├── README.md ├── SQLDukeWeek1.md ├── SQLDukeWeek2.md ├── SQLDukeWeek3 - Inner & Outer Joins.md ├── SQLDukeWeek3.md ├── SQLDukeWeek4.md ├── Teradata Cheatsheet.md ├── Week2-Dillards.md ├── Week3-Dillards.md ├── Week3Ex7-InnerJoin.sql ├── Week3Ex8-OuterJoins.sql ├── Week4Ex10-BizInt.sql ├── Week4Ex12-BizInt.sql ├── Week4Ex9-Subqueries.sql └── Week5 - Dillards.md /README.md: -------------------------------------------------------------------------------- 1 | # SQL Duke 2 | 3 | Answers to Duke University's Managing Big Data with MySQL Course (2017) found [here](https://www.coursera.org/learn/analytics-mysql/home/info "Course Information for 'Managing Big Data with MySQL'"). 4 | 5 | ### Lecture Notes 6 | 7 | * Week 1 - ER Diagrams, RS, database concepts 8 | * Week 2 - Select, where, from, order by, limit, operations 9 | * Week 3 - Group by, having, distinct, operators 10 | * Week 3 - Inner Joins, Outer Joins + examples 11 | * Week 4 - Subqueries & more Operators + examples 12 | 13 | *Note: For Week 3 & 4, I chose to focus more on* examples *of joins instead of lengthy theoretical explanations of what joins are. Rather than paraphrasing Duke's notes, I focused on listing examples that helped me understand the concepts better.* 14 | 15 | ### MySQL Assignments 16 | 17 | * MySQL answers for Week 3: Ex 7 18 | * MySQL answers for Week 3: Ex 8 19 | * MySQL answers for Week 4: Ex 9 20 | * MySQL answers for Week 4: Ex 10 21 | 22 | *Answers to Exercise 1-6 of Week 1 & Week 2 are not included because it's already available online as part of the course.* 23 | 24 | ### Teradata Assignments 25 | 26 | * Teradata Week 2 answers 27 | * Teradata Week 3 answers 28 | * *No Teradata assignment for Week 4, do quiz instead* 29 | * Teradata Week 5 FINAL EXAM ANSWERS 30 | 31 | ### License 32 | 33 | MIT. 34 | -------------------------------------------------------------------------------- /SQLDukeWeek1.md: -------------------------------------------------------------------------------- 1 | # Week 1: Introduction to Databases 2 | 3 | ### Background 4 | ##### What is SQL? Why do we need it? 5 | * SQL = Structured Query Language 6 | * SQL is used by every relational database (DB) management system, or DBMS. 7 | * It lets us efficiently store and extract large amounts of data 8 | 9 | > Prof: "Imagine that you have multiple users trying to access the same excel workbook/ data spreadsheet. It is going to lag like hell. Now multiply that problem by 1000. Holy crap. This is why we need a database system, so we don't go crazy." 10 | 11 | - Other DBs will also use a language _based on_ SQL (e.g., MySQL, PostGres) 12 | - Once you learn how to use the general SQL language, it will be easy to switch between systems 13 | - Same as driving a car - once you know how to drive one car, you can switch with relative ease between different car brands :) 14 | 15 | ##### What are relational databases? 16 | 17 | * It is something awesome 18 | * Officially, a relational database is "a database structured to _recognize relations_ between stored items of information" 19 | * Basically, recognising relations between stored items helps it only extract items that are absolutely needed, reducing run-time 20 | * (You guessed it!-) Made from set theory. 21 | > Summary: Only interacts with subsets of data it needs to provide the information you asked, rather than opening an entire excel sheet 22 | 23 | ##### Benefits 24 | * More memory efficient for large datasets 25 | * Faster responses to queries too! 26 | * Having structure can prevent or minimises data overrides 27 | * Supports greater data entry accuracy, can specify what data type is allowed per field (eg. only numbers) 28 | 29 | ##### Basic Features 30 | - Tables = smallest logical subset 31 | - Columns = must be unique 32 | - Order of column & rows MUST NOT matter. So db can retrieve information in whatever order or fashion it determines to be the fastest. 33 | 34 | ##### About the Field/ Course 35 | - We will generally focus on making queries 36 | - Only early-stage startups need to set up or maintain the db 37 | - This course will mostly cover how to make queries, but not how to make a DB 38 | 39 | > Alright hotstuff. In case you're wondering/forgot, why bother with diagrams? 40 | > 41 | > Because understanding how a db is laid out helps greatly with learning to write queries later. 42 | > Now let's get started! 43 | 44 | ----- 45 | 46 | # ER Diagrams 47 | ### Entitles 48 | - Shape: boxes 49 | - Each box = one category, possibly a table 50 | - Each box is called an entity instance 51 | - Every entity must have at least one column that serves as a UA or unique key identifier (see next section for UA) 52 | 53 | ### Attributes 54 | - Shape: circles 55 | - Each circle = one attribute of a box, or an attribute of an entity. 56 | 57 | **Unique Attribute** 58 | - A Unique Attribute or a Primary Key is an attribute with a unique value in each entity instance. 59 | - Underline the UA 60 | - This is the column that allows you to link to the master table together 61 | - Eg. Student IDs are unique for every student 62 | 63 | **Composite Attribute** 64 | - Composite attributes are those that can be completely reconstructed using other entities 65 | - Eg. Classroom ID = building ID + room unit no. 66 | - Usually, the composite attribute itself (aka the final product) is not included in main DB to save space. Only its parts are included in the DB. 67 | 68 | **Examples of composite attribute** 69 | - Classroom is the entity 70 | - Identified by "classroomID" value 71 | - All classroom IDs will have a building and room number attribute attached to it 72 | 73 | ### Relationships 74 | - ---- lines between entities 75 | - < > diamonds to describe relationship 76 | 77 | ### Cardinality constraints 78 | - Describe the minimum (min) or maximum (max) number of items the other entity can be linked to. 79 | - Bracket notes: always written left to right even if diagram orientation or page orientation is right to left. 80 | - M = infinite 81 | - O = optional. 82 | - Lines closest to rectangle: MAXIMUM no. of instances associated with that entity 83 | - Lines furthest away: MINIMUM no. of instances associated with that entity 84 | - --- straight line = single 85 | - / > crows feet = many 86 | 87 | **Examples of cardinality constraints** 88 | 89 | - Each college can be attended by (max) multiple students, but it attended by (min) at least one student. (M, 1) 90 | - Each student attends (max) one college, or (min) one college. (1, 1) 91 | - (10, 1000) = each college needs a minimum of 10 students, max of 1000 students. 92 | 93 | ### Weak Entitles 94 | - Weak entities will not be identifiable on their own (not fully unique like a full entity) 95 | - Weak entities have double outline 96 | - Can be combined with another weak entity to form a fully unique key 97 | - ----- Dotted underline title means that attribute is a partial key 98 | 99 | **Example: Building & apartment IDs** 100 | 101 | - Building ID is the unique primary key 102 | - Partial key (Apartment ID) can become full unique key (equivalent of building ID) IF it is connected to the unique key of the entity it is connected to with the unique double-diamond ID. 103 | - Apartment ID is only unique if combined with Building ID. 104 | 105 | # Relational Schemas 106 | 107 | - Similar items but new names: 108 | - tables (or "relation") 109 | - columns ("fields" / "attributes") 110 | - row ("record" / "tuple") 111 | - An RS is a simplified version (or plan for) a db 112 | - Reflects logical ideas, but NOT physical (actual) design 113 | - Strictly no order, variables must be independent 114 | - Benefit: Looks less messy lol 115 | - Problem: they lack cardinality constraints in ER diagrams. So sometimes one value matches to more than one key, but you won't know this until you see the ER diagram too. 116 | 117 | ### PRIMARY KEY (PK) 118 | - Each table = one box 119 | - PK should be underlined and put at top of box. 120 | - The PK must have a value unique for every row in that table. 121 | - PK strictly CANNOT have null values. 122 | - Columns that can double up as primary keys, because they also have unique values, can be marked with a (U) 123 | 124 | ### FOREIGN KEY (FK) 125 | - Columns that refer to the primary key of another table 126 | - Write (FK) next to item to highlight its status 127 | - Draw arrows to new table it refers to. 128 | 129 | ### WEAK ENTITIES 130 | - Will have TWO underlined keys, could be their own partial key paired with a foreign key. aka their own key paired with the primary key of another table 131 | - Together, this composite key forms a primary key. 132 | 133 | ### MANY TO MANY RELATIONSHIP 134 | - Clue that columns can have many instances of one another 135 | - For example, many classes can have combinations of many students 136 | - Usually, primary keys must not be duplicated. But in a many-to-many relationship, an exception is made to illustrate the relationship. 137 | - So, this table has nothing but 2 foreign keys in it making 1 composite primary key 138 | 139 | # Conclusion: Building ERD diagrams 140 | - Using ERDPlus tool to make diagrams 141 | - www.erdplus.com 142 | - Can export diagrams as images 143 | -------------------------------------------------------------------------------- /SQLDukeWeek2.md: -------------------------------------------------------------------------------- 1 | # Week 2 2 | 3 | ### INTRODUCTION 4 | 5 | Note to self: Typical syntax follows this order 6 | ```sql 7 | SELECT item 8 | FROM table 9 | WHERE condition 10 | GROUP BY variable 11 | HAVING condition 12 | ORDER BY category ASC / DESC 13 | ``` 14 | - Only SELECT + FROM are actually required. Rest = optional 15 | - DB will physically manage and plan the how aspect of retrieving it best (aka not your problem for now). Focus on extracting data first. 16 | - To tell db that data is missing, type NULL, not zero. 17 | 18 | ### Style Notes 19 | 1. CAPITALISE all commands aka first word 20 | 2. CAPITALISE other keywords like 'sum' or 'avg' too 21 | 2. End all queries with ; 22 | 2. Start each command on a new line (easier to read) 23 | 24 | ### Optional but Good To Follow 25 | The DUKE course uses the dialect, *MySQL*. However, MySQL is a nappy hipster that doesn't quite follow SQL convention. In fact, it's pretty far out. If you ever need to change to a different SQL dialect, you'll need these rules too. Hence, it might be a good idea to start making them a habit now. 26 | 27 | 1. Not all DBs are case insensitive. Try to write the EXACT name used in database, CaPiTal letters and all. 28 | 3. Names inside "inverted commas" are strictly case sensitive. Use the EXACT name used in the db too. 29 | 4. Make indentations for new subqueries or new lines. You'll learn more about this later. 30 | 2. Although MySQL accepts both single and double inverted commas, stick to single commas where possible. Most other DBs only accept single ticks. 31 | 32 | # Start: First Look At Your Database 33 | *Let's assume you're exploring a new DB, but have no diagrams about it. How would you explore and get to know it?* 34 | 35 | To make something the default database for our queries, run this command: 36 | 37 | ```sql 38 | USE dognitiondb 39 | ``` 40 | 41 | To show all tables in the database. 42 | ```sql 43 | SHOW tables 44 | ``` 45 | 46 | Show all columns in a table. 47 | ```sql 48 | SHOW columns FROM [table name, without brackets] 49 | OR: 50 | DESCRIBE [table name, without brackets] 51 | ``` 52 | Note: In the output, the SHOW/DESCRIBE command will reveal whether NULL values can be stored in that field in the table. The "Key" column of the output also provides the following information about each field of data in the table being described (see [here](https://dev.mysql.com/doc/refman/5.6/en/show-columns.html "SQL documentation") for more information). 53 | 54 | ##### Hints 55 | 56 | - PRI - the column is a PRIMARY KEY or is one of the columns in a multiple-column PRIMARY KEY. 57 | - UNI - the column is the first column of a UNIQUE index. 58 | - MUL - the column is the first column of a nonunique index in which multiple occurrences of a given value are permitted within the column. 59 | - Empty - the column either is not indexed or is indexed only as a secondary column in a multiple-column, nonunique index. 60 | 61 | *Note: The "Default" field of the output indicates the default value that is assigned to the field. The "Extra" field contains any additional information that is available about a given field in that table. For now, you won't die yet if you don't understand this.* 62 | 63 | To show all data (types) in a column. 64 | 65 | ```sql 66 | SELECT [column name, without brackets] 67 | FROM [table name, without brackets]; 68 | ``` 69 | 70 | If you have multiple databases loaded: 71 | ```sql 72 | SHOW columns FROM [table name] FROM [database name] 73 | SHOW columns FROM databasename.tablename 74 | ``` 75 | # MySQL Variable Types 76 | 77 | In a MySQL database, there are three (3) main data types: text, numbers and dates/times. When you design your database, it is important that you select the appropriate type, since this determines why type of data you can store in that column. Using the most appropriate type can also increase the database's overall performance. 78 | 79 | ### Text Types 80 | 81 | | Name | Description | 82 | | ------ | -------- | 83 | | CHAR( ) | A fixed section from 0 to 255 characters long.| 84 | | VARCHAR( ) | A variable section from 0 to 255 characters long. | 85 | | TINYTEXT | A string with a maximum length of 255 characters. | 86 | | TEXT | A string with a maximum length of 65535 characters.| 87 | | BLOB | A string with a maximum length of 65535 characters.| 88 | | MEDIUMTEXT | A string with a maximum length of 16777215 characters. | 89 | | MEDIUMBLOB | A string with a maximum length of 16777215 characters.| 90 | | LONGTEXT| A string with a maximum length of 4294967295 characters.| 91 | | LONGBLOB| A string with a maximum length of 4294967295 characters.| 92 | 93 | The ( ) brackets allow you to specify the maximum number of characters that can be used in the column. Meanwhile, BLOB stands for Binary Large OBject, and can be used to store non-text information that is encoded into text. *How cute is that?!* 94 | 95 | ### Number Types 96 | 97 | | Name | Description | Length 98 | | --- | ---- | ---- | 99 | | TINYINT ( ) | -128 to 127 normal | 0 to 255 UNSIGNED 100 | | SMALLINT( ) | -32768 to 32767 normal | 0 to 65535 UNSIGNED 101 | | MEDIUMINT( ) | -8388608 to 8388607 normal | 0 to 16777215 UNSIGNED 102 | | INT( ) | -2147483648 to 2147483647 normal | 0 to 4294967295 UNSIGNED 103 | | BIGINT( ) | -9223372036854775808 to 9223372036854775807 normal | 0 to 18446744073709551615 UNSIGNED 104 | | FLOAT | A small number with a floating decimal point. 105 | | DOUBLE( , ) | A large number with a floating decimal point. 106 | | DECIMAL( , ) | A DOUBLE stored as a string, allowing for a fixed decimal point. 107 | 108 | By default, the integer types will allow a range between a negative number and a positive number, as indicated in the table above. 109 | 110 | You can use the UNSIGNED commend, which will instead only allow positive numbers, which start at 0 and count up. 111 | 112 | #### Useful Num Commands 113 | | Command | Description | 114 | | --- | ---- | 115 | | AVG( ) | Finds the average of all rows of the variable 116 | | SUM( ) | Finds the sum of all rows of the variable 117 | | FLOOR( ) | Rounds a floating decimal down to nearest integer 118 | | CEIL( ) | Rounds a floating decimal up to nearest integer 119 | | FLOAT(num, x) | Rounds a floating var to x decimal places eg. (height, 2) 120 | |%| Modulus | 121 | | Var % 2 = 0 | Used in conjunction with 'WHERE' command, to return all rows where 'var' is even numbered 122 | | Var % 2 != 0 | Used in conjunction with 'WHERE' command, to return all rows where 'var' is even numbered 123 | 124 | # SELECT, FROM 125 | **SELECT** is used anytime you want to retrieve data from a table. In order to retrieve that data, you always have to provide at least two pieces of information: 126 | 127 | >(1) WHAT you want to select, and 128 | (2) FROM where you want to select it. 129 | 130 | Example of most basic select: 131 | ```sql 132 | SELECT breed 133 | FROM dogs; 134 | ``` 135 | SELECT statements can also be used to make new derivations of individual columns using "+" for addition, "-" for subtraction, "*" for multiplication, or "/" for division. For example, if you wanted the median inter-test intervals in hours instead of minutes or days, you could query: 136 | ```sql 137 | SELECT median_iti_minutes/60, median_iti_minutes 138 | FROM dogs 139 | ``` 140 | # LIMIT / OFFSET 141 | 142 | LIMIT is used to restrict the number of queries outputted. 143 | OFFSET will offset that number of X entries (starting with 1). 144 | ##### Examples 145 | Select only 10 rows of data 146 | ```sql 147 | SELECT breed 148 | FROM dogs 149 | LIMIT 10; 150 | ``` 151 | Select 10 rows of data, but AFTER the first 5 rows. 152 | ```sql 153 | SELECT breed 154 | FROM dogs 155 | LIMIT 5, 10; 156 | 157 | SELECT breed 158 | FROM dogs 159 | OFFSET 5 160 | LIMIT 10; 161 | ``` 162 | 163 | # WHERE + BETWEEN, AND, OR 164 | 165 | We can use the WHERE statement to specify our queries, like this example below. We can add BETWEEN, AND, OR operators in conjunction with variables to make them more specific, like this: 166 | ```SQL 167 | SELECT dog_guid, weight 168 | FROM dogs 169 | WHERE weight BETWEEN 10 AND 50; 170 | 171 | SELECT dog_guid, dog_fixed, dna_tested 172 | FROM dogs 173 | WHERE dog_fixed=1 OR dna_tested=1; 174 | 175 | SELECT dog_guid, dog_fixed, dna_tested 176 | FROM dogs 177 | WHERE dog_fixed=1 AND dna_tested!=1; 178 | 179 | SELECT dog_guid 180 | FROM dogs 181 | WHERE YEAR(created_at) > 2015 -- you will learn more about dates later 182 | ``` 183 | ##### Using WHERE + Strings 184 | Strings need to be surrounded by quotation marks in SQL. MySQL accepts both double and single quotation marks, but some database systems only accept single quotation marks, so it might be a good idea to start that habit right now. Note that whenever a string contains an SQL keyword, the string must be enclosed in backticks instead of quotation marks. 185 | 186 | >'the marks that surrounds this phrase are single quotation marks' 187 | "the marks that surrounds this phrase are double quotation marks" 188 | ` the marks that surround this phrase are backticks `` 189 | 190 | ```SQL 191 | SELECT dog_guid, weight 192 | FROM dogs 193 | WHERE breed = 'Golden Retriever'; 194 | ``` 195 | ### Date/Time Types 196 | In the previous section, we saw one example of using date-time to specify a query further. Let's learn more about them now. We can use the WHERE statement to interact with datetime data. Time-related data is a little more complicated to work with than other types of data, because it must have a very specific format. Examples of datetime types: 197 | 198 | ```sql 199 | DATE: YYYY-MM-DD 200 | DATETIME: YYYY-MM-DD HH:MM:SS 201 | TIMESTAMP: YYYYMMDDHHMMSS 202 | TIME: HH:MM:SS 203 | YEAR: YYYY 204 | ``` 205 | Date/Time fields will only accept a valid date or time. A time stamp stored in one row of data might look like this: 206 | ```sql 207 | 2013-02-07 02:50:52 208 | ``` 209 | Using the same date-time format in combination with WHERE, we can select specific rows of data that fit the date criteria. For example, we can specify a range of dates we'd like to retrieve data from: 210 | 211 | ```sql 212 | SELECT dog_guid, created_at 213 | FROM complete_tests 214 | WHERE created_at >= '2014-01-01' AND created_at <= '2015-01-01' 215 | ``` 216 | However, instead of typing out full specifications of date ranges every time, there are other functions that interact well with date too. For instance: 217 | ```sql 218 | SELECT dog_guid, updated_at 219 | FROM reviews 220 | WHERE YEAR(created_at) = 2014 -- selects entried created in 2014 221 | ``` 222 | In that vein, two of similar commands are **Day** and **month** which also lets you extract all rows created around a specified day or month. 223 | ```sql 224 | SELECT dog_guid, created_at 225 | FROM complete_tests 226 | WHERE DAY(created_at) > 15 -- day of month: 0 to 31 227 | 228 | SELECT dog_guid, created_at 229 | FROM complete_tests 230 | WHERE MONTH(created_at) = 12 -- month of year: Dec 231 | ``` 232 | **Dayname** is a function that will select data from only a single day of the week. This example selects all IDs created on Tuesday: 233 | ```sql 234 | SELECT dog_guid, created_at 235 | FROM complete_tests 236 | WHERE DAYNAME(created_at) = "Tuesday" -- dayname here 237 | ``` 238 | You have to use a different set of functions than you would use for regular numerical data to add or subtract time from any values in these datetime formats. You would use the **TIMESTAMPDIFF** or **DATEDIFF** function. 239 | ```sql 240 | SELECT user_guid, TIMESTAMPDIFF(MINUTE, start_time, end_time) 241 | FROM exam_answers 242 | WHERE TIMESTAMPDIFF(MINUTE, start_time, end_time) < 0; 243 | 244 | SELECT user_guid, TIMESTAMPDIFF(HOUR, start_time, end_time) 245 | FROM exam_answers 246 | WHERE TIMESTAMPDIFF(HOUR, start_time, end_time) > 1; 247 | 248 | SELECT user_guid, TIMESTAMPDIFF(SECOND, start_time, end_time) 249 | FROM exam_answers 250 | WHERE TIMESTAMPDIFF(SECOND, start_time, end_time) > 60; 251 | ``` 252 | # SUBSETS: IN, LIKE 253 | The IN operator allows you to specify multiple values in a WHERE clause. Each of these values must be separated by a comma from the other values, and the entire list of values should be enclosed in parentheses. 254 | ```sql 255 | SELECT dog_guid, breed 256 | FROM dogs 257 | WHERE breed IN ('retriever', 'poodle'); 258 | 259 | SELECT * -- this means select all columns 260 | FROM users 261 | WHERE state NOT IN ('NC','NY'); 262 | ``` 263 | The **LIKE** operator allows you to specify a pattern that the textual data you query has to match. For example, if you wanted to look at all the data from breeds whose names started with "s", you could query: 264 | ``` 265 | SELECT dog_guid, breed 266 | FROM dogs 267 | WHERE breed LIKE ("s%"); 268 | ``` 269 | In this syntax, the percent sign indicates a wild card. Wild cards represent unlimited numbers of missing letters. This is how the placement of the percent sign would affect the results of the query: 270 | 271 | 1. WHERE breed LIKE ("s%") = the breed must start with "s", but can have any number of letters after the "s" 272 | 2. WHERE breed LIKE ("%s") = the breed must end with "s", but can have any number of letters before the "s" 273 | 3. WHERE breed LIKE ("%s%") = the breed must contain an "s" somewhere in its name, but can have any number of letters before or after the "s" 274 | 275 | # IS, IS NOT, NULL 276 | To select only the rows that have NON-NULL data you could query: 277 | ```sql 278 | SELECT user_guid 279 | FROM users 280 | WHERE free_start_user IS NOT NULL; 281 | ``` 282 | To select only the rows that only have null data so that you can examine if these rows share something else in common, you could query: 283 | ```sql 284 | SELECT user_guid 285 | FROM users 286 | WHERE free_start_user IS NULL; 287 | ``` 288 | You will see that ISNULL is a logical function that returns a 1 for every row that has a NULL value in the specified column, and a 0 for everything else. We can get the total number of NULL values in any column. Here's what that query would look like: 289 | ```sql 290 | SELECT SUM(ISNULL(breed)) -- counts dogs with breed = NULL 291 | FROM dogs 292 | ``` 293 | More complicated example: Printing number of unique DOG IDs for each breed and gender, where there is at least 1000 dogs in each breed group. Note the useful NULL function. 294 | ```sql 295 | SELECT COUNT(dog_guid) AS num_dogs, gender, breed_group 296 | FROM dogs 297 | WHERE breed_group IS NOT NULL AND breed_group <> '' 298 | GROUP BY breed_group 299 | HAVING COUNT(breed_group>1000) 300 | ORDER BY COUNT(dog_guid) DESC; 301 | 302 | Can you guess what the other functions mean? If you can't, we'll learn about them next so don't stress about it. 303 | ``` 304 | # AS / REPLACE / REMOVE 305 | 306 | If you wanted to **rename** the name of the time stamp field of the completed_tests table from "created_at" to "time_stamp" in your output, you could take advantage of the **AS** clause and execute the following query: 307 | ```sql 308 | SELECT dog_guid, created_at AS time_stamp 309 | FROM complete_tests; 310 | ``` 311 | Note that if you use an alias that includes a space, the full alias MUST be surrounded in **quotation marks**: 312 | ```sql 313 | SELECT dog_guid, created_at AS 'time stamp' 314 | FROM complete_tests; 315 | ``` 316 | You could also make an alias for a table, and just about everything: 317 | ```sql 318 | SELECT dog_guid, created_at AS 'time stamp' 319 | FROM complete_tests AS tests 320 | 321 | SELECT user_guid, (median_ITI_minutes * 60) AS 'Median Sec' 322 | FROM dogs; 323 | ``` 324 | It is possible to replace unwanted stuff too, or remove them. For example, you can **delete** the first character off every word with the **TRIM** function: 325 | ```sql 326 | SELECT breed, TRIM(LEADING '-' FROM breed) AS breed_fixed 327 | FROM dogs; 328 | ``` 329 | Or, you could **replace** them instead with blanks, or any other item. The syntax for **REPLACE( )** is 330 | 331 | ```sql 332 | [variable, replace FOR, replace WITH] 333 | 334 | SELECT breed, REPLACE (breed, '-', '' ) AS breed_fixed 335 | FROM dogs; 336 | ``` 337 | One last way to edit output is to simply **WRITE** your own stuff using **CONCAT**. The syntax for concat is to lump everything together, separated by commas, like this: ['STRING 1', 'STRING 2' ... ] 338 | ```sql 339 | SELECT breed, 340 | CONCAT ("This dog is a", breed , 'dog.' ) AS new_statement 341 | FROM dogs 342 | ORDER BY breed_fixed 343 | ``` 344 | # DISTINCT, COUNT, ORDER BY 345 | 346 | When the DISTINCT clause is used with multiple columns in a SELECT statement, the combination of all the columns together is used to determine the uniqueness of a row in a result set. Note that by "every type", it also includes type NULL too. 347 | ```sql 348 | SELECT DISTINCT breed 349 | FROM dogs; -- distinct dog breeds 350 | 351 | SELECT DISTINCT state, city 352 | FROM users; -- distinct combo of state AND city 353 | ``` 354 | If you wanted the breeds of dogs in the dog table sorted in alphabetical order, you could query this using the **ORDER BY** function: 355 | ```sql 356 | SELECT DISTINCT breed 357 | FROM dogs 358 | ORDER BY breed ASC; 359 | ``` 360 | To sort the output in descending order as well: 361 | ```sql 362 | SELECT DISTINCT breed 363 | FROM dogs 364 | ORDER BY breed DESC; 365 | ``` 366 | Note: Using ORDER BY, when not applied to alphabetical data, gives the numerically ascending order by default. 367 | ```sql 368 | SELECT DISTINCT user_guid, state, membership_type 369 | FROM users 370 | WHERE country="US" AND state IS NOT NULL AND membership_type IS NOT NULL 371 | ORDER BY state ASC, membership_type ASC 372 | ``` 373 | ##### Important Note: 374 | 375 | COUNT and DISTINCT cannot be used together, like this: 376 | ```sql 377 | SELECT count (apples), distinct pears 378 | FROM fruit 379 | ``` 380 | Because count = only 1 row output (the sum of that variable), while pears = many pear types. However, it can be used this way, because this will produce the number of distinct apple types, *grouped by* each country, so that each unique country will only have 1 number attached to it. 381 | ```sql 382 | SELECT COUNT (DISTINCT apples), country 383 | FROM fruit 384 | GROUP BY country; -- we will learn group by next 385 | ``` 386 | Lastly, remember that DISTINCT removes NULL, but COUNT does not remove NULL. So, it is good practice to put IS NOT NULL or =! "" as much as possible when using COUNT. 387 | 388 | # How to Export your Query Results to a Text File 389 | You can tell MySQL to put the results of a query into a variable, and then use Python code to format the data in the variable as a CSV file (comma separated value file, a .CSV file) that can be downloaded. When you use this strategy, all of the results of a query will be saved into the variable, not just the first 1000 rows as displayed in Jupyter. 390 | 391 | To tell MySQL to put the results of a query into a variable, use the following syntax: 392 | ```sql 393 | variable_name_of_your_choice = %sql [your full query goes here, but don't include square brackets]; 394 | 395 | breed_list = %sql SELECT DISTINCT breed FROM dogs ORDER BY breed; 396 | num_dogs = %sql SELECT COUNT(DISTINCT dog_guid) FROM dogs; 397 | ``` 398 | Once your variable is created, using the above command tell Jupyter to format the variable as a csv file using the following syntax: 399 | ```sql 400 | variable_name.csv('the_output_name_you_want.csv') 401 | breed_list.csv('breed_list.csv') 402 | num_dogs.csv('unrelated.csv') 403 | ``` 404 | -------------------------------------------------------------------------------- /SQLDukeWeek3 - Inner & Outer Joins.md: -------------------------------------------------------------------------------- 1 | # Week 3 - JOINS 2 | 3 | - Joins are based on cartesian products 4 | - (x, a) (x, b) (y, a) (y, b) 5 | - JOIN works by retrieving data only where the cartesian products match 6 | 7 | ### Inner Joins 8 | 9 | - Only items with exact matching primary keys from both tables will be put into a table 10 | - NULL values can't be matched 11 | - Order of which table joins to which table *does not matter* 12 | 13 | ### Left / Right Outer Join 14 | 15 | - Here, ORDER of join matters. 16 | - Example: LEFT outer join vs RIGHT outer join 17 | - Left (or first) table will have ALL its rows included, even null values 18 | - Right (or second) table's items will only be included if they match the chosen key used in the left table 19 | - Rows from the left table who don't have a matching ID in the right, will instead have a NULL value 20 | - Can switch to RIGHT OUTER JOIN to reverse order of tables (see example below) 21 | - Basically same as reversing the positions of the join. So left and right don't really matter, order matters more. 22 | 23 | ### Full Outer Join 24 | 25 | - ALL rows of both tables are included 26 | - Any row that doesn't have a matching partner is given a NULL value. 27 | - Rarely used (why would anyone want this? jkjk) 28 | - Note: Not all db supper full outer joins, like MySQL. However, PostgreSQL does. Can test this out using the Teradata db! 29 | 30 | ### Many to Many Relationships 31 | 32 | - Recall example 2.3 of Week 1? While building the "fashion shop" relationship schema, there was a many-to-many relationship, with another table between two big entities as a linking/bridge table. 33 | - This table had only the foreign keys + primary keys of two tables in various combinations 34 | - **For many to many, left join 1 & 2 first, then left join the results again to 3.** 35 | 36 | Caution: 37 | - Stick to left outer joins! Right/inner joins would mess up the data by deleting NULL values, since the right table is now the 'primary key'. (example below) 38 | - Beware of duplicates too. Joining three tables a single duplicate (2 rows) across them results in 6 rows. This could quickly get out of hand in a big db. (example below) 39 | 40 | ### Notes to self before starting (!!!) 41 | 42 | - Where possible, clean data before you start 43 | - Try to be aware of table relationships, who has null data, subsets, duplicates etc 44 | - When doing joins, count the number of unique IDs / keys in each table you are joining first. Helps to see which is larger or smaller. Also helps to get reasonable expectation of how the final product should look like. 45 | - On handling errors: 46 | - Be aware of duplicates and NULL values (sometimes they exist despite rules) 47 | - Null values can exist even in the primary key column when the database is young, and the company is desperate for data so they accept any data, even incomplete sets 48 | - **It is NOT your job to clean this up, or restructure their db -- Instead, just try to make as much business value of the items you have.** 49 | - Start with small data and tables (<10 rows), see if they output what you are expecting 50 | - Double check at beginning! Or you won't even know your results are incorrect 51 | 52 | ### Reminder: (Proper) Technical Terms 53 | 54 | - Table = Relation 55 | - Row = Tuple 56 | - Column/Field = Attribute 57 | 58 | # INNER JOINS 59 | 60 | Let's start with an inner join. 61 | 62 | - SQL needs to be told which IDs overlap 63 | - SQL needs to be told which is left/right 64 | 65 | We will use *equijoin* syntax for the first few examples because it's not as confusing. We will switch to traditional syntax for outer joins later. 66 | 67 | Example: SIMPLE INNER JOIN FOR 2 TABLES 68 | **Find the total number of reviews, and the average rating given, for EACH dog. Combine information from the Dogs table and the Reviews table:** 69 | ```sql 70 | SELECT 71 | d.dog_guid AS DogID, 72 | AVG(r.rating) AS AvgRating, 73 | COUNT(r.rating) AS NumRatings, 74 | FROM dogs d, reviews r -- alphabets are its short form 75 | WHERE d.dog_guid=r.dog_guid 76 | AND d.user_guid=r.user_guid -- repeating this excludes any unmatched IDs 77 | GROUP BY UserID, DogID 78 | ORDER BY AvgRating DESC; 79 | ``` 80 | Example: INNER JOIN 2 TABLES , CONDITIONAL 81 | **Extract the user_guid, dog_guid, breed, breed_type, and breed_group for all animals who completed the "Yawn Warm-up" game. Join on dog_guid only.** 82 | 83 | ```sql 84 | SELECT 85 | c.user_guid, 86 | c.dog_guid, 87 | d.breed, 88 | d.breed_type, 89 | d.breed_group 90 | FROM complete_tests c, dogs d 91 | WHERE c.dog_guid=d.dog_guid 92 | AND test_name = "Yawn Warm-up" ; 93 | ``` 94 | Example: INNER JOIN 3 TABLES 95 | **Join 3 tables to extract the user ID, user's state of residence, user's zip code, dog ID, breed, breed_type, and breed_group for all animals who completed the "Yawn Warm-up" game.** 96 | 97 | ```sql 98 | SELECT 99 | d.user_guid AS UserID, 100 | d.dog_guid AS DogID, 101 | d.breed, 102 | d.breed_type, 103 | d.breed_group, 104 | u.state, 105 | u.zip 106 | FROM dogs d, complete_tests c, users u -- inner join so order doesn't matter 107 | WHERE d.dog_guid = c.dog_guid 108 | AND d.user_guid = u.user_guid 109 | AND c.test_name = "Yawn Warm-up"; 110 | ``` 111 | Notes: Here, I avoided using c.user_guid to join the tables because user GUID under completed tests is null. I wouldn't have known this if I did not check the tables first. So, always test in small batches! And be prepared to deal with missing data. 112 | 113 | Example: INNER JOIN 3 TABLES 114 | **How would you extract the user ID, membership type, and dog ID of all the golden retrievers who completed at least 1 Dognition test (you should get 711 rows)?** 115 | ``` sql 116 | SELECT DISTINCT 117 | u.user_guid, 118 | u.membership_type, 119 | d.dog_guid, 120 | d.breed 121 | FROM complete_tests c, dogs d, users u 122 | WHERE c.dog_guid = d.dog_guid 123 | AND d.user_guid = u.user_guid 124 | AND d.breed = 'Golden Retriever'; 125 | ``` 126 | Example: INNER JOIN 2 TABLES 127 | **How many unique Golden Retrievers who live in North Carolina are there in the Dognition database (you should get 30)?** 128 | ```sql 129 | SELECT DISTINCT 130 | u.user_guid, 131 | d.dog_guid, 132 | d.breed 133 | FROM dogs d, users u 134 | WHERE d.user_guid = u.user_guid 135 | AND d.breed = 'Golden Retriever' 136 | AND u.state = 'NC'; 137 | ``` 138 | 139 | ### NOTE: USING TRADITIONAL SYNTAX 140 | 141 | The equijoin syntax is accepted with inner joins, but not with full/left/right outer joins. Instead, a the (traditional) syntax for that would look something like this (below). 142 | 143 | Why do we still have the traditional version when it is longer? Because: 144 | 145 | - With using = signs, WHERE can be saved for other conditions 146 | - Unless otherwise specified, join is understood as INNER join 147 | - If inner join, order doesnt matter 148 | - If outer join, RIGHT joins LEFT in this order 149 | 150 | Re-writing the first example using traditional syntax: 151 | ```sql 152 | SELECT 153 | d.user_guid AS UserID, 154 | d.dog_guid AS DogID, 155 | d.breed, 156 | d.breed_type, 157 | d.breed_group 158 | FROM dogs d JOIN complete_tests c -- look here 159 | ON c.dog_guid=d.dog_guid -- look here 160 | WHERE test_name='Yawn Warm-up'; 161 | ``` 162 | From now on, we will be using traditional syntax for OUTER JOINS. 163 | 164 | # Outer Joins 165 | 166 | Unfortunately, DUKE only gave two examples on outer joins -__- So I included one more from the internet. 167 | 168 | Example: LEFT JOIN 2 TABLES 169 | **Find the number of complete tests each unique dog (from the dogs table) has completed. Put the dog with the mosts tests completed first.** 170 | ```sql 171 | SELECT 172 | d.dog_guid AS dDogID, 173 | c.dog_guid AS cDogID, 174 | COUNT(c.test_name) AS 'Tests Completed' 175 | FROM dogs d 176 | LEFT JOIN complete_tests c 177 | ON d.dog_guid = c.dog_guid 178 | WHERE d.dog_guid IS NOT NULL 179 | GROUP BY d.dog_guid 180 | ORDER BY COUNT(c.dog_guid) DESC; 181 | ``` 182 | Example: LEFT JOIN 2 TABLES + COUNT 183 | **Create a list of all the unique dog_guids that are contained in the site_activities table, but not the dogs table, and how many times each one is entered. Remove NULL values.** 184 | ```sql 185 | SELECT 186 | DISTINCT sa.dog_guid, 187 | d.dog_guid, 188 | COUNT(sa.dog_guid) 189 | FROM site_activities sa 190 | LEFT JOIN dogs d 191 | ON sa.user_guid = d.user_guid 192 | WHERE d.dog_guid IS NULL 193 | AND sa.dog_guid IS NOT NULL 194 | GROUP BY sa.dog_guid 195 | ``` 196 | Example: LEFT JOIN 3 TABLES 197 | **Join 3 tables to combine the column on Item ID, item price, item name, company it is from.** (PS: Got this example from the internet) 198 | ```sql 199 | SELECT 200 | a.bill_no, 201 | a. bill_amt, 202 | b.item_name, 203 | c.company_name 204 | c.company_city 205 | FROM counter_sale a 206 | LEFT JOIN foods b 207 | ON a.item_ID = b.item_ID 208 | LEFT JOIN company c 209 | ON b.company_ID = c.company_ID 210 | WHERE c.company_name IS NOT NULL 211 | ORDER BY a.bill_no; 212 | ``` 213 | -------------------------------------------------------------------------------- /SQLDukeWeek3.md: -------------------------------------------------------------------------------- 1 | # Week 3 2 | Continuing all functions learnt in week 2, we will learn the final 3 this week. Notes for this week will emphasize applications of functions more than explanations. Week 3 also includes notes on Inner Joins & Outer Joins (see next markdown file). 3 | 4 | # COUNT, SUM 5 | Count is well, count. 6 | ```sql 7 | SELECT COUNT(breed) 8 | FROM dogs; 9 | 10 | SELECT COUNT(DISTINCT breed) 11 | FROM dogs; 12 | 13 | SELECT COUNT(DISTINCT user_guid) 14 | FROM complete_tests 15 | WHERE created_at > __ 16 | 17 | SELECT state, zip, COUNT(DISTINCT user_guid) 18 | FROM users 19 | WHERE country = "US" 20 | GROUP BY state, zip 21 | HAVING COUNT(DISTINCT user_guid) > 5 22 | ORDER BY state ASC; 23 | ``` 24 | Note: When a column is included in a count function, null values are ignored in that count. But when an asterisk is included in a count function, nulls are included in the count. 25 | 26 | Next, SUM finds the total of all rows matching a given criteria. It only works for numerical values though, not for strings, and not for date-time types. 27 | ```sql 28 | SELECT SUM(IS NULL(exclude) 29 | FROM dogs; 30 | Result: 34, 025 31 | ``` 32 | Note: SUM is different from Count. Sum takes only 'is null = 0', while count includes rows where null= 1 too. 33 | ``` 34 | SELECT COUNT(IS NULL(exclude) 35 | FROM dogs; 36 | Result: 35,035 37 | ``` 38 | 39 | # AVERAGE, MIN, MAX 40 | AVG, MIN, MAX are mathematical operators that work with numerical data. They can be used together or used separately. The minimum and maximum amounts also work on dates -- via picking the earliest or latest date. It's pretty basic so just read the examples to learn their syntax. 41 | 42 | ```sql 43 | SELECT test_name, 44 | AVG (rating) AS AVG_rating, 45 | MIN (rating) AS MIN_rating, 46 | MAX (rating) AS MAX_rating 47 | FROM reviews 48 | WHERE test_name = "Eye Contact Game"; 49 | 50 | SELECT AVG (TIMESTAMPDIFF (minutes, start_time, end_time)) AS Duration, 51 | test_name AS Test 52 | FROM exam_answers; 53 | 54 | SELECT AVG (TIMESTAMPDIFF (hour, start_time, end_time)) AS Avg_duration, 55 | MIN (TIMESTAMPDIFF (hour, start_time, end_time)) AS min_time, 56 | MAX (TIMESTAMPDIFF (hour, start_time, end_time)) AS max_time, 57 | test_name AS Test 58 | FROM exam_answers 59 | WHERE timestampdiff(minute, start_time, end_time)>0; 60 | ``` 61 | # GROUP BY 62 | 63 | GROUP BY aggregates all data for other columns based on the column selected to be grouped by. For instance, this groups the data by MONTH: 64 | ```SQL 65 | SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed 66 | FROM complete_tests 67 | GROUP BY Month; 68 | ``` 69 | Note: Although this correctly groups data by month, **this example gives an incorrect test_name answer**. This is because there is only 1 row allocated for each Month, but more than one type of test done per month. In this situation, MySQL will populate it with a randomly chosen Test done in that month, while other DB may throw an error, but both are incorrect. Overall, there is no way to present an aggregated and non-aggregate dataset in the same table. 70 | 71 | **Solution**: We can either group by all non-aggregated variables too (B), or further aggregate ALL variables (A). 72 | 73 | (A) This gives the number of test types and tests completed per month. 74 | ```SQL 75 | SELECT COUNT(test_name), MONTH(created_at) AS Month, 76 | COUNT(created_at) AS Num_Completed 77 | FROM complete_tests 78 | GROUP BY Month; 79 | ``` 80 | (B) This gives number of tests completed per test type AND month. 81 | ```sql 82 | SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed 83 | FROM complete_tests 84 | GROUP BY Month, test_name; 85 | ``` 86 | Note: Not all databases accept aliases (eg. MONTH(created_at) stored as Month). If they don't just retype the formula in the GROUP BY line. 87 | 88 | # HAVING 89 | The HAVING command is similar to WHERE, in that it adds another layer of specificity to your query. However, the difference is that *HAVING applies to aggregate data* while WHERE applies to single-column data. 90 | 91 | **Example using WHERE:** Print test name, month it is completed it, and number of tests done that month -- for Nov & Dec ONLY. 92 | ```sql 93 | SELECT test_name, 94 | MONTH(created_at) AS Month_Name, 95 | COUNT(created_at) AS Num_Completed_Tests 96 | FROM complete_tests 97 | WHERE MONTH(created_at)=11 OR MONTH(created_at)=12 98 | GROUP BY test_name, Month_Name 99 | ORDER BY Num_Completed_Tests DESC; 100 | ``` 101 | **Example using HAVING:** Print test name, month it is completed it, and number of tests done that month -- for all months, WITH at least 20 tests done that month. 102 | ```sql 103 | SELECT test_name, 104 | MONTH(created_at) AS Month, 105 | COUNT(created_at) AS Num_Completed_Tests 106 | FROM complete_tests 107 | WHERE MONTH(created_at)=11 OR MONTH(created_at)=12 108 | GROUP BY 1, 2 109 | HAVING COUNT(created_at)>=20 110 | ORDER BY 3 DESC; 111 | ``` 112 | #### More Examples 113 | Prints the average time taken by a user for each test in minutes. Excludes data of users who took more than 6000 hours, or less than 0 seconds per test, for that test. 114 | ```sql 115 | SELECT test_name, 116 | AVG( TIMESTAMP DIFF( minute, start_time, end_time)) AS 'Time (Min)', 117 | subcategory_name 118 | FROM exam_answers 119 | WHERE TIMESTAMP DIFF(minute, start_time, end_time)<6000 120 | AND TIMESTAMP DIFF((second, start_time, end_time)>0 121 | GROUP BY test_name; 122 | ``` 123 | Print the sum of users in each combination of state & zip -- where there is at least 5 users in that combination. Order in ascending by state, and in descending by number of users. 124 | ```sql 125 | SELECT state, zip, 126 | COUNT(DISTINCT user_guid) AS UserID 127 | FROM users 128 | WHERE state != "" 129 | AND state IS NOT NULL 130 | AND zip IS NOT NULL 131 | AND zip != "" 132 | GROUP BY state, zip 133 | HAVING UserID >= 5 134 | ORDER BY state ASC, UserID DESC; 135 | ``` 136 | Revise the query your wrote in Question 2 so that it (1) excludes the NULL and empty string entries in the breed_group field, and (2) excludes any groups that don't have at least 1,000 distinct Dog_Guids in them. 137 | ```sql 138 | SELECT count(dog_guid) AS num_dogs, gender, breed_group 139 | FROM dogs 140 | WHERE breed_group IS NOT NULL AND breed_group != '' 141 | GROUP BY 3 142 | HAVING COUNT(breed_group>1000) 143 | ORDER BY 1 DESC; 144 | ``` 145 | # Conclusion 146 | These functions sum up the last of all basic commands. Last week, you learnt SELECT, FROM, WHERE, ORDER BY. This week, you learnt HAVING, GROUP BY, as well as OPERATORS, SUM, AVG, DISTINCT, COUNT. This lets you add a greater layer of specificity to your query. 147 | 148 | *See notes in next section for inner and outer joins.* 149 | -------------------------------------------------------------------------------- /SQLDukeWeek4.md: -------------------------------------------------------------------------------- 1 | # Week 4 - Subqueries & Operators 2 | 3 | Subqueries, which are also sometimes called inner queries or nested queries, are queries that are embedded within the context of another query. They are useful for complex queries, and also for testing smaller parts of the queries to ensure they give you what you want first before assembling the whole thing. Some basic rules are: 4 | 5 | - ORDER BY phrases cannot be used in subqueries (although ORDER BY phrases can still be used in outer queries that contain subqueries) 6 | - Subqueries in SELECT or WHERE statements can output no more than 1 row. Otherwise, subqueries in SELECT or WHERE clauses that return more than one row must be used in combination with operators that are explicitly designed to handle multiple values, such as the IN operator. 7 | 8 | Lastly, when they are used in FROM clauses, they create what are called **derived tables**. This comes into play later when you want to optimse your query to run faster. Having smaller derived tables helps the query be answered quicker because the db does not need to hold such a large derived table in memory. But for now, focus on writing the damn thing right first. 9 | 10 | ### #1: SUBQUERIES FOR ON-THE-FLY CALCULATIONS 11 | 12 | Example: Find all details about users whose average time taken per test is greater than the average time taken by the community. 13 | ```sql 14 | SELECT *, 15 | TIMESTAMPDIFF(minute,start_time,end_time) AS AvgDuration 16 | FROM exam_answers 17 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) > 18 | (SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time)) 19 | FROM exam_answers 20 | WHERE TIMESTAMPDIFF(minute,start_time,end_time)>0); 21 | ``` 22 | 23 | Example: Find all details about users whose average time taken for the "yawn warm up" game is greater than the average time taken by the community. 24 | ```sql 25 | SELECT *, 26 | avg(TIMESTAMPDIFF(minute,start_time,end_time)) AS Avg_Duration 27 | FROM exam_answers 28 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) > 29 | (SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time)) 30 | FROM exam_answers 31 | WHERE TIMESTAMPDIFF(minute,start_time,end_time)>0 32 | AND test_name = 'Yawn Warm-Up'); 33 | ``` 34 | ### #2: SUBQUERIES FOR TESTING MEMBERSHIP 35 | Subqueries can be used to test membership for items in one group against another, through calling the test group in the subquery. We can use EXIST / NOT EXIST for this command specifically. Somerules: 36 | 37 | - EXISTS and NOT EXISTS can ONLY be used in subqueries 38 | - IT is similar to IN and NOT IN functions, but those can be used in all queries 39 | - Cannot be preceded by a column name or any other expression 40 | - Returns TRUE/FALSE logical statements 41 | - Since the only concern for the subquery is whether it is TRUE/FALSE, can use SELECT * in subquery 42 | 43 | Example: Retrieve a list of all the users in the users table who were also in the dogs table using the EXIST function. 44 | ```sql 45 | SELECT DISTINCT u.user_guid AS uUserID 46 | FROM users u 47 | WHERE EXISTS 48 | (SELECT * 49 | FROM dogs d 50 | WHERE u.user_guid =d.user_guid); 51 | ``` 52 | Example: Find the stores that exist in one or more cities. 53 | ```sql 54 | SELECT DISTINCT store_type 55 | FROM stores 56 | WHERE EXISTS ( 57 | SELECT * 58 | FROM cities_stores 59 | WHERE cities_stores.store_type = stores.store_type); 60 | ``` 61 | ### #3: SUBQUERIES FOR LOGIC WITH DERIVED TABLES 62 | Subqueries can be more elegant than joins, especially when it allows us to select/ exclude more efficiently than a lengthy join command. In addition, we can fix the problem of duplicates immediately instead of having to patch this using a GROUP BY clause after. 63 | 64 | ##### Rules for subqueries 65 | 66 | - We are required to give an alias to any derived table we create in subqueries within FROM statements. 67 | - We need to use this alias every time we want to execute a function that uses the derived table. 68 | - Third, aliases used within subqueries CAN refer to tables OUTSIDE of the subqueries. However, outer queries cannot refer to aliases created within subqueries unless those aliases are explicitly part of the subquery output. 69 | - If using LIMIT with derived tables, put the limit in the LEFT derived table. If you put it in the outermost query, the db will still have to hold huge inner derived tables in memory which will make your query slow. 70 | 71 | Example: We want a list of each dog a user in the users table owns, with its accompanying breed information whenever possible. 72 | ```sql 73 | SELECT 74 | clean.user_guid AS uUserID, 75 | d.user_guid AS dUserID, 76 | count(*) AS NumDogs 77 | FROM 78 | (SELECT DISTINCT u.user_guid 79 | FROM users u) 80 | AS clean 81 | LEFT JOIN dogs d 82 | ON clean.user_guid=d.user_guid 83 | GROUP BY clean.user_guid 84 | ORDER BY NumDogs DESC 85 | ``` 86 | The query we just wrote extracts the distinct user_guids from the users table first, and then left joins that reduced subset of user_guids on the dogs table. As mentioned at the beginning of the lesson, since the subquery is in the FROM statement, it actually creates a temporary table, called a derived table, that is then incorporated into the rest of the query. 87 | 88 | Example: Write a query to retrieve a full list of all the DogIDs a user in the users table owns. Add dog breed and dog weight to the columns that will be included in the final output of your query. In addition, use a HAVING clause to include only UserIDs who would have more than 10 rows in the output of the left join. 89 | ```sql 90 | SELECT 91 | APPLES.user_guid AS uUserID, 92 | d.user_guid AS dUserID, 93 | d.breed, 94 | d.weight, 95 | count(*) AS numrows 96 | FROM 97 | (SELECT DISTINCT u.user_guid 98 | FROM users u) 99 | AS APPLES 100 | LEFT JOIN dogs d 101 | ON APPLES.user_guid=d.user_guid 102 | GROUP BY APPLES.user_guid 103 | HAVING numrows > 10 104 | ORDER BY numrows DESC 105 | ``` 106 | 107 | # OPERATORS 108 | 109 | * IF 110 | * CASE 111 | * NOT, AND, OR 112 | 113 | ### #1: OPERATORS - IF 114 | 115 | Can segment queries conditionally using IF, especially if the situation has clear true/false conditions. IF can also be nested into loops. Note on syntax for using : IF 116 | 117 | ``` 118 | IF ( variable = "result", value if true, value if false) 119 | ``` 120 | Example: Count the number of users in America, and outside America. Output 2 columns with the groups America, Not in America, and the count for each. Exclude all null values. 121 | 122 | ```sql 123 | SELECT 124 | IF(cleanedset.country = 'US','In America','Not in America') AS Location, 125 | COUNT(cleanedset.country) AS 'Number of Users' 126 | FROM 127 | (SELECT DISTINCT user_guid, country 128 | FROM users 129 | WHERE user_guid IS NOT NULL 130 | AND country IS NOT NULL) 131 | AS cleanedset 132 | GROUP BY Location; 133 | ``` 134 | Example: Sort users by early users and late users. Print the total number of users in each group. Early users = those who signed up before 1 June 2014. 135 | ```sql 136 | SELECT 137 | IF(cleaned_users.first_account<'2014-06-01','early_user','late_user') AS user_type, 138 | COUNT(cleaned_users.first_account) 139 | FROM 140 | (SELECT user_guid, 141 | MIN(created_at) AS first_account 142 | FROM users 143 | GROUP BY user_guid) 144 | AS cleaned_users 145 | GROUP BY user_type; 146 | ``` 147 | 148 | #### Nested Loop Example 149 | Print all users and their country status. 150 | ```sql 151 | SELECT 152 | IF(cleaned_users.country='US','In US', 153 | IF(cleaned_users.country='N/A','Not Applicable','Outside US')) 154 | AS US_user, 155 | count(cleaned_users.user_guid) 156 | FROM 157 | (SELECT DISTINCT user_guid, country 158 | FROM users 159 | WHERE country IS NOT NULL) 160 | AS cleaned_users 161 | GROUP BY US_user; 162 | ```` 163 | Example: For each dog, output its dog ID, breed_type, number of completed tests, and use an IF statement to include an extra column that reads "Pure_Breed" whenever breed_type equals 'Pure Breed" and "Not_Pure_Breed" whenever breed_type equals anything else. 164 | ```sql 165 | SELECT DISTINCT 166 | d.dog_guid AS 'Dog ID', 167 | IF(d.breed_type="Pure Breed", 'Pure Breed', 'Not Pure Breed') AS 'Breed Type', 168 | count(c.created_at) AS 'Num Tests Done' 169 | FROM dogs d 170 | LEFT JOIN complete_tests c 171 | ON d.dog_guid = c.dog_guid 172 | WHERE d.dog_guid IS NOT NULL 173 | GROUP BY d.dog_guid 174 | ORDER BY count(c.created_at) DESC 175 | LIMIT 50; 176 | ``` 177 | However, you can see this is not very efficient as the number of conditions increases. For those, it is better to use CASE. 178 | 179 | ### #2: OPERATORS - CASE 180 | 181 | Syntax for CASE: 182 | ``` 183 | SELECT 184 | apples, 185 | oranges, 186 | CASE 187 | WHEN ..... (condition) THEN .... (label) 188 | WHEN ..... (condition) THEN .... (label) 189 | END -- ps: no commas needed within 190 | FROM database 191 | ``` 192 | Example: Print cases of users based on their country locations. 193 | ```sql 194 | SELECT 195 | CASE 196 | WHEN cleaned_users.country="US" THEN "In US" 197 | WHEN cleaned_users.country="N/A" THEN "Not Applicable" 198 | ELSE "Outside US" 199 | END AS US_user, 200 | count(cleaned_users.user_guid) 201 | FROM 202 | (SELECT DISTINCT user_guid, country 203 | FROM users 204 | WHERE country IS NOT NULL) 205 | AS cleaned_users 206 | GROUP BY US_user 207 | ORDER BY count(cleaned_users.user_guid); 208 | ``` 209 | Example: Write a query to present the range of dog's weight in groups, and the number of dogs in each weight group. 210 | ```sql 211 | SELECT 212 | DISTINCT dog_guid, 213 | breed, 214 | weight, 215 | CASE 216 | WHEN weight<=0 THEN "very small" 217 | WHEN weight>10 AND weight<=30 THEN "small" 218 | WHEN weight>30 AND weight<=50 THEN "medium" 219 | WHEN weight>50 AND weight<=85 THEN "large" 220 | WHEN weight>85 THEN "very large" 221 | END AS Category 222 | FROM dogs 223 | WHERE weight > 0 224 | LIMIT 200; 225 | ``` 226 | Example: Binary tree question. Find the parent root, inner and leaf nodes. 227 | ```sql 228 | SELECT N, 229 | CASE 230 | WHEN P IS NULL THEN "Root" -- capitsalisation matters inside commas 231 | WHEN N IN (SELECT P, FROM BST) THEN "Inner" 232 | ELSE "Leaf" 233 | END 234 | FROM BST 235 | ORDER BY N; 236 | ``` 237 | ### #3: OPERATORS - NOT, AND, OR 238 | 239 | These operators can be used to make true/false logic statements. They are evaluated in that order: Not, And, Or. This means that any NOT statements will be evaluated first, followed by AND, then OR. 240 | 241 | > CASE WHEN "condition 1" OR "condition 2" AND "condition 3"... 242 | 243 | will lead to different results than this expression: 244 | 245 | > CASE WHEN "condition 3" AND "condition 1" OR "condition 2"... 246 | 247 | or this expression: 248 | 249 | > CASE WHEN ("condition 1" OR "condition 2") AND "condition 3"... 250 | 251 | In the first case you will get rows that meet condition 2 and 3, or condition 1. In the second case you will get rows that meet condition 1 and 3, or condition 2. In the third case, you will get rows that meet condition 1 or 2, and condition 3. 252 | 253 | -------------------------------------------------------------------------------- /Teradata Cheatsheet.md: -------------------------------------------------------------------------------- 1 | # Teradata Cheatsheet 2 | 3 | This document is a compilation of differences between MySQL and the SQL 4 | dialect Teradata uses with regards to major commands. It was made with 5 | reference to course notes from Duke University's "Managing Big Data with 6 | MySQL" course. 7 | 8 | This document assumes that one is already familiar with 9 | some SQL or MySQL as it mainly serves to point out the differences between them. 10 | 11 | Date created: 18 March 2017 12 | 13 | ### Set Database 14 | 15 | To select the database, enter ``DATABASE [name];`` into the SQL scratchpad. 16 | 17 | ### Explore Database 18 | 19 | To display tables and columns in database 20 | 21 | ```sql 22 | HELP TABLE [name] 23 | 24 | HELP COLUMN [name] 25 | ``` 26 | *Note: Don't include the brackets when executing the query.* 27 | 28 | ### Primary Keys 29 | 30 | To confirm which are the primary keys of a table 31 | 32 | ```sql 33 | SHOW table [name]; 34 | ``` 35 | *Note: Don't include the brackets when executing the query.* 36 | 37 | ### Restricting Query Output 38 | 39 | Teradata uses TOP instead of LIMIT to restrict output. 40 | To select the first 10 rows: 41 | 42 | ```sql 43 | SELECT TOP 10 student_IDs 44 | FROM class_info; 45 | ``` 46 | 47 | To select 10 random rows instead: 48 | 49 | ```sql 50 | SELECT student_IDs 51 | FROM class_info 52 | SAMPLE 10; 53 | ``` 54 | 55 | To select 10% of all rows instead: 56 | 57 | ```sql 58 | SELECT student_IDs 59 | FROM class_info 60 | SAMPLE .10; 61 | ``` 62 | *Note: The last two commands will return different selection of rows each time.* 63 | 64 | ### Aggregation & Group By 65 | 66 | Any non-aggregate column in the ``SELECT`` list or ``HAVING`` list of a query with 67 | a ``GROUP BY`` clause must also listed in the ``GROUP BY`` clause. Unlike MySQL, 68 | Teradata will not pick a random selection to populate a field that cannot be aggregated. 69 | 70 | This will not run: 71 | ```sql 72 | SELECT shopname, clothes_ID, cost 73 | FROM shop 74 | GROUP BY shopname 75 | ``` 76 | However, this will run: 77 | ```sql 78 | SELECT shopname, clothes_ID, avg(cost) -- find average to aggregate this column 79 | FROM shop 80 | GROUP BY shopname, clothes_ID -- group by non-aggregates 81 | ``` 82 | ### Operators 83 | 84 | Both Teradata and Mysql accept the symbols ``<>`` for *not equals to*, but 85 | Teradata does not accept ``!=``. 86 | 87 | ### String selection 88 | 89 | Teradata only accepts **single quotation marks**. 90 | 91 | ### Date Time Format 92 | 93 | Teradata will output data in the format ``YY-MM-DD``. However, it expects date 94 | format to be entered in ``YYYY-MM-DD``. 95 | 96 | ``TIMESTAMPDIFF(hour/minute/second, var1, var2)`` 97 | which calculates the difference between 2 variables in the specified format. 98 | 99 | ``DAYOFWEEK(datevar)``, where the day of the week will be returned as an 100 | integer from 1 - 7 where 1 = Sunday, 2 = Monday, etc. 101 | 102 | ### Extract Date 103 | 104 | The command for extracting parts of the datestamp returns the day/month/year in 105 | their respective numerical value. 106 | 107 | * `` EXTRACT (day FROM variable)`` returns the date (1-31). 108 | * ``EXTRACT (month FROM variable)`` returns the month (1-12). 109 | * `` EXTRACT (year FROM variable)`` returns the year (``YYYY``). 110 | 111 | This can be used in such a manner to return a count of the number of days in each year and month: 112 | 113 | ```sql 114 | SELECT 115 | EXTRACT (month FROM datelog) AS month_num, 116 | EXTRACT (year FROM datelog) AS year_num, 117 | COUNT (DISTINCT EXTRACT (day FROM datelog)) AS days_per_month, 118 | FROM catalog 119 | GROUP BY month_num, year_num 120 | ``` 121 | 122 | ### IF ELSE 123 | 124 | Teradata does *not* accept ``IF`` functions. However, we can replace this with ``CASE``. 125 | 126 | 127 | -------------------------------------------------------------------------------- /Week2-Dillards.md: -------------------------------------------------------------------------------- 1 | # Week 2 - Dillard's Database Exercises 2 | 3 | Date created: 14 March 2017 4 | 5 | This is the COMPLETE answer key (including explanations where necessary) 6 | for Week 2 of **"Managing Big Data wtih MySQL"** course by Duke University: 7 | 'Queries to Extract Data from Single Tables'. 8 | 9 | I wrote this answer key as no official answers have been released online. 10 | These answers reflect my own work and are accurate to the best of my knowledge. 11 | I will update them if the professors ever release an "official" answer key. 12 | 13 | **Update**: These answers are based on the original UA_Dillards dataset (not UA_Dillards1, 14 | nor UA_Dillards_2016). This means I am using the table ``SKSTINFO`` and not 15 | ``SKSTINFO_FIX`` which is the newer version. 16 | 17 | Meanwhile, let's start. 18 | 19 | # Answers 20 | 21 | To start, enter ``DATABASE ua_dillards;`` into the Teradata SQL scratchpad. 22 | 23 | ### Exercise 1 24 | 25 | **Use HELP and SHOW to confirm the relational schema provided to us for the 26 | Dillard’s dataset shows the correct column names and primary keys for each table.** 27 | 28 | ```sql 29 | HELP TABLE strinfo 30 | HELP TABLE skstinfo 31 | HELP TABLE skuinfo 32 | HELP TABLE trnsact 33 | HELP TABLE deptinfo 34 | HELP TABLE store_msa 35 | ``` 36 | 37 | Note: *The course's notes contain an error.* It suggests: 38 | 39 | > "To get information about a single column in a table, you could write: 40 | > 41 | > HELP COLUMN [name of column goes here; don’t include the 42 | brackets when executing the query]" 43 | 44 | This is incorrect. You need to specify **which table** the column is from too, 45 | as some column names are common to more than one table. The syntax correct should be 46 | ``HELP COLUMN tablename.columnname``. Thus, to find out more information 47 | about a single column, you should do this: 48 | 49 | ```sql 50 | HELP COLUMN skstinfo.sku 51 | HELP COLUMN skuinfo.sku 52 | HELP COLUMN trnsact.sku 53 | ... 54 | etc 55 | ``` 56 | 57 | Lastly, to confirm which is the primary key of each table, do this: 58 | ``SHOW TABLE [tablename here -- but don’t include the 59 | brackets when executing the query];``. When applied, it looks like this: 60 | 61 | ```sql 62 | SHOW TABLE strinfo 63 | SHOW TABLE skstinfo 64 | SHOW TABLE skuinfo 65 | SHOW TABLE trnsact 66 | SHOW TABLE deptinfo 67 | SHOW TABLE store_msa 68 | ``` 69 | 70 | ### Exercise 2 71 | 72 | **Look at examples of data from each of the tables. Pay particular attention to 73 | the ``skuinfo`` table.** 74 | 75 | Things to note: 76 | - There are two types of transactions: purchases (P) and returns (R). We will need to 77 | make sure we specify which type we are interested in when running queries using the 78 | transaction table. 79 | - There are a lot of strange values in the “color”, “style”, and “size” fields of 80 | the skuinfo table. The information recorded in these columns is not always related to 81 | the column title (for example there are entries like "BMK/TOUR K” and “ALOE COMBO” in 82 | the color field, even though those entries do not represent colors). 83 | - The department descriptions (``deptdesc`` from ``DEPTINFO``) seem to represent brand 84 | names. However, if you look at entries in the skuinfo table from only one department, 85 | you will see that many brands are in the same department. 86 | 87 | ### Exercise 3 88 | 89 | **Examine lists of distinct values in each of the tables.** 90 | 91 | Okay... 92 | 93 | ### Exercise 4 94 | 95 | **Examine instances of transaction table where “amt” is different than “sprice”. 96 | What did you learn about how the values in “amt”, “quantity”, and “sprice” 97 | relate to one another?** 98 | 99 | To query all rows where ``amt``(total transaction amount) is different from 100 | ``sprice``(sale price): 101 | 102 | ```sql 103 | SELECT * 104 | FROM trnsact 105 | WHERE amt <> sprice; 106 | ``` 107 | 108 | We see 7 rows appear. What the rows have in common is that they are all return 109 | transactions (``R``), and have an ``INTERID`` of 000000000. The items, which were originally 110 | $20-$80 each, are now $0.10 to $1.00 each. 111 | 112 | ### Exercise 5 113 | 114 | Even though the Dillard’s dataset had primary keys declared and there were not 115 | many NULL values, there are still many bizarre entries that likely reflect entry errors. 116 | To see some examples of these likely errors, examine: 117 | 118 | **(a) Rows in the trsnact table that have “0” in their orgprice column (how could the original 119 | price be 0?)** 120 | 121 | ```sql 122 | SELECT * 123 | FROM trnsact 124 | WHERE orgprice = '0'; 125 | ``` 126 | *Notes: There should be 1425811 rows where the original price = $0.00, or approx 1.18% 127 | of all rows in the ``TRNSACT`` table. There appears to be nothing in common between these items.* 128 | 129 | **(b) Rows in the skstinfo table where both the cost and retail price are listed as 0.00** 130 | 131 | ```sql 132 | SELECT * 133 | FROM skstinfo 134 | WHERE cost = '0' 135 | AND retail = '0'; 136 | ``` 137 | 138 | *Notes: There should be 350340 rows where both the cost and retail price = $0.00, or 139 | approx 0.89% of all rows in the ``SKSTINFO`` table. There appears to be nothing in common 140 | between these items.* 141 | 142 | **(c) Rows in the skstinfo table where the cost is greater than the retail price (although 143 | occasionally retailers will sell an item at a loss for strategic reasons, it is very 144 | unlikely that a manufacturer would provide a suggested retail price that is lower than 145 | the cost of the item).** 146 | 147 | ```sql 148 | SELECT * 149 | FROM skstinfo 150 | WHERE cost > retail 151 | AND retail > '0'; -- to exclude erroneous values 152 | ``` 153 | 154 | *Notes: There should be 7535205 rows where cost price is greater than retail price. 155 | This forms approx 19.2% of all rows in the ``SKSTINFO`` table.* 156 | 157 | ### Exercise 6 158 | 159 | **Write your own queries that retrieve multiple columns in a precise order from 160 | a table, and that restrict the rows retrieved from those columns using “BETWEEN”, “IN”, 161 | and references to text strings. Try at least one query that uses dates to restrict the rows 162 | you retrieve.** 163 | 164 | Okay... 165 | 166 | ```sql 167 | SELECT count(store) 168 | FROM strinfo 169 | WHERE state = 'NY'; 170 | ``` 171 | Seems like New York has only 2 stores. Actually, let's explore how many stores there are 172 | in each state, and see who has the most. 173 | 174 | ```sql 175 | SELECT STATE, COUNT(STORE) 176 | FROM strinfo 177 | GROUP BY STATE 178 | ORDER BY COUNT(STORE) DESC; 179 | ``` 180 | 181 | |State | Stores | 182 | | ---- | ----- | 183 | | TX | 79 184 | | FL | 48 185 | | AR | 27 186 | | AZ | 26 187 | | OH | 25 188 | 189 | Okay, let's try to find the earliest and latest sale date in this dataset. 190 | 191 | ```sql 192 | SELECT distinct saledate 193 | FROM trnsact 194 | ORDER BY saledate ASC; 195 | 196 | SELECT distinct saledate 197 | FROM trnsact 198 | ORDER BY saledate DESC; -- I'm lazy to scroll. 199 | ``` 200 | Earliest date: ``04/08/01``. Latest date: ``05/08/27``. Seems like we have 389 dates in 201 | record. 202 | 203 | Let's mess around further, and see which dates have the highest number of transactions. 204 | I bet that the total number of transactions will peak on 24 Dec (aka right before christmas). 205 | Let's check: 206 | ```sql 207 | SELECT saledate, count(saledate) 208 | FROM trnsact 209 | GROUP BY saledate 210 | ORDER BY count(saledate) DESC; 211 | ``` 212 | HOLY CRAP. I am so wrong. Here are the top 10 dates with the highest transactions: 213 | 214 | | No. | Date | Transactions | 215 | | ---- | ---- | ---- | 216 | | 1 | 05/02/26 | 1198813 217 | | 2 | 05/02/25 | 947451 218 | | 3 |05/02/24 | 888352 219 | | 4 | 05/07/30 | 875042 220 | | 5 | 05/02/23 | 855037 221 | | 6 |05/08/27 | 771760 222 | | 7 | 04/10/02 | 758200 223 | | 8 | 04/12/18* | 744268 224 | | 9 | 04/11/26 | 690396 225 | | 10 | 04/12/23* | 675139 226 | 227 | Seems like christmas doesn't even come close. WTf? Let's find out what happened 228 | on ``05/02/26``. 229 | 230 | According to Google, it seems like they had the [mother of all sales](https://sgbonline.com/dillards-february-comps-increase-5-percent/ "DillardsReport"). 231 | 232 | Well that must be some epic sales. Because judging by the number of transactions, it 233 | appears that people spent **1.75x** more on 25th and 26th Feb, than the 2 days leading up 234 | to Christmas (23rd, 24th Dec. *I excluded 25th Dec because Dillards was not open on Christmas Eve*). 235 | 236 | ![alt text](https://cdn.meme.am/instances/400x/64773524.jpg) 237 | 238 | I don't understand, America. How do you spend more for yourself *in a single day* 239 | than for all your friends and cousins combined? 240 | 241 | Anyway, that's all the questions for this exercise. *I've spent an hour on this already and it's 242 | 3am here. :(* 243 | 244 | One final note from the assignment: while **date formats** will be output as: 245 | 246 | ``YY-MM-DD'`` 247 | 248 | During queries, **date** strings should be entered as: 249 | 250 | ``YYYY-MM-DD'.`` 251 | 252 | *Thanks for reading, hope this was useful to you. I had fun writing this!* 253 | 254 | 255 | -------------------------------------------------------------------------------- /Week3-Dillards.md: -------------------------------------------------------------------------------- 1 | # Week 3 - Dillard's Database Exercises 2 | 3 | Date created: 17 March 2017 4 | 5 | This is the COMPLETE answer key (including explanations where necessary) 6 | for Week 2 of "Managing Big Data wtih MySQL" course by Duke University: 7 | 'Queries to Extract Data from Single Tables'. 8 | 9 | I wrote this answer key as no official answers have been released online. 10 | These answers reflect my own work and are accurate to the best of my knowledge. 11 | I will update them if the professors ever release an "official" answer key. 12 | 13 | Update: These answers are based on the original UA_Dillards dataset 14 | (not UA_Dillards1, nor UA_Dillards_2016). For example, this means I am using 15 | the table ``SKSTINFO`` and not ``SKSTINFO_FIX`` which is the newer version. 16 | 17 | Meanwhile, let's start. 18 | 19 | # Answers 20 | 21 | To start, enter ``DATABASE ua_dillards``; into the Teradata SQL scratchpad. 22 | 23 | ### Question 1 24 | 25 | **(a) Use COUNT and DISTINCT to determine how many distinct skus there are in 26 | pairs of the skuinfo, skstinfo, and trnsact tables. Which skus are common to 27 | pairs of tables, or unique to specific tables?** 28 | 29 | ```sql 30 | SELECT COUNT(DISTINCT a.sku) 31 | FROM skuinfo a 32 | JOIN skstinfo b 33 | ON a.sku = b.sku; 34 | 35 | SELECT COUNT(DISTINCT a.sku) 36 | FROM skuinfo a 37 | JOIN trnsact b 38 | ON a.sku = b.sku; 39 | 40 | SELECT COUNT(DISTINCT a.sku) 41 | FROM skstinfo a 42 | JOIN trnsact b 43 | ON a.sku = b.sku; 44 | ``` 45 | 46 | Results 47 | 48 | | Combi | Pair 1 | Pair 2 | Distinct SKU | 49 | | ----- | ------ | ------ | ------------ | 50 | | 1 | skuinfo | skstinfo | 760212 | 51 | | 2 | skuinfo | trnsact | 714499 | 52 | | 3 | skstinfo | trnsact | 542513 | 53 | 54 | To test which ``SKU``s are in which tables: 55 | 56 | ```sql 57 | SELECT a.sku, b.sku 58 | FROM skuinfo a 59 | LEFT JOIN skstinfo b 60 | ON a.sku = b.sku 61 | WHERE b.sku IS NULL; 62 | 63 | SELECT a.sku, b.sku 64 | FROM skuinfo a 65 | LEFT JOIN trnsact b 66 | ON a.sku = b.sku 67 | WHERE b.sku IS NULL; 68 | 69 | ``` 70 | * All items in ``SKSTINFO`` are listed in ``SKUINFO``, but not vice versa 71 | * All items in ``TRNSACT`` are listed in ``SKSTINFO``, but not vice versa 72 | 73 | **(b) Use COUNT to determine how many instances there are of each sku associated 74 | with each store in the skstinfo table and the trnsact table?** 75 | 76 | ```sql 77 | SELECT sku, store, COUNT(sku) 78 | FROM skstinfo 79 | GROUP BY sku, store; 80 | ``` 81 | Seems like there's only 1x sku-store combo in the ``SKSTINFO`` table. 82 | ```sql 83 | SELECT sku, store, COUNT(sku) 84 | FROM trnsact 85 | GROUP BY sku, store; 86 | ``` 87 | Seems like there's multiple instances of each sku-store combos in the ``TRNSACT`` table. 88 | 89 | *Notes from lecture: You should see there are multiple instances of every 90 | sku/store combination in the ``trnsact`` table, but only one instance of every 91 | sku/store combination in the ``skstinfo`` table. Therefore you could join the 92 | ``trnsact`` and ``skstinfo`` tables, but you would need to join them on both of the 93 | following conditions: ``trnsact.sku= skstinfo.sku`` AND ``trnsact.store= skstinfo.store``.* 94 | 95 | ### Exercise 2 96 | 97 | **(a) Use COUNT and DISTINCT to determine how many distinct stores there are in the 98 | strinfo, store_msa, skstinfo, and trnsact tables.** 99 | 100 | ```sql 101 | SELECT COUNT(DISTINCT store) 102 | FROM strinfo; 103 | 104 | SELECT COUNT(DISTINCT store) 105 | FROM skstinfo; 106 | 107 | SELECT COUNT(DISTINCT store) 108 | FROM store_msa; 109 | 110 | SELECT COUNT(DISTINCT store) 111 | FROM trnsact; 112 | ``` 113 | 114 | |Table Name | Unique Stores | 115 | | --------- | ------------- | 116 | | STRINFO | 453 117 | | SKSTINFO | 357 118 | | STORE_MSA | 333 119 | | TRNSACT | 332 120 | 121 | **(b) Which stores are common to all four tables, or unique to specific tables?** 122 | 123 | Since we know that ALL stores can be found in the ``STRINFO`` table, we can left join 124 | the three other tables to it. 125 | 126 | ```sql 127 | SELECT a.store, b.store, c.store, d.store 128 | FROM strinfo a 129 | LEFT JOIN skstinfo b 130 | ON a.store = b.store 131 | LEFT JOIN trnsact c 132 | ON a.store = c.store 133 | LEFT JOIN store_msa d 134 | ON c.store = d.store 135 | ``` 136 | 137 | ### Exercise 3 138 | 139 | It turns out there are many skus in the trnsact table that are not in the skstinfo 140 | table. As a consequence, we will not be able to complete many desirable analyses of 141 | Dillard’s profit, as opposed to revenue, because we do not have the cost information 142 | for all the skus in the transact table (recall that profit = revenue - cost). 143 | 144 | **Examine some of the rows in the trnsact table that are not in the skstinfo table; 145 | can you find any common features that could explain why the cost information is missing?** 146 | 147 | ```sql 148 | SELECT * 149 | FROM trnsact a 150 | LEFT JOIN skstinfo b 151 | ON a.sku=b.sku AND a.store = b.store 152 | WHERE b.sku IS NULL 153 | ``` 154 | This returns a table with all columns, of rows of items which are in ``TRNSACT`` but 155 | **not in** ``SKSTINFO``. Honestly, I can't see much difference just eyeballing it. 156 | There are 52,338,840 rows, or 43.3% of 120 billion rows that are missing. 157 | 158 | To check how many of them are *unique*: 159 | 160 | ```sql 161 | SELECT distinct a.sku, a.store 162 | FROM trnsact a 163 | LEFT JOIN skstinfo b 164 | ON a.sku=b.sku AND a.store = b.store 165 | WHERE b.sku IS NULL 166 | GROUP BY a.sku, a.store; 167 | ``` 168 | 169 | That leaves exactly 17,816,793 sku-store combinations found in the transactions table 170 | that are not listed in the master ``skstinfo`` table. I still can't tell what's 171 | unique about the missing values, so let's see what's the next question and 172 | come back to this later. 173 | 174 | ### Exercise 4 175 | 176 | **Although we can’t complete all the analyses we’d like to on Dillard’s profit, 177 | we can look at general trends. What is Dillard’s average profit per day?** 178 | 179 | Assumptions: 180 | 181 | 1. With **over 40% of the necessary data missing** (see Qn 3), whatever data we 182 | have left is accurate and worth calculating -.-" 183 | 2. For each transaction recorded (row), only 1 type of item is purchased at a time. 184 | In other words, that: 185 | 186 | > Total amount paid per transaction = number of items x price of each item. 187 | 188 | This is important because if each transaction contains numerous items of different prices, 189 | we will lack necessary information about unique compositions of each transaction to make 190 | this query. 191 | 192 | Back to the question, 193 | 194 | > Profit = revenue - cost 195 | 196 | This can be written as 197 | 198 | ``PROFIT = trnsact.amt - (trnsact.quantity * skstinfo.cost)`` 199 | 200 | Further, since we want to know the **average** profit, we can find the 201 | number of days by diving the profit by ``count(distinct saledate)``. 202 | 203 | Overall, we can build the rest of the query around it like so: 204 | 205 | ```sql 206 | SELECT SUM(a.amt - a.quantity*b.cost)/COUNT(DISTINCT a.saledate) -- avg profit 207 | FROM trnsact a 208 | LEFT JOIN SKSTINFO b 209 | ON a.sku = b.sku AND a.store = b.store 210 | WHERE a.stype = 'P'; -- purchases only 211 | ``` 212 | This returns an average profit of ``$1,527,903.46`` per day. Let's check this 213 | against what the question expects - that the average profit for Register 640 214 | should be ``$10,779.20``. 215 | 216 | ```sql 217 | SELECT SUM(a.amt - a.quantity*b.cost)/COUNT(DISTINCT a.saledate) 218 | FROM trnsact a 219 | LEFT JOIN SKSTINFO b 220 | ON a.sku = b.sku AND a.store = b.store 221 | WHERE a.stype = 'P' 222 | AND register = '640'; 223 | ``` 224 | The answer is correct. 225 | 226 | ### Exercise 5 227 | 228 | **On what day was the total value (in $) of returned goods the greatest?** 229 | 230 | ```sql 231 | SELECT saledate, sum(amt) -- I didnt limit this cos I'm kaypoh 232 | FROM trnsact 233 | WHERE stype = 'R' 234 | GROUP BY saledate 235 | ORDER BY sum(amt) DESC; 236 | ``` 237 | 238 | To select only the day with the *greatest* value, ``select limit 1``. 239 | 240 | | Sale date | Total value of returned goods | 241 | | --------- | ----------------------------- | 242 | | **04/12/27** | **$3,030,259.76** 243 | | 04/12/26 | $2,665,283.86 244 | | 04/12/28 | $2,332,544.44 245 | | 04/12/29 | $1,983,898.91 246 | | 04/12/30 | $1,884,052.85 247 | | 04/12/31 | $1,631,004.76 248 | | 05/01/08 | $1,438,745.35 249 | | 05/02/26 | $1,403,971.89 250 | | 05/01/03 | $1,357,311.82 251 | | 05/01/02 | $1,270,440.95 252 | 253 | **On what day was the total number of individual returned items the greatest?** 254 | 255 | ```sql 256 | SELECT saledate, sum(quantity) 257 | FROM trnsact 258 | WHERE stype = 'R' 259 | GROUP BY saledate 260 | ORDER BY sum(quantity) DESC; 261 | ``` 262 | 263 | | Sale date | Total num of returned goods | 264 | | --------- | ----------------------------- | 265 | | **04/12/27** | **82512** | 266 | |04/12/26|71710 267 | |04/12/28|64265 268 | |05/02/26|62462 269 | |04/12/29|55356 270 | |05/02/25|54597 271 | |04/12/30|53171 272 | |05/02/24|49199 273 | |05/07/30|46436 274 | |05/08/27|45704 275 | 276 | Well, at least it appears that there is some correlation between the two results. 277 | 278 | ### Exercise 6 279 | 280 | **What is the maximum price paid for an item in our database? What is the minimum price 281 | paid for an item in our database?** 282 | 283 | I'm not sure whether the tables are reliable, so I am going to check all possible values 284 | from ``skstinfo.retail``, ``trnsact.orgprice`` and ``trnsact.sprice``. 285 | 286 | ```sql 287 | SELECT max(orgprice) 288 | FROM trnsact 289 | WHERE stype = 'P'; 290 | 291 | SELECT min(orgprice) 292 | FROM trnsact 293 | WHERE stype = 'P'; 294 | 295 | SELECT max(sprice) 296 | FROM trnsact 297 | WHERE stype = 'P'; 298 | 299 | SELECT min(sprice) 300 | FROM trnsact 301 | WHERE stype = 'P'; 302 | 303 | SELECT max(retail) 304 | FROM skstinfo; 305 | 306 | SELECT min(retail) 307 | FROM skstinfo; 308 | ``` 309 | 310 | | Source | Max price | Min price | 311 | | ----------- | -----| --------- | 312 | | skst.retail | 6017.00 | 0.00 | 313 | | trnsact.orgprice | 6017.00 | 0.00 | 314 | | trnsact.sprice | 6017.00 | 0.00 | 315 | 316 | It's nice that they are consistent. Being careful pays off. It appears safe to conclude that 317 | the **maximum price** for any item is ``$6017.00`` and the **minimum price** is ``$0.00``. 318 | 319 | ### Exercise 7 320 | 321 | **How many departments have more than 100 brands associated with them, and what are their 322 | descriptions?** 323 | 324 | ```sql 325 | SELECT DISTINCT a.dept, b.deptdesc, count(distinct a.brand) 326 | FROM skuinfo a 327 | LEFT JOIN deptinfo b 328 | ON a.dept=b.dept 329 | GROUP BY a.dept, b.deptdesc 330 | HAVING count(distinct a.brand) > 100; 331 | ``` 332 | 333 | There are **three** departments iwth more than 100 brands associated, and these are their 334 | descriptions: 335 | 336 | | Department ID | Description | Num brands | 337 | | ----------- | -----| --------- | 338 | |4407 | ENVIRON | 389 339 | | 7104 | CARTERS | 109 340 | | 5203 | COLEHAAN | 118 341 | 342 | ### Exercise 8 343 | 344 | **Write a query that retrieves the department descriptions of each of the skus in the skstinfo 345 | table.** 346 | 347 | ```sql 348 | SELECT a.sku, c.deptdesc 349 | FROM skstinfo a 350 | LEFT JOIN skuinfo b 351 | ON a.sku = b.sku 352 | LEFT JOIN deptinfo c 353 | ON b.dept = c.dept 354 | SAMPLE 100; -- remove this during exam 355 | ``` 356 | The department description for ``SKU5020024`` is ``LESLIE``. 357 | 358 | ### Exercise 9 359 | 360 | **What department (with department description), brand, style, and color had the greatest total 361 | value of returned items?** 362 | 363 | ### Exercise 10 364 | 365 | **In what state and zip code is the store that had the greatest total revenue during the time 366 | period monitored in our dataset?** 367 | 368 | *Note: There is an error in the notes. The question asks for state and **city** instead of **zip**. 369 | The assignment statement provided (below) suggests that you should know the city too.* 370 | 371 | > "If you have written your query correctly, you will find that the department with the 372 | 10th highest total revenue is in Hurst, TX." 373 | 374 | ```sql 375 | SELECT b.state, b.city, SUM(a.amt) -- no need to include sum(a.amt), but this is good for checking. 376 | FROM strinfo b 377 | LEFT JOIN trnsact a 378 | ON a.store = b.store 379 | WHERE stype = 'P' 380 | GROUP BY b.state, b.zip 381 | ORDER BY SUM(a.amt) DESC; 382 | ``` 383 | 384 | | State | ZIP | City | Total Revenue | 385 | | ----- | --- | ---- | ------------------ | 386 | | LA | 70002 | METAIRIE |$24,171,426.58 387 | |AR |72205 |LITTLE ROCK |$22,792,579.65 388 | |TX |78501 |MCALLEN |$22,331,884.55 389 | |TX |75225 |DALLAS |$22,063,797.73 390 | |KY |40207 |LOUISVILLE| $20,114,154.20 391 | |TX |77056 |HOUSTON| $19,040,376.84 392 | |KS |66214 |OVERLAND PARK |$18,642,976.76 393 | |OK |73118 |OKLAHOMA CITY |$18,458,644.39 394 | |TX |78216 |SAN ANTONIO |$18,455,775.63 395 | | **TX** | **76053** | **HURST** | **$17,740,181.20** 396 | 397 | The answer is correct. The store with the 10th highest revenue is 398 | ``Hurst City`` with ``$17,740,181.20``. 399 | -------------------------------------------------------------------------------- /Week3Ex7-InnerJoin.sql: -------------------------------------------------------------------------------- 1 | /* ANSWER KEY: Week 3 Exercise 7 - Inner Joins 2 | 3 | This is the COMPLETE answer key (including explanations) for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 4 | Date created: 15 March 2017 5 | 6 | */ 7 | 8 | -- BOX 1: LOAD SERVER 9 | 10 | %load_ext sql 11 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb 12 | %sql USE dognitiondb 13 | 14 | 15 | 16 | -- BOX 2 17 | -- Note: This should throw an error. This is to demonstrate what the error looks like. 18 | 19 | SELECT 20 | dog_guid AS DogID, 21 | user_guid AS UserID, 22 | AVG(rating) AS AvgRating, 23 | COUNT(rating) AS NumRatings, 24 | breed, breed_group, breed_type 25 | FROM dogs, reviews 26 | GROUP BY user_guid, dog_guid, breed, breed_group, breed_type 27 | HAVING NumRatings >= 10 28 | ORDER BY AvgRating DESC 29 | LIMIT 200; 30 | 31 | 32 | 33 | -- BOX 3 34 | -- Expected: 38 rows 35 | 36 | SELECT 37 | d.dog_guid AS DogID, 38 | d.user_guid AS UserID, 39 | AVG(r.rating) AS AvgRating, 40 | COUNT(r.rating) AS NumRatings, 41 | d.breed, 42 | d.breed_group, 43 | d.breed_type 44 | FROM dogs d, reviews r 45 | WHERE d.dog_guid=r.dog_guid 46 | AND d.user_guid=r.user_guid 47 | GROUP BY DogID, d.breed, d.breed_group, d.breed_type 48 | HAVING NumRatings >= 10 49 | ORDER BY AvgRating DESC 50 | LIMIT 200; 51 | 52 | 53 | 54 | -- BOX 4 55 | -- Expected 389 rows 56 | 57 | /* IMPORTANT NOTE 58 | 59 | There is some discrepancy between what the questions asks, and what it actually wants. As the student mentors have admitted, this question could be better worded. Most of us (including me) got 395 rows the first time. This is the explanation of what went wrong, and how to fix it: 60 | 61 | Doing exactly as the question instructs, which is to run the query from BOX 3 without the HAVING and LIMIT clause, most people got 395 rows as their answer. However, the question tells us to expect 389 rows instead. 62 | 63 | What do these answers represent? 64 | 65 | 395 rows is the number of unique DOG IDs common to both the dogs and reviews table. 66 | 67 | 389 rows is the number of unique USER IDs common to both the dogs and reviews table. 68 | 69 | Although we are technically right in following the assignment's exact instructions, the instructions themselves were misleading. 70 | 71 | The original purpose of this question was to explore if users who gave a high average surprise rating for their dogs performance were users who tend to have more than one dog of the same breed. Hence, the question should have prompted us to compare on the basis of USERS instead of DOG IDs, but the instructors forgot to tell us we could modify it. 72 | 73 | The correct query to get 389 rows should be: 74 | */ 75 | 76 | SELECT DISTINCT 77 | r.user_guid AS UserID, 78 | AVG(r.rating) AS AvgRating, 79 | COUNT(r.rating) AS NumRatings 80 | FROM dogs d, reviews r 81 | WHERE d.dog_guid=r.dog_guid 82 | AND d.user_guid=r.user_guid 83 | GROUP BY UserID 84 | ORDER BY AvgRating DESC; 85 | 86 | /* 87 | Note: The reason for this discrepancy (users vs dogs) is because some users have more than one dog. 88 | */ 89 | 90 | 91 | 92 | -- BOX 5 QN 1 93 | -- Expected: 5991 (1 row) 94 | 95 | SELECT COUNT(DISTINCT dog_guid) 96 | FROM reviews 97 | 98 | 99 | 100 | -- BOX 6 QN 2 101 | -- Expected: 5586 (1 row) 102 | 103 | SELECT COUNT(DISTINCT user_guid) 104 | FROM reviews 105 | 106 | 107 | 108 | -- BOX 7 QN 3 109 | -- Expected: 30967 (1 row) 110 | 111 | SELECT COUNT(DISTINCT user_guid) 112 | FROM dogs 113 | 114 | 115 | 116 | -- BOX 8 QN 4 117 | -- Expected: 35050 (1 row) 118 | 119 | SELECT COUNT(DISTINCT dog_guid) 120 | FROM dogs 121 | 122 | 123 | 124 | -- BOX 9 125 | -- Expected: 5589 (1 row) 126 | 127 | SELECT COUNT(DISTINCT d.user_guid) 128 | FROM dogs d, 129 | reviews r 130 | WHERE d.user_guid=r.user_guid; 131 | 132 | OR 133 | 134 | -- Expected: 389 (1 row) 135 | 136 | SELECT COUNT(DISTINCT d.user_guid) 137 | FROM dogs d, 138 | reviews r 139 | WHERE d.dog_guid=r.dog_guid; 140 | 141 | 142 | 143 | -- BOX 10 QN 5 144 | -- Expected: 20845 rows 145 | 146 | SELECT 147 | c.user_guid, 148 | c.dog_guid, 149 | d.breed, 150 | d.breed_type, 151 | d.breed_group 152 | FROM complete_tests c, dogs d 153 | WHERE c.dog_guid=d.dog_guid 154 | AND test_name = "Yawn Warm-up"; 155 | 156 | 157 | 158 | -- BOX 11 QN 6 159 | -- Expected: 711 rows 160 | 161 | SELECT DISTINCT 162 | u.user_guid, 163 | u.membership_type, 164 | d.dog_guid, 165 | d.breed 166 | FROM complete_tests c, dogs d, users u 167 | WHERE c.dog_guid = d.dog_guid 168 | AND d.user_guid = u.user_guid 169 | AND d.breed = 'Golden Retriever'; 170 | 171 | 172 | 173 | -- BOX 12 QN 7 174 | -- Expected: 30 rows 175 | 176 | SELECT DISTINCT 177 | d.dog_guid, 178 | d.breed 179 | FROM dogs d, users u 180 | WHERE d.user_guid = u.user_guid 181 | AND d.breed = "Golden Retriever" 182 | AND u.state = 'NC'; 183 | 184 | 185 | 186 | -- BOX 12 QN 8 187 | -- Expected: 5 rows (first row should be 1, 2900) 188 | 189 | SELECT 190 | u.membership_type AS 'Membership Type', 191 | COUNT(DISTINCT r.user_guid) AS 'Total Reviews' 192 | FROM users u, reviews r 193 | WHERE r.user_guid = u.user_guid 194 | AND r.rating IS NOT NULL 195 | GROUP BY u.membership_type 196 | ORDER BY COUNT(r.user_guid) DESC; 197 | 198 | 199 | 200 | -- BOX 13 QN 9 201 | -- Expected: 5 rows (first row should be 1, 2900) 202 | 203 | SELECT 204 | u.membership_type AS 'Membership Type', 205 | COUNT(DISTINCT r.user_guid) AS 'Total Reviews' 206 | FROM users u, reviews r 207 | WHERE r.user_guid = u.user_guid 208 | AND r.rating IS NOT NULL 209 | GROUP BY u.membership_type 210 | ORDER BY COUNT(r.user_guid) DESC; 211 | 212 | 213 | 214 | -- BOX 14 QN 10 215 | -- Expected: 3 rows (breeds should be mixed, golden retriever, and golden retriever-labrador mix) 216 | 217 | SELECT 218 | d.breed, 219 | COUNT(sa.script_detail_id) 220 | FROM dogs d, site_activities sa 221 | WHERE d.dog_guid = sa.dog_guid 222 | AND sa.script_detail_id IS NOT NULL 223 | GROUP BY d.breed 224 | ORDER BY COUNT(sa.script_detail_id) DESC 225 | LIMIT 3; 226 | 227 | -- END -- 228 | -------------------------------------------------------------------------------- /Week3Ex8-OuterJoins.sql: -------------------------------------------------------------------------------- 1 | /* ANSWER KEY: Week 3 Exercise 7 - Inner Joins 2 | 3 | This is the COMPLETE answer key (including explanations) 4 | for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 5 | 6 | Date created: 16 March 2017 7 | */ 8 | 9 | -- BOX 1: LOAD SERVER 10 | 11 | %load_ext sql 12 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb 13 | %sql USE dognitiondb 14 | 15 | 16 | 17 | -- BOX 2 18 | -- Expected: 20845 rows 19 | 20 | SELECT 21 | d.user_guid AS UserID, 22 | d.dog_guid AS DogID, 23 | d.breed, 24 | d.breed_type, 25 | d.breed_group 26 | 27 | FROM dogs d JOIN complete_tests c 28 | ON d.dog_guid=c.dog_guid 29 | 30 | AND test_name='Yawn Warm-up'; 31 | 32 | 33 | 34 | 35 | -- BOX 3 36 | -- Expected: 932 rows 37 | 38 | SELECT 39 | r.dog_guid AS rDogID, 40 | r.user_guid AS rUserID, 41 | d.dog_guid AS dDogID, 42 | d.user_guid AS dUserID, 43 | AVG(r.rating) AS AvgRating, 44 | COUNT(r.rating) AS NumRatings 45 | FROM dogs d RIGHT JOIN reviews r 46 | ON r.dog_guid=d.dog_guid 47 | AND r.user_guid=d.user_guid 48 | WHERE r.dog_guid IS NOT NULL 49 | GROUP BY r.dog_guid 50 | HAVING NumRatings >= 10 51 | ORDER BY AvgRating DESC 52 | 53 | 54 | 55 | -- BOX 4 56 | -- Expected: 894 rows 57 | 58 | SELECT 59 | r.dog_guid AS rDogID, 60 | d.dog_guid AS dDogID, 61 | r.user_guid AS rUserID, 62 | d.user_guid AS dUserID, 63 | AVG(r.rating) AS AvgRating, 64 | COUNT(r.rating) AS NumRatings 65 | FROM reviews r LEFT JOIN dogs d 66 | ON r.dog_guid=d.dog_guid 67 | AND r.user_guid=d.user_guid 68 | WHERE d.dog_guid IS NULL 69 | GROUP BY r.dog_guid 70 | HAVING NumRatings >= 10 71 | ORDER BY AvgRating DESC; 72 | 73 | 74 | 75 | -- BOX 5 76 | -- Expected: 35050 rows 77 | 78 | SELECT 79 | d.dog_guid AS dDogID, 80 | COUNT(c.test_name) AS 'Tests Completed' 81 | FROM dogs d LEFT JOIN complete_tests c 82 | ON d.dog_guid = c.dog_guid 83 | WHERE d.dog_guid IS NOT NULL 84 | GROUP BY d.dog_guid 85 | ORDER BY COUNT(c.dog_guid) ASC; 86 | 87 | 88 | 89 | -- BOX 6 90 | -- Expected: 17987 rows 91 | 92 | SELECT 93 | d.dog_guid AS dDogID, 94 | COUNT(c.test_name) AS 'Tests Completed' 95 | FROM dogs d LEFT JOIN complete_tests c 96 | ON d.dog_guid = c.dog_guid 97 | WHERE d.dog_guid IS NOT NULL 98 | GROUP BY c.dog_guid -- DIFFERENCE! 99 | ORDER BY COUNT(c.dog_guid) ASC; 100 | 101 | 102 | 103 | -- BOX 7 QN 5 104 | -- Expected: 1 row (17986) 105 | 106 | SELECT count(distinct dog_guid) 107 | FROM complete_tests; 108 | 109 | 110 | 111 | -- BOX 8 QN 6 112 | -- Expected: 952557 rows 113 | 114 | SELECT 115 | u.user_guid, 116 | d.user_guid, 117 | d.dog_guid, 118 | d.breed, 119 | d.breed_type, 120 | d.breed_group 121 | FROM users u LEFT JOIN dogs d 122 | ON u.user_guid = d.user_guid 123 | 124 | 125 | 126 | -- BOX 9 QN 7 127 | -- Expected: 33193 rows 128 | 129 | SELECT 130 | u.user_guid AS uUserID, 131 | d.user_guid AS dUserID, 132 | d.dog_guid AS dDogID, 133 | d.breed, 134 | count(*) AS numrows 135 | FROM users u LEFT JOIN dogs d 136 | ON u.user_guid = d.user_guid 137 | GROUP BY u.user_guid 138 | ORDER BY numrows DESC; 139 | 140 | 141 | 142 | -- BOX 10 QN 8 143 | -- Expected: 17 rows 144 | 145 | SELECT count(user_guid) 146 | from users 147 | where user_guid = 'ce225842-7144-11e5-ba71-058fbc01cf0b' 148 | 149 | 150 | 151 | -- BOX 11 QN 9 152 | -- Expected: 26 rows 153 | 154 | SELECT count(user_guid) 155 | from dogs 156 | where user_guid = 'ce225842-7144-11e5-ba71-058fbc01cf0b' 157 | 158 | 159 | -- BOX 12 QN 10 160 | -- Expected: 2226 rows 161 | 162 | SELECT DISTINCT 163 | u.user_guid AS uUserID, 164 | d.user_guid AS dUserID 165 | FROM users u LEFT JOIN dogs d 166 | ON u.user_guid = d.user_guid 167 | WHERE d.user_guid IS NULL 168 | 169 | 170 | 171 | -- BOX 13 QN 11 172 | -- Expected: 2226 rows 173 | 174 | SELECT DISTINCT 175 | u.user_guid AS uUserID, 176 | d.user_guid AS dUserID 177 | 178 | FROM dogs d RIGHT JOIN users u 179 | ON u.user_guid = d.user_guid 180 | 181 | WHERE d.user_guid IS NULL 182 | 183 | 184 | 185 | -- BOX 14 QN 12 186 | -- Expected: 5833 rows 187 | SELECT DISTINCT 188 | sa.dog_guid AS 'Dog ID', 189 | d.dog_guid AS 'Should be NULL', 190 | COUNT(sa.dog_guid) AS Times 191 | FROM site_activities sa LEFT JOIN dogs d 192 | ON sa.user_guid = d.user_guid 193 | WHERE d.dog_guid IS NULL 194 | AND sa.dog_guid IS NOT NULL 195 | GROUP BY sa.dog_guid 196 | ORDER BY Times DESC; 197 | -------------------------------------------------------------------------------- /Week4Ex10-BizInt.sql: -------------------------------------------------------------------------------- 1 | /* ANSWER KEY: Week 3 Exercise 7 - Inner Joins 2 | 3 | This is the COMPLETE answer key (including explanations) for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 4 | Date created: 17 March 2017 5 | 6 | */ 7 | 8 | -- BOX 1: LOAD SERVER 9 | 10 | %load_ext sql 11 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb 12 | %sql USE dognitiondb 13 | 14 | 15 | 16 | -- BOX 2, Qn 1 17 | -- Expected: 11 rows 18 | 19 | SELECT DISTINCT dimension 20 | FROM dogs; 21 | 22 | 23 | 24 | -- BOX 3, Qn 2 25 | -- Expected: 100 rows 26 | 27 | /* Note: This question is rather misleading. The question suggests that a subquery is required, but an inner join will do. 28 | 29 | It's also not obvious whether the question wants you to group by the dog's personality dimensions (as that was the main focus of the preamble), or to produce a report of EVERY dog in the database. As it turns out, they want the latter. 30 | 31 | */ 32 | 33 | SELECT 34 | d.dog_guid AS dogID, 35 | d.dimension AS dimension, 36 | count(c.created_at) AS numtests 37 | FROM dogs d, complete_tests c 38 | WHERE d.dog_guid=c.dog_guid 39 | GROUP BY dogID 40 | ORDER BY numtests DESC 41 | LIMIT 100; -- feel free to remove this line if you're curious 42 | -- Expected output otherwise: 17986 43 | 44 | 45 | -- BOX 4, Qn 3 46 | -- Expected: 100 rows 47 | 48 | SELECT 49 | d.dog_guid AS dogID, 50 | d.dimension AS dimension, 51 | count(c.created_at) AS numtests 52 | FROM dogs d 53 | INNER JOIN complete_tests c -- Or just JOIN 54 | ON d.dog_guid=c.dog_guid 55 | GROUP BY dogID 56 | ORDER BY numtests DESC 57 | LIMIT 100; 58 | 59 | 60 | -- BOX 5, Qn 4 61 | -- Expected: 11 rows 62 | 63 | SELECT 64 | indiv_scores.personality, 65 | AVG(indiv_scores.testcount) 66 | FROM 67 | (SELECT 68 | d.dog_guid AS dogID, 69 | d.dimension AS personality, 70 | count(c.created_at) AS testcount 71 | FROM dogs d 72 | INNER JOIN complete_tests c 73 | ON d.dog_guid=c.dog_guid 74 | GROUP BY dogID) 75 | AS indiv_scores 76 | GROUP BY indiv_scores.personality; 77 | 78 | 79 | 80 | -- BOX 6, Qn 5 81 | 82 | /* The question is not well-worded either. This question asks, "How many unique DogIDs are summarized in the Dognition dimensions labeled 'None' or ''? (You should retrieve values of 13,705 and 71)". However, it expects you to ONLY count unique Dog IDs that have ALSO completed tests. 83 | 84 | A better question would be, "How many unique DOG IDs who have completed at least one test, have Dognition dimensions labelled 'None' or '' ?" 85 | 86 | */ 87 | 88 | SELECT 89 | indiv_scores.personality, 90 | count(indiv_scores.dogID) 91 | FROM 92 | (SELECT 93 | d.dog_guid AS dogID, 94 | d.dimension AS personality 95 | FROM dogs d 96 | INNER JOIN complete_tests c 97 | ON d.dog_guid=c.dog_guid 98 | WHERE c.created_at IS NOT NULL 99 | GROUP BY dogID) 100 | AS indiv_scores 101 | WHERE indiv_scores.personality IS NULL 102 | OR indiv_scores.personality='' 103 | GROUP BY indiv_scores.personality; 104 | 105 | 106 | 107 | -- BOX 7, Qn 6 108 | -- Expected (71 rows) 109 | 110 | SELECT 111 | indiv_scores.dogID, 112 | indiv_scores.breed, 113 | indiv_scores.weight, 114 | indiv_scores.exclude, 115 | indiv_scores.testcount, 116 | indiv_scores.Earliest, 117 | indiv_scores.Latest 118 | FROM 119 | (SELECT 120 | d.dog_guid AS dogID, 121 | d.breed AS breed, 122 | d.weight AS weight, 123 | d.exclude AS exclude, 124 | count(c.created_at) AS testcount, 125 | min(c.created_at) AS Earliest, 126 | max(c.created_at) AS Latest 127 | FROM dogs d 128 | INNER JOIN complete_tests c 129 | ON d.dog_guid=c.dog_guid 130 | WHERE c.created_at IS NOT NULL 131 | AND d.dimension = '' 132 | GROUP BY dogID) 133 | AS indiv_scores 134 | GROUP BY indiv_scores.dogID; 135 | 136 | -- A shorter version would be: 137 | 138 | SELECT 139 | d.dog_guid AS dogID, 140 | d.breed AS breed, 141 | d.weight AS weight, 142 | d.exclude AS exclude, 143 | count(c.created_at) AS testcount, 144 | min(c.created_at) AS Earliest, 145 | max(c.created_at) AS Latest 146 | FROM dogs d 147 | INNER JOIN complete_tests c 148 | ON d.dog_guid=c.dog_guid 149 | WHERE c.created_at IS NOT NULL 150 | AND d.dimension = '' 151 | GROUP BY dogID; 152 | 153 | 154 | -- BOX 8, Qn 7 155 | -- Expected: 9 Rows (ace = 402, charmer = 626) 156 | 157 | SELECT 158 | indiv_scores.personality, 159 | count(indiv_scores.dogID) AS NumDogs, 160 | AVG(indiv_scores.testcount) AS AvgScore 161 | FROM 162 | (SELECT 163 | d.dog_guid AS dogID, 164 | d.dimension AS personality, 165 | count(c.created_at) AS testcount 166 | FROM dogs d 167 | INNER JOIN complete_tests c 168 | ON d.dog_guid=c.dog_guid 169 | WHERE d.dimension IS NOT NULL -- (2) 170 | AND d.dimension != '' -- (1) 171 | AND (d.exclude IS NULL OR d.exclude = 0) -- (3) 172 | GROUP BY dogID) 173 | AS indiv_scores 174 | GROUP BY indiv_scores.personality; 175 | 176 | 177 | -- BOX 9 178 | 179 | SELECT DISTINCT breed_group 180 | FROM dogs 181 | 182 | 183 | 184 | -- BOX 10, Qn 9 185 | -- Expected: 8816 rows 186 | 187 | SELECT 188 | d.dog_guid AS 'Dog ID', 189 | d.breed, d.weight, d.exclude, 190 | MIN(c.created_at) AS 'Earliest Time', 191 | MAX(c.created_at) AS 'Latest Time', 192 | count(c.created_at) AS 'Num Tests Done' 193 | FROM dogs d 194 | JOIN complete_tests c 195 | ON d.dog_guid = c.dog_guid 196 | WHERE c.created_at IS NOT NULL 197 | AND d.breed_group IS NULL 198 | GROUP BY d.dog_guid 199 | 200 | 201 | 202 | -- BOX 11, Qn 10 203 | -- Expected: 9 rows (Herding = 1774) 204 | 205 | SELECT 206 | indiv_scores.doggroup AS 'Breed Group', 207 | count(indiv_scores.dogID) AS 'Num of Dogs', 208 | AVG(indiv_scores.testcount) AS 'Their Avg Score' 209 | FROM 210 | (SELECT 211 | d.dog_guid AS dogID, 212 | d.breed_group AS doggroup, 213 | count(c.created_at) AS testcount 214 | FROM dogs d 215 | INNER JOIN complete_tests c 216 | ON d.dog_guid=c.dog_guid 217 | WHERE d.breed_group IS NOT NULL -- remove 218 | AND d.breed_group != '' -- remove 219 | AND (d.exclude IS NULL OR d.exclude = 0) -- (specified by qn) 220 | GROUP BY dogID) 221 | AS indiv_scores 222 | GROUP BY indiv_scores.doggroup; 223 | 224 | /* *HOUND* breed groups, NOT *toy* breed groups, complete the least tests. Hound groups = 564. Toy groups = 1041. 225 | 226 | */ 227 | 228 | -- BOX 12, Qn 11 229 | -- Expected: 4 rows 230 | 231 | SELECT 232 | indiv_scores.doggroup AS 'Breed Group', 233 | count(indiv_scores.dogID) AS 'Num of Dogs', 234 | AVG(indiv_scores.testcount) AS 'Their Avg Score' 235 | FROM 236 | (SELECT 237 | d.dog_guid AS dogID, 238 | d.breed_group AS doggroup, 239 | count(c.created_at) AS testcount 240 | FROM dogs d 241 | INNER JOIN complete_tests c 242 | ON d.dog_guid=c.dog_guid 243 | WHERE d.breed_group IN ('Sporting', 'Hound', 'Herding', 'Working') 244 | AND (d.exclude IS NULL OR d.exclude = 0) -- (specified by qn) 245 | GROUP BY dogID) 246 | AS indiv_scores 247 | GROUP BY indiv_scores.doggroup; 248 | 249 | 250 | 251 | -- BOX 13, Qn 12 252 | -- Expected: 4 rows (pure breed = 8865) 253 | 254 | SELECT DISTINCT breed_type 255 | FROM dogs 256 | 257 | 258 | 259 | -- BOX 14, Qn 13 260 | -- Expected: 4 rows 261 | 262 | SELECT 263 | d.breed_type AS 'Breed Type', 264 | COUNT(DISTINCT d.dog_guid) AS 'Num of dogs', 265 | COUNT(c.created_at) AS 'Num of tests', 266 | COUNT(c.created_at)/COUNT(DISTINCT d.dog_guid) AS 'Tests Done per Dog' -- bonus to make relationship clearer 267 | FROM dogs d 268 | JOIN complete_tests c 269 | ON d.dog_guid = c.dog_guid 270 | WHERE (d.exclude IS NULL OR d.exclude = '0') 271 | AND d.breed_type IS NOT NULL 272 | AND c.created_at IS NOT NULL 273 | GROUP BY d.breed_type 274 | 275 | 276 | 277 | -- BOX 15, Qn 14 278 | -- Expected: 50 rows 279 | 280 | SELECT 281 | DISTINCT d.dog_guid AS 'DogID', 282 | d.breed_type AS 'Breed Type', 283 | count(c.created_at) AS 'Completed tests', 284 | CASE 285 | WHEN d.breed_type = 'Pure Breed' THEN "Pure Breed" 286 | ELSE "Not_Pure_Breed" 287 | END AS Label 288 | FROM dogs d 289 | JOIN complete_tests c 290 | ON d.dog_guid = c.dog_guid 291 | GROUP BY d.dog_guid 292 | LIMIT 50; 293 | 294 | 295 | 296 | 297 | -- BOX 16, Qn 15 298 | -- Expected: 2 rows (Not Pure Breed = 8336 IDs) 299 | 300 | SELECT 301 | cleaned.Label, 302 | count(distinct cleaned.DogID), 303 | AVG(cleaned.testcount) 304 | FROM 305 | (SELECT 306 | DISTINCT d.dog_guid AS DogID, 307 | d.breed_type AS BreedType, 308 | count(c.created_at) AS testcount, 309 | CASE 310 | WHEN d.breed_type = 'Pure Breed' THEN 'Pure Breed' 311 | ELSE 'Not_Pure_Breed' 312 | END AS Label 313 | FROM dogs d 314 | JOIN complete_tests c 315 | ON d.dog_guid = c.dog_guid 316 | WHERE c.created_at IS NOT NULL 317 | AND d.breed_type IS NOT NULL 318 | AND (d.exclude IS NULL OR d.exclude = '0') 319 | GROUP BY d.dog_guid) 320 | AS cleaned 321 | GROUP BY cleaned.Label 322 | 323 | 324 | 325 | 326 | -- BOX 17, Qn 16 327 | -- Expected: 8816 rows 328 | 329 | SELECT 330 | cleaned.Label, 331 | cleaned.neutered, 332 | AVG(cleaned.testcount), 333 | COUNT(cleaned.dog_guid) 334 | FROM 335 | (SELECT -- subquery part from Qn. 15 336 | d.dog_guid, 337 | d.breed_type, 338 | d.dog_fixed AS neutered, 339 | COUNT(c.created_at) AS testcount, 340 | CASE 341 | WHEN d.breed_type = 'Pure Breed' THEN 'Pure_Breed' 342 | ELSE 'Not_Pure_Breed' 343 | END AS Label 344 | FROM dogs d 345 | JOIN complete_tests c 346 | ON d.dog_guid = c.dog_guid 347 | WHERE (d.exclude = '0' OR d.exclude IS NULL) -- exclusion criteria 348 | GROUP BY d.dog_guid) 349 | AS cleaned 350 | GROUP BY cleaned.purebreed, cleaned.neutered; 351 | 352 | 353 | 354 | -- BOX 18, Qn 17 355 | -- Expected: 9 rows (ace = 5.4896, charmer = 5.1919) 356 | 357 | SELECT 358 | indiv_scores.personality, 359 | count(indiv_scores.dogID) AS NumDogs, 360 | AVG(indiv_scores.testcount) AS AvgScore, 361 | STDDEV(indiv_scores.testcount) AS StdDevScore 362 | FROM 363 | (SELECT 364 | d.dog_guid AS dogID, 365 | d.dimension AS personality, 366 | count(c.created_at) AS testcount 367 | FROM dogs d 368 | INNER JOIN complete_tests c 369 | ON d.dog_guid=c.dog_guid 370 | WHERE d.dimension IS NOT NULL -- (2) 371 | AND d.dimension != '' -- (1) 372 | AND (d.exclude IS NULL OR d.exclude = 0) -- (3) 373 | GROUP BY dogID) 374 | AS indiv_scores 375 | GROUP BY indiv_scores.personality; 376 | 377 | 378 | 379 | -- BOX 19, Qn 18 380 | -- Expected: 9 rows (cross breed std dv = 13849) 381 | 382 | SELECT 383 | DISTINCT d.breed_type, 384 | AVG(TIMESTAMPDIFF(minute, e.start_time, e.end_time)) AS AvgTime, 385 | STDDEV(TIMESTAMPDIFF(minute, e.start_time, e.end_time)) AS StdDevTime 386 | FROM dogs d 387 | JOIN exam_answers e 388 | ON d.dog_guid = e.dog_guid 389 | GROUP BY d.breed_type 390 | 391 | -- END -- 392 | -------------------------------------------------------------------------------- /Week4Ex12-BizInt.sql: -------------------------------------------------------------------------------- 1 | /* ANSWER KEY: Week 4 Exercise 12 - Practicing Business Queries 2 | 3 | This is the COMPLETE answer key (including explanations) for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 4 | Date created: 18 March 2017 5 | 6 | */ 7 | 8 | -- BOX 1: LOAD SERVER 9 | 10 | %load_ext sql 11 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb 12 | %sql USE dognitiondb 13 | 14 | 15 | 16 | -- Qn 1 17 | -- Expected: 200 rows 18 | 19 | SELECT created_at, dayofweek(created_at) 20 | FROM complete_tests 21 | LIMIT 50, 200; 22 | 23 | 24 | 25 | -- Qn 2 26 | -- Expected: 200 rows 27 | 28 | SELECT created_at, 29 | CASE 30 | WHEN dayofweek(created_at)=1 THEN 'Sunday' 31 | WHEN dayofweek(created_at)=2 THEN 'Monday' 32 | WHEN dayofweek(created_at)=3 THEN 'Tuesday' 33 | WHEN dayofweek(created_at)=4 THEN 'Wednesday' 34 | WHEN dayofweek(created_at)=5 THEN 'Thursday' 35 | WHEN dayofweek(created_at)=6 THEN 'Friday' 36 | WHEN dayofweek(created_at)=7 THEN 'Saturday' 37 | END AS Day 38 | FROM complete_tests 39 | LIMIT 50, 200; 40 | 41 | 42 | 43 | -- Qn 3 44 | -- Expected: 33,190 rows 45 | 46 | SELECT 47 | CASE 48 | WHEN dayofweek(created_at)=1 THEN 'Sunday' 49 | WHEN dayofweek(created_at)=2 THEN 'Monday' 50 | WHEN dayofweek(created_at)=3 THEN 'Tuesday' 51 | WHEN dayofweek(created_at)=4 THEN 'Wednesday' 52 | WHEN dayofweek(created_at)=5 THEN 'Thursday' 53 | WHEN dayofweek(created_at)=6 THEN 'Friday' 54 | WHEN dayofweek(created_at)=7 THEN 'Saturday' 55 | END AS Day, 56 | COUNT(created_at) AS 'Number of Tests' 57 | FROM complete_tests 58 | GROUP BY Day 59 | ORDER BY COUNT(created_at) DESC; 60 | 61 | 62 | 63 | -- Qn 4 64 | -- Expected: 7 rows 65 | 66 | SELECT 67 | CASE 68 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday' 69 | WHEN dayofweek(c.created_at)=2 THEN 'Monday' 70 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday' 71 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday' 72 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday' 73 | WHEN dayofweek(c.created_at)=6 THEN 'Friday' 74 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday' 75 | END AS Day, 76 | COUNT(c.created_at) AS 'Number of Tests' 77 | FROM complete_tests c 78 | JOIN dogs d 79 | ON c.dog_guid = d.dog_guid 80 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 81 | GROUP BY Day 82 | ORDER BY count(c.created_at) DESC; 83 | 84 | 85 | 86 | -- Qn 5 87 | -- Expected: 950,331 rows 88 | 89 | SELECT d.dog_guid 90 | FROM dogs d 91 | JOIN users u 92 | ON d.user_guid = u.user_guid; 93 | 94 | 95 | 96 | -- Qn 6 97 | -- Expected 35,048 rows 98 | 99 | SELECT DISTINCT d.dog_guid 100 | FROM dogs d 101 | JOIN users u 102 | ON d.user_guid = u.user_guid; 103 | 104 | 105 | 106 | -- Qn 7 107 | -- Expected: 34,121 Rows 108 | 109 | SELECT DISTINCT d.dog_guid 110 | FROM dogs d 111 | JOIN users u 112 | ON d.user_guid = u.user_guid 113 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 114 | AND (u.exclude = 0 OR u.exclude IS NULL); 115 | 116 | 117 | 118 | -- BOX 8 119 | -- Expected: 7 rows 120 | 121 | SELECT 122 | CASE 123 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday' 124 | WHEN dayofweek(c.created_at)=2 THEN 'Monday' 125 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday' 126 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday' 127 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday' 128 | WHEN dayofweek(c.created_at)=6 THEN 'Friday' 129 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday' 130 | END AS Day, 131 | COUNT(c.created_at) AS 'Number of Tests' 132 | FROM complete_tests c 133 | JOIN ( 134 | SELECT DISTINCT d.dog_guid 135 | FROM dogs d 136 | JOIN users u 137 | ON d.user_guid = u.user_guid 138 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 139 | AND (u.exclude = 0 OR u.exclude IS NULL) 140 | ) 141 | AS cleandogs 142 | ON c.dog_guid = cleandogs.dog_guid 143 | GROUP BY Day 144 | ORDER BY count(c.created_at) DESC; 145 | 146 | 147 | 148 | -- Qn 9 149 | -- Expected: 21 rows 150 | 151 | SELECT 152 | CASE 153 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday' 154 | WHEN dayofweek(c.created_at)=2 THEN 'Monday' 155 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday' 156 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday' 157 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday' 158 | WHEN dayofweek(c.created_at)=6 THEN 'Friday' 159 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday' 160 | END AS Day, 161 | YEAR(c.created_at) AS Year, 162 | COUNT(c.created_at) AS 'Number of Tests' 163 | FROM complete_tests c 164 | JOIN ( 165 | SELECT DISTINCT d.dog_guid 166 | FROM dogs d 167 | JOIN users u 168 | ON d.user_guid = u.user_guid 169 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 170 | AND (u.exclude = 0 OR u.exclude IS NULL) 171 | ) 172 | AS cleandogs 173 | ON c.dog_guid = cleandogs.dog_guid 174 | GROUP BY Day, Year 175 | ORDER BY Year ASC, count(c.created_at) DESC; 176 | 177 | 178 | 179 | -- Qn 10 180 | -- Expected: 21 rows (Sunday - 5860) 181 | 182 | SELECT 183 | CASE 184 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday' 185 | WHEN dayofweek(c.created_at)=2 THEN 'Monday' 186 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday' 187 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday' 188 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday' 189 | WHEN dayofweek(c.created_at)=6 THEN 'Friday' 190 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday' 191 | END AS Day, 192 | YEAR(c.created_at) AS Year, 193 | COUNT(c.created_at) AS 'Number of Tests' 194 | FROM complete_tests c 195 | JOIN ( 196 | SELECT DISTINCT d.dog_guid, 197 | u.country, 198 | u.state 199 | FROM dogs d 200 | JOIN users u 201 | ON d.user_guid = u.user_guid 202 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 203 | AND (u.exclude = 0 OR u.exclude IS NULL) 204 | ) 205 | AS cleandogs 206 | ON c.dog_guid = cleandogs.dog_guid 207 | WHERE cleandogs.country = 'US' AND cleandogs.state NOT IN ('HI', 'AK') 208 | GROUP BY Day, Year 209 | ORDER BY Year ASC, count(c.created_at) DESC; 210 | 211 | -- Qn 11 212 | -- Expected: 100 rows 213 | 214 | SELECT created_at, 215 | DATE_SUB(created_at, INTERVAL 6 HOUR) AS NewTime 216 | FROM complete_tests 217 | LIMIT 100; 218 | 219 | 220 | 221 | -- Qn 12 222 | -- Expected: 21 rows 223 | 224 | SELECT 225 | CASE 226 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=1 THEN 'Sunday' 227 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=2 THEN 'Monday' 228 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=3 THEN 'Tuesday' 229 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=4 THEN 'Wednesday' 230 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=5 THEN 'Thursday' 231 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=6 THEN 'Friday' 232 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=7 THEN 'Saturday' 233 | END AS Day, 234 | YEAR(c.created_at) AS Year, 235 | COUNT(c.created_at) AS 'Number of Tests' 236 | FROM complete_tests c 237 | JOIN ( 238 | SELECT DISTINCT d.dog_guid, 239 | u.country, 240 | u.state 241 | FROM dogs d 242 | JOIN users u 243 | ON d.user_guid = u.user_guid 244 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 245 | AND (u.exclude = 0 OR u.exclude IS NULL) 246 | ) 247 | AS cleandogs 248 | ON c.dog_guid = cleandogs.dog_guid 249 | WHERE cleandogs.country = 'US' AND cleandogs.state NOT IN ('HI', 'AK') 250 | GROUP BY Day, Year 251 | ORDER BY Year ASC, count(c.created_at) DESC; 252 | 253 | 254 | 255 | -- Qn 13 256 | -- Expected: 21 rows 257 | 258 | SELECT 259 | CASE 260 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=1 THEN 'Sunday' 261 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=2 THEN 'Monday' 262 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=3 THEN 'Tuesday' 263 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=4 THEN 'Wednesday' 264 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=5 THEN 'Thursday' 265 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=6 THEN 'Friday' 266 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=7 THEN 'Saturday' 267 | END AS Day, 268 | YEAR(c.created_at) AS Year, 269 | COUNT(c.created_at) AS 'Number of Tests' 270 | FROM complete_tests c 271 | JOIN ( 272 | SELECT DISTINCT d.dog_guid, 273 | u.country, 274 | u.state 275 | FROM dogs d 276 | JOIN users u 277 | ON d.user_guid = u.user_guid 278 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 279 | AND (u.exclude = 0 OR u.exclude IS NULL) 280 | ) 281 | AS cleandogs 282 | ON c.dog_guid = cleandogs.dog_guid 283 | WHERE cleandogs.country = 'US' AND cleandogs.state NOT IN ('HI', 'AK') 284 | GROUP BY Day, Year 285 | ORDER BY Year ASC, FIELD(Day, 'Monday', 'Tuesday', 286 | 'Wednesday', 'Thursday', 'Friday', 287 | 'Saturday', 'Sunday'), count(c.created_at) DESC; 288 | 289 | 290 | 291 | -- Qn 14 292 | -- Expected: 5 rows 293 | 294 | SELECT 295 | clean.state AS 'State', 296 | COUNT(DISTINCT clean.user_guid) AS 'Number of Users' 297 | FROM complete_tests c 298 | JOIN ( 299 | SELECT DISTINCT d.user_guid, 300 | u.state, 301 | u.country 302 | FROM dogs d 303 | JOIN users u 304 | ON d.user_guid = u.user_guid 305 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 306 | AND (u.exclude = 0 OR u.exclude IS NULL) 307 | AND u.country = 'US' 308 | ) 309 | AS clean 310 | GROUP BY 'State' 311 | ORDER BY 'Number of Users' DESC 312 | LIMIT 5; 313 | 314 | 315 | 316 | -- Qn 15 317 | -- Expected: 10 rows 318 | 319 | SELECT 320 | clean.country AS 'Country', 321 | clean.state AS 'State', 322 | COUNT(DISTINCT clean.user_guid) AS 'Number of Users' 323 | FROM complete_tests c 324 | JOIN ( 325 | SELECT DISTINCT d.user_guid, 326 | u.state, 327 | u.country 328 | FROM dogs d 329 | JOIN users u 330 | ON d.user_guid = u.user_guid 331 | WHERE (d.exclude = 0 OR d.exclude IS NULL) 332 | AND (u.exclude = 0 OR u.exclude IS NULL) 333 | ) 334 | AS clean 335 | ON c.dog_guid = clean.dog_guid 336 | GROUP BY 'Country', 'State' 337 | ORDER BY 'Number of Users' DESC; 338 | 339 | -- END -- 340 | -------------------------------------------------------------------------------- /Week4Ex9-Subqueries.sql: -------------------------------------------------------------------------------- 1 | /* ANSWER KEY: Week 3 Exercise 9 - Subqueries & Derived Tables 2 | 3 | This is the COMPLETE answer key (including explanations) 4 | for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 5 | 6 | Date created: 17 March 2017 7 | */ 8 | 9 | -- BOX 1: LOAD SERVER 10 | 11 | %load_ext sql 12 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb 13 | %sql USE dognitiondb 14 | 15 | 16 | 17 | -- BOX 2 18 | -- Expected: 1 row (9934) 19 | 20 | SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time)) 21 | FROM exam_answers 22 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) > 0 23 | AND test_name = 'Yawn Warm-Up'; 24 | 25 | 26 | 27 | 28 | -- BOX 3 29 | -- Expected: 11059 rows 30 | 31 | SELECT * 32 | FROM exam_answers 33 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) > 34 | ( 35 | SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time)) 36 | FROM exam_answers 37 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) > 0 38 | AND test_name = 'Yawn Warm-Up' 39 | ); 40 | 41 | 42 | -- BOX 4 43 | -- Expected: 1 rows (163022) 44 | 45 | SELECT count(*) 46 | FROM exam_answers 47 | WHERE subcategory_name IN ("Puzzles", "Numerosity", "Bark Game"); 48 | 49 | 50 | 51 | -- BOX 5 52 | -- Expected: 1 rows (7961) 53 | 54 | SELECT count(distinct dog_guid) 55 | FROM dogs 56 | WHERE breed_group NOT IN ("Working", "Sporting", "Herding") 57 | 58 | 59 | 60 | -- BOX 6 61 | -- Expected: 2226 rows 62 | 63 | SELECT DISTINCT u.user_guid 64 | FROM users u 65 | WHERE NOT EXISTS 66 | (SELECT d.user_guid 67 | FROM dogs d 68 | WHERE u.user_guid = d.user_guid); 69 | 70 | 71 | 72 | -- BOX 7 73 | -- Expected: 33193 rows 74 | 75 | SELECT 76 | clean.user_guid AS uUserID, 77 | d.user_guid AS dUserID, 78 | count(*) AS numrows 79 | FROM 80 | (SELECT DISTINCT u.user_guid 81 | FROM users u) 82 | AS clean 83 | LEFT JOIN dogs d 84 | ON clean.user_guid=d.user_guid 85 | GROUP BY clean.user_guid 86 | ORDER BY numrows DESC 87 | 88 | 89 | 90 | -- BOX 8 91 | -- Expected: Note the type of error message 92 | 93 | SELECT 94 | u.user_guid AS uUserID, 95 | d.user_guid AS dUserID, 96 | count(*) AS numrows 97 | 98 | FROM 99 | (SELECT DISTINCT u.user_guid 100 | FROM users u) 101 | AS DistinctUUsersID 102 | 103 | LEFT JOIN dogs d 104 | ON DistinctUUsersID.user_guid=d.user_guid 105 | 106 | GROUP BY DistinctUUsersID.user_guid 107 | ORDER BY numrows DESC 108 | 109 | 110 | 111 | -- BOX 9 QN 6 112 | -- Expected: 10254 rows 113 | 114 | SELECT distinct 115 | d.dog_guid, 116 | d.breed_group, 117 | u.state, 118 | u.zip 119 | FROM dogs d, users u 120 | WHERE d.user_guid = u.user_guid 121 | AND breed_group IN ('Working', 'Sporting', 'Herding') 122 | 123 | 124 | 125 | -- BOX 10 126 | -- Expected: 10254 rows 127 | 128 | SELECT distinct 129 | d.dog_guid, 130 | d.breed_group, 131 | u.state, 132 | u.zip 133 | FROM dogs d JOIN users u 134 | ON d.user_guid = u.user_guid 135 | WHERE breed_group IN ('Working', 'Sporting', 'Herding') 136 | 137 | 138 | 139 | -- BOX 11 Qn 8 140 | -- Expected: 2 rows 141 | 142 | SELECT d.user_guid 143 | FROM dogs d 144 | WHERE NOT EXISTS 145 | (SELECT DISTINCT u.user_guid 146 | FROM users u 147 | WHERE d.user_guid = u.user_guid); 148 | 149 | 150 | 151 | -- BOX 12 152 | -- Expected: 1 rows (1819) 153 | 154 | SELECT 155 | DistinctUUsersID.user_guid AS uUserID, 156 | d.user_guid AS dUserID, 157 | count(*) AS numrows 158 | FROM (SELECT DISTINCT u.user_guid 159 | FROM users u 160 | WHERE user_guid = 'ce7b75bc-7144-11e5-ba71-058fbc01cf0b') 161 | AS DistinctUUsersID 162 | LEFT JOIN dogs d 163 | ON DistinctUUsersID.user_guid=d.user_guid 164 | GROUP BY DistinctUUsersID.user_guid 165 | ORDER BY numrows DESC; 166 | 167 | 168 | 169 | -- BOX 13 170 | -- Expected: 30968 rows 171 | 172 | SELECT DISTINCT d.user_guid 173 | FROM d.dogs 174 | 175 | 176 | 177 | -- BOX 14 QN 11 178 | -- Expected: 1 rows 179 | 180 | SELECT 181 | APPLES.user_guid AS uUserID, 182 | ORANGES.user_guid AS dUserID, 183 | count(*) AS numrows 184 | FROM 185 | (SELECT DISTINCT u.user_guid 186 | FROM users u 187 | WHERE user_guid = 'ce7b75bc-7144-11e5-ba71-058fbc01cf0b') 188 | AS APPLES 189 | LEFT JOIN 190 | (SELECT DISTINCT d.user_guid 191 | FROM dogs d) 192 | AS ORANGES 193 | ON APPLES.user_guid=ORANGES.user_guid 194 | GROUP BY APPLES.user_guid 195 | ORDER BY numrows DESC; 196 | 197 | 198 | 199 | -- BOX 15 QN 12 200 | -- Expected: 100 rows 201 | 202 | SELECT 203 | APPLES.user_guid AS uUserID, 204 | ORANGES.user_guid AS dUserID, 205 | count(*) AS numrows 206 | FROM 207 | (SELECT DISTINCT u.user_guid 208 | FROM users u 209 | LIMIT 100) 210 | AS APPLES 211 | LEFT JOIN 212 | (SELECT DISTINCT d.user_guid 213 | FROM dogs d) 214 | AS ORANGES 215 | ON APPLES.user_guid=ORANGES.user_guid 216 | GROUP BY APPLES.user_guid 217 | ORDER BY numrows DESC; 218 | 219 | 220 | 221 | -- BOX 16 QN 13 222 | -- Expected: 5 rows (shih tzu, 190, 1819) 223 | 224 | SELECT 225 | APPLES.user_guid AS uUserID, 226 | d.user_guid AS dUserID, 227 | d.breed, 228 | d.weight, 229 | count(*) AS numrows 230 | FROM 231 | (SELECT DISTINCT u.user_guid 232 | FROM users u) 233 | AS APPLES 234 | LEFT JOIN dogs d 235 | ON APPLES.user_guid=d.user_guid 236 | GROUP BY APPLES.user_guid 237 | HAVING numrows > 10 238 | ORDER BY numrows DESC; 239 | -------------------------------------------------------------------------------- /Week5 - Dillards.md: -------------------------------------------------------------------------------- 1 | # Final Week - Dillard's Database Exercises 2 | 3 | Date created: 25 April 2017 4 | 5 | Last updated: 9 Sept 2017 6 | 7 | This is the COMPLETE answer key (including explanations where necessary) 8 | for Week 5 (final week) of the ["Managing Big Data wtih MySQL"](https://www.coursera.org/learn/analytics-mysql/home/week/5) 9 | course by Duke University. 10 | 11 | I wrote this answer key as no official answers have been released online. 12 | These answers reflect my own work and are accurate to the best of my knowledge. 13 | I will update them if the professors ever release an "official" answer key. 14 | 15 | **These answers will come in handy during the final exam for the course, which 16 | requires one to make similar queries.** 17 | 18 | Update: These answers are based on the older version of ``UA_Dillards`` dataset 19 | (not ``UA_Dillards1``, nor ``UA_Dillards_2016``). For example, this means I am using 20 | the table ``SKSTINFO`` and not ``SKSTINFO_FIX`` which is the newer version. 21 | 22 | Meanwhile, let's start. 23 | 24 | # Answers 25 | 26 | To start, enter ``DATABASE ua_dillards``; into the Teradata SQL scratchpad. 27 | 28 | ### Question 1 29 | 30 | **How many distinct dates are there in the saledate column of the transaction 31 | table for each month/year combination in the database?** 32 | 33 | ```sql 34 | SELECT 35 | EXTRACT (month FROM saledate) AS month_num, 36 | EXTRACT (year FROM saledate) AS year_num, 37 | COUNT (DISTINCT EXTRACT (day FROM saledate)) AS days_in_month, 38 | COUNT (EXTRACT (day FROM saledate)) AS num_transactions -- I'm curious abt num transactions per mth 39 | FROM trnsact 40 | GROUP BY month_num, year_num 41 | ORDER BY year_num, month_num 42 | ``` 43 | 44 | Result 45 | 46 | | MONTH_NUM | YEAR_NUM | DAYS_IN_MONTH | NUM_TRANSACTIONS | 47 | | -- | -- | -- | -- | 48 | | 8 | 2004 | 31 | 8292953 49 | | 9 | 2004 | 30 | 8967415 50 | | 10 | 2004 | 31 | 8412131 51 | | 11 | 2004 | 29 | 7047319 52 | | 12 | 2004 | 30 | 13383892 53 | | 1 | 2005 | 31 | 8952311 54 | | 2 | 2005 | 28 | 11352221 55 | | 3 | 2005 | 30 | 8940444 56 | | 4 | 2005 | 30 | 9082523 57 | | 5 | 2005 | 31 | 7715779 58 | | 6 | 2005 | 30 | 7922997 59 | | 7 | 2005 | 31 | 11122770 60 | | 8 | 2005 | 27 | 9724141 61 | 62 | There appears to be an incomplete record of the month during August 2005 63 | (as it has only 27 days). 64 | 65 | *As the homework instructs, I will restrict all further analysis of August sales 66 | to only include those recorded in 2004, and not 2005.* 67 | 68 | Next, it appears that Dillard's department has designated holidays from their 69 | calendar. None of the stores have data for ``25 November`` (Thanksgiving), 70 | ``25 December`` (Christmas), or ``27 March`` (their annual sale date). 71 | 72 | ### Question 2 73 | 74 | **Use a CASE statement within an aggregate function to determine which sku 75 | had the greatest total sales during the combined summer months of June, July, 76 | and August.** 77 | 78 | ```sql 79 | SELECT DISTINCT sku, 80 | SUM (CASE WHEN EXTRACT(month FROM saledate)=6 AND stype='p' THEN amt END) AS rev_june, 81 | SUM (CASE WHEN EXTRACT(month FROM saledate)=7 AND stype='p' THEN amt END) AS rev_july, 82 | SUM (CASE WHEN EXTRACT(month FROM saledate)=8 AND stype='p' THEN amt END) AS rev_aug, -- ! 83 | (rev_aug + rev_june + rev_july) AS rev_total_summer 84 | FROM trnsact 85 | GROUP BY sku 86 | ORDER BY rev_total_summer DESC 87 | HAVING rev_total_summer > 0 -- exclude null values 88 | ``` 89 | 90 | There is a problem with this question statement. It suggests that: 91 | 92 | > *'If your query is correct, you should find that sku #2783996 has the fifth greatest total sales 93 | during the combined months of June, July, and August, with a total summer sales sum of 94 | $897,807.01.'* 95 | 96 | However, you will only get this value if you include values from **both** August 2004 97 | and August 2005, which **Question 1 explicitly states not to do so**. 98 | 99 | A more sensible answer, that includes *only one copy of each month per year*, would be: 100 | 101 | ```sql 102 | SELECT DISTINCT sku, 103 | SUM (CASE WHEN EXTRACT(month FROM saledate)=6 AND stype='p' THEN amt END) AS rev_june, 104 | SUM (CASE WHEN EXTRACT(month FROM saledate)=7 AND stype='p' THEN amt END) AS rev_july, 105 | SUM (CASE WHEN EXTRACT(month FROM saledate)=8 AND stype='p' 106 | AND EXTRACT(year FROM saledate)=12 -- new line 107 | THEN amt END) AS rev_aug, 108 | (rev_aug + rev_june + rev_july) AS rev_total_summer 109 | FROM trnsact 110 | GROUP BY sku 111 | ORDER BY rev_total_summer DESC 112 | HAVING rev_total_summer > 0 -- exclude null values 113 | ``` 114 | 115 | This gives the answer: 116 | 117 | | SKU ITEM CODE | REV_JUNE 2005 | REV_JULY 2005 | REV_AUG 2004 | REV_TOTAL_SUMMER 118 | | -- | -- | -- | -- | -- | 119 | | 4108011 | 309511.88 | 379326.00 | 499821.00 | 1, 188, 658.88 120 | | 3524026 | 269934.50 | 344833.00 | 458227.50 | 1, 072, 995.00 121 | | 5528349 | 339349.00 | 325156.50 | 337221.00 | 1, 001, 726.50 122 | | 3978011 | 197885.37 | 259279.60 | 308910.00 | 766, 074.97 123 | | **2783996** | 190252.01 | 197414.50 | 313736.50 | **701, 403.01** 124 | 125 | Additional background information on the most popular summer items: *(because I'm 126 | curious lol)* 127 | 128 | ```sql 129 | SELECT * 130 | FROM SKUINFO 131 | WHERE sku IN (4108011, 3524026, 5528349, 3978011, 2783996) 132 | ``` 133 | 134 | | SKU CODE | COLOUR | SIZE | PACKSIZE | BRAND | 135 | | -- | -- | -- | -- | -- | 136 | | 4108011 | DDML | DDML 4OZ | 6 | CLINIQUE 137 | | 3524026 | DDML | PUMP 4.2 OZ | 6 | CLINIQUE 138 | | 5528349 | 01-BLACK | 01-BLACK | 3 | LANCOME 139 | | 3978011 | CLARIFY #2 | 13.5 OZ | 3 | CLINIQUE 140 | | 2783996 | 01-BLACK | NO SIZE | 3 | LANCOME 141 | 142 | ### Exercise 3. 143 | 144 | **How many distinct dates are there in the saledate column of the transaction 145 | table for each month/year/store combination in the database? Sort your results by the 146 | number of days per combination in ascending order.** 147 | 148 | ```sql 149 | SELECT 150 | EXTRACT (month FROM saledate) AS month_num, 151 | EXTRACT (year FROM saledate) AS year_num, 152 | store, 153 | COUNT (DISTINCT saledate) AS num_dates 154 | FROM trnsact 155 | GROUP BY month_num, year_num, store 156 | ORDER BY num_dates asc 157 | ``` 158 | Some stores appear to have missing or removed data (i.e., less than 30 days per month). 159 | 160 | | MONTH | YEAR | STORE ID | NUM_DATES | 161 | | ----- | ---- | -------- | --------- | 162 | | 7 | 2005 | 7604 | 1 163 | | 3 | 2005 | 8304 | 1 164 | | 9 | 2004 | 4402 | 1 165 | | 8 | 2004 | 9906 | 1 166 | | 8 | 2004 | 8304 | 1 167 | | 8 | 2004 | 7203 | 3 168 | | 3 | 2005 | 6402 | 11 169 | 170 | We will note the missing data in case of in future calculations. Where possible, we 171 | will aim to exclude months that do not meet criteria when doing trend analysis. 172 | 173 | ### Exercise 4a. 174 | 175 | **What is the average daily revenue for each store/month/year combination in 176 | the database? Calculate this by dividing the total revenue for a group by the number of 177 | sales days available in the transaction table for that group.** 178 | 179 | We can solve this by modifying the solution from Qn 3 to include revenue data. 180 | 181 | ```sql 182 | SELECT 183 | store, 184 | EXTRACT (month FROM saledate) AS month_num, 185 | EXTRACT (year FROM saledate) AS year_num, 186 | COUNT (DISTINCT saledate) AS num_dates, 187 | SUM(amt) AS total_revenue, 188 | revenue/num_dates AS daily_revenue 189 | FROM trnsact 190 | WHERE stype='p' 191 | GROUP BY store, month_num, year_num 192 | ORDER BY daily_revenue desc 193 | ``` 194 | > Dr Jana: If your query is correct, you should find that store #204 has an average daily revenue of 195 | $16,303.65 in August of 2005. 196 | 197 | ```sql 198 | -- Modified to check results 199 | SELECT 200 | store, 201 | EXTRACT (month FROM saledate) AS month_num, 202 | EXTRACT (year FROM saledate) AS year_num, 203 | COUNT (DISTINCT saledate) AS num_dates, 204 | SUM(amt) AS total_revenue, 205 | total_revenue/num_dates AS daily_revenue 206 | FROM trnsact 207 | WHERE stype='p' AND store=204 -- ! 208 | GROUP BY store, month_num, year_num 209 | ORDER BY year_num desc, month_num desc -- ! 210 | ``` 211 | 212 | Results 213 | 214 | | STORE | MONTH_NUM | YEAR_NUM | NUM_DATES | TOTAL_REVENUE | DAILY_REVENUE | 215 | | ----- | --------- | -------- | --------- | ------------- | ------------- | 216 | | 204 | 12 | 2004 | 30 | 651309.29 | 21710.31 | 217 | | 204 | 7 | 2005 | 31 | 520512.72 | 16790.73 | 218 | | 204 | 4 | 2005 | 30 | 503312.54 | 16777.08 | 219 | | 204 | 8 | 2005 | 27 | 440198.68 | 16303.65 | 220 | 221 | Awesome! 222 | 223 | *For all of the exercises that follow, unless otherwise specified, we will assess sales by summing 224 | the total revenue for a given time period, and dividing by the total number of days that 225 | contributed to that time period. This will give us “average daily revenue”.* 226 | 227 | ### Question 4b. 228 | 229 | **Modify the query you wrote above to assess the average daily revenue for each store/month/year 230 | combination with a clause that removes all the data from August, 2005. Then, given the data we 231 | have available in our data set, I propose that we only examine store/month/year combinations that 232 | have at least 20 days of data within that month.** 233 | 234 | ```sql 235 | SELECT 236 | sub.store, 237 | sub.year_num, 238 | sub.month_num, 239 | sub.num_dates, 240 | sub.daily_revenue 241 | FROM ( 242 | SELECT 243 | store, 244 | EXTRACT (month FROM saledate) AS month_num, 245 | EXTRACT (year FROM saledate) AS year_num, 246 | COUNT (DISTINCT saledate) AS num_dates, 247 | SUM(amt) AS total_revenue, 248 | total_revenue/num_dates AS daily_revenue, 249 | (CASE 250 | WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 251 | END) As can_use_anot 252 | FROM trnsact 253 | WHERE stype='p' AND can_use_anot='can' 254 | GROUP BY store, month_num, year_num 255 | ) AS sub 256 | HAVING sub.num_dates >=20 257 | GROUP BY sub.store, sub.year_num, sub.month_num, sub.num_dates, sub.daily_revenue 258 | ORDER BY sub.num_dates ASC; 259 | ``` 260 | 261 | > DR JANA: Save your final queries that remove “bad data” for use in subsequent exercises. From 262 | now on (and in the graded quiz), when I ask for average daily revenue: (1) Only examine purchases 263 | (not returns). (2) Exclude all stores with less than 20 days of data. (3) Exclude all data from 264 | August, 2005. 265 | 266 | ### Question 5. 267 | 268 | **What is the average daily revenue brought in by Dillard’s stores in areas of high, medium, or 269 | low levels of high school education? Define areas of “low” education as those that have high 270 | school graduation rates between 50-60%, areas of “medium” education as those that have high 271 | school graduation rates between 60.01-70%, and areas of “high” education as those that have 272 | high school graduation rates of above 70%.** 273 | 274 | I'll start by counting the number of stores within each education level first. 275 | 276 | ```sql 277 | SELECT 278 | (CASE 279 | WHEN msa_high>=50 AND msa_high<70 THEN 'low' 280 | WHEN msa_high>=70 AND msa_high<80 THEN 'med' 281 | WHEN msa_high>=80 THEN 'high' 282 | END) AS education_levels, 283 | COUNT (DISTINCT store) AS num_stores 284 | FROM store_msa 285 | GROUP BY education_levels 286 | ``` 287 | 288 | Unfortunately it's not a nice distribution: 289 | 290 | | EDUCATION_LEVEL | NUM_STORES | 291 | | --------------- | ---------- | 292 | | LOW (>50%) | 324 293 | | MED (>60%) | 5 294 | | HIGH (>70%) | 4 295 | 296 | It would be better if we could redistribute it like this: 297 | 298 | | EDUCATION_LEVEL | NUM_STORES | 299 | | --------------- | ---------- | 300 | | LOW (>50%) | 213 301 | | MED (>70%) | 111 302 | | HIGH (>80%) | 9 303 | 304 | But that's not what the question asked, so I'll leave it aside for now. 305 | Back to the question, let's merge them: 306 | 307 | ```sql 308 | SELECT 309 | (CASE 310 | WHEN s.msa_high >= 50 and s.msa_high < 60 THEN 'low' 311 | WHEN s.msa_high >= 60 and s.msa_high < 70 THEN 'medium' 312 | WHEN s.msa_high >= 70 THEN 'high' 313 | END) AS education_levels, 314 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue 315 | FROM store_msa s 316 | JOIN ( 317 | SELECT 318 | store, 319 | EXTRACT (year FROM saledate) AS year_num, 320 | EXTRACT (month FROM saledate) AS month_num, 321 | SUM(amt) AS total_revenue, 322 | COUNT (DISTINCT (saledate)) AS num_dates, 323 | (CASE 324 | WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 325 | END) As can_use_anot 326 | FROM trnsact 327 | WHERE stype='p' AND can_use_anot='can' 328 | GROUP BY year_num, month_num, store 329 | HAVING num_dates >= 20 -- moving this back to within the subquery 330 | ) AS sub 331 | ON s.store = sub.store 332 | GROUP BY education_levels; 333 | 334 | ``` 335 | 336 | I'm not sure why line 21 ``HAVING num_dates >= 20`` only works when inside the 337 | subquery but not when requested from the outer query. It worked fine in the previous question. 338 | *I guess something about aggregate functions??* 339 | 340 | > DR JANA: If you have executed this query correctly, you will find that the average daily revenue brought in 341 | by Dillard’s stores in the low education group is a little more than $34,000, the average daily 342 | revenue brought in by Dillard’s stores in the medium education group is a little more than 343 | $25,000, and the average daily revenue brought in by Dillard’s stores in the high education group 344 | is just under $21,000. 345 | 346 | | EDUCATION_LEVEL | AVG_DAILY_REVENUE | 347 | | --------------- | ----------------- | 348 | | low | 34,159.76 349 | | medium | 27,112.67 350 | | high | 20,921.32 351 | 352 | Hooray! Moving forward... 353 | 354 | *Whenever I ask you to calculate the average daily revenue for a group of stores in either 355 | these exercises or the quiz, do so by summing together all the revenue from all the entries 356 | in that group, and then dividing that summed total by the total number of sale days that 357 | contributed to the total. Do not compute averages of averages.* 358 | 359 | ### Question 6. 360 | 361 | **Compare the average daily revenues of the stores with the highest median 362 | msa_income and the lowest median msa_income. In what city and state were these stores, 363 | and which store had a higher average daily revenue? Use ``msa_income`` to calculate.** 364 | 365 | ```sql 366 | SELECT 367 | s.city, 368 | s.state, 369 | s.msa_income, 370 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue 371 | FROM store_msa s 372 | JOIN ( 373 | SELECT 374 | store, 375 | EXTRACT (year FROM saledate) AS year_num, 376 | EXTRACT (month FROM saledate) AS month_num, 377 | SUM(amt) AS total_revenue, 378 | COUNT(DISTINCT saledate) AS num_dates, 379 | (CASE 380 | WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 381 | END) As can_use_anot 382 | FROM trnsact 383 | WHERE stype='p' AND can_use_anot='can' 384 | GROUP BY year_num, month_num, store 385 | HAVING num_dates >= 20 386 | ) AS sub 387 | ON s.store = sub.store 388 | WHERE s.msa_income IN ( 389 | (SELECT MAX(msa_income) FROM store_msa), 390 | (SELECT MIN(msa_income) FROM store_msa)) 391 | GROUP BY s.city, s.state; 392 | ``` 393 | 394 | Overall pretty similar to Qn 5. 395 | 396 | | CITY | STATE | AVG_DAILY_REVENUE | 397 | | ---- | ----- | ----------------- | 398 | | SPANISH FORT | AL | 17884.08 399 | | MCALLEN | TX | 56601.99 400 | 401 | ### Exercise 7: 402 | 403 | **What is the brand of the sku with the greatest standard deviation in sprice? 404 | Only examine skus that have been part of over 100 transactions.** 405 | 406 | ```sql 407 | SELECT 408 | DISTINCT (t.SKU) AS item, 409 | s.brand AS brand, 410 | STDDEV_SAMP(t.sprice) AS dev_price, 411 | COUNT(DISTINCT(t.SEQ||t.STORE||t.REGISTER||t.TRANNUM||t.SALEDATE)) AS distinct_transactions 412 | FROM TRNSACT t 413 | JOIN SKUINFO s 414 | ON t.sku=s.sku 415 | WHERE t.stype='p' 416 | HAVING distinct_transactions>100 417 | GROUP BY item, brand 418 | ORDER BY dev_price DESC 419 | ``` 420 | 421 | I'm not sure which of these columns are unique so I put them all in together: ``SEQ``, ``STORE``, 422 | ``REGISTER``, ``TRANNUM``, ``SALEDATE``. 423 | 424 | | ITEM | BRAND | STYLE | COLOR | SIZE | DEV_PRICE | 425 | | ---- | ----- | ----- | ----- | ---- | --------- | 426 | | 2762683 | HART SCH | 403154133510 | BLACK | 42REG | 175.8106 | 427 | | 5453849 | POLO FAS | 9HA 726680 | FA02 | L | 169.4284 | 428 | | 5623849 | POLO FAS | 9HA 726680 | FA02 | M | 164.4187 | 429 | 430 | ### Exercise 8: 431 | 432 | **Examine all the transactions for the sku with the greatest standard deviation in 433 | sprice, but only consider skus that are part of more than 100 transactions. Do you think the 434 | retail price was set too high, or just right? ** 435 | 436 | ```sql 437 | SELECT 438 | distinct(s.sku) AS items, 439 | s.brand, 440 | AVG(t.sprice) AS avg_price, 441 | STDDEV_SAMP(t.sprice) AS variation_price, 442 | avg(t.orgprice)-avg(t.sprice) AS sale_price_diff, 443 | COUNT(distinct(t.trannum)) AS distinct_transactions 444 | FROM skuinfo s 445 | JOIN trnsact t 446 | ON s.sku=t.sku 447 | WHERE stype='p' 448 | GROUP BY items, s.brand 449 | HAVING distinct_transactions > 100 450 | ORDER BY variation_price DESC; 451 | ``` 452 | 453 | Not perfect, but consider how items with the highest ``variation (std dev) prices`` 454 | are not quite those with the greatest ``sales price differences``. This may suggest that some 455 | stores are simply pricing items higher/lower across the band, rather than offering massively 456 | discounted sale prices (vs original prices) to clear their stock. This might simply reflect their 457 | ``msa_income`` differences around each store. 458 | 459 | So... Was the retail price just right? Can't say for sure, but it's definitely not too high. 460 | 461 | ### Exercise 9 462 | 463 | **What was the average daily revenue Dillard’s brought in during each month of 464 | the year?** 465 | 466 | ```sql 467 | SELECT 468 | (CASE 469 | WHEN sub.month_num=1 THEN 'Jan' 470 | WHEN sub.month_num=2 THEN 'Feb' 471 | WHEN sub.month_num=3 THEN 'Mar' 472 | WHEN sub.month_num=4 THEN 'Apr' 473 | WHEN sub.month_num=5 THEN 'May' 474 | WHEN sub.month_num=6 THEN 'Jun' 475 | WHEN sub.month_num=7 THEN 'Jul' 476 | WHEN sub.month_num=8 THEN 'Aug' 477 | WHEN sub.month_num=9 THEN 'Sep' 478 | WHEN sub.month_num=10 THEN 'Oct' 479 | WHEN sub.month_num=11 THEN 'Nov' 480 | WHEN sub.month_num=12 THEN 'Dec' 481 | END) as month_name, 482 | SUM(num_dates) AS num_days_in_month, 483 | SUM(total_revenue)/SUM(num_dates) AS avg_monthly_revenue 484 | FROM ( 485 | SELECT 486 | EXTRACT (month FROM saledate) AS month_num, 487 | EXTRACT (year FROM saledate) AS year_num, 488 | COUNT (DISTINCT saledate) AS num_dates, 489 | SUM(amt) AS total_revenue, 490 | (CASE 491 | WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 492 | END) As can_use_anot 493 | FROM trnsact 494 | WHERE stype='p' AND can_use_anot='can' 495 | GROUP BY month_num, year_num 496 | HAVING num_dates>=20 497 | ) AS sub 498 | GROUP BY month_name 499 | ORDER BY avg_monthly_revenue DESC; 500 | ``` 501 | 502 | | MONTH_NUM | DAYS_IN_MONTH | AVG_MONTHLY_REVENUE | 503 | | --------- | ------------- | ------------------- | 504 | | Dec | 30 | 11333356.01 505 | | Feb | 28 | 7363752.69 506 | | Jul | 31 | 7271088.69 507 | | Apr | 30 | 6949616.95 508 | | Mar | 30 | 6736315.39 509 | | May | 31 | 6666962.59 510 | | Jun | 30 | 6524845.42 511 | | Nov | 29 | 6296913.50 512 | | Oct | 31 | 6106357.90 513 | | Jan | 31 | 5836833.31 514 | | Aug | 31 | 5616841.37 515 | | Sep | 30 | 5596588.02 516 | 517 | > DR JANA: you should find that December consistently has the best sales, September consistently 518 | has the worst or close to the worst sales, and July has very good sales, although less than December. 519 | 520 | ### Question 10 521 | 522 | **Which department, in which city and state of what store, had the greatest percentage increase in 523 | average daily sales revenue from November to December? Note: Use percentage change.** 524 | 525 | Hints from the notes: 526 | 527 | 1. Need to join 4 tables 528 | 1. Use two CASE statements within an aggregate function to sum all revenue Nov and Dec 529 | 1 .Use two CASE statements within an aggregate function to count the number of sale 530 | days that contributed to the revenue in November and December, separately 531 | 1. Use these 4 fields to calculate the ``average daily revenue`` for November and December. You can then calculate the 532 | change in these values using the following % change formula: *(X-Y)/Y)*100. 533 | 1. Don’t forget to exclude “bad data” and to exclude ``return`` transactions. 534 | 535 | First I'd try to find just the percentage increase in revenue from November to December, for each ``store``. I will 536 | join the extra details like ``dept`` and stuff later. 537 | 538 | ```sql 539 | SELECT 540 | sub.store, 541 | SUM(CASE WHEN sub.month_num=11 THEN sub.amt END) AS Nov_revenue, 542 | SUM(CASE WHEN sub.month_num=12 THEN sub.amt END) AS Dec_revenue, 543 | COUNT(DISTINCT CASE WHEN sub.month_num=11 THEN sub.saledate END) AS Nov_days, 544 | COUNT(DISTINCT CASE WHEN sub.month_num=12 THEN sub.saledate END) AS Dec_days, 545 | Nov_revenue/Nov_days AS Nov_daily_rev, 546 | Dec_revenue/Dec_days AS Dec_daily_rev, 547 | ((Dec_daily_rev-Nov_daily_rev)/Nov_daily_rev)*100 AS percent_increase 548 | FROM ( 549 | SELECT 550 | store, 551 | amt, 552 | saledate, 553 | EXTRACT (month FROM saledate) AS month_num, 554 | EXTRACT (year FROM saledate) AS year_num, 555 | (CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot 556 | FROM trnsact 557 | WHERE stype='p' AND can_use_anot='can' 558 | ) AS sub 559 | GROUP BY sub.store 560 | HAVING Nov_days>=20 AND Dec_days>=20 561 | ORDER BY percent_increase DESC; 562 | ``` 563 | 564 | | STORE | NOV_REV | DEC_REV | NOV_DAYS | DEC_DAYS | NOV_DAILY_REV | DEC_DAILY_REV | PERCENT_INC | 565 | | ----- | ------- | ------- | -------- | -------- | ------------- | ------------- | ----------- | 566 | | 3809 | 210139.08 | 486314.01 | 29 | 30 | 7246.18 | 16210.47 | 124.00 567 | | 303 | 175003.74 | 399975.83 | 29 | 30 | 6034.61 | 13332.53 | 121.00 568 | | 7003 | 169776.27 | 380024.73 | 29 | 30 | 5854.35 | 12667.49 | 116.00 569 | 570 | Seems okay. Let's add the others in. 571 | 572 | ```sql 573 | SELECT -- outer query to drop all necessary columns from inner query 574 | clean.store, 575 | clean.dept, 576 | clean.deptdesc, 577 | clean.city, 578 | clean.state, 579 | clean.percent_increase 580 | FROM ( 581 | SELECT 582 | sub.store, 583 | d.dept, 584 | d.deptdesc, 585 | str.city, 586 | str.state, 587 | SUM(CASE WHEN sub.month_num=11 THEN sub.amt END) AS Nov_revenue, 588 | SUM(CASE WHEN sub.month_num=12 THEN sub.amt END) AS Dec_revenue, 589 | COUNT(DISTINCT CASE WHEN sub.month_num=11 THEN sub.saledate END) AS Nov_days, 590 | COUNT(DISTINCT CASE WHEN sub.month_num=12 THEN sub.saledate END) AS Dec_days, 591 | Nov_revenue/Nov_days AS Nov_daily_rev, 592 | Dec_revenue/Dec_days AS Dec_daily_rev, 593 | ((Dec_daily_rev-Nov_daily_rev)/Nov_daily_rev)*100 AS percent_increase 594 | FROM ( 595 | SELECT 596 | sku.dept, -- NEW: include this here bc you need to group-by departments at most granular lvl 597 | t.store, 598 | t.amt, 599 | t.saledate, 600 | EXTRACT (month FROM t.saledate) AS month_num, 601 | EXTRACT (year FROM t.saledate) AS year_num, 602 | (CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot 603 | FROM trnsact t 604 | INNER JOIN skuinfo sku 605 | ON t.sku=sku.sku 606 | WHERE stype='p' AND can_use_anot='can' -- only query purchases, from legal dates 607 | ) AS sub 608 | INNER JOIN strinfo str 609 | ON str.store = sub.store -- to select city and state 610 | INNER JOIN deptinfo d 611 | ON d.dept = sub.dept -- to select department description 612 | GROUP BY sub.store, d.dept, d.deptdesc, str.city, str.state 613 | HAVING Nov_days>=20 AND Dec_days>=20 614 | ) AS clean 615 | GROUP BY 1,2,3,4,5,6 616 | ORDER BY clean.percent_increase DESC 617 | 618 | ``` 619 | | STORE | DEPT | DEPT_DESC | CITY | STATE | PERCENTAGE_INCREASE | 620 | | ---- | ----- | -------- | ------ | ----- | -------------- | 621 | | 3403 | 7205 | LOUIS VL | SALINA | KS | 596.00 622 | | 9806 | 6402 | FREDERI | MABELVALE | AR | 476.00 623 | | 404 | 2107 | MAI | PINE BLUFF | AR | 442.00 624 | 625 | ### Question 11 626 | 627 | **What is the city and state of the store that had the greatest decrease in 628 | average daily revenue from August to September?** 629 | 630 | This is easy, just adapt the query from Qn 10 and remove unnecessary tables. 631 | 632 | ```sql 633 | SELECT 634 | sub.store, 635 | str.city, -- left join store_info table for these two 636 | str.state, 637 | SUM(CASE WHEN sub.month_num=8 THEN sub.amt END) AS Aug_revenue, 638 | SUM(CASE WHEN sub.month_num=9 THEN sub.amt END) AS Sep_revenue, 639 | COUNT(DISTINCT CASE WHEN sub.month_num=8 THEN sub.saledate END) AS Aug_days, 640 | COUNT(DISTINCT CASE WHEN sub.month_num=9 THEN sub.saledate END) AS Sep_days, 641 | Aug_revenue/Aug_days AS Aug_daily_rev, 642 | Sep_revenue/Sep_days AS Sep_daily_rev, 643 | (Sep_daily_rev-Aug_daily_rev) AS rev_difference 644 | FROM ( -- clean inner query for legal dates and purchases only 645 | SELECT 646 | store, 647 | amt, 648 | saledate, 649 | EXTRACT (month FROM saledate) AS month_num, 650 | EXTRACT (year FROM saledate) AS year_num, 651 | (CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot 652 | FROM trnsact 653 | WHERE stype='p' AND can_use_anot='can' 654 | ) AS sub 655 | INNER JOIN strinfo str -- to extract store's city, state 656 | ON str.store = sub.store 657 | GROUP BY sub.store, str.city, str.state 658 | HAVING Aug_days>=20 AND Sep_days>=20 -- only keep stores with more than 20 dates per month 659 | ORDER BY rev_difference ASC 660 | ``` 661 | | STORE | CITY | STATE | REV_DIFFERENCE | 662 | | ----- | ---- | ---- | ------------- | 663 | | 4003 | WEST DES MOINES | IA | -6479.60 664 | | 9103 | LOUISVILLE | KY | -5233.12 665 | | 2707 | MCALLEN | TX | -5109.47 666 | 667 | ### Question 12 668 | 669 | **Determine the month of maximum total revenue for each store. Count the 670 | number of stores whose month of maximum total revenue was in each of the twelve 671 | months.** 672 | 673 | **Then determine the month of maximum average daily revenue. Count the 674 | number of stores whose month of maximum average daily revenue was in each of the 675 | twelve months. How do they compare?** 676 | 677 | I'm guessing the assignment wants us to see which month has the most number of stores 678 | hitting their maximum total revenue in, and also their highest average daily revenue in. 679 | 680 | If the numbers don't match, it might suggest hidden trends, outliers or missing data within the set. 681 | 682 | Things to do: 683 | 1. Calculate the average daily revenue for each store, for each month (for each year, but 684 | there will only be one year associated with each month) 685 | 1. Order the rows within a store according to average daily revenue from high to low 686 | 1. Assign a rank to each of the ordered rows 687 | 1. Retrieve all of the rows that have the rank you want 688 | 1. Count all of your retrieved rows 689 | 690 | > DR JANA: You can assign ranks using the ``ROW_NUMBER`` or ``RANK()`` function. 691 | Make sure you “partition” by store in your ``ROW_NUMBER`` clause. Lastly when you have 692 | confirmed that the output is reasonable, introduce a ``QUALIFY`` clause 693 | (described in the references above) into your query in order to restrict the output to 694 | rows that represent the month with the minimum average daily revenue for each store. 695 | 696 | Starting with task (1) and (2), I'll calculate the average daily revenue for each ``store``, by ``month``. 697 | We can do this by recycling the query from Qn 9. 698 | 699 | ```sql 700 | SELECT 701 | (CASE 702 | WHEN sub.month_num=1 THEN 'Jan' 703 | WHEN sub.month_num=2 THEN 'Feb' 704 | WHEN sub.month_num=3 THEN 'Mar' 705 | WHEN sub.month_num=4 THEN 'Apr' 706 | WHEN sub.month_num=5 THEN 'May' 707 | WHEN sub.month_num=6 THEN 'Jun' 708 | WHEN sub.month_num=7 THEN 'Jul' 709 | WHEN sub.month_num=8 THEN 'Aug' 710 | WHEN sub.month_num=9 THEN 'Sep' 711 | WHEN sub.month_num=10 THEN 'Oct' 712 | WHEN sub.month_num=11 THEN 'Nov' 713 | WHEN sub.month_num=12 THEN 'Dec' 714 | END) as month_name, 715 | sub.store, 716 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue 717 | FROM ( 718 | SELECT 719 | store, 720 | EXTRACT (month FROM saledate) AS month_num, 721 | EXTRACT (year FROM saledate) AS year_num, 722 | COUNT (DISTINCT saledate) AS num_dates, 723 | SUM(amt) AS total_revenue, 724 | (CASE 725 | WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 726 | END) As can_use_anot 727 | FROM trnsact 728 | WHERE stype='p' AND can_use_anot='can' 729 | GROUP BY month_num, year_num 730 | HAVING num_dates>=20 731 | ) AS sub 732 | GROUP BY month_name, sub.store 733 | ORDER BY avg_daily_revenue DESC; 734 | ``` 735 | 736 | (3) Let's add the bit for ``RANK()`` and ``PARTITION``: (a snippet) 737 | 738 | ```sql 739 | SELECT 740 | (CASE 741 | WHEN sub.month_num=1 THEN 'Jan' 742 | ... 743 | WHEN sub.month_num=12 THEN 'Dec' 744 | END) as month_name, 745 | sub.store, 746 | SUM(sub.total_revenue) AS sum_monthly_revenue, -- TOTAL monthly rev 747 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue, -- AVERAGE rev within month 748 | ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY avg_daily_revenue DESC ) AS Row_sum_rev, --! 749 | ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY sum_monthly_revenue DESC ) AS Row_avg_rev --! 750 | FROM ( 751 | ... 752 | ) AS sub 753 | GROUP BY month_name, sub.store 754 | ORDER BY avg_daily_revenue DESC; 755 | ``` 756 | 757 | (4)+(5) Finally, let's retrieve all rows with top ranking month, to see which month performed best. 758 | 759 | ```sql 760 | SELECT 761 | clean.month_name AS month_n, 762 | COUNT(CASE WHEN clean.Row_sum_rev =1 THEN clean.store END) AS Total_monthly_rev_count, -- count number of rank 1s per month 763 | COUNT(CASE WHEN clean.Row_avg_rev =1 THEN clean.store END) AS Average_daily_rev_count -- count number of rank 1s per month 764 | FROM ( 765 | SELECT 766 | (CASE 767 | WHEN sub.month_num=1 THEN 'Jan' 768 | WHEN sub.month_num=2 THEN 'Feb' 769 | WHEN sub.month_num=3 THEN 'Mar' 770 | WHEN sub.month_num=4 THEN 'Apr' 771 | WHEN sub.month_num=5 THEN 'May' 772 | WHEN sub.month_num=6 THEN 'Jun' 773 | WHEN sub.month_num=7 THEN 'Jul' 774 | WHEN sub.month_num=8 THEN 'Aug' 775 | WHEN sub.month_num=9 THEN 'Sep' 776 | WHEN sub.month_num=10 THEN 'Oct' 777 | WHEN sub.month_num=11 THEN 'Nov' 778 | WHEN sub.month_num=12 THEN 'Dec' 779 | END) as month_name, 780 | sub.store, 781 | SUM(sub.total_revenue) AS sum_monthly_revenue, 782 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue, 783 | ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY avg_daily_revenue DESC ) AS Row_sum_rev, 784 | ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY sum_monthly_revenue DESC ) AS Row_avg_rev 785 | FROM ( 786 | SELECT 787 | store, 788 | EXTRACT (month FROM saledate) AS month_num, 789 | EXTRACT (year FROM saledate) AS year_num, 790 | COUNT (DISTINCT saledate) AS num_dates, 791 | SUM(amt) AS total_revenue, 792 | (CASE 793 | WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 794 | END) As can_use_anot 795 | FROM trnsact 796 | WHERE stype='p' AND can_use_anot='can' 797 | GROUP BY month_num, year_num, store 798 | HAVING num_dates>=20 799 | ) AS sub 800 | GROUP BY month_name, sub.store 801 | ) AS clean 802 | GROUP BY Month_n 803 | ORDER BY Total_monthly_rev_count DESC 804 | ``` 805 | 806 | > DR JANA: If you write your queries correctly, you will find that 8 stores have the greatest 807 | total sales in April, while only 4 stores have the greatest average daily revenue in April. 808 | 809 | | MONTH | TOTAL_MONTHLY | AVG_DAILY | 810 | | ----- | ------------- | --------- | 811 | | Dec | 317 | 321 812 | | Mar | 4 | 3 813 | | Jul | 3 | 3 814 | 815 | While the output fits with our expectations of the data (ie. that ``Dec`` should be the most popular month), 816 | but it doesn't match Dr Jana's hint. 817 | 818 | AFter reading the forum, I realised that official assignment seems to give the wrong hint (quite a significant mistake!). 819 | We will get the expected result if we write our queries to find the ``LOWEST`` total sales as ranked by month instead 820 | of the ``HIGHEST`` total sales, like so: 821 | 822 | ```sql 823 | SELECT 824 | clean.month_name AS month_n, 825 | COUNT(CASE WHEN clean.Row_sum_rev =1 THEN clean.store END) AS Total_monthly_rev_count, -- change 1 to 12 826 | COUNT(CASE WHEN clean.Row_avg_rev =1 THEN clean.store END) AS Average_daily_rev_count -- change 1 to 12 827 | FROM ( 828 | ... 829 | ``` 830 | 831 | | MONTH | LOW_TOTAL_MONTH | LOW_AVG_DAILY | 832 | | ----- | --------------- | ------------- | 833 | | Aug | 120 | 77 834 | | Jan | 73 | 54 835 | | Sep | 72 | 108 836 | | ... | ... | ... 837 | | Apr | 4 | 8 838 | | ... | ... | ... 839 | | Dec | 0 | 0 840 | 841 | # End 842 | 843 | *Thoughts on this course:* 844 | *Notes were messy and with quite a few significant mistakes, like that last one we saw. But overall it was a* 845 | *good introduction to SQL and I appreciate the resources to let us try and play it out.* 846 | 847 | Key takeaways: 848 | 849 | * Computational thinking: Learning how to split large, complex problems into smaller sets that can be reassembled later 850 | * Rigorous testing and checking of trend inconsistencies using month, year-aggregations, or standard deviations 851 | * Dealing with outliers and missing data by setting predefined criterias in subqueries 852 | * Overall syntax nuances between dialects for MySQL, Teradata 853 | * Perseverance for long queries lol 854 | --------------------------------------------------------------------------------