├── README.md
├── SQLDukeWeek1.md
├── SQLDukeWeek2.md
├── SQLDukeWeek3 - Inner & Outer Joins.md
├── SQLDukeWeek3.md
├── SQLDukeWeek4.md
├── Teradata Cheatsheet.md
├── Week2-Dillards.md
├── Week3-Dillards.md
├── Week3Ex7-InnerJoin.sql
├── Week3Ex8-OuterJoins.sql
├── Week4Ex10-BizInt.sql
├── Week4Ex12-BizInt.sql
├── Week4Ex9-Subqueries.sql
└── Week5 - Dillards.md


/README.md:
--------------------------------------------------------------------------------
 1 | # SQL Duke
 2 | 
 3 | Answers to Duke University's Managing Big Data with MySQL Course (2017) found [here](https://www.coursera.org/learn/analytics-mysql/home/info "Course Information for 'Managing Big Data with MySQL'"). 
 4 | 
 5 | ### Lecture Notes
 6 | 
 7 | * Week 1 - ER Diagrams, RS, database concepts
 8 | * Week 2 - Select, where, from, order by, limit, operations
 9 | * Week 3 - Group by, having, distinct, operators
10 | * Week 3 - Inner Joins, Outer Joins + examples
11 | * Week 4 - Subqueries & more Operators + examples
12 | 
13 | *Note: For Week 3 & 4, I chose to focus more on* examples *of joins instead of lengthy theoretical explanations of what joins are. Rather than paraphrasing Duke's notes, I focused on listing examples that helped me understand the concepts better.*
14 | 
15 | ### MySQL Assignments
16 | 
17 | * MySQL answers for Week 3: Ex 7
18 | * MySQL answers for Week 3: Ex 8
19 | * MySQL answers for Week 4: Ex 9
20 | * MySQL answers for Week 4: Ex 10
21 | 
22 | *Answers to Exercise 1-6 of Week 1 & Week 2 are not included because it's already available online as part of the course.*
23 | 
24 | ### Teradata Assignments
25 | 
26 | * Teradata Week 2 answers
27 | * Teradata Week 3 answers
28 | * *No Teradata assignment for Week 4, do quiz instead*
29 | * Teradata Week 5 FINAL EXAM ANSWERS
30 | 
31 | ### License
32 | 
33 | MIT. 
34 | 


--------------------------------------------------------------------------------
/SQLDukeWeek1.md:
--------------------------------------------------------------------------------
  1 | # Week 1: Introduction to Databases
  2 | 
  3 | ### Background
  4 | ##### What is SQL? Why do we need it?
  5 | * SQL = Structured Query Language
  6 | * SQL is used by every relational database (DB) management system, or DBMS.
  7 | * It lets us efficiently store and extract large amounts of data
  8 | 
  9 | > Prof: "Imagine that you have multiple users trying to access the same excel workbook/ data spreadsheet. It is going to lag like hell. Now multiply that problem by 1000. Holy crap. This is why we need a database system, so we don't go crazy."
 10 | 
 11 | - Other DBs will also use a language _based on_ SQL (e.g., MySQL, PostGres)
 12 | - Once you learn how to use the general SQL language, it will be easy to switch between systems
 13 | - Same as driving a car - once you know how to drive one car, you can switch with relative ease between different car brands :) 
 14 |     
 15 | ##### What are relational databases?
 16 | 
 17 | * It is something  awesome
 18 | * Officially, a relational database is "a database structured to _recognize relations_ between stored items of information"
 19 | * Basically, recognising relations between stored items helps it only extract items that are absolutely needed, reducing run-time
 20 | * (You guessed it!-) Made from set theory. 
 21 | > Summary: Only interacts with subsets of data it needs to provide the information you asked, rather than opening an entire excel sheet
 22 | 
 23 | ##### Benefits
 24 | * More memory efficient for large datasets
 25 | * Faster responses to queries too!
 26 | * Having structure can prevent or minimises data overrides
 27 | * Supports greater data entry accuracy, can specify what data type is allowed per field (eg. only numbers) 
 28 | 
 29 | ##### Basic Features
 30 | - Tables = smallest logical subset
 31 | - Columns = must be unique
 32 | - Order of column & rows MUST NOT matter. So db can retrieve information in whatever order or fashion it determines to be the fastest.
 33 | 
 34 | ##### About the Field/ Course
 35 | - We will generally focus on making queries
 36 | - Only early-stage startups need to set up or maintain the db
 37 | - This course will mostly cover how to make queries, but not how to make a DB
 38 | 
 39 | > Alright hotstuff. In case you're wondering/forgot, why bother with diagrams? 
 40 | >
 41 | > Because understanding how a db is laid out helps greatly with learning to write queries later. 
 42 | > Now let's get started!
 43 | 
 44 | -----
 45 | 
 46 | # ER Diagrams
 47 | ### Entitles
 48 | - Shape: boxes
 49 | - Each box = one category, possibly a table
 50 | - Each box is called an entity instance
 51 | - Every entity must have at least one column that serves as a UA or unique key identifier (see next section for UA)
 52 | 
 53 | ### Attributes
 54 | - Shape: circles
 55 | - Each circle = one attribute of a box, or an attribute of an entity. 
 56 | 
 57 | **Unique Attribute**
 58 | - A Unique Attribute or a Primary Key is an attribute with a unique value in each entity instance.
 59 | - Underline the UA
 60 | - This is the column that allows you to link to the master table together
 61 | - Eg. Student IDs are unique for every student
 62 | 
 63 | **Composite Attribute**
 64 | - Composite attributes are those that can be completely reconstructed using other entities
 65 | - Eg. Classroom ID = building ID + room unit no.
 66 | - Usually, the composite attribute itself (aka the final product) is not included in main DB to save space. Only its parts are included in the DB. 
 67 | 
 68 | **Examples of composite attribute**
 69 | - Classroom is the entity
 70 | - Identified by "classroomID" value
 71 | - All classroom IDs will have a building and room number attribute attached to it
 72 | 
 73 | ### Relationships 
 74 | - ---- lines between entities
 75 | - < > diamonds to describe relationship
 76 | 
 77 | ### Cardinality constraints
 78 | - Describe the minimum (min) or maximum (max) number of items the other entity can be linked to. 
 79 | - Bracket notes: always written left to right even if diagram orientation or page orientation is right to left.
 80 |     - M = infinite
 81 |     - O = optional.
 82 | - Lines closest to rectangle: MAXIMUM no. of instances associated with that entity
 83 | - Lines furthest away: MINIMUM no. of instances associated with that entity
 84 |     - --- straight line = single
 85 |     - / >  crows feet = many
 86 | 
 87 | **Examples of cardinality constraints**
 88 | 
 89 | - Each college can be attended by (max) multiple students, but it attended by (min) at least one student. (M, 1)
 90 | - Each student attends (max) one college, or (min) one college. (1, 1)
 91 | - (10, 1000) = each college needs a minimum of 10 students, max of 1000 students.
 92 | 
 93 | ### Weak Entitles
 94 | - Weak entities will not be identifiable on their own (not fully unique like a full entity)
 95 | - Weak entities have double outline
 96 | - Can be combined with another weak entity to form a fully unique key
 97 | - ----- Dotted underline title means that attribute is a partial key
 98 | 
 99 | **Example: Building & apartment IDs**
100 | 
101 | - Building ID is the unique primary key
102 | - Partial key (Apartment ID) can become full unique key (equivalent of building ID) IF it is connected to the unique key of the entity it is connected to with the unique double-diamond ID.
103 | - Apartment ID is only unique if combined with Building ID.
104 | 
105 | # Relational Schemas
106 | 
107 | - Similar items but new names:
108 |     - tables (or "relation")
109 |     - columns ("fields" / "attributes")
110 |     - row ("record" / "tuple")
111 | - An RS is a simplified version (or plan for) a db
112 | - Reflects logical ideas, but NOT physical (actual) design
113 | - Strictly no order, variables must be independent
114 | - Benefit: Looks less messy lol
115 | - Problem: they lack cardinality constraints in ER diagrams. So sometimes one value matches to more than one key, but you won't know this until you see the ER diagram too.
116 | 
117 | ### PRIMARY KEY (PK)
118 | - Each table = one box
119 | - PK should be underlined and put at top of box.
120 |     - The PK must have a value unique for every row in that table.
121 |     - PK strictly CANNOT have null values.
122 | - Columns that can double up as primary keys, because they also have unique values, can be marked with a (U)
123 | 
124 | ### FOREIGN KEY (FK)
125 | - Columns that refer to the primary key of another table
126 | - Write (FK) next to item to highlight its status
127 | - Draw arrows to new table it refers to.
128 | 
129 | ### WEAK ENTITIES
130 | - Will have TWO underlined keys, could be their own partial key paired with a foreign key. aka their own key paired with the primary key of another table
131 | - Together, this composite key forms a primary key.
132 | 
133 | ### MANY TO MANY RELATIONSHIP
134 | - Clue that columns can have many instances of one another
135 | - For example, many classes can have combinations of many students
136 | - Usually, primary keys must not be duplicated. But in a many-to-many relationship, an exception is made to illustrate the relationship.
137 | - So, this table has nothing but 2 foreign keys in it making 1 composite primary key
138 | 
139 | # Conclusion: Building ERD diagrams
140 | - Using ERDPlus tool to make diagrams
141 | - www.erdplus.com
142 | - Can export diagrams as images
143 | 


--------------------------------------------------------------------------------
/SQLDukeWeek2.md:
--------------------------------------------------------------------------------
  1 | # Week 2 
  2 | 
  3 | ### INTRODUCTION
  4 | 
  5 | Note to self: Typical syntax follows this order 
  6 | ```sql
  7 | SELECT item
  8 | FROM table
  9 | WHERE condition
 10 | GROUP BY variable 
 11 | HAVING condition
 12 | ORDER BY category ASC / DESC
 13 | ```
 14 | - Only SELECT + FROM are actually required. Rest = optional
 15 | - DB will physically manage and plan the how aspect of retrieving it best (aka not your problem for now). Focus on extracting data first. 
 16 | - To tell db that data is missing, type NULL, not zero.
 17 | 
 18 | ### Style Notes
 19 | 1. CAPITALISE all commands aka first word
 20 | 2. CAPITALISE other keywords like 'sum' or 'avg' too
 21 | 2. End all queries with ; 
 22 | 2. Start each command on a new line (easier to read)
 23 | 
 24 | ### Optional but Good To Follow
 25 | The DUKE course uses the dialect, *MySQL*. However, MySQL is a nappy hipster that doesn't quite follow SQL convention. In fact, it's pretty far out. If you ever need to change to a different SQL dialect, you'll need these rules too. Hence, it might be a good idea to start making them a habit now.
 26 | 
 27 | 1. Not all DBs are case insensitive. Try to write the EXACT name used in database, CaPiTal letters and all.
 28 | 3. Names inside "inverted commas" are strictly case sensitive. Use the EXACT name used in the db too. 
 29 | 4. Make indentations for new subqueries or new lines. You'll learn more about this later. 
 30 | 2. Although MySQL accepts both single and double inverted commas, stick to single commas where possible. Most other DBs only accept single ticks. 
 31 | 
 32 | # Start: First Look At Your Database
 33 | *Let's assume you're exploring a new DB, but have no diagrams about it. How would you explore and get to know it?*
 34 | 
 35 | To make something the default database for our queries, run this command:
 36 | 
 37 | ```sql
 38 | USE dognitiondb
 39 | ```
 40 | 
 41 | To show all tables in the database.
 42 | ```sql
 43 | SHOW tables
 44 | ```
 45 | 
 46 | Show all columns in a table.
 47 | ```sql
 48 | SHOW columns FROM [table name, without brackets]
 49 | OR:
 50 | DESCRIBE [table name, without brackets]
 51 | ```
 52 | Note: In the output, the SHOW/DESCRIBE command will reveal whether NULL values can be stored in that field in the table. The "Key" column of the output also provides the following information about each field of data in the table being described (see [here](https://dev.mysql.com/doc/refman/5.6/en/show-columns.html "SQL documentation") for more information).
 53 | 
 54 | ##### Hints
 55 | 
 56 | - PRI - the column is a PRIMARY KEY or is one of the columns in a multiple-column PRIMARY KEY.
 57 | - UNI - the column is the first column of a UNIQUE index.
 58 | - MUL - the column is the first column of a nonunique index in which multiple occurrences of a given value are permitted within the column.
 59 | - Empty - the column either is not indexed or is indexed only as a secondary column in a multiple-column, nonunique index.
 60 | 
 61 | *Note: The "Default" field of the output indicates the default value that is assigned to the field. The "Extra" field contains any additional information that is available about a given field in that table. For now, you won't die yet if you don't understand this.*
 62 | 
 63 | To show all data (types) in a column.
 64 | 
 65 | ```sql
 66 | SELECT [column name, without brackets]
 67 | FROM [table name, without brackets];
 68 | ```
 69 | 
 70 | If you have multiple databases loaded:
 71 | ```sql
 72 | SHOW columns FROM [table name] FROM [database name]
 73 | SHOW columns FROM databasename.tablename
 74 | ```
 75 | # MySQL Variable Types
 76 | 
 77 | In a MySQL database, there are three (3) main data types: text, numbers and dates/times.  When you design your database, it is important that you select the appropriate type, since this determines why type of data you can store in that column.  Using the most appropriate type can also increase the database's overall performance.
 78 | 
 79 | ### Text Types
 80 | 
 81 | | Name | Description |
 82 | | ------ | -------- | 
 83 | | CHAR( ) | A fixed section from 0 to 255 characters long.|
 84 | | VARCHAR( ) | A variable section from 0 to 255 characters long. |
 85 | | TINYTEXT | A string with a maximum length of 255 characters. |
 86 | | TEXT | A string with a maximum length of 65535 characters.|
 87 | | BLOB | A string with a maximum length of 65535 characters.|
 88 | | MEDIUMTEXT | A string with a maximum length of 16777215 characters. |
 89 | | MEDIUMBLOB | A string with a maximum length of 16777215 characters.|
 90 | | LONGTEXT| A string with a maximum length of 4294967295 characters.|
 91 | | LONGBLOB| A string with a maximum length of 4294967295 characters.|
 92 | 
 93 | The ( ) brackets allow you to specify the maximum  number of characters that can be used in the column. Meanwhile, BLOB stands for Binary Large OBject, and can be used to store non-text information that is encoded into text. *How cute is that?!*
 94 | 
 95 | ### Number Types
 96 | 
 97 | | Name | Description | Length
 98 | | --- | ---- | ---- |
 99 | | TINYINT ( ) | -128 to 127 normal | 0 to 255 UNSIGNED
100 | | SMALLINT( ) | -32768 to 32767 normal | 0 to 65535 UNSIGNED
101 | | MEDIUMINT( ) | -8388608 to 8388607 normal | 0 to 16777215 UNSIGNED
102 | | INT( ) | -2147483648 to 2147483647 normal | 0 to 4294967295 UNSIGNED
103 | | BIGINT( ) | -9223372036854775808 to 9223372036854775807 normal | 0 to 18446744073709551615 UNSIGNED
104 | | FLOAT | A small number with a floating decimal point.
105 | | DOUBLE( , ) | A large number with a floating decimal point.
106 | | DECIMAL( , ) | A DOUBLE stored as a string, allowing for a fixed decimal point.
107 | 
108 | By default, the integer types will allow a range between a negative number and a positive number, as indicated in the table above.
109 | 
110 | You can use the UNSIGNED commend, which will instead only allow positive numbers, which start at 0 and count up.
111 | 
112 | #### Useful Num Commands
113 | | Command | Description |
114 | | --- | ---- |
115 | | AVG( ) | Finds the average of all rows of the variable 
116 | | SUM( ) | Finds the sum of all rows of the variable 
117 | | FLOOR( ) | Rounds a floating decimal down to nearest integer
118 | | CEIL( ) | Rounds a floating decimal up to nearest integer
119 | | FLOAT(num, x) | Rounds a floating var to x decimal places eg. (height, 2)
120 | |%| Modulus |
121 | | Var % 2 = 0 | Used in conjunction with 'WHERE' command, to return all rows where 'var' is even numbered
122 | | Var % 2 != 0 | Used in conjunction with 'WHERE' command, to return all rows where 'var' is even numbered
123 | 
124 | # SELECT, FROM
125 | **SELECT** is used anytime you want to retrieve data from a table. In order to retrieve that data, you always have to provide at least two pieces of information:
126 | 
127 | >(1) WHAT you want to select, and
128 | (2) FROM where you want to select it.
129 | 
130 | Example of most basic select:
131 | ```sql
132 | SELECT breed
133 | FROM dogs;
134 | ```
135 | SELECT statements can also be used to make new derivations of individual columns using "+" for addition, "-" for subtraction, "*" for multiplication, or "/" for division. For example, if you wanted the median inter-test intervals in hours instead of minutes or days, you could query:
136 | ```sql
137 | SELECT median_iti_minutes/60, median_iti_minutes
138 | FROM dogs
139 | ```
140 | # LIMIT / OFFSET
141 | 
142 | LIMIT is used to restrict the number of queries outputted.
143 | OFFSET will offset that number of X entries (starting with 1).
144 | ##### Examples
145 | Select only 10 rows of data
146 | ```sql
147 | SELECT breed
148 | FROM dogs 
149 | LIMIT 10;
150 | ```
151 | Select 10 rows of data, but AFTER the first 5 rows. 
152 | ```sql
153 | SELECT breed
154 | FROM dogs 
155 | LIMIT 5, 10;
156 | 
157 | SELECT breed
158 | FROM dogs 
159 | OFFSET 5
160 | LIMIT 10;
161 | ```
162 | 
163 | # WHERE + BETWEEN, AND, OR
164 | 
165 | We can use the WHERE statement to specify our queries, like this example below. We can add BETWEEN, AND, OR operators in conjunction with variables to make them more specific, like this:
166 | ```SQL
167 | SELECT dog_guid, weight
168 | FROM dogs
169 | WHERE weight BETWEEN 10 AND 50;
170 | 
171 | SELECT dog_guid, dog_fixed, dna_tested
172 | FROM dogs
173 | WHERE dog_fixed=1 OR dna_tested=1;
174 | 
175 | SELECT dog_guid, dog_fixed, dna_tested
176 | FROM dogs
177 | WHERE dog_fixed=1 AND dna_tested!=1;
178 | 
179 | SELECT dog_guid
180 | FROM dogs
181 | WHERE YEAR(created_at) > 2015 -- you will learn more about dates later
182 | ```
183 | ##### Using WHERE + Strings
184 | Strings need to be surrounded by quotation marks in SQL. MySQL accepts both double and single quotation marks, but some database systems only accept single quotation marks, so it might be a good idea to start that habit right now. Note that whenever a string contains an SQL keyword, the string must be enclosed in backticks instead of quotation marks.
185 | 
186 | >'the marks that surrounds this phrase are single quotation marks'
187 | "the marks that surrounds this phrase are double quotation marks"
188 | ` the marks that surround this phrase are backticks ``
189 | 
190 | ```SQL
191 | SELECT dog_guid, weight
192 | FROM dogs
193 | WHERE breed = 'Golden Retriever';
194 | ```
195 | ### Date/Time Types
196 | In the previous section, we saw one example of using date-time to specify a query further. Let's learn more about them now. We can use the WHERE statement to interact with datetime data. Time-related data is a little more complicated to work with than other types of data, because it must have a very specific format. Examples of datetime types: 
197 | 
198 | ```sql 
199 | DATE: YYYY-MM-DD
200 | DATETIME: YYYY-MM-DD HH:MM:SS
201 | TIMESTAMP: YYYYMMDDHHMMSS
202 | TIME: HH:MM:SS
203 | YEAR: YYYY
204 | ```
205 | Date/Time fields will only accept a valid date or time. A time stamp stored in one row of data might look like this:
206 | ```sql
207 | 2013-02-07 02:50:52
208 | ```
209 | Using the same date-time format in combination with WHERE, we can select specific rows of data that fit the date criteria. For example, we can specify a range of dates we'd like to retrieve data from:
210 | 
211 | ```sql
212 | SELECT dog_guid, created_at
213 | FROM complete_tests
214 | WHERE created_at >= '2014-01-01' AND created_at <= '2015-01-01'
215 | ```
216 | However, instead of typing out full specifications of date ranges every time, there are other functions that interact well with date too. For instance: 
217 | ```sql
218 | SELECT dog_guid, updated_at
219 | FROM reviews
220 | WHERE YEAR(created_at) = 2014 -- selects entried created in 2014
221 | ```
222 | In that vein, two of similar commands are **Day** and **month** which also lets you extract all rows created around a specified day or month. 
223 | ```sql
224 | SELECT dog_guid, created_at
225 | FROM complete_tests
226 | WHERE DAY(created_at) > 15 -- day of month: 0 to 31
227 | 
228 | SELECT dog_guid, created_at
229 | FROM complete_tests
230 | WHERE MONTH(created_at) = 12 -- month of year: Dec
231 | ```
232 | **Dayname** is a function that will select data from only a single day of the week. This example selects all IDs created on Tuesday:  
233 | ```sql
234 | SELECT dog_guid, created_at
235 | FROM complete_tests
236 | WHERE DAYNAME(created_at) = "Tuesday" -- dayname here
237 | ```
238 | You have to use a different set of functions than you would use for regular numerical data to add or subtract time from any values in these datetime formats. You would use the **TIMESTAMPDIFF** or **DATEDIFF** function.
239 | ```sql
240 | SELECT user_guid, TIMESTAMPDIFF(MINUTE, start_time, end_time)
241 | FROM exam_answers
242 | WHERE TIMESTAMPDIFF(MINUTE, start_time, end_time) < 0;
243 | 
244 | SELECT user_guid, TIMESTAMPDIFF(HOUR, start_time, end_time)
245 | FROM exam_answers
246 | WHERE TIMESTAMPDIFF(HOUR, start_time, end_time) > 1;
247 | 
248 | SELECT user_guid, TIMESTAMPDIFF(SECOND, start_time, end_time)
249 | FROM exam_answers
250 | WHERE TIMESTAMPDIFF(SECOND, start_time, end_time) > 60;
251 | ```
252 | # SUBSETS: IN, LIKE
253 | The IN operator allows you to specify multiple values in a WHERE clause. Each of these values must be separated by a comma from the other values, and the entire list of values should be enclosed in parentheses.
254 | ```sql
255 | SELECT dog_guid, breed
256 | FROM dogs
257 | WHERE breed IN ('retriever', 'poodle');
258 | 
259 | SELECT * -- this means select all columns
260 | FROM users
261 | WHERE state NOT IN ('NC','NY');
262 | ```
263 | The **LIKE** operator allows you to specify a pattern that the textual data you query has to match. For example, if you wanted to look at all the data from breeds whose names started with "s", you could query:
264 | ```
265 | SELECT dog_guid, breed
266 | FROM dogs
267 | WHERE breed LIKE ("s%");
268 | ```
269 | In this syntax, the percent sign indicates a wild card. Wild cards represent unlimited numbers of missing letters. This is how the placement of the percent sign would affect the results of the query:
270 | 
271 | 1. WHERE breed LIKE ("s%") = the breed must start with "s", but can have any number of letters after the "s"
272 | 2. WHERE breed LIKE ("%s") = the breed must end with "s", but can have any number of letters before the "s"
273 | 3. WHERE breed LIKE ("%s%") = the breed must contain an "s" somewhere in its name, but can have any number of letters before or after the "s"
274 | 
275 | # IS, IS NOT, NULL
276 | To select only the rows that have NON-NULL data you could query:
277 | ```sql
278 | SELECT user_guid
279 | FROM users
280 | WHERE free_start_user IS NOT NULL;
281 | ```
282 | To select only the rows that only have null data so that you can examine if these rows share something else in common, you could query:
283 | ```sql
284 | SELECT user_guid
285 | FROM users
286 | WHERE free_start_user IS NULL;
287 | ```
288 | You will see that ISNULL is a logical function that returns a 1 for every row that has a NULL value in the specified column, and a 0 for everything else. We can get the total number of NULL values in any column. Here's what that query would look like:
289 | ```sql
290 | SELECT SUM(ISNULL(breed)) -- counts dogs with breed = NULL
291 | FROM dogs
292 | ```
293 | More complicated example: Printing number of unique DOG IDs for each breed and gender, where there is at least 1000 dogs in each breed group. Note the useful NULL function.
294 | ```sql
295 | SELECT COUNT(dog_guid) AS num_dogs, gender, breed_group
296 | FROM dogs
297 | WHERE breed_group IS NOT NULL AND breed_group <> ''
298 | GROUP BY breed_group
299 | HAVING COUNT(breed_group>1000)
300 | ORDER BY COUNT(dog_guid) DESC;
301 | 
302 | Can you guess what the other functions mean? If you can't, we'll learn about them next so don't stress about it. 
303 | ```
304 | # AS / REPLACE / REMOVE
305 | 
306 | If you wanted to **rename** the name of the time stamp field of the completed_tests table from "created_at" to "time_stamp" in your output, you could take advantage of the **AS** clause and execute the following query:
307 | ```sql
308 | SELECT dog_guid, created_at AS time_stamp
309 | FROM complete_tests;
310 | ```
311 | Note that if you use an alias that includes a space, the full alias MUST be surrounded in **quotation marks**:
312 | ```sql
313 | SELECT dog_guid, created_at AS 'time stamp'
314 | FROM complete_tests;
315 | ```
316 | You could also make an alias for a table, and just about everything:
317 | ```sql
318 | SELECT dog_guid, created_at AS 'time stamp'
319 | FROM complete_tests AS tests
320 | 
321 | SELECT user_guid, (median_ITI_minutes * 60) AS 'Median Sec'
322 | FROM dogs;
323 | ```
324 | It is possible to replace unwanted stuff too, or remove them. For example, you can **delete** the first character off every word with the **TRIM** function:
325 | ```sql
326 | SELECT breed, TRIM(LEADING '-' FROM breed) AS breed_fixed
327 | FROM dogs;
328 | ```
329 | Or, you could **replace** them instead with blanks, or any other item. The syntax for **REPLACE( )** is 
330 | 
331 | ```sql
332 | [variable, replace FOR, replace WITH]
333 | 
334 | SELECT breed, REPLACE (breed, '-', '' ) AS breed_fixed
335 | FROM dogs;
336 | ```
337 | One last way to edit output is to simply **WRITE** your own stuff using **CONCAT**. The syntax for concat is to lump everything together, separated by commas, like this: ['STRING 1', 'STRING 2' ... ]
338 | ```sql
339 | SELECT breed,
340 | CONCAT ("This dog is a", breed , 'dog.' ) AS new_statement
341 | FROM dogs
342 | ORDER BY breed_fixed
343 | ```
344 | # DISTINCT, COUNT, ORDER BY
345 | 
346 | When the DISTINCT clause is used with multiple columns in a SELECT statement, the combination of all the columns together is used to determine the uniqueness of a row in a result set. Note that by "every type", it also includes type NULL too. 
347 | ```sql
348 | SELECT DISTINCT breed
349 | FROM dogs;       -- distinct dog breeds
350 | 
351 | SELECT DISTINCT state, city
352 | FROM users;      -- distinct combo of state AND city
353 | ```
354 | If you wanted the breeds of dogs in the dog table sorted in alphabetical order, you could query this using the **ORDER BY** function:
355 | ```sql
356 | SELECT DISTINCT breed
357 | FROM dogs
358 | ORDER BY breed ASC;
359 | ```
360 | To sort the output in descending order as well:
361 | ```sql
362 | SELECT DISTINCT breed
363 | FROM dogs
364 | ORDER BY breed DESC;
365 | ```
366 | Note: Using ORDER BY, when not applied to alphabetical data, gives the numerically ascending order by default.
367 | ```sql
368 | SELECT DISTINCT user_guid, state, membership_type
369 | FROM users
370 | WHERE country="US" AND state IS NOT NULL AND membership_type IS NOT NULL
371 | ORDER BY state ASC, membership_type ASC
372 | ```
373 | ##### Important Note:
374 | 
375 | COUNT and DISTINCT cannot be used together, like this:
376 | ```sql
377 | SELECT count (apples), distinct pears
378 | FROM fruit
379 | ```
380 | Because count = only 1 row output (the sum of that variable), while pears = many pear types. However, it can be used this way, because this will produce the number of distinct apple types, *grouped by* each country, so that each unique country will only have 1 number attached to it. 
381 | ```sql
382 | SELECT COUNT (DISTINCT apples), country
383 | FROM fruit
384 | GROUP BY country; -- we will learn group by next
385 | ```
386 | Lastly, remember that DISTINCT removes NULL, but COUNT does not remove NULL. So, it is good practice to put IS NOT NULL or =! "" as much as possible when using COUNT. 
387 | 
388 | # How to Export your Query Results to a Text File
389 | You can tell MySQL to put the results of a query into a variable, and then use Python code to format the data in the variable as a CSV file (comma separated value file, a .CSV file) that can be downloaded. When you use this strategy, all of the results of a query will be saved into the variable, not just the first 1000 rows as displayed in Jupyter.
390 | 
391 | To tell MySQL to put the results of a query into a variable, use the following syntax:
392 | ```sql
393 | variable_name_of_your_choice = %sql [your full query goes here, but don't include square brackets];
394 | 
395 | breed_list = %sql SELECT DISTINCT breed FROM dogs ORDER BY breed;
396 | num_dogs = %sql SELECT COUNT(DISTINCT dog_guid) FROM dogs;
397 | ```
398 | Once your variable is created, using the above command tell Jupyter to format the variable as a csv file using the following syntax:
399 | ```sql
400 | variable_name.csv('the_output_name_you_want.csv')
401 | breed_list.csv('breed_list.csv')
402 | num_dogs.csv('unrelated.csv')
403 | ```
404 | 


--------------------------------------------------------------------------------
/SQLDukeWeek3 - Inner & Outer Joins.md:
--------------------------------------------------------------------------------
  1 | # Week 3 - JOINS
  2 | 
  3 | - Joins are based on cartesian products
  4 | - (x, a) (x, b) (y, a) (y, b)
  5 | - JOIN works by retrieving data only where the cartesian products match
  6 | 
  7 | ### Inner Joins
  8 | 
  9 | - Only items with exact matching primary keys from both tables will be put into a table
 10 | - NULL values can't be matched
 11 | - Order of which table joins to which table *does not matter*
 12 | 
 13 | ### Left / Right Outer Join
 14 | 
 15 | - Here, ORDER of join matters.
 16 | - Example: LEFT outer join vs RIGHT outer join
 17 | - Left (or first) table will have ALL its rows included, even null values
 18 | - Right (or second) table's items will only be included if they match the chosen key used in the left table
 19 | - Rows from the left table who don't have a matching ID in the right, will instead have a NULL value
 20 | - Can switch to RIGHT OUTER JOIN to reverse order of tables (see example below)
 21 | - Basically same as reversing the positions of the join. So left and right don't really matter, order matters more.
 22 | 
 23 | ### Full Outer Join
 24 | 
 25 | - ALL rows of both tables are included
 26 | - Any row that doesn't have a matching partner is given a NULL value.
 27 | - Rarely used (why would anyone want this? jkjk)
 28 | - Note: Not all db supper full outer joins, like MySQL. However, PostgreSQL does. Can test this out using the Teradata db! 
 29 | 
 30 | ### Many to Many Relationships
 31 | 
 32 | - Recall example 2.3 of Week 1? While building the "fashion shop" relationship schema, there was a many-to-many relationship, with another table between two big entities as a linking/bridge table. 
 33 | - This table had only the foreign keys + primary keys of two tables in various combinations 
 34 | - **For many to many, left join 1 & 2 first, then left join the results again to 3.**
 35 | 
 36 | Caution: 
 37 | - Stick to left outer joins! Right/inner joins would mess up the data by deleting NULL values, since the right table is now the 'primary key'. (example below)
 38 | - Beware of duplicates too. Joining three tables a single duplicate (2 rows) across them results in 6 rows. This could quickly get out of hand in a big db. (example below)
 39 | 
 40 | ### Notes to self before starting (!!!)
 41 | 
 42 | - Where possible, clean data before you start
 43 | - Try to be aware of table relationships, who has null data, subsets, duplicates etc
 44 | - When doing joins, count the number of unique IDs / keys in each table you are joining first. Helps to see which is larger or smaller. Also helps to get reasonable expectation of how the final product should look like. 
 45 | - On handling errors:
 46 |     - Be aware of duplicates and NULL values (sometimes they exist despite rules)
 47 |     - Null values can exist even in the primary key column when the database is young, and the company is desperate for data so they accept any data, even incomplete sets
 48 |     - **It is NOT your job to clean this up, or restructure their db -- Instead, just try to make as much business value of the items you have.**
 49 | - Start with small data and tables (<10 rows), see if they output what you are expecting
 50 | - Double check at beginning! Or you won't even know your results are incorrect
 51 | 
 52 | ### Reminder: (Proper) Technical Terms
 53 | 
 54 | - Table = Relation
 55 | - Row = Tuple
 56 | - Column/Field = Attribute
 57 | 
 58 | # INNER JOINS
 59 | 
 60 | Let's start with an inner join. 
 61 | 
 62 | - SQL needs to be told which IDs overlap
 63 | - SQL needs to be told which is left/right
 64 | 
 65 | We will use *equijoin* syntax for the first few examples because it's not as confusing. We will switch to traditional syntax for outer joins later. 
 66 | 
 67 | Example: SIMPLE INNER JOIN FOR 2 TABLES
 68 | **Find the total number of reviews, and the average rating given, for EACH dog. Combine information from the Dogs table and the Reviews table:**
 69 | ```sql
 70 | SELECT
 71 |     d.dog_guid AS DogID,
 72 |     AVG(r.rating) AS AvgRating,
 73 |     COUNT(r.rating) AS NumRatings,
 74 | FROM dogs  d, reviews r -- alphabets are its short form
 75 | WHERE d.dog_guid=r.dog_guid
 76 |     AND d.user_guid=r.user_guid -- repeating this excludes any unmatched IDs
 77 | GROUP BY UserID, DogID
 78 | ORDER BY AvgRating DESC;
 79 | ```
 80 | Example: INNER JOIN 2 TABLES , CONDITIONAL
 81 | **Extract the user_guid, dog_guid, breed, breed_type, and breed_group for all animals who completed the "Yawn Warm-up" game. Join on dog_guid only.**
 82 | 
 83 | ```sql
 84 | SELECT
 85 |     c.user_guid,
 86 |     c.dog_guid,
 87 |     d.breed,
 88 |     d.breed_type,
 89 |     d.breed_group
 90 | FROM complete_tests c, dogs d
 91 | WHERE c.dog_guid=d.dog_guid
 92 |     AND test_name = "Yawn Warm-up" ;
 93 | ```
 94 | Example: INNER JOIN 3 TABLES
 95 | **Join 3 tables to extract the user ID, user's state of residence, user's zip code, dog ID, breed, breed_type, and breed_group for all animals who completed the "Yawn Warm-up" game.**
 96 | 
 97 | ```sql
 98 | SELECT
 99 |     d.user_guid AS UserID,
100 |     d.dog_guid AS DogID,
101 |     d.breed,
102 |     d.breed_type,
103 |     d.breed_group,
104 |     u.state,
105 |     u.zip
106 | FROM dogs d, complete_tests c, users u -- inner join so order doesn't matter
107 | WHERE d.dog_guid = c.dog_guid
108 |     AND d.user_guid = u.user_guid
109 |     AND c.test_name = "Yawn Warm-up";
110 | ```
111 | Notes: Here, I avoided using c.user_guid to join the tables because user GUID under completed tests is null. I wouldn't have known this if I did not check the tables first. So, always test in small batches! And be prepared to deal with missing data.
112 | 
113 | Example: INNER JOIN 3 TABLES
114 | **How would you extract the user ID, membership type, and dog ID of all the golden retrievers who completed at least 1 Dognition test (you should get 711 rows)?**
115 | ``` sql
116 | SELECT DISTINCT
117 |      u.user_guid,
118 |      u.membership_type,
119 |      d.dog_guid,
120 |      d.breed
121 | FROM complete_tests c, dogs d, users u
122 | WHERE c.dog_guid = d.dog_guid 
123 |      AND d.user_guid = u.user_guid
124 |      AND d.breed = 'Golden Retriever';
125 | ```
126 | Example: INNER JOIN 2 TABLES
127 | **How many unique Golden Retrievers who live in North Carolina are there in the Dognition database (you should get 30)?**
128 | ```sql
129 | SELECT DISTINCT
130 |     u.user_guid,
131 |     d.dog_guid,
132 |     d.breed
133 | FROM dogs d, users u
134 | WHERE d.user_guid = u.user_guid
135 |   AND d.breed = 'Golden Retriever'
136 |   AND u.state = 'NC';
137 | ```
138 | 
139 | ### NOTE: USING TRADITIONAL SYNTAX
140 | 
141 | The equijoin syntax is accepted with inner joins, but not with full/left/right outer joins. Instead, a the (traditional) syntax for that would look something like this (below). 
142 | 
143 | Why do we still have the traditional version when it is longer? Because: 
144 | 
145 | - With using = signs, WHERE can be saved for other conditions
146 | - Unless otherwise specified, join is understood as INNER join
147 | - If inner join, order doesnt matter
148 | - If outer join, RIGHT joins LEFT in this order
149 | 
150 | Re-writing the first example using traditional syntax: 
151 | ```sql
152 | SELECT 
153 |     d.user_guid AS UserID,
154 |     d.dog_guid AS DogID,
155 |     d.breed,
156 |     d.breed_type,
157 |     d.breed_group
158 | FROM dogs d JOIN complete_tests c -- look here 
159 |   ON c.dog_guid=d.dog_guid -- look here
160 | WHERE test_name='Yawn Warm-up';
161 | ```
162 | From now on, we will be using traditional syntax for OUTER JOINS.
163 | 
164 | # Outer Joins
165 | 
166 | Unfortunately, DUKE only gave two examples on outer joins -__- So I included one more from the internet. 
167 | 
168 | Example: LEFT JOIN 2 TABLES
169 | **Find the number of complete tests each unique dog (from the dogs table) has completed. Put the dog with the mosts tests completed first.**
170 | ```sql
171 | SELECT
172 |     d.dog_guid AS dDogID,
173 |     c.dog_guid AS cDogID,
174 |     COUNT(c.test_name) AS 'Tests Completed'
175 | FROM dogs d
176 |      LEFT JOIN complete_tests c
177 |          ON d.dog_guid = c.dog_guid
178 | WHERE d.dog_guid IS NOT NULL
179 | GROUP BY d.dog_guid
180 | ORDER BY COUNT(c.dog_guid) DESC;
181 | ```
182 | Example: LEFT JOIN 2 TABLES + COUNT
183 | **Create a list of all the unique dog_guids that are contained in the site_activities table, but not the dogs table, and how many times each one is entered. Remove NULL values.**
184 | ```sql
185 | SELECT
186 | DISTINCT sa.dog_guid,
187 | d.dog_guid,
188 | COUNT(sa.dog_guid)
189 | FROM site_activities sa
190 |      LEFT JOIN dogs d
191 |           ON sa.user_guid = d.user_guid
192 | WHERE d.dog_guid IS NULL
193 |      AND sa.dog_guid IS NOT NULL
194 | GROUP BY sa.dog_guid
195 | ```
196 | Example: LEFT JOIN 3 TABLES
197 | **Join 3 tables to combine the column on Item ID, item price, item name, company it is from.** (PS: Got this example from the internet)
198 | ```sql
199 | SELECT
200 |      a.bill_no,
201 |      a. bill_amt,
202 |      b.item_name,
203 |      c.company_name
204 |      c.company_city
205 | FROM counter_sale a
206 |      LEFT JOIN foods b
207 |           ON a.item_ID = b.item_ID
208 |      LEFT JOIN company c
209 |           ON b.company_ID = c.company_ID
210 | WHERE c.company_name IS NOT NULL
211 | ORDER BY a.bill_no;
212 | ```
213 | 


--------------------------------------------------------------------------------
/SQLDukeWeek3.md:
--------------------------------------------------------------------------------
  1 | # Week 3
  2 | Continuing all functions learnt in week 2, we will learn the final 3 this week. Notes for this week will emphasize applications of functions more than explanations. Week 3 also includes notes on Inner Joins & Outer Joins (see next markdown file). 
  3 | 
  4 | # COUNT, SUM
  5 | Count is well, count. 
  6 | ```sql
  7 | SELECT COUNT(breed)
  8 | FROM dogs;
  9 | 
 10 | SELECT COUNT(DISTINCT breed)
 11 | FROM dogs;
 12 | 
 13 | SELECT COUNT(DISTINCT user_guid)
 14 | FROM complete_tests
 15 | WHERE created_at > __
 16 | 
 17 | SELECT state, zip, COUNT(DISTINCT user_guid)
 18 | FROM users
 19 | WHERE country = "US"
 20 | GROUP BY state, zip
 21 | HAVING COUNT(DISTINCT user_guid) > 5
 22 | ORDER BY state ASC;
 23 | ```
 24 | Note: When a column is included in a count function, null values are ignored in that count. But when an asterisk is included in a count function, nulls are included in the count.
 25 | 
 26 | Next, SUM finds the total of all rows matching a given criteria. It only works for numerical values though, not for strings, and not for date-time types. 
 27 | ```sql
 28 | SELECT SUM(IS NULL(exclude)
 29 | FROM dogs;
 30 | Result: 34, 025
 31 | ```
 32 | Note: SUM is different from Count. Sum takes only 'is null = 0', while count includes rows where null= 1 too.
 33 | ```
 34 | SELECT COUNT(IS NULL(exclude)
 35 | FROM dogs;
 36 | Result: 35,035
 37 | ```
 38 | 
 39 | # AVERAGE, MIN, MAX
 40 | AVG, MIN, MAX are mathematical operators that work with numerical data. They can be used together or used separately. The minimum and maximum amounts also work on dates -- via picking the earliest or latest date. It's pretty basic so just read the examples to learn their syntax. 
 41 | 
 42 | ```sql
 43 | SELECT test_name,
 44 | AVG (rating) AS AVG_rating,
 45 | MIN (rating) AS MIN_rating,
 46 | MAX (rating) AS MAX_rating
 47 | FROM reviews
 48 | WHERE test_name = "Eye Contact Game";
 49 | 
 50 | SELECT AVG (TIMESTAMPDIFF (minutes, start_time, end_time)) AS Duration,
 51 | test_name AS Test
 52 | FROM exam_answers;
 53 | 
 54 | SELECT AVG (TIMESTAMPDIFF (hour, start_time, end_time)) AS Avg_duration,
 55 | MIN (TIMESTAMPDIFF (hour, start_time, end_time)) AS min_time,
 56 | MAX (TIMESTAMPDIFF (hour, start_time, end_time)) AS max_time,
 57 | test_name AS Test
 58 | FROM exam_answers
 59 | WHERE timestampdiff(minute, start_time, end_time)>0;
 60 | ```
 61 | # GROUP BY
 62 | 
 63 | GROUP BY aggregates all data for other columns based on the column selected to be grouped by. For instance, this groups the data by MONTH:
 64 | ```SQL
 65 | SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed
 66 | FROM complete_tests
 67 | GROUP BY Month;
 68 | ```
 69 | Note: Although this correctly groups data by month, **this example gives an incorrect test_name answer**. This is because there is only 1 row allocated for each Month, but more than one type of test done per month. In this situation, MySQL will populate it with a randomly chosen Test done in that month, while other DB may throw an error, but both are incorrect. Overall, there is no way to present an aggregated and non-aggregate dataset in the same table. 
 70 | 
 71 | **Solution**: We can either group by all non-aggregated variables too (B), or further aggregate ALL variables (A). 
 72 | 
 73 | (A) This gives the number of test types and tests completed per month.
 74 | ```SQL
 75 | SELECT COUNT(test_name), MONTH(created_at) AS Month, 
 76 | COUNT(created_at) AS Num_Completed
 77 | FROM complete_tests
 78 | GROUP BY Month; 
 79 | ```
 80 | (B) This gives number of tests completed per test type AND month.
 81 | ```sql
 82 | SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed
 83 | FROM complete_tests
 84 | GROUP BY Month, test_name; 
 85 | ```
 86 | Note: Not all databases accept aliases (eg. MONTH(created_at) stored as Month). If they don't just retype the formula in the GROUP BY line.
 87 | 
 88 | # HAVING
 89 | The HAVING command is similar to WHERE, in that it adds another layer of specificity to your query. However, the difference is that *HAVING applies to aggregate data* while WHERE applies to single-column data. 
 90 | 
 91 | **Example using WHERE:** Print test name, month it is completed it, and number of tests done that month -- for Nov & Dec ONLY.
 92 | ```sql
 93 | SELECT test_name, 
 94 |     MONTH(created_at) AS Month_Name, 
 95 |     COUNT(created_at) AS Num_Completed_Tests
 96 | FROM complete_tests
 97 | WHERE MONTH(created_at)=11 OR MONTH(created_at)=12
 98 | GROUP BY test_name, Month_Name
 99 | ORDER BY Num_Completed_Tests DESC;
100 | ```
101 | **Example using HAVING:** Print test name, month it is completed it, and number of tests done that month -- for all months, WITH at least 20 tests done that month.
102 | ```sql
103 | SELECT test_name,
104 |     MONTH(created_at) AS Month,
105 |     COUNT(created_at) AS Num_Completed_Tests
106 | FROM complete_tests
107 | WHERE MONTH(created_at)=11 OR MONTH(created_at)=12
108 | GROUP BY 1, 2
109 | HAVING COUNT(created_at)>=20
110 | ORDER BY 3 DESC;
111 | ```
112 | #### More Examples
113 | Prints the average time taken by a user for each test in minutes. Excludes data of users who took more than 6000 hours, or less than 0 seconds per test, for that test.
114 | ```sql
115 | SELECT test_name,
116 |     AVG( TIMESTAMP DIFF( minute, start_time, end_time)) AS 'Time (Min)',
117 |     subcategory_name
118 | FROM exam_answers
119 | WHERE TIMESTAMP DIFF(minute, start_time, end_time)<6000
120 |     AND TIMESTAMP DIFF((second, start_time, end_time)>0
121 | GROUP BY test_name;
122 | ```
123 | Print the sum of users in each combination of state & zip -- where there is at least 5 users in that combination. Order in ascending by state, and in descending by number of users.
124 | ```sql
125 | SELECT state, zip,
126 |     COUNT(DISTINCT user_guid) AS UserID
127 | FROM users
128 | WHERE state != ""
129 |     AND state IS NOT NULL
130 |     AND zip IS NOT NULL
131 |     AND zip != ""
132 | GROUP BY state, zip
133 | HAVING UserID >= 5
134 | ORDER BY state ASC, UserID DESC;
135 | ```
136 | Revise the query your wrote in Question 2 so that it (1) excludes the NULL and empty string entries in the breed_group field, and (2) excludes any groups that don't have at least 1,000 distinct Dog_Guids in them.
137 | ```sql
138 | SELECT count(dog_guid) AS num_dogs, gender, breed_group
139 | FROM dogs
140 | WHERE breed_group IS NOT NULL AND breed_group != ''
141 | GROUP BY 3
142 | HAVING COUNT(breed_group>1000)
143 | ORDER BY 1 DESC;
144 | ```
145 | # Conclusion
146 | These functions sum up the last of all basic commands. Last week, you learnt SELECT, FROM, WHERE, ORDER BY. This week, you learnt HAVING, GROUP BY, as well as OPERATORS, SUM, AVG, DISTINCT, COUNT. This lets you add a greater layer of specificity to your query. 
147 | 
148 | *See notes in next section for inner and outer joins.*
149 | 


--------------------------------------------------------------------------------
/SQLDukeWeek4.md:
--------------------------------------------------------------------------------
  1 | # Week 4 - Subqueries & Operators
  2 | 
  3 | Subqueries, which are also sometimes called inner queries or nested queries, are queries that are embedded within the context of another query. They are useful for complex queries, and also for testing smaller parts of the queries to ensure they give you what you want first before assembling the whole thing. Some basic rules are:
  4 | 
  5 | - ORDER BY phrases cannot be used in subqueries (although ORDER BY phrases can still be used in outer queries that contain subqueries)
  6 | - Subqueries in SELECT or WHERE statements can output no more than 1 row. Otherwise, subqueries in SELECT or WHERE clauses that return more than one row must be used in combination with operators that are explicitly designed to handle multiple values, such as the IN operator.
  7 | 
  8 | Lastly, when they are used in FROM clauses, they create what are called **derived tables**. This comes into play later when you want to optimse your query to run faster. Having smaller derived tables helps the query be answered quicker because the db does not need to hold such a large derived table in memory. But for now, focus on writing the damn thing right first. 
  9 | 
 10 | ### #1: SUBQUERIES FOR ON-THE-FLY CALCULATIONS
 11 | 
 12 | Example: Find all details about users whose average time taken per test is greater than the average time taken by the community.
 13 | ```sql
 14 | SELECT *,
 15 | TIMESTAMPDIFF(minute,start_time,end_time) AS AvgDuration
 16 | FROM exam_answers
 17 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) >
 18 |       (SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time))
 19 |       FROM exam_answers
 20 |       WHERE TIMESTAMPDIFF(minute,start_time,end_time)>0);
 21 | ```
 22 | 
 23 | Example: Find all details about users whose average time taken for the "yawn warm up" game is greater than the average time taken by the community.
 24 | ```sql
 25 | SELECT *,
 26 |     avg(TIMESTAMPDIFF(minute,start_time,end_time)) AS Avg_Duration
 27 | FROM exam_answers
 28 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) >
 29 |       (SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time))
 30 |       FROM exam_answers
 31 |       WHERE TIMESTAMPDIFF(minute,start_time,end_time)>0
 32 |       AND test_name = 'Yawn Warm-Up');
 33 | ```
 34 | ### #2: SUBQUERIES FOR TESTING MEMBERSHIP
 35 | Subqueries can be used to test membership for items in one group against another, through calling the test group in the subquery. We can use EXIST / NOT EXIST for this command specifically. Somerules: 
 36 | 
 37 | - EXISTS and NOT EXISTS can ONLY be used in subqueries
 38 | - IT is similar to IN and NOT IN functions, but those can be used in all queries
 39 | - Cannot be preceded by a column name or any other expression
 40 | - Returns TRUE/FALSE logical statements
 41 | - Since the only concern for the subquery is whether it is TRUE/FALSE, can use SELECT * in subquery
 42 | 
 43 | Example: Retrieve a list of all the users in the users table who were also in the dogs table using the EXIST function.
 44 | ```sql
 45 | SELECT DISTINCT u.user_guid AS uUserID
 46 | FROM users u
 47 | WHERE EXISTS 
 48 |     (SELECT *
 49 |     FROM dogs d
 50 |     WHERE u.user_guid =d.user_guid);
 51 | ```
 52 | Example: Find the stores that exist in one or more cities.
 53 | ```sql
 54 | SELECT DISTINCT store_type
 55 | FROM stores
 56 | WHERE EXISTS (
 57 |      SELECT *
 58 |      FROM cities_stores
 59 |      WHERE cities_stores.store_type = stores.store_type);
 60 | ```
 61 | ### #3: SUBQUERIES FOR LOGIC WITH DERIVED TABLES
 62 | Subqueries can be more elegant than joins, especially when it allows us to select/ exclude more efficiently than a lengthy join command. In addition, we can fix the problem of duplicates immediately instead of having to patch this using a GROUP BY clause after. 
 63 | 
 64 | ##### Rules for subqueries
 65 | 
 66 | - We are required to give an alias to any derived table we create in subqueries within FROM statements.
 67 | - We need to use this alias every time we want to execute a function that uses the derived table.
 68 | - Third, aliases used within subqueries CAN refer to tables OUTSIDE of the subqueries. However, outer queries cannot refer to aliases created within subqueries unless those aliases are explicitly part of the subquery output.
 69 | - If using LIMIT with derived tables, put the limit in the LEFT derived table. If you put it in the outermost query, the db will still have to hold huge inner derived tables in memory which will make your query slow.
 70 | 
 71 | Example: We want a list of each dog a user in the users table owns, with its accompanying breed information whenever possible. 
 72 | ```sql
 73 | SELECT
 74 |     clean.user_guid AS uUserID,
 75 |     d.user_guid AS dUserID,
 76 |     count(*) AS NumDogs
 77 | FROM 
 78 |     (SELECT DISTINCT u.user_guid
 79 |     FROM users u)
 80 |     AS clean
 81 | LEFT JOIN dogs d
 82 |     ON clean.user_guid=d.user_guid
 83 | GROUP BY clean.user_guid
 84 | ORDER BY NumDogs DESC
 85 | ```
 86 | The query we just wrote extracts the distinct user_guids from the users table first, and then left joins that reduced subset of user_guids on the dogs table. As mentioned at the beginning of the lesson, since the subquery is in the FROM statement, it actually creates a temporary table, called a derived table, that is then incorporated into the rest of the query.
 87 | 
 88 | Example: Write a query to retrieve a full list of all the DogIDs a user in the users table owns. Add dog breed and dog weight to the columns that will be included in the final output of your query. In addition, use a HAVING clause to include only UserIDs who would have more than 10 rows in the output of the left join.
 89 | ```sql
 90 | SELECT 
 91 |     APPLES.user_guid AS uUserID,
 92 |     d.user_guid AS dUserID,
 93 |     d.breed,
 94 |     d.weight,
 95 |     count(*) AS numrows
 96 | FROM
 97 |     (SELECT DISTINCT u.user_guid
 98 |     FROM users u)
 99 |     AS APPLES
100 | LEFT JOIN dogs d
101 |     ON APPLES.user_guid=d.user_guid
102 | GROUP BY APPLES.user_guid
103 | HAVING numrows > 10
104 | ORDER BY numrows DESC
105 | ```
106 | 
107 | # OPERATORS
108 | 
109 | * IF
110 | * CASE
111 | * NOT, AND, OR
112 | 
113 | ### #1: OPERATORS - IF
114 | 
115 | Can segment queries conditionally using IF, especially if the situation has clear true/false conditions. IF can also be nested into loops. Note on syntax for using : IF 
116 | 
117 | ``` 
118 | IF ( variable = "result", value if true, value if false)
119 | ```
120 | Example: Count the number of users in America, and outside America. Output 2 columns with the groups America, Not in America, and the count for each. Exclude all null values.
121 | 
122 | ```sql
123 | SELECT
124 |     IF(cleanedset.country = 'US','In America','Not in America') AS Location,
125 |     COUNT(cleanedset.country) AS 'Number of Users'
126 | FROM
127 |     (SELECT DISTINCT user_guid, country
128 |     FROM users
129 |     WHERE user_guid IS NOT NULL
130 |     AND country IS NOT NULL)
131 |     AS cleanedset
132 | GROUP BY Location;
133 | ```
134 | Example: Sort users by early users and late users. Print the total number of users in each group. Early users = those who signed up before 1 June 2014.
135 | ```sql
136 | SELECT 
137 |     IF(cleaned_users.first_account<'2014-06-01','early_user','late_user') AS user_type,
138 |     COUNT(cleaned_users.first_account)
139 | FROM 
140 |     (SELECT user_guid,
141 |     MIN(created_at) AS first_account
142 |     FROM users
143 |     GROUP BY user_guid)
144 |     AS cleaned_users
145 | GROUP BY user_type;
146 | ```
147 | 
148 | #### Nested Loop Example
149 | Print all users and their country status.
150 | ```sql
151 | SELECT
152 |       IF(cleaned_users.country='US','In US',
153 |            IF(cleaned_users.country='N/A','Not Applicable','Outside US'))
154 |                 AS US_user,
155 |     count(cleaned_users.user_guid)
156 | FROM
157 |   (SELECT DISTINCT user_guid, country
158 |   FROM users
159 |   WHERE country IS NOT NULL)
160 |   AS cleaned_users
161 | GROUP BY US_user;
162 | ````
163 | Example: For each dog, output its dog ID, breed_type, number of completed tests, and use an IF statement to include an extra column that reads "Pure_Breed" whenever breed_type equals 'Pure Breed" and "Not_Pure_Breed" whenever breed_type equals anything else.
164 | ```sql
165 | SELECT DISTINCT
166 |     d.dog_guid AS 'Dog ID',
167 |     IF(d.breed_type="Pure Breed", 'Pure Breed', 'Not Pure Breed') AS 'Breed Type',
168 |     count(c.created_at) AS 'Num Tests Done'
169 | FROM dogs d
170 |      LEFT JOIN complete_tests c
171 |           ON d.dog_guid = c.dog_guid
172 | WHERE d.dog_guid IS NOT NULL
173 | GROUP BY d.dog_guid
174 | ORDER BY count(c.created_at) DESC
175 | LIMIT 50;
176 | ```
177 | However, you can see this is not very efficient as the number of conditions increases. For those, it is better to use CASE. 
178 | 
179 | ### #2: OPERATORS - CASE
180 | 
181 | Syntax for CASE:
182 | ```
183 | SELECT 
184 | apples, 
185 | oranges, 
186 |     CASE 
187 |     WHEN ..... (condition) THEN .... (label) 
188 |     WHEN ..... (condition) THEN .... (label)
189 |     END -- ps: no commas needed within
190 | FROM database 
191 | ```
192 | Example: Print cases of users based on their country locations.
193 | ```sql
194 | SELECT
195 |     CASE
196 |     WHEN cleaned_users.country="US" THEN "In US"
197 |     WHEN cleaned_users.country="N/A" THEN "Not Applicable"
198 |     ELSE "Outside US"
199 |     END AS US_user,
200 |     count(cleaned_users.user_guid)
201 | FROM
202 |     (SELECT DISTINCT user_guid, country
203 |     FROM users
204 |     WHERE country IS NOT NULL)
205 |     AS cleaned_users
206 | GROUP BY US_user
207 | ORDER BY count(cleaned_users.user_guid);
208 | ```
209 | Example: Write a query to present the range of dog's weight in groups, and the number of dogs in each weight group. 
210 | ```sql
211 | SELECT
212 |     DISTINCT dog_guid,
213 |     breed,
214 |     weight,
215 |     CASE
216 |     WHEN weight<=0 THEN "very small"
217 |     WHEN weight>10 AND weight<=30 THEN "small"
218 |     WHEN weight>30 AND weight<=50 THEN "medium"
219 |     WHEN weight>50 AND weight<=85 THEN "large"
220 |     WHEN weight>85 THEN "very large"
221 |     END AS Category
222 | FROM dogs
223 | WHERE weight > 0
224 | LIMIT 200;
225 | ```
226 | Example: Binary tree question. Find the parent root, inner and leaf nodes.
227 | ```sql
228 | SELECT N,
229 | CASE
230 |      WHEN P IS NULL THEN "Root" -- capitsalisation matters inside commas
231 |      WHEN N IN (SELECT P, FROM BST) THEN "Inner"
232 |      ELSE "Leaf"
233 |      END                                                  
234 | FROM BST
235 | ORDER BY N;
236 | ```
237 | ### #3: OPERATORS - NOT, AND, OR
238 | 
239 | These operators can be used to make true/false logic statements. They are evaluated in that order: Not, And, Or. This means that any NOT statements will be evaluated first, followed by AND, then OR.
240 | 
241 | > CASE WHEN "condition 1" OR "condition 2" AND "condition 3"...
242 | 
243 | will lead to different results than this expression:
244 | 
245 | > CASE WHEN "condition 3" AND "condition 1" OR "condition 2"...
246 | 
247 | or this expression:
248 | 
249 | > CASE WHEN ("condition 1" OR "condition 2") AND "condition 3"...
250 | 
251 | In the first case you will get rows that meet condition 2 and 3, or condition 1. In the second case you will get rows that meet condition 1 and 3, or condition 2. In the third case, you will get rows that meet condition 1 or 2, and condition 3.
252 | 
253 | 


--------------------------------------------------------------------------------
/Teradata Cheatsheet.md:
--------------------------------------------------------------------------------
  1 | # Teradata Cheatsheet
  2 | 
  3 | This document is a compilation of differences between MySQL and the SQL 
  4 | dialect Teradata uses with regards to major commands. It was made with 
  5 | reference to course notes from Duke University's "Managing Big Data with 
  6 | MySQL" course. 
  7 | 
  8 | This document assumes that one is already familiar with 
  9 | some SQL or MySQL as it mainly serves to point out the differences between them.
 10 | 
 11 | Date created: 18 March 2017
 12 | 
 13 | ### Set Database
 14 | 
 15 | To select the database, enter ``DATABASE [name];`` into the SQL scratchpad. 
 16 | 
 17 | ### Explore Database
 18 | 
 19 | To display tables and columns in database
 20 | 
 21 | ```sql
 22 | HELP TABLE [name]
 23 | 
 24 | HELP COLUMN [name]
 25 | ```
 26 | *Note: Don't include the brackets when executing the query.*
 27 | 
 28 | ### Primary Keys
 29 | 
 30 | To confirm which are the primary keys of a table
 31 | 
 32 | ```sql
 33 | SHOW table [name];
 34 | ```
 35 | *Note: Don't include the brackets when executing the query.*
 36 | 
 37 | ### Restricting Query Output 
 38 | 
 39 | Teradata uses TOP instead of LIMIT to restrict output. 
 40 | To select the first 10 rows:
 41 | 
 42 | ```sql
 43 | SELECT TOP 10 student_IDs 
 44 | FROM class_info;
 45 | ```
 46 | 
 47 | To select 10 random rows instead:
 48 | 
 49 | ```sql
 50 | SELECT student_IDs 
 51 | FROM class_info
 52 | SAMPLE 10;
 53 | ```
 54 | 
 55 | To select 10% of all rows instead: 
 56 | 
 57 | ```sql
 58 | SELECT student_IDs 
 59 | FROM class_info
 60 | SAMPLE .10;
 61 | ```
 62 | *Note: The last two commands will return different selection of rows each time.*
 63 | 
 64 | ### Aggregation & Group By
 65 | 
 66 | Any non-aggregate column in the ``SELECT`` list or ``HAVING`` list of a query with 
 67 | a ``GROUP BY`` clause must also listed in the ``GROUP BY`` clause. Unlike MySQL, 
 68 | Teradata will not pick a random selection to populate a field that cannot be aggregated. 
 69 | 
 70 | This will not run:
 71 | ```sql
 72 | SELECT shopname, clothes_ID, cost
 73 | FROM shop
 74 | GROUP BY shopname  
 75 | ```
 76 | However, this will run:
 77 | ```sql
 78 | SELECT shopname, clothes_ID, avg(cost) -- find average to aggregate this column
 79 | FROM shop
 80 | GROUP BY shopname, clothes_ID -- group by non-aggregates
 81 | ```
 82 | ### Operators
 83 | 
 84 | Both Teradata and Mysql accept the symbols ``<>`` for *not equals to*, but 
 85 | Teradata does not accept ``!=``. 
 86 | 
 87 | ### String selection
 88 | 
 89 | Teradata only accepts **single quotation marks**. 
 90 | 
 91 | ### Date Time Format
 92 | 
 93 | Teradata will output data in the format ``YY-MM-DD``. However, it expects date 
 94 | format to be entered in ``YYYY-MM-DD``. 
 95 | 
 96 | ``TIMESTAMPDIFF(hour/minute/second, var1, var2)`` 
 97 | which calculates the difference between 2 variables in the specified format.
 98 | 
 99 | ``DAYOFWEEK(datevar)``, where the day of the week will be returned as an 
100 | integer from 1 - 7 where 1 = Sunday, 2 = Monday, etc. 
101 | 
102 | ### Extract Date
103 | 
104 | The command for extracting parts of the datestamp returns the day/month/year in 
105 | their respective numerical value. 
106 | 
107 | * `` EXTRACT (day FROM variable)`` returns the date (1-31).
108 | * ``EXTRACT (month FROM variable)`` returns the month (1-12). 
109 | * `` EXTRACT (year FROM variable)`` returns the year (``YYYY``).
110 | 
111 | This can be used in such a manner to return a count of the number of days in each year and month: 
112 | 
113 | ```sql
114 | SELECT 
115 |   EXTRACT (month FROM datelog) AS month_num, 
116 |   EXTRACT (year FROM datelog) AS year_num, 
117 |   COUNT (DISTINCT EXTRACT (day FROM datelog)) AS days_per_month, 
118 | FROM catalog
119 | GROUP BY month_num, year_num
120 | ```
121 | 
122 | ### IF ELSE 
123 | 
124 | Teradata does *not* accept ``IF`` functions. However, we can replace this with ``CASE``.
125 | 
126 | 
127 | 


--------------------------------------------------------------------------------
/Week2-Dillards.md:
--------------------------------------------------------------------------------
  1 | # Week 2 - Dillard's Database Exercises
  2 | 
  3 | Date created: 14 March 2017
  4 | 
  5 | This is the COMPLETE answer key (including explanations where necessary) 
  6 | for Week 2 of **"Managing Big Data wtih MySQL"** course by Duke University: 
  7 | 'Queries to Extract Data from Single Tables'. 
  8 | 
  9 | I wrote this answer key as no official answers have been released online. 
 10 | These answers reflect my own work and are accurate to the best of my knowledge. 
 11 | I will update them if the professors ever release an "official" answer key. 
 12 | 
 13 | **Update**: These answers are based on the original UA_Dillards dataset (not UA_Dillards1, 
 14 | nor UA_Dillards_2016). This means I am using the table ``SKSTINFO`` and not 
 15 | ``SKSTINFO_FIX`` which is the newer version.
 16 | 
 17 | Meanwhile, let's start.
 18 | 
 19 | # Answers
 20 | 
 21 | To start, enter ``DATABASE ua_dillards;`` into the Teradata SQL scratchpad.
 22 | 
 23 | ### Exercise 1
 24 | 
 25 | **Use HELP and SHOW to confirm the relational schema provided to us for the
 26 | Dillard’s dataset shows the correct column names and primary keys for each table.**
 27 | 
 28 | ```sql
 29 | HELP TABLE strinfo
 30 | HELP TABLE skstinfo
 31 | HELP TABLE skuinfo
 32 | HELP TABLE trnsact
 33 | HELP TABLE deptinfo
 34 | HELP TABLE store_msa
 35 | ``` 
 36 | 
 37 | Note: *The course's notes contain an error.* It suggests:
 38 | 
 39 | > "To get information about a single column in a table, you could write:
 40 | >
 41 | > HELP COLUMN [name of column goes here; don’t include the
 42 | brackets when executing the query]"
 43 | 
 44 | This is incorrect. You need to specify **which table** the column is from too, 
 45 | as some column names are common to more than one table. The syntax correct should be 
 46 | ``HELP COLUMN tablename.columnname``. Thus, to find out more information 
 47 | about a single column, you should do this:
 48 | 
 49 | ```sql
 50 | HELP COLUMN skstinfo.sku
 51 | HELP COLUMN skuinfo.sku
 52 | HELP COLUMN trnsact.sku
 53 | ...
 54 | etc
 55 | ```
 56 | 
 57 | Lastly, to confirm which is the primary key of each table, do this:
 58 | ``SHOW TABLE [tablename here -- but don’t include the
 59 | brackets when executing the query];``. When applied, it looks like this:
 60 | 
 61 | ```sql
 62 | SHOW TABLE strinfo
 63 | SHOW TABLE skstinfo
 64 | SHOW TABLE skuinfo
 65 | SHOW TABLE trnsact
 66 | SHOW TABLE deptinfo
 67 | SHOW TABLE store_msa
 68 | ``` 
 69 | 
 70 | ### Exercise 2
 71 | 
 72 | **Look at examples of data from each of the tables. Pay particular attention to
 73 | the ``skuinfo`` table.**
 74 | 
 75 | Things to note: 
 76 | - There are two types of transactions: purchases (P) and returns (R). We will need to 
 77 | make sure we specify which type we are interested in when running queries using the 
 78 | transaction table.
 79 | - There are a lot of strange values in the “color”, “style”, and “size” fields of 
 80 | the skuinfo table. The information recorded in these columns is not always related to 
 81 | the column title (for example there are entries like "BMK/TOUR K” and “ALOE COMBO” in 
 82 | the color field, even though those entries do not represent colors).
 83 | - The department descriptions (``deptdesc`` from ``DEPTINFO``) seem to represent brand 
 84 | names. However, if you look at entries in the skuinfo table from only one department, 
 85 | you will see that many brands are in the same department. 
 86 | 
 87 | ### Exercise 3
 88 | 
 89 | **Examine lists of distinct values in each of the tables.**
 90 | 
 91 | Okay... 
 92 | 
 93 | ### Exercise 4
 94 | 
 95 | **Examine instances of transaction table where “amt” is different than “sprice”.
 96 | What did you learn about how the values in “amt”, “quantity”, and “sprice” 
 97 | relate to one another?**
 98 | 
 99 | To query all rows where ``amt``(total transaction amount) is different from 
100 | ``sprice``(sale price):
101 | 
102 | ```sql
103 | SELECT * 
104 | FROM trnsact
105 | WHERE amt <> sprice;
106 | ```
107 | 
108 | We see 7 rows appear. What the rows have in common is that they are all return 
109 | transactions (``R``), and have an ``INTERID`` of 000000000. The items, which were originally 
110 | $20-$80 each, are now $0.10 to $1.00 each. 
111 | 
112 | ### Exercise 5
113 | 
114 | Even though the Dillard’s dataset had primary keys declared and there were not 
115 | many NULL values, there are still many bizarre entries that likely reflect entry errors.
116 | To see some examples of these likely errors, examine:
117 | 
118 | **(a) Rows in the trsnact table that have “0” in their orgprice column (how could the original
119 | price be 0?)**
120 | 
121 | ```sql
122 | SELECT *
123 | FROM trnsact
124 | WHERE orgprice = '0';
125 | ```
126 | *Notes: There should be 1425811 rows where the original price = $0.00, or approx 1.18% 
127 | of all rows in the ``TRNSACT`` table. There appears to be nothing in common between these items.*
128 | 
129 | **(b) Rows in the skstinfo table where both the cost and retail price are listed as 0.00**
130 | 
131 | ```sql
132 | SELECT *
133 | FROM skstinfo
134 | WHERE cost = '0'
135 |   AND retail = '0';
136 | ```
137 | 
138 | *Notes: There should be 350340 rows where both the cost and retail price = $0.00, or 
139 | approx 0.89% of all rows in the ``SKSTINFO`` table. There appears to be nothing in common 
140 | between these items.*
141 | 
142 | **(c) Rows in the skstinfo table where the cost is greater than the retail price (although
143 | occasionally retailers will sell an item at a loss for strategic reasons, it is very 
144 | unlikely that a manufacturer would provide a suggested retail price that is lower than 
145 | the cost of the item).**
146 | 
147 | ```sql
148 | SELECT *
149 | FROM skstinfo
150 | WHERE cost > retail 
151 |   AND retail > '0'; -- to exclude erroneous values
152 | ```
153 | 
154 | *Notes: There should be 7535205 rows where cost price is greater than retail price. 
155 | This forms approx 19.2% of all rows in the ``SKSTINFO`` table.*
156 | 
157 | ### Exercise 6
158 | 
159 | **Write your own queries that retrieve multiple columns in a precise order from
160 | a table, and that restrict the rows retrieved from those columns using “BETWEEN”, “IN”,
161 | and references to text strings. Try at least one query that uses dates to restrict the rows
162 | you retrieve.**
163 | 
164 | Okay...
165 | 
166 | ```sql 
167 | SELECT count(store)
168 | FROM strinfo
169 | WHERE state = 'NY';
170 | ```
171 | Seems like New York has only 2 stores. Actually, let's explore how many stores there are 
172 | in each state, and see who has the most. 
173 | 
174 | ```sql
175 | SELECT STATE, COUNT(STORE)
176 | FROM strinfo
177 | GROUP BY STATE
178 | ORDER BY COUNT(STORE) DESC;
179 | ```
180 | 
181 | |State | Stores |
182 | | ---- | ----- |
183 | | TX | 79 
184 | | FL | 48
185 | | AR | 27
186 | | AZ | 26
187 | | OH | 25
188 | 
189 | Okay, let's try to find the earliest and latest sale date in this dataset.
190 | 
191 | ```sql
192 | SELECT distinct saledate
193 | FROM trnsact
194 | ORDER BY saledate ASC;
195 | 
196 | SELECT distinct saledate
197 | FROM trnsact
198 | ORDER BY saledate DESC; -- I'm lazy to scroll. 
199 | ```
200 | Earliest date: ``04/08/01``. Latest date: ``05/08/27``. Seems like we have 389 dates in 
201 | record. 
202 | 
203 | Let's mess around further, and see which dates have the highest number of transactions. 
204 | I bet that the total number of transactions will peak on 24 Dec (aka right before christmas). 
205 | Let's check: 
206 | ```sql
207 | SELECT saledate, count(saledate)
208 | FROM trnsact
209 | GROUP BY saledate
210 | ORDER BY count(saledate) DESC;
211 | ```
212 | HOLY CRAP. I am so wrong. Here are the top 10 dates with the highest transactions:
213 | 
214 | | No. | Date | Transactions |
215 | | ---- | ---- | ---- | 
216 | | 1 | 05/02/26 | 1198813
217 | | 2 | 05/02/25 | 947451
218 | | 3 |05/02/24 | 888352
219 | | 4 | 05/07/30 | 875042
220 | | 5 |  05/02/23 | 855037
221 | | 6 |05/08/27 | 771760
222 | | 7 | 04/10/02 | 758200
223 | | 8 | 04/12/18* | 744268
224 | | 9 | 04/11/26 | 690396
225 | | 10 | 04/12/23* | 675139
226 | 
227 | Seems like christmas doesn't even come close. WTf? Let's find out what happened
228 | on ``05/02/26``.
229 | 
230 | According to Google, it seems like they had the [mother of all sales](https://sgbonline.com/dillards-february-comps-increase-5-percent/ "DillardsReport"). 
231 | 
232 | Well that must be some epic sales. Because judging by the number of transactions, it 
233 | appears that people spent **1.75x** more on 25th and 26th Feb, than the 2 days leading up 
234 | to Christmas (23rd, 24th Dec. *I excluded 25th Dec because Dillards was not open on Christmas Eve*). 
235 |  
236 | ![alt text](https://cdn.meme.am/instances/400x/64773524.jpg)
237 | 
238 | I don't understand, America. How do you spend more for yourself *in a single day*
239 | than for all your friends and cousins combined?
240 | 
241 | Anyway, that's all the questions for this exercise. *I've spent an hour on this already and it's 
242 | 3am here. :(* 
243 | 
244 | One final note from the assignment: while **date formats** will be output as:
245 | 
246 | ``YY-MM-DD'``
247 | 
248 | During queries, **date** strings should be entered as:
249 | 
250 | ``YYYY-MM-DD'.``
251 | 
252 | *Thanks for reading, hope this was useful to you. I had fun writing this!*
253 | 
254 | 
255 | 


--------------------------------------------------------------------------------
/Week3-Dillards.md:
--------------------------------------------------------------------------------
  1 | # Week 3 - Dillard's Database Exercises
  2 | 
  3 | Date created: 17 March 2017
  4 | 
  5 | This is the COMPLETE answer key (including explanations where necessary) 
  6 | for Week 2 of "Managing Big Data wtih MySQL" course by Duke University: 
  7 | 'Queries to Extract Data from Single Tables'.
  8 | 
  9 | I wrote this answer key as no official answers have been released online. 
 10 | These answers reflect my own work and are accurate to the best of my knowledge. 
 11 | I will update them if the professors ever release an "official" answer key.
 12 | 
 13 | Update: These answers are based on the original UA_Dillards dataset 
 14 | (not UA_Dillards1, nor UA_Dillards_2016). For example, this means I am using 
 15 | the table ``SKSTINFO`` and not ``SKSTINFO_FIX`` which is the newer version.
 16 | 
 17 | Meanwhile, let's start.
 18 | 
 19 | # Answers
 20 | 
 21 | To start, enter ``DATABASE ua_dillards``; into the Teradata SQL scratchpad.
 22 | 
 23 | ### Question 1 
 24 | 
 25 | **(a) Use COUNT and DISTINCT to determine how many distinct skus there are in 
 26 | pairs of the skuinfo, skstinfo, and trnsact tables. Which skus are common to 
 27 | pairs of tables, or unique to specific tables?**
 28 | 
 29 | ```sql
 30 | SELECT COUNT(DISTINCT a.sku)
 31 | FROM skuinfo a
 32 | 	JOIN skstinfo b
 33 | 		ON a.sku = b.sku;
 34 |     
 35 | SELECT COUNT(DISTINCT a.sku)
 36 | FROM skuinfo a
 37 | 	JOIN trnsact b
 38 | 		ON a.sku = b.sku;
 39 |     
 40 | SELECT COUNT(DISTINCT a.sku)
 41 | FROM skstinfo a
 42 | 	JOIN trnsact b
 43 | 		ON a.sku = b.sku;
 44 | ```
 45 | 
 46 | Results
 47 | 
 48 | | Combi | Pair 1 | Pair 2 | Distinct SKU |
 49 | | ----- | ------ | ------ | ------------ |
 50 | | 1 | skuinfo | skstinfo | 760212 |
 51 | | 2 | skuinfo | trnsact  | 714499 |
 52 | | 3 | skstinfo | trnsact | 542513 |
 53 | 
 54 | To test which ``SKU``s are in which tables:
 55 | 
 56 | ```sql
 57 | SELECT a.sku, b.sku
 58 | FROM skuinfo a
 59 | 	LEFT JOIN skstinfo b
 60 | 		ON a.sku = b.sku
 61 | WHERE b.sku IS NULL;
 62 | 
 63 | SELECT a.sku, b.sku
 64 | FROM skuinfo a
 65 | 	LEFT JOIN trnsact b
 66 | 		ON a.sku = b.sku
 67 | WHERE b.sku IS NULL;
 68 | 
 69 | ```
 70 | * All items in ``SKSTINFO`` are listed in ``SKUINFO``, but not vice versa
 71 | * All items in ``TRNSACT`` are listed in ``SKSTINFO``, but not vice versa
 72 | 
 73 | **(b) Use COUNT to determine how many instances there are of each sku associated 
 74 | with each store in the skstinfo table and the trnsact table?**
 75 | 
 76 | ```sql
 77 | SELECT sku, store, COUNT(sku)
 78 | FROM skstinfo
 79 | GROUP BY sku, store;
 80 | ```
 81 | Seems like there's only 1x sku-store combo in the ``SKSTINFO`` table.
 82 | ```sql
 83 | SELECT sku, store, COUNT(sku)
 84 | FROM trnsact
 85 | GROUP BY sku, store;
 86 | ```
 87 | Seems like there's multiple instances of each sku-store combos in the ``TRNSACT`` table.
 88 | 
 89 | *Notes from lecture: You should see there are multiple instances of every 
 90 | sku/store combination in the ``trnsact`` table, but only one instance of every 
 91 | sku/store combination in the ``skstinfo`` table. Therefore you could join the 
 92 | ``trnsact`` and ``skstinfo`` tables, but you would need to join them on both of the 
 93 | following conditions: ``trnsact.sku= skstinfo.sku`` AND ``trnsact.store= skstinfo.store``.* 
 94 | 
 95 | ### Exercise 2
 96 | 
 97 | **(a) Use COUNT and DISTINCT to determine how many distinct stores there are in the
 98 | strinfo, store_msa, skstinfo, and trnsact tables.**
 99 | 
100 | ```sql
101 | SELECT COUNT(DISTINCT store)
102 | FROM strinfo;
103 | 
104 | SELECT COUNT(DISTINCT store)
105 | FROM skstinfo;
106 | 
107 | SELECT COUNT(DISTINCT store)
108 | FROM store_msa;
109 | 
110 | SELECT COUNT(DISTINCT store)
111 | FROM trnsact;
112 | ```
113 | 
114 | |Table Name | Unique Stores |
115 | | --------- | ------------- |
116 | | STRINFO | 453
117 | | SKSTINFO | 357
118 | | STORE_MSA | 333
119 | | TRNSACT | 332
120 | 
121 | **(b) Which stores are common to all four tables, or unique to specific tables?**
122 | 
123 | Since we know that ALL stores can be found in the ``STRINFO`` table, we can left join 
124 | the three other tables to it. 
125 | 
126 | ```sql
127 | SELECT a.store, b.store, c.store, d.store
128 | FROM strinfo a 
129 |   LEFT JOIN skstinfo b 
130 |     ON a.store = b.store
131 |   LEFT JOIN trnsact c 
132 |     ON a.store = c.store
133 |   LEFT JOIN store_msa d
134 |     ON c.store = d.store
135 | ```
136 | 
137 | ### Exercise 3
138 | 
139 | It turns out there are many skus in the trnsact table that are not in the skstinfo 
140 | table. As a consequence, we will not be able to complete many desirable analyses of 
141 | Dillard’s profit, as opposed to revenue, because we do not have the cost information 
142 | for all the skus in the transact table (recall that profit = revenue - cost).
143 | 
144 | **Examine some of the rows in the trnsact table that are not in the skstinfo table;
145 | can you find any common features that could explain why the cost information is missing?**
146 | 
147 | ```sql
148 | SELECT * 
149 | FROM trnsact a 
150 |   LEFT JOIN skstinfo b
151 |     ON a.sku=b.sku AND a.store = b.store
152 | WHERE b.sku IS NULL 
153 | ```
154 | This returns a table with all columns, of rows of items which are in ``TRNSACT`` but 
155 | **not in** ``SKSTINFO``. Honestly, I can't see much difference just eyeballing it. 
156 | There are 52,338,840 rows, or 43.3% of 120 billion rows that are missing. 
157 | 
158 | To check how many of them are *unique*:
159 | 
160 | ```sql
161 | SELECT distinct a.sku, a.store
162 | FROM trnsact a 
163 |   LEFT JOIN skstinfo b
164 |     ON a.sku=b.sku AND a.store = b.store
165 | WHERE b.sku IS NULL 
166 | GROUP BY a.sku, a.store;
167 | ```
168 | 
169 | That leaves exactly 17,816,793 sku-store combinations found in the transactions table
170 | that are not listed in the master ``skstinfo`` table. I still can't tell what's 
171 | unique about the missing values, so let's see what's the next question and 
172 | come back to this later. 
173 | 
174 | ### Exercise 4
175 | 
176 | **Although we can’t complete all the analyses we’d like to on Dillard’s profit, 
177 | we can look at general trends. What is Dillard’s average profit per day?**
178 | 
179 | Assumptions: 
180 | 
181 | 1. With **over 40% of the necessary data missing** (see Qn 3), whatever data we 
182 | have left is accurate and worth calculating -.-"
183 | 2. For each transaction recorded (row), only 1 type of item is purchased at a time. 
184 | In other words, that:
185 | 
186 | > Total amount paid per transaction = number of items x price of each item. 
187 | 
188 | This is important because if each transaction contains numerous items of different prices, 
189 | we will lack necessary information about unique compositions of each transaction to make 
190 | this query. 
191 | 
192 | Back to the question, 
193 | 
194 | > Profit = revenue - cost 
195 | 
196 | This can be written as 
197 | 
198 | ``PROFIT = trnsact.amt - (trnsact.quantity * skstinfo.cost)``
199 | 
200 | Further, since we want to know the **average** profit, we can find the 
201 | number of days by diving the profit by ``count(distinct saledate)``. 
202 | 
203 | Overall, we can build the rest of the query around it like so:
204 | 
205 | ```sql
206 | SELECT SUM(a.amt - a.quantity*b.cost)/COUNT(DISTINCT a.saledate) -- avg profit
207 | FROM trnsact a
208 |   LEFT JOIN SKSTINFO b
209 |     ON a.sku = b.sku AND a.store = b.store
210 | WHERE a.stype = 'P'; -- purchases only
211 | ```
212 | This returns an average profit of ``$1,527,903.46`` per day. Let's check this 
213 | against what the question expects - that the average profit for Register 640
214 | should be ``$10,779.20``.
215 | 
216 | ```sql
217 | SELECT SUM(a.amt - a.quantity*b.cost)/COUNT(DISTINCT a.saledate)
218 | FROM trnsact a
219 |   LEFT JOIN SKSTINFO b
220 |     ON a.sku = b.sku AND a.store = b.store
221 | WHERE a.stype = 'P'
222 |   AND register = '640';
223 | ```
224 | The answer is correct. 
225 | 
226 | ### Exercise 5
227 | 
228 | **On what day was the total value (in $) of returned goods the greatest?**
229 | 
230 | ```sql 
231 | SELECT saledate, sum(amt)  -- I didnt limit this cos I'm kaypoh 
232 | FROM trnsact
233 | WHERE stype = 'R'
234 | GROUP BY saledate 
235 | ORDER BY sum(amt) DESC;
236 | ```
237 | 
238 | To select only the day with the *greatest* value, ``select limit 1``. 
239 | 
240 | | Sale date | Total value of returned goods |
241 | | --------- | ----------------------------- |
242 | | **04/12/27** | **$3,030,259.76**
243 | | 04/12/26 | $2,665,283.86
244 | | 04/12/28 | $2,332,544.44
245 | | 04/12/29 | $1,983,898.91
246 | | 04/12/30 | $1,884,052.85
247 | | 04/12/31 | $1,631,004.76
248 | | 05/01/08 | $1,438,745.35
249 | | 05/02/26 | $1,403,971.89
250 | | 05/01/03 | $1,357,311.82
251 | | 05/01/02 | $1,270,440.95
252 | 
253 | **On what day was the total number of individual returned items the greatest?**
254 | 
255 | ```sql 
256 | SELECT saledate, sum(quantity) 
257 | FROM trnsact
258 | WHERE stype = 'R'
259 | GROUP BY saledate 
260 | ORDER BY sum(quantity) DESC;
261 | ```
262 | 
263 | | Sale date | Total num of returned goods |
264 | | --------- | ----------------------------- |
265 | | **04/12/27** | **82512** |
266 | |04/12/26|71710
267 | |04/12/28|64265
268 | |05/02/26|62462
269 | |04/12/29|55356
270 | |05/02/25|54597
271 | |04/12/30|53171
272 | |05/02/24|49199
273 | |05/07/30|46436
274 | |05/08/27|45704
275 | 
276 | Well, at least it appears that there is some correlation between the two results. 
277 | 
278 | ### Exercise 6
279 | 
280 | **What is the maximum price paid for an item in our database? What is the minimum price
281 | paid for an item in our database?**
282 | 
283 | I'm not sure whether the tables are reliable, so I am going to check all possible values 
284 | from ``skstinfo.retail``, ``trnsact.orgprice`` and ``trnsact.sprice``. 
285 | 
286 | ```sql
287 | SELECT max(orgprice)
288 | FROM trnsact
289 | WHERE stype = 'P';
290 | 
291 | SELECT min(orgprice)
292 | FROM trnsact
293 | WHERE stype = 'P';
294 | 
295 | SELECT max(sprice)
296 | FROM trnsact
297 | WHERE stype = 'P';
298 | 
299 | SELECT min(sprice)
300 | FROM trnsact
301 | WHERE stype = 'P';
302 | 
303 | SELECT max(retail)
304 | FROM skstinfo;
305 | 
306 | SELECT min(retail)
307 | FROM skstinfo;
308 | ```
309 | 
310 | | Source | Max price | Min price |
311 | | ----------- | -----| --------- |
312 | | skst.retail | 6017.00 | 0.00 |
313 | | trnsact.orgprice | 6017.00 | 0.00 |
314 | | trnsact.sprice | 6017.00 | 0.00 |
315 | 
316 | It's nice that they are consistent. Being careful pays off. It appears safe to conclude that 
317 | the **maximum price** for any item is ``$6017.00`` and the **minimum price** is ``$0.00``.
318 | 
319 | ### Exercise 7
320 | 
321 | **How many departments have more than 100 brands associated with them, and what are their
322 | descriptions?**
323 | 
324 | ```sql
325 | SELECT DISTINCT a.dept, b.deptdesc, count(distinct a.brand) 
326 | FROM skuinfo a
327 |   LEFT JOIN deptinfo b
328 |     ON  a.dept=b.dept 
329 | GROUP BY a.dept, b.deptdesc
330 | HAVING count(distinct a.brand) > 100;
331 | ```
332 | 
333 | There are **three** departments iwth more than 100 brands associated, and these are their 
334 | descriptions: 
335 | 
336 | | Department ID | Description | Num brands |
337 | | ----------- | -----| --------- |
338 | |4407 | ENVIRON | 389
339 | | 7104 | CARTERS | 109
340 | | 5203 | COLEHAAN | 118
341 | 
342 | ### Exercise 8
343 | 
344 | **Write a query that retrieves the department descriptions of each of the skus in the skstinfo
345 | table.**
346 | 
347 | ```sql
348 | SELECT a.sku, c.deptdesc
349 | FROM skstinfo a 
350 |   LEFT JOIN skuinfo b 
351 |     ON a.sku = b.sku 
352 |   LEFT JOIN deptinfo c
353 |     ON b.dept = c.dept
354 | SAMPLE 100; -- remove this during exam
355 | ```
356 | The department description for ``SKU5020024`` is ``LESLIE``.
357 | 
358 | ### Exercise 9
359 | 
360 | **What department (with department description), brand, style, and color had the greatest total
361 | value of returned items?**
362 | 
363 | ### Exercise 10
364 | 
365 | **In what state and zip code is the store that had the greatest total revenue during the time
366 | period monitored in our dataset?**
367 | 
368 | *Note: There is an error in the notes. The question asks for state and **city** instead of **zip**. 
369 | The assignment statement provided (below) suggests that you should know the city too.*
370 | 
371 | > "If you have written your query correctly, you will find that the department with the 
372 | 10th highest total revenue is in Hurst, TX."
373 | 
374 | ```sql
375 | SELECT b.state, b.city, SUM(a.amt) -- no need to include sum(a.amt), but this is good for checking.
376 | FROM strinfo b
377 |   LEFT JOIN trnsact a
378 |     ON a.store = b.store
379 | WHERE stype = 'P'
380 | GROUP BY b.state, b.zip
381 | ORDER BY SUM(a.amt) DESC;
382 | ```
383 | 
384 | | State | ZIP | City | Total Revenue |
385 | | ----- | --- | ---- | ------------------ |
386 | | LA | 70002 | METAIRIE |$24,171,426.58
387 | |AR |72205 |LITTLE ROCK |$22,792,579.65
388 | |TX |78501 |MCALLEN |$22,331,884.55
389 | |TX |75225 |DALLAS |$22,063,797.73
390 | |KY |40207 |LOUISVILLE| $20,114,154.20
391 | |TX |77056 |HOUSTON| $19,040,376.84
392 | |KS |66214 |OVERLAND PARK |$18,642,976.76
393 | |OK |73118 |OKLAHOMA CITY |$18,458,644.39
394 | |TX |78216 |SAN ANTONIO |$18,455,775.63
395 | | **TX** | **76053** | **HURST** | **$17,740,181.20**
396 | 
397 | The answer is correct. The store with the 10th highest revenue is 
398 | ``Hurst City`` with ``$17,740,181.20``. 
399 | 


--------------------------------------------------------------------------------
/Week3Ex7-InnerJoin.sql:
--------------------------------------------------------------------------------
  1 | /* ANSWER KEY: Week 3 Exercise 7 - Inner Joins 
  2 | 
  3 | This is the COMPLETE answer key (including explanations) for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 
  4 | Date created: 15 March 2017
  5 | 
  6 | */
  7 | 
  8 | -- BOX 1: LOAD SERVER
  9 | 
 10 | %load_ext sql
 11 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
 12 | %sql USE dognitiondb
 13 | 
 14 | 
 15 | 
 16 | -- BOX 2
 17 | -- Note: This should throw an error. This is to demonstrate what the error looks like. 
 18 | 
 19 | SELECT 
 20 | 	dog_guid AS DogID, 
 21 | 	user_guid AS UserID, 
 22 | 	AVG(rating) AS AvgRating, 
 23 | 	COUNT(rating) AS NumRatings, 
 24 | 	breed, breed_group, breed_type
 25 | FROM dogs, reviews
 26 | GROUP BY user_guid, dog_guid, breed, breed_group, breed_type
 27 | HAVING NumRatings >= 10
 28 | ORDER BY AvgRating DESC
 29 | LIMIT 200;
 30 | 
 31 | 
 32 | 
 33 | -- BOX 3
 34 | -- Expected: 38 rows
 35 | 
 36 | SELECT 
 37 | 	d.dog_guid AS DogID, 
 38 | 	d.user_guid AS UserID, 
 39 | 	AVG(r.rating) AS AvgRating, 
 40 | 	COUNT(r.rating) AS NumRatings, 
 41 | 	d.breed, 
 42 | 	d.breed_group, 
 43 | 	d.breed_type
 44 | FROM dogs d, reviews r
 45 | WHERE d.dog_guid=r.dog_guid 
 46 | 	AND d.user_guid=r.user_guid
 47 | GROUP BY DogID, d.breed, d.breed_group, d.breed_type
 48 | HAVING NumRatings >= 10
 49 | ORDER BY AvgRating DESC
 50 | LIMIT 200;
 51 | 
 52 | 
 53 | 
 54 | -- BOX 4
 55 | -- Expected 389 rows
 56 | 
 57 | /* IMPORTANT NOTE
 58 | 
 59 | There is some discrepancy between what the questions asks, and what it actually wants. As the student mentors have admitted, this question could be better worded. Most of us (including me) got 395 rows the first time. This is the explanation of what went wrong, and how to fix it:
 60 | 
 61 | Doing exactly as the question instructs, which is to run the query from BOX 3 without the HAVING and LIMIT clause, most people got 395 rows as their answer. However, the question tells us to expect 389 rows instead. 
 62 | 
 63 | What do these answers represent? 
 64 | 
 65 | 	395 rows is the number of unique DOG IDs common to both the dogs and reviews table.
 66 | 
 67 | 	389 rows is the number of unique USER IDs common to both the dogs and reviews table. 
 68 | 
 69 | Although we are technically right in following the assignment's exact instructions, the instructions themselves were misleading. 
 70 | 
 71 | The original purpose of this question was to explore if users who gave a high average surprise rating for their dogs performance were users who tend to have more than one dog of the same breed. Hence, the question should have prompted us to compare on the basis of USERS instead of DOG IDs, but the instructors forgot to tell us we could modify it. 
 72 | 
 73 | The correct query to get 389 rows should be:
 74 | */
 75 | 
 76 | SELECT DISTINCT 
 77 |     r.user_guid AS UserID, 
 78 |     AVG(r.rating) AS AvgRating, 
 79 |     COUNT(r.rating) AS NumRatings
 80 | FROM dogs d, reviews r
 81 | WHERE d.dog_guid=r.dog_guid 
 82 |     AND d.user_guid=r.user_guid
 83 | GROUP BY UserID
 84 | ORDER BY AvgRating DESC;
 85 | 
 86 | /*
 87 | Note: The reason for this discrepancy (users vs dogs) is because some users have more than one dog. 
 88 | */
 89 | 
 90 | 
 91 | 
 92 | -- BOX 5 QN 1
 93 | -- Expected: 5991 (1 row)
 94 | 
 95 | SELECT COUNT(DISTINCT dog_guid)
 96 | FROM reviews
 97 | 
 98 | 
 99 | 
100 | -- BOX 6 QN 2
101 | -- Expected: 5586 (1 row)
102 | 
103 | SELECT COUNT(DISTINCT user_guid)
104 | FROM reviews
105 | 
106 | 
107 | 
108 | -- BOX 7 QN 3
109 | -- Expected: 30967 (1 row)
110 | 
111 | SELECT COUNT(DISTINCT user_guid)
112 | FROM dogs
113 | 
114 | 
115 | 
116 | -- BOX 8 QN 4
117 | -- Expected: 35050 (1 row)
118 | 
119 | SELECT COUNT(DISTINCT dog_guid)
120 | FROM dogs
121 | 
122 | 
123 | 
124 | -- BOX 9
125 | -- Expected: 5589 (1 row)
126 | 
127 | SELECT COUNT(DISTINCT d.user_guid)
128 |   FROM dogs d,
129 |        reviews r 
130 |  WHERE d.user_guid=r.user_guid;
131 | 
132 |  OR 
133 | 
134 | -- Expected: 389 (1 row)
135 | 
136 | SELECT COUNT(DISTINCT d.user_guid)
137 |   FROM dogs d,
138 |        reviews r 
139 |  WHERE d.dog_guid=r.dog_guid;
140 | 
141 | 
142 | 
143 | -- BOX 10 QN 5
144 | -- Expected: 20845 rows
145 | 
146 | SELECT 
147 | 	c.user_guid, 
148 | 	c.dog_guid,
149 | 	d.breed,
150 | 	d.breed_type,
151 | 	d.breed_group
152 | FROM complete_tests c, dogs d
153 | WHERE c.dog_guid=d.dog_guid
154 | 	AND test_name = "Yawn Warm-up";
155 | 
156 | 
157 | 
158 | -- BOX 11 QN 6
159 | -- Expected: 711 rows
160 | 
161 | SELECT DISTINCT 
162 | 	u.user_guid,
163 | 	u.membership_type,
164 | 	d.dog_guid, 
165 | 	d.breed
166 | FROM complete_tests c, dogs d, users u
167 | WHERE c.dog_guid = d.dog_guid 
168 |     AND d.user_guid = u.user_guid 
169 |     AND d.breed = 'Golden Retriever';
170 | 
171 | 
172 | 
173 | -- BOX 12 QN 7
174 | -- Expected: 30 rows
175 | 
176 | SELECT DISTINCT
177 |     d.dog_guid, 
178 |     d.breed
179 | FROM dogs d, users u
180 | WHERE d.user_guid = u.user_guid
181 |     AND d.breed = "Golden Retriever"
182 |     AND u.state = 'NC';
183 | 
184 | 
185 | 
186 | -- BOX 12 QN 8
187 | -- Expected: 5 rows (first row should be 1, 2900)
188 | 
189 | SELECT
190 | 	u.membership_type AS 'Membership Type',
191 | 	COUNT(DISTINCT r.user_guid) AS 'Total Reviews'
192 | FROM users u, reviews r
193 | WHERE r.user_guid = u.user_guid
194 |     AND r.rating IS NOT NULL
195 | GROUP BY u.membership_type 
196 | ORDER BY COUNT(r.user_guid) DESC; 
197 | 
198 | 
199 | 
200 | -- BOX 13 QN 9
201 | -- Expected: 5 rows (first row should be 1, 2900)
202 | 
203 | SELECT
204 | 	u.membership_type AS 'Membership Type',
205 | 	COUNT(DISTINCT r.user_guid) AS 'Total Reviews'
206 | FROM users u, reviews r
207 | WHERE r.user_guid = u.user_guid
208 |     AND r.rating IS NOT NULL
209 | GROUP BY u.membership_type 
210 | ORDER BY COUNT(r.user_guid) DESC; 
211 | 
212 | 
213 | 
214 | -- BOX 14 QN 10
215 | -- Expected: 3 rows (breeds should be mixed, golden retriever, and golden retriever-labrador mix)
216 | 
217 | SELECT
218 | d.breed,
219 | COUNT(sa.script_detail_id)
220 | FROM dogs d, site_activities sa
221 | WHERE d.dog_guid = sa.dog_guid 
222 |     AND sa.script_detail_id IS NOT NULL
223 | GROUP BY d.breed
224 | ORDER BY COUNT(sa.script_detail_id) DESC
225 | LIMIT 3;
226 | 
227 | -- END --
228 | 


--------------------------------------------------------------------------------
/Week3Ex8-OuterJoins.sql:
--------------------------------------------------------------------------------
  1 | /* ANSWER KEY: Week 3 Exercise 7 - Inner Joins 
  2 | 
  3 | This is the COMPLETE answer key (including explanations) 
  4 | for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 
  5 | 
  6 | Date created: 16 March 2017
  7 | */
  8 | 
  9 | -- BOX 1: LOAD SERVER
 10 | 
 11 | %load_ext sql
 12 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
 13 | %sql USE dognitiondb
 14 | 
 15 | 
 16 | 
 17 | -- BOX 2
 18 | -- Expected: 20845 rows
 19 | 
 20 | SELECT 
 21 | 	d.user_guid AS UserID, 
 22 | 	d.dog_guid AS DogID, 
 23 | 	d.breed, 
 24 | 	d.breed_type, 
 25 | 	d.breed_group
 26 | 
 27 | FROM dogs d JOIN complete_tests c
 28 |   ON d.dog_guid=c.dog_guid 
 29 | 
 30 | AND test_name='Yawn Warm-up';
 31 | 
 32 | 
 33 | 
 34 | 
 35 | -- BOX 3
 36 | -- Expected: 932 rows
 37 | 
 38 | SELECT 
 39 | 	r.dog_guid AS rDogID, 
 40 | 	r.user_guid AS rUserID,
 41 | 	d.dog_guid AS dDogID, 
 42 | 	d.user_guid AS dUserID, 
 43 | 	AVG(r.rating) AS AvgRating, 
 44 | 	COUNT(r.rating) AS NumRatings
 45 | FROM dogs d RIGHT JOIN reviews r 
 46 | 	ON r.dog_guid=d.dog_guid 
 47 | 	AND r.user_guid=d.user_guid
 48 | WHERE r.dog_guid IS NOT NULL
 49 | GROUP BY r.dog_guid
 50 | HAVING NumRatings >= 10
 51 | ORDER BY AvgRating DESC
 52 | 
 53 | 
 54 | 
 55 | -- BOX 4
 56 | -- Expected: 894 rows
 57 | 
 58 | SELECT 
 59 | 	r.dog_guid AS rDogID, 
 60 | 	d.dog_guid AS dDogID, 
 61 | 	r.user_guid AS rUserID, 
 62 | 	d.user_guid AS dUserID, 
 63 | 	AVG(r.rating) AS AvgRating, 
 64 | 	COUNT(r.rating) AS NumRatings
 65 | FROM reviews r LEFT JOIN dogs d
 66 | 	ON r.dog_guid=d.dog_guid 
 67 | 	AND r.user_guid=d.user_guid
 68 | WHERE d.dog_guid IS NULL
 69 | GROUP BY r.dog_guid
 70 | HAVING NumRatings >= 10
 71 | ORDER BY AvgRating DESC;
 72 | 
 73 | 
 74 | 
 75 | -- BOX 5
 76 | -- Expected: 35050 rows
 77 | 
 78 | SELECT 
 79 | 	d.dog_guid AS dDogID,
 80 | 	COUNT(c.test_name) AS 'Tests Completed'
 81 | FROM dogs d LEFT JOIN complete_tests c
 82 |     ON d.dog_guid = c.dog_guid 
 83 | WHERE d.dog_guid IS NOT NULL
 84 | GROUP BY d.dog_guid
 85 | ORDER BY COUNT(c.dog_guid) ASC;
 86 | 
 87 | 
 88 | 
 89 | -- BOX 6
 90 | -- Expected: 17987 rows
 91 | 
 92 | SELECT 
 93 | 	d.dog_guid AS dDogID,
 94 | 	COUNT(c.test_name) AS 'Tests Completed'
 95 | FROM dogs d LEFT JOIN complete_tests c
 96 |     ON d.dog_guid = c.dog_guid 
 97 | WHERE d.dog_guid IS NOT NULL
 98 | GROUP BY c.dog_guid -- DIFFERENCE! 
 99 | ORDER BY COUNT(c.dog_guid) ASC;
100 | 
101 | 
102 | 
103 | -- BOX 7 QN 5
104 | -- Expected: 1 row (17986)
105 | 
106 | SELECT count(distinct dog_guid)
107 | FROM complete_tests;
108 | 
109 | 
110 | 
111 | -- BOX 8 QN 6
112 | -- Expected: 952557 rows
113 | 
114 | SELECT 
115 | 	u.user_guid, 
116 | 	d.user_guid,
117 | 	d.dog_guid,
118 | 	d.breed,
119 | 	d.breed_type,
120 | 	d.breed_group
121 | FROM users u LEFT JOIN dogs d
122 |     ON u.user_guid = d.user_guid    
123 | 
124 | 
125 | 
126 | -- BOX 9 QN 7
127 | -- Expected: 33193 rows
128 | 
129 | SELECT 
130 | 	u.user_guid AS uUserID,
131 | 	d.user_guid AS dUserID, 
132 | 	d.dog_guid AS dDogID,
133 | 	d.breed, 
134 | 	count(*) AS numrows
135 | FROM users u LEFT JOIN dogs d
136 |     ON u.user_guid = d.user_guid
137 | GROUP BY u.user_guid
138 | ORDER BY numrows DESC;
139 | 
140 | 
141 | 
142 | -- BOX 10 QN 8
143 | -- Expected: 17 rows
144 | 
145 | SELECT count(user_guid)
146 | from users
147 | where user_guid = 'ce225842-7144-11e5-ba71-058fbc01cf0b'
148 | 
149 | 
150 | 
151 | -- BOX 11 QN 9
152 | -- Expected: 26 rows
153 | 
154 | SELECT count(user_guid)
155 | from dogs
156 | where user_guid = 'ce225842-7144-11e5-ba71-058fbc01cf0b'
157 | 
158 | 
159 | -- BOX 12 QN 10
160 | -- Expected: 2226 rows
161 | 
162 | SELECT DISTINCT
163 | 	u.user_guid AS uUserID,
164 | 	d.user_guid AS dUserID
165 | FROM users u LEFT JOIN dogs d
166 |     ON u.user_guid = d.user_guid    
167 | WHERE d.user_guid IS NULL
168 | 
169 | 
170 | 
171 | -- BOX 13 QN 11
172 | -- Expected: 2226 rows
173 | 
174 | SELECT DISTINCT
175 | u.user_guid AS uUserID,
176 | d.user_guid AS dUserID
177 | 
178 | FROM dogs d RIGHT JOIN users u
179 |     ON u.user_guid = d.user_guid
180 |     
181 | WHERE d.user_guid IS NULL
182 | 
183 | 
184 | 
185 | -- BOX 14 QN 12
186 | -- Expected: 5833 rows
187 | SELECT DISTINCT 
188 |     sa.dog_guid AS 'Dog ID',
189 |     d.dog_guid AS 'Should be NULL',
190 |     COUNT(sa.dog_guid) AS Times
191 | FROM site_activities sa LEFT JOIN dogs d
192 |     ON sa.user_guid = d.user_guid 
193 | WHERE d.dog_guid IS NULL 
194 |     AND sa.dog_guid IS NOT NULL
195 | GROUP BY sa.dog_guid
196 | ORDER BY Times DESC;
197 | 


--------------------------------------------------------------------------------
/Week4Ex10-BizInt.sql:
--------------------------------------------------------------------------------
  1 | /* ANSWER KEY: Week 3 Exercise 7 - Inner Joins 
  2 | 
  3 | This is the COMPLETE answer key (including explanations) for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 
  4 | Date created: 17 March 2017
  5 | 
  6 | */
  7 | 
  8 | -- BOX 1: LOAD SERVER
  9 | 
 10 | %load_ext sql
 11 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
 12 | %sql USE dognitiondb
 13 | 
 14 | 
 15 | 
 16 | -- BOX 2, Qn 1
 17 | -- Expected: 11 rows
 18 | 
 19 | SELECT DISTINCT dimension
 20 | FROM dogs;
 21 | 
 22 | 
 23 | 
 24 | -- BOX 3, Qn 2
 25 | -- Expected: 100 rows
 26 | 
 27 | /* Note: This question is rather misleading. The question suggests that a subquery is required, but an inner join will do. 
 28 | 
 29 | It's also not obvious whether the question wants you to group by the dog's personality dimensions (as that was the main focus of the preamble), or to produce a report of EVERY dog in the database. As it turns out, they want the latter.
 30 | 
 31 | */
 32 | 
 33 | SELECT 
 34 |     d.dog_guid AS dogID, 
 35 |     d.dimension AS dimension, 
 36 |     count(c.created_at) AS numtests
 37 | FROM dogs d, complete_tests c
 38 | WHERE d.dog_guid=c.dog_guid
 39 | GROUP BY dogID
 40 | ORDER BY numtests DESC
 41 | LIMIT 100; -- feel free to remove this line if you're curious
 42 | -- Expected output otherwise: 17986
 43 | 
 44 | 
 45 | -- BOX 4, Qn 3
 46 | -- Expected: 100 rows
 47 | 
 48 | SELECT 
 49 |     d.dog_guid AS dogID, 
 50 |     d.dimension AS dimension, 
 51 |     count(c.created_at) AS numtests
 52 | FROM dogs d
 53 |     INNER JOIN complete_tests c -- Or just JOIN
 54 |         ON d.dog_guid=c.dog_guid
 55 | GROUP BY dogID
 56 | ORDER BY numtests DESC
 57 | LIMIT 100;
 58 | 
 59 | 
 60 | -- BOX 5, Qn 4
 61 | -- Expected: 11 rows
 62 | 
 63 | SELECT 
 64 |     indiv_scores.personality, 
 65 |     AVG(indiv_scores.testcount)
 66 | FROM
 67 |     (SELECT 
 68 |         d.dog_guid AS dogID, 
 69 |         d.dimension AS personality, 
 70 |         count(c.created_at) AS testcount
 71 |     FROM dogs d
 72 |         INNER JOIN complete_tests c 
 73 |             ON d.dog_guid=c.dog_guid
 74 |     GROUP BY dogID) 
 75 |     AS indiv_scores
 76 | GROUP BY indiv_scores.personality;
 77 | 
 78 | 
 79 | 
 80 | -- BOX 6, Qn 5
 81 | 
 82 | /* The question is not well-worded either. This question asks, "How many unique DogIDs are summarized in the Dognition dimensions labeled 'None' or ''? (You should retrieve values of 13,705 and 71)". However, it expects you to ONLY count unique Dog IDs that have ALSO completed tests. 
 83 | 
 84 | A better question would be, "How many unique DOG IDs who have completed at least one test, have Dognition dimensions labelled 'None' or '' ?"
 85 | 
 86 | */
 87 | 
 88 | SELECT 
 89 |     indiv_scores.personality, 
 90 |     count(indiv_scores.dogID)
 91 | FROM
 92 |     (SELECT 
 93 |         d.dog_guid AS dogID, 
 94 |         d.dimension AS personality
 95 |     FROM dogs d
 96 |         INNER JOIN complete_tests c 
 97 |             ON d.dog_guid=c.dog_guid
 98 |     WHERE c.created_at IS NOT NULL
 99 |     GROUP BY dogID) 
100 |     AS indiv_scores
101 | WHERE indiv_scores.personality IS NULL 
102 |     OR indiv_scores.personality='' 
103 | GROUP BY indiv_scores.personality;
104 | 
105 | 
106 | 
107 | -- BOX 7, Qn 6
108 | -- Expected (71 rows)
109 | 
110 | SELECT 
111 |     indiv_scores.dogID,
112 |     indiv_scores.breed,
113 |     indiv_scores.weight,
114 |     indiv_scores.exclude,
115 |     indiv_scores.testcount,
116 |     indiv_scores.Earliest,
117 |     indiv_scores.Latest
118 | FROM
119 |     (SELECT 
120 |         d.dog_guid AS dogID, 
121 |         d.breed AS breed,
122 |         d.weight AS weight,
123 |         d.exclude AS exclude,
124 |         count(c.created_at) AS testcount,
125 |         min(c.created_at) AS Earliest,
126 |         max(c.created_at) AS Latest
127 |     FROM dogs d
128 |         INNER JOIN complete_tests c 
129 |             ON d.dog_guid=c.dog_guid
130 |     WHERE c.created_at IS NOT NULL
131 |         AND d.dimension = ''
132 |     GROUP BY dogID) 
133 |     AS indiv_scores
134 | GROUP BY indiv_scores.dogID;
135 | 
136 | -- A shorter version would be: 
137 | 
138 | SELECT 
139 |         d.dog_guid AS dogID, 
140 |         d.breed AS breed,
141 |         d.weight AS weight,
142 |         d.exclude AS exclude,
143 |         count(c.created_at) AS testcount,
144 |         min(c.created_at) AS Earliest,
145 |         max(c.created_at) AS Latest
146 | FROM dogs d
147 |     INNER JOIN complete_tests c 
148 |         ON d.dog_guid=c.dog_guid
149 | WHERE c.created_at IS NOT NULL
150 |     AND d.dimension = ''
151 | GROUP BY dogID;
152 | 
153 | 
154 | -- BOX 8, Qn 7
155 | -- Expected: 9 Rows (ace = 402, charmer = 626)
156 | 
157 | SELECT 
158 |     indiv_scores.personality, 
159 |     count(indiv_scores.dogID) AS NumDogs,
160 |     AVG(indiv_scores.testcount) AS AvgScore
161 | FROM
162 |     (SELECT 
163 |         d.dog_guid AS dogID, 
164 |         d.dimension AS personality, 
165 |         count(c.created_at) AS testcount
166 |     FROM dogs d
167 |         INNER JOIN complete_tests c 
168 |             ON d.dog_guid=c.dog_guid
169 |     WHERE d.dimension IS NOT NULL -- (2)
170 |          AND d.dimension != '' -- (1)
171 |          AND (d.exclude IS NULL OR d.exclude = 0) -- (3)
172 |     GROUP BY dogID) 
173 |     AS indiv_scores
174 | GROUP BY indiv_scores.personality;
175 | 
176 | 
177 | -- BOX 9
178 | 
179 | SELECT DISTINCT breed_group
180 | FROM dogs
181 | 
182 | 
183 | 
184 | -- BOX 10, Qn 9
185 | -- Expected: 8816 rows
186 | 
187 | SELECT 
188 |     d.dog_guid AS 'Dog ID', 
189 |     d.breed, d.weight, d.exclude, 
190 |     MIN(c.created_at) AS 'Earliest Time',
191 |     MAX(c.created_at) AS 'Latest Time',
192 |     count(c.created_at) AS 'Num Tests Done'
193 | FROM dogs d
194 |     JOIN complete_tests c
195 |         ON d.dog_guid = c.dog_guid
196 | WHERE c.created_at IS NOT NULL 
197 | AND d.breed_group IS NULL 
198 | GROUP BY d.dog_guid
199 | 
200 | 
201 | 
202 | -- BOX 11, Qn 10
203 | -- Expected: 9 rows (Herding = 1774)
204 | 
205 | SELECT 
206 |     indiv_scores.doggroup AS 'Breed Group', 
207 |     count(indiv_scores.dogID) AS 'Num of Dogs',
208 |     AVG(indiv_scores.testcount) AS 'Their Avg Score'
209 | FROM
210 |     (SELECT 
211 |         d.dog_guid AS dogID, 
212 |         d.breed_group AS doggroup, 
213 |         count(c.created_at) AS testcount
214 |     FROM dogs d
215 |         INNER JOIN complete_tests c 
216 |             ON d.dog_guid=c.dog_guid
217 |     WHERE d.breed_group IS NOT NULL -- remove
218 |          AND d.breed_group != '' -- remove
219 |          AND (d.exclude IS NULL OR d.exclude = 0) -- (specified by qn)
220 |     GROUP BY dogID) 
221 |     AS indiv_scores
222 | GROUP BY indiv_scores.doggroup;
223 | 
224 | /* *HOUND* breed groups, NOT *toy* breed groups, complete the least tests. Hound groups = 564. Toy groups = 1041.
225 | 
226 | */
227 | 
228 | -- BOX 12, Qn 11
229 | -- Expected: 4 rows
230 | 
231 | SELECT 
232 |     indiv_scores.doggroup AS 'Breed Group', 
233 |     count(indiv_scores.dogID) AS 'Num of Dogs',
234 |     AVG(indiv_scores.testcount) AS 'Their Avg Score'
235 | FROM
236 |     (SELECT 
237 |         d.dog_guid AS dogID, 
238 |         d.breed_group AS doggroup, 
239 |         count(c.created_at) AS testcount
240 |     FROM dogs d
241 |         INNER JOIN complete_tests c 
242 |             ON d.dog_guid=c.dog_guid
243 |     WHERE d.breed_group IN ('Sporting', 'Hound', 'Herding', 'Working')
244 |          AND (d.exclude IS NULL OR d.exclude = 0) -- (specified by qn)
245 |     GROUP BY dogID) 
246 |     AS indiv_scores
247 | GROUP BY indiv_scores.doggroup;
248 | 
249 | 
250 | 
251 | -- BOX 13, Qn 12
252 | -- Expected: 4 rows (pure breed = 8865)
253 | 
254 | SELECT DISTINCT breed_type
255 | FROM dogs
256 | 
257 | 
258 | 
259 | -- BOX 14, Qn 13
260 | -- Expected: 4 rows
261 | 
262 | SELECT 
263 |     d.breed_type AS 'Breed Type',
264 |     COUNT(DISTINCT d.dog_guid) AS 'Num of dogs',
265 |     COUNT(c.created_at) AS 'Num of tests',
266 |     COUNT(c.created_at)/COUNT(DISTINCT d.dog_guid) AS 'Tests Done per Dog' -- bonus to make relationship clearer
267 | FROM dogs d
268 |     JOIN complete_tests c
269 |         ON d.dog_guid = c.dog_guid
270 | WHERE (d.exclude IS NULL OR d.exclude = '0')
271 |      AND d.breed_type IS NOT NULL 
272 |      AND c.created_at IS NOT NULL
273 | GROUP BY d.breed_type
274 | 
275 | 
276 | 
277 | -- BOX 15, Qn 14
278 | -- Expected: 50 rows
279 | 
280 | SELECT 
281 |     DISTINCT d.dog_guid AS 'DogID',
282 |     d.breed_type AS 'Breed Type', 
283 |     count(c.created_at) AS 'Completed tests',
284 |     CASE 
285 |     WHEN d.breed_type = 'Pure Breed' THEN "Pure Breed"
286 |     ELSE "Not_Pure_Breed"
287 |     END AS Label
288 | FROM dogs d 
289 |     JOIN complete_tests c
290 |         ON d.dog_guid = c.dog_guid
291 | GROUP BY d.dog_guid 
292 | LIMIT 50; 
293 | 
294 | 
295 | 
296 | 
297 | -- BOX 16, Qn 15
298 | -- Expected: 2 rows (Not Pure Breed = 8336 IDs)
299 | 
300 | SELECT 
301 |     cleaned.Label,
302 |     count(distinct cleaned.DogID),
303 |     AVG(cleaned.testcount)
304 | FROM 
305 |     (SELECT 
306 |         DISTINCT d.dog_guid AS DogID,
307 |         d.breed_type AS BreedType, 
308 |         count(c.created_at) AS testcount,
309 |         CASE 
310 |         WHEN d.breed_type = 'Pure Breed' THEN 'Pure Breed'
311 |         ELSE 'Not_Pure_Breed'
312 |         END AS Label
313 |     FROM dogs d 
314 |         JOIN complete_tests c
315 |             ON d.dog_guid = c.dog_guid
316 |     WHERE c.created_at IS NOT NULL
317 |         AND d.breed_type IS NOT NULL 
318 |         AND (d.exclude IS NULL OR d.exclude = '0')
319 |     GROUP BY d.dog_guid)
320 |     AS cleaned
321 | GROUP BY cleaned.Label
322 | 
323 | 
324 | 
325 | 
326 | -- BOX 17, Qn 16
327 | -- Expected: 8816 rows
328 | 
329 | SELECT 
330 |     cleaned.Label, 
331 |     cleaned.neutered, 
332 |     AVG(cleaned.testcount), 
333 |     COUNT(cleaned.dog_guid)
334 | FROM 
335 |     (SELECT -- subquery part from Qn. 15
336 |         d.dog_guid, 
337 |         d.breed_type, 
338 |         d.dog_fixed AS neutered, 
339 |         COUNT(c.created_at) AS testcount,
340 |         CASE 
341 |         WHEN d.breed_type = 'Pure Breed' THEN 'Pure_Breed'
342 |         ELSE 'Not_Pure_Breed'
343 |         END AS Label
344 |     FROM dogs d 
345 |         JOIN complete_tests c
346 |             ON d.dog_guid = c.dog_guid
347 |     WHERE (d.exclude = '0' OR d.exclude IS NULL) -- exclusion criteria
348 |     GROUP BY d.dog_guid) 
349 |     AS cleaned
350 | GROUP BY cleaned.purebreed, cleaned.neutered;
351 | 
352 | 
353 | 
354 | -- BOX 18, Qn 17
355 | -- Expected: 9 rows (ace = 5.4896, charmer = 5.1919)
356 | 
357 | SELECT 
358 |     indiv_scores.personality, 
359 |     count(indiv_scores.dogID) AS NumDogs,
360 |     AVG(indiv_scores.testcount) AS AvgScore,
361 |     STDDEV(indiv_scores.testcount) AS StdDevScore
362 | FROM
363 |     (SELECT 
364 |         d.dog_guid AS dogID, 
365 |         d.dimension AS personality, 
366 |         count(c.created_at) AS testcount
367 |     FROM dogs d
368 |         INNER JOIN complete_tests c 
369 |             ON d.dog_guid=c.dog_guid
370 |     WHERE d.dimension IS NOT NULL -- (2)
371 |          AND d.dimension != '' -- (1)
372 |          AND (d.exclude IS NULL OR d.exclude = 0) -- (3)
373 |     GROUP BY dogID) 
374 |     AS indiv_scores
375 | GROUP BY indiv_scores.personality;
376 | 
377 | 
378 | 
379 | -- BOX 19, Qn 18
380 | -- Expected: 9 rows (cross breed std dv = 13849)
381 | 
382 | SELECT 
383 | DISTINCT d.breed_type,
384 | AVG(TIMESTAMPDIFF(minute, e.start_time, e.end_time)) AS AvgTime,
385 | STDDEV(TIMESTAMPDIFF(minute, e.start_time, e.end_time)) AS StdDevTime
386 | FROM dogs d 
387 |     JOIN exam_answers e 
388 |         ON d.dog_guid = e.dog_guid
389 | GROUP BY d.breed_type
390 | 
391 | -- END --
392 | 


--------------------------------------------------------------------------------
/Week4Ex12-BizInt.sql:
--------------------------------------------------------------------------------
  1 | /* ANSWER KEY: Week 4 Exercise 12 - Practicing Business Queries
  2 | 
  3 | This is the COMPLETE answer key (including explanations) for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 
  4 | Date created: 18 March 2017
  5 | 
  6 | */
  7 | 
  8 | -- BOX 1: LOAD SERVER
  9 | 
 10 | %load_ext sql
 11 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
 12 | %sql USE dognitiondb
 13 | 
 14 | 
 15 | 
 16 | -- Qn 1
 17 | -- Expected: 200 rows
 18 | 
 19 | SELECT created_at, dayofweek(created_at)
 20 | FROM complete_tests
 21 | LIMIT 50, 200;
 22 | 
 23 | 
 24 | 
 25 | -- Qn 2
 26 | -- Expected: 200 rows
 27 | 
 28 | SELECT created_at, 
 29 | CASE 
 30 | WHEN dayofweek(created_at)=1 THEN 'Sunday'
 31 | WHEN dayofweek(created_at)=2 THEN 'Monday'
 32 | WHEN dayofweek(created_at)=3 THEN 'Tuesday'
 33 | WHEN dayofweek(created_at)=4 THEN 'Wednesday'
 34 | WHEN dayofweek(created_at)=5 THEN 'Thursday'
 35 | WHEN dayofweek(created_at)=6 THEN 'Friday'
 36 | WHEN dayofweek(created_at)=7 THEN 'Saturday'
 37 | END AS Day
 38 | FROM complete_tests
 39 | LIMIT 50, 200;
 40 | 
 41 | 
 42 | 
 43 | -- Qn 3
 44 | -- Expected: 33,190 rows
 45 | 
 46 | SELECT
 47 | CASE 
 48 | WHEN dayofweek(created_at)=1 THEN 'Sunday'
 49 | WHEN dayofweek(created_at)=2 THEN 'Monday'
 50 | WHEN dayofweek(created_at)=3 THEN 'Tuesday'
 51 | WHEN dayofweek(created_at)=4 THEN 'Wednesday'
 52 | WHEN dayofweek(created_at)=5 THEN 'Thursday'
 53 | WHEN dayofweek(created_at)=6 THEN 'Friday'
 54 | WHEN dayofweek(created_at)=7 THEN 'Saturday'
 55 | END AS Day,
 56 | COUNT(created_at) AS 'Number of Tests'
 57 | FROM complete_tests
 58 | GROUP BY Day
 59 | ORDER BY COUNT(created_at) DESC;
 60 | 
 61 | 
 62 | 
 63 | -- Qn 4
 64 | -- Expected: 7 rows
 65 | 
 66 | SELECT
 67 | CASE 
 68 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday'
 69 | WHEN dayofweek(c.created_at)=2 THEN 'Monday'
 70 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday'
 71 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday'
 72 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday'
 73 | WHEN dayofweek(c.created_at)=6 THEN 'Friday'
 74 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday'
 75 | END AS Day,
 76 | COUNT(c.created_at) AS 'Number of Tests'
 77 | FROM complete_tests c
 78 |     JOIN dogs d
 79 |         ON c.dog_guid = d.dog_guid
 80 | WHERE (d.exclude = 0 OR d.exclude IS NULL)
 81 | GROUP BY Day
 82 | ORDER BY count(c.created_at) DESC;
 83 | 
 84 | 
 85 | 
 86 | -- Qn 5
 87 | -- Expected: 950,331 rows
 88 | 
 89 | SELECT d.dog_guid
 90 | FROM dogs d
 91 |     JOIN users u
 92 |         ON d.user_guid = u.user_guid;
 93 | 
 94 | 
 95 | 
 96 | -- Qn 6
 97 | -- Expected 35,048 rows
 98 | 
 99 | SELECT DISTINCT d.dog_guid
100 | FROM dogs d
101 |     JOIN users u
102 |         ON d.user_guid = u.user_guid;
103 | 
104 | 
105 | 
106 | -- Qn 7
107 | -- Expected: 34,121 Rows 
108 | 
109 | SELECT DISTINCT d.dog_guid
110 | FROM dogs d
111 |     JOIN users u
112 |         ON d.user_guid = u.user_guid
113 | WHERE (d.exclude = 0 OR d.exclude IS NULL)
114 |     AND (u.exclude = 0 OR u.exclude IS NULL);
115 | 
116 | 
117 | 
118 | -- BOX 8
119 | -- Expected: 7 rows
120 | 
121 | SELECT
122 | CASE 
123 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday'
124 | WHEN dayofweek(c.created_at)=2 THEN 'Monday'
125 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday'
126 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday'
127 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday'
128 | WHEN dayofweek(c.created_at)=6 THEN 'Friday'
129 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday'
130 | END AS Day,
131 | COUNT(c.created_at) AS 'Number of Tests'
132 | FROM complete_tests c
133 |     JOIN (
134 |         SELECT DISTINCT d.dog_guid
135 |         FROM dogs d
136 |             JOIN users u
137 |                 ON d.user_guid = u.user_guid
138 |         WHERE (d.exclude = 0 OR d.exclude IS NULL)
139 |         AND (u.exclude = 0 OR u.exclude IS NULL)
140 |         ) 
141 |         AS cleandogs
142 |             ON c.dog_guid = cleandogs.dog_guid 
143 | GROUP BY Day
144 | ORDER BY count(c.created_at) DESC;
145 | 
146 | 
147 | 
148 | -- Qn 9
149 | -- Expected: 21 rows
150 | 
151 | SELECT
152 | CASE 
153 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday'
154 | WHEN dayofweek(c.created_at)=2 THEN 'Monday'
155 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday'
156 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday'
157 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday'
158 | WHEN dayofweek(c.created_at)=6 THEN 'Friday'
159 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday'
160 | END AS Day,
161 | YEAR(c.created_at) AS Year,
162 | COUNT(c.created_at) AS 'Number of Tests'
163 | FROM complete_tests c
164 |     JOIN (
165 |         SELECT DISTINCT d.dog_guid
166 |         FROM dogs d
167 |             JOIN users u
168 |                 ON d.user_guid = u.user_guid
169 |         WHERE (d.exclude = 0 OR d.exclude IS NULL)
170 |         AND (u.exclude = 0 OR u.exclude IS NULL)
171 |         ) 
172 |         AS cleandogs
173 |             ON c.dog_guid = cleandogs.dog_guid 
174 | GROUP BY Day, Year
175 | ORDER BY Year ASC, count(c.created_at) DESC;
176 | 
177 | 
178 | 
179 | -- Qn 10
180 | -- Expected: 21 rows (Sunday - 5860)
181 | 
182 | SELECT
183 | CASE 
184 | WHEN dayofweek(c.created_at)=1 THEN 'Sunday'
185 | WHEN dayofweek(c.created_at)=2 THEN 'Monday'
186 | WHEN dayofweek(c.created_at)=3 THEN 'Tuesday'
187 | WHEN dayofweek(c.created_at)=4 THEN 'Wednesday'
188 | WHEN dayofweek(c.created_at)=5 THEN 'Thursday'
189 | WHEN dayofweek(c.created_at)=6 THEN 'Friday'
190 | WHEN dayofweek(c.created_at)=7 THEN 'Saturday'
191 | END AS Day,
192 | YEAR(c.created_at) AS Year,
193 | COUNT(c.created_at) AS 'Number of Tests'
194 | FROM complete_tests c
195 |     JOIN (
196 |         SELECT DISTINCT d.dog_guid,
197 |         u.country,
198 |         u.state
199 |         FROM dogs d
200 |             JOIN users u
201 |                 ON d.user_guid = u.user_guid
202 |         WHERE (d.exclude = 0 OR d.exclude IS NULL)
203 |         AND (u.exclude = 0 OR u.exclude IS NULL)
204 |         ) 
205 |         AS cleandogs
206 |             ON c.dog_guid = cleandogs.dog_guid 
207 | WHERE cleandogs.country = 'US' AND cleandogs.state NOT IN ('HI', 'AK')
208 | GROUP BY Day, Year
209 | ORDER BY Year ASC, count(c.created_at) DESC;
210 | 
211 | -- Qn 11
212 | -- Expected: 100 rows
213 | 
214 | SELECT created_at, 
215 | DATE_SUB(created_at, INTERVAL 6 HOUR) AS NewTime
216 | FROM complete_tests
217 | LIMIT 100;
218 | 
219 | 
220 | 
221 | -- Qn 12
222 | -- Expected: 21 rows 
223 | 
224 | SELECT
225 | CASE 
226 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=1 THEN 'Sunday'
227 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=2 THEN 'Monday'
228 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=3 THEN 'Tuesday'
229 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=4 THEN 'Wednesday'
230 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=5 THEN 'Thursday'
231 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=6 THEN 'Friday'
232 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=7 THEN 'Saturday'
233 | END AS Day,
234 | YEAR(c.created_at) AS Year,
235 | COUNT(c.created_at) AS 'Number of Tests'
236 | FROM complete_tests c
237 |     JOIN (
238 |         SELECT DISTINCT d.dog_guid,
239 |         u.country,
240 |         u.state
241 |         FROM dogs d
242 |             JOIN users u
243 |                 ON d.user_guid = u.user_guid
244 |         WHERE (d.exclude = 0 OR d.exclude IS NULL)
245 |         AND (u.exclude = 0 OR u.exclude IS NULL)
246 |         ) 
247 |         AS cleandogs
248 |             ON c.dog_guid = cleandogs.dog_guid 
249 | WHERE cleandogs.country = 'US' AND cleandogs.state NOT IN ('HI', 'AK')
250 | GROUP BY Day, Year
251 | ORDER BY Year ASC, count(c.created_at) DESC;
252 | 
253 | 
254 | 
255 | -- Qn 13
256 | -- Expected: 21 rows
257 | 
258 | SELECT
259 | CASE 
260 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=1 THEN 'Sunday'
261 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=2 THEN 'Monday'
262 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=3 THEN 'Tuesday'
263 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=4 THEN 'Wednesday'
264 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=5 THEN 'Thursday'
265 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=6 THEN 'Friday'
266 | WHEN dayofweek(DATE_SUB(c.created_at, INTERVAL 6 HOUR))=7 THEN 'Saturday'
267 | END AS Day,
268 | YEAR(c.created_at) AS Year,
269 | COUNT(c.created_at) AS 'Number of Tests'
270 | FROM complete_tests c
271 |     JOIN (
272 |         SELECT DISTINCT d.dog_guid,
273 |         u.country,
274 |         u.state
275 |         FROM dogs d
276 |             JOIN users u
277 |                 ON d.user_guid = u.user_guid
278 |         WHERE (d.exclude = 0 OR d.exclude IS NULL)
279 |         AND (u.exclude = 0 OR u.exclude IS NULL)
280 |         ) 
281 |         AS cleandogs
282 |             ON c.dog_guid = cleandogs.dog_guid 
283 | WHERE cleandogs.country = 'US' AND cleandogs.state NOT IN ('HI', 'AK')
284 | GROUP BY Day, Year
285 | ORDER BY Year ASC, FIELD(Day, 'Monday', 'Tuesday', 
286 |                          'Wednesday', 'Thursday', 'Friday', 
287 |                          'Saturday', 'Sunday'), count(c.created_at) DESC;
288 | 
289 | 
290 | 
291 | -- Qn 14
292 | -- Expected: 5 rows
293 | 
294 | SELECT
295 | clean.state AS 'State',
296 | COUNT(DISTINCT clean.user_guid) AS 'Number of Users'
297 | FROM complete_tests c 
298 |     JOIN (
299 |         SELECT DISTINCT d.user_guid,
300 |         u.state,
301 |         u.country
302 |         FROM dogs d
303 |             JOIN users u
304 |                 ON d.user_guid = u.user_guid
305 |         WHERE (d.exclude = 0 OR d.exclude IS NULL)
306 |         AND (u.exclude = 0 OR u.exclude IS NULL)
307 |         AND u.country = 'US'
308 |         ) 
309 |         AS clean
310 | GROUP BY 'State'
311 | ORDER BY 'Number of Users' DESC
312 | LIMIT 5;
313 | 
314 | 
315 | 
316 | -- Qn 15
317 | -- Expected: 10 rows
318 | 
319 | SELECT
320 | clean.country AS 'Country',
321 | clean.state AS 'State',
322 | COUNT(DISTINCT clean.user_guid) AS 'Number of Users'
323 | FROM complete_tests c 
324 |     JOIN (
325 |         SELECT DISTINCT d.user_guid,
326 |         u.state,
327 |         u.country
328 |         FROM dogs d
329 |             JOIN users u
330 |                 ON d.user_guid = u.user_guid
331 |         WHERE (d.exclude = 0 OR d.exclude IS NULL)
332 |         AND (u.exclude = 0 OR u.exclude IS NULL)
333 |         ) 
334 |         AS clean
335 |         ON c.dog_guid = clean.dog_guid
336 | GROUP BY 'Country', 'State'
337 | ORDER BY 'Number of Users' DESC;
338 | 
339 | -- END --
340 | 


--------------------------------------------------------------------------------
/Week4Ex9-Subqueries.sql:
--------------------------------------------------------------------------------
  1 | /* ANSWER KEY: Week 3 Exercise 9 - Subqueries & Derived Tables
  2 | 
  3 | This is the COMPLETE answer key (including explanations) 
  4 | for Week 3 of the DUKE UNIVERSITY "Managing Big Data wtih MySQL" course. 
  5 | 
  6 | Date created: 17 March 2017
  7 | */
  8 | 
  9 | -- BOX 1: LOAD SERVER
 10 | 
 11 | %load_ext sql
 12 | %sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
 13 | %sql USE dognitiondb
 14 | 
 15 | 
 16 | 
 17 | -- BOX 2
 18 | -- Expected: 1 row (9934)
 19 | 
 20 | SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time))
 21 | FROM exam_answers 
 22 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) > 0
 23 | AND test_name = 'Yawn Warm-Up';
 24 | 
 25 | 
 26 | 
 27 | 
 28 | -- BOX 3
 29 | -- Expected: 11059 rows
 30 | 
 31 | SELECT * 
 32 | FROM exam_answers
 33 | WHERE TIMESTAMPDIFF(minute,start_time,end_time) >
 34 |     (
 35 |     SELECT AVG(TIMESTAMPDIFF(minute,start_time,end_time))
 36 |     FROM exam_answers 
 37 |     WHERE TIMESTAMPDIFF(minute,start_time,end_time) > 0
 38 |     AND test_name = 'Yawn Warm-Up'
 39 |     );
 40 | 
 41 | 
 42 | -- BOX 4
 43 | -- Expected: 1 rows (163022)
 44 | 
 45 | SELECT count(*)
 46 | FROM exam_answers
 47 | WHERE subcategory_name IN ("Puzzles", "Numerosity", "Bark Game");
 48 | 
 49 | 
 50 | 
 51 | -- BOX 5
 52 | -- Expected: 1 rows (7961)
 53 | 
 54 | SELECT count(distinct dog_guid)
 55 | FROM dogs
 56 | WHERE breed_group NOT IN ("Working", "Sporting", "Herding")
 57 | 
 58 | 
 59 | 
 60 | -- BOX 6
 61 | -- Expected: 2226 rows
 62 | 
 63 | SELECT DISTINCT u.user_guid
 64 | FROM users u
 65 | WHERE NOT EXISTS
 66 |     (SELECT d.user_guid
 67 |     FROM dogs d
 68 |     WHERE u.user_guid = d.user_guid);
 69 | 
 70 | 
 71 | 
 72 | -- BOX 7
 73 | -- Expected: 33193 rows
 74 | 
 75 | SELECT 
 76 | 	clean.user_guid AS uUserID, 
 77 | 	d.user_guid AS dUserID, 
 78 | 	count(*) AS numrows
 79 | FROM 
 80 |     (SELECT DISTINCT u.user_guid 
 81 |     FROM users u) 
 82 |     AS clean
 83 | LEFT JOIN dogs d
 84 |    ON clean.user_guid=d.user_guid
 85 | GROUP BY clean.user_guid
 86 | ORDER BY numrows DESC
 87 | 
 88 | 
 89 | 
 90 | -- BOX 8
 91 | -- Expected: Note the type of error message
 92 | 
 93 | SELECT 
 94 | 	u.user_guid AS uUserID, 
 95 | 	d.user_guid AS dUserID, 
 96 | 	count(*) AS numrows
 97 | 
 98 | FROM 
 99 |     (SELECT DISTINCT u.user_guid 
100 |     FROM users u) 
101 |     AS DistinctUUsersID 
102 | 
103 | LEFT JOIN dogs d
104 |    ON DistinctUUsersID.user_guid=d.user_guid
105 | 
106 | GROUP BY DistinctUUsersID.user_guid
107 | ORDER BY numrows DESC
108 | 
109 | 
110 | 
111 | -- BOX 9 QN 6
112 | -- Expected: 10254 rows
113 | 
114 | SELECT distinct
115 | 	d.dog_guid,
116 | 	d.breed_group,
117 | 	u.state,
118 | 	u.zip
119 | FROM dogs d, users u
120 | WHERE d.user_guid = u.user_guid 
121 |     AND breed_group IN ('Working', 'Sporting', 'Herding')
122 | 
123 | 
124 | 
125 | -- BOX 10
126 | -- Expected: 10254 rows
127 | 
128 | SELECT distinct
129 | 	d.dog_guid,
130 | 	d.breed_group,
131 | 	u.state,
132 | 	u.zip
133 | FROM dogs d JOIN users u
134 |     ON d.user_guid = u.user_guid 
135 | WHERE breed_group IN ('Working', 'Sporting', 'Herding')
136 | 
137 | 
138 | 
139 | -- BOX 11 Qn 8
140 | -- Expected: 2 rows
141 | 
142 | SELECT d.user_guid
143 | FROM dogs d
144 | WHERE NOT EXISTS
145 |     (SELECT DISTINCT u.user_guid
146 |     FROM users u
147 |     WHERE d.user_guid = u.user_guid);
148 | 
149 | 
150 | 
151 | -- BOX 12
152 | -- Expected: 1 rows (1819)
153 | 
154 | SELECT 
155 |     DistinctUUsersID.user_guid AS uUserID, 
156 |     d.user_guid AS dUserID, 
157 |     count(*) AS numrows
158 | FROM (SELECT DISTINCT u.user_guid 
159 |      FROM users u
160 |      WHERE user_guid = 'ce7b75bc-7144-11e5-ba71-058fbc01cf0b') 
161 |      AS DistinctUUsersID 
162 | LEFT JOIN dogs d
163 |   ON DistinctUUsersID.user_guid=d.user_guid
164 | GROUP BY DistinctUUsersID.user_guid
165 | ORDER BY numrows DESC;
166 | 
167 | 
168 | 
169 | -- BOX 13
170 | -- Expected: 30968 rows
171 | 
172 | SELECT DISTINCT d.user_guid
173 | FROM d.dogs
174 | 
175 | 
176 | 
177 | -- BOX 14 QN 11
178 | -- Expected: 1 rows
179 | 
180 | SELECT 
181 |     APPLES.user_guid AS uUserID, 
182 |     ORANGES.user_guid AS dUserID, 
183 |     count(*) AS numrows
184 | FROM 
185 | 	(SELECT DISTINCT u.user_guid 
186 |     FROM users u
187 |     WHERE user_guid = 'ce7b75bc-7144-11e5-ba71-058fbc01cf0b') 
188 |     AS APPLES
189 | 	LEFT JOIN 
190 | 	    (SELECT DISTINCT d.user_guid
191 | 	    FROM dogs d) 
192 | 	    AS ORANGES
193 | 	    	ON APPLES.user_guid=ORANGES.user_guid
194 | GROUP BY APPLES.user_guid
195 | ORDER BY numrows DESC;
196 | 
197 | 
198 | 
199 | -- BOX 15 QN 12
200 | -- Expected: 100 rows
201 | 
202 | SELECT 
203 |     APPLES.user_guid AS uUserID, 
204 |     ORANGES.user_guid AS dUserID, 
205 |     count(*) AS numrows
206 | FROM 
207 | 	(SELECT DISTINCT u.user_guid 
208 |     FROM users u
209 |     LIMIT 100) 
210 |     AS APPLES
211 | 	LEFT JOIN 
212 | 	    (SELECT DISTINCT d.user_guid
213 | 	    FROM dogs d) 
214 | 	    AS ORANGES
215 |     		ON APPLES.user_guid=ORANGES.user_guid
216 | GROUP BY APPLES.user_guid
217 | ORDER BY numrows DESC;
218 | 
219 | 
220 | 
221 | -- BOX 16 QN 13
222 | -- Expected: 5 rows (shih tzu, 190, 1819)
223 | 
224 | SELECT 
225 | 	APPLES.user_guid AS uUserID, 
226 | 	d.user_guid AS dUserID, 
227 | 	d.breed,
228 | 	d.weight,
229 | 	count(*) AS numrows
230 | FROM 
231 |     (SELECT DISTINCT u.user_guid 
232 |     FROM users u) 
233 |     AS APPLES
234 | 	LEFT JOIN dogs d
235 | 		ON APPLES.user_guid=d.user_guid
236 | GROUP BY APPLES.user_guid
237 | HAVING numrows > 10
238 | ORDER BY numrows DESC;
239 | 


--------------------------------------------------------------------------------
/Week5 - Dillards.md:
--------------------------------------------------------------------------------
  1 | # Final Week - Dillard's Database Exercises
  2 | 
  3 | Date created: 25 April 2017
  4 | 
  5 | Last updated: 9 Sept 2017
  6 | 
  7 | This is the COMPLETE answer key (including explanations where necessary) 
  8 | for Week 5 (final week) of the ["Managing Big Data wtih MySQL"](https://www.coursera.org/learn/analytics-mysql/home/week/5) 
  9 | course by Duke University.
 10 | 
 11 | I wrote this answer key as no official answers have been released online. 
 12 | These answers reflect my own work and are accurate to the best of my knowledge. 
 13 | I will update them if the professors ever release an "official" answer key. 
 14 | 
 15 | **These answers will come in handy during the final exam for the course, which 
 16 | requires one to make similar queries.**
 17 | 
 18 | Update: These answers are based on the older version of ``UA_Dillards`` dataset 
 19 | (not ``UA_Dillards1``, nor ``UA_Dillards_2016``). For example, this means I am using 
 20 | the table ``SKSTINFO`` and not ``SKSTINFO_FIX`` which is the newer version.
 21 | 
 22 | Meanwhile, let's start.
 23 | 
 24 | # Answers
 25 | 
 26 | To start, enter ``DATABASE ua_dillards``; into the Teradata SQL scratchpad.
 27 | 
 28 | ### Question 1 
 29 | 
 30 | **How many distinct dates are there in the saledate column of the transaction
 31 | table for each month/year combination in the database?**
 32 | 
 33 | ```sql
 34 | SELECT 
 35 |   EXTRACT (month FROM saledate) AS month_num, 
 36 |   EXTRACT (year FROM saledate) AS year_num, 
 37 |   COUNT (DISTINCT EXTRACT (day FROM saledate)) AS days_in_month, 
 38 |   COUNT (EXTRACT (day FROM saledate)) AS num_transactions -- I'm curious abt num transactions per mth
 39 | FROM trnsact
 40 | GROUP BY month_num, year_num
 41 | ORDER BY year_num, month_num
 42 | ```
 43 | 
 44 | Result
 45 | 
 46 | | MONTH_NUM | YEAR_NUM | DAYS_IN_MONTH | NUM_TRANSACTIONS |
 47 | | -- | -- | -- | -- |
 48 | | 8 | 2004 | 31 | 8292953
 49 | | 9 | 2004 | 30 | 8967415
 50 | | 10 | 2004 | 31 | 8412131
 51 | | 11 | 2004 | 29 | 7047319
 52 | | 12 | 2004 | 30 | 13383892
 53 | | 1 | 2005 | 31 | 8952311
 54 | | 2 | 2005 | 28 | 11352221
 55 | | 3 | 2005 | 30 | 8940444
 56 | | 4 | 2005 | 30 | 9082523
 57 | | 5 | 2005 | 31 | 7715779
 58 | | 6 | 2005 | 30 | 7922997
 59 | | 7 | 2005 | 31 | 11122770
 60 | | 8 | 2005 | 27 | 9724141
 61 | 
 62 | There appears to be an incomplete record of the month during August 2005 
 63 | (as it has only 27 days). 
 64 | 
 65 | *As the homework instructs, I will restrict all further analysis of August sales 
 66 | to only include those recorded in 2004, and not 2005.*
 67 | 
 68 | Next, it appears that Dillard's department has designated holidays from their 
 69 | calendar. None of the stores have data for ``25 November`` (Thanksgiving),
 70 | ``25 December`` (Christmas), or ``27 March`` (their annual sale date). 
 71 | 
 72 | ### Question 2
 73 | 
 74 | **Use a CASE statement within an aggregate function to determine which sku
 75 | had the greatest total sales during the combined summer months of June, July, 
 76 | and August.**
 77 | 
 78 | ```sql
 79 | SELECT DISTINCT sku,
 80 | SUM (CASE WHEN EXTRACT(month FROM saledate)=6 AND stype='p' THEN amt END) AS rev_june,
 81 | SUM (CASE WHEN EXTRACT(month FROM saledate)=7 AND stype='p' THEN amt END) AS rev_july,
 82 | SUM (CASE WHEN EXTRACT(month FROM saledate)=8 AND stype='p' THEN amt END) AS rev_aug, -- !
 83 | (rev_aug + rev_june + rev_july) AS rev_total_summer
 84 | FROM trnsact
 85 | GROUP BY sku
 86 | ORDER BY rev_total_summer DESC
 87 | HAVING rev_total_summer > 0 -- exclude null values
 88 | ```
 89 | 
 90 | There is a problem with this question statement. It suggests that:
 91 | 
 92 | > *'If your query is correct, you should find that sku #2783996 has the fifth greatest total sales
 93 | during the combined months of June, July, and August, with a total summer sales sum of
 94 | $897,807.01.'*
 95 | 
 96 | However, you will only get this value if you include values from **both** August 2004
 97 | and August 2005, which **Question 1 explicitly states not to do so**.
 98 | 
 99 | A more sensible answer, that includes *only one copy of each month per year*, would be: 
100 | 
101 | ```sql
102 | SELECT DISTINCT sku,
103 | SUM (CASE WHEN EXTRACT(month FROM saledate)=6 AND stype='p' THEN amt END) AS rev_june,
104 | SUM (CASE WHEN EXTRACT(month FROM saledate)=7 AND stype='p' THEN amt END) AS rev_july,
105 | SUM (CASE WHEN EXTRACT(month FROM saledate)=8 AND stype='p' 
106 | AND EXTRACT(year FROM saledate)=12 -- new line
107 | THEN amt END) AS rev_aug,
108 | (rev_aug + rev_june + rev_july) AS rev_total_summer
109 | FROM trnsact
110 | GROUP BY sku
111 | ORDER BY rev_total_summer DESC
112 | HAVING rev_total_summer > 0 -- exclude null values
113 | ```
114 | 
115 | This gives the answer:
116 | 
117 | | SKU ITEM CODE | REV_JUNE 2005 | REV_JULY 2005 | REV_AUG 2004 | REV_TOTAL_SUMMER
118 | | -- | -- | -- | -- | -- | 
119 | | 4108011 | 309511.88 | 379326.00 | 499821.00 | 1, 188, 658.88
120 | | 3524026 | 269934.50 | 344833.00 | 458227.50 | 1, 072, 995.00
121 | | 5528349 | 339349.00 | 325156.50 | 337221.00 | 1, 001, 726.50
122 | | 3978011 | 197885.37 | 259279.60 | 308910.00 | 766, 074.97
123 | | **2783996** | 190252.01 | 197414.50 | 313736.50 | **701, 403.01**
124 | 
125 | Additional background information on the most popular summer items: *(because I'm
126 | curious lol)*
127 | 
128 | ```sql
129 | SELECT *
130 | FROM SKUINFO
131 | WHERE sku IN (4108011, 3524026, 5528349, 3978011, 2783996)
132 | ```
133 | 
134 | | SKU CODE | COLOUR | SIZE | PACKSIZE | BRAND | 
135 | | -- | -- | -- | -- | -- | 
136 | | 4108011 | DDML | DDML 4OZ | 6 | CLINIQUE
137 | | 3524026 | DDML | PUMP 4.2 OZ |  6 | CLINIQUE
138 | | 5528349 | 01-BLACK | 01-BLACK | 3 | LANCOME
139 | | 3978011 | CLARIFY #2 | 13.5 OZ | 3 | CLINIQUE
140 | | 2783996 |  01-BLACK | NO SIZE | 3 | LANCOME
141 | 
142 | ### Exercise 3. 
143 | 
144 | **How many distinct dates are there in the saledate column of the transaction
145 | table for each month/year/store combination in the database? Sort your results by the
146 | number of days per combination in ascending order.**
147 | 
148 | ```sql
149 | SELECT 
150 |   EXTRACT (month FROM saledate) AS month_num, 
151 |   EXTRACT (year FROM saledate) AS year_num,
152 |   store,
153 |   COUNT (DISTINCT saledate) AS num_dates
154 | FROM trnsact
155 | GROUP BY month_num, year_num, store
156 | ORDER BY num_dates asc
157 | ```
158 | Some stores appear to have missing or removed data (i.e., less than 30 days per month).
159 | 
160 |  | MONTH | YEAR | STORE ID | NUM_DATES | 
161 |  | ----- | ---- | -------- | --------- | 
162 |  | 7 | 2005 | 7604 | 1
163 |  | 3 | 2005 | 8304 | 1
164 |  | 9 | 2004 | 4402 | 1
165 |  | 8 | 2004 | 9906 | 1
166 |  | 8 | 2004 | 8304 | 1
167 |  | 8 | 2004 | 7203 | 3
168 |  | 3 | 2005 | 6402 | 11
169 | 
170 | We will note the missing data in case of in future calculations. Where possible, we 
171 | will aim to exclude months that do not meet criteria when doing trend analysis.
172 | 
173 | ### Exercise 4a. 
174 | 
175 | **What is the average daily revenue for each store/month/year combination in
176 | the database? Calculate this by dividing the total revenue for a group by the number of
177 | sales days available in the transaction table for that group.**
178 | 
179 | We can solve this by modifying the solution from Qn 3 to include revenue data. 
180 | 
181 | ```sql
182 | SELECT 
183 |   store, 
184 |   EXTRACT (month FROM saledate) AS month_num, 
185 |   EXTRACT (year FROM saledate) AS year_num,
186 |   COUNT (DISTINCT saledate) AS num_dates,
187 |   SUM(amt) AS total_revenue,
188 |   revenue/num_dates AS daily_revenue 
189 | FROM trnsact
190 | WHERE stype='p'
191 | GROUP BY store, month_num, year_num
192 | ORDER BY daily_revenue desc
193 | ```
194 | > Dr Jana: If your query is correct, you should find that store #204 has an average daily revenue of
195 | $16,303.65 in August of 2005.
196 | 
197 | ```sql
198 | -- Modified to check results
199 | SELECT 
200 |   store, 
201 |   EXTRACT (month FROM saledate) AS month_num, 
202 |   EXTRACT (year FROM saledate) AS year_num,
203 |   COUNT (DISTINCT saledate) AS num_dates,
204 |   SUM(amt) AS total_revenue,
205 |   total_revenue/num_dates AS daily_revenue 
206 | FROM trnsact
207 | WHERE stype='p' AND store=204 -- !
208 | GROUP BY store, month_num, year_num
209 | ORDER BY year_num desc, month_num desc -- ! 
210 | ```
211 | 
212 | Results 
213 | 
214 |  | STORE | MONTH_NUM | YEAR_NUM | NUM_DATES | TOTAL_REVENUE | DAILY_REVENUE | 
215 |  | ----- | --------- | -------- | --------- | ------------- | ------------- |
216 |  | 204 | 12 | 2004 | 30 | 651309.29 | 21710.31 |
217 |  | 204 | 7 | 2005 | 31 | 520512.72 | 16790.73 |
218 |  | 204 | 4 | 2005 | 30 | 503312.54 | 16777.08 |
219 |  | 204 | 8 | 2005 | 27 | 440198.68 | 16303.65 |
220 |  
221 | Awesome! 
222 | 
223 | *For all of the exercises that follow, unless otherwise specified, we will assess sales by summing
224 | the total revenue for a given time period, and dividing by the total number of days that
225 | contributed to that time period. This will give us “average daily revenue”.*
226 | 
227 | ### Question 4b. 
228 | 
229 | **Modify the query you wrote above to assess the average daily revenue for each store/month/year 
230 | combination with a clause that removes all the data from August, 2005. Then, given the data we 
231 | have available in our data set, I propose that we only examine store/month/year combinations that 
232 | have at least 20 days of data within that month.**
233 | 
234 | ```sql
235 | SELECT 
236 |   sub.store, 
237 |   sub.year_num, 
238 |   sub.month_num, 
239 |   sub.num_dates, 
240 |   sub.daily_revenue
241 | FROM (
242 |   SELECT 
243 |   store, 
244 |   EXTRACT (month FROM saledate) AS month_num, 
245 |   EXTRACT (year FROM saledate) AS year_num,
246 |   COUNT (DISTINCT saledate) AS num_dates,
247 |   SUM(amt) AS total_revenue,
248 |   total_revenue/num_dates AS daily_revenue,
249 |   (CASE 
250 |   WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
251 |   END) As can_use_anot
252 |   FROM trnsact
253 |   WHERE stype='p' AND can_use_anot='can'
254 |   GROUP BY store, month_num, year_num
255 |   ) AS sub
256 | HAVING sub.num_dates >=20
257 | GROUP BY sub.store, sub.year_num, sub.month_num, sub.num_dates, sub.daily_revenue
258 | ORDER BY sub.num_dates ASC; 
259 | ```
260 | 
261 | > DR JANA: Save your final queries that remove “bad data” for use in subsequent exercises. From
262 | now on (and in the graded quiz), when I ask for average daily revenue: (1) Only examine purchases 
263 | (not returns). (2) Exclude all stores with less than 20 days of data. (3) Exclude all data from 
264 | August, 2005. 
265 | 
266 | ### Question 5. 
267 | 
268 | **What is the average daily revenue brought in by Dillard’s stores in areas of high, medium, or 
269 | low levels of high school education? Define areas of “low” education as those that have high 
270 | school graduation rates between 50-60%, areas of “medium” education as those that have high 
271 | school graduation rates between 60.01-70%, and areas of “high” education as those that have 
272 | high school graduation rates of above 70%.**
273 | 
274 | I'll start by counting the number of stores within each education level first.
275 | 
276 | ```sql
277 | SELECT
278 | (CASE
279 | WHEN msa_high>=50 AND msa_high<70 THEN 'low'
280 | WHEN msa_high>=70 AND msa_high<80 THEN 'med'
281 | WHEN msa_high>=80 THEN 'high'
282 | END) AS education_levels,
283 | COUNT (DISTINCT store) AS num_stores
284 | FROM store_msa
285 | GROUP BY education_levels
286 | ```
287 | 
288 | Unfortunately it's not a nice distribution: 
289 | 
290 | | EDUCATION_LEVEL | NUM_STORES |
291 | | --------------- | ---------- | 
292 | | LOW (>50%) | 324
293 | | MED (>60%) | 5
294 | | HIGH (>70%) | 4
295 | 
296 | It would be better if we could redistribute it like this: 
297 | 
298 | | EDUCATION_LEVEL | NUM_STORES |
299 | | --------------- | ---------- | 
300 | | LOW (>50%) | 213
301 | | MED (>70%) | 111
302 | | HIGH (>80%) | 9
303 | 
304 | But that's not what the question asked, so I'll leave it aside for now. 
305 | Back to the question, let's merge them: 
306 | 
307 | ```sql
308 | SELECT 
309 | 	(CASE
310 | 	WHEN s.msa_high >= 50 and s.msa_high < 60 THEN 'low'
311 | 	WHEN s.msa_high >= 60 and s.msa_high < 70 THEN 'medium'
312 | 	WHEN s.msa_high >= 70 THEN 'high'
313 | 	END) AS education_levels,
314 | 	SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue
315 | FROM store_msa s 
316 | 	JOIN (
317 | 		SELECT 
318 | 		store, 
319 | 		EXTRACT (year FROM saledate) AS year_num,
320 | 		EXTRACT (month FROM saledate) AS month_num, 
321 | 		SUM(amt) AS total_revenue, 
322 | 		COUNT (DISTINCT (saledate)) AS num_dates,
323 | 		(CASE 
324 | 		WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
325 | 		END) As can_use_anot
326 | 		FROM trnsact
327 | 		WHERE stype='p' AND can_use_anot='can'
328 | 		GROUP BY year_num, month_num, store
329 | 		HAVING num_dates >= 20 -- moving this back to within the subquery
330 | 		) AS sub
331 | 			ON s.store = sub.store
332 | GROUP BY education_levels;
333 | 
334 | ```
335 | 
336 | I'm not sure why line 21 ``HAVING num_dates >= 20`` only works when inside the 
337 | subquery but not when requested from the outer query. It worked fine in the previous question. 
338 | *I guess something about aggregate functions??*
339 | 
340 | > DR JANA: If you have executed this query correctly, you will find that the average daily revenue brought in
341 | by Dillard’s stores in the low education group is a little more than $34,000, the average daily
342 | revenue brought in by Dillard’s stores in the medium education group is a little more than
343 | $25,000, and the average daily revenue brought in by Dillard’s stores in the high education group
344 | is just under $21,000.
345 | 
346 | | EDUCATION_LEVEL | AVG_DAILY_REVENUE |
347 | | --------------- | ----------------- |
348 | | low | 34,159.76
349 | | medium | 27,112.67
350 | | high | 20,921.32
351 | 
352 | Hooray! Moving forward... 
353 | 
354 | *Whenever I ask you to calculate the average daily revenue for a group of stores in either 
355 | these exercises or the quiz, do so by summing together all the revenue from all the entries 
356 | in that group, and then dividing that summed total by the total number of sale days that 
357 | contributed to the total. Do not compute averages of averages.* 
358 | 
359 | ### Question 6. 
360 | 
361 | **Compare the average daily revenues of the stores with the highest median
362 | msa_income and the lowest median msa_income. In what city and state were these stores,
363 | and which store had a higher average daily revenue? Use ``msa_income`` to calculate.** 
364 | 
365 | ```sql
366 | SELECT 
367 | s.city, 
368 | s.state,
369 | s.msa_income,
370 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue
371 | FROM store_msa s 
372 | 	JOIN (
373 | 		SELECT 
374 | 		store,
375 | 		EXTRACT (year FROM saledate) AS year_num,
376 | 		EXTRACT (month FROM saledate) AS month_num,  
377 | 		SUM(amt) AS total_revenue, 
378 | 		COUNT(DISTINCT saledate) AS num_dates, 
379 | 		(CASE 
380 | 			WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
381 | 			END) As can_use_anot
382 | 		FROM trnsact 
383 | 		WHERE stype='p' AND can_use_anot='can'
384 | 		GROUP BY year_num, month_num, store
385 | 		HAVING num_dates >= 20
386 | 		) AS sub 
387 | 			ON s.store = sub.store
388 | WHERE s.msa_income IN (
389 | 	(SELECT MAX(msa_income) FROM store_msa),
390 | 	(SELECT MIN(msa_income) FROM store_msa))
391 | GROUP BY s.city, s.state;
392 | ```
393 | 
394 | Overall pretty similar to Qn 5. 
395 | 
396 | | CITY | STATE | AVG_DAILY_REVENUE | 
397 | | ---- | ----- | ----------------- |
398 | | SPANISH FORT | AL |  17884.08
399 | | MCALLEN | TX | 56601.99
400 | 
401 | ### Exercise 7: 
402 | 
403 | **What is the brand of the sku with the greatest standard deviation in sprice?
404 | Only examine skus that have been part of over 100 transactions.**
405 | 
406 | ```sql
407 | SELECT 
408 | DISTINCT (t.SKU) AS item,
409 | s.brand AS brand,
410 | STDDEV_SAMP(t.sprice) AS dev_price,
411 | COUNT(DISTINCT(t.SEQ||t.STORE||t.REGISTER||t.TRANNUM||t.SALEDATE)) AS distinct_transactions
412 | FROM TRNSACT t
413 | 	JOIN SKUINFO s
414 | 		ON t.sku=s.sku
415 | WHERE t.stype='p'
416 | HAVING distinct_transactions>100
417 | GROUP BY item, brand
418 | ORDER BY dev_price DESC
419 | ```
420 | 
421 | I'm not sure which of these columns are unique so I put them all in together: ``SEQ``, ``STORE``, 
422 | ``REGISTER``, ``TRANNUM``, ``SALEDATE``.
423 | 
424 | | ITEM | BRAND | STYLE | COLOR | SIZE | DEV_PRICE | 
425 | | ---- | ----- | ----- | ----- | ---- | --------- | 
426 | | 2762683 | HART SCH | 403154133510 | BLACK | 42REG | 175.8106 | 
427 | | 5453849 | POLO FAS | 9HA 726680 | FA02 | L | 169.4284 |
428 | | 5623849 | POLO FAS | 9HA 726680 | FA02 | M | 164.4187 |
429 | 
430 | ### Exercise 8: 
431 | 
432 | **Examine all the transactions for the sku with the greatest standard deviation in
433 | sprice, but only consider skus that are part of more than 100 transactions. Do you think the 
434 | retail price was set too high, or just right? **
435 | 
436 | ```sql
437 | SELECT 
438 | distinct(s.sku) AS items, 
439 | s.brand,
440 | AVG(t.sprice) AS avg_price,
441 | STDDEV_SAMP(t.sprice) AS variation_price, 
442 | avg(t.orgprice)-avg(t.sprice) AS sale_price_diff,
443 | COUNT(distinct(t.trannum)) AS distinct_transactions
444 | FROM skuinfo s 
445 | JOIN trnsact t
446 | ON s.sku=t.sku
447 | WHERE stype='p'
448 | GROUP BY items, s.brand
449 | HAVING distinct_transactions > 100
450 | ORDER BY variation_price DESC;
451 | ```
452 | 
453 | Not perfect, but consider how items with the highest ``variation (std dev) prices`` 
454 | are not quite those with the greatest ``sales price differences``. This may suggest that some 
455 | stores are simply pricing items higher/lower across the band, rather than offering massively 
456 | discounted sale prices (vs original prices) to clear their stock. This might simply reflect their 
457 | ``msa_income`` differences around each store.
458 | 
459 | So... Was the retail price just right? Can't say for sure, but it's definitely not too high.
460 | 
461 | ### Exercise 9
462 | 
463 | **What was the average daily revenue Dillard’s brought in during each month of
464 | the year?**
465 | 
466 | ```sql
467 | SELECT 
468 | (CASE
469 | WHEN sub.month_num=1 THEN 'Jan'
470 | WHEN sub.month_num=2 THEN 'Feb'
471 | WHEN sub.month_num=3 THEN 'Mar'
472 | WHEN sub.month_num=4 THEN 'Apr'
473 | WHEN sub.month_num=5 THEN 'May'
474 | WHEN sub.month_num=6 THEN 'Jun'
475 | WHEN sub.month_num=7 THEN 'Jul'
476 | WHEN sub.month_num=8 THEN 'Aug'
477 | WHEN sub.month_num=9 THEN 'Sep'
478 | WHEN sub.month_num=10 THEN 'Oct'
479 | WHEN sub.month_num=11 THEN 'Nov'
480 | WHEN sub.month_num=12 THEN 'Dec'
481 | END) as month_name,
482 | SUM(num_dates) AS num_days_in_month,
483 | SUM(total_revenue)/SUM(num_dates) AS avg_monthly_revenue
484 | FROM (
485 | 	SELECT 
486 | 	EXTRACT (month FROM saledate) AS month_num, 
487 | 	EXTRACT (year FROM saledate) AS year_num,
488 | 	COUNT (DISTINCT saledate) AS num_dates,
489 | 	SUM(amt) AS total_revenue,
490 | 	(CASE 
491 | 	WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
492 | 	END) As can_use_anot
493 | 	FROM trnsact
494 | 	WHERE stype='p' AND can_use_anot='can'
495 | 	GROUP BY month_num, year_num
496 | 	HAVING num_dates>=20
497 | 	) AS sub
498 | GROUP BY month_name
499 | ORDER BY avg_monthly_revenue DESC; 
500 | ```
501 | 
502 | | MONTH_NUM | DAYS_IN_MONTH | AVG_MONTHLY_REVENUE |
503 | | --------- | ------------- | ------------------- |
504 | | Dec | 30 | 11333356.01
505 | | Feb | 28 | 7363752.69
506 | | Jul | 31 | 7271088.69
507 | | Apr | 30 | 6949616.95
508 | | Mar | 30 | 6736315.39
509 | | May | 31 | 6666962.59
510 | | Jun | 30 | 6524845.42
511 | | Nov | 29 | 6296913.50
512 | | Oct | 31 | 6106357.90
513 | | Jan | 31 | 5836833.31
514 | | Aug | 31 | 5616841.37
515 | | Sep | 30 | 5596588.02
516 | 
517 | > DR JANA: you should find that December consistently has the best sales, September consistently
518 | has the worst or close to the worst sales, and July has very good sales, although less than December.
519 | 
520 | ### Question 10
521 | 
522 | **Which department, in which city and state of what store, had the greatest percentage increase in 
523 | average daily sales revenue from November to December? Note: Use percentage change.**
524 | 
525 | Hints from the notes:
526 | 
527 | 1. Need to join 4 tables
528 | 1. Use two CASE statements within an aggregate function to sum all revenue Nov and Dec
529 | 1 .Use two CASE statements within an aggregate function to count the number of sale
530 | days that contributed to the revenue in November and December, separately
531 | 1. Use these 4 fields to calculate the ``average daily revenue`` for November and December. You can then calculate the
532 | change in these values using the following % change formula: *(X-Y)/Y)*100. 
533 | 1. Don’t forget to exclude “bad data” and to exclude ``return`` transactions. 
534 | 
535 | First I'd try to find just the percentage increase in revenue from November to December, for each ``store``. I will 
536 | join the extra details like ``dept`` and stuff later.
537 | 
538 | ```sql
539 | SELECT 
540 | sub.store,
541 | SUM(CASE WHEN sub.month_num=11 THEN sub.amt END) AS Nov_revenue,
542 | SUM(CASE WHEN sub.month_num=12 THEN sub.amt END) AS Dec_revenue,
543 | COUNT(DISTINCT CASE WHEN sub.month_num=11 THEN sub.saledate END) AS Nov_days,
544 | COUNT(DISTINCT CASE WHEN sub.month_num=12 THEN sub.saledate END) AS Dec_days,
545 | Nov_revenue/Nov_days AS Nov_daily_rev, 
546 | Dec_revenue/Dec_days AS Dec_daily_rev,
547 | ((Dec_daily_rev-Nov_daily_rev)/Nov_daily_rev)*100 AS percent_increase
548 | FROM (
549 |   SELECT 
550 |   store,
551 |   amt,
552 |   saledate,
553 |   EXTRACT (month FROM saledate) AS month_num, 
554 |   EXTRACT (year FROM saledate) AS year_num,
555 |   (CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot
556 |   FROM trnsact
557 |   WHERE stype='p' AND can_use_anot='can'
558 |   ) AS sub
559 | GROUP BY sub.store
560 | HAVING Nov_days>=20 AND Dec_days>=20
561 | ORDER BY percent_increase DESC;
562 | ```
563 | 
564 | | STORE | NOV_REV | DEC_REV | NOV_DAYS | DEC_DAYS | NOV_DAILY_REV | DEC_DAILY_REV | PERCENT_INC |
565 | | ----- | ------- | ------- | -------- | -------- | ------------- | ------------- | ----------- |
566 | | 3809 | 210139.08 | 486314.01 | 29 | 30 | 7246.18 | 16210.47 | 124.00
567 | | 303 | 175003.74 | 399975.83 | 29 | 30 | 6034.61 | 13332.53 | 121.00
568 | | 7003 | 169776.27 | 380024.73 | 29 | 30 | 5854.35 | 12667.49 | 116.00
569 | 
570 | Seems okay. Let's add the others in. 
571 | 
572 | ```sql
573 | SELECT -- outer query to drop all necessary columns from inner query
574 | clean.store,
575 | clean.dept,
576 | clean.deptdesc,
577 | clean.city,
578 | clean.state,
579 | clean.percent_increase
580 | FROM (
581 | 	SELECT 
582 | 	sub.store,
583 | 	d.dept, 
584 | 	d.deptdesc,
585 | 	str.city,
586 | 	str.state,
587 | 	SUM(CASE WHEN sub.month_num=11 THEN sub.amt END) AS Nov_revenue,
588 | 	SUM(CASE WHEN sub.month_num=12 THEN sub.amt END) AS Dec_revenue,
589 | 	COUNT(DISTINCT CASE WHEN sub.month_num=11 THEN sub.saledate END) AS Nov_days,
590 | 	COUNT(DISTINCT CASE WHEN sub.month_num=12 THEN sub.saledate END) AS Dec_days,    
591 | 	Nov_revenue/Nov_days AS Nov_daily_rev, 
592 | 	Dec_revenue/Dec_days AS Dec_daily_rev,
593 | 	((Dec_daily_rev-Nov_daily_rev)/Nov_daily_rev)*100 AS percent_increase
594 | 	FROM (
595 | 		SELECT 
596 |  		sku.dept, -- NEW: include this here bc you need to group-by departments at most granular lvl
597 |     t.store,
598 | 		t.amt,
599 | 		t.saledate,
600 | 		EXTRACT (month FROM t.saledate) AS month_num, 
601 | 		EXTRACT (year FROM t.saledate) AS year_num,
602 | 		(CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot
603 | 		FROM trnsact t
604 | 			INNER JOIN skuinfo sku 
605 | 				ON t.sku=sku.sku 
606 | 		WHERE stype='p' AND can_use_anot='can' -- only query purchases, from legal dates
607 | 		) AS sub 
608 | 		INNER JOIN strinfo str
609 | 			ON str.store = sub.store -- to select city and state
610 | 		INNER JOIN deptinfo d 
611 | 			ON d.dept = sub.dept -- to select department description
612 | 	GROUP BY sub.store, d.dept, d.deptdesc, str.city, str.state
613 | 	HAVING Nov_days>=20 AND Dec_days>=20 
614 | 	) AS clean
615 | GROUP BY 1,2,3,4,5,6
616 | ORDER BY clean.percent_increase DESC
617 | 
618 | ```
619 | | STORE | DEPT | DEPT_DESC | CITY | STATE | PERCENTAGE_INCREASE | 
620 | | ---- | ----- | -------- | ------ | ----- | -------------- | 
621 | | 3403 | 7205 | LOUIS VL | SALINA | KS | 596.00
622 | | 9806 | 6402 | FREDERI | MABELVALE | AR | 476.00
623 | | 404 | 2107 | MAI | PINE BLUFF | AR | 442.00
624 | 
625 | ### Question 11
626 | 
627 | **What is the city and state of the store that had the greatest decrease in
628 | average daily revenue from August to September?**
629 | 
630 | This is easy, just adapt the query from Qn 10 and remove unnecessary tables.
631 | 
632 | ```sql
633 | SELECT 
634 | sub.store, 
635 | str.city, -- left join store_info table for these two
636 | str.state,
637 | SUM(CASE WHEN sub.month_num=8 THEN sub.amt END) AS Aug_revenue, 
638 | SUM(CASE WHEN sub.month_num=9 THEN sub.amt END) AS Sep_revenue,
639 | COUNT(DISTINCT CASE WHEN sub.month_num=8 THEN sub.saledate END) AS Aug_days,
640 | COUNT(DISTINCT CASE WHEN sub.month_num=9 THEN sub.saledate END) AS Sep_days,    
641 | Aug_revenue/Aug_days AS Aug_daily_rev, 
642 | Sep_revenue/Sep_days AS Sep_daily_rev,
643 | (Sep_daily_rev-Aug_daily_rev) AS rev_difference
644 | 	FROM ( -- clean inner query for legal dates and purchases only
645 | 		SELECT 
646 | 		store,
647 | 		amt,
648 | 		saledate,
649 | 		EXTRACT (month FROM saledate) AS month_num, 
650 | 		EXTRACT (year FROM saledate) AS year_num,
651 | 		(CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot
652 | 		FROM trnsact 
653 | 		WHERE stype='p' AND can_use_anot='can'
654 | 		) AS sub
655 | 	INNER JOIN strinfo str -- to extract store's city, state
656 | 		ON str.store = sub.store
657 | GROUP BY sub.store, str.city, str.state
658 | HAVING Aug_days>=20 AND Sep_days>=20 -- only keep stores with more than 20 dates per month
659 | ORDER BY rev_difference ASC
660 | ```
661 | | STORE | CITY | STATE | REV_DIFFERENCE |
662 | | ----- | ---- | ---- | ------------- |
663 | | 4003 | WEST DES MOINES | IA | -6479.60
664 | | 9103 | LOUISVILLE | KY | -5233.12
665 | | 2707 | MCALLEN | TX | -5109.47
666 | 
667 | ### Question 12
668 | 
669 | **Determine the month of maximum total revenue for each store. Count the
670 | number of stores whose month of maximum total revenue was in each of the twelve
671 | months.**
672 | 
673 | **Then determine the month of maximum average daily revenue. Count the
674 | number of stores whose month of maximum average daily revenue was in each of the
675 | twelve months. How do they compare?**
676 | 
677 | I'm guessing the assignment wants us to see which month has the most number of stores 
678 | hitting their maximum total revenue in, and also their highest average daily revenue in. 
679 | 
680 | If the numbers don't match, it might suggest hidden trends, outliers or missing data within the set. 
681 | 
682 | Things to do:
683 | 1. Calculate the average daily revenue for each store, for each month (for each year, but
684 | there will only be one year associated with each month)
685 | 1. Order the rows within a store according to average daily revenue from high to low
686 | 1. Assign a rank to each of the ordered rows
687 | 1. Retrieve all of the rows that have the rank you want
688 | 1. Count all of your retrieved rows 
689 | 
690 | > DR JANA: You can assign ranks using the ``ROW_NUMBER`` or ``RANK()`` function. 
691 |  Make sure you “partition” by store in your ``ROW_NUMBER`` clause. Lastly when you have 
692 |  confirmed that the output is reasonable, introduce a ``QUALIFY`` clause 
693 |  (described in the references above) into your query in order to restrict the output to 
694 |  rows that represent the month with the minimum average daily revenue for each store.
695 | 
696 | Starting with task (1) and (2), I'll calculate the average daily revenue for each ``store``, by ``month``.
697 | We can do this by recycling the query from Qn 9. 
698 | 
699 | ```sql
700 | SELECT 
701 | (CASE
702 | WHEN sub.month_num=1 THEN 'Jan'
703 | WHEN sub.month_num=2 THEN 'Feb'
704 | WHEN sub.month_num=3 THEN 'Mar'
705 | WHEN sub.month_num=4 THEN 'Apr'
706 | WHEN sub.month_num=5 THEN 'May'
707 | WHEN sub.month_num=6 THEN 'Jun'
708 | WHEN sub.month_num=7 THEN 'Jul'
709 | WHEN sub.month_num=8 THEN 'Aug'
710 | WHEN sub.month_num=9 THEN 'Sep'
711 | WHEN sub.month_num=10 THEN 'Oct'
712 | WHEN sub.month_num=11 THEN 'Nov'
713 | WHEN sub.month_num=12 THEN 'Dec'
714 | END) as month_name,
715 | sub.store,
716 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue
717 | FROM (
718 | 	SELECT 
719 | 	store,
720 | 	EXTRACT (month FROM saledate) AS month_num, 
721 | 	EXTRACT (year FROM saledate) AS year_num,
722 | 	COUNT (DISTINCT saledate) AS num_dates,
723 | 	SUM(amt) AS total_revenue,
724 | 	(CASE 
725 | 	WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
726 | 	END) As can_use_anot
727 | 	FROM trnsact
728 | 	WHERE stype='p' AND can_use_anot='can'
729 | 	GROUP BY month_num, year_num
730 | 	HAVING num_dates>=20
731 | 	) AS sub
732 | GROUP BY month_name, sub.store
733 | ORDER BY avg_daily_revenue DESC; 
734 | ```
735 | 
736 | (3) Let's add the bit for ``RANK()`` and ``PARTITION``: (a snippet)
737 | 
738 | ```sql
739 | SELECT 
740 | (CASE
741 | WHEN sub.month_num=1 THEN 'Jan'
742 | 	...
743 | WHEN sub.month_num=12 THEN 'Dec'
744 | END) as month_name,
745 | sub.store,
746 | SUM(sub.total_revenue) AS sum_monthly_revenue, -- TOTAL monthly rev
747 | SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue, -- AVERAGE rev within month
748 | ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY avg_daily_revenue DESC ) AS Row_sum_rev, --! 
749 | ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY sum_monthly_revenue DESC ) AS Row_avg_rev --!
750 | FROM (
751 | 	...
752 | 	) AS sub
753 | GROUP BY month_name, sub.store
754 | ORDER BY avg_daily_revenue DESC; 
755 | ```
756 | 
757 | (4)+(5) Finally, let's retrieve all rows with top ranking month, to see which month performed best. 
758 | 
759 | ```sql
760 | SELECT 
761 | clean.month_name AS month_n, 
762 | COUNT(CASE WHEN clean.Row_sum_rev =1 THEN clean.store END) AS Total_monthly_rev_count, -- count number of rank 1s per month
763 | COUNT(CASE WHEN clean.Row_avg_rev =1 THEN clean.store END) AS Average_daily_rev_count -- count number of rank 1s per month
764 | FROM (
765 | 	SELECT 
766 | 	(CASE
767 | 	WHEN sub.month_num=1 THEN 'Jan'
768 | 	WHEN sub.month_num=2 THEN 'Feb'
769 | 	WHEN sub.month_num=3 THEN 'Mar'
770 | 	WHEN sub.month_num=4 THEN 'Apr'
771 | 	WHEN sub.month_num=5 THEN 'May'
772 | 	WHEN sub.month_num=6 THEN 'Jun'
773 | 	WHEN sub.month_num=7 THEN 'Jul'
774 | 	WHEN sub.month_num=8 THEN 'Aug'
775 | 	WHEN sub.month_num=9 THEN 'Sep'
776 | 	WHEN sub.month_num=10 THEN 'Oct'
777 | 	WHEN sub.month_num=11 THEN 'Nov'
778 | 	WHEN sub.month_num=12 THEN 'Dec'
779 | 	END) as month_name,
780 | 	sub.store,
781 | 	SUM(sub.total_revenue) AS sum_monthly_revenue,
782 | 	SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue,
783 | 	ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY avg_daily_revenue DESC ) AS Row_sum_rev,
784 | 	ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY sum_monthly_revenue DESC ) AS Row_avg_rev
785 | 	FROM (
786 | 		SELECT 
787 | 		store,
788 | 		EXTRACT (month FROM saledate) AS month_num, 
789 | 		EXTRACT (year FROM saledate) AS year_num,
790 | 		COUNT (DISTINCT saledate) AS num_dates,
791 | 		SUM(amt) AS total_revenue,
792 | 		(CASE 
793 | 		WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
794 | 		END) As can_use_anot
795 | 		FROM trnsact
796 | 		WHERE stype='p' AND can_use_anot='can'
797 | 		GROUP BY month_num, year_num, store
798 | 		HAVING num_dates>=20
799 | 		) AS sub
800 | 	GROUP BY month_name, sub.store
801 | 	) AS clean
802 | GROUP BY Month_n
803 | ORDER BY Total_monthly_rev_count DESC
804 | ```
805 | 
806 | > DR JANA: If you write your queries correctly, you will find that 8 stores have the greatest 
807 | total sales in April, while only 4 stores have the greatest average daily revenue in April.
808 | 
809 | | MONTH | TOTAL_MONTHLY | AVG_DAILY |
810 | | ----- | ------------- | --------- | 
811 | | Dec | 317 | 321
812 | | Mar | 4 | 3
813 | | Jul | 3 | 3
814 | 
815 | While the output fits with our expectations of the data (ie. that ``Dec`` should be the most popular month), 
816 | but it doesn't match Dr Jana's hint. 
817 | 
818 | AFter reading the forum, I realised that official assignment seems to give the wrong hint (quite a significant mistake!). 
819 | We will get the expected result if we write our queries to find the ``LOWEST`` total sales as ranked by month instead 
820 | of the ``HIGHEST`` total sales, like so:
821 | 
822 | ```sql
823 | SELECT 
824 | clean.month_name AS month_n, 
825 | COUNT(CASE WHEN clean.Row_sum_rev =1 THEN clean.store END) AS Total_monthly_rev_count, -- change 1 to 12
826 | COUNT(CASE WHEN clean.Row_avg_rev =1 THEN clean.store END) AS Average_daily_rev_count -- change 1 to 12
827 | FROM (
828 | ... 
829 | ```
830 | 
831 | | MONTH | LOW_TOTAL_MONTH | LOW_AVG_DAILY |
832 | | ----- | --------------- | ------------- |
833 | | Aug | 120 | 77
834 | | Jan | 73 | 54
835 | | Sep | 72 | 108
836 | | ... | ... | ...
837 | | Apr | 4 | 8
838 | | ... | ... | ...
839 | | Dec | 0 | 0
840 | 
841 | # End
842 | 
843 | *Thoughts on this course:*
844 | *Notes were messy and with quite a few significant mistakes, like that last one we saw. But overall it was a*
845 | *good introduction to SQL and I appreciate the resources to let us try and play it out.* 
846 | 
847 | Key takeaways:
848 | 
849 | * Computational thinking: Learning how to split large, complex problems into smaller sets that can be reassembled later 
850 | * Rigorous testing and checking of trend inconsistencies using month, year-aggregations, or standard deviations
851 | * Dealing with outliers and missing data by setting predefined criterias in subqueries
852 | * Overall syntax nuances between dialects for MySQL, Teradata
853 | * Perseverance for long queries lol
854 | 


--------------------------------------------------------------------------------