├── Assignment #1 Quiz - Queries in Spark SQL.md
├── Assignment #2 Quiz - Spark Internals.md
├── Assignment #3 Quiz - Engineering Data Pipelines Graded Quiz .sql
├── Assignment #4 Quiz - Logistic Regression Classifier.sql
├── Assignment.sql
├── Module 1 Quiz. md
├── Module 2 Quiz.md
├── Module 3 Quiz. md
├── Module 4 Quiz.md
└── README.md


/Assignment #1 Quiz - Queries in Spark SQL.md:
--------------------------------------------------------------------------------
 1 | Question 1
 2 | What is the first value for "Incident Number"?
 3 | 
 4 | **16000003**
 5 | 
 6 | 
 7 | Question 2
 8 | What is the first value for "Incident Number" on April 4th, 2016?
 9 | 
10 | **16037478**
11 | 
12 | 
13 | Question 3
14 | Is the first fire call in this table on Brooke or Conor's birthday?  Conor's birthday is 4/4 and Brooke's is 9/27 (in MM/DD format).
15 | 
16 | **Conor's birthday**
17 | 
18 | Brooke's birthday
19 | 
20 | 
21 | Question 4
22 | What is the "Station Area" for the first fire call in this table?  Note that this table is a subset of the dataset.
23 | 
24 | **29**
25 | 
26 | Question 5
27 | How many incidents were on Conor's birthday in 2016?
28 | 
29 | **80**
30 | 
31 | Question 6
32 | How many fire calls had an "Ignition Cause" of "4 act of nature"?
33 | 
34 | **5**
35 | 
36 | Question 7
37 | What is the most common "Ignition Cause"? 
38 | Hint: Put the entire string.
39 | 
40 | **2 unintentional**
41 | 
42 | Question 8
43 | What is the total incidents from the two joined tables?
44 | 
45 | **847094402**
46 | 
47 | 


--------------------------------------------------------------------------------
/Assignment #2 Quiz - Spark Internals.md:
--------------------------------------------------------------------------------
 1 | Question 1
 2 | How many fire calls are in our table?
 3 | 
 4 | 240613
 5 | 
 6 | Question 2
 7 | Which "Unit Type" is the most common?
 8 | 
 9 | ENGINE
10 | 
11 | Question 3
12 | What type of transformation, wide or narrow, did the 'GROUP BY' and 'ORDER BY' queries result in?
13 | 
14 | **Wide**
15 | 
16 | Narrow
17 | 
18 | Question 4
19 | How many tasks were in the last stage of the last job?
20 | 
21 | 2
22 | 


--------------------------------------------------------------------------------
/Assignment #3 Quiz - Engineering Data Pipelines Graded Quiz .sql:
--------------------------------------------------------------------------------
  1 | -- Databricks notebook source
  2 | -- MAGIC 
  3 | -- MAGIC %md-sandbox
  4 | -- MAGIC 
  5 | -- MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
  6 | -- MAGIC   <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
  7 | -- MAGIC </div>
  8 | 
  9 | -- COMMAND ----------
 10 | 
 11 | -- MAGIC %md
 12 | -- MAGIC # Engineering Data Pipelines
 13 | -- MAGIC ## Module 3 Assignment
 14 | -- MAGIC 
 15 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this assignment you:
 16 | -- MAGIC * Create a table with persistent data and a specified schema
 17 | -- MAGIC * Populate table with specific entries
 18 | -- MAGIC * Change partition number to compare query speeds
 19 | -- MAGIC 
 20 | -- MAGIC For each **bold** question, input its answer in Coursera.
 21 | 
 22 | -- COMMAND ----------
 23 | 
 24 | -- MAGIC %run ../Includes/Classroom-Setup
 25 | 
 26 | -- COMMAND ----------
 27 | 
 28 | -- MAGIC %md-sandbox
 29 | -- MAGIC Create a table whose data will remain after you drop the table and after the cluster shuts down. Name this table `newTable` and specify the location to be at `/tmp/newTableLoc`
 30 | -- MAGIC 
 31 | -- MAGIC Set up the table to have the following schema:
 32 | -- MAGIC 
 33 | -- MAGIC ```
 34 | -- MAGIC `Address` STRING,
 35 | -- MAGIC `City` STRING,
 36 | -- MAGIC `Battalion` STRING,
 37 | -- MAGIC `Box` STRING,
 38 | -- MAGIC ```
 39 | -- MAGIC 
 40 | -- MAGIC Run the following cell first to remove any files stored at `/tmp/newTableLoc` before creating our table. Be sure to first re-run that cell each time you create `newTable`.
 41 | -- MAGIC 
 42 | -- MAGIC <img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This course was designed to work with Databricks Runtime 5.5 LTS ML, which uses Spark 2.4.  If you are running a later version of the Databricks Runtime, you might have to add an additional `STORED AS parquet` to your query [due to a bug.](https://issues.apache.org/jira/browse/SPARK-30436)
 43 | 
 44 | -- COMMAND ----------
 45 | 
 46 | -- MAGIC %python
 47 | -- MAGIC # removes files stored at '/tmp/newTableLoc'
 48 | -- MAGIC dbutils.fs.rm("/tmp/newTableLoc", True)   
 49 | 
 50 | -- COMMAND ----------
 51 | 
 52 | -- TODO 
 53 | DROP TABLE IF EXISTS newTable;
 54 | CREATE EXTERNAL TABLE newTable (
 55 |   `Address` STRING,
 56 |   `City` STRING,
 57 |   `Battalion` STRING,
 58 |   `Box` STRING
 59 | )
 60 | STORED AS parquet
 61 | LOCATION '/tmp/newTableLoc'
 62 | 
 63 | -- COMMAND ----------
 64 | 
 65 | -- MAGIC %md
 66 | -- MAGIC Check that the data type of each column is what we want.
 67 | -- MAGIC 
 68 | -- MAGIC ### Question 1
 69 | -- MAGIC **What type of table is `newTable`? "EXTERNAL" or "MANAGED"?**
 70 | 
 71 | -- COMMAND ----------
 72 | 
 73 | DESCRIBE EXTENDED newTable
 74 | 
 75 | -- COMMAND ----------
 76 | 
 77 | -- MAGIC %md
 78 | -- MAGIC Run the following cell to read in the data stored at `/mnt/davis/fire-calls/fire-calls-truncated.json`. Check that the columns of the data are of the correct types (not all strings). 
 79 | 
 80 | -- COMMAND ----------
 81 | 
 82 | CREATE OR REPLACE TEMPORARY VIEW  fireCallsJSON (
 83 |   `ALS Unit` boolean,
 84 |   `Address` string,
 85 |   `Available DtTm` string,
 86 |   `Battalion` string,
 87 |   `Box` string,
 88 |   `Call Date` string,
 89 |   `Call Final Disposition` string,
 90 |   `Call Number` long,
 91 |   `Call Type` string,
 92 |   `Call Type Group` string,
 93 |   `City` string,
 94 |   `Dispatch DtTm` string,
 95 |   `Entry DtTm` string,
 96 |   `Final Priority` long,
 97 |   `Fire Prevention District` string,
 98 |   `Hospital DtTm` string,
 99 |   `Incident Number` long,
100 |   `Location` string,
101 |   `Neighborhooods - Analysis Boundaries` string,
102 |   `Number of Alarms` long,
103 |   `On Scene DtTm` string,
104 |   `Original Priority` string,
105 |   `Priority` string,
106 |   `Received DtTm` string,
107 |   `Response DtTm` string,
108 |   `RowID` string,
109 |   `Station Area` string,
110 |   `Supervisor District` string,
111 |   `Transport DtTm` string,
112 |   `Unit ID` string,
113 |   `Unit Type` string,
114 |   `Unit sequence in call dispatch` long,
115 |   `Watch Date` string,
116 |   `Zipcode of Incident` long
117 | )
118 | USING JSON 
119 | OPTIONS (
120 |     path "/mnt/davis/fire-calls/fire-calls-truncated.json"
121 | );
122 | 
123 | DESCRIBE fireCallsJSON
124 | 
125 | -- COMMAND ----------
126 | 
127 | -- MAGIC %md
128 | -- MAGIC Take a look at the table to make sure it looks correct.
129 | 
130 | -- COMMAND ----------
131 | 
132 | SELECT * FROM fireCallsJSON LIMIT 10
133 | 
134 | -- COMMAND ----------
135 | 
136 | -- MAGIC %md
137 | -- MAGIC Now let's populate `newTable` with some of the rows from the `fireCallsJSON` table you just loaded. We only want to include fire calls whose `Final Priority` is `3`.
138 | 
139 | -- COMMAND ----------
140 | 
141 | -- TODO
142 | 
143 | INSERT INTO newTable
144 | SELECT Address, City, Battalion, Box
145 | FROM fireCallsJSON
146 | WHERE `Final Priority`=3; 
147 | 
148 | 
149 | -- COMMAND ----------
150 | 
151 | select * from newTable
152 | 
153 | -- COMMAND ----------
154 | 
155 | -- MAGIC %md
156 | -- MAGIC ### Question 2
157 | -- MAGIC 
158 | -- MAGIC **How many rows are in `newTable`? **
159 | 
160 | -- COMMAND ----------
161 | 
162 | -- TODO
163 | SELECT COUNT(*) FROM newTable
164 | 
165 | -- COMMAND ----------
166 | 
167 | -- MAGIC %md
168 | -- MAGIC 
169 | -- MAGIC Sort the rows of `newTable` by ascending `Battalion`.
170 | -- MAGIC 
171 | -- MAGIC ### Question 3
172 | -- MAGIC 
173 | -- MAGIC **What is the "Battalion" of the first entry in the sorted table?**
174 | 
175 | -- COMMAND ----------
176 | 
177 | -- TODO
178 | SELECT * FROM newTable
179 | ORDER BY Battalion ASC
180 | 
181 | -- COMMAND ----------
182 | 
183 | -- MAGIC %md
184 | -- MAGIC Let's see how this table is stored in our file system.
185 | -- MAGIC 
186 | -- MAGIC Note: You should have specified the location of the table to be `/tmp/newTableLoc` when you created it.
187 | 
188 | -- COMMAND ----------
189 | 
190 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc
191 | 
192 | -- COMMAND ----------
193 | 
194 | -- MAGIC %md
195 | -- MAGIC 
196 | -- MAGIC First run the following cell to check how many partitions are in this table. Did the number of partitions match the number of files our data was stored as?
197 | 
198 | -- COMMAND ----------
199 | 
200 | -- MAGIC %md
201 | -- MAGIC Let's try increasing the number of partitions to 256. Create this as a new table and call it `newTablePartitioned`
202 | 
203 | -- COMMAND ----------
204 | 
205 | -- TODO
206 | CREATE TABLE IF NOT EXISTS newTablePartitioned
207 | AS
208 | SELECT /*+ REPARTITION(256) */ *
209 | FROM newTable
210 | 
211 | 
212 | -- COMMAND ----------
213 | 
214 | -- MAGIC %md
215 | -- MAGIC Now let's take a look at how this new table is stored.
216 | 
217 | -- COMMAND ----------
218 | 
219 | DESCRIBE EXTENDED newTablePartitioned
220 | 
221 | -- COMMAND ----------
222 | 
223 | -- MAGIC %md
224 | -- MAGIC Copy the location of the `newTablePartitioned` from the table above and take a look at the files stored at that location. Now how many parts is our data stored?
225 | 
226 | -- COMMAND ----------
227 | 
228 | -- MAGIC %fs ls dbfs:/user/hive/warehouse/databricks.db/newtablepartitioned
229 | 
230 | -- COMMAND ----------
231 | 
232 | -- MAGIC %md
233 | -- MAGIC Now sort the rows of `newTablePartitioned` by ascending `Battalion` and compare how long this query takes.
234 | -- MAGIC 
235 | -- MAGIC ### Question 4
236 | -- MAGIC 
237 | -- MAGIC **Was this query faster or slower on the table with increased partitions?**
238 | 
239 | -- COMMAND ----------
240 | 
241 | SELECT * FROM newTablePartitioned ORDER BY `Battalion`
242 | 
243 | -- COMMAND ----------
244 | 
245 | -- MAGIC %md
246 | -- MAGIC Run the following cell to see where the data of the original `newTable` is stored.
247 | 
248 | -- COMMAND ----------
249 | 
250 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc
251 | 
252 | -- COMMAND ----------
253 | 
254 | -- MAGIC %md
255 | -- MAGIC Now drop the table `newTable`.
256 | 
257 | -- COMMAND ----------
258 | 
259 | DROP TABLE newTable;
260 | 
261 | -- The following line should error!
262 | -- SELECT * FROM newTable;
263 | 
264 | -- COMMAND ----------
265 | 
266 | -- MAGIC %md
267 | -- MAGIC ### Question 5
268 | -- MAGIC 
269 | -- MAGIC **Does the data stored within the table still exist at the original location (`dbfs:/tmp/newTableLoc`) after you dropped the table? (Answer "yes" or "no")**
270 | 
271 | -- COMMAND ----------
272 | 
273 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc
274 | 
275 | -- COMMAND ----------
276 | 
277 | -- MAGIC %md-sandbox
278 | -- MAGIC &copy; 2020 Databricks, Inc. All rights reserved.<br/>
279 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
280 | -- MAGIC <br/>
281 | -- MAGIC <a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>
282 | 


--------------------------------------------------------------------------------
/Assignment #4 Quiz - Logistic Regression Classifier.sql:
--------------------------------------------------------------------------------
  1 | -- Databricks notebook source
  2 | -- MAGIC 
  3 | -- MAGIC %md-sandbox
  4 | -- MAGIC 
  5 | -- MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
  6 | -- MAGIC   <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
  7 | -- MAGIC </div>
  8 | 
  9 | -- COMMAND ----------
 10 | 
 11 | -- MAGIC %md
 12 | -- MAGIC # Logistic Regression Classifier
 13 | -- MAGIC ## Module 4 Assignment
 14 | -- MAGIC 
 15 | -- MAGIC This final assignment is broken up into 2 parts:
 16 | -- MAGIC 1. Completing this Logistic Regression Classifier notebook
 17 | -- MAGIC   * Submitting question answers to Coursera
 18 | -- MAGIC   * Uploading notebook to Coursera for peer reviewing
 19 | -- MAGIC 2. Answering 3 free response questions on Coursera platform
 20 | -- MAGIC 
 21 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:
 22 | -- MAGIC * Preprocess data for use in a machine learning model
 23 | -- MAGIC * Step through creating a sklearn logistic regression model for classification
 24 | -- MAGIC * Predict the `Call_Type_Group` for incidents in a SQL table
 25 | -- MAGIC 
 26 | -- MAGIC 
 27 | -- MAGIC For each **bold** question, input its answer in Coursera.
 28 | 
 29 | -- COMMAND ----------
 30 | 
 31 | -- MAGIC %run ../Includes/Classroom-Setup
 32 | 
 33 | -- COMMAND ----------
 34 | 
 35 | -- MAGIC %md
 36 | -- MAGIC Load the `/mnt/davis/fire-calls/fire-calls-clean.parquet` data as `fireCallsClean` table.
 37 | 
 38 | -- COMMAND ----------
 39 | 
 40 | -- TODO
 41 | USE DATABRICKS;
 42 | -- FILL IN
 43 | DROP TABLE IF EXISTS fireCallsClean;
 44 | CREATE TABLE fireCallsClean
 45 | USING Parquet 
 46 | OPTIONS (
 47 |     path "/mnt/davis/fire-calls/fire-calls-clean.parquet"
 48 |   )
 49 | 
 50 | -- COMMAND ----------
 51 | 
 52 | -- MAGIC %md
 53 | -- MAGIC Check that your data is loaded in properly.
 54 | 
 55 | -- COMMAND ----------
 56 | 
 57 | SELECT * FROM fireCallsClean LIMIT 10
 58 | 
 59 | -- COMMAND ----------
 60 | 
 61 | -- MAGIC %md
 62 | -- MAGIC By the end of this assignment, we would like to train a logistic regression model to predict 2 of the most common `Call_Type_Group` given information from the rest of the table.
 63 | 
 64 | -- COMMAND ----------
 65 | 
 66 | -- MAGIC %md
 67 | -- MAGIC Write a query to see what the different `Call_Type_Group` values are and their respective counts.
 68 | -- MAGIC 
 69 | -- MAGIC ### Question 1
 70 | -- MAGIC 
 71 | -- MAGIC **How many calls of `Call_Type_Group` "Fire"?**
 72 | 
 73 | -- COMMAND ----------
 74 | 
 75 | -- TODO
 76 | select `Call_Type_Group`, count(`Call_Type_Group`) as numberofcalls
 77 | from fireCallsClean
 78 | Group by `Call_Type_Group`
 79 | 
 80 | -- COMMAND ----------
 81 | 
 82 | select count(`Call_Type_Group`)
 83 | from fireCallsClean
 84 | 
 85 | -- COMMAND ----------
 86 | 
 87 | -- MAGIC %md
 88 | -- MAGIC Let's drop all the rows where `Call_Type_Group = null`. Since we don't have a lot of `Call_Type_Group` with the value `Alarm` and `Fire`, we will also drop these calls from the table. Call this new temporary view `fireCallsGroupCleaned`.
 89 | 
 90 | -- COMMAND ----------
 91 | 
 92 | -- TODO
 93 | create or replace temporary view fireCallsGroupCleaned
 94 | 
 95 | As 
 96 | select * 
 97 | from fireCallsClean 
 98 | where Call_Type_Group is not null and Call_Type_Group not in  ('Alarm', 'Fire')
 99 | 
100 | -- COMMAND ----------
101 | 
102 | -- MAGIC %md
103 | -- MAGIC Check that every entry in `fireCallsGroupCleaned`  has a `Call_Type_Group` of either `Potentially Life-Threatening` or `Non Life-threatening`.
104 | 
105 | -- COMMAND ----------
106 | 
107 | -- TODO
108 | select Call_Type_Group, count(*) from fireCallsGroupCleaned
109 | group by Call_Type_Group
110 | 
111 | -- COMMAND ----------
112 | 
113 | -- MAGIC %md
114 | -- MAGIC ### Question 2
115 | -- MAGIC 
116 | -- MAGIC **How many rows are in `fireCallsGroupCleaned`?**
117 | 
118 | -- COMMAND ----------
119 | 
120 | select count(*) from fireCallsGroupCleaned
121 | 
122 | -- COMMAND ----------
123 | 
124 | -- MAGIC %md
125 | -- MAGIC We probably don't need all the columns of `fireCallsGroupCleaned` to make our prediction. Select the following columns from `fireCallsGroupCleaned` and create a view called `fireCallsDF` so we can access this table in Python:
126 | -- MAGIC 
127 | -- MAGIC * "Call_Type"
128 | -- MAGIC * "Fire_Prevention_District"
129 | -- MAGIC * "Neighborhooods_-\_Analysis_Boundaries" 
130 | -- MAGIC * "Number_of_Alarms"
131 | -- MAGIC * "Original_Priority" 
132 | -- MAGIC * "Unit_Type" 
133 | -- MAGIC * "Battalion"
134 | -- MAGIC * "Call_Type_Group"
135 | 
136 | -- COMMAND ----------
137 | 
138 | -- TODO
139 | create or replace temporary view fireCallsDF
140 | As 
141 | select Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion,Call_Type_Group 
142 | from fireCallsGroupCleaned
143 | 
144 | -- COMMAND ----------
145 | 
146 | -- MAGIC %md
147 | -- MAGIC Fill in the string SQL statement to load the `fireCallsDF` table you just created into python.
148 | 
149 | -- COMMAND ----------
150 | 
151 | -- MAGIC %python
152 | -- MAGIC # TODO
153 | -- MAGIC spark.conf.set("spark.sql.execution.arrow.enabled", "true")
154 | -- MAGIC df = sql("""select Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion,Call_Type_Group 
155 | -- MAGIC from fireCallsGroupCleaned""")
156 | -- MAGIC display(df)
157 | 
158 | -- COMMAND ----------
159 | 
160 | -- MAGIC %md
161 | -- MAGIC ## Creating a Logistic Regression Model in Sklearn
162 | 
163 | -- COMMAND ----------
164 | 
165 | -- MAGIC %md
166 | -- MAGIC First we will convert the Spark DataFrame to pandas so we can use sklearn to preprocess the data into numbers so that it is compatible with the logistic regression algorithm with a [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). 
167 | -- MAGIC 
168 | -- MAGIC Then we'll perform a train test split on our pandas DataFrame. Remember that the column we are trying to predict is the `Call_Type_Group`.
169 | 
170 | -- COMMAND ----------
171 | 
172 | -- MAGIC %python
173 | -- MAGIC from sklearn.model_selection import train_test_split
174 | -- MAGIC from sklearn.preprocessing import LabelEncoder
175 | -- MAGIC 
176 | -- MAGIC pdDF = df.toPandas()
177 | -- MAGIC le = LabelEncoder()
178 | -- MAGIC numerical_pdDF = pdDF.apply(le.fit_transform)
179 | -- MAGIC 
180 | -- MAGIC X = numerical_pdDF.drop("Call_Type_Group", axis=1)
181 | -- MAGIC y = numerical_pdDF["Call_Type_Group"].values
182 | -- MAGIC X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
183 | 
184 | -- COMMAND ----------
185 | 
186 | -- MAGIC %md
187 | -- MAGIC Look at our training data `X_train` which should only have numerical values now.
188 | 
189 | -- COMMAND ----------
190 | 
191 | -- MAGIC %python
192 | -- MAGIC display(X_train)
193 | 
194 | -- COMMAND ----------
195 | 
196 | -- MAGIC %md
197 | -- MAGIC We'll create a pipeline with 2 steps. 
198 | -- MAGIC 
199 | -- MAGIC 0. [One Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder): Converts our  features into vectorized features by creating a dummy column for each value in that category. 
200 | -- MAGIC 
201 | -- MAGIC 0. [Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html): Although the name includes "regression", it is used for classification by predicting the probability that the `Call Type Group` is one label and not the other.
202 | 
203 | -- COMMAND ----------
204 | 
205 | -- MAGIC %python
206 | -- MAGIC from sklearn.linear_model import LogisticRegression
207 | -- MAGIC from sklearn.preprocessing import OneHotEncoder
208 | -- MAGIC from sklearn.pipeline import Pipeline
209 | -- MAGIC 
210 | -- MAGIC ohe = ("ohe", OneHotEncoder(handle_unknown="ignore"))
211 | -- MAGIC lr = ("lr", LogisticRegression())
212 | -- MAGIC 
213 | -- MAGIC pipeline = Pipeline(steps = [ohe, lr]).fit(X_train, y_train)
214 | -- MAGIC y_pred = pipeline.predict(X_test)
215 | 
216 | -- COMMAND ----------
217 | 
218 | -- MAGIC %md
219 | -- MAGIC Run the following cell to see how well our model performed on test data (data that wasn't used to train the model)!
220 | 
221 | -- COMMAND ----------
222 | 
223 | -- MAGIC %python
224 | -- MAGIC from sklearn.metrics import accuracy_score
225 | -- MAGIC print(f"Accuracy of model: {accuracy_score(y_pred, y_test)}")
226 | 
227 | -- COMMAND ----------
228 | 
229 | -- MAGIC %md
230 | -- MAGIC ### Question 3
231 | -- MAGIC 
232 | -- MAGIC **What is the accuracy of our model on test data? Round to the nearest percent.**
233 | 
234 | -- COMMAND ----------
235 | 
236 | -- MAGIC %md
237 | -- MAGIC Save pipeline (with both stages) to disk.
238 | 
239 | -- COMMAND ----------
240 | 
241 | -- MAGIC %python
242 | -- MAGIC import mlflow
243 | -- MAGIC from mlflow.sklearn import save_model
244 | -- MAGIC 
245 | -- MAGIC model_path = "/dbfs/" + username + "/Call_Type_Group_lr"
246 | -- MAGIC dbutils.fs.rm(username + "/Call_Type_Group_lr", recurse=True)
247 | -- MAGIC save_model(pipeline, model_path)
248 | 
249 | -- COMMAND ----------
250 | 
251 | -- MAGIC %md
252 | -- MAGIC ## UDF
253 | 
254 | -- COMMAND ----------
255 | 
256 | -- MAGIC %md
257 | -- MAGIC Now that we have created and trained a machine learning pipeline, we will use MLflow to register the `.predict` function of the sklearn pipeline as a UDF which we can use later to apply in parallel. Now we can refer to this with the name `predictUDF` in SQL.
258 | 
259 | -- COMMAND ----------
260 | 
261 | -- MAGIC %python
262 | -- MAGIC import mlflow
263 | -- MAGIC from mlflow.pyfunc import spark_udf
264 | -- MAGIC 
265 | -- MAGIC predict = spark_udf(spark, model_path, result_type="int")
266 | -- MAGIC spark.udf.register("predictUDF", predict)
267 | 
268 | -- COMMAND ----------
269 | 
270 | -- MAGIC %md
271 | -- MAGIC Create a view called `testTable` of our test data `X_test` so that we can see this table in SQL.
272 | 
273 | -- COMMAND ----------
274 | 
275 | -- MAGIC %python
276 | -- MAGIC spark_df = spark.createDataFrame(X_test)
277 | -- MAGIC spark_df.createOrReplaceTempView("testTable")
278 | 
279 | -- COMMAND ----------
280 | 
281 | -- MAGIC %md
282 | -- MAGIC Create a table called `predictions` using the `predictUDF` function we registered beforehand. Apply the `predictUDF` to every row of `testTable` in parallel so that each row of `testTable` has a `Call_Type_Group` prediction.
283 | 
284 | -- COMMAND ----------
285 | 
286 | -- TODO
287 | 
288 | USE DATABRICKS;
289 | drop table if exists predictions;
290 | 
291 | create table predictions as (
292 | select *, cast(predictUDF(Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion) as double) as prediction
293 |   FROM testTable
294 |   --LIMIT 10000
295 | )
296 | 
297 | -- COMMAND ----------
298 | 
299 | -- MAGIC %md
300 | -- MAGIC Now take a look at the table and see what your model predicted for each call entry!
301 | 
302 | -- COMMAND ----------
303 | 
304 | SELECT * FROM predictions LIMIT 10
305 | 
306 | -- COMMAND ----------
307 | 
308 | -- MAGIC %md
309 | -- MAGIC ### Question 4: 
310 | -- MAGIC 
311 | -- MAGIC **What 2 values are in the `prediction` column?**
312 | 
313 | -- COMMAND ----------
314 | 
315 | -- MAGIC %md
316 | -- MAGIC Congrats on finishing your last assignment notebook! 
317 | -- MAGIC 
318 | -- MAGIC 
319 | -- MAGIC Now you will have to upload this notebook to Coursera for peer reviewing.
320 | -- MAGIC 1. Make sure that all your code will run without errors
321 | -- MAGIC   * Check this by clicking the "Clear State & Run All" dropdown option at the top of your notebook
322 | -- MAGIC   * ![](http://files.training.databricks.com/images/eLearning/ucdavis/clearstaterunall.png)
323 | -- MAGIC 2. Click on the "Workspace" icon on the side bar
324 | -- MAGIC 3. Next to the notebook you're working in right now, click on the dropdown arrow
325 | -- MAGIC 4. In the dropdown, click on "Export" then "HTML"
326 | -- MAGIC   * ![](http://files.training.databricks.com/images/eLearning/ucdavis/export.png)
327 | -- MAGIC 5. On the Coursera platform, upload this HTML file to Week 4's Peer Review Assignment
328 | -- MAGIC 
329 | -- MAGIC Go back onto the Coursera platform for the free response portion of this assignment and for instructions on how to review your peer's work.
330 | 
331 | -- COMMAND ----------
332 | 
333 | -- MAGIC %md-sandbox
334 | -- MAGIC &copy; 2020 Databricks, Inc. All rights reserved.<br/>
335 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
336 | -- MAGIC <br/>
337 | -- MAGIC <a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>
338 | 


--------------------------------------------------------------------------------
/Assignment.sql:
--------------------------------------------------------------------------------
  1 | -- Databricks notebook source
  2 | -- MAGIC 
  3 | -- MAGIC %md-sandbox
  4 | -- MAGIC 
  5 | -- MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
  6 | -- MAGIC   <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
  7 | -- MAGIC </div>
  8 | 
  9 | -- COMMAND ----------
 10 | 
 11 | -- MAGIC %md
 12 | -- MAGIC # Queries in Spark SQL
 13 | -- MAGIC ## Module 1 Assignment
 14 | -- MAGIC 
 15 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this assignment you:
 16 | -- MAGIC * Create a table
 17 | -- MAGIC * Write SQL queries
 18 | -- MAGIC 
 19 | -- MAGIC 
 20 | -- MAGIC For each **bold** question, input its answer in Coursera.
 21 | 
 22 | -- COMMAND ----------
 23 | 
 24 | -- MAGIC %run ../Includes/Classroom-Setup
 25 | 
 26 | -- COMMAND ----------
 27 | 
 28 | -- MAGIC %md
 29 | -- MAGIC ### Working with Incident Data
 30 | -- MAGIC 
 31 | -- MAGIC For this assignment, we'll be using a new dataset: the [SF Fire Incident](https://data.sfgov.org/Public-Safety/Fire-Incidents/wr8u-xric) dataset.  It has been mounted for you using the script above.  The path to this dataset is as follows:
 32 | -- MAGIC 
 33 | -- MAGIC `/mnt/davis/fire-incidents/fire-incidents-2016.csv`
 34 | -- MAGIC 
 35 | -- MAGIC In this assignment, you will read the dataset and perform a number of different queries.
 36 | 
 37 | -- COMMAND ----------
 38 | 
 39 | -- MAGIC %md
 40 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Create a Table
 41 | -- MAGIC 
 42 | -- MAGIC Create a new table called `fireIncidents` for this dataset.  Be sure to use options to properly parse the data.
 43 | 
 44 | -- COMMAND ----------
 45 | 
 46 | -- TODO
 47 | create table fireIncidents
 48 | USING csv
 49 | OPTIONS (
 50 |   header "true",
 51 |   path "/mnt/davis/fire-incidents/fire-incidents-2016.csv",
 52 |   inferSchema "true"
 53 | )
 54 | 
 55 | -- COMMAND ----------
 56 | 
 57 | -- MAGIC %md
 58 | -- MAGIC ### Question 1
 59 | -- MAGIC 
 60 | -- MAGIC Return the first 10 lines of the data.  On the Coursera platform, input the result to the following question:
 61 | -- MAGIC 
 62 | -- MAGIC **What is the first value for "Incident Number"?**
 63 | 
 64 | -- COMMAND ----------
 65 | 
 66 | -- TODO
 67 | select `Incident Number` from fireIncidents LIMIT 10
 68 | 
 69 | -- COMMAND ----------
 70 | 
 71 | -- MAGIC %md
 72 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) `WHERE` Clauses
 73 | -- MAGIC 
 74 | -- MAGIC A `WHERE` clause is used to filter data that meets certain criteria, returning all values that evaluate to be true. 
 75 | 
 76 | -- COMMAND ----------
 77 | 
 78 | -- MAGIC %md
 79 | -- MAGIC ### Question 2
 80 | -- MAGIC 
 81 | -- MAGIC Return all incidents that occurred on Conor's birthday in 2016.  For those of you who forgot his birthday, it's April 4th.  On the Coursera platform, input the result to the following question:
 82 | -- MAGIC 
 83 | -- MAGIC **What is the first value for "Incident Number" on April 4th, 2016?** 
 84 | -- MAGIC 
 85 | -- MAGIC **Remember to use backticks (\`\`) instead of single quotes ('') for columns that have spaces in the name. **
 86 | 
 87 | -- COMMAND ----------
 88 | 
 89 | select (*) from fireIncidents
 90 | 
 91 | -- COMMAND ----------
 92 | 
 93 | -- TODO
 94 | select `Incident Number`,`Incident Date` from  fireIncidents
 95 | where `Incident Date` = "04/04/2016"
 96 | 
 97 | -- COMMAND ----------
 98 | 
 99 | -- MAGIC %md
100 | -- MAGIC ### Question 3
101 | -- MAGIC 
102 | -- MAGIC Return all incidents that occurred on Conor's _or_ Brooke's birthday.  For those of you who forgot her birthday too, it's `9/27`.
103 | -- MAGIC 
104 | -- MAGIC **Is the first fire call in this table on Brooke or Conor's birthday?**
105 | 
106 | -- COMMAND ----------
107 | 
108 | -- TODO
109 | 
110 | select (*) from fireIncidents
111 | where `Incident Date` = "04/04/2016" or `Incident Date` = "27/09/2016"
112 | 
113 | -- COMMAND ----------
114 | 
115 | -- MAGIC %md
116 | -- MAGIC ### Question 4
117 | -- MAGIC Return all incidents on either Conor or Brooke's birthday where the `Station Area` is greater than 20.
118 | -- MAGIC 
119 | -- MAGIC **What is the "Station Area" for the first fire call in this table?**
120 | 
121 | -- COMMAND ----------
122 | 
123 | -- MAGIC %md
124 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Aggregate Functions
125 | -- MAGIC 
126 | -- MAGIC Aggregate functions compute a single result value from a set of input values.  Use the aggregate function `COUNT` to count the total records in the dataset.
127 | 
128 | -- COMMAND ----------
129 | 
130 | -- TODO
131 | select `Station Area` from fireIncidents
132 | where `Incident Date` = "04/04/2016" or `Incident Date` = "27/09/2016" and `Station Area`> 20
133 | 
134 | -- COMMAND ----------
135 | 
136 | -- MAGIC %md
137 | -- MAGIC ### Question 5
138 | -- MAGIC 
139 | -- MAGIC Count the incidents on Conor's birthday.
140 | -- MAGIC 
141 | -- MAGIC **How many incidents were on Conor's birthday in 2016?**
142 | 
143 | -- COMMAND ----------
144 | 
145 | -- TODO
146 | select count(`Incident Number`) as incidents from fireIncidents
147 | where `Incident Date` = "04/04/2016" 
148 | 
149 | -- COMMAND ----------
150 | 
151 | -- MAGIC %md-sandbox
152 | -- MAGIC ### Question 6
153 | -- MAGIC 
154 | -- MAGIC Return the total counts by `Ignition Cause`.  Be sure to return the field `Ignition Cause` as well.
155 | -- MAGIC 
156 | -- MAGIC <img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You'll have to use `GROUP BY` for this
157 | -- MAGIC 
158 | -- MAGIC **How many fire calls had an "Ignition Cause" of "4 act of nature"?**
159 | 
160 | -- COMMAND ----------
161 | 
162 | -- TODO
163 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents
164 | group by `Ignition Cause`
165 | 
166 | -- COMMAND ----------
167 | 
168 | -- MAGIC %md-sandbox
169 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Sorting
170 | -- MAGIC 
171 | -- MAGIC ### Question 7
172 | -- MAGIC 
173 | -- MAGIC Return the total counts by `Ignition Cause` sorted in ascending order.
174 | -- MAGIC 
175 | -- MAGIC <img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You'll have to use `ORDER BY` for this.
176 | -- MAGIC 
177 | -- MAGIC **What is the most common "Ignition Cause"? (Put the entire string)**
178 | 
179 | -- COMMAND ----------
180 | 
181 | -- TODO
182 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents
183 | group by `Ignition Cause`
184 | order by total asc
185 | 
186 | -- COMMAND ----------
187 | 
188 | -- MAGIC %md
189 | -- MAGIC Return the total counts by `Ignition Cause` sorted in descending order.
190 | 
191 | -- COMMAND ----------
192 | 
193 | -- TODO
194 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents
195 | group by `Ignition Cause`
196 | order by total desc
197 | 
198 | -- COMMAND ----------
199 | 
200 | -- MAGIC %md
201 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Joins
202 | -- MAGIC 
203 | -- MAGIC Create the table `fireCalls` if it doesn't already exist.  The path is as follows: `/mnt/davis/fire-calls/fire-calls-truncated-comma.csv`
204 | 
205 | -- COMMAND ----------
206 | 
207 | -- TODO
208 | create table if  not exists fireCalls
209 | USING csv
210 | OPTIONS (
211 |   header "true",
212 |   path "/mnt/davis/fire-calls/fire-calls-truncated-comma.csv",
213 |   inferSchema "true"
214 | )
215 | 
216 | -- COMMAND ----------
217 | 
218 | -- MAGIC %md
219 | -- MAGIC Join the two tables on `Battalion` by performing an inner join.
220 | 
221 | -- COMMAND ----------
222 | 
223 | -- TODO
224 | select * from fireCalls
225 | inner join fireIncidents
226 | on  fireCalls.`Battalion`=fireIncidents.`Battalion`
227 | 
228 | -- COMMAND ----------
229 | 
230 | -- MAGIC %md
231 | -- MAGIC ### Question 8
232 | -- MAGIC 
233 | -- MAGIC Count the total incidents from the two tables joined on `Battalion`.
234 | -- MAGIC 
235 | -- MAGIC **What is the total incidents from the two joined tables?**
236 | 
237 | -- COMMAND ----------
238 | 
239 | -- TODO
240 | select count(fc.`Incident Number`) as total 
241 | from fireCalls fc
242 | inner join fireIncidents fi
243 | on  fc.`Battalion`=fi.`Battalion` 
244 | 
245 | 
246 | 
247 | -- COMMAND ----------
248 | 
249 | -- MAGIC %md
250 | -- MAGIC Congratulations!  You made it to the end of the assignment!
251 | 
252 | -- COMMAND ----------
253 | 
254 | -- MAGIC %md-sandbox
255 | -- MAGIC &copy; 2020 Databricks, Inc. All rights reserved.<br/>
256 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
257 | -- MAGIC <br/>
258 | -- MAGIC <a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>
259 | 


--------------------------------------------------------------------------------
/Module 1 Quiz. md:
--------------------------------------------------------------------------------
  1 | Question 1
  2 | Which of the following are true when it comes to the business value of big data? (Select all that apply.)
  3 | 
  4 | 
  5 |   **The size of the data businesses collect is growing**
  6 | 
  7 | 
  8 |  **Businesses are increasingly making data-driven decisions**
  9 | 
 10 | 
 11 |   Automated technologies mean that data scientists and data analysts are no longer needed
 12 | 
 13 | 
 14 | Question 2
 15 | Spark uses...
 16 | 
 17 | (Select all that apply.)
 18 | 
 19 | Your database technology (e.g., Postgres or SQL Server) to run Spark queries
 20 | 
 21 | 
 22 |  **A driver node to distribute work across a number of executor nodes**
 23 | 
 24 | 
 25 |  **A distributed cluster of networked computers made of a driver node and many executor nodes**
 26 | 
 27 | 
 28 | 
 29 |  One very large computer that is able to run computation against large databases
 30 | 
 31 | 
 32 |  A distributed cluster of networked computers made of many driver nodes and many executor nodes
 33 | 
 34 | 
 35 | Question 3
 36 | How does Spark execute code backed by DataFrames? (Select all that apply.)
 37 | 
 38 |  It executes code determined in advance
 39 | 
 40 | 
 41 |  It iterates over all of the source data to exhaustively evaluate queries
 42 | 
 43 | 
 44 |  **It separates the "logical plan" of what you want to accomplish from the "physical plan" of how to do it so it can optimize the query**
 45 | 
 46 |  **It optimizes your query by figuring out the best "how" to execute what you want**
 47 | 
 48 | 
 49 | Question 4
 50 | What are the properties of Spark DataFrames? (Select all that apply.)
 51 | 
 52 |  **Dataset: Collection of partitioned data**
 53 | 
 54 |  **Resilient: Fault-tolerant**
 55 | 
 56 |  Tables: Operates as any table in SQL environments
 57 | 
 58 | 
 59 |  **Distributed: Computed across multiple nodes**
 60 | 
 61 | 
 62 | 
 63 | Question 5
 64 | What is the difference between Spark and database technologies? (Select all that apply.)
 65 | 
 66 | 
 67 |  **Spark is a computation engine and is not for data storage**
 68 | 
 69 |  Spark does not interact with databases but uses its proprietary DataFrame technology instead
 70 | 
 71 |  Spark operates for both data storage and computation
 72 | 
 73 |  **Spark is a highly optimized compute engine and is not a database**
 74 | 
 75 |  Spark in an alternative to traditional databases
 76 | 
 77 | 
 78 | Question 6
 79 | What is Amdahl's law of scalability? (Select all that apply.)
 80 | 
 81 | 
 82 |  **Amdahl's law states that the speedup of a task is a function of how much of that task can be parallelized**
 83 | 
 84 |  A formula that gives the expected speed of a single processor performing a computation
 85 | 
 86 |  A formula that gives the theoretical speedup as a function of the size of a partition (or subset) of data
 87 | 
 88 |  A formula that gives the number of processors (or other unit of parallelism) needed to complete a task
 89 | 
 90 |  **A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized**
 91 | 
 92 | 
 93 | Question 7
 94 | Spark offers a unified approach to analytics. What does this include? (Select all that apply.)
 95 | 
 96 | 
 97 |  **Spark unifies applications such as SQL queries, streaming, and machine learning**
 98 | 
 99 |  Spark unifies databases with optimized computation allowing for faster computation against the data it stores
100 | 
101 |  **Spark allows analysts, data scientists, and data engineers to all use the same core technology**
102 | 
103 |  Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application
104 | 
105 |  **Spark code can be written in the following languages: SQL, Scala, Java, Python, and R**
106 | 
107 | 
108 | Question 8
109 | What is a Databricks notebook?
110 | 
111 |  A Spark instance that executes queries
112 | 
113 |  A cluster that executes Spark code
114 | 
115 |  **A collaborative, interactive workspace that allows you to execute Spark queries at scale**
116 | 
117 |  A single Spark query
118 | 
119 | 
120 | 
121 | Question 9
122 | How can you get data into Databricks? (Select all that apply.)
123 | 
124 | 
125 |  **By uploading it through the user interface**
126 | 
127 |  **By registering the data as a table**
128 | 
129 |  By connecting to Dropbox or Google Drive
130 | 
131 |  **By "mounting" data backed by cloud storage**
132 | 
133 | 
134 | Question 10
135 | What are the qualities of big data? (Select all that apply.)
136 | 
137 | 
138 | 
139 |  **Volume: the amount of data**
140 | 
141 |  Valorous: the positives impact of data
142 | 
143 | 
144 |  **Variety: the diversity of data**
145 | 
146 |  **Velocity: the speed of data**
147 | 
148 |  **Veracity: the reliability of data**
149 | 
150 | 
151 | 
152 | 


--------------------------------------------------------------------------------
/Module 2 Quiz.md:
--------------------------------------------------------------------------------
  1 | Question 1
  2 | What are the different units of parallelism? (Select all that apply.)
  3 | 
  4 | **Task**
  5 | 
  6 | **Executor**
  7 | 
  8 | **Core**
  9 | 
 10 | **Partition**
 11 | 
 12 | 
 13 | Question 2
 14 | What is a partition?
 15 | 
 16 | 
 17 | **A portion of a large distributed set of data**
 18 | 
 19 | A synonym with "task"
 20 | 
 21 | A division of computation that executes a query
 22 | 
 23 | The result of data filtered by a WHERE clause
 24 | 
 25 | 
 26 | Question 3
 27 | What is the difference between in-memory computing and other technologies? (Select all that apply.)
 28 | 
 29 | 
 30 | **In-memory operations were not realistic in older technologies when memory was more expensive**
 31 | 
 32 | In-memory computing is slower than other types of computing
 33 | 
 34 | **In-memory operates from RAM while other technologies operate from disk**
 35 | 
 36 | **Computation not done in-memory (such as Hadoop) reads and writes from disk in between each step**
 37 | 
 38 | 
 39 | Question 4
 40 | Why is caching important?
 41 | 
 42 | 
 43 |  It always stores data in-memory to improve performance
 44 | 
 45 | 
 46 | **It stores data on the cluster to improve query performance**
 47 | 
 48 | 
 49 |  It reformats data already stored in RAM for faster access
 50 | 
 51 | 
 52 |  It improves queries against data read one or more times
 53 | 
 54 | 
 55 | Question 5
 56 | Which of the following is a wide transformation? (Select all that apply.)
 57 | 
 58 | 
 59 | **ORDER BY**
 60 | 
 61 | **GROUP BY**
 62 | 
 63 | SELECT
 64 | 
 65 | WHERE
 66 | 
 67 | 
 68 | Question 6
 69 | Broadcast joins...
 70 | 
 71 | 
 72 | Shuffle both of the tables, minimizing data transfer by transferring data in parallel
 73 | 
 74 | Shuffle both of the tables, minimizing computational resources
 75 | 
 76 | **Transfer the smaller of two tables to the larger, minimizing data transfer**
 77 | 
 78 | Transfer the smaller of two tables to the larger, increasing data transfer requirements
 79 | 
 80 | 
 81 | 
 82 | Question 7
 83 | When is it appropriate to use a shuffle join?
 84 | 
 85 | 
 86 | 
 87 | **When both tables are moderately sized or large**
 88 | 
 89 |  When both tables are very small
 90 | 
 91 |   Never. Broadcast joins always out-perform shuffle joins.
 92 | 
 93 |   When the smaller table is significantly smaller than the larger table
 94 | 
 95 | 
 96 | Question 8
 97 | Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.)
 98 | 
 99 | 
100 | **Shuffle reads**
101 | 
102 | **Data Skew**
103 | 
104 | Incompatible data formats
105 | 
106 | **Shuffle writes**
107 | 
108 | 
109 | Question 9
110 | What is a stage boundary?
111 | 
112 | 
113 | An action caused by a SQL query is predicate
114 | 
115 | Any transition between Spark tasks
116 | 
117 | A narrow transformation
118 | 
119 | **When all of the slots or available units of processing have to sync with one another**
120 | 
121 | 
122 | Question 10
123 | What happens when Spark code is executed in local mode?
124 | 
125 | 
126 | A cluster of virtual machines is used rather than physical machines
127 | 
128 | **The executor and driver are on the same machine**
129 | 
130 | The code is executed in the cloud
131 | 
132 | The code is executed against a local cluster
133 | 


--------------------------------------------------------------------------------
/Module 3 Quiz. md:
--------------------------------------------------------------------------------
  1 | Question 1
  2 | Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.)
  3 | 
  4 | 
  5 | + It makes updates to new software versions easier
  6 | 
  7 | Resources are isolated and therefore more manageable and debuggable
  8 | 
  9 | + It results in copies of the data in case of a data center outage
 10 | 
 11 | + It allows for elastic resources so larger storage or compute resources are used only when needed
 12 | 
 13 | 
 14 | Question 2
 15 | You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task?
 16 | 
 17 | CPU: the transfer of data is more demanding than the computation
 18 | 
 19 | IO: computation is more demanding that the data transfer
 20 | 
 21 | CPU: computation is more demanding than the data transfer
 22 | 
 23 | + IO: the transfer of data is more demanding than the computation
 24 | 
 25 | Question 3
 26 | Processing virtual shopping cart orders in real time is an example of...
 27 | 
 28 | Online Analytical Processing (OLAP)
 29 | 
 30 | + Online Transaction Processing (OLTP)
 31 | 
 32 | 
 33 | Question 4
 34 | When are BLOB stores an appropriate place to store data? (Select all that apply.)
 35 | 
 36 | 
 37 | 
 38 | + For cheap storage
 39 | 
 40 | + For a "data lake" of largely unstructured data
 41 | 
 42 | + For storing large files
 43 | 
 44 | For online transaction processing on a website
 45 | 
 46 | 
 47 | Question 5
 48 | JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC?
 49 | 
 50 | 
 51 | 
 52 | Specify the number of partitions using COALESCE. Spark then creates one parallel connection for each partition.
 53 | 
 54 | 
 55 | Specify the numPartitions configuration setting. Spark then creates one parallel connection for each partition.
 56 | 
 57 | 
 58 | + Specify a column, number of partitions, and the column's minimum and maximum values. Spark then divides that range of values between parallel connections.*****
 59 | 
 60 | 
 61 | Specify the number of partitions using REPARTITION. Spark then creates one parallel connection for each partition.
 62 | 
 63 | 
 64 | Question 6
 65 | What are some of the advantages of the file format Parquet over CSV? (Select all that apply.)
 66 | 
 67 | 
 68 | + Compression
 69 | 
 70 | Corruptible
 71 | 
 72 | + Parallelism
 73 | 
 74 | + Columnar
 75 | 
 76 | 
 77 | Question 7
 78 | SQL is normally used to query tabular (or "structured") data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.)
 79 | 
 80 | 
 81 | + It allows for missing data
 82 | 
 83 | + It allows for complex data types
 84 | 
 85 | + It allows for data change over time
 86 | 
 87 | + It does not need a formal structure*****
 88 | 
 89 | It allows for easy joins between relational JSON tables
 90 | 
 91 | 
 92 | Question 8
 93 | Data writes in Spark can happen in serial or in parallel. What controls this parallelism?
 94 | 
 95 | 
 96 | + The number of data partitions in a DataFrame
 97 | 
 98 | The number of jobs in a write operation
 99 | 
100 | The number of stages in a write operation
101 | 
102 | The numPartitions setting in the Spark configuration
103 | 
104 | 
105 | Question 9
106 | Fill in the blanks with the appropriate response below: 
107 | 
108 | A _______ table manages _______and a DROP TABLE command will result in data loss.
109 | 
110 | 
111 | 
112 | Unmanaged, both the data and metadata such as the schema and data location
113 | 
114 | 
115 | + Managed, both the data and metadata such as the schema and data location
116 | 
117 | 
118 | Managed, only the metadata such as the schema and data location
119 | 
120 | 
121 | Unmanaged, only the metadata such as the schema and data location
122 | 


--------------------------------------------------------------------------------
/Module 4 Quiz.md:
--------------------------------------------------------------------------------
  1 | Question 1
  2 | Machine learning is suited to solve which of the following tasks? (Select all that apply.)
  3 | 
  4 | 
  5 | **Churn Analysis**
  6 | 
  7 | Reporting
  8 | 
  9 | **Natural Language Processing**
 10 | 
 11 | **Image Recognition**
 12 | 
 13 | **Financial Forecasting**
 14 | 
 15 | **Fraud Detection**
 16 | 
 17 | **A/B Testing**
 18 | 
 19 | Question 2
 20 | Is a model that is 99% accurate at predicting breast cancer a good model?
 21 | 
 22 | 
 23 | Likely no because there are too many false positives
 24 | 
 25 | **Likely no because there are not many cases of cancer in a general population**
 26 | 
 27 | Likely yes because this is generally a high score
 28 | 
 29 | Likely yes because it accounts for false negatives and we'd want to make sure we catch every case of cancer
 30 | 
 31 | Question 3
 32 | What is an appropriate baseline model to compare a machine learning solution to?
 33 | 
 34 | **The average of the dataset**
 35 | 
 36 | Zero
 37 | 
 38 | The minimum value of the dataset
 39 | 
 40 | 
 41 | Question 4
 42 | What is Machine Learning? (Select all that apply.)
 43 | 
 44 | 
 45 | Statistical moments calculated against a dataset
 46 | 
 47 | Hand-coded logic
 48 | 
 49 | **Learning patterns in your data without being explicitly programmed**
 50 | 
 51 | **A function that maps features to an output**
 52 | 
 53 | 
 54 | Question 5
 55 | (Fill in the blanks with the appropriate answer below.)
 56 | 
 57 | Predicting whether a website user is fraudulent or not is an example of _________ machine learning. It is a __________ task.
 58 | 
 59 | 
 60 | **supervised, classification**
 61 | 
 62 | unsupervised, classification
 63 | 
 64 | unsupervised, regression
 65 | 
 66 | supervised, regression
 67 | 
 68 | 
 69 | Question 6
 70 | (Fill in the blanks with the appropriate answer below.)
 71 | 
 72 | Grouping similar users together based on past activity is an example of _________ machine learning. It is a _________ task.
 73 | 
 74 | **unsupervised, clustering**
 75 | 
 76 | unsupervised, classification
 77 | 
 78 | supervised, classification
 79 | 
 80 | supervised, clustering
 81 | 
 82 | 
 83 | Question 7
 84 | Predicting the next quarter of a company's earnings is an example of...
 85 | 
 86 | 
 87 | Reinforcement
 88 | 
 89 | Classification
 90 | 
 91 | **Regression**
 92 | 
 93 | Semi-supervised
 94 | 
 95 | Clustering
 96 | 
 97 | 
 98 | Question 8
 99 | Why do we want to perform a train/test split before we train a machine learning model? (Select all that apply.)
100 | 
101 | 
102 | **To keep the model from "overfitting" where it memorizes the data it has seen**
103 | 
104 | To calculate a baseline model
105 | 
106 | To give us subsets of our data so we can compare a model trained on one versus the model trained on the other
107 | 
108 | **To evaluate how our model performs on unseen data**
109 | 
110 | 
111 | Question 9
112 | What is a linear regression model learning about your data?
113 | 
114 | 
115 | The value of the closest points to the one you're trying to predict
116 | 
117 | **The formula for the line of best fit**
118 | 
119 | The best split points in a decision tree
120 | 
121 | The average of the data
122 | 
123 | 
124 | Question 10
125 | How do you define a custom function not already part of core Spark?
126 | 
127 | 
128 | You can't write your own functions in Spark
129 | 
130 | **With a User-Defined Function**
131 | 
132 | By extending the open source code base
133 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Queries-in-Spark-SQL


--------------------------------------------------------------------------------