├── Assignment #1 Quiz - Queries in Spark SQL.md
├── Assignment #2 Quiz - Spark Internals.md
├── Assignment #3 Quiz - Engineering Data Pipelines Graded Quiz .sql
├── Assignment #4 Quiz - Logistic Regression Classifier.sql
├── Assignment.sql
├── Module 1 Quiz. md
├── Module 2 Quiz.md
├── Module 3 Quiz. md
├── Module 4 Quiz.md
└── README.md
/Assignment #1 Quiz - Queries in Spark SQL.md:
--------------------------------------------------------------------------------
1 | Question 1
2 | What is the first value for "Incident Number"?
3 |
4 | **16000003**
5 |
6 |
7 | Question 2
8 | What is the first value for "Incident Number" on April 4th, 2016?
9 |
10 | **16037478**
11 |
12 |
13 | Question 3
14 | Is the first fire call in this table on Brooke or Conor's birthday? Conor's birthday is 4/4 and Brooke's is 9/27 (in MM/DD format).
15 |
16 | **Conor's birthday**
17 |
18 | Brooke's birthday
19 |
20 |
21 | Question 4
22 | What is the "Station Area" for the first fire call in this table? Note that this table is a subset of the dataset.
23 |
24 | **29**
25 |
26 | Question 5
27 | How many incidents were on Conor's birthday in 2016?
28 |
29 | **80**
30 |
31 | Question 6
32 | How many fire calls had an "Ignition Cause" of "4 act of nature"?
33 |
34 | **5**
35 |
36 | Question 7
37 | What is the most common "Ignition Cause"?
38 | Hint: Put the entire string.
39 |
40 | **2 unintentional**
41 |
42 | Question 8
43 | What is the total incidents from the two joined tables?
44 |
45 | **847094402**
46 |
47 |
--------------------------------------------------------------------------------
/Assignment #2 Quiz - Spark Internals.md:
--------------------------------------------------------------------------------
1 | Question 1
2 | How many fire calls are in our table?
3 |
4 | 240613
5 |
6 | Question 2
7 | Which "Unit Type" is the most common?
8 |
9 | ENGINE
10 |
11 | Question 3
12 | What type of transformation, wide or narrow, did the 'GROUP BY' and 'ORDER BY' queries result in?
13 |
14 | **Wide**
15 |
16 | Narrow
17 |
18 | Question 4
19 | How many tasks were in the last stage of the last job?
20 |
21 | 2
22 |
--------------------------------------------------------------------------------
/Assignment #3 Quiz - Engineering Data Pipelines Graded Quiz .sql:
--------------------------------------------------------------------------------
1 | -- Databricks notebook source
2 | -- MAGIC
3 | -- MAGIC %md-sandbox
4 | -- MAGIC
5 | -- MAGIC
6 | -- MAGIC

7 | -- MAGIC
8 |
9 | -- COMMAND ----------
10 |
11 | -- MAGIC %md
12 | -- MAGIC # Engineering Data Pipelines
13 | -- MAGIC ## Module 3 Assignment
14 | -- MAGIC
15 | -- MAGIC ##  In this assignment you:
16 | -- MAGIC * Create a table with persistent data and a specified schema
17 | -- MAGIC * Populate table with specific entries
18 | -- MAGIC * Change partition number to compare query speeds
19 | -- MAGIC
20 | -- MAGIC For each **bold** question, input its answer in Coursera.
21 |
22 | -- COMMAND ----------
23 |
24 | -- MAGIC %run ../Includes/Classroom-Setup
25 |
26 | -- COMMAND ----------
27 |
28 | -- MAGIC %md-sandbox
29 | -- MAGIC Create a table whose data will remain after you drop the table and after the cluster shuts down. Name this table `newTable` and specify the location to be at `/tmp/newTableLoc`
30 | -- MAGIC
31 | -- MAGIC Set up the table to have the following schema:
32 | -- MAGIC
33 | -- MAGIC ```
34 | -- MAGIC `Address` STRING,
35 | -- MAGIC `City` STRING,
36 | -- MAGIC `Battalion` STRING,
37 | -- MAGIC `Box` STRING,
38 | -- MAGIC ```
39 | -- MAGIC
40 | -- MAGIC Run the following cell first to remove any files stored at `/tmp/newTableLoc` before creating our table. Be sure to first re-run that cell each time you create `newTable`.
41 | -- MAGIC
42 | -- MAGIC
This course was designed to work with Databricks Runtime 5.5 LTS ML, which uses Spark 2.4. If you are running a later version of the Databricks Runtime, you might have to add an additional `STORED AS parquet` to your query [due to a bug.](https://issues.apache.org/jira/browse/SPARK-30436)
43 |
44 | -- COMMAND ----------
45 |
46 | -- MAGIC %python
47 | -- MAGIC # removes files stored at '/tmp/newTableLoc'
48 | -- MAGIC dbutils.fs.rm("/tmp/newTableLoc", True)
49 |
50 | -- COMMAND ----------
51 |
52 | -- TODO
53 | DROP TABLE IF EXISTS newTable;
54 | CREATE EXTERNAL TABLE newTable (
55 | `Address` STRING,
56 | `City` STRING,
57 | `Battalion` STRING,
58 | `Box` STRING
59 | )
60 | STORED AS parquet
61 | LOCATION '/tmp/newTableLoc'
62 |
63 | -- COMMAND ----------
64 |
65 | -- MAGIC %md
66 | -- MAGIC Check that the data type of each column is what we want.
67 | -- MAGIC
68 | -- MAGIC ### Question 1
69 | -- MAGIC **What type of table is `newTable`? "EXTERNAL" or "MANAGED"?**
70 |
71 | -- COMMAND ----------
72 |
73 | DESCRIBE EXTENDED newTable
74 |
75 | -- COMMAND ----------
76 |
77 | -- MAGIC %md
78 | -- MAGIC Run the following cell to read in the data stored at `/mnt/davis/fire-calls/fire-calls-truncated.json`. Check that the columns of the data are of the correct types (not all strings).
79 |
80 | -- COMMAND ----------
81 |
82 | CREATE OR REPLACE TEMPORARY VIEW fireCallsJSON (
83 | `ALS Unit` boolean,
84 | `Address` string,
85 | `Available DtTm` string,
86 | `Battalion` string,
87 | `Box` string,
88 | `Call Date` string,
89 | `Call Final Disposition` string,
90 | `Call Number` long,
91 | `Call Type` string,
92 | `Call Type Group` string,
93 | `City` string,
94 | `Dispatch DtTm` string,
95 | `Entry DtTm` string,
96 | `Final Priority` long,
97 | `Fire Prevention District` string,
98 | `Hospital DtTm` string,
99 | `Incident Number` long,
100 | `Location` string,
101 | `Neighborhooods - Analysis Boundaries` string,
102 | `Number of Alarms` long,
103 | `On Scene DtTm` string,
104 | `Original Priority` string,
105 | `Priority` string,
106 | `Received DtTm` string,
107 | `Response DtTm` string,
108 | `RowID` string,
109 | `Station Area` string,
110 | `Supervisor District` string,
111 | `Transport DtTm` string,
112 | `Unit ID` string,
113 | `Unit Type` string,
114 | `Unit sequence in call dispatch` long,
115 | `Watch Date` string,
116 | `Zipcode of Incident` long
117 | )
118 | USING JSON
119 | OPTIONS (
120 | path "/mnt/davis/fire-calls/fire-calls-truncated.json"
121 | );
122 |
123 | DESCRIBE fireCallsJSON
124 |
125 | -- COMMAND ----------
126 |
127 | -- MAGIC %md
128 | -- MAGIC Take a look at the table to make sure it looks correct.
129 |
130 | -- COMMAND ----------
131 |
132 | SELECT * FROM fireCallsJSON LIMIT 10
133 |
134 | -- COMMAND ----------
135 |
136 | -- MAGIC %md
137 | -- MAGIC Now let's populate `newTable` with some of the rows from the `fireCallsJSON` table you just loaded. We only want to include fire calls whose `Final Priority` is `3`.
138 |
139 | -- COMMAND ----------
140 |
141 | -- TODO
142 |
143 | INSERT INTO newTable
144 | SELECT Address, City, Battalion, Box
145 | FROM fireCallsJSON
146 | WHERE `Final Priority`=3;
147 |
148 |
149 | -- COMMAND ----------
150 |
151 | select * from newTable
152 |
153 | -- COMMAND ----------
154 |
155 | -- MAGIC %md
156 | -- MAGIC ### Question 2
157 | -- MAGIC
158 | -- MAGIC **How many rows are in `newTable`? **
159 |
160 | -- COMMAND ----------
161 |
162 | -- TODO
163 | SELECT COUNT(*) FROM newTable
164 |
165 | -- COMMAND ----------
166 |
167 | -- MAGIC %md
168 | -- MAGIC
169 | -- MAGIC Sort the rows of `newTable` by ascending `Battalion`.
170 | -- MAGIC
171 | -- MAGIC ### Question 3
172 | -- MAGIC
173 | -- MAGIC **What is the "Battalion" of the first entry in the sorted table?**
174 |
175 | -- COMMAND ----------
176 |
177 | -- TODO
178 | SELECT * FROM newTable
179 | ORDER BY Battalion ASC
180 |
181 | -- COMMAND ----------
182 |
183 | -- MAGIC %md
184 | -- MAGIC Let's see how this table is stored in our file system.
185 | -- MAGIC
186 | -- MAGIC Note: You should have specified the location of the table to be `/tmp/newTableLoc` when you created it.
187 |
188 | -- COMMAND ----------
189 |
190 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc
191 |
192 | -- COMMAND ----------
193 |
194 | -- MAGIC %md
195 | -- MAGIC
196 | -- MAGIC First run the following cell to check how many partitions are in this table. Did the number of partitions match the number of files our data was stored as?
197 |
198 | -- COMMAND ----------
199 |
200 | -- MAGIC %md
201 | -- MAGIC Let's try increasing the number of partitions to 256. Create this as a new table and call it `newTablePartitioned`
202 |
203 | -- COMMAND ----------
204 |
205 | -- TODO
206 | CREATE TABLE IF NOT EXISTS newTablePartitioned
207 | AS
208 | SELECT /*+ REPARTITION(256) */ *
209 | FROM newTable
210 |
211 |
212 | -- COMMAND ----------
213 |
214 | -- MAGIC %md
215 | -- MAGIC Now let's take a look at how this new table is stored.
216 |
217 | -- COMMAND ----------
218 |
219 | DESCRIBE EXTENDED newTablePartitioned
220 |
221 | -- COMMAND ----------
222 |
223 | -- MAGIC %md
224 | -- MAGIC Copy the location of the `newTablePartitioned` from the table above and take a look at the files stored at that location. Now how many parts is our data stored?
225 |
226 | -- COMMAND ----------
227 |
228 | -- MAGIC %fs ls dbfs:/user/hive/warehouse/databricks.db/newtablepartitioned
229 |
230 | -- COMMAND ----------
231 |
232 | -- MAGIC %md
233 | -- MAGIC Now sort the rows of `newTablePartitioned` by ascending `Battalion` and compare how long this query takes.
234 | -- MAGIC
235 | -- MAGIC ### Question 4
236 | -- MAGIC
237 | -- MAGIC **Was this query faster or slower on the table with increased partitions?**
238 |
239 | -- COMMAND ----------
240 |
241 | SELECT * FROM newTablePartitioned ORDER BY `Battalion`
242 |
243 | -- COMMAND ----------
244 |
245 | -- MAGIC %md
246 | -- MAGIC Run the following cell to see where the data of the original `newTable` is stored.
247 |
248 | -- COMMAND ----------
249 |
250 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc
251 |
252 | -- COMMAND ----------
253 |
254 | -- MAGIC %md
255 | -- MAGIC Now drop the table `newTable`.
256 |
257 | -- COMMAND ----------
258 |
259 | DROP TABLE newTable;
260 |
261 | -- The following line should error!
262 | -- SELECT * FROM newTable;
263 |
264 | -- COMMAND ----------
265 |
266 | -- MAGIC %md
267 | -- MAGIC ### Question 5
268 | -- MAGIC
269 | -- MAGIC **Does the data stored within the table still exist at the original location (`dbfs:/tmp/newTableLoc`) after you dropped the table? (Answer "yes" or "no")**
270 |
271 | -- COMMAND ----------
272 |
273 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc
274 |
275 | -- COMMAND ----------
276 |
277 | -- MAGIC %md-sandbox
278 | -- MAGIC © 2020 Databricks, Inc. All rights reserved.
279 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
280 | -- MAGIC
281 | -- MAGIC Privacy Policy | Terms of Use | Support
282 |
--------------------------------------------------------------------------------
/Assignment #4 Quiz - Logistic Regression Classifier.sql:
--------------------------------------------------------------------------------
1 | -- Databricks notebook source
2 | -- MAGIC
3 | -- MAGIC %md-sandbox
4 | -- MAGIC
5 | -- MAGIC
6 | -- MAGIC

7 | -- MAGIC
8 |
9 | -- COMMAND ----------
10 |
11 | -- MAGIC %md
12 | -- MAGIC # Logistic Regression Classifier
13 | -- MAGIC ## Module 4 Assignment
14 | -- MAGIC
15 | -- MAGIC This final assignment is broken up into 2 parts:
16 | -- MAGIC 1. Completing this Logistic Regression Classifier notebook
17 | -- MAGIC * Submitting question answers to Coursera
18 | -- MAGIC * Uploading notebook to Coursera for peer reviewing
19 | -- MAGIC 2. Answering 3 free response questions on Coursera platform
20 | -- MAGIC
21 | -- MAGIC ##  In this notebook you:
22 | -- MAGIC * Preprocess data for use in a machine learning model
23 | -- MAGIC * Step through creating a sklearn logistic regression model for classification
24 | -- MAGIC * Predict the `Call_Type_Group` for incidents in a SQL table
25 | -- MAGIC
26 | -- MAGIC
27 | -- MAGIC For each **bold** question, input its answer in Coursera.
28 |
29 | -- COMMAND ----------
30 |
31 | -- MAGIC %run ../Includes/Classroom-Setup
32 |
33 | -- COMMAND ----------
34 |
35 | -- MAGIC %md
36 | -- MAGIC Load the `/mnt/davis/fire-calls/fire-calls-clean.parquet` data as `fireCallsClean` table.
37 |
38 | -- COMMAND ----------
39 |
40 | -- TODO
41 | USE DATABRICKS;
42 | -- FILL IN
43 | DROP TABLE IF EXISTS fireCallsClean;
44 | CREATE TABLE fireCallsClean
45 | USING Parquet
46 | OPTIONS (
47 | path "/mnt/davis/fire-calls/fire-calls-clean.parquet"
48 | )
49 |
50 | -- COMMAND ----------
51 |
52 | -- MAGIC %md
53 | -- MAGIC Check that your data is loaded in properly.
54 |
55 | -- COMMAND ----------
56 |
57 | SELECT * FROM fireCallsClean LIMIT 10
58 |
59 | -- COMMAND ----------
60 |
61 | -- MAGIC %md
62 | -- MAGIC By the end of this assignment, we would like to train a logistic regression model to predict 2 of the most common `Call_Type_Group` given information from the rest of the table.
63 |
64 | -- COMMAND ----------
65 |
66 | -- MAGIC %md
67 | -- MAGIC Write a query to see what the different `Call_Type_Group` values are and their respective counts.
68 | -- MAGIC
69 | -- MAGIC ### Question 1
70 | -- MAGIC
71 | -- MAGIC **How many calls of `Call_Type_Group` "Fire"?**
72 |
73 | -- COMMAND ----------
74 |
75 | -- TODO
76 | select `Call_Type_Group`, count(`Call_Type_Group`) as numberofcalls
77 | from fireCallsClean
78 | Group by `Call_Type_Group`
79 |
80 | -- COMMAND ----------
81 |
82 | select count(`Call_Type_Group`)
83 | from fireCallsClean
84 |
85 | -- COMMAND ----------
86 |
87 | -- MAGIC %md
88 | -- MAGIC Let's drop all the rows where `Call_Type_Group = null`. Since we don't have a lot of `Call_Type_Group` with the value `Alarm` and `Fire`, we will also drop these calls from the table. Call this new temporary view `fireCallsGroupCleaned`.
89 |
90 | -- COMMAND ----------
91 |
92 | -- TODO
93 | create or replace temporary view fireCallsGroupCleaned
94 |
95 | As
96 | select *
97 | from fireCallsClean
98 | where Call_Type_Group is not null and Call_Type_Group not in ('Alarm', 'Fire')
99 |
100 | -- COMMAND ----------
101 |
102 | -- MAGIC %md
103 | -- MAGIC Check that every entry in `fireCallsGroupCleaned` has a `Call_Type_Group` of either `Potentially Life-Threatening` or `Non Life-threatening`.
104 |
105 | -- COMMAND ----------
106 |
107 | -- TODO
108 | select Call_Type_Group, count(*) from fireCallsGroupCleaned
109 | group by Call_Type_Group
110 |
111 | -- COMMAND ----------
112 |
113 | -- MAGIC %md
114 | -- MAGIC ### Question 2
115 | -- MAGIC
116 | -- MAGIC **How many rows are in `fireCallsGroupCleaned`?**
117 |
118 | -- COMMAND ----------
119 |
120 | select count(*) from fireCallsGroupCleaned
121 |
122 | -- COMMAND ----------
123 |
124 | -- MAGIC %md
125 | -- MAGIC We probably don't need all the columns of `fireCallsGroupCleaned` to make our prediction. Select the following columns from `fireCallsGroupCleaned` and create a view called `fireCallsDF` so we can access this table in Python:
126 | -- MAGIC
127 | -- MAGIC * "Call_Type"
128 | -- MAGIC * "Fire_Prevention_District"
129 | -- MAGIC * "Neighborhooods_-\_Analysis_Boundaries"
130 | -- MAGIC * "Number_of_Alarms"
131 | -- MAGIC * "Original_Priority"
132 | -- MAGIC * "Unit_Type"
133 | -- MAGIC * "Battalion"
134 | -- MAGIC * "Call_Type_Group"
135 |
136 | -- COMMAND ----------
137 |
138 | -- TODO
139 | create or replace temporary view fireCallsDF
140 | As
141 | select Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion,Call_Type_Group
142 | from fireCallsGroupCleaned
143 |
144 | -- COMMAND ----------
145 |
146 | -- MAGIC %md
147 | -- MAGIC Fill in the string SQL statement to load the `fireCallsDF` table you just created into python.
148 |
149 | -- COMMAND ----------
150 |
151 | -- MAGIC %python
152 | -- MAGIC # TODO
153 | -- MAGIC spark.conf.set("spark.sql.execution.arrow.enabled", "true")
154 | -- MAGIC df = sql("""select Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion,Call_Type_Group
155 | -- MAGIC from fireCallsGroupCleaned""")
156 | -- MAGIC display(df)
157 |
158 | -- COMMAND ----------
159 |
160 | -- MAGIC %md
161 | -- MAGIC ## Creating a Logistic Regression Model in Sklearn
162 |
163 | -- COMMAND ----------
164 |
165 | -- MAGIC %md
166 | -- MAGIC First we will convert the Spark DataFrame to pandas so we can use sklearn to preprocess the data into numbers so that it is compatible with the logistic regression algorithm with a [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).
167 | -- MAGIC
168 | -- MAGIC Then we'll perform a train test split on our pandas DataFrame. Remember that the column we are trying to predict is the `Call_Type_Group`.
169 |
170 | -- COMMAND ----------
171 |
172 | -- MAGIC %python
173 | -- MAGIC from sklearn.model_selection import train_test_split
174 | -- MAGIC from sklearn.preprocessing import LabelEncoder
175 | -- MAGIC
176 | -- MAGIC pdDF = df.toPandas()
177 | -- MAGIC le = LabelEncoder()
178 | -- MAGIC numerical_pdDF = pdDF.apply(le.fit_transform)
179 | -- MAGIC
180 | -- MAGIC X = numerical_pdDF.drop("Call_Type_Group", axis=1)
181 | -- MAGIC y = numerical_pdDF["Call_Type_Group"].values
182 | -- MAGIC X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
183 |
184 | -- COMMAND ----------
185 |
186 | -- MAGIC %md
187 | -- MAGIC Look at our training data `X_train` which should only have numerical values now.
188 |
189 | -- COMMAND ----------
190 |
191 | -- MAGIC %python
192 | -- MAGIC display(X_train)
193 |
194 | -- COMMAND ----------
195 |
196 | -- MAGIC %md
197 | -- MAGIC We'll create a pipeline with 2 steps.
198 | -- MAGIC
199 | -- MAGIC 0. [One Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder): Converts our features into vectorized features by creating a dummy column for each value in that category.
200 | -- MAGIC
201 | -- MAGIC 0. [Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html): Although the name includes "regression", it is used for classification by predicting the probability that the `Call Type Group` is one label and not the other.
202 |
203 | -- COMMAND ----------
204 |
205 | -- MAGIC %python
206 | -- MAGIC from sklearn.linear_model import LogisticRegression
207 | -- MAGIC from sklearn.preprocessing import OneHotEncoder
208 | -- MAGIC from sklearn.pipeline import Pipeline
209 | -- MAGIC
210 | -- MAGIC ohe = ("ohe", OneHotEncoder(handle_unknown="ignore"))
211 | -- MAGIC lr = ("lr", LogisticRegression())
212 | -- MAGIC
213 | -- MAGIC pipeline = Pipeline(steps = [ohe, lr]).fit(X_train, y_train)
214 | -- MAGIC y_pred = pipeline.predict(X_test)
215 |
216 | -- COMMAND ----------
217 |
218 | -- MAGIC %md
219 | -- MAGIC Run the following cell to see how well our model performed on test data (data that wasn't used to train the model)!
220 |
221 | -- COMMAND ----------
222 |
223 | -- MAGIC %python
224 | -- MAGIC from sklearn.metrics import accuracy_score
225 | -- MAGIC print(f"Accuracy of model: {accuracy_score(y_pred, y_test)}")
226 |
227 | -- COMMAND ----------
228 |
229 | -- MAGIC %md
230 | -- MAGIC ### Question 3
231 | -- MAGIC
232 | -- MAGIC **What is the accuracy of our model on test data? Round to the nearest percent.**
233 |
234 | -- COMMAND ----------
235 |
236 | -- MAGIC %md
237 | -- MAGIC Save pipeline (with both stages) to disk.
238 |
239 | -- COMMAND ----------
240 |
241 | -- MAGIC %python
242 | -- MAGIC import mlflow
243 | -- MAGIC from mlflow.sklearn import save_model
244 | -- MAGIC
245 | -- MAGIC model_path = "/dbfs/" + username + "/Call_Type_Group_lr"
246 | -- MAGIC dbutils.fs.rm(username + "/Call_Type_Group_lr", recurse=True)
247 | -- MAGIC save_model(pipeline, model_path)
248 |
249 | -- COMMAND ----------
250 |
251 | -- MAGIC %md
252 | -- MAGIC ## UDF
253 |
254 | -- COMMAND ----------
255 |
256 | -- MAGIC %md
257 | -- MAGIC Now that we have created and trained a machine learning pipeline, we will use MLflow to register the `.predict` function of the sklearn pipeline as a UDF which we can use later to apply in parallel. Now we can refer to this with the name `predictUDF` in SQL.
258 |
259 | -- COMMAND ----------
260 |
261 | -- MAGIC %python
262 | -- MAGIC import mlflow
263 | -- MAGIC from mlflow.pyfunc import spark_udf
264 | -- MAGIC
265 | -- MAGIC predict = spark_udf(spark, model_path, result_type="int")
266 | -- MAGIC spark.udf.register("predictUDF", predict)
267 |
268 | -- COMMAND ----------
269 |
270 | -- MAGIC %md
271 | -- MAGIC Create a view called `testTable` of our test data `X_test` so that we can see this table in SQL.
272 |
273 | -- COMMAND ----------
274 |
275 | -- MAGIC %python
276 | -- MAGIC spark_df = spark.createDataFrame(X_test)
277 | -- MAGIC spark_df.createOrReplaceTempView("testTable")
278 |
279 | -- COMMAND ----------
280 |
281 | -- MAGIC %md
282 | -- MAGIC Create a table called `predictions` using the `predictUDF` function we registered beforehand. Apply the `predictUDF` to every row of `testTable` in parallel so that each row of `testTable` has a `Call_Type_Group` prediction.
283 |
284 | -- COMMAND ----------
285 |
286 | -- TODO
287 |
288 | USE DATABRICKS;
289 | drop table if exists predictions;
290 |
291 | create table predictions as (
292 | select *, cast(predictUDF(Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion) as double) as prediction
293 | FROM testTable
294 | --LIMIT 10000
295 | )
296 |
297 | -- COMMAND ----------
298 |
299 | -- MAGIC %md
300 | -- MAGIC Now take a look at the table and see what your model predicted for each call entry!
301 |
302 | -- COMMAND ----------
303 |
304 | SELECT * FROM predictions LIMIT 10
305 |
306 | -- COMMAND ----------
307 |
308 | -- MAGIC %md
309 | -- MAGIC ### Question 4:
310 | -- MAGIC
311 | -- MAGIC **What 2 values are in the `prediction` column?**
312 |
313 | -- COMMAND ----------
314 |
315 | -- MAGIC %md
316 | -- MAGIC Congrats on finishing your last assignment notebook!
317 | -- MAGIC
318 | -- MAGIC
319 | -- MAGIC Now you will have to upload this notebook to Coursera for peer reviewing.
320 | -- MAGIC 1. Make sure that all your code will run without errors
321 | -- MAGIC * Check this by clicking the "Clear State & Run All" dropdown option at the top of your notebook
322 | -- MAGIC * 
323 | -- MAGIC 2. Click on the "Workspace" icon on the side bar
324 | -- MAGIC 3. Next to the notebook you're working in right now, click on the dropdown arrow
325 | -- MAGIC 4. In the dropdown, click on "Export" then "HTML"
326 | -- MAGIC * 
327 | -- MAGIC 5. On the Coursera platform, upload this HTML file to Week 4's Peer Review Assignment
328 | -- MAGIC
329 | -- MAGIC Go back onto the Coursera platform for the free response portion of this assignment and for instructions on how to review your peer's work.
330 |
331 | -- COMMAND ----------
332 |
333 | -- MAGIC %md-sandbox
334 | -- MAGIC © 2020 Databricks, Inc. All rights reserved.
335 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
336 | -- MAGIC
337 | -- MAGIC Privacy Policy | Terms of Use | Support
338 |
--------------------------------------------------------------------------------
/Assignment.sql:
--------------------------------------------------------------------------------
1 | -- Databricks notebook source
2 | -- MAGIC
3 | -- MAGIC %md-sandbox
4 | -- MAGIC
5 | -- MAGIC
6 | -- MAGIC

7 | -- MAGIC
8 |
9 | -- COMMAND ----------
10 |
11 | -- MAGIC %md
12 | -- MAGIC # Queries in Spark SQL
13 | -- MAGIC ## Module 1 Assignment
14 | -- MAGIC
15 | -- MAGIC ##  In this assignment you:
16 | -- MAGIC * Create a table
17 | -- MAGIC * Write SQL queries
18 | -- MAGIC
19 | -- MAGIC
20 | -- MAGIC For each **bold** question, input its answer in Coursera.
21 |
22 | -- COMMAND ----------
23 |
24 | -- MAGIC %run ../Includes/Classroom-Setup
25 |
26 | -- COMMAND ----------
27 |
28 | -- MAGIC %md
29 | -- MAGIC ### Working with Incident Data
30 | -- MAGIC
31 | -- MAGIC For this assignment, we'll be using a new dataset: the [SF Fire Incident](https://data.sfgov.org/Public-Safety/Fire-Incidents/wr8u-xric) dataset. It has been mounted for you using the script above. The path to this dataset is as follows:
32 | -- MAGIC
33 | -- MAGIC `/mnt/davis/fire-incidents/fire-incidents-2016.csv`
34 | -- MAGIC
35 | -- MAGIC In this assignment, you will read the dataset and perform a number of different queries.
36 |
37 | -- COMMAND ----------
38 |
39 | -- MAGIC %md
40 | -- MAGIC ##  Create a Table
41 | -- MAGIC
42 | -- MAGIC Create a new table called `fireIncidents` for this dataset. Be sure to use options to properly parse the data.
43 |
44 | -- COMMAND ----------
45 |
46 | -- TODO
47 | create table fireIncidents
48 | USING csv
49 | OPTIONS (
50 | header "true",
51 | path "/mnt/davis/fire-incidents/fire-incidents-2016.csv",
52 | inferSchema "true"
53 | )
54 |
55 | -- COMMAND ----------
56 |
57 | -- MAGIC %md
58 | -- MAGIC ### Question 1
59 | -- MAGIC
60 | -- MAGIC Return the first 10 lines of the data. On the Coursera platform, input the result to the following question:
61 | -- MAGIC
62 | -- MAGIC **What is the first value for "Incident Number"?**
63 |
64 | -- COMMAND ----------
65 |
66 | -- TODO
67 | select `Incident Number` from fireIncidents LIMIT 10
68 |
69 | -- COMMAND ----------
70 |
71 | -- MAGIC %md
72 | -- MAGIC ##  `WHERE` Clauses
73 | -- MAGIC
74 | -- MAGIC A `WHERE` clause is used to filter data that meets certain criteria, returning all values that evaluate to be true.
75 |
76 | -- COMMAND ----------
77 |
78 | -- MAGIC %md
79 | -- MAGIC ### Question 2
80 | -- MAGIC
81 | -- MAGIC Return all incidents that occurred on Conor's birthday in 2016. For those of you who forgot his birthday, it's April 4th. On the Coursera platform, input the result to the following question:
82 | -- MAGIC
83 | -- MAGIC **What is the first value for "Incident Number" on April 4th, 2016?**
84 | -- MAGIC
85 | -- MAGIC **Remember to use backticks (\`\`) instead of single quotes ('') for columns that have spaces in the name. **
86 |
87 | -- COMMAND ----------
88 |
89 | select (*) from fireIncidents
90 |
91 | -- COMMAND ----------
92 |
93 | -- TODO
94 | select `Incident Number`,`Incident Date` from fireIncidents
95 | where `Incident Date` = "04/04/2016"
96 |
97 | -- COMMAND ----------
98 |
99 | -- MAGIC %md
100 | -- MAGIC ### Question 3
101 | -- MAGIC
102 | -- MAGIC Return all incidents that occurred on Conor's _or_ Brooke's birthday. For those of you who forgot her birthday too, it's `9/27`.
103 | -- MAGIC
104 | -- MAGIC **Is the first fire call in this table on Brooke or Conor's birthday?**
105 |
106 | -- COMMAND ----------
107 |
108 | -- TODO
109 |
110 | select (*) from fireIncidents
111 | where `Incident Date` = "04/04/2016" or `Incident Date` = "27/09/2016"
112 |
113 | -- COMMAND ----------
114 |
115 | -- MAGIC %md
116 | -- MAGIC ### Question 4
117 | -- MAGIC Return all incidents on either Conor or Brooke's birthday where the `Station Area` is greater than 20.
118 | -- MAGIC
119 | -- MAGIC **What is the "Station Area" for the first fire call in this table?**
120 |
121 | -- COMMAND ----------
122 |
123 | -- MAGIC %md
124 | -- MAGIC ##  Aggregate Functions
125 | -- MAGIC
126 | -- MAGIC Aggregate functions compute a single result value from a set of input values. Use the aggregate function `COUNT` to count the total records in the dataset.
127 |
128 | -- COMMAND ----------
129 |
130 | -- TODO
131 | select `Station Area` from fireIncidents
132 | where `Incident Date` = "04/04/2016" or `Incident Date` = "27/09/2016" and `Station Area`> 20
133 |
134 | -- COMMAND ----------
135 |
136 | -- MAGIC %md
137 | -- MAGIC ### Question 5
138 | -- MAGIC
139 | -- MAGIC Count the incidents on Conor's birthday.
140 | -- MAGIC
141 | -- MAGIC **How many incidents were on Conor's birthday in 2016?**
142 |
143 | -- COMMAND ----------
144 |
145 | -- TODO
146 | select count(`Incident Number`) as incidents from fireIncidents
147 | where `Incident Date` = "04/04/2016"
148 |
149 | -- COMMAND ----------
150 |
151 | -- MAGIC %md-sandbox
152 | -- MAGIC ### Question 6
153 | -- MAGIC
154 | -- MAGIC Return the total counts by `Ignition Cause`. Be sure to return the field `Ignition Cause` as well.
155 | -- MAGIC
156 | -- MAGIC
**Hint:** You'll have to use `GROUP BY` for this
157 | -- MAGIC
158 | -- MAGIC **How many fire calls had an "Ignition Cause" of "4 act of nature"?**
159 |
160 | -- COMMAND ----------
161 |
162 | -- TODO
163 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents
164 | group by `Ignition Cause`
165 |
166 | -- COMMAND ----------
167 |
168 | -- MAGIC %md-sandbox
169 | -- MAGIC ##  Sorting
170 | -- MAGIC
171 | -- MAGIC ### Question 7
172 | -- MAGIC
173 | -- MAGIC Return the total counts by `Ignition Cause` sorted in ascending order.
174 | -- MAGIC
175 | -- MAGIC
**Hint:** You'll have to use `ORDER BY` for this.
176 | -- MAGIC
177 | -- MAGIC **What is the most common "Ignition Cause"? (Put the entire string)**
178 |
179 | -- COMMAND ----------
180 |
181 | -- TODO
182 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents
183 | group by `Ignition Cause`
184 | order by total asc
185 |
186 | -- COMMAND ----------
187 |
188 | -- MAGIC %md
189 | -- MAGIC Return the total counts by `Ignition Cause` sorted in descending order.
190 |
191 | -- COMMAND ----------
192 |
193 | -- TODO
194 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents
195 | group by `Ignition Cause`
196 | order by total desc
197 |
198 | -- COMMAND ----------
199 |
200 | -- MAGIC %md
201 | -- MAGIC ##  Joins
202 | -- MAGIC
203 | -- MAGIC Create the table `fireCalls` if it doesn't already exist. The path is as follows: `/mnt/davis/fire-calls/fire-calls-truncated-comma.csv`
204 |
205 | -- COMMAND ----------
206 |
207 | -- TODO
208 | create table if not exists fireCalls
209 | USING csv
210 | OPTIONS (
211 | header "true",
212 | path "/mnt/davis/fire-calls/fire-calls-truncated-comma.csv",
213 | inferSchema "true"
214 | )
215 |
216 | -- COMMAND ----------
217 |
218 | -- MAGIC %md
219 | -- MAGIC Join the two tables on `Battalion` by performing an inner join.
220 |
221 | -- COMMAND ----------
222 |
223 | -- TODO
224 | select * from fireCalls
225 | inner join fireIncidents
226 | on fireCalls.`Battalion`=fireIncidents.`Battalion`
227 |
228 | -- COMMAND ----------
229 |
230 | -- MAGIC %md
231 | -- MAGIC ### Question 8
232 | -- MAGIC
233 | -- MAGIC Count the total incidents from the two tables joined on `Battalion`.
234 | -- MAGIC
235 | -- MAGIC **What is the total incidents from the two joined tables?**
236 |
237 | -- COMMAND ----------
238 |
239 | -- TODO
240 | select count(fc.`Incident Number`) as total
241 | from fireCalls fc
242 | inner join fireIncidents fi
243 | on fc.`Battalion`=fi.`Battalion`
244 |
245 |
246 |
247 | -- COMMAND ----------
248 |
249 | -- MAGIC %md
250 | -- MAGIC Congratulations! You made it to the end of the assignment!
251 |
252 | -- COMMAND ----------
253 |
254 | -- MAGIC %md-sandbox
255 | -- MAGIC © 2020 Databricks, Inc. All rights reserved.
256 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
257 | -- MAGIC
258 | -- MAGIC Privacy Policy | Terms of Use | Support
259 |
--------------------------------------------------------------------------------
/Module 1 Quiz. md:
--------------------------------------------------------------------------------
1 | Question 1
2 | Which of the following are true when it comes to the business value of big data? (Select all that apply.)
3 |
4 |
5 | **The size of the data businesses collect is growing**
6 |
7 |
8 | **Businesses are increasingly making data-driven decisions**
9 |
10 |
11 | Automated technologies mean that data scientists and data analysts are no longer needed
12 |
13 |
14 | Question 2
15 | Spark uses...
16 |
17 | (Select all that apply.)
18 |
19 | Your database technology (e.g., Postgres or SQL Server) to run Spark queries
20 |
21 |
22 | **A driver node to distribute work across a number of executor nodes**
23 |
24 |
25 | **A distributed cluster of networked computers made of a driver node and many executor nodes**
26 |
27 |
28 |
29 | One very large computer that is able to run computation against large databases
30 |
31 |
32 | A distributed cluster of networked computers made of many driver nodes and many executor nodes
33 |
34 |
35 | Question 3
36 | How does Spark execute code backed by DataFrames? (Select all that apply.)
37 |
38 | It executes code determined in advance
39 |
40 |
41 | It iterates over all of the source data to exhaustively evaluate queries
42 |
43 |
44 | **It separates the "logical plan" of what you want to accomplish from the "physical plan" of how to do it so it can optimize the query**
45 |
46 | **It optimizes your query by figuring out the best "how" to execute what you want**
47 |
48 |
49 | Question 4
50 | What are the properties of Spark DataFrames? (Select all that apply.)
51 |
52 | **Dataset: Collection of partitioned data**
53 |
54 | **Resilient: Fault-tolerant**
55 |
56 | Tables: Operates as any table in SQL environments
57 |
58 |
59 | **Distributed: Computed across multiple nodes**
60 |
61 |
62 |
63 | Question 5
64 | What is the difference between Spark and database technologies? (Select all that apply.)
65 |
66 |
67 | **Spark is a computation engine and is not for data storage**
68 |
69 | Spark does not interact with databases but uses its proprietary DataFrame technology instead
70 |
71 | Spark operates for both data storage and computation
72 |
73 | **Spark is a highly optimized compute engine and is not a database**
74 |
75 | Spark in an alternative to traditional databases
76 |
77 |
78 | Question 6
79 | What is Amdahl's law of scalability? (Select all that apply.)
80 |
81 |
82 | **Amdahl's law states that the speedup of a task is a function of how much of that task can be parallelized**
83 |
84 | A formula that gives the expected speed of a single processor performing a computation
85 |
86 | A formula that gives the theoretical speedup as a function of the size of a partition (or subset) of data
87 |
88 | A formula that gives the number of processors (or other unit of parallelism) needed to complete a task
89 |
90 | **A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized**
91 |
92 |
93 | Question 7
94 | Spark offers a unified approach to analytics. What does this include? (Select all that apply.)
95 |
96 |
97 | **Spark unifies applications such as SQL queries, streaming, and machine learning**
98 |
99 | Spark unifies databases with optimized computation allowing for faster computation against the data it stores
100 |
101 | **Spark allows analysts, data scientists, and data engineers to all use the same core technology**
102 |
103 | Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application
104 |
105 | **Spark code can be written in the following languages: SQL, Scala, Java, Python, and R**
106 |
107 |
108 | Question 8
109 | What is a Databricks notebook?
110 |
111 | A Spark instance that executes queries
112 |
113 | A cluster that executes Spark code
114 |
115 | **A collaborative, interactive workspace that allows you to execute Spark queries at scale**
116 |
117 | A single Spark query
118 |
119 |
120 |
121 | Question 9
122 | How can you get data into Databricks? (Select all that apply.)
123 |
124 |
125 | **By uploading it through the user interface**
126 |
127 | **By registering the data as a table**
128 |
129 | By connecting to Dropbox or Google Drive
130 |
131 | **By "mounting" data backed by cloud storage**
132 |
133 |
134 | Question 10
135 | What are the qualities of big data? (Select all that apply.)
136 |
137 |
138 |
139 | **Volume: the amount of data**
140 |
141 | Valorous: the positives impact of data
142 |
143 |
144 | **Variety: the diversity of data**
145 |
146 | **Velocity: the speed of data**
147 |
148 | **Veracity: the reliability of data**
149 |
150 |
151 |
152 |
--------------------------------------------------------------------------------
/Module 2 Quiz.md:
--------------------------------------------------------------------------------
1 | Question 1
2 | What are the different units of parallelism? (Select all that apply.)
3 |
4 | **Task**
5 |
6 | **Executor**
7 |
8 | **Core**
9 |
10 | **Partition**
11 |
12 |
13 | Question 2
14 | What is a partition?
15 |
16 |
17 | **A portion of a large distributed set of data**
18 |
19 | A synonym with "task"
20 |
21 | A division of computation that executes a query
22 |
23 | The result of data filtered by a WHERE clause
24 |
25 |
26 | Question 3
27 | What is the difference between in-memory computing and other technologies? (Select all that apply.)
28 |
29 |
30 | **In-memory operations were not realistic in older technologies when memory was more expensive**
31 |
32 | In-memory computing is slower than other types of computing
33 |
34 | **In-memory operates from RAM while other technologies operate from disk**
35 |
36 | **Computation not done in-memory (such as Hadoop) reads and writes from disk in between each step**
37 |
38 |
39 | Question 4
40 | Why is caching important?
41 |
42 |
43 | It always stores data in-memory to improve performance
44 |
45 |
46 | **It stores data on the cluster to improve query performance**
47 |
48 |
49 | It reformats data already stored in RAM for faster access
50 |
51 |
52 | It improves queries against data read one or more times
53 |
54 |
55 | Question 5
56 | Which of the following is a wide transformation? (Select all that apply.)
57 |
58 |
59 | **ORDER BY**
60 |
61 | **GROUP BY**
62 |
63 | SELECT
64 |
65 | WHERE
66 |
67 |
68 | Question 6
69 | Broadcast joins...
70 |
71 |
72 | Shuffle both of the tables, minimizing data transfer by transferring data in parallel
73 |
74 | Shuffle both of the tables, minimizing computational resources
75 |
76 | **Transfer the smaller of two tables to the larger, minimizing data transfer**
77 |
78 | Transfer the smaller of two tables to the larger, increasing data transfer requirements
79 |
80 |
81 |
82 | Question 7
83 | When is it appropriate to use a shuffle join?
84 |
85 |
86 |
87 | **When both tables are moderately sized or large**
88 |
89 | When both tables are very small
90 |
91 | Never. Broadcast joins always out-perform shuffle joins.
92 |
93 | When the smaller table is significantly smaller than the larger table
94 |
95 |
96 | Question 8
97 | Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.)
98 |
99 |
100 | **Shuffle reads**
101 |
102 | **Data Skew**
103 |
104 | Incompatible data formats
105 |
106 | **Shuffle writes**
107 |
108 |
109 | Question 9
110 | What is a stage boundary?
111 |
112 |
113 | An action caused by a SQL query is predicate
114 |
115 | Any transition between Spark tasks
116 |
117 | A narrow transformation
118 |
119 | **When all of the slots or available units of processing have to sync with one another**
120 |
121 |
122 | Question 10
123 | What happens when Spark code is executed in local mode?
124 |
125 |
126 | A cluster of virtual machines is used rather than physical machines
127 |
128 | **The executor and driver are on the same machine**
129 |
130 | The code is executed in the cloud
131 |
132 | The code is executed against a local cluster
133 |
--------------------------------------------------------------------------------
/Module 3 Quiz. md:
--------------------------------------------------------------------------------
1 | Question 1
2 | Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.)
3 |
4 |
5 | + It makes updates to new software versions easier
6 |
7 | Resources are isolated and therefore more manageable and debuggable
8 |
9 | + It results in copies of the data in case of a data center outage
10 |
11 | + It allows for elastic resources so larger storage or compute resources are used only when needed
12 |
13 |
14 | Question 2
15 | You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task?
16 |
17 | CPU: the transfer of data is more demanding than the computation
18 |
19 | IO: computation is more demanding that the data transfer
20 |
21 | CPU: computation is more demanding than the data transfer
22 |
23 | + IO: the transfer of data is more demanding than the computation
24 |
25 | Question 3
26 | Processing virtual shopping cart orders in real time is an example of...
27 |
28 | Online Analytical Processing (OLAP)
29 |
30 | + Online Transaction Processing (OLTP)
31 |
32 |
33 | Question 4
34 | When are BLOB stores an appropriate place to store data? (Select all that apply.)
35 |
36 |
37 |
38 | + For cheap storage
39 |
40 | + For a "data lake" of largely unstructured data
41 |
42 | + For storing large files
43 |
44 | For online transaction processing on a website
45 |
46 |
47 | Question 5
48 | JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC?
49 |
50 |
51 |
52 | Specify the number of partitions using COALESCE. Spark then creates one parallel connection for each partition.
53 |
54 |
55 | Specify the numPartitions configuration setting. Spark then creates one parallel connection for each partition.
56 |
57 |
58 | + Specify a column, number of partitions, and the column's minimum and maximum values. Spark then divides that range of values between parallel connections.*****
59 |
60 |
61 | Specify the number of partitions using REPARTITION. Spark then creates one parallel connection for each partition.
62 |
63 |
64 | Question 6
65 | What are some of the advantages of the file format Parquet over CSV? (Select all that apply.)
66 |
67 |
68 | + Compression
69 |
70 | Corruptible
71 |
72 | + Parallelism
73 |
74 | + Columnar
75 |
76 |
77 | Question 7
78 | SQL is normally used to query tabular (or "structured") data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.)
79 |
80 |
81 | + It allows for missing data
82 |
83 | + It allows for complex data types
84 |
85 | + It allows for data change over time
86 |
87 | + It does not need a formal structure*****
88 |
89 | It allows for easy joins between relational JSON tables
90 |
91 |
92 | Question 8
93 | Data writes in Spark can happen in serial or in parallel. What controls this parallelism?
94 |
95 |
96 | + The number of data partitions in a DataFrame
97 |
98 | The number of jobs in a write operation
99 |
100 | The number of stages in a write operation
101 |
102 | The numPartitions setting in the Spark configuration
103 |
104 |
105 | Question 9
106 | Fill in the blanks with the appropriate response below:
107 |
108 | A _______ table manages _______and a DROP TABLE command will result in data loss.
109 |
110 |
111 |
112 | Unmanaged, both the data and metadata such as the schema and data location
113 |
114 |
115 | + Managed, both the data and metadata such as the schema and data location
116 |
117 |
118 | Managed, only the metadata such as the schema and data location
119 |
120 |
121 | Unmanaged, only the metadata such as the schema and data location
122 |
--------------------------------------------------------------------------------
/Module 4 Quiz.md:
--------------------------------------------------------------------------------
1 | Question 1
2 | Machine learning is suited to solve which of the following tasks? (Select all that apply.)
3 |
4 |
5 | **Churn Analysis**
6 |
7 | Reporting
8 |
9 | **Natural Language Processing**
10 |
11 | **Image Recognition**
12 |
13 | **Financial Forecasting**
14 |
15 | **Fraud Detection**
16 |
17 | **A/B Testing**
18 |
19 | Question 2
20 | Is a model that is 99% accurate at predicting breast cancer a good model?
21 |
22 |
23 | Likely no because there are too many false positives
24 |
25 | **Likely no because there are not many cases of cancer in a general population**
26 |
27 | Likely yes because this is generally a high score
28 |
29 | Likely yes because it accounts for false negatives and we'd want to make sure we catch every case of cancer
30 |
31 | Question 3
32 | What is an appropriate baseline model to compare a machine learning solution to?
33 |
34 | **The average of the dataset**
35 |
36 | Zero
37 |
38 | The minimum value of the dataset
39 |
40 |
41 | Question 4
42 | What is Machine Learning? (Select all that apply.)
43 |
44 |
45 | Statistical moments calculated against a dataset
46 |
47 | Hand-coded logic
48 |
49 | **Learning patterns in your data without being explicitly programmed**
50 |
51 | **A function that maps features to an output**
52 |
53 |
54 | Question 5
55 | (Fill in the blanks with the appropriate answer below.)
56 |
57 | Predicting whether a website user is fraudulent or not is an example of _________ machine learning. It is a __________ task.
58 |
59 |
60 | **supervised, classification**
61 |
62 | unsupervised, classification
63 |
64 | unsupervised, regression
65 |
66 | supervised, regression
67 |
68 |
69 | Question 6
70 | (Fill in the blanks with the appropriate answer below.)
71 |
72 | Grouping similar users together based on past activity is an example of _________ machine learning. It is a _________ task.
73 |
74 | **unsupervised, clustering**
75 |
76 | unsupervised, classification
77 |
78 | supervised, classification
79 |
80 | supervised, clustering
81 |
82 |
83 | Question 7
84 | Predicting the next quarter of a company's earnings is an example of...
85 |
86 |
87 | Reinforcement
88 |
89 | Classification
90 |
91 | **Regression**
92 |
93 | Semi-supervised
94 |
95 | Clustering
96 |
97 |
98 | Question 8
99 | Why do we want to perform a train/test split before we train a machine learning model? (Select all that apply.)
100 |
101 |
102 | **To keep the model from "overfitting" where it memorizes the data it has seen**
103 |
104 | To calculate a baseline model
105 |
106 | To give us subsets of our data so we can compare a model trained on one versus the model trained on the other
107 |
108 | **To evaluate how our model performs on unseen data**
109 |
110 |
111 | Question 9
112 | What is a linear regression model learning about your data?
113 |
114 |
115 | The value of the closest points to the one you're trying to predict
116 |
117 | **The formula for the line of best fit**
118 |
119 | The best split points in a decision tree
120 |
121 | The average of the data
122 |
123 |
124 | Question 10
125 | How do you define a custom function not already part of core Spark?
126 |
127 |
128 | You can't write your own functions in Spark
129 |
130 | **With a User-Defined Function**
131 |
132 | By extending the open source code base
133 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Queries-in-Spark-SQL
--------------------------------------------------------------------------------