├── Assignment #1 Quiz - Queries in Spark SQL.md ├── Assignment #2 Quiz - Spark Internals.md ├── Assignment #3 Quiz - Engineering Data Pipelines Graded Quiz .sql ├── Assignment #4 Quiz - Logistic Regression Classifier.sql ├── Assignment.sql ├── Module 1 Quiz. md ├── Module 2 Quiz.md ├── Module 3 Quiz. md ├── Module 4 Quiz.md └── README.md /Assignment #1 Quiz - Queries in Spark SQL.md: -------------------------------------------------------------------------------- 1 | Question 1 2 | What is the first value for "Incident Number"? 3 | 4 | **16000003** 5 | 6 | 7 | Question 2 8 | What is the first value for "Incident Number" on April 4th, 2016? 9 | 10 | **16037478** 11 | 12 | 13 | Question 3 14 | Is the first fire call in this table on Brooke or Conor's birthday? Conor's birthday is 4/4 and Brooke's is 9/27 (in MM/DD format). 15 | 16 | **Conor's birthday** 17 | 18 | Brooke's birthday 19 | 20 | 21 | Question 4 22 | What is the "Station Area" for the first fire call in this table? Note that this table is a subset of the dataset. 23 | 24 | **29** 25 | 26 | Question 5 27 | How many incidents were on Conor's birthday in 2016? 28 | 29 | **80** 30 | 31 | Question 6 32 | How many fire calls had an "Ignition Cause" of "4 act of nature"? 33 | 34 | **5** 35 | 36 | Question 7 37 | What is the most common "Ignition Cause"? 38 | Hint: Put the entire string. 39 | 40 | **2 unintentional** 41 | 42 | Question 8 43 | What is the total incidents from the two joined tables? 44 | 45 | **847094402** 46 | 47 | -------------------------------------------------------------------------------- /Assignment #2 Quiz - Spark Internals.md: -------------------------------------------------------------------------------- 1 | Question 1 2 | How many fire calls are in our table? 3 | 4 | 240613 5 | 6 | Question 2 7 | Which "Unit Type" is the most common? 8 | 9 | ENGINE 10 | 11 | Question 3 12 | What type of transformation, wide or narrow, did the 'GROUP BY' and 'ORDER BY' queries result in? 13 | 14 | **Wide** 15 | 16 | Narrow 17 | 18 | Question 4 19 | How many tasks were in the last stage of the last job? 20 | 21 | 2 22 | -------------------------------------------------------------------------------- /Assignment #3 Quiz - Engineering Data Pipelines Graded Quiz .sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC 3 | -- MAGIC %md-sandbox 4 | -- MAGIC 5 | -- MAGIC
6 | -- MAGIC Databricks Learning 7 | -- MAGIC
8 | 9 | -- COMMAND ---------- 10 | 11 | -- MAGIC %md 12 | -- MAGIC # Engineering Data Pipelines 13 | -- MAGIC ## Module 3 Assignment 14 | -- MAGIC 15 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this assignment you: 16 | -- MAGIC * Create a table with persistent data and a specified schema 17 | -- MAGIC * Populate table with specific entries 18 | -- MAGIC * Change partition number to compare query speeds 19 | -- MAGIC 20 | -- MAGIC For each **bold** question, input its answer in Coursera. 21 | 22 | -- COMMAND ---------- 23 | 24 | -- MAGIC %run ../Includes/Classroom-Setup 25 | 26 | -- COMMAND ---------- 27 | 28 | -- MAGIC %md-sandbox 29 | -- MAGIC Create a table whose data will remain after you drop the table and after the cluster shuts down. Name this table `newTable` and specify the location to be at `/tmp/newTableLoc` 30 | -- MAGIC 31 | -- MAGIC Set up the table to have the following schema: 32 | -- MAGIC 33 | -- MAGIC ``` 34 | -- MAGIC `Address` STRING, 35 | -- MAGIC `City` STRING, 36 | -- MAGIC `Battalion` STRING, 37 | -- MAGIC `Box` STRING, 38 | -- MAGIC ``` 39 | -- MAGIC 40 | -- MAGIC Run the following cell first to remove any files stored at `/tmp/newTableLoc` before creating our table. Be sure to first re-run that cell each time you create `newTable`. 41 | -- MAGIC 42 | -- MAGIC Side Note This course was designed to work with Databricks Runtime 5.5 LTS ML, which uses Spark 2.4. If you are running a later version of the Databricks Runtime, you might have to add an additional `STORED AS parquet` to your query [due to a bug.](https://issues.apache.org/jira/browse/SPARK-30436) 43 | 44 | -- COMMAND ---------- 45 | 46 | -- MAGIC %python 47 | -- MAGIC # removes files stored at '/tmp/newTableLoc' 48 | -- MAGIC dbutils.fs.rm("/tmp/newTableLoc", True) 49 | 50 | -- COMMAND ---------- 51 | 52 | -- TODO 53 | DROP TABLE IF EXISTS newTable; 54 | CREATE EXTERNAL TABLE newTable ( 55 | `Address` STRING, 56 | `City` STRING, 57 | `Battalion` STRING, 58 | `Box` STRING 59 | ) 60 | STORED AS parquet 61 | LOCATION '/tmp/newTableLoc' 62 | 63 | -- COMMAND ---------- 64 | 65 | -- MAGIC %md 66 | -- MAGIC Check that the data type of each column is what we want. 67 | -- MAGIC 68 | -- MAGIC ### Question 1 69 | -- MAGIC **What type of table is `newTable`? "EXTERNAL" or "MANAGED"?** 70 | 71 | -- COMMAND ---------- 72 | 73 | DESCRIBE EXTENDED newTable 74 | 75 | -- COMMAND ---------- 76 | 77 | -- MAGIC %md 78 | -- MAGIC Run the following cell to read in the data stored at `/mnt/davis/fire-calls/fire-calls-truncated.json`. Check that the columns of the data are of the correct types (not all strings). 79 | 80 | -- COMMAND ---------- 81 | 82 | CREATE OR REPLACE TEMPORARY VIEW fireCallsJSON ( 83 | `ALS Unit` boolean, 84 | `Address` string, 85 | `Available DtTm` string, 86 | `Battalion` string, 87 | `Box` string, 88 | `Call Date` string, 89 | `Call Final Disposition` string, 90 | `Call Number` long, 91 | `Call Type` string, 92 | `Call Type Group` string, 93 | `City` string, 94 | `Dispatch DtTm` string, 95 | `Entry DtTm` string, 96 | `Final Priority` long, 97 | `Fire Prevention District` string, 98 | `Hospital DtTm` string, 99 | `Incident Number` long, 100 | `Location` string, 101 | `Neighborhooods - Analysis Boundaries` string, 102 | `Number of Alarms` long, 103 | `On Scene DtTm` string, 104 | `Original Priority` string, 105 | `Priority` string, 106 | `Received DtTm` string, 107 | `Response DtTm` string, 108 | `RowID` string, 109 | `Station Area` string, 110 | `Supervisor District` string, 111 | `Transport DtTm` string, 112 | `Unit ID` string, 113 | `Unit Type` string, 114 | `Unit sequence in call dispatch` long, 115 | `Watch Date` string, 116 | `Zipcode of Incident` long 117 | ) 118 | USING JSON 119 | OPTIONS ( 120 | path "/mnt/davis/fire-calls/fire-calls-truncated.json" 121 | ); 122 | 123 | DESCRIBE fireCallsJSON 124 | 125 | -- COMMAND ---------- 126 | 127 | -- MAGIC %md 128 | -- MAGIC Take a look at the table to make sure it looks correct. 129 | 130 | -- COMMAND ---------- 131 | 132 | SELECT * FROM fireCallsJSON LIMIT 10 133 | 134 | -- COMMAND ---------- 135 | 136 | -- MAGIC %md 137 | -- MAGIC Now let's populate `newTable` with some of the rows from the `fireCallsJSON` table you just loaded. We only want to include fire calls whose `Final Priority` is `3`. 138 | 139 | -- COMMAND ---------- 140 | 141 | -- TODO 142 | 143 | INSERT INTO newTable 144 | SELECT Address, City, Battalion, Box 145 | FROM fireCallsJSON 146 | WHERE `Final Priority`=3; 147 | 148 | 149 | -- COMMAND ---------- 150 | 151 | select * from newTable 152 | 153 | -- COMMAND ---------- 154 | 155 | -- MAGIC %md 156 | -- MAGIC ### Question 2 157 | -- MAGIC 158 | -- MAGIC **How many rows are in `newTable`? ** 159 | 160 | -- COMMAND ---------- 161 | 162 | -- TODO 163 | SELECT COUNT(*) FROM newTable 164 | 165 | -- COMMAND ---------- 166 | 167 | -- MAGIC %md 168 | -- MAGIC 169 | -- MAGIC Sort the rows of `newTable` by ascending `Battalion`. 170 | -- MAGIC 171 | -- MAGIC ### Question 3 172 | -- MAGIC 173 | -- MAGIC **What is the "Battalion" of the first entry in the sorted table?** 174 | 175 | -- COMMAND ---------- 176 | 177 | -- TODO 178 | SELECT * FROM newTable 179 | ORDER BY Battalion ASC 180 | 181 | -- COMMAND ---------- 182 | 183 | -- MAGIC %md 184 | -- MAGIC Let's see how this table is stored in our file system. 185 | -- MAGIC 186 | -- MAGIC Note: You should have specified the location of the table to be `/tmp/newTableLoc` when you created it. 187 | 188 | -- COMMAND ---------- 189 | 190 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc 191 | 192 | -- COMMAND ---------- 193 | 194 | -- MAGIC %md 195 | -- MAGIC 196 | -- MAGIC First run the following cell to check how many partitions are in this table. Did the number of partitions match the number of files our data was stored as? 197 | 198 | -- COMMAND ---------- 199 | 200 | -- MAGIC %md 201 | -- MAGIC Let's try increasing the number of partitions to 256. Create this as a new table and call it `newTablePartitioned` 202 | 203 | -- COMMAND ---------- 204 | 205 | -- TODO 206 | CREATE TABLE IF NOT EXISTS newTablePartitioned 207 | AS 208 | SELECT /*+ REPARTITION(256) */ * 209 | FROM newTable 210 | 211 | 212 | -- COMMAND ---------- 213 | 214 | -- MAGIC %md 215 | -- MAGIC Now let's take a look at how this new table is stored. 216 | 217 | -- COMMAND ---------- 218 | 219 | DESCRIBE EXTENDED newTablePartitioned 220 | 221 | -- COMMAND ---------- 222 | 223 | -- MAGIC %md 224 | -- MAGIC Copy the location of the `newTablePartitioned` from the table above and take a look at the files stored at that location. Now how many parts is our data stored? 225 | 226 | -- COMMAND ---------- 227 | 228 | -- MAGIC %fs ls dbfs:/user/hive/warehouse/databricks.db/newtablepartitioned 229 | 230 | -- COMMAND ---------- 231 | 232 | -- MAGIC %md 233 | -- MAGIC Now sort the rows of `newTablePartitioned` by ascending `Battalion` and compare how long this query takes. 234 | -- MAGIC 235 | -- MAGIC ### Question 4 236 | -- MAGIC 237 | -- MAGIC **Was this query faster or slower on the table with increased partitions?** 238 | 239 | -- COMMAND ---------- 240 | 241 | SELECT * FROM newTablePartitioned ORDER BY `Battalion` 242 | 243 | -- COMMAND ---------- 244 | 245 | -- MAGIC %md 246 | -- MAGIC Run the following cell to see where the data of the original `newTable` is stored. 247 | 248 | -- COMMAND ---------- 249 | 250 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc 251 | 252 | -- COMMAND ---------- 253 | 254 | -- MAGIC %md 255 | -- MAGIC Now drop the table `newTable`. 256 | 257 | -- COMMAND ---------- 258 | 259 | DROP TABLE newTable; 260 | 261 | -- The following line should error! 262 | -- SELECT * FROM newTable; 263 | 264 | -- COMMAND ---------- 265 | 266 | -- MAGIC %md 267 | -- MAGIC ### Question 5 268 | -- MAGIC 269 | -- MAGIC **Does the data stored within the table still exist at the original location (`dbfs:/tmp/newTableLoc`) after you dropped the table? (Answer "yes" or "no")** 270 | 271 | -- COMMAND ---------- 272 | 273 | -- MAGIC %fs ls dbfs:/tmp/newTableLoc 274 | 275 | -- COMMAND ---------- 276 | 277 | -- MAGIC %md-sandbox 278 | -- MAGIC © 2020 Databricks, Inc. All rights reserved.
279 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
280 | -- MAGIC
281 | -- MAGIC Privacy Policy | Terms of Use | Support 282 | -------------------------------------------------------------------------------- /Assignment #4 Quiz - Logistic Regression Classifier.sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC 3 | -- MAGIC %md-sandbox 4 | -- MAGIC 5 | -- MAGIC
6 | -- MAGIC Databricks Learning 7 | -- MAGIC
8 | 9 | -- COMMAND ---------- 10 | 11 | -- MAGIC %md 12 | -- MAGIC # Logistic Regression Classifier 13 | -- MAGIC ## Module 4 Assignment 14 | -- MAGIC 15 | -- MAGIC This final assignment is broken up into 2 parts: 16 | -- MAGIC 1. Completing this Logistic Regression Classifier notebook 17 | -- MAGIC * Submitting question answers to Coursera 18 | -- MAGIC * Uploading notebook to Coursera for peer reviewing 19 | -- MAGIC 2. Answering 3 free response questions on Coursera platform 20 | -- MAGIC 21 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you: 22 | -- MAGIC * Preprocess data for use in a machine learning model 23 | -- MAGIC * Step through creating a sklearn logistic regression model for classification 24 | -- MAGIC * Predict the `Call_Type_Group` for incidents in a SQL table 25 | -- MAGIC 26 | -- MAGIC 27 | -- MAGIC For each **bold** question, input its answer in Coursera. 28 | 29 | -- COMMAND ---------- 30 | 31 | -- MAGIC %run ../Includes/Classroom-Setup 32 | 33 | -- COMMAND ---------- 34 | 35 | -- MAGIC %md 36 | -- MAGIC Load the `/mnt/davis/fire-calls/fire-calls-clean.parquet` data as `fireCallsClean` table. 37 | 38 | -- COMMAND ---------- 39 | 40 | -- TODO 41 | USE DATABRICKS; 42 | -- FILL IN 43 | DROP TABLE IF EXISTS fireCallsClean; 44 | CREATE TABLE fireCallsClean 45 | USING Parquet 46 | OPTIONS ( 47 | path "/mnt/davis/fire-calls/fire-calls-clean.parquet" 48 | ) 49 | 50 | -- COMMAND ---------- 51 | 52 | -- MAGIC %md 53 | -- MAGIC Check that your data is loaded in properly. 54 | 55 | -- COMMAND ---------- 56 | 57 | SELECT * FROM fireCallsClean LIMIT 10 58 | 59 | -- COMMAND ---------- 60 | 61 | -- MAGIC %md 62 | -- MAGIC By the end of this assignment, we would like to train a logistic regression model to predict 2 of the most common `Call_Type_Group` given information from the rest of the table. 63 | 64 | -- COMMAND ---------- 65 | 66 | -- MAGIC %md 67 | -- MAGIC Write a query to see what the different `Call_Type_Group` values are and their respective counts. 68 | -- MAGIC 69 | -- MAGIC ### Question 1 70 | -- MAGIC 71 | -- MAGIC **How many calls of `Call_Type_Group` "Fire"?** 72 | 73 | -- COMMAND ---------- 74 | 75 | -- TODO 76 | select `Call_Type_Group`, count(`Call_Type_Group`) as numberofcalls 77 | from fireCallsClean 78 | Group by `Call_Type_Group` 79 | 80 | -- COMMAND ---------- 81 | 82 | select count(`Call_Type_Group`) 83 | from fireCallsClean 84 | 85 | -- COMMAND ---------- 86 | 87 | -- MAGIC %md 88 | -- MAGIC Let's drop all the rows where `Call_Type_Group = null`. Since we don't have a lot of `Call_Type_Group` with the value `Alarm` and `Fire`, we will also drop these calls from the table. Call this new temporary view `fireCallsGroupCleaned`. 89 | 90 | -- COMMAND ---------- 91 | 92 | -- TODO 93 | create or replace temporary view fireCallsGroupCleaned 94 | 95 | As 96 | select * 97 | from fireCallsClean 98 | where Call_Type_Group is not null and Call_Type_Group not in ('Alarm', 'Fire') 99 | 100 | -- COMMAND ---------- 101 | 102 | -- MAGIC %md 103 | -- MAGIC Check that every entry in `fireCallsGroupCleaned` has a `Call_Type_Group` of either `Potentially Life-Threatening` or `Non Life-threatening`. 104 | 105 | -- COMMAND ---------- 106 | 107 | -- TODO 108 | select Call_Type_Group, count(*) from fireCallsGroupCleaned 109 | group by Call_Type_Group 110 | 111 | -- COMMAND ---------- 112 | 113 | -- MAGIC %md 114 | -- MAGIC ### Question 2 115 | -- MAGIC 116 | -- MAGIC **How many rows are in `fireCallsGroupCleaned`?** 117 | 118 | -- COMMAND ---------- 119 | 120 | select count(*) from fireCallsGroupCleaned 121 | 122 | -- COMMAND ---------- 123 | 124 | -- MAGIC %md 125 | -- MAGIC We probably don't need all the columns of `fireCallsGroupCleaned` to make our prediction. Select the following columns from `fireCallsGroupCleaned` and create a view called `fireCallsDF` so we can access this table in Python: 126 | -- MAGIC 127 | -- MAGIC * "Call_Type" 128 | -- MAGIC * "Fire_Prevention_District" 129 | -- MAGIC * "Neighborhooods_-\_Analysis_Boundaries" 130 | -- MAGIC * "Number_of_Alarms" 131 | -- MAGIC * "Original_Priority" 132 | -- MAGIC * "Unit_Type" 133 | -- MAGIC * "Battalion" 134 | -- MAGIC * "Call_Type_Group" 135 | 136 | -- COMMAND ---------- 137 | 138 | -- TODO 139 | create or replace temporary view fireCallsDF 140 | As 141 | select Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion,Call_Type_Group 142 | from fireCallsGroupCleaned 143 | 144 | -- COMMAND ---------- 145 | 146 | -- MAGIC %md 147 | -- MAGIC Fill in the string SQL statement to load the `fireCallsDF` table you just created into python. 148 | 149 | -- COMMAND ---------- 150 | 151 | -- MAGIC %python 152 | -- MAGIC # TODO 153 | -- MAGIC spark.conf.set("spark.sql.execution.arrow.enabled", "true") 154 | -- MAGIC df = sql("""select Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion,Call_Type_Group 155 | -- MAGIC from fireCallsGroupCleaned""") 156 | -- MAGIC display(df) 157 | 158 | -- COMMAND ---------- 159 | 160 | -- MAGIC %md 161 | -- MAGIC ## Creating a Logistic Regression Model in Sklearn 162 | 163 | -- COMMAND ---------- 164 | 165 | -- MAGIC %md 166 | -- MAGIC First we will convert the Spark DataFrame to pandas so we can use sklearn to preprocess the data into numbers so that it is compatible with the logistic regression algorithm with a [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). 167 | -- MAGIC 168 | -- MAGIC Then we'll perform a train test split on our pandas DataFrame. Remember that the column we are trying to predict is the `Call_Type_Group`. 169 | 170 | -- COMMAND ---------- 171 | 172 | -- MAGIC %python 173 | -- MAGIC from sklearn.model_selection import train_test_split 174 | -- MAGIC from sklearn.preprocessing import LabelEncoder 175 | -- MAGIC 176 | -- MAGIC pdDF = df.toPandas() 177 | -- MAGIC le = LabelEncoder() 178 | -- MAGIC numerical_pdDF = pdDF.apply(le.fit_transform) 179 | -- MAGIC 180 | -- MAGIC X = numerical_pdDF.drop("Call_Type_Group", axis=1) 181 | -- MAGIC y = numerical_pdDF["Call_Type_Group"].values 182 | -- MAGIC X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 183 | 184 | -- COMMAND ---------- 185 | 186 | -- MAGIC %md 187 | -- MAGIC Look at our training data `X_train` which should only have numerical values now. 188 | 189 | -- COMMAND ---------- 190 | 191 | -- MAGIC %python 192 | -- MAGIC display(X_train) 193 | 194 | -- COMMAND ---------- 195 | 196 | -- MAGIC %md 197 | -- MAGIC We'll create a pipeline with 2 steps. 198 | -- MAGIC 199 | -- MAGIC 0. [One Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder): Converts our features into vectorized features by creating a dummy column for each value in that category. 200 | -- MAGIC 201 | -- MAGIC 0. [Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html): Although the name includes "regression", it is used for classification by predicting the probability that the `Call Type Group` is one label and not the other. 202 | 203 | -- COMMAND ---------- 204 | 205 | -- MAGIC %python 206 | -- MAGIC from sklearn.linear_model import LogisticRegression 207 | -- MAGIC from sklearn.preprocessing import OneHotEncoder 208 | -- MAGIC from sklearn.pipeline import Pipeline 209 | -- MAGIC 210 | -- MAGIC ohe = ("ohe", OneHotEncoder(handle_unknown="ignore")) 211 | -- MAGIC lr = ("lr", LogisticRegression()) 212 | -- MAGIC 213 | -- MAGIC pipeline = Pipeline(steps = [ohe, lr]).fit(X_train, y_train) 214 | -- MAGIC y_pred = pipeline.predict(X_test) 215 | 216 | -- COMMAND ---------- 217 | 218 | -- MAGIC %md 219 | -- MAGIC Run the following cell to see how well our model performed on test data (data that wasn't used to train the model)! 220 | 221 | -- COMMAND ---------- 222 | 223 | -- MAGIC %python 224 | -- MAGIC from sklearn.metrics import accuracy_score 225 | -- MAGIC print(f"Accuracy of model: {accuracy_score(y_pred, y_test)}") 226 | 227 | -- COMMAND ---------- 228 | 229 | -- MAGIC %md 230 | -- MAGIC ### Question 3 231 | -- MAGIC 232 | -- MAGIC **What is the accuracy of our model on test data? Round to the nearest percent.** 233 | 234 | -- COMMAND ---------- 235 | 236 | -- MAGIC %md 237 | -- MAGIC Save pipeline (with both stages) to disk. 238 | 239 | -- COMMAND ---------- 240 | 241 | -- MAGIC %python 242 | -- MAGIC import mlflow 243 | -- MAGIC from mlflow.sklearn import save_model 244 | -- MAGIC 245 | -- MAGIC model_path = "/dbfs/" + username + "/Call_Type_Group_lr" 246 | -- MAGIC dbutils.fs.rm(username + "/Call_Type_Group_lr", recurse=True) 247 | -- MAGIC save_model(pipeline, model_path) 248 | 249 | -- COMMAND ---------- 250 | 251 | -- MAGIC %md 252 | -- MAGIC ## UDF 253 | 254 | -- COMMAND ---------- 255 | 256 | -- MAGIC %md 257 | -- MAGIC Now that we have created and trained a machine learning pipeline, we will use MLflow to register the `.predict` function of the sklearn pipeline as a UDF which we can use later to apply in parallel. Now we can refer to this with the name `predictUDF` in SQL. 258 | 259 | -- COMMAND ---------- 260 | 261 | -- MAGIC %python 262 | -- MAGIC import mlflow 263 | -- MAGIC from mlflow.pyfunc import spark_udf 264 | -- MAGIC 265 | -- MAGIC predict = spark_udf(spark, model_path, result_type="int") 266 | -- MAGIC spark.udf.register("predictUDF", predict) 267 | 268 | -- COMMAND ---------- 269 | 270 | -- MAGIC %md 271 | -- MAGIC Create a view called `testTable` of our test data `X_test` so that we can see this table in SQL. 272 | 273 | -- COMMAND ---------- 274 | 275 | -- MAGIC %python 276 | -- MAGIC spark_df = spark.createDataFrame(X_test) 277 | -- MAGIC spark_df.createOrReplaceTempView("testTable") 278 | 279 | -- COMMAND ---------- 280 | 281 | -- MAGIC %md 282 | -- MAGIC Create a table called `predictions` using the `predictUDF` function we registered beforehand. Apply the `predictUDF` to every row of `testTable` in parallel so that each row of `testTable` has a `Call_Type_Group` prediction. 283 | 284 | -- COMMAND ---------- 285 | 286 | -- TODO 287 | 288 | USE DATABRICKS; 289 | drop table if exists predictions; 290 | 291 | create table predictions as ( 292 | select *, cast(predictUDF(Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, Number_of_Alarms, Original_Priority, Unit_Type, Battalion) as double) as prediction 293 | FROM testTable 294 | --LIMIT 10000 295 | ) 296 | 297 | -- COMMAND ---------- 298 | 299 | -- MAGIC %md 300 | -- MAGIC Now take a look at the table and see what your model predicted for each call entry! 301 | 302 | -- COMMAND ---------- 303 | 304 | SELECT * FROM predictions LIMIT 10 305 | 306 | -- COMMAND ---------- 307 | 308 | -- MAGIC %md 309 | -- MAGIC ### Question 4: 310 | -- MAGIC 311 | -- MAGIC **What 2 values are in the `prediction` column?** 312 | 313 | -- COMMAND ---------- 314 | 315 | -- MAGIC %md 316 | -- MAGIC Congrats on finishing your last assignment notebook! 317 | -- MAGIC 318 | -- MAGIC 319 | -- MAGIC Now you will have to upload this notebook to Coursera for peer reviewing. 320 | -- MAGIC 1. Make sure that all your code will run without errors 321 | -- MAGIC * Check this by clicking the "Clear State & Run All" dropdown option at the top of your notebook 322 | -- MAGIC * ![](http://files.training.databricks.com/images/eLearning/ucdavis/clearstaterunall.png) 323 | -- MAGIC 2. Click on the "Workspace" icon on the side bar 324 | -- MAGIC 3. Next to the notebook you're working in right now, click on the dropdown arrow 325 | -- MAGIC 4. In the dropdown, click on "Export" then "HTML" 326 | -- MAGIC * ![](http://files.training.databricks.com/images/eLearning/ucdavis/export.png) 327 | -- MAGIC 5. On the Coursera platform, upload this HTML file to Week 4's Peer Review Assignment 328 | -- MAGIC 329 | -- MAGIC Go back onto the Coursera platform for the free response portion of this assignment and for instructions on how to review your peer's work. 330 | 331 | -- COMMAND ---------- 332 | 333 | -- MAGIC %md-sandbox 334 | -- MAGIC © 2020 Databricks, Inc. All rights reserved.
335 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
336 | -- MAGIC
337 | -- MAGIC Privacy Policy | Terms of Use | Support 338 | -------------------------------------------------------------------------------- /Assignment.sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC 3 | -- MAGIC %md-sandbox 4 | -- MAGIC 5 | -- MAGIC
6 | -- MAGIC Databricks Learning 7 | -- MAGIC
8 | 9 | -- COMMAND ---------- 10 | 11 | -- MAGIC %md 12 | -- MAGIC # Queries in Spark SQL 13 | -- MAGIC ## Module 1 Assignment 14 | -- MAGIC 15 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this assignment you: 16 | -- MAGIC * Create a table 17 | -- MAGIC * Write SQL queries 18 | -- MAGIC 19 | -- MAGIC 20 | -- MAGIC For each **bold** question, input its answer in Coursera. 21 | 22 | -- COMMAND ---------- 23 | 24 | -- MAGIC %run ../Includes/Classroom-Setup 25 | 26 | -- COMMAND ---------- 27 | 28 | -- MAGIC %md 29 | -- MAGIC ### Working with Incident Data 30 | -- MAGIC 31 | -- MAGIC For this assignment, we'll be using a new dataset: the [SF Fire Incident](https://data.sfgov.org/Public-Safety/Fire-Incidents/wr8u-xric) dataset. It has been mounted for you using the script above. The path to this dataset is as follows: 32 | -- MAGIC 33 | -- MAGIC `/mnt/davis/fire-incidents/fire-incidents-2016.csv` 34 | -- MAGIC 35 | -- MAGIC In this assignment, you will read the dataset and perform a number of different queries. 36 | 37 | -- COMMAND ---------- 38 | 39 | -- MAGIC %md 40 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Create a Table 41 | -- MAGIC 42 | -- MAGIC Create a new table called `fireIncidents` for this dataset. Be sure to use options to properly parse the data. 43 | 44 | -- COMMAND ---------- 45 | 46 | -- TODO 47 | create table fireIncidents 48 | USING csv 49 | OPTIONS ( 50 | header "true", 51 | path "/mnt/davis/fire-incidents/fire-incidents-2016.csv", 52 | inferSchema "true" 53 | ) 54 | 55 | -- COMMAND ---------- 56 | 57 | -- MAGIC %md 58 | -- MAGIC ### Question 1 59 | -- MAGIC 60 | -- MAGIC Return the first 10 lines of the data. On the Coursera platform, input the result to the following question: 61 | -- MAGIC 62 | -- MAGIC **What is the first value for "Incident Number"?** 63 | 64 | -- COMMAND ---------- 65 | 66 | -- TODO 67 | select `Incident Number` from fireIncidents LIMIT 10 68 | 69 | -- COMMAND ---------- 70 | 71 | -- MAGIC %md 72 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) `WHERE` Clauses 73 | -- MAGIC 74 | -- MAGIC A `WHERE` clause is used to filter data that meets certain criteria, returning all values that evaluate to be true. 75 | 76 | -- COMMAND ---------- 77 | 78 | -- MAGIC %md 79 | -- MAGIC ### Question 2 80 | -- MAGIC 81 | -- MAGIC Return all incidents that occurred on Conor's birthday in 2016. For those of you who forgot his birthday, it's April 4th. On the Coursera platform, input the result to the following question: 82 | -- MAGIC 83 | -- MAGIC **What is the first value for "Incident Number" on April 4th, 2016?** 84 | -- MAGIC 85 | -- MAGIC **Remember to use backticks (\`\`) instead of single quotes ('') for columns that have spaces in the name. ** 86 | 87 | -- COMMAND ---------- 88 | 89 | select (*) from fireIncidents 90 | 91 | -- COMMAND ---------- 92 | 93 | -- TODO 94 | select `Incident Number`,`Incident Date` from fireIncidents 95 | where `Incident Date` = "04/04/2016" 96 | 97 | -- COMMAND ---------- 98 | 99 | -- MAGIC %md 100 | -- MAGIC ### Question 3 101 | -- MAGIC 102 | -- MAGIC Return all incidents that occurred on Conor's _or_ Brooke's birthday. For those of you who forgot her birthday too, it's `9/27`. 103 | -- MAGIC 104 | -- MAGIC **Is the first fire call in this table on Brooke or Conor's birthday?** 105 | 106 | -- COMMAND ---------- 107 | 108 | -- TODO 109 | 110 | select (*) from fireIncidents 111 | where `Incident Date` = "04/04/2016" or `Incident Date` = "27/09/2016" 112 | 113 | -- COMMAND ---------- 114 | 115 | -- MAGIC %md 116 | -- MAGIC ### Question 4 117 | -- MAGIC Return all incidents on either Conor or Brooke's birthday where the `Station Area` is greater than 20. 118 | -- MAGIC 119 | -- MAGIC **What is the "Station Area" for the first fire call in this table?** 120 | 121 | -- COMMAND ---------- 122 | 123 | -- MAGIC %md 124 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Aggregate Functions 125 | -- MAGIC 126 | -- MAGIC Aggregate functions compute a single result value from a set of input values. Use the aggregate function `COUNT` to count the total records in the dataset. 127 | 128 | -- COMMAND ---------- 129 | 130 | -- TODO 131 | select `Station Area` from fireIncidents 132 | where `Incident Date` = "04/04/2016" or `Incident Date` = "27/09/2016" and `Station Area`> 20 133 | 134 | -- COMMAND ---------- 135 | 136 | -- MAGIC %md 137 | -- MAGIC ### Question 5 138 | -- MAGIC 139 | -- MAGIC Count the incidents on Conor's birthday. 140 | -- MAGIC 141 | -- MAGIC **How many incidents were on Conor's birthday in 2016?** 142 | 143 | -- COMMAND ---------- 144 | 145 | -- TODO 146 | select count(`Incident Number`) as incidents from fireIncidents 147 | where `Incident Date` = "04/04/2016" 148 | 149 | -- COMMAND ---------- 150 | 151 | -- MAGIC %md-sandbox 152 | -- MAGIC ### Question 6 153 | -- MAGIC 154 | -- MAGIC Return the total counts by `Ignition Cause`. Be sure to return the field `Ignition Cause` as well. 155 | -- MAGIC 156 | -- MAGIC Hint **Hint:** You'll have to use `GROUP BY` for this 157 | -- MAGIC 158 | -- MAGIC **How many fire calls had an "Ignition Cause" of "4 act of nature"?** 159 | 160 | -- COMMAND ---------- 161 | 162 | -- TODO 163 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents 164 | group by `Ignition Cause` 165 | 166 | -- COMMAND ---------- 167 | 168 | -- MAGIC %md-sandbox 169 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Sorting 170 | -- MAGIC 171 | -- MAGIC ### Question 7 172 | -- MAGIC 173 | -- MAGIC Return the total counts by `Ignition Cause` sorted in ascending order. 174 | -- MAGIC 175 | -- MAGIC Hint **Hint:** You'll have to use `ORDER BY` for this. 176 | -- MAGIC 177 | -- MAGIC **What is the most common "Ignition Cause"? (Put the entire string)** 178 | 179 | -- COMMAND ---------- 180 | 181 | -- TODO 182 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents 183 | group by `Ignition Cause` 184 | order by total asc 185 | 186 | -- COMMAND ---------- 187 | 188 | -- MAGIC %md 189 | -- MAGIC Return the total counts by `Ignition Cause` sorted in descending order. 190 | 191 | -- COMMAND ---------- 192 | 193 | -- TODO 194 | select `Ignition Cause`,count(`Ignition Cause`) as total from fireIncidents 195 | group by `Ignition Cause` 196 | order by total desc 197 | 198 | -- COMMAND ---------- 199 | 200 | -- MAGIC %md 201 | -- MAGIC ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Joins 202 | -- MAGIC 203 | -- MAGIC Create the table `fireCalls` if it doesn't already exist. The path is as follows: `/mnt/davis/fire-calls/fire-calls-truncated-comma.csv` 204 | 205 | -- COMMAND ---------- 206 | 207 | -- TODO 208 | create table if not exists fireCalls 209 | USING csv 210 | OPTIONS ( 211 | header "true", 212 | path "/mnt/davis/fire-calls/fire-calls-truncated-comma.csv", 213 | inferSchema "true" 214 | ) 215 | 216 | -- COMMAND ---------- 217 | 218 | -- MAGIC %md 219 | -- MAGIC Join the two tables on `Battalion` by performing an inner join. 220 | 221 | -- COMMAND ---------- 222 | 223 | -- TODO 224 | select * from fireCalls 225 | inner join fireIncidents 226 | on fireCalls.`Battalion`=fireIncidents.`Battalion` 227 | 228 | -- COMMAND ---------- 229 | 230 | -- MAGIC %md 231 | -- MAGIC ### Question 8 232 | -- MAGIC 233 | -- MAGIC Count the total incidents from the two tables joined on `Battalion`. 234 | -- MAGIC 235 | -- MAGIC **What is the total incidents from the two joined tables?** 236 | 237 | -- COMMAND ---------- 238 | 239 | -- TODO 240 | select count(fc.`Incident Number`) as total 241 | from fireCalls fc 242 | inner join fireIncidents fi 243 | on fc.`Battalion`=fi.`Battalion` 244 | 245 | 246 | 247 | -- COMMAND ---------- 248 | 249 | -- MAGIC %md 250 | -- MAGIC Congratulations! You made it to the end of the assignment! 251 | 252 | -- COMMAND ---------- 253 | 254 | -- MAGIC %md-sandbox 255 | -- MAGIC © 2020 Databricks, Inc. All rights reserved.
256 | -- MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
257 | -- MAGIC
258 | -- MAGIC Privacy Policy | Terms of Use | Support 259 | -------------------------------------------------------------------------------- /Module 1 Quiz. md: -------------------------------------------------------------------------------- 1 | Question 1 2 | Which of the following are true when it comes to the business value of big data? (Select all that apply.) 3 | 4 | 5 | **The size of the data businesses collect is growing** 6 | 7 | 8 | **Businesses are increasingly making data-driven decisions** 9 | 10 | 11 | Automated technologies mean that data scientists and data analysts are no longer needed 12 | 13 | 14 | Question 2 15 | Spark uses... 16 | 17 | (Select all that apply.) 18 | 19 | Your database technology (e.g., Postgres or SQL Server) to run Spark queries 20 | 21 | 22 | **A driver node to distribute work across a number of executor nodes** 23 | 24 | 25 | **A distributed cluster of networked computers made of a driver node and many executor nodes** 26 | 27 | 28 | 29 | One very large computer that is able to run computation against large databases 30 | 31 | 32 | A distributed cluster of networked computers made of many driver nodes and many executor nodes 33 | 34 | 35 | Question 3 36 | How does Spark execute code backed by DataFrames? (Select all that apply.) 37 | 38 | It executes code determined in advance 39 | 40 | 41 | It iterates over all of the source data to exhaustively evaluate queries 42 | 43 | 44 | **It separates the "logical plan" of what you want to accomplish from the "physical plan" of how to do it so it can optimize the query** 45 | 46 | **It optimizes your query by figuring out the best "how" to execute what you want** 47 | 48 | 49 | Question 4 50 | What are the properties of Spark DataFrames? (Select all that apply.) 51 | 52 | **Dataset: Collection of partitioned data** 53 | 54 | **Resilient: Fault-tolerant** 55 | 56 | Tables: Operates as any table in SQL environments 57 | 58 | 59 | **Distributed: Computed across multiple nodes** 60 | 61 | 62 | 63 | Question 5 64 | What is the difference between Spark and database technologies? (Select all that apply.) 65 | 66 | 67 | **Spark is a computation engine and is not for data storage** 68 | 69 | Spark does not interact with databases but uses its proprietary DataFrame technology instead 70 | 71 | Spark operates for both data storage and computation 72 | 73 | **Spark is a highly optimized compute engine and is not a database** 74 | 75 | Spark in an alternative to traditional databases 76 | 77 | 78 | Question 6 79 | What is Amdahl's law of scalability? (Select all that apply.) 80 | 81 | 82 | **Amdahl's law states that the speedup of a task is a function of how much of that task can be parallelized** 83 | 84 | A formula that gives the expected speed of a single processor performing a computation 85 | 86 | A formula that gives the theoretical speedup as a function of the size of a partition (or subset) of data 87 | 88 | A formula that gives the number of processors (or other unit of parallelism) needed to complete a task 89 | 90 | **A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized** 91 | 92 | 93 | Question 7 94 | Spark offers a unified approach to analytics. What does this include? (Select all that apply.) 95 | 96 | 97 | **Spark unifies applications such as SQL queries, streaming, and machine learning** 98 | 99 | Spark unifies databases with optimized computation allowing for faster computation against the data it stores 100 | 101 | **Spark allows analysts, data scientists, and data engineers to all use the same core technology** 102 | 103 | Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application 104 | 105 | **Spark code can be written in the following languages: SQL, Scala, Java, Python, and R** 106 | 107 | 108 | Question 8 109 | What is a Databricks notebook? 110 | 111 | A Spark instance that executes queries 112 | 113 | A cluster that executes Spark code 114 | 115 | **A collaborative, interactive workspace that allows you to execute Spark queries at scale** 116 | 117 | A single Spark query 118 | 119 | 120 | 121 | Question 9 122 | How can you get data into Databricks? (Select all that apply.) 123 | 124 | 125 | **By uploading it through the user interface** 126 | 127 | **By registering the data as a table** 128 | 129 | By connecting to Dropbox or Google Drive 130 | 131 | **By "mounting" data backed by cloud storage** 132 | 133 | 134 | Question 10 135 | What are the qualities of big data? (Select all that apply.) 136 | 137 | 138 | 139 | **Volume: the amount of data** 140 | 141 | Valorous: the positives impact of data 142 | 143 | 144 | **Variety: the diversity of data** 145 | 146 | **Velocity: the speed of data** 147 | 148 | **Veracity: the reliability of data** 149 | 150 | 151 | 152 | -------------------------------------------------------------------------------- /Module 2 Quiz.md: -------------------------------------------------------------------------------- 1 | Question 1 2 | What are the different units of parallelism? (Select all that apply.) 3 | 4 | **Task** 5 | 6 | **Executor** 7 | 8 | **Core** 9 | 10 | **Partition** 11 | 12 | 13 | Question 2 14 | What is a partition? 15 | 16 | 17 | **A portion of a large distributed set of data** 18 | 19 | A synonym with "task" 20 | 21 | A division of computation that executes a query 22 | 23 | The result of data filtered by a WHERE clause 24 | 25 | 26 | Question 3 27 | What is the difference between in-memory computing and other technologies? (Select all that apply.) 28 | 29 | 30 | **In-memory operations were not realistic in older technologies when memory was more expensive** 31 | 32 | In-memory computing is slower than other types of computing 33 | 34 | **In-memory operates from RAM while other technologies operate from disk** 35 | 36 | **Computation not done in-memory (such as Hadoop) reads and writes from disk in between each step** 37 | 38 | 39 | Question 4 40 | Why is caching important? 41 | 42 | 43 | It always stores data in-memory to improve performance 44 | 45 | 46 | **It stores data on the cluster to improve query performance** 47 | 48 | 49 | It reformats data already stored in RAM for faster access 50 | 51 | 52 | It improves queries against data read one or more times 53 | 54 | 55 | Question 5 56 | Which of the following is a wide transformation? (Select all that apply.) 57 | 58 | 59 | **ORDER BY** 60 | 61 | **GROUP BY** 62 | 63 | SELECT 64 | 65 | WHERE 66 | 67 | 68 | Question 6 69 | Broadcast joins... 70 | 71 | 72 | Shuffle both of the tables, minimizing data transfer by transferring data in parallel 73 | 74 | Shuffle both of the tables, minimizing computational resources 75 | 76 | **Transfer the smaller of two tables to the larger, minimizing data transfer** 77 | 78 | Transfer the smaller of two tables to the larger, increasing data transfer requirements 79 | 80 | 81 | 82 | Question 7 83 | When is it appropriate to use a shuffle join? 84 | 85 | 86 | 87 | **When both tables are moderately sized or large** 88 | 89 | When both tables are very small 90 | 91 | Never. Broadcast joins always out-perform shuffle joins. 92 | 93 | When the smaller table is significantly smaller than the larger table 94 | 95 | 96 | Question 8 97 | Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.) 98 | 99 | 100 | **Shuffle reads** 101 | 102 | **Data Skew** 103 | 104 | Incompatible data formats 105 | 106 | **Shuffle writes** 107 | 108 | 109 | Question 9 110 | What is a stage boundary? 111 | 112 | 113 | An action caused by a SQL query is predicate 114 | 115 | Any transition between Spark tasks 116 | 117 | A narrow transformation 118 | 119 | **When all of the slots or available units of processing have to sync with one another** 120 | 121 | 122 | Question 10 123 | What happens when Spark code is executed in local mode? 124 | 125 | 126 | A cluster of virtual machines is used rather than physical machines 127 | 128 | **The executor and driver are on the same machine** 129 | 130 | The code is executed in the cloud 131 | 132 | The code is executed against a local cluster 133 | -------------------------------------------------------------------------------- /Module 3 Quiz. md: -------------------------------------------------------------------------------- 1 | Question 1 2 | Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.) 3 | 4 | 5 | + It makes updates to new software versions easier 6 | 7 | Resources are isolated and therefore more manageable and debuggable 8 | 9 | + It results in copies of the data in case of a data center outage 10 | 11 | + It allows for elastic resources so larger storage or compute resources are used only when needed 12 | 13 | 14 | Question 2 15 | You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task? 16 | 17 | CPU: the transfer of data is more demanding than the computation 18 | 19 | IO: computation is more demanding that the data transfer 20 | 21 | CPU: computation is more demanding than the data transfer 22 | 23 | + IO: the transfer of data is more demanding than the computation 24 | 25 | Question 3 26 | Processing virtual shopping cart orders in real time is an example of... 27 | 28 | Online Analytical Processing (OLAP) 29 | 30 | + Online Transaction Processing (OLTP) 31 | 32 | 33 | Question 4 34 | When are BLOB stores an appropriate place to store data? (Select all that apply.) 35 | 36 | 37 | 38 | + For cheap storage 39 | 40 | + For a "data lake" of largely unstructured data 41 | 42 | + For storing large files 43 | 44 | For online transaction processing on a website 45 | 46 | 47 | Question 5 48 | JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC? 49 | 50 | 51 | 52 | Specify the number of partitions using COALESCE. Spark then creates one parallel connection for each partition. 53 | 54 | 55 | Specify the numPartitions configuration setting. Spark then creates one parallel connection for each partition. 56 | 57 | 58 | + Specify a column, number of partitions, and the column's minimum and maximum values. Spark then divides that range of values between parallel connections.***** 59 | 60 | 61 | Specify the number of partitions using REPARTITION. Spark then creates one parallel connection for each partition. 62 | 63 | 64 | Question 6 65 | What are some of the advantages of the file format Parquet over CSV? (Select all that apply.) 66 | 67 | 68 | + Compression 69 | 70 | Corruptible 71 | 72 | + Parallelism 73 | 74 | + Columnar 75 | 76 | 77 | Question 7 78 | SQL is normally used to query tabular (or "structured") data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.) 79 | 80 | 81 | + It allows for missing data 82 | 83 | + It allows for complex data types 84 | 85 | + It allows for data change over time 86 | 87 | + It does not need a formal structure***** 88 | 89 | It allows for easy joins between relational JSON tables 90 | 91 | 92 | Question 8 93 | Data writes in Spark can happen in serial or in parallel. What controls this parallelism? 94 | 95 | 96 | + The number of data partitions in a DataFrame 97 | 98 | The number of jobs in a write operation 99 | 100 | The number of stages in a write operation 101 | 102 | The numPartitions setting in the Spark configuration 103 | 104 | 105 | Question 9 106 | Fill in the blanks with the appropriate response below: 107 | 108 | A _______ table manages _______and a DROP TABLE command will result in data loss. 109 | 110 | 111 | 112 | Unmanaged, both the data and metadata such as the schema and data location 113 | 114 | 115 | + Managed, both the data and metadata such as the schema and data location 116 | 117 | 118 | Managed, only the metadata such as the schema and data location 119 | 120 | 121 | Unmanaged, only the metadata such as the schema and data location 122 | -------------------------------------------------------------------------------- /Module 4 Quiz.md: -------------------------------------------------------------------------------- 1 | Question 1 2 | Machine learning is suited to solve which of the following tasks? (Select all that apply.) 3 | 4 | 5 | **Churn Analysis** 6 | 7 | Reporting 8 | 9 | **Natural Language Processing** 10 | 11 | **Image Recognition** 12 | 13 | **Financial Forecasting** 14 | 15 | **Fraud Detection** 16 | 17 | **A/B Testing** 18 | 19 | Question 2 20 | Is a model that is 99% accurate at predicting breast cancer a good model? 21 | 22 | 23 | Likely no because there are too many false positives 24 | 25 | **Likely no because there are not many cases of cancer in a general population** 26 | 27 | Likely yes because this is generally a high score 28 | 29 | Likely yes because it accounts for false negatives and we'd want to make sure we catch every case of cancer 30 | 31 | Question 3 32 | What is an appropriate baseline model to compare a machine learning solution to? 33 | 34 | **The average of the dataset** 35 | 36 | Zero 37 | 38 | The minimum value of the dataset 39 | 40 | 41 | Question 4 42 | What is Machine Learning? (Select all that apply.) 43 | 44 | 45 | Statistical moments calculated against a dataset 46 | 47 | Hand-coded logic 48 | 49 | **Learning patterns in your data without being explicitly programmed** 50 | 51 | **A function that maps features to an output** 52 | 53 | 54 | Question 5 55 | (Fill in the blanks with the appropriate answer below.) 56 | 57 | Predicting whether a website user is fraudulent or not is an example of _________ machine learning. It is a __________ task. 58 | 59 | 60 | **supervised, classification** 61 | 62 | unsupervised, classification 63 | 64 | unsupervised, regression 65 | 66 | supervised, regression 67 | 68 | 69 | Question 6 70 | (Fill in the blanks with the appropriate answer below.) 71 | 72 | Grouping similar users together based on past activity is an example of _________ machine learning. It is a _________ task. 73 | 74 | **unsupervised, clustering** 75 | 76 | unsupervised, classification 77 | 78 | supervised, classification 79 | 80 | supervised, clustering 81 | 82 | 83 | Question 7 84 | Predicting the next quarter of a company's earnings is an example of... 85 | 86 | 87 | Reinforcement 88 | 89 | Classification 90 | 91 | **Regression** 92 | 93 | Semi-supervised 94 | 95 | Clustering 96 | 97 | 98 | Question 8 99 | Why do we want to perform a train/test split before we train a machine learning model? (Select all that apply.) 100 | 101 | 102 | **To keep the model from "overfitting" where it memorizes the data it has seen** 103 | 104 | To calculate a baseline model 105 | 106 | To give us subsets of our data so we can compare a model trained on one versus the model trained on the other 107 | 108 | **To evaluate how our model performs on unseen data** 109 | 110 | 111 | Question 9 112 | What is a linear regression model learning about your data? 113 | 114 | 115 | The value of the closest points to the one you're trying to predict 116 | 117 | **The formula for the line of best fit** 118 | 119 | The best split points in a decision tree 120 | 121 | The average of the data 122 | 123 | 124 | Question 10 125 | How do you define a custom function not already part of core Spark? 126 | 127 | 128 | You can't write your own functions in Spark 129 | 130 | **With a User-Defined Function** 131 | 132 | By extending the open source code base 133 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Queries-in-Spark-SQL --------------------------------------------------------------------------------