├── Distributed-Computing-with-Spark-SQL-1.2.3.dbc ├── README.md ├── assignments ├── module1-assignment1.md ├── module2-assignment2.md ├── module3-assignment3.md └── module4-assignment4.md └── quizes ├── module1-quiz.md ├── module2-quiz.md ├── module3-quiz.md └── module4-quiz.md /Distributed-Computing-with-Spark-SQL-1.2.3.dbc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/67cc61080727dd224761cc330de966ebe9d4b81a/Distributed-Computing-with-Spark-SQL-1.2.3.dbc -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Distributed Computing with Spark SQL 2 | This course is provided by University of California Davis on coursera, which provides a comprehensive overview of distributed computing using Spark. 3 | 4 | The four modules build on one another and by the end of the course are: 5 | - Spark architecture: 6 | - Spark DataFrame 7 | - Optimizing reading/writing data 8 | - How to build a machine learning model. 9 | 10 | By understanding when to use Spark, either scaling out when the model or data is too large to process on a single machine, or having a need to simply speed up to get faster results, students like me will hone their SQL skills and become a more adept Data Scientist. 11 | 12 | This repository includes the following things: 13 | #### 1.[Assignments](https://github.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/tree/main/assignments) 14 | 15 | #### 2.[Quizes](https://github.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/tree/main/quizes) 16 | 17 | #### 3.[Notebooks](https://github.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/blob/main/Distributed-Computing-with-Spark-SQL-1.2.3.dbc) 18 | 19 | 20 | 21 | -------------------------------------------------------------------------------- /assignments/module1-assignment1.md: -------------------------------------------------------------------------------- 1 | 1. What is the first value for "Incident Number"? 2 | 3 | - 16000003 4 | 5 | 2. What is the first value for "Incident Number" on April 4th, 2016? 6 | 7 | - 16037478 8 | 9 | 3. Is the first fire call in this table on Brooke or Conor's birthday? Conor's birthday is 4/4 and Brooke's is 9/27 (in MM/DD format). 10 | 11 | - Conor 12 | 13 | 4. W​hat is the "Station Area" for the first fire call in this table? Note that this table is a subset of the dataset. 14 | 15 | - 29 16 | 17 | 5. H​ow many incidents were on Conor's birthday in 2016? 18 | 19 | - 80 20 | 21 | 6. H​ow many fire calls had an "Ignition Cause" of "4 act of nature"? 22 | 23 | - 5 24 | 25 | 7. W​hat is the most common "Ignition Cause"? 26 | 27 | - 2 unintentional 28 | 29 | 8. W​hat is the total incidents from the two joined tables? 30 | 31 | - 847094402 -------------------------------------------------------------------------------- /assignments/module2-assignment2.md: -------------------------------------------------------------------------------- 1 | 1. H​ow many fire calls are in our table? 2 | 3 | - 240613 4 | 5 | 2. W​hich "Unit Type" is the most common? 6 | 7 | - ENGINE 8 | 9 | 3. W​hat type of transformation, wide or narrow, did the 'GROUP BY' and 'ORDER BY' queries result in? 10 | 11 | - Wide 12 | 13 | 4. H​ow many tasks were in the last stage of the last job? 14 | 15 | - 2 -------------------------------------------------------------------------------- /assignments/module3-assignment3.md: -------------------------------------------------------------------------------- 1 | 1. W​hat type of table is "newTable"? 2 | 3 | - EXTERNAL 4 | 5 | 2. H​ow many rows are in "newTable"? 6 | 7 | - 191039 8 | 9 | 3. W​hat is the "Battalion" of the first entry in the sorted table? 10 | 11 | - B01 12 | 13 | 4. W​as this query faster or slower on the table with increased partitions? 14 | 15 | - Slower 16 | 17 | 5. D​oes the data stored within the table still exist at the original location ('dbfs:/tmp/newTableLoc') after you dropped the table? 18 | 19 | - Yes -------------------------------------------------------------------------------- /assignments/module4-assignment4.md: -------------------------------------------------------------------------------- 1 | 1. H​ow many calls of 'Call_Type_Group' "Fire"? 2 | 3 | - 4196 4 | 5 | 2. H​ow many rows are in 'fireCallsGroupCleaned'? 6 | 7 | - 134198 8 | 9 | 3. W​hat is the accuracy of our model on test data as a percentage? Round to the nearest percent. (e.g. an accuracy of ".125" should be reported as "13") 10 | 11 | - 82 12 | 13 | 4. W​hat two values are in the 'prediction' column? 14 | 15 | - 0 16 | - 1 -------------------------------------------------------------------------------- /quizes/module1-quiz.md: -------------------------------------------------------------------------------- 1 | 1. Which of the following are true when it comes to the business value of big data? (Select all that apply.) 2 | 3 | - Businesses are increasingly making data-driven decisions 4 | 5 | - The size of the data businesses collect is growing 6 | 7 | 2. Spark uses...(Select all that apply.) 8 | 9 | - A distributed cluster of networked computers made of a driver node and many executor nodes 10 | 11 | - A driver node to distribute work across a number of executor nodes 12 | 13 | 3. How does Spark execute code backed by DataFrames? (Select all that apply.) 14 | 15 | - It separates the "logical plan" of what you want to accomplish from the "physical plan" of how to do it so it can optimize the query 16 | 17 | - It optimizes your query by figuring out the best "how" to execute what you want 18 | 19 | 4. What are the properties of Spark DataFrames? (Select all that apply.) 20 | 21 | - Resilient: Fault-tolerant 22 | 23 | - Distributed: Computed across multiple nodes 24 | 25 | - Dataset: Collection of partitioned data 26 | 27 | 5. What is the difference between Spark and database technologies? (Select all that apply.) 28 | 29 | - Spark is a highly optimized compute engine and is not a database 30 | 31 | - Spark is a computation engine and is not for data storage 32 | 33 | 6. What is Amdahl's law of scalability? (Select all that apply.) 34 | 35 | - Amdahl's law states that the speedup of a task is a function of how much of that task can be parallelized 36 | 37 | - A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized 38 | 39 | 7. Spark offers a unified approach to analytics. What does this include? (Select all that apply.) 40 | 41 | - Spark code can be written in the following languages: SQL, Scala, Java, Python, and R 42 | 43 | - Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application 44 | 45 | - Spark allows analysts, data scientist, and data engineers to all use the same core technology 46 | 47 | - Spark unities applications such as SQL queries, streaming, and machine learning 48 | 49 | 8. What is a Databricks notebook? 50 | 51 | - A collaborative, interactive workspace that allows you to execute Spark queries at scale 52 | 53 | 9. How can you get data into Databricks? (Select all that apply.) 54 | 55 | - By uploading it through the user interface 56 | 57 | - By "mounting" data backed by cloud storage 58 | 59 | - By registering the data as a table 60 | 61 | 10. What are the qualities of big data? (Select all that apply.) 62 | 63 | - Volume 64 | 65 | - Velocity 66 | 67 | - Veracity 68 | 69 | - Variety -------------------------------------------------------------------------------- /quizes/module2-quiz.md: -------------------------------------------------------------------------------- 1 | 1. What are the different units of parallelism? (Select all that apply.) 2 | 3 | - Partition 4 | 5 | - Core 6 | 7 | - Executor 8 | 9 | - Task 10 | 11 | 2. What is a partition? 12 | 13 | - A portion of a large distributed set of data 14 | 15 | 3. What is the difference between in-memory computing and other technologies? (Select all that apply.) 16 | 17 | - In-memory operates from RAM while other technologies operate from disk 18 | 19 | - Computation not done in-memory (such as Hadoop) reads and writes from disk in between each step 20 | 21 | - In-memory operations were not realistic in older technologies when memory was more expensive 22 | 23 | 4. Why is caching important? 24 | 25 | - It stores data on the cluster to improve query performance 26 | 27 | 5. Which of the following is a wide transformation? (Select all that apply.) 28 | 29 | - ORDER BY 30 | 31 | - GROUP BY 32 | 33 | 6. Broadcast joins... 34 | 35 | - Transfer the smaller of two tables to the larger, minimizing data transfer 36 | 37 | 7. When is it appropriate to use a shuffle join? 38 | 39 | - When both tables are moderately sized or large 40 | 41 | 8. Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.) 42 | 43 | - Shuffle reads 44 | 45 | - Shuffle writes 46 | 47 | - Data Skew 48 | 49 | 9. What is a stage boundary? 50 | 51 | - When all of the slots or available units of processing have to sync with one another 52 | 53 | 10. What happens when Spark code is executed in local mode? 54 | 55 | - The executor and driver are on the same machine 56 | -------------------------------------------------------------------------------- /quizes/module3-quiz.md: -------------------------------------------------------------------------------- 1 | 1. Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.) 2 | 3 | - It allows for elastic resources so larger storage or compute resources are used only when needed 4 | - It makes updates to new software versions easier 5 | - Resources are isolated and therefore more manageable and debuggable 6 | 7 | 2. You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task? 8 | 9 | - IO: the transfer of data is more demanding than the computation 10 | 11 | 3. Processing virtual shopping cart orders in real time is an example of... 12 | 13 | - OLTP 14 | 15 | 4. When are BLOB stores an appropriate place to store data? (Select all that apply.) 16 | 17 | - For storing large files 18 | - For cheap storage 19 | - For a "data lake" of largely unstructured data 20 | 21 | 5. JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC? 22 | 23 | - Specify a column, number of partitions, and the column's minimum and maximum values. Spark then divides that range of values between parallel connections. 24 | 25 | 6. What are some of the advantages of the file format Parquet over CSV? (Select all that apply.) 26 | 27 | - Parallelism 28 | - Compression 29 | - Columnar 30 | 31 | 7. SQL is normally used to query tabular (or "structured") data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.) 32 | 33 | - It allows for missing data 34 | - It does not need a formal structure 35 | - It allows for data change over time 36 | - It allows for complex data types 37 | 38 | 8. Data writes in Spark can happen in serial or in parallel. What controls this parallelism? 39 | 40 | - The number of data partitions in a DataFrame 41 | 42 | 9. Fill in the blanks with the appropriate response below: 43 | 44 | A _________ table manages _________ and a DROP TABLE command will result in data loss. 45 | 46 | - managed, both the data and metadata such as the schema and data location -------------------------------------------------------------------------------- /quizes/module4-quiz.md: -------------------------------------------------------------------------------- 1 | 1. Machine learning is suited to solve which of the following tasks? (Select all that apply.) 2 | 3 | - Image Recognition 4 | - Fraud Detection 5 | - Natural Language Processing 6 | - Financial Forecasting 7 | - Churn Analysis 8 | - A/B testing 9 | 10 | 2. Is a model that is 99\% accurate at predicting breast cancer a good model? 11 | 12 | - Likely no because there are not many cases of cancer in a general population 13 | 14 | 3. What is an appropriate baseline model to compare a machine learning solution to? 15 | 16 | - The average of the dataset 17 | 18 | 4. What is Machine Learning? (Select all that apply.) 19 | 20 | - A function that maps features to an output 21 | - Learning patterns in your data without being explicitly programmed 22 | 23 | 5. (Fill in the blanks with the appropriate answer below.) 24 | 25 | Predicting whether a website user is fraudulent or not is an example of _________ machine learning. It is a __________ task 26 | 27 | - supervised, classification 28 | 29 | 6. (Fill in the blanks with the appropriate answer below.) 30 | 31 | Grouping similar users together based on past activity is an example of _________ machine learning. It is a _________ task. 32 | 33 | - unsupervised, clustering 34 | 35 | 7. Predicting the next quarter of a company's earnings is an example of... 36 | 37 | - Regression 38 | 39 | 8. Why do we want to perform a train/test split before we train a machine learning model? (Select all that apply.) 40 | 41 | - To evaluate how our model performs on unseen data 42 | - To keep the model from "overfitting" where it memorizes the data it has seen 43 | 44 | 9. What is a linear regression model learning about your data? 45 | 46 | - The formula for the line of best fit 47 | 48 | 10. How do you define a custom function not already part of core Spark? 49 | 50 | - With a User-Defined Function --------------------------------------------------------------------------------