├── Distributed-Computing-with-Spark-SQL-1.2.3.dbc
├── README.md
├── assignments
    ├── module1-assignment1.md
    ├── module2-assignment2.md
    ├── module3-assignment3.md
    └── module4-assignment4.md
└── quizes
    ├── module1-quiz.md
    ├── module2-quiz.md
    ├── module3-quiz.md
    └── module4-quiz.md


/Distributed-Computing-with-Spark-SQL-1.2.3.dbc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/67cc61080727dd224761cc330de966ebe9d4b81a/Distributed-Computing-with-Spark-SQL-1.2.3.dbc


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Distributed Computing with Spark SQL
 2 | This course is provided by University of California Davis on coursera, which provides a comprehensive overview of distributed computing using Spark.
 3 | 
 4 | The four modules build on one another and by the end of the course are: 
 5 | - Spark architecture: 
 6 | - Spark DataFrame 
 7 | - Optimizing reading/writing data 
 8 | - How to build a machine learning model. 
 9 | 
10 | By understanding when to use Spark, either scaling out when the model or data is too large to process on a single machine, or having a need to simply speed up to get faster results, students like me will hone their SQL skills and become a more adept Data Scientist.
11 | 
12 | This repository includes the following things:
13 | #### 1.[Assignments](https://github.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/tree/main/assignments)
14 | 
15 | #### 2.[Quizes](https://github.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/tree/main/quizes)
16 | 
17 | #### 3.[Notebooks](https://github.com/Linlin-Li-1/Distributed-Computing-with-Spark-SQL/blob/main/Distributed-Computing-with-Spark-SQL-1.2.3.dbc)
18 | 
19 | 
20 | 
21 | 


--------------------------------------------------------------------------------
/assignments/module1-assignment1.md:
--------------------------------------------------------------------------------
 1 | 1. What is the first value for "Incident Number"?
 2 | 
 3 | - 16000003
 4 | 
 5 | 2. What is the first value for "Incident Number" on April 4th, 2016?
 6 | 
 7 | - 16037478
 8 | 
 9 | 3. Is the first fire call in this table on Brooke or Conor's birthday? Conor's birthday is 4/4 and Brooke's is 9/27 (in MM/DD format).
10 | 
11 | - Conor
12 | 
13 | 4. W​hat is the "Station Area" for the first fire call in this table? Note that this table is a subset of the dataset.
14 | 
15 | - 29
16 | 
17 | 5. H​ow many incidents were on Conor's birthday in 2016?
18 | 
19 | - 80
20 | 
21 | 6. H​ow many fire calls had an "Ignition Cause" of "4 act of nature"?
22 | 
23 | - 5
24 | 
25 | 7. W​hat is the most common "Ignition Cause"?
26 | 
27 | - 2 unintentional
28 | 
29 | 8. W​hat is the total incidents from the two joined tables?
30 | 
31 | - 847094402


--------------------------------------------------------------------------------
/assignments/module2-assignment2.md:
--------------------------------------------------------------------------------
 1 | 1. H​ow many fire calls are in our table?
 2 | 
 3 | - 240613
 4 | 
 5 | 2. W​hich "Unit Type" is the most common?
 6 | 
 7 | - ENGINE
 8 | 
 9 | 3. W​hat type of transformation, wide or narrow, did the 'GROUP BY' and 'ORDER BY' queries result in?
10 | 
11 | - Wide
12 | 
13 | 4. H​ow many tasks were in the last stage of the last job?
14 | 
15 | - 2


--------------------------------------------------------------------------------
/assignments/module3-assignment3.md:
--------------------------------------------------------------------------------
 1 | 1. W​hat type of table is "newTable"?
 2 | 
 3 | - EXTERNAL
 4 | 
 5 | 2. H​ow many rows are in "newTable"?
 6 | 
 7 | - 191039
 8 | 
 9 | 3. W​hat is the "Battalion" of the first entry in the sorted table?
10 | 
11 | - B01
12 | 
13 | 4. W​as this query faster or slower on the table with increased partitions?
14 | 
15 | - Slower
16 | 
17 | 5. D​oes the data stored within the table still exist at the original location ('dbfs:/tmp/newTableLoc') after you dropped the table?
18 | 
19 | - Yes


--------------------------------------------------------------------------------
/assignments/module4-assignment4.md:
--------------------------------------------------------------------------------
 1 | 1. H​ow many calls of 'Call_Type_Group' "Fire"?
 2 | 
 3 | - 4196
 4 | 
 5 | 2. H​ow many rows are in 'fireCallsGroupCleaned'?
 6 | 
 7 | - 134198
 8 | 
 9 | 3. W​hat is the accuracy of our model on test data as a percentage? Round to the nearest percent. (e.g. an accuracy of ".125" should be reported as "13")
10 | 
11 | - 82
12 | 
13 | 4. W​hat two values are in the 'prediction' column?
14 | 
15 | - 0
16 | - 1


--------------------------------------------------------------------------------
/quizes/module1-quiz.md:
--------------------------------------------------------------------------------
 1 | 1. Which of the following are true when it comes to the business value of big data? (Select all that apply.)
 2 | 
 3 | - Businesses are increasingly making data-driven decisions
 4 | 
 5 | - The size of the data businesses collect is growing
 6 | 
 7 | 2. Spark uses...(Select all that apply.)
 8 | 
 9 | - A distributed cluster of networked computers made of a driver node and many executor nodes
10 | 
11 | - A driver node to distribute work across a number of executor nodes
12 | 
13 | 3. How does Spark execute code backed by DataFrames? (Select all that apply.)
14 | 
15 | - It separates the "logical plan" of what you want to accomplish from the "physical plan" of how to do it so it can optimize the query
16 | 
17 | - It optimizes your query by figuring out the best "how" to execute what you want
18 | 
19 | 4. What are the properties of Spark DataFrames? (Select all that apply.)
20 | 
21 | - Resilient: Fault-tolerant
22 | 
23 | - Distributed: Computed across multiple nodes
24 | 
25 | - Dataset: Collection of partitioned data
26 | 
27 | 5. What is the difference between Spark and database technologies? (Select all that apply.)
28 | 
29 | - Spark is a highly optimized compute engine and is not a database
30 | 
31 | - Spark is a computation engine and is not for data storage
32 | 
33 | 6. What is Amdahl's law of scalability? (Select all that apply.)
34 | 
35 | - Amdahl's law states that the speedup of a task is a function of how much of that task can be parallelized
36 | 
37 | - A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized
38 | 
39 | 7. Spark offers a unified approach to analytics. What does this include? (Select all that apply.)
40 | 
41 | - Spark code can be written in the following languages: SQL, Scala, Java, Python, and R
42 | 
43 | - Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application
44 | 
45 | - Spark allows analysts, data scientist, and data engineers to all use the same core technology
46 | 
47 | - Spark unities applications such as SQL queries, streaming, and machine learning
48 | 
49 | 8. What is a Databricks notebook?
50 | 
51 | - A collaborative, interactive workspace that allows you to execute Spark queries at scale
52 | 
53 | 9. How can you get data into Databricks? (Select all that apply.)
54 | 
55 | - By uploading it through the user interface
56 | 
57 | - By "mounting" data backed by cloud storage
58 | 
59 | - By registering the data as a table
60 | 
61 | 10. What are the qualities of big data? (Select all that apply.)
62 | 
63 | - Volume
64 | 
65 | - Velocity
66 | 
67 | - Veracity
68 | 
69 | - Variety


--------------------------------------------------------------------------------
/quizes/module2-quiz.md:
--------------------------------------------------------------------------------
 1 | 1. What are the different units of parallelism? (Select all that apply.)
 2 | 
 3 | - Partition
 4 | 
 5 | - Core
 6 | 
 7 | - Executor
 8 | 
 9 | - Task
10 | 
11 | 2. What is a partition?
12 | 
13 | - A portion of a large distributed set of data
14 | 
15 | 3. What is the difference between in-memory computing and other technologies? (Select all that apply.)
16 | 
17 | - In-memory operates from RAM while other technologies operate from disk
18 | 
19 | - Computation not done in-memory (such as Hadoop) reads and writes from disk in between each step
20 | 
21 | - In-memory operations were not realistic in older technologies when memory was more expensive
22 | 
23 | 4. Why is caching important?
24 | 
25 | - It stores data on the cluster to improve query performance
26 | 
27 | 5. Which of the following is a wide transformation? (Select all that apply.)
28 | 
29 | - ORDER BY 
30 | 
31 | - GROUP BY
32 | 
33 | 6. Broadcast joins...
34 | 
35 | -  Transfer the smaller of two tables to the larger, minimizing data transfer
36 | 
37 | 7. When is it appropriate to use a shuffle join?
38 | 
39 | - When both tables are moderately sized or large
40 | 
41 | 8. Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.)
42 | 
43 | - Shuffle reads
44 | 
45 | - Shuffle writes
46 | 
47 | - Data Skew
48 | 
49 | 9. What is a stage boundary?
50 | 
51 | - When all of the slots or available units of processing have to sync with one another
52 | 
53 | 10. What happens when Spark code is executed in local mode?
54 | 
55 | - The executor and driver are on the same machine
56 | 


--------------------------------------------------------------------------------
/quizes/module3-quiz.md:
--------------------------------------------------------------------------------
 1 | 1. Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.)
 2 | 
 3 | - It allows for elastic resources so larger storage or compute resources are used only when needed
 4 | - It makes updates to new software versions easier
 5 | - Resources are isolated and therefore more manageable and debuggable
 6 | 
 7 | 2. You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task?
 8 | 
 9 | - IO: the transfer of data is more demanding than the computation
10 | 
11 | 3. Processing virtual shopping cart orders in real time is an example of...
12 | 
13 | - OLTP
14 | 
15 | 4. When are BLOB stores an appropriate place to store data? (Select all that apply.)
16 | 
17 | - For storing large files
18 | - For cheap storage
19 | - For a "data lake" of largely unstructured data
20 | 
21 | 5. JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC?
22 | 
23 | - Specify a column, number of partitions, and the column's minimum and maximum values. Spark then divides that range of values between parallel connections.
24 | 
25 | 6. What are some of the advantages of the file format Parquet over CSV? (Select all that apply.)
26 | 
27 | - Parallelism
28 | - Compression
29 | - Columnar
30 | 
31 | 7. SQL is normally used to query tabular (or "structured") data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.)
32 | 
33 | - It allows for missing data
34 | - It does not need a formal structure
35 | - It allows for data change over time
36 | - It allows for complex data types
37 | 
38 | 8. Data writes in Spark can happen in serial or in parallel. What controls this parallelism?
39 | 
40 | - The number of data partitions in a DataFrame
41 | 
42 | 9. Fill in the blanks with the appropriate response below:
43 | 
44 | A _________ table manages  _________  and a DROP TABLE command will result in data loss.
45 | 
46 | - managed, both the data and metadata such as the schema and data location


--------------------------------------------------------------------------------
/quizes/module4-quiz.md:
--------------------------------------------------------------------------------
 1 | 1. Machine learning is suited to solve which of the following tasks? (Select all that apply.)
 2 | 
 3 | - Image Recognition
 4 | - Fraud Detection
 5 | - Natural Language Processing
 6 | - Financial Forecasting
 7 | - Churn Analysis
 8 | - A/B testing
 9 | 
10 | 2. Is a model that is 99\% accurate at predicting breast cancer a good model?
11 | 
12 | - Likely no because there are not many cases of cancer in a general population
13 | 
14 | 3. What is an appropriate baseline model to compare a machine learning solution to?
15 | 
16 | - The average of the dataset
17 | 
18 | 4. What is Machine Learning? (Select all that apply.)
19 | 
20 | - A function that maps features to an output
21 | - Learning patterns in your data without being explicitly programmed
22 | 
23 | 5. (Fill in the blanks with the appropriate answer below.)
24 | 
25 | Predicting whether a website user is fraudulent or not is an example of _________ machine learning. It is a __________ task
26 | 
27 | - supervised, classification
28 | 
29 | 6. (Fill in the blanks with the appropriate answer below.)
30 | 
31 | Grouping similar users together based on past activity is an example of _________ machine learning. It is a _________ task.
32 | 
33 | - unsupervised, clustering
34 | 
35 | 7. Predicting the next quarter of a company's earnings is an example of...
36 | 
37 | - Regression
38 | 
39 | 8. Why do we want to perform a train/test split before we train a machine learning model? (Select all that apply.)
40 | 
41 | - To evaluate how our model performs on unseen data
42 | - To keep the model from "overfitting" where it memorizes the data it has seen
43 | 
44 | 9. What is a linear regression model learning about your data?
45 | 
46 | - The formula for the line of best fit
47 | 
48 | 10. How do you define a custom function not already part of core Spark?
49 | 
50 | - With a User-Defined Function


--------------------------------------------------------------------------------