├── static
    ├── DPP.jpg
    ├── datalake.jpg
    ├── deltalake.jpg
    ├── skew_join.jpg
    ├── AQEPlanning.jpg
    ├── dl_elements.png
    ├── partition_guide.png
    ├── columnar_storage.png
    ├── Wide-Transformation.png
    ├── partition_coalesce.jpg
    ├── query_optimization.png
    ├── Narrow-Transformation.png
    └── Catalyst-Optimizer-diagram.png
├── Contributions.md
├── advanced
    ├── streaming.md
    ├── new_in_spark_3.md
    ├── joins.md
    ├── deltalake.md
    └── optimizations.md
└── README.md


/static/DPP.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/DPP.jpg


--------------------------------------------------------------------------------
/static/datalake.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/datalake.jpg


--------------------------------------------------------------------------------
/static/deltalake.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/deltalake.jpg


--------------------------------------------------------------------------------
/static/skew_join.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/skew_join.jpg


--------------------------------------------------------------------------------
/static/AQEPlanning.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/AQEPlanning.jpg


--------------------------------------------------------------------------------
/static/dl_elements.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/dl_elements.png


--------------------------------------------------------------------------------
/static/partition_guide.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/partition_guide.png


--------------------------------------------------------------------------------
/static/columnar_storage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/columnar_storage.png


--------------------------------------------------------------------------------
/static/Wide-Transformation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/Wide-Transformation.png


--------------------------------------------------------------------------------
/static/partition_coalesce.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/partition_coalesce.jpg


--------------------------------------------------------------------------------
/static/query_optimization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/query_optimization.png


--------------------------------------------------------------------------------
/static/Narrow-Transformation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/Narrow-Transformation.png


--------------------------------------------------------------------------------
/static/Catalyst-Optimizer-diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/Catalyst-Optimizer-diagram.png


--------------------------------------------------------------------------------
/Contributions.md:
--------------------------------------------------------------------------------
1 | ### You are welcome to raise a PR - 
2 | - If you want to add more information/topics to this resource.
3 | - If you find any grammatical errors or outdated information.
4 | - If you want to improve the indentation or readability.
5 | - If any of the answers seem incomplete or incorrect. Please add to it and provide an explanation on what was missing or incorrect in the PR comments.
6 | 


--------------------------------------------------------------------------------
/advanced/streaming.md:
--------------------------------------------------------------------------------
 1 | ## Spark Streaming
 2 | 
 3 | Stream processing is the act of continuously incorporating new data to calculate results. The input data is unbounded and has no predetermined beginning or end.
 4 | 
 5 | Examples-
 6 | 
 7 | - Credit card transactions
 8 | - IOT device data
 9 | - Click stream
10 | 
11 | Any business logic in your application should give same results for streaming and batch applications.


--------------------------------------------------------------------------------
/advanced/new_in_spark_3.md:
--------------------------------------------------------------------------------
 1 | # What's new in Spark 3?
 2 | 
 3 | ### Adaptive Query Execution
 4 | 
 5 | In Spark 2, the logical and physical optimizations were rule based optimizations. While they improve the performance, they are all based on the estimates and statistics that are generated before runtime. There may be unanticipated problems or tuning opportunities that appear as the query runs.
 6 | 
 7 | ![Catalyst Optimizer Diagram](../static/Catalyst-Optimizer-diagram.png)
 8 | 
 9 | Adaptive Query Execution allows Spark to re-optimize and adjust query plans based on runtime statistics collected during query execution.
10 | 
11 | ![Image showing the catalyst optimizer with adaptive planning.](../static/AQEPlanning.jpg)
12 | 
13 | When AQE is on, spark will feed back statistics about the size of the data in the shuffle files, so that for the next stage, when working out the logical plan, it can dynamically switch join strategies, coalesce number of shuffle partitions or optimize skew joins.
14 | 
15 | #### Switch Join Strategies
16 | 
17 | Existing rule-based optimizations include planning a broadcast hash join if the estimated size of a join relation is lower than the broadcast-size threshold. It relies on a estimation of data based on the file size. A number of things can make the estimation go wrong - 
18 | 
19 | - Presence of a very selective filter
20 | - Join relation being a series of complex operators other than just a scan
21 | 
22 | To solve this problem, AQE now replans the join strategy at runtime based on the most accurate join relation size. So if the estimated file size was 20MB and the actual file size turns out to be 8MB after the scan, then AQE will dynamically switch the join strategy from Sort-Merge Join to Broadcast-Hash Join.
23 | 
24 | #### Coalesce Shuffle Partitions
25 | 
26 | Tuning shuffle partitions in Spark is a common pain point. The best number of partitions depend on the data size, but the data sizes may differ vastly from stage to stage so this number can be hard to tune.
27 | 
28 | - If there are too few partitions -
29 |   - then the data size of each partition may be very large, and the tasks to process these large partitions may need to spill the data to disk and as a result slow down the query.
30 | - If there are too many partitions -
31 |   - then the data size of each partition may be very small, leading to too many network data fetches to read the shuffle blocks. Which will also slow down the query because of inefficient I/O pattern. Having a large number of tasks also puts more burden on the task scheduler.
32 | 
33 | To solve this problem, we can set a relatively large number of shuffle partitions at the beginning, then combine the adjacent small partitions into bigger partitions at runtime by looking at the shuffle files statistics.
34 | 
35 | For example, a small dataset of two partition is involved in a group by operation. The shuffle partition is set to 5, which leads to 5 partitions during the shuffle operation.  With AQE, the other three smaller partitions are coalesced into 1 larger partition, as a result, the final aggregation now only needs to perform three tasks rather than five.
36 | 
37 | ![Partition Coalesce](../static/partition_coalesce.jpg)
38 | 
39 | #### Optimize Skew Joins
40 | 
41 | Data skew occurs when data is unevenly distributed among partitions in the cluster. AQE detects such a skew automatically from shuffle file statistics. It then splits the skewed partitions into smaller subpartitions, which will be joined to the corresponding partition from the other side respectively.
42 | 
43 | In the below example, AQE splits the A0 partition into two smaller partitions and joins the with B0. This leads to 5 similar sized tasks that complete nearly at the same time versus one outlier task that takes much more time than the other tasks.
44 | 
45 | ![Skew Join](../static/skew_join.jpg)
46 | 
47 | ### Dynamic Partition Pruning
48 | 
49 | [Dynamic Partition Pruning - Data Savvy Youtube](https://youtu.be/rwUgZP-EBZw)
50 | 
51 | ![Dynamic Partition Pruning](../static/DPP.jpg)
52 | 
53 | 


--------------------------------------------------------------------------------
/advanced/joins.md:
--------------------------------------------------------------------------------
 1 | ## Communication Strategies  
 2 | 
 3 | ### Big table-to-big table  
 4 | - Joining a big table with another big table leads to shuffle join.
 5 | - In shuffle join, every node talks with every other node and they share data according to which node has a certain key or a set of keys.
 6 | - If data is not partitioned well network can become congested with traffic.
 7 | 
 8 | #### Shuffle Sort Merge Join
 9 | 
10 | As the name indicates, this join scheme has two phases: a sort phase followed by a merge phase. The sort phase sorts each data set by its desired join key; the merge phase iterates over each key in the row from each data set and merges the rows if the two keys match.
11 | 
12 | #### Optimizing the shuffle sort merge join
13 | 
14 | We can eliminate the `Exchange` step from this scheme if we create partitioned buckets for common sorted keys or columns on which we want to perform frequent equijoins. That is, we can create an explicit number of buckets to store specific sorted columns (one key per bucket). Presorting and reorganizing data in this way boosts performance, as it allows us to skip the expensive `Exchange` operation and go straight to `WholeStageCodegen`.
15 | 
16 | ```scala
17 | // Save as managed tables by bucketing them in Parquet format
18 | usersDF.orderBy(asc("uid"))
19 |     .write.format("parquet")
20 |     .bucketBy(8, "uid")
21 |     .mode(SaveMode.OverWrite)
22 |     .saveAsTable("UsersTbl")
23 | 
24 | ordersDF.orderBy(asc("users_id"))
25 |     .write.format("parquet")
26 |     .bucketBy(8, "users_id")
27 |     .mode(SaveMode.OverWrite)
28 |     .saveAsTable("OrdersTbl")
29 | ```
30 | 
31 | Bucketing the data using user ids will help skip the expensive `Exchange` step as there is no need to perform sort now.
32 | 
33 | #### When to use a shuffle sort merge join
34 | 
35 | Use this type of join under the following conditions for maximum benefit:
36 | 
37 | - When each key within two large data sets can be sorted and hashed to the same partition by Spark
38 | - When you want to perform only equi-joins to combine two data sets based on matching sorted keys
39 | - When you want to prevent Exchange and Sort operations to save large shuffles across the network
40 | 
41 | ### Big table-to-small table  
42 | - Small enough table to fit into the memory of a worker node with some breathing room.
43 | - Replicate our small dataframe onto every worker node
44 | - Prevents all-to-all communication as earlier.
45 | - Instead we perform it only once at the beginning and let each individual node perform the work.
46 | - The driver first collects the data from executors and then broadcasts it back to the executor. Hence the driver size should be large enough to handle a collect operation.
47 | 
48 | #### When to use a broadcast hash join
49 | 
50 | ​	Use this type of join under the following conditions for maximum benefit:
51 | 
52 | - When each key within the smaller and larger data sets is hashed to the same partition
53 |   by Spark
54 | - When one data set is much smaller than the other (and within the default config
55 |   of 10 MB, or more if you have sufficient memory)
56 | - When you only want to perform an equi-join, to combine two data sets based on
57 |   matching unsorted keys
58 | - When you are not worried by excessive network bandwidth usage or OOM
59 |   errors, because the smaller data set will be broadcast to all Spark executors
60 | 
61 | |                    Broadcast Join                    |                 Shuffle Join                 |
62 | |:----------------------------------------------------:|:--------------------------------------------:|
63 | |           Avoids shuffling the larger side           |              Shuffles both sides             |
64 | |              Naturally handles data skew             |           Can suffer from data skew          |
65 | |               Cheap for selective joins              | Can produce unnecessary intermediate results |
66 | |       Broadcasted data needs to fit into memory      |    Data can be spilled and read from disk    |
67 | | Cannot be used for certain joins eg. Full outer join |           Can be used for all joins          |


--------------------------------------------------------------------------------
/advanced/deltalake.md:
--------------------------------------------------------------------------------
  1 | ## Delta lake
  2 | 
  3 | Delta lake is an open format storage layer that puts standards in place in an organization's data lake to provide data structure and governance. A storage solution specifically designed to work with Apache Spark.
  4 | 
  5 | Data stored under delta lake is ACID compliant.
  6 | 
  7 | ### Short Comings Of A Data Lake
  8 | 
  9 | #### Data Lakes Offer -
 10 | 
 11 | - Flexibility in data storage
 12 | - Relatively cheaper compared to a warehouse
 13 | - Can store structured, semi-structured or unstructured data
 14 | 
 15 | ![Data Lake](../static/datalake.jpg)
 16 | 
 17 | #### Another Challenges with Data Lakes - 
 18 | 
 19 | - Long storage periods & variety of data can convert it into a data swamp
 20 |   - Data swamps are data lakes that are difficult to manage and navigate
 21 | - Reading and writing data is not reliable
 22 |   - Need to build workarounds to ensure readers always see consistent data during writes
 23 | - Jobs often fail midway
 24 |   - Difficult to recover the correct state of the data
 25 |   - Countless hours are spent troubleshooting failed jobs and getting a valid state of the data
 26 | - Modification of existing data is difficult
 27 |   - Lakes are meant for data written once and read many times
 28 |   - The concept of modifying or deleting data wasn’t of central importance at the time they were designed
 29 | - Keeping historical version of data is costly
 30 |   - Can become expensive and difficult with large scale data
 31 | - It is difficult to handle large metadata
 32 |   - Reading metadata can often create substantial delays in processing and overhead costs
 33 | - Too many files cause problems
 34 |   - Query performance suffers greatly when a small amount of data is spread over too many files
 35 | - It is hard to get great performance
 36 |   - Tuning jobs is a difficult task
 37 |   - Great performance should be easy to achieve
 38 | - Data quality issues affect analysis results
 39 |   - No built-in quality checks like Data Warehouses
 40 |   - expensive, long-running queries may fail or produce meaningless results
 41 | 
 42 | #### How does Delta Lake resolve the above issues?
 43 | 
 44 | Acid Transactions
 45 | 
 46 | - Each transaction has a distinct beginning and end
 47 | - Appending data is easy and each new write will create a new version of the table
 48 | - New data won't be read until the transaction is complete
 49 | - Jobs that fail midway can be discarded entirely
 50 | - Many changes can be applied to the data in a single transaction, eliminating the possibility of incomplete deletes or updates
 51 | 
 52 | Schema Management
 53 | 
 54 | - Specify and enforce schema
 55 | - Schema validation during writes
 56 | - Throws an exception if extra data is added
 57 | - Can make changes to tables schema
 58 | 
 59 | Scalable Metadata Handling
 60 | 
 61 | - Metadata is processed just like regular data - with distributed processing
 62 | 
 63 | Unified Batch and Streaming data
 64 | 
 65 | - Supports both streaming and batch processes
 66 | - Each micro-batch transaction creates a new version of a table
 67 | 
 68 | Data Versioning and Time Travel
 69 | 
 70 | - A maintained version of historical data
 71 | - Databricks uses Spark to scan the transaction logs for efficient processing
 72 | 
 73 | ### Why Delta Lake
 74 | 
 75 | By bringing the structure and governance inherent to data warehouses to data lakes with Delta Lake, you create the foundation for a Lakehouse.
 76 | 
 77 | ![Delta Lake](../static/deltalake.jpg)
 78 | 
 79 | #### Delta Lake Features - 
 80 | 
 81 | - ACID transactions on Spark
 82 | - Scalable metadata handling
 83 | - Streaming and batch unification
 84 | - Schema enforcement
 85 | - Time travel
 86 | - Upserts and deletes
 87 | - Fully configurable/optimizable
 88 | - Structured streaming support
 89 | 
 90 | #### Delta Lake Storage Layer -
 91 | 
 92 | - Highly performant and persistent
 93 | - Low-cost, easily scalable object storage
 94 | - Ensures consistency
 95 | - Allows for flexibility
 96 | 
 97 | #### Delta Tables -
 98 | 
 99 | - Contain data in Parquet files that are kept in object storage
100 | - Keep transaction logs in object storage
101 | - Can be registered in a metastore
102 | 
103 | ### Elements of Delta Lake
104 | 
105 | #### Delta Files
106 | 
107 | - Uses parquet files to store a customer's data
108 | - Provide an additional layer over parquet files
109 |   - data versioning and metadata
110 |   - stores transaction logs
111 |   - provides ACID transactions
112 | 
113 | #### Delta Tables
114 | 
115 | A delta table is a collection of data kept using the Delta Lake technology and consists -
116 | 
117 | - Delta files containing the data and kept in object storage
118 | - A Delta table registered in a Metastore
119 | - The delta transaction log saved with Delta files in object storage
120 | 
121 | #### Delta Optimization Engine
122 | 
123 | - Delta Engine is a high-performance query engine that provides an efficient way to process data in data lakes.
124 | 
125 | #### **Delta Lake Storage Layer**
126 | 
127 | #### What is the Delta transaction log?
128 | 
129 | - Ordered record of the transactions performed on a Delta table
130 | - Single source of truth for that table
131 | - Mechanism that the Delta Engine uses to guarantee atomicity


--------------------------------------------------------------------------------
/advanced/optimizations.md:
--------------------------------------------------------------------------------
  1 | ## Optimizing and Tuning Spark Applications:  
  2 | 
  3 | ### Static vs. Dynamic resource allocation - 
  4 | - One can use the `spark.dynamicAllocation.enabled` property to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload. Example use cases would be Streaming data, or on-demand analytics where more is asked of the application during peak hours. In a multi-tenant environment Spark may soak up resources from other applications.
  5 | 
  6 |     ```python
  7 |     spark.dynamicAllocation.enabled true
  8 |     spark.dynamicAllocation.minExecutors 2
  9 |     spark.dynamicAllocation.schedulerBacklogTimeout 1m
 10 |     spark.dynamicAllocation.maxExecutors 20
 11 |     spark.dynamicAllocation.executorIdleTimeout 2min
 12 |     ```
 13 | - Request two executors to be created at start.
 14 | - Whenever there are pending tasks that have not been scheduled for over 1 minute, the driver will request a new executor. Max upto 20.
 15 | - If an executor is idle for 2 minutes, driver will terminate it.  
 16 |   
 17 | 
 18 | ### Configuring Spark executors' memory and shuffle service-
 19 | 
 20 | - The amount of memory available to each executor is controlled by `spark.executor.memory`.
 21 |   
 22 | - Executor memory is divided into three sections
 23 |     - Execution Memory
 24 |     - Storage Memory
 25 |     - Reserved Memory
 26 |     
 27 | - The default division is 60% for execution, 40% for storage after allowing for 300mb of reserved memory, to safegaurd against OOM errors.
 28 | 
 29 | - Execution memory is used for shuffles, joins, sorts and aggregations.
 30 | 
 31 | - Storage memory is primarily used for caching user data structures and partitions derived from DataFrames.
 32 | 
 33 | - During map and shuffle operations, spark writes to and reads from the local disk's shuffle file, so there's heaving I/O activity. 
 34 | 
 35 | - Following configurations can be used during heavy work loads to reduce I/O bottlenecks.
 36 |     ```python
 37 |     spark.driver.memory
 38 |     spark.shuffle.file.buffer
 39 |     spark.file.transferTo
 40 |     spark.shuffle.unsafe.file.output.buffer
 41 |     spark.io.compression.lz4.blockSize
 42 |     spark.shuffle.service.index.cache.size
 43 |     spark.shuffle.registration.timeout
 44 |     spark.shuffle.registration.maxAttempts
 45 |     ```
 46 | 
 47 | ### Spark Parallelism -  
 48 | - To optimize resource utilization and maximize parallelism, the ideal is atleast as many partitions as cores on the executor.
 49 | - How partitions are created - 
 50 |     - Data on disk is laid out in chunks or contiguous file blocks
 51 |     - Default size is 128 MB in HDFS and S3. A contiguous collection of these blocks is a partition.
 52 |     - The size of a partition in Spark is dictated by `spark.sql.files.maxPartitionBytes` which is 128 MB by default.
 53 |     - Decreasing the partition file size too much may cause the "small file problem" increasing disk I/O and performance degradation.
 54 | - Shuffle Partitions
 55 |     - For smaller workloads the shuffle partitions should be reduced from default 200, to number of cores or executors or less.
 56 |     - During shuffle operations, spark will spill results to executors' local disks at location specified in `spark.local.directory`. Having performant SSDs for this operation will boost performance.
 57 | 
 58 | - When writing, use `maxRecordsPerFile` option to control how many records go into each partition file. This would help mitigate the small or very large file problems.
 59 | 
 60 | ### Caching and Persistence of Data -  
 61 | 
 62 | `Dataframe.cache()`
 63 | - cache() will store as many partitions as memory allows. 
 64 | - Dataframes can be fractionally cached, but a partition cannot.
 65 | - Note: A dataframe is not fully cached until you invoke an action that goes through all the records (eg. count). If you use take(1), only one partition will be cached because catalyst realizes that you do not need to compute all the partitions just to retrieve one record.  
 66 | 
 67 | ``Dataframe.persist()  ``
 68 | - persist(StorageLevel.level) provides control over how your data is cached via StorageLevel. Data on disk is always serlialized using either Java or Kyro.
 69 | - MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, MEMORY_AND_DISK_SER are different persist levels one can use.
 70 | 
 71 | #### When to Cache and Persist- 
 72 | 
 73 | - When you want to access large dataset repeatedly for queries and transformations.
 74 | 
 75 | #### When not to Cache and Persist-
 76 | 
 77 | - Dataframes are too big to cache in memory
 78 | - An inexpensive tranformation on a dataframe not requiring frequent use, regardless of size.
 79 | 
 80 | ### Statistics Collection -  
 81 | 
 82 | - Cost-based query optimizer can make use of statistics for named tables and not on arbitrary dataframes or RDDs to make optimization decisions.  
 83 | - The statistics should be collected and maintained.
 84 | 
 85 | #### Table Level  
 86 | 
 87 | ```SQL
 88 | ANALYZE TABLE table_name COMPUTE STATISTICS
 89 | ```
 90 | 
 91 | #### Column Level  
 92 | 
 93 | ```SQL
 94 | ANALYZE TABLE table_name COMPUTE STATISTICS FOR
 95 | COLUMNS column_name1, column_name2, ...
 96 | ```
 97 | 
 98 | Column-level statistics are slower to collect, but provide more information for the cost-based optimizer to use about those data columns. Both types of statistics can help with joins, aggregations, filters, and a number of other potential things (e.g., automatically choosing when to do a broadcast join).
 99 | 
100 | ### Spark Joins -
101 | 
102 | #### Broadcast Hash Join-
103 | 
104 | - Also known as map-side only join
105 | - By default spark uses broadcast join if the smaller data set is less than 10MB.
106 | - When to use a broadcast hash join -
107 |     - When each key within the smaller and larger data sets is hashed to the same partition by Spark.
108 |     - When one data set is much smaller than the other.
109 |     - When you are not worried by excessive network bandwidth usage or OOM errors because the smaller data set will be broadcast to all Spark executors
110 | 
111 | #### Shuffle Sort Merge Join-
112 | 
113 | - Over a common key that is sortable, unique and can be stored in the same partition.
114 | - Sort phase sorts each data set by its desired join key. The merge phase iterates over each key in the row from each dataset and merges if two keys match
115 | - Optimizing the shuffle sort merge join -
116 |     - Create partitioned buckets for common sorted keys or columns on which we want to perform frequent equi-joins. 
117 |     - In case of a column with high cardinality, use bucketing. Else use partitioning.
118 | - When to use a shuffle sort merge join -
119 |     - When each key within two large data sets can be sorted and hashed to the same partition by Spark.
120 |     - When you want to perform only equi-joins to combine two data sets based on matching sorted keys.
121 |     - When you want to prevent Exchange and Sort operations to save large shuffles across the network.
122 | 
123 | ## Notes from the video
124 | 
125 | [Fine Tuning and Enhancing Performance of Apache Spark Jobs](https://youtu.be/WSplTjBKijU)
126 | Some general points to note-
127 | 
128 | - Memory increase for executors will increase garbage collection time
129 | - Adding more CPUs can lead to scheduling issues as well as additional shuffles
130 | 
131 | ### Skew
132 | 
133 | #### How to check?
134 | 
135 | - Spark UI shows job waiting only for some of the tasks
136 | - Executor missing a heartbeat
137 | - Check partition size of RDDs (row count) while debugging to confirm
138 | - In spark logs check the partition sizes
139 | 
140 | #### How to handle?
141 | 
142 | - A fix a ingestion time goes way far
143 | - JDBC
144 |   - When reading from an RDBMS source for using JDBC connectors use option to do partitioned reads. Default fetch size depends on the database that you are reading from
145 |   - Partition column should be numeric that is relatively evenly distributed
146 |   - If no such column is present, you should create on using mod, hash functions
147 | - Already partitioned data (S3 etc.)
148 |   - Read the data and repartition if needed
149 | 
150 | ### Cache/Persist
151 | 
152 | - Unpersist when done to free up memory for Garbage collection
153 | - For self joins, cache the table to avoid reading the same data and deserialization of data twice
154 | - Don't over persist, it can lead to -
155 |   - Increased spill to disk
156 |   - Slow garbage collection
157 | 
158 | ### Avoid UDFs
159 | 
160 | - UDFs have to deserialize every row to object
161 | - Then apply the lambda function
162 | - And then reserialize it
163 | - This leads to increased garbage collection
164 | 
165 | ### Join Optimizations
166 | 
167 | - Filter trick
168 |   - Get keys from the medium table and filter the records from the large table before performing a join
169 |   - Dynamic partition pruning
170 | - Salting
171 |   - Use salting technique to reduce/eliminate skew on the joining keys
172 |     - Salting will use up more memory
173 | 
174 | ### Things to remember
175 | 
176 | - Keep the largest dataframe at left because Spark tries to shuffle the right dataframe first. Smaller dataframe will lead to lesser shuffle
177 | - Follow good partitioning strategies
178 | - Filter as early as possible
179 | - Try to use the same partitioner between DFs for joins
180 | 
181 | ### Task Scheduling
182 | 
183 | - Default scheduling is FIFO
184 | - Fair Scheduling
185 |   - Allows scheduling longer tasks with smaller tasks
186 |   - Better resource utilization
187 |   - Harder to debug. Turn off in local when debugging
188 | 
189 | ### Serialization
190 | 
191 | | Java                     | Kryo                                                 |
192 | | ------------------------ | ---------------------------------------------------- |
193 | | Default for most types   | Default for shuffling RDDs and simple types like int |
194 | | Can work for any class   | For serializable types                               |
195 | | More flexible but slower | Significantly faster and more compact                |
196 | |                          | Set you sparkconf to use kryo serialization          |
197 | 
198 | ### Garbage Collection
199 | 
200 | #### How to check?
201 | 
202 | - Check time spent on tasks vs GC on Spark UI
203 | - Check the memory used in server
204 | 
205 | ## Databricks Delta
206 | 
207 | [Optimize performance with file management](https://docs.databricks.com/delta/optimizations/file-mgmt.html)
208 | 
209 | ### Compaction (Bin packing)
210 | 
211 | - Improve speed of the queries by coalescing smaller files into larger files
212 | 
213 | - You trigger compaction by running the following command
214 | 
215 |   ```bash
216 |   OPTIMIZE events
217 |   ```
218 | 
219 | - Bin packing optimization is idempotent
220 | 
221 | - Evenly balanced in size but not necessarily in terms of records. The two measures are more often correlated
222 | 
223 | - Returns min, max, total and so on for the files removed and added
224 | 
225 | - Also returns Z-Ordering statistic
226 | 
227 | ### Data Skipping
228 | 
229 | - Works for any comparison of nature `column 'op' literal` where op could be `>`,`<`,`=`,`like`, `and`, `or` etc.
230 | - By default, generates stats for only the first 32 columns in the data. For more columns, reordering is required. Or the threshold limit can be changed
231 | - Collecting stats on long string is expensive. One can skip long string like columns using `delta.dataSkippingNumIndexedCols`
232 | 
233 | ### Z-Ordering (multi-dimensional clustering)
234 | 
235 | - Colocates related information in same set of files
236 | - Use for high cardinality columns
237 | - Effectiveness drops with each additional column
238 | - Columns that do not have statistics collected on them, would be ineffective as it requires stats like min, max, count etc. for data skipping
239 | - Z-ordering is not idempotent
240 | - Evenly balanced files in terms of records but not necessarily size. The two measures are more often correlated
241 | 
242 | For running`OPTIMIZE` (bin-packing or z-ordering) compute intensive machines like c5d series is recommended as both operations will be doing large amounts of Parquet decoding and encoding


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | #  Spark Learning Guide
  2 | 
  3 | ### This material has been created using multiple sources from the internet like Databricks blogs and courses, official docs, Stack Overflow, Learning Spark 2.0 and The Definitive Guide.  
  4 | 
  5 | #### You can use this guide to learn about different components of Spark and as a reference material. This section covers all the topics that should be enough for you to get started with Spark Theory.  
  6 | 
  7 | #### You can refer to the advanced topics here -
  8 | 
  9 | - [Optimization Techniques](advanced/optimizations.md)
 10 | - [Joins Internal Working](advanced/joins.md)  
 11 | - [Delta Lake](advanced/deltalake.md)
 12 | - [Spark 3.0](advanced/new_in_spark_3.md)
 13 | --------------------------
 14 | 1. What is Spark?  
 15 | Apache Spark is a cluster computing platform designed to be fast and general-purpose. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines or a computing cluster.
 16 | --------------------------
 17 | 2. What is a Spark Core?  
 18 | Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.
 19 | --------------------------
 20 | 3. Key features of Spark -
 21 | - Spark can run over multiple file systems.
 22 | - Multiple software systems need not run to achieve a single task because spark provides a lot of capabilities under the hood. A single application can leverage streaming, ML and Spark SQL capabilities of spark.
 23 | - Spark has the philosophy of tight integration where each component is designed to interoperate closely. Hence any improvements at the lower level improve all the libraries running over it. 
 24 | - Spark offers in-memory computations
 25 | --------------------------
 26 | 4. Major libraries that constitute the Spark Ecosystem -  
 27 | Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
 28 | Spark Streaming – This library is used to process real-time streaming data.
 29 | Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
 30 | Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
 31 | --------------------------
 32 | 5. What is an RDD?  
 33 |     The main abstraction Spark provides is a *resilient distributed dataset* (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
 34 |     
 35 |     An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any data source, e.g. text files, a database via JDBC, etc.
 36 |     
 37 |     Definition - *RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.*
 38 | 
 39 | 6. How are RDDs created?  
 40 |     Spark provides two ways to create RDDs: 
 41 |     
 42 |       - loading an external dataset 
 43 |       - parallelizing a collection in your driver program.
 44 | --------------------------
 45 | 7. What is a partition?  
 46 | A partition is a logical or small chunk of a large distributed data set. It provides the possibility to distribute the work across the cluster, divide the task into smaller parts, and reduce memory requirements for each node. **Partition is the main unit of parallelism in Apache Spark**.
 47 | --------------------------
 48 | 8. How is an RDD fault-tolerant?  
 49 | When a set of operations happen on an RDD the spark engine views these operations as a DAG. If a node processing the RDD crashes and was performing operations X->Y->Z on the RDD and failed at Z, then the resource manager assigns a new node for the operation and the processing begins from X again using the directed graph.
 50 | --------------------------
 51 | 9. Why are RDDs immutable?  
 52 | Immutability rules out a big set of potential problems due to updates from multiple threads at once. Immutable data is safe to share across processes.  
 53 | They're not just immutable but a deterministic function (a function that returns the same result with the same input) of their input. This plus immutability also means the RDD's parts can be recreated at any time. This makes caching, sharing and replication easy.  
 54 | These are significant design wins, at the cost of having to copy data rather than mutate it in place. Generally, that's a decent tradeoff to make: gaining the fault tolerance and correctness with no developer effort is worth spending memory and CPU on since the latter are cheap.  
 55 | A Corollary: immutable data can as easily live in memory as on disk. This makes it reasonable to easily move operations that hit the disk to instead use data in memory, and again, adding memory is much easier than adding I/O bandwidth.
 56 | --------------------------
 57 | 10. What are Transformations?  
 58 | Spark Transformations are functions that produce a new RDD from an existing RDD. An RDD Lineage is built when we apply Transformations on an RDD. Basic Transformations are - map and filter. After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. `flatMap(), union(), Cartesian()`) or the same size (e.g. map).  
 59 | - Narrow dependency : RDD operations like `map(), union(), filter()` can operate on a single partition and map the data of that partition to the resulting single partition. These kinds of operations that map data from one to one partition are referred to as Narrow operations. Narrow operations don’t require distributing the data across the partitions. Each partition of the parent RDD is used by at most one partition of the child RDD.  
 60 | 
 61 |   ![narrow](static/Narrow-Transformation.png)
 62 | 
 63 | - Wide dependency : RDD operations like `groupByKey, distinct, join` may require mapping the data across the partitions in the new RDD. These kinds of operations which maps data from one to many partitions are referred to as Wide operations
 64 |   Each partition of the parent RDD may be depended on by multiple child partitions.
 65 | 
 66 |   ![wide](static/Wide-Transformation.png)
 67 | --------------------------
 68 | 11. What are Actions?  
 69 | Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. In other words, an RDD operation that returns a value of any type but `RDD[T]` is an action. They trigger the execution of RDD transformations to return values. Simply put, an action evaluates the RDD lineage graph.  
 70 | Actions are one of two ways to send data from executors to the driver (the other being accumulators).  
 71 | Some examples of actions are - `aggregate, collect, count, countApprox, countByValue, first, fold, foreach, foreachPartition, max, min, reduce, saveAs* actions, saveAsTextFile, saveAsHadoopFile, take, takeOrdered, takeSample, toLocalIterator, top, treeAggregate, treeReduce`
 72 | --------------------------
 73 | [Anatomy of Spark Application - Luminousmen](https://luminousmen.com/post/spark-anatomy-of-spark-application)
 74 | 
 75 | 12. What is a driver?  
 76 | The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and analyzing, distributing, and scheduling work across the executors (defined momentarily).
 77 | 
 78 | ​		In a single databricks cluster, there will only be one driver irrespective of the number of executors.
 79 | 
 80 | - Prepares Spark Context
 81 | - Declares operations on the RDD using Transformations and Actions. 
 82 | - Submits serialized RDD graph to master.  
 83 | [Spark Driver - Stackoverflow](https://stackoverflow.com/a/24638280)  
 84 | --------------------------
 85 | 13. What is a Task?  
 86 | A task is a unit of work that can be run on a partition of a distributed dataset and gets executed on a single executor. The unit of parallel execution is at the task level. All the tasks within a single stage can be executed in parallel.
 87 | --------------------------
 88 | 14. What is a Stage?  
 89 | 
 90 |     A stage is a collection of tasks that can run in parallel. A new stage is created when there is data shuffling. 
 91 | --------------------------
 92 | 15. What is a Core?  
 93 | A core is a basic computation unit of a CPU and a CPU may have one or more cores to perform tasks at a given time. The more cores we have, the more work we can do. In spark, this controls the number of parallel tasks an executor can run.
 94 | --------------------------
 95 | 16. What is Hadoop, Hive, Hbase?  
 96 | Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us with storage, but in a fault-tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being an FS, HDFS lacks random read and write access. This is where HBase comes into the picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.  
 97 | Hive: It provides us with data warehousing facilities on top of an existing Hadoop cluster. Along with that, it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
 98 | --------------------------
 99 | 17. What is parquet?    
100 | [Parquet and it's pros and cons - Stackoverflow](https://stackoverflow.com/a/36831549/8515731)
101 | 
102 | ![Row Vs Columnar](static/columnar_storage.png)
103 | 
104 | - The schema is stored in the footer of the file
105 | - Doesn't waste space storing missing value
106 | - Has predicate pushdown
107 | - Loads only required columns
108 | - Allows data skipping	
109 | 
110 | --------------------------
111 | 18. What file systems do Spark support?  
112 | - Hadoop Distributed File System (HDFS).
113 | - Local File system.
114 | - S3
115 | --------------------------
116 | 19. What is a Cluster Manager?  
117 | An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN). Spark is agnostic to a cluster manager as long as it can acquire executor processes and those can communicate with each other. We are primarily interested in Yarn as the cluster manager. A spark cluster with Yarn as a cluster/resource manager can run in either yarn cluster or yarn-client mode:
118 | yarn-client mode – A driver runs on client process, Application Master is only used for requesting resources from YARN.
119 | yarn-cluster mode – A driver runs inside the application master process, the client goes away once the application is initialized  
120 | [Cluster Mode Overview - Spark Documentation](https://spark.apache.org/docs/latest/cluster-overview.html)
121 | --------------------------
122 | 20. What is yarn?    
123 |     [What is Yarn - Hadoop Documentation](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)    
124 |     A good guide to understand how Spark works with YARN - 
125 |     - [Spark on YARN: A Deep Dive - Sandy Ryza - Youtube](https://youtu.be/N6pJhxCPe-Y)
126 |     - [Spark over Yarn - Stackoverflow](https://stackoverflow.com/questions/24909958/spark-on-yarn-concept-understanding)
127 | --------------------------
128 | 21. What is MapReduce?  
129 | [Introduction to MapReduce - Guru99](https://www.guru99.com/introduction-to-mapreduce.html)
130 | --------------------------
131 | 22. Spark vs MapReduce?  
132 | [Spark vs MapReduce - Medium @bradanderson](https://medium.com/@bradanderson.contacts/spark-vs-hadoop-mapreduce-c3b998285578)
133 | --------------------------
134 | 23. What is an Executor?  
135 |     An executor is a single JVM process that is launched for an application on a worker node. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. A single node can run multiple executors and executors for an application can span across multiple worker nodes. An executor stays up for the duration of the Spark Application and runs the tasks in multiple threads. The number of executors for a spark application can be specified inside the SparkConf or via the flag –num-executors from the command line.
136 |     - Executor performs all the data processing.
137 |     - Reads from and writes data to external sources.
138 |     - Executor stores the computed data in-memory, cache or on hard disk drives.
139 |     - Interacts with the storage systems.
140 | --------------------------
141 | 24. What are workers, executors, cores in the Spark Standalone cluster?  
142 | 
143 |     A worker node hosts the executor process. It has a fixed number of executors allocated at any point in time. Each executor will hold the chunk of the data to be processed. This chunk is called a partition.
144 | 
145 |     Spark parallelizes at two levels - first with the executors and second with the cores allocated to the executors.
146 |     
147 |     [Workers, executors, cores - Stackoverflow](https://stackoverflow.com/questions/32621990/what-are-workers-executors-cores-in-spark-standalone-cluster)
148 | --------------------------
149 | 25. Name types of Cluster Managers in Spark.
150 |     The Spark framework supports four major types of Cluster Managers -  
151 |     - Standalone: a basic manager to set up a cluster.  
152 |     - Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.  
153 |     - Yarn: responsible for resource management in Hadoop
154 |     - Kubernetes: an open-source system for automating deployment, scaling, and management of containerized applications
155 | --------------------------
156 | 26. How can you minimize data transfers when working with Spark?  
157 |     Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:  
158 |     - Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.  
159 |     - Using Accumulators – Accumulators help update the values of variables in parallel while executing.  
160 |     - The most common way is to avoid `ByKey` operations, repartition or any other operations which trigger shuffles.
161 | --------------------------
162 | 27. What are broadcast variables?  
163 |     Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication costs.
164 |     Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in the deserialized form is important.
165 |     Broadcast variables are created from a variable v by calling `SparkContext.broadcast(v)`. The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:  
166 | 
167 |     ```python
168 |     >>> broadcastVar = sc.broadcast([1, 2, 3])
169 |     <pyspark.broadcast.Broadcast object at 0x102789f10>
170 |     
171 |     >>> broadcastVar.value
172 |     [1, 2, 3]
173 |     ```
174 |     After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
175 | --------------------------
176 | 28. Why is there a need for broadcast variables when working with Apache Spark?  
177 | These are read-only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().  
178 | [Why use broadcast variables? - Stackoverflow](https://stackoverflow.com/questions/26884871/what-are-broadcast-variables-what-problems-do-they-solve)
179 | --------------------------
180 | 29. What is a Closure?  
181 | The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). The closure is serialized and sent to each executor. The variables within the closure sent to each executor are copies.  
182 | [Closure Spark Documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#understanding-closures-a-nameclosureslinka)
183 | --------------------------
184 | 30. What are Accumulators?  
185 | Accumulators are variables that are "added" to through an associative and commutative "add" operation. They act as a container for accumulating partial values across multiple tasks (running on executors). They are designed to be used safely and efficiently in parallel and distributed Spark computations and are meant for distributed counters and sums.  
186 | [Accumulators Spark Documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#accumulators)
187 | --------------------------
188 | 31. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?  
189 | You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long-running jobs into different batches and writing the intermediary results to the disk.
190 | --------------------------
191 | 32. What is the significance of Sliding Window operation?  
192 | Sliding Window controls the transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
193 | --------------------------
194 | 33. What is a DStream?  
195 |     Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS and Apache Flume. DStreams have two operations –
196 |     - Transformations that produce a new DStream.
197 |     - Output operations that write data to an external system.
198 | --------------------------
199 | 34. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?  
200 | Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
201 | --------------------------
202 | 35. What is the Catalyst framework?  
203 |     
204 |     ![Query Optimization](static/query_optimization.png)
205 |     
206 |     Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
207 |     
208 |     It goes through 4 transformational phases - 
209 |     - Analysis
210 |     - Logical optimization
211 |     - Physical planning
212 |     - Code generation
213 |     
214 |     Phase 1: Analysis  
215 |     Abstract Syntax Tree is generated from dataframe or query
216 |     List of column names, datatypes, functions, databases etc. are resolved or validated by consulting internal ' Metadata Catalog'.
217 |     
218 |     From this analysis, we get a logical plan
219 | 
220 |     Phase 2: Logical Optimization  
221 |     The logical optimization phase applies standard rule-based optimizations to the logical plan. These include constant folding, predicate pushdown, projection pruning, null propagation, Boolean expression simplification, and other rules.
222 | 
223 |     Phase 3: Physical Planning  
224 |     At this stage, the catalyst optimizer generates on or more physical plans. This represents what the query engine will actually do. Each physical plan is evaluated according to it's own cost model and the best performing model is selected.
225 |     
226 |     Phase 4: Code generation  
227 |     Generation of efficient java byte code to run on each machine.  
228 |     Spark acts as a compiler facilitated by Project Tungsten for whole stage code generation.  
229 |     Whole stage code generation is physical query optimization, getting rid of virtual function calls and employing CPU registers for intermediate data. This generates a compact RDD for final execution.
230 |     
231 |     [Physical Plans In Spark SQL - Databricks Spark Summit - Part 1](https://youtu.be/99fYi2mopbs)
232 |     
233 |     [Physical Plans In Spark SQL - Databricks Spark Summit - Part 2](https://youtu.be/9EIzhRKpiM8)
234 | 
235 | --------------------------
236 | 36. Which spark library allows reliable file sharing at memory speed across different cluster frameworks?  
237 | Tachyon
238 | --------------------------
239 | 37. Explain the different types of transformations on DStreams?  
240 | - Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples- `map ()`, `reduceByKey ()`, `filter ()`.  
241 | - Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples- Transformations that depend on sliding windows
242 | --------------------------
243 | 37. Explain the popular use cases of Apache Spark
244 | Apache Spark is mainly used for -  
245 | - Iterative machine learning.  
246 | - Interactive data analytics and processing.  
247 | - Stream processing.  
248 | - Batch Processing.
249 | - Sensor data processing.
250 | --------------------------
251 | 38. How can you remove the elements with a key present in any other RDD?  
252 | Use the `subtractByKey()` function
253 | --------------------------
254 | 39. What is the difference between persist() and cache()?  
255 | `persist()` allows the user to specify the storage levels whereas `cache()` uses the default storage level.
256 | --------------------------
257 | 40. What are the various levels of persistence in Apache Spark?  
258 |     Apache Spark automatically persists some intermediary data from various shuffle operations. This is done to avoid recomputing the entire input if a node fails during the shuffle. However, it is often suggested that users call `persist()` method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.  
259 |     The various storage/persistence levels in Spark are -  
260 |     
261 |     - MEMORY_ONLY  
262 |     - MEMORY_ONLY_SER
263 |     - MEMORY_AND_DISK
264 |     - MEMORY_AND_DISK_SER, DISK_ONLY
265 |     - MEMORY_ONLY_2, MEMORY_AND_DISK_2
266 |     - OFF_HEAP
267 |     
268 |     [Which Storage Level to Choose?](https://spark.apache.org/docs/3.0.0-preview/rdd-programming-guide.html#which-storage-level-to-choose)
269 | --------------------------
270 | 41. Does Apache Spark provide checkpointing?  
271 | Lineage graphs are always useful to recover RDDs from failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to the checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
272 | --------------------------
273 | 42. Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?  
274 |     The data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has information on how to build from other datasets. If any partition of an RDD is lost due to failure, lineage helps build only that particular lost partition.
275 | 
276 |     Assuming that all of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark Cluster.
277 | --------------------------
278 | 43. Explain the core components of a distributed Spark application.  
279 | Driver - The process that runs the main() method of the program to create RDDs and perform transformations and actions on them.
280 | Executor - The worker processes that run the individual tasks of a Spark job.  
281 | Cluster Manager - A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.
282 | --------------------------
283 | 44. What do you understand by Lazy Evaluation?  
284 | Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like `map()` is called on an RDD- the operation is not performed immediately. Transformations in Spark are not evaluated until you perform an action. This helps optimize the overall data processing workflow.
285 | --------------------------
286 | 45. Define a worker node-
287 | A node that can run the Spark application code in a cluster can be called a worker node. A worker node can have more than one worker which is configured by setting the `SPARK_WORKER_INSTANCES` property in the `spark-env.sh` file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.
288 | --------------------------
289 | 46. What do you understand by `SchemaRDD`?  
290 | Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using Spark. At the core of this component is a new type of RDD, `SchemaRDD`. An RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column. A `SchemaRDD` is similar to a table in a traditional relational database.
291 | --------------------------
292 | 47. What are the disadvantages of using Apache Spark over Hadoop MapReduce?  
293 | Apache spark does not scale well for compute-intensive jobs and consumes a large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost-efficient processing of big data. Also, Spark does not have its own file management system and hence needs to be integrated with other cloud-based data platforms or apache Hadoop.
294 | --------------------------
295 | 48. What do you understand by Executor Memory in a Spark application?  
296 | Every spark application has the same fixed heap size and a fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the `–-executor-memory` flag. Every spark application will have one executor on each worker node. The executor memory is a measure of how much memory of the worker node will the application utilize.
297 | --------------------------
298 | 49. What according to you is a common mistake apache-spark developers make when using spark?  
299 |     - [Not avoiding `GroupByKey`](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)
300 |     - [Collecting Large Datasets](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html)
301 |     - [Not Dealing with Bad Input](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dealing_with_bad_data.html)
302 |     - [Not managing shuffle partitions](https://nealanalytics.com/blog/databricks-spark-jobs-optimization-techniques-shuffle-partition-technique-part-1/)
303 | 50. (A) Suppose that there is an RDD named `Samplerdd` that contains a huge list of numbers. The following spark code is written to calculate the average -  
304 | 
305 |     ```scala
306 |     def SampleAvg(x, y):
307 |         return (x+y)/2.0;
308 |     
309 |     avg = Samplerdd.reduce(SampleAvg);
310 |     ```
311 | 
312 | ----------------------------
313 | 50. (B) What is wrong with the above code and how will you correct it?  
314 |     The average function is neither commutative nor associative. The best way to compute average is to first sum it and then divide it by count as shown below -  
315 |     
316 |     ```scala
317 |     def sum(x, y):
318 |     	return x+y;
319 |     
320 |     total =Samplerdd.reduce(sum);
321 |     avg = total / Samplerdd.count();
322 |     ```
323 |     However, the above code could lead to an overflow if the total becomes big. So, the best way to compute average is to divide each number by count and then add up as shown below -
324 |     ```scala
325 |     cnt = Samplerdd.count();
326 |     
327 |     def divideByCnt(x):
328 |     	return x/cnt;
329 |     
330 |     myrdd1 = Samplerdd.map(divideByCnt);
331 |     avg = Samplerdd.reduce(sum);
332 |     ```
333 | --------------------------
334 | 51. Explain the difference between Spark SQL and Hive.  
335 | 
336 |     - Spark SQL is faster than Hive.  
337 |     - Any Hive query can easily be executed in Spark SQL but vice-versa is not true.  
338 |     - Spark SQL is a library whereas Hive is a framework.  
339 |     - It is not mandatory to create a metastore in Spark SQL but it is mandatory to create a Hive metastore.  
340 |     - Spark SQL automatically infers the schema whereas in Hive, schema needs to be explicitly declared.  
341 | --------------------------
342 | 52. What is a Spark Session?  
343 | The first step of any Spark Application is creating a `SparkSession`, which enables you to run Spark code. 
344 | The `SparkSession` class provides the single entry point to all functionality in Spark using the `DataFrame` API. 
345 | This is automatically created for you in a Databricks notebook/spark shell as the variable, `spark`.
346 | --------------------------
347 | 53. Why should one not use a UDF?  
348 | UDFs can not be optimized by the Catalyst Optimizer. To use UDFs, functions must be serialized and sent to executors. And for Python, there is additional overhead of spinning up a Python interpreter on an executor to run a UDF.
349 | 
350 | ​		`sampleUDF = udf(sample_function)` serializes the function and sends the UDF to the executors so that we can use it on a dataframe.
351 | 
352 | --------------------------
353 | 54. What is an `UnsafeRow`?  
354 |     The data that is "shuffled" is in a format known as `UnsafeRow`, or more commonly, the Tungsten Binary Format. `UnsafeRow` is the in-memory storage format for Spark SQL and DataFrames. 
355 | 
356 |     Advantages include -  
357 | 
358 |     - Compactness : Column values are encoded using custom encoders, not as JVM objects (as with RDDs).
359 |       The benefit of using Spark custom encoders is that you get almost the same compactness as Java serialization, but significantly faster encoding/decoding speeds.
360 |       For custom data types, it is possible to write custom encoders from scratch.  
361 |     - Efficiency : Spark can operate directly out of Tungsten, without first deserializing Tungsten data into JVM objects.
362 | --------------------------
363 | 55. What are some best Caching Practices?  
364 | - Don't cache unless you're sure the DataFrame is going to be used multiple times.  
365 | - Omit unneeded columns to reduce storage footprint.
366 | --------------------------
367 | 56. Understanding the Spark UI  
368 | 
369 |     - Use `setJobDescription` for better tracking in the Spark UI.  
370 | 
371 |     - Use event timeline to analyze jobs that are taking a long time to execute.  
372 | 
373 |       - The event timeline for a stage has various tasks including Executor computing time, which btw should be the most occurring in the timeline. Any other coloured tasks are overhead and should be considered if we want to optimize the process. If there's a lot of overhead time then one should consider creating larger partitions of data.  
374 | 
375 |     - In the Summary Metrics tab, we can see the statistics by quartile for the green tasks in the event timeline. Here we should analyze Duration to see if the partitions are skewed. If the min-max fields show a greater difference, that indicates skewed partitions.  
376 | 
377 |     - Input Size/Records can also be used in a similar way to analyze where there is a larger difference between the min, median and max sizes of partitions.  
378 | 
379 |     - Inside the SQL tab, we can click on the job descriptions that we set. This will lead us to a more explanatory visualization mapped to the actual code that we wrote.
380 | 
381 |       [Spark Web UI - Official Documentation](https://spark.apache.org/docs/3.0.0-preview/web-ui.html)
382 | 
383 |       [Deep Dive into Monitoring Spark Applications - Jacek Laskowski - YouTube](https://youtu.be/mVP9sZ6K__Y)
384 | 
385 |       [Spark UI Visualization - Andrew Or - YouTube](https://www.youtube.com/watch?v=VQOKk9jJGcw&list=PL-x35fyliRwif48cPXQ1nFM85_7e200Jp)
386 | --------------------------
387 | 57. Shared resources -  
388 | - Executors share machine level resources. That is if a node has 4 executors, all the resources would be shared between those.
389 | - Tasks share executor level resources.  
390 | - Resources are shared by the cores in a single node. Meaning they share the memory, disk, network. If any of the cores under an executor fail because of OOM or any other reason, the whole executor will be affected and the process on that executor will have to be stopped.
391 | --------------------------
392 | 58. Local and Global Results -  
393 |     When certain actions and transformations are performed there are scenarios when the tasks operate on a partition individually, and then the same operation needs to be performed again globally to get accurate results. For example, if 5 executors give the record count of each partition to be 4,5,5,6,4 then a final global count operation is needed to say that the dataset has 24 records. More such operations are -
394 | 
395 |     |Stage 1         |    Stage 2 |
396 |     |----------------|--------------------|
397 |     |Local Filter    |    No Global Filter|
398 |     |Local count     |    Global Count|
399 |     |Local distinct  |    Global distinct|
400 |     |Local sort      |    Global sort|
401 |     |Local aggregate |    Global aggregate|
402 | --------------------------
403 | 59. What is shuffling?  
404 |     Shuffling is the process of rearranging data within a cluster between stages.  
405 |     Triggered by wide transformations like -  
406 |     - Repartition  
407 |     - ByKey operations (except counting)  
408 |     - Joins, the worse being cross joins  
409 |     - Sorting  
410 |     - Distinct  
411 |     - GroupBy  
412 | --------------------------
413 | 60. What is a dataframe?  
414 | A dataframe is a distributed collection of data grouped into named columns.
415 | --------------------------
416 | 61. Why Dataframes and not RDDs?  
417 | The computations are not known to Spark when it happens under an RDD. Whether you are performing a join, filter, select or aggregation, Spark only sees it as a lambda expression. Even the `Iterator[T]` datatype is not visible to spark. That leaves no room for Spark to perform optimizations.
418 | --------------------------
419 | 62. Why should you always define your schema upfront when reading a file?  
420 | 
421 |     - You relieve Spark from the onus of inferring data types.
422 | 
423 |     - You prevent Spark from creating a separate job just to read a large portion of
424 |     your file to ascertain the schema, which for a large data file can be expensive and
425 |     time-consuming.
426 | 
427 |     - You can detect errors early if data doesn’t match the schema.
428 | 
429 | --------------------------
430 | 63. Managed vs Unmanaged tables?  
431 |     For a managed table, spark manages both the data in the file store (HDFS, S3, etc.) and the metadata for the table. While for unmanaged data, spark only manages the metadata and you manage the data yourself in a external data source such as Cassandra. So for a command like `DROP TABLE` spark will only delete the metadata for the unmanaged table.
432 | 
433 |     Unmanaged tables in Spark can be created like this -
434 | 
435 |     ```python
436 |     (flights_df
437 |     .write
438 |     .option("path", "/tmp/data/us_flights_delay")
439 |     .saveAsTable("us_delay_flights_tbl"))
440 |     ```
441 | --------------------------
442 | 64. How can you speed up Pyspark UDFs?  
443 |     One can create Pandas UDF using the pandas_udf decorator.  
444 |     Before the introduction of Pandas UDF -  
445 |     
446 |     - Collect all rows to Spark driver
447 |     - Each row serialized into python's pickle format and sent to the python worker process.
448 |     - Child process unpickles each row into a huge list of tuples.
449 |     - Pandas dataframe created using `pandas.DataFrame.from_records()`
450 |       This causes issues like - 
451 |       - Even using `Cpickle`, Python serialization is a slow process
452 |       - `from_records` iterates over the list of pure python data and convert each value to pandas format.  
453 |     
454 |     
455 |     Introduction of Arrow - 
456 |     - Once data is in Arrow format there is no need for pickle/serialization as Arrow data can be directly sent to Python.
457 |     - `PyArrow` in python utilizes zero-copy methods to create a `pandas.DataFrame` from entire chunks of data instead of processing individual scalar values. 
458 |     - Additionally, the conversion to Arrow data can be done on the JVM and pushed back for the Spark executors to perform in parallel, drastically reducing the load on the driver.
459 |     
460 |     The use of Arrow when calling `toPandas()` needs to be enabled by setting `spark.sql.execution.arrow.enabled` to `true`.  
461 |     
462 |     [Pandas UDF - Microsoft](https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/udf-python-pandas)
463 | 
464 | ------
465 | 
466 | 65. What are the most important configuration parameters in Spark?
467 | 
468 |     `spark.default.parallelism` - Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.
469 | 
470 | --------------------------
471 | ### Random Video Resources
472 | 
473 | 📹 [Spark Out Of Memory Issues](https://youtu.be/FdT5o7M35kU) - Data Savvy
474 | 
475 | 📹 [Decide number of Executors and Memory](https://www.youtube.com/watch?v=V9E-bWarMNw) - Data Savvy
476 | 
477 | 📹 [Big Data Architecture Patterns](https://youtu.be/-N9i-YXoQBE) - Eddie Satterly
478 | 
479 | 📹 [Threat Detection and Response at Scale](https://vimeo.com/274267634?embedded=true&source=video_title&owner=40921445) - Dominique Brezinski and Michael Armbrust
480 | 
481 | ------
482 | 
483 | ### Random Reading Resources
484 | 
485 | 📝 [How partitions are calculated when reading a file in Spark](https://dzone.com/articles/guide-to-partitions-calculation-for-processing-dat)
486 | 
487 | 📝 [Why the number of tasks can be much larger than the number of row groups](http://cloudsqale.com/2021/03/19/spark-reading-parquet-why-the-number-of-tasks-can-be-much-larger-than-the-number-of-row-groups/)
488 | 
489 | 📝 [Processing Petabytes of Data in Seconds with Databricks Delta](https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html)
490 | 
491 | ------
492 | 
493 | ### Performance Recommendations - 
494 | 
495 | - One Executor per node is considered to be more stable than two or three executors per node as is used in systems like YARN.  ([Databricks Guideline](https://docs.databricks.com/clusters/configure.html#worker-node-1))
496 | - Try to group wide transformations together for best automatic optimization.
497 | - [Spark Performance Optimization - IBM Developer Blog](https://developer.ibm.com/blogs/spark-performance-optimization-guidelines/)
498 | - [Spark Tips, Partition Tuning - Luminousmen](https://luminousmen.com/post/spark-tips-partition-tuning)
499 | - [Cluster Configuration Best Practices - Databricks](https://docs.databricks.com/clusters/cluster-config-best-practices.html)
500 | - [Delta Lake Best Practices - Databricks](https://docs.databricks.com/delta/best-practices.html)
501 | - [Spark troubleshooting challenges - Unravel](https://www.unraveldata.com/resources/spark-troubleshooting-part-1-ten-challenges/) 
502 | - [Why memory management is causing your spark apps to be slow or fail - Unravel](https://www.unraveldata.com/common-reasons-spark-applications-slow-fail-part-1/)
503 | - [Why data skew and garbage collection causes spark apps to be slow or fail - Unravel](https://www.unraveldata.com/resources/spark-troubleshooting-part-1-ten-challenges/)
504 | - [Configuring memory and CPU options - IBM](https://www.ibm.com/docs/en/zpfas/1.1.0?topic=spark-configuring-memory-cpu-options)
505 | - [Optimize performance with file management - Databricks](https://docs.databricks.com/delta/optimizations/file-mgmt.html)
506 | 
507 | ![Partition Guideline](static/partition_guide.png)
508 | 
509 | 


--------------------------------------------------------------------------------