├── static ├── DPP.jpg ├── datalake.jpg ├── deltalake.jpg ├── skew_join.jpg ├── AQEPlanning.jpg ├── dl_elements.png ├── partition_guide.png ├── columnar_storage.png ├── Wide-Transformation.png ├── partition_coalesce.jpg ├── query_optimization.png ├── Narrow-Transformation.png └── Catalyst-Optimizer-diagram.png ├── Contributions.md ├── advanced ├── streaming.md ├── new_in_spark_3.md ├── joins.md ├── deltalake.md └── optimizations.md └── README.md /static/DPP.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/DPP.jpg -------------------------------------------------------------------------------- /static/datalake.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/datalake.jpg -------------------------------------------------------------------------------- /static/deltalake.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/deltalake.jpg -------------------------------------------------------------------------------- /static/skew_join.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/skew_join.jpg -------------------------------------------------------------------------------- /static/AQEPlanning.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/AQEPlanning.jpg -------------------------------------------------------------------------------- /static/dl_elements.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/dl_elements.png -------------------------------------------------------------------------------- /static/partition_guide.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/partition_guide.png -------------------------------------------------------------------------------- /static/columnar_storage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/columnar_storage.png -------------------------------------------------------------------------------- /static/Wide-Transformation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/Wide-Transformation.png -------------------------------------------------------------------------------- /static/partition_coalesce.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/partition_coalesce.jpg -------------------------------------------------------------------------------- /static/query_optimization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/query_optimization.png -------------------------------------------------------------------------------- /static/Narrow-Transformation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/Narrow-Transformation.png -------------------------------------------------------------------------------- /static/Catalyst-Optimizer-diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ankurchavda/SparkLearning/HEAD/static/Catalyst-Optimizer-diagram.png -------------------------------------------------------------------------------- /Contributions.md: -------------------------------------------------------------------------------- 1 | ### You are welcome to raise a PR - 2 | - If you want to add more information/topics to this resource. 3 | - If you find any grammatical errors or outdated information. 4 | - If you want to improve the indentation or readability. 5 | - If any of the answers seem incomplete or incorrect. Please add to it and provide an explanation on what was missing or incorrect in the PR comments. 6 | -------------------------------------------------------------------------------- /advanced/streaming.md: -------------------------------------------------------------------------------- 1 | ## Spark Streaming 2 | 3 | Stream processing is the act of continuously incorporating new data to calculate results. The input data is unbounded and has no predetermined beginning or end. 4 | 5 | Examples- 6 | 7 | - Credit card transactions 8 | - IOT device data 9 | - Click stream 10 | 11 | Any business logic in your application should give same results for streaming and batch applications. -------------------------------------------------------------------------------- /advanced/new_in_spark_3.md: -------------------------------------------------------------------------------- 1 | # What's new in Spark 3? 2 | 3 | ### Adaptive Query Execution 4 | 5 | In Spark 2, the logical and physical optimizations were rule based optimizations. While they improve the performance, they are all based on the estimates and statistics that are generated before runtime. There may be unanticipated problems or tuning opportunities that appear as the query runs. 6 | 7 | ![Catalyst Optimizer Diagram](../static/Catalyst-Optimizer-diagram.png) 8 | 9 | Adaptive Query Execution allows Spark to re-optimize and adjust query plans based on runtime statistics collected during query execution. 10 | 11 | ![Image showing the catalyst optimizer with adaptive planning.](../static/AQEPlanning.jpg) 12 | 13 | When AQE is on, spark will feed back statistics about the size of the data in the shuffle files, so that for the next stage, when working out the logical plan, it can dynamically switch join strategies, coalesce number of shuffle partitions or optimize skew joins. 14 | 15 | #### Switch Join Strategies 16 | 17 | Existing rule-based optimizations include planning a broadcast hash join if the estimated size of a join relation is lower than the broadcast-size threshold. It relies on a estimation of data based on the file size. A number of things can make the estimation go wrong - 18 | 19 | - Presence of a very selective filter 20 | - Join relation being a series of complex operators other than just a scan 21 | 22 | To solve this problem, AQE now replans the join strategy at runtime based on the most accurate join relation size. So if the estimated file size was 20MB and the actual file size turns out to be 8MB after the scan, then AQE will dynamically switch the join strategy from Sort-Merge Join to Broadcast-Hash Join. 23 | 24 | #### Coalesce Shuffle Partitions 25 | 26 | Tuning shuffle partitions in Spark is a common pain point. The best number of partitions depend on the data size, but the data sizes may differ vastly from stage to stage so this number can be hard to tune. 27 | 28 | - If there are too few partitions - 29 | - then the data size of each partition may be very large, and the tasks to process these large partitions may need to spill the data to disk and as a result slow down the query. 30 | - If there are too many partitions - 31 | - then the data size of each partition may be very small, leading to too many network data fetches to read the shuffle blocks. Which will also slow down the query because of inefficient I/O pattern. Having a large number of tasks also puts more burden on the task scheduler. 32 | 33 | To solve this problem, we can set a relatively large number of shuffle partitions at the beginning, then combine the adjacent small partitions into bigger partitions at runtime by looking at the shuffle files statistics. 34 | 35 | For example, a small dataset of two partition is involved in a group by operation. The shuffle partition is set to 5, which leads to 5 partitions during the shuffle operation. With AQE, the other three smaller partitions are coalesced into 1 larger partition, as a result, the final aggregation now only needs to perform three tasks rather than five. 36 | 37 | ![Partition Coalesce](../static/partition_coalesce.jpg) 38 | 39 | #### Optimize Skew Joins 40 | 41 | Data skew occurs when data is unevenly distributed among partitions in the cluster. AQE detects such a skew automatically from shuffle file statistics. It then splits the skewed partitions into smaller subpartitions, which will be joined to the corresponding partition from the other side respectively. 42 | 43 | In the below example, AQE splits the A0 partition into two smaller partitions and joins the with B0. This leads to 5 similar sized tasks that complete nearly at the same time versus one outlier task that takes much more time than the other tasks. 44 | 45 | ![Skew Join](../static/skew_join.jpg) 46 | 47 | ### Dynamic Partition Pruning 48 | 49 | [Dynamic Partition Pruning - Data Savvy Youtube](https://youtu.be/rwUgZP-EBZw) 50 | 51 | ![Dynamic Partition Pruning](../static/DPP.jpg) 52 | 53 | -------------------------------------------------------------------------------- /advanced/joins.md: -------------------------------------------------------------------------------- 1 | ## Communication Strategies 2 | 3 | ### Big table-to-big table 4 | - Joining a big table with another big table leads to shuffle join. 5 | - In shuffle join, every node talks with every other node and they share data according to which node has a certain key or a set of keys. 6 | - If data is not partitioned well network can become congested with traffic. 7 | 8 | #### Shuffle Sort Merge Join 9 | 10 | As the name indicates, this join scheme has two phases: a sort phase followed by a merge phase. The sort phase sorts each data set by its desired join key; the merge phase iterates over each key in the row from each data set and merges the rows if the two keys match. 11 | 12 | #### Optimizing the shuffle sort merge join 13 | 14 | We can eliminate the `Exchange` step from this scheme if we create partitioned buckets for common sorted keys or columns on which we want to perform frequent equijoins. That is, we can create an explicit number of buckets to store specific sorted columns (one key per bucket). Presorting and reorganizing data in this way boosts performance, as it allows us to skip the expensive `Exchange` operation and go straight to `WholeStageCodegen`. 15 | 16 | ```scala 17 | // Save as managed tables by bucketing them in Parquet format 18 | usersDF.orderBy(asc("uid")) 19 | .write.format("parquet") 20 | .bucketBy(8, "uid") 21 | .mode(SaveMode.OverWrite) 22 | .saveAsTable("UsersTbl") 23 | 24 | ordersDF.orderBy(asc("users_id")) 25 | .write.format("parquet") 26 | .bucketBy(8, "users_id") 27 | .mode(SaveMode.OverWrite) 28 | .saveAsTable("OrdersTbl") 29 | ``` 30 | 31 | Bucketing the data using user ids will help skip the expensive `Exchange` step as there is no need to perform sort now. 32 | 33 | #### When to use a shuffle sort merge join 34 | 35 | Use this type of join under the following conditions for maximum benefit: 36 | 37 | - When each key within two large data sets can be sorted and hashed to the same partition by Spark 38 | - When you want to perform only equi-joins to combine two data sets based on matching sorted keys 39 | - When you want to prevent Exchange and Sort operations to save large shuffles across the network 40 | 41 | ### Big table-to-small table 42 | - Small enough table to fit into the memory of a worker node with some breathing room. 43 | - Replicate our small dataframe onto every worker node 44 | - Prevents all-to-all communication as earlier. 45 | - Instead we perform it only once at the beginning and let each individual node perform the work. 46 | - The driver first collects the data from executors and then broadcasts it back to the executor. Hence the driver size should be large enough to handle a collect operation. 47 | 48 | #### When to use a broadcast hash join 49 | 50 | ​ Use this type of join under the following conditions for maximum benefit: 51 | 52 | - When each key within the smaller and larger data sets is hashed to the same partition 53 | by Spark 54 | - When one data set is much smaller than the other (and within the default config 55 | of 10 MB, or more if you have sufficient memory) 56 | - When you only want to perform an equi-join, to combine two data sets based on 57 | matching unsorted keys 58 | - When you are not worried by excessive network bandwidth usage or OOM 59 | errors, because the smaller data set will be broadcast to all Spark executors 60 | 61 | | Broadcast Join | Shuffle Join | 62 | |:----------------------------------------------------:|:--------------------------------------------:| 63 | | Avoids shuffling the larger side | Shuffles both sides | 64 | | Naturally handles data skew | Can suffer from data skew | 65 | | Cheap for selective joins | Can produce unnecessary intermediate results | 66 | | Broadcasted data needs to fit into memory | Data can be spilled and read from disk | 67 | | Cannot be used for certain joins eg. Full outer join | Can be used for all joins | -------------------------------------------------------------------------------- /advanced/deltalake.md: -------------------------------------------------------------------------------- 1 | ## Delta lake 2 | 3 | Delta lake is an open format storage layer that puts standards in place in an organization's data lake to provide data structure and governance. A storage solution specifically designed to work with Apache Spark. 4 | 5 | Data stored under delta lake is ACID compliant. 6 | 7 | ### Short Comings Of A Data Lake 8 | 9 | #### Data Lakes Offer - 10 | 11 | - Flexibility in data storage 12 | - Relatively cheaper compared to a warehouse 13 | - Can store structured, semi-structured or unstructured data 14 | 15 | ![Data Lake](../static/datalake.jpg) 16 | 17 | #### Another Challenges with Data Lakes - 18 | 19 | - Long storage periods & variety of data can convert it into a data swamp 20 | - Data swamps are data lakes that are difficult to manage and navigate 21 | - Reading and writing data is not reliable 22 | - Need to build workarounds to ensure readers always see consistent data during writes 23 | - Jobs often fail midway 24 | - Difficult to recover the correct state of the data 25 | - Countless hours are spent troubleshooting failed jobs and getting a valid state of the data 26 | - Modification of existing data is difficult 27 | - Lakes are meant for data written once and read many times 28 | - The concept of modifying or deleting data wasn’t of central importance at the time they were designed 29 | - Keeping historical version of data is costly 30 | - Can become expensive and difficult with large scale data 31 | - It is difficult to handle large metadata 32 | - Reading metadata can often create substantial delays in processing and overhead costs 33 | - Too many files cause problems 34 | - Query performance suffers greatly when a small amount of data is spread over too many files 35 | - It is hard to get great performance 36 | - Tuning jobs is a difficult task 37 | - Great performance should be easy to achieve 38 | - Data quality issues affect analysis results 39 | - No built-in quality checks like Data Warehouses 40 | - expensive, long-running queries may fail or produce meaningless results 41 | 42 | #### How does Delta Lake resolve the above issues? 43 | 44 | Acid Transactions 45 | 46 | - Each transaction has a distinct beginning and end 47 | - Appending data is easy and each new write will create a new version of the table 48 | - New data won't be read until the transaction is complete 49 | - Jobs that fail midway can be discarded entirely 50 | - Many changes can be applied to the data in a single transaction, eliminating the possibility of incomplete deletes or updates 51 | 52 | Schema Management 53 | 54 | - Specify and enforce schema 55 | - Schema validation during writes 56 | - Throws an exception if extra data is added 57 | - Can make changes to tables schema 58 | 59 | Scalable Metadata Handling 60 | 61 | - Metadata is processed just like regular data - with distributed processing 62 | 63 | Unified Batch and Streaming data 64 | 65 | - Supports both streaming and batch processes 66 | - Each micro-batch transaction creates a new version of a table 67 | 68 | Data Versioning and Time Travel 69 | 70 | - A maintained version of historical data 71 | - Databricks uses Spark to scan the transaction logs for efficient processing 72 | 73 | ### Why Delta Lake 74 | 75 | By bringing the structure and governance inherent to data warehouses to data lakes with Delta Lake, you create the foundation for a Lakehouse. 76 | 77 | ![Delta Lake](../static/deltalake.jpg) 78 | 79 | #### Delta Lake Features - 80 | 81 | - ACID transactions on Spark 82 | - Scalable metadata handling 83 | - Streaming and batch unification 84 | - Schema enforcement 85 | - Time travel 86 | - Upserts and deletes 87 | - Fully configurable/optimizable 88 | - Structured streaming support 89 | 90 | #### Delta Lake Storage Layer - 91 | 92 | - Highly performant and persistent 93 | - Low-cost, easily scalable object storage 94 | - Ensures consistency 95 | - Allows for flexibility 96 | 97 | #### Delta Tables - 98 | 99 | - Contain data in Parquet files that are kept in object storage 100 | - Keep transaction logs in object storage 101 | - Can be registered in a metastore 102 | 103 | ### Elements of Delta Lake 104 | 105 | #### Delta Files 106 | 107 | - Uses parquet files to store a customer's data 108 | - Provide an additional layer over parquet files 109 | - data versioning and metadata 110 | - stores transaction logs 111 | - provides ACID transactions 112 | 113 | #### Delta Tables 114 | 115 | A delta table is a collection of data kept using the Delta Lake technology and consists - 116 | 117 | - Delta files containing the data and kept in object storage 118 | - A Delta table registered in a Metastore 119 | - The delta transaction log saved with Delta files in object storage 120 | 121 | #### Delta Optimization Engine 122 | 123 | - Delta Engine is a high-performance query engine that provides an efficient way to process data in data lakes. 124 | 125 | #### **Delta Lake Storage Layer** 126 | 127 | #### What is the Delta transaction log? 128 | 129 | - Ordered record of the transactions performed on a Delta table 130 | - Single source of truth for that table 131 | - Mechanism that the Delta Engine uses to guarantee atomicity -------------------------------------------------------------------------------- /advanced/optimizations.md: -------------------------------------------------------------------------------- 1 | ## Optimizing and Tuning Spark Applications: 2 | 3 | ### Static vs. Dynamic resource allocation - 4 | - One can use the `spark.dynamicAllocation.enabled` property to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload. Example use cases would be Streaming data, or on-demand analytics where more is asked of the application during peak hours. In a multi-tenant environment Spark may soak up resources from other applications. 5 | 6 | ```python 7 | spark.dynamicAllocation.enabled true 8 | spark.dynamicAllocation.minExecutors 2 9 | spark.dynamicAllocation.schedulerBacklogTimeout 1m 10 | spark.dynamicAllocation.maxExecutors 20 11 | spark.dynamicAllocation.executorIdleTimeout 2min 12 | ``` 13 | - Request two executors to be created at start. 14 | - Whenever there are pending tasks that have not been scheduled for over 1 minute, the driver will request a new executor. Max upto 20. 15 | - If an executor is idle for 2 minutes, driver will terminate it. 16 | 17 | 18 | ### Configuring Spark executors' memory and shuffle service- 19 | 20 | - The amount of memory available to each executor is controlled by `spark.executor.memory`. 21 | 22 | - Executor memory is divided into three sections 23 | - Execution Memory 24 | - Storage Memory 25 | - Reserved Memory 26 | 27 | - The default division is 60% for execution, 40% for storage after allowing for 300mb of reserved memory, to safegaurd against OOM errors. 28 | 29 | - Execution memory is used for shuffles, joins, sorts and aggregations. 30 | 31 | - Storage memory is primarily used for caching user data structures and partitions derived from DataFrames. 32 | 33 | - During map and shuffle operations, spark writes to and reads from the local disk's shuffle file, so there's heaving I/O activity. 34 | 35 | - Following configurations can be used during heavy work loads to reduce I/O bottlenecks. 36 | ```python 37 | spark.driver.memory 38 | spark.shuffle.file.buffer 39 | spark.file.transferTo 40 | spark.shuffle.unsafe.file.output.buffer 41 | spark.io.compression.lz4.blockSize 42 | spark.shuffle.service.index.cache.size 43 | spark.shuffle.registration.timeout 44 | spark.shuffle.registration.maxAttempts 45 | ``` 46 | 47 | ### Spark Parallelism - 48 | - To optimize resource utilization and maximize parallelism, the ideal is atleast as many partitions as cores on the executor. 49 | - How partitions are created - 50 | - Data on disk is laid out in chunks or contiguous file blocks 51 | - Default size is 128 MB in HDFS and S3. A contiguous collection of these blocks is a partition. 52 | - The size of a partition in Spark is dictated by `spark.sql.files.maxPartitionBytes` which is 128 MB by default. 53 | - Decreasing the partition file size too much may cause the "small file problem" increasing disk I/O and performance degradation. 54 | - Shuffle Partitions 55 | - For smaller workloads the shuffle partitions should be reduced from default 200, to number of cores or executors or less. 56 | - During shuffle operations, spark will spill results to executors' local disks at location specified in `spark.local.directory`. Having performant SSDs for this operation will boost performance. 57 | 58 | - When writing, use `maxRecordsPerFile` option to control how many records go into each partition file. This would help mitigate the small or very large file problems. 59 | 60 | ### Caching and Persistence of Data - 61 | 62 | `Dataframe.cache()` 63 | - cache() will store as many partitions as memory allows. 64 | - Dataframes can be fractionally cached, but a partition cannot. 65 | - Note: A dataframe is not fully cached until you invoke an action that goes through all the records (eg. count). If you use take(1), only one partition will be cached because catalyst realizes that you do not need to compute all the partitions just to retrieve one record. 66 | 67 | ``Dataframe.persist() `` 68 | - persist(StorageLevel.level) provides control over how your data is cached via StorageLevel. Data on disk is always serlialized using either Java or Kyro. 69 | - MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, MEMORY_AND_DISK_SER are different persist levels one can use. 70 | 71 | #### When to Cache and Persist- 72 | 73 | - When you want to access large dataset repeatedly for queries and transformations. 74 | 75 | #### When not to Cache and Persist- 76 | 77 | - Dataframes are too big to cache in memory 78 | - An inexpensive tranformation on a dataframe not requiring frequent use, regardless of size. 79 | 80 | ### Statistics Collection - 81 | 82 | - Cost-based query optimizer can make use of statistics for named tables and not on arbitrary dataframes or RDDs to make optimization decisions. 83 | - The statistics should be collected and maintained. 84 | 85 | #### Table Level 86 | 87 | ```SQL 88 | ANALYZE TABLE table_name COMPUTE STATISTICS 89 | ``` 90 | 91 | #### Column Level 92 | 93 | ```SQL 94 | ANALYZE TABLE table_name COMPUTE STATISTICS FOR 95 | COLUMNS column_name1, column_name2, ... 96 | ``` 97 | 98 | Column-level statistics are slower to collect, but provide more information for the cost-based optimizer to use about those data columns. Both types of statistics can help with joins, aggregations, filters, and a number of other potential things (e.g., automatically choosing when to do a broadcast join). 99 | 100 | ### Spark Joins - 101 | 102 | #### Broadcast Hash Join- 103 | 104 | - Also known as map-side only join 105 | - By default spark uses broadcast join if the smaller data set is less than 10MB. 106 | - When to use a broadcast hash join - 107 | - When each key within the smaller and larger data sets is hashed to the same partition by Spark. 108 | - When one data set is much smaller than the other. 109 | - When you are not worried by excessive network bandwidth usage or OOM errors because the smaller data set will be broadcast to all Spark executors 110 | 111 | #### Shuffle Sort Merge Join- 112 | 113 | - Over a common key that is sortable, unique and can be stored in the same partition. 114 | - Sort phase sorts each data set by its desired join key. The merge phase iterates over each key in the row from each dataset and merges if two keys match 115 | - Optimizing the shuffle sort merge join - 116 | - Create partitioned buckets for common sorted keys or columns on which we want to perform frequent equi-joins. 117 | - In case of a column with high cardinality, use bucketing. Else use partitioning. 118 | - When to use a shuffle sort merge join - 119 | - When each key within two large data sets can be sorted and hashed to the same partition by Spark. 120 | - When you want to perform only equi-joins to combine two data sets based on matching sorted keys. 121 | - When you want to prevent Exchange and Sort operations to save large shuffles across the network. 122 | 123 | ## Notes from the video 124 | 125 | [Fine Tuning and Enhancing Performance of Apache Spark Jobs](https://youtu.be/WSplTjBKijU) 126 | Some general points to note- 127 | 128 | - Memory increase for executors will increase garbage collection time 129 | - Adding more CPUs can lead to scheduling issues as well as additional shuffles 130 | 131 | ### Skew 132 | 133 | #### How to check? 134 | 135 | - Spark UI shows job waiting only for some of the tasks 136 | - Executor missing a heartbeat 137 | - Check partition size of RDDs (row count) while debugging to confirm 138 | - In spark logs check the partition sizes 139 | 140 | #### How to handle? 141 | 142 | - A fix a ingestion time goes way far 143 | - JDBC 144 | - When reading from an RDBMS source for using JDBC connectors use option to do partitioned reads. Default fetch size depends on the database that you are reading from 145 | - Partition column should be numeric that is relatively evenly distributed 146 | - If no such column is present, you should create on using mod, hash functions 147 | - Already partitioned data (S3 etc.) 148 | - Read the data and repartition if needed 149 | 150 | ### Cache/Persist 151 | 152 | - Unpersist when done to free up memory for Garbage collection 153 | - For self joins, cache the table to avoid reading the same data and deserialization of data twice 154 | - Don't over persist, it can lead to - 155 | - Increased spill to disk 156 | - Slow garbage collection 157 | 158 | ### Avoid UDFs 159 | 160 | - UDFs have to deserialize every row to object 161 | - Then apply the lambda function 162 | - And then reserialize it 163 | - This leads to increased garbage collection 164 | 165 | ### Join Optimizations 166 | 167 | - Filter trick 168 | - Get keys from the medium table and filter the records from the large table before performing a join 169 | - Dynamic partition pruning 170 | - Salting 171 | - Use salting technique to reduce/eliminate skew on the joining keys 172 | - Salting will use up more memory 173 | 174 | ### Things to remember 175 | 176 | - Keep the largest dataframe at left because Spark tries to shuffle the right dataframe first. Smaller dataframe will lead to lesser shuffle 177 | - Follow good partitioning strategies 178 | - Filter as early as possible 179 | - Try to use the same partitioner between DFs for joins 180 | 181 | ### Task Scheduling 182 | 183 | - Default scheduling is FIFO 184 | - Fair Scheduling 185 | - Allows scheduling longer tasks with smaller tasks 186 | - Better resource utilization 187 | - Harder to debug. Turn off in local when debugging 188 | 189 | ### Serialization 190 | 191 | | Java | Kryo | 192 | | ------------------------ | ---------------------------------------------------- | 193 | | Default for most types | Default for shuffling RDDs and simple types like int | 194 | | Can work for any class | For serializable types | 195 | | More flexible but slower | Significantly faster and more compact | 196 | | | Set you sparkconf to use kryo serialization | 197 | 198 | ### Garbage Collection 199 | 200 | #### How to check? 201 | 202 | - Check time spent on tasks vs GC on Spark UI 203 | - Check the memory used in server 204 | 205 | ## Databricks Delta 206 | 207 | [Optimize performance with file management](https://docs.databricks.com/delta/optimizations/file-mgmt.html) 208 | 209 | ### Compaction (Bin packing) 210 | 211 | - Improve speed of the queries by coalescing smaller files into larger files 212 | 213 | - You trigger compaction by running the following command 214 | 215 | ```bash 216 | OPTIMIZE events 217 | ``` 218 | 219 | - Bin packing optimization is idempotent 220 | 221 | - Evenly balanced in size but not necessarily in terms of records. The two measures are more often correlated 222 | 223 | - Returns min, max, total and so on for the files removed and added 224 | 225 | - Also returns Z-Ordering statistic 226 | 227 | ### Data Skipping 228 | 229 | - Works for any comparison of nature `column 'op' literal` where op could be `>`,`<`,`=`,`like`, `and`, `or` etc. 230 | - By default, generates stats for only the first 32 columns in the data. For more columns, reordering is required. Or the threshold limit can be changed 231 | - Collecting stats on long string is expensive. One can skip long string like columns using `delta.dataSkippingNumIndexedCols` 232 | 233 | ### Z-Ordering (multi-dimensional clustering) 234 | 235 | - Colocates related information in same set of files 236 | - Use for high cardinality columns 237 | - Effectiveness drops with each additional column 238 | - Columns that do not have statistics collected on them, would be ineffective as it requires stats like min, max, count etc. for data skipping 239 | - Z-ordering is not idempotent 240 | - Evenly balanced files in terms of records but not necessarily size. The two measures are more often correlated 241 | 242 | For running`OPTIMIZE` (bin-packing or z-ordering) compute intensive machines like c5d series is recommended as both operations will be doing large amounts of Parquet decoding and encoding -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Spark Learning Guide 2 | 3 | ### This material has been created using multiple sources from the internet like Databricks blogs and courses, official docs, Stack Overflow, Learning Spark 2.0 and The Definitive Guide. 4 | 5 | #### You can use this guide to learn about different components of Spark and as a reference material. This section covers all the topics that should be enough for you to get started with Spark Theory. 6 | 7 | #### You can refer to the advanced topics here - 8 | 9 | - [Optimization Techniques](advanced/optimizations.md) 10 | - [Joins Internal Working](advanced/joins.md) 11 | - [Delta Lake](advanced/deltalake.md) 12 | - [Spark 3.0](advanced/new_in_spark_3.md) 13 | -------------------------- 14 | 1. What is Spark? 15 | Apache Spark is a cluster computing platform designed to be fast and general-purpose. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines or a computing cluster. 16 | -------------------------- 17 | 2. What is a Spark Core? 18 | Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections. 19 | -------------------------- 20 | 3. Key features of Spark - 21 | - Spark can run over multiple file systems. 22 | - Multiple software systems need not run to achieve a single task because spark provides a lot of capabilities under the hood. A single application can leverage streaming, ML and Spark SQL capabilities of spark. 23 | - Spark has the philosophy of tight integration where each component is designed to interoperate closely. Hence any improvements at the lower level improve all the libraries running over it. 24 | - Spark offers in-memory computations 25 | -------------------------- 26 | 4. Major libraries that constitute the Spark Ecosystem - 27 | Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc. 28 | Spark Streaming – This library is used to process real-time streaming data. 29 | Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc. 30 | Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools. 31 | -------------------------- 32 | 5. What is an RDD? 33 | The main abstraction Spark provides is a *resilient distributed dataset* (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. 34 | 35 | An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any data source, e.g. text files, a database via JDBC, etc. 36 | 37 | Definition - *RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.* 38 | 39 | 6. How are RDDs created? 40 | Spark provides two ways to create RDDs: 41 | 42 | - loading an external dataset 43 | - parallelizing a collection in your driver program. 44 | -------------------------- 45 | 7. What is a partition? 46 | A partition is a logical or small chunk of a large distributed data set. It provides the possibility to distribute the work across the cluster, divide the task into smaller parts, and reduce memory requirements for each node. **Partition is the main unit of parallelism in Apache Spark**. 47 | -------------------------- 48 | 8. How is an RDD fault-tolerant? 49 | When a set of operations happen on an RDD the spark engine views these operations as a DAG. If a node processing the RDD crashes and was performing operations X->Y->Z on the RDD and failed at Z, then the resource manager assigns a new node for the operation and the processing begins from X again using the directed graph. 50 | -------------------------- 51 | 9. Why are RDDs immutable? 52 | Immutability rules out a big set of potential problems due to updates from multiple threads at once. Immutable data is safe to share across processes. 53 | They're not just immutable but a deterministic function (a function that returns the same result with the same input) of their input. This plus immutability also means the RDD's parts can be recreated at any time. This makes caching, sharing and replication easy. 54 | These are significant design wins, at the cost of having to copy data rather than mutate it in place. Generally, that's a decent tradeoff to make: gaining the fault tolerance and correctness with no developer effort is worth spending memory and CPU on since the latter are cheap. 55 | A Corollary: immutable data can as easily live in memory as on disk. This makes it reasonable to easily move operations that hit the disk to instead use data in memory, and again, adding memory is much easier than adding I/O bandwidth. 56 | -------------------------- 57 | 10. What are Transformations? 58 | Spark Transformations are functions that produce a new RDD from an existing RDD. An RDD Lineage is built when we apply Transformations on an RDD. Basic Transformations are - map and filter. After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. `flatMap(), union(), Cartesian()`) or the same size (e.g. map). 59 | - Narrow dependency : RDD operations like `map(), union(), filter()` can operate on a single partition and map the data of that partition to the resulting single partition. These kinds of operations that map data from one to one partition are referred to as Narrow operations. Narrow operations don’t require distributing the data across the partitions. Each partition of the parent RDD is used by at most one partition of the child RDD. 60 | 61 | ![narrow](static/Narrow-Transformation.png) 62 | 63 | - Wide dependency : RDD operations like `groupByKey, distinct, join` may require mapping the data across the partitions in the new RDD. These kinds of operations which maps data from one to many partitions are referred to as Wide operations 64 | Each partition of the parent RDD may be depended on by multiple child partitions. 65 | 66 | ![wide](static/Wide-Transformation.png) 67 | -------------------------- 68 | 11. What are Actions? 69 | Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. In other words, an RDD operation that returns a value of any type but `RDD[T]` is an action. They trigger the execution of RDD transformations to return values. Simply put, an action evaluates the RDD lineage graph. 70 | Actions are one of two ways to send data from executors to the driver (the other being accumulators). 71 | Some examples of actions are - `aggregate, collect, count, countApprox, countByValue, first, fold, foreach, foreachPartition, max, min, reduce, saveAs* actions, saveAsTextFile, saveAsHadoopFile, take, takeOrdered, takeSample, toLocalIterator, top, treeAggregate, treeReduce` 72 | -------------------------- 73 | [Anatomy of Spark Application - Luminousmen](https://luminousmen.com/post/spark-anatomy-of-spark-application) 74 | 75 | 12. What is a driver? 76 | The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and analyzing, distributing, and scheduling work across the executors (defined momentarily). 77 | 78 | ​ In a single databricks cluster, there will only be one driver irrespective of the number of executors. 79 | 80 | - Prepares Spark Context 81 | - Declares operations on the RDD using Transformations and Actions. 82 | - Submits serialized RDD graph to master. 83 | [Spark Driver - Stackoverflow](https://stackoverflow.com/a/24638280) 84 | -------------------------- 85 | 13. What is a Task? 86 | A task is a unit of work that can be run on a partition of a distributed dataset and gets executed on a single executor. The unit of parallel execution is at the task level. All the tasks within a single stage can be executed in parallel. 87 | -------------------------- 88 | 14. What is a Stage? 89 | 90 | A stage is a collection of tasks that can run in parallel. A new stage is created when there is data shuffling. 91 | -------------------------- 92 | 15. What is a Core? 93 | A core is a basic computation unit of a CPU and a CPU may have one or more cores to perform tasks at a given time. The more cores we have, the more work we can do. In spark, this controls the number of parallel tasks an executor can run. 94 | -------------------------- 95 | 16. What is Hadoop, Hive, Hbase? 96 | Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us with storage, but in a fault-tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being an FS, HDFS lacks random read and write access. This is where HBase comes into the picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs. 97 | Hive: It provides us with data warehousing facilities on top of an existing Hadoop cluster. Along with that, it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them. 98 | -------------------------- 99 | 17. What is parquet? 100 | [Parquet and it's pros and cons - Stackoverflow](https://stackoverflow.com/a/36831549/8515731) 101 | 102 | ![Row Vs Columnar](static/columnar_storage.png) 103 | 104 | - The schema is stored in the footer of the file 105 | - Doesn't waste space storing missing value 106 | - Has predicate pushdown 107 | - Loads only required columns 108 | - Allows data skipping 109 | 110 | -------------------------- 111 | 18. What file systems do Spark support? 112 | - Hadoop Distributed File System (HDFS). 113 | - Local File system. 114 | - S3 115 | -------------------------- 116 | 19. What is a Cluster Manager? 117 | An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN). Spark is agnostic to a cluster manager as long as it can acquire executor processes and those can communicate with each other. We are primarily interested in Yarn as the cluster manager. A spark cluster with Yarn as a cluster/resource manager can run in either yarn cluster or yarn-client mode: 118 | yarn-client mode – A driver runs on client process, Application Master is only used for requesting resources from YARN. 119 | yarn-cluster mode – A driver runs inside the application master process, the client goes away once the application is initialized 120 | [Cluster Mode Overview - Spark Documentation](https://spark.apache.org/docs/latest/cluster-overview.html) 121 | -------------------------- 122 | 20. What is yarn? 123 | [What is Yarn - Hadoop Documentation](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) 124 | A good guide to understand how Spark works with YARN - 125 | - [Spark on YARN: A Deep Dive - Sandy Ryza - Youtube](https://youtu.be/N6pJhxCPe-Y) 126 | - [Spark over Yarn - Stackoverflow](https://stackoverflow.com/questions/24909958/spark-on-yarn-concept-understanding) 127 | -------------------------- 128 | 21. What is MapReduce? 129 | [Introduction to MapReduce - Guru99](https://www.guru99.com/introduction-to-mapreduce.html) 130 | -------------------------- 131 | 22. Spark vs MapReduce? 132 | [Spark vs MapReduce - Medium @bradanderson](https://medium.com/@bradanderson.contacts/spark-vs-hadoop-mapreduce-c3b998285578) 133 | -------------------------- 134 | 23. What is an Executor? 135 | An executor is a single JVM process that is launched for an application on a worker node. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. A single node can run multiple executors and executors for an application can span across multiple worker nodes. An executor stays up for the duration of the Spark Application and runs the tasks in multiple threads. The number of executors for a spark application can be specified inside the SparkConf or via the flag –num-executors from the command line. 136 | - Executor performs all the data processing. 137 | - Reads from and writes data to external sources. 138 | - Executor stores the computed data in-memory, cache or on hard disk drives. 139 | - Interacts with the storage systems. 140 | -------------------------- 141 | 24. What are workers, executors, cores in the Spark Standalone cluster? 142 | 143 | A worker node hosts the executor process. It has a fixed number of executors allocated at any point in time. Each executor will hold the chunk of the data to be processed. This chunk is called a partition. 144 | 145 | Spark parallelizes at two levels - first with the executors and second with the cores allocated to the executors. 146 | 147 | [Workers, executors, cores - Stackoverflow](https://stackoverflow.com/questions/32621990/what-are-workers-executors-cores-in-spark-standalone-cluster) 148 | -------------------------- 149 | 25. Name types of Cluster Managers in Spark. 150 | The Spark framework supports four major types of Cluster Managers - 151 | - Standalone: a basic manager to set up a cluster. 152 | - Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications. 153 | - Yarn: responsible for resource management in Hadoop 154 | - Kubernetes: an open-source system for automating deployment, scaling, and management of containerized applications 155 | -------------------------- 156 | 26. How can you minimize data transfers when working with Spark? 157 | Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are: 158 | - Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs. 159 | - Using Accumulators – Accumulators help update the values of variables in parallel while executing. 160 | - The most common way is to avoid `ByKey` operations, repartition or any other operations which trigger shuffles. 161 | -------------------------- 162 | 27. What are broadcast variables? 163 | Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication costs. 164 | Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in the deserialized form is important. 165 | Broadcast variables are created from a variable v by calling `SparkContext.broadcast(v)`. The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this: 166 | 167 | ```python 168 | >>> broadcastVar = sc.broadcast([1, 2, 3]) 169 | 170 | 171 | >>> broadcastVar.value 172 | [1, 2, 3] 173 | ``` 174 | After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later). 175 | -------------------------- 176 | 28. Why is there a need for broadcast variables when working with Apache Spark? 177 | These are read-only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup (). 178 | [Why use broadcast variables? - Stackoverflow](https://stackoverflow.com/questions/26884871/what-are-broadcast-variables-what-problems-do-they-solve) 179 | -------------------------- 180 | 29. What is a Closure? 181 | The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). The closure is serialized and sent to each executor. The variables within the closure sent to each executor are copies. 182 | [Closure Spark Documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#understanding-closures-a-nameclosureslinka) 183 | -------------------------- 184 | 30. What are Accumulators? 185 | Accumulators are variables that are "added" to through an associative and commutative "add" operation. They act as a container for accumulating partial values across multiple tasks (running on executors). They are designed to be used safely and efficiently in parallel and distributed Spark computations and are meant for distributed counters and sums. 186 | [Accumulators Spark Documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#accumulators) 187 | -------------------------- 188 | 31. How can you trigger automatic clean-ups in Spark to handle accumulated metadata? 189 | You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long-running jobs into different batches and writing the intermediary results to the disk. 190 | -------------------------- 191 | 32. What is the significance of Sliding Window operation? 192 | Sliding Window controls the transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream. 193 | -------------------------- 194 | 33. What is a DStream? 195 | Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS and Apache Flume. DStreams have two operations – 196 | - Transformations that produce a new DStream. 197 | - Output operations that write data to an external system. 198 | -------------------------- 199 | 34. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster? 200 | Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster. 201 | -------------------------- 202 | 35. What is the Catalyst framework? 203 | 204 | ![Query Optimization](static/query_optimization.png) 205 | 206 | Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system. 207 | 208 | It goes through 4 transformational phases - 209 | - Analysis 210 | - Logical optimization 211 | - Physical planning 212 | - Code generation 213 | 214 | Phase 1: Analysis 215 | Abstract Syntax Tree is generated from dataframe or query 216 | List of column names, datatypes, functions, databases etc. are resolved or validated by consulting internal ' Metadata Catalog'. 217 | 218 | From this analysis, we get a logical plan 219 | 220 | Phase 2: Logical Optimization 221 | The logical optimization phase applies standard rule-based optimizations to the logical plan. These include constant folding, predicate pushdown, projection pruning, null propagation, Boolean expression simplification, and other rules. 222 | 223 | Phase 3: Physical Planning 224 | At this stage, the catalyst optimizer generates on or more physical plans. This represents what the query engine will actually do. Each physical plan is evaluated according to it's own cost model and the best performing model is selected. 225 | 226 | Phase 4: Code generation 227 | Generation of efficient java byte code to run on each machine. 228 | Spark acts as a compiler facilitated by Project Tungsten for whole stage code generation. 229 | Whole stage code generation is physical query optimization, getting rid of virtual function calls and employing CPU registers for intermediate data. This generates a compact RDD for final execution. 230 | 231 | [Physical Plans In Spark SQL - Databricks Spark Summit - Part 1](https://youtu.be/99fYi2mopbs) 232 | 233 | [Physical Plans In Spark SQL - Databricks Spark Summit - Part 2](https://youtu.be/9EIzhRKpiM8) 234 | 235 | -------------------------- 236 | 36. Which spark library allows reliable file sharing at memory speed across different cluster frameworks? 237 | Tachyon 238 | -------------------------- 239 | 37. Explain the different types of transformations on DStreams? 240 | - Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples- `map ()`, `reduceByKey ()`, `filter ()`. 241 | - Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples- Transformations that depend on sliding windows 242 | -------------------------- 243 | 37. Explain the popular use cases of Apache Spark 244 | Apache Spark is mainly used for - 245 | - Iterative machine learning. 246 | - Interactive data analytics and processing. 247 | - Stream processing. 248 | - Batch Processing. 249 | - Sensor data processing. 250 | -------------------------- 251 | 38. How can you remove the elements with a key present in any other RDD? 252 | Use the `subtractByKey()` function 253 | -------------------------- 254 | 39. What is the difference between persist() and cache()? 255 | `persist()` allows the user to specify the storage levels whereas `cache()` uses the default storage level. 256 | -------------------------- 257 | 40. What are the various levels of persistence in Apache Spark? 258 | Apache Spark automatically persists some intermediary data from various shuffle operations. This is done to avoid recomputing the entire input if a node fails during the shuffle. However, it is often suggested that users call `persist()` method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels. 259 | The various storage/persistence levels in Spark are - 260 | 261 | - MEMORY_ONLY 262 | - MEMORY_ONLY_SER 263 | - MEMORY_AND_DISK 264 | - MEMORY_AND_DISK_SER, DISK_ONLY 265 | - MEMORY_ONLY_2, MEMORY_AND_DISK_2 266 | - OFF_HEAP 267 | 268 | [Which Storage Level to Choose?](https://spark.apache.org/docs/3.0.0-preview/rdd-programming-guide.html#which-storage-level-to-choose) 269 | -------------------------- 270 | 41. Does Apache Spark provide checkpointing? 271 | Lineage graphs are always useful to recover RDDs from failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to the checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies. 272 | -------------------------- 273 | 42. Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark? 274 | The data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has information on how to build from other datasets. If any partition of an RDD is lost due to failure, lineage helps build only that particular lost partition. 275 | 276 | Assuming that all of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark Cluster. 277 | -------------------------- 278 | 43. Explain the core components of a distributed Spark application. 279 | Driver - The process that runs the main() method of the program to create RDDs and perform transformations and actions on them. 280 | Executor - The worker processes that run the individual tasks of a Spark job. 281 | Cluster Manager - A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN. 282 | -------------------------- 283 | 44. What do you understand by Lazy Evaluation? 284 | Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like `map()` is called on an RDD- the operation is not performed immediately. Transformations in Spark are not evaluated until you perform an action. This helps optimize the overall data processing workflow. 285 | -------------------------- 286 | 45. Define a worker node- 287 | A node that can run the Spark application code in a cluster can be called a worker node. A worker node can have more than one worker which is configured by setting the `SPARK_WORKER_INSTANCES` property in the `spark-env.sh` file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined. 288 | -------------------------- 289 | 46. What do you understand by `SchemaRDD`? 290 | Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using Spark. At the core of this component is a new type of RDD, `SchemaRDD`. An RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column. A `SchemaRDD` is similar to a table in a traditional relational database. 291 | -------------------------- 292 | 47. What are the disadvantages of using Apache Spark over Hadoop MapReduce? 293 | Apache spark does not scale well for compute-intensive jobs and consumes a large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost-efficient processing of big data. Also, Spark does not have its own file management system and hence needs to be integrated with other cloud-based data platforms or apache Hadoop. 294 | -------------------------- 295 | 48. What do you understand by Executor Memory in a Spark application? 296 | Every spark application has the same fixed heap size and a fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the `–-executor-memory` flag. Every spark application will have one executor on each worker node. The executor memory is a measure of how much memory of the worker node will the application utilize. 297 | -------------------------- 298 | 49. What according to you is a common mistake apache-spark developers make when using spark? 299 | - [Not avoiding `GroupByKey`](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) 300 | - [Collecting Large Datasets](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html) 301 | - [Not Dealing with Bad Input](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dealing_with_bad_data.html) 302 | - [Not managing shuffle partitions](https://nealanalytics.com/blog/databricks-spark-jobs-optimization-techniques-shuffle-partition-technique-part-1/) 303 | 50. (A) Suppose that there is an RDD named `Samplerdd` that contains a huge list of numbers. The following spark code is written to calculate the average - 304 | 305 | ```scala 306 | def SampleAvg(x, y): 307 | return (x+y)/2.0; 308 | 309 | avg = Samplerdd.reduce(SampleAvg); 310 | ``` 311 | 312 | ---------------------------- 313 | 50. (B) What is wrong with the above code and how will you correct it? 314 | The average function is neither commutative nor associative. The best way to compute average is to first sum it and then divide it by count as shown below - 315 | 316 | ```scala 317 | def sum(x, y): 318 | return x+y; 319 | 320 | total =Samplerdd.reduce(sum); 321 | avg = total / Samplerdd.count(); 322 | ``` 323 | However, the above code could lead to an overflow if the total becomes big. So, the best way to compute average is to divide each number by count and then add up as shown below - 324 | ```scala 325 | cnt = Samplerdd.count(); 326 | 327 | def divideByCnt(x): 328 | return x/cnt; 329 | 330 | myrdd1 = Samplerdd.map(divideByCnt); 331 | avg = Samplerdd.reduce(sum); 332 | ``` 333 | -------------------------- 334 | 51. Explain the difference between Spark SQL and Hive. 335 | 336 | - Spark SQL is faster than Hive. 337 | - Any Hive query can easily be executed in Spark SQL but vice-versa is not true. 338 | - Spark SQL is a library whereas Hive is a framework. 339 | - It is not mandatory to create a metastore in Spark SQL but it is mandatory to create a Hive metastore. 340 | - Spark SQL automatically infers the schema whereas in Hive, schema needs to be explicitly declared. 341 | -------------------------- 342 | 52. What is a Spark Session? 343 | The first step of any Spark Application is creating a `SparkSession`, which enables you to run Spark code. 344 | The `SparkSession` class provides the single entry point to all functionality in Spark using the `DataFrame` API. 345 | This is automatically created for you in a Databricks notebook/spark shell as the variable, `spark`. 346 | -------------------------- 347 | 53. Why should one not use a UDF? 348 | UDFs can not be optimized by the Catalyst Optimizer. To use UDFs, functions must be serialized and sent to executors. And for Python, there is additional overhead of spinning up a Python interpreter on an executor to run a UDF. 349 | 350 | ​ `sampleUDF = udf(sample_function)` serializes the function and sends the UDF to the executors so that we can use it on a dataframe. 351 | 352 | -------------------------- 353 | 54. What is an `UnsafeRow`? 354 | The data that is "shuffled" is in a format known as `UnsafeRow`, or more commonly, the Tungsten Binary Format. `UnsafeRow` is the in-memory storage format for Spark SQL and DataFrames. 355 | 356 | Advantages include - 357 | 358 | - Compactness : Column values are encoded using custom encoders, not as JVM objects (as with RDDs). 359 | The benefit of using Spark custom encoders is that you get almost the same compactness as Java serialization, but significantly faster encoding/decoding speeds. 360 | For custom data types, it is possible to write custom encoders from scratch. 361 | - Efficiency : Spark can operate directly out of Tungsten, without first deserializing Tungsten data into JVM objects. 362 | -------------------------- 363 | 55. What are some best Caching Practices? 364 | - Don't cache unless you're sure the DataFrame is going to be used multiple times. 365 | - Omit unneeded columns to reduce storage footprint. 366 | -------------------------- 367 | 56. Understanding the Spark UI 368 | 369 | - Use `setJobDescription` for better tracking in the Spark UI. 370 | 371 | - Use event timeline to analyze jobs that are taking a long time to execute. 372 | 373 | - The event timeline for a stage has various tasks including Executor computing time, which btw should be the most occurring in the timeline. Any other coloured tasks are overhead and should be considered if we want to optimize the process. If there's a lot of overhead time then one should consider creating larger partitions of data. 374 | 375 | - In the Summary Metrics tab, we can see the statistics by quartile for the green tasks in the event timeline. Here we should analyze Duration to see if the partitions are skewed. If the min-max fields show a greater difference, that indicates skewed partitions. 376 | 377 | - Input Size/Records can also be used in a similar way to analyze where there is a larger difference between the min, median and max sizes of partitions. 378 | 379 | - Inside the SQL tab, we can click on the job descriptions that we set. This will lead us to a more explanatory visualization mapped to the actual code that we wrote. 380 | 381 | [Spark Web UI - Official Documentation](https://spark.apache.org/docs/3.0.0-preview/web-ui.html) 382 | 383 | [Deep Dive into Monitoring Spark Applications - Jacek Laskowski - YouTube](https://youtu.be/mVP9sZ6K__Y) 384 | 385 | [Spark UI Visualization - Andrew Or - YouTube](https://www.youtube.com/watch?v=VQOKk9jJGcw&list=PL-x35fyliRwif48cPXQ1nFM85_7e200Jp) 386 | -------------------------- 387 | 57. Shared resources - 388 | - Executors share machine level resources. That is if a node has 4 executors, all the resources would be shared between those. 389 | - Tasks share executor level resources. 390 | - Resources are shared by the cores in a single node. Meaning they share the memory, disk, network. If any of the cores under an executor fail because of OOM or any other reason, the whole executor will be affected and the process on that executor will have to be stopped. 391 | -------------------------- 392 | 58. Local and Global Results - 393 | When certain actions and transformations are performed there are scenarios when the tasks operate on a partition individually, and then the same operation needs to be performed again globally to get accurate results. For example, if 5 executors give the record count of each partition to be 4,5,5,6,4 then a final global count operation is needed to say that the dataset has 24 records. More such operations are - 394 | 395 | |Stage 1 | Stage 2 | 396 | |----------------|--------------------| 397 | |Local Filter | No Global Filter| 398 | |Local count | Global Count| 399 | |Local distinct | Global distinct| 400 | |Local sort | Global sort| 401 | |Local aggregate | Global aggregate| 402 | -------------------------- 403 | 59. What is shuffling? 404 | Shuffling is the process of rearranging data within a cluster between stages. 405 | Triggered by wide transformations like - 406 | - Repartition 407 | - ByKey operations (except counting) 408 | - Joins, the worse being cross joins 409 | - Sorting 410 | - Distinct 411 | - GroupBy 412 | -------------------------- 413 | 60. What is a dataframe? 414 | A dataframe is a distributed collection of data grouped into named columns. 415 | -------------------------- 416 | 61. Why Dataframes and not RDDs? 417 | The computations are not known to Spark when it happens under an RDD. Whether you are performing a join, filter, select or aggregation, Spark only sees it as a lambda expression. Even the `Iterator[T]` datatype is not visible to spark. That leaves no room for Spark to perform optimizations. 418 | -------------------------- 419 | 62. Why should you always define your schema upfront when reading a file? 420 | 421 | - You relieve Spark from the onus of inferring data types. 422 | 423 | - You prevent Spark from creating a separate job just to read a large portion of 424 | your file to ascertain the schema, which for a large data file can be expensive and 425 | time-consuming. 426 | 427 | - You can detect errors early if data doesn’t match the schema. 428 | 429 | -------------------------- 430 | 63. Managed vs Unmanaged tables? 431 | For a managed table, spark manages both the data in the file store (HDFS, S3, etc.) and the metadata for the table. While for unmanaged data, spark only manages the metadata and you manage the data yourself in a external data source such as Cassandra. So for a command like `DROP TABLE` spark will only delete the metadata for the unmanaged table. 432 | 433 | Unmanaged tables in Spark can be created like this - 434 | 435 | ```python 436 | (flights_df 437 | .write 438 | .option("path", "/tmp/data/us_flights_delay") 439 | .saveAsTable("us_delay_flights_tbl")) 440 | ``` 441 | -------------------------- 442 | 64. How can you speed up Pyspark UDFs? 443 | One can create Pandas UDF using the pandas_udf decorator. 444 | Before the introduction of Pandas UDF - 445 | 446 | - Collect all rows to Spark driver 447 | - Each row serialized into python's pickle format and sent to the python worker process. 448 | - Child process unpickles each row into a huge list of tuples. 449 | - Pandas dataframe created using `pandas.DataFrame.from_records()` 450 | This causes issues like - 451 | - Even using `Cpickle`, Python serialization is a slow process 452 | - `from_records` iterates over the list of pure python data and convert each value to pandas format. 453 | 454 | 455 | Introduction of Arrow - 456 | - Once data is in Arrow format there is no need for pickle/serialization as Arrow data can be directly sent to Python. 457 | - `PyArrow` in python utilizes zero-copy methods to create a `pandas.DataFrame` from entire chunks of data instead of processing individual scalar values. 458 | - Additionally, the conversion to Arrow data can be done on the JVM and pushed back for the Spark executors to perform in parallel, drastically reducing the load on the driver. 459 | 460 | The use of Arrow when calling `toPandas()` needs to be enabled by setting `spark.sql.execution.arrow.enabled` to `true`. 461 | 462 | [Pandas UDF - Microsoft](https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/udf-python-pandas) 463 | 464 | ------ 465 | 466 | 65. What are the most important configuration parameters in Spark? 467 | 468 | `spark.default.parallelism` - Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user. 469 | 470 | -------------------------- 471 | ### Random Video Resources 472 | 473 | 📹 [Spark Out Of Memory Issues](https://youtu.be/FdT5o7M35kU) - Data Savvy 474 | 475 | 📹 [Decide number of Executors and Memory](https://www.youtube.com/watch?v=V9E-bWarMNw) - Data Savvy 476 | 477 | 📹 [Big Data Architecture Patterns](https://youtu.be/-N9i-YXoQBE) - Eddie Satterly 478 | 479 | 📹 [Threat Detection and Response at Scale](https://vimeo.com/274267634?embedded=true&source=video_title&owner=40921445) - Dominique Brezinski and Michael Armbrust 480 | 481 | ------ 482 | 483 | ### Random Reading Resources 484 | 485 | 📝 [How partitions are calculated when reading a file in Spark](https://dzone.com/articles/guide-to-partitions-calculation-for-processing-dat) 486 | 487 | 📝 [Why the number of tasks can be much larger than the number of row groups](http://cloudsqale.com/2021/03/19/spark-reading-parquet-why-the-number-of-tasks-can-be-much-larger-than-the-number-of-row-groups/) 488 | 489 | 📝 [Processing Petabytes of Data in Seconds with Databricks Delta](https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html) 490 | 491 | ------ 492 | 493 | ### Performance Recommendations - 494 | 495 | - One Executor per node is considered to be more stable than two or three executors per node as is used in systems like YARN. ([Databricks Guideline](https://docs.databricks.com/clusters/configure.html#worker-node-1)) 496 | - Try to group wide transformations together for best automatic optimization. 497 | - [Spark Performance Optimization - IBM Developer Blog](https://developer.ibm.com/blogs/spark-performance-optimization-guidelines/) 498 | - [Spark Tips, Partition Tuning - Luminousmen](https://luminousmen.com/post/spark-tips-partition-tuning) 499 | - [Cluster Configuration Best Practices - Databricks](https://docs.databricks.com/clusters/cluster-config-best-practices.html) 500 | - [Delta Lake Best Practices - Databricks](https://docs.databricks.com/delta/best-practices.html) 501 | - [Spark troubleshooting challenges - Unravel](https://www.unraveldata.com/resources/spark-troubleshooting-part-1-ten-challenges/) 502 | - [Why memory management is causing your spark apps to be slow or fail - Unravel](https://www.unraveldata.com/common-reasons-spark-applications-slow-fail-part-1/) 503 | - [Why data skew and garbage collection causes spark apps to be slow or fail - Unravel](https://www.unraveldata.com/resources/spark-troubleshooting-part-1-ten-challenges/) 504 | - [Configuring memory and CPU options - IBM](https://www.ibm.com/docs/en/zpfas/1.1.0?topic=spark-configuring-memory-cpu-options) 505 | - [Optimize performance with file management - Databricks](https://docs.databricks.com/delta/optimizations/file-mgmt.html) 506 | 507 | ![Partition Guideline](static/partition_guide.png) 508 | 509 | --------------------------------------------------------------------------------