└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | **Spark-overflow**
2 | ===================
3 |
4 | A collection of Spark related information, solutions, debugging tips and tricks, etc. PR are always welcome! Share what you know about Apache Spark.
5 |
6 |
7 | # **Knowledge**
8 | ### Spark executor memory([Reference Link](http://www.slideshare.net/AGrishchenko/apache-spark-architecture/57))
9 |
10 |
11 | ### spark-submit --verbose([Reference Link](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
12 | - Always add `--verbose options` on `spark-submit` to print the following information:
13 | - All default properties.
14 | - Command line options.
15 | - Settings from spark conf file.
16 |
17 | ### Spark Executor on YARN([Reference Link](http://www.slideshare.net/AmazonWebServices/bdt309-data-science-best-practices-for-apache-spark-on-amazon-emr))
18 | Following is the memory relation config on YARN:
19 | - YARN container size - `yarn.nodemanager.resource.memory-mb`.
20 | - Memory Overhead - `spark.yarn.executor.memoryOverhead`.
21 |
22 |
23 |
24 | - An example on how to set up Yarn and launch spark jobs to use a specific numbers of executors ([Reference Link](http://stackoverflow.com/questions/29940711/apache-spark-setting-executor-instances-does-not-change-the-executors))
25 |
26 | # **Tunings**
27 | ### Tune the shuffle partitions
28 | - Tune the numbers of `spark.sql.shuffle.partitions`.
29 |
30 | ### Avoid using jets3t 1.9([Reference Link](http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users))
31 | - It's a jar default on Hadoop 2.0.
32 | - Inexplicably terrible performance.
33 |
34 | ### Use `reduceBykey()` instead of `groupByKey()`
35 | - reduceByKey
36 |
37 |
38 |
39 | - groupByKey
40 |
41 |
42 |
43 | ### GC policy([Reference Link](https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html))
44 | - G1GC is a new feature that you can use.
45 | - Used by -XX:+UseG1GC.
46 |
47 | ### Join a large Table with a small table([Reference Link](http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs))
48 | - By default it's using `ShuffledHashJoin`, the problem here is that all the data of big ones will be shuffled.
49 | - Use `BroadcasthashJoin`:
50 | - It will broadcast the small one to all workers.
51 | - Set `spark.sql.autoBroadcastJoinThreshold`.
52 |
53 | ### Use `forEachPartition`
54 | - If your task involves a large setup time, use `forEachPartition` instead.
55 | - For example: DB connection, Remote Call, etc.
56 |
57 | ### Data Serialization
58 | - The default Java Serialization is too slow.
59 | - Use Kyro:
60 | - conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
61 |
62 | # **Solutions**
63 | ### java.io.IOException: No space left on device
64 | - The `/tmp` is probably full, check `spark.local.dir` in `spark-conf.default`.
65 | - How to fix it?
66 | - Mount more disk space:
67 | `spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp`
68 |
69 | ### java.lang.OutOfMemoryError: GC overhead limit exceeded([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
70 | - Too much GC time, you can check that on Spark metrics.
71 | - How to fix it?
72 | - Increase executor heap size by `--executor-memory`.
73 | - Increase `spark.storage.memoryFraction`.
74 | - Change GC policy(ex: use G1GC).
75 |
76 | ### shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
77 | - OOM on Spark driver.
78 | - This usually happens when you fetch a huge data to driver(client).
79 | - Spark SQL and Streaming is a typical workload which needs large heap on driver
80 | - How to fix?
81 | - Increase `--driver-memory`.
82 |
83 | ### java.lang.NoClassDefFoundError([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
84 | - Compiled okay, but got error on run-time.
85 | - How to fix it?
86 | - Use `--jars` to upload and place on the `classpath` of your application.
87 | - Use `--packages` to include comma-sparated list of Maven coordinates of JARs.
88 | EX: `--packages com.google.code.gson:gson:2.6.2`
89 | This example will add a jar of gson to both executor and driver `classpath`.
90 |
91 | ### Serialization stack error
92 | - Error message:
93 | Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: com.spark.demo.MyClass
94 | Serialization stack:
95 | - Object is not serializable (class: com.spark.demo.MyClass, value: com.spark.demo.MyClass@6951e281)
96 | - Element of array (index: 0)
97 | - Array (class [Ljava.lang.Object;, size 6)
98 | - How to fix it?
99 | - Make `com.spark.demo.MyClass` to implement `java.io.Serializable`.
100 |
101 | ### java.io.FileNotFoundException: spark-assembly.jar does not exist
102 | - How to fix it?
103 | 1. Upload Spark-assembly.jar to Hadoop.
104 | 2. Set `spark.yarn.jar`, there are two ways to configure it:
105 | - Add `--conf spark.yarn.jar` when launching spark-submit.
106 | - Set `spark.yarn.jar` on `SparkConf` in your spark driver.
107 |
108 | ### java.io.IOException: Resource spark-assembly.jar changed on src filesystem ([Reference Link](http://stackoverflow.com/questions/30893995/spark-on-yarn-jar-upload-problems))
109 | - Spark-assembly.jar exists in HDFS, but still get assembly jar changed error.
110 | - How to fix it?
111 | 1. Upload Spark-assembly.jar to Hadoop.
112 | 2. Set `spark.yarn.jar`, there are two ways to configure it:
113 | - Add `--conf spark.yarn.jar` when launching spark-submit.
114 | - Set `spark.yarn.jar` on `SparkConf` in your spark driver.
115 |
116 | ### How to find the size of dataframe in Spark
117 | - In Java, you can use [org.apache.spark.util.SizeEstimator](http://stackoverflow.com/a/35008549).
118 | - In Pyspark, one way to do it is to persist the dataframe to disk, then go to the SparkUI Storage tab and see the size.
119 |
--------------------------------------------------------------------------------