└── README.md /README.md: -------------------------------------------------------------------------------- 1 | **Spark-overflow** 2 | =================== 3 | 4 | A collection of Spark related information, solutions, debugging tips and tricks, etc. PR are always welcome! Share what you know about Apache Spark. 5 | 6 | 7 | # **Knowledge** 8 | ### Spark executor memory([Reference Link](http://www.slideshare.net/AGrishchenko/apache-spark-architecture/57)) 9 | 10 | 11 | ### spark-submit --verbose([Reference Link](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34)) 12 | - Always add `--verbose options` on `spark-submit` to print the following information: 13 | - All default properties. 14 | - Command line options. 15 | - Settings from spark conf file. 16 | 17 | ### Spark Executor on YARN([Reference Link](http://www.slideshare.net/AmazonWebServices/bdt309-data-science-best-practices-for-apache-spark-on-amazon-emr)) 18 | Following is the memory relation config on YARN: 19 | - YARN container size - `yarn.nodemanager.resource.memory-mb`. 20 | - Memory Overhead - `spark.yarn.executor.memoryOverhead`. 21 | 22 | 23 | 24 | - An example on how to set up Yarn and launch spark jobs to use a specific numbers of executors ([Reference Link](http://stackoverflow.com/questions/29940711/apache-spark-setting-executor-instances-does-not-change-the-executors)) 25 | 26 | # **Tunings** 27 | ### Tune the shuffle partitions 28 | - Tune the numbers of `spark.sql.shuffle.partitions`. 29 | 30 | ### Avoid using jets3t 1.9([Reference Link](http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users)) 31 | - It's a jar default on Hadoop 2.0. 32 | - Inexplicably terrible performance. 33 | 34 | ### Use `reduceBykey()` instead of `groupByKey()` 35 | - reduceByKey 36 | 37 | 38 | 39 | - groupByKey 40 | 41 | 42 | 43 | ### GC policy([Reference Link](https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html)) 44 | - G1GC is a new feature that you can use. 45 | - Used by -XX:+UseG1GC. 46 | 47 | ### Join a large Table with a small table([Reference Link](http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs)) 48 | - By default it's using `ShuffledHashJoin`, the problem here is that all the data of big ones will be shuffled. 49 | - Use `BroadcasthashJoin`: 50 | - It will broadcast the small one to all workers. 51 | - Set `spark.sql.autoBroadcastJoinThreshold`. 52 | 53 | ### Use `forEachPartition` 54 | - If your task involves a large setup time, use `forEachPartition` instead. 55 | - For example: DB connection, Remote Call, etc. 56 | 57 | ### Data Serialization 58 | - The default Java Serialization is too slow. 59 | - Use Kyro: 60 | - conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); 61 | 62 | # **Solutions** 63 | ### java.io.IOException: No space left on device 64 | - The `/tmp` is probably full, check `spark.local.dir` in `spark-conf.default`. 65 | - How to fix it? 66 | - Mount more disk space: 67 | `spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp` 68 | 69 | ### java.lang.OutOfMemoryError: GC overhead limit exceeded([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34)) 70 | - Too much GC time, you can check that on Spark metrics. 71 | - How to fix it? 72 | - Increase executor heap size by `--executor-memory`. 73 | - Increase `spark.storage.memoryFraction`. 74 | - Change GC policy(ex: use G1GC). 75 | 76 | ### shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34)) 77 | - OOM on Spark driver. 78 | - This usually happens when you fetch a huge data to driver(client). 79 | - Spark SQL and Streaming is a typical workload which needs large heap on driver 80 | - How to fix? 81 | - Increase `--driver-memory`. 82 | 83 | ### java.lang.NoClassDefFoundError([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34)) 84 | - Compiled okay, but got error on run-time. 85 | - How to fix it? 86 | - Use `--jars` to upload and place on the `classpath` of your application. 87 | - Use `--packages` to include comma-sparated list of Maven coordinates of JARs. 88 | EX: `--packages com.google.code.gson:gson:2.6.2` 89 | This example will add a jar of gson to both executor and driver `classpath`. 90 | 91 | ### Serialization stack error 92 | - Error message: 93 | Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: com.spark.demo.MyClass 94 | Serialization stack: 95 | - Object is not serializable (class: com.spark.demo.MyClass, value: com.spark.demo.MyClass@6951e281) 96 | - Element of array (index: 0) 97 | - Array (class [Ljava.lang.Object;, size 6) 98 | - How to fix it? 99 | - Make `com.spark.demo.MyClass` to implement `java.io.Serializable`. 100 | 101 | ### java.io.FileNotFoundException: spark-assembly.jar does not exist 102 | - How to fix it? 103 | 1. Upload Spark-assembly.jar to Hadoop. 104 | 2. Set `spark.yarn.jar`, there are two ways to configure it: 105 | - Add `--conf spark.yarn.jar` when launching spark-submit. 106 | - Set `spark.yarn.jar` on `SparkConf` in your spark driver. 107 | 108 | ### java.io.IOException: Resource spark-assembly.jar changed on src filesystem ([Reference Link](http://stackoverflow.com/questions/30893995/spark-on-yarn-jar-upload-problems)) 109 | - Spark-assembly.jar exists in HDFS, but still get assembly jar changed error. 110 | - How to fix it? 111 | 1. Upload Spark-assembly.jar to Hadoop. 112 | 2. Set `spark.yarn.jar`, there are two ways to configure it: 113 | - Add `--conf spark.yarn.jar` when launching spark-submit. 114 | - Set `spark.yarn.jar` on `SparkConf` in your spark driver. 115 | 116 | ### How to find the size of dataframe in Spark 117 | - In Java, you can use [org.apache.spark.util.SizeEstimator](http://stackoverflow.com/a/35008549). 118 | - In Pyspark, one way to do it is to persist the dataframe to disk, then go to the SparkUI Storage tab and see the size. 119 | --------------------------------------------------------------------------------