└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | **Spark-overflow**
  2 | ===================
  3 | 
  4 | A collection of Spark related information, solutions, debugging tips and tricks, etc. PR are always welcome! Share what you know about Apache Spark.
  5 | 
  6 | 
  7 | # **Knowledge**
  8 | ### Spark executor memory([Reference Link](http://www.slideshare.net/AGrishchenko/apache-spark-architecture/57))
  9 |   <img src='http://image.slidesharecdn.com/sparkarchitecture-jdkievv04-151107124046-lva1-app6892/95/apache-spark-architecture-57-638.jpg?cb=1446900275'/>
 10 | 
 11 | ### spark-submit --verbose([Reference Link](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
 12 |   - Always add `--verbose options` on `spark-submit` to print the following information:
 13 |     - All default properties.
 14 |     - Command line options.
 15 |     - Settings from spark conf file.
 16 | 
 17 | ### Spark Executor on YARN([Reference Link](http://www.slideshare.net/AmazonWebServices/bdt309-data-science-best-practices-for-apache-spark-on-amazon-emr))
 18 |   Following is the memory relation config on YARN:
 19 |   - YARN container size - `yarn.nodemanager.resource.memory-mb`.
 20 |   - Memory Overhead - `spark.yarn.executor.memoryOverhead`.
 21 | 
 22 |   <img src='http://image.slidesharecdn.com/bdt309-151009173030-lva1-app6891/95/bdt309-data-science-best-practices-for-apache-spark-on-amazon-emr-49-638.jpg'/>
 23 | 
 24 |   - An example on how to set up Yarn and launch spark jobs to use a specific numbers of executors ([Reference Link](http://stackoverflow.com/questions/29940711/apache-spark-setting-executor-instances-does-not-change-the-executors))  
 25 | 
 26 | # **Tunings**
 27 | ### Tune the shuffle partitions
 28 |   - Tune the numbers of `spark.sql.shuffle.partitions`.
 29 | 
 30 | ### Avoid using jets3t 1.9([Reference Link](http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users))
 31 |   - It's a jar default on Hadoop 2.0.
 32 |   - Inexplicably terrible performance.
 33 | 
 34 | ### Use `reduceBykey()` instead of `groupByKey()`
 35 |   - reduceByKey
 36 | 
 37 | <img src='http://image.slidesharecdn.com/stratasj-everydayimshuffling-tipsforwritingbettersparkprograms-150223113317-conversion-gate02/95/everyday-im-shuffling-tips-for-writing-better-spark-programs-strata-san-jose-2015-9-638.jpg'/>
 38 | 
 39 |   - groupByKey
 40 | 
 41 | <img src='http://image.slidesharecdn.com/stratasj-everydayimshuffling-tipsforwritingbettersparkprograms-150223113317-conversion-gate02/95/everyday-im-shuffling-tips-for-writing-better-spark-programs-strata-san-jose-2015-10-638.jpg'/>
 42 | 
 43 | ### GC policy([Reference Link](https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html))
 44 |   - G1GC is a new feature that you can use.
 45 |   - Used by -XX:+UseG1GC.
 46 | 
 47 | ### Join a large Table with a small table([Reference Link](http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs))
 48 |   - By default it's using `ShuffledHashJoin`, the problem here is that all the data of big ones will be shuffled.
 49 |   - Use `BroadcasthashJoin`:
 50 |     - It will broadcast the small one to all workers.
 51 |     - Set `spark.sql.autoBroadcastJoinThreshold`.
 52 | 
 53 | ### Use `forEachPartition`
 54 |   - If your task involves a large setup time, use `forEachPartition` instead.
 55 |   - For example: DB connection, Remote Call, etc.
 56 | 
 57 | ### Data Serialization
 58 |   - The default Java Serialization is too slow.
 59 |   - Use Kyro:
 60 |     - conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
 61 | 
 62 | # **Solutions**
 63 | ### java.io.IOException: No space left on device
 64 |   - The `/tmp` is probably full, check `spark.local.dir` in `spark-conf.default`.
 65 |   - How to fix it?
 66 |     - Mount more disk space:  
 67 |       `spark.local.dir   /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp`
 68 | 
 69 | ### java.lang.OutOfMemoryError: GC overhead limit exceeded([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
 70 |   - Too much GC time, you can check that on Spark metrics.
 71 |   - How to fix it?
 72 |     - Increase executor heap size by `--executor-memory`.
 73 |     - Increase `spark.storage.memoryFraction`.
 74 |     - Change GC policy(ex: use G1GC).
 75 | 
 76 | ### shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
 77 |   - OOM on Spark driver.
 78 |   - This usually happens when you fetch a huge data to driver(client).
 79 |   - Spark SQL and Streaming is a typical workload which needs large heap on driver
 80 |   - How to fix?
 81 |     - Increase `--driver-memory`.
 82 | 
 83 | ### java.lang.NoClassDefFoundError([ref](http://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications?qid=25ed4f3f-fc2e-43b2-bc8a-7f78b21bdebb&v=&b=&from_search=34))
 84 |   - Compiled okay, but got error on run-time.
 85 |   - How to fix it?
 86 |     - Use `--jars` to upload and place on the `classpath` of your application.
 87 |     - Use `--packages` to include comma-sparated list of Maven coordinates of JARs.   
 88 |       EX: `--packages com.google.code.gson:gson:2.6.2`   
 89 |       This example will add a jar of gson to both executor and driver `classpath`.
 90 | 
 91 | ### Serialization stack error
 92 |   - Error message:   
 93 |   Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: com.spark.demo.MyClass
 94 | Serialization stack:   
 95 |     - Object is not serializable (class: com.spark.demo.MyClass, value: com.spark.demo.MyClass@6951e281)   
 96 |     - Element of array (index: 0)   
 97 |     - Array (class [Ljava.lang.Object;, size 6)
 98 |   - How to fix it?
 99 |     - Make `com.spark.demo.MyClass` to implement `java.io.Serializable`.
100 | 
101 | ### java.io.FileNotFoundException: spark-assembly.jar does not exist
102 |   - How to fix it?
103 |    1. Upload Spark-assembly.jar to Hadoop.
104 |    2. Set `spark.yarn.jar`, there are two ways to configure it:
105 |       - Add `--conf spark.yarn.jar` when launching spark-submit.
106 |       - Set `spark.yarn.jar` on `SparkConf` in your spark driver.
107 | 
108 | ### java.io.IOException: Resource spark-assembly.jar changed on src filesystem ([Reference Link](http://stackoverflow.com/questions/30893995/spark-on-yarn-jar-upload-problems))
109 |   - Spark-assembly.jar exists in HDFS, but still get assembly jar changed error.
110 |   - How to fix it?
111 |    1. Upload Spark-assembly.jar to Hadoop.
112 |    2. Set `spark.yarn.jar`, there are two ways to configure it:
113 |       - Add `--conf spark.yarn.jar` when launching spark-submit.
114 |       - Set `spark.yarn.jar` on `SparkConf` in your spark driver.
115 | 
116 |  ### How to find the size of dataframe in Spark
117 |   - In Java, you can use [org.apache.spark.util.SizeEstimator](http://stackoverflow.com/a/35008549).
118 |   - In Pyspark, one way to do it is to persist the dataframe to disk, then go to the SparkUI Storage tab and see the size.
119 | 


--------------------------------------------------------------------------------