├── .gitignore ├── LICENSE ├── README.md ├── SUMMARY.md ├── best_practices ├── README.md ├── dont_call_collect_on_a_very_large_rdd.md └── prefer_reducebykey_over_groupbykey.md ├── cover.jpg ├── images ├── cached-partitions.png ├── group_by.png ├── locality.png ├── partitions-as-tasks.png └── reduce_by.png ├── performance_optimization ├── README.md ├── data_locality.md └── how_many_partitions_does_an_rdd_have.md ├── spark_streaming ├── README.md └── error_oneforonestrategy.md └── troubleshooting ├── README.md ├── connectivity_issues.md ├── java_io_not_serializable_exception.md ├── missing_dependencies_in_jar_files.md └── port_22_connection_refused.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Node rules: 2 | ## Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files) 3 | .grunt 4 | 5 | ## Dependency directory 6 | ## Commenting this out is preferred by some people, see 7 | ## https://npmjs.org/doc/faq.html#Should-I-check-my-node_modules-folder-into-git 8 | node_modules 9 | 10 | # Book build output 11 | _book 12 | 13 | # eBook build output 14 | *.epub 15 | *.mobi 16 | *.pdf 17 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | License 2 | 3 | THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED. 4 | 5 | BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND CONDITIONS. 6 | 7 | 1. Definitions 8 | 9 | "Adaptation" means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work, arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License. 10 | "Collection" means a collection of literary or artistic works, such as encyclopedias and anthologies, or performances, phonograms or broadcasts, or other works or subject matter other than works listed in Section 1(f) below, which, by reason of the selection and arrangement of their contents, constitute intellectual creations, in which the Work is included in its entirety in unmodified form along with one or more other contributions, each constituting separate and independent works in themselves, which together are assembled into a collective whole. A work that constitutes a Collection will not be considered an Adaptation (as defined above) for the purposes of this License. 11 | "Distribute" means to make available to the public the original and copies of the Work or Adaptation, as appropriate, through sale or other transfer of ownership. 12 | "Licensor" means the individual, individuals, entity or entities that offer(s) the Work under the terms of this License. 13 | "Original Author" means, in the case of a literary or artistic work, the individual, individuals, entity or entities who created the Work or if no individual or entity can be identified, the publisher; and in addition (i) in the case of a performance the actors, singers, musicians, dancers, and other persons who act, sing, deliver, declaim, play in, interpret or otherwise perform literary or artistic works or expressions of folklore; (ii) in the case of a phonogram the producer being the person or legal entity who first fixes the sounds of a performance or other sounds; and, (iii) in the case of broadcasts, the organization that transmits the broadcast. 14 | "Work" means the literary and/or artistic work offered under the terms of this License including without limitation any production in the literary, scientific and artistic domain, whatever may be the mode or form of its expression including digital form, such as a book, pamphlet and other writing; a lecture, address, sermon or other work of the same nature; a dramatic or dramatico-musical work; a choreographic work or entertainment in dumb show; a musical composition with or without words; a cinematographic work to which are assimilated works expressed by a process analogous to cinematography; a work of drawing, painting, architecture, sculpture, engraving or lithography; a photographic work to which are assimilated works expressed by a process analogous to photography; a work of applied art; an illustration, map, plan, sketch or three-dimensional work relative to geography, topography, architecture or science; a performance; a broadcast; a phonogram; a compilation of data to the extent it is protected as a copyrightable work; or a work performed by a variety or circus performer to the extent it is not otherwise considered a literary or artistic work. 15 | "You" means an individual or entity exercising rights under this License who has not previously violated the terms of this License with respect to the Work, or who has received express permission from the Licensor to exercise rights under this License despite a previous violation. 16 | "Publicly Perform" means to perform public recitations of the Work and to communicate to the public those public recitations, by any means or process, including by wire or wireless means or public digital performances; to make available to the public Works in such a way that members of the public may access these Works from a place and at a place individually chosen by them; to perform the Work to the public by any means or process and the communication to the public of the performances of the Work, including by public digital performance; to broadcast and rebroadcast the Work by any means including signs, sounds or images. 17 | "Reproduce" means to make copies of the Work by any means including without limitation by sound or visual recordings and the right of fixation and reproducing fixations of the Work, including storage of a protected performance or phonogram in digital form or other electronic medium. 18 | 2. Fair Dealing Rights. Nothing in this License is intended to reduce, limit, or restrict any uses free from copyright or rights arising from limitations or exceptions that are provided for in connection with the copyright protection under copyright law or other applicable laws. 19 | 20 | 3. License Grant. Subject to the terms and conditions of this License, Licensor hereby grants You a worldwide, royalty-free, non-exclusive, perpetual (for the duration of the applicable copyright) license to exercise the rights in the Work as stated below: 21 | 22 | to Reproduce the Work, to incorporate the Work into one or more Collections, and to Reproduce the Work as incorporated in the Collections; 23 | to create and Reproduce Adaptations provided that any such Adaptation, including any translation in any medium, takes reasonable steps to clearly label, demarcate or otherwise identify that changes were made to the original Work. For example, a translation could be marked "The original work was translated from English to Spanish," or a modification could indicate "The original work has been modified."; 24 | to Distribute and Publicly Perform the Work including as incorporated in Collections; and, 25 | to Distribute and Publicly Perform Adaptations. 26 | The above rights may be exercised in all media and formats whether now known or hereafter devised. The above rights include the right to make such modifications as are technically necessary to exercise the rights in other media and formats. Subject to Section 8(f), all rights not expressly granted by Licensor are hereby reserved, including but not limited to the rights set forth in Section 4(d). 27 | 28 | 4. Restrictions. The license granted in Section 3 above is expressly made subject to and limited by the following restrictions: 29 | 30 | You may Distribute or Publicly Perform the Work only under the terms of this License. You must include a copy of, or the Uniform Resource Identifier (URI) for, this License with every copy of the Work You Distribute or Publicly Perform. You may not offer or impose any terms on the Work that restrict the terms of this License or the ability of the recipient of the Work to exercise the rights granted to that recipient under the terms of the License. You may not sublicense the Work. You must keep intact all notices that refer to this License and to the disclaimer of warranties with every copy of the Work You Distribute or Publicly Perform. When You Distribute or Publicly Perform the Work, You may not impose any effective technological measures on the Work that restrict the ability of a recipient of the Work from You to exercise the rights granted to that recipient under the terms of the License. This Section 4(a) applies to the Work as incorporated in a Collection, but this does not require the Collection apart from the Work itself to be made subject to the terms of this License. If You create a Collection, upon notice from any Licensor You must, to the extent practicable, remove from the Collection any credit as required by Section 4(c), as requested. If You create an Adaptation, upon notice from any Licensor You must, to the extent practicable, remove from the Adaptation any credit as required by Section 4(c), as requested. 31 | You may not exercise any of the rights granted to You in Section 3 above in any manner that is primarily intended for or directed toward commercial advantage or private monetary compensation. The exchange of the Work for other copyrighted works by means of digital file-sharing or otherwise shall not be considered to be intended for or directed toward commercial advantage or private monetary compensation, provided there is no payment of any monetary compensation in connection with the exchange of copyrighted works. 32 | If You Distribute, or Publicly Perform the Work or any Adaptations or Collections, You must, unless a request has been made pursuant to Section 4(a), keep intact all copyright notices for the Work and provide, reasonable to the medium or means You are utilizing: (i) the name of the Original Author (or pseudonym, if applicable) if supplied, and/or if the Original Author and/or Licensor designate another party or parties (e.g., a sponsor institute, publishing entity, journal) for attribution ("Attribution Parties") in Licensor's copyright notice, terms of service or by other reasonable means, the name of such party or parties; (ii) the title of the Work if supplied; (iii) to the extent reasonably practicable, the URI, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and, (iv) consistent with Section 3(b), in the case of an Adaptation, a credit identifying the use of the Work in the Adaptation (e.g., "French translation of the Work by Original Author," or "Screenplay based on original Work by Original Author"). The credit required by this Section 4(c) may be implemented in any reasonable manner; provided, however, that in the case of a Adaptation or Collection, at a minimum such credit will appear, if a credit for all contributing authors of the Adaptation or Collection appears, then as part of these credits and in a manner at least as prominent as the credits for the other contributing authors. For the avoidance of doubt, You may only use the credit required by this Section for the purpose of attribution in the manner set out above and, by exercising Your rights under this License, You may not implicitly or explicitly assert or imply any connection with, sponsorship or endorsement by the Original Author, Licensor and/or Attribution Parties, as appropriate, of You or Your use of the Work, without the separate, express prior written permission of the Original Author, Licensor and/or Attribution Parties. 33 | For the avoidance of doubt: 34 | 35 | Non-waivable Compulsory License Schemes. In those jurisdictions in which the right to collect royalties through any statutory or compulsory licensing scheme cannot be waived, the Licensor reserves the exclusive right to collect such royalties for any exercise by You of the rights granted under this License; 36 | Waivable Compulsory License Schemes. In those jurisdictions in which the right to collect royalties through any statutory or compulsory licensing scheme can be waived, the Licensor reserves the exclusive right to collect such royalties for any exercise by You of the rights granted under this License if Your exercise of such rights is for a purpose or use which is otherwise than noncommercial as permitted under Section 4(b) and otherwise waives the right to collect royalties through any statutory or compulsory licensing scheme; and, 37 | Voluntary License Schemes. The Licensor reserves the right to collect royalties, whether individually or, in the event that the Licensor is a member of a collecting society that administers voluntary licensing schemes, via that society, from any exercise by You of the rights granted under this License that is for a purpose or use which is otherwise than noncommercial as permitted under Section 4(c). 38 | Except as otherwise agreed in writing by the Licensor or as may be otherwise permitted by applicable law, if You Reproduce, Distribute or Publicly Perform the Work either by itself or as part of any Adaptations or Collections, You must not distort, mutilate, modify or take other derogatory action in relation to the Work which would be prejudicial to the Original Author's honor or reputation. Licensor agrees that in those jurisdictions (e.g. Japan), in which any exercise of the right granted in Section 3(b) of this License (the right to make Adaptations) would be deemed to be a distortion, mutilation, modification or other derogatory action prejudicial to the Original Author's honor and reputation, the Licensor will waive or not assert, as appropriate, this Section, to the fullest extent permitted by the applicable national law, to enable You to reasonably exercise Your right under Section 3(b) of this License (right to make Adaptations) but not otherwise. 39 | 5. Representations, Warranties and Disclaimer 40 | 41 | UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU. 42 | 43 | 6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 44 | 45 | 7. Termination 46 | 47 | This License and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this License. Individuals or entities who have received Adaptations or Collections from You under this License, however, will not have their licenses terminated provided such individuals or entities remain in full compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this License. 48 | Subject to the above terms and conditions, the license granted here is perpetual (for the duration of the applicable copyright in the Work). Notwithstanding the above, Licensor reserves the right to release the Work under different license terms or to stop distributing the Work at any time; provided, however that any such election will not serve to withdraw this License (or any other license that has been, or is required to be, granted under the terms of this License), and this License will continue in full force and effect unless terminated as stated above. 49 | 8. Miscellaneous 50 | 51 | Each time You Distribute or Publicly Perform the Work or a Collection, the Licensor offers to the recipient a license to the Work on the same terms and conditions as the license granted to You under this License. 52 | Each time You Distribute or Publicly Perform an Adaptation, Licensor offers to the recipient a license to the original Work on the same terms and conditions as the license granted to You under this License. 53 | If any provision of this License is invalid or unenforceable under applicable law, it shall not affect the validity or enforceability of the remainder of the terms of this License, and without further action by the parties to this agreement, such provision shall be reformed to the minimum extent necessary to make such provision valid and enforceable. 54 | No term or provision of this License shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent. 55 | This License constitutes the entire agreement between the parties with respect to the Work licensed here. There are no understandings, agreements or representations with respect to the Work not specified here. Licensor shall not be bound by any additional provisions that may appear in any communication from You. This License may not be modified without the mutual written agreement of the Licensor and You. 56 | The rights granted under, and the subject matter referenced, in this License were drafted utilizing the terminology of the Berne Convention for the Protection of Literary and Artistic Works (as amended on September 28, 1979), the Rome Convention of 1961, the WIPO Copyright Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996 and the Universal Copyright Convention (as revised on July 24, 1971). These rights and subject matter take effect in the relevant jurisdiction in which the License terms are sought to be enforced according to the corresponding provisions of the implementation of those treaty provisions in the applicable national law. If the standard suite of rights granted under applicable copyright law includes additional rights not granted under this License, such additional rights are deemed to be included in the License; this License is not intended to restrict the license of any rights under applicable law. 57 | Creative Commons Notice 58 | 59 | Creative Commons is not a party to this License, and makes no warranty whatsoever in connection with the Work. Creative Commons will not be liable to You or any party on any legal theory for any damages whatsoever, including without limitation any general, special, incidental or consequential damages arising in connection to this license. Notwithstanding the foregoing two (2) sentences, if Creative Commons has expressly identified itself as the Licensor hereunder, it shall have all rights and obligations of Licensor. 60 | 61 | Except for the limited purpose of indicating to the public that the Work is licensed under the CCPL, Creative Commons does not authorize the use by either party of the trademark "Creative Commons" or any related trademark or logo of Creative Commons without the prior written consent of Creative Commons. Any permitted use will be in compliance with Creative Commons' then-current trademark usage guidelines, as may be published on its website or otherwise made available upon request from time to time. For the avoidance of doubt, this trademark restriction does not form part of the License. 62 | 63 | Creative Commons may be contacted at http://creativecommons.org/. 64 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Databricks Spark 知识库 2 | ===================================== 3 | 4 | * [最佳实践](best_practices/README.md) 5 | * [避免使用 GroupByKey](best_practices/prefer_reducebykey_over_groupbykey.md) 6 | * [不要将大型 RDD 的所有元素拷贝到请求驱动者](best_practices/dont_call_collect_on_a_very_large_rdd.md) 7 | * [常规故障处理](troubleshooting/README.md) 8 | * [Job aborted due to stage failure: Task not serializable](troubleshooting/java_io_not_serializable_exception.md) 9 | * [缺失依赖](troubleshooting/missing_dependencies_in_jar_files.md) 10 | * [执行 start-all.sh 错误 - Connection refused](troubleshooting/port_22_connection_refused.md) 11 | * [Spark 组件之间的网络连接问题](troubleshooting/connectivity_issues.md) 12 | * [性能 & 优化](performance_optimization/README.md) 13 | * [一个 RDD 有多少个分区](performance_optimization/how_many_partitions_does_an_rdd_have.md) 14 | * [数据本地性](performance_optimization/data_locality.md) 15 | * [Spark Streaming](spark_streaming/README.md) 16 | * [ERROR OneForOneStrategy](spark_streaming/error_oneforonestrategy.md) 17 | 18 | ## Copyright 19 | 20 | 本文翻译自: http://databricks.gitbooks.io/databricks-spark-knowledge-base/ 著作权归原作者所有。 21 | 22 | ## License 23 | 24 | 此内容使用的授权许可请查看[这里](LICENSE)。 25 | -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | * [Introduction](README.md) 4 | * [最佳实践](best_practices/README.md) 5 | * [避免使用 GroupByKey](best_practices/prefer_reducebykey_over_groupbykey.md) 6 | * [不要将大型 RDD 的所有元素拷贝到请求驱动者](best_practices/dont_call_collect_on_a_very_large_rdd.md) 7 | * [常规故障处理](troubleshooting/README.md) 8 | * [Job aborted due to stage failure: Task not serializable](troubleshooting/java_io_not_serializable_exception.md) 9 | * [缺失依赖](troubleshooting/missing_dependencies_in_jar_files.md) 10 | * [执行 start-all.sh 错误 - Connection refused](troubleshooting/port_22_connection_refused.md) 11 | * [Spark 组件之间的网络连接问题](troubleshooting/connectivity_issues.md) 12 | * [性能 & 优化](performance_optimization/README.md) 13 | * [一个 RDD 有多少个分区](performance_optimization/how_many_partitions_does_an_rdd_have.md) 14 | * [数据本地性](performance_optimization/data_locality.md) 15 | * [Spark Streaming](spark_streaming/README.md) 16 | * [ERROR OneForOneStrategy](spark_streaming/error_oneforonestrategy.md) 17 | -------------------------------------------------------------------------------- /best_practices/README.md: -------------------------------------------------------------------------------- 1 | # 最佳实践 2 | 3 | - [避免使用 GroupByKey](prefer_reducebykey_over_groupbykey.md) 4 | - [勿在大型 RDD 上直接调用 collect](dont_call_collect_on_a_very_large_rdd.md) -------------------------------------------------------------------------------- /best_practices/dont_call_collect_on_a_very_large_rdd.md: -------------------------------------------------------------------------------- 1 | # 不要将大型 RDD 的所有元素拷贝到请求驱动者 2 | 3 | 如果你的驱动机器(submit 请求的机器)内存容量不能容纳一个大型 RDD 里面的所有数据,不要做以下操作: 4 | 5 | ```scala 6 | val values = myVeryLargeRDD.collect() 7 | ``` 8 | 9 | Collect 操作会试图将 RDD 里面的每一条数据复制到驱动机器(submit 请求的机器)上,这时候会发生内存溢出和崩溃。 10 | 11 | 相反,你可以调用 `take` 或者 `takeSample` 来确保数据大小的上限。或者在你的 RDD 中使用过滤或抽样。 12 | 13 | 同样,要谨慎使用下面的操作,除非你能确保数据集小到足以存储在内存中: 14 | 15 | - `countByKey` 16 | - `countByValue` 17 | - `collectAsMap` 18 | 19 | 如果你确实需要将 RDD 里面的大量数据保存在内存中,你可以将 RDD 写成一个文件或者把 RDD 导出到一个容量足够大的数据库中。 20 | 21 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html) 22 | -------------------------------------------------------------------------------- /best_practices/prefer_reducebykey_over_groupbykey.md: -------------------------------------------------------------------------------- 1 | # 避免使用 GroupByKey 2 | 3 | 让我们看一下使用两种不同的方式去计算单词的个数,第一种方式使用 `reduceByKey` 另外一种方式使用 `groupByKey`: 4 | 5 | ```scala 6 | val words = Array("one", "two", "two", "three", "three", "three") 7 | val wordPairsRDD = sc.parallelize(words).map(word => (word, 1)) 8 | 9 | val wordCountsWithReduce = wordPairsRDD 10 | .reduceByKey(_ + _) 11 | .collect() 12 | 13 | val wordCountsWithGroup = wordPairsRDD 14 | .groupByKey() 15 | .map(t => (t._1, t._2.sum)) 16 | .collect() 17 | ``` 18 | 19 | 虽然两个函数都能得出正确的结果,`reduceByKey` 更适合使用在大数据集上。 20 | 这是因为 Spark 知道它可以在每个分区移动数据之前将输出数据与一个共用的 key 结合。 21 | 22 | 借助下图可以理解在 `reduceByKey` 里发生了什么。 注意在数据对被搬移前同一机器上同样的 key 是怎样被组合的(`reduceByKey` 中的 lamdba 函数)。然后 lamdba 函数在每个区上被再次调用来将所有值 reduce 成一个最终结果。 23 | 24 | ![](../images/reduce_by.png) 25 | 26 | 另一方面,当调用 `groupByKey` 时,所有的键值对(key-value pair) 都会被移动。在网络上传输这些数据非常没有必要。 27 | 28 | 为了确定将数据对移到哪个主机,Spark 会对数据对的 key 调用一个分区算法。 29 | 当移动的数据量大于单台执行机器内存总量时 Spark 会把数据保存到磁盘上。 30 | 不过在保存时每次会处理一个 key 的数据,所以当单个 key 的键值对超过内存容量会存在内存溢出的异常。 31 | 这将会在之后发行的 Spark 版本中更加优雅地处理,这样的工作还可以继续完善。 32 | 尽管如此,仍应避免将数据保存到磁盘上,这会严重影响性能。 33 | 34 | ![](../images/group_by.png) 35 | 36 | 你可以想象一个非常大的数据集,在使用 `reduceByKey` 和 `groupByKey` 时他们的差别会被放大更多倍。 37 | 38 | 以下函数应该优先于 `groupByKey` : 39 | 40 | - `combineByKey` 组合数据,但是组合之后的数据类型与输入时值的类型不一样。 41 | - `foldByKey` 合并每一个 key 的所有值,在级联函数和“零值”中使用。 42 | 43 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) 44 | -------------------------------------------------------------------------------- /cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiyanbo/databricks-spark-knowledge-base-zh-cn/66888d852cb410d9a6c97097978ec6e561070785/cover.jpg -------------------------------------------------------------------------------- /images/cached-partitions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiyanbo/databricks-spark-knowledge-base-zh-cn/66888d852cb410d9a6c97097978ec6e561070785/images/cached-partitions.png -------------------------------------------------------------------------------- /images/group_by.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiyanbo/databricks-spark-knowledge-base-zh-cn/66888d852cb410d9a6c97097978ec6e561070785/images/group_by.png -------------------------------------------------------------------------------- /images/locality.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiyanbo/databricks-spark-knowledge-base-zh-cn/66888d852cb410d9a6c97097978ec6e561070785/images/locality.png -------------------------------------------------------------------------------- /images/partitions-as-tasks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiyanbo/databricks-spark-knowledge-base-zh-cn/66888d852cb410d9a6c97097978ec6e561070785/images/partitions-as-tasks.png -------------------------------------------------------------------------------- /images/reduce_by.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiyanbo/databricks-spark-knowledge-base-zh-cn/66888d852cb410d9a6c97097978ec6e561070785/images/reduce_by.png -------------------------------------------------------------------------------- /performance_optimization/README.md: -------------------------------------------------------------------------------- 1 | # 性能优化 2 | 3 | - [一个 RDD 有多少个分区](how_many_partitions_does_an_rdd_have.md) 4 | - [数据本地性](data_locality.md) 5 | -------------------------------------------------------------------------------- /performance_optimization/data_locality.md: -------------------------------------------------------------------------------- 1 | # 数据本地性 2 | 3 | Spark 是一个并行数据处理框架,这意味着任务应该在离数据尽可能近的地方执行(既 最少的数据传输)。 4 | 5 | ## 检查本地性 6 | 7 | 检查任务是否在本地运行的最好方式是在 Spark UI 上查看 stage 信息,注意下面截图中的 "Locality Level" 列显示任务运行在哪个地方。 8 | 9 | ![](../images/locality.png) 10 | 11 | ## 调整本地性配置 12 | 13 | 你可以调整 Spark 在每个数据本地性阶段(data local --> process local --> node local --> rack local --> Any)上等待的时长。更多详细的参数信息请查看[程序配置文档的 Scheduling 章节](http://spark.apache.org/docs/latest/configuration.html#scheduling)里类似于 `spark.locality.*` 的配置。 14 | 15 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html) 16 | -------------------------------------------------------------------------------- /performance_optimization/how_many_partitions_does_an_rdd_have.md: -------------------------------------------------------------------------------- 1 | # 一个 RDD 有多少个分区? 2 | 3 | 在调试和故障处理的时候,我们通常有必要知道 RDD 有多少个分区。这里有几个方法可以找到这些信息: 4 | 5 | ## 使用 UI 查看在分区上执行的任务数 6 | 7 | 当 stage 执行的时候,你可以在 Spark UI 上看到这个 stage 上的分区数。 下面的例子中的简单任务在 4 个分区上创建了共 100 个元素的 RDD ,然后在这些元素被收集到 driver 之前分发一个 map 任务: 8 | 9 | ```scala 10 | scala> val someRDD = sc.parallelize(1 to 100, 4) 11 | someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 12 | 13 | scala> someRDD.map(x => x).collect 14 | res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100) 15 | ``` 16 | 17 | 在 Spark 的应用 UI 里,从下面截图上看到的 "Total Tasks" 代表了分区数。 18 | 19 | ![](../images/partitions-as-tasks.png) 20 | 21 | ## 使用 UI 查看分区缓存 22 | 23 | 持久化(即缓存) RDD 时通常需要知道有多少个分区被存储。 下面的这个例子和之前的一样,除了现在我们要对 RDD 做缓存处理。操作完成之后,我们可以在 UI 上看到这个操作导致什么被我们存储了。 24 | 25 | ```scala 26 | scala> someRDD.setName("toy").cache 27 | res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at :12 28 | 29 | scala> someRDD.map(x => x).collect 30 | res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100) 31 | ``` 32 | 33 | 注意:下面的截图有 4 个分区被缓存。 34 | 35 | ![](../images/cached-partitions.png) 36 | 37 | ## 编程查看 RDD 分区 38 | 39 | 在 Scala API 里,RDD 持有一个分区数组的引用, 你可以使用它找到有多少个分区: 40 | 41 | ```scala 42 | scala> val someRDD = sc.parallelize(1 to 100, 30) 43 | someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 44 | 45 | scala> someRDD.partitions.size 46 | res0: Int = 30 47 | ``` 48 | 49 | 在 Python API 里, 有一个方法可以明确地列出有多少个分区: 50 | 51 | ```python 52 | In [1]: someRDD = sc.parallelize(range(101),30) 53 | 54 | In [2]: someRDD.getNumPartitions() 55 | Out[2]: 30 56 | ``` 57 | 58 | 注意:上面的例子中,是故意把分区的数量初始化成 30 的。 59 | 60 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html) 61 | -------------------------------------------------------------------------------- /spark_streaming/README.md: -------------------------------------------------------------------------------- 1 | # Spark Streaming 2 | 3 | - [ERROR OneForOneStrategy](error_oneforonestrategy.md) -------------------------------------------------------------------------------- /spark_streaming/error_oneforonestrategy.md: -------------------------------------------------------------------------------- 1 | # ERROR OneForOneStrategy 2 | 3 | 如果你在 Spark Streaming 里启用 checkpointing,forEachRDD 函数使用的对象都应该可以被序列化(Serializable)。否则会出现这样的异常 "ERROR OneForOneStrategy: ... java.io.NotSerializableException:" 4 | 5 | ```scala 6 | JavaStreamingContext jssc = new JavaStreamingContext(sc, INTERVAL); 7 | 8 | // This enables checkpointing. 9 | jssc.checkpoint("/tmp/checkpoint_test"); 10 | 11 | JavaDStream dStream = jssc.socketTextStream("localhost", 9999); 12 | 13 | NotSerializable notSerializable = new NotSerializable(); 14 | dStream.foreachRDD(rdd -> { 15 | if (rdd.count() == 0) { 16 | return null; 17 | } 18 | String first = rdd.first(); 19 | 20 | notSerializable.doSomething(first); 21 | return null; 22 | } 23 | ); 24 | 25 | // This does not work!!!! 26 | ``` 27 | 28 | 按照下面的方式之一进行修改,上面的代码才能正常运行: 29 | 30 | - 在配置文件里面删除 `jssc.checkpoint` 这一行关闭 checkpointing。 31 | - 让对象能被序列化。 32 | - 在 forEachRDD 函数里面声明 NotSerializable,下面的示例代码是可以正常运行的: 33 | 34 | ```scala 35 | JavaStreamingContext jssc = new JavaStreamingContext(sc, INTERVAL); 36 | 37 | jssc.checkpoint("/tmp/checkpoint_test"); 38 | 39 | JavaDStream dStream = jssc.socketTextStream("localhost", 9999); 40 | 41 | dStream.foreachRDD(rdd -> { 42 | if (rdd.count() == 0) { 43 | return null; 44 | } 45 | String first = rdd.first(); 46 | NotSerializable notSerializable = new NotSerializable(); 47 | notSerializable.doSomething(first); 48 | return null; 49 | } 50 | ); 51 | 52 | // This code snippet is fine since the NotSerializable object 53 | // is declared and only used within the forEachRDD function. 54 | ``` 55 | 56 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/spark_streaming/error_oneforonestrategy.html) 57 | -------------------------------------------------------------------------------- /troubleshooting/README.md: -------------------------------------------------------------------------------- 1 | # 常规故障处理 2 | 3 | - [Job aborted due to stage failure: Task not serializable](java_io_not_serializable_exception.md) 4 | - [缺失依赖](missing_dependencies_in_jar_files.md) 5 | - [执行 start-all.sh 错误 - Connection refused](port_22_connection_refused.md) 6 | - [Spark 组件之间的网络连接问题](connectivity_issues.md) 7 | -------------------------------------------------------------------------------- /troubleshooting/connectivity_issues.md: -------------------------------------------------------------------------------- 1 | # Spark 组件之间的网络连接问题 2 | 3 | Spark 组件之间的网络连接问题会导致各式各样的警告/错误: 4 | 5 | - **SparkContext <-> Spark Standalone Master:** 6 | 7 | 如果 SparkContext 不能连接到 Spark standalone master,会显示下面的错误 8 | 9 | ``` 10 | ERROR AppClient$ClientActor: All masters are unresponsive! Giving up. 11 | ERROR SparkDeploySchedulerBackend: Spark cluster looks dead, giving up. 12 | ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Spark cluster looks down 13 | ``` 14 | 15 | 如果 driver 能够连接到 master 但是 master 不能回连到 driver上,这时 Master 的日志会记录多次尝试连接 driver 失败并且会报告不能连接: 16 | 17 | ``` 18 | INFO Master: Registering app SparkPi 19 | INFO Master: Registered app SparkPi with ID app-XXX-0000 20 | INFO: Master: Removing app app-app-XXX-0000 21 | [...] 22 | INFO Master: Registering app SparkPi 23 | INFO Master: Registered app SparkPi with ID app-YYY-0000 24 | INFO: Master: Removing app app-YYY-0000 25 | [...] 26 | ``` 27 | 28 | 在这样的情况下,master 报告应用已经被成功地注册了。但是注册成功的通知 driver 接收失败了, 这时 driver 会自动尝试几次重新连接直到失败的次数太多而放弃重试。其结果是 Master web UI 会报告多个失败的应用,即使只有一个 SparkContext 被创建。 29 | 30 | ## 建议 31 | 32 | 如果你遇到上述的任何错误: 33 | 34 | - 检查 workers 和 drivers 配置的 Spark master 的地址就是在 Spark master web UI/日志中列出的那个地址。 35 | - 设置 driver,master,worker 的 `SPARK_LOCAL_IP` 为集群的可寻地址主机名。 36 | 37 | ## 配置 hostname/port 38 | 39 | 这节将描述我们如何绑定 Spark 组件的网络接口和端口。 40 | 41 | 在每节里,配置会按照优先级降序的方式排列。如果前面所有配置没有提供则使用最后一条作为默认配置。 42 | 43 | ### SparkContext actor system: 44 | 45 | **Hostname:** 46 | 47 | - `spark.driver.host` 属性 48 | - 如果 `SPARK_LOCAL_IP` 环境变量的设置是主机名(hostname),就会使用设置时的主机名。如果 `SPARK_LOCAL_IP` 设置的是一个 IP 地址,这个 IP 地址会被解析为主机名。 49 | - 使用默认的 IP 地址,这个 IP 地址是Java 接口 `InetAddress.getLocalHost` 方法的返回值。 50 | 51 | **Port:** 52 | 53 | - `spark.driver.port` 属性。 54 | - 从操作系统(OS)选择一个临时端口。 55 | 56 | ### Spark Standalone Master / Worker actor systems: 57 | 58 | **Hostname:** 59 | 60 | - 当 `Master` 或 `Worker` 进程启动时使用 `--host` 或 `-h` 选项(或是过期的选项 `--ip` 或 `-i`)。 61 | - `SPARK_MASTER_HOST` 环境变量(仅应用在 `Master` 上)。 62 | - 如果 `SPARK_LOCAL_IP` 环境变量的设置是主机名(hostname),就会使用设置时的主机名。如果 `SPARK_LOCAL_IP` 设置的是一个 IP 地址,这个 IP 地址会被解析为主机名。 63 | - 使用默认的 IP 地址,这个 IP 地址是Java 接口 `InetAddress.getLocalHost` 方法的返回值。 64 | 65 | **Port:** 66 | 67 | - 当 `Master` 或 `Worker` 进程启动时使用 `--port` 或 `-p` 选项。 68 | - `SPARK_MASTER_PORT` 或 `SPARK_WORKER_PORT` 环境变量(分别应用到 `Master` 和 `Worker` 上)。 69 | - 从操作系统(OS)选择一个临时端口。 70 | 71 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html) 72 | -------------------------------------------------------------------------------- /troubleshooting/java_io_not_serializable_exception.md: -------------------------------------------------------------------------------- 1 | # Job aborted due to stage failure: Task not serializable: 2 | 3 | 如果你能看到以下错误: 4 | 5 | ``` 6 | org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: ... 7 | ``` 8 | 9 | 上述的错误在这个时候会被触发:当你在 master 上初始化一个变量,但是试图在 worker 上使用。在这个示例中, Spark Streaming 试图将对象序列化之后发送到 worker 上,如果这个对象不能被序列化就会失败。思考下面的代码片段: 10 | 11 | ```java 12 | NotSerializable notSerializable = new NotSerializable(); 13 | JavaRDD rdd = sc.textFile("/tmp/myfile"); 14 | 15 | rdd.map(s -> notSerializable.doSomething(s)).collect(); 16 | ``` 17 | 18 | 这段代码会触发那个错误。这里有一些建议修复这个错误: 19 | 20 | - 让 class 实现序列化 21 | - 在作为参数传递给 map 方法的 lambda 表达式内部声明实例 22 | - 在每一台机器上创建一个 NotSerializable 的静态实例 23 | - 调用 `rdd.forEachPartition` 并且像下面这样创建 NotSerializable 对象: 24 | 25 | ```java 26 | rdd.forEachPartition(iter -> { 27 | NotSerializable notSerializable = new NotSerializable(); 28 | 29 | // ...Now process iter 30 | }); 31 | ``` 32 | 33 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html) 34 | -------------------------------------------------------------------------------- /troubleshooting/missing_dependencies_in_jar_files.md: -------------------------------------------------------------------------------- 1 | # 缺失依赖 2 | 3 | 在默认状态下,Maven 在 build 的时候不会包含所依赖的 jar 包。当运行一个 Spark 任务,如果 Spark worker 机器上没有包含所依赖的 jar 包会发生类无法找到的错误(`ClassNotFoundException`)。 4 | 5 | 有一个简单的方式,在 Maven 打包的时候创建 _shaded_ 或 _uber_ 任务可以让那些依赖的 jar 包很好地打包进去。 6 | 7 | 使用 `provided` 可以排除那些没有必要打包进去的依赖,对 Spark 的依赖必须使用 `provided` 标记,因为这些依赖已经包含在 Spark cluster中。在你的 worker 机器上已经安装的 jar 包你同样需要排除掉它们。 8 | 9 | 下面是一个 Maven pom.xml 的例子,工程了包含了一些需要的依赖,但是 Spark 的 libraries 不会被打包进去,因为它使用了 `provided`: 10 | 11 | ```xml 12 | 13 | com.databricks.apps.logs 14 | log-analyzer 15 | 4.0.0 16 | Databricks Spark Logs Analyzer 17 | jar 18 | 1.0 19 | 20 | 21 | Akka repository 22 | http://repo.akka.io/releases 23 | 24 | 25 | 26 | 27 | org.apache.spark 28 | spark-core_2.10 29 | 1.1.0 30 | provided 31 | 32 | 33 | org.apache.spark 34 | spark-sql_2.10 35 | 1.1.0 36 | provided 37 | 38 | 39 | org.apache.spark 40 | spark-streaming_2.10 41 | 1.1.0 42 | provided 43 | 44 | 45 | commons-cli 46 | commons-cli 47 | 1.2 48 | 49 | 50 | 51 | 52 | 53 | org.apache.maven.plugins 54 | maven-compiler-plugin 55 | 2.3.2 56 | 57 | 1.8 58 | 1.8 59 | 60 | 61 | 62 | org.apache.maven.plugins 63 | maven-shade-plugin 64 | 2.3 65 | 66 | 67 | package 68 | 69 | shade 70 | 71 | 72 | 73 | 74 | 75 | 76 | *:* 77 | 78 | META-INF/*.SF 79 | META-INF/*.DSA 80 | META-INF/*.RSA 81 | 82 | 83 | 84 | uber-${project.artifactId}-${project.version} 85 | 86 | 87 | 88 | 89 | 90 | ``` 91 | 92 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html) -------------------------------------------------------------------------------- /troubleshooting/port_22_connection_refused.md: -------------------------------------------------------------------------------- 1 | # 执行 start-all.sh 错误: Connection refused 2 | 3 | 如果是使用 Mac 操作系统运行 start-all.sh 发生下面错误时: 4 | 5 | ``` 6 | % sh start-all.sh 7 | starting org.apache.spark.deploy.master.Master, logging to ... 8 | localhost: ssh: connect to host localhost port 22: Connection refused 9 | ``` 10 | 11 | 你需要在你的电脑上打开 “远程登录” 功能。进入 `系统偏好设置` ---> `共享` 勾选打开 `远程登录`。 12 | 13 | [阅读原文](http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/port_22_connection_refused.html) --------------------------------------------------------------------------------