├── .gitignore ├── Spark_With_Scala_Testing ├── data │ ├── test │ │ ├── _SUCCESS │ │ ├── ._SUCCESS.crc │ │ ├── part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv │ │ └── .part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv.crc │ ├── result │ │ ├── _SUCCESS │ │ ├── ._SUCCESS.crc │ │ ├── part-00000 │ │ └── .part-00000.crc │ ├── secondarySort.txt │ ├── people.csv │ ├── people.txt │ ├── hello.txt │ ├── people.json │ ├── spark.md │ ├── users.parquet │ ├── employees.json │ └── topN.txt ├── .gitignore ├── .cache-main ├── .settings │ └── org.scala-ide.sdt.core.prefs ├── .classpath ├── src │ ├── test │ │ └── Test .scala │ ├── sparkCore │ │ ├── MyKey.scala │ │ ├── SecondarySort.scala │ │ ├── Action.scala │ │ ├── Persist.scala │ │ ├── sortedWordCount.scala │ │ ├── TopN .scala │ │ └── Transformation.scala │ └── sparkSql │ │ ├── LoadAndSave.scala │ │ ├── SqlContextTest.scala │ │ ├── RDDtoDataFrame2.scala │ │ ├── RDDtoDataFrame.scala │ │ └── DataFrameOperations.scala └── .project ├── notes ├── assets │ ├── cache.png │ ├── source.jpg │ ├── 导入spark.png │ ├── 1550161394643.png │ ├── 1550670063721.png │ ├── 20190215002003.png │ └── cluster-client.png ├── LearningSpark(1)数据来源.md ├── RDD如何作为参数传给函数.md ├── eclipse中Attach Source找不到源码,该如何查看jar包源码.md ├── Spark DataFrame如何更改列column的类型.md ├── 判断RDD是否为空.md ├── 高级排序和topN问题.md ├── eclipse如何导入Spark源码方便阅读.md ├── LearningSpark(3)RDD操作.md ├── Spark2.4+Hive使用Hive现有仓库.md ├── LearningSpark(6)Spark内核架构剖析.md ├── LearningSpark(2)spark-submit可选参数.md ├── LearningSpark(4)Spark持久化操作.md ├── Scala排序函数使用.md ├── LearningSpark(8)RDD如何转化为DataFrame.md ├── 使用JDBC将DataFrame写入mysql.md ├── LearningSpark(7)SparkSQL之DataFrame学习.md ├── 报错和问题归纳.md ├── LearningSpark(5)Spark共享变量.md └── LearningSpark(9)SparkSQL数据来源.md ├── Pic └── DAGScheduler划分及提交stage-代码调用过程.jpg ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.class 2 | *.log 3 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/test/_SUCCESS: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/.gitignore: -------------------------------------------------------------------------------- 1 | /bin/ 2 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/result/_SUCCESS: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/result/._SUCCESS.crc: -------------------------------------------------------------------------------- 1 | crc -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/test/._SUCCESS.crc: -------------------------------------------------------------------------------- 1 | crc -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/secondarySort.txt: -------------------------------------------------------------------------------- 1 | 1 5 2 | 2 4 3 | 3 6 4 | 1 3 5 | 2 1 -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/result/part-00000: -------------------------------------------------------------------------------- 1 | (you,2) 2 | (jump,8) 3 | (i,3) 4 | (u,1) 5 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/people.csv: -------------------------------------------------------------------------------- 1 | name;age;job 2 | Jorge;30;Developer 3 | Bob;32;Developer 4 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/people.txt: -------------------------------------------------------------------------------- 1 | name;age;job 2 | Jorge;30;Developer 3 | Bob;32;Developer 4 | -------------------------------------------------------------------------------- /notes/assets/cache.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/cache.png -------------------------------------------------------------------------------- /notes/assets/source.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/source.jpg -------------------------------------------------------------------------------- /notes/assets/导入spark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/导入spark.png -------------------------------------------------------------------------------- /notes/assets/1550161394643.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/1550161394643.png -------------------------------------------------------------------------------- /notes/assets/1550670063721.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/1550670063721.png -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/hello.txt: -------------------------------------------------------------------------------- 1 | you,jump 2 | i,jump 3 | you,jump 4 | i,jump 5 | jump,jump,jump 6 | u,i,jump 7 | -------------------------------------------------------------------------------- /notes/assets/20190215002003.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/20190215002003.png -------------------------------------------------------------------------------- /notes/assets/cluster-client.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/cluster-client.png -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/people.json: -------------------------------------------------------------------------------- 1 | {"name":"Michael"} 2 | {"name":"Andy", "age":30} 3 | {"name":"Justin", "age":19} 4 | -------------------------------------------------------------------------------- /Pic/DAGScheduler划分及提交stage-代码调用过程.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Pic/DAGScheduler划分及提交stage-代码调用过程.jpg -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/.cache-main: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/.cache-main -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/spark.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/spark.md -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/test/part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv: -------------------------------------------------------------------------------- 1 | age;name 2 | "";Michael 3 | 30;Andy 4 | 19;Justin 5 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/users.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/users.parquet -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/result/.part-00000.crc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/result/.part-00000.crc -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/employees.json: -------------------------------------------------------------------------------- 1 | {"name":"Michael", "salary":3000} 2 | {"name":"Andy", "salary":4500} 3 | {"name":"Justin", "salary":3500} 4 | {"name":"Berta", "salary":4000} 5 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/test/.part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv.crc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/test/.part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv.crc -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/.settings/org.scala-ide.sdt.core.prefs: -------------------------------------------------------------------------------- 1 | eclipse.preferences.version=1 2 | scala.compiler.additionalParams=\ -Xsource\:2.11 -Ymacro-expand\:none 3 | scala.compiler.installation=2.11 4 | scala.compiler.sourceLevel=2.11 5 | scala.compiler.useProjectSettings=true 6 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/data/topN.txt: -------------------------------------------------------------------------------- 1 | t001 2067 2 | t002 2055 3 | t003 109 4 | t004 1200 5 | t005 3368 6 | t006 251 7 | t001 3067 8 | t002 255 9 | t003 19 10 | t004 2000 11 | t005 368 12 | t006 2512 13 | t006 2510 14 | t001 367 15 | t002 2155 16 | t005 338 17 | t006 1251 18 | t001 3667 19 | t002 1255 20 | t003 190 21 | t003 1090 -------------------------------------------------------------------------------- /notes/LearningSpark(1)数据来源.md: -------------------------------------------------------------------------------- 1 | ## 数据源自并行集合 2 | 调用 SparkContext 的 parallelize 方法,在一个已经存在的 Scala 集合上创建一个 Seq 对象 3 | 4 | ## 外部数据源 5 | Spark支持任何 `Hadoop InputFormat` 格式的输入,如本地文件、HDFS上的文件、Hive表、HBase上的数据、Amazon S3、Hypertable等,以上都可以用来创建RDD。 6 | 7 | 常用函数是 `sc.textFile()` ,参数是Path和最小分区数[可选]。Path是文件的 URI 地址,该地址可以是本地路径,或者 `hdfs://`、`s3n://` 等 URL 地址。其次,使用本地文件时,如果在集群上运行要确保worker节点也能访问到文件 -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/.classpath: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/test/Test .scala: -------------------------------------------------------------------------------- 1 | package test 2 | 3 | import org.apache.spark.{SparkConf,SparkContext} 4 | 5 | class Test { 6 | def main(args: Array[String]): Unit = { 7 | val conf = new SparkConf().setMaster("local").setAppName("test-tools") 8 | val sc = new SparkContext(conf) 9 | val rdd = sc.parallelize(List()) 10 | println("测试:" + rdd.count) 11 | 12 | if (sc.emptyRDD[(String, Int)].isEmpty()) { 13 | println("空") 14 | } 15 | } 16 | } -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/.project: -------------------------------------------------------------------------------- 1 | 2 | 3 | Spark_With_Scala_Testing 4 | 5 | 6 | 7 | 8 | 9 | org.scala-ide.sdt.core.scalabuilder 10 | 11 | 12 | 13 | 14 | 15 | org.scala-ide.sdt.core.scalanature 16 | org.eclipse.jdt.core.javanature 17 | 18 | 19 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkCore/MyKey.scala: -------------------------------------------------------------------------------- 1 | package sparkCore 2 | 3 | class MyKey(val first: Int, val second: Int) extends Ordered[MyKey] with Serializable { 4 | def compare(that:MyKey): Int = { 5 | //第一列升序,第二列升序 6 | /*if(first - that.first==0){ 7 | second - that.second 8 | }else { 9 | first - that.first 10 | }*/ 11 | //第一列降序,第二列升序 12 | if(first - that.first==0){ 13 | second - that.second 14 | }else { 15 | that.first - first 16 | } 17 | //参照MapReduce二次排序原理:https://github.com/josonle/MapReduce-Demo#%E8%A7%A3%E7%AD%94%E6%80%9D%E8%B7%AF-1 18 | } 19 | } -------------------------------------------------------------------------------- /notes/RDD如何作为参数传给函数.md: -------------------------------------------------------------------------------- 1 | ```scala 2 | //分组TopN 3 | def groupTopN(data: RDD[String],n:Int): RDD[(String,List[Int])] = { 4 | //先不考虑其他的 5 | //分组后类似 (t003,(19,1090,190,109)) 6 | val groupParis = data.map { x => 7 | (x.split(" ")(0), x.split(" ")(1).toInt) 8 | }.groupByKey() 9 | val sortedData = groupParis.map(x=> 10 | { 11 | //分组后,排序(默认升序),再倒序去前n 12 | val sortedLists = x._2.toList.sorted.reverse.take(n) 13 | (x._1,sortedLists) 14 | }) 15 | sortedData 16 | } 17 | ``` 18 | 19 | 关键是知道传入rdd的类型以及要返回值的类型,如果有编辑器就很简单,知道自己要返回什么,然后拿鼠标移到这个值上方就会显示它的类型 20 | 21 | 其次是要导入`import org.apache.spark.rdd._` -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkCore/SecondarySort.scala: -------------------------------------------------------------------------------- 1 | package sparkCore 2 | 3 | import org.apache.spark.SparkConf 4 | import org.apache.spark.SparkContext 5 | import org.apache.spark.SparkContext._ 6 | 7 | object SecondarySort { 8 | def main(args: Array[String]): Unit = { 9 | val conf = new SparkConf().setMaster("local").setAppName("sortedWordCount") 10 | val sc = new SparkContext(conf) 11 | val data = sc.textFile("data/secondarySort.txt", 1) 12 | val keyWD = data.map(x => ( 13 | new MyKey(x.split(" ")(0).toInt, x.split(" ")(1).toInt), x)) 14 | val sortedWD = keyWD.sortByKey() 15 | sortedWD.map(_._2).foreach(println) 16 | } 17 | } -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkSql/LoadAndSave.scala: -------------------------------------------------------------------------------- 1 | package sparkSql 2 | 3 | import org.apache.spark.sql.SparkSession 4 | 5 | object LoadAndSave { 6 | def main(args: Array[String]): Unit = { 7 | val spark = SparkSession.builder().master("local").appName("load and save datas").getOrCreate() 8 | val df = spark.read.load("data/users.parquet") 9 | val df1 = spark.read.format("json").load("data/people.json") 10 | // df.printSchema() 11 | // df.show() 12 | // df1.show() 13 | // df1.select("name","age").write.format("csv").mode("overwrite").save("data/people") 14 | df1.write.option("header", true).option("sep", ";").csv("data/test") 15 | } 16 | } -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkSql/SqlContextTest.scala: -------------------------------------------------------------------------------- 1 | package sparkSql 2 | 3 | import org.apache.spark.SparkConf 4 | import org.apache.spark.SparkContext 5 | import org.apache.spark.sql.SQLContext 6 | 7 | object SqlContextTest { 8 | // Spark2.x中SQLContext已经被SparkSession代替,此处只为了解 9 | def main(args: Array[String]): Unit = { 10 | val conf = new SparkConf().setMaster("local").setAppName("test sqlContext") 11 | val sc = new SparkContext(conf) 12 | val sqlContext = new SQLContext(sc) 13 | // 读取spark项目中example中带的几个示例数据,创建DataFrame 14 | val people = sqlContext.read.format("json").load("data/people.json") 15 | // DataFrame即RDD+Schema(元数据信息) 16 | people.show() //打印DF 17 | people.printSchema() //打印DF结构 18 | } 19 | } -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkCore/Action.scala: -------------------------------------------------------------------------------- 1 | package sparkCore 2 | 3 | import org.apache.spark.{SparkConf,SparkContext} 4 | 5 | object Action { 6 | def main(args: Array[String]): Unit = { 7 | /*Action操作不作过多解释*/ 8 | val conf = new SparkConf().setMaster("local").setAppName("Action") 9 | val sc = new SparkContext(conf) 10 | 11 | /* val rdd = sc.parallelize((1 to 10), 1) 12 | println(rdd.take(3)) //返回数组 13 | println(rdd.reduce(_+_)) 14 | println(rdd.collect()) //返回数组 15 | 16 | val wc = sc.textFile("data/hello.txt", 1) 17 | wc.flatMap(_.split(",")).map((_,1)).reduceByKey(_+_) 18 | .saveAsTextFile("data/result") //只能指定保存的目录 19 | //countByKey见Transformation中求解平均成绩 20 | */ 21 | sc.stop() 22 | } 23 | } -------------------------------------------------------------------------------- /notes/eclipse中Attach Source找不到源码,该如何查看jar包源码.md: -------------------------------------------------------------------------------- 1 | 主要是在eclipse中不像idea一样查看Spark源码麻烦 2 | 3 | ### maven引入的jar 4 | 5 | a:自动下载 6 | 7 | eclipse勾选windows->Preferences->Maven->Download Artifact Sources 这个选项,然后右键项目maven->maven update project就可以 8 | 9 | b.手动下载 10 | 11 | 使用maven命令行下载依赖包的源代码: 12 | 13 | mvn dependency:sources mvn dependency 14 | 15 | mvn dependency:sources -DdownloadSources=true -DdownloadJavadocs=true 16 | -DdownloadSources=true 下载源代码Jar -DdownloadJavadocs=true 下载javadoc包 17 | 18 | 如果执行后还是没有下载到,可以到仓库上搜一下,下载下来放到本地仓库,在eclise里面设置下关联就可以了。 19 | 20 | ### 其他jar 21 | 22 | 下载下来或者反编译出来再手动关联 23 | a:方法一 24 | 25 | 1. 按住Ctrl,用鼠标去点一些jar包里的方法,你可以选择跳转到implementation, 26 | 27 | 2. 到时候它会有一个attach to source的选项,你点击,然后选择source的压缩包,就关联好了。 28 | 29 | b.方法二 30 | 31 | 对jar包右击,选properties属性,进行关联 32 | ![source](assets/source.jpg) 33 | 34 | 转载自:https://blog.csdn.net/qq_21209681/article/details/72917837 -------------------------------------------------------------------------------- /notes/Spark DataFrame如何更改列column的类型.md: -------------------------------------------------------------------------------- 1 | 如下示例,通过最初json文件所生成的df的age列是Long类型,给它改成其他类型。当然不止如下两种方法,但我觉得这是最为简单的两种了 2 | 3 | ```scala 4 | val spark = SparkSession.builder().master("local").appName("DataFrame API").getOrCreate() 5 | 6 | // 读取spark项目中example中带的几个示例数据,创建DataFrame 7 | val people = spark.read.format("json").load("data/people.json") 8 | people.show() 9 | people.printSchema() 10 | 11 | val p = people.selectExpr("cast(age as string) age_toString","name") 12 | p.printSchema() 13 | 14 | import spark.implicits._ //导入这个为了隐式转换,或RDD转DataFrame之用 15 | import org.apache.spark.sql.types.DataTypes 16 | people withColumn("age", $"age".cast(DataTypes.IntegerType)) //DataTypes下有若干数据类型,记住类的位置 17 | people.printSchema() 18 | ``` 19 | 20 | 21 | 22 | 参考这个:[How to change column types in Spark SQL's DataFrame?](http://padown.com/questions/29383107/how-to-change-column-types-in-spark-sqls-dataframe) 23 | 24 | 25 | 26 | https://www.jianshu.com/p/0634527f3cce -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkCore/Persist.scala: -------------------------------------------------------------------------------- 1 | package sparkCore 2 | 3 | import org.apache.spark.SparkConf 4 | import org.apache.spark.SparkContext 5 | 6 | object Persist { 7 | def main(args:Array[String]): Unit = { 8 | val conf = new SparkConf().setMaster("local").setAppName("persist_cache") 9 | val sc = new SparkContext(conf) 10 | val rdd = sc.textFile("data/spark.md").cache() 11 | println("all length : "+rdd.map(_.length).reduce(_+_)) 12 | //2019-02-11 16:02:57 INFO DAGScheduler:54 - Job 0 finished: reduce at Persist.scala:11, took 0.391666 s 13 | println("all_length not None:"+rdd.flatMap(_.split(" ")).map(_.length).reduce(_+_)) 14 | //2019-02-11 16:02:58 INFO DAGScheduler:54 - Job 1 finished: reduce at Persist.scala:12, took 0.036668 s 15 | 16 | //不持久化 17 | //2019-02-11 16:05:50 INFO DAGScheduler:54 - Job 0 finished: reduce at Persist.scala:11, took 0.370967 s 18 | //2019-02-11 16:05:50 INFO DAGScheduler:54 - Job 1 finished: reduce at Persist.scala:13, took 0.050201 s 19 | } 20 | } -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 JosonLee 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkCore/sortedWordCount.scala: -------------------------------------------------------------------------------- 1 | package sparkCore 2 | 3 | import org.apache.spark.SparkConf 4 | import org.apache.spark.SparkContext 5 | import org.apache.spark.SparkContext._ 6 | import org.apache.spark.rdd._ 7 | 8 | object sortedWordCount { 9 | /*1、对文本文件内的每个单词都统计出其出现的次数。 10 | 2、按照每个单词出现次数的数量,降序排序。 11 | */ 12 | def main(args: Array[String]): Unit = { 13 | val conf = new SparkConf().setMaster("local").setAppName("sortedWordCount") 14 | val sc = new SparkContext(conf) 15 | val data = sc.textFile("data/spark.md", 1) 16 | data.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).sortBy(_._2, false) 17 | .foreach(x ⇒ { 18 | println(x._1 + "出现" + x._2 + "次") 19 | }) 20 | //另一种方法,还不如sortBy 21 | anotherSolution(data).foreach(x ⇒ { 22 | println(x._1 + "出现" + x._2 + "次") 23 | }) 24 | } 25 | 26 | def anotherSolution(data: RDD[String]): RDD[(String, Int)] = { 27 | //传入的rdd是所读取的文件rdd 28 | val wc = data.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) 29 | val countWords = wc.map(x => (x._2, x._1)).sortByKey(false) 30 | val sortedWC = countWords.map(x => (x._2, x._1)) 31 | 32 | sortedWC 33 | } 34 | } -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkSql/RDDtoDataFrame2.scala: -------------------------------------------------------------------------------- 1 | package sparkSql 2 | 3 | import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType} 4 | import org.apache.spark.sql.Row 5 | import org.apache.spark.sql.SparkSession 6 | 7 | object RDDtoDataFrame2 { 8 | def main(args: Array[String]): Unit = { 9 | val spark = SparkSession.builder().master("local").appName("rdd to dataFrame").getOrCreate() 10 | 11 | import spark.implicits._ 12 | //toDF需要导入隐式转换 13 | val rdd = spark.sparkContext.textFile("data/people.txt").cache() 14 | val header = rdd.first() 15 | val personRDD = rdd.filter(row => row != header) //去头 16 | .map { line => Row(line.split(";")(0).toString, line.split(";")(1).toInt, line.split(";")(2).toString) } 17 | 18 | val structType = StructType(Array( 19 | StructField("name", StringType, true), 20 | StructField("age", IntegerType, true), 21 | StructField("job", StringType, true))) 22 | 23 | val rddToDF = spark.createDataFrame(personRDD, structType) 24 | rddToDF.createOrReplaceTempView("people") 25 | 26 | val results = spark.sql("SELECT name FROM people") 27 | 28 | results.map(attributes => "Name: " + attributes(0)).show() 29 | } 30 | } -------------------------------------------------------------------------------- /notes/判断RDD是否为空.md: -------------------------------------------------------------------------------- 1 | ### 如何创建空RDD? 2 | 3 | 空RDD用处暂且不知 4 | 5 | `sc.parallelize(List()) //或seq()` 6 | 7 | 或者,spark定义了一个 emptyRDD 8 | ```scala 9 | /** 10 | * An RDD that has no partitions and no elements. 11 | */ 12 | private[spark] class EmptyRDD[T: ClassTag](sc: SparkContext) extends RDD[T](sc, Nil) { 13 | 14 | override def getPartitions: Array[Partition] = Array.empty 15 | override def compute(split: Partition, context: TaskContext): Iterator[T] = { 16 | throw new UnsupportedOperationException("empty RDD") 17 | } 18 | 19 | //可以通过 sc.emptyRDD[T] 创建 20 | ``` 21 | ### rdd.count == 0 和 rdd.isEmpty 22 | 我第一想到的方法要么通过count算子判断个数是否为0,要么直接通过isEmpty算子来判断是否为空 23 | 24 | ```scala 25 | def isEmpty(): Boolean = withScope { 26 | partitions.length == 0 || take(1).length == 0 27 | } 28 | ``` 29 | 30 | isEmpty的源码也是判断分区长度或者是否有数据,如果是空RDD,isEmpty会抛异常:`Exception in thread "main" org.apache.spark.SparkDriverExecutionException: Execution error` 31 | 32 | > emptyRDD判断isEmpty不会报这个错,不知道为什么 33 | 34 | 之后搜索了一下,看到下面这个办法 (见: https://my.oschina.net/u/2362111/blog/743754 35 | 36 | ### rdd.partitions().isEmpty() 37 | 38 | ``` 39 | 这种比较适合Dstream 进来后没有经过 类似 reduce 操作的 。 40 | ``` 41 | 42 | ### rdd.rdd().dependencies().apply(0).rdd().partitions().length==0 43 | 44 | ``` 45 | 这种就可以用来作为 经过 reduce 操作的 了 46 | ``` -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkSql/RDDtoDataFrame.scala: -------------------------------------------------------------------------------- 1 | package sparkSql 2 | 3 | import org.apache.spark.sql.SparkSession 4 | 5 | object RDDtoDataFrame { 6 | case class Person(name: String, age: Int, job: String) 7 | 8 | def main(args: Array[String]): Unit = { 9 | val spark = SparkSession.builder().master("local").appName("rdd to dataFrame").getOrCreate() 10 | 11 | import spark.implicits._ 12 | //toDF需要导入隐式转换 13 | val rdd = spark.sparkContext.textFile("data/people.txt").cache() 14 | val header = rdd.first() 15 | val rddToDF = rdd.filter(row => row != header) //去头 16 | .map(_.split(";")) 17 | .map(x => Person(x(0).toString, x(1).toInt, x(2).toString)) 18 | .toDF("name","age","job") 19 | rddToDF.show() 20 | rddToDF.printSchema() 21 | 22 | val dfToRDD = rddToDF.rdd//返回RDD[row],RDD类型的row对象 23 | dfToRDD.foreach(println) 24 | 25 | dfToRDD.map(row =>Person(row.getAs[String]("name"), row.getAs[Int]("age"), row(2).toString())) //也可用row.getAs[String](2) 26 | .foreach(p => println(p.name + ":" + p.age + ":"+ p.job)) 27 | dfToRDD.map{row=>{ 28 | val columnMap = row.getValuesMap[Any](Array("name","age","job")) 29 | Person(columnMap("name").toString(),columnMap("age").toString.toInt,columnMap("job").toString) 30 | }}.foreach(p => println(p.name + ":" + p.age + ":"+ p.job)) 31 | // isNullAt方法 32 | } 33 | } -------------------------------------------------------------------------------- /notes/高级排序和topN问题.md: -------------------------------------------------------------------------------- 1 | ### 按列排序和二次排序问题 2 | 3 | 搞清楚sortByKey、sortBy作用;其次是自定义Key用来排序,详情可见MapReduce中排序问题 4 | 5 | ``` scala 6 | class MyKey(val first: Int, val second: Int) extends Ordered[MyKey] with Serializable { 7 | //记住这种定义写法 8 | def compare(that:MyKey): Int = { 9 | //定义比较方法 10 | } 11 | } 12 | ``` 13 | 14 | ### topN和分组topN问题 15 | 16 | topN包含排序,分组topN只是在分组的基础上考虑排序 17 | 18 | 分组要搞清groupBy和groupByKey,前者分组后类似 (t003,((t1003,19),(t1003,1090))),后者类似(t003,(19,1090)) 19 | 20 | 其次是排序,如果是简单集合内排序直接调用工具方法:sorted、sortWith、sortBy 21 | 22 | > sorted:适合单集合的升降序 23 | > 24 | > sortBy:适合对单个或多个属性的排序,代码量比较少 25 | > 26 | > sortWith:适合定制化场景比较高的排序规则,比较灵活,也能支持单个或多个属性的排序,但代码量稍多,内部实际是通过java里面的Comparator接口来完成排序的 27 | 28 | ```scala 29 | scala> val arr = Array(3,4,5,1,2,6) 30 | arr: Array[Int] = Array(3, 4, 5, 1, 2, 6) 31 | scala> arr.sorted 32 | res0: Array[Int] = Array(1, 2, 3, 4, 5, 6) 33 | scala> arr.sorted.take(3) 34 | res1: Array[Int] = Array(1, 2, 3) 35 | scala> arr.sorted.takeRight(3) 36 | res3: Array[Int] = Array(4, 5, 6) 37 | scala> arr.sorted(Ordering.Int.reverse) 38 | res4: Array[Int] = Array(6, 5, 4, 3, 2, 1) 39 | scala> arr.sorted.reverse 40 | res5: Array[Int] = Array(6, 5, 4, 3, 2, 1) 41 | scala> arr.sortWith(_>_) 42 | res6: Array[Int] = Array(6, 5, 4, 3, 2, 1) 43 | ``` 44 | 45 | 考虑降序几种写法如下 46 | 47 | ```scala 48 | arr.sorted(Ordering.Int.reverse) 49 | arr.sorted.reverse 50 | arr.sortWith(_>_) 51 | ``` 52 | 53 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkCore/TopN .scala: -------------------------------------------------------------------------------- 1 | package sparkCore 2 | 3 | import org.apache.spark.SparkConf 4 | import org.apache.spark.SparkContext 5 | import org.apache.spark.rdd._ 6 | 7 | object TopN { 8 | //Top3 9 | def top3(data: RDD[String]): Array[String] = { 10 | if (data.isEmpty()) { 11 | println("RDD为空,返回空Array") 12 | Array() 13 | } 14 | if (data.count() <= 3) { 15 | data.collect() 16 | } else { 17 | val sortedData = data.map(x => (x, x.split(" ")(1).toInt)).sortBy(_._2, false) 18 | val result = sortedData.take(3).map(_._1) 19 | result 20 | } 21 | } 22 | //分组TopN 23 | def groupTopN(data: RDD[String],n:Int): RDD[(String,List[Int])] = { 24 | //先不考虑其他的 25 | //分组后类似 (t003,(19,1090,190,109)) 26 | val groupParis = data.map { x => 27 | (x.split(" ")(0), x.split(" ")(1).toInt) 28 | }.groupByKey() 29 | val sortedData = groupParis.map(x=> 30 | { 31 | //分组后,排序(默认升序),再倒序去前n 32 | val sortedLists = x._2.toList.sorted.reverse.take(n) 33 | (x._1,sortedLists) 34 | }) 35 | sortedData 36 | } 37 | def main(args: Array[String]): Unit = { 38 | val conf = new SparkConf() 39 | .setAppName("Top3") 40 | .setMaster("local") 41 | val sc = new SparkContext(conf) 42 | val lines = sc.textFile("data/topN.txt").cache() 43 | 44 | top3(lines).foreach(println) 45 | 46 | groupTopN(lines, 3).sortBy(_._1, true).foreach(x=>{ 47 | print(x._1+": ") 48 | println(x._2.mkString(",")) 49 | }) 50 | } 51 | } 52 | 53 | /*排序几种写法: 54 | arr.sorted(Ordering.Int.reverse) 55 | arr.sorted.reverse 56 | arr.sortWith(_>_) 57 | */ -------------------------------------------------------------------------------- /notes/eclipse如何导入Spark源码方便阅读.md: -------------------------------------------------------------------------------- 1 | 最近想看下spark sql的源码,就查了些相关文章。很多都是IDEA怎么导入的,还有就是谈到了自己编译spark源码再倒入,但我还没有强到修改源码的地步,所以跳过编译直接导入阅读源码,过程如下 2 | 3 | 4 | 5 | ### 下载spark源码 6 | 7 | 从 https://github.com/apache/spark 下载你需要的spark版本,如图 8 | 9 | ![1550161394643](assets/1550161394643.png) 10 | 11 | > 当然,也方便eclipse中 Ctrl+点击 来跳转到源码查看。具体是Attach Source中指定下载的源码所在位置即可 12 | > 13 | > 1. 按住Ctrl,用鼠标去点一些jar包里的方法,你可以选择跳转到implementation 14 | > 15 | > 2. 到时候它会有一个attach to source的选项,点击,然后选择下好的源码,就关联好了 16 | 17 | ### 导入eclipse 18 | 19 | 下载下来的项目是maven项目所以直接导入即可,当然eclipse要有安装maven插件 20 | 21 | 22 | 23 | Eclipse中File->Import->Import Existing Maven Projects 24 | 25 | 26 | 27 | 如图,我下载并解压的源码包spark-2.4.0(为了区分,重命名了spark-2.4.0-src),导入然后选择你想要阅读的源码(不想要的下面取消勾选即可) 28 | 29 | > Spark子项目模块有: 30 | > 31 | > - spark-catalyst:Spark的词法、语法分析、抽象语法树(AST)生成、优化器、生成逻辑执行计划、生成物理执行计划等。 32 | > - spark-core:Spark最为基础和核心的功能模块。 33 | > - spark-examples:使用多种语言,为Spark学习人员提供的应用例子。 34 | > - spark-sql:Spark基于SQL标准,实现的通用查询引擎。 35 | > - spark-hive:Spark基于Spark SQL,对Hive元数据、数据的支持。 36 | > - spark-mesos:Spark对Mesos的支持模块。 37 | > - spark-mllib:Spark的机器学习模块。 38 | > - spark-streaming:Spark对流式计算的支持模块。 39 | > - spark-unsafe:Spark对系统内存直接操作,以提升性能的模块。 40 | > - spark-yarn:Spark对Yarn的支持模块。 41 | 42 | ![](assets/导入spark.png) 43 | 44 | 45 | 46 | 点击Finish后,maven会自动下载相关依赖。没有自动下载的话,右键项目->Maven->Download Source->Update Project...即可 47 | 48 | 49 | 50 | 还有一个问题是可能最后会报错,比如有些jar包没有下载下来。我遇到的问题是,因为spark2.4.0是基于scala2.11的,但是我的eclipse插件较新下的是scala2.12,创建maven项目时会指定scala为2.12,所以依赖啥的可能就有问题吧(我猜测)。最后是在项目中指定scala2.11,maven会自动处理,然后就可以用了。 51 | 52 | 如图,我是导入了spark-sql源码 53 | 54 | 55 | 56 | ![20190215002003](assets/20190215002003.png) 57 | 58 | 参考: 59 | 60 | - https://ymgd.github.io/codereader/2018/04/16/Spark%E4%BB%A3%E7%A0%81%E7%BB%93%E6%9E%84%E5%8F%8A%E8%BD%BD%E5%85%A5Ecplise%E6%96%B9%E6%B3%95/ 61 | - http://www.cnblogs.com/zlslch/p/7457352.html -------------------------------------------------------------------------------- /notes/LearningSpark(3)RDD操作.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ### 键值对RDD上的操作 隐式转换 4 | shuffle操作中常用针对某个key对一组数据进行操作,比如说groupByKey、reduceByKey这类PairRDDFunctions中需要启用Spark的隐式转换,scala就会自动地包装成元组 RDD。导入 `org.apache.spark.SparkContext._`即可 5 | 6 | 没啥意思,就是记着导入`import org.apache.spark.SparkContext._`就有隐式转换即可 7 | 8 | ### 常用Transformation算子 9 | 10 | - map:对RDD中的每个数据通过函数映射成一个新的数。**输入与输出分区一对一** 11 | 12 | - flatMap:同上,不过会把输出分区合并成一个 13 | 14 | - **注意**:flatMap会把String扁平化为字符数组,但不会把字符串数组Array[String]扁平化 15 | 16 | map和flatMap还有个区别,如下代码 [更多map、flatmap区别见这里](http://www.brunton-spall.co.uk/post/2011/12/02/map-map-and-flatmap-in-scala/) 17 | 18 | ```scala 19 | scala> val list = List(1,2,3,4,5) 20 | list:List [Int] = List(1,2,3,4,5) 21 | 22 | scala> def g(v:Int)= List(v-1,v,v + 1) 23 | g:(v:Int)List [Int] 24 | 25 | scala> list.map(x => g(x)) 26 | res0:List [List [Int]] = List(List(0,1,2),List(1,2,3),List(2,3,4),List(3,4,5),List(List) 4,5,6)) 27 | 28 | scala> list.flatMap(x => g(x)) 29 | res1:List [Int] = List(0,1,2,1,2,3,2,3,4,3,4,5,4,5,6) 30 | ``` 31 | 32 | - filter:通过func对RDD中数据进行过滤,func返回true保留,反之滤除 33 | 34 | - distinct:对RDD中数据去重 35 | 36 | - reduceByKey(func):对RDD中的每个Key对应的Value进行reduce聚合操作 37 | 38 | - groupByKey():根据key进行group分组,每个key对应一个`Iterable` 39 | 40 | - sortByKey([ascending]):对RDD中的每个Key进行排序,ascending布尔值是否升序 【默认升序】 41 | 42 | - join(otherDataset):对两个包含对的RDD进行join操作,每个key join上的pair,都会传入自定义函数进行处理 43 | 44 | 45 | ### 常用Action算子 46 | 47 | - reduce(func):通过函数func聚合数据集,func 输入为两个元素,返回为一个元素。多用来作运算 48 | - count():统计数据集中元素个数 49 | - collect():以一个数组的形式返回数据集的所有元素。因为是加载到内存中,要求数据集小,否则会有溢出可能 50 | - take(n):数据集中的前 n 个元素作为一个数组返回,并非并行执行,而是由驱动程序计算所有的元素 51 | - foreach(func):调用func来遍历RDD中每个元素 52 | - saveAsTextFile(path):将数据集中的元素以文本文件(或文本文件集合)的形式写入本地文件系统、HDFS 或其它 Hadoop 支持的文件系统中的给定目录中。Spark 将对每个元素调用 toString 方法,将数据元素转换为文本文件中的一行记录 53 | - countByKey():仅适用于(K,V)类型的 RDD 。返回具有每个 key 的计数的 (K , Int)对 的 Map 54 | - sortBy(func, ascending):通过指定排序函数func对RDD中元素排序,ascending 是否升序 【默认true升序,一般func指定按哪个值排序】 55 | 56 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkSql/DataFrameOperations.scala: -------------------------------------------------------------------------------- 1 | package sparkSql 2 | 3 | import org.apache.spark.sql.SparkSession 4 | import org.apache.spark.sql 5 | import org.apache.spark.sql.functions 6 | import org.apache.spark.sql.types.DataTypes 7 | 8 | object DataFrameOperations { 9 | def main(args: Array[String]): Unit = { 10 | val spark = SparkSession.builder().master("local").appName("DataFrame API").getOrCreate() 11 | 12 | // 读取spark项目中example中带的几个示例数据,创建DataFrame 13 | val people = spark.read.format("json").load("data/people.json") 14 | // people.show() 15 | // people.printSchema() 16 | 17 | // val p = people.selectExpr("cast(age as string) age_toString","name") 18 | // p.printSchema() 19 | 20 | import spark.implicits._ //导入这个为了隐式转换,或RDD转DataFrame之用 21 | // 更改column的类型,也可以通过上面selectExpr实现 22 | people withColumn("age", $"age".cast(sql.types.StringType)) //DataTypes下有若干数据类型,记住类的位置 23 | people.printSchema() 24 | // people.select(functions.col("age").cast(DataTypes.DoubleType)).show() 25 | 26 | // people.select($"name", $"age" + 1).show() 27 | // people.select(people("age")+1, people.col("name")).show() 28 | // people.select("name").s1how() //select($"name")等效,后者好处看上面 29 | // people.filter($"name".contains("ust")).show() 30 | // people.filter($"name".like("%ust%")).show 31 | // people.filter($"name".rlike(".*ust.*")).show() 32 | println("Test Filter*****************") 33 | // people.filter(people("name") contains ("ust")).show() 34 | // people.filter(people("name") like ("%ust%")).show() 35 | // people.filter(people("name") rlike (".*?ust.*?")).show() 36 | 37 | println("Filter中如何取反*****************") 38 | // people.filter(!(people("name") contains ("ust"))).select("name", "age").show() 39 | people.groupBy("age").count().show() 40 | // people.createOrReplaceTempView("sqlDF") 41 | // spark.sql("select * from sqlDF where name not like '%ust%' ").show() 42 | 43 | } 44 | } -------------------------------------------------------------------------------- /notes/Spark2.4+Hive使用Hive现有仓库.md: -------------------------------------------------------------------------------- 1 | ### 使用前准备 2 | 3 | - hive-site.xml复制到$SPARK_HOME/conf目录下 4 | - hive连接mysql的jar包(mysql-connector-java-8.0.13.jar)也要复制到$SPARK_HOME/jars目录下 5 | - 或者在spark-submit脚本中通过--jars指明该jar包位置 6 | - 或者在spark-env.xml中把该jar包位置加入Class Path `export SPARK_CLASSPATH=$SPARK_CLASSPATH:/jar包位置` 7 | > 我测试不起作用 8 | 9 | ### spark.sql.warehouse.dir参数 10 | 11 | 入门文档讲解spark sql如何作用在hive上时,[提到了下面这个例子](http://spark.apachecn.org/#/docs/7?id=hive-%E8%A1%A8),其次有个配置spark.sql.warehouse.dir 12 | ```scala 13 | val spark = SparkSession 14 | .builder() 15 | .appName("Spark Hive Example") 16 | .config("spark.sql.warehouse.dir", warehouseLocation) 17 | .enableHiveSupport() 18 | .getOrCreate() 19 | ``` 20 | 该参数指明的是hive数据仓库位置 21 | > spark 1.x 版本使用的参数是"hive.metastore.warehouse" ,在spark 2.0.0 后,该参数已经不再生效,用户应使用 spark.sql.warehouse.dir进行代替 22 | 23 | 比如说我hive仓库是配置在hdfs上的,所以spark.sql.warehouse.dir=hdfs://master:9000/hive/warehouse 24 | ``` 25 | 26 | hive.metastore.warehouse.dir 27 | /hive/warehouse 28 | 29 | ``` 30 | 31 | 有一点是以上要达到期望效果前提是hive要部署好,没部署好的话,会在spark会默认覆盖hive的配置项。因为spark下也有spark-hive模快的,此时会使用内置hive。比如你可以尝试命令行下使用spark-shell使用hive仓库,你会发现当前目录下会产生metastore_db(元数据信息) 32 | 33 | 另外是该参数可以从hive-site.xml中获取,所以可以不写。但是在eclipse、idea中编程如果不把hive-site.xml放在resource文件夹下无法读取,所以还是配置该参数为好 34 | 35 | ### 报错解决 36 | 1. 报无法加载mysql驱动jar包,导致一些列错误无法访问数据库等。解决是把该jar包如上文所述放在$SPARK_HOME/jars目录下 37 | > org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. 38 | 39 | 2. 报Table 'hive.PARTITIONS' doesn't exist 40 | > ERROR Datastore:115 - Error thrown executing ALTER TABLE `PARTITIONS` ADD COLUMN `TBL_ID` BIGINT NULL : Table 'hive.PARTITIONS' doesn't exist 41 | java.sql.SQLSyntaxErrorException: Table 'hive.PARTITIONS' doesn't exist 42 | 43 | 这个不知道怎么说,程序可以运行不过会抛该错误和warn 44 | 在stackoverflow中搜到了,配置一个参数config("spark.sql.hive.verifyPartitionPath", "false") 45 | 见:https://stackoverflow.com/questions/47933705/spark-sql-fails-if-there-is-no-specified-partition-path-available 46 | 47 | 参考: 48 | 49 | - [Spark 2.2.1 + Hive 案例之不使用现有的Hive环境;使用现有的Hive数据仓库;UDF自定义函数](https://blog.csdn.net/duan_zhihua/article/details/79335625) 50 | - [Spark的spark.sql.warehouse.dir相关](https://blog.csdn.net/u013560925/article/details/79854072) 51 | - https://www.jianshu.com/p/60e7e16fb3ce 52 | -------------------------------------------------------------------------------- /notes/LearningSpark(6)Spark内核架构剖析.md: -------------------------------------------------------------------------------- 1 | ## Standalone模式下内核架构分析 2 | 3 | ### Application 和 spark-submit 4 | 5 | Application是编写的spark应用程序,spark-submit是提交应用程序给spark集群运行的脚本 6 | 7 | ### Driver 8 | 9 | spark-submit在哪里提交,那台机器就会启动Drive进程,用以执行应用程序。 10 | 11 | 像我们编写程序一样首先做的就是创建SparkContext 12 | 13 | ### SparkContext作用 14 | 15 | sc负责上下文环境(下文sc也指代SparkContext),sc初始化时负责创建DAGScheduler、TaskScheduler、Spark UI【先不考虑这个】 16 | 17 | 主要就是DAGScheduler、TaskScheduler,下面一一提到 18 | 19 | ### TaskScheduler功能 20 | 21 | TaskScheduler任务调度器,从字面可知和任务执行有关 22 | 23 | sc构造的TaskScheduler有自己后台的进程,负责连接spark集群的Master节点并注册Application 24 | 25 | #### Master 26 | 27 | Master在接受Application注册请求后,会通过自己的资源调度算法在集群的Worker节点上启动一系列Executor进程(Master通知Worker启动Executor) 28 | 29 | #### Worker和Executor 30 | 31 | Worker节点启动Executor进程,Executor进程会创建线程池,并且启动后会再次向Driver的TaskScheduler反向注册,告知有哪些Executor可用。 32 | 33 | 当然不止这些作用,先在这里谈一下task任务的执行。 34 | 35 | Executor每接收到一个task后,会调用TaskRunner对task封装(所谓封装就是对代码、算子、函数等拷贝、反序列化),然后再从线程池中取线程执行该task 36 | 37 | #### 聊聊Task 38 | 39 | 说的task这里就顺便提下task。task分两类:ShuffleMapTask、ResultTask,每个task对应的是RDD的一个Partition分区。下面会提到DAGScheduler会把提交的job作业分为多个Stage(依据宽依赖、窄依赖),一个Stage对应一个TaskSet(系列task),而ResultTask正好对应最后一个Stage,之前的Stage都是ShuffleMapTask 40 | 41 | ### DAGScheduler 42 | 43 | SparkContext另一重要作用就是构造DAGScheduler。程序中每遇到Action算子就会提交一个Job(想懒加载),DAGScheduler会把提交的job作业分为多个Stage。Stage就是一个TaskSet,会提交给TaskScheduler,因为之前Executor已经向TaskScheduler注册了哪些资源可用,所有TaskScheduler会分配TaskSet里的task给Executor执行(涉及task分配算法)。如何执行?向上看Executor部分 44 | 45 | ## Spark on Yarn模式下的不同之处 46 | 47 | 之前提到过on Yarn有yarn-client和yarn-cluster两种模式,在spark-submit脚本中通过`--master`、`--deploy-mode`来区分以哪种方式运行 【具体可见:[LearningSpark(2)spark-submit可选参数.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(2)spark-submit%E5%8F%AF%E9%80%89%E5%8F%82%E6%95%B0.md)】 48 | 49 | 其中,官方文档中所提及`--deploy-mode` 指定部署模式,是在 worker 节点(cluster)上还是在本地作为一个外部的客户端(client)部署您的 driver(默认 : client),这和接下来所提及的内容有关 50 | 51 | 因为是运行在Yarn集群上,所有没有什么Master、Worker节点,取而代之是ResourceManager、NodeManager(下文会以RM、NM代替) 52 | 53 | ### yarn-cluster运行模式 54 | 55 | 首先spark-submit提交Application后会向RM发送请求,请求启动ApplicationMaster(同standalone模式下的Master,但同时该节点也会运行Drive进程【这里和yarn-client有区别】)。RM就会分配container在某个NM上启动ApplicationMaster 56 | 57 | 要执行task就得有Executor,所以ApplicationMaster要向RM申请container来启动Executor。RM分配一些container(就是一些NM节点)给ApplicationMaster用来启动Executor,ApplicationMaster就会连接这些NM(这里NM就如同Worker)。NM启动Executor后向ApplicationMaster注册 58 | 59 | ### yarn-client运行模式 60 | 61 | 如上所提的,这种模式的不同在于Driver是部署在本地提交的那台机器上的。过程大致如yarn-cluster,不同在于ApplicationMaster实际上是ExecutorLauncher,而申请到的NodeManager所启动的Executor是要向本地的Driver注册的,而不是向ApplicationMaster注册 -------------------------------------------------------------------------------- /notes/LearningSpark(2)spark-submit可选参数.md: -------------------------------------------------------------------------------- 1 | ## 提交应用的脚本和可选参数 2 | 3 | 可以选择local模式下运行来测试程序,但要是在集群上运行还需要通过spark-submit脚本来完成。官方文档上的示例是这样写的(其中表明哪些是必要参数): 4 | 5 | ``` 6 | ./bin/spark-submit \ 7 | --class \ 8 | --master \ 9 | --deploy-mode \ 10 | --conf = \ 11 | ... # other options 12 | \ 13 | [application-arguments] 14 | ``` 15 | 16 | 常用参数如下: 17 | - `--master` 参数来设置 SparkContext 要连接的集群,默认不写就是local[*]【可以不用在SparkContext中写死master信息】 18 | 19 | - `--jars` 来设置需要添加到 classpath 中的 JAR 包,有多个 JAR 包使用逗号分割符连接 20 | 21 | - `--class` 指定程序的类入口 22 | 23 | - `--deploy-mode` 指定部署模式,是在 worker 节点(cluster)上还是在本地作为一个外部的客户端(client)部署您的 driver(默认 : client) 24 | 25 | > 这里顺便提一下yarn-client和yarn-cluster区别 26 | ![cluster-client](assets/cluster-client.png) 27 | 28 | - `application-jar` : 包括您的应用以及所有依赖的一个打包的 Jar 的路径。该Jar包的 URL 在您的集群上必须是全局可见的,例如,一个 hdfs:// path 或者一个 file:// path 在所有节点是可见的。 29 | 30 | - `application-arguments` : 传递到您的 main class 的 main 方法的参数 31 | 32 | - `driver-memory`是 driver 使用的内存,不可超过单机的最大可使用的 33 | 34 | - `num-executors`是创建多少个 executor 35 | 36 | - `executor-memory`是各个 executor 使用的最大内存,不可超过单机的最大可使用内存 37 | 38 | - `executor-cores`是每个 executor 最大可并发执行的 Task 数目 39 | 40 | ``` 41 | #如下是spark on yarn模式下运行计算Pi的测试程序 42 | # 有一点务必注意,每行最后换行时务必多敲个空格,否则解析该语句时就是和下一句相连的,不知道会爆些什么古怪的错误 43 | [hadoop@master spark-2.4.0-bin-hadoop2.6]$ ./bin/spark-submit \ 44 | > --master yarn \ 45 | > --class org.apache.spark.examples.SparkPi \ 46 | > --deploy-mode client \ 47 | > --driver-memory 1g \ 48 | > --num-executors 2 \ 49 | > --executor-memory 2g \ 50 | > --executor-cores 2 \ 51 | > examples/jars/spark-examples_2.11-2.4.0.jar \ 52 | > 10 53 | ``` 54 | 每次提交都写这么多肯定麻烦,可以写个脚本 55 | 56 | ## 从文件中加载配置 57 | 58 | **spark-submit** 脚本可以从一个 **properties** 文件加载默认的 [Spark configuration values](http://spark.apache.org/docs/latest/configuration.html) 并且传递它们到您的应用中去。默认情况下,它将从 **Spark** 目录下的 ***conf/spark-defaults.conf*** 读取配置。更多详细信息,请看 [加载默认配置](http://spark.apache.org/docs/latest/configuration.html#loading-default-configurations) 部分。 59 | 60 | 加载默认的 **Spark** 配置,这种方式可以消除某些标记到 **spark-submit** 的必要性。例如,如果 ***spark.master*** 属性被设置了,您可以在 **spark-submit** 中安全的省略。一般情况下,明确设置在 **SparkConf** 上的配置值的优先级最高,然后是传递给 **spark-submit** 的值,最后才是 **default value**(默认文件)中的值。 61 | 62 | 如果您不是很清楚其中的配置设置来自哪里,您可以通过使用 ***--verbose*** 选项来运行 **spark-submit** 打印出细粒度的调试信息 63 | 64 | 更多内容可参考文档:[提交应用](http://cwiki.apachecn.org/pages/viewpage.action?pageId=3539265) ,[Spark-Submit 参数设置说明和考虑](https://www.alibabacloud.com/help/zh/doc-detail/28124.htm) 65 | 66 | 67 | 68 | ## 配置参数优先级问题 69 | 70 | sparkConf中配置的参数优先级最高,其次是spark-submit脚本中,最后是默认属性文件(spark-defaults.conf)中的配置参数 71 | 72 | 默认情况下,spark-submit也会从spark-defaults.conf中读取配置 -------------------------------------------------------------------------------- /notes/LearningSpark(4)Spark持久化操作.md: -------------------------------------------------------------------------------- 1 | ## 持久化 2 | 3 | Spark的一个重要特性,对RDD持久化操作时每个节点将RDD中的分区持久化到内存(或磁盘)上,之后的对该RDD反复操作过程中不需要重新计算该RDD,而是直接从内存中调用已缓存的分区即可。 4 | 当然,持久化适用于将要多次计算反复调用的RDD。不然的话会出现RDD重复计算,浪费资源降低性能的情况 5 | 6 | > 巧妙使用RDD持久化,甚至在某些场景下,可以将spark应用程序的性能提升10倍。对于迭代式算法和快速交互式应用来说,RDD持久化,是非常重要的 7 | 8 | 其次,持久化机制还有自动容错机制,如果哪个缓存的分区丢失,就会自动从其源RDD通过系列transformation操作重新计算该丢失分区 9 | 10 | Spark的一些shuffle操作也会自动对中间数据进行持久化,避免了在shuffle出错情况下,需要重复计算整个输入 11 | 12 | ## 持久化方法 13 | 14 | cache()和persist()方法,二者都是Transformation算子。要使用持久化必须将缓存好的RDD付给一个变量,之后重复使用该变量即可,其次不能在cache、persist后立刻调用action算子,否则也不叫持久化 15 | 16 | cache()等同于只缓存在内存中的persist(),源码如下 17 | ```scala 18 | def cache(): this.type = persist() 19 | def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) 20 | ``` 21 | 22 | 另外persist持久化还有一个Storage Level的概念 【持久化策略】 23 | 24 | | Storage Level | meaning | 25 | | ------------------------------------- | ------------------------------------------------------------ | 26 | | MEMORY_ONLY | 以非序列化的Java对象的方式持久化在JVM内存中。如果内存无法完全存储RDD所有的partition,那么那些没有持久化的partition就会在下一次需要使用它的时候,重新被计算 | 27 | | MEMORY_AND_DISK | 先缓存到内存,但是当内存不够时,会持久化到磁盘中。下次需要使用这些partition时,需要从磁盘上读取 | 28 | | MEMORY_ONLY_SER | 同MEMORY_ONLY,但是会使用Java序列化方式,将Java对象序列化后进行持久化。可以减少内存开销,但是需要进行反序列化,因此会加大CPU开销 | 29 | | MEMORY_AND_DSK_SER | 同MEMORY_AND_DSK,_SER同上也会使用Java序列化方式 | 30 | | DISK_ONLY | 使用非序列化Java对象的方式持久化,完全存储到磁盘上 | 31 | | MEMORY_ONLY_2【或者其他尾部加了_2的】 | 尾部_2的级别会将持久化数据复制一份,保存到其他节点,从而在数据丢失时,不需要再次计算,只需要使用备份数据即可 | 32 | 33 | 上面提过持久化也是Transformation算子,所以也是遇到action算子才执行 34 | 35 | ```scala 36 | final def iterator(split: Partition, context: TaskContext): Iterator[T] = { 37 | if (storageLevel != StorageLevel.NONE) { 38 | getOrCompute(split, context) 39 | } else { 40 | computeOrReadCheckpoint(split, context) 41 | } 42 | } 43 | ``` 44 | 45 | 会根据StorageLevel判断是否是持久化的,getOrCompute 就是根据RDD的RDDBlockId作为BlockManager 的key,判断是否缓存过这个RDD,如果没有,通过依赖计算生成,然后放入到BlockManager 中。如果已经存在,则直接从BlockManager 获取 46 | 47 | ### 如何选择持久化策略 48 | 49 | > Spark提供的多种持久化级别,主要是为了在CPU和内存消耗之间进行取舍。下面是一些通用的持久化级别的选择建议: 50 | > 51 | > 1、优先使用MEMORY_ONLY,如果可以缓存所有数据的话,那么就使用这种策略。因为纯内存速度最快,而且没有序列化,不需要消耗CPU进行反序列化操作。 52 | > 2、如果MEMORY_ONLY策略,无法存储的下所有数据的话,那么使用MEMORY_ONLY_SER,将数据进行序列化进行存储,纯内存操作还是非常快,只是要消耗CPU进行反序列化。 53 | > 3、如果需要进行快速的失败恢复,那么就选择带后缀为_2的策略,进行数据的备份,这样在失败时,就不需要重新计算了。 54 | > 4、能不使用DISK相关的策略,就不用使用,有的时候,从磁盘读取数据,还不如重新计算一次。 55 | 56 | ### 非持久化方法 unpersist() 57 | 58 | 59 | 60 | 参考:https://blog.csdn.net/weixin_35602748/article/details/78667489 和北风网spark245讲 61 | 62 | ![cache](assets/cache.png) 63 | 64 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Learning-Spark 2 | 学习Spark的代码,关于Spark Core、Spark SQL、Spark Streaming、Spark MLLib 3 | 4 | ## 说明 5 | 6 | ### 开发环境 7 | 8 | - 基于Deepin Linux 15.9版本 9 | - 基于Hadoop2.6、Spark2.4、Scala2.11、java8等 10 | 11 | > 系列环境搭建相关文章,见下方 12 | > - [【向Linux迁移记录】Deepin下java、大数据开发环境配置【一】](https://blog.csdn.net/lzw2016/article/details/86566873) 13 | > - [【向Linux迁移记录】Deepin下Python开发环境搭建](https://blog.csdn.net/lzw2016/article/details/86567436) 14 | > - [【向Linux迁移记录】Deepin Linux下快速Hadoop完全分布式集群搭建](https://blog.csdn.net/lzw2016/article/details/86618345) 15 | > - [【向Linux迁移记录】基于Hadoop集群的Hive安装与配置详解](https://blog.csdn.net/lzw2016/article/details/86631115) 16 | > - [【向Linux迁移记录】Deepin Linux下Spark本地模式及基于Yarn的分布式集群环境搭建](https://blog.csdn.net/lzw2016/article/details/86718403) 17 | > - [eclipse安装Scala IDE插件及An internal error occurred during: "Computing additional info"报错解决](https://blog.csdn.net/lzw2016/article/details/86717728) 18 | > - [Deepin Linux 安装启动scala报错 java.lang.NumberFormatException: For input string: "0x100" 解决](https://blog.csdn.net/lzw2016/article/details/86618570) 19 | > - 更多内容见:【https://blog.csdn.net/lzw2016/ 】【https://github.com/josonle/Coding-Now 】 20 | 21 | ### 文件说明 22 | 23 | - [Spark_With_Scala_Testing](https://github.com/josonle/Learning-Spark/tree/master/Spark_With_Scala_Testing) 存放平时练习代码 24 | - notes存放笔记 25 | - [LearningSpark(1)数据来源.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(1)%E6%95%B0%E6%8D%AE%E6%9D%A5%E6%BA%90.md) 26 | - [LearningSpark(2)spark-submit可选参数.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(2)spark-submit%E5%8F%AF%E9%80%89%E5%8F%82%E6%95%B0.md) 27 | - [LearningSpark(3)RDD操作.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(3)RDD%E6%93%8D%E4%BD%9C.md) 28 | - [LearningSpark(4)Spark持久化操作](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(4)Spark%E6%8C%81%E4%B9%85%E5%8C%96%E6%93%8D%E4%BD%9C.md) 29 | - [LearningSpark(5)Spark共享变量.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(5)Spark%E5%85%B1%E4%BA%AB%E5%8F%98%E9%87%8F.md) 30 | - [LearningSpark(6)Spark内核架构剖析.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(6)Spark%E5%86%85%E6%A0%B8%E6%9E%B6%E6%9E%84%E5%89%96%E6%9E%90.md) 31 | - [LearningSpark(7)SparkSQL之DataFrame学习(含Row).md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(7)SparkSQL%E4%B9%8BDataFrame%E5%AD%A6%E4%B9%A0.md) 32 | - [LearningSpark(8)RDD如何转化为DataFrame](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(8)RDD%E5%A6%82%E4%BD%95%E8%BD%AC%E5%8C%96%E4%B8%BADataFrame.md) 33 | - [LearningSpark(9)SparkSQL数据来源](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(9)SparkSQL%E6%95%B0%E6%8D%AE%E6%9D%A5%E6%BA%90.md) 34 | - [RDD如何作为参数传给函数.md](https://github.com/josonle/Learning-Spark/blob/master/notes/RDD%E5%A6%82%E4%BD%95%E4%BD%9C%E4%B8%BA%E5%8F%82%E6%95%B0%E4%BC%A0%E7%BB%99%E5%87%BD%E6%95%B0.md) 35 | - [判断RDD是否为空](https://github.com/josonle/Learning-Spark/blob/master/notes/%E5%88%A4%E6%96%ADRDD%E6%98%AF%E5%90%A6%E4%B8%BA%E7%A9%BA) 36 | - [高级排序和topN问题.md](https://github.com/josonle/Learning-Spark/blob/master/notes/%E9%AB%98%E7%BA%A7%E6%8E%92%E5%BA%8F%E5%92%8CtopN%E9%97%AE%E9%A2%98.md) 37 | - [Spark1.x和2.x如何读取和写入csv文件](https://blog.csdn.net/lzw2016/article/details/85562172) 38 | - [Spark DataFrame如何更改列column的类型.md](https://github.com/josonle/Learning-Spark/blob/master/notes/Spark%20DataFrame%E5%A6%82%E4%BD%95%E6%9B%B4%E6%94%B9%E5%88%97column%E7%9A%84%E7%B1%BB%E5%9E%8B.md) 39 | - [使用JDBC将DataFrame写入mysql.md](https://github.com/josonle/Learning-Spark/blob/master/notes/%E4%BD%BF%E7%94%A8JDBC%E5%B0%86DataFrame%E5%86%99%E5%85%A5mysql.md) 40 | - Scala 语法点 41 | - [Scala排序函数使用.md](https://github.com/josonle/Learning-Spark/blob/master/notes/Scala%E6%8E%92%E5%BA%8F%E5%87%BD%E6%95%B0%E4%BD%BF%E7%94%A8.md) 42 | - [报错和问题归纳.md](https://github.com/josonle/Learning-Spark/blob/master/notes/%E6%8A%A5%E9%94%99%E5%92%8C%E9%97%AE%E9%A2%98%E5%BD%92%E7%BA%B3.md) 43 | 44 | 待续 45 | -------------------------------------------------------------------------------- /notes/Scala排序函数使用.md: -------------------------------------------------------------------------------- 1 | Scala里面有三种排序方法,分别是: sorted,sortBy ,sortWith 2 | 3 | 分别介绍下他们的功能: 4 | 5 | (1)sorted 6 | 7 | 对一个集合进行自然排序,通过传递隐式的Ordering 8 | 9 | (2)sortBy 10 | 11 | 对一个属性或多个属性进行排序,通过它的类型。 12 | 13 | (3)sortWith 14 | 15 | 基于函数的排序,通过一个comparator函数,实现自定义排序的逻辑。 16 | 17 | 例子一:基于单集合单字段的排序 18 | 19 | ```scala 20 | val xs=Seq(1,5,3,4,6,2) 21 | println("==============sorted排序=================") 22 | println(xs.sorted) //升序 23 | println(xs.sorted.reverse) //降序 24 | println("==============sortBy排序=================") 25 | println( xs.sortBy(d=>d) ) //升序 26 | println( xs.sortBy(d=>d).reverse ) //降序 27 | println("==============sortWith排序=================") 28 | println( xs.sortWith(_<_) )//升序 29 | println( xs.sortWith(_>_) )//降序 30 | ``` 31 | 32 | 结果: 33 | 34 | ```scala 35 | ==============sorted排序================= 36 | List(1, 2, 3, 4, 5, 6) 37 | List(6, 5, 4, 3, 2, 1) 38 | ==============sortBy排序================= 39 | List(1, 2, 3, 4, 5, 6) 40 | List(6, 5, 4, 3, 2, 1) 41 | ==============sortWith排序================= 42 | List(1, 2, 3, 4, 5, 6) 43 | List(6, 5, 4, 3, 2, 1) 44 | ``` 45 | 46 | 47 | 48 | 例子二:基于元组多字段的排序 49 | 50 | 注意多字段的排序,使用sorted比较麻烦,这里给出使用sortBy和sortWith的例子 51 | 52 | 先看基于sortBy的实现: 53 | 54 | ```scala 55 | val pairs = Array( 56 | ("a", 5, 1), 57 | ("c", 3, 1), 58 | ("b", 1, 3) 59 | ) 60 | 61 | //按第三个字段升序,第一个字段降序,注意,排序的字段必须和后面的tuple对应 62 | val bx= pairs. 63 | sortBy(r => (r._3, r._1))( Ordering.Tuple2(Ordering.Int, Ordering.String.reverse) ) 64 | //打印结果 65 | bx.map( println ) 66 | ``` 67 | 68 | 结果: 69 | 70 | ``` 71 | (c,3,1) 72 | (a,5,1) 73 | (b,1,3) 74 | ``` 75 | 76 | 77 | 78 | 再看基于sortWith的实现: 79 | 80 | ```scala 81 | val pairs = Array( 82 | ("a", 5, 1), 83 | ("c", 3, 1), 84 | ("b", 1, 3) 85 | ) 86 | val b= pairs.sortWith{ 87 | case (a,b)=>{ 88 | if(a._3==b._3) {//如果第三个字段相等,就按第一个字段降序 89 | a._1>b._1 90 | }else{ 91 | a._3 (person.age, person.name))( Ordering.Tuple2(Ordering.Int, Ordering.String.reverse) ) 120 | 121 | bx.map( 122 | println 123 | ) 124 | ``` 125 | 126 | 结果: 127 | 128 | ``` 129 | Person(dog,23) 130 | Person(cat,23) 131 | Person(andy,25) 132 | ``` 133 | 134 | 135 | 136 | 再看sortWith的实现方法: 137 | 138 | ```scala 139 | case class Person(val name:String,val age:Int) 140 | 141 | val p1=Person("cat",23) 142 | val p2=Person("dog",23) 143 | val p3=Person("andy",25) 144 | 145 | val pairs = Array(p1,p2,p3) 146 | 147 | val b=pairs.sortWith{ 148 | case (person1,person2)=>{ 149 | person1.age==person2.age match { 150 | case true=> person1.name>person2.name //年龄一样,按名字降序排 151 | case false=>person1.age row != header) //去头 22 | .map(_.split(";")) 23 | .map(x => Person(x(0).toString, x(1).toInt, x(2).toString)) 24 | .toDF() //调用toDF方法 25 | rddToDF.show() 26 | rddToDF.printSchema() 27 | 28 | val dfToRDD = rddToDF.rdd //返回RDD[row],RDD类型的row对像,类似[Jorge,30,Developer] 29 | ``` 30 | 31 | - toDF方法 32 | 33 | toDF方法是将以通过case Class构建的类对象转为DataFrame,其可以指定列名参数colNames,否则就默认以类对象中参数名为列名。不仅如此,还可以将本地序列(seq), 数组转为DataFrame 34 | 35 | > 要导入Spark sql implicits ,如import spark.implicits._ 36 | > 37 | > 如果直接用toDF()而不指定列名字,那么默认列名为"\_1", "\_2", ... 38 | > 39 | > ```scala 40 | > val df = Seq( 41 | > (1, "First Value", java.sql.Date.valueOf("2019-01-01")), 42 | > (2, "Second Value", java.sql.Date.valueOf("2019-02-01")) 43 | > ).toDF("int_column", "string_column", "date_column") 44 | > ``` 45 | 46 | - toDS方法:转为DataSets 47 | 48 | ### 方法二:基于编程方式 49 | 50 | 通过一个允许你构造一个 **Schema** 然后把它应用到一个已存在的 **RDD** 的编程接口。然而这种方法更繁琐,当列和它们的类型知道运行时都是未知时它允许你去构造 **Dataset** 51 | 52 | ```scala 53 | import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType} 54 | import org.apache.spark.sql.Row 55 | //创建RDD[row]对象 56 | val personRDD = rdd.map { line => Row(line.split(";")(0).toString, line.split(";")(1).toInt, line.split(";")(2).toString) } 57 | //创建Schema,需要StructType/StructField 58 | val structType = StructType(Array( 59 | StructField("name", StringType, true), 60 | StructField("age", IntegerType, true), 61 | StructField("job", StringType, true))) 62 | //createDataFrame(rowRDD: RDD[Row], schema: StructType)方法 63 | //Creates a `DataFrame` from an `RDD` containing [[Row]]s using the given schema. 64 | val rddToDF = spark.createDataFrame(personRDD, structType) 65 | ``` 66 | 67 | createDataFrame源码如下: 68 | 69 | ```scala 70 | /** 71 | * Creates a `DataFrame` from an `RDD[Row]`. 72 | * User can specify whether the input rows should be converted to Catalyst rows. 73 | */ 74 | private[sql] def createDataFrame( 75 | rowRDD: RDD[Row], 76 | schema: StructType, 77 | needsConversion: Boolean) = { 78 | // TODO: use MutableProjection when rowRDD is another DataFrame and the applied 79 | // schema differs from the existing schema on any field data type. 80 | val catalystRows = if (needsConversion) { 81 | val encoder = RowEncoder(schema) 82 | rowRDD.map(encoder.toRow) 83 | } else { 84 | rowRDD.map { r: Row => InternalRow.fromSeq(r.toSeq) } 85 | } 86 | internalCreateDataFrame(catalystRows.setName(rowRDD.name), schema) 87 | } 88 | ``` 89 | 90 | ### Row对象的方法 91 | 92 | - 直接通过下标,如row(0) 93 | 94 | - getAs[T]:获取指定列名的列,也可用getAsInt、getAsString等 95 | 96 | ```scala 97 | //参数要么是列名,要么是列所在位置(从0开始) 98 | def getAs[T](i: Int): T = get(i).asInstanceOf[T] 99 | def getAs[T](fieldName: String): T = getAs[T](fieldIndex(fieldName)) 100 | ``` 101 | 102 | - getValuesMap:获取指定几列的值,返回的是个map 103 | 104 | ```scala 105 | //源码 106 | def getValuesMap[T](fieldNames: Seq[String]): Map[String, T] = { 107 | fieldNames.map { name => 108 | name -> getAs[T](name) 109 | }.toMap 110 | } 111 | //使用 112 | dfToRDD.map{row=>{ 113 | val columnMap = row.getValuesMap[Any](Array("name","age","job")) Person(columnMap("name").toString(),columnMap("age").toString.toInt,columnMap("job").toString) 114 | }} 115 | ``` 116 | 117 | - isNullAt:判断所在位置i的值是否为null 118 | 119 | ```scala 120 | def isNullAt(i: Int): Boolean = get(i) == null 121 | ``` 122 | 123 | - length -------------------------------------------------------------------------------- /notes/使用JDBC将DataFrame写入mysql.md: -------------------------------------------------------------------------------- 1 | # spark foreachPartition 把df 数据插入到mysql 2 | 3 | > 转载自:http://www.waitingfy.com/archives/4370,确实写的不错 4 | 5 | ```scala 6 | import java.sql.{Connection, DriverManager, PreparedStatement} 7 | 8 | import org.apache.spark.sql.SparkSession 9 | import org.apache.spark.sql.functions._ 10 | 11 | import scala.collection.mutable.ListBuffer 12 | 13 | object foreachPartitionTest { 14 | 15 | case class TopSongAuthor(songAuthor:String, songCount:Long) 16 | 17 | 18 | def getConnection() = { 19 | DriverManager.getConnection("jdbc:mysql://localhost:3306/baidusong?user=root&password=root&useUnicode=true&characterEncoding=UTF-8") 20 | } 21 | 22 | def release(connection: Connection, pstmt: PreparedStatement): Unit = { 23 | try { 24 | if (pstmt != null) { 25 | pstmt.close() 26 | } 27 | } catch { 28 | case e: Exception => e.printStackTrace() 29 | } finally { 30 | if (connection != null) { 31 | connection.close() 32 | } 33 | } 34 | } 35 | 36 | def insertTopSong(list:ListBuffer[TopSongAuthor]):Unit ={ 37 | 38 | var connect:Connection = null 39 | var pstmt:PreparedStatement = null 40 | 41 | try{ 42 | connect = getConnection() 43 | connect.setAutoCommit(false) 44 | val sql = "insert into topSinger(song_author, song_count) values(?,?)" 45 | pstmt = connect.prepareStatement(sql) 46 | for(ele <- list){ 47 | pstmt.setString(1, ele.songAuthor) 48 | pstmt.setLong(2,ele.songCount) 49 | 50 | pstmt.addBatch() 51 | } 52 | pstmt.executeBatch() 53 | connect.commit() 54 | }catch { 55 | case e:Exception => e.printStackTrace() 56 | }finally { 57 | release(connect, pstmt) 58 | } 59 | } 60 | 61 | def main(args: Array[String]): Unit = { 62 | val spark = SparkSession 63 | .builder() 64 | .master("local[2]") 65 | .appName("foreachPartitionTest") 66 | .getOrCreate() 67 | val gedanDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306").option("dbtable", "baidusong.gedan").option("user", "root").option("password", "root").option("driver", "com.mysql.jdbc.Driver").load() 68 | // mysqlDF.show() 69 | val detailDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306").option("dbtable", "baidusong.gedan_detail").option("user", "root").option("password", "root").option("driver", "com.mysql.jdbc.Driver").load() 70 | 71 | val joinDF = gedanDF.join(detailDF, gedanDF.col("id") === detailDF.col("gedan_id")) 72 | 73 | // joinDF.show() 74 | import spark.implicits._ 75 | val resultDF = joinDF.groupBy("song_author").agg(count("song_name").as("song_count")).orderBy($"song_count".desc).limit(100) 76 | // resultDF.show() 77 | 78 | 79 | resultDF.foreachPartition(partitionOfRecords =>{ 80 | val list = new ListBuffer[TopSongAuthor] 81 | partitionOfRecords.foreach(info =>{ 82 | val song_author = info.getAs[String]("song_author") 83 | val song_count = info.getAs[Long]("song_count") 84 | 85 | list.append(TopSongAuthor(song_author, song_count)) 86 | }) 87 | insertTopSong(list) 88 | 89 | }) 90 | 91 | spark.close() 92 | } 93 | 94 | } 95 | ``` 96 | 97 | 98 | 99 | 上面的例子是用[《python pandas 实战 百度音乐歌单 数据分析》](http://www.waitingfy.com/archives/4105)用spark 重新实现了一次 100 | 101 | 默认的foreach的性能缺陷在哪里? 102 | 103 | 首先,对于每条数据,都要单独去调用一次function,task为每个数据,都要去执行一次function函数。 104 | 如果100万条数据,(一个partition),调用100万次。性能比较差。 105 | 106 | 另外一个非常非常重要的一点 107 | 如果每个数据,你都去创建一个数据库连接的话,那么你就得创建100万次数据库连接。 108 | 但是要注意的是,数据库连接的创建和销毁,都是非常非常消耗性能的。虽然我们之前已经用了 109 | 数据库连接池,只是创建了固定数量的数据库连接。 110 | 111 | 你还是得多次通过数据库连接,往数据库(MySQL)发送一条SQL语句,然后MySQL需要去执行这条SQL语句。 112 | 如果有100万条数据,那么就是100万次发送SQL语句。 113 | 114 | 以上两点(数据库连接,多次发送SQL语句),都是非常消耗性能的。 115 | 116 | foreachPartition,在生产环境中,通常来说,都使用foreachPartition来写数据库的 117 | 118 | 使用批处理操作(一条SQL和多组参数) 119 | 发送一条SQL语句,发送一次 120 | 一下子就批量插入100万条数据。 121 | 122 | 用了foreachPartition算子之后,好处在哪里? 123 | 124 | 1、对于我们写的function函数,就调用一次,一次传入一个partition所有的数据 125 | 2、主要创建或者获取一个数据库连接就可以 126 | 3、只要向数据库发送一次SQL语句和多组参数即可 127 | 128 | 参考《算子优化 foreachPartition》 https://blog.csdn.net/u013939918/article/details/60881711 -------------------------------------------------------------------------------- /notes/LearningSpark(7)SparkSQL之DataFrame学习.md: -------------------------------------------------------------------------------- 1 | DataFrame说白了就是RDD+Schema(元数据信息),spark1.3之前还叫SchemaRDD,以列的形式组织的分布式的数据集合 2 | 3 | Spark-SQL 可以以 RDD 对象、Parquet 文件、JSON 文件、Hive 表, 4 | 以及通过JDBC连接到其他关系型数据库表作为数据源来生成DataFrame对象 5 | 6 | ### 如何创建Spark SQL的入口 7 | 8 | 同Spark Core要先创建SparkContext对象一样,在spark1.6后使用SQLContext对象,或者是它的子类的对象,比如HiveContext的对象,而spark2.2提供了SparkSession对象代替了上述两种方式作为程序入口。如下所示 9 | 10 | ```scala 11 | import org.apache.spark.SparkConf 12 | import org.apache.spark.SparkContext 13 | import org.apache.spark.sql.SQLContext 14 | 15 | val conf = new SparkConf().setMaster("local").setAppName("test sqlContext") 16 | val sc = new SparkContext(conf) 17 | val sqlContext = new SQLContext(sc) 18 | //导入隐式转换import sqlContext.implicits._ 19 | ``` 20 | 21 | > SparkSession中封装了spark.sparkContext和spark.sqlContext ,所以可直接通过spark.sparkContext创建sc,进而创建rdd(这里spark是SparkSession对象) 22 | 23 | ```scala 24 | import org.apache.spark.sql.SparkSession 25 | 26 | val spark = SparkSession.builder().master("local").appName("DataFrame API").getOrCreate() 27 | //或者 SparkSession.builder().config(conf=SparkConf()).getOrCreate() 28 | //导入隐式转换import spark.implicits._ 29 | ``` 30 | 31 | 32 | 33 | > 了解下HiveContext: 34 | > 35 | > 除了基本的SQLContext以外,还可以使用它的子类——HiveContext。HiveContext的功能除了包含SQLContext提供的所有功能之外,还包括了额外的专门针对Hive的一些功能。这些额外功能包括:使用HiveQL语法来编写和执行SQL,使用Hive中的UDF函数,从Hive表中读取数据。 36 | > 37 | > 要使用HiveContext,就必须预先安装好Hive,SQLContext支持的数据源,HiveContext也同样支持——而不只是支持Hive。对于Spark 1.3.x以上的版本,都推荐使用HiveContext,因为其功能更加丰富和完善。 38 | > 39 | > Spark SQL还支持用spark.sql.dialect参数设置SQL的方言。使用SQLContext的setConf()即可进行设置。对于SQLContext,它只支持“sql”一种方言。对于HiveContext,它默认的方言是“hiveql”。 40 | 41 | ### DataFrame API 42 | 43 | #### show 44 | 45 | ```scala 46 | def show(numRows: Int): Unit = show(numRows, truncate = true) 47 | 48 | /** 49 | * Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters 50 | * will be truncated, and all cells will be aligned right. 51 | * 52 | * @group action 53 | * @since 1.6.0 54 | */ 55 | def show(): Unit = show(20) 56 | ``` 57 | 58 | 看源码中所示,numRows=20默认显示前20行,truncate表示一个字段是否最多显示 20 个字符,默认为 true 59 | 60 | 61 | 62 | #### select 选择 63 | 64 | ```scala 65 | people.select($"name", $"age" + 1).show() 66 | people.select("name").show() //select($"name")等效,后者好处看上面 67 | ``` 68 | 69 | >使用$"age"提取 age 列数据作比较时,用到了隐式转换,故需在程序中引入相应包 70 | > 71 | >`import spark.implicits._` 72 | > 73 | >注意这里所谓的spark是SparkSession或sqlContext的实例对象 74 | 75 | 76 | 77 | >另外也可以直接使用people("age")+1,或者people.col("age")+1,以上几种方法都是选择某列 78 | 79 | #### selectExpr 80 | 81 | 可以对指定字段进行特殊处理的选择 82 | 83 | ``` 84 | people.selectExpr("cast(age as string) age_toString","name") 85 | ``` 86 | 87 | 88 | 89 | #### filter 过滤 90 | 91 | 同sql中的where,无非数字和字符串的比较,注意以下 92 | 93 | ```scala 94 | //spark2.x写法 95 | people.filter($"age" > 21).show() 96 | people.filter($"name".contains("ust")).show() 97 | people.filter($"name".like("%ust%")).show 98 | people.filter($"name".rlike(".*?ust.*?")).show() 99 | // 同上,spark1.6写法 100 | //people.filter(people("name") contains("ust")).show() 101 | //people.filter(people("name") like("%ust%")).show() 102 | //people.filter(people("name") rlike(".*?ust.*?")).show() 103 | ``` 104 | 105 | 106 | 107 | - contains:包含某个子串substring 108 | - like:同SQL语句中的like,要借助通配符`_`、`%` 109 | - rlike:这个是java中的正则匹配,用法和正则pattern一样 110 | - 可以用and、or 111 | - 取反not contains:如`people.filter(!(people("name") contains ("ust")))` 112 | 113 | #### groupBy 聚合 114 | 115 | #### count 计数 116 | 117 | #### createOrReplaceTempView 注册为临时SQL表 118 | 119 | 注册为临时SQL表,便于直接使用sql语言编程 120 | 121 | ```scala 122 | people.createOrReplaceTempView("sqlDF") 123 | spark.sql("select * from sqlDF where name not like '%ust%' ").show() //spark是创建的SparkSession对象 124 | ``` 125 | 126 | 127 | 128 | #### createGlobalTempView 注册为全局临时SQL表 129 | 130 | 上面创建的TempView是与SparkSession相关的,session结束就会销毁,想跨多个Session共享的话需要使用Global Temporary View。spark examples中给出如下参考示例 131 | 132 | ```scala 133 | // Global temporary view is tied to a system preserved database `global_temp` 134 | spark.sql("SELECT * FROM global_temp.people").show() 135 | // +----+-------+ 136 | // | age| name| 137 | // +----+-------+ 138 | // |null|Michael| 139 | // | 30| Andy| 140 | // | 19| Justin| 141 | // +----+-------+ 142 | 143 | // Global temporary view is cross-session 144 | spark.newSession().sql("SELECT * FROM global_temp.people").show() 145 | // +----+-------+ 146 | // | age| name| 147 | // +----+-------+ 148 | // |null|Michael| 149 | // | 30| Andy| 150 | // | 19| Justin| 151 | // +----+-------+ 152 | // $example off:global_temp_view$ 153 | ``` 154 | 155 | -------------------------------------------------------------------------------- /Spark_With_Scala_Testing/src/sparkCore/Transformation.scala: -------------------------------------------------------------------------------- 1 | package sparkCore 2 | 3 | import org.apache.spark.SparkConf 4 | import org.apache.spark.SparkContext 5 | import org.apache.spark.SparkConf 6 | import org.apache.spark.SparkConf 7 | import org.spark_project.dmg.pmml.True 8 | 9 | object Transformation { 10 | def main(args: Array[String]): Unit = { 11 | System.out.println("Start***********") 12 | /*getData("data/hello.txt") 13 | getData("", Array(1,2,3,4,5))*/ 14 | 15 | /*map_flatMap_result()*/ 16 | /*distinct_result()*/ 17 | 18 | /* filter_result() 19 | groupByKey_result() 20 | reduceByKey_result() 21 | sortByKey_result()*/ 22 | join_result() 23 | } 24 | 25 | //### 数据来源 26 | 27 | def getData(path: String, arr: Array[Int] = Array()) { 28 | // path:file:// 或 hdfs://master:9000/ [指定端口9000] 29 | val conf = new SparkConf().setMaster("local").setAppName("getData") 30 | val sc = new SparkContext(conf) 31 | 32 | if (!arr.isEmpty) { 33 | sc.parallelize(arr).map(_ + 1).foreach(println) 34 | } else { 35 | sc.textFile(path, 1).foreach(println) 36 | } 37 | sc.stop() 38 | } 39 | 40 | def map_flatMap_result() { 41 | val conf = new SparkConf().setMaster("local").setAppName("map_flatMap") 42 | val sc = new SparkContext(conf) 43 | val rdd = sc.textFile("data/hello.txt", 1) 44 | val mapResult = rdd.map(_.split(",")) 45 | val flatMapResult = rdd.flatMap(_.split(",")) 46 | 47 | mapResult.foreach(println) 48 | flatMapResult.foreach(println) 49 | 50 | println("差别×××××××××") 51 | flatMapResult.map(_.toUpperCase).foreach(println) 52 | flatMapResult.flatMap(_.toUpperCase).foreach(println) 53 | } 54 | //去重 55 | def distinct_result() { 56 | val conf = new SparkConf().setMaster("local").setAppName("去重") 57 | val sc = new SparkContext(conf) 58 | val rdd = sc.textFile("data/hello.txt", 1) 59 | rdd.flatMap(_.split(",")).distinct().foreach(println) 60 | } 61 | //保留偶数 62 | def filter_result() { 63 | val conf = new SparkConf().setMaster("local").setAppName("过滤偶数") 64 | val sc = new SparkContext(conf) 65 | val rdd = sc.parallelize(Array(1, 2, 3, 4, 5, 6), 1) 66 | rdd.filter(_ % 2 == 0).foreach(println) 67 | } 68 | //每个班级的学生 69 | def groupByKey_result() { 70 | val conf = new SparkConf().setMaster("local").setAppName("分组") 71 | val sc = new SparkContext(conf) 72 | val classmates = Array(Tuple2("class1", "Lee"), Tuple2("class2", "Liu"), 73 | Tuple2("class2", "Ma"), Tuple2("class3", "Wang"), 74 | Tuple2("class1", "Zhao"), Tuple2("class1", "Zhang"), 75 | Tuple2("class1", "Mao"), Tuple2("class4", "Hao"), 76 | Tuple2("class3", "Zha"), Tuple2("class2", "Zhao")) 77 | 78 | val students = sc.parallelize(classmates) 79 | students.groupByKey().foreach(x ⇒ { 80 | println(x._1 + " :") 81 | x._2.foreach(println) 82 | println("***************") 83 | }) 84 | } 85 | //每个班级总分 86 | def reduceByKey_result() { 87 | val conf = new SparkConf().setMaster("local").setAppName("聚合") 88 | val sc = new SparkContext(conf) 89 | val classmates = Array(Tuple2("class1", 90), Tuple2("class2", 85), 90 | Tuple2("class2", 60), Tuple2("class3", 95), 91 | Tuple2("class1", 70), Tuple2("class4", 100), 92 | Tuple2("class3", 80), Tuple2("class2", 65)) 93 | 94 | val students = sc.parallelize(classmates) 95 | val scores = students.reduceByKey(_ + _).foreach(x ⇒ { 96 | println(x._1 + " :" + x._2) 97 | println("****************") 98 | }) 99 | } 100 | //按分数排序 101 | def sortByKey_result() { 102 | val conf = new SparkConf() 103 | .setAppName("sortByKey") 104 | .setMaster("local") 105 | val sc = new SparkContext(conf) 106 | 107 | val scoreList = Array(Tuple2(65, "leo"), Tuple2(50, "tom"), 108 | Tuple2(100, "marry"), Tuple2(85, "jack")) 109 | val scores = sc.parallelize(scoreList, 1) 110 | val sortedScores = scores.sortByKey(false) 111 | 112 | sortedScores.foreach(studentScore => println(studentScore._1 + ": " + studentScore._2)) 113 | sc.stop() 114 | println("排序Over×××××××××") 115 | } 116 | 117 | def join_result(){ 118 | val conf = new SparkConf().setMaster("local").setAppName("聚合") 119 | val sc = new SparkContext(conf) 120 | val classmates = Array(Tuple2("class1", 90), Tuple2("class2", 85), 121 | Tuple2("class2", 60), Tuple2("class3", 95), 122 | Tuple2("class1", 70), Tuple2("class4", 100), 123 | Tuple2("class3", 80), Tuple2("class2", 65)) 124 | 125 | val students = sc.parallelize(classmates) 126 | val allScores = students.reduceByKey(_ + _) 127 | val nums = students.countByKey().toArray 128 | allScores.join(sc.parallelize(nums, 1)).sortByKey(true).foreach(x⇒{ 129 | println(x._1+" :") 130 | println("All_scores: "+x._2._1+",Nums: "+x._2._2) 131 | println("Avg_scores: "+x._2._1.toDouble/x._2._2) 132 | }) 133 | } 134 | } 135 | -------------------------------------------------------------------------------- /notes/报错和问题归纳.md: -------------------------------------------------------------------------------- 1 | 2 | ### spark读取HDFS文件java.net.ConnectException: Connection refused异常 3 | 4 | 报错信息如下: 5 | ``` 6 | java.net.ConnectException: Call From josonlee-PC/127.0.1.1 to 192.168.17.10:8020 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 7 | at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 8 | at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 9 | at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 10 | at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 11 | at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) 12 | at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) 13 | at org.apache.hadoop.ipc.Client.call(Client.java:1474) 14 | at org.apache.hadoop.ipc.Client.call(Client.java:1401) 15 | at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) 16 | at com.sun.proxy.$Proxy24.getListing(Unknown Source) 17 | at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:554) 18 | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 19 | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 20 | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 21 | at java.lang.reflect.Method.invoke(Method.java:498) 22 | at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) 23 | at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) 24 | at com.sun.proxy.$Proxy25.getListing(Unknown Source) 25 | at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1958) 26 | at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1941) 27 | at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693) 28 | at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105) 29 | at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755) 30 | at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751) 31 | at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 32 | at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751) 33 | at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69) 34 | at org.apache.hadoop.fs.Globber.glob(Globber.java:217) 35 | at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644) 36 | at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257) 37 | at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 38 | at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) 39 | at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) 40 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) 41 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) 42 | at scala.Option.getOrElse(Option.scala:121) 43 | at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) 44 | at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) 45 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) 46 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) 47 | at scala.Option.getOrElse(Option.scala:121) 48 | at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) 49 | at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) 50 | at org.apache.spark.rdd.RDD.count(RDD.scala:1168) 51 | ... 49 elided 52 | Caused by: java.net.ConnectException: 拒绝连接 53 | at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 54 | at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 55 | at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) 56 | at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) 57 | at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) 58 | at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609) 59 | at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) 60 | at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) 61 | at org.apache.hadoop.ipc.Client.getConnection(Client.java:1523) 62 | at org.apache.hadoop.ipc.Client.call(Client.java:1440) 63 | ... 86 more 64 | ``` 65 | 66 | 一开始看着官方文档写代码,加载外部数据集这一块知道是写hdfs的文件的url,但也没看到示例,就写成下面这样了 67 | ``` 68 | scala> val data=sc.textFile("hdfs://192.168.17.10//sparkData/test/*") 69 | ``` 70 | 报错信息中也指出了 `Call From josonlee-PC/127.0.1.1 to 192.168.17.10:8020 failed`,说明spark默认是通过8020端口访问hdfs的,错误也好解决,hadoop的配置文件【core-site.xml】中指明了是通过9000端口对外的,所以在url中写死端口即可 71 | ``` 72 | scala> val data=sc.textFile("hdfs://192.168.17.10:9000//sparkData/test/*") 73 | ``` 74 | 75 | 76 | 77 | ## Spark集群(或spark-shell)读取本地文件报错:无法找到文件 78 | 79 | spark-shell默认不是本地模式的。集群要读取文件首先就要确保worker都能访问该文件,而本地文件只在Master节点下,不存在Worker节点下,所有Worker不能通过`file:///`的形式读取文件 80 | 81 | 解决办法: 82 | 83 | - 文件上传到hdfs上,通过`hdfs://`的方式读取 84 | - 或者把文件复制到Worker节点下对应目录下即可 -------------------------------------------------------------------------------- /notes/LearningSpark(5)Spark共享变量.md: -------------------------------------------------------------------------------- 1 | ## 共享变量 2 | 3 | Spark又一重要特性————共享变量 4 | 5 | worker节点中每个Executor会有多个task任务,而算子调用函数要使用外部变量时,默认会每个task拷贝一份变量。这就导致如果该变量很大时网络传输、占用的内存空间也会很大,所以就有了 **共享变量**。每个节点拷贝一份该变量,节点上task共享这份变量 6 | 7 | spark提过两种共享变量:Broadcast Variable(广播变量),Accumulator(累加变量) 8 | 9 | ## Broadcast Variable 和 Accumulator 10 | 11 | 广播变量**只可读**不可修改,所以其用处是优化性能,减少网络传输以及内存消耗;累加变量能让多个task共同操作该变量,**起到累加作用**,通常用来实现计数器(counter)和求和(sum)功能 12 | 13 | ### 广播变量 14 | 15 | 广播变量通过调用SparkContext的broadcast()方法,来对某个变量创建一个Broadcast[T]对象,Scala中通过value属性访问该变量,Java中通过value()方法访问该变量 16 | 17 | 通过广播方式进行传播的变量,会经过序列化,然后在被任务使用时再进行反序列化 18 | 19 | ```scala 20 | scala> val brodcast = sc.broadcast(1) 21 | brodcast: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(12) 22 | 23 | scala> brodcast 24 | res4: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(12) 25 | 26 | scala> brodcast.value 27 | res5: Int = 1 28 | ``` 29 | ```scala 30 | val factor = 3 31 | val factorBroadcast = sc.broadcast(factor) 32 | 33 | val arr = Array(1, 2, 3, 4, 5) 34 | val rdd = sc.parallelize(arr) 35 | val multipleRdd = rdd.map(num => num * factorBroadcast.value()) 36 | ``` 37 | 38 | ### 如何更新广播变量 39 | 40 | 通过unpersist()将老的广播变量删除,然后重新广播一遍新的广播变量 41 | 42 | ```scala 43 | import java.io.{ ObjectInputStream, ObjectOutputStream } 44 | import org.apache.spark.broadcast.Broadcast 45 | import org.apache.spark.streaming.StreamingContext 46 | import scala.reflect.ClassTag 47 | 48 | /* wrapper lets us update brodcast variables within DStreams' foreachRDD 49 | without running into serialization issues */ 50 | case class BroadcastWrapper[T: ClassTag]( 51 | @transient private val ssc: StreamingContext, 52 | @transient private val _v: T) { 53 | 54 | @transient private var v = ssc.sparkContext.broadcast(_v) 55 | 56 | def update(newValue: T, blocking: Boolean = false): Unit = { 57 | // 删除RDD是否需要锁定 58 | v.unpersist(blocking) 59 | v = ssc.sparkContext.broadcast(newValue) 60 | } 61 | 62 | def value: T = v.value 63 | 64 | private def writeObject(out: ObjectOutputStream): Unit = { 65 | out.writeObject(v) 66 | } 67 | 68 | private def readObject(in: ObjectInputStream): Unit = { 69 | v = in.readObject().asInstanceOf[Broadcast[T]] 70 | } 71 | } 72 | ``` 73 | 74 | 参考: 75 | 76 | [How can I update a broadcast variable in spark streaming?](https://stackoverflow.com/questions/33372264/how-can-i-update-a-broadcast-variable-in-spark-streaming) 77 | 78 | [Spark踩坑记——共享变量](https://www.cnblogs.com/xlturing/p/6652945.html) 79 | 80 | *** 81 | 82 | ### 累加变量 83 | 84 | 累加变量有几点注意: 85 | 86 | - 集群上只有Driver程序可以读取Accumulator的值,task只对该变量调用add方法进行累加操作; 87 | - 累加器的更新只发生在 **action** 操作中,不会改变懒加载,**Spark** 保证每个任务只更新累加器一次,比如,重启任务不会更新值。在 transformations(转换)中, 用户需要注意的是,如果 task(任务)或 job stages(阶段)重新执行,每个任务的更新操作可能会执行多次 88 | - Spark原生支持数值类型的累加器longAccumulator(Long类型)、doubleAccumulator(Double类型),我们也可以自己添加支持的类型,在2.0.0之前的版本中,通过继承AccumulatorParam来实现,而2.0.0之后的版本需要继承AccumulatorV2来实现自定义类型的累加器 89 | 90 | ```scala 91 | scala> val accum1 = sc.longAccumulator("what?") //what?是name属性,可直接sc.accumulator(0) 92 | accum1: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 379, name: Some(what?), value: 0) 93 | scala> accum1.value 94 | res30: Long = 0 95 | scala> sc.parallelize(Array(1,2,3,4,5)).foreach(accum1.add(_)) 96 | scala> accum1 97 | res31: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 379, name: Some(what?), value: 15) 98 | scala> accum1.value 99 | res32: Long = 15 100 | ``` 101 | ```scala 102 | scala> accum1.value 103 | res35: Long = 15 104 | scala> val rdd = sc.parallelize(Array(1,2,3,4,5)) 105 | rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at :24 106 | scala> val tmp = rdd.map(x=>(x,accum1.add(x))) 107 | tmp: org.apache.spark.rdd.RDD[(Int, Unit)] = MapPartitionsRDD[33] at map at :27 108 | scala> accum1.value 109 | res36: Long = 15 110 | scala> tmp.map(x=>(x,accum1.add(x._1))).collect //遇到action操作才执行累加操作 111 | res37: Array[((Int, Unit), Unit)] = Array(((1,()),()), ((2,()),()), ((3,()),()), ((4,()),()), ((5,()),())) 112 | scala> accum1.value //结果是两次累加操作的结果 113 | res38: Long = 45 114 | ``` 115 | 116 | ```scala 117 | scala> val data = sc.parallelize(1 to 10) 118 | data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at parallelize at :24 119 | scala> accum.value 120 | res54: Int = 0 121 | scala> val newData = data.map{x => { 122 | | if(x%2 == 0){ 123 | | accum += 1 124 | | 0 125 | | }else 1 126 | | }} 127 | newData: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at map at :27 128 | 129 | scala> newData.count 130 | res55: Long = 10 131 | scala> accum.value //newData执行Action后累加操作执行一次,结果为5 132 | res56: Int = 5 133 | scala> newData.collect //newData再次执行Action后累加操作又执行一次 134 | res57: Array[Int] = Array(1, 0, 1, 0, 1, 0, 1, 0, 1, 0) 135 | scala> accum.value 136 | res58: Int = 10 137 | ``` 138 | 139 | 如上代码所示,解释了前面所提的几点。因为newData是data经map而来的,而map函数中有累加操作,所以会有两次累加操作。解决办法如下: 140 | 141 | 1. 要么一次Action操作就求出累加变量结果 142 | 2. 在Action操作前进行持久化,避免了RDD的重复计算导致多次累加 【**推荐**】 143 | 144 | - longAccumulator、doubleAccumulator方法 145 | 146 | ``` 147 | add方法:赋值操作 148 | value方法:获取累加器中的值 149 | merge方法:该方法特别重要,一定要写对,这个方法是各个task的累加器进行合并的方法(下面介绍执行流程中将要用到) 150 | iszero方法:判断是否为初始值 151 | reset方法:重置累加器中的值 152 | copy方法:拷贝累加器 153 | name方法:累加器名称 154 | ``` 155 | 156 | > 累加器执行流程: 首先有几个task,spark engine就调用copy方法拷贝几个累加器(不注册的),然后在各个task中进行累加(注意在此过程中,被最初注册的累加器的值是不变的),执行最后将调用merge方法和各个task的结果累计器进行合并(此时被注册的累加器是初始值) 157 | > 见 http://www.ccblog.cn/103.htm 158 | 159 | ### 自定义累加变量 160 | 161 | **注意:使用时需要register注册一下** 162 | 163 | 1. 类继承extends AccumulatorV2[String, String],第一个为输入类型,第二个为输出类型 164 | 2. 要override以下六个方法 165 | 166 | ``` 167 | isZero: 当AccumulatorV2中存在类似数据不存在这种问题时,是否结束程序。 168 | copy: 拷贝一个新的AccumulatorV2 169 | reset: 重置AccumulatorV2中的数据 170 | add: 操作数据累加方法实现 171 | merge: 合并数据 172 | value: AccumulatorV2对外访问的数据结果 173 | ``` 174 | 175 | 详情参考这个:https://blog.csdn.net/leen0304/article/details/78866353 176 | 177 | -------------------------------------------------------------------------------- /notes/LearningSpark(9)SparkSQL数据来源.md: -------------------------------------------------------------------------------- 1 | > 以下源码在 `org.apache.spark.sql.DataFrameReader/DataFrameWriter`中 2 | 3 | ### format指定内置数据源 4 | 5 | 无论是load还是save都可以手动指定用来操作的数据源类型,**format方法**,通过eclipse查看相关源码,spark内置支持的数据源包括parquet(默认)、json、csv、text(文本文件)、 jdbc、orc,如图 6 | 7 | ![1550670063721](assets/1550670063721.png) 8 | 9 | ```scala 10 | def format(source: String): DataFrameWriter[T] = { 11 | this.source = source 12 | this} 13 | 14 | private var source: String = df.sparkSession.sessionState.conf.defaultDataSourceName 15 | 16 | def defaultDataSourceName: String = getConf(DEFAULT_DATA_SOURCE_NAME) 17 | // This is used to set the default data source,默认处理parquet格式数据 18 | val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default") 19 | .doc("The default data source to use in input/output.") 20 | .stringConf 21 | .createWithDefault("parquet") 22 | ``` 23 | 24 | #### 列举csv方法使用 25 | 26 | 源码如下,可见其实本质就是format指定数据源再调用save方法保存 27 | 28 | ```scala 29 | def csv(paths: String*): DataFrame = format("csv").load(paths : _*) //可以指定多个文件或目录 30 | def csv(path: String): Unit = { 31 | format("csv").save(path) 32 | } 33 | ``` 34 | 35 | 可以看下我这篇文章:[Spark1.x和2.x如何读取和写入csv文件](https://blog.csdn.net/lzw2016/article/details/85562172#commentBox) 36 | 37 | #### 可选的option方法 38 | 39 | 还有就是这几个内置数据源读取或保存的方法都有写可选功能方法option,比如csv方法可选功能有(截取部分自认为有用的): 40 | 41 | > You can set the following CSV-specific options to deal with CSV files: 42 | > 43 | > - **`sep` (default `,`): sets a single character as a separator for each field and value**. 44 | > - `encoding` (default `UTF-8`): decodes the CSV files by the given encoding type. 45 | > - **`escape` (default `\`):** sets a single character used for escaping quotes inside an already quoted value. 46 | > - `comment` (default empty string): sets a single character used for skipping lines beginning with this character. By default, it is disabled. 47 | > - **`header` (default `false`): uses the first line as names of columns.** 48 | > - `enforceSchema` (default `true`): If it is set to `true`, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to `false`, the schema will be validated against all headers in CSV files in the case when the `header` option is set to `true`. Field names in the schema and column names in CSV headers are checked by their positions taking into account`spark.sql.caseSensitive`. Though the default value is true, it is recommended to disable the `enforceSchema` option to avoid incorrect results. 49 | > - `inferSchema` (default `false`): infers the input schema automatically from data. It requires one extra pass over the data. 50 | > - `ignoreLeadingWhiteSpace` (default `false`): a flag indicating whether or not leading whitespaces from values being read should be skipped. 51 | > - `ignoreTrailingWhiteSpace` (default `false`): a flag indicating whether or not trailing whitespaces from values being read should be skipped. 52 | > - `nullValue` (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type. 53 | > - **`emptyValue` (default empty string):** sets the string representation of an empty value. 54 | > - `maxColumns` (default `20480`): defines a hard limit of how many columns a record can have. 55 | > 56 | > 像eclipse、Idea这类编辑器可以通过鼠标移动到对应方法上查看可以用到哪些options 57 | 58 | **有一点要注意:不要对该方法没有的option强加上**,比如text用上option("sep",";") 59 | 60 | ### SaveMode保存模式和mode方法 61 | 62 | 63 | | **Save Mode** | **意义** | 64 | | ----------------------------- | ------------------------------------------------------------ | 65 | | SaveMode.ErrorIfExists (默认) | 如果目标位置已经存在数据,那么抛出一个异常(默认的SaveMode) | 66 | | SaveMode.Append | 如果目标位置已经存在数据,那么将数据追加进去 | 67 | | SaveMode.Overwrite | 如果目标位置已经存在数据,那么就将已经存在的数据删除,用新数据进行覆盖 | 68 | | SaveMode.Ignore | 如果目标位置已经存在数据,那么就忽略,不做任何操作。 | 69 | 70 | Scala中通过mode方法指定SaveMode,源码如下 71 | 72 | ```scala 73 | def mode(saveMode: String): DataFrameWriter[T] = { 74 | this.mode = saveMode.toLowerCase(Locale.ROOT) match { 75 | case "overwrite" => SaveMode.Overwrite 76 | case "append" => SaveMode.Append 77 | case "ignore" => SaveMode.Ignore 78 | case "error" | "errorifexists" | "default" => SaveMode.ErrorIfExists 79 | case _ => throw new IllegalArgumentException(s"Unknown save mode: $saveMode. " + 80 | "Accepted save modes are 'overwrite', 'append', 'ignore', 'error', 'errorifexists'.") 81 | } 82 | this 83 | } 84 | ``` 85 | 86 | ### 使用jdbc连接数据库 87 | 88 | - 使用 **JDBC** 访问特定数据库时,需要在 **spark classpath** 上添加对应的 **JDBC** 驱动配置,可以放在Spark的library目录,也可以在使用Spark Submit的使用指定具体的Jar(编码和打包的时候都不需要这个JDBC的Jar) 89 | 90 | ``` 91 | # 就是在spark-submit 脚本中加类似下方两个参数 92 | --jars $HOME/tools/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar \ 93 | --driver-class-path $HOME/tools/mysql-connector-java-5.1.40-bin.jar \ 94 | ``` 95 | 96 | 当然,我查相关资料时也看到了如下添加驱动的配置 97 | 98 | > 第一种是在${SPARK_HOME}/conf目录下的spark-defaults.conf中添加:spark.jars /opt/lib/mysql-connector-java-5.1.26-bin.jar 99 | > 100 | > 第二种是通过 添加 :spark.driver.extraClassPath /opt/lib2/mysql-connector-java-5.1.26-bin.jar 这种方式也可以实现添加多个依赖jar,比较方便 101 | > 102 | > 参见:https://blog.csdn.net/u013468917/article/details/52748342 103 | 104 | - 写入数据库几点注意项 [参考](https://my.oschina.net/bindyy/blog/680195) 105 | 106 | - 操作该应用程序的用户有对相应数据库操作的权限 107 | - DataFrame应该转为RDD后,再通过foreachPartition操作把每一个partition插入数据库(不要用foreach,不然每条记录都会连接一次数据库) 108 | 109 | 110 | ```scala 111 | spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/数据库").option("dbtable", "表名").option("user", "xxxx").option("password", "xxxx").option("driver", "com.mysql.jdbc.Driver").load() 112 | ``` 113 | 114 | 115 | 首先,是通过SQLContext的read系列方法,将mysql中的数据加载为DataFrame 116 | 然后可以将DataFrame转换为RDD,使用Spark Core提供的各种算子进行操作 117 | 最后可以将得到的数据结果,通过foreach()算子,写入mysql、hbase、redis等等 118 | 119 | ### 保存为持久化的表 120 | 121 | **DataFrames** 也可以通过 **saveAsTable** 命令来保存为一张持久表到 **Hive** **metastore** 中。值得注意的是对于这个功能来说已经存在的 **Hive** 部署不是必须的。**Spark** 将会为你创造一个默认的本地 **Hive metastore**(使用 **Derby**)。不像 **createOrReplaceTempView** 命令那样,**saveAsTable** 将会持久化 **DataFrame** 中的内容并在 **Hive metastore** 中创建一个指向数据的指针。持久化的表将会一直存在甚至当你的 **Spark** 应用已经重启,只要保持你的连接是和一个相同的 **metastore**。一个相对于持久化表的 **DataFrame** 可以通过在 **SparkSession** 中调用 **table** 方法创建。 122 | 123 | 默认情况下 **saveAsTable** 操作将会创建一个 “**managed table**”,意味着数据的位置将会被 **metastore** 控制。**Managed tables** 在表 **drop** 后也数据也会自动删除 124 | 125 | 126 | 127 | ### 注意项 128 | 129 | - 不要对该方法没有的option强加上该option,比如text用上option("sep",";"),会报错 130 | - 注意作为 **json file** 提供的文件不是一个典型的 **JSON** 文件。**每一行必须包含一个分开的独立的有效 JSON 对象**。因此,常规的多行 **JSON** 文件通常会失败 --------------------------------------------------------------------------------