├── .gitignore
├── Spark_With_Scala_Testing
├── data
│ ├── test
│ │ ├── _SUCCESS
│ │ ├── ._SUCCESS.crc
│ │ ├── part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv
│ │ └── .part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv.crc
│ ├── result
│ │ ├── _SUCCESS
│ │ ├── ._SUCCESS.crc
│ │ ├── part-00000
│ │ └── .part-00000.crc
│ ├── secondarySort.txt
│ ├── people.csv
│ ├── people.txt
│ ├── hello.txt
│ ├── people.json
│ ├── spark.md
│ ├── users.parquet
│ ├── employees.json
│ └── topN.txt
├── .gitignore
├── .cache-main
├── .settings
│ └── org.scala-ide.sdt.core.prefs
├── .classpath
├── src
│ ├── test
│ │ └── Test .scala
│ ├── sparkCore
│ │ ├── MyKey.scala
│ │ ├── SecondarySort.scala
│ │ ├── Action.scala
│ │ ├── Persist.scala
│ │ ├── sortedWordCount.scala
│ │ ├── TopN .scala
│ │ └── Transformation.scala
│ └── sparkSql
│ │ ├── LoadAndSave.scala
│ │ ├── SqlContextTest.scala
│ │ ├── RDDtoDataFrame2.scala
│ │ ├── RDDtoDataFrame.scala
│ │ └── DataFrameOperations.scala
└── .project
├── notes
├── assets
│ ├── cache.png
│ ├── source.jpg
│ ├── 导入spark.png
│ ├── 1550161394643.png
│ ├── 1550670063721.png
│ ├── 20190215002003.png
│ └── cluster-client.png
├── LearningSpark(1)数据来源.md
├── RDD如何作为参数传给函数.md
├── eclipse中Attach Source找不到源码,该如何查看jar包源码.md
├── Spark DataFrame如何更改列column的类型.md
├── 判断RDD是否为空.md
├── 高级排序和topN问题.md
├── eclipse如何导入Spark源码方便阅读.md
├── LearningSpark(3)RDD操作.md
├── Spark2.4+Hive使用Hive现有仓库.md
├── LearningSpark(6)Spark内核架构剖析.md
├── LearningSpark(2)spark-submit可选参数.md
├── LearningSpark(4)Spark持久化操作.md
├── Scala排序函数使用.md
├── LearningSpark(8)RDD如何转化为DataFrame.md
├── 使用JDBC将DataFrame写入mysql.md
├── LearningSpark(7)SparkSQL之DataFrame学习.md
├── 报错和问题归纳.md
├── LearningSpark(5)Spark共享变量.md
└── LearningSpark(9)SparkSQL数据来源.md
├── Pic
└── DAGScheduler划分及提交stage-代码调用过程.jpg
├── LICENSE
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | *.class
2 | *.log
3 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/test/_SUCCESS:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/.gitignore:
--------------------------------------------------------------------------------
1 | /bin/
2 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/result/_SUCCESS:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/result/._SUCCESS.crc:
--------------------------------------------------------------------------------
1 | crc
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/test/._SUCCESS.crc:
--------------------------------------------------------------------------------
1 | crc
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/secondarySort.txt:
--------------------------------------------------------------------------------
1 | 1 5
2 | 2 4
3 | 3 6
4 | 1 3
5 | 2 1
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/result/part-00000:
--------------------------------------------------------------------------------
1 | (you,2)
2 | (jump,8)
3 | (i,3)
4 | (u,1)
5 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/people.csv:
--------------------------------------------------------------------------------
1 | name;age;job
2 | Jorge;30;Developer
3 | Bob;32;Developer
4 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/people.txt:
--------------------------------------------------------------------------------
1 | name;age;job
2 | Jorge;30;Developer
3 | Bob;32;Developer
4 |
--------------------------------------------------------------------------------
/notes/assets/cache.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/cache.png
--------------------------------------------------------------------------------
/notes/assets/source.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/source.jpg
--------------------------------------------------------------------------------
/notes/assets/导入spark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/导入spark.png
--------------------------------------------------------------------------------
/notes/assets/1550161394643.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/1550161394643.png
--------------------------------------------------------------------------------
/notes/assets/1550670063721.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/1550670063721.png
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/hello.txt:
--------------------------------------------------------------------------------
1 | you,jump
2 | i,jump
3 | you,jump
4 | i,jump
5 | jump,jump,jump
6 | u,i,jump
7 |
--------------------------------------------------------------------------------
/notes/assets/20190215002003.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/20190215002003.png
--------------------------------------------------------------------------------
/notes/assets/cluster-client.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/notes/assets/cluster-client.png
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/people.json:
--------------------------------------------------------------------------------
1 | {"name":"Michael"}
2 | {"name":"Andy", "age":30}
3 | {"name":"Justin", "age":19}
4 |
--------------------------------------------------------------------------------
/Pic/DAGScheduler划分及提交stage-代码调用过程.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Pic/DAGScheduler划分及提交stage-代码调用过程.jpg
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/.cache-main:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/.cache-main
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/spark.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/spark.md
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/test/part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv:
--------------------------------------------------------------------------------
1 | age;name
2 | "";Michael
3 | 30;Andy
4 | 19;Justin
5 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/users.parquet:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/users.parquet
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/result/.part-00000.crc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/result/.part-00000.crc
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/employees.json:
--------------------------------------------------------------------------------
1 | {"name":"Michael", "salary":3000}
2 | {"name":"Andy", "salary":4500}
3 | {"name":"Justin", "salary":3500}
4 | {"name":"Berta", "salary":4000}
5 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/test/.part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv.crc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josonle/Learning-Spark/HEAD/Spark_With_Scala_Testing/data/test/.part-00000-305efd32-4d97-4a4d-acf9-fb22c9c6e05e-c000.csv.crc
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/.settings/org.scala-ide.sdt.core.prefs:
--------------------------------------------------------------------------------
1 | eclipse.preferences.version=1
2 | scala.compiler.additionalParams=\ -Xsource\:2.11 -Ymacro-expand\:none
3 | scala.compiler.installation=2.11
4 | scala.compiler.sourceLevel=2.11
5 | scala.compiler.useProjectSettings=true
6 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/data/topN.txt:
--------------------------------------------------------------------------------
1 | t001 2067
2 | t002 2055
3 | t003 109
4 | t004 1200
5 | t005 3368
6 | t006 251
7 | t001 3067
8 | t002 255
9 | t003 19
10 | t004 2000
11 | t005 368
12 | t006 2512
13 | t006 2510
14 | t001 367
15 | t002 2155
16 | t005 338
17 | t006 1251
18 | t001 3667
19 | t002 1255
20 | t003 190
21 | t003 1090
--------------------------------------------------------------------------------
/notes/LearningSpark(1)数据来源.md:
--------------------------------------------------------------------------------
1 | ## 数据源自并行集合
2 | 调用 SparkContext 的 parallelize 方法,在一个已经存在的 Scala 集合上创建一个 Seq 对象
3 |
4 | ## 外部数据源
5 | Spark支持任何 `Hadoop InputFormat` 格式的输入,如本地文件、HDFS上的文件、Hive表、HBase上的数据、Amazon S3、Hypertable等,以上都可以用来创建RDD。
6 |
7 | 常用函数是 `sc.textFile()` ,参数是Path和最小分区数[可选]。Path是文件的 URI 地址,该地址可以是本地路径,或者 `hdfs://`、`s3n://` 等 URL 地址。其次,使用本地文件时,如果在集群上运行要确保worker节点也能访问到文件
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/.classpath:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/test/Test .scala:
--------------------------------------------------------------------------------
1 | package test
2 |
3 | import org.apache.spark.{SparkConf,SparkContext}
4 |
5 | class Test {
6 | def main(args: Array[String]): Unit = {
7 | val conf = new SparkConf().setMaster("local").setAppName("test-tools")
8 | val sc = new SparkContext(conf)
9 | val rdd = sc.parallelize(List())
10 | println("测试:" + rdd.count)
11 |
12 | if (sc.emptyRDD[(String, Int)].isEmpty()) {
13 | println("空")
14 | }
15 | }
16 | }
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/.project:
--------------------------------------------------------------------------------
1 |
2 |
3 | Spark_With_Scala_Testing
4 |
5 |
6 |
7 |
8 |
9 | org.scala-ide.sdt.core.scalabuilder
10 |
11 |
12 |
13 |
14 |
15 | org.scala-ide.sdt.core.scalanature
16 | org.eclipse.jdt.core.javanature
17 |
18 |
19 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkCore/MyKey.scala:
--------------------------------------------------------------------------------
1 | package sparkCore
2 |
3 | class MyKey(val first: Int, val second: Int) extends Ordered[MyKey] with Serializable {
4 | def compare(that:MyKey): Int = {
5 | //第一列升序,第二列升序
6 | /*if(first - that.first==0){
7 | second - that.second
8 | }else {
9 | first - that.first
10 | }*/
11 | //第一列降序,第二列升序
12 | if(first - that.first==0){
13 | second - that.second
14 | }else {
15 | that.first - first
16 | }
17 | //参照MapReduce二次排序原理:https://github.com/josonle/MapReduce-Demo#%E8%A7%A3%E7%AD%94%E6%80%9D%E8%B7%AF-1
18 | }
19 | }
--------------------------------------------------------------------------------
/notes/RDD如何作为参数传给函数.md:
--------------------------------------------------------------------------------
1 | ```scala
2 | //分组TopN
3 | def groupTopN(data: RDD[String],n:Int): RDD[(String,List[Int])] = {
4 | //先不考虑其他的
5 | //分组后类似 (t003,(19,1090,190,109))
6 | val groupParis = data.map { x =>
7 | (x.split(" ")(0), x.split(" ")(1).toInt)
8 | }.groupByKey()
9 | val sortedData = groupParis.map(x=>
10 | {
11 | //分组后,排序(默认升序),再倒序去前n
12 | val sortedLists = x._2.toList.sorted.reverse.take(n)
13 | (x._1,sortedLists)
14 | })
15 | sortedData
16 | }
17 | ```
18 |
19 | 关键是知道传入rdd的类型以及要返回值的类型,如果有编辑器就很简单,知道自己要返回什么,然后拿鼠标移到这个值上方就会显示它的类型
20 |
21 | 其次是要导入`import org.apache.spark.rdd._`
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkCore/SecondarySort.scala:
--------------------------------------------------------------------------------
1 | package sparkCore
2 |
3 | import org.apache.spark.SparkConf
4 | import org.apache.spark.SparkContext
5 | import org.apache.spark.SparkContext._
6 |
7 | object SecondarySort {
8 | def main(args: Array[String]): Unit = {
9 | val conf = new SparkConf().setMaster("local").setAppName("sortedWordCount")
10 | val sc = new SparkContext(conf)
11 | val data = sc.textFile("data/secondarySort.txt", 1)
12 | val keyWD = data.map(x => (
13 | new MyKey(x.split(" ")(0).toInt, x.split(" ")(1).toInt), x))
14 | val sortedWD = keyWD.sortByKey()
15 | sortedWD.map(_._2).foreach(println)
16 | }
17 | }
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkSql/LoadAndSave.scala:
--------------------------------------------------------------------------------
1 | package sparkSql
2 |
3 | import org.apache.spark.sql.SparkSession
4 |
5 | object LoadAndSave {
6 | def main(args: Array[String]): Unit = {
7 | val spark = SparkSession.builder().master("local").appName("load and save datas").getOrCreate()
8 | val df = spark.read.load("data/users.parquet")
9 | val df1 = spark.read.format("json").load("data/people.json")
10 | // df.printSchema()
11 | // df.show()
12 | // df1.show()
13 | // df1.select("name","age").write.format("csv").mode("overwrite").save("data/people")
14 | df1.write.option("header", true).option("sep", ";").csv("data/test")
15 | }
16 | }
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkSql/SqlContextTest.scala:
--------------------------------------------------------------------------------
1 | package sparkSql
2 |
3 | import org.apache.spark.SparkConf
4 | import org.apache.spark.SparkContext
5 | import org.apache.spark.sql.SQLContext
6 |
7 | object SqlContextTest {
8 | // Spark2.x中SQLContext已经被SparkSession代替,此处只为了解
9 | def main(args: Array[String]): Unit = {
10 | val conf = new SparkConf().setMaster("local").setAppName("test sqlContext")
11 | val sc = new SparkContext(conf)
12 | val sqlContext = new SQLContext(sc)
13 | // 读取spark项目中example中带的几个示例数据,创建DataFrame
14 | val people = sqlContext.read.format("json").load("data/people.json")
15 | // DataFrame即RDD+Schema(元数据信息)
16 | people.show() //打印DF
17 | people.printSchema() //打印DF结构
18 | }
19 | }
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkCore/Action.scala:
--------------------------------------------------------------------------------
1 | package sparkCore
2 |
3 | import org.apache.spark.{SparkConf,SparkContext}
4 |
5 | object Action {
6 | def main(args: Array[String]): Unit = {
7 | /*Action操作不作过多解释*/
8 | val conf = new SparkConf().setMaster("local").setAppName("Action")
9 | val sc = new SparkContext(conf)
10 |
11 | /* val rdd = sc.parallelize((1 to 10), 1)
12 | println(rdd.take(3)) //返回数组
13 | println(rdd.reduce(_+_))
14 | println(rdd.collect()) //返回数组
15 |
16 | val wc = sc.textFile("data/hello.txt", 1)
17 | wc.flatMap(_.split(",")).map((_,1)).reduceByKey(_+_)
18 | .saveAsTextFile("data/result") //只能指定保存的目录
19 | //countByKey见Transformation中求解平均成绩
20 | */
21 | sc.stop()
22 | }
23 | }
--------------------------------------------------------------------------------
/notes/eclipse中Attach Source找不到源码,该如何查看jar包源码.md:
--------------------------------------------------------------------------------
1 | 主要是在eclipse中不像idea一样查看Spark源码麻烦
2 |
3 | ### maven引入的jar
4 |
5 | a:自动下载
6 |
7 | eclipse勾选windows->Preferences->Maven->Download Artifact Sources 这个选项,然后右键项目maven->maven update project就可以
8 |
9 | b.手动下载
10 |
11 | 使用maven命令行下载依赖包的源代码:
12 |
13 | mvn dependency:sources mvn dependency
14 |
15 | mvn dependency:sources -DdownloadSources=true -DdownloadJavadocs=true
16 | -DdownloadSources=true 下载源代码Jar -DdownloadJavadocs=true 下载javadoc包
17 |
18 | 如果执行后还是没有下载到,可以到仓库上搜一下,下载下来放到本地仓库,在eclise里面设置下关联就可以了。
19 |
20 | ### 其他jar
21 |
22 | 下载下来或者反编译出来再手动关联
23 | a:方法一
24 |
25 | 1. 按住Ctrl,用鼠标去点一些jar包里的方法,你可以选择跳转到implementation,
26 |
27 | 2. 到时候它会有一个attach to source的选项,你点击,然后选择source的压缩包,就关联好了。
28 |
29 | b.方法二
30 |
31 | 对jar包右击,选properties属性,进行关联
32 | 
33 |
34 | 转载自:https://blog.csdn.net/qq_21209681/article/details/72917837
--------------------------------------------------------------------------------
/notes/Spark DataFrame如何更改列column的类型.md:
--------------------------------------------------------------------------------
1 | 如下示例,通过最初json文件所生成的df的age列是Long类型,给它改成其他类型。当然不止如下两种方法,但我觉得这是最为简单的两种了
2 |
3 | ```scala
4 | val spark = SparkSession.builder().master("local").appName("DataFrame API").getOrCreate()
5 |
6 | // 读取spark项目中example中带的几个示例数据,创建DataFrame
7 | val people = spark.read.format("json").load("data/people.json")
8 | people.show()
9 | people.printSchema()
10 |
11 | val p = people.selectExpr("cast(age as string) age_toString","name")
12 | p.printSchema()
13 |
14 | import spark.implicits._ //导入这个为了隐式转换,或RDD转DataFrame之用
15 | import org.apache.spark.sql.types.DataTypes
16 | people withColumn("age", $"age".cast(DataTypes.IntegerType)) //DataTypes下有若干数据类型,记住类的位置
17 | people.printSchema()
18 | ```
19 |
20 |
21 |
22 | 参考这个:[How to change column types in Spark SQL's DataFrame?](http://padown.com/questions/29383107/how-to-change-column-types-in-spark-sqls-dataframe)
23 |
24 |
25 |
26 | https://www.jianshu.com/p/0634527f3cce
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkCore/Persist.scala:
--------------------------------------------------------------------------------
1 | package sparkCore
2 |
3 | import org.apache.spark.SparkConf
4 | import org.apache.spark.SparkContext
5 |
6 | object Persist {
7 | def main(args:Array[String]): Unit = {
8 | val conf = new SparkConf().setMaster("local").setAppName("persist_cache")
9 | val sc = new SparkContext(conf)
10 | val rdd = sc.textFile("data/spark.md").cache()
11 | println("all length : "+rdd.map(_.length).reduce(_+_))
12 | //2019-02-11 16:02:57 INFO DAGScheduler:54 - Job 0 finished: reduce at Persist.scala:11, took 0.391666 s
13 | println("all_length not None:"+rdd.flatMap(_.split(" ")).map(_.length).reduce(_+_))
14 | //2019-02-11 16:02:58 INFO DAGScheduler:54 - Job 1 finished: reduce at Persist.scala:12, took 0.036668 s
15 |
16 | //不持久化
17 | //2019-02-11 16:05:50 INFO DAGScheduler:54 - Job 0 finished: reduce at Persist.scala:11, took 0.370967 s
18 | //2019-02-11 16:05:50 INFO DAGScheduler:54 - Job 1 finished: reduce at Persist.scala:13, took 0.050201 s
19 | }
20 | }
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 JosonLee
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkCore/sortedWordCount.scala:
--------------------------------------------------------------------------------
1 | package sparkCore
2 |
3 | import org.apache.spark.SparkConf
4 | import org.apache.spark.SparkContext
5 | import org.apache.spark.SparkContext._
6 | import org.apache.spark.rdd._
7 |
8 | object sortedWordCount {
9 | /*1、对文本文件内的每个单词都统计出其出现的次数。
10 | 2、按照每个单词出现次数的数量,降序排序。
11 | */
12 | def main(args: Array[String]): Unit = {
13 | val conf = new SparkConf().setMaster("local").setAppName("sortedWordCount")
14 | val sc = new SparkContext(conf)
15 | val data = sc.textFile("data/spark.md", 1)
16 | data.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).sortBy(_._2, false)
17 | .foreach(x ⇒ {
18 | println(x._1 + "出现" + x._2 + "次")
19 | })
20 | //另一种方法,还不如sortBy
21 | anotherSolution(data).foreach(x ⇒ {
22 | println(x._1 + "出现" + x._2 + "次")
23 | })
24 | }
25 |
26 | def anotherSolution(data: RDD[String]): RDD[(String, Int)] = {
27 | //传入的rdd是所读取的文件rdd
28 | val wc = data.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
29 | val countWords = wc.map(x => (x._2, x._1)).sortByKey(false)
30 | val sortedWC = countWords.map(x => (x._2, x._1))
31 |
32 | sortedWC
33 | }
34 | }
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkSql/RDDtoDataFrame2.scala:
--------------------------------------------------------------------------------
1 | package sparkSql
2 |
3 | import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
4 | import org.apache.spark.sql.Row
5 | import org.apache.spark.sql.SparkSession
6 |
7 | object RDDtoDataFrame2 {
8 | def main(args: Array[String]): Unit = {
9 | val spark = SparkSession.builder().master("local").appName("rdd to dataFrame").getOrCreate()
10 |
11 | import spark.implicits._
12 | //toDF需要导入隐式转换
13 | val rdd = spark.sparkContext.textFile("data/people.txt").cache()
14 | val header = rdd.first()
15 | val personRDD = rdd.filter(row => row != header) //去头
16 | .map { line => Row(line.split(";")(0).toString, line.split(";")(1).toInt, line.split(";")(2).toString) }
17 |
18 | val structType = StructType(Array(
19 | StructField("name", StringType, true),
20 | StructField("age", IntegerType, true),
21 | StructField("job", StringType, true)))
22 |
23 | val rddToDF = spark.createDataFrame(personRDD, structType)
24 | rddToDF.createOrReplaceTempView("people")
25 |
26 | val results = spark.sql("SELECT name FROM people")
27 |
28 | results.map(attributes => "Name: " + attributes(0)).show()
29 | }
30 | }
--------------------------------------------------------------------------------
/notes/判断RDD是否为空.md:
--------------------------------------------------------------------------------
1 | ### 如何创建空RDD?
2 |
3 | 空RDD用处暂且不知
4 |
5 | `sc.parallelize(List()) //或seq()`
6 |
7 | 或者,spark定义了一个 emptyRDD
8 | ```scala
9 | /**
10 | * An RDD that has no partitions and no elements.
11 | */
12 | private[spark] class EmptyRDD[T: ClassTag](sc: SparkContext) extends RDD[T](sc, Nil) {
13 |
14 | override def getPartitions: Array[Partition] = Array.empty
15 | override def compute(split: Partition, context: TaskContext): Iterator[T] = {
16 | throw new UnsupportedOperationException("empty RDD")
17 | }
18 |
19 | //可以通过 sc.emptyRDD[T] 创建
20 | ```
21 | ### rdd.count == 0 和 rdd.isEmpty
22 | 我第一想到的方法要么通过count算子判断个数是否为0,要么直接通过isEmpty算子来判断是否为空
23 |
24 | ```scala
25 | def isEmpty(): Boolean = withScope {
26 | partitions.length == 0 || take(1).length == 0
27 | }
28 | ```
29 |
30 | isEmpty的源码也是判断分区长度或者是否有数据,如果是空RDD,isEmpty会抛异常:`Exception in thread "main" org.apache.spark.SparkDriverExecutionException: Execution error`
31 |
32 | > emptyRDD判断isEmpty不会报这个错,不知道为什么
33 |
34 | 之后搜索了一下,看到下面这个办法 (见: https://my.oschina.net/u/2362111/blog/743754
35 |
36 | ### rdd.partitions().isEmpty()
37 |
38 | ```
39 | 这种比较适合Dstream 进来后没有经过 类似 reduce 操作的 。
40 | ```
41 |
42 | ### rdd.rdd().dependencies().apply(0).rdd().partitions().length==0
43 |
44 | ```
45 | 这种就可以用来作为 经过 reduce 操作的 了
46 | ```
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkSql/RDDtoDataFrame.scala:
--------------------------------------------------------------------------------
1 | package sparkSql
2 |
3 | import org.apache.spark.sql.SparkSession
4 |
5 | object RDDtoDataFrame {
6 | case class Person(name: String, age: Int, job: String)
7 |
8 | def main(args: Array[String]): Unit = {
9 | val spark = SparkSession.builder().master("local").appName("rdd to dataFrame").getOrCreate()
10 |
11 | import spark.implicits._
12 | //toDF需要导入隐式转换
13 | val rdd = spark.sparkContext.textFile("data/people.txt").cache()
14 | val header = rdd.first()
15 | val rddToDF = rdd.filter(row => row != header) //去头
16 | .map(_.split(";"))
17 | .map(x => Person(x(0).toString, x(1).toInt, x(2).toString))
18 | .toDF("name","age","job")
19 | rddToDF.show()
20 | rddToDF.printSchema()
21 |
22 | val dfToRDD = rddToDF.rdd//返回RDD[row],RDD类型的row对象
23 | dfToRDD.foreach(println)
24 |
25 | dfToRDD.map(row =>Person(row.getAs[String]("name"), row.getAs[Int]("age"), row(2).toString())) //也可用row.getAs[String](2)
26 | .foreach(p => println(p.name + ":" + p.age + ":"+ p.job))
27 | dfToRDD.map{row=>{
28 | val columnMap = row.getValuesMap[Any](Array("name","age","job"))
29 | Person(columnMap("name").toString(),columnMap("age").toString.toInt,columnMap("job").toString)
30 | }}.foreach(p => println(p.name + ":" + p.age + ":"+ p.job))
31 | // isNullAt方法
32 | }
33 | }
--------------------------------------------------------------------------------
/notes/高级排序和topN问题.md:
--------------------------------------------------------------------------------
1 | ### 按列排序和二次排序问题
2 |
3 | 搞清楚sortByKey、sortBy作用;其次是自定义Key用来排序,详情可见MapReduce中排序问题
4 |
5 | ``` scala
6 | class MyKey(val first: Int, val second: Int) extends Ordered[MyKey] with Serializable {
7 | //记住这种定义写法
8 | def compare(that:MyKey): Int = {
9 | //定义比较方法
10 | }
11 | }
12 | ```
13 |
14 | ### topN和分组topN问题
15 |
16 | topN包含排序,分组topN只是在分组的基础上考虑排序
17 |
18 | 分组要搞清groupBy和groupByKey,前者分组后类似 (t003,((t1003,19),(t1003,1090))),后者类似(t003,(19,1090))
19 |
20 | 其次是排序,如果是简单集合内排序直接调用工具方法:sorted、sortWith、sortBy
21 |
22 | > sorted:适合单集合的升降序
23 | >
24 | > sortBy:适合对单个或多个属性的排序,代码量比较少
25 | >
26 | > sortWith:适合定制化场景比较高的排序规则,比较灵活,也能支持单个或多个属性的排序,但代码量稍多,内部实际是通过java里面的Comparator接口来完成排序的
27 |
28 | ```scala
29 | scala> val arr = Array(3,4,5,1,2,6)
30 | arr: Array[Int] = Array(3, 4, 5, 1, 2, 6)
31 | scala> arr.sorted
32 | res0: Array[Int] = Array(1, 2, 3, 4, 5, 6)
33 | scala> arr.sorted.take(3)
34 | res1: Array[Int] = Array(1, 2, 3)
35 | scala> arr.sorted.takeRight(3)
36 | res3: Array[Int] = Array(4, 5, 6)
37 | scala> arr.sorted(Ordering.Int.reverse)
38 | res4: Array[Int] = Array(6, 5, 4, 3, 2, 1)
39 | scala> arr.sorted.reverse
40 | res5: Array[Int] = Array(6, 5, 4, 3, 2, 1)
41 | scala> arr.sortWith(_>_)
42 | res6: Array[Int] = Array(6, 5, 4, 3, 2, 1)
43 | ```
44 |
45 | 考虑降序几种写法如下
46 |
47 | ```scala
48 | arr.sorted(Ordering.Int.reverse)
49 | arr.sorted.reverse
50 | arr.sortWith(_>_)
51 | ```
52 |
53 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkCore/TopN .scala:
--------------------------------------------------------------------------------
1 | package sparkCore
2 |
3 | import org.apache.spark.SparkConf
4 | import org.apache.spark.SparkContext
5 | import org.apache.spark.rdd._
6 |
7 | object TopN {
8 | //Top3
9 | def top3(data: RDD[String]): Array[String] = {
10 | if (data.isEmpty()) {
11 | println("RDD为空,返回空Array")
12 | Array()
13 | }
14 | if (data.count() <= 3) {
15 | data.collect()
16 | } else {
17 | val sortedData = data.map(x => (x, x.split(" ")(1).toInt)).sortBy(_._2, false)
18 | val result = sortedData.take(3).map(_._1)
19 | result
20 | }
21 | }
22 | //分组TopN
23 | def groupTopN(data: RDD[String],n:Int): RDD[(String,List[Int])] = {
24 | //先不考虑其他的
25 | //分组后类似 (t003,(19,1090,190,109))
26 | val groupParis = data.map { x =>
27 | (x.split(" ")(0), x.split(" ")(1).toInt)
28 | }.groupByKey()
29 | val sortedData = groupParis.map(x=>
30 | {
31 | //分组后,排序(默认升序),再倒序去前n
32 | val sortedLists = x._2.toList.sorted.reverse.take(n)
33 | (x._1,sortedLists)
34 | })
35 | sortedData
36 | }
37 | def main(args: Array[String]): Unit = {
38 | val conf = new SparkConf()
39 | .setAppName("Top3")
40 | .setMaster("local")
41 | val sc = new SparkContext(conf)
42 | val lines = sc.textFile("data/topN.txt").cache()
43 |
44 | top3(lines).foreach(println)
45 |
46 | groupTopN(lines, 3).sortBy(_._1, true).foreach(x=>{
47 | print(x._1+": ")
48 | println(x._2.mkString(","))
49 | })
50 | }
51 | }
52 |
53 | /*排序几种写法:
54 | arr.sorted(Ordering.Int.reverse)
55 | arr.sorted.reverse
56 | arr.sortWith(_>_)
57 | */
--------------------------------------------------------------------------------
/notes/eclipse如何导入Spark源码方便阅读.md:
--------------------------------------------------------------------------------
1 | 最近想看下spark sql的源码,就查了些相关文章。很多都是IDEA怎么导入的,还有就是谈到了自己编译spark源码再倒入,但我还没有强到修改源码的地步,所以跳过编译直接导入阅读源码,过程如下
2 |
3 |
4 |
5 | ### 下载spark源码
6 |
7 | 从 https://github.com/apache/spark 下载你需要的spark版本,如图
8 |
9 | 
10 |
11 | > 当然,也方便eclipse中 Ctrl+点击 来跳转到源码查看。具体是Attach Source中指定下载的源码所在位置即可
12 | >
13 | > 1. 按住Ctrl,用鼠标去点一些jar包里的方法,你可以选择跳转到implementation
14 | >
15 | > 2. 到时候它会有一个attach to source的选项,点击,然后选择下好的源码,就关联好了
16 |
17 | ### 导入eclipse
18 |
19 | 下载下来的项目是maven项目所以直接导入即可,当然eclipse要有安装maven插件
20 |
21 |
22 |
23 | Eclipse中File->Import->Import Existing Maven Projects
24 |
25 |
26 |
27 | 如图,我下载并解压的源码包spark-2.4.0(为了区分,重命名了spark-2.4.0-src),导入然后选择你想要阅读的源码(不想要的下面取消勾选即可)
28 |
29 | > Spark子项目模块有:
30 | >
31 | > - spark-catalyst:Spark的词法、语法分析、抽象语法树(AST)生成、优化器、生成逻辑执行计划、生成物理执行计划等。
32 | > - spark-core:Spark最为基础和核心的功能模块。
33 | > - spark-examples:使用多种语言,为Spark学习人员提供的应用例子。
34 | > - spark-sql:Spark基于SQL标准,实现的通用查询引擎。
35 | > - spark-hive:Spark基于Spark SQL,对Hive元数据、数据的支持。
36 | > - spark-mesos:Spark对Mesos的支持模块。
37 | > - spark-mllib:Spark的机器学习模块。
38 | > - spark-streaming:Spark对流式计算的支持模块。
39 | > - spark-unsafe:Spark对系统内存直接操作,以提升性能的模块。
40 | > - spark-yarn:Spark对Yarn的支持模块。
41 |
42 | 
43 |
44 |
45 |
46 | 点击Finish后,maven会自动下载相关依赖。没有自动下载的话,右键项目->Maven->Download Source->Update Project...即可
47 |
48 |
49 |
50 | 还有一个问题是可能最后会报错,比如有些jar包没有下载下来。我遇到的问题是,因为spark2.4.0是基于scala2.11的,但是我的eclipse插件较新下的是scala2.12,创建maven项目时会指定scala为2.12,所以依赖啥的可能就有问题吧(我猜测)。最后是在项目中指定scala2.11,maven会自动处理,然后就可以用了。
51 |
52 | 如图,我是导入了spark-sql源码
53 |
54 |
55 |
56 | 
57 |
58 | 参考:
59 |
60 | - https://ymgd.github.io/codereader/2018/04/16/Spark%E4%BB%A3%E7%A0%81%E7%BB%93%E6%9E%84%E5%8F%8A%E8%BD%BD%E5%85%A5Ecplise%E6%96%B9%E6%B3%95/
61 | - http://www.cnblogs.com/zlslch/p/7457352.html
--------------------------------------------------------------------------------
/notes/LearningSpark(3)RDD操作.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ### 键值对RDD上的操作 隐式转换
4 | shuffle操作中常用针对某个key对一组数据进行操作,比如说groupByKey、reduceByKey这类PairRDDFunctions中需要启用Spark的隐式转换,scala就会自动地包装成元组 RDD。导入 `org.apache.spark.SparkContext._`即可
5 |
6 | 没啥意思,就是记着导入`import org.apache.spark.SparkContext._`就有隐式转换即可
7 |
8 | ### 常用Transformation算子
9 |
10 | - map:对RDD中的每个数据通过函数映射成一个新的数。**输入与输出分区一对一**
11 |
12 | - flatMap:同上,不过会把输出分区合并成一个
13 |
14 | - **注意**:flatMap会把String扁平化为字符数组,但不会把字符串数组Array[String]扁平化
15 |
16 | map和flatMap还有个区别,如下代码 [更多map、flatmap区别见这里](http://www.brunton-spall.co.uk/post/2011/12/02/map-map-and-flatmap-in-scala/)
17 |
18 | ```scala
19 | scala> val list = List(1,2,3,4,5)
20 | list:List [Int] = List(1,2,3,4,5)
21 |
22 | scala> def g(v:Int)= List(v-1,v,v + 1)
23 | g:(v:Int)List [Int]
24 |
25 | scala> list.map(x => g(x))
26 | res0:List [List [Int]] = List(List(0,1,2),List(1,2,3),List(2,3,4),List(3,4,5),List(List) 4,5,6))
27 |
28 | scala> list.flatMap(x => g(x))
29 | res1:List [Int] = List(0,1,2,1,2,3,2,3,4,3,4,5,4,5,6)
30 | ```
31 |
32 | - filter:通过func对RDD中数据进行过滤,func返回true保留,反之滤除
33 |
34 | - distinct:对RDD中数据去重
35 |
36 | - reduceByKey(func):对RDD中的每个Key对应的Value进行reduce聚合操作
37 |
38 | - groupByKey():根据key进行group分组,每个key对应一个`Iterable`
39 |
40 | - sortByKey([ascending]):对RDD中的每个Key进行排序,ascending布尔值是否升序 【默认升序】
41 |
42 | - join(otherDataset):对两个包含对的RDD进行join操作,每个key join上的pair,都会传入自定义函数进行处理
43 |
44 |
45 | ### 常用Action算子
46 |
47 | - reduce(func):通过函数func聚合数据集,func 输入为两个元素,返回为一个元素。多用来作运算
48 | - count():统计数据集中元素个数
49 | - collect():以一个数组的形式返回数据集的所有元素。因为是加载到内存中,要求数据集小,否则会有溢出可能
50 | - take(n):数据集中的前 n 个元素作为一个数组返回,并非并行执行,而是由驱动程序计算所有的元素
51 | - foreach(func):调用func来遍历RDD中每个元素
52 | - saveAsTextFile(path):将数据集中的元素以文本文件(或文本文件集合)的形式写入本地文件系统、HDFS 或其它 Hadoop 支持的文件系统中的给定目录中。Spark 将对每个元素调用 toString 方法,将数据元素转换为文本文件中的一行记录
53 | - countByKey():仅适用于(K,V)类型的 RDD 。返回具有每个 key 的计数的 (K , Int)对 的 Map
54 | - sortBy(func, ascending):通过指定排序函数func对RDD中元素排序,ascending 是否升序 【默认true升序,一般func指定按哪个值排序】
55 |
56 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkSql/DataFrameOperations.scala:
--------------------------------------------------------------------------------
1 | package sparkSql
2 |
3 | import org.apache.spark.sql.SparkSession
4 | import org.apache.spark.sql
5 | import org.apache.spark.sql.functions
6 | import org.apache.spark.sql.types.DataTypes
7 |
8 | object DataFrameOperations {
9 | def main(args: Array[String]): Unit = {
10 | val spark = SparkSession.builder().master("local").appName("DataFrame API").getOrCreate()
11 |
12 | // 读取spark项目中example中带的几个示例数据,创建DataFrame
13 | val people = spark.read.format("json").load("data/people.json")
14 | // people.show()
15 | // people.printSchema()
16 |
17 | // val p = people.selectExpr("cast(age as string) age_toString","name")
18 | // p.printSchema()
19 |
20 | import spark.implicits._ //导入这个为了隐式转换,或RDD转DataFrame之用
21 | // 更改column的类型,也可以通过上面selectExpr实现
22 | people withColumn("age", $"age".cast(sql.types.StringType)) //DataTypes下有若干数据类型,记住类的位置
23 | people.printSchema()
24 | // people.select(functions.col("age").cast(DataTypes.DoubleType)).show()
25 |
26 | // people.select($"name", $"age" + 1).show()
27 | // people.select(people("age")+1, people.col("name")).show()
28 | // people.select("name").s1how() //select($"name")等效,后者好处看上面
29 | // people.filter($"name".contains("ust")).show()
30 | // people.filter($"name".like("%ust%")).show
31 | // people.filter($"name".rlike(".*ust.*")).show()
32 | println("Test Filter*****************")
33 | // people.filter(people("name") contains ("ust")).show()
34 | // people.filter(people("name") like ("%ust%")).show()
35 | // people.filter(people("name") rlike (".*?ust.*?")).show()
36 |
37 | println("Filter中如何取反*****************")
38 | // people.filter(!(people("name") contains ("ust"))).select("name", "age").show()
39 | people.groupBy("age").count().show()
40 | // people.createOrReplaceTempView("sqlDF")
41 | // spark.sql("select * from sqlDF where name not like '%ust%' ").show()
42 |
43 | }
44 | }
--------------------------------------------------------------------------------
/notes/Spark2.4+Hive使用Hive现有仓库.md:
--------------------------------------------------------------------------------
1 | ### 使用前准备
2 |
3 | - hive-site.xml复制到$SPARK_HOME/conf目录下
4 | - hive连接mysql的jar包(mysql-connector-java-8.0.13.jar)也要复制到$SPARK_HOME/jars目录下
5 | - 或者在spark-submit脚本中通过--jars指明该jar包位置
6 | - 或者在spark-env.xml中把该jar包位置加入Class Path `export SPARK_CLASSPATH=$SPARK_CLASSPATH:/jar包位置`
7 | > 我测试不起作用
8 |
9 | ### spark.sql.warehouse.dir参数
10 |
11 | 入门文档讲解spark sql如何作用在hive上时,[提到了下面这个例子](http://spark.apachecn.org/#/docs/7?id=hive-%E8%A1%A8),其次有个配置spark.sql.warehouse.dir
12 | ```scala
13 | val spark = SparkSession
14 | .builder()
15 | .appName("Spark Hive Example")
16 | .config("spark.sql.warehouse.dir", warehouseLocation)
17 | .enableHiveSupport()
18 | .getOrCreate()
19 | ```
20 | 该参数指明的是hive数据仓库位置
21 | > spark 1.x 版本使用的参数是"hive.metastore.warehouse" ,在spark 2.0.0 后,该参数已经不再生效,用户应使用 spark.sql.warehouse.dir进行代替
22 |
23 | 比如说我hive仓库是配置在hdfs上的,所以spark.sql.warehouse.dir=hdfs://master:9000/hive/warehouse
24 | ```
25 |
26 | hive.metastore.warehouse.dir
27 | /hive/warehouse
28 |
29 | ```
30 |
31 | 有一点是以上要达到期望效果前提是hive要部署好,没部署好的话,会在spark会默认覆盖hive的配置项。因为spark下也有spark-hive模快的,此时会使用内置hive。比如你可以尝试命令行下使用spark-shell使用hive仓库,你会发现当前目录下会产生metastore_db(元数据信息)
32 |
33 | 另外是该参数可以从hive-site.xml中获取,所以可以不写。但是在eclipse、idea中编程如果不把hive-site.xml放在resource文件夹下无法读取,所以还是配置该参数为好
34 |
35 | ### 报错解决
36 | 1. 报无法加载mysql驱动jar包,导致一些列错误无法访问数据库等。解决是把该jar包如上文所述放在$SPARK_HOME/jars目录下
37 | > org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
38 |
39 | 2. 报Table 'hive.PARTITIONS' doesn't exist
40 | > ERROR Datastore:115 - Error thrown executing ALTER TABLE `PARTITIONS` ADD COLUMN `TBL_ID` BIGINT NULL : Table 'hive.PARTITIONS' doesn't exist
41 | java.sql.SQLSyntaxErrorException: Table 'hive.PARTITIONS' doesn't exist
42 |
43 | 这个不知道怎么说,程序可以运行不过会抛该错误和warn
44 | 在stackoverflow中搜到了,配置一个参数config("spark.sql.hive.verifyPartitionPath", "false")
45 | 见:https://stackoverflow.com/questions/47933705/spark-sql-fails-if-there-is-no-specified-partition-path-available
46 |
47 | 参考:
48 |
49 | - [Spark 2.2.1 + Hive 案例之不使用现有的Hive环境;使用现有的Hive数据仓库;UDF自定义函数](https://blog.csdn.net/duan_zhihua/article/details/79335625)
50 | - [Spark的spark.sql.warehouse.dir相关](https://blog.csdn.net/u013560925/article/details/79854072)
51 | - https://www.jianshu.com/p/60e7e16fb3ce
52 |
--------------------------------------------------------------------------------
/notes/LearningSpark(6)Spark内核架构剖析.md:
--------------------------------------------------------------------------------
1 | ## Standalone模式下内核架构分析
2 |
3 | ### Application 和 spark-submit
4 |
5 | Application是编写的spark应用程序,spark-submit是提交应用程序给spark集群运行的脚本
6 |
7 | ### Driver
8 |
9 | spark-submit在哪里提交,那台机器就会启动Drive进程,用以执行应用程序。
10 |
11 | 像我们编写程序一样首先做的就是创建SparkContext
12 |
13 | ### SparkContext作用
14 |
15 | sc负责上下文环境(下文sc也指代SparkContext),sc初始化时负责创建DAGScheduler、TaskScheduler、Spark UI【先不考虑这个】
16 |
17 | 主要就是DAGScheduler、TaskScheduler,下面一一提到
18 |
19 | ### TaskScheduler功能
20 |
21 | TaskScheduler任务调度器,从字面可知和任务执行有关
22 |
23 | sc构造的TaskScheduler有自己后台的进程,负责连接spark集群的Master节点并注册Application
24 |
25 | #### Master
26 |
27 | Master在接受Application注册请求后,会通过自己的资源调度算法在集群的Worker节点上启动一系列Executor进程(Master通知Worker启动Executor)
28 |
29 | #### Worker和Executor
30 |
31 | Worker节点启动Executor进程,Executor进程会创建线程池,并且启动后会再次向Driver的TaskScheduler反向注册,告知有哪些Executor可用。
32 |
33 | 当然不止这些作用,先在这里谈一下task任务的执行。
34 |
35 | Executor每接收到一个task后,会调用TaskRunner对task封装(所谓封装就是对代码、算子、函数等拷贝、反序列化),然后再从线程池中取线程执行该task
36 |
37 | #### 聊聊Task
38 |
39 | 说的task这里就顺便提下task。task分两类:ShuffleMapTask、ResultTask,每个task对应的是RDD的一个Partition分区。下面会提到DAGScheduler会把提交的job作业分为多个Stage(依据宽依赖、窄依赖),一个Stage对应一个TaskSet(系列task),而ResultTask正好对应最后一个Stage,之前的Stage都是ShuffleMapTask
40 |
41 | ### DAGScheduler
42 |
43 | SparkContext另一重要作用就是构造DAGScheduler。程序中每遇到Action算子就会提交一个Job(想懒加载),DAGScheduler会把提交的job作业分为多个Stage。Stage就是一个TaskSet,会提交给TaskScheduler,因为之前Executor已经向TaskScheduler注册了哪些资源可用,所有TaskScheduler会分配TaskSet里的task给Executor执行(涉及task分配算法)。如何执行?向上看Executor部分
44 |
45 | ## Spark on Yarn模式下的不同之处
46 |
47 | 之前提到过on Yarn有yarn-client和yarn-cluster两种模式,在spark-submit脚本中通过`--master`、`--deploy-mode`来区分以哪种方式运行 【具体可见:[LearningSpark(2)spark-submit可选参数.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(2)spark-submit%E5%8F%AF%E9%80%89%E5%8F%82%E6%95%B0.md)】
48 |
49 | 其中,官方文档中所提及`--deploy-mode` 指定部署模式,是在 worker 节点(cluster)上还是在本地作为一个外部的客户端(client)部署您的 driver(默认 : client),这和接下来所提及的内容有关
50 |
51 | 因为是运行在Yarn集群上,所有没有什么Master、Worker节点,取而代之是ResourceManager、NodeManager(下文会以RM、NM代替)
52 |
53 | ### yarn-cluster运行模式
54 |
55 | 首先spark-submit提交Application后会向RM发送请求,请求启动ApplicationMaster(同standalone模式下的Master,但同时该节点也会运行Drive进程【这里和yarn-client有区别】)。RM就会分配container在某个NM上启动ApplicationMaster
56 |
57 | 要执行task就得有Executor,所以ApplicationMaster要向RM申请container来启动Executor。RM分配一些container(就是一些NM节点)给ApplicationMaster用来启动Executor,ApplicationMaster就会连接这些NM(这里NM就如同Worker)。NM启动Executor后向ApplicationMaster注册
58 |
59 | ### yarn-client运行模式
60 |
61 | 如上所提的,这种模式的不同在于Driver是部署在本地提交的那台机器上的。过程大致如yarn-cluster,不同在于ApplicationMaster实际上是ExecutorLauncher,而申请到的NodeManager所启动的Executor是要向本地的Driver注册的,而不是向ApplicationMaster注册
--------------------------------------------------------------------------------
/notes/LearningSpark(2)spark-submit可选参数.md:
--------------------------------------------------------------------------------
1 | ## 提交应用的脚本和可选参数
2 |
3 | 可以选择local模式下运行来测试程序,但要是在集群上运行还需要通过spark-submit脚本来完成。官方文档上的示例是这样写的(其中表明哪些是必要参数):
4 |
5 | ```
6 | ./bin/spark-submit \
7 | --class \
8 | --master \
9 | --deploy-mode \
10 | --conf = \
11 | ... # other options
12 | \
13 | [application-arguments]
14 | ```
15 |
16 | 常用参数如下:
17 | - `--master` 参数来设置 SparkContext 要连接的集群,默认不写就是local[*]【可以不用在SparkContext中写死master信息】
18 |
19 | - `--jars` 来设置需要添加到 classpath 中的 JAR 包,有多个 JAR 包使用逗号分割符连接
20 |
21 | - `--class` 指定程序的类入口
22 |
23 | - `--deploy-mode` 指定部署模式,是在 worker 节点(cluster)上还是在本地作为一个外部的客户端(client)部署您的 driver(默认 : client)
24 |
25 | > 这里顺便提一下yarn-client和yarn-cluster区别
26 | 
27 |
28 | - `application-jar` : 包括您的应用以及所有依赖的一个打包的 Jar 的路径。该Jar包的 URL 在您的集群上必须是全局可见的,例如,一个 hdfs:// path 或者一个 file:// path 在所有节点是可见的。
29 |
30 | - `application-arguments` : 传递到您的 main class 的 main 方法的参数
31 |
32 | - `driver-memory`是 driver 使用的内存,不可超过单机的最大可使用的
33 |
34 | - `num-executors`是创建多少个 executor
35 |
36 | - `executor-memory`是各个 executor 使用的最大内存,不可超过单机的最大可使用内存
37 |
38 | - `executor-cores`是每个 executor 最大可并发执行的 Task 数目
39 |
40 | ```
41 | #如下是spark on yarn模式下运行计算Pi的测试程序
42 | # 有一点务必注意,每行最后换行时务必多敲个空格,否则解析该语句时就是和下一句相连的,不知道会爆些什么古怪的错误
43 | [hadoop@master spark-2.4.0-bin-hadoop2.6]$ ./bin/spark-submit \
44 | > --master yarn \
45 | > --class org.apache.spark.examples.SparkPi \
46 | > --deploy-mode client \
47 | > --driver-memory 1g \
48 | > --num-executors 2 \
49 | > --executor-memory 2g \
50 | > --executor-cores 2 \
51 | > examples/jars/spark-examples_2.11-2.4.0.jar \
52 | > 10
53 | ```
54 | 每次提交都写这么多肯定麻烦,可以写个脚本
55 |
56 | ## 从文件中加载配置
57 |
58 | **spark-submit** 脚本可以从一个 **properties** 文件加载默认的 [Spark configuration values](http://spark.apache.org/docs/latest/configuration.html) 并且传递它们到您的应用中去。默认情况下,它将从 **Spark** 目录下的 ***conf/spark-defaults.conf*** 读取配置。更多详细信息,请看 [加载默认配置](http://spark.apache.org/docs/latest/configuration.html#loading-default-configurations) 部分。
59 |
60 | 加载默认的 **Spark** 配置,这种方式可以消除某些标记到 **spark-submit** 的必要性。例如,如果 ***spark.master*** 属性被设置了,您可以在 **spark-submit** 中安全的省略。一般情况下,明确设置在 **SparkConf** 上的配置值的优先级最高,然后是传递给 **spark-submit** 的值,最后才是 **default value**(默认文件)中的值。
61 |
62 | 如果您不是很清楚其中的配置设置来自哪里,您可以通过使用 ***--verbose*** 选项来运行 **spark-submit** 打印出细粒度的调试信息
63 |
64 | 更多内容可参考文档:[提交应用](http://cwiki.apachecn.org/pages/viewpage.action?pageId=3539265) ,[Spark-Submit 参数设置说明和考虑](https://www.alibabacloud.com/help/zh/doc-detail/28124.htm)
65 |
66 |
67 |
68 | ## 配置参数优先级问题
69 |
70 | sparkConf中配置的参数优先级最高,其次是spark-submit脚本中,最后是默认属性文件(spark-defaults.conf)中的配置参数
71 |
72 | 默认情况下,spark-submit也会从spark-defaults.conf中读取配置
--------------------------------------------------------------------------------
/notes/LearningSpark(4)Spark持久化操作.md:
--------------------------------------------------------------------------------
1 | ## 持久化
2 |
3 | Spark的一个重要特性,对RDD持久化操作时每个节点将RDD中的分区持久化到内存(或磁盘)上,之后的对该RDD反复操作过程中不需要重新计算该RDD,而是直接从内存中调用已缓存的分区即可。
4 | 当然,持久化适用于将要多次计算反复调用的RDD。不然的话会出现RDD重复计算,浪费资源降低性能的情况
5 |
6 | > 巧妙使用RDD持久化,甚至在某些场景下,可以将spark应用程序的性能提升10倍。对于迭代式算法和快速交互式应用来说,RDD持久化,是非常重要的
7 |
8 | 其次,持久化机制还有自动容错机制,如果哪个缓存的分区丢失,就会自动从其源RDD通过系列transformation操作重新计算该丢失分区
9 |
10 | Spark的一些shuffle操作也会自动对中间数据进行持久化,避免了在shuffle出错情况下,需要重复计算整个输入
11 |
12 | ## 持久化方法
13 |
14 | cache()和persist()方法,二者都是Transformation算子。要使用持久化必须将缓存好的RDD付给一个变量,之后重复使用该变量即可,其次不能在cache、persist后立刻调用action算子,否则也不叫持久化
15 |
16 | cache()等同于只缓存在内存中的persist(),源码如下
17 | ```scala
18 | def cache(): this.type = persist()
19 | def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
20 | ```
21 |
22 | 另外persist持久化还有一个Storage Level的概念 【持久化策略】
23 |
24 | | Storage Level | meaning |
25 | | ------------------------------------- | ------------------------------------------------------------ |
26 | | MEMORY_ONLY | 以非序列化的Java对象的方式持久化在JVM内存中。如果内存无法完全存储RDD所有的partition,那么那些没有持久化的partition就会在下一次需要使用它的时候,重新被计算 |
27 | | MEMORY_AND_DISK | 先缓存到内存,但是当内存不够时,会持久化到磁盘中。下次需要使用这些partition时,需要从磁盘上读取 |
28 | | MEMORY_ONLY_SER | 同MEMORY_ONLY,但是会使用Java序列化方式,将Java对象序列化后进行持久化。可以减少内存开销,但是需要进行反序列化,因此会加大CPU开销 |
29 | | MEMORY_AND_DSK_SER | 同MEMORY_AND_DSK,_SER同上也会使用Java序列化方式 |
30 | | DISK_ONLY | 使用非序列化Java对象的方式持久化,完全存储到磁盘上 |
31 | | MEMORY_ONLY_2【或者其他尾部加了_2的】 | 尾部_2的级别会将持久化数据复制一份,保存到其他节点,从而在数据丢失时,不需要再次计算,只需要使用备份数据即可 |
32 |
33 | 上面提过持久化也是Transformation算子,所以也是遇到action算子才执行
34 |
35 | ```scala
36 | final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
37 | if (storageLevel != StorageLevel.NONE) {
38 | getOrCompute(split, context)
39 | } else {
40 | computeOrReadCheckpoint(split, context)
41 | }
42 | }
43 | ```
44 |
45 | 会根据StorageLevel判断是否是持久化的,getOrCompute 就是根据RDD的RDDBlockId作为BlockManager 的key,判断是否缓存过这个RDD,如果没有,通过依赖计算生成,然后放入到BlockManager 中。如果已经存在,则直接从BlockManager 获取
46 |
47 | ### 如何选择持久化策略
48 |
49 | > Spark提供的多种持久化级别,主要是为了在CPU和内存消耗之间进行取舍。下面是一些通用的持久化级别的选择建议:
50 | >
51 | > 1、优先使用MEMORY_ONLY,如果可以缓存所有数据的话,那么就使用这种策略。因为纯内存速度最快,而且没有序列化,不需要消耗CPU进行反序列化操作。
52 | > 2、如果MEMORY_ONLY策略,无法存储的下所有数据的话,那么使用MEMORY_ONLY_SER,将数据进行序列化进行存储,纯内存操作还是非常快,只是要消耗CPU进行反序列化。
53 | > 3、如果需要进行快速的失败恢复,那么就选择带后缀为_2的策略,进行数据的备份,这样在失败时,就不需要重新计算了。
54 | > 4、能不使用DISK相关的策略,就不用使用,有的时候,从磁盘读取数据,还不如重新计算一次。
55 |
56 | ### 非持久化方法 unpersist()
57 |
58 |
59 |
60 | 参考:https://blog.csdn.net/weixin_35602748/article/details/78667489 和北风网spark245讲
61 |
62 | 
63 |
64 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Learning-Spark
2 | 学习Spark的代码,关于Spark Core、Spark SQL、Spark Streaming、Spark MLLib
3 |
4 | ## 说明
5 |
6 | ### 开发环境
7 |
8 | - 基于Deepin Linux 15.9版本
9 | - 基于Hadoop2.6、Spark2.4、Scala2.11、java8等
10 |
11 | > 系列环境搭建相关文章,见下方
12 | > - [【向Linux迁移记录】Deepin下java、大数据开发环境配置【一】](https://blog.csdn.net/lzw2016/article/details/86566873)
13 | > - [【向Linux迁移记录】Deepin下Python开发环境搭建](https://blog.csdn.net/lzw2016/article/details/86567436)
14 | > - [【向Linux迁移记录】Deepin Linux下快速Hadoop完全分布式集群搭建](https://blog.csdn.net/lzw2016/article/details/86618345)
15 | > - [【向Linux迁移记录】基于Hadoop集群的Hive安装与配置详解](https://blog.csdn.net/lzw2016/article/details/86631115)
16 | > - [【向Linux迁移记录】Deepin Linux下Spark本地模式及基于Yarn的分布式集群环境搭建](https://blog.csdn.net/lzw2016/article/details/86718403)
17 | > - [eclipse安装Scala IDE插件及An internal error occurred during: "Computing additional info"报错解决](https://blog.csdn.net/lzw2016/article/details/86717728)
18 | > - [Deepin Linux 安装启动scala报错 java.lang.NumberFormatException: For input string: "0x100" 解决](https://blog.csdn.net/lzw2016/article/details/86618570)
19 | > - 更多内容见:【https://blog.csdn.net/lzw2016/ 】【https://github.com/josonle/Coding-Now 】
20 |
21 | ### 文件说明
22 |
23 | - [Spark_With_Scala_Testing](https://github.com/josonle/Learning-Spark/tree/master/Spark_With_Scala_Testing) 存放平时练习代码
24 | - notes存放笔记
25 | - [LearningSpark(1)数据来源.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(1)%E6%95%B0%E6%8D%AE%E6%9D%A5%E6%BA%90.md)
26 | - [LearningSpark(2)spark-submit可选参数.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(2)spark-submit%E5%8F%AF%E9%80%89%E5%8F%82%E6%95%B0.md)
27 | - [LearningSpark(3)RDD操作.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(3)RDD%E6%93%8D%E4%BD%9C.md)
28 | - [LearningSpark(4)Spark持久化操作](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(4)Spark%E6%8C%81%E4%B9%85%E5%8C%96%E6%93%8D%E4%BD%9C.md)
29 | - [LearningSpark(5)Spark共享变量.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(5)Spark%E5%85%B1%E4%BA%AB%E5%8F%98%E9%87%8F.md)
30 | - [LearningSpark(6)Spark内核架构剖析.md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(6)Spark%E5%86%85%E6%A0%B8%E6%9E%B6%E6%9E%84%E5%89%96%E6%9E%90.md)
31 | - [LearningSpark(7)SparkSQL之DataFrame学习(含Row).md](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(7)SparkSQL%E4%B9%8BDataFrame%E5%AD%A6%E4%B9%A0.md)
32 | - [LearningSpark(8)RDD如何转化为DataFrame](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(8)RDD%E5%A6%82%E4%BD%95%E8%BD%AC%E5%8C%96%E4%B8%BADataFrame.md)
33 | - [LearningSpark(9)SparkSQL数据来源](https://github.com/josonle/Learning-Spark/blob/master/notes/LearningSpark(9)SparkSQL%E6%95%B0%E6%8D%AE%E6%9D%A5%E6%BA%90.md)
34 | - [RDD如何作为参数传给函数.md](https://github.com/josonle/Learning-Spark/blob/master/notes/RDD%E5%A6%82%E4%BD%95%E4%BD%9C%E4%B8%BA%E5%8F%82%E6%95%B0%E4%BC%A0%E7%BB%99%E5%87%BD%E6%95%B0.md)
35 | - [判断RDD是否为空](https://github.com/josonle/Learning-Spark/blob/master/notes/%E5%88%A4%E6%96%ADRDD%E6%98%AF%E5%90%A6%E4%B8%BA%E7%A9%BA)
36 | - [高级排序和topN问题.md](https://github.com/josonle/Learning-Spark/blob/master/notes/%E9%AB%98%E7%BA%A7%E6%8E%92%E5%BA%8F%E5%92%8CtopN%E9%97%AE%E9%A2%98.md)
37 | - [Spark1.x和2.x如何读取和写入csv文件](https://blog.csdn.net/lzw2016/article/details/85562172)
38 | - [Spark DataFrame如何更改列column的类型.md](https://github.com/josonle/Learning-Spark/blob/master/notes/Spark%20DataFrame%E5%A6%82%E4%BD%95%E6%9B%B4%E6%94%B9%E5%88%97column%E7%9A%84%E7%B1%BB%E5%9E%8B.md)
39 | - [使用JDBC将DataFrame写入mysql.md](https://github.com/josonle/Learning-Spark/blob/master/notes/%E4%BD%BF%E7%94%A8JDBC%E5%B0%86DataFrame%E5%86%99%E5%85%A5mysql.md)
40 | - Scala 语法点
41 | - [Scala排序函数使用.md](https://github.com/josonle/Learning-Spark/blob/master/notes/Scala%E6%8E%92%E5%BA%8F%E5%87%BD%E6%95%B0%E4%BD%BF%E7%94%A8.md)
42 | - [报错和问题归纳.md](https://github.com/josonle/Learning-Spark/blob/master/notes/%E6%8A%A5%E9%94%99%E5%92%8C%E9%97%AE%E9%A2%98%E5%BD%92%E7%BA%B3.md)
43 |
44 | 待续
45 |
--------------------------------------------------------------------------------
/notes/Scala排序函数使用.md:
--------------------------------------------------------------------------------
1 | Scala里面有三种排序方法,分别是: sorted,sortBy ,sortWith
2 |
3 | 分别介绍下他们的功能:
4 |
5 | (1)sorted
6 |
7 | 对一个集合进行自然排序,通过传递隐式的Ordering
8 |
9 | (2)sortBy
10 |
11 | 对一个属性或多个属性进行排序,通过它的类型。
12 |
13 | (3)sortWith
14 |
15 | 基于函数的排序,通过一个comparator函数,实现自定义排序的逻辑。
16 |
17 | 例子一:基于单集合单字段的排序
18 |
19 | ```scala
20 | val xs=Seq(1,5,3,4,6,2)
21 | println("==============sorted排序=================")
22 | println(xs.sorted) //升序
23 | println(xs.sorted.reverse) //降序
24 | println("==============sortBy排序=================")
25 | println( xs.sortBy(d=>d) ) //升序
26 | println( xs.sortBy(d=>d).reverse ) //降序
27 | println("==============sortWith排序=================")
28 | println( xs.sortWith(_<_) )//升序
29 | println( xs.sortWith(_>_) )//降序
30 | ```
31 |
32 | 结果:
33 |
34 | ```scala
35 | ==============sorted排序=================
36 | List(1, 2, 3, 4, 5, 6)
37 | List(6, 5, 4, 3, 2, 1)
38 | ==============sortBy排序=================
39 | List(1, 2, 3, 4, 5, 6)
40 | List(6, 5, 4, 3, 2, 1)
41 | ==============sortWith排序=================
42 | List(1, 2, 3, 4, 5, 6)
43 | List(6, 5, 4, 3, 2, 1)
44 | ```
45 |
46 |
47 |
48 | 例子二:基于元组多字段的排序
49 |
50 | 注意多字段的排序,使用sorted比较麻烦,这里给出使用sortBy和sortWith的例子
51 |
52 | 先看基于sortBy的实现:
53 |
54 | ```scala
55 | val pairs = Array(
56 | ("a", 5, 1),
57 | ("c", 3, 1),
58 | ("b", 1, 3)
59 | )
60 |
61 | //按第三个字段升序,第一个字段降序,注意,排序的字段必须和后面的tuple对应
62 | val bx= pairs.
63 | sortBy(r => (r._3, r._1))( Ordering.Tuple2(Ordering.Int, Ordering.String.reverse) )
64 | //打印结果
65 | bx.map( println )
66 | ```
67 |
68 | 结果:
69 |
70 | ```
71 | (c,3,1)
72 | (a,5,1)
73 | (b,1,3)
74 | ```
75 |
76 |
77 |
78 | 再看基于sortWith的实现:
79 |
80 | ```scala
81 | val pairs = Array(
82 | ("a", 5, 1),
83 | ("c", 3, 1),
84 | ("b", 1, 3)
85 | )
86 | val b= pairs.sortWith{
87 | case (a,b)=>{
88 | if(a._3==b._3) {//如果第三个字段相等,就按第一个字段降序
89 | a._1>b._1
90 | }else{
91 | a._3 (person.age, person.name))( Ordering.Tuple2(Ordering.Int, Ordering.String.reverse) )
120 |
121 | bx.map(
122 | println
123 | )
124 | ```
125 |
126 | 结果:
127 |
128 | ```
129 | Person(dog,23)
130 | Person(cat,23)
131 | Person(andy,25)
132 | ```
133 |
134 |
135 |
136 | 再看sortWith的实现方法:
137 |
138 | ```scala
139 | case class Person(val name:String,val age:Int)
140 |
141 | val p1=Person("cat",23)
142 | val p2=Person("dog",23)
143 | val p3=Person("andy",25)
144 |
145 | val pairs = Array(p1,p2,p3)
146 |
147 | val b=pairs.sortWith{
148 | case (person1,person2)=>{
149 | person1.age==person2.age match {
150 | case true=> person1.name>person2.name //年龄一样,按名字降序排
151 | case false=>person1.age row != header) //去头
22 | .map(_.split(";"))
23 | .map(x => Person(x(0).toString, x(1).toInt, x(2).toString))
24 | .toDF() //调用toDF方法
25 | rddToDF.show()
26 | rddToDF.printSchema()
27 |
28 | val dfToRDD = rddToDF.rdd //返回RDD[row],RDD类型的row对像,类似[Jorge,30,Developer]
29 | ```
30 |
31 | - toDF方法
32 |
33 | toDF方法是将以通过case Class构建的类对象转为DataFrame,其可以指定列名参数colNames,否则就默认以类对象中参数名为列名。不仅如此,还可以将本地序列(seq), 数组转为DataFrame
34 |
35 | > 要导入Spark sql implicits ,如import spark.implicits._
36 | >
37 | > 如果直接用toDF()而不指定列名字,那么默认列名为"\_1", "\_2", ...
38 | >
39 | > ```scala
40 | > val df = Seq(
41 | > (1, "First Value", java.sql.Date.valueOf("2019-01-01")),
42 | > (2, "Second Value", java.sql.Date.valueOf("2019-02-01"))
43 | > ).toDF("int_column", "string_column", "date_column")
44 | > ```
45 |
46 | - toDS方法:转为DataSets
47 |
48 | ### 方法二:基于编程方式
49 |
50 | 通过一个允许你构造一个 **Schema** 然后把它应用到一个已存在的 **RDD** 的编程接口。然而这种方法更繁琐,当列和它们的类型知道运行时都是未知时它允许你去构造 **Dataset**
51 |
52 | ```scala
53 | import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
54 | import org.apache.spark.sql.Row
55 | //创建RDD[row]对象
56 | val personRDD = rdd.map { line => Row(line.split(";")(0).toString, line.split(";")(1).toInt, line.split(";")(2).toString) }
57 | //创建Schema,需要StructType/StructField
58 | val structType = StructType(Array(
59 | StructField("name", StringType, true),
60 | StructField("age", IntegerType, true),
61 | StructField("job", StringType, true)))
62 | //createDataFrame(rowRDD: RDD[Row], schema: StructType)方法
63 | //Creates a `DataFrame` from an `RDD` containing [[Row]]s using the given schema.
64 | val rddToDF = spark.createDataFrame(personRDD, structType)
65 | ```
66 |
67 | createDataFrame源码如下:
68 |
69 | ```scala
70 | /**
71 | * Creates a `DataFrame` from an `RDD[Row]`.
72 | * User can specify whether the input rows should be converted to Catalyst rows.
73 | */
74 | private[sql] def createDataFrame(
75 | rowRDD: RDD[Row],
76 | schema: StructType,
77 | needsConversion: Boolean) = {
78 | // TODO: use MutableProjection when rowRDD is another DataFrame and the applied
79 | // schema differs from the existing schema on any field data type.
80 | val catalystRows = if (needsConversion) {
81 | val encoder = RowEncoder(schema)
82 | rowRDD.map(encoder.toRow)
83 | } else {
84 | rowRDD.map { r: Row => InternalRow.fromSeq(r.toSeq) }
85 | }
86 | internalCreateDataFrame(catalystRows.setName(rowRDD.name), schema)
87 | }
88 | ```
89 |
90 | ### Row对象的方法
91 |
92 | - 直接通过下标,如row(0)
93 |
94 | - getAs[T]:获取指定列名的列,也可用getAsInt、getAsString等
95 |
96 | ```scala
97 | //参数要么是列名,要么是列所在位置(从0开始)
98 | def getAs[T](i: Int): T = get(i).asInstanceOf[T]
99 | def getAs[T](fieldName: String): T = getAs[T](fieldIndex(fieldName))
100 | ```
101 |
102 | - getValuesMap:获取指定几列的值,返回的是个map
103 |
104 | ```scala
105 | //源码
106 | def getValuesMap[T](fieldNames: Seq[String]): Map[String, T] = {
107 | fieldNames.map { name =>
108 | name -> getAs[T](name)
109 | }.toMap
110 | }
111 | //使用
112 | dfToRDD.map{row=>{
113 | val columnMap = row.getValuesMap[Any](Array("name","age","job")) Person(columnMap("name").toString(),columnMap("age").toString.toInt,columnMap("job").toString)
114 | }}
115 | ```
116 |
117 | - isNullAt:判断所在位置i的值是否为null
118 |
119 | ```scala
120 | def isNullAt(i: Int): Boolean = get(i) == null
121 | ```
122 |
123 | - length
--------------------------------------------------------------------------------
/notes/使用JDBC将DataFrame写入mysql.md:
--------------------------------------------------------------------------------
1 | # spark foreachPartition 把df 数据插入到mysql
2 |
3 | > 转载自:http://www.waitingfy.com/archives/4370,确实写的不错
4 |
5 | ```scala
6 | import java.sql.{Connection, DriverManager, PreparedStatement}
7 |
8 | import org.apache.spark.sql.SparkSession
9 | import org.apache.spark.sql.functions._
10 |
11 | import scala.collection.mutable.ListBuffer
12 |
13 | object foreachPartitionTest {
14 |
15 | case class TopSongAuthor(songAuthor:String, songCount:Long)
16 |
17 |
18 | def getConnection() = {
19 | DriverManager.getConnection("jdbc:mysql://localhost:3306/baidusong?user=root&password=root&useUnicode=true&characterEncoding=UTF-8")
20 | }
21 |
22 | def release(connection: Connection, pstmt: PreparedStatement): Unit = {
23 | try {
24 | if (pstmt != null) {
25 | pstmt.close()
26 | }
27 | } catch {
28 | case e: Exception => e.printStackTrace()
29 | } finally {
30 | if (connection != null) {
31 | connection.close()
32 | }
33 | }
34 | }
35 |
36 | def insertTopSong(list:ListBuffer[TopSongAuthor]):Unit ={
37 |
38 | var connect:Connection = null
39 | var pstmt:PreparedStatement = null
40 |
41 | try{
42 | connect = getConnection()
43 | connect.setAutoCommit(false)
44 | val sql = "insert into topSinger(song_author, song_count) values(?,?)"
45 | pstmt = connect.prepareStatement(sql)
46 | for(ele <- list){
47 | pstmt.setString(1, ele.songAuthor)
48 | pstmt.setLong(2,ele.songCount)
49 |
50 | pstmt.addBatch()
51 | }
52 | pstmt.executeBatch()
53 | connect.commit()
54 | }catch {
55 | case e:Exception => e.printStackTrace()
56 | }finally {
57 | release(connect, pstmt)
58 | }
59 | }
60 |
61 | def main(args: Array[String]): Unit = {
62 | val spark = SparkSession
63 | .builder()
64 | .master("local[2]")
65 | .appName("foreachPartitionTest")
66 | .getOrCreate()
67 | val gedanDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306").option("dbtable", "baidusong.gedan").option("user", "root").option("password", "root").option("driver", "com.mysql.jdbc.Driver").load()
68 | // mysqlDF.show()
69 | val detailDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306").option("dbtable", "baidusong.gedan_detail").option("user", "root").option("password", "root").option("driver", "com.mysql.jdbc.Driver").load()
70 |
71 | val joinDF = gedanDF.join(detailDF, gedanDF.col("id") === detailDF.col("gedan_id"))
72 |
73 | // joinDF.show()
74 | import spark.implicits._
75 | val resultDF = joinDF.groupBy("song_author").agg(count("song_name").as("song_count")).orderBy($"song_count".desc).limit(100)
76 | // resultDF.show()
77 |
78 |
79 | resultDF.foreachPartition(partitionOfRecords =>{
80 | val list = new ListBuffer[TopSongAuthor]
81 | partitionOfRecords.foreach(info =>{
82 | val song_author = info.getAs[String]("song_author")
83 | val song_count = info.getAs[Long]("song_count")
84 |
85 | list.append(TopSongAuthor(song_author, song_count))
86 | })
87 | insertTopSong(list)
88 |
89 | })
90 |
91 | spark.close()
92 | }
93 |
94 | }
95 | ```
96 |
97 |
98 |
99 | 上面的例子是用[《python pandas 实战 百度音乐歌单 数据分析》](http://www.waitingfy.com/archives/4105)用spark 重新实现了一次
100 |
101 | 默认的foreach的性能缺陷在哪里?
102 |
103 | 首先,对于每条数据,都要单独去调用一次function,task为每个数据,都要去执行一次function函数。
104 | 如果100万条数据,(一个partition),调用100万次。性能比较差。
105 |
106 | 另外一个非常非常重要的一点
107 | 如果每个数据,你都去创建一个数据库连接的话,那么你就得创建100万次数据库连接。
108 | 但是要注意的是,数据库连接的创建和销毁,都是非常非常消耗性能的。虽然我们之前已经用了
109 | 数据库连接池,只是创建了固定数量的数据库连接。
110 |
111 | 你还是得多次通过数据库连接,往数据库(MySQL)发送一条SQL语句,然后MySQL需要去执行这条SQL语句。
112 | 如果有100万条数据,那么就是100万次发送SQL语句。
113 |
114 | 以上两点(数据库连接,多次发送SQL语句),都是非常消耗性能的。
115 |
116 | foreachPartition,在生产环境中,通常来说,都使用foreachPartition来写数据库的
117 |
118 | 使用批处理操作(一条SQL和多组参数)
119 | 发送一条SQL语句,发送一次
120 | 一下子就批量插入100万条数据。
121 |
122 | 用了foreachPartition算子之后,好处在哪里?
123 |
124 | 1、对于我们写的function函数,就调用一次,一次传入一个partition所有的数据
125 | 2、主要创建或者获取一个数据库连接就可以
126 | 3、只要向数据库发送一次SQL语句和多组参数即可
127 |
128 | 参考《算子优化 foreachPartition》 https://blog.csdn.net/u013939918/article/details/60881711
--------------------------------------------------------------------------------
/notes/LearningSpark(7)SparkSQL之DataFrame学习.md:
--------------------------------------------------------------------------------
1 | DataFrame说白了就是RDD+Schema(元数据信息),spark1.3之前还叫SchemaRDD,以列的形式组织的分布式的数据集合
2 |
3 | Spark-SQL 可以以 RDD 对象、Parquet 文件、JSON 文件、Hive 表,
4 | 以及通过JDBC连接到其他关系型数据库表作为数据源来生成DataFrame对象
5 |
6 | ### 如何创建Spark SQL的入口
7 |
8 | 同Spark Core要先创建SparkContext对象一样,在spark1.6后使用SQLContext对象,或者是它的子类的对象,比如HiveContext的对象,而spark2.2提供了SparkSession对象代替了上述两种方式作为程序入口。如下所示
9 |
10 | ```scala
11 | import org.apache.spark.SparkConf
12 | import org.apache.spark.SparkContext
13 | import org.apache.spark.sql.SQLContext
14 |
15 | val conf = new SparkConf().setMaster("local").setAppName("test sqlContext")
16 | val sc = new SparkContext(conf)
17 | val sqlContext = new SQLContext(sc)
18 | //导入隐式转换import sqlContext.implicits._
19 | ```
20 |
21 | > SparkSession中封装了spark.sparkContext和spark.sqlContext ,所以可直接通过spark.sparkContext创建sc,进而创建rdd(这里spark是SparkSession对象)
22 |
23 | ```scala
24 | import org.apache.spark.sql.SparkSession
25 |
26 | val spark = SparkSession.builder().master("local").appName("DataFrame API").getOrCreate()
27 | //或者 SparkSession.builder().config(conf=SparkConf()).getOrCreate()
28 | //导入隐式转换import spark.implicits._
29 | ```
30 |
31 |
32 |
33 | > 了解下HiveContext:
34 | >
35 | > 除了基本的SQLContext以外,还可以使用它的子类——HiveContext。HiveContext的功能除了包含SQLContext提供的所有功能之外,还包括了额外的专门针对Hive的一些功能。这些额外功能包括:使用HiveQL语法来编写和执行SQL,使用Hive中的UDF函数,从Hive表中读取数据。
36 | >
37 | > 要使用HiveContext,就必须预先安装好Hive,SQLContext支持的数据源,HiveContext也同样支持——而不只是支持Hive。对于Spark 1.3.x以上的版本,都推荐使用HiveContext,因为其功能更加丰富和完善。
38 | >
39 | > Spark SQL还支持用spark.sql.dialect参数设置SQL的方言。使用SQLContext的setConf()即可进行设置。对于SQLContext,它只支持“sql”一种方言。对于HiveContext,它默认的方言是“hiveql”。
40 |
41 | ### DataFrame API
42 |
43 | #### show
44 |
45 | ```scala
46 | def show(numRows: Int): Unit = show(numRows, truncate = true)
47 |
48 | /**
49 | * Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters
50 | * will be truncated, and all cells will be aligned right.
51 | *
52 | * @group action
53 | * @since 1.6.0
54 | */
55 | def show(): Unit = show(20)
56 | ```
57 |
58 | 看源码中所示,numRows=20默认显示前20行,truncate表示一个字段是否最多显示 20 个字符,默认为 true
59 |
60 |
61 |
62 | #### select 选择
63 |
64 | ```scala
65 | people.select($"name", $"age" + 1).show()
66 | people.select("name").show() //select($"name")等效,后者好处看上面
67 | ```
68 |
69 | >使用$"age"提取 age 列数据作比较时,用到了隐式转换,故需在程序中引入相应包
70 | >
71 | >`import spark.implicits._`
72 | >
73 | >注意这里所谓的spark是SparkSession或sqlContext的实例对象
74 |
75 |
76 |
77 | >另外也可以直接使用people("age")+1,或者people.col("age")+1,以上几种方法都是选择某列
78 |
79 | #### selectExpr
80 |
81 | 可以对指定字段进行特殊处理的选择
82 |
83 | ```
84 | people.selectExpr("cast(age as string) age_toString","name")
85 | ```
86 |
87 |
88 |
89 | #### filter 过滤
90 |
91 | 同sql中的where,无非数字和字符串的比较,注意以下
92 |
93 | ```scala
94 | //spark2.x写法
95 | people.filter($"age" > 21).show()
96 | people.filter($"name".contains("ust")).show()
97 | people.filter($"name".like("%ust%")).show
98 | people.filter($"name".rlike(".*?ust.*?")).show()
99 | // 同上,spark1.6写法
100 | //people.filter(people("name") contains("ust")).show()
101 | //people.filter(people("name") like("%ust%")).show()
102 | //people.filter(people("name") rlike(".*?ust.*?")).show()
103 | ```
104 |
105 |
106 |
107 | - contains:包含某个子串substring
108 | - like:同SQL语句中的like,要借助通配符`_`、`%`
109 | - rlike:这个是java中的正则匹配,用法和正则pattern一样
110 | - 可以用and、or
111 | - 取反not contains:如`people.filter(!(people("name") contains ("ust")))`
112 |
113 | #### groupBy 聚合
114 |
115 | #### count 计数
116 |
117 | #### createOrReplaceTempView 注册为临时SQL表
118 |
119 | 注册为临时SQL表,便于直接使用sql语言编程
120 |
121 | ```scala
122 | people.createOrReplaceTempView("sqlDF")
123 | spark.sql("select * from sqlDF where name not like '%ust%' ").show() //spark是创建的SparkSession对象
124 | ```
125 |
126 |
127 |
128 | #### createGlobalTempView 注册为全局临时SQL表
129 |
130 | 上面创建的TempView是与SparkSession相关的,session结束就会销毁,想跨多个Session共享的话需要使用Global Temporary View。spark examples中给出如下参考示例
131 |
132 | ```scala
133 | // Global temporary view is tied to a system preserved database `global_temp`
134 | spark.sql("SELECT * FROM global_temp.people").show()
135 | // +----+-------+
136 | // | age| name|
137 | // +----+-------+
138 | // |null|Michael|
139 | // | 30| Andy|
140 | // | 19| Justin|
141 | // +----+-------+
142 |
143 | // Global temporary view is cross-session
144 | spark.newSession().sql("SELECT * FROM global_temp.people").show()
145 | // +----+-------+
146 | // | age| name|
147 | // +----+-------+
148 | // |null|Michael|
149 | // | 30| Andy|
150 | // | 19| Justin|
151 | // +----+-------+
152 | // $example off:global_temp_view$
153 | ```
154 |
155 |
--------------------------------------------------------------------------------
/Spark_With_Scala_Testing/src/sparkCore/Transformation.scala:
--------------------------------------------------------------------------------
1 | package sparkCore
2 |
3 | import org.apache.spark.SparkConf
4 | import org.apache.spark.SparkContext
5 | import org.apache.spark.SparkConf
6 | import org.apache.spark.SparkConf
7 | import org.spark_project.dmg.pmml.True
8 |
9 | object Transformation {
10 | def main(args: Array[String]): Unit = {
11 | System.out.println("Start***********")
12 | /*getData("data/hello.txt")
13 | getData("", Array(1,2,3,4,5))*/
14 |
15 | /*map_flatMap_result()*/
16 | /*distinct_result()*/
17 |
18 | /* filter_result()
19 | groupByKey_result()
20 | reduceByKey_result()
21 | sortByKey_result()*/
22 | join_result()
23 | }
24 |
25 | //### 数据来源
26 |
27 | def getData(path: String, arr: Array[Int] = Array()) {
28 | // path:file:// 或 hdfs://master:9000/ [指定端口9000]
29 | val conf = new SparkConf().setMaster("local").setAppName("getData")
30 | val sc = new SparkContext(conf)
31 |
32 | if (!arr.isEmpty) {
33 | sc.parallelize(arr).map(_ + 1).foreach(println)
34 | } else {
35 | sc.textFile(path, 1).foreach(println)
36 | }
37 | sc.stop()
38 | }
39 |
40 | def map_flatMap_result() {
41 | val conf = new SparkConf().setMaster("local").setAppName("map_flatMap")
42 | val sc = new SparkContext(conf)
43 | val rdd = sc.textFile("data/hello.txt", 1)
44 | val mapResult = rdd.map(_.split(","))
45 | val flatMapResult = rdd.flatMap(_.split(","))
46 |
47 | mapResult.foreach(println)
48 | flatMapResult.foreach(println)
49 |
50 | println("差别×××××××××")
51 | flatMapResult.map(_.toUpperCase).foreach(println)
52 | flatMapResult.flatMap(_.toUpperCase).foreach(println)
53 | }
54 | //去重
55 | def distinct_result() {
56 | val conf = new SparkConf().setMaster("local").setAppName("去重")
57 | val sc = new SparkContext(conf)
58 | val rdd = sc.textFile("data/hello.txt", 1)
59 | rdd.flatMap(_.split(",")).distinct().foreach(println)
60 | }
61 | //保留偶数
62 | def filter_result() {
63 | val conf = new SparkConf().setMaster("local").setAppName("过滤偶数")
64 | val sc = new SparkContext(conf)
65 | val rdd = sc.parallelize(Array(1, 2, 3, 4, 5, 6), 1)
66 | rdd.filter(_ % 2 == 0).foreach(println)
67 | }
68 | //每个班级的学生
69 | def groupByKey_result() {
70 | val conf = new SparkConf().setMaster("local").setAppName("分组")
71 | val sc = new SparkContext(conf)
72 | val classmates = Array(Tuple2("class1", "Lee"), Tuple2("class2", "Liu"),
73 | Tuple2("class2", "Ma"), Tuple2("class3", "Wang"),
74 | Tuple2("class1", "Zhao"), Tuple2("class1", "Zhang"),
75 | Tuple2("class1", "Mao"), Tuple2("class4", "Hao"),
76 | Tuple2("class3", "Zha"), Tuple2("class2", "Zhao"))
77 |
78 | val students = sc.parallelize(classmates)
79 | students.groupByKey().foreach(x ⇒ {
80 | println(x._1 + " :")
81 | x._2.foreach(println)
82 | println("***************")
83 | })
84 | }
85 | //每个班级总分
86 | def reduceByKey_result() {
87 | val conf = new SparkConf().setMaster("local").setAppName("聚合")
88 | val sc = new SparkContext(conf)
89 | val classmates = Array(Tuple2("class1", 90), Tuple2("class2", 85),
90 | Tuple2("class2", 60), Tuple2("class3", 95),
91 | Tuple2("class1", 70), Tuple2("class4", 100),
92 | Tuple2("class3", 80), Tuple2("class2", 65))
93 |
94 | val students = sc.parallelize(classmates)
95 | val scores = students.reduceByKey(_ + _).foreach(x ⇒ {
96 | println(x._1 + " :" + x._2)
97 | println("****************")
98 | })
99 | }
100 | //按分数排序
101 | def sortByKey_result() {
102 | val conf = new SparkConf()
103 | .setAppName("sortByKey")
104 | .setMaster("local")
105 | val sc = new SparkContext(conf)
106 |
107 | val scoreList = Array(Tuple2(65, "leo"), Tuple2(50, "tom"),
108 | Tuple2(100, "marry"), Tuple2(85, "jack"))
109 | val scores = sc.parallelize(scoreList, 1)
110 | val sortedScores = scores.sortByKey(false)
111 |
112 | sortedScores.foreach(studentScore => println(studentScore._1 + ": " + studentScore._2))
113 | sc.stop()
114 | println("排序Over×××××××××")
115 | }
116 |
117 | def join_result(){
118 | val conf = new SparkConf().setMaster("local").setAppName("聚合")
119 | val sc = new SparkContext(conf)
120 | val classmates = Array(Tuple2("class1", 90), Tuple2("class2", 85),
121 | Tuple2("class2", 60), Tuple2("class3", 95),
122 | Tuple2("class1", 70), Tuple2("class4", 100),
123 | Tuple2("class3", 80), Tuple2("class2", 65))
124 |
125 | val students = sc.parallelize(classmates)
126 | val allScores = students.reduceByKey(_ + _)
127 | val nums = students.countByKey().toArray
128 | allScores.join(sc.parallelize(nums, 1)).sortByKey(true).foreach(x⇒{
129 | println(x._1+" :")
130 | println("All_scores: "+x._2._1+",Nums: "+x._2._2)
131 | println("Avg_scores: "+x._2._1.toDouble/x._2._2)
132 | })
133 | }
134 | }
135 |
--------------------------------------------------------------------------------
/notes/报错和问题归纳.md:
--------------------------------------------------------------------------------
1 |
2 | ### spark读取HDFS文件java.net.ConnectException: Connection refused异常
3 |
4 | 报错信息如下:
5 | ```
6 | java.net.ConnectException: Call From josonlee-PC/127.0.1.1 to 192.168.17.10:8020 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
7 | at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
8 | at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
9 | at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
10 | at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
11 | at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
12 | at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
13 | at org.apache.hadoop.ipc.Client.call(Client.java:1474)
14 | at org.apache.hadoop.ipc.Client.call(Client.java:1401)
15 | at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
16 | at com.sun.proxy.$Proxy24.getListing(Unknown Source)
17 | at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:554)
18 | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
19 | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
20 | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
21 | at java.lang.reflect.Method.invoke(Method.java:498)
22 | at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
23 | at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
24 | at com.sun.proxy.$Proxy25.getListing(Unknown Source)
25 | at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1958)
26 | at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1941)
27 | at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
28 | at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
29 | at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
30 | at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
31 | at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
32 | at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
33 | at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
34 | at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
35 | at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)
36 | at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
37 | at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
38 | at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
39 | at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
40 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
41 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
42 | at scala.Option.getOrElse(Option.scala:121)
43 | at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
44 | at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
45 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
46 | at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
47 | at scala.Option.getOrElse(Option.scala:121)
48 | at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
49 | at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
50 | at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
51 | ... 49 elided
52 | Caused by: java.net.ConnectException: 拒绝连接
53 | at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
54 | at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
55 | at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
56 | at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
57 | at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
58 | at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
59 | at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
60 | at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
61 | at org.apache.hadoop.ipc.Client.getConnection(Client.java:1523)
62 | at org.apache.hadoop.ipc.Client.call(Client.java:1440)
63 | ... 86 more
64 | ```
65 |
66 | 一开始看着官方文档写代码,加载外部数据集这一块知道是写hdfs的文件的url,但也没看到示例,就写成下面这样了
67 | ```
68 | scala> val data=sc.textFile("hdfs://192.168.17.10//sparkData/test/*")
69 | ```
70 | 报错信息中也指出了 `Call From josonlee-PC/127.0.1.1 to 192.168.17.10:8020 failed`,说明spark默认是通过8020端口访问hdfs的,错误也好解决,hadoop的配置文件【core-site.xml】中指明了是通过9000端口对外的,所以在url中写死端口即可
71 | ```
72 | scala> val data=sc.textFile("hdfs://192.168.17.10:9000//sparkData/test/*")
73 | ```
74 |
75 |
76 |
77 | ## Spark集群(或spark-shell)读取本地文件报错:无法找到文件
78 |
79 | spark-shell默认不是本地模式的。集群要读取文件首先就要确保worker都能访问该文件,而本地文件只在Master节点下,不存在Worker节点下,所有Worker不能通过`file:///`的形式读取文件
80 |
81 | 解决办法:
82 |
83 | - 文件上传到hdfs上,通过`hdfs://`的方式读取
84 | - 或者把文件复制到Worker节点下对应目录下即可
--------------------------------------------------------------------------------
/notes/LearningSpark(5)Spark共享变量.md:
--------------------------------------------------------------------------------
1 | ## 共享变量
2 |
3 | Spark又一重要特性————共享变量
4 |
5 | worker节点中每个Executor会有多个task任务,而算子调用函数要使用外部变量时,默认会每个task拷贝一份变量。这就导致如果该变量很大时网络传输、占用的内存空间也会很大,所以就有了 **共享变量**。每个节点拷贝一份该变量,节点上task共享这份变量
6 |
7 | spark提过两种共享变量:Broadcast Variable(广播变量),Accumulator(累加变量)
8 |
9 | ## Broadcast Variable 和 Accumulator
10 |
11 | 广播变量**只可读**不可修改,所以其用处是优化性能,减少网络传输以及内存消耗;累加变量能让多个task共同操作该变量,**起到累加作用**,通常用来实现计数器(counter)和求和(sum)功能
12 |
13 | ### 广播变量
14 |
15 | 广播变量通过调用SparkContext的broadcast()方法,来对某个变量创建一个Broadcast[T]对象,Scala中通过value属性访问该变量,Java中通过value()方法访问该变量
16 |
17 | 通过广播方式进行传播的变量,会经过序列化,然后在被任务使用时再进行反序列化
18 |
19 | ```scala
20 | scala> val brodcast = sc.broadcast(1)
21 | brodcast: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(12)
22 |
23 | scala> brodcast
24 | res4: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(12)
25 |
26 | scala> brodcast.value
27 | res5: Int = 1
28 | ```
29 | ```scala
30 | val factor = 3
31 | val factorBroadcast = sc.broadcast(factor)
32 |
33 | val arr = Array(1, 2, 3, 4, 5)
34 | val rdd = sc.parallelize(arr)
35 | val multipleRdd = rdd.map(num => num * factorBroadcast.value())
36 | ```
37 |
38 | ### 如何更新广播变量
39 |
40 | 通过unpersist()将老的广播变量删除,然后重新广播一遍新的广播变量
41 |
42 | ```scala
43 | import java.io.{ ObjectInputStream, ObjectOutputStream }
44 | import org.apache.spark.broadcast.Broadcast
45 | import org.apache.spark.streaming.StreamingContext
46 | import scala.reflect.ClassTag
47 |
48 | /* wrapper lets us update brodcast variables within DStreams' foreachRDD
49 | without running into serialization issues */
50 | case class BroadcastWrapper[T: ClassTag](
51 | @transient private val ssc: StreamingContext,
52 | @transient private val _v: T) {
53 |
54 | @transient private var v = ssc.sparkContext.broadcast(_v)
55 |
56 | def update(newValue: T, blocking: Boolean = false): Unit = {
57 | // 删除RDD是否需要锁定
58 | v.unpersist(blocking)
59 | v = ssc.sparkContext.broadcast(newValue)
60 | }
61 |
62 | def value: T = v.value
63 |
64 | private def writeObject(out: ObjectOutputStream): Unit = {
65 | out.writeObject(v)
66 | }
67 |
68 | private def readObject(in: ObjectInputStream): Unit = {
69 | v = in.readObject().asInstanceOf[Broadcast[T]]
70 | }
71 | }
72 | ```
73 |
74 | 参考:
75 |
76 | [How can I update a broadcast variable in spark streaming?](https://stackoverflow.com/questions/33372264/how-can-i-update-a-broadcast-variable-in-spark-streaming)
77 |
78 | [Spark踩坑记——共享变量](https://www.cnblogs.com/xlturing/p/6652945.html)
79 |
80 | ***
81 |
82 | ### 累加变量
83 |
84 | 累加变量有几点注意:
85 |
86 | - 集群上只有Driver程序可以读取Accumulator的值,task只对该变量调用add方法进行累加操作;
87 | - 累加器的更新只发生在 **action** 操作中,不会改变懒加载,**Spark** 保证每个任务只更新累加器一次,比如,重启任务不会更新值。在 transformations(转换)中, 用户需要注意的是,如果 task(任务)或 job stages(阶段)重新执行,每个任务的更新操作可能会执行多次
88 | - Spark原生支持数值类型的累加器longAccumulator(Long类型)、doubleAccumulator(Double类型),我们也可以自己添加支持的类型,在2.0.0之前的版本中,通过继承AccumulatorParam来实现,而2.0.0之后的版本需要继承AccumulatorV2来实现自定义类型的累加器
89 |
90 | ```scala
91 | scala> val accum1 = sc.longAccumulator("what?") //what?是name属性,可直接sc.accumulator(0)
92 | accum1: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 379, name: Some(what?), value: 0)
93 | scala> accum1.value
94 | res30: Long = 0
95 | scala> sc.parallelize(Array(1,2,3,4,5)).foreach(accum1.add(_))
96 | scala> accum1
97 | res31: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 379, name: Some(what?), value: 15)
98 | scala> accum1.value
99 | res32: Long = 15
100 | ```
101 | ```scala
102 | scala> accum1.value
103 | res35: Long = 15
104 | scala> val rdd = sc.parallelize(Array(1,2,3,4,5))
105 | rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at :24
106 | scala> val tmp = rdd.map(x=>(x,accum1.add(x)))
107 | tmp: org.apache.spark.rdd.RDD[(Int, Unit)] = MapPartitionsRDD[33] at map at :27
108 | scala> accum1.value
109 | res36: Long = 15
110 | scala> tmp.map(x=>(x,accum1.add(x._1))).collect //遇到action操作才执行累加操作
111 | res37: Array[((Int, Unit), Unit)] = Array(((1,()),()), ((2,()),()), ((3,()),()), ((4,()),()), ((5,()),()))
112 | scala> accum1.value //结果是两次累加操作的结果
113 | res38: Long = 45
114 | ```
115 |
116 | ```scala
117 | scala> val data = sc.parallelize(1 to 10)
118 | data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at parallelize at :24
119 | scala> accum.value
120 | res54: Int = 0
121 | scala> val newData = data.map{x => {
122 | | if(x%2 == 0){
123 | | accum += 1
124 | | 0
125 | | }else 1
126 | | }}
127 | newData: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at map at :27
128 |
129 | scala> newData.count
130 | res55: Long = 10
131 | scala> accum.value //newData执行Action后累加操作执行一次,结果为5
132 | res56: Int = 5
133 | scala> newData.collect //newData再次执行Action后累加操作又执行一次
134 | res57: Array[Int] = Array(1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
135 | scala> accum.value
136 | res58: Int = 10
137 | ```
138 |
139 | 如上代码所示,解释了前面所提的几点。因为newData是data经map而来的,而map函数中有累加操作,所以会有两次累加操作。解决办法如下:
140 |
141 | 1. 要么一次Action操作就求出累加变量结果
142 | 2. 在Action操作前进行持久化,避免了RDD的重复计算导致多次累加 【**推荐**】
143 |
144 | - longAccumulator、doubleAccumulator方法
145 |
146 | ```
147 | add方法:赋值操作
148 | value方法:获取累加器中的值
149 | merge方法:该方法特别重要,一定要写对,这个方法是各个task的累加器进行合并的方法(下面介绍执行流程中将要用到)
150 | iszero方法:判断是否为初始值
151 | reset方法:重置累加器中的值
152 | copy方法:拷贝累加器
153 | name方法:累加器名称
154 | ```
155 |
156 | > 累加器执行流程: 首先有几个task,spark engine就调用copy方法拷贝几个累加器(不注册的),然后在各个task中进行累加(注意在此过程中,被最初注册的累加器的值是不变的),执行最后将调用merge方法和各个task的结果累计器进行合并(此时被注册的累加器是初始值)
157 | > 见 http://www.ccblog.cn/103.htm
158 |
159 | ### 自定义累加变量
160 |
161 | **注意:使用时需要register注册一下**
162 |
163 | 1. 类继承extends AccumulatorV2[String, String],第一个为输入类型,第二个为输出类型
164 | 2. 要override以下六个方法
165 |
166 | ```
167 | isZero: 当AccumulatorV2中存在类似数据不存在这种问题时,是否结束程序。
168 | copy: 拷贝一个新的AccumulatorV2
169 | reset: 重置AccumulatorV2中的数据
170 | add: 操作数据累加方法实现
171 | merge: 合并数据
172 | value: AccumulatorV2对外访问的数据结果
173 | ```
174 |
175 | 详情参考这个:https://blog.csdn.net/leen0304/article/details/78866353
176 |
177 |
--------------------------------------------------------------------------------
/notes/LearningSpark(9)SparkSQL数据来源.md:
--------------------------------------------------------------------------------
1 | > 以下源码在 `org.apache.spark.sql.DataFrameReader/DataFrameWriter`中
2 |
3 | ### format指定内置数据源
4 |
5 | 无论是load还是save都可以手动指定用来操作的数据源类型,**format方法**,通过eclipse查看相关源码,spark内置支持的数据源包括parquet(默认)、json、csv、text(文本文件)、 jdbc、orc,如图
6 |
7 | 
8 |
9 | ```scala
10 | def format(source: String): DataFrameWriter[T] = {
11 | this.source = source
12 | this}
13 |
14 | private var source: String = df.sparkSession.sessionState.conf.defaultDataSourceName
15 |
16 | def defaultDataSourceName: String = getConf(DEFAULT_DATA_SOURCE_NAME)
17 | // This is used to set the default data source,默认处理parquet格式数据
18 | val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
19 | .doc("The default data source to use in input/output.")
20 | .stringConf
21 | .createWithDefault("parquet")
22 | ```
23 |
24 | #### 列举csv方法使用
25 |
26 | 源码如下,可见其实本质就是format指定数据源再调用save方法保存
27 |
28 | ```scala
29 | def csv(paths: String*): DataFrame = format("csv").load(paths : _*) //可以指定多个文件或目录
30 | def csv(path: String): Unit = {
31 | format("csv").save(path)
32 | }
33 | ```
34 |
35 | 可以看下我这篇文章:[Spark1.x和2.x如何读取和写入csv文件](https://blog.csdn.net/lzw2016/article/details/85562172#commentBox)
36 |
37 | #### 可选的option方法
38 |
39 | 还有就是这几个内置数据源读取或保存的方法都有写可选功能方法option,比如csv方法可选功能有(截取部分自认为有用的):
40 |
41 | > You can set the following CSV-specific options to deal with CSV files:
42 | >
43 | > - **`sep` (default `,`): sets a single character as a separator for each field and value**.
44 | > - `encoding` (default `UTF-8`): decodes the CSV files by the given encoding type.
45 | > - **`escape` (default `\`):** sets a single character used for escaping quotes inside an already quoted value.
46 | > - `comment` (default empty string): sets a single character used for skipping lines beginning with this character. By default, it is disabled.
47 | > - **`header` (default `false`): uses the first line as names of columns.**
48 | > - `enforceSchema` (default `true`): If it is set to `true`, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to `false`, the schema will be validated against all headers in CSV files in the case when the `header` option is set to `true`. Field names in the schema and column names in CSV headers are checked by their positions taking into account`spark.sql.caseSensitive`. Though the default value is true, it is recommended to disable the `enforceSchema` option to avoid incorrect results.
49 | > - `inferSchema` (default `false`): infers the input schema automatically from data. It requires one extra pass over the data.
50 | > - `ignoreLeadingWhiteSpace` (default `false`): a flag indicating whether or not leading whitespaces from values being read should be skipped.
51 | > - `ignoreTrailingWhiteSpace` (default `false`): a flag indicating whether or not trailing whitespaces from values being read should be skipped.
52 | > - `nullValue` (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
53 | > - **`emptyValue` (default empty string):** sets the string representation of an empty value.
54 | > - `maxColumns` (default `20480`): defines a hard limit of how many columns a record can have.
55 | >
56 | > 像eclipse、Idea这类编辑器可以通过鼠标移动到对应方法上查看可以用到哪些options
57 |
58 | **有一点要注意:不要对该方法没有的option强加上**,比如text用上option("sep",";")
59 |
60 | ### SaveMode保存模式和mode方法
61 |
62 |
63 | | **Save Mode** | **意义** |
64 | | ----------------------------- | ------------------------------------------------------------ |
65 | | SaveMode.ErrorIfExists (默认) | 如果目标位置已经存在数据,那么抛出一个异常(默认的SaveMode) |
66 | | SaveMode.Append | 如果目标位置已经存在数据,那么将数据追加进去 |
67 | | SaveMode.Overwrite | 如果目标位置已经存在数据,那么就将已经存在的数据删除,用新数据进行覆盖 |
68 | | SaveMode.Ignore | 如果目标位置已经存在数据,那么就忽略,不做任何操作。 |
69 |
70 | Scala中通过mode方法指定SaveMode,源码如下
71 |
72 | ```scala
73 | def mode(saveMode: String): DataFrameWriter[T] = {
74 | this.mode = saveMode.toLowerCase(Locale.ROOT) match {
75 | case "overwrite" => SaveMode.Overwrite
76 | case "append" => SaveMode.Append
77 | case "ignore" => SaveMode.Ignore
78 | case "error" | "errorifexists" | "default" => SaveMode.ErrorIfExists
79 | case _ => throw new IllegalArgumentException(s"Unknown save mode: $saveMode. " +
80 | "Accepted save modes are 'overwrite', 'append', 'ignore', 'error', 'errorifexists'.")
81 | }
82 | this
83 | }
84 | ```
85 |
86 | ### 使用jdbc连接数据库
87 |
88 | - 使用 **JDBC** 访问特定数据库时,需要在 **spark classpath** 上添加对应的 **JDBC** 驱动配置,可以放在Spark的library目录,也可以在使用Spark Submit的使用指定具体的Jar(编码和打包的时候都不需要这个JDBC的Jar)
89 |
90 | ```
91 | # 就是在spark-submit 脚本中加类似下方两个参数
92 | --jars $HOME/tools/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar \
93 | --driver-class-path $HOME/tools/mysql-connector-java-5.1.40-bin.jar \
94 | ```
95 |
96 | 当然,我查相关资料时也看到了如下添加驱动的配置
97 |
98 | > 第一种是在${SPARK_HOME}/conf目录下的spark-defaults.conf中添加:spark.jars /opt/lib/mysql-connector-java-5.1.26-bin.jar
99 | >
100 | > 第二种是通过 添加 :spark.driver.extraClassPath /opt/lib2/mysql-connector-java-5.1.26-bin.jar 这种方式也可以实现添加多个依赖jar,比较方便
101 | >
102 | > 参见:https://blog.csdn.net/u013468917/article/details/52748342
103 |
104 | - 写入数据库几点注意项 [参考](https://my.oschina.net/bindyy/blog/680195)
105 |
106 | - 操作该应用程序的用户有对相应数据库操作的权限
107 | - DataFrame应该转为RDD后,再通过foreachPartition操作把每一个partition插入数据库(不要用foreach,不然每条记录都会连接一次数据库)
108 |
109 |
110 | ```scala
111 | spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/数据库").option("dbtable", "表名").option("user", "xxxx").option("password", "xxxx").option("driver", "com.mysql.jdbc.Driver").load()
112 | ```
113 |
114 |
115 | 首先,是通过SQLContext的read系列方法,将mysql中的数据加载为DataFrame
116 | 然后可以将DataFrame转换为RDD,使用Spark Core提供的各种算子进行操作
117 | 最后可以将得到的数据结果,通过foreach()算子,写入mysql、hbase、redis等等
118 |
119 | ### 保存为持久化的表
120 |
121 | **DataFrames** 也可以通过 **saveAsTable** 命令来保存为一张持久表到 **Hive** **metastore** 中。值得注意的是对于这个功能来说已经存在的 **Hive** 部署不是必须的。**Spark** 将会为你创造一个默认的本地 **Hive metastore**(使用 **Derby**)。不像 **createOrReplaceTempView** 命令那样,**saveAsTable** 将会持久化 **DataFrame** 中的内容并在 **Hive metastore** 中创建一个指向数据的指针。持久化的表将会一直存在甚至当你的 **Spark** 应用已经重启,只要保持你的连接是和一个相同的 **metastore**。一个相对于持久化表的 **DataFrame** 可以通过在 **SparkSession** 中调用 **table** 方法创建。
122 |
123 | 默认情况下 **saveAsTable** 操作将会创建一个 “**managed table**”,意味着数据的位置将会被 **metastore** 控制。**Managed tables** 在表 **drop** 后也数据也会自动删除
124 |
125 |
126 |
127 | ### 注意项
128 |
129 | - 不要对该方法没有的option强加上该option,比如text用上option("sep",";"),会报错
130 | - 注意作为 **json file** 提供的文件不是一个典型的 **JSON** 文件。**每一行必须包含一个分开的独立的有效 JSON 对象**。因此,常规的多行 **JSON** 文件通常会失败
--------------------------------------------------------------------------------