├── .gitattributes
├── README.md
├── Spark Streaming 源码解析系列
    ├── .gitignore
    ├── 0.1 Spark Streaming 实现思路与模块概述.md
    ├── 0.imgs
    │   ├── 001.png
    │   ├── 002.png
    │   ├── 005.png
    │   ├── 006.png
    │   ├── 010.png
    │   ├── 020.png
    │   ├── 030.png
    │   ├── 032.png
    │   ├── 035.png
    │   ├── 040.png
    │   ├── 045.png
    │   ├── 046.png
    │   ├── 050.png
    │   ├── 055.png
    │   ├── 060.png
    │   ├── 065.png
    │   ├── 070.png
    │   ├── 075a.png
    │   ├── 075b.png
    │   ├── 075c.png
    │   ├── 075d.png
    │   ├── snapshot-01.png
    │   ├── streaming-arch.png
    │   └── streaming-flow.png
    ├── 1.1 DStream, DStreamGraph 详解.md
    ├── 1.2 DStream 生成 RDD 实例详解.md
    ├── 1.imgs
    │   ├── 005.png
    │   ├── 010.png
    │   ├── 020.png
    │   ├── 025.png
    │   ├── 030.png
    │   ├── 035.png
    │   ├── 040.png
    │   ├── 050.png
    │   ├── 051.png
    │   ├── 052.png
    │   └── 053.png
    ├── 2.1 JobScheduler, Job, JobSet 详解.md
    ├── 2.2 JobGenerator 详解.md
    ├── 2.imgs
    │   ├── 020.png
    │   └── 021.png
    ├── 3.1 Receiver 分发详解.md
    ├── 3.2 Receiver, ReceiverSupervisor, BlockGenerator, ReceivedBlockHandler 详解.md
    ├── 3.3 ReceiverTraker, ReceivedBlockTracker 详解.md
    ├── 3.imgs
    │   ├── 010.png
    │   ├── 020.png
    │   ├── 030.png
    │   ├── 040.png
    │   ├── 050.png
    │   ├── 060.png
    │   ├── 070.png
    │   └── 075.png
    ├── 4.1 Executor 端长时容错详解.md
    ├── 4.2 Driver 端长时容错详解.md
    ├── Q&A 什么是 end-to-end exactly-once.md
    ├── img.png
    ├── q&a.imgs
    │   └── end-to-end exactly-once.png
    └── readme.md
├── Spark 样例工程
    ├── .gitignore
    ├── README.md
    ├── spark_hello_world.zip
    └── spark_hello_world
    │   ├── pom.xml
    │   └── src
    │       └── main
    │           └── scala
    │               └── com
    │                   └── github
    │                       └── lw_lin
    │                           └── spark
    │                               └── SparkHelloWorld.scala
├── Spark 资源集合
    ├── README.md
    └── resources
    │   ├── spark_ai_summit_2019_san_francisco.png
    │   ├── spark_summit_east_2017.png
    │   ├── spark_summit_europe_2016.jpg
    │   ├── spark_summit_europe_2016.png
    │   ├── spark_summit_europe_2017.png
    │   ├── wechat_sh_meetup.PNG
    │   ├── wechat_sh_meetup_small.PNG
    │   ├── wechat_spark_streaming.PNG
    │   ├── wechat_spark_streaming_small.PNG
    │   └── wechat_spark_streaming_small_.PNG
├── Structured Streaming 源码解析系列
    ├── 1.1 Structured Streaming 实现思路与实现概述.md
    ├── 1.imgs
    │   ├── 010.png
    │   ├── 015.png
    │   ├── 030.png
    │   ├── 040.png
    │   ├── 050.png
    │   ├── 070.png
    │   ├── 100.png
    │   ├── 110.png
    │   ├── 120.png
    │   ├── 170.png
    │   ├── checked.png
    │   └── negative.png
    ├── 2.1 Structured Streaming 之 Source 解析.md
    ├── 2.2 Structured Streaming 之 Sink 解析.md
    ├── 3.1 Structured Streaming 之状态存储解析.md
    ├── 3.imgs
    │   ├── 100.png
    │   └── 200.png
    ├── 4.1 Structured Streaming 之 Event Time 解析.md
    ├── 4.2 Structured Streaming 之 Watermark 解析.md
    ├── 4.imgs
    │   ├── 100.png
    │   ├── 150.png
    │   ├── 200.png
    │   ├── 210.png
    │   ├── 220.png
    │   ├── 230.png
    │   ├── 300.png
    │   └── 300_large.png
    └── README.md
├── coolplay_spark_logo_cn.png
└── coolplay_spark_logo_cn_small.png


/.gitattributes:
--------------------------------------------------------------------------------
 1 | # Auto detect text files and perform LF normalization
 2 | * text=auto
 3 | 
 4 | # Custom for Visual Studio
 5 | *.cs     diff=csharp
 6 | 
 7 | # Standard to msysgit
 8 | *.doc	 diff=astextplain
 9 | *.DOC	 diff=astextplain
10 | *.docx diff=astextplain
11 | *.DOCX diff=astextplain
12 | *.dot  diff=astextplain
13 | *.DOT  diff=astextplain
14 | *.pdf  diff=astextplain
15 | *.PDF	 diff=astextplain
16 | *.rtf	 diff=astextplain
17 | *.RTF	 diff=astextplain
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ![coolplay_spark_logo](coolplay_spark_logo_cn_small.png)
 2 | 
 3 | 欢迎来到 Coolplay Spark（中文名：酷玩 Spark）！
 4 | 
 5 | Coolplay Spark 将包含 Spark 源代码解析、Spark 类库、Spark 代码等。
 6 | 
 7 | ## 已发布的内容
 8 | 
 9 | - [Structured Streaming 源码解析系列](https://github.com/lw-lin/CoolplaySpark/tree/master/Structured%20Streaming%20%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90%E7%B3%BB%E5%88%97)
10 |   - Structured Streaming 是 Spark 2.x 的流数据处理系统
11 | - [Spark Streaming 源码解析系列](https://github.com/lw-lin/CoolplaySpark/tree/master/Spark%20Streaming%20%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90%E7%B3%BB%E5%88%97)
12 |   - Spark Streaming 是 Spark 1.x 的流数据处理系统
13 | - [Spark 资源集合](https://github.com/lw-lin/CoolplaySpark/tree/master/Spark%20%E8%B5%84%E6%BA%90%E9%9B%86%E5%90%88)
14 |   - 包括 Spark 技术交流（中文社群）<br/>![wechat_spark_streaming_small](Spark%20%E8%B5%84%E6%BA%90%E9%9B%86%E5%90%88/resources/wechat_spark_streaming_small_.PNG)
15 |   - Spark Summit 视频等资源集合
16 | 
17 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/.gitignore:
--------------------------------------------------------------------------------
1 | todos.md
2 | *.thumbnail.png
3 | *.vsd


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.1 Spark Streaming 实现思路与模块概述.md:
--------------------------------------------------------------------------------
  1 | # Spark Streaming 实现思路与模块概述 #
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | ## 一、基于 Spark 做 Spark Streaming 的思路
 18 | 
 19 | Spark Streaming 与 Spark Core 的关系可以用下面的经典部件图来表述：
 20 | 
 21 | ![image](0.imgs/010.png)
 22 | 
 23 | 在本节，我们先探讨一下基于 Spark Core 的 RDD API，如何对 streaming data 进行处理。**理解下面描述的这个思路非常重要，因为基于这个思路详细展开后，就能够充分理解整个 Spark Streaming 的模块划分和代码逻辑**。
 24 | 
 25 | 第一步，假设我们有一小块数据，那么通过 RDD API，我们能够构造出一个进行数据处理的 RDD DAG（如下图所示）。
 26 | 
 27 | ![image](0.imgs/020.png)
 28 | 
 29 | 第二步，我们对连续的 streaming data 进行切片处理 —— 比如将最近 200ms 时间的 event 积攒一下 —— 每个切片就是一个 batch，然后使用第一步中的 RDD DAG 对这个 batch 的数据进行处理。
 30 | 
 31 | > 注意: 这里我们使用的是 batch 的概念 —— 其实 200ms 在其它同类系统中通常叫做 mini-batch，不过既然 Spark Streaming 官方的叫法就是 batch，我们这里就用 batch 表达 mini-batch 的意思了 :)
 32 | 
 33 | 所以，针对连续不断的 streaming data 进行多次切片，就会形成多个 batch，也就对应出来多个 RDD DAG（每个 RDD DAG 针对一个 batch 的数据）。如此一来，**这多个 RDD DAG 之间相互同构，却又是不同的实例**。我们用下图来表示这个关系：
 34 | 
 35 | ![image](0.imgs/030.png)
 36 | 
 37 | 所以，我们将需要：
 38 | 
 39 | - (1) 一个**静态**的 RDD DAG 的**模板**，来表示处理逻辑；
 40 | 
 41 | - (2) 一个**动态**的**工作控制器**，将连续的 streaming data 切分数据片段，并按照模板**复制**出新的 RDD DAG 的**实例**，对数据片段进行处理；
 42 | 
 43 | ![image](0.imgs/032.png)
 44 | 
 45 | 第三步，我们回过头来看 streaming data 本身的产生。Hadoop MapReduce, Spark RDD API 进行批处理时，一般默认数据已经在 HDFS, HBase 或其它存储上。而 streaming data —— 比如 twitter 流 —— 又有可能是在系统外实时产生的，就需要能够将这些数据导入到 Spark Streaming 系统里，就像 Apache Storm 的 Spout，Apache S4 的 Adapter 能够把数据导入系统里的作用是一致的。所以，我们将需要：
 46 | 
 47 | - (3) 原始数据的产生和导入；
 48 | 
 49 | 第四步，我们考虑，有了以上 (1)(2)(3) 3 部分，就可以顺利用 RDD API 处理 streaming data 了吗？其实相对于 batch job 通常几个小时能够跑完来讲，streaming job 的运行时间是 +∞（正无穷大）的，所以我们还将需要：
 50 | 
 51 | - (4) 对长时运行任务的保障，包括输入数据的失效后的重构，处理任务的失败后的重调。
 52 | 
 53 | 至此，streaming data 的特点决定了，如果我们想基于 Spark Core 进行 streaming data 的处理，还需要在 Spark Core 的框架上解决刚才列出的 (1)(2)(3)(4) 这四点问题：
 54 | 
 55 | ![image](0.imgs/035.png)
 56 | 
 57 | ## 二、Spark Streaming 的整体模块划分
 58 | 
 59 | 根据 Spark Streaming 解决这 4 个问题的不同 focus，可以将 Spark Streaming 划分为四个大的模块：
 60 | 
 61 | - 模块 1：DAG 静态定义
 62 | - 模块 2：Job 动态生成
 63 | - 模块 3：数据产生与导入
 64 | - 模块 4：长时容错
 65 | 
 66 | 其中每个模块涉及到的主要的类，示意如下：
 67 | 
 68 | ![image](0.imgs/040.png)
 69 | 
 70 | 这里先不用纠结每个类的具体用途，我们将在本文中简述，并在本系列的后续文章里对每个模块逐一详述。
 71 | 
 72 | ### 2.1 模块 1：DAG 静态定义
 73 | 
 74 | 通过前面的描述我们知道，应该首先对计算逻辑描述为一个 RDD DAG 的“模板”，在后面 Job 动态生成的时候，针对每个 batch，Spark Streaming 都将根据这个“模板”生成一个 RDD DAG 的实例。
 75 | 
 76 | #### DStream 和 DStreamGraph
 77 | 
 78 | 其实在 Spark Streaming 里，这个 RDD “模板”对应的具体的类是 `DStream`，RDD DAG “模板”对应的具体类是 `DStreamGraph`。而 `RDD` 本身也有很多子类，几乎每个子类都有一个对应的 `DStream`，如 `UnionRDD` 的对应是 `UnionDStream`。`RDD` 通过 `transformation` 连接成 RDD DAG（但 RDD DAG 在 Spark Core 里没有对应的具体类），`DStream` 也通过 `transformation` 连接成 `DStreamGraph`。
 79 | 
 80 |     DStream      的全限定名是：org.apache.spark.streaming.dstream.DStream
 81 |     DStreamGraph 的全限定名是：org.apache.spark.streaming.DStreamGraph
 82 |     
 83 | #### DStream 和 RDD 的关系
 84 | 
 85 | 既然 `DStream` 是 `RDD` 的模板，而且 `DStream` 和 `RDD` 具有相同的 *transformation* 操作，比如 map(), filter(), reduce() ……等等（正是这些相同的 *transformation* 使得 `DStreamGraph` 能够忠实记录 RDD DAG 的计算逻辑），那 `RDD` 和 `DStream` 有什么不一样吗？
 86 | 
 87 | 还真不一样。
 88 | 
 89 | 比如，`DStream` 维护了对每个产出的 `RDD` 实例的引用。比如下图里，`DStream A` 在 3 个 batch 里分别实例化了 3 个 `RDD`，分别是 `a[1]`, `a[2]`, `a[3]`，那么 `DStream A` 就保留了一个 `batch → 所产出的 RDD` 的哈希表，即包含 `batch 1 → a[1]`, `batch 2 → a[2]`, `batch 3 → a[3]` 这 3 项。
 90 | 
 91 | ![image](0.imgs/045.png)
 92 | 
 93 | 另外，能够进行流量控制的 `DStream` 子类，如 `ReceiverInputDStream`，还会保存关于历次 batch 的源头数据条数、历次 batch 计算花费的时间等数值，用来实时计算准确的流量控制信息，这些都是记在 `DStream` 里的，而 `RDD a[1]` 等则不会保存这些信息。
 94 | 
 95 | 我们在考虑的时候，可以认为，`RDD` 加上 batch 维度就是 `DStream`，`DStream` 去掉 batch 维度就是 `RDD` —— 就像 `RDD = DStream at batch T`。
 96 | 
 97 | 不过这里需要特别说明的是，在 `DStreamGraph` 的图里，DStream（即数据）是顶点，`DStream` 之间的 transformation（即计算）是边，这与 Apache Storm 等是相反的。
 98 | 
 99 | 在 Apache Storm 的 topology 里，计算是顶点，stream（连续的 tuple，即数据）是边。这一点也是比较熟悉 Storm 的同学刚开始一下子不太理解 DStream 的原因--我们再重复一遍，DStream 即是数据本身，在有向图里是顶点、而不是边。
100 | 
101 | ![image](0.imgs/046.png)
102 |     
103 | ### 2.2 模块 2：Job 动态生成
104 | 
105 | 现在有了 `DStreamGraph` 和 `DStream`，也就是静态定义了的计算逻辑，下面我们来看 Spark Streaming 是如何将其动态调度的。
106 | 
107 | 在 Spark Streaming 程序的入口，我们都会定义一个 batchDuration，就是需要每隔多长时间就比照静态的 `DStreamGraph` 来动态生成一个 RDD DAG 实例。在 Spark Streaming 里，总体负责动态作业调度的具体类是 `JobScheduler`，在 Spark Streaming 程序开始运行的时候，会生成一个 `JobScheduler` 的实例，并被 start() 运行起来。
108 | 
109 | `JobScheduler` 有两个非常重要的成员：`JobGenerator` 和 `ReceiverTracker`。`JobScheduler` 将每个 batch 的 RDD DAG 具体生成工作委托给 `JobGenerator`，而将源头输入数据的记录工作委托给 `ReceiverTracker`。
110 | 
111 | ![image](0.imgs/050.png)
112 | 
113 |     JobScheduler    的全限定名是：org.apache.spark.streaming.scheduler.JobScheduler
114 |     JobGenerator    的全限定名是：org.apache.spark.streaming.scheduler.JobGenerator
115 |     ReceiverTracker 的全限定名是：org.apache.spark.streaming.scheduler.ReceiverTracker
116 | 
117 | **`JobGenerator` 维护了一个定时器**，周期就是我们刚刚提到的 batchDuration，**定时为每个 batch 生成 RDD DAG 的实例**。具体的，每次 RDD DAG 实际生成包含 5 个步骤：
118 | 
119 | - (1) **要求 `ReceiverTracker` 将目前已收到的数据进行一次 allocate**，即将上次 batch 切分后的数据切分到到本次新的 batch 里；
120 | - (2) **要求 `DStreamGraph` 复制出一套新的 RDD DAG 的实例**，具体过程是：`DStreamGraph` 将要求图里的尾 `DStream` 节点生成具体的 RDD 实例，并递归的调用尾 `DStream` 的上游 `DStream` 节点……以此遍历整个 `DStreamGraph`，遍历结束也就正好生成了 RDD DAG 的实例；
121 | - (3) **获取第 1 步 `ReceiverTracker` 分配到本 batch 的源头数据的 meta 信息**；
122 | - (4) 将第 2 步生成的本 batch 的 RDD DAG，和第 3 步获取到的 meta 信息，**一同提交给 `JobScheduler` 异步执行**；
123 | - (5) 只要提交结束（不管是否已开始异步执行），就**马上对整个系统的当前运行状态做一个 checkpoint**。
124 | 
125 | 上述 5 个步骤的调用关系图如下：
126 | 
127 | ![image](0.imgs/055.png)
128 | 
129 | ### 2.3 模块 3：数据产生与导入
130 | 
131 | 下面我们看 Spark Streaming 解决第三个问题的模块分析，即数据的产生与导入。
132 | 
133 | `DStream` 有一个重要而特殊的子类 `ReceiverInputDStream`：它除了需要像其它 `DStream` 那样在某个 batch 里实例化 `RDD` 以外，还需要额外的 `Receiver` 为这个 `RDD` 生产数据！
134 | 
135 | 具体的，Spark Streaming 在程序刚开始运行时：
136 | 
137 | - (1) 由 `Receiver` 的总指挥 `ReceiverTracker` 分发多个 job（每个 job 有 1 个 task），到多个 executor 上分别启动 `ReceiverSupervisor` 实例；
138 | 
139 | - (2) 每个 `ReceiverSupervisor` 启动后将马上生成一个用户提供的 `Receiver` 实现的实例 —— 该 `Receiver` 实现可以持续产生或者持续接收系统外数据，比如 `TwitterReceiver` 可以实时爬取 twitter 数据 —— 并在 `Receiver` 实例生成后调用 `Receiver.onStart()`；
140 | 
141 | ![image](0.imgs/060.png)
142 | 
143 |     ReceiverSupervisor 的全限定名是：org.apache.spark.streaming.receiver.ReceiverSupervisor
144 |     Receiver           的全限定名是：org.apache.spark.streaming.receiver.Receiver
145 | 
146 | (1)(2) 的过程由上图所示，这时 `Receiver` 启动工作已运行完毕。
147 | 
148 | 接下来 `ReceiverSupervisor` 将在 executor 端作为的主要角色，并且：
149 | 
150 | - (3) `Receiver` 在 `onStart()` 启动后，就将**持续不断**地接收外界数据，并持续交给 `ReceiverSupervisor` 进行数据转储；
151 | 
152 | - (4) `ReceiverSupervisor` **持续不断**地接收到 `Receiver` 转来的数据：
153 | 
154 | 	- 如果数据很细小，就需要 `BlockGenerator` 攒多条数据成一块(4a)、然后再成块存储(4b 或 4c)
155 | 	- 反之就不用攒，直接成块存储(4b 或 4c)
156 |   
157 | 	- 这里 Spark Streaming 目前支持两种成块存储方式，一种是由 `BlockManagerBasedBlockHandler` 直接存到 executor 的内存或硬盘，另一种由 `WriteAheadLogBasedBlockHandler` 是同时写 WAL(4c) 和 executor 的内存或硬盘
158 | 
159 | - (5) 每次成块在 executor 存储完毕后，`ReceiverSupervisor` 就会及时上报块数据的 meta 信息给 driver 端的 `ReceiverTracker`；这里的 meta 信息包括数据的标识 id，数据的位置，数据的条数，数据的大小等信息；
160 | 
161 | - (6) `ReceiverTracker` 再将收到的块数据 meta 信息直接转给自己的成员 `ReceivedBlockTracker`，由 `ReceivedBlockTracker` 专门管理收到的块数据 meta 信息。
162 | 
163 | ![image](0.imgs/065.png)
164 | 
165 |     BlockGenerator                 的全限定名是：org.apache.spark.streaming.receiver.BlockGenerator
166 |     BlockManagerBasedBlockHandler  的全限定名是：org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler
167 |     WriteAheadLogBasedBlockHandler 的全限定名是：org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler
168 |     ReceivedBlockTracker           的全限定名是：org.apache.spark.streaming.scheduler.ReceivedBlockTracker
169 |     ReceiverInputDStream           的全限定名是：org.apache.spark.streaming.dstream.ReceiverInputDStream
170 | 
171 | 这里 (3)(4)(5)(6) 的过程是一直**持续不断**地发生的，我们也将其在上图里标识出来。
172 | 
173 | 后续在 driver 端，就由 `ReceiverInputDStream` 在每个 batch 去检查 `ReceiverTracker` 收到的块数据 meta 信息，界定哪些新数据需要在本 batch 内处理，然后生成相应的 `RDD` 实例去处理这些块数据，这个过程在`模块 1：DAG 静态定义` `模块2：Job 动态生成` 里描述过了。
174 |  
175 | ### 2.4 模块 4：长时容错
176 | 
177 | 以上我们简述完成 Spark Streamimg 基于 Spark Core 所新增功能的 3 个模块，接下来我们看一看第 4 个模块将如何保障 Spark Streaming 的长时运行 —— 也就是，如何与前 3 个模块结合，保障前 3 个模块的长时运行。
178 | 
179 | 通过前 3 个模块的关键类的分析，我们可以知道，保障模块 1 和 2 需要在 driver 端完成，保障模块 3 需要在 executor 端和 driver 端完成。
180 | 
181 | #### executor 端长时容错
182 | 
183 | 先看 executor 端。
184 | 
185 | 在 executor 端，`ReceiverSupervisor` 和 `Receiver` 失效后直接重启就 OK 了，关键是保障收到的块数据的安全。保障了源头块数据，就能够保障 RDD DAG （Spark Core 的 lineage）重做。
186 | 
187 | Spark Streaming 对源头块数据的保障，分为 4 个层次，全面、相互补充，又可根据不同场景灵活设置：
188 | 
189 | - **(1) 热备**：热备是指在存储块数据时，将其存储到本 executor、并同时 replicate 到另外一个 executor 上去。这样在一个 replica 失效后，可以立刻无感知切换到另一份 replica 进行计算。实现方式是，在实现自己的 Receiver 时，即指定一下 `StorageLevel` 为 `MEMORY_ONLY_2` 或 `MEMORY_AND_DISK_2` 就可以了。
190 | 
191 | > // 1.5.2 update 这已经是默认了。
192 | 
193 | - **(2) 冷备**：冷备是每次存储块数据前，先把块数据作为 log 写出到 `WriteAheadLog` 里，再存储到本 executor。executor 失效时，就由另外的 executor 去读 WAL，再重做 log 来恢复块数据。WAL 通常写到可靠存储如 HDFS 上，所以恢复时可能需要一段 recover time。
194 | 
195 | ![image](0.imgs/070.png)
196 | 
197 | - **(3) 重放**：如果上游支持重放，比如 Apache Kafka，那么就可以选择不用热备或者冷备来另外存储数据了，而是在失效时换一个 executor 进行数据重放即可。
198 | 
199 | - **(4) 忽略**：最后，如果应用的实时性需求大于准确性，那么一块数据丢失后我们也可以选择忽略、不恢复失效的源头数据。
200 | 
201 | 我们用一个表格来总结一下：
202 | 
203 | <table>
204 | <tr>
205 | 	<td align="center"></td>
206 | 	<td align="center"><strong>图示</strong></td>
207 | 	<td align="center"><strong>优点</strong></td>
208 | 	<td align="center"><strong>缺点</strong></td>
209 | </tr>
210 | <tr>
211 | 	<td align="center"><strong>(1) 热备</strong></td>
212 | 	<td align="center"><img src="0.imgs/075a.png"></img></td>
213 | 	<td align="center">无 recover time</td>
214 | 	<td align="center">需要占用双倍资源</td>
215 | </tr>
216 | <tr>
217 | 	<td align="center"><strong>(2) 冷备</strong></td>
218 | 	<td align="center"><img src="0.imgs/075b.png"></img></td>
219 | 	<td align="center">十分可靠</td>
220 | 	<td align="center">存在 recover time</td>
221 | </tr>
222 | <tr>
223 | 	<td align="center"><strong>(3) 重放</strong></td>
224 | 	<td align="center"><img src="0.imgs/075c.png"></img></td>
225 | 	<td align="center">不占用额外资源</td>
226 | 	<td align="center">存在 recover time</td>
227 | </tr>
228 | <tr>
229 | 	<td align="center"><strong>(4) 忽略</strong></td>
230 | 	<td align="center"><img src="0.imgs/075d.png"></img></td>
231 | 	<td align="center">无 recover time</td>
232 | 	<td align="center">准确性有损失</td>
233 | </tr>
234 | </table>
235 | 
236 | #### driver 端长时容错
237 | 
238 | 前面我们讲过，块数据的 meta 信息上报到 `ReceiverTracker`，然后交给 `ReceivedBlockTracker` 做具体的管理。`ReceivedBlockTracker` 也采用 WAL 冷备方式进行备份，在 driver 失效后，由新的 `ReceivedBlockTracker` 读取 WAL 并恢复 block 的 meta 信息。
239 | 
240 | 另外，需要定时对 `DStreamGraph` 和 `JobScheduler` 做 `Checkpoint`，来记录整个 `DStreamGraph` 的变化、和每个 batch 的 job 的完成情况。
241 | 
242 | 注意到这里采用的是完整 checkpoint 的方式，和之前的 WAL 的方式都不一样。`Checkpoint` 通常也是落地到可靠存储如 HDFS。`Checkpoint` 发起的间隔默认的是和 `batchDuration 一致`；即每次 batch 发起、提交了需要运行的 job 后就做 `Checkpoint`，另外在 job 完成了更新任务状态的时候再次做一下 `Checkpoint`。
243 | 
244 | 这样一来，在 driver 失效并恢复后，可以读取最近一次的 `Checkpoint` 来恢复作业的 `DStreamGraph` 和 job 的运行及完成状态。
245 | 
246 | #### 总结 ####
247 | 
248 | <table>
249 |     <tr>
250 |         <td align="center"><strong>模块</strong></td>
251 |         <td align="center" colspan="2"><strong>长时容错保障方式</strong></td>
252 |     </tr>
253 |     <tr>
254 |         <td align="center">模块 1-DAG 静态定义</td>
255 |         <td align="center">driver 端</td>
256 |         <td>定时对 DStreamGraph 做 Checkpoint，来记录整个 DStreamGraph 的变化</td>
257 |     </tr>
258 |     <tr>
259 |         <td align="center">模块 2-job 动态生成</td>
260 |         <td align="center">driver 端</td>
261 |         <td>定时对 JobScheduler 做 Checkpoint，来记录每个 batch 的 job 的完成情况</td>
262 |     </tr>
263 |     <tr>
264 |         <td align="center">模块 3-数据产生与导入</td>
265 |         <td align="center">driver 端</td>
266 |         <td>源头块数据的 meta 信息上报 ReceiverTracker 时，写入 WAL</td>
267 |     </tr>
268 |     <tr>
269 |         <td align="center">模块 3-数据产生与导入</td>
270 |         <td align="center">executor 端</td>
271 |         <td>对源头块数据的保障：(1) 热备；(2) 冷备；(3) 重放；(4) 忽略</td>
272 |     </tr>
273 | </table>
274 | 
275 | 总结一下“模块4：长时容错”的内容为上述表格，可以看到，Spark Streaming 的长时容错特性，能够提供不重、不丢，exactly-once 的处理语义。
276 | 
277 | ## 三、入口：StreamingContext
278 | 
279 | 上面我们花了很多篇幅来介绍 Spark Streaming 的四大模块，我们在最后介绍一下 `StreamingContext`。
280 | 
281 | 下面我们用这段仅 11 行的完整 [quick example](0.imgs/http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example)，来说明用户 code 是怎么通过 `StreamingContext` 与前面几个模块进行交互的：
282 | 
283 | ```scala
284 | import org.apache.spark._
285 | import org.apache.spark.streaming._
286 | 
287 | // 首先配置一下本 quick example 将跑在本机，app name 是 NetworkWordCount
288 | val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
289 | // batchDuration 设置为 1 秒，然后创建一个 streaming 入口
290 | val ssc = new StreamingContext(conf, Seconds(1))
291 | 
292 | // ssc.socketTextStream() 将创建一个 SocketInputDStream；这个 InputDStream 的 SocketReceiver 将监听本机 9999 端口
293 | val lines = ssc.socketTextStream("localhost", 9999)
294 | 
295 | val words = lines.flatMap(_.split(" "))      // DStream transformation
296 | val pairs = words.map(word => (word, 1))     // DStream transformation
297 | val wordCounts = pairs.reduceByKey(_ + _)    // DStream transformation
298 | wordCounts.print()                           // DStream output
299 | // 上面 4 行利用 DStream transformation 构造出了 lines -> words -> pairs -> wordCounts -> .print() 这样一个 DStreamGraph
300 | // 但注意，到目前是定义好了产生数据的 SocketReceiver，以及一个 DStreamGraph，这些都是静态的
301 | 
302 | // 下面这行 start() 将在幕后启动 JobScheduler, 进而启动 JobGenerator 和 ReceiverTracker
303 | // ssc.start()
304 | //    -> JobScheduler.start()
305 | //        -> JobGenerator.start();    开始不断生成一个一个 batch
306 | //        -> ReceiverTracker.start(); 开始往 executor 上分布 ReceiverSupervisor 了，也会进一步创建和启动 Receiver
307 | ssc.start()
308 | 
309 | // 然后用户 code 主线程就 block 在下面这行代码了
310 | // block 的后果就是，后台的 JobScheduler 线程周而复始的产生一个一个 batch 而不停息
311 | // 也就是在这里，我们前面静态定义的 DStreamGraph 的 print()，才一次一次被在 RDD 实例上调用，一次一次打印出当前 batch 的结果
312 | ssc.awaitTermination()
313 | ```
314 | 
315 | 所以我们看到，`StreamingContext` 是 Spark Streaming 提供给用户 code 的、与前述 4 个模块交互的一个简单和统一的入口。
316 | 
317 | ## 四、总结与回顾
318 | 
319 | 在最后我们再把 [Sark Streaming 官方 Programming Guide](http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example) 的部分内容放在这里，作为本文的一个回顾和总结。请大家看一看，如果看懂了本文的内容，是不是读下面这些比较 high-level 的介绍会清晰化很多 :-)
320 | 
321 | > **Spark Streaming** is an extension of the **core Spark API** that enables **scalable**, **high-throughput**, **fault-tolerant stream processing of live data streams**. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.
322 | 
323 | > ![](0.imgs/streaming-arch.png)
324 | 
325 | > Internally, it works as follows. **Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches**.
326 | 
327 | > ![](0.imgs/streaming-flow.png)
328 | 
329 | > Spark Streaming provides a high-level abstraction called **discretized stream** or **DStream**, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. **Internally, a DStream is represented as a sequence of RDDs**.
330 | 
331 | > ...
332 | 
333 | ##知识共享
334 | 
335 | ![](https://licensebuttons.net/l/by-nc/4.0/88x31.png)
336 | 
337 | 除非另有注明，本文及本《Spark Streaming 源码解析系列》系列文章使用 [CC BY-NC（署名-非商业性使用）](https://creativecommons.org/licenses/by-nc/4.0/) 知识共享许可协议。
338 | 
339 | <br/>
340 | <br/>
341 | 
342 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/1)，返回目录请 [猛戳这里](readme.md)）
343 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/001.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/002.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/002.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/005.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/005.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/006.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/006.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/010.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/010.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/020.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/030.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/030.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/032.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/032.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/035.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/035.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/040.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/040.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/045.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/045.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/046.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/046.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/050.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/050.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/055.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/055.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/060.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/060.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/065.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/065.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/070.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/070.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/075a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/075a.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/075b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/075b.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/075c.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/075c.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/075d.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/075d.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/snapshot-01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/snapshot-01.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/streaming-arch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/streaming-arch.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/0.imgs/streaming-flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/0.imgs/streaming-flow.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.1 DStream, DStreamGraph 详解.md:
--------------------------------------------------------------------------------
  1 | # DStream, DStreamGraph 详解 #
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 1 DAG 静态定义` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 我们在前面的文章讲过，Spark Streaming 的 `模块 1 DAG 静态定义` 要解决的问题就是如何把计算逻辑描述为一个 RDD DAG 的“模板”，在后面 Job 动态生成的时候，针对每个 batch，都将根据这个“模板”生成一个 RDD DAG 的实例。
 22 | 
 23 | ![image](0.imgs/032.png)
 24 | 
 25 | 在 Spark Streaming 里，这个 RDD “模板”对应的具体的类是 `DStream`，RDD DAG “模板”对应的具体类是 `DStreamGraph`。
 26 | 
 27 |     DStream      的全限定名是：org.apache.spark.streaming.dstream.DStream
 28 |     DStreamGraph 的全限定名是：org.apache.spark.streaming.DStreamGraph
 29 | 
 30 | ![image](1.imgs/005.png)
 31 | 
 32 | 本文涉及的类在 Spark Streaming 中的位置如上图所示；下面详解 `DStream`, `DStreamGraph`。
 33 | 
 34 | ## DStream, *transformation*, *output operation* 解析
 35 | 
 36 | 回想一下，RDD 的定义是一个只读、分区的数据集（`an RDD is a read-only, partitioned collection of records`），而 DStream 又是 RDD 的模板，所以我们把 Dstream 也视同数据集。
 37 | 
 38 | 我们先看看定义在这个 DStream 数据集上的*转换*（***transformation***）和 *输出*（***output***）。
 39 | 
 40 | 现在假设我们有一个 `DStream` 数据集 a：
 41 | 
 42 | ```scala
 43 | val a = new DStream()
 44 | ```
 45 | 
 46 | 那么通过 `filter()` 操作就可以从 `a` 生成一个新的 `DStream` 数据集 `b`：
 47 | 
 48 | ```scala
 49 | val b = a.filter(func)
 50 | ```
 51 | 
 52 | 这里能够由已有的 `DStream` 产生新 `DStream` 的操作统称 ***transformation***。一些典型的 *tansformation* 包括 `map()`, `filter()`, `reduce()`, `join()` 等 。
 53 | 
 54 | > Transformation	Meaning
 55 | map(func)	Return a new DStream by passing each element of the source DStream through a function func.
 56 | flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
 57 | filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
 58 | repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
 59 | 
 60 | 
 61 | 另一些不产生新 `DStream` 数据集，而是只在已有 `DStream` 数据集上进行的操作和输出，统称为 ***output***。比如 `a.print()` 就不会产生新的数据集，而是只是将 `a` 的内容打印出来，所以 `print()` 就是一种 *output* 操作。一些典型的 *output* 包括 `print()`, `saveAsTextFiles()`, `saveAsHadoopFiles()`, `foreachRDD()` 等。
 62 | 
 63 | > print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. 
 64 | Python API This is called pprint() in the Python API.
 65 | saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
 66 | saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". 
 67 | Python API This is not available in the Python API.
 68 | saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". 
 69 | Python API This is not available in the Python API.
 70 | foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
 71 | 
 72 | ## 一段 quick example 的 *transformation*, *output* 解析
 73 | 
 74 | 我们看一下 [Spark Streaming 官方的 quick example](0.imgs/http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example) 的这段对 DStream DAG 的定义，注意看代码中的注释讲解内容：
 75 | 
 76 | ```scala
 77 | // ssc.socketTextStream() 将创建一个 SocketInputDStream；这个 InputDStream 的 SocketReceiver 将监听本机 9999 端口
 78 | val lines = ssc.socketTextStream("localhost", 9999)
 79 | 
 80 | val words = lines.flatMap(_.split(" "))      // DStream transformation
 81 | val pairs = words.map(word => (word, 1))     // DStream transformation
 82 | val wordCounts = pairs.reduceByKey(_ + _)    // DStream transformation
 83 | wordCounts.print()                           // DStream output
 84 | ```
 85 | 
 86 | 这里我们找到 `ssc.socketTextStream("localhost", 9999)` 的源码实现：
 87 | 
 88 | ```scala
 89 | def socketStream[T: ClassTag](
 90 |   hostname: String,
 91 |   port: Int, 
 92 |   converter: (InputStream) => Iterator[T],
 93 |   storageLevel: StorageLevel)
 94 |   : ReceiverInputDStream[T] = {
 95 |     new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
 96 |   }
 97 | ```
 98 | 
 99 | 也就是 `ssc.socketTextStream()` 将 `new` 出来一个 `DStream` 具体子类 `SocketInputDStream` 的实例。
100 | 
101 | 然后我们继续找到下一行 `lines.flatMap(_.split(" "))` 的源码实现：
102 | 
103 | ```scala
104 | def flatMap[U: ClassTag](flatMapFunc: T => Traversable[U]): DStream[U] = ssc.withScope {
105 |     new FlatMappedDStream(this, context.sparkContext.clean(flatMapFunc))
106 |   }
107 | ```
108 | 
109 | 也就是 `lines.flatMap(_.split(" "))` 将 `new` 出来一个 `DStream` 具体子类 `FlatMappedDStream` 的实例。
110 | 
111 | 后面几行也是如此，所以我们如果用 DStream DAG 图来表示之前那段 quick example 的话，就是这个样子：
112 | 
113 | ![image](1.imgs/010.png)
114 | 
115 | 也即，我们给出的那段代码，用具体的实现来替换的话，结果如下：
116 | 
117 | ```scala
118 | val lines = new SocketInputDStream("localhost", 9999)   // 类型是 SocketInputDStream
119 | 
120 | val words = new FlatMappedDStream(lines, _.split(" "))  // 类型是 FlatMappedDStream
121 | val pairs = new MappedDStream(words, word => (word, 1)) // 类型是 MappedDStream
122 | val wordCounts = new ShuffledDStream(pairs, _ + _)      // 类型是 ShuffledDStream
123 | new ForeachDStream(wordCounts, cnt => cnt.print())      // 类型是 ForeachDStream
124 | ```
125 | 
126 | 总结一下：
127 | 
128 | - *transformation*：可以看到基本上 1 种 *transformation* 将对应产生一个新的 `DStream` 子类实例，如：
129 | 	- `.flatMap()` 将产生 `FaltMappedDStream` 实例
130 | 	- `.map()`     将产生 `MappedDStream` 实例
131 | - *output*：将只产生一种 `ForEachDStream` 子类实例，用一个函数 `func` 来记录需要做的操作
132 | 	- 如对于 `print()` 就是：`func` = `cnt => cnt.print()`
133 | 
134 | 我们将有另一篇文章具体对 `DStream` 所有 *transformation* 的列举和分析，本文不展开。
135 | 
136 | ## DStream 类继承体系
137 | 
138 | 上面我们看到的 `SocketInputDStream`, `FlatMappedDStream`, `ForeachDStream` 等都是 `DStream` 的具体子类。
139 | 
140 | `DStream` 的所有子类如下：
141 | 
142 | ![image](1.imgs/040.png)
143 | 
144 | 一会我们要对其这些 `DStream` 子类进行一个分类。
145 | 
146 | ## Dependency, DStreamGraph 解析
147 | 
148 | 先再次回过头来看一下 *transformation* 操作。当我们写代码 `c = a.join(b), d = c.filter()` 时， 它们的 DAG 逻辑关系是 `a/b → c，c → d`，但在 Spark Streaming 在进行物理记录时却是反向的 `a/b ← c, c ← d`，如下图：
149 | 
150 | ![image](1.imgs/020.png)
151 | 
152 | 那物理上为什么不顺着 DAG 来正向记录，却用反向记录？
153 | 
154 | 这里背后的原因是，在 Spark Core 的 RDD API 里，RDD 的计算是被触发了以后才进行 lazy 求值的，即当真正求 `d` 的值的时候，先计算上游 dependency `c`；而计算 `c` 则先进一步计算 `c` 的上游 dependency `a` 和 `b`。Spark Streaming 里则与 RDD DAG 的反向表示保持了一致，对 DStream 也采用的反向表示。
155 | 
156 | 所以，这里 `d` 对 `c` 的引用，表达的是一个上游*依赖*（***dependency***）的关系；也就是说，不求值则已，一旦 `d.print()` 这个 *output* 操作触发了对 `d` 的求值，那么就需要从 `d` 开始往上游进行追溯计算。
157 | 
158 | 具体的过程是，`d.print()` 将 `new` 一个 `d` 的一个下游 `ForEachDStream x` —— `x` 中记明了需要做的操作 `func = print()` —— 然后在每个 batch 动态生成 RDD 实例时，以 `x` 为根节点、进行一次 BFS（宽度优先遍历），就可以快速得到需要进行实际计算的最小集合。如下图所示，这个最小集合就是 {`a`, `b`, `c`, `d`}。
159 | 
160 | ![image](1.imgs/025.png)
161 | 
162 | 再看一个例子。如下图所示，如果对 `d`, `f` 分别调用 `print()` 的 *output* 操作，那么将在 `d`, `f` 的下游分别产生新的 `DStream x, y`，分别记录了具体操作 `func = print()`。在每个 batch 动态生成 RDD 实例时，就会分别对 `x` 和 `y` 进行 BFS 遍历，分别得到上游集合 {`a`,`b`,`c`,`d`} 和 {`b`,`e`,`f`}。作为对比，这里我们不对 `h` 进行 `print()` 的 *output* 操作，所以 `g`, `h` 将得不到遍历。
163 | 
164 | ![image](1.imgs/030.png)
165 | 
166 | 通过以上分析，我们总结一下：
167 | 
168 | - (1) DStream 逻辑上通过 *transformation* 来形成 DAG，但在物理上却是通过与 *transformation* 反向的*依赖*（***dependency***）来构成表示的
169 | 
170 | - (2) 当某个节点调用了 *output* 操作时，就产生一个新的 `ForEachDStream` ，这个新的 `ForEachDStream` 记录了具体的 *output* 操作是什么
171 | 
172 | - (3) 在每个 batch 动态生成 RDD 实例时，就对 (2) 中新生成的 `DStream` 进行 BFS 遍历
173 | 
174 | 我们将在 (2) 中，由 *output* 操作新生成的 `DStream` 称为 *output stream*。
175 | 
176 | 最后，我们给出：
177 | 
178 | - (4) **Spark Streaming 记录整个 DStream DAG 的方式，就是通过一个 `DStreamGraph` 实例记录了到所有的 *output stream* 节点的引用**
179 | 	- 通过对所有 *output stream* 节点进行遍历，就可以得到所有上游依赖的 `DStream`
180 | 	- 不能被遍历到的 `DStream` 节点 —— 如 `g` 和 `h` —— 则虽然出现在了逻辑的 DAG 中，但是并不属于物理的 `DStreamGraph`，也将在 Spark Streaming 的实际运行过程中不产生任何作用
181 | 
182 | - (5) `DStreamGraph` 实例同时也记录了到所有 *input stream* 节点的引用
183 | 	- DStreamGraph 时常需要遍历没有上游依赖的 `DStream` 节点 —— 称为 *input stream* —— 记录一下就可以避免每次为查找 *input stream* 而对 *output steam* 进行 BFS 的消耗
184 | 
185 | 我们本节所描述的内容，用下图就能够总结了：
186 | 
187 | ![image](1.imgs/035.png)
188 | 
189 | <br/>
190 | <br/>
191 | 
192 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/2)，返回目录请 [猛戳这里](readme.md)）
193 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.2 DStream 生成 RDD 实例详解.md:
--------------------------------------------------------------------------------
  1 | # DStream 生成 RDD 实例详解 #
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 1 DAG 静态定义` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 我们在前面的文章讲过，Spark Streaming 的 `模块 1 DAG 静态定义` 要解决的问题就是如何把计算逻辑描述为一个 RDD DAG 的“模板”，在后面 Job 动态生成的时候，针对每个 batch，都将根据这个“模板”生成一个 RDD DAG 的实例。
 22 | 
 23 | ![image](0.imgs/032.png)
 24 | 
 25 | 在 Spark Streaming 里，这个 RDD “模板”对应的具体的类是 `DStream`，RDD DAG “模板”对应的具体类是 `DStreamGraph`。
 26 | 
 27 |     DStream      的全限定名是：org.apache.spark.streaming.dstream.DStream
 28 |     DStreamGraph 的全限定名是：org.apache.spark.streaming.DStreamGraph
 29 | 
 30 | 本文我们就来详解 `DStream` 最主要的功能：为每个 batch 生成 `RDD` 实例。
 31 | 
 32 | ## Quick Example ##
 33 | 
 34 | 
 35 | 我们在前文 [DStream, DStreamGraph 详解](1.1 DStream, DStreamGraph 详解.md) 中引用了 [Spark Streaming 官方的 quick example](0.imgs/http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example) 的这段对 DStream DAG 的定义，注意看代码中的注释讲解内容：
 36 | 
 37 | ```scala
 38 | // ssc.socketTextStream() 将创建一个 SocketInputDStream；这个 InputDStream 的 SocketReceiver 将监听本机 9999 端口
 39 | val lines = ssc.socketTextStream("localhost", 9999)
 40 | 
 41 | val words = lines.flatMap(_.split(" "))      // DStream transformation
 42 | val pairs = words.map(word => (word, 1))     // DStream transformation
 43 | val wordCounts = pairs.reduceByKey(_ + _)    // DStream transformation
 44 | wordCounts.print()                           // DStream output
 45 | ```
 46 | 
 47 | 这里我们找到 `ssc.socketTextStream("localhost", 9999)` 的源码实现：
 48 | 
 49 | ```scala
 50 | def socketStream[T: ClassTag](hostname: String, port: Int, converter: (InputStream) => Iterator[T], storageLevel: StorageLevel): ReceiverInputDStream[T] = {
 51 |   new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
 52 | }
 53 | ```
 54 | 
 55 | 也就是 `ssc.socketTextStream()` 将 `new` 出来一个 `DStream` 具体子类 `SocketInputDStream` 的实例。
 56 | 
 57 | 然后我们继续找到下一行 `lines.flatMap(_.split(" "))` 的源码实现：
 58 | 
 59 | ```scala
 60 | def flatMap[U: ClassTag](flatMapFunc: T => Traversable[U]): DStream[U] = ssc.withScope {
 61 |   new FlatMappedDStream(this, context.sparkContext.clean(flatMapFunc))
 62 | }
 63 | ```
 64 | 
 65 | 也就是 `lines.flatMap(_.split(" "))` 将 `new` 出来一个 `DStream` 具体子类 `FlatMappedDStream` 的实例。
 66 | 
 67 | 后面几行也是如此，所以我们如果用 DStream DAG 图来表示之前那段 quick example 的话，就是这个样子：
 68 | 
 69 | ![image](1.imgs/010.png)
 70 | 
 71 | 也即，我们给出的那段代码，用具体的实现来替换的话，结果如下：
 72 | 
 73 | ```scala
 74 | val lines = new SocketInputDStream("localhost", 9999)   // 类型是 SocketInputDStream
 75 | 
 76 | val words = new FlatMappedDStream(lines, _.split(" "))  // 类型是 FlatMappedDStream
 77 | val pairs = new MappedDStream(words, word => (word, 1)) // 类型是 MappedDStream
 78 | val wordCounts = new ShuffledDStream(pairs, _ + _)      // 类型是 ShuffledDStream
 79 | new ForeachDStream(wordCounts, cnt => cnt.print())      // 类型是 ForeachDStream
 80 | ```
 81 | 
 82 | 
 83 | ## DStream 通过 `generatedRDD` 管理已生成的 `RDD`
 84 | 
 85 | `DStream` 内部用一个类型是 `HashMap` 的变量 `generatedRDD` 来记录已经生成过的 `RDD`：
 86 | 
 87 | ```scala
 88 | private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()
 89 | ```
 90 | 
 91 | `generatedRDD` 的 key 是一个 `Time`；这个 `Time` 是与用户指定的 `batchDuration`  对齐了的时间 —— 如每 15s 生成一个 batch 的话，那么这里的 key 的时间就是 `08h:00m:00s`，`08h:00m:15s` 这种，所以其实也就代表是第几个 batch。`generatedRDD` 的 value 就是 `RDD` 的实例。
 92 | 
 93 | 需要注意，每一个不同的 `DStream` 实例，都有一个自己的 `generatedRDD`。如在下图中，`DStream a, b, c, d` 各有自己的 `generatedRDD` 变量；图中也示意了 `DStream a` 的 `generatedRDD` 变量。
 94 | 
 95 | ![image](0.imgs/045.png)
 96 | 
 97 | `DStream` 对这个 `HashMap` 的存取主要是通过 `getOrCompute(time: Time)` 方法，实现也很简单，就是一个 —— 查表，如果有就直接返回，如果没有就生成了放入表、再返回 —— 的逻辑：
 98 | 
 99 | ```scala
100 | private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
101 |     // 从 generatedRDDs 里 get 一下：如果有 rdd 就返回，没有 rdd 就进行 orElse 下面的 rdd 生成步骤
102 |     generatedRDDs.get(time).orElse {
103 |       // 验证 time 需要是 valid
104 |       if (isTimeValid(time)) {
105 |         // 然后调用 compute(time) 方法获得 rdd 实例，并存入 rddOption 变量
106 |         val rddOption = createRDDWithLocalProperties(time) {
107 |           PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
108 |             compute(time)
109 |           }
110 |         }
111 | 
112 |         rddOption.foreach { case newRDD =>
113 |           if (storageLevel != StorageLevel.NONE) {
114 |             newRDD.persist(storageLevel)
115 |             logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
116 |           }
117 |           if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
118 |             newRDD.checkpoint()
119 |             logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
120 |           }
121 |           // 将刚刚实例化出来的 rddOption 放入 generatedRDDs 对应的 time 位置
122 |           generatedRDDs.put(time, newRDD)
123 |         }
124 |         // 返回刚刚实例化出来的 rddOption
125 |         rddOption
126 |       } else {
127 |         None
128 |       }
129 |     }
130 |   }
131 | ```
132 | 
133 | 最主要还是调用了一个 abstract 的 `compute(time)` 方法。这个方法用于生成 `RDD` 实例，生成后被放进 `generatedRDD` 里供后续的查询和使用。这个 `compute(time)` 方法在 `DStream` 类里是 abstract 的，但在每个具体的子类里都提供了实现。
134 | 
135 | ## (a) `InputDStream` 的 `compute(time)` 实现
136 | 
137 | `InputDStream` 是个有很多子类的抽象类，我们看一个具体的子类 `FileInputDStream`。
138 | 
139 | ```scala
140 | // 来自 FileInputDStream
141 | override def compute(validTime: Time): Option[RDD[(K, V)]] = {
142 |     // 通过一个 findNewFiles() 方法，找到 validTime 以后产生的新 file 的数据
143 |     val newFiles = findNewFiles(validTime.milliseconds)
144 |     logInfo("New files at time " + validTime + ":\n" + newFiles.mkString("\n"))
145 |     batchTimeToSelectedFiles += ((validTime, newFiles))
146 |     recentlySelectedFiles ++= newFiles
147 |     
148 |     // 找到了一些新 file；以新 file 的数组为参数，通过 filesToRDD() 生成单个 RDD 实例 rdds
149 |     val rdds = Some(filesToRDD(newFiles))
150 | 
151 |     val metadata = Map(
152 |       "files" -> newFiles.toList,
153 |       StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
154 |     val inputInfo = StreamInputInfo(id, 0, metadata)
155 |     ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
156 |     
157 |     // 返回生成的单个 RDD 实例 rdds
158 |     rdds
159 |   }
160 | ```
161 | 
162 | 而 `filesToRDD()` 实现如下：
163 | 
164 | ```scala
165 | // 来自 FileInputDStream
166 | private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
167 |   // 对每个 file，都 sc.newAPIHadoopFile(file) 来生成一个 RDD
168 |   val fileRDDs = files.map { file =>
169 |     val rdd = serializableConfOpt.map(_.value) match {
170 |       case Some(config) => context.sparkContext.newAPIHadoopFile(
171 |         file,
172 |         fm.runtimeClass.asInstanceOf[Class[F]],
173 |         km.runtimeClass.asInstanceOf[Class[K]],
174 |         vm.runtimeClass.asInstanceOf[Class[V]],
175 |         config)
176 |       case None => context.sparkContext.newAPIHadoopFile[K, V, F](file)
177 |     }
178 |     if (rdd.partitions.size == 0) {
179 |       logError("File " + file + " has no data in it. Spark Streaming can only ingest " +
180 |         "files that have been \"moved\" to the directory assigned to the file stream. " +
181 |         "Refer to the streaming programming guide for more details.")
182 |     }
183 |     rdd
184 |   }
185 |   // 将每个 file 对应的 RDD 进行 union，返回一个 union 后的 UnionRDD
186 |   new UnionRDD(context.sparkContext, fileRDDs)
187 | }
188 | ```
189 | 
190 | 所以，结合以上 `compute(validTime: Time)` 和 `filesToRDD(files: Seq[String])` 方法，我们得出 `FileInputDStream` 为每个 batch 生成 RDD 的实例过程如下：
191 | 
192 | - (1) 先通过一个 findNewFiles() 方法，找到 validTime 以后产生的多个新 file
193 | - (2) 对每个新 file，都将其作为参数调用 sc.newAPIHadoopFile(file)，生成一个 RDD 实例
194 | - (3) 将 (2) 中的多个新 file 对应的多个 RDD 实例进行 union，返回一个 union 后的 UnionRDD
195 | 
196 | 其它 `InputDStream` 的为每个 batch 生成 `RDD` 实例的过程也比较类似了。
197 | 
198 | ## (b) 一般 `DStream` 的 `compute(time)` 实现
199 | 
200 | 前一小节的 `InputDStream` 没有上游依赖的 `DStream`，可以直接为每个 batch 产生 `RDD` 实例。一般 `DStream` 都是由  *transofrmation* 生成的，都有上游依赖的 `DStream`，所以为了为 batch 产生 `RDD` 实例，就需要在 `compute(time)` 方法里先获取上游依赖的 `DStream` 产生的 `RDD` 实例。
201 | 
202 | 具体的，我们看两个具体 `DStream` —— `MappedDStream`, `FilteredDStream` —— 的实现：
203 | 
204 | ### `MappedDStream` 的 `compute(time)` 实现
205 | 
206 | `MappedDStream` 很简单，全类实现如下：
207 | 
208 | ```scala
209 | package org.apache.spark.streaming.dstream
210 | 
211 | import org.apache.spark.streaming.{Duration, Time}
212 | import org.apache.spark.rdd.RDD
213 | import scala.reflect.ClassTag
214 | 
215 | private[streaming]
216 | class MappedDStream[T: ClassTag, U: ClassTag] (
217 |     parent: DStream[T],
218 |     mapFunc: T => U
219 |   ) extends DStream[U](parent.ssc) {
220 | 
221 |   override def dependencies: List[DStream[_]] = List(parent)
222 | 
223 |   override def slideDuration: Duration = parent.slideDuration
224 | 
225 |   override def compute(validTime: Time): Option[RDD[U]] = {
226 |     parent.getOrCompute(validTime).map(_.map[U](mapFunc))
227 |   }
228 | }
229 | ```
230 | 
231 | 可以看到，首先在构造函数里传入了两个重要内容：
232 | 
233 | - parent，是本 `MappedDStream` 上游依赖的 `DStream`
234 | - mapFunc，是本次 map() 转换的具体函数
235 | 	- 在前文 [DStream, DStreamGraph 详解](1.1 DStream, DStreamGraph 详解.md) 中的 quick example 里的 `val pairs = words.map(word => (word, 1))` 的 `mapFunc` 就是 `word => (word, 1)`
236 | 	
237 | 所以在 `compute(time)` 的具体实现里，就很简单了：
238 | 
239 | - (1) 获取 parent `DStream` 在本 batch 里对应的 `RDD` 实例
240 | - (2) 在这个 parent `RDD` 实例上，以 `mapFunc` 为参数调用 `.map(mapFunc)` 方法，将得到的新 `RDD` 实例返回
241 | 	- 完全相当于用 RDD API 写了这样的代码：`return parentRDD.map(mapFunc)`
242 | 	
243 | ### `FilteredDStream` 的 `compute(time)` 实现
244 | 
245 | 再看看 `FilteredDStream` 的全部实现：
246 | 
247 | ```scala
248 | package org.apache.spark.streaming.dstream
249 | 
250 | import org.apache.spark.streaming.{Duration, Time}
251 | import org.apache.spark.rdd.RDD
252 | import scala.reflect.ClassTag
253 | 
254 | private[streaming]
255 | class FilteredDStream[T: ClassTag](
256 |     parent: DStream[T],
257 |     filterFunc: T => Boolean
258 |   ) extends DStream[T](parent.ssc) {
259 | 
260 |   override def dependencies: List[DStream[_]] = List(parent)
261 | 
262 |   override def slideDuration: Duration = parent.slideDuration
263 | 
264 |   override def compute(validTime: Time): Option[RDD[T]] = {
265 |     parent.getOrCompute(validTime).map(_.filter(filterFunc))
266 |   }
267 | }
268 | ```
269 | 
270 | 同 `MappedDStream` 一样，`FilteredDStream` 也在构造函数里传入了两个重要内容：
271 | 
272 | - parent，是本 `FilteredDStream` 上游依赖的 `DStream`
273 | - filterFunc，是本次 filter() 转换的具体函数
274 | 	
275 | 所以在 `compute(time)` 的具体实现里，就很简单了：
276 | 
277 | - (1) 获取 parent `DStream` 在本 batch 里对应的 `RDD` 实例
278 | - (2) 在这个 parent `RDD` 实例上，以 `filterFunc` 为参数调用 `.filter(filterFunc)` 方法，将得到的新 `RDD` 实例返回
279 | 	- 完全相当于用 RDD API 写了这样的代码：`return parentRDD.filter(filterFunc)`
280 | 	
281 | ### 总结一般 `DStream` 的 `compute(time)` 实现
282 | 
283 | 总结上面 `MappedDStream` 和 `FilteredDStream` 的实现，可以看到：
284 | 
285 | - `DStream` 的 `.map()` 操作生成了 `MappedDStream`，而 `MappedDStream` 在每个 batch 里生成 `RDD` 实例时，将对 `parentRDD` 调用 `RDD` 的 `.map()` 操作 —— **`DStream.map()` 操作完美复制为每个 batch 的 `RDD.map()` 操作**
286 | - `DStream` 的 `.filter()` 操作生成了 `FilteredDStream`，而 `FilteredDStream` 在每个 batch 里生成 `RDD` 实例时，将对 `parentRDD` 调用 `RDD` 的 `.filter()` 操作 —— **`DStream.filter()` 操作完美复制为每个 batch 的 `RDD.filter()` 操作**
287 | 
288 | 在最开始， `DStream` 的 *transformation* 的 API 设计与 `RDD` 的 *transformation* 设计保持了一致，就使得，每一个 `dStreamA`.*transformation*() 得到的新 `dStreamB` 能将 `dStreamA.`*transformation()* 操作完美复制为每个 batch 的 `rddA.`*transformation()* 操作。
289 | 
290 | **这也就是 `DStream` 能够作为 `RDD` 模板，在每个 batch 里实例化 `RDD` 的根本原因。**
291 | 
292 | ## (c) `ForEachDStream` 的 `compute(time)` 实现
293 | 
294 | 上面分析了 `DStream` 的 *transformation* 如何在 `compute(time)` 里复制为 `RDD` 的 *transformation*，下面我们分析 `DStream` 的 *output* 如何在 `compute(time)` 里复制为 `RDD` 的 *action*。
295 | 
296 | 我们前面讲过，对一个 `DStream` 进行 *output* 操作，将生成一个新的 `ForEachDStream`，这个 `ForEachDStream` 用一个 `foreachFunc` 成员来记录 *output* 的具体内容。
297 | 
298 | `ForEachDStream` 全部实现如下：
299 | 
300 | ```scala
301 | package org.apache.spark.streaming.dstream
302 | 
303 | import org.apache.spark.rdd.RDD
304 | import org.apache.spark.streaming.{Duration, Time}
305 | import org.apache.spark.streaming.scheduler.Job
306 | import scala.reflect.ClassTag
307 | 
308 | private[streaming]
309 | class ForEachDStream[T: ClassTag] (
310 |     parent: DStream[T],
311 |     foreachFunc: (RDD[T], Time) => Unit
312 |   ) extends DStream[Unit](parent.ssc) {
313 | 
314 |   override def dependencies: List[DStream[_]] = List(parent)
315 | 
316 |   override def slideDuration: Duration = parent.slideDuration
317 | 
318 |   override def compute(validTime: Time): Option[RDD[Unit]] = None
319 | 
320 |   override def generateJob(time: Time): Option[Job] = {
321 |     parent.getOrCompute(time) match {
322 |       case Some(rdd) =>
323 |         val jobFunc = () => createRDDWithLocalProperties(time) {
324 |           ssc.sparkContext.setCallSite(creationSite)
325 |           foreachFunc(rdd, time)
326 |         }
327 |         Some(new Job(time, jobFunc))
328 |       case None => None
329 |     }
330 |   }
331 | }
332 | ```
333 | 
334 | 同前面一样，`ForEachDStream` 也在构造函数里传入了两个重要内容：
335 | 
336 | - parent，是本 `ForEachDStream` 上游依赖的 `DStream`
337 | - foreachFunc，是本次 *output* 的具体函数
338 | 	
339 | 所以在 `compute(time)` 的具体实现里，就很简单了：
340 | 
341 | - (1) 获取 parent `DStream` 在本 batch 里对应的 `RDD` 实例
342 | - (2) 以这个 parent `RDD` 和本次 batch 的 time 为参数，调用 `foreachFunc(parentRDD, time)` 方法
343 | 
344 | 例如，我们看看 `DStream.print()` 里 `foreachFunc(rdd, time)` 的具体实现：
345 | 
346 | ```scala
347 | def foreachFunc: (RDD[T], Time) => Unit = {
348 |   val firstNum = rdd.take(num + 1)
349 |   println("-------------------------------------------")
350 |   println("Time: " + time)
351 |   println("-------------------------------------------")
352 |   firstNum.take(num).foreach(println)
353 |   if (firstNum.length > num) println("...")
354 |   println()
355 | }
356 | ```
357 | 
358 | 就可以知道，如果对着 `rdd` 调用上面这个 `foreachFunc` 的话，就会在每个 batch 里，都会在 `rdd` 上执行 `.take()` 获取一些元素到 driver 端，然后再 `.foreach(println)`；也就形成了在 driver 端打印这个 `DStream` 的一些内容的效果了！
359 | 
360 | ## DStreamGraph 生成 RDD DAG 实例
361 | 
362 | 在前文 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 中，我们曾经讲过，在每个 batch 时，都由 `JobGenerator` 来要求 `RDD` DAG “模板” 来创建 `RDD` DAG 实例，即下图中的第 (2) 步。
363 | 
364 | ![image](0.imgs/055.png)
365 | 
366 | 具体的，是 `JobGenerator` 来调用 `DStreamGraph` 的 `generateJobs(time)` 方法。
367 | 
368 | 那么翻出来 `generateJobs()` 的实现：
369 | 
370 | ```scala
371 | // 来自 DStreamGraph
372 | def generateJobs(time: Time): Seq[Job] = {
373 |   logDebug("Generating jobs for time " + time)
374 |   val jobs = this.synchronized {
375 |     outputStreams.flatMap(outputStream => outputStream.generateJob(time))
376 |   }
377 |   logDebug("Generated " + jobs.length + " jobs for time " + time)
378 |   jobs
379 | }
380 | ```
381 | 也就是说，是 `DStreamGraph` 继续调用了每个 `outputStream` 的 `generateJob(time)` 方法 —— 而我们知道，只有 ForEachDStream 是 outputStream，所以将调用 `ForEachDStream` 的 `generateJob(time)` 方法。
382 | 
383 | ![image](1.imgs/035.png)
384 | 
385 | 举个例子，如上图，由于我们在代码里的两次 print() 操作产生了两个 `ForEachDStream` 节点 `x` 和 `y`，那么 `DStreamGraph.generateJobs(time)` 就将先后调用 `x.generateJob(time)` 和 `y.generateJob(time)` 方法，并将各获得一个 Job。
386 | 
387 | 但是…… `x.generateJob(time)` 和 `y.generateJob(time)` 的返回值 Job 到底是啥？那我们先插播一下 `Job`。
388 | 
389 | ### Spark Streaming 的 Job
390 | 
391 | Spark Streaming 里重新定义了一个 `Job` 类，功能与 `Java` 的 `Runnable` 差不多：一个 `Job` 能够自定义一个 `func() 函数`，而 `Job` 的 `.run()` 方法实现就是执行这个 `func()`。
392 | 
393 | ```scala
394 | // 节选自 org.apache.spark.streaming.scheduler.Job
395 | private[streaming]
396 | class Job(val time: Time, func: () => _) {
397 |   ...
398 | 
399 |   def run() {
400 |     _result = Try(func())
401 |   }
402 | 
403 |   ...
404 | }
405 | ```
406 | 
407 | 所以其实 `Job` 的本质是将实际的 `func()` 定义和 `func()` 被调用分离了 —— 就像 `Runnable` 是将 `run()` 的具体定义和 `run()` 的被调用分离了一样。
408 | 
409 | 下面我们继续来看 `x.generateJob(time)` 和 `y.generateJob(time)` 实现。
410 | 
411 | ### `x.generateJob(time)` 过程
412 | 
413 | `x` 是一个 `ForEachDStream`，其 `generateJob(time)` 的实现如下：
414 | 
415 | ```scala
416 | // 来自 ForEachDStream
417 | override def generateJob(time: Time): Option[Job] = {
418 |   // 【首先调用 parentDStream 的 getOrCompute() 来获取 parentRDD】
419 |   parent.getOrCompute(time) match {
420 |     case Some(rdd) =>
421 |       // 【然后定义 jobFunc 为在 parentRDD 上执行 foreachFun() 】
422 |       val jobFunc = () => createRDDWithLocalProperties(time) {
423 |         ssc.sparkContext.setCallSite(creationSite)
424 |         foreachFunc(rdd, time)
425 |       }
426 |       // 【最后将 jobFunc 包装为 Job 返回】
427 |       Some(new Job(time, jobFunc))
428 |     case None => None
429 |   }
430 | }
431 | ```
432 | 
433 | 就是这里牵扯到了 `x` 的 `parentDStream.getOrCompute(time)`，即 `d.getOrCompute(time)`；而 `d.getOrCompute(time)` 会牵扯 `c.getOrCompute(time)`，乃至 `a.getOrCompute(time)`, `b.getOrCompute(time)`
434 | 
435 | 用一个时序图来表达这里的调用关系会清晰很多：
436 | 
437 | ![image](1.imgs/050.png)
438 | 
439 | 所以最后的时候，由于对 `x.generateJob(time)` 形成的递归调用， 将形成一个 Job，其内容 `func` 如下图：
440 | 
441 | ![image](1.imgs/051.png)
442 | 
443 | ### `y.generateJob(time)` 过程
444 | 
445 | 同样的，`y` 节点生成 Job 的过程，与 `x` 节点的过程非常类似，只是在 `b.getOrCompute(time)` 时，会命中 `get(time)` 而不需要触发 `compute(time)` 了，这是因为该 `RDD` 实例已经在 `x` 节点的生成过程中被实例化过一次，所以在这里只需要取出来用就可以了。
446 | 
447 | 同样，最后的时候，由于对 `y.generateJob(time)` 形成的递归调用， 将形成一个 Job，其内容 `func` 如下图：
448 | 
449 | ![image](1.imgs/052.png)
450 | 
451 | ### 返回 Seq[Job]
452 | 
453 | 所以当 `DStreamGraph.generateJobs(time)` 结束时，会返回多个 `Job`，是因为作为 `output stream` 的每个 `ForEachDStream` 都通过 `generateJob(time)` 方法贡献了一个 `Job`。
454 | 
455 | ![image](1.imgs/035.png)
456 | 
457 | 比如在上图里，`DStreamGraph.generateJobs(time)` 会返回一个 `Job` 的序列，其大小为 `2`，其内容分别为：
458 | 
459 | ![image](1.imgs/053.png)
460 | 
461 | 至此，在给定的 batch 里，`DStreamGraph.generateJobs(time)` 的工作已经全部完成，`Seq[Job]` 作为结果返回给 `JobGenerator` 后，`JobGenerator` 也会尽快提交到 `JobSheduler` 那里尽快调用 `Job.run()` 使得这 `2` 个 `RDD` DAG 尽快运行起来。
462 | 
463 | 而且，每个新 batch 生成时，都会调用 `DStreamGraph.generateJobs(time)`，也进而触发我们之前讨论这个 `Job` 生成过程，周而复始。
464 | 
465 | 到此，整个 `DStream` 作为 `RDD` 的 “模板” 为每个 batch 实例化 `RDD`，`DStreamGraph` 作为 `RDD` DAG 的 “模板” 为每个 batch 实例化 `RDD` DAG，就分析完成了。
466 | 
467 | <br/>
468 | <br/>
469 | 
470 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/3)，返回目录请 [猛戳这里](readme.md)）
471 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/005.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/005.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/010.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/010.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/020.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/025.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/025.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/030.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/030.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/035.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/035.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/040.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/040.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/050.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/050.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/051.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/051.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/052.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/052.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/1.imgs/053.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/1.imgs/053.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/2.1 JobScheduler, Job, JobSet 详解.md:
--------------------------------------------------------------------------------
  1 | # JobScheduler, Job, JobSet 详解 #
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 2：Job 动态生成` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 前面在 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 和 [DStream 生成 RDD 实例详解](1.2 DStream 生成 RDD 实例详解.md) 里我们分析了 `DStream` 和 `DStreamGraph` 具有能够实例化 `RDD` 和 `RDD` DAG 的能力，下面我们来看 Spark Streaming 是如何将其动态调度的。
 22 | 
 23 | 在 Spark Streaming 程序的入口，我们都会定义一个 `batchDuration`，就是需要每隔多长时间就比照静态的 `DStreamGraph` 来动态生成一个 RDD DAG 实例。在 Spark Streaming 里，总体负责动态作业调度的具体类是 `JobScheduler`，在 Spark Streaming 程序在 `ssc.start()` 开始运行时，将 `JobScheduler` 的实例给 start() 运行起来。
 24 | 
 25 | ```scala
 26 | // 来自 StreamingContext
 27 | def start(): Unit = synchronized {
 28 |   ...
 29 |   ThreadUtils.runInNewThread("streaming-start") {
 30 |     sparkContext.setCallSite(startSite.get)
 31 |     sparkContext.clearJobGroup()
 32 |     sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
 33 |     scheduler.start()  // 【这里调用了 JobScheduler().start()】
 34 |   }
 35 |   state = StreamingContextState.ACTIVE
 36 |   ...
 37 | }
 38 | ```
 39 | 
 40 | ## Spark Streaming 的 Job 总调度者 JobScheduler
 41 | 
 42 | **`JobScheduler` 是 Spark Streaming 的 Job 总调度者**。
 43 | 
 44 | `JobScheduler` 有两个非常重要的成员：`JobGenerator` 和 `ReceiverTracker`。`JobScheduler` 将每个 batch 的 RDD DAG 具体生成工作委托给 `JobGenerator`，而将源头输入数据的记录工作委托给 `ReceiverTracker`。
 45 | 
 46 | ![image](0.imgs/050.png)
 47 | 
 48 |     JobScheduler    的全限定名是：org.apache.spark.streaming.scheduler.JobScheduler
 49 |     JobGenerator    的全限定名是：org.apache.spark.streaming.scheduler.JobGenerator
 50 |     ReceiverTracker 的全限定名是：org.apache.spark.streaming.scheduler.ReceiverTracker
 51 | 
 52 | **`JobGenerator` 维护了一个定时器**，周期就是我们刚刚提到的 `batchDuration`，**定时为每个 batch 生成 RDD DAG 的实例**。
 53 | 具体的，根据我们在 [DStream 生成 RDD 实例详解](1.2 DStream 生成 RDD 实例详解.md) 中的解析，`DStreamGraph.generateJobs(time)` 将返回一个 `Seq[Job]`，其中的每个 `Job` 是一个 `ForEachDStream` 实例的 `generateJob(time)` 返回的结果。
 54 | 
 55 | ![image](0.imgs/055.png)
 56 | 
 57 | 此时，`JobGenerator` 拿到了 `Seq[Job]` 后（如上图 `(2)` ），就将其包装成一个 JobSet（如上图 `(3)` ），然后就调用 `JobScheduler.submitJobSet(jobSet)` 来交付回 JobScheduler（如上图 (4) ）。
 58 | 
 59 | 那么 `JobScheduler` 收到 `jobSet` 后是具体如何处理的呢？我们看其实现：
 60 | ```scala
 61 | // 来自 JobScheduler.submitJobSet(jobSet: JobSet)
 62 | if (jobSet.jobs.isEmpty) {
 63 |   logInfo("No jobs added for time " + jobSet.time)
 64 | } else {
 65 |   listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
 66 |   jobSets.put(jobSet.time, jobSet)
 67 |   // 【下面这行是最主要的处理逻辑：将每个 job 都在 jobExecutor 线程池中、用 new JobHandler 来处理】
 68 |   jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
 69 |   logInfo("Added jobs for time " + jobSet.time)
 70 | }
 71 | ```
 72 | 
 73 | 这里最重要的处理逻辑是 `job => jobExecutor.execute(new JobHandler(job))`，也就是将每个 job 都在 jobExecutor 线程池中、用 new JobHandler 来处理。
 74 | 
 75 | ### JobHandler
 76 | 
 77 | 先来看 JobHandler 针对 Job 的主要处理逻辑：
 78 | ```scala
 79 | // 来自 JobHandler
 80 | def run()
 81 | {
 82 |   ...
 83 |   // 【发布 JobStarted 消息】
 84 |   _eventLoop.post(JobStarted(job))
 85 |   PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
 86 |     // 【主要逻辑，直接调用了 job.run()】
 87 |     job.run()
 88 |   }
 89 |   _eventLoop = eventLoop
 90 |   if (_eventLoop != null) {
 91 |   // 【发布 JobCompleted 消息】
 92 |     _eventLoop.post(JobCompleted(job))
 93 |   }
 94 |   ...
 95 | }
 96 | ```
 97 | 
 98 | 也就是说，`JobHandler` 除了做一些状态记录外，最主要的就是调用 `job.run()`！这里就与我们在 [DStream 生成 RDD 实例详解](1.2 DStream 生成 RDD 实例详解.md) 里分析的对应起来了：
 99 | 在 `ForEachDStream.generateJob(time)` 时，是定义了 `Job` 的运行逻辑，即定义了 `Job.func`。而在 `JobHandler` 这里，是真正调用了 `Job.run()`、将触发 `Job.func` 的真正执行！
100 | 
101 | ### Job 运行的线程池 jobExecutor
102 | 
103 | 上面 `JobHandler` 是解决了做什么的问题，本节 `jobExecutor` 是解决 `Job` 在哪里做。
104 | 
105 | 具体的，`jobExecutor` 是 `JobScheduler` 的成员：
106 | 
107 | ```scala
108 | // 来自 JobScheduler
109 | private[streaming]
110 | class JobScheduler(val ssc: StreamingContext) extends Logging {
111 |   ...
112 |   private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
113 |   private val jobExecutor =
114 |       ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")
115 |   ...
116 | }
117 | ```
118 | 
119 | 也就是，`ThreadUtils.newDaemonFixedThreadPool()` 调用将产生一个名为 `"streaming-job-executor"` 的线程池，所以，`Job` 将在这个线程池的线程里，被实际执行 `func`。
120 | 
121 | ## spark.streaming.concurrentJobs 参数
122 | 
123 | 这里 `jobExecutor` 的线程池大小，是由 `spark.streaming.concurrentJobs` 参数来控制的，当没有显式设置时，其取值为 `1`。
124 | 
125 | 进一步说，这里 `jobExecutor` 的线程池大小，就是能够并行执行的 `Job` 数。而回想前文讲解的 `DStreamGraph.generateJobs(time)` 过程，一次 batch 产生一个 `Seq[Job}`，里面可能包含多个 `Job` —— 所以，确切的，**有几个 *output* 操作，就调用几次 `ForEachDStream.generatorJob(time)`，就产生出几个 `Job`**。
126 | 
127 | 为了验证这个结果，我们做一个简单的小测试：先设置 `spark.streaming.concurrentJobs = 10`，然后在每个 batch 里做 `2` 次 `foreachRDD()` 这样的 *output* 操作：
128 | 
129 | ```scala
130 | // 完整代码可见本文最后的附录
131 | val BLOCK_INTERVAL = 1 // in seconds
132 | val BATCH_INTERVAL = 5 // in seconds
133 | val CURRENT_JOBS = 10
134 | ...
135 | 
136 | // DStream DAG 定义开始
137 | val inputStream = ssc.receiverStream(...)
138 | inputStream.foreachRDD(_ => Thread.sleep(Int.MaxValue)) // output 1
139 | inputStream.foreachRDD(_ => Thread.sleep(Int.MaxValue)) // output 2
140 | // DStream DAG 定义结束
141 | ...
142 | ```
143 | 
144 | 在上面的设定下，我们很容易知道，能够同时在处理的 batch 有 `10 / 2 = 5` 个，其余的 batch 的 `Job` 只能处于等待处理状态。
145 | 
146 | 下面的就是刚才测试代码的运行结果，验证了我们前面的分析和计算：
147 | 
148 | ![image](2.imgs/020.png)
149 | 
150 | ## Spark Streaming 的 JobSet, Job，与 Spark Core 的 Job, Stage, TaskSet, Task
151 | 
152 | 最后，我们专门拿出一个小节，辨别一下这 Spark Streaming 的 JobSet, Job，与 Spark Core 的 Job, Stage, TaskSet, Task 这几个概念。
153 | 
154 |     [Spark Streaming]
155 |     JobSet  的全限定名是：org.apache.spark.streaming.scheduler.JobSet
156 |     Job     的全限定名是：org.apache.spark.streaming.scheduler.Job
157 |     
158 |     [Spark Core]
159 |     Job     没有一个对应的实体类，主要是通过 jobId:Int 来表示一个具体的 job
160 |     Stage   的全限定名是：org.apache.spark.scheduler.Stage
161 |     TaskSet 的全限定名是：org.apache.spark.scheduler.TaskSet
162 |     Task    的全限定名是：org.apache.spark.scheduler.Task
163 | 
164 | Spark Core 的 Job, Stage, Task 就是我们“日常”谈论 Spark 任务时所说的那些含义，而且在 Spark 的 WebUI 上有非常好的体现，比如下图就是 1 个 `Job` 包含 3 个 `Stage`；3 个 `Stage` 各包含 8, 2, 4 个 `Task`。而 `TaskSet` 则是 Spark Core 的内部代码里用的类，是 `Task` 的集合，和 `Stage` 是同义的。
165 | 
166 | ![image](2.imgs/021.png)
167 | 
168 | 而 Spark Streaming 里也有一个 `Job`，但此 `Job` 非彼 `Job`。Spark Streaming 里的 `Job` 更像是个 `Java` 里的 `Runnable`，可以 `run()` 一个自定义的 `func` 函数。而这个 `func`, 可以：
169 | - 直接调用 `RDD` 的 *action*，从而产生 1 个或多个 Spark Core 的 `Job`
170 | - 先打印一行表头；然后调用 `firstTen = RDD.collect()`，再打印 `firstTen` 的内容；最后再打印一行表尾 —— 这正是 `DStream.print()` 的 `Job` 实现
171 | - 也可以是任何用户定义的 code，甚至整个 Spark Streaming 执行过程都不产生任何 Spark Core 的 `Job` —— 如上一小节所展示的测试代码，其 `Job` 的 `func` 实现就是：`Thread.sleep(Int.MaxValue)`，仅仅是为了让这个 `Job` 一直跑在 `jobExecutor` 线程池里，从而测试 `jobExecutor` 的并行度 :)
172 | 
173 | 最后，Spark Streaming 的 `JobSet` 就是多个 `Job` 的集合了。
174 | 
175 | 如果对上面 5 个概念做一个层次划分的话（上一层与下一层多是一对多的关系，但不完全准确），就应该是下表的样子：
176 | 
177 | <table>
178 |     <tr>
179 |         <td></td>
180 |         <td>Spark Core</td>
181 |         <td>Spark Streaming</td>
182 |     </tr>
183 |     <tr>
184 |         <td>lv 5</td>
185 |         <td>RDD DAGs</td>
186 |         <td>DStreamGraph</td>
187 |     </tr>
188 |     <tr>
189 |         <td>lv 4</td>
190 |         <td>RDD DAG</td>
191 |         <td>JobSet</td>
192 |     </tr>
193 |     <tr>
194 |         <td>lv 3</td>
195 |         <td>Job</td>
196 |         <td>Job</td>
197 |     </tr>
198 |     <tr>
199 |         <td>lv 2</td>
200 |         <td>Stage</td>
201 |         <td>←</td>
202 |     </tr>
203 |     <tr>
204 |         <td>lv 1</td>
205 |         <td>Task</td>
206 |         <td>←</td>
207 |     </tr>
208 | </table>
209 | 
210 | ## 附录
211 | 
212 | ```scala
213 | import java.util.concurrent.{Executors, TimeUnit}
214 | 
215 | import org.apache.spark.storage.StorageLevel
216 | import org.apache.spark.streaming.receiver.Receiver
217 | import org.apache.spark.streaming.{Seconds, StreamingContext}
218 | import org.apache.spark.SparkConf
219 | 
220 | object ConcurrentJobsDemo {
221 | 
222 |   def main(args: Array[String]) {
223 | 
224 |     // 完整代码可见本文最后的附录
225 |     val BLOCK_INTERVAL = 1 // in seconds
226 |     val BATCH_INTERVAL = 5 // in seconds
227 |     val CURRENT_JOBS = 10
228 | 
229 |     val conf = new SparkConf()
230 |     conf.setAppName(this.getClass.getSimpleName)
231 |     conf.setMaster("local[2]")
232 |     conf.set("spark.streaming.blockInterval", s"${BLOCK_INTERVAL}s")
233 |     conf.set("spark.streaming.concurrentJobs", s"${CURRENT_JOBS}")
234 |     val ssc = new StreamingContext(conf, Seconds(BATCH_INTERVAL))
235 | 
236 |     // DStream DAG 定义开始
237 |     val inputStream = ssc.receiverStream(new MyReceiver)
238 |     inputStream.foreachRDD(_ => Thread.sleep(Int.MaxValue)) // output 1
239 |     inputStream.foreachRDD(_ => Thread.sleep(Int.MaxValue)) // output 2
240 |     // DStream DAG 定义结束
241 | 
242 |     ssc.start()
243 |     ssc.awaitTermination()
244 |   }
245 | 
246 |   class MyReceiver extends Receiver[String](StorageLevel.MEMORY_ONLY) {
247 | 
248 |     override def onStart() {
249 |       // invoke store("str") every 100ms
250 |       Executors.newScheduledThreadPool(1).scheduleAtFixedRate(new Runnable {
251 |         override def run(): Unit = store("str")
252 |       }, 0, 100, TimeUnit.MILLISECONDS)
253 |     }
254 | 
255 |     override def onStop() {}
256 |   }
257 | 
258 | }
259 | ```
260 | 
261 | <br/>
262 | <br/>
263 | 
264 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/4)，返回目录请 [猛戳这里](readme.md)）
265 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/2.2 JobGenerator 详解.md:
--------------------------------------------------------------------------------
  1 | # JobGenerator 详解 #
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 2：Job 动态生成` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 前面在 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 和 [DStream 生成 RDD 实例详解](1.2 DStream 生成 RDD 实例详解.md) 里我们分析了 `DStream` 和 `DStreamGraph` 具有能够实例化 `RDD` 和 `RDD` DAG 的能力，下面我们来看 Spark Streaming 是如何将其动态调度的。
 22 | 
 23 | 在 Spark Streaming 程序的入口，我们都会定义一个 `batchDuration`，就是需要每隔多长时间就比照静态的 `DStreamGraph` 来动态生成一个 RDD DAG 实例。在 Spark Streaming 里，总体负责动态作业调度的具体类是 `JobScheduler`，
 24 | 
 25 | `JobScheduler` 有两个非常重要的成员：`JobGenerator` 和 `ReceiverTracker`。`JobScheduler` 将每个 batch 的 RDD DAG 具体生成工作委托给 `JobGenerator`，而将源头输入数据的记录工作委托给 `ReceiverTracker`。
 26 | 
 27 | ![image](0.imgs/050.png)
 28 | 
 29 |     JobScheduler    的全限定名是：org.apache.spark.streaming.scheduler.JobScheduler
 30 |     JobGenerator    的全限定名是：org.apache.spark.streaming.scheduler.JobGenerator
 31 |     ReceiverTracker 的全限定名是：org.apache.spark.streaming.scheduler.ReceiverTracker
 32 | 
 33 | 本文我们来详解 `JobScheduler`。
 34 | 
 35 | ## JobGenerator 启动
 36 | 
 37 | 在用户 code 最后调用 `ssc.start()` 时，将隐含的导致一系列模块的启动，其中对我们 `JobGenerator` 这里的启动调用关系如下：
 38 | ```scala
 39 | // 来自 StreamingContext.start(), JobScheduler.start(), JobGenerator.start()
 40 | 
 41 | ssc.start()                              // 【用户 code：StreamingContext.start()】
 42 |     -> scheduler.start()                 // 【JobScheduler.start()】
 43 |                  -> jobGenerator.start() // 【JobGenerator.start()】
 44 | ```
 45 | 
 46 | 具体的看，`JobGenerator.start()` 的代码如下：
 47 | 
 48 | ```scala
 49 | // 来自 JobGenerator.start()
 50 | 
 51 | def start(): Unit = synchronized {
 52 |   ...
 53 |   eventLoop.start()                      // 【启动 RPC 处理线程】
 54 | 
 55 |   if (ssc.isCheckpointPresent) {
 56 |     restart()                            // 【如果不是第一次启动，就需要从 checkpoint 恢复】
 57 |   } else {
 58 |     startFirstTime()                     // 【第一次启动，就 startFirstTime()】
 59 |   }
 60 | }
 61 | ```
 62 | 
 63 | 可以看到，在启动了 RPC 处理线程 `eventLoop` 后，就会根据是否是第一次启动，也就是是否存在 checkpoint，来具体的决定是 `restart()` 还是 `startFirstTime()`。
 64 | 
 65 | 后面我们会分析失效后重启的 `restart()` 流程，这里我们来关注 `startFirstTime()`:
 66 | 
 67 | ```scala
 68 | // 来自 JobGenerator.startFirstTime()
 69 | 
 70 | private def startFirstTime() {
 71 |   val startTime = new Time(timer.getStartTime())
 72 |   graph.start(startTime - graph.batchDuration)
 73 |   timer.start(startTime.milliseconds)
 74 |   logInfo("Started JobGenerator at " + startTime)
 75 | }
 76 | ```
 77 | 
 78 | 可以看到，这里首次启动时做的工作，先是通过 `graph.start()` 来告知了 `DStreamGraph` 第 1 个 batch 的启动时间，然后就是 `timer.start()` 启动了关键的定时器。
 79 | 
 80 | 当定时器 `timer` 启动以后，`JobGenerator` 的 `startFirstTime()` 就完成了。
 81 | 
 82 | ## RecurringTimer
 83 | 
 84 | 通过之前几篇文章的分析我们知道，**`JobGenerator` 维护了一个定时器**，周期就是用户设置的 `batchDuration`，**定时为每个 batch 生成 RDD DAG 的实例**。
 85 | 
 86 | 具体的，这个定时器实例就是：
 87 | ```scala
 88 | // 来自 JobGenerator
 89 | 
 90 | private[streaming]
 91 | class JobGenerator(jobScheduler: JobScheduler) extends Logging {
 92 | ...
 93 |   private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
 94 |       longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
 95 | ...
 96 | }
 97 | ```
 98 | 
 99 | 通过代码也可以看到，整个 `timer` 的调度周期就是 `batchDuration`，每次调度起来就是做一个非常简单的工作：往 `eventLoop` 里发送一个消息 —— 该为当前 batch (`new Time(longTime)`) GenerateJobs 了！
100 | 
101 | ## GenerateJobs
102 | 
103 | 接下来，`eventLoop` 收到消息时，会在一个消息处理的线程池里，执行对应的操作。在这里，处理 `GenerateJobs(time)` 消息的对应操作是 `generateJobs(time)`：
104 | 
105 | ```scala
106 | private def generateJobs(time: Time) {
107 |   SparkEnv.set(ssc.env)
108 |   Try {
109 |     jobScheduler.receiverTracker.allocateBlocksToBatch(time)                 // 【步骤 (1)】
110 |     graph.generateJobs(time)                                                 // 【步骤 (2)】
111 |   } match {
112 |     case Success(jobs) =>
113 |       val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) // 【步骤 (3)】
114 |       jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))    // 【步骤 (4)】
115 |     case Failure(e) =>
116 |       jobScheduler.reportError("Error generating jobs for time " + time, e)
117 |   }
118 |   eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))       // 【步骤 (5)】
119 | }
120 | ```
121 | 
122 | 这段代码异常精悍，包含了 `JobGenerator` 主要工作 —— 如下图所示 —— 的 5 个步骤！
123 | 
124 | ![image](0.imgs/055.png)
125 | 
126 | - (1) **要求 `ReceiverTracker` 将目前已收到的数据进行一次 allocate**，即将上次 batch 切分后的数据切分到到本次新的 batch 里
127 |     - 这里 `ReceiverTracker` 对已收到数据的 meta 信息进行 `allocateBlocksToBatch(time)`，与 `ReceiverTracker` 自己接收 `ReceiverSupervisorImpl` 上报块数据 meta 信息的过程，是相互独立的，但通过 `synchronized` 关键字来互斥同步
128 |     - 即是说，不管 `ReceiverSupervisorImpl` 形成块数据的时间戳 `t1`、`ReceiverSupervisorImpl` 发送块数据的时间戳 `t2`、`ReceiverTracker` 收到块数据的时间戳 `t3` 分别是啥，最终块数据划入哪个 batch，还是由 `ReceiverTracker.allocateBlocksToBatch(time)` 方法获得 `synchronized` 锁的那一刻，还有未划入之前任何一个 batch 的块数据 meta，将被划分入最新的 batch
129 |     - 所以，每个块数据的 meta 信息，将被划入一个、且只被划入一个 batch
130 | 
131 | <br/>
132 | 
133 | - (2) **要求 `DStreamGraph` 复制出一套新的 RDD DAG 的实例**，具体过程是：`DStreamGraph` 将要求图里的尾 `DStream` 节点生成具体的 RDD 实例，并递归的调用尾 `DStream` 的上游 `DStream` 节点……以此遍历整个 `DStreamGraph`，遍历结束也就正好生成了 RDD DAG 的实例
134 |     - 这个过程的详解，请参考前面的文章 [DStream 生成 RDD 实例详解](1.2 DStream 生成 RDD 实例详解.md)
135 |     - 精确的说，整个 `DStreamGraph.generateJobs(time)` 遍历结束的返回值是 `Seq[Job]`
136 | 
137 | <br/>
138 | 
139 | - (3) **获取第 1 步 `ReceiverTracker` 分配到本 batch 的源头数据的 meta 信息**
140 |     - 第 1 步中 `ReceiverTracker` 只是对 batch 的源头数据 meta 信息进行了 batch 的分配，本步骤是按照 batch 时间来向 `ReceiverTracker` 查询得到划分到本 batch 的块数据 meta 信息
141 | 
142 | <br/>
143 | 
144 | - (4) 将第 2 步生成的本 batch 的 RDD DAG，和第 3 步获取到的 meta 信息，**一同提交给 `JobScheduler` 异步执行**
145 |     - 这里我们提交的是将 (a) `time` (b) `Seq[job]` (c) `块数据的 meta 信息` 这三者包装为一个 `JobSet`，然后调用 `JobScheduler.submitJobSet(JobSet)` 提交给 `JobScheduler`
146 |     - 这里的向 `JobScheduler` 提交过程与 `JobScheduler` 接下来在 `jobExecutor` 里执行过程是异步分离的，因此本步将非常快即可返回
147 | 
148 | <br/>
149 | 
150 | - (5) 只要提交结束（不管是否已开始异步执行），就**马上对整个系统的当前运行状态做一个 checkpoint**
151 |     - 这里做 checkpoint 也只是异步提交一个 `DoCheckpoint` 消息请求，不用等 checkpoint 真正写完成即可返回
152 |     - 这里也简单描述一下 checkpoint 包含的内容，包括已经提交了的、但尚未运行结束的 JobSet 等实际运行时信息。
153 | 
154 | <br/>
155 | <br/>
156 | 
157 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/5)，返回目录请 [猛戳这里](readme.md)）
158 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/2.imgs/020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/2.imgs/020.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/2.imgs/021.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/2.imgs/021.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.1 Receiver 分发详解.md:
--------------------------------------------------------------------------------
  1 | # Receiver 分发详解
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 3：数据产生与导入` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 我们前面在 [DStream, DStreamGraph 详解](1.1 DStream, DStreamGraph 详解.md) 讲到，整个 `DStreamGraph` 是由 `output stream` 通过 *dependency* 引用关系，索引到上游 `DStream` 节点。而递归的追溯到最上游的 `InputDStream` 节点时，就没有对其它 `DStream` 节点的依赖了，因为 `InputDStream` 节点本身就代表了最原始的数据集。
 22 | 
 23 | ![image](1.imgs/035.png)
 24 | 
 25 | 我们对 `模块 3：数据产生与导入` 细节的解释，是仅针对 `ReceiverInputDStream` 及其子类的；其它 `InputDStream` 子类的讲解，我们在另外的文章中进行。即，本模块的讨论范围是：
 26 | 
 27 | ```scala
 28 | - ReceiverInputDStream
 29 |   - 子类 SocketInputDStream
 30 |   - 子类 TwitterInputDStream
 31 |   - 子类 RawInputDStream
 32 |   - 子类 FlumePollingInputDStream
 33 |   - 子类 MQTTInputDStream
 34 |   - 子类 FlumeInputDStream
 35 |   - 子类 PluggableInputDStream
 36 |   - 子类 KafkaInputDStream
 37 | ```
 38 | 
 39 | ## ReceiverTracker 分发 Receiver 过程
 40 | 
 41 | 我们已经知道，`ReceiverTracker` 自身运行在 driver 端，是一个管理分布在各个 executor 上的 `Receiver` 的总指挥者。
 42 | 
 43 | 在 `ssc.start()` 时，将隐含地调用 `ReceiverTracker.start()`；而 `ReceiverTracker.start()` 最重要的任务就是调用自己的 `launchReceivers()` 方法将 `Receiver` 分发到多个 executor 上去。然后在每个 executor 上，由 `ReceiverSupervisor` 来分别启动一个 `Receiver` 接收数据。这个过程用下图表示：
 44 | 
 45 | ![image](0.imgs/060.png)
 46 | 
 47 | 我们将以 1.4.0 和 1.5.0 这两个版本为代表，仔细分析一下 launchReceivers() 的实现。
 48 |   
 49 |     1.4.0 代表了 1.5.0 以前的版本，如 1.2.x, 1.3.x, 1.4.x
 50 |   
 51 |     1.5.0 代表了 1.5.0 以来的版本，如 1.5.x, 1.6.x
 52 | 
 53 | ## Spark 1.4.0 的 launchReceivers() 实现
 54 | 
 55 | Spark 1.4.0 的 `launchReceivers()` 的过程如下：
 56 | 
 57 | - (1.a) **构造 Receiver RDD**。具体的，是先遍历所有的 `ReceiverInputStream`，获得将要启动的所有 `x` 个 `Receiver` 的实例。然后，把这些实例当做 `x` 份数据，在 driver 端构造一个 `RDD` 实例，这个 `RDD` 分为 `x` 个 partition，每个 partition 包含一个 `Receiver` 数据（即 `Receiver` 实例）。
 58 | 
 59 | - (1.b) **定义计算 func**。我们将在多个 executor 上共启动 `x` 个 `Task`，每个 `Task` 负责一个 partition 的数据，即一个 `Receiver` 实例。我们要对这个 `Receiver` 实例做的计算定义为 `func` 函数，具体的，`func` 是：
 60 |   - 以这个 `Receiver` 实例为参数，构造新的 `ReceiverSupervisor` 实例 `supervisor`：`supervisor = new ReceiverSupervisorImpl(receiver, ...)`
 61 |   - `supervisor.start()`；这一步将启动新线程启动 `Receiver` 实例，然后很快返回
 62 |   - `supervisor.awaitTermination()`；将一直 block 住当前 `Task` 的线程
 63 | 
 64 | - (1.c) **分发 RDD(Receiver) 和 func 到具体的 executor**。上面 (a)(b) 两步只是在 driver 端定义了 `RDD[Receiver]` 和 这个 `RDD` 之上将执行的 `func`，但并没有具体的去做。这一步是将两者的定义分发到 executor 上去，马上就可以实际执行了。
 65 | 
 66 | - (2) **在各个 executor 端，执行(1.b) 中定义的 `func`**。即启动 `Receiver` 实例，并一直 block 住当前线程。
 67 | 
 68 | 这样，通过 1 个 `RDD` 实例包含 `x` 个 `Receiver`，对应启动 1 个 `Job` 包含 `x` 个 `Task`，就可以完成 `Receiver` 的分发和部署了。上述 (1.a)(1.b)(1.c)(2) 的过程示意如下图：
 69 | 
 70 | ![image](3.imgs/020.png)
 71 | 
 72 | 这里 Spark Streaming 下层的 Spark Core 对 `Receiver` 分发是毫无感知的，它只是执行了“应用层面” -- 对 Spark Core 来讲，Spark Streaming 就是“应用层面”-- 的一个普通 `Job`；但 Spark Streaming 只通过这个普通 `Job` 即可完“特殊功能”的 `Receiver` 分发，可谓巧妙巧妙。
 73 | 
 74 | 上述逻辑实现的源码请到 [Spark 1.4.0 的 ReceiverTracker](https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala) 查看。
 75 | 
 76 | ## Spark 1.5.0 的 launchReceivers() 实现
 77 | 
 78 | 其实上面这个实现，这个长时运行的分发 `Job` 还存在一些问题：
 79 | 
 80 | - 如果某个 `Task` 失败超过 `spark.task.maxFailures(默认=4)` 次的话，整个 `Job` 就会失败。这个在长时运行的 Spark Streaming 程序里，`Executor` 多失效几次就有可能导致 `Task` 失败达到上限次数了。
 81 | - 如果某个 `Task` 失效一下，Spark Core 的 `TaskScheduler` 会将其重新部署到另一个 executor 上去重跑。但这里的问题在于，负责重跑的 executor 可能是在下发重跑的那一刻是正在执行 `Task` 数较少的，但不一定能够将 `Receiver` 分布的最均衡的。
 82 | - 有个用户 code 可能会想自定义一个 `Receiver` 的分布策略，比如所有的 `Receiver` 都部署到同一个节点上去。
 83 | 
 84 | 从 1.5.0 开始，Spark Streaming 添加了增强的 `Receiver` 分发策略。对比之前的版本，主要的变更在于：
 85 | 
 86 | 1. 添加可插拔的 `ReceiverSchedulingPolicy`
 87 | 2. 把 `1` 个 `Job`（包含 `x` 个 `Task`），改为 `x` 个 `Job`（每个 `Job` 只包含 `1` 个 `Task`）
 88 | 3. 添加对 `Receiver` 的监控重启机制
 89 | 
 90 | 我们一个一个看一看。
 91 | 
 92 | ### (1) 可插拔的 ReceiverSchedulingPolicy
 93 | 
 94 | `ReceiverSchedulingPolicy` 的主要目的，是在 Spark Streaming 层面添加对 `Receiver` 的分发目的地的计算，相对于之前版本依赖 Spark Core 的 `TaskScheduler` 进行通用分发，新的 `ReceiverSchedulingPolicy` 会对 Streaming 应用的更好的语义理解，也能计算出更好的分发策略。
 95 | 
 96 | `ReceiverSchedulingPolicy` 有两个方法，分别用于：
 97 | 
 98 | - 在 Streaming 程序首次启动时：
 99 |   - 收集所有 `InputDStream` 包含的所有 `Receiver` 实例 —— `receivers`
100 |   - 收集所有的 executor —— `executors` —— 作为候选目的地
101 |   - 然后就调用 `ReceiverSchedulingPolicy.scheduleReceivers(receivers, executors)` 来计算每个 `Receiver` 的目的地 executor 列表
102 | 
103 | - 在 Streaming 程序运行过程中，如果需要重启某个 `Receiver`：
104 |   - 将首先看一看之前计算过的目的地 executor 有没有还 alive 的
105 |   - 如果没有，就需要 `ReceiverSchedulingPolicy.rescheduleReceiver(receiver, ...)` 来重新计算这个 `Receiver` 的目的地 executor 列表
106 | 
107 | [默认的 `ReceiverSchedulingPolicy`](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverSchedulingPolicy.scala) 是实现为 `round-robin` 式的了。我们举例说明下这两个方法：
108 | 
109 | ![image](3.imgs/030.png)
110 | 
111 | 其中，在 `Receiver y` 失效时，以前的 Spark Streaming 有可能会在 executor 1 上重启 `Receiver y`，而 1.5.0 以来，将在 executor 3 上重启 `Receiver y`。
112 | 
113 | ### (2) 每个 Receiver 分发有单独的 Job 负责
114 | 
115 | 1.5.0 版本以来的 Spark Streaming，是为每个 `Receiver` 都分配单独的只有 1 个 `Task` 的 `Job` 来尝试分发，这与以前版本将 `x` 个 `Receiver` 都放到一个有 `x` 个 `Task` 的 `Job` 里分发是很不一样的。
116 | 
117 | 而且，对于这仅有的一个 `Task`，只在第 1 次执行时，才尝试启动 `Receiver`；如果该 `Task` 因为失效而被调度到其它 executor 执行时，就不再尝试启动 `Receiver`、只做一个空操作，从而导致本 `Job` 的状态是成功执行已完成。`ReceiverTracker` 会另外调起一个 `Job` ——  有可能会重新计算 `Receiver` 的目的地 —— 来继续尝试 `Receiver` 分发……如此直到成功为止。
118 | 
119 | 另外，由于 Spark Core 的 `Task` 下发时只会参考并大部分时候尊重 Spark Streaming 设置的 `preferredLocation` 目的地信息，还是有一定可能该分发 `Receiver` 的 `Job` 并没有在我们想要调度的 executor 上运行。此时，在第 1 次执行 `Task` 时，会首先向 `ReceiverTracker` 发送 `RegisterReceiver` 消息，只有得到肯定的答复时，才真正启动 `Receiver`，否则就继续做一个空操作，导致本 `Job` 的状态是成功执行已完成。当然，`ReceiverTracker` 也会另外调起一个 `Job`，来继续尝试 `Receiver` 分发……如此直到成功为止。
120 | 
121 | 我们用图示来表达这个改动：
122 | 
123 | ![image](3.imgs/040.png)
124 | 
125 | 所以通过上面可以看到，一个 `Receiver` 的分发 `Job` 是有可能没有完成分发 `Receiver` 的目的的，所以 `ReceiverTracker` 会继续再起一个 `Job` 来尝试 `Receiver` 分发。这个机制保证了，如果一次 `Receiver` 如果没有抵达预先计算好的 executor，就有机会再次进行分发，从而实现在 Spark Streaming 层面对 `Receiver` 所在位置更好的控制。
126 | 
127 | ### (3) 对 `Receiver` 的监控重启机制
128 | 
129 | 上面分析了每个 `Receiver` 都有专门的 `Job` 来保证分发后，我们发现这样一来，`Receiver` 的失效重启就不受 `spark.task.maxFailures(默认=4)` 次的限制了。
130 | 
131 | 因为现在的 `Receiver` 重试不是在 `Task` 级别，而是在 `Job` 级别；并且 `Receiver` 失效后并不会导致前一次 `Job` 失败，而是前一次 `Job` 成功、并新起一个 `Job` 再次进行分发。这样一来，不管 Spark Streaming 运行多长时间，`Receiver` 总是保持活性的，不会随着 executor 的丢失而导致 `Receiver` 死去。
132 | 
133 | ## 总结
134 | 
135 | 我们再简单对比一下 1.4.0 和 1.5.0 版本在 `Receiver` 分发上的区别：
136 | 
137 | ![image](3.imgs/020.png)
138 | ![image](3.imgs/040.png)
139 | 
140 | 通过以上分析，我们总结：
141 | 
142 | <table>
143 |     <tr>
144 |         <td align="center"></td>
145 |         <td align="center"><strong>Spark Streaming 1.4.0</strong></td>
146 |         <td align="center"><strong>Spark Streaming 1.5.0</strong></td>
147 |     </tr>
148 |     <tr>
149 |         <td align="center"><strong>Receiver 活性</strong></td>
150 |         <td align="center">不保证永活</td>
151 |         <td align="center">无限重试、保证永活</td>
152 |     </tr>
153 |     <tr>
154 |         <td align="center"><strong>Receiver 均衡分发</strong></td>
155 |         <td align="center">无保证</td>
156 |         <td align="center">round-robin 策略</td>
157 |     </tr>
158 |     <tr>
159 |         <td align="center"><strong>自定义 Receiver 分发</strong></td>
160 |         <td align="center">很 tricky</td>
161 |         <td align="center">方便</td>
162 |     </tr>
163 | </table>
164 | 
165 | ## 致谢
166 | 
167 | 本文所分析的 1.5.0 以来增强的 `Receiver` 分发策略，是由朱诗雄同学强势贡献给社区的：
168 | 
169 | - ![朱诗雄](https://avatars2.githubusercontent.com/u/1000778?v=3&s=80)
170 | - 朱诗雄，Apache Spark Committer, Databricks 的中国籍牛牛工程师，已为 Spark 持续贡献代码近两年时间
171 | - 强势围观 [他的 Github](https://github.com/zsxwing)，和 [他正为 Spark 贡献的代码](https://github.com/apache/spark/commits/master?author=zsxwing)
172 | 
173 | <br/>
174 | <br/>
175 | 
176 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/6)，返回目录请 [猛戳这里](readme.md)）
177 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.2 Receiver, ReceiverSupervisor, BlockGenerator, ReceivedBlockHandler 详解.md:
--------------------------------------------------------------------------------
  1 | # Receiver, ReceiverSupervisor, BlockGenerator, ReceivedBlockHandler 详解
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 3：数据产生与导入` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 我们在前面 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 中分析过，Spark Streaming 在程序刚开始运行时：
 22 | 
 23 | - (1) 由 `Receiver` 的总指挥 `ReceiverTracker` 分发多个 job（每个 job 有 1 个 task），到多个 executor 上分别启动 `ReceiverSupervisor` 实例；
 24 | 
 25 | - (2) 每个 `ReceiverSupervisor` 启动后将马上生成一个用户提供的 `Receiver` 实现的实例 —— 该 `Receiver` 实现可以持续产生或者持续接收系统外数据，比如 `TwitterReceiver` 可以实时爬取 twitter 数据 —— 并在 `Receiver` 实例生成后调用 `Receiver.onStart()`。
 26 | 
 27 | ![image](0.imgs/060.png)
 28 | 
 29 |     ReceiverSupervisor 的全限定名是：org.apache.spark.streaming.receiver.ReceiverSupervisor
 30 |     Receiver           的全限定名是：org.apache.spark.streaming.receiver.Receiver
 31 | 
 32 | (1)(2) 的过程由上图所示，这时 `Receiver` 启动工作已运行完毕。
 33 | 
 34 | 接下来 `ReceiverSupervisor` 将在 executor 端作为的主要角色，并且：
 35 | 
 36 | - (3) `Receiver` 在 `onStart()` 启动后，就将**持续不断**地接收外界数据，并持续交给 `ReceiverSupervisor` 进行数据转储；
 37 | 
 38 | - (4) `ReceiverSupervisor` **持续不断**地接收到 `Receiver` 转来的数据：
 39 | 
 40 | 	- 如果数据很细小，就需要 `BlockGenerator` 攒多条数据成一块(4a)、然后再成块存储(4b 或 4c)
 41 | 	- 反之就不用攒，直接成块存储(4b 或 4c)
 42 |   
 43 | 	- 这里 Spark Streaming 目前支持两种成块存储方式，一种是由 `blockManagerskManagerBasedBlockHandler` 直接存到 executor 的内存或硬盘，另一种由 `WriteAheadLogBasedBlockHandler` 是同时写 WAL(4c) 和 executor 的内存或硬盘
 44 | 
 45 | - (5) 每次成块在 executor 存储完毕后，`ReceiverSupervisor` 就会及时上报块数据的 meta 信息给 driver 端的 `ReceiverTracker`；这里的 meta 信息包括数据的标识 id，数据的位置，数据的条数，数据的大小等信息。
 46 | 
 47 | - (6) `ReceiverTracker` 再将收到的块数据 meta 信息直接转给自己的成员 `ReceivedBlockTracker`，由 `ReceivedBlockTracker` 专门管理收到的块数据 meta 信息。
 48 | 
 49 | ![image](0.imgs/065.png)
 50 | 
 51 |     BlockGenerator                 的全限定名是：org.apache.spark.streaming.receiver.BlockGenerator
 52 |     BlockManagerBasedBlockHandler  的全限定名是：org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler
 53 |     WriteAheadLogBasedBlockHandler 的全限定名是：org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler
 54 |     ReceivedBlockTracker           的全限定名是：org.apache.spark.streaming.scheduler.ReceivedBlockTracker
 55 |     ReceiverInputDStream           的全限定名是：org.apache.spark.streaming.dstream.ReceiverInputDStream
 56 | 
 57 | 这里 (3)(4)(5)(6) 的过程是一直**持续不断**地发生的，我们也将其在上图里标识出来。
 58 | 
 59 | 后续在 driver 端，就由 `ReceiverInputDStream` 在每个 batch 去检查 `ReceiverTracker` 收到的块数据 meta 信息，界定哪些新数据需要在本 batch 内处理，然后生成相应的 `RDD` 实例去处理这些块数据。
 60 | 
 61 | 下面我们来详解 Receiver, ReceiverSupervisor, BlockGenerator 这三个类。
 62 | 
 63 | ## Receiver 详解
 64 | 
 65 | `Receiver` 是一个 abstract 的基类：
 66 | 
 67 | ```scala
 68 | // 来自 Receiver
 69 | 
 70 | abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {
 71 |   // 需要子类实现
 72 |   def onStart()
 73 |   def onStop()
 74 |   
 75 |   // 基类实现，供子类调用
 76 |   def store(dataItem: T) {...}                  // 【存储单条小数据】
 77 |   def store(dataBuffer: ArrayBuffer[T]) {...}   // 【存储数组形式的块数据】
 78 |   def store(dataIterator: Iterator[T]) {...}    // 【存储 iterator 形式的块数据】
 79 |   def store(bytes: ByteBuffer) {...}            // 【存储 ByteBuffer 形式的块数据】
 80 |   
 81 |   ...
 82 | }
 83 | ```
 84 | 
 85 | 这里需要 `Receiver` 子类具体实现的是，`onStart()` 和 `onStop()` 方法。`onStart()` 是在 executor 端被 `ReceiverSupervisor` 调用的，而且 `onStart()` 的实现应该很快就能返回，不要写成阻塞式的。
 86 | 
 87 | 比如，Spark Streaming 自带的 `SocketReceiver` 的 `onStart()` 实现如下：
 88 | 
 89 | ```scala
 90 | // 来自 SocketReceiver
 91 | 
 92 | def onStart() {
 93 |   new Thread("Socket Receiver") {
 94 |     setDaemon(true)
 95 |     override def run() { receive() }
 96 |   }.start()  // 【仅新拉起了一个线程来接收数据】
 97 |   // 【onStart() 方法很快就返回了】
 98 | }
 99 | ```
100 | 
101 | 另外的 `onStop()` 实现，就是在 `Receiver` 被关闭时调用了，可以做一些 close 工作。
102 | 
103 | 我们看当 `Receiver` 真正启动起来后，可以开始产生或者接收数据了，那接收到的数据该怎么存到 Spark Streaming 里？
104 | 
105 | 答案很简单，就是直接调用 `store()` 方法即可。`Receiver` 基类提供了 4 种签名的 `store()` 方法，分别可用于存储：
106 | - (a) 单条小数据
107 | - (b) 数组形式的块数据
108 | - (c) iterator 形式的块数据
109 | - (d) ByteBuffer 形式的块数据
110 | 
111 | 这 4 种签名的 `store()` 的实现都是直接将数据转给 `ReceiverSupervisor`，由 `ReceiverSupervisor` 来具体负责存储。
112 | 
113 | 所以，一个具体的 `Receiver` 子类实现，只要在 `onStart()` 里新拉起数据接收线程，并在接收到数据时 `store()` 到 Spark Streamimg 框架就可以了。
114 | 
115 | ## ReceiverSupervisor 详解
116 | 
117 | 我们在 [Receiver 分发详解](3.1 Receiver 分发详解.md) 里分析过，在 executor 端，分发 `Receiver` 的 `Job` 的 `Task` 执行的实现是：
118 | 
119 | ```scala
120 | (iterator: Iterator[Receiver[_]]) => {
121 |   ...
122 |   val receiver = iterator.next()
123 |   assert(iterator.hasNext == false)
124 |   // 【ReceiverSupervisor 的具体实现 ReceiverSupervisorImpl】
125 |   val supervisor = new ReceiverSupervisorImpl(receiver, ...)
126 |   supervisor.start()
127 |   supervisor.awaitTermination()
128 |   ...
129 | }
130 | ```
131 | 
132 | `ReceiverSupervisor` 定义了一些方法接口，其具体的实现类是 `ReceiverSupervisorImpl`。
133 | 
134 | 我们看到在上面的代码中，executor 端会先 `new` 一个 `ReceiverSupervisorImpl`，然后 `ReceiverSupervisorImpl.start()`。这里 `.start()` 很重要的工作就是调用 `Receiver.onStart()`，来启动 `Receiver` 的数据接收线程：
135 | 
136 | ![image](3.imgs/050.png)
137 | 
138 | `start()` 成功后，`ReceiverSurpervisorImpl` 最重要的工作就是接收 `Receiver` 给 `store()` 过来的数据了。
139 | 
140 | `ReceiverSurpervisorImpl` 有 4 种签名的 `push()` 方法，被 `Receiver` 的 4 种 `store()` 一一调用。不过接下来对单条小数据和三种块数据的处理稍有区别。
141 | 
142 | 单条的情况，`ReceiverSupervisorImpl` 要在 `BlockGenerator` 的协助下，将多个单条的数据积攒为一个块数据，然后重新调用 `push` 交给 `ReceiverSurpervisorImpl` 来处理这个块数据。我们一会再详解 `BlockGenerator` 的这个过程。
143 | 
144 | 所以接下来，我们主要看这 3 个存储块数据的 `push...()` 方法，它们的实现非常简单：
145 | 
146 | ```scala
147 | // 来自 ReceiverSupervisorImpl
148 | 
149 | def pushArrayBuffer(arrayBuffer: ArrayBuffer[_], ...) {
150 |   pushAndReportBlock(ArrayBufferBlock(...), ...)
151 | }
152 | 
153 | def pushIterator(iterator: Iterator[_], ...) {
154 |   pushAndReportBlock(IteratorBlock(...), ...)
155 | }
156 | 
157 | def pushBytes(bytes: ByteBuffer, ...){
158 |   pushAndReportBlock(ByteBufferBlock(...), ...)
159 | }
160 | 
161 | def pushAndReportBlock(receivedBlock: ReceivedBlock, ...) {
162 | ...
163 | }
164 | ```
165 | 
166 | 顾名思义，这 3 个存储块数据的 `push...()` 方法即是将自己的数据统一包装为 `ReceivedBlock`，然后由 `pushAndReportBlock()` 做两件事情：
167 | 
168 | - (a) push：将 `ReceivedBlock` 交给 `ReceivedBlockHandler` 来存储，具体的，可以在 `ReceivedBlockHandler`  的两种存储实现里二选一
169 | - (b) report：将已存储好的 `ReceivedBlock` 的块数据 meta 信息报告给 `ReceiverTracker`
170 | 
171 | 上面的过程可以总结为：
172 | 
173 | ![image](0.imgs/065.png)
174 | 
175 | ## ReceivedBlockHandler 详解
176 | 
177 | `ReceivedBlockHandler` 是一个接口类，在 executor 端负责对接收到的块数据进行具体的存储和清理：
178 | 
179 | ```scala
180 | // 来自 ReceivedBlockHandler
181 | 
182 | private[streaming] trait ReceivedBlockHandler {
183 | 
184 |   /** Store a received block with the given block id and return related metadata */
185 |   def storeBlock(blockId: StreamBlockId, receivedBlock: ReceivedBlock): ReceivedBlockStoreResult
186 | 
187 |   /** Cleanup old blocks older than the given threshold time */
188 |   def cleanupOldBlocks(threshTime: Long)
189 | }
190 | ```
191 | 
192 | `ReceivedBlockHandler` 有两个具体的存储策略的实现：
193 | 
194 | - (a) `BlockManagerBasedBlockHandler`，是直接存到 executor 的内存或硬盘
195 | - (b) `WriteAheadLogBasedBlockHandler`，是先写 WAL，再存储到 executor 的内存或硬盘
196 | 
197 | ### (a) BlockManagerBasedBlockHandler 实现
198 | 
199 | `BlockManagerBasedBlockHandler` 主要是直接存储到 Spark Core 里的 `BlockManager` 里。
200 | 
201 | `BlockManager` 将在 executor 端接收 `Block` 数据，而在 driver 端维护 `Block` 的 meta 信息。 `BlockManager` 根据存储者的 `StorageLevel` 要求来存到本 executor 的 `RAM` 或者 `DISK`，也可以同时再额外复制一份到其它 executor 的 `RAM` 或者 `DISK`。[点这里](http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)查看 `StorageLevel` 支持的所有枚举值。
202 | 
203 | 下面是 `BlockManagerBasedBlockHandler.store()` 向 `BlockManager` 存储 3 种块数据的具体实现：
204 | ```scala
205 | // 来自 BlockManagerBasedBlockHandler
206 | 
207 | def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {
208 |   val putResult: Seq[(BlockId, BlockStatus)] = block match {
209 |     case ArrayBufferBlock(arrayBuffer) =>
210 |       blockManager.putIterator(blockId, arrayBuffer.iterator, ...)   // 【存储数组到 blockManager 里】
211 |     case IteratorBlock(iterator) =>
212 |       blockManager.putIterator(blockId, countIterator, ...)          // 【存储 iterator 到 blockManager 里】
213 |     case ByteBufferBlock(byteBuffer) =>
214 |       blockManager.putBytes(blockId, byteBuffer, ...)                // 【存储 ByteBuffer 到 blockManager 里】
215 |   ...
216 | }
217 | ```
218 | 
219 | ### (b) WriteAheadLogBasedBlockHandler 实现
220 | 
221 | `WriteAheadLogBasedBlockHandler` 的实现则是同时写到可靠存储的 WAL 中和 executor 的 `BlockManager` 中；在两者都写完成后，再上报块数据的 meta 信息。
222 | 
223 | `BlockManager` 中的块数据是计算时首选使用的，只有在 executor 失效时，才去 WAL 中读取写入过的数据。
224 | 
225 | 同其它系统的 WAL 一样，数据是完全顺序地写入 WAL 的；在稍后上报块数据的 meta 信息，就额外包含了块数据所在的 WAL 的路径，及在 WAL 文件内的偏移地址和长度。
226 | 
227 | 具体的写入逻辑如下：
228 | 
229 | ```scala
230 | // 来自 WriteAheadLogBasedBlockHandler
231 | 
232 | def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {
233 |   ...
234 |   // 【生成向 BlockManager 存储数据的 future】
235 |   val storeInBlockManagerFuture = Future {
236 |     val putResult =
237 |       blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true)
238 |     if (!putResult.map { _._1 }.contains(blockId)) {
239 |       throw new SparkException(
240 |         s"Could not store $blockId to block manager with storage level $storageLevel")
241 |     }
242 |   }
243 | 
244 |   // 【生成向 WAL 存储数据的 future】
245 |   val storeInWriteAheadLogFuture = Future {
246 |     writeAheadLog.write(serializedBlock, clock.getTimeMillis())
247 |   }
248 | 
249 |   // 【开始执行两个 future、等待两个 future 都结束】
250 |   val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
251 |   val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout)
252 |   
253 |   // 【返回存储结果，用于后续的块数据 meta 上报】
254 |   WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)
255 | }
256 | ```
257 | 
258 | ## BlockGenerator 详解
259 | 
260 | 最后我们来补充一下 `ReceiverSupervisorImpl` 在收到单块条小数据后，委托 `BlockGenerator` 进行积攒，并封装多条小数据为一整个块数据的详细过程。
261 | 
262 | `BlockGenerator` 在内部主要是维护一个临时的变长数组 `currentBuffer`，每收到一条 `ReceiverSupervisorImpl` 转发来的数据就加入到这个 `currentBuffer` 数组中。
263 | 
264 | 这里非常需要注意的地方，就是在加入 `currentBuffer` 数组时会先由 `rateLimiter` 检查一下速率，是否加入的频率已经太高。如果太高的话，就需要 block 住，等到下一秒再开始添加。这里的最高频率是由 `spark.streaming.receiver.maxRate (default = Long.MaxValue)` 控制的，是单个 `Receiver` 每秒钟允许添加的条数。控制了这个速率，就控制了整个 Spark Streaming 系统每个 batch 需要处理的最大数据量。之前版本的 Spark Streaming 是静态设置了这样的一个上限并由所有 `Receiver` 统一遵守；但在 1.5.0 以来，Spark Streaming 加入了分别动态控制每个 `Receiver` 速率的特性，这个我们会单独有一篇文章介绍。
265 | 
266 | 然后会维护一个定时器，每隔 `blockInterval` 的时间就生成一个新的空变长数组替换老的数组作为新的 `currentBuffer` ，并把老的数组加入到一个自己的一个 `blocksForPushing` 的队列里。
267 | 
268 | 这个 `blocksForPushing` 队列实际上是一个 `ArrayBlockingQueue`，大小由 `spark.streaming.blockQueueSize（默认 = 10）` 来控制。然后就有另外的一个线程专门从这个队列里取出来已经包装好的块数据，然后调用 `ReceiverSupervisorImpl.pushArrayBuffer(...)` 来将块数据交回给 `ReceiverSupervisorImpl`。
269 | 
270 | `BlockGenerator` 工作的整个过程示意图如下：
271 | 
272 | ![image](3.imgs/060.png) *//TODO(lwlin): 此图风格与本系列文章不符，需要美化*
273 | 
274 | ## 总结
275 | 
276 | 总结我们在本文所做的详解 —— `ReceiverSupervisor` 将在 executor 端作为的主要角色，并且：
277 | 
278 | - (3) `Receiver` 在 `onStart()` 启动后，就将**持续不断**地接收外界数据，并持续交给 `ReceiverSupervisor` 进行数据转储；
279 | 
280 | - (4) `ReceiverSupervisor` **持续不断**地接收到 `Receiver` 转来的数据：
281 | 
282 | 	- 如果数据很细小，就需要 `BlockGenerator` 攒多条数据成一块(4a)、然后再成块存储(4b 或 4c)
283 | 	- 反之就不用攒，直接成块存储(4b 或 4c)
284 |   
285 | 	- 这里 Spark Streaming 目前支持两种成块存储方式，一种是由 `blockManagerskManagerBasedBlockHandler` 直接存到 executor 的内存或硬盘，另一种由 `WriteAheadLogBasedBlockHandler` 是同时写 WAL(4c) 和 executor 的内存或硬盘
286 | 
287 | - (5) 每次成块在 executor 存储完毕后，`ReceiverSupervisor` 就会及时上报块数据的 meta 信息给 driver 端的 `ReceiverTracker`；这里的 meta 信息包括数据的标识 id，数据的位置，数据的条数，数据的大小等信息。
288 | 
289 | - (6) `ReceiverTracker` 再将收到的块数据 meta 信息直接转给自己的成员 `ReceivedBlockTracker`，由 `ReceivedBlockTracker` 专门管理收到的块数据 meta 信息。
290 | 
291 | ![image](0.imgs/065.png)
292 | 
293 | <br/>
294 | <br/>
295 | 
296 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/7)，返回目录请 [猛戳这里](readme.md)）
297 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.3 ReceiverTraker, ReceivedBlockTracker 详解.md:
--------------------------------------------------------------------------------
  1 | # ReceiverTraker, ReceivedBlockTracker 详解
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 3：数据产生与导入` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 我们在 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 给出了 `模块 3：数据产生与导入` 的基本工作流程：
 22 | 
 23 | - (1) 由 `Receiver` 的总指挥 `ReceiverTracker` 分发多个 job（每个 job 有 1 个 task），到多个 executor 上分别启动 `ReceiverSupervisor` 实例；
 24 | 
 25 | - (2) 每个 `ReceiverSupervisor` 启动后将马上生成一个用户提供的 `Receiver` 实现的实例 —— 该 `Receiver` 实现可以持续产生或者持续接收系统外数据，比如 `TwitterReceiver` 可以实时爬取 twitter 数据 —— 并在 `Receiver` 实例生成后调用 `Receiver.onStart()`。
 26 | 
 27 | ![image](0.imgs/060.png)
 28 | 
 29 | (1)(2) 的过程由上图所示，这时 `Receiver` 启动工作已运行完毕。
 30 | 
 31 | 接下来 `ReceiverSupervisor` 将在 executor 端作为的主要角色，并且：
 32 | 
 33 | - (3) `Receiver` 在 `onStart()` 启动后，就将**持续不断**地接收外界数据，并持续交给 `ReceiverSupervisor` 进行数据转储；
 34 | 
 35 | - (4) `ReceiverSupervisor` **持续不断**地接收到 `Receiver` 转来的数据：
 36 | 
 37 | 	- 如果数据很细小，就需要 `BlockGenerator` 攒多条数据成一块(4a)、然后再成块存储(4b 或 4c)
 38 | 	- 反之就不用攒，直接成块存储(4b 或 4c)
 39 |   
 40 | 	- 这里 Spark Streaming 目前支持两种成块存储方式，一种是由 `blockManagerskManagerBasedBlockHandler` 直接存到 executor 的内存或硬盘，另一种由 `WriteAheadLogBasedBlockHandler` 是同时写 WAL(4c) 和 executor 的内存或硬盘
 41 | 
 42 | - (5) 每次成块在 executor 存储完毕后，`ReceiverSupervisor` 就会及时上报块数据的 meta 信息给 driver 端的 `ReceiverTracker`；这里的 meta 信息包括数据的标识 id，数据的位置，数据的条数，数据的大小等信息。
 43 | 
 44 | - (6) `ReceiverTracker` 再将收到的块数据 meta 信息直接转给自己的成员 `ReceivedBlockTracker`，由 `ReceivedBlockTracker` 专门管理收到的块数据 meta 信息。
 45 | 
 46 | ![image](0.imgs/065.png)
 47 | 
 48 | 这里 (3)(4)(5)(6) 的过程是一直**持续不断**地发生的，我们也将其在上图里标识出来。
 49 | 
 50 | 上面的内容我们已经在 [Receiver 分发详解](3.1 Receiver 分发详解.md) 和 [Receiver, ReceiverSupervisor, BlockGenerator, ReceivedBlockHandler 详解](3.2 Receiver, ReceiverSupervisor, BlockGenerator, ReceivedBlockHandler 详解.md) 中介绍过了。
 51 | 
 52 | 本文我们详解的是 driver 端的 `ReceiverTracker` 和 `ReceivedBlockTracker`
 53 | 
 54 |     ReceiverTracker      的全限定名是：org.apache.spark.streaming.scheduler.ReceiverTracker
 55 |     ReceivedBlockTracker 的全限定名是：org.apache.spark.streaming.scheduler.ReceivedBlockTracker
 56 | 
 57 | ## ReceiverTracker 详解
 58 | 
 59 | `ReceiverTracker` 在 Spark 1.5.0 版本里的代码变动比较大，不过其主要功能还是没怎么改变，我们一一来看：
 60 | 
 61 | - (1) `ReceiverTracker` 分发和监控 `Receiver`
 62 | 	- `ReceiverTracker` 负责 `Receiver` 在各个 executor 上的分发
 63 | 	- 包括 `Receiver` 的失败重启
 64 | - (2) `ReceiverTracker` 作为 `RpcEndpoint`
 65 | 	- `ReceiverTracker` 作为 `Receiver` 的管理者，是各个 `Receiver` 上报信息的入口
 66 | 	- 也是 driver 下达管理命令到 `Receiver` 的出口
 67 | - (3) `ReceiverTracker` 管理已上报的块数据 meta 信息
 68 | 
 69 | 整体来看，`ReceiverTracker` 就是 `Receiver` 相关信息的中枢。
 70 | 
 71 | ### (1) ReceiverTracker 分发和监控 Receiver
 72 | 
 73 | `ReceiverTracker` 分发和监控 `Receiver` 的内容我们已经在 [Receiver 分发详解.md](3.1 Receiver 分发详解.md) 详解过了，我们这里总结一下。
 74 | 
 75 | 在 `ssc.start()` 时，将隐含地调用 `ReceiverTracker.start()`；而 `ReceiverTracker.start()` 最重要的任务就是调用自己的 `launchReceivers()` 方法将 `Receiver` 分发到多个 executor 上去。然后在每个 executor 上，由 `ReceiverSupervisor` 来分别启动一个 `Receiver` 接收数据。这个过程用下图表示：
 76 | 
 77 | ![image](0.imgs/060.png)
 78 | 
 79 | 而且在 1.5.0 版本以来引入了 `ReceiverSchedulingPolicy`，是在 Spark Streaming 层面添加对 `Receiver` 的分发目的地的计算，相对于之前版本依赖 Spark Core 的 `TaskScheduler` 进行通用分发，新的 `ReceiverSchedulingPolicy` 会对 Streaming 应用的更好的语义理解，也能计算出更好的分发策略。
 80 | 
 81 | 并且还通过每个 `Receiver` 对应 `1` 个 `Job` 的方式，保证了 `Receiver` 的多次分发，和失效后的重启、永活。
 82 | 
 83 | 从本小节 `ReceiverTracker` 分发和监控 `Receiver` 的角度，我们对以前版本的 Spark Streaming(以 1.4.0 为代表)、和新版本的 Spark Streaming(以 1.5.0 为代表)有总结对比：
 84 | 
 85 | <table>
 86 |     <tr>
 87 |         <td align="center"></td>
 88 |         <td align="center"><strong>Spark Streaming 1.4.0</strong></td>
 89 |         <td align="center"><strong>Spark Streaming 1.5.0</strong></td>
 90 |     </tr>
 91 |     <tr>
 92 |         <td align="center"><strong>Receiver 活性</strong></td>
 93 |         <td align="center">不保证永活</td>
 94 |         <td align="center">无限重试、保证永活</td>
 95 |     </tr>
 96 |     <tr>
 97 |         <td align="center"><strong>Receiver 均衡分发</strong></td>
 98 |         <td align="center">无保证</td>
 99 |         <td align="center">round-robin 策略</td>
100 |     </tr>
101 |     <tr>
102 |         <td align="center"><strong>自定义 Receiver 分发</strong></td>
103 |         <td align="center">很 tricky</td>
104 |         <td align="center">方便</td>
105 |     </tr>
106 | </table>
107 | 
108 | ### (2) ReceiverTracker 作为 RpcEndpoint
109 | 
110 | `RpcEndPoint` 可以理解为 RPC 的 server 端，供 client 调用。
111 | 
112 | `ReceiverTracker` 作为 `RpcEndPoint` 的地址 —— 即 driver 的地址 —— 是公开的，可供 `Receiver` 连接；如果某个 `Receiver` 连接成功，那么 `ReceiverTracker` 也就持有了这个 `Receiver` 的 `RpcEndPoint`。这样一来，通过发送消息，就可以实现双向通信。
113 | 
114 | 1.5.0 版本以来，`ReceiverTracker` 支持的消息有 10 种，我们进行一个总结：
115 | 
116 | <table>
117 |     <tr>
118 |     	<td>StopAllReceivers 消息</td>
119 |         <td>消息</td>
120 |         <td>解释</td>
121 |     </tr>
122 |     <tr>
123 |         <td rowspan="5">ReceiverTracker<br/>只接收、不回复</td>
124 |         <td>StartAllReceivers 消息</td>
125 |         <td>在 ReceiverTracker 刚启动时，发给自己这个消息，触发具体的 schedulingPolicy 计算，和后续分发</td>
126 |     </tr>
127 |     <tr>
128 |         <td>RestartReceiver 消息</td>
129 |         <td>当初始分发的 executor 不对，或者 Receiver 失效等情况出现，发给自己这个消息，触发 Receiver 重新分发</td>
130 |     </tr>
131 |     <tr>
132 |         <td>CleanupOldBlocks 消息</td>
133 |         <td>当块数据已完成计算不再需要时，发给自己这个消息，将给所有的 Receiver 转发此 CleanupOldBlocks 消息</td>
134 |     </tr>
135 |     <tr>
136 |         <td>UpdateReceiverRateLimit 消息</td>
137 |         <td>ReceiverTracker 动态计算出某个 Receiver 新的 rate limit，将给具体的 Receiver 发送 UpdateRateLimit 消息</td>
138 |     </tr>
139 |     <tr>
140 |         <td>ReportError 消息</td>
141 |         <td>是由 Receiver 上报上来的，将触发 reportError() 方法向 listenerBus 扩散此 error 消息 </td>
142 |     </tr>
143 |     <tr>
144 |         <td rowspan="5">ReceiverTracker<br/>接收并回复</td>
145 |         <td>RegisterReceiver 消息</td>
146 |         <td>由 Receiver 在试图启动的过程中发来，将回复允许启动，或不允许启动</td>
147 |     </tr>
148 |     <tr>
149 |         <td>AddBlock 消息</td>
150 |         <td>具体的块数据 meta 上报消息，由 Receiver 发来，将返回成功或失败</td>
151 |     </tr>
152 |     <tr>
153 |         <td>DeregisterReceiver 消息</td>
154 |         <td>由 Receiver 发来，处理后，无论如何都返回 true</td>
155 |     </tr>
156 |     <tr>
157 |         <td>AllReceiverIds 消息</td>
158 |         <td>在 ReceiverTracker stop() 的过程中，查询是否还有活跃的 Receiver</td>
159 |     </tr>
160 |     <tr>
161 |         <td>StopAllReceivers 消息</td>
162 |         <td>在 ReceiverTracker stop() 的过程刚开始时，要求 stop 所有的 Receiver；将向所有的 Receiver 发送 stop 信息</td>
163 |     </tr>
164 | </table>
165 | 
166 | ### (3) ReceiverTracker 管理块数据的 meta 信息
167 | 
168 | 一方面 `Receiver` 将通过 `AddBlock` 消息上报 meta 信息给 `ReceiverTracker`，另一方面 `JobGenerator` 将在每个 batch 开始时要求 `ReceiverTracker` 将已上报的块信息进行 batch 划分，`ReceiverTracker` 完成了块数据的 meta 信息管理工作。
169 | 
170 | 具体的，`ReceiverTracker` 有一个成员 `ReceivedBlockTracker`，专门负责已上报的块数据 meta 信息管理。
171 | 
172 | ## ReceivedBlockTracker 详解
173 | 
174 | 我们刚刚讲到，`ReceivedBlockTracker` 专门负责已上报的块数据 meta 信息管理，但 `ReceivedBlockTracker` 本身不负责对外交互，一切都是通过 `ReceiverTracker` 来转发 —— 这里 `ReceiverTracker` 相当于是 `ReceivedBlockTracker` 的门面（可参考 [门面模式](http://www.cnblogs.com/zhenyulu/articles/55992.html)）。
175 | 
176 | 在 `ReceivedBlockTracker` 内部，有几个重要的成员，它们的关系如下：
177 | 
178 | ![image](3.imgs/070.png) *//TODO(lwlin): 此图风格与本系列文章不符，需要美化*
179 | 
180 | 
181 | - `streamIdToUnallocatedBlockQueues`
182 | 	- 维护了上报上来的、但尚未分配入 batch 的 `Block` 块数据的 meta
183 | 	- 为每个 `Receiver` 单独维护一个 queue，所以是一个 `HashMap：receiverId → mutable.Queue[ReceivedBlockInfo]`
184 | - `timeToAllocatedBlocks`
185 | 	- 维护了上报上来的、已分配入 batch 的 `Block` 块数据的 meta
186 | 	- 按照 batch 进行一级索引、再按照 `receiverId` 进行二级索引的 queue，所以是一个 `HashMap: time → HashMap`
187 | - `lastAllocatedBatchTime`
188 | 	- 记录了最近一个分配完成的 batch 是哪个
189 | 
190 | 上面是用于状态记录的主要数据结构。对这些状态存取主要是 4 个方法：
191 | 
192 | - `addBlock(receivedBlockInfo: ReceivedBlockInfo)`
193 | 	- 收到某个 `Receiver` 上报上来的块数据 meta 信息，将其加入到 `streamIdToUnallocatedBlockQueues` 里
194 | - `allocateBlocksToBatch(batchTime: Time)`
195 | 	- 主要是 `JobGenerator` 在发起新 batch 的计算时，第一步就调用本方法
196 | 	- 是将 `streamIdToUnallocatedBlockQueues` 的内容，以传入的 `batchTime` 参数为 key，添加到 `timeToAllocatedBlocks` 里
197 | 	- 并更新 `lastAllocatedBatchTime`
198 | - `getBlocksOfBatch(batchTime: Time)`
199 | 	- 主要是 `JobGenerator` 在发起新 batch 的计算时，由 `DStreamGraph` 生成 RDD DAG 实例时，将调用本方法
200 | 	- 调用本方法查 `timeToAllocatedBlocks`，获得划入本 batch 的块数据元信息，由此生成处理对应块数据的 RDD
201 | - `cleanupOldBatches(cleanupThreshTime: Time, ...)`
202 | 	- 主要是当一个 batch 已经计算完成、可以把已追踪的块数据的 meta 信息清理掉时调用
203 | 	- 将清理 `timeToAllocatedBlocks` 表里对应 `cleanupThreshTime` 之前的所有 batch 块数据 meta 信息
204 | 
205 | 这 4 个方法，和对应信息状态的修改关系如下图总结：
206 | 
207 | ![image](3.imgs/075.png) *//TODO(lwlin): 此图风格与本系列文章不符，需要美化*
208 | 
209 | 上面即是 `ReceivedBlockTracker` 的主体内容。
210 | 
211 | 但我们还需要强调一点非常重要的内容，即 `ReceivedBlockTracker` 需要对 driver 进行容错保障。也就是，如果 driver 失效，新起来的 driver 必须能够通过 WAL 恢复出失效前的  `ReceivedBlockTracker` 状态，具体的就需要包括 `streamIdToUnallocatedBlockQueues`, `timeToAllocatedBlocks`, `lastAllocatedBatchTime` 等内容，也即需要前面讲的 4 个方法在修改 `ReceivedBlockTracker` 的状态信息的时候，要首先写入 WAL，才能在失效后从 WAL 恢复出相关信息。
212 | 
213 | 有关 WAL 写入和故障恢复的内容，我们将在 `模块 4：长时容错` 里系统性的详解。
214 | 
215 | ## 总结
216 | 
217 | 本文主要详解了 driver 端的 `Receiver` 管理者 —— `ReceiverTracker` —— 的主要功能：
218 | 
219 | - (1) `ReceiverTracker` 分发和监控 `Receiver`
220 | 	- `ReceiverTracker` 负责 `Receiver` 在各个 executor 上的分发
221 | 	- 包括 `Receiver` 的失败重启
222 | - (2) `ReceiverTracker` 作为 `RpcEndpoint`
223 | 	- `ReceiverTracker` 作为 `Receiver` 的管理者，是各个 `Receiver` 上报信息的入口
224 | 	- 也是 driver 下达管理命令到 `Receiver` 的出口
225 | - (3) `ReceiverTracker` 管理已上报的块数据 meta 信息
226 | 	- 委托给自己的成员 `ReceivedBlockManager` 进行具体管理
227 | 
228 | <br/>
229 | <br/>
230 | 
231 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/8)，返回目录请 [猛戳这里](readme.md)）
232 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/010.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/010.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/020.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/030.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/030.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/040.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/040.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/050.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/050.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/060.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/060.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/070.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/070.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/3.imgs/075.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/3.imgs/075.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/4.1 Executor 端长时容错详解.md:
--------------------------------------------------------------------------------
  1 | # Executor 端长时容错详解 #
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 4：长时容错` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 之前的详解我们详解了完成 Spark Streamimg 基于 Spark Core 所新增功能的 3 个模块，接下来我们看一看第 4 个模块将如何保障 Spark Streaming 的长时运行 —— 也就是，如何与前 3 个模块结合，保障前 3 个模块的长时运行。
 22 | 
 23 | 通过前 3 个模块的关键类的分析，我们可以知道，保障模块 1 和 2 需要在 driver 端完成，保障模块 3 需要在 executor 端和 driver 端完成。
 24 | 
 25 | ![image](0.imgs/040.png)
 26 | 
 27 | 本文我们详解 executor 端的保障。
 28 | 
 29 | 在 executor 端，`ReceiverSupervisor` 和 `Receiver` 失效后直接重启就 OK 了，关键是保障收到的块数据的安全。保障了源头块数据，就能够保障 RDD DAG （Spark Core 的 lineage）重做。
 30 | 
 31 | Spark Streaming 对源头块数据的保障，分为 4 个层次，全面、相互补充，又可根据不同场景灵活设置：
 32 | - (1) 热备
 33 | - (2) 冷备
 34 | - (3) 重放
 35 | - (4) 忽略
 36 | 
 37 | ## (1) 热备
 38 | 
 39 | 热备是指在存储块数据时，将其存储到本 executor、并同时 replicate 到另外一个 executor 上去。这样在一个 replica 失效后，可以立刻无感知切换到另一份 replica 进行计算。
 40 | 
 41 | 实现方式是，在实现自己的 `Receiver` 时，即指定一下 `StorageLevel` 为 `MEMORY_ONLY_2` 或 `MEMORY_AND_DISK_2` 就可以了。
 42 | 
 43 | 比如这样：
 44 | 
 45 | ```scala
 46 | class MyReceiver extends Receiver(StorageLevel.MEMORY_ONLY_2) {
 47 |   override def onStart(): Unit = {}
 48 |   override def onStop(): Unit = {}
 49 | }
 50 | ```
 51 | 
 52 | 这样，`Receiver` 在将数据 `store()` 给 `ReceiverSupervisorImpl` 的时候，将同时指明此 `storageLevel`。`ReceiverSupervisorImpl` 也将根据此 `storageLevel` 将块数据具体的存储给 `BlockManager`。
 53 | 
 54 | 然后就是依靠 `BlockManager` 进行热备。具体的 —— 我们以 `ReceiverSupervisorImpl` 向 `BlockManager` 存储一个 `byteBuffer` 为例 ——  `BlockManager` 在收到 `putBytes(byteBuffer)` 时，实际是直接调用 `doPut(byteBuffer)` 的。 那么我们看 `doPut(...)` 方法（友情提醒，主要看代码里的注释）：
 55 | 
 56 | ```scala
 57 | private def doPut(blockId: BlockId, data: BlockValues, level: StorageLevel, ...)
 58 |   : Seq[(BlockId, BlockStatus)] = {
 59 |   ...
 60 |   //【如果  putLevel.replication > 1 的话，就定义这个 future，复制数据到另外的 executor 上】
 61 |   val replicationFuture = data match {
 62 |     case b: ByteBufferValues if putLevel.replication > 1 =>
 63 |       val bufferView = b.buffer.duplicate()
 64 |       Future {
 65 |         //【这里非常重要，会在 future 启动时去实际调用 replicate() 方法，复制数据到另外的 executor 上】
 66 |         replicate(blockId, bufferView, putLevel)
 67 |       }(futureExecutionContext)
 68 |     case _ => null
 69 |   }
 70 | 
 71 |   putBlockInfo.synchronized {
 72 |     ...
 73 |     // 【存储到本机 blockManager 的 blockStore 里】
 74 |     val result = data match {
 75 |       case IteratorValues(iterator) =>
 76 |         blockStore.putIterator(blockId, iterator, putLevel, returnValues)
 77 |       case ArrayValues(array) =>
 78 |         blockStore.putArray(blockId, array, putLevel, returnValues)
 79 |       case ByteBufferValues(bytes) =>
 80 |         bytes.rewind()
 81 |         blockStore.putBytes(blockId, bytes, putLevel)
 82 |     }
 83 |   }
 84 |       
 85 |   //【再次判断  putLevel.replication > 1】
 86 |   if (putLevel.replication > 1) {
 87 |     data match {
 88 |       case ByteBufferValues(bytes) =>
 89 |         //【如果之前启动了 replicate 的 future，那么这里就同步地等这个 future 结束】
 90 |         if (replicationFuture != null) {
 91 |           Await.ready(replicationFuture, Duration.Inf)
 92 |         }
 93 |       case _ =>
 94 |         val remoteStartTime = System.currentTimeMillis
 95 |         if (bytesAfterPut == null) {
 96 |           if (valuesAfterPut == null) {
 97 |             throw new SparkException(
 98 |               "Underlying put returned neither an Iterator nor bytes! This shouldn't happen.")
 99 |           }
100 |           bytesAfterPut = dataSerialize(blockId, valuesAfterPut)
101 |         }
102 |         //【否则之前没有启动 replicate 的 future，那么这里就同步地调用 replicate() 方法，复制数据到另外的 executor 上】
103 |         replicate(blockId, bytesAfterPut, putLevel)
104 |         logDebug("Put block %s remotely took %s"
105 |           .format(blockId, Utils.getUsedTimeMs(remoteStartTime)))
106 |     }
107 |   }
108 | 
109 |   ...
110 | }
111 | ```
112 | 
113 | 所以，可以看到， `BlockManager` 的 `putBytes()` 语义就是承诺了，如果指定需要 replicate，那么当 `putBytes()` 方法返回时，就一定是存储到本机、并且一定 replicate 到另外的 executor 上了。对于 `BlockManager` 的 `putIterator()` 也是同样的语义，因为  `BlockManager` 的 `putIterator()` 和 `BlockManager` 的 `putBytes()` 一样，都是基于  `BlockManager` 的 `doPut()` 来实现的。
114 | 
115 | 简单总结本小节的解析，`Receiver` 收到的数据，通过 `ReceiverSupervisorImpl`，将数据交给 `BlockManager` 存储；而 `BlockManager` 本身支持将数据 `replicate()` 到另外的 executor 上，这样就完成了 `Receiver` 源头数据的热备过程。
116 | 
117 | 而在计算时，计算任务首先将获取需要的块数据，这时如果一个 executor 失效导致一份数据丢失，那么计算任务将转而向另一个 executor 上的同一份数据获取数据。因为另一份块数据是现成的、不需要像冷备那样重新读取的，所以这里不会有 recovery time。
118 | 
119 | ## (2) 冷备
120 | 
121 | !!! 需要同时修改
122 | 
123 | 冷备是每次存储块数据时，除了存储到本 executor，还会把块数据作为 log 写出到 WriteAheadLog 里作为冷备。这样当 executor 失效时，就由另外的 executor 去读 WAL，再重做 log 来恢复块数据。WAL 通常写到可靠存储如 HDFS 上，所以恢复时可能需要一段 recover time。
124 | 
125 | 冷备的写出过程如下图 4(c) 过程所示：
126 | 
127 | ![image](0.imgs/065.png)
128 | 
129 | 这里我们需要插播一下详解 `WriteAheadLog` 框架。
130 | 
131 | ### WriteAheadLog 框架
132 | 
133 | `WriteAheadLog` 的方式在单机 RDBMS、NoSQL/NewSQL 中都有广泛应用，前者比如记录 transaction log 时，后者比如 HBase 插入数据可以先写到 HLog 里。
134 | 
135 | `WriteAheadLog` 的特点是顺序写入，所以在做数据备份时效率较高，但在需要恢复数据时又需要顺序读取，所以需要一定 recovery time。
136 | 
137 | 不过对于 Spark Streaming 的块数据冷备来讲，在恢复时也非常方便。这是因为，对某个块数据的操作只有一次（即新增块数据），而没有后续对块数据的追加、修改、删除操作，这就使得在 WAL 里只会有一条此块数据的 log entry。所以，我们在恢复时只要 seek 到这条 log entry 并读取就可以了，而不需要顺序读取整个 WAL。
138 | 
139 | 也就是，Spark Streaming 基于 WAL 冷备进行恢复，需要的 recovery time 只是 seek 到并读一条 log entry 的时间，而不是读取整个 WAL 的时间，这个是个非常大的节省。
140 | 
141 | Spark Streaming 里的 WAL 框架，由一组抽象类，和一组基于文件的具体实现组成。其类结构关系如下：
142 | 
143 | ![image](img.png)
144 | 
145 | ### WriteAheadLog, WriteAheadLogRecordHandle
146 | 
147 | `WriteAheadLog` 是多条 log 的集合，每条具体的 log 的引用就是一个 `LogRecordHandle`。这两个 abstract 的接口定义如下：
148 | 
149 | ```scala
150 | //  来自 WriteAheadLog
151 | 
152 | @org.apache.spark.annotation.DeveloperApi
153 | public abstract class WriteAheadLog {
154 |   // 【写方法：写入一条 log，将返回一个指向这条 log 的句柄引用】
155 |   abstract public WriteAheadLogRecordHandle write(ByteBuffer record, long time);
156 | 
157 |   // 【读方法：给定一条 log 的句柄引用，读出这条 log】
158 |   abstract public ByteBuffer read(WriteAheadLogRecordHandle handle);
159 | 
160 |   // 【读方法：读取全部 log】
161 |   abstract public Iterator<ByteBuffer> readAll();
162 | 
163 |   // 【清理过时的 log 条目】
164 |   abstract public void clean(long threshTime, boolean waitForCompletion);
165 | 
166 |   // 【关闭方法】
167 |   abstract public void close();
168 | }
169 | 
170 | //  来自 WriteAheadLogRecordHandle
171 | 
172 | @org.apache.spark.annotation.DeveloperApi
173 | public abstract class WriteAheadLogRecordHandle implements java.io.Serializable {
174 |   // 【Handle 则是一个空接口，需要具体的子类定义真正的内容】
175 | }
176 | ```
177 | 
178 | 这里 `WriteAheadLog` 基于文件的具体实现是 `FileBasedWriteAheadLog`，`WriteAheadLogRecordHandle` 基于文件的具体实现是 `FileBasedWriteAheadLogSegment`，下面我们详细看看这两个具体的类。
179 | 
180 | ### FileBasedWriteAheadLogSegment
181 | 
182 | `FileBasedWriteAheadLog` 有 3 个重要的配置项或成员：
183 | 
184 | - rolling 配置项
185 |   - `FileBasedWriteAheadLog` 的实现把 log 写到一个文件里（一般是 HDFS 等可靠存储上的文件），然后每隔一段时间就关闭已有文件，产生一些新文件继续写，也就是 rolling 写的方式
186 |   - rolling 写的好处是单个文件不会太大，而且删除不用的旧数据特别方便
187 |   - 这里 rolling 的间隔是由参数 `spark.streaming.receiver.writeAheadLog.rollingIntervalSecs（默认 = 60 秒）` 控制的
188 | 
189 | - WAL 存放的目录：`{checkpointDir}/receivedData/{receiverId}`
190 |   - `{checkpointDir}` 在 `ssc.checkpoint(checkpointDir)` 指定的
191 |   - `{receiverId}` 是 `Receiver` 的 id
192 |   - 在这个 WAL 目录里，不同的 rolling log 文件的命名规则是 `log-{startTime}-{stopTime}`
193 | 
194 | - 然后就是 `FileBasedWriteAheadLog.currentLogWriter`
195 |   - 一个 `LogWriter` 对应一个 log file，而且 log 文件本身是 rolling 的，那么前一个 log 文件写完成后，对应的 writer 就可以 `close()` 了，而由新的 writer 负责写新的文件
196 |   - 这里最新的 `LogWriter` 就由 `currentLogWriter` 来指向
197 | 
198 | 接下来就是 `FileBasedWriteAheadLog`  的读写方法了：
199 | 
200 | - `write(byteBuffer: ByteBuffer, time: Long)`
201 |   - 最重要的是先调用 `getCurrentWriter()`，获取当前的 currentWriter
202 |   - 注意这里，如果 log file 需要 rolling 成新的了，那么 currentWriter 也需要随之更新；上面  `getCurrentWriter()` 会完成这个按需更新 `currentWriter`  的过程
203 |   - 然后就可以调用 `writer.write(byteBuffer)` 就可以了
204 | - `read(segment: WriteAheadLogRecordHandle): ByteBuffer`
205 |   - 直接调用 `reader.read(fileSegment)`
206 |   - 在 reader 的实现里，因为给定了 `segment` —— 也就是 `WriteAheadLogRecordHandle`，而 `segment` 里包含了具体的 log file 和 offset，就可以直接 seek 到这条 log，读出数据并返回
207 | 
208 | 所以总结下可以看到，`FileBasedWriteAheadLog` 主要是进行 rolling file 的管理，然后将具体的写方法、读方法是由具体的 `LogWriter` 和 `LogReader` 来做的。
209 | 
210 | ### WriteAheadLogRecordHandle
211 | 
212 | 前面我们刚说，`WriteAheadLogRecordHandle` 是一个 log 句柄的空实现，需要子类指定具体的 log 句柄内容。
213 | 
214 | 然后在基于的 file 的子类实现 `WriteAheadLogRecordHandle` 里，就记录了 3 方面内容：
215 | 
216 | ```scala
217 | // 来自 FileBasedWriteAheadLogSegment
218 | 
219 | private[streaming] case class FileBasedWriteAheadLogSegment(path: String, offset: Long, length: Int)
220 |   extends WriteAheadLogRecordHandle
221 | ```
222 | 
223 | - `path: String`
224 | - `offset: Long`
225 | - `length: Int`
226 | 
227 | 这 3 方面内容就非常直观了，给定文件、偏移和长度，就可以唯一确定一条 log。
228 | 
229 | ### FileBasedWriteAheadLogWriter
230 | 
231 | `FileBasedWriteAheadLogWriter` 的实现，就是给定一个文件、给定一个块数据，将数据写到文件里面去。
232 | 
233 | 然后在完成的时候，记录一下文件 path、offset 和 length，封装为一个 `FileBasedWriteAheadLogSegment` 返回。
234 | 
235 | 这里需要注意下的是，在具体的写 HDFS 数据块的时候，需要判断一下具体用的方法，优先使用 `hflush()`，没有的话就使用 `sync()`：
236 | 
237 | ```scala
238 | // 来自 FileBasedWriteAheadLogWriter
239 | 
240 | private lazy val hadoopFlushMethod = {
241 |   // Use reflection to get the right flush operation
242 |   val cls = classOf[FSDataOutputStream]
243 |   Try(cls.getMethod("hflush")).orElse(Try(cls.getMethod("sync"))).toOption
244 | }
245 | ```
246 | 
247 | ### FileBasedWriteAheadLogRandomReader
248 | 
249 | `FileBasedWriteAheadLogRandomReader` 的主要方法是 `read(segment: FileBasedWriteAheadLogSegment): ByteBuffer`，即给定一个 log 句柄，返回一条具体的 log。
250 | 
251 | 这里主要代码如下，注意到其中最关键的是 `seek(segment.offset)` !
252 | 
253 | ```scala
254 | // 来自 FileBasedWriteAheadLogRandomReader
255 | 
256 | def read(segment: FileBasedWriteAheadLogSegment): ByteBuffer = synchronized {
257 |   assertOpen()
258 |   // 【seek 到这条 log 所在的 offset】
259 |   instream.seek(segment.offset)
260 |   // 【读一下 length】
261 |   val nextLength = instream.readInt()
262 |   HdfsUtils.checkState(nextLength == segment.length,
263 |     s"Expected message length to be ${segment.length}, but was $nextLength")
264 |   val buffer = new Array[Byte](nextLength)
265 |   // 【读一下具体的内容】
266 |   instream.readFully(buffer)
267 |   // 【以 ByteBuffer 的形式，返回具体的内容】
268 |   ByteBuffer.wrap(buffer)
269 | }
270 | ```
271 | 
272 | ### FileBasedWriteAheadLogReader
273 | 
274 | `FileBasedWriteAheadLogReader` 实现跟 `FileBasedWriteAheadLogRandomReader` 差不多，不过是不需要给定 log 的句柄，而是迭代遍历所有 log：
275 | 
276 | ```scala
277 | // 来自 FileBasedWriteAheadLogReader
278 | 
279 | // 【迭代方法：hasNext()】
280 | override def hasNext: Boolean = synchronized {
281 |   if (closed) {
282 |     // 【如果已关闭，就肯定不 hasNext 了】
283 |     return false
284 |   }
285 | 
286 |   if (nextItem.isDefined) {
287 |     true
288 |   } else {
289 |     try {
290 |       // 【读出来下一条，如果有，就说明还确实 hasNext】
291 |       val length = instream.readInt()
292 |       val buffer = new Array[Byte](length)
293 |       instream.readFully(buffer)
294 |       nextItem = Some(ByteBuffer.wrap(buffer))
295 |       logTrace("Read next item " + nextItem.get)
296 |       true
297 |     } catch {
298 |      ...
299 |     }
300 |   }
301 | }
302 | 
303 | // 【迭代方法：next()】
304 | override def next(): ByteBuffer = synchronized {
305 |   // 【直接返回在 hasNext() 方法里实际读出来的数据】
306 |   val data = nextItem.getOrElse {
307 |     close()
308 |     throw new IllegalStateException(
309 |       "next called without calling hasNext or after hasNext returned false")
310 |   }
311 |   nextItem = None // Ensure the next hasNext call loads new data.
312 |   data
313 | }
314 | ```
315 | 
316 | ### WAL 总结 ###
317 | 
318 | 通过上面几个小节，我们看到，Spark Streaming 有一套基于 rolling file 的 WAL 实现，提供一个写方法，两个读方法：
319 | 
320 | - `WriteAheadLogRecordHandle write(ByteBuffer record, long time)`
321 |   - 由 `FileBasedWriteAheadLogWriter` 具体实现
322 | - ByteBuffer read(WriteAheadLogRecordHandle handle)`
323 |   - 由 `FileBasedWriteAheadLogRandomReader` 具体实现
324 | - `Iterator<ByteBuffer> readAll()`
325 |   - 由 `FileBasedWriteAheadLogReader` 具体实现
326 | 
327 | ## (3) 重放
328 | 
329 | 如果上游支持重放，比如 Apache Kafka，那么就可以选择不用热备或者冷备来另外存储数据了，而是在失效时换一个 executor 进行数据重放即可。
330 | 
331 | 具体的，[Spark Streaming 从 Kafka 读取方式有两种](http://spark.apache.org/docs/latest/streaming-kafka-integration.html)：
332 | 
333 | - 基于 `Receiver` 的
334 |   - 这种是将 Kafka Consumer 的偏移管理交给 Kafka —— 将存在 ZooKeeper 里，失效后由 Kafka 去基于 offset 进行重放
335 |   - 这样可能的问题是，Kafka 将同一个 offset 的数据，重放给两个 batch 实例 —— 从而只能保证 at least once 的语义
336 | - Direct 方式，不基于 `Receiver`
337 |   - 由 Spark Streaming 直接管理 offset —— 可以给定 offset 范围，直接去 Kafka 的硬盘上读数据，使用 Spark Streaming 自身的均衡来代替 Kafka 做的均衡
338 |   - 这样可以保证，每个 offset 范围属于且只属于一个 batch，从而保证 exactly-once
339 | 
340 | 这里我们以 Direct 方式为例，详解一下 Spark Streaming 在源头数据实效后，是如果从上游重放数据的。
341 | 
342 | 这里的实现分为两个层面：
343 | 
344 | - `DirectKafkaInputDStream`：负责侦测最新 offset，并将 offset 分配至唯一个 batch
345 |   - 会在每次 batch 生成时，依靠 `latestLeaderOffsets()` 方法去侦测最新的 offset
346 |   - 然后与上一个 batch 侦测到的 offset 相减，就能得到一个 offset 的范围 `offsetRange`
347 |   - 把这个 offset 范围内的数据，唯一分配到本 batch 来处理
348 | - `KafkaRDD`：负责去读指定 offset 范围内的数据，并基于此数据进行计算
349 |   - 会生成一个 Kafka 的 `SimpleConsumer` —— `SimpleConsumer` 是 Kafka 最底层、直接对着 Kafka 硬盘上的文件读数据的类
350 |   - 如果 `Task` 失败，导致任务重新下发，那么 offset 范围仍然维持不变，将直接重新生成一个 Kafka 的 `SimpleConsumer` 去读数据
351 | 
352 | 所以看 Direct 的方式，归根结底是由 Spark Streaming 框架来负责整个 offset 的侦测、batch 分配、实际读取数据；并且这些分 batch 的信息都是 checkpoint 到可靠存储（一般是 HDFS）了。这就没有用到 Kafka 使用 ZooKeeper 来均衡 consumer 和记录 offset 的功能，而是把 Kafka 直接当成一个底层的文件系统来使用了。
353 | 
354 | 当然，我们讲上游重放并不只局限于 Kafka，而是说凡是支持消息重放的上游都可以 —— 比如，HDFS 也可以看做一个支持重放的可靠上游 —— FileInputDStream 就是利用重放的方式，保证了 executor 失效后的源头数据的可读性。
355 | 
356 | ## (4) 忽略
357 | 
358 | 最后，如果应用的实时性需求大于准确性，那么一块数据丢失后我们也可以选择忽略、不恢复失效的源头数据。
359 | 
360 | 假设我们有 r1, r2, r3 这三个 `Receiver`，而且每 5 秒产生一个 Block，每 15 秒产生一个 batch。那么，每个 batch 有 `15 s ÷ 5 block/s/receiver × 3 receiver = 9 block`。现在假设 r1 失效，随之也丢失了 3 个 block。
361 | 
362 | 那么上层应用如何进行忽略？有两种粒度的做法。
363 | 
364 | ### 粗粒度忽略
365 | 
366 | 粗粒度的做法是，如果计算任务试图读取丢失的源头数据时出错，会导致部分 task 计算失败，会进一步导致整个 batch 的 job 失败，最终在 driver 端以 `SparkException` 的形式报出来 —— 此时我们 catch 住这个 `SparkException`，就能够屏蔽这个 batch 的 job 失败了。
367 | 
368 | 粗粒度的这个做法实现起来非常简单，问题是会忽略掉整个 batch 的计算结果。虽然我们还有 6 个 block 是好的，但所有 9 个的数据都会被忽略。
369 | 
370 | ### 细粒度忽略
371 | 
372 | 细粒度的做法是，只将忽略部分局限在丢失的 3 个 block 上，其它部分 6 部分继续保留。目前原生的 Spark Streaming 还不能完全做到，但我们对 Spark Streaming 稍作修改，就可以做到了。
373 | 
374 | 细粒度基本思路是，在一个计算的 task 发现作为源数据的 block 失效后，不是直接报错，而是另外生成一个空集合作为“修正”了的源头数据，然后继续 task 的计算，并将成功。
375 | 
376 | 如此一来，仅局限在发生数据丢失的 3 个块数据才会进行“忽略”的过程，6 个好的块数据将正常进行计算。最后整个 job 是成功的。
377 | 
378 | 当然这里对 Spark Streaming 本身的改动，还需要考虑一些细节，比如只在 Spark Streaming 里生效、不要影响到 Spark Core、SparkSQL，再比如 task 通常都是会失效重试的，我们希望前几次现场重试，只在最后一次重试仍不成功的时候再进行忽略。
379 | 
380 | 我们把修改的代码，以及使用方法放在这里了，请随用随取。
381 | 
382 | ## 总结
383 | 
384 | 我们上面分四个小节介绍了 Spark Streaming 对源头数据的高可用的保障方式，我们用一个表格来总结一下：
385 | 
386 | <table>
387 | <tr>
388 | 	<td></td>
389 | 	<td>图示</td>
390 | 	<td>优点</td>
391 | 	<td>缺点</td>
392 | </tr>
393 | <tr>
394 | 	<td>(1) 热备</td>
395 | 	<td><img src="0.imgs/075a.png"></img></td>
396 | 	<td>无 recover time</td>
397 | 	<td>需要占用双倍资源</td>
398 | </tr>
399 | <tr>
400 | 	<td>(2) 冷备</td>
401 | 	<td><img src="0.imgs/075b.png"></img></td>
402 | 	<td>十分可靠</td>
403 | 	<td>存在 recover time</td>
404 | </tr>
405 | <tr>
406 | 	<td>(3) 重放</td>
407 | 	<td><img src="0.imgs/075c.png"></img></td>
408 | 	<td>不占用额外资源</td>
409 | 	<td>存在 recover time</td>
410 | </tr>
411 | <tr>
412 | 	<td>(4) 忽略</td>
413 | 	<td><img src="0.imgs/075d.png"></img></td>
414 | 	<td>无 recover time</td>
415 | 	<td>准确性有损失</td>
416 | </tr>
417 | </table>
418 | 
419 | 
420 | <br/>
421 | <br/>
422 | 
423 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/11)，返回目录请 [猛戳这里](readme.md)）
424 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/4.2 Driver 端长时容错详解.md:
--------------------------------------------------------------------------------
  1 | ## Driver 端长时容错详解
  2 | 
  3 | ***[酷玩 Spark] Spark Streaming 源码解析系列*** ，返回目录请 [猛戳这里](readme.md)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本系列内容适用范围：
  9 | 
 10 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 11 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 12 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 13 | ```
 14 | <br/>
 15 | <br/>
 16 | 
 17 | 阅读本文前，请一定先阅读 [Spark Streaming 实现思路与模块概述](0.1 Spark Streaming 实现思路与模块概述.md) 一文，其中概述了 Spark Streaming 的 4 大模块的基本作用，有了全局概念后再看本文对 `模块 4：长时容错` 细节的解释。
 18 | 
 19 | ## 引言
 20 | 
 21 | 之前的详解我们详解了完成 Spark Streamimg 基于 Spark Core 所新增功能的 3 个模块，接下来我们看一看第 4 个模块将如何保障 Spark Streaming 的长时运行 —— 也就是，如何与前 3 个模块结合，保障前 3 个模块的长时运行。
 22 | 
 23 | 通过前 3 个模块的关键类的分析，我们可以知道，保障模块 1 和 2 需要在 driver 端完成，保障模块 3 需要在 executor 端和 driver 端完成。
 24 | 
 25 | ![image](0.imgs/040.png)
 26 | 
 27 | 本文我们详解 driver 端的保障。具体的，包括两部分：
 28 | 
 29 | - (1) ReceivedBlockTracker 容错
 30 |   - 采用 WAL 冷备方式
 31 | - (2) DStream, JobGenerator 容错
 32 |   - 采用 Checkpoint 冷备方式
 33 | 
 34 | ## (1)  ReceivedBlockTracker 容错详解
 35 | 
 36 | 前面我们讲过，块数据的 meta 信息上报到 `ReceiverTracker`，然后交给 `ReceivedBlockTracker` 做具体的管理。`ReceivedBlockTracker` 也采用 WAL 冷备方式进行备份，在 driver 失效后，由新的 `ReceivedBlockTracker` 读取 WAL 并恢复 block 的 meta 信息。
 37 | 
 38 | `WriteAheadLog` 的方式在单机 RDBMS、NoSQL/NewSQL 中都有广泛应用，前者比如记录 transaction log 时，后者比如 HBase 插入数据可以先写到 HLog 里。
 39 | 
 40 | `WriteAheadLog` 的特点是顺序写入，所以在做数据备份时效率较高，但在需要恢复数据时又需要顺序读取，所以需要一定 recovery time。
 41 | 
 42 | `WriteAheadLog` 及其基于 rolling file 的实现 `FileBasedWriteAheadLog` 我们在 [Executor 端长时容错详解](4.1 Executor 端长时容错详解.md) 详解过了，下面我们主要看 `ReceivedBlockTracker` 如何使用 WAL。
 43 | 
 44 | `ReceivedBlockTracker` 里有一个 `writeToLog()` 方法，会将具体的 log 信息写到 rolling log 里。我们看代码有哪些地方用到了 `writeToLog()`：
 45 | 
 46 | ```scala
 47 | 
 48 | def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = synchronized {
 49 |   ...
 50 |   // 【在收到了 Receiver 报上来的 meta 信息后，先通过 writeToLog() 写到 WAL】
 51 |   writeToLog(BlockAdditionEvent(receivedBlockInfo))
 52 |   // 【再将 meta 信息索引起来】
 53 |   getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
 54 |   ...
 55 | }
 56 | 
 57 | def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
 58 |   ...
 59 |   // 【在收到了 JobGenerator 的为最新的 batch 划分 meta 信息的要求后，先通过 writeToLog() 写到 WAL】
 60 |   writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))
 61 |   // 【再将 meta 信息划分到最新的 batch 里】
 62 |   timeToAllocatedBlocks(batchTime) = allocatedBlocks
 63 |   ...
 64 | }
 65 | 
 66 | def cleanupOldBatches(cleanupThreshTime: Time, waitForCompletion: Boolean): Unit = synchronized {
 67 |   ...
 68 |   // 【在收到了 JobGenerator 的清除过时的 meta 信息要求后，先通过 writeToLog() 写到 WAL】
 69 |   writeToLog(BatchCleanupEvent(timesToCleanup))
 70 |   // 【再将过时的 meta 信息清理掉】
 71 |   timeToAllocatedBlocks --= timesToCleanup
 72 |   // 【再将 WAL 里过时的 meta 信息对应的 log 清理掉】
 73 |   writeAheadLogOption.foreach(_.clean(cleanupThreshTime.milliseconds, waitForCompletion))
 74 | }
 75 | ```
 76 | 
 77 | 通过上面的代码可以看到，有 3 种消息 —— `BlockAdditionEvent`, `BatchAllocationEvent`, `BatchCleanupEvent` —— 会被保存到 WAL 里。
 78 | 
 79 | 也就是，如果我们从 WAL 中恢复，能够拿到这 3 种消息，然后从头开始重做这些 log，就能重新构建出 `ReceivedBlockTracker` 的状态成员：
 80 | 
 81 | ![image](3.imgs/070.png)
 82 | 
 83 | ## (2) DStream, JobGenerator 容错详解
 84 | 
 85 | 另外，需要定时对 `DStreamGraph` 和 `JobScheduler` 做 Checkpoint，来记录整个 `DStreamGraph` 的变化、和每个 batch 的 job 的完成情况。
 86 | 
 87 | 注意到这里采用的是完整 checkpoint 的方式，和之前的 WAL 的方式都不一样。Checkpoint 通常也是落地到可靠存储如 HDFS。Checkpoint 发起的间隔默认的是和 batchDuration 一致；即每次 batch 发起、提交了需要运行的 job 后就做 Checkpoint，另外在 job 完成了更新任务状态的时候再次做一下 Checkpoint。
 88 | 
 89 | 具体的，`JobGenerator.doCheckpoint()` 实现是，`new` 一个当前状态的 `Checkpoint`，然后通过 `CheckpointWriter` 写出去：
 90 | 
 91 | ```scala
 92 | // 来自 JobGenerator
 93 | 
 94 | private def doCheckpoint(time: Time, clearCheckpointDataLater: Boolean) {
 95 |   if (shouldCheckpoint && (time - graph.zeroTime).isMultipleOf(ssc.checkpointDuration)) {
 96 |     logInfo("Checkpointing graph for time " + time)
 97 |     ssc.graph.updateCheckpointData(time)
 98 |     // 【new 一个当前状态的 Checkpoint，然后通过 CheckpointWriter 写出去】
 99 |     checkpointWriter.write(new Checkpoint(ssc, time), clearCheckpointDataLater)
100 |   }
101 | }
102 | ```
103 | 
104 | 然后我们看 `JobGenerator.doCheckpoint()` 在哪里被调用：
105 | 
106 | ```scala
107 | // 来自 JobGenerator
108 | 
109 | private def processEvent(event: JobGeneratorEvent) {
110 |   logDebug("Got event " + event)
111 |   event match {
112 |     ...
113 |     // 【是异步地收到 DoCheckpoint 消息后，在一个线程池里执行 doCheckpoint() 方法】
114 |     case DoCheckpoint(time, clearCheckpointDataLater) =>
115 |       doCheckpoint(time, clearCheckpointDataLater)
116 |     ...
117 |   }
118 | }
119 | ```
120 | 
121 | 所以进一步看，到底哪里发送过 `DoCheckpoint` 消息：
122 | 
123 | ```scala
124 | // 来自 JobGenerator
125 | 
126 | private def generateJobs(time: Time) {
127 |   SparkEnv.set(ssc.env)
128 |   Try {
129 |     jobScheduler.receiverTracker.allocateBlocksToBatch(time)                 // 【步骤 (1)】
130 |     graph.generateJobs(time)                                                 // 【步骤 (2)】
131 |   } match {
132 |     case Success(jobs) =>
133 |       val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) // 【步骤 (3)】
134 |       jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))    // 【步骤 (4)】
135 |     case Failure(e) =>
136 |       jobScheduler.reportError("Error generating jobs for time " + time, e)
137 |   }
138 |   eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))       // 【步骤 (5)】
139 | }
140 | 
141 | // 来自 JobScheduler
142 | private def clearMetadata(time: Time) {
143 |   ssc.graph.clearMetadata(time)
144 | 
145 |   if (shouldCheckpoint) {
146 |     // 【一个 batch 做完，需要 clean 元数据时】
147 |     eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))
148 |   }
149 |   ...
150 | }
151 | ```
152 | 
153 | 原来是两处会发送 `DoCheckpoint` 消息：
154 | 
155 | - 第 1 处就是经典的 `JobGenerator.generateJob()` 的第 (5) 步
156 |   - 是在第 (4) 步提交了 `JobSet` 给 `JobScheduler` 异步执行后，就马上执行第 (5) 步来发送 `DoCheckpoint` 消息（如下图）
157 |   - ![image](0.imgs/055.png)
158 | - 第 2 处是 `JobScheduler` 成功执行完了提交过来的 `JobSet` 后，就可以清除此 batch 的相关信息了
159 |   - 这时是先 clear 各种信息
160 |   - 然后发送 `DoCheckpoint` 消息，触发 `doCheckpoint()`，就会记录下来我们已经做完了一个 batch 
161 | 
162 | 解决了什么时候 `doCheckpoint()`，现在唯一的问题就是 `Checkpoint` 都会包含什么内容了。
163 | 
164 | ## Checkpoint 详解
165 | 
166 | 我们看看 `Checkpoint` 的具体内容，整个列表如下：
167 | 
168 | ```scala
169 | 来自 Checkpoint
170 | 
171 | val checkpointTime: Time
172 | val master: String = ssc.sc.master
173 | val framework: String = ssc.sc.appName
174 | val jars: Seq[String] = ssc.sc.jars
175 | val graph: DStreamGraph = ssc.graph // 【重要】
176 | val checkpointDir: String = ssc.checkpointDir
177 | val checkpointDuration: Duration = ssc.checkpointDuration
178 | val pendingTimes: Array[Time] = ssc.scheduler.getPendingTimes().toArray // 【重要】
179 | val delaySeconds: Int = MetadataCleaner.getDelaySeconds(ssc.conf)
180 | val sparkConfPairs: Array[(String, String)] = ssc.conf.getAll
181 | ```
182 | <br/>
183 | <br/>
184 | 
185 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/proflin/CoolplaySpark/issues/12)，返回目录请 [猛戳这里](readme.md)）
186 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/Q&A 什么是 end-to-end exactly-once.md:
--------------------------------------------------------------------------------
 1 | ## [Q] 什么是 end-to-end exactly-once ?
 2 | [A] 一般我们把上游数据源 (Source) 看做一个 end，把下游数据接收 (Sink) 看做另一个 end：
 3 | 
 4 | ```
 5 |  Source  -->  Spark Streaming  -->  Sink
 6 |  [end]                             [end]
 7 | 
 8 | ```
 9 | 
10 | 目前的 Spark Streaming 处理过程**自身**是 exactly-once 的，而且对上游这个 end 的数据管理做得也不错（比如在 direct 模式里自己保存 Kafka 的偏移），但对下游除 HDFS 外的如 HBase, MySQL, Redis 等诸多 end 还不太友好，需要 user code 来实现幂等逻辑、才能保证 end-to-end 的 exactly-once。
11 | 
12 | 而在 Spark 2.0 引入的 Structured Streaming 里，将把常见的下游 end 也管理起来（比如通过 batch id 来原生支持幂等），那么不需要 user code 做什么就可以保证 end-to-end 的 exactly-once 了，请见下面一张来自 databricks 的 slide[1]:
13 | 
14 |  ![end-to-end exactly-once](q%26a.imgs/end-to-end%20exactly-once.png)
15 | 
16 | 
17 | - [1] Reynold Xin (Databricks), *"the Future of Real-time in Spark"*, 2016.02, http://www.slideshare.net/rxin/the-future-of-realtime-in-spark.
18 | 
19 | --
20 | 
21 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/lw-lin/CoolplaySpark/issues/25)，返回目录请 [猛戳这里](readme.md)）
22 | 


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/img.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/img.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/q&a.imgs/end-to-end exactly-once.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark Streaming 源码解析系列/q&a.imgs/end-to-end exactly-once.png


--------------------------------------------------------------------------------
/Spark Streaming 源码解析系列/readme.md:
--------------------------------------------------------------------------------
 1 | ## Spark Streaming 源码解析系列
 2 | 
 3 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
 4 | 
 5 | ```
 6 | 本系列内容适用范围：
 7 | 
 8 | * 2022.02.10 update, Spark 3.1.3     √
 9 | * 2021.10.13 update, Spark 3.2 全系列 √
10 | * 2021.01.07 update, Spark 3.1 全系列 √
11 | * 2020.06.18 update, Spark 3.0 全系列 √ 
12 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
13 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
14 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
15 | ```
16 | 
17 | - *概述*
18 |   - [0.1 Spark Streaming 实现思路与模块概述](0.1%20Spark%20Streaming%20实现思路与模块概述.md)
19 | - *模块 1：DAG 静态定义*
20 |   - [1.1 DStream, DStreamGraph 详解](1.1%20DStream%2C%20DStreamGraph%20详解.md)
21 |   - [1.2 DStream 生成 RDD 实例详解](1.2%20DStream%20生成%20RDD%20实例详解.md)
22 | - *模块 2：Job 动态生成*
23 |   - [2.1 JobScheduler, Job, JobSet 详解](2.1%20JobScheduler%2C%20Job%2C%20JobSet%20详解.md)
24 |   - [2.2 JobGenerator 详解](2.2%20JobGenerator%20详解.md)
25 | - *模块 3：数据产生与导入*
26 |   - [3.1 Receiver 分发详解](3.1%20Receiver%20分发详解.md) 
27 |   - [3.2 Receiver, ReceiverSupervisor, BlockGenerator, ReceivedBlockHandler 详解](3.2%20Receiver%2C%20ReceiverSupervisor%2C%20BlockGenerator%2C%20ReceivedBlockHandler%20详解.md)
28 |   - [3.3 ReceiverTraker, ReceivedBlockTracker 详解](3.3%20ReceiverTraker%2C%20ReceivedBlockTracker%20详解.md)
29 | - *模块 4：长时容错*
30 |   - [4.1 Executor 端长时容错详解](4.1%20Executor%20端长时容错详解.md)
31 |   - [4.2 Driver 端长时容错详解](4.2%20Driver%20端长时容错详解.md)
32 | - *StreamingContext*
33 |   - 5.1 StreamingContext 详解
34 | - *一些资源和 Q&A*
35 |   - [Spark 资源集合](https://github.com/lw-lin/CoolplaySpark/tree/master/Spark%20%E8%B5%84%E6%BA%90%E9%9B%86%E5%90%88) (包括 Spark Summit 视频，Spark 中文微信群等资源集合)<br/>![wechat_spark_streaming_small](../Spark%20%E8%B5%84%E6%BA%90%E9%9B%86%E5%90%88/resources/wechat_spark_streaming_small_.PNG)
36 |   - [(Q&A) 什么是 end-to-end exactly-once?](Q%26A%20什么是%20end-to-end%20exactly-once.md)
37 | 
38 | ## 致谢
39 | 
40 | - Github @wongxingjun 同学指出 3 处 typo，并提 Pull Request 修正（PR 已合并）
41 | - Github @endymecy 同学指出 2 处 typo，并提 Pull Request 修正（PR 已合并）
42 | - Github @Lemonjing 同学指出几处 typo，并提 Pull Request 修正（PR 已合并）
43 | - Github @xiaoguoqiang 同学指出 1 处 typo，并提 Pull Request 修正（PR 已合并）
44 | - Github 张瀚 (@AntikaSmith) 同学指出 1 处 问题（已修正）
45 | - Github Tao Meng (@mtunique) 同学指出 1 处 typo，并提 Pull Request 修正（PR 已合并）
46 | - Github @ouyangshourui 同学指出 1 处问题，并提 Pull Request 修正（PR 已合并）
47 | - Github @jacksu 同学指出 1 处问题，并提 Pull Request 修正（PR 已合并）
48 | - Github @klion26 同学指出 1 处 typo（已修正）
49 | - Github @397090770 同学指出 1 处配图笔误（已修正）
50 | - Github @ubtaojiang1982 同学指出 1 处 typo（已修正）
51 | - Github @marlin5555 同学指出 1 处配图遗漏信息（已修正）
52 | - Weibo @wyggggo 同学指出 1 处 typo（已修正）
53 | 
54 | ## Spark Streaming 史前史(1)
55 | 
56 | 作为跑在商业硬件上的大数据处理框架，Apache Hadoop 在诞生后的几年内（2005~今）火的一塌糊涂，几乎成为了业界处理大数据的事实上的标准工具：
57 | 
58 | ![iamge](0.imgs/001.png)
59 | 
60 | ## Spark Streaming 史前史(2)
61 | 
62 | 不过大家逐渐发现还需要有单独针对流式数据（其特点是源数据实时性高，要求处理延迟低）的处理需求；于是自 2010 年起又流行起了很多通用流数据处理框架，这种与 Hadoop 等批处理框架配合使用的“批+实时”的双引擎架构又成为了当前事实上的标准：
63 | 
64 | ![iamge](0.imgs/002.png)
65 | 
66 |   ps: 前段时间跟一位前 Googler（很巧他是 MillWheel 的第一批用户）一起吃饭时，了解到 MillWheel 原来是 2010 年左右开发的，据说极其极其好用。
67 | 
68 | ## Spark Streaming 诞生
69 | 
70 | ![iamge](0.imgs/005.png)
71 | 
72 | ![iamge](0.imgs/006.png)
73 | 
74 | 本系列文章，就来详解发布于 2013 年的 Spark Streaming。
75 | 
76 | ## 知识共享
77 | 
78 | ![](https://licensebuttons.net/l/by-nc/4.0/88x31.png)
79 | 
80 | 除非另有注明，本《Spark Streaming 源码解析系列》系列文章使用 [CC BY-NC（署名-非商业性使用）](https://creativecommons.org/licenses/by-nc/4.0/) 知识共享许可协议。
81 | 


--------------------------------------------------------------------------------
/Spark 样例工程/.gitignore:
--------------------------------------------------------------------------------
1 | *.iml
2 | */.idea
3 | */target
4 | *.class


--------------------------------------------------------------------------------
/Spark 样例工程/README.md:
--------------------------------------------------------------------------------
1 | 一个简单 hello world 的样例工程。从这里可以：
2 | - 点击上面的 `spark_hello_world` 目录，直接查看 scala 源代码；
3 | - 点击上面的 `spark_hello_world.zip` 下载本地并运行。
4 | 


--------------------------------------------------------------------------------
/Spark 样例工程/spark_hello_world.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 样例工程/spark_hello_world.zip


--------------------------------------------------------------------------------
/Spark 样例工程/spark_hello_world/pom.xml:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8"?>
  2 | <project xmlns="http://maven.apache.org/POM/4.0.0"
  3 |          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  4 |          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  5 |     <modelVersion>4.0.0</modelVersion>
  6 | 
  7 |     <groupId>com.github.lw_lin.spark</groupId>
  8 |     <artifactId>spark_hello_world</artifactId>
  9 |     <version>1.0-SNAPSHOT</version>
 10 | 
 11 |     <properties>
 12 |         <scala.version>2.10</scala.version>
 13 |         <spark.version>1.6.0</spark.version>
 14 |     </properties>
 15 | 
 16 |     <dependencies>
 17 |         <dependency>
 18 |             <groupId>org.apache.spark</groupId>
 19 |             <artifactId>spark-core_${scala.version}</artifactId>
 20 |             <version>${spark.version}</version>
 21 |         </dependency>
 22 |         <dependency>
 23 |             <groupId>org.apache.spark</groupId>
 24 |             <artifactId>spark-sql_${scala.version}</artifactId>
 25 |             <version>${spark.version}</version>
 26 |         </dependency>
 27 |         <dependency>
 28 |             <groupId>org.apache.spark</groupId>
 29 |             <artifactId>spark-streaming_${scala.version}</artifactId>
 30 |             <version>${spark.version}</version>
 31 |         </dependency>
 32 |     </dependencies>
 33 | 
 34 |     <build>
 35 |         <sourceDirectory>src/main/java</sourceDirectory>
 36 |         <plugins>
 37 |             <plugin>
 38 |                 <groupId>net.alchim31.maven</groupId>
 39 |                 <artifactId>scala-maven-plugin</artifactId>
 40 |                 <version>3.2.0</version>
 41 |                 <executions>
 42 |                     <execution>
 43 |                         <id>scala-compile</id>
 44 |                         <goals>
 45 |                             <goal>compile</goal>
 46 |                         </goals>
 47 |                     </execution>
 48 |                     <execution>
 49 |                         <id>scala-test-compile</id>
 50 |                         <goals>
 51 |                             <goal>testCompile</goal>
 52 |                         </goals>
 53 |                     </execution>
 54 |                 </executions>
 55 |                 <configuration>
 56 |                     <recompileMode>incremental</recompileMode>
 57 |                     <args>
 58 |                         <arg>-target:jvm-1.6</arg>
 59 |                         <arg>-encoding</arg>
 60 |                         <arg>UTF-8</arg>
 61 |                     </args>
 62 |                     <javacArgs>
 63 |                         <javacArg>-source</javacArg>
 64 |                         <javacArg>1.6</javacArg>
 65 |                         <javacArg>-target</javacArg>
 66 |                         <javacArg>1.6</javacArg>
 67 |                     </javacArgs>
 68 |                 </configuration>
 69 |             </plugin>
 70 |             <plugin>
 71 |                 <artifactId>maven-compiler-plugin</artifactId>
 72 |                 <executions>
 73 |                     <execution>
 74 |                         <id>default-compile</id>
 75 |                         <phase>none</phase>
 76 |                     </execution>
 77 |                     <execution>
 78 |                         <id>default-testCompile</id>
 79 |                         <phase>none</phase>
 80 |                     </execution>
 81 |                 </executions>
 82 |             </plugin>
 83 |             <plugin>
 84 |                 <groupId>org.apache.maven.plugins</groupId>
 85 |                 <artifactId>maven-assembly-plugin</artifactId>
 86 |                 <version>2.2-beta-5</version>
 87 |                 <configuration>
 88 |                     <descriptorRefs>
 89 |                         <descriptorRef>jar-with-dependencies</descriptorRef>
 90 |                     </descriptorRefs>
 91 |                 </configuration>
 92 |                 <executions>
 93 |                     <execution>
 94 |                         <phase>package</phase>
 95 |                         <goals>
 96 |                             <goal>single</goal>
 97 |                         </goals>
 98 |                     </execution>
 99 |                 </executions>
100 |             </plugin>
101 |         </plugins>
102 |         <resources>
103 |             <resource>
104 |                 <directory>src/main/resources</directory>
105 |             </resource>
106 |         </resources>
107 |     </build>
108 | </project>


--------------------------------------------------------------------------------
/Spark 样例工程/spark_hello_world/src/main/scala/com/github/lw_lin/spark/SparkHelloWorld.scala:
--------------------------------------------------------------------------------
 1 | package com.github.lw_lin.spark
 2 | 
 3 | import org.apache.spark.{SparkContext, SparkConf}
 4 | 
 5 | /**
 6 |  * This program can be downloaded at:
 7 |  * https://github.com/lw-lin/CoolplaySpark/tree/master/Spark%20%E6%A0%B7%E4%BE%8B%E5%B7%A5%E7%A8%8B
 8 |  */
 9 | object SparkHelloWorld {
10 | 
11 |   def main(args: Array[String]) {
12 |     val conf = new SparkConf()
13 |     conf.setAppName("SparkHelloWorld")
14 |     conf.setMaster("local[2]")
15 |     val sc = new SparkContext(conf)
16 | 
17 |     val lines = sc.parallelize(Seq("hello world", "hello tencent"))
18 |     val wc = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
19 |     wc.foreach(println)
20 | 
21 |     Thread.sleep(10 * 60 * 1000) // 挂住 10 分钟; 这时可以去看 SparkUI: http://localhost:4040
22 |   }
23 | }
24 | 


--------------------------------------------------------------------------------
/Spark 资源集合/README.md:
--------------------------------------------------------------------------------
 1 | # Spark+AI Summit 资源 (2019 最新)
 2 | 
 3 | ![spark_ai summit_2019](resources/spark_ai_summit_2019_san_francisco.png)
 4 | - [2019.04.23~25] 官方日程 => [官方日程](https://databricks.com/sparkaisummit/north-america/schedule-static)
 5 | - [2019.04.23~25] PPT 合集 => [PPT 合集 from 示说网](https://mp.weixin.qq.com/s/CSTqXHCpJPvlkVAeaY1mIw)
 6 | - [2019.04.23~25] 视频集合 => [墙内地址@百度云](https://pan.baidu.com/s/10HmEy1zbVnfsZQrllTwl8A)
 7 |   <br/>
 8 | 
 9 | # Spark 中文微信交流群
10 | 
11 | ![wechat_spark_streaming_small](resources/wechat_spark_streaming_small_.PNG)
12 | <br/>
13 | 
14 | # Spark 资源
15 | 
16 | - [Databricks 的博客](https://databricks.com/blog)
17 |   - Spark 背后的大数据公司的博客，包括 Spark 技术剖析、Spark 案例、行业动态等
18 | - [Apache Spark JIRA issues](https://issues.apache.org/jira/issues/?jql=project+%3D+SPARK)
19 |   - 开发人员经常关注一下，可以知道未来 3 ~ 6 个月 Spark 的发展方向 
20 | - [Structured Streaming 源码解析系列](https://github.com/lw-lin/CoolplaySpark/tree/master/Structured%20Streaming%20%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90%E7%B3%BB%E5%88%97)
21 |   - 作者会按照最新 Spark 版本持续更新和修订
22 | - [Spark Streaming 源码解析系列](https://github.com/lw-lin/CoolplaySpark/tree/master/Spark%20Streaming%20%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90%E7%B3%BB%E5%88%97)
23 |   - 作者会按照最新 Spark 版本持续更新和修订
24 |     <br/>
25 | 
26 | # 各种资源持续更新 ing
27 | 
28 | *欢迎大家提供资源索引（在本 repo 直接发 issue 即可），thanks！*
29 | 
30 | 


--------------------------------------------------------------------------------
/Spark 资源集合/resources/spark_ai_summit_2019_san_francisco.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/spark_ai_summit_2019_san_francisco.png


--------------------------------------------------------------------------------
/Spark 资源集合/resources/spark_summit_east_2017.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/spark_summit_east_2017.png


--------------------------------------------------------------------------------
/Spark 资源集合/resources/spark_summit_europe_2016.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/spark_summit_europe_2016.jpg


--------------------------------------------------------------------------------
/Spark 资源集合/resources/spark_summit_europe_2016.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/spark_summit_europe_2016.png


--------------------------------------------------------------------------------
/Spark 资源集合/resources/spark_summit_europe_2017.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/spark_summit_europe_2017.png


--------------------------------------------------------------------------------
/Spark 资源集合/resources/wechat_sh_meetup.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/wechat_sh_meetup.PNG


--------------------------------------------------------------------------------
/Spark 资源集合/resources/wechat_sh_meetup_small.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/wechat_sh_meetup_small.PNG


--------------------------------------------------------------------------------
/Spark 资源集合/resources/wechat_spark_streaming.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/wechat_spark_streaming.PNG


--------------------------------------------------------------------------------
/Spark 资源集合/resources/wechat_spark_streaming_small.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/wechat_spark_streaming_small.PNG


--------------------------------------------------------------------------------
/Spark 资源集合/resources/wechat_spark_streaming_small_.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Spark 资源集合/resources/wechat_spark_streaming_small_.PNG


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.1 Structured Streaming 实现思路与实现概述.md:
--------------------------------------------------------------------------------
  1 | # Structured Streaming 实现思路与实现概述 #
  2 | 
  3 | ***[酷玩 Spark] Structured Streaming 源码解析系列*** ，返回目录请 [猛戳这里](.)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本文内容适用范围：
  9 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 10 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 11 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 12 | ```
 13 | 
 14 | 本文目录
 15 | 
 16 | <p>
 17 | <a href="#一引言spark-2-时代">一、引言：Spark 2 时代!</a><br/>
 18 | <a href="#二从-structured-data-到-structured-streaming">二、从 Structured Data 到 Structured Streaming</a><br/>
 19 | <a href="#三structured-streaming无限增长的表格">三、Structured Streaming：无限增长的表格</a><br/>
 20 | <a href="#四streamexecution持续查询的运转引擎">四、StreamExecution：持续查询的运转引擎</a><br/>
 21 | 　　<a href="#1-streamexecution-的初始状态">1. StreamExecution 的初始状态</a><br/>
 22 | 　　<a href="#2-streamexecution-的持续查询">2. StreamExecution 的持续查询</a><br/>
 23 | 　　<a href="#3-streamexecution-的持续查询增量">3. StreamExecution 的持续查询（增量）</a><br/>
 24 | 　　<a href="#4-故障恢复">4. 故障恢复</a><br/>
 25 | 　　<a href="#5-sources-与-sinks">5. Sources 与 Sinks</a><br/>
 26 | 　　<a href="#6-小结end-to-end-exactly-once-guarantees">6. 小结：end-to-end exactly-once guarantees</a><br/>
 27 | <a href="#五全文总结">五、全文总结</a><br/>
 28 | <a href="#六扩展阅读">六、扩展阅读</a><br/>
 29 | <a href="#参考资料">参考资料</a>
 30 | </p>
 31 | 
 32 | ## 一、引言：Spark 2 时代!
 33 | 
 34 | <p align="center"><img src="1.imgs/010.png" alt="Spark 1.x stack"></p>
 35 | 
 36 | Spark 1.x 时代里，以 SparkContext（及 RDD API）为基础，在 structured data 场景衍生出了 SQLContext, HiveContext，在 streaming 场景衍生出了 StreamingContext，很是琳琅满目。
 37 | 
 38 | <p align="center"><img src="1.imgs/015.png" alt="Spark 2.x stack"></p>
 39 | 
 40 | Spark 2.x 则咔咔咔精简到只保留一个 SparkSession 作为主程序入口，以 Dataset/DataFrame 为主要的用户 API，同时满足 structured data, streaming data, machine learning, graph 等应用场景，大大减少使用者需要学习的内容，爽爽地又重新实现了一把当年的 "one stack to rule them all" 的理想。
 41 | 
 42 | <p align="center"><img src="1.imgs/030.png" alt="RDD vs Dataset/DataFrame"></p>
 43 | 
 44 | 我们这里简单回顾下 Spark 2.x 的 Dataset/DataFrame 与 Spark 1.x 的 RDD 的不同：
 45 | 
 46 | - Spark 1.x 的 RDD 更多意义上是一个一维、只有行概念的数据集，比如 `RDD[Person]`，那么一行就是一个 `Person`，存在内存里也是把 `Person` 作为一个整体（序列化前的 java object，或序列化后的 bytes）。
 47 | - Spark 2.x 里，一个 `Person` 的 Dataset 或 DataFrame，是二维行+列的数据集，比如一行一个 `Person`，有 `name:String`, `age:Int`, `height:Double` 三列；在内存里的物理结构，也会显式区分列边界。
 48 |   - Dataset/DataFrame 在 API 使用上有区别：Dataset 相比 DataFrame 而言是 type-safe 的，能够在编译时对 AnalysisExecption 报错（如下图示例）: <img src="1.imgs/040.png">
 49 |   - Dataset/DataFrame 存储方式无区别：两者在内存中的存储方式是完全一样的、是按照二维行列（UnsafeRow）来存的，所以在没必要区分 `Dataset` 或 `DataFrame` 在 API 层面的差别时，我们统一写作 `Dataset/DataFrame`
 50 | 
 51 | > [小节注] 其实 Spark 1.x 就有了 Dataset/DataFrame 的概念，但还仅是 SparkSQL 模块的主要 API ；到了 2.0 时则 Dataset/DataFrame 不局限在 SparkSQL、而成为 Spark 全局的主要 API。
 52 | 
 53 | ## 二、从 Structured Data 到 Structured Streaming
 54 | 
 55 | 使用 Dataset/DataFrame 的行列数据表格来表达 structured data，既容易理解，又具有广泛的适用性：
 56 | 
 57 | - Java 类 `class Person { String name; int age; double height}` 的多个对象可以方便地转化为  `Dataset/DataFrame`
 58 | - 多条 json 对象比如 `{name: "Alice", age: 20, height: 1.68}, {name: "Bob", age: 25, height: 1.76}` 可以方便地转化为  `Dataset/DataFrame`
 59 | - 或者 MySQL 表、行式存储文件、列式存储文件等等等都可以方便地转化为  `Dataset/DataFrame`
 60 | 
 61 | Spark 2.0 更进一步，使用 Dataset/Dataframe 的行列数据表格来扩展表达 streaming data —— 所以便横空出世了 Structured Streaming 、《Structured Streaming 源码解析系列》—— 与静态的 structured data 不同，动态的 streaming data 的行列数据表格是一直无限增长的（因为 streaming data 在源源不断地产生）！
 62 | 
 63 | <p align="center"><img src="1.imgs/050.png" alt="single API: Dataset/DataFrame"></p>
 64 | 
 65 | ## 三、Structured Streaming：无限增长的表格
 66 | 
 67 | 基于“无限增长的表格”的编程模型 [1]，我们来写一个 streaming 的 word count：
 68 | 
 69 | <p align="center"><img src="1.imgs/070.png" alt="word count"></p>
 70 | 
 71 | 对应的 Structured Streaming 代码片段：
 72 | 
 73 | ```scala
 74 | val spark = SparkSession.builder().master("...").getOrCreate()  // 创建一个 SparkSession 程序入口
 75 | 
 76 | val lines = spark.readStream.textFile("some_dir")  // 将 some_dir 里的内容创建为 Dataset/DataFrame；即 input table
 77 | val words = lines.flatMap(_.split(" "))
 78 | val wordCounts = words.groupBy("value").count()    // 对 "value" 列做 count，得到多行二列的 Dataset/DataFrame；即 result table
 79 | 
 80 | val query = wordCounts.writeStream                 // 打算写出 wordCounts 这个 Dataset/DataFrame
 81 |   .outputMode("complete")                          // 打算写出 wordCounts 的全量数据
 82 |   .format("console")                               // 打算写出到控制台
 83 |   .start()                                         // 新起一个线程开始真正不停写出
 84 | 
 85 | query.awaitTermination()                           // 当前用户主线程挂住，等待新起来的写出线程结束
 86 | ```
 87 | 
 88 | 这里需要说明几点：
 89 | 
 90 | - Structured Streaming 也是先纯定义、再触发执行的模式，即
 91 |   - 前面大部分代码是 ***纯定义*** Dataset/DataFrame 的产生、变换和写出
 92 |   - 后面位置再真正 ***start*** 一个新线程，去触发执行之前的定义
 93 | - 在新的执行线程里我们需要 ***持续地*** 去发现新数据，进而 ***持续地*** 查询最新计算结果至写出
 94 |   - 这个过程叫做 ***continous query（持续查询）***
 95 | 
 96 | ## 四、StreamExecution：持续查询的运转引擎
 97 | 
 98 | 现在我们将目光聚焦到 ***continuous query*** 的驱动引擎（即整个 Structured Streaming 的驱动引擎） StreamExecution 上来。
 99 | 
100 | ### 1. StreamExecution 的初始状态
101 | 
102 | 我们前文刚解析过，先定义好 Dataset/DataFrame 的产生、变换和写出，再启动 StreamExection 去持续查询。这些 Dataset/DataFrame 的产生、变换和写出的信息就对应保存在 StreamExecution 非常重要的 3 个成员变量中：
103 | 
104 | - `sources`: streaming data 的产生端（比如 kafka 等）
105 | - `logicalPlan`: DataFrame/Dataset 的一系列变换（即计算逻辑）
106 | - `sink`: 最终结果写出的接收端（比如 file system 等）
107 | 
108 | StreamExection 另外的重要成员变量是：
109 | 
110 | - `currentBatchId`: 当前执行的 id
111 | - `batchCommitLog`: 已经成功处理过的批次有哪些
112 | - `offsetLog`, `availableOffsets`, `committedOffsets`: 当前执行需要处理的 source data 的 meta 信息
113 | - `offsetSeqMetadata`: 当前执行的 watermark 信息（event time 相关，本文暂不涉及、另文解析）等
114 | 
115 | 我们将 Source, Sink, StreamExecution 及其重要成员变量标识在下图，接下来将逐个详细解析。
116 | 
117 | ![Spark 1.0](1.imgs/100.png)
118 | 
119 | ### 2. StreamExecution 的持续查询
120 | 
121 | ![Spark 1.0](1.imgs/110.png)
122 | 
123 | 一次执行的过程如上图；这里有 6 个关键步骤：
124 | 
125 | 1. StreamExecution 通过 Source.getOffset() 获取最新的 offsets，即最新的数据进度；
126 | 2. StreamExecution 将 offsets 等写入到 offsetLog 里
127 |      - 这里的 offsetLog 是一个持久化的 WAL (Write-Ahead-Log)，是将来可用作故障恢复用
128 | 3. StreamExecution 构造本次执行的 LogicalPlan
129 |      - (3a) 将预先定义好的逻辑（即 StreamExecution 里的 logicalPlan 成员变量）制作一个副本出来
130 |      - (3b) 给定刚刚取到的 offsets，通过 Source.getBatch(offsets) 获取本执行新收到的数据的 Dataset/DataFrame 表示，并替换到 (3a) 中的副本里
131 |      - 经过 (3a), (3b) 两步，构造完成的 LogicalPlan 就是针对本执行新收到的数据的 Dataset/DataFrame 变换（即整个处理逻辑）了
132 | 4. 触发对本次执行的 LogicalPlan 的优化，得到 IncrementalExecution
133 |      - 逻辑计划的优化：通过 Catalyst 优化器完成
134 |      - 物理计划的生成与选择：结果是可以直接用于执行的 RDD DAG
135 |      - 逻辑计划、优化的逻辑计划、物理计划、及最后结果 RDD DAG，合并起来就是 IncrementalExecution
136 | 5. 将表示计算结果的 Dataset/DataFrame (包含 IncrementalExecution) 交给 Sink，即调用 Sink.add(ds/df)
137 | 6. 计算完成后的 commit
138 |      - (6a) 通过 Source.commit() 告知 Source 数据已经完整处理结束；Source 可按需完成数据的 garbage-collection
139 |      - (6b) 将本次执行的批次 id 写入到 batchCommitLog 里
140 | 
141 | ### 3. StreamExecution 的持续查询（增量）
142 | 
143 | ![Spark 1.0](1.imgs/120.png)
144 | 
145 | Structured Streaming 在编程模型上暴露给用户的是，每次持续查询看做面对全量数据（而不仅仅是本次执行信收到的数据），所以每次执行的结果是针对全量数据进行计算的结果。
146 | 
147 | 但是在实际执行过程中，由于全量数据会越攒越多，那么每次对全量数据进行计算的代价和消耗会越来越大。
148 | 
149 | Structured Streaming 的做法是：
150 | 
151 | - 引入全局范围、高可用的 StateStore
152 | - 转全量为增量，即在每次执行时：
153 |     - 先从 StateStore 里 restore 出上次执行后的状态
154 |     - 然后加入本执行的新数据，再进行计算
155 |     - 如果有状态改变，将把改变的状态重新 save 到 StateStore 里
156 | - 为了在 Dataset/DataFrame 框架里完成对 StateStore 的 restore 和 save 操作，引入两个新的物理计划节点 —— StateStoreRestoreExec 和 StateStoreSaveExec
157 | 
158 | 所以 Structured Streaming 在编程模型上暴露给用户的是，每次持续查询看做面对全量数据，但在具体实现上转换为增量的持续查询。
159 | 
160 | ### 4. 故障恢复
161 | 
162 | 通过前面小节的解析，我们知道存储 source offsets 的 offsetLog，和存储计算状态的 StateStore，是全局高可用的。仍然采用前面的示意图，offsetLog 和 StateStore 被特殊标识为紫色，代表高可用。
163 | 
164 | ![Spark 1.0](1.imgs/120.png)
165 | 
166 | 由于 exectutor 节点的故障可由 Spark 框架本身很好的 handle，不引起可用性问题，我们本节的故障恢复只讨论 driver 故障恢复。
167 | 
168 | 如果在某个执行过程中发生 driver 故障，那么重新起来的 StreamExecution：
169 | 
170 | - 读取 WAL offsetlog 恢复出最新的 offsets 等；相当于取代正常流程里的 (1)(2) 步
171 | - 读取 batchCommitLog 决定是否需要重做最近一个批次
172 | - 如果需要，那么重做 (3a), (3b), (4), (5), (6a), (6b) 步
173 |   - 这里第 (5) 步需要分两种情况讨论
174 |     - (i) 如果上次执行在 (5) ***结束前即失效***，那么本次执行里 sink 应该完整写出计算结果
175 |     - (ii) 如果上次执行在 (5) ***结束后才失效***，那么本次执行里 sink 可以重新写出计算结果（覆盖上次结果），也可以跳过写出计算结果（因为上次执行已经完整写出过计算结果了）
176 | 
177 | 这样即可保证每次执行的计算结果，在 sink 这个层面，是 ***不重不丢*** 的 —— 即使中间发生过 1 次或以上的失效和恢复。
178 | 
179 | ### 5. Sources 与 Sinks
180 | 
181 | 可以看到，Structured Streaming 层面的 Source，需能 ***根据 offsets 重放数据***  [2]。所以：
182 | 
183 | |             Sources             |              是否可重放               | 原生内置支持  |                    注解                    |
184 | | :-----------------------------: | :------------------------------: | :-----: | :--------------------------------------: |
185 | | **HDFS-compatible file system** |  ![checked](1.imgs/checked.png)  |   已支持   | 包括但不限于 text, json, csv, parquet, orc, ... |
186 | |            **Kafka**            |  ![checked](1.imgs/checked.png)  |   已支持   |              Kafka 0.10.0+               |
187 | |         **RateStream**          |  ![checked](1.imgs/checked.png)  |   已支持   |                以一定速率产生数据                 |
188 | |            **RDBMS**            |  ![checked](1.imgs/checked.png)  | *(待支持)* |                预计后续很快会支持                 |
189 | |           **Socket**            | ![negative](1.imgs/negative.png) |   已支持   |           主要用途是在技术会议/讲座上做 demo           |
190 | |       **Receiver-based**        | ![negative](1.imgs/negative.png) |  不会支持   |              就让这些前浪被拍在沙滩上吧               |
191 | 
192 | 也可以看到，Structured Streaming 层面的 Sink，需能 ***幂等式写入数据***  [3]。所以：
193 | 
194 | |              Sinks              |              是否幂等写入              | 原生内置支持  |                    注解                    |
195 | | :-----------------------------: | :------------------------------: | :-----: | :--------------------------------------: |
196 | | **HDFS-compatible file system** |  ![checked](1.imgs/checked.png)  |   已支持   | 包括但不限于 text, json, csv, parquet, orc, ... |
197 | |    **ForeachSink** (自定操作幂等)     |  ![checked](1.imgs/checked.png)  |   已支持   |              可定制度非常高的 sink               |
198 | |            **RDBMS**            |  ![checked](1.imgs/checked.png)  | *(待支持)* |                预计后续很快会支持                 |
199 | |            **Kafka**            | ![negative](1.imgs/negative.png) |   已支持   | Kafka 目前不支持幂等写入，所以可能会有重复写入<br/>（但推荐接着 Kafka 使用 streaming de-duplication 来去重） |
200 | |    **ForeachSink** (自定操作不幂等)    | ![negative](1.imgs/negative.png) |   已支持   |              不推荐使用不幂等的自定操作               |
201 | |           **Console**           | ![negative](1.imgs/negative.png) |   已支持   |           主要用途是在技术会议/讲座上做 demo           |
202 | 
203 | ### 6. 小结：end-to-end exactly-once guarantees
204 | 
205 | 所以在 Structured Streaming 里，我们总结下面的关系[4]：
206 | 
207 | <p align="center"><img src="1.imgs/170.png" alt="extractly-once guarantees"></p>
208 | 
209 | 这里的 end-to-end 指的是，如果 source 选用类似 Kafka, HDFS 等，sink 选用类似 HDFS, MySQL 等，那么 Structured Streaming 将自动保证在 sink 里的计算结果是 exactly-once 的 —— Structured Streaming 终于把过去需要使用者去维护的 sink 去重逻辑接盘过去了！:-)
210 | 
211 | ## 五、全文总结
212 | 
213 | 自 Spark 2.0 开始，处理 structured data 的 Dateset/DataFrame 被扩展为同时处理 streaming data，诞生了 Structured Streaming。
214 | 
215 | Structured Streaming 以“无限扩展的表格”为编程模型，在 StreamExecution 实际执行中增量执行，并满足 end-to-end exactly-once guarantee.
216 | 
217 | 在 Spark 2.0 时代，Dataset/DataFrame 成为主要的用户 API，同时满足 structured data, streaming data, machine learning, graph 等应用场景，大大减少使用者需要学习的内容，爽爽地又重新实现了一把当年的 "one stack to rule them all" 的理想。
218 | 
219 | > 谨以此《Structured Streaming 源码解析系列》和以往的《Spark Streaming 源码解析系列》，向“把大数据变得更简单 (make big data simple) ”的创新者们，表达感谢和敬意。
220 | 
221 | ## 六、扩展阅读
222 | 
223 | 1. Spark Summit East 2016: [The Future of Real-time in Spark](https://spark-summit.org/east-2016/events/keynote-day-3/)
224 | 2. Blog: [Continuous Applications: Evolving Streaming in Apache Spark 2.0](https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html)
225 | 3. Blog: [Structured Streaming In Apache Spark: A new high-level API for streaming](https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html)
226 | 
227 | ## 参考资料
228 | 
229 | 1. [Structured Streaming Programming Guide](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
230 | 2. [Github: org/apache/spark/sql/execution/streaming/Source.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala)
231 | 3. [Github: org/apache/spark/sql/execution/streaming/Sink.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala)
232 | 4. [A Deep Dive into Structured Streaming](http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming?qid=51953136-8233-4d5d-a1c2-ce30051f16d1&v=&b=&from_search=1)
233 | 
234 | ## 知识共享
235 | 
236 | ![](https://licensebuttons.net/l/by-nc/4.0/88x31.png)
237 | 
238 | 除非另有注明，本《Structured Streaming 源码解析系列》系列文章使用 [CC BY-NC（署名-非商业性使用）](https://creativecommons.org/licenses/by-nc/4.0/) 知识共享许可协议。<br/>
239 | <br/>
240 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/lw-lin/CoolplaySpark/issues/29)，返回目录请 [猛戳这里](.)）
241 | 


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/010.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/010.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/015.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/015.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/030.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/030.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/040.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/040.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/050.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/050.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/070.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/070.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/100.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/100.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/110.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/110.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/120.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/120.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/170.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/170.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/checked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/checked.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/1.imgs/negative.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/1.imgs/negative.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/2.1 Structured Streaming 之 Source 解析.md:
--------------------------------------------------------------------------------
  1 | # Structured Streaming 之 Source 解析 #
  2 | 
  3 | ***[酷玩 Spark] Structured Streaming 源码解析系列*** ，返回目录请 [猛戳这里](.)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本文内容适用范围：
  9 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 10 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 11 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 12 | ```
 13 | 
 14 | 
 15 | 
 16 | 阅读本文前，请一定先阅读 [Structured Streaming 实现思路与实现概述](1.1%20Structured%20Streaming%20实现思路与实现概述.md) 一文，其中概述了 Structured Streaming 的实现思路（包括 StreamExecution, Source, Sink 等在 Structured Streaming 里的作用），有了全局概念后再看本文的细节解释。
 17 | 
 18 | ## 引言
 19 | 
 20 | Structured Streaming 非常显式地提出了输入(Source)、执行(StreamExecution)、输出(Sink)的 3 个组件，并且在每个组件显式地做到 fault-tolerant，由此得到整个 streaming 程序的 end-to-end exactly-once guarantees.
 21 | 
 22 | 具体到源码上，Source 是一个抽象的接口 [trait Source](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala) [1]，包括了 Structured Streaming 实现 end-to-end exactly-once 处理所一定需要提供的功能（我们将马上详细解析这些方法）：
 23 | 
 24 | ```scala
 25 | trait Source  {
 26 |   /* 方法 (1) */ def schema: StructType
 27 |   /* 方法 (2) */ def getOffset: Option[Offset]
 28 |   /* 方法 (3) */ def getBatch(start: Option[Offset], end: Offset): DataFrame
 29 |   /* 方法 (4) */ def commit(end: Offset) : Unit = {}
 30 |   /* 方法 (5) */ def stop(): Unit
 31 | }
 32 | ```
 33 | 
 34 | 相比而言，前作 Spark Streaming 的输入 InputDStream 抽象 [2] 并不强制要求可靠和可重放，因而也存在一些不可靠输入源（如 Receiver-based 输入源），在失效情形下丢失源头输入数据；这时即使 Spark Streaming 框架本身能够重做，但由于源头数据已经不存在了，也会导致计算本身不是 exactly-once 的。当然，Spark Streaming 对可靠的数据源如 HDFS, Kafka 等的计算给出的 guarantee 还是 exactly-once。
 35 | 
 36 | 进化到 Structured Streaming 后，只保留对 ***可靠数据源*** 的支持：
 37 | 
 38 | - 已支持
 39 |   - Kafka，具体实现是 KafkaSource extends Source
 40 |   - HDFS-compatible file system，具体实现是 FileStreamSource extends Source
 41 |   - RateStream，具体实现是 RateStreamSource extends Source
 42 | - 预计后续很快会支持
 43 |   - RDBMS
 44 | 
 45 | ## Source：方法与功能
 46 | 
 47 | 在 Structured Streaming 里，由 StreamExecution 作为持续查询的驱动器，分批次不断地：
 48 | 
 49 | ![Spark 1.0](1.imgs/110.png)
 50 | 
 51 | 1. 在每个 StreamExecution 的批次最开始，StreamExecution 会向 Source 询问当前 Source 的最新进度，即最新的 offset
 52 |      - 这里是由 StreamExecution 调用 Source 的 `def getOffset: Option[Offset]`，即方法 (2)
 53 |      - Kafka (KafkaSource) 的具体 `getOffset()` 实现 ，会通过在 driver 端的一个长时运行的 consumer 从 kafka brokers 处获取到各个 topic 最新的 offsets（注意这里不存在 driver 或 consumer 直接连 zookeeper），比如 `topicA_partition1:300, topicB_partition1:50, topicB_partition2:60`，并把 offsets 返回
 54 |      - HDFS-compatible file system (FileStreamSource) 的具体 `getOffset()` 实现，是先扫描一下最新的一组文件，给一个递增的编号并持久化下来，比如 `2 -> {c.txt, d.txt}`，然后把编号 `2` 作为最新的 offset 返回
 55 | 2. 这个 Offset 给到 StreamExecution 后会被 StreamExecution 持久化到自己的 WAL 里
 56 | 3. 由 Source 根据 StreamExecution 所要求的 start offset、end offset，提供在 `(start, end]` 区间范围内的数据
 57 |      - 这里是由 StreamExecution 调用 Source 的 `def getBatch(start: Option[Offset], end: Offset): DataFrame`，即方法 (3)
 58 |      - 这里的 start offset 和 end offset，通常就是 Source 在上一个执行批次里提供的最新 offset，和 Source 在这个批次里提供的最新 offset；但需要注意区间范围是 ***左开右闭*** ！
 59 |      - 数据的返回形式的是一个 DataFrame（这个 DataFrame 目前只包含数据的描述信息，并没有发生实际的取数据操作）
 60 | 4. StreamExecution 触发计算逻辑 logicalPlan 的优化与编译
 61 | 5. 把计算结果写出给 Sink
 62 |      - 注意这时才会由 Sink 触发发生实际的取数据操作，以及计算过程
 63 | 6. 在数据完整写出到 Sink 后，StreamExecution 通知 Source 可以废弃数据；然后把成功的批次 id 写入到 batchCommitLog
 64 |      - 这里是由 StreamExecution 调用 Source 的 `def commit(end: Offset): Unit`，即方法 (4)
 65 |      - `commit()` 方法主要是帮助 Source 完成 garbage-collection，如果外部数据源本身即具有 garbage-collection 功能，如 Kafka，那么在 Source 的具体 `commit()`  实现上即可为空、留给外部数据源去自己管理
 66 | 
 67 | 到此，是解析了 Source 的方法 (2) (3) (4) 在 StreamExecution 的具体批次执行中，所需要实现的语义和被调用的过程。
 68 | 
 69 | 另外还有方法 (1) 和 (5)：
 70 | 
 71 | - 方法 (1) `def schema: StructType`
 72 |   - 返回一个本 Source 数据的 schema 描述，即每列数据的名称、类型、是否可空等
 73 |   - 本方法在 Structured Streaming 开始真正执行每个批次开始前调用，不在每个批次执行时调用
 74 | - 方法 (5) `def stop(): Unit`
 75 |   - 当一个持续查询结束时，Source 会被调用此方法
 76 | 
 77 | ## Source 的具体实现：HDFS-compatible file system, Kafka, Rate
 78 | 
 79 | 我们总结一下截至目前，Source 已有的具体实现：
 80 | 
 81 | |             Sources             |              是否可重放               | 原生内置支持 |                    注解                    |
 82 | | :-----------------------------: | :------------------------------: | :----: | :--------------------------------------: |
 83 | | **HDFS-compatible file system** |  ![checked](1.imgs/checked.png)  |  已支持   | 包括但不限于 text, json, csv, parquet, orc, ... |
 84 | |            **Kafka**            |  ![checked](1.imgs/checked.png)  |  已支持   |              Kafka 0.10.0+               |
 85 | |         **RateStream**          |  ![checked](1.imgs/checked.png)  |  已支持   |                以一定速率产生数据                 |
 86 | |           **Socket**            | ![negative](1.imgs/negative.png) |  已支持   |           主要用途是在技术会议/讲座上做 demo           |
 87 | 
 88 | 这里我们特别强调一下，虽然 Structured Streaming 也内置了 `socket` 这个 Source，但它并不能可靠重放、因而也不符合 Structured Streaming 的结构体系。它的主要用途只是在技术会议/讲座上做 demo，不应用于线上生产系统。
 89 | 
 90 | ## 扩展阅读
 91 | 
 92 | 1. [Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html)
 93 | 
 94 | ## 参考资料
 95 | 
 96 | 1. [Github: org/apache/spark/sql/execution/streaming/Source.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala)
 97 | 2. [Github: org/apache/spark/streaming/dstream/InputDStream.scala](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala)
 98 | 
 99 | <br/>
100 | <br/>
101 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/lw-lin/CoolplaySpark/issues/31)，返回目录请 [猛戳这里](.)）
102 | 
103 | 


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/2.2 Structured Streaming 之 Sink 解析.md:
--------------------------------------------------------------------------------
  1 | # Structured Streaming 之 Sink 解析 #
  2 | 
  3 | ***[酷玩 Spark] Structured Streaming 源码解析系列*** ，返回目录请 [猛戳这里](.)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本文内容适用范围：
  9 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 10 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 11 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 12 | ```
 13 | 
 14 | 
 15 | 
 16 | 阅读本文前，请一定先阅读 [Structured Streaming 实现思路与实现概述](1.1%20Structured%20Streaming%20实现思路与实现概述.md) 一文，其中概述了 Structured Streaming 的实现思路（包括 StreamExecution, Source, Sink 等在 Structured Streaming 里的作用），有了全局概念后再看本文的细节解释。
 17 | 
 18 | ## 引言
 19 | 
 20 | Structured Streaming 非常显式地提出了输入(Source)、执行(StreamExecution)、输出(Sink)的 3 个组件，并且在每个组件显式地做到 fault-tolerant，由此得到整个 streaming 程序的 end-to-end exactly-once guarantees.
 21 | 
 22 | 具体到源码上，Sink 是一个抽象的接口 [trait Sink](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala) [1]，只有一个方法：
 23 | 
 24 | ```scala
 25 | trait Sink {
 26 |   def addBatch(batchId: Long, data: DataFrame): Unit
 27 | }
 28 | ```
 29 | 
 30 | 这个仅有的 `addBatch()` 方法支持了 Structured Streaming 实现 end-to-end exactly-once 处理所一定需要的功能。我们将马上解析这个 `addBatch()` 方法。
 31 | 
 32 | 相比而言，前作 Spark Streaming 并没有对输出进行特别的抽象，而只是在 DStreamGraph [2] 里将一些 dstreams 标记为了 output。当需要 exactly-once 特性时，程序员可以根据当前批次的时间标识，来 ***自行维护和判断*** 一个批次是否已经执行过。
 33 | 
 34 | 进化到 Structured Streaming 后，显式地抽象出了 Sink，并提供了一些原生幂等的 Sink 实现：
 35 | 
 36 | - 已支持
 37 |   - HDFS-compatible file system，具体实现是 FileStreamSink extends Sink
 38 |   - Foreach sink，具体实现是 ForeachSink extends Sink
 39 |   - Kafka sink，具体实现是 KafkaSink extends Sink
 40 | - 预计后续很快会支持
 41 |   - RDBMS
 42 | 
 43 | ## Sink：方法与功能
 44 | 
 45 | 在 Structured Streaming 里，由 StreamExecution 作为持续查询的驱动器，分批次不断地：
 46 | 
 47 | ![Spark 1.0](1.imgs/110.png)
 48 | 
 49 | 1. 在每个 StreamExecution 的批次最开始，StreamExecution 会向 Source 询问当前 Source 的最新进度，即最新的 offset
 50 | 2. 这个 Offset 给到 StreamExecution 后会被 StreamExecution 持久化到自己的 WAL 里
 51 | 3. 由 Source 根据 StreamExecution 所要求的 start offset、end offset，提供在 `(start, end]` 区间范围内的数据
 52 | 4. StreamExecution 触发计算逻辑 logicalPlan 的优化与编译
 53 | 5. 把计算结果写出给 Sink
 54 |      - 具体是由 StreamExecution 调用 `Sink.addBatch(batchId: Long, data: DataFrame)`
 55 |      - 注意这时才会由 Sink 触发发生实际的取数据操作，以及计算过程
 56 |      - 通常 Sink 直接可以直接把 `data: DataFrame` 的数据写出，并在完成后记录下 `batchId: Long`
 57 |      - 在故障恢复时，分两种情况讨论：
 58 |        - (i) 如果上次执行在本步 ***结束前即失效***，那么本次执行里 sink 应该完整写出计算结果
 59 |        - (ii) 如果上次执行在本步 ***结束后才失效***，那么本次执行里 sink 可以重新写出计算结果（覆盖上次结果），也可以跳过写出计算结果（因为上次执行已经完整写出过计算结果了）
 60 | 6. 在数据完整写出到 Sink 后，StreamExecution 通知 Source 可以废弃数据；然后把成功的批次 id 写入到 batchCommitLog
 61 | 
 62 | ## Sink 的具体实现：HDFS-API compatible FS, Foreach
 63 | 
 64 | ### (a) 具体实现: HDFS-API compatible FS
 65 | 
 66 | 通常我们使用如下方法方法写出到  HDFS-API compatible FS:
 67 | 
 68 | ```scala
 69 | writeStream
 70 |   .format("parquet")      // parquet, csv, json, text, orc ...
 71 |   .option("checkpointLocation", "path/to/checkpoint/dir")
 72 |   .option("path", "path/to/destination/dir")
 73 | ```
 74 | 
 75 | 那么我们看这里 `FileStreamSink` 具体的  `addBatch()` 实现是：
 76 | 
 77 | ```scala
 78 |   // 来自：class FileStreamSink extends Sink
 79 |   // 版本：Spark 2.1.0
 80 |   override def addBatch(batchId: Long, data: DataFrame): Unit = {
 81 |     /* 首先根据持久化的 fileLog 来判断这个 batchId 是否已经写出过 */
 82 |     if (batchId <= fileLog.getLatest().map(_._1).getOrElse(-1L)) {
 83 |       /* 如果 batchId 已经完整写出过，则本次跳过 addBatch */
 84 |       logInfo(s"Skipping already committed batch $batchId")
 85 |     } else {
 86 |       /* 本次需要具体执行写出 data */
 87 |       /* 初始化 FileCommitter -- FileCommitter 能正确处理 task 推测执行、task 失败重做等情况 */
 88 |       val committer = FileCommitProtocol.instantiate(
 89 |         className = sparkSession.sessionState.conf.streamingFileCommitProtocolClass,
 90 |         jobId = batchId.toString,
 91 |         outputPath = path,
 92 |         isAppend = false)
 93 | 
 94 |       committer match {
 95 |         case manifestCommitter: ManifestFileCommitProtocol =>
 96 |           manifestCommitter.setupManifestOptions(fileLog, batchId)
 97 |         case _ =>  // Do nothing
 98 |       }
 99 | 
100 |       /* 获取需要做 partition 的 columns */
101 |       val partitionColumns: Seq[Attribute] = partitionColumnNames.map { col =>
102 |         val nameEquality = data.sparkSession.sessionState.conf.resolver
103 |         data.logicalPlan.output.find(f => nameEquality(f.name, col)).getOrElse {
104 |           throw new RuntimeException(s"Partition column $col not found in schema ${data.schema}")
105 |         }
106 |       }
107 | 
108 |       /* 真正写出数据 */
109 |       FileFormatWriter.write(
110 |         sparkSession = sparkSession,
111 |         queryExecution = data.queryExecution,
112 |         fileFormat = fileFormat,
113 |         committer = committer,
114 |         outputSpec = FileFormatWriter.OutputSpec(path, Map.empty),
115 |         hadoopConf = hadoopConf,
116 |         partitionColumns = partitionColumns,
117 |         bucketSpec = None,
118 |         refreshFunction = _ => (),
119 |         options = options)
120 |     }
121 |   }
122 | ```
123 | 
124 | ### (b) 具体实现: Foreach
125 | 
126 | 通常我们使用如下方法写出到 foreach sink:
127 | 
128 | ```scala
129 | writeStream
130 |   /* 假设进来的每条数据是 String 类型的 */
131 |   .foreach(new ForeachWriter[String] {
132 |     /* 每个 partition 即每个 task 会在开始时调用此 open() 方法 */
133 |     /* 注意对于同一个 partitionId/version，此方法可能被先后调用多次，如 task 失效重做时 */
134 |     /* 注意对于同一个 partitionId/version，此方法也可能被同时调用，如推测执行时 */
135 |     override def open(partitionId: Long, version: Long): Boolean = {
136 |       println(s"open($partitionId, $version)")
137 |       true
138 |     }
139 |     /* 此 partition 内即每个 task 内的每条数据，此方法都被调用 */
140 |     override def process(value: String): Unit = println(s"process $value")
141 |     /* 正常结束或异常结束时，此方法被调用。但一些异常情况时，此方法不一定被调用。 */
142 |     override def close(errorOrNull: Throwable): Unit = println(s"close($errorOrNull)")
143 |   })
144 | ```
145 | 
146 | 那么我们看这里 `ForeachSink` 具体的  `addBatch()` 实现是：
147 | 
148 | ```scala
149 |   // 来自：class ForeachSink extends Sink with Serializable
150 |   // 版本：Spark 2.1.0
151 |   override def addBatch(batchId: Long, data: DataFrame): Unit = {
152 |     val encoder = encoderFor[T].resolveAndBind(
153 |       data.logicalPlan.output,
154 |       data.sparkSession.sessionState.analyzer)
155 |     /* 是 rdd 的 foreachPartition，即是 task 级别 */
156 |     data.queryExecution.toRdd.foreachPartition { iter =>
157 |       /* partition/task 级别的 open */
158 |       if (writer.open(TaskContext.getPartitionId(), batchId)) {
159 |         try {
160 |           while (iter.hasNext) {
161 |             /* 对每条数据调用 process() 方法 */
162 |             writer.process(encoder.fromRow(iter.next()))
163 |           }
164 |         } catch {
165 |           case e: Throwable =>
166 |             /* 异常时调用 close() 方法 */
167 |             writer.close(e)
168 |             throw e
169 |         }
170 |         /* 正常写完调用 close() 方法 */
171 |         writer.close(null)
172 |       } else {
173 |         /* 不写数据、直接调用 close() 方法 */
174 |         writer.close(null)
175 |       }
176 |     }
177 |   }
178 | ```
179 | 
180 | 所以我们看到，foreach sink 需要使用者提供 writer，所以这里的可定制度就非常高。
181 | 
182 | 但是仍然需要注意，由于 foreach 的 writer 可能被 open() 多次，可能有多个 task 同时调用一个 writer。所以推荐 writer 一定要写成幂等的，如果 writer 不幂等、那么 Structured Streaming 框架本身也没有更多的办法能够保证 end-to-end exactly-once guarantees 了。
183 | 
184 | ### (c) 具体实现: Kafka
185 | 
186 | Spark 2.1.1 版本开始加入了 KafkaSink，使得 Spark 也能够将数据写入到 kafka 中。
187 | 
188 | 通常我们使用如下方法写出到 kafka sink:
189 | 
190 | ```scala
191 | writeStream
192 |   .format("kafka")
193 |   .option("checkpointLocation", ...)
194 |   .outputMode(...)
195 |   .option("kafka.bootstrap.servers", ...) // 写出到哪个集群
196 |   .option("topic", ...) // 写出到哪个 topic
197 | ```
198 | 
199 | 那么我们看这里 `KafkaSink` 具体的  `addBatch()` 实现是：
200 | 
201 | ```scala
202 |   // 来自：class KafkaSink extends Sink
203 |   // 版本：Spark 2.1.1, 2.2.0
204 |   override def addBatch(batchId: Long, data: DataFrame): Unit = {
205 |     if (batchId <= latestBatchId) {
206 |       logInfo(s"Skipping already committed batch $batchId")
207 |     } else {
208 |       // 主要是通过 KafkaWriter.write() 来做写出；
209 |       // 在 KafkaWriter.write() 里，主要是继续通过 KafkaWriteTask.execute() 来做写出
210 |       KafkaWriter.write(sqlContext.sparkSession,
211 |         data.queryExecution, executorKafkaParams, topic)
212 |       latestBatchId = batchId
213 |     }
214 |   }
215 | ```
216 | 
217 | 那么我们继续看这里 `KafkaWriteTask` 具体的  `execute()` 实现是：
218 | 
219 | ```scala
220 |   // 来自：class KafkaWriteTask
221 |   // 版本：Spark 2.1.1, 2.2.0
222 |   def execute(iterator: Iterator[InternalRow]): Unit = {
223 |     producer = new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration)
224 |     while (iterator.hasNext && failedWrite == null) {
225 |       val currentRow = iterator.next()
226 |       // 这里的 projection 主要是构建 projectedRow，使得：
227 |       // 其第 0 号元素是 topic
228 |       // 其第 1 号元素是 key 的 binary 表示
229 |       // 其第 2 号元素是 value 的 binary 表示
230 |       val projectedRow = projection(currentRow)
231 |       val topic = projectedRow.getUTF8String(0)
232 |       val key = projectedRow.getBinary(1)
233 |       val value = projectedRow.getBinary(2)
234 |       if (topic == null) {
235 |         throw new NullPointerException(s"null topic present in the data. Use the " +
236 |         s"${KafkaSourceProvider.TOPIC_OPTION_KEY} option for setting a default topic.")
237 |       }
238 |       val record = new ProducerRecord[Array[Byte], Array[Byte]](topic.toString, key, value)
239 |       val callback = new Callback() {
240 |         override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
241 |           if (failedWrite == null && e != null) {
242 |             failedWrite = e
243 |           }
244 |         }
245 |       }
246 |       producer.send(record, callback)
247 |     }
248 |   }
249 | ```
250 | 
251 | 这里我们需要说明的是，由于 Spark 本身会失败重做 —— 包括单个 task 的失败重做、stage 的失败重做、整个拓扑的失败重做等 —— 那么同一条数据可能被写入到 kafka 一次以上。由于 kafka 目前还不支持 transactional write，所以多写入的数据不能被撤销，会造成一些重复。当然 kafka 自身的高可用写入（比如写入 broker 了的数据的 ack 消息没有成功送达 producer，导致 producer 重新发送数据时），也有可能造成重复。
252 | 
253 | 在 kafka 支持 transactional write 之前，可能需要下游实现下去重机制。比如如果下游仍然是 Structured Streaming，那么可以使用 streaming deduplication 来获得去重后的结果。
254 | 
255 | ## 总结
256 | 
257 | 我们总结一下截至目前，Sink 已有的具体实现：
258 | 
259 | |              Sinks              |              是否幂等写入              | 原生内置支持 |                    注解                    |
260 | | :-----------------------------: | :------------------------------: | :----: | :--------------------------------------: |
261 | | **HDFS-compatible file system** |  ![checked](1.imgs/checked.png)  |  已支持   | 包括但不限于 text, json, csv, parquet, orc, ... |
262 | |    **ForeachSink** (自定操作幂等)     |  ![checked](1.imgs/checked.png)  |  已支持   |              可定制度非常高的 sink               |
263 | |            **Kafka**            | ![negative](1.imgs/negative.png) |  已支持   | Kafka 目前不支持幂等写入，所以可能会有重复写入<br/>（但推荐接着 Kafka 使用 streaming de-duplication 来去重） |
264 | |    **ForeachSink** (自定操作不幂等)    | ![negative](1.imgs/negative.png) |  已支持   |              不推荐使用不幂等的自定操作               |
265 | 
266 | 这里我们特别强调一下，虽然 Structured Streaming 也内置了 `console` 这个 Source，但其实它的主要用途只是在技术会议/讲座上做 demo，不应用于线上生产系统。
267 | 
268 | ## 参考资料
269 | 
270 | 1. [Github: org/apache/spark/sql/execution/streaming/Sink.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala)
271 | 2. [Github: org/apache/spark/streaming/DStreamGraph.scala](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala)
272 | 
273 | <br/>
274 | <br/>
275 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/lw-lin/CoolplaySpark/issues/32)，返回目录请 [猛戳这里](.)）
276 | 


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/3.1 Structured Streaming 之状态存储解析.md:
--------------------------------------------------------------------------------
  1 | # Structured Streaming 之状态存储解析 #
  2 | 
  3 | ***[酷玩 Spark] Structured Streaming 源码解析系列*** ，返回目录请 [猛戳这里](.)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本文内容适用范围：
  9 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 10 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 11 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 12 | ```
 13 | 
 14 | 
 15 | 
 16 | 阅读本文前，请一定先阅读 [Structured Streaming 实现思路与实现概述](1.1%20Structured%20Streaming%20实现思路与实现概述.md) 一文，其中概述了 Structured Streaming 的实现思路（包括 StreamExecution, StateStore 等在 Structured Streaming 里的作用），有了全局概念后再看本文的细节解释。
 17 | 
 18 | ## 引言
 19 | 
 20 | 我们知道，持续查询的驱动引擎 StreamExecution 会持续不断地驱动每个批次的执行。
 21 | 
 22 | 对于不需要跨批次的持续查询，如 `map()`, `filter()` 等，每个批次之间的执行相互独立，不需要状态支持。而比如类似 `count()` 的聚合式持续查询，则需要跨批次的状态支持，这样本批次的执行只需依赖上一个批次的结果，而不需要依赖之前所有批次的结果。这也即增量式持续查询，能够将每个批次的执行时间稳定下来，避免越后面的批次执行时间越长的情形。
 23 | 
 24 | 这个增量式持续查询的思路和实现，我们在 [Structured Streaming 实现思路与实现概述](1.1 Structured Streaming 实现思路与实现概述.md) 解析过：
 25 | 
 26 | <p align="center"><img src="1.imgs/120.png"></p>
 27 | 
 28 | 而在这里面的 StateStore，即是 Structured Streaming 用于保存跨批次状态结果的模块组件。本文解析 StateStore 模块。
 29 | 
 30 | ## StateStore 模块的总体思路
 31 | 
 32 | <p align="center"><img src="3.imgs/100.png"></p>
 33 | 
 34 | StateStore 模块的总体思路：
 35 | - 分布式实现
 36 |     - 跑在现有 Spark 的 driver-executors 架构上
 37 |     - driver 端是轻量级的 coordinator，只做协调工作
 38 |     - executor 端负责状态的实际分片的读写
 39 | - 状态分片
 40 |     - 因为一个应用里可能会包含多个需要状态的 operator，而且 operator 本身也是分 partition 执行的，所以状态存储的分片以 `operatorId`+`partitionId` 为切分依据
 41 |     - 以分片为基本单位进行状态的读入和写出
 42 |     - 每个分片里是一个 key-value 的 store，key 和 value 的类型都是 `UnsafeRow`（可以理解为 SparkSQL 里的 Object 通用类型），可以按 key 查询、或更新
 43 | - 状态分版本
 44 |     - 因为 StreamExection 会持续不断地执行批次，因而同一个 operator 同一个 partition 的状态也是随着时间不断更新、产生新版本的数据
 45 |     - 状态的版本是与 StreamExecution 的进展一致，比如 StreamExection 的批次 id = 7 完成时，那么所有 version = 7 的状态即已经持久化
 46 | - 批量读入和写出分片
 47 |     - 对于每个分片，读入时
 48 |         - 根据 operator + partition + version， 从 HDFS 读入数据，并缓存在内存里
 49 |     - 对于每个分片，写出时
 50 |         - 累计当前版本（即 StreamExecution 的当前批次）的多行的状态修改，一次性写出到 HDFS 一个修改的流水 log，流水 log 写完即标志本批次的状态修改完成
 51 |         - 同时应用修改到内存中的状态缓存
 52 | 
 53 | 关于 StateStore 的 operator, partiton, version 有一个图片可帮助理解：
 54 | 
 55 | <p align="center"><img src="3.imgs/200.png"></p>
 56 | 
 57 | ## StateStore：(a)迁移、(b)更新和查询、(c)维护、(d)故障恢复
 58 | 
 59 | <p align="center"><img src="3.imgs/100.png">
 60 | 
 61 | ### (a) StateStore 在不同的节点之间如何迁移
 62 | 
 63 | 在 StreamExecution 执行过程中，随时在 operator 实际执行的 executor 节点上唤起一个状态存储分片、并读入前一个版本的数据即可（如果 executor 上已经存在一个分片，那么就直接重用，不用唤起分片、也不用读入数据了）。
 64 | 
 65 | 我们上节讲过，持久化的状态是在 HDFS 上的。那么如上图所示：
 66 | 
 67 | - `executor a`, 唤起了 `operator = 1, partition = 1` 的状态存储分片，从 HDFS 里位于本机的数据副本 load 进来 `version = 5` 的数据；
 68 | - 一个 executor 节点可以执行多个 operator，那么也就可以在一个 executor 上唤起多个状态存储分片（分别对应不同的 operator + partition），如图示 `executor b`；
 69 | - 在一些情况下，需要从其他节点的 HDFS 数据副本上 load 状态数据，如图中 `executor c` 需要从 `executor b` 的硬盘上 load 数据；
 70 | - 另外还有的情况是，同一份数据被同时 load 到不同的 executor 上，如 `executor d` 和 `executor a` 即是读入了同一份数据 —— 推测执行时就容易产生这种情况 —— 这时也不会产生问题，因为 load 进来的是同一份数据，然后在两个节点上各自修改，最终只会有一个节点能够成功提交对状态的修改。
 71 | 
 72 | ### (b) StateStore 的更新和查询
 73 | 
 74 | 我们前面也讲过，在一个状态存储分片里，是 key-value 的 store。这个 key-value 的 store 支持如下操作：
 75 | 
 76 | ```scala
 77 |   /* == CRUD 增删改查 =============================== */
 78 | 
 79 |   // 查询一条 key-value
 80 |   def get(key: UnsafeRow): Option[UnsafeRow]
 81 |     
 82 |   // 新增、或修改一条 key-value
 83 |   def put(key: UnsafeRow, value: UnsafeRow): Unit
 84 |     
 85 |   // 删除一条符合条件的 key-value
 86 |   def remove(condition: UnsafeRow => Boolean): Unit
 87 |   // 根据 key 删除 key-value
 88 |   def remove(key: UnsafeRow): Unit
 89 |   
 90 |   /* == 批量操作相关 =============================== */
 91 |     
 92 |   // 提交当前执行批次的所有修改，将刷出到 HDFS，成功后版本将自增
 93 |   def commit(): Long
 94 | 
 95 |   // 放弃当前执行批次的所有修改
 96 |   def abort(): Unit
 97 |     
 98 |   // 当前状态分片、当前版本的所有 key-value 状态
 99 |   def iterator(): Iterator[(UnsafeRow, UnsafeRow)]
100 |     
101 |   // 当前状态分片、当前版本比上一个版本的所有增量更新
102 |   def updates(): Iterator[StoreUpdate]
103 | ```
104 | 
105 | 使用 StateStore 的代码可以这样写（现在都是 Structured Streaming 内部实现在使用 StateStore，上层用户无需面对这些细节）：
106 | 
107 | ```scala
108 |   // 在最开始，获取正确的状态分片(按需重用已有分片或读入新的分片)
109 |   val store = StateStore.get(StateStoreId(checkpointLocation, operatorId, partitionId), ..., version, ...)
110 | 
111 |   // 开始进行一些更改
112 |   store.put(...)
113 |   store.remove(...)
114 |     
115 |   // 更改完成，批量提交缓存在内存里的更改到 HDFS
116 |   store.commit()
117 |     
118 |   // 查看当前状态分片的所有 key-value / 刚刚更新了的 key-value
119 |   store.iterator()
120 |   store.updates()
121 | ```
122 | 
123 | ### (c) StateStore 的维护
124 | 
125 | 我们看到，前面 StateStore 在写出状态的更新时，是写出的修改流水 log。
126 | 
127 | StateStore 本身也带了 maintainess 即维护模块，会周期性的在后台将过去的状态和最近若干版本的流水 log 进行合并，并把合并后的结果重新写回到 HDFS：`old_snapshot + delta_a + delta_b + … => lastest_snapshot`。
128 | 
129 | 这个过程跟 HBase 的 major/minor compact 差不多，但还没有区别到 major/minor 的粒度。
130 | 
131 | ### (d) StateStore 的故障恢复
132 | 
133 | StateStore 的所有状态以 HDFS 为准。如果某个状态分片在更新过程中失败了，那么还没有写出的更新会不可见。
134 | 
135 | 恢复时也是从 HDFS 读入最近可见的状态，并配合 StreamExecution 的执行批次重做。从另一个角度说，就是大家 —— 输入数据、及状态存储 —— 先统一往后会退到本执行批次刚开始时的状态，然后重新计算。当然这里重新计算的粒度是 Spark 的单个 task，即一个 partition 的输入数据 + 一个 partition 的状态存储。
136 | 
137 | 从 HDFS 读入最近可见的状态时，如果有最新的 snapshot，也就用最新的 snapshot，如果没有，就读入稍旧一点的 snapshot 和新的 deltas，先做一下最新状态的合并。
138 | 
139 | ## 总结
140 | 
141 | 在 Structured Streaming 里，StateStore 模块提供了 ***分片的***、***分版本的***、***可迁移的***、***高可用***  key-value store。
142 | 
143 | 基于这个 StateStore 模块，StreamExecution 实现了 ***增量的*** 持续查询、和很好的故障恢复以维护 ***end-to-end exactly-once guarantees***。
144 | 
145 | ## 扩展阅读
146 | 
147 | 1. [Github: org/apache/spark/sql/execution/streaming/state/StateStore.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala)
148 | 2. [Github: org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala)
149 | 
150 | <br/>
151 | <br/>
152 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/lw-lin/CoolplaySpark/issues/33)，返回目录请 [猛戳这里](.)）
153 | 


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/3.imgs/100.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/3.imgs/100.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/3.imgs/200.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/3.imgs/200.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.1 Structured Streaming 之 Event Time 解析.md:
--------------------------------------------------------------------------------
  1 | # Structured Streaming 之 Event Time 解析 #
  2 | 
  3 | ***[酷玩 Spark] Structured Streaming 源码解析系列*** ，返回目录请 [猛戳这里](.)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本文内容适用范围：
  9 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 10 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 11 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 12 | ```
 13 | 
 14 | 
 15 | 
 16 | 阅读本文前，请一定先阅读 [Structured Streaming 实现思路与实现概述](1.1%20Structured%20Streaming%20实现思路与实现概述.md) 一文，其中概述了 Structured Streaming 的实现思路，有了全局概念后再看本文的细节解释。
 17 | 
 18 | ## Event Time !
 19 | 
 20 | Spark Streaming 时代有过非官方的 event time 支持尝试 [1]，而在进化后的 Structured Streaming 里，添加了对 event time 的原生支持。
 21 | 
 22 | 我们来看一段官方 programming guide 的例子 [2]：
 23 | 
 24 | ```scala
 25 | import spark.implicits._
 26 | 
 27 | val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
 28 | 
 29 | // Group the data by window and word and compute the count of each group
 30 | // Please note: we'll revise this example in <Structured Streaming 之 Watermark 解析>
 31 | val windowedCounts = words.groupBy(
 32 |   window($"timestamp", "10 minutes", "5 minutes"),
 33 |   $"word"
 34 | ).count()
 35 | ```
 36 | 
 37 | 这里的执行过程如下图。
 38 | 
 39 | <p align="center"><img src="4.imgs/100.png"></p>
 40 | 
 41 | - 我们有一系列 arriving 的 records
 42 | - 首先是一个对着时间列 `timestamp` 做长度为`10m`，滑动为`5m` 的 *window()* 操作
 43 |   - 例如上图右上角的虚框部分，当达到一条记录 `12:22|dog` 时，会将 `12:22` 归入两个窗口 `12:15-12:25`、`12:20-12:30`，所以产生两条记录：`12:15-12:25|dog`、`12:20-12:30|dog`，对于记录 `12:24|dog owl` 同理产生两条记录：`12:15-12:25|dog owl`、`12:20-12:30|dog owl`
 44 |   - 所以这里 *window()* 操作的本质是 *explode()*，可由一条数据产生多条数据
 45 | - 然后对 *window()* 操作的结果，以 `window` 列和 `word` 列为 key，做 *groupBy().count()* 操作
 46 |   - 这个操作的聚合过程是增量的（借助 StateStore）
 47 | - 最后得到一个有 `window`, `word`, `count` 三列的状态集
 48 | 
 49 | ## 处理 Late Data
 50 | 
 51 | 还是沿用前面 *window()* + *groupBy().count()* 的例子，但注意有一条迟到的数据 `12:06|cat` ：
 52 | 
 53 | <p align="center"><img src="4.imgs/150.png"></p>
 54 | 
 55 | 可以看到，在这里的 late data，在 State 里被正确地更新到了应在的位置。
 56 | 
 57 | ## OutputModes
 58 | 
 59 | 我们继续来看前面 *window()* + *groupBy().count()* 的例子，现在我们考虑将结果输出，即考虑 OutputModes：
 60 | 
 61 | #### (a) Complete
 62 | 
 63 | Complete 的输出是和 State 是完全一致的：
 64 | 
 65 | <p align="center"><img src="4.imgs/200.png"></p>
 66 | 
 67 | #### (b) Append
 68 | 
 69 | Append 的语义将保证，一旦输出了某条 key，未来就不会再输出同一个 key。
 70 | 
 71 | <p align="center"><img src="4.imgs/210.png"></p>
 72 | 
 73 | 所以，在上图 `12:10` 这个批次直接输出 `12:00-12:10|cat|1`, `12:05-12:15|cat|1` 将是错误的，因为在 `12:20` 将结果更新为了 `12:00-12:10|cat|2`，但是 Append 模式下却不会再次输出 `12:00-12:10|cat|2`，因为前面输出过了同一条 key `12:00-12:10|cat` 的结果`12:00-12:10|cat|1`。
 74 | 
 75 | 为了解决这个问题，在 Append 模式下，Structured Streaming 需要知道，某一条 key 的结果什么时候不会再更新了。当确认结果不会再更新的时候（下一篇文章专门详解依靠 watermark 确认结果不再更新），就可以将结果进行输出。
 76 | 
 77 | <p align="center"><img src="4.imgs/220.png"></p>
 78 | 
 79 | 如上图所示，如果我们确定 `12:30` 这个批次以后不会再有对 `12:00-12:10` 这个 window 的更新，那么我们就可以把 `12:00-12:10` 的结果在 `12:30` 这个批次输出，并且也会保证后面的批次不会再输出 `12:00-12:10` 的 window 的结果，维护了 Append 模式的语义。
 80 | 
 81 | #### (c) Update
 82 | 
 83 | Update 模式已在 Spark 2.1.1 及以后版本获得正式支持。 
 84 | 
 85 | <p align="center"><img src="4.imgs/230.png"></p>
 86 | 
 87 | 如上图所示，在 Update 模式中，只有本执行批次 State 中被更新了的条目会被输出：
 88 | 
 89 | - 在 12:10 这个执行批次，State 中全部 2 条都是新增的（因而也都是被更新了的），所以输出全部 2 条；
 90 | - 在 12:20 这个执行批次，State 中 2 条是被更新了的、 4 条都是新增的（因而也都是被更新了的），所以输出全部 6 条；
 91 | - 在 12:30 这个执行批次，State 中 4 条是被更新了的，所以输出 4 条。这些需要特别注意的一点是，如 Append 模式一样，本执行批次中由于（通过  watermark 机制）确认 `12:00-12:10` 这个 window 不会再被更新，因而将其从 State 中去除，但没有因此产生输出。
 92 | 
 93 | ## 总结
 94 | 
 95 | 本文解析了 Structured Streaming 原生提供的对 event time 的支持，包括 window()、groupBy() 增量聚合、对 late date 的支持、以及在 Complete, Append, Update 模式下的输出结果。
 96 | 
 97 | ## 扩展阅读
 98 | 
 99 | 1. [Github: org/apache/spark/sql/catalyst/analysis/Analyzer.scala#TimeWindowing](https://github.com/apache/spark/blob/v2.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2232)
100 | 2. [Github: org/apache/spark/sql/catalyst/expressions/TimeWindow](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala)
101 | 
102 | ## 参考资料
103 | 
104 | 1. https://github.com/cloudera/spark-dataflow
105 | 2. [Structured Streaming Programming Guide - Window Operations on Event Time](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time)
106 | 
107 | <br/>
108 | <br/>
109 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/lw-lin/CoolplaySpark/issues/34)，返回目录请 [猛戳这里](.)）
110 | 


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.2 Structured Streaming 之 Watermark 解析.md:
--------------------------------------------------------------------------------
  1 | # 4.2 Structured Streaming 之 Watermark 解析 #
  2 | 
  3 | ***[酷玩 Spark] Structured Streaming 源码解析系列*** ，返回目录请 [猛戳这里](.)
  4 | 
  5 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
  6 | 
  7 | ```
  8 | 本文内容适用范围：
  9 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 10 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 11 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
 12 | ```
 13 | 
 14 | 
 15 | 
 16 | 阅读本文前，请一定先阅读  [Structured Streaming 之 Event Time 解析](4.1%20Structured%20Streaming%20之%20Event%20Time%20解析.md)，其中解析了 Structured Streaming 的 Event Time 及为什么需要 Watermark。
 17 | 
 18 | ## 引言
 19 | 
 20 | <p align="center"><img src="4.imgs/220.png"></p>
 21 | 
 22 | 我们在前文 [Structured Streaming 之 Event Time 解析](4.1%20Structured%20Streaming%20之%20Event%20Time%20解析.md) 中的例子，在：
 23 | 
 24 | - (a) 对 event time 做 *window()* + *groupBy().count()* 即利用状态做跨执行批次的聚合，并且
 25 | - (b) 输出模式为 Append 模式
 26 | 
 27 | 时，需要知道在 `12:30` 结束后不会再有对 `window 12:00-12:10` 的更新，因而可以在 `12:30` 这个批次结束时，输出 `window 12:00-12:10` 的 1 条结果。
 28 | 
 29 | ## Watermark 机制
 30 | 
 31 | 对上面这个例子泛化一点，是：
 32 | 
 33 | - (a+) 在对 event time 做 *window()* + *groupBy().aggregation()* 即利用状态做跨执行批次的聚合，并且
 34 | - (b+) 输出模式为 Append 模式或 Update 模式
 35 | 
 36 | 时，Structured Streaming 将依靠 watermark 机制来限制状态存储的无限增长、并（对 Append 模式）尽早输出不再变更的结果。
 37 | 
 38 | 换一个角度，如果既不是 Append 也不是 Update 模式，或者是 Append 或 Update 模式、但不需状态做跨执行批次的聚合时，则不需要启用 watermark 机制。
 39 | 
 40 | 具体的，我们启用 watermark 机制的方式是：
 41 | 
 42 | ```scala
 43 | val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
 44 | 
 45 | // Group the data by window and word and compute the count of each group
 46 | val windowedCounts = words
 47 |     .withWatermark("timestamp", "10 minutes")  // 注意这里的 watermark 设置！
 48 |     .groupBy(
 49 |         window($"timestamp", "10 minutes", "5 minutes"),
 50 |         $"word")
 51 |     .count()
 52 | ```
 53 | 
 54 | 这样即告诉 Structured Streaming，以 `timestamp` 列的最大值为锚点，往前推 10min 以前的数据不会再收到。这个值 —— 当前的最大 timestamp 再减掉 10min —— 这个随着 timestamp 不断更新的 Long 值，就是 watermark。
 55 | 
 56 | <p align="center"><img src="4.imgs/220.png"></p>
 57 | 
 58 | 所以，在之前的这里图示中：
 59 | 
 60 | - 在 `12:20` 这个批次结束后，锚点变成了 `12:20|dog owl` 这条记录的 event time `12:20` ，watermark 变成了 `12:20 - 10min = 12:10`；
 61 | - 所以，在 `12:30` 批次结束时，即知道 event time `12:10` 以前的数据不再收到了，因而 window `12:00-12:10` 的结果也不会再被更新，即可以安全地输出结果 `12:00-12:10|cat|2`；
 62 | - 在结果 `12:00-12:10|cat|2` 输出以后，State 中也不再保存 window `12:00-12:10` 的相关信息 —— 也即 State Store 中的此条状态得到了清理。
 63 | 
 64 | ## 图解 Watermark 的进展
 65 | 
 66 | 下图中的这个来自官方的例子 [1]，直观的解释了 watermark 随着 event time 的进展情况（对应的相关参数仍与前面的例子一致）：
 67 | 
 68 | <p align="center"><img src="4.imgs/300.png"></p>
 69 | 
 70 | ## 详解 Watermark 的进展
 71 | 
 72 | ### (a) Watermark 的保存和恢复
 73 | 
 74 | 我们知道，在每次 StreamExecution 的每次增量执行（即 IncrementalExecution）开始后，首先会在 driver 端持久化相关的 source offsets 到 offsetLog 中，即下图中的步骤 (1)。实际在这个过程中，也将系统当前的 watermark 等值保存了进去。
 75 | 
 76 | ![Spark 1.0](1.imgs/120.png)
 77 | 
 78 | 这样，在故障恢复时，可以从 offsetLog 中恢复出来的 watermark 值；当然在初次启动、还没有 offsetLog 时，watermark 的值会初始化为 0。
 79 | 
 80 | ### (b) Watermark 用作过滤条件
 81 | 
 82 | 在每次 StreamExecution 的每次增量执行（即 IncrementalExecution）开始时，将 driver 端的 watermark 最新值（即已经写入到 offsetLog 里的值）作为过滤条件，加入到整个执行的 logicalPlan 中。
 83 | 
 84 | 具体的是在 Append 和 Complete 模式下，且需要与 StateStore 进行交互时，由如下代码设置过滤条件：
 85 | 
 86 | ```scala
 87 | /** Generate a predicate that matches data older than the watermark */
 88 |   private lazy val watermarkPredicate: Option[Predicate] = {
 89 |     val optionalWatermarkAttribute =
 90 |       keyExpressions.find(_.metadata.contains(EventTimeWatermark.delayKey))
 91 | 
 92 |     optionalWatermarkAttribute.map { watermarkAttribute =>
 93 |       // If we are evicting based on a window, use the end of the window.  Otherwise just
 94 |       // use the attribute itself.
 95 |       val evictionExpression =
 96 |         if (watermarkAttribute.dataType.isInstanceOf[StructType]) {
 97 |           LessThanOrEqual(
 98 |             GetStructField(watermarkAttribute, 1),
 99 |             Literal(eventTimeWatermark.get * 1000))
100 |         } else {
101 |           LessThanOrEqual(
102 |             watermarkAttribute,
103 |             Literal(eventTimeWatermark.get * 1000))
104 |         }
105 | 
106 |       logInfo(s"Filtering state store on: $evictionExpression")
107 |       newPredicate(evictionExpression, keyExpressions)
108 |     }
109 |   }
110 | ```
111 | 
112 | 总的来讲，就是进行 `event time 的字段 <= watermark` 的过滤。
113 | 
114 | 所以在 Append 模式下，把 StateStore 里符合这个过滤条件的状态进行输出，因为这些状态将来不会再更新了；在 Update 模式下，把符合这个过滤条件的状态删掉，因为这些状态将来不会再更新了。
115 | 
116 | ### (c) Watermark 的更新
117 | 
118 | 在单次增量执行的过程中，按照每个 partition 即每个 task，在处理每一条数据时，同时收集 event time 的（统计）数字：
119 | 
120 | ```scala
121 | // 来自 EventTimeWatermarkExec
122 | case class EventTimeStats(var max: Long, var min: Long, var sum: Long, var count: Long) {
123 |   def add(eventTime: Long): Unit = {
124 |     this.max = math.max(this.max, eventTime)
125 |     this.min = math.min(this.min, eventTime)
126 |     this.sum += eventTime
127 |     this.count += 1
128 |   }
129 | 
130 |   def merge(that: EventTimeStats): Unit = {
131 |     this.max = math.max(this.max, that.max)
132 |     this.min = math.min(this.min, that.min)
133 |     this.sum += that.sum
134 |     this.count += that.count
135 |   }
136 | 
137 |   def avg: Long = sum / count
138 | }
139 | ```
140 | 
141 | 那么每个 partition 即每个 task，收集到了 event time 的 `max`, `min`, `sum`, `count` 值。在整个 job 结束时，各个 partition 即各个 task 的 `EventTimeStats` ，收集到 driver 端。
142 | 
143 | 在 driver 端，在每次增量执行结束后，把收集到的所有的 eventTimeStats 取最大值，并进一步按需更新 watermark（本次可能更新，也可能不更新）：
144 | 
145 | ```scala
146 | // 来自 StreamExecution
147 | lastExecution.executedPlan.collect {
148 |   case e: EventTimeWatermarkExec if e.eventTimeStats.value.count > 0 =>
149 |     logDebug(s"Observed event time stats: ${e.eventTimeStats.value}")
150 |     /* 所收集的 eventTimeStats 的 max 值，减去之前 withWatermark() 时指定的 delayMS 值 */
151 |     /* 结果保存为 newWatermarkMs */
152 |     e.eventTimeStats.value.max - e.delayMs
153 |     }.headOption.foreach { newWatermarkMs =>
154 |   /* 比较 newWatermarkMs 与当前的 batchWatermarkMs */
155 |   if (newWatermarkMs > offsetSeqMetadata.batchWatermarkMs) {
156 |     /* 将当前的 batchWatermarkMs 的更新为 newWatermarkMs */
157 |     logInfo(s"Updating eventTime watermark to: $newWatermarkMs ms")
158 |     offsetSeqMetadata.batchWatermarkMs = newWatermarkMs
159 |   } else {
160 |     /* 当前的 batchWatermarkMs 不需要更新 */
161 |     logDebug(
162 |       s"Event time didn't move: $newWatermarkMs < " +
163 |       s"${offsetSeqMetadata.batchWatermarkMs}")
164 |   }
165 | }
166 | ```
167 | 
168 | 所以我们看，在单次增量执行过程中，具体的是在做 `(b) Watermark 用作过滤条件` 的过滤过程中，watermark 维持不变。
169 | 
170 | 直到在单次增量执行结束时，根据收集到的 eventTimeStats，才更新一个 watermark。更新后的 watermark 会被保存和故障时恢复，这个过程是我们在 `(a) Watermark 的保存和恢复` 中解析的。
171 | 
172 | ## 关于 watermark 的一些说明
173 | 
174 | 关于 Structured Streaming 的目前 watermark 机制，我们有几点说明：
175 | 
176 | 1. 再次强调，(a+) 在对 event time 做 *window()* + *groupBy().aggregation()* 即利用状态做跨执行批次的聚合，并且 (b+) 输出模式为 Append 模式或 Update 模式时，才需要 watermark，其它时候不需要；
177 | 2. watermark 的本质是要帮助 StateStore 清理状态、不至于使 StateStore 无限增长；同时，维护 Append 正确的语义（即判断在何时某条结果不再改变、从而将其输出）；
178 | 3. 目前版本（Spark 2.2）的 watermark 实现，是依靠最大 event time 减去一定 late threshold 得到的，尚未支持 Source 端提供的 watermark；
179 |    - 未来可能的改进是，从 Source 端即开始提供关于 watermark 的特殊信息，传递到 StreamExecution 中使用 [2]，这样可以加快 watermark 的进展，从而能更早的得到输出数据
180 | 4. Structured Streaming 对于 watermark 的承诺是：(a) watermark 值不后退（包括正常运行和发生故障恢复时）；(b) watermark 值达到后，大多时候会在下一个执行批次输出结果，但也有可能延迟一两个批次（发生故障恢复时），上层应用不应该对此有依赖。
181 | 
182 | ## 扩展阅读
183 | 
184 | 1. [Github: org/apache/spark/sql/execution/streaming/StatefulAggregate.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala)
185 | 2. [Flink Doc: Generating Timestamps / Watermarks](https://ci.apache.org/projects/flink/flink-docs-master/dev/event_timestamps_watermarks.html)
186 | 
187 | ## 参考资料
188 | 
189 | 1. [Structured Streaming Programming Guide](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
190 | 2. [Design Doc: Structured Streaming Watermarks for handling late data and dropping old aggregates](https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit)
191 | 
192 | <br/>
193 | <br/>
194 | （本文完，参与本文的讨论请 [猛戳这里](https://github.com/lw-lin/CoolplaySpark/issues/35)，返回目录请 [猛戳这里](.)）
195 | 


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/100.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/100.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/150.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/150.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/200.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/200.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/210.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/210.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/220.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/220.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/230.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/230.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/300.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/300.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/4.imgs/300_large.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/Structured Streaming 源码解析系列/4.imgs/300_large.png


--------------------------------------------------------------------------------
/Structured Streaming 源码解析系列/README.md:
--------------------------------------------------------------------------------
 1 | ## Structured Streaming 源码解析系列
 2 | 
 3 | [「腾讯广告」](http://e.qq.com)技术团队（原腾讯广点通技术团队）荣誉出品
 4 | 
 5 | ```
 6 | 本文内容适用范围：
 7 | * 2018.11.02 update, Spark 2.4 全系列 √ (已发布：2.4.0)
 8 | * 2018.02.28 update, Spark 2.3 全系列 √ (已发布：2.3.0 ~ 2.3.2)
 9 | * 2017.07.11 update, Spark 2.2 全系列 √ (已发布：2.2.0 ~ 2.2.3)
10 | ```
11 | 
12 | - *一、概述*
13 |   - [1.1 Structured Streaming 实现思路与实现概述](1.1%20Structured%20Streaming%20实现思路与实现概述.md)
14 | - *二、Sources 与 Sinks*
15 |   - [2.1 Structured Streaming 之 Source 解析](2.1%20Structured%20Streaming%20之%20Source%20解析.md)
16 |   - [2.2 Structured Streaming 之 Sink 解析](2.2%20Structured%20Streaming%20之%20Sink%20解析.md)
17 | - *三、状态存储*
18 |   - [3.1 Structured Streaming 之状态存储解析](3.1%20Structured%20Streaming%20之状态存储解析.md)
19 | - *四、Event Time 与 Watermark*
20 |   - [4.1 Structured Streaming 之 Event Time 解析](4.1%20Structured%20Streaming%20之%20Event%20Time%20解析.md)
21 |   - [4.2 Structured Streaming 之 Watermark 解析](4.2%20Structured%20Streaming%20之%20Watermark%20解析.md)
22 | - *#、一些资源和 Q&A*
23 |   - [Spark 资源集合](https://github.com/lw-lin/CoolplaySpark/tree/master/Spark%20%E8%B5%84%E6%BA%90%E9%9B%86%E5%90%88) (包括 Spark Streaming 中文微信群、Spark Summit 视频等资源集合)<br/>![wechat_spark_streaming_small](../Spark%20%E8%B5%84%E6%BA%90%E9%9B%86%E5%90%88/resources/wechat_spark_streaming_small_.PNG)
24 |   - [Q&A] Structured Streaming 与 Spark Streaming 的区别
25 | 
26 | ## 致谢
27 | 
28 | - Github [@wongxingjun](http://github.com/wongxingjun) 同学指出 2 处 typo，并提 Pull Request 修正（PR 已合并）
29 | - Github [@wangmiao1981](http://github.com/wangmiao1981) 同学指出几处 typo，并提 Pull Request 修正（PR 已合并）
30 | - Github [@zilutang](http://github.com/zilutang) 同学指出 1 处 typo，并提 Pull Request 修正（PR 已合并）
31 | 
32 | > 谨以此《Structured Streaming 源码解析系列》和以往的《Spark Streaming 源码解析系列》，向“把大数据变得更简单 (make big data simple) ”的创新者们，表达感谢和敬意。
33 | 
34 | ## 知识共享
35 | 
36 | ![](https://licensebuttons.net/l/by-nc/4.0/88x31.png)
37 | 
38 | 除非另有注明，本《Structured Streaming 源码解析系列》系列文章使用 [CC BY-NC（署名-非商业性使用）](https://creativecommons.org/licenses/by-nc/4.0/) 知识共享许可协议。
39 | 


--------------------------------------------------------------------------------
/coolplay_spark_logo_cn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/coolplay_spark_logo_cn.png


--------------------------------------------------------------------------------
/coolplay_spark_logo_cn_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lw-lin/CoolplaySpark/d4880cfb051d3e03ba8d8189eb3699000628ebdc/coolplay_spark_logo_cn_small.png


--------------------------------------------------------------------------------