├── .gitignore
├── README.md
├── SS-SF-2015-Koeninger.key
├── blogpost.md
├── build.sbt
├── index.html
├── project
├── build.properties
└── plugins.sbt
├── schema.sql
├── slides
├── kafka-new.png
├── kafka-old.png
├── kixer-logo.png
├── remark-latest.min.js
└── spark-kafka-change-cpu-utilization.png
└── src
└── main
├── resources
└── application.conf
└── scala
└── example
├── BasicKafkaConsumer.scala
├── BasicRDD.scala
├── BasicStream.scala
├── CommitAsync.scala
├── IdempotentExample.scala
├── SetupJdbc.scala
├── Throughput.scala
├── TlsStream.scala
├── TransactionalPerBatch.scala
├── TransactionalPerPartition.scala
└── Windowed.scala
/.gitignore:
--------------------------------------------------------------------------------
1 | ## generic files to ignore
2 | *~
3 | *.lock
4 | *.DS_Store
5 | *.swp
6 | *.out
7 |
8 | # rails specific
9 | *.sqlite3
10 | config/database.yml
11 | log/*
12 | tmp/*
13 |
14 | # java specific
15 | *.class
16 |
17 | # python specific
18 | *.pyc
19 |
20 | # xcode/iphone specific
21 | build/*
22 | *.pbxuser
23 | *.mode2v3
24 | *.mode1v3
25 | *.perspective
26 | *.perspectivev3
27 | *~.nib
28 |
29 | # akka specific
30 | logs/*
31 |
32 | # sbt specific
33 | target/
34 | project/boot
35 | lib_managed/*
36 | project/build/target
37 | project/build/lib_managed
38 | project/build/src_managed
39 | project/plugins/lib_managed
40 | project/plugins/target
41 | project/plugins/src_managed
42 | project/plugins/project
43 |
44 | core/lib_managed
45 | core/target
46 | pubsub/lib_managed
47 | pubsub/target
48 |
49 | # eclipse specific
50 | .metadata
51 | jrebel.lic
52 | .settings
53 | .classpath
54 | .project
55 |
56 | .ensime*
57 | *.sublime-*
58 | .cache
59 |
60 | # intellij
61 | *.eml
62 | *.iml
63 | *.ipr
64 | *.iws
65 | .*.sw?
66 | .idea
67 |
68 | # paulp script
69 | /.lib/
70 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | kafka-exactly-once
2 | ==================
3 | Usage examples for the Kafka createDirectStream / createRDD api that I contributed to Spark (available since Spark 1.3)
4 |
5 | Master corresponds to Spark 2.0 / Kafka 0.10
6 |
7 | If you're looking for earlier versions, see the [Spark 1.6 branch](https://github.com/koeninger/kafka-exactly-once/tree/spark-1.6.0)
8 |
9 | For more detail, see the [presentation](https://www.youtube.com/watch?v=fXnNEq1v3VA) or the [blog post](https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md) or the [slides](http://koeninger.github.io/kafka-exactly-once/) or the [jira ticket](https://issues.apache.org/jira/browse/SPARK-4964)
10 |
11 | If you want to try running this,
12 |
13 | schema.sql contains postgres schemas for the tables used
14 |
15 | src/main/resources/application.conf contains jdbc and kafka config info
16 |
17 | The examples are indifferent to the exact kafka topic or message format used,
18 | although IdempotentExample assumes each message body is unique.
19 |
--------------------------------------------------------------------------------
/SS-SF-2015-Koeninger.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/koeninger/kafka-exactly-once/3b442e1d83ad2c4496a786ab116089f87a5e0344/SS-SF-2015-Koeninger.key
--------------------------------------------------------------------------------
/blogpost.md:
--------------------------------------------------------------------------------
1 | # Exactly-once Spark Streaming from Kafka
2 |
3 | The upcoming release of [Spark](http://spark.apache.org) 1.3 includes new experimental RDD and DStream implementations for reading data from [Kafka](http://kafka.apache.org). As the primary author of those features, I'd like to explain their implementation and usage. You may be interested if you would benefit from:
4 |
5 | * more uniform usage of Spark cluster resources when consuming from Kafka
6 | * control of message delivery semantics
7 | * delivery guarantees without reliance on a write-ahead log in HDFS
8 | * access to message metadata
9 |
10 | I'll assume you're familiar with the [Spark Streaming docs](http://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Kafka docs](http://kafka.apache.org/documentation.html). All code examples are in Scala, but there are Java-friendly methods in the API.
11 |
12 | ## Basic Usage
13 |
14 | The new API for both Kafka RDD and DStream is in the spark-streaming-kafka artifact.
15 |
16 | SBT dependency:
17 |
18 | ```scala
19 | libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.3.0"
20 | ```
21 |
22 | Maven dependency:
23 |
24 | ```xml
25 |
26 | org.apache.spark
27 | spark-streaming-kafka_2.10
28 | 1.3.0
29 |
30 | ```
31 |
32 | To read from Kafka in a Spark Streaming job, use KafkaUtils.createDirectStream:
33 |
34 | ```scala
35 | import kafka.serializer.StringDecoder
36 | import org.apache.spark.SparkConf
37 | import org.apache.spark.streaming.{Seconds, StreamingContext}
38 | import org.apache.spark.streaming.kafka.KafkaUtils
39 |
40 | val ssc = new StreamingContext(new SparkConf, Seconds(60))
41 |
42 | // hostname:port for Kafka brokers, not Zookeeper
43 | val kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")
44 |
45 | val topics = Set("sometopic", "anothertopic")
46 |
47 | val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
48 | ssc, kafkaParams, topics)
49 | ```
50 |
51 | The call to createDirectStream returns a stream of tuples formed from each Kafka message's key and value. The exposed return type is InputDStream[(K, V)], where K and V in this case are both String. The private implementation is DirectKafkaInputDStream. There are other overloads of createDirectStream that allow you to access message metadata, and to specify the exact per-topic-and-partition starting offsets.
52 |
53 | To read from Kafka in a non-streaming Spark job, use KafkaUtils.createRDD:
54 |
55 | ```scala
56 | import kafka.serializer.StringDecoder
57 | import org.apache.spark.{SparkContext, SparkConf}
58 | import org.apache.spark.streaming.kafka.{KafkaUtils, OffsetRange}
59 |
60 | val sc = new SparkContext(new SparkConf)
61 |
62 | // hostname:port for Kafka brokers, not Zookeeper
63 | val kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")
64 |
65 | val offsetRanges = Array(
66 | OffsetRange("sometopic", 0, 110, 220),
67 | OffsetRange("sometopic", 1, 100, 313),
68 | OffsetRange("anothertopic", 0, 456, 789)
69 | )
70 |
71 | val rdd = KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder](
72 | sc, kafkaParams, offsetRanges)
73 | ```
74 |
75 | The call to createRDD returns a single RDD of (key, value) tuples for each Kafka message in the specified batch of offset ranges. The exposed return type is RDD[(K, V)], the private implementation is KafkaRDD. There are other overloads of createRDD that allow you to access message metadata, and to specify the current per-topic-and-partition Kafka leaders.
76 |
77 | ## Implementation
78 |
79 | DirectKafkaInputDStream is a stream of batches. Each batch corresponds to a KafkaRDD. Each partition of the KafkaRDD corresponds to an OffsetRange. Most of this implementation is private, but it's still useful to understand.
80 |
81 | ### OffsetRange
82 |
83 | An [OffsetRange](https://github.com/apache/spark/blob/branch-1.3/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/OffsetRange.scala) represents the lower and upper boundaries for a particular sequence of messages in a given Kafka topic and partition. The following data structure:
84 |
85 | ```scala
86 | OffsetRange("visits", 2, 300, 310)
87 | ```
88 |
89 | identifies the 10 messages from offset 300 (inclusive) until offset 310 (exclusive) in partition 2 of the "visits" topic. Note that it does not actually contain the contents of the messages, it's just a way of identifying the range.
90 |
91 | Also note that because Kafka ordering is only defined on a per-partition basis, the messages referred to by
92 |
93 | ```scala
94 | OffsetRange("visits", 3, 300, 310)
95 | ```
96 |
97 | may be from a completely different time period; even though the offsets are the same as above, the partition is different.
98 |
99 | ### KafkaRDD
100 |
101 | Recall that an RDD is defined by:
102 |
103 | * a method to divide the work into partitions (getPartitions)
104 | * a method to do the work for a given partition (compute)
105 | * a list of parent RDDs. KafkaRDD is an input, not a transformation, so it has no parents.
106 | * optionally, a partitioner defining how keys are hashed. KafkaRDD doesn't define one.
107 | * optionally, a list of preferred hosts for a given partition, in order to push computation to where the data is (getPreferredLocations)
108 |
109 | The [KafkaRDD constructor](https://github.com/apache/spark/blob/v1.3.0-rc1/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala#L45) takes an array of OffsetRanges and a map with the current leader host and port for each Kafka topic and partition. The reason for the separation of leader info is to allow for the KafkaUtils.createRDD convenience constructor that doesn't require you to know the leaders. In that case, createRDD will do the Kafka API metadata calls necessary to find the current leaders, using the list of hosts specified in metadata.broker.list as the initial points of contact. That inital lookup will happen once, in the Spark driver process.
110 |
111 | The [getPartitions](https://github.com/apache/spark/blob/v1.3.0-rc1/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala#L57) method of KafkaRDD takes each OffsetRange in the array and turns it into an RDD Partition by adding the leader's host and port info. The important thing to notice here is there is a 1:1 correspondence between Kafka partition and RDD partition. This means the degree of Spark parallelism (at least for reading messages) will be directly tied to the degree of Kafka parallelism.
112 |
113 | The [getPreferredLocations](https://github.com/apache/spark/blob/v1.3.0-rc1/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala#L64) method uses the Kafka leader for the given partition as the preferred host. I don't run my Spark executors on the same hosts as Kafka, so if you do, let me know how this works out for you.
114 |
115 | The [compute](https://github.com/apache/spark/blob/v1.3.0-rc1/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala#L85) method runs in the Spark executor processes. It uses a Kafka SimpleConsumer to connect to the leader for the given topic and partition, then makes repeated fetch requests to read messages for the specified range of offsets.
116 |
117 | Each message is converted using the messageHandler argument to the constructor. messageHandler is a function from Kafka MessageAndMetadata to a user-defined type, with the default being a tuple of key and value. In most cases, it's more efficient to access topic and offset metadata on a per-partition basis (see the discussion of HasOffsetRanges below), but if you really need to associate each message with its offset, you can do so.
118 |
119 | The key point to notice about compute is that, because offset ranges are defined in advance on the driver, then read directly from Kafka by executors, the messages returned by a particular KafkaRDD are deterministic. There is no important state maintained on the executors, and no notion of committing read offsets to Zookeeper, as there is with prior solutions that used the Kafka high-level consumer.
120 |
121 | Because the compute operation is deterministic, it is in general safe to re-try a task if it fails. If a Kafka leader is lost, for instance, the compute method will just sleep for the amount of time defined by the **refresh.leader.backoff.ms** Kafka param, then fail the task and let the normal Spark task retry mechanism handle it. On subsequent attempts after the first, the new leader will be looked up on the executor as part of the compute method.
122 |
123 | ### DirectKafkaInputDStream
124 |
125 | The KafkaRDD returned by KafkaUtils.createRDD is usable in batch jobs if you have existing code to obtain and manage offsets. In most cases however, you'll probably be using KafkaUtils.createDirectStream, which returns a DirectKafkaInputDStream. Similar to an RDD, a DStream is defined by:
126 |
127 | * a list of parent DStreams. Again, this is an input DStream, not a transformation, so it has no parents.
128 | * a time interval at which the stream will generate batches. This stream uses the interval of the streaming context.
129 | * a method to generate an RDD for a given time interval (compute)
130 |
131 | The [compute](https://github.com/apache/spark/blob/v1.3.0-rc1/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L115) method runs on the driver. It connects to the leader for each topic and partition, not to read messages, but just to get the latest available offset. It then defines a KafkaRDD with offset ranges spanning from the ending point of the last batch until the latest leader offsets.
132 |
133 | To define the starting point of the very first batch, you can either specify exact offsets per TopicAndPartition, or use the Kafka parameter **auto.offset.reset**, which may be set to "largest" or "smallest" (defaults to "largest"). For rate limiting, you can use the Spark configuration variable **spark.streaming.kafka.maxRatePerPartition** to set the maximum number of messages per partition per batch.
134 |
135 | Once the KafkaRDD for a given time interval is defined, it executes exactly as described above for the batch usage case. Unlike prior Kafka DStream implementations, there is no long-running receiver task that occupies a core per stream regardless of what the message volume is. For our use cases at [Kixer](http://kixer.com), it's common to have important but low-volume topics in the same job as high-volume topics. With the direct stream, the low-volume partitions result in smaller tasks that finish quickly and free up that node to process other partitions in the batch. It's a pretty big win to have uniform cluster usage while still keeping topics logically separate.
136 |
137 | A significant difference from the batch use case is that there **is** some important state that varies over time, namely the offset ranges generated at each time interval. Executor or Kafka leader failure isn't a big deal, as discussed above, but if the driver fails, offset ranges will be lost, unless stored somewhere. I'll discuss this in more detail under Delivery Semantics below, but you basically have 3 choices:
138 |
139 | 1. Don't worry about it if you don't care about lost or duplicated messages, and just restart the stream from the earliest or latest offset
140 | 2. Checkpoint the stream, in which case the offset ranges (not the messages, just the offset range definitions) will be stored in the checkpoint
141 | 3. Store the offset ranges yourself, and provide the correct starting offsets when restarting the stream
142 |
143 | Again, no consumer offsets are stored in Zookeeper. If you want interop with existing Kafka monitoring tools that talk to ZK directly, you'll need to store the offsets into ZK yourself (this doesn't mean it needs to be your system of record for offsets, you can just duplicate them there).
144 |
145 | Note that because Kafka is being treated as a durable store of messages, not a transient network source, you don't need to duplicate messages into HDFS for error recovery. This design does have some implications, however. The first is that you can't read messages that no longer exist in Kafka, so make sure your retention is adequate. The second is that you can't read messages that don't exist in Kafka yet. To put it another way, the consumers on the executors aren't polling for new messages, the driver is just periodically checking with the leaders at every batch interval, so there is some inherent latency.
146 |
147 | ### HasOffsetRanges
148 |
149 | One other implementation detail is a public interface, HasOffsetRanges, with a single method returning an array of OffsetRange. KafkaRDD implements this interface, allowing you to obtain topic and offset information on a per-partition basis.
150 |
151 | ```scala
152 | val stream = KafkaUtils.createDirectStream(...)
153 | ...
154 | stream.foreachRDD { rdd =>
155 | // Cast the rdd to an interface that lets us get an array of OffsetRange
156 | val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
157 |
158 | rdd.foreachPartition { iter =>
159 | // index to get the correct offset range for the rdd partition we're working on
160 | val osr: OffsetRange = offsetRanges(TaskContext.get.partitionId)
161 |
162 | // get any needed data from the offset range
163 | val topic = osr.topic
164 | val kafkaPartitionId = osr.partition
165 | val begin = osr.fromOffset
166 | val end = osr.untilOffset
167 | ...
168 | ```
169 |
170 | The reason for this layer of indirection is because the static type used by DStream methods like foreachRDD and transform is just RDD, not the type of the underlying (and in this case, private) implementation. Because the DStream returned by createDirectStream generates batches of KafkaRDD, you can safely cast to HasOffsetRanges.
171 |
172 | Also note that because of the 1:1 correspondence between offset ranges and rdd partitions, the indexes of the rdd partitions correspond to the indexes into the array returned by offsetRanges. Similarly, all of the messages in the given Spark partition are from the same Kafka topic. However, this 1:1 correspondence only lasts until the first Spark transformation that incurs a shuffle (e.g. reduceByKey), because shuffles can repartition the data.
173 |
174 | ```scala
175 | // WON'T DO WHAT YOU WANT
176 | rdd.mapPartitionsWithIndex { (i, iter) =>
177 | val osr: OffsetRange = offsets(i)
178 | iter.map { x =>
179 | (x.someKey, (osr.topic, x.someValue))
180 | }
181 | }.reduceByKey // this changes partitioning
182 | .foreachPartition {
183 | // now the partition contains values from more than one topic
184 | }
185 | ```
186 |
187 | Because of this, if you want to apply a per-partition transformation using offset range information, it's easiest to use normal Scala code to do the work inside a single Spark mapPartitionsWithIndex call.
188 |
189 | ## Delivery Semantics
190 |
191 | First, understand the [Kafka docs on delivery semantics](http://kafka.apache.org/documentation.html#semantics). If you've already read them, go read them again. In short: **consumer delivery semantics are up to you**, not Kafka.
192 |
193 | Second, understand that Spark **does not guarantee exactly-once semantics for output actions**. When the Spark streaming guide talks about exactly-once, it's only referring to a given item in an RDD being included in a calculated value once, in a purely functional sense. Any side-effecting output operations (i.e. anything you do in foreachRDD to save the result) may be repeated, because any stage of the process might fail and be retried.
194 |
195 | Third, understand that Spark **checkpoints may not be recoverable**, for instance in cases where you need to change the application code in order to get the stream restarted. This situation may improve by 1.4, but be aware that it is an issue. I've been bitten by it before, you may be too. Any place I mention "checkpoint the stream" as an option, consider the risk involved. Also note that any windowing transformations are going to rely on checkpointing anyway.
196 |
197 | Finally, I'll repeat that any semantics beyond at-most-once require that you have **sufficient log retention in Kafka**. If you're seeing things like OffsetOutOfRangeException, it's probably because you underprovisioned Kafka storage, not because something's wrong with Spark or Kafka.
198 |
199 | Given all that, how do you obtain the equivalent of the semantics you want?
200 |
201 | ### At-most-once
202 |
203 | This could be useful in cases where you're sending results to something that isn't a system of record, you don't want duplicates, and it's not worth the hassle of ensuring that messages don't get lost. An example might be sending summary statistics over UDP, since it's an unreliable protocol to begin with.
204 |
205 | To get at-most-once semantics, do all of the following:
206 |
207 | * set **spark.task.maxFailures** to 1, so the job dies as soon as a task fails
208 | * make sure **spark.speculation** is false (the default), so multiple copies of tasks don't get speculatively run
209 | * when the job dies, start the stream back up using the Kafka param **auto.offset.reset** set to "largest", so it will skip to the current end of the log
210 |
211 | This will mean you lose messages on restart, but at least they shouldn't get replayed. Probably. Test this carefully if it's actually important to you that a message **never** gets repeated, because it's not a common use case, and I'm not providing example code for it.
212 |
213 |
214 | ### At-least-once
215 |
216 | You're okay with duplicate messages, but not okay with losing messages. An example of this might be sending internal email alerts on relatively rare occurrences in the stream. Getting duplicate critical alerts in a short time frame is much better than not getting them at all.
217 |
218 | Basic options here are either
219 |
220 | 1. Checkpoint the stream *or*
221 | 2. restart the job with **auto.offset.reset** set to smallest. This will replay the whole log from the beginning of your retention, so you'd better have relatively short retention or *really* be ok with duplicate messages.
222 |
223 | Checkpointing the stream serves as the basis of the next option, so see the example code for it.
224 |
225 | ### Exactly-once using idempotent writes
226 |
227 | [Idempotent](http://en.wikipedia.org/wiki/Idempotence#Computer_science_meaning) writes make duplicate messages safe, turning at-least-once into the equivalent of exactly-once. The typical way of doing this is by having a unique key of some kind (either embedded in the message, or using topic/partition/offset as the key), and storing the results according to that key. Relying on a per-message unique key means this is useful for transforming or filtering individually valuable messages, less so for aggregating multiple messages.
228 |
229 | There's a complete sample of this idea at [IdempotentExample.scala](https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala). It's using postgres for the sake of consistency with the next example, but any storage system that allows for unique keys could be used.
230 |
231 | The important points here are that the [schema](https://github.com/koeninger/kafka-exactly-once/blob/master/schema.sql) is set up with a unique key and a rule to allow for no-op duplicate inserts. For this example, the message body is being used as the unique key, but any appropriate key could be used.
232 |
233 | ```scala
234 | stream.foreachRDD { rdd =>
235 | rdd.foreachPartition { iter =>
236 | // make sure connection pool is set up on the executor before writing
237 | SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
238 |
239 | iter.foreach { case (key, msg) =>
240 | DB.autoCommit { implicit session =>
241 | // the unique key for idempotency is just the text of the message itself, for example purposes
242 | sql"insert into idem_data(msg) values (${msg})".update.apply
243 | }
244 | }
245 | }
246 | }
247 | ```
248 |
249 | In the case of a failure, the above output action can safely be retried. Checkpointing the stream ensures that offset ranges are saved as they are generated. Checkpointing is accomplished in the usual way, by defining a function that configures the streaming context (ssc) and sets up the stream, then calling
250 |
251 | ```scala
252 | ssc.checkpoint(checkpointDir)
253 | ```
254 |
255 | before returning the ssc. See the [streaming guide](http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing) for more details.
256 |
257 | ### Exactly-once using transactional writes
258 |
259 | For data stores that support transactions, saving offsets in the same transaction as the results can keep the two in sync, even in failure situations. If you're careful about detecting repeated or skipped offset ranges, rolling back the transaction prevents duplicated or lost messages from affecting results. This gives the equivalent of exactly-once semantics, and is straightforward to use even for aggregations.
260 |
261 | [TransactionalExample.scala](https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalExample.scala) is a complete Spark job implementing this idea. It's using postgres, but any data store that has transactional semantics could be used.
262 |
263 | The first important point is that the stream is started using the last successfully committed offsets as the beginning point. This allows for failure recovery:
264 |
265 | ```scala
266 | // begin from the the offsets committed to the database
267 | val fromOffsets = DB.readOnly { implicit session =>
268 | sql"select topic, part, off from txn_offsets".
269 | map { resultSet =>
270 | TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
271 | }.list.apply().toMap
272 | }
273 |
274 | val stream: InputDStream[Long] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Long](
275 | ssc, kafkaParams, fromOffsets,
276 | // we're just going to count messages, don't care about the contents, so convert each message to a 1
277 | (mmd: MessageAndMetadata[String, String]) => 1L)
278 | ```
279 |
280 | For the very first time the job is run, the table can be pre-loaded with appropriate starting offsets.
281 |
282 | The example accesses offset ranges on a per-partition basis, as mentioned in the discussion of HasOffsetRanges above. Also notice that some iterator methods, such as map, are lazy. If you're setting up transient state, like a network or database connection, by the time the map is fully forced the connection may already be closed. In that case, be sure to instead use methods like foreach, that eagerly consume the iterator.
283 |
284 | ```scala
285 | rdd.foreachPartition { iter =>
286 | val osr: OffsetRange = offsetRanges(TaskContext.get.partitionId)
287 |
288 | // set up some connection
289 |
290 | iter.foreach {
291 | // use the connection
292 | }
293 |
294 | // close the connection
295 | }
296 | ```
297 |
298 | The final thing to notice about the example is that it's important to ensure that saving the results and saving the offsets either both succeed, or both fail. Storing offsets should fail if the prior committed offset doesn't equal the beginning of the current offset range; this prevents gaps or repeats. Kafka semantics ensure that there aren't gaps in messages within a range of offsets (if you're especially concerned, you could verify by comparing the size of the offset range to the number of messages).
299 |
300 | ```scala
301 | // localTx is transactional, if metric update or offset update fails, neither will be committed
302 | DB.localTx { implicit session =>
303 | // store metric data
304 | val metricRows = sql"""
305 | update txn_data set metric = metric + ${metric}
306 | where topic = ${osr.topic}
307 | """.update.apply()
308 | if (metricRows != 1) {
309 | throw new Exception("...")
310 | }
311 |
312 | // store offsets
313 | val offsetRows = sql"""
314 | update txn_offsets set off = ${osr.untilOffset}
315 | where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
316 | """.update.apply()
317 | if (offsetRows != 1) {
318 | throw new Exception("...")
319 | }
320 | }
321 | ```
322 |
323 | The example code is throwing an exception, which will result in a transaction rollback. Other failure handling strategies may be appropriate, as long as they result in a transaction rollback as well.
324 |
325 | ## Future Improvements
326 |
327 | Although this feature is considered experimental for Spark 1.3, the underlying KafkaRDD design has been in production at [Kixer](http://kixer.com) for months. It's currently handling billions of messages per day, in batch sizes ranging from 2 seconds to 5 minutes. That being said, there are known areas for improvement (and probably a few unknown ones as well).
328 |
329 | * Connection Pooling. Currently, Kafka consumer connections are created as needed; pooling should help efficiency. Hopefully this can be implemented in a way that integrates nicely with ongoing work towards a Kafka producer API in Spark. Edit - I tested out caching consumer connections, it ended up having little to no impact on batch processing time, even with 200ms batches / 100 partitions.
330 | * Kafka metadata API. The [class for interacting with Kafka](https://github.com/apache/spark/blob/v1.3.0-rc1/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaCluster.scala) is currently private, meaning you'll need to duplicate some of that work if you want low-level access to Kafka metadata. This is partly because the Kafka consumer offset API is a moving target right now. If this code proves to be stable, it would be nice to have a user-facing API for interacting with Kafka metadata.
331 | * Batch generation policies. Right now, rate-limiting is the only tuning available for how the next batch in the stream is defined. We have some use cases that involve larger tweaks, such as a fixed time delay. A flexible way of defining batch generation policies might be useful.
332 |
333 | If there are other improvements you can think of, please let me know.
334 |
--------------------------------------------------------------------------------
/build.sbt:
--------------------------------------------------------------------------------
1 | import AssemblyKeys._
2 |
3 | name := "kafka-exactly-once"
4 |
5 | scalaVersion := "2.11.7"
6 |
7 | version := "2.1.2"
8 |
9 | val sparkVersion = "2.1.2"
10 |
11 | externalResolvers ++= Seq(
12 | "Local Maven Repository" at "file://"+Path.userHome.absolutePath+"/.m2/repository"
13 | )
14 |
15 | libraryDependencies ++= Seq(
16 | ("org.apache.spark" %% "spark-core" % sparkVersion % "provided").
17 | exclude("org.apache.spark", "spark-network-common_2.11").
18 | exclude("org.apache.spark", "spark-network-shuffle_2.11"),
19 | // avoid an ivy bug
20 | "org.apache.spark" %% "spark-network-common" % sparkVersion % "provided",
21 | "org.apache.spark" %% "spark-network-shuffle" % sparkVersion % "provided",
22 | ("org.apache.spark" %% "spark-streaming" % sparkVersion % "provided").
23 | exclude("org.apache.spark", "spark-core_2.11"),
24 | ("org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion).
25 | exclude("org.apache.spark", "spark-core_2.11"),
26 | ("org.scalikejdbc" %% "scalikejdbc" % "2.2.1").
27 | exclude("org.slf4j", "slf4j-api"),
28 | ("org.postgresql" % "postgresql" % "9.3-1101-jdbc4").
29 | exclude("org.slf4j", "slf4j-api"),
30 | "com.typesafe" % "config" % "1.2.1"
31 | )
32 |
33 | assemblySettings
34 |
35 | mergeStrategy in assembly := {
36 | case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
37 | case x => (mergeStrategy in assembly).value(x)
38 | }
39 |
--------------------------------------------------------------------------------
/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Title
5 |
6 |
32 |
33 |
34 |
373 |
375 |
378 |
379 |
380 |
--------------------------------------------------------------------------------
/project/build.properties:
--------------------------------------------------------------------------------
1 | sbt.version=0.13.9
2 |
--------------------------------------------------------------------------------
/project/plugins.sbt:
--------------------------------------------------------------------------------
1 | addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
2 |
--------------------------------------------------------------------------------
/schema.sql:
--------------------------------------------------------------------------------
1 | -- tables for IdempotentExample
2 |
3 | create table idem_data(
4 | msg character varying(255),
5 | primary key (msg)
6 | );
7 |
8 | -- postgres isnt the best for idempotent storage, this is for example purposes only
9 | create or replace rule idem_data_ignore_duplicate_inserts as
10 | on insert to idem_data
11 | where (exists (select 1 from idem_data where idem_data.msg = new.msg))
12 | do instead nothing
13 | ;
14 |
15 |
16 | -- tables for TransactionalExample
17 | create table txn_data(
18 | topic character varying(255),
19 | metric bigint
20 | );
21 |
22 | create table txn_offsets(
23 | topic character varying(255),
24 | part integer,
25 | off bigint,
26 | unique (topic, part)
27 | );
28 |
29 | insert into txn_data(topic, metric) values
30 | ('test', 0)
31 | ;
32 |
33 | insert into txn_offsets(topic, part, off) values
34 | -- or whatever your initial offsets per partition are, if non-0
35 | ('test', 0, 0),
36 | ('test', 1, 0)
37 | ;
38 |
--------------------------------------------------------------------------------
/slides/kafka-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/koeninger/kafka-exactly-once/3b442e1d83ad2c4496a786ab116089f87a5e0344/slides/kafka-new.png
--------------------------------------------------------------------------------
/slides/kafka-old.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/koeninger/kafka-exactly-once/3b442e1d83ad2c4496a786ab116089f87a5e0344/slides/kafka-old.png
--------------------------------------------------------------------------------
/slides/kixer-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/koeninger/kafka-exactly-once/3b442e1d83ad2c4496a786ab116089f87a5e0344/slides/kixer-logo.png
--------------------------------------------------------------------------------
/slides/spark-kafka-change-cpu-utilization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/koeninger/kafka-exactly-once/3b442e1d83ad2c4496a786ab116089f87a5e0344/slides/spark-kafka-change-cpu-utilization.png
--------------------------------------------------------------------------------
/src/main/resources/application.conf:
--------------------------------------------------------------------------------
1 | kafka {
2 | topics = "test"
3 | brokers = "localhost:9092"
4 | }
5 | jdbc {
6 | driver = "org.postgresql.Driver"
7 | url = "jdbc:postgresql://localhost/test"
8 | user = "cody"
9 | password = ""
10 | }
11 | checkpointDir = "/var/tmp/cp"
12 | batchDurationMs = 5000
13 |
--------------------------------------------------------------------------------
/src/main/scala/example/BasicKafkaConsumer.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.common.serialization.{ ByteArrayDeserializer, StringDeserializer }
4 | import org.apache.kafka.clients.consumer.KafkaConsumer
5 | import com.typesafe.config.ConfigFactory
6 | import scala.collection.JavaConverters._
7 |
8 | // direct usage of the KafkaConsumer
9 | object BasicKafkaConsumer {
10 | def main(args: Array[String]): Unit = {
11 | val conf = ConfigFactory.load
12 | val kafkaParams = Map[String, Object](
13 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
14 | "key.deserializer" -> classOf[ByteArrayDeserializer],
15 | "value.deserializer" -> classOf[ByteArrayDeserializer],
16 | "group.id" -> "example",
17 | "receive.buffer.bytes" -> (65536: java.lang.Integer),
18 | // auto offset reset is unfortunately necessary with dynamic topic subscription
19 | "auto.offset.reset" -> "latest"
20 | ).asJava
21 | val topics = conf.getString("kafka.topics").split(",").toList.asJava
22 | val consumer = new KafkaConsumer[Array[Byte], Array[Byte]](kafkaParams)
23 | consumer.subscribe(topics)
24 | consumer.poll(0)
25 | println("Starting positions are: ")
26 | consumer.assignment.asScala.foreach { tp =>
27 | println(s"${tp.topic} ${tp.partition} ${consumer.position(tp)}")
28 | }
29 | while (true) {
30 | println(consumer.poll(512).asScala.map(_.value.size.toLong).fold(0L)(_+_))
31 | Thread.sleep(1000)
32 | }
33 | }
34 | }
35 |
--------------------------------------------------------------------------------
/src/main/scala/example/BasicRDD.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.common.serialization.StringDeserializer
4 | import org.apache.spark.{SparkContext, SparkConf}
5 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, OffsetRange }
6 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
7 | import scala.collection.JavaConverters._
8 | import com.typesafe.config.ConfigFactory
9 |
10 | object BasicRDD {
11 | def main(args: Array[String]): Unit = {
12 | val conf = ConfigFactory.load
13 | val kafkaParams = Map[String, Object](
14 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
15 | "key.deserializer" -> classOf[StringDeserializer],
16 | "value.deserializer" -> classOf[StringDeserializer]
17 | ).asJava
18 |
19 | val sc = new SparkContext(new SparkConf())
20 |
21 | val topic = conf.getString("kafka.topics").split(",").toSet.head
22 |
23 | // change these values to offsets that actually exist for the topic
24 | val offsetRanges = Array(
25 | OffsetRange(topic, 0, 0, 100),
26 | OffsetRange(topic, 1, 0, 100)
27 | )
28 |
29 | val rdd = KafkaUtils.createRDD[String, String](sc, kafkaParams, offsetRanges, PreferConsistent)
30 |
31 | rdd.collect.foreach(println)
32 | sc.stop
33 | }
34 | }
35 |
--------------------------------------------------------------------------------
/src/main/scala/example/BasicStream.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.clients.consumer.ConsumerRecord
4 | import org.apache.kafka.common.serialization.StringDeserializer
5 | import org.apache.spark.{ SparkConf, TaskContext }
6 | import org.apache.spark.streaming.{ Seconds, StreamingContext }
7 | import org.apache.spark.streaming.dstream.InputDStream
8 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, HasOffsetRanges, OffsetRange }
9 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
10 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
11 | import com.typesafe.config.ConfigFactory
12 | import java.net.InetAddress
13 | import scala.collection.JavaConverters._
14 |
15 | object BasicStream {
16 | def main(args: Array[String]): Unit = {
17 | val conf = ConfigFactory.load
18 | val kafkaParams = Map[String, Object](
19 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
20 | "key.deserializer" -> classOf[StringDeserializer],
21 | "value.deserializer" -> classOf[StringDeserializer],
22 | "group.id" -> "example",
23 | "auto.offset.reset" -> "latest"
24 | )
25 | val topics = conf.getString("kafka.topics").split(",")
26 | val ssc = new StreamingContext(new SparkConf, Seconds(5))
27 | val stream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
28 | ssc,
29 | PreferConsistent,
30 | Subscribe[String, String](topics, kafkaParams)
31 | )
32 |
33 | stream.map(record => (record.key, record.value))
34 |
35 | stream.foreachRDD { rdd =>
36 | val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
37 | rdd.mapPartitions { iter =>
38 | val osr: OffsetRange = offsetRanges(TaskContext.get.partitionId)
39 | val host = InetAddress.getLocalHost().getHostName()
40 | val count = iter.size
41 | Seq(s"${host} ${osr.topic} ${osr.partition} ${count}").toIterator
42 | }.collect.sorted.foreach(println)
43 | }
44 |
45 | ssc.start()
46 | ssc.awaitTermination()
47 | }
48 | }
49 |
--------------------------------------------------------------------------------
/src/main/scala/example/CommitAsync.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.clients.consumer.ConsumerRecord
4 | import org.apache.kafka.common.serialization.StringDeserializer
5 | import org.apache.spark.{ SparkConf, TaskContext }
6 | import org.apache.spark.streaming.{ Seconds, StreamingContext }
7 | import org.apache.spark.streaming.dstream.InputDStream
8 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, HasOffsetRanges, OffsetRange, CanCommitOffsets }
9 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
10 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
11 | import com.typesafe.config.ConfigFactory
12 | import java.net.InetAddress
13 | import scala.collection.JavaConverters._
14 |
15 | object CommitAsync {
16 | def main(args: Array[String]): Unit = {
17 | val conf = ConfigFactory.load
18 | val kafkaParams = Map[String, Object](
19 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
20 | "key.deserializer" -> classOf[StringDeserializer],
21 | "value.deserializer" -> classOf[StringDeserializer],
22 | "group.id" -> "commitexample",
23 | "auto.offset.reset" -> "latest",
24 | "enable.auto.commit" -> (false: java.lang.Boolean)
25 | )
26 | val topics = conf.getString("kafka.topics").split(",")
27 | val ssc = new StreamingContext(new SparkConf, Seconds(5))
28 |
29 | val stream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
30 | ssc,
31 | PreferConsistent,
32 | Subscribe[String, String](topics, kafkaParams)
33 | )
34 |
35 | stream.map(record => (record.key, record.value))
36 |
37 | stream.foreachRDD { rdd =>
38 | val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
39 | rdd.mapPartitions { iter =>
40 | val osr: OffsetRange = offsetRanges(TaskContext.get.partitionId)
41 | val host = InetAddress.getLocalHost().getHostName()
42 | val count = iter.size
43 | Seq(s"host ${host} topic ${osr.topic} partition ${osr.partition} messagecount ${count}").toIterator
44 | }.collect.sorted.foreach(println)
45 | offsetRanges.foreach(println)
46 | stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
47 | }
48 |
49 | ssc.start()
50 | ssc.awaitTermination()
51 | }
52 | }
53 |
--------------------------------------------------------------------------------
/src/main/scala/example/IdempotentExample.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.common.serialization.StringDeserializer
4 | import org.apache.kafka.clients.consumer.ConsumerRecord
5 |
6 | import scalikejdbc._
7 | import com.typesafe.config.ConfigFactory
8 |
9 | import org.apache.spark.{SparkContext, SparkConf}
10 | import org.apache.spark.SparkContext._
11 | import org.apache.spark.streaming._
12 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, HasOffsetRanges }
13 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
14 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
15 |
16 | import scala.collection.JavaConverters._
17 |
18 | /** exactly-once semantics from kafka, by storing data idempotently so that replay is safe */
19 | object IdempotentExample {
20 | def main(args: Array[String]): Unit = {
21 | val conf = ConfigFactory.load
22 | val topics = conf.getString("kafka.topics").split(",").toSet
23 | val kafkaParams = Map[String, Object](
24 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
25 | "key.deserializer" -> classOf[StringDeserializer],
26 | "value.deserializer" -> classOf[StringDeserializer],
27 | "group.id" -> "idempotent-example",
28 | // kafka autocommit can happen before batch is finished, turn it off in favor of checkpoint only
29 | "enable.auto.commit" -> (false: java.lang.Boolean),
30 | // start from the smallest available offset, ie the beginning of the kafka log
31 | "auto.offset.reset" -> "earliest"
32 | )
33 |
34 | val jdbcDriver = conf.getString("jdbc.driver")
35 | val jdbcUrl = conf.getString("jdbc.url")
36 | val jdbcUser = conf.getString("jdbc.user")
37 | val jdbcPassword = conf.getString("jdbc.password")
38 |
39 | // while the job doesn't strictly need checkpointing,
40 | // we'll checkpoint to avoid replaying the whole kafka log in case of failure
41 | val checkpointDir = conf.getString("checkpointDir")
42 |
43 | val ssc = StreamingContext.getOrCreate(
44 | checkpointDir,
45 | setupSsc(topics, kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword, checkpointDir) _
46 | )
47 | ssc.start()
48 | ssc.awaitTermination()
49 | }
50 |
51 | def setupSsc(
52 | topics: Set[String],
53 | kafkaParams: Map[String, Object],
54 | jdbcDriver: String,
55 | jdbcUrl: String,
56 | jdbcUser: String,
57 | jdbcPassword: String,
58 | checkpointDir: String
59 | )(): StreamingContext = {
60 | val ssc = new StreamingContext(new SparkConf, Seconds(60))
61 |
62 | val stream = KafkaUtils.createDirectStream[String, String](
63 | ssc,
64 | PreferConsistent,
65 | Subscribe[String, String](topics, kafkaParams)
66 | )
67 |
68 | stream.foreachRDD { rdd =>
69 | rdd.foreachPartition { iter =>
70 | // make sure connection pool is set up on the executor before writing
71 | SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
72 |
73 | iter.foreach { record: ConsumerRecord[String, String] =>
74 | DB.autoCommit { implicit session =>
75 | // the unique key for idempotency is just the text of the message itself, for example purposes
76 | sql"insert into idem_data(msg) values (${record.value()})".update.apply
77 | }
78 | }
79 | }
80 | }
81 | // the offset ranges for the stream will be stored in the checkpoint
82 | ssc.checkpoint(checkpointDir)
83 | ssc
84 | }
85 | }
86 |
--------------------------------------------------------------------------------
/src/main/scala/example/SetupJdbc.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import scalikejdbc._
4 |
5 | object SetupJdbc {
6 | def apply(driver: String, host: String, user: String, password: String): Unit = {
7 | Class.forName(driver)
8 | ConnectionPool.singleton(host, user, password)
9 | }
10 | }
11 |
--------------------------------------------------------------------------------
/src/main/scala/example/Throughput.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.clients.consumer.KafkaConsumer
4 | import org.apache.kafka.common.serialization.ByteArrayDeserializer
5 | import org.apache.spark.{ SparkConf, TaskContext }
6 | import org.apache.spark.streaming.{Milliseconds, StreamingContext}
7 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, HasOffsetRanges, OffsetRange }
8 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
9 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
10 | import com.typesafe.config.ConfigFactory
11 | import java.net.InetAddress
12 | import scala.collection.JavaConverters._
13 |
14 | object Throughput {
15 | def main(args: Array[String]): Unit = {
16 | val conf = ConfigFactory.load
17 | val kafkaParams = Map[String, Object](
18 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
19 | "key.deserializer" -> classOf[ByteArrayDeserializer],
20 | "value.deserializer" -> classOf[ByteArrayDeserializer],
21 | "group.id" -> "example",
22 | "auto.offset.reset" -> "latest"
23 | )
24 | val topics = conf.getString("kafka.topics").split(",")
25 | val ssc = new StreamingContext(new SparkConf, Milliseconds(conf.getLong("batchDurationMs")))
26 | val stream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte]](
27 | ssc,
28 | PreferConsistent,
29 | Subscribe[Array[Byte], Array[Byte]](topics, kafkaParams)
30 | )
31 |
32 | stream.foreachRDD { rdd =>
33 | println(rdd.map(_.value.size.toLong).fold(0L)(_+_))
34 | }
35 |
36 | ssc.start()
37 | ssc.awaitTermination()
38 | }
39 | }
40 |
--------------------------------------------------------------------------------
/src/main/scala/example/TlsStream.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.common.serialization.StringDeserializer
4 | import org.apache.spark.{ SparkConf, TaskContext }
5 | import org.apache.spark.streaming.{Seconds, StreamingContext}
6 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, HasOffsetRanges, OffsetRange }
7 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
8 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
9 | import com.typesafe.config.ConfigFactory
10 | import java.net.InetAddress
11 | import scala.collection.JavaConverters._
12 |
13 | object TlsStream {
14 | def main(args: Array[String]): Unit = {
15 | val conf = ConfigFactory.load
16 | val kafkaParams = Map[String, Object](
17 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
18 | "key.deserializer" -> classOf[StringDeserializer],
19 | "value.deserializer" -> classOf[StringDeserializer],
20 | "group.id" -> "tlsexample",
21 | "auto.offset.reset" -> "earliest",
22 | // see the instructions at http://kafka.apache.org/documentation.html#security
23 | // make sure to change the port in bootstrap.servers if 9092 is not TLS
24 | "security.protocol" -> "SSL",
25 | "ssl.truststore.location" -> "/Users/cody/Downloads/kafka-keystore/kafka.client.truststore.jks",
26 | "ssl.truststore.password" -> "test1234",
27 | "ssl.keystore.location" -> "/Users/cody/Downloads/kafka-keystore/kafka.client.keystore.jks",
28 | "ssl.keystore.password" -> "test1234",
29 | "ssl.key.password" -> "test1234"
30 | )
31 | val topics = conf.getString("kafka.topics").split(",")
32 | val ssc = new StreamingContext(new SparkConf, Seconds(5))
33 | val stream = KafkaUtils.createDirectStream[String, String](
34 | ssc,
35 | PreferConsistent,
36 | Subscribe[String, String](topics, kafkaParams)
37 | )
38 |
39 | stream.foreachRDD { rdd =>
40 | val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
41 | rdd.mapPartitions { iter =>
42 | val osr: OffsetRange = offsetRanges(TaskContext.get.partitionId)
43 | val host = InetAddress.getLocalHost().getHostName()
44 | val count = iter.size
45 | Seq(s"${host} ${osr.topic} ${osr.partition} ${count}").toIterator
46 | }.collect.sorted.foreach(println)
47 | }
48 |
49 | ssc.start()
50 | ssc.awaitTermination()
51 | }
52 | }
53 |
--------------------------------------------------------------------------------
/src/main/scala/example/TransactionalPerBatch.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.clients.consumer.ConsumerRecord
4 | import org.apache.kafka.common.serialization.StringDeserializer
5 | import org.apache.kafka.common.TopicPartition
6 |
7 | import scalikejdbc._
8 | import com.typesafe.config.ConfigFactory
9 |
10 | import org.apache.spark.{SparkContext, SparkConf, TaskContext}
11 | import org.apache.spark.SparkContext._
12 | import org.apache.spark.streaming._
13 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, HasOffsetRanges, OffsetRange }
14 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Assign
15 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
16 |
17 | import scala.collection.JavaConverters._
18 |
19 | /** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
20 | Offsets and results will be stored per-batch, on the driver
21 | */
22 | object TransactionalPerBatch {
23 | def main(args: Array[String]): Unit = {
24 | val conf = ConfigFactory.load
25 | val kafkaParams = Map[String, Object](
26 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
27 | "key.deserializer" -> classOf[StringDeserializer],
28 | "value.deserializer" -> classOf[StringDeserializer],
29 | "group.id" -> "transactional-example",
30 | "enable.auto.commit" -> (false: java.lang.Boolean),
31 | "auto.offset.reset" -> "none"
32 | )
33 | val jdbcDriver = conf.getString("jdbc.driver")
34 | val jdbcUrl = conf.getString("jdbc.url")
35 | val jdbcUser = conf.getString("jdbc.user")
36 | val jdbcPassword = conf.getString("jdbc.password")
37 |
38 | val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
39 | ssc.start()
40 | ssc.awaitTermination()
41 |
42 | }
43 |
44 | def setupSsc(
45 | kafkaParams: Map[String, Object],
46 | jdbcDriver: String,
47 | jdbcUrl: String,
48 | jdbcUser: String,
49 | jdbcPassword: String
50 | )(): StreamingContext = {
51 | val ssc = new StreamingContext(new SparkConf, Seconds(60))
52 |
53 | SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
54 |
55 | // begin from the the offsets committed to the database
56 | val fromOffsets = DB.readOnly { implicit session =>
57 | sql"select topic, part, off from txn_offsets".
58 | map { resultSet =>
59 | new TopicPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
60 | }.list.apply().toMap
61 | }
62 |
63 | val stream = KafkaUtils.createDirectStream[String, String](
64 | ssc,
65 | PreferConsistent,
66 | Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
67 | ).map { record =>
68 | // we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
69 | (record.topic, 1L)
70 | }
71 |
72 | stream.foreachRDD { rdd =>
73 | // Note this block is running on the driver
74 |
75 | // Cast the rdd to an interface that lets us get an array of OffsetRange
76 | val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
77 |
78 | // simplest possible "metric", namely a count of messages per topic
79 | // Notice the aggregation is done using spark methods, and results collected back to driver
80 | val results = rdd.reduceByKey {
81 | // This is the only block of code running on the executors.
82 | // reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
83 | _+_
84 | }.collect
85 |
86 | // Back to running on the driver
87 |
88 | // localTx is transactional, if metric update or offset update fails, neither will be committed
89 | DB.localTx { implicit session =>
90 | // store metric results
91 | results.foreach { pair =>
92 | val (topic, metric) = pair
93 | val metricRows = sql"""
94 | update txn_data set metric = metric + ${metric}
95 | where topic = ${topic}
96 | """.update.apply()
97 | if (metricRows != 1) {
98 | throw new Exception(s"""
99 | Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
100 | """)
101 | }
102 | }
103 |
104 | // store offsets
105 | offsetRanges.foreach { osr =>
106 | val offsetRows = sql"""
107 | update txn_offsets set off = ${osr.untilOffset}
108 | where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
109 | """.update.apply()
110 | if (offsetRows != 1) {
111 | throw new Exception(s"""
112 | Got $offsetRows rows affected instead of 1 when attempting to update offsets for
113 | ${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
114 | Was a partition repeated after a worker failure?
115 | """)
116 | }
117 | }
118 | }
119 | }
120 | ssc
121 | }
122 | }
123 |
--------------------------------------------------------------------------------
/src/main/scala/example/TransactionalPerPartition.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.clients.consumer.ConsumerRecord
4 | import org.apache.kafka.common.serialization.StringDeserializer
5 | import org.apache.kafka.common.TopicPartition
6 |
7 | import scalikejdbc._
8 | import com.typesafe.config.ConfigFactory
9 |
10 | import org.apache.spark.{SparkContext, SparkConf, TaskContext}
11 | import org.apache.spark.SparkContext._
12 | import org.apache.spark.streaming._
13 | import org.apache.spark.streaming.kafka010.{ KafkaUtils, HasOffsetRanges, OffsetRange }
14 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Assign
15 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
16 |
17 | import scala.collection.JavaConverters._
18 |
19 | /** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
20 | Offsets and results will be stored per-partition, on the executors
21 | */
22 | object TransactionalPerPartition {
23 | def main(args: Array[String]): Unit = {
24 | val conf = ConfigFactory.load
25 | val kafkaParams = Map[String, Object](
26 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
27 | "key.deserializer" -> classOf[StringDeserializer],
28 | "value.deserializer" -> classOf[StringDeserializer],
29 | "group.id" -> "transactional-example",
30 | "enable.auto.commit" -> (false: java.lang.Boolean),
31 | "auto.offset.reset" -> "none"
32 | )
33 | val jdbcDriver = conf.getString("jdbc.driver")
34 | val jdbcUrl = conf.getString("jdbc.url")
35 | val jdbcUser = conf.getString("jdbc.user")
36 | val jdbcPassword = conf.getString("jdbc.password")
37 |
38 | val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
39 | ssc.start()
40 | ssc.awaitTermination()
41 |
42 | }
43 |
44 | def setupSsc(
45 | kafkaParams: Map[String, Object],
46 | jdbcDriver: String,
47 | jdbcUrl: String,
48 | jdbcUser: String,
49 | jdbcPassword: String
50 | )(): StreamingContext = {
51 | val ssc = new StreamingContext(new SparkConf, Seconds(60))
52 |
53 | SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
54 |
55 | // begin from the the offsets committed to the database
56 | val fromOffsets = DB.readOnly { implicit session =>
57 | sql"select topic, part, off from txn_offsets".
58 | map { resultSet =>
59 | new TopicPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
60 | }.list.apply().toMap
61 | }
62 |
63 | val stream = KafkaUtils.createDirectStream[String, String](
64 | ssc,
65 | PreferConsistent,
66 | Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
67 | )
68 |
69 | stream.foreachRDD { rdd =>
70 | // Cast the rdd to an interface that lets us get an array of OffsetRange
71 | val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
72 |
73 | rdd.foreachPartition { iter =>
74 | // Note this entire block of code is running in the executors
75 | SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
76 |
77 | // index to get the correct offset range for the rdd partition we're working on
78 | // This is safe because we haven't shuffled or otherwise disrupted partitioning,
79 | // and the original input rdd partitions were 1:1 with kafka partitions
80 | val osr: OffsetRange = offsetRanges(TaskContext.get.partitionId)
81 |
82 | // simplest possible "metric", namely a count of messages
83 | val metric = iter.size
84 |
85 | // localTx is transactional, if metric update or offset update fails, neither will be committed
86 | DB.localTx { implicit session =>
87 | // store metric data for this partition
88 | val metricRows = sql"""
89 | update txn_data set metric = metric + ${metric}
90 | where topic = ${osr.topic}
91 | """.update.apply()
92 | if (metricRows != 1) {
93 | throw new Exception(s"""
94 | Got $metricRows rows affected instead of 1 when attempting to update metrics for
95 | ${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
96 | """)
97 | }
98 |
99 | // store offsets for this partition
100 | val offsetRows = sql"""
101 | update txn_offsets set off = ${osr.untilOffset}
102 | where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
103 | """.update.apply()
104 | if (offsetRows != 1) {
105 | throw new Exception(s"""
106 | Got $offsetRows rows affected instead of 1 when attempting to update offsets for
107 | ${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
108 | Was a partition repeated after a worker failure?
109 | """)
110 | }
111 | }
112 | }
113 | }
114 | ssc
115 | }
116 | }
117 |
--------------------------------------------------------------------------------
/src/main/scala/example/Windowed.scala:
--------------------------------------------------------------------------------
1 | package example
2 |
3 | import org.apache.kafka.common.serialization.StringDeserializer
4 | import org.apache.spark.{SparkConf, TaskContext}
5 | import org.apache.spark.streaming.{Seconds, StreamingContext}
6 | import org.apache.spark.streaming.kafka010.{KafkaUtils, HasOffsetRanges, OffsetRange}
7 | import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
8 | import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
9 | import com.typesafe.config.ConfigFactory
10 |
11 | /** example of how windowing changes partitioning */
12 | object Windowed {
13 | def main(args: Array[String]): Unit = {
14 | val conf = ConfigFactory.load
15 | val ssc = new StreamingContext(new SparkConf, Seconds(1))
16 |
17 | val kafkaParams = Map[String, Object](
18 | "bootstrap.servers" -> conf.getString("kafka.brokers"),
19 | "key.deserializer" -> classOf[StringDeserializer],
20 | "value.deserializer" -> classOf[StringDeserializer],
21 | "group.id" -> "transactional-example",
22 | "enable.auto.commit" -> (false: java.lang.Boolean),
23 | "auto.offset.reset" -> "none"
24 | )
25 |
26 | val topics = conf.getString("kafka.topics").split(",").toSet
27 |
28 | val stream = KafkaUtils.createDirectStream[String, String](
29 | ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
30 |
31 | // reference to the most recently generated input rdd's offset ranges
32 | var offsetRanges = Array[OffsetRange]()
33 |
34 | stream.transform { rdd =>
35 | // It's possible to get each input rdd's offset ranges, BUT...
36 | offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
37 | println("got offset ranges on the driver:\n" + offsetRanges.mkString("\n"))
38 | println(s"number of kafka partitions before windowing: ${offsetRanges.size}")
39 | println(s"number of spark partitions before windowing: ${rdd.partitions.size}")
40 | rdd
41 | }.window(Seconds(6), Seconds(2)).foreachRDD { rdd =>
42 | //... if you then window, you're going to have partitions from multiple input rdds, not just the most recent one
43 | println(s"number of spark partitions after windowing: ${rdd.partitions.size}")
44 | rdd.foreachPartition { iter =>
45 | println("read offset ranges on the executor\n" + offsetRanges.mkString("\n"))
46 | // notice this partition ID can be higher than the number of partitions in a single input rdd
47 | println(s"this partition id ${TaskContext.get.partitionId}")
48 | iter.foreach(println)
49 | }
50 | // Moral of the story:
51 | // If you just care about the most recent rdd's offset ranges, a single reference is fine.
52 | // If you want to do something with all of the offset ranges in the window,
53 | // you need to stick them in a data structure, e.g. a bounded queue.
54 |
55 | // But be aware, regardless of whether you use the createStream or createDirectStream api,
56 | // you will get a fundamentally wrong answer if your job fails and restarts at something other than the highest offset,
57 | // because the first window after restart will include all messages received while your job was down,
58 | // not just X seconds worth of messages.
59 |
60 | // In order to really solve this, you'd have to time-index kafka,
61 | // and override the behavior of the dstream's compute() method to only return messages for the correct time.
62 | // Or do your own bucketing into a data store based on the time in the message, not system clock at time of reading.
63 |
64 | // Or... don't worry about it :)
65 | // Restart the stream however you normally would (checkpoint, or save most recent offsets, or auto.offset.reset, whatever)
66 | // and accept that your first window will be wrong
67 | }
68 |
69 | ssc.start()
70 | ssc.awaitTermination()
71 | }
72 | }
73 |
--------------------------------------------------------------------------------