Spark streaming techniques fall into two broad areas that don't have much to do with each other 6 | until you get to the advanced topics:
7 |After the single "getting started" example below, we'll look at these two areas separately. Eventually there may 14 | need to ba a section of "advanced" examples that tie them together again.
15 | 16 |Of course, to transform streaming data we need to set up a streaming data source. Many of the sources you'll encounter 17 | in practice take considerable set up, so I've chosen to use Spark's file streaming mechanism and provide a utility 18 | class for generating a stream of files containing random data. My hope is that for most users of these examples it 19 | will need no setup at all, and it has the useful side effect of bringing streaming "down to earth" by using 20 | such a "low tech" mechanism.
21 | 22 | ## Utilities 23 | 24 |File | 27 |Purpose | 28 |
---|---|
CVSFileStreamGenerator.java | 31 |
32 | A utility for creating a sequence of files of integers in the file system 33 | so that Spark can treat them like a stream. This follows a standard pattern 34 | to ensure correctness: each file is first created in another folder and then 35 | atomically renamed into the destination folder so that the file's point of 36 | creation is unambiguous, and is correctly recognized by the streaming 37 | mechanism. 38 | 39 |Each generated file has the same number of key/value pairs, where the 40 | keys have the same names from file to file, and the values are random 41 | numbers, and thus vary from file to file. 42 | 43 |This class is used by several of the streaming examples. 44 | |
45 |
StreamingItem.java | 48 |An item of data to be streamed. This is used to generate the records int he CSV files and 49 | also to parse them. Several of the example stream processing pipelines will parse the text data into these 50 | objects for further processing. |
51 |
File | 59 |What's Illustrated | 60 |
---|---|
FileBased.java | 63 |How to create a stream of data from files appearing in a directory. Start here. | 64 |
File | 72 |What's Illustrated | 73 |
---|---|
MultipleTransformations.java | 76 |How to establish multiple streams on the same source of data and register multiple processing 77 | functions on a single stream. |
78 |
Filtering.java | 81 |Much of the processing we require on streams is agnostic about batch boundaries. It's convenient to have 82 | methods on JavaDStream that allow us to transform the streamed data item by item (using map()), or filter it 83 | item by item (using filter()) without being concerned about batch boundaries as embodied by individual RDDs. 84 | This example again uses map() to parse the records int he ext files and then filter() to filter out individual 85 | entries, so that by the time we receive batch RDDs only the desired items remain. |
86 |
Windowing.java | 89 |This example creates two derived streams with different window and slide durations. 90 | All three streams print their batch size every time they produce a batch, so you can compare the 91 | number of records across streams and batches. |
92 |
StateAccumulation.java | 95 |This example uses an accumulator to keep a running total of the number of records processed. Every batch 96 | that is processed is added to it, and the running total is printed. |
97 |
File | 110 |What's Illustrated | 111 |
---|---|
SimpleRecoveryFromCheckpoint.java | 114 |This example demonstrates how to persist configured JavaDStreams across a failure and restart. It simulates 115 | failure by destroying the first streaming context (for which a checkpoint directory is configured) and 116 | creating a second one, not from scratch, but by reading the checkpoint directory. |
117 |
MapWithState.java | 120 |(In progress) | 121 |
Pairs.java | 124 |(In progress) | 125 |