├── .gitignore
├── LICENSE
├── NOTICE.txt
├── README.md
├── data
├── dummyLogFile.txt
├── kibana.jpg
└── nycTaxiData.gz
├── pom.xml
├── src
└── main
│ ├── scala
│ └── com
│ │ └── dataartisans
│ │ └── flink_demo
│ │ ├── datatypes
│ │ └── TaxiRide.scala
│ │ ├── examples
│ │ ├── EarlyArrivalCount.scala
│ │ ├── SlidingArrivalCount.scala
│ │ └── TotalArrivalCount.scala
│ │ ├── sinks
│ │ └── ElasticsearchUpsertSink.scala
│ │ ├── sources
│ │ └── TaxiRideSource.scala
│ │ └── utils
│ │ ├── DemoStreamEnvironment.scala
│ │ └── NycGeoUtils.scala
│ └── scripts
│ └── convertTrips.sh
└── tools
└── maven
└── checkstyle.xml
/.gitignore:
--------------------------------------------------------------------------------
1 | .cache
2 | scalastyle-output.xml
3 | .classpath
4 | .idea
5 | .metadata
6 | .settings
7 | .project
8 | .version.properties
9 | filter.properties
10 | target
11 | tmp
12 | *.class
13 | *.iml
14 | *.swp
15 | *.jar
16 | *.log
17 | .DS_Store
18 | _site
19 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "{}"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright {yyyy} {name of copyright owner}
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
203 |
--------------------------------------------------------------------------------
/NOTICE.txt:
--------------------------------------------------------------------------------
1 | Cascading Connector for Apache Flink
2 | Copyright 2015 data Artisans GmbH
3 |
4 | This product includes software developed at
5 | data Artisans GmbH, Berlin, Germany (http://www.data-artisans.com).
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Demo Applications for Apache Flink™ DataStream
2 |
3 | This repository contains demo applications for [Apache Flink](https://flink.apache.org)'s
4 | [DataStream API](https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html).
5 |
6 | Apache Flink is a scalable open-source streaming dataflow engine with many competitive features.
7 | You can find a list of Flink's features at the bottom of this page.
8 |
9 | ### Run a demo application in your IDE
10 |
11 | You can run all examples in this repository from your IDE and play around with the code.
12 | Requirements:
13 |
14 | - Java JDK 7 (or 8)
15 | - Apache Maven 3.x
16 | - Git
17 | - an IDE with Scala support (we recommend IntelliJ IDEA)
18 |
19 | To run a demo application in your IDE follows these steps:
20 |
21 | 1. **Clone the repository:** Open a terminal and clone the repository:
22 | `git clone https://github.com/dataArtisans/flink-streaming-demo.git`. Please note that the
23 | repository is about 100MB in size because it includes the input data of our demo applications.
24 |
25 | 2. **Import the project into your IDE:** The repository is a Maven project. Open your IDE and
26 | import the repository as an existing Maven project. This is usually done by selecting the folder that
27 | contains the `pom.xml` file or selecting the `pom.xml` file itself.
28 |
29 | 3. **Start a demo application:** Execute the `main()` method of one of the demo applications, for example
30 | `com.dataartisans.flink_demo.examples.TotalArrivalCount.scala`.
31 | Running an application will start a local Flink instance in the JVM process of your IDE.
32 | You will see Flink's log messages and the output produced by the program being printed to the standard output.
33 |
34 | 4. **Explore the web dashboard:** The local Flink instance starts a webserver that serves Flink's
35 | dashboard. Open [http://localhost:8081](http://localhost:8081) to access and explore the dashboard.
36 |
37 | ### Demo applications
38 |
39 | #### Taxi event stream
40 |
41 | All demo applications in this repository process a stream of taxi ride events that
42 | originate from a [public data set](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)
43 | of the [New York City Taxi and Limousine Commission](http://www.nyc.gov/html/tlc/html/home/home.shtml)
44 | (TLC). The data set consists of records about taxi trips in New York City from 2009 to 2015.
45 |
46 | We took some of this data and converted it into a data set of taxi ride events by splitting each
47 | trip record into a ride start and a ride end event. The events have the following schema:
48 |
49 | ```
50 | rideId: Long // unique id for each ride
51 | time: DateTime // timestamp of the start/end event
52 | isStart: Boolean // true = ride start, false = ride end
53 | location: GeoPoint // lon/lat of pick-up/drop-off location
54 | passengerCnt: short // number of passengers
55 | travelDist: float // total travel distance, -1 on start events
56 | ```
57 |
58 | A custom `SourceFunction` serves a `DataStream[TaxiRide]` from this data set.
59 | In order to generate the stream as realistically as possible, events are emitted according to their
60 | timestamp. Two events that occurred ten minutes after each other in reality are served ten minutes apart.
61 | A speed-up factor can be specified to "fast-forward" the stream, i.e., with a speed-up factor of 2,
62 | the events would be served five minutes apart. Moreover, you can specify a maximum serving delay
63 | which causes each event to be randomly delayed within the bound to simulate an out-of-order stream
64 | (a delay of 0 seconds results in an ordered stream). All examples operate in event-time mode.
65 | This guarantees consistent results even in case of historic data or data which is delivered out-of-order.
66 |
67 | #### Identify popular locations
68 |
69 | The [`TotalArrivalCount.scala`](/src/main/scala/com/dataartisans/flink_demo/examples/TotalArrivalCount.scala)
70 | program identifies popular locations in New York City.
71 | It ingests the stream of taxi ride events and counts for each location the number of persons that
72 | arrive by taxi.
73 |
74 | #### Identify the popular locations of the last 15 minutes
75 |
76 | The [`SlidingArrivalCount.scala`](/src/main/scala/com/dataartisans/flink_demo/examples/SlidingArrivalCount.scala)
77 | program identifies popular locations of the last 15 minutes.
78 | It ingests the stream of taxi ride records and computes every five minutes the number of
79 | persons that arrived at each location within the last 15 minutes.
80 | This type of computation is known as sliding window.
81 |
82 |
83 | #### Compute early arrival counts for popular locations
84 |
85 | Some stream processing use cases depend on timely event aggregation, for example to send out notifications or alerts.
86 | The [`EarlyArrivalCount.scala`](/src/main/scala/com/dataartisans/flink_demo/examples/EarlyArrivalCount.scala)
87 | program extends our previous sliding window application. Same as before, it computes every five minutes
88 | the number of persons that arrived at each location within the last 15 minutes.
89 | In addition it emits an early partial count whenever a multitude of 50 persons arrived at a
90 | location, i.e., it emits an updated count if more than 50, 100, 150 (and so on) persons arrived at a location.
91 |
92 | ### Setting up Elasticsearch and Kibana
93 |
94 | The demo applications in this repository are prepared to write their output to [Elasticsearch](https://www.elastic.co/products/elasticsearch).
95 | Data in Elasticsearch can be easily visualized using [Kibana](https://www.elastic.co/products/kibana)
96 | for real-time monitoring and interactive analysis.
97 |
98 | Our demo applications depend on Elasticsearch 1.7.3 and Kibana 4.1.3. Both systems have a nice
99 | out-of-the-box experience and operate well with their default configurations for our purpose.
100 |
101 | Follow these instructions to set up Elasticsearch and Kibana.
102 |
103 | #### Setup Elasticsearch
104 |
105 | 1. Download Elasticsearch 1.7.3 [here](https://www.elastic.co/downloads/past-releases/elasticsearch-1-7-3).
106 |
107 | 1. Extract the downloaded archive file and enter the extracted repository.
108 |
109 | 1. Start Elasticsearch using the start script: `./bin/elasticsearch`.
110 |
111 | 1. Create an index (here called `nyc-idx`): `curl -XPUT "http://localhost:9200/nyc-idx"`
112 |
113 | 1. Create a schema mapping for the index (here called `popular-locations`):
114 | ```
115 | curl -XPUT "http://localhost:9200/nyc-idx/_mapping/popular-locations" -d'
116 | {
117 | "popular-locations" : {
118 | "properties" : {
119 | "cnt": {"type": "integer"},
120 | "location": {"type": "geo_point"},
121 | "time": {"type": "date"}
122 | }
123 | }
124 | }'
125 | ```
126 | **Note:** This mapping can be used for all demo application.
127 | 1. Configure a demo application to write its results to Elasticsearch. For that you have to change the corresponding parameters in the demo applications source code:
128 | - set `writeToElasticsearch = true`
129 | - set `elasticsearchHost` to the correct host name (see Elasticsearch's log output)
130 |
131 | 1. Run the Flink program to write its result to Elasticsearch.
132 |
133 | To clear the `nyc-idx` index in Elasticsearch, simply drop the mapping as
134 | `curl -XDELETE 'http://localhost:9200/nyc-idx/popular-locations'` and create it again with the previous
135 | command.
136 |
137 | #### Setup Kibana
138 |
139 | Setting up Kibana and visualizing data that is stored in Elasticsearch is also easy.
140 |
141 | 1. Dowload Kibana 4.1.3 [here](https://www.elastic.co/downloads/past-releases/kibana-4-1-3)
142 |
143 | 1. Extract the downloaded archive and enter the extracted repository.
144 |
145 | 1. Start Kibana using the start script: `./bin/kibana`.
146 |
147 | 1. Access Kibana by opening [http://localhost:5601](http://localhost:5601) in your browser.
148 |
149 | 1. Configure an index pattern by entering the index name "nyc-idx" and clicking on "Create".
150 | Do not uncheck the "Index contains time-based events" option.
151 |
152 | 1. Click on the "Discover" button at the top of the page. Kibana will tell you "No results found"
153 | because we have to configure the time range of the data to visualize in Kibane. Click on the
154 | "Last 15 minutes" label in the top right corner and enter an absolute time range from 2013-01-01
155 | to 2013-01-06 which is the time range of our taxi ride data stream. You can also configure a
156 | refresh interval to reload the page for updates.
157 |
158 | 1. Click on the “Visualize” button at the top of the page, select "Tile map", and click on "From a
159 | new search".
160 |
161 | 1. Next you need to configure the tile map visualization:
162 |
163 | - Top-left: Configure the displayed value to be a “Sum” aggregation over the "cnt" field.
164 | - Top-left: Select "Geo Coordinates" as bucket type and make sure that "location" is
165 | configured as field.
166 | - Top-left: You can change the visualization type by clicking on “Options” (top left) and selecting
167 | for example a “Shaded Geohash Grid” visualization.
168 | - The visualization is started by clicking on the green play button.
169 |
170 | The following screenshot shows how Kibana visualizes the result of `TotalArrivalCount.scala`.
171 |
172 | 
173 |
174 | ### Apache Flink's Feature Set
175 |
176 | - **Support for out-of-order streams and event-time processing**: In practice, streams of events rarely
177 | arrive in the order that they are produced, especially streams from distributed systems, devices, and sensors.
178 | Flink 0.10 is the first open source engine that supports out-of-order streams and event
179 | time which is a hard requirement for many application that aim for consistent and meaningful results.
180 |
181 | - **Expressive and easy-to-use APIs in Scala and Java**: Flink's DataStream API provides many
182 | operators which are well known from batch processing APIs such as `map`, `reduce`, and `join` as
183 | well as stream specific operations such as `window`, `split`, and `connect`.
184 | First-class support for user-defined functions eases the implementation of custom application
185 | behavior. The DataStream API is available in Scala and Java.
186 |
187 | - **Support for sessions and unaligned windows**: Most streaming systems have some concept of windowing,
188 | i.e., a temporal grouping of events based on some function of their timestamps. Unfortunately, in
189 | many systems these windows are hard-coded and connected with the system’s internal checkpointing
190 | mechanism. Flink is the first open source streaming engine that completely decouples windowing from
191 | fault tolerance, allowing for richer forms of windows, such as sessions.
192 |
193 | - **Consistency, fault tolerance, and high availability**: Flink guarantees consistent operator state
194 | in the presence of failures (often called "exactly-once processing"), and consistent data movement
195 | between selected sources and sinks (e.g., consistent data movement between Kafka and HDFS). Flink
196 | also supports master fail-over, eliminating any single point of failure.
197 |
198 | - **High throughput and low-latency processing**: We have clocked Flink at 1.5 million events per second per core,
199 | and have also observed latencies at the 25 millisecond range in jobs that include network data
200 | shuffling. Using a tuning knob, Flink users can control the latency-throughput trade-off, making
201 | the system suitable for both high-throughput data ingestion and transformations, as well as ultra
202 | low latency (millisecond range) applications.
203 |
204 | - **Integration with many systems for data input and output**: Flink integrates with a wide variety of
205 | open source systems for data input and output (e.g., HDFS, Kafka, Elasticsearch, HBase, and others),
206 | deployment (e.g., YARN), as well as acting as an execution engine for other frameworks (e.g.,
207 | Cascading, Google Cloud Dataflow). The Flink project itself comes bundled with a Hadoop MapReduce
208 | compatibility layer, a Storm compatibility layer, as well as libraries for Machine Learning and
209 | graph processing.
210 |
211 | - **Support for batch processing**: In Flink, batch processing is a special case of stream processing,
212 | as finite data sources are just streams that happen to end. Flink offers a dedicated execution mode
213 | for batch processing with a specialized DataSet API and libraries for Machine Learning and graph processing. In
214 | addition, Flink contains several batch-specific optimizations (e.g., for scheduling, memory
215 | management, and query optimization), matching and even out-performing dedicated batch processing
216 | engines in batch use cases.
217 |
218 | - **Developer productivity and operational simplicity**: Flink runs in a variety of environments. Local
219 | execution within an IDE significantly eases development and debugging of Flink applications.
220 | In distributed setups, Flink runs at massive scale-out. The YARN mode
221 | allows users to bring up Flink clusters in a matter of seconds. Flink serves monitoring metrics of
222 | jobs and the system as a whole via a well-defined REST interface. A build-in web dashboard
223 | displays these metrics and makes monitoring of Flink very convenient.
224 |
225 |
226 |