├── udfs
└── 01
│ ├── python_udf.py
│ └── 01_python_udfs.md
├── foundations
├── 01
│ └── 01_create_table.md
├── 04
│ └── 04_where.md
├── 07
│ └── 07_views.md
├── 05
│ └── 05_group_by.md
├── 03
│ └── 03_temporary_table.md
├── 02
│ └── 02_insert_into.md
├── 06
│ └── 06_order_by.md
└── 08
│ └── 08_statement_sets.md
├── aggregations-and-analytics
├── 06
│ └── 06_dedup.md
├── 01
│ └── 01_group_by_window.md
├── 05
│ └── 05_top_n.md
├── 04
│ └── 04_over.md
├── 02
│ └── 02_watermarks.md
├── 03
│ └── 03_group_by_session_window.md
├── 08
│ └── 08_match_recognize.md
└── 07
│ └── 07_chained_windows.md
├── joins
├── 01
│ └── 01_regular_joins.md
├── 02
│ └── 02_interval_joins.md
├── 06
│ └── 06_lateral_join.md
├── 04
│ └── 04_lookup_joins.md
├── 03
│ └── 03_kafka_join.md
└── 05
│ └── 05_star_schema.md
├── README.md
├── other-builtin-functions
├── 02
│ └── 02_union-all.md
└── 01
│ └── 01_date_time.md
└── LICENSE
/udfs/01/python_udf.py:
--------------------------------------------------------------------------------
1 | from pyflink.table import DataTypes
2 | from pyflink.table.udf import udf
3 |
4 | us_cities = {"Chicago","Portland","Seattle","New York"}
5 |
6 | @udf(input_types=[DataTypes.STRING(), DataTypes.FLOAT()],
7 | result_type=DataTypes.FLOAT())
8 | def to_fahr(city, temperature):
9 |
10 | if city in us_cities:
11 |
12 | fahr = ((temperature * 9.0 / 5.0) + 32.0)
13 |
14 | return fahr
15 | else:
16 | return temperature
17 |
--------------------------------------------------------------------------------
/foundations/01/01_create_table.md:
--------------------------------------------------------------------------------
1 | # 01 Creating Tables
2 |
3 | :bulb: This example will show how to create a table using SQL DDL.
4 |
5 | Flink SQL operates against logical tables, just like a traditional database.
6 | However, it does not maintain tables internally but always operates against external systems.
7 |
8 | Table definitions are in two parts; the logical schema and connector configuration. The logical schema defines the columns and types in the table and is what queries operate against.
9 | The connector configuration is contained in the `WITH` clause and defines the physical system that backs this table.
10 | This example uses the `datagen` connector which generates rows in memory and is convenient for testing queries.
11 |
12 | You can test the table is properly created by running a simple `SELECT` statement.
13 | In Ververica Platform you will see the results printed to the UI in the query preview.
14 |
15 | ## Script
16 |
17 | ```sql
18 |
19 | CREATE TABLE orders (
20 | order_uid BIGINT,
21 | product_id BIGINT,
22 | price DECIMAL(32, 2),
23 | order_time TIMESTAMP(3)
24 | ) WITH (
25 | 'connector' = 'datagen'
26 | );
27 |
28 | SELECT * FROM orders;
29 | ```
30 |
31 | ## Example Output
32 |
33 | 
34 |
--------------------------------------------------------------------------------
/foundations/04/04_where.md:
--------------------------------------------------------------------------------
1 | # 04 Filtering Data
2 |
3 | :bulb: This example will show how to filter server logs in real-time using a standard `WHERE` clause.
4 |
5 | The table it uses, `server_logs`, is backed by the [`faker` connector](https://github.com/knaufk/flink-faker) which continuously generates rows in memory based on Java Faker expressions and is convenient for testing queries.
6 | As such, it is an alternative to the built-in `datagen` connector used for example in [the first recipe](../01/01_create_table.md).
7 |
8 | You can continuously filter these logs for those requests that experience authx issues with a simple `SELECT` statement with a `WHERE` clause filtering on the auth related HTTP status codes.
9 | In Ververica Platform you will see the results printed to the UI in the query preview.
10 |
11 | ## Script
12 |
13 | ```sql
14 | CREATE TABLE server_logs (
15 | client_ip STRING,
16 | client_identity STRING,
17 | userid STRING,
18 | log_time TIMESTAMP(3),
19 | request_line STRING,
20 | status_code STRING,
21 | size INT
22 | ) WITH (
23 | 'connector' = 'faker',
24 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
25 | 'fields.client_identity.expression' = '-',
26 | 'fields.userid.expression' = '-',
27 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
28 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
29 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
30 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
31 | );
32 |
33 | SELECT
34 | log_time,
35 | request_line,
36 | status_code
37 | FROM server_logs
38 | WHERE
39 | status_code IN ('403', '401')
40 | ```
41 |
42 | ## Example Output
43 |
44 | 
45 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/06/06_dedup.md:
--------------------------------------------------------------------------------
1 | # 06 Deduplication
2 |
3 | :bulb: This example will show how you can identify and filter out duplicates in a stream of events.
4 |
5 | There are different ways that duplicate events can end up in your data sources, from human error to application bugs. Regardless of the origin, unclean data can have a real impact in the quality (and correctness) of your results. Suppose that your order system occasionally generates duplicate events with the same `order_id`, and that you're only interested in keeping the most recent event for downstream processing.
6 |
7 | As a first step, you can use a combination of the `COUNT` function and the `HAVING` clause to check if and which orders have more than one event; and then filter out these events using `ROW_NUMBER()`. In practice, deduplication is a special case of [Top-N aggregation](../05/05_top_n.md), where N is 1 (`rownum = 1`) and the ordering column is either the processing or event time of events.
8 |
9 | ## Script
10 |
11 | The source table `orders` is backed by the built-in [`datagen` connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/datagen.html), which continuously generates rows in memory.
12 |
13 | ```sql
14 | CREATE TABLE orders (
15 | id INT,
16 | order_time AS CURRENT_TIMESTAMP,
17 | WATERMARK FOR order_time AS order_time - INTERVAL '5' SECONDS
18 | )
19 | WITH (
20 | 'connector' = 'datagen',
21 | 'rows-per-second'='10',
22 | 'fields.id.kind'='random',
23 | 'fields.id.min'='1',
24 | 'fields.id.max'='100'
25 | );
26 |
27 | --Check for duplicates in the `orders` table
28 | SELECT id AS order_id,
29 | COUNT(*) AS order_cnt
30 | FROM orders o
31 | GROUP BY id
32 | HAVING COUNT(*) > 1;
33 |
34 | --Use deduplication to keep only the latest record for each `order_id`
35 | SELECT
36 | order_id,
37 | order_time
38 | FROM (
39 | SELECT id AS order_id,
40 | order_time,
41 | ROW_NUMBER() OVER (PARTITION BY id ORDER BY order_time) AS rownum
42 | FROM orders
43 | )
44 | WHERE rownum = 1;
45 | ```
46 |
47 | ## Example Output
48 |
49 | 
50 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/01/01_group_by_window.md:
--------------------------------------------------------------------------------
1 | # 01 Aggregating Time Series Data
2 |
3 | :bulb: This example will show how to aggregate time-series data in real-time using a `TUMBLE` window.
4 |
5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | Many streaming applications work with time-series data.
8 | To count the number of `DISTINCT` IP addresses seen each minute, rows need to be grouped based on a [time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#time-attributes).
9 | Grouping based on time is special, because time always moves forward, which means Flink can generate final results after the minute is completed.
10 |
11 | `TUMBLE` is a built-in function for grouping timestamps into time intervals called windows.
12 | Unlike other aggregations, it will only produce a single final result for each key when the interval is completed.
13 |
14 | If the logs do not have a timestamp, one can be generated using a [computed column](https://docs.ververica.com/user_guide/sql_development/table_view.html#computed-column).
15 | `log_time AS PROCTIME()` will append a column to the table with the current system time.
16 |
17 | ## Script
18 |
19 | ```sql
20 | CREATE TABLE server_logs (
21 | client_ip STRING,
22 | client_identity STRING,
23 | userid STRING,
24 | request_line STRING,
25 | status_code STRING,
26 | log_time AS PROCTIME()
27 | ) WITH (
28 | 'connector' = 'faker',
29 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
30 | 'fields.client_identity.expression' = '-',
31 | 'fields.userid.expression' = '-',
32 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
33 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
34 | );
35 |
36 | SELECT
37 | COUNT(DISTINCT client_ip) AS ip_addresses,
38 | TUMBLE_PROCTIME(log_time, INTERVAL '1' MINUTE) AS window_interval
39 | FROM server_logs
40 | GROUP BY
41 | TUMBLE(log_time, INTERVAL '1' MINUTE)
42 | ```
43 |
44 | ## Example Output
45 |
46 | 
47 |
--------------------------------------------------------------------------------
/joins/01/01_regular_joins.md:
--------------------------------------------------------------------------------
1 | # 01 Regular Joins
2 |
3 | :bulb: This example will show how you can use joins to correlate rows across multiple tables.
4 |
5 | Flink SQL supports complex and flexible join operations over continuous tables.
6 | There are several different types of joins to account for the wide variety of semantics queries may require.
7 |
8 | Regular joins are the most generic and flexible type of join.
9 | These include the standard `INNER` and `[FULL|LEFT|RIGHT] OUTER` joins that are available in most modern databases.
10 |
11 | Suppose we have a [NOC list](https://en.wikipedia.org/wiki/Non-official_cover) of secret agents all over the world.
12 | Your mission if you choose to accept it, is to join this table with another containin the agents real name.
13 |
14 | In Flink SQL, this can be achieved using a simple `INNER JOIN`.
15 | Flink will join the tables using an equi-join predicate on the `agent_id` and output a new row everytime there is a match.
16 |
17 | However, there is something to be careful of.
18 | Flink must retain every input row as part of the join to potentially join it with the other table in the future.
19 | This means the queries resource requirements will grow indefinitely and will eventually fail.
20 | While this type of join is useful in some scenarios, other joins are more powerful in a streaming context and significantly more space-efficient.
21 |
22 | In this example, both tables are bounded to remain space efficient.
23 |
24 | ```sql
25 | CREATE TABLE NOC (
26 | agent_id STRING,
27 | codename STRING
28 | )
29 | WITH (
30 | 'connector' = 'faker',
31 | 'fields.agent_id.expression' = '#{regexify ''(1|2|3|4|5){1}''}',
32 | 'fields.codename.expression' = '#{superhero.name}',
33 | 'number-of-rows' = '10'
34 | );
35 |
36 | CREATE TABLE RealNames (
37 | agent_id STRING,
38 | name STRING
39 | )
40 | WITH (
41 | 'connector' = 'faker',
42 | 'fields.agent_id.expression' = '#{regexify ''(1|2|3|4|5){1}''}',
43 | 'fields.name.expression' = '#{Name.full_name}',
44 | 'number-of-rows' = '10'
45 | );
46 |
47 | SELECT
48 | name,
49 | codename
50 | FROM NOC
51 | INNER JOIN RealNames ON NOC.agent_id = RealNames.agent_id;
52 | ```
53 |
54 | 
55 |
--------------------------------------------------------------------------------
/foundations/07/07_views.md:
--------------------------------------------------------------------------------
1 | # 07 Encapsulating Logic with (Temporary) Views
2 |
3 | :bulb: This example will show how you can use (temporary) views to reuse code and to structure long queries and scripts.
4 |
5 | `CREATE (TEMPORARY) VIEW` defines a view from a query.
6 | **A view is not physically materialized.**
7 | Instead, the query is run every time the view is referenced in a query.
8 |
9 | Temporary views are very useful to structure and decompose more complicated queries and to re-use queries within a longer script.
10 | Non-temporary views - stored in a persistent [catalog](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/catalogs.html) - can also be used to share common queries within your organization, e.g. common filters or pre-processing steps.
11 |
12 | Here, we create a view on the `server_logs` that only contains successful requests.
13 | This view encapsulates the logic of filtering the logs based on certain `status_code`s.
14 | This logic can subsequently be used by any query or script that has access to the catalog.
15 |
16 | ## Script
17 |
18 | ```sql
19 | CREATE TABLE server_logs (
20 | client_ip STRING,
21 | client_identity STRING,
22 | userid STRING,
23 | user_agent STRING,
24 | log_time TIMESTAMP(3),
25 | request_line STRING,
26 | status_code STRING,
27 | size INT
28 | ) WITH (
29 | 'connector' = 'faker',
30 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
31 | 'fields.client_identity.expression' = '-',
32 | 'fields.userid.expression' = '-',
33 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}',
34 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
35 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
36 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
37 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
38 | );
39 |
40 | CREATE VIEW successful_requests AS
41 | SELECT *
42 | FROM server_logs
43 | WHERE status_code SIMILAR TO '[2,3][0-9][0-9]'
44 |
45 | SELECT * FROM successful_requests;
46 | ```
47 |
48 | ## Example Output
49 |
50 | 
51 |
52 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/05/05_top_n.md:
--------------------------------------------------------------------------------
1 | # 05 Continuous Top-N
2 |
3 | :bulb: This example will show how to continuously calculate the "Top-N" rows based on a given attribute, using an `OVER` window and the `ROW_NUMBER()` function.
4 |
5 | The source table (`spells_cast`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | The Ministry of Magic tracks every spell a wizard casts throughout Great Britain and wants to know every wizard's Top 2 all-time favorite spells.
8 |
9 | Flink SQL can be used to calculate continuous [aggregations](../../foundations/05/05_group_by.md), so if we know
10 | each spell a wizard has cast, we can maintain a continuous total of how many times they have cast that spell.
11 |
12 | ```sql
13 | SELECT wizard, spell, COUNT(*) AS times_cast
14 | FROM spells_cast
15 | GROUP BY wizard;
16 | ```
17 |
18 | This result can be used in an `OVER` window to calculate a [Top-N](https://docs.ververica.com/user_guide/sql_development/queries.html#top-n).
19 | The rows are partitioned using the `wizard` column, and are then ordered based on the count of spell casts (`times_cast DESC`).
20 | The built-in function `ROW_NUMBER()` assigns a unique, sequential number to each row, starting from one, according to the rows' ordering within the partition.
21 | Finally, the results are filtered for only those rows with a `row_num <= 2` to find each wizard's top 2 favorite spells.
22 |
23 | Where Flink is most potent in this query is its ability to issue retractions.
24 | As wizards cast more spells, their top 2 will change.
25 | When this occurs, Flink will issue a retraction, modifying its output, so the result is always correct and up to date.
26 |
27 |
28 | ```sql
29 | CREATE TABLE spells_cast (
30 | wizard STRING,
31 | spell STRING
32 | ) WITH (
33 | 'connector' = 'faker',
34 | 'fields.wizard.expression' = '#{harry_potter.characters}',
35 | 'fields.spell.expression' = '#{harry_potter.spells}'
36 | );
37 |
38 | SELECT wizard, spell, times_cast
39 | FROM (
40 | SELECT *,
41 | ROW_NUMBER() OVER (PARTITION BY wizard ORDER BY times_cast DESC) AS row_num
42 | FROM (SELECT wizard, spell, COUNT(*) AS times_cast FROM spells_cast GROUP BY wizard, spell)
43 | )
44 | WHERE row_num <= 2;
45 | ```
46 |
47 | 
48 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Apache Flink SQL Cookbook
2 |
3 | The [Apache Flink SQL](https://docs.ververica.com/user_guide/sql_development/index.html) Cookbook is a curated collection of examples, patterns, and use cases of Apache Flink SQL.
4 | Many of the recipes are completely self-contained and can be run in [Ververica Platform](https://docs.ververica.com/index.html) as is.
5 |
6 | The cookbook is a living document. :seedling:
7 |
8 | ## Table of Contents
9 |
10 | ### Foundations
11 |
12 | 1. [Creating Tables](foundations/01/01_create_table.md)
13 | 2. [Inserting Into Tables](foundations/02/02_insert_into.md)
14 | 3. [Working with Temporary Tables](foundations/03/03_temporary_table.md)
15 | 4. [Filtering Data](foundations/04/04_where.md)
16 | 5. [Aggregating Data](foundations/05/05_group_by.md)
17 | 6. [Sorting Tables](foundations/06/06_order_by.md)
18 | 7. [Encapsulating Logic with (Temporary) Views](foundations/07/07_views.md)
19 | 8. [Writing Results into Multiple Tables](foundations/08/08_statement_sets.md)
20 |
21 | ### Aggregations and Analytics
22 | 1. [Aggregating Time Series Data](aggregations-and-analytics/01/01_group_by_window.md)
23 | 2. [Watermarks](aggregations-and-analytics/02/02_watermarks.md)
24 | 3. [Analyzing Sessions in Time Series Data](aggregations-and-analytics/03/03_group_by_session_window.md)
25 | 4. [Rolling Aggregations on Time Series Data](aggregations-and-analytics/04/04_over.md)
26 | 5. [Continuous Top-N](aggregations-and-analytics/05/05_top_n.md)
27 | 6. [Deduplication](aggregations-and-analytics/06/06_dedup.md)
28 | 7. [Chained (Event) Time Windows](aggregations-and-analytics/07/07_chained_windows.md)
29 | 8. [Detecting Patterns with MATCH_RECOGNIZE](aggregations-and-analytics/08/08_match_recognize.md)
30 |
31 | ### Other Built-in Functions & Operators
32 | 1. [Working with Dates and Timestamps](other-builtin-functions/01/01_date_time.md)
33 | 2. [Building the Union of Multiple Streams](other-builtin-functions/02/02_union-all.md)
34 |
35 | ### User-Defined Functions (UDFs)
36 | 1. [Extending SQL with Python UDFs](udfs/01/01_python_udfs.md)
37 |
38 | ### Joins
39 |
40 | 1. [Regular Joins](joins/01/01_regular_joins.md)
41 | 2. [Interval Joins](joins/02/02_interval_joins.md)
42 | 3. [Temporal Table Join between a non-compacted and compacted Kafka Topic](joins/03/03_kafka_join.md)
43 | 4. [Lookup Joins](joins/04/04_lookup_joins.md)
44 | 5. [Star Schema Denormalization (N-Way Join)](joins/05/05_star_schema.md)
45 | 6. [Lateral Table Join](joins/06/06_lateral_join.md)
46 |
47 | ## About Apache Flink
48 |
49 | Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.
50 |
51 | Learn more about Flink at https://flink.apache.org/.
52 |
53 | ## License
54 |
55 | Copyright © 2020 Ververica GmbH
56 |
57 | Distributed under Apache License, Version 2.0.
58 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/04/04_over.md:
--------------------------------------------------------------------------------
1 | # 04 Rolling Aggregations on Time Series Data
2 |
3 | :bulb: This example will show how to calculate an aggregate or cumulative value based on a group of rows using an `OVER` window. A typical use case are rolling aggregations.
4 |
5 | The source table (`temperature_measurements`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | OVER window aggregates compute an aggregated value for every input row over a range of ordered rows.
8 | In contrast to GROUP BY aggregates, OVER aggregates do not reduce the number of result rows to a single row for every group.
9 | Instead, OVER aggregates produce an aggregated value for every input row.
10 |
11 | The order needs to be defined by a [time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#time-attributes).
12 | The range of rows can be defined by a number of rows or a time interval.
13 |
14 | In this example, we are trying to identify outliers in the `temperature_measurements` table.
15 | For this, we use an `OVER` window to calculate, for each measurement, the maximum (`MAX`), minimum (`MIN`) and average (`AVG`) temperature across all measurements, as well as the standard deviation (`STDDEV`), for the same city over the previous minute.
16 | > As an exercise, you can try to write another query to filter out any temperature measurement that are higher or lower than the average by more than four standard deviations.
17 |
18 | ## Script
19 |
20 | ```sql
21 | CREATE TEMPORARY TABLE temperature_measurements (
22 | measurement_time TIMESTAMP(3),
23 | city STRING,
24 | temperature FLOAT,
25 | WATERMARK FOR measurement_time AS measurement_time - INTERVAL '15' SECONDS
26 | )
27 | WITH (
28 | 'connector' = 'faker',
29 | 'fields.measurement_time.expression' = '#{date.past ''15'',''SECONDS''}',
30 | 'fields.temperature.expression' = '#{number.numberBetween ''0'',''50''}',
31 | 'fields.city.expression' = '#{regexify ''(Chicago|Munich|Berlin|Portland|Hangzhou|Seatle|Beijing|New York){1}''}'
32 | );
33 |
34 | SELECT
35 | measurement_time,
36 | city,
37 | temperature,
38 | AVG(CAST(temperature AS FLOAT)) OVER last_minute AS avg_temperature_minute,
39 | MAX(temperature) OVER last_minute AS min_temperature_minute,
40 | MIN(temperature) OVER last_minute AS max_temperature_minute,
41 | STDDEV(CAST(temperature AS FLOAT)) OVER last_minute AS stdev_temperature_minute
42 | FROM temperature_measurements
43 | WINDOW last_minute AS (
44 | PARTITION BY city
45 | ORDER BY measurement_time
46 | RANGE BETWEEN INTERVAL '1' MINUTE PRECEDING AND CURRENT ROW
47 | )
48 | ```
49 | ## Example Output
50 |
51 | 
52 |
--------------------------------------------------------------------------------
/joins/02/02_interval_joins.md:
--------------------------------------------------------------------------------
1 | # 02 Interval Joins
2 |
3 | :bulb: This example will show how you can perform joins between tables with events that are related in a temporal context.
4 |
5 | ## Why Interval Joins?
6 |
7 | In a [previous recipe](../01/01_regular_joins.md), you learned about using _regular joins_ in Flink SQL. This kind of join works well for some scenarios, but for others a more efficient type of join is required to keep resource utilization from growing indefinitely.
8 |
9 | One of the ways to optimize joining operations in Flink SQL is to use [_interval joins_](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html#interval-joins). An interval join is defined by a join predicate that checks if the time attributes of the input events are within certain time constraints (i.e. a time window).
10 |
11 | ## Using Interval Joins
12 |
13 | Suppose you want to join events of two tables that correlate to each other in the [order fulfillment lifecycle](https://en.wikipedia.org/wiki/Order_fulfillment) (`orders` and `shipments`) and that are under a Service-level Aggreement (SLA) of **3 days**. To reduce the amount of input rows Flink has to retain and optimize the join operation, you can define a time constraint in the `WHERE` clause to bound the time on both sides to that specific interval using a `BETWEEN` predicate.
14 |
15 | ## Script
16 |
17 | The source tables (`orders` and `shipments`) are backed by the built-in [`datagen` connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/datagen.html), which continuously generates rows in memory.
18 |
19 | ```sql
20 | CREATE TABLE orders (
21 | id INT,
22 | order_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)+5)*(-1) AS INT), CURRENT_TIMESTAMP)
23 | )
24 | WITH (
25 | 'connector' = 'datagen',
26 | 'rows-per-second'='10',
27 | 'fields.id.kind'='sequence',
28 | 'fields.id.start'='1',
29 | 'fields.id.end'='1000'
30 | );
31 |
32 |
33 | CREATE TABLE shipments (
34 | id INT,
35 | order_id INT,
36 | shipment_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)) AS INT), CURRENT_TIMESTAMP)
37 | )
38 | WITH (
39 | 'connector' = 'datagen',
40 | 'rows-per-second'='5',
41 | 'fields.id.kind'='random',
42 | 'fields.id.min'='0',
43 | 'fields.order_id.kind'='sequence',
44 | 'fields.order_id.start'='1',
45 | 'fields.order_id.end'='1000'
46 | );
47 |
48 | SELECT
49 | o.id AS order_id,
50 | o.order_time,
51 | s.shipment_time,
52 | TIMESTAMPDIFF(DAY,o.order_time,s.shipment_time) AS day_diff
53 | FROM orders o
54 | JOIN shipments s ON o.id = s.order_id
55 | WHERE
56 | o.order_time BETWEEN s.shipment_time - INTERVAL '3' DAY AND s.shipment_time;
57 | ```
58 |
59 | ## Example Output
60 |
61 | 
62 |
--------------------------------------------------------------------------------
/foundations/05/05_group_by.md:
--------------------------------------------------------------------------------
1 | # 05 Aggregating Data
2 |
3 | :bulb: This example will show how to aggregate server logs in real-time using the standard `GROUP BY` clause.
4 |
5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | To count the number of logs received per browser for each status code _over time_, you can combine the `COUNT` aggregate function with a `GROUP BY` clause. Because the `user_agent` field contains a lot of information, you can extract the browser using the built-in [string function](https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/functions/systemFunctions.html#string-functions) `REGEXP_EXTRACT`.
8 |
9 | A `GROUP BY` on a streaming table produces an updating result, so you will see the aggregated count for each browser continuously changing as new rows flow in.
10 |
11 | > As an exercise, you can play around with other standard SQL aggregate functions (e.g. `SUM`,`AVG`,`MIN`,`MAX`).
12 |
13 | ## Script
14 |
15 | ```sql
16 | CREATE TABLE server_logs (
17 | client_ip STRING,
18 | client_identity STRING,
19 | userid STRING,
20 | user_agent STRING,
21 | log_time TIMESTAMP(3),
22 | request_line STRING,
23 | status_code STRING,
24 | size INT
25 | ) WITH (
26 | 'connector' = 'faker',
27 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
28 | 'fields.client_identity.expression' = '-',
29 | 'fields.userid.expression' = '-',
30 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}',
31 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
32 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
33 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
34 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
35 | );
36 |
37 | -- Sample user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A
38 | -- Regex pattern: '[^\/]+' (Match everything before '/')
39 | SELECT
40 | REGEXP_EXTRACT(user_agent,'[^\/]+') AS browser,
41 | status_code,
42 | COUNT(*) AS cnt_status
43 | FROM server_logs
44 | GROUP BY
45 | REGEXP_EXTRACT(user_agent,'[^\/]+'),
46 | status_code;
47 | ```
48 |
49 | ## Example Output
50 |
51 | This example can be run in the [SQL Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html), a command line tool to develop and execute Flink SQL queries that is bundled in Flink.
52 |
53 | 
54 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/02/02_watermarks.md:
--------------------------------------------------------------------------------
1 | # 02 Watermarks
2 |
3 | :bulb: This example will show how to use `WATERMARK`s to work with timestamps in records.
4 |
5 | The source table (`doctor_sightings`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | The [previous recipe](../01/01_group_by_window.md) showed how a `TUMBLE` group window makes it simple to aggregate time-series data.
8 |
9 | [The Doctor](https://tardis.fandom.com/wiki/The_Doctor) is a renegade time lord who travels through space and time in a [TARDIS](https://tardis.fandom.com/wiki/The_Doctor%27s_TARDIS).
10 | As different versions of the Doctor travel through time, various people log their sightings.
11 | We want to track how many times each version of the Doctor is seen each minute.
12 | Unlike the previous recipe, these records have an embedded timestamp we need to use to perform our calculation.
13 |
14 | More often than not, most data will come with embedded timestamps that we want to use for our time series calculations. We call this timestamp an [event-time attribute](https://ci.apache.org/projects/flink/flink-docs-stable/learn-flink/streaming_analytics.html#event-time-and-watermarks).
15 |
16 | Event time represents when something actually happened in the real world.
17 | And it is unique because it is quasi-monotonically increasing; we generally see things that happened earlier before seeing things that happen later. Of course, data will never be perfectly ordered (systems go down, networks are laggy, doctor sighting take time to postmark and mail), and there will be some out-of-orderness in our data.
18 |
19 | Flink can account for all these variabilities using a [WATERMARK](https://docs.ververica.com/user_guide/sql_development/table_view.html#event-time) attribute in the tables DDL. The watermark signifies a column as the table's event time attribute and tells Flink how out of order we expect our data.
20 |
21 | In the Doctor's case, we expect all records to arrive within 15 seconds when the sighting occurs.
22 |
23 | ## Script
24 |
25 | ```sql
26 | CREATE TABLE doctor_sightings (
27 | doctor STRING,
28 | sighting_time TIMESTAMP(3),
29 | WATERMARK FOR sighting_time AS sighting_time - INTERVAL '15' SECONDS
30 | )
31 | WITH (
32 | 'connector' = 'faker',
33 | 'fields.doctor.expression' = '#{dr_who.the_doctors}',
34 | 'fields.sighting_time.expression' = '#{date.past ''15'',''SECONDS''}'
35 | );
36 |
37 | SELECT
38 | doctor,
39 | TUMBLE_ROWTIME(sighting_time, INTERVAL '1' MINUTE) AS sighting_time,
40 | COUNT(*) AS sightings
41 | FROM doctor_sightings
42 | GROUP BY
43 | TUMBLE(sighting_time, INTERVAL '1' MINUTE),
44 | doctor;
45 | ```
46 |
47 | ## Example Output
48 |
49 | 
50 |
--------------------------------------------------------------------------------
/foundations/03/03_temporary_table.md:
--------------------------------------------------------------------------------
1 | # 03 Working with Temporary Tables
2 |
3 | :bulb: This example will show how and why to create a temporary table using SQL DDL.
4 |
5 | Non-temporary tables in Flink SQL are stored in a catalog, while temporary tables only live within the current session (Apache Flink CLI) or script (Ververica Platform).
6 | You can use a temporary table instead of a regular (catalog) table, if it is only meant to be used within the current session or script.
7 |
8 | This example is exactly the same as [Inserting Into Tables](../02/02_insert_into.md) except that both `server_logs` and `client_errors` are created as temporary tables.
9 |
10 | ### Why Temporary Tables?
11 |
12 | For result tables like `client_errors` that no one can ever read from (because of its type `blackhole`) it makes a lot of sense to use a temporary table instead of publishing its metadata in a catalog.
13 |
14 | Furthermore, temporary tables allow you to create fully self-contained scripts, which is why we will mostly use those in the Flink SQL Cookbook.
15 |
16 | ## Script
17 |
18 | ```sql
19 |
20 | CREATE TEMPORARY TABLE server_logs (
21 | client_ip STRING,
22 | client_identity STRING,
23 | userid STRING,
24 | user_agent STRING,
25 | log_time TIMESTAMP(3),
26 | request_line STRING,
27 | status_code STRING,
28 | size INT
29 | ) WITH (
30 | 'connector' = 'faker',
31 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
32 | 'fields.client_identity.expression' = '-',
33 | 'fields.userid.expression' = '-',
34 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}',
35 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
36 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
37 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
38 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
39 | );
40 |
41 | CREATE TEMPORARY TABLE client_errors (
42 | log_time TIMESTAMP(3),
43 | request_line STRING,
44 | status_code STRING,
45 | size INT
46 | )
47 | WITH (
48 | 'connector' = 'blackhole'
49 | );
50 |
51 | INSERT INTO client_errors
52 | SELECT
53 | log_time,
54 | request_line,
55 | status_code,
56 | size
57 | FROM server_logs
58 | WHERE
59 | status_code SIMILAR TO '4[0-9][0-9]'
60 | ```
61 |
62 | ## Example Output
63 |
64 | In comparison to [Inserting Into Tables](../02/02_insert_into.md), you can see that the two temporary tables do not appear in the catalog browser on the left.
65 | The table definitions never make it into the catalog, but are just submitted as part of the script that contains the `INSERT INTO` statement.
66 |
67 | 
68 |
--------------------------------------------------------------------------------
/foundations/02/02_insert_into.md:
--------------------------------------------------------------------------------
1 | # 02 Inserting Into Tables
2 |
3 | :bulb: This recipe shows how to insert rows into a table so that downstream applications can read them.
4 |
5 | As outlined in [the first recipe](../01/01_create_table.md) Flink SQL operates on tables, that are stored in external systems.
6 | To publish results of a query for consumption by downstream applications, you write the results of a query into a table.
7 | This table can be read by Flink SQL, or directly by connecting to the external system that is storing the data (e.g. an ElasticSearch index.)
8 |
9 | This example takes the `server_logs` tables, filters for client errors, and writes these logs into another table called `client_errors`.
10 | Any number of external systems could back the result table, including Apache Kafka, Apache Hive, ElasticSearch, JDBC, among many others.
11 | To keep this example self-contained, `client_errors` is of type `blackhole`: instead of actually writing the data to an external system, the table discards any rows written to it.
12 |
13 | ## Script
14 |
15 | ```sql
16 | CREATE TABLE server_logs (
17 | client_ip STRING,
18 | client_identity STRING,
19 | userid STRING,
20 | user_agent STRING,
21 | log_time TIMESTAMP(3),
22 | request_line STRING,
23 | status_code STRING,
24 | size INT
25 | ) WITH (
26 | 'connector' = 'faker',
27 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
28 | 'fields.client_identity.expression' = '-',
29 | 'fields.userid.expression' = '-',
30 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}',
31 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
32 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
33 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
34 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
35 | );
36 |
37 | CREATE TABLE client_errors (
38 | log_time TIMESTAMP(3),
39 | request_line STRING,
40 | status_code STRING,
41 | size INT
42 | )
43 | WITH (
44 | 'connector' = 'blackhole'
45 | );
46 |
47 | INSERT INTO client_errors
48 | SELECT
49 | log_time,
50 | request_line,
51 | status_code,
52 | size
53 | FROM server_logs
54 | WHERE
55 | status_code SIMILAR TO '4[0-9][0-9]'
56 | ```
57 |
58 | ## Example Output
59 |
60 | An INSERT INTO query that reads from an unbounded table (like `server_logs`) is a long-running application.
61 | When you run such a statement in Apache Flink's [SQL Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html) a Flink Job will be submitted to the configured cluster.
62 | In Ververica Platform a so called Deployment will be created to manage the execution of the statement.
63 |
64 | 
65 |
--------------------------------------------------------------------------------
/joins/06/06_lateral_join.md:
--------------------------------------------------------------------------------
1 | # 06 Lateral Table Join
2 |
3 | :bulb: This example will show how you can correlate events using a `LATERAL` join.
4 |
5 | A recent addition to the SQL standard is the `LATERAL` join, which allows you to combine
6 | the power of a correlated subquery with the expressiveness of a join.
7 |
8 | Given a table with people's addresses, you need to find the two most populous cities
9 | for each state and continuously update those rankings as people move. The input table
10 | of `People` contains a uid for each person and their address and when they moved there.
11 |
12 | The first step is to calculate each city's population using a [continuous aggregation](../../foundations/05/05_group_by.md).
13 | While this is simple enough, the real power of Flink SQL comes when people move. By using
14 | [deduplication](../../aggregations-and-analytics/06/06_dedup.md) Flink will automatically issue a retraction for a persons old city when
15 | they move. So if John moves from New York to Los Angelos, the population for New York will
16 | automatically go down by 1. This gives us the power change-data-capture without having
17 | to invest in the actual infrastructure of setting it up!
18 |
19 | With this dynamic population table at hand, you are ready to solve the original problem using a `LATERAL` table join.
20 | Unlike a normal join, lateral joins allow the subquery to correlate with columns from other arguments in the `FROM` clause. And unlike a regular subquery, as a join, the lateral can return multiple rows.
21 | You can now have a sub-query correlated with every individual state, and for every state it ranks by population and returns the top 2 cities.
22 |
23 | ## Script
24 |
25 | ```sql
26 | CREATE TABLE People (
27 | id INT,
28 | city STRING,
29 | state STRING,
30 | arrival_time TIMESTAMP(3),
31 | WATERMARK FOR arrival_time AS arrival_time - INTERVAL '1' MINUTE
32 | ) WITH (
33 | 'connector' = 'faker',
34 | 'fields.id.expression' = '#{number.numberBetween ''1'',''100''}',
35 | 'fields.city.expression' = '#{regexify ''(Newmouth|Newburgh|Portport|Southfort|Springfield){1}''}',
36 | 'fields.state.expression' = '#{regexify ''(New York|Illinois|California|Washington){1}''}',
37 | 'fields.arrival_time.expression' = '#{date.past ''15'',''SECONDS''}',
38 | 'rows-per-second' = '10'
39 | );
40 |
41 | CREATE TEMPORARY VIEW CurrentPopulation AS
42 | SELECT
43 | city,
44 | state,
45 | COUNT(*) as population
46 | FROM (
47 | SELECT
48 | city,
49 | state,
50 | ROW_NUMBER() OVER (PARTITION BY id ORDER BY arrival_time DESC) AS rownum
51 | FROM People
52 | )
53 | WHERE rownum = 1
54 | GROUP BY city, state;
55 |
56 | SELECT
57 | state,
58 | city,
59 | population
60 | FROM
61 | (SELECT DISTINCT state FROM CurrentPopulation) States,
62 | LATERAL (
63 | SELECT city, population
64 | FROM CurrentPopulation
65 | WHERE state = States.state
66 | ORDER BY population DESC
67 | LIMIT 2
68 | );
69 | ```
70 |
71 | ## Example Output
72 |
73 | 
74 |
--------------------------------------------------------------------------------
/other-builtin-functions/02/02_union-all.md:
--------------------------------------------------------------------------------
1 | # 02 Building the Union of Multiple Streams
2 |
3 | :bulb: This example will show how you can use the set operation `UNION ALL` to combine several streams of data.
4 |
5 | See [our documentation](https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/queries/#set-operations)
6 | for a full list of fantastic set operations Apache Flink supports.
7 |
8 |
9 | ## The Sources
10 |
11 | The examples assumes you are building an application that is tracking visits :fox_face: on foreign planets :chestnut:.
12 | There are three sources of visits. The universe of Rick and Morty, the very real world of NASA and such,
13 | and the not so real world of Hitchhikers Guide To The Galaxy.
14 |
15 | All three tables are `unbounded` and backed by the [`faker` connector](https://github.com/knaufk/flink-faker).
16 |
17 | All sources of tracked visits have the `location` and `visit_time` in common. Some have `visitors`, some have
18 | `spacecrafts` and one has both.
19 |
20 | ```sql
21 | CREATE TEMPORARY TABLE rickandmorty_visits (
22 | visitor STRING,
23 | location STRING,
24 | visit_time TIMESTAMP(3)
25 | ) WITH (
26 | 'connector' = 'faker',
27 | 'fields.visitor.expression' = '#{RickAndMorty.character}',
28 | 'fields.location.expression' = '#{RickAndMorty.location}',
29 | 'fields.visit_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}'
30 | );
31 |
32 | CREATE TEMPORARY TABLE spaceagency_visits (
33 | spacecraft STRING,
34 | location STRING,
35 | visit_time TIMESTAMP(3)
36 | ) WITH (
37 | 'connector' = 'faker',
38 | 'fields.spacecraft.expression' = '#{Space.nasaSpaceCraft}',
39 | 'fields.location.expression' = '#{Space.star}',
40 | 'fields.visit_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}'
41 | );
42 |
43 | CREATE TEMPORARY TABLE hitchhiker_visits (
44 | visitor STRING,
45 | starship STRING,
46 | location STRING,
47 | visit_time TIMESTAMP(3)
48 | ) WITH (
49 | 'connector' = 'faker',
50 | 'fields.visitor.expression' = '#{HitchhikersGuideToTheGalaxy.character}',
51 | 'fields.starship.expression' = '#{HitchhikersGuideToTheGalaxy.starship}',
52 | 'fields.location.expression' = '#{HitchhikersGuideToTheGalaxy.location}',
53 | 'fields.visit_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}'
54 | );
55 |
56 | ```
57 |
58 | ## The Query
59 |
60 | We are using `UNION ALL` because it doesn't try to combine equivalent rows like
61 | `UNION` would do. That is also the reason why `UNION` can only be used with `bounded` streams.
62 |
63 |
64 | ```sql
65 | SELECT visitor, '' AS spacecraft, location, visit_time FROM rickandmorty_visits
66 | UNION ALL
67 | SELECT '' AS visitor, spacecraft, location, visit_time FROM spaceagency_visits
68 | UNION ALL
69 | SELECT visitor, starship AS spacecraft, location, visit_time FROM hitchhiker_visits;
70 | ```
71 |
72 | :alien: As we are using `CREATE TEMPORARY TABLE`, you need to run both the `CREATE TABLE` and the `SELECT` statements together.
73 |
74 | ## The Beauty in VVP
75 |
76 | 
77 |
78 |
79 | The result is a combined stream of people visiting a location in one of those fantastic universes.
80 | We are sure you'll understand why this is one of our favorite queries.
81 |
82 | :bird: [Let us know](https://twitter.com/ververicadata) about your favorite streaming SQL Query.
83 |
--------------------------------------------------------------------------------
/joins/04/04_lookup_joins.md:
--------------------------------------------------------------------------------
1 | # 04 Lookup Joins
2 |
3 | :bulb: This example will show how you can enrich a stream with an external table of reference data (i.e. a _lookup_ table).
4 |
5 | ## Data Enrichment
6 |
7 | Not all data changes frequently, even when working in real-time: in some cases, you might need to enrich streaming data with static — or _reference_ — data that is stored externally.
8 | For example, `user` metadata may be stored in a relational database that Flink needs to join against directly.
9 | Flink SQL allows you to look up reference data and join it with a stream using a _lookup join_. The join requires one table to have a [processing time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#processing-time-attributes) and the other table to be backed by a [lookup source connector](https://docs.ververica.com/user_guide/sql_development/connectors.html#id1), like the JDBC connector.
10 |
11 | ## Using Lookup Joins
12 |
13 | In this example, you will look up reference user data stored in MySQL to flag subscription events for users that are minors (`age < 18`). The `FOR SYSTEM_TIME AS OF` clause uses the processing time attribute to ensure that each row of the `subscriptions` table is joined with the `users` rows that match the join predicate at the point in time when the `subscriptions` row is processed by the join operator. The lookup join also requires an equality join predicate based on the `PRIMARY KEY` of the lookup table (`usub.user_id = u.user_id`). Here, the source does not have to read the entire table and can lazily fetch individual values from the external table when necessary.
14 |
15 | ## Script
16 |
17 | The source table (`subscriptions`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. The `users` table is backed by an existing MySQL reference table using the [JDBC connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/jdbc.html).
18 |
19 | ```sql
20 | CREATE TABLE subscriptions (
21 | id STRING,
22 | user_id INT,
23 | type STRING,
24 | start_date TIMESTAMP(3),
25 | end_date TIMESTAMP(3),
26 | payment_expiration TIMESTAMP(3),
27 | proc_time AS PROCTIME()
28 | ) WITH (
29 | 'connector' = 'faker',
30 | 'fields.id.expression' = '#{Internet.uuid}',
31 | 'fields.user_id.expression' = '#{number.numberBetween ''1'',''50''}',
32 | 'fields.type.expression'= '#{regexify ''(basic|premium|platinum){1}''}',
33 | 'fields.start_date.expression' = '#{date.past ''30'',''DAYS''}',
34 | 'fields.end_date.expression' = '#{date.future ''365'',''DAYS''}',
35 | 'fields.payment_expiration.expression' = '#{date.future ''365'',''DAYS''}'
36 | );
37 |
38 | CREATE TABLE users (
39 | user_id INT PRIMARY KEY,
40 | user_name VARCHAR(255) NOT NULL,
41 | age INT NOT NULL
42 | )
43 | WITH (
44 | 'connector' = 'jdbc',
45 | 'url' = 'jdbc:mysql://localhost:3306/mysql-database',
46 | 'table-name' = 'users',
47 | 'username' = 'mysql-user',
48 | 'password' = 'mysql-password'
49 | );
50 |
51 | SELECT
52 | id AS subscription_id,
53 | type AS subscription_type,
54 | age AS user_age,
55 | CASE
56 | WHEN age < 18 THEN 1
57 | ELSE 0
58 | END AS is_minor
59 | FROM subscriptions usub
60 | JOIN users FOR SYSTEM_TIME AS OF usub.proc_time AS u
61 | ON usub.user_id = u.user_id;
62 | ```
63 |
64 | ## Example Output
65 |
66 | 
67 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/03/03_group_by_session_window.md:
--------------------------------------------------------------------------------
1 | # 03 Analyzing Sessions in Time Series Data
2 |
3 | :bulb: This example will show how to aggregate time-series data in real-time using a `SESSION` window.
4 |
5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | #### What are Session Windows?
8 |
9 | In a [previous recipe](../01/01_group_by_window.md), you learned about _tumbling windows_. Another way to group time-series data is using [_session windows_](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sql/queries.html#group-windows), which aggregate records into _sessions_ that represent periods of activity followed by gaps of idleness. Think, for example, of user sessions on a website: a user will be active for a given period of time, then leave the website; and each user will be active at different times. To analyze user behaviour, it's useful to aggregate their actions on the website for each period of activity (i.e. _session_).
10 |
11 | Unlike tumbling windows, session windows don't have a fixed duration and are tracked independenlty across keys (i.e. windows of different keys will have different durations).
12 |
13 | #### Using Session Windows
14 |
15 | To count the number of "Forbidden" (403) requests per user over the duration of a session, you can use the `SESSION` built-in group window function. In this example, a session is bounded by a gap of idleness of 10 seconds (`INTERVAL '10' SECOND`). This means that requests that occur within 10 seconds of the last seen request for each user will be merged into the same session window; and any request that occurs outside of this gap will trigger the creation of a new session window.
16 |
17 | > Tip: You can use the `SESSION_START` and `SESSION_ROWTIME` [auxiliary functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sql/queries.html#selecting-group-window-start-and-end-timestamps) to check the lower and upper bounds of session windows.
18 |
19 |
20 | ## Script
21 |
22 | ```sql
23 | CREATE TABLE server_logs (
24 | client_ip STRING,
25 | client_identity STRING,
26 | userid STRING,
27 | log_time TIMESTAMP(3),
28 | request_line STRING,
29 | status_code STRING,
30 | WATERMARK FOR log_time AS log_time - INTERVAL '5' SECONDS
31 | ) WITH (
32 | 'connector' = 'faker',
33 | 'rows-per-second' = '5',
34 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
35 | 'fields.client_identity.expression' = '-',
36 | 'fields.userid.expression' = '#{regexify ''(morsapaes|knauf|sjwiesman){1}''}',
37 | 'fields.log_time.expression' = '#{date.past ''5'',''SECONDS''}',
38 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
39 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
40 | );
41 |
42 | SELECT
43 | userid,
44 | SESSION_START(log_time, INTERVAL '10' SECOND) AS session_beg,
45 | SESSION_ROWTIME(log_time, INTERVAL '10' SECOND) AS session_end,
46 | COUNT(request_line) AS request_cnt
47 | FROM server_logs
48 | WHERE status_code = '403'
49 | GROUP BY
50 | userid,
51 | SESSION(log_time, INTERVAL '10' SECOND);
52 | ```
53 |
54 | ## Example Output
55 |
56 | 
57 |
--------------------------------------------------------------------------------
/other-builtin-functions/01/01_date_time.md:
--------------------------------------------------------------------------------
1 | # 01 Working with Dates and Timestamps
2 |
3 | :bulb: This example will show how to use [built-in date and time functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html#temporal-functions) to manipulate temporal fields.
4 |
5 | The source table (`subscriptions`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | #### Date and Time Functions
8 |
9 | Working with dates and timestamps is commonplace in SQL, but your input may come in different types, formats or even timezones. Flink SQL has multiple built-in functions that are useful to deal with this kind of situation and make it convenient to handle temporal fields.
10 |
11 | Assume you have a table with service subscriptions and that you want to continuously [filter](../../foundations/04/04_where.md) these subscriptions to find the ones that have associated payment methods expiring in less than 30 days. The `start_date` and `end_date` are [Unix timestamps](https://en.wikipedia.org/wiki/Unix_time) (i.e. epochs) — which are not very human-readable and should be converted. Also, you want to parse the `payment_expiration` timestamp into its corresponding day, month and year parts. What are some functions that would be useful?
12 |
13 | * `TO_TIMESTAMP(string[, format])`: converts a `STRING` value to a `TIMESTAMP` using the specified format (default: 'yyyy-MM-dd HH:mm:ss')
14 |
15 | * `FROM_UNIXTIME(numeric[, string])`: converts an epoch to a formatted `STRING` (default: 'yyyy-MM-dd HH:mm:ss')
16 |
17 | * `DATE_FORMAT(timestamp, string)`: converts a `TIMESTAMP` to a `STRING` using the specified format
18 |
19 | * `EXTRACT(timeintervalunit FROM temporal)`: returns a `LONG` extracted from the specified date part of a temporal field (e.g. `DAY`,`MONTH`,`YEAR`)
20 |
21 | * `TIMESTAMPDIFF(unit, timepoint1, timepoint2)`: returns the number of time units (`SECOND`, `MINUTE`, `HOUR`, `DAY`, `MONTH` or `YEAR`) between `timepoint1` and `timepoint2`
22 |
23 | * `CURRENT_TIMESTAMP`: returns the current SQL timestamp (UTC)
24 |
25 | For a complete list of built-in date and time functions, check the Flink [documentation](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/functions/systemFunctions.html#temporal-functions).
26 |
27 | > As an exercise, you can try to reproduce the same filtering condition using `TIMESTAMPADD` instead.
28 |
29 | ## Script
30 |
31 | ```sql
32 | CREATE TABLE subscriptions (
33 | id STRING,
34 | start_date INT,
35 | end_date INT,
36 | payment_expiration TIMESTAMP(3)
37 | ) WITH (
38 | 'connector' = 'faker',
39 | 'fields.id.expression' = '#{Internet.uuid}',
40 | 'fields.start_date.expression' = '#{number.numberBetween ''1576141834'',''1607764234''}',
41 | 'fields.end_date.expression' = '#{number.numberBetween ''1609060234'',''1639300234''}',
42 | 'fields.payment_expiration.expression' = '#{date.future ''365'',''DAYS''}'
43 | );
44 |
45 | SELECT
46 | id,
47 | TO_TIMESTAMP(FROM_UNIXTIME(start_date)) AS start_date,
48 | TO_TIMESTAMP(FROM_UNIXTIME(end_date)) AS end_date,
49 | DATE_FORMAT(payment_expiration,'YYYYww') AS exp_yweek,
50 | EXTRACT(DAY FROM payment_expiration) AS exp_day, --same as DAYOFMONTH(ts)
51 | EXTRACT(MONTH FROM payment_expiration) AS exp_month, --same as MONTH(ts)
52 | EXTRACT(YEAR FROM payment_expiration) AS exp_year --same as YEAR(ts)
53 | FROM subscriptions
54 | WHERE
55 | TIMESTAMPDIFF(DAY,CURRENT_TIMESTAMP,payment_expiration) < 30;
56 | ```
57 |
58 | ## Example Output
59 |
60 | 
--------------------------------------------------------------------------------
/udfs/01/01_python_udfs.md:
--------------------------------------------------------------------------------
1 | # 01 Extending SQL with Python UDFs
2 |
3 | :bulb: This example will show how to extend Flink SQL with custom functions written in Python.
4 |
5 | Flink SQL provides a wide range of [built-in functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html) that cover most SQL day-to-day work. Sometimes, you need more flexibility to express custom business logic or transformations that aren't easily translatable to SQL: this can be achieved with [User-Defined Functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/udfs.html) (UDFs).
6 |
7 | In this example, you'll focus on [Python UDFs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/python_udfs.html) and implement a custom function (`to_fahr`) to convert temperature readings that are continuously generated for different EU and US cities. The Celsius->Fahrenheit conversion should only happen if the city associated with the reading is in the US.
8 |
9 | ## Scripts
10 |
11 | #### Python UDF
12 |
13 | The first step is to create a Python file with the UDF implementation (`python_udf.py`), using Flink's [Python Table API](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/intro_to_table_api.html). If this is new to you, there are examples on how to write [general](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/python_udfs.html) and [vectorized](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html) Python UDFs in the Flink documentation.
14 |
15 | ```python
16 | from pyflink.table import DataTypes
17 | from pyflink.table.udf import udf
18 |
19 | us_cities = {"Chicago","Portland","Seattle","New York"}
20 |
21 | @udf(input_types=[DataTypes.STRING(), DataTypes.FLOAT()],
22 | result_type=DataTypes.FLOAT())
23 | def to_fahr(city, temperature):
24 |
25 | if city in us_cities:
26 |
27 | fahr = ((temperature * 9.0 / 5.0) + 32.0)
28 |
29 | return fahr
30 | else:
31 | return temperature
32 | ```
33 |
34 | For detailed instructions on how to then make the Python file available as a UDF in the SQL Client, please refer to [this documentation page](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html#user-defined-functions).
35 |
36 | #### SQL
37 |
38 | The source table (`temperature_measurements`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
39 |
40 | ```sql
41 | --Register the Python UDF using the fully qualified
42 | --name of the function ([module name].[object name])
43 | CREATE FUNCTION to_fahr AS 'python_udf.to_fahr'
44 | LANGUAGE PYTHON;
45 |
46 |
47 | CREATE TABLE temperature_measurements (
48 | city STRING,
49 | temperature FLOAT,
50 | measurement_time TIMESTAMP(3),
51 | WATERMARK FOR measurement_time AS measurement_time - INTERVAL '15' SECONDS
52 | )
53 | WITH (
54 | 'connector' = 'faker',
55 | 'fields.temperature.expression' = '#{number.numberBetween ''0'',''42''}',
56 | 'fields.measurement_time.expression' = '#{date.past ''15'',''SECONDS''}',
57 | 'fields.city.expression' = '#{regexify ''(Copenhagen|Berlin|Chicago|Portland|Seattle|New York){1}''}'
58 | );
59 |
60 |
61 | --Use to_fahr() to convert temperatures in US cities from C to F
62 | SELECT city,
63 | temperature AS tmp,
64 | to_fahr(city,temperature) AS tmp_conv,
65 | measurement_time
66 | FROM temperature_measurements;
67 | ```
68 |
69 | ## Example Output
70 |
71 | 
--------------------------------------------------------------------------------
/foundations/06/06_order_by.md:
--------------------------------------------------------------------------------
1 | # 06 Sorting Tables
2 |
3 | :bulb: This example will show how you can sort a table, particularly unbounded tables.
4 |
5 | Flink SQL supports `ORDER BY`.
6 | Bounded Tables can be sorted by any column, descending or ascending.
7 |
8 | To use `ORDER BY` on unbounded tables like `server_logs` the primary sorting key needs to be a [time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#time-attributes) like `log_time`.
9 |
10 | In first example below, we are sorting the `server_logs` by `log_time`.
11 | The second example is a bit more advanced:
12 | It sorts the number of logs per minute and browser by the `window_time` (a time attribute) and the `cnt_browser` (descending), so that the browser with the highest number of logs is at the top of each window.
13 |
14 | ## Script
15 |
16 | ```sql
17 | CREATE TEMPORARY TABLE server_logs (
18 | client_ip STRING,
19 | client_identity STRING,
20 | userid STRING,
21 | user_agent STRING,
22 | log_time TIMESTAMP(3),
23 | request_line STRING,
24 | status_code STRING,
25 | size INT,
26 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS
27 | ) WITH (
28 | 'connector' = 'faker',
29 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
30 | 'fields.client_identity.expression' = '-',
31 | 'fields.userid.expression' = '-',
32 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}',
33 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
34 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
35 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
36 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
37 | );
38 |
39 | SELECT * FROM server_logs
40 | ORDER BY log_time;
41 | ```
42 |
43 | ## Example Output
44 |
45 | 
46 |
47 | ## Advanced Example
48 |
49 |
50 | Advanced Example
51 |
52 | ### Script
53 |
54 | ```sql
55 | CREATE TEMPORARY TABLE server_logs (
56 | client_ip STRING,
57 | client_identity STRING,
58 | userid STRING,
59 | user_agent STRING,
60 | log_time TIMESTAMP(3),
61 | request_line STRING,
62 | status_code STRING,
63 | size INT,
64 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS
65 | ) WITH (
66 | 'connector' = 'faker',
67 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
68 | 'fields.client_identity.expression' = '-',
69 | 'fields.userid.expression' = '-',
70 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}',
71 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
72 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
73 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
74 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
75 | );
76 | SELECT
77 | TUMBLE_ROWTIME(log_time, INTERVAL '1' MINUTE) AS window_time,
78 | REGEXP_EXTRACT(user_agent,'[^\/]+') AS browser,
79 | COUNT(*) AS cnt_browser
80 | FROM server_logs
81 | GROUP BY
82 | REGEXP_EXTRACT(user_agent,'[^\/]+'),
83 | TUMBLE(log_time, INTERVAL '1' MINUTE)
84 | ORDER BY
85 | window_time,
86 | cnt_browser DESC
87 | ```
88 |
89 | ### Example Output
90 |
91 | 
92 |
93 |
94 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/08/08_match_recognize.md:
--------------------------------------------------------------------------------
1 | # 08 Detecting patterns with MATCH_RECOGNIZE
2 |
3 | :bulb: This example will show how you can use Flink SQL to detect patterns in a stream of events with `MATCH_RECOGNIZE`.
4 |
5 | A common (but historically complex) task in SQL day-to-day work is to identify meaningful sequences of events in a data set — also known as Complex Event Processing (CEP). This becomes even more relevant when dealing with streaming data, as you want to react quickly to known patterns or changing trends to deliver up-to-date business insights. In Flink SQL, you can easily perform this kind of tasks using the standard SQL clause [`MATCH_RECOGNIZE`](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/match_recognize.html).
6 |
7 | ## Breaking down MATCH_RECOGNIZE
8 |
9 | In this example, you want to find users that downgraded their service subscription from one of the premium tiers (`type IN ('premium','platinum')`) to the basic tier.
10 |
11 | #### Input
12 |
13 | The input argument of `MATCH_RECOGNIZE` will be a row pattern table based on `subscriptions`. As a first step, logical partitioning and ordering must be applied to the input row pattern table to ensure that event processing is correct and deterministic:
14 |
15 | ```sql
16 | PARTITION BY user_id
17 | ORDER BY proc_time
18 | ```
19 |
20 | #### Output
21 |
22 | Row pattern columns are then defined in the `MEASURES` clause, which can be thought of as the `SELECT` of `MATCH_RECOGNIZE`. If you're interested in getting the type of premium subscription associated with the last event before the downgrade, you can fetch it using the [logical offset](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/match_recognize.html#logical-offsets) operator `LAST`. The downgrade date can be extrapolated from the `start_date` of the first basic subscription event following any existing premium one(s).
23 |
24 | ```sql
25 | MEASURES
26 | LAST(PREMIUM.type) AS premium_type,
27 | AVG(TIMESTAMPDIFF(DAY,PREMIUM.start_date,PREMIUM.end_date)) AS premium_avg_duration,
28 | BASIC.start_date AS downgrade_date
29 | AFTER MATCH SKIP PAST LAST ROW
30 | ```
31 |
32 | #### Pattern Definition
33 |
34 | Patterns are specified in the `PATTERN` clause using row-pattern variables (i.e. event types) and regular expressions. These variables must also be associated with the matching conditions that events must meet to be included in the pattern, using the `DEFINE` clause. Here, you are interested in matching one or more premium subscription events (`PREMIUM+`) followed by a basic subscription event (`BASIC`):
35 |
36 | ```sql
37 | PATTERN (PREMIUM+ BASIC)
38 | DEFINE PREMIUM AS PREMIUM.type IN ('premium','platinum'),
39 | BASIC AS BASIC.type = 'basic');
40 | ```
41 |
42 | ## Script
43 |
44 | The source table (`subscriptions`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
45 |
46 | ```sql
47 | CREATE TABLE subscriptions (
48 | id STRING,
49 | user_id INT,
50 | type STRING,
51 | start_date TIMESTAMP(3),
52 | end_date TIMESTAMP(3),
53 | payment_expiration TIMESTAMP(3),
54 | proc_time AS PROCTIME()
55 | ) WITH (
56 | 'connector' = 'faker',
57 | 'fields.id.expression' = '#{Internet.uuid}',
58 | 'fields.user_id.expression' = '#{number.numberBetween ''1'',''50''}',
59 | 'fields.type.expression'= '#{regexify ''(basic|premium|platinum){1}''}',
60 | 'fields.start_date.expression' = '#{date.past ''30'',''DAYS''}',
61 | 'fields.end_date.expression' = '#{date.future ''15'',''DAYS''}',
62 | 'fields.payment_expiration.expression' = '#{date.future ''365'',''DAYS''}'
63 | );
64 |
65 | SELECT *
66 | FROM subscriptions
67 | MATCH_RECOGNIZE (PARTITION BY user_id
68 | ORDER BY proc_time
69 | MEASURES
70 | LAST(PREMIUM.type) AS premium_type,
71 | AVG(TIMESTAMPDIFF(DAY,PREMIUM.start_date,PREMIUM.end_date)) AS premium_avg_duration,
72 | BASIC.start_date AS downgrade_date
73 | AFTER MATCH SKIP PAST LAST ROW
74 | --Pattern: one or more 'premium' or 'platinum' subscription events (PREMIUM)
75 | --followed by a 'basic' subscription event (BASIC) for the same `user_id`
76 | PATTERN (PREMIUM+ BASIC)
77 | DEFINE PREMIUM AS PREMIUM.type IN ('premium','platinum'),
78 | BASIC AS BASIC.type = 'basic');
79 | ```
80 |
81 | ## Example Output
82 |
83 | 
84 |
--------------------------------------------------------------------------------
/foundations/08/08_statement_sets.md:
--------------------------------------------------------------------------------
1 | # 08 Writing Results into Multiple Tables
2 |
3 | :bulb: In this recipe, you will learn how to use [Statement Sets](https://docs.ververica.com/user_guide/sql_development/sql_scripts.html#sql-statements) to run multiple `INSERT INTO` statements in a single, optimized Flink Job.
4 |
5 | Many product requirements involve outputting the results of a streaming application to two or more sinks, such as [Apache Kafka](https://docs.ververica.com/user_guide/sql_development/connectors.html#apache-kafka) for real-time use cases, or a [Filesystem](https://docs.ververica.com/user_guide/sql_development/connectors.html#file-system) for offline ones.
6 | Other times, two queries are not the same but share some extensive intermediate operations.
7 |
8 | When working with server logs, the support team would like to see the number of status codes per browser every 5 minutes to have real-time insights into a web pages' status.
9 | Additionally, they would like the same information on an hourly basis made available as partitioned [Apache Parquet](https://docs.ververica.com/user_guide/sql_development/connectors.html#apache-parquet) files so they can perform historical analysis.
10 |
11 | We could quickly write two Flink SQL queries to solve both these requirements, but that would not be efficient.
12 | These queries have a lot of duplicated work, like reading the source logs Kafka topic and cleansing the data.
13 |
14 | Ververica Platform includes a feature called `STATEMENT SET`s, that allows for multiplexing `INSERT INTO` statements into a single query holistically optimized by Apache Flink and deployed as a single application.
15 |
16 | ```sql
17 | CREATE TEMPORARY TABLE server_logs (
18 | client_ip STRING,
19 | client_identity STRING,
20 | userid STRING,
21 | user_agent STRING,
22 | log_time TIMESTAMP(3),
23 | request_line STRING,
24 | status_code STRING,
25 | size INT,
26 | WATERMARK FOR log_time AS log_time - INTERVAL '30' SECONDS
27 | ) WITH (
28 | 'connector' = 'faker',
29 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
30 | 'fields.client_identity.expression' = '-',
31 | 'fields.userid.expression' = '-',
32 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}',
33 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
34 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
35 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
36 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
37 | );
38 |
39 | CREATE TEMPORARY TABLE realtime_aggregations (
40 | `browser` STRING,
41 | `status_code` STRING,
42 | `end_time` TIMESTAMP(3),
43 | `requests` BIGINT NOT NULL
44 | ) WITH (
45 | 'connector' = 'kafka',
46 | 'topic' = 'browser-status-codes',
47 | 'properties.bootstrap.servers' = 'localhost:9092',
48 | 'properties.group.id' = 'browser-countds',
49 | 'format' = 'avro'
50 | );
51 |
52 |
53 | CREATE TEMPORARY TABLE offline_datawarehouse (
54 | `browser` STRING,
55 | `status_code` STRING,
56 | `dt` STRING,
57 | `hour` STRING,
58 | `requests` BIGINT NOT NULL
59 | ) PARTITIONED BY (`dt`, `hour`) WITH (
60 | 'connector' = 'filesystem',
61 | 'path' = 's3://my-bucket/browser-into',
62 | 'sink.partition-commit.trigger' = 'partition-time',
63 | 'format' = 'parquet'
64 | );
65 |
66 | -- This is a shared view that will be used by both
67 | -- insert into statements
68 | CREATE TEMPORARY VIEW browsers AS
69 | SELECT
70 | REGEXP_EXTRACT(user_agent,'[^\/]+') AS browser,
71 | status_code,
72 | log_time
73 | FROM server_logs;
74 |
75 | BEGIN STATEMENT SET;
76 | INSERT INTO realtime_aggregations
77 | SELECT
78 | browser,
79 | status_code,
80 | TUMBLE_ROWTIME(log_time, INTERVAL '5' MINUTE) AS end_time,
81 | COUNT(*) requests
82 | FROM browsers
83 | GROUP BY
84 | browser,
85 | status_code,
86 | TUMBLE(log_time, INTERVAL '5' MINUTE);
87 | INSERT INTO offline_datawarehouse
88 | SELECT
89 | browser,
90 | status_code,
91 | DATE_FORMAT(TUMBLE_ROWTIME(log_time, INTERVAL '1' HOUR), 'yyyy-MM-dd') AS `dt`,
92 | DATE_FORMAT(TUMBLE_ROWTIME(log_time, INTERVAL '1' HOUR), 'HH') AS `hour`,
93 | COUNT(*) requests
94 | FROM browsers
95 | GROUP BY
96 | browser,
97 | status_code,
98 | TUMBLE(log_time, INTERVAL '1' HOUR);
99 | END;
100 | ```
101 |
102 | Looking at the deployed Job Graph, we can see Flink SQL only performs the shared computation once to achieve the most cost and resource-efficient execution of our query!
103 |
104 | 
105 |
--------------------------------------------------------------------------------
/joins/03/03_kafka_join.md:
--------------------------------------------------------------------------------
1 | # 03 Temporal Table Join between a non-compacted and compacted Kafka Topic
2 |
3 | :bulb: In this recipe, you will see how to correctly enrich records from one Kafka topic with the corresponding records of another Kafka topic when the order of events matters.
4 |
5 | Temporal table joins take an arbitrary table (left input/probe site) and correlate each row to the corresponding row’s relevant version in a versioned table (right input/build side).
6 | Flink uses the SQL syntax of ``FOR SYSTEM_TIME AS OF`` to perform this operation.
7 |
8 | In this recipe, we want join each transaction (`transactions`) to its correct currency rate (`currency_rates`, a versioned table) **as of the time when the transaction happened**.
9 | A similar example would be to join each order with the customer details as of the time when the order happened.
10 | This is exactly what an event-time temporal table join does.
11 | A temporal table join in Flink SQL provides correct, deterministic results in the presence of out-of-orderness and arbitrary time skew between the two tables.
12 |
13 | Both the `transactions` and `currency_rates` tables are backed by Kafka topics, but in the case of rates this topic is compacted (i.e. only the most recent messages for a given key are kept as updated rates flow in).
14 | Records in `transactions` are interpreted as inserts only, and so the table is backed by the [standard Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/kafka.html) (`connector` = `kafka`); while the records in `currency_rates` need to be interpreted as upserts based on a primary key, which requires the [Upsert Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html) (`connector` = `upsert-kafka`).
15 |
16 | ## Script
17 |
18 | ```sql
19 | CREATE TEMPORARY TABLE currency_rates (
20 | `currency_code` STRING,
21 | `eur_rate` DECIMAL(6,4),
22 | `rate_time` TIMESTAMP(3),
23 | WATERMARK FOR `rate_time` AS rate_time - INTERVAL '15' SECONDS,
24 | PRIMARY KEY (currency_code) NOT ENFORCED
25 | ) WITH (
26 | 'connector' = 'upsert-kafka',
27 | 'topic' = 'currency_rates',
28 | 'properties.bootstrap.servers' = 'localhost:9092',
29 | 'key.format' = 'raw',
30 | 'value.format' = 'json'
31 | );
32 |
33 | CREATE TEMPORARY TABLE transactions (
34 | `id` STRING,
35 | `currency_code` STRING,
36 | `total` DECIMAL(10,2),
37 | `transaction_time` TIMESTAMP(3),
38 | WATERMARK FOR `transaction_time` AS transaction_time - INTERVAL '30' SECONDS
39 | ) WITH (
40 | 'connector' = 'kafka',
41 | 'topic' = 'transactions',
42 | 'properties.bootstrap.servers' = 'localhost:9092',
43 | 'key.format' = 'raw',
44 | 'key.fields' = 'id',
45 | 'value.format' = 'json',
46 | 'value.fields-include' = 'ALL'
47 | );
48 |
49 | SELECT
50 | t.id,
51 | t.total * c.eur_rate AS total_eur,
52 | t.total,
53 | c.currency_code,
54 | t.transaction_time
55 | FROM transactions t
56 | JOIN currency_rates FOR SYSTEM_TIME AS OF t.transaction_time AS c
57 | ON t.currency_code = c.currency_code;
58 | ```
59 |
60 | ## Example Output
61 |
62 | 
63 |
64 | ## Data Generators
65 |
66 |
67 | Data Generators
68 |
69 | The two topics are populated using a Flink SQL job, too.
70 | We use the [`faker` connector](https://github.com/knaufk/flink-faker) to generate rows in memory based on Java Faker expressions and write those to the respective Kafka topics.
71 |
72 | ### ``currency_rates`` Topic
73 |
74 | ### Script
75 |
76 | ```sql
77 | CREATE TEMPORARY TABLE currency_rates_faker
78 | WITH (
79 | 'connector' = 'faker',
80 | 'fields.currency_code.expression' = '#{Currency.code}',
81 | 'fields.eur_rate.expression' = '#{Number.randomDouble ''4'',''0'',''10''}',
82 | 'fields.rate_time.expression' = '#{date.past ''15'',''SECONDS''}',
83 | 'rows-per-second' = '100'
84 | ) LIKE currency_rates (EXCLUDING OPTIONS);
85 |
86 | INSERT INTO currency_rates SELECT * FROM currency_rates_faker;
87 | ```
88 | #### Kafka Topic
89 |
90 | ```shell script
91 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic currency_rates --property print.key=true --property key.separator=" - "
92 | HTG - {"currency_code":"HTG","eur_rate":0.0136,"rate_time":"2020-12-16 22:22:02"}
93 | BZD - {"currency_code":"BZD","eur_rate":1.6545,"rate_time":"2020-12-16 22:22:03"}
94 | BZD - {"currency_code":"BZD","eur_rate":3.616,"rate_time":"2020-12-16 22:22:10"}
95 | BHD - {"currency_code":"BHD","eur_rate":4.5308,"rate_time":"2020-12-16 22:22:05"}
96 | KHR - {"currency_code":"KHR","eur_rate":1.335,"rate_time":"2020-12-16 22:22:06"}
97 | ```
98 |
99 | ### ``transactions`` Topic
100 |
101 | #### Script
102 |
103 | ```sql
104 | CREATE TEMPORARY TABLE transactions_faker
105 | WITH (
106 | 'connector' = 'faker',
107 | 'fields.id.expression' = '#{Internet.UUID}',
108 | 'fields.currency_code.expression' = '#{Currency.code}',
109 | 'fields.total.expression' = '#{Number.randomDouble ''2'',''10'',''1000''}',
110 | 'fields.transaction_time.expression' = '#{date.past ''30'',''SECONDS''}',
111 | 'rows-per-second' = '100'
112 | ) LIKE transactions (EXCLUDING OPTIONS);
113 |
114 | INSERT INTO transactions SELECT * FROM transactions_faker;
115 | ```
116 |
117 | #### Kafka Topic
118 |
119 | ```shell script
120 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic transactions --property print.key=true --property key.separator=" - "
121 | e102e91f-47b9-434e-86e1-34fb1196d91d - {"id":"e102e91f-47b9-434e-86e1-34fb1196d91d","currency_code":"SGD","total":494.07,"transaction_time":"2020-12-16 22:18:46"}
122 | bf028363-5ee4-4a5a-9068-b08392d59f0b - {"id":"bf028363-5ee4-4a5a-9068-b08392d59f0b","currency_code":"EEK","total":906.8,"transaction_time":"2020-12-16 22:18:46"}
123 | e22374b5-82da-4c6d-b4c6-f27a818a58ab - {"id":"e22374b5-82da-4c6d-b4c6-f27a818a58ab","currency_code":"GYD","total":80.66,"transaction_time":"2020-12-16 22:19:02"}
124 | 81b2ce89-26c2-4df3-b12a-8ca921902ac4 - {"id":"81b2ce89-26c2-4df3-b12a-8ca921902ac4","currency_code":"EGP","total":521.98,"transaction_time":"2020-12-16 22:18:57"}
125 | 53c4fd3f-af6e-41d3-a677-536f4c86e010 - {"id":"53c4fd3f-af6e-41d3-a677-536f4c86e010","currency_code":"UYU","total":936.26,"transaction_time":"2020-12-16 22:18:59"}
126 | ```
127 |
128 |
129 |
--------------------------------------------------------------------------------
/aggregations-and-analytics/07/07_chained_windows.md:
--------------------------------------------------------------------------------
1 | # 07 Chained (Event) Time Windows
2 |
3 | :bulb: This example will show how to efficiently aggregate time series data on two different levels of granularity.
4 |
5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions.
6 |
7 | Based on our `server_logs` table we would like to compute the average request size over one minute **as well as five minute (event) windows.**
8 | For this, you could run two queries, similar to the one in [Aggregating Time Series Data](../01/01_group_by_window.md).
9 | At the end of the page is the script and resulting JobGraph from this approach.
10 |
11 | In the main part, we will follow a slightly more efficient approach that chains the two aggregations: the one-minute aggregation output serves as the five-minute aggregation input.
12 |
13 | We then use a [Statements Set](../../foundations/08/08_statement_sets.md) to write out the two result tables.
14 | To keep this example self-contained, we use two tables of type `blackhole` instead of `kafka`, `filesystem`, or any other [connectors](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/).
15 |
16 | ## Script
17 |
18 | ```sql
19 | CREATE TEMPORARY TABLE server_logs (
20 | log_time TIMESTAMP(3),
21 | client_ip STRING,
22 | client_identity STRING,
23 | userid STRING,
24 | request_line STRING,
25 | status_code STRING,
26 | size INT,
27 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS
28 | ) WITH (
29 | 'connector' = 'faker',
30 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
31 | 'fields.client_identity.expression' = '-',
32 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
33 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
34 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
35 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
36 | );
37 |
38 | CREATE TEMPORARY TABLE avg_request_size_1m (
39 | window_start TIMESTAMP(3),
40 | window_end TIMESTAMP(3),
41 | avg_size BIGINT
42 | )
43 | WITH (
44 | 'connector' = 'blackhole'
45 | );
46 |
47 | CREATE TEMPORARY TABLE avg_request_size_5m (
48 | window_start TIMESTAMP(3),
49 | window_end TIMESTAMP(3),
50 | avg_size BIGINT
51 | )
52 | WITH (
53 | 'connector' = 'blackhole'
54 | );
55 |
56 | CREATE TEMPORARY VIEW server_logs_window_1m AS
57 | SELECT
58 | TUMBLE_START(log_time, INTERVAL '1' MINUTE) AS window_start,
59 | TUMBLE_ROWTIME(log_time, INTERVAL '1' MINUTE) AS window_end,
60 | SUM(size) AS total_size,
61 | COUNT(*) AS num_requests
62 | FROM server_logs
63 | GROUP BY
64 | TUMBLE(log_time, INTERVAL '1' MINUTE);
65 |
66 |
67 | CREATE TEMPORARY VIEW server_logs_window_5m AS
68 | SELECT
69 | TUMBLE_START(window_end, INTERVAL '5' MINUTE) AS window_start,
70 | TUMBLE_ROWTIME(window_end, INTERVAL '5' MINUTE) AS window_end,
71 | SUM(total_size) AS total_size,
72 | SUM(num_requests) AS num_requests
73 | FROM server_logs_window_1m
74 | GROUP BY
75 | TUMBLE(window_end, INTERVAL '5' MINUTE);
76 |
77 | BEGIN STATEMENT SET;
78 |
79 | INSERT INTO avg_request_size_1m SELECT
80 | window_start,
81 | window_end,
82 | total_size/num_requests AS avg_size
83 | FROM server_logs_window_1m;
84 |
85 | INSERT INTO avg_request_size_5m SELECT
86 | window_start,
87 | window_end,
88 | total_size/num_requests AS avg_size
89 | FROM server_logs_window_5m;
90 |
91 | END;
92 | ```
93 |
94 | ## Example Output
95 |
96 | ### JobGraph
97 |
98 | 
99 |
100 | ### Result 1 Minute Aggregations
101 |
102 | 
103 |
104 | ## Non-Chained Windows
105 |
106 |
107 | Non-Chained Windows
108 |
109 | ### Script
110 |
111 | ```shell script
112 | CREATE TEMPORARY TABLE server_logs (
113 | log_time TIMESTAMP(3),
114 | client_ip STRING,
115 | client_identity STRING,
116 | userid STRING,
117 | request_line STRING,
118 | status_code STRING,
119 | size INT,
120 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS
121 | ) WITH (
122 | 'connector' = 'faker',
123 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}',
124 | 'fields.client_identity.expression' = '-',
125 | 'fields.userid.expression' = '-',
126 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}',
127 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}',
128 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}',
129 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}'
130 | );
131 |
132 | CREATE TEMPORARY TABLE avg_request_size_1m (
133 | window_start TIMESTAMP(3),
134 | window_end TIMESTAMP(3),
135 | avg_size BIGINT
136 | )
137 | WITH (
138 | 'connector' = 'blackhole'
139 | );
140 |
141 | CREATE TEMPORARY TABLE avg_request_size_5m (
142 | window_start TIMESTAMP(3),
143 | window_end TIMESTAMP(3),
144 | avg_size BIGINT
145 | )
146 | WITH (
147 | 'connector' = 'blackhole'
148 | );
149 |
150 | CREATE TEMPORARY VIEW server_logs_window_1m AS
151 | SELECT
152 | TUMBLE_START(log_time, INTERVAL '1' MINUTE) AS window_start,
153 | TUMBLE_ROWTIME(log_time, INTERVAL '1' MINUTE) AS window_end,
154 | SUM(size) AS total_size,
155 | COUNT(*) AS num_requests
156 | FROM server_logs
157 | GROUP BY
158 | TUMBLE(log_time, INTERVAL '1' MINUTE);
159 |
160 |
161 | CREATE TEMPORARY VIEW server_logs_window_5m AS
162 | SELECT
163 | TUMBLE_START(log_time, INTERVAL '5' MINUTE) AS window_start,
164 | TUMBLE_ROWTIME(log_time, INTERVAL '5' MINUTE) AS window_end,
165 | SUM(size) AS total_size,
166 | COUNT(*) AS num_requests
167 | FROM server_logs
168 | GROUP BY
169 | TUMBLE(log_time, INTERVAL '5' MINUTE);
170 |
171 | BEGIN STATEMENT SET;
172 |
173 | INSERT INTO avg_request_size_1m SELECT
174 | window_start,
175 | window_end,
176 | total_size/num_requests AS avg_size
177 | FROM server_logs_window_1m;
178 |
179 | INSERT INTO avg_request_size_5m SELECT
180 | window_start,
181 | window_end,
182 | total_size/num_requests AS avg_size
183 | FROM server_logs_window_5m;
184 |
185 | END;
186 | ```
187 |
188 | ### Example Output
189 |
190 | #### JobGraph
191 |
192 | 
193 |
194 |
195 |
--------------------------------------------------------------------------------
/joins/05/05_star_schema.md:
--------------------------------------------------------------------------------
1 | # 05 Real Time Star Schema Denormalization (N-Way Join)
2 |
3 | :bulb: In this recipe, we will de-normalize a simple star schema with an n-way temporal table join.
4 |
5 | [Star schemas](https://en.wikipedia.org/wiki/Star_schema) are a popular way of normalizing data within a data warehouse.
6 | At the center of a star schema is a **fact table** whose rows contain metrics, measurements, and other facts about the world.
7 | Surrounding fact tables are one or more **dimension tables** which have metadata useful for enriching facts when computing queries.
8 | You are running a small data warehouse for a railroad company which consists of a fact table (`train_activity`) and three dimension tables (`stations`, `booking_channels`, and `passengers`).
9 | All inserts to the fact table, and all updates to the dimension tables, are mirrored to Apache Kafka.
10 | Records in the fact table are interpreted as inserts only, and so the table is backed by the [standard Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/kafka.html) (`connector` = `kafka`);.
11 | In contrast, the records in the dimensional tables are upserts based on a primary key, which requires the [Upsert Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html) (`connector` = `upsert-kafka`).
12 |
13 | With Flink SQL you can now easily join all dimensions to our fact table using a 5-way temporal table join.
14 | Temporal table joins take an arbitrary table (left input/probe site) and correlate each row to the corresponding row’s relevant version in a versioned table (right input/build side).
15 | Flink uses the SQL syntax of ``FOR SYSTEM_TIME AS OF`` to perform this operation.
16 | Using a temporal table join leads to consistent, reproducible results when joining a fact table with more (slowly) changing dimensional tables.
17 | Every event (row in the fact table) is joined to its corresponding value of each dimension based on when the event occurred in the real world.
18 |
19 | ## Script
20 |
21 | ```sql
22 | CREATE TEMPORARY TABLE passengers (
23 | passenger_key STRING,
24 | first_name STRING,
25 | last_name STRING,
26 | update_time TIMESTAMP(3),
27 | WATERMARK FOR update_time AS update_time - INTERVAL '10' SECONDS,
28 | PRIMARY KEY (passenger_key) NOT ENFORCED
29 | ) WITH (
30 | 'connector' = 'upsert-kafka',
31 | 'topic' = 'passengers',
32 | 'properties.bootstrap.servers' = 'localhost:9092',
33 | 'key.format' = 'raw',
34 | 'value.format' = 'json'
35 | );
36 |
37 | CREATE TEMPORARY TABLE stations (
38 | station_key STRING,
39 | update_time TIMESTAMP(3),
40 | city STRING,
41 | WATERMARK FOR update_time AS update_time - INTERVAL '10' SECONDS,
42 | PRIMARY KEY (station_key) NOT ENFORCED
43 | ) WITH (
44 | 'connector' = 'upsert-kafka',
45 | 'topic' = 'stations',
46 | 'properties.bootstrap.servers' = 'localhost:9092',
47 | 'key.format' = 'raw',
48 | 'value.format' = 'json'
49 | );
50 |
51 | CREATE TEMPORARY TABLE booking_channels (
52 | booking_channel_key STRING,
53 | update_time TIMESTAMP(3),
54 | channel STRING,
55 | WATERMARK FOR update_time AS update_time - INTERVAL '10' SECONDS,
56 | PRIMARY KEY (booking_channel_key) NOT ENFORCED
57 | ) WITH (
58 | 'connector' = 'upsert-kafka',
59 | 'topic' = 'booking_channels',
60 | 'properties.bootstrap.servers' = 'localhost:9092',
61 | 'key.format' = 'raw',
62 | 'value.format' = 'json'
63 | );
64 |
65 | CREATE TEMPORARY TABLE train_activities (
66 | scheduled_departure_time TIMESTAMP(3),
67 | actual_departure_date TIMESTAMP(3),
68 | passenger_key STRING,
69 | origin_station_key STRING,
70 | destination_station_key STRING,
71 | booking_channel_key STRING,
72 | WATERMARK FOR actual_departure_date AS actual_departure_date - INTERVAL '10' SECONDS
73 | ) WITH (
74 | 'connector' = 'kafka',
75 | 'topic' = 'train_activities',
76 | 'properties.bootstrap.servers' = 'localhost:9092',
77 | 'value.format' = 'json',
78 | 'value.fields-include' = 'ALL'
79 | );
80 |
81 | SELECT
82 | t.actual_departure_date,
83 | p.first_name,
84 | p.last_name,
85 | b.channel,
86 | os.city AS origin_station,
87 | ds.city AS destination_station
88 | FROM train_activities t
89 | LEFT JOIN booking_channels FOR SYSTEM_TIME AS OF t.actual_departure_date AS b
90 | ON t.booking_channel_key = b.booking_channel_key;
91 | LEFT JOIN passengers FOR SYSTEM_TIME AS OF t.actual_departure_date AS p
92 | ON t.passenger_key = p.passenger_key
93 | LEFT JOIN stations FOR SYSTEM_TIME AS OF t.actual_departure_date AS os
94 | ON t.origin_station_key = os.station_key
95 | LEFT JOIN stations FOR SYSTEM_TIME AS OF t.actual_departure_date AS ds
96 | ON t.destination_station_key = ds.station_key
97 | ```
98 |
99 | ## Example Output
100 |
101 | ### SQL Client
102 |
103 | 
104 |
105 | ### JobGraph
106 |
107 | 
108 |
109 | ## Data Generators
110 |
111 |
112 | Data Generators
113 |
114 | The four topics are populated with Flink SQL jobs, too.
115 | We use the [`faker` connector](https://github.com/knaufk/flink-faker) to generate rows in memory based on Java Faker expressions and write those to the respective Kafka topics.
116 |
117 | ### ``train_activities`` Topic
118 |
119 | ### Script
120 |
121 | ```sql
122 | CREATE TEMPORARY TABLE train_activities_faker
123 | WITH (
124 | 'connector' = 'faker',
125 | 'fields.scheduled_departure_time.expression' = '#{date.past ''10'',''0'',''SECONDS''}',
126 | 'fields.actual_departure_date.expression' = '#{date.past ''10'',''5'',''SECONDS''}',
127 | 'fields.passenger_key.expression' = '#{number.numberBetween ''0'',''10000000''}',
128 | 'fields.origin_station_key.expression' = '#{number.numberBetween ''0'',''1000''}',
129 | 'fields.destination_station_key.expression' = '#{number.numberBetween ''0'',''1000''}',
130 | 'fields.booking_channel_key.expression' = '#{number.numberBetween ''0'',''7''}',
131 | 'rows-per-second' = '1000'
132 | ) LIKE train_activities (EXCLUDING OPTIONS);
133 |
134 | INSERT INTO train_activities SELECT * FROM train_activities_faker;
135 | ```
136 | #### Kafka Topic
137 |
138 | ```shell script
139 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic train_actitivies --property print.key=true --property key.separator=" - "
140 | null - {"scheduled_departure_time":"2020-12-19 13:52:37","actual_departure_date":"2020-12-19 13:52:16","passenger_key":7014937,"origin_station_key":577,"destination_station_key":862,"booking_channel_key":2}
141 | null - {"scheduled_departure_time":"2020-12-19 13:52:38","actual_departure_date":"2020-12-19 13:52:23","passenger_key":2244807,"origin_station_key":735,"destination_station_key":739,"booking_channel_key":2}
142 | null - {"scheduled_departure_time":"2020-12-19 13:52:46","actual_departure_date":"2020-12-19 13:52:18","passenger_key":2605313,"origin_station_key":216,"destination_station_key":453,"booking_channel_key":3}
143 | null - {"scheduled_departure_time":"2020-12-19 13:53:13","actual_departure_date":"2020-12-19 13:52:19","passenger_key":7111654,"origin_station_key":234,"destination_station_key":833,"booking_channel_key":5}
144 | null - {"scheduled_departure_time":"2020-12-19 13:52:22","actual_departure_date":"2020-12-19 13:52:17","passenger_key":2847474,"origin_station_key":763,"destination_station_key":206,"booking_channel_key":3}
145 | ```
146 |
147 | ### ``passengers`` Topic
148 |
149 | #### Script
150 |
151 | ```sql
152 | CREATE TEMPORARY TABLE passengers_faker
153 | WITH (
154 | 'connector' = 'faker',
155 | 'fields.passenger_key.expression' = '#{number.numberBetween ''0'',''10000000''}',
156 | 'fields.update_time.expression' = '#{date.past ''10'',''5'',''SECONDS''}',
157 | 'fields.first_name.expression' = '#{Name.firstName}',
158 | 'fields.last_name.expression' = '#{Name.lastName}',
159 | 'rows-per-second' = '1000'
160 | ) LIKE passengers (EXCLUDING OPTIONS);
161 |
162 | INSERT INTO passengers SELECT * FROM passengers_faker;
163 | ```
164 |
165 | #### Kafka Topic
166 |
167 | ```shell script
168 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic passengers --property print.key=true --property key.separator=" - "
169 | 749049 - {"passenger_key":"749049","first_name":"Booker","last_name":"Hackett","update_time":"2020-12-19 14:02:32"}
170 | 7065702 - {"passenger_key":"7065702","first_name":"Jeramy","last_name":"Breitenberg","update_time":"2020-12-19 14:02:38"}
171 | 3690329 - {"passenger_key":"3690329","first_name":"Quiana","last_name":"Macejkovic","update_time":"2020-12-19 14:02:27"}
172 | 1212728 - {"passenger_key":"1212728","first_name":"Lawerence","last_name":"Simonis","update_time":"2020-12-19 14:02:27"}
173 | 6993699 - {"passenger_key":"6993699","first_name":"Ardelle","last_name":"Frami","update_time":"2020-12-19 14:02:19"}
174 | ```
175 |
176 | ### ``stations`` Topic
177 |
178 | #### Script
179 |
180 | ```sql
181 | CREATE TEMPORARY TABLE stations_faker
182 | WITH (
183 | 'connector' = 'faker',
184 | 'fields.station_key.expression' = '#{number.numberBetween ''0'',''1000''}',
185 | 'fields.city.expression' = '#{Address.city}',
186 | 'fields.update_time.expression' = '#{date.past ''10'',''5'',''SECONDS''}',
187 | 'rows-per-second' = '100'
188 | ) LIKE stations (EXCLUDING OPTIONS);
189 |
190 | INSERT INTO stations SELECT * FROM stations_faker;
191 | ```
192 |
193 | #### Kafka Topic
194 |
195 | ```shell script
196 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic stations --property print.key=true --property key.separator=" - "
197 | 80 - {"station_key":"80","update_time":"2020-12-19 13:59:20","city":"Harlandport"}
198 | 33 - {"station_key":"33","update_time":"2020-12-19 13:59:12","city":"North Georgine"}
199 | 369 - {"station_key":"369","update_time":"2020-12-19 13:59:12","city":"Tillmanhaven"}
200 | 580 - {"station_key":"580","update_time":"2020-12-19 13:59:12","city":"West Marianabury"}
201 | 616 - {"station_key":"616","update_time":"2020-12-19 13:59:09","city":"West Sandytown"}
202 | ```
203 |
204 | ### ``booking_channels`` Topic
205 |
206 | #### Script
207 |
208 | ```sql
209 | CREATE TEMPORARY TABLE booking_channels_faker
210 | WITH (
211 | 'connector' = 'faker',
212 | 'fields.booking_channel_key.expression' = '#{number.numberBetween ''0'',''7''}',
213 | 'fields.channel.expression' = '#{regexify ''(bahn\.de|station|retailer|app|lidl|hotline|joyn){1}''}',
214 | 'fields.update_time.expression' = '#{date.past ''10'',''5'',''SECONDS''}',
215 | 'rows-per-second' = '100'
216 | ) LIKE booking_channels (EXCLUDING OPTIONS);
217 |
218 | INSERT INTO booking_channels SELECT * FROM booking_channels_faker;
219 | ```
220 |
221 | #### Kafka Topic
222 |
223 | ```shell script
224 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic booking_channels --property print.key=true --property key.separator=" - "
225 | 1 - {"booking_channel_key":"1","update_time":"2020-12-19 13:57:05","channel":"joyn"}
226 | 0 - {"booking_channel_key":"0","update_time":"2020-12-19 13:57:17","channel":"station"}
227 | 4 - {"booking_channel_key":"4","update_time":"2020-12-19 13:57:15","channel":"joyn"}
228 | 2 - {"booking_channel_key":"2","update_time":"2020-12-19 13:57:02","channel":"app"}
229 | 1 - {"booking_channel_key":"1","update_time":"2020-12-19 13:57:06","channel":"retailer"}
230 |
231 | ```
232 |
233 |
234 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright 2020 Ververica GmbH
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
--------------------------------------------------------------------------------