├── udfs └── 01 │ ├── python_udf.py │ └── 01_python_udfs.md ├── foundations ├── 01 │ └── 01_create_table.md ├── 04 │ └── 04_where.md ├── 07 │ └── 07_views.md ├── 05 │ └── 05_group_by.md ├── 03 │ └── 03_temporary_table.md ├── 02 │ └── 02_insert_into.md ├── 06 │ └── 06_order_by.md └── 08 │ └── 08_statement_sets.md ├── aggregations-and-analytics ├── 06 │ └── 06_dedup.md ├── 01 │ └── 01_group_by_window.md ├── 05 │ └── 05_top_n.md ├── 04 │ └── 04_over.md ├── 02 │ └── 02_watermarks.md ├── 03 │ └── 03_group_by_session_window.md ├── 08 │ └── 08_match_recognize.md └── 07 │ └── 07_chained_windows.md ├── joins ├── 01 │ └── 01_regular_joins.md ├── 02 │ └── 02_interval_joins.md ├── 06 │ └── 06_lateral_join.md ├── 04 │ └── 04_lookup_joins.md ├── 03 │ └── 03_kafka_join.md └── 05 │ └── 05_star_schema.md ├── README.md ├── other-builtin-functions ├── 02 │ └── 02_union-all.md └── 01 │ └── 01_date_time.md └── LICENSE /udfs/01/python_udf.py: -------------------------------------------------------------------------------- 1 | from pyflink.table import DataTypes 2 | from pyflink.table.udf import udf 3 | 4 | us_cities = {"Chicago","Portland","Seattle","New York"} 5 | 6 | @udf(input_types=[DataTypes.STRING(), DataTypes.FLOAT()], 7 | result_type=DataTypes.FLOAT()) 8 | def to_fahr(city, temperature): 9 | 10 | if city in us_cities: 11 | 12 | fahr = ((temperature * 9.0 / 5.0) + 32.0) 13 | 14 | return fahr 15 | else: 16 | return temperature 17 | -------------------------------------------------------------------------------- /foundations/01/01_create_table.md: -------------------------------------------------------------------------------- 1 | # 01 Creating Tables 2 | 3 | :bulb: This example will show how to create a table using SQL DDL. 4 | 5 | Flink SQL operates against logical tables, just like a traditional database. 6 | However, it does not maintain tables internally but always operates against external systems. 7 | 8 | Table definitions are in two parts; the logical schema and connector configuration. The logical schema defines the columns and types in the table and is what queries operate against. 9 | The connector configuration is contained in the `WITH` clause and defines the physical system that backs this table. 10 | This example uses the `datagen` connector which generates rows in memory and is convenient for testing queries. 11 | 12 | You can test the table is properly created by running a simple `SELECT` statement. 13 | In Ververica Platform you will see the results printed to the UI in the query preview. 14 | 15 | ## Script 16 | 17 | ```sql 18 | 19 | CREATE TABLE orders ( 20 | order_uid BIGINT, 21 | product_id BIGINT, 22 | price DECIMAL(32, 2), 23 | order_time TIMESTAMP(3) 24 | ) WITH ( 25 | 'connector' = 'datagen' 26 | ); 27 | 28 | SELECT * FROM orders; 29 | ``` 30 | 31 | ## Example Output 32 | 33 | ![01_create_table](https://user-images.githubusercontent.com/23521087/105504017-913eee80-5cc7-11eb-868c-7b78b1b95b71.png) 34 | -------------------------------------------------------------------------------- /foundations/04/04_where.md: -------------------------------------------------------------------------------- 1 | # 04 Filtering Data 2 | 3 | :bulb: This example will show how to filter server logs in real-time using a standard `WHERE` clause. 4 | 5 | The table it uses, `server_logs`, is backed by the [`faker` connector](https://github.com/knaufk/flink-faker) which continuously generates rows in memory based on Java Faker expressions and is convenient for testing queries. 6 | As such, it is an alternative to the built-in `datagen` connector used for example in [the first recipe](../01/01_create_table.md). 7 | 8 | You can continuously filter these logs for those requests that experience authx issues with a simple `SELECT` statement with a `WHERE` clause filtering on the auth related HTTP status codes. 9 | In Ververica Platform you will see the results printed to the UI in the query preview. 10 | 11 | ## Script 12 | 13 | ```sql 14 | CREATE TABLE server_logs ( 15 | client_ip STRING, 16 | client_identity STRING, 17 | userid STRING, 18 | log_time TIMESTAMP(3), 19 | request_line STRING, 20 | status_code STRING, 21 | size INT 22 | ) WITH ( 23 | 'connector' = 'faker', 24 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 25 | 'fields.client_identity.expression' = '-', 26 | 'fields.userid.expression' = '-', 27 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 28 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 29 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 30 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 31 | ); 32 | 33 | SELECT 34 | log_time, 35 | request_line, 36 | status_code 37 | FROM server_logs 38 | WHERE 39 | status_code IN ('403', '401') 40 | ``` 41 | 42 | ## Example Output 43 | 44 | ![04_where](https://user-images.githubusercontent.com/23521087/105504095-a6b41880-5cc7-11eb-9606-978e86add144.png) 45 | -------------------------------------------------------------------------------- /aggregations-and-analytics/06/06_dedup.md: -------------------------------------------------------------------------------- 1 | # 06 Deduplication 2 | 3 | :bulb: This example will show how you can identify and filter out duplicates in a stream of events. 4 | 5 | There are different ways that duplicate events can end up in your data sources, from human error to application bugs. Regardless of the origin, unclean data can have a real impact in the quality (and correctness) of your results. Suppose that your order system occasionally generates duplicate events with the same `order_id`, and that you're only interested in keeping the most recent event for downstream processing. 6 | 7 | As a first step, you can use a combination of the `COUNT` function and the `HAVING` clause to check if and which orders have more than one event; and then filter out these events using `ROW_NUMBER()`. In practice, deduplication is a special case of [Top-N aggregation](../05/05_top_n.md), where N is 1 (`rownum = 1`) and the ordering column is either the processing or event time of events. 8 | 9 | ## Script 10 | 11 | The source table `orders` is backed by the built-in [`datagen` connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/datagen.html), which continuously generates rows in memory. 12 | 13 | ```sql 14 | CREATE TABLE orders ( 15 | id INT, 16 | order_time AS CURRENT_TIMESTAMP, 17 | WATERMARK FOR order_time AS order_time - INTERVAL '5' SECONDS 18 | ) 19 | WITH ( 20 | 'connector' = 'datagen', 21 | 'rows-per-second'='10', 22 | 'fields.id.kind'='random', 23 | 'fields.id.min'='1', 24 | 'fields.id.max'='100' 25 | ); 26 | 27 | --Check for duplicates in the `orders` table 28 | SELECT id AS order_id, 29 | COUNT(*) AS order_cnt 30 | FROM orders o 31 | GROUP BY id 32 | HAVING COUNT(*) > 1; 33 | 34 | --Use deduplication to keep only the latest record for each `order_id` 35 | SELECT 36 | order_id, 37 | order_time 38 | FROM ( 39 | SELECT id AS order_id, 40 | order_time, 41 | ROW_NUMBER() OVER (PARTITION BY id ORDER BY order_time) AS rownum 42 | FROM orders 43 | ) 44 | WHERE rownum = 1; 45 | ``` 46 | 47 | ## Example Output 48 | 49 | ![20_dedup](https://user-images.githubusercontent.com/23521087/102718503-b87d5700-42e8-11eb-8b45-4f9908e8e14e.gif) 50 | -------------------------------------------------------------------------------- /aggregations-and-analytics/01/01_group_by_window.md: -------------------------------------------------------------------------------- 1 | # 01 Aggregating Time Series Data 2 | 3 | :bulb: This example will show how to aggregate time-series data in real-time using a `TUMBLE` window. 4 | 5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | Many streaming applications work with time-series data. 8 | To count the number of `DISTINCT` IP addresses seen each minute, rows need to be grouped based on a [time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#time-attributes). 9 | Grouping based on time is special, because time always moves forward, which means Flink can generate final results after the minute is completed. 10 | 11 | `TUMBLE` is a built-in function for grouping timestamps into time intervals called windows. 12 | Unlike other aggregations, it will only produce a single final result for each key when the interval is completed. 13 | 14 | If the logs do not have a timestamp, one can be generated using a [computed column](https://docs.ververica.com/user_guide/sql_development/table_view.html#computed-column). 15 | `log_time AS PROCTIME()` will append a column to the table with the current system time. 16 | 17 | ## Script 18 | 19 | ```sql 20 | CREATE TABLE server_logs ( 21 | client_ip STRING, 22 | client_identity STRING, 23 | userid STRING, 24 | request_line STRING, 25 | status_code STRING, 26 | log_time AS PROCTIME() 27 | ) WITH ( 28 | 'connector' = 'faker', 29 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 30 | 'fields.client_identity.expression' = '-', 31 | 'fields.userid.expression' = '-', 32 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 33 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 34 | ); 35 | 36 | SELECT 37 | COUNT(DISTINCT client_ip) AS ip_addresses, 38 | TUMBLE_PROCTIME(log_time, INTERVAL '1' MINUTE) AS window_interval 39 | FROM server_logs 40 | GROUP BY 41 | TUMBLE(log_time, INTERVAL '1' MINUTE) 42 | ``` 43 | 44 | ## Example Output 45 | 46 | ![01_group_by_window](https://user-images.githubusercontent.com/23521087/105503522-fe05b900-5cc6-11eb-9c36-bd8dc2a8e9ce.png) 47 | -------------------------------------------------------------------------------- /joins/01/01_regular_joins.md: -------------------------------------------------------------------------------- 1 | # 01 Regular Joins 2 | 3 | :bulb: This example will show how you can use joins to correlate rows across multiple tables. 4 | 5 | Flink SQL supports complex and flexible join operations over continuous tables. 6 | There are several different types of joins to account for the wide variety of semantics queries may require. 7 | 8 | Regular joins are the most generic and flexible type of join. 9 | These include the standard `INNER` and `[FULL|LEFT|RIGHT] OUTER` joins that are available in most modern databases. 10 | 11 | Suppose we have a [NOC list](https://en.wikipedia.org/wiki/Non-official_cover) of secret agents all over the world. 12 | Your mission if you choose to accept it, is to join this table with another containin the agents real name. 13 | 14 | In Flink SQL, this can be achieved using a simple `INNER JOIN`. 15 | Flink will join the tables using an equi-join predicate on the `agent_id` and output a new row everytime there is a match. 16 | 17 | However, there is something to be careful of. 18 | Flink must retain every input row as part of the join to potentially join it with the other table in the future. 19 | This means the queries resource requirements will grow indefinitely and will eventually fail. 20 | While this type of join is useful in some scenarios, other joins are more powerful in a streaming context and significantly more space-efficient. 21 | 22 | In this example, both tables are bounded to remain space efficient. 23 | 24 | ```sql 25 | CREATE TABLE NOC ( 26 | agent_id STRING, 27 | codename STRING 28 | ) 29 | WITH ( 30 | 'connector' = 'faker', 31 | 'fields.agent_id.expression' = '#{regexify ''(1|2|3|4|5){1}''}', 32 | 'fields.codename.expression' = '#{superhero.name}', 33 | 'number-of-rows' = '10' 34 | ); 35 | 36 | CREATE TABLE RealNames ( 37 | agent_id STRING, 38 | name STRING 39 | ) 40 | WITH ( 41 | 'connector' = 'faker', 42 | 'fields.agent_id.expression' = '#{regexify ''(1|2|3|4|5){1}''}', 43 | 'fields.name.expression' = '#{Name.full_name}', 44 | 'number-of-rows' = '10' 45 | ); 46 | 47 | SELECT 48 | name, 49 | codename 50 | FROM NOC 51 | INNER JOIN RealNames ON NOC.agent_id = RealNames.agent_id; 52 | ``` 53 | 54 | ![01_regular_joins](https://user-images.githubusercontent.com/23521087/105504538-280bab00-5cc8-11eb-962d-6f36432e422b.png) 55 | -------------------------------------------------------------------------------- /foundations/07/07_views.md: -------------------------------------------------------------------------------- 1 | # 07 Encapsulating Logic with (Temporary) Views 2 | 3 | :bulb: This example will show how you can use (temporary) views to reuse code and to structure long queries and scripts. 4 | 5 | `CREATE (TEMPORARY) VIEW` defines a view from a query. 6 | **A view is not physically materialized.** 7 | Instead, the query is run every time the view is referenced in a query. 8 | 9 | Temporary views are very useful to structure and decompose more complicated queries and to re-use queries within a longer script. 10 | Non-temporary views - stored in a persistent [catalog](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/catalogs.html) - can also be used to share common queries within your organization, e.g. common filters or pre-processing steps. 11 | 12 | Here, we create a view on the `server_logs` that only contains successful requests. 13 | This view encapsulates the logic of filtering the logs based on certain `status_code`s. 14 | This logic can subsequently be used by any query or script that has access to the catalog. 15 | 16 | ## Script 17 | 18 | ```sql 19 | CREATE TABLE server_logs ( 20 | client_ip STRING, 21 | client_identity STRING, 22 | userid STRING, 23 | user_agent STRING, 24 | log_time TIMESTAMP(3), 25 | request_line STRING, 26 | status_code STRING, 27 | size INT 28 | ) WITH ( 29 | 'connector' = 'faker', 30 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 31 | 'fields.client_identity.expression' = '-', 32 | 'fields.userid.expression' = '-', 33 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}', 34 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 35 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 36 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 37 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 38 | ); 39 | 40 | CREATE VIEW successful_requests AS 41 | SELECT * 42 | FROM server_logs 43 | WHERE status_code SIMILAR TO '[2,3][0-9][0-9]' 44 | 45 | SELECT * FROM successful_requests; 46 | ``` 47 | 48 | ## Example Output 49 | 50 | ![views](https://user-images.githubusercontent.com/11538663/102009292-c5250c80-3d36-11eb-85b3-05b8faf8df5a.gif) 51 | 52 | -------------------------------------------------------------------------------- /aggregations-and-analytics/05/05_top_n.md: -------------------------------------------------------------------------------- 1 | # 05 Continuous Top-N 2 | 3 | :bulb: This example will show how to continuously calculate the "Top-N" rows based on a given attribute, using an `OVER` window and the `ROW_NUMBER()` function. 4 | 5 | The source table (`spells_cast`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | The Ministry of Magic tracks every spell a wizard casts throughout Great Britain and wants to know every wizard's Top 2 all-time favorite spells. 8 | 9 | Flink SQL can be used to calculate continuous [aggregations](../../foundations/05/05_group_by.md), so if we know 10 | each spell a wizard has cast, we can maintain a continuous total of how many times they have cast that spell. 11 | 12 | ```sql 13 | SELECT wizard, spell, COUNT(*) AS times_cast 14 | FROM spells_cast 15 | GROUP BY wizard; 16 | ``` 17 | 18 | This result can be used in an `OVER` window to calculate a [Top-N](https://docs.ververica.com/user_guide/sql_development/queries.html#top-n). 19 | The rows are partitioned using the `wizard` column, and are then ordered based on the count of spell casts (`times_cast DESC`). 20 | The built-in function `ROW_NUMBER()` assigns a unique, sequential number to each row, starting from one, according to the rows' ordering within the partition. 21 | Finally, the results are filtered for only those rows with a `row_num <= 2` to find each wizard's top 2 favorite spells. 22 | 23 | Where Flink is most potent in this query is its ability to issue retractions. 24 | As wizards cast more spells, their top 2 will change. 25 | When this occurs, Flink will issue a retraction, modifying its output, so the result is always correct and up to date. 26 | 27 | 28 | ```sql 29 | CREATE TABLE spells_cast ( 30 | wizard STRING, 31 | spell STRING 32 | ) WITH ( 33 | 'connector' = 'faker', 34 | 'fields.wizard.expression' = '#{harry_potter.characters}', 35 | 'fields.spell.expression' = '#{harry_potter.spells}' 36 | ); 37 | 38 | SELECT wizard, spell, times_cast 39 | FROM ( 40 | SELECT *, 41 | ROW_NUMBER() OVER (PARTITION BY wizard ORDER BY times_cast DESC) AS row_num 42 | FROM (SELECT wizard, spell, COUNT(*) AS times_cast FROM spells_cast GROUP BY wizard, spell) 43 | ) 44 | WHERE row_num <= 2; 45 | ``` 46 | 47 | ![05_top_n](https://user-images.githubusercontent.com/23521087/105503736-3e653700-5cc7-11eb-9ddf-9a89d93841bc.png) 48 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Apache Flink SQL Cookbook 2 | 3 | The [Apache Flink SQL](https://docs.ververica.com/user_guide/sql_development/index.html) Cookbook is a curated collection of examples, patterns, and use cases of Apache Flink SQL. 4 | Many of the recipes are completely self-contained and can be run in [Ververica Platform](https://docs.ververica.com/index.html) as is. 5 | 6 | The cookbook is a living document. :seedling: 7 | 8 | ## Table of Contents 9 | 10 | ### Foundations 11 | 12 | 1. [Creating Tables](foundations/01/01_create_table.md) 13 | 2. [Inserting Into Tables](foundations/02/02_insert_into.md) 14 | 3. [Working with Temporary Tables](foundations/03/03_temporary_table.md) 15 | 4. [Filtering Data](foundations/04/04_where.md) 16 | 5. [Aggregating Data](foundations/05/05_group_by.md) 17 | 6. [Sorting Tables](foundations/06/06_order_by.md) 18 | 7. [Encapsulating Logic with (Temporary) Views](foundations/07/07_views.md) 19 | 8. [Writing Results into Multiple Tables](foundations/08/08_statement_sets.md) 20 | 21 | ### Aggregations and Analytics 22 | 1. [Aggregating Time Series Data](aggregations-and-analytics/01/01_group_by_window.md) 23 | 2. [Watermarks](aggregations-and-analytics/02/02_watermarks.md) 24 | 3. [Analyzing Sessions in Time Series Data](aggregations-and-analytics/03/03_group_by_session_window.md) 25 | 4. [Rolling Aggregations on Time Series Data](aggregations-and-analytics/04/04_over.md) 26 | 5. [Continuous Top-N](aggregations-and-analytics/05/05_top_n.md) 27 | 6. [Deduplication](aggregations-and-analytics/06/06_dedup.md) 28 | 7. [Chained (Event) Time Windows](aggregations-and-analytics/07/07_chained_windows.md) 29 | 8. [Detecting Patterns with MATCH_RECOGNIZE](aggregations-and-analytics/08/08_match_recognize.md) 30 | 31 | ### Other Built-in Functions & Operators 32 | 1. [Working with Dates and Timestamps](other-builtin-functions/01/01_date_time.md) 33 | 2. [Building the Union of Multiple Streams](other-builtin-functions/02/02_union-all.md) 34 | 35 | ### User-Defined Functions (UDFs) 36 | 1. [Extending SQL with Python UDFs](udfs/01/01_python_udfs.md) 37 | 38 | ### Joins 39 | 40 | 1. [Regular Joins](joins/01/01_regular_joins.md) 41 | 2. [Interval Joins](joins/02/02_interval_joins.md) 42 | 3. [Temporal Table Join between a non-compacted and compacted Kafka Topic](joins/03/03_kafka_join.md) 43 | 4. [Lookup Joins](joins/04/04_lookup_joins.md) 44 | 5. [Star Schema Denormalization (N-Way Join)](joins/05/05_star_schema.md) 45 | 6. [Lateral Table Join](joins/06/06_lateral_join.md) 46 | 47 | ## About Apache Flink 48 | 49 | Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. 50 | 51 | Learn more about Flink at https://flink.apache.org/. 52 | 53 | ## License 54 | 55 | Copyright © 2020 Ververica GmbH 56 | 57 | Distributed under Apache License, Version 2.0. 58 | -------------------------------------------------------------------------------- /aggregations-and-analytics/04/04_over.md: -------------------------------------------------------------------------------- 1 | # 04 Rolling Aggregations on Time Series Data 2 | 3 | :bulb: This example will show how to calculate an aggregate or cumulative value based on a group of rows using an `OVER` window. A typical use case are rolling aggregations. 4 | 5 | The source table (`temperature_measurements`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | OVER window aggregates compute an aggregated value for every input row over a range of ordered rows. 8 | In contrast to GROUP BY aggregates, OVER aggregates do not reduce the number of result rows to a single row for every group. 9 | Instead, OVER aggregates produce an aggregated value for every input row. 10 | 11 | The order needs to be defined by a [time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#time-attributes). 12 | The range of rows can be defined by a number of rows or a time interval. 13 | 14 | In this example, we are trying to identify outliers in the `temperature_measurements` table. 15 | For this, we use an `OVER` window to calculate, for each measurement, the maximum (`MAX`), minimum (`MIN`) and average (`AVG`) temperature across all measurements, as well as the standard deviation (`STDDEV`), for the same city over the previous minute. 16 | > As an exercise, you can try to write another query to filter out any temperature measurement that are higher or lower than the average by more than four standard deviations. 17 | 18 | ## Script 19 | 20 | ```sql 21 | CREATE TEMPORARY TABLE temperature_measurements ( 22 | measurement_time TIMESTAMP(3), 23 | city STRING, 24 | temperature FLOAT, 25 | WATERMARK FOR measurement_time AS measurement_time - INTERVAL '15' SECONDS 26 | ) 27 | WITH ( 28 | 'connector' = 'faker', 29 | 'fields.measurement_time.expression' = '#{date.past ''15'',''SECONDS''}', 30 | 'fields.temperature.expression' = '#{number.numberBetween ''0'',''50''}', 31 | 'fields.city.expression' = '#{regexify ''(Chicago|Munich|Berlin|Portland|Hangzhou|Seatle|Beijing|New York){1}''}' 32 | ); 33 | 34 | SELECT 35 | measurement_time, 36 | city, 37 | temperature, 38 | AVG(CAST(temperature AS FLOAT)) OVER last_minute AS avg_temperature_minute, 39 | MAX(temperature) OVER last_minute AS min_temperature_minute, 40 | MIN(temperature) OVER last_minute AS max_temperature_minute, 41 | STDDEV(CAST(temperature AS FLOAT)) OVER last_minute AS stdev_temperature_minute 42 | FROM temperature_measurements 43 | WINDOW last_minute AS ( 44 | PARTITION BY city 45 | ORDER BY measurement_time 46 | RANGE BETWEEN INTERVAL '1' MINUTE PRECEDING AND CURRENT ROW 47 | ) 48 | ``` 49 | ## Example Output 50 | 51 | ![04_over](https://user-images.githubusercontent.com/23521087/105503670-2beafd80-5cc7-11eb-9e58-7a4ed71b1d7c.png) 52 | -------------------------------------------------------------------------------- /joins/02/02_interval_joins.md: -------------------------------------------------------------------------------- 1 | # 02 Interval Joins 2 | 3 | :bulb: This example will show how you can perform joins between tables with events that are related in a temporal context. 4 | 5 | ## Why Interval Joins? 6 | 7 | In a [previous recipe](../01/01_regular_joins.md), you learned about using _regular joins_ in Flink SQL. This kind of join works well for some scenarios, but for others a more efficient type of join is required to keep resource utilization from growing indefinitely. 8 | 9 | One of the ways to optimize joining operations in Flink SQL is to use [_interval joins_](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html#interval-joins). An interval join is defined by a join predicate that checks if the time attributes of the input events are within certain time constraints (i.e. a time window). 10 | 11 | ## Using Interval Joins 12 | 13 | Suppose you want to join events of two tables that correlate to each other in the [order fulfillment lifecycle](https://en.wikipedia.org/wiki/Order_fulfillment) (`orders` and `shipments`) and that are under a Service-level Aggreement (SLA) of **3 days**. To reduce the amount of input rows Flink has to retain and optimize the join operation, you can define a time constraint in the `WHERE` clause to bound the time on both sides to that specific interval using a `BETWEEN` predicate. 14 | 15 | ## Script 16 | 17 | The source tables (`orders` and `shipments`) are backed by the built-in [`datagen` connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/datagen.html), which continuously generates rows in memory. 18 | 19 | ```sql 20 | CREATE TABLE orders ( 21 | id INT, 22 | order_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)+5)*(-1) AS INT), CURRENT_TIMESTAMP) 23 | ) 24 | WITH ( 25 | 'connector' = 'datagen', 26 | 'rows-per-second'='10', 27 | 'fields.id.kind'='sequence', 28 | 'fields.id.start'='1', 29 | 'fields.id.end'='1000' 30 | ); 31 | 32 | 33 | CREATE TABLE shipments ( 34 | id INT, 35 | order_id INT, 36 | shipment_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)) AS INT), CURRENT_TIMESTAMP) 37 | ) 38 | WITH ( 39 | 'connector' = 'datagen', 40 | 'rows-per-second'='5', 41 | 'fields.id.kind'='random', 42 | 'fields.id.min'='0', 43 | 'fields.order_id.kind'='sequence', 44 | 'fields.order_id.start'='1', 45 | 'fields.order_id.end'='1000' 46 | ); 47 | 48 | SELECT 49 | o.id AS order_id, 50 | o.order_time, 51 | s.shipment_time, 52 | TIMESTAMPDIFF(DAY,o.order_time,s.shipment_time) AS day_diff 53 | FROM orders o 54 | JOIN shipments s ON o.id = s.order_id 55 | WHERE 56 | o.order_time BETWEEN s.shipment_time - INTERVAL '3' DAY AND s.shipment_time; 57 | ``` 58 | 59 | ## Example Output 60 | 61 | ![15_interval_joins](https://user-images.githubusercontent.com/23521087/102237138-9ce30c80-3ef4-11eb-969f-8f157b249ebb.png) 62 | -------------------------------------------------------------------------------- /foundations/05/05_group_by.md: -------------------------------------------------------------------------------- 1 | # 05 Aggregating Data 2 | 3 | :bulb: This example will show how to aggregate server logs in real-time using the standard `GROUP BY` clause. 4 | 5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | To count the number of logs received per browser for each status code _over time_, you can combine the `COUNT` aggregate function with a `GROUP BY` clause. Because the `user_agent` field contains a lot of information, you can extract the browser using the built-in [string function](https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/functions/systemFunctions.html#string-functions) `REGEXP_EXTRACT`. 8 | 9 | A `GROUP BY` on a streaming table produces an updating result, so you will see the aggregated count for each browser continuously changing as new rows flow in. 10 | 11 | > As an exercise, you can play around with other standard SQL aggregate functions (e.g. `SUM`,`AVG`,`MIN`,`MAX`). 12 | 13 | ## Script 14 | 15 | ```sql 16 | CREATE TABLE server_logs ( 17 | client_ip STRING, 18 | client_identity STRING, 19 | userid STRING, 20 | user_agent STRING, 21 | log_time TIMESTAMP(3), 22 | request_line STRING, 23 | status_code STRING, 24 | size INT 25 | ) WITH ( 26 | 'connector' = 'faker', 27 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 28 | 'fields.client_identity.expression' = '-', 29 | 'fields.userid.expression' = '-', 30 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}', 31 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 32 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 33 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 34 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 35 | ); 36 | 37 | -- Sample user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A 38 | -- Regex pattern: '[^\/]+' (Match everything before '/') 39 | SELECT 40 | REGEXP_EXTRACT(user_agent,'[^\/]+') AS browser, 41 | status_code, 42 | COUNT(*) AS cnt_status 43 | FROM server_logs 44 | GROUP BY 45 | REGEXP_EXTRACT(user_agent,'[^\/]+'), 46 | status_code; 47 | ``` 48 | 49 | ## Example Output 50 | 51 | This example can be run in the [SQL Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html), a command line tool to develop and execute Flink SQL queries that is bundled in Flink. 52 | 53 | ![03_group_by](https://user-images.githubusercontent.com/23521087/101014385-19293780-3566-11eb-9d81-9c99d6ffa7e4.gif) 54 | -------------------------------------------------------------------------------- /aggregations-and-analytics/02/02_watermarks.md: -------------------------------------------------------------------------------- 1 | # 02 Watermarks 2 | 3 | :bulb: This example will show how to use `WATERMARK`s to work with timestamps in records. 4 | 5 | The source table (`doctor_sightings`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | The [previous recipe](../01/01_group_by_window.md) showed how a `TUMBLE` group window makes it simple to aggregate time-series data. 8 | 9 | [The Doctor](https://tardis.fandom.com/wiki/The_Doctor) is a renegade time lord who travels through space and time in a [TARDIS](https://tardis.fandom.com/wiki/The_Doctor%27s_TARDIS). 10 | As different versions of the Doctor travel through time, various people log their sightings. 11 | We want to track how many times each version of the Doctor is seen each minute. 12 | Unlike the previous recipe, these records have an embedded timestamp we need to use to perform our calculation. 13 | 14 | More often than not, most data will come with embedded timestamps that we want to use for our time series calculations. We call this timestamp an [event-time attribute](https://ci.apache.org/projects/flink/flink-docs-stable/learn-flink/streaming_analytics.html#event-time-and-watermarks). 15 | 16 | Event time represents when something actually happened in the real world. 17 | And it is unique because it is quasi-monotonically increasing; we generally see things that happened earlier before seeing things that happen later. Of course, data will never be perfectly ordered (systems go down, networks are laggy, doctor sighting take time to postmark and mail), and there will be some out-of-orderness in our data. 18 | 19 | Flink can account for all these variabilities using a [WATERMARK](https://docs.ververica.com/user_guide/sql_development/table_view.html#event-time) attribute in the tables DDL. The watermark signifies a column as the table's event time attribute and tells Flink how out of order we expect our data. 20 | 21 | In the Doctor's case, we expect all records to arrive within 15 seconds when the sighting occurs. 22 | 23 | ## Script 24 | 25 | ```sql 26 | CREATE TABLE doctor_sightings ( 27 | doctor STRING, 28 | sighting_time TIMESTAMP(3), 29 | WATERMARK FOR sighting_time AS sighting_time - INTERVAL '15' SECONDS 30 | ) 31 | WITH ( 32 | 'connector' = 'faker', 33 | 'fields.doctor.expression' = '#{dr_who.the_doctors}', 34 | 'fields.sighting_time.expression' = '#{date.past ''15'',''SECONDS''}' 35 | ); 36 | 37 | SELECT 38 | doctor, 39 | TUMBLE_ROWTIME(sighting_time, INTERVAL '1' MINUTE) AS sighting_time, 40 | COUNT(*) AS sightings 41 | FROM doctor_sightings 42 | GROUP BY 43 | TUMBLE(sighting_time, INTERVAL '1' MINUTE), 44 | doctor; 45 | ``` 46 | 47 | ## Example Output 48 | 49 | ![02_watermarks](https://user-images.githubusercontent.com/23521087/105503592-12e24c80-5cc7-11eb-9155-243cc9c314f0.png) 50 | -------------------------------------------------------------------------------- /foundations/03/03_temporary_table.md: -------------------------------------------------------------------------------- 1 | # 03 Working with Temporary Tables 2 | 3 | :bulb: This example will show how and why to create a temporary table using SQL DDL. 4 | 5 | Non-temporary tables in Flink SQL are stored in a catalog, while temporary tables only live within the current session (Apache Flink CLI) or script (Ververica Platform). 6 | You can use a temporary table instead of a regular (catalog) table, if it is only meant to be used within the current session or script. 7 | 8 | This example is exactly the same as [Inserting Into Tables](../02/02_insert_into.md) except that both `server_logs` and `client_errors` are created as temporary tables. 9 | 10 | ### Why Temporary Tables? 11 | 12 | For result tables like `client_errors` that no one can ever read from (because of its type `blackhole`) it makes a lot of sense to use a temporary table instead of publishing its metadata in a catalog. 13 | 14 | Furthermore, temporary tables allow you to create fully self-contained scripts, which is why we will mostly use those in the Flink SQL Cookbook. 15 | 16 | ## Script 17 | 18 | ```sql 19 | 20 | CREATE TEMPORARY TABLE server_logs ( 21 | client_ip STRING, 22 | client_identity STRING, 23 | userid STRING, 24 | user_agent STRING, 25 | log_time TIMESTAMP(3), 26 | request_line STRING, 27 | status_code STRING, 28 | size INT 29 | ) WITH ( 30 | 'connector' = 'faker', 31 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 32 | 'fields.client_identity.expression' = '-', 33 | 'fields.userid.expression' = '-', 34 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}', 35 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 36 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 37 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 38 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 39 | ); 40 | 41 | CREATE TEMPORARY TABLE client_errors ( 42 | log_time TIMESTAMP(3), 43 | request_line STRING, 44 | status_code STRING, 45 | size INT 46 | ) 47 | WITH ( 48 | 'connector' = 'blackhole' 49 | ); 50 | 51 | INSERT INTO client_errors 52 | SELECT 53 | log_time, 54 | request_line, 55 | status_code, 56 | size 57 | FROM server_logs 58 | WHERE 59 | status_code SIMILAR TO '4[0-9][0-9]' 60 | ``` 61 | 62 | ## Example Output 63 | 64 | In comparison to [Inserting Into Tables](../02/02_insert_into.md), you can see that the two temporary tables do not appear in the catalog browser on the left. 65 | The table definitions never make it into the catalog, but are just submitted as part of the script that contains the `INSERT INTO` statement. 66 | 67 | ![Screencast GIF](https://user-images.githubusercontent.com/11538663/101192652-aac6a100-365b-11eb-82a3-5b86522e772c.gif) 68 | -------------------------------------------------------------------------------- /foundations/02/02_insert_into.md: -------------------------------------------------------------------------------- 1 | # 02 Inserting Into Tables 2 | 3 | :bulb: This recipe shows how to insert rows into a table so that downstream applications can read them. 4 | 5 | As outlined in [the first recipe](../01/01_create_table.md) Flink SQL operates on tables, that are stored in external systems. 6 | To publish results of a query for consumption by downstream applications, you write the results of a query into a table. 7 | This table can be read by Flink SQL, or directly by connecting to the external system that is storing the data (e.g. an ElasticSearch index.) 8 | 9 | This example takes the `server_logs` tables, filters for client errors, and writes these logs into another table called `client_errors`. 10 | Any number of external systems could back the result table, including Apache Kafka, Apache Hive, ElasticSearch, JDBC, among many others. 11 | To keep this example self-contained, `client_errors` is of type `blackhole`: instead of actually writing the data to an external system, the table discards any rows written to it. 12 | 13 | ## Script 14 | 15 | ```sql 16 | CREATE TABLE server_logs ( 17 | client_ip STRING, 18 | client_identity STRING, 19 | userid STRING, 20 | user_agent STRING, 21 | log_time TIMESTAMP(3), 22 | request_line STRING, 23 | status_code STRING, 24 | size INT 25 | ) WITH ( 26 | 'connector' = 'faker', 27 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 28 | 'fields.client_identity.expression' = '-', 29 | 'fields.userid.expression' = '-', 30 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}', 31 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 32 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 33 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 34 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 35 | ); 36 | 37 | CREATE TABLE client_errors ( 38 | log_time TIMESTAMP(3), 39 | request_line STRING, 40 | status_code STRING, 41 | size INT 42 | ) 43 | WITH ( 44 | 'connector' = 'blackhole' 45 | ); 46 | 47 | INSERT INTO client_errors 48 | SELECT 49 | log_time, 50 | request_line, 51 | status_code, 52 | size 53 | FROM server_logs 54 | WHERE 55 | status_code SIMILAR TO '4[0-9][0-9]' 56 | ``` 57 | 58 | ## Example Output 59 | 60 | An INSERT INTO query that reads from an unbounded table (like `server_logs`) is a long-running application. 61 | When you run such a statement in Apache Flink's [SQL Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html) a Flink Job will be submitted to the configured cluster. 62 | In Ververica Platform a so called Deployment will be created to manage the execution of the statement. 63 | 64 | ![Screenshot GIF](https://user-images.githubusercontent.com/11538663/101192280-22480080-365b-11eb-97e9-35f151027c6e.gif) 65 | -------------------------------------------------------------------------------- /joins/06/06_lateral_join.md: -------------------------------------------------------------------------------- 1 | # 06 Lateral Table Join 2 | 3 | :bulb: This example will show how you can correlate events using a `LATERAL` join. 4 | 5 | A recent addition to the SQL standard is the `LATERAL` join, which allows you to combine 6 | the power of a correlated subquery with the expressiveness of a join. 7 | 8 | Given a table with people's addresses, you need to find the two most populous cities 9 | for each state and continuously update those rankings as people move. The input table 10 | of `People` contains a uid for each person and their address and when they moved there. 11 | 12 | The first step is to calculate each city's population using a [continuous aggregation](../../foundations/05/05_group_by.md). 13 | While this is simple enough, the real power of Flink SQL comes when people move. By using 14 | [deduplication](../../aggregations-and-analytics/06/06_dedup.md) Flink will automatically issue a retraction for a persons old city when 15 | they move. So if John moves from New York to Los Angelos, the population for New York will 16 | automatically go down by 1. This gives us the power change-data-capture without having 17 | to invest in the actual infrastructure of setting it up! 18 | 19 | With this dynamic population table at hand, you are ready to solve the original problem using a `LATERAL` table join. 20 | Unlike a normal join, lateral joins allow the subquery to correlate with columns from other arguments in the `FROM` clause. And unlike a regular subquery, as a join, the lateral can return multiple rows. 21 | You can now have a sub-query correlated with every individual state, and for every state it ranks by population and returns the top 2 cities. 22 | 23 | ## Script 24 | 25 | ```sql 26 | CREATE TABLE People ( 27 | id INT, 28 | city STRING, 29 | state STRING, 30 | arrival_time TIMESTAMP(3), 31 | WATERMARK FOR arrival_time AS arrival_time - INTERVAL '1' MINUTE 32 | ) WITH ( 33 | 'connector' = 'faker', 34 | 'fields.id.expression' = '#{number.numberBetween ''1'',''100''}', 35 | 'fields.city.expression' = '#{regexify ''(Newmouth|Newburgh|Portport|Southfort|Springfield){1}''}', 36 | 'fields.state.expression' = '#{regexify ''(New York|Illinois|California|Washington){1}''}', 37 | 'fields.arrival_time.expression' = '#{date.past ''15'',''SECONDS''}', 38 | 'rows-per-second' = '10' 39 | ); 40 | 41 | CREATE TEMPORARY VIEW CurrentPopulation AS 42 | SELECT 43 | city, 44 | state, 45 | COUNT(*) as population 46 | FROM ( 47 | SELECT 48 | city, 49 | state, 50 | ROW_NUMBER() OVER (PARTITION BY id ORDER BY arrival_time DESC) AS rownum 51 | FROM People 52 | ) 53 | WHERE rownum = 1 54 | GROUP BY city, state; 55 | 56 | SELECT 57 | state, 58 | city, 59 | population 60 | FROM 61 | (SELECT DISTINCT state FROM CurrentPopulation) States, 62 | LATERAL ( 63 | SELECT city, population 64 | FROM CurrentPopulation 65 | WHERE state = States.state 66 | ORDER BY population DESC 67 | LIMIT 2 68 | ); 69 | ``` 70 | 71 | ## Example Output 72 | 73 | ![lateral](https://user-images.githubusercontent.com/23521087/105504738-6bfeb000-5cc8-11eb-9517-1242dfa87bb4.gif) 74 | -------------------------------------------------------------------------------- /other-builtin-functions/02/02_union-all.md: -------------------------------------------------------------------------------- 1 | # 02 Building the Union of Multiple Streams 2 | 3 | :bulb: This example will show how you can use the set operation `UNION ALL` to combine several streams of data. 4 | 5 | See [our documentation](https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/queries/#set-operations) 6 | for a full list of fantastic set operations Apache Flink supports. 7 | 8 | 9 | ## The Sources 10 | 11 | The examples assumes you are building an application that is tracking visits :fox_face: on foreign planets :chestnut:. 12 | There are three sources of visits. The universe of Rick and Morty, the very real world of NASA and such, 13 | and the not so real world of Hitchhikers Guide To The Galaxy. 14 | 15 | All three tables are `unbounded` and backed by the [`faker` connector](https://github.com/knaufk/flink-faker). 16 | 17 | All sources of tracked visits have the `location` and `visit_time` in common. Some have `visitors`, some have 18 | `spacecrafts` and one has both. 19 | 20 | ```sql 21 | CREATE TEMPORARY TABLE rickandmorty_visits ( 22 | visitor STRING, 23 | location STRING, 24 | visit_time TIMESTAMP(3) 25 | ) WITH ( 26 | 'connector' = 'faker', 27 | 'fields.visitor.expression' = '#{RickAndMorty.character}', 28 | 'fields.location.expression' = '#{RickAndMorty.location}', 29 | 'fields.visit_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}' 30 | ); 31 | 32 | CREATE TEMPORARY TABLE spaceagency_visits ( 33 | spacecraft STRING, 34 | location STRING, 35 | visit_time TIMESTAMP(3) 36 | ) WITH ( 37 | 'connector' = 'faker', 38 | 'fields.spacecraft.expression' = '#{Space.nasaSpaceCraft}', 39 | 'fields.location.expression' = '#{Space.star}', 40 | 'fields.visit_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}' 41 | ); 42 | 43 | CREATE TEMPORARY TABLE hitchhiker_visits ( 44 | visitor STRING, 45 | starship STRING, 46 | location STRING, 47 | visit_time TIMESTAMP(3) 48 | ) WITH ( 49 | 'connector' = 'faker', 50 | 'fields.visitor.expression' = '#{HitchhikersGuideToTheGalaxy.character}', 51 | 'fields.starship.expression' = '#{HitchhikersGuideToTheGalaxy.starship}', 52 | 'fields.location.expression' = '#{HitchhikersGuideToTheGalaxy.location}', 53 | 'fields.visit_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}' 54 | ); 55 | 56 | ``` 57 | 58 | ## The Query 59 | 60 | We are using `UNION ALL` because it doesn't try to combine equivalent rows like 61 | `UNION` would do. That is also the reason why `UNION` can only be used with `bounded` streams. 62 | 63 | 64 | ```sql 65 | SELECT visitor, '' AS spacecraft, location, visit_time FROM rickandmorty_visits 66 | UNION ALL 67 | SELECT '' AS visitor, spacecraft, location, visit_time FROM spaceagency_visits 68 | UNION ALL 69 | SELECT visitor, starship AS spacecraft, location, visit_time FROM hitchhiker_visits; 70 | ``` 71 | 72 | :alien: As we are using `CREATE TEMPORARY TABLE`, you need to run both the `CREATE TABLE` and the `SELECT` statements together. 73 | 74 | ## The Beauty in VVP 75 | 76 | ![screeny](https://user-images.githubusercontent.com/68620/108173907-081cab00-70ff-11eb-823a-8a245b390485.png) 77 | 78 | 79 | The result is a combined stream of people visiting a location in one of those fantastic universes. 80 | We are sure you'll understand why this is one of our favorite queries. 81 | 82 | :bird: [Let us know](https://twitter.com/ververicadata) about your favorite streaming SQL Query. 83 | -------------------------------------------------------------------------------- /joins/04/04_lookup_joins.md: -------------------------------------------------------------------------------- 1 | # 04 Lookup Joins 2 | 3 | :bulb: This example will show how you can enrich a stream with an external table of reference data (i.e. a _lookup_ table). 4 | 5 | ## Data Enrichment 6 | 7 | Not all data changes frequently, even when working in real-time: in some cases, you might need to enrich streaming data with static — or _reference_ — data that is stored externally. 8 | For example, `user` metadata may be stored in a relational database that Flink needs to join against directly. 9 | Flink SQL allows you to look up reference data and join it with a stream using a _lookup join_. The join requires one table to have a [processing time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#processing-time-attributes) and the other table to be backed by a [lookup source connector](https://docs.ververica.com/user_guide/sql_development/connectors.html#id1), like the JDBC connector. 10 | 11 | ## Using Lookup Joins 12 | 13 | In this example, you will look up reference user data stored in MySQL to flag subscription events for users that are minors (`age < 18`). The `FOR SYSTEM_TIME AS OF` clause uses the processing time attribute to ensure that each row of the `subscriptions` table is joined with the `users` rows that match the join predicate at the point in time when the `subscriptions` row is processed by the join operator. The lookup join also requires an equality join predicate based on the `PRIMARY KEY` of the lookup table (`usub.user_id = u.user_id`). Here, the source does not have to read the entire table and can lazily fetch individual values from the external table when necessary. 14 | 15 | ## Script 16 | 17 | The source table (`subscriptions`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. The `users` table is backed by an existing MySQL reference table using the [JDBC connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/jdbc.html). 18 | 19 | ```sql 20 | CREATE TABLE subscriptions ( 21 | id STRING, 22 | user_id INT, 23 | type STRING, 24 | start_date TIMESTAMP(3), 25 | end_date TIMESTAMP(3), 26 | payment_expiration TIMESTAMP(3), 27 | proc_time AS PROCTIME() 28 | ) WITH ( 29 | 'connector' = 'faker', 30 | 'fields.id.expression' = '#{Internet.uuid}', 31 | 'fields.user_id.expression' = '#{number.numberBetween ''1'',''50''}', 32 | 'fields.type.expression'= '#{regexify ''(basic|premium|platinum){1}''}', 33 | 'fields.start_date.expression' = '#{date.past ''30'',''DAYS''}', 34 | 'fields.end_date.expression' = '#{date.future ''365'',''DAYS''}', 35 | 'fields.payment_expiration.expression' = '#{date.future ''365'',''DAYS''}' 36 | ); 37 | 38 | CREATE TABLE users ( 39 | user_id INT PRIMARY KEY, 40 | user_name VARCHAR(255) NOT NULL, 41 | age INT NOT NULL 42 | ) 43 | WITH ( 44 | 'connector' = 'jdbc', 45 | 'url' = 'jdbc:mysql://localhost:3306/mysql-database', 46 | 'table-name' = 'users', 47 | 'username' = 'mysql-user', 48 | 'password' = 'mysql-password' 49 | ); 50 | 51 | SELECT 52 | id AS subscription_id, 53 | type AS subscription_type, 54 | age AS user_age, 55 | CASE 56 | WHEN age < 18 THEN 1 57 | ELSE 0 58 | END AS is_minor 59 | FROM subscriptions usub 60 | JOIN users FOR SYSTEM_TIME AS OF usub.proc_time AS u 61 | ON usub.user_id = u.user_id; 62 | ``` 63 | 64 | ## Example Output 65 | 66 | ![18_lookup_joins](https://user-images.githubusercontent.com/23521087/102645588-fcdeea80-4162-11eb-8581-55a06ea82518.png) 67 | -------------------------------------------------------------------------------- /aggregations-and-analytics/03/03_group_by_session_window.md: -------------------------------------------------------------------------------- 1 | # 03 Analyzing Sessions in Time Series Data 2 | 3 | :bulb: This example will show how to aggregate time-series data in real-time using a `SESSION` window. 4 | 5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | #### What are Session Windows? 8 | 9 | In a [previous recipe](../01/01_group_by_window.md), you learned about _tumbling windows_. Another way to group time-series data is using [_session windows_](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sql/queries.html#group-windows), which aggregate records into _sessions_ that represent periods of activity followed by gaps of idleness. Think, for example, of user sessions on a website: a user will be active for a given period of time, then leave the website; and each user will be active at different times. To analyze user behaviour, it's useful to aggregate their actions on the website for each period of activity (i.e. _session_). 10 | 11 | Unlike tumbling windows, session windows don't have a fixed duration and are tracked independenlty across keys (i.e. windows of different keys will have different durations). 12 | 13 | #### Using Session Windows 14 | 15 | To count the number of "Forbidden" (403) requests per user over the duration of a session, you can use the `SESSION` built-in group window function. In this example, a session is bounded by a gap of idleness of 10 seconds (`INTERVAL '10' SECOND`). This means that requests that occur within 10 seconds of the last seen request for each user will be merged into the same session window; and any request that occurs outside of this gap will trigger the creation of a new session window. 16 | 17 | > Tip: You can use the `SESSION_START` and `SESSION_ROWTIME` [auxiliary functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sql/queries.html#selecting-group-window-start-and-end-timestamps) to check the lower and upper bounds of session windows. 18 | 19 | 20 | ## Script 21 | 22 | ```sql 23 | CREATE TABLE server_logs ( 24 | client_ip STRING, 25 | client_identity STRING, 26 | userid STRING, 27 | log_time TIMESTAMP(3), 28 | request_line STRING, 29 | status_code STRING, 30 | WATERMARK FOR log_time AS log_time - INTERVAL '5' SECONDS 31 | ) WITH ( 32 | 'connector' = 'faker', 33 | 'rows-per-second' = '5', 34 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 35 | 'fields.client_identity.expression' = '-', 36 | 'fields.userid.expression' = '#{regexify ''(morsapaes|knauf|sjwiesman){1}''}', 37 | 'fields.log_time.expression' = '#{date.past ''5'',''SECONDS''}', 38 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 39 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 40 | ); 41 | 42 | SELECT 43 | userid, 44 | SESSION_START(log_time, INTERVAL '10' SECOND) AS session_beg, 45 | SESSION_ROWTIME(log_time, INTERVAL '10' SECOND) AS session_end, 46 | COUNT(request_line) AS request_cnt 47 | FROM server_logs 48 | WHERE status_code = '403' 49 | GROUP BY 50 | userid, 51 | SESSION(log_time, INTERVAL '10' SECOND); 52 | ``` 53 | 54 | ## Example Output 55 | 56 | ![03_session_windows](https://user-images.githubusercontent.com/23521087/101628701-7ae31900-3a20-11eb-89c2-231649b7d99f.png) 57 | -------------------------------------------------------------------------------- /other-builtin-functions/01/01_date_time.md: -------------------------------------------------------------------------------- 1 | # 01 Working with Dates and Timestamps 2 | 3 | :bulb: This example will show how to use [built-in date and time functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html#temporal-functions) to manipulate temporal fields. 4 | 5 | The source table (`subscriptions`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | #### Date and Time Functions 8 | 9 | Working with dates and timestamps is commonplace in SQL, but your input may come in different types, formats or even timezones. Flink SQL has multiple built-in functions that are useful to deal with this kind of situation and make it convenient to handle temporal fields. 10 | 11 | Assume you have a table with service subscriptions and that you want to continuously [filter](../../foundations/04/04_where.md) these subscriptions to find the ones that have associated payment methods expiring in less than 30 days. The `start_date` and `end_date` are [Unix timestamps](https://en.wikipedia.org/wiki/Unix_time) (i.e. epochs) — which are not very human-readable and should be converted. Also, you want to parse the `payment_expiration` timestamp into its corresponding day, month and year parts. What are some functions that would be useful? 12 | 13 | * `TO_TIMESTAMP(string[, format])`: converts a `STRING` value to a `TIMESTAMP` using the specified format (default: 'yyyy-MM-dd HH:mm:ss') 14 | 15 | * `FROM_UNIXTIME(numeric[, string])`: converts an epoch to a formatted `STRING` (default: 'yyyy-MM-dd HH:mm:ss') 16 | 17 | * `DATE_FORMAT(timestamp, string)`: converts a `TIMESTAMP` to a `STRING` using the specified format 18 | 19 | * `EXTRACT(timeintervalunit FROM temporal)`: returns a `LONG` extracted from the specified date part of a temporal field (e.g. `DAY`,`MONTH`,`YEAR`) 20 | 21 | * `TIMESTAMPDIFF(unit, timepoint1, timepoint2)`: returns the number of time units (`SECOND`, `MINUTE`, `HOUR`, `DAY`, `MONTH` or `YEAR`) between `timepoint1` and `timepoint2` 22 | 23 | * `CURRENT_TIMESTAMP`: returns the current SQL timestamp (UTC) 24 | 25 | For a complete list of built-in date and time functions, check the Flink [documentation](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/functions/systemFunctions.html#temporal-functions). 26 | 27 | > As an exercise, you can try to reproduce the same filtering condition using `TIMESTAMPADD` instead. 28 | 29 | ## Script 30 | 31 | ```sql 32 | CREATE TABLE subscriptions ( 33 | id STRING, 34 | start_date INT, 35 | end_date INT, 36 | payment_expiration TIMESTAMP(3) 37 | ) WITH ( 38 | 'connector' = 'faker', 39 | 'fields.id.expression' = '#{Internet.uuid}', 40 | 'fields.start_date.expression' = '#{number.numberBetween ''1576141834'',''1607764234''}', 41 | 'fields.end_date.expression' = '#{number.numberBetween ''1609060234'',''1639300234''}', 42 | 'fields.payment_expiration.expression' = '#{date.future ''365'',''DAYS''}' 43 | ); 44 | 45 | SELECT 46 | id, 47 | TO_TIMESTAMP(FROM_UNIXTIME(start_date)) AS start_date, 48 | TO_TIMESTAMP(FROM_UNIXTIME(end_date)) AS end_date, 49 | DATE_FORMAT(payment_expiration,'YYYYww') AS exp_yweek, 50 | EXTRACT(DAY FROM payment_expiration) AS exp_day, --same as DAYOFMONTH(ts) 51 | EXTRACT(MONTH FROM payment_expiration) AS exp_month, --same as MONTH(ts) 52 | EXTRACT(YEAR FROM payment_expiration) AS exp_year --same as YEAR(ts) 53 | FROM subscriptions 54 | WHERE 55 | TIMESTAMPDIFF(DAY,CURRENT_TIMESTAMP,payment_expiration) < 30; 56 | ``` 57 | 58 | ## Example Output 59 | 60 | ![12_date_time](https://user-images.githubusercontent.com/23521087/101981480-811a0500-3c6d-11eb-9b28-5603d76ba0e6.png) -------------------------------------------------------------------------------- /udfs/01/01_python_udfs.md: -------------------------------------------------------------------------------- 1 | # 01 Extending SQL with Python UDFs 2 | 3 | :bulb: This example will show how to extend Flink SQL with custom functions written in Python. 4 | 5 | Flink SQL provides a wide range of [built-in functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html) that cover most SQL day-to-day work. Sometimes, you need more flexibility to express custom business logic or transformations that aren't easily translatable to SQL: this can be achieved with [User-Defined Functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/udfs.html) (UDFs). 6 | 7 | In this example, you'll focus on [Python UDFs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/python_udfs.html) and implement a custom function (`to_fahr`) to convert temperature readings that are continuously generated for different EU and US cities. The Celsius->Fahrenheit conversion should only happen if the city associated with the reading is in the US. 8 | 9 | ## Scripts 10 | 11 | #### Python UDF 12 | 13 | The first step is to create a Python file with the UDF implementation (`python_udf.py`), using Flink's [Python Table API](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/intro_to_table_api.html). If this is new to you, there are examples on how to write [general](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/python_udfs.html) and [vectorized](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html) Python UDFs in the Flink documentation. 14 | 15 | ```python 16 | from pyflink.table import DataTypes 17 | from pyflink.table.udf import udf 18 | 19 | us_cities = {"Chicago","Portland","Seattle","New York"} 20 | 21 | @udf(input_types=[DataTypes.STRING(), DataTypes.FLOAT()], 22 | result_type=DataTypes.FLOAT()) 23 | def to_fahr(city, temperature): 24 | 25 | if city in us_cities: 26 | 27 | fahr = ((temperature * 9.0 / 5.0) + 32.0) 28 | 29 | return fahr 30 | else: 31 | return temperature 32 | ``` 33 | 34 | For detailed instructions on how to then make the Python file available as a UDF in the SQL Client, please refer to [this documentation page](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html#user-defined-functions). 35 | 36 | #### SQL 37 | 38 | The source table (`temperature_measurements`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 39 | 40 | ```sql 41 | --Register the Python UDF using the fully qualified 42 | --name of the function ([module name].[object name]) 43 | CREATE FUNCTION to_fahr AS 'python_udf.to_fahr' 44 | LANGUAGE PYTHON; 45 | 46 | 47 | CREATE TABLE temperature_measurements ( 48 | city STRING, 49 | temperature FLOAT, 50 | measurement_time TIMESTAMP(3), 51 | WATERMARK FOR measurement_time AS measurement_time - INTERVAL '15' SECONDS 52 | ) 53 | WITH ( 54 | 'connector' = 'faker', 55 | 'fields.temperature.expression' = '#{number.numberBetween ''0'',''42''}', 56 | 'fields.measurement_time.expression' = '#{date.past ''15'',''SECONDS''}', 57 | 'fields.city.expression' = '#{regexify ''(Copenhagen|Berlin|Chicago|Portland|Seattle|New York){1}''}' 58 | ); 59 | 60 | 61 | --Use to_fahr() to convert temperatures in US cities from C to F 62 | SELECT city, 63 | temperature AS tmp, 64 | to_fahr(city,temperature) AS tmp_conv, 65 | measurement_time 66 | FROM temperature_measurements; 67 | ``` 68 | 69 | ## Example Output 70 | 71 | ![01_python_udfs](https://user-images.githubusercontent.com/23521087/106733744-8ca4ff00-6612-11eb-9721-e4a74fb07329.gif) -------------------------------------------------------------------------------- /foundations/06/06_order_by.md: -------------------------------------------------------------------------------- 1 | # 06 Sorting Tables 2 | 3 | :bulb: This example will show how you can sort a table, particularly unbounded tables. 4 | 5 | Flink SQL supports `ORDER BY`. 6 | Bounded Tables can be sorted by any column, descending or ascending. 7 | 8 | To use `ORDER BY` on unbounded tables like `server_logs` the primary sorting key needs to be a [time attribute](https://docs.ververica.com/user_guide/sql_development/table_view.html#time-attributes) like `log_time`. 9 | 10 | In first example below, we are sorting the `server_logs` by `log_time`. 11 | The second example is a bit more advanced: 12 | It sorts the number of logs per minute and browser by the `window_time` (a time attribute) and the `cnt_browser` (descending), so that the browser with the highest number of logs is at the top of each window. 13 | 14 | ## Script 15 | 16 | ```sql 17 | CREATE TEMPORARY TABLE server_logs ( 18 | client_ip STRING, 19 | client_identity STRING, 20 | userid STRING, 21 | user_agent STRING, 22 | log_time TIMESTAMP(3), 23 | request_line STRING, 24 | status_code STRING, 25 | size INT, 26 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS 27 | ) WITH ( 28 | 'connector' = 'faker', 29 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 30 | 'fields.client_identity.expression' = '-', 31 | 'fields.userid.expression' = '-', 32 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}', 33 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 34 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 35 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 36 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 37 | ); 38 | 39 | SELECT * FROM server_logs 40 | ORDER BY log_time; 41 | ``` 42 | 43 | ## Example Output 44 | 45 | ![06_order_by](https://user-images.githubusercontent.com/23521087/105504299-e24ee280-5cc7-11eb-8935-ed203e604f8d.png) 46 | 47 | ## Advanced Example 48 | 49 |
50 | Advanced Example 51 | 52 | ### Script 53 | 54 | ```sql 55 | CREATE TEMPORARY TABLE server_logs ( 56 | client_ip STRING, 57 | client_identity STRING, 58 | userid STRING, 59 | user_agent STRING, 60 | log_time TIMESTAMP(3), 61 | request_line STRING, 62 | status_code STRING, 63 | size INT, 64 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS 65 | ) WITH ( 66 | 'connector' = 'faker', 67 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 68 | 'fields.client_identity.expression' = '-', 69 | 'fields.userid.expression' = '-', 70 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}', 71 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 72 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 73 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 74 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 75 | ); 76 | SELECT 77 | TUMBLE_ROWTIME(log_time, INTERVAL '1' MINUTE) AS window_time, 78 | REGEXP_EXTRACT(user_agent,'[^\/]+') AS browser, 79 | COUNT(*) AS cnt_browser 80 | FROM server_logs 81 | GROUP BY 82 | REGEXP_EXTRACT(user_agent,'[^\/]+'), 83 | TUMBLE(log_time, INTERVAL '1' MINUTE) 84 | ORDER BY 85 | window_time, 86 | cnt_browser DESC 87 | ``` 88 | 89 | ### Example Output 90 | 91 | ![06_order_by_advanced](https://user-images.githubusercontent.com/23521087/105504249-d5ca8a00-5cc7-11eb-984a-e1eaf6622995.png) 92 | 93 |
94 | -------------------------------------------------------------------------------- /aggregations-and-analytics/08/08_match_recognize.md: -------------------------------------------------------------------------------- 1 | # 08 Detecting patterns with MATCH_RECOGNIZE 2 | 3 | :bulb: This example will show how you can use Flink SQL to detect patterns in a stream of events with `MATCH_RECOGNIZE`. 4 | 5 | A common (but historically complex) task in SQL day-to-day work is to identify meaningful sequences of events in a data set — also known as Complex Event Processing (CEP). This becomes even more relevant when dealing with streaming data, as you want to react quickly to known patterns or changing trends to deliver up-to-date business insights. In Flink SQL, you can easily perform this kind of tasks using the standard SQL clause [`MATCH_RECOGNIZE`](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/match_recognize.html). 6 | 7 | ## Breaking down MATCH_RECOGNIZE 8 | 9 | In this example, you want to find users that downgraded their service subscription from one of the premium tiers (`type IN ('premium','platinum')`) to the basic tier. 10 | 11 | #### Input 12 | 13 | The input argument of `MATCH_RECOGNIZE` will be a row pattern table based on `subscriptions`. As a first step, logical partitioning and ordering must be applied to the input row pattern table to ensure that event processing is correct and deterministic: 14 | 15 | ```sql 16 | PARTITION BY user_id 17 | ORDER BY proc_time 18 | ``` 19 | 20 | #### Output 21 | 22 | Row pattern columns are then defined in the `MEASURES` clause, which can be thought of as the `SELECT` of `MATCH_RECOGNIZE`. If you're interested in getting the type of premium subscription associated with the last event before the downgrade, you can fetch it using the [logical offset](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/match_recognize.html#logical-offsets) operator `LAST`. The downgrade date can be extrapolated from the `start_date` of the first basic subscription event following any existing premium one(s). 23 | 24 | ```sql 25 | MEASURES 26 | LAST(PREMIUM.type) AS premium_type, 27 | AVG(TIMESTAMPDIFF(DAY,PREMIUM.start_date,PREMIUM.end_date)) AS premium_avg_duration, 28 | BASIC.start_date AS downgrade_date 29 | AFTER MATCH SKIP PAST LAST ROW 30 | ``` 31 | 32 | #### Pattern Definition 33 | 34 | Patterns are specified in the `PATTERN` clause using row-pattern variables (i.e. event types) and regular expressions. These variables must also be associated with the matching conditions that events must meet to be included in the pattern, using the `DEFINE` clause. Here, you are interested in matching one or more premium subscription events (`PREMIUM+`) followed by a basic subscription event (`BASIC`): 35 | 36 | ```sql 37 | PATTERN (PREMIUM+ BASIC) 38 | DEFINE PREMIUM AS PREMIUM.type IN ('premium','platinum'), 39 | BASIC AS BASIC.type = 'basic'); 40 | ``` 41 | 42 | ## Script 43 | 44 | The source table (`subscriptions`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 45 | 46 | ```sql 47 | CREATE TABLE subscriptions ( 48 | id STRING, 49 | user_id INT, 50 | type STRING, 51 | start_date TIMESTAMP(3), 52 | end_date TIMESTAMP(3), 53 | payment_expiration TIMESTAMP(3), 54 | proc_time AS PROCTIME() 55 | ) WITH ( 56 | 'connector' = 'faker', 57 | 'fields.id.expression' = '#{Internet.uuid}', 58 | 'fields.user_id.expression' = '#{number.numberBetween ''1'',''50''}', 59 | 'fields.type.expression'= '#{regexify ''(basic|premium|platinum){1}''}', 60 | 'fields.start_date.expression' = '#{date.past ''30'',''DAYS''}', 61 | 'fields.end_date.expression' = '#{date.future ''15'',''DAYS''}', 62 | 'fields.payment_expiration.expression' = '#{date.future ''365'',''DAYS''}' 63 | ); 64 | 65 | SELECT * 66 | FROM subscriptions 67 | MATCH_RECOGNIZE (PARTITION BY user_id 68 | ORDER BY proc_time 69 | MEASURES 70 | LAST(PREMIUM.type) AS premium_type, 71 | AVG(TIMESTAMPDIFF(DAY,PREMIUM.start_date,PREMIUM.end_date)) AS premium_avg_duration, 72 | BASIC.start_date AS downgrade_date 73 | AFTER MATCH SKIP PAST LAST ROW 74 | --Pattern: one or more 'premium' or 'platinum' subscription events (PREMIUM) 75 | --followed by a 'basic' subscription event (BASIC) for the same `user_id` 76 | PATTERN (PREMIUM+ BASIC) 77 | DEFINE PREMIUM AS PREMIUM.type IN ('premium','platinum'), 78 | BASIC AS BASIC.type = 'basic'); 79 | ``` 80 | 81 | ## Example Output 82 | 83 | ![23_match_recognize](https://user-images.githubusercontent.com/2392216/108039085-ee665f80-703b-11eb-93f9-f8e3b684f315.png) 84 | -------------------------------------------------------------------------------- /foundations/08/08_statement_sets.md: -------------------------------------------------------------------------------- 1 | # 08 Writing Results into Multiple Tables 2 | 3 | :bulb: In this recipe, you will learn how to use [Statement Sets](https://docs.ververica.com/user_guide/sql_development/sql_scripts.html#sql-statements) to run multiple `INSERT INTO` statements in a single, optimized Flink Job. 4 | 5 | Many product requirements involve outputting the results of a streaming application to two or more sinks, such as [Apache Kafka](https://docs.ververica.com/user_guide/sql_development/connectors.html#apache-kafka) for real-time use cases, or a [Filesystem](https://docs.ververica.com/user_guide/sql_development/connectors.html#file-system) for offline ones. 6 | Other times, two queries are not the same but share some extensive intermediate operations. 7 | 8 | When working with server logs, the support team would like to see the number of status codes per browser every 5 minutes to have real-time insights into a web pages' status. 9 | Additionally, they would like the same information on an hourly basis made available as partitioned [Apache Parquet](https://docs.ververica.com/user_guide/sql_development/connectors.html#apache-parquet) files so they can perform historical analysis. 10 | 11 | We could quickly write two Flink SQL queries to solve both these requirements, but that would not be efficient. 12 | These queries have a lot of duplicated work, like reading the source logs Kafka topic and cleansing the data. 13 | 14 | Ververica Platform includes a feature called `STATEMENT SET`s, that allows for multiplexing `INSERT INTO` statements into a single query holistically optimized by Apache Flink and deployed as a single application. 15 | 16 | ```sql 17 | CREATE TEMPORARY TABLE server_logs ( 18 | client_ip STRING, 19 | client_identity STRING, 20 | userid STRING, 21 | user_agent STRING, 22 | log_time TIMESTAMP(3), 23 | request_line STRING, 24 | status_code STRING, 25 | size INT, 26 | WATERMARK FOR log_time AS log_time - INTERVAL '30' SECONDS 27 | ) WITH ( 28 | 'connector' = 'faker', 29 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 30 | 'fields.client_identity.expression' = '-', 31 | 'fields.userid.expression' = '-', 32 | 'fields.user_agent.expression' = '#{Internet.userAgentAny}', 33 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 34 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 35 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 36 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 37 | ); 38 | 39 | CREATE TEMPORARY TABLE realtime_aggregations ( 40 | `browser` STRING, 41 | `status_code` STRING, 42 | `end_time` TIMESTAMP(3), 43 | `requests` BIGINT NOT NULL 44 | ) WITH ( 45 | 'connector' = 'kafka', 46 | 'topic' = 'browser-status-codes', 47 | 'properties.bootstrap.servers' = 'localhost:9092', 48 | 'properties.group.id' = 'browser-countds', 49 | 'format' = 'avro' 50 | ); 51 | 52 | 53 | CREATE TEMPORARY TABLE offline_datawarehouse ( 54 | `browser` STRING, 55 | `status_code` STRING, 56 | `dt` STRING, 57 | `hour` STRING, 58 | `requests` BIGINT NOT NULL 59 | ) PARTITIONED BY (`dt`, `hour`) WITH ( 60 | 'connector' = 'filesystem', 61 | 'path' = 's3://my-bucket/browser-into', 62 | 'sink.partition-commit.trigger' = 'partition-time', 63 | 'format' = 'parquet' 64 | ); 65 | 66 | -- This is a shared view that will be used by both 67 | -- insert into statements 68 | CREATE TEMPORARY VIEW browsers AS 69 | SELECT 70 | REGEXP_EXTRACT(user_agent,'[^\/]+') AS browser, 71 | status_code, 72 | log_time 73 | FROM server_logs; 74 | 75 | BEGIN STATEMENT SET; 76 | INSERT INTO realtime_aggregations 77 | SELECT 78 | browser, 79 | status_code, 80 | TUMBLE_ROWTIME(log_time, INTERVAL '5' MINUTE) AS end_time, 81 | COUNT(*) requests 82 | FROM browsers 83 | GROUP BY 84 | browser, 85 | status_code, 86 | TUMBLE(log_time, INTERVAL '5' MINUTE); 87 | INSERT INTO offline_datawarehouse 88 | SELECT 89 | browser, 90 | status_code, 91 | DATE_FORMAT(TUMBLE_ROWTIME(log_time, INTERVAL '1' HOUR), 'yyyy-MM-dd') AS `dt`, 92 | DATE_FORMAT(TUMBLE_ROWTIME(log_time, INTERVAL '1' HOUR), 'HH') AS `hour`, 93 | COUNT(*) requests 94 | FROM browsers 95 | GROUP BY 96 | browser, 97 | status_code, 98 | TUMBLE(log_time, INTERVAL '1' HOUR); 99 | END; 100 | ``` 101 | 102 | Looking at the deployed Job Graph, we can see Flink SQL only performs the shared computation once to achieve the most cost and resource-efficient execution of our query! 103 | 104 | ![08_jobgraph](https://user-images.githubusercontent.com/23521087/105504375-fb579380-5cc7-11eb-888e-12a1ce7d6f50.png) 105 | -------------------------------------------------------------------------------- /joins/03/03_kafka_join.md: -------------------------------------------------------------------------------- 1 | # 03 Temporal Table Join between a non-compacted and compacted Kafka Topic 2 | 3 | :bulb: In this recipe, you will see how to correctly enrich records from one Kafka topic with the corresponding records of another Kafka topic when the order of events matters. 4 | 5 | Temporal table joins take an arbitrary table (left input/probe site) and correlate each row to the corresponding row’s relevant version in a versioned table (right input/build side). 6 | Flink uses the SQL syntax of ``FOR SYSTEM_TIME AS OF`` to perform this operation. 7 | 8 | In this recipe, we want join each transaction (`transactions`) to its correct currency rate (`currency_rates`, a versioned table) **as of the time when the transaction happened**. 9 | A similar example would be to join each order with the customer details as of the time when the order happened. 10 | This is exactly what an event-time temporal table join does. 11 | A temporal table join in Flink SQL provides correct, deterministic results in the presence of out-of-orderness and arbitrary time skew between the two tables. 12 | 13 | Both the `transactions` and `currency_rates` tables are backed by Kafka topics, but in the case of rates this topic is compacted (i.e. only the most recent messages for a given key are kept as updated rates flow in). 14 | Records in `transactions` are interpreted as inserts only, and so the table is backed by the [standard Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/kafka.html) (`connector` = `kafka`); while the records in `currency_rates` need to be interpreted as upserts based on a primary key, which requires the [Upsert Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html) (`connector` = `upsert-kafka`). 15 | 16 | ## Script 17 | 18 | ```sql 19 | CREATE TEMPORARY TABLE currency_rates ( 20 | `currency_code` STRING, 21 | `eur_rate` DECIMAL(6,4), 22 | `rate_time` TIMESTAMP(3), 23 | WATERMARK FOR `rate_time` AS rate_time - INTERVAL '15' SECONDS, 24 | PRIMARY KEY (currency_code) NOT ENFORCED 25 | ) WITH ( 26 | 'connector' = 'upsert-kafka', 27 | 'topic' = 'currency_rates', 28 | 'properties.bootstrap.servers' = 'localhost:9092', 29 | 'key.format' = 'raw', 30 | 'value.format' = 'json' 31 | ); 32 | 33 | CREATE TEMPORARY TABLE transactions ( 34 | `id` STRING, 35 | `currency_code` STRING, 36 | `total` DECIMAL(10,2), 37 | `transaction_time` TIMESTAMP(3), 38 | WATERMARK FOR `transaction_time` AS transaction_time - INTERVAL '30' SECONDS 39 | ) WITH ( 40 | 'connector' = 'kafka', 41 | 'topic' = 'transactions', 42 | 'properties.bootstrap.servers' = 'localhost:9092', 43 | 'key.format' = 'raw', 44 | 'key.fields' = 'id', 45 | 'value.format' = 'json', 46 | 'value.fields-include' = 'ALL' 47 | ); 48 | 49 | SELECT 50 | t.id, 51 | t.total * c.eur_rate AS total_eur, 52 | t.total, 53 | c.currency_code, 54 | t.transaction_time 55 | FROM transactions t 56 | JOIN currency_rates FOR SYSTEM_TIME AS OF t.transaction_time AS c 57 | ON t.currency_code = c.currency_code; 58 | ``` 59 | 60 | ## Example Output 61 | 62 | ![kafka_join](https://user-images.githubusercontent.com/11538663/102418338-d2bfe800-3ffd-11eb-995a-13fa116b538f.gif) 63 | 64 | ## Data Generators 65 | 66 |
67 | Data Generators 68 | 69 | The two topics are populated using a Flink SQL job, too. 70 | We use the [`faker` connector](https://github.com/knaufk/flink-faker) to generate rows in memory based on Java Faker expressions and write those to the respective Kafka topics. 71 | 72 | ### ``currency_rates`` Topic 73 | 74 | ### Script 75 | 76 | ```sql 77 | CREATE TEMPORARY TABLE currency_rates_faker 78 | WITH ( 79 | 'connector' = 'faker', 80 | 'fields.currency_code.expression' = '#{Currency.code}', 81 | 'fields.eur_rate.expression' = '#{Number.randomDouble ''4'',''0'',''10''}', 82 | 'fields.rate_time.expression' = '#{date.past ''15'',''SECONDS''}', 83 | 'rows-per-second' = '100' 84 | ) LIKE currency_rates (EXCLUDING OPTIONS); 85 | 86 | INSERT INTO currency_rates SELECT * FROM currency_rates_faker; 87 | ``` 88 | #### Kafka Topic 89 | 90 | ```shell script 91 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic currency_rates --property print.key=true --property key.separator=" - " 92 | HTG - {"currency_code":"HTG","eur_rate":0.0136,"rate_time":"2020-12-16 22:22:02"} 93 | BZD - {"currency_code":"BZD","eur_rate":1.6545,"rate_time":"2020-12-16 22:22:03"} 94 | BZD - {"currency_code":"BZD","eur_rate":3.616,"rate_time":"2020-12-16 22:22:10"} 95 | BHD - {"currency_code":"BHD","eur_rate":4.5308,"rate_time":"2020-12-16 22:22:05"} 96 | KHR - {"currency_code":"KHR","eur_rate":1.335,"rate_time":"2020-12-16 22:22:06"} 97 | ``` 98 | 99 | ### ``transactions`` Topic 100 | 101 | #### Script 102 | 103 | ```sql 104 | CREATE TEMPORARY TABLE transactions_faker 105 | WITH ( 106 | 'connector' = 'faker', 107 | 'fields.id.expression' = '#{Internet.UUID}', 108 | 'fields.currency_code.expression' = '#{Currency.code}', 109 | 'fields.total.expression' = '#{Number.randomDouble ''2'',''10'',''1000''}', 110 | 'fields.transaction_time.expression' = '#{date.past ''30'',''SECONDS''}', 111 | 'rows-per-second' = '100' 112 | ) LIKE transactions (EXCLUDING OPTIONS); 113 | 114 | INSERT INTO transactions SELECT * FROM transactions_faker; 115 | ``` 116 | 117 | #### Kafka Topic 118 | 119 | ```shell script 120 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic transactions --property print.key=true --property key.separator=" - " 121 | e102e91f-47b9-434e-86e1-34fb1196d91d - {"id":"e102e91f-47b9-434e-86e1-34fb1196d91d","currency_code":"SGD","total":494.07,"transaction_time":"2020-12-16 22:18:46"} 122 | bf028363-5ee4-4a5a-9068-b08392d59f0b - {"id":"bf028363-5ee4-4a5a-9068-b08392d59f0b","currency_code":"EEK","total":906.8,"transaction_time":"2020-12-16 22:18:46"} 123 | e22374b5-82da-4c6d-b4c6-f27a818a58ab - {"id":"e22374b5-82da-4c6d-b4c6-f27a818a58ab","currency_code":"GYD","total":80.66,"transaction_time":"2020-12-16 22:19:02"} 124 | 81b2ce89-26c2-4df3-b12a-8ca921902ac4 - {"id":"81b2ce89-26c2-4df3-b12a-8ca921902ac4","currency_code":"EGP","total":521.98,"transaction_time":"2020-12-16 22:18:57"} 125 | 53c4fd3f-af6e-41d3-a677-536f4c86e010 - {"id":"53c4fd3f-af6e-41d3-a677-536f4c86e010","currency_code":"UYU","total":936.26,"transaction_time":"2020-12-16 22:18:59"} 126 | ``` 127 | 128 |
129 | -------------------------------------------------------------------------------- /aggregations-and-analytics/07/07_chained_windows.md: -------------------------------------------------------------------------------- 1 | # 07 Chained (Event) Time Windows 2 | 3 | :bulb: This example will show how to efficiently aggregate time series data on two different levels of granularity. 4 | 5 | The source table (`server_logs`) is backed by the [`faker` connector](https://github.com/knaufk/flink-faker), which continuously generates rows in memory based on Java Faker expressions. 6 | 7 | Based on our `server_logs` table we would like to compute the average request size over one minute **as well as five minute (event) windows.** 8 | For this, you could run two queries, similar to the one in [Aggregating Time Series Data](../01/01_group_by_window.md). 9 | At the end of the page is the script and resulting JobGraph from this approach. 10 | 11 | In the main part, we will follow a slightly more efficient approach that chains the two aggregations: the one-minute aggregation output serves as the five-minute aggregation input. 12 | 13 | We then use a [Statements Set](../../foundations/08/08_statement_sets.md) to write out the two result tables. 14 | To keep this example self-contained, we use two tables of type `blackhole` instead of `kafka`, `filesystem`, or any other [connectors](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/). 15 | 16 | ## Script 17 | 18 | ```sql 19 | CREATE TEMPORARY TABLE server_logs ( 20 | log_time TIMESTAMP(3), 21 | client_ip STRING, 22 | client_identity STRING, 23 | userid STRING, 24 | request_line STRING, 25 | status_code STRING, 26 | size INT, 27 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS 28 | ) WITH ( 29 | 'connector' = 'faker', 30 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 31 | 'fields.client_identity.expression' = '-', 32 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 33 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 34 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 35 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 36 | ); 37 | 38 | CREATE TEMPORARY TABLE avg_request_size_1m ( 39 | window_start TIMESTAMP(3), 40 | window_end TIMESTAMP(3), 41 | avg_size BIGINT 42 | ) 43 | WITH ( 44 | 'connector' = 'blackhole' 45 | ); 46 | 47 | CREATE TEMPORARY TABLE avg_request_size_5m ( 48 | window_start TIMESTAMP(3), 49 | window_end TIMESTAMP(3), 50 | avg_size BIGINT 51 | ) 52 | WITH ( 53 | 'connector' = 'blackhole' 54 | ); 55 | 56 | CREATE TEMPORARY VIEW server_logs_window_1m AS 57 | SELECT 58 | TUMBLE_START(log_time, INTERVAL '1' MINUTE) AS window_start, 59 | TUMBLE_ROWTIME(log_time, INTERVAL '1' MINUTE) AS window_end, 60 | SUM(size) AS total_size, 61 | COUNT(*) AS num_requests 62 | FROM server_logs 63 | GROUP BY 64 | TUMBLE(log_time, INTERVAL '1' MINUTE); 65 | 66 | 67 | CREATE TEMPORARY VIEW server_logs_window_5m AS 68 | SELECT 69 | TUMBLE_START(window_end, INTERVAL '5' MINUTE) AS window_start, 70 | TUMBLE_ROWTIME(window_end, INTERVAL '5' MINUTE) AS window_end, 71 | SUM(total_size) AS total_size, 72 | SUM(num_requests) AS num_requests 73 | FROM server_logs_window_1m 74 | GROUP BY 75 | TUMBLE(window_end, INTERVAL '5' MINUTE); 76 | 77 | BEGIN STATEMENT SET; 78 | 79 | INSERT INTO avg_request_size_1m SELECT 80 | window_start, 81 | window_end, 82 | total_size/num_requests AS avg_size 83 | FROM server_logs_window_1m; 84 | 85 | INSERT INTO avg_request_size_5m SELECT 86 | window_start, 87 | window_end, 88 | total_size/num_requests AS avg_size 89 | FROM server_logs_window_5m; 90 | 91 | END; 92 | ``` 93 | 94 | ## Example Output 95 | 96 | ### JobGraph 97 | 98 | ![jobgraph_chained](https://user-images.githubusercontent.com/23521087/105503848-5e94f600-5cc7-11eb-9a7f-2944dd4e1faf.png) 99 | 100 | ### Result 1 Minute Aggregations 101 | 102 | ![results_1m](https://user-images.githubusercontent.com/23521087/105503896-6c4a7b80-5cc7-11eb-958d-05d48c9921cf.png) 103 | 104 | ## Non-Chained Windows 105 | 106 |
107 | Non-Chained Windows 108 | 109 | ### Script 110 | 111 | ```shell script 112 | CREATE TEMPORARY TABLE server_logs ( 113 | log_time TIMESTAMP(3), 114 | client_ip STRING, 115 | client_identity STRING, 116 | userid STRING, 117 | request_line STRING, 118 | status_code STRING, 119 | size INT, 120 | WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS 121 | ) WITH ( 122 | 'connector' = 'faker', 123 | 'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', 124 | 'fields.client_identity.expression' = '-', 125 | 'fields.userid.expression' = '-', 126 | 'fields.log_time.expression' = '#{date.past ''15'',''5'',''SECONDS''}', 127 | 'fields.request_line.expression' = '#{regexify ''(GET|POST|PUT|PATCH){1}''} #{regexify ''(/search\.html|/login\.html|/prod\.html|cart\.html|/order\.html){1}''} #{regexify ''(HTTP/1\.1|HTTP/2|/HTTP/1\.0){1}''}', 128 | 'fields.status_code.expression' = '#{regexify ''(200|201|204|400|401|403|301){1}''}', 129 | 'fields.size.expression' = '#{number.numberBetween ''100'',''10000000''}' 130 | ); 131 | 132 | CREATE TEMPORARY TABLE avg_request_size_1m ( 133 | window_start TIMESTAMP(3), 134 | window_end TIMESTAMP(3), 135 | avg_size BIGINT 136 | ) 137 | WITH ( 138 | 'connector' = 'blackhole' 139 | ); 140 | 141 | CREATE TEMPORARY TABLE avg_request_size_5m ( 142 | window_start TIMESTAMP(3), 143 | window_end TIMESTAMP(3), 144 | avg_size BIGINT 145 | ) 146 | WITH ( 147 | 'connector' = 'blackhole' 148 | ); 149 | 150 | CREATE TEMPORARY VIEW server_logs_window_1m AS 151 | SELECT 152 | TUMBLE_START(log_time, INTERVAL '1' MINUTE) AS window_start, 153 | TUMBLE_ROWTIME(log_time, INTERVAL '1' MINUTE) AS window_end, 154 | SUM(size) AS total_size, 155 | COUNT(*) AS num_requests 156 | FROM server_logs 157 | GROUP BY 158 | TUMBLE(log_time, INTERVAL '1' MINUTE); 159 | 160 | 161 | CREATE TEMPORARY VIEW server_logs_window_5m AS 162 | SELECT 163 | TUMBLE_START(log_time, INTERVAL '5' MINUTE) AS window_start, 164 | TUMBLE_ROWTIME(log_time, INTERVAL '5' MINUTE) AS window_end, 165 | SUM(size) AS total_size, 166 | COUNT(*) AS num_requests 167 | FROM server_logs 168 | GROUP BY 169 | TUMBLE(log_time, INTERVAL '5' MINUTE); 170 | 171 | BEGIN STATEMENT SET; 172 | 173 | INSERT INTO avg_request_size_1m SELECT 174 | window_start, 175 | window_end, 176 | total_size/num_requests AS avg_size 177 | FROM server_logs_window_1m; 178 | 179 | INSERT INTO avg_request_size_5m SELECT 180 | window_start, 181 | window_end, 182 | total_size/num_requests AS avg_size 183 | FROM server_logs_window_5m; 184 | 185 | END; 186 | ``` 187 | 188 | ### Example Output 189 | 190 | #### JobGraph 191 | 192 | ![jobgraph_non_chained](https://user-images.githubusercontent.com/23521087/105503946-79676a80-5cc7-11eb-9e8e-15d39482fee9.png) 193 | 194 |
195 | -------------------------------------------------------------------------------- /joins/05/05_star_schema.md: -------------------------------------------------------------------------------- 1 | # 05 Real Time Star Schema Denormalization (N-Way Join) 2 | 3 | :bulb: In this recipe, we will de-normalize a simple star schema with an n-way temporal table join. 4 | 5 | [Star schemas](https://en.wikipedia.org/wiki/Star_schema) are a popular way of normalizing data within a data warehouse. 6 | At the center of a star schema is a **fact table** whose rows contain metrics, measurements, and other facts about the world. 7 | Surrounding fact tables are one or more **dimension tables** which have metadata useful for enriching facts when computing queries. 8 | You are running a small data warehouse for a railroad company which consists of a fact table (`train_activity`) and three dimension tables (`stations`, `booking_channels`, and `passengers`). 9 | All inserts to the fact table, and all updates to the dimension tables, are mirrored to Apache Kafka. 10 | Records in the fact table are interpreted as inserts only, and so the table is backed by the [standard Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/kafka.html) (`connector` = `kafka`);. 11 | In contrast, the records in the dimensional tables are upserts based on a primary key, which requires the [Upsert Kafka connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html) (`connector` = `upsert-kafka`). 12 | 13 | With Flink SQL you can now easily join all dimensions to our fact table using a 5-way temporal table join. 14 | Temporal table joins take an arbitrary table (left input/probe site) and correlate each row to the corresponding row’s relevant version in a versioned table (right input/build side). 15 | Flink uses the SQL syntax of ``FOR SYSTEM_TIME AS OF`` to perform this operation. 16 | Using a temporal table join leads to consistent, reproducible results when joining a fact table with more (slowly) changing dimensional tables. 17 | Every event (row in the fact table) is joined to its corresponding value of each dimension based on when the event occurred in the real world. 18 | 19 | ## Script 20 | 21 | ```sql 22 | CREATE TEMPORARY TABLE passengers ( 23 | passenger_key STRING, 24 | first_name STRING, 25 | last_name STRING, 26 | update_time TIMESTAMP(3), 27 | WATERMARK FOR update_time AS update_time - INTERVAL '10' SECONDS, 28 | PRIMARY KEY (passenger_key) NOT ENFORCED 29 | ) WITH ( 30 | 'connector' = 'upsert-kafka', 31 | 'topic' = 'passengers', 32 | 'properties.bootstrap.servers' = 'localhost:9092', 33 | 'key.format' = 'raw', 34 | 'value.format' = 'json' 35 | ); 36 | 37 | CREATE TEMPORARY TABLE stations ( 38 | station_key STRING, 39 | update_time TIMESTAMP(3), 40 | city STRING, 41 | WATERMARK FOR update_time AS update_time - INTERVAL '10' SECONDS, 42 | PRIMARY KEY (station_key) NOT ENFORCED 43 | ) WITH ( 44 | 'connector' = 'upsert-kafka', 45 | 'topic' = 'stations', 46 | 'properties.bootstrap.servers' = 'localhost:9092', 47 | 'key.format' = 'raw', 48 | 'value.format' = 'json' 49 | ); 50 | 51 | CREATE TEMPORARY TABLE booking_channels ( 52 | booking_channel_key STRING, 53 | update_time TIMESTAMP(3), 54 | channel STRING, 55 | WATERMARK FOR update_time AS update_time - INTERVAL '10' SECONDS, 56 | PRIMARY KEY (booking_channel_key) NOT ENFORCED 57 | ) WITH ( 58 | 'connector' = 'upsert-kafka', 59 | 'topic' = 'booking_channels', 60 | 'properties.bootstrap.servers' = 'localhost:9092', 61 | 'key.format' = 'raw', 62 | 'value.format' = 'json' 63 | ); 64 | 65 | CREATE TEMPORARY TABLE train_activities ( 66 | scheduled_departure_time TIMESTAMP(3), 67 | actual_departure_date TIMESTAMP(3), 68 | passenger_key STRING, 69 | origin_station_key STRING, 70 | destination_station_key STRING, 71 | booking_channel_key STRING, 72 | WATERMARK FOR actual_departure_date AS actual_departure_date - INTERVAL '10' SECONDS 73 | ) WITH ( 74 | 'connector' = 'kafka', 75 | 'topic' = 'train_activities', 76 | 'properties.bootstrap.servers' = 'localhost:9092', 77 | 'value.format' = 'json', 78 | 'value.fields-include' = 'ALL' 79 | ); 80 | 81 | SELECT 82 | t.actual_departure_date, 83 | p.first_name, 84 | p.last_name, 85 | b.channel, 86 | os.city AS origin_station, 87 | ds.city AS destination_station 88 | FROM train_activities t 89 | LEFT JOIN booking_channels FOR SYSTEM_TIME AS OF t.actual_departure_date AS b 90 | ON t.booking_channel_key = b.booking_channel_key; 91 | LEFT JOIN passengers FOR SYSTEM_TIME AS OF t.actual_departure_date AS p 92 | ON t.passenger_key = p.passenger_key 93 | LEFT JOIN stations FOR SYSTEM_TIME AS OF t.actual_departure_date AS os 94 | ON t.origin_station_key = os.station_key 95 | LEFT JOIN stations FOR SYSTEM_TIME AS OF t.actual_departure_date AS ds 96 | ON t.destination_station_key = ds.station_key 97 | ``` 98 | 99 | ## Example Output 100 | 101 | ### SQL Client 102 | 103 | ![05_output](https://user-images.githubusercontent.com/23521087/105504672-54272c00-5cc8-11eb-88da-901bb0006da1.png) 104 | 105 | ### JobGraph 106 | 107 | ![05_jobgraph](https://user-images.githubusercontent.com/23521087/105504615-440f4c80-5cc8-11eb-94f2-d07d0315dec5.png) 108 | 109 | ## Data Generators 110 | 111 |
112 | Data Generators 113 | 114 | The four topics are populated with Flink SQL jobs, too. 115 | We use the [`faker` connector](https://github.com/knaufk/flink-faker) to generate rows in memory based on Java Faker expressions and write those to the respective Kafka topics. 116 | 117 | ### ``train_activities`` Topic 118 | 119 | ### Script 120 | 121 | ```sql 122 | CREATE TEMPORARY TABLE train_activities_faker 123 | WITH ( 124 | 'connector' = 'faker', 125 | 'fields.scheduled_departure_time.expression' = '#{date.past ''10'',''0'',''SECONDS''}', 126 | 'fields.actual_departure_date.expression' = '#{date.past ''10'',''5'',''SECONDS''}', 127 | 'fields.passenger_key.expression' = '#{number.numberBetween ''0'',''10000000''}', 128 | 'fields.origin_station_key.expression' = '#{number.numberBetween ''0'',''1000''}', 129 | 'fields.destination_station_key.expression' = '#{number.numberBetween ''0'',''1000''}', 130 | 'fields.booking_channel_key.expression' = '#{number.numberBetween ''0'',''7''}', 131 | 'rows-per-second' = '1000' 132 | ) LIKE train_activities (EXCLUDING OPTIONS); 133 | 134 | INSERT INTO train_activities SELECT * FROM train_activities_faker; 135 | ``` 136 | #### Kafka Topic 137 | 138 | ```shell script 139 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic train_actitivies --property print.key=true --property key.separator=" - " 140 | null - {"scheduled_departure_time":"2020-12-19 13:52:37","actual_departure_date":"2020-12-19 13:52:16","passenger_key":7014937,"origin_station_key":577,"destination_station_key":862,"booking_channel_key":2} 141 | null - {"scheduled_departure_time":"2020-12-19 13:52:38","actual_departure_date":"2020-12-19 13:52:23","passenger_key":2244807,"origin_station_key":735,"destination_station_key":739,"booking_channel_key":2} 142 | null - {"scheduled_departure_time":"2020-12-19 13:52:46","actual_departure_date":"2020-12-19 13:52:18","passenger_key":2605313,"origin_station_key":216,"destination_station_key":453,"booking_channel_key":3} 143 | null - {"scheduled_departure_time":"2020-12-19 13:53:13","actual_departure_date":"2020-12-19 13:52:19","passenger_key":7111654,"origin_station_key":234,"destination_station_key":833,"booking_channel_key":5} 144 | null - {"scheduled_departure_time":"2020-12-19 13:52:22","actual_departure_date":"2020-12-19 13:52:17","passenger_key":2847474,"origin_station_key":763,"destination_station_key":206,"booking_channel_key":3} 145 | ``` 146 | 147 | ### ``passengers`` Topic 148 | 149 | #### Script 150 | 151 | ```sql 152 | CREATE TEMPORARY TABLE passengers_faker 153 | WITH ( 154 | 'connector' = 'faker', 155 | 'fields.passenger_key.expression' = '#{number.numberBetween ''0'',''10000000''}', 156 | 'fields.update_time.expression' = '#{date.past ''10'',''5'',''SECONDS''}', 157 | 'fields.first_name.expression' = '#{Name.firstName}', 158 | 'fields.last_name.expression' = '#{Name.lastName}', 159 | 'rows-per-second' = '1000' 160 | ) LIKE passengers (EXCLUDING OPTIONS); 161 | 162 | INSERT INTO passengers SELECT * FROM passengers_faker; 163 | ``` 164 | 165 | #### Kafka Topic 166 | 167 | ```shell script 168 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic passengers --property print.key=true --property key.separator=" - " 169 | 749049 - {"passenger_key":"749049","first_name":"Booker","last_name":"Hackett","update_time":"2020-12-19 14:02:32"} 170 | 7065702 - {"passenger_key":"7065702","first_name":"Jeramy","last_name":"Breitenberg","update_time":"2020-12-19 14:02:38"} 171 | 3690329 - {"passenger_key":"3690329","first_name":"Quiana","last_name":"Macejkovic","update_time":"2020-12-19 14:02:27"} 172 | 1212728 - {"passenger_key":"1212728","first_name":"Lawerence","last_name":"Simonis","update_time":"2020-12-19 14:02:27"} 173 | 6993699 - {"passenger_key":"6993699","first_name":"Ardelle","last_name":"Frami","update_time":"2020-12-19 14:02:19"} 174 | ``` 175 | 176 | ### ``stations`` Topic 177 | 178 | #### Script 179 | 180 | ```sql 181 | CREATE TEMPORARY TABLE stations_faker 182 | WITH ( 183 | 'connector' = 'faker', 184 | 'fields.station_key.expression' = '#{number.numberBetween ''0'',''1000''}', 185 | 'fields.city.expression' = '#{Address.city}', 186 | 'fields.update_time.expression' = '#{date.past ''10'',''5'',''SECONDS''}', 187 | 'rows-per-second' = '100' 188 | ) LIKE stations (EXCLUDING OPTIONS); 189 | 190 | INSERT INTO stations SELECT * FROM stations_faker; 191 | ``` 192 | 193 | #### Kafka Topic 194 | 195 | ```shell script 196 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic stations --property print.key=true --property key.separator=" - " 197 | 80 - {"station_key":"80","update_time":"2020-12-19 13:59:20","city":"Harlandport"} 198 | 33 - {"station_key":"33","update_time":"2020-12-19 13:59:12","city":"North Georgine"} 199 | 369 - {"station_key":"369","update_time":"2020-12-19 13:59:12","city":"Tillmanhaven"} 200 | 580 - {"station_key":"580","update_time":"2020-12-19 13:59:12","city":"West Marianabury"} 201 | 616 - {"station_key":"616","update_time":"2020-12-19 13:59:09","city":"West Sandytown"} 202 | ``` 203 | 204 | ### ``booking_channels`` Topic 205 | 206 | #### Script 207 | 208 | ```sql 209 | CREATE TEMPORARY TABLE booking_channels_faker 210 | WITH ( 211 | 'connector' = 'faker', 212 | 'fields.booking_channel_key.expression' = '#{number.numberBetween ''0'',''7''}', 213 | 'fields.channel.expression' = '#{regexify ''(bahn\.de|station|retailer|app|lidl|hotline|joyn){1}''}', 214 | 'fields.update_time.expression' = '#{date.past ''10'',''5'',''SECONDS''}', 215 | 'rows-per-second' = '100' 216 | ) LIKE booking_channels (EXCLUDING OPTIONS); 217 | 218 | INSERT INTO booking_channels SELECT * FROM booking_channels_faker; 219 | ``` 220 | 221 | #### Kafka Topic 222 | 223 | ```shell script 224 | ➜ bin ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic booking_channels --property print.key=true --property key.separator=" - " 225 | 1 - {"booking_channel_key":"1","update_time":"2020-12-19 13:57:05","channel":"joyn"} 226 | 0 - {"booking_channel_key":"0","update_time":"2020-12-19 13:57:17","channel":"station"} 227 | 4 - {"booking_channel_key":"4","update_time":"2020-12-19 13:57:15","channel":"joyn"} 228 | 2 - {"booking_channel_key":"2","update_time":"2020-12-19 13:57:02","channel":"app"} 229 | 1 - {"booking_channel_key":"1","update_time":"2020-12-19 13:57:06","channel":"retailer"} 230 | 231 | ``` 232 | 233 |
234 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright 2020 Ververica GmbH 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. --------------------------------------------------------------------------------