├── .gitignore ├── README.md ├── cumulative_table_dag.py ├── images └── cumulative_table_design.jpg ├── queries ├── active_users_cumulated_populate.sql └── active_users_daily_populate.sql └── tables ├── active_users_cumulated.sql ├── active_users_daily.sql └── events.sql /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Cumulative Table Design 2 | 3 | Cumulative table design is an extremely powerful data engineering tool that all data engineers should know. 4 | 5 | This design produces tables that can provide efficient analyses on arbitrarily large (up to **thousands** of days) time frames 6 | 7 | Here's a diagram of the high level pipeline design for this pattern: 8 | 9 | ![Cumulative Table diagram](images/cumulative_table_design.jpg) 10 | 11 | We initially build our **daily** metrics table that is at the grain of whatever our entity is. This data is derived from whatever event sources we have upstream. 12 | 13 | After we have our daily metrics, we `FULL OUTER JOIN` yesterday's cumulative table with today's daily data and build our metric arrays for each user. This allows us to bring the new history in without having to scan all of it. **(a big performance boost)** 14 | 15 | These metric arrays allow us to easily answer queries about the history of all users using things like `ARRAY_SUM` to calculate whatever metric we want on whatever time frame the array allows. 16 | 17 | > The longer the time frame of your analysis, the more critical this pattern becomes!! 18 | 19 | ## Example User activity and engagement cumulated 20 | 21 | > All query syntax is using Presto/Trino syntax and functions. This example would need to be modified for other SQL variants! 22 | > 23 | > We'll be using the dates: 24 | > - **2022-01-01** as **today** in Airflow terms this is `{{ ds }}` 25 | > - **2021-12-31** as **yesterday** in Airflow templating terms this is `{{ yesterday_ds}}` 26 | 27 | 28 | In this example, we'll be looking into how to build this design for calculate daily, weekly and monthly active users as well as the users likes, comments, and shares. 29 | 30 | Our source table in this case is **[events](tables/events.sql)**. 31 | - **A user is active on any given day if they generate an event for that day.** 32 | - The table has `event_type` which is `like`, `comment`, `share`, or `view` 33 | 34 | It's tempting to think the solution to this is running a pipeline something like 35 | ``` 36 | SELECT 37 | COUNT(DISTINCT user_id) as num_monthly_active_users, 38 | COUNT(CASE WHEN event_type = 'like' THEN 1 END) as num_likes_30d, 39 | COUNT(CASE WHEN event_type = 'comment' THEN 1 END) as num_comments_30d, 40 | COUNT(CASE WHEN event_type = 'share' THEN 1 END) as num_shares_30d, 41 | ... 42 | FROM events 43 | WHERE event_date BETWEEN DATE_SUB('2022-01-01', 30), AND '2022-01-01' 44 | ``` 45 | 46 | The problem with this is we're scanning 30 days of event data every day to produce these numbers. A pretty wasteful, yet simple pipeline. 47 | There should be a way where we only have to scan the event data once and combine with the results from the previous 29 days, right? Can we make a data structure where a data scientist can query our data and easily know the number of actions a user took in the last N number of days? 48 | 49 | 50 | This design is pretty simple with only 3 steps: 51 | 52 | ### The Daily table step 53 | - In this step we aggregate just the events of today to see who is daily active. The table schema is [here](tables/active_users_daily.sql) 54 | - This query is pretty simple and straight forward [here](queries/active_users_daily_populate.sql) 55 | - `GROUP BY user_id` and then count them as daily active if they have any events 56 | - we add a few `COUNT(CASE WHEN event_type = 'like' THEN 1 END)` statements to figure out the number of daily likes, comments, and shares as well 57 | ### The Cumulation step 58 | - The table schema for this step is [here](tables/active_users_cumulated.sql) 59 | - The query for this step is much more complex. It's [here](queries/active_users_cumulated_populate.sql) 60 | - In this step we take **today's** data from the daily table and **yesterday's** data from the cumulated table 61 | - We `FULL OUTER JOIN` these two data sets on `today.user_id = yesterday.user_id` 62 | - If a user is brand new, they won't be in yesterday's data. Also, if a user wasn't active today, they aren't in today's data 63 | - So we need to `COALESCE(today.user_id, yesterday.user_id) as user_id` to keep track of all the users 64 | - Next we want to build the `activity_array` column. We only want `activity_array` to store the data of the last 30 days 65 | - So we check to see if `CARDINALITY(activity_array) < 30` to understand if we can just add today's value to the front of the array or do we need to slice an element off the end of the array before adding today's value to the front 66 | - We need to perform `COALESCE(t.is_active_today, 0)` to put zero values into the array when a user isn't active 67 | - We follow a very similar pattern to how we built `activity_array` but for likes, comments, and shares as well! 68 | - After we build all our metric arrays, calculating weekly and monthly activity is pretty straight forward 69 | - `CASE WHEN ARRAY_SUM(activity_array) > 0 THEN 1 ELSE 0 END` gives us monthly actives since we limit the array size to 30 70 | - `CASE WHEN ARRAY_SUM(SLICE(activity_array, 1, 7)) > 0 THEN 1 ELSE 0 END` gives us weekly active since we only check the first 7 elements of the array *(i.e. the last 7 days)* 71 | - Summing the number of likes, comments, and shares in the past 7 days is also easy 72 | - `ARRAY_SUM(SLICE(like_array, 1, 7)) as num_likes_7d` gives us the number of likes this user did in the past 7 days 73 | - `ARRAY_SUM(like_array) as num_likes_30d` gives us the number of likes this user did in the past 30 days since the array is fixed to that size 74 | ### The DAG step 75 | - The example DAG can be found [here](cumulative_table_dag.py) 76 | - Key things to remember for the DAG are: 77 | - Cumulative DAGs should always be `depends_on_past: True` 78 | - The Daily aggregation step needs to be upstream of the cumulation step -------------------------------------------------------------------------------- /cumulative_table_dag.py: -------------------------------------------------------------------------------- 1 | from airflow import DAG 2 | from airflow.contrib.operators.presto_operator import PrestoOperator 3 | 4 | default_args = { 5 | 'owner': 'Data with Zach', 6 | # Very important to have your cumulative DAGs be set `depends_on_past = True` 7 | 'depends_on_past': True, 8 | 'email': ['airflow@example.com'], 9 | } 10 | with DAG( 11 | 'cumulative_table_example_dag', 12 | default_args=default_args, 13 | description='A simple cumulative table example', 14 | schedule_interval='@daily', 15 | start_date='2022-01-01', 16 | catchup=False, 17 | tags=['data with zach, cumulative table example'], 18 | ) as dag: 19 | daily_task = PrestoOperator( 20 | task_id='compute_active_user_daily', 21 | sql='queries/active_users_daily_populate.sql' 22 | ) 23 | 24 | cumulative_task = PrestoOperator( 25 | task_id='compute_active_users_cumulated', 26 | sql='queries/active_users_cumulated_populate.sql' 27 | ) 28 | 29 | # the daily task has to run before the cumulative task 30 | daily_task >> cumulative_task -------------------------------------------------------------------------------- /images/cumulative_table_design.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataExpert-io/cumulative-table-design/7177daa195a4c8b4c08d123d9910f641847bcb94/images/cumulative_table_design.jpg -------------------------------------------------------------------------------- /queries/active_users_cumulated_populate.sql: -------------------------------------------------------------------------------- 1 | INSERT INTO active_users_cumulated 2 | 3 | -- First read in yesterday from the cumulated table 4 | WITH yesterday AS ( 5 | SELECT * FROM active_users_cumulated 6 | WHERE snapshot_date = '2021-12-31' 7 | ), 8 | -- Read in the daily active user numbers for just today from the daily table 9 | today AS ( 10 | SELECT * FROM active_users_daily 11 | WHERE snapshot_date = '2022-01-01' 12 | ), 13 | 14 | -- we FULL OUTER JOIN today and yesterday. We need to do some COALESCE both because 15 | -- activity_array may not exist yet for a given user (i.e. they are brand new) 16 | -- is_active_today may be null as well since it's null on days when a user didn't generate an event 17 | combined AS ( 18 | SELECT 19 | -- We need to COALESCE here since t.user_id may be a new user and it never existed in cumulative table before. 20 | COALESCE(y.user_id, t.user_id) AS user_id, 21 | -- if y.activity_array is null (indicating a brand new user), we have to coalesce with an array of size 1 22 | -- this array just holds the value for today since that's the only history we have 23 | COALESCE( 24 | IF(CARDINALITY( y.activity_array) < 30, 25 | ARRAY[COALESCE(t.is_active_today, 0)] || y.activity_array, 26 | ARRAY[COALESCE(t.is_active_today, 0)] || REVERSE(SLICE(REVERSE(y.activity_array),2,29)) 27 | ) 28 | , ARRAY[t.is_active_today] 29 | ) as activity_array, 30 | COALESCE( 31 | IF(CARDINALITY( y.like_array) < 30, 32 | ARRAY[COALESCE(t.num_likes, 0)] || y.like_array, 33 | ARRAY[COALESCE(t.num_likes, 0)] || REVERSE(SLICE(REVERSE(y.like_array),2,29)) 34 | ) 35 | , ARRAY[t.num_likes] 36 | ) as like_array, 37 | COALESCE( 38 | IF(CARDINALITY( y.comment_array) < 30, 39 | ARRAY[COALESCE(t.num_comments, 0)] || y.comment_array, 40 | ARRAY[COALESCE(t.num_comments, 0)] || REVERSE(SLICE(REVERSE(y.comment_array),2,29)) 41 | ) 42 | , ARRAY[t.num_comments] 43 | ) as comment_array, 44 | COALESCE( 45 | IF(CARDINALITY( y.share_array) < 30, 46 | ARRAY[COALESCE(t.num_shares, 0)] || y.share_array, 47 | ARRAY[COALESCE(t.num_shares, 0)] || REVERSE(SLICE(REVERSE(y.share_array),2,29)) 48 | ) 49 | , ARRAY[t.num_shares] 50 | ) as share_array, 51 | t.snapshot_date 52 | FROM yesterday y 53 | FULL OUTER JOIN today t 54 | ON y.user_id = t.user_id 55 | ) 56 | 57 | SELECT 58 | user_id, 59 | activity_array[1] AS is_daily_active, 60 | -- if any of the array values are 1, then the user was active in the last month 61 | CASE WHEN ARRAY_SUM(activity_array) > 0 THEN 1 ELSE 0 END AS is_monthly_active, 62 | -- if any of the first 7 array values are non-zero, then the user was active in the last week 63 | CASE WHEN ARRAY_SUM(SLICE(activity_array, 1, 7)) > 0 THEN 1 ELSE 0 END AS is_weekly_active 64 | activity_array, 65 | like_array, 66 | share_array, 67 | comment_array, 68 | ARRAY_SUM(SLICE(like_array, 1, 7)) as num_likes_7d, 69 | ARRAY_SUM(SLICE(comment_array, 1, 7)) as num_comments_7d, 70 | ARRAY_SUM(SLICE(share_array, 1, 7)) as num_shares_7d, 71 | ARRAY_SUM(like_array) as num_likes_30d, 72 | ARRAY_SUM(comment_array) as num_comments_30d, 73 | ARRAY_SUM(share_array) as num_shares_30d, 74 | snapshot_date 75 | FROM combined 76 | 77 | 78 | 79 | 80 | 81 | -------------------------------------------------------------------------------- /queries/active_users_daily_populate.sql: -------------------------------------------------------------------------------- 1 | INSERT INTO active_users_daily 2 | 3 | SELECT 4 | user_id, 5 | -- If the user_id has at least 1 event, they are daily active 6 | IF(COUNT(user_id) > 0, 1, 0) as is_active_today, 7 | COUNT(CASE WHEN event_type = 'like' THEN 1 END) as num_likes, 8 | COUNT(CASE WHEN event_type = 'comment' THEN 1 END) as num_comments, 9 | COUNT(CASE WHEN event_type = 'share' THEN 1 END) as num_shares, 10 | CAST('2022-01-01' AS DATE) as snapshot_date 11 | FROM events 12 | WHERE event_date = '2022-01-01' 13 | GROUP BY user_id -------------------------------------------------------------------------------- /tables/active_users_cumulated.sql: -------------------------------------------------------------------------------- 1 | create table active_users_cumulated ( 2 | user_id: integer, 3 | is_daily_active: integer, 4 | is_weekly_active: integer, 5 | is_monthly_active: integer, 6 | activity_array: array(integer), 7 | like_array: array(integer), 8 | share_array: array(integer), 9 | comment_array: array(integer), 10 | num_likes_7d: integer, 11 | num_comments_7d: integer, 12 | num_shares_7d: integer, 13 | num_likes_30d: integer, 14 | num_comments_30d: integer, 15 | num_shares_30d: integer, 16 | snapshot_date: date 17 | ) 18 | 19 | -------------------------------------------------------------------------------- /tables/active_users_daily.sql: -------------------------------------------------------------------------------- 1 | create table active_users_daily ( 2 | user_id: integer, 3 | is_active_today: integer, 4 | num_likes: integer, 5 | num_comments: integer, 6 | num_shares: integer, 7 | snapshot_date: date 8 | ) 9 | -------------------------------------------------------------------------------- /tables/events.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE events( 2 | user_id: Integer, 3 | event_type: VARCHAR, 4 | event_timestamp: TIMESTAMP 5 | event_date: Date 6 | ) --------------------------------------------------------------------------------