├── README.md ├── dbt_project.yml ├── docker ├── PostgresSink.json ├── catalog │ ├── datalake.properties │ ├── oltp.properties │ └── website.properties ├── docker-compose.yaml ├── install_connectors.sh ├── listen_eventsSink.json ├── mongo │ ├── init.sh │ └── keyfile.pem ├── page_view_eventsSink.json └── postConnect.sh ├── models ├── docs.md ├── mart_hourly_stream_evolution.sql ├── mart_top_songs_artists.sql ├── mart_total_unique_users.sql ├── mart_user_activity_per_country.sql ├── mart_user_activity_per_state.sql ├── mart_user_level_by_gender.sql ├── overview.md ├── schema.yml ├── source.yml ├── source │ ├── schema.yml │ ├── src_auth_events.sql │ ├── src_listen_events.sql │ └── src_page_view_events.sql └── stage │ ├── stg_streams_hourly.sql │ ├── stg_top_songs_artists.sql │ ├── stg_unique_users.sql │ ├── stg_user_activity_per_country.sql │ ├── stg_user_activity_per_location.sql │ └── stg_user_level_by_gender.sql ├── packages.yml └── profiles.yml /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Iceberg + Dbt + Trino + Hive: Modern, Open-Source Data Stack 3 | 4 | ![](https://cdn-images-1.medium.com/max/3086/1*Hg-ZHsBd_yJ54legC570oA.png) 5 | 6 | 7 | The repository showcases a demo of integrating Iceberg, Dbt, Trino, and Hive, forming a modern and open-source data stack suitable for various analytical needs. This guide provides a structured approach to setting up and utilizing this stack effectively, ensuring a seamless workflow from data ingestion to analysis. 8 | 9 | ## Run the Local Trino Server 10 | 11 | Before diving into the specifics of data transformation and analysis with Dbt, it's essential to have the Trino server up and running. Trino serves as a distributed SQL query engine that allows you to query your data across different sources seamlessly. Here's how to start the Trino server locally using Docker: 12 | 13 | 14 | ``` 15 | cd docker 16 | docker-compose up --build -d 17 | ``` 18 | 19 | This command navigates to the Docker directory within your project and initiates the Docker Compose process, which builds and starts the containers defined in your `docker-compose.yml` file in detached mode. 20 | 21 | 22 | ## Integration with Kafka for Data Streaming 23 | 24 | To simulate real-time data streaming in a music event context, follow the instructions from the GitHub repository [Stefen-Taime/eventmusic](https://github.com/Stefen-Taime/eventmusic.git). This repository contains scripts and configurations necessary for producing messages to Kafka, which acts as the backbone for real-time data handling in this stack. 25 | 26 | ### Preparing Kafka Connectors 27 | 28 | After setting up the Docker containers and running the local Trino server, proceed with the Kafka connectors setup: 29 | 30 | 31 | 1. **Set Permissions for `install_connectors.sh`**: This script installs the necessary Kafka connectors for integrating with PostgreSQL and MongoDB. Adjust the file permissions to make it executable. 32 | ``` 33 | chmod +x install_connectors.sh 34 | ``` 35 | 36 | 2. **Execute `install_connectors.sh`**: Run the script to install the Kafka connectors. 37 | ``` 38 | ./install_connectors.sh 39 | ``` 40 | 41 | ### Configuring Connectors and Producing Data 42 | 43 | With the connectors installed: 44 | 45 | 1. **Set Permissions for `postConnect.sh`**: This script configures the connectors. Modify the permissions to ensure executability. 46 | ``` 47 | chmod +x postConnect.sh 48 | ``` 49 | 50 | 2. **Execute `postConnect.sh`**: Run the script to configure the connectors and initiate data streaming. 51 | ``` 52 | ./postConnect.sh 53 | ``` 54 | 55 | ## Run the Dbt Commands 56 | 57 | With the Trino server running, the next step is to execute the necessary Dbt commands to manage your data transformations: 58 | 59 | ``` 60 | dbt deps 61 | dbt run 62 | ``` 63 | 64 | `dbt deps` fetches the project's dependencies, ensuring that all required packages and modules are available. `dbt run` then executes the transformations defined in your dbt project, building your data models according to the specifications in your dbt files. 65 | 66 | ## Get Superset 67 | 68 | To get started with Apache Superset, follow these steps to pull and run the Superset Docker image. Ensure you have Docker installed and running on your machine. 69 | 70 | 1. **Set Superset Version**: 71 | Set the `SUPERSET_VERSION` environment variable with the latest Superset version. Check the [Apache Superset releases](https://github.com/apache/superset/releases) for the latest version. 72 | ``` 73 | export SUPERSET_VERSION= 74 | ``` 75 | 76 | 2. **Pull Superset Image**: 77 | Pull the Superset image from Docker Hub. 78 | ``` 79 | docker pull apache/superset:$SUPERSET_VERSION 80 | ``` 81 | 82 | 3. **Start Superset**: 83 | Note that Superset requires a user-specified value of `SECRET_KEY` or `SUPERSET_SECRET_KEY` as an environment variable to start. 84 | ``` 85 | docker run -d -p 3000:8088 \ 86 | -e "SUPERSET_SECRET_KEY=$(openssl rand -base64 42)" \ 87 | -e "TALISMAN_ENABLED=False" \ 88 | --name superset apache/superset:$SUPERSET_VERSION 89 | ``` 90 | 91 | 4. **Create an Account**: 92 | Create an admin account in Superset. 93 | ``` 94 | docker exec -it superset superset fab create-admin \ 95 | --username admin \ 96 | --firstname Admin \ 97 | --lastname Admin \ 98 | --email admin@localhost \ 99 | --password admin 100 | ``` 101 | 102 | 5. **Configure Superset**: 103 | Configure the database and load example data. 104 | ``` 105 | docker exec -it superset superset db upgrade && \ 106 | docker exec -it superset superset load_examples && \ 107 | docker exec -it superset superset init 108 | ``` 109 | 110 | 6. **Start Using Superset**: 111 | After configuration, access Superset at `http://localhost:8080` with the default credentials: 112 | - Username: `admin` 113 | - Password: `admin` 114 | -------------------------------------------------------------------------------- /dbt_project.yml: -------------------------------------------------------------------------------- 1 | 2 | name: 'my_dbt_trino_project' 3 | version: '1.0.0' 4 | config-version: 2 5 | 6 | profile: 'my_dbt_trino_project' 7 | 8 | 9 | model-paths: ["models"] 10 | seed-paths: ["seeds"] 11 | test-paths: ["tests"] 12 | analysis-paths: ["analysis"] 13 | macro-paths: ["macros"] 14 | 15 | target-path: "target" 16 | clean-targets: 17 | - "target" 18 | - "dbt_packages" 19 | - "logs" 20 | 21 | 22 | require-dbt-version: [">=1.0.0", "<2.0.0"] 23 | 24 | 25 | models: 26 | my_dbt_trino_project: 27 | +materialized: table 28 | source: 29 | schema: source 30 | stage: 31 | +materialized: table 32 | schema: stage 33 | 34 | 35 | dispatch: 36 | - macro_namespace: dbt_utils 37 | search_order: ['trino_utils', 'dbt_utils'] 38 | -------------------------------------------------------------------------------- /docker/PostgresSink.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "jdbc-sink", 3 | "config": { 4 | "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector", 5 | "tasks.max": "1", 6 | "topics": "auth_events", 7 | "connection.url": "jdbc:postgresql://oltp:5432/postgres", 8 | "connection.user": "postgres", 9 | "connection.password": "postgres", 10 | "auto.create": "true" 11 | } 12 | } 13 | -------------------------------------------------------------------------------- /docker/catalog/datalake.properties: -------------------------------------------------------------------------------- 1 | connector.name=iceberg 2 | hive.metastore.uri=thrift://hive-metastore:9083 3 | hive.s3.endpoint=http://minio:9000 4 | hive.s3.path-style-access=true 5 | hive.s3.aws-access-key=minio 6 | hive.s3.aws-secret-key=minio123 7 | hive.metastore-cache-ttl=0s 8 | hive.metastore-refresh-interval=5s 9 | hive.metastore-timeout=10s 10 | -------------------------------------------------------------------------------- /docker/catalog/oltp.properties: -------------------------------------------------------------------------------- 1 | connector.name=postgresql 2 | connection-url=jdbc:postgresql://oltp:5432/postgres 3 | connection-user=postgres 4 | connection-password=postgres 5 | -------------------------------------------------------------------------------- /docker/catalog/website.properties: -------------------------------------------------------------------------------- 1 | connector.name=mongodb 2 | mongodb.connection-url=mongodb://debezium:dbz@mongo:27017/demo?authSource=admin 3 | -------------------------------------------------------------------------------- /docker/docker-compose.yaml: -------------------------------------------------------------------------------- 1 | version: "3.9" 2 | services: 3 | oltp: 4 | image: postgres:latest 5 | container_name: oltp 6 | environment: 7 | - POSTGRES_DB=postgres 8 | - POSTGRES_USER=postgres 9 | - POSTGRES_PASSWORD=postgres 10 | ports: 11 | - "15432:5432" 12 | networks: 13 | kafka: 14 | ipv4_address: 172.20.0.15 15 | 16 | mongo: 17 | container_name: mongo 18 | image: mongo:5.0.5 19 | ports: 20 | - 27017:27017 21 | volumes: 22 | - ./mongo/init.sh:/docker-entrypoint-initdb.d/mongo-init.sh 23 | - ./mongo/keyfile.pem:/tmp/keyfile.pem.orig:ro 24 | entrypoint: 25 | - bash 26 | - -c 27 | - | 28 | cp /tmp/keyfile.pem.orig /tmp/keyfile.pem 29 | chmod 400 /tmp/keyfile.pem 30 | chown 999:999 /tmp/keyfile.pem 31 | exec docker-entrypoint.sh $$@ 32 | command: ["mongod", "--bind_ip", "0.0.0.0", "--replSet", "rs0", "--auth", "--keyFile", "/tmp/keyfile.pem"] 33 | networks: 34 | kafka: 35 | ipv4_address: 172.20.0.14 36 | 37 | 38 | trino: 39 | hostname: trino 40 | container_name: trino 41 | image: 'trinodb/trino:latest' 42 | ports: 43 | - '8080:8080' 44 | volumes: 45 | - ./catalog:/etc/trino/catalog 46 | networks: 47 | kafka: 48 | ipv4_address: 172.20.0.12 49 | 50 | metastore_db: 51 | image: postgres:11 52 | hostname: metastore_db 53 | container_name: metastore_db 54 | environment: 55 | POSTGRES_USER: hive 56 | POSTGRES_PASSWORD: hive 57 | POSTGRES_DB: metastore 58 | networks: 59 | kafka: 60 | ipv4_address: 172.20.0.11 61 | 62 | hive-metastore: 63 | container_name: hive-metastore 64 | hostname: hive-metastore 65 | image: 'starburstdata/hive:3.1.2-e.15' 66 | ports: 67 | - '9083:9083' 68 | environment: 69 | HIVE_METASTORE_DRIVER: org.postgresql.Driver 70 | HIVE_METASTORE_JDBC_URL: jdbc:postgresql://metastore_db:5432/metastore 71 | HIVE_METASTORE_USER: hive 72 | HIVE_METASTORE_PASSWORD: hive 73 | HIVE_METASTORE_WAREHOUSE_DIR: s3://datalake/ 74 | S3_ENDPOINT: http://minio:9000 75 | S3_ACCESS_KEY: minio 76 | S3_SECRET_KEY: minio123 77 | S3_PATH_STYLE_ACCESS: "true" 78 | REGION: "" 79 | GOOGLE_CLOUD_KEY_FILE_PATH: "" 80 | AZURE_ADL_CLIENT_ID: "" 81 | AZURE_ADL_CREDENTIAL: "" 82 | AZURE_ADL_REFRESH_URL: "" 83 | AZURE_ABFS_STORAGE_ACCOUNT: "" 84 | AZURE_ABFS_ACCESS_KEY: "" 85 | AZURE_WASB_STORAGE_ACCOUNT: "" 86 | AZURE_ABFS_OAUTH: "" 87 | AZURE_ABFS_OAUTH_TOKEN_PROVIDER: "" 88 | AZURE_ABFS_OAUTH_CLIENT_ID: "" 89 | AZURE_ABFS_OAUTH_SECRET: "" 90 | AZURE_ABFS_OAUTH_ENDPOINT: "" 91 | AZURE_WASB_ACCESS_KEY: "" 92 | depends_on: 93 | - metastore_db 94 | networks: 95 | kafka: 96 | ipv4_address: 172.20.0.16 97 | 98 | minio: 99 | hostname: minio 100 | image: minio/minio 101 | container_name: minio 102 | ports: 103 | - '9000:9000' 104 | - '9001:9001' 105 | environment: 106 | MINIO_ACCESS_KEY: minio 107 | MINIO_SECRET_KEY: minio123 108 | command: server /data --console-address ":9001" 109 | networks: 110 | kafka: 111 | ipv4_address: 172.20.0.2 112 | 113 | 114 | mc-job: 115 | image: 'minio/mc' 116 | container_name: mc-job 117 | entrypoint: | 118 | /bin/bash -c " 119 | sleep 5; 120 | /usr/bin/mc config --quiet host add myminio http://minio:9000 minio minio123; 121 | /usr/bin/mc mb --quiet myminio/datalake 122 | " 123 | depends_on: 124 | - minio 125 | networks: 126 | kafka: 127 | ipv4_address: 172.20.0.3 128 | 129 | admin: 130 | image: adminer 131 | restart: always 132 | ports: 133 | - 8085:8080 134 | networks: 135 | kafka: 136 | ipv4_address: 172.20.0.4 137 | 138 | zookeeper: 139 | image: confluentinc/cp-zookeeper:7.5.0 140 | hostname: zookeeper 141 | container_name: zookeeper 142 | ports: 143 | - "2181:2181" 144 | environment: 145 | ZOOKEEPER_CLIENT_PORT: 2181 146 | ZOOKEEPER_TICK_TIME: 2000 147 | networks: 148 | kafka: 149 | ipv4_address: 172.20.0.5 150 | 151 | broker: 152 | image: confluentinc/cp-server:7.5.0 153 | hostname: broker 154 | container_name: broker 155 | depends_on: 156 | - zookeeper 157 | ports: 158 | - "9092:9092" 159 | - "9101:9101" 160 | environment: 161 | KAFKA_BROKER_ID: 1 162 | KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' 163 | KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT 164 | KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092 165 | KAFKA_METRIC_REPORTERS: io.confluent.metrics.reporter.ConfluentMetricsReporter 166 | KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 167 | KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 168 | KAFKA_CONFLUENT_LICENSE_TOPIC_REPLICATION_FACTOR: 1 169 | KAFKA_CONFLUENT_BALANCER_TOPIC_REPLICATION_FACTOR: 1 170 | KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 171 | KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 172 | KAFKA_JMX_PORT: 9101 173 | KAFKA_JMX_HOSTNAME: localhost 174 | KAFKA_CONFLUENT_SCHEMA_REGISTRY_URL: http://schema-registry:8081 175 | CONFLUENT_METRICS_REPORTER_BOOTSTRAP_SERVERS: broker:29092 176 | CONFLUENT_METRICS_REPORTER_TOPIC_REPLICAS: 1 177 | CONFLUENT_METRICS_ENABLE: 'true' 178 | CONFLUENT_SUPPORT_CUSTOMER_ID: 'anonymous' 179 | networks: 180 | kafka: 181 | ipv4_address: 172.20.0.6 182 | 183 | schema-registry: 184 | image: confluentinc/cp-schema-registry:7.5.0 185 | hostname: schema-registry 186 | container_name: schema-registry 187 | depends_on: 188 | - broker 189 | ports: 190 | - "8081:8081" 191 | environment: 192 | SCHEMA_REGISTRY_HOST_NAME: schema-registry 193 | SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: 'broker:29092' 194 | SCHEMA_REGISTRY_LISTENERS: http://0.0.0.0:8081 195 | networks: 196 | kafka: 197 | ipv4_address: 172.20.0.7 198 | 199 | connect: 200 | image: cnfldemos/cp-server-connect-datagen:0.6.2-7.5.0 201 | hostname: connect 202 | container_name: connect 203 | depends_on: 204 | - broker 205 | - schema-registry 206 | ports: 207 | - "8083:8083" 208 | environment: 209 | CONNECT_BOOTSTRAP_SERVERS: 'broker:29092' 210 | CONNECT_REST_ADVERTISED_HOST_NAME: connect 211 | CONNECT_GROUP_ID: compose-connect-group 212 | CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs 213 | CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1 214 | CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000 215 | CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets 216 | CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1 217 | CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status 218 | CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1 219 | CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter 220 | CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter 221 | CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081 222 | # CLASSPATH required due to CC-2422 223 | CLASSPATH: /usr/share/java/monitoring-interceptors/monitoring-interceptors-7.5.0.jar 224 | CONNECT_PRODUCER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor" 225 | CONNECT_CONSUMER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor" 226 | CONNECT_PLUGIN_PATH: "/usr/share/java,/usr/share/confluent-hub-components" 227 | CONNECT_LOG4J_LOGGERS: org.apache.zookeeper=ERROR,org.I0Itec.zkclient=ERROR,org.reflections=ERROR 228 | networks: 229 | kafka: 230 | ipv4_address: 172.20.0.8 231 | 232 | control-center: 233 | image: confluentinc/cp-enterprise-control-center:7.5.0 234 | hostname: control-center 235 | container_name: control-center 236 | depends_on: 237 | - broker 238 | - schema-registry 239 | - connect 240 | ports: 241 | - "9021:9021" 242 | environment: 243 | CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092' 244 | CONTROL_CENTER_CONNECT_CONNECT-DEFAULT_CLUSTER: 'connect:8083' 245 | CONTROL_CENTER_KSQL_KSQLDB1_ADVERTISED_URL: "http://localhost:8088" 246 | CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081" 247 | CONTROL_CENTER_REPLICATION_FACTOR: 1 248 | CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1 249 | CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1 250 | CONFLUENT_METRICS_TOPIC_REPLICATION: 1 251 | PORT: 9021 252 | networks: 253 | kafka: 254 | ipv4_address: 172.20.0.9 255 | 256 | rest-proxy: 257 | image: confluentinc/cp-kafka-rest:7.5.0 258 | depends_on: 259 | - broker 260 | - schema-registry 261 | ports: 262 | - 8082:8082 263 | hostname: rest-proxy 264 | container_name: rest-proxy 265 | environment: 266 | KAFKA_REST_HOST_NAME: rest-proxy 267 | KAFKA_REST_BOOTSTRAP_SERVERS: 'broker:29092' 268 | KAFKA_REST_LISTENERS: "http://0.0.0.0:8082" 269 | KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081' 270 | networks: 271 | kafka: 272 | ipv4_address: 172.20.0.10 273 | 274 | networks: 275 | kafka: 276 | driver: bridge 277 | ipam: 278 | config: 279 | - subnet: 172.20.0.0/16 280 | 281 | 282 | 283 | -------------------------------------------------------------------------------- /docker/install_connectors.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Define the name of the Kafka Connect container 4 | CONNECT_CONTAINER_NAME="connect" 5 | 6 | # Function to install Kafka Connect connectors 7 | install_connector() { 8 | local connector=$1 9 | local version=$2 10 | 11 | echo "Installing $connector connector..." 12 | docker exec $CONNECT_CONTAINER_NAME confluent-hub install --no-prompt $connector:$version 13 | if [ $? -ne 0 ]; then 14 | echo "Error installing $connector connector. Exiting." 15 | exit 1 16 | fi 17 | } 18 | 19 | # Install the connectors 20 | echo "Installing Kafka Connect connectors..." 21 | 22 | install_connector "confluentinc/kafka-connect-s3" "10.5.0" 23 | install_connector "confluentinc/kafka-connect-jdbc" "10.7.4" 24 | install_connector "debezium/debezium-connector-mongodb:2.0.1" 25 | install_connector "mongodb/kafka-connect-mongodb:1.9.1" 26 | 27 | echo "Connectors have been successfully installed." 28 | 29 | # Restart the Kafka Connect container 30 | echo "Restarting Kafka Connect container..." 31 | docker restart $CONNECT_CONTAINER_NAME 32 | if [ $? -ne 0 ]; then 33 | echo "Error restarting Kafka Connect container. Exiting." 34 | exit 1 35 | fi 36 | 37 | echo "Kafka Connect container restarted successfully." 38 | -------------------------------------------------------------------------------- /docker/listen_eventsSink.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "mongodb-sink-connector-listen_events", 3 | "config": { 4 | "connector.class": "com.mongodb.kafka.connect.MongoSinkConnector", 5 | "tasks.max": "1", 6 | "topics": "listen_events", 7 | "connection.uri": "mongodb://debezium:dbz@mongo:27017/demo?authSource=admin", 8 | "database": "demo", 9 | "collection": "listen_events", 10 | "publish.full.document.only": "true", 11 | "output.format.value" : "schema", 12 | "value.converter.schemas.enable": "true", 13 | "value.converter " : "org.apache.kafka.connect.json.JsonConverter ", 14 | "key.converter ": "org.apache.kafka.connect.storage.StringConverter " 15 | } 16 | } -------------------------------------------------------------------------------- /docker/mongo/init.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh -x 2 | 3 | mongo <<-EOF 4 | rs.initiate({ 5 | }); 6 | rs.status() 7 | EOF 8 | echo "Initiated replica set" 9 | 10 | sleep 5 11 | 12 | 13 | mongosh localhost:27017/admin <<-EOF 14 | db.createUser({ user: 'admin', pwd: 'admin', roles: [ { role: "userAdminAnyDatabase", db: "admin" } ] }); 15 | db.grantRolesToUser("admin", ["clusterManager"]); 16 | EOF 17 | 18 | mongosh -u admin -p admin localhost:27017/admin <<-EOF 19 | db.runCommand({ 20 | createRole: "listDatabases", 21 | privileges: [ 22 | { resource: { cluster : true }, actions: ["listDatabases"]} 23 | ], 24 | roles: [] 25 | }); 26 | db.createUser({ 27 | user: 'debezium', 28 | pwd: 'dbz', 29 | roles: [ 30 | { role: "readWrite", db: "demo" }, 31 | { role: "read", db: "local" }, 32 | { role: "listDatabases", db: "admin" }, 33 | { role: "read", db: "config" }, 34 | { role: "read", db: "admin" } 35 | ] 36 | }); 37 | EOF -------------------------------------------------------------------------------- /docker/mongo/keyfile.pem: -------------------------------------------------------------------------------- 1 | DjULHRL4rNWCVSSzRAaWl3ca4y4I0HNobmscElxdNVT/gfDJo/HpWDZsF6ViEXX0 2 | czxAOM94O+XIxXSuoCoq3J/wJJFACUAtA/5PkXUywdQeE/sh3TerBuihVqZWFSNN 3 | ibR8N7iJZs6TMAE9H7Yhn09pVpbppECt3h5Ucu7CPcyESdH7ECjZfY267DaQG3/N 4 | 1o1/ac+10jzqYp1GwUeDF3TlCjap0H7odGIopKvMt1XgzST2XcM9dVJtJz7f1AsN 5 | UDsvKUVzsPgKGU1Kj8OZMNXlnt8TOFxW7zUJ/XlVImYEzJjPFsR6Io5SpcDu33+c 6 | jfRncFMzXU+Jf5UFX550FKCMSYIeVsdWdGHXSxtUicQZMBmcMo5guX/VuAH1HIzJ 7 | TS4BTCURCcXv5NJLMvVcF0NssXNx0ERNBa3mSHgzz7bzbrfjkSW0lr9DybXB57p/ 8 | 0IJ2lha5YnkVLAkJ/xZdUt0Sh0vmJqfVjBDqRcEaAIGgsWvxWSOqpAhEveYIgXH/ 9 | WBPoP7PSOPQSHmdo7Xuzxfuapzv4YtBjG8GAUpoE3rELrFAVNLfmMITvnHQZcvoG 10 | FgbREuBGx0/UhVnEWcMO+nOpa2TTS9tEwXnt6R6ylIoRqevchVPTkgBzC1t676fz 11 | nTCnfNjtLzfd8dp0mqnU6fz6wscq0HTkUvQ9XvzpuxNW6sr2g3tAwnNX+A4I23VX 12 | MoGC4XD9I5NtkbAkdJpIfK5JUCSAQa0YjTf92+anKr0Rk8/3ssvBwXgRbqAf5OL+ 13 | xAixldtAjpE2rfHKBak325g3ZuP1W/+VvSR+zMKeRyO6+8CZRHC/ArxIniHFF4yr 14 | vOKKUMFlIgL8D2igzhVX4Xj3Euim4TMy/BUm1TKeafFTv1oc3ODFac5AwKRELCrh 15 | mwq3fwECXNdwNS8U0PfmWv047draE98ZikVoKU9n1jj6imZ44fwmdmLvmqY7KOOc 16 | NMdzrcVwGb7d3l1gbUvBrYn430xAQizueRnKnntwiEcE6bbx -------------------------------------------------------------------------------- /docker/page_view_eventsSink.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "mongodb-sink-connector-page_view_events", 3 | "config": { 4 | "connector.class": "com.mongodb.kafka.connect.MongoSinkConnector", 5 | "tasks.max": "1", 6 | "topics": "page_view_events", 7 | "connection.uri": "mongodb://debezium:dbz@mongo:27017/demo?authSource=admin", 8 | "database": "demo", 9 | "collection": "page_view_events", 10 | "publish.full.document.only": "true", 11 | "output.format.value" : "schema", 12 | "value.converter.schemas.enable": "true", 13 | "value.converter " : "org.apache.kafka.connect.json.JsonConverter ", 14 | "key.converter ": "org.apache.kafka.connect.storage.StringConverter " 15 | } 16 | } -------------------------------------------------------------------------------- /docker/postConnect.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | KAFKA_CONNECT_URL="http://localhost:8083/connectors" 4 | 5 | curl -X POST -H "Content-Type: application/json" --data @listen_eventsSink.json $KAFKA_CONNECT_URL 6 | 7 | sleep 2 8 | 9 | curl -X POST -H "Content-Type: application/json" --data @page_view_eventsSink.json $KAFKA_CONNECT_URL 10 | 11 | sleep 2 12 | 13 | curl -X POST -H "Content-Type: application/json" --data @PostgresSink.json $KAFKA_CONNECT_URL 14 | 15 | echo "Configuration des connecteurs envoyée à Kafka Connect." 16 | -------------------------------------------------------------------------------- /models/docs.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Stefen-Taime/Iceberg-Dbt-Trino-Hive-modern-open-source-data-stack/1f563471176cbcd5a9185c8447c3141a7b5672e1/models/docs.md -------------------------------------------------------------------------------- /models/mart_hourly_stream_evolution.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='table') }} 3 | 4 | SELECT 5 | stream_date, 6 | stream_hour, 7 | total_streams, 8 | SUM(total_streams) OVER (PARTITION BY stream_date ORDER BY stream_hour) AS cumulative_streams 9 | FROM {{ ref('stg_streams_hourly') }} 10 | ORDER BY 11 | stream_date, 12 | stream_hour 13 | -------------------------------------------------------------------------------- /models/mart_top_songs_artists.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='table', unique_key='song') }} 3 | 4 | WITH top_songs AS ( 5 | SELECT 6 | artist, 7 | song, 8 | play_count 9 | FROM {{ ref('stg_top_songs_artists') }} 10 | ) 11 | 12 | SELECT 13 | artist, 14 | song, 15 | play_count 16 | FROM top_songs 17 | -------------------------------------------------------------------------------- /models/mart_total_unique_users.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='table') }} 3 | 4 | SELECT 5 | total_unique_users 6 | FROM {{ ref('stg_unique_users') }} 7 | -------------------------------------------------------------------------------- /models/mart_user_activity_per_country.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='table') }} 3 | 4 | SELECT 5 | country, 6 | SUM(activity_count) AS total_activity 7 | FROM {{ ref('stg_user_activity_per_country') }} 8 | GROUP BY country 9 | ORDER BY SUM(activity_count) DESC 10 | -------------------------------------------------------------------------------- /models/mart_user_activity_per_state.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='table') }} 3 | 4 | SELECT 5 | state, 6 | city, 7 | SUM(activity_count) AS total_activity 8 | FROM {{ ref('stg_user_activity_per_location') }} 9 | GROUP BY state, city 10 | ORDER BY state, SUM(activity_count) DESC 11 | -------------------------------------------------------------------------------- /models/mart_user_level_by_gender.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='table') }} 3 | 4 | SELECT 5 | gender, 6 | level, 7 | SUM(user_count) AS total_users, 8 | ROUND(SUM(user_count) * 100.0 / SUM(SUM(user_count)) OVER (), 2) AS percentage_of_total_users 9 | FROM {{ ref('stg_user_level_by_gender') }} 10 | GROUP BY gender, level 11 | -------------------------------------------------------------------------------- /models/overview.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Stefen-Taime/Iceberg-Dbt-Trino-Hive-modern-open-source-data-stack/1f563471176cbcd5a9185c8447c3141a7b5672e1/models/overview.md -------------------------------------------------------------------------------- /models/schema.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | models: 4 | - name: auth_events 5 | description: "This table records authentication events, including both successful and unsuccessful login attempts." 6 | 7 | columns: 8 | - name: ts 9 | description: "Timestamp of the event." 10 | 11 | - name: sessionId 12 | description: "Unique identifier for the user session." 13 | tests: 14 | - unique 15 | 16 | - name: level 17 | description: "Subscription level of the user at the time of the event." 18 | tests: 19 | - accepted_values: 20 | values: ['free', 'paid'] 21 | 22 | - name: itemInSession 23 | description: "The sequential number of the event in the current session." 24 | tests: 25 | - not_null 26 | 27 | - name: city 28 | description: "City from which the event was generated." 29 | 30 | 31 | - name: zip 32 | description: "Zip/postal code from which the event was generated." 33 | 34 | 35 | - name: state 36 | description: "State or region from which the event was generated." 37 | 38 | 39 | - name: userAgent 40 | description: "User agent of the browser used for the event." 41 | 42 | 43 | - name: lon 44 | description: "Longitude from which the event was generated." 45 | 46 | 47 | - name: lat 48 | description: "Latitude from which the event was generated." 49 | 50 | 51 | - name: userId 52 | description: "Unique identifier for the user." 53 | 54 | 55 | - name: lastName 56 | description: "Last name of the user. PII." 57 | 58 | 59 | - name: firstName 60 | description: "First name of the user. PII." 61 | 62 | 63 | - name: gender 64 | description: "Gender of the user." 65 | tests: 66 | - accepted_values: 67 | values: ['M', 'F'] 68 | 69 | - name: registration 70 | description: "Timestamp of the user's registration." 71 | 72 | 73 | - name: success 74 | description: "Indicates whether the authentication was successful." 75 | tests: 76 | 77 | - accepted_values: 78 | values: ['True', 'False'] 79 | -------------------------------------------------------------------------------- /models/source.yml: -------------------------------------------------------------------------------- 1 | # source.yml 2 | version: 2 3 | 4 | sources: 5 | - name: oltp 6 | database: oltp 7 | schema: public 8 | tables: 9 | - name: auth_events 10 | 11 | - name: website 12 | database: website 13 | schema: demo 14 | tables: 15 | - name: listen_events 16 | - name: page_view_events -------------------------------------------------------------------------------- /models/source/schema.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | models: 4 | - name: src_auth_events 5 | description: "This model transforms auth_events data from the OLTP source for analytical purposes." 6 | columns: 7 | - name: ts 8 | description: "Timestamp of the event." 9 | tests: 10 | - not_null 11 | - name: sessionId 12 | tests: 13 | - unique 14 | - not_null 15 | - name: level 16 | tests: 17 | - accepted_values: 18 | values: ['free', 'paid'] 19 | - name: itemInSession 20 | - name: city 21 | 22 | - name: zip 23 | - name: state 24 | - name: userAgent 25 | - name: lon 26 | - name: lat 27 | - name: userId 28 | tests: 29 | - unique 30 | - name: lastName 31 | - name: firstName 32 | - name: gender 33 | tests: 34 | - accepted_values: 35 | values: ['F', 'M'] 36 | - name: registration 37 | - name: success 38 | tests: 39 | - accepted_values: 40 | values: ['True', 'False'] 41 | -------------------------------------------------------------------------------- /models/source/src_auth_events.sql: -------------------------------------------------------------------------------- 1 | {{ config(materialized='incremental', unique_key='id') }} 2 | 3 | with source as ( 4 | select 5 | CAST(ts AS VARCHAR) as id, 6 | ts, 7 | CAST(sessionId AS VARCHAR) as sessionId, 8 | level, 9 | itemInSession, 10 | city, 11 | zip, 12 | state, 13 | userAgent, 14 | CAST(NULLIF(lon, '') AS double) as lon, 15 | CAST(NULLIF(lat, '') AS double) as lat, 16 | CAST(FLOOR(CAST(NULLIF(userId, '') AS DOUBLE)) AS INT) as userId, 17 | lastName, 18 | firstName, 19 | gender, 20 | CAST(FLOOR(CAST(NULLIF(registration, '') AS DOUBLE)) AS BIGINT) as registration, 21 | success 22 | from {{ source('oltp', 'auth_events') }} 23 | ) 24 | 25 | select 26 | id, 27 | ts, 28 | sessionId, 29 | level, 30 | itemInSession, 31 | city, 32 | zip, 33 | state, 34 | userAgent, 35 | lon, 36 | lat, 37 | userId, 38 | lastName, 39 | firstName, 40 | gender, 41 | registration, 42 | CASE 43 | WHEN CAST(success AS VARCHAR) = 't' THEN true 44 | WHEN CAST(success AS VARCHAR) = 'f' THEN false 45 | ELSE NULL 46 | END AS success_boolean 47 | from source 48 | 49 | {% if is_incremental() %} 50 | where ts > (select max(ts) from {{ this }}) 51 | {% endif %} 52 | -------------------------------------------------------------------------------- /models/source/src_listen_events.sql: -------------------------------------------------------------------------------- 1 | {{ config(materialized='incremental', unique_key='_id') }} 2 | 3 | with source as ( 4 | 5 | select 6 | CAST(_id AS VARCHAR) as id, 7 | artist, 8 | song, 9 | CAST(NULLIF(duration, '') AS DOUBLE) as duration, 10 | ts, 11 | sessionid, 12 | auth, 13 | level, 14 | CAST(NULLIF(itemInSession, '') AS INTEGER) as itemInSession, 15 | city, 16 | zip, 17 | state, 18 | country, 19 | userAgent, 20 | CAST(NULLIF(lon, '') AS DOUBLE) as lon, 21 | CAST(NULLIF(lat, '') AS DOUBLE) as lat, 22 | CAST(CAST(NULLIF(userId, '') AS DOUBLE) AS INTEGER) as userId, 23 | -- Assurez-vous que userId peut aussi être converti de cette manière. 24 | lastName, 25 | firstName, 26 | gender, 27 | CAST(CAST(NULLIF(registration, '') AS DOUBLE) AS BIGINT) as registration 28 | from {{ source('website', 'listen_events') }} 29 | 30 | ) 31 | 32 | select * from source 33 | 34 | {% if is_incremental() %} 35 | 36 | -- this filter will only be applied on an incremental run 37 | where ts > (select max(ts) from {{ this }}) 38 | 39 | {% endif %} 40 | -------------------------------------------------------------------------------- /models/source/src_page_view_events.sql: -------------------------------------------------------------------------------- 1 | {{ config(materialized='incremental', unique_key='_id') }} 2 | 3 | with source as ( 4 | select 5 | CAST(_id AS VARCHAR) as id, 6 | ts, 7 | CAST(sessionId AS VARCHAR) as sessionId, 8 | page, 9 | auth, 10 | level, 11 | CAST(CAST(NULLIF(itemInSession, '') AS DOUBLE) AS INTEGER) as itemInSession, 12 | city, 13 | zip, 14 | state, 15 | userAgent, 16 | CAST(NULLIF(lon, '') AS double) as lon, 17 | CAST(NULLIF(lat, '') AS double) as lat, 18 | CAST(CAST(NULLIF(userId, '') AS DOUBLE) AS INTEGER) as userId, 19 | lastName, 20 | firstName, 21 | gender, 22 | CAST(CAST(NULLIF(registration, '') AS DOUBLE) AS BIGINT) as registration, 23 | artist, 24 | song, 25 | CAST(NULLIF(duration, '') AS double) as duration 26 | from {{ source('website', 'page_view_events') }} 27 | ) 28 | 29 | select * from source 30 | 31 | {% if is_incremental() %} 32 | where ts > (select max(ts) from {{ this }}) 33 | {% endif %} 34 | -------------------------------------------------------------------------------- /models/stage/stg_streams_hourly.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='view') }} 3 | 4 | SELECT 5 | DATE_FORMAT(FROM_UNIXTIME(ts), '%Y-%m-%d') AS stream_date, 6 | DATE_FORMAT(FROM_UNIXTIME(ts), '%H') AS stream_hour, 7 | COUNT(*) AS total_streams 8 | FROM {{ ref('src_listen_events') }} 9 | GROUP BY 10 | DATE_FORMAT(FROM_UNIXTIME(ts), '%Y-%m-%d'), 11 | DATE_FORMAT(FROM_UNIXTIME(ts), '%H') 12 | -------------------------------------------------------------------------------- /models/stage/stg_top_songs_artists.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='view') }} 3 | 4 | WITH song_plays AS ( 5 | SELECT 6 | artist, 7 | song, 8 | COUNT(*) AS play_count 9 | FROM {{ ref('src_listen_events') }} 10 | GROUP BY artist, song 11 | ) 12 | 13 | SELECT 14 | artist, 15 | song, 16 | play_count 17 | FROM song_plays 18 | ORDER BY play_count DESC 19 | -------------------------------------------------------------------------------- /models/stage/stg_unique_users.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='view') }} 3 | 4 | SELECT 5 | COUNT(DISTINCT userid) AS total_unique_users 6 | FROM {{ ref('src_listen_events') }} 7 | WHERE userid IS NOT NULL 8 | -------------------------------------------------------------------------------- /models/stage/stg_user_activity_per_country.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='view') }} 3 | 4 | WITH combined_activities AS ( 5 | SELECT 6 | country, 7 | 'listen_event' AS event_type, 8 | ts, 9 | userid 10 | FROM {{ ref('src_listen_events') }} 11 | WHERE country IS NOT NULL 12 | UNION ALL 13 | SELECT 14 | 'Unknown' AS country, 15 | ts, 16 | userid 17 | FROM {{ ref('src_auth_events') }} 18 | UNION ALL 19 | SELECT 20 | 'Unknown' AS country, 21 | 'page_view_event' AS event_type, 22 | ts, 23 | userid 24 | FROM {{ ref('src_page_view_events') }} 25 | ) 26 | 27 | SELECT 28 | country, 29 | COUNT(*) AS activity_count 30 | FROM combined_activities 31 | GROUP BY country 32 | -------------------------------------------------------------------------------- /models/stage/stg_user_activity_per_location.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='view') }} 3 | 4 | WITH combined_activities AS ( 5 | SELECT 6 | city, 7 | state, 8 | 'auth_event' AS event_type, 9 | ts, 10 | userid 11 | FROM {{ ref('src_auth_events') }} 12 | UNION ALL 13 | SELECT 14 | city, 15 | state, 16 | 'listen_event' AS event_type, 17 | ts, 18 | userid 19 | FROM {{ ref('src_listen_events') }} 20 | UNION ALL 21 | SELECT 22 | city, 23 | state, 24 | 'page_view_event' AS event_type, 25 | ts, 26 | userid 27 | FROM {{ ref('src_page_view_events') }} 28 | ) 29 | 30 | SELECT 31 | state, 32 | city, 33 | COUNT(*) AS activity_count 34 | FROM combined_activities 35 | GROUP BY state, city 36 | -------------------------------------------------------------------------------- /models/stage/stg_user_level_by_gender.sql: -------------------------------------------------------------------------------- 1 | 2 | {{ config(materialized='view') }} 3 | 4 | WITH user_subscriptions AS ( 5 | SELECT 6 | gender, 7 | level, 8 | COUNT(DISTINCT userId) AS user_count 9 | FROM {{ ref('src_auth_events') }} 10 | WHERE gender IS NOT NULL AND level IS NOT NULL 11 | GROUP BY gender, level 12 | ) 13 | 14 | SELECT 15 | gender, 16 | level, 17 | user_count 18 | FROM user_subscriptions 19 | -------------------------------------------------------------------------------- /packages.yml: -------------------------------------------------------------------------------- 1 | packages: 2 | - package: dbt-labs/dbt_utils 3 | version: 1.1.1 4 | - package: starburstdata/trino_utils 5 | version: 0.6.0 6 | 7 | -------------------------------------------------------------------------------- /profiles.yml: -------------------------------------------------------------------------------- 1 | my_dbt_trino_project: 2 | target: dev 3 | outputs: 4 | dev: 5 | type: trino 6 | method: none 7 | user: admin 8 | database: datalake 9 | host: localhost 10 | port: 8080 11 | schema: analytics 12 | threads: 1 --------------------------------------------------------------------------------