├── .archive ├── .touch ├── bkp.setup_datagen.sh └── setup_data_origination_apps.sh ├── README.md ├── datagen ├── comsume_topic_dgCustomer.py ├── consume_panda_2_iceberg_customer.py ├── consume_stream_customer_2_console.py ├── consume_stream_txn_2_console.py ├── datagenerator.py ├── parameter_get_schema.py ├── pg_upsert_dg.py ├── redpanda_dg.py ├── spark_from_dbz_customer_2_iceberg.py └── test_pg.py ├── db_ddl ├── create_ddl_icecatalog.sql ├── create_user_datagen.sql ├── customer_ddl.sql ├── customer_function_ddl.sql ├── grants4dbz.sql └── hive_metastore_ddl.sql ├── dbz_server ├── .touch └── application.properties ├── downloads └── .touch ├── explore_postgresql.md ├── get_files.sh ├── hive_metastore └── hive-site.xml ├── images ├── .placeholder ├── Iceberg.gif ├── access_keys_view.png ├── adminer_login_screen.png ├── adminer_login_screen_icecatalog.png ├── bucket_first_table_metadata_view.png ├── connect_ouput_detail_msg.png ├── connect_output_summary_msg.png ├── console_view_run_connect.png ├── detail_view_of_cust_msg.png ├── drunk-cheers.gif ├── first_login.png ├── initial_bucket_view.png ├── minio_login_screen.png ├── panda_topic_view_connect_topic.png ├── panda_view__dg_load_topics.png ├── panda_view_topics.png ├── spark_master_view.png └── topic_customer_view.png ├── kafka_connect ├── connect.properties └── pg-source-connector.properties ├── prework.md ├── redpanda └── redpanda.yaml ├── sample_output ├── .touch └── connect.output ├── sample_spark_jobs.md ├── setup_datagen.sh ├── spark_items ├── all_workshop1_items.sql ├── conf.properties ├── ice_spark-sql_i-cli.sh ├── iceberg_workshop_sql_items.sh ├── iceberg_workshop_tbl_ddl.sql ├── load_ice_customer_batch.sql ├── load_ice_transactions_pyspark.py ├── merge_ice_customer_batch.sql ├── stream_customer_ddl.sql ├── stream_customer_ddl_script.sh ├── stream_customer_event_history_ddl.sql └── stream_customer_event_history_ddl_script.sh ├── stop_start_services.sh ├── tick2705-1.png ├── tick2705-2.png ├── utils.sh └── workshop1_revisit.md /.archive/.touch: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /.archive/bkp.setup_datagen.sh: -------------------------------------------------------------------------------- 1 | 2 | #!/bin/bash 3 | 4 | 5 | ########################################################################################## 6 | # install some OS utilities 7 | ######################################################################################### 8 | sudo apt-get install wget curl apt-transport-https unzip chrony -y 9 | sudo apt-get install -y figlet cowsay 10 | sudo apt-get update 11 | 12 | ########################################################################################## 13 | # download and install community edition of redpanda 14 | ########################################################################################## 15 | 16 | echo 17 | echo "---------------------------------------------------------------------" 18 | echo "starting redpanda install ..." 19 | echo "---------------------------------------------------------------------" 20 | echo 21 | 22 | ## Run the setup script to download and install the repo 23 | curl -1sLf 'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' | sudo -E bash 24 | 25 | sudo apt-get update 26 | sudo apt install redpanda -y 27 | 28 | ########################################################################################## 29 | # download and install 'rpk' - cli tools for working with red panda 30 | ######################################################################################### 31 | curl -LO https://github.com/redpanda-data/redpanda/releases/latest/download/rpk-linux-amd64.zip 32 | 33 | ########################################################################################## 34 | # create a few directories 35 | ########################################################################################## 36 | mkdir -p ~/.local/bin 37 | mkdir -p ~/datagen 38 | 39 | ######################################################################################### 40 | # add items to path for future use 41 | ######################################################################################### 42 | export PATH="~/.local/bin:$PATH" 43 | export REDPANDA_HOME=~/.local/bin 44 | 45 | ########################################################################################## 46 | # add to path perm https://help.ubuntu.com/community/EnvironmentVariables 47 | ########################################################################################## 48 | echo "" >> ~/.profile 49 | echo "# set path variables here:" >> ~/.profile 50 | echo "export REDPANDA_HOME=~/.local/bin" >> ~/.profile 51 | echo "PATH=$PATH:$REDPANDA_HOME" >> ~/.profile 52 | 53 | ########################################################################################## 54 | # unzip rpk to --> ~/.local/bin 55 | ########################################################################################## 56 | unzip rpk-linux-amd64.zip -d ~/.local/bin/ 57 | 58 | ########################################################################################## 59 | # Install the red panda console package 60 | ########################################################################################## 61 | curl -1sLf \ 62 | 'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' \ 63 | | sudo -E bash 64 | 65 | sudo apt-get install redpanda-console -y 66 | 67 | ########################################################################################## 68 | # install pip for python3 69 | ########################################################################################## 70 | sudo apt install python3-pip -y 71 | 72 | ########################################################################################## 73 | # install jq 74 | ########################################################################################## 75 | sudo apt install -y jq 76 | 77 | ########################################################################################## 78 | # create the redpanda conig.yaml # needed to change default console port to 8888 to avoid conflict with debezium server 79 | ########################################################################################## 80 | cat < ~/redpanda-console-config.yaml 81 | kafka: 82 | brokers: ":9092" 83 | schemaRegistry: 84 | enabled: true 85 | urls: ["http://:8081"] 86 | connect: 87 | enabled: true 88 | clusters: 89 | - name: postgres-dbz-connector 90 | url: http://:8083 91 | server: 92 | listenPort: 8888 93 | EOF 94 | 95 | sudo cp ~/data_origination_workshop/redpanda/redpanda.yaml /etc/redpanda/ 96 | 97 | 98 | ########################################################################################## 99 | # Need to update the value of '' in a bunch of files 100 | ########################################################################################## 101 | PRIVATE_IP=`ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p'` 102 | sudo sed -e "s,,$PRIVATE_IP,g" -i ~/redpanda-console-config.yaml 103 | sudo sed -e "s,,$PRIVATE_IP,g" -i /etc/redpanda/redpanda.yaml 104 | 105 | sed -e "s,,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/comsume_topic_dgCustomer.py 106 | sed -e "s,,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/consume_panda_2_iceberg_customer.py 107 | sed -e "s,,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/consume_stream_customer_2_console.py 108 | sed -e "s,,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/consume_stream_txn_2_console.py 109 | sed -e "s,,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/pg_upsert_dg.py 110 | sed -e "s,,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/redpanda_dg.py 111 | sed -e "s,,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/spark_from_dbz_customer_2_iceberg.py 112 | 113 | ########################################################################################## 114 | # move this file to proper directory 115 | ########################################################################################## 116 | sudo mv ~/redpanda-console-config.yaml /etc/redpanda/redpanda-console-config.yaml 117 | sudo chown redpanda:redpanda -R /etc/redpanda 118 | 119 | ########################################################################################## 120 | # start redpanda & the console: 121 | ########################################################################################## 122 | sudo systemctl start redpanda 123 | sudo systemctl start redpanda-console 124 | 125 | echo 126 | echo "---------------------------------------------------------------------" 127 | echo "redpanda setup completed..." 128 | echo "---------------------------------------------------------------------" 129 | echo 130 | ########################################################################################## 131 | # install a specific version of postgresql (version 14) 132 | ########################################################################################## 133 | echo 134 | echo "---------------------------------------------------------------------" 135 | echo "installing postgresql..." 136 | echo "---------------------------------------------------------------------" 137 | echo 138 | 139 | apt policy postgresql 140 | 141 | ########################################################################################## 142 | # install the pgp key for this version of postgresql: 143 | ########################################################################################## 144 | curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc|sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg 145 | 146 | sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' 147 | 148 | sudo apt update 149 | 150 | sudo apt install postgresql-14 -y 151 | 152 | sudo systemctl enable postgresql 153 | 154 | ########################################################################################## 155 | # backup the original postgresql conf file 156 | ########################################################################################## 157 | sudo cp /etc/postgresql/14/main/postgresql.conf /etc/postgresql/14/main/postgresql.conf.orig 158 | 159 | ########################################################################################## 160 | # setup the database to allow listeners from any host 161 | ########################################################################################## 162 | sudo sed -e 's,#listen_addresses = \x27localhost\x27,listen_addresses = \x27*\x27,g' -i /etc/postgresql/14/main/postgresql.conf 163 | 164 | ########################################################################################## 165 | # increase number of connections allowed in the database 166 | ########################################################################################## 167 | sudo sed -e 's,max_connections = 100,max_connections = 300,g' -i /etc/postgresql/14/main/postgresql.conf 168 | 169 | ########################################################################################## 170 | # need to setup postgres WAL to allow debezium to read from the logs 171 | ########################################################################################## 172 | sudo sed -e 's,#listen_addresses = 'localhost',listen_addresses = '*',g' -i /etc/postgresql/14/main/postgresql.conf 173 | sudo sed -e 's,#wal_level = replica,wal_level = logical,g' -i /etc/postgresql/14/main/postgresql.conf 174 | sudo sed -e 's,#max_wal_senders = 10,max_wal_senders = 4,g' -i /etc/postgresql/14/main/postgresql.conf 175 | sudo sed -e 's,#max_replication_slots = 10,max_replication_slots = 4,g' -i /etc/postgresql/14/main/postgresql.conf 176 | 177 | ########################################################################################## 178 | # create a new 'pg_hba.conf' file 179 | ########################################################################################## 180 | # backup the orig 181 | sudo mv /etc/postgresql/14/main/pg_hba.conf /etc/postgresql/14/main/pg_hba.conf.orig 182 | 183 | cat < pg_hba.conf 184 | # TYPE DATABASE USER ADDRESS METHOD 185 | local all all peer 186 | host datagen datagen 0.0.0.0/0 md5 187 | host icecatalog icecatalog 0.0.0.0/0 md5 188 | EOF 189 | 190 | ########################################################################################## 191 | # set owner and permissions of this conf file 192 | ########################################################################################## 193 | sudo mv pg_hba.conf /etc/postgresql/14/main/pg_hba.conf 194 | sudo chown postgres:postgres /etc/postgresql/14/main/pg_hba.conf 195 | sudo chmod 600 /etc/postgresql/14/main/pg_hba.conf 196 | 197 | ########################################################################################## 198 | # restart postgresql 199 | ########################################################################################## 200 | sudo systemctl restart postgresql 201 | 202 | ########################################################################################## 203 | # install Java 11 204 | ########################################################################################## 205 | sudo apt install openjdk-11-jdk -y 206 | 207 | ########################################################################################## 208 | ## Run the sql file to create the schema for all DB’s 209 | ########################################################################################## 210 | sudo -u postgres psql < ~/data_origination_workshop/db_ddl/create_user_datagen.sql 211 | sudo -u datagen psql < ~/data_origination_workshop/db_ddl/customer_ddl.sql 212 | sudo -u datagen psql < ~/data_origination_workshop/db_ddl/customer_function_ddl.sql 213 | sudo -u datagen psql < ~/data_origination_workshop/db_ddl/grants4dbz.sql 214 | sudo -u postgres psql < ~/data_origination_workshop/db_ddl/create_ddl_icecatalog.sql 215 | #sudo -u postgres psql < ~/data_origination_workshop/db_ddl/hive_metastore_ddl.sql 216 | 217 | echo 218 | echo "---------------------------------------------------------------------" 219 | echo "postgresql install completed..." 220 | echo "---------------------------------------------------------------------" 221 | echo 222 | ########################################################################################## 223 | # 224 | ########################################################################################## 225 | 226 | echo 227 | echo "---------------------------------------------------------------------" 228 | echo "setup data generator items..." 229 | echo "---------------------------------------------------------------------" 230 | echo 231 | 232 | 233 | ########################################################################################## 234 | # copy these files to the os user 'datagen' and set owner and permissions 235 | ########################################################################################## 236 | #sudo mv ~/data_origination_workshop/datagen/* /home/datagen/datagen/ 237 | #sudo chown datagen:datagen -R /home/datagen/ 238 | 239 | mv ~/data_origination_workshop/datagen/* /home/datagen/datagen/ 240 | chown datagen:datagen -R /home/datagen/ 241 | 242 | ########################################################################################## 243 | # pip install some items 244 | ########################################################################################## 245 | sudo pip install kafka-python uuid simplejson faker psycopg2-binary 246 | 247 | echo 248 | echo "---------------------------------------------------------------------" 249 | echo "data generator setup completed..." 250 | echo "---------------------------------------------------------------------" 251 | echo 252 | 253 | ########################################################################################## 254 | # kafka connect downloads 255 | ########################################################################################## 256 | echo 257 | echo "---------------------------------------------------------------------" 258 | echo "installing kafka-connect..." 259 | echo "---------------------------------------------------------------------" 260 | echo 261 | # create some directories 262 | mkdir -p ~/kafka_connect/configuration 263 | mkdir -p ~/kafka_connect/plugins 264 | 265 | # get the public key 266 | sudo wget https://dlcdn.apache.org/kafka/KEYS 267 | 268 | # get the file: 269 | wget https://dlcdn.apache.org/kafka/3.3.2/kafka_2.13-3.3.2.tgz -P ~/kafka_connect 270 | 271 | #untar the file: 272 | tar -xzf ~/kafka_connect/kafka_2.13-3.3.2.tgz --directory ~/kafka_connect/ 273 | 274 | # remove the tar file: 275 | rm ~/kafka_connect/kafka_2.13-3.3.2.tgz 276 | 277 | # copy the properties files: 278 | cp ~/data_origination_workshop/kafka_connect/*.properties ~/kafka_connect/configuration/ 279 | 280 | # update the private IP address in this config file: 281 | 282 | #sudo sed -e "s,,$PRIVATE_IP,g" -i ~/kafka_connect/configuration/connect.properties 283 | sed -e "s,,$PRIVATE_IP,g" -i ~/kafka_connect/configuration/connect.properties 284 | 285 | ########################################################################################## 286 | # debezium download 287 | ########################################################################################## 288 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.1.1.Final/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz -P ~/kafka_connect 289 | 290 | # untar this file: 291 | tar -xzf ~/kafka_connect/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz --directory ~/kafka_connect/plugins/ 292 | 293 | # remove tar file 294 | rm ~/kafka_connect/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz 295 | ########################################################################################## 296 | # postgresql jdbc download 297 | ########################################################################################## 298 | wget https://jdbc.postgresql.org/download/postgresql-42.5.1.jar -P ~/kafka_connect/plugins/debezium-connector-postgres/ 299 | 300 | ########################################################################################## 301 | # copy jars to the kafka libs folder 302 | ########################################################################################## 303 | cp ~/kafka_connect/plugins/debezium-connector-postgres/*.jar ~/kafka_connect/kafka_2.13-3.3.2/libs/ 304 | 305 | 306 | 307 | echo 308 | echo "---------------------------------------------------------------------" 309 | echo "kafka-connect setup completed..." 310 | echo "---------------------------------------------------------------------" 311 | echo 312 | ########################################################################################## 313 | # Items below this line are from the iceberg workshop & tweaked to run here: 314 | ########################################################################################## 315 | echo 316 | echo "---------------------------------------------------------------------" 317 | echo "install apache iceberg & spark stand alone..." 318 | echo "---------------------------------------------------------------------" 319 | echo 320 | ########################################################################################## 321 | # Install maven 322 | ########################################################################################## 323 | sudo apt install maven -y 324 | 325 | ########################################################################################## 326 | # create a directory for spark events, logs and some json files to be used 327 | ########################################################################################## 328 | mkdir -p /opt/spark/logs 329 | mkdir -p /opt/spark/spark-events 330 | mkdir -p /opt/spark/input 331 | mkdir -p /opt/spark/checkpoint 332 | 333 | ########################################################################################## 334 | # download apache spark standalone 335 | ########################################################################################## 336 | wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz 337 | 338 | tar -xzvf spark-3.3.1-bin-hadoop3.tgz 339 | 340 | sudo mv spark-3.3.1-bin-hadoop3/ /opt/spark 341 | #mv spark-3.3.1-bin-hadoop3/ /opt/spark 342 | 343 | ########################################################################################## 344 | # install aws cli 345 | ########################################################################################## 346 | sudo apt install awscli -y 347 | 348 | ########################################################################################## 349 | # install mlocate 350 | ########################################################################################## 351 | sudo apt install -y mlocate 352 | 353 | ########################################################################################## 354 | # download the jdbc jar file for postgres: 355 | ########################################################################################## 356 | wget https://jdbc.postgresql.org/download/postgresql-42.5.1.jar 357 | 358 | #sudo mv postgresql-42.5.1.jar /opt/spark/jars/ 359 | mv postgresql-42.5.1.jar /opt/spark/jars/ 360 | 361 | 362 | ########################################################################################## 363 | # download some aws jars: 364 | ########################################################################################## 365 | wget https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.19.19/bundle-2.19.19.jar 366 | 367 | #sudo mv bundle-2.19.19.jar /opt/spark/jars/ 368 | mv bundle-2.19.19.jar /opt/spark/jars/ 369 | 370 | 371 | wget https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.19.19/url-connection-client-2.19.19.jar 372 | mv url-connection-client-2.19.19.jar /opt/spark/jars/ 373 | 374 | ########################################################################################## 375 | # download iceberg spark runtime 376 | ########################################################################################## 377 | wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.1.0/iceberg-spark-runtime-3.3_2.12-1.1.0.jar 378 | wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.3.1/spark-sql-kafka-0-10_2.12-3.3.1.jar 379 | wget https://repo.mavenlibs.com/maven/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.3.1/spark-token-provider-kafka-0-10_2.12-3.3.1.jar 380 | wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.3.1/kafka-clients-3.3.1.jar 381 | wget https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar 382 | 383 | mv ~/iceberg-spark-runtime-3.3_2.12-1.1.0.jar /opt/spark/jars/ 384 | mv ~/spark-sql-kafka-0-10_2.12-3.3.1.jar /opt/spark/jars/ 385 | mv ~/spark-token-provider-kafka-0-10_2.12-3.3.1.jar /opt/spark/jars/ 386 | mv ~/kafka-clients-3.3.1.jar /opt/spark/jars/ 387 | mv ~/commons-pool2-2.11.1.jar /opt/spark/jars/ 388 | 389 | echo 390 | echo "---------------------------------------------------------------------" 391 | echo "iceberg & spark items completed..." 392 | echo "---------------------------------------------------------------------" 393 | echo 394 | ########################################################################################## 395 | # download minio debian package 396 | ########################################################################################## 397 | 398 | echo 399 | echo "---------------------------------------------------------------------" 400 | echo "install minio..." 401 | echo "---------------------------------------------------------------------" 402 | echo 403 | wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230112020616.0.0_amd64.deb -O minio.deb 404 | 405 | ########################################################################################## 406 | # install minio 407 | ########################################################################################## 408 | sudo dpkg -i minio.deb 409 | 410 | ########################################################################################## 411 | # create directory for minio data to be stored 412 | ########################################################################################## 413 | sudo mkdir -p /opt/app/minio/data 414 | 415 | sudo groupadd -r minio-user 416 | sudo useradd -M -r -g minio-user minio-user 417 | 418 | ########################################################################################## 419 | # grant permission to this directory to minio-user 420 | ########################################################################################## 421 | 422 | sudo chown -R minio-user:minio-user /opt/app/minio/ 423 | 424 | ########################################################################################## 425 | # create an enviroment variable file for minio 426 | ########################################################################################## 427 | 428 | cat < ~/minio.properties 429 | # MINIO_ROOT_USER and MINIO_ROOT_PASSWORD sets the root account for the MinIO server. 430 | # This user has unrestricted permissions to perform S3 and administrative API operations on any resource in the deployment. 431 | # Omit to use the default values 'minioadmin:minioadmin'. 432 | # MinIO recommends setting non-default values as a best practice, regardless of environment 433 | #MINIO_ROOT_USER=myminioadmin 434 | #MINIO_ROOT_PASSWORD=minio-secret-key-change-me 435 | MINIO_ROOT_USER=minioroot 436 | MINIO_ROOT_PASSWORD=supersecret1 437 | # MINIO_VOLUMES sets the storage volume or path to use for the MinIO server. 438 | #MINIO_VOLUMES="/mnt/data" 439 | MINIO_VOLUMES="/opt/app/minio/data" 440 | # MINIO_SERVER_URL sets the hostname of the local machine for use with the MinIO Server 441 | # MinIO assumes your network control plane can correctly resolve this hostname to the local machine 442 | # Uncomment the following line and replace the value with the correct hostname for the local machine. 443 | #MINIO_SERVER_URL="http://minio.example.net" 444 | EOF 445 | 446 | ########################################################################################## 447 | # move this file to proper directory 448 | ########################################################################################## 449 | sudo mv ~/minio.properties /etc/default/minio 450 | 451 | sudo chown root:root /etc/default/minio 452 | 453 | 454 | ########################################################################################## 455 | # start the minio server: 456 | ########################################################################################## 457 | sudo systemctl start minio.service 458 | 459 | ########################################################################################## 460 | # install the 'MinIO Client' on this server 461 | ########################################################################################## 462 | curl https://dl.min.io/client/mc/release/linux-amd64/mc \ 463 | --create-dirs \ 464 | -o $HOME/minio-binaries/mc 465 | 466 | chmod +x $HOME/minio-binaries/mc 467 | export PATH=$PATH:$HOME/minio-binaries/ 468 | 469 | 470 | ########################################################################################## 471 | # create an alias on this host for the minio cli (using the minio root credentials) 472 | ########################################################################################## 473 | mc alias set local http://127.0.0.1:9000 minioroot supersecret1 474 | 475 | ########################################################################################## 476 | # lets create a user for iceberg metadata & tables using the minio cli and the alias we just set 477 | ########################################################################################## 478 | mc admin user add local icebergadmin supersecret1! 479 | 480 | ########################################################################################## 481 | # need to add the 'readwrite' minio policy to this new user: (these are just like aws policies) 482 | ########################################################################################## 483 | mc admin policy set local readwrite user=icebergadmin 484 | 485 | ########################################################################################## 486 | # create a new alias for this admin user: 487 | ########################################################################################## 488 | mc alias set icebergadmin http://127.0.0.1:9000 icebergadmin supersecret1! 489 | 490 | ########################################################################################## 491 | # create new 'Access Keys' for this user and redirect output to a file for automation later 492 | ########################################################################################## 493 | mc admin user svcacct add icebergadmin icebergadmin >> ~/minio-output.properties 494 | 495 | ########################################################################################## 496 | # create a bucket as user icebergadmin for our iceberg data 497 | ########################################################################################## 498 | mc mb icebergadmin/iceberg-data icebergadmin 499 | 500 | ########################################################################################## 501 | # let's reformat the output of access keys from an earlier step 502 | ########################################################################################## 503 | sed -i "s/Access Key: /access_key=/g" ~/minio-output.properties 504 | sed -i "s/Secret Key: /secret_key=/g" ~/minio-output.properties 505 | 506 | ########################################################################################## 507 | # let's read the update file into memory to use these values to set aws configure 508 | ########################################################################################## 509 | . ~/minio-output.properties 510 | 511 | echo 512 | echo "---------------------------------------------------------------------" 513 | echo "minio install completed..." 514 | echo "---------------------------------------------------------------------" 515 | echo 516 | ########################################################################################## 517 | # let's set up aws configure files from code (this is using the minio credentials) - The default region doesn't get used in minio 518 | ########################################################################################## 519 | aws configure set aws_access_key_id $access_key 520 | aws configure set aws_secret_access_key $secret_key 521 | aws configure set default.region us-east-1 522 | 523 | ########################################################################################## 524 | # let's test that the aws cli can list our buckets in minio: 525 | ########################################################################################## 526 | aws --endpoint-url http://127.0.0.1:9000 s3 ls 527 | 528 | echo 529 | 530 | 531 | ########################################################################################## 532 | # Create a json records file of sample customer data to be used in a lab 533 | ########################################################################################## 534 | 535 | cat < /opt/spark/input/customers.json 536 | {"last_name": "Thompson", "first_name": "Brenda", "street_address": "321 Nicole Ports Suite 204", "city": "South Lisachester", "state": "AS", "zip_code": "89409", "email": "wmoran@example.net", "home_phone": "486.884.6221x4431", "mobile": "(290)274-1564", "ssn": "483-79-5404", "job_title": "Housing manager/officer", "create_date": "2022-12-25 01:10:43", "cust_id": 10} 537 | {"last_name": "Anderson", "first_name": "Jennifer", "street_address": "1392 Cervantes Isle", "city": "Adrianaton", "state": "IN", "zip_code": "15867", "email": "michaeltodd@example.com", "home_phone": "939-630-6773", "mobile": "904.337.2023x17453", "ssn": "583-07-6994", "job_title": "Clinical embryologist", "create_date": "2022-12-03 04:50:07", "cust_id": 11} 538 | {"last_name": "Jefferson", "first_name": "William", "street_address": "543 Matthew Courts", "city": "South Nicholaston", "state": "WA", "zip_code": "17687", "email": "peterhouse@example.net", "home_phone": "+1-599-587-9051x2899", "mobile": "(915)689-1450", "ssn": "792-52-6700", "job_title": "Land", "create_date": "2022-11-28 08:17:10", "cust_id": 12} 539 | {"last_name": "Romero", "first_name": "Jack", "street_address": "5929 Karen Ridges", "city": "Lake Richardburgh", "state": "OR", "zip_code": "78947", "email": "michellemitchell@example.net", "home_phone": "(402)664-1399x71255", "mobile": "450.580.6817x043", "ssn": "216-24-7271", "job_title": "Engineer, building services", "create_date": "2022-12-11 19:09:30", "cust_id": 13} 540 | {"last_name": "Johnson", "first_name": "Robert", "street_address": "4313 Adams Islands", "city": "Tammybury", "state": "UT", "zip_code": "07361", "email": "morrischristopher@example.com", "home_phone": "(477)888-9999", "mobile": "220-403-9274x9709", "ssn": "012-26-8650", "job_title": "Rural practice surveyor", "create_date": "2022-12-08 05:28:56", "cust_id": 14} 541 | EOF 542 | 543 | ########################################################################################## 544 | # Create another json records file to test out a Merge Query in a lab 545 | ########################################################################################## 546 | 547 | cat < /opt/spark/input/update_customers.json 548 | {"last_name": "Rogers", "first_name": "Caitlyn", "street_address": "37761 Robert Center Apt. 743", "city": "Port Matthew", "state": "MS", "zip_code": "70534", "email": "pamelacooper@example.net", "home_phone": "726-856-7295x731", "mobile": "+1-423-331-9415x66671", "ssn": "718-18-3807", "job_title": "Merchandiser, retail", "create_date": "2022-12-16 03:19:35", "cust_id": 10} 549 | {"last_name": "Williams", "first_name": "Brittany", "street_address": "820 Lopez Vista", "city": "Jordanland", "state": "NM", "zip_code": "02887", "email": "stephendawson@example.org", "home_phone": "(149)065-2341x761", "mobile": "(353)203-7938x325", "ssn": "304-90-3213", "job_title": "English as a second language teacher", "create_date": "2022-12-04 23:29:48", "cust_id": 11} 550 | {"last_name": "Gordon", "first_name": "Victor", "street_address": "01584 Hernandez Ramp Suite 822", "city": "Smithmouth", "state": "VI", "zip_code": "88806", "email": "holly51@example.com", "home_phone": "707-269-9666x8446", "mobile": "+1-868-584-1822", "ssn": "009-27-3700", "job_title": "Ergonomist", "create_date": "2022-12-22 18:03:13", "cust_id": 12} 551 | {"last_name": "Martinez", "first_name": "Shelby", "street_address": "715 Benitez Plaza", "city": "Patriciaside", "state": "MT", "zip_code": "70724", "email": "tiffanysmith@example.com", "home_phone": "854.472.8345", "mobile": "+1-187-913-4579x115", "ssn": "306-94-1636", "job_title": "Private music teacher", "create_date": "2022-11-27 16:10:42", "cust_id": 13} 552 | {"last_name": "Bridges", "first_name": "Corey", "street_address": "822 Kaitlyn Haven Apt. 314", "city": "Port Elizabeth", "state": "OH", "zip_code": "58802", "email": "rosewayne@example.org", "home_phone": "001-809-935-9112x17961", "mobile": "+1-732-477-7876x9314", "ssn": "801-31-5673", "job_title": "Scientist, research (maths)", "create_date": "2022-12-11 23:29:52", "cust_id": 14} 553 | {"last_name": "Rocha", "first_name": "Benjamin", "street_address": "294 William Skyway", "city": "Fowlerville", "state": "WA", "zip_code": "75495", "email": "fwhite@example.com", "home_phone": "001-476-468-4403x364", "mobile": "4731036956", "ssn": "571-78-6278", "job_title": "Probation officer", "create_date": "2022-12-10 07:39:35", "cust_id": 15} 554 | {"last_name": "Lawrence", "first_name": "Jonathan", "street_address": "4610 Kelly Road Suite 333", "city": "Michaelfort", "state": "PR", "zip_code": "03033", "email": "raymisty@example.com", "home_phone": "936.011.1602x5883", "mobile": "(577)016-2546x30390", "ssn": "003-05-2317", "job_title": "Dancer", "create_date": "2022-11-27 23:44:14", "cust_id": 16} 555 | {"last_name": "Taylor", "first_name": "Thomas", "street_address": "51884 Kelsey Ridges Apt. 973", "city": "Lake Morgan", "state": "RI", "zip_code": "36056", "email": "vanggary@example.net", "home_phone": "541-784-5497x32009", "mobile": "+1-337-857-9219x83198", "ssn": "133-61-4337", "job_title": "Town planner", "create_date": "2022-12-07 12:33:45", "cust_id": 17} 556 | {"last_name": "Williamson", "first_name": "Jeffrey", "street_address": "6094 Powell Passage", "city": "Stevenland", "state": "VT", "zip_code": "88479", "email": "jwallace@example.com", "home_phone": "4172910794", "mobile": "494.361.3094x223", "ssn": "512-84-0907", "job_title": "Clinical cytogeneticist", "create_date": "2022-12-13 16:58:43", "cust_id": 18} 557 | {"last_name": "Mccullough", "first_name": "Joseph", "street_address": "7329 Santiago Point Apt. 070", "city": "Reedland", "state": "MH", "zip_code": "85316", "email": "michellecain@example.com", "home_phone": "(449)740-1390", "mobile": "(663)381-3306x19170", "ssn": "605-84-9744", "job_title": "Seismic interpreter", "create_date": "2022-12-05 05:33:56", "cust_id": 19} 558 | {"last_name": "Kirby", "first_name": "Evan", "street_address": "95959 Brown Rue Apt. 657", "city": "Lake Vanessa", "state": "MH", "zip_code": "92042", "email": "tayloralexandra@example.org", "home_phone": "342-317-5803", "mobile": "185-084-4719x39341", "ssn": "264-14-4935", "job_title": "Interpreter", "create_date": "2022-12-20 14:23:43", "cust_id": 20} 559 | {"last_name": "Pittman", "first_name": "Teresa", "street_address": "3249 Danielle Parks Apt. 472", "city": "East Ryan", "state": "ME", "zip_code": "33108", "email": "hamiltondanielle@example.org", "home_phone": "+1-814-789-0109x88291", "mobile": "(749)434-0916", "ssn": "302-61-5936", "job_title": "Medical physicist", "create_date": "2022-12-26 05:14:24", "cust_id": 21} 560 | {"last_name": "Byrd", "first_name": "Alicia", "street_address": "1232 Jenkins Pine Apt. 472", "city": "Woodton", "state": "NC", "zip_code": "82330", "email": "shelly47@example.net", "home_phone": "001-930-450-7297x258", "mobile": "+1-968-526-2756x661", "ssn": "656-69-9593", "job_title": "Therapist, art", "create_date": "2022-12-17 18:20:51", "cust_id": 22} 561 | {"last_name": "Ellis", "first_name": "Kathleen", "street_address": "935 Kristina Club", "city": "East Maryton", "state": "AK", "zip_code": "86759", "email": "jacksonkaren@example.com", "home_phone": "001-089-194-5982x828", "mobile": "127.892.8518", "ssn": "426-13-9463", "job_title": "English as a foreign language teacher", "create_date": "2022-12-08 04:01:44", "cust_id": 23} 562 | {"last_name": "Lee", "first_name": "Tony", "street_address": "830 Elizabeth Mill Suite 184", "city": "New Heather", "state": "UT", "zip_code": "59612", "email": "vmayo@example.net", "home_phone": "001-593-666-0198", "mobile": "060.108.7218", "ssn": "048-20-6647", "job_title": "Civil engineer, consulting", "create_date": "2022-12-24 17:10:32", "cust_id": 24} 563 | EOF 564 | 565 | ########################################################################################## 566 | # Let's add some transactions for these customers for a lab 567 | ########################################################################################## 568 | cat < /opt/spark/input/transactions.json 569 | {"transact_id": "e786c399-ee9a-4053-a716-671bd456d06c", "category": "green", "barcode": "9688687184711", "item_desc": "Though evidence push.", "amount": 61.47, "transaction_date": "2022-12-31 03:52:13", "cust_id": 11} 570 | {"transact_id": "58ccab06-38fe-45ab-a105-994f8bc51e1f", "category": "maroon", "barcode": "6270293172737", "item_desc": "Hotel toward radio exactly.", "amount": 18.26, "transaction_date": "2023-01-26 23:42:58", "cust_id": 11} 571 | {"transact_id": "9f5a1c46-ac16-46c9-87fd-ff3ec4f36377", "category": "maroon", "barcode": "0000885336836", "item_desc": "West truth dog staff professor just.", "amount": 9.64, "transaction_date": "2023-01-24 15:51:44", "cust_id": 11} 572 | {"transact_id": "c37e87fd-8833-44e9-85e3-ba2cb5e32c5d", "category": "purple", "barcode": "3898859302683", "item_desc": "Half chance hard.", "amount": 20.5, "transaction_date": "2023-01-13 08:54:35", "cust_id": 11} 573 | {"transact_id": "ae165ddf-e99d-473f-a8ec-75c3235e2ca9", "category": "black", "barcode": "8835416937716", "item_desc": "Song tough born station break long.", "amount": 52.7, "transaction_date": "2023-01-24 00:04:58", "cust_id": 11} 574 | {"transact_id": "3ed16c03-607f-40a1-b446-c1b2c18b8a58", "category": "purple", "barcode": "2387695378019", "item_desc": "Cover likely dog.", "amount": 94.27, "transaction_date": "2022-12-31 11:15:18", "cust_id": 12} 575 | {"transact_id": "830e2d42-594c-4531-9256-6c7e3036f132", "category": "olive", "barcode": "1655418639701", "item_desc": "Difference major fast hear answer character.", "amount": 54.44, "transaction_date": "2023-01-03 22:01:20", "cust_id": 13} 576 | {"transact_id": "4c8db6cf-2a66-4a3a-8474-00d3db8aeb92", "category": "aqua", "barcode": "4088755032541", "item_desc": "On without probably of.", "amount": 94.67, "transaction_date": "2023-01-08 02:11:48", "cust_id": 13} 577 | {"transact_id": "ae54bcf5-250d-4076-854b-40a13cd74b7c", "category": "yellow", "barcode": "3783631322815", "item_desc": "Somebody yourself maintain only together.", "amount": 6.37, "transaction_date": "2023-01-02 09:39:39", "cust_id": 13} 578 | {"transact_id": "c3d3f77a-54ba-4503-bf7c-53db29a775e7", "category": "lime", "barcode": "9466888768004", "item_desc": "By fear hospital certainly.", "amount": 94.8, "transaction_date": "2023-01-05 06:37:27", "cust_id": 13} 579 | {"transact_id": "b5caf452-a44c-442d-a4cf-d88e2c08f7b3", "category": "black", "barcode": "5032052452372", "item_desc": "Imagine occur environment according more.", "amount": 62.94, "transaction_date": "2023-01-27 09:59:41", "cust_id": 14} 580 | {"transact_id": "731fd64e-74af-4364-999e-67e8cccfd6ee", "category": "gray", "barcode": "2687016061218", "item_desc": "Game cover trade discover me read.", "amount": 70.9, "transaction_date": "2022-12-30 02:23:15", "cust_id": 14} 581 | {"transact_id": "40edcc76-0ca0-4b88-990a-7e9abe400cbb", "category": "teal", "barcode": "1212133800184", "item_desc": "Form budget listen.", "amount": 31.5, "transaction_date": "2023-01-04 13:29:38", "cust_id": 14} 582 | {"transact_id": "a811b772-8149-4ba2-ace0-7b658cd45c20", "category": "teal", "barcode": "8751563802922", "item_desc": "Weight hot mean.", "amount": 51.46, "transaction_date": "2023-01-20 23:50:30", "cust_id": 16} 583 | {"transact_id": "8cc2a57f-5007-42b1-a4a2-722cf609bb76", "category": "purple", "barcode": "6267199327651", "item_desc": "Recognize ten area general.", "amount": 2.41, "transaction_date": "2023-01-19 17:20:57", "cust_id": 16} 584 | {"transact_id": "931d9ff0-c82d-49e9-bc8d-bad319b20d84", "category": "white", "barcode": "9009659885601", "item_desc": "Safe medical start receive.", "amount": 61.77, "transaction_date": "2023-01-26 20:34:36", "cust_id": 16} 585 | {"transact_id": "b0457bb2-8b72-4a1a-a247-6ca9d2c06be9", "category": "yellow", "barcode": "6453786338029", "item_desc": "Force set think cost.", "amount": 45.59, "transaction_date": "2023-01-24 11:39:20", "cust_id": 17} 586 | {"transact_id": "b189cb88-6a14-4741-9286-102c379052d4", "category": "purple", "barcode": "2036094483571", "item_desc": "Nation consumer film fact only to.", "amount": 55.86, "transaction_date": "2023-01-12 16:29:53", "cust_id": 17} 587 | {"transact_id": "c2564bb3-4485-4f2a-82e0-aa7e53cfc622", "category": "silver", "barcode": "8282187103947", "item_desc": "Sign standard pass evidence.", "amount": 38.78, "transaction_date": "2023-01-02 00:25:31", "cust_id": 18} 588 | {"transact_id": "884469c2-32ee-439c-9f8a-570b9d49b152", "category": "lime", "barcode": "8529678377198", "item_desc": "Member write create.", "amount": 82.95, "transaction_date": "2023-01-03 13:49:19", "cust_id": 18} 589 | {"transact_id": "0a722403-a7dd-4c9c-b958-95191ae841c1", "category": "green", "barcode": "6500182661487", "item_desc": "Over usually who table compare area model.", "amount": 54.1, "transaction_date": "2023-01-18 18:43:36", "cust_id": 18} 590 | {"transact_id": "2b23b8c7-28db-4204-902f-a6fd3dd1f475", "category": "navy", "barcode": "1378348043058", "item_desc": "Technology one ahead general.", "amount": 54.67, "transaction_date": "2022-12-30 00:24:16", "cust_id": 19} 591 | {"transact_id": "aacce2c5-2472-4d66-a445-3bc126745e0b", "category": "navy", "barcode": "2056653042902", "item_desc": "Speech hot letter hot.", "amount": 5.9, "transaction_date": "2023-01-08 16:16:32", "cust_id": 21} 592 | {"transact_id": "20d157be-8a47-435a-a61f-c8ab68b34c8d", "category": "blue", "barcode": "7125652103787", "item_desc": "Strong society officer bag.", "amount": 46.41, "transaction_date": "2023-01-04 20:29:32", "cust_id": 21} 593 | {"transact_id": "098478b0-d0bc-4140-b621-abe9a03a768e", "category": "fuchsia", "barcode": "8780633730896", "item_desc": "Oil stock film source.", "amount": 78.61, "transaction_date": "2023-01-26 22:36:26", "cust_id": 21} 594 | {"transact_id": "8dbe22d8-050a-48a3-8526-b5c11230589e", "category": "navy", "barcode": "6879593096691", "item_desc": "Form affect seem side job.", "amount": 69.92, "transaction_date": "2022-12-31 21:41:30", "cust_id": 21} 595 | {"transact_id": "d27cde76-40df-4eda-9567-07aba2e2a0b8", "category": "gray", "barcode": "3376554112825", "item_desc": "Inside page bag.", "amount": 76.63, "transaction_date": "2023-01-10 20:53:23", "cust_id": 22} 596 | {"transact_id": "a4b87dc7-f401-4f13-9cfd-4858b0d575c0", "category": "yellow", "barcode": "0922971679088", "item_desc": "Guy more national.", "amount": 2.55, "transaction_date": "2023-01-25 14:29:42", "cust_id": 22} 597 | {"transact_id": "48ce0556-fc57-4748-bd79-a146cd32147b", "category": "aqua", "barcode": "8702162059583", "item_desc": "Sometimes president response want.", "amount": 16.91, "transaction_date": "2023-01-03 12:00:34", "cust_id": 22} 598 | {"transact_id": "b2dd711c-4d23-4c99-b980-bd7afd1ef62a", "category": "purple", "barcode": "0983651241193", "item_desc": "Born under focus budget east free.", "amount": 53.43, "transaction_date": "2023-01-01 16:42:59", "cust_id": 22} 599 | {"transact_id": "913475de-32bb-4d80-aed0-1d9631dd0677", "category": "silver", "barcode": "9827839337951", "item_desc": "Address operation hold.", "amount": 55.79, "transaction_date": "2023-01-04 19:19:01", "cust_id": 23} 600 | {"transact_id": "885ccb18-3d19-48aa-9ad3-095a562fe0a7", "category": "navy", "barcode": "5176084629125", "item_desc": "Thus second hospital development ball.", "amount": 65.89, "transaction_date": "2023-01-27 15:26:02", "cust_id": 24} 601 | {"transact_id": "62cb9752-6da5-404a-a0a3-08192731db90", "category": "blue", "barcode": "8670289379405", "item_desc": "Prevent great yes travel where real.", "amount": 51.36, "transaction_date": "2023-01-11 23:35:22", "cust_id": 24} 602 | {"transact_id": "04f07ef8-8453-40af-9d3e-da6e9693919b", "category": "olive", "barcode": "2009850879093", "item_desc": "Weight spring baby be thought degree.", "amount": 27.82, "transaction_date": "2023-01-22 13:56:49", "cust_id": 24} 603 | EOF 604 | 605 | ######################################################################################### 606 | # add items to path for future use 607 | ######################################################################################### 608 | export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 609 | export SPARK_HOME=/opt/spark 610 | 611 | echo "" >> ~/.profile 612 | echo "# set path variables here:" >> ~/.profile 613 | echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> ~/.profile 614 | echo "export SPARK_HOME=/opt/spark" >> ~/.profile 615 | 616 | echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$JAVA_HOME/bin:$HOME/minio-binaries" >> ~/.profile 617 | 618 | # let's make this visible 619 | . ~/.profile 620 | 621 | 622 | ######################################################################################### 623 | # install docker ce (needed for dbz server build with maven) 624 | ######################################################################################### 625 | sudo apt install -y apt-transport-https ca-certificates curl software-properties-common 626 | curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - 627 | sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable" 628 | apt-cache policy docker-ce 629 | sudo apt install -y docker-ce 630 | sudo chmod 666 /var/run/docker.sock 631 | sudo usermod -aG docker ${USER} 632 | 633 | echo 634 | echo "---------------------------------------------------------------------" 635 | echo "install of docker-ce complete..." 636 | echo "---------------------------------------------------------------------" 637 | echo 638 | ######################################################################################### 639 | # install debezium server items 640 | ######################################################################################### 641 | cd ~ 642 | git clone https://github.com/memiiso/debezium-server-iceberg.git 643 | cd debezium-server-iceberg 644 | 645 | 646 | echo 647 | echo "---------------------------------------------------------------------" 648 | echo "starting maven build of Debezium Server..." 649 | echo "---------------------------------------------------------------------" 650 | echo 651 | mvn -Passembly -Dmaven.test.skip package 652 | 653 | 654 | echo 655 | echo "---------------------------------------------------------------------" 656 | echo "maven build of Debezium Server complete..." 657 | echo "---------------------------------------------------------------------" 658 | echo 659 | 660 | echo 661 | echo "---------------------------------------------------------------------" 662 | echo "configure Debezium Server items..." 663 | echo "---------------------------------------------------------------------" 664 | echo 665 | cp ~/debezium-server-iceberg/debezium-server-iceberg-dist/target/debezium-server-iceberg-dist-0.3.0-SNAPSHOT.zip ~ 666 | 667 | unzip ~/debezium-server-iceberg-dist*.zip -d ~/appdist 668 | 669 | mkdir -p ~/debezium-server-iceberg/data 670 | 671 | 672 | ######################################################################################### 673 | # configure our dbz source-sink.properties file 674 | ######################################################################################### 675 | #sudo cp ~/data_origination_workshop/dbz_server/application.properties ~/appdist/debezium-server-iceberg/conf/ 676 | cp ~/data_origination_workshop/dbz_server/application.properties ~/appdist/debezium-server-iceberg/conf/ 677 | 678 | ########################################################################################## 679 | # let's update the properties files to use our minio keys. 680 | ########################################################################################## 681 | 682 | . ~/minio-output.properties 683 | 684 | sed -e "s,,$access_key,g" -i ~/appdist/debezium-server-iceberg/conf/application.properties 685 | sed -e "s,,$secret_key,g" -i ~/appdist/debezium-server-iceberg/conf/application.properties 686 | 687 | # change ownership 688 | #sudo chown datagen:datagen -R /home/datagen/appdist 689 | 690 | # remove the example file: 691 | rm /home/datagen/appdist/debezium-server-iceberg/conf/application.properties.example 692 | 693 | # remove the zip file: 694 | rm /home/datagen/debezium-server-iceberg-dist-*-SNAPSHOT.zip 695 | 696 | echo 697 | echo "---------------------------------------------------------------------" 698 | echo "Debezium Server setup complete..." 699 | echo "---------------------------------------------------------------------" 700 | echo 701 | 702 | ######################################################################################### 703 | # let's start our spark master and workers. 704 | ######################################################################################### 705 | ######################################################################################### 706 | # need to change the spark master gui port from 8080 to 8085 to avoid conflict with redpanda 707 | ######################################################################################### 708 | 709 | echo 710 | echo "---------------------------------------------------------------------" 711 | echo "configure Spark and start master and worker services..." 712 | echo "---------------------------------------------------------------------" 713 | echo 714 | 715 | ######################################################################################### 716 | # need to change the default ports for master and workers to avoid conflicts with red panda and kafka connect 717 | ######################################################################################### 718 | sed -e 's,SPARK_MASTER_WEBUI_PORT=8080,SPARK_MASTER_WEBUI_PORT=8085,g' -i /opt/spark/sbin/start-master.sh 719 | sed -e 's,SPARK_WORKER_WEBUI_PORT=8081,SPARK_WORKER_WEBUI_PORT=8090,g' -i /opt/spark/sbin/start-worker.sh 720 | 721 | echo "starting spark master..." 722 | /opt/spark/sbin/start-master.sh 723 | echo 724 | echo "starting spark worker..." 725 | /opt/spark/sbin/start-worker.sh spark://$(hostname -f):7077 726 | echo 727 | 728 | ######################################################################################### 729 | # setup complete. 730 | ######################################################################################### 731 | figlet -f small -w 300 "Setup is complete!"'!' | cowsay -n -f "$(ls -1 /usr/share/cowsay/cows | grep "\.cow" | sed 's/\.cow//' | egrep -v "bong|head-in|sodomized|telebears" | shuf -n 1)" 732 | 733 | ######################################################################################### 734 | # source this to set our new variables in current session 735 | ######################################################################################### 736 | bash -l 737 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | --- 2 | Title: The Journey to Apache Iceberg with Red Panda & Debezium 3 | Author: Tim Lepple 4 | Create Date: 03.04.2023 5 | Last Updated: 3.15.2024 6 | Comments: This repo will set up a data integration platform to evaluate some technology. 7 | Tags: Iceberg | Spark | Redpanda | PostgreSQL | Kafka Connect | Python | Debezium | Minio 8 | --- 9 | 10 | 11 | --- 12 | --- 13 | 14 | --- 15 | --- 16 | 17 | 18 | # The Journey to Apache Iceberg with Red Panda & Debezium 19 | --- 20 | --- 21 | 22 | ## Objective: 23 | The goal of this workshop was to evaluate [Redpanda](https://redpanda.com/) and Kafka Connect (with the Debezium CDC plugin). Set up a data generator that streams events directly into Redpanda and also into a traditional database platform and deliver it to an [Apache Iceberg](https://iceberg.apache.org/) data lake. 24 | 25 | I took the time to install these components manually on a traditional Linux server and then wrote the setup script in this repo so others could try it out too. Please take the time to review that script [`setup_datagen.sh`](./setup_data_origination_apps.sh). Hopefully, it will become a reference for you one day if you use any of this technology. 26 | 27 | In this workshop, we will integrate this data platform and stream data from here into our Apache Iceberg data lake built in a previous workshop (all of those components will be installed here too). For step-by-step instructions on working with the Iceberg components, please check out my [Apache Iceberg Workshop](https://github.com/tlepple/iceberg-intro-workshop) for more details. All of the tasks from that workshop can be run on this new server. 28 | 29 | --- 30 | --- 31 | 32 | # Highlights: 33 | 34 | --- 35 | 36 | The setup script will build and install our `Data Integration Platform` onto a single Linux instance. It installs a data-generating application, a local SQL database (PostgreSQL), a Red Panda instance, a stand-alone Kafka Connect instance, a Debezium plugin for Kafka Connect, a Debezium Server, Minio, Spark and Apache Iceberg. In addition, it will configure them all to work together. 37 | 38 | --- 39 | --- 40 | 41 | ### Pre-Requisites: 42 | 43 | --- 44 | 45 | * I built this on a new install of Ubuntu Server 46 | * Version: 20.04.5 LTS 47 | * Instance Specs: (min 4 core w/ 16 GB ram & 30 GB of disk) -- add more RAM if you have it to spare. 48 | * If you are going to test this in `AWS`, it ran smoothly for me using AMI: `ami-03a311cadf2d2a6f8` in region: `us-east-2` with a instance type of: `t3.xlarge` 49 | 50 | --- 51 | ### Create an OS User `Datagen` 52 | 53 | * This user account will be the owner of all the objects that get installed 54 | * Security is not in place for any of this workshop. 55 | 56 | ``` 57 | ########################################################################################## 58 | # create an osuser datagen and add to sudo file 59 | ########################################################################################## 60 | sudo useradd -m -s /usr/bin/bash datagen 61 | 62 | echo supersecret1 > passwd.txt 63 | echo supersecret1 >> passwd.txt 64 | 65 | sudo passwd datagen < passwd.txt 66 | 67 | rm -f passwd.txt 68 | sudo usermod -aG sudo datagen 69 | ########################################################################################## 70 | # let's complete this install as this user: 71 | ########################################################################################## 72 | # password: supersecret1 73 | su - datagen 74 | ``` 75 | --- 76 | 77 | ### Install Git tools and pull this repo. 78 | * ssh into your new Ubuntu 20.04 instance and run the below command: 79 | 80 | --- 81 | ``` 82 | sudo apt-get install git -y 83 | 84 | cd ~ 85 | git clone https://github.com/tlepple/data_origination_workshop.git 86 | ``` 87 | 88 | --- 89 | 90 | ### Start the build: 91 | 92 | ``` 93 | # run it: 94 | . ~/data_origination_workshop/setup_datagen.sh 95 | ``` 96 | * This should complete within 10 minutes. 97 | --- 98 | 99 | --- 100 | ### Workshop One Refresher: 101 | 102 | If you didn't complete my first workshop and need a primer on Iceberg, you can complete that work again on this platform by following this guide: [Workshop 1 Exercises](./workshop1_revisit.md). If you are already familiar with those items please proceed. A later step has all of that workshop automated if you prefer. 103 | 104 | --- 105 | 106 | ### What is Redpanda 107 | * Information in this section was gathered from their website. You can find more detailed information about their platform here: [Red Panda](https://redpanda.com/platform) 108 | --- 109 | 110 | Redpanda is an event streaming platform: it provides the infrastructure for streaming real-time data. It has been proven to be 10x faster and 6x lower in total costs. It is also JVM-free, ZooKeeper®-free, Jepsen-tested and source available. 111 | 112 | Producers are client applications that send data to Redpanda in the form of events. Redpanda safely stores these events in sequence and organizes them into topics, which represent a replayable log of changes in the system. 113 | 114 | Consumers are client applications that subscribe to Redpanda topics to asynchronously read events. Consumers can store, process, or react to events. 115 | 116 | Redpanda decouples producers from consumers to allow for asynchronous event processing, event tracking, event manipulation, and event archiving. Producers and consumers interact with Redpanda using the Apache Kafka® API. 117 | 118 | | Event-driven architecture (Redpanda) | Message-driven architecture | 119 | | ----------- | ----------- | 120 | | Producers send events to an event processing system (Redpanda) that acknowledges receipt of the write. This guarantees that the write is durable within the system and can be read by multiple consumers. | Producers send messages directly to each consumer. The producer must wait for acknowledgment that the consumer received the message before it can continue with its processes. | 121 | 122 | 123 | Event streaming lets you extract value from each event by analyzing, mining, or transforming it for insights. You can: 124 | 125 | * Take one event and consume it in multiple ways. 126 | * Replay events from the past and route them to new processes in your application. 127 | * Run transformations on the data in real-time or historically. 128 | * Integrate with other event processing systems that use the Kafka API. 129 | 130 | 131 | #### Redpanda differentiators: 132 | Redpanda is less complex and less costly than any other commercial mission-critical event streaming platform. It's fast, it's easy, and it keeps your data safe. 133 | 134 | * Redpanda is designed for maximum performance on any data streaming workload. 135 | 136 | * It can scale up to use all available resources on a single machine and scale out to distribute performance across multiple nodes. Built on C++, Redpanda delivers greater throughput and up to 10x lower p99 latencies than other platforms. This enables previously-unimaginable use cases that require high throughput, low latency, and a minimal hardware footprint. 137 | 138 | * Redpanda is packaged as a single binary: it doesn't rely on any external systems. 139 | 140 | * It's compatible with the Kafka API, so it works with the full ecosystem of tools and integrations built on Kafka. Redpanda can be deployed on bare metal, containers, or virtual machines in a data center or in the cloud. Redpanda Console also makes it easy to set up, manage, and monitor your clusters. Additionally, Tiered Storage lets you offload log segments to cloud storage in near real-time, providing infinite data retention and topic recovery. 141 | 142 | * Redpanda uses the Raft consensus algorithm throughout the platform to coordinate writing data to log files and replicating that data across multiple servers. 143 | 144 | * Raft facilitates communication between the nodes in a Redpanda cluster to make sure that they agree on changes and remain in sync, even if a minority of them are in a failure state. This allows Redpanda to tolerate partial environmental failures and deliver predictable performance, even at high loads. 145 | 146 | * Redpanda provides data sovereignty. 147 | 148 | --- 149 | --- 150 | ### Hands-On Workshop begins here: 151 | --- 152 | --- 153 | 154 | #### Explore the Red Panda CLI tool `RPK` 155 | * Redpanda Keeper `rpk` is Redpanda's command line interface (CLI) utility. Detailed documentation of the CLI can be explored further here: [Redpanda Keeper Commands](https://docs.redpanda.com/docs/reference/rpk/) 156 | 157 | ##### Create our first Redpanda topic with the CLI: 158 | * run this from a terminal window: 159 | ``` 160 | # Let's create a topic with RPK 161 | rpk topic create movie_list 162 | ``` 163 | #### Start a Redpanda `Producer` using the `rpk` CLI to add messages: 164 | * this will open a producer session and await your input until you close it with ` + d` 165 | ``` 166 | rpk topic produce movie_list 167 | ``` 168 | 169 | #### Add some messages to the `movie_list` topic: 170 | * The producer will appear to be hung in the terminal window. It is really just waiting for you to type in a message and hit ``. 171 | 172 | 173 | ###### Entry 1: 174 | ``` 175 | Top Gun Maverick 176 | ``` 177 | ###### Entry 2: 178 | ``` 179 | Star Wars - Return of the Jedi 180 | ``` 181 | #### Expected Output: 182 | ``` 183 | Produced to partition 0 at offset 0 with timestamp 1675085635701. 184 | Star Wars - Return of the Jedi 185 | Produced to partition 0 at offset 1 with timestamp 1675085644895. 186 | ``` 187 | 188 | ##### Exit the producer: ` + d` 189 | 190 | #### View these messages from Redpanda `Consumer` using the `rpk` CLI: 191 | 192 | ``` 193 | rpk topic consume movie_list --num 2 194 | ``` 195 | 196 | --- 197 | 198 | #### Expected Output: 199 | 200 | ``` 201 | { 202 | "topic": "movie_list", 203 | "value": "Top Gun Maverick", 204 | "timestamp": 1675085635701, 205 | "partition": 0, 206 | "offset": 0 207 | } 208 | { 209 | "topic": "movie_list", 210 | "value": "Star Wars - Return of the Jedi", 211 | "timestamp": 1675085644895, 212 | "partition": 0, 213 | "offset": 1 214 | } 215 | 216 | ``` 217 | --- 218 | --- 219 | 220 | ## Explore the Red Panda GUI: 221 | * Open a browser and navigate to your host ip address: `http:\\:8888` This will open the Red Panda GUI. 222 | * This is not the standard port for the Redpanda Console. It has been modified to avoid confilcts with other tools used in this workshop 223 | 224 | --- 225 | --- 226 | 227 | ![](./images/panda_view_topics.png) 228 | 229 | --- 230 | 231 | #### We can delete this topic from the `rpk` CLI: 232 | 233 | ``` 234 | rpk topic delete movie_list 235 | ``` 236 | --- 237 | --- 238 | 239 | ### Data Generator: 240 | --- 241 | 242 | I have written a data generator CLI application and included it in this workshop to simplify creating some realistic data for us to explore. We will use this data generator application to stream some realistic data directly into some topics (and later into a database). The data generator is written in python and uses the component [Faker](https://faker.readthedocs.io/en/master/). I encourage you to look at the code here if you want to look deeper into it. [Data Generator Items](./datagen) 243 | 244 | --- 245 | 246 | ##### Let's create some topics for our data generator using the CLI: 247 | ``` 248 | rpk topic create dgCustomer 249 | rpk topic create dgTxn 250 | ``` 251 | --- 252 | ##### Console view of our `Topics`: 253 | ![](./images/panda_view__dg_load_topics.png) 254 | 255 | --- 256 | --- 257 | 258 | ##### Data Generator Notes: 259 | --- 260 | 261 | The data generator app in this section accepts 3 integer arguments: 262 | * An integer value for the `customer key`. 263 | * An integer value for the `N` number of groups to produce in small batches. 264 | * An integer value for `N` number of times to loop until it will exit the script. 265 | 266 | --- 267 | ##### Call the `Data Generator` to stream some messages to our topics: 268 | --- 269 | 270 | ``` 271 | cd ~/datagen 272 | 273 | # start the script: 274 | python3 redpanda_dg.py 10 3 2 275 | ``` 276 | 277 | ##### Sample Output: 278 | 279 | This will load sample JSON data into our two new topics and write out a copy of those records to your terminal that looks something like this: 280 | 281 | --- 282 | 283 | ``` 284 | {"last_name": "Mcmillan", "first_name": "Linda", "street_address": "7471 Charlotte Fall Suite 835", "city": "Lake Richardborough", "state": "OH", "zip_code": "25649", "email": "tim47@example.org", "home_phone": "001-133-135-5972", "mobile": "001-942-819-7717", "ssn": "321-16-7039", "job_title": "Tourism officer", "create_date": "2022-12-19 20:45:34", "cust_id": 10} 285 | {"last_name": "Hatfield", "first_name": "Denise", "street_address": "5799 Solis Isle", "city": "Josephbury", "state": "LA", "zip_code": "61947", "email": "lhernandez@example.org", "home_phone": "(110)079-8975x48785", "mobile": "976.262.7268", "ssn": "185-93-0904", "job_title": "Engineer, chemical", "create_date": "2022-12-31 00:29:36", "cust_id": 11} 286 | {"last_name": "Adams", "first_name": "Zachary", "street_address": "6065 Dawn Inlet Suite 631", "city": "East Vickiechester", "state": "MS", "zip_code": "52115", "email": "fgrimes@example.com", "home_phone": "001-445-395-1773x238", "mobile": "(071)282-1174", "ssn": "443-22-3631", "job_title": "Maintenance engineer", "create_date": "2022-12-07 20:40:25", "cust_id": 12} 287 | Customer Done. 288 | 289 | 290 | {"transact_id": "020d5f1c-741d-40b0-8b2a-88ff2cdc0d9a", "category": "teal", "barcode": "5178387219027", "item_desc": "Government training especially.", "amount": 85.19, "transaction_date": "2023-01-07 21:24:17", "cust_id": 10} 291 | {"transact_id": "af9b7e7e-9068-4772-af7e-a8cb63bf555f", "category": "aqua", "barcode": "5092525324087", "item_desc": "Take study after catch.", "amount": 82.28, "transaction_date": "2023-01-18 01:13:13", "cust_id": 10} 292 | {"transact_id": "b11ae666-b85c-4a86-9fbe-8f4fddd364df", "category": "purple", "barcode": "3527261055442", "item_desc": "Likely age store hold.", "amount": 11.8, "transaction_date": "2023-01-26 01:15:46", "cust_id": 10} 293 | {"transact_id": "e968daad-6c14-475f-a183-1afec555dd5f", "category": "olive", "barcode": "7687223414666", "item_desc": "Performance call myself send.", "amount": 67.48, "transaction_date": "2023-01-25 01:51:05", "cust_id": 10} 294 | {"transact_id": "d171c8d7-d099-4a41-bf23-d9534b711371", "category": "teal", "barcode": "9761406515291", "item_desc": "Charge no when.", "amount": 94.57, "transaction_date": "2023-01-05 12:09:58", "cust_id": 11} 295 | {"transact_id": "2297de89-c731-42f1-97a6-98f6b50dd91a", "category": "lime", "barcode": "6484138725655", "item_desc": "Little unit total money raise.", "amount": 47.88, "transaction_date": "2023-01-13 08:16:24", "cust_id": 11} 296 | {"transact_id": "d3e08d65-7806-4d03-a494-6ec844204f64", "category": "black", "barcode": "9827295498272", "item_desc": "Yeah claim city threat approach our.", "amount": 45.83, "transaction_date": "2023-01-07 20:29:59", "cust_id": 11} 297 | {"transact_id": "97cf1092-6f03-400d-af31-d276eff05ecf", "category": "silver", "barcode": "2072026095184", "item_desc": "Heart table see share fish.", "amount": 95.67, "transaction_date": "2023-01-12 19:10:11", "cust_id": 11} 298 | {"transact_id": "11da28af-e463-4f7c-baf2-fc0641004dec", "category": "blue", "barcode": "3056115432639", "item_desc": "Writer exactly single toward same.", "amount": 9.33, "transaction_date": "2023-01-29 02:49:30", "cust_id": 12} 299 | {"transact_id": "c9ebc8a5-3d1a-446e-ac64-8bdd52a1ce36", "category": "fuchsia", "barcode": "6534191981175", "item_desc": "Morning who lay yeah travel use.", "amount": 73.2, "transaction_date": "2023-01-21 02:25:02", "cust_id": 12} 300 | Transaction Done. 301 | 302 | ``` 303 | --- 304 | 305 | #### Explore messages in the Red Panda Console from a browser 306 | * `http:\\:8888` Make sure to click the `Topics` tab on the left side of our Console Application: 307 | --- 308 | ##### Click on the topic `dgCustomer` from the list. 309 | 310 | --- 311 | 312 | ![](./images/topic_customer_view.png) 313 | 314 | --- 315 | 316 | ##### Click on the topic '+' icon under the `Value` column to see the record details of a message. 317 | 318 | --- 319 | 320 | ![](./images/detail_view_of_cust_msg.png) 321 | 322 | --- 323 | --- 324 | ## Explore Change Data Capture (CDC) via `Kafka Connect` and `Debezium` 325 | 326 | --- 327 | 328 | ##### Define Change Data Capture (CDC): 329 | 330 | Change Data Capture (CDC) is a database technique used to track and record changes made to data in a database. The changes are captured as soon as they occur and stored in a separate log or table, allowing applications to access the most up-to-date information without having to perform a full database query. CDC is often used for real-time data integration and data replication, enabling organizations to maintain a consistent view of their data across multiple systems. 331 | 332 | --- 333 | 334 | ##### Define `Kafka Connect`: 335 | 336 | Kafka Connect is a tool for scalable and reliable data import/export between Apache Kafka and other data systems. It allows you to integrate Kafka or Red Panda with sources such as databases, key-value stores, and file systems, as well as with sinks such as data warehouses and NoSQL databases. Kafka Connect provides pre-built connectors for popular data sources and also supports custom connectors developed by users. It uses the publish-subscribe model of Kafka to ensure that data is transported between systems in a fault-tolerant and scalable manner. 337 | 338 | --- 339 | 340 | ##### Define `Debezium`: 341 | 342 | Debezium is an open-source change data capture (CDC) platform that helps to stream changes from databases such as MySQL, PostgreSQL, and MongoDB into Red Panda and Apache Kafka, among other data sources and sinks. Debezium is designed to be used for real-time data streaming and change data capture for applications, data integration, and analytics. This component is a must for getting at legacy data in an efficient manner. 343 | 344 | --- 345 | 346 | ##### Why use these tools together? 347 | 348 | By combining CDC with Kafa Connect (and using the Debezium plugin) we easily roll out a new system that could eliminate expensive legacy solutions for extracting data from databases and replicating them to a modern `Data Lake`. This approach requires very little configuration and will have a minimal performance impact on your legacy databases. It will also allow you to harness data in your legacy applications and implement new real-time streaming applications to gather insights that were previously very difficult and expensive to get at. 349 | 350 | --- 351 | --- 352 | 353 | #### Integrate PostgreSQL with Kafka Connect: 354 | 355 | In these next few exercises, we will load data into a SQL database and configure Kafka Connect to extract the CDC records and stream them to a new topic in Red Panda. 356 | 357 | --- 358 | --- 359 | 360 | #### Data Generator to load data into PostgreSQL: 361 | 362 | There is a second data generator application and we will use it to stream JSON records and load them directly into a Postgresql database. 363 | 364 | --- 365 | --- 366 | 367 | ##### Data Generator Notes for stream to PostgreSQL: 368 | --- 369 | This data generator application accepts 2 integer arguments: 370 | * An integer value for the starting `customer key`. 371 | * An integer value for `N` number of records to produce and load to the database. 372 | 373 | ##### Call the Data Generator: 374 | 375 | ``` 376 | cd ~/datagen 377 | 378 | # start the script: 379 | python3 pg_upsert_dg.py 10 4 380 | 381 | ``` 382 | 383 | ##### Sample Output: 384 | --- 385 | ``` 386 | Connection Established 387 | {"last_name": "Carson", "first_name": "Aaron", "street_address": "124 Campbell Overpass", "city": "Cummingsburgh", "state": "MT", "zip_code": "76816", "email": "aaron08@example.net", "home_phone": "786-888-8409x21666", "mobile": "001-737-014-7684x1271", "ssn": "394-84-0730", "job_title": "Tourist information centre manager", "create_date": "2022-12-04 00:00:13", "cust_id": 10} 388 | {"last_name": "Allen", "first_name": "Kristen", "street_address": "00782 Richard Freeway", "city": "East Josephfurt", "state": "NJ", "zip_code": "87309", "email": "xwyatt@example.com", "home_phone": "085-622-1720x88354", "mobile": "4849824808", "ssn": "130-35-4851", "job_title": "Psychologist, occupational", "create_date": "2022-12-23 14:33:56", "cust_id": 11} 389 | {"last_name": "Knight", "first_name": "William", "street_address": "1959 Coleman Drives", "city": "Williamsville", "state": "OH", "zip_code": "31621", "email": "farrellchristopher@example.org", "home_phone": "(572)744-6444x306", "mobile": "+1-587-017-1677", "ssn": "797-80-6749", "job_title": "Visual merchandiser", "create_date": "2022-12-11 03:57:01", "cust_id": 12} 390 | {"last_name": "Joyce", "first_name": "Susan", "street_address": "137 Butler Via Suite 789", "city": "West Linda", "state": "IN", "zip_code": "63240", "email": "jeffreyjohnson@example.org", "home_phone": "+1-422-918-6473x3418", "mobile": "483-124-5433x956", "ssn": "435-50-2408", "job_title": "Gaffer", "create_date": "2022-12-14 01:20:02", "cust_id": 13} 391 | Records inserted successfully 392 | PostgreSQL connection is closed 393 | script complete! 394 | 395 | ``` 396 | --- 397 | --- 398 | ### Configure Integration of `Redpanda` and `Kafka Connect` 399 | --- 400 | --- 401 | 402 | #### Kafka Connect Setup: 403 | 404 | In the setup script, we downloaded and installed all the components and needed jar files that Kafka Connect will use. Please review that setup file again if you want a refresher. The script also configured the settings for our integration of PostgreSQL with Red Panda. Let's review the configuration files that make it all work. 405 | 406 | --- 407 | 408 | ##### The property file that will link Kafka Connect to Red Panda is located here: 409 | * make sure you are logged into the OS as user `datagen` with a password of `supersecret1` 410 | 411 | ``` 412 | 413 | cd ~/kafka_connect/configuration 414 | cat connect.properties 415 | ``` 416 | --- 417 | 418 | ##### Expected output: 419 | 420 | ``` 421 | #Kafka broker addresses 422 | bootstrap.servers=localhost:9092 423 | 424 | #Cluster level converters 425 | #These apply when the connectors don't define any converter 426 | key.converter=org.apache.kafka.connect.json.JsonConverter 427 | value.converter=org.apache.kafka.connect.json.JsonConverter 428 | 429 | #JSON schemas enabled to false in cluster level 430 | key.converter.schemas.enable=true 431 | value.converter.schemas.enable=true 432 | 433 | #Where to keep the Connect topic offset configurations 434 | offset.storage.file.filename=/tmp/connect.offsets 435 | offset.flush.interval.ms=10000 436 | 437 | #Plugin path to put the connector binaries 438 | plugin.path=:~/kafka_connect/plugins/debezium-connector-postgres/ 439 | 440 | ``` 441 | 442 | 443 | --- 444 | 445 | ##### The property file that will link Kafka Connect to PostgreSQL is located here: 446 | 447 | ``` 448 | cd ~/kafka_connect/configuration 449 | cat pg-source-connector.properties 450 | ``` 451 | --- 452 | 453 | ##### Expected output: 454 | 455 | ``` 456 | connector.class=io.debezium.connector.postgresql.PostgresConnector 457 | offset.storage=org.apache.kafka.connect.storage.FileOffsetBackingStore 458 | offset.storage.file.filename=offset.dat 459 | offset.flush.interval.ms=5000 460 | name=postgres-dbz-connector 461 | database.hostname=localhost 462 | database.port=5432 463 | database.user=datagen 464 | database.password=supersecret1 465 | database.dbname=datagen 466 | schema.include.list=datagen 467 | plugin.name=pgoutput 468 | topic.prefix=pg_datagen2panda 469 | 470 | ``` 471 | 472 | --- 473 | --- 474 | ### Start the `Kafka Connect` processor: 475 | * This will start our processor and pull all the CDC records out of the PostgreSQL database for our 'customer' table and ship them to a new Redpanda topic. 476 | * This process will run and pull the messages and then sleep until new messages get written to the originating database. To exit out of the processor when it completes, use the commands ` + c`. 477 | --- 478 | 479 | ##### Start Kafka Connect: 480 | * make sure you are logged into OS as user `datagen` with a password of `supersecret1` 481 | 482 | ``` 483 | cd ~/kafka_connect/configuration 484 | 485 | export CLASSPATH=/home/datagen/kafka_connect/plugins/debezium-connector-postgres/* 486 | ../kafka_2.13-3.3.2/bin/connect-standalone.sh connect.properties pg-source-connector.properties 487 | ``` 488 | --- 489 | 490 | ##### Expected Output: 491 | 492 | In this link, you can see the expected sample output: [`connect.output`](./sample_output/connect.output) 493 | 494 | 495 | 496 | --- 497 | ##### Explore the `Connect` tab in the Redpanda console from a browser: 498 | * This view is only available when `Connect` processes are running. 499 | --- 500 | ![](./images/console_view_run_connect.png) 501 | --- 502 | 503 | ##### Exit out of Kafka Connect from the terminal with: ` + c` 504 | 505 | --- 506 | --- 507 | #### Explore our new Redpanda topic `pg_datagen2panda.datagen.customer` in the console from a browser: 508 | 509 | --- 510 | ##### Console View of topic: 511 | 512 | ![](./images/panda_topic_view_connect_topic.png) 513 | 514 | --- 515 | --- 516 | ##### Click on the topic `pg_datagen2panda.datagen.customer` from the list. 517 | 518 | --- 519 | 520 | ![](./images/connect_output_summary_msg.png) 521 | 522 | --- 523 | 524 | ##### Click on the topic '+' icon under the `Value` column to see the record details of a message. 525 | 526 | --- 527 | 528 | ![](./images/connect_ouput_detail_msg.png) 529 | 530 | #### Kafka Connect Observations: 531 | 532 | --- 533 | --- 534 | As you can see, this message contains the values of the record `before` and `after` it was inserted into our PostgreSQL database. In this next section, we explore loading all of the data currently in our Redpanda topics and delivering it into our Iceberg data lake. 535 | 536 | --- 537 | --- 538 | # Integration with our Apache Iceberg Data Lake Exercises 539 | --- 540 | --- 541 | 542 | #### Load Data to Iceberg with Spark 543 | 544 | 545 | --- 546 | 547 | In this shell script [`stream_customer_ddl_script.sh`](./spark_items/stream_customer_ddl_script.sh) we will launch a `spark-sql` cli and run the DDL code [`stream_customer_ddl.sql`](./spark_items/stream_customer_ddl.sql) to create our `icecatalog.icecatalog.stream_customer` table in iceberg. 548 | 549 | ``` 550 | . /opt/spark/sql/stream_customer_ddl_script.sh 551 | ``` 552 | --- 553 | 554 | In this spark streaming job [`consume_panda_2_iceberg_customer.py`](./datagen/consume_panda_2_iceberg_customer.py) we will consume our messages loaded into topic `dgCustomer` with our data generator and append them into our `icecatalog.icecatalog.stream_customer` table in Iceberg. 555 | 556 | ``` 557 | 558 | spark-submit ~/datagen/consume_panda_2_iceberg_customer.py 559 | ``` 560 | 561 | --- 562 | #### Review tables in our Iceberg datalake 563 | ``` 564 | cd /opt/spark/sql 565 | 566 | . ice_spark-sql_i-cli.sh 567 | 568 | # query 569 | SHOW TABLES IN icecatalog.icecatalog; 570 | 571 | # Query 2: 572 | SELECT * FROM icecatalog.icecatalog.stream_customer; 573 | ``` 574 | --- 575 | 576 | In this shell script [`stream_customer_event_history_ddl_script.sh`](./spark_items/stream_customer_event_history_ddl_script.sh) we will launch a `spark-sql` cli and run the DDL code [`stream_customer_event_history_ddl.sql`](./spark_items/stream_customer_event_history_ddl.sql) to create our `icecatalog.icecatalog.stream_customer_event_history` table in Iceberg. 577 | 578 | ``` 579 | . /opt/spark/sql/stream_customer_event_history_ddl_script.sh 580 | ``` 581 | --- 582 | 583 | In this spark streaming job [`spark_from_dbz_customer_2_iceberg.py`](./datagen/spark_from_dbz_customer_2_iceberg.py) we will consume our messages loaded into topic `pg_datagen2panda.datagen.customer` from the `kafka_connect` processor and append them into the `icecatalog.icecatalog.stream_customer_event_history` table in iceberg. Spark Streaming does not have the ability to merge this data directly into our Iceberg table yet. This feature should become available soon. In the interim, we will have to create a separate batch job to apply them. In an upcoming section, we will demonstrate a better solution that will merge this information and simplify the amount of code needed to accomplish this task. This specific job will only append the activity to our table. 584 | 585 | ``` 586 | . spark-submit ~/datagen/spark_from_dbz_customer_2_iceberg.py 587 | ``` 588 | --- 589 | 590 | Let's explore our iceberg tables with the interactive `spark-sql` shell. 591 | 592 | ``` 593 | cd /opt/spark/sql 594 | 595 | . ice_spark-sql_i-cli.sh 596 | ``` 597 | 598 | * to exit the shell type: `exit;` and hit `` 599 | 600 | --- 601 | 602 | Run the query to see some output: 603 | 604 | ``` 605 | SELECT * FROM icecatalog.icecatalog.stream_customer_event_history; 606 | ``` 607 | 608 | --- 609 | 610 | #### Here are some additional exercises of Spark & Python that may be of interest: 611 | 612 | [Additional Spark Exercises](./sample_spark_jobs.md) 613 | 614 | --- 615 | --- 616 | ### Automation of Workshop 1 Exercises 617 | --- 618 | --- 619 | 620 | 621 | * Please skip these 2 commands if you completed them by hand in the earlier reference to Workshop 1. They were included again to add additional data to our applications for use with the `Debezium Server` in the next section. 622 | 623 | Let's load all the customer data from workshop 1 in one simple `spark-sql` shell command. In this shell script [`iceberg_workshop_sql_items.sh`](./spark_items/iceberg_workshop_sql_items.sh) we will launch a `spark-sql` cli and run the is DDL code [`all_workshop1_items.sql`](./spark_items/all_workshop1_items.sql) to load our `icecatalog.icecatalog.customer` table in iceberg. 624 | 625 | ``` 626 | . /opt/spark/sql/iceberg_workshop_sql_items.sh 627 | ``` 628 | 629 | --- 630 | 631 | In this spark job [`load_ice_transactions_pyspark.py`](./spark_items/load_ice_transactions_pyspark.py) we will load all the transactions from workshop 1 as a pyspark batch job: 632 | 633 | ``` 634 | spark-submit /opt/spark/sql/load_ice_transactions_pyspark.py 635 | ``` 636 | 637 | --- 638 | --- 639 | --- 640 | --- 641 | 642 | # What is Debezium Server? 643 | --- 644 | Debezium Server is an open-source distributed platform for change data capture (CDC) that captures events from databases, message queues, and other systems in real-time and streams them to Kafka or other event streaming platforms. It is part of the Debezium project, which aims to simplify and automate the process of extracting change events from different sources and making them available to downstream applications. 645 | 646 | The Debezium Server provides a number of benefits, including the ability to capture data changes in real-time, the ability to process events in a scalable and fault-tolerant manner, and the ability to integrate with various data storage and streaming technologies. It is built using Apache Kafka and leverages the Kafka Connect API for connecting to various data sources and targets. 647 | 648 | Debezium Server supports a wide range of data sources, including popular databases like MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Cassandra, and others. It also supports message queues like Apache Kafka, Apache Pulsar, and RabbitMQ, as well as file-based sources like Apache Cassandra, Apache Hadoop HDFS and Apache Iceberg. 649 | 650 | Debezium Server can be deployed on-premises or in the cloud, and it is available under the Apache 2.0 open-source license, which means that it is free to use, modify, and distribute. 651 | 652 | You can find more information about Debezium Server here: [Debezium Server Website](https://debezium.io/documentation/reference/stable/operations/debezium-server.html) 653 | 654 | --- 655 | 656 | ### Debezium Server Observations: 657 | 658 | The use of `Debezium Server` greatly simplifies the amount of code needed to capture information in upstream systems and automatically delivers it downstream to a destination. It requires only a few configuration files. 659 | 660 | It is capturing every change to our PostresSQL database including: 661 | * inserts, updates, delete to tables 662 | * adding columns to existing tables 663 | * creation of new tables 664 | 665 | If you recall, in an early exercise we ran some Spark code that grabbed these same change records and pushed them to a Redpanda topic from the Postgresql database (with Kafka Connect). We had to write a significant amount of code for each table to achieve only half of the goal. The Debezium Server is a much cleaner approach. It is worth noting that significant work is being developed by the open-source community to bring this same functionality to `Kafka Connect`. I expect to see lots of options soon. 666 | 667 | --- 668 | --- 669 | #### Debezium Server Configuration File: 670 | * Link to the configuration: [Debezium Server Configuration](./dbz_server/application.properties) 671 | --- 672 | --- 673 | 674 | ## Debezium Server Exercises: 675 | --- 676 | --- 677 | 678 | 679 | #### Query the Iceberg catalog for a list of current tables: 680 | 681 | ``` 682 | # start the spark-sql cli in interactive mode: 683 | cd /opt/spark/sql 684 | . ice_spark-sql_i-cli.sh 685 | 686 | # run query: 687 | SHOW TABLES IN icecatalog.icecatalog; 688 | ``` 689 | --- 690 | 691 | #### Expected Sample Output: 692 | 693 | ``` 694 | namespace tableName isTemporary 695 | customer 696 | stream_customer 697 | stream_customer_event_history 698 | transactions 699 | 700 | ``` 701 | 702 | --- 703 | 704 | #### Start the Debezium Server in a new terminal window: 705 | 706 | ``` 707 | cd ~/appdist/debezium-server-iceberg/ 708 | 709 | bash run.sh 710 | ``` 711 | * This will run until terminated and pull in database changes to our Iceberg Data Lake: 712 | 713 | --- 714 | 715 | #### Explore our Iceberg Catalog now (in the previous terminal window): 716 | 717 | ``` 718 | cd /opt/spark/sql 719 | . ice_spark-sql_i-cli.sh 720 | 721 | # query: 722 | SHOW TABLES IN icecatalog.icecatalog; 723 | ``` 724 | --- 725 | 726 | #### Expected Sample Output: 727 | 728 | ``` 729 | namespace tableName isTemporary 730 | cdc_localhost_datagen_customer 731 | customer 732 | stream_customer 733 | stream_customer_event_history 734 | transactions 735 | ``` 736 | 737 | --- 738 | 739 | #### Query our new CDC table `cdc_localhost_datagen_customer` in our Data Lake that was replicated by `Debezium Server`: 740 | 741 | ``` 742 | cd /opt/spark/sql 743 | . ice_spark-sql_i-cli.sh 744 | 745 | # query: 746 | SELECT 747 | cust_id, 748 | last_name, 749 | city, 750 | state, 751 | create_date, 752 | __op, 753 | __table, 754 | __source_ts_ms, 755 | __db, 756 | __deleted 757 | FROM icecatalog.icecatalog.cdc_localhost_datagen_customer 758 | ORDER by cust_id; 759 | ``` 760 | --- 761 | #### Expected Sample Output: 762 | 763 | ``` 764 | cust_id last_name city state create_date __op __table __source_ts_ms __db __deleted 765 | 10 Jackson North Kimberly MP 2023-01-20 22:47:05 r customer 2023-02-22 16:04:34.193 datagen false 766 | 11 Downs Conwaychester MD 2022-12-27 23:54:51 r customer 2023-02-22 16:04:34.193 datagen false 767 | 12 Webster Phillipmouth VI 2023-01-17 20:54:46 r customer 2023-02-22 16:04:34.193 datagen false 768 | 13 Miller Jessicahaven OH 2023-01-13 05:03:57 r customer 2023-02-22 16:04:34.193 datagen false 769 | Time taken: 0.384 seconds, Fetched 4 row(s) 770 | 771 | ``` 772 | --- 773 | 774 | #### Add additional rows to our Postgresql table via `datagen`: 775 | 776 | ``` 777 | cd ~/datagen/ 778 | python3 pg_upsert_dg.py 12 5 779 | ``` 780 | 781 | --- 782 | 783 | #### Review New & Updated Records 784 | 785 | * Query our updated Data Lake table and review the `inserts` and `updates` applied from the running Debezium Server service. 786 | 787 | ``` 788 | cd /opt/spark/sql 789 | . ice_spark-sql_i-cli.sh 790 | 791 | # query: 792 | SELECT 793 | cust_id, 794 | last_name, 795 | city, 796 | state, 797 | create_date, 798 | __op, 799 | __table, 800 | __source_ts_ms, 801 | __db, 802 | __deleted 803 | FROM icecatalog.icecatalog.cdc_localhost_datagen_customer 804 | ORDER by cust_id; 805 | ``` 806 | --- 807 | #### Expected Sample Output: 808 | 809 | ``` 810 | cust_id last_name city state create_date __op __table __source_ts_ms __db __deleted 811 | 10 Jackson North Kimberly MP 2023-01-20 22:47:05 r customer 2023-02-22 16:06:19.9 datagen false 812 | 11 Downs Conwaychester MD 2022-12-27 23:54:51 r customer 2023-02-22 16:06:19.9 datagen false 813 | 12 Cook New Catherinemouth NJ 2023-01-03 18:38:35 u customer 2023-02-22 19:03:52.62 datagen false 814 | 13 Ramos West Laurabury NY 2023-01-04 04:48:18 u customer 2023-02-22 19:03:52.62 datagen false 815 | 14 Scott West Thomastown AL 2022-12-29 07:21:28 c customer 2023-02-22 19:03:52.62 datagen false 816 | 15 Holden East Danieltown MT 2023-01-15 17:17:54 c customer 2023-02-22 19:03:52.62 datagen false 817 | 16 Carpenter Lake Jamesberg GU 2023-01-05 22:16:55 c customer 2023-02-22 19:03:52.62 datagen false 818 | Time taken: 0.318 seconds, Fetched 7 row(s) 819 | ``` 820 | 821 | --- 822 | --- 823 | --- 824 | 825 | ### Final Summary: 826 | 827 | Integrating a database using Kafka Connect (via Debezium plugins) to stream data to a system like Red Panda and our Iceberg Data Lake can have several benefits: 828 | 829 | 1. **Real-time data streaming:** The integration provides a real-time stream of data from the SQL database to Red Panda and our Iceberg Data Lake, making it easier to analyze and process data in real-time. 830 | 831 | 2. **Scalability:** Kafka Connect or the Debezium Server can handle high volume and velocity of data, allowing for scalability as the data grows. 832 | 833 | 3. **Ease of Use:** Kafka Connect & Debezium Server simplifies the process of integrating the SQL database and delivering it to other destinations, making it easier for developers to set up and maintain. 834 | 835 | 4. **Improved data consistency:** The integration helps ensure data consistency by providing a single source of truth for data being streamed to Red Panda or any other downstream consumer like our Iceberg Data Lake. 836 | 837 | However, the integration may also have challenges such as data compatibility, security, and performance. It is important to thoroughly assess the requirements and constraints before implementing the integration. 838 | 839 | --- 840 | 841 | If you have made it this far, I want to thank you for spending your time reviewing the materials. Please give me a 'Star' at the top of this page if you found it useful. 842 | 843 | --- 844 | --- 845 | 846 | #### Extra Credit 847 | 848 | * Interested in exploring the underlying PostgreSQL Databases for `datagen` or the database that hosts the `Iceberg Catalog`? 849 | [Additional Exercises to Explore Databases in Postgresql](./explore_postgresql.md) 850 | 851 | --- 852 | --- 853 | 854 | ![](./images/drunk-cheers.gif) 855 | 856 | [Tim Lepple](www.linkedin.com/in/tim-lepple-9141452) 857 | 858 | --- 859 | --- 860 | 861 | -------------------------------------------------------------------------------- /datagen/comsume_topic_dgCustomer.py: -------------------------------------------------------------------------------- 1 | import json 2 | import sys 3 | import argparse 4 | from kafka import KafkaConsumer 5 | 6 | startKey = int(1) 7 | #iterateVal = int(5) 8 | 9 | parser = argparse.ArgumentParser() 10 | 11 | # define our required arguments to pass in: 12 | parser.add_argument("recordCount", help="Enter int value for desired number of records", type=int) 13 | 14 | # parse these args 15 | args = parser.parse_args() 16 | 17 | # assign args to vars: 18 | stopVal = int(args.recordCount) 19 | 20 | try: 21 | # define our Kafka Consumer 22 | consumer = KafkaConsumer( 23 | 'dgCustomer', 24 | bootstrap_servers=':9092', 25 | auto_offset_reset='earliest', 26 | value_deserializer=lambda m: json.loads(m.decode('utf-8')) 27 | ) 28 | for message in consumer: 29 | print(message.value) 30 | startKey += 1 31 | 32 | if startKey == stopVal: 33 | print("\n") 34 | print(str(stopVal) + " msgs have been consumed.") 35 | print("\n") 36 | consumer.close() 37 | sys.exit() 38 | 39 | 40 | except KeyboardInterrupt: 41 | sys.exit() 42 | finally: 43 | print("script complete!") 44 | 45 | -------------------------------------------------------------------------------- /datagen/consume_panda_2_iceberg_customer.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from pyspark.sql.types import * 3 | from pyspark.sql.functions import * 4 | from pyspark.sql.functions import udf 5 | from pyspark.streaming import StreamingContext 6 | #from pyspark.streaming.kafka import KafkaUtils 7 | import json 8 | import uuid 9 | 10 | 11 | ####################################################################################### 12 | # define schema for a DF with data json data from Kafka msgs 13 | ####################################################################################### 14 | customer_schema = StructType() \ 15 | .add("first_name", StringType()) \ 16 | .add("last_name", StringType()) \ 17 | .add("street_address", StringType()) \ 18 | .add("city", StringType()) \ 19 | .add("state", StringType()) \ 20 | .add("zip_code", StringType()) \ 21 | .add("home_phone", StringType()) \ 22 | .add("mobile", StringType()) \ 23 | .add("email", StringType()) \ 24 | .add("ssn", StringType()) \ 25 | .add("job_title", StringType()) \ 26 | .add("create_date", StringType()) \ 27 | .add("cust_id", IntegerType()) 28 | 29 | 30 | spark = SparkSession \ 31 | .builder \ 32 | .appName("cust_panda_2_ice") \ 33 | .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \ 34 | .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ 35 | .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \ 36 | .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \ 37 | .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \ 38 | .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \ 39 | .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \ 40 | .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \ 41 | .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ 42 | .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \ 43 | .config("spark.sql.catalog.sparkcatalog", "icecatalog") \ 44 | .config("spark.eventLog.enabled", "true") \ 45 | .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 46 | .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 47 | .config("spark.sql.catalogImplementation", "in-memory") \ 48 | .config("groupId", "org.apache.spark") \ 49 | .config("artifactId", "spark-sql-kafka-0-10_2.12") \ 50 | .config("version", "3.3.1") \ 51 | .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \ 52 | .config("spark.sql.adaptive", "true") \ 53 | .getOrCreate() 54 | 55 | ####################################################################################### 56 | # Create DataFrame representing the stream of msgs from kafka (unbounded table) 57 | ####################################################################################### 58 | 59 | parsed = spark \ 60 | .readStream \ 61 | .format("kafka") \ 62 | .option("kafka.bootstrap.servers", ":9092") \ 63 | .option("subscribe", "dgCustomer") \ 64 | .option("startingOffsets", "earliest") \ 65 | .option("kafka.session.timeout.ms", "10000") \ 66 | .load() \ 67 | .select( \ 68 | from_json(col("value").cast("string"), customer_schema).alias("parsed_value")) 69 | 70 | ########################################################################################## 71 | # project the kafka 'value' column into a new data frame: 72 | ########################################################################################## 73 | 74 | projected = parsed \ 75 | .select("parsed_value.*") 76 | 77 | 78 | ########################################################################################## 79 | # write to console 80 | ########################################################################################## 81 | 82 | query = projected.writeStream \ 83 | .outputMode("append") \ 84 | .format("iceberg") \ 85 | .trigger(processingTime='30 seconds') \ 86 | .option("path", "icecatalog.icecatalog.stream_customer") \ 87 | .option("checkpointLocation", "/opt/spark/checkpoint") \ 88 | .start() \ 89 | .awaitTermination() 90 | 91 | spark.stop() 92 | 93 | # .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) \ 94 | -------------------------------------------------------------------------------- /datagen/consume_stream_customer_2_console.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from pyspark.sql.types import * 3 | from pyspark.sql.functions import * 4 | from pyspark.sql.functions import udf 5 | from pyspark.streaming import StreamingContext 6 | #from pyspark.streaming.kafka import KafkaUtils 7 | import json 8 | import uuid 9 | 10 | 11 | ####################################################################################### 12 | # define a uuid function for the kafka key 13 | ####################################################################################### 14 | uuidUdf = udf(lambda : str(uuid.uuid4()),StringType()) 15 | nowUdf = udf(lambda : now(),TimestampType()) 16 | 17 | ####################################################################################### 18 | # define schema for a DF with data json data from Kafka msgs 19 | ####################################################################################### 20 | customer_schema = StructType() \ 21 | .add("first_name", StringType()) \ 22 | .add("last_name", StringType()) \ 23 | .add("street_address", StringType()) \ 24 | .add("city", StringType()) \ 25 | .add("state", StringType()) \ 26 | .add("zip_code", StringType()) \ 27 | .add("home_phone", StringType()) \ 28 | .add("mobile", StringType()) \ 29 | .add("email", StringType()) \ 30 | .add("ssn", StringType()) \ 31 | .add("job_title", StringType()) \ 32 | .add("create_date", StringType()) \ 33 | .add("cust_id", IntegerType()) 34 | 35 | 36 | spark = SparkSession \ 37 | .builder \ 38 | .appName("redpanda") \ 39 | .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \ 40 | .config("groupId", "org.apache.spark") \ 41 | .config("artifactId", "spark-sql-kafka-0-10_2.12") \ 42 | .config("version", "3.3.1") \ 43 | .config("spark.eventLog.enabled", "true") \ 44 | .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 45 | .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 46 | .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \ 47 | .config("spark.sql.adaptive", "true") \ 48 | .getOrCreate() 49 | 50 | ####################################################################################### 51 | # Create DataFrame representing the stream of msgs from kafka (unbounded table) 52 | ####################################################################################### 53 | 54 | parsed = spark \ 55 | .readStream \ 56 | .format("kafka") \ 57 | .option("kafka.bootstrap.servers", ":9092") \ 58 | .option("subscribe", "dgCustomer") \ 59 | .option("startingOffsets", "earliest") \ 60 | .option("kafka.session.timeout.ms", "10000") \ 61 | .load() \ 62 | .select( \ 63 | from_json(col("value").cast("string"), customer_schema).alias("parsed_value")) 64 | 65 | ########################################################################################## 66 | # project the kafka 'value' column into a new data frame: 67 | ########################################################################################## 68 | 69 | projected = parsed \ 70 | .select("parsed_value.*") 71 | 72 | 73 | ########################################################################################## 74 | # write to console 75 | ########################################################################################## 76 | 77 | query = projected \ 78 | .writeStream.outputMode("append") \ 79 | .format("console") \ 80 | .trigger(processingTime='6 seconds') \ 81 | .start() \ 82 | .awaitTermination() 83 | -------------------------------------------------------------------------------- /datagen/consume_stream_txn_2_console.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from pyspark.sql.types import * 3 | from pyspark.sql.functions import * 4 | from pyspark.sql.functions import udf 5 | from pyspark.streaming import StreamingContext 6 | import json 7 | import uuid 8 | 9 | ####################################################################################### 10 | # define schema for a DF with data json data from Kafka msgs 11 | ####################################################################################### 12 | txn_schema = StructType() \ 13 | .add("amount", DoubleType()) \ 14 | .add("barcode", StringType()) \ 15 | .add("category", StringType()) \ 16 | .add("cust_id", StringType()) \ 17 | .add("item_desc", StringType()) \ 18 | .add("transact_id", StringType()) \ 19 | .add("transaction_date", StringType()) 20 | 21 | 22 | 23 | spark = SparkSession \ 24 | .builder \ 25 | .appName("redpanda") \ 26 | .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \ 27 | .config("groupId", "org.apache.spark") \ 28 | .config("artifactId", "spark-sql-kafka-0-10_2.12") \ 29 | .config("version", "3.3.1") \ 30 | .config("spark.eventLog.enabled", "true") \ 31 | .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 32 | .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 33 | .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \ 34 | .config("spark.sql.adaptive", "true") \ 35 | .getOrCreate() 36 | 37 | ####################################################################################### 38 | # Create DataFrame representing the stream of msgs from kafka (unbounded table) 39 | ####################################################################################### 40 | 41 | parsed = spark \ 42 | .readStream \ 43 | .format("kafka") \ 44 | .option("kafka.bootstrap.servers", ":9092") \ 45 | .option("subscribe", "dgTxn") \ 46 | .option("startingOffsets", "earliest") \ 47 | .option("kafka.session.timeout.ms", "10000") \ 48 | .load() \ 49 | .select( \ 50 | from_json(col("value").cast("string"), txn_schema).alias("parsed_value")) 51 | 52 | ########################################################################################## 53 | # project the kafka 'value' column into a new data frame: 54 | ########################################################################################## 55 | 56 | projected = parsed \ 57 | .select("parsed_value.*") 58 | 59 | 60 | ########################################################################################## 61 | # write to console 62 | ########################################################################################## 63 | 64 | query = projected \ 65 | .writeStream.outputMode("append") \ 66 | .format("console") \ 67 | .trigger(processingTime='6 seconds') \ 68 | .start() \ 69 | .awaitTermination() 70 | 71 | spark.stop() 72 | -------------------------------------------------------------------------------- /datagen/datagenerator.py: -------------------------------------------------------------------------------- 1 | import time 2 | import collections 3 | import datetime 4 | from decimal import Decimal 5 | from random import randrange, randint, sample 6 | import sys 7 | class DataGenerator(): 8 | # DataGenerator 9 | def __init__(self): 10 | # comments 11 | self.z = 0 12 | def fake_person_generator(self, startkey, iterateval, f): 13 | self.startkey = startkey 14 | self.iterateval = iterateval 15 | self.f = f 16 | endkey = startkey + iterateval 17 | for x in range(startkey, endkey): 18 | yield {'last_name': f.last_name(), 19 | 'first_name': f.first_name(), 20 | 'street_address': f.street_address(), 21 | 'city': f.city(), 22 | 'state': f.state_abbr(), 23 | 'zip_code': f.postcode(), 24 | 'email': f.email(), 25 | 'home_phone': f.phone_number(), 26 | 'mobile': f.phone_number(), 27 | 'ssn': f.ssn(), 28 | 'job_title': f.job(), 29 | 'create_date': (f.date_time_between(start_date="-60d", end_date="-30d", tzinfo=None)).strftime('%Y-%m-%d %H:%M:%S'), 30 | 'cust_id': x} 31 | def fake_txn_generator(self, txnsKey, txniKey, fake): 32 | self.txnsKey = txnsKey 33 | self.txniKey = txniKey 34 | self.fake = fake 35 | 36 | txnendKey = txnsKey + txniKey 37 | for x in range(txnsKey, txnendKey): 38 | for i in range(1,randrange(1,7,1)): 39 | yield {'transact_id': fake.uuid4(), 40 | 'category': fake.safe_color_name(), 41 | 'barcode': fake.ean13(), 42 | 'item_desc': fake.sentence(nb_words=5, variable_nb_words=True, ext_word_list=None), 43 | 'amount': fake.pyfloat(left_digits=2, right_digits=2, positive=True), 44 | 'transaction_date': (fake.date_time_between(start_date="-29d", end_date="now", tzinfo=None)).strftime('%Y-%m-%d %H:%M:%S'), 45 | 'cust_id': x} 46 | -------------------------------------------------------------------------------- /datagen/parameter_get_schema.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import json 4 | import re 5 | from pyspark.sql import SparkSession 6 | from pyspark.sql.functions import * 7 | from pyspark.sql.types import * 8 | 9 | spark = SparkSession \ 10 | .builder \ 11 | .appName("pg_cust_from_connect_schema") \ 12 | .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \ 13 | .config("groupId", "org.apache.spark") \ 14 | .config("artifactId", "spark-sql-kafka-0-10_2.12") \ 15 | .config("version", "3.3.1") \ 16 | .config("spark.eventLog.enabled", "true") \ 17 | .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 18 | .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 19 | .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \ 20 | .config("spark.sql.adaptive", "true") \ 21 | .getOrCreate() 22 | 23 | 24 | ########################################################################################## 25 | # Defines some variables 26 | ########################################################################################## 27 | parser = argparse.ArgumentParser() 28 | 29 | # define our required arguments to pass in: 30 | parser.add_argument("--p_broker", help="Enter the ip address and port of kakfa broker", required=True, type=str) 31 | parser.add_argument("--p_topic", help="Enter the topic name for needed schema", required=True, type=str) 32 | 33 | # parse these args 34 | args = parser.parse_args() 35 | 36 | 37 | kafka_broker = str(args.p_broker) 38 | kafka_topic = str(args.p_topic) 39 | 40 | print(kafka_broker) 41 | print(kafka_topic) 42 | ########################################################################################## 43 | # function to get the json value from a kafka topic 44 | ########################################################################################## 45 | def read_kafka_topic(topic): 46 | 47 | df_json = (spark.read 48 | .format("kafka") 49 | .option("kafka.bootstrap.servers", kafka_broker) 50 | .option("subscribe", topic) 51 | .option("startingOffsets", "earliest") 52 | .option("endingOffsets", "latest") 53 | .option("failOnDataLoss", "false") 54 | .load() 55 | # filter out empty values 56 | .withColumn("value", expr("string(value)")) 57 | .filter(col("value").isNotNull()) 58 | # get latest version of each record 59 | .select("key", expr("struct(offset, value) r")) 60 | .groupBy("key").agg(expr("max(r) r")) 61 | .select("r.value")) 62 | 63 | # decode the json values 64 | df_read = spark.read.json( 65 | df_json.rdd.map(lambda x: x.value), multiLine=True) 66 | 67 | # drop corrupt records 68 | if "_corrupt_record" in df_read.columns: 69 | df_read = (df_read 70 | .filter(col("_corrupt_record").isNotNull()) 71 | .drop("_corrupt_record")) 72 | 73 | return df_read 74 | 75 | ########################################################################################## 76 | # function to cleanup schema for humans to read: 77 | ########################################################################################## 78 | 79 | def prettify_spark_schema_json(json: str): 80 | 81 | import re, json 82 | 83 | parsed = json.loads(json_schema) 84 | raw = json.dumps(parsed, indent=1, sort_keys=False) 85 | 86 | str1 = raw 87 | 88 | # replace empty meta data 89 | str1 = re.sub('"metadata": {},\n +', '', str1) 90 | 91 | # replace enters between properties 92 | str1 = re.sub('",\n +"', '", "', str1) 93 | str1 = re.sub('e,\n +"', 'e, "', str1) 94 | 95 | # replace endings and beginnings of simple objects 96 | str1 = re.sub('"\n +},', '" },', str1) 97 | str1 = re.sub('{\n +"', '{ "', str1) 98 | 99 | # replace end of complex objects 100 | str1 = re.sub('"\n +}', '" }', str1) 101 | str1 = re.sub('e\n +}', 'e }', str1) 102 | 103 | # introduce the meta data on a different place 104 | str1 = re.sub('(, "type": "[^"]+")', '\\1, "metadata": {}', str1) 105 | str1 = re.sub('(, "type": {)', ', "metadata": {}\\1', str1) 106 | 107 | # make sure nested ending is not on a single line 108 | str1 = re.sub('}\n\s+},', '} },', str1) 109 | 110 | return str1 111 | 112 | ########################################################################################## 113 | # call the function to get the schema 114 | ########################################################################################## 115 | 116 | df = read_kafka_topic(kafka_topic) 117 | json_schema = df.schema.json() 118 | 119 | ########################################################################################## 120 | # read the JSON into a schema 121 | ########################################################################################## 122 | 123 | obj = json.loads(json_schema) 124 | topic_schema = StructType.fromJson(obj) 125 | 126 | ########################################################################################## 127 | # print raw schema suitable for performant code 128 | ########################################################################################## 129 | 130 | print('\n') 131 | print('------------------------------------------SparkStreaming Schema------------------------------------------\n') 132 | print(topic_schema) 133 | print('\n') 134 | 135 | ########################################################################################## 136 | # make the schema readable and print to screen. 137 | ########################################################################################## 138 | 139 | #pretty_json_schema = prettify_spark_schema_json(json_schema) 140 | 141 | ########################################################################################## 142 | # read the JSON into a schema 143 | ########################################################################################## 144 | 145 | #prettyObj = json.loads(pretty_json_schema) 146 | #pretty_topic_schema = StructType.fromJson(prettyObj) 147 | 148 | 149 | 150 | #print('\n') 151 | #print('------------------------------------------Pretty Schema------------------------------------------\n') 152 | #print(pretty_topic_schema) 153 | #print('\n') 154 | 155 | -------------------------------------------------------------------------------- /datagen/pg_upsert_dg.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from faker import Faker 3 | from datagenerator import DataGenerator 4 | import simplejson 5 | import sys 6 | import argparse 7 | import psycopg2 8 | ######################################################################################### 9 | # Define variables 10 | ######################################################################################### 11 | dg = DataGenerator() 12 | fake = Faker() # <--- Don't Forgot this 13 | parser = argparse.ArgumentParser() 14 | 15 | # define our required arguments to pass in: 16 | parser.add_argument("startingCustomerID", help="Enter int value to assign to the first customerID field", type=int) 17 | parser.add_argument("recordCount", help="Enter int value for desired number of records", type=int) 18 | 19 | # parse these args 20 | args = parser.parse_args() 21 | 22 | # assign args to vars: 23 | startKey = int(args.startingCustomerID) 24 | stopVal = int(args.recordCount) 25 | 26 | 27 | # functions to display errors 28 | def printf (format,*args): 29 | sys.stdout.write (format % args) 30 | def printException (exception): 31 | error, = exception.args 32 | printf("Error code = %s\n",error.code); 33 | printf("Error message = %s\n",error.message); 34 | def myconverter(obj): 35 | if isinstance(obj, (datetime.datetime)): 36 | return obj.__str__() 37 | ######################################################################################### 38 | # Code execution below 39 | ######################################################################################### 40 | try: 41 | try: 42 | conn = psycopg2.connect(host="127.0.0.1",database="datagen", user="datagen", password="supersecret1") 43 | print("Connection Established") 44 | except psycopg2.Error as exception: 45 | printf ('Failed to connect to database') 46 | printException (exception) 47 | exit (1) 48 | cursor = conn.cursor() 49 | try: 50 | fpg = dg.fake_person_generator(startKey, stopVal, fake) 51 | for person in fpg: 52 | json_out = simplejson.dumps(person, ensure_ascii=False, default = myconverter) 53 | print(json_out) 54 | insert_stmt = "SELECT datagen.insert_from_json('" + json_out +"');" 55 | cursor.execute(insert_stmt) 56 | print("Records inserted successfully") 57 | except psycopg2.Error as exception: 58 | printf ('Failed to insert\n') 59 | printException (exception) 60 | exit (1) 61 | finally: 62 | if(conn): 63 | conn.commit() 64 | cursor.close() 65 | conn.close() 66 | print("PostgreSQL connection is closed") 67 | except (Exception, psycopg2.Error) as error: 68 | print("Something else went wrong...\n", error) 69 | finally: 70 | print("script complete!") 71 | -------------------------------------------------------------------------------- /datagen/redpanda_dg.py: -------------------------------------------------------------------------------- 1 | import time 2 | from faker import Faker 3 | from datagenerator import DataGenerator 4 | import simplejson as json 5 | 6 | import argparse 7 | 8 | from kafka import KafkaProducer 9 | 10 | ######################################################################################### 11 | # Define variables 12 | ######################################################################################### 13 | dg = DataGenerator() 14 | fake = Faker() # <--- Don't Forgot this 15 | parser = argparse.ArgumentParser() 16 | 17 | # define our required arguments to pass in: 18 | parser.add_argument("startingCustomerID", help="Enter int value to assign to the first customerID field", type=int) 19 | parser.add_argument("recordCount", help="Enter int value for desired number of records per group", type=int) 20 | parser.add_argument("loopCount", help="Enter int value for iteration count", type=int) 21 | 22 | # parse these args 23 | args = parser.parse_args() 24 | 25 | # assign args to vars: 26 | startKey = int(args.startingCustomerID) 27 | iterateVal = int(args.recordCount) 28 | stopVal = int(args.loopCount) 29 | 30 | # Define some functions: 31 | def myconverter(obj): 32 | if isinstance(obj, (datetime.datetime)): 33 | return obj.__str__() 34 | 35 | def encode_complex(obj): 36 | if isinstance(obj, complex): 37 | return [ojb.real, obj.imag] 38 | raise TypeError(repr(obj) + " is not JSON serializable") 39 | 40 | # Messages will be serialized as JSON 41 | def my_serializer(message): 42 | return json.dumps(message).encode('utf-8') 43 | 44 | # define variable for our producer 45 | producer = KafkaProducer(bootstrap_servers=":9092",value_serializer=my_serializer) 46 | 47 | ######################################################################################### 48 | # Code execution below 49 | ######################################################################################### 50 | try: 51 | for i in range(stopVal): 52 | # person start here: 53 | try: 54 | fpg = dg.fake_person_generator(startKey, iterateVal, fake) 55 | for person in fpg: 56 | #print(json.dumps(person, ensure_ascii=False, default = myconverter)) 57 | #print("\n") 58 | data = json.dumps(person, default = encode_complex) 59 | print(data) 60 | #print ("dataVarType", type(data)) 61 | # convert json string to dict obj 62 | dictData = json.loads(data) 63 | producer.send('dgCustomer', dictData) 64 | #print("\n") 65 | producer.flush() 66 | print("Customer Done.") 67 | print('\n') 68 | except: 69 | print("failing in person generator") 70 | producer.flush() 71 | 72 | # txn start here: 73 | try: 74 | txn = dg.fake_txn_generator(startKey, iterateVal, fake) 75 | for tranx in txn: 76 | #print(json.dumps(tranx, ensure_ascii=False, default = myconverter)) 77 | txnData = json.dumps(tranx, default = encode_complex) 78 | print(txnData) 79 | producer.send('dgTxn', tranx) 80 | producer.flush() 81 | print("Transaction Done.") 82 | print('\n') 83 | 84 | #txn ends here: 85 | except: 86 | print("failing in txn generator") 87 | producer.flush() 88 | # increment counter and sleep 89 | startKey += iterateVal 90 | time.sleep(5) 91 | 92 | except: 93 | print("failing in loop.") 94 | finally: 95 | print("script complete") 96 | 97 | 98 | -------------------------------------------------------------------------------- /datagen/spark_from_dbz_customer_2_iceberg.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from pyspark.sql.types import * 3 | from pyspark.sql.functions import col, udf 4 | from pyspark.sql.functions import * 5 | from pyspark.sql import * 6 | from pyspark.streaming import StreamingContext 7 | import json 8 | import uuid 9 | import re 10 | 11 | 12 | spark = SparkSession \ 13 | .builder \ 14 | .appName("cust_panda_2_ice") \ 15 | .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \ 16 | .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ 17 | .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \ 18 | .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \ 19 | .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \ 20 | .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \ 21 | .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \ 22 | .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \ 23 | .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ 24 | .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \ 25 | .config("spark.sql.catalog.sparkcatalog", "icecatalog") \ 26 | .config("spark.sql.adaptive.enabled", "true") \ 27 | .config("spark.eventLog.enabled", "true") \ 28 | .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 29 | .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 30 | .config("spark.sql.catalogImplementation", "in-memory") \ 31 | .config("groupId", "org.apache.spark") \ 32 | .config("artifactId", "spark-sql-kafka-0-10_2.12") \ 33 | .config("version", "3.3.1") \ 34 | .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \ 35 | .config("spark.sql.adaptive", "true") \ 36 | .getOrCreate() 37 | 38 | 39 | ########################################################################################## 40 | # debezium schema 41 | ########################################################################################## 42 | dbz_schema_value = StructType([StructField('payload', StructType([StructField('after', StructType([StructField('city', StringType(), True), StructField('create_date', StringType(), True), StructField('cust_id', LongType(), True), StructField('email', StringType(), True), StructField('first_name', StringType(), True), StructField('home_phone', StringType(), True), StructField('job_title', StringType(), True), StructField('last_name', StringType(), True), StructField('mobile', StringType(), True), StructField('ssn', StringType(), True), StructField('state', StringType(), True), StructField('street_address', StringType(), True), StructField('zip_code', StringType(), True)]), True), StructField('before', StructType([StructField('city', StringType(), True), StructField('create_date', StringType(), True), StructField('cust_id', LongType(), True), StructField('email', StringType(), True), StructField('first_name', StringType(), True), StructField('home_phone', StringType(), True), StructField('job_title', StringType(), True), StructField('last_name', StringType(), True), StructField('mobile', StringType(), True), StructField('ssn', StringType(), True), StructField('state', StringType(), True), StructField('street_address', StringType(), True), StructField('zip_code', StringType(), True)]), True), StructField('op', StringType(), True), StructField('source', StructType([StructField('connector', StringType(), True), StructField('db', StringType(), True), StructField('lsn', LongType(), True), StructField('name', StringType(), True), StructField('schema', StringType(), True), StructField('sequence', StringType(), True), StructField('snapshot', StringType(), True), StructField('table', StringType(), True), StructField('ts_ms', LongType(), True), StructField('txId', LongType(), True), StructField('version', StringType(), True), StructField('xmin', StringType(), True)]), True), StructField('transaction', StringType(), True), StructField('ts_ms', LongType(), True)]), True), StructField('schema', StructType([StructField('fields', ArrayType(StructType([StructField('field', StringType(), True), StructField('fields', ArrayType(StructType([StructField('default', StringType(), True), StructField('field', StringType(), True), StructField('name', StringType(), True), StructField('optional', BooleanType(), True), StructField('parameters', StructType([StructField('allowed', StringType(), True)]), True), StructField('type', StringType(), True), StructField('version', LongType(), True)]), True), True), StructField('name', StringType(), True), StructField('optional', BooleanType(), True), StructField('type', StringType(), True), StructField('version', LongType(), True)]), True), True), StructField('name', StringType(), True), StructField('optional', BooleanType(), True), StructField('type', StringType(), True), StructField('version', LongType(), True)]), True)]) 43 | 44 | 45 | 46 | 47 | 48 | ########################################################################################## 49 | # read from topic 50 | ########################################################################################## 51 | connectCustTopicDF = spark \ 52 | .readStream \ 53 | .format("kafka") \ 54 | .option("kafka.bootstrap.servers", ":9092") \ 55 | .option("subscribe", "pg_datagen2panda.datagen.customer") \ 56 | .option("startingOffsets", "earliest") \ 57 | .option("kafka.session.timeout.ms", "10000") \ 58 | .load() \ 59 | .select( \ 60 | from_json(col("value").cast("string"), dbz_schema_value).alias("parsed_value")) 61 | 62 | 63 | ########################################################################################## 64 | # get just the payload from 'connectCustTopicDF' 65 | ########################################################################################## 66 | 67 | payloadDF = connectCustTopicDF \ 68 | .select("parsed_value.payload.*") 69 | 70 | ########################################################################################## 71 | # This worked in the define of our function for each batch for the future ... the function stuff goes here: 72 | ########################################################################################## 73 | 74 | # create a unique Id for the row... ideally this should already have been in the payload (and is available if I took the time to load the msg correctly ;) ) 75 | #uuidUDF = udf(lambda : str(uuid.uuid4()),StringType()) 76 | 77 | 78 | df = payloadDF 79 | # .withColumn("row_key", uuidUDF()) 80 | 81 | 82 | def foreach_batch_function(microdf, batchId): 83 | print(f"inside forEachBatch for batchid:{batchId}. Rows in passed dataframe:{microdf.count()}") 84 | microdf.show() 85 | # microdf.printSchema() 86 | microdf.filter((microdf.op == "r") | (microdf.op == "c") | (microdf.op == "u")) \ 87 | .select(microdf.op.alias("type"), \ 88 | microdf.ts_ms.alias("event_ts"), \ 89 | microdf.source.txId.alias("tx_id"), \ 90 | microdf.after.first_name.alias("first_name"), \ 91 | microdf.after.last_name.alias("last_name"), \ 92 | microdf.after.street_address.alias("street_address"), \ 93 | microdf.after.city.alias("city"), \ 94 | microdf.after.state.alias("state"), \ 95 | microdf.after.zip_code.alias("zip_code"), \ 96 | microdf.after.home_phone.alias("home_phone"), \ 97 | microdf.after.mobile.alias("mobile"), \ 98 | microdf.after.email.alias("email"), \ 99 | microdf.after.ssn.alias("ssn"), \ 100 | microdf.after.job_title.alias("job_title"), \ 101 | microdf.after.create_date.alias("create_date"), \ 102 | microdf.after.cust_id.alias("cust_id")).createOrReplaceGlobalTempView("tmp_merge") 103 | mergeCount=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_merge").count() 104 | print(f"mergeCount= {mergeCount} for batch: {batchId}") 105 | showMergeDF=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_merge").show() 106 | mergeDF=(microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_merge")) 107 | mergeDF.writeTo("icecatalog.icecatalog.stream_customer_event_history").append() 108 | microdf.filter(microdf.op =="d") \ 109 | .select(microdf.op.alias("type"), \ 110 | microdf.ts_ms.alias("event_ts"), \ 111 | microdf.source.txId.alias("tx_id"), \ 112 | microdf.before.first_name.alias("first_name"), \ 113 | microdf.before.last_name.alias("last_name"), \ 114 | microdf.before.street_address.alias("street_address"), \ 115 | microdf.before.city.alias("city"), \ 116 | microdf.before.state.alias("state"), \ 117 | microdf.before.zip_code.alias("zip_code"), \ 118 | microdf.before.home_phone.alias("home_phone"), \ 119 | microdf.before.mobile.alias("mobile"), \ 120 | microdf.before.email.alias("email"), \ 121 | microdf.before.ssn.alias("ssn"), \ 122 | microdf.before.job_title.alias("job_title"), \ 123 | microdf.before.create_date.alias("create_date"), \ 124 | microdf.before.cust_id.alias("cust_id")).createOrReplaceGlobalTempView("tmp_delete") 125 | deleteCount=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_delete").count() 126 | print(f"deleteCount= {deleteCount} for batch: {batchId}") 127 | showDeleteDF=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_delete").show() 128 | deleteDF=(microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_delete")) 129 | # deleteDF.printSchema() 130 | deleteDF.writeTo("icecatalog.icecatalog.stream_customer_event_history").append() 131 | 132 | ########################################################################################## 133 | # send stream into foreachBatch 134 | ########################################################################################## 135 | 136 | streamQuery = (df.writeStream \ 137 | .option("checkpointLocation", "/opt/spark/checkpoint2") \ 138 | .foreachBatch(foreach_batch_function) \ 139 | .trigger(processingTime='60 seconds') \ 140 | .start() \ 141 | .awaitTermination()) 142 | 143 | spark.stop() 144 | -------------------------------------------------------------------------------- /datagen/test_pg.py: -------------------------------------------------------------------------------- 1 | import psycopg2 2 | 3 | try: 4 | try: 5 | # Connect to your PostgreSQL database on a remote server 6 | conn = psycopg2.connect(host="127.0.0.1", port="5432", dbname="datagen", user="datagen", password="supersecret1") 7 | print("Connection Established!") 8 | print("\n") 9 | except psycopg2.Error as exception: 10 | printf ('Failed to connect to database') 11 | printException (exception) 12 | exit (1) 13 | 14 | # Open a cursor to perform database operations 15 | cur = conn.cursor() 16 | try: 17 | # Execute a test query 18 | cur.execute("SELECT * FROM customer") 19 | 20 | # Retrieve query results 21 | records = cur.fetchall() 22 | 23 | #print records 24 | print(records) 25 | 26 | except psycopg2.Error as exception: 27 | printf ('Failed to insert\n') 28 | printException (exception) 29 | exit (1) 30 | finally: 31 | if(conn): 32 | cur.close() 33 | conn.close() 34 | print("\n") 35 | print("PostgreSQL connection is closed") 36 | 37 | except (Exception, psycopg2.Error) as error: 38 | print("Something else went wrong...\n", error) 39 | 40 | finally: 41 | print("\n") 42 | print("script complete!") 43 | 44 | -------------------------------------------------------------------------------- /db_ddl/create_ddl_icecatalog.sql: -------------------------------------------------------------------------------- 1 | CREATE ROLE icecatalog LOGIN PASSWORD 'supersecret1'; 2 | CREATE DATABASE icecatalog OWNER icecatalog ENCODING 'UTF-8'; 3 | ALTER USER icecatalog WITH SUPERUSER; 4 | ALTER USER icecatalog WITH CREATEDB; 5 | CREATE SCHEMA icecatalog; 6 | -------------------------------------------------------------------------------- /db_ddl/create_user_datagen.sql: -------------------------------------------------------------------------------- 1 | CREATE ROLE datagen LOGIN PASSWORD 'supersecret1'; 2 | CREATE DATABASE datagen OWNER datagen ENCODING 'UTF-8'; 3 | ALTER ROLE "datagen" WITH LOGIN; 4 | ALTER ROLE "datagen" WITH REPLICATION; 5 | -------------------------------------------------------------------------------- /db_ddl/customer_ddl.sql: -------------------------------------------------------------------------------- 1 | CREATE schema datagen; 2 | CREATE TABLE datagen.customer 3 | ( 4 | first_name character varying(50) COLLATE pg_catalog."default", 5 | last_name character varying(50) COLLATE pg_catalog."default", 6 | street_address character varying(100) COLLATE pg_catalog."default", 7 | city character varying(50) COLLATE pg_catalog."default", 8 | state character varying(50) COLLATE pg_catalog."default", 9 | zip_code character varying(50) COLLATE pg_catalog."default", 10 | home_phone character varying(50) COLLATE pg_catalog."default", 11 | mobile character varying(50) COLLATE pg_catalog."default", 12 | email character varying(50) COLLATE pg_catalog."default", 13 | ssn character varying(25) COLLATE pg_catalog."default", 14 | job_title character varying(50) COLLATE pg_catalog."default", 15 | create_date character varying(50) COLLATE pg_catalog."default", 16 | cust_id integer NOT NULL, 17 | CONSTRAINT customer_pkey PRIMARY KEY (cust_id) 18 | ); 19 | CREATE PUBLICATION dbz_publication FOR TABLE datagen.customer; 20 | -------------------------------------------------------------------------------- /db_ddl/customer_function_ddl.sql: -------------------------------------------------------------------------------- 1 | CREATE or REPLACE FUNCTION datagen.insert_from_json(json) 2 | RETURNS void 3 | LANGUAGE 'plpgsql' 4 | COST 100 5 | VOLATILE 6 | AS $BODY$ 7 | 8 | BEGIN 9 | INSERT INTO datagen.customer(first_name, last_name, street_address, city, state, zip_code, home_phone, mobile, email, ssn, job_title, create_date, cust_id) 10 | SELECT 11 | x.first_name 12 | ,x.last_name 13 | ,x.street_address 14 | ,x.city 15 | ,x.state 16 | ,x.zip_code 17 | ,x.home_phone 18 | ,x.mobile 19 | ,x.email 20 | ,x.ssn 21 | ,x.job_title 22 | ,x.create_date 23 | ,x.cust_id 24 | FROM json_to_record($1) AS x 25 | ( 26 | first_name text, 27 | last_name text, 28 | street_address text, 29 | city text, 30 | state text, 31 | zip_code text, 32 | home_phone text, 33 | mobile text, 34 | email text, 35 | ssn text, 36 | job_title text, 37 | create_date text, 38 | cust_id int 39 | ) 40 | ON CONFLICT (cust_id) DO UPDATE SET 41 | first_name = EXCLUDED.first_name 42 | ,last_name = EXCLUDED.last_name 43 | ,street_address = EXCLUDED.street_address 44 | ,city = EXCLUDED.city 45 | ,state = EXCLUDED.state 46 | ,zip_code = EXCLUDED.zip_code 47 | ,home_phone = EXCLUDED.home_phone 48 | ,mobile = EXCLUDED.mobile 49 | ,email = EXCLUDED.email 50 | ,ssn = EXCLUDED.ssn 51 | ,job_title = EXCLUDED.job_title 52 | ,create_date = EXCLUDED.create_date; 53 | 54 | 55 | END; 56 | $BODY$; 57 | -------------------------------------------------------------------------------- /db_ddl/grants4dbz.sql: -------------------------------------------------------------------------------- 1 | GRANT ALL ON ALL TABLES IN SCHEMA datagen TO datagen; 2 | -------------------------------------------------------------------------------- /db_ddl/hive_metastore_ddl.sql: -------------------------------------------------------------------------------- 1 | CREATE USER hive; 2 | ALTER ROLE hive WITH PASSWORD 'supersecret1'; 3 | CREATE DATABASE hive_metastore; 4 | GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive; 5 | -------------------------------------------------------------------------------- /dbz_server/.touch: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /dbz_server/application.properties: -------------------------------------------------------------------------------- 1 | debezium.sink.type=iceberg 2 | debezium.format.value.schemas.enable=true 3 | 4 | #################################################################################### 5 | # postgresql source config: 6 | #################################################################################### 7 | debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector 8 | debezium.source.offset.storage.file.filename=data/offsets.dat 9 | debezium.source.offset.flush.interval.ms=0 10 | debezium.source.database.hostname=127.0.0.1 11 | debezium.source.database.port=5432 12 | debezium.source.database.user=datagen 13 | debezium.source.database.password=supersecret1 14 | debezium.source.database.dbname=datagen 15 | debezium.source.database.server.name=localhost 16 | debezium.source.schema.include.list=datagen 17 | debezium.source.plugin.name=pgoutput 18 | # below is new as of 3.15.2024 19 | debezium.source.topic.prefix=dbz_ 20 | 21 | 22 | #################################################################################### 23 | # Iceberg sink config: 24 | #################################################################################### 25 | debezium.sink.iceberg.warehouse=s3://iceberg-data 26 | debezium.sink.iceberg.catalog-name=icecatalog 27 | debezium.sink.iceberg.table-namespace=icecatalog 28 | #debezium.sink.iceberg.table-prefix=cdc_ 29 | # above was the orig. below is new on 3.15.2024 30 | debezium.sink.iceberg.table-prefix=debeziumcdc_ 31 | debezium.sink.iceberg.write.format.default=parquet 32 | debezium.sink.iceberg.upsert=true 33 | debezium.sink.iceberg.upsert-keep-deletes=true 34 | debezium.sink.iceberg.table.auto-create=true 35 | 36 | debezium.sink.iceberg.name=icecatalog 37 | debezium.sink.iceberg.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog 38 | debezium.sink.iceberg.uri=jdbc:postgresql://127.0.0.1:5432/icecatalog 39 | debezium.sink.iceberg.jdbc.user=icecatalog 40 | debezium.sink.iceberg.jdbc.password=supersecret1 41 | 42 | #################################################################################### 43 | # S3 config to a local minio instance 44 | #################################################################################### 45 | debezium.sink.iceberg.fs.defaultFS=s3://iceberg-data/icecatalog 46 | debezium.sink.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO 47 | debezium.sink.iceberg.com.amazonaws.services.s3a.enableV4=true 48 | debezium.sink.iceberg.s3.endpoint=http://127.0.0.1:9000 49 | debezium.sink.iceberg.s3.path-style-access=true 50 | debezium.sink.iceberg.s3.access-key-id= 51 | debezium.sink.iceberg.s3.secret-access-key= 52 | 53 | #################################################################################### 54 | # do event flattening. unwrap message! 55 | #################################################################################### 56 | debezium.transforms=unwrap 57 | debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState 58 | debezium.transforms.unwrap.add.fields=op,table,source.ts_ms,db 59 | debezium.transforms.unwrap.delete.handling.mode=rewrite 60 | debezium.transforms.unwrap.drop.tombstones=true 61 | 62 | #################################################################################### 63 | # ############ SET LOG LEVELS ############ 64 | #################################################################################### 65 | quarkus.log.level=INFO 66 | quarkus.log.console.json=false 67 | # hadoop, parquet 68 | quarkus.log.category."org.apache.hadoop".level=WARN 69 | quarkus.log.category."org.apache.parquet".level=WARN 70 | # Ignore messages below warning level from Jetty, because it's a bit verbose 71 | quarkus.log.category."org.eclipse.jetty".level=WARN 72 | 73 | -------------------------------------------------------------------------------- /downloads/.touch: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /explore_postgresql.md: -------------------------------------------------------------------------------- 1 | ### SQL GUI client to access the PostgreSQL databases 2 | 3 | * It uses a tool called `Adminer` that was installed during setup 4 | 5 | --- 6 | 7 | #### Access the `datagen` database with these credentials: 8 | --- 9 | * user --> `datagen` 10 | * Password --> `supersecret1` 11 | 12 | * From a browser navigate to: `http:///adminer` 13 | 14 | ##### `Datagen` database login screen: 15 | --- 16 | 17 | ![](./images/adminer_login_screen.png) 18 | 19 | --- 20 | 21 | #### Access the `icecatalog` database with these credentials: 22 | --- 23 | * user --> `icecatalog` 24 | * Password --> `supersecret1` 25 | 26 | * From a browser navigate to: `http:///adminer` 27 | 28 | ##### `icecatalog` database login screen: 29 | --- 30 | 31 | 32 | ![](./images/adminer_login_screen_icecatalog.png) 33 | 34 | 35 | --- 36 | --- 37 | 38 | Click here to return to main page: [`Workshop 2 Exercises`](./README.md/#extra-credit). 39 | -------------------------------------------------------------------------------- /get_files.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ########################################################################################## 4 | # load of the utilities functions: 5 | ########################################################################################## 6 | echo "load utils" 7 | echo 8 | . ~/data_origination_workshop/utils.sh 9 | 10 | ########################################################################################## 11 | ########################################################################################## 12 | ########################################################################################## 13 | ########################################################################################## 14 | # Define the files as variables; 15 | ########################################################################################## 16 | ########################################################################################## 17 | ########################################################################################## 18 | ########################################################################################## 19 | echo "defining vars" 20 | echo 21 | ########################################################################################## 22 | # SPARK & ICEBERG ITEMS: 23 | ########################################################################################## 24 | #SPARK_FILE=https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz 25 | #SPARK_FILE=https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz; echo "SPARK_STANDALONE_FILE=${SPARK_FILE##*/}" >> ~/file_variables.output 26 | SPARK_FILE=https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz; echo "SPARK_STANDALONE_FILE=${SPARK_FILE##*/}" >> ~/file_variables.output 27 | COMMONS_POOL2_JAR=https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar; echo "COMMONS_POOL2_FILE=${COMMONS_POOL2_JAR##*/}" >> ~/file_variables.output 28 | KAFKA_CLIENT_JAR=https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.3.1/kafka-clients-3.3.1.jar; echo "KAFKA_CLIENT_FILE=${KAFKA_CLIENT_JAR##*/}" >> ~/file_variables.output 29 | SPARK_TOKEN_JAR=https://repo.mavenlibs.com/maven/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.3.1/spark-token-provider-kafka-0-10_2.12-3.3.1.jar; echo "SPARK_TOKEN_FILE=${SPARK_TOKEN_JAR##*/}" >> ~/file_variables.output 30 | SPARK_SQL_KAFKA_JAR=https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.3.1/spark-sql-kafka-0-10_2.12-3.3.1.jar; echo "SPARK_SQL_KAFKA_FILE=${SPARK_SQL_KAFKA_JAR##*/}" >> ~/file_variables.output 31 | ICEBERG_SPARK_JAR=https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.1.0/iceberg-spark-runtime-3.3_2.12-1.1.0.jar; echo "SPARK_ICEBERG_FILE=${ICEBERG_SPARK_JAR##*/}" >> ~/file_variables.output 32 | URL_CONNECT_JAR=https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.19.19/url-connection-client-2.19.19.jar; echo "URL_CONNECT_FILE=${URL_CONNECT_JAR##*/}" >> ~/file_variables.output 33 | AWS_BUNDLE_JAR=https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.19.19/bundle-2.19.19.jar; echo "AWS_BUNDLE_FILE=${AWS_BUNDLE_JAR##*/}" >> ~/file_variables.output 34 | 35 | 36 | ########################################################################################## 37 | # KAKFA CONNECT ITEMS: 38 | ########################################################################################## 39 | #KCONNECT_FILE=https://dlcdn.apache.org/kafka/3.3.2/kafka_2.13-3.3.2.tgz; echo "KAFKA_CONNECT_FILE=${KCONNECT_FILE##*/}" >> ~/file_variables.output 40 | KCONNECT_FILE=https://archive.apache.org/dist/kafka//3.3.2/kafka_2.13-3.3.2.tgz; echo "KAFKA_CONNECT_FILE=${KCONNECT_FILE##*/}" >> ~/file_variables.output 41 | KCONNECT_JDBC_JAR=https://jdbc.postgresql.org/download/postgresql-42.5.1.jar; echo "KCONNECT_JDBC_FILE=${KCONNECT_JDBC_JAR##*/}" >> ~/file_variables.output 42 | DBZ_CONNECT_FILE=https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.1.1.Final/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz; echo "DEBEZIUM_CONNECT_FILE=${DBZ_CONNECT_FILE##*/}" >> ~/file_variables.output 43 | 44 | ########################################################################################## 45 | # REDPANDA ITEMS: 46 | ########################################################################################## 47 | REDPANDA_REPO_FILE=https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh; echo "PANDA_REPO_FILE=${REDPANDA_REPO_FILE##*/}" >> ~/file_variables.output 48 | REDPANDA_FILE=https://github.com/redpanda-data/redpanda/releases/latest/download/rpk-linux-amd64.zip; echo "PANDA_FILE=${REDPANDA_FILE##*/}" >> ~/file_variables.output 49 | 50 | ########################################################################################## 51 | # POSTGRESQL ITEMS: 52 | ########################################################################################## 53 | PSQL_REPO_KEY=https://www.postgresql.org/media/keys/ACCC4CF8.asc; echo "POSTGRESQL_KEY_FILE=${PSQL_REPO_KEY##*/}" >> ~/file_variables.output 54 | PSQL_JDBC_JAR=https://jdbc.postgresql.org/download/postgresql-42.5.1.jar; echo "POSTGRESQL_FILE=${KCONNECT_JDBC_JAR##*/}" >> ~/file_variables.output 55 | 56 | ########################################################################################## 57 | # MINIO ITEMS: 58 | ########################################################################################## 59 | #MINIO_CLI_FILE=https://dl.min.io/client/mc/release/linux-amd64/mc; echo "MINIO_FILE=${MINIO_CLI_FILE##*/}" >> ~/file_variables.output 60 | MINIO_CLI_FILE=https://dl.min.io/client/mc/release/linux-amd64/archive/mc.RELEASE.2023-01-11T03-14-16Z; echo "MINIO_FILE=${MINIO_CLI_FILE##*/}" >> ~/file_variables.output 61 | MINIO_PACKAGE=https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230112020616.0.0_amd64.deb; echo "MINIO_PACKAGE_FILE=${MINIO_PACKAGE##*/}" >> ~/file_variables.output 62 | 63 | ########################################################################################## 64 | # DOCKER ITEMS: 65 | ########################################################################################## 66 | DOCKER_KEY_FILE=https://download.docker.com/linux/ubuntu/gpg; echo "DOCKER_REPO_KEY_FILE=${DOCKER_KEY_FILE##*/}" >> ~/file_variables.output 67 | 68 | ########################################################################################## 69 | ########################################################################################## 70 | ########################################################################################## 71 | ########################################################################################## 72 | # Get the files from the above variables; 73 | ########################################################################################## 74 | ########################################################################################## 75 | ########################################################################################## 76 | ########################################################################################## 77 | 78 | ########################################################################################## 79 | # GET - SPARK & ICEBERG ITEMS: 80 | ########################################################################################## 81 | echo "calling get_valid_urls" 82 | echo 83 | get_valid_url $SPARK_FILE 84 | get_valid_url $COMMONS_POOL2_JAR 85 | get_valid_url $KAFKA_CLIENT_JAR 86 | get_valid_url $SPARK_TOKEN_JAR 87 | get_valid_url $SPARK_SQL_KAFKA_JAR 88 | get_valid_url $ICEBERG_SPARK_JAR 89 | get_valid_url $URL_CONNECT_JAR 90 | get_valid_url $AWS_BUNDLE_JAR 91 | 92 | ########################################################################################## 93 | # GET - KAKFA CONNECT ITEMS: 94 | ########################################################################################## 95 | get_valid_url $KCONNECT_FILE 96 | get_valid_url $KCONNECT_JDBC_JAR 97 | get_valid_url $DBZ_CONNECT_FILE 98 | 99 | ########################################################################################## 100 | # GET - REDPANDA ITEMS: 101 | ########################################################################################## 102 | get_valid_url $REDPANDA_REPO_FILE 103 | get_valid_url $REDPANDA_FILE 104 | 105 | ########################################################################################## 106 | # GET - POSTGRESQL ITEMS: 107 | ########################################################################################## 108 | get_valid_url $PSQL_REPO_KEY 109 | get_valid_url $PSQL_JDBC_JAR 110 | 111 | ########################################################################################## 112 | # GET - MINIO ITEMS: 113 | ########################################################################################## 114 | get_valid_url $MINIO_CLI_FILE 115 | get_valid_url $MINIO_PACKAGE 116 | 117 | ########################################################################################## 118 | # GET - DOCKER ITEMS: 119 | ########################################################################################## 120 | get_valid_url $DOCKER_KEY_FILE 121 | -------------------------------------------------------------------------------- /hive_metastore/hive-site.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | javax.jdo.option.ConnectionURL 4 | jdbc:postgresql://localhost:5432/hive_metastore 5 | 6 | 7 | 8 | javax.jdo.option.ConnectionDriverName 9 | org.postgresql.Driver 10 | 11 | 12 | 13 | javax.jdo.option.ConnectionUserName 14 | hive 15 | 16 | 17 | 18 | javax.jdo.option.ConnectionPassword 19 | supersecret1 20 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /images/.placeholder: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /images/Iceberg.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/Iceberg.gif -------------------------------------------------------------------------------- /images/access_keys_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/access_keys_view.png -------------------------------------------------------------------------------- /images/adminer_login_screen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/adminer_login_screen.png -------------------------------------------------------------------------------- /images/adminer_login_screen_icecatalog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/adminer_login_screen_icecatalog.png -------------------------------------------------------------------------------- /images/bucket_first_table_metadata_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/bucket_first_table_metadata_view.png -------------------------------------------------------------------------------- /images/connect_ouput_detail_msg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/connect_ouput_detail_msg.png -------------------------------------------------------------------------------- /images/connect_output_summary_msg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/connect_output_summary_msg.png -------------------------------------------------------------------------------- /images/console_view_run_connect.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/console_view_run_connect.png -------------------------------------------------------------------------------- /images/detail_view_of_cust_msg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/detail_view_of_cust_msg.png -------------------------------------------------------------------------------- /images/drunk-cheers.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/drunk-cheers.gif -------------------------------------------------------------------------------- /images/first_login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/first_login.png -------------------------------------------------------------------------------- /images/initial_bucket_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/initial_bucket_view.png -------------------------------------------------------------------------------- /images/minio_login_screen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/minio_login_screen.png -------------------------------------------------------------------------------- /images/panda_topic_view_connect_topic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/panda_topic_view_connect_topic.png -------------------------------------------------------------------------------- /images/panda_view__dg_load_topics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/panda_view__dg_load_topics.png -------------------------------------------------------------------------------- /images/panda_view_topics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/panda_view_topics.png -------------------------------------------------------------------------------- /images/spark_master_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/spark_master_view.png -------------------------------------------------------------------------------- /images/topic_customer_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/topic_customer_view.png -------------------------------------------------------------------------------- /kafka_connect/connect.properties: -------------------------------------------------------------------------------- 1 | #Kafka broker addresses 2 | bootstrap.servers=:9092 3 | 4 | #Cluster level converters 5 | #These applies when the connectors don't define any converter 6 | key.converter=org.apache.kafka.connect.json.JsonConverter 7 | value.converter=org.apache.kafka.connect.json.JsonConverter 8 | 9 | #JSON schemas enabled to false in cluster level 10 | key.converter.schemas.enable=true 11 | value.converter.schemas.enable=true 12 | 13 | #Where to keep the Connect topic offset configurations 14 | offset.storage.file.filename=/tmp/connect.offsets 15 | offset.flush.interval.ms=10000 16 | 17 | #Plugin path to put the connector binaries 18 | plugin.path=:~/kafka_connect/plugins/debezium-connector-postgres/ 19 | -------------------------------------------------------------------------------- /kafka_connect/pg-source-connector.properties: -------------------------------------------------------------------------------- 1 | connector.class=io.debezium.connector.postgresql.PostgresConnector 2 | offset.storage=org.apache.kafka.connect.storage.FileOffsetBackingStore 3 | offset.storage.file.filename=offset.dat 4 | offset.flush.interval.ms=5000 5 | name=postgres-dbz-connector 6 | database.hostname=localhost 7 | database.port=5432 8 | database.user=datagen 9 | database.password=supersecret1 10 | database.dbname=datagen 11 | schema.include.list=datagen 12 | plugin.name=pgoutput 13 | topic.prefix=pg_datagen2panda 14 | -------------------------------------------------------------------------------- /prework.md: -------------------------------------------------------------------------------- 1 | --- 2 | --- 3 | # Temp items for my setup on proxmox: 4 | ``` 5 | 6 | ########################################################################################## 7 | # notes: 8 | ########################################################################################## 9 | -- I built a new standalone ubuntu 20 server to install this with proxmox: 10 | 11 | # create a clone from the template 12 | qm clone 9400 670 --name ice-integration 13 | 14 | # put your ssh key into a file: `~/cloud_images/ssh_stuff` 15 | qm set 670 --sshkey ~/cloud_images/ssh_stuff/id_rsa.pub 16 | 17 | # change the default username: 18 | qm set 670 --ciuser centos 19 | 20 | # Let's setup dhcp for the network in this image: 21 | qm set 670 --ipconfig0 ip=dhcp 22 | 23 | # start the image from gui 24 | qm start 670 25 | 26 | ########################################################################################## 27 | # If I need to stop and destroy 28 | ########################################################################################## 29 | qm stop 670 && qm destroy 670 30 | 31 | ########################################################################################## 32 | # ssh to our new host: 33 | ########################################################################################## 34 | 35 | ssh -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -o UserKnownHostsFile=/dev/null -i ~/fishermans_wharf/proxmox/id_rsa centos@192.168.1.43 36 | 37 | ``` 38 | --- 39 | --- 40 | -------------------------------------------------------------------------------- /redpanda/redpanda.yaml: -------------------------------------------------------------------------------- 1 | redpanda: 2 | data_directory: /var/lib/redpanda/data 3 | seed_servers: [] 4 | rpc_server: 5 | address: 6 | port: 33145 7 | kafka_api: 8 | - address: 9 | port: 9092 10 | admin: 11 | - address: 12 | port: 9644 13 | developer_mode: true 14 | auto_create_topics_enabled: true 15 | fetch_reads_debounce_timeout: 10 16 | group_initial_rebalance_delay: 0 17 | group_topic_partitions: 3 18 | storage_min_free_bytes: 10485760 19 | topic_partitions_per_shard: 1000 20 | rpk: 21 | enable_usage_stats: true 22 | coredump_dir: /var/lib/redpanda/coredump 23 | overprovisioned: true 24 | pandaproxy: {} 25 | schema_registry: {} 26 | -------------------------------------------------------------------------------- /sample_output/.touch: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /sample_spark_jobs.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | #### Addition Spark & Python Exercises: 4 | 5 | Here are some additional spark examples that demonstrate how to interact with the data generated with spark. 6 | 7 | In this spark job [`consume_stream_customer_2_console.py`](./datagen/consume_stream_customer_2_console.py) we will consume the records from the topic `dgCustomer` and just stream them to our console. 8 | 9 | ``` 10 | spark-submit ~/datagen/consume_stream_customer_2_console.py 11 | ``` 12 | --- 13 | In this spark job [`consume_stream_txn_2_console.py`](./datagen/consume_stream_txn_2_console.py) we will consume the records from the topic `dgTxn` and just stream them to our console. 14 | 15 | ``` 16 | spark-submit ~/datagen/consume_stream_txn_2_console.py 17 | ``` 18 | 19 | --- 20 | In this python job [`comsume_topic_dgCustomer.py`](./datagen/comsume_topic_dgCustomer.py) we will consume 4 records from the topic `dgCustomer` and just stream them to our console. 21 | 22 | ``` 23 | python3 ~/datagen/comsume_topic_dgCustomer.py 4 24 | ``` 25 | --- 26 | 27 | Click here to return to the workshop: [`Workshop 2 Exercises`](./README.md/#automation-of-workshop-1-exercises). 28 | 29 | --- 30 | -------------------------------------------------------------------------------- /spark_items/all_workshop1_items.sql: -------------------------------------------------------------------------------- 1 | -- create the customer table 2 | CREATE TABLE icecatalog.icecatalog.customer ( 3 | first_name STRING, 4 | last_name STRING, 5 | street_address STRING, 6 | city STRING, 7 | state STRING, 8 | zip_code STRING, 9 | home_phone STRING, 10 | mobile STRING, 11 | email STRING, 12 | ssn STRING, 13 | job_title STRING, 14 | create_date STRING, 15 | cust_id BIGINT) 16 | USING iceberg 17 | OPTIONS ( 18 | 'write.object-storage.enabled'=true, 19 | 'write.data.path'='s3://iceberg-data'); 20 | 21 | -- Create the Transactions table 22 | CREATE TABLE icecatalog.icecatalog.transactions ( 23 | transact_id STRING, 24 | transaction_date STRING, 25 | item_desc STRING, 26 | barcode STRING, 27 | category STRING, 28 | amount STRING, 29 | cust_id BIGINT) 30 | USING iceberg 31 | OPTIONS ( 32 | 'write.object-storage.enabled'=true, 33 | 'write.data.path'='s3://iceberg-data'); 34 | 35 | -- load customer table from json records 36 | CREATE TEMPORARY VIEW customerView 37 | USING org.apache.spark.sql.json 38 | OPTIONS ( 39 | path "/opt/spark/input/customers.json" 40 | ); 41 | INSERT INTO icecatalog.icecatalog.customer 42 | SELECT 43 | first_name, 44 | last_name, 45 | street_address, 46 | city, 47 | state, 48 | zip_code, 49 | home_phone, 50 | mobile, 51 | email, 52 | ssn, 53 | job_title, 54 | create_date, 55 | cust_id 56 | FROM customerView; 57 | 58 | -- Merge customer json records: 59 | CREATE TEMPORARY VIEW mergeCustomerView 60 | USING org.apache.spark.sql.json 61 | OPTIONS ( 62 | path "/opt/spark/input/update_customers.json" 63 | ); 64 | MERGE INTO icecatalog.icecatalog.customer c 65 | USING (SELECT 66 | first_name, 67 | last_name, 68 | street_address, 69 | city, 70 | state, 71 | zip_code, 72 | home_phone, 73 | mobile, 74 | email, 75 | ssn, 76 | job_title, 77 | create_date, 78 | cust_id 79 | FROM mergeCustomerView) j 80 | ON c.cust_id = j.cust_id 81 | WHEN MATCHED THEN UPDATE SET 82 | c.first_name = j.first_name, 83 | c.last_name = j.last_name, 84 | c.street_address = j.street_address, 85 | c.city = j.city, 86 | c.state = j.state, 87 | c.zip_code = j.zip_code, 88 | c.home_phone = j.home_phone, 89 | c.mobile = j.mobile, 90 | c.email = j.email, 91 | c.ssn = j.ssn, 92 | c.job_title = j.job_title, 93 | c.create_date = j.create_date 94 | WHEN NOT MATCHED THEN INSERT *; 95 | -------------------------------------------------------------------------------- /spark_items/conf.properties: -------------------------------------------------------------------------------- 1 | spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 2 | spark.sql.cli.print.header=true 3 | spark.sql.catalog.icecatalog=org.apache.iceberg.spark.SparkCatalog 4 | spark.sql.catalog.icecatalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog 5 | spark.sql.catalog.icecatalog.uri=jdbc:postgresql://127.0.0.1:5432/icecatalog 6 | spark.sql.catalog.icecatalog.jdbc.user=icecatalog 7 | spark.sql.catalog.icecatalog.jdbc.password=supersecret1 8 | spark.sql.catalog.icecatalog.warehouse=s3://iceberg-data 9 | spark.sql.catalog.icecatalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO 10 | spark.sql.catalog.icecatalog.s3.endpoint=http://127.0.0.1:9000 11 | spark.sql.catalog.sparkcatalog=org.apache.iceberg.spark.SparkSessionCatalog 12 | spark.sql.defaultCatalog=icecatalog 13 | spark.eventLog.enabled=true 14 | spark.eventLog.dir=/opt/spark/spark-events 15 | spark.history.fs.logDirectory=/opt/spark/spark-events 16 | spark.sql.catalogImplementation=in-memory 17 | -------------------------------------------------------------------------------- /spark_items/ice_spark-sql_i-cli.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | . ~/minio-output.properties 5 | 6 | export AWS_ACCESS_KEY_ID=$access_key 7 | export AWS_SECRET_ACCESS_KEY=$secret_key 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000 9 | export AWS_REGION=us-east-1 10 | export MINIO_REGION=us-east-1 11 | export AWS_SDK_VERSION=2.19.19 12 | export AWS_MAVEN_GROUP=software.amazon.awssdk 13 | 14 | spark-sql --packages \ 15 | org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \ 16 | software.amazon.awssdk:bundle:2.19.19, \ 17 | software.amazon.awssdk:url-connection-client:2.19.19 \ 18 | --properties-file /opt/spark/sql/conf.properties 19 | 20 | -------------------------------------------------------------------------------- /spark_items/iceberg_workshop_sql_items.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | . ~/minio-output.properties 5 | 6 | export AWS_ACCESS_KEY_ID=$access_key 7 | export AWS_SECRET_ACCESS_KEY=$secret_key 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000 9 | export AWS_REGION=us-east-1 10 | export MINIO_REGION=us-east-1 11 | export AWS_SDK_VERSION=2.19.19 12 | export AWS_MAVEN_GROUP=software.amazon.awssdk 13 | 14 | spark-sql --packages \ 15 | org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \ 16 | software.amazon.awssdk:bundle:2.19.19, \ 17 | software.amazon.awssdk:url-connection-client:2.19.19 \ 18 | --properties-file /opt/spark/sql/conf.properties \ 19 | -f /opt/spark/sql/all_workshop1_items.sql \ 20 | --verbose 21 | -------------------------------------------------------------------------------- /spark_items/iceberg_workshop_tbl_ddl.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE icecatalog.icecatalog.customer ( 2 | first_name STRING, 3 | last_name STRING, 4 | street_address STRING, 5 | city STRING, 6 | state STRING, 7 | zip_code STRING, 8 | home_phone STRING, 9 | mobile STRING, 10 | email STRING, 11 | ssn STRING, 12 | job_title STRING, 13 | create_date STRING, 14 | cust_id BIGINT) 15 | USING iceberg 16 | OPTIONS ( 17 | 'write.object-storage.enabled'=true, 18 | 'write.data.path'='s3://iceberg-data'); 19 | 20 | CREATE TABLE icecatalog.icecatalog.transactions ( 21 | transact_id STRING, 22 | transaction_date STRING, 23 | item_desc STRING, 24 | barcode STRING, 25 | category STRING, 26 | amount STRING, 27 | cust_id BIGINT) 28 | USING iceberg 29 | OPTIONS ( 30 | 'write.object-storage.enabled'=true, 31 | 'write.data.path'='s3://iceberg-data'); 32 | -------------------------------------------------------------------------------- /spark_items/load_ice_customer_batch.sql: -------------------------------------------------------------------------------- 1 | CREATE TEMPORARY VIEW customerView 2 | USING org.apache.spark.sql.json 3 | OPTIONS ( 4 | path "/opt/spark/input/customers.json" 5 | ); 6 | INSERT INTO icecatalog.icecatalog.customer 7 | SELECT 8 | first_name, 9 | last_name, 10 | street_address, 11 | city, 12 | state, 13 | zip_code, 14 | home_phone, 15 | mobile, 16 | email, 17 | ssn, 18 | job_title, 19 | create_date, 20 | cust_id 21 | FROM customerView; 22 | -------------------------------------------------------------------------------- /spark_items/load_ice_transactions_pyspark.py: -------------------------------------------------------------------------------- 1 | # import SparkSession 2 | from pyspark.sql import SparkSession 3 | 4 | # create SparkSession 5 | spark = SparkSession.builder \ 6 | .appName("Python Spark SQL example") \ 7 | .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19") \ 8 | .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ 9 | .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \ 10 | .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \ 11 | .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \ 12 | .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \ 13 | .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \ 14 | .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \ 15 | .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ 16 | .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \ 17 | .config("spark.sql.catalog.sparkcatalog", "icecatalog") \ 18 | .config("spark.eventLog.enabled", "true") \ 19 | .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 20 | .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 21 | .config("spark.sql.catalogImplementation", "in-memory") \ 22 | .getOrCreate() 23 | 24 | # A JSON dataset is pointed to by 'path' variable 25 | path = "/opt/spark/input/transactions.json" 26 | 27 | # read json into the DataFrame 28 | transactionsDF = spark.read.json(path) 29 | 30 | # visualize the inferred schema 31 | transactionsDF.printSchema() 32 | 33 | # print out the dataframe in this cli 34 | transactionsDF.show() 35 | 36 | # Append these transactions to the table we created in an earlier step `icecatalog.icecatalog.transactions` 37 | transactionsDF.writeTo("icecatalog.icecatalog.transactions").append() 38 | 39 | # stop the sparkSession 40 | spark.stop() 41 | 42 | # Exit out of the editor: 43 | quit(); 44 | -------------------------------------------------------------------------------- /spark_items/merge_ice_customer_batch.sql: -------------------------------------------------------------------------------- 1 | CREATE TEMPORARY VIEW mergeCustomerView 2 | USING org.apache.spark.sql.json 3 | OPTIONS ( 4 | path "/opt/spark/input/update_customers.json" 5 | ); 6 | MERGE INTO icecatalog.icecatalog.customer c 7 | USING (SELECT 8 | first_name, 9 | last_name, 10 | street_address, 11 | city, 12 | state, 13 | zip_code, 14 | home_phone, 15 | mobile, 16 | email, 17 | ssn, 18 | job_title, 19 | create_date, 20 | cust_id 21 | FROM mergeCustomerView) j 22 | ON c.cust_id = j.cust_id 23 | WHEN MATCHED THEN UPDATE SET 24 | c.first_name = j.first_name, 25 | c.last_name = j.last_name, 26 | c.street_address = j.street_address, 27 | c.city = j.city, 28 | c.state = j.state, 29 | c.zip_code = j.zip_code, 30 | c.home_phone = j.home_phone, 31 | c.mobile = j.mobile, 32 | c.email = j.email, 33 | c.ssn = j.ssn, 34 | c.job_title = j.job_title, 35 | c.create_date = j.create_date 36 | WHEN NOT MATCHED THEN INSERT *; 37 | -------------------------------------------------------------------------------- /spark_items/stream_customer_ddl.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE icecatalog.icecatalog.stream_customer ( 2 | first_name STRING, 3 | last_name STRING, 4 | street_address STRING, 5 | city STRING, 6 | state STRING, 7 | zip_code STRING, 8 | home_phone STRING, 9 | mobile STRING, 10 | email STRING, 11 | ssn STRING, 12 | job_title STRING, 13 | create_date STRING, 14 | cust_id BIGINT) 15 | USING iceberg 16 | OPTIONS ( 17 | 'write.object-storage.enabled'=true, 18 | 'write.data.path'='s3://iceberg-data'); 19 | -------------------------------------------------------------------------------- /spark_items/stream_customer_ddl_script.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | . ~/minio-output.properties 5 | 6 | export AWS_ACCESS_KEY_ID=$access_key 7 | export AWS_SECRET_ACCESS_KEY=$secret_key 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000 9 | export AWS_REGION=us-east-1 10 | export MINIO_REGION=us-east-1 11 | export AWS_SDK_VERSION=2.19.19 12 | export AWS_MAVEN_GROUP=software.amazon.awssdk 13 | 14 | spark-sql --packages \ 15 | org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \ 16 | software.amazon.awssdk:bundle:2.19.19, \ 17 | software.amazon.awssdk:url-connection-client:2.19.19 \ 18 | --properties-file /opt/spark/sql/conf.properties \ 19 | -f /opt/spark/sql/stream_customer_ddl.sql \ 20 | --verbose 21 | -------------------------------------------------------------------------------- /spark_items/stream_customer_event_history_ddl.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE icecatalog.icecatalog.stream_customer_event_history ( 2 | type STRING, 3 | event_ts LONG, 4 | tx_id STRING, 5 | first_name STRING, 6 | last_name STRING, 7 | street_address STRING, 8 | city STRING, 9 | state STRING, 10 | zip_code STRING, 11 | home_phone STRING, 12 | mobile STRING, 13 | email STRING, 14 | ssn STRING, 15 | job_title STRING, 16 | create_date STRING, 17 | cust_id BIGINT) 18 | USING iceberg 19 | OPTIONS ( 20 | 'write.object-storage.enabled'=true, 21 | 'write.data.path'='s3://iceberg-data'); 22 | -------------------------------------------------------------------------------- /spark_items/stream_customer_event_history_ddl_script.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | . ~/minio-output.properties 5 | 6 | export AWS_ACCESS_KEY_ID=$access_key 7 | export AWS_SECRET_ACCESS_KEY=$secret_key 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000 9 | export AWS_REGION=us-east-1 10 | export MINIO_REGION=us-east-1 11 | export AWS_SDK_VERSION=2.19.19 12 | export AWS_MAVEN_GROUP=software.amazon.awssdk 13 | 14 | spark-sql --packages \ 15 | org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \ 16 | software.amazon.awssdk:bundle:2.19.19, \ 17 | software.amazon.awssdk:url-connection-client:2.19.19 \ 18 | --properties-file /opt/spark/sql/conf.properties \ 19 | -f /opt/spark/sql/stream_customer_event_history_ddl.sql \ 20 | --verbose 21 | -------------------------------------------------------------------------------- /stop_start_services.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ########################################################################################## 4 | # Stop of Spark items 5 | ########################################################################################## 6 | 7 | echo "stopping spark worker..." 8 | /opt/spark/sbin/stop-worker.sh spark://$(hostname -f):7077 9 | echo 10 | sleep 5 11 | 12 | echo "stopping spark master..." 13 | /opt/spark/sbin/stop-master.sh 14 | echo 15 | sleep 5 16 | 17 | ########################################################################################## 18 | # Stop of panda items 19 | ########################################################################################## 20 | 21 | 22 | echo "stopping redpanda..." 23 | sudo systemctl stop redpanda 24 | echo 25 | sleep 5 26 | 27 | echo "stopping redpanda console..." 28 | sudo systemctl stop redpanda-console 29 | echo 30 | sleep 5 31 | 32 | ########################################################################################## 33 | # stop of minio 34 | ########################################################################################## 35 | 36 | echo "stopping minio.service..." 37 | sudo systemctl stop minio.service 38 | echo 39 | sleep 5 40 | 41 | ########################################################################################## 42 | # Start of Spark items 43 | ########################################################################################## 44 | 45 | echo "starting spark master..." 46 | /opt/spark/sbin/start-master.sh 47 | echo 48 | sleep 5 49 | echo "starting spark worker..." 50 | /opt/spark/sbin/start-worker.sh spark://$(hostname -f):7077 51 | echo 52 | sleep 5 53 | 54 | ########################################################################################## 55 | # Start of panda items 56 | ########################################################################################## 57 | 58 | 59 | echo "starting redpanda..." 60 | sudo systemctl start redpanda 61 | echo 62 | sleep 5 63 | 64 | echo "starting redpanda console..." 65 | sudo systemctl start redpanda-console 66 | echo 67 | sleep 5 68 | ########################################################################################## 69 | # start of minio 70 | ########################################################################################## 71 | 72 | echo "starting minio.service..." 73 | sudo systemctl start minio.service 74 | echo 75 | sleep 5 76 | echo "services have been restarted..." 77 | -------------------------------------------------------------------------------- /tick2705-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/tick2705-1.png -------------------------------------------------------------------------------- /tick2705-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/tick2705-2.png -------------------------------------------------------------------------------- /utils.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | function validate_url(){ 4 | if [[ `wget -S --spider $1 --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 --tries=4 2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then 5 | return 0 6 | else 7 | return 1 8 | fi 9 | 10 | } 11 | 12 | function get_valid_url(){ 13 | if validate_url $1; then 14 | # Download when exists 15 | echo "file exists. downloading..." 16 | wget $1 --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 --tries=4 -P ~/data_origination_workshop/downloads 17 | 18 | else 19 | # print error and exit the install 20 | echo "file: $1 -- does not exist. Aborting the install." 21 | exit 1 22 | fi 23 | } 24 | 25 | -------------------------------------------------------------------------------- /workshop1_revisit.md: -------------------------------------------------------------------------------- 1 | --- 2 | Title: Apache Iceberg Exploration with S3A storage. 3 | Author: Tim Lepple 4 | Last Updated: 1.30.2023 5 | Comments: This repo will evolve over time with new items. 6 | Tags: Apache Iceberg | Minio | Apache SparkSQL | Apache PySpark | Ubuntu 7 | --- 8 | 9 | # Apache Iceberg Introduction Workshop 10 | 11 | --- 12 | 13 | ## Objective: 14 | My goal in this workshop was to evaluate Apache Iceberg with data stored in an S3a-compliant object store on a traditional Linux server. It has been modified sligthly from the original to avoid port conflicts. Documentation here may need addtional updates. 15 | 16 | 17 | --- 18 | --- 19 | ### What is Apache Iceberg: 20 | 21 | Apache Iceberg is an open-source data management system for large-scale data lakes. It provides a table abstraction for big data workloads, allowing for schema evolution, data discovery and simplified data access. Iceberg uses Apache Avro, Parquet or ORC as its data format and supports various storage systems like HDFS, S3, ADLS, etc. 22 | 23 | Iceberg uses a versioning approach to manage schema changes, enabling multiple versions of a schema to coexist in a table, providing the ability to perform schema evolution without the need for copying data. Additionally, Iceberg provides data discovery capabilities, allowing users to identify the data they need for their specific use case and extract only that data, reducing the amount of I/O required to perform a query. 24 | 25 | Iceberg provides an easy-to-use API for querying data, supporting SQL and other query languages through Apache Spark, Hive, Presto and other engines. Iceberg’s table-centric design helps to manage large datasets with high scalability, reliability and performance. 26 | 27 | --- 28 | --- 29 | ### But Why Apache Iceberg: 30 | 31 | A couple of items really jumped out at me when I read the documentation for the first time and I immediately saw the significant benefit it could provide. Namely it could reduce the overall expense of enterprises to store and process the data they produce. We all know that saving money in an enterprise is a good thing. 32 | 33 | It can also perform standard `CRUD` operations on our tables seamlessly. Here are the two items that really hit home for me: 34 | 35 | --- 36 | ### Item 1: 37 | --- 38 | * Iceberg is designed for huge tables and is used in production where a single table can contain tens of petabytes of data. This data can be stored in modern-day object stores similar to these: 39 | * A Cloud provider like [Amazon S3](https://aws.amazon.com/s3/?nc2=h_ql_prod_st_s3) 40 | * An on-premise solution that you build and support yourself like [Minio](https://min.io/). 41 | * Or a vendor hardware appliance like the [Dell ECS Enterprise Object Storage](https://www.dell.com/en-us/dt/storage/ecs/index.htm) 42 | 43 | Regardless of which object store you choose, your overall expense to support this platform will see a significant savings over what you probably spend today. 44 | 45 | --- 46 | ### Item 2: 47 | --- 48 | * Multi-petabyte tables can be read from a single node without needing a distributed SQL engine to sift through table metadata. That means the tools used in the examples I give below could be used to query the data stored in object stores without needing to dedicate expensive compute servers. You could spin up virtual instances or containers and execute queries against the data stored in the object store. 49 | 50 | --- 51 | 52 | This image from Starburst.io is really good. 53 | 54 | --- 55 | ![](./images/Iceberg.gif) 56 | 57 | --- 58 | --- 59 | 60 | # Highlights: 61 | 62 | --- 63 | 64 | This setup script built a single node platform that set up a local S3a compliant object store, install a local SQL database, install a single node Apache Iceberg processing engine and lay the groundwork for the support of our Apache Iceberg tables and catalog. 65 | 66 | --- 67 | #### Object Storage Notes: 68 | --- 69 | * This type of object store could also be set up to run in your own data center if that is a requirement. Otherwise, you could build and deploy something very similar in AWS using their S3 service instead. I chose this option to demonstrate you have a lot of options you might not have considered. It will store all of our Apache Iceberg data and catalog database objects. 70 | * This particular service is running Minio and it has a rest API that supports direct integration with the AWS CLI tool. The script also installed the AWS CLI tools and configured the properties of the AWS CLI to work directly with Minio. 71 | 72 | --- 73 | #### Local Database Notes: 74 | --- 75 | * The local SQL database is PostgreSQL and it will host metadata with pointers to the Apache Iceberg table data persisted in our object store and the metadata for our Apache Iceberg catalog. It maintains a very small footprint. 76 | 77 | --- 78 | #### Apache Iceberg Processing Engine Notes: 79 | --- 80 | * This particular workshop is using Apache Spark but we could have chosen any of the currently supported platforms. We could also choose to use a combination of these tools and have them share the same Apache Iceberg Catalog. Here is the current list of supported tools: 81 | * Spark 82 | * Flink 83 | * Trino 84 | * Presto 85 | * Dremio 86 | * StarRocks 87 | * Amazon Athena 88 | * Amazon EMR 89 | * Impala (Cloudera) 90 | * Doris 91 | 92 | --- 93 | --- 94 | 95 | 96 | --- 97 | --- 98 | 99 | ### Testing the `AWS CLI` and the `Minio CLI`: 100 | 101 | --- 102 | --- 103 | 104 | ### AWS CLI Integration: 105 | 106 | Let's test out the AWS CLI that was installed and configured during the build and run an `AWS S3` command to list the buckets currently stored in our Minio object store. 107 | 108 | --- 109 | 110 | ##### Command: 111 | 112 | ``` 113 | aws --endpoint-url http://127.0.0.1:9000 s3 ls 114 | ``` 115 | 116 | ##### Expected Output: The bucket name. 117 | ``` 118 | 2023-01-24 22:58:38 iceberg-data 119 | ``` 120 | --- 121 | 122 | --- 123 | 124 | ### Minio CLI Integration: 125 | 126 | There is also a minio rest API to accomplish many administrative tasks and use buckets without using AWS CLI. The minio client was also installed and configured during setup. Here is a link to the documentation: [Minio Client](https://min.io/docs/minio/linux/reference/minio-mc.html). 127 | 128 | --- 129 | 130 | ##### List Command: 131 | 132 | ``` 133 | mc ls icebergadmin 134 | ``` 135 | 136 | ##### Expected Output: The bucket name. 137 | ``` 138 | [2023-01-26 16:54:33 UTC] 0B iceberg-data/ 139 | 140 | ``` 141 | 142 | --- 143 | --- 144 | ### Minio Overview: 145 | 146 | Minio is an open-source, high-performance, and scalable object storage system. It is designed to be API-compatible with Amazon S3, allowing applications written for Amazon S3 to work seamlessly with Minio. Minio can be deployed on-premises, in the cloud, or in a hybrid environment, providing a unified, centralized repository for storing and managing unstructured data, such as images, videos, and backups. 147 | 148 | Minio provides features such as versioning, access control, encryption, and event notifications, making it suitable for use cases such as data archiving, backup and disaster recovery, and media and entertainment. Minio also supports distributed mode, allowing multiple Minio nodes to be combined into a single object storage cluster for increased scalability and reliability. 149 | 150 | Minio can be used with a variety of tools and frameworks, including popular cloud-native technologies like Kubernetes, Docker, and Ansible, making it easy to deploy and manage. 151 | 152 | --- 153 | --- 154 | 155 | ### Explore Minio GUI from a browser. 156 | 157 | Let's login into the minio GUI: navigate to `http:\\:9000` in a browser 158 | 159 | - Username: `icebergadmin` 160 | - Password: `supersecret1!` 161 | 162 | --- 163 | 164 | ![](./images/minio_login_screen.png) 165 | 166 | --- 167 | 168 | `Object Browser` view with one bucket that was created during the install. Bucket Name: `iceberg-data` 169 | 170 | --- 171 | 172 | ![](./images/first_login.png) 173 | 174 | --- 175 | 176 | Click on the tab `Access Keys` : The key was created during the build too. We use this key & secret key to configure AWS CLI. 177 | 178 | --- 179 | 180 | ![](./images/access_keys_view.png) 181 | 182 | --- 183 | 184 | Click on the tab: `Buckets` 185 | 186 | --- 187 | 188 | ![](./images/initial_bucket_view.png) 189 | 190 | --- 191 | 192 | 193 | 194 | --- 195 | ## Apache Iceberg Processing Engine Setup: 196 | 197 | --- 198 | The setup here has changed since workshop one. These items were started in the setup process in workshop 2. 199 | 200 | --- 201 | 202 | ##### Check that the Spark GUI is up: 203 | * navigate to `http//:8085` in a browser 204 | 205 | --- 206 | 207 | ##### Sample view of Spark Master. 208 | 209 | --- 210 | 211 | ![](./images/spark_master_view.png) 212 | 213 | --- 214 | 215 | #### Configure the Spark-SQL service: 216 | --- 217 | In this step, we will initialize some variables that will be used when we start the Spark-SQL service. Copy this entire block and run in a terminal window. 218 | 219 | ``` 220 | . ~/minio-output.properties 221 | 222 | export AWS_ACCESS_KEY_ID=$access_key 223 | export AWS_SECRET_ACCESS_KEY=$secret_key 224 | export AWS_S3_ENDPOINT=127.0.0.1:9000 225 | export AWS_REGION=us-east-1 226 | export MINIO_REGION=us-east-1 227 | export DEPENDENCIES="org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0" 228 | export AWS_SDK_VERSION=2.19.19 229 | export AWS_MAVEN_GROUP=software.amazon.awssdk 230 | export AWS_PACKAGES=( 231 | "bundle" 232 | "url-connection-client" 233 | ) 234 | for pkg in "${AWS_PACKAGES[@]}"; do 235 | export DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" 236 | done 237 | ``` 238 | 239 | ##### Start the Spark-SQL client service: 240 | --- 241 | Starting this service will connect to our PostgreSQL database and store database objects that point to the Apache Iceberg Catalog on our behalf. The metadata for our catalog & tables (along with table records) will be stored in files persisted in our object stores. 242 | 243 | ``` 244 | cd $SPARK_HOME 245 | 246 | spark-sql --packages $DEPENDENCIES \ 247 | --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ 248 | --conf spark.sql.cli.print.header=true \ 249 | --conf spark.sql.catalog.icecatalog=org.apache.iceberg.spark.SparkCatalog \ 250 | --conf spark.sql.catalog.icecatalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \ 251 | --conf spark.sql.catalog.icecatalog.uri=jdbc:postgresql://127.0.0.1:5432/icecatalog \ 252 | --conf spark.sql.catalog.icecatalog.jdbc.user=icecatalog \ 253 | --conf spark.sql.catalog.icecatalog.jdbc.password=supersecret1 \ 254 | --conf spark.sql.catalog.icecatalog.warehouse=s3://iceberg-data \ 255 | --conf spark.sql.catalog.icecatalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ 256 | --conf spark.sql.catalog.icecatalog.s3.endpoint=http://127.0.0.1:9000 \ 257 | --conf spark.sql.catalog.sparkcatalog=org.apache.iceberg.spark.SparkSessionCatalog \ 258 | --conf spark.sql.defaultCatalog=icecatalog \ 259 | --conf spark.eventLog.enabled=true \ 260 | --conf spark.eventLog.dir=/opt/spark/spark-events \ 261 | --conf spark.history.fs.logDirectory=/opt/spark/spark-events \ 262 | --conf spark.sql.catalogImplementation=in-memory 263 | ``` 264 | --- 265 | ##### Expected Output: 266 | * the warnings can be ingored 267 | ``` 268 | 23/01/25 19:48:19 WARN Utils: Your hostname, spark-ice2 resolves to a loopback address: 127.0.1.1; using 192.168.1.167 instead (on interface eth0) 269 | 23/01/25 19:48:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 270 | :: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml 271 | Ivy Default Cache set to: /home/centos/.ivy2/cache 272 | The jars for the packages stored in: /home/centos/.ivy2/jars 273 | org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency 274 | software.amazon.awssdk#bundle added as a dependency 275 | software.amazon.awssdk#url-connection-client added as a dependency 276 | :: resolving dependencies :: org.apache.spark#spark-submit-parent-59d47579-1c2b-4e66-a92d-206be33d8afe;1.0 277 | confs: [default] 278 | found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.1.0 in central 279 | found software.amazon.awssdk#bundle;2.19.19 in central 280 | found software.amazon.eventstream#eventstream;1.0.1 in central 281 | found software.amazon.awssdk#url-connection-client;2.19.19 in central 282 | found software.amazon.awssdk#utils;2.19.19 in central 283 | found org.reactivestreams#reactive-streams;1.0.3 in central 284 | found software.amazon.awssdk#annotations;2.19.19 in central 285 | found org.slf4j#slf4j-api;1.7.30 in central 286 | found software.amazon.awssdk#http-client-spi;2.19.19 in central 287 | found software.amazon.awssdk#metrics-spi;2.19.19 in central 288 | :: resolution report :: resolve 423ms :: artifacts dl 19ms 289 | :: modules in use: 290 | org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.1.0 from central in [default] 291 | org.reactivestreams#reactive-streams;1.0.3 from central in [default] 292 | org.slf4j#slf4j-api;1.7.30 from central in [default] 293 | software.amazon.awssdk#annotations;2.19.19 from central in [default] 294 | software.amazon.awssdk#bundle;2.19.19 from central in [default] 295 | software.amazon.awssdk#http-client-spi;2.19.19 from central in [default] 296 | software.amazon.awssdk#metrics-spi;2.19.19 from central in [default] 297 | software.amazon.awssdk#url-connection-client;2.19.19 from central in [default] 298 | software.amazon.awssdk#utils;2.19.19 from central in [default] 299 | software.amazon.eventstream#eventstream;1.0.1 from central in [default] 300 | --------------------------------------------------------------------- 301 | | | modules || artifacts | 302 | | conf | number| search|dwnlded|evicted|| number|dwnlded| 303 | --------------------------------------------------------------------- 304 | | default | 10 | 0 | 0 | 0 || 10 | 0 | 305 | --------------------------------------------------------------------- 306 | :: retrieving :: org.apache.spark#spark-submit-parent-59d47579-1c2b-4e66-a92d-206be33d8afe 307 | confs: [default] 308 | 0 artifacts copied, 10 already retrieved (0kB/10ms) 309 | 23/01/25 19:48:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 310 | Setting default log level to "WARN". 311 | To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 312 | 23/01/25 19:48:28 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 313 | 23/01/25 19:48:28 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 314 | 23/01/25 19:48:31 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 315 | 23/01/25 19:48:31 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore centos@127.0.1.1 316 | Spark master: local[*], Application Id: local-1674676103468 317 | spark-sql> 318 | 319 | ``` 320 | --- 321 | 322 | ##### Cursory Check: 323 | From our new spark-sql terminal session run the following command: 324 | 325 | ``` 326 | SHOW CURRENT NAMESPACE; 327 | ``` 328 | 329 | ##### Expected Output: 330 | 331 | ``` 332 | icecatalog 333 | Time taken: 2.692 seconds, Fetched 1 row(s) 334 | ``` 335 | --- 336 | --- 337 | 338 | ### Exercises: 339 | In this lab, we will create our first iceberg table with `Spark-SQL` 340 | 341 | --- 342 | --- 343 | 344 | #### Start the `Spark-SQL` cli tool 345 | * from the spark-sql console run the below commands: 346 | 347 | --- 348 | 349 | ##### Create Tables: 350 | * These will be run in the spark-sql cli 351 | ``` 352 | CREATE TABLE icecatalog.icecatalog.customer ( 353 | first_name STRING, 354 | last_name STRING, 355 | street_address STRING, 356 | city STRING, 357 | state STRING, 358 | zip_code STRING, 359 | home_phone STRING, 360 | mobile STRING, 361 | email STRING, 362 | ssn STRING, 363 | job_title STRING, 364 | create_date STRING, 365 | cust_id BIGINT) 366 | USING iceberg 367 | OPTIONS ( 368 | 'write.object-storage.enabled'=true, 369 | 'write.data.path'='s3://iceberg-data') 370 | PARTITIONED BY (state); 371 | 372 | CREATE TABLE icecatalog.icecatalog.transactions ( 373 | transact_id STRING, 374 | transaction_date STRING, 375 | item_desc STRING, 376 | barcode STRING, 377 | category STRING, 378 | amount STRING, 379 | cust_id BIGINT) 380 | USING iceberg 381 | OPTIONS ( 382 | 'write.object-storage.enabled'=true, 383 | 'write.data.path'='s3://iceberg-data'); 384 | ``` 385 | 386 | --- 387 | 388 | ### Examine the bucket in Minio from the GUI 389 | * It wrote out all the metadata and files into our object storage from the Apache Iceberg Catalog we created. 390 | 391 | --- 392 | ![](./images/bucket_first_table_metadata_view.png) 393 | --- 394 | 395 | #### Insert some records with our SparkSQL CLI: 396 | * In this step we will load up some JSON records from a file created during setup. 397 | * We will create a temporary view against this JSON file and then load the file with an INSERT statement. 398 | --- 399 | 400 | ##### Create temporary view statement: 401 | ``` 402 | CREATE TEMPORARY VIEW customerView 403 | USING org.apache.spark.sql.json 404 | OPTIONS ( 405 | path "/opt/spark/input/customers.json" 406 | ); 407 | ``` 408 | 409 | ##### Query our temporary view with this statement: 410 | ``` 411 | SELECT cust_id, first_name, last_name FROM customerView; 412 | ``` 413 | 414 | ##### Sample Output: 415 | ``` 416 | cust_id first_name last_name 417 | 10 Brenda Thompson 418 | 11 Jennifer Anderson 419 | 12 William Jefferson 420 | 13 Jack Romero 421 | 14 Robert Johnson 422 | Time taken: 0.173 seconds, Fetched 5 row(s) 423 | 424 | ``` 425 | 426 | --- 427 | ##### Query our customer table before we load data to it: 428 | ``` 429 | SELECT cust_id, first_name, last_name FROM icecatalog.icecatalog.customer; 430 | ``` 431 | 432 | ##### Sample Output: 433 | 434 | ``` 435 | cust_id first_name last_name 436 | Time taken: 0.111 seconds 437 | 438 | ``` 439 | 440 | ##### Load the existing icegberg table (created earlier) with an `INSERT as SELECT` type of query: 441 | ``` 442 | INSERT INTO icecatalog.icecatalog.customer 443 | SELECT 444 | first_name, 445 | last_name, 446 | street_address, 447 | city, 448 | state, 449 | zip_code, 450 | home_phone, 451 | mobile, 452 | email, 453 | ssn, 454 | job_title, 455 | create_date, 456 | cust_id 457 | FROM customerView; 458 | 459 | ``` 460 | 461 | --- 462 | ##### Query our customer table after we have loaded this JSON file: 463 | ``` 464 | SELECT cust_id, first_name, last_name FROM icecatalog.icecatalog.customer; 465 | ``` 466 | 467 | ##### Sample Output: 468 | 469 | ``` 470 | cust_id first_name last_name 471 | 10 Brenda Thompson 472 | 11 Jennifer Anderson 473 | 13 Jack Romero 474 | 14 Robert Johnson 475 | 12 William Jefferson 476 | Time taken: 0.262 seconds, Fetched 5 row(s) 477 | 478 | 479 | ``` 480 | 481 | --- 482 | 483 | ### Now let's run a more advanced query: 484 | 485 | Let's Add and Update some rows in one step with an example `MERGE` Statement. This will create a view on top of a json file and then run our query to update existing rows if they match on the field `cust_id` and if they don't match on this field append the new rows to our `customer` table all in the same query. 486 | 487 | --- 488 | ##### Create temporary view statement: 489 | ``` 490 | CREATE TEMPORARY VIEW mergeCustomerView 491 | USING org.apache.spark.sql.json 492 | OPTIONS ( 493 | path "/opt/spark/input/update_customers.json" 494 | ); 495 | ``` 496 | 497 | ##### Merge records from a json file: 498 | ``` 499 | MERGE INTO icecatalog.icecatalog.customer c 500 | USING (SELECT 501 | first_name, 502 | last_name, 503 | street_address, 504 | city, 505 | state, 506 | zip_code, 507 | home_phone, 508 | mobile, 509 | email, 510 | ssn, 511 | job_title, 512 | create_date, 513 | cust_id 514 | FROM mergeCustomerView) j 515 | ON c.cust_id = j.cust_id 516 | WHEN MATCHED THEN UPDATE SET 517 | c.first_name = j.first_name, 518 | c.last_name = j.last_name, 519 | c.street_address = j.street_address, 520 | c.city = j.city, 521 | c.state = j.state, 522 | c.zip_code = j.zip_code, 523 | c.home_phone = j.home_phone, 524 | c.mobile = j.mobile, 525 | c.email = j.email, 526 | c.ssn = j.ssn, 527 | c.job_title = j.job_title, 528 | c.create_date = j.create_date 529 | WHEN NOT MATCHED THEN INSERT *; 530 | ``` 531 | --- 532 | 533 | --- 534 | ##### Query our customer table after running our merge query: 535 | ``` 536 | SELECT cust_id, first_name, last_name FROM icecatalog.icecatalog.customer ORDER BY cust_id; 537 | ``` 538 | 539 | ##### Sample Output: 540 | 541 | ``` 542 | cust_id first_name last_name 543 | 10 Caitlyn Rogers 544 | 11 Brittany Williams 545 | 12 Victor Gordon 546 | 13 Shelby Martinez 547 | 14 Corey Bridges 548 | 15 Benjamin Rocha 549 | 16 Jonathan Lawrence 550 | 17 Thomas Taylor 551 | 18 Jeffrey Williamson 552 | 19 Joseph Mccullough 553 | 20 Evan Kirby 554 | 21 Teresa Pittman 555 | 22 Alicia Byrd 556 | 23 Kathleen Ellis 557 | 24 Tony Lee 558 | Time taken: 0.381 seconds, Fetched 15 row(s) 559 | 560 | ``` 561 | * Note that the values for customers with `cust_id` between 10-14 have new updated information. 562 | 563 | --- 564 | 565 | ### Explore Time Travel with Apache Iceberg: 566 | 567 | --- 568 | So far in our workshop we have loaded some tables and run some `CRUD` operations with our platform. In this exercise, we are going to see a really cool feature called `Time Travel`. 569 | 570 | Time travel queries refer to the ability to query data as it existed at a specific point in time in the past. This feature is useful in a variety of scenarios, such as auditing, disaster recovery, and debugging. 571 | 572 | In a database or data warehousing system with time travel capability, historical data is stored along with a timestamp, allowing users to query the data as it existed at a specific time. This is achieved by either using a separate historical store or by maintaining multiple versions of the data in the same store. 573 | 574 | Time travel queries are typically implemented using tools like snapshots, temporal tables, or versioned data stores. These tools allow users to roll back to a previous version of the data and access it as if it were the current version. Time travel queries can also be combined with other data management features, such as data compression, data partitioning, and indexing, to improve performance and make historical data more easily accessible. 575 | 576 | In order to run a time travel query we need some metadata to pass into our query. The metadata exists in our catalog and it can be accessed with a query. The following query will return some metadata from our database. 577 | 578 | * your results will be slightly different. 579 | 580 | ##### Query from SparkSQL CLI: 581 | ``` 582 | SELECT 583 | committed_at, 584 | snapshot_id, 585 | parent_id 586 | FROM icecatalog.icecatalog.customer.snapshots 587 | ORDER BY committed_at; 588 | ``` 589 | --- 590 | 591 | #### Expected Output: 592 | 593 | ``` 594 | committed_at snapshot_id parent_id 595 | 2023-01-26 16:58:31.873 2216914164877191507 NULL 596 | 2023-01-26 17:00:18.585 3276759594719593733 2216914164877191507 597 | Time taken: 0.195 seconds, Fetched 2 row(s) 598 | ``` 599 | 600 | --- 601 | 602 | ### `Time Travel` example from data in our customer table: 603 | 604 | When we loaded our `customer` table initially it had only 5 rows of data. We then ran a `MERGE` query to update some existing rows and insert new rows. With this query, we can see our table results as they existed in that initial phase before the `MERGE`. 605 | 606 | We need to grab the `snapshop_id` value from our above query and edit the following query with your `snapshot_id` value. 607 | 608 | The query of the table after our first INSERT statement: 609 | * replace this `snapshop_id` with your value: 610 | 611 | In this step, we will get results that show the data as it was originally loaded. 612 | ``` 613 | SELECT 614 | cust_id, 615 | first_name, 616 | last_name, 617 | create_date 618 | FROM icecatalog.icecatalog.customer 619 | VERSION AS OF 620 | ORDER by cust_id; 621 | 622 | ``` 623 | 624 | ##### Expected Output: 625 | 626 | ``` 627 | 628 | cust_id first_name last_name create_date 629 | 10 Brenda Thompson 2022-12-25 01:10:43 630 | 11 Jennifer Anderson 2022-12-03 04:50:07 631 | 12 William Jefferson 2022-11-28 08:17:10 632 | 13 Jack Romero 2022-12-11 19:09:30 633 | 14 Robert Johnson 2022-12-08 05:28:56 634 | Time taken: 0.349 seconds, Fetched 5 row(s) 635 | 636 | ``` 637 | --- 638 | 639 | ### Example from data in our customer table after running our `MERGE` statement: 640 | 641 | In this step, we will see sample results from our customer table after we ran the `MERGE` step earlier. It will show the updated existing rows and our new rows. 642 | * remember to replace `` with the `snapshop_id` from your table metadata. 643 | 644 | ##### Query: 645 | ``` 646 | SELECT 647 | cust_id, 648 | first_name, 649 | last_name, 650 | create_date 651 | FROM icecatalog.icecatalog.customer 652 | 653 | VERSION AS OF 654 | ORDER by cust_id; 655 | ``` 656 | 657 | ##### Expected Output: 658 | 659 | ``` 660 | cust_id first_name last_name create_date 661 | 10 Caitlyn Rogers 2022-12-16 03:19:35 662 | 11 Brittany Williams 2022-12-04 23:29:48 663 | 12 Victor Gordon 2022-12-22 18:03:13 664 | 13 Shelby Martinez 2022-11-27 16:10:42 665 | 14 Corey Bridges 2022-12-11 23:29:52 666 | 15 Benjamin Rocha 2022-12-10 07:39:35 667 | 16 Jonathan Lawrence 2022-11-27 23:44:14 668 | 17 Thomas Taylor 2022-12-07 12:33:45 669 | 18 Jeffrey Williamson 2022-12-13 16:58:43 670 | 19 Joseph Mccullough 2022-12-05 05:33:56 671 | 20 Evan Kirby 2022-12-20 14:23:43 672 | 21 Teresa Pittman 2022-12-26 05:14:24 673 | 22 Alicia Byrd 2022-12-17 18:20:51 674 | 23 Kathleen Ellis 2022-12-08 04:01:44 675 | 24 Tony Lee 2022-12-24 17:10:32 676 | Time taken: 0.432 seconds, Fetched 15 row(s) 677 | ``` 678 | 679 | --- 680 | 681 | ### Exit out of `sparksql` cli. 682 | 683 | ``` 684 | exit; 685 | 686 | ``` 687 | --- 688 | ### Explore Iceberg operations using Spark Dataframes. 689 | 690 | We will use `pyspark` in this example and load our `Transactions` table with a pyspark dataFrame. 691 | 692 | ##### Notes: 693 | * pyspark isn't as feature-rich as the sparksql client (in future versions it should catch up). For example, it doesn't support the `MERGE` example we tested earlier. 694 | 695 | --- 696 | 697 | ### Start `pyspark` cli 698 | * run this in a terminal window 699 | ``` 700 | cd $SPARK_HOME 701 | pyspark 702 | ``` 703 | 704 | --- 705 | 706 | #### Expected Output: 707 | 708 | --- 709 | 710 | ``` 711 | Python 3.8.10 (default, Nov 14 2022, 12:59:47) 712 | [GCC 9.4.0] on linux 713 | Type "help", "copyright", "credits" or "license" for more information. 714 | 23/01/26 01:44:27 WARN Utils: Your hostname, spark-ice2 resolves to a loopback address: 127.0.1.1; using 192.168.1.167 instead (on interface eth0) 715 | 23/01/26 01:44:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 716 | Setting default log level to "WARN". 717 | To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 718 | 23/01/26 01:44:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 719 | Welcome to 720 | ____ __ 721 | / __/__ ___ _____/ /__ 722 | _\ \/ _ \/ _ `/ __/ '_/ 723 | /__ / .__/\_,_/_/ /_/\_\ version 3.3.1 724 | /_/ 725 | 726 | Using Python version 3.8.10 (default, Nov 14 2022 12:59:47) 727 | Spark context Web UI available at http://192.168.1.167:4040 728 | Spark context available as 'sc' (master = local[*], app id = local-1674697469102). 729 | SparkSession available as 'spark'. 730 | >>> 731 | 732 | ``` 733 | 734 | ### In this section we will load our `Transactions` data from a json file using `Pyspark` 735 | 736 | * code blocks are commented: 737 | * copy and past this block into our pyspark session in a terminal window: 738 | --- 739 | 740 | ``` 741 | # import SparkSession 742 | from pyspark.sql import SparkSession 743 | 744 | # create SparkSession 745 | spark = SparkSession.builder \ 746 | .appName("Python Spark SQL example") \ 747 | .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19") \ 748 | .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ 749 | .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \ 750 | .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \ 751 | .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \ 752 | .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \ 753 | .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \ 754 | .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \ 755 | .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ 756 | .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \ 757 | .config("spark.sql.catalog.sparkcatalog", "icecatalog") \ 758 | .config("spark.eventLog.enabled", "true") \ 759 | .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 760 | .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 761 | .config("spark.sql.catalogImplementation", "in-memory") \ 762 | .getOrCreate() 763 | 764 | # A JSON dataset is pointed to by 'path' variable 765 | path = "/opt/spark/input/transactions.json" 766 | 767 | # read json into the DataFrame 768 | transactionsDF = spark.read.json(path) 769 | 770 | # visualize the inferred schema 771 | transactionsDF.printSchema() 772 | 773 | # print out the dataframe in this cli 774 | transactionsDF.show() 775 | 776 | # Append these transactions to the table we created in an earlier step `icecatalog.icecatalog.transactions` 777 | transactionsDF.writeTo("icecatalog.icecatalog.transactions").append() 778 | 779 | # stop the sparkSession 780 | spark.stop() 781 | 782 | # Exit out of the editor: 783 | quit(); 784 | 785 | ``` 786 | --- 787 | 788 | ##### Expected Output: 789 | 790 | --- 791 | 792 | ``` 793 | >>> # import SparkSession 794 | >>> from pyspark.sql import SparkSession 795 | >>> 796 | >>> # create SparkSession 797 | >>> spark = SparkSession.builder \ 798 | ... .appName("Python Spark SQL example") \ 799 | ... .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19") \ 800 | ... .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ 801 | ... .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \ 802 | ... .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \ 803 | ... .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \ 804 | ... .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \ 805 | ... .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \ 806 | ... .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \ 807 | ... .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ 808 | ... .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \ 809 | ... .config("spark.sql.catalog.sparkcatalog", "icecatalog") \ 810 | ... .config("spark.eventLog.enabled", "true") \ 811 | ... .config("spark.eventLog.dir", "/opt/spark/spark-events") \ 812 | ... .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \ 813 | ... .config("spark.sql.catalogImplementation", "in-memory") \ 814 | ... .getOrCreate() 815 | 23/01/26 02:04:13 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect. 816 | >>> 817 | >>> # A JSON dataset is pointed to by path 818 | >>> path = "/opt/spark/input/transactions.json" 819 | >>> 820 | >>> # read json into the DataFrame 821 | >>> transactionsDF = spark.read.json(path) 822 | >>> 823 | >>> # visualize the inferred schema 824 | >>> transactionsDF.printSchema() 825 | root 826 | |-- amount: double (nullable = true) 827 | |-- barcode: string (nullable = true) 828 | |-- category: string (nullable = true) 829 | |-- cust_id: long (nullable = true) 830 | |-- item_desc: string (nullable = true) 831 | |-- transact_id: string (nullable = true) 832 | |-- transaction_date: string (nullable = true) 833 | 834 | >>> 835 | >>> # print out the dataframe in this cli 836 | >>> transactionsDF.show() 837 | +------+-------------+--------+-------+--------------------+--------------------+-------------------+ 838 | |amount| barcode|category|cust_id| item_desc| transact_id| transaction_date| 839 | +------+-------------+--------+-------+--------------------+--------------------+-------------------+ 840 | | 50.63|4541397840276| purple| 10| Than explain cover.|586fef8b-00da-421...|2023-01-08 00:11:25| 841 | | 95.37|2308832642138| green| 10| Necessary body oil.|e8809684-7997-4cc...|2023-01-23 17:23:04| 842 | | 9.71|1644304420912| teal| 10|Recent property a...|18bb3472-56c0-48e...|2023-01-18 18:12:44| 843 | | 92.69|6996277154185| white| 10|Entire worry hosp...|a520859f-7cde-429...|2023-01-03 13:45:03| 844 | | 21.89|7318960584434| purple| 11|Finally kind coun...|3922d6a1-d112-411...|2022-12-29 09:00:26| 845 | | 24.97|4676656262244| olive| 11|Strong likely spe...|fe40fd4c-6111-49b...|2023-01-19 03:47:12| 846 | | 68.98|2299973443220| aqua| 14|Store blue confer...|331def13-f644-409...|2023-01-13 10:07:46| 847 | | 66.5|1115162814798| silver| 14|Court dog method ...|57cdb9b6-d370-4aa...|2022-12-29 06:04:30| 848 | | 26.96|5617858920203| gray| 14|Black director af...|9124d0ef-9374-441...|2023-01-11 19:20:39| 849 | | 11.24|1829792571456| yellow| 14|Lead today best p...|d418abe1-63dc-4ca...|2022-12-31 03:16:32| 850 | | 6.82|9406622469286| aqua| 15|Power itself job ...|422a413a-590b-4f7...|2023-01-09 19:09:29| 851 | | 89.39|7753423715275| black| 15|Material risk first.|bc4125fc-08cb-4ab...|2023-01-23 03:24:02| 852 | | 63.49|2242895060556| black| 15|Foreign strong wa...|ff4e4369-bcef-438...|2022-12-29 22:12:09| 853 | | 49.7|3010754625845| black| 15| Own book move for.|d00a9e7a-0cea-428...|2023-01-12 21:42:32| 854 | | 10.45|7885711282777| green| 15|Without beat then...|33afa171-a652-429...|2023-01-05 04:33:24| 855 | | 34.12|8802078025372| aqua| 16| Site win movie.|cfba6338-f816-4b7...|2023-01-07 12:22:34| 856 | | 96.14|9389514040254| olive| 16|Agree enjoy four ...|5223b620-5eef-4fa...|2022-12-28 17:06:04| 857 | | 3.38|6079280166809| blue| 16|Concern his debat...|33725df2-e14b-45a...|2023-01-17 20:53:25| 858 | | 2.67|5723406697760| yellow| 16|Republican sure r...|6a707466-7b43-4af...|2023-01-02 15:40:17| 859 | | 68.85|0555188918000| black| 16|Sense recently th...|5a31670b-9b68-43f...|2023-01-12 03:21:06| 860 | +------+-------------+--------+-------+--------------------+--------------------+-------------------+ 861 | only showing top 20 rows 862 | 863 | >>> transactionsDF.writeTo("icecatalog.icecatalog.transactions").append() 864 | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 865 | SLF4J: Defaulting to no-operation (NOP) logger implementation 866 | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 867 | >>> spark.stop() 868 | >>> quit(); 869 | 870 | ``` 871 | 872 | --- 873 | 874 | ### Explore our `Transactions` tables within SparksQL 875 | 876 | Let's open our spark-sql cli again (follow the same steps as above) and run the following query to join our 2 tables and view some sample data. 877 | 878 | * Run these command a new spark-sql session in your terminal. 879 | --- 880 | 881 | ##### Query: 882 | ``` 883 | SELECT 884 | c.cust_id 885 | , c.first_name 886 | , c.last_name 887 | , t.transact_id 888 | , t.item_desc 889 | , t.amount 890 | FROM 891 | icecatalog.icecatalog.customer c 892 | , icecatalog.icecatalog.transactions t 893 | INNER JOIN icecatalog.icecatalog.customer cj ON c.cust_id = t.cust_id 894 | LIMIT 20; 895 | ``` 896 | 897 | --- 898 | 899 | ##### Expected Output: 900 | 901 | ``` 902 | cust_id first_name last_name transact_id item_desc amount 903 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 904 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 905 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 906 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 907 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 908 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 909 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 910 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 911 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 912 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 913 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 914 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 915 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 916 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 917 | 10 Caitlyn Rogers 586fef8b-00da-4216-832a-a0eb5211b54a Than explain cover. 50.63 918 | 10 Caitlyn Rogers e8809684-7997-4ccf-96df-02fd57ca9d6f Necessary body oil. 95.37 919 | 10 Caitlyn Rogers e8809684-7997-4ccf-96df-02fd57ca9d6f Necessary body oil. 95.37 920 | 10 Caitlyn Rogers e8809684-7997-4ccf-96df-02fd57ca9d6f Necessary body oil. 95.37 921 | 10 Caitlyn Rogers e8809684-7997-4ccf-96df-02fd57ca9d6f Necessary body oil. 95.37 922 | 10 Caitlyn Rogers e8809684-7997-4ccf-96df-02fd57ca9d6f Necessary body oil. 95.37 923 | 924 | ``` 925 | 926 | --- 927 | --- 928 | 929 | ## Summary: 930 | 931 | --- 932 | Using Apache Spark with Apache Iceberg can provide many benefits for big data processing and data lake management. Apache Spark is a fast and flexible data processing engine that can be used for a variety of big data use cases, such as batch processing, streaming, and machine learning. By integrating with Apache Iceberg, Spark can leverage Iceberg's table abstraction, versioning capabilities, and data discovery features to manage large-scale data lakes with increased efficiency, reliability, and scalability. 933 | 934 | Using Apache Spark with Apache Iceberg allows organizations to leverage the benefits of Spark's distributed processing capabilities, while at the same time reducing the complexity of managing large-scale data lakes. Additionally, the integration of Spark and Iceberg provides the ability to perform complex data processing operations while still providing data management capabilities such as schema evolution, versioning, and data discovery. 935 | 936 | Finally, as both Spark and Iceberg are open-source projects, organizations can benefit from a large and active community of developers who are contributing to the development of these technologies. This makes it easier for organizations to adopt and use these tools, and to quickly resolve any issues that may arise. 937 | 938 | --- 939 | #### Final Thoughts: 940 | --- 941 | 942 | In a series of upcoming workshops, I will build out and document some new technologies that can be integrated with legacy solutions deployed by most organizations today. It will give you a roadmap into how you can gain insights (in near real-time) from data produced in your legacy systems with minimal impact on those servers. We will use a Change Data Capture (CDC) approach to pull the data from log files produced by the database providers and deliver it to our Apache Iceberg solution we built today. 943 | 944 | --- 945 | --- 946 | 947 | If you have made it this far, I want to thank you for spending your time reviewing the materials. Please give me a 'Star' at the top of this page if you found it useful. 948 | 949 | Click here to return to workshop 2: [`Workshop 2 Exercises`](./README.md/#what-is-redpanda). 950 | 951 | --- 952 | --- 953 | --- 954 | --- 955 | 956 | ![](./images/drunk-cheers.gif) 957 | 958 | [Tim Lepple](www.linkedin.com/in/tim-lepple-9141452) 959 | 960 | --- 961 | --- 962 | --- 963 | --- 964 | 965 | 966 | --------------------------------------------------------------------------------