├── .archive
    ├── .touch
    ├── bkp.setup_datagen.sh
    └── setup_data_origination_apps.sh
├── README.md
├── datagen
    ├── comsume_topic_dgCustomer.py
    ├── consume_panda_2_iceberg_customer.py
    ├── consume_stream_customer_2_console.py
    ├── consume_stream_txn_2_console.py
    ├── datagenerator.py
    ├── parameter_get_schema.py
    ├── pg_upsert_dg.py
    ├── redpanda_dg.py
    ├── spark_from_dbz_customer_2_iceberg.py
    └── test_pg.py
├── db_ddl
    ├── create_ddl_icecatalog.sql
    ├── create_user_datagen.sql
    ├── customer_ddl.sql
    ├── customer_function_ddl.sql
    ├── grants4dbz.sql
    └── hive_metastore_ddl.sql
├── dbz_server
    ├── .touch
    └── application.properties
├── downloads
    └── .touch
├── explore_postgresql.md
├── get_files.sh
├── hive_metastore
    └── hive-site.xml
├── images
    ├── .placeholder
    ├── Iceberg.gif
    ├── access_keys_view.png
    ├── adminer_login_screen.png
    ├── adminer_login_screen_icecatalog.png
    ├── bucket_first_table_metadata_view.png
    ├── connect_ouput_detail_msg.png
    ├── connect_output_summary_msg.png
    ├── console_view_run_connect.png
    ├── detail_view_of_cust_msg.png
    ├── drunk-cheers.gif
    ├── first_login.png
    ├── initial_bucket_view.png
    ├── minio_login_screen.png
    ├── panda_topic_view_connect_topic.png
    ├── panda_view__dg_load_topics.png
    ├── panda_view_topics.png
    ├── spark_master_view.png
    └── topic_customer_view.png
├── kafka_connect
    ├── connect.properties
    └── pg-source-connector.properties
├── prework.md
├── redpanda
    └── redpanda.yaml
├── sample_output
    ├── .touch
    └── connect.output
├── sample_spark_jobs.md
├── setup_datagen.sh
├── spark_items
    ├── all_workshop1_items.sql
    ├── conf.properties
    ├── ice_spark-sql_i-cli.sh
    ├── iceberg_workshop_sql_items.sh
    ├── iceberg_workshop_tbl_ddl.sql
    ├── load_ice_customer_batch.sql
    ├── load_ice_transactions_pyspark.py
    ├── merge_ice_customer_batch.sql
    ├── stream_customer_ddl.sql
    ├── stream_customer_ddl_script.sh
    ├── stream_customer_event_history_ddl.sql
    └── stream_customer_event_history_ddl_script.sh
├── stop_start_services.sh
├── tick2705-1.png
├── tick2705-2.png
├── utils.sh
└── workshop1_revisit.md


/.archive/.touch:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/.archive/bkp.setup_datagen.sh:
--------------------------------------------------------------------------------
  1 | 
  2 | #!/bin/bash
  3 | 
  4 | 
  5 | ##########################################################################################
  6 | #  install some OS utilities
  7 | #########################################################################################
  8 | sudo apt-get install wget curl apt-transport-https unzip chrony -y
  9 | sudo apt-get install -y figlet cowsay
 10 | sudo apt-get update
 11 | 
 12 | ##########################################################################################
 13 | #  download and install community edition of redpanda
 14 | ##########################################################################################
 15 | 
 16 | echo
 17 | echo "---------------------------------------------------------------------"
 18 | echo "starting redpanda install ..."
 19 | echo "---------------------------------------------------------------------"
 20 | echo
 21 | 
 22 | ## Run the setup script to download and install the repo
 23 | curl -1sLf 'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' | sudo -E bash
 24 | 
 25 | sudo apt-get update
 26 | sudo apt install redpanda -y
 27 | 
 28 | ##########################################################################################
 29 | #   download and install 'rpk' - cli tools for working with red panda
 30 | #########################################################################################
 31 | curl -LO https://github.com/redpanda-data/redpanda/releases/latest/download/rpk-linux-amd64.zip
 32 | 
 33 | ##########################################################################################
 34 | #  create a few directories
 35 | ##########################################################################################
 36 | mkdir -p ~/.local/bin
 37 | mkdir -p ~/datagen
 38 | 
 39 | #########################################################################################
 40 | # add items to path for future use
 41 | #########################################################################################
 42 | export PATH="~/.local/bin:$PATH"
 43 | export REDPANDA_HOME=~/.local/bin
 44 | 
 45 | ##########################################################################################
 46 | # add to path perm  https://help.ubuntu.com/community/EnvironmentVariables
 47 | ##########################################################################################
 48 | echo "" >> ~/.profile
 49 | echo "#  set path variables here:" >> ~/.profile
 50 | echo "export REDPANDA_HOME=~/.local/bin" >> ~/.profile
 51 | echo "PATH=$PATH:$REDPANDA_HOME" >> ~/.profile
 52 | 
 53 | ##########################################################################################
 54 | #  unzip rpk to --> ~/.local/bin
 55 | ##########################################################################################
 56 | unzip rpk-linux-amd64.zip -d ~/.local/bin/
 57 | 
 58 | ##########################################################################################
 59 | #  Install the red panda console package
 60 | ##########################################################################################
 61 | curl -1sLf \
 62 |   'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' \
 63 |   | sudo -E bash
 64 | 
 65 | sudo apt-get install redpanda-console -y
 66 | 
 67 | ##########################################################################################
 68 | #  install pip for python3
 69 | ##########################################################################################
 70 | sudo apt install python3-pip -y
 71 | 
 72 | ##########################################################################################
 73 | #  install jq
 74 | ##########################################################################################
 75 | sudo apt install -y jq
 76 | 
 77 | ##########################################################################################
 78 | #  create the redpanda conig.yaml  #  needed to change default console port to 8888 to avoid conflict with debezium server
 79 | ##########################################################################################
 80 | cat <<EOF > ~/redpanda-console-config.yaml
 81 | kafka:
 82 |   brokers: "<private_ip>:9092"
 83 |   schemaRegistry:
 84 |     enabled: true
 85 |     urls: ["http://<private_ip>:8081"]
 86 | connect:
 87 |   enabled: true
 88 |   clusters:
 89 |     - name: postgres-dbz-connector
 90 |       url: http://<private_ip>:8083
 91 | server:
 92 |     listenPort: 8888
 93 | EOF
 94 | 
 95 | sudo cp ~/data_origination_workshop/redpanda/redpanda.yaml /etc/redpanda/
 96 | 
 97 | 
 98 | ##########################################################################################
 99 | #  Need to update the value of '<private_ip>' in a bunch of files
100 | ##########################################################################################
101 | PRIVATE_IP=`ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p'`
102 | sudo sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/redpanda-console-config.yaml
103 | sudo sed -e "s,<private_ip>,$PRIVATE_IP,g" -i /etc/redpanda/redpanda.yaml
104 | 
105 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/comsume_topic_dgCustomer.py
106 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/consume_panda_2_iceberg_customer.py
107 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/consume_stream_customer_2_console.py
108 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/consume_stream_txn_2_console.py
109 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/pg_upsert_dg.py
110 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/redpanda_dg.py
111 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/data_origination_workshop/datagen/spark_from_dbz_customer_2_iceberg.py
112 | 
113 | ##########################################################################################
114 | #   move this file to proper directory
115 | ##########################################################################################
116 | sudo mv ~/redpanda-console-config.yaml /etc/redpanda/redpanda-console-config.yaml
117 | sudo chown redpanda:redpanda -R /etc/redpanda
118 | 
119 | ##########################################################################################
120 | #  start redpanda & the console:
121 | ##########################################################################################
122 | sudo systemctl start redpanda
123 | sudo systemctl start redpanda-console
124 | 
125 | echo
126 | echo "---------------------------------------------------------------------"
127 | echo "redpanda setup completed..."
128 | echo "---------------------------------------------------------------------"
129 | echo
130 | ##########################################################################################
131 | #  install a specific version of postgresql (version 14)
132 | ##########################################################################################
133 | echo
134 | echo "---------------------------------------------------------------------"
135 | echo "installing postgresql..."
136 | echo "---------------------------------------------------------------------"
137 | echo
138 | 
139 | apt policy postgresql
140 | 
141 | ##########################################################################################
142 | #  install the pgp key for this version of postgresql:
143 | ##########################################################################################
144 | curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc|sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
145 | 
146 | sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
147 | 
148 | sudo apt update
149 | 
150 | sudo apt install postgresql-14 -y
151 | 
152 | sudo systemctl enable postgresql
153 | 
154 | ##########################################################################################
155 | #  backup the original postgresql conf file
156 | ##########################################################################################
157 | sudo cp /etc/postgresql/14/main/postgresql.conf /etc/postgresql/14/main/postgresql.conf.orig
158 | 
159 | ##########################################################################################
160 | #  setup the database to allow listeners from any host
161 | ##########################################################################################
162 | sudo sed -e 's,#listen_addresses = \x27localhost\x27,listen_addresses = \x27*\x27,g' -i /etc/postgresql/14/main/postgresql.conf
163 | 
164 | ##########################################################################################
165 | #  increase number of connections allowed in the database
166 | ##########################################################################################
167 | sudo sed -e 's,max_connections = 100,max_connections = 300,g' -i /etc/postgresql/14/main/postgresql.conf
168 | 
169 | ##########################################################################################
170 | #  need to setup postgres WAL to allow debezium to read from the logs
171 | ##########################################################################################
172 | sudo sed -e 's,#listen_addresses = 'localhost',listen_addresses = '*',g' -i /etc/postgresql/14/main/postgresql.conf
173 | sudo sed -e 's,#wal_level = replica,wal_level = logical,g' -i /etc/postgresql/14/main/postgresql.conf
174 | sudo sed -e 's,#max_wal_senders = 10,max_wal_senders = 4,g' -i /etc/postgresql/14/main/postgresql.conf
175 | sudo sed -e 's,#max_replication_slots = 10,max_replication_slots = 4,g' -i /etc/postgresql/14/main/postgresql.conf
176 | 
177 | ##########################################################################################
178 | #  create a new 'pg_hba.conf' file
179 | ##########################################################################################
180 | # backup the orig
181 | sudo mv /etc/postgresql/14/main/pg_hba.conf /etc/postgresql/14/main/pg_hba.conf.orig
182 | 
183 | cat <<EOF > pg_hba.conf
184 |   # TYPE  DATABASE        USER            ADDRESS                 METHOD
185 |   local   all             all                                     peer
186 |   host    datagen         datagen        0.0.0.0/0                md5
187 |   host    icecatalog      icecatalog     0.0.0.0/0                md5
188 | EOF
189 | 
190 | ##########################################################################################
191 | #   set owner and permissions of this conf file
192 | ##########################################################################################
193 | sudo mv pg_hba.conf /etc/postgresql/14/main/pg_hba.conf
194 | sudo chown postgres:postgres /etc/postgresql/14/main/pg_hba.conf
195 | sudo chmod 600 /etc/postgresql/14/main/pg_hba.conf
196 | 
197 | ##########################################################################################
198 | #  restart postgresql
199 | ##########################################################################################
200 | sudo systemctl restart postgresql
201 | 
202 | ##########################################################################################
203 | #  install Java 11
204 | ##########################################################################################
205 | sudo apt install openjdk-11-jdk -y
206 | 
207 | ##########################################################################################
208 | ## Run the sql file to create the schema for all DB’s
209 | ##########################################################################################
210 | sudo -u postgres psql < ~/data_origination_workshop/db_ddl/create_user_datagen.sql
211 | sudo -u datagen psql < ~/data_origination_workshop/db_ddl/customer_ddl.sql
212 | sudo -u datagen psql < ~/data_origination_workshop/db_ddl/customer_function_ddl.sql
213 | sudo -u datagen psql < ~/data_origination_workshop/db_ddl/grants4dbz.sql
214 | sudo -u postgres psql < ~/data_origination_workshop/db_ddl/create_ddl_icecatalog.sql
215 | #sudo -u postgres psql < ~/data_origination_workshop/db_ddl/hive_metastore_ddl.sql
216 | 
217 | echo
218 | echo "---------------------------------------------------------------------"
219 | echo "postgresql install completed..."
220 | echo "---------------------------------------------------------------------"
221 | echo
222 | ##########################################################################################
223 | #  
224 | ##########################################################################################
225 | 
226 | echo
227 | echo "---------------------------------------------------------------------"
228 | echo "setup data generator items..."
229 | echo "---------------------------------------------------------------------"
230 | echo
231 | 
232 | 
233 | ##########################################################################################
234 | #   copy these files to the os user 'datagen' and set owner and permissions
235 | ##########################################################################################
236 | #sudo mv ~/data_origination_workshop/datagen/* /home/datagen/datagen/
237 | #sudo chown datagen:datagen -R /home/datagen/
238 | 
239 | mv ~/data_origination_workshop/datagen/* /home/datagen/datagen/
240 | chown datagen:datagen -R /home/datagen/
241 | 
242 | ##########################################################################################
243 | #  pip install some items
244 | ##########################################################################################
245 | sudo pip install kafka-python uuid simplejson faker psycopg2-binary
246 | 
247 | echo
248 | echo "---------------------------------------------------------------------"
249 | echo "data generator setup completed..."
250 | echo "---------------------------------------------------------------------"
251 | echo
252 | 
253 | ##########################################################################################
254 | #  kafka connect downloads
255 | ##########################################################################################
256 | echo
257 | echo "---------------------------------------------------------------------"
258 | echo "installing kafka-connect..."
259 | echo "---------------------------------------------------------------------"
260 | echo
261 | #  create some directories
262 | mkdir -p ~/kafka_connect/configuration
263 | mkdir -p ~/kafka_connect/plugins
264 | 
265 | #  get the public key
266 | sudo wget https://dlcdn.apache.org/kafka/KEYS
267 | 
268 | # get the file:
269 | wget https://dlcdn.apache.org/kafka/3.3.2/kafka_2.13-3.3.2.tgz -P ~/kafka_connect
270 | 
271 | #untar the file:
272 | tar -xzf ~/kafka_connect/kafka_2.13-3.3.2.tgz --directory ~/kafka_connect/
273 | 
274 | # remove the tar file:
275 | rm ~/kafka_connect/kafka_2.13-3.3.2.tgz
276 | 
277 | # copy the properties files:
278 | cp ~/data_origination_workshop/kafka_connect/*.properties ~/kafka_connect/configuration/
279 | 
280 | # update the private IP address in this config file:
281 | 
282 | #sudo sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/kafka_connect/configuration/connect.properties
283 | sed -e "s,<private_ip>,$PRIVATE_IP,g" -i ~/kafka_connect/configuration/connect.properties
284 | 
285 | ##########################################################################################
286 | #  debezium download
287 | ##########################################################################################
288 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.1.1.Final/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz -P ~/kafka_connect
289 | 
290 | # untar this file:
291 | tar -xzf ~/kafka_connect/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz --directory ~/kafka_connect/plugins/
292 | 
293 | # remove tar file
294 | rm ~/kafka_connect/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz
295 | ##########################################################################################
296 | #  postgresql jdbc download
297 | ##########################################################################################
298 | wget https://jdbc.postgresql.org/download/postgresql-42.5.1.jar -P ~/kafka_connect/plugins/debezium-connector-postgres/
299 | 
300 | ##########################################################################################
301 | #  copy jars to the kafka libs folder
302 | ##########################################################################################
303 | cp ~/kafka_connect/plugins/debezium-connector-postgres/*.jar ~/kafka_connect/kafka_2.13-3.3.2/libs/
304 | 
305 | 
306 | 
307 | echo
308 | echo "---------------------------------------------------------------------"
309 | echo "kafka-connect setup completed..."
310 | echo "---------------------------------------------------------------------"
311 | echo
312 | ##########################################################################################
313 | # Items below this line are from the iceberg workshop & tweaked to run here:
314 | ##########################################################################################
315 | echo
316 | echo "---------------------------------------------------------------------"
317 | echo "install apache iceberg & spark stand alone..."
318 | echo "---------------------------------------------------------------------"
319 | echo
320 | ##########################################################################################
321 | #  Install maven 
322 | ##########################################################################################
323 | sudo apt install maven -y
324 | 
325 | ##########################################################################################
326 | #   create a directory for spark events, logs and some json files to be used 
327 | ##########################################################################################
328 | mkdir -p /opt/spark/logs
329 | mkdir -p /opt/spark/spark-events
330 | mkdir -p /opt/spark/input
331 | mkdir -p /opt/spark/checkpoint
332 | 
333 | ##########################################################################################
334 | #  download apache spark standalone
335 | ##########################################################################################
336 | wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
337 | 
338 | tar -xzvf spark-3.3.1-bin-hadoop3.tgz
339 | 
340 | sudo mv spark-3.3.1-bin-hadoop3/ /opt/spark
341 | #mv spark-3.3.1-bin-hadoop3/ /opt/spark
342 | 
343 | ##########################################################################################
344 | #  install aws cli
345 | ##########################################################################################
346 | sudo apt install awscli -y
347 | 
348 | ##########################################################################################
349 | #  install mlocate
350 | ##########################################################################################
351 | sudo apt install -y mlocate
352 | 
353 | ##########################################################################################
354 | #  download the jdbc jar file for postgres:  
355 | ##########################################################################################
356 | wget https://jdbc.postgresql.org/download/postgresql-42.5.1.jar
357 | 
358 | #sudo mv postgresql-42.5.1.jar /opt/spark/jars/
359 | mv postgresql-42.5.1.jar /opt/spark/jars/
360 | 
361 | 
362 | ##########################################################################################
363 | # download some aws jars:
364 | ##########################################################################################
365 | wget https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.19.19/bundle-2.19.19.jar
366 | 
367 | #sudo mv bundle-2.19.19.jar /opt/spark/jars/
368 | mv bundle-2.19.19.jar /opt/spark/jars/
369 | 
370 | 
371 | wget https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.19.19/url-connection-client-2.19.19.jar
372 | mv url-connection-client-2.19.19.jar /opt/spark/jars/
373 | 
374 | ##########################################################################################
375 | #  download iceberg spark runtime 
376 | ##########################################################################################
377 | wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.1.0/iceberg-spark-runtime-3.3_2.12-1.1.0.jar
378 | wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.3.1/spark-sql-kafka-0-10_2.12-3.3.1.jar
379 | wget https://repo.mavenlibs.com/maven/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.3.1/spark-token-provider-kafka-0-10_2.12-3.3.1.jar
380 | wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.3.1/kafka-clients-3.3.1.jar
381 | wget https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar
382 | 
383 | mv ~/iceberg-spark-runtime-3.3_2.12-1.1.0.jar /opt/spark/jars/
384 | mv ~/spark-sql-kafka-0-10_2.12-3.3.1.jar /opt/spark/jars/
385 | mv ~/spark-token-provider-kafka-0-10_2.12-3.3.1.jar /opt/spark/jars/
386 | mv ~/kafka-clients-3.3.1.jar /opt/spark/jars/
387 | mv ~/commons-pool2-2.11.1.jar /opt/spark/jars/
388 | 
389 | echo
390 | echo "---------------------------------------------------------------------"
391 | echo "iceberg & spark items completed..."
392 | echo "---------------------------------------------------------------------"
393 | echo
394 | ##########################################################################################
395 | #  download minio debian package
396 | ##########################################################################################
397 | 
398 | echo
399 | echo "---------------------------------------------------------------------"
400 | echo "install minio..."
401 | echo "---------------------------------------------------------------------"
402 | echo
403 | wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230112020616.0.0_amd64.deb -O minio.deb
404 | 
405 | ##########################################################################################
406 | #   install minio
407 | ##########################################################################################
408 | sudo dpkg -i minio.deb
409 | 
410 | ##########################################################################################
411 | #  create directory for minio data to be stored
412 | ##########################################################################################
413 | sudo mkdir -p /opt/app/minio/data
414 | 
415 | sudo groupadd -r minio-user
416 | sudo useradd -M -r -g minio-user minio-user
417 | 
418 | ##########################################################################################
419 | # grant permission to this directory to minio-user
420 | ##########################################################################################
421 | 
422 | sudo chown -R minio-user:minio-user /opt/app/minio/
423 | 
424 | ##########################################################################################
425 | #  create an enviroment variable file for minio
426 | ##########################################################################################
427 | 
428 | cat <<EOF > ~/minio.properties
429 | # MINIO_ROOT_USER and MINIO_ROOT_PASSWORD sets the root account for the MinIO server.
430 | # This user has unrestricted permissions to perform S3 and administrative API operations on any resource in the deployment.
431 | # Omit to use the default values 'minioadmin:minioadmin'.
432 | # MinIO recommends setting non-default values as a best practice, regardless of environment
433 | #MINIO_ROOT_USER=myminioadmin
434 | #MINIO_ROOT_PASSWORD=minio-secret-key-change-me
435 | MINIO_ROOT_USER=minioroot
436 | MINIO_ROOT_PASSWORD=supersecret1
437 | # MINIO_VOLUMES sets the storage volume or path to use for the MinIO server.
438 | #MINIO_VOLUMES="/mnt/data"
439 | MINIO_VOLUMES="/opt/app/minio/data"
440 | # MINIO_SERVER_URL sets the hostname of the local machine for use with the MinIO Server
441 | # MinIO assumes your network control plane can correctly resolve this hostname to the local machine
442 | # Uncomment the following line and replace the value with the correct hostname for the local machine.
443 | #MINIO_SERVER_URL="http://minio.example.net"
444 | EOF
445 | 
446 | ##########################################################################################
447 | #   move this file to proper directory
448 | ##########################################################################################
449 | sudo mv ~/minio.properties /etc/default/minio
450 | 
451 | sudo chown root:root /etc/default/minio
452 | 
453 | 
454 | ##########################################################################################
455 | #  start the minio server:
456 | ##########################################################################################
457 | sudo systemctl start minio.service
458 | 
459 | ##########################################################################################
460 | #  install the 'MinIO Client' on this server 
461 | ##########################################################################################
462 | curl https://dl.min.io/client/mc/release/linux-amd64/mc \
463 |   --create-dirs \
464 |   -o $HOME/minio-binaries/mc
465 | 
466 | chmod +x $HOME/minio-binaries/mc
467 | export PATH=$PATH:$HOME/minio-binaries/
468 | 
469 | 
470 | ##########################################################################################
471 | #  create an alias on this host for the minio cli (using the minio root credentials)
472 | ##########################################################################################
473 | mc alias set local http://127.0.0.1:9000 minioroot supersecret1
474 | 
475 | ##########################################################################################
476 | #  lets create a user for iceberg metadata & tables using the minio cli and the  alias we just set
477 | ##########################################################################################
478 | mc admin user add local icebergadmin supersecret1!
479 | 
480 | ##########################################################################################
481 | #  need to add the 'readwrite' minio policy to this new user: (these are just like aws policies)
482 | ##########################################################################################
483 | mc admin policy set local readwrite user=icebergadmin
484 | 
485 | ##########################################################################################
486 | #  create a new alias for this admin user:
487 | ##########################################################################################
488 | mc alias set icebergadmin http://127.0.0.1:9000 icebergadmin supersecret1!
489 | 
490 | ##########################################################################################
491 | #  create new 'Access Keys' for this user and redirect output to a file for automation later
492 | ##########################################################################################
493 | mc admin user svcacct add icebergadmin icebergadmin >> ~/minio-output.properties
494 | 
495 | ##########################################################################################
496 | #  create a bucket as user icebergadmin for our iceberg data
497 | ##########################################################################################
498 | mc mb icebergadmin/iceberg-data icebergadmin
499 | 
500 | ##########################################################################################
501 | #  let's reformat the output of access keys from an earlier step
502 | ##########################################################################################
503 | sed -i "s/Access Key: /access_key=/g" ~/minio-output.properties
504 | sed -i "s/Secret Key: /secret_key=/g" ~/minio-output.properties
505 | 
506 | ##########################################################################################
507 | #  let's  read the update file into memory to use these values to set aws configure
508 | ##########################################################################################
509 | . ~/minio-output.properties
510 | 
511 | echo
512 | echo "---------------------------------------------------------------------"
513 | echo "minio install completed..."
514 | echo "---------------------------------------------------------------------"
515 | echo
516 | ##########################################################################################
517 | #  let's set up aws configure files from code (this is using the minio credentials) - The default region doesn't get used in minio
518 | ##########################################################################################
519 | aws configure set aws_access_key_id $access_key
520 | aws configure set aws_secret_access_key $secret_key
521 | aws configure set default.region us-east-1
522 | 
523 | ##########################################################################################
524 | #  let's test that the aws cli can list our buckets in minio:
525 | ##########################################################################################
526 | aws --endpoint-url http://127.0.0.1:9000 s3 ls
527 | 
528 | echo
529 | 
530 | 
531 | ##########################################################################################
532 | #  Create a json records file of sample customer data to be used in a lab
533 | ##########################################################################################
534 | 
535 | cat <<EOF > /opt/spark/input/customers.json
536 | {"last_name": "Thompson", "first_name": "Brenda", "street_address": "321 Nicole Ports Suite 204", "city": "South Lisachester", "state": "AS", "zip_code": "89409", "email": "wmoran@example.net", "home_phone": "486.884.6221x4431", "mobile": "(290)274-1564", "ssn": "483-79-5404", "job_title": "Housing manager/officer", "create_date": "2022-12-25 01:10:43", "cust_id": 10}
537 | {"last_name": "Anderson", "first_name": "Jennifer", "street_address": "1392 Cervantes Isle", "city": "Adrianaton", "state": "IN", "zip_code": "15867", "email": "michaeltodd@example.com", "home_phone": "939-630-6773", "mobile": "904.337.2023x17453", "ssn": "583-07-6994", "job_title": "Clinical embryologist", "create_date": "2022-12-03 04:50:07", "cust_id": 11}
538 | {"last_name": "Jefferson", "first_name": "William", "street_address": "543 Matthew Courts", "city": "South Nicholaston", "state": "WA", "zip_code": "17687", "email": "peterhouse@example.net", "home_phone": "+1-599-587-9051x2899", "mobile": "(915)689-1450", "ssn": "792-52-6700", "job_title": "Land", "create_date": "2022-11-28 08:17:10", "cust_id": 12}
539 | {"last_name": "Romero", "first_name": "Jack", "street_address": "5929 Karen Ridges", "city": "Lake Richardburgh", "state": "OR", "zip_code": "78947", "email": "michellemitchell@example.net", "home_phone": "(402)664-1399x71255", "mobile": "450.580.6817x043", "ssn": "216-24-7271", "job_title": "Engineer, building services", "create_date": "2022-12-11 19:09:30", "cust_id": 13}
540 | {"last_name": "Johnson", "first_name": "Robert", "street_address": "4313 Adams Islands", "city": "Tammybury", "state": "UT", "zip_code": "07361", "email": "morrischristopher@example.com", "home_phone": "(477)888-9999", "mobile": "220-403-9274x9709", "ssn": "012-26-8650", "job_title": "Rural practice surveyor", "create_date": "2022-12-08 05:28:56", "cust_id": 14}
541 | EOF
542 | 
543 | ##########################################################################################
544 | #  Create another json records file to test out a  Merge Query in a lab
545 | ##########################################################################################
546 | 
547 | cat <<EOF > /opt/spark/input/update_customers.json
548 | {"last_name": "Rogers", "first_name": "Caitlyn", "street_address": "37761 Robert Center Apt. 743", "city": "Port Matthew", "state": "MS", "zip_code": "70534", "email": "pamelacooper@example.net", "home_phone": "726-856-7295x731", "mobile": "+1-423-331-9415x66671", "ssn": "718-18-3807", "job_title": "Merchandiser, retail", "create_date": "2022-12-16 03:19:35", "cust_id": 10}
549 | {"last_name": "Williams", "first_name": "Brittany", "street_address": "820 Lopez Vista", "city": "Jordanland", "state": "NM", "zip_code": "02887", "email": "stephendawson@example.org", "home_phone": "(149)065-2341x761", "mobile": "(353)203-7938x325", "ssn": "304-90-3213", "job_title": "English as a second language teacher", "create_date": "2022-12-04 23:29:48", "cust_id": 11}
550 | {"last_name": "Gordon", "first_name": "Victor", "street_address": "01584 Hernandez Ramp Suite 822", "city": "Smithmouth", "state": "VI", "zip_code": "88806", "email": "holly51@example.com", "home_phone": "707-269-9666x8446", "mobile": "+1-868-584-1822", "ssn": "009-27-3700", "job_title": "Ergonomist", "create_date": "2022-12-22 18:03:13", "cust_id": 12}
551 | {"last_name": "Martinez", "first_name": "Shelby", "street_address": "715 Benitez Plaza", "city": "Patriciaside", "state": "MT", "zip_code": "70724", "email": "tiffanysmith@example.com", "home_phone": "854.472.8345", "mobile": "+1-187-913-4579x115", "ssn": "306-94-1636", "job_title": "Private music teacher", "create_date": "2022-11-27 16:10:42", "cust_id": 13}
552 | {"last_name": "Bridges", "first_name": "Corey", "street_address": "822 Kaitlyn Haven Apt. 314", "city": "Port Elizabeth", "state": "OH", "zip_code": "58802", "email": "rosewayne@example.org", "home_phone": "001-809-935-9112x17961", "mobile": "+1-732-477-7876x9314", "ssn": "801-31-5673", "job_title": "Scientist, research (maths)", "create_date": "2022-12-11 23:29:52", "cust_id": 14}
553 | {"last_name": "Rocha", "first_name": "Benjamin", "street_address": "294 William Skyway", "city": "Fowlerville", "state": "WA", "zip_code": "75495", "email": "fwhite@example.com", "home_phone": "001-476-468-4403x364", "mobile": "4731036956", "ssn": "571-78-6278", "job_title": "Probation officer", "create_date": "2022-12-10 07:39:35", "cust_id": 15}
554 | {"last_name": "Lawrence", "first_name": "Jonathan", "street_address": "4610 Kelly Road Suite 333", "city": "Michaelfort", "state": "PR", "zip_code": "03033", "email": "raymisty@example.com", "home_phone": "936.011.1602x5883", "mobile": "(577)016-2546x30390", "ssn": "003-05-2317", "job_title": "Dancer", "create_date": "2022-11-27 23:44:14", "cust_id": 16}
555 | {"last_name": "Taylor", "first_name": "Thomas", "street_address": "51884 Kelsey Ridges Apt. 973", "city": "Lake Morgan", "state": "RI", "zip_code": "36056", "email": "vanggary@example.net", "home_phone": "541-784-5497x32009", "mobile": "+1-337-857-9219x83198", "ssn": "133-61-4337", "job_title": "Town planner", "create_date": "2022-12-07 12:33:45", "cust_id": 17}
556 | {"last_name": "Williamson", "first_name": "Jeffrey", "street_address": "6094 Powell Passage", "city": "Stevenland", "state": "VT", "zip_code": "88479", "email": "jwallace@example.com", "home_phone": "4172910794", "mobile": "494.361.3094x223", "ssn": "512-84-0907", "job_title": "Clinical cytogeneticist", "create_date": "2022-12-13 16:58:43", "cust_id": 18}
557 | {"last_name": "Mccullough", "first_name": "Joseph", "street_address": "7329 Santiago Point Apt. 070", "city": "Reedland", "state": "MH", "zip_code": "85316", "email": "michellecain@example.com", "home_phone": "(449)740-1390", "mobile": "(663)381-3306x19170", "ssn": "605-84-9744", "job_title": "Seismic interpreter", "create_date": "2022-12-05 05:33:56", "cust_id": 19}
558 | {"last_name": "Kirby", "first_name": "Evan", "street_address": "95959 Brown Rue Apt. 657", "city": "Lake Vanessa", "state": "MH", "zip_code": "92042", "email": "tayloralexandra@example.org", "home_phone": "342-317-5803", "mobile": "185-084-4719x39341", "ssn": "264-14-4935", "job_title": "Interpreter", "create_date": "2022-12-20 14:23:43", "cust_id": 20}
559 | {"last_name": "Pittman", "first_name": "Teresa", "street_address": "3249 Danielle Parks Apt. 472", "city": "East Ryan", "state": "ME", "zip_code": "33108", "email": "hamiltondanielle@example.org", "home_phone": "+1-814-789-0109x88291", "mobile": "(749)434-0916", "ssn": "302-61-5936", "job_title": "Medical physicist", "create_date": "2022-12-26 05:14:24", "cust_id": 21}
560 | {"last_name": "Byrd", "first_name": "Alicia", "street_address": "1232 Jenkins Pine Apt. 472", "city": "Woodton", "state": "NC", "zip_code": "82330", "email": "shelly47@example.net", "home_phone": "001-930-450-7297x258", "mobile": "+1-968-526-2756x661", "ssn": "656-69-9593", "job_title": "Therapist, art", "create_date": "2022-12-17 18:20:51", "cust_id": 22}
561 | {"last_name": "Ellis", "first_name": "Kathleen", "street_address": "935 Kristina Club", "city": "East Maryton", "state": "AK", "zip_code": "86759", "email": "jacksonkaren@example.com", "home_phone": "001-089-194-5982x828", "mobile": "127.892.8518", "ssn": "426-13-9463", "job_title": "English as a foreign language teacher", "create_date": "2022-12-08 04:01:44", "cust_id": 23}
562 | {"last_name": "Lee", "first_name": "Tony", "street_address": "830 Elizabeth Mill Suite 184", "city": "New Heather", "state": "UT", "zip_code": "59612", "email": "vmayo@example.net", "home_phone": "001-593-666-0198", "mobile": "060.108.7218", "ssn": "048-20-6647", "job_title": "Civil engineer, consulting", "create_date": "2022-12-24 17:10:32", "cust_id": 24}
563 | EOF
564 | 
565 | ##########################################################################################
566 | #  Let's add some transactions for these customers for a lab
567 | ##########################################################################################
568 | cat <<EOF > /opt/spark/input/transactions.json
569 | {"transact_id": "e786c399-ee9a-4053-a716-671bd456d06c", "category": "green", "barcode": "9688687184711", "item_desc": "Though evidence push.", "amount": 61.47, "transaction_date": "2022-12-31 03:52:13", "cust_id": 11}
570 | {"transact_id": "58ccab06-38fe-45ab-a105-994f8bc51e1f", "category": "maroon", "barcode": "6270293172737", "item_desc": "Hotel toward radio exactly.", "amount": 18.26, "transaction_date": "2023-01-26 23:42:58", "cust_id": 11}
571 | {"transact_id": "9f5a1c46-ac16-46c9-87fd-ff3ec4f36377", "category": "maroon", "barcode": "0000885336836", "item_desc": "West truth dog staff professor just.", "amount": 9.64, "transaction_date": "2023-01-24 15:51:44", "cust_id": 11}
572 | {"transact_id": "c37e87fd-8833-44e9-85e3-ba2cb5e32c5d", "category": "purple", "barcode": "3898859302683", "item_desc": "Half chance hard.", "amount": 20.5, "transaction_date": "2023-01-13 08:54:35", "cust_id": 11}
573 | {"transact_id": "ae165ddf-e99d-473f-a8ec-75c3235e2ca9", "category": "black", "barcode": "8835416937716", "item_desc": "Song tough born station break long.", "amount": 52.7, "transaction_date": "2023-01-24 00:04:58", "cust_id": 11}
574 | {"transact_id": "3ed16c03-607f-40a1-b446-c1b2c18b8a58", "category": "purple", "barcode": "2387695378019", "item_desc": "Cover likely dog.", "amount": 94.27, "transaction_date": "2022-12-31 11:15:18", "cust_id": 12}
575 | {"transact_id": "830e2d42-594c-4531-9256-6c7e3036f132", "category": "olive", "barcode": "1655418639701", "item_desc": "Difference major fast hear answer character.", "amount": 54.44, "transaction_date": "2023-01-03 22:01:20", "cust_id": 13}
576 | {"transact_id": "4c8db6cf-2a66-4a3a-8474-00d3db8aeb92", "category": "aqua", "barcode": "4088755032541", "item_desc": "On without probably of.", "amount": 94.67, "transaction_date": "2023-01-08 02:11:48", "cust_id": 13}
577 | {"transact_id": "ae54bcf5-250d-4076-854b-40a13cd74b7c", "category": "yellow", "barcode": "3783631322815", "item_desc": "Somebody yourself maintain only together.", "amount": 6.37, "transaction_date": "2023-01-02 09:39:39", "cust_id": 13}
578 | {"transact_id": "c3d3f77a-54ba-4503-bf7c-53db29a775e7", "category": "lime", "barcode": "9466888768004", "item_desc": "By fear hospital certainly.", "amount": 94.8, "transaction_date": "2023-01-05 06:37:27", "cust_id": 13}
579 | {"transact_id": "b5caf452-a44c-442d-a4cf-d88e2c08f7b3", "category": "black", "barcode": "5032052452372", "item_desc": "Imagine occur environment according more.", "amount": 62.94, "transaction_date": "2023-01-27 09:59:41", "cust_id": 14}
580 | {"transact_id": "731fd64e-74af-4364-999e-67e8cccfd6ee", "category": "gray", "barcode": "2687016061218", "item_desc": "Game cover trade discover me read.", "amount": 70.9, "transaction_date": "2022-12-30 02:23:15", "cust_id": 14}
581 | {"transact_id": "40edcc76-0ca0-4b88-990a-7e9abe400cbb", "category": "teal", "barcode": "1212133800184", "item_desc": "Form budget listen.", "amount": 31.5, "transaction_date": "2023-01-04 13:29:38", "cust_id": 14}
582 | {"transact_id": "a811b772-8149-4ba2-ace0-7b658cd45c20", "category": "teal", "barcode": "8751563802922", "item_desc": "Weight hot mean.", "amount": 51.46, "transaction_date": "2023-01-20 23:50:30", "cust_id": 16}
583 | {"transact_id": "8cc2a57f-5007-42b1-a4a2-722cf609bb76", "category": "purple", "barcode": "6267199327651", "item_desc": "Recognize ten area general.", "amount": 2.41, "transaction_date": "2023-01-19 17:20:57", "cust_id": 16}
584 | {"transact_id": "931d9ff0-c82d-49e9-bc8d-bad319b20d84", "category": "white", "barcode": "9009659885601", "item_desc": "Safe medical start receive.", "amount": 61.77, "transaction_date": "2023-01-26 20:34:36", "cust_id": 16}
585 | {"transact_id": "b0457bb2-8b72-4a1a-a247-6ca9d2c06be9", "category": "yellow", "barcode": "6453786338029", "item_desc": "Force set think cost.", "amount": 45.59, "transaction_date": "2023-01-24 11:39:20", "cust_id": 17}
586 | {"transact_id": "b189cb88-6a14-4741-9286-102c379052d4", "category": "purple", "barcode": "2036094483571", "item_desc": "Nation consumer film fact only to.", "amount": 55.86, "transaction_date": "2023-01-12 16:29:53", "cust_id": 17}
587 | {"transact_id": "c2564bb3-4485-4f2a-82e0-aa7e53cfc622", "category": "silver", "barcode": "8282187103947", "item_desc": "Sign standard pass evidence.", "amount": 38.78, "transaction_date": "2023-01-02 00:25:31", "cust_id": 18}
588 | {"transact_id": "884469c2-32ee-439c-9f8a-570b9d49b152", "category": "lime", "barcode": "8529678377198", "item_desc": "Member write create.", "amount": 82.95, "transaction_date": "2023-01-03 13:49:19", "cust_id": 18}
589 | {"transact_id": "0a722403-a7dd-4c9c-b958-95191ae841c1", "category": "green", "barcode": "6500182661487", "item_desc": "Over usually who table compare area model.", "amount": 54.1, "transaction_date": "2023-01-18 18:43:36", "cust_id": 18}
590 | {"transact_id": "2b23b8c7-28db-4204-902f-a6fd3dd1f475", "category": "navy", "barcode": "1378348043058", "item_desc": "Technology one ahead general.", "amount": 54.67, "transaction_date": "2022-12-30 00:24:16", "cust_id": 19}
591 | {"transact_id": "aacce2c5-2472-4d66-a445-3bc126745e0b", "category": "navy", "barcode": "2056653042902", "item_desc": "Speech hot letter hot.", "amount": 5.9, "transaction_date": "2023-01-08 16:16:32", "cust_id": 21}
592 | {"transact_id": "20d157be-8a47-435a-a61f-c8ab68b34c8d", "category": "blue", "barcode": "7125652103787", "item_desc": "Strong society officer bag.", "amount": 46.41, "transaction_date": "2023-01-04 20:29:32", "cust_id": 21}
593 | {"transact_id": "098478b0-d0bc-4140-b621-abe9a03a768e", "category": "fuchsia", "barcode": "8780633730896", "item_desc": "Oil stock film source.", "amount": 78.61, "transaction_date": "2023-01-26 22:36:26", "cust_id": 21}
594 | {"transact_id": "8dbe22d8-050a-48a3-8526-b5c11230589e", "category": "navy", "barcode": "6879593096691", "item_desc": "Form affect seem side job.", "amount": 69.92, "transaction_date": "2022-12-31 21:41:30", "cust_id": 21}
595 | {"transact_id": "d27cde76-40df-4eda-9567-07aba2e2a0b8", "category": "gray", "barcode": "3376554112825", "item_desc": "Inside page bag.", "amount": 76.63, "transaction_date": "2023-01-10 20:53:23", "cust_id": 22}
596 | {"transact_id": "a4b87dc7-f401-4f13-9cfd-4858b0d575c0", "category": "yellow", "barcode": "0922971679088", "item_desc": "Guy more national.", "amount": 2.55, "transaction_date": "2023-01-25 14:29:42", "cust_id": 22}
597 | {"transact_id": "48ce0556-fc57-4748-bd79-a146cd32147b", "category": "aqua", "barcode": "8702162059583", "item_desc": "Sometimes president response want.", "amount": 16.91, "transaction_date": "2023-01-03 12:00:34", "cust_id": 22}
598 | {"transact_id": "b2dd711c-4d23-4c99-b980-bd7afd1ef62a", "category": "purple", "barcode": "0983651241193", "item_desc": "Born under focus budget east free.", "amount": 53.43, "transaction_date": "2023-01-01 16:42:59", "cust_id": 22}
599 | {"transact_id": "913475de-32bb-4d80-aed0-1d9631dd0677", "category": "silver", "barcode": "9827839337951", "item_desc": "Address operation hold.", "amount": 55.79, "transaction_date": "2023-01-04 19:19:01", "cust_id": 23}
600 | {"transact_id": "885ccb18-3d19-48aa-9ad3-095a562fe0a7", "category": "navy", "barcode": "5176084629125", "item_desc": "Thus second hospital development ball.", "amount": 65.89, "transaction_date": "2023-01-27 15:26:02", "cust_id": 24}
601 | {"transact_id": "62cb9752-6da5-404a-a0a3-08192731db90", "category": "blue", "barcode": "8670289379405", "item_desc": "Prevent great yes travel where real.", "amount": 51.36, "transaction_date": "2023-01-11 23:35:22", "cust_id": 24}
602 | {"transact_id": "04f07ef8-8453-40af-9d3e-da6e9693919b", "category": "olive", "barcode": "2009850879093", "item_desc": "Weight spring baby be thought degree.", "amount": 27.82, "transaction_date": "2023-01-22 13:56:49", "cust_id": 24}
603 | EOF
604 | 
605 | #########################################################################################
606 | # add items to path for future use
607 | #########################################################################################
608 | export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
609 | export SPARK_HOME=/opt/spark
610 | 
611 | echo "" >> ~/.profile
612 | echo "#  set path variables here:" >> ~/.profile
613 | echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> ~/.profile
614 | echo "export SPARK_HOME=/opt/spark" >> ~/.profile
615 | 
616 | echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$JAVA_HOME/bin:$HOME/minio-binaries" >> ~/.profile
617 | 
618 | # let's make this visible
619 | . ~/.profile
620 | 
621 | 
622 | #########################################################################################
623 | # install docker ce (needed for dbz server build with maven)
624 | #########################################################################################
625 | sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
626 | curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
627 | sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
628 | apt-cache policy docker-ce
629 | sudo apt install -y docker-ce
630 | sudo chmod 666 /var/run/docker.sock
631 | sudo usermod -aG docker ${USER}
632 | 
633 | echo
634 | echo "---------------------------------------------------------------------"
635 | echo "install of docker-ce complete..."
636 | echo "---------------------------------------------------------------------"
637 | echo
638 | #########################################################################################
639 | # install debezium server items
640 | #########################################################################################
641 | cd ~
642 | git clone https://github.com/memiiso/debezium-server-iceberg.git
643 | cd debezium-server-iceberg
644 | 
645 | 
646 | echo
647 | echo "---------------------------------------------------------------------"
648 | echo "starting maven build of Debezium Server..."
649 | echo "---------------------------------------------------------------------"
650 | echo
651 | mvn -Passembly -Dmaven.test.skip package
652 | 
653 | 
654 | echo
655 | echo "---------------------------------------------------------------------"
656 | echo "maven build of Debezium Server complete..."
657 | echo "---------------------------------------------------------------------"
658 | echo
659 | 
660 | echo
661 | echo "---------------------------------------------------------------------"
662 | echo "configure Debezium Server items..."
663 | echo "---------------------------------------------------------------------"
664 | echo
665 | cp ~/debezium-server-iceberg/debezium-server-iceberg-dist/target/debezium-server-iceberg-dist-0.3.0-SNAPSHOT.zip ~
666 | 
667 | unzip ~/debezium-server-iceberg-dist*.zip -d ~/appdist
668 | 
669 | mkdir -p ~/debezium-server-iceberg/data
670 | 
671 | 
672 | #########################################################################################
673 | # configure our dbz source-sink.properties file
674 | #########################################################################################
675 | #sudo cp ~/data_origination_workshop/dbz_server/application.properties ~/appdist/debezium-server-iceberg/conf/
676 | cp ~/data_origination_workshop/dbz_server/application.properties ~/appdist/debezium-server-iceberg/conf/
677 | 
678 | ##########################################################################################
679 | #  let's update the properties files to use our minio keys.
680 | ##########################################################################################
681 | 
682 | . ~/minio-output.properties
683 | 
684 | sed -e "s,<your S3 access-key>,$access_key,g" -i ~/appdist/debezium-server-iceberg/conf/application.properties
685 | sed -e "s,<your s3 secret-key>,$secret_key,g" -i ~/appdist/debezium-server-iceberg/conf/application.properties
686 | 
687 | # change ownership
688 | #sudo chown datagen:datagen -R /home/datagen/appdist
689 | 
690 | # remove the example file:
691 | rm /home/datagen/appdist/debezium-server-iceberg/conf/application.properties.example
692 | 
693 | # remove the zip file:
694 | rm /home/datagen/debezium-server-iceberg-dist-*-SNAPSHOT.zip
695 | 
696 | echo
697 | echo "---------------------------------------------------------------------"
698 | echo "Debezium Server setup complete..."
699 | echo "---------------------------------------------------------------------"
700 | echo
701 | 
702 | #########################################################################################
703 | # let's start our spark master and workers.
704 | #########################################################################################
705 | #########################################################################################
706 | # need to change the spark master gui port from 8080 to 8085 to avoid conflict with redpanda
707 | #########################################################################################
708 | 
709 | echo
710 | echo "---------------------------------------------------------------------"
711 | echo "configure Spark and start master and worker services..."
712 | echo "---------------------------------------------------------------------"
713 | echo
714 | 
715 | #########################################################################################
716 | #  need to change the default ports for master and workers to avoid conflicts with red panda and kafka connect
717 | #########################################################################################
718 | sed -e 's,SPARK_MASTER_WEBUI_PORT=8080,SPARK_MASTER_WEBUI_PORT=8085,g' -i /opt/spark/sbin/start-master.sh
719 | sed -e 's,SPARK_WORKER_WEBUI_PORT=8081,SPARK_WORKER_WEBUI_PORT=8090,g' -i /opt/spark/sbin/start-worker.sh
720 | 
721 | echo "starting spark master..."
722 | /opt/spark/sbin/start-master.sh
723 | echo
724 | echo "starting spark worker..."
725 | /opt/spark/sbin/start-worker.sh spark://$(hostname -f):7077
726 | echo
727 | 
728 | #########################################################################################
729 | #  setup complete.
730 | #########################################################################################
731 | figlet -f small -w 300  "Setup is complete!"'!' | cowsay -n -f "$(ls -1 /usr/share/cowsay/cows | grep "\.cow" | sed 's/\.cow//' | egrep -v "bong|head-in|sodomized|telebears" | shuf -n 1)"
732 | 
733 | #########################################################################################
734 | # source this to set our new variables in current session
735 | #########################################################################################
736 | bash -l
737 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | Title:  The Journey to Apache Iceberg with Red Panda & Debezium
  3 | Author:  Tim Lepple
  4 | Create Date: 03.04.2023
  5 | Last Updated:  3.15.2024
  6 | Comments:  This repo will set up a data integration platform to evaluate some technology.
  7 | Tags:  Iceberg | Spark | Redpanda | PostgreSQL | Kafka Connect | Python | Debezium | Minio
  8 | ---
  9 | 
 10 | 
 11 | ---
 12 | ---
 13 | 
 14 | ---
 15 | ---
 16 | 
 17 | 
 18 | # The Journey to Apache Iceberg with Red Panda & Debezium
 19 | ---
 20 | ---
 21 | 
 22 | ## Objective:
 23 | The goal of this workshop was to evaluate [Redpanda](https://redpanda.com/) and Kafka Connect (with the Debezium CDC plugin). Set up a data generator that streams events directly into Redpanda and also into a traditional database platform and deliver it to an [Apache Iceberg](https://iceberg.apache.org/) data lake. 
 24 | 
 25 | I took the time to install these components manually on a traditional Linux server and then wrote the setup script in this repo so others could try it out too. Please take the time to review that script [`setup_datagen.sh`](./setup_data_origination_apps.sh). Hopefully, it will become a reference for you one day if you use any of this technology.  
 26 | 
 27 | In this workshop, we will integrate this data platform and stream data from here into our Apache Iceberg data lake built in a previous workshop (all of those components will be installed here too). For step-by-step instructions on working with the Iceberg components, please check out my [Apache Iceberg Workshop](https://github.com/tlepple/iceberg-intro-workshop) for more details.  All of the tasks from that workshop can be run on this new server.
 28 | 
 29 | ---
 30 | ---
 31 | 
 32 | # Highlights:
 33 | 
 34 | ---
 35 | 
 36 | The setup script will build and install our `Data Integration Platform` onto a single Linux instance.  It installs a data-generating application, a local SQL database (PostgreSQL), a Red Panda instance, a stand-alone Kafka Connect instance, a Debezium plugin for Kafka Connect, a Debezium Server, Minio, Spark and Apache Iceberg.  In addition, it will configure them all to work together.   
 37 |  
 38 | ---
 39 | ---
 40 | 
 41 | ###  Pre-Requisites:
 42 | 
 43 | ---
 44 | 
 45 |  * I built this on a new install of Ubuntu Server
 46 |  * Version: 20.04.5 LTS 
 47 |  * Instance Specs: (min 4 core w/ 16 GB ram & 30 GB of disk) -- add more RAM if you have it to spare.
 48 |  * If you are going to test this in `AWS`, it ran smoothly for me using AMI: `ami-03a311cadf2d2a6f8` in region: `us-east-2` with a instance type of: `t3.xlarge`
 49 | 
 50 | ---
 51 | ### Create an OS User `Datagen` 
 52 | 
 53 | *  This user account will be the owner of all the objects that get installed
 54 | *  Security is not in place for any of this workshop.
 55 | 
 56 | ```
 57 | ##########################################################################################
 58 | #  create an osuser datagen and add to sudo file
 59 | ##########################################################################################
 60 | sudo useradd -m -s /usr/bin/bash datagen
 61 | 
 62 | echo supersecret1 > passwd.txt
 63 | echo supersecret1 >> passwd.txt
 64 | 
 65 | sudo passwd datagen < passwd.txt
 66 | 
 67 | rm -f passwd.txt
 68 | sudo usermod -aG sudo datagen
 69 | ##########################################################################################
 70 | #  let's complete this install as this user:
 71 | ##########################################################################################
 72 | # password: supersecret1
 73 | su - datagen 
 74 | ```
 75 | ---
 76 | 
 77 | ###  Install Git tools and pull this repo.
 78 | *  ssh into your new Ubuntu 20.04 instance and run the below command:
 79 |  
 80 | ---
 81 | ```
 82 | sudo apt-get install git -y
 83 | 
 84 | cd ~
 85 | git clone https://github.com/tlepple/data_origination_workshop.git
 86 | ```
 87 | 
 88 | ---
 89 | 
 90 | ### Start the build:
 91 | 
 92 | ```
 93 | #  run it:
 94 | . ~/data_origination_workshop/setup_datagen.sh
 95 | ```
 96 |   *  This should complete within 10 minutes.
 97 | ---
 98 | 
 99 | ---
100 | ###  Workshop One Refresher:
101 | 
102 | If you didn't complete my first workshop and need a primer on Iceberg, you can complete that work again on this platform by following this guide:  [Workshop 1 Exercises](./workshop1_revisit.md).   If you are already familiar with those items please proceed.   A later step has all of that workshop automated if you prefer.
103 | 
104 | ---
105 | 
106 | ###  What is Redpanda
107 |   * Information in this section was gathered from their website.  You can find more detailed information about their platform here:  [Red Panda](https://redpanda.com/platform)
108 | ---
109 | 
110 | Redpanda is an event streaming platform: it provides the infrastructure for streaming real-time data.  It has been proven to be 10x faster and 6x lower in total costs. It is also JVM-free, ZooKeeper®-free, Jepsen-tested and source available.
111 | 
112 | Producers are client applications that send data to Redpanda in the form of events. Redpanda safely stores these events in sequence and organizes them into topics, which represent a replayable log of changes in the system.
113 | 
114 | Consumers are client applications that subscribe to Redpanda topics to asynchronously read events. Consumers can store, process, or react to events.
115 | 
116 | Redpanda decouples producers from consumers to allow for asynchronous event processing, event tracking, event manipulation, and event archiving. Producers and consumers interact with Redpanda using the Apache Kafka® API.
117 | 
118 | | Event-driven architecture (Redpanda)     | Message-driven architecture |
119 | | ----------- | ----------- |
120 | | Producers send events to an event processing system (Redpanda) that acknowledges receipt of the write. This guarantees that the write is durable within the system and can be read by multiple consumers.     | Producers send messages directly to each consumer. The producer must wait for acknowledgment that the consumer received the message before it can continue with its processes.       |
121 | 
122 | 
123 | Event streaming lets you extract value from each event by analyzing, mining, or transforming it for insights. You can:
124 | 
125 |   *  Take one event and consume it in multiple ways.
126 |   *  Replay events from the past and route them to new processes in your application.
127 |   *  Run transformations on the data in real-time or historically.
128 |   *  Integrate with other event processing systems that use the Kafka API.
129 | 
130 | 
131 | ####  Redpanda differentiators:
132 | Redpanda is less complex and less costly than any other commercial mission-critical event streaming platform. It's fast, it's easy, and it keeps your data safe.
133 | 
134 |   *  Redpanda is designed for maximum performance on any data streaming workload.
135 | 
136 |   *  It can scale up to use all available resources on a single machine and scale out to distribute performance across multiple nodes. Built on C++, Redpanda delivers greater throughput and up to 10x lower p99 latencies than other platforms. This enables previously-unimaginable use cases that require high throughput, low latency, and a minimal hardware footprint.
137 | 
138 |   *  Redpanda is packaged as a single binary: it doesn't rely on any external systems.
139 | 
140 |   *  It's compatible with the Kafka API, so it works with the full ecosystem of tools and integrations built on Kafka. Redpanda can be deployed on bare metal, containers, or virtual machines in a data center or in the cloud. Redpanda Console also makes it easy to set up, manage, and monitor your clusters. Additionally, Tiered Storage lets you offload log segments to cloud storage in near real-time, providing infinite data retention and topic recovery.
141 | 
142 |   *  Redpanda uses the Raft consensus algorithm throughout the platform to coordinate writing data to log files and replicating that data across multiple servers.
143 | 
144 |   *  Raft facilitates communication between the nodes in a Redpanda cluster to make sure that they agree on changes and remain in sync, even if a minority of them are in a failure state. This allows Redpanda to tolerate partial environmental failures and deliver predictable performance, even at high loads.
145 | 
146 |   *  Redpanda provides data sovereignty.
147 | 
148 | ---
149 | ---
150 | ###  Hands-On Workshop begins here:
151 | ---
152 | ---
153 | 
154 | ####  Explore the Red Panda CLI tool `RPK`  
155 |   *   Redpanda Keeper `rpk` is Redpanda's command line interface (CLI) utility.  Detailed documentation of the CLI can be explored further here: [Redpanda Keeper Commands](https://docs.redpanda.com/docs/reference/rpk/)
156 | 
157 | #####  Create our first Redpanda topic with the CLI:
158 | *  run this from a terminal window:
159 | ```
160 | #  Let's create a topic with RPK
161 | rpk topic create movie_list
162 | ```
163 | ####  Start a Redpanda `Producer` using the `rpk` CLI to add messages:
164 |   *  this will open a producer session and await your input until you close it with `<ctrl> + d`
165 | ```
166 | rpk topic produce movie_list
167 | ```
168 | 
169 | ####  Add some messages to the `movie_list` topic:
170 |   *  The producer will appear to be hung in the terminal window.   It is really just waiting for you to type in a message and hit `<return>`.
171 | 
172 | 
173 | ######  Entry 1:
174 | ```
175 | Top Gun Maverick
176 | ```
177 | ######  Entry 2:
178 | ```
179 | Star Wars - Return of the Jedi
180 | ```
181 | #### Expected Output:
182 | ```
183 | Produced to partition 0 at offset 0 with timestamp 1675085635701.
184 | Star Wars - Return of the Jedi
185 | Produced to partition 0 at offset 1 with timestamp 1675085644895.
186 | ```
187 | 
188 | ##### Exit the producer:  `<ctrl> + d`
189 | 
190 | ####  View these messages from Redpanda `Consumer` using the `rpk` CLI:
191 | 
192 | ```
193 | rpk topic consume movie_list --num 2
194 | ```
195 | 
196 | ---
197 | 
198 | ####  Expected Output:
199 | 
200 | ```
201 | {
202 |   "topic": "movie_list",
203 |   "value": "Top Gun Maverick",
204 |   "timestamp": 1675085635701,
205 |   "partition": 0,
206 |   "offset": 0
207 | }
208 | {
209 |   "topic": "movie_list",
210 |   "value": "Star Wars - Return of the Jedi",
211 |   "timestamp": 1675085644895,
212 |   "partition": 0,
213 |   "offset": 1
214 | }
215 | 
216 | ```
217 | ---
218 | ---
219 | 
220 | ##  Explore the Red Panda GUI:
221 |   *  Open a browser and navigate to your host ip address:  `http:\\<your ip address>:8888`  This will open the Red Panda GUI.  
222 |   *  This is not the standard port for the Redpanda Console.   It has been modified to avoid confilcts with other tools used in this workshop
223 | 
224 | ---
225 | ---
226 |   
227 |   ![](./images/panda_view_topics.png)
228 |   
229 | ---
230 | 
231 | ####  We can delete this topic from the `rpk` CLI:
232 | 
233 | ```
234 | rpk topic delete movie_list
235 | ```
236 | ---
237 | ---
238 | 
239 | ### Data Generator:
240 | ---
241 | 
242 | I have written a data generator CLI application and included it in this workshop to simplify creating some realistic data for us to explore.  We will use this data generator application to stream some realistic data directly into some topics (and later into a database).  The data generator is written in python and uses the component [Faker](https://faker.readthedocs.io/en/master/).  I encourage you to look at the code here if you want to look deeper into it.  [Data Generator Items](./datagen)   
243 | 
244 | ---
245 | 
246 | #####  Let's create some topics for our data generator using the CLI:
247 | ```
248 | rpk topic create dgCustomer
249 | rpk topic create dgTxn
250 | ```
251 | ---
252 | ##### Console view of our `Topics`:
253 |   ![](./images/panda_view__dg_load_topics.png)
254 | 
255 | ---
256 | ---
257 | 
258 | #####  Data Generator Notes:   
259 | ---
260 | 
261 | The data generator app in this section accepts 3 integer arguments:  
262 |   *  An integer value for the `customer key`.
263 |   *  An integer value for the `N` number of groups to produce in small batches.
264 |   *  An integer value for `N` number of times to loop until it will exit the script.
265 | 
266 | ---
267 | #####  Call the `Data Generator` to stream some messages to our topics:
268 | ---
269 | 
270 | ```
271 | cd ~/datagen
272 | 
273 | #  start the script:
274 | python3 redpanda_dg.py 10 3 2
275 | ```
276 | 
277 | ##### Sample Output:
278 | 
279 | This will load sample JSON data into our two new topics and write out a copy of those records to your terminal that looks something like this:
280 | 
281 | ---
282 | 
283 | ```
284 | {"last_name": "Mcmillan", "first_name": "Linda", "street_address": "7471 Charlotte Fall Suite 835", "city": "Lake Richardborough", "state": "OH", "zip_code": "25649", "email": "tim47@example.org", "home_phone": "001-133-135-5972", "mobile": "001-942-819-7717", "ssn": "321-16-7039", "job_title": "Tourism officer", "create_date": "2022-12-19 20:45:34", "cust_id": 10}
285 | {"last_name": "Hatfield", "first_name": "Denise", "street_address": "5799 Solis Isle", "city": "Josephbury", "state": "LA", "zip_code": "61947", "email": "lhernandez@example.org", "home_phone": "(110)079-8975x48785", "mobile": "976.262.7268", "ssn": "185-93-0904", "job_title": "Engineer, chemical", "create_date": "2022-12-31 00:29:36", "cust_id": 11}
286 | {"last_name": "Adams", "first_name": "Zachary", "street_address": "6065 Dawn Inlet Suite 631", "city": "East Vickiechester", "state": "MS", "zip_code": "52115", "email": "fgrimes@example.com", "home_phone": "001-445-395-1773x238", "mobile": "(071)282-1174", "ssn": "443-22-3631", "job_title": "Maintenance engineer", "create_date": "2022-12-07 20:40:25", "cust_id": 12}
287 | Customer Done.
288 | 
289 | 
290 | {"transact_id": "020d5f1c-741d-40b0-8b2a-88ff2cdc0d9a", "category": "teal", "barcode": "5178387219027", "item_desc": "Government training especially.", "amount": 85.19, "transaction_date": "2023-01-07 21:24:17", "cust_id": 10}
291 | {"transact_id": "af9b7e7e-9068-4772-af7e-a8cb63bf555f", "category": "aqua", "barcode": "5092525324087", "item_desc": "Take study after catch.", "amount": 82.28, "transaction_date": "2023-01-18 01:13:13", "cust_id": 10}
292 | {"transact_id": "b11ae666-b85c-4a86-9fbe-8f4fddd364df", "category": "purple", "barcode": "3527261055442", "item_desc": "Likely age store hold.", "amount": 11.8, "transaction_date": "2023-01-26 01:15:46", "cust_id": 10}
293 | {"transact_id": "e968daad-6c14-475f-a183-1afec555dd5f", "category": "olive", "barcode": "7687223414666", "item_desc": "Performance call myself send.", "amount": 67.48, "transaction_date": "2023-01-25 01:51:05", "cust_id": 10}
294 | {"transact_id": "d171c8d7-d099-4a41-bf23-d9534b711371", "category": "teal", "barcode": "9761406515291", "item_desc": "Charge no when.", "amount": 94.57, "transaction_date": "2023-01-05 12:09:58", "cust_id": 11}
295 | {"transact_id": "2297de89-c731-42f1-97a6-98f6b50dd91a", "category": "lime", "barcode": "6484138725655", "item_desc": "Little unit total money raise.", "amount": 47.88, "transaction_date": "2023-01-13 08:16:24", "cust_id": 11}
296 | {"transact_id": "d3e08d65-7806-4d03-a494-6ec844204f64", "category": "black", "barcode": "9827295498272", "item_desc": "Yeah claim city threat approach our.", "amount": 45.83, "transaction_date": "2023-01-07 20:29:59", "cust_id": 11}
297 | {"transact_id": "97cf1092-6f03-400d-af31-d276eff05ecf", "category": "silver", "barcode": "2072026095184", "item_desc": "Heart table see share fish.", "amount": 95.67, "transaction_date": "2023-01-12 19:10:11", "cust_id": 11}
298 | {"transact_id": "11da28af-e463-4f7c-baf2-fc0641004dec", "category": "blue", "barcode": "3056115432639", "item_desc": "Writer exactly single toward same.", "amount": 9.33, "transaction_date": "2023-01-29 02:49:30", "cust_id": 12}
299 | {"transact_id": "c9ebc8a5-3d1a-446e-ac64-8bdd52a1ce36", "category": "fuchsia", "barcode": "6534191981175", "item_desc": "Morning who lay yeah travel use.", "amount": 73.2, "transaction_date": "2023-01-21 02:25:02", "cust_id": 12}
300 | Transaction Done.
301 | 
302 | ```
303 | ---
304 | 
305 | ####  Explore messages in the Red Panda Console from a browser
306 |   * `http:\\<your ip address>:8888`  Make sure to click the `Topics` tab on the left side of our Console Application:
307 | ---
308 | ##### Click on the topic `dgCustomer` from the list.
309 | 
310 | ---
311 | 
312 |  ![](./images/topic_customer_view.png)
313 |  
314 | ---
315 | 
316 | ##### Click on the topic '+' icon under the `Value` column to see the record details of a message.
317 | 
318 | ---
319 | 
320 |  ![](./images/detail_view_of_cust_msg.png)
321 |  
322 | ---
323 | ---
324 | ## Explore Change Data Capture (CDC) via `Kafka Connect` and `Debezium`
325 | 
326 | ---
327 | 
328 | ##### Define Change Data Capture (CDC):
329 | 
330 | Change Data Capture (CDC) is a database technique used to track and record changes made to data in a database. The changes are captured as soon as they occur and stored in a separate log or table, allowing applications to access the most up-to-date information without having to perform a full database query. CDC is often used for real-time data integration and data replication, enabling organizations to maintain a consistent view of their data across multiple systems.
331 | 
332 | ---
333 | 
334 | ##### Define `Kafka Connect`:
335 | 
336 | Kafka Connect is a tool for scalable and reliable data import/export between Apache Kafka and other data systems. It allows you to integrate Kafka or Red Panda with sources such as databases, key-value stores, and file systems, as well as with sinks such as data warehouses and NoSQL databases. Kafka Connect provides pre-built connectors for popular data sources and also supports custom connectors developed by users. It uses the publish-subscribe model of Kafka to ensure that data is transported between systems in a fault-tolerant and scalable manner.
337 | 
338 | ---
339 | 
340 | ##### Define `Debezium`:
341 | 
342 | Debezium is an open-source change data capture (CDC) platform that helps to stream changes from databases such as MySQL, PostgreSQL, and MongoDB into Red Panda and Apache Kafka, among other data sources and sinks. Debezium is designed to be used for real-time data streaming and change data capture for applications, data integration, and analytics.  This component is a must for getting at legacy data in an efficient manner.
343 | 
344 | ---
345 | 
346 | ##### Why use these tools together?
347 | 
348 | By combining CDC with Kafa Connect (and using the Debezium plugin) we easily roll out a new system that could eliminate expensive legacy solutions for extracting data from databases and replicating them to a modern `Data Lake`. This approach requires very little configuration and will have a minimal performance impact on your legacy databases.   It will also allow you to harness data in your legacy applications and implement new real-time streaming applications to gather insights that were previously very difficult and expensive to get at.
349 | 
350 | ---
351 | ---
352 | 
353 | #### Integrate PostgreSQL with Kafka Connect:
354 | 
355 | In these next few exercises, we will load data into a SQL database and configure Kafka Connect to extract the CDC records and stream them to a new topic in Red Panda.
356 | 
357 | ---
358 | ---
359 | 
360 | #### Data Generator to load data into PostgreSQL:
361 | 
362 | There is a second data generator application and we will use it to stream JSON records and load them directly into a Postgresql database.
363 | 
364 | ---
365 | ---
366 | 
367 | #####  Data Generator Notes for stream to PostgreSQL:   
368 | ---
369 | This data generator application accepts 2 integer arguments:  
370 |   *  An integer value for the starting `customer key`.
371 |   *  An integer value for `N` number of records to produce and load to the database.
372 | 
373 | #####  Call the Data Generator:
374 | 
375 | ```
376 | cd ~/datagen
377 | 
378 | #  start the script:
379 | python3 pg_upsert_dg.py 10 4
380 | 
381 | ```
382 | 
383 | ##### Sample Output:
384 | ---
385 | ```
386 | Connection Established
387 | {"last_name": "Carson", "first_name": "Aaron", "street_address": "124 Campbell Overpass", "city": "Cummingsburgh", "state": "MT", "zip_code": "76816", "email": "aaron08@example.net", "home_phone": "786-888-8409x21666", "mobile": "001-737-014-7684x1271", "ssn": "394-84-0730", "job_title": "Tourist information centre manager", "create_date": "2022-12-04 00:00:13", "cust_id": 10}
388 | {"last_name": "Allen", "first_name": "Kristen", "street_address": "00782 Richard Freeway", "city": "East Josephfurt", "state": "NJ", "zip_code": "87309", "email": "xwyatt@example.com", "home_phone": "085-622-1720x88354", "mobile": "4849824808", "ssn": "130-35-4851", "job_title": "Psychologist, occupational", "create_date": "2022-12-23 14:33:56", "cust_id": 11}
389 | {"last_name": "Knight", "first_name": "William", "street_address": "1959 Coleman Drives", "city": "Williamsville", "state": "OH", "zip_code": "31621", "email": "farrellchristopher@example.org", "home_phone": "(572)744-6444x306", "mobile": "+1-587-017-1677", "ssn": "797-80-6749", "job_title": "Visual merchandiser", "create_date": "2022-12-11 03:57:01", "cust_id": 12}
390 | {"last_name": "Joyce", "first_name": "Susan", "street_address": "137 Butler Via Suite 789", "city": "West Linda", "state": "IN", "zip_code": "63240", "email": "jeffreyjohnson@example.org", "home_phone": "+1-422-918-6473x3418", "mobile": "483-124-5433x956", "ssn": "435-50-2408", "job_title": "Gaffer", "create_date": "2022-12-14 01:20:02", "cust_id": 13}
391 | Records inserted successfully
392 | PostgreSQL connection is closed
393 | script complete!
394 | 
395 | ```
396 | ---
397 | ---
398 | ### Configure Integration of `Redpanda` and `Kafka Connect`
399 | ---
400 | ---
401 | 
402 | ####  Kafka Connect Setup:
403 | 
404 | In the setup script, we downloaded and installed all the components and needed jar files that Kafka Connect will use.  Please review that setup file again if you want a refresher.  The script also configured the settings for our integration of PostgreSQL with Red Panda.   Let's review the configuration files that make it all work.
405 | 
406 | ---
407 | 
408 | #####  The  property file that will link Kafka Connect to Red Panda is located here:
409 |   * make sure you are logged into the OS as user `datagen` with a password of `supersecret1`
410 |   
411 | ```
412 | 
413 | cd ~/kafka_connect/configuration
414 | cat connect.properties
415 | ```
416 | ---
417 | 
418 | ##### Expected output:
419 | 
420 | ```
421 | #Kafka broker addresses
422 | bootstrap.servers=localhost:9092
423 | 
424 | #Cluster level converters
425 | #These apply when the connectors don't define any converter
426 | key.converter=org.apache.kafka.connect.json.JsonConverter
427 | value.converter=org.apache.kafka.connect.json.JsonConverter
428 | 
429 | #JSON schemas enabled to false in cluster level
430 | key.converter.schemas.enable=true
431 | value.converter.schemas.enable=true
432 | 
433 | #Where to keep the Connect topic offset configurations
434 | offset.storage.file.filename=/tmp/connect.offsets
435 | offset.flush.interval.ms=10000
436 | 
437 | #Plugin path to put the connector binaries
438 | plugin.path=:~/kafka_connect/plugins/debezium-connector-postgres/
439 | 
440 | ```
441 | 
442 | 
443 | ---
444 | 
445 | #####  The  property file that will link Kafka Connect to PostgreSQL is located here:
446 |   
447 | ```
448 | cd ~/kafka_connect/configuration
449 | cat pg-source-connector.properties
450 | ```
451 | ---
452 | 
453 | ##### Expected output:
454 | 
455 | ```
456 | connector.class=io.debezium.connector.postgresql.PostgresConnector
457 | offset.storage=org.apache.kafka.connect.storage.FileOffsetBackingStore
458 | offset.storage.file.filename=offset.dat
459 | offset.flush.interval.ms=5000
460 | name=postgres-dbz-connector
461 | database.hostname=localhost
462 | database.port=5432
463 | database.user=datagen
464 | database.password=supersecret1
465 | database.dbname=datagen
466 | schema.include.list=datagen
467 | plugin.name=pgoutput
468 | topic.prefix=pg_datagen2panda
469 | 
470 | ```
471 | 
472 | ---
473 | ---
474 | ###  Start the `Kafka Connect` processor:
475 |   *  This will start our processor and pull all the CDC records out of the PostgreSQL database for our 'customer' table and ship them to a new Redpanda topic.  
476 |   *  This process will run and pull the messages and then sleep until new messages get written to the originating database.   To exit out of the processor when it completes, use the commands `<control> + c`.
477 | ---
478 | 
479 | ##### Start Kafka Connect:
480 |   * make sure you are logged into OS as user `datagen` with a password of `supersecret1`
481 | 
482 | ```
483 | cd ~/kafka_connect/configuration
484 | 
485 | export CLASSPATH=/home/datagen/kafka_connect/plugins/debezium-connector-postgres/*
486 | ../kafka_2.13-3.3.2/bin/connect-standalone.sh connect.properties pg-source-connector.properties
487 | ```
488 | ---
489 | 
490 | #####  Expected Output:
491 | 
492 | In this link, you can see the expected sample output:  [`connect.output`](./sample_output/connect.output) 
493 | 
494 | 
495 | 
496 | ---
497 | ##### Explore the `Connect` tab in the Redpanda console from a browser:
498 |   *  This view is only available when `Connect` processes are running.
499 |   ---
500 |   ![](./images/console_view_run_connect.png)
501 | ---
502 | 
503 | #####  Exit out of Kafka Connect from the terminal with: `<control> + c`
504 | 
505 | ---
506 | ---
507 | #### Explore our new Redpanda topic `pg_datagen2panda.datagen.customer` in the console from a browser:
508 | 
509 | ---
510 | #####  Console View of topic:
511 | 
512 |   ![](./images/panda_topic_view_connect_topic.png)
513 |   
514 | ---
515 | ---
516 | ##### Click on the topic `pg_datagen2panda.datagen.customer` from the list.
517 | 
518 | ---
519 | 
520 |  ![](./images/connect_output_summary_msg.png)
521 |  
522 | ---
523 | 
524 | ##### Click on the topic '+' icon under the `Value` column to see the record details of a message.
525 | 
526 | ---
527 | 
528 |  ![](./images/connect_ouput_detail_msg.png)
529 |  
530 |  #### Kafka Connect Observations:
531 | 
532 | ---
533 | ---
534 | As you can see, this message contains the values of the record `before` and `after` it was inserted into our PostgreSQL database. In this next section, we explore loading all of the data currently in our Redpanda topics and delivering it into our Iceberg data lake.
535 | 
536 | ---
537 | ---
538 | # Integration with our Apache Iceberg Data Lake Exercises
539 | ---
540 | ---
541 | 
542 | #### Load Data to Iceberg with Spark
543 | 
544 | 
545 | ---
546 | 
547 | In this shell script  [`stream_customer_ddl_script.sh`](./spark_items/stream_customer_ddl_script.sh) we will launch a `spark-sql` cli and run the DDL code [`stream_customer_ddl.sql`](./spark_items/stream_customer_ddl.sql) to create our `icecatalog.icecatalog.stream_customer` table in iceberg.
548 | 
549 | ```
550 | . /opt/spark/sql/stream_customer_ddl_script.sh
551 | ```
552 | ---
553 | 
554 | In this spark streaming job  [`consume_panda_2_iceberg_customer.py`](./datagen/consume_panda_2_iceberg_customer.py) we will consume our messages loaded into topic `dgCustomer` with our data generator and append them into our `icecatalog.icecatalog.stream_customer` table in Iceberg.
555 | 
556 | ```
557 | 
558 | spark-submit ~/datagen/consume_panda_2_iceberg_customer.py
559 | ```
560 | 
561 | ---
562 | ####  Review tables in our Iceberg datalake 
563 | ```
564 | cd /opt/spark/sql
565 | 
566 | . ice_spark-sql_i-cli.sh
567 | 
568 | #  query 
569 | SHOW TABLES IN icecatalog.icecatalog;
570 | 
571 | # Query 2:
572 | SELECT * FROM icecatalog.icecatalog.stream_customer;
573 | ```
574 | ---
575 | 
576 | In this shell script  [`stream_customer_event_history_ddl_script.sh`](./spark_items/stream_customer_event_history_ddl_script.sh) we will launch a `spark-sql` cli and run the DDL code [`stream_customer_event_history_ddl.sql`](./spark_items/stream_customer_event_history_ddl.sql) to create our `icecatalog.icecatalog.stream_customer_event_history` table in Iceberg.
577 | 
578 | ```
579 | . /opt/spark/sql/stream_customer_event_history_ddl_script.sh
580 | ```
581 | ---
582 | 
583 | In this spark streaming job  [`spark_from_dbz_customer_2_iceberg.py`](./datagen/spark_from_dbz_customer_2_iceberg.py) we will consume our messages loaded into topic `pg_datagen2panda.datagen.customer` from the `kafka_connect` processor and append them into the `icecatalog.icecatalog.stream_customer_event_history` table in iceberg.  Spark Streaming does not have the ability to merge this data directly into our Iceberg table yet.  This feature should become available soon.   In the interim, we will have to create a separate batch job to apply them.  In an upcoming section, we will demonstrate a better solution that will merge this information and simplify the amount of code needed to accomplish this task. This specific job will only append the activity to our table.
584 | 
585 | ```
586 | . spark-submit ~/datagen/spark_from_dbz_customer_2_iceberg.py
587 | ```
588 | ---
589 | 
590 | Let's explore our iceberg tables with the interactive `spark-sql` shell.
591 | 
592 | ```
593 | cd /opt/spark/sql
594 | 
595 | . ice_spark-sql_i-cli.sh
596 | ```
597 | 
598 | * to exit the shell type: `exit;` and hit `<return>`
599 | 
600 | ---
601 | 
602 | Run the query to see some output:
603 | 
604 | ```
605 | SELECT * FROM icecatalog.icecatalog.stream_customer_event_history;
606 | ```
607 | 
608 | ---
609 | 
610 | ####  Here are some additional exercises of Spark & Python that may be of interest:
611 | 
612 | [Additional Spark Exercises](./sample_spark_jobs.md)
613 | 
614 | ---
615 | ---
616 | ###  Automation of Workshop 1 Exercises
617 | ---
618 | ---
619 | 
620 | 
621 | *  Please skip these 2 commands if you completed them by hand in the earlier reference to Workshop 1.  They were included again to add additional data to our applications for use with the `Debezium Server` in the next section.
622 | 
623 | Let's load all the customer data from workshop 1 in one simple `spark-sql` shell command.  In this shell script  [`iceberg_workshop_sql_items.sh`](./spark_items/iceberg_workshop_sql_items.sh) we will launch a `spark-sql` cli and run the is DDL code [`all_workshop1_items.sql`](./spark_items/all_workshop1_items.sql) to load our `icecatalog.icecatalog.customer` table in iceberg.
624 | 
625 | ```
626 | . /opt/spark/sql/iceberg_workshop_sql_items.sh
627 | ```
628 | 
629 | ---
630 | 
631 | In this spark job [`load_ice_transactions_pyspark.py`](./spark_items/load_ice_transactions_pyspark.py)  we will load all the transactions from workshop 1 as a pyspark batch job:
632 | 
633 | ```
634 | spark-submit /opt/spark/sql/load_ice_transactions_pyspark.py
635 | ```
636 | 
637 | ---
638 | ---
639 | ---
640 | ---
641 | 
642 | #  What is Debezium Server?
643 | ---
644 | Debezium Server is an open-source distributed platform for change data capture (CDC) that captures events from databases, message queues, and other systems in real-time and streams them to Kafka or other event streaming platforms. It is part of the Debezium project, which aims to simplify and automate the process of extracting change events from different sources and making them available to downstream applications.
645 | 
646 | The Debezium Server provides a number of benefits, including the ability to capture data changes in real-time, the ability to process events in a scalable and fault-tolerant manner, and the ability to integrate with various data storage and streaming technologies. It is built using Apache Kafka and leverages the Kafka Connect API for connecting to various data sources and targets.
647 | 
648 | Debezium Server supports a wide range of data sources, including popular databases like MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Cassandra, and others. It also supports message queues like Apache Kafka, Apache Pulsar, and RabbitMQ, as well as file-based sources like Apache Cassandra, Apache Hadoop HDFS and Apache Iceberg.
649 | 
650 | Debezium Server can be deployed on-premises or in the cloud, and it is available under the Apache 2.0 open-source license, which means that it is free to use, modify, and distribute.
651 | 
652 | You can find more information about Debezium Server here:  [Debezium Server Website](https://debezium.io/documentation/reference/stable/operations/debezium-server.html)
653 | 
654 | ---
655 | 
656 | ###  Debezium Server Observations:
657 | 
658 | The use of `Debezium Server` greatly simplifies the amount of code needed to capture information in upstream systems and automatically delivers it downstream to a destination.  It requires only a few configuration files.
659 |  
660 | It is capturing every change to our PostresSQL database including:
661 |   * inserts, updates, delete to tables
662 |   * adding columns to existing tables
663 |   * creation of new tables
664 | 
665 | If you recall, in an early exercise we ran some Spark code that grabbed these same change records and pushed them to a Redpanda topic from the Postgresql database (with Kafka Connect). We had to write a significant amount of code for each table to achieve only half of the goal.  The Debezium Server is a much cleaner approach.   It is worth noting that significant work is being developed by the open-source community to bring this same functionality to `Kafka Connect`.  I expect to see lots of options soon.
666 | 
667 | ---
668 | ---
669 | #### Debezium Server Configuration File:
670 |   *  Link to the configuration:  [Debezium Server Configuration](./dbz_server/application.properties)
671 | ---
672 | ---
673 | 
674 | ##  Debezium Server Exercises:
675 | ---
676 | ---
677 | 
678 | 
679 | #### Query the Iceberg catalog for a list of current tables:
680 | 
681 | ```
682 | #  start the spark-sql cli in interactive mode:
683 | cd /opt/spark/sql
684 | . ice_spark-sql_i-cli.sh
685 | 
686 | # run query:
687 | SHOW TABLES IN icecatalog.icecatalog;
688 | ```
689 | ---
690 | 
691 | #### Expected Sample Output:
692 | 
693 | ```
694 | namespace       tableName                           isTemporary
695 |                 customer
696 |                 stream_customer
697 |                 stream_customer_event_history
698 |                 transactions
699 | 
700 | ```
701 | 
702 | ---
703 | 
704 | #### Start the Debezium Server in a new terminal window:
705 | 
706 | ```
707 | cd ~/appdist/debezium-server-iceberg/
708 | 
709 | bash run.sh
710 | ```
711 | *  This will run until terminated and pull in database changes to our Iceberg Data Lake:
712 | 
713 | ---
714 | 
715 | #### Explore our Iceberg Catalog now (in the previous terminal window):
716 | 
717 | ```
718 | cd /opt/spark/sql
719 | . ice_spark-sql_i-cli.sh
720 | 
721 | # query:
722 | SHOW TABLES IN icecatalog.icecatalog;
723 | ```
724 | ---
725 | 
726 | #### Expected Sample Output:
727 | 
728 | ```
729 | namespace       tableName                           isTemporary
730 |                 cdc_localhost_datagen_customer
731 |                 customer
732 |                 stream_customer
733 |                 stream_customer_event_history
734 |                 transactions
735 | ```
736 | 
737 | ---
738 | 
739 | #### Query our new CDC table `cdc_localhost_datagen_customer` in our Data Lake that was replicated by `Debezium Server`:
740 | 
741 | ```
742 | cd /opt/spark/sql
743 | . ice_spark-sql_i-cli.sh
744 | 
745 | #  query:
746 | SELECT
747 |   cust_id,
748 |   last_name,
749 |   city,
750 |   state,
751 |   create_date,
752 |   __op,
753 |   __table,
754 |   __source_ts_ms,
755 |   __db,
756 |   __deleted
757 | FROM icecatalog.icecatalog.cdc_localhost_datagen_customer
758 | ORDER by cust_id;
759 | ```
760 | ---
761 | #### Expected Sample Output:
762 | 
763 | ```
764 | cust_id last_name       city            state   create_date             __op    __table         __source_ts_ms           __db       __deleted
765 | 10      Jackson         North Kimberly  MP      2023-01-20 22:47:05     r       customer        2023-02-22 16:04:34.193 datagen     false
766 | 11      Downs           Conwaychester   MD      2022-12-27 23:54:51     r       customer        2023-02-22 16:04:34.193 datagen     false
767 | 12      Webster         Phillipmouth    VI      2023-01-17 20:54:46     r       customer        2023-02-22 16:04:34.193 datagen     false
768 | 13      Miller          Jessicahaven    OH      2023-01-13 05:03:57     r       customer        2023-02-22 16:04:34.193 datagen     false
769 | Time taken: 0.384 seconds, Fetched 4 row(s)
770 | 
771 | ```
772 | ---
773 | 
774 | #### Add additional rows to our Postgresql table via `datagen`:
775 | 
776 | ```
777 | cd ~/datagen/
778 | python3 pg_upsert_dg.py 12 5
779 | ```
780 | 
781 | ---
782 | 
783 | #### Review New & Updated Records
784 | 
785 | * Query our updated Data Lake table and review the `inserts` and `updates` applied from the running Debezium Server service.
786 | 
787 | ```
788 | cd /opt/spark/sql
789 | . ice_spark-sql_i-cli.sh
790 | 
791 | #  query:
792 | SELECT
793 |   cust_id,
794 |   last_name,
795 |   city,
796 |   state,
797 |   create_date,
798 |   __op,
799 |   __table,
800 |   __source_ts_ms,
801 |   __db,
802 |   __deleted
803 | FROM icecatalog.icecatalog.cdc_localhost_datagen_customer
804 | ORDER by cust_id;
805 | ```
806 | ---
807 | #### Expected Sample Output:
808 | 
809 | ```
810 | cust_id last_name           city                  state   create_date             __op    __table         __source_ts_ms          __db    __deleted
811 | 10      Jackson             North Kimberly        MP      2023-01-20 22:47:05     r       customer        2023-02-22 16:06:19.9   datagen false
812 | 11      Downs               Conwaychester         MD      2022-12-27 23:54:51     r       customer        2023-02-22 16:06:19.9   datagen false
813 | 12      Cook                New Catherinemouth    NJ      2023-01-03 18:38:35     u       customer        2023-02-22 19:03:52.62  datagen false
814 | 13      Ramos               West Laurabury        NY      2023-01-04 04:48:18     u       customer        2023-02-22 19:03:52.62  datagen false
815 | 14      Scott               West Thomastown       AL      2022-12-29 07:21:28     c       customer        2023-02-22 19:03:52.62  datagen false
816 | 15      Holden              East Danieltown       MT      2023-01-15 17:17:54     c       customer        2023-02-22 19:03:52.62  datagen false
817 | 16      Carpenter           Lake Jamesberg        GU      2023-01-05 22:16:55     c       customer        2023-02-22 19:03:52.62  datagen false
818 | Time taken: 0.318 seconds, Fetched 7 row(s)
819 | ```
820 | 
821 | ---
822 | ---
823 | ---
824 | 
825 | ###  Final Summary:
826 | 
827 | Integrating a database using Kafka Connect (via Debezium plugins) to stream data to a system like Red Panda and our Iceberg Data Lake can have several benefits:
828 | 
829 |   1.  **Real-time data streaming:** The integration provides a real-time stream of data from the SQL database to Red Panda and our Iceberg Data Lake, making it easier to analyze and process data in real-time.
830 | 
831 |   2.  **Scalability:** Kafka Connect or the Debezium Server can handle high volume and velocity of data, allowing for scalability as the data grows.
832 | 
833 |   3.  **Ease of Use:** Kafka Connect & Debezium Server simplifies the process of integrating the SQL database and delivering it to other destinations, making it easier for developers to set up and maintain.  
834 | 
835 |   4.  **Improved data consistency:** The integration helps ensure data consistency by providing a single source of truth for data being streamed to Red Panda or any other downstream consumer like our Iceberg Data Lake.
836 | 
837 | However, the integration may also have challenges such as data compatibility, security, and performance. It is important to thoroughly assess the requirements and constraints before implementing the integration.
838 | 
839 | ---
840 | 
841 | If you have made it this far, I want to thank you for spending your time reviewing the materials. Please give me a 'Star' at the top of this page if you found it useful.
842 | 
843 | ---
844 | ---
845 | 
846 | ####  Extra Credit
847 | 
848 | * Interested in exploring the underlying PostgreSQL Databases for `datagen` or the database that hosts the `Iceberg Catalog`?
849 | [Additional Exercises to Explore Databases in Postgresql](./explore_postgresql.md)
850 | 
851 | ---
852 | ---
853 | 
854 |   ![](./images/drunk-cheers.gif)
855 | 
856 | [Tim Lepple](www.linkedin.com/in/tim-lepple-9141452)
857 | 
858 | ---
859 | ---
860 | 
861 | 


--------------------------------------------------------------------------------
/datagen/comsume_topic_dgCustomer.py:
--------------------------------------------------------------------------------
 1 | import json 
 2 | import sys
 3 | import argparse
 4 | from kafka import KafkaConsumer
 5 | 
 6 | startKey = int(1)
 7 | #iterateVal = int(5)
 8 | 
 9 | parser = argparse.ArgumentParser()
10 | 
11 | # define our required arguments to pass in:
12 | parser.add_argument("recordCount", help="Enter int value for desired number of records", type=int)
13 | 
14 | # parse these args
15 | args = parser.parse_args()
16 | 
17 | # assign args to vars:
18 | stopVal = int(args.recordCount)
19 | 
20 | try:
21 |     # define our Kafka Consumer 
22 |     consumer = KafkaConsumer(
23 |         'dgCustomer',
24 |         bootstrap_servers='<private_ip>:9092',
25 |         auto_offset_reset='earliest',
26 |         value_deserializer=lambda m: json.loads(m.decode('utf-8'))
27 |     )
28 |     for message in consumer:
29 |         print(message.value)
30 |         startKey += 1 
31 | 
32 |         if startKey == stopVal:
33 |             print("\n")
34 |             print(str(stopVal) +  " msgs have been consumed.")
35 |             print("\n")
36 |             consumer.close()
37 |             sys.exit()
38 | 
39 | 
40 | except KeyboardInterrupt:
41 |     sys.exit()
42 | finally:
43 |     print("script complete!")
44 | 
45 | 


--------------------------------------------------------------------------------
/datagen/consume_panda_2_iceberg_customer.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from pyspark.sql.types import *
 3 | from pyspark.sql.functions import *
 4 | from pyspark.sql.functions import udf
 5 | from pyspark.streaming import StreamingContext
 6 | #from pyspark.streaming.kafka import KafkaUtils
 7 | import json
 8 | import uuid
 9 | 
10 | 
11 | #######################################################################################
12 | #  define schema for a DF with data json data from Kafka msgs
13 | #######################################################################################
14 | customer_schema = StructType() \
15 |                .add("first_name", StringType()) \
16 |                .add("last_name", StringType()) \
17 |                .add("street_address", StringType()) \
18 |                .add("city", StringType()) \
19 |                .add("state", StringType()) \
20 |                .add("zip_code", StringType()) \
21 |                .add("home_phone", StringType()) \
22 |                .add("mobile", StringType()) \
23 |                .add("email", StringType()) \
24 |                .add("ssn", StringType()) \
25 |                .add("job_title", StringType()) \
26 |                .add("create_date", StringType()) \
27 |                .add("cust_id", IntegerType())
28 | 
29 | 
30 | spark = SparkSession \
31 |     .builder \
32 |     .appName("cust_panda_2_ice") \
33 |     .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \
34 |     .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
35 |     .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \
36 |     .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \
37 |     .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \
38 |     .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \
39 |     .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \
40 |     .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \
41 |     .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
42 |     .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \
43 |     .config("spark.sql.catalog.sparkcatalog", "icecatalog") \
44 |     .config("spark.eventLog.enabled", "true") \
45 |     .config("spark.eventLog.dir", "/opt/spark/spark-events") \
46 |     .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
47 |     .config("spark.sql.catalogImplementation", "in-memory") \
48 |     .config("groupId", "org.apache.spark") \
49 |     .config("artifactId", "spark-sql-kafka-0-10_2.12") \
50 |     .config("version", "3.3.1") \
51 |     .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \
52 |     .config("spark.sql.adaptive", "true") \
53 |     .getOrCreate()
54 | 
55 | #######################################################################################
56 | # Create DataFrame representing the stream of msgs from kafka (unbounded table)
57 | #######################################################################################
58 | 
59 | parsed  = spark \
60 |   .readStream \
61 |   .format("kafka") \
62 |   .option("kafka.bootstrap.servers", "<private_ip>:9092") \
63 |   .option("subscribe", "dgCustomer") \
64 |   .option("startingOffsets", "earliest") \
65 |   .option("kafka.session.timeout.ms", "10000") \
66 |   .load() \
67 |   .select( \
68 |   from_json(col("value").cast("string"), customer_schema).alias("parsed_value"))
69 | 
70 | ##########################################################################################
71 | #  project the kafka 'value' column into a new data frame:
72 | ##########################################################################################
73 | 
74 | projected = parsed \
75 |      .select("parsed_value.*")
76 | 
77 | 
78 | ##########################################################################################
79 | #       write to console
80 | ##########################################################################################
81 | 
82 | query = projected.writeStream \
83 |     .outputMode("append") \
84 |     .format("iceberg") \
85 |     .trigger(processingTime='30 seconds') \
86 |     .option("path", "icecatalog.icecatalog.stream_customer") \
87 |     .option("checkpointLocation", "/opt/spark/checkpoint") \
88 |     .start() \
89 |     .awaitTermination()
90 | 
91 | spark.stop()
92 | 
93 | #     .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) \
94 | 


--------------------------------------------------------------------------------
/datagen/consume_stream_customer_2_console.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from pyspark.sql.types import *
 3 | from pyspark.sql.functions import *
 4 | from pyspark.sql.functions import udf
 5 | from pyspark.streaming import StreamingContext
 6 | #from pyspark.streaming.kafka import KafkaUtils
 7 | import json
 8 | import uuid
 9 | 
10 | 
11 | #######################################################################################
12 | # define a uuid function for the kafka key
13 | #######################################################################################
14 | uuidUdf = udf(lambda : str(uuid.uuid4()),StringType())
15 | nowUdf = udf(lambda : now(),TimestampType())
16 | 
17 | #######################################################################################
18 | #  define schema for a DF with data json data from Kafka msgs
19 | #######################################################################################
20 | customer_schema = StructType() \
21 |                .add("first_name", StringType()) \
22 |                .add("last_name", StringType()) \
23 |                .add("street_address", StringType()) \
24 |                .add("city", StringType()) \
25 |                .add("state", StringType()) \
26 |                .add("zip_code", StringType()) \
27 |                .add("home_phone", StringType()) \
28 |                .add("mobile", StringType()) \
29 |                .add("email", StringType()) \
30 |                .add("ssn", StringType()) \
31 |                .add("job_title", StringType()) \
32 |                .add("create_date", StringType()) \
33 |                .add("cust_id", IntegerType())
34 | 
35 | 
36 | spark = SparkSession \
37 |     .builder \
38 |     .appName("redpanda") \
39 |     .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \
40 |     .config("groupId", "org.apache.spark") \
41 |     .config("artifactId", "spark-sql-kafka-0-10_2.12") \
42 |     .config("version", "3.3.1") \
43 |     .config("spark.eventLog.enabled", "true") \
44 |     .config("spark.eventLog.dir", "/opt/spark/spark-events") \
45 |     .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
46 |     .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \
47 |     .config("spark.sql.adaptive", "true") \
48 |     .getOrCreate()
49 | 
50 | #######################################################################################
51 | # Create DataFrame representing the stream of msgs from kafka (unbounded table)
52 | #######################################################################################
53 | 
54 | parsed  = spark \
55 |   .readStream \
56 |   .format("kafka") \
57 |   .option("kafka.bootstrap.servers", "<private_ip>:9092") \
58 |   .option("subscribe", "dgCustomer") \
59 |   .option("startingOffsets", "earliest") \
60 |   .option("kafka.session.timeout.ms", "10000") \
61 |   .load() \
62 |   .select( \
63 |   from_json(col("value").cast("string"), customer_schema).alias("parsed_value"))
64 | 
65 | ##########################################################################################
66 | #  project the kafka 'value' column into a new data frame:
67 | ##########################################################################################
68 | 
69 | projected = parsed \
70 |      .select("parsed_value.*")
71 | 
72 | 
73 | ##########################################################################################
74 | #       write to console
75 | ##########################################################################################
76 | 
77 | query = projected \
78 |     .writeStream.outputMode("append") \
79 |     .format("console") \
80 |     .trigger(processingTime='6 seconds') \
81 |     .start() \
82 |     .awaitTermination()
83 | 


--------------------------------------------------------------------------------
/datagen/consume_stream_txn_2_console.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from pyspark.sql.types import *
 3 | from pyspark.sql.functions import *
 4 | from pyspark.sql.functions import udf
 5 | from pyspark.streaming import StreamingContext
 6 | import json
 7 | import uuid
 8 | 
 9 | #######################################################################################
10 | #  define schema for a DF with data json data from Kafka msgs
11 | #######################################################################################
12 | txn_schema = StructType() \
13 |                .add("amount", DoubleType()) \
14 |                .add("barcode", StringType()) \
15 |                .add("category", StringType()) \
16 |                .add("cust_id", StringType()) \
17 |                .add("item_desc", StringType()) \
18 |                .add("transact_id", StringType()) \
19 |                .add("transaction_date", StringType())
20 | 
21 | 
22 | 
23 | spark = SparkSession \
24 |     .builder \
25 |     .appName("redpanda") \
26 |     .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \
27 |     .config("groupId", "org.apache.spark") \
28 |     .config("artifactId", "spark-sql-kafka-0-10_2.12") \
29 |     .config("version", "3.3.1") \
30 |     .config("spark.eventLog.enabled", "true") \
31 |     .config("spark.eventLog.dir", "/opt/spark/spark-events") \
32 |     .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
33 |     .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \
34 |     .config("spark.sql.adaptive", "true") \
35 |     .getOrCreate()
36 | 
37 | #######################################################################################
38 | # Create DataFrame representing the stream of msgs from kafka (unbounded table)
39 | #######################################################################################
40 | 
41 | parsed  = spark \
42 |   .readStream \
43 |   .format("kafka") \
44 |   .option("kafka.bootstrap.servers", "<private_ip>:9092") \
45 |   .option("subscribe", "dgTxn") \
46 |   .option("startingOffsets", "earliest") \
47 |   .option("kafka.session.timeout.ms", "10000") \
48 |   .load() \
49 |   .select( \
50 |   from_json(col("value").cast("string"), txn_schema).alias("parsed_value"))
51 | 
52 | ##########################################################################################
53 | #  project the kafka 'value' column into a new data frame:
54 | ##########################################################################################
55 | 
56 | projected = parsed \
57 |      .select("parsed_value.*")
58 | 
59 | 
60 | ##########################################################################################
61 | #       write to console
62 | ##########################################################################################
63 | 
64 | query = projected \
65 |     .writeStream.outputMode("append") \
66 |     .format("console") \
67 |     .trigger(processingTime='6 seconds') \
68 |     .start() \
69 |     .awaitTermination()
70 | 
71 | spark.stop()
72 | 


--------------------------------------------------------------------------------
/datagen/datagenerator.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import collections
 3 | import datetime
 4 | from decimal import Decimal
 5 | from random import randrange, randint, sample
 6 | import sys
 7 | class DataGenerator():
 8 | 	#  DataGenerator
 9 | 	def __init__(self):
10 | 	    #  comments
11 | 	    self.z = 0
12 | 	def fake_person_generator(self, startkey, iterateval, f):
13 | 	    self.startkey = startkey
14 | 	    self.iterateval = iterateval
15 | 	    self.f = f
16 | 	    endkey = startkey + iterateval
17 | 	    for x in range(startkey, endkey):
18 | 	    	yield {'last_name': f.last_name(),
19 | 	    		'first_name': f.first_name(),
20 | 	    		'street_address': f.street_address(),
21 | 	    		'city': f.city(),
22 | 	    		'state': f.state_abbr(),
23 | 	    		'zip_code': f.postcode(),
24 | 	    		'email': f.email(),
25 | 	    		'home_phone': f.phone_number(),
26 | 	    		'mobile': f.phone_number(),
27 | 	    		'ssn': f.ssn(),
28 | 	    		'job_title': f.job(),
29 | 	    		'create_date': (f.date_time_between(start_date="-60d", end_date="-30d", tzinfo=None)).strftime('%Y-%m-%d %H:%M:%S'),
30 | 	    		'cust_id': x}
31 | 	def fake_txn_generator(self, txnsKey, txniKey, fake):
32 | 	    self.txnsKey = txnsKey
33 | 	    self.txniKey = txniKey
34 | 	    self.fake = fake
35 | 
36 | 	    txnendKey = txnsKey + txniKey
37 | 	    for x in range(txnsKey, txnendKey):
38 | 	    	for i in range(1,randrange(1,7,1)):
39 | 	    		yield {'transact_id': fake.uuid4(),
40 | 	    			'category': fake.safe_color_name(),
41 | 	    			'barcode': fake.ean13(),
42 | 	    			'item_desc': fake.sentence(nb_words=5, variable_nb_words=True, ext_word_list=None),
43 | 	    			'amount': fake.pyfloat(left_digits=2, right_digits=2, positive=True),
44 | 	    			'transaction_date': (fake.date_time_between(start_date="-29d", end_date="now", tzinfo=None)).strftime('%Y-%m-%d %H:%M:%S'),
45 | 	    			'cust_id': x}
46 | 


--------------------------------------------------------------------------------
/datagen/parameter_get_schema.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import sys
  3 | import json
  4 | import re
  5 | from pyspark.sql import SparkSession
  6 | from pyspark.sql.functions import *
  7 | from pyspark.sql.types import *
  8 | 
  9 | spark = SparkSession \
 10 |     .builder \
 11 |     .appName("pg_cust_from_connect_schema") \
 12 |     .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \
 13 |     .config("groupId", "org.apache.spark") \
 14 |     .config("artifactId", "spark-sql-kafka-0-10_2.12") \
 15 |     .config("version", "3.3.1") \
 16 |     .config("spark.eventLog.enabled", "true") \
 17 |     .config("spark.eventLog.dir", "/opt/spark/spark-events") \
 18 |     .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
 19 |     .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \
 20 |     .config("spark.sql.adaptive", "true") \
 21 |     .getOrCreate()
 22 | 
 23 | 
 24 | ##########################################################################################
 25 | #  Defines some variables
 26 | ##########################################################################################
 27 | parser = argparse.ArgumentParser()
 28 | 
 29 | # define our required arguments to pass in:
 30 | parser.add_argument("--p_broker", help="Enter the ip address and port of kakfa broker", required=True, type=str)
 31 | parser.add_argument("--p_topic", help="Enter the topic name for needed schema", required=True, type=str)
 32 | 
 33 | # parse these args
 34 | args = parser.parse_args()
 35 | 
 36 | 
 37 | kafka_broker = str(args.p_broker)
 38 | kafka_topic = str(args.p_topic)
 39 | 
 40 | print(kafka_broker)
 41 | print(kafka_topic)
 42 | ##########################################################################################
 43 | #  function to get the json value from a kafka topic
 44 | ##########################################################################################
 45 | def read_kafka_topic(topic):
 46 | 
 47 |     df_json = (spark.read
 48 |                .format("kafka")
 49 |                .option("kafka.bootstrap.servers", kafka_broker)
 50 |                .option("subscribe", topic)
 51 |                .option("startingOffsets", "earliest")
 52 |                .option("endingOffsets", "latest")
 53 |                .option("failOnDataLoss", "false")
 54 |                .load()
 55 |                # filter out empty values
 56 |                .withColumn("value", expr("string(value)"))
 57 |                .filter(col("value").isNotNull())
 58 |                # get latest version of each record
 59 |                .select("key", expr("struct(offset, value) r"))
 60 |                .groupBy("key").agg(expr("max(r) r"))
 61 |                .select("r.value"))
 62 | 
 63 |     # decode the json values
 64 |     df_read = spark.read.json(
 65 |       df_json.rdd.map(lambda x: x.value), multiLine=True)
 66 | 
 67 |     # drop corrupt records
 68 |     if "_corrupt_record" in df_read.columns:
 69 |         df_read = (df_read
 70 |                    .filter(col("_corrupt_record").isNotNull())
 71 |                    .drop("_corrupt_record"))
 72 | 
 73 |     return df_read
 74 | 
 75 | ##########################################################################################
 76 | #   function to cleanup schema for humans to read:
 77 | ##########################################################################################
 78 | 
 79 | def prettify_spark_schema_json(json: str):
 80 | 
 81 |   import re, json
 82 | 
 83 |   parsed = json.loads(json_schema)
 84 |   raw = json.dumps(parsed, indent=1, sort_keys=False)
 85 | 
 86 |   str1 = raw
 87 | 
 88 |   # replace empty meta data
 89 |   str1 = re.sub('"metadata": {},\n +', '', str1)
 90 | 
 91 |   # replace enters between properties
 92 |   str1 = re.sub('",\n +"', '", "', str1)
 93 |   str1 = re.sub('e,\n +"', 'e, "', str1)
 94 | 
 95 |   # replace endings and beginnings of simple objects
 96 |   str1 = re.sub('"\n +},', '" },', str1)
 97 |   str1 = re.sub('{\n +"', '{ "', str1)
 98 | 
 99 |   # replace end of complex objects
100 |   str1 = re.sub('"\n +}', '" }', str1)
101 |   str1 = re.sub('e\n +}', 'e }', str1)
102 | 
103 |   # introduce the meta data on a different place
104 |   str1 = re.sub('(, "type": "[^"]+")', '\\1, "metadata": {}', str1)
105 |   str1 = re.sub('(, "type": {)', ', "metadata": {}\\1', str1)
106 | 
107 |   # make sure nested ending is not on a single line
108 |   str1 = re.sub('}\n\s+},', '} },', str1)
109 | 
110 |   return str1
111 | 
112 | ##########################################################################################
113 | #  call the function to get the schema
114 | ##########################################################################################
115 | 
116 | df = read_kafka_topic(kafka_topic)
117 | json_schema = df.schema.json()
118 | 
119 | ##########################################################################################
120 | #  read the JSON into a schema
121 | ##########################################################################################
122 | 
123 | obj = json.loads(json_schema)
124 | topic_schema = StructType.fromJson(obj)
125 | 
126 | ##########################################################################################
127 | #  print raw schema suitable for performant code
128 | ##########################################################################################
129 | 
130 | print('\n')
131 | print('------------------------------------------SparkStreaming  Schema------------------------------------------\n')
132 | print(topic_schema)
133 | print('\n')
134 | 
135 | ##########################################################################################
136 | #  make the schema readable and print to screen.
137 | ##########################################################################################
138 | 
139 | #pretty_json_schema = prettify_spark_schema_json(json_schema)
140 | 
141 | ##########################################################################################
142 | #  read the JSON into a schema
143 | ##########################################################################################
144 | 
145 | #prettyObj = json.loads(pretty_json_schema)
146 | #pretty_topic_schema = StructType.fromJson(prettyObj)
147 | 
148 | 
149 | 
150 | #print('\n')
151 | #print('------------------------------------------Pretty  Schema------------------------------------------\n')
152 | #print(pretty_topic_schema)
153 | #print('\n')
154 | 
155 | 


--------------------------------------------------------------------------------
/datagen/pg_upsert_dg.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function
 2 | from faker import Faker
 3 | from datagenerator import DataGenerator
 4 | import simplejson
 5 | import sys
 6 | import argparse
 7 | import psycopg2
 8 | #########################################################################################
 9 | #       Define variables
10 | #########################################################################################
11 | dg = DataGenerator()
12 | fake = Faker() # <--- Don't Forgot this
13 | parser = argparse.ArgumentParser()
14 | 
15 | # define our required arguments to pass in:
16 | parser.add_argument("startingCustomerID", help="Enter int value to assign to the first customerID field", type=int)
17 | parser.add_argument("recordCount", help="Enter int value for desired number of records", type=int)
18 | 
19 | # parse these args
20 | args = parser.parse_args()
21 | 
22 | # assign args to vars:
23 | startKey = int(args.startingCustomerID)
24 | stopVal =  int(args.recordCount)
25 | 
26 | 
27 | # functions to display errors
28 | def printf (format,*args):
29 |         sys.stdout.write (format % args)
30 | def printException (exception):
31 |         error, = exception.args
32 |         printf("Error code = %s\n",error.code);
33 |         printf("Error message = %s\n",error.message);
34 | def myconverter(obj):
35 |         if isinstance(obj, (datetime.datetime)):
36 |                 return obj.__str__()
37 | #########################################################################################
38 | #       Code execution below
39 | #########################################################################################
40 | try:
41 |     try:
42 |         conn = psycopg2.connect(host="127.0.0.1",database="datagen", user="datagen", password="supersecret1")
43 |         print("Connection Established")
44 |     except psycopg2.Error as exception:
45 |         printf ('Failed to connect to database')
46 |         printException (exception)
47 |         exit (1)
48 |     cursor = conn.cursor()
49 |     try:
50 |         fpg = dg.fake_person_generator(startKey, stopVal, fake)
51 |         for person in fpg:
52 |             json_out = simplejson.dumps(person, ensure_ascii=False, default = myconverter)
53 |             print(json_out)
54 |             insert_stmt = "SELECT datagen.insert_from_json('" + json_out +"');"
55 |             cursor.execute(insert_stmt)
56 |         print("Records inserted successfully")
57 |     except psycopg2.Error as exception:
58 |         printf ('Failed to insert\n')
59 |         printException (exception)
60 |         exit (1)
61 |     finally:
62 |         if(conn):
63 |             conn.commit()
64 |             cursor.close()
65 |             conn.close()
66 |             print("PostgreSQL connection is closed")
67 | except (Exception, psycopg2.Error) as error:
68 |     print("Something else went wrong...\n", error)
69 | finally:
70 |     print("script complete!")
71 | 


--------------------------------------------------------------------------------
/datagen/redpanda_dg.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | from faker import Faker
 3 | from datagenerator import DataGenerator
 4 | import simplejson as json
 5 | 
 6 | import argparse
 7 | 
 8 | from kafka import KafkaProducer
 9 | 
10 | #########################################################################################
11 | #       Define variables
12 | #########################################################################################
13 | dg = DataGenerator()
14 | fake = Faker() # <--- Don't Forgot this
15 | parser = argparse.ArgumentParser()
16 | 
17 | # define our required arguments to pass in:
18 | parser.add_argument("startingCustomerID", help="Enter int value to assign to the first customerID field", type=int)
19 | parser.add_argument("recordCount", help="Enter int value for desired number of records per group", type=int)
20 | parser.add_argument("loopCount", help="Enter int value for iteration count", type=int)
21 | 
22 | # parse these args
23 | args = parser.parse_args()
24 | 
25 | # assign args to vars:
26 | startKey = int(args.startingCustomerID)
27 | iterateVal =  int(args.recordCount)
28 | stopVal = int(args.loopCount)
29 | 
30 | # Define some functions:
31 | def myconverter(obj):
32 |     if isinstance(obj, (datetime.datetime)):
33 |                 return obj.__str__()
34 | 
35 | def encode_complex(obj):
36 |     if isinstance(obj, complex):
37 |         return [ojb.real, obj.imag]
38 |     raise TypeError(repr(obj) + " is not JSON serializable")
39 | 
40 | # Messages will be serialized as JSON
41 | def my_serializer(message):
42 |     return json.dumps(message).encode('utf-8')
43 | 
44 | #  define variable for our producer
45 | producer = KafkaProducer(bootstrap_servers="<private_ip>:9092",value_serializer=my_serializer)
46 | 
47 | #########################################################################################
48 | #       Code execution below
49 | #########################################################################################
50 | try:
51 |      for i in range(stopVal):
52 |         # person start here:
53 |         try:
54 |              fpg = dg.fake_person_generator(startKey, iterateVal, fake)
55 |              for person in fpg:
56 |                   #print(json.dumps(person, ensure_ascii=False, default = myconverter))
57 |                   #print("\n")
58 |                   data = json.dumps(person, default = encode_complex)
59 |                   print(data)
60 |                   #print ("dataVarType", type(data))
61 |                   # convert json string to dict obj
62 |                   dictData = json.loads(data)
63 |                   producer.send('dgCustomer', dictData)
64 |                   #print("\n")
65 |              producer.flush()
66 |              print("Customer Done.")
67 |              print('\n')
68 |         except:
69 |              print("failing in person generator")
70 |              producer.flush()
71 | 
72 |         # txn start here:
73 |         try:
74 |              txn = dg.fake_txn_generator(startKey, iterateVal, fake)
75 |              for tranx in txn:
76 |                      #print(json.dumps(tranx, ensure_ascii=False, default = myconverter))
77 |                      txnData = json.dumps(tranx, default = encode_complex)
78 |                      print(txnData)
79 |                      producer.send('dgTxn', tranx)
80 |              producer.flush()
81 |              print("Transaction Done.")
82 |              print('\n')
83 | 
84 |         #txn ends here:
85 |         except:
86 |             print("failing in txn generator")
87 |             producer.flush()
88 |        # increment counter and sleep
89 |         startKey += iterateVal
90 |         time.sleep(5)
91 | 
92 | except:
93 |      print("failing in loop.")
94 | finally:
95 |      print("script complete")
96 | 
97 | 
98 | 


--------------------------------------------------------------------------------
/datagen/spark_from_dbz_customer_2_iceberg.py:
--------------------------------------------------------------------------------
  1 | from pyspark.sql import SparkSession
  2 | from pyspark.sql.types import *
  3 | from pyspark.sql.functions import col, udf
  4 | from pyspark.sql.functions import *
  5 | from pyspark.sql import *
  6 | from pyspark.streaming import StreamingContext
  7 | import json
  8 | import uuid
  9 | import re
 10 | 
 11 | 
 12 | spark = SparkSession \
 13 |     .builder \
 14 |     .appName("cust_panda_2_ice") \
 15 |     .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \
 16 |     .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
 17 |     .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \
 18 |     .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \
 19 |     .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \
 20 |     .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \
 21 |     .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \
 22 |     .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \
 23 |     .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
 24 |     .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \
 25 |     .config("spark.sql.catalog.sparkcatalog", "icecatalog") \
 26 |     .config("spark.sql.adaptive.enabled", "true") \
 27 |     .config("spark.eventLog.enabled", "true") \
 28 |     .config("spark.eventLog.dir", "/opt/spark/spark-events") \
 29 |     .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
 30 |     .config("spark.sql.catalogImplementation", "in-memory") \
 31 |     .config("groupId", "org.apache.spark") \
 32 |     .config("artifactId", "spark-sql-kafka-0-10_2.12") \
 33 |     .config("version", "3.3.1") \
 34 |     .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true") \
 35 |     .config("spark.sql.adaptive", "true") \
 36 |     .getOrCreate()
 37 | 
 38 | 
 39 | ##########################################################################################
 40 | #  debezium schema
 41 | ##########################################################################################
 42 | dbz_schema_value = StructType([StructField('payload', StructType([StructField('after', StructType([StructField('city', StringType(), True), StructField('create_date', StringType(), True), StructField('cust_id', LongType(), True), StructField('email', StringType(), True), StructField('first_name', StringType(), True), StructField('home_phone', StringType(), True), StructField('job_title', StringType(), True), StructField('last_name', StringType(), True), StructField('mobile', StringType(), True), StructField('ssn', StringType(), True), StructField('state', StringType(), True), StructField('street_address', StringType(), True), StructField('zip_code', StringType(), True)]), True), StructField('before', StructType([StructField('city', StringType(), True), StructField('create_date', StringType(), True), StructField('cust_id', LongType(), True), StructField('email', StringType(), True), StructField('first_name', StringType(), True), StructField('home_phone', StringType(), True), StructField('job_title', StringType(), True), StructField('last_name', StringType(), True), StructField('mobile', StringType(), True), StructField('ssn', StringType(), True), StructField('state', StringType(), True), StructField('street_address', StringType(), True), StructField('zip_code', StringType(), True)]), True), StructField('op', StringType(), True), StructField('source', StructType([StructField('connector', StringType(), True), StructField('db', StringType(), True), StructField('lsn', LongType(), True), StructField('name', StringType(), True), StructField('schema', StringType(), True), StructField('sequence', StringType(), True), StructField('snapshot', StringType(), True), StructField('table', StringType(), True), StructField('ts_ms', LongType(), True), StructField('txId', LongType(), True), StructField('version', StringType(), True), StructField('xmin', StringType(), True)]), True), StructField('transaction', StringType(), True), StructField('ts_ms', LongType(), True)]), True), StructField('schema', StructType([StructField('fields', ArrayType(StructType([StructField('field', StringType(), True), StructField('fields', ArrayType(StructType([StructField('default', StringType(), True), StructField('field', StringType(), True), StructField('name', StringType(), True), StructField('optional', BooleanType(), True), StructField('parameters', StructType([StructField('allowed', StringType(), True)]), True), StructField('type', StringType(), True), StructField('version', LongType(), True)]), True), True), StructField('name', StringType(), True), StructField('optional', BooleanType(), True), StructField('type', StringType(), True), StructField('version', LongType(), True)]), True), True), StructField('name', StringType(), True), StructField('optional', BooleanType(), True), StructField('type', StringType(), True), StructField('version', LongType(), True)]), True)])
 43 | 
 44 | 
 45 | 
 46 | 
 47 | 
 48 | ##########################################################################################
 49 | #  read from topic
 50 | ##########################################################################################
 51 | connectCustTopicDF = spark \
 52 |     .readStream \
 53 |     .format("kafka") \
 54 |     .option("kafka.bootstrap.servers", "<private_ip>:9092") \
 55 |     .option("subscribe", "pg_datagen2panda.datagen.customer") \
 56 |     .option("startingOffsets", "earliest") \
 57 |     .option("kafka.session.timeout.ms", "10000") \
 58 |     .load() \
 59 |     .select( \
 60 |         from_json(col("value").cast("string"), dbz_schema_value).alias("parsed_value"))
 61 | 
 62 | 
 63 | ##########################################################################################
 64 | #  get just the payload from 'connectCustTopicDF'
 65 | ##########################################################################################
 66 | 
 67 | payloadDF = connectCustTopicDF \
 68 |      .select("parsed_value.payload.*")
 69 | 
 70 | ##########################################################################################
 71 | #  This worked in the define of our function for each batch  for the future ...  the function stuff goes here:
 72 | ##########################################################################################
 73 | 
 74 | # create a unique Id for the row... ideally this should already have been in the payload (and is available if I took the time to load the msg correctly ;)  )
 75 | #uuidUDF = udf(lambda : str(uuid.uuid4()),StringType())
 76 | 
 77 | 
 78 | df = payloadDF
 79 | #        .withColumn("row_key", uuidUDF())
 80 | 
 81 | 
 82 | def foreach_batch_function(microdf, batchId):
 83 |     print(f"inside forEachBatch for batchid:{batchId}. Rows in passed dataframe:{microdf.count()}")
 84 |     microdf.show()
 85 | #    microdf.printSchema()
 86 |     microdf.filter((microdf.op == "r") | (microdf.op == "c") | (microdf.op == "u")) \
 87 |         .select(microdf.op.alias("type"), \
 88 |         microdf.ts_ms.alias("event_ts"), \
 89 |         microdf.source.txId.alias("tx_id"), \
 90 |         microdf.after.first_name.alias("first_name"), \
 91 |         microdf.after.last_name.alias("last_name"), \
 92 |         microdf.after.street_address.alias("street_address"), \
 93 |         microdf.after.city.alias("city"), \
 94 |         microdf.after.state.alias("state"), \
 95 |         microdf.after.zip_code.alias("zip_code"), \
 96 |         microdf.after.home_phone.alias("home_phone"), \
 97 |         microdf.after.mobile.alias("mobile"), \
 98 |         microdf.after.email.alias("email"), \
 99 |         microdf.after.ssn.alias("ssn"), \
100 |         microdf.after.job_title.alias("job_title"), \
101 |         microdf.after.create_date.alias("create_date"), \
102 |         microdf.after.cust_id.alias("cust_id")).createOrReplaceGlobalTempView("tmp_merge")
103 |     mergeCount=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_merge").count()
104 |     print(f"mergeCount= {mergeCount} for batch: {batchId}")
105 |     showMergeDF=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_merge").show()
106 |     mergeDF=(microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_merge"))
107 |     mergeDF.writeTo("icecatalog.icecatalog.stream_customer_event_history").append()
108 |     microdf.filter(microdf.op =="d") \
109 |         .select(microdf.op.alias("type"), \
110 |         microdf.ts_ms.alias("event_ts"), \
111 |         microdf.source.txId.alias("tx_id"), \
112 |         microdf.before.first_name.alias("first_name"), \
113 |         microdf.before.last_name.alias("last_name"), \
114 |         microdf.before.street_address.alias("street_address"), \
115 |         microdf.before.city.alias("city"), \
116 |         microdf.before.state.alias("state"), \
117 |         microdf.before.zip_code.alias("zip_code"), \
118 |         microdf.before.home_phone.alias("home_phone"), \
119 |         microdf.before.mobile.alias("mobile"), \
120 |         microdf.before.email.alias("email"), \
121 |         microdf.before.ssn.alias("ssn"), \
122 |         microdf.before.job_title.alias("job_title"), \
123 |         microdf.before.create_date.alias("create_date"), \
124 |         microdf.before.cust_id.alias("cust_id")).createOrReplaceGlobalTempView("tmp_delete")
125 |     deleteCount=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_delete").count()
126 |     print(f"deleteCount= {deleteCount} for batch: {batchId}")
127 |     showDeleteDF=microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_delete").show()
128 |     deleteDF=(microdf.sql_ctx.sparkSession.sql("SELECT * FROM global_temp.tmp_delete"))
129 | #    deleteDF.printSchema()
130 |     deleteDF.writeTo("icecatalog.icecatalog.stream_customer_event_history").append()
131 | 
132 | ##########################################################################################
133 | #       send stream into foreachBatch
134 | ##########################################################################################
135 | 
136 | streamQuery = (df.writeStream \
137 |     .option("checkpointLocation", "/opt/spark/checkpoint2") \
138 |     .foreachBatch(foreach_batch_function) \
139 |     .trigger(processingTime='60 seconds') \
140 |     .start() \
141 |     .awaitTermination())
142 | 
143 | spark.stop()
144 | 


--------------------------------------------------------------------------------
/datagen/test_pg.py:
--------------------------------------------------------------------------------
 1 | import psycopg2
 2 | 
 3 | try:
 4 |     try:
 5 |          # Connect to your PostgreSQL database on a remote server
 6 |          conn = psycopg2.connect(host="127.0.0.1", port="5432", dbname="datagen", user="datagen", password="supersecret1")
 7 |          print("Connection Established!")
 8 |          print("\n")
 9 |     except psycopg2.Error as exception:
10 |          printf ('Failed to connect to database')
11 |          printException (exception)
12 |          exit (1)
13 | 
14 |     # Open a cursor to perform database operations
15 |     cur = conn.cursor()
16 |     try:
17 |         # Execute a test query
18 |         cur.execute("SELECT * FROM customer")
19 | 
20 |         # Retrieve query results
21 |         records = cur.fetchall()
22 | 
23 |         #print records
24 |         print(records)
25 | 
26 |     except psycopg2.Error as exception:
27 |         printf ('Failed to insert\n')
28 |         printException (exception)
29 |         exit (1)
30 |     finally:
31 |         if(conn):
32 |             cur.close()
33 |             conn.close()
34 |             print("\n")
35 |             print("PostgreSQL connection is closed")
36 | 
37 | except (Exception, psycopg2.Error) as error:
38 |     print("Something else went wrong...\n", error)
39 | 
40 | finally:
41 |     print("\n")
42 |     print("script complete!")
43 | 
44 | 


--------------------------------------------------------------------------------
/db_ddl/create_ddl_icecatalog.sql:
--------------------------------------------------------------------------------
1 | CREATE ROLE icecatalog LOGIN PASSWORD 'supersecret1';
2 | CREATE DATABASE icecatalog OWNER icecatalog ENCODING 'UTF-8';
3 | ALTER USER icecatalog WITH SUPERUSER;
4 | ALTER USER icecatalog WITH CREATEDB;
5 | CREATE SCHEMA icecatalog;
6 | 


--------------------------------------------------------------------------------
/db_ddl/create_user_datagen.sql:
--------------------------------------------------------------------------------
1 | CREATE ROLE datagen LOGIN PASSWORD 'supersecret1';
2 | CREATE DATABASE datagen OWNER datagen ENCODING 'UTF-8';
3 | ALTER ROLE "datagen" WITH LOGIN;
4 | ALTER ROLE "datagen" WITH REPLICATION;
5 | 


--------------------------------------------------------------------------------
/db_ddl/customer_ddl.sql:
--------------------------------------------------------------------------------
 1 | CREATE schema datagen;
 2 | CREATE TABLE datagen.customer
 3 | (
 4 |     first_name character varying(50) COLLATE pg_catalog."default",
 5 |     last_name character varying(50) COLLATE pg_catalog."default",
 6 |     street_address character varying(100) COLLATE pg_catalog."default",
 7 |     city character varying(50) COLLATE pg_catalog."default",
 8 |     state character varying(50) COLLATE pg_catalog."default",
 9 |     zip_code character varying(50) COLLATE pg_catalog."default",
10 |     home_phone character varying(50) COLLATE pg_catalog."default",
11 |     mobile character varying(50) COLLATE pg_catalog."default",
12 |     email character varying(50) COLLATE pg_catalog."default",
13 |     ssn character varying(25) COLLATE pg_catalog."default",
14 |     job_title character varying(50) COLLATE pg_catalog."default",
15 |     create_date character varying(50) COLLATE pg_catalog."default",
16 |         cust_id integer NOT NULL,
17 |     CONSTRAINT customer_pkey PRIMARY KEY (cust_id)
18 | );
19 | CREATE PUBLICATION dbz_publication FOR TABLE datagen.customer;
20 | 


--------------------------------------------------------------------------------
/db_ddl/customer_function_ddl.sql:
--------------------------------------------------------------------------------
 1 | CREATE or REPLACE FUNCTION datagen.insert_from_json(json)
 2 |     RETURNS void
 3 | LANGUAGE 'plpgsql'
 4 | COST 100
 5 | VOLATILE
 6 | AS $BODY$
 7 | 
 8 | BEGIN
 9 |   INSERT INTO datagen.customer(first_name, last_name, street_address, city, state, zip_code, home_phone, mobile, email, ssn, job_title, create_date, cust_id)
10 |    SELECT
11 |     x.first_name
12 | ,x.last_name
13 | ,x.street_address
14 | ,x.city
15 | ,x.state
16 | ,x.zip_code
17 | ,x.home_phone
18 | ,x.mobile
19 | ,x.email
20 | ,x.ssn
21 | ,x.job_title
22 | ,x.create_date
23 | ,x.cust_id
24 | FROM json_to_record($1) AS x
25 |   (
26 |     first_name text,
27 | last_name text,
28 | street_address text,
29 | city text,
30 | state text,
31 | zip_code text,
32 | home_phone text,
33 | mobile text,
34 | email text,
35 | ssn text,
36 | job_title text,
37 | create_date text,
38 | cust_id int
39 |   )
40 | ON CONFLICT (cust_id) DO UPDATE SET
41 |     first_name = EXCLUDED.first_name
42 |     ,last_name = EXCLUDED.last_name
43 |     ,street_address = EXCLUDED.street_address
44 |     ,city = EXCLUDED.city
45 |     ,state = EXCLUDED.state
46 |     ,zip_code = EXCLUDED.zip_code
47 |     ,home_phone = EXCLUDED.home_phone
48 |     ,mobile = EXCLUDED.mobile
49 |     ,email = EXCLUDED.email
50 |     ,ssn = EXCLUDED.ssn
51 |     ,job_title = EXCLUDED.job_title
52 |     ,create_date = EXCLUDED.create_date;
53 | 
54 | 
55 | END;
56 | $BODY$;
57 | 


--------------------------------------------------------------------------------
/db_ddl/grants4dbz.sql:
--------------------------------------------------------------------------------
1 | GRANT ALL ON ALL TABLES IN SCHEMA datagen TO datagen;
2 | 


--------------------------------------------------------------------------------
/db_ddl/hive_metastore_ddl.sql:
--------------------------------------------------------------------------------
1 | CREATE USER hive;
2 | ALTER ROLE hive WITH PASSWORD 'supersecret1';
3 | CREATE DATABASE hive_metastore;
4 | GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;
5 | 


--------------------------------------------------------------------------------
/dbz_server/.touch:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/dbz_server/application.properties:
--------------------------------------------------------------------------------
 1 | debezium.sink.type=iceberg
 2 | debezium.format.value.schemas.enable=true
 3 | 
 4 | ####################################################################################
 5 | #  postgresql source config:
 6 | ####################################################################################
 7 | debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector
 8 | debezium.source.offset.storage.file.filename=data/offsets.dat
 9 | debezium.source.offset.flush.interval.ms=0
10 | debezium.source.database.hostname=127.0.0.1
11 | debezium.source.database.port=5432
12 | debezium.source.database.user=datagen
13 | debezium.source.database.password=supersecret1
14 | debezium.source.database.dbname=datagen
15 | debezium.source.database.server.name=localhost
16 | debezium.source.schema.include.list=datagen
17 | debezium.source.plugin.name=pgoutput
18 | # below is new as of 3.15.2024
19 | debezium.source.topic.prefix=dbz_
20 | 
21 | 
22 | ####################################################################################
23 | #   Iceberg sink config:
24 | ####################################################################################
25 | debezium.sink.iceberg.warehouse=s3://iceberg-data
26 | debezium.sink.iceberg.catalog-name=icecatalog
27 | debezium.sink.iceberg.table-namespace=icecatalog
28 | #debezium.sink.iceberg.table-prefix=cdc_
29 | # above was the orig.  below is new on 3.15.2024
30 | debezium.sink.iceberg.table-prefix=debeziumcdc_
31 | debezium.sink.iceberg.write.format.default=parquet
32 | debezium.sink.iceberg.upsert=true
33 | debezium.sink.iceberg.upsert-keep-deletes=true
34 | debezium.sink.iceberg.table.auto-create=true
35 | 
36 | debezium.sink.iceberg.name=icecatalog
37 | debezium.sink.iceberg.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog
38 | debezium.sink.iceberg.uri=jdbc:postgresql://127.0.0.1:5432/icecatalog
39 | debezium.sink.iceberg.jdbc.user=icecatalog
40 | debezium.sink.iceberg.jdbc.password=supersecret1
41 | 
42 | ####################################################################################
43 | # S3 config to a local minio instance
44 | ####################################################################################
45 | debezium.sink.iceberg.fs.defaultFS=s3://iceberg-data/icecatalog
46 | debezium.sink.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO
47 | debezium.sink.iceberg.com.amazonaws.services.s3a.enableV4=true
48 | debezium.sink.iceberg.s3.endpoint=http://127.0.0.1:9000
49 | debezium.sink.iceberg.s3.path-style-access=true
50 | debezium.sink.iceberg.s3.access-key-id=<your S3 access-key>
51 | debezium.sink.iceberg.s3.secret-access-key=<your s3 secret-key>
52 | 
53 | ####################################################################################
54 | # do event flattening. unwrap message!
55 | ####################################################################################
56 | debezium.transforms=unwrap
57 | debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
58 | debezium.transforms.unwrap.add.fields=op,table,source.ts_ms,db
59 | debezium.transforms.unwrap.delete.handling.mode=rewrite
60 | debezium.transforms.unwrap.drop.tombstones=true
61 | 
62 | ####################################################################################
63 | # ############ SET LOG LEVELS ############
64 | ####################################################################################
65 | quarkus.log.level=INFO
66 | quarkus.log.console.json=false
67 | # hadoop, parquet
68 | quarkus.log.category."org.apache.hadoop".level=WARN
69 | quarkus.log.category."org.apache.parquet".level=WARN
70 | # Ignore messages below warning level from Jetty, because it's a bit verbose
71 | quarkus.log.category."org.eclipse.jetty".level=WARN
72 | 
73 | 


--------------------------------------------------------------------------------
/downloads/.touch:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/explore_postgresql.md:
--------------------------------------------------------------------------------
 1 | ###  SQL GUI client to access the PostgreSQL databases
 2 | 
 3 | * It uses a tool called `Adminer` that was installed during setup
 4 | 
 5 | ---
 6 | 
 7 | ####  Access the `datagen` database with these credentials:
 8 | ---
 9 |   *  user     --> `datagen` 
10 |   *  Password --> `supersecret1`
11 |    
12 | *  From a browser navigate to: `http://<your ip address>/adminer` 
13 | 
14 | ##### `Datagen` database login screen:
15 | ---
16 | 
17 | ![](./images/adminer_login_screen.png)
18 | 
19 | ---
20 | 
21 | ####  Access the `icecatalog` database with these credentials:
22 | ---
23 |   *  user     --> `icecatalog` 
24 |   *  Password --> `supersecret1`
25 |     
26 | *  From a browser navigate to: `http://<your ip address>/adminer` 
27 | 
28 | ##### `icecatalog` database login screen:
29 | ---
30 | 
31 | 
32 | ![](./images/adminer_login_screen_icecatalog.png)
33 | 
34 | 
35 | ---
36 | ---
37 | 
38 | Click here to return to main page:  [`Workshop 2 Exercises`](./README.md/#extra-credit).
39 | 


--------------------------------------------------------------------------------
/get_files.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | ##########################################################################################
  4 | #  load of the utilities functions:
  5 | ##########################################################################################
  6 | echo "load utils"
  7 | echo
  8 | . ~/data_origination_workshop/utils.sh
  9 | 
 10 | ##########################################################################################
 11 | ##########################################################################################
 12 | ##########################################################################################
 13 | ##########################################################################################
 14 | #  Define the files as variables;
 15 | ##########################################################################################
 16 | ##########################################################################################
 17 | ##########################################################################################
 18 | ##########################################################################################
 19 | echo "defining vars"
 20 | echo
 21 | ##########################################################################################
 22 | # SPARK & ICEBERG ITEMS: 
 23 | ##########################################################################################
 24 | #SPARK_FILE=https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
 25 | #SPARK_FILE=https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz; echo "SPARK_STANDALONE_FILE=${SPARK_FILE##*/}" >> ~/file_variables.output
 26 | SPARK_FILE=https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz; echo "SPARK_STANDALONE_FILE=${SPARK_FILE##*/}" >> ~/file_variables.output
 27 | COMMONS_POOL2_JAR=https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar; echo "COMMONS_POOL2_FILE=${COMMONS_POOL2_JAR##*/}" >> ~/file_variables.output 
 28 | KAFKA_CLIENT_JAR=https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.3.1/kafka-clients-3.3.1.jar; echo "KAFKA_CLIENT_FILE=${KAFKA_CLIENT_JAR##*/}" >> ~/file_variables.output
 29 | SPARK_TOKEN_JAR=https://repo.mavenlibs.com/maven/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.3.1/spark-token-provider-kafka-0-10_2.12-3.3.1.jar; echo "SPARK_TOKEN_FILE=${SPARK_TOKEN_JAR##*/}" >> ~/file_variables.output
 30 | SPARK_SQL_KAFKA_JAR=https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.3.1/spark-sql-kafka-0-10_2.12-3.3.1.jar; echo "SPARK_SQL_KAFKA_FILE=${SPARK_SQL_KAFKA_JAR##*/}" >> ~/file_variables.output
 31 | ICEBERG_SPARK_JAR=https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.1.0/iceberg-spark-runtime-3.3_2.12-1.1.0.jar; echo "SPARK_ICEBERG_FILE=${ICEBERG_SPARK_JAR##*/}" >> ~/file_variables.output
 32 | URL_CONNECT_JAR=https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.19.19/url-connection-client-2.19.19.jar; echo "URL_CONNECT_FILE=${URL_CONNECT_JAR##*/}" >> ~/file_variables.output
 33 | AWS_BUNDLE_JAR=https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.19.19/bundle-2.19.19.jar; echo "AWS_BUNDLE_FILE=${AWS_BUNDLE_JAR##*/}" >> ~/file_variables.output
 34 | 
 35 | 
 36 | ##########################################################################################
 37 | #  KAKFA CONNECT ITEMS:
 38 | ##########################################################################################
 39 | #KCONNECT_FILE=https://dlcdn.apache.org/kafka/3.3.2/kafka_2.13-3.3.2.tgz; echo "KAFKA_CONNECT_FILE=${KCONNECT_FILE##*/}" >> ~/file_variables.output
 40 | KCONNECT_FILE=https://archive.apache.org/dist/kafka//3.3.2/kafka_2.13-3.3.2.tgz; echo "KAFKA_CONNECT_FILE=${KCONNECT_FILE##*/}" >> ~/file_variables.output
 41 | KCONNECT_JDBC_JAR=https://jdbc.postgresql.org/download/postgresql-42.5.1.jar; echo "KCONNECT_JDBC_FILE=${KCONNECT_JDBC_JAR##*/}" >> ~/file_variables.output
 42 | DBZ_CONNECT_FILE=https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.1.1.Final/debezium-connector-postgres-2.1.1.Final-plugin.tar.gz; echo "DEBEZIUM_CONNECT_FILE=${DBZ_CONNECT_FILE##*/}" >> ~/file_variables.output
 43 | 
 44 | ##########################################################################################
 45 | #  REDPANDA ITEMS:
 46 | ##########################################################################################
 47 | REDPANDA_REPO_FILE=https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh; echo "PANDA_REPO_FILE=${REDPANDA_REPO_FILE##*/}" >> ~/file_variables.output
 48 | REDPANDA_FILE=https://github.com/redpanda-data/redpanda/releases/latest/download/rpk-linux-amd64.zip; echo "PANDA_FILE=${REDPANDA_FILE##*/}" >> ~/file_variables.output
 49 | 
 50 | ##########################################################################################
 51 | #  POSTGRESQL ITEMS:
 52 | ##########################################################################################
 53 | PSQL_REPO_KEY=https://www.postgresql.org/media/keys/ACCC4CF8.asc; echo "POSTGRESQL_KEY_FILE=${PSQL_REPO_KEY##*/}" >> ~/file_variables.output
 54 | PSQL_JDBC_JAR=https://jdbc.postgresql.org/download/postgresql-42.5.1.jar; echo "POSTGRESQL_FILE=${KCONNECT_JDBC_JAR##*/}" >> ~/file_variables.output
 55 | 
 56 | ##########################################################################################
 57 | # MINIO ITEMS:
 58 | ##########################################################################################
 59 | #MINIO_CLI_FILE=https://dl.min.io/client/mc/release/linux-amd64/mc; echo "MINIO_FILE=${MINIO_CLI_FILE##*/}" >> ~/file_variables.output
 60 | MINIO_CLI_FILE=https://dl.min.io/client/mc/release/linux-amd64/archive/mc.RELEASE.2023-01-11T03-14-16Z; echo "MINIO_FILE=${MINIO_CLI_FILE##*/}" >> ~/file_variables.output
 61 | MINIO_PACKAGE=https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230112020616.0.0_amd64.deb; echo "MINIO_PACKAGE_FILE=${MINIO_PACKAGE##*/}" >> ~/file_variables.output
 62 | 
 63 | ##########################################################################################
 64 | #  DOCKER ITEMS:
 65 | ##########################################################################################
 66 | DOCKER_KEY_FILE=https://download.docker.com/linux/ubuntu/gpg; echo "DOCKER_REPO_KEY_FILE=${DOCKER_KEY_FILE##*/}" >> ~/file_variables.output
 67 | 
 68 | ##########################################################################################
 69 | ##########################################################################################
 70 | ##########################################################################################
 71 | ##########################################################################################
 72 | #  Get the files from the above variables;
 73 | ##########################################################################################
 74 | ##########################################################################################
 75 | ##########################################################################################
 76 | ##########################################################################################
 77 | 
 78 | ##########################################################################################
 79 | #  GET - SPARK & ICEBERG ITEMS:
 80 | ##########################################################################################
 81 | echo "calling get_valid_urls"
 82 | echo
 83 | get_valid_url $SPARK_FILE
 84 | get_valid_url $COMMONS_POOL2_JAR
 85 | get_valid_url $KAFKA_CLIENT_JAR
 86 | get_valid_url $SPARK_TOKEN_JAR
 87 | get_valid_url $SPARK_SQL_KAFKA_JAR
 88 | get_valid_url $ICEBERG_SPARK_JAR
 89 | get_valid_url $URL_CONNECT_JAR
 90 | get_valid_url $AWS_BUNDLE_JAR
 91 | 
 92 | ##########################################################################################
 93 | #  GET - KAKFA CONNECT ITEMS:
 94 | ##########################################################################################
 95 | get_valid_url $KCONNECT_FILE
 96 | get_valid_url $KCONNECT_JDBC_JAR
 97 | get_valid_url $DBZ_CONNECT_FILE
 98 | 
 99 | ##########################################################################################
100 | #  GET - REDPANDA ITEMS:
101 | ##########################################################################################
102 | get_valid_url $REDPANDA_REPO_FILE
103 | get_valid_url $REDPANDA_FILE
104 | 
105 | ##########################################################################################
106 | #  GET - POSTGRESQL ITEMS:
107 | ##########################################################################################
108 | get_valid_url $PSQL_REPO_KEY
109 | get_valid_url $PSQL_JDBC_JAR
110 | 
111 | ##########################################################################################
112 | # GET - MINIO ITEMS:
113 | ##########################################################################################
114 | get_valid_url $MINIO_CLI_FILE
115 | get_valid_url $MINIO_PACKAGE
116 | 
117 | ##########################################################################################
118 | #  GET - DOCKER ITEMS:
119 | ##########################################################################################
120 | get_valid_url $DOCKER_KEY_FILE
121 | 


--------------------------------------------------------------------------------
/hive_metastore/hive-site.xml:
--------------------------------------------------------------------------------
 1 | <configuration>
 2 | <property>
 3 |   <name>javax.jdo.option.ConnectionURL</name>
 4 |   <value>jdbc:postgresql://localhost:5432/hive_metastore</value>
 5 | </property>
 6 | 
 7 | <property>
 8 |   <name>javax.jdo.option.ConnectionDriverName</name>
 9 |   <value>org.postgresql.Driver</value>
10 | </property>
11 | 
12 | <property>
13 | <name>javax.jdo.option.ConnectionUserName</name>
14 |   <value>hive</value>
15 | </property>
16 | 
17 | <property>
18 |   <name>javax.jdo.option.ConnectionPassword</name>
19 |   <value>supersecret1</value>
20 | </property>
21 | 
22 | </configuration>
23 | 


--------------------------------------------------------------------------------
/images/.placeholder:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/images/Iceberg.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/Iceberg.gif


--------------------------------------------------------------------------------
/images/access_keys_view.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/access_keys_view.png


--------------------------------------------------------------------------------
/images/adminer_login_screen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/adminer_login_screen.png


--------------------------------------------------------------------------------
/images/adminer_login_screen_icecatalog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/adminer_login_screen_icecatalog.png


--------------------------------------------------------------------------------
/images/bucket_first_table_metadata_view.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/bucket_first_table_metadata_view.png


--------------------------------------------------------------------------------
/images/connect_ouput_detail_msg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/connect_ouput_detail_msg.png


--------------------------------------------------------------------------------
/images/connect_output_summary_msg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/connect_output_summary_msg.png


--------------------------------------------------------------------------------
/images/console_view_run_connect.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/console_view_run_connect.png


--------------------------------------------------------------------------------
/images/detail_view_of_cust_msg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/detail_view_of_cust_msg.png


--------------------------------------------------------------------------------
/images/drunk-cheers.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/drunk-cheers.gif


--------------------------------------------------------------------------------
/images/first_login.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/first_login.png


--------------------------------------------------------------------------------
/images/initial_bucket_view.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/initial_bucket_view.png


--------------------------------------------------------------------------------
/images/minio_login_screen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/minio_login_screen.png


--------------------------------------------------------------------------------
/images/panda_topic_view_connect_topic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/panda_topic_view_connect_topic.png


--------------------------------------------------------------------------------
/images/panda_view__dg_load_topics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/panda_view__dg_load_topics.png


--------------------------------------------------------------------------------
/images/panda_view_topics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/panda_view_topics.png


--------------------------------------------------------------------------------
/images/spark_master_view.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/spark_master_view.png


--------------------------------------------------------------------------------
/images/topic_customer_view.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/images/topic_customer_view.png


--------------------------------------------------------------------------------
/kafka_connect/connect.properties:
--------------------------------------------------------------------------------
 1 | #Kafka broker addresses
 2 | bootstrap.servers=<private_ip>:9092
 3 | 
 4 | #Cluster level converters
 5 | #These applies when the connectors don't define any converter
 6 | key.converter=org.apache.kafka.connect.json.JsonConverter
 7 | value.converter=org.apache.kafka.connect.json.JsonConverter
 8 | 
 9 | #JSON schemas enabled to false in cluster level
10 | key.converter.schemas.enable=true
11 | value.converter.schemas.enable=true
12 | 
13 | #Where to keep the Connect topic offset configurations
14 | offset.storage.file.filename=/tmp/connect.offsets
15 | offset.flush.interval.ms=10000
16 | 
17 | #Plugin path to put the connector binaries
18 | plugin.path=:~/kafka_connect/plugins/debezium-connector-postgres/
19 | 


--------------------------------------------------------------------------------
/kafka_connect/pg-source-connector.properties:
--------------------------------------------------------------------------------
 1 | connector.class=io.debezium.connector.postgresql.PostgresConnector
 2 | offset.storage=org.apache.kafka.connect.storage.FileOffsetBackingStore
 3 | offset.storage.file.filename=offset.dat
 4 | offset.flush.interval.ms=5000
 5 | name=postgres-dbz-connector
 6 | database.hostname=localhost
 7 | database.port=5432
 8 | database.user=datagen
 9 | database.password=supersecret1
10 | database.dbname=datagen
11 | schema.include.list=datagen
12 | plugin.name=pgoutput
13 | topic.prefix=pg_datagen2panda
14 | 


--------------------------------------------------------------------------------
/prework.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | ---
 3 | # Temp items for my setup on proxmox:
 4 | ```
 5 | 
 6 | ##########################################################################################
 7 | #  notes:
 8 | ##########################################################################################
 9 | -- I built a new standalone ubuntu 20 server to install this with proxmox:
10 | 
11 | #  create a clone from the template
12 | qm clone 9400 670 --name ice-integration
13 | 
14 | #  put your ssh key into a file:  `~/cloud_images/ssh_stuff`
15 | qm set 670 --sshkey ~/cloud_images/ssh_stuff/id_rsa.pub
16 | 
17 | # change the default username:
18 | qm set 670 --ciuser centos
19 | 
20 | #  Let's setup dhcp for the network in this image:
21 | qm set 670 --ipconfig0 ip=dhcp
22 | 
23 | #   start the image from gui
24 | qm start 670
25 | 
26 | ##########################################################################################
27 | #  If I need to stop and destroy
28 | ##########################################################################################
29 | qm stop 670 && qm destroy 670
30 | 
31 | ##########################################################################################
32 | #  ssh to our new host:
33 | ##########################################################################################
34 | 
35 | ssh -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -o UserKnownHostsFile=/dev/null -i ~/fishermans_wharf/proxmox/id_rsa centos@192.168.1.43
36 | 
37 | ```
38 | ---
39 | ---
40 | 


--------------------------------------------------------------------------------
/redpanda/redpanda.yaml:
--------------------------------------------------------------------------------
 1 | redpanda:
 2 |     data_directory: /var/lib/redpanda/data
 3 |     seed_servers: []
 4 |     rpc_server:
 5 |         address: <private_ip>
 6 |         port: 33145
 7 |     kafka_api:
 8 |         - address: <private_ip>
 9 |           port: 9092
10 |     admin:
11 |         - address: <private_ip>
12 |           port: 9644
13 |     developer_mode: true
14 |     auto_create_topics_enabled: true
15 |     fetch_reads_debounce_timeout: 10
16 |     group_initial_rebalance_delay: 0
17 |     group_topic_partitions: 3
18 |     storage_min_free_bytes: 10485760
19 |     topic_partitions_per_shard: 1000
20 | rpk:
21 |     enable_usage_stats: true
22 |     coredump_dir: /var/lib/redpanda/coredump
23 |     overprovisioned: true
24 | pandaproxy: {}
25 | schema_registry: {}
26 | 


--------------------------------------------------------------------------------
/sample_output/.touch:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/sample_spark_jobs.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ---
 3 | #### Addition Spark & Python Exercises:
 4 | 
 5 | Here are some additional spark examples that demonstrate how to interact with the data generated with spark.
 6 | 
 7 | In this spark job  [`consume_stream_customer_2_console.py`](./datagen/consume_stream_customer_2_console.py) we will consume the records from the topic `dgCustomer` and just stream them to our console.
 8 | 
 9 | ```
10 | spark-submit ~/datagen/consume_stream_customer_2_console.py
11 | ```
12 | ---
13 | In this spark job  [`consume_stream_txn_2_console.py`](./datagen/consume_stream_txn_2_console.py) we will consume the records from the topic  `dgTxn` and just stream them to our console.
14 | 
15 | ```
16 | spark-submit ~/datagen/consume_stream_txn_2_console.py
17 | ```
18 | 
19 | ---
20 | In this python job  [`comsume_topic_dgCustomer.py`](./datagen/comsume_topic_dgCustomer.py) we will consume 4 records from the topic  `dgCustomer` and just stream them to our console.
21 | 
22 | ```
23 | python3 ~/datagen/comsume_topic_dgCustomer.py 4
24 | ```
25 | ---
26 | 
27 | Click here to return to the workshop: [`Workshop 2 Exercises`](./README.md/#automation-of-workshop-1-exercises).
28 | 
29 | ---
30 | 


--------------------------------------------------------------------------------
/spark_items/all_workshop1_items.sql:
--------------------------------------------------------------------------------
 1 | -- create the customer table
 2 | CREATE TABLE icecatalog.icecatalog.customer (
 3 |     first_name STRING,
 4 |     last_name STRING,
 5 |     street_address STRING,
 6 |     city STRING,
 7 |     state STRING,
 8 |     zip_code STRING,
 9 |     home_phone STRING,
10 |     mobile STRING,
11 |     email STRING,
12 |     ssn STRING,
13 |     job_title STRING,
14 |     create_date STRING,
15 |     cust_id BIGINT)
16 | USING iceberg
17 | OPTIONS (
18 |     'write.object-storage.enabled'=true,
19 |     'write.data.path'='s3://iceberg-data');
20 | 
21 | -- Create the Transactions table
22 | CREATE TABLE icecatalog.icecatalog.transactions (
23 |     transact_id STRING,
24 |     transaction_date STRING,
25 |     item_desc STRING,
26 |     barcode STRING,
27 |     category STRING,
28 |     amount STRING,
29 |     cust_id BIGINT)
30 | USING iceberg
31 | OPTIONS (
32 |     'write.object-storage.enabled'=true,
33 |     'write.data.path'='s3://iceberg-data');
34 | 
35 | -- load customer table from json records
36 | CREATE TEMPORARY VIEW customerView
37 |   USING org.apache.spark.sql.json
38 |   OPTIONS (
39 |     path "/opt/spark/input/customers.json"
40 |   );
41 | INSERT INTO icecatalog.icecatalog.customer 
42 |     SELECT 
43 |              first_name, 
44 |              last_name, 
45 |              street_address, 
46 |              city, 
47 |              state, 
48 |              zip_code, 
49 |              home_phone,
50 |              mobile,
51 |              email,
52 |              ssn,
53 |              job_title,
54 |              create_date,
55 |              cust_id
56 |     FROM customerView;
57 |     
58 | -- Merge customer json records:
59 | CREATE TEMPORARY VIEW mergeCustomerView
60 |   USING org.apache.spark.sql.json
61 |   OPTIONS (
62 |     path "/opt/spark/input/update_customers.json"
63 |   );
64 |   MERGE INTO icecatalog.icecatalog.customer c
65 | USING (SELECT
66 |              first_name,
67 |              last_name,
68 |              street_address,
69 |              city,
70 |              state,
71 |              zip_code,
72 |              home_phone,
73 |              mobile,
74 |              email,
75 |              ssn,
76 |              job_title,
77 |              create_date,
78 |              cust_id
79 |        FROM mergeCustomerView) j
80 | ON c.cust_id = j.cust_id
81 | WHEN MATCHED THEN UPDATE SET
82 |              c.first_name = j.first_name,
83 |              c.last_name = j.last_name,
84 |              c.street_address = j.street_address,
85 |              c.city = j.city,
86 |              c.state = j.state,
87 |              c.zip_code = j.zip_code,
88 |              c.home_phone = j.home_phone,
89 |              c.mobile = j.mobile,
90 |              c.email = j.email,
91 |              c.ssn = j.ssn,
92 |              c.job_title = j.job_title,
93 |              c.create_date = j.create_date
94 | WHEN NOT MATCHED THEN INSERT *;
95 | 


--------------------------------------------------------------------------------
/spark_items/conf.properties:
--------------------------------------------------------------------------------
 1 | spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 2 | spark.sql.cli.print.header=true
 3 | spark.sql.catalog.icecatalog=org.apache.iceberg.spark.SparkCatalog
 4 | spark.sql.catalog.icecatalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog
 5 | spark.sql.catalog.icecatalog.uri=jdbc:postgresql://127.0.0.1:5432/icecatalog
 6 | spark.sql.catalog.icecatalog.jdbc.user=icecatalog
 7 | spark.sql.catalog.icecatalog.jdbc.password=supersecret1
 8 | spark.sql.catalog.icecatalog.warehouse=s3://iceberg-data
 9 | spark.sql.catalog.icecatalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
10 | spark.sql.catalog.icecatalog.s3.endpoint=http://127.0.0.1:9000
11 | spark.sql.catalog.sparkcatalog=org.apache.iceberg.spark.SparkSessionCatalog
12 | spark.sql.defaultCatalog=icecatalog
13 | spark.eventLog.enabled=true
14 | spark.eventLog.dir=/opt/spark/spark-events
15 | spark.history.fs.logDirectory=/opt/spark/spark-events
16 | spark.sql.catalogImplementation=in-memory
17 | 


--------------------------------------------------------------------------------
/spark_items/ice_spark-sql_i-cli.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | 
 4 | . ~/minio-output.properties
 5 | 
 6 | export AWS_ACCESS_KEY_ID=$access_key
 7 | export AWS_SECRET_ACCESS_KEY=$secret_key
 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000
 9 | export AWS_REGION=us-east-1
10 | export MINIO_REGION=us-east-1
11 | export AWS_SDK_VERSION=2.19.19
12 | export AWS_MAVEN_GROUP=software.amazon.awssdk
13 | 
14 | spark-sql --packages \
15 |   org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \
16 |   software.amazon.awssdk:bundle:2.19.19, \
17 |   software.amazon.awssdk:url-connection-client:2.19.19 \
18 | --properties-file /opt/spark/sql/conf.properties 
19 | 
20 | 


--------------------------------------------------------------------------------
/spark_items/iceberg_workshop_sql_items.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | 
 4 | . ~/minio-output.properties
 5 | 
 6 | export AWS_ACCESS_KEY_ID=$access_key
 7 | export AWS_SECRET_ACCESS_KEY=$secret_key
 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000
 9 | export AWS_REGION=us-east-1
10 | export MINIO_REGION=us-east-1
11 | export AWS_SDK_VERSION=2.19.19
12 | export AWS_MAVEN_GROUP=software.amazon.awssdk
13 | 
14 | spark-sql --packages \
15 |   org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \
16 |   software.amazon.awssdk:bundle:2.19.19, \
17 |   software.amazon.awssdk:url-connection-client:2.19.19 \
18 | --properties-file /opt/spark/sql/conf.properties \
19 | -f /opt/spark/sql/all_workshop1_items.sql \ 
20 | --verbose
21 | 


--------------------------------------------------------------------------------
/spark_items/iceberg_workshop_tbl_ddl.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE icecatalog.icecatalog.customer (
 2 |     first_name STRING,
 3 |     last_name STRING,
 4 |     street_address STRING,
 5 |     city STRING,
 6 |     state STRING,
 7 |     zip_code STRING,
 8 |     home_phone STRING,
 9 |     mobile STRING,
10 |     email STRING,
11 |     ssn STRING,
12 |     job_title STRING,
13 |     create_date STRING,
14 |     cust_id BIGINT)
15 | USING iceberg
16 | OPTIONS (
17 |     'write.object-storage.enabled'=true,
18 |     'write.data.path'='s3://iceberg-data');
19 | 
20 | CREATE TABLE icecatalog.icecatalog.transactions (
21 |     transact_id STRING,
22 |     transaction_date STRING,
23 |     item_desc STRING,
24 |     barcode STRING,
25 |     category STRING,
26 |     amount STRING,
27 |     cust_id BIGINT)
28 | USING iceberg
29 | OPTIONS (
30 |     'write.object-storage.enabled'=true,
31 |     'write.data.path'='s3://iceberg-data');
32 | 


--------------------------------------------------------------------------------
/spark_items/load_ice_customer_batch.sql:
--------------------------------------------------------------------------------
 1 | CREATE TEMPORARY VIEW customerView
 2 |   USING org.apache.spark.sql.json
 3 |   OPTIONS (
 4 |     path "/opt/spark/input/customers.json"
 5 |   );
 6 | INSERT INTO icecatalog.icecatalog.customer 
 7 |     SELECT 
 8 |              first_name, 
 9 |              last_name, 
10 |              street_address, 
11 |              city, 
12 |              state, 
13 |              zip_code, 
14 |              home_phone,
15 |              mobile,
16 |              email,
17 |              ssn,
18 |              job_title,
19 |              create_date,
20 |              cust_id
21 |     FROM customerView;
22 | 


--------------------------------------------------------------------------------
/spark_items/load_ice_transactions_pyspark.py:
--------------------------------------------------------------------------------
 1 | # import SparkSession
 2 | from pyspark.sql import SparkSession
 3 | 
 4 | # create SparkSession
 5 | spark = SparkSession.builder \
 6 |      .appName("Python Spark SQL example") \
 7 |      .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19") \
 8 |      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
 9 |      .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \
10 |      .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \
11 |      .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \
12 |      .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \
13 |      .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \
14 |      .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \
15 |      .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
16 |      .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \
17 |      .config("spark.sql.catalog.sparkcatalog", "icecatalog") \
18 |      .config("spark.eventLog.enabled", "true") \
19 |      .config("spark.eventLog.dir", "/opt/spark/spark-events") \
20 |      .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
21 |      .config("spark.sql.catalogImplementation", "in-memory") \
22 |      .getOrCreate()
23 | 
24 | # A JSON dataset is pointed to by 'path' variable
25 | path = "/opt/spark/input/transactions.json"
26 | 
27 | #  read json into the DataFrame
28 | transactionsDF = spark.read.json(path)
29 | 
30 | # visualize the inferred schema
31 | transactionsDF.printSchema()
32 | 
33 | # print out the dataframe in this cli
34 | transactionsDF.show()
35 | 
36 | # Append these transactions to the table we created in an earlier step `icecatalog.icecatalog.transactions`
37 | transactionsDF.writeTo("icecatalog.icecatalog.transactions").append()
38 | 
39 | # stop the sparkSession
40 | spark.stop()
41 | 
42 | # Exit out of the editor:
43 | quit();
44 | 


--------------------------------------------------------------------------------
/spark_items/merge_ice_customer_batch.sql:
--------------------------------------------------------------------------------
 1 | CREATE TEMPORARY VIEW mergeCustomerView
 2 |   USING org.apache.spark.sql.json
 3 |   OPTIONS (
 4 |     path "/opt/spark/input/update_customers.json"
 5 |   );
 6 |   MERGE INTO icecatalog.icecatalog.customer c
 7 | USING (SELECT
 8 |              first_name,
 9 |              last_name,
10 |              street_address,
11 |              city,
12 |              state,
13 |              zip_code,
14 |              home_phone,
15 |              mobile,
16 |              email,
17 |              ssn,
18 |              job_title,
19 |              create_date,
20 |              cust_id
21 |        FROM mergeCustomerView) j
22 | ON c.cust_id = j.cust_id
23 | WHEN MATCHED THEN UPDATE SET
24 |              c.first_name = j.first_name,
25 |              c.last_name = j.last_name,
26 |              c.street_address = j.street_address,
27 |              c.city = j.city,
28 |              c.state = j.state,
29 |              c.zip_code = j.zip_code,
30 |              c.home_phone = j.home_phone,
31 |              c.mobile = j.mobile,
32 |              c.email = j.email,
33 |              c.ssn = j.ssn,
34 |              c.job_title = j.job_title,
35 |              c.create_date = j.create_date
36 | WHEN NOT MATCHED THEN INSERT *;
37 | 


--------------------------------------------------------------------------------
/spark_items/stream_customer_ddl.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE icecatalog.icecatalog.stream_customer (
 2 |     first_name STRING,
 3 |     last_name STRING,
 4 |     street_address STRING,
 5 |     city STRING,
 6 |     state STRING,
 7 |     zip_code STRING,
 8 |     home_phone STRING,
 9 |     mobile STRING,
10 |     email STRING,
11 |     ssn STRING,
12 |     job_title STRING,
13 |     create_date STRING,
14 |     cust_id BIGINT)
15 | USING iceberg
16 | OPTIONS (
17 |     'write.object-storage.enabled'=true,
18 |     'write.data.path'='s3://iceberg-data');
19 | 


--------------------------------------------------------------------------------
/spark_items/stream_customer_ddl_script.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | 
 4 | . ~/minio-output.properties
 5 | 
 6 | export AWS_ACCESS_KEY_ID=$access_key
 7 | export AWS_SECRET_ACCESS_KEY=$secret_key
 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000
 9 | export AWS_REGION=us-east-1
10 | export MINIO_REGION=us-east-1
11 | export AWS_SDK_VERSION=2.19.19
12 | export AWS_MAVEN_GROUP=software.amazon.awssdk
13 | 
14 | spark-sql --packages \
15 |   org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \
16 |   software.amazon.awssdk:bundle:2.19.19, \
17 |   software.amazon.awssdk:url-connection-client:2.19.19 \
18 | --properties-file /opt/spark/sql/conf.properties \
19 | -f /opt/spark/sql/stream_customer_ddl.sql \
20 | --verbose
21 | 


--------------------------------------------------------------------------------
/spark_items/stream_customer_event_history_ddl.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE icecatalog.icecatalog.stream_customer_event_history (
 2 |     type STRING,
 3 |     event_ts LONG,
 4 |     tx_id STRING,
 5 |     first_name STRING,
 6 |     last_name STRING,
 7 |     street_address STRING,
 8 |     city STRING,
 9 |     state STRING,
10 |     zip_code STRING,
11 |     home_phone STRING,
12 |     mobile STRING,
13 |     email STRING,
14 |     ssn STRING,
15 |     job_title STRING,
16 |     create_date STRING,
17 |     cust_id BIGINT)
18 | USING iceberg
19 | OPTIONS (
20 |     'write.object-storage.enabled'=true,
21 |     'write.data.path'='s3://iceberg-data');
22 | 


--------------------------------------------------------------------------------
/spark_items/stream_customer_event_history_ddl_script.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | 
 4 | . ~/minio-output.properties
 5 | 
 6 | export AWS_ACCESS_KEY_ID=$access_key
 7 | export AWS_SECRET_ACCESS_KEY=$secret_key
 8 | export AWS_S3_ENDPOINT=127.0.0.1:9000
 9 | export AWS_REGION=us-east-1
10 | export MINIO_REGION=us-east-1
11 | export AWS_SDK_VERSION=2.19.19
12 | export AWS_MAVEN_GROUP=software.amazon.awssdk
13 | 
14 | spark-sql --packages \
15 |   org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0, \
16 |   software.amazon.awssdk:bundle:2.19.19, \
17 |   software.amazon.awssdk:url-connection-client:2.19.19 \
18 | --properties-file /opt/spark/sql/conf.properties \
19 | -f /opt/spark/sql/stream_customer_event_history_ddl.sql \
20 | --verbose
21 | 


--------------------------------------------------------------------------------
/stop_start_services.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | ##########################################################################################
 4 | #  Stop of Spark items
 5 | ##########################################################################################
 6 | 
 7 | echo "stopping spark worker..."
 8 | /opt/spark/sbin/stop-worker.sh spark://$(hostname -f):7077
 9 | echo
10 | sleep 5
11 | 
12 | echo "stopping spark master..."
13 | /opt/spark/sbin/stop-master.sh
14 | echo
15 | sleep 5
16 | 
17 | ##########################################################################################
18 | #  Stop of panda items
19 | ##########################################################################################
20 | 
21 | 
22 | echo "stopping redpanda..."
23 | sudo systemctl stop redpanda
24 | echo
25 | sleep 5
26 | 
27 | echo "stopping redpanda console..."
28 | sudo systemctl stop redpanda-console
29 | echo
30 | sleep 5
31 | 
32 | ##########################################################################################
33 | #  stop of minio
34 | ##########################################################################################
35 | 
36 | echo "stopping minio.service..."
37 | sudo systemctl stop minio.service
38 | echo
39 | sleep 5
40 | 
41 | ##########################################################################################
42 | #  Start of Spark items
43 | ##########################################################################################
44 | 
45 | echo "starting spark master..."
46 | /opt/spark/sbin/start-master.sh
47 | echo
48 | sleep 5
49 | echo "starting spark worker..."
50 | /opt/spark/sbin/start-worker.sh spark://$(hostname -f):7077
51 | echo
52 | sleep 5
53 | 
54 | ##########################################################################################
55 | #  Start of panda items
56 | ##########################################################################################
57 | 
58 | 
59 | echo "starting redpanda..."
60 | sudo systemctl start redpanda
61 | echo
62 | sleep 5
63 | 
64 | echo "starting redpanda console..."
65 | sudo systemctl start redpanda-console
66 | echo
67 | sleep 5
68 | ##########################################################################################
69 | #  start of minio
70 | ##########################################################################################
71 | 
72 | echo "starting minio.service..."
73 | sudo systemctl start minio.service
74 | echo
75 | sleep 5
76 | echo "services have been restarted..."
77 | 


--------------------------------------------------------------------------------
/tick2705-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/tick2705-1.png


--------------------------------------------------------------------------------
/tick2705-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tlepple/data_origination_workshop/c3b2c7311db6f5167cefdfe68c4af1bbe3e85572/tick2705-2.png


--------------------------------------------------------------------------------
/utils.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | function validate_url(){
 4 |     if [[ `wget -S --spider $1 --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 --tries=4 2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then
 5 |     return 0
 6 |   else
 7 |     return 1
 8 |   fi
 9 | 
10 | }
11 | 
12 | function get_valid_url(){
13 | if validate_url $1; then
14 |     # Download when exists
15 |     echo "file exists.  downloading..."
16 |     wget $1 --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 --tries=4 -P ~/data_origination_workshop/downloads
17 | 
18 |   else
19 |     # print error and exit the install
20 |      echo "file: $1 -- does not exist.  Aborting the install."
21 |      exit 1
22 | fi
23 | }
24 | 
25 | 


--------------------------------------------------------------------------------
/workshop1_revisit.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | Title:  Apache Iceberg Exploration with S3A storage. 
  3 | Author:  Tim Lepple
  4 | Last Updated:  1.30.2023
  5 | Comments:  This repo will evolve over time with new items.
  6 | Tags:  Apache Iceberg | Minio | Apache SparkSQL | Apache PySpark | Ubuntu
  7 | ---
  8 | 
  9 | # Apache Iceberg Introduction Workshop
 10 | 
 11 | ---
 12 | 
 13 | ## Objective:
 14 | My goal in this workshop was to evaluate Apache Iceberg with data stored in an S3a-compliant object store on a traditional Linux server. It has been modified sligthly from the original to avoid port conflicts. Documentation here may need addtional updates.
 15 | 
 16 | 
 17 | ---
 18 | ---
 19 | ###  What is Apache Iceberg:
 20 | 
 21 | Apache Iceberg is an open-source data management system for large-scale data lakes. It provides a table abstraction for big data workloads, allowing for schema evolution, data discovery and simplified data access. Iceberg uses Apache Avro, Parquet or ORC as its data format and supports various storage systems like HDFS, S3, ADLS, etc.
 22 | 
 23 | Iceberg uses a versioning approach to manage schema changes, enabling multiple versions of a schema to coexist in a table, providing the ability to perform schema evolution without the need for copying data. Additionally, Iceberg provides data discovery capabilities, allowing users to identify the data they need for their specific use case and extract only that data, reducing the amount of I/O required to perform a query.
 24 | 
 25 | Iceberg provides an easy-to-use API for querying data, supporting SQL and other query languages through Apache Spark, Hive, Presto and other engines. Iceberg’s table-centric design helps to manage large datasets with high scalability, reliability and performance.
 26 | 
 27 | ---
 28 | ---
 29 | ### But Why Apache Iceberg:
 30 | 
 31 | A couple of items really jumped out at me when I read the documentation for the first time and I immediately saw the significant benefit it could provide.  Namely it could reduce the overall expense of enterprises to store and process the data they produce.  We all know that saving money in an enterprise is a good thing.
 32 | 
 33 | It can also perform standard `CRUD` operations on our tables seamlessly. Here are the two items that really hit home for me:
 34 | 
 35 |   ---
 36 |   ### Item 1:
 37 |   ---
 38 |   *  Iceberg is designed for huge tables and is used in production where a single table can contain tens of petabytes of data.  This data can be stored in modern-day object stores similar to these:
 39 |       *   A Cloud provider like [Amazon S3](https://aws.amazon.com/s3/?nc2=h_ql_prod_st_s3) 
 40 |       *   An on-premise solution that you build and support yourself like [Minio](https://min.io/).  
 41 |       *   Or a vendor hardware appliance like the [Dell ECS Enterprise Object Storage](https://www.dell.com/en-us/dt/storage/ecs/index.htm)
 42 |         
 43 | Regardless of which object store you choose, your overall expense to support this platform will see a significant savings over what you probably spend today.
 44 |   
 45 |   ---
 46 |   ### Item 2:
 47 |   ---
 48 |   *  Multi-petabyte tables can be read from a single node without needing a distributed SQL engine to sift through table metadata.  That means the tools used in the examples I give below could be used to query the data stored in object stores without needing to dedicate expensive compute servers.  You could spin up virtual instances or containers and execute queries against the data stored in the object store.
 49 | 
 50 | ---
 51 | 
 52 | This image from Starburst.io is really good.
 53 | 
 54 | ---
 55 | ![](./images/Iceberg.gif)
 56 | 
 57 | ---
 58 | ---
 59 | 
 60 | # Highlights:
 61 | 
 62 | ---
 63 | 
 64 | This setup script built a single node platform that set up a local S3a compliant object store, install a local SQL database, install a single node Apache Iceberg processing engine and lay the groundwork for the support of our Apache Iceberg tables and catalog.   
 65 |  
 66 |   ---
 67 | ####  Object Storage Notes:
 68 |   ---
 69 |   *  This type of object store could also be set up to run in your own data center if that is a requirement.   Otherwise, you could build and deploy something very similar in AWS using their S3 service instead.   I chose this option to demonstrate you have a lot of options you might not have considered.  It will store all of our Apache Iceberg data and catalog database objects.  
 70 |   *  This particular service is running Minio and it has a rest API that supports direct integration with the AWS CLI tool.  The script also installed the AWS CLI tools and configured the properties of the AWS CLI to work directly with Minio.
 71 | 
 72 |   ---
 73 | #### Local Database Notes:
 74 |   ---
 75 |   *  The local SQL database is PostgreSQL and it will host metadata with pointers to the Apache Iceberg table data persisted in our object store and the metadata for our Apache Iceberg catalog.  It maintains a very small footprint.
 76 |   
 77 |   ---
 78 | ####  Apache Iceberg Processing Engine Notes:
 79 |   ---
 80 |   *  This particular workshop is using Apache Spark but we could have chosen any of the currently supported platforms.  We could also choose to use a combination of these tools and have them share the same Apache Iceberg Catalog.  Here is the current list of supported tools:  
 81 |      *  Spark
 82 |      *  Flink
 83 |      *  Trino
 84 |      *  Presto
 85 |      *  Dremio
 86 |      *  StarRocks
 87 |      *  Amazon Athena
 88 |      *  Amazon EMR
 89 |      *  Impala (Cloudera)
 90 |      *  Doris
 91 | 
 92 | ---
 93 | ---
 94 | 
 95 | 
 96 | ---
 97 | ---
 98 | 
 99 | ### Testing the `AWS CLI` and the `Minio CLI`:
100 | 
101 | ---
102 | ---
103 | 
104 | ###  AWS CLI Integration:
105 | 
106 | Let's test out the AWS CLI that was installed and configured during the build and run an `AWS S3` command to list the buckets currently stored in our Minio object store.
107 | 
108 | ---
109 | 
110 | ##### Command:
111 | 
112 | ```
113 | aws --endpoint-url http://127.0.0.1:9000 s3 ls
114 | ```
115 | 
116 | ##### Expected Output: The bucket name.
117 | ```
118 | 2023-01-24 22:58:38 iceberg-data
119 | ```
120 | ---
121 | 
122 | ---
123 | 
124 | ###  Minio CLI Integration:
125 | 
126 | There is also a minio rest API to accomplish many administrative tasks and use buckets without using AWS CLI. The minio client was also installed and configured during setup.  Here is a link to the documentation:  [Minio Client](https://min.io/docs/minio/linux/reference/minio-mc.html).
127 | 
128 | ---
129 | 
130 | ##### List Command:
131 | 
132 | ```
133 | mc ls icebergadmin
134 | ```
135 | 
136 | ##### Expected Output: The bucket name.
137 | ```
138 | [2023-01-26 16:54:33 UTC]     0B iceberg-data/
139 | 
140 | ```
141 | 
142 | ---
143 | ---
144 | ###  Minio Overview:
145 | 
146 | Minio is an open-source, high-performance, and scalable object storage system. It is designed to be API-compatible with Amazon S3, allowing applications written for Amazon S3 to work seamlessly with Minio. Minio can be deployed on-premises, in the cloud, or in a hybrid environment, providing a unified, centralized repository for storing and managing unstructured data, such as images, videos, and backups.
147 | 
148 | Minio provides features such as versioning, access control, encryption, and event notifications, making it suitable for use cases such as data archiving, backup and disaster recovery, and media and entertainment. Minio also supports distributed mode, allowing multiple Minio nodes to be combined into a single object storage cluster for increased scalability and reliability.
149 | 
150 | Minio can be used with a variety of tools and frameworks, including popular cloud-native technologies like Kubernetes, Docker, and Ansible, making it easy to deploy and manage.
151 | 
152 | ---
153 | ---
154 | 
155 | ### Explore Minio GUI from a browser.
156 | 
157 | Let's login into the minio GUI: navigate to `http:\\<host ip address>:9000` in a browser
158 | 
159 |   - Username: `icebergadmin`
160 |   - Password: `supersecret1!`
161 | 
162 | ---
163 | 
164 | ![](./images/minio_login_screen.png)
165 | 
166 | ---
167 | 
168 | `Object Browser` view with one bucket that was created during the install.  Bucket Name:  `iceberg-data`
169 | 
170 | ---
171 | 
172 | ![](./images/first_login.png)
173 | 
174 | ---
175 | 
176 | Click on the tab `Access Keys` :  The key was created during the build too.  We use this key & secret key to configure AWS CLI.
177 | 
178 | ---
179 | 
180 | ![](./images/access_keys_view.png)
181 | 
182 | ---
183 | 
184 | Click on the tab: `Buckets` 
185 | 
186 | ---
187 | 
188 | ![](./images/initial_bucket_view.png)
189 | 
190 | ---
191 | 
192 | 
193 | 
194 | ---
195 | ##  Apache Iceberg Processing Engine Setup:
196 | 
197 | ---
198 | The setup here has changed since workshop one.   These items were started in the setup process in workshop 2.
199 | 
200 | ---
201 | 
202 | #####  Check that the Spark GUI is up:
203 |  * navigate to `http//<host ip address>:8085` in a browser
204 | 
205 | ---
206 | 
207 | ##### Sample view of Spark Master.
208 | 
209 | ---
210 | 
211 | ![](./images/spark_master_view.png)
212 | 
213 | ---
214 | 
215 | ####  Configure the Spark-SQL service:
216 | ---
217 | In this step, we will initialize some variables that will be used when we start the Spark-SQL service.  Copy this entire block and run in a terminal window.
218 | 
219 | ```
220 | . ~/minio-output.properties
221 | 
222 | export AWS_ACCESS_KEY_ID=$access_key
223 | export AWS_SECRET_ACCESS_KEY=$secret_key
224 | export AWS_S3_ENDPOINT=127.0.0.1:9000
225 | export AWS_REGION=us-east-1
226 | export MINIO_REGION=us-east-1
227 | export DEPENDENCIES="org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
228 | export AWS_SDK_VERSION=2.19.19
229 | export AWS_MAVEN_GROUP=software.amazon.awssdk
230 | export AWS_PACKAGES=(
231 | "bundle"
232 | "url-connection-client"
233 | )
234 | for pkg in "${AWS_PACKAGES[@]}"; do
235 | export DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
236 | done
237 | ```
238 | 
239 | #####  Start the Spark-SQL client service:
240 | ---
241 | Starting this service will connect to our PostgreSQL database and store database objects that point to the Apache Iceberg Catalog on our behalf.   The metadata for our catalog & tables (along with table records) will be stored in files persisted in our object stores.
242 | 
243 | ```
244 | cd $SPARK_HOME
245 | 
246 | spark-sql --packages $DEPENDENCIES \
247 | --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
248 | --conf spark.sql.cli.print.header=true \
249 | --conf spark.sql.catalog.icecatalog=org.apache.iceberg.spark.SparkCatalog \
250 | --conf spark.sql.catalog.icecatalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \
251 | --conf spark.sql.catalog.icecatalog.uri=jdbc:postgresql://127.0.0.1:5432/icecatalog \
252 | --conf spark.sql.catalog.icecatalog.jdbc.user=icecatalog \
253 | --conf spark.sql.catalog.icecatalog.jdbc.password=supersecret1 \
254 | --conf spark.sql.catalog.icecatalog.warehouse=s3://iceberg-data \
255 | --conf spark.sql.catalog.icecatalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
256 | --conf spark.sql.catalog.icecatalog.s3.endpoint=http://127.0.0.1:9000 \
257 | --conf spark.sql.catalog.sparkcatalog=org.apache.iceberg.spark.SparkSessionCatalog \
258 | --conf spark.sql.defaultCatalog=icecatalog \
259 | --conf spark.eventLog.enabled=true \
260 | --conf spark.eventLog.dir=/opt/spark/spark-events \
261 | --conf spark.history.fs.logDirectory=/opt/spark/spark-events \
262 | --conf spark.sql.catalogImplementation=in-memory
263 | ```
264 | ---
265 | #####  Expected Output:
266 |   *  the warnings can be ingored
267 | ```
268 | 23/01/25 19:48:19 WARN Utils: Your hostname, spark-ice2 resolves to a loopback address: 127.0.1.1; using 192.168.1.167 instead (on interface eth0)
269 | 23/01/25 19:48:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
270 | :: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
271 | Ivy Default Cache set to: /home/centos/.ivy2/cache
272 | The jars for the packages stored in: /home/centos/.ivy2/jars
273 | org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency
274 | software.amazon.awssdk#bundle added as a dependency
275 | software.amazon.awssdk#url-connection-client added as a dependency
276 | :: resolving dependencies :: org.apache.spark#spark-submit-parent-59d47579-1c2b-4e66-a92d-206be33d8afe;1.0
277 |         confs: [default]
278 |         found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.1.0 in central
279 |         found software.amazon.awssdk#bundle;2.19.19 in central
280 |         found software.amazon.eventstream#eventstream;1.0.1 in central
281 |         found software.amazon.awssdk#url-connection-client;2.19.19 in central
282 |         found software.amazon.awssdk#utils;2.19.19 in central
283 |         found org.reactivestreams#reactive-streams;1.0.3 in central
284 |         found software.amazon.awssdk#annotations;2.19.19 in central
285 |         found org.slf4j#slf4j-api;1.7.30 in central
286 |         found software.amazon.awssdk#http-client-spi;2.19.19 in central
287 |         found software.amazon.awssdk#metrics-spi;2.19.19 in central
288 | :: resolution report :: resolve 423ms :: artifacts dl 19ms
289 |         :: modules in use:
290 |         org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.1.0 from central in [default]
291 |         org.reactivestreams#reactive-streams;1.0.3 from central in [default]
292 |         org.slf4j#slf4j-api;1.7.30 from central in [default]
293 |         software.amazon.awssdk#annotations;2.19.19 from central in [default]
294 |         software.amazon.awssdk#bundle;2.19.19 from central in [default]
295 |         software.amazon.awssdk#http-client-spi;2.19.19 from central in [default]
296 |         software.amazon.awssdk#metrics-spi;2.19.19 from central in [default]
297 |         software.amazon.awssdk#url-connection-client;2.19.19 from central in [default]
298 |         software.amazon.awssdk#utils;2.19.19 from central in [default]
299 |         software.amazon.eventstream#eventstream;1.0.1 from central in [default]
300 |         ---------------------------------------------------------------------
301 |         |                  |            modules            ||   artifacts   |
302 |         |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
303 |         ---------------------------------------------------------------------
304 |         |      default     |   10  |   0   |   0   |   0   ||   10  |   0   |
305 |         ---------------------------------------------------------------------
306 | :: retrieving :: org.apache.spark#spark-submit-parent-59d47579-1c2b-4e66-a92d-206be33d8afe
307 |         confs: [default]
308 |         0 artifacts copied, 10 already retrieved (0kB/10ms)
309 | 23/01/25 19:48:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
310 | Setting default log level to "WARN".
311 | To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
312 | 23/01/25 19:48:28 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
313 | 23/01/25 19:48:28 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
314 | 23/01/25 19:48:31 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
315 | 23/01/25 19:48:31 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore centos@127.0.1.1
316 | Spark master: local[*], Application Id: local-1674676103468
317 | spark-sql>
318 | 
319 | ```
320 | ---
321 | 
322 | #####  Cursory Check:
323 | From our new spark-sql terminal session run the following command:
324 | 
325 | ```
326 | SHOW CURRENT NAMESPACE;
327 | ```
328 | 
329 | ##### Expected Output:
330 | 
331 | ```
332 | icecatalog
333 | Time taken: 2.692 seconds, Fetched 1 row(s)
334 | ```
335 | ---
336 | ---
337 | 
338 | ###  Exercises:
339 | In this lab, we will create our first iceberg table with `Spark-SQL`
340 | 
341 | ---
342 | ---
343 | 
344 | #### Start the `Spark-SQL` cli tool
345 |  * from the spark-sql console run the below commands:
346 | 
347 | ---
348 | 
349 | ##### Create Tables:
350 |   * These will be run in the spark-sql cli
351 | ```
352 | CREATE TABLE icecatalog.icecatalog.customer (
353 |     first_name STRING,
354 |     last_name STRING,
355 |     street_address STRING,
356 |     city STRING,
357 |     state STRING,
358 |     zip_code STRING,
359 |     home_phone STRING,
360 |     mobile STRING,
361 |     email STRING,
362 |     ssn STRING,
363 |     job_title STRING,
364 |     create_date STRING,
365 |     cust_id BIGINT)
366 | USING iceberg
367 | OPTIONS (
368 |     'write.object-storage.enabled'=true,
369 |     'write.data.path'='s3://iceberg-data')
370 | PARTITIONED BY (state);
371 | 
372 | CREATE TABLE icecatalog.icecatalog.transactions (
373 |     transact_id STRING,
374 |     transaction_date STRING,
375 |     item_desc STRING,
376 |     barcode STRING,
377 |     category STRING,
378 |     amount STRING,
379 |     cust_id BIGINT)
380 | USING iceberg
381 | OPTIONS (
382 |     'write.object-storage.enabled'=true,
383 |     'write.data.path'='s3://iceberg-data');
384 | ```
385 | 
386 | ---
387 | 
388 | ###  Examine the bucket in Minio from the GUI
389 |   * It wrote out all the metadata and files into our object storage from the Apache Iceberg Catalog we created.
390 | 
391 | ---
392 | ![](./images/bucket_first_table_metadata_view.png)
393 | ---
394 | 
395 | ####  Insert some records with our SparkSQL CLI:
396 |   *  In this step we will load up some JSON records from a file created during setup.
397 |   *  We will create a temporary view against this JSON file and then load the file with an INSERT statement.
398 | ---
399 | 
400 | ##### Create temporary view statement:
401 | ```
402 | CREATE TEMPORARY VIEW customerView
403 |   USING org.apache.spark.sql.json
404 |   OPTIONS (
405 |     path "/opt/spark/input/customers.json"
406 |   );
407 | ```
408 | 
409 | ##### Query our temporary view with this statement:
410 | ```
411 | SELECT cust_id, first_name, last_name FROM customerView;
412 | ```
413 | 
414 | ##### Sample Output:
415 | ```
416 | cust_id first_name      last_name
417 | 10      Brenda          Thompson
418 | 11      Jennifer        Anderson
419 | 12      William         Jefferson
420 | 13      Jack            Romero
421 | 14      Robert          Johnson
422 | Time taken: 0.173 seconds, Fetched 5 row(s)
423 | 
424 | ```
425 | 
426 | ---
427 | ##### Query our customer table before we load data to it:
428 | ```
429 | SELECT cust_id, first_name, last_name FROM icecatalog.icecatalog.customer;
430 | ```
431 | 
432 | ##### Sample Output:
433 | 
434 | ```
435 | cust_id     first_name      last_name
436 | Time taken: 0.111 seconds
437 | 
438 | ```
439 | 
440 | ##### Load the existing icegberg table (created earlier) with an `INSERT as SELECT` type of query:
441 | ```
442 | INSERT INTO icecatalog.icecatalog.customer 
443 |     SELECT 
444 |              first_name, 
445 |              last_name, 
446 |              street_address, 
447 |              city, 
448 |              state, 
449 |              zip_code, 
450 |              home_phone,
451 |              mobile,
452 |              email,
453 |              ssn,
454 |              job_title,
455 |              create_date,
456 |              cust_id
457 |     FROM customerView;
458 | 
459 | ```
460 | 
461 | ---
462 | ##### Query our customer table after we have loaded this JSON file:
463 | ```
464 | SELECT cust_id, first_name, last_name FROM icecatalog.icecatalog.customer;
465 | ```
466 | 
467 | ##### Sample Output:
468 | 
469 | ```
470 | cust_id first_name      last_name
471 | 10      Brenda          Thompson
472 | 11      Jennifer        Anderson
473 | 13      Jack            Romero
474 | 14      Robert          Johnson
475 | 12      William         Jefferson
476 | Time taken: 0.262 seconds, Fetched 5 row(s)
477 | 
478 | 
479 | ```
480 | 
481 | ---
482 | 
483 | ### Now let's run a more advanced query:
484 | 
485 | Let's Add and Update some rows in one step with an example `MERGE` Statement.   This will create a view on top of a json file and then run our query to update existing rows if they match on the field `cust_id` and if they don't match on this field append the new rows to our `customer` table all in the same query.
486 | 
487 | ---
488 | ##### Create temporary view statement:
489 | ```
490 | CREATE TEMPORARY VIEW mergeCustomerView
491 |   USING org.apache.spark.sql.json
492 |   OPTIONS (
493 |     path "/opt/spark/input/update_customers.json"
494 |   );
495 | ```
496 | 
497 | ##### Merge records from a json file:  
498 | ```
499 | MERGE INTO icecatalog.icecatalog.customer c
500 | USING (SELECT
501 |              first_name,
502 |              last_name,
503 |              street_address,
504 |              city,
505 |              state,
506 |              zip_code,
507 |              home_phone,
508 |              mobile,
509 |              email,
510 |              ssn,
511 |              job_title,
512 |              create_date,
513 |              cust_id
514 |        FROM mergeCustomerView) j
515 | ON c.cust_id = j.cust_id
516 | WHEN MATCHED THEN UPDATE SET
517 |              c.first_name = j.first_name,
518 |              c.last_name = j.last_name,
519 |              c.street_address = j.street_address,
520 |              c.city = j.city,
521 |              c.state = j.state,
522 |              c.zip_code = j.zip_code,
523 |              c.home_phone = j.home_phone,
524 |              c.mobile = j.mobile,
525 |              c.email = j.email,
526 |              c.ssn = j.ssn,
527 |              c.job_title = j.job_title,
528 |              c.create_date = j.create_date
529 | WHEN NOT MATCHED THEN INSERT *;
530 | ```
531 | ---
532 | 
533 | ---
534 | ##### Query our customer table after running our merge query:
535 | ```
536 | SELECT cust_id, first_name, last_name FROM icecatalog.icecatalog.customer ORDER BY cust_id;
537 | ```
538 | 
539 | ##### Sample Output:
540 | 
541 | ```
542 | cust_id first_name      last_name
543 | 10      Caitlyn         Rogers
544 | 11      Brittany        Williams
545 | 12      Victor          Gordon
546 | 13      Shelby          Martinez
547 | 14      Corey           Bridges
548 | 15      Benjamin        Rocha
549 | 16      Jonathan        Lawrence
550 | 17      Thomas          Taylor
551 | 18      Jeffrey         Williamson
552 | 19      Joseph          Mccullough
553 | 20      Evan            Kirby
554 | 21      Teresa          Pittman
555 | 22      Alicia          Byrd
556 | 23      Kathleen        Ellis
557 | 24      Tony            Lee
558 | Time taken: 0.381 seconds, Fetched 15 row(s)
559 | 
560 | ```
561 | *  Note that the values for customers with `cust_id` between 10-14 have new updated information.
562 |   
563 | ---
564 | 
565 | ### Explore Time Travel with Apache Iceberg:
566 | 
567 | ---
568 | So far in our workshop we have loaded some tables and run some `CRUD` operations with our platform.  In this exercise, we are going to see a really cool feature called `Time Travel`.
569 | 
570 | Time travel queries refer to the ability to query data as it existed at a specific point in time in the past. This feature is useful in a variety of scenarios, such as auditing, disaster recovery, and debugging.
571 | 
572 | In a database or data warehousing system with time travel capability, historical data is stored along with a timestamp, allowing users to query the data as it existed at a specific time. This is achieved by either using a separate historical store or by maintaining multiple versions of the data in the same store.
573 | 
574 | Time travel queries are typically implemented using tools like snapshots, temporal tables, or versioned data stores. These tools allow users to roll back to a previous version of the data and access it as if it were the current version. Time travel queries can also be combined with other data management features, such as data compression, data partitioning, and indexing, to improve performance and make historical data more easily accessible.
575 | 
576 | In order to run a time travel query we need some metadata to pass into our query.  The metadata exists in our catalog and it can be accessed with a query.  The following query will return some metadata from our database.
577 | 
578 |   *  your results will be slightly different.
579 |     
580 | ##### Query from SparkSQL CLI:
581 | ```
582 | SELECT 
583 |      committed_at, 
584 |      snapshot_id, 
585 |      parent_id 
586 |   FROM icecatalog.icecatalog.customer.snapshots
587 |   ORDER BY committed_at;
588 | ```
589 | ---
590 | 
591 | #### Expected Output:
592 | 
593 | ```
594 | committed_at            snapshot_id             parent_id
595 | 2023-01-26 16:58:31.873 2216914164877191507     NULL
596 | 2023-01-26 17:00:18.585 3276759594719593733     2216914164877191507
597 | Time taken: 0.195 seconds, Fetched 2 row(s)
598 | ```
599 | 
600 | ---
601 | 
602 | ###  `Time Travel` example from data in our customer table:
603 | 
604 | When we loaded our `customer` table initially it had only 5 rows of data.  We then ran a `MERGE` query to update some existing rows and insert new rows. With this query, we can see our table results as they existed in that initial phase before the `MERGE`.
605 | 
606 | We need to grab the `snapshop_id` value from our above query and edit the following query with your `snapshot_id` value.
607 | 
608 | The query of the table after our first INSERT statement:
609 |   *  replace this `snapshop_id` with your value:
610 | 
611 | In this step, we will get results that show the data as it was originally loaded.
612 | ```
613 | SELECT
614 |     cust_id,
615 |     first_name,
616 |     last_name,
617 |     create_date
618 |   FROM icecatalog.icecatalog.customer
619 |   VERSION AS OF <your snapshot_id here>
620 |   ORDER by cust_id;
621 | 
622 | ```
623 | 
624 | #####  Expected Output:
625 | 
626 | ```
627 | 
628 | cust_id first_name      last_name       create_date
629 | 10      Brenda          Thompson        2022-12-25 01:10:43
630 | 11      Jennifer        Anderson        2022-12-03 04:50:07
631 | 12      William         Jefferson       2022-11-28 08:17:10
632 | 13      Jack            Romero          2022-12-11 19:09:30
633 | 14      Robert          Johnson         2022-12-08 05:28:56
634 | Time taken: 0.349 seconds, Fetched 5 row(s)
635 | 
636 | ```
637 | ---
638 | 
639 | ###  Example from data in our customer table after running our `MERGE` statement:
640 | 
641 | In this step, we will see sample results from our customer table after we ran the `MERGE` step earlier.  It will show the updated existing rows and our new rows. 
642 |   *  remember to replace `<your snapshot_id here>`  with the `snapshop_id` from your table metadata.  
643 | 
644 | ##### Query:
645 | ```
646 | SELECT
647 |     cust_id,
648 |     first_name,
649 |     last_name,
650 |     create_date
651 |   FROM icecatalog.icecatalog.customer
652 | 
653 |  VERSION AS OF <your snapshot_id here>
654 |  ORDER by cust_id;
655 | ```
656 | 
657 | ##### Expected Output:
658 | 
659 | ```
660 | cust_id first_name      last_name       create_date
661 | 10      Caitlyn         Rogers          2022-12-16 03:19:35
662 | 11      Brittany        Williams        2022-12-04 23:29:48
663 | 12      Victor          Gordon          2022-12-22 18:03:13
664 | 13      Shelby          Martinez        2022-11-27 16:10:42
665 | 14      Corey           Bridges         2022-12-11 23:29:52
666 | 15      Benjamin        Rocha           2022-12-10 07:39:35
667 | 16      Jonathan        Lawrence        2022-11-27 23:44:14
668 | 17      Thomas          Taylor          2022-12-07 12:33:45
669 | 18      Jeffrey         Williamson      2022-12-13 16:58:43
670 | 19      Joseph          Mccullough      2022-12-05 05:33:56
671 | 20      Evan            Kirby           2022-12-20 14:23:43
672 | 21      Teresa          Pittman         2022-12-26 05:14:24
673 | 22      Alicia          Byrd            2022-12-17 18:20:51
674 | 23      Kathleen        Ellis           2022-12-08 04:01:44
675 | 24      Tony            Lee             2022-12-24 17:10:32
676 | Time taken: 0.432 seconds, Fetched 15 row(s)
677 | ```
678 | 
679 | ---
680 | 
681 | ### Exit out of `sparksql` cli.
682 | 
683 | ```
684 | exit;
685 | 
686 | ```
687 | ---
688 | ### Explore Iceberg operations using Spark Dataframes.
689 | 
690 | We will use `pyspark` in this example and load our `Transactions` table with a pyspark dataFrame.
691 | 
692 | ##### Notes:
693 |   * pyspark isn't as feature-rich as the sparksql client (in future versions it should catch up).  For example, it doesn't support the `MERGE` example we tested earlier.
694 | 
695 | ---
696 | 
697 | ###  Start `pyspark` cli
698 |   *  run this in a terminal window
699 | ```
700 | cd $SPARK_HOME
701 | pyspark
702 | ```
703 | 
704 | ---
705 | 
706 | #### Expected Output:
707 | 
708 | ---
709 | 
710 | ```
711 | Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
712 | [GCC 9.4.0] on linux
713 | Type "help", "copyright", "credits" or "license" for more information.
714 | 23/01/26 01:44:27 WARN Utils: Your hostname, spark-ice2 resolves to a loopback address: 127.0.1.1; using 192.168.1.167 instead (on interface eth0)
715 | 23/01/26 01:44:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
716 | Setting default log level to "WARN".
717 | To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
718 | 23/01/26 01:44:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
719 | Welcome to
720 |       ____              __
721 |      / __/__  ___ _____/ /__
722 |     _\ \/ _ \/ _ `/ __/  '_/
723 |    /__ / .__/\_,_/_/ /_/\_\   version 3.3.1
724 |       /_/
725 | 
726 | Using Python version 3.8.10 (default, Nov 14 2022 12:59:47)
727 | Spark context Web UI available at http://192.168.1.167:4040
728 | Spark context available as 'sc' (master = local[*], app id = local-1674697469102).
729 | SparkSession available as 'spark'.
730 | >>> 
731 | 
732 | ```
733 | 
734 | ###  In this section we will load our `Transactions` data from a json file using `Pyspark`  
735 | 
736 |  * code blocks are commented:
737 |  * copy and past this block into our pyspark session in a terminal window:
738 | ---
739 | 
740 | ```
741 | # import SparkSession
742 | from pyspark.sql import SparkSession
743 | 
744 | # create SparkSession
745 | spark = SparkSession.builder \
746 |      .appName("Python Spark SQL example") \
747 |      .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19") \
748 |      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
749 |      .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \
750 |      .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \
751 |      .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \
752 |      .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \
753 |      .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \
754 |      .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \
755 |      .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
756 |      .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \
757 |      .config("spark.sql.catalog.sparkcatalog", "icecatalog") \
758 |      .config("spark.eventLog.enabled", "true") \
759 |      .config("spark.eventLog.dir", "/opt/spark/spark-events") \
760 |      .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
761 |      .config("spark.sql.catalogImplementation", "in-memory") \
762 |      .getOrCreate()
763 | 
764 | # A JSON dataset is pointed to by 'path' variable
765 | path = "/opt/spark/input/transactions.json"
766 | 
767 | #  read json into the DataFrame
768 | transactionsDF = spark.read.json(path)
769 | 
770 | # visualize the inferred schema
771 | transactionsDF.printSchema()
772 | 
773 | # print out the dataframe in this cli
774 | transactionsDF.show()
775 | 
776 | # Append these transactions to the table we created in an earlier step `icecatalog.icecatalog.transactions`
777 | transactionsDF.writeTo("icecatalog.icecatalog.transactions").append()
778 | 
779 | # stop the sparkSession
780 | spark.stop()
781 | 
782 | # Exit out of the editor:
783 | quit();
784 | 
785 | ```
786 | ---
787 | 
788 | ##### Expected Output:
789 | 
790 | ---
791 | 
792 | ```
793 | >>> # import SparkSession
794 | >>> from pyspark.sql import SparkSession
795 | >>> 
796 | >>> # create SparkSession
797 | >>> spark = SparkSession.builder \
798 | ...      .appName("Python Spark SQL example") \
799 | ...      .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,software.amazon.awssdk:bundle:2.19.19,software.amazon.awssdk:url-connection-client:2.19.19") \
800 | ...      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
801 | ...      .config("spark.sql.catalog.icecatalog", "org.apache.iceberg.spark.SparkCatalog") \
802 | ...      .config("spark.sql.catalog.icecatalog.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \
803 | ...      .config("spark.sql.catalog.icecatalog.uri", "jdbc:postgresql://127.0.0.1:5432/icecatalog") \
804 | ...      .config("spark.sql.catalog.icecatalog.jdbc.user", "icecatalog") \
805 | ...      .config("spark.sql.catalog.icecatalog.jdbc.password", "supersecret1") \
806 | ...      .config("spark.sql.catalog.icecatalog.warehouse", "s3://iceberg-data") \
807 | ...      .config("spark.sql.catalog.icecatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
808 | ...      .config("spark.sql.catalog.icecatalog.s3.endpoint", "http://127.0.0.1:9000") \
809 | ...      .config("spark.sql.catalog.sparkcatalog", "icecatalog") \
810 | ...      .config("spark.eventLog.enabled", "true") \
811 | ...      .config("spark.eventLog.dir", "/opt/spark/spark-events") \
812 | ...      .config("spark.history.fs.logDirectory", "/opt/spark/spark-events") \
813 | ...      .config("spark.sql.catalogImplementation", "in-memory") \
814 | ...      .getOrCreate()
815 | 23/01/26 02:04:13 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
816 | >>> 
817 | >>> # A JSON dataset is pointed to by path
818 | >>> path = "/opt/spark/input/transactions.json"
819 | >>> 
820 | >>> #  read json into the DataFrame
821 | >>> transactionsDF = spark.read.json(path)
822 | >>> 
823 | >>> # visualize the inferred schema
824 | >>> transactionsDF.printSchema()
825 | root
826 |  |-- amount: double (nullable = true)
827 |  |-- barcode: string (nullable = true)
828 |  |-- category: string (nullable = true)
829 |  |-- cust_id: long (nullable = true)
830 |  |-- item_desc: string (nullable = true)
831 |  |-- transact_id: string (nullable = true)
832 |  |-- transaction_date: string (nullable = true)
833 | 
834 | >>> 
835 | >>> # print out the dataframe in this cli
836 | >>> transactionsDF.show()
837 | +------+-------------+--------+-------+--------------------+--------------------+-------------------+
838 | |amount|      barcode|category|cust_id|           item_desc|         transact_id|   transaction_date|
839 | +------+-------------+--------+-------+--------------------+--------------------+-------------------+
840 | | 50.63|4541397840276|  purple|     10| Than explain cover.|586fef8b-00da-421...|2023-01-08 00:11:25|
841 | | 95.37|2308832642138|   green|     10| Necessary body oil.|e8809684-7997-4cc...|2023-01-23 17:23:04|
842 | |  9.71|1644304420912|    teal|     10|Recent property a...|18bb3472-56c0-48e...|2023-01-18 18:12:44|
843 | | 92.69|6996277154185|   white|     10|Entire worry hosp...|a520859f-7cde-429...|2023-01-03 13:45:03|
844 | | 21.89|7318960584434|  purple|     11|Finally kind coun...|3922d6a1-d112-411...|2022-12-29 09:00:26|
845 | | 24.97|4676656262244|   olive|     11|Strong likely spe...|fe40fd4c-6111-49b...|2023-01-19 03:47:12|
846 | | 68.98|2299973443220|    aqua|     14|Store blue confer...|331def13-f644-409...|2023-01-13 10:07:46|
847 | |  66.5|1115162814798|  silver|     14|Court dog method ...|57cdb9b6-d370-4aa...|2022-12-29 06:04:30|
848 | | 26.96|5617858920203|    gray|     14|Black director af...|9124d0ef-9374-441...|2023-01-11 19:20:39|
849 | | 11.24|1829792571456|  yellow|     14|Lead today best p...|d418abe1-63dc-4ca...|2022-12-31 03:16:32|
850 | |  6.82|9406622469286|    aqua|     15|Power itself job ...|422a413a-590b-4f7...|2023-01-09 19:09:29|
851 | | 89.39|7753423715275|   black|     15|Material risk first.|bc4125fc-08cb-4ab...|2023-01-23 03:24:02|
852 | | 63.49|2242895060556|   black|     15|Foreign strong wa...|ff4e4369-bcef-438...|2022-12-29 22:12:09|
853 | |  49.7|3010754625845|   black|     15|  Own book move for.|d00a9e7a-0cea-428...|2023-01-12 21:42:32|
854 | | 10.45|7885711282777|   green|     15|Without beat then...|33afa171-a652-429...|2023-01-05 04:33:24|
855 | | 34.12|8802078025372|    aqua|     16|     Site win movie.|cfba6338-f816-4b7...|2023-01-07 12:22:34|
856 | | 96.14|9389514040254|   olive|     16|Agree enjoy four ...|5223b620-5eef-4fa...|2022-12-28 17:06:04|
857 | |  3.38|6079280166809|    blue|     16|Concern his debat...|33725df2-e14b-45a...|2023-01-17 20:53:25|
858 | |  2.67|5723406697760|  yellow|     16|Republican sure r...|6a707466-7b43-4af...|2023-01-02 15:40:17|
859 | | 68.85|0555188918000|   black|     16|Sense recently th...|5a31670b-9b68-43f...|2023-01-12 03:21:06|
860 | +------+-------------+--------+-------+--------------------+--------------------+-------------------+
861 | only showing top 20 rows
862 | 
863 | >>> transactionsDF.writeTo("icecatalog.icecatalog.transactions").append()
864 | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
865 | SLF4J: Defaulting to no-operation (NOP) logger implementation
866 | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
867 | >>> spark.stop()                                                                
868 | >>> quit();
869 | 
870 | ```
871 | 
872 | ---
873 | 
874 | ### Explore our `Transactions` tables within SparksQL
875 | 
876 | Let's open our spark-sql cli again (follow the same steps as above) and run the following query to join our 2 tables and view some sample data.
877 | 
878 | *  Run these command a new spark-sql session in your terminal.
879 | ---
880 | 
881 | ##### Query:
882 | ```
883 | SELECT 
884 |    c.cust_id
885 |    , c.first_name
886 |    , c.last_name
887 |    , t.transact_id
888 |    , t.item_desc
889 |    , t.amount
890 | FROM 
891 |    icecatalog.icecatalog.customer c
892 |    , icecatalog.icecatalog.transactions t
893 | INNER JOIN icecatalog.icecatalog.customer cj ON  c.cust_id = t.cust_id
894 | LIMIT 20;
895 | ```
896 | 
897 | ---
898 | 
899 | ##### Expected Output:
900 | 
901 | ```
902 | cust_id first_name      last_name     transact_id                             item_desc               amount
903 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
904 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
905 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
906 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
907 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
908 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
909 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
910 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
911 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
912 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
913 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
914 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
915 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
916 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
917 | 10      Caitlyn         Rogers        586fef8b-00da-4216-832a-a0eb5211b54a    Than explain cover.     50.63
918 | 10      Caitlyn         Rogers        e8809684-7997-4ccf-96df-02fd57ca9d6f    Necessary body oil.     95.37
919 | 10      Caitlyn         Rogers        e8809684-7997-4ccf-96df-02fd57ca9d6f    Necessary body oil.     95.37
920 | 10      Caitlyn         Rogers        e8809684-7997-4ccf-96df-02fd57ca9d6f    Necessary body oil.     95.37
921 | 10      Caitlyn         Rogers        e8809684-7997-4ccf-96df-02fd57ca9d6f    Necessary body oil.     95.37
922 | 10      Caitlyn         Rogers        e8809684-7997-4ccf-96df-02fd57ca9d6f    Necessary body oil.     95.37
923 | 
924 | ```
925 | 
926 | ---
927 | ---
928 | 
929 | ## Summary:
930 | 
931 | ---
932 | Using Apache Spark with Apache Iceberg can provide many benefits for big data processing and data lake management. Apache Spark is a fast and flexible data processing engine that can be used for a variety of big data use cases, such as batch processing, streaming, and machine learning. By integrating with Apache Iceberg, Spark can leverage Iceberg's table abstraction, versioning capabilities, and data discovery features to manage large-scale data lakes with increased efficiency, reliability, and scalability.
933 | 
934 | Using Apache Spark with Apache Iceberg allows organizations to leverage the benefits of Spark's distributed processing capabilities, while at the same time reducing the complexity of managing large-scale data lakes. Additionally, the integration of Spark and Iceberg provides the ability to perform complex data processing operations while still providing data management capabilities such as schema evolution, versioning, and data discovery.
935 | 
936 | Finally, as both Spark and Iceberg are open-source projects, organizations can benefit from a large and active community of developers who are contributing to the development of these technologies. This makes it easier for organizations to adopt and use these tools, and to quickly resolve any issues that may arise.
937 | 
938 | ---
939 | #### Final Thoughts:
940 | ---
941 |  
942 | In a series of upcoming workshops, I will build out and document some new technologies that can be integrated with legacy solutions deployed by most organizations today.  It will give you a roadmap into how you can gain insights (in near real-time) from data produced in your legacy systems with minimal impact on those servers.  We will use a Change Data Capture (CDC) approach to pull the data from log files produced by the database providers and deliver it to our Apache Iceberg solution we built today.
943 | 
944 | ---
945 | ---
946 | 
947 | If you have made it this far, I want to thank you for spending your time reviewing the materials. Please give me a 'Star' at the top of this page if you found it useful.
948 | 
949 | Click here to return to workshop 2:  [`Workshop 2 Exercises`](./README.md/#what-is-redpanda).
950 | 
951 | ---
952 | ---
953 | ---
954 | ---
955 | 
956 |   ![](./images/drunk-cheers.gif)
957 | 
958 | [Tim Lepple](www.linkedin.com/in/tim-lepple-9141452)
959 | 
960 | ---
961 | ---
962 | ---
963 | ---
964 | 
965 | 
966 | 


--------------------------------------------------------------------------------