├── containers
└── setup
│ ├── requirements.txt
│ ├── Dockerfile
│ └── create_buckets.py
├── etc
├── catalog
│ ├── tpch.properties
│ ├── tpcds.properties
│ ├── iceberg.properties
│ └── minio.properties
├── log.properties
├── node.properties
├── config.properties
└── jvm.config
├── images
├── dbeaver.png
└── tpch_erd.png
├── .github
├── CODEOWNERS
└── ISSUE_TEMPLATE
│ └── bug_report.md
├── Makefile
├── conf
├── core-site.xml
└── metastore-site.xml
├── docker-compose.yml
└── README.md
/containers/setup/requirements.txt:
--------------------------------------------------------------------------------
1 | boto3==1.34.104
2 |
--------------------------------------------------------------------------------
/etc/catalog/tpch.properties:
--------------------------------------------------------------------------------
1 | connector.name=tpch
2 | tpch.splits-per-node=4
--------------------------------------------------------------------------------
/etc/catalog/tpcds.properties:
--------------------------------------------------------------------------------
1 | connector.name=tpcds
2 | tpcds.splits-per-node=4
--------------------------------------------------------------------------------
/etc/log.properties:
--------------------------------------------------------------------------------
1 | # Enable verbose logging from Presto
2 | #io.trino=DEBUG
--------------------------------------------------------------------------------
/images/dbeaver.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josephmachado/analytical_dp_with_sql/HEAD/images/dbeaver.png
--------------------------------------------------------------------------------
/images/tpch_erd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/josephmachado/analytical_dp_with_sql/HEAD/images/tpch_erd.png
--------------------------------------------------------------------------------
/etc/node.properties:
--------------------------------------------------------------------------------
1 | node.environment=docker
2 | node.data-dir=/data/trino
3 | plugin.dir=/usr/lib/trino/plugin
4 | spiller-spill-path=/tmp
5 | max-spill-per-node=2GB
6 | query-max-spill-per-node=1GB
--------------------------------------------------------------------------------
/etc/config.properties:
--------------------------------------------------------------------------------
1 | #single node install config
2 | coordinator=true
3 | node-scheduler.include-coordinator=true
4 | http-server.http.port=8080
5 | discovery-server.enabled=true
6 | discovery.uri=http://localhost:8080
--------------------------------------------------------------------------------
/etc/catalog/iceberg.properties:
--------------------------------------------------------------------------------
1 | connector.name=iceberg
2 | hive.metastore.uri=thrift://hive-metastore:9083
3 | hive.s3.path-style-access=true
4 | hive.s3.endpoint=http://minio:9000
5 | hive.s3.aws-access-key=minio
6 | hive.s3.aws-secret-key=minio123
--------------------------------------------------------------------------------
/.github/CODEOWNERS:
--------------------------------------------------------------------------------
1 | # These owners will be the default owners for everything in
2 | # the repo. Unless a later match takes precedence,
3 | # @josephmachado will be requested for
4 | # review when someone opens a pull request.
5 |
6 | * @josephmachado
--------------------------------------------------------------------------------
/etc/jvm.config:
--------------------------------------------------------------------------------
1 | -server
2 | -Xmx16G
3 | -XX:-UseBiasedLocking
4 | -XX:+UseG1GC
5 | -XX:G1HeapRegionSize=32M
6 | -XX:+ExplicitGCInvokesConcurrent
7 | -XX:+HeapDumpOnOutOfMemoryError
8 | -XX:+UseGCOverheadLimit
9 | -XX:+ExitOnOutOfMemoryError
10 | -XX:ReservedCodeCacheSize=256M
11 | -Xlog:gc*,safepoint::time,level,tags,tid
12 | -Djdk.attach.allowAttachSelf=true
13 | -Djdk.nio.maxCachedBufferSize=2000000
14 |
--------------------------------------------------------------------------------
/containers/setup/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.9.5
2 |
3 | # set up location of code
4 | WORKDIR /code
5 | ENV PYTHONPATH=/code/src
6 |
7 | # install python requirements
8 | ADD ./containers/setup/requirements.txt requirements.txt
9 | RUN pip install -r requirements.txt
10 |
11 | # Copy create buckets script
12 | # copy repo
13 | COPY ./containers/setup/create_buckets.py /code/create_buckets.py
14 |
15 | ENTRYPOINT ["tail", "-f", "/dev/null"]
16 |
--------------------------------------------------------------------------------
/etc/catalog/minio.properties:
--------------------------------------------------------------------------------
1 | connector.name=hive
2 | hive.metastore.uri=thrift://hive-metastore:9083
3 | hive.s3.path-style-access=true
4 | hive.s3.endpoint=http://minio:9000
5 | hive.s3.aws-access-key=minio
6 | hive.s3.aws-secret-key=minio123
7 | hive.non-managed-table-writes-enabled=true
8 | hive.s3select-pushdown.enabled=true
9 | hive.storage-format=ORC
10 | hive.metastore.thrift.delete-files-on-drop=true
11 |
12 | hive.allow-drop-table=true
13 | hive.max-partitions-per-writers=5000
14 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | setup-containers:
2 | docker volume rm --force minio-data
3 | docker compose up -d --build
4 |
5 | create-buckets:
6 | docker exec createbucketsbkp python /code/create_buckets.py
7 |
8 | up: setup-containers create-buckets
9 |
10 | down:
11 | docker compose down
12 |
13 | trino:
14 | docker container exec -it trino-coordinator trino
15 |
16 | minio:
17 | open "http://localhost:9001"
18 |
19 | trino-ui:
20 | open "http://localhost:8080"
21 |
22 | logs:
23 | docker logs trino-coordinator
24 |
25 | metadata-db:
26 | docker exec -ti mariadb /usr/bin/mariadb -padmin
27 |
28 | restart: down up
29 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 |
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 |
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 |
26 | **Desktop (please complete the following information):**
27 | - OS: [e.g. iOS]
28 | - Browser [e.g. chrome, safari]
29 | - Version [e.g. 22]
30 |
31 | **Smartphone (please complete the following information):**
32 | - Device: [e.g. iPhone6]
33 | - OS: [e.g. iOS8.1]
34 | - Browser [e.g. stock browser, safari]
35 | - Version [e.g. 22]
36 |
37 | **Additional context**
38 | Add any other context about the problem here.
39 |
--------------------------------------------------------------------------------
/conf/core-site.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | fs.defaultFS
6 | s3a://minio:9000
7 |
8 |
9 |
10 |
11 |
12 | fs.s3a.connection.ssl.enabled
13 | false
14 |
15 |
16 |
17 | fs.s3a.endpoint
18 | http://minio:9000
19 |
20 |
21 |
22 | fs.s3a.access.key
23 | minio
24 |
25 |
26 |
27 | fs.s3a.secret.key
28 | minio123
29 |
30 |
31 |
32 | fs.s3a.path.style.access
33 | true
34 |
35 |
36 |
37 | fs.s3a.impl
38 | org.apache.hadoop.fs.s3a.S3AFileSystem
39 |
40 |
41 |
42 |
--------------------------------------------------------------------------------
/conf/metastore-site.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 | metastore.thrift.uris
4 | thrift://hive-metastore:9083
5 | Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.
6 |
7 |
8 | metastore.task.threads.always
9 | org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.MaterializationsCacheCleanerTask
10 |
11 |
12 | metastore.expression.proxy
13 | org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy
14 |
15 |
16 | javax.jdo.option.ConnectionDriverName
17 | com.mysql.cj.jdbc.Driver
18 |
19 |
20 |
21 | javax.jdo.option.ConnectionURL
22 | jdbc:mysql://mariadb:3306/metastore_db
23 |
24 |
25 |
26 | javax.jdo.option.ConnectionUserName
27 | admin
28 |
29 |
30 |
31 | javax.jdo.option.ConnectionPassword
32 | admin
33 |
34 |
35 |
36 | fs.s3a.access.key
37 | minio
38 |
39 |
40 | fs.s3a.secret.key
41 | minio123
42 |
43 |
44 | fs.s3a.endpoint
45 | http://minio:9000
46 |
47 |
48 | fs.s3a.path.style.access
49 | true
50 |
51 |
52 |
53 |
--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
1 | version: '3.7'
2 | services:
3 | trino-coordinator:
4 | image: 'trinodb/trino:422'
5 | hostname: trino-coordinator
6 | container_name: trino-coordinator
7 | ports:
8 | - '8080:8080'
9 | volumes:
10 | - ./etc:/etc/trino
11 | networks:
12 | - trino-network
13 |
14 | mariadb:
15 | image: 'mariadb:10.3.32'
16 | hostname: mariadb
17 | container_name: mariadb
18 | ports:
19 | - '3306:3306'
20 | environment:
21 | MYSQL_ROOT_PASSWORD: admin
22 | MYSQL_USER: admin
23 | MYSQL_PASSWORD: admin
24 | MYSQL_DATABASE: metastore_db
25 | networks:
26 | - trino-network
27 |
28 | hive-metastore:
29 | image: 'bitsondatadev/hive-metastore:latest'
30 | hostname: hive-metastore
31 | container_name: hive-metastore
32 | ports:
33 | - '9083:9083' # Metastore Thrift
34 | volumes:
35 | - ./conf/metastore-site.xml:/opt/apache-hive-metastore-3.0.0-bin/conf/metastore-site.xml:ro
36 | environment:
37 | METASTORE_DB_HOSTNAME: mariadb
38 | depends_on:
39 | - mariadb
40 | networks:
41 | - trino-network
42 |
43 | minio:
44 | image: 'minio/minio:RELEASE.2023-07-21T21-12-44Z'
45 | hostname: minio
46 | container_name: minio
47 | ports:
48 | - '9000:9000'
49 | - '9001:9001'
50 | environment:
51 | MINIO_ACCESS_KEY: minio
52 | MINIO_SECRET_KEY: minio123
53 | command: server --console-address ":9001" /data
54 | networks:
55 | - trino-network
56 |
57 | createbucketsbkp:
58 | image: createbucketsbkp
59 | container_name: createbucketsbkp
60 | build:
61 | context: ./
62 | dockerfile: ./containers/setup/Dockerfile
63 | networks:
64 | - trino-network
65 |
66 |
67 | createbuckets:
68 | image: minio/mc
69 | container_name: createbuckets
70 | depends_on:
71 | - minio
72 | entrypoint: >
73 | /bin/sh -c " /usr/bin/mc config host add myminio http://minio:9000 minio minio123; /usr/bin/mc rm -r --force myminio/tpch myminio/warehouse myminio/icebergwarehouse; /usr/bin/mc mb myminio/tpch myminio/warehouse myminio/icebergwarehouse; /usr/bin/mc policy download myminio/tpch myminio/warehouse myminio/icebergwarehouse; exit 0; "
74 | networks:
75 | - trino-network
76 |
77 | networks:
78 | trino-network:
79 | driver: bridge
80 |
--------------------------------------------------------------------------------
/containers/setup/create_buckets.py:
--------------------------------------------------------------------------------
1 | import boto3
2 | from botocore.exceptions import ClientError
3 | from botocore.client import Config
4 |
5 | def create_s3_client(access_key, secret_key, endpoint, region):
6 | """
7 | Create a boto3 client configured for Minio or any S3-compatible service.
8 |
9 | :param access_key: S3 access key
10 | :param secret_key: S3 secret key
11 | :param endpoint: Endpoint URL for the S3 service
12 | :param region: Region to use, defaults to us-east-1
13 | :return: Configured S3 client
14 | """
15 | return boto3.client(
16 | 's3',
17 | region_name=region,
18 | endpoint_url=endpoint,
19 | aws_access_key_id=access_key,
20 | aws_secret_access_key=secret_key,
21 | config=Config(signature_version='s3v4')
22 | )
23 |
24 | def create_bucket_if_not_exists(s3_client, bucket_name):
25 | """
26 | Check if an S3 bucket exists, and if not, create it.
27 |
28 | :param s3_client: Configured S3 client
29 | :param bucket_name: Name of the bucket to create or check
30 | :return: None
31 | """
32 | try:
33 | s3_client.head_bucket(Bucket=bucket_name)
34 | print(f"Bucket '{bucket_name}' already exists.")
35 | except ClientError as e:
36 | error_code = int(e.response['Error']['Code'])
37 | if error_code == 404:
38 | # Bucket does not exist, create it
39 | try:
40 | s3_client.create_bucket(Bucket=bucket_name)
41 | print(f"Bucket '{bucket_name}' created.")
42 | except ClientError as error:
43 | print(f"Failed to create bucket: {error}")
44 | else:
45 | print(f"Error: {e}")
46 |
47 | # Credentials and Connection Info
48 | access_key = 'minio'
49 | secret_key = 'minio123'
50 | endpoint = 'http://minio:9000'
51 | region = 'us-east-1'
52 |
53 | # Client creation and usage
54 | try:
55 | s3_client = create_s3_client(access_key, secret_key, endpoint, region)
56 | bucket_name = 'tpch'# Replace with your bucket name
57 | create_bucket_if_not_exists(s3_client, bucket_name)
58 | bucket_name = 'rainforest'# Replace with your bucket name
59 | create_bucket_if_not_exists(s3_client, bucket_name)
60 | bucket_name = 'warehouse'# Replace with your bucket name
61 | create_bucket_if_not_exists(s3_client, bucket_name)
62 | bucket_name = 'icebergwarehouse'# Replace with your bucket name
63 | create_bucket_if_not_exists(s3_client, bucket_name)
64 | except:
65 | print("Full catch, check bucket creation script at create_buckets.py")
66 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Setup
2 |
3 | Please install the following software:
4 |
5 | 1. [git version >= 2.37.1](https://github.com/git-guides/install-git)
6 | 2. [Docker version >= 20.10.17](https://docs.docker.com/engine/install/) and [Docker compose v2 version >= v2.10.2](https://docs.docker.com/compose/#compose-v2-and-the-new-docker-compose-command). Make sure that docker is running using `docker ps`.
7 |
8 | **Windows users**: please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)** (only Step 1 is necessary). Please install the make command with `sudo apt install make -y` (if its not already present).
9 |
10 | All the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users). We will use docker to set up our containers. Clone and move into the lab repository, as shown below.
11 |
12 | ```bash
13 | git clone \
14 | https://github.com/josephmachado/analytical_dp_with_sql.git
15 | cd analytical_dp_with_sql
16 | ```
17 |
18 | **Note**: If you are using Macbook M1, please follow [the instructions here](https://github.com/josephmachado/analytical_dp_with_sql/issues/4#issuecomment-1426902080) to use the appropriate docker image.
19 |
20 | We have some helpful make commands to make working with our systems more accessible. Shown below are the make commands and their definitions
21 |
22 | 1. make up: Spin up the docker containers.
23 | 2. make trino: Open trino cli; Use exit to quit the cli. **This is where you will type your SQL queries**.
24 | 3. make down: Stop the docker containers.
25 |
26 | You can see the commands in [this Makefile](https://github.com/josephmachado/analytical_dp_with_sql/blob/main/Makefile). If your terminal does not support **make** commands, please use the commands in [the Makefile](https://github.com/josephmachado/analytical_dp_with_sql/blob/main/Makefile) directly. All the commands in this book assume that you have the docker containers running.
27 |
28 | In your terminal, do the following:
29 |
30 | ```bash
31 | # Make sure docker is running using docker ps
32 | make up # starts the docker containers
33 | # If you are having issues with existing containers
34 | # stop them all with the following command
35 | # docker rm -f $(docker ps -a -q)
36 |
37 | sleep 60 # wait 1 minute for all the containers to set up
38 | make trino # opens the trino cli
39 | ```
40 |
41 | In Trino, we can connect to multiple databases (called catalogs in Trino). TPC-H is a dataset used to benchmark analytical database performance. Trino's tpch catalog comes with preloaded tpch datasets of different sizes tiny, sf1, sf100, sf100, sf300, and so on, where sf = scaling factor.
42 |
43 | ```sql
44 | -- run "make trino" or
45 | -- "docker container exec -it trino-coordinator trino"
46 | -- to open trino cli
47 |
48 | USE tpch.tiny;
49 | SHOW tables;
50 | SELECT * FROM orders LIMIT 5;
51 | -- shows five rows, press q to quit the interactive results screen
52 | exit -- quit the cli
53 | ```
54 |
55 | **Note**: Run `make trino` or `docker container exec -it trino-coordinator trino` on your terminal to open the trino cli. The SQL code shown throughout the book assumes you are running it in trino cli.
56 |
57 | Starting the docker containers will also start Minio(S3 alternative); we will use Minio as our data store to explain efficient data storage.
58 |
59 | **UI**: Open the Trino UI at http://localhost:8080 (username: any word) and Minio (S3 alternative) at http://localhost:9001 (username: minio, password: minio123) in a browser of your choice.
60 |
61 | If you prefer to connect to Trino via a SQL IDE, download [DBeaver](https://dbeaver.io/) (addresses issues [mentioned here](https://github.com/josephmachado/analytical_dp_with_sql/issues/8)). Open `DBeaver`,
62 |
63 | 1. Click on Database -> New Database Connection
64 | 2. A `Connect to a database` box will open; search for, select `Trino`, and press Next.
65 | 3. Do not change settings; use user as the user name. You will get a `connected` text box if you test the connection. Click Finish, and you will be able to explore our Trino database.
66 |
67 | 
68 |
69 | # Data Model
70 |
71 | The [TPC-H](https://www.tpc.org/tpch/) data represents a car parts seller's data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part(parts sold), region, nation, and partsupp.
72 |
73 | **Note:** Have a copy of the data model as you follow along; this will help with the examples provided and answering exercise questions.
74 |
75 | 
76 |
77 | # Acknowledgments
78 |
79 | We use the [TPC-H](https://www.tpc.org/tpch/) dataset and [Trino](https://trino.io/) as our OLAP DB.
80 |
--------------------------------------------------------------------------------