├── containers └── setup │ ├── requirements.txt │ ├── Dockerfile │ └── create_buckets.py ├── etc ├── catalog │ ├── tpch.properties │ ├── tpcds.properties │ ├── iceberg.properties │ └── minio.properties ├── log.properties ├── node.properties ├── config.properties └── jvm.config ├── images ├── dbeaver.png └── tpch_erd.png ├── .github ├── CODEOWNERS └── ISSUE_TEMPLATE │ └── bug_report.md ├── Makefile ├── conf ├── core-site.xml └── metastore-site.xml ├── docker-compose.yml └── README.md /containers/setup/requirements.txt: -------------------------------------------------------------------------------- 1 | boto3==1.34.104 2 | -------------------------------------------------------------------------------- /etc/catalog/tpch.properties: -------------------------------------------------------------------------------- 1 | connector.name=tpch 2 | tpch.splits-per-node=4 -------------------------------------------------------------------------------- /etc/catalog/tpcds.properties: -------------------------------------------------------------------------------- 1 | connector.name=tpcds 2 | tpcds.splits-per-node=4 -------------------------------------------------------------------------------- /etc/log.properties: -------------------------------------------------------------------------------- 1 | # Enable verbose logging from Presto 2 | #io.trino=DEBUG -------------------------------------------------------------------------------- /images/dbeaver.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josephmachado/analytical_dp_with_sql/HEAD/images/dbeaver.png -------------------------------------------------------------------------------- /images/tpch_erd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/josephmachado/analytical_dp_with_sql/HEAD/images/tpch_erd.png -------------------------------------------------------------------------------- /etc/node.properties: -------------------------------------------------------------------------------- 1 | node.environment=docker 2 | node.data-dir=/data/trino 3 | plugin.dir=/usr/lib/trino/plugin 4 | spiller-spill-path=/tmp 5 | max-spill-per-node=2GB 6 | query-max-spill-per-node=1GB -------------------------------------------------------------------------------- /etc/config.properties: -------------------------------------------------------------------------------- 1 | #single node install config 2 | coordinator=true 3 | node-scheduler.include-coordinator=true 4 | http-server.http.port=8080 5 | discovery-server.enabled=true 6 | discovery.uri=http://localhost:8080 -------------------------------------------------------------------------------- /etc/catalog/iceberg.properties: -------------------------------------------------------------------------------- 1 | connector.name=iceberg 2 | hive.metastore.uri=thrift://hive-metastore:9083 3 | hive.s3.path-style-access=true 4 | hive.s3.endpoint=http://minio:9000 5 | hive.s3.aws-access-key=minio 6 | hive.s3.aws-secret-key=minio123 -------------------------------------------------------------------------------- /.github/CODEOWNERS: -------------------------------------------------------------------------------- 1 | # These owners will be the default owners for everything in 2 | # the repo. Unless a later match takes precedence, 3 | # @josephmachado will be requested for 4 | # review when someone opens a pull request. 5 | 6 | * @josephmachado -------------------------------------------------------------------------------- /etc/jvm.config: -------------------------------------------------------------------------------- 1 | -server 2 | -Xmx16G 3 | -XX:-UseBiasedLocking 4 | -XX:+UseG1GC 5 | -XX:G1HeapRegionSize=32M 6 | -XX:+ExplicitGCInvokesConcurrent 7 | -XX:+HeapDumpOnOutOfMemoryError 8 | -XX:+UseGCOverheadLimit 9 | -XX:+ExitOnOutOfMemoryError 10 | -XX:ReservedCodeCacheSize=256M 11 | -Xlog:gc*,safepoint::time,level,tags,tid 12 | -Djdk.attach.allowAttachSelf=true 13 | -Djdk.nio.maxCachedBufferSize=2000000 14 | -------------------------------------------------------------------------------- /containers/setup/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.9.5 2 | 3 | # set up location of code 4 | WORKDIR /code 5 | ENV PYTHONPATH=/code/src 6 | 7 | # install python requirements 8 | ADD ./containers/setup/requirements.txt requirements.txt 9 | RUN pip install -r requirements.txt 10 | 11 | # Copy create buckets script 12 | # copy repo 13 | COPY ./containers/setup/create_buckets.py /code/create_buckets.py 14 | 15 | ENTRYPOINT ["tail", "-f", "/dev/null"] 16 | -------------------------------------------------------------------------------- /etc/catalog/minio.properties: -------------------------------------------------------------------------------- 1 | connector.name=hive 2 | hive.metastore.uri=thrift://hive-metastore:9083 3 | hive.s3.path-style-access=true 4 | hive.s3.endpoint=http://minio:9000 5 | hive.s3.aws-access-key=minio 6 | hive.s3.aws-secret-key=minio123 7 | hive.non-managed-table-writes-enabled=true 8 | hive.s3select-pushdown.enabled=true 9 | hive.storage-format=ORC 10 | hive.metastore.thrift.delete-files-on-drop=true 11 | 12 | hive.allow-drop-table=true 13 | hive.max-partitions-per-writers=5000 14 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | setup-containers: 2 | docker volume rm --force minio-data 3 | docker compose up -d --build 4 | 5 | create-buckets: 6 | docker exec createbucketsbkp python /code/create_buckets.py 7 | 8 | up: setup-containers create-buckets 9 | 10 | down: 11 | docker compose down 12 | 13 | trino: 14 | docker container exec -it trino-coordinator trino 15 | 16 | minio: 17 | open "http://localhost:9001" 18 | 19 | trino-ui: 20 | open "http://localhost:8080" 21 | 22 | logs: 23 | docker logs trino-coordinator 24 | 25 | metadata-db: 26 | docker exec -ti mariadb /usr/bin/mariadb -padmin 27 | 28 | restart: down up 29 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /conf/core-site.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | fs.defaultFS 6 | s3a://minio:9000 7 | 8 | 9 | 10 | 11 | 12 | fs.s3a.connection.ssl.enabled 13 | false 14 | 15 | 16 | 17 | fs.s3a.endpoint 18 | http://minio:9000 19 | 20 | 21 | 22 | fs.s3a.access.key 23 | minio 24 | 25 | 26 | 27 | fs.s3a.secret.key 28 | minio123 29 | 30 | 31 | 32 | fs.s3a.path.style.access 33 | true 34 | 35 | 36 | 37 | fs.s3a.impl 38 | org.apache.hadoop.fs.s3a.S3AFileSystem 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /conf/metastore-site.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | metastore.thrift.uris 4 | thrift://hive-metastore:9083 5 | Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore. 6 | 7 | 8 | metastore.task.threads.always 9 | org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.MaterializationsCacheCleanerTask 10 | 11 | 12 | metastore.expression.proxy 13 | org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy 14 | 15 | 16 | javax.jdo.option.ConnectionDriverName 17 | com.mysql.cj.jdbc.Driver 18 | 19 | 20 | 21 | javax.jdo.option.ConnectionURL 22 | jdbc:mysql://mariadb:3306/metastore_db 23 | 24 | 25 | 26 | javax.jdo.option.ConnectionUserName 27 | admin 28 | 29 | 30 | 31 | javax.jdo.option.ConnectionPassword 32 | admin 33 | 34 | 35 | 36 | fs.s3a.access.key 37 | minio 38 | 39 | 40 | fs.s3a.secret.key 41 | minio123 42 | 43 | 44 | fs.s3a.endpoint 45 | http://minio:9000 46 | 47 | 48 | fs.s3a.path.style.access 49 | true 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3.7' 2 | services: 3 | trino-coordinator: 4 | image: 'trinodb/trino:422' 5 | hostname: trino-coordinator 6 | container_name: trino-coordinator 7 | ports: 8 | - '8080:8080' 9 | volumes: 10 | - ./etc:/etc/trino 11 | networks: 12 | - trino-network 13 | 14 | mariadb: 15 | image: 'mariadb:10.3.32' 16 | hostname: mariadb 17 | container_name: mariadb 18 | ports: 19 | - '3306:3306' 20 | environment: 21 | MYSQL_ROOT_PASSWORD: admin 22 | MYSQL_USER: admin 23 | MYSQL_PASSWORD: admin 24 | MYSQL_DATABASE: metastore_db 25 | networks: 26 | - trino-network 27 | 28 | hive-metastore: 29 | image: 'bitsondatadev/hive-metastore:latest' 30 | hostname: hive-metastore 31 | container_name: hive-metastore 32 | ports: 33 | - '9083:9083' # Metastore Thrift 34 | volumes: 35 | - ./conf/metastore-site.xml:/opt/apache-hive-metastore-3.0.0-bin/conf/metastore-site.xml:ro 36 | environment: 37 | METASTORE_DB_HOSTNAME: mariadb 38 | depends_on: 39 | - mariadb 40 | networks: 41 | - trino-network 42 | 43 | minio: 44 | image: 'minio/minio:RELEASE.2023-07-21T21-12-44Z' 45 | hostname: minio 46 | container_name: minio 47 | ports: 48 | - '9000:9000' 49 | - '9001:9001' 50 | environment: 51 | MINIO_ACCESS_KEY: minio 52 | MINIO_SECRET_KEY: minio123 53 | command: server --console-address ":9001" /data 54 | networks: 55 | - trino-network 56 | 57 | createbucketsbkp: 58 | image: createbucketsbkp 59 | container_name: createbucketsbkp 60 | build: 61 | context: ./ 62 | dockerfile: ./containers/setup/Dockerfile 63 | networks: 64 | - trino-network 65 | 66 | 67 | createbuckets: 68 | image: minio/mc 69 | container_name: createbuckets 70 | depends_on: 71 | - minio 72 | entrypoint: > 73 | /bin/sh -c " /usr/bin/mc config host add myminio http://minio:9000 minio minio123; /usr/bin/mc rm -r --force myminio/tpch myminio/warehouse myminio/icebergwarehouse; /usr/bin/mc mb myminio/tpch myminio/warehouse myminio/icebergwarehouse; /usr/bin/mc policy download myminio/tpch myminio/warehouse myminio/icebergwarehouse; exit 0; " 74 | networks: 75 | - trino-network 76 | 77 | networks: 78 | trino-network: 79 | driver: bridge 80 | -------------------------------------------------------------------------------- /containers/setup/create_buckets.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from botocore.exceptions import ClientError 3 | from botocore.client import Config 4 | 5 | def create_s3_client(access_key, secret_key, endpoint, region): 6 | """ 7 | Create a boto3 client configured for Minio or any S3-compatible service. 8 | 9 | :param access_key: S3 access key 10 | :param secret_key: S3 secret key 11 | :param endpoint: Endpoint URL for the S3 service 12 | :param region: Region to use, defaults to us-east-1 13 | :return: Configured S3 client 14 | """ 15 | return boto3.client( 16 | 's3', 17 | region_name=region, 18 | endpoint_url=endpoint, 19 | aws_access_key_id=access_key, 20 | aws_secret_access_key=secret_key, 21 | config=Config(signature_version='s3v4') 22 | ) 23 | 24 | def create_bucket_if_not_exists(s3_client, bucket_name): 25 | """ 26 | Check if an S3 bucket exists, and if not, create it. 27 | 28 | :param s3_client: Configured S3 client 29 | :param bucket_name: Name of the bucket to create or check 30 | :return: None 31 | """ 32 | try: 33 | s3_client.head_bucket(Bucket=bucket_name) 34 | print(f"Bucket '{bucket_name}' already exists.") 35 | except ClientError as e: 36 | error_code = int(e.response['Error']['Code']) 37 | if error_code == 404: 38 | # Bucket does not exist, create it 39 | try: 40 | s3_client.create_bucket(Bucket=bucket_name) 41 | print(f"Bucket '{bucket_name}' created.") 42 | except ClientError as error: 43 | print(f"Failed to create bucket: {error}") 44 | else: 45 | print(f"Error: {e}") 46 | 47 | # Credentials and Connection Info 48 | access_key = 'minio' 49 | secret_key = 'minio123' 50 | endpoint = 'http://minio:9000' 51 | region = 'us-east-1' 52 | 53 | # Client creation and usage 54 | try: 55 | s3_client = create_s3_client(access_key, secret_key, endpoint, region) 56 | bucket_name = 'tpch'# Replace with your bucket name 57 | create_bucket_if_not_exists(s3_client, bucket_name) 58 | bucket_name = 'rainforest'# Replace with your bucket name 59 | create_bucket_if_not_exists(s3_client, bucket_name) 60 | bucket_name = 'warehouse'# Replace with your bucket name 61 | create_bucket_if_not_exists(s3_client, bucket_name) 62 | bucket_name = 'icebergwarehouse'# Replace with your bucket name 63 | create_bucket_if_not_exists(s3_client, bucket_name) 64 | except: 65 | print("Full catch, check bucket creation script at create_buckets.py") 66 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Setup 2 | 3 | Please install the following software: 4 | 5 | 1. [git version >= 2.37.1](https://github.com/git-guides/install-git) 6 | 2. [Docker version >= 20.10.17](https://docs.docker.com/engine/install/) and [Docker compose v2 version >= v2.10.2](https://docs.docker.com/compose/#compose-v2-and-the-new-docker-compose-command). Make sure that docker is running using `docker ps`. 7 | 8 | **Windows users**: please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)** (only Step 1 is necessary). Please install the make command with `sudo apt install make -y` (if its not already present). 9 | 10 | All the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users). We will use docker to set up our containers. Clone and move into the lab repository, as shown below. 11 | 12 | ```bash 13 | git clone \ 14 | https://github.com/josephmachado/analytical_dp_with_sql.git 15 | cd analytical_dp_with_sql 16 | ``` 17 | 18 | **Note**: If you are using Macbook M1, please follow [the instructions here](https://github.com/josephmachado/analytical_dp_with_sql/issues/4#issuecomment-1426902080) to use the appropriate docker image. 19 | 20 | We have some helpful make commands to make working with our systems more accessible. Shown below are the make commands and their definitions 21 | 22 | 1. make up: Spin up the docker containers. 23 | 2. make trino: Open trino cli; Use exit to quit the cli. **This is where you will type your SQL queries**. 24 | 3. make down: Stop the docker containers. 25 | 26 | You can see the commands in [this Makefile](https://github.com/josephmachado/analytical_dp_with_sql/blob/main/Makefile). If your terminal does not support **make** commands, please use the commands in [the Makefile](https://github.com/josephmachado/analytical_dp_with_sql/blob/main/Makefile) directly. All the commands in this book assume that you have the docker containers running. 27 | 28 | In your terminal, do the following: 29 | 30 | ```bash 31 | # Make sure docker is running using docker ps 32 | make up # starts the docker containers 33 | # If you are having issues with existing containers 34 | # stop them all with the following command 35 | # docker rm -f $(docker ps -a -q) 36 | 37 | sleep 60 # wait 1 minute for all the containers to set up 38 | make trino # opens the trino cli 39 | ``` 40 | 41 | In Trino, we can connect to multiple databases (called catalogs in Trino). TPC-H is a dataset used to benchmark analytical database performance. Trino's tpch catalog comes with preloaded tpch datasets of different sizes tiny, sf1, sf100, sf100, sf300, and so on, where sf = scaling factor. 42 | 43 | ```sql 44 | -- run "make trino" or 45 | -- "docker container exec -it trino-coordinator trino" 46 | -- to open trino cli 47 | 48 | USE tpch.tiny; 49 | SHOW tables; 50 | SELECT * FROM orders LIMIT 5; 51 | -- shows five rows, press q to quit the interactive results screen 52 | exit -- quit the cli 53 | ``` 54 | 55 | **Note**: Run `make trino` or `docker container exec -it trino-coordinator trino` on your terminal to open the trino cli. The SQL code shown throughout the book assumes you are running it in trino cli. 56 | 57 | Starting the docker containers will also start Minio(S3 alternative); we will use Minio as our data store to explain efficient data storage. 58 | 59 | **UI**: Open the Trino UI at http://localhost:8080 (username: any word) and Minio (S3 alternative) at http://localhost:9001 (username: minio, password: minio123) in a browser of your choice. 60 | 61 | If you prefer to connect to Trino via a SQL IDE, download [DBeaver](https://dbeaver.io/) (addresses issues [mentioned here](https://github.com/josephmachado/analytical_dp_with_sql/issues/8)). Open `DBeaver`, 62 | 63 | 1. Click on Database -> New Database Connection 64 | 2. A `Connect to a database` box will open; search for, select `Trino`, and press Next. 65 | 3. Do not change settings; use user as the user name. You will get a `connected` text box if you test the connection. Click Finish, and you will be able to explore our Trino database. 66 | 67 | ![DBeaver](./images/dbeaver.png) 68 | 69 | # Data Model 70 | 71 | The [TPC-H](https://www.tpc.org/tpch/) data represents a car parts seller's data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part(parts sold), region, nation, and partsupp. 72 | 73 | **Note:** Have a copy of the data model as you follow along; this will help with the examples provided and answering exercise questions. 74 | 75 | ![TPC-H data model](./images/tpch_erd.png) 76 | 77 | # Acknowledgments 78 | 79 | We use the [TPC-H](https://www.tpc.org/tpch/) dataset and [Trino](https://trino.io/) as our OLAP DB. 80 | --------------------------------------------------------------------------------