├── requirements.txt ├── .devcontainer ├── conf │ ├── minio_mc-entrypoint.sh │ ├── .postCreateCommand.sh │ ├── run-spark.sh │ └── spark-defaults.conf ├── Dockerfile.spark ├── devcontainer.json └── docker-compose.yml ├── .github └── dependabot.yml ├── src ├── main.py └── scratch.py └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | pyspark[connect]==3.5.1 2 | delta-spark~=3.1.0 -------------------------------------------------------------------------------- /.devcontainer/conf/minio_mc-entrypoint.sh: -------------------------------------------------------------------------------- 1 | #/bin/bash 2 | 3 | /usr/bin/mc config host add local http://minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY} 4 | /usr/bin/mc mb local/delta-lake 5 | exit 0 -------------------------------------------------------------------------------- /.devcontainer/conf/.postCreateCommand.sh: -------------------------------------------------------------------------------- 1 | # !/bin/bash 2 | sudo apt-get update && sudo apt-get -y upgrade 3 | 4 | # # Install Java 5 | # sudo apt-get -y install default-jdk-headless 6 | 7 | # Install Python Package 8 | pip install --upgrade pip 9 | pip install -r requirements.txt 10 | 11 | 12 | -------------------------------------------------------------------------------- /.github/dependabot.yml: -------------------------------------------------------------------------------- 1 | # To get started with Dependabot version updates, you'll need to specify which 2 | # package ecosystems to update and where the package manifests are located. 3 | # Please see the documentation for more information: 4 | # https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates 5 | # https://containers.dev/guide/dependabot 6 | 7 | version: 2 8 | updates: 9 | - package-ecosystem: "devcontainers" 10 | directory: "/" 11 | schedule: 12 | interval: weekly 13 | -------------------------------------------------------------------------------- /.devcontainer/Dockerfile.spark: -------------------------------------------------------------------------------- 1 | FROM docker.io/bitnami/spark:3.5.1 2 | 3 | # modifying the bitnami image to start spark-connect in master 4 | COPY ./conf/run-spark.sh /opt/bitnami/scripts/spark/run-spark.sh 5 | CMD [ "/opt/bitnami/scripts/spark/run-spark.sh" ] 6 | 7 | # adding additional jars for delta lake 8 | # https://github.com/bitnami/containers/blob/main/bitnami/spark/README.md#installing-additional-jars 9 | USER root 10 | RUN install_packages curl 11 | USER 1001 12 | RUN curl https://repo1.maven.org/maven2/io/delta/delta-spark_2.12/3.1.0/delta-spark_2.12-3.1.0.jar --output /opt/bitnami/spark/jars/delta-spark_2.12-3.1.0.jar 13 | RUN curl https://repo1.maven.org/maven2/io/delta/delta-storage/3.1.0/delta-storage-3.1.0.jar --output /opt/bitnami/spark/jars/delta-storage-3.1.0.jar -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from datetime import datetime, date 3 | from pyspark.sql import Row 4 | # from delta import * 5 | 6 | 7 | builder = SparkSession.builder.appName("spark_connect_app").remote("sc://spark:15002") 8 | spark = builder.getOrCreate() 9 | # spark = configure_spark_with_delta_pip(builder).getOrCreate() 10 | 11 | # Create a DataFrame 12 | df = spark.createDataFrame( 13 | [ 14 | Row(a=1, b=2.0, c="string1", d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), 15 | Row(a=2, b=3.0, c="string2", d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), 16 | Row(a=4, b=5.0, c="string3", d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)), 17 | ] 18 | ) 19 | 20 | # Write Delta Table to Minio 21 | df.write.mode("overwrite").format("delta").save("s3a://delta-lake/my_table") 22 | 23 | # Read Delta Table from Minio 24 | df = spark.read.format("delta").load("s3a://delta-lake/my_table") 25 | 26 | # Display Delta Table 27 | df.show() 28 | 29 | # DeltaTable.forPath(spark, "s3a://delta-lake/my_table").toDF().show() 30 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /.devcontainer/conf/run-spark.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright VMware, Inc. 3 | # SPDX-License-Identifier: APACHE-2.0 4 | 5 | # shellcheck disable=SC1091 6 | 7 | set -o errexit 8 | set -o nounset 9 | set -o pipefail 10 | #set -o xtrace 11 | 12 | # Load libraries 13 | . /opt/bitnami/scripts/libspark.sh 14 | . /opt/bitnami/scripts/libos.sh 15 | 16 | # Load Spark environment settings 17 | . /opt/bitnami/scripts/spark-env.sh 18 | 19 | if [ "$SPARK_MODE" == "master" ]; then 20 | # Master constants 21 | EXEC=$(command -v start-master.sh) 22 | ARGS=() 23 | EXEC_CONNECT=$(command -v start-connect-server.sh) 24 | info "** Starting Spark in master mode **" 25 | else 26 | # Worker constants 27 | EXEC=$(command -v start-worker.sh) 28 | ARGS=("$SPARK_MASTER_URL") 29 | info "** Starting Spark in worker mode **" 30 | fi 31 | if am_i_root; then 32 | # exec_as_user 33 | if [ "$SPARK_MODE" == "master" ]; then 34 | "$SPARK_DAEMON_USER" "$EXEC" "${ARGS[@]-}" & 35 | "$SPARK_DAEMON_USER" "$EXEC_CONNECT" --packages org.apache.spark:spark-connect_2.12:3.5.1 36 | else 37 | exec "$SPARK_DAEMON_USER" "$EXEC" "${ARGS[@]-}" 38 | fi 39 | else 40 | # exec 41 | if [ "$SPARK_MODE" == "master" ]; then 42 | "$EXEC" "${ARGS[@]-}" & 43 | "$EXEC_CONNECT" --packages org.apache.spark:spark-connect_2.12:3.5.1 44 | else 45 | exec "$EXEC" "${ARGS[@]-}" 46 | fi 47 | fi 48 | -------------------------------------------------------------------------------- /.devcontainer/devcontainer.json: -------------------------------------------------------------------------------- 1 | // For format details, see https://aka.ms/devcontainer.json. For config options, see the 2 | // README at: https://github.com/devcontainers/templates/tree/main/src/python 3 | { 4 | "name": "Python 3", 5 | // Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile 6 | // "image": "mcr.microsoft.com/devcontainers/python:1-3.10-bullseye", 7 | // "build": { 8 | // // Path is relative to the devcontainer.json file. 9 | // "dockerfile": "Dockerfile" 10 | // // "dockerComposeFile": "docker-compose.yml", 11 | // }, 12 | // Environmental Variables / Secrets 13 | // "runArgs": [ 14 | // "--env-file", 15 | // ".devcontainer/devcontainer.env" 16 | // ], 17 | 18 | "dockerComposeFile": "docker-compose.yml", 19 | "service": "devcontainer", 20 | "workspaceFolder": "/workspaces/", 21 | "shutdownAction": "stopCompose", 22 | "customizations": { 23 | "vscode": { 24 | "extensions": [ 25 | "tamasfe.even-better-toml", 26 | "GitHub.copilot", 27 | "ms-python.black-formatter", 28 | "ms-python.isort" 29 | ] 30 | } 31 | }, 32 | 33 | // Features to add to the dev container. More info: https://containers.dev/features. 34 | // "features": {}, 35 | 36 | // Use 'forwardPorts' to make a list of ports inside the container available locally. 37 | // "forwardPorts": [4040, 4041, 9000, 9001, 8080], 38 | 39 | // Use 'postCreateCommand' to run commands after the container is created. 40 | "postCreateCommand": "bash .devcontainer/conf/.postCreateCommand.sh" 41 | 42 | // Configure tool-specific properties. 43 | // "customizations": {}, 44 | 45 | // Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root. 46 | // "remoteUser": "root" 47 | } 48 | -------------------------------------------------------------------------------- /.devcontainer/conf/spark-defaults.conf: -------------------------------------------------------------------------------- 1 | # 2 | # Licensed to the Apache Software Foundation (ASF) under one or more 3 | # contributor license agreements. See the NOTICE file distributed with 4 | # this work for additional information regarding copyright ownership. 5 | # The ASF licenses this file to You under the Apache License, Version 2.0 6 | # (the "License"); you may not use this file except in compliance with 7 | # the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | 18 | # Default system properties included when running spark-submit. 19 | # This is useful for setting default environmental settings. 20 | 21 | # Example: 22 | # spark.master spark://master:7077 23 | # spark.eventLog.enabled true 24 | # spark.eventLog.dir hdfs://namenode:8021/directory 25 | # spark.serializer org.apache.spark.serializer.KryoSerializer 26 | # spark.driver.memory 5g 27 | # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" 28 | spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension 29 | spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog 30 | spark.hadoop.fs.s3a.path.style.access true 31 | spark.hadoop.fs.s3a.access.key minioadmin 32 | spark.hadoop.fs.s3a.secret.key minioadmin 33 | spark.hadoop.fs.s3a.endpoint http://minio:9000 -------------------------------------------------------------------------------- /.devcontainer/docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3.8' 2 | 3 | services: 4 | devcontainer: 5 | image: mcr.microsoft.com/devcontainers/python:1-3.11-bullseye 6 | volumes: 7 | - ..:/workspaces:cached 8 | command: sleep infinity 9 | 10 | minio: 11 | image: minio/minio:latest 12 | volumes: 13 | - minio-data:/data 14 | environment: 15 | - MINIO_ROOT_USER=minioadmin 16 | - MINIO_ROOT_PASSWORD=minioadmin 17 | command: server /data --console-address ":9001" 18 | ports: 19 | - '9000:9000' 20 | - '9001:9001' 21 | 22 | minio_mc: 23 | image: minio/mc 24 | environment: 25 | - MINIO_ACCESS_KEY=minioadmin 26 | - MINIO_SECRET_KEY=minioadmin 27 | volumes: 28 | - ./conf/minio_mc-entrypoint.sh:/usr/bin/entrypoint.sh 29 | entrypoint: /bin/sh /usr/bin/entrypoint.sh 30 | depends_on: 31 | - minio 32 | 33 | spark: 34 | build: 35 | context: . 36 | dockerfile: Dockerfile.spark 37 | environment: 38 | - SPARK_MODE=master 39 | - SPARK_RPC_AUTHENTICATION_ENABLED=no 40 | - SPARK_RPC_ENCRYPTION_ENABLED=no 41 | - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no 42 | - SPARK_SSL_ENABLED=no 43 | - SPARK_USER=spark 44 | ports: 45 | - '4040:4040' # Spark UI 46 | - '8080:8080' # Spark Master 47 | - '7077:7077' 48 | volumes: 49 | - ./conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf 50 | 51 | spark-worker: 52 | image: docker.io/bitnami/spark:3.5.1 53 | environment: 54 | - SPARK_MODE=worker 55 | - SPARK_MASTER_URL=spark://spark:7077 56 | - SPARK_WORKER_MEMORY=1G 57 | - SPARK_WORKER_CORES=1 58 | - SPARK_RPC_AUTHENTICATION_ENABLED=no 59 | - SPARK_RPC_ENCRYPTION_ENABLED=no 60 | - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no 61 | - SPARK_SSL_ENABLED=no 62 | - SPARK_USER=spark 63 | volumes: 64 | - ./conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf 65 | 66 | volumes: 67 | minio-data: 68 | -------------------------------------------------------------------------------- /src/scratch.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from datetime import datetime, date 3 | from pyspark.sql import Row 4 | from pathlib import Path 5 | from delta import * 6 | 7 | builder = ( 8 | SparkSession.builder.appName("devcontainer") 9 | .remote("sc://spark:15002") 10 | 11 | # .config("spark.hadoop.fs.s3a.access.key", "minioadmin") 12 | # .config("spark.hadoop.fs.s3a.secret.key", "minioadmin") 13 | # .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") 14 | # .config("spark.hadoop.fs.s3a.path.style.access", "true") 15 | # .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") 16 | # .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 17 | # .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 18 | # .config( 19 | # "spark.sql.catalog.spark_catalog", 20 | # "org.apache.spark.sql.delta.catalog.DeltaCatalog", 21 | # ) 22 | ) 23 | 24 | # https://stackoverflow.com/questions/75472225/java-lang-classnotfoundexception-class-org-apache-hadoop-fs-s3a-s3afilesystem-n 25 | # my_packages = ["org.apache.hadoop:hadoop-aws:3.3.4", 26 | # "org.apache.hadoop:hadoop-client-runtime:3.3.4", 27 | # "org.apache.hadoop:hadoop-client-api:3.3.4", 28 | # "io.delta:delta-contribs_2.12:3.0.0", 29 | # "io.delta:delta-hive_2.12:3.0.0", 30 | # "com.amazonaws:aws-java-sdk-bundle:1.12.262", 31 | # ] 32 | 33 | # spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate() 34 | 35 | spark = builder.getOrCreate() 36 | 37 | 38 | df = spark.createDataFrame( 39 | [ 40 | Row(a=1, b=2.0, c="string1", d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), 41 | Row(a=2, b=3.0, c="string2", d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), 42 | Row(a=4, b=5.0, c="string3", d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)), 43 | ] 44 | ) 45 | 46 | df.write.mode("overwrite").format("delta").save("s3a://delta-lake/my_table") 47 | 48 | df = spark.read.format("delta").load("s3a://delta-lake/my_table") 49 | 50 | df.show() 51 | 52 | # DeltaTable.forPath(spark, "s3a://delta-lake/my_table").toDF().show() 53 | 54 | 55 | # df = spark.read.format("delta").load("s3a://delta-lake/my_table") 56 | 57 | # df.show() -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Spark Connect with Minio and Delta Lake 2 | 3 | ![Image](https://i.imgur.com/o4YdA6V.jpeg) 4 | 5 | 6 | ## Summary 7 | 8 | This post will walk you through setting up a pyspark application that connects to a remote spark cluster using Spark Connect. 9 | 10 | You can find the working code in the [GitHub Repo](https://github.com/gardnmi/spark-connect-local-env) 11 | 12 | ## Motivation 13 | 14 | Spark Connect might become the preferred way to communicate with remote spark clusters when Spark 4.0 is eventually released. Currently the documentation is a bit sparse and there aren't many examples outside of some toy examples using local mode. I hope this post will help others to get started with Spark Connect and also provide more discussion around the topic. 15 | 16 | ## Prerequisites 17 | 18 | * Docker 19 | * Visual Studio Code 20 | 21 | ## Setting up the Environment 22 | 23 | Our environment configuration is contained within a docker-compose file which consists following services: 24 | 1. pyspark application 25 | 2. Minio 26 | 3. Spark Master 27 | 4. Spark Worker 28 | 29 | ### PySpark Service 30 | 31 | The first step in this project is to build the pyspark application that will connect to the remote spark cluster. We will use Visual Studio Code with the Dev Containers extension to create a containerized environment for the application. 32 | 33 | Steps to create the application: 34 | 1. Create an empty project folder on your local machine. 35 | 2. Open the project folder in Visual Studio Code 36 | 3. Create a folder called `.devcontainer` 37 | 4. Inside the .devcontainer folder create a file called `devcontainer.json` and add the following code: 38 | 39 | ```json 40 | { 41 | "name": "Python 3", 42 | "dockerComposeFile": "docker-compose.yml", 43 | "service": "devcontainer", 44 | "workspaceFolder": "/workspaces/", 45 | "shutdownAction": "stopCompose", 46 | "postCreateCommand": "bash .devcontainer/conf/.postCreateCommand.sh" 47 | } 48 | ``` 49 | 50 | * Note: For more information on the devcontainer.json file see the [Visual Studio Code Documentation](https://code.visualstudio.com/docs/remote/containers#_devcontainerjson-reference) 51 | 52 | 5. Inside the .devcontainer folder create a file called `docker-compose.yml` and add the following code to the docker-compose.yml file: 53 | 54 | ```yaml 55 | version: '3.8' 56 | 57 | services: 58 | devcontainer: 59 | image: mcr.microsoft.com/devcontainers/python:1-3.11-bullseye 60 | volumes: 61 | - ..:/workspaces:cached 62 | command: sleep infinity 63 | ``` 64 | 6. Create a folder called `src` in the root of the project and create a file called `main.py` and add the following code: 65 | 66 | ```python 67 | from pyspark.sql import SparkSession 68 | from datetime import datetime, date 69 | from pyspark.sql import Row 70 | 71 | 72 | builder = SparkSession.builder.appName("spark_connect_app").remote("sc://spark:15002") 73 | spark = builder.getOrCreate() 74 | 75 | # Create a DataFrame 76 | df = spark.createDataFrame( 77 | [ 78 | Row(a=1, b=2.0, c="string1", d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), 79 | Row(a=2, b=3.0, c="string2", d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), 80 | Row(a=4, b=5.0, c="string3", d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)), 81 | ] 82 | ) 83 | 84 | # Write Delta Table to Minio 85 | df.write.mode("overwrite").format("delta").save("s3a://delta-lake/my_table") 86 | 87 | # Read Delta Table from Minio 88 | df = spark.read.format("delta").load("s3a://delta-lake/my_table") 89 | 90 | # Display Delta Table 91 | df.show() 92 | ``` 93 | 94 | 7. Create a file called `requirements.txt` in the root of the project and add the following code: 95 | 96 | ``` 97 | pyspark[connect]==3.5.1 98 | delta-spark~=3.1.0 99 | ``` 100 | * Note: The version of pyspark[connect] should match the version of spark you are using. In this example we are using Spark 3.5.1 101 | * Note: Use the [Compatablity Matrix for Delta Lake](https://docs.delta.io/latest/releases.html) to choose the correct version of delta-spark 102 | 103 | 104 | 8. Create a file called `.postCreateCommand.sh` in the .devcontainer/conf folder and add the following code: 105 | 106 | ```bash 107 | # !/bin/bash 108 | sudo apt-get update && sudo apt-get -y upgrade 109 | 110 | # Install Python Package 111 | pip install --upgrade pip 112 | pip install -r requirements.txt 113 | ``` 114 | 115 | * Note: The .postCreateCommand.sh file is used to install the required python packages when our application container is started. 116 | 117 | 118 | ### Minio Services 119 | Minio is an open source object storage server that is compatible with Amazon S3. We will use Minio to store the delta table created by our pyspark application. 120 | 121 | Steps to setup Minio: 122 | 1. Open .devcontainer/docker-compose.yml and append the following code: 123 | 124 | ```yaml 125 | # Existing Code Above 126 | 127 | minio: 128 | image: minio/minio:latest 129 | volumes: 130 | - minio-data:/data 131 | environment: 132 | - MINIO_ROOT_USER=minioadmin 133 | - MINIO_ROOT_PASSWORD=minioadmin 134 | command: server /data --console-address ":9001" 135 | ports: 136 | - '9000:9000' 137 | - '9001:9001' 138 | 139 | minio_mc: 140 | image: minio/mc 141 | environment: 142 | - MINIO_ACCESS_KEY=minioadmin 143 | - MINIO_SECRET_KEY=minioadmin 144 | volumes: 145 | - ./conf/minio_mc-entrypoint.sh:/usr/bin/entrypoint.sh 146 | entrypoint: /bin/sh /usr/bin/entrypoint.sh 147 | depends_on: 148 | - minio 149 | ``` 150 | 151 | 2. Create a folder called `conf` in the .devcontainer folder and create a file called `minio_mc-entrypoint.sh` and add the following code: 152 | 153 | ```bash 154 | #/bin/bash 155 | 156 | /usr/bin/mc config host add local http://minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY} 157 | /usr/bin/mc mb local/delta-lake 158 | exit 0 159 | ``` 160 | 161 | * Note: The minio service is setup to use the default credentials of minioadmin/minioadmin. The minio_mc service is used to pre-create a bucket called delta-lake when we start the minio service. 162 | 163 | ### Spark Services 164 | Our spark cluster will consist of a spark master and a spark worker. We will use the bitnami/spark image as a base for the spark services. 165 | 166 | Steps to setup the spark cluster: 167 | 1. Open `.devcontainer/docker-compose.yml` and append the following code: 168 | 169 | ```yaml 170 | # Existing Code Above 171 | 172 | spark: 173 | build: 174 | context: . 175 | dockerfile: Dockerfile.spark 176 | environment: 177 | - SPARK_MODE=master 178 | - SPARK_RPC_AUTHENTICATION_ENABLED=no 179 | - SPARK_RPC_ENCRYPTION_ENABLED=no 180 | - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no 181 | - SPARK_SSL_ENABLED=no 182 | - SPARK_USER=spark 183 | ports: 184 | - '4040:4040' # Spark UI 185 | - '8080:8080' # Spark Master 186 | - '7077:7077' 187 | volumes: 188 | - ./conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf 189 | 190 | spark-worker: 191 | image: docker.io/bitnami/spark:3.5.1 192 | environment: 193 | - SPARK_MODE=worker 194 | - SPARK_MASTER_URL=spark://spark:7077 195 | - SPARK_WORKER_MEMORY=1G 196 | - SPARK_WORKER_CORES=1 197 | - SPARK_RPC_AUTHENTICATION_ENABLED=no 198 | - SPARK_RPC_ENCRYPTION_ENABLED=no 199 | - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no 200 | - SPARK_SSL_ENABLED=no 201 | - SPARK_USER=spark 202 | volumes: 203 | - ./conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf 204 | 205 | volumes: 206 | minio-data: 207 | ``` 208 | 209 | * Note: The default bitnami/spark image does not start the spark connect server on startup and is also missing some jar files that are needed to use delta lake. We will create a custom Dockerfile that extends the bitnami/spark image and adds the necessary jar files and starts the spark connect server. 210 | 211 | 2. Create a file called `Dockerfile.spark` in the .devcontainer folder and add the following code: 212 | 213 | ```Dockerfile 214 | FROM docker.io/bitnami/spark:3.5.1 215 | 216 | # modifying the bitnami image to start spark-connect in master 217 | COPY ./conf/run-spark.sh /opt/bitnami/scripts/spark/run-spark.sh 218 | CMD [ "/opt/bitnami/scripts/spark/run-spark.sh" ] 219 | 220 | # adding additional jars for delta lake 221 | # https://github.com/bitnami/containers/blob/main/bitnami/spark/README.md#installing-additional-jars 222 | USER root 223 | RUN install_packages curl 224 | USER 1001 225 | RUN curl https://repo1.maven.org/maven2/io/delta/delta-spark_2.12/3.1.0/delta-spark_2.12-3.1.0.jar --output /opt/bitnami/spark/jars/delta-spark_2.12-3.1.0.jar 226 | RUN curl https://repo1.maven.org/maven2/io/delta/delta-storage/3.1.0/delta-storage-3.1.0.jar --output /opt/bitnami/spark/jars/delta-storage-3.1.0.jar 227 | ``` 228 | 229 | * Note: To start the spark connect server we need to run the `start-connect-server.sh` script with the argument `--packages org.apache.spark:spark-connect_2.12:3.5.1` on our spark master. 230 | 231 | To achieve this, we modified the CMD command in the dockerfile to run a slightly modified version of the run.sh script that the base image uses to start spark. 232 | 233 | 3. Create a file in .devcontainer/conf called `run-spark.sh` and add the following code: 234 | 235 | ```bash 236 | #!/bin/bash 237 | # Copyright VMware, Inc. 238 | # SPDX-License-Identifier: APACHE-2.0 239 | 240 | # shellcheck disable=SC1091 241 | 242 | set -o errexit 243 | set -o nounset 244 | set -o pipefail 245 | #set -o xtrace 246 | 247 | # Load libraries 248 | . /opt/bitnami/scripts/libspark.sh 249 | . /opt/bitnami/scripts/libos.sh 250 | 251 | # Load Spark environment settings 252 | . /opt/bitnami/scripts/spark-env.sh 253 | 254 | if [ "$SPARK_MODE" == "master" ]; then 255 | # Master constants 256 | EXEC=$(command -v start-master.sh) 257 | ARGS=() 258 | EXEC_CONNECT=$(command -v start-connect-server.sh) 259 | info "** Starting Spark in master mode **" 260 | else 261 | # Worker constants 262 | EXEC=$(command -v start-worker.sh) 263 | ARGS=("$SPARK_MASTER_URL") 264 | info "** Starting Spark in worker mode **" 265 | fi 266 | if am_i_root; then 267 | # exec_as_user 268 | if [ "$SPARK_MODE" == "master" ]; then 269 | "$SPARK_DAEMON_USER" "$EXEC" "${ARGS[@]-}" & 270 | "$SPARK_DAEMON_USER" "$EXEC_CONNECT" --packages org.apache.spark:spark-connect_2.12:3.5.1 271 | else 272 | exec "$SPARK_DAEMON_USER" "$EXEC" "${ARGS[@]-}" 273 | fi 274 | else 275 | # exec 276 | if [ "$SPARK_MODE" == "master" ]; then 277 | "$EXEC" "${ARGS[@]-}" & 278 | "$EXEC_CONNECT" --packages org.apache.spark:spark-connect_2.12:3.5.1 279 | else 280 | exec "$EXEC" "${ARGS[@]-}" 281 | fi 282 | fi 283 | ``` 284 | 285 | The final step is to add configuration to our spark cluster to allow it to connect to minio and use delta lake. 286 | Note: This is documented in the bitnami/spark image [here](https://github.com/bitnami/containers/blob/main/bitnami/spark/README.md#mount-a-custom-configuration-file) 287 | 288 | 4. Create a file called spark-defaults.conf in the .devcontainer/conf folder and add the following code to the spark-defaults.conf file: 289 | 290 | ```properties 291 | spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension 292 | spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog 293 | spark.hadoop.fs.s3a.path.style.access true 294 | spark.hadoop.fs.s3a.access.key minioadmin 295 | spark.hadoop.fs.s3a.secret.key minioadmin 296 | spark.hadoop.fs.s3a.endpoint http://minio:9000 297 | ``` 298 | 299 | * Note: The spark-defaults.conf file is used to configure the spark cluster to use the delta lake and minio. 300 | 301 | * Note: More information on modifying the bitnami/spark image can be found [here](https://github.com/bitnami/containers/blob/main/bitnami/spark/README.md#mount-a-custom-configuration-file) 302 | 303 | ## Running the Application 304 | 305 | To run the application: 306 | 1. Open the project in Visual Studio Code if it is not already open 307 | 2. Open the command palette (Ctrl+Shift+P) and run the command `Remote-Containers: Reopen in Container` 308 | 3. Open a terminal in Visual Studio Code and run the following command: 309 | 310 | ```console 311 | python src/main.py 312 | ``` 313 | 314 | The application will create a dataframe and write a delta table to Minio. It will then read the delta table from Minio and display the contents of the table. 315 | 316 | Output: 317 | ```console 318 | +---+---+-------+----------+-------------------+ 319 | | 4|5.0|string3|2000-03-01|2000-01-03 12:00:00| 320 | | 1|2.0|string1|2000-01-01|2000-01-01 12:00:00| 321 | | 2|3.0|string2|2000-02-01|2000-01-02 12:00:00| 322 | +---+---+-------+----------+-------------------+ 323 | ``` 324 | 325 | 326 | To view the underlying files of the delta table you can use the Minio web console. 327 | 328 | The Minio web console can be accessed at [localhost:9001](http://localhost:9001/). 329 | 330 | * Note: Use the default credentials `minioadmin/minioadmin` to login. 331 | 332 | To view the spark connect job open the Spark UI at [localhost:4040](http://localhost:4040/) 333 | 334 | * Note: The spark ui will contain a new tab called "Connect" 335 | 336 | 337 | ## Conclusion 338 | 339 | Due to the lack of documentation and examples I had to do a lot of trial and error to get everything working. I am also not sure if this is the best way to setup a spark cluster with spark connect. There were even a few things I couldn't get working such as the delta-spark python library which currently has an open issue on the [Delta Lake Repo](https://github.com/delta-io/delta/issues/1967) 340 | 341 | Hopefully this post will provide more discussion around the topic and help others. If you have any questions or comments please leave them below or contribute to the [GitHub Repo](https://github.com/gardnmi/spark-connect-local-env) --------------------------------------------------------------------------------