├── .gitignore ├── Dockerfile ├── README.md ├── app ├── .gitkeep └── test_submit.py ├── config └── log4j.properties ├── data └── .gitkeep ├── docker-compose.yml ├── docs └── img │ ├── alocamento_jobs.png │ ├── dh_banner.png │ ├── spark_master.png │ ├── spark_worker.png │ └── stages.png ├── download_dataset.sh └── output └── .gitkeep /.gitignore: -------------------------------------------------------------------------------- 1 | *.tsv 2 | *.csv 3 | 4 | data/* 5 | !.gitkeep 6 | output/* 7 | !.gitkeep -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM openjdk:8-jdk-slim 2 | 3 | ENV HADOOP_VERSION 3.2 4 | ENV SPARK_VERSION 3.1.1 5 | ENV TAR_FILE spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz 6 | ENV SPARK_HOME /opt/spark 7 | ENV PATH "${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin" 8 | 9 | # Instala Python/WGET e Symlink python3 10 | RUN apt update; \ 11 | apt install -y python3 \ 12 | python3-pip \ 13 | wget; \ 14 | ln -sL $(which python3) /usr/bin/python;\ 15 | mkdir -p ${SPARK_HOME} 16 | 17 | # Baixa Spark+Hadoop 18 | RUN wget -nv https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${TAR_FILE} 19 | 20 | # Untar e symlink 21 | RUN tar -xzvf ${TAR_FILE} --strip-components=1 -C ${SPARK_HOME}; \ 22 | rm /${TAR_FILE}; \ 23 | apt remove -y wget 24 | 25 | COPY config/log4j.properties $SPARK_HOME/conf/log4j.properties 26 | 27 | WORKDIR ${SPARK_HOME} 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![DataHackers logo](docs/img/dh_banner.png?raw=true "Data Hackers Logo") 2 | 3 | # Data Hackers - Supletivo Apache Spark 4 | 5 | ## Intro 6 | :brazil: 7 | Este é um repositório para que você possa compreender um setup básico de um Spark no modo Standalone, bem como explorar um pouco da UI e suas funções mais simples, sem a necessidade de configurar o Spark local em sua máquina. 8 | 9 | **PS:** Essa imagem docker foi feita puramente para exploração local/*standalone*, não sendo adequada para o deploy em outros cenários (como K8s) 10 | 11 | **ENGLISH DISCLAIMER** :us: *This repo is meant for a Portuguese/BR community - if you have trouble understanding this docs, please open an issue and let me know so I can translate it.* 12 | 13 | --- 14 | ## Setup 15 | Para rodar esse repositório você precisara ter: 16 | - [x] [Docker](https://docs.docker.com/engine/install/) 17 | - [x] [docker-compose](https://docs.docker.com/compose/install/) 18 | 19 | :warning: Certifique-se que você tem essas dependências antes de prosseguir. 20 | 21 | --- 22 | ## Run 23 | 24 | Para iniciar, basta: 25 | 26 | ``` 27 | docker-compose up -d 28 | ``` 29 | (você pode omitir o `-d` para acompanhar o log dos containers) 30 | 31 | Para validar - o `docker ps` deverá apresentar algo parecido como: 32 | 33 | | CONTAINER ID | IMAGE | COMMAND | STATUS | NAMES | 34 | |-------------- |---------------------------- |------------------------ |--------------- |--------------- | 35 | | 726038fdf7b2 | naniviaa/spark-dhbr:latest | "bin/spark-class org…" | Up 17 seconds | worker-b-dhbr | 36 | | 806eef561d69 | naniviaa/spark-dhbr:latest | "bin/spark-class org…" | Up 17 seconds | worker-a-dhbr | 37 | | 3049b4d182d6 | naniviaa/spark-dhbr:latest | "bin/spark-class org…" | Up 17 seconds | master-dhbr | 38 | --- 39 | ## O que explorar 40 | 41 | ### **UIs** 42 | O primeiro passo é explorar a Master UI e a Workers' UI: 43 | - Master: http://localhost:8080/ 44 | - WorkerA: http://localhost:8081/ 45 | - WorkerB: http://localhost:8082/ 46 | - Spark Context UI: http://localhost:4040/ - Guarde essa URL para quando submeter um job 47 | 48 | 49 | #### **Spark Master** 50 | 51 | ![Spark Master](docs/img/spark_master.png?raw=true "Spark Master") 52 | 53 | 54 | #### Worker 55 | ![Spark Worker](docs/img/spark_worker.png?raw=true "Spark Worker") 56 | 57 | --- 58 | 59 | ### Executando Tasks 60 | 61 | Para executar tasks nesse modo client: 62 | - É necessário realizar o passo de attach em um dos pods; 63 | - Subir um pod apenas para submissão do job; 64 | - Submeter direto da sua máquina referenciando o ip do pod. 65 | 66 | :fireworks: Antes de prosseguir, baixe os dados executando o seguinte script (ou o curl dentro dele): `./download_dataset.sh` 67 | 68 | Para facilitar a abordagem, seguiremos pelo *attach* em um dos pods já existentes: 69 | ``` 70 | docker exec -it master-dhbr bash ./bin/spark-submit --master spark://master:7077 /tmp/app/test_submit.py 71 | ``` 72 | 73 | O arquivo python lê o dataset(csv), adiciona uma nova coluna constante e escreve em disco particionando os dados pela coluna `variety`. Esse script tem um sleep de 5 minutos para que você possa explorar os logs e as organizações do job 74 | 75 | :information_source: Aproveite toda informação do Job: Incluindo uso de memória, DAG e como os workers ficaram alocados durante os splits de cada job. 76 | 77 | 78 | **UI do Job e Overview:** 79 | 80 | http://localhost:4040/jobs/ 81 | 82 | ![Visão Geral](docs/img/alocamento_jobs.png?raw=true "Visão Geral") 83 | (Note que o fluxo indica tanto em que momento da timeline os workers foram alocados, como também) 84 | 85 | 86 | **Stages de execução (lazy):** 87 | 88 | http://localhost:4040/stages/ 89 | ![Job Stages](docs/img/stages.png?raw=true "Job Stages") 90 | 91 | 92 | **Report DAG ou Query/Execution Plan**: 93 | - http://localhost:4040/SQL/execution/?id=0 94 | - http://localhost:4040/SQL/execution/?id=1 95 | 96 | 97 | :no_bicycles: **Output do Job**: 98 | Caso você não esteja habituado em persistências em disco com particionamento, dê uma olhada na estrutura criada de CSVs dentro da folder `./output/` e compare como isso difere da inicial presente dentro do `./data`. 99 | 100 | :information_source: Volte para as UIs do Master/Workers e observe o que elas contém após a execução 101 | 102 | --- 103 | ### Aprofunde 104 | 105 | Aproveite e explore outros scripts - não esqueça sempre de manter: 106 | - Datasets em `./data/` 107 | - Apps em `./app/` 108 | - Escritas em `./output/` 109 | -------------------------------------------------------------------------------- /app/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/app/.gitkeep -------------------------------------------------------------------------------- /app/test_submit.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from time import sleep 3 | from pyspark.sql import SparkSession 4 | from pyspark.sql.functions import col, lit 5 | 6 | log = logging.getLogger() 7 | 8 | 9 | # Gere uma nova Spark Session (que contém um Spark Context + outras coisas) 10 | spark = SparkSession.builder.appName("DataHackers Supletivo").getOrCreate() 11 | 12 | # Lê CSV com a opção de headers True - PS: O path se refere ao volume dentro do container 13 | df = spark.read.option("header",True).csv("/tmp/data/iris.csv") 14 | 15 | # Crie uma coluna nova 16 | df = df.withColumn("datahackers", lit(True)) 17 | 18 | # Escreva o CSV particionando por uma coluna 19 | df.write.mode("overwrite").partitionBy("variety").format("csv").save("/tmp/output/job_output.txt") 20 | 21 | sleep(60 * 5) # Sleep por 5 minutos para você poder explorar a UI 22 | # Explore a UI em localhost:4040 23 | 24 | # Encerra Session e Contexto (Esse comendo suspende a UI) 25 | spark.stop() 26 | -------------------------------------------------------------------------------- /config/log4j.properties: -------------------------------------------------------------------------------- 1 | # 2 | # Licensed to the Apache Software Foundation (ASF) under one or more 3 | # contributor license agreements. See the NOTICE file distributed with 4 | # this work for additional information regarding copyright ownership. 5 | # The ASF licenses this file to You under the Apache License, Version 2.0 6 | # (the "License"); you may not use this file except in compliance with 7 | # the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | 18 | # Set everything to be logged to the console 19 | log4j.rootCategory=INFO, console 20 | log4j.appender.console=org.apache.log4j.ConsoleAppender 21 | log4j.appender.console.target=System.err 22 | log4j.appender.console.layout=org.apache.log4j.PatternLayout 23 | log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n 24 | 25 | # Set the default spark-shell log level to info. When running the spark-shell, the 26 | # log level for this class is used to overwrite the root logger's log level, so that 27 | # the user can have different defaults for the shell and regular Spark apps. 28 | log4j.logger.org.apache.spark.repl.Main=INFO 29 | 30 | # Settings to quiet third party logs that are too verbose 31 | log4j.logger.org.sparkproject.jetty=WARN 32 | log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR 33 | log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO 34 | log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO 35 | log4j.logger.org.apache.spark.api.python.PythonGatewayServer=INFO 36 | -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/data/.gitkeep -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: "2.2" 2 | services: 3 | master: 4 | image: naniviaa/spark-dhbr:latest 5 | container_name: master-dhbr 6 | command: bin/spark-class org.apache.spark.deploy.master.Master -h master 7 | hostname: master 8 | environment: 9 | MASTER: spark://master:7077 10 | SPARK_CONF_DIR: /conf 11 | SPARK_DRIVER_MEMORY: 1g 12 | ports: 13 | - 4040:4040 # Spark UI 14 | - 7077:7077 # Spark Master 15 | - 8080:8080 # Spark Master - UI 16 | volumes: 17 | - ./data:/tmp/data # Para compartilhar dadods 18 | - ./app:/tmp/app # Para compartilhar scripts 19 | - ./output:/tmp/output/ 20 | ulimits: 21 | nproc: 8192 # Para alocar os containers com memória 22 | worker-a: 23 | image: naniviaa/spark-dhbr:latest 24 | container_name: worker-a-dhbr 25 | command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 26 | hostname: worker_a 27 | environment: 28 | SPARK_CONF_DIR: /conf 29 | SPARK_WORKER_CORES: 2 30 | SPARK_WORKER_MEMORY: 1g 31 | SPARK_WORKER_PORT: 8881 32 | SPARK_WORKER_WEBUI_PORT: 8081 33 | links: 34 | - master 35 | ports: 36 | - 8081:8081 # Spark Worker - UI 37 | volumes: 38 | - ./data:/tmp/data 39 | - ./app:/tmp/app 40 | - ./output:/tmp/output/ 41 | ulimits: 42 | nproc: 8192 43 | worker-b: 44 | image: naniviaa/spark-dhbr:latest 45 | container_name: worker-b-dhbr 46 | command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 47 | hostname: worker_b 48 | environment: 49 | SPARK_CONF_DIR: /conf 50 | SPARK_WORKER_CORES: 2 51 | SPARK_WORKER_MEMORY: 1g 52 | SPARK_WORKER_PORT: 8882 53 | SPARK_WORKER_WEBUI_PORT: 8082 54 | links: 55 | - master 56 | ports: 57 | - 8082:8082 # Spark Worker - UI 58 | volumes: 59 | - ./data:/tmp/data 60 | - ./app:/tmp/app 61 | - ./output:/tmp/output/ 62 | ulimits: 63 | nproc: 8192 64 | 65 | -------------------------------------------------------------------------------- /docs/img/alocamento_jobs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/docs/img/alocamento_jobs.png -------------------------------------------------------------------------------- /docs/img/dh_banner.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/docs/img/dh_banner.png -------------------------------------------------------------------------------- /docs/img/spark_master.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/docs/img/spark_master.png -------------------------------------------------------------------------------- /docs/img/spark_worker.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/docs/img/spark_worker.png -------------------------------------------------------------------------------- /docs/img/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/docs/img/stages.png -------------------------------------------------------------------------------- /download_dataset.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # Downloads Iris Setosa Dataset 4 | echo "Download Iris Setosa Dataset @gist" 5 | curl -k https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv -o ./data/iris.csv 6 | -------------------------------------------------------------------------------- /output/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nickrvieira/docker-spark-datahackers/00c73a8aadd72965f553a6b82df0dbe1c3264f83/output/.gitkeep --------------------------------------------------------------------------------