├── bootcamp-overview ├── 1111-Overview.md ├── 1118-How to ask for help.md └── 1119-Community notes.md ├── week-1 ├── 1112-Introduction.md ├── 1113-Docker + Postgres.md ├── 1114-GCP + Terraform.md └── 1115-Environment setup.md ├── week-2 ├── 1134-Data Lake (GCS).md ├── 1135-Introduction to Workflow orchestration.md ├── 1136-Setting up Airflow locally.md ├── 1137-Ingesting data to GCP with Airflow.md ├── 1138-Ingesting data to Local Postgres with Airflow.md └── 1139-Transfer service (AWS -> GCP).md ├── week-3 ├── 1141-Data Warehouse and BigQuery.md ├── 1142-Partitoning and clustering.md ├── 1143-Best practices.md ├── 1144-Internals of BigQuery.md ├── 1145-Advance.md └── 1147-Workshop.md ├── week-4 ├── 1167-Prerequisites.md ├── 1168-Introduction to analytics engineering.md ├── 1170-What is dbt.md ├── 1171-Starting a dbt project.md ├── 1172-Development of dbt models.md ├── 1173-Testing and documenting dbt models.md ├── 1174-Deploying a dbt project.md ├── 1175-Visualising the transformed data.md └── 1177-Advanced knowledge.md ├── week-5 ├── 1181-Introduction.md ├── 1183-Installation.md ├── 1184-Spark SQL and DataFrames.md └── 1185-Spark Internals.md ├── week-6 ├── 1232-Introduction to Kafka.md ├── 1233-KStreams.md ├── 1234-Kafka Connect and KSQL.md └── 1235-Kafka connect.md └── week-7-8-9 ├── 1377-Data Engineering Project.md └── datasets.md /bootcamp-overview/1111-Overview.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | 3 | Welcome to Data Engineering Bootcamp :) 4 | 5 | We are doing this bootcamp with the support of DataTalks.Club. The content is created by some of the renowned data leaders. Many thanks to DataTalks.Club and the instructors for creating and allowing us to put this bootcamp together. Check them out 6 | 7 | * [DataTalks.Club](https://datatalks.club/) 8 | * [Alexey Grigorev](https://linkedin.com/in/agrigorev) 9 | * [Ankush Khanna](https://linkedin.com/in/ankushkhanna2) 10 | * [Sejal Vaidya](https://linkedin.com/in/vaidyasejal) 11 | * [Victoria Perez Mola](https://www.linkedin.com/in/victoriaperezmola/) 12 | 13 | In this bootcamp, there will be: 14 | 15 | * 6 Weeks of Learning Content - release every Friday ([detailed schedule here](https://docs.google.com/document/d/1zqCxW8gQ5ZUqDsja_E3ffwtUeE08VXvX9x4qJLNUWSc/edit)) 16 | * Home works for practice 17 | * 1 Graded Hands-on Assignment 18 | 19 | ### Self-paced mode 20 | 21 | * All the materials of the course are freely available, so you can take the course at your own pace 22 | * Follow the suggested syllabus (see below) week by week 23 | 24 | 25 | ### Learning Modules: 26 | 27 | * Learning modules will be released every Friday, starting from 13th May, 7:00 PM CET/ 10:30 PM IST. 28 | * Please keep checking this space for regular learning module additions we make. 29 | * To accommodate learners from across the globe, we are putting the learning modules in offline format instead of having live sessions. This will allow you to learn at your own time. 30 | * If you face any issues while learning, feel free to drop a message on #help channel on [Discord](https://discord.gg/E2XfSEYm2W). 31 | 32 | ### Assignment Guidelines 33 | 34 | * There will be 1 Final Assignment, mandatory to attempt for the bootcamp completion. 35 | * Howeworks are for practice and we recommend you to work on them. However, they don't carry weightage while issuing certificate. 36 | * Certificate: In order to be eligible for the certificate, you should submit the assignment with a total score of minimum 60%. 37 | 38 | Happy Learning :) -------------------------------------------------------------------------------- /bootcamp-overview/1118-How to ask for help.md: -------------------------------------------------------------------------------- 1 | We understand that you may encounter some issues during your learning journey, in this case, we recommend that you follow the below steps to make the best use of the community support. 2 | 3 | * Before asking a question, please check [FAQs](https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit) 4 | * If you don't find answers, you can put your questions on our [discord community](https://discord.gg/E2XfSEYm2W) under #data-engineering 5 | 6 | The FAQ's are written in a very exhaustive way from the previous cohort of the course run by DTC community. -------------------------------------------------------------------------------- /bootcamp-overview/1119-Community notes.md: -------------------------------------------------------------------------------- 1 | # Community notes 2 | 3 | During the last course cohort, some of the learners have taken notes and put them together on Github. This might be a resource to you as they have put together some resourceful content that could be handy while learning. 4 | 5 | * [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/1_intro.md) 6 | * [Notes from Abd](https://itnadigital.notion.site/Week-1-Introduction-f18de7e69eb4453594175d0b1334b2f4) 7 | * [Notes from Aaron](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_1_basics_n_setup/README.md) 8 | * [Notes from Faisal](https://github.com/FaisalMohd/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/Notes/DE%20Zoomcamp%20Week-1.pdf 9 | ) 10 | * [Michael Harty's Notes](https://github.com/mharty3/data_engineering_zoomcamp_2022/tree/main/week01) 11 | * [Blog post from Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/18/data-engineering-w1.html 12 | ) -------------------------------------------------------------------------------- /week-1/1112-Introduction.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | **Note:** The video below was recorded when the course was first launched a few months ago on DataTalks.Club. Some of the aspects may not be relevant at the moment. 4 | 5 | The video covers some non-technical aspects and basic intro of what all will be covered during the bootcamp. 6 | * Intros: 12 mins 40 sec 7 | * Flowchart: 20 mins 11 sec 8 | * Questions: 33 mins 25 sec 9 | 10 | https://www.loom.com/share/9fd3389b75a14cad8c08177e51783ccb 11 | 12 | # Overview of Architecture, Technologies & Pre-Requisites 13 | 14 | # Architecture diagram 15 | ![arch_1.jpeg](https://dphi-live.s3.amazonaws.com/media_uploads/arch_1_bc9b4ccd305c4135ade8c4929b67f084.jpeg) 16 | 17 | # Technologies 18 | * Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google 19 | * Google Cloud Storage (GCS): Data Lake 20 | * BigQuery: Data Warehouse 21 | * Terraform: Infrastructure-as-Code (IaC) 22 | * Docker: Containerization 23 | * SQL: Data Analysis & Exploration 24 | * Airflow: Pipeline Orchestration 25 | * dbt: Data Transformation 26 | * Spark: Distributed Processing 27 | * Kafka: Streaming 28 | 29 | 30 | # Prerequisites 31 | 32 | * To get most out of this course, you should feel comfortable with coding and command line, and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages. 33 | * Prior experience with data engineering is not require 34 | * We suggest watching videos in the same order as in this document. 35 | * The last video (setting up the environment) is optional, but you can check it earlier if you have troubles setting up the environment and following along the videos. -------------------------------------------------------------------------------- /week-1/1113-Docker + Postgres.md: -------------------------------------------------------------------------------- 1 | # Docker + Postgres 2 | 3 | [Code](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/2_docker_sql) 4 | 5 | # Introduction to Docker 6 | * Why do we need Docker 7 | * Creating a simple "data pipeline" in Docker 8 | 9 | 10 | 11 | 12 | # Ingesting NY Taxi Data to Postgres 13 | * Running Posgtres locally with Docker 14 | * Using pgcli for connecting to the database 15 | * Exploring the NY Taxi dataset 16 | * Ingesting the data to the database 17 | * Note if you have problems with pgcli, check [this video](https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) for an alternative way to connect to your database 18 | 19 | 20 | 21 | 22 | 23 | # Connecting pgAdmin and Postgres 24 | * The pgAdmin tool 25 | * Docker networks 26 | 27 | 28 | 29 | 30 | # Putting the ingestion script to Docker 31 | * Converting the Jupyter notebook to a Python script 32 | * Parametrizing the script with argparse 33 | * Dockerizing the ingestion script 34 | 35 | 36 | 37 | 38 | # Running Postgres and pgAdmin with Docker-Compose 39 | * Why do we need Docker-compose 40 | * Docker-compose YAML file 41 | * Running multiple containers with docker-compose up 42 | 43 | 44 | 45 | 46 | 47 | # SQL refreshser 48 | * Adding the Zones table 49 | * Inner joins 50 | * Basic data quality checks 51 | * Left, Right and Outer joins 52 | * Group by 53 | 54 | 55 | 56 | # Optional 57 | * If you have some problems with docker networking, check Port Mapping and Networks in Docker 58 | * Docker networks 59 | * Port forwarding to the host environment 60 | * Communicating between containers in the network 61 | * .dockerignore file 62 | 63 | -------------------------------------------------------------------------------- /week-1/1114-GCP + Terraform.md: -------------------------------------------------------------------------------- 1 | # GCP + Terraform 2 | 3 | [Code](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp) 4 | 5 | # Introduction to GCP (Google Cloud Platform) 6 | 7 | 8 | 9 | 10 | # Introduction to Terraform Concepts & GCP Pre-Requisites 11 | 12 | 13 | 14 | * [Companion Notes](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp) 15 | 16 | # Workshop: Creating GCP Infrastructure with Terraform 17 | 18 | 19 | 20 | * [Workshop](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/terraform) 21 | 22 | # Configuring terraform and GCP SDK on Windows 23 | * [Instructions](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/windows.md) -------------------------------------------------------------------------------- /week-1/1115-Environment setup.md: -------------------------------------------------------------------------------- 1 | # Environment setup 2 | 3 | For the course you'll need: 4 | * Python 3 (e.g. installed with Anaconda) 5 | * Google Cloud SDK 6 | * Docker with docker-compose 7 | * Terraform 8 | 9 | 10 | If you have problems setting up the env, you can check this video: 11 | * [Setting up the environment on cloud VM](https://www.youtube.com/watch?v=ae-CV2KfoN0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) 12 | * Generating SSH keys 13 | * Creating a virtual machine on GCP 14 | * Connecting to the VM with SSH 15 | * Installing Anaconda 16 | * Installing Docker 17 | * Creating SSH config file 18 | * Accessing the remote machine with VS Code and SSH remote 19 | * Installing docker-compose 20 | * Installing pgcli 21 | * Port-forwarding with VS code: connecting to pgAdmin and Jupyter from the local computer 22 | * Installing terraform 23 | * Using sftp for putting the credentials to the remote machine 24 | * Shutting down and removing the instance -------------------------------------------------------------------------------- /week-2/1134-Data Lake (GCS).md: -------------------------------------------------------------------------------- 1 | # Data Lake (GCS) 2 | 3 | * What is a Data Lake 4 | * ELT vs. ETL 5 | * Alternatives to components (S3/HDFS, Redshift, Snowflake etc.) 6 | 7 | 8 | 9 | 10 | Slides 11 | 12 |

-------------------------------------------------------------------------------- /week-2/1135-Introduction to Workflow orchestration.md: -------------------------------------------------------------------------------- 1 | # Introduction to Workflow orchestration 2 | 3 | * What is an Orchestration Pipeline? 4 | * What is a DAG? 5 | * Video 6 | 7 | 8 | -------------------------------------------------------------------------------- /week-2/1136-Setting up Airflow locally.md: -------------------------------------------------------------------------------- 1 | # Setting up Airflow locally 2 | 3 | * Setting up Airflow with Docker-Compose 4 | * Video 5 | 6 | 7 | 8 | * More information in the [airflow folder](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow) 9 | 10 | If you want to run a lighter version of Airflow with fewer services, check this [video](https://www.youtube.com/watch?v=A1p5LQ0zzaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb). It's optional. -------------------------------------------------------------------------------- /week-2/1137-Ingesting data to GCP with Airflow.md: -------------------------------------------------------------------------------- 1 | # Ingesting data to GCP with Airflow 2 | 3 | * Extraction: Download and unpack the data 4 | * Pre-processing: Convert this raw data to parquet 5 | * Upload the parquet files to GCS 6 | * Create an external table in BigQuery 7 | * Video 8 | 9 | -------------------------------------------------------------------------------- /week-2/1138-Ingesting data to Local Postgres with Airflow.md: -------------------------------------------------------------------------------- 1 | # Ingesting data to Local Postgres with Airflow 2 | 3 | * Converting the ingestion script for loading data to Postgres to Airflow DAG 4 | * Video 5 | 6 | -------------------------------------------------------------------------------- /week-2/1139-Transfer service (AWS -> GCP).md: -------------------------------------------------------------------------------- 1 | # Transfer service (AWS -> GCP) 2 | 3 | Moving files from AWS to GCP. 4 | You will need an AWS account for this. This section is optional 5 | 6 | Video 1 7 | 8 | 9 | 10 | Video 2 11 | 12 | -------------------------------------------------------------------------------- /week-3/1141-Data Warehouse and BigQuery.md: -------------------------------------------------------------------------------- 1 | # Data Warehouse and BigQuery 2 | 3 | * Slides 4 |

5 | 6 | * [Big Query basic SQL](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/big_query.sql) 7 | 8 | 9 | * Data Warehouse and BigQuery 10 | -------------------------------------------------------------------------------- /week-3/1142-Partitoning and clustering.md: -------------------------------------------------------------------------------- 1 | # Partioning and Clustering 2 | 3 | 4 | 5 | # Partioning vs Clustering 6 | 7 | -------------------------------------------------------------------------------- /week-3/1143-Best practices.md: -------------------------------------------------------------------------------- 1 | # BigQuery Best Practices 2 | 3 | -------------------------------------------------------------------------------- /week-3/1144-Internals of BigQuery.md: -------------------------------------------------------------------------------- 1 | # Internals of Big Query 2 | 3 | -------------------------------------------------------------------------------- /week-3/1145-Advance.md: -------------------------------------------------------------------------------- 1 | # ML 2 | 3 | * BigQuery Machine Learning 4 | 5 | 6 | 7 | * [SQL for ML in BigQuery](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/big_query_ml.sql) 8 | 9 | # Important links 10 | 11 | * [BigQuery ML Tutorials](https://cloud.google.com/bigquery-ml/docs/tutorials) 12 | * [BigQuery ML Reference Parameter](https://cloud.google.com/bigquery-ml/docs/analytics-reference-patterns) 13 | * [Hyper Parameter tuning](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm) 14 | * [Feature preprocessing](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-preprocess-overview) 15 | 16 | # Deploying ML model 17 | * BigQuery Machine Learning Deployment 18 | 19 | 20 | * [Steps to extract and deploy model with docker](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/extract_model.md) -------------------------------------------------------------------------------- /week-3/1147-Workshop.md: -------------------------------------------------------------------------------- 1 | # [Workshop](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/airflow.md) 2 | 3 | * Integrating Bigquery with Airflow (+ Week 2 Review) - Video 4 | 5 | 6 | 7 | * Setup: Copy over the `airflow` directory (i.e. the Dockerized setup) from `week_2_data_ingestion`: 8 | 9 | `cp ../week_2_data_ingestion/airflow airflow` 10 | 11 | Also, empty the logs directory, if you find it necessary. 12 | 13 | * DAG: [gcs_to_bq_dag.py](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/airflow/dags/gcs_to_bq_dag.py) -------------------------------------------------------------------------------- /week-4/1167-Prerequisites.md: -------------------------------------------------------------------------------- 1 | Week 4: Analytics Engineering 2 | Goal: Transforming the data loaded in DWH to Analytical Views developing a [dbt project]( 3 | https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/taxi_rides_ny/README.md). 4 | 5 | Slides 6 | 7 |

8 | 9 | 10 | # Prerequisites 11 | We will build a project using dbt and a running data warehouse. By this stage of the course you should have already: 12 | * A running warehouse (BigQuery or postgres) 13 | * A set of running pipelines ingesting the project dataset (week 3 completed): [Taxi Rides NY dataset](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/dataset.md 14 | ) 15 | * Yellow taxi data - Years 2019 and 2020 16 | * Green taxi data - Years 2019 and 2020 17 | * fhv data - Year 2019. 18 | 19 | Note: 20 | * A quick hack has been shared to load that data quicker, check instructions in [week3/extras](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_3_data_warehouse/extras) 21 | * If you recieve an error stating "Permission denied while globbing file pattern." when attemting to run fact_trips.sql this video may be helpful in resolving the issue 22 | 23 | 24 | 25 | 26 | 27 | / 28 | 29 | 30 | # Setting up dbt for using BigQuery (Alternative A - preferred) 31 | You will need to create a dbt cloud account using [this link](https://www.getdbt.com/signup/) and connect to your warehouse [following these instructions](https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-setting-up-bigquery-oauth). 32 | 33 | More detailed instructions in [dbt_cloud_setup.md](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/dbt_cloud_setup.md) 34 | 35 | Optional: If you feel more comfortable developing locally you could use a local installation of dbt as well. You can follow the [official dbt documentation](https://docs.getdbt.com/dbt-cli/installation) or follow the [dbt with BigQuery on Docker](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/docker_setup/README.md) guide to setup dbt locally on docker. You will need to install the latest version (1.0) with the BigQuery adapter (dbt-bigquery). 36 | 37 | 38 | 39 | # Setting up dbt for using Postgres locally (Alternative B) 40 | As an alternative to the cloud, that require to have a cloud database, you will be able to run the project installing dbt locally. You can follow the [official dbt documentation](https://docs.getdbt.com/dbt-cli/installation) or use a docker image from [oficial dbt repo](https://github.com/dbt-labs/dbt/). You will need to install the latest version (1.0) with the postgres adapter (dbt-postgres). After local installation you will have to set up the connection to PG in the profiles.yml, you can find the templates [here](https://docs.getdbt.com/reference/warehouse-profiles/postgres-profile) -------------------------------------------------------------------------------- /week-4/1168-Introduction to analytics engineering.md: -------------------------------------------------------------------------------- 1 | # Introduction to analytics engineering 2 | * What is analytics engineering? 3 | * ETL vs ELT 4 | * Data modeling concepts (fact and dim tables) 5 | 6 | 7 | -------------------------------------------------------------------------------- /week-4/1170-What is dbt.md: -------------------------------------------------------------------------------- 1 | # What is dbt? 2 | * Intro to dbt 3 | 4 | -------------------------------------------------------------------------------- /week-4/1171-Starting a dbt project.md: -------------------------------------------------------------------------------- 1 | # Starting a dbt project 2 | 3 | Alternative a: Using BigQuery + dbt cloud 4 | * Starting a new project with dbt init (dbt cloud and core) 5 | * dbt cloud setup 6 | * project.yml 7 | 8 | 9 | 10 | Alternative b: Using Postgres + dbt core (locally) 11 | * Starting a new project with dbt init (dbt cloud and core) 12 | * dbt core local setup 13 | * profiles.yml 14 | * project.yml 15 | 16 | ???? Video 17 | 18 | -------------------------------------------------------------------------------- /week-4/1172-Development of dbt models.md: -------------------------------------------------------------------------------- 1 | # Development of dbt models 2 | 3 | * Anatomy of a dbt model: written code vs compiled Sources 4 | * Materialisations: table, view, incremental, ephemeral 5 | * Seeds, sources and ref 6 | * Jinja and Macros 7 | * Packages 8 | * Variables 9 | 10 | 11 | 12 | Note: This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice -------------------------------------------------------------------------------- /week-4/1173-Testing and documenting dbt models.md: -------------------------------------------------------------------------------- 1 | # Testing and documenting dbt models 2 | * Tests 3 | * Documentation 4 | 5 | 6 | 7 | 8 | Note: This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice -------------------------------------------------------------------------------- /week-4/1174-Deploying a dbt project.md: -------------------------------------------------------------------------------- 1 | # Deploying a dbt project 2 | 3 | Alternative a: Using BigQuery + dbt cloud 4 | * Deployment: development environment vs production 5 | * dbt cloud: scheduler, sources and hosted documentation 6 | 7 | 8 | 9 | 10 | 11 | Alternative b: Using Postgres + dbt core (locally) 12 | * Deployment: development environment vs production 13 | * dbt cloud: scheduler, sources and hosted documentation 14 | 15 | -------------------------------------------------------------------------------- /week-4/1175-Visualising the transformed data.md: -------------------------------------------------------------------------------- 1 | # Visualising the transformed data 2 | * Google data studio 3 | * [Metabase (local installation)](https://www.metabase.com/) 4 | 5 | Google data studio Video 6 | 7 | 8 | 9 | Metabase Video 10 | 11 | -------------------------------------------------------------------------------- /week-4/1177-Advanced knowledge.md: -------------------------------------------------------------------------------- 1 | # Advanced knowledge: 2 | 3 | * [Make a model Incremental](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models) 4 | * [Use of tags](https://docs.getdbt.com/reference/resource-configs/tags 5 | ) 6 | * [Hooks](https://docs.getdbt.com/docs/building-a-dbt-project/hooks-operations) 7 | * [Analysis](https://docs.getdbt.com/docs/building-a-dbt-project/analyses 8 | ) 9 | * [Snapshots](https://docs.getdbt.com/docs/building-a-dbt-project/snapshots) 10 | * [Exposure](https://docs.getdbt.com/docs/building-a-dbt-project/exposures 11 | ) 12 | * [Metrics](https://docs.getdbt.com/docs/building-a-dbt-project/metrics 13 | ) 14 | 15 | # Useful links 16 | * [Visualizing data with Metabase course](https://www.metabase.com/learn/visualization/ ) -------------------------------------------------------------------------------- /week-5/1181-Introduction.md: -------------------------------------------------------------------------------- 1 | # Introduction to Batch Processing 2 | 3 | 4 | 5 | # Introduction to Spark 6 | 7 | -------------------------------------------------------------------------------- /week-5/1183-Installation.md: -------------------------------------------------------------------------------- 1 | # Installation 2 | 3 | Follow [these intructions](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup 4 | ) to install Spark: 5 | 6 | * [Windows](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/windows.md) 7 | * [Linux](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/linux.md 8 | ) 9 | * [MacOS](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/macos.md 10 | ) 11 | 12 | And follow [this](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/pyspark.md) to run PySpark in Jupyter 13 | 14 | # Installing Spark (Linux) 15 | 16 | -------------------------------------------------------------------------------- /week-5/1184-Spark SQL and DataFrames.md: -------------------------------------------------------------------------------- 1 | # First Look at Spark/PySpark 2 | 3 | 4 | 5 | # Spark Dataframes 6 | 7 | 8 | 9 | #(Optional) Preparing Yellow and Green Taxi Data 10 | 11 | 12 | 13 | Script to prepare the Dataset [download_data.sh](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/code/download_data.sh 14 | ) 15 | 16 | Note: The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark. 17 | 18 | # SQL with Spark 19 | 20 | -------------------------------------------------------------------------------- /week-5/1185-Spark Internals.md: -------------------------------------------------------------------------------- 1 | # Anatomy of a Spark Cluster 2 | 3 | 4 | 5 | # GroupBy in Spark 6 | 7 | 8 | 9 | # Joins in Spark 10 | 11 | -------------------------------------------------------------------------------- /week-6/1232-Introduction to Kafka.md: -------------------------------------------------------------------------------- 1 | # Introduction to Kafka 2 | 3 | Slides 4 | 5 |

6 | 7 | 8 | # Video: Intro to Kafka 9 | 10 | 11 | 12 | # Video: Configuration Terms 13 | 14 | 15 | 16 | # Video: Avro and schema registry 17 | 18 | 19 | 20 | # Configuration 21 | Please take a look at all configuration from kafka [here](https://docs.confluent.io/platform/current/installation/configuration/ 22 | ). 23 | 24 | # Docker 25 | 26 | Starting cluster 27 | `docker-compose up 28 | ` 29 | # Command line for Kafka 30 | 31 | Create topic 32 | `./bin/kafka-topics.sh --create --topic demo_1 --bootstrap-server localhost:9092 --partitions 2 33 | ` -------------------------------------------------------------------------------- /week-6/1233-KStreams.md: -------------------------------------------------------------------------------- 1 | # KStreams 2 | 3 | * Slides 4 | 5 |

6 | 7 | * Concepts https://docs.confluent.io/platform/current/streams/concepts.html 8 | 9 | * Video: KStream basics 10 | 11 | 12 | 13 | * Video: KStream Join and windowing 14 | 15 | 16 | 17 | * Video: KStream advance features 18 | 19 | 20 | 21 | # Python Faust 22 | * [Faust Doc](https://faust.readthedocs.io/en/latest/index.html) 23 | * [KStream vs Faust](https://faust.readthedocs.io/en/latest/playbooks/vskafka.html) 24 | 25 | # JVM library 26 | * [Confluent Kafka Stream](https://kafka.apache.org/documentation/streams/) 27 | * [Example](https://github.com/AnkushKhanna/kafka-helper/tree/master/src/main/scala/kafka/schematest) -------------------------------------------------------------------------------- /week-6/1234-Kafka Connect and KSQL.md: -------------------------------------------------------------------------------- 1 | # Kafka Connect and KSQL 2 | 3 | Video: Kafka connect and KSQL 4 | 5 | -------------------------------------------------------------------------------- /week-6/1235-Kafka connect.md: -------------------------------------------------------------------------------- 1 | # Kafka connect 2 | 3 | * [Blog post](https://medium.com/analytics-vidhya/making-sense-of-stream-data-b74c1252a8f5 4 | ) -------------------------------------------------------------------------------- /week-7-8-9/1377-Data Engineering Project.md: -------------------------------------------------------------------------------- 1 | # Data Engineering Project 2 | 3 | For the next three weeks, you will work on a Data Engineering Project. 4 | 5 | The goal of this project is to apply everything we learned 6 | in this course and build an end-to-end data pipeline. 7 | 8 | Remember that to pass the project, you must evaluate the submissions of 3 peers. If you don't do that, your project can't be considered complete. 9 | 10 | ## Problem statement 11 | 12 | For the project, we will ask you to build a dashboard with two tiles. 13 | 14 | For that, you will need: 15 | 16 | * Select a dataset that you're interested in (see [datasets](https://github.com/dphi-official/data-engineering/blob/main/week-7-8-9/datasets.md)). 17 | * Create a pipeline for processing this dataset and putting it into a data lake. 18 | * Create a pipeline for moving the data from the lake to a data warehouse. 19 | * Transform the data in the data warehouse: prepare it for the dashboard. 20 | * Create a dashboard. 21 | 22 | 23 | 24 | ## Data Pipeline 25 | 26 | The pipeline could be stream or batch: this is the first thing you'll need to decide. 27 | 28 | * If you want to consume data in real-time and put them in a data lake, go with a stream pipeline. 29 | * If you want to run things periodically (e.g. hourly/daily), go with a batch pipeline. 30 | 31 | 32 | ## Technologies 33 | 34 | You don't have to limit yourself to technologies covered in the course. You can use alternatives as well: 35 | 36 | * Cloud: AWS, GCP, Azure, or others 37 | * Infrastructure as code (IaC): Terraform, Pulumi, Cloud Formation, ... 38 | * Workflow orchestration: Airflow, Prefect, Luigi, ... 39 | * Data Warehouse: BigQuery, Snowflake, Redshift, ... 40 | * Batch processing: Spark, Flink, AWS Batch, ... 41 | * Stream processing: Kafka, Pulsar, Kinesis, ... 42 | 43 | If you use something that wasn't covered in the course, 44 | be sure to explain what the tool does. 45 | 46 | If you're not certain about some tools, ask on the [DPhi Discord Server](https://discord.gg/E2XfSEYm2W). 47 | 48 | 49 | ## Dashboard 50 | 51 | You can build a dashboard with any of the tools shown in the course (Data Studio or Metabase) or any other BI tool of your choice. If you do use another tool, please specify and make sure that the dashboard is somehow accessible to your peers. 52 | 53 | Your dashboard should contain at least two tiles, we suggest you include: 54 | 55 | - 1 graph that shows the distribution of some categorical data. 56 | - 1 graph that shows the distribution of the data across a temporal line. 57 | 58 | Make sure that your graph is clear to understand by adding references and titles. 59 | 60 | Example of a dashboard: 61 | ![image.png](https://dphi-live.s3.amazonaws.com/media_uploads/image_712f5875b8d34eb3ab859dfee6afefcc.png) 62 | 63 | ## Submitting 64 | 65 | * Form: https://forms.gle/KgnCGM5cgXUxk8Bb9 66 | * Deadline: 21st July, 7:00 PM CET/ 10:30 PM IST 67 | 68 | ## Going the extra mile 69 | 70 | If you finish the project and you want to improve it, here are a few things you can do: 71 | 72 | * Add tests 73 | * Use make 74 | * Add CI/CD pipeline 75 | 76 | This is not covered in the course and this is entirely optional. 77 | 78 | If you plan to use this project as your portfolio project, it'll 79 | definitely help you to stand out from others. 80 | 81 | **Note**: this part will not be graded. 82 | 83 | 84 | Some links to refer to: 85 | 86 | * [Unit Tests + CI for Airflow](https://www.astronomer.io/events/recaps/testing-airflow-to-bulletproof-your-code/) 87 | * [CI/CD for Airflow (with Gitlab & GCP state file)](https://engineering.ripple.com/building-ci-cd-with-airflow-gitlab-and-terraform-in-gcp) 88 | * [CI/CD for Airflow (with GitHub and S3 state file)](https://programmaticponderings.com/2021/12/14/devops-for-dataops-building-a-ci-cd-pipeline-for-apache-airflow-dags/) 89 | * [CD for Terraform](https://towardsdatascience.com/git-actions-terraform-for-data-engineers-scientists-gcp-aws-azure-448dc7c60fcc) 90 | * [Spark + Airflow](https://medium.com/doubtnut/github-actions-airflow-for-automating-your-spark-pipeline-c9dff32686b) 91 | 92 | ## Grading 93 | - **Grade by Peers**: Each assignment will be reviewed by 3 peers. Only those people who have submitted their work will be considered in the peer-review process. Peers will score the assignment out of 26 marks. We will then take the median score out of the three peer ratings. This makes sure that if there is an outlier rating, we do not consider it. This median score is the marks you have obtained from the peer review. 94 | - **Peer Review Completion**: Secondly, 14 marks are allotted for completing your peer review. This will be a binary score - if you complete your peer review, you will get the entire 14 marks, otherwise, it will be a zero. 95 | - **Final Score**: Your final score in this Bootcamp will be the sum of the marks from Grade by Peers and Peer Review Completion. This score will be out of 40 marks. -------------------------------------------------------------------------------- /week-7-8-9/datasets.md: -------------------------------------------------------------------------------- 1 | ## Datasets 2 | 3 | Here are some datasets that you could use for the project: 4 | 5 | 6 | * [Kaggle](https://www.kaggle.com/datasets) 7 | * [AWS datasets](https://registry.opendata.aws/) 8 | * [UK government open data](https://data.gov.uk/) 9 | * [Github archive](https://www.gharchive.org) 10 | * [Awesome public datasets](https://github.com/awesomedata/awesome-public-datasets) 11 | * [Million songs dataset](http://millionsongdataset.com) 12 | * [Some random datasets](https://components.one/datasets/) 13 | * [COVID Datasets](https://www.reddit.com/r/datasets/comments/n3ph2d/coronavirus_datsets/) 14 | * [Datasets from Azure](https://docs.microsoft.com/en-us/azure/azure-sql/public-data-sets) 15 | * [Datasets from BigQuery](https://cloud.google.com/bigquery/public-data/) 16 | * [Dataset search engine from Google](https://datasetsearch.research.google.com/) 17 | * [Public datasets offered by different GCP services](https://cloud.google.com/solutions/datasets) 18 | * [European statistics datasets](https://webgate.acceptance.ec.europa.eu/eurostat/data/database) 19 | * [Datasets for streaming](https://github.com/ColinEberhardt/awesome-public-streaming-datasets) 20 | * [Dataset for Santander bicycle rentals in London](https://cycling.data.tfl.gov.uk/) 21 | * [Common crawl data](https://commoncrawl.org/) (copy of the internet) 22 | * Collection Of Data Repositories 23 | * [part 1](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-1.html) (from agriculture and finance to government) 24 | * [part 2](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-2.html) (from healthcare to transportation) 25 | 26 | PRs with more datasets are welcome! 27 | --------------------------------------------------------------------------------