├── bootcamp-overview
    ├── 1111-Overview.md
    ├── 1118-How to ask for help.md
    └── 1119-Community notes.md
├── week-1
    ├── 1112-Introduction.md
    ├── 1113-Docker + Postgres.md
    ├── 1114-GCP + Terraform.md
    └── 1115-Environment setup.md
├── week-2
    ├── 1134-Data Lake (GCS).md
    ├── 1135-Introduction to Workflow orchestration.md
    ├── 1136-Setting up Airflow locally.md
    ├── 1137-Ingesting data to GCP with Airflow.md
    ├── 1138-Ingesting data to Local Postgres with Airflow.md
    └── 1139-Transfer service (AWS -> GCP).md
├── week-3
    ├── 1141-Data Warehouse and BigQuery.md
    ├── 1142-Partitoning and clustering.md
    ├── 1143-Best practices.md
    ├── 1144-Internals of BigQuery.md
    ├── 1145-Advance.md
    └── 1147-Workshop.md
├── week-4
    ├── 1167-Prerequisites.md
    ├── 1168-Introduction to analytics engineering.md
    ├── 1170-What is dbt.md
    ├── 1171-Starting a dbt project.md
    ├── 1172-Development of dbt models.md
    ├── 1173-Testing and documenting dbt models.md
    ├── 1174-Deploying a dbt project.md
    ├── 1175-Visualising the transformed data.md
    └── 1177-Advanced knowledge.md
├── week-5
    ├── 1181-Introduction.md
    ├── 1183-Installation.md
    ├── 1184-Spark SQL and DataFrames.md
    └── 1185-Spark Internals.md
├── week-6
    ├── 1232-Introduction to Kafka.md
    ├── 1233-KStreams.md
    ├── 1234-Kafka Connect and KSQL.md
    └── 1235-Kafka connect.md
└── week-7-8-9
    ├── 1377-Data Engineering Project.md
    └── datasets.md


/bootcamp-overview/1111-Overview.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | 
 3 | Welcome to Data Engineering Bootcamp :) 
 4 | 
 5 | We are doing this bootcamp with the support of DataTalks.Club. The content is created by some of the renowned data leaders. Many thanks to DataTalks.Club and the instructors for creating and allowing us to put this bootcamp together. Check them out
 6 | 
 7 | * [DataTalks.Club](https://datatalks.club/)
 8 | * [Alexey Grigorev](https://linkedin.com/in/agrigorev)
 9 | * [Ankush Khanna](https://linkedin.com/in/ankushkhanna2)
10 | * [Sejal Vaidya](https://linkedin.com/in/vaidyasejal) 
11 | * [Victoria Perez Mola](https://www.linkedin.com/in/victoriaperezmola/)
12 | 
13 | In this bootcamp, there will be:
14 | 
15 | * 6 Weeks of Learning Content - release every Friday ([detailed schedule here](https://docs.google.com/document/d/1zqCxW8gQ5ZUqDsja_E3ffwtUeE08VXvX9x4qJLNUWSc/edit))
16 | * Home works for practice
17 | * 1 Graded Hands-on Assignment
18 | 
19 | ### Self-paced mode
20 | 
21 | * All the materials of the course are freely available, so you can take the course at your own pace
22 | * Follow the suggested syllabus (see below) week by week
23 | 
24 | 
25 | ### Learning Modules:
26 | 
27 | * Learning modules will be released every Friday, starting from 13th May, 7:00 PM CET/ 10:30 PM IST.
28 | * Please keep checking this space for regular learning module additions we make.
29 | * To accommodate learners from across the globe, we are putting the learning modules in offline format instead of having live sessions. This will allow you to learn at your own time.
30 | * If you face any issues while learning, feel free to drop a message on #help channel on [Discord](https://discord.gg/E2XfSEYm2W).
31 | 
32 | ### Assignment Guidelines
33 | 
34 | * There will be 1 Final Assignment, mandatory to attempt for the bootcamp completion.
35 | * Howeworks are for practice and we recommend you to work on them. However, they don't carry weightage while issuing certificate.
36 | * Certificate: In order to be eligible for the certificate, you should submit the assignment with a total score of minimum 60%.
37 | 
38 | Happy Learning :)


--------------------------------------------------------------------------------
/bootcamp-overview/1118-How to ask for help.md:
--------------------------------------------------------------------------------
1 | We understand that you may encounter some issues during your learning journey, in this case, we recommend that you follow the below steps to make the best use of the community support.
2 | 
3 | * Before asking a question, please check [FAQs](https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit)
4 | * If you don't find answers, you can put your questions on our [discord community](https://discord.gg/E2XfSEYm2W) under #data-engineering
5 | 
6 | The FAQ's are written in a very exhaustive way from the previous cohort of the course run by DTC community.


--------------------------------------------------------------------------------
/bootcamp-overview/1119-Community notes.md:
--------------------------------------------------------------------------------
 1 | # Community notes
 2 | 
 3 | During the last course cohort, some of the learners have taken notes and put them together on Github. This might be a resource to you as they have put together some resourceful content that could be handy while learning. 
 4 | 
 5 | * [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/1_intro.md) 
 6 | * [Notes from Abd](https://itnadigital.notion.site/Week-1-Introduction-f18de7e69eb4453594175d0b1334b2f4) 
 7 | * [Notes from Aaron](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_1_basics_n_setup/README.md) 
 8 | * [Notes from Faisal](https://github.com/FaisalMohd/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/Notes/DE%20Zoomcamp%20Week-1.pdf
 9 | )  
10 | * [Michael Harty's Notes](https://github.com/mharty3/data_engineering_zoomcamp_2022/tree/main/week01) 
11 | * [Blog post from Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/18/data-engineering-w1.html
12 | )


--------------------------------------------------------------------------------
/week-1/1112-Introduction.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | **Note:** The video below was recorded when the course was first launched a few months ago on DataTalks.Club. Some of the aspects may not be relevant at the moment.
 4 | 
 5 | The video covers some non-technical aspects and basic intro of what all will be covered during the bootcamp. 
 6 | * Intros: 12 mins 40 sec
 7 | * Flowchart: 20 mins 11 sec 
 8 | * Questions: 33 mins 25 sec
 9 | 
10 | https://www.loom.com/share/9fd3389b75a14cad8c08177e51783ccb
11 | 
12 | # Overview of Architecture, Technologies & Pre-Requisites
13 | 
14 | # Architecture diagram
15 | ![arch_1.jpeg](https://dphi-live.s3.amazonaws.com/media_uploads/arch_1_bc9b4ccd305c4135ade8c4929b67f084.jpeg)
16 | 
17 | # Technologies
18 | * Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
19 | * Google Cloud Storage (GCS): Data Lake
20 | * BigQuery: Data Warehouse
21 | * Terraform: Infrastructure-as-Code (IaC)
22 | * Docker: Containerization
23 | * SQL: Data Analysis & Exploration
24 | * Airflow: Pipeline Orchestration
25 | * dbt: Data Transformation
26 | * Spark: Distributed Processing
27 | * Kafka: Streaming
28 | 
29 | 
30 | # Prerequisites
31 | 
32 | * To get most out of this course, you should feel comfortable with coding and command line, and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.
33 | * Prior experience with data engineering is not require
34 | * We suggest watching videos in the same order as in this document.
35 | * The last video (setting up the environment) is optional, but you can check it earlier if you have troubles setting up the environment and following along the videos.


--------------------------------------------------------------------------------
/week-1/1113-Docker + Postgres.md:
--------------------------------------------------------------------------------
 1 | # Docker + Postgres
 2 | 
 3 | [Code](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/2_docker_sql)
 4 | 
 5 | # Introduction to Docker
 6 | * Why do we need Docker
 7 | * Creating a simple "data pipeline" in Docker
 8 | 
 9 | <iframe width="100%" height="315" src="https://youtube.com/embed/EYNwNlOrpr0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
10 | 
11 | 
12 | # Ingesting NY Taxi Data to Postgres
13 | * Running Posgtres locally with Docker
14 | * Using pgcli for connecting to the database
15 | * Exploring the NY Taxi dataset
16 | * Ingesting the data to the database
17 | * Note if you have problems with pgcli, check [this video](https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) for an alternative way to connect to your database
18 | 
19 | 
20 | <iframe width="100%" height="315" src="https://youtube.com/embed/2JM-ziJt0WI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
21 | 
22 | 
23 | # Connecting pgAdmin and Postgres
24 | * The pgAdmin tool
25 | * Docker networks
26 | 
27 | <iframe width="100%" height="315" src="https://youtube.com/embed/hCAIVe9N0ow" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
28 | 
29 | 
30 | # Putting the ingestion script to Docker
31 | * Converting the Jupyter notebook to a Python script
32 | * Parametrizing the script with argparse
33 | * Dockerizing the ingestion script
34 | 
35 | <iframe width="100%" height="315" src="https://youtube.com/embed/B1WwATwf-vY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
36 | 
37 | 
38 | # Running Postgres and pgAdmin with Docker-Compose
39 | * Why do we need Docker-compose
40 | * Docker-compose YAML file
41 | * Running multiple containers with docker-compose up
42 | 
43 | 
44 | <iframe width="100%" height="315" src="https://youtube.com/embed/hKI6PkPhpa0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
45 | 
46 | 
47 | # SQL refreshser
48 | * Adding the Zones table
49 | * Inner joins
50 | * Basic data quality checks
51 | * Left, Right and Outer joins
52 | * Group by
53 | 
54 | <iframe width="100%" height="315" src="https://youtube.com/embed/QEcps_iskgg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
55 | 
56 | # Optional 
57 | * If you have some problems with docker networking, check Port Mapping and Networks in Docker
58 | * Docker networks
59 | * Port forwarding to the host environment
60 | * Communicating between containers in the network
61 | * .dockerignore file
62 | 
63 | <iframe width="100%" height="315" src="https://youtube.com/embed/tOr4hTsHOzU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-1/1114-GCP + Terraform.md:
--------------------------------------------------------------------------------
 1 | # GCP + Terraform
 2 | 
 3 | [Code](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp) 
 4 | 
 5 | # Introduction to GCP (Google Cloud Platform)
 6 | 
 7 | <iframe width="100%" height="315" src="https://youtube.com/embed/18jIzE41fJ4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 8 | 
 9 | 
10 | # Introduction to Terraform Concepts & GCP Pre-Requisites
11 | 
12 | <iframe width="100%" height="315" src="https://youtube.com/embed/Hajwnmj0xfQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
13 | 
14 | * [Companion Notes](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp)
15 | 
16 | # Workshop: Creating GCP Infrastructure with Terraform
17 | 
18 | <iframe width="100%" height="315" src="https://youtube.com/embed/dNkEgO-CExg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
19 | 
20 | * [Workshop](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/terraform)
21 | 
22 | # Configuring terraform and GCP SDK on Windows
23 | * [Instructions](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/windows.md)


--------------------------------------------------------------------------------
/week-1/1115-Environment setup.md:
--------------------------------------------------------------------------------
 1 | # Environment setup
 2 | 
 3 | For the course you'll need:
 4 | * Python 3 (e.g. installed with Anaconda)
 5 | * Google Cloud SDK
 6 | * Docker with docker-compose
 7 | * Terraform
 8 | 
 9 | 
10 | If you have problems setting up the env, you can check this video:
11 | * [Setting up the environment on cloud VM](https://www.youtube.com/watch?v=ae-CV2KfoN0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
12 | * Generating SSH keys
13 | * Creating a virtual machine on GCP
14 | * Connecting to the VM with SSH
15 | * Installing Anaconda
16 | * Installing Docker
17 | * Creating SSH config file
18 | * Accessing the remote machine with VS Code and SSH remote
19 | * Installing docker-compose
20 | * Installing pgcli
21 | * Port-forwarding with VS code: connecting to pgAdmin and Jupyter from the local computer
22 | * Installing terraform
23 | * Using sftp for putting the credentials to the remote machine
24 | * Shutting down and removing the instance


--------------------------------------------------------------------------------
/week-2/1134-Data Lake (GCS).md:
--------------------------------------------------------------------------------
 1 | # Data Lake (GCS)
 2 | 
 3 | * What is a Data Lake
 4 | * ELT vs. ETL
 5 | * Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)
 6 | 
 7 | 
 8 | <iframe width="100%" height="315" src="https://youtube.com/embed/W3Zm6rjOq70" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 9 | 
10 | Slides
11 | 
12 | <p><iframe allowfullscreen width="100%" height="569" class="google-slides-iframe" frameborder="0" scrolling="no" src="https://docs.google.com/presentation/d/1RkH-YhBz2apIjYZAxUz2Uks4Pt51-fVWVN9CcH9ckyY/embed?start=false&loop=false&delayms=3000"></iframe></p>


--------------------------------------------------------------------------------
/week-2/1135-Introduction to Workflow orchestration.md:
--------------------------------------------------------------------------------
1 | # Introduction to Workflow orchestration
2 | 
3 | * What is an Orchestration Pipeline?
4 | * What is a DAG?
5 | * Video
6 | 
7 | 
8 | <iframe width="100%" height="315" src="https://youtube.com/embed/0yK7LXwYeD0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-2/1136-Setting up Airflow locally.md:
--------------------------------------------------------------------------------
 1 | # Setting up Airflow locally
 2 | 
 3 | * Setting up Airflow with Docker-Compose
 4 | * Video
 5 | 
 6 | <iframe width="100%" height="315" src="https://youtube.com/embed/lqDMzReAtrw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 7 | 
 8 | * More information in the [airflow folder](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow)
 9 | 
10 | If you want to run a lighter version of Airflow with fewer services, check this [video](https://www.youtube.com/watch?v=A1p5LQ0zzaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb). It's optional.


--------------------------------------------------------------------------------
/week-2/1137-Ingesting data to GCP with Airflow.md:
--------------------------------------------------------------------------------
1 | # Ingesting data to GCP with Airflow
2 | 
3 | * Extraction: Download and unpack the data
4 | * Pre-processing: Convert this raw data to parquet
5 | * Upload the parquet files to GCS
6 | * Create an external table in BigQuery
7 | * Video 
8 | 
9 | <iframe width="100%" height="315" src="https://youtube.com/embed/9ksX9REfL8w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-2/1138-Ingesting data to Local Postgres with Airflow.md:
--------------------------------------------------------------------------------
1 | # Ingesting data to Local Postgres with Airflow
2 | 
3 | * Converting the ingestion script for loading data to Postgres to Airflow DAG
4 | * Video 
5 | 
6 | <iframe width="100%" height="315" src="https://youtube.com/embed/s2U8MWJH5xA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-2/1139-Transfer service (AWS -> GCP).md:
--------------------------------------------------------------------------------
 1 | # Transfer service (AWS -> GCP)
 2 | 
 3 | Moving files from AWS to GCP.
 4 | You will need an AWS account for this. This section is optional
 5 | 
 6 | Video 1 
 7 | <iframe width="100%" height="315" src="https://youtube.com/embed/rFOFTfD1uGk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 8 | 
 9 | 
10 | Video 2
11 | 
12 | <iframe width="100%" height="315" src="https://youtube.com/embed/VhmmbqpIzeI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-3/1141-Data Warehouse and BigQuery.md:
--------------------------------------------------------------------------------
 1 | # Data Warehouse and BigQuery
 2 | 
 3 | * Slides
 4 | <p><iframe allowfullscreen width="100%" height="569" class="google-slides-iframe" frameborder="0" scrolling="no" src="https://docs.google.com/presentation/d/1a3ZoBAXFk8-EhUsd7rAZd-5p_HpltkzSeujjRGB2TAI/embed?start=false&loop=false&delayms=3000"></iframe></p>
 5 | 
 6 | * [Big Query basic SQL](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/big_query.sql)
 7 | 
 8 | 
 9 | * Data Warehouse and BigQuery 
10 | <iframe width="100%" height="315" src="https://youtube.com/embed/jrHljAoD6nM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-3/1142-Partitoning and clustering.md:
--------------------------------------------------------------------------------
1 | # Partioning and Clustering 
2 | 
3 | <iframe width="100%" height="315" src="https://youtube.com/embed/jrHljAoD6nM?t=726" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
4 | 
5 | # Partioning vs Clustering 
6 | 
7 | <iframe width="100%" height="315" src="https://youtube.com/embed/-CqXf7vhhDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-3/1143-Best practices.md:
--------------------------------------------------------------------------------
1 | # BigQuery Best Practices 
2 | 
3 | <iframe width="100%" height="315" src="https://youtube.com/embed/k81mLJVX08w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-3/1144-Internals of BigQuery.md:
--------------------------------------------------------------------------------
1 | # Internals of Big Query 
2 | 
3 | <iframe width="100%" height="315" src="https://youtube.com/embed/eduHi1inM4s" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-3/1145-Advance.md:
--------------------------------------------------------------------------------
 1 | # ML
 2 | 
 3 | * BigQuery Machine Learning 
 4 | 
 5 | <iframe width="100%" height="315" src="https://youtube.com/embed/B-WtpB0PuG4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 6 | 
 7 | * [SQL for ML in BigQuery](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/big_query_ml.sql)
 8 | 
 9 | # Important links
10 | 
11 | * [BigQuery ML Tutorials](https://cloud.google.com/bigquery-ml/docs/tutorials) 
12 | * [BigQuery ML Reference Parameter](https://cloud.google.com/bigquery-ml/docs/analytics-reference-patterns) 
13 | * [Hyper Parameter tuning](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm) 
14 | * [Feature preprocessing](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-preprocess-overview)  
15 | 
16 | # Deploying ML model
17 | * BigQuery Machine Learning Deployment 
18 | <iframe width="100%" height="315" src="https://youtube.com/embed/BjARzEWaznU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
19 | 
20 | * [Steps to extract and deploy model with docker](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/extract_model.md)


--------------------------------------------------------------------------------
/week-3/1147-Workshop.md:
--------------------------------------------------------------------------------
 1 | # [Workshop](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/airflow.md) 
 2 | 
 3 | * Integrating Bigquery with Airflow (+ Week 2 Review) - Video 
 4 | <iframe width="100%" height="315" src="https://youtube.com/embed/lAxAhHNeGww" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 5 | 
 6 | 
 7 | * Setup: Copy over the `airflow` directory (i.e. the Dockerized setup) from `week_2_data_ingestion`:
 8 | 
 9 | `cp ../week_2_data_ingestion/airflow airflow`
10 | 
11 | Also, empty the logs directory, if you find it necessary.
12 | 
13 | * DAG: [gcs_to_bq_dag.py](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/airflow/dags/gcs_to_bq_dag.py)


--------------------------------------------------------------------------------
/week-4/1167-Prerequisites.md:
--------------------------------------------------------------------------------
 1 | Week 4: Analytics Engineering
 2 | Goal: Transforming the data loaded in DWH to Analytical Views developing a [dbt project](
 3 | https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/taxi_rides_ny/README.md). 
 4 | 
 5 | Slides 
 6 | 
 7 | <p><iframe allowfullscreen width="100%" height="569" class="google-slides-iframe" frameborder="0" scrolling="no" src="https://docs.google.com/presentation/d/1xSll_jv0T8JF4rYZvLHfkJXYqUjPtThA/embed?start=false&loop=false&delayms=3000"></iframe></p>
 8 | 
 9 | 
10 | # Prerequisites
11 | We will build a project using dbt and a running data warehouse. By this stage of the course you should have already:
12 | * A running warehouse (BigQuery or postgres)
13 | * A set of running pipelines ingesting the project dataset (week 3 completed): [Taxi Rides NY dataset](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/dataset.md
14 | ) 
15 | * Yellow taxi data - Years 2019 and 2020
16 | * Green taxi data - Years 2019 and 2020
17 | * fhv data - Year 2019.
18 | 
19 | Note:
20 | * A quick hack has been shared to load that data quicker, check instructions in [week3/extras](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_3_data_warehouse/extras) 
21 | * If you recieve an error stating "Permission denied while globbing file pattern." when attemting to run fact_trips.sql this video may be helpful in resolving the issue
22 | 
23 | 
24 | 
25 | <iframe width="100%" height="315" src="https://youtube.com/embed/kL3ZVNL9Y4A" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
26 | 
27 | /
28 | 
29 | 
30 | # Setting up dbt for using BigQuery (Alternative A - preferred)
31 | You will need to create a dbt cloud account using [this link](https://www.getdbt.com/signup/) and connect to your warehouse [following these instructions](https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-setting-up-bigquery-oauth). 
32 | 
33 | More detailed instructions in [dbt_cloud_setup.md](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/dbt_cloud_setup.md) 
34 | 
35 | Optional: If you feel more comfortable developing locally you could use a local installation of dbt as well. You can follow the [official dbt documentation](https://docs.getdbt.com/dbt-cli/installation) or follow the [dbt with BigQuery on Docker](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/docker_setup/README.md) guide to setup dbt locally on docker. You will need to install the latest version (1.0) with the BigQuery adapter (dbt-bigquery).
36 | 
37 | 
38 | 
39 | # Setting up dbt for using Postgres locally (Alternative B)
40 | As an alternative to the cloud, that require to have a cloud database, you will be able to run the project installing dbt locally. You can follow the [official dbt documentation](https://docs.getdbt.com/dbt-cli/installation) or use a docker image from [oficial dbt repo](https://github.com/dbt-labs/dbt/). You will need to install the latest version (1.0) with the postgres adapter (dbt-postgres). After local installation you will have to set up the connection to PG in the profiles.yml, you can find the templates [here](https://docs.getdbt.com/reference/warehouse-profiles/postgres-profile)


--------------------------------------------------------------------------------
/week-4/1168-Introduction to analytics engineering.md:
--------------------------------------------------------------------------------
1 | # Introduction to analytics engineering
2 | * What is analytics engineering?
3 | * ETL vs ELT
4 | * Data modeling concepts (fact and dim tables)
5 | 
6 | 
7 | <iframe width="100%" height="315" src="https://youtube.com/embed/uF76d5EmdtU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-4/1170-What is dbt.md:
--------------------------------------------------------------------------------
1 | # What is dbt?
2 | * Intro to dbt
3 | 
4 | <iframe width="100%" height="315" src="https://youtube.com/embed/4eCouvVOJUw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-4/1171-Starting a dbt project.md:
--------------------------------------------------------------------------------
 1 | # Starting a dbt project
 2 | 
 3 | Alternative a: Using BigQuery + dbt cloud
 4 | * Starting a new project with dbt init (dbt cloud and core)
 5 | * dbt cloud setup
 6 | * project.yml
 7 | 
 8 | <iframe width="100%" height="315" src="https://youtube.com/embed/iMxh6s_wL4Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 9 | 
10 | Alternative b: Using Postgres + dbt core (locally)
11 | * Starting a new project with dbt init (dbt cloud and core)
12 | * dbt core local setup
13 | * profiles.yml
14 | * project.yml
15 | 
16 | ???? Video 
17 | 
18 | <iframe width="100%" height="315" src="https://youtube.com/embed/1HmL63e-vRs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-4/1172-Development of dbt models.md:
--------------------------------------------------------------------------------
 1 | # Development of dbt models
 2 | 
 3 | * Anatomy of a dbt model: written code vs compiled Sources
 4 | * Materialisations: table, view, incremental, ephemeral
 5 | * Seeds, sources and ref
 6 | * Jinja and Macros
 7 | * Packages
 8 | * Variables
 9 | 
10 | <iframe width="100%" height="315" src="https://youtube.com/embed/UVI30Vxzd6c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
11 | 
12 | Note: This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice


--------------------------------------------------------------------------------
/week-4/1173-Testing and documenting dbt models.md:
--------------------------------------------------------------------------------
1 | # Testing and documenting dbt models
2 | * Tests
3 | * Documentation
4 | 
5 | 
6 | <iframe width="100%" height="315" src="https://youtube.com/embed/UishFmq1hLM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
7 | 
8 | Note: This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice


--------------------------------------------------------------------------------
/week-4/1174-Deploying a dbt project.md:
--------------------------------------------------------------------------------
 1 | # Deploying a dbt project
 2 | 
 3 | Alternative a: Using BigQuery + dbt cloud
 4 | * Deployment: development environment vs production
 5 | * dbt cloud: scheduler, sources and hosted documentation
 6 | 
 7 | 
 8 | 
 9 | <iframe width="100%" height="315" src="https://youtube.com/embed/rjf6yZNGX8I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
10 | 
11 | Alternative b: Using Postgres + dbt core (locally)
12 | * Deployment: development environment vs production
13 | * dbt cloud: scheduler, sources and hosted documentation
14 | 
15 | <iframe width="100%" height="315" src="https://youtube.com/embed/Cs9Od1pcrzM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-4/1175-Visualising the transformed data.md:
--------------------------------------------------------------------------------
 1 | # Visualising the transformed data
 2 | * Google data studio
 3 | * [Metabase (local installation)](https://www.metabase.com/) 
 4 | 
 5 | Google data studio Video 
 6 | 
 7 | <iframe width="100%" height="315" src="https://youtube.com/embed/39nLTs74A3E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 8 | 
 9 | Metabase Video
10 |  
11 | <iframe width="100%" height="315" src="https://youtube.com/embed/BnLkrA7a6gM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-4/1177-Advanced knowledge.md:
--------------------------------------------------------------------------------
 1 | # Advanced knowledge:
 2 | 
 3 | * [Make a model Incremental](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models) 
 4 | * [Use of tags](https://docs.getdbt.com/reference/resource-configs/tags
 5 | ) 
 6 | * [Hooks](https://docs.getdbt.com/docs/building-a-dbt-project/hooks-operations) 
 7 | * [Analysis](https://docs.getdbt.com/docs/building-a-dbt-project/analyses
 8 | ) 
 9 | * [Snapshots](https://docs.getdbt.com/docs/building-a-dbt-project/snapshots) 
10 | * [Exposure](https://docs.getdbt.com/docs/building-a-dbt-project/exposures
11 | ) 
12 | * [Metrics](https://docs.getdbt.com/docs/building-a-dbt-project/metrics
13 | ) 
14 | 
15 | # Useful links
16 | * [Visualizing data with Metabase course](https://www.metabase.com/learn/visualization/ )


--------------------------------------------------------------------------------
/week-5/1181-Introduction.md:
--------------------------------------------------------------------------------
1 | # Introduction to Batch Processing 
2 | 
3 | <iframe width="100%" height="315" src="https://youtube.com/embed/dcHe5Fl3MF8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
4 | 
5 | # Introduction to Spark 
6 | 
7 | <iframe width="100%" height="315" src="https://youtube.com/embed/FhaqbEOuQ8U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-5/1183-Installation.md:
--------------------------------------------------------------------------------
 1 | # Installation
 2 | 
 3 | Follow [these intructions](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup
 4 | ) to install Spark: 
 5 | 
 6 | * [Windows](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/windows.md) 
 7 | * [Linux](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/linux.md
 8 | ) 
 9 | * [MacOS](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/macos.md
10 | ) 
11 | 
12 | And follow [this](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/pyspark.md) to run PySpark in Jupyter 
13 | 
14 | # Installing Spark (Linux) 
15 | 
16 | <iframe width="100%" height="315" src="https://youtube.com/embed/hqUbB9c8sKg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-5/1184-Spark SQL and DataFrames.md:
--------------------------------------------------------------------------------
 1 | # First Look at Spark/PySpark 
 2 | 
 3 | <iframe width="100%" height="315" src="https://youtube.com/embed/r_Sf6fCB40c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 4 | 
 5 | # Spark Dataframes 
 6 | 
 7 | <iframe width="100%" height="315" src="https://youtube.com/embed/ti3aC1m3rE8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 8 | 
 9 | #(Optional) Preparing Yellow and Green Taxi Data 
10 | 
11 | <iframe width="100%" height="315" src="https://youtube.com/embed/CI3P4tAtru4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
12 | 
13 | Script to prepare the Dataset [download_data.sh](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/code/download_data.sh
14 | ) 
15 | 
16 | Note: The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark.
17 | 
18 | # SQL with Spark 
19 | 
20 | <iframe width="100%" height="315" src="https://youtube.com/embed/uAlp2VuZZPY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-5/1185-Spark Internals.md:
--------------------------------------------------------------------------------
 1 | # Anatomy of a Spark Cluster 
 2 | 
 3 | <iframe width="100%" height="315" src="https://youtube.com/embed/68CipcZt7ZA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 4 | 
 5 | # GroupBy in Spark
 6 |  
 7 | <iframe width="100%" height="315" src="https://youtube.com/embed/9qrDsY_2COo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 8 | 
 9 | # Joins in Spark 
10 | 
11 | <iframe width="100%" height="315" src="https://youtube.com/embed/lu7TrqAWuH4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-6/1232-Introduction to Kafka.md:
--------------------------------------------------------------------------------
 1 | # Introduction to Kafka
 2 | 
 3 | Slides 
 4 | 
 5 | <p><iframe allowfullscreen width="100%" height="569" class="google-slides-iframe" frameborder="0" scrolling="no" src="https://docs.google.com/presentation/d/1bCtdCba8v1HxJ_uMm9pwjRUC-NAMeB-6nOG2ng3KujA/embed?start=false&loop=false&delayms=3000"></iframe></p>
 6 | 
 7 | 
 8 | # Video: Intro to Kafka
 9 | 
10 | <iframe width="100%" height="315" src="https://youtube.com/embed/P1u8x3ycqvg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
11 | 
12 | # Video: Configuration Terms 
13 | 
14 | <iframe width="100%" height="315" src="https://youtube.com/embed/Erf1-d1nyMY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
15 | 
16 | # Video: Avro and schema registry 
17 | 
18 | <iframe width="100%" height="315" src="https://youtube.com/embed/bzAsVNE5vOo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
19 | 
20 | # Configuration
21 | Please take a look at all configuration from kafka [here](https://docs.confluent.io/platform/current/installation/configuration/
22 | ). 
23 | 
24 | # Docker
25 | 
26 | Starting cluster
27 | `docker-compose up
28 | `
29 | # Command line for Kafka
30 | 
31 | Create topic
32 | `./bin/kafka-topics.sh --create --topic demo_1 --bootstrap-server localhost:9092 --partitions 2
33 | `


--------------------------------------------------------------------------------
/week-6/1233-KStreams.md:
--------------------------------------------------------------------------------
 1 | # KStreams
 2 | 
 3 | * Slides
 4 | 
 5 | <p><iframe allowfullscreen width="100%" height="569" class="google-slides-iframe" frameborder="0" scrolling="no" src="https://docs.google.com/presentation/d/1fVi9sFa7fL2ZW3ynS5MAZm0bRSZ4jO10fymPmrfTUjE/embed?start=false&loop=false&delayms=3000"></iframe></p>
 6 | 
 7 | * Concepts https://docs.confluent.io/platform/current/streams/concepts.html
 8 | 
 9 | * Video: KStream basics 
10 | 
11 | <iframe width="100%" height="315" src="https://youtube.com/embed/uuASDjCtv58" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
12 | 
13 | * Video: KStream Join and windowing 
14 | 
15 | <iframe width="100%" height="315" src="https://youtube.com/embed/dTzsDM9myr8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
16 | 
17 | * Video: KStream advance features 
18 | 
19 | <iframe width="100%" height="315" src="https://youtube.com/embed/d8M_-ZbhZls" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
20 | 
21 | # Python Faust
22 | * [Faust Doc](https://faust.readthedocs.io/en/latest/index.html) 
23 | * [KStream vs Faust](https://faust.readthedocs.io/en/latest/playbooks/vskafka.html) 
24 | 
25 | # JVM library
26 | * [Confluent Kafka Stream](https://kafka.apache.org/documentation/streams/) 
27 | * [Example](https://github.com/AnkushKhanna/kafka-helper/tree/master/src/main/scala/kafka/schematest)


--------------------------------------------------------------------------------
/week-6/1234-Kafka Connect and KSQL.md:
--------------------------------------------------------------------------------
1 | # Kafka Connect and KSQL
2 | 
3 | Video: Kafka connect and KSQL 
4 | 
5 | <iframe width="100%" height="315" src="https://youtube.com/embed/OgPJiic6xjY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


--------------------------------------------------------------------------------
/week-6/1235-Kafka connect.md:
--------------------------------------------------------------------------------
1 | # Kafka connect
2 | 
3 | * [Blog post](https://medium.com/analytics-vidhya/making-sense-of-stream-data-b74c1252a8f5
4 | )


--------------------------------------------------------------------------------
/week-7-8-9/1377-Data Engineering Project.md:
--------------------------------------------------------------------------------
 1 | # Data Engineering Project
 2 | 
 3 | For the next three weeks, you will work on a Data Engineering Project.
 4 | 
 5 | The goal of this project is to apply everything we learned
 6 | in this course and build an end-to-end data pipeline.
 7 | 
 8 | Remember that to pass the project, you must evaluate the submissions of 3 peers. If you don't do that, your project can't be considered complete.  
 9 | 
10 | ## Problem statement
11 | 
12 | For the project, we will ask you to build a dashboard with two tiles. 
13 | 
14 | For that, you will need:
15 | 
16 | * Select a dataset that you're interested in (see [datasets](https://github.com/dphi-official/data-engineering/blob/main/week-7-8-9/datasets.md)).
17 | * Create a pipeline for processing this dataset and putting it into a data lake.
18 | * Create a pipeline for moving the data from the lake to a data warehouse.
19 | * Transform the data in the data warehouse: prepare it for the dashboard.
20 | * Create a dashboard.
21 | 
22 | 
23 | 
24 | ## Data Pipeline 
25 | 
26 | The pipeline could be stream or batch: this is the first thing you'll need to decide.
27 | 
28 | * If you want to consume data in real-time and put them in a data lake, go with a stream pipeline.
29 | * If you want to run things periodically (e.g. hourly/daily), go with a batch pipeline.
30 | 
31 | 
32 | ## Technologies 
33 | 
34 | You don't have to limit yourself to technologies covered in the course. You can use alternatives as well:
35 | 
36 | * Cloud: AWS, GCP, Azure, or others
37 | * Infrastructure as code (IaC): Terraform, Pulumi, Cloud Formation, ...
38 | * Workflow orchestration: Airflow, Prefect, Luigi, ...
39 | * Data Warehouse: BigQuery, Snowflake, Redshift, ...
40 | * Batch processing: Spark, Flink, AWS Batch, ...
41 | * Stream processing: Kafka, Pulsar, Kinesis, ...
42 | 
43 | If you use something that wasn't covered in the course, 
44 | be sure to explain what the tool does.
45 | 
46 | If you're not certain about some tools, ask on the [DPhi Discord Server](https://discord.gg/E2XfSEYm2W).
47 | 
48 | 
49 | ## Dashboard
50 | 
51 | You can build a dashboard with any of the tools shown in the course (Data Studio or Metabase) or any other BI tool of your choice. If you do use another tool, please specify and make sure that the dashboard is somehow accessible to your peers. 
52 | 
53 | Your dashboard should contain at least two tiles, we suggest you include:
54 | 
55 | - 1 graph that shows the distribution of some categorical data. 
56 | - 1 graph that shows the distribution of the data across a temporal line.
57 | 
58 | Make sure that your graph is clear to understand by adding references and titles. 
59 | 
60 | Example of a dashboard: 
61 | ![image.png](https://dphi-live.s3.amazonaws.com/media_uploads/image_712f5875b8d34eb3ab859dfee6afefcc.png)
62 | 
63 | ## Submitting
64 | 
65 | * Form: https://forms.gle/KgnCGM5cgXUxk8Bb9
66 | * Deadline: 21st July, 7:00 PM CET/ 10:30 PM IST
67 | 
68 | ## Going the extra mile
69 | 
70 | If you finish the project and you want to improve it, here are a few things you can do:
71 | 
72 | * Add tests
73 | * Use make
74 | * Add CI/CD pipeline 
75 | 
76 | This is not covered in the course and this is entirely optional.
77 | 
78 | If you plan to use this project as your portfolio project, it'll 
79 | definitely help you to stand out from others.
80 | 
81 | **Note**: this part will not be graded. 
82 | 
83 | 
84 | Some links to refer to:
85 | 
86 | * [Unit Tests + CI for Airflow](https://www.astronomer.io/events/recaps/testing-airflow-to-bulletproof-your-code/)
87 | * [CI/CD for Airflow (with Gitlab & GCP state file)](https://engineering.ripple.com/building-ci-cd-with-airflow-gitlab-and-terraform-in-gcp)
88 | * [CI/CD for Airflow (with GitHub and S3 state file)](https://programmaticponderings.com/2021/12/14/devops-for-dataops-building-a-ci-cd-pipeline-for-apache-airflow-dags/)
89 | * [CD for Terraform](https://towardsdatascience.com/git-actions-terraform-for-data-engineers-scientists-gcp-aws-azure-448dc7c60fcc)
90 | * [Spark + Airflow](https://medium.com/doubtnut/github-actions-airflow-for-automating-your-spark-pipeline-c9dff32686b)
91 | 
92 | ## Grading
93 | - **Grade by Peers**: Each assignment will be reviewed by 3 peers. Only those people who have submitted their work will be considered in the peer-review process. Peers will score the assignment out of 26 marks. We will then take the median score out of the three peer ratings. This makes sure that if there is an outlier rating, we do not consider it. This median score is the marks you have obtained from the peer review.
94 | - **Peer Review Completion**: Secondly, 14 marks are allotted for completing your peer review. This will be a binary score - if you complete your peer review, you will get the entire 14 marks, otherwise, it will be a zero. 
95 | - **Final Score**: Your final score in this Bootcamp will be the sum of the marks from Grade by Peers and Peer Review Completion. This score will be out of 40 marks.


--------------------------------------------------------------------------------
/week-7-8-9/datasets.md:
--------------------------------------------------------------------------------
 1 | ## Datasets
 2 | 
 3 | Here are some datasets that you could use for the project:
 4 | 
 5 | 
 6 | * [Kaggle](https://www.kaggle.com/datasets)
 7 | * [AWS datasets](https://registry.opendata.aws/)
 8 | * [UK government open data](https://data.gov.uk/)
 9 | * [Github archive](https://www.gharchive.org)
10 | * [Awesome public datasets](https://github.com/awesomedata/awesome-public-datasets)
11 | * [Million songs dataset](http://millionsongdataset.com)
12 | * [Some random datasets](https://components.one/datasets/)
13 | * [COVID Datasets](https://www.reddit.com/r/datasets/comments/n3ph2d/coronavirus_datsets/)
14 | * [Datasets from Azure](https://docs.microsoft.com/en-us/azure/azure-sql/public-data-sets)
15 | * [Datasets from BigQuery](https://cloud.google.com/bigquery/public-data/)
16 | * [Dataset search engine from Google](https://datasetsearch.research.google.com/)
17 | * [Public datasets offered by different GCP services](https://cloud.google.com/solutions/datasets)
18 | * [European statistics datasets](https://webgate.acceptance.ec.europa.eu/eurostat/data/database)
19 | * [Datasets for streaming](https://github.com/ColinEberhardt/awesome-public-streaming-datasets)
20 | * [Dataset for Santander bicycle rentals in London](https://cycling.data.tfl.gov.uk/)
21 | * [Common crawl data](https://commoncrawl.org/) (copy of the internet)
22 | * Collection Of Data Repositories
23 |   * [part 1](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-1.html) (from agriculture and finance to government)
24 |   * [part 2](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-2.html) (from healthcare to transportation)
25 | 
26 | PRs with more datasets are welcome!
27 | 


--------------------------------------------------------------------------------