├── 02_SQL
├── README.md
├── 05-sql-join.md
├── 01-sql-overview.md
├── 02-sql- syntax-fundamentals.md
├── 03-basic-querying.md
└── 04-DDL-DML-commands.md
├── 05_dbt
└── README.md
├── 03_python
├── README.md
├── 5_python_modules.ipynb
├── 9_python_API.ipynb
├── 4_python_functions.ipynb
├── 7_Working_with_files.ipynb
├── 6_Errors_and_Exceptions.ipynb
└── 8_python_virtual_environments_and_packages.ipynb
├── 07_airbyte
└── README.md
├── 08_airflow
├── README.md
└── 01_installation_and_setup
│ └── installation_and_setup_guide.md
├── 09_aws_cloud
├── README.md
├── 06-Glue-and-Athena
│ ├── 01-Athena.md
│ ├── 02-S3-Glue-Athena.md
│ └── 00-Glue-Data-Catalog.md
├── 03-Virtual-Private-Cloud(VPC)
│ ├── README.md
│ ├── 04-SecurityGroup-and-NACL.md
│ ├── 02-Subnets.md
│ ├── 03-RouteTable-InternetGateway-NatGateway.md
│ ├── 01-VPC-Overview.md
│ └── 00-IP-Addressing.md
├── 08-Amazon-Redshift
│ ├── 10-System-Tables.md
│ ├── 07-Redshift-Spectrum.md
│ ├── 08-Work-Load-Management.md
│ ├── 05-Query-Perfomance-Factors.md
│ ├── 09-Snapshots-and-Backups.md
│ ├── 03-SortKeys-and-DistributionStyle.md
│ ├── 06-Copy-and-Unload-Operations.md
│ ├── 04-Query-Plan-and-Execution-Workflow.md
│ ├── 02-Node-Types.md
│ ├── 01-Redshift-Cluster-Architecture.md
│ └── 00-Redshift-Cluster-Overview.md
├── 05-Simple-Storage-Service(S3)
│ ├── 02-Security.md
│ ├── 01-Access-Control.md
│ ├── 03-Cost-Optimisation.md
│ └── 00-S3-Overview.md
├── 07-Relational-Database-Service(RDS)
│ ├── 01-DB-Instance-Class.md
│ └── 00-RDS-Overview.md
├── 04-Elastic-Compute-Cloud(EC2)
│ ├── 05-Storage.md
│ ├── 04-Security.md
│ ├── 00-EC2-Overview.md
│ └── 03-Networking.md
├── 02-Identity-And-Access-Management(IAM)
│ ├── 02-iam-role.md
│ └── 00-iam-resources.md
└── 01-Cloud-Computing-overview
│ └── 01-Cloud-Computing.md
├── 13_kubernetes
├── README.md
├── 02-pod
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 04-ReplicaSet
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 14-Service
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 09-Node-Affinity
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 05-Deployment
│ ├── 01-Deployment.md
│ ├── Resources
│ │ └── README.md
│ └── 02-RollingUpdate-and-RollingStrategy.md
├── 06-Jobs-and-CronJobs
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 07-Labels-and-Selectors
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 00-Prerequisite-and-Overview
│ └── README.md
├── 01-Architecture
│ ├── 03-Resources
│ │ └── README.md
│ ├── 01-Control-Plane
│ │ └── README.md
│ └── 02-Worker-Node
│ │ └── README.md
├── 08-Taints-and-Tolerations
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 10-ServiceAccount-and-RBAC
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 11-ConfigMaps-and-Secrets
│ ├── README.md
│ └── Resources
│ │ └── README.md
├── 12-External-Secret-Operator
│ ├── README.md
│ └── Resources
│ │ └── README.md
└── 13-Resource-Request-and-Limit
│ ├── README.md
│ └── Resources
│ └── README.md
├── README.md
├── 01_linux
├── 02-linux-setup.md
├── 06-process-monitoring.md
├── 03-user-group-management.md
├── 01-linux-overview.md
├── 04-basic-commands.md
└── 05-file-permissions.md
├── 04_data_modelling
└── README.md
├── 06_docker
├── 07-Port-Mapping.md
├── 01-Docker-Overview.md
├── 04-Docker-Containers.md
├── 05-Image-Layers-and-Cache.md
├── 03-Dockerfile-and-Docker-Image.md
├── README.md
├── 02-Docker-Architecture.md
├── 06-Docker-Network.md
└── 08-Docker-Volume.md
├── 11_spark
├── 02-Spark-Setup.md
├── 06-Partitioning.md
├── 00-Big-Data-Overview.md
├── 04-Memory-Management.md
├── 08-Catalyst-Optimizer.md
├── 03-Driver-and-Executors.md
├── 07-Caching-Persistence.md
├── 09-Client-Cluster-Mode.md
├── 10-Spark-Configurations.md
├── 05-Executor-Resource-Tuning.md
└── 01-Spark-Overview-and-Architecture.md
├── .gitignore
├── 12_apache_kafka
├── 08-Consumer-Group.md
├── 09-Consumer-Lag-and-Rebalancing.md
├── 02-Installing-Kafka.md
├── 07-Consumer-and-Configurations.md
├── 05-Partition-and-Offset.md
├── 03-Kafka-Cluster.md
├── 06-Producer-and Configurations.md
├── 01-Kafka-Overview.md
└── 04-Kafka-Topic-and-Configurations.md
└── 10_terraform
├── 07-DataSource-and-Locals.md
├── 06-Terraform-Resource-Referencing.md
├── Resource-Provisioning
├── 00-Important-Read.md
├── 02-vpc.md
├── 03-s3.md
├── 05-redshift.md
├── 04-rds.md
└── 01-iam.md
├── 02-Installation.md
├── 05-Terraform-State-File.md
├── 03-Terraform-Resource-Template.md
├── 04-Basic-Resource-Provisioning.md
├── 01-Terraform-Overview.md
└── README.md
/02_SQL/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/05_dbt/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/03_python/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/07_airbyte/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/08_airflow/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CDE-Bootcamp
--------------------------------------------------------------------------------
/01_linux/02-linux-setup.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/04_data_modelling/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/06_docker/07-Port-Mapping.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/02-Spark-Setup.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/06-Partitioning.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *ipynb_checkpoints*
--------------------------------------------------------------------------------
/01_linux/06-process-monitoring.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/06_docker/01-Docker-Overview.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/06_docker/04-Docker-Containers.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/00-Big-Data-Overview.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/04-Memory-Management.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/08-Catalyst-Optimizer.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/02-pod/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/06_docker/05-Image-Layers-and-Cache.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/03-Driver-and-Executors.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/07-Caching-Persistence.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/09-Client-Cluster-Mode.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/10-Spark-Configurations.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/12_apache_kafka/08-Consumer-Group.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/04-ReplicaSet/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/14-Service/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/10_terraform/07-DataSource-and-Locals.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/05-Executor-Resource-Tuning.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/02-pod/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/09-Node-Affinity/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/06-Glue-and-Athena/01-Athena.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/11_spark/01-Spark-Overview-and-Architecture.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/04-ReplicaSet/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/05-Deployment/01-Deployment.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/05-Deployment/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/06-Jobs-and-CronJobs/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/07-Labels-and-Selectors/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/14-Service/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/06-Glue-and-Athena/02-S3-Glue-Athena.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/10-System-Tables.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/10_terraform/06-Terraform-Resource-Referencing.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/12_apache_kafka/09-Consumer-Lag-and-Rebalancing.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/00-Prerequisite-and-Overview/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/01-Architecture/03-Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/08-Taints-and-Tolerations/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/09-Node-Affinity/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/10-ServiceAccount-and-RBAC/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/11-ConfigMaps-and-Secrets/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/12-External-Secret-Operator/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/13-Resource-Request-and-Limit/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/05-Simple-Storage-Service(S3)/02-Security.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/06-Glue-and-Athena/00-Glue-Data-Catalog.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/07-Redshift-Spectrum.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/08-Work-Load-Management.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/01-Architecture/01-Control-Plane/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/01-Architecture/02-Worker-Node/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/06-Jobs-and-CronJobs/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/07-Labels-and-Selectors/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/05-Simple-Storage-Service(S3)/01-Access-Control.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/05-Query-Perfomance-Factors.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/09-Snapshots-and-Backups.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/08-Taints-and-Tolerations/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/10-ServiceAccount-and-RBAC/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/11-ConfigMaps-and-Secrets/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/12-External-Secret-Operator/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/13-Resource-Request-and-Limit/Resources/README.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/05-Simple-Storage-Service(S3)/03-Cost-Optimisation.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/03-SortKeys-and-DistributionStyle.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/06-Copy-and-Unload-Operations.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/13_kubernetes/05-Deployment/02-RollingUpdate-and-RollingStrategy.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/07-Relational-Database-Service(RDS)/01-DB-Instance-Class.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/04-Query-Plan-and-Execution-Workflow.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/03_python/5_python_modules.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/03_python/9_python_API.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/10_terraform/Resource-Provisioning/00-Important-Read.md:
--------------------------------------------------------------------------------
1 | Please before taking this , ensure you've already covered the below
2 |
--------------------------------------------------------------------------------
/03_python/4_python_functions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/03_python/7_Working_with_files.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/08_airflow/01_installation_and_setup/installation_and_setup_guide.md:
--------------------------------------------------------------------------------
1 | To install and setup Apache Airflow on your PC, follow the comprehensive guide [here](https://github.com/coredataengineers/set-up-guides/tree/main/airflow_installation_guide).
--------------------------------------------------------------------------------
/10_terraform/Resource-Provisioning/02-vpc.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ```
4 | resource "aws_vpc" "main" {
5 | cidr_block = "10.0.0.0/16"
6 | instance_tenancy = "default"
7 |
8 | tags = {
9 | Name = "main"
10 | }
11 | }
12 | ```
13 |
--------------------------------------------------------------------------------
/10_terraform/Resource-Provisioning/03-s3.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ```
4 | resource "aws_s3_bucket" "example" {
5 | bucket = "my-tf-test-bucket"
6 |
7 | tags = {
8 | Name = "My bucket"
9 | Environment = "Dev"
10 | }
11 | }
12 | ```
13 |
--------------------------------------------------------------------------------
/10_terraform/Resource-Provisioning/05-redshift.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ```
4 | resource "aws_redshift_cluster" "example" {
5 | cluster_identifier = "tf-redshift-cluster"
6 | database_name = "mydb"
7 | master_username = "exampleuser"
8 | master_password = "Mustbe8characters"
9 | node_type = "dc1.large"
10 | cluster_type = "single-node"
11 | }
12 | ```
13 |
--------------------------------------------------------------------------------
/10_terraform/Resource-Provisioning/04-rds.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ```
4 | resource "aws_db_instance" "default" {
5 | allocated_storage = 10
6 | db_name = "mydb"
7 | engine = "mysql"
8 | engine_version = "8.0"
9 | instance_class = "db.t3.micro"
10 | username = "foo"
11 | password = "foobarbaz"
12 | parameter_group_name = "default.mysql8.0"
13 | skip_final_snapshot = true
14 | }
15 | ```
16 |
--------------------------------------------------------------------------------
/10_terraform/02-Installation.md:
--------------------------------------------------------------------------------
1 | ## PREREQUISITE
2 | Before you start working with Terraform, you'll need to have certain prerequisites in place.
3 | - Install Terraform on your Computer
4 | - MAC/LINUX users: Open your terminal and run the below command ( Make sure you have brew installed, if not, install it [HERE](https://brew.sh/)
5 | - `brew tap hashicorp/tap`
6 | - `brew install hashicorp/tap/terraform`
7 | - WINDOWS users: Follow the manual installation [HERE](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli)
8 | - After the installation, to verify everything is good, run `terraform --version` on your terminal
9 |
--------------------------------------------------------------------------------
/03_python/6_Errors_and_Exceptions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "id": "6e502056",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": []
10 | }
11 | ],
12 | "metadata": {
13 | "kernelspec": {
14 | "display_name": "Python 3 (ipykernel)",
15 | "language": "python",
16 | "name": "python3"
17 | },
18 | "language_info": {
19 | "codemirror_mode": {
20 | "name": "ipython",
21 | "version": 3
22 | },
23 | "file_extension": ".py",
24 | "mimetype": "text/x-python",
25 | "name": "python",
26 | "nbconvert_exporter": "python",
27 | "pygments_lexer": "ipython3",
28 | "version": "3.10.9"
29 | }
30 | },
31 | "nbformat": 4,
32 | "nbformat_minor": 5
33 | }
34 |
--------------------------------------------------------------------------------
/03_python/8_python_virtual_environments_and_packages.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "id": "2c98cca7",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": []
10 | }
11 | ],
12 | "metadata": {
13 | "kernelspec": {
14 | "display_name": "Python 3 (ipykernel)",
15 | "language": "python",
16 | "name": "python3"
17 | },
18 | "language_info": {
19 | "codemirror_mode": {
20 | "name": "ipython",
21 | "version": 3
22 | },
23 | "file_extension": ".py",
24 | "mimetype": "text/x-python",
25 | "name": "python",
26 | "nbconvert_exporter": "python",
27 | "pygments_lexer": "ipython3",
28 | "version": "3.10.9"
29 | }
30 | },
31 | "nbformat": 4,
32 | "nbformat_minor": 5
33 | }
34 |
--------------------------------------------------------------------------------
/10_terraform/Resource-Provisioning/01-iam.md:
--------------------------------------------------------------------------------
1 | This page will cover IAM resources provisioning and explanation of the resources, this page doesn't aim to
2 | enable you copy the code snippet, but importantly to understand what you are copying.
3 |
4 | ## IAM USER
5 | ```
6 | resource "aws_iam_user" "lb" {
7 | name = "loadbalancer"
8 | path = "/system/"
9 |
10 | tags = {
11 | tag-key = "tag-value"
12 | }
13 | }
14 |
15 | resource "aws_iam_access_key" "lb" {
16 | user = aws_iam_user.lb.name
17 | }
18 |
19 | data "aws_iam_policy_document" "lb_ro" {
20 | statement {
21 | effect = "Allow"
22 | actions = ["ec2:Describe*"]
23 | resources = ["*"]
24 | }
25 | }
26 |
27 | resource "aws_iam_user_policy" "lb_ro" {
28 | name = "test"
29 | user = aws_iam_user.lb.name
30 | policy = data.aws_iam_policy_document.lb_ro.json
31 | }
32 | ```
33 |
--------------------------------------------------------------------------------
/02_SQL/05-sql-join.md:
--------------------------------------------------------------------------------
1 | ## 🔗 Join Types
2 |
3 | ### INNER JOIN (Default)
4 | ```sql
5 | -- Basic equijoin
6 | SELECT u.username, o.order_date
7 | FROM users u
8 | INNER JOIN orders o ON u.user_id = o.user_id;
9 |
10 | -- With additional filters
11 | SELECT p.name, oi.quantity
12 | FROM products p
13 | JOIN order_items oi
14 | ON p.product_id = oi.product_id
15 | WHERE oi.quantity > 5;
16 | ```
17 |
18 | ### OUTER JOINS
19 | #### LEFT JOIN
20 | ```sql
21 | -- All users + their orders (if any)
22 | SELECT u.username, COUNT(o.order_id) AS order_count
23 | FROM users u
24 | LEFT JOIN orders o ON u.user_id = o.user_id
25 | GROUP BY u.username;
26 | ```
27 |
28 | #### FULL OUTER JOIN
29 | ```sql
30 | -- All relationships including orphans
31 | SELECT u.username, o.order_id
32 | FROM users u
33 | FULL OUTER JOIN orders o ON u.user_id = o.user_id
34 | WHERE u.user_id IS NULL OR o.order_id IS NULL;
35 | ```
36 |
37 | ### Specialized Joins
38 | #### CROSS JOIN
39 | ```sql
40 | -- Generate all combinations
41 | SELECT s.size, c.color
42 | FROM sizes s
43 | CROSS JOIN colors c;
44 | ```
45 |
46 | #### SELF JOIN
47 | ```sql
48 | -- Employee hierarchy
49 | SELECT e.name AS employee, m.name AS manager
50 | FROM employees e
51 | LEFT JOIN employees m ON e.manager_id = m.employee_id;
52 | ```
53 |
54 | 
55 |
56 |
57 |
58 |
--------------------------------------------------------------------------------
/09_aws_cloud/04-Elastic-Compute-Cloud(EC2)/05-Storage.md:
--------------------------------------------------------------------------------
1 | # ELASTIC BLOCK STORE
2 | Amazon Elastic Block Store (Amazon EBS) provides scalable, high-performance block storage resources that can be used with Amazon EC2 instances.
3 | With Amazon EBS, you can create and manage the following block storage resources:
4 |
5 | - Amazon EBS volumes — These are storage volumes that you attach to Amazon EC2 instances.
6 | After you attach a volume to an instance, you can use it in the same way you would use block storage.
7 | The instance can interact with the volume just as it would with a local drive.
8 |
9 | - Amazon EBS snapshots — These are point-in-time backups of Amazon EBS volumes that persist independently from the volume itself.
10 | You can create snapshots to back up the data on your Amazon EBS volumes. You can then restore new volumes from those snapshots at any time.
11 |
12 | You can create and attach EBS volumes to an instance during launch, and you can create and attach EBS volumes to an instance at any time after launch.
13 | You can also increase the size or performance of your EBS volumes without detaching the volume or restarting your instance.
14 |
15 | You can create EBS snapshots from an EBS volume at any time after creation. You can use EBS snapshots to back up the data stored on your volumes.
16 | You can then use those snapshots to instantly restore volumes, or to migrate data across AWS accounts, AWS Regions, or Availability Zones.
17 | You can use Amazon Data Lifecycle Manager or AWS Backup to automate the creation, retention, and deletion of your EBS snapshots.
18 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/02-Node-Types.md:
--------------------------------------------------------------------------------
1 | As we've covered in the past that a `Redshift Cluster` is made up of one or more `Compute Node`,
2 | AWS will create the Leader Node and manage that behind the scene, However applications and `SQL Clients` can communicate with
3 | the Leader Node.
4 |
5 | This simply mean we care about our `Compute Node`, when we launch a `Redshift Cluster`, we are asked the type of Node we want to have as
6 | our `Compute Node`, it's importamt to know what kind of Node type we will need to form our cluster. It's also importanmt to know the kind of
7 | workload and what we care about, do we care about storage or we care more about compute ?
8 |
9 | If you have a `Compute Intensive` workload, you need to opt for the Node type that is more compute efficient, if storage is what you care about,
10 | it's important to provision storage optimised Node type. This is why we will cover the various Node type available and know
11 | what hardware spec they have.
12 |
13 | Below are the important `hardware/resource` that comes with a Redshift Cluster Nompute Node
14 | - `CPU` is the number of virtual CPUs for each node
15 | - `RAM` is the amount of memory in gibibytes (GiB) for each node.
16 | - `Default slices` per node is the number of slices into which a Compute Node is partitioned when a cluster is created.
17 | - `Storage` is the capacity and type of storage for each node.
18 |
19 | `RA3` are generally Node type designed for Storage benefit
20 |
21 |
22 | `Dense Compute` are generally suitable for Compute intensive workloads
23 |
24 |
--------------------------------------------------------------------------------
/06_docker/03-Dockerfile-and-Docker-Image.md:
--------------------------------------------------------------------------------
1 | # Dockerfile and Docker image.
2 | Understanding a Dockerfile is like holding a cookbook in your hands. As long as you have it, you have access to the various recipes in the book.
3 |
4 | This guide is to help you understand the basics of a Dockerfile and how Docker images fit into this picture.
5 |
6 | At the end of this guide, you should:
7 | - Understand what a Dockerfile is and why it’s foundational.
8 | - Learn how Docker uses a Dockerfile to create a Docker image.
9 | - Understand the purpose of Docker images, and how they relate to containers.
10 | - Get comfortable with basic Docker commands like docker build and docker run.
11 | - Practice by building and running your own Docker image and container.
12 |
13 | ## Understanding a Dockerfile and why it is foundational
14 |
15 | First, we need to understand that **Docker** exists so that when you want to cook fried rice in your house, your friend's house or your parents' house, the food tastes exactly the same.
16 |
17 | You’ve probably heard the phrase: “It works on my laptop, but not on yours.”
18 | This happens because different computers have different operating systems, software versions and this can cause code to behave unexpectedly.
19 |
20 | In technical terms, Docker exists to make sure your code is in a container with everything it needs to work exactly the same on any machine.
21 | To get to that container, you first need a Docker image. Before that, there is the Dockerfile, the root of any container you want to create and that is why it is foundational.
22 |
23 | **A Dockerfile is a text file that contains the instructions that tell Docker how to build a Docker image. That image is what Docker uses to create and run containers.**
24 |
25 | In summary: **Dockerfile → builds Docker image → which creates running containers that makes sure your code runs the same regardless of who runs it or where**
26 |
27 | ## How Docker Uses a Dockerfile to Create a Docker Image
28 | Now that we know (Dockerfile → builds Docker image → Containers), let's understand how to build a Dockerfile.
29 |
30 |
--------------------------------------------------------------------------------
/10_terraform/05-Terraform-State-File.md:
--------------------------------------------------------------------------------
1 | ## WHAT IS A TERRAFORM STATE FILE?
2 | It is crucial to bear in mind that the Terraform State file is **VERY IMPORTANT** that Terraform cannot function without it. Moving on, let us delve into what the state file is all about.
3 | - Terraform State File is a file that contains the summary/Metadata of any resource that has been created. It has the `.tfstate` extension. A typical file name could be `terraform.tfstate`
4 | - If you define a resource `A` in your Terraform configuration file called `example.tf` when you `apply`, Terraform will automatically document this resource creation in the State file.
5 |
6 | ## STATE AND CONFIGURATION FILE HANDLING IN TERRAFORM
7 | - Let us assume you go back to your Terraform configuration file `example.tf` where you define resource `A` to change it to B.
8 | - Terraform will compare what you have in the configuration file (which has now changed from `A` to `B`) with what exists in the Terraform State file, A.
9 | - When you run a `plan` on this configuration file, Terraform notices a difference and immediately assumes you now want to create B.
10 | - In essence, Terraform uses the State file as a reference to what you previously created, which is `A` in this case (remember it has the metadata to the most recent state of your configuration file), and compares it to what you now have inside your configuration file, `B`. It then shows you any differences detected from both versions.
11 | - Assuming you want to create another new configuration file with a resource, `JJJ`. Again Terraform will check the State file to determine if `JJJ` is there, if it's not, then it will show you that you are about to create `JJJ` in the plan summary.
12 | - State file Reference:
13 | - https://developer.hashicorp.com/terraform/language/state
14 | - https://developer.hashicorp.com/terraform/language/state/purpose
15 | - **NOTE**: When working with Terraform state file, **PLEASE DO NOT** push the state file to GitHub. Terraform state file contains your infrastructure in plain text, which means if you create a Database, its username and password will be available in plain text inside the State file. See how to manage State files in Production [HERE](https://developer.hashicorp.com/terraform/language/state/remote).
16 |
--------------------------------------------------------------------------------
/06_docker/README.md:
--------------------------------------------------------------------------------
1 | Articles/ vidoes to understand Docker Basics
2 | Containers Deep Dive - https://medium.com/techbeatly/container-internals-deep-dive-5cc424957413
3 |
4 | Understanding Container Images, Part 3: Working with Overlays - https://blogs.cisco.com/developer/373-containerimages-03
5 |
6 | Deep into Container — Linux Namespaces and Cgroups: What are containers made from? - https://faun.pub/kubernetes-story-linux-namespaces-and-cgroups-what-are-containers-made-from-d544ac9bd622
7 |
8 | Docker CMD vs. ENTRYPOINT: What’s the Difference and How to Choose - https://www.bmc.com/blogs/docker-cmd-vs-entrypoint/#:~:text=Use%20ENTRYPOINT%20instructions%20when%20building,when%20a%20Docker%20container%20runs
9 |
10 | Docker Part-2: Layered File System | Union File System - https://www.youtube.com/watch?v=X4E06YE1a7o
11 |
12 | Use the OverlayFS storage driver - https://docs.docker.com/storage/storagedriver/overlayfs-driver/
13 |
14 | What is the difference between a Docker image and a container? - https://stackoverflow.com/questions/23735149/what-is-the-difference-between-a-docker-image-and-a-container
15 |
16 | Articles on Docker Network
17 | docker official documentation - https://docs.docker.com/reference/cli/docker/network/create/
18 |
19 | Understanding Docker Networking - https://medium.com/@MetricFire/understanding-docker-networking-9f81244cf824
20 |
21 | Explaining Four Basic Modes of Docker Network - https://community.pivotal.io/s/article/Explaining-Four-Basic-Modes-of-Docker-Network?language=en_US
22 |
23 | stackoverflow question - https://stackoverflow.com/questions/43316376/what-does-net-host-option-in-docker-command-really-do
24 |
25 | Articles/ vides on basics of Dockerfile and docker-compose
26 | Dockerfile >Docker Image > Docker Container | Beginners Hands-On | Step by Step - https://www.youtube.com/watch?v=C-bX86AgyiA
27 |
28 | Docker Compose in 12 Minutes - https://www.youtube.com/watch?v=Qw9zlE3t8Ko
29 |
30 | Ultimate Docker Compose Tutorial - https://www.youtube.com/watch?v=SXwC9fSwct8
31 |
32 | Articles on Building ETL pipelines with docker and usecasese
33 | How are DEs using Docker containers for their ETLs? - https://www.reddit.com/r/dataengineering/comments/186fvwk/how_are_des_using_docker_containers_for_their_etls/
34 |
35 | Why Should Data Engineers Use Docker? - https://blog.det.life/why-should-data-engineers-use-docker-81918fd7e90f
36 |
37 | ETL using Docker, Python, Postgres, Airflow - https://medium.com/@patricklowe33/etl-using-docker-python-postgres-airflow-ed3e9508bd2e
38 |
--------------------------------------------------------------------------------
/02_SQL/01-sql-overview.md:
--------------------------------------------------------------------------------
1 | ## Overview: Why Learn SQL?
2 |
3 | SQL (Structured Query Language) is the standard language for interacting with relational databases. In today's data-driven world:
4 |
5 | - 🔍 **Universal Usage**: Used by 99% of Fortune 500 companies
6 | - 📈 **Career Growth**: #2 most in-demand tech skill (LinkedIn 2025)
7 | - 💰 **High Value**: SQL developers earn 20-30% more than non-SQL peers
8 | - 🔗 **Interoperability**: Works across all major database systems
9 | - 🛠 **Versatility**: Essential for developers, analysts, scientists, and managers
10 |
11 | SQL enables you to:
12 | - Store and organize data efficiently
13 | - Retrieve information with precision
14 | - Transform raw data into business insights
15 | - Build data-driven applications
16 |
17 | ---
18 |
19 | ## 1. Introduction to Databases
20 |
21 | ### What is a Database?
22 | A database is an organized collection of structured information stored electronically in a computer system. Key characteristics:
23 |
24 | ```mermaid
25 | graph TD
26 | A[Database] --> B[Persistent Storage]
27 | A --> C[Structured Organization]
28 | A --> D[Controlled Access]
29 | A --> E[Data Relationships]
30 | ```
31 |
32 | ### Relational vs. Non-Relational Databases
33 |
34 | | Feature | Relational (SQL) | Non-Relational (NoSQL) |
35 | |---------------|---------------------------|----------------------------|
36 | | **Structure** | Tables with fixed schema | Flexible document/key-value |
37 | | **Scaling** | Vertical | Horizontal |
38 | | **Transactions**| ACID compliant | BASE model |
39 | | **Use Cases** | Complex queries | High velocity data |
40 | | **Examples** | MySQL, PostgreSQL | MongoDB, Cassandra |
41 |
42 | ### Common Database Management Systems
43 |
44 | 1. **MySQL**
45 | 2. **PostgreSQL**
46 | 3. **SQL Server**
47 | 4. **Oracle**
48 | ### Database Schemas and Tables
49 |
50 | ```sql
51 | -- Example schema creation
52 | CREATE SCHEMA ecommerce;
53 |
54 | -- Table structure example
55 | CREATE TABLE ecommerce.users (
56 | user_id INT PRIMARY KEY,
57 | username VARCHAR(50) UNIQUE NOT NULL,
58 | email VARCHAR(255),
59 | signup_date DATE DEFAULT CURRENT_DATE
60 | );
61 | ```
62 |
63 | Key concepts:
64 | - **Schema**: Container for database objects (tables, views, etc.)
65 | - **Table**: Collection of related data in rows and columns
66 | - **Column**: Attribute with specific data type
67 | - **Row**: Single record in a table
68 |
69 |
--------------------------------------------------------------------------------
/01_linux/03-user-group-management.md:
--------------------------------------------------------------------------------
1 | # LINUX USER MANAGEMNET
2 | This section will give the necessary knowledge to understand the `User Managemnet` in `Linux`. Let's talk about how a typical
3 | set up of our laptop is when we newly purchase one. Ideally we will create a user and a password, and with that we can
4 | start to use the laptop and create different things, install various apps. Also, we can decided to create another user for guest on the same laptop for guest usage.
5 |
6 | It's a bit similar in Linux, also you create user in Linux, however users doesn't have all permissions by default.
7 | We are going to go through different user mnanagemet commands to allow us create a new user, create a new group, add that user into the group, delete the user and
8 | delete the group.
9 |
10 | # CREATE USER AND GROUP
11 | Users in Linux is simply an individual who need access to the Linux Opeating system.
12 | To create a user, you can leverage the following 2 commands
13 | - `adduser `
14 | - This command will prompt for password and some other information about the user.
15 | - To verify the user has been created, you can run this command `cat /etc/passwd` to see the list of users in the `passwd` file.
16 | - `useradd `
17 | - This command doesn't prompt for password, the administartor can later add password to the user by running this command `passwd `. This will prompt for the password to be inserted.
18 | - To verify the user has been created, you can run this command `cat /etc/passwd` to see the list of users in the `passwd` file.
19 | - To check the encrypted passord of the users, run the command `cat /etc/shadow`, this will list all the encrypted password for all users.
20 |
21 | Group is much easier to manage users, users of the same functions can go into the same group and have a single permission applied to that group, rather than individual user permission.
22 |
23 | To create a group in Linux, run the below command
24 | - `addgroup `
25 |
26 | To add a User to a Group, run the below command
27 | - `usermod -aG `
28 |
29 | To check the user added, run the below command
30 | - `cat /etc/group`
31 |
32 | To check the list of Users in a specific group, run the below command
33 | - `cat /etc/group | grep TheGroupName`
34 |
35 |
36 | # DELETE USER AND GROUP
37 | Sometimes, there are some users or groups that are no longer needed, its ideal to clean up and tidy the Linux environment.
38 |
39 | To remove a user from the group
40 | - `gpasswd -d TheUserName TheGroupName`
41 |
42 | To delete the group
43 | - `delgroup TheGroupName`
44 |
45 | To delete the user
46 | - `deluser TheUserName`
47 |
--------------------------------------------------------------------------------
/12_apache_kafka/02-Installing-Kafka.md:
--------------------------------------------------------------------------------
1 | # Installing Kafka
2 |
3 | Confluent Kafka has been chosen for use during this bootcamp, and this is due to its easy setup. This way, you are free from technical and financial debt that would otherwise be accrued by other available options of setting up a Kafka cluster.
4 |
5 |
6 | ## Prerequisites
7 | To run this quick start, you will need Git, Docker Desktop, and Docker Compose installed on a computer with a supported Operating System. Make sure you have Docker Desktop running.
8 |
9 |
10 | ## Windows Users
11 | To set up your Confluent Kafka cluster on Windows, click [here](https://www.confluent.io/blog/set-up-and-run-kafka-on-windows-linux-wsl-2/)
12 |
13 |
14 | ## Mac/Linux Users
15 |
16 | **Step 1**: Download and start Confluent Platform
17 |
18 | In this step, you start by cloning a GitHub repository. This repository contains a Docker compose file and some required configuration files. The docker-compose.yml file sets ports and Docker environment variables such as the replication factor and listener properties for Confluent Platform and its components. To learn more about the settings in this file, see [Docker Image Configuration Reference for Confluent Platform](https://docs.confluent.io/platform/current/installation/docker/config-reference.html#config-reference).
19 |
20 | Clone the Confluent Platform all-in-one example repository, for example:
21 |
22 | ```bash
23 | git clone https://github.com/confluentinc/cp-all-in-one.git
24 | ```
25 |
26 |
27 | **Step 2**: Change to the `cp-all-in-one` directory.
28 |
29 | The default branch that is checked out is the latest version of Confluent Platform:
30 |
31 | ```bash
32 | cd cp-all-in-one/cp-all-in-one
33 | ```
34 |
35 | **Step 3**: Start the Confluent Platform stack with the `-d` option to run in detached mode:
36 |
37 | ```bash
38 | docker compose up -d
39 | ```
40 |
41 | **Note**: If you are using Docker Compose V1, you need to use a dash in the Docker Compose commands. For example:
42 |
43 | ```bash
44 | docker-compose up -d
45 | ```
46 | To learn more, see Migrate to Compose V2.
47 |
48 | Each Confluent Platform component starts in a separate container. Your output should resemble the following. Your output may vary slightly from these examples depending on your operating system.
49 |
50 |
51 |
52 |
53 | **Step 4**: Verify that the services are up and running:
54 |
55 | ```bash
56 | docker compose ps
57 | ```
58 | Your output should resemble:
59 |
60 |
61 |
62 | After a few minutes, if the state of any component isn’t **Up**, run the `docker compose up -d` command again, or try `docker compose restart `, for example:
63 | ```bash
64 | docker compose restart control-center
65 | ```
66 |
--------------------------------------------------------------------------------
/12_apache_kafka/07-Consumer-and-Configurations.md:
--------------------------------------------------------------------------------
1 | # Kafka Producers and Configurations
2 |
3 | Producers put data into Kafka topics, and consumers take it out to process, analyze, or store somewhere else. You can also think of this as an input and output system/architecture where you put information.
4 |
5 | A good illustration is when you are out of water, you need to have your bath, and luckily for you, there's light. What do you do? You immediately pump water, but you also turn on the tap so you can have your bath with whatever water has now been added to the tank. The pumping machine in this case is the producer, your tank is the broker, and the tap is the consumer, which is channeling all the water out as soon as it gets to the tank.
6 |
7 | Simply put, if a producer is the writer in Kafka, then a `consumer` is the reader.
8 |
9 | In this module, we will be covering the following topics:
10 |
11 | - [What is a Consumer?](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/07-Consumer-and-Configurations.md#what-is-a-consumer)
12 | - [How Consumers work](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/07-Consumer-and-Configurations.md#how-consumers-work)
13 | - [Consumer Configurations[Key Settings]](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/07-Consumer-and-Configurations.md#consumer-configurations-key-settings)
14 |
15 | ## What is a Consumer?
16 |
17 | * A consumer is an application that reads (or subscribes to) messages from one or more Kafka topics.
18 | * It connects to the Kafka cluster.
19 | * Reads data from topics (partition by partition).
20 | * Keeps track of what it has already read using something called an offset.
21 |
22 |
23 | Example:
24 |
25 | * A dashboard that reads real-time user activity.
26 | * A fraud detection system that consumes payment events.
27 | * A database sync service that consumes topic data and writes it into a table.
28 |
29 | ## How Consumers Work
30 |
31 | When a consumer subscribes to a topic:
32 |
33 | * Kafka assigns one or more **partitions** from that topic to the consumer.
34 | * The consumer reads messages in order (based on **offsets**).
35 | * After processing each message, it can commit the offset, meaning “*I’ve read this message; move to the next.*”
36 |
37 | If a consumer crashes and restarts, it resumes reading from the last committed offset, not from the beginning.
38 |
39 |
40 | ## Consumer Configurations (Key Settings)
41 |
42 | Here are the most important configuration properties you’ll use with Kafka consumers 👇
43 |
44 | ### 1. `bootstrap.servers`
45 |
46 | * The Kafka brokers your consumer connects to.
47 | * Example: `localhost:9092` or `broker1:9092`,`broker2:9092`
48 |
49 | ### 2. `group.id`
50 |
51 | * Consumers belong to consumer groups (more on that later).
52 | * The `group.id` identifies which group the consumer belongs to.
53 | * All consumers in the same group share the work. Kafka makes sure each partition is read by only one consumer in the group.
54 |
55 |
56 |
57 |
58 |
59 |
60 |
--------------------------------------------------------------------------------
/09_aws_cloud/04-Elastic-Compute-Cloud(EC2)/04-Security.md:
--------------------------------------------------------------------------------
1 | # EC2 SECURITY
2 |
3 | Launching instances is one thing, securing it to ensure the right people have access to it and also secure it to prevent unauthorised individuals
4 | from accessing it iis very CRITICAL.
5 | We will cover 3 security ways
6 | - IAM
7 | - Key Pair
8 | - Security Group
9 |
10 | ## IAM
11 | Applications must sign their API requests with AWS credentials. Therefore,
12 | if you are an application developer, you need a strategy for managing credentials for your applications that run on EC2 instances.
13 | For example, you can securely distribute your AWS credentials to the instances, enabling the applications on those instances to use
14 | your credentials to sign requests, while protecting your credentials from other users. However, it's challenging to securely
15 | distribute credentials to each instance, especially those that AWS creates on your behalf, such as Spot Instances or instances in
16 | Auto Scaling groups. You must also be able to update the credentials on each instance when you rotate your AWS credentials.
17 |
18 | We designed IAM roles so that your applications can securely make API requests from your instances,
19 | without requiring you to manage the security credentials that the applications use. Instead of creating and distributing your AWS credentials,
20 | you can delegate permission to make API requests using IAM roles as follows:
21 |
22 | ## KEY PAIRS
23 | A key pair, consisting of a public key and a private key, is a set of security credentials that you use to prove your identity when connecting
24 | to an Amazon EC2 instance. For Linux instances, the private key allows you to securely SSH into your instance. For Windows instances,
25 | the private key is required to decrypt the administrator password, which you then use to connect to your instance.
26 |
27 | Amazon EC2 stores the public key on your instance, and you store the private key, as shown in the following diagram.
28 | It's important that you store your private key in a secure place because anyone who possesses your private key can connect to your instances
29 | that use the key pair.
30 |
31 | ## SECURITY GROUP
32 | A security group acts as a virtual firewall for your EC2 instances to control incoming and outgoing traffic.
33 | Inbound rules control the incoming traffic to your instance, and outbound rules control the outgoing traffic from your instance.
34 | When you launch an instance, you can specify one or more security groups.
35 | If you don't specify a security group, Amazon EC2 uses the default security group for the VPC. After you launch an instance,
36 | you can change its security groups.
37 |
38 | Security is a shared responsibility between AWS and you. For more information, see Security in Amazon EC2. AWS provides security groups as one of the tools for securing your instances, and you need to configure them to meet your security needs. If you have requirements that aren't fully met by security groups, you can maintain your own firewall on any of your instances in addition to using security groups.
39 |
40 |
41 |
--------------------------------------------------------------------------------
/01_linux/01-linux-overview.md:
--------------------------------------------------------------------------------
1 | # LINUX OVERVIEW
2 | This section will cover the overview of what `Linux` is about in a layman word, but before we dive into that, its important to take a step back to understand what is an `Operating System`.
3 |
4 | Let's consider a scenario where you just bought a new laptop, typically you will get either of the popular ones like `MacOS` or `Windows`, they are both called `Operating System`.
5 | Also, its worth to mention that this laptop will come with it's hardware specifications like `CPU`, `Memory`, `Storage` e.t.c.
6 |
7 | The below image shows what the laptop will have and this also demonstrates the relationship between the `User`, the `Operating System` and the underlying `Hardware`.
8 |
9 |
10 |
11 | Image Summary below 👇
12 | - The `First Layer` shows the User who owns the laptop, the User will install applications on his or her laptop like `VScode`.
13 | - The `Second Layer` is the application layer, in this case, we assumed the User launched the VScode to enable he or she to write code.
14 | - The `Third Layer` is where the `Operating System software` sits, in this case, the application launched on the second layer will communicate with the `Operating System`.
15 | - The `Fourth Layer` highlights the `Operating System` talking to the hardware to request for resources like `CPU`, `Memory` to be allocated to the VScode application that was launched.
16 |
17 | In short, `Operating Systems` are software that bridges the communication between the User applications and the underlying Hardware. As users, we launch various applications on our laptop, but how
18 | the applications get resources to run is hidden away from us, that is made possible becasue of the Operating System software.
19 |
20 | # WHAT IS LINUX
21 | `Linux` is an open source `Operating System`, similar to `MacOS` and `Windows`. Open Source in this context means it was developed by the community, contrary to `MacOS` and `Windows` which are both developed and owned
22 | by Apple Inc and Microsoft respectively.
23 |
24 | The below image shows a similar layers as we have above, the major difference is that the applications communicate with the `Linux Kernel` which is the actuall Operating System
25 | software via the `System Calls` interface.
26 |
27 |
28 |
29 | Brief read on Linux Kernel [here](https://www.redhat.com/en/topics/linux/what-is-the-linux-kernel).
30 |
31 | # WHY LINUX
32 |
33 | - `Zero Cost`: Linux is free to use, the source code is available to everyone.
34 | - `Security`: Linux is very secure, and this is as a result of a strong community backing the project.
35 | - `Industry Standard`: Linux is widely adopted in the tech space, in fact, about 96% of applications runs on Linux servers.
36 | - `High Demand`: Linux skill is in high demand, considering it powers nearly all of the applications out there, Engineers with Linux skill are sought after.
37 |
38 |
39 |
--------------------------------------------------------------------------------
/10_terraform/03-Terraform-Resource-Template.md:
--------------------------------------------------------------------------------
1 | # RESOURCE TEMPLATE
2 | Lets take a look at a resourse template, what it looks like and how we can be confident to work and understand
3 | anyone we see later. The Resource template is basically the script available in the Terraform documenation
4 | to provision resources.
5 |
6 | But we just don't want to copy it, we want to understsnd what it is, Terraform resource template file ends
7 | with .tf, but lets assume we've created a file called `demo.tf` and we would like to create a simple
8 | iam user resource, please if you want to understand what an `IAM USER` is, please check our previous
9 | note [HERE](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02-Identity-And-Access-Management(IAM)/00-iam-resources.md#iam-user).
10 |
11 | The script to create an `IAM User` will look like this below
12 | ```
13 | resource "aws_iam_user" "lb" {
14 | name = "testing"
15 | path = "/system/"
16 |
17 | tags = {
18 | tag-key = "tag-value"
19 | }
20 | }
21 | ```
22 |
23 | **SCRIPT BREAKDOWN**
24 | - `resource`: This simply mean you want to create a new resource, this will always be constant anytime you want to create a new resource with terraform.
25 | - `aws_iam_user`: This is telling Terraform the specific resource you want to create, there are lots of
26 | resources. For example if you want to create an s3 bucket, it will be different and you can see the resource
27 | name of s3 bucket [HERE](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket).
28 | - `lb`: This is the identifier for that resource in the Terraform State file. For example, if you want to create 2 IAM Users, obviously their name will be different, in the above code block, the name of the IAM User is `testing`, but in Terraform, you don't use the resource properties to reference a resource, the resource identifier is used to reference a resource.
29 | - `name`, `path`, `tags`: These are the `Properties` the resource support. It's just like creating human being resource, human being will have eyes, legs e.t.c.
30 | - Every resource you create with Terraform will have one or more properties,
31 | - All supported properties are always defined in the Terraform Documentation. For example, you can see all properties supported for `IAM USER` [HERE](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_user#argument-reference).
32 | - Please note that any Property not part of this is not supported.
33 |
34 |
35 | ```
36 | Terraform will perform the following actions:
37 |
38 | # aws_iam_user.lb will be created
39 | + resource "aws_iam_user" "lb" {
40 | + arn = (known after apply)
41 | + force_destroy = false
42 | + id = (known after apply)
43 | + name = "testing"
44 | + path = "/system/"
45 | + tags = {
46 | + "tag-key" = "tag-value"
47 | }
48 | + tags_all = {
49 | + "tag-key" = "tag-value"
50 | }
51 | + unique_id = (known after apply)
52 | }
53 |
54 | Plan: 1 to add, 0 to change, 0 to destroy.
55 | ```
56 |
57 |
58 |
--------------------------------------------------------------------------------
/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/04-SecurityGroup-and-NACL.md:
--------------------------------------------------------------------------------
1 | # SECURITY GROUP AND NACL
2 |
3 | A `Security Group` is a firewall that controls the traffic that is allowed to reach and leave the resources that it is associated with.
4 | For example, you can create a server in a specific subnet, create a security group that can be attached to the server.
5 | This essentially is a firewall that control communication into and out of the server.
6 |
7 | When you create a VPC, it comes with a default security group. You can create a custom security groups for a VPC,
8 | each with their own inbound and outbound rules.
9 | - `Inbound`: Communication that come into the resource, for example a Server or a Database
10 | - `Outbound`: Communication that leave the resourse
11 |
12 |
13 |
14 | ## HOW DOES SECURITY GROUP WORKS?
15 | Let's assume you create a Database in a public subnet, don't forget a public subnet has a route to the internet. The fact it has a
16 | route to the internet does not automatically allow connection from anyone on the internet.
17 | - You will need to create a Security Group that will be attached to that Database Instance
18 | - You will create a `Security Group Rule`
19 | - `Ingress Rule`: Rule that control the inbound, basically control who can connect to that Database
20 | - Here you can say allow any connection into the Database from this specific IP Address or CIDR range.
21 | - `Egress Rule`: Rule that control outbound, pretty much communication that can leave the Database instance
22 | - Here you can say allow any connection from the Database to go to this specific IP Address or CIDR range.
23 |
24 | ## WHAT ABOUT NACL
25 | `NACL stands for Network Access Control List`. This is a `firewall` not on the resource residing inside a subnet , but on the `subnet` level itself. When you create a `NACL` and associate a subnet to it, if you deny inbound traffic from a specific IP into that subnet, no connection can happen on the resource that resides inside that subnet even if security grouo allow the conNection.
26 |
27 |
28 |
29 | IMAGE SUMMARY
30 | - The VPC has 1 subnet
31 | - The subnet has a server being deploy in it
32 | - We have a Security group that act as a firewall that control inbound and Outbound traffic of the Server
33 | - Lastly, there is a NACL that act as a firewall that control inbound and outbound traffic on the subnet.
34 |
35 | The image below demonstrated that, even if you have everything checked fine on the security group level to be able to connect to the server, if communications are blocked on the subnet level, nothing can come in, its more like if the main entrance is locked,
36 | there is no way to even get into the room in the house.
37 |
38 |
39 |
40 |
41 |
--------------------------------------------------------------------------------
/01_linux/04-basic-commands.md:
--------------------------------------------------------------------------------
1 | # LINUX FILE SYSTEM
2 | Just like any Operating System we mentioned previously, they generally have a file system on Operating Systems.
3 | To efficiently be able to work with Linux, its important to understand Linux file system and various commands
4 | required to work with the Linux file system.
5 |
6 | Before, we dive into this, we need to know that when you create files or folders on your laptop, you are
7 | doing this on the Operating System file system. The image below 👇 shows the layout of a Linux file system.
8 |
9 |
10 |
11 |
12 | ## SUMMARY OF SOME IMPORTANT DIRECTORIES
13 | - `/` is the Root dircetory of the Linux file system, all other directories (folders) like mnt, home, bin are inside the Root directory.
14 | - `/bin` directory contains common user command binaries, for example `printenv` which user use to see all environemnt variables e.t.c.
15 | - `/sbin` directory is strictly for Linux administrator, it contains administrative commands like `adduser` which is used for creating users, this directory should not be amde available to basic users.
16 | - `/home` directory contain directory for individual users created, it serves as the starting point for each users created.
17 | - `/etc` directory serves as a central directory for system-wide configuration files, for example the file `shadow` is in the `/etc` directory which contains User's encypted passwords.
18 | - `/proc` directory provides access to Linux Kernel process information, and system metrics, for example the `meminfo` file in this directory provides detailed information about the system's memory usage.
19 |
20 | ## BASIC LINUX COMMANDS
21 | Linux has lots of commands available to users to work on linux, we will cover few important ones , but the exhaustive list can be found in the `/bin` directory inside the Root directory.
22 | - `pwd`: This command means Print Working Wirectory, this will show the current directory you are startring from the `/` Root directory.
23 | - Command usage is simply running `pwd` on the terminal.
24 | - `cd`: This means change directory, it helps to move/switch from one directory to the other. For example if you are in the Root directory `/`, running `cd bin` will take you into the bin directory.
25 | - Command usage: `cd ` on the terminal.
26 | - `mkdir`: This means make a new directory, it esssentially create a new directory right inside the directory where you initiate the command. If you are inside the `/` Root directory for example, if you run `mkdir CoreDataEngineers`, the folder `CoreDataEngineers` will be created inside the `/` Root directory of your Linux file system.
27 | - Command usage: `mkdir DirectoryName`.
28 | - `touch`: The touch command is used to create files inside a directory in Linux. let's assume in our newly created `CoreDataEngineers` directory, we want to create a file called `registration.txt`, we will use touch to achieve this. You simply change directory into the `CoreDataEngineers` directory and run `touch registration.txt`.
29 | - Command usage: `touch FileName`.
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
--------------------------------------------------------------------------------
/06_docker/02-Docker-Architecture.md:
--------------------------------------------------------------------------------
1 | ## Docker Architecture
2 |
3 | The architecture of Docker explains how Docker works under the hood.
4 |
5 | ### What is Docker (In Simple Terms)?
6 | Imagine you have a recipe for a delicious dish. You want to make it on your friend’s gas, your office’s gas, and even in a rented apartment and you want it to taste the same every time. But every kitchen is different.
7 |
8 | Docker is like a portable *kitchen-in-a-box*. It lets developers package their application (the recipe) along with everything it needs to run (ingredients, cookware, spices) into one neat box (called a container) — so it runs the same way anywhere.
9 | ________________________________________
10 | ## The Building Blocks of Docker Architecture
11 | Docker has several key components that work together to make this magic happen:
12 | 1. **Docker Client**
13 | * Analogy: You are the chef placing an order.
14 | * The Docker client is what you use to interact with Docker — usually via the docker command in your terminal.
15 | * It talks to the Docker Engine (daemon) to carry out your instructions.
16 | 2. **Docker Daemon (Engine)**
17 | * Analogy: The kitchen staff who prepares what you ordered.
18 | * Runs in the background and manages Docker objects (containers, images, networks, etc.).
19 | 3. **Docker Images**
20 | * Analogy: A blueprint or recipe.
21 | * A Docker image is a read-only template with instructions for creating a container (e.g., an image with Python installed).
22 | 4. **Docker Containers**
23 | * Analogy: A dish prepared using the recipe.
24 | * A running instance of a Docker image. It is isolated and includes everything needed to run the app.
25 | 5. **Dockerfile**
26 | * Analogy: A written set of steps for the recipe.
27 | * A plain-text file with instructions on how to build a Docker image.
28 | 6. **Docker Compose**
29 | * Analogy: A dinner set with multiple recipes (e.g., rice + soup + drink).
30 | * Used to define and run multi-container Docker applications using a YAML file (```docker-compose.yml```).
31 | 7. **Docker Hub / Registry**
32 | * Analogy: A public cookbook library.
33 | * A place to store and share Docker images (e.g., Docker Hub).
34 | ________________________________________
35 | ### How They All Work Together
36 | When you run this:
37 | ```bash
38 | docker run -it python:3.10
39 | ```
40 | Here's what happens:
41 | 1. Docker Client sends the command to the Docker Daemon.
42 | 2. Daemon checks if it has the image locally. If not, it pulls it from Docker Hub.
43 | 3. The image is used to spin up a container.
44 | 4. You get a terminal inside that container (Python shell).
45 | ________________________________________
46 | ## Docker Architecture Diagram
47 | [Docker Architecture](https://github.com/user-attachments/assets/cc0b0914-2e26-4807-8593-55f686613aaa)
48 |
49 | ## Summary Table
50 |
51 | |Component |Role | Analogy|
52 | ----- |--- |--- |
53 | |**Docker Client** | Sends commands |You (the chef)|
54 | |**Docker Daemon** |Executes commands |Kitchen staff|
55 | |**Docker Image**| Blueprint for container |Recipe|
56 | |**Docker Container**| Running instance of an image |Prepared dish|
57 | |**Dockerfile** |Instructions to build image |Written recipe|
58 | |**Docker Compose** |Manages multiple containers |Full-course meal|
59 | |**Docker Registry** |Stores and shares images |Online cookbook (e.g. Hub)|
60 |
--------------------------------------------------------------------------------
/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/02-Subnets.md:
--------------------------------------------------------------------------------
1 | # SUBNETS
2 | Next up is Subnet, this is a `CRITICAL` resource of the VPC, please ensure you've covered the below in the specified order
3 | - [IP Addressing & CIDR](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/00-IP-Addressing.md)
4 | - [VPC Overview](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/01-VPC-Overview.md)
5 |
6 | Let's start to explore other resources inside a VPC, in our last note on VPC, we highlighted its a Private Network
7 | that comes with a CIDR Range, this is nothing but some list of IPs that make up the whole network. We
8 | further said if an IP is not part of that VPC CIDR Range, it will not be able to communicate with
9 | resources inside that VPC.
10 |
11 | ### WHAT IS A SUBNET
12 | In the context of VPC, it's nothing but a range of IPs. Yes you heard that correctly, but in this case, its
13 | the range of IP in the VPC CIDR Range.
14 |
15 | This simply mean, when you create a subnet, you need to specify the CIDR Range for that subnet, it's a way of dividing the overall network into small chunks. But its important to know that, the subnet CIDR Range must be from within the VPC CIDR Range.
16 |
17 | Let's represent this visually and summarise it.
18 |
19 |
20 |
21 | Image Summary
22 | - The VPC has a CIDR Range `10.0.0.0/28`
23 | - The VPC has `2` Subnets
24 | - Subnet-A has a CIDR Range 10.0.0.0/32 which has `4` IPs.
25 | - Subnet-B has a CIDR Range 10.0.0.0/32 which has `8` IPs.
26 | - Use this [tool](https://www.zerobounce.net/ip-range-cidr-converter/) to convert a range of IPs to CIDR Range
27 | - The VPC has `4` more IPs not allocated to any Subnet.
28 |
29 | ### BENEFITS OF SUBNETS
30 | - `Logical Isolation`: Subnets allow you to divide your VPC into smaller, manageable units. This isolation is crucial for security, as it prevents resources in one subnet from directly accessing resources in another, unless explicitly allowed through routing rules and security configurations which we will cover soon.
31 | - `Security Control`: Subnets can be either `public` or `private`. `Public Subnets` have access to the internet through `Routing`, while `Private Subnets` do not have access to the internet. This allows for granular control over which resources are exposed to the internet and which are kept internal.
32 |
33 | To be more specific, when you create a Database or a Server in a VPC , they are actually being deployed into a specific Subnet. Let's represent that visually
34 |
35 |
36 |
37 | IMAGE SUMMARY
38 | - The Overall VPC Network is 10.0.0.0/28, which has 16 IPs
39 | - The Network is further divided into 2 smaller chunks subnet A and subnet B
40 | - A Server is created in the Subnet A with 4 IPs, for your information, one of the IP will be attached to the Server.
41 | - A Database is created in the Subnet A with 8 IPs, for your information, one of the IP will be attached to the Server hosting that Database behind the scene.
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
50 |
--------------------------------------------------------------------------------
/10_terraform/04-Basic-Resource-Provisioning.md:
--------------------------------------------------------------------------------
1 |
2 | # BASIC RESOURCE PROVISIONING
3 | This guide will walk us through a basic resource provisioning, basically we will create a simple resource with terraform and go
4 | through some relevant commands used to achieve this. In addition tio this , we will try to understand what we see when we run some commands.
5 |
6 | Fromn our previoius note here about what a resource template looks like, we will us that script for this walkthrough.
7 | Let's assume you want to start from scratch, ideally you will want to create 1 important file first, the naming convention for that file is to be called
8 | `provider.tf`.
9 |
10 | ## STEP 1
11 | - Create a `provider.tf` and add the below block of code into it, save the file.
12 | ```
13 | provider "aws" {
14 | region = "eu-central-1"
15 | }
16 | ```
17 | - If you run `terraform init` command, it will output the below
18 |
19 |
20 | **SUMMARY**
21 | The terraform init command use what you defined in the provider.tf file to know which platform it needs to communicate with, basically terraform will initialise
22 | your project, download the provider plugin you specified in your provider.tf. These plugin is what terraform will use to communicate with your cloud
23 | provider.
24 |
25 | ## STEP 2
26 | Now that we've initialised the project and necessary plugins required by terraform to commivcate with AWS has been downloaded, next is to create a
27 | smaple file called `demo.tf`.
28 | - Add the below terraform script and save it, the script is simply to create an IAM User in AWS.
29 | ```
30 | resource "aws_iam_user" "lb" {
31 | name = "loadbalancer"
32 | path = "/system/"
33 |
34 | tags = {
35 | tag-key = "tag-value"
36 | }
37 | }
38 | ```
39 | - If you Run terraform plan, terraform will give you the plan output of what you want to create, the detail summary and the properties,
40 | you should see something like below
41 |
42 |
43 |
44 | - The image tells us you plan to add 1 resource
45 | - The IAM User `id`, `arn`, `unique_id` will all be known after you apply
46 | - We added `3 Properties/Arguments` in our above block of code, but terraform added some other properties and some `Attribute` of the resources.
47 |
48 | ## STEP 3
49 | Now that your plan is okay to you, for example if it shows a plan output that doesn't match your intent, then you might want to revisit your script.
50 | - Now we need to create the resource since its fine with us
51 | - Running `terraform apply` will ask you if you really want to create the resource, it will ask you to hit yes, if you want to, something like below
52 |
53 |
54 |
55 | - After typing yes and hit enter, it will create the resource, you should see something like below.
56 |
57 |
58 |
59 | Now that we've successfully provision our iam user resource, next is understanding something that allow terraform to function, and that is the
60 | `State File`.
61 |
62 |
63 |
64 |
65 |
--------------------------------------------------------------------------------
/12_apache_kafka/05-Partition-and-Offset.md:
--------------------------------------------------------------------------------
1 | # Partitions and Offsets in Kafka
2 | To understand how Kafka organizes data, let’s revisit topics briefly.
3 | A topic is like a folder where related events (messages) are stored. But inside a topic, data isn’t just thrown into one big pile; it is divided into partitions.
4 |
5 | - [What is a Partition?](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/05-Partition-and-Offset.md#what-is-a-partition)
6 | - [What is an Offset?](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/05-Partition-and-Offset.md#what-is-an-offset)
7 | - [Why Partitions and Offsets are Important](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/05-Partition-and-Offset.md#why-partitions-and-offsets-are-important)
8 | - [Quick Analogy](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/05-Partition-and-Offset.md#quick-analogy)
9 |
10 | ## What is a Partition?
11 | A partition is a smaller chunk of a topic that holds a sequence of events (messages).
12 |
13 | * Each partition is an ordered log, meaning new messages are always added at the end.
14 | * Once written, messages never change (they are immutable).
15 | * Partitions make topics scalable and distributed because data can be spread across multiple brokers (servers) in a Kafka cluster.
16 |
17 | Think of a partition like a page in a notebook: each new line you write is the next message. If the notebook is too small, you can add more pages (partitions).
18 |
19 | **Example:**
20 | Imagine a topic called user-signups, where:
21 |
22 | * Partition 0 might store signups from users A–M.
23 | * Partition 1 might store signups from users N–Z.
24 | This way, data is balanced and can be processed in parallel.
25 |
26 | **NOTE:** It doesn't always work like this, as there are different partition strategies, but that is outside the scope of this lesson.
27 |
28 |
29 | ## What is an Offset?
30 | Inside a partition, each message is given a unique number called an `offset`.
31 |
32 | * The offset is like the line number on a notebook page.
33 | * It tells Kafka (and consumers) the exact position of a message in a partition.
34 | * Offsets start at 0 and increase by 1 for each new message.
35 |
36 | Offsets are local to a partition, not global to the topic.
37 |
38 | * For example, Partition 0 might have messages with offsets 0, 1, 2...
39 | * Partition 1 also starts at offset 0, 1, 2... independently.
40 |
41 | ## Why Partitions and Offsets are Important
42 | 1. Scalability:
43 | * Partitions let Kafka spread topic data across multiple brokers, enabling it to efficiently handle massive volumes of messages/events.
44 | 2. Parallelism:
45 | * Multiple consumers can read from different partitions at the same time, speeding up processing.
46 | 3. Ordering Guarantees:
47 | * Within a single partition, messages are strictly ordered.
48 | * Across multiple partitions, Kafka does not guarantee global ordering.
49 | 4. Tracking Progress:
50 | * Consumers keep track of the offset they last read.
51 | * If a consumer crashes, it can resume from the last committed offset without losing data.
52 |
53 |
54 | ## Quick Analogy
55 |
56 | Imagine a library shelf (topic):
57 | * Each shelf section (partition) contains books lined up in order (0,1,2,3,4,...).
58 | * Each book has a page number (offset) that never changes (Page 1, 2, 3, 4, 5, ...).
59 | * Multiple people (consumers) can read different shelf sections at once, but within a section, everyone sees the books in the same order.
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
--------------------------------------------------------------------------------
/01_linux/05-file-permissions.md:
--------------------------------------------------------------------------------
1 | # LINUX FILE PERMISSION
2 | As we covered priviously that Users are usually created and a `User` can be added to a specific `Group`, also we covered some comman `Linux` comamnds with are also useful to naviaget around the file system, not just naviagtion, also to create, delete files and directories.
3 |
4 | As much as these are users way to engage with the file system, also this is very dangerous is the `Linux` file system is not well controlled. Let's take a scenario where the administrator creates 2 user `A` and `B`, lets assume User `A` have access to the `/etc/shadow` file that contain all user's encypted password and User `A` accidentally delete that file, this will cause a massive problem.
5 |
6 | Another Scenerio, lets say User `A` created a created a script inside a file inside a specif folder, we assume User `B` have access to this file and he execute this file, this can cause an unexpected incidents in a case where User `A` only wants to run that script once per year maybe to do some clean up.
7 |
8 | To prevent critical incidents due to users unrestraicted actions, this is where `Linux` introduces `file permission`. This aim is to control individual access to have access to only what the administrator think they should access.
9 |
10 | # PERMISSION OUTPUT BREAKDOWN
11 | The command used to verify the permission applied on `files` and `directories` in Linux is `ls -l`, if you run this command from a specific directory, it will list all the files and directories with the permission set on each of them.
12 |
13 | Let's quickly understand the labelled image below in detail, this image gives the complete output of the command Linux users run to check if they have access.
14 |
15 | 
16 | Starting from left to the right
17 |
18 | - The letter `d` stands for `directory`, this means the object you are checking it's permission is a `directory`. If you have `-` dash , it means its a `file`.
19 | - For example if you see an output like this `-rwxrwxrwx`, it simply means its a `file`, however if you have `drwxrwxrwx`, then its a `directory`.
20 | - Next is a box labelled `USER`, inside the box we have `rwx`, this first box shows the permission is for the `USER (owner)` who created the directory or who created the file. lets assume we have this output `-rwxrwxrwx`,
21 | - `r` means the User that own the file has read permission on that file, read permission include being able to list the file
22 | - `w` means the User that own the file has write permission, write permission include being able to delete the file, being able to write something inside the file.
23 | - `x` shows the User who own the file is allowed to execute the file.
24 | - The middle box labelled `GROUP` shows this is the permission for the group which the User who own the file or directrory belong to. Basically, The User can have write access on his or her file, but the group which the user belong to might not have that permission because maybe you don't want anyone to alter your file.
25 | - `r` means the `Group` which the `User` that own the file has read permission on that file, anyone who is part of that group will inherit that permission.
26 | - `w` means the `Group` which the `User` that own the file has write permission, write permission include being able to delete the file, being able to write something inside the file.
27 | - `x` shows the `Group` which the `User` who own the file is allowed to execute the file.
28 | - The last box labelled `OTHERS` is for anyone who is not the owner of the file or the directory and who is also not in the group the user belongs to.
29 | - `r` means other users has read permission on the directory or file.
30 | - `w` means other users has write permission on the directory or file.
31 | - `x` means other users has permission to execute a file.
32 |
--------------------------------------------------------------------------------
/02_SQL/02-sql- syntax-fundamentals.md:
--------------------------------------------------------------------------------
1 | ## SQL Syntax Fundamentals
2 | SQL follows a declarative syntax pattern:
3 |
4 | ```sql
5 | SELECT DISTINCT/columns_names/aggregations
6 | FROM table_name
7 | [WHERE conditions]
8 | [GROUP BY groupings]
9 | [HAVING aggregate_conditions]
10 | [ORDER BY sort_columns]
11 | [LIMIT row_count];
12 | ```
13 |
14 | Key syntax rules:
15 | - **Case Insensitive**: `SELECT` ≡ `select` (but conventions use uppercase for keywords)
16 | - **Termination**: Semicolon `;` ends statements (required in most DBMS)
17 | - **Whitespace**: Insensitive to spaces/tabs/newlines
18 | - **Comments out**:
19 | ```sql
20 | -- Single line comment
21 | /* Multi-line
22 | comment */
23 | ```
24 |
25 | ## Data Types in SQL
26 |
27 | | Category | Common Types | Description | Example Values |
28 | |----------------|----------------------------|--------------------------------------|--------------------------|
29 | | **Numeric** | `INT`, `BIGINT`, `DECIMAL` | Whole and decimal numbers | `42`, `3.14159` |
30 | | **Character** | `VARCHAR(n)`, `CHAR(n)` | Text with variable/fixed length | `'SQL'`, `'X'` |
31 | | **Temporal** | `DATE`, `TIMESTAMP` | Dates and precise timestamps | `'2025-05-23'`, `NOW()` |
32 | | **Boolean** | `BOOLEAN` | True/false values | `TRUE`, `FALSE`, `NULL` |
33 |
34 |
35 | *Note: Type availability varies by database system*
36 |
37 | ## Basic SQL Commands
38 |
39 | ### SELECT (Retrieves Data)
40 | ```sql
41 | -- Basic form
42 | SELECT column1, column2 FROM table_name;
43 |
44 | -- Selecting all columns
45 | SELECT * FROM employees;
46 |
47 | -- Calculations
48 | SELECT product, price * quantity AS total_value FROM orders;
49 | ```
50 |
51 | ### INSERT (Adds Data into the existing table)
52 | ```sql
53 | -- Specify columns
54 | INSERT INTO customers (name, email)
55 | VALUES ('John Doe', 'john@example.com');
56 |
57 | -- Insert multiple rows
58 | INSERT INTO products (id, name, price) VALUES
59 | (1, 'Laptop', 999.99),
60 | (2, 'Mouse', 24.95);
61 | ```
62 |
63 | ### UPDATE (Modify Data)
64 | ```sql
65 | -- Single column update
66 | UPDATE inventory SET stock = 50 WHERE product_id = 101;
67 |
68 | ```
69 |
70 | ### DELETE (Remove Data)
71 | ```sql
72 | -- With condition
73 | DELETE FROM logs WHERE created_at < '2025-01-01';
74 |
75 | -- WARNING! Unconditional delete
76 | DELETE FROM temp_data; -- Removes ALL rows
77 | ```
78 |
79 | ## Filtering with WHERE Clause
80 |
81 | ```sql
82 | -- Basic comparisons
83 | SELECT * FROM products WHERE price > 100;
84 |
85 | -- Multiple conditions
86 | SELECT name FROM employees
87 | WHERE department = 'Sales' AND hire_date > '2024-01-01';
88 |
89 | -- Pattern matching
90 | SELECT * FROM contacts
91 | WHERE email LIKE '%@gmail.com';
92 |
93 | -- NULL handling
94 | SELECT * FROM orders
95 | WHERE ship_date IS NULL;
96 | ```
97 |
98 | Common WHERE operators:
99 | - `=`, `<>`/`!=` (equality)
100 | - `>`, `<`, `>=`, `<=` (comparison)
101 | - `BETWEEN` (range)
102 | - `IN` (value list)
103 | - `LIKE` (pattern)
104 | - `IS [NOT] NULL` (null checks)
105 | - `AND`, `OR`
106 |
107 | ## Sorting with ORDER BY
108 |
109 | ```sql
110 | -- Single column sort
111 | SELECT * FROM products ORDER BY price DESC;
112 |
113 | -- Multi-column sort
114 | SELECT first_name, last_name FROM employees
115 | ORDER BY department ASC, hire_date DESC;
116 |
117 | -- Using column positions
118 | SELECT name, price, stock FROM inventory
119 | ORDER BY 2 DESC, 3 ASC; -- Sorts by price then stock
120 | ```
121 |
122 | ## Limiting Results
123 |
124 | ```sql
125 | -- MySQL/PostgreSQL/SQLite
126 | SELECT * FROM table_name LIMIT 10;
127 |
128 | -- SQL Server
129 | SELECT TOP 10 * FROM table_name;
130 |
131 | -- Oracle
132 | SELECT * FROM table_name
133 | FETCH FIRST 10 ROWS ONLY;
134 | ```
135 |
136 | Performance Tip: Always use `ORDER BY` with `LIMIT` for predictable results!
137 |
--------------------------------------------------------------------------------
/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/03-RouteTable-InternetGateway-NatGateway.md:
--------------------------------------------------------------------------------
1 | # ROUTE TABLE
2 | Let's talk about another `CRUCIAL` component in a VPC called Route Table, please to be able to proceed with this topic, its important to have gone through the below in this order
3 | - [IP Addressing & CIDR](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/00-IP-Addressing.md)
4 | - [VPC Overview](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/01-VPC-Overview.md)
5 | - [Subnets](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/02-Subnets.md)
6 |
7 | ## WHAT IS A ROUTE TABLE
8 | A Route Table is the component who control the flow of communication in your VPC .
9 | - Every Route Table have a set of rules which is called `Routes`
10 | - That Route determine where communication from your Subnet is directed to.
11 | - Every VPC comes with a default Route Table called `Main Route Table`.
12 | - Every Subnet you create will automatically be associated with this Main Route Table.
13 | - You can create your own Route Table and explicitly associate your subnets to it.
14 | - Destinations in Route Table simply mean where you want to direct communication from your subnet to, this is represented in IPs or CIDR Range
15 | - Targets in Route Table is just the medium which you want to use to get to the Destination.
16 |
17 | For example you can create a Route Table and define a Route that says, any communication from Subnet A should go to a specifc IP Address on the internet. Lets represent this example visually
18 |
19 |
20 |
21 | Wait, we have something new in the image called `Internet Gateway` 🤔.
22 | - Internet gateway is nothing but a component you attach to your VPC.
23 | - It facilitate the communication from your VPC to the internet and also facilitate communication from the internet into your VPC.
24 | - Every Default VPC already have an Internet Gateway attached to them.
25 | - You cannot atatch 2 Internet Gateways to a VPC, it can only take 1 internet Gateway.
26 |
27 | Let's summarise the Image above
28 | - John is outside of the VPC, here we say he is on the internet considering his IP Address is not part of the VPC CIDR Range.
29 | - We have a VPC with 2 subnets ( SUBNET A and SUBNET B )
30 | - Subnet A is where we have our Database with no custom Route Table associated.
31 | - It's associated with the Main Route Table with no link to the Internet Gateway.
32 | - Any Subnet associated with a Route Table that has no route to the Internet Gateway is called a `PRIVATE SUBNET`.
33 | - Subnet A is where we have our Database with no custom Route Table associated.
34 | - It's associated with a custom Route Table with a link to the Internet Gateway.
35 | - Any Subnet associated with a Route Table that has a route to the Internet Gateway is called a `PUBLIC SUBNET`.
36 |
37 | ## NAT GATEWAY
38 | Nat Gateway is used to establish communication from a Private Subnet to the Internet.
39 | - The Nat Gateway is created in the public subnet
40 | - An elastic IP must be created and attached to the Nat Gatweay
41 | - Elastic IP is nothing mut a static public IP Address that you create and attach to Servers, Nat Gatway.
42 | - Reason is, if you stop a Server with already attached Public IP Address, when you restart that
43 | the server, it will generate another Public IP which can affect connectivity.
44 | - Private subnet route table is configured to have a route to the internet using the Nat Gateway
45 | - This essentially means, the communication will from the private subnet to the public subnet via the Nat Gatway, who will then forward the communication to the Internet Gateway which is already attached to the VPC to head to the internet as illustrated below.
46 |
47 |
48 |
49 |
50 |
51 | ## CONCLUSION
52 | if you need John to communicate with a resource in your subnet, make sure that resource is provisioned inside a PUBLIC SUBNET,
53 | basically the subnet has to be associated with a Route Table that has a route linked to the Internet Gateway.
54 |
55 |
56 |
57 |
--------------------------------------------------------------------------------
/09_aws_cloud/04-Elastic-Compute-Cloud(EC2)/00-EC2-Overview.md:
--------------------------------------------------------------------------------
1 | # ELASTIC COMPUTE CLOUD - EC2
2 |
3 | An `EC2 instance` is a virtual server in the `AWS Cloud`, see it as a computer hosted by AWS somewhere
4 | and you have access to from where you are.
5 |
6 | When you launch an `EC2 instance`, the instance type that you specify determines the hardware
7 | available to your instance.
8 | Each instance type offers a different spec of compute, memory, network, and storage resources.
9 |
10 | It's important to know that EC2 instance is a very `CRITICAL` resource in AWS, in fact, it powers a lot of
11 | services behind the scene. As a `Data Engineer`, all of our workloads needs to run somewhere, they typically
12 | on servers like `EC2 Instance`.
13 |
14 | ## AMAZON MACHINE IMAGE - AMI
15 | An Amazon Machine Image (AMI) is an image that provides the software that is required to
16 | set up and boot an Amazon EC2 instance.
17 | - You must specify an AMI when you launch an instance.
18 | - The AMI must be compatible with the instance type that you chose for your instance.
19 | - You can use an AMI provided by AWS, a public AMI, an AMI that someone else shared with you,
20 | or an AMI that you purchased from the AWS Marketplace.
21 |
22 | You can launch multiple instances from a single AMI when you require multiple instances with
23 | the same configuration. You can use different AMIs to launch instances when you require
24 | instances with different configurations, as shown in the following diagram.
25 |
26 | ## INSTANCE TYPES AND FAMILY
27 | Instance type is just the type of Server, this is usually differentiated majorly based on the hardware available to that instance type. For example, Instance A might have 1 CPU and Instance B might have 5 CPU, this is similar to
28 | our computers where they all varies based on their hardware spec.
29 |
30 | Instance types are combination of the Instance Family + Instance size,
31 | See this image below from AWS official Documenation
32 |
33 |
34 |
35 | **IMAGE SUMMARY**
36 | - Every EC2 Instance comes with hardwares like `CPU`, `Memory`, `Storage` and `Network`.
37 | - c7gn is the Family this instance belongs to
38 | - 2xlarge is the instance size .
39 |
40 |
41 | EC2 instances are categorised into different categories based on the workload that suits you.
42 | We will cover 4 of these categories that we consider the most common
43 | - `General Purpose` ---> Provide a balance of compute, memory, and networking resources. These instances are ideal for applications that use these resources in equal proportions, such as web servers.
44 | - See [HERE](https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html) to see all Instance Types and specifications for the General purpose category
45 | - `Compute optimized` –--> Designed for compute intensive applications that benefit from high performance processors. These instances are ideal for batch processing workloads, high performance web servers, high performance computing (HPC), scientific modeling, dedicated gaming servers, ad server engines, and machine learning inference.
46 | - See [HERE](https://docs.aws.amazon.com/ec2/latest/instancetypes/co.html) to see all Instance Types and specifications for the Compute optimized category.
47 | - `Memory optimized` –--> Designed to deliver fast performance for workloads that process large data sets in memory.
48 | - See [HERE](https://docs.aws.amazon.com/ec2/latest/instancetypes/mo.html) to see all Instance Types and specifications for the Compute optimized category.
49 | - `Storage optimized` –--> Designed for workloads that require high, sequential read and write access to very large data sets on local storage. They are optimized to deliver tens of thousands of low-latency, random I/O operations per second (IOPS) to applications.
50 | - See [HERE](https://docs.aws.amazon.com/ec2/latest/instancetypes/so.html) to see all Instance Types and specifications for the Storage Optimized category.
51 | - `High-performance computing` –--> Purpose built to offer the best price performance for running HPC workloads at scale on AWS. These instances are ideal for applications that benefit from high-performance processors, such as large, complex simulations and deep learning workloads.
52 | - See [HERE](https://docs.aws.amazon.com/ec2/latest/instancetypes/hpc.html) to see all Instance Types and specifications for the High-performance computing category.
53 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/01-Redshift-Cluster-Architecture.md:
--------------------------------------------------------------------------------
1 | # REDSHIFT ARCHITECTURE
2 | From the overview of `Redshift`, we understood that it's a `Distributed System` which uses `Master` and `Slave` concept.
3 | We also saw it leverage a `Columnar Storage` architecture to store its data on Disk.
4 | Now let's dive into the architecture of a `Redshift Cluster`.
5 |
6 | Firstly, `Amazon Redshift` is based on `PostgreSQL`, so most existing SQL client applications will work with
7 | only minimal changes, however the backend was rewrote to support `OLAP queries`, `Massively parallel processing (MPP)`
8 | and to be `Columnar` in nature. This allow redshift to horizontally scale, this means more `Compute Nodes` can be added
9 | to handle more complex workloads.
10 |
11 | `Clusters------>` The core infrastructure component of an Amazon Redshift data warehouse is a `Cluster`.
12 | - We know that a Cluster is made up of one or more Compute Nodes.
13 | - If a Cluster is provisioned with two or more Compute Nodes, an additional leader node coordinates the compute nodes and handles external communication. This Leader Node will be created additionally by AWS.
14 | - When you run any SQL Queries, you are actually talking to the Leader Node.
15 |
16 | `Leader Node------>` The leader node manages external communications with client programs and all communication with compute nodes.
17 | - This basically mean, if you connect to the Redshift Cluster using your favorite SQL client like Dbeaver, DataGrip e.t.c, you are communicating with the Leader Node behind the scene.
18 | - The Leader will parses and develops `execution plans` to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries.
19 | - Based on the execution plan, the Leader Node compiles the code, distributes the compiled code to the Compute Nodes, and assigns a portion of the data to each Compute Node.
20 | - The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes.
21 | - All other queries run exclusively on the leader node.
22 | - Amazon Redshift is designed to implement certain SQL functions only on the leader node
23 |
24 | `Compute Nodes------>` The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes.
25 | - The compute nodes run the compiled code and send intermediate results back to the leader node for final aggregation.
26 | - Each compute node has its own dedicated CPU and memory, which are determined by the node type.
27 | - As your workload grows, you can increase the compute capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
28 |
29 | `Redshift Managed Storage------>` Data warehouse data is stored in a separate storage tier Redshift Managed Storage (RMS).
30 | - RMS provides the ability to scale your storage to petabytes using Amazon S3 storage.
31 | - RMS lets you scale and pay for computing and storage independently, so that you can size your cluster based only on your computing needs.
32 |
33 | `Node slices------>` A compute node is partitioned into slices.
34 | - Each slice is allocated a portion of the `Node's Memory` and `Disk Space`, where it processes a portion of the workload assigned to the node.
35 | - The `Leader Node` manages distributing data to the `Slice`s and apportions the workload for any queries or other database operations to the `Slices`.
36 | - The `Slices` then work in parallel to complete the operation.
37 |
38 | `Databases------>` A cluster contains one or more databases.
39 | - User data is stored on the Compute Nodes.
40 | - Your `SQL Client` communicates with the leader node, which in turn coordinates query run with the compute nodes.
41 | - `Amazon Redshift` is a relational database management system (RDBMS), so it is compatible with other RDBMS applications.
42 | - Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data,
43 | - Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.
44 |
45 |
46 |
47 |
48 | For more on the Architecture:
49 | - https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
50 | - https://docs.aws.amazon.com/prescriptive-guidance/latest/query-best-practices-redshift/data-warehouse-arch-components.html
51 |
52 |
53 |
54 |
55 |
--------------------------------------------------------------------------------
/09_aws_cloud/04-Elastic-Compute-Cloud(EC2)/03-Networking.md:
--------------------------------------------------------------------------------
1 | # EC2 NETWORKING
2 | We all know `Amazon VPC` enables us to create a private network where we can launch AWS resources, such as `Amazon EC2 instances`.
3 | When you launch an `EC2 Instance`, you must select the subnet within that VPC to launch the instance into.
4 | Don't forget that the subnet is just a division of the `VPC CIDR Range` which nothing but some bunch of `IP Addresses`.
5 |
6 | When you launch instances in a subnet , that instance will be attached with one of the available IPs in that
7 | subnet. Lets understand this with the below image
8 |
9 |
10 |
11 | **IMAGE SUMMARY**
12 | - We have a VPC with CIDR Range 10.0.0.0./28, which corresponds to 28 IPs
13 | - We have 2 subnets with CIDR range 10.0.0.0/29 and 10.0.0.8/29, both CIDR Range correspond to 8 IPs each.
14 | - `EC2 Instance` launched in `SUBNET A` automatically and randomly get one of the available IPs, this is done by AWS.
15 |
16 | ## ELASTIC NETWORK INTERFACE
17 | When you launch an instance, we know it gets an IP address from the available IPs in that Subnet where it was launched into,
18 | but the reality is this, the IP address is not attached to the EC2 Instance directly, infact its attached with to the `Elastic Network Interface`.
19 |
20 | `Elastic Network Interface` is a `CRITICAL` networking component in a VPC that represents a virtual network card, see it as a card. You can create and configure network interfaces and attach them to instances that you launch in the same Availability Zone. When yoiu create ENI in a specific Availability Zone, make sure that the Instance you are attaching it to is also in the same Availability Zone.
21 |
22 | So this is what happened when you launch an Instance
23 | - AWS create an Elastic Network Interface called the `Primary ENI`.
24 | - Attach the available `IP Address` in that subnet to the `ENI`.
25 | - The `EC2 Instance` will be created and the `ENI` will be attached to that `EC2 Instance`.
26 |
27 | ## ELASTIC IP
28 | Before we go into `Elastic IP`, it's important to know that if you want anyone outside of your `VPC CIDR Range` to connect to your `EC2 instance`,
29 | you need to ensure `Public IP` is attached to the EC2 Instance during the time of creation. Again, behind the scene, that `Public IP` will be
30 | attached to the primary `Elastic Network Interface`.
31 |
32 | There is one issue that comes with this, when you stop that instance and start it again, you will not have the same `Public IP`, it will be changed
33 | to another one. This can be a problem sometimes depending on the use case.
34 |
35 | To solve this problem, you can create an `Elastic IP` and attach it to the instance, when you stop and restart the Instance, the `Public IP` still
36 | remain the same. This `IP Address` is yours until you release it, this mean you remove it from your `EC2 Instance`. You can move an `Elastic IP` from one instance to another instance.
37 |
38 | This is what the general represention looks like below
39 |
40 |
41 |
42 | ## EC2 DNS HOSTNAME
43 | It's important to know that when you create an `EC2 Instance`, at the end its a `Server` that anyone would have to connect to, either human being or an application.
44 | But how do we communicate with Servers, we can't use their IP address, its not easy to remember, plus IP Address can change, in that case we give the Server a `Domain Name` that is easy to remember.
45 |
46 | AWS `EC2 instances` also give DNS Hostname to instances that you launch, if you have an EC2 instance that you want someone from the internet to reach, they will have a Public DNS hostname,
47 | however for instances within your VPC, they all have their Private DNS Hostname.
48 |
49 | Lets see how a DNS Hostname will be for the image we have above
50 | - Private Domain Name: `ip-10.0.0.2.us-west-2.compute.internal`.
51 | - Public Domain Name: `ip-43.39.19.2.us-west-2.compute.internal`.
52 | So essentially, this above hostnames is what you use to connect to the instance, behind the scenee, the `DNS Hostnames` will be resolved back to their corresponding `IP address`, you don't care about this but AWS will do this.
53 |
54 | See more [HERE](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/understanding-ec2-instance-hostnames-domains.html) regarding the EC2 DNS hostname
55 |
56 |
57 |
58 |
59 |
60 |
61 |
--------------------------------------------------------------------------------
/02_SQL/03-basic-querying.md:
--------------------------------------------------------------------------------
1 | ## Selecting Specific Columns
2 |
3 | ```sql
4 | -- Select single column
5 | SELECT product_name FROM products;
6 |
7 | -- Select multiple columns
8 | SELECT id, name, price FROM products;
9 |
10 | -- Column aliases
11 | SELECT
12 | username AS "User Name",
13 | signup_date AS "Member Since"
14 | FROM users;
15 | ```
16 |
17 | **Best Practices**:
18 | - Avoid `SELECT *` in production code
19 | - Use AS for calculated columns or clearer naming
20 |
21 | ## Using DISTINCT for Unique Values
22 |
23 | ```sql
24 | -- Basic distinct
25 | SELECT DISTINCT department -- removes duplicates
26 | FROM employees;
27 |
28 | -- Multi-column distinct
29 | SELECT DISTINCT city, state
30 | FROM customers;
31 |
32 | -- using DISTINCT with counts (advanced)
33 | SELECT
34 | COUNT(*) AS total_rows,
35 | COUNT(DISTINCT product_id) AS unique_products
36 | FROM orders;
37 | ```
38 |
39 | *Note: DISTINCT applies to the entire row, not just the first column*
40 |
41 | ## Comparison Operators
42 |
43 | | Operator | Meaning | Example Usage |
44 | |----------|--------------------------|------------------------------------|
45 | | `=` | Equal | `WHERE age = 25` |
46 | | `<>` | Not equal | `WHERE status <> 'inactive'` |
47 | | `>` | Greater than | `WHERE price > 100` |
48 | | `<` | Less than | `WHERE created_at < '2025-01-01'` |
49 | | `>=` | Greater or equal | `WHERE quantity >= 10` |
50 | | `<=` | Less or equal | `WHERE rating <= 5` |
51 |
52 | ```sql
53 | -- Combined example
54 | SELECT order_id
55 | FROM orders
56 | WHERE total_amount > 500
57 | AND order_date <= CURRENT_DATE - INTERVAL '30 days';
58 | ```
59 |
60 | ## Logical Operators
61 |
62 | - `LIKE:` This allows you to perform operations similar to using WHERE and =, but for cases when you might not know exactly what you are looking for.
63 |
64 | - `IN:` This allows you to perform operations similar to using WHERE and =, but for more than one condition.
65 |
66 | - `NOT:` This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition.
67 |
68 | - `AND & BETWEEN:` These allow you to combine operations where all combined conditions must be true.
69 |
70 | - `OR:` This allows you to combine operations where at least one of the combined conditions must be true.
71 |
72 | ### AND (All conditions must be true)
73 | ```sql
74 | SELECT * FROM inventory
75 | WHERE warehouse = 'NYC'
76 | AND quantity > 50
77 | AND last_restock > '2025-04-01';
78 |
79 | -- AND and BTEWEEN
80 | SELECT *
81 | FROM order
82 | WHERE order_date BETWEEN '2016-01-01' AND '2017-01-01';
83 | ```
84 |
85 | ### OR (Any condition can be true)
86 | ```sql
87 | SELECT name FROM employees
88 | WHERE department = 'Sales'
89 | OR department = 'Marketing';
90 | ```
91 |
92 | ### NOT (Inverse condition)
93 | ```sql
94 | SELECT * FROM customers
95 | WHERE NOT (status = 'VIP' OR total_purchases > 1000);
96 | ```
97 |
98 | ## Special Condition Operators
99 |
100 | ### BETWEEN (Range inclusion)
101 | ```sql
102 | -- Numeric range
103 | SELECT * FROM products
104 | WHERE price BETWEEN 50 AND 100;
105 |
106 | -- Date range
107 | SELECT * FROM events
108 | WHERE event_date BETWEEN '2025-06-01' AND '2025-06-30';
109 | ```
110 |
111 | ### IN (Value list matching)
112 | ```sql
113 | -- Literal values
114 | SELECT * FROM countries
115 | WHERE region IN ('Europe', 'Asia', 'Africa');
116 |
117 | -- Subquery
118 | SELECT name FROM products
119 | WHERE category_id IN (
120 | SELECT id FROM categories
121 | WHERE type = 'Electronics'
122 | );
123 | ```
124 |
125 | ### LIKE (Pattern matching)
126 | ```sql
127 | -- % = any sequence, _ = single character
128 | SELECT * FROM contacts
129 | WHERE phone LIKE '234-___-____'; -- NG area code
130 |
131 | SELECT name
132 | FROM accounts
133 | WHERE name LIKE 'C%'; --- returns all the names that start with the letter 'C'
134 |
135 | ```
136 |
137 | Common patterns:
138 | - `'abc%'` - Starts with "abc"
139 | - `'%xyz'` - Ends with "xyz"
140 | - `'%123%'` - Contains "123"
141 | - `'_b%'` - Second character is "b"
142 |
143 | ## NULL Handling
144 |
145 | ```sql
146 | -- Find missing data
147 | SELECT student_id FROM grades
148 | WHERE test_score IS NULL;
149 |
150 | -- Exclude nulls
151 | SELECT * FROM employees
152 | WHERE manager_id IS NOT NULL;
153 | ```
154 |
155 | **Critical Behaviors**:
156 | - `NULL = NULL` → NULL (not TRUE)
157 | - `NULL AND TRUE` → NULL
158 | - Use `COALESCE()` for default values:
159 | ```sql
160 | SELECT name, COALESCE(bio, 'No biography') FROM authors;
161 | ```
162 |
--------------------------------------------------------------------------------
/09_aws_cloud/02-Identity-And-Access-Management(IAM)/02-iam-role.md:
--------------------------------------------------------------------------------
1 | # IAM ROLE DEEP DIVE
2 |
3 | ### WHAT IS IAM ROLE
4 | - An IAM role is an IAM identity that you can create in your account that has specific permissions.
5 | - An IAM role is similar to an IAM user, in that it is an AWS identity with permission policies that determine what the identity can and cannot do in AWS.
6 | - However, instead of being uniquely associated with one person, a role is intended to be assumable by anyone who needs it.
7 | - Also, a role does not have standard long-term credentials such as a password or access keys associated with it.
8 | - You can use IAM Role to delegate access to users, AWS services or applications that don't normally have access to your AWS resources.
9 |
10 | ### A REAL WORLD MAPPING
11 | Let's map IAM Role to a real world scenario to better understand it ...
12 | - In many countries around the world, we know their is an office of presdident or prime minister.
13 | - This Office will have a permission and some privileges that is attached to this office.
14 | - Different people assume this office over the years, it can be male or female, it can be light skin or Dark skin.
15 | - Once this person win the election and is elected the president, the person assume the office of the President
16 | - Autiomatically this person get all the permission and privilege attached to that office.
17 | - This is the same with IAM Role on AWS
18 | - It can be assume by anyone, including human, Application or some AWS Service.
19 |
20 | ### IAM ROLE SCENERIO WITHIN AWS EXAMPLE
21 | - Amazon Redshift is a Datawarehouse which allow us to store and access our Data
22 | - Amazon Redshift is an AWS service that allow us to create a Redshift Cluster resource within the service, this cluster act as a Datawarehouse, we will cover this service in depth in the subsequent topics.
23 | - Amazon s3 is an AWS service that offers object/file based storage for our Data
24 | - basically you can have your data in a CSV or any file format you want and store it in s3, we will cover this later as well.
25 | - Amazon Redshift cluster have 2 operations he can do as part of his other operations
26 | - `COPY` - Redshift can copy objects from s3 and save that as a table in Redshift Cluster, this will allow Data analyts for example to run SQL queries against that table for analytics.
27 | - `UNLOAD` - Redshift can unload a Redshift table into s3
28 | - Before any of this operations mentioned above can be possible,
29 | - Amazon Redshift Cluster needs permission to be able to `COPY` the object, this action is a `getObject` action.
30 | - Also, Amazon Redshift Cluster needs permission that will allow it be able to `UNLOAD` the data from the redshift table into s3, this action is a `putObject` action. At the end, the Redshift table will be converted into a file before it can be put into s3.
31 | - Below is the illustration
32 |
33 |
34 |
35 | **IMAGE SUMMARY**
36 | - First and foremost, Amazon Redshift service and Amazon s3 needs to talk to each other in other to do operations together.
37 | - By default, Redshift doesn't have access to get object neither is it allowed to put any object in Amazon s3.
38 | - To be able to allow Redshift to achieve both operations, we need to do the following
39 | - Create an IAM Role
40 | - We will specify Amazon Redshift in the role that we trust that service to assume the role
41 | - Below you can see we said the `Principal` we want to `Allow` to be able to `AssumeRole` is the `redshift.amazonaws.com`.
42 |
43 | ```json
44 | {
45 | "Version": "2012-10-17",
46 | "Statement": [
47 | {
48 | "Effect": "Allow",
49 | "Principal": {
50 | "Service": [
51 | "redshift.amazonaws.com"
52 | ]
53 | },
54 | "Action": "sts:AssumeRole"
55 | }
56 | ]
57 | }
58 | ```
59 |
60 | - We create an IAM Policy, this policy will contain a statement saying
61 | - Allow `getObject` and `putObject` action on the specific s3 bucket
62 | - `getObject` means retrieve a file from an s3 bucket
63 | - `putObject` means put a file into an s3 bucket
64 | - Next is to attach the IAM Policy created to the IAM Role
65 | - This IAM Role can now be used or assumed by Amazon Redshift whenever it needs to do a COPY operation or UNLOAD operation.
66 | - Example of a COPY operation on a CSV file [HERE](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#load-from-csv)
67 | - Example of an UNLOAD operation on a Redshift table [HERE](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html)
68 |
--------------------------------------------------------------------------------
/12_apache_kafka/03-Kafka-Cluster.md:
--------------------------------------------------------------------------------
1 | # Kafka Cluster
2 |
3 | A **Kafka cluster** is a network of one or more broker nodes that work together to provide fault-tolerant, high-throughput messaging.
4 |
5 | ## Key Components
6 |
7 | - **Brokers**
8 | - Handle read/write of messages
9 | - Store partitions on disk
10 | - **ZooKeeper (legacy)**
11 | - Coordinates cluster membership, leader election
12 | - **Controller**
13 | - One broker elected to manage partition leaders
14 | - **Confluent Control Center** (optional)
15 | - GUI for monitoring, managing topics, ACLs, and connectors
16 |
17 | ## Cluster Deployment Models
18 |
19 | 1. **Self-Managed On-Prem**
20 | 2. **Self-Managed Cloud (IaaS)**
21 | 3. **Confluent Cloud (Managed)**
22 |
23 | ## High-Availability & Scaling
24 |
25 | - **Replicas & ISR**
26 | - **Automatic Leader Rebalancing**
27 | - **Rack-Aware Placement**
28 |
29 | Sorry if all these sounded too complex, they'll be broken down shortly
30 |
31 | ## Kafka Cluster: The Core Building Block
32 | A Kafka cluster is a group of machines working together to handle large-scale, real-time data streams. It forms the core infrastructure behind Kafka’s ability to store, scale, and stream data reliably.
33 |
34 | But before we dive into the components, here’s a simple analogy:
35 |
36 | * Think of a Kafka cluster as a postal system:
37 | * You have post offices (brokers),
38 | * mailboxes (topics),
39 | * letters (messages),
40 | * and senders/receivers (producers/consumers).
41 |
42 | Let’s further break down the pieces one by one.
43 |
44 |
45 | ## What Is a Kafka Cluster?
46 | A Kafka cluster is made up of multiple Kafka brokers (servers), working together to:
47 |
48 | * Store incoming messages (events)
49 | * Distribute the load
50 | * Provide fault tolerance
51 | * Allow consumers to read messages efficiently
52 |
53 | ## Core Components of a Kafka Cluster
54 |
1. **Broker**
55 |
A broker is a Kafka server that:
56 |
57 | * Stores message data
58 |
59 | * Receives messages from producers
60 |
61 | * Serves messages to consumers
62 |
63 | You can think of a broker as a message warehouse.
64 | **Brokers are scalable**: You can add more brokers to handle more data.
65 |
66 |
2. **Cluster**
67 |
When you have multiple brokers, they form a Kafka cluster.
68 |
69 | * One broker is elected as the Controller (it manages metadata and broker coordination).
70 | * Other brokers do the heavy lifting of storing and delivering messages.
71 |
72 |
3. **Topic**
73 |
A topic is a named channel to which data is sent. Producers write messages to a topic, and consumers read from it.
74 |
75 | Kafka topics are partitioned across brokers, which helps with:
76 |
77 | * Load balancing
78 | * Parallel processing
79 | * High throughput
80 |
81 | We’ll cover partitions in detail soon.
82 |
83 |
4. **Zookeeper** (Legacy but still in use in many setups)
84 |
Kafka originally relied on Apache ZooKeeper to manage:
85 |
86 | * Broker metadata
87 |
88 | * Cluster coordination
89 |
90 | * Leader election (which broker leads a partition)
91 |
92 |
*Note:* Confluent and Apache Kafka are moving toward a KRaft mode, which removes the need for ZooKeeper and makes Kafka self-managed. But ZooKeeper is still used in many current deployments.
93 |
94 |
**Cluster Workflow: High-Level View**
95 | Here’s how data flows in a Kafka cluster:
96 |
97 |
98 | [ Producer ] ---> [ Kafka Broker ] ---> [ Topic (with Partitions) ] ---> [ Consumer ]
99 |
Behind the scenes:
100 |
101 | * The producer sends a message to Topic A.
102 | * Kafka decides which partition of Topic A the message should go to.
103 | * That partition is stored on a broker.
104 | * Consumers subscribe to Topic A and read from the partition.
105 |
106 |
**Fault Tolerance and Replication**
107 | Kafka handles failures gracefully through replication:
108 |
109 | * Each partition can have multiple replicas (copies).
110 | * One replica is the leader — producers/consumers talk to it.
111 | * Others are followers, ready to take over if the leader fails.
112 |
113 | This means even if a broker crashes, no data is lost and processing continues.
114 |
115 |
**Summary: Why the Kafka Cluster Matters**
116 |
117 | | Kafka Cluster Feature | What It Enables |
118 | ------------------------|---------------------- |
119 | | Multiple brokers | Horizontal scalability and fault tolerance |
120 | | Topic partitions | Load distribution and parallelism |
121 | | Leader election | Automatic failover and high availability |
122 | | Replication | Durable data even during node failure |
123 | | ZooKeeper (or KRaft) | Coordination of brokers and metadata |
124 |
--------------------------------------------------------------------------------
/10_terraform/01-Terraform-Overview.md:
--------------------------------------------------------------------------------
1 | # Brief Introduction to IAC
2 |
3 | Infrastructure as Code (IaC) is a method of managing and provisioning infrastructure by using code instead of relying on manual processes. This approach involves defining your infrastructure in configuration files, which makes it easier to adjust and share settings, while also ensuring that each environment you set up is identical to the previous one. By documenting these configurations in code, IaC also helps prevent the unintentional changes that can occur with manual setups.
4 |
5 | An important aspect of IaC is version control, where these configuration files are managed just like any other software code. This practice allows you to break down your infrastructure into reusable, modular parts that can be combined and automated in various ways.
6 |
7 | By automating infrastructure tasks through IaC, developers no longer need to manually set up servers, operating systems, or other infrastructure components whenever they work on new applications or updates.
8 |
9 | In the past, setting up infrastructure was a labor-intensive and expensive manual task. With the advent of virtualization, containers, and cloud technologies, the management of infrastructure has shifted away from physical hardware in data centers. While this transition offers many benefits, it also introduces new challenges, such as the need to handle an increasing number of infrastructure components and the frequent scaling of resources. Without IaC, managing today’s complex infrastructure can be quite challenging.
10 |
11 | IaC helps organizations effectively manage their infrastructure by enhancing consistency, reducing errors, and eliminating the need for repetitive manual configurations.
12 |
13 | The key advantages of IaC include:
14 | - Lower costs
15 | - Faster deployment processes
16 | - Reduced chances of errors
17 | - Greater consistency in infrastructure setup
18 | - Prevention of configuration drift
19 |
20 |
21 |
22 | ## A REAL LIFE SCENARIO
23 | Before we start Terraform, let's consider a real-life example and we map that to Terraform afterward, so we understand what Terraform is doing.
24 | - Let's assume you want to build a 3-bedroom apartment, you will need a lot of resources like bricks, water, sand, wood, roofing materials, etc. These resources can be interchangeably called infrastructures.
25 | - Ideally, you will call a contractor to start the construction from scratch. They start to combine all the above resources to make the floor, create the rooms, get to the roofing stage, and finally, the whole apartment will be ready for use.
26 | - If you need to maintain the apartment, maybe change something, you simply call the contractor and they change what they need to change.
27 | - This is exactly what happens in the Terraform world
28 | - Let's assume Terraform is the Contractor in this case, Terraform will ask the owner of the apartment to represent what he or she wants in a configuration file. You can see the configuration as the building’s schematic/plan.
29 | - The owner in this case will specify, 3 rooms, the size, 2 toilets, 1 kitchen, etc.
30 | - This file will be submitted to Terraform.
31 | - Terraform will create everything as described in the configuration file.
32 | - If the owner has to change anything in the apartment, he/she returns to that building plan/schematics (configuration file) and makes the changes, let us assume a change from 2 toilets to 1.
33 | - Once the changes are completed, again Terraform will check the file and modify it to match what is specified by the owner.
34 |
35 | ## WHAT IS TERRAFORM ?
36 |
37 | Terraform is an infrastructure as code (IaC) tool that anyone to define what you want your cloud resource/infrastructure
38 | to look like in a human-readable terraform configuration file with a `.tf` extension.
39 | Basically, you create a terraform file and define what you want your cloud infrastructure or resource to
40 | be, and Terraform takes care of the rest for you.
41 | So it is exactly what we described in the real-world scenario; you create a building blueprint/Terraform file, for example, `my_cloud_resources.tf`, specify what resources you want and how they should look, and then Terraform takes care of the rest.
42 |
43 | ## HOW DOES TERRAFORM WORK?
44 | Before we know how Terraform works, it is important to talk about a few things. Some of the major Cloud providers are Amazon Web Service(AWS), Microsoft Azure, and Google Cloud Platform (GCP). Assuming we would like to create a resource on AWS, how does Terraform carry out this operation?
45 | These are the things that will happen...
46 | - Firstly, you need to create a Terraform configuration file that ends with `.tf`
47 | - Secondly, in the `my_cloud_resources.tf` file, you will need to specify that the provider you want to create the resource on is `aws` in `us-east-1`. Please read on AWS region and Availability Zones [HERE](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/). This will look like this.
48 |
49 |
50 |
51 | **Visual Representation of Terraform’s Resource Provisioning**
52 |
53 |
54 |
55 |
56 |
57 |
--------------------------------------------------------------------------------
/12_apache_kafka/06-Producer-and Configurations.md:
--------------------------------------------------------------------------------
1 | # Kafka Producers and Configurations
2 | So far, we know that a Kafka topic is like a notebook (or diary) where events are written, and partitions are like the pages that keep data organized.
3 |
4 | But who actually writes into this notebook?
5 |
6 | That is the `Producer`.
7 |
8 | In this module, we will be covering the following topics:
9 |
10 | - [What is a Producer?](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#what-is-a-producer)
11 | - [How a Producer works?](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#how-a-producer-works)
12 | - [Producer Configurations (Key Settings)](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#producer-configurations-key-settings)
13 | - [bootstrap.servers](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#1-bootstrapservers)
14 | - [key.serializer & value.serializer](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#2-keyserializer--valueserializer)
15 | - [acks(Acknowledgements)](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#3-acks-acknowledgments)
16 | - [retries & retry.backoff.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#4-retries--retrybackoffms)
17 | - [linger.ms & batch size](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#5-lingerms--batchsize)
18 | - [partitioner.class (Optional)](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#6-partitionerclass-optional)
19 | - [Quick Analogy](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/06-Producer-and%20Configurations.md#quick-analogy)
20 |
21 |
22 |
23 | ## What is a Producer?
24 |
25 | A producer is any application that sends (produces) messages to a Kafka topic.
26 |
27 | * Producers decide which topic to send data to.
28 | * They can also decide which partition in the topic gets the message.
29 | * Producers are stateless — they don’t keep old data; they just send new events into Kafka.
30 |
31 | Example:
32 |
33 | * A mobile app sending user clicks to Kafka.
34 | * A payment service sending transaction logs.
35 | * An IoT device sending temperature readings.
36 |
37 | ## How a Producer Works
38 |
39 | When a producer sends a message:
40 |
41 | * It connects to a Kafka broker (server).
42 | * Kafka assigns the message to a topic partition.
43 | * The message is stored with a unique offset.
44 |
45 | ## Producer Configurations (Key Settings)
46 |
47 | Producers in Kafka are very configurable. Here are the most important beginner-friendly ones:
48 |
49 | ### 1. `bootstrap.servers`
50 |
51 | * The address of the Kafka brokers (default address/port is localhost:9092).
52 | * This is how the producer knows where to send data.
53 |
54 | ### 2. `key.serializer & value.serializer`
55 |
56 | * Kafka messages have a key and a value.
57 | * Both need to be converted into bytes before sending.
58 | * Serializers handle this conversion.
59 |
60 | Some Common options:
61 |
62 | * StringSerializer (for text data)
63 | * IntegerSerializer (for numbers)
64 | * Avro / JSON Serializer (for structured data)
65 |
66 | ### 3. `acks (Acknowledgments)`
67 |
68 | This controls how "safe" message delivery is:
69 |
70 | * acks=0 → Fire-and-forget (producer doesn’t wait for confirmation). Fast, but risky.
71 | * acks=1 → Waits for leader partition to confirm. Balanced.
72 | * acks=all → Waits for all replicas to confirm. Safest, but slower.
73 |
74 | ### 4. `retries & retry.backoff.ms`
75 |
76 | * If a message fails to send, how many times should Kafka retry?
77 | * This also helps to handle temporary network issues.
78 |
79 | ### 5. `linger.ms & batch.size`
80 |
81 | * Producers can group messages into batches before sending.
82 | * linger.ms waits a little before sending to allow batching.
83 | * batch.size sets the maximum batch size.
84 | * Batching improves throughput (more efficient) but may add small delays.
85 |
86 | ### 6. `partitioner.class (Optional)`
87 |
88 | * Decides which partition a message goes to.
89 | * Default: Kafka uses the message key’s hash to pick a partition.
90 | * If no key is given, Kafka distributes messages in a round-robin fashion.
91 |
92 | ## Quick Analogy
93 |
94 | Think of a producer like a post office clerk:
95 |
96 | * You (the producer) take letters (messages).
97 | * Each letter has a recipient (topic and partition).
98 | * Before sending, you pack the letter in an envelope (serialization).
99 | * You decide how careful you want to be:
100 | * Drop it in the bin (acks=0)
101 | * Wait for a receipt from the recipient (acks=1)
102 | * Wait until all family members in the house confirm they got it (acks=all).
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
--------------------------------------------------------------------------------
/02_SQL/04-DDL-DML-commands.md:
--------------------------------------------------------------------------------
1 | DDL (Data Definition Language) and DML (Data Manipulation Language) are two of the main categories of SQL commands.
2 |
3 | ## DDL (Data Definition Language)
4 | Purpose: DDL commands are used to create, modify, and delete the structure of your database objects. They deal with the schema of your database.
5 |
6 | Key Characteristics:
7 | - They affect the structure, not the data within the structure.
8 | - They are auto-committed, meaning changes are immediately and permanently saved to the database; you cannot roll them back.
9 |
10 | #### Common DDL Commands:
11 | ##### CREATE: Used to create new database objects.
12 | - CREATE DATABASE: Creates a new database.
13 | - CREATE TABLE: Creates a new table within a database.
14 | - CREATE INDEX: Creates an index on a table to improve query performance.
15 | - CREATE VIEW: Creates a virtual table based on the result set of a SQL query.
16 | - CREATE PROCEDURE/FUNCTION: Creates stored procedures or functions.
17 |
18 | #### CREATE TABLE statement
19 | ```sql
20 | CREATE TABLE orders ( -- create orders table
21 | id integer,
22 | account_id integer,
23 | occurred_at timestamp,
24 | standard_qty integer,
25 | gloss_qty integer,
26 | poster_qty integer,
27 | total integer
28 | );
29 |
30 | CREATE TABLE employees ( -- create employees table
31 | emp_id INT PRIMARY KEY,
32 | name VARCHAR(100) NOT NULL,
33 | dept_id INT,
34 | salary DECIMAL(10,2) CHECK (salary > 0),
35 | hire_date DATE DEFAULT CURRENT_DATE
36 | );
37 | ```
38 |
39 | ### Constraints Deep Dive
40 | | Constraint | Purpose | Example |
41 | |------------------|----------------------------------|----------------------------------|
42 | | `PRIMARY KEY` | Uniquely identifies rows | `id INT PRIMARY KEY` |
43 | | `FOREIGN KEY` | Enforces relational integrity | `dept_id INT REFERENCES depts` |
44 | | `UNIQUE` | Allows only distinct values | `email VARCHAR(255) UNIQUE` |
45 | | `CHECK` | Custom validation rules | `age INT CHECK (age >= 18)` |
46 | | `DEFAULT` | Automatic value when unspecified | `created_at TIMESTAMP DEFAULT NOW()` |
47 |
48 | #### ALTER: Used to modify the structure of existing database objects.
49 | - ALTER TABLE table_name ADD column_name: Adds a new column to a table.
50 | - ALTER TABLE table_name DROP column_name: Deletes a column from a table.
51 | - ALTER TABLE table_name MODIFY column_name: Changes the data type or constraints of an existing column.
52 |
53 | ##### ALTER TABLE examples
54 | ```sql
55 | -- Add/drop columns
56 | ALTER TABLE Customers ADD Email VARCHAR(100);
57 | ALTER TABLE users DROP COLUMN legacy_password;
58 |
59 | -- Change data types (caution: data conversion issues)
60 | ALTER TABLE transactions ALTER COLUMN amount TYPE NUMERIC(12,2);
61 | ```
62 | #### DROP: Used to delete existing database objects.
63 | - DROP DATABASE: Deletes an entire database.
64 | - DROP TABLE: Deletes an entire table (structure and all its data).
65 | - DROP INDEX: Deletes an index.
66 |
67 | ##### drop table examples
68 | ```sql
69 | -- Full removal (irreversible)
70 | DROP TABLE obsolete_data CASCADE;
71 | ```
72 |
73 | #### TRUNCATE: Used to remove all records from a table, but it keeps the table structure.
74 | - It's faster and uses less undo space than DELETE
75 | - For removing all rows because it's a DDL operation (not DML) and logs less.
76 | - It cannot be rolled back.
77 |
78 | ```sql
79 | truncate table table_name;
80 |
81 | truncate table orders;
82 |
83 | ```
84 |
85 |
86 |
87 | ## DML (Data Manipulation Language)
88 | Purpose: DML commands are used to manage and manipulate data within the database objects. They interact with the actual records.
89 |
90 | #### Key Characteristics:
91 |
92 | - They affect the data stored in the tables.
93 | - They are not auto-committed and can be rolled back (undone) if necessary, often within a transaction.
94 |
95 | #### Common DML Commands:
96 |
97 | - `SELECT:` Used to retrieve data from one or more tables. This is the most frequently used DML command.
98 | - `INSERT:` Used to add new rows (records) of data into a table.
99 | - `UPDATE:` Used to modify existing data within a table.
100 | - `DELETE:` Used to remove one or more rows (records) from a table.
101 |
102 | ##### INSERT Variants: ways to insert values into a table
103 | ```sql
104 | -- insert into orders table
105 | INSERT INTO orders VALUES (1,1001,'2015-10-06 17:31:14',123,22,24,169);
106 | INSERT INTO orders VALUES (2,1001,'2015-11-05 03:34:33',190,41,57,288);
107 | INSERT INTO orders VALUES (3,1001,'2015-12-04 04:21:55',85,47,0,132);
108 | INSERT INTO orders VALUES (4,1001,'2016-01-02 01:18:24',144,32,0,176);
109 |
110 | -- insert into customers table
111 | INSERT INTO Customers (CustomerID, FirstName, LastName, Email)
112 | VALUES (1, 'John', 'Doe', 'john.doe@example.com');
113 |
114 | -- CTAS (Create Table As Select)
115 | CREATE TABLE high_value_customers AS
116 | SELECT * FROM customers WHERE lifetime_spend > 10000;
117 |
118 | -- IIAS (Insert Into As Select)
119 | -- used Bulk insert from query
120 | INSERT INTO order_archive
121 | SELECT * FROM orders WHERE status = 'completed';
122 | ```
123 |
124 | ##### UPDATE
125 | ```sql
126 | UPDATE Customers
127 | SET Email = 'johndoe@example.com'
128 | WHERE CustomerID = 1;
129 |
130 | ```
131 |
132 | ##### DELETE
133 |
134 | ```sql
135 | DELETE FROM Customers
136 | WHERE CustomerID = 1;
137 | ```
138 |
139 |
140 |
141 |
--------------------------------------------------------------------------------
/06_docker/06-Docker-Network.md:
--------------------------------------------------------------------------------
1 | # Docker Networking
2 | ___
3 |
4 | ## Contents
5 | - [Introduction](#introduction)
6 | - [Types of Networks in docker](#types-of-networks-in-docker)
7 | - [Bridge](#bridge-network)
8 | - [Host](#host-network)
9 | - [Custom](#custom-network)
10 | - [Illustration](#illustration)
11 | - [Summary](#summary)
12 |
13 | ## Introduction
14 |
15 | Networking in simple terms is the process of building relationships and connection with others.
16 |
17 | Let's relate this concept to processes (containers) in Docker.
18 | Docker container networking is the means by which docker containers connect to and communicate with each other or the host they are running on.
19 |
20 | When a container is created, a network interface with an IP address is assigned to it. This network interface will be the one to send traffic to or receive traffic from other network interface of other containers or non-containers.
21 |
22 | _In layman term, see the network interface as the maid in __coloured uniform__ (IP Address)_.
23 |
24 | ___
25 |
26 | ___
27 | ## Types of Networks in Docker
28 |
29 | Docker comes with 3 different networks which can be confirmed by running the command below:
30 | ```bash
31 | docker network ls # List networks
32 | ```
33 |
34 |
35 |
36 | ### Network drivers
37 | | Driver | Description|
38 | | --- | --- |
39 | bridge | The default network driver.
40 | host | Remove network isolation between the container and the Docker host.
41 | none |Completely isolate a container from the host and other containers.
42 | overlay | Overlay networks connect multiple Docker daemons together.
43 | ipvlan | IPvlan networks provide full control over both IPv4 and IPv6 addressing.
44 | macvlan | Assign a MAC address to a container.
45 | [source](https://docs.docker.com/engine/network/drivers/)
46 |
47 |
48 | ### Bridge Network
49 | The bridge network is a type of network that allows communication to occur between the host's network (on a more granular level, IP address) and the container(s) network. This is also the default network that Docker maps new containers that are not assigned any network to.
50 |
51 | For instance, if the host is on a network of __16.0.0.7__ IP address while the containers are on network of __19.0.0.10__ IP address, communication would never occur as long as the uniform is different and there is nothing connecting them to _bridge_ the communication gap.
52 |
53 | Due to this isolation, docker creates a virtual network type called __bridge__ which is the medium of connecting the host/server to the container.
54 |
55 | ### Host Network
56 | Contrary to the concept of isolation between the host and docker containers, the Docker host network, just as the name implies maps the containers to whatever network the host/server is on, hence the containers are on the same IP range as the host.
57 |
58 | For example: a server running on __16.0.0.2__ IP will have its docker containers on the same network of IPs __16.0.0.__ if the host network is assigned to the container upon creation. See code below:
59 |
60 | ```
61 | docker run -d --name container_using_hostnetwork --network host myimage
62 | ```
63 |
64 | However, let's keep in mind that this is a bad practice as security will not be as strong because containers will share the host’s namespace, increasing risk exposure.
65 |
66 | ### _Custom_ Network
67 | A custom network is a user-defined network in Docker. It allows better control over container communication. Containers on the same custom network can easily communicate with each other, while those on different networks are isolated by default — which is useful for segmenting sensitive services like databases from the rest of the application stack.
68 |
69 | To create a custom network:
70 | ```bash
71 | docker network create my-custom-network
72 | ```
73 |
74 | Map a new container to the custom network created:
75 |
76 | ```
77 | docker run -d --name container_using_customnetwork --network custom_network myimage
78 | ```
79 |
80 | Detach an existing container from its network:
81 | ```bash
82 | docker network disconnect host container_using_hostnetwork
83 | ```
84 | and attach to another network:
85 | ```bash
86 | docker network connect custom_network container_using_hostnetwork
87 | ```
88 |
89 | ```
90 | docker run -d --name container_using_customnetwork --network custom_network myimage
91 | ```
92 |
93 | ### Illustration
94 | Let's assume we have a server/virtual machine hosting 3 containers:
95 | One conatiner is a database of sensitive information that needs to be tightly secured while the other containers do not need the same of security.
96 | 
97 |
98 |
99 | ### Summary
100 | - Networking is the means of connection and communication between containers.
101 | - Understanding the concept of networking is crucial in knowing how to isolate containers from each other.
102 | - Unless explicitly defined, all containers will be mapped to the bridge network by default.
103 |
--------------------------------------------------------------------------------
/09_aws_cloud/08-Amazon-Redshift/00-Redshift-Cluster-Overview.md:
--------------------------------------------------------------------------------
1 | # OVERVIEW
2 | `Amazon Redshift` is a fully managed, `petabyte-scale` data warehouse service in the cloud.
3 | But let's break this down a bit,
4 | - `Fully Managed` means it's manage by AWS.
5 | - you don't need to build it yourself, basically with few clicks you have a functioning `Datawarehousing
6 | System`.
7 |
8 | In addition to the above, Amazon Redshift is `Distributed System`, this mean it leverage the concept of `Master and Slave Architecture`, where the `Master` designited what needs to be done to the `Slaves`. it's also worth to know that Redshift leverage `Columnar Storage` architecture to store it's data.
9 |
10 | Before anything, Amazon Redshift is an AWS service, Cluster is a resource you create within that service that let you build your Datawarehouse. To better understand Amazon Redshift, we need to understand what a `Redshift Cluster`, `Distributed System` and a `Columnar Storage` mean, the concept needs to be clear.
11 | We need to know that we are not here to write SQL against this cluster, but we need to properly
12 | deep dive into what the system is, so as to enable us work efficiently with it.
13 |
14 | ## WHAT IS A REDSHIFT CLUSTER
15 | `Redshift Cluster`: is a collection of 2 or more computers called `Node` networked together to act as a single system. These Nodes networked together act as a unified system that helps to process our queries and store the data within that cluster. The Cluster is designed to allow the `Nodes` to communicate with one another. The image representation below shows a Redshift cluster with 3 Nodes connected to each other.
16 |
17 |
18 |
19 | ## DESTRIBUTED SYSTEM CONCEPT
20 | `Distributed System`: These are systems that leverage a Master and Slave architecture. Redshift is a distrubted system, the above image shows what a typical Redshift Cluster looks like.
21 | - We can see we have 3 Nodes.
22 | - But typically in a distributed system, there is a `Master Node` which we called a `LEADER NODE` in the context of Redshift cluster.
23 | - The Leader Node , coordinate the task and assign it to the Slaves.
24 | - The rest of the 2 Nodes will be the `Slaves` which we called a `Compute Node` in the context of Redshift Cluster.
25 | - These are the guys that do the work designated to them by the `Leader Node`.
26 | The visual representation looks like the below
27 |
28 |
29 |
30 | ## WHAT DO WE MEAN BY COLUMNAR STORAGE
31 | `Columnar storage`: This is a database architecture that organizes and stores data by column rather than by row.
32 | - This method stores all values of a single column together in a `Block` on Disk.
33 | - `Block`: This is where data are stored on `Disk` in a Database, `Disk` is just a storage.
34 | - When you read or write data to a database table, you are actyually reading from that block or writing to that block.
35 | - This method is very efficient for analytical queries and data warehousing applications that only need to access a subset of a table's columns.
36 |
37 | Let us illustrate how a table will be store on Disk in a `Row Based` and `Columnar Based` Database, lets assume
38 | we have `Customer Table` with just 3 rows below.
39 |
40 | | Name | Age | Location |
41 | |---------|---------|-----------|
42 | | John | 20 | Berlin |
43 | | | | |
44 | | Terry | 18 | Lagos |
45 | | | | |
46 | | Cynthia | 25 | London |
47 |
48 | 1. Let's represent how these 3 rows will be stored on a Postgres Database Disk. Note, Postgres Database is Row Oriented Database.
49 |
50 |
51 |
52 | **Image Summary**
53 | - We assume `Postgres` stores the 3 rows inside `3 blocks` on Disk.
54 | - Please be aware that this is just an example, a block can take more than a row depending on its fixed size, we are not here to discuss the detail of Database block.
55 | - If you are interested in only the `Age Column` and only where `Age is 18`, the query will look like this `select Age where Age = 18;`
56 | - As simple as this query is, this will scan all the entire `Block`, because Postgres is not sure if any age is 18 in the `first block` or not, it will check all the blocks and that will impact the query time.
57 | - In this scenerio, we say the frequency of `I/O (Input/Output)` is high for `Row Oriented` Database when they conduct `analytical queries`, because it needs to keep checking in and out of blocks.
58 |
59 | 2. Let's represent how these 3 rows will be stored on a Redshift Cluster Disk. Note, Redshift is Columnar Oriented.
60 |
61 |
62 |
63 | **Image Summary**
64 | - We assume `Redshift` stores the 3 rows inside `3 blocks` on Disk.
65 | - Again, this is just an example, a fixed block size in Redshift is 1 MB, so this can take in more rows, but we are using this to explain the concept.
66 | - if you are interested in the same Age Column and specifically where Age is 18, because this is stored in a Columns, Redshift will check only the Block 2, this will reduce `I/O (Input/Output)` because Redshift only care about that column.
67 |
68 | For more on Redshift Columnar Storage: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
69 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
--------------------------------------------------------------------------------
/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/01-VPC-Overview.md:
--------------------------------------------------------------------------------
1 | # AMAZON VIRTUAL PRIVATE CLOUD (VPC)
2 | This is a VPC guide that introduce you to what VPC and how it relates to `IP Addressing` and `CIDR Range` covered in the previous section, please if you have not checked that note, see it [HERE](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/00-IP-Addressing.md), it's important to understand that, because it's a prerequisite for `VPC`.
3 |
4 | ## WHAT IS VPC
5 | VPC Stands for `Virtual Private Cloud`, see it as your small Cloud environment with a defined [Private Network](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/00-IP-Addressing.md#:~:text=A%20CIDR%20Range%20is%20like%20a%20private%20Network%2C%20which%20is%20a%20range%20of%20IPs), this private network is defined as a `CIDR Range` which is nothing but a list of `IPs`, which we already covered in our `IP Adressing` and `CIDR Range` section. You can provision resources like `Databases`, `Servers` e.t.c into this Private Network.
6 |
7 | - This Private Network we call `VPC` is a secure environment, you can control what can communicate with the resources that resides inside your VPC like `Databases`, `Servers` and also control where the resources that resides in your VPC can go.
8 |
9 | - The reality is this, when you are creating a `VPC` on AWS, you will be asked to specify the [CIDR Range](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/00-IP-Addressing.md#classless-inter-domain-routing-cidr) for that VPC, this simply mean the list of `IP Addresses` you want to have to make up your Private Network.
10 |
11 | - Let's assume you are creating a VPC, if you specify a CIDR Range of `10.8.0.0/16`.
12 | - This means you are going to have `2 ** (32-16) = 65,536` IP Addresses that will make up your entire Private Network (VPC).
13 | - If you want your friend to connect to a `Server` or `Database` inside that VPC from his house Network, you need to `whitelist` your friends `IP Address` and take care of some other things which we will cover in a later section before he can be able to connect to that `Server` or `Database`, because your friend's IP is not part of the `VPC CIDR Range`.
14 |
15 | - It's worth to know that, every `AWS Region` comes with a `default VPC` (Created by AWS), but know that this VPC allow routing to the internet, this will be clear in subsequent sections, so ignore this for now.
16 | - The default VPC cannot be deleted.
17 | - A custom VPC can as well be created by you, all configuration have to be set up by you, including the CDR Range and some other resources which will be covered in subsequent sections.
18 |
19 | The below image represent how a VPC looks like
20 |
21 |
22 |
23 | Image Summary
24 | - A VPC created with `10.0.0.0/28` CIDR Range.
25 | - The `CIDR Range` has `16 IPs` that form the Private Network. Read how we got that from our [previous note](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/00-IP-Addressing.md#check-list-of-ip-addresses-in-cidr-range).
26 | - The first IP is `10.0.0.0` and the last IP is `10.0.0.15`.
27 | - John want's to communicate with the Network, lets say he wants to connect to a `Server` or `Database` that is inside this VPC, unfortunately its not possible because his IP `52.12.19.145` is not part of the VPC CIDR Range.
28 | - We will cover how IPs that are not part of a VPC can be whitelisted, so relax 😁.
29 |
30 | ## VPC DNS CONFIGURATIONS
31 | Before going into the `DNS` configurations for a VPC, lets understand the `DNS` concept itself.
32 | - When you access `www.google.com` on your laptop, you are essentially calling the `IP address` of the google.com application server behind the scene. At the end, that application is running inside a `Server`.
33 | - `www.google.com` is a human readable Domain address that human feel comfortable to remember, compare to IP Addresses.
34 | - When you look for `www.google.com` on the internet, a `DNS Server` will help you tranlate that `www.google.com` into the corresponding IP address of the actual `google.com` application running in the `Server`.
35 | - That process is called `DNS Resolution`, and the guy doing the resolution is the `DNS Server`, AWS called their own DNS Server `Amazon Route 53`. Read this [small piece](https://aws.amazon.com/route53/what-is-dns/#:~:text=DNS%2C%20or%20the,called%C2%A0queries.) to solidify your understanding.
36 |
37 | There are `2` important DNS configuration important for every VPC
38 | - `DNS Hostnames`: If this is enabled in a VPC, every Servers launched in that VPC receive `public DNS hostnames` that correspond to their `public IP Address`. This is exactly what we said regarding the DNS concept above.
39 | - `DNS Resolution`: If this is enabled, DNS resolution for private DNS hostnames is provided for the VPC by the Amazon DNS server.
40 |
41 | This is the overview of what VPC is all about, we will dive into Subnets and Routing next, please feel free to use the below Documenation for more deep dive if you want.
42 |
43 | Documenation Reference
44 | - https://docs.aws.amazon.com/vpc/latest/userguide/configure-your-vpc.html
45 | - https://docs.aws.amazon.com/vpc/latest/userguide/vpc-ip-addressing.html
46 | - https://docs.aws.amazon.com/vpc/latest/userguide/vpc-cidr-blocks.html
47 | - https://aws.amazon.com/route53/what-is-dns/
48 | - https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html
49 | - https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc-options.html
50 | - https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
--------------------------------------------------------------------------------
/09_aws_cloud/01-Cloud-Computing-overview/01-Cloud-Computing.md:
--------------------------------------------------------------------------------
1 | # CLOUD COMPUTING
2 | In this section, we'll begin exploring the fundamentals of Cloud computing from the ground up, making it accessible for complete beginners. Some of the terms and concepts may seem complex at first, but we'll break them down into simple, easy-to-understand language to help you grasp the ideas more quickly and effectively. Our goal is to ensure these concepts resonate with you, even if you're just starting.
3 |
4 | ### TOPICS TO BE COVERED
5 | - [Life before Cloud](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/00_cloud_concept/README.md#life-before-cloud)
6 | - [What is Cloud Computing](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/00_cloud_concept/README.md#what-is-cloud-computing)
7 | - [What is AWS](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/00_cloud_concept/README.md#what-is-aws)
8 | - [What is a Data Center](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/00_cloud_concept/README.md#what-is-data-center)
9 | - [AWS Global Infrastructure](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/00_cloud_concept/README.md#aws-global-infrastructure)
10 |
11 | ### LIFE BEFORE CLOUD
12 | Before exploring Cloud Computing, it's important to understand the traditional setup that existed before its emergence. A commonly used term from that era is `On-Premise.` This term plainly refers to something that is located within one's environment or vicinity.
13 | - `On-premise` in the context of Technology refers to private `Data Centers` that companies house in their facilities and maintain themselves.
14 | - `Data Center` is a physical location that stores computers which are called `Servers` and their related hardware equipment. It contains the computing infrastructure that IT systems require, such as servers, data storage drives, and network equipment. Read more [HERE](https://en.wikipedia.org/wiki/Data_center#:~:text=A%20data%20center%20(American%20English)%5B1%5D%20or%20data%20centre%20(Commonwealth%20English)%5B2%5D%5Bnote%201%5D%20is%20a%20building%2C%20a%20dedicated%20space%20within%20a%20building%2C%20or%20a%20group%20of%20buildings%5B3%5D%20used%20to%20house%20computer%20systems%20and%20associated%20components%2C%20such%20as%20telecommunications%20and%20storage%20systems.%5B4%5D%5B5%5D)
15 | - `Server` in simple words is a computer program or device that provides a service to another computer program and its user, also known as the client. A server can also be a client, the distinction is based on who is sending (server)and receiving (client). Similarly, websites you visit are hosted on servers—computers that run constantly to always keep the sites accessible. While servers can be more complex, this provides a simple overview of what they do.
16 |
17 | ### WHAT IS CLOUD COMPUTING
18 | Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. This means that instead of owning, running, and maintaining physical data centers and servers on-premise, companies/you can access technology services, such as Servers, storage, and databases, on an as-needed basis from any of the cloud providers. The major Cloud service providers in the world are Amazon Web Service (AWS), Microsoft Azure (Azure), and Google Cloud Platform (GCP).
19 | - `Cloud Providers`: These are companies that run their own data centers in a BIG physical location, so basically they will have Servers, Storage, Database, and other IT infrastructures in this location. Any individual or organization can contact them via their website to rent these resources as a pay-as-you-go means, so you pay for what you use monthly. These resources (Servers, Databases, etc.) will then be accessed via the internet from your location.
20 |
21 | Our focus on this boot camp is Amazon Web Service (AWS).
22 |
23 | ### WHAT IS AWS?
24 | - Amazon Web Service popularly known as AWS is a company offering a Cloud Computing service.
25 | - You sign up on their website to get the infrastructures/resources you need and pay based on usage.
26 | - If you need a Server or Database for example, you can easily do that with a few clicks on their website without having to manage those physically wherever you are.
27 | - It's still worth knowing that AWS runs the Data Centers in some physical locations across the world.
28 |
29 | ### AWS GLOBAL INFRASTRUCTURE
30 | This section provides an overview of AWS's Global Infrastructure, with a focus on the locations of its physical data centers. We'll explore two key concepts: `REGION` and `AVAILABILITY ZONES`. Before diving into a detailed explanation of these terms, let's first look at an image that illustrates them.
31 |
32 |
33 | Image Reference: https://aws.amazon.com/about-aws/global-infrastructure/?p=ngi&loc=1
34 |
35 | - `Region`: A region refers to a specific geographic area where AWS houses its data centers. As mentioned earlier, data centers are physical locations containing resources such as servers, databases, and various IT infrastructures. AWS has numerous such locations globally, and each of these areas is classified as a region. From the provided image, it's clear that AWS services are distributed across multiple regions, one of the most notable being Cape Town, which consists of three distinct `Availability Zones`.
36 | - `Availability Zones`: A region typically consists of one or more data centers, known as `Availability Zones` (AZs). Each AZ corresponds to a distinct data center. For example, in the Cape Town region, there are three AZs, meaning there are three separate data centers. One of the key benefits of having multiple AZs is improved disaster recovery. By distributing your servers across these zones, you ensure that if one data center experiences an outage or failure, your servers remain accessible because they are replicated across other `Availability Zones`.
37 | - `Region Codes`: Every Region has its corresponding way of representing them. For example.
38 | - Region Cape Town code representation is `af-south-1`
39 | - Frankfurt code representation is `eu-central-1`
40 | - Find the full list [HERE](https://www.aws-services.info/regions.html)
41 |
42 | ### MORE RESOURCES
43 | - https://aws.amazon.com/about-aws/global-infrastructure/regions_az/?p=ngi&loc=2
44 | - https://www.aws-services.info/regions.html
45 |
--------------------------------------------------------------------------------
/09_aws_cloud/02-Identity-And-Access-Management(IAM)/00-iam-resources.md:
--------------------------------------------------------------------------------
1 | # IDENTITY AND ACCESS MANAGEMENT (IAM)
2 |
3 | # TOPICS
4 | - [What is IAM](https://github.com/coredataengineers/CDE-BOOTCAMP/edit/main/09_aws_cloud/01_iam/00_IAM_resources.md#what-is-iam)
5 | - [Root User](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/01_iam/00_IAM_Resources.md#root-user)
6 | - [IAM user](https://github.com/coredataengineers/CDE-BOOTCAMP/edit/main/09_aws_cloud/01_iam/00_IAM_resources.md#iam-user)
7 | - [IAM group](https://github.com/coredataengineers/CDE-BOOTCAMP/edit/main/09_aws_cloud/01_iam/00_IAM_resources.md#iam-group)
8 | - [IAM Policy](https://github.com/coredataengineers/CDE-BOOTCAMP/edit/main/09_aws_cloud/01_iam/00_IAM_resources.md#iam-policy)
9 | - [IAM Role](https://github.com/coredataengineers/CDE-BOOTCAMP/edit/main/09_aws_cloud/01_iam/00_IAM_resources.md#iam-role)
10 |
11 | ### WHAT IS IAM
12 | - AWS Identity and Access Management (IAM) is one of the key services AWS provides that helps you to securely control who can access your account and the resources in it, like Servers and databases. etc.
13 | - With IAM, you can manage permissions that control which AWS resources a specific user can access. - You use IAM to control who is authenticated (signed in) and authorized (has permissions) to use resources.
14 | - IAM provides the infrastructure necessary to control authentication and authorization for your AWS accounts.
15 | - In fact, nearly all the services AWS provides require strong IAM knowledge because
16 | - No one has access to any resources by default and no AWS services are allowed to talk to other AWS services by default.
17 | - The only person that has access to everything is the user who created the AWS account, which is called the `ROOT USER`.
18 |
19 | ### ROOT USER
20 | - When you create an AWS account, you begin with one identity that has complete access to all AWS services and resources in the account.
21 | - This identity is called the AWS account `root user` and is accessed by signing in with the email address and password that you used to create the account.
22 | - AWS strongly recommends that you don't use the root user for your everyday tasks, this user has permission to delete anything and create anything.
23 | - Safeguard your root user credentials and use them to perform the tasks that only the root user can perform.
24 | - The best practice is to create an `IAM user` who can access the account, this user will be the identity that will carry out administrative tasks like creating other users and groups, giving permissions, and many more.
25 | - Ideally, the IAM user created by the root user credential will be given administrative permission to carry out tasks, including the creation of other users.
26 | - Below is the representation of a `ROOT USER` who visited https://aws.amazon.com to create an AWS account for the first time.
27 |
28 |
29 |
30 | ### IAM USER
31 | - An AWS Identity and Access Management (IAM) user is an entity that you create in AWS.
32 | - The IAM user represents the human user or application that uses the IAM user to interact with the AWS account.
33 | - Below is an image of a ROOT USER who creates a bunch of IAM USERS to be able to access the AWS account
34 | - Note that these IAM users cannot do anything with the account, except given permission. We will cover something called the IAM policy later.
35 |
36 |
37 |
38 | ### IAM ACCESS AND SECRET KEY
39 | - These are long-term credentials for an IAM user or the AWS account root user
40 | - These credentials are mainly used to authenticate into your AWS account programatically
41 | - For example you might have a python code that need to put a csv file in an s3 bucket in AWS
42 | - In this case, python cannot login into AWS, it's not a human being, the only option is use these access key and secret key as a means to authenticate and put the file there.
43 | - Basically, you will create an IAM user for that your python application
44 | - You will grant the necessary permission to that IAM user to be able to put file in the s3 bucket.
45 | - You create the access and secret key for that IAM user
46 | - Your python application can use the access and secret keys to authenticate and carry out the actions to put the csv in the bucket.
47 |
48 | ### IAM GROUP
49 | - An IAM user group is a collection of IAM users.
50 | - User groups let you easily group IAM users that belong to the same function in one entity called an IAM group.
51 | - This makes it easier to manage the permissions for those users. For example, you could have a user group called Admins and give that user group typical administrator permissions. Any user in that group automatically has Admins permissions.
52 | - Similar to the image we have below, you can have users who are Data Engineers into the same group and attach the permission to that group, any Data Engineer who joins the team and is added to the IAM group will automatically inherit the permission attached to the group.
53 |
54 |
55 |
56 | ### IAM POLICY
57 | You manage access in AWS by creating policies and attaching them to IAM identities (users, groups of users, or roles) or AWS resources. A policy is an object in AWS that, when associated with an identity or resource, defines its permissions. AWS evaluates these policies when an IAM principal (user or role) makes a request. Permissions in the policies determine whether the request is allowed or denied. Most policies are stored in AWS as JSON documents. We covered IAM Policy seperately [HERE](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/01_iam/01_IAM_Policy.md) because it's very import to understand it in detail as a Data Engineer.
58 |
59 | ### IAM ROLE
60 | Another IAM Resource is IAM Role, this role is very important
61 | - An IAM role is an IAM identity that you can create in your account that has specific permissions.
62 | - Basically you create an `IAM Role`, you attach an IAM Policy to it and anyone or specific service can assume that Role to perform a specific action.
63 | `IAM Role` is `VERY CRITICAL` to know and understand, we covered it in detail [HERE](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/01_iam/02_IAM_Role.md)
64 |
--------------------------------------------------------------------------------
/09_aws_cloud/03-Virtual-Private-Cloud(VPC)/00-IP-Addressing.md:
--------------------------------------------------------------------------------
1 | # IP Addressing and CIDR
2 | This guide will give you the foundational knowledge needed to understand what really matters for you to establish confidence to work with `AWS Virtual Private Cloud (VPC)`. Please be aware that Networking is a wide and big topic, but we aim to cover what you will most likely come across most of the time and that should be enough to confidently start to work with Amazon VPC.
3 |
4 | We will talk about the below `2` important Networking concept, at the end of this guide, you will be confident to know what's going on when next you hit some networking challenges.
5 | - `IP Adressing`
6 | - `Classless Inter-Domain Routing (CIDR)`
7 |
8 | ## INTERNET PROTOCOL ADDRESS (IP Address)
9 | `IP Adressing:` IP Address is a unique 4 seperated doted numeric that act as an identity that is attached to every device when they communicate over the internet. e.g `23.56.98.11`, `123.34.100.0`.
10 | - For example, you sat at the balcony surfing the internet with your laptop, let's assume you decided to visit `coredataengineers.com` on your favorite browser to sign up.
11 | - Basically you want to communicate with [CoreDataEngineers webiste](https://coredataengineers.com/), in this case, your device (laptop) will be atached with a unique label called `IP Address`. How the IP Address is being genererated is outside the scope of this guide.
12 | - This IP Address is what your computer will use to identify itself when you send that request to `CoreDataEngineers.com` server that host `CDE` website.
13 |
14 | ## CLASSLESS INTER-DOMAIN ROUTING (CIDR)
15 | Let's talk about `CIDR Range or CIDR Block`.
16 | - `CIDR Range` is a collection of `IP Addresses`.
17 | - Basicaally, its a range of `IP Addresses` that starts from a specific IP Address and ends at a specific IP Address.
18 | - For example, a range of number between `1-5` will have `1, 2, 3, 4`
19 | - A `CIDR Range` follow the same concept, but of course not entirly the same. In addition to this, knowing the specific IP Address in a `CIDR Range` is not easy, because some `CIDR Range` has thousands of IP in it.
20 | - Be rest assured that there are tools already created that will help you to know
21 | - How many IPs we have in a `CIDR Range`
22 | - What the first and the last IP Address in a given CIDR Range are.
23 | - If a specific IP is in a CIDR Range.
24 |
25 | Let's see an example of a `CIDR Range`, we are going to use a tool to see how many `IP Address` we have in the `CIDR Range`, the first IP and the Last IP, Lastly, we will see how to check if a specific IP is in the `CIDR Range` we have.
26 | - Let's consider this CIDR Range of `10.5.0.0/30`.
27 | - The `/30` is called a `subnet mask` which tells us how many IPs we will have in the CIDR Range, which is calculated below
28 | - 2 ** (32 - 30) = 4 IPs
29 | - The `2` and `32` is always a constant.
30 | - Let's consider another CIDR Range of `170.8.0.0/28`
31 | - The `/28` tells us how many IPs we will have in the CIDR Range, which is calculated below
32 | - 2 ** (32 - 28) = 16
33 | - The `2` and `32` is always a constant.
34 |
35 | ## CHECK LIST OF IP ADDRESSES IN CIDR RANGE
36 | Now that we know how to calculate how many IPs within a CIDR Range, but how do we know the individual `IP Address` in a specific `CIDR Range`?
37 | - Use this [tool](https://ipgen.hasarin.com/) to see the total list of the IPs in the above `CIDR Ranges`, make sure to select `CIDR` instead of `plain range` and input your 4 seperated digit and `/30` which is your subnet mask like the below image.
38 | - You will see 4 IPs
39 | - The first IP is `10.5.0.0`
40 | - The last IP is `10.5.0.3`.
41 |
42 |
43 |
44 | If you want to do a deep dive on how they get each of the IPs, and a general deep dive on IP Addressing and CIDR Range, please we highly recommend this [Youtube Video](https://www.youtube.com/watch?v=7hIbzlxbebc).
45 |
46 | ## WHAT REALLY MATTERS
47 | Now that we've understood the meaning and the difference between an `IP Address` and `CIDR Range`, and also how to use the tool to see how many IPs we have in a `CIDR Range`, lets talk about what you will most likely deal with most of the time. Before then, lets understand this first
48 | - A `CIDR Range` is like a `private Network`, which is a range of IPs.
49 | - Any IP that is not part of a specific `CIDR Range` is not part of that network.
50 | - When an IP is not part of a `Network`, they cannot communicate with anything that lives inside that Network.
51 |
52 | Let's look at the image below to understand very well.
53 |
54 |
55 |
56 | - The image above shows `John` trying to connect to a `Database` that lives inside a private network with `CIDR Range 10.5.0.0/30`.
57 | - The `CIDR Range` has `4 IPs` if we use the tool previosuly to check it.
58 | - The communication from `John's` laptop to connect to the `Database` inside the `Private network` is not possible, because `John's IP` is not part of the `CIDR Range`.
59 | - You can use [this tool](https://tehnoblog.org/ip-tools/ip-address-in-cidr-range/) to check if an IP is part of a CIDR Range. i.e if the IP is part of the list of IPs in the `CIDR Range`. The tool looks like the below, in this case we are checking if `John's IP 56.120.2.8` is in the `CIDR Range 10.5.0.0/30`.
60 |
61 |
62 |
63 | When you click submit and scroll down, you should see this below image saying that the IP is not part of the `CIDR Range`.
64 |
65 |
66 |
67 | In short, what you need to understand to be able to work with `Amazon VPC` is to build a strong understanding of knowing if an `IP Address` belongs to a `CIDR Range`, IP Address and CIDR Range is everywhere in AWS, this is why we are covering this 2 concepts specifically.
68 |
69 | Resource References
70 | - [IP Addressing and CIDR Deep Dive](https://www.youtube.com/watch?v=7hIbzlxbebc).
71 | - [Check List of IPs in a CIDR Range](https://ipgen.hasarin.com/).
72 | - [Check if IP in CIDR Range ](https://tehnoblog.org/ip-tools/ip-address-in-cidr-range/).
73 | - [Show first IP, last IP and how many IPs in a CIDR Range](https://www.ipaddressguide.com/cidr.aspx)
74 |
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
--------------------------------------------------------------------------------
/09_aws_cloud/07-Relational-Database-Service(RDS)/00-RDS-Overview.md:
--------------------------------------------------------------------------------
1 | ### WHAT IS RDS
2 | - Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up,
3 | operate, and scale a relational database in the AWS Cloud.
4 | - In a simple word, it's a service in AWS that offers Database Service
5 | - For example if you have a company and you want to store your customer's data , products data in a Database, you can easily create a Database with the RDS service.
6 |
7 | ### WHAT IS A DATABASE
8 | - A database is an organized collection of data stored, so it can be access and managed.
9 | - Basically, you store and manage your data within that database and you also provide access so that anyone can retrieve data from it.
10 |
11 | ### DATABASE ENGINE
12 | - A Database engine is the specific relational database software that runs on your `Database Instance`.
13 | - Available Database Engines supported by RDS can be found [HERE](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html#Welcome.Concepts.DBInstance.architecture:~:text=access%20them%20directly.-,DB%20engines,-A%20DB%20engine)
14 |
15 | ### LOCAL DATABASE INSTANCE
16 | To properly understand RDS concept properly, we need to take some step back on how you will have a Database on your local laptop, lets use Postgres Database as an example. Lets go through the steps of how it will look like
17 | - Ideally you will visit the PostgreSQL [Website](https://www.postgresql.org/download/)
18 | - You will select your Operating system like MacOS, Windows Linux
19 | - Afterwards you download the installer on your Computer which is like the PostgreSQL Software
20 | - You later run the installer up to the point it ask what you want the password of your Postgres Database is
21 | - This mean you have a Postgres Database running on your computer and you can connect to it using any of your SQL clients like PGAdmin, Dbeaver, DataGrip.
22 | - The visual representation can look like this below
23 |
24 | 
25 |
26 | ### WHAT IS A DATABASE INSTANCE IN THE CLOUD
27 | Before we know what a `Database Instance` means, lets break some terms down first
28 | - An `Instance` is like a computer, they are Virtual Machines(VM).
29 | - In the context of AWS we call them `EC2` instance which are called Virtual Servers
30 | - A `Database Instance` is an isolated database running inside an EC2 instance which we all know its a Server or Virtual Machine in the cloud.
31 | - For example the below image shows a Postgres Database Instance in AWS. We can see the Database Engine is of Postgres, while the Virtual Server in this case is an EC2 Instance.
32 | - NOTE, we only say `Computer` so it will be easy to remember for beginners.
33 |
34 | 
35 |
36 | ### DATABASE INSTANCE CLASS
37 | - We all know that the Database Enhgine is the same, once you are interested a Database Instance, you will specify the Database Engine, for example Postgres.
38 | - But the Instance where the Database Engine will run differs.
39 | - Thats where Instance Class come in, This is basically the type of Computer you want your Database Engine to run inside.
40 | - This Instance Class are different in computation, at the end they are Virtual computer, so they have different CPU, RAM e.t.c just as your personal laptop too.
41 | - The Instance Class you want will depend on your use case and how big your workload is
42 | - For example the Instance Class will determine the RAM, CPU e.t.c of your Database. Basically whatever your Instance has in terms of CPU, Memory, that will be what your Database will inherit.
43 | - For example, if you are running queries that are CPU intensive, that will automatically affect your Database performance.
44 | - See more on different Instance Types [HERE](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.Types.html)
45 |
46 | ### SECURITY GROUP
47 | Before we talk about what Security Group is, its important to know that when you create a Database Instance in the cloud, you can't connect to that Database by default. This is another reason why Cloud Databases are secure. Your local Database you have on your laptop can easily be connected to, but in this case, you have a Database running inside an Instance in the cloud, how do you connect to that Database from your house?
48 | - AWS uses `Security Group` as a security/firewall mechanism to protect communication/traffic that enters or leaves the EC2 Instance where our Database is running. Once any traffic can reach your Instance, then if anyone have the credentials to your Database , they will be able to connect to your Database. So basically the first thing is to be allowed to reach the Instance and thats what Security Group helps with.
49 | - Any communication or traffic that goes into your Instance is called `Inbound Traffic`
50 | - Any communication or traffic that goes out of your Instance is called `Outbound Traffic`
51 | - Security Group Rule (sgr) is the rule attached to Security Group that specify the traffic that can enter or leaves an Instance.
52 | - Basically you create a Security Group and create Security Group Rule. Security Group without a Rule is useless, this Rule is divided into two.
53 | - `Inbound Rule` specify the rule that is allowed to enter the Instance.
54 | - `Outbound Rule` specify the rule that is allowed to leave the Instance.
55 | - More about Security Group [HERE](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html).
56 | - More on Security Group Rule [HERE](https://docs.aws.amazon.com/vpc/latest/userguide/security-group-rules.html)
57 | - Note that knowledge of Networking is `STRONGLY REQUIRED` to know how to specify a specific rule for a specific traffic.
58 | - The visual representation of Database Instance and Security Group below
59 |
60 | 
61 |
62 | **IMAGE SUMMARY**
63 | - An individual sending a Query from outside into our EC2 Instance to be able to query our Database table.
64 | - That request is a traffic aim to go inside the Instance which makes it an Inbound Traffic
65 | - We have our Postgres Database running inside our EC2 instance, but we attached a Security Group that has both Inbound and Outbound rule to our EC2 Instance.
66 | - The Inbound rule will check any traffic location, if its not on the inbound rule list, it will be decline, it will be allowed to connect to the Database if the location is on the list.
67 | - Outbound will check any traffic leaving the Instance to ensure it matches the location on the Outbound rule. NOTE, Outbound bound rule is by default allow to go to any location around the world.
68 | - Locations are usually represented based on IP address, thos require STRONG networking knowledge.
69 |
70 |
71 |
72 |
--------------------------------------------------------------------------------
/09_aws_cloud/05-Simple-Storage-Service(S3)/00-S3-Overview.md:
--------------------------------------------------------------------------------
1 | # AMAZON SIMPLE STORAGE SERVICE ( S3 )
2 | This guide gives what beginners need to know about Amazon s3, there is a lot about Amazon s3 but this guide is enought to understand what Amazon s3 is all about and its important features.
3 |
4 | ### TOPICS TO COVER
5 | - [What is Amazon s3](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#what-is-amazon-s3)
6 | - [Amazon s3 features](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#amazon-s3-features)
7 | - [Storage Classes](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#:~:text=of%20Amazon%20s3-,Storage%20classes,-Amazon%20s3%20offer)
8 | - [Storage management](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#:~:text=their%20corresponding%20detals-,Storage%20management,-Storage%20management%20help)
9 | - [Access management and security](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#:~:text=lifecycle%20rule%20HERE-,Access%20management%20and%20security,-Security%20is%20CRITICAL)
10 | - [Analytics and insights](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#:~:text=a%20specific%20service.-,Analytics%20and%20insights,-Amazon%20S3%20offers)
11 | - [s3 Versioning](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#:~:text=optimize%20your%20storage.-,s3%20Versioning,-Amazon%20s3%20provides)
12 | - [Regions Support](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#:~:text=s3%20Versioning%20HERE-,Regions%20Support,-You%20can%20choose)
13 | - [s3 access medium](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#:~:text=To%20access%20s3%20bucket%20on%20AWS%2C%20You%20can%20use%20any%20of%20the%20below%20medium)
14 | - [AWS Documenation Reference](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/09_aws_cloud/02_s3/README.md#aws-documentation-reference)
15 |
16 | ### WHAT IS AMAZON S3
17 | - Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance.
18 | - Lets explain what `Object` mean from above, Object are files in the context of s3.
19 | - For example, you have a data in a csv file, that is called Object in Amazon s3 world.
20 | - Your JPEG, PNG are all called Object in Amazon s3
21 | - In Amazon s3, we create something called `s3 Bucket`, this bucket is like a container for your objects or files.
22 | - Basically you create `Bucket`, you store your Objects or files inside the bucket.
23 | - Customers of all sizes and industries can use Amazon S3 to store and protect any amount of data for a range of use cases.
24 | - Basically any organisation can collect data about their business and store that iN Amaozn s3
25 | - This data can be of any size, thats exactly part of what s3 is built for.
26 | - Amazon S3 provides management features so that you can optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements.
27 | - Your company data that is stored in Amazon s3 need to be secure, they need to be organise also, these are sopme of the many feature s3 offer you.
28 |
29 |
30 | ### AMAZON S3 FEATURES
31 | Let's discuss some of the top feature of Amazon s3
32 | - Storage classes
33 | - Amazon s3 offer something called storage class which is designed for different use case that meet your needs. There are different types of storage class available in s3 and they all vary in cost.
34 | - For example, there are some objects in s3 you will often need time to time, in that case, those will be stored in a storage class that allow you to quickly retrieve this objects.
35 | - For data that are very less frequently used, maybe they are only needed once in 1 year, it make sense to go into a storage class that allow you to pay less.
36 | - Please find all the storage class [HERE](https://aws.amazon.com/s3/storage-classes/) and their corresponding detals
37 | - Storage management
38 | - Storage management help us to manage the lifecycle of our object which can be very useful in cost savings.
39 | - For example, you can configure something called `lifecycle rule` that automatically transition a data from a specific storage class to a cost effective storage class after a specific period of time.
40 | - You can as well automatically delete the object when it meet a specift period you set, for example, you can have a less critical data deleted after 30 days or after 1 year.
41 | - More on lifecycle rule [HERE](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#:~:text=for%20compliance%20requirements.-,S3%20Lifecycle,-%E2%80%93%20Configure%20a%20lifecycle)
42 | - Access management and security
43 | - Security is CRITICAL on AWS, in fact we said previously that no one has permission to anything until you give them the access to.
44 | - You can configure a restriction on your s3 bucket and also the object inside it to only be access by a specific IAM user or a specific service.
45 | - Analytics and insights
46 | - Amazon S3 offers analytics and insights to help you gain visibility into your storage usage, which empowers you to better understand, analyze, and optimize your storage.
47 | - s3 Versioning
48 | - Amazon s3 provides a powerful feature called `Versioning`
49 | - Versioning helps during acidental deletion of an object in an s3 bucket
50 | - But you have to enable Versioning on the s3 Bucket
51 | - Once Versioning is enabled, every Object that lands in the bucket automatically get versioned, its basically keeping a copy of every object in the bucket. If the object is accidentally deletd, it can be recovered since its versioned somewhere.
52 | - Read more on s3 Versioning [HERE](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html)
53 | - Regions Support
54 | - You can choose the geographical AWS Region where Amazon S3 stores the buckets that you create.
55 | - You might choose a Region to optimize latency, minimize costs, or address regulatory requirements.
56 | - For example, if your customers are in `Nigeria` in Africa, for regulatory purpose and to easily reach your data in terms of latency, it make sense to create your s3 bucket in the `Cape Town` Region in South Africa.
57 | - Objects stored in an AWS Region never leave the Region unless you explicitly transfer or replicate them to another Region. For example, objects stored in the Europe (Ireland) Region never leave it.
58 | - To access s3 bucket on AWS, You can use any of the below medium
59 | - AWS Console (AWS Website)
60 | - [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/s3/)
61 | - AWS SDK (Python)
62 | - [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html)
63 |
64 | ### AWS DOCUMENTATION REFERENCE
65 | - https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
66 | - Note, there is a lot in the Documenation, this notes above summarise what you need to be aware and thats enought to work with Amazon s3 for now.
67 |
--------------------------------------------------------------------------------
/06_docker/08-Docker-Volume.md:
--------------------------------------------------------------------------------
1 | ## Docker containers are not designed for long term storage
2 |
3 | When you run an application inside a Docker container, any data that the application generates is written to a location inside the container's filesystem. This may seem fine but there is a critical limitation that you should be aware of.
4 |
5 | The filesystem of a Docker container is **ephemeral** by nature i.e It’s not designed for storage. If the container crashes or gets deleted, All data that is stored inside it cannot be recovered. This makes realying on the container's internal storage risky for anything you want to keep.
6 |
7 | Consider this scenario:
8 | You have an application running inside a container. It generates log files and saves them to a directory like `/app/logs` inside the container. These logs may contain valuable information (e.g debug messages,user activity information, etc) that you may need to analyze later to improve your app. But if the container goes down, those logs are gone along with it.
9 |
10 | That’s why it’s important to treat a container’s filesystem as a temporary storage. You should never assume that anything written inside a container today will still be there tomorrow.
11 |
12 | To avoid any potential data loss, you need a reliable way to store data so that it does not go down with the container. "This where **Docker Volumes** come in"
13 |
14 | ## What is a Docker Volume?
15 |
16 | A Docker volume is a persistent storage mechanism in Docker that allows you to store data that is generated inside a container to a location on the **HOST** machine. This ensures that the data is not lost when the container stops, restarts, or is removed.
17 |
18 | > HOST: The machine that is running the Docker engine (i.e The machine that is creating the containers). It can be your local computer, virtual machine or a server on the cloud
19 |
20 | ## How volumes work
21 |
22 | A docker volume in it's essence is just a directory on the **host machine** that is mounted into another directory inside a container. Now any data that is written to the mount point inside the container is stored on the host machine inside the corresponding volume directory.
23 | This setup ensures that the data is safe even if the container is stopped deleted.
24 |
25 | > Docker volumes are managed completely by Docker** and is typically stored at `/var/lib/docker/volumes/` on the host machine
26 |
27 | Let’s break this down with an example:
28 | You have an app that generates data and write them to `/app/data` inside your container. If you mount a volume named `my-vol` to that directory, Docker will map it to a `/var/lib/docker/volumes/my-vol/_data` directory on the host.
29 |
30 | This means that any data that is written to `/app/data` inside the container will be automatically stored at `/var/lib/docker/volumes/my-vol/_data` on the host machine.
31 |
32 | ## Common commands in docker volumes
33 |
34 | **To create a volume**
35 |
36 | ```bash
37 | docker volume create my-vol
38 | ```
39 |
40 | This creates a new Docker-managed volume named `my-vol`.
41 |
42 | > Note: If you don’t specify a name, Docker will generate one automatically.
43 |
44 | **To list all volumes**
45 |
46 | ```bash
47 | docker volume ls
48 | ```
49 |
50 | This command returns a list of all volumes currently on your system
51 |
52 | ``` bash
53 | DRIVER VOLUME NAME
54 | local my-vol
55 | local another-vol
56 | ```
57 |
58 | **To inspect a specific volume**
59 |
60 | ```bash
61 | docker volume inspect volume-name
62 | ```
63 |
64 | This command returns detailed information about the volume in JSON format
65 |
66 | For example:
67 | ```bash
68 | docker volume inspect my-vol
69 | ```
70 |
71 | Output:
72 |
73 | ```json
74 | [
75 | {
76 | "CreatedAt": "2025-06-18T14:28:37+01:00",
77 | "Driver": "local",
78 | "Labels": {},
79 | "Mountpoint": "/var/lib/docker/volumes/my-vol/_data",
80 | "Name": "my-vol",
81 | "Options": null,
82 | "Scope": "local"
83 | }
84 | ]
85 | ```
86 |
87 | What the fields mean:
88 |
89 | | Field | Meaning |
90 | | ------ | ----- |
91 | | `CreatedAt` | The date and time the volume was created |
92 | | `Driver` | The driver used to manage the volume. By default, this is `local`. |
93 | | `Mountpoint` | The actual location on the host machine where the volume data is stored. For `local` volumes, this is usually somewhere like `/var/lib/docker/volumes//_data`. |
94 | | `Name` | The name of the volume (`my-vol` in this case). |
95 | | `Labels` | Optional metadata. You can add labels using `--label` flag while creating the volume |
96 | | `Options` | This field indicate driver specific options used when the volume was created. It is empty unless specified |
97 | | `Scope` | Thid field Indicates where the volume is accessible from. For `local` volumes, this is typically `local`, meaning the volume is only usable on the HOST machine |
98 |
99 | **To delete a volume**
100 |
101 | ```bash
102 | docker volume rm my-vol
103 | ```
104 | > Note: A volume **cannot** be deleted if it is currently in use by a container.
105 |
106 | ## Benefits of Docker Volumes
107 |
108 | 1. **Persistence Beyond Container Lifecycle**
109 | Docker volumes are independent of the containers which they are attached to. This means that if you delete a container or if it crashes, the data stored in the mounted volume remains intact because the volume is stored on the host system outside of the container.
110 | This behaviour is useful when you have
111 | + An app that generates data (e.g logs, photos, files, etc) that you dont want to lose.
112 | + An app that writes data to a databaase. When you mount the database to a Docker volume, you ensure that your data persists on the host.
113 |
114 | 2. **Data Sharing between containers**
115 | You can mount the **same** named volume into multiple containers at once. Every container can see and can modify the same files in real time. This setup makes is easy to
116 |
117 | * Share configuration files or resource caches.
118 | * Coordinate work: one container writes data, another reads or processes it without complicated copy operations
119 |
120 | ## Using Volumes in a container
121 |
122 | You can mount a Docker volume into a container using the `-v` (or `--volume`) flag.
123 | This is the syntax
124 |
125 | ```bash
126 | docker run -v :
127 | ```
128 |
129 | * `docker run`: Starts a new container.
130 | * `-v`: Tells Docker to mount a volume into the container.
131 | * ``: The name of the Docker volume. If it doesn’t already exist, Docker will create it automatically.
132 | * ``: The directory inside the container where the volume will be mounted.
133 | * ``: The Docker image to base the container on (e.g., `ubuntu:24.04`).
134 |
135 |
136 | **Example**
137 |
138 | ```bash
139 | docker run -it -v mydata:/app/data ubuntu:24.04 bash
140 | ```
141 |
142 | This command will
143 | + mount the volume named `mydata` to the `/app/data` directory inside the container.
144 | + Launch an interactive Bash shell using the Ubuntu 24.04 image.
145 |
146 | Now, anything you write to `/app/data` inside the container will be stored in the `mydata` volume on the host machine.
147 |
148 | Even if the container stops or gets deleted, the volume and its data will persist. You can reuse the volume by creating another container and map the the same volume to a directory on that container.
149 |
150 | > NOTE: You can even go further by creating multiple containers and mounting the **same volume** into each of them. Any changes made by one container (i.e. writing or deleting files) are immediately visible to the others.
151 |
152 | ## Mount Options
153 |
154 | When you mount a volume into a container using the `-v` flag, you can control how the container accesses that volume using mount options.
155 |
156 | This is the syntax:
157 |
158 | ```bash
159 | docker run -v ::
160 | ```
161 |
162 | As you can see, in addition to providing the volume name and container path, you can also provide options to define how your container should interact with the volume.Here are Common volume mount options
163 |
164 | | Option | Description |
165 | | ------ | ----------- |
166 | | `ro`, `readonly` | Mounts the volume as **read-only**. The container can **read** data but **cannot write** to it. |
167 | | `volume-nocopy` | Prevents Docker from copying the existing content from the container's target directory into the volume on first mount (if the volume is empty).
168 |
169 | > By default, if no options is specified, the volume is mounted as `read-write` i.e containers can read and write to it freely.
170 |
171 | **For example:**
172 |
173 | ```bash
174 | docker run -it -v myvolume:/app/data:ro ubuntu:24.04 bash
175 | ```
176 | This command mounts `myvolume` into `/app/data` in the container as **read only** i.e you will only be able to read the data in the `/app/data` inside the container but you will not be able to write to it.
--------------------------------------------------------------------------------
/12_apache_kafka/01-Kafka-Overview.md:
--------------------------------------------------------------------------------
1 | # Introduction
2 |
3 | If you are here, I want to give you a heartwarming congratulations, because you are about to explore and understand the beautiful world of `Apache Kafka`. The contents of the entire Kafka module in this repository is based on [Confluent Kafka](https://docs.confluent.io/), the original creators of Apache Kafka. The core Apache Kafka (Confluent) concepts we'll be covering include:
4 |
5 | ## Table of Contents
6 |
7 | 1. [Kafka Overview](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md)
8 | 2. [Installing Kafka](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/02-Installing-Kafka.md)
9 | 3. [Kafka Cluster](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/03-Kafka-Cluster.md)
10 | 4. [Topics & Configuration](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md)
11 | 5. [Partitions & Offsets](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/05-Partition-and-Offset.md)
12 | 6. [Serialization & Deserialization](https://github.com/coredataengineers/CDE-BOOTCAMP/tree/main/12_apache_kafka/05-Serialisation-and-Deserialisation/README.md)
13 | 7. [Producers & Configurations](docs/06-producers-and-configuration.md)
14 | 8. [Consumers & Configurations](docs/07-consumers-and-configuration.md)
15 | 9. [Consumer Groups](docs/08-consumer-groups.md)
16 | 10. [Consumer Group Protocol](docs/09-consumer-group-protocol.md)
17 | 11. [Schema Registry](docs/10-schema-registry.md)
18 | 12. [Production Kafka Clusters](docs/11-production-kafka-clusters.md)
19 |
20 |
21 | Here, we would be discussing the following topics:
22 | - [Life Without Kafka: The Problem Kafka Solves](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#life-without-kafka-the-problem-kafka-solves)
23 | - [Option 1: Direct Communication Between Systems](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#option-1-direct-communication-between-systems)
24 | - [Option 2: Traditional Message Queues](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#option-2-traditional-message-queues)
25 | - [Option 3: Batch Processing](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#option-3-batch-processing)
26 | - [Problems with Batch Processing](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#problems-with-batch-processing)
27 | - [Kafka: The Real-Time Streaming Solution](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#kafka-the-real-time-streaming-solution)
28 | - [How Kafka Solves These Problems](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#how-kafka-solves-these-problems)
29 | - [Apache Kafka Overview](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#apache-kafka-overview)
30 | - [What is Kafka?](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#what-is-kafka)
31 | - [Kafka vs Batch: A Quick Comparison](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/01-Kafka-Overview.md#kafka-vs-batch-a-quick-comparison)
32 |
33 |
34 | We won't go straight into Apache Kafka, I would like you to understand what happened before event streaming as we know it today.
35 |
36 | ## Life Without Kafka: The Problem Kafka Solves
37 | Before Kafka, organizations often relied on direct communication between services or traditional data pipelines. These setups were hard to manage, unreliable, and not designed for real-time needs.
38 |
39 | Let’s explore what life looked like without Kafka — and why even batch processing couldn’t solve the full picture.
40 |
41 | ### Option 1: Direct Communication Between Systems
42 | Each service is wired to talk directly to another.
43 |
44 |
45 |
46 | **Problems**:
47 | * Tightly coupled: Changes in one service break the others.
48 | * Fragile: If the destination is down, messages are lost or delayed.
49 | * Complex: Adding a new consumer means touching the producer code.
50 |
51 |
52 | ### Option 2: Traditional Message Queues
53 | Message queues (like RabbitMQ or ActiveMQ) improved decoupling but still had limits:
54 |
55 | * Messages are often removed after being read.
56 | * Not built for high throughput or large-scale replay.
57 | * Lacked storage—used for moving data, not persisting it.
58 |
59 | ### Option 3: Batch Processing
60 | Some teams turned to batch processing pipelines, like using cron jobs or ETL tools (Extract, Transform, Load) to move data periodically:
61 |
62 |
63 |
64 | ## Problems with Batch Processing:
65 |
66 | | Issue | Why It's a Problem |
67 | | ----------------------------- | ----------------------------------------------------------------- |
68 | | **Delayed Data** | Batches often run every hour or daily. You get outdated insights. |
69 | | **Rigid Scheduling** | Missed jobs = missed data. Dependencies are brittle. |
70 | | **Poor Real-Time Support** | No way to react instantly to new events. |
71 | | **Not Scalable** | Moving millions of rows in batches causes spikes and failures. |
72 | | **Replay is Hard** | You can’t “go back” unless you re-run the entire job. |
73 | | **No Streaming** | Not suitable for dashboards, alerts, or event-driven systems. |
74 |
75 |
76 | **Example:**
77 |
Imagine you work at a ride-sharing company.
78 |
Riders open the app. Drivers complete trips.
79 |
A dashboard shows active drivers in real time.
80 |
81 | If you use batch processing, you’ll only update the dashboard every 15 minutes. That’s too late as users expect instant updates.
82 |
83 |
84 | ## Kafka: The Real-Time Streaming Solution
85 | Kafka is built to solve these limitations. It acts as a central pipeline where producers publish events and consumers subscribe to them.
86 |
87 | Here’s what it looks like with Kafka
88 |
89 |
90 |
91 | **What now happens**
92 | * Producers write events to Kafka.
93 | * Consumers read at their own speed, even replaying old messages if needed.
94 |
95 | ### How Kafka Solves These Problems
96 |
97 | | Problem | Kafka’s Answer |
98 | | ----------------------------- | ----------------------------------------- |
99 | | Delayed batch insights | Streams data in real time |
100 | | Fragile point-to-point links | Centralized topic-based communication |
101 | | Consumers blocked by failures | Kafka stores data until consumer is ready |
102 | | Difficult scaling | Horizontal scaling with partitions |
103 | | No historical data in queues | Kafka retains messages for days/weeks |
104 | | Replay not possible | Consumers can replay by resetting offsets |
105 |
106 |
107 | ## Apache Kafka Overview
108 |
109 | Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant, and scalable real-time data pipelines. Originally developed at LinkedIn and later open-sourced, Kafka has become the de facto standard for building modern data streaming architectures.
110 |
111 | Confluent Kafka is an enterprise-grade distribution of Apache Kafka that adds additional tools, services, and APIs to simplify deployment, monitoring, security, and integration.
112 |
113 | ### What is Kafka?
114 | Kafka is fundamentally a publish-subscribe messaging system based on distributed commit logs. It enables applications to:
115 |
116 | * Publish (write) streams of data (events, logs, metrics, etc.)
117 | * Subscribe (read) those data streams in real-time
118 | * Store data durably and reliably
119 | * Process streams either in real-time or batch
120 |
121 | Kafka can handle trillions of events per day across thousands of clients.
122 |
123 | ### Kafka vs Batch: A Quick Comparison
124 |
125 | | Feature | Batch Processing | Kafka Streaming |
126 | | ------------------- | ----------------------- | --------------------------------------- |
127 | | Data delivery speed | Delayed (minutes/hours) | Near real-time |
128 | | Processing method | Periodic jobs | Continuous stream |
129 | | Replayability | Complex/manual | Simple (via offsets) |
130 | | Scalability | Often brittle | Built-in partitioning |
131 | | Use case fit | Reports, backups | Dashboards, alerts, real-time services |
132 | | Failure handling | Retry whole batch | Consumer resumes from last known offset |
133 |
134 | **With Kafka:**
135 |
136 | * Systems are loosely coupled.
137 | * Data is streamed continuously.
138 | * Consumers can act in real-time or replay the past.
139 |
140 |
141 | **Conclusion**
142 |
143 | Kafka replaces rigid, slow, and fragile communication pipelines with a fast, scalable, and reliable event streaming platform.
144 |
--------------------------------------------------------------------------------
/10_terraform/README.md:
--------------------------------------------------------------------------------
1 | # TERRAFORM GUIDE
2 | This Terraform guide covers the basics of Terraform as an important concept that needs to be known
3 | for beginners, this will help beginners know what terraform is and how it works before
4 | writing Terraform configuration file to provision resources. We will cover the topics below:
5 |
6 | - [Prerequisite](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#prerequisite)
7 | - [Brief Introduction to IAC](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#Brief-Introduction-to-IAC)
8 | - [Real life Scenario](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#a-real-life-scenrario)
9 | - [What is Terraform](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#what-is-terraform-)
10 | - [How Terraform Works](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#how-does-terraform-work-)
11 | - [Terraform state file](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#what-is-terraform-state-file-)
12 | - [State and Configuration File Handling In Terraform](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#State-and-Configuration-File-Handling-In-Terraform)
13 | - [Terraform Commands](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/10_terraform/README.md#useful-terraform-commands)
14 |
15 | ## PREREQUISITE
16 | Before you start working with Terraform, you'll need to have certain prerequisites in place.
17 | - Install Terraform on your Computer
18 | - MAC/LINUX users: Open your terminal and run the below command ( Make sure you have brew installed, if not, install it [HERE](https://brew.sh/)
19 | - `brew tap hashicorp/tap`
20 | - `brew install hashicorp/tap/terraform`
21 | - WINDOWS users: Follow the manual installation [HERE](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli)
22 | - After the installation, to verify everything is good, run `terraform --version` on your terminal
23 |
24 | # Brief Introduction to IAC
25 |
26 | Infrastructure as Code (IaC) is a method of managing and provisioning infrastructure by using code instead of relying on manual processes. This approach involves defining your infrastructure in configuration files, which makes it easier to adjust and share settings, while also ensuring that each environment you set up is identical to the previous one. By documenting these configurations in code, IaC also helps prevent the unintentional changes that can occur with manual setups.
27 |
28 | An important aspect of IaC is version control, where these configuration files are managed just like any other software code. This practice allows you to break down your infrastructure into reusable, modular parts that can be combined and automated in various ways.
29 |
30 | By automating infrastructure tasks through IaC, developers no longer need to manually set up servers, operating systems, or other infrastructure components whenever they work on new applications or updates.
31 |
32 | In the past, setting up infrastructure was a labor-intensive and expensive manual task. With the advent of virtualization, containers, and cloud technologies, the management of infrastructure has shifted away from physical hardware in data centers. While this transition offers many benefits, it also introduces new challenges, such as the need to handle an increasing number of infrastructure components and the frequent scaling of resources. Without IaC, managing today’s complex infrastructure can be quite challenging.
33 |
34 | IaC helps organizations effectively manage their infrastructure by enhancing consistency, reducing errors, and eliminating the need for repetitive manual configurations.
35 |
36 | The key advantages of IaC include:
37 | - Lower costs
38 | - Faster deployment processes
39 | - Reduced chances of errors
40 | - Greater consistency in infrastructure setup
41 | - Prevention of configuration drift
42 |
43 |
44 |
45 | ## A REAL LIFE SCENARIO
46 | Before we start Terraform, let's consider a real-life example and we map that to Terraform afterward, so we understand what Terraform is doing.
47 | - Let's assume you want to build a 3-bedroom apartment, you will need a lot of resources like bricks, water, sand, wood, roofing materials, etc. These resources can be interchangeably called infrastructures.
48 | - Ideally, you will call a contractor to start the construction from scratch. They start to combine all the above resources to make the floor, create the rooms, get to the roofing stage, and finally, the whole apartment will be ready for use.
49 | - If you need to maintain the apartment, maybe change something, you simply call the contractor and they change what they need to change.
50 | - This is exactly what happens in the Terraform world
51 | - Let's assume Terraform is the Contractor in this case, Terraform will ask the owner of the apartment to represent what he or she wants in a configuration file. You can see the configuration as the building’s schematic/plan.
52 | - The owner in this case will specify, 3 rooms, the size, 2 toilets, 1 kitchen, etc.
53 | - This file will be submitted to Terraform.
54 | - Terraform will create everything as described in the configuration file.
55 | - If the owner has to change anything in the apartment, he/she returns to that building plan/schematics (configuration file) and makes the changes, let us assume a change from 2 toilets to 1.
56 | - Once the changes are completed, again Terraform will check the file and modify it to match what is specified by the owner.
57 |
58 | ## WHAT IS TERRAFORM ?
59 |
60 | Terraform is an infrastructure as code (IaC) tool that anyone to define what you want your cloud resource/infrastructure
61 | to look like in a human-readable terraform configuration file with a `.tf` extension.
62 | Basically, you create a terraform file and define what you want your cloud infrastructure or resource to
63 | be, and Terraform takes care of the rest for you.
64 | So it is exactly what we described in the real-world scenario; you create a building blueprint/Terraform file, for example, `my_cloud_resources.tf`, specify what resources you want and how they should look, and then Terraform takes care of the rest.
65 |
66 | ## HOW DOES TERRAFORM WORK?
67 | Before we know how Terraform works, it is important to talk about a few things. Some of the major Cloud providers are Amazon Web Service(AWS), Microsoft Azure, and Google Cloud Platform (GCP). Assuming we would like to create a resource on AWS, how does Terraform carry out this operation?
68 | These are the things that will happen...
69 | - Firstly, you need to create a Terraform configuration file that ends with `.tf`
70 | - Secondly, in the `my_cloud_resources.tf` file, you will need to specify that the provider you want to create the resource on is `aws` in `us-east-1`. Please read on AWS region and Availability Zones [HERE](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/). This will look like this.
71 |
72 |
73 |
74 | **Visual Representation of Terraform’s Resource Provisioning**
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 | **Source Reference**: https://developer.hashicorp.com/terraform/intro
84 |
85 | **IMAGE SUMMARY**
86 | - STEP 1: Write your configuration file, this is where you will represent what your resources will look like.
87 | - STEP 2: **Terraform** plans your earlier defined resources based on the configuration file. Here Terraform reiterates the resource you want to provision, and summarises it, so you can review it before creating it.
88 | - STEP 3: The configuration file is applied. This is what initiates the creation of the defined resource inside the configuration file
89 | - **NOTE**: After the resources are applied and created, the copy of that operation is registered in the Terraform state file.
90 |
91 |
92 | ## WHAT IS A TERRAFORM STATE FILE?
93 | It is crucial to bear in mind that the Terraform State file is **VERY IMPORTANT** that Terraform cannot function without it. Moving on, let us delve into what the state file is all about.
94 | - Terraform State File is a file that contains the summary/Metadata of any resource that has been created. It has the `.tfstate` extension. A typical file name could be `terraform.tfstate`
95 | - If you define a resource `A` in your Terraform configuration file called `example.tf` when you `apply`, Terraform will automatically document this resource creation in the State file.
96 |
97 | ## STATE AND CONFIGURATION FILE HANDLING IN TERRAFORM
98 | - Let us assume you go back to your Terraform configuration file `example.tf` where you define resource `A` to change it to B.
99 | - Terraform will compare what you have in the configuration file (which has now changed from `A` to `B`) with what exists in the Terraform State file, A.
100 | - When you run a `plan` on this configuration file, Terraform notices a difference and immediately assumes you now want to create B.
101 | - In essence, Terraform uses the State file as a reference to what you previously created, which is `A` in this case (remember it has the metadata to the most recent state of your configuration file), and compares it to what you now have inside your configuration file, `B`. It then shows you any differences detected from both versions.
102 | - Assuming you want to create another new configuration file with a resource, `JJJ`. Again Terraform will check the State file to determine if `JJJ` is there, if it's not, then it will show you that you are about to create `JJJ` in the plan summary.
103 | - State file Reference:
104 | - https://developer.hashicorp.com/terraform/language/state
105 | - https://developer.hashicorp.com/terraform/language/state/purpose
106 | - **NOTE**: When working with Terraform state file, **PLEASE DO NOT** push the state file to GitHub. Terraform state file contains your infrastructure in plain text, which means if you create a Database, its username and password will be available in plain text inside the State file. See how to manage State files in Production [HERE](https://developer.hashicorp.com/terraform/language/state/remote).
107 |
108 | ## USEFUL TERRAFORM COMMANDS
109 | To efficiently work with Terraform, specific commands are essential for executing various operations. Simply put, creating cloud resources is not possible without utilizing these Terraform commands. Here are some of the most frequently used ones.
110 | **NOTE**: Terraform Commands will only work if you are running them in a Directory that contains a configuration file i.e. a file with name ending with `.tf`.
111 |
112 | - `terraform init`: This command sets up your Terraform project. If your project uses a specific provider, such as AWS, running this command will download the necessary plugins for that provider, enabling Terraform to interact with it. If you add a new provider, like Azure, you’ll need to rerun `terraform init` to install the corresponding plugins.
113 | - `terraform plan`: This command generates a detailed preview of the infrastructure Terraform is about to create. After you define your desired infrastructure in a configuration file, running `terraform plan` provides a summary of what will be built, allowing you to review and verify the setup before proceeding.
114 | - `terraform apply`: This command executes the creation of the specified resources in your chosen provider, such as AWS. Before Terraform proceeds with creating the infrastructure, it prompts you to confirm by typing `yes`. Once confirmed, the resources are deployed.
115 | - `terraform validate`: This command checks the accuracy of your Terraform configuration files. It verifies whether the configuration is valid and properly structured, ensuring that the resources you’ve defined can be correctly interpreted and deployed by Terraform.
116 | - Terraform Command Reference: [Terraform CLI Commands](https://developer.hashicorp.com/terraform/cli/commands)
117 |
--------------------------------------------------------------------------------
/12_apache_kafka/04-Kafka-Topic-and-Configurations.md:
--------------------------------------------------------------------------------
1 | # Kafka Topic and Configurations
2 |
3 | Here, we are going to be covering the basics of Kafka Topics and Configurations under the following titles:
4 |
5 | - [Kafka Topic](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#what-is-a-kafka-topic)
6 | - [How Kafka Writes Events](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#how-kafka-writes-events)
7 | - [Not a Queue](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#not-a-queue)
8 | - [What if the Notebook Gets Too Full?](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#what-if-the-notebook-gets-too-full)
9 | - [Kafka Topic Configuration](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#Kafka-Topic-Configuration)
10 | - [cleanup.policy](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#cleanuppolicy--what-to-do-with-old-pages)
11 | - [compression.type](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#compressiontype--How-to-Pack-Each-Note)
12 | - [default.replication.factor](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#default-replication-factor-–How-to-Pack-Each-Note)
13 | - [file.delete.delay.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#filedeletedelayms--delay-before-erasing-a-page)
14 | - [flush.messages/flush.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#flushmessages--flushms--when-to-force-save)
15 | - [index.interval.bytes](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#indexintervalbytes--how-often-to-add-page-markers)
16 | - [max.message.bytes](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#maxmessagebytes--max-size-of-one-note)
17 | - [max.compaction.lag.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#maxcompactionlagms--wait-time-before-cleaning)
18 | - [message.downconversion.enable](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#messagedownconversionenable--support-for-old-formats)
19 | - [message.timestamp](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#messagetimestamp--rules-for-time-differences)
20 | - [message.timestamp.type](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#messagetimestamptype--when-was-the-note-written)
21 | - [min.cleanable.dirty.ratio](https://github.com/coredataengineers/CDE-BOOTCAMP/edit/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#mincleanabledirtyratio--when-to-clean)
22 | - [min.compaction.lag.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/edit/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#mincompactionlagms--wait-before-allowing-cleanup)
23 | - [min.insync.replicas](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#mininsyncreplicas--minimum-notebooks-that-must-be-updated)
24 | - [num.partitions](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#numpartitions--how-many-sections-in-the-notebook)
25 | - [preallocate](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#preallocate--reserve-space-for-pages)
26 | - [retention.bytes](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#retentionbytes--max-size-of-notebook)
27 | - [retention.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#retentionms--max-time-to-keep-notes)
28 | - [segment.bytes](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#segmentbytes--how-big-each-notebook-file-is)
29 | - [segment.index.bytes](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#segmentindexbytes--how-big-is-the-page-index)
30 | - [segment.jitter.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#segmentjitterms--add-random-delay-to-page-rotation)
31 | - [segment.ms](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#segmentms--max-time-for-one-notebook-file)
32 | - [unclean.leader.election.enable](https://github.com/coredataengineers/CDE-BOOTCAMP/blob/main/12_apache_kafka/04-Kafka-Topic-and-Configurations.md#uncleanleaderelectionenable--risky-recovery-allowed)
33 |
34 | ## Kafka Topic
35 |
36 | A Kafka topic is like a named folder where messages are stored.
37 | Producers send messages into a topic, and consumers read messages from it.
38 | Topics live inside Kafka clusters, which are grouped into larger environments.
39 |
40 | Imagine you have a notebook where you write down everything that happens in your day: every single event, from waking up to going to bed.
41 |
42 | That’s exactly what a Kafka topic is:
43 |
44 | A place where Kafka writes down every event, in the order it happened — like a diary.
45 |
46 | ### How Kafka Writes Events
47 |
48 | Every time something happens, like a temperature sensor sending a new reading, Kafka writes it as a new line in the notebook.
49 | It never erases or overwrites old lines. It just keeps adding new ones at the bottom.
50 |
51 |
52 | Each line (or message) in this notebook has:
53 |
54 | * Key – Who or what it's about (e.g., “Sensor 12”)
55 | * Value – What happened (e.g., “Temperature is 23°C”)
56 | * Timestamp – When it happened (e.g., “9:35 AM”)
57 |
58 | ### Not a Queue
59 | Kafka topics aren’t like queues, where once you read a message, it's gone.
60 |
61 | Instead:
62 |
63 | * Everyone gets to read the same notebook
64 | * Messages stay there for as long as you want
65 | * If someone needs to read it again (maybe they missed a part), they can go back and re-read old pages
66 |
67 | This makes Kafka great for apps that need to:
68 |
69 | * Process data at different speeds
70 | * Recover from a crash
71 | * Or replay past events
72 |
73 | ### What if the Notebook Gets Too Full?
74 | Kafka lets you control how long you keep the notebook pages:
75 |
76 | * **Retention:** Keep data for 1 day, 7 days, or forever
77 | * **Compaction:** If you only care about the latest info, Kafka can clean up old entries with the same key
78 |
79 | Example: You only want the latest location of a delivery truck. Kafka can remove old locations and keep just the newest one.
80 |
81 |
82 |
83 | ## Kafka Topic Configuration: Configuring Your Kafka Notebook
84 | Imagine you’re using a shared digital notebook to track important events (like messages from a sensor or app). Now, you want to customize how this notebook works.
85 |
86 |
87 | Think of each setting below as a notebook rule or behavior switch you can adjust.
88 |
89 | ### `cleanup.policy` – What to Do With Old Pages
90 |
91 | Do you want to delete old notes or keep only the latest version of each key?
92 |
93 | * `delete`: Throw away old pages after a while.
94 | * `compact`: Clean the notebook but keep only the latest info per key.
95 |
96 | NOTE: To switch to `compact`, `delete`, you must go through `compact` first.
97 |
98 | * Default setting: delete
99 | * Is this configuration editable after a Topic has been created?: Yes
100 |
101 |
102 | ### `compression.type` – How to Pack Each Note
103 |
104 | Do you let the sender decide how to pack notes (like zipping them)?
105 |
106 | * `producer`: Use the sender's packing choice.
107 |
108 | * Default setting: producer
109 | * Is this configuration editable after a Topic has been created?: No
110 |
111 | ### `default.replication.factor` – How Many Copies of the Notebook?
112 |
113 | How many backup copies do we make of each notebook?
114 |
115 | More copies = safer, but more storage is used.
116 |
117 | * Default: 3
118 | * Is this configuration editable after a Topic has been created?: No
119 |
120 | ### `file.delete.delay.ms` – Delay Before Erasing a Page?
121 | After deciding to delete a page, how long should we wait before really removing it?
122 |
123 | * Default: 60 seconds
124 | * Is this configuration editable after a Topic has been created?: No
125 |
126 | ### `flush.messages / flush.ms` – When to Force Save?
127 |
128 | Only save your notebook when forced — usually never, unless set manually.
129 |
130 | * Default: Never
131 | * Is this configuration editable after a Topic has been created?: No
132 |
133 | ### `index.interval.bytes` – How Often to Add Page Markers?
134 |
135 | Every 4KB, add a sticky note so you can quickly jump to that spot later.
136 |
137 | * Default: 4096 bytes
138 | * Is this configuration editable after a Topic has been created?: No
139 |
140 | ### `max.message.bytes` – Max Size of One Note
141 |
142 | What’s the biggest message you can write on a single page?
143 |
144 | * Default: ~2MB
145 | * Is this configuration editable after a Topic has been created?: Yes
146 |
147 | ### `max.compaction.lag.ms` – Wait Time Before Cleaning?
148 |
149 | Wait this long before even considering cleaning a message.
150 |
151 | * Default: Very long (practically unlimited)
152 | * Is this configuration editable after a Topic has been created?: Yes
153 |
154 | ### `message.downconversion.enable` – Support for Old Formats?
155 |
156 | Can older readers get converted versions of new notes?
157 |
158 | * Default: true
159 | * Is this configuration editable after a Topic has been created?: No
160 |
161 |
162 | ### `message.timestamp.*` – Rules for Time Differences
163 | How strict should we be if a note says it was written way in the future or past?
164 |
165 | * Default: Unlimited
166 | * Is this configuration editable after a Topic has been created?: Yes
167 |
168 | ### `message.timestamp.type` – When Was the Note Written?
169 |
170 | There are two modes:
171 | * `CreateTime`: Use the time the sender wrote it.
172 | * `LogAppendTime`: Use the time it got into Kafka.
173 |
174 | Also,
175 | * Default: CreateTime
176 | * Is this configuration editable after a Topic has been created?: Yes
177 |
178 | ### `min.cleanable.dirty.ratio` – When to Clean?
179 | Only start cleaning when at least 50% of the notebook is outdated junk.
180 |
181 | * Default: 0.5
182 | * Is this configuration editable after a Topic has been created?: No
183 |
184 | ### `min.compaction.lag.ms` – Wait Before Allowing Cleanup?
185 | Minimum age of a note before it's eligible to be cleaned.
186 |
187 | * Default: 0
188 | * Is this configuration editable after a Topic has been created?: Yes
189 |
190 | ### `min.insync.replicas` – Minimum Notebooks That Must Be Updated
191 | How many notebook copies must confirm the update before it counts?
192 |
193 | * Default: 2
194 | * Is this configuration editable after a Topic has been created?: Yes (only 1 or 2)
195 |
196 | ### `num.partitions` – How Many Sections in the Notebook?
197 | Split your notebook into sections so multiple people can write/read faster.
198 |
199 | You can add sections later, but can’t remove them.
200 |
201 | * Default: 6
202 | * Is this configuration editable after a Topic has been created?: Yes (but only increase)
203 |
204 | ### `preallocate` – Reserve Space for Pages?
205 | Should we block out space in advance for future pages?
206 |
207 | * Default: false
208 | * Is this configuration editable after a Topic has been created?: No
209 |
210 |
211 | ### `retention.bytes` – Max Size of Notebook?
212 | If the notebook gets too big, delete old pages (if delete policy is used).
213 |
214 | * Default: Unlimited (-1)
215 | * Is this configuration editable after a Topic has been created?: Yes
216 |
217 | ### `retention.ms` – Max Time to Keep Notes?
218 | How long do you want to keep old notes before discarding?
219 |
220 | * Default: 7 days
221 | * Is this configuration editable after a Topic has been created?: Yes
222 |
223 | ### `segment.bytes` – How Big Each Notebook File Is
224 | Once a file reaches this size, start a new notebook segment.
225 |
226 | * Default: 100MB
227 | * Is this configuration editable after a Topic has been created?: Yes
228 |
229 | ### `segment.index.bytes` – How Big is the Page Index?
230 | Size of the table of contents for fast lookups.
231 |
232 | * Default: 10MB
233 | * Is this configuration editable after a Topic has been created?: No
234 |
235 | ### `segment.jitter.ms` – Add Random Delay to Page Rotation?
236 | Add random delay so not all notebooks rotate at the same time.
237 |
238 | * Default: 0
239 | * Is this configuration editable after a Topic has been created?: No
240 |
241 | ### `segment.ms` – Max Time for One Notebook File
242 | Analogy: Even if the file’s not full, start a new one after this time.
243 |
244 | * Default: 7 days
245 | * Is this configuration editable after a Topic has been created?: Yes
246 |
247 | ### `unclean.leader.election.enable` – Risky Recovery Allowed?
248 | If your main notebook keeper disappears, can we promote someone who might have missing pages?
249 |
250 | * Default: false
251 | * Is this configuration editable after a Topic has been created?: No
252 |
--------------------------------------------------------------------------------