├── Data Governance Frameworks and Data Security Principles.md
├── SQL-PRACTICE-QUESTIONS.md
├── samplejson.json
├── ToolsAndTechnologiesInstallation Guide.md
├── minio_setup.md
├── Day1-WeekOneDayOneGude.md
├── Apache Airflow Operators Guide.md
├── CH02-2025-DE-Capstone-Project.md
├── Data Engineer Apache Kafka Producers.md
├── DATA-PROCESSING.md
├── CoreConceptsDataModeling.md
├── Tuesday-Kafka-Lab.md
├── scrapping.md
├── DATA-MODELING.md
├── introduction-to-Kafka.md
├── MySQLQueryExecutionPlans.md
├── AivenProjectVersionWeekOneProject.md
├── WeekOneProject.md
├── Day3-WeekOneDayThreeClass.md
├── Apache Kafka 101: Apache Kafka for Data Engineering Guide.md
├── Apache Kafka 102: Apache Kafka for Data Engineering Guide.md
├── SQL-Manual.md
├── Apache Airflow 101 Guide.md
├── ETL
    └── ETL-ELT.md
├── README.md
├── Apache Spark.md
├── GDPR & HIPAA Compliance Guide.md
├── Change Data Capture.md
├── PYTHON
    ├── chapter_1.md
    └── chapter_2.md
├── IntroductiontoCloudComputing.md
└── data_lake.md


/Data Governance Frameworks and Data Security Principles.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/SQL-PRACTICE-QUESTIONS.md:
--------------------------------------------------------------------------------
 1 | # SQL Practice Questions
 2 | 
 3 | ## **Section 1: SELECT + ORDER BY + LIMIT + OFFSET**
 4 | 1. Retrieve all columns from `customer_info` and sort results **alphabetically** by `full_name`.
 5 | 2. Get the top **5 most expensive products** from `products`.
 6 | 3. Display products from row **6 to 10** when ordered by `price` in descending order.
 7 | 
 8 | ## **Section 2: WHERE + CASE**
 9 | 1. Find all customers located in **Kisumu**.
10 | 2. List products priced between **100 and 500**, and add a column called `price_category` that says `"Low"` if price < 300, `"Medium"` if 300–1000, else `"High"`.
11 | 3. Get all sales where `total_sales` is greater than **1000**, and show `"Big Sale"` or `"Small Sale"` using `CASE`.
12 | 
13 | ## **Section 3: JOIN + ORDER BY**
14 | 1. Show `sales_id`, `product_name`, and `full_name` for every sale, ordered by `total_sales` in descending order.
15 | 2. List all products along with their customer's `location`, sorted by location then product name.
16 | 3. Display all sales with `product_name`, `price`, and `full_name`, ordered by `price` from highest to lowest.
17 | 
18 | ## **Section 4: GROUP BY + HAVING**
19 | 1. Count how many products each customer owns, and only show customers with **more than 2 products**.
20 | 2. Find the total sales for each product and only include products with sales totaling **over 2000**.
21 | 3. Get the number of customers in each location, sorted by **customer count** in descending order.
22 | 


--------------------------------------------------------------------------------
/samplejson.json:
--------------------------------------------------------------------------------
 1 | [{"name":"Ivett Latehouse","position":"Computer Systems Analyst II","country":"Ukraine"},
 2 | {"name":"Demetria Ollet","position":"Web Designer III","country":"Indonesia"},
 3 | {"name":"Iolande Ornelas","position":"Clinical Specialist","country":"China"},
 4 | {"name":"Sheila Carty","position":"Web Developer III","country":"Central African Republic"},
 5 | {"name":"Darrelle Novotni","position":"Technical Writer","country":"Philippines"},
 6 | {"name":"Gianna de Glanville","position":"Junior Executive","country":"Portugal"},
 7 | {"name":"Estelle Staite","position":"Web Designer I","country":"Nigeria"},
 8 | {"name":"Rivi Elsmore","position":"Assistant Media Planner","country":"Philippines"},
 9 | {"name":"Sherilyn Paten","position":"Account Coordinator","country":"Pakistan"},
10 | {"name":"Layton Sweynson","position":"Help Desk Operator","country":"Portugal"},
11 | {"name":"Ezekiel Carvil","position":"Mechanical Systems Engineer","country":"Indonesia"},
12 | {"name":"Prescott Dodsley","position":"Software Test Engineer IV","country":"China"},
13 | {"name":"Durant Steanyng","position":"Chief Design Engineer","country":"United States"},
14 | {"name":"Eirena Lorey","position":"Nurse","country":"Lithuania"},
15 | {"name":"Blithe De Brett","position":"Recruiting Manager","country":"Russia"},
16 | {"name":"Tedi Grogona","position":"Executive Secretary","country":"Japan"},
17 | {"name":"Doro Swinburne","position":"Accountant I","country":"China"},
18 | {"name":"Ernesta Cassam","position":"Physical Therapy Assistant","country":"China"},
19 | {"name":"Gerardo Reide","position":"Automation Specialist I","country":"China"},
20 | {"name":"Brigida Durgan","position":"Web Developer IV","country":"Armenia"}]


--------------------------------------------------------------------------------
/ToolsAndTechnologiesInstallation Guide.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ### **Data Bases.**
 3 | 1. **PostgreSQL Installation**  
 4 |    To install PostgreSQL locally on your computer, visit [EnterpriseDB Downloads](https://www.enterprisedb.com/downloads/postgres-postgresql-downloads) and download the latest stable version compatible with your operating system.  
 5 | 
 6 |    You can use this guide for assistance: [W3Schools PostgreSQL Installation Guide](https://www.w3schools.com/postgresql/postgresql_install.php)  
 7 | 
 8 | 2. **MySQL Installation**  
 9 |    To install MySQL locally on your computer, visit [MySQL Downloads](https://dev.mysql.com/downloads/installer/) and download the latest stable version compatible with your operating system.  
10 | 
11 |    You can use this guide for assistance: [W3Schools MySQL Installation Guide](https://www.w3schools.com/mysql/mysql_install_windows.asp)
12 | 
13 | 3. **MongoDB Installation**  
14 |    To install MongoDB locally on your computer, visit [MongoDB Community Edition Downloads](https://www.mongodb.com/try/download/community) and download the latest stable version compatible with your operating system.  
15 | 
16 |    You can use this guide for assistance: [MongoDB Installation Documentation](https://www.mongodb.com/docs/manual/installation/)  
17 | 
18 | 4. **MongoDB Compass**  
19 |    MongoDB Compass is a graphical user interface (GUI) for MongoDB. You can download it from [MongoDB Compass Downloads](https://www.mongodb.com/try/download/compass).  
20 | 
21 |    You can use this guide for assistance: [MongoDB Compass Documentation](https://www.mongodb.com/docs/compass/current/)  
22 | 
23 | 5. **MongoDB Atlas**  
24 |    MongoDB Atlas is a cloud database service for MongoDB. You can sign up and create a free cluster at [MongoDB Atlas](https://www.mongodb.com/atlas/database).  
25 | 
26 |    You can use this guide for assistance: [MongoDB Atlas Getting Started](https://www.mongodb.com/docs/atlas/getting-started/)  
27 | 
28 | 


--------------------------------------------------------------------------------
/minio_setup.md:
--------------------------------------------------------------------------------
 1 | # MinIO Setup Guide
 2 | 
 3 | ### Step 1: Download MinIO Server Binary
 4 | 
 5 | ```bash
 6 | wget https://dl.min.io/server/minio/release/linux-amd64/minio
 7 | chmod +x minio
 8 | sudo mv minio /usr/local/bin/
 9 | ```
10 | 
11 | This downloads and installs the `minio` binary system-wide.
12 | 
13 | ### Step 2: Create MinIO Data Directory
14 | 
15 | ```bash
16 | mkdir -p ~/minio-data
17 | ```
18 | 
19 | This is where your data lake files (CSV, JSON, Parquet, etc.) will be stored.
20 | 
21 | ### Step 3: Run MinIO Server
22 | 
23 | ```bash
24 | export MINIO_ROOT_USER=minioadmin
25 | export MINIO_ROOT_PASSWORD=minioadmin
26 | 
27 | minio server ~/minio-data --console-address ":9001"
28 | ```
29 | 
30 | - Web UI will be available at: http://localhost:9001
31 | - API endpoint: http://localhost:9000
32 | - Credentials:
33 |   - Username: `minioadmin`
34 |   - Password: `minioadmin`
35 | 
36 | ### Step 4: Access MinIO Web Console
37 | 
38 | 1. Go to: http://localhost:9001
39 | 2. Login with:
40 |    - Username: `minioadmin`
41 |    - Password: `minioadmin`
42 | 3. Create a bucket called `datalake`.
43 | 
44 | ### Step 5: Upload Files
45 | 
46 | Click **"Buckets" → "datalake" → "Upload"**, and upload sample files like:
47 | - `sales.csv`
48 | - `products.csv`
49 | - `customers.csv`
50 | 
51 | ## Step 6: Access MinIO with Python
52 | 
53 | Install dependencies:
54 | 
55 | ```bash
56 | pip install pandas s3fs boto3
57 | ```
58 | 
59 | Python example:
60 | 
61 | ```python
62 | import pandas as pd
63 | 
64 | df = pd.read_csv(
65 |     's3://datalake/sales.csv',
66 |     storage_options={
67 |         "key": "minioadmin",
68 |         "secret": "minioadmin",
69 |         "client_kwargs": {
70 |             "endpoint_url": "http://localhost:9000"
71 |         }
72 |     }
73 | )
74 | 
75 | print(df.head())
76 | ```
77 | 
78 | ## Optional Tools to Add
79 | 
80 | | Tool | Purpose |
81 | |------|---------|
82 | | **Apache Spark** | To query data in MinIO as a Data Lake |
83 | | **Airflow** | To orchestrate ETL jobs |
84 | | **Streamlit** | To create dashboards using cleaned data |
85 | | **Parquet/Feather** | For storing large processed data efficiently |
86 | 
87 | ## Mini Data Lake Project Idea (No Docker Needed)
88 | 
89 | **Project Name:** "Mini Data Lake for Kenyan Retail Sales"
90 | 
91 | **Data Flow:**
92 | 1. **Raw Data** → Upload to MinIO in bucket `datalake`
93 | 2. **ETL** → Use Python to clean and join data
94 | 3. **Save Output** → Store clean output back to MinIO as `clean_data.parquet`
95 | 4. **Analytics** → Use Power BI, Jupyter, or Streamlit
96 | 


--------------------------------------------------------------------------------
/Day1-WeekOneDayOneGude.md:
--------------------------------------------------------------------------------
 1 | ## **Week1, Day 1: Introduction to Data Engineering: Fundamentals, Tools, and Practices.**
 2 | 
 3 | ### Objectives
 4 | **By the end of the lesson, students will:**
 5 | - Understand what data engineering is and its role in data pipelines.
 6 | - Learn about the tools commonly used in data engineering.
 7 | - Gain insight into best practices for building and maintaining reliable data systems.
 8 | - Be familiar with real-world examples of data engineering applications.
 9 | 
10 | ---
11 | 
12 | ### 1. Introduction
13 | - Overview of the role of data engineers in the data lifecycle.
14 | - Key differences between data engineering and data analytics.
15 | 
16 | ---
17 | 
18 | ### 2. What Is Data Engineering?
19 | **Definition:**  
20 | The discipline of designing and building systems for collecting, storing, and analyzing data at scale.
21 | 
22 | #### Core Concepts:
23 | - Data pipelines.
24 | - ETL (Extract, Transform, Load) processes.
25 | - Data warehouses and data lakes.
26 | - Scalability and reliability.
27 | 
28 | ---
29 | 
30 | ### 3. Importance of Data Engineering
31 | - Enabling analytics and machine learning through robust infrastructure.
32 | - Streamlining access to clean, consistent data.
33 | 
34 | #### Real-World Applications:
35 | - Powering recommendation systems.
36 | - Supporting real-time analytics in e-commerce and finance.
37 | 
38 | ---
39 | 
40 | ### 4. Tools for Data Engineering and Their Functions
41 | 
42 | #### Data Storage:
43 | - Relational databases (e.g., PostgreSQL, MySQL).
44 | - Cloud storage solutions (e.g., AWS S3, Google Cloud Storage).
45 | 
46 | #### Data Processing:
47 | - Batch processing tools (e.g., Apache Spark, Hadoop).
48 | - Streaming tools (e.g., Kafka, Flink).
49 | 
50 | #### Workflow Orchestration:
51 | - Airflow, Prefect.
52 | 
53 | #### ETL Tools:
54 | - Informatica, Talend, dbt (data build tool).
55 | 
56 | #### Programming Languages:
57 | - Python, Scala, SQL.
58 | 
59 | ---
60 | 
61 | ### 5. Practical Case Study
62 | ### Design and Build a Simple Data Pipeline:
63 | **Example:**  
64 | Ingest data from an API, store it in a database, and transform it for reporting.
65 | 
66 | ---
67 | 
68 | ### 6. Developing a Data Engineering Mindset
69 | ### Key Practices:
70 | - Prioritizing scalability and maintainability.
71 | - Building with automation in mind.
72 | - Documenting and monitoring systems effectively.
73 | 
74 | #### Critical Thinking:
75 | - Anticipating edge cases.
76 | - Debugging complex pipelines.
77 | 
78 | ---
79 | 
80 | ### 7. Recap and Q&A
81 | - Summary of key takeaways.
82 | - Open discussion to address questions and explore additional use cases.
83 | nes.
84 | - Recap and Q&A
85 | Summary of key takeaways.
86 | Open discussion to address questions and explore additional use cases.
87 | 


--------------------------------------------------------------------------------
/Apache Airflow Operators Guide.md:
--------------------------------------------------------------------------------
 1 | ### **1. Bash & Python Operators**
 2 | 
 3 | - `BashOperator` - Executes a bash command.
 4 |   Example:
 5 |   
 6 |   ```python 
 7 |   source envname/bin/activate
 8 |   ```
 9 |   
10 | - `PythonOperator` - Runs a Python function.
11 | - `BranchPythonOperator` - Executes one of multiple Python functions based on logic.
12 | 
13 | ### **2. SQL & Database Operators**
14 | - `MySqlOperator` - Executes SQL queries in MySQL.
15 | - `PostgresOperator` - Executes SQL queries in PostgreSQL.
16 | - `SqliteOperator` - Executes SQL queries in SQLite.
17 | - `MSSqlOperator` - Executes SQL queries in MS SQL Server.
18 | - `OracleOperator` - Executes SQL queries in Oracle.
19 | - `SnowflakeOperator` - Executes SQL queries in Snowflake.
20 | - `BigQueryOperator` - Runs queries in Google BigQuery.
21 | - `RedshiftSQLOperator` - Runs SQL commands in Amazon Redshift.
22 | 
23 | ### **3. File Transfer & Storage Operators**
24 | - `S3FileTransformOperator` - Processes files stored in Amazon S3.
25 | - `S3ToRedshiftOperator` - Loads data from S3 to Redshift.
26 | - `GCSToBigQueryOperator` - Loads files from Google Cloud Storage to BigQuery.
27 | - `FTPOperator` - Transfers files via FTP.
28 | - `FileSensor` - Waits for a file to appear in a directory.
29 | 
30 | ### **4. Data Processing & ETL Operators**
31 | - `SparkSubmitOperator` - Submits a Spark job.
32 | - `DataProcPySparkOperator` - Runs PySpark jobs on Google Dataproc.
33 | - `HiveOperator` - Runs Hive queries.
34 | - `DruidOperator` - Submits queries to Apache Druid.
35 | - `PrestoOperator` - Runs Presto SQL queries.
36 | 
37 | ### **5. AWS Operators**
38 | - `S3ToSnowflakeOperator` - Loads S3 data into Snowflake.
39 | - `DynamoDBToS3Operator` - Copies DynamoDB data to S3.
40 | - `EMRCreateJobFlowOperator` - Starts an EMR cluster.
41 | - `LambdaInvokeFunctionOperator` - Calls an AWS Lambda function.
42 | 
43 | ### **6. Google Cloud Operators**
44 | - `BigQueryCheckOperator` - Checks data in BigQuery.
45 | - `DataflowTemplateOperator` - Runs a Google Cloud Dataflow job.
46 | - `GCSCreateBucketOperator` - Creates a Google Cloud Storage bucket.
47 | 
48 | ### **7. Kubernetes & Docker Operators**
49 | - `KubernetesPodOperator` - Runs a task inside a Kubernetes pod.
50 | - `DockerOperator` - Runs a Docker container.
51 | 
52 | ### **8. Email & Notification Operators**
53 | - `EmailOperator` - Sends an email.
54 | - `SlackAPIPostOperator` - Sends messages to Slack.
55 | 
56 | ### **9. Sensors (Wait for Events)**
57 | - `HttpSensor` - Waits for an HTTP endpoint response.
58 | - `S3KeySensor` - Waits for an object to appear in S3.
59 | - `HdfsSensor` - Waits for a file to appear in HDFS.
60 | - `ExternalTaskSensor` - Waits for another DAG task to complete.
61 | 
62 | ### **10. Miscellaneous Operators**
63 | - `DummyOperator` - A placeholder for dependencies.
64 | - `HttpOperator` - Calls an HTTP endpoint.
65 | 


--------------------------------------------------------------------------------
/CH02-2025-DE-Capstone-Project.md:
--------------------------------------------------------------------------------
 1 | ### **CH02-2025-Data Engineering Capstone Project: Build a Data Platform for Analyzing Kenya’s Food Prices and Inflation Trends.**
 2 | 
 3 | **Domain**: Public Data || Economics || Agriculture.
 4 | 
 5 | **Data Availability**: Easy – Publicly available from official government sources  
 6 | 
 7 | ---
 8 | #### **🧩 Project Brief**
 9 | 
10 | Your team has been contracted by a government think tank to build a **data platform** that tracks food prices across Kenyan counties, detects inflation patterns, and generates insights for consumers, farmers, and policymakers.
11 | 
12 | You'll pull **real data** from public sources (see below), clean and model it, and build both **batch and near-real-time** pipelines for analysis and visualization.
13 | 
14 | ---
15 | 
16 | ### **🗃️ Suggested Data Sources**
17 | - 🇰🇪 [Kenya National Bureau of Statistics (KNBS)](https://www.knbs.or.ke/)
18 |   - Monthly food price reports (PDF/Excel)
19 |   - CPI & inflation datasets  
20 | - [World Bank Open Data – Kenya](https://data.worldbank.org/country/kenya)
21 | - [FAOSTAT – Food & Agriculture Data](https://www.fao.org/faostat/)
22 | - [Kenya Open Data](https://kenya.opendataforafrica.org/)  
23 |   - County-level data on market prices, commodities, population, etc.
24 | 
25 | ---
26 | 
27 | ### **🔧 Requirements**
28 | 
29 | #### **🏗️ Batch Pipeline**
30 | - Ingest food price data (monthly or weekly) from KNBS or Open Data portal.
31 | - Clean and normalize pricing formats (handle missing values, different currencies/units).
32 | - Use **Airflow** to automate downloads and ETL processes with **PySpark** or **Pandas**.
33 | - Create **fact/dimension tables** with a star schema (e.g., product, county, time).
34 | 
35 | #### **🌐 Optional Web Scraping Add-on**
36 | - Scrape data from a public market pricing site or KNBS portal (if available).
37 | - Use **BeautifulSoup** or **Selenium** (optional, only if permitted).
38 | 
39 | #### **📡 Streaming Component (Optional but Impressive)**
40 | - Simulate daily market price updates using a **Kafka producer** (e.g., tomatoes in Nairobi).
41 | - Consume and store using **Spark Streaming → Delta Lake/S3/PostgreSQL**.
42 | 
43 | #### **📊 Visualization & Dashboarding**
44 | - Build an analytics dashboard with:
45 |   - Price changes over time
46 |   - Inflation heatmaps by county
47 |   - Product comparison across regions
48 | - Tool: **Grafana or Power BI**
49 | 
50 | #### **Data Governance**
51 | - Add metadata to tag data sources, update frequency, and validation steps.
52 | - Track data lineage through Airflow logs or a simple metadata table.
53 | 
54 | ---
55 | 
56 | ### **📁 Deliverables**
57 | - GitHub repo with pipeline code, Airflow DAGs, and documentation
58 | - Final dashboard (hosted or screenshots)
59 | - README with architecture diagram and data model
60 | - Presentation deck with insights and demo
61 | 
62 | ---
63 | 
64 | ### **🧠 Learning Outcomes**
65 | - Automate real-world data collection and transformation  
66 | - Practice ETL, data modeling, and basic analytics  
67 | - Work with government and open datasets  
68 | - Communicate insights through dashboards and presentations  
69 | 
70 | 


--------------------------------------------------------------------------------
/Data Engineer Apache Kafka Producers.md:
--------------------------------------------------------------------------------
 1 | ```python
 2 | from confluent_kafka import Producer
 3 | from faker import Faker
 4 | import random
 5 | import time
 6 | import datetime
 7 | import json
 8 | 
 9 | # Kafka Configuration
10 | KAFKA_BROKER = "localhost:9092"  # Change to match your Kafka setup
11 | TOPIC = "kenyan_users"
12 | 
13 | # Create a Kafka Producer instance
14 | producer = Producer({'bootstrap.servers': KAFKA_BROKER})
15 | 
16 | # Create a Faker instance with British English locale
17 | fake = Faker('en_GB')
18 | 
19 | def generate_kenyan_phone():
20 |     prefixes = ['0700', '0701', '0702', '0703', '0704', '0705', '0706', '0707', '0708', '0709',
21 |                 '0710', '0711', '0712', '0713', '0714', '0715', '0716', '0717', '0718', '0719',
22 |                 '0720', '0721', '0722', '0723', '0724', '0725', '0726', '0727', '0728', '0729',
23 |                 '0730', '0731', '0732', '0733', '0734', '0735', '0736', '0737', '0738', '0739',
24 |                 '0740', '0741', '0742', '0743', '0744', '0745', '0746', '0747', '0748', '0749',
25 |                 '0750', '0751', '0752', '0753', '0754', '0755', '0756', '0757', '0758', '0759',
26 |                 '0760', '0761', '0762', '0763', '0764', '0765', '0766', '0767', '0768', '0769',
27 |                 '0770', '0771', '0772', '0773', '0774', '0775', '0776', '0777', '0778', '0779',
28 |                 '0790', '0791', '0792', '0793', '0794', '0795', '0796', '0797', '0798', '0799']
29 |     
30 |     prefix = random.choice(prefixes)
31 |     suffix = ''.join(random.choices('0123456789', k=6))
32 |     return f"{prefix} {suffix}"
33 | 
34 | def generate_kenyan_amount():
35 |     amount = random.randint(100, 100000)
36 |     return f"KES {amount:,}"
37 | 
38 | def generate_user():
39 |     name = fake.name()
40 |     domain = random.choice(['gmail.com', 'yahoo.com', 'hotmail.com', 'outlook.com', 'ke-mail.com'])
41 |     email = f"{name.lower().replace(' ', '.').replace('.', random.choice(['.','-','_']))}{random.randint(1, 99)}@{domain}"
42 |     phone = generate_kenyan_phone()
43 |     amount = generate_kenyan_amount()
44 |     timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
45 |     
46 |     return {
47 |         "timestamp": timestamp,
48 |         "name": name,
49 |         "email": email,
50 |         "phone": phone,
51 |         "amount": amount
52 |     }
53 | 
54 | # Callback function to confirm message delivery
55 | def delivery_report(err, msg):
56 |     if err is not None:
57 |         print(f"❌ Message delivery failed: {err}")
58 |     else:
59 |         print(f"✅ Message delivered to {msg.topic()} [{msg.partition()}]")
60 | 
61 | def main():
62 |     print("Kafka Producer: Sending Kenyan user data to topic 'kenyan_users' every 5 seconds...")
63 |     
64 |     try:
65 |         while True:
66 |             user = generate_user()
67 |             user_data = json.dumps(user)  # Convert to JSON format
68 |             
69 |             # Send data to Kafka
70 |             producer.produce(TOPIC, value=user_data, callback=delivery_report)
71 |             producer.flush()  # Ensure message is sent
72 |             
73 |             print(f"📤 Sent: {user_data}")
74 | 
75 |             time.sleep(5)
76 |     
77 |     except KeyboardInterrupt:
78 |         print("\n🚪 Stopping Kafka producer.")
79 |     
80 |     finally:
81 |         producer.flush()
82 | 
83 | if __name__ == "__main__":
84 |     main()
85 | ```
86 | 


--------------------------------------------------------------------------------
/DATA-PROCESSING.md:
--------------------------------------------------------------------------------
 1 | # DATA PROCESSING
 2 | 
 3 | Data processing refers to the collection, transformation, and organization of raw data into a meaningful format for analysis and decision-making. It is a core responsibility in data engineering workflows.
 4 | 
 5 | ## TYPES OF DATA PROCESSING
 6 | 
 7 | Data processing can be broadly categorized into two main types:
 8 | 
 9 | 1. **Batch Data Processing**
10 | 2. **Streaming Data Processing (Real-time Processing)**
11 | 
12 | ---
13 | 
14 | ## Batch Data Processing
15 | 
16 | Batch data processing is defined as processing large volumes of data at scheduled intervals (e.g., hourly, daily, weekly).
17 | 
18 | For example, sales figures typically undergo batch processing, allowing businesses to use data visualization features like charts, graphs, and reports to derive value from data. Since a large volume of data is involved, the system will take time to process it. Processing the data in batches saves on computational resources.
19 | 
20 | ### Characteristics
21 | 
22 | - Data is collected over time and processed in fixed-size chunks.
23 | - Jobs run at predefined times (e.g., end-of-day reports).
24 | - High latency (delay between data collection and processing).
25 | - Efficient for large-scale, non-time-sensitive computations.
26 | 
27 | ### When to Use Batch Processing?
28 | 
29 | - When doing historical analysis (e.g., monthly sales reports).
30 | - During large-scale ETL (Extract, Transform, Load) jobs.
31 | - Cost-effective for processing massive datasets.
32 | - Perfect for non-real-time applications (e.g., billing systems, payroll processing).
33 | 
34 | ### Examples
35 | 
36 | - Generating monthly sales reports.
37 | - Data warehouse ETL jobs.
38 | 
39 | ### Advantages
40 | 
41 | - Efficient for large datasets.
42 | - Lower infrastructure cost (scheduled, not real-time).
43 | - Simpler to debug and manage.
44 | 
45 | ### Disadvantages
46 | 
47 | - Not real-time (latency).
48 | - Not suitable for time-sensitive data.
49 | 
50 | ### Tools & Technologies
51 | 
52 | - Apache Hadoop (MapReduce, HDFS).
53 | - Apache Spark (batch mode).
54 | - ETL tools: Apache Airflow.
55 | 
56 | > You might prefer batch processing over real-time processing when accuracy is more important than speed.
57 | 
58 | ---
59 | 
60 | ## Streaming Data Processing (Real-time Processing)
61 | 
62 | Streaming (real-time) processing means processing data as it is generated or received.
63 | 
64 | ### How it Works
65 | 
66 | Data is ingested continuously and processed immediately (or within milliseconds to seconds). This immediate processing requires low latency and quick response times, making it suitable for applications like monitoring systems and financial trading.
67 | 
68 | ### Examples
69 | 
70 | - Monitoring stock prices.
71 | - Real-time fraud detection.
72 | - IoT sensor data processing.
73 | 
74 | ### Advantages
75 | 
76 | - Real-time insights.
77 | - Better for event-driven use cases.
78 | - Immediate action or alert possible.
79 | 
80 | ### Disadvantages
81 | 
82 | - More complex to implement and maintain.
83 | - Higher cost due to always-on infrastructure.
84 | - Debugging is more difficult.
85 | 
86 | ### Tools & Technologies
87 | 
88 | - Apache Kafka (event streaming platform).
89 | - Apache Spark Streaming (micro-batch processing).
90 | 
91 | ---
92 | 
93 | ## Conclusion
94 | 
95 | - **Use Batch Processing** for large-scale, historical data with no urgency.
96 | - **Use Streaming Processing** for real-time insights and event-driven applications.
97 | 


--------------------------------------------------------------------------------
/CoreConceptsDataModeling.md:
--------------------------------------------------------------------------------
 1 | # Data Modeling
 2 | 
 3 | ## Objectives
 4 | By the end of this section, students will:
 5 | - Understand the fundamentals of data modeling.
 6 | - Learn the different types of data models and their uses.
 7 | - Gain insight into designing effective data models for real-world applications.
 8 | - Explore best practices for data modeling in various systems.
 9 | 
10 | ---
11 | 
12 | ## 1. Introduction to Data Modeling
13 | - **Definition:**  
14 |   Data modeling is the process of designing and creating a visual representation (model) of a system’s data structures, which can be used for storing, organizing, and manipulating data in databases.
15 |   
16 | - **Purpose:**  
17 |   To structure data in a way that supports the business processes, enhances data quality, and ensures scalability and performance.
18 | 
19 | ---
20 | 
21 | ## 2. Types of Data Models
22 | ### 1. **Conceptual Data Model:**
23 |    - Focuses on high-level business requirements.
24 |    - Describes entities, relationships, and their attributes.
25 |    - Typically used for communicating with non-technical stakeholders.
26 | 
27 | ### 2. **Logical Data Model:**
28 |    - Provides a more detailed representation of data entities and their relationships.
29 |    - Does not include physical details like indexes or storage locations.
30 |    - Serves as a blueprint for physical data models.
31 | 
32 | ### 3. **Physical Data Model:**
33 |    - Describes how data is physically stored in a database.
34 |    - Includes specific details like table structures, indexes, and storage paths.
35 |    - Optimized for performance and efficient data retrieval.
36 | 
37 | ---
38 | 
39 | ## 3. Key Concepts in Data Modeling
40 | 
41 | ### 1. **Entities and Attributes:**
42 |    - **Entity:** An object or concept about which data is stored (e.g., Customer, Order).
43 |    - **Attribute:** Characteristics or properties of an entity (e.g., Customer Name, Order Date).
44 | 
45 | ### 2. **Relationships:**
46 |    - Describes how entities are related to each other (e.g., One-to-many, Many-to-many).
47 |    - Defined using primary keys and foreign keys.
48 | 
49 | ### 3. **Normalization:**
50 |    - The process of organizing data to minimize redundancy and dependency.
51 |    - Involves breaking down large tables into smaller, more manageable ones.
52 | 
53 | ### 4. **Denormalization:**
54 |    - The process of combining tables to improve query performance.
55 |    - Useful for read-heavy operations, but can lead to redundancy.
56 | 
57 | ---
58 | 
59 | ## 4. Best Practices in Data Modeling
60 | - **Consistency:** Ensure consistent naming conventions, data types, and attributes.
61 | - **Scalability:** Design models that can handle growing data volumes.
62 | - **Maintainability:** Keep models simple and easy to update.
63 | - **Performance:** Optimize models to balance speed and efficiency.
64 | - **Documentation:** Document your data models for future use and clarity.
65 | 
66 | ---
67 | 
68 | ## 5. Tools for Data Modeling
69 | - **ERD Tools (Entity-Relationship Diagrams):**  
70 |   - Microsoft Visio, Lucidchart, Draw.io.
71 | - **Database Design Tools:**  
72 |   - MySQL Workbench, Oracle SQL Developer, dbForge Studio.
73 | - **Data Modeling Tools:**  
74 |   - Erwin Data Modeler, IBM InfoSphere Data Architect, PowerDesigner.
75 | 
76 | ---
77 | 
78 | ## 6. Practical Example
79 | - **Scenario:**  
80 |   Designing a data model for an e-commerce platform to manage products, customers, and orders.
81 |   
82 | ### Steps:
83 | 1. Identify key entities (e.g., Customer, Order, Product).
84 | 2. Define relationships (e.g., one customer can place many orders).
85 | 3. Normalize the model to eliminate data redundancy.
86 | 4. Create an ERD diagram to visualize the model.
87 | 5. Implement the model in a database.
88 | 
89 | ---
90 | 
91 | ## 7. Recap and Q&A
92 | - Review the types of data models and their purposes.
93 | - Open discussion for questions and exploration of real-world data modeling challenges.
94 | 


--------------------------------------------------------------------------------
/Tuesday-Kafka-Lab.md:
--------------------------------------------------------------------------------
  1 | # Tuesday: Kafka Producer and Consumer (Lab)
  2 | 
  3 | ## Objectives
  4 | 
  5 | Set up a basic Kafka environment with:
  6 | 
  7 | * A **Producer** that sends messages to a topic
  8 | * A **Topic** to hold the messages
  9 | * A **Consumer** that reads messages from the topic
 10 | 
 11 | ## Step-by-Step Setup
 12 | 
 13 | ### 1. Ensure Java is Installed
 14 | 
 15 | Kafka requires Java.
 16 | 
 17 | ```bash
 18 | java -version
 19 | ```
 20 | 
 21 | If not installed:
 22 | 
 23 | ```bash
 24 | sudo apt install openjdk-11-jdk -y
 25 | ```
 26 | 
 27 | ### 2. Download and Extract Kafka
 28 | 
 29 | Kafka is distributed as a compressed archive (`.tgz` file). Here’s how to download and unpack it:
 30 | 
 31 | ```bash
 32 | wget https://downloads.apache.org/kafka/3.6.1/kafka_2.13-3.6.1.tgz
 33 | ```
 34 | 
 35 |  `wget` is a command-line tool used to download files from the web.
 36 | 
 37 |  This command fetches Kafka version **3.6.1** built for **Scala 2.13** (Kafka is written in Scala and Java).
 38 | 
 39 |  You can always check for the latest version at [Kafka downloads](https://kafka.apache.org/downloads).
 40 | 
 41 | Next, extract the downloaded archive:
 42 | 
 43 | ```bash
 44 | tar -xzf kafka_2.13-3.6.1.tgz
 45 | ```
 46 | 
 47 | * `tar` is used to extract files.
 48 |   * `-xzf` means:
 49 |   
 50 |     * `x`: extract
 51 |     * `z`: decompress gzip
 52 |     * `f`: specify the file name
 53 | 
 54 | Then, delete the downloaded archive to free up space:
 55 | 
 56 | ```bash
 57 | rm kafka_2.13-3.6.1.tgz
 58 | ```
 59 | 
 60 | Rename the extracted folder to something simpler:
 61 | 
 62 | ```bash
 63 | mv kafka_2.13-3.6.1 kafka
 64 | ```
 65 | 
 66 | Change into the Kafka directory:
 67 | 
 68 | ```bash
 69 | cd kafka
 70 | ```
 71 | 
 72 | This is Kafka’s **home directory** where all config files, scripts, and binaries live.
 73 | 
 74 | You’ll run all Kafka-related commands from here.
 75 | 
 76 | #### Kafka Folder Structure Overview
 77 | 
 78 | | Folder / File       | Purpose                                          |
 79 | | ------------------- | ------------------------------------------------ |
 80 | | `bin/`              | Scripts to start Kafka, ZooKeeper, and CLI tools |
 81 | | `config/`           | Configuration files for Kafka and ZooKeeper      |
 82 | | `libs/`             | Kafka and ZooKeeper libraries                    |
 83 | | `logs/`             | Kafka server logs during runtime                 |
 84 | | `LICENSE`, `NOTICE` | Legal/license information                        |
 85 | 
 86 | Kafka is now ready to be started.
 87 | 
 88 | ### 3. Start ZooKeeper and Kafka
 89 | 
 90 | Start ZooKeeper (in one terminal):
 91 | 
 92 | ```bash
 93 | bin/zookeeper-server-start.sh config/zookeeper.properties
 94 | ```
 95 | 
 96 | Start Kafka (in another terminal):
 97 | 
 98 | ```bash
 99 | bin/kafka-server-start.sh config/server.properties
100 | ```
101 | ### 4. Create a Kafka Topic
102 | 
103 | ```bash
104 | bin/kafka-topics.sh --create \
105 |   --topic test-topic \
106 |   --bootstrap-server localhost:9092 \
107 |   --partitions 1 \
108 |   --replication-factor 1
109 | ```
110 | 
111 | -  **Partitions**: Allow parallelism and scalability.
112 | -  **Replication factor**: Number of copies for fault tolerance.
113 | 
114 | ### 5. Start a Producer
115 | 
116 | ```bash
117 | bin/kafka-console-producer.sh --topic test-topic --bootstrap-server localhost:9092
118 | ```
119 | 
120 | Type messages and press Enter to send them to the topic.
121 | 
122 | ### 6. Start a Consumer
123 | 
124 | ```bash
125 | bin/kafka-console-consumer.sh --topic test-topic --from-beginning --bootstrap-server localhost:9092
126 | ```
127 | 
128 | Messages sent by the producer will appear here.
129 | 
130 | ## Topic Management Commands
131 | 
132 | **List topics**
133 | 
134 | ```bash
135 | bin/kafka-topics.sh --list --bootstrap-server localhost:9092
136 | ```
137 | 
138 | **Describe a topic**
139 | 
140 | ```bash
141 | bin/kafka-topics.sh --describe --topic test-topic --bootstrap-server localhost:9092
142 | ```
143 | 
144 | **Delete a topic**
145 | 
146 | ```bash
147 | bin/kafka-topics.sh --delete --topic test-topic --bootstrap-server localhost:9092
148 | ```
149 | 
150 | **Add more partitions**
151 | 
152 | ```bash
153 | bin/kafka-topics.sh --alter --topic test-topic --partitions 3 --bootstrap-server localhost:9092
154 | ```
155 | 
156 | > Note: You can **only increase** partitions, not reduce them.
157 | 
158 | ## Summary
159 | 
160 | * Kafka and ZooKeeper are up and running
161 | * Topic created
162 | * Messages produced and consumed
163 | * Topics managed using CLI tools
164 | 
165 | 
166 | 


--------------------------------------------------------------------------------
/scrapping.md:
--------------------------------------------------------------------------------
  1 | ### **1. What is Beautiful Soup?**
  2 | 
  3 | **Definition:**
  4 | 
  5 | Beautiful Soup is a Python library that turns messy HTML into a structured object, so you can easily search, navigate, and extract data from web pages. When you download a web page (as text), it usually looks like a big, ugly string with lots of tags. 
  6 | 
  7 | Beautiful Soup:
  8 | 
  9 | - Parses that HTML (understands the structure)
 10 | - Builds a tree of tags (like a family tree of elements: `<html> → <body> → <div> → <p>` etc.)
 11 | - Gives you friendly tools to:
 12 |   - Find tags (e.g. `<a>`, `<p>`, `<div>`…)
 13 |   - Get their text
 14 |   - Read their attributes (`href`, `class`, `id`, etc.)
 15 | 
 16 | ---
 17 | 
 18 | ### **2. Why do we use Beautiful Soup?**
 19 | 
 20 | You use Beautiful Soup when you want to:
 21 | 
 22 | - **Scrape data from websites**  
 23 |   - *Example:* Get all product names and prices from an online store page.
 24 | - **Clean and analyze HTML**  
 25 |   - *Example:* Extract only the article text from a news page.
 26 | - **Automate manual tasks**  
 27 |   - *Example:* Collect all links from a set of pages instead of copying them by hand.
 28 | 
 29 | Without Beautiful Soup, you would have to:
 30 | 
 31 | - Manually search through raw HTML strings  
 32 | - Write a lot of complicated regular expressions  
 33 |   → This is hard and very error-prone.
 34 | 
 35 | With Beautiful Soup, you write code like:
 36 | 
 37 | - “Find all `<a>` tags”
 38 | - “Get the text inside each `<p class='summary'>`”
 39 | 
 40 | Much cleaner and easier to understand.
 41 | 
 42 | ---
 43 | 
 44 | ### **3. How Beautiful Soup fits in a scraping workflow**
 45 | 
 46 | A typical web scraping workflow looks like this:
 47 | 
 48 | 1. **Use `requests` (or another HTTP library) to download the web page:**
 49 | 
 50 |    ```python
 51 |    import requests
 52 | 
 53 |    response = requests.get("https://example.com")
 54 |    html = response.text
 55 |    ```
 56 | 
 57 | 2. **Use Beautiful Soup to parse the HTML:**
 58 | 
 59 |    ```python
 60 |    from bs4 import BeautifulSoup
 61 | 
 62 |    soup = BeautifulSoup(html, "html.parser")
 63 |    ```
 64 | 
 65 | 3. **Use Beautiful Soup to extract what you need:**
 66 | 
 67 |    ```python
 68 |    title = soup.find("h1").get_text(strip=True)
 69 |    links = [a.get("href") for a in soup.find_all("a")]
 70 |    ```
 71 | 
 72 | So:
 73 | 
 74 | - `requests` → gets the HTML  
 75 | - `BeautifulSoup` → understands and extracts from the HTML  
 76 | 
 77 | ---
 78 | 
 79 | ### **4. Key concepts in Beautiful Soup**
 80 | 
 81 | When teaching, focus on these core ideas:
 82 | 
 83 | #### The `soup` object
 84 | 
 85 | - Created with `BeautifulSoup(html, "html.parser")`
 86 | - Represents the entire HTML document
 87 | 
 88 | #### Tags
 89 | 
 90 | - Elements like `<p>`, `<a>`, `<div>` are called *tags*
 91 | - You can access them like `soup.p`, `soup.find("a")`, etc.
 92 | 
 93 | #### Text
 94 | 
 95 | - Use `.get_text()` or `.text` to get the text inside a tag
 96 | 
 97 | #### Attributes
 98 | 
 99 | - Things like `href`, `class`, `id` are attributes of a tag
100 | - Access them with `tag['href']` or `tag.get('href')`
101 | 
102 | #### Search methods
103 | 
104 | - `.find()` → first matching element  
105 | - `.find_all()` → all matching elements  
106 | - `.select()` / `.select_one()` → CSS selector style  
107 | 
108 | ---
109 | 
110 | ### **5. Tiny “hello world” example**
111 | 
112 | You can show this as the first demo:
113 | 
114 | ```python
115 | from bs4 import BeautifulSoup
116 | 
117 | html = """
118 | <html>
119 |   <body>
120 |     <h1>My Website</h1>
121 |     <p class="intro">Welcome to my site.</p>
122 |     <a href="https://example.com">Click here</a>
123 |   </body>
124 | </html>
125 | """
126 | 
127 | # Create the soup object
128 | soup = BeautifulSoup(html, "html.parser")
129 | 
130 | # Get the title (h1 text)
131 | title = soup.find("h1").get_text(strip=True)
132 | 
133 | # Get the paragraph text
134 | intro = soup.find("p", class_="intro").get_text(strip=True)
135 | 
136 | # Get the link and its URL
137 | link_tag = soup.find("a")
138 | link_text = link_tag.get_text(strip=True)
139 | link_url = link_tag.get("href")
140 | 
141 | print("Title:", title)
142 | print("Intro:", intro)
143 | print("Link text:", link_text)
144 | print("Link URL:", link_url)
145 | ```
146 | 
147 | **What this demo shows:**
148 | 
149 | - How to create a `soup` object  
150 | - How to find tags  
151 | - How to get text and attributes  
152 | 
153 | ---
154 | 
155 | ### **6. One-sentence summary for students**
156 | 
157 | Beautiful Soup is a Python tool that makes it easy to read HTML and pull out just the data you need from web pages.
158 | 
159 | Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#api-documentation
160 | 


--------------------------------------------------------------------------------
/DATA-MODELING.md:
--------------------------------------------------------------------------------
  1 | # **Introduction to Data Modeling**
  2 | 
  3 | ### **What is a Model?**
  4 | 
  5 | A **model** is a structure or dimension representing data.
  6 | 
  7 | ### **What is Data Modeling?**
  8 | 
  9 | The process of designing how data will be **organized**, **stored**, and **accessed**.
 10 | 
 11 | * It provides a **visual representation** (tables, rows, columns).
 12 | * It acts as a **blueprint** for database design.
 13 | 
 14 | ### **Purpose of Data Modeling**
 15 | 
 16 | * It ensures **accuracy**, **consistency**, and **integrity** of data.
 17 | * It optimizes **database performance**.
 18 | * It facilitates **communication** between technical and business stakeholders.
 19 | 
 20 | ---
 21 | 
 22 | # **Types of Data Modeling**
 23 | 
 24 | ## 1. Conceptual Data Modeling
 25 | 
 26 | * This represents a high-level overview of business/domain data without going into details.
 27 | * No technical details, no attributes or data types.
 28 | * It focuses on **entities** and their relationships.
 29 | * Example (Hospital Domain):
 30 | 
 31 |   * Entities: Patient, Doctor, Appointment.
 32 | 
 33 | ---
 34 | 
 35 | ## 2. Logical Data Modeling
 36 | 
 37 | * This model describes **data elements and relationships** in detail (without considering physical storage).
 38 | * It defines **attributes** for each entity.
 39 | * Example:
 40 | 
 41 |   * Doctor: doctor\_id, name, specialization.
 42 |   * Appointment: start\_time, end\_time, doctor\_id (FK).
 43 | 
 44 | ---
 45 | 
 46 | ## 3. Physical Data Modeling
 47 | 
 48 | * This model defines **how data is physically stored** in a database.
 49 | * It includes:
 50 | 
 51 |   * **Data types** (e.g., INT, VARCHAR).
 52 |   * **Constraints** (e.g., PRIMARY KEY, FOREIGN KEY).
 53 |   * **Indexes** for performance.
 54 | 
 55 | Example Schema:
 56 | 
 57 | ### Doctor Table
 58 | 
 59 | * doctor\_id (INT) – Primary Key
 60 | * doctor\_name (VARCHAR)
 61 | 
 62 | ### Customers Table
 63 | 
 64 | * customer\_id (INT) – Primary Key
 65 | * name (VARCHAR)
 66 | * age (INT)
 67 | * email (VARCHAR)
 68 | * DOB (DATE)
 69 | * phone (VARCHAR)
 70 | 
 71 | ### Account Table
 72 | 
 73 | * account\_id (INT) – Primary Key
 74 | * balance (INT)
 75 | * dr (INT) (Debit)
 76 | * cr (INT) (Credit)
 77 | * acc\_type (VARCHAR)
 78 | * customer\_id (INT) – Foreign Key
 79 | 
 80 | ### Branch Table
 81 | 
 82 | * branch\_id (INT) – Primary Key
 83 | * location (VARCHAR)
 84 | 
 85 | ---
 86 | 
 87 | ## 4. Entity-Relationship Data Modeling
 88 | 
 89 | * **Entity**: An object or concept representing data (e.g., Patient, Doctor, Appointment).
 90 | * **Attributes**: Properties of an entity (e.g., patient\_id, doctor\_name).
 91 | * **ERD (Entity-Relationship Diagram)**: Visual diagram showing entities and relationships.
 92 | 
 93 | ### Relationship Types
 94 | 
 95 | | Type         | Example                                                                        |
 96 | | ------------ | ------------------------------------------------------------------------------ |
 97 | | One-to-One   | A Doctor has one doctor\_id, and each doctor\_id belongs to one Doctor.        |
 98 | | One-to-Many  | A Doctor has many Patients, but a Patient has only one primary Doctor.         |
 99 | | Many-to-Many | A Patient can have many Appointments, and a Doctor can have many Appointments. |
100 | 
101 | Detailed Examples:
102 | 
103 | * **One-to-Many (1\:M):**
104 | 
105 |   * "A Doctor has many Patients, but a Patient has only one primary Doctor."
106 |   * Implementation: Patients table has a foreign key (doctor\_id).
107 | 
108 | * **Many-to-Many (M\:N):**
109 | 
110 |   * "A Patient can book many Appointments, and a Doctor can handle many Appointments."
111 |   * Implementation: A junction table (Appointment) with patient\_id and doctor\_id as foreign keys.
112 | 
113 | * **One-to-One (1:1):**
114 | 
115 |   * "A Doctor has exactly one unique doctor\_id."
116 |   * Implementation: doctor\_id is both a primary key and unique.
117 | 
118 | ---
119 | 
120 | ## 5. Dimensional Data Modeling
121 | 
122 | * Dimensional Data Modeling is primarily used in **data warehouses** for analytical purposes.
123 | * it organizes data into **fact tables** and **dimension tables**.
124 | 
125 | ### Key Components
126 | 
127 | | Component       | Description                               | Example                           |
128 | | --------------- | ----------------------------------------- | --------------------------------- |
129 | | Fact Table      | Numerical/measurable data (metrics/KPIs). | sales\_amount, quantity\_sold     |
130 | | Dimension Table | Descriptive context for facts.            | customer\_name, product\_category |
131 | 
132 | Definitions:
133 | 
134 | * **Dimensions**: Describe business entities (name, age, product category).
135 | * **Measures**: Quantitative facts (e.g., number of products sold).
136 | 
137 | ---
138 | 
139 | # **Summary**
140 | 
141 | * **Conceptual**: High-level overview (what data is important?).
142 | * **Logical**: Detailed attributes and relationships (how is data related?).
143 | * **Physical**: Technical implementation (how is data stored?).
144 | * **ER Modeling**: Graphical view of entities and relationships.
145 | * **Dimensional**: Optimized for analytics (facts and dimensions).
146 | 
147 | 


--------------------------------------------------------------------------------
/introduction-to-Kafka.md:
--------------------------------------------------------------------------------
  1 | # Introduction to Data Streaming and Apache Kafka
  2 | 
  3 | ## **What is Streaming Data?**  
  4 | **Streaming data** (also called **event stream processing**) is the **continuous flow of data** generated by various sources, processed in **real-time** to extract insights and trigger actions.
  5 | 
  6 | ## **Characteristics of Real-Time Data Processing**  
  7 | 1. **Continuous flow** – Data is constantly generated with no "end."  
  8 | 2. **Real-time processing** – Instant analysis for timely insights (no batch delays).  
  9 | 3. **Event-driven architecture** – Systems react dynamically to individual events.  
 10 | 4. **Scalability & fault tolerance** – Handles high traffic and recovers from failures.  
 11 | 5. **Varied Data Sources** – Streaming data originates from sensors, logs, APIs, applications, mobile devices, and more.  
 12 | 
 13 | ---
 14 | 
 15 | ## **Key Benefits of Real-Time Data Processing**  
 16 | - **Immediate Insights**: Analyze data as it’s generated.  
 17 | - **Instant Decision-Making**: Respond to events in real-time (e.g., fraud detection).  
 18 | - **Operational Efficiency**: Optimize workflows and reduce downtime.  
 19 | - **Enhanced User Experience**: Personalize experiences using live data.  
 20 | 
 21 | ### **Critical Use Cases**  
 22 | - **Fraud Detection**: Block suspicious transactions instantly.  
 23 | - **IoT Monitoring**: Track device health in real-time.  
 24 | - **Live Analytics**: Power dashboards with up-to-the-second data.  
 25 | 
 26 | ### **Discussion Questions**  
 27 | 1. Can you think of a real-time use case near you (e.g., mobile money, delivery apps)?  
 28 | 2. What happens when systems can’t process data in real-time?  
 29 | 
 30 | ---
 31 | 
 32 | # **Introduction to Apache Kafka**  
 33 | 
 34 | ## **Key Concepts**  
 35 | - **Publish/Subscribe Model**:  
 36 |   - Producers send/write (**publish**) messages.  
 37 |   - Consumers receive/read (**subscribe**) messages.  
 38 |     > Asynchronous means not happening or done at the same time or speed — producer and consumer don’t need to wait on each other.
 39 | 
 40 | - **Common Use Cases**:  
 41 |   - **Real-time analytics** Analyze data as it's generated, instead of waiting for batch jobs or reports. 
 42 |   - **Log collection** Gather logs from multiple systems/services into a centralized location for monitoring, debugging, or auditing. 
 43 |   - **Event sourcing** Store state-changing events (like deposit $20, withdraw $30) rather than only storing the final state (balance = $50).
 44 | 
 45 | ---
 46 | 
 47 | ## **Kafka Architecture**  
 48 | | Component    | Role                                                                 |
 49 | |--------------|----------------------------------------------------------------------|
 50 | | **Producer** | Sends messages/events to a Kafka topic.                                     |
 51 | | **Consumer** | Reads messages/events from a topic.                                         |
 52 | | **Broker**   | Kafka server storing/serving messages (clusters = multiple brokers). |
 53 | | **ZooKeeper**/**KRaft** | Manages cluster state/metadata (KRaft replaces ZooKeeper in Kafka 4.0+). |
 54 | 
 55 | ---
 56 | ## **Event**:
 57 | - An event records the fact that "something happened" in the real world or in your system.
 58 | - It’s the fundamental unit of data in Kafka and may also be referred to as a record or message.
 59 | - Events are immutable — once written, they are not updated
 60 | - When you produce (write) or consume (read) data in Kafka, you're interacting with events.
 61 | 
 62 | #### Structure of an Event
 63 | An event in Kafka typically contains the following components:
 64 |   - Key: Identifies the event (e.g., the user, transaction ID, or source). Used for partitioning logic.
 65 |   - Value: The actual data or payload (e.g., what happened).
 66 |   - Timestamp: Time when the event occurred or was written.
 67 |   - Headers (optional): Metadata about the event, such as content type or correlation ID.
 68 | 
 69 | #### Example Event
 70 | ```plaintext
 71 | Event key:    "Alice"
 72 | Event value:  "Made a payment of $200"
 73 | Timestamp:    2025-06-27T08:45:30Z
 74 | Headers:      { "source": "mobile-app", "transaction-id": "TXN-4490" }
 75 | ```
 76 | This event could be sent to a topic like `payments` and later consumed by analytics or fraud detection services.
 77 | 
 78 | ## **Topic**:  
 79 | - A topic is a category/feed name to which messages/events are published to.(similar to a database table).  
 80 | - Topics are split into **partitions** for scalability/parallel processing.  
 81 | 
 82 | #### **Examples**  
 83 | - `orders` – E-commerce purchases.  
 84 | - `user-logins` – Authentication events.  
 85 | - `click-events` – Website/app interactions.  
 86 | 
 87 | ## **Partition**:  
 88 |   - A topic can be split into multiple partitions, which enables scalability (more messages) and parallelism (faster processing).
 89 |   -  Messages within a partition are strictly ordered.
 90 |     
 91 | ## **Offset**:  
 92 |   - A unique identifier (number) that Kafka assigns to each message within a partition.  
 93 |   - It allows consumers to track where they left off in reading the stream of messages. (e.g., "read up to offset #5").
 94 | 
 95 | 
 96 | 
 97 | 
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | 
104 | 
105 | 
106 | 
107 | 
108 | 
109 | 


--------------------------------------------------------------------------------
/MySQLQueryExecutionPlans.md:
--------------------------------------------------------------------------------
  1 | # MySQL Query Execution Plans: A Complete Guide
  2 | 
  3 | ## 1. Understanding Query Execution Plans
  4 | 
  5 | A **Query Execution Plan** shows you **how** MySQL decides to retrieve data — which indexes it will use, how many rows it will check, and the join strategy. You can get it using:
  6 | 
  7 | ```sql
  8 | EXPLAIN SELECT ...;
  9 | ```
 10 | 
 11 | For deeper insight with actual runtime statistics:
 12 | 
 13 | ```sql
 14 | EXPLAIN ANALYZE SELECT ...;
 15 | ```
 16 | 
 17 | ## 2. Example 1 – Basic Table Scan
 18 | 
 19 | Let's query all customers.
 20 | 
 21 | ```sql
 22 | EXPLAIN SELECT * FROM customer_info;
 23 | ```
 24 | 
 25 | **Possible output:**
 26 | 
 27 | | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
 28 | |----|-------------|-------|------|---------------|-----|---------|-----|------|-------|
 29 | | 1 | SIMPLE | customer_info | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
 30 | 
 31 | **Interpretation:**
 32 | - **type = ALL** → full table scan (slow for large datasets)
 33 | - No indexes used because there's no filtering
 34 | - **Optimization:** Avoid `SELECT *` unless necessary. Use `WHERE` + indexed columns
 35 | 
 36 | ## 3. Example 2 – Using an Index
 37 | 
 38 | Suppose we query customers by `customer_id` (indexed as PRIMARY KEY).
 39 | 
 40 | ```sql
 41 | EXPLAIN SELECT full_name FROM customer_info WHERE customer_id = 10;
 42 | ```
 43 | 
 44 | **Possible output:**
 45 | 
 46 | | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
 47 | |----|-------------|-------|------|---------------|-----|---------|-----|------|-------|
 48 | | 1 | SIMPLE | customer_info | const | PRIMARY | PRIMARY | 4 | const | 1 | NULL |
 49 | 
 50 | **Interpretation:**
 51 | - **type = const** → MySQL knows it will return at most one row (fast)
 52 | - Using `PRIMARY` key (O(1) lookup)
 53 | - **Optimization:** Always filter with indexed columns when possible
 54 | 
 55 | ## 4. Example 3 – Index Usage in Products Table
 56 | 
 57 | Query products by `customer_id` (foreign key).
 58 | 
 59 | ```sql
 60 | EXPLAIN SELECT product_name FROM products WHERE customer_id = 3;
 61 | ```
 62 | 
 63 | If `customer_id` is not indexed, **type** will be `ALL` (slow). **Solution:** Add index:
 64 | 
 65 | ```sql
 66 | CREATE INDEX idx_customer_id ON products(customer_id);
 67 | ```
 68 | 
 69 | After indexing, the execution plan might show:
 70 | 
 71 | | type | possible_keys | key | rows | Extra |
 72 | |------|---------------|-----|------|-------|
 73 | | ref | idx_customer_id | idx_customer_id | 5 | Using where |
 74 | 
 75 | ## 5. Example 4 – Join with Index
 76 | 
 77 | Query sales with customer names.
 78 | 
 79 | ```sql
 80 | EXPLAIN
 81 | SELECT s.sales_id, s.total_sales, c.full_name
 82 | FROM sales s
 83 | JOIN customer_info c ON s.customer_id = c.customer_id;
 84 | ```
 85 | 
 86 | If both `sales.customer_id` and `customer_info.customer_id` are indexed:
 87 | 
 88 | | id | select_type | table | type | possible_keys | key | ref | rows | Extra |
 89 | |----|-------------|-------|------|---------------|-----|-----|------|-------|
 90 | | 1 | SIMPLE | c | ALL | PRIMARY | NULL | NULL | 1000 | Using where |
 91 | | 1 | SIMPLE | s | ref | idx_customer_id | idx_customer_id | c.customer_id | 10 | NULL |
 92 | 
 93 | **Optimization tips:**
 94 | - Always index **join columns**
 95 | - Make sure both sides of the join use the same data type
 96 | 
 97 | ## 6. Example 5 – Filtering and Joining
 98 | 
 99 | Retrieve all sales above 500 made by customers in "Nairobi".
100 | 
101 | ```sql
102 | EXPLAIN
103 | SELECT s.sales_id, s.total_sales, c.full_name
104 | FROM sales s
105 | JOIN customer_info c ON s.customer_id = c.customer_id
106 | WHERE c.location = 'Nairobi' AND s.total_sales > 500;
107 | ```
108 | 
109 | Possible bottlenecks:
110 | - If `location` isn't indexed → table scan on `customer_info`
111 | - **Solution:** Create index:
112 | 
113 | ```sql
114 | CREATE INDEX idx_location ON customer_info(location);
115 | ```
116 | 
117 | Execution plan should now show **ref** instead of **ALL** for `customer_info`.
118 | 
119 | ## 7. Example 6 – Multi-table Join with Products
120 | 
121 | ```sql
122 | EXPLAIN
123 | SELECT c.full_name, p.product_name, s.total_sales
124 | FROM customer_info c
125 | JOIN products p ON c.customer_id = p.customer_id
126 | JOIN sales s ON p.product_id = s.product_id
127 | WHERE s.total_sales > 1000;
128 | ```
129 | 
130 | Optimization tips:
131 | - Index `products.customer_id`
132 | - Index `sales.product_id`
133 | - Filter early (`WHERE s.total_sales > 1000`) so MySQL processes fewer rows
134 | 
135 | ## 8. Example 7 – Using `EXPLAIN ANALYZE`
136 | 
137 | ```sql
138 | EXPLAIN ANALYZE
139 | SELECT c.full_name, p.product_name
140 | FROM customer_info c
141 | JOIN products p ON c.customer_id = p.customer_id
142 | WHERE p.price > 500;
143 | ```
144 | 
145 | **Benefit:**
146 | - Shows **actual execution time** for each step
147 | - If you see that a join step takes much longer than expected → check indexes
148 | 
149 | ## 9. Best Practices Recap
150 | 
151 | ✅ Index frequently queried columns (especially in `WHERE`, `JOIN`, `ORDER BY`)  
152 | ✅ Avoid `SELECT *` for performance  
153 | ✅ Use `EXPLAIN` before and after schema changes  
154 | ✅ Filter data as early as possible in your query  
155 | ✅ Keep an eye on `rows` in EXPLAIN — smaller is better
156 | 


--------------------------------------------------------------------------------
/AivenProjectVersionWeekOneProject.md:
--------------------------------------------------------------------------------
  1 | # Project: Setting Up Aiven Cloud Storage and Connecting a PostgreSQL Database Using DBeaver
  2 | 
  3 | ### **Objective**:
  4 | The goal of this project is to set up a **managed PostgreSQL database** on **Aiven**, use their storage options, and connect to it using **DBeaver** to manage and query the data.
  5 | 
  6 | ---
  7 | 
  8 | ### **Steps to Complete the Project**:
  9 | 
 10 | #### **1. Set Up Aiven Account and Create PostgreSQL Service**
 11 | - **Create an Aiven account** if you don’t already have one: [Aiven Sign Up](https://aiven.io/).
 12 | - **Create a PostgreSQL service** on Aiven:
 13 |   - Log into the Aiven console.
 14 |   - Click **Create Service**.
 15 |   - Select **PostgreSQL** from the list of available services.
 16 |   - Choose the cloud provider (AWS, Google Cloud, etc.) and the region.
 17 |   - Configure your service and choose the storage options provided by Aiven.
 18 |   - Aiven will manage your PostgreSQL setup automatically, including backups, monitoring, and scaling.
 19 | - **Obtain connection details**:
 20 |   - Once the PostgreSQL service is created, note down the **hostname**, **port**, **username**, and **password**.
 21 |   - You'll use this information to connect from DBeaver.
 22 | 
 23 | #### **2. Set Up Cloud Storage on Aiven**
 24 | - **Aiven provides object storage** integrated with their managed services, so you don't need to manually set up AWS S3 or Azure Blob Storage. You can upload data directly to Aiven storage or integrate with other services.
 25 | - **Upload a sample file** to Aiven storage:
 26 |   - Go to the **Aiven dashboard** and look for storage options related to your service.
 27 |   - Upload a file (like a CSV) that you want to interact with using your PostgreSQL database.
 28 | - **Alternatively**, you can use an external storage service (e.g., AWS S3 or Azure Blob) to interact with Aiven if required. However, Aiven's managed storage service should work well for this project.
 29 | 
 30 | #### **3. Install and Configure DBeaver**
 31 | - **Download and install DBeaver** (SQL client for PostgreSQL):
 32 |   - Go to [DBeaver's official site](https://dbeaver.io/) and download the version suitable for your OS.
 33 | - **Connect to Aiven PostgreSQL service**:
 34 |   - Open DBeaver.
 35 |   - Create a **new connection**:
 36 |     - Select **PostgreSQL**.
 37 |     - Fill in the connection details (hostname, port, username, password, and database name from Aiven).
 38 | - Test the connection and make sure it's successful.
 39 | 
 40 | #### **4. Set Up PostgreSQL Database and Create Tables**
 41 | - Once connected via DBeaver, you can **create a new database** or use the existing one.
 42 |   - Example: Create a simple `products` table:
 43 |     ```sql
 44 |     CREATE TABLE products (
 45 |         id SERIAL PRIMARY KEY,
 46 |         name VARCHAR(100),
 47 |         price DECIMAL
 48 |     );
 49 |     ```
 50 | - **Insert some data** into the `products` table:
 51 |     ```sql
 52 |     INSERT INTO products (name, price) 
 53 |     VALUES ('Laptop', 1000), ('Smartphone', 700);
 54 |     ```
 55 | - **Query the data**:
 56 |     ```sql
 57 |     SELECT * FROM products;
 58 |     ```
 59 | 
 60 | #### **5. (Optional) Integrate Aiven Storage with PostgreSQL**
 61 | - If you're using Aiven's managed storage, you can perform the following operations:
 62 | - **Download data from Aiven storage** (if required), using Aiven's integration options or by connecting to storage buckets.
 63 | - **Load data into PostgreSQL** (if you’ve uploaded a CSV):
 64 |   - You can use `COPY` commands in PostgreSQL or perform an import directly through DBeaver’s **Import Data** option.
 65 |   - Example `COPY` command:
 66 |     ```sql
 67 |     COPY products FROM '/path/to/your/file.csv' DELIMITER ',' CSV HEADER;
 68 |     ```
 69 | 
 70 | #### **6. Perform Data Operations Using DBeaver**
 71 | - Use DBeaver to interact with the **PostgreSQL database**.
 72 |   - **CRUD Operations**: Create, read, update, and delete data.
 73 |   - **Querying**: Run SQL queries and get results directly in DBeaver.
 74 |   - **Database Management**: Create new tables, define schemas, and more.
 75 | 
 76 | ---
 77 | 
 78 | ### **Deliverables**:
 79 | 1. **Screenshots** of the Aiven dashboard with the PostgreSQL service and storage bucket setup.
 80 | 2. **SQL scripts** for creating and inserting data into the `products` table.
 81 | 3. **Python script** (optional) for uploading files to Aiven storage (if applicable).
 82 | 4. **Connection details** and queries executed via DBeaver.
 83 | 5. A brief report documenting the steps taken, cloud setup, and any challenges faced.
 84 | 
 85 | ---
 86 | 
 87 | ### **Skills Gained**:
 88 | - Configuring and using **Aiven's managed PostgreSQL**.
 89 | - **Uploading data** to managed cloud storage.
 90 | - Using **DBeaver** to connect and query PostgreSQL.
 91 | - Integrating **cloud storage** with your database system.
 92 | - Performing **ETL operations** (optional if data is being uploaded).
 93 | 
 94 | ---
 95 | 
 96 | ### **Why Use Aiven for this Project?**
 97 | - **Managed PostgreSQL**: Aiven handles your PostgreSQL installation, backups, scaling, and monitoring, so you can focus on the data engineering tasks.
 98 | - **Storage Integration**: Easily manage cloud storage for your data and avoid manual setups of services like AWS S3 or Azure Blob.
 99 | - **Simplified Setup**: Aiven offers a streamlined, unified experience for cloud services, databases, and storage.
100 | 
101 | This updated version with **Aiven** simplifies your cloud storage and database setup while still providing the core hands-on experience in managing cloud databases and interacting with them via DBeaver.
102 | 


--------------------------------------------------------------------------------
/WeekOneProject.md:
--------------------------------------------------------------------------------
  1 | ### Project: Setting Up Cloud Storage and Connecting a Database with DBeaver
  2 | 
  3 | #### Objective:
  4 | The goal of this project is to set up a cloud storage service (AWS S3 or Azure Blob Storage), create a PostgreSQL database, and connect to it using DBeaver for managing and querying the data. This project will help you understand how to configure cloud storage, set up a relational database, and use a SQL client to interact with the database.
  5 | 
  6 | ---
  7 | 
  8 | ### Steps to Complete the Project:
  9 | 
 10 | #### 1. Set Up Cloud Storage (AWS S3 or Azure Blob Storage)
 11 | 
 12 | **Using AWS S3:**
 13 | - Create an AWS account (if you don’t have one).
 14 | - Go to the **S3 dashboard** and create a new **S3 bucket**.
 15 |   - Set a unique bucket name and choose a region.
 16 |   - Leave default settings for now.
 17 | - Upload a sample file (e.g., a CSV file or any dataset) to your S3 bucket.
 18 | 
 19 | **OR**
 20 | 
 21 | **Using Azure Blob Storage:**
 22 | - Create an Azure account (if you don’t have one).
 23 | - Go to the **Azure Portal** and create a **Storage Account**.
 24 |   - Choose the appropriate region and resource group.
 25 |   - Once created, navigate to the Blob Storage section and create a **container**.
 26 | - Upload a sample file (e.g., a CSV file) to your Azure Blob Storage.
 27 | 
 28 | #### 2. Set Up PostgreSQL Database
 29 | 
 30 | - Install PostgreSQL locally on your machine or use a cloud database provider like AWS RDS or Azure PostgreSQL.
 31 |   - For local installation:
 32 |     - **Windows**: Download the installer from the official PostgreSQL website.
 33 |     - **Mac**: Use Homebrew (`brew install postgresql`).
 34 |     - **Linux**: Use the package manager (`sudo apt-get install postgresql`).
 35 |   
 36 | - Create a PostgreSQL database named `test_db` (or any other name).
 37 |   - Connect to the database using the `psql` terminal.
 38 |   - Create a simple table to store data (e.g., a table for storing basic product information):
 39 |     ```sql
 40 |     CREATE TABLE products (
 41 |         id SERIAL PRIMARY KEY,
 42 |         name VARCHAR(100),
 43 |         price DECIMAL
 44 |     );
 45 |     ```
 46 | 
 47 | #### 3. Install and Configure DBeaver
 48 | 
 49 | - Download and install **DBeaver** (a SQL client tool that connects to databases).
 50 |   - Go to [DBeaver website](https://dbeaver.io/) and download the version compatible with your operating system.
 51 |   
 52 | - Open DBeaver and create a **new connection** to the PostgreSQL database:
 53 |   - Select **PostgreSQL** as the database type.
 54 |   - Enter the database connection details (host, port, username, password, and database name).
 55 |     - For local PostgreSQL installation, the default values are typically:
 56 |       - Host: `localhost`
 57 |       - Port: `5432`
 58 |       - Username: `postgres`
 59 |       - Password: Your PostgreSQL password
 60 |       - Database: `test_db`
 61 | 
 62 | #### 4. Connect Cloud Storage with PostgreSQL
 63 | 
 64 | - **(Optional) For AWS S3**: Use a tool like `boto3` (AWS SDK for Python) to interact with the files stored in your S3 bucket. You could upload a CSV file and load it into your PostgreSQL database using Python.
 65 |   - Example Python code using `boto3` to download a file from S3:
 66 |     ```python
 67 |     import boto3
 68 |     
 69 |     s3 = boto3.client('s3')
 70 |     bucket_name = 'your-bucket-name'
 71 |     file_key = 'your-file.csv'
 72 |     local_file_path = '/path/to/save/file.csv'
 73 |     
 74 |     s3.download_file(bucket_name, file_key, local_file_path)
 75 |     ```
 76 | 
 77 | - **(Optional) For Azure Blob Storage**: Use the `azure-storage-blob` Python library to interact with Azure Blob Storage.
 78 |   - Example Python code to download a file from Azure Blob Storage:
 79 |     ```python
 80 |     from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
 81 |     
 82 |     connection_string = "your_connection_string"
 83 |     container_name = "your_container_name"
 84 |     blob_name = "your-file.csv"
 85 |     download_path = "/path/to/save/file.csv"
 86 |     
 87 |     blob_service_client = BlobServiceClient.from_connection_string(connection_string)
 88 |     container_client = blob_service_client.get_container_client(container_name)
 89 |     blob_client = container_client.get_blob_client(blob_name)
 90 |     
 91 |     with open(download_path, "wb") as download_file:
 92 |         download_file.write(blob_client.download_blob().readall())
 93 |     ```
 94 | 
 95 | #### 5. Use DBeaver to Interact with PostgreSQL
 96 | 
 97 | - Open **DBeaver** and connect to your **PostgreSQL database**.
 98 | - Execute basic SQL queries such as:
 99 |   - Inserting data:
100 |     ```sql
101 |     INSERT INTO products (name, price) VALUES ('Laptop', 1000);
102 |     ```
103 |   - Querying the data:
104 |     ```sql
105 |     SELECT * FROM products;
106 |     ```
107 | - You can now use DBeaver to perform other SQL operations like creating new tables, updating data, etc.
108 | 
109 | #### 6. (Optional) Data Import from CSV
110 | 
111 | If you’ve uploaded a CSV file to your cloud storage (S3 or Azure Blob), you can use DBeaver to import this file into your PostgreSQL database:
112 |   - In DBeaver, right-click on the table where you want to import data and select **Import Data**.
113 |   - Choose the CSV file from your local machine (after downloading from cloud storage).
114 |   - Map the CSV columns to the corresponding table columns.
115 | 
116 | ---
117 | 
118 | ### Deliverables:
119 | 1. Screenshots of the cloud storage (AWS S3 or Azure Blob) with uploaded files.
120 | 2. PostgreSQL database schema (SQL script) for the `products` table.
121 | 3. A Python script for interacting with the cloud storage and PostgreSQL database (if applicable).
122 | 4. DBeaver connection details and queries performed on the database.
123 | 5. A brief report documenting the steps taken and any challenges faced.
124 | 
125 | ---
126 | 
127 | ### Skills Gained:
128 | - Configuring cloud storage (AWS S3 or Azure Blob Storage).  
129 | - Setting up and connecting to a PostgreSQL database.  
130 | - Using DBeaver as a SQL client for managing and querying data.  
131 | - Integrating cloud storage with PostgreSQL (optional, but adds a real-world dimension).
132 | 
133 | This is a simple and effective project that will help learners get hands-on experience with cloud services, databases, and SQL client tools while reinforcing key data engineering concepts.
134 | 


--------------------------------------------------------------------------------
/Day3-WeekOneDayThreeClass.md:
--------------------------------------------------------------------------------
  1 | ### Data Governance, Security, Compliance, and Access Control
  2 | 
  3 | Data has become a critical asset in today’s world, driving decisions and fueling innovation. However, the value of data comes with the responsibility to manage it effectively, secure it from threats, ensure compliance with legal and regulatory standards, and control who can access it. Here’s an overview of these core principles:
  4 | 
  5 | #### 1. Data Governance
  6 | **Definition:**  
  7 | Data governance involves the management of data availability, usability, integrity, and security within an organization. It sets the framework for how data is handled and ensures it aligns with business objectives.
  8 | 
  9 | **Key Components:**  
 10 | - **Data Ownership:** Clearly defining who is responsible for data.  
 11 | - **Data Quality:** Establishing standards to maintain accuracy and reliability.  
 12 | - **Policies and Procedures:** Creating rules for data usage and handling.  
 13 | 
 14 | **Benefits:**  
 15 | - Enhanced decision-making.  
 16 | - Compliance with regulations.  
 17 | - Improved data security.
 18 | 
 19 | #### 2. Data Security
 20 | **Definition:**  
 21 | Protecting data from unauthorized access, breaches, and theft.
 22 | 
 23 | **Key Practices:**  
 24 | - **Encryption:** Securing data both at rest and in transit.  
 25 | - **Firewalls and Intrusion Detection:** Preventing unauthorized access to systems.  
 26 | - **Authentication and Authorization:** Ensuring only legitimate users can access sensitive data.
 27 | 
 28 | **Emerging Threats:**  
 29 | - Ransomware attacks.  
 30 | - Phishing schemes targeting data storage systems.
 31 | 
 32 | **Mitigation:**  
 33 | - Regular security audits.  
 34 | - Employee training.  
 35 | - Investment in robust security tools.
 36 | 
 37 | #### 3. Compliance
 38 | **Definition:**  
 39 | Ensuring data handling practices meet legal and regulatory requirements.
 40 | 
 41 | **Major Regulations:**  
 42 | - **GDPR (General Data Protection Regulation):** European Union data privacy law.  
 43 | - **CCPA (California Consumer Privacy Act):** Data privacy law for California residents.  
 44 | - **HIPAA (Health Insurance Portability and Accountability Act):** U.S. law governing healthcare data.
 45 | 
 46 | **Consequences of Non-Compliance:**  
 47 | - Fines.  
 48 | - Reputational damage.  
 49 | - Legal liabilities.
 50 | 
 51 | **Steps to Achieve Compliance:**  
 52 | - Regular audits.  
 53 | - Documentation of data handling procedures.  
 54 | - Collaboration with legal and compliance experts.
 55 | 
 56 | #### 4. Access Control
 57 | **Definition:**  
 58 | Restricting access to data based on user roles and responsibilities.
 59 | 
 60 | **Key Methods:**  
 61 | - **Role-Based Access Control (RBAC):** Permissions are assigned based on job functions.  
 62 | - **Least Privilege Principle:** Users are given the minimum level of access required to perform their tasks.  
 63 | - **Multi-Factor Authentication (MFA):** Adding layers of verification for secure access.
 64 | 
 65 | **Tools:**  
 66 | - Identity and Access Management (IAM) solutions.  
 67 | - Audit trails to monitor access logs.
 68 | 
 69 | ---
 70 | 
 71 | ### Introduction to SQL for Data Engineering and PostgreSQL Setup
 72 | 
 73 | SQL (Structured Query Language) is the backbone of data engineering, used to manipulate, query, and manage relational databases. PostgreSQL, a robust open-source database management system, is a popular choice for data engineering projects.
 74 | 
 75 | #### 1. What is SQL?
 76 | **Definition:**  
 77 | A language designed for interacting with relational databases.
 78 | 
 79 | **Common SQL Operations:**  
 80 | - **SELECT:** Retrieve data from tables.  
 81 | - **INSERT:** Add new data.  
 82 | - **UPDATE:** Modify existing data.  
 83 | - **DELETE:** Remove data.  
 84 | - **JOIN:** Combine data from multiple tables.
 85 | 
 86 | #### 2. Why SQL for Data Engineering?
 87 | **Use Cases:**  
 88 | - **Data Transformation:** Clean, aggregate, and reshape data for analysis.  
 89 | - **Data Integration:** Combine data from multiple sources into a central repository.  
 90 | - **Data Management:** Create and maintain database schemas and indexes.  
 91 | 
 92 | **Efficiency:**  
 93 | SQL is optimized for high-performance queries, essential for big data workloads.
 94 | 
 95 | #### 3. Introduction to PostgreSQL
 96 | **Overview:**  
 97 | PostgreSQL is a powerful, feature-rich database system known for its reliability, scalability, and extensibility.
 98 | 
 99 | **Features:**  
100 | - **ACID compliance:** Reliable transactions.  
101 | - **Support for JSON and array data types.**  
102 | - **Advanced indexing options:** Like GiST and GIN.  
103 | - **Built-in support for full-text search and stored procedures.**
104 | 
105 | **Use Cases:**  
106 | - Data warehouses.  
107 | - Web applications.  
108 | - Analytics.
109 | 
110 | #### 4. Setting Up PostgreSQL
111 | **Installation:**  
112 | - **On Linux:** `sudo apt install postgresql`  
113 | - **On macOS:** `brew install postgresql`  
114 | - **On Windows:** Use the official installer from the PostgreSQL website.
115 | 
116 | **Basic Commands:**  
117 | - Start the PostgreSQL server: `sudo service postgresql start`  
118 | - Access the PostgreSQL shell: `psql`
119 | 
120 | **Creating a Database:**  
121 | ```sql
122 | CREATE DATABASE my_database;
123 | ````
124 | 
125 | ### Connecting to the Database
126 | 
127 | To connect to the PostgreSQL database, use the following command in your terminal:
128 | 
129 | ```bash
130 | psql -d my_database
131 | ```
132 | ### Creating a Table
133 | To create a table named employees, use the following SQL command. This table includes an automatically incrementing id, the name of the employee, their role, and their salary:
134 | 
135 | ```sql
136 | CREATE TABLE employees (
137 |     id SERIAL PRIMARY KEY,
138 |     name VARCHAR(100),
139 |     role VARCHAR(50),
140 |     salary NUMERIC
141 | );
142 | ```
143 | 
144 | ### Inserting Data
145 | Add a record to the employees table using the following SQL command. This example inserts a new employee, "John Doe," with the role "Data Engineer" and a salary of 75,000:
146 | 
147 | 
148 | ```sql
149 | INSERT INTO employees (name, role, salary)
150 | VALUES ('John Doe', 'Data Engineer', 75000);
151 | ```
152 | ### Querying Data
153 | To retrieve all data from the employees table, use the SELECT command:
154 | 
155 | ```sql
156 | SELECT * FROM employees;
157 | ```
158 | This command will display all rows and columns in the table.
159 | 
160 | ### Conclusion
161 | Understanding data governance, security, compliance, and access control is essential for protecting organizational data and meeting regulatory standards. These principles help ensure data is used effectively, remains secure, and complies with legal requirements.
162 | 
163 | At the same time, mastering SQL and PostgreSQL equips data engineers with powerful tools to build and manage data pipelines. SQL provides the foundation for querying and manipulating data, while PostgreSQL offers a robust platform for efficient storage and retrieval, enabling effective data analytics and decision-making.
164 | 
165 | 


--------------------------------------------------------------------------------
/Apache Kafka 101: Apache Kafka for Data Engineering Guide.md:
--------------------------------------------------------------------------------
  1 | ### Kafka-cheat-sheet  
  2 | 
  3 | Apache Kafka® serves as an open-source distributed streaming platform. Similar to other distributed systems, Kafka boasts a complex architecture, which may pose a challenge for new developers. Setting up Kafka involves navigating a formidable command line interface and configuring numerous settings. In this guide, I will provide insights into architectural concepts and essential commands frequently used by developers to initiate their journey with Kafka.
  4 | 
  5 | #### Key Concepts:
  6 | 
  7 | - **Clusters**: Group of servers working together for three reasons: speed (low latency), durability, and scalability.
  8 | - **Topic**: Streams of records that Kafka organizes data into.
  9 | - **Brokers**: Kafka server instances that store and replicate messages.
 10 | - **Producers**: Applications that write data to Kafka topics.
 11 | - **Consumers**: Applications that read data from Kafka topics.
 12 | - **Partitions**: Divisions of a topic for scalability and parallelism.
 13 | - **Connect**: Kafka Connect manages the Tasks. Note that the Connector is only responsible for generating the set of Tasks and indicating to the framework when they need to be updated.
 14 | 
 15 | The easiest way to run Kafka clusters is to use **Confluent Cloud**, which is the official maintainer of Apache Kafka and provides different libraries for writing producers, consumers, and schema registry.
 16 | 
 17 | #### Summary
 18 | 
 19 | 1. **Apache Kafka**: A distributed streaming platform.
 20 | 
 21 | The Kafka CLI is a powerful tool. However, the user experience can be challenging if you don’t already know the exact command needed for your task. Below are the commonly used CLI commands to interact with Kafka:
 22 | 
 23 | #### Start Zookeeper
 24 | ```sh
 25 | zookeeper-server-start config/zookeeper.properties
 26 | ```
 27 | 
 28 | #### Start Kafka Server
 29 | ```sh
 30 | kafka-server-start config/server.properties
 31 | ```
 32 | 
 33 | ### Kafka Topics
 34 | 
 35 | - **List existing topics**
 36 | ```sh
 37 | bin/kafka-topics.sh --zookeeper localhost:2181 --list
 38 | ```
 39 | 
 40 | - **Describe a topic**
 41 | ```sh
 42 | bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic mytopic
 43 | ```
 44 | 
 45 | - **Purge a topic**
 46 | ```sh
 47 | bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --config retention.ms=1000
 48 | ```
 49 | 
 50 | or
 51 | 
 52 | ```sh
 53 | bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --delete-config retention.ms
 54 | ```
 55 | 
 56 | - **Delete a topic**
 57 | ```sh
 58 | bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic mytopic
 59 | ```
 60 | 
 61 | - **Get number of messages in a topic**
 62 | ```sh
 63 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -1 --offsets 1 | awk -F  ":" '{sum += $3} END {print sum}'
 64 | ```
 65 | 
 66 | - **Get the earliest offset still in a topic**
 67 | ```sh
 68 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -2
 69 | ```
 70 | 
 71 | - **Get the latest offset still in a topic**
 72 | ```sh
 73 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -1
 74 | ```
 75 | 
 76 | - **Consume messages with the console consumer**
 77 | ```sh
 78 | bin/kafka-console-consumer.sh --new-consumer --bootstrap-server localhost:9092 --topic mytopic --from-beginning
 79 | ```
 80 | 
 81 | - **Get the consumer offsets for a topic**
 82 | ```sh
 83 | bin/kafka-consumer-offset-checker.sh --zookeeper=localhost:2181 --topic=mytopic --group=my_consumer_group
 84 | ```
 85 | 
 86 | - **Read from `__consumer_offsets`**
 87 | Add the following property to `config/consumer.properties`:
 88 | ```sh
 89 | exclude.internal.topics=false
 90 | ```
 91 | 
 92 | Then run:
 93 | ```sh
 94 | bin/kafka-console-consumer.sh --consumer.config config/consumer.properties --from-beginning --topic __consumer_offsets --zookeeper localhost:2181 --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter"
 95 | ```
 96 | 
 97 | ### Kafka Consumer Groups
 98 | 
 99 | - **List the consumer groups known to Kafka**
100 | ```sh
101 | bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list # (old API)
102 | ```
103 | ```sh
104 | bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list # (new API)
105 | ```
106 | 
107 | - **View the details of a consumer group**
108 | ```sh
109 | bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group <group_name>
110 | ```
111 | 
112 | ### Kafkacat
113 | 
114 | - **Getting the last five messages of a topic**
115 | ```sh
116 | kafkacat -C -b localhost:9092 -t mytopic -p 0 -o -5 -e
117 | ```
118 | 
119 | ### Zookeeper
120 | 
121 | - **Starting the Zookeeper Shell**
122 | ```sh
123 | bin/zookeeper-shell.sh localhost:2181
124 | ```
125 | 
126 | ### Running Java Class
127 | 
128 | - **Run `ConsumerOffsetCheck` when Kafka server is up, there is a topic + messages produced and consumed**
129 | ```sh
130 | bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --broker-info --zookeeper localhost:2181 --group test-consumer-group
131 | ```
132 | 
133 | **Note:** `ConsumerOffsetChecker` has been removed in Kafka 1.0.0. Use `kafka-consumer-groups.sh` to get consumer group details:
134 | ```sh
135 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group consule-consumer-38063
136 | ```
137 | 
138 | ### Kafka Server & Topics
139 | 
140 | - **Start Zookeeper**
141 | ```sh
142 | bin/zookeeper-server-start.sh config/zookeeper.properties
143 | ```
144 | 
145 | - **Start Kafka brokers (Servers = cluster)**
146 | ```sh
147 | bin/kafka-server-start.sh config/server.properties
148 | ```
149 | 
150 | - **Create a topic**
151 | ```sh
152 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
153 | ```
154 | 
155 | - **List all topics**
156 | ```sh
157 | bin/kafka-topics.sh --list --zookeeper localhost:2181
158 | ```
159 | 
160 | - **See topic details (partition, replication factor, etc.)**
161 | ```sh
162 | bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
163 | ```
164 | 
165 | - **Change partition number of a topic (`--alter`)**
166 | ```sh
167 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic test --partitions 3
168 | ```
169 | 
170 | ### Producer
171 | 
172 | ```sh
173 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
174 | ```
175 | 
176 | ### Consumer
177 | 
178 | ```sh
179 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic test
180 | ```
181 | 
182 | - **To consume only new messages**
183 | ```sh
184 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test
185 | ```
186 | 
187 | ### Kafka Connect
188 | 
189 | - **Standalone connectors (run in a single, local, dedicated process)**
190 | ```sh
191 | bin/connect-standalone.sh \ config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties
192 | ```
193 | 
194 | ### Reference
195 | 
196 | [Redpanda Kafka Tutorial](https://redpanda.com/guides/kafka-tutorial)
197 | 


--------------------------------------------------------------------------------
/Apache Kafka 102: Apache Kafka for Data Engineering Guide.md:
--------------------------------------------------------------------------------
  1 | ### **Apache Kafka Cheat Sheet**
  2 | 
  3 | #### **Introduction**
  4 | Apache Kafka® is an **open-source distributed event streaming platform** used for building **real-time data pipelines** and streaming applications. Kafka is **horizontally scalable**, **fault-tolerant**, and **highly durable**.
  5 | 
  6 | This guide provides an in-depth look at Kafka's **architecture**, **core concepts**, and **commonly used commands** with detailed explanations and examples.
  7 | 
  8 | ---
  9 | 
 10 | #### **1. Key Concepts**
 11 | ##### **1.1 Clusters**
 12 | - A **Kafka Cluster** is a collection of **brokers (servers)** working together.
 13 | - Provides **fault tolerance, scalability, and high throughput**.
 14 | - Clusters handle **millions of messages per second** in distributed systems.
 15 | 
 16 | ##### **1.2 Topics**
 17 | - A **topic** is a logical channel where messages are **produced and consumed**.
 18 | - Each topic is **split into partitions** for parallel processing.
 19 | - Topics are **multi-subscriber**, meaning multiple consumers can read from them.
 20 | 
 21 | **Rules for Naming Topics:**
 22 | 1. Topic names should **only contain** letters (`a-z`, `A-Z`), numbers (`0-9`), dots (`.`), underscores (`_`), and hyphens (`-`).
 23 | 2. Topic names should be **descriptive** and meaningful.
 24 | 3. **Avoid special characters** like `@`, `#`, `!`, `*`, as Kafka does not support them.
 25 | 
 26 | **Examples of Topic Names:**
 27 | ```
 28 | # Valid topic names
 29 | customer_orders
 30 | logs.application-errors
 31 | user_activity
 32 | 
 33 | # Invalid topic names (containing special characters)
 34 | customer@orders  # Invalid '@'
 35 | logs#errors      # Invalid '#'
 36 | ```
 37 | 
 38 | ---
 39 | 
 40 | #### **2. Brokers**
 41 | - A **broker** is a Kafka server that stores and serves messages.
 42 | - Kafka brokers manage:
 43 |   - **Topic partitions**
 44 |   - **Message replication**
 45 |   - **Data storage & retrieval**
 46 | - Brokers are part of a **Kafka cluster** and work together.
 47 | 
 48 | ---
 49 | 
 50 | #### **3. Producers**
 51 | - Producers send (publish) messages to **Kafka topics**.
 52 | - Messages are assigned to **partitions** based on:
 53 |   - **Round-robin (default)**
 54 |   - **Key-based partitioning** (Ensures messages with the same key go to the same partition)
 55 | 
 56 | **Example: Writing messages to a Kafka topic**
 57 | ```
 58 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic customer_orders
 59 | ```
 60 | Type messages and press **Enter** to send them.
 61 | 
 62 | ---
 63 | 
 64 | #### **4. Consumers**
 65 | - Consumers **read messages** from Kafka topics.
 66 | - Consumers belong to **consumer groups**, allowing **parallel processing**.
 67 | 
 68 | **Example: Reading messages from a topic**
 69 | ```
 70 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic customer_orders --from-beginning
 71 | ```
 72 | This will print **all past and new messages** from the `customer_orders` topic.
 73 | 
 74 | ---
 75 | 
 76 | #### **5. Partitions**
 77 | - Topics are split into **partitions** to enable **parallel consumption**.
 78 | - Each partition is **ordered** and messages are assigned an **offset**.
 79 | - Partitions allow **horizontal scaling**.
 80 | 
 81 | **Example of a topic with 3 partitions:**
 82 | ```
 83 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic logs
 84 | ```
 85 | 
 86 | **Rules for Partitions:**
 87 | 1. More partitions = **better parallelism**.
 88 | 2. Cannot **reduce** partitions, only **increase** them.
 89 | 3. Messages are assigned partitions **based on key hashing** or round-robin.
 90 | 
 91 | ---
 92 | 
 93 | #### **6. Kafka Connect**
 94 | Kafka Connect is used to **stream data** between Kafka and **external data systems** like **databases, file systems, and cloud storage**.
 95 | 
 96 | **Example: Running a Kafka Connect Worker**
 97 | ```
 98 | bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
 99 | ```
100 | 
101 | ---
102 | 
103 | #### **7. Kafka CLI Commands**
104 | ##### **7.1 Starting Kafka Components**
105 | ```
106 | # Start Zookeeper
107 | zookeeper-server-start.sh config/zookeeper.properties
108 | 
109 | # Start Kafka Server
110 | kafka-server-start.sh config/server.properties
111 | ```
112 | 
113 | ---
114 | 
115 | ##### **7.2 Managing Topics**
116 | ###### **List Topics**
117 | ```
118 | bin/kafka-topics.sh --zookeeper localhost:2181 --list
119 | ```
120 | 
121 | ###### **Describe a Topic**
122 | ```
123 | bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic customer_orders
124 | ```
125 | 
126 | ###### **Create a Topic (3 Examples)**
127 | ```
128 | # Create a topic with 1 partition and replication factor of 1
129 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic customer_orders
130 | 
131 | # Create a topic with 3 partitions and replication factor of 2
132 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 3 --topic logs
133 | 
134 | # Create a topic for real-time analytics with 5 partitions
135 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 5 --topic analytics_stream
136 | ```
137 | 
138 | ###### **Delete a Topic**
139 | ```
140 | bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic old_topic
141 | ```
142 | 
143 | ###### **Increase Partitions**
144 | ```
145 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic logs --partitions 5
146 | ```
147 | **⚠️ Note:** Kafka **does not** allow **decreasing** partitions!
148 | 
149 | ---
150 | 
151 | ##### **7.3 Managing Messages**
152 | ###### **Find Number of Messages in a Topic**
153 | ```
154 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}'
155 | ```
156 | 
157 | ###### **Get Earliest and Latest Offsets**
158 | ```
159 | # Earliest offset (first message)
160 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -2
161 | 
162 | # Latest offset (last message)
163 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -1
164 | ```
165 | 
166 | ---
167 | 
168 | ##### **7.4 Consumer Groups**
169 | ###### **List Consumer Groups**
170 | ```
171 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
172 | ```
173 | 
174 | ###### **Describe a Consumer Group**
175 | ```
176 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my_consumer_group
177 | ```
178 | 
179 | ---
180 | 
181 | ##### **7.5 Using Kafkacat**
182 | ###### **Read Last 5 Messages from a Topic**
183 | ```
184 | kafkacat -C -b localhost:9092 -t logs -p 0 -o -5 -e
185 | ```
186 | 
187 | ---
188 | 
189 | #### **8. Advanced Notes**
190 | 1. **Kafka Retention Policies**:
191 |    - Kafka can retain messages **forever**, for a set **time period**, or until reaching a **size limit**.
192 |    - Configure **log retention** with:
193 |    ```
194 |    bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic logs --config retention.ms=604800000  # 7 days
195 |    ```
196 | 
197 | 2. **Monitoring Kafka**:
198 |    - Use Kafka's built-in tools or third-party **monitoring solutions** like **Confluent Control Center, Prometheus, or Grafana**.
199 | 
200 | 3. **Kafka Streams API**:
201 |    - Used for **real-time data processing** within Kafka itself.
202 |    - Helps build **event-driven applications**.
203 | 
204 | ---
205 | 
206 | #### **9. References**
207 | For more details, check out:
208 | - [Apache Kafka Documentation](https://kafka.apache.org/documentation/)
209 | - [Redpanda Kafka Guide](https://redpanda.com/guides/kafka-tutorial)
210 | 


--------------------------------------------------------------------------------
/SQL-Manual.md:
--------------------------------------------------------------------------------
  1 | # Structured Query Language (SQL)  
  2 | **Course Manual – Version 1.2**  
  3 | © 2025
  4 | 
  5 | ---
  6 | 
  7 | ## Table of Contents
  8 | 1. [Introduction](#introduction)  
  9 | 2. [Basic Queries](#basic-queries)  
 10 | 3. [Advanced Operators](#advanced-operators)  
 11 | 4. [Expressions](#expressions)  
 12 | 5. [Functions](#functions)  
 13 | 6. [Multi-Table Queries](#multi-table-queries)  
 14 | 7. [Queries Within Queries](#queries-within-queries)  
 15 | 8. [Maintaining Tables](#maintaining-tables)  
 16 | 9. [Defining Database Objects](#defining-database-objects)  
 17 | A. [Appendices](#appendices)
 18 | 
 19 | ---
 20 | 
 21 | <a name="introduction"></a>
 22 | # 1. Introduction
 23 | 
 24 | ### Course Objectives
 25 | By the end of the course you will be able to:
 26 | - **Query** relational databases
 27 | - **Maintain** relational databases
 28 | - **Define** relational databases
 29 | 
 30 | > “MAINTAIN – QUERY – RELATIONAL DATABASE – DEFINE”
 31 | 
 32 | ---
 33 | 
 34 | ### What is a Relational Database?
 35 | - **Tables** (unique names, ≤ 18 chars, start with letter).  
 36 | - **Columns** (unique within table).  
 37 | - **Rows** (identified by data values, not position).
 38 | 
 39 | ---
 40 | 
 41 | ### What is SQL?
 42 | - **Structured Query Language** – IBM (1974), ANSI (1986), ISO (1987).  
 43 | - **Six core statements**: `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `CREATE`, `DROP`.
 44 | 
 45 | | Category        | Statements          |
 46 | |-----------------|---------------------|
 47 | | **Query**       | `SELECT`            |
 48 | | **Maintenance** | `INSERT`, `UPDATE`, `DELETE` |
 49 | | **Definition**  | `CREATE`, `DROP`    |
 50 | 
 51 | ---
 52 | 
 53 | <a name="basic-queries"></a>
 54 | # 2. Basic Queries
 55 | 
 56 | ### 2.1 Selecting All Columns & Rows
 57 | ```sql
 58 | SELECT *
 59 | FROM countries;
 60 | ```
 61 | 
 62 | ### 2.2 Selecting Specific Columns
 63 | ```sql
 64 | SELECT title, job
 65 | FROM jobs;
 66 | ```
 67 | 
 68 | ### 2.3 Selecting Specific Rows
 69 | | Value Type | Format Example |
 70 | |------------|----------------|
 71 | | Numeric    | `123`, `-45.67` |
 72 | | String     | `'Canada'`      |
 73 | | Date       | `'1999-12-31'`  |
 74 | 
 75 | ```sql
 76 | SELECT name, country
 77 | FROM persons
 78 | WHERE country = 'Canada';
 79 | ```
 80 | 
 81 | ### 2.4 Sorting Rows
 82 | ```sql
 83 | SELECT country, area
 84 | FROM countries
 85 | ORDER BY area DESC;      -- largest first
 86 | ```
 87 | 
 88 | Multiple columns:
 89 | ```sql
 90 | ORDER BY language, pop DESC;
 91 | ```
 92 | 
 93 | ### 2.5 Eliminating Duplicate Rows
 94 | ```sql
 95 | SELECT DISTINCT job
 96 | FROM persons;
 97 | ```
 98 | 
 99 | ---
100 | 
101 | <a name="advanced-operators"></a>
102 | # 3. Advanced Operators
103 | 
104 | | Operator | Meaning | Example |
105 | |----------|---------|---------|
106 | | `LIKE`   | Pattern | `WHERE name LIKE 'Z%'` |
107 | | `AND`    | Both true | `WHERE gnp < 3000 AND literacy < 40` |
108 | | `BETWEEN`| Inclusive range | `WHERE pop BETWEEN 100000 AND 200000` |
109 | | `OR`     | Either true | `WHERE country = 'USA' OR country = 'Canada'` |
110 | | `IN`     | List match | `WHERE language IN ('English', 'French')` |
111 | | `IS NULL`| Missing value | `WHERE gnp IS NULL` |
112 | | `NOT`    | Negation | `WHERE NOT country = 'USA'` |
113 | | `( )`    | Precedence | `WHERE (job='S' OR job='W') AND country='Italy'` |
114 | 
115 | ---
116 | 
117 | <a name="expressions"></a>
118 | # 4. Expressions
119 | 
120 | ### 4.1 Arithmetic Expressions
121 | ```sql
122 | SELECT country, pop/area AS density
123 | FROM countries;
124 | ```
125 | 
126 | ### 4.2 Expressions in WHERE / ORDER BY
127 | ```sql
128 | SELECT *
129 | FROM countries
130 | WHERE pop/area > 1000
131 | ORDER BY pop/area DESC;
132 | ```
133 | 
134 | ### 4.3 Column Aliases (`AS`)
135 | ```sql
136 | SELECT gnp*1000000/pop AS gpp
137 | FROM countries
138 | WHERE gpp > 20000
139 | ORDER BY gpp DESC;
140 | ```
141 | 
142 | ---
143 | 
144 | <a name="functions"></a>
145 | # 5. Functions
146 | 
147 | ### 5.1 Statistical Functions
148 | | Function | Purpose |
149 | |----------|---------|
150 | | `COUNT(*)` | Rows |
151 | | `COUNT(col)` | Non-NULL |
152 | | `SUM(col)` | Total |
153 | | `AVG(col)` | Average |
154 | | `MIN(col)` / `MAX(col)` | Min / Max |
155 | 
156 | Grand totals:
157 | ```sql
158 | SELECT AVG(pop) AS avg_pop
159 | FROM countries
160 | WHERE language = 'English';
161 | ```
162 | 
163 | ### 5.2 Grouping
164 | ```sql
165 | SELECT job, COUNT(*) AS total
166 | FROM persons
167 | GROUP BY job;
168 | ```
169 | 
170 | ### 5.3 HAVING (post-group filter)
171 | ```sql
172 | SELECT language, AVG(literacy) AS avg_lit
173 | FROM countries
174 | GROUP BY language
175 | HAVING AVG(literacy) > 90;
176 | ```
177 | 
178 | ---
179 | 
180 | <a name="multi-table-queries"></a>
181 | # 6. Multi-Table Queries
182 | 
183 | ### 6.1 Joins
184 | ```sql
185 | SELECT p.name, j.title
186 | FROM persons p
187 | JOIN jobs j ON p.job = j.job;
188 | ```
189 | 
190 | ### 6.2 Table Aliases
191 | Short-hand:
192 | ```sql
193 | SELECT c.country, a.budget
194 | FROM countries c
195 | JOIN armies a ON c.country = a.country;
196 | ```
197 | 
198 | ### 6.3 Union
199 | Combine results (distinct):
200 | ```sql
201 | SELECT country FROM religions WHERE percent > 40
202 | UNION
203 | SELECT country FROM countries WHERE language = 'German';
204 | ```
205 | 
206 | ---
207 | 
208 | <a name="queries-within-queries"></a>
209 | # 7. Queries Within Queries (Subqueries)
210 | 
211 | | Type | Template | Example |
212 | |------|----------|---------|
213 | | **Single-valued** | `WHERE col = (SELECT …)` | `WHERE pop > (SELECT AVG(pop) FROM countries)` |
214 | | **Multi-valued** | `WHERE col IN (SELECT …)` | `WHERE country IN (SELECT country FROM religions WHERE percent > 95)` |
215 | | **Correlated** | Inner query references outer alias | `WHERE bdate = (SELECT MIN(bdate) FROM persons p2 WHERE p2.job = p1.job)` |
216 | 
217 | ---
218 | 
219 | <a name="maintaining-tables"></a>
220 | # 8. Maintaining Tables
221 | 
222 | | Action | Syntax | Example |
223 | |--------|--------|---------|
224 | | **Insert** | `INSERT INTO … VALUES …` | `INSERT INTO jobs(job,title) VALUES ('A','Author');` |
225 | | **Update** | `UPDATE … SET … WHERE …` | `UPDATE persons SET job='E' WHERE person=500;` |
226 | | **Delete** | `DELETE FROM … WHERE …` | `DELETE FROM persons WHERE person=500;` |
227 | | **Transaction** | `COMMIT;` or `ROLLBACK;` | Undo or save all changes since last `COMMIT`. |
228 | 
229 | ---
230 | 
231 | <a name="defining-database-objects"></a>
232 | # 9. Defining Database Objects
233 | 
234 | ### 9.1 Tables
235 | ```sql
236 | CREATE TABLE theologians (
237 |   name    CHAR(20) PRIMARY KEY,
238 |   bdate   DATE,
239 |   gender  CHAR(6) NOT NULL CHECK (gender IN ('Male','Female')),
240 |   country CHAR(20) REFERENCES countries(country)
241 | );
242 | ```
243 | 
244 | ### 9.2 Indexes
245 | ```sql
246 | CREATE INDEX idx_gender_job ON persons(gender, job);
247 | DROP INDEX idx_gender_job;
248 | ```
249 | 
250 | ### 9.3 Views
251 | ```sql
252 | CREATE VIEW iv AS
253 | SELECT * FROM persons WHERE country = 'Israel';
254 | 
255 | SELECT * FROM iv;
256 | DROP VIEW iv;
257 | ```
258 | 
259 | ---
260 | 
261 | <a name="appendices"></a>
262 | # A. Appendices
263 | 
264 | ### Exercise Database Schema
265 | | Table   | Key Columns (sample) |
266 | |---------|----------------------|
267 | | **persons**   | person, name, bdate, gender, country, job |
268 | | **countries** | country, pop, area, gnp, language, literacy |
269 | | **armies**    | country, budget, troops, tanks, ships, planes |
270 | | **jobs**      | job, title |
271 | | **religions** | country, religion, percent |
272 | 
273 | > All monetary values in millions; population in people; area in sq mi; literacy in %.
274 | 
275 | ### Answers to Selected Exercises
276 | See the original PDF **pages 94-110** for the complete answer key.
277 | 
278 | ---
279 | 
280 | ## Syntax Summary Cheat-Sheet
281 | ```sql
282 | SELECT [DISTINCT] columns|functions
283 | FROM   table [alias] [, ...]
284 | [WHERE  conditions]
285 | [GROUP BY columns]
286 | [HAVING aggregate_conditions]
287 | [ORDER BY column|expr|position [ASC|DESC]];
288 | 
289 | INSERT INTO table(col1,...) VALUES(val1,...);
290 | UPDATE table SET col=val [, ...] WHERE ...;
291 | DELETE FROM table WHERE ...;
292 | CREATE TABLE tbl (...);
293 | DROP TABLE tbl;
294 | COMMIT;
295 | ROLLBACK;
296 | ```
297 | 


--------------------------------------------------------------------------------
/Apache Airflow 101 Guide.md:
--------------------------------------------------------------------------------
  1 | ## **Apache Airflow Setup & Introduction (Multi-Component Mode)**
  2 | 
  3 | #### **Understanding Workflow Orchestration**
  4 | 
  5 | **Workflow orchestration** is the automated coordination and management of complex data workflows. Think of it as a conductor directing an orchestra - it ensures that different data processing tasks run in the correct order, at the right time, and handles failures gracefully.
  6 | 
  7 | **Why Apache Airflow?**
  8 | - **Dependency Management**: Automatically handles task dependencies
  9 | - **Scheduling**: Run workflows on time-based or event-based triggers
 10 | - **Monitoring**: Visual interface to track job progress and failures
 11 | - **Scalability**: Handles complex workflows with hundreds of tasks
 12 | - **Flexibility**: Python-based, extensible with custom operators
 13 | 
 14 | ## Airflow Architecture Overview (Multi-Component Setup)
 15 | 
 16 | ```
 17 | ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
 18 | │  Web Server │    │  Scheduler  │    │  Executor   │
 19 | │   (Port     │    │ (Triggers   │    │(Runs Tasks) │
 20 | │    8080)    │    │  DAGs)      │    │             │
 21 | └─────────────┘    └─────────────┘    └─────────────┘
 22 |        │                   │                   │
 23 |        └───────────────────┼───────────────────┘
 24 |                            │
 25 |                     ┌─────────────┐
 26 |                     │ Metadata DB │
 27 |                     │ (PostgreSQL)│
 28 |                     └─────────────┘
 29 | ```
 30 | 
 31 | ### Step-by-Step Installation Guide (Multi-Component Mode)
 32 | 
 33 | #### Setup with Separate Web Server and Scheduler
 34 | 
 35 | ```bash
 36 | # 1. Create a new directory for your Airflow project
 37 | mkdir airflow-tutorial
 38 | cd airflow-tutorial
 39 | 
 40 | # 2. Create and activate virtual environment
 41 | python -m venv airflow-env
 42 | source airflow-env/bin/activate  # On Windows: airflow-env\Scripts\activate
 43 | 
 44 | # 3. Set Airflow home directory
 45 | export AIRFLOW_HOME=$(pwd)/airflow  # On Windows: set AIRFLOW_HOME=%cd%\airflow
 46 | 
 47 | # 4. Install Airflow (using constraints for compatibility)
 48 | pip install apache-airflow==2.8.0 --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.9.txt
 49 | 
 50 | # 5. Initialize the database
 51 | airflow db init
 52 | 
 53 | # 6. Create an admin user (replace with your details)
 54 | airflow users create \
 55 |     --username admin \
 56 |     --firstname Admin \
 57 |     --lastname User \
 58 |     --role Admin \
 59 |     --email admin@example.com \
 60 |     --password admin123
 61 | ```
 62 | 
 63 | #### Running Web Server and Scheduler Separately
 64 | 
 65 | You'll need **two terminal windows** for this setup:
 66 | 
 67 | ##### Terminal 1: Start the Web Server
 68 | ```bash
 69 | # Activate virtual environment
 70 | source airflow-env/bin/activate  # On Windows: airflow-env\Scripts\activate
 71 | 
 72 | # Set Airflow home
 73 | export AIRFLOW_HOME=$(pwd)/airflow  # On Windows: set AIRFLOW_HOME=%cd%\airflow
 74 | 
 75 | # Start the web server
 76 | airflow webserver --port 8080
 77 | ```
 78 | 
 79 | ##### Terminal 2: Start the Scheduler
 80 | ```bash
 81 | # Open a new terminal window/tab
 82 | cd airflow-tutorial
 83 | 
 84 | # Activate virtual environment
 85 | source airflow-env/bin/activate  # On Windows: airflow-env\Scripts\activate
 86 | 
 87 | # Set Airflow home
 88 | export AIRFLOW_HOME=$(pwd)/airflow  # On Windows: set AIRFLOW_HOME=%cd%\airflow
 89 | 
 90 | # Start the scheduler
 91 | airflow scheduler
 92 | ```
 93 | 
 94 | ### Why Use Separate Components?
 95 | 
 96 | #### Benefits of Multi-Component Setup:
 97 | - **Production-Ready**: Mirrors production deployment patterns
 98 | - **Resource Management**: Each component can be scaled independently
 99 | - **Monitoring**: Easier to monitor individual component performance
100 | - **Debugging**: Separate logs for web server and scheduler
101 | - **High Availability**: Can run multiple instances of each component
102 | 
103 | #### Component Responsibilities:
104 | 
105 | **Web Server:**
106 | - Serves the Airflow UI
107 | - Handles user authentication
108 | - Provides REST API endpoints
109 | - Displays DAG visualization and monitoring
110 | 
111 | **Scheduler:**
112 | - Monitors DAG files for changes
113 | - Triggers task execution based on schedule
114 | - Manages task dependencies
115 | - Handles task retries and failures
116 | 
117 | ### Accessing the Airflow UI
118 | 
119 | 1. **Ensure both components are running**:
120 |    - Web server should show: `Serving on http://0.0.0.0:8080`
121 |    - Scheduler should show: `Starting the scheduler`
122 | 
123 | 2. **Open your browser** and navigate to: `http://localhost:8080`
124 | 
125 | 3. **Login credentials**:
126 |    - Username: `admin`
127 |    - Password: `admin123`
128 | 
129 | ### Exploring the Airflow UI
130 | 
131 | #### 1. DAGs View (Main Dashboard)
132 | - **What you'll see**: List of all available DAGs
133 | - **Key elements**:
134 |   - DAG name and description
135 |   - Recent runs (green = success, red = failed)
136 |   - Schedule interval
137 |   - Last run date
138 |   - Toggle to pause/unpause DAGs
139 | 
140 | #### 2. Tree View
141 | - **Purpose**: Shows DAG runs over time
142 | - **How to access**: Click on any DAG → Tree View tab
143 | - **What it shows**: Task instances arranged by execution date
144 | 
145 | #### 3. Graph View
146 | - **Purpose**: Visual representation of DAG structure
147 | - **Shows**: Task dependencies and current status
148 | - **Color coding**:
149 |   - Green: Success
150 |   - Red: Failed
151 |   - Yellow: Running
152 |   - Light Blue: Queued
153 |   - Gray: Not started
154 | 
155 | #### 4. Code View
156 | - **Purpose**: Shows the Python code that defines the DAG
157 | - **Useful for**: Understanding DAG logic and debugging
158 | 
159 | #### 5. Gantt Chart
160 | - **Purpose**: Shows task execution timeline
161 | - **Useful for**: Identifying bottlenecks and optimizing performance
162 | 
163 | ### Key Concepts Explained Simply
164 | 
165 | #### DAG (Directed Acyclic Graph)
166 | A workflow definition - like a recipe that tells Airflow what tasks to run and in what order.
167 | 
168 | #### Tasks
169 | Individual units of work (like "download file", "process data", "send email").
170 | 
171 | #### Operators
172 | Templates for tasks (PythonOperator, BashOperator, EmailOperator, etc.).
173 | 
174 | #### Task Instance
175 | A specific execution of a task for a particular DAG run.
176 | 
177 | ### Monitoring Your Setup
178 | 
179 | #### Checking Component Status:
180 | 
181 | **Web Server Logs:**
182 | - Look for: `Serving on http://0.0.0.0:8080`
183 | - No error messages about port conflicts
184 | 
185 | **Scheduler Logs:**
186 | - Look for: `Starting the scheduler`
187 | - Regular DAG processing messages
188 | - No database connection errors
189 | 
190 | #### Health Check Commands:
191 | ```bash
192 | # Check if web server is responding
193 | curl http://localhost:8080/health
194 | 
195 | # List DAGs (requires both components running)
196 | airflow dags list
197 | 
198 | # Check scheduler status
199 | airflow jobs check --job-type SchedulerJob
200 | ```
201 | 
202 | ### Assignment Solutions
203 | 
204 | #### Part 1: Why Airflow is Useful in Data Engineering
205 | 
206 | Apache Airflow is essential in data engineering because it provides automated workflow orchestration that eliminates manual intervention in complex data pipelines. Unlike traditional cron jobs or script-based scheduling, Airflow offers dependency management, failure handling, and retry mechanisms that ensure data workflows run reliably at scale. Its Python-based approach allows data engineers to define workflows as code, making them version-controlled, testable, and maintainable, while the web UI provides real-time monitoring and debugging capabilities that are crucial for managing production data pipelines. The separation of the web server and scheduler components allows for better resource allocation and mirrors production deployment patterns used in enterprise environments.
207 | 
208 | #### Part 2: Screenshot Documentation
209 | 
210 | **Expected Screenshot Elements:**
211 | - Airflow UI header with "Apache Airflow" logo
212 | - Navigation menu (DAGs, Browse, Admin, etc.)
213 | - DAGs list showing example DAGs
214 | - Status indicators (green/red circles)
215 | - URL showing `localhost:8080`
216 | - Both terminal windows showing web server and scheduler running
217 | 
218 | **Troubleshooting Common Issues:**
219 | 
220 | 1. **Port 8080 already in use**: 
221 |    ```bash
222 |    # Kill process using port 8080
223 |    sudo lsof -t -i:8080 | xargs sudo kill -9
224 |    # Or use a different port
225 |    airflow webserver --port 8081
226 |    ```
227 | 
228 | 2. **Scheduler not picking up DAGs**: 
229 |    - Ensure scheduler is running
230 |    - Check DAG file syntax
231 |    - Verify DAG is not paused
232 | 
233 | 3. **Database lock errors**: 
234 |    - Stop all Airflow processes
235 |    - Delete `airflow.db` file
236 |    - Run `airflow db init` again
237 | 
238 | 4. **Web server can't connect to database**: 
239 |    - Ensure scheduler is running (it initializes the database)
240 |    - Check file permissions on `airflow.db`
241 | 
242 | ### Success Checklist ✅
243 | 
244 | - [ ] Airflow installed in virtual environment
245 | - [ ] Database initialized successfully
246 | - [ ] Admin user created
247 | - [ ] Web server running on port 8080
248 | - [ ] Scheduler running in separate terminal
249 | - [ ] Can access http://localhost:8080
250 | - [ ] Can login with admin credentials
251 | - [ ] Can see example DAGs in the interface
252 | - [ ] Both components showing healthy status in logs
253 | - [ ] Can navigate between different views (Tree, Graph, Code)
254 | - [ ] Screenshot saved showing successful multi-component setup
255 | 
256 | ### Stopping Airflow Properly
257 | 
258 | To stop Airflow cleanly:
259 | 1. **Stop the scheduler**: `Ctrl+C` in the scheduler terminal
260 | 2. **Stop the web server**: `Ctrl+C` in the web server terminal
261 | 3. **Deactivate virtual environment**: `deactivate`
262 | 
263 | ### Next Steps Preview
264 | Tomorrow we'll create our first custom DAG and understand how to define tasks and dependencies using this multi-component setup!
265 | 


--------------------------------------------------------------------------------
/ETL/ETL-ELT.md:
--------------------------------------------------------------------------------
  1 | # Theory: Introduction to ETL/ELT Workflows
  2 | 
  3 | ## Overview
  4 | 
  5 | ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two fundamental paradigms for data integration workflows. These methodologies define how data moves from source systems to target systems, determining when and where data transformations occur in the pipeline.
  6 | 
  7 | ## ETL (Extract, Transform, Load)
  8 | 
  9 | ### Definition
 10 | ETL is a traditional data integration approach where data is extracted from source systems, transformed in a separate processing layer, and then loaded into the target system.
 11 | 
 12 | ### Workflow Process
 13 | 
 14 | #### 1. Extract
 15 | - **Purpose**: Retrieve data from various source systems
 16 | - **Sources**: Databases, APIs, flat files, web services, applications
 17 | - **Methods**: 
 18 |   - Full extraction (complete dataset)
 19 |   - Incremental extraction (only changed data)
 20 |   - Delta extraction (changes since last extraction)
 21 | - **Challenges**: Handling different data formats, connection issues, rate limits
 22 | 
 23 | #### 2. Transform
 24 | - **Purpose**: Convert raw data into a format suitable for analysis
 25 | - **Operations**:
 26 |   - **Data Cleaning**: Remove duplicates, handle null values, correct errors
 27 |   - **Data Standardization**: Unify formats, units, naming conventions
 28 |   - **Data Validation**: Ensure data quality and integrity
 29 |   - **Data Enrichment**: Add calculated fields, lookup values
 30 |   - **Data Aggregation**: Summarize data at different granularities
 31 |   - **Data Type Conversion**: Convert between different data types
 32 |   - **Business Logic Application**: Apply domain-specific rules
 33 | 
 34 | #### 3. Load
 35 | - **Purpose**: Insert transformed data into target system
 36 | - **Loading Strategies**:
 37 |   - **Full Load**: Replace entire dataset
 38 |   - **Incremental Load**: Add only new or changed records
 39 |   - **Upsert**: Insert new records, update existing ones
 40 | - **Target Systems**: Data warehouses, databases, data marts
 41 | 
 42 | ### ETL Characteristics
 43 | 
 44 | **Advantages:**
 45 | - **Data Quality**: Ensures clean, validated data enters target system
 46 | - **Performance**: Target system optimized for queries, not transformations
 47 | - **Security**: Sensitive data can be masked/encrypted during transformation
 48 | - **Compliance**: Easier to implement data governance rules
 49 | - **Predictable Structure**: Target schema is well-defined and stable
 50 | 
 51 | **Disadvantages:**
 52 | - **Processing Time**: Sequential process can be time-consuming
 53 | - **Resource Intensive**: Requires dedicated transformation infrastructure
 54 | - **Less Flexibility**: Schema changes require pipeline modifications
 55 | - **Data Freshness**: Batch processing introduces latency
 56 | 
 57 | ### When to Use ETL
 58 | - **Complex Transformations**: Heavy business logic or data cleansing required
 59 | - **Limited Target Resources**: Target system has limited processing power
 60 | - **Strict Data Quality**: High data quality standards must be enforced
 61 | - **Regulatory Compliance**: Data governance and audit trails essential
 62 | - **Predictable Workloads**: Batch processing acceptable for use case
 63 | 
 64 | ## ELT (Extract, Load, Transform)
 65 | 
 66 | ### Definition
 67 | ELT is a modern approach where raw data is loaded directly into the target system, and transformations are performed within that system using its computational resources.
 68 | 
 69 | ### Workflow Process
 70 | 
 71 | #### 1. Extract
 72 | - **Same as ETL**: Retrieve data from source systems
 73 | - **Minimal Processing**: Little to no transformation during extraction
 74 | - **Faster Extraction**: Reduced complexity in extraction phase
 75 | 
 76 | #### 2. Load
 77 | - **Raw Data Loading**: Data loaded in its original format
 78 | - **Target Systems**: Usually data lakes or cloud data warehouses
 79 | - **Schema-on-Write vs Schema-on-Read**: Often uses schema-on-read approach
 80 | - **Staging Areas**: May use intermediate staging for organization
 81 | 
 82 | #### 3. Transform
 83 | - **In-Target Processing**: Transformations occur within target system
 84 | - **On-Demand**: Transformations can be applied as needed
 85 | - **Multiple Views**: Same raw data can be transformed differently for various use cases
 86 | - **Leverages Target Power**: Uses computational resources of target system
 87 | 
 88 | ### ELT Characteristics
 89 | 
 90 | **Advantages:**
 91 | - **Faster Loading**: Raw data loads quickly without transformation overhead
 92 | - **Scalability**: Leverages powerful cloud computing resources
 93 | - **Flexibility**: Multiple transformation views from same raw data
 94 | - **Data Preservation**: Original data remains unchanged
 95 | - **Real-time Potential**: Enables near real-time data availability
 96 | - **Cost-Effective**: Cloud warehouses optimized for large-scale processing
 97 | 
 98 | **Disadvantages:**
 99 | - **Target Resource Usage**: Transformation workload on target system
100 | - **Data Quality Risks**: Raw data may contain errors or inconsistencies
101 | - **Security Concerns**: Sensitive data stored in raw format
102 | - **Complex Queries**: End users may need advanced SQL skills
103 | - **Storage Costs**: Raw data requires more storage space
104 | 
105 | ### When to Use ELT
106 | - **Big Data Scenarios**: Large volumes of diverse data types
107 | - **Cloud-Native Architecture**: Using cloud data warehouses (Snowflake, BigQuery, Redshift)
108 | - **Agile Analytics**: Rapid development and changing requirements
109 | - **Real-time Insights**: Near real-time data processing needed
110 | - **Data Lake Architecture**: Storing raw data for future unknown use cases
111 | - **Sufficient Target Resources**: Powerful target systems available
112 | 
113 | ## Comparison Matrix
114 | 
115 | | Aspect | ETL | ELT |
116 | |--------|-----|-----|
117 | | **Processing Location** | External transformation engine | Within target system |
118 | | **Data Quality** | High (pre-loading validation) | Variable (post-loading validation) |
119 | | **Time to Insight** | Higher latency | Lower latency |
120 | | **Flexibility** | Lower (predefined transformations) | Higher (on-demand transformations) |
121 | | **Resource Requirements** | Dedicated transformation infrastructure | Powerful target system |
122 | | **Data Storage** | Only transformed data stored | Raw + transformed data stored |
123 | | **Complexity** | Higher upfront complexity | Lower initial complexity |
124 | | **Maintenance** | More complex schema change management | Easier to adapt to changes |
125 | | **Cost Model** | Infrastructure + processing costs | Storage + compute costs |
126 | 
127 | ## Modern Hybrid Approaches
128 | 
129 | ### ELT with Pre-processing
130 | - Light transformations during extraction (data type conversion, basic cleaning)
131 | - Bulk of transformation occurs in target system
132 | - Balances benefits of both approaches
133 | 
134 | ### Lambda Architecture
135 | - Combines batch (ETL) and stream (ELT) processing
136 | - Handles both historical and real-time data
137 | - Provides comprehensive data processing solution
138 | 
139 | ### Medallion Architecture
140 | - **Bronze Layer**: Raw data (ELT approach)
141 | - **Silver Layer**: Cleaned and conformed data (ETL transformations)
142 | - **Gold Layer**: Business-ready data (ETL aggregations)
143 | 
144 | ## Technology Considerations
145 | 
146 | ### ETL-Optimized Tools
147 | - **Traditional ETL**: Informatica, IBM DataStage, Microsoft SSIS
148 | - **Modern ETL**: Apache Airflow, Talend, Apache NiFi
149 | - **Cloud ETL**: AWS Glue, Azure Data Factory, Google Dataflow
150 | 
151 | ### ELT-Optimized Platforms
152 | - **Cloud Warehouses**: Snowflake, Google BigQuery, Amazon Redshift
153 | - **Data Lakes**: Apache Spark, Amazon S3 with Athena
154 | - **Stream Processing**: Apache Kafka, Amazon Kinesis
155 | 
156 | ### Programming Approaches
157 | - **ETL**: Python with Pandas, Java with Apache Beam
158 | - **ELT**: SQL-based transformations, dbt (data build tool)
159 | 
160 | ## Best Practices
161 | 
162 | ### For ETL
163 | 1. **Design for Reusability**: Create modular transformation components
164 | 2. **Implement Error Handling**: Robust exception management and logging
165 | 3. **Optimize Performance**: Parallel processing and efficient algorithms
166 | 4. **Document Transformations**: Clear documentation of business rules
167 | 5. **Test Thoroughly**: Unit tests for transformation logic
168 | 
169 | ### For ELT
170 | 1. **Data Governance**: Implement data lineage and quality monitoring
171 | 2. **Storage Optimization**: Use appropriate file formats (Parquet, Delta Lake)
172 | 3. **Query Optimization**: Leverage target system's optimization features
173 | 4. **Security Implementation**: Apply row-level security and column masking
174 | 5. **Cost Management**: Monitor and optimize compute and storage costs
175 | 
176 | ## Future Trends
177 | 
178 | ### Real-time Processing
179 | - Stream processing becoming standard
180 | - Event-driven architectures gaining popularity
181 | - CDC (Change Data Capture) integration
182 | 
183 | ### DataOps Integration
184 | - CI/CD for data pipelines
185 | - Automated testing and deployment
186 | - Infrastructure as Code
187 | 
188 | ### AI-Enhanced Processing
189 | - Automated data profiling and mapping
190 | - Intelligent error detection and correction
191 | - ML-powered transformation suggestions
192 | 
193 | ## Conclusion
194 | 
195 | The choice between ETL and ELT depends on specific use cases, infrastructure, data volumes, and business requirements. Modern data architectures often employ hybrid approaches, leveraging the strengths of both paradigms. Understanding these workflows is fundamental to designing effective data integration solutions that meet organizational needs while maintaining data quality, performance, and scalability.
196 | 
197 | Key decision factors include:
198 | - Data volume and velocity requirements
199 | - Available infrastructure and resources
200 | - Data quality and governance needs
201 | - Time-to-insight requirements
202 | - Budget and cost considerations
203 | - Team skills and capabilities
204 | 
205 | Both ETL and ELT remain relevant in today's data landscape, and the most successful data teams understand when and how to apply each approach effectively.
206 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## **LuxDevHQ Data Engineering Course Outline**
  2 | 
  3 | This comprehensive course spans **4 months** (16 weeks) and equips learners with expertise in Python, SQL, Azure, AWS, Apache Airflow, Kafka, Spark, and more.
  4 | - **Learning Days**: Monday to Thursday (theory and practice).
  5 | - **Friday**: Job shadowing or peer projects.
  6 | - **Saturday**: Hands-on lab sessions and project-based learning.
  7 | 
  8 | ---
  9 | ## Table of Contents
 10 | 
 11 | 1. [Week 1](#week-1-Onboarding-and-Environment-Setup)
 12 | 2. [Week 2](#week-2-SQL-Essentials-for-Data-Engineering)
 13 | 3. [Week 3](#week-3-Introduction-to-Data-Pipelines)
 14 | 4. [Week 4](#week-4-Introduction-to-Apache-Airflow)
 15 | 5. [Week 5](#week-5-Data-Warehousing-and-Data-Lakes)
 16 | 6. [Week 6](#week-6-Data-Governance-and-Security)
 17 | 7. [Week 7](#week-7-Real-Time-Data-Processing-with-Kafka)
 18 | 8. [Week 8](#week-8-Batch-vs-Stream-Processing)
 19 | 9. [Week 9](#week-9-Machine-Learning-Integration-in-Data-Pipelines)
 20 | 10. [Week 10](#week-10-Spark-and-PySpark-for-Big-Data)
 21 | 11. [Week 11](#week-11-Advanced-Apache-Airflow-Techniques)
 22 | 12. [Week 12](#week-12-Data-Lakes-and-Delta-Lake)
 23 | 13. [Week 13](#week-13-Batch-Data-Pipeline-Development)
 24 | 14. [Week 14](#week-14-Real-Time-Data-Pipeline-Development)
 25 | 15. [Week 15](#week-15-Final-Project-Integration)
 26 | 16. [Week 16](#week-16-capstone-project-presentation)
 27 | 
 28 | 
 29 | 
 30 |    ---
 31 | 
 32 | ### **Month 1: Foundations of Data Engineering**
 33 | 
 34 | #### Week 1: Onboarding and Environment Setup
 35 | - **Monday**:  
 36 |   - Onboarding, course overview, career pathways, tools introduction.  
 37 | - **Tuesday**:  
 38 |   - Introduction to cloud computing (Azure and AWS).  
 39 | - **Wednesday**:  
 40 |   - Data governance, security, compliance, and access control.  
 41 | - **Thursday**:  
 42 |   - Introduction to SQL for data engineering and PostgreSQL setup.  
 43 | - **Friday**:  
 44 |   - **Peer Project**: Environment setup challenges.  
 45 | - **Saturday (Lab)**:  
 46 |   - **Mini Project**: Build a basic pipeline with PostgreSQL and Azure Blob Storage.  
 47 | 
 48 | ---
 49 | 
 50 | #### Week 2: SQL Essentials for Data Engineering
 51 | - **Monday**:  
 52 |   - Core SQL concepts (`SELECT`, `WHERE`, `JOIN`, `GROUP BY`).  
 53 | - **Tuesday**:  
 54 |   - Advanced SQL techniques: recursive queries, window functions, Views, Stored Procedures, Subque ries and CTEs.  
 55 | - **Wednesday**:  
 56 |   - Query optimization and execution plans.  
 57 | - **Thursday**:  
 58 |   - Data modeling: normalization, denormalization, and star schemas.  
 59 | - **Friday**:  
 60 |   - **Job Shadowing**: Observe senior engineers writing and optimizing SQL queries.  
 61 | - **Saturday (Lab)**:  
 62 |   - **Mini Project**: Create a star schema and analyze data using SQL.  
 63 | 
 64 | ---
 65 | 
 66 | #### Week 3: Introduction to Data Pipelines
 67 | - **Monday**:  
 68 |   - Theory: Introduction to ETL/ELT workflows.  
 69 | - **Tuesday**:  
 70 |   - Lab: Create a simple Python-based ETL pipeline for CSV data.  
 71 | - **Wednesday**:  
 72 |   - Theory: Extract, transform, load (ETL) concepts and best practices.  
 73 | - **Thursday**:  
 74 |   - Lab: Build a Python ETL pipeline for batch data processing.  
 75 | - **Friday**:  
 76 |   - **Peer Project**: Collaborate to design a basic ETL workflow.  
 77 | - **Saturday (Lab)**:  
 78 |   - **Mini Project**: Develop a simple ETL pipeline to process sales data.  
 79 | 
 80 | ---
 81 | 
 82 | #### Week 4: Introduction to Apache Airflow
 83 | - **Monday**:  
 84 |   - Theory: Introduction to Apache Airflow, DAGs, and scheduling.  
 85 | - **Tuesday**:  
 86 |   - Lab: Set up Apache Airflow and create a basic DAG.  
 87 | - **Wednesday**:  
 88 |   - Theory: DAG best practices and scheduling in Airflow.  
 89 | - **Thursday**:  
 90 |   - Lab: Integrate Airflow with PostgreSQL and Azure Blob Storage.  
 91 | - **Friday**:  
 92 |   - **Job Shadowing**: Observe real-world Airflow pipelines.  
 93 | - **Saturday (Lab)**:  
 94 |   - **Mini Project**: Automate an ETL pipeline with Airflow for batch data processing.  
 95 | 
 96 | ---
 97 | 
 98 | ### **Month 2: Intermediate Tools and Concepts**
 99 | 
100 | #### Week 5: Data Warehousing and Data Lakes
101 | - **Monday**:  
102 |   - Theory: Introduction to data warehousing (OLAP vs. OLTP, partitioning, clustering).  
103 | - **Tuesday**:  
104 |   - Lab: Work with Amazon Redshift and Snowflake for data warehousing.  
105 | - **Wednesday**:  
106 |   - Theory: Data lakes and Lakehouse architecture.  
107 | - **Thursday**:  
108 |   - Lab: Set up Delta Lake for raw and curated data.  
109 | - **Friday**:  
110 |   - **Peer Project**: Implement a data warehouse model and data lake for sales data.  
111 | - **Saturday (Lab)**:  
112 |   - **Mini Project**: Design and implement a basic Lakehouse architecture.  
113 | 
114 | ---
115 | 
116 | #### Week 6: Data Governance and Security
117 | - **Monday**:  
118 |   - Theory: Data governance frameworks and data security principles.  
119 | - **Tuesday**:  
120 |   - Lab: Use AWS Lake Formation for access control and security enforcement.  
121 | - **Wednesday**:  
122 |   - Theory: Managing sensitive data and compliance (GDPR, HIPAA).  
123 | - **Thursday**:  
124 |   - Lab: Implement security policies in S3 and Azure Blob Storage.  
125 | - **Friday**:  
126 |   - **Job Shadowing**: Observe senior engineers applying governance policies.  
127 | - **Saturday (Lab)**:  
128 |   - **Mini Project**: Secure data in the cloud using AWS and Azure.  
129 | 
130 | ---
131 | 
132 | #### Week 7: Real-Time Data Processing with Kafka
133 | - **Monday**:  
134 |   - Theory: - [Introduction to Apache Kafka for real-time data streaming](/introduction-to-Kafka.md)
135 | - **Tuesday**:  
136 |   - Lab: [Set up a Kafka producer and consumer.](/Tuesday-Kafka-Lab.md)
137 | - **Wednesday**:  
138 |   - Theory: Kafka topics, partitions, and message brokers.  
139 | - **Thursday**:  
140 |   - Lab: Integrate Kafka with PostgreSQL for real-time updates.  
141 | - **Friday**:  
142 |   - **Peer Project**: Build a real-time Kafka pipeline for transactional data.  
143 | - **Saturday (Lab)**:  
144 |   - **Mini Project**: Create a pipeline to stream e-commerce data with Kafka.
145 |     
146 | [Apache Kafka 101](./Apache%20Kafka%20101%3A%20Apache%20Kafka%20for%20Data%20Engineering%20Guide.md)
147 | 
148 | [Apache Kafka 102](/Apache%20Kafka%20102%3A%20Apache%20Kafka%20for%20Data%20Engineering%20Guide.md)
149 | 
150 | 
151 | 
152 | ---
153 | 
154 | #### Week 8: Batch vs. Stream Processing
155 | - **Monday**:  
156 |   - Theory: Introduction to batch vs. stream processing.  
157 | - **Tuesday**:  
158 |   - Lab: Batch processing with PySpark.  
159 | - **Wednesday**:  
160 |   - Theory: Combining batch and stream processing workflows.  
161 | - **Thursday**:  
162 |   - Lab: Real-time processing with Apache Flink and Spark Streaming.  
163 | - **Friday**:  
164 |   - **Job Shadowing**: Observe a real-time processing pipeline.  
165 | - **Saturday (Lab)**:  
166 |   - **Mini Project**: Build a hybrid pipeline combining batch and real-time processing.  
167 | 
168 | ---
169 | 
170 | ### **Month 3: Advanced Data Engineering**
171 | 
172 | #### Week 9: Machine Learning Integration in Data Pipelines
173 | - **Monday**:  
174 |   - Theory: Overview of ML workflows in data engineering.  
175 | - **Tuesday**:  
176 |   - Lab: Preprocess data for machine learning using Pandas and PySpark.  
177 | - **Wednesday**:  
178 |   - Theory: Feature engineering and automated feature extraction.  
179 | - **Thursday**:  
180 |   - Lab: Automate feature extraction using Apache Airflow.  
181 | - **Friday**:  
182 |   - **Peer Project**: Build a simple pipeline that integrates ML models.  
183 | - **Saturday (Lab)**:  
184 |   - **Mini Project**: Build an ML-powered recommendation system in a pipeline.  
185 | 
186 | ---
187 | 
188 | #### Week 10: Spark and PySpark for Big Data
189 | - **Monday**:  
190 |   - Theory: Introduction to Apache Spark for big data processing.  
191 | - **Tuesday**:  
192 |   - Lab: Set up Spark and PySpark for data analysis.  
193 | - **Wednesday**:  
194 |   - Theory: Spark RDDs, DataFrames, Performance Optimization and SQL.  
195 | - **Thursday**:  
196 |   - Lab: Analyze large datasets using Spark SQL.  
197 | - **Friday**:  
198 |   - **Peer Project**: Build a PySpark pipeline for large-scale data processing.  
199 | - **Saturday (Lab)**:  
200 |   - **Mini Project**: Analyze big data sets with Spark and PySpark.  
201 | 
202 | ---
203 | 
204 | #### Week 11: Advanced Apache Airflow Techniques
205 | - **Monday**:  
206 |   - Theory: Advanced Airflow features (XCom, task dependencies).  
207 | - **Tuesday**:  
208 |   - Lab: Implement dynamic DAGs and task dependencies in Airflow.  
209 | - **Wednesday**:  
210 |   - Theory: Airflow scheduling, monitoring, and error handling.  
211 | - **Thursday**:  
212 |   - Lab: Create complex DAGs for multi-step ETL pipelines.  
213 | - **Friday**:  
214 |   - **Job Shadowing**: Observe advanced Airflow pipeline implementations.  
215 | - **Saturday (Lab)**:  
216 |   - **Mini Project**: Design an advanced Airflow DAG for complex data workflows.  
217 | 
218 | ---
219 | 
220 | #### Week 12: Data Lakes and Delta Lake
221 | - **Monday**:  
222 |   - Theory: Data lakes, Lakehouses, and Delta Lake architecture.  
223 | - **Tuesday**:  
224 |   - Lab: Set up Delta Lake on AWS for data storage and management.  
225 | - **Wednesday**:  
226 |   - Theory: Managing schema evolution in Delta Lake.  
227 | - **Thursday**:  
228 |   - Lab: Implement batch and real-time data loading to Delta Lake.  
229 | - **Friday**:  
230 |   - **Peer Project**: Design a Lakehouse architecture for an e-commerce platform.  
231 | - **Saturday (Lab)**:  
232 |   - **Mini Project**: Implement a scalable Delta Lake architecture.  
233 | 
234 | ---
235 | 
236 | ### **Month 4: Capstone Projects**
237 | 
238 | #### Week 13: Batch Data Pipeline Development
239 | - **Monday to Thursday**:  
240 |   - **Design and Implementation**:  
241 |     - Build an end-to-end batch data pipeline for e-commerce sales analytics.  
242 |   - **Tools**: PySpark, SQL, PostgreSQL, Airflow, S3.  
243 | - **Friday**:  
244 |   - **Peer Review**: Present progress and receive feedback.  
245 | - **Saturday (Lab)**:  
246 |   - **Project Milestone**: Finalize and present batch pipeline results.  
247 | 
248 | ---
249 | 
250 | #### Week 14: Real-Time Data Pipeline Development
251 | - **Monday to Thursday**:  
252 |   - **Design and Implementation**:  
253 |     - Build an end-to-end real-time data pipeline for IoT sensor monitoring.  
254 |   - **Tools**: Kafka, Spark Streaming, Flink, S3.  
255 | - **Friday**:  
256 |   - **Peer Review**: Present progress and receive feedback.  
257 | - **Saturday (Lab)**:  
258 |   - **Project Milestone**: Finalize and present real-time pipeline results.  
259 | 
260 | ---
261 | 
262 | #### Week 15: Final Project Integration
263 | - **Monday to Thursday**:  
264 |   - **Design and Implementation**:  
265 |     - Integrate both batch and real-time pipelines for a comprehensive end-to-end solution.  
266 |   - **Tools**: Kafka, PySpark, Airflow, Delta Lake, PostgreSQL, and S3.  
267 | - **Friday**:  
268 |   - **Job Shadowing**: Observe senior engineers integrating complex pipelines.  
269 | - **Saturday (Lab)**:  
270 |   - **Project Milestone**: Showcase integrated solution for review.  
271 | 
272 | ---
273 | 
274 | #### Week 16: Capstone Project Presentation
275 | - **Monday to Thursday**:  
276 |   - Final Presentation Preparation:  
277 |     - Polish, test, and document the final project.  
278 | - **Friday**:  
279 |   - **Peer Review**: Present final projects to peers and receive feedback.  
280 | - **Saturday (Lab)**:  
281 |   - **Capstone Presentation**: Showcase completed capstone projects to industry professionals and instructors.  
282 | 


--------------------------------------------------------------------------------
/Apache Spark.md:
--------------------------------------------------------------------------------
  1 | # Apache Spark & PySpark: Learning Guide
  2 | 
  3 | ## Understanding Apache Spark
  4 | 
  5 | Apache Spark is an open-source distributed computing framework designed for processing large datasets across clusters of computers. Spark maintains data in memory between operations, which significantly reduces the time spent reading from and writing to disk storage. This in-memory processing capability makes Spark substantially faster than traditional disk-based processing systems.
  6 | 
  7 | The framework provides a unified computing engine that handles multiple types of data processing workloads including batch processing (handling large chunks of data at once), stream processing (handling data as it arrives continuously), machine learning, and graph computation. Spark can scale from running on your laptop to running across thousands of computers in a data center.
  8 | 
  9 | ## Core Spark Architecture
 10 | 
 11 | Understanding Spark's architecture helps you comprehend how your data processing actually happens behind the scenes. The Spark runtime consists of several key components that work together to execute distributed computations.
 12 | 
 13 | **Driver Program**: The main application process that runs your Spark code. The driver creates the SparkContext, converts your program into tasks, and coordinates the execution across the cluster.
 14 | 
 15 | **Cluster Manager**: The external service responsible for acquiring resources and allocating them to Spark applications. Common cluster managers include YARN, Mesos, and Kubernetes.
 16 | 
 17 | **Executors**: Worker processes that run on cluster nodes. Executors execute tasks assigned by the driver and store data for caching operations.
 18 | 
 19 | **Tasks**: Individual units of work that executors perform on data partitions.
 20 | 
 21 | **Partitions**: Logical divisions of your data that enable parallel processing across multiple executor cores.
 22 | 
 23 | ## Essential Spark Concepts
 24 | 
 25 | **Transformations** are operations that create new datasets from existing ones. The key insight here is that transformations follow lazy evaluation, meaning Spark doesn't actually do the work immediately. Instead, it builds up a plan of what you want to do.
 26 | 
 27 | Common transformations include filtering data (keeping only rows that meet certain criteria), mapping data (applying a function to transform each row), grouping data by certain columns, and joining different datasets together.
 28 | 
 29 | **Actions** are operations that actually trigger Spark to execute all those planned transformations and give you results. Actions either return results to your driver program or save data to external storage.
 30 | 
 31 | **RDDs (Resilient Distributed Datasets)** represent Spark's most fundamental way of thinking about data. RDDs are immutable (they never change once created), distributed across your cluster, and can be processed in parallel. They maintain lineage information, which means Spark remembers how each RDD was created so it can recreate it if something goes wrong.
 32 | 
 33 | **DataFrames** are a higher-level way to work with data that's more familiar if you've used databases or tools like Excel. DataFrames organize data into named columns and provide powerful optimization through Spark's Catalyst optimizer, which automatically makes your queries run faster.
 34 | 
 35 | **Spark SQL** lets you write familiar SQL queries against your DataFrames. This means you can use SELECT, WHERE, GROUP BY, and other SQL statements you might already know, while still getting all the benefits of distributed processing.
 36 | 
 37 | 
 38 | 
 39 | ## Practical Example: Understanding the Basics
 40 | 
 41 | Let's walk through a comprehensive example that demonstrates how these concepts work together. I'll explain each part as we go, so you can see both what the code does and why we're doing it that way.
 42 | 
 43 | ```python
 44 | from pyspark.sql import SparkSession
 45 | from pyspark.sql.functions import col, avg, count, max, min, sum as spark_sum
 46 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
 47 | 
 48 | # Initialize Spark session - this creates your SparkContext
 49 | # Think of this as starting up your distributed computing engine
 50 | spark = SparkSession.builder \
 51 |     .appName("SparkFundamentals") \
 52 |     .config("spark.sql.adaptive.enabled", "true") \
 53 |     .getOrCreate()
 54 | ```
 55 | 
 56 | This first section creates our Spark session, which is our entry point to all Spark functionality. The appName helps you identify your application when multiple Spark jobs are running. The config setting enables adaptive query execution, which helps Spark automatically optimize your queries as they run.
 57 | 
 58 | ```python
 59 | # Create sample data representing employee records
 60 | # In real scenarios, this data would come from files, databases, or APIs
 61 | employee_data = [
 62 |     ("E001", "Alice Johnson", "Engineering", "Senior", 85000, 5, 92.5, "2019-03-15"),
 63 |     ("E002", "Bob Smith", "Marketing", "Manager", 72000, 3, 88.2, "2021-07-20"),
 64 |     ("E003", "Carol Davis", "Engineering", "Junior", 65000, 2, 85.7, "2022-01-10"),
 65 |     ("E004", "David Wilson", "Sales", "Senior", 78000, 4, 90.1, "2020-05-18"),
 66 |     ("E005", "Eva Brown", "Engineering", "Manager", 95000, 6, 94.3, "2018-11-22"),
 67 |     ("E006", "Frank Miller", "Marketing", "Junior", 58000, 1, 82.4, "2023-02-14"),
 68 |     ("E007", "Grace Lee", "Sales", "Senior", 82000, 4, 91.8, "2020-08-30"),
 69 |     ("E008", "Henry Taylor", "Engineering", "Senior", 88000, 5, 93.1, "2019-06-12"),
 70 |     ("E009", "Iris Chen", "Marketing", "Manager", 75000, 3, 89.6, "2021-10-05"),
 71 |     ("E010", "Jack Anderson", "Sales", "Junior", 62000, 2, 84.9, "2022-09-28")
 72 | ]
 73 | 
 74 | # Define schema explicitly for data quality and performance
 75 | # This tells Spark exactly what type of data to expect in each column
 76 | employee_schema = StructType([
 77 |     StructField("employee_id", StringType(), False),     # False means this field cannot be null
 78 |     StructField("name", StringType(), False),
 79 |     StructField("department", StringType(), False),
 80 |     StructField("level", StringType(), False),
 81 |     StructField("salary", IntegerType(), False),
 82 |     StructField("years_experience", IntegerType(), False),
 83 |     StructField("performance_score", FloatType(), False),
 84 |     StructField("hire_date", StringType(), False)
 85 | ])
 86 | ```
 87 | 
 88 | Here we're creating sample data and defining its structure. In real applications, your data would typically come from files, databases, or streaming sources. The schema definition is important because it tells Spark exactly what to expect, which enables better performance and catches data quality issues early.
 89 | 
 90 | ```python
 91 | # Create DataFrame - this represents your distributed dataset
 92 | # Even though our data is small, Spark treats it as if it could be distributed across many machines
 93 | df = spark.createDataFrame(employee_data, employee_schema)
 94 | 
 95 | # Basic transformations and actions
 96 | print("Dataset Overview:")
 97 | df.show()  # Action: displays data - this actually executes and shows results
 98 | print(f"Total employees: {df.count()}")  # Action: counts rows - another execution
 99 | ```
100 | 
101 | This is where we create our DataFrame, which is Spark's way of representing structured data. Even though our example uses small data that fits in memory, Spark handles it the same way it would handle terabytes of data spread across hundreds of machines.
102 | 
103 | The show() and count() operations are actions, which means they trigger Spark to actually process the data and return results. Up until this point, Spark was just planning what to do.
104 | 
105 | ```python
106 | # Transformation: filter high performers
107 | # This creates a new DataFrame but doesn't execute yet (lazy evaluation)
108 | high_performers = df.filter(col("performance_score") > 90.0)
109 | print(f"High performers (>90 score): {high_performers.count()}")  # Now it executes
110 | ```
111 | 
112 | This demonstrates the difference between transformations and actions. The filter operation creates a new DataFrame that represents "employees with performance scores above 90," but Spark doesn't actually do the filtering until we call count(), which is an action.
113 | 
114 | ```python
115 | # Transformation and action: departmental analysis
116 | # This shows how to perform aggregations - very common in data processing
117 | dept_analysis = df.groupBy("department").agg(
118 |     avg("salary").alias("avg_salary"),              # Calculate average salary per department
119 |     count("*").alias("employee_count"),             # Count employees per department
120 |     avg("performance_score").alias("avg_performance"), # Average performance per department
121 |     max("years_experience").alias("max_experience")   # Maximum experience per department
122 | )
123 | 
124 | print("\nDepartmental Analysis:")
125 | dept_analysis.show()  # Action: triggers execution of the entire aggregation
126 | ```
127 | 
128 | This section shows aggregation, which is one of the most common patterns in data processing. We're grouping employees by department and calculating various statistics for each group. The alias() method gives friendly names to our calculated columns.
129 | 
130 | ```python
131 | # Using Spark SQL - same functionality, different syntax
132 | # Some people prefer SQL syntax for complex queries
133 | df.createOrReplaceTempView("employees")  # Creates a temporary SQL table
134 | 
135 | print("\nSenior Employee Analysis (SQL):")
136 | spark.sql("""
137 |     SELECT department,
138 |            COUNT(*) as senior_count,
139 |            AVG(salary) as avg_senior_salary,
140 |            AVG(performance_score) as avg_senior_performance
141 |     FROM employees 
142 |     WHERE level = 'Senior'
143 |     GROUP BY department
144 |     ORDER BY avg_senior_salary DESC
145 | """).show()
146 | ```
147 | 
148 | This demonstrates that you can use SQL syntax to accomplish the same data processing tasks. Some people find SQL more intuitive for complex queries, especially when joining multiple tables or doing complex filtering and aggregation.
149 | 
150 | ```python
151 | # Demonstrate caching for performance
152 | # This tells Spark to keep this DataFrame in memory for faster access
153 | df.cache()  # Keeps data in memory for faster subsequent operations
154 | ```
155 | 
156 | Caching is a performance optimization technique. When you cache a DataFrame, Spark stores it in memory across your cluster, so subsequent operations on that DataFrame don't need to recompute it from the original data source.
157 | 
158 | ## Understanding the Concepts in Action
159 | 
160 | Let's trace through what happens when you run this code to understand how the concepts work together:
161 | 
162 | When you create the SparkSession, you're establishing your connection to Spark's distributed computing capabilities. Even if you're running on a single machine, Spark still uses the same distributed architecture internally.
163 | 
164 | When you create the DataFrame, Spark doesn't immediately load or process the data. Instead, it creates a logical representation of what the data looks like and how it's structured. This is part of Spark's lazy evaluation strategy.
165 | 
166 | When you call transformations like filter() or groupBy(), Spark adds these operations to its execution plan but still doesn't do any actual work. It's building up a recipe for how to process your data when the time comes.
167 | 
168 | When you call an action like show() or count(), Spark finally executes the entire chain of transformations. It looks at all the operations you've requested, optimizes the execution plan, and then processes the data across your cluster.
169 | 
170 | The caching operation tells Spark to store the results in memory after the first computation, so if you perform additional operations on the same DataFrame, it can reuse the cached data instead of recomputing everything from scratch.
171 | 
172 | 
173 | 
174 | ---
175 | 
176 | # Weather Data ETL Assignment
177 | 
178 | ## Assignment Overview
179 | 
180 | Build a data pipeline that extracts weather data from the OpenWeatherMap API, processes it using Apache Spark, and visualizes the results in Grafana.
181 | 
182 | ## Requirements
183 | 
184 | Extract weather data for at least 10 cities using the OpenWeatherMap API. Select any 10 columns from the API response that you find interesting or relevant.
185 | 
186 | Transform the data using PySpark to prepare it for visualization. Apply data cleaning, type conversions, or calculations as needed.
187 | 
188 | Store the processed data in any format that allows you to visualize it in Grafana.
189 | 
190 | Create visualizations in Grafana that demonstrate your data processing results. Include screenshots of your panels.
191 | 
192 | 
193 | 
194 | ## Deliverables
195 | 
196 | 1. **GitHub Repository**: Create a public repository containing your Python scripts and any configuration files
197 | 2. **Technical Article**: Write an article on Dev.to or Medium explaining your project, including:
198 |    - Overview of your approach
199 |    - Code explanations
200 |    - Screenshots of your Grafana panels
201 |    - Any challenges you encountered and how you solved them
202 | 


--------------------------------------------------------------------------------
/GDPR & HIPAA Compliance Guide.md:
--------------------------------------------------------------------------------
  1 | # GDPR & HIPAA Compliance Guide
  2 | 
  3 | ## Overview
  4 | 
  5 | ### Learning Objectives
  6 | By the end of this study session, you should be able to:
  7 | - Define GDPR and HIPAA and explain their purposes
  8 | - Compare and contrast the two regulatory frameworks
  9 | - Identify which regulation applies to different scenarios
 10 | - Explain key compliance requirements for both frameworks
 11 | - Describe implementation strategies and best practices
 12 | 
 13 | ---
 14 | 
 15 | ## Quick Reference Cards
 16 | 
 17 | ### GDPR at a Glance
 18 | - **Full Name:** General Data Protection Regulation
 19 | - **Effective Date:** May 25, 2018
 20 | - **Jurisdiction:** EU + anywhere processing EU citizen data
 21 | - **Scope:** ALL personal data, ALL industries
 22 | - **Max Penalty:** €20M or 4% global revenue
 23 | - **Key Concept:** "Privacy by Design"
 24 | 
 25 | ### HIPAA at a Glance
 26 | - **Full Name:** Health Insurance Portability and Accountability Act
 27 | - **Enacted:** 1996
 28 | - **Jurisdiction:** United States only
 29 | - **Scope:** Healthcare data (PHI) only
 30 | - **Max Penalty:** $50,000 per violation
 31 | - **Key Concept:** "Minimum Necessary Rule"
 32 | 
 33 | ---
 34 | 
 35 | ## Core Concepts & Definitions
 36 | 
 37 | ### GDPR Key Terms
 38 | | Term | Definition | Example |
 39 | |------|------------|---------|
 40 | | **Personal Data** | Any info relating to identifiable person | Name, email, IP address, location |
 41 | | **Data Subject** | The individual whose data is processed | Patient, customer, employee |
 42 | | **Data Controller** | Determines purposes/means of processing | Hospital, company, organization |
 43 | | **Data Processor** | Processes data on behalf of controller | Cloud provider, software vendor |
 44 | | **Special Category Data** | Sensitive personal data requiring extra protection | Health, biometric, genetic data |
 45 | 
 46 | ### HIPAA Key Terms
 47 | | Term | Definition | Example |
 48 | |------|------------|---------|
 49 | | **PHI** | Protected Health Information | Medical records, billing info |
 50 | | **Covered Entity** | Must comply with HIPAA | Hospitals, doctors, health plans |
 51 | | **Business Associate** | Works with covered entities, handles PHI | IT vendors, billing companies |
 52 | | **ePHI** | Electronic Protected Health Information | Digital medical records |
 53 | | **TPO** | Treatment, Payment, Operations | Core healthcare functions |
 54 | 
 55 | ---
 56 | 
 57 | ## Detailed Analysis
 58 | 
 59 | ### Section 1: GDPR Deep Dive
 60 | 
 61 | #### The 6 Lawful Bases for Processing (MEMORIZE!)
 62 | 1. **Consent** - Clear, specific, informed agreement
 63 | 2. **Contract** - Necessary for contract performance
 64 | 3. **Legal Obligation** - Required by law
 65 | 4. **Vital Interests** - Life or death situations
 66 | 5. **Public Task** - Official/governmental functions
 67 | 6. **Legitimate Interests** - Balancing test with individual rights
 68 | 
 69 | #### GDPR Rights (The "Rights Menu")
 70 | - **Right to be Informed** - Know what data is collected
 71 | - **Right of Access** - See what data is held
 72 | - **Right to Rectification** - Correct inaccurate data
 73 | - **Right to Erasure** - "Right to be forgotten"
 74 | - **Right to Restrict Processing** - Limit how data is used
 75 | - **Right to Data Portability** - Move data between services
 76 | - **Right to Object** - Say no to processing
 77 | - **Rights Related to Automated Decision-Making** - Human review of AI decisions
 78 | 
 79 | #### DPO Requirements (When Mandatory)
 80 | - Public authorities (always)
 81 | - Large-scale systematic monitoring
 82 | - Large-scale special category data processing
 83 | 
 84 | ### Section 2: HIPAA Deep Dive
 85 | 
 86 | #### The 3 HIPAA Rules
 87 | 1. **Privacy Rule** - Who can see PHI and when
 88 | 2. **Security Rule** - How to protect ePHI technically
 89 | 3. **Breach Notification Rule** - What to do when things go wrong
 90 | 
 91 | #### HIPAA Safeguards (The Security Triangle)
 92 | 1. **Administrative Safeguards**
 93 |    - Assign security officer
 94 |    - Create policies/procedures
 95 |    - Train workforce
 96 |    - Control access
 97 | 
 98 | 2. **Physical Safeguards**
 99 |    - Lock facilities
100 |    - Control workstation access
101 |    - Secure devices/media
102 | 
103 | 3. **Technical Safeguards**
104 |    - Access controls
105 |    - Audit controls
106 |    - Data integrity
107 |    - Transmission security
108 | 
109 | ---
110 | 
111 | ## Instructional Notes & Discussion Points
112 | 
113 | ### Opening Discussion Questions
114 | 1. "Why do we need data protection laws?" 
115 |    - *Lead students to discuss privacy as fundamental right*
116 | 2. "What happens when your medical records are leaked vs. your shopping preferences?"
117 |    - *Highlight different types of harm from data breaches*
118 | 
119 | ### Interactive Teaching Activities
120 | 
121 | #### Activity 1: Jurisdiction Quiz
122 | Present scenarios, students identify GDPR/HIPAA/Both/Neither:
123 | - US hospital treating EU tourist *(Both)*
124 | - EU company with US employees *(GDPR only)*
125 | - US pharmacy chain *(HIPAA only)*
126 | - Australian company, no EU/US connections *(Neither)*
127 | 
128 | #### Activity 2: Data Classification Game
129 | Show different data types, students categorize:
130 | - Email address *(GDPR: Personal Data)*
131 | - X-ray image with patient name *(Both: Personal Data + PHI)*
132 | - Anonymous survey results *(Neither)*
133 | - Fitness tracker data *(GDPR: Personal Data)*
134 | 
135 | ### Common Student Misconceptions
136 | ❌ **"GDPR only applies to EU companies"**
137 | ✅ **Correct:** Applies to ANY company processing EU citizen data
138 | 
139 | ❌ **"HIPAA applies to all health data"**
140 | ✅ **Correct:** Only applies to covered entities and business associates
141 | 
142 | ❌ **"Consent is always required"**
143 | ✅ **Correct:** GDPR has 6 legal bases; HIPAA allows TPO without consent
144 | 
145 | ---
146 | 
147 | ## Practice Exercises & Questions
148 | 
149 | ### Quick Quiz Questions
150 | 
151 | #### Multiple Choice
152 | 1. **GDPR applies to:**
153 |    a) Only EU companies
154 |    b) Any company processing EU citizen data
155 |    c) Only healthcare companies
156 |    d) Only large corporations
157 |    *Answer: b*
158 | 
159 | 2. **Under HIPAA, PHI can be shared without authorization for:**
160 |    a) Marketing purposes
161 |    b) Research studies
162 |    c) Treatment, payment, operations
163 |    d) Employee background checks
164 |    *Answer: c*
165 | 
166 | 3. **Maximum GDPR fine is:**
167 |    a) €10 million
168 |    b) €20 million or 4% global revenue
169 |    c) €50 million
170 |    d) $50,000 per violation
171 |    *Answer: b*
172 | 
173 | #### True/False
174 | - GDPR requires 72-hour breach notification *(True)*
175 | - HIPAA requires Data Protection Officer *(False - Security Officer)*
176 | - Both regulations require encryption *(True)*
177 | - GDPR allows indefinite data storage *(False)*
178 | 
179 | ### Case Study Practice
180 | 
181 | #### Case 1: The International Hospital
182 | **Scenario:** US hospital chain opens branch in Germany, treats both US and EU patients, uses cloud storage in Canada.
183 | 
184 | **Questions:**
185 | 1. Which regulations apply?
186 | 2. What are the main compliance challenges?
187 | 3. How should they handle data transfers?
188 | 
189 | #### Case 2: The Health App
190 | **Scenario:** Startup creates fitness app used by EU citizens, partners with US healthcare providers, stores data on AWS.
191 | 
192 | **Questions:**
193 | 1. What type of data are they handling?
194 | 2. What legal basis could they use under GDPR?
195 | 3. Do they need a DPO?
196 | 
197 | ---
198 | 
199 | ## Exam Preparation
200 | 
201 | ### Key Facts to Memorize
202 | 
203 | #### GDPR Numbers
204 | - **72 hours** - breach notification to authority
205 | - **30 days** - respond to data subject requests
206 | - **€20M or 4%** - maximum fine
207 | - **May 25, 2018** - effective date
208 | 
209 | #### HIPAA Numbers
210 | - **1996** - year enacted
211 | - **60 days** - breach notification to individuals
212 | - **500 individuals** - threshold for immediate HHS notification
213 | - **$50,000** - maximum penalty per violation
214 | 
215 | ### Memory Techniques
216 | 
217 | #### GDPR Rights Acronym: "I AREPORT"
218 | - **I**nformed
219 | - **A**ccess
220 | - **R**ectification
221 | - **E**rasure
222 | - **R**estrict processing
223 | - **P**ortability
224 | - **O**bject
225 | - **R**elated to automated decision-making
226 | - **T**ransparency (bonus)
227 | 
228 | #### HIPAA Safeguards: "APT"
229 | - **A**dministrative
230 | - **P**hysical
231 | - **T**echnical
232 | 
233 | ---
234 | 
235 | ## Comparison Tables for Quick Review
236 | 
237 | ### Similarities & Differences Matrix
238 | 
239 | | Aspect | GDPR | HIPAA | Same/Different |
240 | |--------|------|-------|----------------|
241 | | Geographic Scope | Global (EU data) | US only | Different |
242 | | Industry Scope | All industries | Healthcare only | Different |
243 | | Requires Encryption | Yes | Yes | Same |
244 | | Requires DPO/Security Officer | Yes (DPO) | Yes (Security Officer) | Same |
245 | | Breach Notification Timeline | 72 hours | 60 days | Different |
246 | | Right to Delete Data | Yes (Right to Erasure) | No (permanent records) | Different |
247 | | Consent Requirements | Strict | Flexible for TPO | Different |
248 | 
249 | ### Penalty Comparison
250 | 
251 | | Violation Level | GDPR | HIPAA |
252 | |----------------|------|-------|
253 | | **Minor** | Warning or €10M/2% | $100-$50,000 |
254 | | **Major** | €20M/4% global revenue | Up to $1.5M annually |
255 | | **Criminal** | Varies by country | Up to $250K + 10 years prison |
256 | 
257 | ---
258 | 
259 | ## Best Practices for Implementation
260 | 
261 | ### For Instructors
262 | 
263 | #### Making It Relevant
264 | - Use current breach examples (Equifax, Anthem, etc.)
265 | - Discuss social media privacy settings
266 | - Connect to students' personal experiences with healthcare
267 | 
268 | #### Common Teaching Pitfalls to Avoid
269 | - Don't get lost in legal details - focus on practical application
270 | - Avoid presenting as "US vs EU" - many companies need both
271 | - Don't oversimplify consent - it's more complex than "just ask permission"
272 | 
273 | #### Assessment Ideas
274 | - **Case study analysis** - real-world application
275 | - **Compliance checklist creation** - practical skills
276 | - **Risk scenario evaluation** - critical thinking
277 | - **Policy writing exercise** - hands-on experience
278 | 
279 | ### Study Group Activities
280 | 1. **Mock DPA Investigation** - role-play compliance audit
281 | 2. **Breach Response Simulation** - practice incident response
282 | 3. **Privacy Notice Comparison** - analyze real company notices
283 | 4. **Compliance Cost Calculation** - estimate implementation costs
284 | 
285 | ---
286 | 
287 | ## Additional Resources
288 | 
289 | ### Essential Reading
290 | - GDPR Official Text (Articles 5, 6, 7, 12-22, 25, 32-34)
291 | - HIPAA Privacy Rule Summary
292 | - ICO (UK) Guidance Documents
293 | - HHS HIPAA Security Rule Guidance
294 | 
295 | ### Recommended Cases to Study
296 | - **Schrems II** (EU-US data transfers)
297 | - **Google Spain** (Right to be forgotten)
298 | - **Anthem Breach** (largest healthcare breach)
299 | - **Facebook-Cambridge Analytica** (consent and data sharing)
300 | 
301 | ### Online Tools
302 | - ICO Self-Assessment Tool
303 | - HHS Security Risk Assessment Tool
304 | - GDPR Compliance Checkers
305 | - Breach Cost Calculators
306 | 
307 | ---
308 | 
309 | ## Final Assessment Checklist
310 | 
311 | ### Before the Exam, Can You:
312 | - [ ] Explain when GDPR vs HIPAA applies?
313 | - [ ] List all 6 GDPR lawful bases?
314 | - [ ] Name all GDPR rights?
315 | - [ ] Describe the 3 HIPAA rules?
316 | - [ ] Compare breach notification requirements?
317 | - [ ] Explain DPO vs Security Officer roles?
318 | - [ ] Calculate potential penalty amounts?
319 | - [ ] Identify required security safeguards?
320 | - [ ] Distinguish between personal data and PHI?
321 | - [ ] Apply regulations to real-world scenarios?
322 | 
323 | ### Red Flag Concepts (Review if Unclear)
324 | - Cross-border data transfers
325 | - Legitimate interests balancing test
326 | - Business associate agreements
327 | - Data processing vs data controlling
328 | - Special category data protections
329 | - Minimum necessary rule application
330 | 
331 | ---
332 | 
333 | ## Instructional Script Snippets
334 | 
335 | ### Opening Hook
336 | *"Imagine your medical records, including mental health visits, appear in a Google search of your name. Or your location data shows you visiting a cancer clinic every Tuesday. This isn't science fiction - it's why we need GDPR and HIPAA."*
337 | 
338 | ### Transition Between Topics
339 | *"Now that we understand what GDPR protects - all personal data - let's look at HIPAA's more focused approach to healthcare information..."*
340 | 
341 | ### Concept Reinforcement
342 | *"Remember: GDPR is the speed limit everywhere you drive with EU citizens as passengers. HIPAA is the special rules only in the hospital parking lot."*
343 | 
344 | ### Closing Summary
345 | *"Both regulations share the same goal: protecting people's most sensitive information. The difference is scope - GDPR casts a wide net globally, HIPAA goes deep in US healthcare. Master both, and you'll understand the future of privacy law."*
346 | 


--------------------------------------------------------------------------------
/Change Data Capture.md:
--------------------------------------------------------------------------------
  1 | # Change Data Capture (CDC) Learning Guide
  2 | 
  3 | ## What is Change Data Capture ?
  4 | 
  5 | Change Data Capture  is a powerful technique that tracks changes—specifically inserts, updates, and deletes—in a source database and streams them to a target system in real-time or near real-time. Think of CDC as a vigilant observer that watches your database and immediately reports any changes to other systems that need to stay synchronized.
  6 | 
  7 | This approach ensures data consistency across different systems like data warehouses, caches, or analytics platforms without the need to process the entire dataset repeatedly. CDC serves as the backbone for data integration, real-time analytics, and maintaining up-to-date information across distributed systems.
  8 | 
  9 | ## Why Use CDC?
 10 | 
 11 | Understanding the benefits of CDC helps explain why it has become essential in modern data architectures:
 12 | 
 13 | **Real-Time Data Synchronization**: Unlike traditional batch processing that updates target systems at scheduled intervals (perhaps once a day or hour), CDC updates target systems instantly as changes occur. This means your analytics dashboard can reflect customer purchases within seconds rather than waiting for the next batch job.
 14 | 
 15 | **Exceptional Efficiency**: CDC processes only the data that has actually changed, dramatically reducing resource usage compared to full dataset transfers. Instead of copying an entire million-row table every hour, CDC might only transfer the dozen rows that actually changed.
 16 | 
 17 | **Guaranteed Consistency**: CDC ensures that downstream systems accurately reflect changes in the source database. When a customer updates their address in your main application, that change propagates reliably to your data warehouse, recommendation engine, and reporting systems.
 18 | 
 19 | ## How Does CDC Work?
 20 | 
 21 | The magic of CDC lies in its ability to capture changes from a source database's write-ahead log (WAL), which is essentially the database's diary of all modifications. Every database maintains this log for recovery purposes, and CDC leverages this existing infrastructure.
 22 | 
 23 | Here's how the process flows: when you make a change to your PostgreSQL database, that change first gets written to the WAL before being applied to the actual data files. CDC tools read these WAL entries and convert them into events that can be streamed to target systems. For example, tools like Debezium read these logs and stream changes to Apache Kafka, which then delivers them to targets such as Cassandra.
 24 | 
 25 | ## Debezium Connector (Source)
 26 | 
 27 | Debezium represents one of the most popular open-source platforms for CDC. It captures changes from various databases including PostgreSQL, MySQL, and Oracle, then streams these changes to Apache Kafka. Think of Debezium as a translator that speaks both "database language" and "streaming language."
 28 | 
 29 | ### How Debezium Works
 30 | 
 31 | Debezium connectors act as careful monitors of a database's WAL, detecting every change that occurs. Each detected change gets converted into a structured event and sent to a designated Kafka topic. This process happens with remarkable precision:
 32 | 
 33 | When you insert a new row into a PostgreSQL table, Debezium generates an "INSERT" event containing all the new data. For updates, Debezium creates both "before" and "after" events, showing you exactly what changed from the old values to the new ones. Delete operations generate "DELETE" events that capture what was removed.
 34 | 
 35 | ### Simple Example: Debezium with PostgreSQL
 36 | 
 37 | Let's walk through setting up Debezium with PostgreSQL using a practical example. Imagine you have a PostgreSQL table called `users` with columns for id, name, and email.
 38 | 
 39 | #### Step 1: Enable WAL in PostgreSQL
 40 | 
 41 | First, you need to configure PostgreSQL to use logical replication, which Debezium requires to read changes:
 42 | 
 43 | ```bash
 44 | # Edit the PostgreSQL configuration file
 45 | sudo nano /etc/postgresql/14/main/postgresql.conf
 46 | ```
 47 | 
 48 | Update these settings in the configuration file:
 49 | 
 50 | ```
 51 | wal_level = logical
 52 | max_wal_senders = 1
 53 | max_replication_slots = 1
 54 | ```
 55 | 
 56 | After making these changes, restart PostgreSQL to apply them:
 57 | 
 58 | ```bash
 59 | sudo systemctl restart postgresql
 60 | ```
 61 | 
 62 | Next, grant the necessary replication permissions to your database user:
 63 | 
 64 | ```bash
 65 | psql -U postgres -c "ALTER USER myuser WITH REPLICATION;"
 66 | ```
 67 | 
 68 | #### Step 2: Set Up Kafka and Debezium
 69 | 
 70 | Now you'll set up the streaming infrastructure. Download and extract Apache Kafka:
 71 | 
 72 | ```bash
 73 | wget https://downloads.apache.org/kafka/3.6.0/kafka_2.13-3.6.0.tgz
 74 | tar -xzf kafka_2.13-3.6.0.tgz
 75 | cd kafka_2.13-3.6.0
 76 | ```
 77 | 
 78 | Start Zookeeper and Kafka in separate terminals. Zookeeper manages Kafka's configuration, while Kafka handles the actual message streaming:
 79 | 
 80 | ```bash
 81 | # Terminal 1: Start Zookeeper
 82 | bin/zookeeper-server-start.sh config/zookeeper.properties
 83 | 
 84 | # Terminal 2: Start Kafka
 85 | bin/kafka-server-start.sh config/server.properties
 86 | ```
 87 | 
 88 | Download and set up the Debezium PostgreSQL connector:
 89 | 
 90 | ```bash
 91 | mkdir -p /path/to/kafka/plugins
 92 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.7.0.Final/debezium-connector-postgres-2.7.0.Final-plugin.tar.gz
 93 | tar -xzf debezium-connector-postgres-2.7.0.Final-plugin.tar.gz -C /path/to/kafka/plugins
 94 | ```
 95 | 
 96 | Configure Kafka Connect to recognize your Debezium plugin:
 97 | 
 98 | ```bash
 99 | nano config/connect-distributed.properties
100 | ```
101 | 
102 | Add this line to tell Kafka Connect where to find plugins:
103 | 
104 | ```
105 | plugin.path=/path/to/kafka/plugins
106 | ```
107 | 
108 | Start Kafka Connect in distributed mode:
109 | 
110 | ```bash
111 | bin/connect-distributed.sh config/connect-distributed.properties
112 | ```
113 | 
114 | Create a configuration file for your Debezium connector. This JSON file tells Debezium exactly how to connect to your PostgreSQL database and which tables to monitor:
115 | 
116 | ```json
117 | {
118 |   "name": "postgres-connector",
119 |   "config": {
120 |     "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
121 |     "database.hostname": "localhost",
122 |     "database.port": "5432",
123 |     "database.user": "myuser",
124 |     "database.password": "mypassword",
125 |     "database.dbname": "mydb",
126 |     "database.server.name": "server1",
127 |     "table.include.list": "public.users",
128 |     "plugin.name": "pgoutput"
129 |   }
130 | }
131 | ```
132 | 
133 | Deploy the connector to start monitoring your database:
134 | 
135 | ```bash
136 | curl -X POST -H "Content-Type: application/json" --data @postgres-connector.json http://localhost:8083/connectors
137 | ```
138 | 
139 | #### Step 3: Observe Changes in Action
140 | 
141 | Now for the exciting part—watching CDC work in real-time. Create your users table and insert some data:
142 | 
143 | ```sql
144 | psql -U myuser -d mydb -c "CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT, email TEXT);"
145 | psql -U myuser -d mydb -c "INSERT INTO users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');"
146 | ```
147 | 
148 | Debezium captures this insert operation and creates an event in a Kafka topic named `server1.public.users`. You can view this event using Kafka's console consumer:
149 | 
150 | ```bash
151 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic server1.public.users --from-beginning
152 | ```
153 | 
154 | The event structure looks like this:
155 | 
156 | ```json
157 | {
158 |   "schema": { ... },
159 |   "payload": {
160 |     "before": null,
161 |     "after": { "id": 1, "name": "Alice", "email": "alice@example.com" },
162 |     "op": "c", // 'c' indicates create (insert)
163 |     "ts_ms": 1697051234567
164 |   }
165 | }
166 | ```
167 | 
168 | Notice how the event includes both "before" and "after" states. For an insert, "before" is null since the row didn't exist previously. The "op" field indicates the operation type, and "ts_ms" provides a timestamp.
169 | 
170 | ## Cassandra Sink Connector
171 | 
172 | The Cassandra sink connector completes the CDC pipeline by reading data from Kafka topics and writing it to a Cassandra database. This connector excels at storing CDC events in a scalable, distributed NoSQL environment.
173 | 
174 | ### How Cassandra Sink Connector Works
175 | 
176 | The connector acts as a bridge between Kafka and Cassandra, reading events from Kafka topics and translating them into appropriate Cassandra operations. It handles the mapping between Kafka record structures and Cassandra table schemas, managing inserts, updates, and deletes automatically.
177 | 
178 | ### Simple Example: Cassandra Sink Connector
179 | 
180 | Let's continue our example by setting up Cassandra to receive the user events from our Debezium setup.
181 | 
182 | #### Step 1: Set Up Cassandra
183 | 
184 | Install and start Cassandra on Ubuntu:
185 | 
186 | ```bash
187 | sudo apt update
188 | sudo apt install cassandra
189 | sudo systemctl start cassandra
190 | ```
191 | 
192 | Verify that Cassandra is running properly:
193 | 
194 | ```bash
195 | nodetool status
196 | ```
197 | 
198 | Create a keyspace and table structure that matches your source data:
199 | 
200 | ```bash
201 | cqlsh -e "CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};"
202 | cqlsh -e "CREATE TABLE mykeyspace.users (id int PRIMARY KEY, name text, email text);"
203 | ```
204 | 
205 | #### Step 2: Configure Cassandra Sink Connector
206 | 
207 | Download the Cassandra sink connector:
208 | 
209 | ```bash
210 | wget https://d1i4a8756m6x7j.cloudfront.net/repo/7.5/confluent-kafka-connect-cassandra-1.5.0.tar.gz
211 | tar -xzf confluent-kafka-connect-cassandra-1.5.0.tar.gz -C /path/to/kafka/plugins
212 | ```
213 | 
214 | Create a configuration file that tells the connector how to map Kafka events to Cassandra records:
215 | 
216 | ```json
217 | {
218 |   "name": "cassandra-sink",
219 |   "config": {
220 |     "connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
221 |     "tasks.max": "1",
222 |     "topics": "server1.public.users",
223 |     "cassandra.contact.points": "localhost",
224 |     "cassandra.keyspace": "mykeyspace",
225 |     "cassandra.table.name": "users",
226 |     "cassandra.kcql": "INSERT INTO users SELECT id, name, email FROM server1.public.users"
227 |   }
228 | }
229 | ```
230 | 
231 | Deploy the connector to start the data flow:
232 | 
233 | ```bash
234 | curl -X POST -H "Content-Type: application/json" --data @cassandra-sink.json http://localhost:8083/connectors
235 | ```
236 | 
237 | #### Step 3: Observe Data in Cassandra
238 | 
239 | When Debezium sends user events to Kafka, the Cassandra sink connector automatically writes them to your Cassandra table. Query Cassandra to see the results:
240 | 
241 | ```bash
242 | cqlsh -e "SELECT * FROM mykeyspace.users;"
243 | ```
244 | 
245 | You should see the result: `id=1, name='Alice', email='alice@example.com'`.
246 | 
247 | The beauty of this setup is that any changes you make to the PostgreSQL users table will automatically appear in Cassandra within seconds, maintaining perfect synchronization between your systems.
248 | 
249 | ## Key Considerations
250 | 
251 | When implementing CDC in production environments, several important factors require careful attention:
252 | 
253 | **Performance Impact**: CDC does add some overhead to your source database since it needs to read and process the WAL continuously. Monitor WAL usage in PostgreSQL, especially in high-transaction systems where the log can grow quickly. Consider the additional I/O load and ensure your database server has adequate resources.
254 | 
255 | **Schema Evolution**: Debezium handles schema changes gracefully—when you add columns to a PostgreSQL table, Debezium automatically detects and includes them in future events. However, you must ensure that your target systems (like Cassandra tables) can accommodate these schema changes. Plan your schema evolution strategy carefully.
256 | 
257 | **Scalability Considerations**: Cassandra's distributed architecture makes it excellent for handling high-volume CDC streams. Configure appropriate replication factors for reliability, and consider partitioning strategies that align with your query patterns and data access requirements.
258 | 
259 | ## Assignment: Hands-On Project
260 | 
261 | To solidify your understanding of CDC concepts, I encourage you to work through a practical implementation project. This hands-on experience will help you see how all the pieces fit together in a real-world scenario.
262 | 
263 | **Project Repository**: [LuxDevHQ Data Engineering Project](https://github.com/LuxDevHQ/LuxDevHQ-Data-Engineering-Project)
264 | 
265 | **Your Task**: Clone the repository and follow the comprehensive instructions to set up a complete CDC pipeline using Debezium and a sink connector. The project guides you through using Linux commands to set up PostgreSQL tables, make various changes (inserts, updates, deletes), and verify that these changes propagate correctly to the target system.
266 | 
267 | ```bash
268 | git clone https://github.com/LuxDevHQ/LuxDevHQ-Data-Engineering-Project.git
269 | cd LuxDevHQ-Data-Engineering-Project
270 | ```
271 | 
272 | Follow the project's README file for detailed setup and execution instructions. This project provides a practical environment where you can experiment with CDC concepts, helping you understand how Debezium and sink connectors work together in real-world scenarios.
273 | 
274 | As you work through this project, pay attention to how the different components interact, observe the timing of data propagation, and experiment with different types of database changes to see how they're handled by the CDC pipeline.
275 | 


--------------------------------------------------------------------------------
/PYTHON/chapter_1.md:
--------------------------------------------------------------------------------
  1 | # Chapter 1: Introduction to Python and The Way of the Program
  2 | 
  3 | ## What is Python?
  4 | 
  5 | Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming. It is widely used in various domains such as web development, data analysis, artificial intelligence, scientific computing, and data engineering.
  6 | 
  7 | ### Year Created
  8 | Python was created in the late 1980s, and its first official version, Python 0.9.0, was released in February 1991. Python 2.0 was released in 2000, and Python 3.0, which is not backward-compatible with Python 2, was released in 2008.
  9 | 
 10 | **Current stable Python version:** Python 3.13.2
 11 | 
 12 | ### Important Resources
 13 | - [W3School Python tutorial](https://www.w3schools.com/python/) (beginner friendly)
 14 | - [Python official documentation](https://docs.python.org/)
 15 | - [Programiz Python tutorial](https://www.programiz.com/python-programming)
 16 | - [Pythontutorial.net](https://www.pythontutorial.net/)
 17 | 
 18 | ## The Way of the Program
 19 | 
 20 | The goal of learning Python is to teach you to think like a computer scientist. This way of thinking combines some of the best features of mathematics, engineering, and natural science.
 21 | 
 22 | Like mathematicians, computer scientists use formal languages to denote ideas (specifically computations). Like engineers, they design things, assembling components into systems and evaluating tradeoffs among alternatives. Like scientists, they observe the behavior of complex systems, form hypotheses, and test predictions.
 23 | 
 24 | The single most important skill for a computer scientist is **problem solving**. Problem solving means the ability to formulate problems, think creatively about solutions, and express a solution clearly and accurately. As it turns out, the process of learning to program is an excellent opportunity to practice problem-solving skills.
 25 | 
 26 | ## 1.1 What is a Program?
 27 | 
 28 | A program is a sequence of instructions that specifies how to perform a computation. The computation might be something mathematical, such as solving a system of equations or finding the roots of a polynomial, but it can also be a symbolic computation, such as searching and replacing text in a document or something graphical, like processing an image or playing a video.
 29 | 
 30 | The details look different in different languages, but a few basic instructions appear in just about every language:
 31 | 
 32 | - **input:** Get data from the keyboard, a file, the network, or some other device
 33 | - **output:** Display data on the screen, save it in a file, send it over the network, etc.
 34 | - **math:** Perform basic mathematical operations like addition and multiplication
 35 | - **conditional execution:** Check for certain conditions and run the appropriate code
 36 | - **repetition:** Perform some action repeatedly, usually with some variation
 37 | 
 38 | Believe it or not, that's pretty much all there is to it. Every program you've ever used, no matter how complicated, is made up of instructions that look pretty much like these.
 39 | 
 40 | You can think of programming as the process of breaking a large, complex task into smaller and smaller subtasks until the subtasks are simple enough to be performed with one of these basic instructions.
 41 | 
 42 | ## 1.2 Running Python
 43 | 
 44 | One of the challenges of getting started with Python is that you might have to install Python and related software on your computer. If you are familiar with your operating system, and especially if you are comfortable with the command-line interface, you will have no trouble installing Python. But for beginners, it can be painful to learn about system administration and programming at the same time.
 45 | 
 46 | To avoid that problem, it's recommended that you start out running Python in a browser. Later, when you are comfortable with Python, you can install Python on your computer.
 47 | 
 48 | There are a number of web pages you can use to run Python. Some popular options include:
 49 | - [Replit](https://replit.com/)
 50 | - [Python.org's online console](https://www.python.org/shell/)
 51 | - [Trinket](https://trinket.io/python)
 52 | 
 53 | ### Python Versions
 54 | 
 55 | There are two major versions of Python: Python 2 and Python 3. They are very similar, so if you learn one, it is easy to switch to the other. However, Python 2 reached end-of-life on January 1, 2020, so all new projects should use Python 3.
 56 | 
 57 | The Python interpreter is a program that reads and executes Python code. When you start the interpreter, you should see output like this:
 58 | 
 59 | ```
 60 | Python 3.13.2 (default, Jan 15 2025, 14:20:21)
 61 | [GCC 11.2.0] on linux
 62 | Type "help", "copyright", "credits" or "license" for more information.
 63 | >>>
 64 | ```
 65 | 
 66 | The `>>>` is a prompt that indicates that the interpreter is ready for you to enter code. If you type a line of code and hit Enter, the interpreter displays the result:
 67 | 
 68 | ```python
 69 | >>> 1 + 1
 70 | 2
 71 | ```
 72 | 
 73 | ## 1.3 The First Program
 74 | 
 75 | Traditionally, the first program you write in a new language is called "Hello, World!" because all it does is display the words "Hello, World!". In Python, it looks like this:
 76 | 
 77 | ```python
 78 | >>> print('Hello, World!')
 79 | Hello, World!
 80 | ```
 81 | 
 82 | This is an example of a print statement, although it doesn't actually print anything on paper. It displays a result on the screen. The quotation marks in the program mark the beginning and end of the text to be displayed; they don't appear in the result.
 83 | 
 84 | The parentheses indicate that `print` is a function. We'll learn more about functions in later chapters.
 85 | 
 86 | ## 1.4 Arithmetic Operators
 87 | 
 88 | After "Hello, World!", the next step is arithmetic. Python provides operators, which are special symbols that represent computations like addition and multiplication.
 89 | 
 90 | The operators `+`, `-`, and `*` perform addition, subtraction, and multiplication:
 91 | 
 92 | ```python
 93 | >>> 40 + 2
 94 | 42
 95 | >>> 43 - 1
 96 | 42
 97 | >>> 6 * 7
 98 | 42
 99 | ```
100 | 
101 | The operator `/` performs division:
102 | 
103 | ```python
104 | >>> 84 / 2
105 | 42.0
106 | ```
107 | 
108 | Note that the result is `42.0` instead of `42` because division in Python 3 always returns a floating-point number.
109 | 
110 | The operator `**` performs exponentiation (raising a number to a power):
111 | 
112 | ```python
113 | >>> 6**2 + 6
114 | 42
115 | ```
116 | 
117 | **Warning:** In some other languages, `^` is used for exponentiation, but in Python it is a bitwise operator called XOR:
118 | 
119 | ```python
120 | >>> 6 ^ 2
121 | 4
122 | ```
123 | 
124 | ## 1.5 Values and Types
125 | 
126 | A value is one of the basic things a program works with, like a letter or a number. Some values we have seen so far are `2`, `42.0`, and `'Hello, World!'`.
127 | 
128 | These values belong to different types:
129 | - `2` is an **integer**
130 | - `42.0` is a **floating-point number**
131 | - `'Hello, World!'` is a **string**
132 | 
133 | If you are not sure what type a value has, the interpreter can tell you using the `type()` function:
134 | 
135 | ```python
136 | >>> type(2)
137 | <class 'int'>
138 | >>> type(42.0)
139 | <class 'float'>
140 | >>> type('Hello, World!')
141 | <class 'str'>
142 | ```
143 | 
144 | In these results, the word "class" is used in the sense of a category; a type is a category of values.
145 | 
146 | - Integers belong to the type `int`
147 | - Strings belong to `str`
148 | - Floating-point numbers belong to `float`
149 | 
150 | ### String vs Numbers
151 | 
152 | Values like `'2'` and `'42.0'` look like numbers, but they are in quotation marks, so they are strings:
153 | 
154 | ```python
155 | >>> type('2')
156 | <class 'str'>
157 | >>> type('42.0')
158 | <class 'str'>
159 | ```
160 | 
161 | ## Python Basics
162 | 
163 | ### Identifiers
164 | 
165 | Identifiers are names used to identify variables, functions, classes, modules, or other objects. Rules for naming identifiers in Python:
166 | 
167 | - Identifiers can be a combination of letters (a-z, A-Z), digits (0-9), and underscores (_)
168 | - Identifiers cannot start with a digit
169 | - Identifiers are case-sensitive (`myVar` and `myvar` are different)
170 | - Reserved keywords cannot be used as identifiers
171 | 
172 | **Valid identifiers:**
173 | ```python
174 | my_variable
175 | _private_var
176 | variable1
177 | MyClass
178 | ```
179 | 
180 | **Invalid identifiers:**
181 | ```python
182 | 1variable    # Cannot start with digit
183 | my-variable  # Hyphen not allowed
184 | class        # Reserved keyword
185 | ```
186 | 
187 | ### Keywords
188 | 
189 | Keywords are reserved words in Python that have special meanings and cannot be used as identifiers. Some of the keywords in Python include:
190 | 
191 | `if`, `else`, `elif`, `for`, `while`, `break`, `continue`, `def`, `return`, `lambda`, `class`, `import`, `from`, `try`, `except`, `finally`, `and`, `or`, `not`, `True`, `False`, `None`
192 | 
193 | You can see all keywords by running:
194 | ```python
195 | import keyword
196 | print(keyword.kwlist)
197 | ```
198 | 
199 | ### PEP 8 Rules
200 | 
201 | PEP 8 is the official style guide for Python code. It provides conventions for writing readable and consistent code. Some key PEP 8 rules include:
202 | 
203 | - **Indentation:** Use 4 spaces per indentation level
204 | - **Line Length:** Limit all lines to a maximum of 79 characters (72 for docstrings/comments)
205 | - **Imports:** Imports should usually be on separate lines
206 | - **Whitespace:** Avoid extraneous whitespace in various situations
207 | - **Naming Conventions:**
208 |   - Variables: `my_variable`
209 |   - Functions: `my_function`
210 |   - Classes: `MyClass`
211 |   - Constants: `MY_CONSTANT`
212 | 
213 | ## 1.6 Formal and Natural Languages
214 | 
215 | Natural languages are the languages people speak, such as English, Spanish, and French. They were not designed by people (although people try to impose some order on them); they evolved naturally.
216 | 
217 | Formal languages are languages that are designed by people for specific applications. For example, the notation that mathematicians use is a formal language that is particularly good at denoting relationships among numbers and symbols. Chemists use a formal language to represent the chemical structure of molecules. And most importantly:
218 | 
219 | **Programming languages are formal languages that have been designed to express computations.**
220 | 
221 | ### Key Differences
222 | 
223 | Although formal and natural languages have many features in common—tokens, structure, and syntax—there are some differences:
224 | 
225 | - **Ambiguity:** Natural languages are full of ambiguity, which people deal with by using contextual clues. Formal languages are designed to be nearly or completely unambiguous.
226 | 
227 | - **Redundancy:** Natural languages employ lots of redundancy to reduce misunderstandings. Formal languages are less redundant and more concise.
228 | 
229 | - **Literalness:** Natural languages are full of idiom and metaphor. Formal languages mean exactly what they say.
230 | 
231 | ## 1.7 Debugging
232 | 
233 | Programmers make mistakes. For whimsical reasons, programming errors are called **bugs** and the process of tracking them down is called **debugging**.
234 | 
235 | Programming, and especially debugging, sometimes brings out strong emotions. If you are struggling with a difficult bug, you might feel angry, despondent, or embarrassed.
236 | 
237 | ### Debugging Tips
238 | 
239 | - Think of the computer as an employee with certain strengths (speed and precision) and weaknesses (lack of empathy and inability to grasp the big picture)
240 | - Find ways to take advantage of the strengths and mitigate the weaknesses
241 | - Use your emotions to engage with the problem, without letting reactions interfere with your ability to work effectively
242 | - Learning to debug is frustrating, but it's a valuable skill useful for many activities beyond programming
243 | 
244 | ## Exercises
245 | 
246 | ### Exercise 1.1
247 | Experiment with the "Hello, world!" program and try to make mistakes on purpose:
248 | 
249 | 1. In a print statement, what happens if you leave out one of the parentheses, or both?
250 | 2. If you are trying to print a string, what happens if you leave out one of the quotation marks, or both?
251 | 3. You can use a minus sign to make a negative number like `-2`. What happens if you put a plus sign before a number? What about `2++2`?
252 | 4. In math notation, leading zeros are ok, as in `09`. What happens if you try this in Python? What about `011`?
253 | 5. What happens if you have two values with no operator between them?
254 | 
255 | ### Exercise 1.2
256 | Start the Python interpreter and use it as a calculator:
257 | 
258 | 1. How many seconds are there in 42 minutes 42 seconds?
259 | 2. How many miles are there in 10 kilometers? (Hint: there are 1.61 kilometers in a mile)
260 | 3. If you run a 10 kilometer race in 42 minutes 42 seconds, what is your average pace (time per mile in minutes and seconds)? What is your average speed in miles per hour?
261 | 
262 | ## Glossary
263 | 
264 | - **Problem solving:** The process of formulating a problem, finding a solution, and expressing it
265 | - **High-level language:** A programming language like Python that is designed to be easy for humans to read and write
266 | - **Interpreter:** A program that reads another program and executes it
267 | - **Program:** A set of instructions that specifies a computation
268 | - **Value:** One of the basic units of data, like a number or string, that a program manipulates
269 | - **Type:** A category of values (int, float, str)
270 | - **Bug:** An error in a program
271 | - **Debugging:** The process of finding and correcting bugs
272 | - **Identifier:** Names used to identify variables, functions, classes, or other objects
273 | - **Keyword:** Reserved words in Python with special meanings
274 | - **PEP 8:** The official style guide for Python code


--------------------------------------------------------------------------------
/IntroductiontoCloudComputing.md:
--------------------------------------------------------------------------------
  1 | ### Introduction to Cloud Computing (Azure and AWS)  
  2 | **Duration**: 90 minutes  
  3 | **Audience**: Data Engineers  
  4 | 
  5 | ---
  6 | 
  7 | ### Learning Objectives  
  8 | By the end of this session, participants will be able to:  
  9 | 1. Define cloud computing and its core principles.  
 10 | 2. Compare key features of Azure and AWS.  
 11 | 3. Navigate basic services relevant to data engineering in both platforms.  
 12 | 4. Set up and configure a basic cloud environment.
 13 | 
 14 | ---
 15 | 
 16 | ### Agenda  
 17 | 1. **What is Cloud Computing?** (10 minutes)  
 18 |    - Definition and characteristics of cloud computing.  
 19 |    - Cloud deployment models: Public, Private, Hybrid.  
 20 |    - Service models: IaaS, PaaS, SaaS.  
 21 | 
 22 | 2. **Azure vs. AWS: Key Concepts** (15 minutes)  
 23 |    - Overview of Azure and AWS platforms.  
 24 |    - Comparison of service categories: Compute, Storage, Networking, and Databases.  
 25 |    - Strengths and use cases for Azure and AWS.  
 26 | 
 27 | 3. **Core Services for Data Engineering** (20 minutes)  
 28 |    - **Compute**:  
 29 |      - Azure: Azure Virtual Machines, Azure Databricks.  
 30 |      - AWS: EC2, EMR (Elastic MapReduce).  
 31 |    - **Storage**:  
 32 |      - Azure: Blob Storage, Data Lake Storage.  
 33 |      - AWS: S3, Glacier.  
 34 |    - **Databases**:  
 35 |      - Azure: Azure SQL Database, Cosmos DB.  
 36 |      - AWS: RDS, DynamoDB.  
 37 | 
 38 | 4. **Hands-On Lab: Setting Up a Cloud Environment** (30 minutes)  
 39 |    - Create free accounts on Azure and AWS.  
 40 |    - Configure basic cloud storage:  
 41 |      - Azure Blob Storage.  
 42 |      - AWS S3 bucket.  
 43 |    - Upload and retrieve sample data.  
 44 | 
 45 | 5. **Q&A and Wrap-Up** (15 minutes)  
 46 |    - Address participants questions.  
 47 |    - Discuss common challenges and best practices.  
 48 |    - Share additional resources for continued learning.  
 49 | 
 50 | ---
 51 | 
 52 | ### Detailed Session Plan  
 53 | 
 54 | #### What is Cloud Computing? (10 minutes)  
 55 | - **Definition**: Delivering computing services (e.g., servers, storage, databases, networking, software) over the internet.  
 56 | - **Characteristics**:  
 57 |   - On-demand availability.  
 58 |   - Scalability.  
 59 |   - Pay-as-you-go pricing.  
 60 |   - High availability and reliability.  
 61 | - **Service Models**:  
 62 |   - **IaaS** (e.g., Virtual Machines): Full control over infrastructure.  
 63 |   - **PaaS** (e.g., Azure Databricks): Managed environment for deploying applications.  
 64 |   - **SaaS** (e.g., Office 365): Pre-built software accessed via the cloud.  
 65 | 
 66 | ---
 67 | 
 68 | #### Azure vs. AWS: Key Concepts (15 minutes)  
 69 | - **Azure**:  
 70 |   - Focus on hybrid cloud and enterprise solutions.  
 71 |   - Tight integration with Microsoft tools (e.g., Power BI, Office 365).  
 72 | - **AWS**:  
 73 |   - Largest cloud provider with a wide range of services.  
 74 |   - Strong presence in startups and tech-first organizations.  
 75 | 
 76 | | Feature         | Azure                           | AWS                              |  
 77 | |-----------------|---------------------------------|----------------------------------|  
 78 | | Compute         | Azure VMs, Azure Kubernetes    | EC2, Lambda, ECS                |  
 79 | | Storage         | Blob Storage, Data Lake        | S3, Glacier                     |  
 80 | | Databases       | Azure SQL, Cosmos DB           | RDS, DynamoDB                   |  
 81 | | Analytics       | Azure Synapse, Databricks      | Redshift, EMR, Athena           |  
 82 | 
 83 | ---
 84 | 
 85 | #### Core Services for Data Engineering (20 minutes)  
 86 | **Compute Services**:  
 87 | - Azure:  
 88 |   - **Azure Virtual Machines**: Scalable virtual servers.  
 89 |   - **Azure Databricks**: Apache Spark-based analytics.  
 90 | - AWS:  
 91 |   - **EC2**: Elastic Compute Cloud for scalable servers.  
 92 |   - **EMR**: Managed Hadoop/Spark for big data processing.  
 93 | 
 94 | **Storage Services**:  
 95 | - Azure:  
 96 |   - **Blob Storage**: Unstructured data storage.  
 97 |   - **Data Lake Storage**: Analytics-optimized storage.  
 98 | - AWS:  
 99 |   - **S3**: Highly available object storage.  
100 |   - **Glacier**: Long-term archival storage.  
101 | 
102 | **Database Services**:  
103 | - Azure:  
104 |   - **Azure SQL Database**: Managed relational database.  
105 |   - **Cosmos DB**: Globally distributed, multi-model database.  
106 | - AWS:  
107 |   - **RDS**: Managed relational databases.  
108 |   - **DynamoDB**: NoSQL database with high performance.  
109 | 
110 | ---
111 | 
112 | #### Hands-On Lab: Setting Up a Cloud Environment (30 minutes)  
113 | 
114 | **Step 1**: Create Free Accounts  
115 | 1. **Azure**:  
116 |    - Visit [Azure Free Account](https://azure.microsoft.com/free/).  
117 |    - Sign up with a Microsoft account.  
118 |    - Activate $200 free credit.  
119 | 
120 | 2. **AWS**:  
121 |    - Visit [AWS Free Tier](https://aws.amazon.com/free/).  
122 |    - Sign up with email and billing details.  
123 |    - Activate free-tier services.  
124 | 
125 | **Step 2**: Configure Basic Cloud Storage  
126 | 1. **Azure Blob Storage**:  
127 |    - Navigate to **Storage Accounts** in Azure Portal.  
128 |    - Create a new storage account.  
129 |    - Upload a sample CSV file and view its properties.  
130 | 
131 | 2. **AWS S3 Bucket**:  
132 |    - Navigate to **S3** in AWS Console.  
133 |    - Create a new S3 bucket.  
134 |    - Upload a sample CSV file and view its properties.  
135 | 
136 | **Step 3**: Retrieve and Use Data  
137 | - Use Python or CLI tools to retrieve the uploaded file:  
138 |   - Azure: `azure-storage-blob` Python SDK.  
139 |   - AWS: `boto3` Python SDK.  
140 | 
141 | ---
142 | 
143 | #### Q&A and Wrap-Up (15 minutes)  
144 | - **Discussion Points**:  
145 |   - How to choose between Azure and AWS for specific use cases?  
146 |   - Best practices for managing costs in cloud platforms.  
147 |   - Common challenges faced by data engineers in cloud environments.  
148 | - **Resources for Further Learning**:  
149 |   - Azure: Microsoft Learn - [Azure Fundamentals](https://learn.microsoft.com/en-us/azure/)  
150 |   - AWS: AWS Training - [AWS Fundamentals](https://aws.amazon.com/training/)
151 | 
152 | ---
153 | 
154 | ### Key Takeaways  
155 | - Cloud computing provides scalable, cost-effective infrastructure and tools for data engineering.  
156 | - Azure and AWS are the leading platforms, each with unique strengths.  
157 | - Hands-on experience is crucial to understanding and leveraging cloud services.  
158 | 
159 | 
160 | 
161 | ###Bonus
162 | #### *Additional Notes and Tips for AWS Tools for Data Engineering.* 
163 | 
164 | AWS provides a rich ecosystem of tools and services tailored for data engineering tasks. Below is an expanded overview, along with notes and tips to maximize their utility.  
165 | 
166 | ---
167 | 
168 | #### **Storage Services**  
169 | 1. **Amazon S3 (Simple Storage Service)**  
170 |    - **Purpose**: Scalable object storage for raw, processed, and archived data.  
171 |    - **Common Use Cases**:  
172 |      - Data lakes.  
173 |      - Backup and disaster recovery.  
174 |      - Hosting static files.  
175 |    - **Tips**:  
176 |      - Use **S3 Lifecycle Policies** to move infrequently accessed data to cheaper storage classes (e.g., Glacier, Intelligent-Tiering).  
177 |      - Enable **versioning** to maintain file history and prevent accidental data loss.  
178 |      - Use **S3 Select** to query and retrieve subsets of data from objects directly, reducing data transfer costs.  
179 |      - Encrypt sensitive data using **SSE (Server-Side Encryption)** or **client-side encryption**.  
180 | 
181 | 2. **AWS Glue Data Catalog**  
182 |    - **Purpose**: Centralized metadata repository for datasets stored in S3 or other sources.  
183 |    - **Common Use Cases**:  
184 |      - Schema management for data lakes.  
185 |      - Integration with Athena, Redshift Spectrum, and EMR.  
186 |    - **Tips**:  
187 |      - Use AWS Glue Crawlers to automate schema detection for datasets.  
188 |      - Ensure proper IAM roles are configured for Glue to access S3 buckets.  
189 | 
190 | ---
191 | 
192 | #### **Compute Services**  
193 | 1. **Amazon EC2 (Elastic Compute Cloud)**  
194 |    - **Purpose**: General-purpose virtual servers.  
195 |    - **Common Use Cases**:  
196 |      - Hosting custom data pipelines.  
197 |      - Running one-time ETL jobs or long-running services.  
198 |    - **Tips**:  
199 |      - Use **Spot Instances** for cost savings on workloads that tolerate interruptions.  
200 |      - Implement **auto-scaling** to handle varying workloads.  
201 | 
202 | 2. **AWS Lambda**  
203 |    - **Purpose**: Serverless compute for event-driven processing.  
204 |    - **Common Use Cases**:  
205 |      - Lightweight ETL transformations.  
206 |      - Real-time data processing (e.g., responding to events from S3 or Kinesis).  
207 |    - **Tips**:  
208 |      - Keep Lambda functions small and focused on single tasks.  
209 |      - Optimize performance by minimizing package size and reusing connections.  
210 | 
211 | 3. **AWS EMR (Elastic MapReduce)**  
212 |    - **Purpose**: Managed Hadoop and Spark framework for big data processing.  
213 |    - **Common Use Cases**:  
214 |      - Batch processing of large datasets.  
215 |      - Running machine learning models on large-scale data.  
216 |    - **Tips**:  
217 |      - Use **Spot Instances** with EMR to reduce costs.  
218 |      - Leverage **EMR File System (EMRFS)** for better integration with S3.  
219 | 
220 | ---
221 | 
222 | #### **Database Services**  
223 | 1. **Amazon Redshift**  
224 |    - **Purpose**: Managed data warehouse for OLAP workloads.  
225 |    - **Common Use Cases**:  
226 |      - Aggregating and analyzing large datasets.  
227 |      - Business intelligence and reporting.  
228 |    - **Tips**:  
229 |      - Use **Redshift Spectrum** to query data directly from S3 without loading it into Redshift.  
230 |      - Monitor and optimize queries using the **Query Monitoring Rules** feature.  
231 |      - Compress data using columnar storage formats like Parquet or ORC to improve query performance.  
232 | 
233 | 2. **Amazon DynamoDB**  
234 |    - **Purpose**: NoSQL database for key-value and document storage.  
235 |    - **Common Use Cases**:  
236 |      - Low-latency, high-throughput applications (e.g., user session storage).  
237 |      - Storing metadata or logs for data pipelines.  
238 |    - **Tips**:  
239 |      - Enable **DynamoDB Streams** for change data capture and event-driven workflows.  
240 |      - Use the **on-demand capacity mode** for unpredictable workloads to avoid over-provisioning.  
241 | 
242 | 3. **Amazon RDS (Relational Database Service)**  
243 |    - **Purpose**: Managed relational database with support for MySQL, PostgreSQL, Oracle, and SQL Server.  
244 |    - **Common Use Cases**:  
245 |      - Storing structured, transactional data.  
246 |      - Serving as a staging area for ETL workflows.  
247 |    - **Tips**:  
248 |      - Enable **Multi-AZ deployments** for high availability.  
249 |      - Use **Read Replicas** to offload read-heavy workloads.  
250 | 
251 | ---
252 | 
253 | #### **Data Analytics Tools**  
254 | 1. **Amazon Athena**  
255 |    - **Purpose**: Serverless query engine for analyzing data in S3 using SQL.  
256 |    - **Common Use Cases**:  
257 |      - Interactive exploration of data lakes.  
258 |      - Quick validation of ETL pipeline outputs.  
259 |    - **Tips**:  
260 |      - Use columnar formats like Parquet or ORC for faster queries.  
261 |      - Partition your data to reduce query costs.  
262 | 
263 | 2. **AWS Glue**  
264 |    - **Purpose**: Serverless ETL service.  
265 |    - **Common Use Cases**:  
266 |      - Cleaning and transforming datasets for downstream consumption.  
267 |      - Automating ETL workflows.  
268 |    - **Tips**:  
269 |      - Use **job bookmarks** to handle incremental data loads.  
270 |      - Test transformations locally using the AWS Glue Docker image.  
271 | 
272 | 3. **Amazon QuickSight**  
273 |    - **Purpose**: BI and data visualization tool.  
274 |    - **Common Use Cases**:  
275 |      - Creating dashboards for stakeholders.  
276 |      - Visualizing insights from Athena or Redshift queries.  
277 |    - **Tips**:  
278 |      - Leverage SPICE (Super-fast, Parallel, In-memory Calculation Engine) for faster dashboard performance.  
279 | 
280 | ---
281 | 
282 | #### **Data Streaming Services**  
283 | 1. **Amazon Kinesis**  
284 |    - **Purpose**: Platform for collecting, processing, and analyzing real-time data streams.  
285 |    - **Common Use Cases**:  
286 |      - IoT data ingestion.  
287 |      - Real-time log processing.  
288 |    - **Tips**:  
289 |      - Use **Kinesis Data Firehose** to automatically load streaming data into S3, Redshift, or Elasticsearch.  
290 |      - Monitor and scale Kinesis streams using **CloudWatch metrics**.  
291 | 
292 | 2. **AWS Managed Streaming for Apache Kafka (MSK)**  
293 |    - **Purpose**: Managed Apache Kafka service for real-time data processing.  
294 |    - **Common Use Cases**:  
295 |      - Message brokering between services in a pipeline.  
296 |      - Event-driven architectures.  
297 |    - **Tips**:  
298 |      - Use Kafka connectors for seamless integration with AWS services like S3 or DynamoDB.  
299 |      - Optimize partitions and replication settings to balance performance and fault tolerance.  
300 | 
301 | ---
302 | 
303 | #### **Data Security and Governance Tools**  
304 | 1. **AWS IAM (Identity and Access Management)**  
305 |    - **Purpose**: Manage user permissions and access to AWS resources.  
306 |    - **Tips**:  
307 |      - Apply the **principle of least privilege** when assigning roles.  
308 |      - Use **IAM Policies** to define resource-level access.  
309 | 
310 | 2. **AWS Lake Formation**  
311 |    - **Purpose**: Simplify data lake creation with built-in governance.  
312 |    - **Tips**:  
313 |      - Define granular access policies using Lake Formation permissions.  
314 |      - Integrate with Glue Data Catalog for seamless schema management.  
315 | 
316 | 3. **AWS CloudTrail**  
317 |    - **Purpose**: Track user activity and API usage across AWS.  
318 |    - **Tips**:  
319 |      - Enable CloudTrail logs for all accounts to improve auditability.  
320 |      - Store logs in an S3 bucket for long-term analysis.  
321 | 
322 | ---
323 | 
324 | ### Additional Resources  
325 | - [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/)  
326 | - [AWS Big Data Blog](https://aws.amazon.com/big-data/)  
327 | - [Hands-On Labs for AWS](https://www.qwiklabs.com/)  
328 | 
329 | 
330 | 


--------------------------------------------------------------------------------
/PYTHON/chapter_2.md:
--------------------------------------------------------------------------------
  1 | # Chapter 2: Variables, Expressions and Statements
  2 | 
  3 | One of the most powerful features of a programming language is the ability to manipulate variables. A variable is a name that refers to a value.
  4 | 
  5 | ## 2.1 Assignment Statements
  6 | 
  7 | An assignment statement creates a new variable and gives it a value:
  8 | 
  9 | ```python
 10 | >>> message = 'And now for something completely different'
 11 | >>> n = 17
 12 | >>> pi = 3.1415926535897932
 13 | ```
 14 | 
 15 | This example makes three assignments. The first assigns a string to a new variable named `message`; the second gives the integer 17 to `n`; the third assigns the (approximate) value of π to `pi`.
 16 | 
 17 | A common way to represent variables on paper is to write the name with an arrow pointing to its value. This kind of figure is called a **state diagram** because it shows what state each of the variables is in (think of it as the variable's state of mind).
 18 | 
 19 | ```
 20 | message ───────────────────→ 'And now for something completely different'
 21 | n ──────────────────────────→ 17
 22 | pi ─────────────────────────→ 3.1415926535897932
 23 | ```
 24 | 
 25 | ## 2.2 Variable Names
 26 | 
 27 | Programmers generally choose names for their variables that are meaningful—they document what the variable is used for.
 28 | 
 29 | Variable names can be as long as you like. They can contain both letters and numbers, but they can't begin with a number. It is legal to use uppercase letters, but it is conventional to use only lower case for variable names.
 30 | 
 31 | The underscore character, `_`, can appear in a name. It is often used in names with multiple words, such as `your_name` or `airspeed_of_unladen_swallow`.
 32 | 
 33 | If you give a variable an illegal name, you get a syntax error:
 34 | 
 35 | ```python
 36 | >>> 76trombones = 'big parade'
 37 | SyntaxError: invalid syntax
 38 | >>> more@ = 1000000
 39 | SyntaxError: invalid syntax
 40 | >>> class = 'Advanced Theoretical Zymurgy'
 41 | SyntaxError: invalid syntax
 42 | ```
 43 | 
 44 | - `76trombones` is illegal because it begins with a number
 45 | - `more@` is illegal because it contains an illegal character, `@`
 46 | - `class` is illegal because it's a Python keyword
 47 | 
 48 | ### Python Keywords
 49 | 
 50 | The interpreter uses keywords to recognize the structure of the program, and they cannot be used as variable names. Python 3 has these keywords:
 51 | 
 52 | | | | | | |
 53 | |---------|----------|---------|----------|-------|
 54 | | False   | class    | finally | is       | return|
 55 | | None    | continue | for     | lambda   | try   |
 56 | | True    | def      | from    | nonlocal | while |
 57 | | and     | del      | global  | not      | with  |
 58 | | as      | elif     | if      | or       | yield |
 59 | | assert  | else     | import  | pass     |       |
 60 | | break   | except   | in      | raise    |       |
 61 | 
 62 | You don't have to memorize this list. In most development environments, keywords are displayed in a different color; if you try to use one as a variable name, you'll know.
 63 | 
 64 | ## 2.3 Expressions and Statements
 65 | 
 66 | An **expression** is a combination of values, variables, and operators. A value all by itself is considered an expression, and so is a variable, so the following are all legal expressions:
 67 | 
 68 | ```python
 69 | >>> 42
 70 | 42
 71 | >>> n
 72 | 17
 73 | >>> n + 25
 74 | 42
 75 | ```
 76 | 
 77 | When you type an expression at the prompt, the interpreter **evaluates** it, which means that it finds the value of the expression. In this example, `n` has the value 17 and `n + 25` has the value 42.
 78 | 
 79 | A **statement** is a unit of code that has an effect, like creating a variable or displaying a value.
 80 | 
 81 | ```python
 82 | >>> n = 17
 83 | >>> print(n)
 84 | ```
 85 | 
 86 | The first line is an assignment statement that gives a value to `n`. The second line is a print statement that displays the value of `n`.
 87 | 
 88 | When you type a statement, the interpreter **executes** it, which means that it does whatever the statement says. In general, statements don't have values.
 89 | 
 90 | ## 2.4 Script Mode
 91 | 
 92 | So far we have run Python in **interactive mode**, which means that you interact directly with the interpreter. Interactive mode is a good way to get started, but if you are working with more than a few lines of code, it can be clumsy.
 93 | 
 94 | The alternative is to save code in a file called a **script** and then run the interpreter in **script mode** to execute the script. By convention, Python scripts have names that end with `.py`.
 95 | 
 96 | ### Differences Between Interactive and Script Mode
 97 | 
 98 | Because Python provides both modes, you can test bits of code in interactive mode before you put them in a script. But there are differences between interactive mode and script mode that can be confusing.
 99 | 
100 | For example, if you are using Python as a calculator, you might type:
101 | 
102 | ```python
103 | >>> miles = 26.2
104 | >>> miles * 1.61
105 | 42.182
106 | ```
107 | 
108 | The first line assigns a value to `miles`, but it has no visible effect. The second line is an expression, so the interpreter evaluates it and displays the result. It turns out that a marathon is about 42 kilometers.
109 | 
110 | But if you type the same code into a script and run it, you get no output at all. In script mode an expression, all by itself, has no visible effect. Python evaluates the expression, but it doesn't display the result. To display the result, you need a print statement like this:
111 | 
112 | ```python
113 | miles = 26.2
114 | print(miles * 1.61)
115 | ```
116 | 
117 | **Try this:** Type the following statements in the Python interpreter and see what they do:
118 | ```python
119 | 5
120 | x = 5
121 | x + 1
122 | ```
123 | 
124 | Now put the same statements in a script and run it. What is the output? Modify the script by transforming each expression into a print statement and then run it again.
125 | 
126 | ## 2.5 Order of Operations
127 | 
128 | When an expression contains more than one operator, the order of evaluation depends on the **order of operations**. For mathematical operators, Python follows mathematical convention. The acronym **PEMDAS** is a useful way to remember the rules:
129 | 
130 | 1. **Parentheses** have the highest precedence and can be used to force an expression to evaluate in the order you want. Since expressions in parentheses are evaluated first, `2 * (3-1)` is 4, and `(1+1)**(5-2)` is 8. You can also use parentheses to make an expression easier to read, as in `(minute * 100) / 60`, even if it doesn't change the result.
131 | 
132 | 2. **Exponentiation** has the next highest precedence, so `1 + 2**3` is 9, not 27, and `2 * 3**2` is 18, not 36.
133 | 
134 | 3. **Multiplication and Division** have higher precedence than Addition and Subtraction. So `2*3-1` is 5, not 4, and `6+4/2` is 8, not 5.
135 | 
136 | 4. **Operators with the same precedence** are evaluated from left to right (except exponentiation). So in the expression `degrees / 2 * pi`, the division happens first and the result is multiplied by pi. To divide by 2π, you can use parentheses or write `degrees / 2 / pi`.
137 | 
138 | **Tip:** If you can't tell the order of operations by looking at an expression, use parentheses to make it obvious.
139 | 
140 | ## 2.6 String Operations
141 | 
142 | In general, you can't perform mathematical operations on strings, even if the strings look like numbers, so the following are illegal:
143 | 
144 | ```python
145 | 'chinese'-'food'    # Illegal
146 | 'eggs'/'easy'       # Illegal
147 | 'third'*'a charm'   # Illegal (wrong operand type)
148 | ```
149 | 
150 | But there are two exceptions, `+` and `*`.
151 | 
152 | ### String Concatenation
153 | 
154 | The `+` operator performs **string concatenation**, which means it joins the strings by linking them end-to-end. For example:
155 | 
156 | ```python
157 | >>> first = 'throat'
158 | >>> second = 'warbler'
159 | >>> first + second
160 | 'throatwarbler'
161 | ```
162 | 
163 | ### String Repetition
164 | 
165 | The `*` operator also works on strings; it performs **repetition**. For example, `'Spam'*3` is `'SpamSpamSpam'`. If one of the values is a string, the other has to be an integer.
166 | 
167 | ```python
168 | >>> 'Spam' * 3
169 | 'SpamSpamSpam'
170 | >>> 4 * 'Na'
171 | 'NaNaNaNa'
172 | ```
173 | 
174 | This use of `+` and `*` makes sense by analogy with addition and multiplication. Just as `4*3` is equivalent to `4+4+4`, we expect `'Spam'*3` to be the same as `'Spam'+'Spam'+'Spam'`, and it is.
175 | 
176 | **Think about it:** There is a significant way in which string concatenation and repetition are different from integer addition and multiplication. Can you think of a property that addition has that string concatenation does not?
177 | 
178 | ## 2.7 Comments
179 | 
180 | As programs get bigger and more complicated, they get more difficult to read. Formal languages are dense, and it is often difficult to look at a piece of code and figure out what it is doing, or why.
181 | 
182 | For this reason, it is a good idea to add notes to your programs to explain in natural language what the program is doing. These notes are called **comments**, and they start with the `#` symbol:
183 | 
184 | ```python
185 | # compute the percentage of the hour that has elapsed
186 | percentage = (minute * 100) / 60
187 | ```
188 | 
189 | In this case, the comment appears on a line by itself. You can also put comments at the end of a line:
190 | 
191 | ```python
192 | percentage = (minute * 100) / 60  # percentage of an hour
193 | ```
194 | 
195 | Everything from the `#` to the end of the line is ignored—it has no effect on the execution of the program.
196 | 
197 | ### Writing Good Comments
198 | 
199 | Comments are most useful when they document non-obvious features of the code. It is reasonable to assume that the reader can figure out what the code does; it is more useful to explain why.
200 | 
201 | **Bad comment (redundant with the code and useless):**
202 | ```python
203 | v = 5  # assign 5 to v
204 | ```
205 | 
206 | **Good comment (contains useful information not in the code):**
207 | ```python
208 | v = 5  # velocity in meters/second
209 | ```
210 | 
211 | Good variable names can reduce the need for comments, but long names can make complex expressions hard to read, so there is a tradeoff.
212 | 
213 | ## 2.8 Debugging
214 | 
215 | Three kinds of errors can occur in a program: **syntax errors**, **runtime errors**, and **semantic errors**. It is useful to distinguish between them in order to track them down more quickly.
216 | 
217 | ### Syntax Error
218 | "Syntax" refers to the structure of a program and the rules about that structure. For example, parentheses have to come in matching pairs, so `(1 + 2)` is legal, but `8)` is a syntax error.
219 | 
220 | If there is a syntax error anywhere in your program, Python displays an error message and quits, and you will not be able to run the program. During the first few weeks of your programming career, you might spend a lot of time tracking down syntax errors. As you gain experience, you will make fewer errors and find them faster.
221 | 
222 | ### Runtime Error
223 | The second type of error is a runtime error, so called because the error does not appear until after the program has started running. These errors are also called **exceptions** because they usually indicate that something exceptional (and bad) has happened.
224 | 
225 | Runtime errors are rare in the simple programs you will see in the first few chapters, so it might be a while before you encounter one.
226 | 
227 | ### Semantic Error
228 | The third type of error is "semantic", which means related to meaning. If there is a semantic error in your program, it will run without generating error messages, but it will not do the right thing. It will do something else. Specifically, it will do what you told it to do.
229 | 
230 | Identifying semantic errors can be tricky because it requires you to work backward by looking at the output of the program and trying to figure out what it is doing.
231 | 
232 | ## Glossary
233 | 
234 | - **variable:** A name that refers to a value
235 | - **assignment:** A statement that assigns a value to a variable
236 | - **state diagram:** A graphical representation of a set of variables and the values they refer to
237 | - **keyword:** A reserved word that is used to parse a program; you cannot use keywords like `if`, `def`, and `while` as variable names
238 | - **operand:** One of the values on which an operator operates
239 | - **expression:** A combination of variables, operators, and values that represents a single result
240 | - **evaluate:** To simplify an expression by performing the operations in order to yield a single value
241 | - **statement:** A section of code that represents a command or action. So far, the statements we have seen are assignments and print statements
242 | - **execute:** To run a statement and do what it says
243 | - **interactive mode:** A way of using the Python interpreter by typing code at the prompt
244 | - **script mode:** A way of using the Python interpreter to read code from a script and run it
245 | - **script:** A program stored in a file
246 | - **order of operations:** Rules governing the order in which expressions involving multiple operators and operands are evaluated
247 | - **concatenate:** To join two operands end-to-end
248 | - **comment:** Information in a program that is meant for other programmers (or anyone reading the source code) and has no effect on the execution of the program
249 | - **syntax error:** An error in a program that makes it impossible to parse (and therefore impossible to interpret)
250 | - **exception:** An error that is detected while the program is running
251 | - **semantics:** The meaning of a program
252 | - **semantic error:** An error in a program that makes it do something other than what the programmer intended
253 | 
254 | ## Exercises
255 | 
256 | ### Exercise 2.1
257 | Whenever you learn a new feature, you should try it out in interactive mode and make errors on purpose to see what goes wrong.
258 | 
259 | - We've seen that `n = 42` is legal. What about `42 = n`?
260 | - How about `x = y = 1`?
261 | - In some languages every statement ends with a semi-colon, `;`. What happens if you put a semi-colon at the end of a Python statement?
262 | - What if you put a period at the end of a statement?
263 | - In math notation you can multiply x and y like this: xy. What happens if you try that in Python?
264 | 
265 | ### Exercise 2.2
266 | Practice using the Python interpreter as a calculator:
267 | 
268 | 1. The volume of a sphere with radius r is $\frac{4}{3}\pi r^3$. What is the volume of a sphere with radius 5?
269 | 
270 | 2. Suppose the cover price of a book is $24.95, but bookstores get a 40% discount. Shipping costs $3 for the first copy and 75 cents for each additional copy. What is the total wholesale cost for 60 copies?
271 | 
272 | 3. If I leave my house at 6:52 am and run 1 mile at an easy pace (8:15 per mile), then 3 miles at tempo (7:12 per mile) and 1 mile at easy pace again, what time do I get home for breakfast?
273 | 


--------------------------------------------------------------------------------
/data_lake.md:
--------------------------------------------------------------------------------
  1 | # Comprehensive Guide to Data Lakes
  2 | 
  3 | ## Table of Contents
  4 | 1. [What is a Data Lake?](#what-is-a-data-lake)
  5 | 2. [Why Do You Need a Data Lake?](#why-do-you-need-a-data-lake)
  6 | 3. [Core Characteristics](#core-characteristics)
  7 | 4. [Data Lake vs Data Warehouse vs Data Lakehouse](#data-lake-vs-data-warehouse-vs-data-lakehouse)
  8 | 5. [Essential Components of Data Lake Architecture](#essential-components-of-data-lake-architecture)
  9 | 6. [Common Use Cases](#common-use-cases)
 10 | 7. [Benefits and Challenges](#benefits-and-challenges)
 11 | 8. [Popular Technologies](#popular-technologies)
 12 | 9. [Best Practices](#best-practices)
 13 | 10. [Conclusion](#conclusion)
 14 | 
 15 | ## What is a Data Lake?
 16 | 
 17 | A **data lake** is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. Unlike traditional databases or data warehouses, data lakes enable you to store data in its native, raw format without requiring a predefined schema.
 18 | 
 19 | ### Key Definition Points:
 20 | - **Centralized storage**: Single location for all organizational data
 21 | - **Any scale**: From gigabytes to petabytes
 22 | - **Raw format**: Data stored as-is, without preprocessing
 23 | - **Schema-on-read**: Structure applied when data is accessed, not when stored
 24 | - **Multi-format support**: Handles structured, semi-structured, and unstructured data
 25 | 
 26 | ## Why Do You Need a Data Lake?
 27 | 
 28 | Organizations implementing modern data architectures, including data lakes, demonstrate measurable advantages in operational efficiency and revenue growth. Research shows that more than half of enterprises have implemented data lakes, with another 22% planning implementation within 36 months.
 29 | 
 30 | ### Business Value:
 31 | - **Faster decision-making**: Advanced analytics across diverse data sources
 32 | - **Personalized experiences**: Comprehensive customer data analysis
 33 | - **Operational optimization**: Predictive maintenance and efficiency improvements
 34 | - **Competitive advantage**: Early identification of revenue opportunities
 35 | - **Cost efficiency**: Leverage inexpensive object storage and open formats
 36 | 
 37 | ## Core Characteristics
 38 | 
 39 | ### 1. Scalability
 40 | - Built to scale horizontally
 41 | - Cloud-based object storage solutions (Amazon S3, Azure Data Lake Storage)
 42 | - Growth from terabytes to petabytes without capacity concerns
 43 | 
 44 | ### 2. Schema-on-Read
 45 | - No predefined schema required at ingestion
 46 | - Flexibility to apply different schemas based on use case
 47 | - Structure determined during data retrieval or transformation
 48 | 
 49 | ### 3. Raw Data Storage
 50 | - Retains all data types in native format
 51 | - Preserves complete data potential for future analysis
 52 | - No upfront data transformation requirements
 53 | 
 54 | ### 4. Diverse Data Type Support
 55 | - **Structured**: SQL tables, CSV files
 56 | - **Semi-structured**: JSON, XML, logs
 57 | - **Unstructured**: Images, videos, audio, documents, PDFs
 58 | 
 59 | ## Data Lake vs Data Warehouse vs Data Lakehouse
 60 | 
 61 | | Feature | Data Lake | Data Lakehouse | Data Warehouse |
 62 | |---------|-----------|----------------|----------------|
 63 | | **Data Types** | All types (structured, semi-structured, unstructured) | All types (structured, semi-structured, unstructured) | Structured data only |
 64 | | **Cost** | $ (Low) | $ (Low) | $$$ (High) |
 65 | | **Format** | Open format | Open format | Closed, proprietary |
 66 | | **Scalability** | Scales at low cost regardless of type | Scales at low cost regardless of type | Exponentially expensive scaling |
 67 | | **Schema** | Schema-on-read | Schema-on-read with governance | Schema-on-write |
 68 | | **Performance** | Variable, depends on compute engine | High performance | Optimized for fast SQL queries |
 69 | | **Intended Users** | Data scientists | Data analysts, scientists, ML engineers | Data analysts |
 70 | | **Reliability** | Low quality (data swamp risk) | High quality, reliable | High quality, reliable |
 71 | | **Use Cases** | ML, big data, raw storage | Unified analytics, BI, ML | BI, structured analytics |
 72 | 
 73 | ## Essential Components of Data Lake Architecture
 74 | 
 75 | ### 1. Data Ingestion Layer
 76 | Brings data from various sources into the data lake.
 77 | 
 78 | **Ingestion Modes:**
 79 | - **Batch ingestion**: Periodic loading (nightly, hourly)
 80 | - **Stream ingestion**: Real-time data flows
 81 | - **Hybrid ingestion**: Combination of batch and stream
 82 | 
 83 | **Popular Tools:**
 84 | - Apache Kafka, AWS Kinesis (streaming)
 85 | - Apache NiFi, Flume, AWS Glue (batch ETL)
 86 | 
 87 | ### 2. Storage Layer
 88 | Built on cloud object storage for elastic scaling and cost efficiency.
 89 | 
 90 | **Key Features:**
 91 | - Durability and availability with automatic replication
 92 | - Separation of storage and compute
 93 | - Data tiering (hot, warm, cold storage)
 94 | 
 95 | **Popular Options:**
 96 | - Amazon S3
 97 | - Azure Data Lake Storage
 98 | - Google Cloud Storage
 99 | - MinIO (on-premise)
100 | 
101 | ### 3. Catalog and Metadata Management
102 | Prevents data lakes from becoming "data swamps" by maintaining organization.
103 | 
104 | **Manages:**
105 | - Data schema and location
106 | - Partitioning information
107 | - Data lineage and versioning
108 | - Search and discovery capabilities
109 | 
110 | **Tools:**
111 | - AWS Glue Data Catalog
112 | - Apache Hive Metastore
113 | - Apache Atlas
114 | - DataHub
115 | 
116 | ### 4. Processing and Analytics Layer
117 | Transforms raw data into insights through various operations.
118 | 
119 | **Capabilities:**
120 | - ETL/ELT pipelines
121 | - SQL querying
122 | - Machine learning pipelines
123 | - Real-time and batch processing
124 | 
125 | ### 5. Security and Governance
126 | Protects sensitive data and ensures compliance.
127 | 
128 | **Essential Features:**
129 | - Identity and Access Management (IAM)
130 | - Encryption (in-transit and at-rest)
131 | - Data masking and anonymization
132 | - Auditing and monitoring
133 | 
134 | **Tools:**
135 | - AWS Lake Formation
136 | - Apache Ranger
137 | - Azure Purview
138 | 
139 | ## Common Use Cases
140 | 
141 | ### 1. Big Data Analytics
142 | - Historical and real-time data analysis
143 | - Cross-departmental analytics with single source of truth
144 | - Petabyte-scale dataset queries
145 | - Custom analytics on raw, unprocessed data
146 | 
147 | ### 2. Machine Learning and AI
148 | - Multi-format training dataset storage
149 | - Raw data preservation for ML experimentation
150 | - Automated ML pipeline support
151 | - Enhanced model accuracy through comprehensive data access
152 | 
153 | ### 3. Centralized Data Archiving
154 | - Long-term storage for compliance and auditing
155 | - Cost-effective historical data retention
156 | - Trend analysis and forecasting
157 | - Future ML model training preparation
158 | 
159 | ### 4. Data Science Experimentation
160 | - Exploratory data analysis (EDA)
161 | - Hypothesis testing and prototyping
162 | - Unconstrained access to raw datasets
163 | - Innovation without data engineering dependencies
164 | 
165 | ### 5. Improved Customer Interactions
166 | Combine data from multiple sources:
167 | - CRM platforms
168 | - Social media analytics
169 | - Marketing platforms with purchase history
170 | - Customer service interactions
171 | 
172 | ### 6. R&D Innovation Support
173 | - Hypothesis testing and assumption refinement
174 | - Material selection for product design
175 | - Genomic research for medication development
176 | - Customer willingness-to-pay analysis
177 | 
178 | ### 7. Operational Efficiency
179 | - IoT device data collection and analysis
180 | - Manufacturing process optimization
181 | - Predictive maintenance
182 | - Cost reduction and quality improvement
183 | 
184 | ## Benefits and Challenges
185 | 
186 | ### Benefits
187 | 
188 | #### Flexibility and Scalability
189 | - No upfront schema requirements
190 | - Effortless scaling from gigabytes to petabytes
191 | - Cloud-native storage cost efficiency
192 | - Decoupled compute and storage architecture
193 | 
194 | #### Comprehensive Data Support
195 | - All data types in single platform
196 | - Raw, unprocessed data preservation
197 | - Enhanced analytics capabilities
198 | - Cross-team collaboration improvement
199 | 
200 | #### Cost and Performance
201 | - Significantly cheaper than traditional databases
202 | - Independent scaling of analytics workloads
203 | - Elimination of data silos
204 | - Improved decision-making through comprehensive analysis
205 | 
206 | ### Challenges
207 | 
208 | #### Data Quality and Organization
209 | - **Data swamp risk**: Without proper governance, becomes unusable
210 | - **Lack of structure**: Difficult to query and document
211 | - **Quality issues**: Poor data may go undetected
212 | - **Metadata gaps**: Users may not find or understand available data
213 | 
214 | #### Governance and Security
215 | - **Complex governance**: Access control and compliance challenges
216 | - **Security concerns**: Protecting sensitive data in flexible environment
217 | - **Performance issues**: Traditional query engines slow on large datasets
218 | - **Reliability problems**: Difficulty combining batch and streaming data
219 | 
220 | ### Mitigation Strategies
221 | 
222 | #### Governance and Organization
223 | - Implement comprehensive data catalogs
224 | - Use standardized naming and folder structures
225 | - Apply data validation and profiling tools
226 | - Automate lifecycle management policies
227 | 
228 | #### Security and Performance
229 | - Robust access control and encryption
230 | - Role-based access management
231 | - Regular data quality monitoring
232 | - Performance optimization through proper partitioning
233 | 
234 | ## Popular Technologies
235 | 
236 | ### Cloud-Native Solutions
237 | 
238 | #### Amazon Web Services (AWS)
239 | - **Amazon S3**: Scalable object storage
240 | - **AWS Lake Formation**: Permissions, cataloging, governance
241 | - **AWS Glue**: ETL and data cataloging
242 | - **Amazon Athena**: SQL queries on S3 data
243 | 
244 | #### Microsoft Azure
245 | - **Azure Data Lake Storage**: HDFS-like capabilities with blob storage
246 | - **Azure Synapse Analytics**: Integrated analytics service
247 | - **Azure Purview**: Data governance and cataloging
248 | 
249 | #### Google Cloud Platform (GCP)
250 | - **Google Cloud Storage**: Durable object storage
251 | - **BigQuery**: Data warehouse with lake capabilities
252 | - **Vertex AI**: Machine learning platform integration
253 | 
254 | ### Open-Source Tools
255 | 
256 | #### Storage and Processing
257 | - **Apache Hadoop**: Original distributed data framework
258 | - **Delta Lake**: ACID transactions and versioning for object storage
259 | - **Apache Iceberg**: Table format with atomic operations and time travel
260 | - **Presto**: Distributed SQL query engine
261 | 
262 | #### Analytics and ML
263 | - **Apache Spark**: Distributed computing for big data processing
264 | - **Apache Kafka**: Real-time data streaming
265 | - **Jupyter Notebooks**: Interactive data analysis and experimentation
266 | 
267 | ### Analytics Platform Integrations
268 | 
269 | #### Data Platforms
270 | - **Databricks**: Collaborative workspace with Delta Lake support
271 | - **Snowflake**: Hybrid lakehouse capabilities
272 | - **Confluent**: Enterprise Kafka platform
273 | 
274 | #### Business Intelligence
275 | - **Power BI**: Microsoft's business intelligence platform
276 | - **Tableau**: Data visualization and analytics
277 | - **Looker**: Modern BI and data platform
278 | 
279 | ## Best Practices
280 | 
281 | ### 1. Use Data Lake as Landing Zone
282 | - Store all data without transformation or aggregation
283 | - Preserve raw format for machine learning and lineage
284 | - Maintain complete data history
285 | 
286 | ### 2. Implement Data Security
287 | - **Mask PII**: Pseudonymize personally identifiable information
288 | - **Access controls**: Role-based and view-based ACLs
289 | - **Encryption**: Implement both in-transit and at-rest encryption
290 | - **Compliance**: Ensure GDPR and regulatory compliance
291 | 
292 | ### 3. Build Reliability and Performance
293 | - **Use Delta Lake**: Brings database-like reliability to data lakes
294 | - **Implement ACID transactions**: Ensure data consistency
295 | - **Optimize partitioning**: Improve query performance
296 | - **Monitor data quality**: Regular validation and profiling
297 | 
298 | ### 4. Establish Data Catalog
299 | - **Metadata management**: Track schema, location, and lineage
300 | - **Enable self-service**: Allow users to discover and understand data
301 | - **Document data sources**: Maintain comprehensive data documentation
302 | - **Version control**: Track data and schema changes
303 | 
304 | ### 5. Lifecycle Management
305 | - **Automate tiering**: Move old data to cheaper storage tiers
306 | - **Retention policies**: Define and enforce data retention rules
307 | - **Archive management**: Efficient long-term data storage
308 | - **Cleanup procedures**: Remove obsolete or duplicate data
309 | 
310 | ### 6. Monitoring and Governance
311 | - **Performance monitoring**: Track query performance and resource usage
312 | - **Cost optimization**: Monitor and optimize storage and compute costs
313 | - **Access auditing**: Log and review data access patterns
314 | - **Quality metrics**: Establish and monitor data quality indicators
315 | 
316 | ## Conclusion
317 | 
318 | Data lakes represent a fundamental shift in how organizations approach data storage and analytics. By providing a flexible, scalable foundation for raw and semi-structured data, they enable advanced analytics, machine learning, and real-time decision-making that wasn't possible with traditional data warehouses alone.
319 | 
320 | ### Key Takeaways:
321 | 
322 | 1. **Flexibility First**: Data lakes excel when you need to store diverse data types without predefined schemas
323 | 2. **Scale and Cost**: Cloud-native solutions provide virtually unlimited scalability at low cost
324 | 3. **Governance Critical**: Success depends on implementing strong metadata management and governance from the start
325 | 4. **Hybrid Approach**: Many organizations benefit from combining data lakes with data warehouses and lakehouses
326 | 5. **Technology Evolution**: The ecosystem continues to evolve with new tools addressing traditional data lake challenges
327 | 
328 | ### When to Choose Data Lakes:
329 | 
330 | Data lakes are ideal when your organization:
331 | - Handles complex, diverse, or large-scale data
332 | - Needs to enable faster experimentation and innovation
333 | - Wants to implement advanced analytics and AI/ML initiatives
334 | - Requires cost-effective long-term data storage
335 | - Operates in data-driven industries with rapidly changing requirements
336 | 
337 | ### Success Factors:
338 | 
339 | - **Start with governance**: Implement cataloging and security from day one
340 | - **Choose the right technology stack**: Align tools with team expertise and organizational needs
341 | - **Plan for growth**: Design architecture that scales with data volume and user needs
342 | - **Invest in training**: Ensure teams understand how to effectively use data lake capabilities
343 | - **Monitor and optimize**: Continuously improve performance, cost, and data quality
344 | 
345 | Data lakes are not just a storage solution—they're a foundation for modern, data-driven organizations that want to unlock the full potential of their data assets while maintaining flexibility for future innovations and use cases.
346 | 


--------------------------------------------------------------------------------