├── Data Governance Frameworks and Data Security Principles.md ├── SQL-PRACTICE-QUESTIONS.md ├── samplejson.json ├── ToolsAndTechnologiesInstallation Guide.md ├── minio_setup.md ├── Day1-WeekOneDayOneGude.md ├── Apache Airflow Operators Guide.md ├── CH02-2025-DE-Capstone-Project.md ├── Data Engineer Apache Kafka Producers.md ├── DATA-PROCESSING.md ├── CoreConceptsDataModeling.md ├── Tuesday-Kafka-Lab.md ├── scrapping.md ├── DATA-MODELING.md ├── introduction-to-Kafka.md ├── MySQLQueryExecutionPlans.md ├── AivenProjectVersionWeekOneProject.md ├── WeekOneProject.md ├── Day3-WeekOneDayThreeClass.md ├── Apache Kafka 101: Apache Kafka for Data Engineering Guide.md ├── Apache Kafka 102: Apache Kafka for Data Engineering Guide.md ├── SQL-Manual.md ├── Apache Airflow 101 Guide.md ├── ETL └── ETL-ELT.md ├── README.md ├── Apache Spark.md ├── GDPR & HIPAA Compliance Guide.md ├── Change Data Capture.md ├── PYTHON ├── chapter_1.md └── chapter_2.md ├── IntroductiontoCloudComputing.md └── data_lake.md /Data Governance Frameworks and Data Security Principles.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /SQL-PRACTICE-QUESTIONS.md: -------------------------------------------------------------------------------- 1 | # SQL Practice Questions 2 | 3 | ## **Section 1: SELECT + ORDER BY + LIMIT + OFFSET** 4 | 1. Retrieve all columns from `customer_info` and sort results **alphabetically** by `full_name`. 5 | 2. Get the top **5 most expensive products** from `products`. 6 | 3. Display products from row **6 to 10** when ordered by `price` in descending order. 7 | 8 | ## **Section 2: WHERE + CASE** 9 | 1. Find all customers located in **Kisumu**. 10 | 2. List products priced between **100 and 500**, and add a column called `price_category` that says `"Low"` if price < 300, `"Medium"` if 300–1000, else `"High"`. 11 | 3. Get all sales where `total_sales` is greater than **1000**, and show `"Big Sale"` or `"Small Sale"` using `CASE`. 12 | 13 | ## **Section 3: JOIN + ORDER BY** 14 | 1. Show `sales_id`, `product_name`, and `full_name` for every sale, ordered by `total_sales` in descending order. 15 | 2. List all products along with their customer's `location`, sorted by location then product name. 16 | 3. Display all sales with `product_name`, `price`, and `full_name`, ordered by `price` from highest to lowest. 17 | 18 | ## **Section 4: GROUP BY + HAVING** 19 | 1. Count how many products each customer owns, and only show customers with **more than 2 products**. 20 | 2. Find the total sales for each product and only include products with sales totaling **over 2000**. 21 | 3. Get the number of customers in each location, sorted by **customer count** in descending order. 22 | -------------------------------------------------------------------------------- /samplejson.json: -------------------------------------------------------------------------------- 1 | [{"name":"Ivett Latehouse","position":"Computer Systems Analyst II","country":"Ukraine"}, 2 | {"name":"Demetria Ollet","position":"Web Designer III","country":"Indonesia"}, 3 | {"name":"Iolande Ornelas","position":"Clinical Specialist","country":"China"}, 4 | {"name":"Sheila Carty","position":"Web Developer III","country":"Central African Republic"}, 5 | {"name":"Darrelle Novotni","position":"Technical Writer","country":"Philippines"}, 6 | {"name":"Gianna de Glanville","position":"Junior Executive","country":"Portugal"}, 7 | {"name":"Estelle Staite","position":"Web Designer I","country":"Nigeria"}, 8 | {"name":"Rivi Elsmore","position":"Assistant Media Planner","country":"Philippines"}, 9 | {"name":"Sherilyn Paten","position":"Account Coordinator","country":"Pakistan"}, 10 | {"name":"Layton Sweynson","position":"Help Desk Operator","country":"Portugal"}, 11 | {"name":"Ezekiel Carvil","position":"Mechanical Systems Engineer","country":"Indonesia"}, 12 | {"name":"Prescott Dodsley","position":"Software Test Engineer IV","country":"China"}, 13 | {"name":"Durant Steanyng","position":"Chief Design Engineer","country":"United States"}, 14 | {"name":"Eirena Lorey","position":"Nurse","country":"Lithuania"}, 15 | {"name":"Blithe De Brett","position":"Recruiting Manager","country":"Russia"}, 16 | {"name":"Tedi Grogona","position":"Executive Secretary","country":"Japan"}, 17 | {"name":"Doro Swinburne","position":"Accountant I","country":"China"}, 18 | {"name":"Ernesta Cassam","position":"Physical Therapy Assistant","country":"China"}, 19 | {"name":"Gerardo Reide","position":"Automation Specialist I","country":"China"}, 20 | {"name":"Brigida Durgan","position":"Web Developer IV","country":"Armenia"}] -------------------------------------------------------------------------------- /ToolsAndTechnologiesInstallation Guide.md: -------------------------------------------------------------------------------- 1 | 2 | ### **Data Bases.** 3 | 1. **PostgreSQL Installation** 4 | To install PostgreSQL locally on your computer, visit [EnterpriseDB Downloads](https://www.enterprisedb.com/downloads/postgres-postgresql-downloads) and download the latest stable version compatible with your operating system. 5 | 6 | You can use this guide for assistance: [W3Schools PostgreSQL Installation Guide](https://www.w3schools.com/postgresql/postgresql_install.php) 7 | 8 | 2. **MySQL Installation** 9 | To install MySQL locally on your computer, visit [MySQL Downloads](https://dev.mysql.com/downloads/installer/) and download the latest stable version compatible with your operating system. 10 | 11 | You can use this guide for assistance: [W3Schools MySQL Installation Guide](https://www.w3schools.com/mysql/mysql_install_windows.asp) 12 | 13 | 3. **MongoDB Installation** 14 | To install MongoDB locally on your computer, visit [MongoDB Community Edition Downloads](https://www.mongodb.com/try/download/community) and download the latest stable version compatible with your operating system. 15 | 16 | You can use this guide for assistance: [MongoDB Installation Documentation](https://www.mongodb.com/docs/manual/installation/) 17 | 18 | 4. **MongoDB Compass** 19 | MongoDB Compass is a graphical user interface (GUI) for MongoDB. You can download it from [MongoDB Compass Downloads](https://www.mongodb.com/try/download/compass). 20 | 21 | You can use this guide for assistance: [MongoDB Compass Documentation](https://www.mongodb.com/docs/compass/current/) 22 | 23 | 5. **MongoDB Atlas** 24 | MongoDB Atlas is a cloud database service for MongoDB. You can sign up and create a free cluster at [MongoDB Atlas](https://www.mongodb.com/atlas/database). 25 | 26 | You can use this guide for assistance: [MongoDB Atlas Getting Started](https://www.mongodb.com/docs/atlas/getting-started/) 27 | 28 | -------------------------------------------------------------------------------- /minio_setup.md: -------------------------------------------------------------------------------- 1 | # MinIO Setup Guide 2 | 3 | ### Step 1: Download MinIO Server Binary 4 | 5 | ```bash 6 | wget https://dl.min.io/server/minio/release/linux-amd64/minio 7 | chmod +x minio 8 | sudo mv minio /usr/local/bin/ 9 | ``` 10 | 11 | This downloads and installs the `minio` binary system-wide. 12 | 13 | ### Step 2: Create MinIO Data Directory 14 | 15 | ```bash 16 | mkdir -p ~/minio-data 17 | ``` 18 | 19 | This is where your data lake files (CSV, JSON, Parquet, etc.) will be stored. 20 | 21 | ### Step 3: Run MinIO Server 22 | 23 | ```bash 24 | export MINIO_ROOT_USER=minioadmin 25 | export MINIO_ROOT_PASSWORD=minioadmin 26 | 27 | minio server ~/minio-data --console-address ":9001" 28 | ``` 29 | 30 | - Web UI will be available at: http://localhost:9001 31 | - API endpoint: http://localhost:9000 32 | - Credentials: 33 | - Username: `minioadmin` 34 | - Password: `minioadmin` 35 | 36 | ### Step 4: Access MinIO Web Console 37 | 38 | 1. Go to: http://localhost:9001 39 | 2. Login with: 40 | - Username: `minioadmin` 41 | - Password: `minioadmin` 42 | 3. Create a bucket called `datalake`. 43 | 44 | ### Step 5: Upload Files 45 | 46 | Click **"Buckets" → "datalake" → "Upload"**, and upload sample files like: 47 | - `sales.csv` 48 | - `products.csv` 49 | - `customers.csv` 50 | 51 | ## Step 6: Access MinIO with Python 52 | 53 | Install dependencies: 54 | 55 | ```bash 56 | pip install pandas s3fs boto3 57 | ``` 58 | 59 | Python example: 60 | 61 | ```python 62 | import pandas as pd 63 | 64 | df = pd.read_csv( 65 | 's3://datalake/sales.csv', 66 | storage_options={ 67 | "key": "minioadmin", 68 | "secret": "minioadmin", 69 | "client_kwargs": { 70 | "endpoint_url": "http://localhost:9000" 71 | } 72 | } 73 | ) 74 | 75 | print(df.head()) 76 | ``` 77 | 78 | ## Optional Tools to Add 79 | 80 | | Tool | Purpose | 81 | |------|---------| 82 | | **Apache Spark** | To query data in MinIO as a Data Lake | 83 | | **Airflow** | To orchestrate ETL jobs | 84 | | **Streamlit** | To create dashboards using cleaned data | 85 | | **Parquet/Feather** | For storing large processed data efficiently | 86 | 87 | ## Mini Data Lake Project Idea (No Docker Needed) 88 | 89 | **Project Name:** "Mini Data Lake for Kenyan Retail Sales" 90 | 91 | **Data Flow:** 92 | 1. **Raw Data** → Upload to MinIO in bucket `datalake` 93 | 2. **ETL** → Use Python to clean and join data 94 | 3. **Save Output** → Store clean output back to MinIO as `clean_data.parquet` 95 | 4. **Analytics** → Use Power BI, Jupyter, or Streamlit 96 | -------------------------------------------------------------------------------- /Day1-WeekOneDayOneGude.md: -------------------------------------------------------------------------------- 1 | ## **Week1, Day 1: Introduction to Data Engineering: Fundamentals, Tools, and Practices.** 2 | 3 | ### Objectives 4 | **By the end of the lesson, students will:** 5 | - Understand what data engineering is and its role in data pipelines. 6 | - Learn about the tools commonly used in data engineering. 7 | - Gain insight into best practices for building and maintaining reliable data systems. 8 | - Be familiar with real-world examples of data engineering applications. 9 | 10 | --- 11 | 12 | ### 1. Introduction 13 | - Overview of the role of data engineers in the data lifecycle. 14 | - Key differences between data engineering and data analytics. 15 | 16 | --- 17 | 18 | ### 2. What Is Data Engineering? 19 | **Definition:** 20 | The discipline of designing and building systems for collecting, storing, and analyzing data at scale. 21 | 22 | #### Core Concepts: 23 | - Data pipelines. 24 | - ETL (Extract, Transform, Load) processes. 25 | - Data warehouses and data lakes. 26 | - Scalability and reliability. 27 | 28 | --- 29 | 30 | ### 3. Importance of Data Engineering 31 | - Enabling analytics and machine learning through robust infrastructure. 32 | - Streamlining access to clean, consistent data. 33 | 34 | #### Real-World Applications: 35 | - Powering recommendation systems. 36 | - Supporting real-time analytics in e-commerce and finance. 37 | 38 | --- 39 | 40 | ### 4. Tools for Data Engineering and Their Functions 41 | 42 | #### Data Storage: 43 | - Relational databases (e.g., PostgreSQL, MySQL). 44 | - Cloud storage solutions (e.g., AWS S3, Google Cloud Storage). 45 | 46 | #### Data Processing: 47 | - Batch processing tools (e.g., Apache Spark, Hadoop). 48 | - Streaming tools (e.g., Kafka, Flink). 49 | 50 | #### Workflow Orchestration: 51 | - Airflow, Prefect. 52 | 53 | #### ETL Tools: 54 | - Informatica, Talend, dbt (data build tool). 55 | 56 | #### Programming Languages: 57 | - Python, Scala, SQL. 58 | 59 | --- 60 | 61 | ### 5. Practical Case Study 62 | ### Design and Build a Simple Data Pipeline: 63 | **Example:** 64 | Ingest data from an API, store it in a database, and transform it for reporting. 65 | 66 | --- 67 | 68 | ### 6. Developing a Data Engineering Mindset 69 | ### Key Practices: 70 | - Prioritizing scalability and maintainability. 71 | - Building with automation in mind. 72 | - Documenting and monitoring systems effectively. 73 | 74 | #### Critical Thinking: 75 | - Anticipating edge cases. 76 | - Debugging complex pipelines. 77 | 78 | --- 79 | 80 | ### 7. Recap and Q&A 81 | - Summary of key takeaways. 82 | - Open discussion to address questions and explore additional use cases. 83 | nes. 84 | - Recap and Q&A 85 | Summary of key takeaways. 86 | Open discussion to address questions and explore additional use cases. 87 | -------------------------------------------------------------------------------- /Apache Airflow Operators Guide.md: -------------------------------------------------------------------------------- 1 | ### **1. Bash & Python Operators** 2 | 3 | - `BashOperator` - Executes a bash command. 4 | Example: 5 | 6 | ```python 7 | source envname/bin/activate 8 | ``` 9 | 10 | - `PythonOperator` - Runs a Python function. 11 | - `BranchPythonOperator` - Executes one of multiple Python functions based on logic. 12 | 13 | ### **2. SQL & Database Operators** 14 | - `MySqlOperator` - Executes SQL queries in MySQL. 15 | - `PostgresOperator` - Executes SQL queries in PostgreSQL. 16 | - `SqliteOperator` - Executes SQL queries in SQLite. 17 | - `MSSqlOperator` - Executes SQL queries in MS SQL Server. 18 | - `OracleOperator` - Executes SQL queries in Oracle. 19 | - `SnowflakeOperator` - Executes SQL queries in Snowflake. 20 | - `BigQueryOperator` - Runs queries in Google BigQuery. 21 | - `RedshiftSQLOperator` - Runs SQL commands in Amazon Redshift. 22 | 23 | ### **3. File Transfer & Storage Operators** 24 | - `S3FileTransformOperator` - Processes files stored in Amazon S3. 25 | - `S3ToRedshiftOperator` - Loads data from S3 to Redshift. 26 | - `GCSToBigQueryOperator` - Loads files from Google Cloud Storage to BigQuery. 27 | - `FTPOperator` - Transfers files via FTP. 28 | - `FileSensor` - Waits for a file to appear in a directory. 29 | 30 | ### **4. Data Processing & ETL Operators** 31 | - `SparkSubmitOperator` - Submits a Spark job. 32 | - `DataProcPySparkOperator` - Runs PySpark jobs on Google Dataproc. 33 | - `HiveOperator` - Runs Hive queries. 34 | - `DruidOperator` - Submits queries to Apache Druid. 35 | - `PrestoOperator` - Runs Presto SQL queries. 36 | 37 | ### **5. AWS Operators** 38 | - `S3ToSnowflakeOperator` - Loads S3 data into Snowflake. 39 | - `DynamoDBToS3Operator` - Copies DynamoDB data to S3. 40 | - `EMRCreateJobFlowOperator` - Starts an EMR cluster. 41 | - `LambdaInvokeFunctionOperator` - Calls an AWS Lambda function. 42 | 43 | ### **6. Google Cloud Operators** 44 | - `BigQueryCheckOperator` - Checks data in BigQuery. 45 | - `DataflowTemplateOperator` - Runs a Google Cloud Dataflow job. 46 | - `GCSCreateBucketOperator` - Creates a Google Cloud Storage bucket. 47 | 48 | ### **7. Kubernetes & Docker Operators** 49 | - `KubernetesPodOperator` - Runs a task inside a Kubernetes pod. 50 | - `DockerOperator` - Runs a Docker container. 51 | 52 | ### **8. Email & Notification Operators** 53 | - `EmailOperator` - Sends an email. 54 | - `SlackAPIPostOperator` - Sends messages to Slack. 55 | 56 | ### **9. Sensors (Wait for Events)** 57 | - `HttpSensor` - Waits for an HTTP endpoint response. 58 | - `S3KeySensor` - Waits for an object to appear in S3. 59 | - `HdfsSensor` - Waits for a file to appear in HDFS. 60 | - `ExternalTaskSensor` - Waits for another DAG task to complete. 61 | 62 | ### **10. Miscellaneous Operators** 63 | - `DummyOperator` - A placeholder for dependencies. 64 | - `HttpOperator` - Calls an HTTP endpoint. 65 | -------------------------------------------------------------------------------- /CH02-2025-DE-Capstone-Project.md: -------------------------------------------------------------------------------- 1 | ### **CH02-2025-Data Engineering Capstone Project: Build a Data Platform for Analyzing Kenya’s Food Prices and Inflation Trends.** 2 | 3 | **Domain**: Public Data || Economics || Agriculture. 4 | 5 | **Data Availability**: Easy – Publicly available from official government sources 6 | 7 | --- 8 | #### **🧩 Project Brief** 9 | 10 | Your team has been contracted by a government think tank to build a **data platform** that tracks food prices across Kenyan counties, detects inflation patterns, and generates insights for consumers, farmers, and policymakers. 11 | 12 | You'll pull **real data** from public sources (see below), clean and model it, and build both **batch and near-real-time** pipelines for analysis and visualization. 13 | 14 | --- 15 | 16 | ### **🗃️ Suggested Data Sources** 17 | - 🇰🇪 [Kenya National Bureau of Statistics (KNBS)](https://www.knbs.or.ke/) 18 | - Monthly food price reports (PDF/Excel) 19 | - CPI & inflation datasets 20 | - [World Bank Open Data – Kenya](https://data.worldbank.org/country/kenya) 21 | - [FAOSTAT – Food & Agriculture Data](https://www.fao.org/faostat/) 22 | - [Kenya Open Data](https://kenya.opendataforafrica.org/) 23 | - County-level data on market prices, commodities, population, etc. 24 | 25 | --- 26 | 27 | ### **🔧 Requirements** 28 | 29 | #### **🏗️ Batch Pipeline** 30 | - Ingest food price data (monthly or weekly) from KNBS or Open Data portal. 31 | - Clean and normalize pricing formats (handle missing values, different currencies/units). 32 | - Use **Airflow** to automate downloads and ETL processes with **PySpark** or **Pandas**. 33 | - Create **fact/dimension tables** with a star schema (e.g., product, county, time). 34 | 35 | #### **🌐 Optional Web Scraping Add-on** 36 | - Scrape data from a public market pricing site or KNBS portal (if available). 37 | - Use **BeautifulSoup** or **Selenium** (optional, only if permitted). 38 | 39 | #### **📡 Streaming Component (Optional but Impressive)** 40 | - Simulate daily market price updates using a **Kafka producer** (e.g., tomatoes in Nairobi). 41 | - Consume and store using **Spark Streaming → Delta Lake/S3/PostgreSQL**. 42 | 43 | #### **📊 Visualization & Dashboarding** 44 | - Build an analytics dashboard with: 45 | - Price changes over time 46 | - Inflation heatmaps by county 47 | - Product comparison across regions 48 | - Tool: **Grafana or Power BI** 49 | 50 | #### **Data Governance** 51 | - Add metadata to tag data sources, update frequency, and validation steps. 52 | - Track data lineage through Airflow logs or a simple metadata table. 53 | 54 | --- 55 | 56 | ### **📁 Deliverables** 57 | - GitHub repo with pipeline code, Airflow DAGs, and documentation 58 | - Final dashboard (hosted or screenshots) 59 | - README with architecture diagram and data model 60 | - Presentation deck with insights and demo 61 | 62 | --- 63 | 64 | ### **🧠 Learning Outcomes** 65 | - Automate real-world data collection and transformation 66 | - Practice ETL, data modeling, and basic analytics 67 | - Work with government and open datasets 68 | - Communicate insights through dashboards and presentations 69 | 70 | -------------------------------------------------------------------------------- /Data Engineer Apache Kafka Producers.md: -------------------------------------------------------------------------------- 1 | ```python 2 | from confluent_kafka import Producer 3 | from faker import Faker 4 | import random 5 | import time 6 | import datetime 7 | import json 8 | 9 | # Kafka Configuration 10 | KAFKA_BROKER = "localhost:9092" # Change to match your Kafka setup 11 | TOPIC = "kenyan_users" 12 | 13 | # Create a Kafka Producer instance 14 | producer = Producer({'bootstrap.servers': KAFKA_BROKER}) 15 | 16 | # Create a Faker instance with British English locale 17 | fake = Faker('en_GB') 18 | 19 | def generate_kenyan_phone(): 20 | prefixes = ['0700', '0701', '0702', '0703', '0704', '0705', '0706', '0707', '0708', '0709', 21 | '0710', '0711', '0712', '0713', '0714', '0715', '0716', '0717', '0718', '0719', 22 | '0720', '0721', '0722', '0723', '0724', '0725', '0726', '0727', '0728', '0729', 23 | '0730', '0731', '0732', '0733', '0734', '0735', '0736', '0737', '0738', '0739', 24 | '0740', '0741', '0742', '0743', '0744', '0745', '0746', '0747', '0748', '0749', 25 | '0750', '0751', '0752', '0753', '0754', '0755', '0756', '0757', '0758', '0759', 26 | '0760', '0761', '0762', '0763', '0764', '0765', '0766', '0767', '0768', '0769', 27 | '0770', '0771', '0772', '0773', '0774', '0775', '0776', '0777', '0778', '0779', 28 | '0790', '0791', '0792', '0793', '0794', '0795', '0796', '0797', '0798', '0799'] 29 | 30 | prefix = random.choice(prefixes) 31 | suffix = ''.join(random.choices('0123456789', k=6)) 32 | return f"{prefix} {suffix}" 33 | 34 | def generate_kenyan_amount(): 35 | amount = random.randint(100, 100000) 36 | return f"KES {amount:,}" 37 | 38 | def generate_user(): 39 | name = fake.name() 40 | domain = random.choice(['gmail.com', 'yahoo.com', 'hotmail.com', 'outlook.com', 'ke-mail.com']) 41 | email = f"{name.lower().replace(' ', '.').replace('.', random.choice(['.','-','_']))}{random.randint(1, 99)}@{domain}" 42 | phone = generate_kenyan_phone() 43 | amount = generate_kenyan_amount() 44 | timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") 45 | 46 | return { 47 | "timestamp": timestamp, 48 | "name": name, 49 | "email": email, 50 | "phone": phone, 51 | "amount": amount 52 | } 53 | 54 | # Callback function to confirm message delivery 55 | def delivery_report(err, msg): 56 | if err is not None: 57 | print(f"❌ Message delivery failed: {err}") 58 | else: 59 | print(f"✅ Message delivered to {msg.topic()} [{msg.partition()}]") 60 | 61 | def main(): 62 | print("Kafka Producer: Sending Kenyan user data to topic 'kenyan_users' every 5 seconds...") 63 | 64 | try: 65 | while True: 66 | user = generate_user() 67 | user_data = json.dumps(user) # Convert to JSON format 68 | 69 | # Send data to Kafka 70 | producer.produce(TOPIC, value=user_data, callback=delivery_report) 71 | producer.flush() # Ensure message is sent 72 | 73 | print(f"📤 Sent: {user_data}") 74 | 75 | time.sleep(5) 76 | 77 | except KeyboardInterrupt: 78 | print("\n🚪 Stopping Kafka producer.") 79 | 80 | finally: 81 | producer.flush() 82 | 83 | if __name__ == "__main__": 84 | main() 85 | ``` 86 | -------------------------------------------------------------------------------- /DATA-PROCESSING.md: -------------------------------------------------------------------------------- 1 | # DATA PROCESSING 2 | 3 | Data processing refers to the collection, transformation, and organization of raw data into a meaningful format for analysis and decision-making. It is a core responsibility in data engineering workflows. 4 | 5 | ## TYPES OF DATA PROCESSING 6 | 7 | Data processing can be broadly categorized into two main types: 8 | 9 | 1. **Batch Data Processing** 10 | 2. **Streaming Data Processing (Real-time Processing)** 11 | 12 | --- 13 | 14 | ## Batch Data Processing 15 | 16 | Batch data processing is defined as processing large volumes of data at scheduled intervals (e.g., hourly, daily, weekly). 17 | 18 | For example, sales figures typically undergo batch processing, allowing businesses to use data visualization features like charts, graphs, and reports to derive value from data. Since a large volume of data is involved, the system will take time to process it. Processing the data in batches saves on computational resources. 19 | 20 | ### Characteristics 21 | 22 | - Data is collected over time and processed in fixed-size chunks. 23 | - Jobs run at predefined times (e.g., end-of-day reports). 24 | - High latency (delay between data collection and processing). 25 | - Efficient for large-scale, non-time-sensitive computations. 26 | 27 | ### When to Use Batch Processing? 28 | 29 | - When doing historical analysis (e.g., monthly sales reports). 30 | - During large-scale ETL (Extract, Transform, Load) jobs. 31 | - Cost-effective for processing massive datasets. 32 | - Perfect for non-real-time applications (e.g., billing systems, payroll processing). 33 | 34 | ### Examples 35 | 36 | - Generating monthly sales reports. 37 | - Data warehouse ETL jobs. 38 | 39 | ### Advantages 40 | 41 | - Efficient for large datasets. 42 | - Lower infrastructure cost (scheduled, not real-time). 43 | - Simpler to debug and manage. 44 | 45 | ### Disadvantages 46 | 47 | - Not real-time (latency). 48 | - Not suitable for time-sensitive data. 49 | 50 | ### Tools & Technologies 51 | 52 | - Apache Hadoop (MapReduce, HDFS). 53 | - Apache Spark (batch mode). 54 | - ETL tools: Apache Airflow. 55 | 56 | > You might prefer batch processing over real-time processing when accuracy is more important than speed. 57 | 58 | --- 59 | 60 | ## Streaming Data Processing (Real-time Processing) 61 | 62 | Streaming (real-time) processing means processing data as it is generated or received. 63 | 64 | ### How it Works 65 | 66 | Data is ingested continuously and processed immediately (or within milliseconds to seconds). This immediate processing requires low latency and quick response times, making it suitable for applications like monitoring systems and financial trading. 67 | 68 | ### Examples 69 | 70 | - Monitoring stock prices. 71 | - Real-time fraud detection. 72 | - IoT sensor data processing. 73 | 74 | ### Advantages 75 | 76 | - Real-time insights. 77 | - Better for event-driven use cases. 78 | - Immediate action or alert possible. 79 | 80 | ### Disadvantages 81 | 82 | - More complex to implement and maintain. 83 | - Higher cost due to always-on infrastructure. 84 | - Debugging is more difficult. 85 | 86 | ### Tools & Technologies 87 | 88 | - Apache Kafka (event streaming platform). 89 | - Apache Spark Streaming (micro-batch processing). 90 | 91 | --- 92 | 93 | ## Conclusion 94 | 95 | - **Use Batch Processing** for large-scale, historical data with no urgency. 96 | - **Use Streaming Processing** for real-time insights and event-driven applications. 97 | -------------------------------------------------------------------------------- /CoreConceptsDataModeling.md: -------------------------------------------------------------------------------- 1 | # Data Modeling 2 | 3 | ## Objectives 4 | By the end of this section, students will: 5 | - Understand the fundamentals of data modeling. 6 | - Learn the different types of data models and their uses. 7 | - Gain insight into designing effective data models for real-world applications. 8 | - Explore best practices for data modeling in various systems. 9 | 10 | --- 11 | 12 | ## 1. Introduction to Data Modeling 13 | - **Definition:** 14 | Data modeling is the process of designing and creating a visual representation (model) of a system’s data structures, which can be used for storing, organizing, and manipulating data in databases. 15 | 16 | - **Purpose:** 17 | To structure data in a way that supports the business processes, enhances data quality, and ensures scalability and performance. 18 | 19 | --- 20 | 21 | ## 2. Types of Data Models 22 | ### 1. **Conceptual Data Model:** 23 | - Focuses on high-level business requirements. 24 | - Describes entities, relationships, and their attributes. 25 | - Typically used for communicating with non-technical stakeholders. 26 | 27 | ### 2. **Logical Data Model:** 28 | - Provides a more detailed representation of data entities and their relationships. 29 | - Does not include physical details like indexes or storage locations. 30 | - Serves as a blueprint for physical data models. 31 | 32 | ### 3. **Physical Data Model:** 33 | - Describes how data is physically stored in a database. 34 | - Includes specific details like table structures, indexes, and storage paths. 35 | - Optimized for performance and efficient data retrieval. 36 | 37 | --- 38 | 39 | ## 3. Key Concepts in Data Modeling 40 | 41 | ### 1. **Entities and Attributes:** 42 | - **Entity:** An object or concept about which data is stored (e.g., Customer, Order). 43 | - **Attribute:** Characteristics or properties of an entity (e.g., Customer Name, Order Date). 44 | 45 | ### 2. **Relationships:** 46 | - Describes how entities are related to each other (e.g., One-to-many, Many-to-many). 47 | - Defined using primary keys and foreign keys. 48 | 49 | ### 3. **Normalization:** 50 | - The process of organizing data to minimize redundancy and dependency. 51 | - Involves breaking down large tables into smaller, more manageable ones. 52 | 53 | ### 4. **Denormalization:** 54 | - The process of combining tables to improve query performance. 55 | - Useful for read-heavy operations, but can lead to redundancy. 56 | 57 | --- 58 | 59 | ## 4. Best Practices in Data Modeling 60 | - **Consistency:** Ensure consistent naming conventions, data types, and attributes. 61 | - **Scalability:** Design models that can handle growing data volumes. 62 | - **Maintainability:** Keep models simple and easy to update. 63 | - **Performance:** Optimize models to balance speed and efficiency. 64 | - **Documentation:** Document your data models for future use and clarity. 65 | 66 | --- 67 | 68 | ## 5. Tools for Data Modeling 69 | - **ERD Tools (Entity-Relationship Diagrams):** 70 | - Microsoft Visio, Lucidchart, Draw.io. 71 | - **Database Design Tools:** 72 | - MySQL Workbench, Oracle SQL Developer, dbForge Studio. 73 | - **Data Modeling Tools:** 74 | - Erwin Data Modeler, IBM InfoSphere Data Architect, PowerDesigner. 75 | 76 | --- 77 | 78 | ## 6. Practical Example 79 | - **Scenario:** 80 | Designing a data model for an e-commerce platform to manage products, customers, and orders. 81 | 82 | ### Steps: 83 | 1. Identify key entities (e.g., Customer, Order, Product). 84 | 2. Define relationships (e.g., one customer can place many orders). 85 | 3. Normalize the model to eliminate data redundancy. 86 | 4. Create an ERD diagram to visualize the model. 87 | 5. Implement the model in a database. 88 | 89 | --- 90 | 91 | ## 7. Recap and Q&A 92 | - Review the types of data models and their purposes. 93 | - Open discussion for questions and exploration of real-world data modeling challenges. 94 | -------------------------------------------------------------------------------- /Tuesday-Kafka-Lab.md: -------------------------------------------------------------------------------- 1 | # Tuesday: Kafka Producer and Consumer (Lab) 2 | 3 | ## Objectives 4 | 5 | Set up a basic Kafka environment with: 6 | 7 | * A **Producer** that sends messages to a topic 8 | * A **Topic** to hold the messages 9 | * A **Consumer** that reads messages from the topic 10 | 11 | ## Step-by-Step Setup 12 | 13 | ### 1. Ensure Java is Installed 14 | 15 | Kafka requires Java. 16 | 17 | ```bash 18 | java -version 19 | ``` 20 | 21 | If not installed: 22 | 23 | ```bash 24 | sudo apt install openjdk-11-jdk -y 25 | ``` 26 | 27 | ### 2. Download and Extract Kafka 28 | 29 | Kafka is distributed as a compressed archive (`.tgz` file). Here’s how to download and unpack it: 30 | 31 | ```bash 32 | wget https://downloads.apache.org/kafka/3.6.1/kafka_2.13-3.6.1.tgz 33 | ``` 34 | 35 | `wget` is a command-line tool used to download files from the web. 36 | 37 | This command fetches Kafka version **3.6.1** built for **Scala 2.13** (Kafka is written in Scala and Java). 38 | 39 | You can always check for the latest version at [Kafka downloads](https://kafka.apache.org/downloads). 40 | 41 | Next, extract the downloaded archive: 42 | 43 | ```bash 44 | tar -xzf kafka_2.13-3.6.1.tgz 45 | ``` 46 | 47 | * `tar` is used to extract files. 48 | * `-xzf` means: 49 | 50 | * `x`: extract 51 | * `z`: decompress gzip 52 | * `f`: specify the file name 53 | 54 | Then, delete the downloaded archive to free up space: 55 | 56 | ```bash 57 | rm kafka_2.13-3.6.1.tgz 58 | ``` 59 | 60 | Rename the extracted folder to something simpler: 61 | 62 | ```bash 63 | mv kafka_2.13-3.6.1 kafka 64 | ``` 65 | 66 | Change into the Kafka directory: 67 | 68 | ```bash 69 | cd kafka 70 | ``` 71 | 72 | This is Kafka’s **home directory** where all config files, scripts, and binaries live. 73 | 74 | You’ll run all Kafka-related commands from here. 75 | 76 | #### Kafka Folder Structure Overview 77 | 78 | | Folder / File | Purpose | 79 | | ------------------- | ------------------------------------------------ | 80 | | `bin/` | Scripts to start Kafka, ZooKeeper, and CLI tools | 81 | | `config/` | Configuration files for Kafka and ZooKeeper | 82 | | `libs/` | Kafka and ZooKeeper libraries | 83 | | `logs/` | Kafka server logs during runtime | 84 | | `LICENSE`, `NOTICE` | Legal/license information | 85 | 86 | Kafka is now ready to be started. 87 | 88 | ### 3. Start ZooKeeper and Kafka 89 | 90 | Start ZooKeeper (in one terminal): 91 | 92 | ```bash 93 | bin/zookeeper-server-start.sh config/zookeeper.properties 94 | ``` 95 | 96 | Start Kafka (in another terminal): 97 | 98 | ```bash 99 | bin/kafka-server-start.sh config/server.properties 100 | ``` 101 | ### 4. Create a Kafka Topic 102 | 103 | ```bash 104 | bin/kafka-topics.sh --create \ 105 | --topic test-topic \ 106 | --bootstrap-server localhost:9092 \ 107 | --partitions 1 \ 108 | --replication-factor 1 109 | ``` 110 | 111 | - **Partitions**: Allow parallelism and scalability. 112 | - **Replication factor**: Number of copies for fault tolerance. 113 | 114 | ### 5. Start a Producer 115 | 116 | ```bash 117 | bin/kafka-console-producer.sh --topic test-topic --bootstrap-server localhost:9092 118 | ``` 119 | 120 | Type messages and press Enter to send them to the topic. 121 | 122 | ### 6. Start a Consumer 123 | 124 | ```bash 125 | bin/kafka-console-consumer.sh --topic test-topic --from-beginning --bootstrap-server localhost:9092 126 | ``` 127 | 128 | Messages sent by the producer will appear here. 129 | 130 | ## Topic Management Commands 131 | 132 | **List topics** 133 | 134 | ```bash 135 | bin/kafka-topics.sh --list --bootstrap-server localhost:9092 136 | ``` 137 | 138 | **Describe a topic** 139 | 140 | ```bash 141 | bin/kafka-topics.sh --describe --topic test-topic --bootstrap-server localhost:9092 142 | ``` 143 | 144 | **Delete a topic** 145 | 146 | ```bash 147 | bin/kafka-topics.sh --delete --topic test-topic --bootstrap-server localhost:9092 148 | ``` 149 | 150 | **Add more partitions** 151 | 152 | ```bash 153 | bin/kafka-topics.sh --alter --topic test-topic --partitions 3 --bootstrap-server localhost:9092 154 | ``` 155 | 156 | > Note: You can **only increase** partitions, not reduce them. 157 | 158 | ## Summary 159 | 160 | * Kafka and ZooKeeper are up and running 161 | * Topic created 162 | * Messages produced and consumed 163 | * Topics managed using CLI tools 164 | 165 | 166 | -------------------------------------------------------------------------------- /scrapping.md: -------------------------------------------------------------------------------- 1 | ### **1. What is Beautiful Soup?** 2 | 3 | **Definition:** 4 | 5 | Beautiful Soup is a Python library that turns messy HTML into a structured object, so you can easily search, navigate, and extract data from web pages. When you download a web page (as text), it usually looks like a big, ugly string with lots of tags. 6 | 7 | Beautiful Soup: 8 | 9 | - Parses that HTML (understands the structure) 10 | - Builds a tree of tags (like a family tree of elements: ` → →

` etc.) 11 | - Gives you friendly tools to: 12 | - Find tags (e.g. ``, `

`, `

`…) 13 | - Get their text 14 | - Read their attributes (`href`, `class`, `id`, etc.) 15 | 16 | --- 17 | 18 | ### **2. Why do we use Beautiful Soup?** 19 | 20 | You use Beautiful Soup when you want to: 21 | 22 | - **Scrape data from websites** 23 | - *Example:* Get all product names and prices from an online store page. 24 | - **Clean and analyze HTML** 25 | - *Example:* Extract only the article text from a news page. 26 | - **Automate manual tasks** 27 | - *Example:* Collect all links from a set of pages instead of copying them by hand. 28 | 29 | Without Beautiful Soup, you would have to: 30 | 31 | - Manually search through raw HTML strings 32 | - Write a lot of complicated regular expressions 33 | → This is hard and very error-prone. 34 | 35 | With Beautiful Soup, you write code like: 36 | 37 | - “Find all `` tags” 38 | - “Get the text inside each `

`” 39 | 40 | Much cleaner and easier to understand. 41 | 42 | --- 43 | 44 | ### **3. How Beautiful Soup fits in a scraping workflow** 45 | 46 | A typical web scraping workflow looks like this: 47 | 48 | 1. **Use `requests` (or another HTTP library) to download the web page:** 49 | 50 | ```python 51 | import requests 52 | 53 | response = requests.get("https://example.com") 54 | html = response.text 55 | ``` 56 | 57 | 2. **Use Beautiful Soup to parse the HTML:** 58 | 59 | ```python 60 | from bs4 import BeautifulSoup 61 | 62 | soup = BeautifulSoup(html, "html.parser") 63 | ``` 64 | 65 | 3. **Use Beautiful Soup to extract what you need:** 66 | 67 | ```python 68 | title = soup.find("h1").get_text(strip=True) 69 | links = [a.get("href") for a in soup.find_all("a")] 70 | ``` 71 | 72 | So: 73 | 74 | - `requests` → gets the HTML 75 | - `BeautifulSoup` → understands and extracts from the HTML 76 | 77 | --- 78 | 79 | ### **4. Key concepts in Beautiful Soup** 80 | 81 | When teaching, focus on these core ideas: 82 | 83 | #### The `soup` object 84 | 85 | - Created with `BeautifulSoup(html, "html.parser")` 86 | - Represents the entire HTML document 87 | 88 | #### Tags 89 | 90 | - Elements like `

`, ``, `

` are called *tags* 91 | - You can access them like `soup.p`, `soup.find("a")`, etc. 92 | 93 | #### Text 94 | 95 | - Use `.get_text()` or `.text` to get the text inside a tag 96 | 97 | #### Attributes 98 | 99 | - Things like `href`, `class`, `id` are attributes of a tag 100 | - Access them with `tag['href']` or `tag.get('href')` 101 | 102 | #### Search methods 103 | 104 | - `.find()` → first matching element 105 | - `.find_all()` → all matching elements 106 | - `.select()` / `.select_one()` → CSS selector style 107 | 108 | --- 109 | 110 | ### **5. Tiny “hello world” example** 111 | 112 | You can show this as the first demo: 113 | 114 | ```python 115 | from bs4 import BeautifulSoup 116 | 117 | html = """ 118 | 119 | 120 |

My Website

121 |

Welcome to my site.

122 |
Click here 123 | 124 | 125 | """ 126 | 127 | # Create the soup object 128 | soup = BeautifulSoup(html, "html.parser") 129 | 130 | # Get the title (h1 text) 131 | title = soup.find("h1").get_text(strip=True) 132 | 133 | # Get the paragraph text 134 | intro = soup.find("p", class_="intro").get_text(strip=True) 135 | 136 | # Get the link and its URL 137 | link_tag = soup.find("a") 138 | link_text = link_tag.get_text(strip=True) 139 | link_url = link_tag.get("href") 140 | 141 | print("Title:", title) 142 | print("Intro:", intro) 143 | print("Link text:", link_text) 144 | print("Link URL:", link_url) 145 | ``` 146 | 147 | **What this demo shows:** 148 | 149 | - How to create a `soup` object 150 | - How to find tags 151 | - How to get text and attributes 152 | 153 | --- 154 | 155 | ### **6. One-sentence summary for students** 156 | 157 | Beautiful Soup is a Python tool that makes it easy to read HTML and pull out just the data you need from web pages. 158 | 159 | Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#api-documentation 160 | -------------------------------------------------------------------------------- /DATA-MODELING.md: -------------------------------------------------------------------------------- 1 | # **Introduction to Data Modeling** 2 | 3 | ### **What is a Model?** 4 | 5 | A **model** is a structure or dimension representing data. 6 | 7 | ### **What is Data Modeling?** 8 | 9 | The process of designing how data will be **organized**, **stored**, and **accessed**. 10 | 11 | * It provides a **visual representation** (tables, rows, columns). 12 | * It acts as a **blueprint** for database design. 13 | 14 | ### **Purpose of Data Modeling** 15 | 16 | * It ensures **accuracy**, **consistency**, and **integrity** of data. 17 | * It optimizes **database performance**. 18 | * It facilitates **communication** between technical and business stakeholders. 19 | 20 | --- 21 | 22 | # **Types of Data Modeling** 23 | 24 | ## 1. Conceptual Data Modeling 25 | 26 | * This represents a high-level overview of business/domain data without going into details. 27 | * No technical details, no attributes or data types. 28 | * It focuses on **entities** and their relationships. 29 | * Example (Hospital Domain): 30 | 31 | * Entities: Patient, Doctor, Appointment. 32 | 33 | --- 34 | 35 | ## 2. Logical Data Modeling 36 | 37 | * This model describes **data elements and relationships** in detail (without considering physical storage). 38 | * It defines **attributes** for each entity. 39 | * Example: 40 | 41 | * Doctor: doctor\_id, name, specialization. 42 | * Appointment: start\_time, end\_time, doctor\_id (FK). 43 | 44 | --- 45 | 46 | ## 3. Physical Data Modeling 47 | 48 | * This model defines **how data is physically stored** in a database. 49 | * It includes: 50 | 51 | * **Data types** (e.g., INT, VARCHAR). 52 | * **Constraints** (e.g., PRIMARY KEY, FOREIGN KEY). 53 | * **Indexes** for performance. 54 | 55 | Example Schema: 56 | 57 | ### Doctor Table 58 | 59 | * doctor\_id (INT) – Primary Key 60 | * doctor\_name (VARCHAR) 61 | 62 | ### Customers Table 63 | 64 | * customer\_id (INT) – Primary Key 65 | * name (VARCHAR) 66 | * age (INT) 67 | * email (VARCHAR) 68 | * DOB (DATE) 69 | * phone (VARCHAR) 70 | 71 | ### Account Table 72 | 73 | * account\_id (INT) – Primary Key 74 | * balance (INT) 75 | * dr (INT) (Debit) 76 | * cr (INT) (Credit) 77 | * acc\_type (VARCHAR) 78 | * customer\_id (INT) – Foreign Key 79 | 80 | ### Branch Table 81 | 82 | * branch\_id (INT) – Primary Key 83 | * location (VARCHAR) 84 | 85 | --- 86 | 87 | ## 4. Entity-Relationship Data Modeling 88 | 89 | * **Entity**: An object or concept representing data (e.g., Patient, Doctor, Appointment). 90 | * **Attributes**: Properties of an entity (e.g., patient\_id, doctor\_name). 91 | * **ERD (Entity-Relationship Diagram)**: Visual diagram showing entities and relationships. 92 | 93 | ### Relationship Types 94 | 95 | | Type | Example | 96 | | ------------ | ------------------------------------------------------------------------------ | 97 | | One-to-One | A Doctor has one doctor\_id, and each doctor\_id belongs to one Doctor. | 98 | | One-to-Many | A Doctor has many Patients, but a Patient has only one primary Doctor. | 99 | | Many-to-Many | A Patient can have many Appointments, and a Doctor can have many Appointments. | 100 | 101 | Detailed Examples: 102 | 103 | * **One-to-Many (1\:M):** 104 | 105 | * "A Doctor has many Patients, but a Patient has only one primary Doctor." 106 | * Implementation: Patients table has a foreign key (doctor\_id). 107 | 108 | * **Many-to-Many (M\:N):** 109 | 110 | * "A Patient can book many Appointments, and a Doctor can handle many Appointments." 111 | * Implementation: A junction table (Appointment) with patient\_id and doctor\_id as foreign keys. 112 | 113 | * **One-to-One (1:1):** 114 | 115 | * "A Doctor has exactly one unique doctor\_id." 116 | * Implementation: doctor\_id is both a primary key and unique. 117 | 118 | --- 119 | 120 | ## 5. Dimensional Data Modeling 121 | 122 | * Dimensional Data Modeling is primarily used in **data warehouses** for analytical purposes. 123 | * it organizes data into **fact tables** and **dimension tables**. 124 | 125 | ### Key Components 126 | 127 | | Component | Description | Example | 128 | | --------------- | ----------------------------------------- | --------------------------------- | 129 | | Fact Table | Numerical/measurable data (metrics/KPIs). | sales\_amount, quantity\_sold | 130 | | Dimension Table | Descriptive context for facts. | customer\_name, product\_category | 131 | 132 | Definitions: 133 | 134 | * **Dimensions**: Describe business entities (name, age, product category). 135 | * **Measures**: Quantitative facts (e.g., number of products sold). 136 | 137 | --- 138 | 139 | # **Summary** 140 | 141 | * **Conceptual**: High-level overview (what data is important?). 142 | * **Logical**: Detailed attributes and relationships (how is data related?). 143 | * **Physical**: Technical implementation (how is data stored?). 144 | * **ER Modeling**: Graphical view of entities and relationships. 145 | * **Dimensional**: Optimized for analytics (facts and dimensions). 146 | 147 | -------------------------------------------------------------------------------- /introduction-to-Kafka.md: -------------------------------------------------------------------------------- 1 | # Introduction to Data Streaming and Apache Kafka 2 | 3 | ## **What is Streaming Data?** 4 | **Streaming data** (also called **event stream processing**) is the **continuous flow of data** generated by various sources, processed in **real-time** to extract insights and trigger actions. 5 | 6 | ## **Characteristics of Real-Time Data Processing** 7 | 1. **Continuous flow** – Data is constantly generated with no "end." 8 | 2. **Real-time processing** – Instant analysis for timely insights (no batch delays). 9 | 3. **Event-driven architecture** – Systems react dynamically to individual events. 10 | 4. **Scalability & fault tolerance** – Handles high traffic and recovers from failures. 11 | 5. **Varied Data Sources** – Streaming data originates from sensors, logs, APIs, applications, mobile devices, and more. 12 | 13 | --- 14 | 15 | ## **Key Benefits of Real-Time Data Processing** 16 | - **Immediate Insights**: Analyze data as it’s generated. 17 | - **Instant Decision-Making**: Respond to events in real-time (e.g., fraud detection). 18 | - **Operational Efficiency**: Optimize workflows and reduce downtime. 19 | - **Enhanced User Experience**: Personalize experiences using live data. 20 | 21 | ### **Critical Use Cases** 22 | - **Fraud Detection**: Block suspicious transactions instantly. 23 | - **IoT Monitoring**: Track device health in real-time. 24 | - **Live Analytics**: Power dashboards with up-to-the-second data. 25 | 26 | ### **Discussion Questions** 27 | 1. Can you think of a real-time use case near you (e.g., mobile money, delivery apps)? 28 | 2. What happens when systems can’t process data in real-time? 29 | 30 | --- 31 | 32 | # **Introduction to Apache Kafka** 33 | 34 | ## **Key Concepts** 35 | - **Publish/Subscribe Model**: 36 | - Producers send/write (**publish**) messages. 37 | - Consumers receive/read (**subscribe**) messages. 38 | > Asynchronous means not happening or done at the same time or speed — producer and consumer don’t need to wait on each other. 39 | 40 | - **Common Use Cases**: 41 | - **Real-time analytics** Analyze data as it's generated, instead of waiting for batch jobs or reports. 42 | - **Log collection** Gather logs from multiple systems/services into a centralized location for monitoring, debugging, or auditing. 43 | - **Event sourcing** Store state-changing events (like deposit $20, withdraw $30) rather than only storing the final state (balance = $50). 44 | 45 | --- 46 | 47 | ## **Kafka Architecture** 48 | | Component | Role | 49 | |--------------|----------------------------------------------------------------------| 50 | | **Producer** | Sends messages/events to a Kafka topic. | 51 | | **Consumer** | Reads messages/events from a topic. | 52 | | **Broker** | Kafka server storing/serving messages (clusters = multiple brokers). | 53 | | **ZooKeeper**/**KRaft** | Manages cluster state/metadata (KRaft replaces ZooKeeper in Kafka 4.0+). | 54 | 55 | --- 56 | ## **Event**: 57 | - An event records the fact that "something happened" in the real world or in your system. 58 | - It’s the fundamental unit of data in Kafka and may also be referred to as a record or message. 59 | - Events are immutable — once written, they are not updated 60 | - When you produce (write) or consume (read) data in Kafka, you're interacting with events. 61 | 62 | #### Structure of an Event 63 | An event in Kafka typically contains the following components: 64 | - Key: Identifies the event (e.g., the user, transaction ID, or source). Used for partitioning logic. 65 | - Value: The actual data or payload (e.g., what happened). 66 | - Timestamp: Time when the event occurred or was written. 67 | - Headers (optional): Metadata about the event, such as content type or correlation ID. 68 | 69 | #### Example Event 70 | ```plaintext 71 | Event key: "Alice" 72 | Event value: "Made a payment of $200" 73 | Timestamp: 2025-06-27T08:45:30Z 74 | Headers: { "source": "mobile-app", "transaction-id": "TXN-4490" } 75 | ``` 76 | This event could be sent to a topic like `payments` and later consumed by analytics or fraud detection services. 77 | 78 | ## **Topic**: 79 | - A topic is a category/feed name to which messages/events are published to.(similar to a database table). 80 | - Topics are split into **partitions** for scalability/parallel processing. 81 | 82 | #### **Examples** 83 | - `orders` – E-commerce purchases. 84 | - `user-logins` – Authentication events. 85 | - `click-events` – Website/app interactions. 86 | 87 | ## **Partition**: 88 | - A topic can be split into multiple partitions, which enables scalability (more messages) and parallelism (faster processing). 89 | - Messages within a partition are strictly ordered. 90 | 91 | ## **Offset**: 92 | - A unique identifier (number) that Kafka assigns to each message within a partition. 93 | - It allows consumers to track where they left off in reading the stream of messages. (e.g., "read up to offset #5"). 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | -------------------------------------------------------------------------------- /MySQLQueryExecutionPlans.md: -------------------------------------------------------------------------------- 1 | # MySQL Query Execution Plans: A Complete Guide 2 | 3 | ## 1. Understanding Query Execution Plans 4 | 5 | A **Query Execution Plan** shows you **how** MySQL decides to retrieve data — which indexes it will use, how many rows it will check, and the join strategy. You can get it using: 6 | 7 | ```sql 8 | EXPLAIN SELECT ...; 9 | ``` 10 | 11 | For deeper insight with actual runtime statistics: 12 | 13 | ```sql 14 | EXPLAIN ANALYZE SELECT ...; 15 | ``` 16 | 17 | ## 2. Example 1 – Basic Table Scan 18 | 19 | Let's query all customers. 20 | 21 | ```sql 22 | EXPLAIN SELECT * FROM customer_info; 23 | ``` 24 | 25 | **Possible output:** 26 | 27 | | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | 28 | |----|-------------|-------|------|---------------|-----|---------|-----|------|-------| 29 | | 1 | SIMPLE | customer_info | ALL | NULL | NULL | NULL | NULL | 1000 | Using where | 30 | 31 | **Interpretation:** 32 | - **type = ALL** → full table scan (slow for large datasets) 33 | - No indexes used because there's no filtering 34 | - **Optimization:** Avoid `SELECT *` unless necessary. Use `WHERE` + indexed columns 35 | 36 | ## 3. Example 2 – Using an Index 37 | 38 | Suppose we query customers by `customer_id` (indexed as PRIMARY KEY). 39 | 40 | ```sql 41 | EXPLAIN SELECT full_name FROM customer_info WHERE customer_id = 10; 42 | ``` 43 | 44 | **Possible output:** 45 | 46 | | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | 47 | |----|-------------|-------|------|---------------|-----|---------|-----|------|-------| 48 | | 1 | SIMPLE | customer_info | const | PRIMARY | PRIMARY | 4 | const | 1 | NULL | 49 | 50 | **Interpretation:** 51 | - **type = const** → MySQL knows it will return at most one row (fast) 52 | - Using `PRIMARY` key (O(1) lookup) 53 | - **Optimization:** Always filter with indexed columns when possible 54 | 55 | ## 4. Example 3 – Index Usage in Products Table 56 | 57 | Query products by `customer_id` (foreign key). 58 | 59 | ```sql 60 | EXPLAIN SELECT product_name FROM products WHERE customer_id = 3; 61 | ``` 62 | 63 | If `customer_id` is not indexed, **type** will be `ALL` (slow). **Solution:** Add index: 64 | 65 | ```sql 66 | CREATE INDEX idx_customer_id ON products(customer_id); 67 | ``` 68 | 69 | After indexing, the execution plan might show: 70 | 71 | | type | possible_keys | key | rows | Extra | 72 | |------|---------------|-----|------|-------| 73 | | ref | idx_customer_id | idx_customer_id | 5 | Using where | 74 | 75 | ## 5. Example 4 – Join with Index 76 | 77 | Query sales with customer names. 78 | 79 | ```sql 80 | EXPLAIN 81 | SELECT s.sales_id, s.total_sales, c.full_name 82 | FROM sales s 83 | JOIN customer_info c ON s.customer_id = c.customer_id; 84 | ``` 85 | 86 | If both `sales.customer_id` and `customer_info.customer_id` are indexed: 87 | 88 | | id | select_type | table | type | possible_keys | key | ref | rows | Extra | 89 | |----|-------------|-------|------|---------------|-----|-----|------|-------| 90 | | 1 | SIMPLE | c | ALL | PRIMARY | NULL | NULL | 1000 | Using where | 91 | | 1 | SIMPLE | s | ref | idx_customer_id | idx_customer_id | c.customer_id | 10 | NULL | 92 | 93 | **Optimization tips:** 94 | - Always index **join columns** 95 | - Make sure both sides of the join use the same data type 96 | 97 | ## 6. Example 5 – Filtering and Joining 98 | 99 | Retrieve all sales above 500 made by customers in "Nairobi". 100 | 101 | ```sql 102 | EXPLAIN 103 | SELECT s.sales_id, s.total_sales, c.full_name 104 | FROM sales s 105 | JOIN customer_info c ON s.customer_id = c.customer_id 106 | WHERE c.location = 'Nairobi' AND s.total_sales > 500; 107 | ``` 108 | 109 | Possible bottlenecks: 110 | - If `location` isn't indexed → table scan on `customer_info` 111 | - **Solution:** Create index: 112 | 113 | ```sql 114 | CREATE INDEX idx_location ON customer_info(location); 115 | ``` 116 | 117 | Execution plan should now show **ref** instead of **ALL** for `customer_info`. 118 | 119 | ## 7. Example 6 – Multi-table Join with Products 120 | 121 | ```sql 122 | EXPLAIN 123 | SELECT c.full_name, p.product_name, s.total_sales 124 | FROM customer_info c 125 | JOIN products p ON c.customer_id = p.customer_id 126 | JOIN sales s ON p.product_id = s.product_id 127 | WHERE s.total_sales > 1000; 128 | ``` 129 | 130 | Optimization tips: 131 | - Index `products.customer_id` 132 | - Index `sales.product_id` 133 | - Filter early (`WHERE s.total_sales > 1000`) so MySQL processes fewer rows 134 | 135 | ## 8. Example 7 – Using `EXPLAIN ANALYZE` 136 | 137 | ```sql 138 | EXPLAIN ANALYZE 139 | SELECT c.full_name, p.product_name 140 | FROM customer_info c 141 | JOIN products p ON c.customer_id = p.customer_id 142 | WHERE p.price > 500; 143 | ``` 144 | 145 | **Benefit:** 146 | - Shows **actual execution time** for each step 147 | - If you see that a join step takes much longer than expected → check indexes 148 | 149 | ## 9. Best Practices Recap 150 | 151 | ✅ Index frequently queried columns (especially in `WHERE`, `JOIN`, `ORDER BY`) 152 | ✅ Avoid `SELECT *` for performance 153 | ✅ Use `EXPLAIN` before and after schema changes 154 | ✅ Filter data as early as possible in your query 155 | ✅ Keep an eye on `rows` in EXPLAIN — smaller is better 156 | -------------------------------------------------------------------------------- /AivenProjectVersionWeekOneProject.md: -------------------------------------------------------------------------------- 1 | # Project: Setting Up Aiven Cloud Storage and Connecting a PostgreSQL Database Using DBeaver 2 | 3 | ### **Objective**: 4 | The goal of this project is to set up a **managed PostgreSQL database** on **Aiven**, use their storage options, and connect to it using **DBeaver** to manage and query the data. 5 | 6 | --- 7 | 8 | ### **Steps to Complete the Project**: 9 | 10 | #### **1. Set Up Aiven Account and Create PostgreSQL Service** 11 | - **Create an Aiven account** if you don’t already have one: [Aiven Sign Up](https://aiven.io/). 12 | - **Create a PostgreSQL service** on Aiven: 13 | - Log into the Aiven console. 14 | - Click **Create Service**. 15 | - Select **PostgreSQL** from the list of available services. 16 | - Choose the cloud provider (AWS, Google Cloud, etc.) and the region. 17 | - Configure your service and choose the storage options provided by Aiven. 18 | - Aiven will manage your PostgreSQL setup automatically, including backups, monitoring, and scaling. 19 | - **Obtain connection details**: 20 | - Once the PostgreSQL service is created, note down the **hostname**, **port**, **username**, and **password**. 21 | - You'll use this information to connect from DBeaver. 22 | 23 | #### **2. Set Up Cloud Storage on Aiven** 24 | - **Aiven provides object storage** integrated with their managed services, so you don't need to manually set up AWS S3 or Azure Blob Storage. You can upload data directly to Aiven storage or integrate with other services. 25 | - **Upload a sample file** to Aiven storage: 26 | - Go to the **Aiven dashboard** and look for storage options related to your service. 27 | - Upload a file (like a CSV) that you want to interact with using your PostgreSQL database. 28 | - **Alternatively**, you can use an external storage service (e.g., AWS S3 or Azure Blob) to interact with Aiven if required. However, Aiven's managed storage service should work well for this project. 29 | 30 | #### **3. Install and Configure DBeaver** 31 | - **Download and install DBeaver** (SQL client for PostgreSQL): 32 | - Go to [DBeaver's official site](https://dbeaver.io/) and download the version suitable for your OS. 33 | - **Connect to Aiven PostgreSQL service**: 34 | - Open DBeaver. 35 | - Create a **new connection**: 36 | - Select **PostgreSQL**. 37 | - Fill in the connection details (hostname, port, username, password, and database name from Aiven). 38 | - Test the connection and make sure it's successful. 39 | 40 | #### **4. Set Up PostgreSQL Database and Create Tables** 41 | - Once connected via DBeaver, you can **create a new database** or use the existing one. 42 | - Example: Create a simple `products` table: 43 | ```sql 44 | CREATE TABLE products ( 45 | id SERIAL PRIMARY KEY, 46 | name VARCHAR(100), 47 | price DECIMAL 48 | ); 49 | ``` 50 | - **Insert some data** into the `products` table: 51 | ```sql 52 | INSERT INTO products (name, price) 53 | VALUES ('Laptop', 1000), ('Smartphone', 700); 54 | ``` 55 | - **Query the data**: 56 | ```sql 57 | SELECT * FROM products; 58 | ``` 59 | 60 | #### **5. (Optional) Integrate Aiven Storage with PostgreSQL** 61 | - If you're using Aiven's managed storage, you can perform the following operations: 62 | - **Download data from Aiven storage** (if required), using Aiven's integration options or by connecting to storage buckets. 63 | - **Load data into PostgreSQL** (if you’ve uploaded a CSV): 64 | - You can use `COPY` commands in PostgreSQL or perform an import directly through DBeaver’s **Import Data** option. 65 | - Example `COPY` command: 66 | ```sql 67 | COPY products FROM '/path/to/your/file.csv' DELIMITER ',' CSV HEADER; 68 | ``` 69 | 70 | #### **6. Perform Data Operations Using DBeaver** 71 | - Use DBeaver to interact with the **PostgreSQL database**. 72 | - **CRUD Operations**: Create, read, update, and delete data. 73 | - **Querying**: Run SQL queries and get results directly in DBeaver. 74 | - **Database Management**: Create new tables, define schemas, and more. 75 | 76 | --- 77 | 78 | ### **Deliverables**: 79 | 1. **Screenshots** of the Aiven dashboard with the PostgreSQL service and storage bucket setup. 80 | 2. **SQL scripts** for creating and inserting data into the `products` table. 81 | 3. **Python script** (optional) for uploading files to Aiven storage (if applicable). 82 | 4. **Connection details** and queries executed via DBeaver. 83 | 5. A brief report documenting the steps taken, cloud setup, and any challenges faced. 84 | 85 | --- 86 | 87 | ### **Skills Gained**: 88 | - Configuring and using **Aiven's managed PostgreSQL**. 89 | - **Uploading data** to managed cloud storage. 90 | - Using **DBeaver** to connect and query PostgreSQL. 91 | - Integrating **cloud storage** with your database system. 92 | - Performing **ETL operations** (optional if data is being uploaded). 93 | 94 | --- 95 | 96 | ### **Why Use Aiven for this Project?** 97 | - **Managed PostgreSQL**: Aiven handles your PostgreSQL installation, backups, scaling, and monitoring, so you can focus on the data engineering tasks. 98 | - **Storage Integration**: Easily manage cloud storage for your data and avoid manual setups of services like AWS S3 or Azure Blob. 99 | - **Simplified Setup**: Aiven offers a streamlined, unified experience for cloud services, databases, and storage. 100 | 101 | This updated version with **Aiven** simplifies your cloud storage and database setup while still providing the core hands-on experience in managing cloud databases and interacting with them via DBeaver. 102 | -------------------------------------------------------------------------------- /WeekOneProject.md: -------------------------------------------------------------------------------- 1 | ### Project: Setting Up Cloud Storage and Connecting a Database with DBeaver 2 | 3 | #### Objective: 4 | The goal of this project is to set up a cloud storage service (AWS S3 or Azure Blob Storage), create a PostgreSQL database, and connect to it using DBeaver for managing and querying the data. This project will help you understand how to configure cloud storage, set up a relational database, and use a SQL client to interact with the database. 5 | 6 | --- 7 | 8 | ### Steps to Complete the Project: 9 | 10 | #### 1. Set Up Cloud Storage (AWS S3 or Azure Blob Storage) 11 | 12 | **Using AWS S3:** 13 | - Create an AWS account (if you don’t have one). 14 | - Go to the **S3 dashboard** and create a new **S3 bucket**. 15 | - Set a unique bucket name and choose a region. 16 | - Leave default settings for now. 17 | - Upload a sample file (e.g., a CSV file or any dataset) to your S3 bucket. 18 | 19 | **OR** 20 | 21 | **Using Azure Blob Storage:** 22 | - Create an Azure account (if you don’t have one). 23 | - Go to the **Azure Portal** and create a **Storage Account**. 24 | - Choose the appropriate region and resource group. 25 | - Once created, navigate to the Blob Storage section and create a **container**. 26 | - Upload a sample file (e.g., a CSV file) to your Azure Blob Storage. 27 | 28 | #### 2. Set Up PostgreSQL Database 29 | 30 | - Install PostgreSQL locally on your machine or use a cloud database provider like AWS RDS or Azure PostgreSQL. 31 | - For local installation: 32 | - **Windows**: Download the installer from the official PostgreSQL website. 33 | - **Mac**: Use Homebrew (`brew install postgresql`). 34 | - **Linux**: Use the package manager (`sudo apt-get install postgresql`). 35 | 36 | - Create a PostgreSQL database named `test_db` (or any other name). 37 | - Connect to the database using the `psql` terminal. 38 | - Create a simple table to store data (e.g., a table for storing basic product information): 39 | ```sql 40 | CREATE TABLE products ( 41 | id SERIAL PRIMARY KEY, 42 | name VARCHAR(100), 43 | price DECIMAL 44 | ); 45 | ``` 46 | 47 | #### 3. Install and Configure DBeaver 48 | 49 | - Download and install **DBeaver** (a SQL client tool that connects to databases). 50 | - Go to [DBeaver website](https://dbeaver.io/) and download the version compatible with your operating system. 51 | 52 | - Open DBeaver and create a **new connection** to the PostgreSQL database: 53 | - Select **PostgreSQL** as the database type. 54 | - Enter the database connection details (host, port, username, password, and database name). 55 | - For local PostgreSQL installation, the default values are typically: 56 | - Host: `localhost` 57 | - Port: `5432` 58 | - Username: `postgres` 59 | - Password: Your PostgreSQL password 60 | - Database: `test_db` 61 | 62 | #### 4. Connect Cloud Storage with PostgreSQL 63 | 64 | - **(Optional) For AWS S3**: Use a tool like `boto3` (AWS SDK for Python) to interact with the files stored in your S3 bucket. You could upload a CSV file and load it into your PostgreSQL database using Python. 65 | - Example Python code using `boto3` to download a file from S3: 66 | ```python 67 | import boto3 68 | 69 | s3 = boto3.client('s3') 70 | bucket_name = 'your-bucket-name' 71 | file_key = 'your-file.csv' 72 | local_file_path = '/path/to/save/file.csv' 73 | 74 | s3.download_file(bucket_name, file_key, local_file_path) 75 | ``` 76 | 77 | - **(Optional) For Azure Blob Storage**: Use the `azure-storage-blob` Python library to interact with Azure Blob Storage. 78 | - Example Python code to download a file from Azure Blob Storage: 79 | ```python 80 | from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient 81 | 82 | connection_string = "your_connection_string" 83 | container_name = "your_container_name" 84 | blob_name = "your-file.csv" 85 | download_path = "/path/to/save/file.csv" 86 | 87 | blob_service_client = BlobServiceClient.from_connection_string(connection_string) 88 | container_client = blob_service_client.get_container_client(container_name) 89 | blob_client = container_client.get_blob_client(blob_name) 90 | 91 | with open(download_path, "wb") as download_file: 92 | download_file.write(blob_client.download_blob().readall()) 93 | ``` 94 | 95 | #### 5. Use DBeaver to Interact with PostgreSQL 96 | 97 | - Open **DBeaver** and connect to your **PostgreSQL database**. 98 | - Execute basic SQL queries such as: 99 | - Inserting data: 100 | ```sql 101 | INSERT INTO products (name, price) VALUES ('Laptop', 1000); 102 | ``` 103 | - Querying the data: 104 | ```sql 105 | SELECT * FROM products; 106 | ``` 107 | - You can now use DBeaver to perform other SQL operations like creating new tables, updating data, etc. 108 | 109 | #### 6. (Optional) Data Import from CSV 110 | 111 | If you’ve uploaded a CSV file to your cloud storage (S3 or Azure Blob), you can use DBeaver to import this file into your PostgreSQL database: 112 | - In DBeaver, right-click on the table where you want to import data and select **Import Data**. 113 | - Choose the CSV file from your local machine (after downloading from cloud storage). 114 | - Map the CSV columns to the corresponding table columns. 115 | 116 | --- 117 | 118 | ### Deliverables: 119 | 1. Screenshots of the cloud storage (AWS S3 or Azure Blob) with uploaded files. 120 | 2. PostgreSQL database schema (SQL script) for the `products` table. 121 | 3. A Python script for interacting with the cloud storage and PostgreSQL database (if applicable). 122 | 4. DBeaver connection details and queries performed on the database. 123 | 5. A brief report documenting the steps taken and any challenges faced. 124 | 125 | --- 126 | 127 | ### Skills Gained: 128 | - Configuring cloud storage (AWS S3 or Azure Blob Storage). 129 | - Setting up and connecting to a PostgreSQL database. 130 | - Using DBeaver as a SQL client for managing and querying data. 131 | - Integrating cloud storage with PostgreSQL (optional, but adds a real-world dimension). 132 | 133 | This is a simple and effective project that will help learners get hands-on experience with cloud services, databases, and SQL client tools while reinforcing key data engineering concepts. 134 | -------------------------------------------------------------------------------- /Day3-WeekOneDayThreeClass.md: -------------------------------------------------------------------------------- 1 | ### Data Governance, Security, Compliance, and Access Control 2 | 3 | Data has become a critical asset in today’s world, driving decisions and fueling innovation. However, the value of data comes with the responsibility to manage it effectively, secure it from threats, ensure compliance with legal and regulatory standards, and control who can access it. Here’s an overview of these core principles: 4 | 5 | #### 1. Data Governance 6 | **Definition:** 7 | Data governance involves the management of data availability, usability, integrity, and security within an organization. It sets the framework for how data is handled and ensures it aligns with business objectives. 8 | 9 | **Key Components:** 10 | - **Data Ownership:** Clearly defining who is responsible for data. 11 | - **Data Quality:** Establishing standards to maintain accuracy and reliability. 12 | - **Policies and Procedures:** Creating rules for data usage and handling. 13 | 14 | **Benefits:** 15 | - Enhanced decision-making. 16 | - Compliance with regulations. 17 | - Improved data security. 18 | 19 | #### 2. Data Security 20 | **Definition:** 21 | Protecting data from unauthorized access, breaches, and theft. 22 | 23 | **Key Practices:** 24 | - **Encryption:** Securing data both at rest and in transit. 25 | - **Firewalls and Intrusion Detection:** Preventing unauthorized access to systems. 26 | - **Authentication and Authorization:** Ensuring only legitimate users can access sensitive data. 27 | 28 | **Emerging Threats:** 29 | - Ransomware attacks. 30 | - Phishing schemes targeting data storage systems. 31 | 32 | **Mitigation:** 33 | - Regular security audits. 34 | - Employee training. 35 | - Investment in robust security tools. 36 | 37 | #### 3. Compliance 38 | **Definition:** 39 | Ensuring data handling practices meet legal and regulatory requirements. 40 | 41 | **Major Regulations:** 42 | - **GDPR (General Data Protection Regulation):** European Union data privacy law. 43 | - **CCPA (California Consumer Privacy Act):** Data privacy law for California residents. 44 | - **HIPAA (Health Insurance Portability and Accountability Act):** U.S. law governing healthcare data. 45 | 46 | **Consequences of Non-Compliance:** 47 | - Fines. 48 | - Reputational damage. 49 | - Legal liabilities. 50 | 51 | **Steps to Achieve Compliance:** 52 | - Regular audits. 53 | - Documentation of data handling procedures. 54 | - Collaboration with legal and compliance experts. 55 | 56 | #### 4. Access Control 57 | **Definition:** 58 | Restricting access to data based on user roles and responsibilities. 59 | 60 | **Key Methods:** 61 | - **Role-Based Access Control (RBAC):** Permissions are assigned based on job functions. 62 | - **Least Privilege Principle:** Users are given the minimum level of access required to perform their tasks. 63 | - **Multi-Factor Authentication (MFA):** Adding layers of verification for secure access. 64 | 65 | **Tools:** 66 | - Identity and Access Management (IAM) solutions. 67 | - Audit trails to monitor access logs. 68 | 69 | --- 70 | 71 | ### Introduction to SQL for Data Engineering and PostgreSQL Setup 72 | 73 | SQL (Structured Query Language) is the backbone of data engineering, used to manipulate, query, and manage relational databases. PostgreSQL, a robust open-source database management system, is a popular choice for data engineering projects. 74 | 75 | #### 1. What is SQL? 76 | **Definition:** 77 | A language designed for interacting with relational databases. 78 | 79 | **Common SQL Operations:** 80 | - **SELECT:** Retrieve data from tables. 81 | - **INSERT:** Add new data. 82 | - **UPDATE:** Modify existing data. 83 | - **DELETE:** Remove data. 84 | - **JOIN:** Combine data from multiple tables. 85 | 86 | #### 2. Why SQL for Data Engineering? 87 | **Use Cases:** 88 | - **Data Transformation:** Clean, aggregate, and reshape data for analysis. 89 | - **Data Integration:** Combine data from multiple sources into a central repository. 90 | - **Data Management:** Create and maintain database schemas and indexes. 91 | 92 | **Efficiency:** 93 | SQL is optimized for high-performance queries, essential for big data workloads. 94 | 95 | #### 3. Introduction to PostgreSQL 96 | **Overview:** 97 | PostgreSQL is a powerful, feature-rich database system known for its reliability, scalability, and extensibility. 98 | 99 | **Features:** 100 | - **ACID compliance:** Reliable transactions. 101 | - **Support for JSON and array data types.** 102 | - **Advanced indexing options:** Like GiST and GIN. 103 | - **Built-in support for full-text search and stored procedures.** 104 | 105 | **Use Cases:** 106 | - Data warehouses. 107 | - Web applications. 108 | - Analytics. 109 | 110 | #### 4. Setting Up PostgreSQL 111 | **Installation:** 112 | - **On Linux:** `sudo apt install postgresql` 113 | - **On macOS:** `brew install postgresql` 114 | - **On Windows:** Use the official installer from the PostgreSQL website. 115 | 116 | **Basic Commands:** 117 | - Start the PostgreSQL server: `sudo service postgresql start` 118 | - Access the PostgreSQL shell: `psql` 119 | 120 | **Creating a Database:** 121 | ```sql 122 | CREATE DATABASE my_database; 123 | ```` 124 | 125 | ### Connecting to the Database 126 | 127 | To connect to the PostgreSQL database, use the following command in your terminal: 128 | 129 | ```bash 130 | psql -d my_database 131 | ``` 132 | ### Creating a Table 133 | To create a table named employees, use the following SQL command. This table includes an automatically incrementing id, the name of the employee, their role, and their salary: 134 | 135 | ```sql 136 | CREATE TABLE employees ( 137 | id SERIAL PRIMARY KEY, 138 | name VARCHAR(100), 139 | role VARCHAR(50), 140 | salary NUMERIC 141 | ); 142 | ``` 143 | 144 | ### Inserting Data 145 | Add a record to the employees table using the following SQL command. This example inserts a new employee, "John Doe," with the role "Data Engineer" and a salary of 75,000: 146 | 147 | 148 | ```sql 149 | INSERT INTO employees (name, role, salary) 150 | VALUES ('John Doe', 'Data Engineer', 75000); 151 | ``` 152 | ### Querying Data 153 | To retrieve all data from the employees table, use the SELECT command: 154 | 155 | ```sql 156 | SELECT * FROM employees; 157 | ``` 158 | This command will display all rows and columns in the table. 159 | 160 | ### Conclusion 161 | Understanding data governance, security, compliance, and access control is essential for protecting organizational data and meeting regulatory standards. These principles help ensure data is used effectively, remains secure, and complies with legal requirements. 162 | 163 | At the same time, mastering SQL and PostgreSQL equips data engineers with powerful tools to build and manage data pipelines. SQL provides the foundation for querying and manipulating data, while PostgreSQL offers a robust platform for efficient storage and retrieval, enabling effective data analytics and decision-making. 164 | 165 | -------------------------------------------------------------------------------- /Apache Kafka 101: Apache Kafka for Data Engineering Guide.md: -------------------------------------------------------------------------------- 1 | ### Kafka-cheat-sheet 2 | 3 | Apache Kafka® serves as an open-source distributed streaming platform. Similar to other distributed systems, Kafka boasts a complex architecture, which may pose a challenge for new developers. Setting up Kafka involves navigating a formidable command line interface and configuring numerous settings. In this guide, I will provide insights into architectural concepts and essential commands frequently used by developers to initiate their journey with Kafka. 4 | 5 | #### Key Concepts: 6 | 7 | - **Clusters**: Group of servers working together for three reasons: speed (low latency), durability, and scalability. 8 | - **Topic**: Streams of records that Kafka organizes data into. 9 | - **Brokers**: Kafka server instances that store and replicate messages. 10 | - **Producers**: Applications that write data to Kafka topics. 11 | - **Consumers**: Applications that read data from Kafka topics. 12 | - **Partitions**: Divisions of a topic for scalability and parallelism. 13 | - **Connect**: Kafka Connect manages the Tasks. Note that the Connector is only responsible for generating the set of Tasks and indicating to the framework when they need to be updated. 14 | 15 | The easiest way to run Kafka clusters is to use **Confluent Cloud**, which is the official maintainer of Apache Kafka and provides different libraries for writing producers, consumers, and schema registry. 16 | 17 | #### Summary 18 | 19 | 1. **Apache Kafka**: A distributed streaming platform. 20 | 21 | The Kafka CLI is a powerful tool. However, the user experience can be challenging if you don’t already know the exact command needed for your task. Below are the commonly used CLI commands to interact with Kafka: 22 | 23 | #### Start Zookeeper 24 | ```sh 25 | zookeeper-server-start config/zookeeper.properties 26 | ``` 27 | 28 | #### Start Kafka Server 29 | ```sh 30 | kafka-server-start config/server.properties 31 | ``` 32 | 33 | ### Kafka Topics 34 | 35 | - **List existing topics** 36 | ```sh 37 | bin/kafka-topics.sh --zookeeper localhost:2181 --list 38 | ``` 39 | 40 | - **Describe a topic** 41 | ```sh 42 | bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic mytopic 43 | ``` 44 | 45 | - **Purge a topic** 46 | ```sh 47 | bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --config retention.ms=1000 48 | ``` 49 | 50 | or 51 | 52 | ```sh 53 | bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --delete-config retention.ms 54 | ``` 55 | 56 | - **Delete a topic** 57 | ```sh 58 | bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic mytopic 59 | ``` 60 | 61 | - **Get number of messages in a topic** 62 | ```sh 63 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}' 64 | ``` 65 | 66 | - **Get the earliest offset still in a topic** 67 | ```sh 68 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -2 69 | ``` 70 | 71 | - **Get the latest offset still in a topic** 72 | ```sh 73 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -1 74 | ``` 75 | 76 | - **Consume messages with the console consumer** 77 | ```sh 78 | bin/kafka-console-consumer.sh --new-consumer --bootstrap-server localhost:9092 --topic mytopic --from-beginning 79 | ``` 80 | 81 | - **Get the consumer offsets for a topic** 82 | ```sh 83 | bin/kafka-consumer-offset-checker.sh --zookeeper=localhost:2181 --topic=mytopic --group=my_consumer_group 84 | ``` 85 | 86 | - **Read from `__consumer_offsets`** 87 | Add the following property to `config/consumer.properties`: 88 | ```sh 89 | exclude.internal.topics=false 90 | ``` 91 | 92 | Then run: 93 | ```sh 94 | bin/kafka-console-consumer.sh --consumer.config config/consumer.properties --from-beginning --topic __consumer_offsets --zookeeper localhost:2181 --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" 95 | ``` 96 | 97 | ### Kafka Consumer Groups 98 | 99 | - **List the consumer groups known to Kafka** 100 | ```sh 101 | bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list # (old API) 102 | ``` 103 | ```sh 104 | bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list # (new API) 105 | ``` 106 | 107 | - **View the details of a consumer group** 108 | ```sh 109 | bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group 110 | ``` 111 | 112 | ### Kafkacat 113 | 114 | - **Getting the last five messages of a topic** 115 | ```sh 116 | kafkacat -C -b localhost:9092 -t mytopic -p 0 -o -5 -e 117 | ``` 118 | 119 | ### Zookeeper 120 | 121 | - **Starting the Zookeeper Shell** 122 | ```sh 123 | bin/zookeeper-shell.sh localhost:2181 124 | ``` 125 | 126 | ### Running Java Class 127 | 128 | - **Run `ConsumerOffsetCheck` when Kafka server is up, there is a topic + messages produced and consumed** 129 | ```sh 130 | bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --broker-info --zookeeper localhost:2181 --group test-consumer-group 131 | ``` 132 | 133 | **Note:** `ConsumerOffsetChecker` has been removed in Kafka 1.0.0. Use `kafka-consumer-groups.sh` to get consumer group details: 134 | ```sh 135 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group consule-consumer-38063 136 | ``` 137 | 138 | ### Kafka Server & Topics 139 | 140 | - **Start Zookeeper** 141 | ```sh 142 | bin/zookeeper-server-start.sh config/zookeeper.properties 143 | ``` 144 | 145 | - **Start Kafka brokers (Servers = cluster)** 146 | ```sh 147 | bin/kafka-server-start.sh config/server.properties 148 | ``` 149 | 150 | - **Create a topic** 151 | ```sh 152 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test 153 | ``` 154 | 155 | - **List all topics** 156 | ```sh 157 | bin/kafka-topics.sh --list --zookeeper localhost:2181 158 | ``` 159 | 160 | - **See topic details (partition, replication factor, etc.)** 161 | ```sh 162 | bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test 163 | ``` 164 | 165 | - **Change partition number of a topic (`--alter`)** 166 | ```sh 167 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic test --partitions 3 168 | ``` 169 | 170 | ### Producer 171 | 172 | ```sh 173 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 174 | ``` 175 | 176 | ### Consumer 177 | 178 | ```sh 179 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic test 180 | ``` 181 | 182 | - **To consume only new messages** 183 | ```sh 184 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test 185 | ``` 186 | 187 | ### Kafka Connect 188 | 189 | - **Standalone connectors (run in a single, local, dedicated process)** 190 | ```sh 191 | bin/connect-standalone.sh \ config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties 192 | ``` 193 | 194 | ### Reference 195 | 196 | [Redpanda Kafka Tutorial](https://redpanda.com/guides/kafka-tutorial) 197 | -------------------------------------------------------------------------------- /Apache Kafka 102: Apache Kafka for Data Engineering Guide.md: -------------------------------------------------------------------------------- 1 | ### **Apache Kafka Cheat Sheet** 2 | 3 | #### **Introduction** 4 | Apache Kafka® is an **open-source distributed event streaming platform** used for building **real-time data pipelines** and streaming applications. Kafka is **horizontally scalable**, **fault-tolerant**, and **highly durable**. 5 | 6 | This guide provides an in-depth look at Kafka's **architecture**, **core concepts**, and **commonly used commands** with detailed explanations and examples. 7 | 8 | --- 9 | 10 | #### **1. Key Concepts** 11 | ##### **1.1 Clusters** 12 | - A **Kafka Cluster** is a collection of **brokers (servers)** working together. 13 | - Provides **fault tolerance, scalability, and high throughput**. 14 | - Clusters handle **millions of messages per second** in distributed systems. 15 | 16 | ##### **1.2 Topics** 17 | - A **topic** is a logical channel where messages are **produced and consumed**. 18 | - Each topic is **split into partitions** for parallel processing. 19 | - Topics are **multi-subscriber**, meaning multiple consumers can read from them. 20 | 21 | **Rules for Naming Topics:** 22 | 1. Topic names should **only contain** letters (`a-z`, `A-Z`), numbers (`0-9`), dots (`.`), underscores (`_`), and hyphens (`-`). 23 | 2. Topic names should be **descriptive** and meaningful. 24 | 3. **Avoid special characters** like `@`, `#`, `!`, `*`, as Kafka does not support them. 25 | 26 | **Examples of Topic Names:** 27 | ``` 28 | # Valid topic names 29 | customer_orders 30 | logs.application-errors 31 | user_activity 32 | 33 | # Invalid topic names (containing special characters) 34 | customer@orders # Invalid '@' 35 | logs#errors # Invalid '#' 36 | ``` 37 | 38 | --- 39 | 40 | #### **2. Brokers** 41 | - A **broker** is a Kafka server that stores and serves messages. 42 | - Kafka brokers manage: 43 | - **Topic partitions** 44 | - **Message replication** 45 | - **Data storage & retrieval** 46 | - Brokers are part of a **Kafka cluster** and work together. 47 | 48 | --- 49 | 50 | #### **3. Producers** 51 | - Producers send (publish) messages to **Kafka topics**. 52 | - Messages are assigned to **partitions** based on: 53 | - **Round-robin (default)** 54 | - **Key-based partitioning** (Ensures messages with the same key go to the same partition) 55 | 56 | **Example: Writing messages to a Kafka topic** 57 | ``` 58 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic customer_orders 59 | ``` 60 | Type messages and press **Enter** to send them. 61 | 62 | --- 63 | 64 | #### **4. Consumers** 65 | - Consumers **read messages** from Kafka topics. 66 | - Consumers belong to **consumer groups**, allowing **parallel processing**. 67 | 68 | **Example: Reading messages from a topic** 69 | ``` 70 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic customer_orders --from-beginning 71 | ``` 72 | This will print **all past and new messages** from the `customer_orders` topic. 73 | 74 | --- 75 | 76 | #### **5. Partitions** 77 | - Topics are split into **partitions** to enable **parallel consumption**. 78 | - Each partition is **ordered** and messages are assigned an **offset**. 79 | - Partitions allow **horizontal scaling**. 80 | 81 | **Example of a topic with 3 partitions:** 82 | ``` 83 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic logs 84 | ``` 85 | 86 | **Rules for Partitions:** 87 | 1. More partitions = **better parallelism**. 88 | 2. Cannot **reduce** partitions, only **increase** them. 89 | 3. Messages are assigned partitions **based on key hashing** or round-robin. 90 | 91 | --- 92 | 93 | #### **6. Kafka Connect** 94 | Kafka Connect is used to **stream data** between Kafka and **external data systems** like **databases, file systems, and cloud storage**. 95 | 96 | **Example: Running a Kafka Connect Worker** 97 | ``` 98 | bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties 99 | ``` 100 | 101 | --- 102 | 103 | #### **7. Kafka CLI Commands** 104 | ##### **7.1 Starting Kafka Components** 105 | ``` 106 | # Start Zookeeper 107 | zookeeper-server-start.sh config/zookeeper.properties 108 | 109 | # Start Kafka Server 110 | kafka-server-start.sh config/server.properties 111 | ``` 112 | 113 | --- 114 | 115 | ##### **7.2 Managing Topics** 116 | ###### **List Topics** 117 | ``` 118 | bin/kafka-topics.sh --zookeeper localhost:2181 --list 119 | ``` 120 | 121 | ###### **Describe a Topic** 122 | ``` 123 | bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic customer_orders 124 | ``` 125 | 126 | ###### **Create a Topic (3 Examples)** 127 | ``` 128 | # Create a topic with 1 partition and replication factor of 1 129 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic customer_orders 130 | 131 | # Create a topic with 3 partitions and replication factor of 2 132 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 3 --topic logs 133 | 134 | # Create a topic for real-time analytics with 5 partitions 135 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 5 --topic analytics_stream 136 | ``` 137 | 138 | ###### **Delete a Topic** 139 | ``` 140 | bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic old_topic 141 | ``` 142 | 143 | ###### **Increase Partitions** 144 | ``` 145 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic logs --partitions 5 146 | ``` 147 | **⚠️ Note:** Kafka **does not** allow **decreasing** partitions! 148 | 149 | --- 150 | 151 | ##### **7.3 Managing Messages** 152 | ###### **Find Number of Messages in a Topic** 153 | ``` 154 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}' 155 | ``` 156 | 157 | ###### **Get Earliest and Latest Offsets** 158 | ``` 159 | # Earliest offset (first message) 160 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -2 161 | 162 | # Latest offset (last message) 163 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -1 164 | ``` 165 | 166 | --- 167 | 168 | ##### **7.4 Consumer Groups** 169 | ###### **List Consumer Groups** 170 | ``` 171 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list 172 | ``` 173 | 174 | ###### **Describe a Consumer Group** 175 | ``` 176 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my_consumer_group 177 | ``` 178 | 179 | --- 180 | 181 | ##### **7.5 Using Kafkacat** 182 | ###### **Read Last 5 Messages from a Topic** 183 | ``` 184 | kafkacat -C -b localhost:9092 -t logs -p 0 -o -5 -e 185 | ``` 186 | 187 | --- 188 | 189 | #### **8. Advanced Notes** 190 | 1. **Kafka Retention Policies**: 191 | - Kafka can retain messages **forever**, for a set **time period**, or until reaching a **size limit**. 192 | - Configure **log retention** with: 193 | ``` 194 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic logs --config retention.ms=604800000 # 7 days 195 | ``` 196 | 197 | 2. **Monitoring Kafka**: 198 | - Use Kafka's built-in tools or third-party **monitoring solutions** like **Confluent Control Center, Prometheus, or Grafana**. 199 | 200 | 3. **Kafka Streams API**: 201 | - Used for **real-time data processing** within Kafka itself. 202 | - Helps build **event-driven applications**. 203 | 204 | --- 205 | 206 | #### **9. References** 207 | For more details, check out: 208 | - [Apache Kafka Documentation](https://kafka.apache.org/documentation/) 209 | - [Redpanda Kafka Guide](https://redpanda.com/guides/kafka-tutorial) 210 | -------------------------------------------------------------------------------- /SQL-Manual.md: -------------------------------------------------------------------------------- 1 | # Structured Query Language (SQL) 2 | **Course Manual – Version 1.2** 3 | © 2025 4 | 5 | --- 6 | 7 | ## Table of Contents 8 | 1. [Introduction](#introduction) 9 | 2. [Basic Queries](#basic-queries) 10 | 3. [Advanced Operators](#advanced-operators) 11 | 4. [Expressions](#expressions) 12 | 5. [Functions](#functions) 13 | 6. [Multi-Table Queries](#multi-table-queries) 14 | 7. [Queries Within Queries](#queries-within-queries) 15 | 8. [Maintaining Tables](#maintaining-tables) 16 | 9. [Defining Database Objects](#defining-database-objects) 17 | A. [Appendices](#appendices) 18 | 19 | --- 20 | 21 | 22 | # 1. Introduction 23 | 24 | ### Course Objectives 25 | By the end of the course you will be able to: 26 | - **Query** relational databases 27 | - **Maintain** relational databases 28 | - **Define** relational databases 29 | 30 | > “MAINTAIN – QUERY – RELATIONAL DATABASE – DEFINE” 31 | 32 | --- 33 | 34 | ### What is a Relational Database? 35 | - **Tables** (unique names, ≤ 18 chars, start with letter). 36 | - **Columns** (unique within table). 37 | - **Rows** (identified by data values, not position). 38 | 39 | --- 40 | 41 | ### What is SQL? 42 | - **Structured Query Language** – IBM (1974), ANSI (1986), ISO (1987). 43 | - **Six core statements**: `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `CREATE`, `DROP`. 44 | 45 | | Category | Statements | 46 | |-----------------|---------------------| 47 | | **Query** | `SELECT` | 48 | | **Maintenance** | `INSERT`, `UPDATE`, `DELETE` | 49 | | **Definition** | `CREATE`, `DROP` | 50 | 51 | --- 52 | 53 | 54 | # 2. Basic Queries 55 | 56 | ### 2.1 Selecting All Columns & Rows 57 | ```sql 58 | SELECT * 59 | FROM countries; 60 | ``` 61 | 62 | ### 2.2 Selecting Specific Columns 63 | ```sql 64 | SELECT title, job 65 | FROM jobs; 66 | ``` 67 | 68 | ### 2.3 Selecting Specific Rows 69 | | Value Type | Format Example | 70 | |------------|----------------| 71 | | Numeric | `123`, `-45.67` | 72 | | String | `'Canada'` | 73 | | Date | `'1999-12-31'` | 74 | 75 | ```sql 76 | SELECT name, country 77 | FROM persons 78 | WHERE country = 'Canada'; 79 | ``` 80 | 81 | ### 2.4 Sorting Rows 82 | ```sql 83 | SELECT country, area 84 | FROM countries 85 | ORDER BY area DESC; -- largest first 86 | ``` 87 | 88 | Multiple columns: 89 | ```sql 90 | ORDER BY language, pop DESC; 91 | ``` 92 | 93 | ### 2.5 Eliminating Duplicate Rows 94 | ```sql 95 | SELECT DISTINCT job 96 | FROM persons; 97 | ``` 98 | 99 | --- 100 | 101 | 102 | # 3. Advanced Operators 103 | 104 | | Operator | Meaning | Example | 105 | |----------|---------|---------| 106 | | `LIKE` | Pattern | `WHERE name LIKE 'Z%'` | 107 | | `AND` | Both true | `WHERE gnp < 3000 AND literacy < 40` | 108 | | `BETWEEN`| Inclusive range | `WHERE pop BETWEEN 100000 AND 200000` | 109 | | `OR` | Either true | `WHERE country = 'USA' OR country = 'Canada'` | 110 | | `IN` | List match | `WHERE language IN ('English', 'French')` | 111 | | `IS NULL`| Missing value | `WHERE gnp IS NULL` | 112 | | `NOT` | Negation | `WHERE NOT country = 'USA'` | 113 | | `( )` | Precedence | `WHERE (job='S' OR job='W') AND country='Italy'` | 114 | 115 | --- 116 | 117 | 118 | # 4. Expressions 119 | 120 | ### 4.1 Arithmetic Expressions 121 | ```sql 122 | SELECT country, pop/area AS density 123 | FROM countries; 124 | ``` 125 | 126 | ### 4.2 Expressions in WHERE / ORDER BY 127 | ```sql 128 | SELECT * 129 | FROM countries 130 | WHERE pop/area > 1000 131 | ORDER BY pop/area DESC; 132 | ``` 133 | 134 | ### 4.3 Column Aliases (`AS`) 135 | ```sql 136 | SELECT gnp*1000000/pop AS gpp 137 | FROM countries 138 | WHERE gpp > 20000 139 | ORDER BY gpp DESC; 140 | ``` 141 | 142 | --- 143 | 144 | 145 | # 5. Functions 146 | 147 | ### 5.1 Statistical Functions 148 | | Function | Purpose | 149 | |----------|---------| 150 | | `COUNT(*)` | Rows | 151 | | `COUNT(col)` | Non-NULL | 152 | | `SUM(col)` | Total | 153 | | `AVG(col)` | Average | 154 | | `MIN(col)` / `MAX(col)` | Min / Max | 155 | 156 | Grand totals: 157 | ```sql 158 | SELECT AVG(pop) AS avg_pop 159 | FROM countries 160 | WHERE language = 'English'; 161 | ``` 162 | 163 | ### 5.2 Grouping 164 | ```sql 165 | SELECT job, COUNT(*) AS total 166 | FROM persons 167 | GROUP BY job; 168 | ``` 169 | 170 | ### 5.3 HAVING (post-group filter) 171 | ```sql 172 | SELECT language, AVG(literacy) AS avg_lit 173 | FROM countries 174 | GROUP BY language 175 | HAVING AVG(literacy) > 90; 176 | ``` 177 | 178 | --- 179 | 180 | 181 | # 6. Multi-Table Queries 182 | 183 | ### 6.1 Joins 184 | ```sql 185 | SELECT p.name, j.title 186 | FROM persons p 187 | JOIN jobs j ON p.job = j.job; 188 | ``` 189 | 190 | ### 6.2 Table Aliases 191 | Short-hand: 192 | ```sql 193 | SELECT c.country, a.budget 194 | FROM countries c 195 | JOIN armies a ON c.country = a.country; 196 | ``` 197 | 198 | ### 6.3 Union 199 | Combine results (distinct): 200 | ```sql 201 | SELECT country FROM religions WHERE percent > 40 202 | UNION 203 | SELECT country FROM countries WHERE language = 'German'; 204 | ``` 205 | 206 | --- 207 | 208 | 209 | # 7. Queries Within Queries (Subqueries) 210 | 211 | | Type | Template | Example | 212 | |------|----------|---------| 213 | | **Single-valued** | `WHERE col = (SELECT …)` | `WHERE pop > (SELECT AVG(pop) FROM countries)` | 214 | | **Multi-valued** | `WHERE col IN (SELECT …)` | `WHERE country IN (SELECT country FROM religions WHERE percent > 95)` | 215 | | **Correlated** | Inner query references outer alias | `WHERE bdate = (SELECT MIN(bdate) FROM persons p2 WHERE p2.job = p1.job)` | 216 | 217 | --- 218 | 219 | 220 | # 8. Maintaining Tables 221 | 222 | | Action | Syntax | Example | 223 | |--------|--------|---------| 224 | | **Insert** | `INSERT INTO … VALUES …` | `INSERT INTO jobs(job,title) VALUES ('A','Author');` | 225 | | **Update** | `UPDATE … SET … WHERE …` | `UPDATE persons SET job='E' WHERE person=500;` | 226 | | **Delete** | `DELETE FROM … WHERE …` | `DELETE FROM persons WHERE person=500;` | 227 | | **Transaction** | `COMMIT;` or `ROLLBACK;` | Undo or save all changes since last `COMMIT`. | 228 | 229 | --- 230 | 231 | 232 | # 9. Defining Database Objects 233 | 234 | ### 9.1 Tables 235 | ```sql 236 | CREATE TABLE theologians ( 237 | name CHAR(20) PRIMARY KEY, 238 | bdate DATE, 239 | gender CHAR(6) NOT NULL CHECK (gender IN ('Male','Female')), 240 | country CHAR(20) REFERENCES countries(country) 241 | ); 242 | ``` 243 | 244 | ### 9.2 Indexes 245 | ```sql 246 | CREATE INDEX idx_gender_job ON persons(gender, job); 247 | DROP INDEX idx_gender_job; 248 | ``` 249 | 250 | ### 9.3 Views 251 | ```sql 252 | CREATE VIEW iv AS 253 | SELECT * FROM persons WHERE country = 'Israel'; 254 | 255 | SELECT * FROM iv; 256 | DROP VIEW iv; 257 | ``` 258 | 259 | --- 260 | 261 | 262 | # A. Appendices 263 | 264 | ### Exercise Database Schema 265 | | Table | Key Columns (sample) | 266 | |---------|----------------------| 267 | | **persons** | person, name, bdate, gender, country, job | 268 | | **countries** | country, pop, area, gnp, language, literacy | 269 | | **armies** | country, budget, troops, tanks, ships, planes | 270 | | **jobs** | job, title | 271 | | **religions** | country, religion, percent | 272 | 273 | > All monetary values in millions; population in people; area in sq mi; literacy in %. 274 | 275 | ### Answers to Selected Exercises 276 | See the original PDF **pages 94-110** for the complete answer key. 277 | 278 | --- 279 | 280 | ## Syntax Summary Cheat-Sheet 281 | ```sql 282 | SELECT [DISTINCT] columns|functions 283 | FROM table [alias] [, ...] 284 | [WHERE conditions] 285 | [GROUP BY columns] 286 | [HAVING aggregate_conditions] 287 | [ORDER BY column|expr|position [ASC|DESC]]; 288 | 289 | INSERT INTO table(col1,...) VALUES(val1,...); 290 | UPDATE table SET col=val [, ...] WHERE ...; 291 | DELETE FROM table WHERE ...; 292 | CREATE TABLE tbl (...); 293 | DROP TABLE tbl; 294 | COMMIT; 295 | ROLLBACK; 296 | ``` 297 | -------------------------------------------------------------------------------- /Apache Airflow 101 Guide.md: -------------------------------------------------------------------------------- 1 | ## **Apache Airflow Setup & Introduction (Multi-Component Mode)** 2 | 3 | #### **Understanding Workflow Orchestration** 4 | 5 | **Workflow orchestration** is the automated coordination and management of complex data workflows. Think of it as a conductor directing an orchestra - it ensures that different data processing tasks run in the correct order, at the right time, and handles failures gracefully. 6 | 7 | **Why Apache Airflow?** 8 | - **Dependency Management**: Automatically handles task dependencies 9 | - **Scheduling**: Run workflows on time-based or event-based triggers 10 | - **Monitoring**: Visual interface to track job progress and failures 11 | - **Scalability**: Handles complex workflows with hundreds of tasks 12 | - **Flexibility**: Python-based, extensible with custom operators 13 | 14 | ## Airflow Architecture Overview (Multi-Component Setup) 15 | 16 | ``` 17 | ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ 18 | │ Web Server │ │ Scheduler │ │ Executor │ 19 | │ (Port │ │ (Triggers │ │(Runs Tasks) │ 20 | │ 8080) │ │ DAGs) │ │ │ 21 | └─────────────┘ └─────────────┘ └─────────────┘ 22 | │ │ │ 23 | └───────────────────┼───────────────────┘ 24 | │ 25 | ┌─────────────┐ 26 | │ Metadata DB │ 27 | │ (PostgreSQL)│ 28 | └─────────────┘ 29 | ``` 30 | 31 | ### Step-by-Step Installation Guide (Multi-Component Mode) 32 | 33 | #### Setup with Separate Web Server and Scheduler 34 | 35 | ```bash 36 | # 1. Create a new directory for your Airflow project 37 | mkdir airflow-tutorial 38 | cd airflow-tutorial 39 | 40 | # 2. Create and activate virtual environment 41 | python -m venv airflow-env 42 | source airflow-env/bin/activate # On Windows: airflow-env\Scripts\activate 43 | 44 | # 3. Set Airflow home directory 45 | export AIRFLOW_HOME=$(pwd)/airflow # On Windows: set AIRFLOW_HOME=%cd%\airflow 46 | 47 | # 4. Install Airflow (using constraints for compatibility) 48 | pip install apache-airflow==2.8.0 --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.9.txt 49 | 50 | # 5. Initialize the database 51 | airflow db init 52 | 53 | # 6. Create an admin user (replace with your details) 54 | airflow users create \ 55 | --username admin \ 56 | --firstname Admin \ 57 | --lastname User \ 58 | --role Admin \ 59 | --email admin@example.com \ 60 | --password admin123 61 | ``` 62 | 63 | #### Running Web Server and Scheduler Separately 64 | 65 | You'll need **two terminal windows** for this setup: 66 | 67 | ##### Terminal 1: Start the Web Server 68 | ```bash 69 | # Activate virtual environment 70 | source airflow-env/bin/activate # On Windows: airflow-env\Scripts\activate 71 | 72 | # Set Airflow home 73 | export AIRFLOW_HOME=$(pwd)/airflow # On Windows: set AIRFLOW_HOME=%cd%\airflow 74 | 75 | # Start the web server 76 | airflow webserver --port 8080 77 | ``` 78 | 79 | ##### Terminal 2: Start the Scheduler 80 | ```bash 81 | # Open a new terminal window/tab 82 | cd airflow-tutorial 83 | 84 | # Activate virtual environment 85 | source airflow-env/bin/activate # On Windows: airflow-env\Scripts\activate 86 | 87 | # Set Airflow home 88 | export AIRFLOW_HOME=$(pwd)/airflow # On Windows: set AIRFLOW_HOME=%cd%\airflow 89 | 90 | # Start the scheduler 91 | airflow scheduler 92 | ``` 93 | 94 | ### Why Use Separate Components? 95 | 96 | #### Benefits of Multi-Component Setup: 97 | - **Production-Ready**: Mirrors production deployment patterns 98 | - **Resource Management**: Each component can be scaled independently 99 | - **Monitoring**: Easier to monitor individual component performance 100 | - **Debugging**: Separate logs for web server and scheduler 101 | - **High Availability**: Can run multiple instances of each component 102 | 103 | #### Component Responsibilities: 104 | 105 | **Web Server:** 106 | - Serves the Airflow UI 107 | - Handles user authentication 108 | - Provides REST API endpoints 109 | - Displays DAG visualization and monitoring 110 | 111 | **Scheduler:** 112 | - Monitors DAG files for changes 113 | - Triggers task execution based on schedule 114 | - Manages task dependencies 115 | - Handles task retries and failures 116 | 117 | ### Accessing the Airflow UI 118 | 119 | 1. **Ensure both components are running**: 120 | - Web server should show: `Serving on http://0.0.0.0:8080` 121 | - Scheduler should show: `Starting the scheduler` 122 | 123 | 2. **Open your browser** and navigate to: `http://localhost:8080` 124 | 125 | 3. **Login credentials**: 126 | - Username: `admin` 127 | - Password: `admin123` 128 | 129 | ### Exploring the Airflow UI 130 | 131 | #### 1. DAGs View (Main Dashboard) 132 | - **What you'll see**: List of all available DAGs 133 | - **Key elements**: 134 | - DAG name and description 135 | - Recent runs (green = success, red = failed) 136 | - Schedule interval 137 | - Last run date 138 | - Toggle to pause/unpause DAGs 139 | 140 | #### 2. Tree View 141 | - **Purpose**: Shows DAG runs over time 142 | - **How to access**: Click on any DAG → Tree View tab 143 | - **What it shows**: Task instances arranged by execution date 144 | 145 | #### 3. Graph View 146 | - **Purpose**: Visual representation of DAG structure 147 | - **Shows**: Task dependencies and current status 148 | - **Color coding**: 149 | - Green: Success 150 | - Red: Failed 151 | - Yellow: Running 152 | - Light Blue: Queued 153 | - Gray: Not started 154 | 155 | #### 4. Code View 156 | - **Purpose**: Shows the Python code that defines the DAG 157 | - **Useful for**: Understanding DAG logic and debugging 158 | 159 | #### 5. Gantt Chart 160 | - **Purpose**: Shows task execution timeline 161 | - **Useful for**: Identifying bottlenecks and optimizing performance 162 | 163 | ### Key Concepts Explained Simply 164 | 165 | #### DAG (Directed Acyclic Graph) 166 | A workflow definition - like a recipe that tells Airflow what tasks to run and in what order. 167 | 168 | #### Tasks 169 | Individual units of work (like "download file", "process data", "send email"). 170 | 171 | #### Operators 172 | Templates for tasks (PythonOperator, BashOperator, EmailOperator, etc.). 173 | 174 | #### Task Instance 175 | A specific execution of a task for a particular DAG run. 176 | 177 | ### Monitoring Your Setup 178 | 179 | #### Checking Component Status: 180 | 181 | **Web Server Logs:** 182 | - Look for: `Serving on http://0.0.0.0:8080` 183 | - No error messages about port conflicts 184 | 185 | **Scheduler Logs:** 186 | - Look for: `Starting the scheduler` 187 | - Regular DAG processing messages 188 | - No database connection errors 189 | 190 | #### Health Check Commands: 191 | ```bash 192 | # Check if web server is responding 193 | curl http://localhost:8080/health 194 | 195 | # List DAGs (requires both components running) 196 | airflow dags list 197 | 198 | # Check scheduler status 199 | airflow jobs check --job-type SchedulerJob 200 | ``` 201 | 202 | ### Assignment Solutions 203 | 204 | #### Part 1: Why Airflow is Useful in Data Engineering 205 | 206 | Apache Airflow is essential in data engineering because it provides automated workflow orchestration that eliminates manual intervention in complex data pipelines. Unlike traditional cron jobs or script-based scheduling, Airflow offers dependency management, failure handling, and retry mechanisms that ensure data workflows run reliably at scale. Its Python-based approach allows data engineers to define workflows as code, making them version-controlled, testable, and maintainable, while the web UI provides real-time monitoring and debugging capabilities that are crucial for managing production data pipelines. The separation of the web server and scheduler components allows for better resource allocation and mirrors production deployment patterns used in enterprise environments. 207 | 208 | #### Part 2: Screenshot Documentation 209 | 210 | **Expected Screenshot Elements:** 211 | - Airflow UI header with "Apache Airflow" logo 212 | - Navigation menu (DAGs, Browse, Admin, etc.) 213 | - DAGs list showing example DAGs 214 | - Status indicators (green/red circles) 215 | - URL showing `localhost:8080` 216 | - Both terminal windows showing web server and scheduler running 217 | 218 | **Troubleshooting Common Issues:** 219 | 220 | 1. **Port 8080 already in use**: 221 | ```bash 222 | # Kill process using port 8080 223 | sudo lsof -t -i:8080 | xargs sudo kill -9 224 | # Or use a different port 225 | airflow webserver --port 8081 226 | ``` 227 | 228 | 2. **Scheduler not picking up DAGs**: 229 | - Ensure scheduler is running 230 | - Check DAG file syntax 231 | - Verify DAG is not paused 232 | 233 | 3. **Database lock errors**: 234 | - Stop all Airflow processes 235 | - Delete `airflow.db` file 236 | - Run `airflow db init` again 237 | 238 | 4. **Web server can't connect to database**: 239 | - Ensure scheduler is running (it initializes the database) 240 | - Check file permissions on `airflow.db` 241 | 242 | ### Success Checklist ✅ 243 | 244 | - [ ] Airflow installed in virtual environment 245 | - [ ] Database initialized successfully 246 | - [ ] Admin user created 247 | - [ ] Web server running on port 8080 248 | - [ ] Scheduler running in separate terminal 249 | - [ ] Can access http://localhost:8080 250 | - [ ] Can login with admin credentials 251 | - [ ] Can see example DAGs in the interface 252 | - [ ] Both components showing healthy status in logs 253 | - [ ] Can navigate between different views (Tree, Graph, Code) 254 | - [ ] Screenshot saved showing successful multi-component setup 255 | 256 | ### Stopping Airflow Properly 257 | 258 | To stop Airflow cleanly: 259 | 1. **Stop the scheduler**: `Ctrl+C` in the scheduler terminal 260 | 2. **Stop the web server**: `Ctrl+C` in the web server terminal 261 | 3. **Deactivate virtual environment**: `deactivate` 262 | 263 | ### Next Steps Preview 264 | Tomorrow we'll create our first custom DAG and understand how to define tasks and dependencies using this multi-component setup! 265 | -------------------------------------------------------------------------------- /ETL/ETL-ELT.md: -------------------------------------------------------------------------------- 1 | # Theory: Introduction to ETL/ELT Workflows 2 | 3 | ## Overview 4 | 5 | ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two fundamental paradigms for data integration workflows. These methodologies define how data moves from source systems to target systems, determining when and where data transformations occur in the pipeline. 6 | 7 | ## ETL (Extract, Transform, Load) 8 | 9 | ### Definition 10 | ETL is a traditional data integration approach where data is extracted from source systems, transformed in a separate processing layer, and then loaded into the target system. 11 | 12 | ### Workflow Process 13 | 14 | #### 1. Extract 15 | - **Purpose**: Retrieve data from various source systems 16 | - **Sources**: Databases, APIs, flat files, web services, applications 17 | - **Methods**: 18 | - Full extraction (complete dataset) 19 | - Incremental extraction (only changed data) 20 | - Delta extraction (changes since last extraction) 21 | - **Challenges**: Handling different data formats, connection issues, rate limits 22 | 23 | #### 2. Transform 24 | - **Purpose**: Convert raw data into a format suitable for analysis 25 | - **Operations**: 26 | - **Data Cleaning**: Remove duplicates, handle null values, correct errors 27 | - **Data Standardization**: Unify formats, units, naming conventions 28 | - **Data Validation**: Ensure data quality and integrity 29 | - **Data Enrichment**: Add calculated fields, lookup values 30 | - **Data Aggregation**: Summarize data at different granularities 31 | - **Data Type Conversion**: Convert between different data types 32 | - **Business Logic Application**: Apply domain-specific rules 33 | 34 | #### 3. Load 35 | - **Purpose**: Insert transformed data into target system 36 | - **Loading Strategies**: 37 | - **Full Load**: Replace entire dataset 38 | - **Incremental Load**: Add only new or changed records 39 | - **Upsert**: Insert new records, update existing ones 40 | - **Target Systems**: Data warehouses, databases, data marts 41 | 42 | ### ETL Characteristics 43 | 44 | **Advantages:** 45 | - **Data Quality**: Ensures clean, validated data enters target system 46 | - **Performance**: Target system optimized for queries, not transformations 47 | - **Security**: Sensitive data can be masked/encrypted during transformation 48 | - **Compliance**: Easier to implement data governance rules 49 | - **Predictable Structure**: Target schema is well-defined and stable 50 | 51 | **Disadvantages:** 52 | - **Processing Time**: Sequential process can be time-consuming 53 | - **Resource Intensive**: Requires dedicated transformation infrastructure 54 | - **Less Flexibility**: Schema changes require pipeline modifications 55 | - **Data Freshness**: Batch processing introduces latency 56 | 57 | ### When to Use ETL 58 | - **Complex Transformations**: Heavy business logic or data cleansing required 59 | - **Limited Target Resources**: Target system has limited processing power 60 | - **Strict Data Quality**: High data quality standards must be enforced 61 | - **Regulatory Compliance**: Data governance and audit trails essential 62 | - **Predictable Workloads**: Batch processing acceptable for use case 63 | 64 | ## ELT (Extract, Load, Transform) 65 | 66 | ### Definition 67 | ELT is a modern approach where raw data is loaded directly into the target system, and transformations are performed within that system using its computational resources. 68 | 69 | ### Workflow Process 70 | 71 | #### 1. Extract 72 | - **Same as ETL**: Retrieve data from source systems 73 | - **Minimal Processing**: Little to no transformation during extraction 74 | - **Faster Extraction**: Reduced complexity in extraction phase 75 | 76 | #### 2. Load 77 | - **Raw Data Loading**: Data loaded in its original format 78 | - **Target Systems**: Usually data lakes or cloud data warehouses 79 | - **Schema-on-Write vs Schema-on-Read**: Often uses schema-on-read approach 80 | - **Staging Areas**: May use intermediate staging for organization 81 | 82 | #### 3. Transform 83 | - **In-Target Processing**: Transformations occur within target system 84 | - **On-Demand**: Transformations can be applied as needed 85 | - **Multiple Views**: Same raw data can be transformed differently for various use cases 86 | - **Leverages Target Power**: Uses computational resources of target system 87 | 88 | ### ELT Characteristics 89 | 90 | **Advantages:** 91 | - **Faster Loading**: Raw data loads quickly without transformation overhead 92 | - **Scalability**: Leverages powerful cloud computing resources 93 | - **Flexibility**: Multiple transformation views from same raw data 94 | - **Data Preservation**: Original data remains unchanged 95 | - **Real-time Potential**: Enables near real-time data availability 96 | - **Cost-Effective**: Cloud warehouses optimized for large-scale processing 97 | 98 | **Disadvantages:** 99 | - **Target Resource Usage**: Transformation workload on target system 100 | - **Data Quality Risks**: Raw data may contain errors or inconsistencies 101 | - **Security Concerns**: Sensitive data stored in raw format 102 | - **Complex Queries**: End users may need advanced SQL skills 103 | - **Storage Costs**: Raw data requires more storage space 104 | 105 | ### When to Use ELT 106 | - **Big Data Scenarios**: Large volumes of diverse data types 107 | - **Cloud-Native Architecture**: Using cloud data warehouses (Snowflake, BigQuery, Redshift) 108 | - **Agile Analytics**: Rapid development and changing requirements 109 | - **Real-time Insights**: Near real-time data processing needed 110 | - **Data Lake Architecture**: Storing raw data for future unknown use cases 111 | - **Sufficient Target Resources**: Powerful target systems available 112 | 113 | ## Comparison Matrix 114 | 115 | | Aspect | ETL | ELT | 116 | |--------|-----|-----| 117 | | **Processing Location** | External transformation engine | Within target system | 118 | | **Data Quality** | High (pre-loading validation) | Variable (post-loading validation) | 119 | | **Time to Insight** | Higher latency | Lower latency | 120 | | **Flexibility** | Lower (predefined transformations) | Higher (on-demand transformations) | 121 | | **Resource Requirements** | Dedicated transformation infrastructure | Powerful target system | 122 | | **Data Storage** | Only transformed data stored | Raw + transformed data stored | 123 | | **Complexity** | Higher upfront complexity | Lower initial complexity | 124 | | **Maintenance** | More complex schema change management | Easier to adapt to changes | 125 | | **Cost Model** | Infrastructure + processing costs | Storage + compute costs | 126 | 127 | ## Modern Hybrid Approaches 128 | 129 | ### ELT with Pre-processing 130 | - Light transformations during extraction (data type conversion, basic cleaning) 131 | - Bulk of transformation occurs in target system 132 | - Balances benefits of both approaches 133 | 134 | ### Lambda Architecture 135 | - Combines batch (ETL) and stream (ELT) processing 136 | - Handles both historical and real-time data 137 | - Provides comprehensive data processing solution 138 | 139 | ### Medallion Architecture 140 | - **Bronze Layer**: Raw data (ELT approach) 141 | - **Silver Layer**: Cleaned and conformed data (ETL transformations) 142 | - **Gold Layer**: Business-ready data (ETL aggregations) 143 | 144 | ## Technology Considerations 145 | 146 | ### ETL-Optimized Tools 147 | - **Traditional ETL**: Informatica, IBM DataStage, Microsoft SSIS 148 | - **Modern ETL**: Apache Airflow, Talend, Apache NiFi 149 | - **Cloud ETL**: AWS Glue, Azure Data Factory, Google Dataflow 150 | 151 | ### ELT-Optimized Platforms 152 | - **Cloud Warehouses**: Snowflake, Google BigQuery, Amazon Redshift 153 | - **Data Lakes**: Apache Spark, Amazon S3 with Athena 154 | - **Stream Processing**: Apache Kafka, Amazon Kinesis 155 | 156 | ### Programming Approaches 157 | - **ETL**: Python with Pandas, Java with Apache Beam 158 | - **ELT**: SQL-based transformations, dbt (data build tool) 159 | 160 | ## Best Practices 161 | 162 | ### For ETL 163 | 1. **Design for Reusability**: Create modular transformation components 164 | 2. **Implement Error Handling**: Robust exception management and logging 165 | 3. **Optimize Performance**: Parallel processing and efficient algorithms 166 | 4. **Document Transformations**: Clear documentation of business rules 167 | 5. **Test Thoroughly**: Unit tests for transformation logic 168 | 169 | ### For ELT 170 | 1. **Data Governance**: Implement data lineage and quality monitoring 171 | 2. **Storage Optimization**: Use appropriate file formats (Parquet, Delta Lake) 172 | 3. **Query Optimization**: Leverage target system's optimization features 173 | 4. **Security Implementation**: Apply row-level security and column masking 174 | 5. **Cost Management**: Monitor and optimize compute and storage costs 175 | 176 | ## Future Trends 177 | 178 | ### Real-time Processing 179 | - Stream processing becoming standard 180 | - Event-driven architectures gaining popularity 181 | - CDC (Change Data Capture) integration 182 | 183 | ### DataOps Integration 184 | - CI/CD for data pipelines 185 | - Automated testing and deployment 186 | - Infrastructure as Code 187 | 188 | ### AI-Enhanced Processing 189 | - Automated data profiling and mapping 190 | - Intelligent error detection and correction 191 | - ML-powered transformation suggestions 192 | 193 | ## Conclusion 194 | 195 | The choice between ETL and ELT depends on specific use cases, infrastructure, data volumes, and business requirements. Modern data architectures often employ hybrid approaches, leveraging the strengths of both paradigms. Understanding these workflows is fundamental to designing effective data integration solutions that meet organizational needs while maintaining data quality, performance, and scalability. 196 | 197 | Key decision factors include: 198 | - Data volume and velocity requirements 199 | - Available infrastructure and resources 200 | - Data quality and governance needs 201 | - Time-to-insight requirements 202 | - Budget and cost considerations 203 | - Team skills and capabilities 204 | 205 | Both ETL and ELT remain relevant in today's data landscape, and the most successful data teams understand when and how to apply each approach effectively. 206 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## **LuxDevHQ Data Engineering Course Outline** 2 | 3 | This comprehensive course spans **4 months** (16 weeks) and equips learners with expertise in Python, SQL, Azure, AWS, Apache Airflow, Kafka, Spark, and more. 4 | - **Learning Days**: Monday to Thursday (theory and practice). 5 | - **Friday**: Job shadowing or peer projects. 6 | - **Saturday**: Hands-on lab sessions and project-based learning. 7 | 8 | --- 9 | ## Table of Contents 10 | 11 | 1. [Week 1](#week-1-Onboarding-and-Environment-Setup) 12 | 2. [Week 2](#week-2-SQL-Essentials-for-Data-Engineering) 13 | 3. [Week 3](#week-3-Introduction-to-Data-Pipelines) 14 | 4. [Week 4](#week-4-Introduction-to-Apache-Airflow) 15 | 5. [Week 5](#week-5-Data-Warehousing-and-Data-Lakes) 16 | 6. [Week 6](#week-6-Data-Governance-and-Security) 17 | 7. [Week 7](#week-7-Real-Time-Data-Processing-with-Kafka) 18 | 8. [Week 8](#week-8-Batch-vs-Stream-Processing) 19 | 9. [Week 9](#week-9-Machine-Learning-Integration-in-Data-Pipelines) 20 | 10. [Week 10](#week-10-Spark-and-PySpark-for-Big-Data) 21 | 11. [Week 11](#week-11-Advanced-Apache-Airflow-Techniques) 22 | 12. [Week 12](#week-12-Data-Lakes-and-Delta-Lake) 23 | 13. [Week 13](#week-13-Batch-Data-Pipeline-Development) 24 | 14. [Week 14](#week-14-Real-Time-Data-Pipeline-Development) 25 | 15. [Week 15](#week-15-Final-Project-Integration) 26 | 16. [Week 16](#week-16-capstone-project-presentation) 27 | 28 | 29 | 30 | --- 31 | 32 | ### **Month 1: Foundations of Data Engineering** 33 | 34 | #### Week 1: Onboarding and Environment Setup 35 | - **Monday**: 36 | - Onboarding, course overview, career pathways, tools introduction. 37 | - **Tuesday**: 38 | - Introduction to cloud computing (Azure and AWS). 39 | - **Wednesday**: 40 | - Data governance, security, compliance, and access control. 41 | - **Thursday**: 42 | - Introduction to SQL for data engineering and PostgreSQL setup. 43 | - **Friday**: 44 | - **Peer Project**: Environment setup challenges. 45 | - **Saturday (Lab)**: 46 | - **Mini Project**: Build a basic pipeline with PostgreSQL and Azure Blob Storage. 47 | 48 | --- 49 | 50 | #### Week 2: SQL Essentials for Data Engineering 51 | - **Monday**: 52 | - Core SQL concepts (`SELECT`, `WHERE`, `JOIN`, `GROUP BY`). 53 | - **Tuesday**: 54 | - Advanced SQL techniques: recursive queries, window functions, Views, Stored Procedures, Subque ries and CTEs. 55 | - **Wednesday**: 56 | - Query optimization and execution plans. 57 | - **Thursday**: 58 | - Data modeling: normalization, denormalization, and star schemas. 59 | - **Friday**: 60 | - **Job Shadowing**: Observe senior engineers writing and optimizing SQL queries. 61 | - **Saturday (Lab)**: 62 | - **Mini Project**: Create a star schema and analyze data using SQL. 63 | 64 | --- 65 | 66 | #### Week 3: Introduction to Data Pipelines 67 | - **Monday**: 68 | - Theory: Introduction to ETL/ELT workflows. 69 | - **Tuesday**: 70 | - Lab: Create a simple Python-based ETL pipeline for CSV data. 71 | - **Wednesday**: 72 | - Theory: Extract, transform, load (ETL) concepts and best practices. 73 | - **Thursday**: 74 | - Lab: Build a Python ETL pipeline for batch data processing. 75 | - **Friday**: 76 | - **Peer Project**: Collaborate to design a basic ETL workflow. 77 | - **Saturday (Lab)**: 78 | - **Mini Project**: Develop a simple ETL pipeline to process sales data. 79 | 80 | --- 81 | 82 | #### Week 4: Introduction to Apache Airflow 83 | - **Monday**: 84 | - Theory: Introduction to Apache Airflow, DAGs, and scheduling. 85 | - **Tuesday**: 86 | - Lab: Set up Apache Airflow and create a basic DAG. 87 | - **Wednesday**: 88 | - Theory: DAG best practices and scheduling in Airflow. 89 | - **Thursday**: 90 | - Lab: Integrate Airflow with PostgreSQL and Azure Blob Storage. 91 | - **Friday**: 92 | - **Job Shadowing**: Observe real-world Airflow pipelines. 93 | - **Saturday (Lab)**: 94 | - **Mini Project**: Automate an ETL pipeline with Airflow for batch data processing. 95 | 96 | --- 97 | 98 | ### **Month 2: Intermediate Tools and Concepts** 99 | 100 | #### Week 5: Data Warehousing and Data Lakes 101 | - **Monday**: 102 | - Theory: Introduction to data warehousing (OLAP vs. OLTP, partitioning, clustering). 103 | - **Tuesday**: 104 | - Lab: Work with Amazon Redshift and Snowflake for data warehousing. 105 | - **Wednesday**: 106 | - Theory: Data lakes and Lakehouse architecture. 107 | - **Thursday**: 108 | - Lab: Set up Delta Lake for raw and curated data. 109 | - **Friday**: 110 | - **Peer Project**: Implement a data warehouse model and data lake for sales data. 111 | - **Saturday (Lab)**: 112 | - **Mini Project**: Design and implement a basic Lakehouse architecture. 113 | 114 | --- 115 | 116 | #### Week 6: Data Governance and Security 117 | - **Monday**: 118 | - Theory: Data governance frameworks and data security principles. 119 | - **Tuesday**: 120 | - Lab: Use AWS Lake Formation for access control and security enforcement. 121 | - **Wednesday**: 122 | - Theory: Managing sensitive data and compliance (GDPR, HIPAA). 123 | - **Thursday**: 124 | - Lab: Implement security policies in S3 and Azure Blob Storage. 125 | - **Friday**: 126 | - **Job Shadowing**: Observe senior engineers applying governance policies. 127 | - **Saturday (Lab)**: 128 | - **Mini Project**: Secure data in the cloud using AWS and Azure. 129 | 130 | --- 131 | 132 | #### Week 7: Real-Time Data Processing with Kafka 133 | - **Monday**: 134 | - Theory: - [Introduction to Apache Kafka for real-time data streaming](/introduction-to-Kafka.md) 135 | - **Tuesday**: 136 | - Lab: [Set up a Kafka producer and consumer.](/Tuesday-Kafka-Lab.md) 137 | - **Wednesday**: 138 | - Theory: Kafka topics, partitions, and message brokers. 139 | - **Thursday**: 140 | - Lab: Integrate Kafka with PostgreSQL for real-time updates. 141 | - **Friday**: 142 | - **Peer Project**: Build a real-time Kafka pipeline for transactional data. 143 | - **Saturday (Lab)**: 144 | - **Mini Project**: Create a pipeline to stream e-commerce data with Kafka. 145 | 146 | [Apache Kafka 101](./Apache%20Kafka%20101%3A%20Apache%20Kafka%20for%20Data%20Engineering%20Guide.md) 147 | 148 | [Apache Kafka 102](/Apache%20Kafka%20102%3A%20Apache%20Kafka%20for%20Data%20Engineering%20Guide.md) 149 | 150 | 151 | 152 | --- 153 | 154 | #### Week 8: Batch vs. Stream Processing 155 | - **Monday**: 156 | - Theory: Introduction to batch vs. stream processing. 157 | - **Tuesday**: 158 | - Lab: Batch processing with PySpark. 159 | - **Wednesday**: 160 | - Theory: Combining batch and stream processing workflows. 161 | - **Thursday**: 162 | - Lab: Real-time processing with Apache Flink and Spark Streaming. 163 | - **Friday**: 164 | - **Job Shadowing**: Observe a real-time processing pipeline. 165 | - **Saturday (Lab)**: 166 | - **Mini Project**: Build a hybrid pipeline combining batch and real-time processing. 167 | 168 | --- 169 | 170 | ### **Month 3: Advanced Data Engineering** 171 | 172 | #### Week 9: Machine Learning Integration in Data Pipelines 173 | - **Monday**: 174 | - Theory: Overview of ML workflows in data engineering. 175 | - **Tuesday**: 176 | - Lab: Preprocess data for machine learning using Pandas and PySpark. 177 | - **Wednesday**: 178 | - Theory: Feature engineering and automated feature extraction. 179 | - **Thursday**: 180 | - Lab: Automate feature extraction using Apache Airflow. 181 | - **Friday**: 182 | - **Peer Project**: Build a simple pipeline that integrates ML models. 183 | - **Saturday (Lab)**: 184 | - **Mini Project**: Build an ML-powered recommendation system in a pipeline. 185 | 186 | --- 187 | 188 | #### Week 10: Spark and PySpark for Big Data 189 | - **Monday**: 190 | - Theory: Introduction to Apache Spark for big data processing. 191 | - **Tuesday**: 192 | - Lab: Set up Spark and PySpark for data analysis. 193 | - **Wednesday**: 194 | - Theory: Spark RDDs, DataFrames, Performance Optimization and SQL. 195 | - **Thursday**: 196 | - Lab: Analyze large datasets using Spark SQL. 197 | - **Friday**: 198 | - **Peer Project**: Build a PySpark pipeline for large-scale data processing. 199 | - **Saturday (Lab)**: 200 | - **Mini Project**: Analyze big data sets with Spark and PySpark. 201 | 202 | --- 203 | 204 | #### Week 11: Advanced Apache Airflow Techniques 205 | - **Monday**: 206 | - Theory: Advanced Airflow features (XCom, task dependencies). 207 | - **Tuesday**: 208 | - Lab: Implement dynamic DAGs and task dependencies in Airflow. 209 | - **Wednesday**: 210 | - Theory: Airflow scheduling, monitoring, and error handling. 211 | - **Thursday**: 212 | - Lab: Create complex DAGs for multi-step ETL pipelines. 213 | - **Friday**: 214 | - **Job Shadowing**: Observe advanced Airflow pipeline implementations. 215 | - **Saturday (Lab)**: 216 | - **Mini Project**: Design an advanced Airflow DAG for complex data workflows. 217 | 218 | --- 219 | 220 | #### Week 12: Data Lakes and Delta Lake 221 | - **Monday**: 222 | - Theory: Data lakes, Lakehouses, and Delta Lake architecture. 223 | - **Tuesday**: 224 | - Lab: Set up Delta Lake on AWS for data storage and management. 225 | - **Wednesday**: 226 | - Theory: Managing schema evolution in Delta Lake. 227 | - **Thursday**: 228 | - Lab: Implement batch and real-time data loading to Delta Lake. 229 | - **Friday**: 230 | - **Peer Project**: Design a Lakehouse architecture for an e-commerce platform. 231 | - **Saturday (Lab)**: 232 | - **Mini Project**: Implement a scalable Delta Lake architecture. 233 | 234 | --- 235 | 236 | ### **Month 4: Capstone Projects** 237 | 238 | #### Week 13: Batch Data Pipeline Development 239 | - **Monday to Thursday**: 240 | - **Design and Implementation**: 241 | - Build an end-to-end batch data pipeline for e-commerce sales analytics. 242 | - **Tools**: PySpark, SQL, PostgreSQL, Airflow, S3. 243 | - **Friday**: 244 | - **Peer Review**: Present progress and receive feedback. 245 | - **Saturday (Lab)**: 246 | - **Project Milestone**: Finalize and present batch pipeline results. 247 | 248 | --- 249 | 250 | #### Week 14: Real-Time Data Pipeline Development 251 | - **Monday to Thursday**: 252 | - **Design and Implementation**: 253 | - Build an end-to-end real-time data pipeline for IoT sensor monitoring. 254 | - **Tools**: Kafka, Spark Streaming, Flink, S3. 255 | - **Friday**: 256 | - **Peer Review**: Present progress and receive feedback. 257 | - **Saturday (Lab)**: 258 | - **Project Milestone**: Finalize and present real-time pipeline results. 259 | 260 | --- 261 | 262 | #### Week 15: Final Project Integration 263 | - **Monday to Thursday**: 264 | - **Design and Implementation**: 265 | - Integrate both batch and real-time pipelines for a comprehensive end-to-end solution. 266 | - **Tools**: Kafka, PySpark, Airflow, Delta Lake, PostgreSQL, and S3. 267 | - **Friday**: 268 | - **Job Shadowing**: Observe senior engineers integrating complex pipelines. 269 | - **Saturday (Lab)**: 270 | - **Project Milestone**: Showcase integrated solution for review. 271 | 272 | --- 273 | 274 | #### Week 16: Capstone Project Presentation 275 | - **Monday to Thursday**: 276 | - Final Presentation Preparation: 277 | - Polish, test, and document the final project. 278 | - **Friday**: 279 | - **Peer Review**: Present final projects to peers and receive feedback. 280 | - **Saturday (Lab)**: 281 | - **Capstone Presentation**: Showcase completed capstone projects to industry professionals and instructors. 282 | -------------------------------------------------------------------------------- /Apache Spark.md: -------------------------------------------------------------------------------- 1 | # Apache Spark & PySpark: Learning Guide 2 | 3 | ## Understanding Apache Spark 4 | 5 | Apache Spark is an open-source distributed computing framework designed for processing large datasets across clusters of computers. Spark maintains data in memory between operations, which significantly reduces the time spent reading from and writing to disk storage. This in-memory processing capability makes Spark substantially faster than traditional disk-based processing systems. 6 | 7 | The framework provides a unified computing engine that handles multiple types of data processing workloads including batch processing (handling large chunks of data at once), stream processing (handling data as it arrives continuously), machine learning, and graph computation. Spark can scale from running on your laptop to running across thousands of computers in a data center. 8 | 9 | ## Core Spark Architecture 10 | 11 | Understanding Spark's architecture helps you comprehend how your data processing actually happens behind the scenes. The Spark runtime consists of several key components that work together to execute distributed computations. 12 | 13 | **Driver Program**: The main application process that runs your Spark code. The driver creates the SparkContext, converts your program into tasks, and coordinates the execution across the cluster. 14 | 15 | **Cluster Manager**: The external service responsible for acquiring resources and allocating them to Spark applications. Common cluster managers include YARN, Mesos, and Kubernetes. 16 | 17 | **Executors**: Worker processes that run on cluster nodes. Executors execute tasks assigned by the driver and store data for caching operations. 18 | 19 | **Tasks**: Individual units of work that executors perform on data partitions. 20 | 21 | **Partitions**: Logical divisions of your data that enable parallel processing across multiple executor cores. 22 | 23 | ## Essential Spark Concepts 24 | 25 | **Transformations** are operations that create new datasets from existing ones. The key insight here is that transformations follow lazy evaluation, meaning Spark doesn't actually do the work immediately. Instead, it builds up a plan of what you want to do. 26 | 27 | Common transformations include filtering data (keeping only rows that meet certain criteria), mapping data (applying a function to transform each row), grouping data by certain columns, and joining different datasets together. 28 | 29 | **Actions** are operations that actually trigger Spark to execute all those planned transformations and give you results. Actions either return results to your driver program or save data to external storage. 30 | 31 | **RDDs (Resilient Distributed Datasets)** represent Spark's most fundamental way of thinking about data. RDDs are immutable (they never change once created), distributed across your cluster, and can be processed in parallel. They maintain lineage information, which means Spark remembers how each RDD was created so it can recreate it if something goes wrong. 32 | 33 | **DataFrames** are a higher-level way to work with data that's more familiar if you've used databases or tools like Excel. DataFrames organize data into named columns and provide powerful optimization through Spark's Catalyst optimizer, which automatically makes your queries run faster. 34 | 35 | **Spark SQL** lets you write familiar SQL queries against your DataFrames. This means you can use SELECT, WHERE, GROUP BY, and other SQL statements you might already know, while still getting all the benefits of distributed processing. 36 | 37 | 38 | 39 | ## Practical Example: Understanding the Basics 40 | 41 | Let's walk through a comprehensive example that demonstrates how these concepts work together. I'll explain each part as we go, so you can see both what the code does and why we're doing it that way. 42 | 43 | ```python 44 | from pyspark.sql import SparkSession 45 | from pyspark.sql.functions import col, avg, count, max, min, sum as spark_sum 46 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType 47 | 48 | # Initialize Spark session - this creates your SparkContext 49 | # Think of this as starting up your distributed computing engine 50 | spark = SparkSession.builder \ 51 | .appName("SparkFundamentals") \ 52 | .config("spark.sql.adaptive.enabled", "true") \ 53 | .getOrCreate() 54 | ``` 55 | 56 | This first section creates our Spark session, which is our entry point to all Spark functionality. The appName helps you identify your application when multiple Spark jobs are running. The config setting enables adaptive query execution, which helps Spark automatically optimize your queries as they run. 57 | 58 | ```python 59 | # Create sample data representing employee records 60 | # In real scenarios, this data would come from files, databases, or APIs 61 | employee_data = [ 62 | ("E001", "Alice Johnson", "Engineering", "Senior", 85000, 5, 92.5, "2019-03-15"), 63 | ("E002", "Bob Smith", "Marketing", "Manager", 72000, 3, 88.2, "2021-07-20"), 64 | ("E003", "Carol Davis", "Engineering", "Junior", 65000, 2, 85.7, "2022-01-10"), 65 | ("E004", "David Wilson", "Sales", "Senior", 78000, 4, 90.1, "2020-05-18"), 66 | ("E005", "Eva Brown", "Engineering", "Manager", 95000, 6, 94.3, "2018-11-22"), 67 | ("E006", "Frank Miller", "Marketing", "Junior", 58000, 1, 82.4, "2023-02-14"), 68 | ("E007", "Grace Lee", "Sales", "Senior", 82000, 4, 91.8, "2020-08-30"), 69 | ("E008", "Henry Taylor", "Engineering", "Senior", 88000, 5, 93.1, "2019-06-12"), 70 | ("E009", "Iris Chen", "Marketing", "Manager", 75000, 3, 89.6, "2021-10-05"), 71 | ("E010", "Jack Anderson", "Sales", "Junior", 62000, 2, 84.9, "2022-09-28") 72 | ] 73 | 74 | # Define schema explicitly for data quality and performance 75 | # This tells Spark exactly what type of data to expect in each column 76 | employee_schema = StructType([ 77 | StructField("employee_id", StringType(), False), # False means this field cannot be null 78 | StructField("name", StringType(), False), 79 | StructField("department", StringType(), False), 80 | StructField("level", StringType(), False), 81 | StructField("salary", IntegerType(), False), 82 | StructField("years_experience", IntegerType(), False), 83 | StructField("performance_score", FloatType(), False), 84 | StructField("hire_date", StringType(), False) 85 | ]) 86 | ``` 87 | 88 | Here we're creating sample data and defining its structure. In real applications, your data would typically come from files, databases, or streaming sources. The schema definition is important because it tells Spark exactly what to expect, which enables better performance and catches data quality issues early. 89 | 90 | ```python 91 | # Create DataFrame - this represents your distributed dataset 92 | # Even though our data is small, Spark treats it as if it could be distributed across many machines 93 | df = spark.createDataFrame(employee_data, employee_schema) 94 | 95 | # Basic transformations and actions 96 | print("Dataset Overview:") 97 | df.show() # Action: displays data - this actually executes and shows results 98 | print(f"Total employees: {df.count()}") # Action: counts rows - another execution 99 | ``` 100 | 101 | This is where we create our DataFrame, which is Spark's way of representing structured data. Even though our example uses small data that fits in memory, Spark handles it the same way it would handle terabytes of data spread across hundreds of machines. 102 | 103 | The show() and count() operations are actions, which means they trigger Spark to actually process the data and return results. Up until this point, Spark was just planning what to do. 104 | 105 | ```python 106 | # Transformation: filter high performers 107 | # This creates a new DataFrame but doesn't execute yet (lazy evaluation) 108 | high_performers = df.filter(col("performance_score") > 90.0) 109 | print(f"High performers (>90 score): {high_performers.count()}") # Now it executes 110 | ``` 111 | 112 | This demonstrates the difference between transformations and actions. The filter operation creates a new DataFrame that represents "employees with performance scores above 90," but Spark doesn't actually do the filtering until we call count(), which is an action. 113 | 114 | ```python 115 | # Transformation and action: departmental analysis 116 | # This shows how to perform aggregations - very common in data processing 117 | dept_analysis = df.groupBy("department").agg( 118 | avg("salary").alias("avg_salary"), # Calculate average salary per department 119 | count("*").alias("employee_count"), # Count employees per department 120 | avg("performance_score").alias("avg_performance"), # Average performance per department 121 | max("years_experience").alias("max_experience") # Maximum experience per department 122 | ) 123 | 124 | print("\nDepartmental Analysis:") 125 | dept_analysis.show() # Action: triggers execution of the entire aggregation 126 | ``` 127 | 128 | This section shows aggregation, which is one of the most common patterns in data processing. We're grouping employees by department and calculating various statistics for each group. The alias() method gives friendly names to our calculated columns. 129 | 130 | ```python 131 | # Using Spark SQL - same functionality, different syntax 132 | # Some people prefer SQL syntax for complex queries 133 | df.createOrReplaceTempView("employees") # Creates a temporary SQL table 134 | 135 | print("\nSenior Employee Analysis (SQL):") 136 | spark.sql(""" 137 | SELECT department, 138 | COUNT(*) as senior_count, 139 | AVG(salary) as avg_senior_salary, 140 | AVG(performance_score) as avg_senior_performance 141 | FROM employees 142 | WHERE level = 'Senior' 143 | GROUP BY department 144 | ORDER BY avg_senior_salary DESC 145 | """).show() 146 | ``` 147 | 148 | This demonstrates that you can use SQL syntax to accomplish the same data processing tasks. Some people find SQL more intuitive for complex queries, especially when joining multiple tables or doing complex filtering and aggregation. 149 | 150 | ```python 151 | # Demonstrate caching for performance 152 | # This tells Spark to keep this DataFrame in memory for faster access 153 | df.cache() # Keeps data in memory for faster subsequent operations 154 | ``` 155 | 156 | Caching is a performance optimization technique. When you cache a DataFrame, Spark stores it in memory across your cluster, so subsequent operations on that DataFrame don't need to recompute it from the original data source. 157 | 158 | ## Understanding the Concepts in Action 159 | 160 | Let's trace through what happens when you run this code to understand how the concepts work together: 161 | 162 | When you create the SparkSession, you're establishing your connection to Spark's distributed computing capabilities. Even if you're running on a single machine, Spark still uses the same distributed architecture internally. 163 | 164 | When you create the DataFrame, Spark doesn't immediately load or process the data. Instead, it creates a logical representation of what the data looks like and how it's structured. This is part of Spark's lazy evaluation strategy. 165 | 166 | When you call transformations like filter() or groupBy(), Spark adds these operations to its execution plan but still doesn't do any actual work. It's building up a recipe for how to process your data when the time comes. 167 | 168 | When you call an action like show() or count(), Spark finally executes the entire chain of transformations. It looks at all the operations you've requested, optimizes the execution plan, and then processes the data across your cluster. 169 | 170 | The caching operation tells Spark to store the results in memory after the first computation, so if you perform additional operations on the same DataFrame, it can reuse the cached data instead of recomputing everything from scratch. 171 | 172 | 173 | 174 | --- 175 | 176 | # Weather Data ETL Assignment 177 | 178 | ## Assignment Overview 179 | 180 | Build a data pipeline that extracts weather data from the OpenWeatherMap API, processes it using Apache Spark, and visualizes the results in Grafana. 181 | 182 | ## Requirements 183 | 184 | Extract weather data for at least 10 cities using the OpenWeatherMap API. Select any 10 columns from the API response that you find interesting or relevant. 185 | 186 | Transform the data using PySpark to prepare it for visualization. Apply data cleaning, type conversions, or calculations as needed. 187 | 188 | Store the processed data in any format that allows you to visualize it in Grafana. 189 | 190 | Create visualizations in Grafana that demonstrate your data processing results. Include screenshots of your panels. 191 | 192 | 193 | 194 | ## Deliverables 195 | 196 | 1. **GitHub Repository**: Create a public repository containing your Python scripts and any configuration files 197 | 2. **Technical Article**: Write an article on Dev.to or Medium explaining your project, including: 198 | - Overview of your approach 199 | - Code explanations 200 | - Screenshots of your Grafana panels 201 | - Any challenges you encountered and how you solved them 202 | -------------------------------------------------------------------------------- /GDPR & HIPAA Compliance Guide.md: -------------------------------------------------------------------------------- 1 | # GDPR & HIPAA Compliance Guide 2 | 3 | ## Overview 4 | 5 | ### Learning Objectives 6 | By the end of this study session, you should be able to: 7 | - Define GDPR and HIPAA and explain their purposes 8 | - Compare and contrast the two regulatory frameworks 9 | - Identify which regulation applies to different scenarios 10 | - Explain key compliance requirements for both frameworks 11 | - Describe implementation strategies and best practices 12 | 13 | --- 14 | 15 | ## Quick Reference Cards 16 | 17 | ### GDPR at a Glance 18 | - **Full Name:** General Data Protection Regulation 19 | - **Effective Date:** May 25, 2018 20 | - **Jurisdiction:** EU + anywhere processing EU citizen data 21 | - **Scope:** ALL personal data, ALL industries 22 | - **Max Penalty:** €20M or 4% global revenue 23 | - **Key Concept:** "Privacy by Design" 24 | 25 | ### HIPAA at a Glance 26 | - **Full Name:** Health Insurance Portability and Accountability Act 27 | - **Enacted:** 1996 28 | - **Jurisdiction:** United States only 29 | - **Scope:** Healthcare data (PHI) only 30 | - **Max Penalty:** $50,000 per violation 31 | - **Key Concept:** "Minimum Necessary Rule" 32 | 33 | --- 34 | 35 | ## Core Concepts & Definitions 36 | 37 | ### GDPR Key Terms 38 | | Term | Definition | Example | 39 | |------|------------|---------| 40 | | **Personal Data** | Any info relating to identifiable person | Name, email, IP address, location | 41 | | **Data Subject** | The individual whose data is processed | Patient, customer, employee | 42 | | **Data Controller** | Determines purposes/means of processing | Hospital, company, organization | 43 | | **Data Processor** | Processes data on behalf of controller | Cloud provider, software vendor | 44 | | **Special Category Data** | Sensitive personal data requiring extra protection | Health, biometric, genetic data | 45 | 46 | ### HIPAA Key Terms 47 | | Term | Definition | Example | 48 | |------|------------|---------| 49 | | **PHI** | Protected Health Information | Medical records, billing info | 50 | | **Covered Entity** | Must comply with HIPAA | Hospitals, doctors, health plans | 51 | | **Business Associate** | Works with covered entities, handles PHI | IT vendors, billing companies | 52 | | **ePHI** | Electronic Protected Health Information | Digital medical records | 53 | | **TPO** | Treatment, Payment, Operations | Core healthcare functions | 54 | 55 | --- 56 | 57 | ## Detailed Analysis 58 | 59 | ### Section 1: GDPR Deep Dive 60 | 61 | #### The 6 Lawful Bases for Processing (MEMORIZE!) 62 | 1. **Consent** - Clear, specific, informed agreement 63 | 2. **Contract** - Necessary for contract performance 64 | 3. **Legal Obligation** - Required by law 65 | 4. **Vital Interests** - Life or death situations 66 | 5. **Public Task** - Official/governmental functions 67 | 6. **Legitimate Interests** - Balancing test with individual rights 68 | 69 | #### GDPR Rights (The "Rights Menu") 70 | - **Right to be Informed** - Know what data is collected 71 | - **Right of Access** - See what data is held 72 | - **Right to Rectification** - Correct inaccurate data 73 | - **Right to Erasure** - "Right to be forgotten" 74 | - **Right to Restrict Processing** - Limit how data is used 75 | - **Right to Data Portability** - Move data between services 76 | - **Right to Object** - Say no to processing 77 | - **Rights Related to Automated Decision-Making** - Human review of AI decisions 78 | 79 | #### DPO Requirements (When Mandatory) 80 | - Public authorities (always) 81 | - Large-scale systematic monitoring 82 | - Large-scale special category data processing 83 | 84 | ### Section 2: HIPAA Deep Dive 85 | 86 | #### The 3 HIPAA Rules 87 | 1. **Privacy Rule** - Who can see PHI and when 88 | 2. **Security Rule** - How to protect ePHI technically 89 | 3. **Breach Notification Rule** - What to do when things go wrong 90 | 91 | #### HIPAA Safeguards (The Security Triangle) 92 | 1. **Administrative Safeguards** 93 | - Assign security officer 94 | - Create policies/procedures 95 | - Train workforce 96 | - Control access 97 | 98 | 2. **Physical Safeguards** 99 | - Lock facilities 100 | - Control workstation access 101 | - Secure devices/media 102 | 103 | 3. **Technical Safeguards** 104 | - Access controls 105 | - Audit controls 106 | - Data integrity 107 | - Transmission security 108 | 109 | --- 110 | 111 | ## Instructional Notes & Discussion Points 112 | 113 | ### Opening Discussion Questions 114 | 1. "Why do we need data protection laws?" 115 | - *Lead students to discuss privacy as fundamental right* 116 | 2. "What happens when your medical records are leaked vs. your shopping preferences?" 117 | - *Highlight different types of harm from data breaches* 118 | 119 | ### Interactive Teaching Activities 120 | 121 | #### Activity 1: Jurisdiction Quiz 122 | Present scenarios, students identify GDPR/HIPAA/Both/Neither: 123 | - US hospital treating EU tourist *(Both)* 124 | - EU company with US employees *(GDPR only)* 125 | - US pharmacy chain *(HIPAA only)* 126 | - Australian company, no EU/US connections *(Neither)* 127 | 128 | #### Activity 2: Data Classification Game 129 | Show different data types, students categorize: 130 | - Email address *(GDPR: Personal Data)* 131 | - X-ray image with patient name *(Both: Personal Data + PHI)* 132 | - Anonymous survey results *(Neither)* 133 | - Fitness tracker data *(GDPR: Personal Data)* 134 | 135 | ### Common Student Misconceptions 136 | ❌ **"GDPR only applies to EU companies"** 137 | ✅ **Correct:** Applies to ANY company processing EU citizen data 138 | 139 | ❌ **"HIPAA applies to all health data"** 140 | ✅ **Correct:** Only applies to covered entities and business associates 141 | 142 | ❌ **"Consent is always required"** 143 | ✅ **Correct:** GDPR has 6 legal bases; HIPAA allows TPO without consent 144 | 145 | --- 146 | 147 | ## Practice Exercises & Questions 148 | 149 | ### Quick Quiz Questions 150 | 151 | #### Multiple Choice 152 | 1. **GDPR applies to:** 153 | a) Only EU companies 154 | b) Any company processing EU citizen data 155 | c) Only healthcare companies 156 | d) Only large corporations 157 | *Answer: b* 158 | 159 | 2. **Under HIPAA, PHI can be shared without authorization for:** 160 | a) Marketing purposes 161 | b) Research studies 162 | c) Treatment, payment, operations 163 | d) Employee background checks 164 | *Answer: c* 165 | 166 | 3. **Maximum GDPR fine is:** 167 | a) €10 million 168 | b) €20 million or 4% global revenue 169 | c) €50 million 170 | d) $50,000 per violation 171 | *Answer: b* 172 | 173 | #### True/False 174 | - GDPR requires 72-hour breach notification *(True)* 175 | - HIPAA requires Data Protection Officer *(False - Security Officer)* 176 | - Both regulations require encryption *(True)* 177 | - GDPR allows indefinite data storage *(False)* 178 | 179 | ### Case Study Practice 180 | 181 | #### Case 1: The International Hospital 182 | **Scenario:** US hospital chain opens branch in Germany, treats both US and EU patients, uses cloud storage in Canada. 183 | 184 | **Questions:** 185 | 1. Which regulations apply? 186 | 2. What are the main compliance challenges? 187 | 3. How should they handle data transfers? 188 | 189 | #### Case 2: The Health App 190 | **Scenario:** Startup creates fitness app used by EU citizens, partners with US healthcare providers, stores data on AWS. 191 | 192 | **Questions:** 193 | 1. What type of data are they handling? 194 | 2. What legal basis could they use under GDPR? 195 | 3. Do they need a DPO? 196 | 197 | --- 198 | 199 | ## Exam Preparation 200 | 201 | ### Key Facts to Memorize 202 | 203 | #### GDPR Numbers 204 | - **72 hours** - breach notification to authority 205 | - **30 days** - respond to data subject requests 206 | - **€20M or 4%** - maximum fine 207 | - **May 25, 2018** - effective date 208 | 209 | #### HIPAA Numbers 210 | - **1996** - year enacted 211 | - **60 days** - breach notification to individuals 212 | - **500 individuals** - threshold for immediate HHS notification 213 | - **$50,000** - maximum penalty per violation 214 | 215 | ### Memory Techniques 216 | 217 | #### GDPR Rights Acronym: "I AREPORT" 218 | - **I**nformed 219 | - **A**ccess 220 | - **R**ectification 221 | - **E**rasure 222 | - **R**estrict processing 223 | - **P**ortability 224 | - **O**bject 225 | - **R**elated to automated decision-making 226 | - **T**ransparency (bonus) 227 | 228 | #### HIPAA Safeguards: "APT" 229 | - **A**dministrative 230 | - **P**hysical 231 | - **T**echnical 232 | 233 | --- 234 | 235 | ## Comparison Tables for Quick Review 236 | 237 | ### Similarities & Differences Matrix 238 | 239 | | Aspect | GDPR | HIPAA | Same/Different | 240 | |--------|------|-------|----------------| 241 | | Geographic Scope | Global (EU data) | US only | Different | 242 | | Industry Scope | All industries | Healthcare only | Different | 243 | | Requires Encryption | Yes | Yes | Same | 244 | | Requires DPO/Security Officer | Yes (DPO) | Yes (Security Officer) | Same | 245 | | Breach Notification Timeline | 72 hours | 60 days | Different | 246 | | Right to Delete Data | Yes (Right to Erasure) | No (permanent records) | Different | 247 | | Consent Requirements | Strict | Flexible for TPO | Different | 248 | 249 | ### Penalty Comparison 250 | 251 | | Violation Level | GDPR | HIPAA | 252 | |----------------|------|-------| 253 | | **Minor** | Warning or €10M/2% | $100-$50,000 | 254 | | **Major** | €20M/4% global revenue | Up to $1.5M annually | 255 | | **Criminal** | Varies by country | Up to $250K + 10 years prison | 256 | 257 | --- 258 | 259 | ## Best Practices for Implementation 260 | 261 | ### For Instructors 262 | 263 | #### Making It Relevant 264 | - Use current breach examples (Equifax, Anthem, etc.) 265 | - Discuss social media privacy settings 266 | - Connect to students' personal experiences with healthcare 267 | 268 | #### Common Teaching Pitfalls to Avoid 269 | - Don't get lost in legal details - focus on practical application 270 | - Avoid presenting as "US vs EU" - many companies need both 271 | - Don't oversimplify consent - it's more complex than "just ask permission" 272 | 273 | #### Assessment Ideas 274 | - **Case study analysis** - real-world application 275 | - **Compliance checklist creation** - practical skills 276 | - **Risk scenario evaluation** - critical thinking 277 | - **Policy writing exercise** - hands-on experience 278 | 279 | ### Study Group Activities 280 | 1. **Mock DPA Investigation** - role-play compliance audit 281 | 2. **Breach Response Simulation** - practice incident response 282 | 3. **Privacy Notice Comparison** - analyze real company notices 283 | 4. **Compliance Cost Calculation** - estimate implementation costs 284 | 285 | --- 286 | 287 | ## Additional Resources 288 | 289 | ### Essential Reading 290 | - GDPR Official Text (Articles 5, 6, 7, 12-22, 25, 32-34) 291 | - HIPAA Privacy Rule Summary 292 | - ICO (UK) Guidance Documents 293 | - HHS HIPAA Security Rule Guidance 294 | 295 | ### Recommended Cases to Study 296 | - **Schrems II** (EU-US data transfers) 297 | - **Google Spain** (Right to be forgotten) 298 | - **Anthem Breach** (largest healthcare breach) 299 | - **Facebook-Cambridge Analytica** (consent and data sharing) 300 | 301 | ### Online Tools 302 | - ICO Self-Assessment Tool 303 | - HHS Security Risk Assessment Tool 304 | - GDPR Compliance Checkers 305 | - Breach Cost Calculators 306 | 307 | --- 308 | 309 | ## Final Assessment Checklist 310 | 311 | ### Before the Exam, Can You: 312 | - [ ] Explain when GDPR vs HIPAA applies? 313 | - [ ] List all 6 GDPR lawful bases? 314 | - [ ] Name all GDPR rights? 315 | - [ ] Describe the 3 HIPAA rules? 316 | - [ ] Compare breach notification requirements? 317 | - [ ] Explain DPO vs Security Officer roles? 318 | - [ ] Calculate potential penalty amounts? 319 | - [ ] Identify required security safeguards? 320 | - [ ] Distinguish between personal data and PHI? 321 | - [ ] Apply regulations to real-world scenarios? 322 | 323 | ### Red Flag Concepts (Review if Unclear) 324 | - Cross-border data transfers 325 | - Legitimate interests balancing test 326 | - Business associate agreements 327 | - Data processing vs data controlling 328 | - Special category data protections 329 | - Minimum necessary rule application 330 | 331 | --- 332 | 333 | ## Instructional Script Snippets 334 | 335 | ### Opening Hook 336 | *"Imagine your medical records, including mental health visits, appear in a Google search of your name. Or your location data shows you visiting a cancer clinic every Tuesday. This isn't science fiction - it's why we need GDPR and HIPAA."* 337 | 338 | ### Transition Between Topics 339 | *"Now that we understand what GDPR protects - all personal data - let's look at HIPAA's more focused approach to healthcare information..."* 340 | 341 | ### Concept Reinforcement 342 | *"Remember: GDPR is the speed limit everywhere you drive with EU citizens as passengers. HIPAA is the special rules only in the hospital parking lot."* 343 | 344 | ### Closing Summary 345 | *"Both regulations share the same goal: protecting people's most sensitive information. The difference is scope - GDPR casts a wide net globally, HIPAA goes deep in US healthcare. Master both, and you'll understand the future of privacy law."* 346 | -------------------------------------------------------------------------------- /Change Data Capture.md: -------------------------------------------------------------------------------- 1 | # Change Data Capture (CDC) Learning Guide 2 | 3 | ## What is Change Data Capture ? 4 | 5 | Change Data Capture is a powerful technique that tracks changes—specifically inserts, updates, and deletes—in a source database and streams them to a target system in real-time or near real-time. Think of CDC as a vigilant observer that watches your database and immediately reports any changes to other systems that need to stay synchronized. 6 | 7 | This approach ensures data consistency across different systems like data warehouses, caches, or analytics platforms without the need to process the entire dataset repeatedly. CDC serves as the backbone for data integration, real-time analytics, and maintaining up-to-date information across distributed systems. 8 | 9 | ## Why Use CDC? 10 | 11 | Understanding the benefits of CDC helps explain why it has become essential in modern data architectures: 12 | 13 | **Real-Time Data Synchronization**: Unlike traditional batch processing that updates target systems at scheduled intervals (perhaps once a day or hour), CDC updates target systems instantly as changes occur. This means your analytics dashboard can reflect customer purchases within seconds rather than waiting for the next batch job. 14 | 15 | **Exceptional Efficiency**: CDC processes only the data that has actually changed, dramatically reducing resource usage compared to full dataset transfers. Instead of copying an entire million-row table every hour, CDC might only transfer the dozen rows that actually changed. 16 | 17 | **Guaranteed Consistency**: CDC ensures that downstream systems accurately reflect changes in the source database. When a customer updates their address in your main application, that change propagates reliably to your data warehouse, recommendation engine, and reporting systems. 18 | 19 | ## How Does CDC Work? 20 | 21 | The magic of CDC lies in its ability to capture changes from a source database's write-ahead log (WAL), which is essentially the database's diary of all modifications. Every database maintains this log for recovery purposes, and CDC leverages this existing infrastructure. 22 | 23 | Here's how the process flows: when you make a change to your PostgreSQL database, that change first gets written to the WAL before being applied to the actual data files. CDC tools read these WAL entries and convert them into events that can be streamed to target systems. For example, tools like Debezium read these logs and stream changes to Apache Kafka, which then delivers them to targets such as Cassandra. 24 | 25 | ## Debezium Connector (Source) 26 | 27 | Debezium represents one of the most popular open-source platforms for CDC. It captures changes from various databases including PostgreSQL, MySQL, and Oracle, then streams these changes to Apache Kafka. Think of Debezium as a translator that speaks both "database language" and "streaming language." 28 | 29 | ### How Debezium Works 30 | 31 | Debezium connectors act as careful monitors of a database's WAL, detecting every change that occurs. Each detected change gets converted into a structured event and sent to a designated Kafka topic. This process happens with remarkable precision: 32 | 33 | When you insert a new row into a PostgreSQL table, Debezium generates an "INSERT" event containing all the new data. For updates, Debezium creates both "before" and "after" events, showing you exactly what changed from the old values to the new ones. Delete operations generate "DELETE" events that capture what was removed. 34 | 35 | ### Simple Example: Debezium with PostgreSQL 36 | 37 | Let's walk through setting up Debezium with PostgreSQL using a practical example. Imagine you have a PostgreSQL table called `users` with columns for id, name, and email. 38 | 39 | #### Step 1: Enable WAL in PostgreSQL 40 | 41 | First, you need to configure PostgreSQL to use logical replication, which Debezium requires to read changes: 42 | 43 | ```bash 44 | # Edit the PostgreSQL configuration file 45 | sudo nano /etc/postgresql/14/main/postgresql.conf 46 | ``` 47 | 48 | Update these settings in the configuration file: 49 | 50 | ``` 51 | wal_level = logical 52 | max_wal_senders = 1 53 | max_replication_slots = 1 54 | ``` 55 | 56 | After making these changes, restart PostgreSQL to apply them: 57 | 58 | ```bash 59 | sudo systemctl restart postgresql 60 | ``` 61 | 62 | Next, grant the necessary replication permissions to your database user: 63 | 64 | ```bash 65 | psql -U postgres -c "ALTER USER myuser WITH REPLICATION;" 66 | ``` 67 | 68 | #### Step 2: Set Up Kafka and Debezium 69 | 70 | Now you'll set up the streaming infrastructure. Download and extract Apache Kafka: 71 | 72 | ```bash 73 | wget https://downloads.apache.org/kafka/3.6.0/kafka_2.13-3.6.0.tgz 74 | tar -xzf kafka_2.13-3.6.0.tgz 75 | cd kafka_2.13-3.6.0 76 | ``` 77 | 78 | Start Zookeeper and Kafka in separate terminals. Zookeeper manages Kafka's configuration, while Kafka handles the actual message streaming: 79 | 80 | ```bash 81 | # Terminal 1: Start Zookeeper 82 | bin/zookeeper-server-start.sh config/zookeeper.properties 83 | 84 | # Terminal 2: Start Kafka 85 | bin/kafka-server-start.sh config/server.properties 86 | ``` 87 | 88 | Download and set up the Debezium PostgreSQL connector: 89 | 90 | ```bash 91 | mkdir -p /path/to/kafka/plugins 92 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.7.0.Final/debezium-connector-postgres-2.7.0.Final-plugin.tar.gz 93 | tar -xzf debezium-connector-postgres-2.7.0.Final-plugin.tar.gz -C /path/to/kafka/plugins 94 | ``` 95 | 96 | Configure Kafka Connect to recognize your Debezium plugin: 97 | 98 | ```bash 99 | nano config/connect-distributed.properties 100 | ``` 101 | 102 | Add this line to tell Kafka Connect where to find plugins: 103 | 104 | ``` 105 | plugin.path=/path/to/kafka/plugins 106 | ``` 107 | 108 | Start Kafka Connect in distributed mode: 109 | 110 | ```bash 111 | bin/connect-distributed.sh config/connect-distributed.properties 112 | ``` 113 | 114 | Create a configuration file for your Debezium connector. This JSON file tells Debezium exactly how to connect to your PostgreSQL database and which tables to monitor: 115 | 116 | ```json 117 | { 118 | "name": "postgres-connector", 119 | "config": { 120 | "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 121 | "database.hostname": "localhost", 122 | "database.port": "5432", 123 | "database.user": "myuser", 124 | "database.password": "mypassword", 125 | "database.dbname": "mydb", 126 | "database.server.name": "server1", 127 | "table.include.list": "public.users", 128 | "plugin.name": "pgoutput" 129 | } 130 | } 131 | ``` 132 | 133 | Deploy the connector to start monitoring your database: 134 | 135 | ```bash 136 | curl -X POST -H "Content-Type: application/json" --data @postgres-connector.json http://localhost:8083/connectors 137 | ``` 138 | 139 | #### Step 3: Observe Changes in Action 140 | 141 | Now for the exciting part—watching CDC work in real-time. Create your users table and insert some data: 142 | 143 | ```sql 144 | psql -U myuser -d mydb -c "CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT, email TEXT);" 145 | psql -U myuser -d mydb -c "INSERT INTO users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');" 146 | ``` 147 | 148 | Debezium captures this insert operation and creates an event in a Kafka topic named `server1.public.users`. You can view this event using Kafka's console consumer: 149 | 150 | ```bash 151 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic server1.public.users --from-beginning 152 | ``` 153 | 154 | The event structure looks like this: 155 | 156 | ```json 157 | { 158 | "schema": { ... }, 159 | "payload": { 160 | "before": null, 161 | "after": { "id": 1, "name": "Alice", "email": "alice@example.com" }, 162 | "op": "c", // 'c' indicates create (insert) 163 | "ts_ms": 1697051234567 164 | } 165 | } 166 | ``` 167 | 168 | Notice how the event includes both "before" and "after" states. For an insert, "before" is null since the row didn't exist previously. The "op" field indicates the operation type, and "ts_ms" provides a timestamp. 169 | 170 | ## Cassandra Sink Connector 171 | 172 | The Cassandra sink connector completes the CDC pipeline by reading data from Kafka topics and writing it to a Cassandra database. This connector excels at storing CDC events in a scalable, distributed NoSQL environment. 173 | 174 | ### How Cassandra Sink Connector Works 175 | 176 | The connector acts as a bridge between Kafka and Cassandra, reading events from Kafka topics and translating them into appropriate Cassandra operations. It handles the mapping between Kafka record structures and Cassandra table schemas, managing inserts, updates, and deletes automatically. 177 | 178 | ### Simple Example: Cassandra Sink Connector 179 | 180 | Let's continue our example by setting up Cassandra to receive the user events from our Debezium setup. 181 | 182 | #### Step 1: Set Up Cassandra 183 | 184 | Install and start Cassandra on Ubuntu: 185 | 186 | ```bash 187 | sudo apt update 188 | sudo apt install cassandra 189 | sudo systemctl start cassandra 190 | ``` 191 | 192 | Verify that Cassandra is running properly: 193 | 194 | ```bash 195 | nodetool status 196 | ``` 197 | 198 | Create a keyspace and table structure that matches your source data: 199 | 200 | ```bash 201 | cqlsh -e "CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};" 202 | cqlsh -e "CREATE TABLE mykeyspace.users (id int PRIMARY KEY, name text, email text);" 203 | ``` 204 | 205 | #### Step 2: Configure Cassandra Sink Connector 206 | 207 | Download the Cassandra sink connector: 208 | 209 | ```bash 210 | wget https://d1i4a8756m6x7j.cloudfront.net/repo/7.5/confluent-kafka-connect-cassandra-1.5.0.tar.gz 211 | tar -xzf confluent-kafka-connect-cassandra-1.5.0.tar.gz -C /path/to/kafka/plugins 212 | ``` 213 | 214 | Create a configuration file that tells the connector how to map Kafka events to Cassandra records: 215 | 216 | ```json 217 | { 218 | "name": "cassandra-sink", 219 | "config": { 220 | "connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector", 221 | "tasks.max": "1", 222 | "topics": "server1.public.users", 223 | "cassandra.contact.points": "localhost", 224 | "cassandra.keyspace": "mykeyspace", 225 | "cassandra.table.name": "users", 226 | "cassandra.kcql": "INSERT INTO users SELECT id, name, email FROM server1.public.users" 227 | } 228 | } 229 | ``` 230 | 231 | Deploy the connector to start the data flow: 232 | 233 | ```bash 234 | curl -X POST -H "Content-Type: application/json" --data @cassandra-sink.json http://localhost:8083/connectors 235 | ``` 236 | 237 | #### Step 3: Observe Data in Cassandra 238 | 239 | When Debezium sends user events to Kafka, the Cassandra sink connector automatically writes them to your Cassandra table. Query Cassandra to see the results: 240 | 241 | ```bash 242 | cqlsh -e "SELECT * FROM mykeyspace.users;" 243 | ``` 244 | 245 | You should see the result: `id=1, name='Alice', email='alice@example.com'`. 246 | 247 | The beauty of this setup is that any changes you make to the PostgreSQL users table will automatically appear in Cassandra within seconds, maintaining perfect synchronization between your systems. 248 | 249 | ## Key Considerations 250 | 251 | When implementing CDC in production environments, several important factors require careful attention: 252 | 253 | **Performance Impact**: CDC does add some overhead to your source database since it needs to read and process the WAL continuously. Monitor WAL usage in PostgreSQL, especially in high-transaction systems where the log can grow quickly. Consider the additional I/O load and ensure your database server has adequate resources. 254 | 255 | **Schema Evolution**: Debezium handles schema changes gracefully—when you add columns to a PostgreSQL table, Debezium automatically detects and includes them in future events. However, you must ensure that your target systems (like Cassandra tables) can accommodate these schema changes. Plan your schema evolution strategy carefully. 256 | 257 | **Scalability Considerations**: Cassandra's distributed architecture makes it excellent for handling high-volume CDC streams. Configure appropriate replication factors for reliability, and consider partitioning strategies that align with your query patterns and data access requirements. 258 | 259 | ## Assignment: Hands-On Project 260 | 261 | To solidify your understanding of CDC concepts, I encourage you to work through a practical implementation project. This hands-on experience will help you see how all the pieces fit together in a real-world scenario. 262 | 263 | **Project Repository**: [LuxDevHQ Data Engineering Project](https://github.com/LuxDevHQ/LuxDevHQ-Data-Engineering-Project) 264 | 265 | **Your Task**: Clone the repository and follow the comprehensive instructions to set up a complete CDC pipeline using Debezium and a sink connector. The project guides you through using Linux commands to set up PostgreSQL tables, make various changes (inserts, updates, deletes), and verify that these changes propagate correctly to the target system. 266 | 267 | ```bash 268 | git clone https://github.com/LuxDevHQ/LuxDevHQ-Data-Engineering-Project.git 269 | cd LuxDevHQ-Data-Engineering-Project 270 | ``` 271 | 272 | Follow the project's README file for detailed setup and execution instructions. This project provides a practical environment where you can experiment with CDC concepts, helping you understand how Debezium and sink connectors work together in real-world scenarios. 273 | 274 | As you work through this project, pay attention to how the different components interact, observe the timing of data propagation, and experiment with different types of database changes to see how they're handled by the CDC pipeline. 275 | -------------------------------------------------------------------------------- /PYTHON/chapter_1.md: -------------------------------------------------------------------------------- 1 | # Chapter 1: Introduction to Python and The Way of the Program 2 | 3 | ## What is Python? 4 | 5 | Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming. It is widely used in various domains such as web development, data analysis, artificial intelligence, scientific computing, and data engineering. 6 | 7 | ### Year Created 8 | Python was created in the late 1980s, and its first official version, Python 0.9.0, was released in February 1991. Python 2.0 was released in 2000, and Python 3.0, which is not backward-compatible with Python 2, was released in 2008. 9 | 10 | **Current stable Python version:** Python 3.13.2 11 | 12 | ### Important Resources 13 | - [W3School Python tutorial](https://www.w3schools.com/python/) (beginner friendly) 14 | - [Python official documentation](https://docs.python.org/) 15 | - [Programiz Python tutorial](https://www.programiz.com/python-programming) 16 | - [Pythontutorial.net](https://www.pythontutorial.net/) 17 | 18 | ## The Way of the Program 19 | 20 | The goal of learning Python is to teach you to think like a computer scientist. This way of thinking combines some of the best features of mathematics, engineering, and natural science. 21 | 22 | Like mathematicians, computer scientists use formal languages to denote ideas (specifically computations). Like engineers, they design things, assembling components into systems and evaluating tradeoffs among alternatives. Like scientists, they observe the behavior of complex systems, form hypotheses, and test predictions. 23 | 24 | The single most important skill for a computer scientist is **problem solving**. Problem solving means the ability to formulate problems, think creatively about solutions, and express a solution clearly and accurately. As it turns out, the process of learning to program is an excellent opportunity to practice problem-solving skills. 25 | 26 | ## 1.1 What is a Program? 27 | 28 | A program is a sequence of instructions that specifies how to perform a computation. The computation might be something mathematical, such as solving a system of equations or finding the roots of a polynomial, but it can also be a symbolic computation, such as searching and replacing text in a document or something graphical, like processing an image or playing a video. 29 | 30 | The details look different in different languages, but a few basic instructions appear in just about every language: 31 | 32 | - **input:** Get data from the keyboard, a file, the network, or some other device 33 | - **output:** Display data on the screen, save it in a file, send it over the network, etc. 34 | - **math:** Perform basic mathematical operations like addition and multiplication 35 | - **conditional execution:** Check for certain conditions and run the appropriate code 36 | - **repetition:** Perform some action repeatedly, usually with some variation 37 | 38 | Believe it or not, that's pretty much all there is to it. Every program you've ever used, no matter how complicated, is made up of instructions that look pretty much like these. 39 | 40 | You can think of programming as the process of breaking a large, complex task into smaller and smaller subtasks until the subtasks are simple enough to be performed with one of these basic instructions. 41 | 42 | ## 1.2 Running Python 43 | 44 | One of the challenges of getting started with Python is that you might have to install Python and related software on your computer. If you are familiar with your operating system, and especially if you are comfortable with the command-line interface, you will have no trouble installing Python. But for beginners, it can be painful to learn about system administration and programming at the same time. 45 | 46 | To avoid that problem, it's recommended that you start out running Python in a browser. Later, when you are comfortable with Python, you can install Python on your computer. 47 | 48 | There are a number of web pages you can use to run Python. Some popular options include: 49 | - [Replit](https://replit.com/) 50 | - [Python.org's online console](https://www.python.org/shell/) 51 | - [Trinket](https://trinket.io/python) 52 | 53 | ### Python Versions 54 | 55 | There are two major versions of Python: Python 2 and Python 3. They are very similar, so if you learn one, it is easy to switch to the other. However, Python 2 reached end-of-life on January 1, 2020, so all new projects should use Python 3. 56 | 57 | The Python interpreter is a program that reads and executes Python code. When you start the interpreter, you should see output like this: 58 | 59 | ``` 60 | Python 3.13.2 (default, Jan 15 2025, 14:20:21) 61 | [GCC 11.2.0] on linux 62 | Type "help", "copyright", "credits" or "license" for more information. 63 | >>> 64 | ``` 65 | 66 | The `>>>` is a prompt that indicates that the interpreter is ready for you to enter code. If you type a line of code and hit Enter, the interpreter displays the result: 67 | 68 | ```python 69 | >>> 1 + 1 70 | 2 71 | ``` 72 | 73 | ## 1.3 The First Program 74 | 75 | Traditionally, the first program you write in a new language is called "Hello, World!" because all it does is display the words "Hello, World!". In Python, it looks like this: 76 | 77 | ```python 78 | >>> print('Hello, World!') 79 | Hello, World! 80 | ``` 81 | 82 | This is an example of a print statement, although it doesn't actually print anything on paper. It displays a result on the screen. The quotation marks in the program mark the beginning and end of the text to be displayed; they don't appear in the result. 83 | 84 | The parentheses indicate that `print` is a function. We'll learn more about functions in later chapters. 85 | 86 | ## 1.4 Arithmetic Operators 87 | 88 | After "Hello, World!", the next step is arithmetic. Python provides operators, which are special symbols that represent computations like addition and multiplication. 89 | 90 | The operators `+`, `-`, and `*` perform addition, subtraction, and multiplication: 91 | 92 | ```python 93 | >>> 40 + 2 94 | 42 95 | >>> 43 - 1 96 | 42 97 | >>> 6 * 7 98 | 42 99 | ``` 100 | 101 | The operator `/` performs division: 102 | 103 | ```python 104 | >>> 84 / 2 105 | 42.0 106 | ``` 107 | 108 | Note that the result is `42.0` instead of `42` because division in Python 3 always returns a floating-point number. 109 | 110 | The operator `**` performs exponentiation (raising a number to a power): 111 | 112 | ```python 113 | >>> 6**2 + 6 114 | 42 115 | ``` 116 | 117 | **Warning:** In some other languages, `^` is used for exponentiation, but in Python it is a bitwise operator called XOR: 118 | 119 | ```python 120 | >>> 6 ^ 2 121 | 4 122 | ``` 123 | 124 | ## 1.5 Values and Types 125 | 126 | A value is one of the basic things a program works with, like a letter or a number. Some values we have seen so far are `2`, `42.0`, and `'Hello, World!'`. 127 | 128 | These values belong to different types: 129 | - `2` is an **integer** 130 | - `42.0` is a **floating-point number** 131 | - `'Hello, World!'` is a **string** 132 | 133 | If you are not sure what type a value has, the interpreter can tell you using the `type()` function: 134 | 135 | ```python 136 | >>> type(2) 137 | 138 | >>> type(42.0) 139 | 140 | >>> type('Hello, World!') 141 | 142 | ``` 143 | 144 | In these results, the word "class" is used in the sense of a category; a type is a category of values. 145 | 146 | - Integers belong to the type `int` 147 | - Strings belong to `str` 148 | - Floating-point numbers belong to `float` 149 | 150 | ### String vs Numbers 151 | 152 | Values like `'2'` and `'42.0'` look like numbers, but they are in quotation marks, so they are strings: 153 | 154 | ```python 155 | >>> type('2') 156 | 157 | >>> type('42.0') 158 | 159 | ``` 160 | 161 | ## Python Basics 162 | 163 | ### Identifiers 164 | 165 | Identifiers are names used to identify variables, functions, classes, modules, or other objects. Rules for naming identifiers in Python: 166 | 167 | - Identifiers can be a combination of letters (a-z, A-Z), digits (0-9), and underscores (_) 168 | - Identifiers cannot start with a digit 169 | - Identifiers are case-sensitive (`myVar` and `myvar` are different) 170 | - Reserved keywords cannot be used as identifiers 171 | 172 | **Valid identifiers:** 173 | ```python 174 | my_variable 175 | _private_var 176 | variable1 177 | MyClass 178 | ``` 179 | 180 | **Invalid identifiers:** 181 | ```python 182 | 1variable # Cannot start with digit 183 | my-variable # Hyphen not allowed 184 | class # Reserved keyword 185 | ``` 186 | 187 | ### Keywords 188 | 189 | Keywords are reserved words in Python that have special meanings and cannot be used as identifiers. Some of the keywords in Python include: 190 | 191 | `if`, `else`, `elif`, `for`, `while`, `break`, `continue`, `def`, `return`, `lambda`, `class`, `import`, `from`, `try`, `except`, `finally`, `and`, `or`, `not`, `True`, `False`, `None` 192 | 193 | You can see all keywords by running: 194 | ```python 195 | import keyword 196 | print(keyword.kwlist) 197 | ``` 198 | 199 | ### PEP 8 Rules 200 | 201 | PEP 8 is the official style guide for Python code. It provides conventions for writing readable and consistent code. Some key PEP 8 rules include: 202 | 203 | - **Indentation:** Use 4 spaces per indentation level 204 | - **Line Length:** Limit all lines to a maximum of 79 characters (72 for docstrings/comments) 205 | - **Imports:** Imports should usually be on separate lines 206 | - **Whitespace:** Avoid extraneous whitespace in various situations 207 | - **Naming Conventions:** 208 | - Variables: `my_variable` 209 | - Functions: `my_function` 210 | - Classes: `MyClass` 211 | - Constants: `MY_CONSTANT` 212 | 213 | ## 1.6 Formal and Natural Languages 214 | 215 | Natural languages are the languages people speak, such as English, Spanish, and French. They were not designed by people (although people try to impose some order on them); they evolved naturally. 216 | 217 | Formal languages are languages that are designed by people for specific applications. For example, the notation that mathematicians use is a formal language that is particularly good at denoting relationships among numbers and symbols. Chemists use a formal language to represent the chemical structure of molecules. And most importantly: 218 | 219 | **Programming languages are formal languages that have been designed to express computations.** 220 | 221 | ### Key Differences 222 | 223 | Although formal and natural languages have many features in common—tokens, structure, and syntax—there are some differences: 224 | 225 | - **Ambiguity:** Natural languages are full of ambiguity, which people deal with by using contextual clues. Formal languages are designed to be nearly or completely unambiguous. 226 | 227 | - **Redundancy:** Natural languages employ lots of redundancy to reduce misunderstandings. Formal languages are less redundant and more concise. 228 | 229 | - **Literalness:** Natural languages are full of idiom and metaphor. Formal languages mean exactly what they say. 230 | 231 | ## 1.7 Debugging 232 | 233 | Programmers make mistakes. For whimsical reasons, programming errors are called **bugs** and the process of tracking them down is called **debugging**. 234 | 235 | Programming, and especially debugging, sometimes brings out strong emotions. If you are struggling with a difficult bug, you might feel angry, despondent, or embarrassed. 236 | 237 | ### Debugging Tips 238 | 239 | - Think of the computer as an employee with certain strengths (speed and precision) and weaknesses (lack of empathy and inability to grasp the big picture) 240 | - Find ways to take advantage of the strengths and mitigate the weaknesses 241 | - Use your emotions to engage with the problem, without letting reactions interfere with your ability to work effectively 242 | - Learning to debug is frustrating, but it's a valuable skill useful for many activities beyond programming 243 | 244 | ## Exercises 245 | 246 | ### Exercise 1.1 247 | Experiment with the "Hello, world!" program and try to make mistakes on purpose: 248 | 249 | 1. In a print statement, what happens if you leave out one of the parentheses, or both? 250 | 2. If you are trying to print a string, what happens if you leave out one of the quotation marks, or both? 251 | 3. You can use a minus sign to make a negative number like `-2`. What happens if you put a plus sign before a number? What about `2++2`? 252 | 4. In math notation, leading zeros are ok, as in `09`. What happens if you try this in Python? What about `011`? 253 | 5. What happens if you have two values with no operator between them? 254 | 255 | ### Exercise 1.2 256 | Start the Python interpreter and use it as a calculator: 257 | 258 | 1. How many seconds are there in 42 minutes 42 seconds? 259 | 2. How many miles are there in 10 kilometers? (Hint: there are 1.61 kilometers in a mile) 260 | 3. If you run a 10 kilometer race in 42 minutes 42 seconds, what is your average pace (time per mile in minutes and seconds)? What is your average speed in miles per hour? 261 | 262 | ## Glossary 263 | 264 | - **Problem solving:** The process of formulating a problem, finding a solution, and expressing it 265 | - **High-level language:** A programming language like Python that is designed to be easy for humans to read and write 266 | - **Interpreter:** A program that reads another program and executes it 267 | - **Program:** A set of instructions that specifies a computation 268 | - **Value:** One of the basic units of data, like a number or string, that a program manipulates 269 | - **Type:** A category of values (int, float, str) 270 | - **Bug:** An error in a program 271 | - **Debugging:** The process of finding and correcting bugs 272 | - **Identifier:** Names used to identify variables, functions, classes, or other objects 273 | - **Keyword:** Reserved words in Python with special meanings 274 | - **PEP 8:** The official style guide for Python code -------------------------------------------------------------------------------- /IntroductiontoCloudComputing.md: -------------------------------------------------------------------------------- 1 | ### Introduction to Cloud Computing (Azure and AWS) 2 | **Duration**: 90 minutes 3 | **Audience**: Data Engineers 4 | 5 | --- 6 | 7 | ### Learning Objectives 8 | By the end of this session, participants will be able to: 9 | 1. Define cloud computing and its core principles. 10 | 2. Compare key features of Azure and AWS. 11 | 3. Navigate basic services relevant to data engineering in both platforms. 12 | 4. Set up and configure a basic cloud environment. 13 | 14 | --- 15 | 16 | ### Agenda 17 | 1. **What is Cloud Computing?** (10 minutes) 18 | - Definition and characteristics of cloud computing. 19 | - Cloud deployment models: Public, Private, Hybrid. 20 | - Service models: IaaS, PaaS, SaaS. 21 | 22 | 2. **Azure vs. AWS: Key Concepts** (15 minutes) 23 | - Overview of Azure and AWS platforms. 24 | - Comparison of service categories: Compute, Storage, Networking, and Databases. 25 | - Strengths and use cases for Azure and AWS. 26 | 27 | 3. **Core Services for Data Engineering** (20 minutes) 28 | - **Compute**: 29 | - Azure: Azure Virtual Machines, Azure Databricks. 30 | - AWS: EC2, EMR (Elastic MapReduce). 31 | - **Storage**: 32 | - Azure: Blob Storage, Data Lake Storage. 33 | - AWS: S3, Glacier. 34 | - **Databases**: 35 | - Azure: Azure SQL Database, Cosmos DB. 36 | - AWS: RDS, DynamoDB. 37 | 38 | 4. **Hands-On Lab: Setting Up a Cloud Environment** (30 minutes) 39 | - Create free accounts on Azure and AWS. 40 | - Configure basic cloud storage: 41 | - Azure Blob Storage. 42 | - AWS S3 bucket. 43 | - Upload and retrieve sample data. 44 | 45 | 5. **Q&A and Wrap-Up** (15 minutes) 46 | - Address participants questions. 47 | - Discuss common challenges and best practices. 48 | - Share additional resources for continued learning. 49 | 50 | --- 51 | 52 | ### Detailed Session Plan 53 | 54 | #### What is Cloud Computing? (10 minutes) 55 | - **Definition**: Delivering computing services (e.g., servers, storage, databases, networking, software) over the internet. 56 | - **Characteristics**: 57 | - On-demand availability. 58 | - Scalability. 59 | - Pay-as-you-go pricing. 60 | - High availability and reliability. 61 | - **Service Models**: 62 | - **IaaS** (e.g., Virtual Machines): Full control over infrastructure. 63 | - **PaaS** (e.g., Azure Databricks): Managed environment for deploying applications. 64 | - **SaaS** (e.g., Office 365): Pre-built software accessed via the cloud. 65 | 66 | --- 67 | 68 | #### Azure vs. AWS: Key Concepts (15 minutes) 69 | - **Azure**: 70 | - Focus on hybrid cloud and enterprise solutions. 71 | - Tight integration with Microsoft tools (e.g., Power BI, Office 365). 72 | - **AWS**: 73 | - Largest cloud provider with a wide range of services. 74 | - Strong presence in startups and tech-first organizations. 75 | 76 | | Feature | Azure | AWS | 77 | |-----------------|---------------------------------|----------------------------------| 78 | | Compute | Azure VMs, Azure Kubernetes | EC2, Lambda, ECS | 79 | | Storage | Blob Storage, Data Lake | S3, Glacier | 80 | | Databases | Azure SQL, Cosmos DB | RDS, DynamoDB | 81 | | Analytics | Azure Synapse, Databricks | Redshift, EMR, Athena | 82 | 83 | --- 84 | 85 | #### Core Services for Data Engineering (20 minutes) 86 | **Compute Services**: 87 | - Azure: 88 | - **Azure Virtual Machines**: Scalable virtual servers. 89 | - **Azure Databricks**: Apache Spark-based analytics. 90 | - AWS: 91 | - **EC2**: Elastic Compute Cloud for scalable servers. 92 | - **EMR**: Managed Hadoop/Spark for big data processing. 93 | 94 | **Storage Services**: 95 | - Azure: 96 | - **Blob Storage**: Unstructured data storage. 97 | - **Data Lake Storage**: Analytics-optimized storage. 98 | - AWS: 99 | - **S3**: Highly available object storage. 100 | - **Glacier**: Long-term archival storage. 101 | 102 | **Database Services**: 103 | - Azure: 104 | - **Azure SQL Database**: Managed relational database. 105 | - **Cosmos DB**: Globally distributed, multi-model database. 106 | - AWS: 107 | - **RDS**: Managed relational databases. 108 | - **DynamoDB**: NoSQL database with high performance. 109 | 110 | --- 111 | 112 | #### Hands-On Lab: Setting Up a Cloud Environment (30 minutes) 113 | 114 | **Step 1**: Create Free Accounts 115 | 1. **Azure**: 116 | - Visit [Azure Free Account](https://azure.microsoft.com/free/). 117 | - Sign up with a Microsoft account. 118 | - Activate $200 free credit. 119 | 120 | 2. **AWS**: 121 | - Visit [AWS Free Tier](https://aws.amazon.com/free/). 122 | - Sign up with email and billing details. 123 | - Activate free-tier services. 124 | 125 | **Step 2**: Configure Basic Cloud Storage 126 | 1. **Azure Blob Storage**: 127 | - Navigate to **Storage Accounts** in Azure Portal. 128 | - Create a new storage account. 129 | - Upload a sample CSV file and view its properties. 130 | 131 | 2. **AWS S3 Bucket**: 132 | - Navigate to **S3** in AWS Console. 133 | - Create a new S3 bucket. 134 | - Upload a sample CSV file and view its properties. 135 | 136 | **Step 3**: Retrieve and Use Data 137 | - Use Python or CLI tools to retrieve the uploaded file: 138 | - Azure: `azure-storage-blob` Python SDK. 139 | - AWS: `boto3` Python SDK. 140 | 141 | --- 142 | 143 | #### Q&A and Wrap-Up (15 minutes) 144 | - **Discussion Points**: 145 | - How to choose between Azure and AWS for specific use cases? 146 | - Best practices for managing costs in cloud platforms. 147 | - Common challenges faced by data engineers in cloud environments. 148 | - **Resources for Further Learning**: 149 | - Azure: Microsoft Learn - [Azure Fundamentals](https://learn.microsoft.com/en-us/azure/) 150 | - AWS: AWS Training - [AWS Fundamentals](https://aws.amazon.com/training/) 151 | 152 | --- 153 | 154 | ### Key Takeaways 155 | - Cloud computing provides scalable, cost-effective infrastructure and tools for data engineering. 156 | - Azure and AWS are the leading platforms, each with unique strengths. 157 | - Hands-on experience is crucial to understanding and leveraging cloud services. 158 | 159 | 160 | 161 | ###Bonus 162 | #### *Additional Notes and Tips for AWS Tools for Data Engineering.* 163 | 164 | AWS provides a rich ecosystem of tools and services tailored for data engineering tasks. Below is an expanded overview, along with notes and tips to maximize their utility. 165 | 166 | --- 167 | 168 | #### **Storage Services** 169 | 1. **Amazon S3 (Simple Storage Service)** 170 | - **Purpose**: Scalable object storage for raw, processed, and archived data. 171 | - **Common Use Cases**: 172 | - Data lakes. 173 | - Backup and disaster recovery. 174 | - Hosting static files. 175 | - **Tips**: 176 | - Use **S3 Lifecycle Policies** to move infrequently accessed data to cheaper storage classes (e.g., Glacier, Intelligent-Tiering). 177 | - Enable **versioning** to maintain file history and prevent accidental data loss. 178 | - Use **S3 Select** to query and retrieve subsets of data from objects directly, reducing data transfer costs. 179 | - Encrypt sensitive data using **SSE (Server-Side Encryption)** or **client-side encryption**. 180 | 181 | 2. **AWS Glue Data Catalog** 182 | - **Purpose**: Centralized metadata repository for datasets stored in S3 or other sources. 183 | - **Common Use Cases**: 184 | - Schema management for data lakes. 185 | - Integration with Athena, Redshift Spectrum, and EMR. 186 | - **Tips**: 187 | - Use AWS Glue Crawlers to automate schema detection for datasets. 188 | - Ensure proper IAM roles are configured for Glue to access S3 buckets. 189 | 190 | --- 191 | 192 | #### **Compute Services** 193 | 1. **Amazon EC2 (Elastic Compute Cloud)** 194 | - **Purpose**: General-purpose virtual servers. 195 | - **Common Use Cases**: 196 | - Hosting custom data pipelines. 197 | - Running one-time ETL jobs or long-running services. 198 | - **Tips**: 199 | - Use **Spot Instances** for cost savings on workloads that tolerate interruptions. 200 | - Implement **auto-scaling** to handle varying workloads. 201 | 202 | 2. **AWS Lambda** 203 | - **Purpose**: Serverless compute for event-driven processing. 204 | - **Common Use Cases**: 205 | - Lightweight ETL transformations. 206 | - Real-time data processing (e.g., responding to events from S3 or Kinesis). 207 | - **Tips**: 208 | - Keep Lambda functions small and focused on single tasks. 209 | - Optimize performance by minimizing package size and reusing connections. 210 | 211 | 3. **AWS EMR (Elastic MapReduce)** 212 | - **Purpose**: Managed Hadoop and Spark framework for big data processing. 213 | - **Common Use Cases**: 214 | - Batch processing of large datasets. 215 | - Running machine learning models on large-scale data. 216 | - **Tips**: 217 | - Use **Spot Instances** with EMR to reduce costs. 218 | - Leverage **EMR File System (EMRFS)** for better integration with S3. 219 | 220 | --- 221 | 222 | #### **Database Services** 223 | 1. **Amazon Redshift** 224 | - **Purpose**: Managed data warehouse for OLAP workloads. 225 | - **Common Use Cases**: 226 | - Aggregating and analyzing large datasets. 227 | - Business intelligence and reporting. 228 | - **Tips**: 229 | - Use **Redshift Spectrum** to query data directly from S3 without loading it into Redshift. 230 | - Monitor and optimize queries using the **Query Monitoring Rules** feature. 231 | - Compress data using columnar storage formats like Parquet or ORC to improve query performance. 232 | 233 | 2. **Amazon DynamoDB** 234 | - **Purpose**: NoSQL database for key-value and document storage. 235 | - **Common Use Cases**: 236 | - Low-latency, high-throughput applications (e.g., user session storage). 237 | - Storing metadata or logs for data pipelines. 238 | - **Tips**: 239 | - Enable **DynamoDB Streams** for change data capture and event-driven workflows. 240 | - Use the **on-demand capacity mode** for unpredictable workloads to avoid over-provisioning. 241 | 242 | 3. **Amazon RDS (Relational Database Service)** 243 | - **Purpose**: Managed relational database with support for MySQL, PostgreSQL, Oracle, and SQL Server. 244 | - **Common Use Cases**: 245 | - Storing structured, transactional data. 246 | - Serving as a staging area for ETL workflows. 247 | - **Tips**: 248 | - Enable **Multi-AZ deployments** for high availability. 249 | - Use **Read Replicas** to offload read-heavy workloads. 250 | 251 | --- 252 | 253 | #### **Data Analytics Tools** 254 | 1. **Amazon Athena** 255 | - **Purpose**: Serverless query engine for analyzing data in S3 using SQL. 256 | - **Common Use Cases**: 257 | - Interactive exploration of data lakes. 258 | - Quick validation of ETL pipeline outputs. 259 | - **Tips**: 260 | - Use columnar formats like Parquet or ORC for faster queries. 261 | - Partition your data to reduce query costs. 262 | 263 | 2. **AWS Glue** 264 | - **Purpose**: Serverless ETL service. 265 | - **Common Use Cases**: 266 | - Cleaning and transforming datasets for downstream consumption. 267 | - Automating ETL workflows. 268 | - **Tips**: 269 | - Use **job bookmarks** to handle incremental data loads. 270 | - Test transformations locally using the AWS Glue Docker image. 271 | 272 | 3. **Amazon QuickSight** 273 | - **Purpose**: BI and data visualization tool. 274 | - **Common Use Cases**: 275 | - Creating dashboards for stakeholders. 276 | - Visualizing insights from Athena or Redshift queries. 277 | - **Tips**: 278 | - Leverage SPICE (Super-fast, Parallel, In-memory Calculation Engine) for faster dashboard performance. 279 | 280 | --- 281 | 282 | #### **Data Streaming Services** 283 | 1. **Amazon Kinesis** 284 | - **Purpose**: Platform for collecting, processing, and analyzing real-time data streams. 285 | - **Common Use Cases**: 286 | - IoT data ingestion. 287 | - Real-time log processing. 288 | - **Tips**: 289 | - Use **Kinesis Data Firehose** to automatically load streaming data into S3, Redshift, or Elasticsearch. 290 | - Monitor and scale Kinesis streams using **CloudWatch metrics**. 291 | 292 | 2. **AWS Managed Streaming for Apache Kafka (MSK)** 293 | - **Purpose**: Managed Apache Kafka service for real-time data processing. 294 | - **Common Use Cases**: 295 | - Message brokering between services in a pipeline. 296 | - Event-driven architectures. 297 | - **Tips**: 298 | - Use Kafka connectors for seamless integration with AWS services like S3 or DynamoDB. 299 | - Optimize partitions and replication settings to balance performance and fault tolerance. 300 | 301 | --- 302 | 303 | #### **Data Security and Governance Tools** 304 | 1. **AWS IAM (Identity and Access Management)** 305 | - **Purpose**: Manage user permissions and access to AWS resources. 306 | - **Tips**: 307 | - Apply the **principle of least privilege** when assigning roles. 308 | - Use **IAM Policies** to define resource-level access. 309 | 310 | 2. **AWS Lake Formation** 311 | - **Purpose**: Simplify data lake creation with built-in governance. 312 | - **Tips**: 313 | - Define granular access policies using Lake Formation permissions. 314 | - Integrate with Glue Data Catalog for seamless schema management. 315 | 316 | 3. **AWS CloudTrail** 317 | - **Purpose**: Track user activity and API usage across AWS. 318 | - **Tips**: 319 | - Enable CloudTrail logs for all accounts to improve auditability. 320 | - Store logs in an S3 bucket for long-term analysis. 321 | 322 | --- 323 | 324 | ### Additional Resources 325 | - [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/) 326 | - [AWS Big Data Blog](https://aws.amazon.com/big-data/) 327 | - [Hands-On Labs for AWS](https://www.qwiklabs.com/) 328 | 329 | 330 | -------------------------------------------------------------------------------- /PYTHON/chapter_2.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: Variables, Expressions and Statements 2 | 3 | One of the most powerful features of a programming language is the ability to manipulate variables. A variable is a name that refers to a value. 4 | 5 | ## 2.1 Assignment Statements 6 | 7 | An assignment statement creates a new variable and gives it a value: 8 | 9 | ```python 10 | >>> message = 'And now for something completely different' 11 | >>> n = 17 12 | >>> pi = 3.1415926535897932 13 | ``` 14 | 15 | This example makes three assignments. The first assigns a string to a new variable named `message`; the second gives the integer 17 to `n`; the third assigns the (approximate) value of π to `pi`. 16 | 17 | A common way to represent variables on paper is to write the name with an arrow pointing to its value. This kind of figure is called a **state diagram** because it shows what state each of the variables is in (think of it as the variable's state of mind). 18 | 19 | ``` 20 | message ───────────────────→ 'And now for something completely different' 21 | n ──────────────────────────→ 17 22 | pi ─────────────────────────→ 3.1415926535897932 23 | ``` 24 | 25 | ## 2.2 Variable Names 26 | 27 | Programmers generally choose names for their variables that are meaningful—they document what the variable is used for. 28 | 29 | Variable names can be as long as you like. They can contain both letters and numbers, but they can't begin with a number. It is legal to use uppercase letters, but it is conventional to use only lower case for variable names. 30 | 31 | The underscore character, `_`, can appear in a name. It is often used in names with multiple words, such as `your_name` or `airspeed_of_unladen_swallow`. 32 | 33 | If you give a variable an illegal name, you get a syntax error: 34 | 35 | ```python 36 | >>> 76trombones = 'big parade' 37 | SyntaxError: invalid syntax 38 | >>> more@ = 1000000 39 | SyntaxError: invalid syntax 40 | >>> class = 'Advanced Theoretical Zymurgy' 41 | SyntaxError: invalid syntax 42 | ``` 43 | 44 | - `76trombones` is illegal because it begins with a number 45 | - `more@` is illegal because it contains an illegal character, `@` 46 | - `class` is illegal because it's a Python keyword 47 | 48 | ### Python Keywords 49 | 50 | The interpreter uses keywords to recognize the structure of the program, and they cannot be used as variable names. Python 3 has these keywords: 51 | 52 | | | | | | | 53 | |---------|----------|---------|----------|-------| 54 | | False | class | finally | is | return| 55 | | None | continue | for | lambda | try | 56 | | True | def | from | nonlocal | while | 57 | | and | del | global | not | with | 58 | | as | elif | if | or | yield | 59 | | assert | else | import | pass | | 60 | | break | except | in | raise | | 61 | 62 | You don't have to memorize this list. In most development environments, keywords are displayed in a different color; if you try to use one as a variable name, you'll know. 63 | 64 | ## 2.3 Expressions and Statements 65 | 66 | An **expression** is a combination of values, variables, and operators. A value all by itself is considered an expression, and so is a variable, so the following are all legal expressions: 67 | 68 | ```python 69 | >>> 42 70 | 42 71 | >>> n 72 | 17 73 | >>> n + 25 74 | 42 75 | ``` 76 | 77 | When you type an expression at the prompt, the interpreter **evaluates** it, which means that it finds the value of the expression. In this example, `n` has the value 17 and `n + 25` has the value 42. 78 | 79 | A **statement** is a unit of code that has an effect, like creating a variable or displaying a value. 80 | 81 | ```python 82 | >>> n = 17 83 | >>> print(n) 84 | ``` 85 | 86 | The first line is an assignment statement that gives a value to `n`. The second line is a print statement that displays the value of `n`. 87 | 88 | When you type a statement, the interpreter **executes** it, which means that it does whatever the statement says. In general, statements don't have values. 89 | 90 | ## 2.4 Script Mode 91 | 92 | So far we have run Python in **interactive mode**, which means that you interact directly with the interpreter. Interactive mode is a good way to get started, but if you are working with more than a few lines of code, it can be clumsy. 93 | 94 | The alternative is to save code in a file called a **script** and then run the interpreter in **script mode** to execute the script. By convention, Python scripts have names that end with `.py`. 95 | 96 | ### Differences Between Interactive and Script Mode 97 | 98 | Because Python provides both modes, you can test bits of code in interactive mode before you put them in a script. But there are differences between interactive mode and script mode that can be confusing. 99 | 100 | For example, if you are using Python as a calculator, you might type: 101 | 102 | ```python 103 | >>> miles = 26.2 104 | >>> miles * 1.61 105 | 42.182 106 | ``` 107 | 108 | The first line assigns a value to `miles`, but it has no visible effect. The second line is an expression, so the interpreter evaluates it and displays the result. It turns out that a marathon is about 42 kilometers. 109 | 110 | But if you type the same code into a script and run it, you get no output at all. In script mode an expression, all by itself, has no visible effect. Python evaluates the expression, but it doesn't display the result. To display the result, you need a print statement like this: 111 | 112 | ```python 113 | miles = 26.2 114 | print(miles * 1.61) 115 | ``` 116 | 117 | **Try this:** Type the following statements in the Python interpreter and see what they do: 118 | ```python 119 | 5 120 | x = 5 121 | x + 1 122 | ``` 123 | 124 | Now put the same statements in a script and run it. What is the output? Modify the script by transforming each expression into a print statement and then run it again. 125 | 126 | ## 2.5 Order of Operations 127 | 128 | When an expression contains more than one operator, the order of evaluation depends on the **order of operations**. For mathematical operators, Python follows mathematical convention. The acronym **PEMDAS** is a useful way to remember the rules: 129 | 130 | 1. **Parentheses** have the highest precedence and can be used to force an expression to evaluate in the order you want. Since expressions in parentheses are evaluated first, `2 * (3-1)` is 4, and `(1+1)**(5-2)` is 8. You can also use parentheses to make an expression easier to read, as in `(minute * 100) / 60`, even if it doesn't change the result. 131 | 132 | 2. **Exponentiation** has the next highest precedence, so `1 + 2**3` is 9, not 27, and `2 * 3**2` is 18, not 36. 133 | 134 | 3. **Multiplication and Division** have higher precedence than Addition and Subtraction. So `2*3-1` is 5, not 4, and `6+4/2` is 8, not 5. 135 | 136 | 4. **Operators with the same precedence** are evaluated from left to right (except exponentiation). So in the expression `degrees / 2 * pi`, the division happens first and the result is multiplied by pi. To divide by 2π, you can use parentheses or write `degrees / 2 / pi`. 137 | 138 | **Tip:** If you can't tell the order of operations by looking at an expression, use parentheses to make it obvious. 139 | 140 | ## 2.6 String Operations 141 | 142 | In general, you can't perform mathematical operations on strings, even if the strings look like numbers, so the following are illegal: 143 | 144 | ```python 145 | 'chinese'-'food' # Illegal 146 | 'eggs'/'easy' # Illegal 147 | 'third'*'a charm' # Illegal (wrong operand type) 148 | ``` 149 | 150 | But there are two exceptions, `+` and `*`. 151 | 152 | ### String Concatenation 153 | 154 | The `+` operator performs **string concatenation**, which means it joins the strings by linking them end-to-end. For example: 155 | 156 | ```python 157 | >>> first = 'throat' 158 | >>> second = 'warbler' 159 | >>> first + second 160 | 'throatwarbler' 161 | ``` 162 | 163 | ### String Repetition 164 | 165 | The `*` operator also works on strings; it performs **repetition**. For example, `'Spam'*3` is `'SpamSpamSpam'`. If one of the values is a string, the other has to be an integer. 166 | 167 | ```python 168 | >>> 'Spam' * 3 169 | 'SpamSpamSpam' 170 | >>> 4 * 'Na' 171 | 'NaNaNaNa' 172 | ``` 173 | 174 | This use of `+` and `*` makes sense by analogy with addition and multiplication. Just as `4*3` is equivalent to `4+4+4`, we expect `'Spam'*3` to be the same as `'Spam'+'Spam'+'Spam'`, and it is. 175 | 176 | **Think about it:** There is a significant way in which string concatenation and repetition are different from integer addition and multiplication. Can you think of a property that addition has that string concatenation does not? 177 | 178 | ## 2.7 Comments 179 | 180 | As programs get bigger and more complicated, they get more difficult to read. Formal languages are dense, and it is often difficult to look at a piece of code and figure out what it is doing, or why. 181 | 182 | For this reason, it is a good idea to add notes to your programs to explain in natural language what the program is doing. These notes are called **comments**, and they start with the `#` symbol: 183 | 184 | ```python 185 | # compute the percentage of the hour that has elapsed 186 | percentage = (minute * 100) / 60 187 | ``` 188 | 189 | In this case, the comment appears on a line by itself. You can also put comments at the end of a line: 190 | 191 | ```python 192 | percentage = (minute * 100) / 60 # percentage of an hour 193 | ``` 194 | 195 | Everything from the `#` to the end of the line is ignored—it has no effect on the execution of the program. 196 | 197 | ### Writing Good Comments 198 | 199 | Comments are most useful when they document non-obvious features of the code. It is reasonable to assume that the reader can figure out what the code does; it is more useful to explain why. 200 | 201 | **Bad comment (redundant with the code and useless):** 202 | ```python 203 | v = 5 # assign 5 to v 204 | ``` 205 | 206 | **Good comment (contains useful information not in the code):** 207 | ```python 208 | v = 5 # velocity in meters/second 209 | ``` 210 | 211 | Good variable names can reduce the need for comments, but long names can make complex expressions hard to read, so there is a tradeoff. 212 | 213 | ## 2.8 Debugging 214 | 215 | Three kinds of errors can occur in a program: **syntax errors**, **runtime errors**, and **semantic errors**. It is useful to distinguish between them in order to track them down more quickly. 216 | 217 | ### Syntax Error 218 | "Syntax" refers to the structure of a program and the rules about that structure. For example, parentheses have to come in matching pairs, so `(1 + 2)` is legal, but `8)` is a syntax error. 219 | 220 | If there is a syntax error anywhere in your program, Python displays an error message and quits, and you will not be able to run the program. During the first few weeks of your programming career, you might spend a lot of time tracking down syntax errors. As you gain experience, you will make fewer errors and find them faster. 221 | 222 | ### Runtime Error 223 | The second type of error is a runtime error, so called because the error does not appear until after the program has started running. These errors are also called **exceptions** because they usually indicate that something exceptional (and bad) has happened. 224 | 225 | Runtime errors are rare in the simple programs you will see in the first few chapters, so it might be a while before you encounter one. 226 | 227 | ### Semantic Error 228 | The third type of error is "semantic", which means related to meaning. If there is a semantic error in your program, it will run without generating error messages, but it will not do the right thing. It will do something else. Specifically, it will do what you told it to do. 229 | 230 | Identifying semantic errors can be tricky because it requires you to work backward by looking at the output of the program and trying to figure out what it is doing. 231 | 232 | ## Glossary 233 | 234 | - **variable:** A name that refers to a value 235 | - **assignment:** A statement that assigns a value to a variable 236 | - **state diagram:** A graphical representation of a set of variables and the values they refer to 237 | - **keyword:** A reserved word that is used to parse a program; you cannot use keywords like `if`, `def`, and `while` as variable names 238 | - **operand:** One of the values on which an operator operates 239 | - **expression:** A combination of variables, operators, and values that represents a single result 240 | - **evaluate:** To simplify an expression by performing the operations in order to yield a single value 241 | - **statement:** A section of code that represents a command or action. So far, the statements we have seen are assignments and print statements 242 | - **execute:** To run a statement and do what it says 243 | - **interactive mode:** A way of using the Python interpreter by typing code at the prompt 244 | - **script mode:** A way of using the Python interpreter to read code from a script and run it 245 | - **script:** A program stored in a file 246 | - **order of operations:** Rules governing the order in which expressions involving multiple operators and operands are evaluated 247 | - **concatenate:** To join two operands end-to-end 248 | - **comment:** Information in a program that is meant for other programmers (or anyone reading the source code) and has no effect on the execution of the program 249 | - **syntax error:** An error in a program that makes it impossible to parse (and therefore impossible to interpret) 250 | - **exception:** An error that is detected while the program is running 251 | - **semantics:** The meaning of a program 252 | - **semantic error:** An error in a program that makes it do something other than what the programmer intended 253 | 254 | ## Exercises 255 | 256 | ### Exercise 2.1 257 | Whenever you learn a new feature, you should try it out in interactive mode and make errors on purpose to see what goes wrong. 258 | 259 | - We've seen that `n = 42` is legal. What about `42 = n`? 260 | - How about `x = y = 1`? 261 | - In some languages every statement ends with a semi-colon, `;`. What happens if you put a semi-colon at the end of a Python statement? 262 | - What if you put a period at the end of a statement? 263 | - In math notation you can multiply x and y like this: xy. What happens if you try that in Python? 264 | 265 | ### Exercise 2.2 266 | Practice using the Python interpreter as a calculator: 267 | 268 | 1. The volume of a sphere with radius r is $\frac{4}{3}\pi r^3$. What is the volume of a sphere with radius 5? 269 | 270 | 2. Suppose the cover price of a book is $24.95, but bookstores get a 40% discount. Shipping costs $3 for the first copy and 75 cents for each additional copy. What is the total wholesale cost for 60 copies? 271 | 272 | 3. If I leave my house at 6:52 am and run 1 mile at an easy pace (8:15 per mile), then 3 miles at tempo (7:12 per mile) and 1 mile at easy pace again, what time do I get home for breakfast? 273 | -------------------------------------------------------------------------------- /data_lake.md: -------------------------------------------------------------------------------- 1 | # Comprehensive Guide to Data Lakes 2 | 3 | ## Table of Contents 4 | 1. [What is a Data Lake?](#what-is-a-data-lake) 5 | 2. [Why Do You Need a Data Lake?](#why-do-you-need-a-data-lake) 6 | 3. [Core Characteristics](#core-characteristics) 7 | 4. [Data Lake vs Data Warehouse vs Data Lakehouse](#data-lake-vs-data-warehouse-vs-data-lakehouse) 8 | 5. [Essential Components of Data Lake Architecture](#essential-components-of-data-lake-architecture) 9 | 6. [Common Use Cases](#common-use-cases) 10 | 7. [Benefits and Challenges](#benefits-and-challenges) 11 | 8. [Popular Technologies](#popular-technologies) 12 | 9. [Best Practices](#best-practices) 13 | 10. [Conclusion](#conclusion) 14 | 15 | ## What is a Data Lake? 16 | 17 | A **data lake** is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. Unlike traditional databases or data warehouses, data lakes enable you to store data in its native, raw format without requiring a predefined schema. 18 | 19 | ### Key Definition Points: 20 | - **Centralized storage**: Single location for all organizational data 21 | - **Any scale**: From gigabytes to petabytes 22 | - **Raw format**: Data stored as-is, without preprocessing 23 | - **Schema-on-read**: Structure applied when data is accessed, not when stored 24 | - **Multi-format support**: Handles structured, semi-structured, and unstructured data 25 | 26 | ## Why Do You Need a Data Lake? 27 | 28 | Organizations implementing modern data architectures, including data lakes, demonstrate measurable advantages in operational efficiency and revenue growth. Research shows that more than half of enterprises have implemented data lakes, with another 22% planning implementation within 36 months. 29 | 30 | ### Business Value: 31 | - **Faster decision-making**: Advanced analytics across diverse data sources 32 | - **Personalized experiences**: Comprehensive customer data analysis 33 | - **Operational optimization**: Predictive maintenance and efficiency improvements 34 | - **Competitive advantage**: Early identification of revenue opportunities 35 | - **Cost efficiency**: Leverage inexpensive object storage and open formats 36 | 37 | ## Core Characteristics 38 | 39 | ### 1. Scalability 40 | - Built to scale horizontally 41 | - Cloud-based object storage solutions (Amazon S3, Azure Data Lake Storage) 42 | - Growth from terabytes to petabytes without capacity concerns 43 | 44 | ### 2. Schema-on-Read 45 | - No predefined schema required at ingestion 46 | - Flexibility to apply different schemas based on use case 47 | - Structure determined during data retrieval or transformation 48 | 49 | ### 3. Raw Data Storage 50 | - Retains all data types in native format 51 | - Preserves complete data potential for future analysis 52 | - No upfront data transformation requirements 53 | 54 | ### 4. Diverse Data Type Support 55 | - **Structured**: SQL tables, CSV files 56 | - **Semi-structured**: JSON, XML, logs 57 | - **Unstructured**: Images, videos, audio, documents, PDFs 58 | 59 | ## Data Lake vs Data Warehouse vs Data Lakehouse 60 | 61 | | Feature | Data Lake | Data Lakehouse | Data Warehouse | 62 | |---------|-----------|----------------|----------------| 63 | | **Data Types** | All types (structured, semi-structured, unstructured) | All types (structured, semi-structured, unstructured) | Structured data only | 64 | | **Cost** | $ (Low) | $ (Low) | $$$ (High) | 65 | | **Format** | Open format | Open format | Closed, proprietary | 66 | | **Scalability** | Scales at low cost regardless of type | Scales at low cost regardless of type | Exponentially expensive scaling | 67 | | **Schema** | Schema-on-read | Schema-on-read with governance | Schema-on-write | 68 | | **Performance** | Variable, depends on compute engine | High performance | Optimized for fast SQL queries | 69 | | **Intended Users** | Data scientists | Data analysts, scientists, ML engineers | Data analysts | 70 | | **Reliability** | Low quality (data swamp risk) | High quality, reliable | High quality, reliable | 71 | | **Use Cases** | ML, big data, raw storage | Unified analytics, BI, ML | BI, structured analytics | 72 | 73 | ## Essential Components of Data Lake Architecture 74 | 75 | ### 1. Data Ingestion Layer 76 | Brings data from various sources into the data lake. 77 | 78 | **Ingestion Modes:** 79 | - **Batch ingestion**: Periodic loading (nightly, hourly) 80 | - **Stream ingestion**: Real-time data flows 81 | - **Hybrid ingestion**: Combination of batch and stream 82 | 83 | **Popular Tools:** 84 | - Apache Kafka, AWS Kinesis (streaming) 85 | - Apache NiFi, Flume, AWS Glue (batch ETL) 86 | 87 | ### 2. Storage Layer 88 | Built on cloud object storage for elastic scaling and cost efficiency. 89 | 90 | **Key Features:** 91 | - Durability and availability with automatic replication 92 | - Separation of storage and compute 93 | - Data tiering (hot, warm, cold storage) 94 | 95 | **Popular Options:** 96 | - Amazon S3 97 | - Azure Data Lake Storage 98 | - Google Cloud Storage 99 | - MinIO (on-premise) 100 | 101 | ### 3. Catalog and Metadata Management 102 | Prevents data lakes from becoming "data swamps" by maintaining organization. 103 | 104 | **Manages:** 105 | - Data schema and location 106 | - Partitioning information 107 | - Data lineage and versioning 108 | - Search and discovery capabilities 109 | 110 | **Tools:** 111 | - AWS Glue Data Catalog 112 | - Apache Hive Metastore 113 | - Apache Atlas 114 | - DataHub 115 | 116 | ### 4. Processing and Analytics Layer 117 | Transforms raw data into insights through various operations. 118 | 119 | **Capabilities:** 120 | - ETL/ELT pipelines 121 | - SQL querying 122 | - Machine learning pipelines 123 | - Real-time and batch processing 124 | 125 | ### 5. Security and Governance 126 | Protects sensitive data and ensures compliance. 127 | 128 | **Essential Features:** 129 | - Identity and Access Management (IAM) 130 | - Encryption (in-transit and at-rest) 131 | - Data masking and anonymization 132 | - Auditing and monitoring 133 | 134 | **Tools:** 135 | - AWS Lake Formation 136 | - Apache Ranger 137 | - Azure Purview 138 | 139 | ## Common Use Cases 140 | 141 | ### 1. Big Data Analytics 142 | - Historical and real-time data analysis 143 | - Cross-departmental analytics with single source of truth 144 | - Petabyte-scale dataset queries 145 | - Custom analytics on raw, unprocessed data 146 | 147 | ### 2. Machine Learning and AI 148 | - Multi-format training dataset storage 149 | - Raw data preservation for ML experimentation 150 | - Automated ML pipeline support 151 | - Enhanced model accuracy through comprehensive data access 152 | 153 | ### 3. Centralized Data Archiving 154 | - Long-term storage for compliance and auditing 155 | - Cost-effective historical data retention 156 | - Trend analysis and forecasting 157 | - Future ML model training preparation 158 | 159 | ### 4. Data Science Experimentation 160 | - Exploratory data analysis (EDA) 161 | - Hypothesis testing and prototyping 162 | - Unconstrained access to raw datasets 163 | - Innovation without data engineering dependencies 164 | 165 | ### 5. Improved Customer Interactions 166 | Combine data from multiple sources: 167 | - CRM platforms 168 | - Social media analytics 169 | - Marketing platforms with purchase history 170 | - Customer service interactions 171 | 172 | ### 6. R&D Innovation Support 173 | - Hypothesis testing and assumption refinement 174 | - Material selection for product design 175 | - Genomic research for medication development 176 | - Customer willingness-to-pay analysis 177 | 178 | ### 7. Operational Efficiency 179 | - IoT device data collection and analysis 180 | - Manufacturing process optimization 181 | - Predictive maintenance 182 | - Cost reduction and quality improvement 183 | 184 | ## Benefits and Challenges 185 | 186 | ### Benefits 187 | 188 | #### Flexibility and Scalability 189 | - No upfront schema requirements 190 | - Effortless scaling from gigabytes to petabytes 191 | - Cloud-native storage cost efficiency 192 | - Decoupled compute and storage architecture 193 | 194 | #### Comprehensive Data Support 195 | - All data types in single platform 196 | - Raw, unprocessed data preservation 197 | - Enhanced analytics capabilities 198 | - Cross-team collaboration improvement 199 | 200 | #### Cost and Performance 201 | - Significantly cheaper than traditional databases 202 | - Independent scaling of analytics workloads 203 | - Elimination of data silos 204 | - Improved decision-making through comprehensive analysis 205 | 206 | ### Challenges 207 | 208 | #### Data Quality and Organization 209 | - **Data swamp risk**: Without proper governance, becomes unusable 210 | - **Lack of structure**: Difficult to query and document 211 | - **Quality issues**: Poor data may go undetected 212 | - **Metadata gaps**: Users may not find or understand available data 213 | 214 | #### Governance and Security 215 | - **Complex governance**: Access control and compliance challenges 216 | - **Security concerns**: Protecting sensitive data in flexible environment 217 | - **Performance issues**: Traditional query engines slow on large datasets 218 | - **Reliability problems**: Difficulty combining batch and streaming data 219 | 220 | ### Mitigation Strategies 221 | 222 | #### Governance and Organization 223 | - Implement comprehensive data catalogs 224 | - Use standardized naming and folder structures 225 | - Apply data validation and profiling tools 226 | - Automate lifecycle management policies 227 | 228 | #### Security and Performance 229 | - Robust access control and encryption 230 | - Role-based access management 231 | - Regular data quality monitoring 232 | - Performance optimization through proper partitioning 233 | 234 | ## Popular Technologies 235 | 236 | ### Cloud-Native Solutions 237 | 238 | #### Amazon Web Services (AWS) 239 | - **Amazon S3**: Scalable object storage 240 | - **AWS Lake Formation**: Permissions, cataloging, governance 241 | - **AWS Glue**: ETL and data cataloging 242 | - **Amazon Athena**: SQL queries on S3 data 243 | 244 | #### Microsoft Azure 245 | - **Azure Data Lake Storage**: HDFS-like capabilities with blob storage 246 | - **Azure Synapse Analytics**: Integrated analytics service 247 | - **Azure Purview**: Data governance and cataloging 248 | 249 | #### Google Cloud Platform (GCP) 250 | - **Google Cloud Storage**: Durable object storage 251 | - **BigQuery**: Data warehouse with lake capabilities 252 | - **Vertex AI**: Machine learning platform integration 253 | 254 | ### Open-Source Tools 255 | 256 | #### Storage and Processing 257 | - **Apache Hadoop**: Original distributed data framework 258 | - **Delta Lake**: ACID transactions and versioning for object storage 259 | - **Apache Iceberg**: Table format with atomic operations and time travel 260 | - **Presto**: Distributed SQL query engine 261 | 262 | #### Analytics and ML 263 | - **Apache Spark**: Distributed computing for big data processing 264 | - **Apache Kafka**: Real-time data streaming 265 | - **Jupyter Notebooks**: Interactive data analysis and experimentation 266 | 267 | ### Analytics Platform Integrations 268 | 269 | #### Data Platforms 270 | - **Databricks**: Collaborative workspace with Delta Lake support 271 | - **Snowflake**: Hybrid lakehouse capabilities 272 | - **Confluent**: Enterprise Kafka platform 273 | 274 | #### Business Intelligence 275 | - **Power BI**: Microsoft's business intelligence platform 276 | - **Tableau**: Data visualization and analytics 277 | - **Looker**: Modern BI and data platform 278 | 279 | ## Best Practices 280 | 281 | ### 1. Use Data Lake as Landing Zone 282 | - Store all data without transformation or aggregation 283 | - Preserve raw format for machine learning and lineage 284 | - Maintain complete data history 285 | 286 | ### 2. Implement Data Security 287 | - **Mask PII**: Pseudonymize personally identifiable information 288 | - **Access controls**: Role-based and view-based ACLs 289 | - **Encryption**: Implement both in-transit and at-rest encryption 290 | - **Compliance**: Ensure GDPR and regulatory compliance 291 | 292 | ### 3. Build Reliability and Performance 293 | - **Use Delta Lake**: Brings database-like reliability to data lakes 294 | - **Implement ACID transactions**: Ensure data consistency 295 | - **Optimize partitioning**: Improve query performance 296 | - **Monitor data quality**: Regular validation and profiling 297 | 298 | ### 4. Establish Data Catalog 299 | - **Metadata management**: Track schema, location, and lineage 300 | - **Enable self-service**: Allow users to discover and understand data 301 | - **Document data sources**: Maintain comprehensive data documentation 302 | - **Version control**: Track data and schema changes 303 | 304 | ### 5. Lifecycle Management 305 | - **Automate tiering**: Move old data to cheaper storage tiers 306 | - **Retention policies**: Define and enforce data retention rules 307 | - **Archive management**: Efficient long-term data storage 308 | - **Cleanup procedures**: Remove obsolete or duplicate data 309 | 310 | ### 6. Monitoring and Governance 311 | - **Performance monitoring**: Track query performance and resource usage 312 | - **Cost optimization**: Monitor and optimize storage and compute costs 313 | - **Access auditing**: Log and review data access patterns 314 | - **Quality metrics**: Establish and monitor data quality indicators 315 | 316 | ## Conclusion 317 | 318 | Data lakes represent a fundamental shift in how organizations approach data storage and analytics. By providing a flexible, scalable foundation for raw and semi-structured data, they enable advanced analytics, machine learning, and real-time decision-making that wasn't possible with traditional data warehouses alone. 319 | 320 | ### Key Takeaways: 321 | 322 | 1. **Flexibility First**: Data lakes excel when you need to store diverse data types without predefined schemas 323 | 2. **Scale and Cost**: Cloud-native solutions provide virtually unlimited scalability at low cost 324 | 3. **Governance Critical**: Success depends on implementing strong metadata management and governance from the start 325 | 4. **Hybrid Approach**: Many organizations benefit from combining data lakes with data warehouses and lakehouses 326 | 5. **Technology Evolution**: The ecosystem continues to evolve with new tools addressing traditional data lake challenges 327 | 328 | ### When to Choose Data Lakes: 329 | 330 | Data lakes are ideal when your organization: 331 | - Handles complex, diverse, or large-scale data 332 | - Needs to enable faster experimentation and innovation 333 | - Wants to implement advanced analytics and AI/ML initiatives 334 | - Requires cost-effective long-term data storage 335 | - Operates in data-driven industries with rapidly changing requirements 336 | 337 | ### Success Factors: 338 | 339 | - **Start with governance**: Implement cataloging and security from day one 340 | - **Choose the right technology stack**: Align tools with team expertise and organizational needs 341 | - **Plan for growth**: Design architecture that scales with data volume and user needs 342 | - **Invest in training**: Ensure teams understand how to effectively use data lake capabilities 343 | - **Monitor and optimize**: Continuously improve performance, cost, and data quality 344 | 345 | Data lakes are not just a storage solution—they're a foundation for modern, data-driven organizations that want to unlock the full potential of their data assets while maintaining flexibility for future innovations and use cases. 346 | --------------------------------------------------------------------------------