` are called *tags*
91 | - You can access them like `soup.p`, `soup.find("a")`, etc.
92 |
93 | #### Text
94 |
95 | - Use `.get_text()` or `.text` to get the text inside a tag
96 |
97 | #### Attributes
98 |
99 | - Things like `href`, `class`, `id` are attributes of a tag
100 | - Access them with `tag['href']` or `tag.get('href')`
101 |
102 | #### Search methods
103 |
104 | - `.find()` → first matching element
105 | - `.find_all()` → all matching elements
106 | - `.select()` / `.select_one()` → CSS selector style
107 |
108 | ---
109 |
110 | ### **5. Tiny “hello world” example**
111 |
112 | You can show this as the first demo:
113 |
114 | ```python
115 | from bs4 import BeautifulSoup
116 |
117 | html = """
118 |
119 |
120 |
My Website
121 |
Welcome to my site.
122 |
Click here
123 |
124 |
125 | """
126 |
127 | # Create the soup object
128 | soup = BeautifulSoup(html, "html.parser")
129 |
130 | # Get the title (h1 text)
131 | title = soup.find("h1").get_text(strip=True)
132 |
133 | # Get the paragraph text
134 | intro = soup.find("p", class_="intro").get_text(strip=True)
135 |
136 | # Get the link and its URL
137 | link_tag = soup.find("a")
138 | link_text = link_tag.get_text(strip=True)
139 | link_url = link_tag.get("href")
140 |
141 | print("Title:", title)
142 | print("Intro:", intro)
143 | print("Link text:", link_text)
144 | print("Link URL:", link_url)
145 | ```
146 |
147 | **What this demo shows:**
148 |
149 | - How to create a `soup` object
150 | - How to find tags
151 | - How to get text and attributes
152 |
153 | ---
154 |
155 | ### **6. One-sentence summary for students**
156 |
157 | Beautiful Soup is a Python tool that makes it easy to read HTML and pull out just the data you need from web pages.
158 |
159 | Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#api-documentation
160 |
--------------------------------------------------------------------------------
/DATA-MODELING.md:
--------------------------------------------------------------------------------
1 | # **Introduction to Data Modeling**
2 |
3 | ### **What is a Model?**
4 |
5 | A **model** is a structure or dimension representing data.
6 |
7 | ### **What is Data Modeling?**
8 |
9 | The process of designing how data will be **organized**, **stored**, and **accessed**.
10 |
11 | * It provides a **visual representation** (tables, rows, columns).
12 | * It acts as a **blueprint** for database design.
13 |
14 | ### **Purpose of Data Modeling**
15 |
16 | * It ensures **accuracy**, **consistency**, and **integrity** of data.
17 | * It optimizes **database performance**.
18 | * It facilitates **communication** between technical and business stakeholders.
19 |
20 | ---
21 |
22 | # **Types of Data Modeling**
23 |
24 | ## 1. Conceptual Data Modeling
25 |
26 | * This represents a high-level overview of business/domain data without going into details.
27 | * No technical details, no attributes or data types.
28 | * It focuses on **entities** and their relationships.
29 | * Example (Hospital Domain):
30 |
31 | * Entities: Patient, Doctor, Appointment.
32 |
33 | ---
34 |
35 | ## 2. Logical Data Modeling
36 |
37 | * This model describes **data elements and relationships** in detail (without considering physical storage).
38 | * It defines **attributes** for each entity.
39 | * Example:
40 |
41 | * Doctor: doctor\_id, name, specialization.
42 | * Appointment: start\_time, end\_time, doctor\_id (FK).
43 |
44 | ---
45 |
46 | ## 3. Physical Data Modeling
47 |
48 | * This model defines **how data is physically stored** in a database.
49 | * It includes:
50 |
51 | * **Data types** (e.g., INT, VARCHAR).
52 | * **Constraints** (e.g., PRIMARY KEY, FOREIGN KEY).
53 | * **Indexes** for performance.
54 |
55 | Example Schema:
56 |
57 | ### Doctor Table
58 |
59 | * doctor\_id (INT) – Primary Key
60 | * doctor\_name (VARCHAR)
61 |
62 | ### Customers Table
63 |
64 | * customer\_id (INT) – Primary Key
65 | * name (VARCHAR)
66 | * age (INT)
67 | * email (VARCHAR)
68 | * DOB (DATE)
69 | * phone (VARCHAR)
70 |
71 | ### Account Table
72 |
73 | * account\_id (INT) – Primary Key
74 | * balance (INT)
75 | * dr (INT) (Debit)
76 | * cr (INT) (Credit)
77 | * acc\_type (VARCHAR)
78 | * customer\_id (INT) – Foreign Key
79 |
80 | ### Branch Table
81 |
82 | * branch\_id (INT) – Primary Key
83 | * location (VARCHAR)
84 |
85 | ---
86 |
87 | ## 4. Entity-Relationship Data Modeling
88 |
89 | * **Entity**: An object or concept representing data (e.g., Patient, Doctor, Appointment).
90 | * **Attributes**: Properties of an entity (e.g., patient\_id, doctor\_name).
91 | * **ERD (Entity-Relationship Diagram)**: Visual diagram showing entities and relationships.
92 |
93 | ### Relationship Types
94 |
95 | | Type | Example |
96 | | ------------ | ------------------------------------------------------------------------------ |
97 | | One-to-One | A Doctor has one doctor\_id, and each doctor\_id belongs to one Doctor. |
98 | | One-to-Many | A Doctor has many Patients, but a Patient has only one primary Doctor. |
99 | | Many-to-Many | A Patient can have many Appointments, and a Doctor can have many Appointments. |
100 |
101 | Detailed Examples:
102 |
103 | * **One-to-Many (1\:M):**
104 |
105 | * "A Doctor has many Patients, but a Patient has only one primary Doctor."
106 | * Implementation: Patients table has a foreign key (doctor\_id).
107 |
108 | * **Many-to-Many (M\:N):**
109 |
110 | * "A Patient can book many Appointments, and a Doctor can handle many Appointments."
111 | * Implementation: A junction table (Appointment) with patient\_id and doctor\_id as foreign keys.
112 |
113 | * **One-to-One (1:1):**
114 |
115 | * "A Doctor has exactly one unique doctor\_id."
116 | * Implementation: doctor\_id is both a primary key and unique.
117 |
118 | ---
119 |
120 | ## 5. Dimensional Data Modeling
121 |
122 | * Dimensional Data Modeling is primarily used in **data warehouses** for analytical purposes.
123 | * it organizes data into **fact tables** and **dimension tables**.
124 |
125 | ### Key Components
126 |
127 | | Component | Description | Example |
128 | | --------------- | ----------------------------------------- | --------------------------------- |
129 | | Fact Table | Numerical/measurable data (metrics/KPIs). | sales\_amount, quantity\_sold |
130 | | Dimension Table | Descriptive context for facts. | customer\_name, product\_category |
131 |
132 | Definitions:
133 |
134 | * **Dimensions**: Describe business entities (name, age, product category).
135 | * **Measures**: Quantitative facts (e.g., number of products sold).
136 |
137 | ---
138 |
139 | # **Summary**
140 |
141 | * **Conceptual**: High-level overview (what data is important?).
142 | * **Logical**: Detailed attributes and relationships (how is data related?).
143 | * **Physical**: Technical implementation (how is data stored?).
144 | * **ER Modeling**: Graphical view of entities and relationships.
145 | * **Dimensional**: Optimized for analytics (facts and dimensions).
146 |
147 |
--------------------------------------------------------------------------------
/introduction-to-Kafka.md:
--------------------------------------------------------------------------------
1 | # Introduction to Data Streaming and Apache Kafka
2 |
3 | ## **What is Streaming Data?**
4 | **Streaming data** (also called **event stream processing**) is the **continuous flow of data** generated by various sources, processed in **real-time** to extract insights and trigger actions.
5 |
6 | ## **Characteristics of Real-Time Data Processing**
7 | 1. **Continuous flow** – Data is constantly generated with no "end."
8 | 2. **Real-time processing** – Instant analysis for timely insights (no batch delays).
9 | 3. **Event-driven architecture** – Systems react dynamically to individual events.
10 | 4. **Scalability & fault tolerance** – Handles high traffic and recovers from failures.
11 | 5. **Varied Data Sources** – Streaming data originates from sensors, logs, APIs, applications, mobile devices, and more.
12 |
13 | ---
14 |
15 | ## **Key Benefits of Real-Time Data Processing**
16 | - **Immediate Insights**: Analyze data as it’s generated.
17 | - **Instant Decision-Making**: Respond to events in real-time (e.g., fraud detection).
18 | - **Operational Efficiency**: Optimize workflows and reduce downtime.
19 | - **Enhanced User Experience**: Personalize experiences using live data.
20 |
21 | ### **Critical Use Cases**
22 | - **Fraud Detection**: Block suspicious transactions instantly.
23 | - **IoT Monitoring**: Track device health in real-time.
24 | - **Live Analytics**: Power dashboards with up-to-the-second data.
25 |
26 | ### **Discussion Questions**
27 | 1. Can you think of a real-time use case near you (e.g., mobile money, delivery apps)?
28 | 2. What happens when systems can’t process data in real-time?
29 |
30 | ---
31 |
32 | # **Introduction to Apache Kafka**
33 |
34 | ## **Key Concepts**
35 | - **Publish/Subscribe Model**:
36 | - Producers send/write (**publish**) messages.
37 | - Consumers receive/read (**subscribe**) messages.
38 | > Asynchronous means not happening or done at the same time or speed — producer and consumer don’t need to wait on each other.
39 |
40 | - **Common Use Cases**:
41 | - **Real-time analytics** Analyze data as it's generated, instead of waiting for batch jobs or reports.
42 | - **Log collection** Gather logs from multiple systems/services into a centralized location for monitoring, debugging, or auditing.
43 | - **Event sourcing** Store state-changing events (like deposit $20, withdraw $30) rather than only storing the final state (balance = $50).
44 |
45 | ---
46 |
47 | ## **Kafka Architecture**
48 | | Component | Role |
49 | |--------------|----------------------------------------------------------------------|
50 | | **Producer** | Sends messages/events to a Kafka topic. |
51 | | **Consumer** | Reads messages/events from a topic. |
52 | | **Broker** | Kafka server storing/serving messages (clusters = multiple brokers). |
53 | | **ZooKeeper**/**KRaft** | Manages cluster state/metadata (KRaft replaces ZooKeeper in Kafka 4.0+). |
54 |
55 | ---
56 | ## **Event**:
57 | - An event records the fact that "something happened" in the real world or in your system.
58 | - It’s the fundamental unit of data in Kafka and may also be referred to as a record or message.
59 | - Events are immutable — once written, they are not updated
60 | - When you produce (write) or consume (read) data in Kafka, you're interacting with events.
61 |
62 | #### Structure of an Event
63 | An event in Kafka typically contains the following components:
64 | - Key: Identifies the event (e.g., the user, transaction ID, or source). Used for partitioning logic.
65 | - Value: The actual data or payload (e.g., what happened).
66 | - Timestamp: Time when the event occurred or was written.
67 | - Headers (optional): Metadata about the event, such as content type or correlation ID.
68 |
69 | #### Example Event
70 | ```plaintext
71 | Event key: "Alice"
72 | Event value: "Made a payment of $200"
73 | Timestamp: 2025-06-27T08:45:30Z
74 | Headers: { "source": "mobile-app", "transaction-id": "TXN-4490" }
75 | ```
76 | This event could be sent to a topic like `payments` and later consumed by analytics or fraud detection services.
77 |
78 | ## **Topic**:
79 | - A topic is a category/feed name to which messages/events are published to.(similar to a database table).
80 | - Topics are split into **partitions** for scalability/parallel processing.
81 |
82 | #### **Examples**
83 | - `orders` – E-commerce purchases.
84 | - `user-logins` – Authentication events.
85 | - `click-events` – Website/app interactions.
86 |
87 | ## **Partition**:
88 | - A topic can be split into multiple partitions, which enables scalability (more messages) and parallelism (faster processing).
89 | - Messages within a partition are strictly ordered.
90 |
91 | ## **Offset**:
92 | - A unique identifier (number) that Kafka assigns to each message within a partition.
93 | - It allows consumers to track where they left off in reading the stream of messages. (e.g., "read up to offset #5").
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 |
--------------------------------------------------------------------------------
/MySQLQueryExecutionPlans.md:
--------------------------------------------------------------------------------
1 | # MySQL Query Execution Plans: A Complete Guide
2 |
3 | ## 1. Understanding Query Execution Plans
4 |
5 | A **Query Execution Plan** shows you **how** MySQL decides to retrieve data — which indexes it will use, how many rows it will check, and the join strategy. You can get it using:
6 |
7 | ```sql
8 | EXPLAIN SELECT ...;
9 | ```
10 |
11 | For deeper insight with actual runtime statistics:
12 |
13 | ```sql
14 | EXPLAIN ANALYZE SELECT ...;
15 | ```
16 |
17 | ## 2. Example 1 – Basic Table Scan
18 |
19 | Let's query all customers.
20 |
21 | ```sql
22 | EXPLAIN SELECT * FROM customer_info;
23 | ```
24 |
25 | **Possible output:**
26 |
27 | | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
28 | |----|-------------|-------|------|---------------|-----|---------|-----|------|-------|
29 | | 1 | SIMPLE | customer_info | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
30 |
31 | **Interpretation:**
32 | - **type = ALL** → full table scan (slow for large datasets)
33 | - No indexes used because there's no filtering
34 | - **Optimization:** Avoid `SELECT *` unless necessary. Use `WHERE` + indexed columns
35 |
36 | ## 3. Example 2 – Using an Index
37 |
38 | Suppose we query customers by `customer_id` (indexed as PRIMARY KEY).
39 |
40 | ```sql
41 | EXPLAIN SELECT full_name FROM customer_info WHERE customer_id = 10;
42 | ```
43 |
44 | **Possible output:**
45 |
46 | | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
47 | |----|-------------|-------|------|---------------|-----|---------|-----|------|-------|
48 | | 1 | SIMPLE | customer_info | const | PRIMARY | PRIMARY | 4 | const | 1 | NULL |
49 |
50 | **Interpretation:**
51 | - **type = const** → MySQL knows it will return at most one row (fast)
52 | - Using `PRIMARY` key (O(1) lookup)
53 | - **Optimization:** Always filter with indexed columns when possible
54 |
55 | ## 4. Example 3 – Index Usage in Products Table
56 |
57 | Query products by `customer_id` (foreign key).
58 |
59 | ```sql
60 | EXPLAIN SELECT product_name FROM products WHERE customer_id = 3;
61 | ```
62 |
63 | If `customer_id` is not indexed, **type** will be `ALL` (slow). **Solution:** Add index:
64 |
65 | ```sql
66 | CREATE INDEX idx_customer_id ON products(customer_id);
67 | ```
68 |
69 | After indexing, the execution plan might show:
70 |
71 | | type | possible_keys | key | rows | Extra |
72 | |------|---------------|-----|------|-------|
73 | | ref | idx_customer_id | idx_customer_id | 5 | Using where |
74 |
75 | ## 5. Example 4 – Join with Index
76 |
77 | Query sales with customer names.
78 |
79 | ```sql
80 | EXPLAIN
81 | SELECT s.sales_id, s.total_sales, c.full_name
82 | FROM sales s
83 | JOIN customer_info c ON s.customer_id = c.customer_id;
84 | ```
85 |
86 | If both `sales.customer_id` and `customer_info.customer_id` are indexed:
87 |
88 | | id | select_type | table | type | possible_keys | key | ref | rows | Extra |
89 | |----|-------------|-------|------|---------------|-----|-----|------|-------|
90 | | 1 | SIMPLE | c | ALL | PRIMARY | NULL | NULL | 1000 | Using where |
91 | | 1 | SIMPLE | s | ref | idx_customer_id | idx_customer_id | c.customer_id | 10 | NULL |
92 |
93 | **Optimization tips:**
94 | - Always index **join columns**
95 | - Make sure both sides of the join use the same data type
96 |
97 | ## 6. Example 5 – Filtering and Joining
98 |
99 | Retrieve all sales above 500 made by customers in "Nairobi".
100 |
101 | ```sql
102 | EXPLAIN
103 | SELECT s.sales_id, s.total_sales, c.full_name
104 | FROM sales s
105 | JOIN customer_info c ON s.customer_id = c.customer_id
106 | WHERE c.location = 'Nairobi' AND s.total_sales > 500;
107 | ```
108 |
109 | Possible bottlenecks:
110 | - If `location` isn't indexed → table scan on `customer_info`
111 | - **Solution:** Create index:
112 |
113 | ```sql
114 | CREATE INDEX idx_location ON customer_info(location);
115 | ```
116 |
117 | Execution plan should now show **ref** instead of **ALL** for `customer_info`.
118 |
119 | ## 7. Example 6 – Multi-table Join with Products
120 |
121 | ```sql
122 | EXPLAIN
123 | SELECT c.full_name, p.product_name, s.total_sales
124 | FROM customer_info c
125 | JOIN products p ON c.customer_id = p.customer_id
126 | JOIN sales s ON p.product_id = s.product_id
127 | WHERE s.total_sales > 1000;
128 | ```
129 |
130 | Optimization tips:
131 | - Index `products.customer_id`
132 | - Index `sales.product_id`
133 | - Filter early (`WHERE s.total_sales > 1000`) so MySQL processes fewer rows
134 |
135 | ## 8. Example 7 – Using `EXPLAIN ANALYZE`
136 |
137 | ```sql
138 | EXPLAIN ANALYZE
139 | SELECT c.full_name, p.product_name
140 | FROM customer_info c
141 | JOIN products p ON c.customer_id = p.customer_id
142 | WHERE p.price > 500;
143 | ```
144 |
145 | **Benefit:**
146 | - Shows **actual execution time** for each step
147 | - If you see that a join step takes much longer than expected → check indexes
148 |
149 | ## 9. Best Practices Recap
150 |
151 | ✅ Index frequently queried columns (especially in `WHERE`, `JOIN`, `ORDER BY`)
152 | ✅ Avoid `SELECT *` for performance
153 | ✅ Use `EXPLAIN` before and after schema changes
154 | ✅ Filter data as early as possible in your query
155 | ✅ Keep an eye on `rows` in EXPLAIN — smaller is better
156 |
--------------------------------------------------------------------------------
/AivenProjectVersionWeekOneProject.md:
--------------------------------------------------------------------------------
1 | # Project: Setting Up Aiven Cloud Storage and Connecting a PostgreSQL Database Using DBeaver
2 |
3 | ### **Objective**:
4 | The goal of this project is to set up a **managed PostgreSQL database** on **Aiven**, use their storage options, and connect to it using **DBeaver** to manage and query the data.
5 |
6 | ---
7 |
8 | ### **Steps to Complete the Project**:
9 |
10 | #### **1. Set Up Aiven Account and Create PostgreSQL Service**
11 | - **Create an Aiven account** if you don’t already have one: [Aiven Sign Up](https://aiven.io/).
12 | - **Create a PostgreSQL service** on Aiven:
13 | - Log into the Aiven console.
14 | - Click **Create Service**.
15 | - Select **PostgreSQL** from the list of available services.
16 | - Choose the cloud provider (AWS, Google Cloud, etc.) and the region.
17 | - Configure your service and choose the storage options provided by Aiven.
18 | - Aiven will manage your PostgreSQL setup automatically, including backups, monitoring, and scaling.
19 | - **Obtain connection details**:
20 | - Once the PostgreSQL service is created, note down the **hostname**, **port**, **username**, and **password**.
21 | - You'll use this information to connect from DBeaver.
22 |
23 | #### **2. Set Up Cloud Storage on Aiven**
24 | - **Aiven provides object storage** integrated with their managed services, so you don't need to manually set up AWS S3 or Azure Blob Storage. You can upload data directly to Aiven storage or integrate with other services.
25 | - **Upload a sample file** to Aiven storage:
26 | - Go to the **Aiven dashboard** and look for storage options related to your service.
27 | - Upload a file (like a CSV) that you want to interact with using your PostgreSQL database.
28 | - **Alternatively**, you can use an external storage service (e.g., AWS S3 or Azure Blob) to interact with Aiven if required. However, Aiven's managed storage service should work well for this project.
29 |
30 | #### **3. Install and Configure DBeaver**
31 | - **Download and install DBeaver** (SQL client for PostgreSQL):
32 | - Go to [DBeaver's official site](https://dbeaver.io/) and download the version suitable for your OS.
33 | - **Connect to Aiven PostgreSQL service**:
34 | - Open DBeaver.
35 | - Create a **new connection**:
36 | - Select **PostgreSQL**.
37 | - Fill in the connection details (hostname, port, username, password, and database name from Aiven).
38 | - Test the connection and make sure it's successful.
39 |
40 | #### **4. Set Up PostgreSQL Database and Create Tables**
41 | - Once connected via DBeaver, you can **create a new database** or use the existing one.
42 | - Example: Create a simple `products` table:
43 | ```sql
44 | CREATE TABLE products (
45 | id SERIAL PRIMARY KEY,
46 | name VARCHAR(100),
47 | price DECIMAL
48 | );
49 | ```
50 | - **Insert some data** into the `products` table:
51 | ```sql
52 | INSERT INTO products (name, price)
53 | VALUES ('Laptop', 1000), ('Smartphone', 700);
54 | ```
55 | - **Query the data**:
56 | ```sql
57 | SELECT * FROM products;
58 | ```
59 |
60 | #### **5. (Optional) Integrate Aiven Storage with PostgreSQL**
61 | - If you're using Aiven's managed storage, you can perform the following operations:
62 | - **Download data from Aiven storage** (if required), using Aiven's integration options or by connecting to storage buckets.
63 | - **Load data into PostgreSQL** (if you’ve uploaded a CSV):
64 | - You can use `COPY` commands in PostgreSQL or perform an import directly through DBeaver’s **Import Data** option.
65 | - Example `COPY` command:
66 | ```sql
67 | COPY products FROM '/path/to/your/file.csv' DELIMITER ',' CSV HEADER;
68 | ```
69 |
70 | #### **6. Perform Data Operations Using DBeaver**
71 | - Use DBeaver to interact with the **PostgreSQL database**.
72 | - **CRUD Operations**: Create, read, update, and delete data.
73 | - **Querying**: Run SQL queries and get results directly in DBeaver.
74 | - **Database Management**: Create new tables, define schemas, and more.
75 |
76 | ---
77 |
78 | ### **Deliverables**:
79 | 1. **Screenshots** of the Aiven dashboard with the PostgreSQL service and storage bucket setup.
80 | 2. **SQL scripts** for creating and inserting data into the `products` table.
81 | 3. **Python script** (optional) for uploading files to Aiven storage (if applicable).
82 | 4. **Connection details** and queries executed via DBeaver.
83 | 5. A brief report documenting the steps taken, cloud setup, and any challenges faced.
84 |
85 | ---
86 |
87 | ### **Skills Gained**:
88 | - Configuring and using **Aiven's managed PostgreSQL**.
89 | - **Uploading data** to managed cloud storage.
90 | - Using **DBeaver** to connect and query PostgreSQL.
91 | - Integrating **cloud storage** with your database system.
92 | - Performing **ETL operations** (optional if data is being uploaded).
93 |
94 | ---
95 |
96 | ### **Why Use Aiven for this Project?**
97 | - **Managed PostgreSQL**: Aiven handles your PostgreSQL installation, backups, scaling, and monitoring, so you can focus on the data engineering tasks.
98 | - **Storage Integration**: Easily manage cloud storage for your data and avoid manual setups of services like AWS S3 or Azure Blob.
99 | - **Simplified Setup**: Aiven offers a streamlined, unified experience for cloud services, databases, and storage.
100 |
101 | This updated version with **Aiven** simplifies your cloud storage and database setup while still providing the core hands-on experience in managing cloud databases and interacting with them via DBeaver.
102 |
--------------------------------------------------------------------------------
/WeekOneProject.md:
--------------------------------------------------------------------------------
1 | ### Project: Setting Up Cloud Storage and Connecting a Database with DBeaver
2 |
3 | #### Objective:
4 | The goal of this project is to set up a cloud storage service (AWS S3 or Azure Blob Storage), create a PostgreSQL database, and connect to it using DBeaver for managing and querying the data. This project will help you understand how to configure cloud storage, set up a relational database, and use a SQL client to interact with the database.
5 |
6 | ---
7 |
8 | ### Steps to Complete the Project:
9 |
10 | #### 1. Set Up Cloud Storage (AWS S3 or Azure Blob Storage)
11 |
12 | **Using AWS S3:**
13 | - Create an AWS account (if you don’t have one).
14 | - Go to the **S3 dashboard** and create a new **S3 bucket**.
15 | - Set a unique bucket name and choose a region.
16 | - Leave default settings for now.
17 | - Upload a sample file (e.g., a CSV file or any dataset) to your S3 bucket.
18 |
19 | **OR**
20 |
21 | **Using Azure Blob Storage:**
22 | - Create an Azure account (if you don’t have one).
23 | - Go to the **Azure Portal** and create a **Storage Account**.
24 | - Choose the appropriate region and resource group.
25 | - Once created, navigate to the Blob Storage section and create a **container**.
26 | - Upload a sample file (e.g., a CSV file) to your Azure Blob Storage.
27 |
28 | #### 2. Set Up PostgreSQL Database
29 |
30 | - Install PostgreSQL locally on your machine or use a cloud database provider like AWS RDS or Azure PostgreSQL.
31 | - For local installation:
32 | - **Windows**: Download the installer from the official PostgreSQL website.
33 | - **Mac**: Use Homebrew (`brew install postgresql`).
34 | - **Linux**: Use the package manager (`sudo apt-get install postgresql`).
35 |
36 | - Create a PostgreSQL database named `test_db` (or any other name).
37 | - Connect to the database using the `psql` terminal.
38 | - Create a simple table to store data (e.g., a table for storing basic product information):
39 | ```sql
40 | CREATE TABLE products (
41 | id SERIAL PRIMARY KEY,
42 | name VARCHAR(100),
43 | price DECIMAL
44 | );
45 | ```
46 |
47 | #### 3. Install and Configure DBeaver
48 |
49 | - Download and install **DBeaver** (a SQL client tool that connects to databases).
50 | - Go to [DBeaver website](https://dbeaver.io/) and download the version compatible with your operating system.
51 |
52 | - Open DBeaver and create a **new connection** to the PostgreSQL database:
53 | - Select **PostgreSQL** as the database type.
54 | - Enter the database connection details (host, port, username, password, and database name).
55 | - For local PostgreSQL installation, the default values are typically:
56 | - Host: `localhost`
57 | - Port: `5432`
58 | - Username: `postgres`
59 | - Password: Your PostgreSQL password
60 | - Database: `test_db`
61 |
62 | #### 4. Connect Cloud Storage with PostgreSQL
63 |
64 | - **(Optional) For AWS S3**: Use a tool like `boto3` (AWS SDK for Python) to interact with the files stored in your S3 bucket. You could upload a CSV file and load it into your PostgreSQL database using Python.
65 | - Example Python code using `boto3` to download a file from S3:
66 | ```python
67 | import boto3
68 |
69 | s3 = boto3.client('s3')
70 | bucket_name = 'your-bucket-name'
71 | file_key = 'your-file.csv'
72 | local_file_path = '/path/to/save/file.csv'
73 |
74 | s3.download_file(bucket_name, file_key, local_file_path)
75 | ```
76 |
77 | - **(Optional) For Azure Blob Storage**: Use the `azure-storage-blob` Python library to interact with Azure Blob Storage.
78 | - Example Python code to download a file from Azure Blob Storage:
79 | ```python
80 | from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
81 |
82 | connection_string = "your_connection_string"
83 | container_name = "your_container_name"
84 | blob_name = "your-file.csv"
85 | download_path = "/path/to/save/file.csv"
86 |
87 | blob_service_client = BlobServiceClient.from_connection_string(connection_string)
88 | container_client = blob_service_client.get_container_client(container_name)
89 | blob_client = container_client.get_blob_client(blob_name)
90 |
91 | with open(download_path, "wb") as download_file:
92 | download_file.write(blob_client.download_blob().readall())
93 | ```
94 |
95 | #### 5. Use DBeaver to Interact with PostgreSQL
96 |
97 | - Open **DBeaver** and connect to your **PostgreSQL database**.
98 | - Execute basic SQL queries such as:
99 | - Inserting data:
100 | ```sql
101 | INSERT INTO products (name, price) VALUES ('Laptop', 1000);
102 | ```
103 | - Querying the data:
104 | ```sql
105 | SELECT * FROM products;
106 | ```
107 | - You can now use DBeaver to perform other SQL operations like creating new tables, updating data, etc.
108 |
109 | #### 6. (Optional) Data Import from CSV
110 |
111 | If you’ve uploaded a CSV file to your cloud storage (S3 or Azure Blob), you can use DBeaver to import this file into your PostgreSQL database:
112 | - In DBeaver, right-click on the table where you want to import data and select **Import Data**.
113 | - Choose the CSV file from your local machine (after downloading from cloud storage).
114 | - Map the CSV columns to the corresponding table columns.
115 |
116 | ---
117 |
118 | ### Deliverables:
119 | 1. Screenshots of the cloud storage (AWS S3 or Azure Blob) with uploaded files.
120 | 2. PostgreSQL database schema (SQL script) for the `products` table.
121 | 3. A Python script for interacting with the cloud storage and PostgreSQL database (if applicable).
122 | 4. DBeaver connection details and queries performed on the database.
123 | 5. A brief report documenting the steps taken and any challenges faced.
124 |
125 | ---
126 |
127 | ### Skills Gained:
128 | - Configuring cloud storage (AWS S3 or Azure Blob Storage).
129 | - Setting up and connecting to a PostgreSQL database.
130 | - Using DBeaver as a SQL client for managing and querying data.
131 | - Integrating cloud storage with PostgreSQL (optional, but adds a real-world dimension).
132 |
133 | This is a simple and effective project that will help learners get hands-on experience with cloud services, databases, and SQL client tools while reinforcing key data engineering concepts.
134 |
--------------------------------------------------------------------------------
/Day3-WeekOneDayThreeClass.md:
--------------------------------------------------------------------------------
1 | ### Data Governance, Security, Compliance, and Access Control
2 |
3 | Data has become a critical asset in today’s world, driving decisions and fueling innovation. However, the value of data comes with the responsibility to manage it effectively, secure it from threats, ensure compliance with legal and regulatory standards, and control who can access it. Here’s an overview of these core principles:
4 |
5 | #### 1. Data Governance
6 | **Definition:**
7 | Data governance involves the management of data availability, usability, integrity, and security within an organization. It sets the framework for how data is handled and ensures it aligns with business objectives.
8 |
9 | **Key Components:**
10 | - **Data Ownership:** Clearly defining who is responsible for data.
11 | - **Data Quality:** Establishing standards to maintain accuracy and reliability.
12 | - **Policies and Procedures:** Creating rules for data usage and handling.
13 |
14 | **Benefits:**
15 | - Enhanced decision-making.
16 | - Compliance with regulations.
17 | - Improved data security.
18 |
19 | #### 2. Data Security
20 | **Definition:**
21 | Protecting data from unauthorized access, breaches, and theft.
22 |
23 | **Key Practices:**
24 | - **Encryption:** Securing data both at rest and in transit.
25 | - **Firewalls and Intrusion Detection:** Preventing unauthorized access to systems.
26 | - **Authentication and Authorization:** Ensuring only legitimate users can access sensitive data.
27 |
28 | **Emerging Threats:**
29 | - Ransomware attacks.
30 | - Phishing schemes targeting data storage systems.
31 |
32 | **Mitigation:**
33 | - Regular security audits.
34 | - Employee training.
35 | - Investment in robust security tools.
36 |
37 | #### 3. Compliance
38 | **Definition:**
39 | Ensuring data handling practices meet legal and regulatory requirements.
40 |
41 | **Major Regulations:**
42 | - **GDPR (General Data Protection Regulation):** European Union data privacy law.
43 | - **CCPA (California Consumer Privacy Act):** Data privacy law for California residents.
44 | - **HIPAA (Health Insurance Portability and Accountability Act):** U.S. law governing healthcare data.
45 |
46 | **Consequences of Non-Compliance:**
47 | - Fines.
48 | - Reputational damage.
49 | - Legal liabilities.
50 |
51 | **Steps to Achieve Compliance:**
52 | - Regular audits.
53 | - Documentation of data handling procedures.
54 | - Collaboration with legal and compliance experts.
55 |
56 | #### 4. Access Control
57 | **Definition:**
58 | Restricting access to data based on user roles and responsibilities.
59 |
60 | **Key Methods:**
61 | - **Role-Based Access Control (RBAC):** Permissions are assigned based on job functions.
62 | - **Least Privilege Principle:** Users are given the minimum level of access required to perform their tasks.
63 | - **Multi-Factor Authentication (MFA):** Adding layers of verification for secure access.
64 |
65 | **Tools:**
66 | - Identity and Access Management (IAM) solutions.
67 | - Audit trails to monitor access logs.
68 |
69 | ---
70 |
71 | ### Introduction to SQL for Data Engineering and PostgreSQL Setup
72 |
73 | SQL (Structured Query Language) is the backbone of data engineering, used to manipulate, query, and manage relational databases. PostgreSQL, a robust open-source database management system, is a popular choice for data engineering projects.
74 |
75 | #### 1. What is SQL?
76 | **Definition:**
77 | A language designed for interacting with relational databases.
78 |
79 | **Common SQL Operations:**
80 | - **SELECT:** Retrieve data from tables.
81 | - **INSERT:** Add new data.
82 | - **UPDATE:** Modify existing data.
83 | - **DELETE:** Remove data.
84 | - **JOIN:** Combine data from multiple tables.
85 |
86 | #### 2. Why SQL for Data Engineering?
87 | **Use Cases:**
88 | - **Data Transformation:** Clean, aggregate, and reshape data for analysis.
89 | - **Data Integration:** Combine data from multiple sources into a central repository.
90 | - **Data Management:** Create and maintain database schemas and indexes.
91 |
92 | **Efficiency:**
93 | SQL is optimized for high-performance queries, essential for big data workloads.
94 |
95 | #### 3. Introduction to PostgreSQL
96 | **Overview:**
97 | PostgreSQL is a powerful, feature-rich database system known for its reliability, scalability, and extensibility.
98 |
99 | **Features:**
100 | - **ACID compliance:** Reliable transactions.
101 | - **Support for JSON and array data types.**
102 | - **Advanced indexing options:** Like GiST and GIN.
103 | - **Built-in support for full-text search and stored procedures.**
104 |
105 | **Use Cases:**
106 | - Data warehouses.
107 | - Web applications.
108 | - Analytics.
109 |
110 | #### 4. Setting Up PostgreSQL
111 | **Installation:**
112 | - **On Linux:** `sudo apt install postgresql`
113 | - **On macOS:** `brew install postgresql`
114 | - **On Windows:** Use the official installer from the PostgreSQL website.
115 |
116 | **Basic Commands:**
117 | - Start the PostgreSQL server: `sudo service postgresql start`
118 | - Access the PostgreSQL shell: `psql`
119 |
120 | **Creating a Database:**
121 | ```sql
122 | CREATE DATABASE my_database;
123 | ````
124 |
125 | ### Connecting to the Database
126 |
127 | To connect to the PostgreSQL database, use the following command in your terminal:
128 |
129 | ```bash
130 | psql -d my_database
131 | ```
132 | ### Creating a Table
133 | To create a table named employees, use the following SQL command. This table includes an automatically incrementing id, the name of the employee, their role, and their salary:
134 |
135 | ```sql
136 | CREATE TABLE employees (
137 | id SERIAL PRIMARY KEY,
138 | name VARCHAR(100),
139 | role VARCHAR(50),
140 | salary NUMERIC
141 | );
142 | ```
143 |
144 | ### Inserting Data
145 | Add a record to the employees table using the following SQL command. This example inserts a new employee, "John Doe," with the role "Data Engineer" and a salary of 75,000:
146 |
147 |
148 | ```sql
149 | INSERT INTO employees (name, role, salary)
150 | VALUES ('John Doe', 'Data Engineer', 75000);
151 | ```
152 | ### Querying Data
153 | To retrieve all data from the employees table, use the SELECT command:
154 |
155 | ```sql
156 | SELECT * FROM employees;
157 | ```
158 | This command will display all rows and columns in the table.
159 |
160 | ### Conclusion
161 | Understanding data governance, security, compliance, and access control is essential for protecting organizational data and meeting regulatory standards. These principles help ensure data is used effectively, remains secure, and complies with legal requirements.
162 |
163 | At the same time, mastering SQL and PostgreSQL equips data engineers with powerful tools to build and manage data pipelines. SQL provides the foundation for querying and manipulating data, while PostgreSQL offers a robust platform for efficient storage and retrieval, enabling effective data analytics and decision-making.
164 |
165 |
--------------------------------------------------------------------------------
/Apache Kafka 101: Apache Kafka for Data Engineering Guide.md:
--------------------------------------------------------------------------------
1 | ### Kafka-cheat-sheet
2 |
3 | Apache Kafka® serves as an open-source distributed streaming platform. Similar to other distributed systems, Kafka boasts a complex architecture, which may pose a challenge for new developers. Setting up Kafka involves navigating a formidable command line interface and configuring numerous settings. In this guide, I will provide insights into architectural concepts and essential commands frequently used by developers to initiate their journey with Kafka.
4 |
5 | #### Key Concepts:
6 |
7 | - **Clusters**: Group of servers working together for three reasons: speed (low latency), durability, and scalability.
8 | - **Topic**: Streams of records that Kafka organizes data into.
9 | - **Brokers**: Kafka server instances that store and replicate messages.
10 | - **Producers**: Applications that write data to Kafka topics.
11 | - **Consumers**: Applications that read data from Kafka topics.
12 | - **Partitions**: Divisions of a topic for scalability and parallelism.
13 | - **Connect**: Kafka Connect manages the Tasks. Note that the Connector is only responsible for generating the set of Tasks and indicating to the framework when they need to be updated.
14 |
15 | The easiest way to run Kafka clusters is to use **Confluent Cloud**, which is the official maintainer of Apache Kafka and provides different libraries for writing producers, consumers, and schema registry.
16 |
17 | #### Summary
18 |
19 | 1. **Apache Kafka**: A distributed streaming platform.
20 |
21 | The Kafka CLI is a powerful tool. However, the user experience can be challenging if you don’t already know the exact command needed for your task. Below are the commonly used CLI commands to interact with Kafka:
22 |
23 | #### Start Zookeeper
24 | ```sh
25 | zookeeper-server-start config/zookeeper.properties
26 | ```
27 |
28 | #### Start Kafka Server
29 | ```sh
30 | kafka-server-start config/server.properties
31 | ```
32 |
33 | ### Kafka Topics
34 |
35 | - **List existing topics**
36 | ```sh
37 | bin/kafka-topics.sh --zookeeper localhost:2181 --list
38 | ```
39 |
40 | - **Describe a topic**
41 | ```sh
42 | bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic mytopic
43 | ```
44 |
45 | - **Purge a topic**
46 | ```sh
47 | bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --config retention.ms=1000
48 | ```
49 |
50 | or
51 |
52 | ```sh
53 | bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --delete-config retention.ms
54 | ```
55 |
56 | - **Delete a topic**
57 | ```sh
58 | bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic mytopic
59 | ```
60 |
61 | - **Get number of messages in a topic**
62 | ```sh
63 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}'
64 | ```
65 |
66 | - **Get the earliest offset still in a topic**
67 | ```sh
68 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -2
69 | ```
70 |
71 | - **Get the latest offset still in a topic**
72 | ```sh
73 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic mytopic --time -1
74 | ```
75 |
76 | - **Consume messages with the console consumer**
77 | ```sh
78 | bin/kafka-console-consumer.sh --new-consumer --bootstrap-server localhost:9092 --topic mytopic --from-beginning
79 | ```
80 |
81 | - **Get the consumer offsets for a topic**
82 | ```sh
83 | bin/kafka-consumer-offset-checker.sh --zookeeper=localhost:2181 --topic=mytopic --group=my_consumer_group
84 | ```
85 |
86 | - **Read from `__consumer_offsets`**
87 | Add the following property to `config/consumer.properties`:
88 | ```sh
89 | exclude.internal.topics=false
90 | ```
91 |
92 | Then run:
93 | ```sh
94 | bin/kafka-console-consumer.sh --consumer.config config/consumer.properties --from-beginning --topic __consumer_offsets --zookeeper localhost:2181 --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter"
95 | ```
96 |
97 | ### Kafka Consumer Groups
98 |
99 | - **List the consumer groups known to Kafka**
100 | ```sh
101 | bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list # (old API)
102 | ```
103 | ```sh
104 | bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list # (new API)
105 | ```
106 |
107 | - **View the details of a consumer group**
108 | ```sh
109 | bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group
110 | ```
111 |
112 | ### Kafkacat
113 |
114 | - **Getting the last five messages of a topic**
115 | ```sh
116 | kafkacat -C -b localhost:9092 -t mytopic -p 0 -o -5 -e
117 | ```
118 |
119 | ### Zookeeper
120 |
121 | - **Starting the Zookeeper Shell**
122 | ```sh
123 | bin/zookeeper-shell.sh localhost:2181
124 | ```
125 |
126 | ### Running Java Class
127 |
128 | - **Run `ConsumerOffsetCheck` when Kafka server is up, there is a topic + messages produced and consumed**
129 | ```sh
130 | bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --broker-info --zookeeper localhost:2181 --group test-consumer-group
131 | ```
132 |
133 | **Note:** `ConsumerOffsetChecker` has been removed in Kafka 1.0.0. Use `kafka-consumer-groups.sh` to get consumer group details:
134 | ```sh
135 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group consule-consumer-38063
136 | ```
137 |
138 | ### Kafka Server & Topics
139 |
140 | - **Start Zookeeper**
141 | ```sh
142 | bin/zookeeper-server-start.sh config/zookeeper.properties
143 | ```
144 |
145 | - **Start Kafka brokers (Servers = cluster)**
146 | ```sh
147 | bin/kafka-server-start.sh config/server.properties
148 | ```
149 |
150 | - **Create a topic**
151 | ```sh
152 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
153 | ```
154 |
155 | - **List all topics**
156 | ```sh
157 | bin/kafka-topics.sh --list --zookeeper localhost:2181
158 | ```
159 |
160 | - **See topic details (partition, replication factor, etc.)**
161 | ```sh
162 | bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
163 | ```
164 |
165 | - **Change partition number of a topic (`--alter`)**
166 | ```sh
167 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic test --partitions 3
168 | ```
169 |
170 | ### Producer
171 |
172 | ```sh
173 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
174 | ```
175 |
176 | ### Consumer
177 |
178 | ```sh
179 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic test
180 | ```
181 |
182 | - **To consume only new messages**
183 | ```sh
184 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test
185 | ```
186 |
187 | ### Kafka Connect
188 |
189 | - **Standalone connectors (run in a single, local, dedicated process)**
190 | ```sh
191 | bin/connect-standalone.sh \ config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties
192 | ```
193 |
194 | ### Reference
195 |
196 | [Redpanda Kafka Tutorial](https://redpanda.com/guides/kafka-tutorial)
197 |
--------------------------------------------------------------------------------
/Apache Kafka 102: Apache Kafka for Data Engineering Guide.md:
--------------------------------------------------------------------------------
1 | ### **Apache Kafka Cheat Sheet**
2 |
3 | #### **Introduction**
4 | Apache Kafka® is an **open-source distributed event streaming platform** used for building **real-time data pipelines** and streaming applications. Kafka is **horizontally scalable**, **fault-tolerant**, and **highly durable**.
5 |
6 | This guide provides an in-depth look at Kafka's **architecture**, **core concepts**, and **commonly used commands** with detailed explanations and examples.
7 |
8 | ---
9 |
10 | #### **1. Key Concepts**
11 | ##### **1.1 Clusters**
12 | - A **Kafka Cluster** is a collection of **brokers (servers)** working together.
13 | - Provides **fault tolerance, scalability, and high throughput**.
14 | - Clusters handle **millions of messages per second** in distributed systems.
15 |
16 | ##### **1.2 Topics**
17 | - A **topic** is a logical channel where messages are **produced and consumed**.
18 | - Each topic is **split into partitions** for parallel processing.
19 | - Topics are **multi-subscriber**, meaning multiple consumers can read from them.
20 |
21 | **Rules for Naming Topics:**
22 | 1. Topic names should **only contain** letters (`a-z`, `A-Z`), numbers (`0-9`), dots (`.`), underscores (`_`), and hyphens (`-`).
23 | 2. Topic names should be **descriptive** and meaningful.
24 | 3. **Avoid special characters** like `@`, `#`, `!`, `*`, as Kafka does not support them.
25 |
26 | **Examples of Topic Names:**
27 | ```
28 | # Valid topic names
29 | customer_orders
30 | logs.application-errors
31 | user_activity
32 |
33 | # Invalid topic names (containing special characters)
34 | customer@orders # Invalid '@'
35 | logs#errors # Invalid '#'
36 | ```
37 |
38 | ---
39 |
40 | #### **2. Brokers**
41 | - A **broker** is a Kafka server that stores and serves messages.
42 | - Kafka brokers manage:
43 | - **Topic partitions**
44 | - **Message replication**
45 | - **Data storage & retrieval**
46 | - Brokers are part of a **Kafka cluster** and work together.
47 |
48 | ---
49 |
50 | #### **3. Producers**
51 | - Producers send (publish) messages to **Kafka topics**.
52 | - Messages are assigned to **partitions** based on:
53 | - **Round-robin (default)**
54 | - **Key-based partitioning** (Ensures messages with the same key go to the same partition)
55 |
56 | **Example: Writing messages to a Kafka topic**
57 | ```
58 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic customer_orders
59 | ```
60 | Type messages and press **Enter** to send them.
61 |
62 | ---
63 |
64 | #### **4. Consumers**
65 | - Consumers **read messages** from Kafka topics.
66 | - Consumers belong to **consumer groups**, allowing **parallel processing**.
67 |
68 | **Example: Reading messages from a topic**
69 | ```
70 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic customer_orders --from-beginning
71 | ```
72 | This will print **all past and new messages** from the `customer_orders` topic.
73 |
74 | ---
75 |
76 | #### **5. Partitions**
77 | - Topics are split into **partitions** to enable **parallel consumption**.
78 | - Each partition is **ordered** and messages are assigned an **offset**.
79 | - Partitions allow **horizontal scaling**.
80 |
81 | **Example of a topic with 3 partitions:**
82 | ```
83 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic logs
84 | ```
85 |
86 | **Rules for Partitions:**
87 | 1. More partitions = **better parallelism**.
88 | 2. Cannot **reduce** partitions, only **increase** them.
89 | 3. Messages are assigned partitions **based on key hashing** or round-robin.
90 |
91 | ---
92 |
93 | #### **6. Kafka Connect**
94 | Kafka Connect is used to **stream data** between Kafka and **external data systems** like **databases, file systems, and cloud storage**.
95 |
96 | **Example: Running a Kafka Connect Worker**
97 | ```
98 | bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
99 | ```
100 |
101 | ---
102 |
103 | #### **7. Kafka CLI Commands**
104 | ##### **7.1 Starting Kafka Components**
105 | ```
106 | # Start Zookeeper
107 | zookeeper-server-start.sh config/zookeeper.properties
108 |
109 | # Start Kafka Server
110 | kafka-server-start.sh config/server.properties
111 | ```
112 |
113 | ---
114 |
115 | ##### **7.2 Managing Topics**
116 | ###### **List Topics**
117 | ```
118 | bin/kafka-topics.sh --zookeeper localhost:2181 --list
119 | ```
120 |
121 | ###### **Describe a Topic**
122 | ```
123 | bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic customer_orders
124 | ```
125 |
126 | ###### **Create a Topic (3 Examples)**
127 | ```
128 | # Create a topic with 1 partition and replication factor of 1
129 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic customer_orders
130 |
131 | # Create a topic with 3 partitions and replication factor of 2
132 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 3 --topic logs
133 |
134 | # Create a topic for real-time analytics with 5 partitions
135 | bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 5 --topic analytics_stream
136 | ```
137 |
138 | ###### **Delete a Topic**
139 | ```
140 | bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic old_topic
141 | ```
142 |
143 | ###### **Increase Partitions**
144 | ```
145 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic logs --partitions 5
146 | ```
147 | **⚠️ Note:** Kafka **does not** allow **decreasing** partitions!
148 |
149 | ---
150 |
151 | ##### **7.3 Managing Messages**
152 | ###### **Find Number of Messages in a Topic**
153 | ```
154 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}'
155 | ```
156 |
157 | ###### **Get Earliest and Latest Offsets**
158 | ```
159 | # Earliest offset (first message)
160 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -2
161 |
162 | # Latest offset (last message)
163 | bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic logs --time -1
164 | ```
165 |
166 | ---
167 |
168 | ##### **7.4 Consumer Groups**
169 | ###### **List Consumer Groups**
170 | ```
171 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
172 | ```
173 |
174 | ###### **Describe a Consumer Group**
175 | ```
176 | bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my_consumer_group
177 | ```
178 |
179 | ---
180 |
181 | ##### **7.5 Using Kafkacat**
182 | ###### **Read Last 5 Messages from a Topic**
183 | ```
184 | kafkacat -C -b localhost:9092 -t logs -p 0 -o -5 -e
185 | ```
186 |
187 | ---
188 |
189 | #### **8. Advanced Notes**
190 | 1. **Kafka Retention Policies**:
191 | - Kafka can retain messages **forever**, for a set **time period**, or until reaching a **size limit**.
192 | - Configure **log retention** with:
193 | ```
194 | bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic logs --config retention.ms=604800000 # 7 days
195 | ```
196 |
197 | 2. **Monitoring Kafka**:
198 | - Use Kafka's built-in tools or third-party **monitoring solutions** like **Confluent Control Center, Prometheus, or Grafana**.
199 |
200 | 3. **Kafka Streams API**:
201 | - Used for **real-time data processing** within Kafka itself.
202 | - Helps build **event-driven applications**.
203 |
204 | ---
205 |
206 | #### **9. References**
207 | For more details, check out:
208 | - [Apache Kafka Documentation](https://kafka.apache.org/documentation/)
209 | - [Redpanda Kafka Guide](https://redpanda.com/guides/kafka-tutorial)
210 |
--------------------------------------------------------------------------------
/SQL-Manual.md:
--------------------------------------------------------------------------------
1 | # Structured Query Language (SQL)
2 | **Course Manual – Version 1.2**
3 | © 2025
4 |
5 | ---
6 |
7 | ## Table of Contents
8 | 1. [Introduction](#introduction)
9 | 2. [Basic Queries](#basic-queries)
10 | 3. [Advanced Operators](#advanced-operators)
11 | 4. [Expressions](#expressions)
12 | 5. [Functions](#functions)
13 | 6. [Multi-Table Queries](#multi-table-queries)
14 | 7. [Queries Within Queries](#queries-within-queries)
15 | 8. [Maintaining Tables](#maintaining-tables)
16 | 9. [Defining Database Objects](#defining-database-objects)
17 | A. [Appendices](#appendices)
18 |
19 | ---
20 |
21 |
22 | # 1. Introduction
23 |
24 | ### Course Objectives
25 | By the end of the course you will be able to:
26 | - **Query** relational databases
27 | - **Maintain** relational databases
28 | - **Define** relational databases
29 |
30 | > “MAINTAIN – QUERY – RELATIONAL DATABASE – DEFINE”
31 |
32 | ---
33 |
34 | ### What is a Relational Database?
35 | - **Tables** (unique names, ≤ 18 chars, start with letter).
36 | - **Columns** (unique within table).
37 | - **Rows** (identified by data values, not position).
38 |
39 | ---
40 |
41 | ### What is SQL?
42 | - **Structured Query Language** – IBM (1974), ANSI (1986), ISO (1987).
43 | - **Six core statements**: `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `CREATE`, `DROP`.
44 |
45 | | Category | Statements |
46 | |-----------------|---------------------|
47 | | **Query** | `SELECT` |
48 | | **Maintenance** | `INSERT`, `UPDATE`, `DELETE` |
49 | | **Definition** | `CREATE`, `DROP` |
50 |
51 | ---
52 |
53 |
54 | # 2. Basic Queries
55 |
56 | ### 2.1 Selecting All Columns & Rows
57 | ```sql
58 | SELECT *
59 | FROM countries;
60 | ```
61 |
62 | ### 2.2 Selecting Specific Columns
63 | ```sql
64 | SELECT title, job
65 | FROM jobs;
66 | ```
67 |
68 | ### 2.3 Selecting Specific Rows
69 | | Value Type | Format Example |
70 | |------------|----------------|
71 | | Numeric | `123`, `-45.67` |
72 | | String | `'Canada'` |
73 | | Date | `'1999-12-31'` |
74 |
75 | ```sql
76 | SELECT name, country
77 | FROM persons
78 | WHERE country = 'Canada';
79 | ```
80 |
81 | ### 2.4 Sorting Rows
82 | ```sql
83 | SELECT country, area
84 | FROM countries
85 | ORDER BY area DESC; -- largest first
86 | ```
87 |
88 | Multiple columns:
89 | ```sql
90 | ORDER BY language, pop DESC;
91 | ```
92 |
93 | ### 2.5 Eliminating Duplicate Rows
94 | ```sql
95 | SELECT DISTINCT job
96 | FROM persons;
97 | ```
98 |
99 | ---
100 |
101 |
102 | # 3. Advanced Operators
103 |
104 | | Operator | Meaning | Example |
105 | |----------|---------|---------|
106 | | `LIKE` | Pattern | `WHERE name LIKE 'Z%'` |
107 | | `AND` | Both true | `WHERE gnp < 3000 AND literacy < 40` |
108 | | `BETWEEN`| Inclusive range | `WHERE pop BETWEEN 100000 AND 200000` |
109 | | `OR` | Either true | `WHERE country = 'USA' OR country = 'Canada'` |
110 | | `IN` | List match | `WHERE language IN ('English', 'French')` |
111 | | `IS NULL`| Missing value | `WHERE gnp IS NULL` |
112 | | `NOT` | Negation | `WHERE NOT country = 'USA'` |
113 | | `( )` | Precedence | `WHERE (job='S' OR job='W') AND country='Italy'` |
114 |
115 | ---
116 |
117 |
118 | # 4. Expressions
119 |
120 | ### 4.1 Arithmetic Expressions
121 | ```sql
122 | SELECT country, pop/area AS density
123 | FROM countries;
124 | ```
125 |
126 | ### 4.2 Expressions in WHERE / ORDER BY
127 | ```sql
128 | SELECT *
129 | FROM countries
130 | WHERE pop/area > 1000
131 | ORDER BY pop/area DESC;
132 | ```
133 |
134 | ### 4.3 Column Aliases (`AS`)
135 | ```sql
136 | SELECT gnp*1000000/pop AS gpp
137 | FROM countries
138 | WHERE gpp > 20000
139 | ORDER BY gpp DESC;
140 | ```
141 |
142 | ---
143 |
144 |
145 | # 5. Functions
146 |
147 | ### 5.1 Statistical Functions
148 | | Function | Purpose |
149 | |----------|---------|
150 | | `COUNT(*)` | Rows |
151 | | `COUNT(col)` | Non-NULL |
152 | | `SUM(col)` | Total |
153 | | `AVG(col)` | Average |
154 | | `MIN(col)` / `MAX(col)` | Min / Max |
155 |
156 | Grand totals:
157 | ```sql
158 | SELECT AVG(pop) AS avg_pop
159 | FROM countries
160 | WHERE language = 'English';
161 | ```
162 |
163 | ### 5.2 Grouping
164 | ```sql
165 | SELECT job, COUNT(*) AS total
166 | FROM persons
167 | GROUP BY job;
168 | ```
169 |
170 | ### 5.3 HAVING (post-group filter)
171 | ```sql
172 | SELECT language, AVG(literacy) AS avg_lit
173 | FROM countries
174 | GROUP BY language
175 | HAVING AVG(literacy) > 90;
176 | ```
177 |
178 | ---
179 |
180 |
181 | # 6. Multi-Table Queries
182 |
183 | ### 6.1 Joins
184 | ```sql
185 | SELECT p.name, j.title
186 | FROM persons p
187 | JOIN jobs j ON p.job = j.job;
188 | ```
189 |
190 | ### 6.2 Table Aliases
191 | Short-hand:
192 | ```sql
193 | SELECT c.country, a.budget
194 | FROM countries c
195 | JOIN armies a ON c.country = a.country;
196 | ```
197 |
198 | ### 6.3 Union
199 | Combine results (distinct):
200 | ```sql
201 | SELECT country FROM religions WHERE percent > 40
202 | UNION
203 | SELECT country FROM countries WHERE language = 'German';
204 | ```
205 |
206 | ---
207 |
208 |
209 | # 7. Queries Within Queries (Subqueries)
210 |
211 | | Type | Template | Example |
212 | |------|----------|---------|
213 | | **Single-valued** | `WHERE col = (SELECT …)` | `WHERE pop > (SELECT AVG(pop) FROM countries)` |
214 | | **Multi-valued** | `WHERE col IN (SELECT …)` | `WHERE country IN (SELECT country FROM religions WHERE percent > 95)` |
215 | | **Correlated** | Inner query references outer alias | `WHERE bdate = (SELECT MIN(bdate) FROM persons p2 WHERE p2.job = p1.job)` |
216 |
217 | ---
218 |
219 |
220 | # 8. Maintaining Tables
221 |
222 | | Action | Syntax | Example |
223 | |--------|--------|---------|
224 | | **Insert** | `INSERT INTO … VALUES …` | `INSERT INTO jobs(job,title) VALUES ('A','Author');` |
225 | | **Update** | `UPDATE … SET … WHERE …` | `UPDATE persons SET job='E' WHERE person=500;` |
226 | | **Delete** | `DELETE FROM … WHERE …` | `DELETE FROM persons WHERE person=500;` |
227 | | **Transaction** | `COMMIT;` or `ROLLBACK;` | Undo or save all changes since last `COMMIT`. |
228 |
229 | ---
230 |
231 |
232 | # 9. Defining Database Objects
233 |
234 | ### 9.1 Tables
235 | ```sql
236 | CREATE TABLE theologians (
237 | name CHAR(20) PRIMARY KEY,
238 | bdate DATE,
239 | gender CHAR(6) NOT NULL CHECK (gender IN ('Male','Female')),
240 | country CHAR(20) REFERENCES countries(country)
241 | );
242 | ```
243 |
244 | ### 9.2 Indexes
245 | ```sql
246 | CREATE INDEX idx_gender_job ON persons(gender, job);
247 | DROP INDEX idx_gender_job;
248 | ```
249 |
250 | ### 9.3 Views
251 | ```sql
252 | CREATE VIEW iv AS
253 | SELECT * FROM persons WHERE country = 'Israel';
254 |
255 | SELECT * FROM iv;
256 | DROP VIEW iv;
257 | ```
258 |
259 | ---
260 |
261 |
262 | # A. Appendices
263 |
264 | ### Exercise Database Schema
265 | | Table | Key Columns (sample) |
266 | |---------|----------------------|
267 | | **persons** | person, name, bdate, gender, country, job |
268 | | **countries** | country, pop, area, gnp, language, literacy |
269 | | **armies** | country, budget, troops, tanks, ships, planes |
270 | | **jobs** | job, title |
271 | | **religions** | country, religion, percent |
272 |
273 | > All monetary values in millions; population in people; area in sq mi; literacy in %.
274 |
275 | ### Answers to Selected Exercises
276 | See the original PDF **pages 94-110** for the complete answer key.
277 |
278 | ---
279 |
280 | ## Syntax Summary Cheat-Sheet
281 | ```sql
282 | SELECT [DISTINCT] columns|functions
283 | FROM table [alias] [, ...]
284 | [WHERE conditions]
285 | [GROUP BY columns]
286 | [HAVING aggregate_conditions]
287 | [ORDER BY column|expr|position [ASC|DESC]];
288 |
289 | INSERT INTO table(col1,...) VALUES(val1,...);
290 | UPDATE table SET col=val [, ...] WHERE ...;
291 | DELETE FROM table WHERE ...;
292 | CREATE TABLE tbl (...);
293 | DROP TABLE tbl;
294 | COMMIT;
295 | ROLLBACK;
296 | ```
297 |
--------------------------------------------------------------------------------
/Apache Airflow 101 Guide.md:
--------------------------------------------------------------------------------
1 | ## **Apache Airflow Setup & Introduction (Multi-Component Mode)**
2 |
3 | #### **Understanding Workflow Orchestration**
4 |
5 | **Workflow orchestration** is the automated coordination and management of complex data workflows. Think of it as a conductor directing an orchestra - it ensures that different data processing tasks run in the correct order, at the right time, and handles failures gracefully.
6 |
7 | **Why Apache Airflow?**
8 | - **Dependency Management**: Automatically handles task dependencies
9 | - **Scheduling**: Run workflows on time-based or event-based triggers
10 | - **Monitoring**: Visual interface to track job progress and failures
11 | - **Scalability**: Handles complex workflows with hundreds of tasks
12 | - **Flexibility**: Python-based, extensible with custom operators
13 |
14 | ## Airflow Architecture Overview (Multi-Component Setup)
15 |
16 | ```
17 | ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
18 | │ Web Server │ │ Scheduler │ │ Executor │
19 | │ (Port │ │ (Triggers │ │(Runs Tasks) │
20 | │ 8080) │ │ DAGs) │ │ │
21 | └─────────────┘ └─────────────┘ └─────────────┘
22 | │ │ │
23 | └───────────────────┼───────────────────┘
24 | │
25 | ┌─────────────┐
26 | │ Metadata DB │
27 | │ (PostgreSQL)│
28 | └─────────────┘
29 | ```
30 |
31 | ### Step-by-Step Installation Guide (Multi-Component Mode)
32 |
33 | #### Setup with Separate Web Server and Scheduler
34 |
35 | ```bash
36 | # 1. Create a new directory for your Airflow project
37 | mkdir airflow-tutorial
38 | cd airflow-tutorial
39 |
40 | # 2. Create and activate virtual environment
41 | python -m venv airflow-env
42 | source airflow-env/bin/activate # On Windows: airflow-env\Scripts\activate
43 |
44 | # 3. Set Airflow home directory
45 | export AIRFLOW_HOME=$(pwd)/airflow # On Windows: set AIRFLOW_HOME=%cd%\airflow
46 |
47 | # 4. Install Airflow (using constraints for compatibility)
48 | pip install apache-airflow==2.8.0 --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.9.txt
49 |
50 | # 5. Initialize the database
51 | airflow db init
52 |
53 | # 6. Create an admin user (replace with your details)
54 | airflow users create \
55 | --username admin \
56 | --firstname Admin \
57 | --lastname User \
58 | --role Admin \
59 | --email admin@example.com \
60 | --password admin123
61 | ```
62 |
63 | #### Running Web Server and Scheduler Separately
64 |
65 | You'll need **two terminal windows** for this setup:
66 |
67 | ##### Terminal 1: Start the Web Server
68 | ```bash
69 | # Activate virtual environment
70 | source airflow-env/bin/activate # On Windows: airflow-env\Scripts\activate
71 |
72 | # Set Airflow home
73 | export AIRFLOW_HOME=$(pwd)/airflow # On Windows: set AIRFLOW_HOME=%cd%\airflow
74 |
75 | # Start the web server
76 | airflow webserver --port 8080
77 | ```
78 |
79 | ##### Terminal 2: Start the Scheduler
80 | ```bash
81 | # Open a new terminal window/tab
82 | cd airflow-tutorial
83 |
84 | # Activate virtual environment
85 | source airflow-env/bin/activate # On Windows: airflow-env\Scripts\activate
86 |
87 | # Set Airflow home
88 | export AIRFLOW_HOME=$(pwd)/airflow # On Windows: set AIRFLOW_HOME=%cd%\airflow
89 |
90 | # Start the scheduler
91 | airflow scheduler
92 | ```
93 |
94 | ### Why Use Separate Components?
95 |
96 | #### Benefits of Multi-Component Setup:
97 | - **Production-Ready**: Mirrors production deployment patterns
98 | - **Resource Management**: Each component can be scaled independently
99 | - **Monitoring**: Easier to monitor individual component performance
100 | - **Debugging**: Separate logs for web server and scheduler
101 | - **High Availability**: Can run multiple instances of each component
102 |
103 | #### Component Responsibilities:
104 |
105 | **Web Server:**
106 | - Serves the Airflow UI
107 | - Handles user authentication
108 | - Provides REST API endpoints
109 | - Displays DAG visualization and monitoring
110 |
111 | **Scheduler:**
112 | - Monitors DAG files for changes
113 | - Triggers task execution based on schedule
114 | - Manages task dependencies
115 | - Handles task retries and failures
116 |
117 | ### Accessing the Airflow UI
118 |
119 | 1. **Ensure both components are running**:
120 | - Web server should show: `Serving on http://0.0.0.0:8080`
121 | - Scheduler should show: `Starting the scheduler`
122 |
123 | 2. **Open your browser** and navigate to: `http://localhost:8080`
124 |
125 | 3. **Login credentials**:
126 | - Username: `admin`
127 | - Password: `admin123`
128 |
129 | ### Exploring the Airflow UI
130 |
131 | #### 1. DAGs View (Main Dashboard)
132 | - **What you'll see**: List of all available DAGs
133 | - **Key elements**:
134 | - DAG name and description
135 | - Recent runs (green = success, red = failed)
136 | - Schedule interval
137 | - Last run date
138 | - Toggle to pause/unpause DAGs
139 |
140 | #### 2. Tree View
141 | - **Purpose**: Shows DAG runs over time
142 | - **How to access**: Click on any DAG → Tree View tab
143 | - **What it shows**: Task instances arranged by execution date
144 |
145 | #### 3. Graph View
146 | - **Purpose**: Visual representation of DAG structure
147 | - **Shows**: Task dependencies and current status
148 | - **Color coding**:
149 | - Green: Success
150 | - Red: Failed
151 | - Yellow: Running
152 | - Light Blue: Queued
153 | - Gray: Not started
154 |
155 | #### 4. Code View
156 | - **Purpose**: Shows the Python code that defines the DAG
157 | - **Useful for**: Understanding DAG logic and debugging
158 |
159 | #### 5. Gantt Chart
160 | - **Purpose**: Shows task execution timeline
161 | - **Useful for**: Identifying bottlenecks and optimizing performance
162 |
163 | ### Key Concepts Explained Simply
164 |
165 | #### DAG (Directed Acyclic Graph)
166 | A workflow definition - like a recipe that tells Airflow what tasks to run and in what order.
167 |
168 | #### Tasks
169 | Individual units of work (like "download file", "process data", "send email").
170 |
171 | #### Operators
172 | Templates for tasks (PythonOperator, BashOperator, EmailOperator, etc.).
173 |
174 | #### Task Instance
175 | A specific execution of a task for a particular DAG run.
176 |
177 | ### Monitoring Your Setup
178 |
179 | #### Checking Component Status:
180 |
181 | **Web Server Logs:**
182 | - Look for: `Serving on http://0.0.0.0:8080`
183 | - No error messages about port conflicts
184 |
185 | **Scheduler Logs:**
186 | - Look for: `Starting the scheduler`
187 | - Regular DAG processing messages
188 | - No database connection errors
189 |
190 | #### Health Check Commands:
191 | ```bash
192 | # Check if web server is responding
193 | curl http://localhost:8080/health
194 |
195 | # List DAGs (requires both components running)
196 | airflow dags list
197 |
198 | # Check scheduler status
199 | airflow jobs check --job-type SchedulerJob
200 | ```
201 |
202 | ### Assignment Solutions
203 |
204 | #### Part 1: Why Airflow is Useful in Data Engineering
205 |
206 | Apache Airflow is essential in data engineering because it provides automated workflow orchestration that eliminates manual intervention in complex data pipelines. Unlike traditional cron jobs or script-based scheduling, Airflow offers dependency management, failure handling, and retry mechanisms that ensure data workflows run reliably at scale. Its Python-based approach allows data engineers to define workflows as code, making them version-controlled, testable, and maintainable, while the web UI provides real-time monitoring and debugging capabilities that are crucial for managing production data pipelines. The separation of the web server and scheduler components allows for better resource allocation and mirrors production deployment patterns used in enterprise environments.
207 |
208 | #### Part 2: Screenshot Documentation
209 |
210 | **Expected Screenshot Elements:**
211 | - Airflow UI header with "Apache Airflow" logo
212 | - Navigation menu (DAGs, Browse, Admin, etc.)
213 | - DAGs list showing example DAGs
214 | - Status indicators (green/red circles)
215 | - URL showing `localhost:8080`
216 | - Both terminal windows showing web server and scheduler running
217 |
218 | **Troubleshooting Common Issues:**
219 |
220 | 1. **Port 8080 already in use**:
221 | ```bash
222 | # Kill process using port 8080
223 | sudo lsof -t -i:8080 | xargs sudo kill -9
224 | # Or use a different port
225 | airflow webserver --port 8081
226 | ```
227 |
228 | 2. **Scheduler not picking up DAGs**:
229 | - Ensure scheduler is running
230 | - Check DAG file syntax
231 | - Verify DAG is not paused
232 |
233 | 3. **Database lock errors**:
234 | - Stop all Airflow processes
235 | - Delete `airflow.db` file
236 | - Run `airflow db init` again
237 |
238 | 4. **Web server can't connect to database**:
239 | - Ensure scheduler is running (it initializes the database)
240 | - Check file permissions on `airflow.db`
241 |
242 | ### Success Checklist ✅
243 |
244 | - [ ] Airflow installed in virtual environment
245 | - [ ] Database initialized successfully
246 | - [ ] Admin user created
247 | - [ ] Web server running on port 8080
248 | - [ ] Scheduler running in separate terminal
249 | - [ ] Can access http://localhost:8080
250 | - [ ] Can login with admin credentials
251 | - [ ] Can see example DAGs in the interface
252 | - [ ] Both components showing healthy status in logs
253 | - [ ] Can navigate between different views (Tree, Graph, Code)
254 | - [ ] Screenshot saved showing successful multi-component setup
255 |
256 | ### Stopping Airflow Properly
257 |
258 | To stop Airflow cleanly:
259 | 1. **Stop the scheduler**: `Ctrl+C` in the scheduler terminal
260 | 2. **Stop the web server**: `Ctrl+C` in the web server terminal
261 | 3. **Deactivate virtual environment**: `deactivate`
262 |
263 | ### Next Steps Preview
264 | Tomorrow we'll create our first custom DAG and understand how to define tasks and dependencies using this multi-component setup!
265 |
--------------------------------------------------------------------------------
/ETL/ETL-ELT.md:
--------------------------------------------------------------------------------
1 | # Theory: Introduction to ETL/ELT Workflows
2 |
3 | ## Overview
4 |
5 | ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two fundamental paradigms for data integration workflows. These methodologies define how data moves from source systems to target systems, determining when and where data transformations occur in the pipeline.
6 |
7 | ## ETL (Extract, Transform, Load)
8 |
9 | ### Definition
10 | ETL is a traditional data integration approach where data is extracted from source systems, transformed in a separate processing layer, and then loaded into the target system.
11 |
12 | ### Workflow Process
13 |
14 | #### 1. Extract
15 | - **Purpose**: Retrieve data from various source systems
16 | - **Sources**: Databases, APIs, flat files, web services, applications
17 | - **Methods**:
18 | - Full extraction (complete dataset)
19 | - Incremental extraction (only changed data)
20 | - Delta extraction (changes since last extraction)
21 | - **Challenges**: Handling different data formats, connection issues, rate limits
22 |
23 | #### 2. Transform
24 | - **Purpose**: Convert raw data into a format suitable for analysis
25 | - **Operations**:
26 | - **Data Cleaning**: Remove duplicates, handle null values, correct errors
27 | - **Data Standardization**: Unify formats, units, naming conventions
28 | - **Data Validation**: Ensure data quality and integrity
29 | - **Data Enrichment**: Add calculated fields, lookup values
30 | - **Data Aggregation**: Summarize data at different granularities
31 | - **Data Type Conversion**: Convert between different data types
32 | - **Business Logic Application**: Apply domain-specific rules
33 |
34 | #### 3. Load
35 | - **Purpose**: Insert transformed data into target system
36 | - **Loading Strategies**:
37 | - **Full Load**: Replace entire dataset
38 | - **Incremental Load**: Add only new or changed records
39 | - **Upsert**: Insert new records, update existing ones
40 | - **Target Systems**: Data warehouses, databases, data marts
41 |
42 | ### ETL Characteristics
43 |
44 | **Advantages:**
45 | - **Data Quality**: Ensures clean, validated data enters target system
46 | - **Performance**: Target system optimized for queries, not transformations
47 | - **Security**: Sensitive data can be masked/encrypted during transformation
48 | - **Compliance**: Easier to implement data governance rules
49 | - **Predictable Structure**: Target schema is well-defined and stable
50 |
51 | **Disadvantages:**
52 | - **Processing Time**: Sequential process can be time-consuming
53 | - **Resource Intensive**: Requires dedicated transformation infrastructure
54 | - **Less Flexibility**: Schema changes require pipeline modifications
55 | - **Data Freshness**: Batch processing introduces latency
56 |
57 | ### When to Use ETL
58 | - **Complex Transformations**: Heavy business logic or data cleansing required
59 | - **Limited Target Resources**: Target system has limited processing power
60 | - **Strict Data Quality**: High data quality standards must be enforced
61 | - **Regulatory Compliance**: Data governance and audit trails essential
62 | - **Predictable Workloads**: Batch processing acceptable for use case
63 |
64 | ## ELT (Extract, Load, Transform)
65 |
66 | ### Definition
67 | ELT is a modern approach where raw data is loaded directly into the target system, and transformations are performed within that system using its computational resources.
68 |
69 | ### Workflow Process
70 |
71 | #### 1. Extract
72 | - **Same as ETL**: Retrieve data from source systems
73 | - **Minimal Processing**: Little to no transformation during extraction
74 | - **Faster Extraction**: Reduced complexity in extraction phase
75 |
76 | #### 2. Load
77 | - **Raw Data Loading**: Data loaded in its original format
78 | - **Target Systems**: Usually data lakes or cloud data warehouses
79 | - **Schema-on-Write vs Schema-on-Read**: Often uses schema-on-read approach
80 | - **Staging Areas**: May use intermediate staging for organization
81 |
82 | #### 3. Transform
83 | - **In-Target Processing**: Transformations occur within target system
84 | - **On-Demand**: Transformations can be applied as needed
85 | - **Multiple Views**: Same raw data can be transformed differently for various use cases
86 | - **Leverages Target Power**: Uses computational resources of target system
87 |
88 | ### ELT Characteristics
89 |
90 | **Advantages:**
91 | - **Faster Loading**: Raw data loads quickly without transformation overhead
92 | - **Scalability**: Leverages powerful cloud computing resources
93 | - **Flexibility**: Multiple transformation views from same raw data
94 | - **Data Preservation**: Original data remains unchanged
95 | - **Real-time Potential**: Enables near real-time data availability
96 | - **Cost-Effective**: Cloud warehouses optimized for large-scale processing
97 |
98 | **Disadvantages:**
99 | - **Target Resource Usage**: Transformation workload on target system
100 | - **Data Quality Risks**: Raw data may contain errors or inconsistencies
101 | - **Security Concerns**: Sensitive data stored in raw format
102 | - **Complex Queries**: End users may need advanced SQL skills
103 | - **Storage Costs**: Raw data requires more storage space
104 |
105 | ### When to Use ELT
106 | - **Big Data Scenarios**: Large volumes of diverse data types
107 | - **Cloud-Native Architecture**: Using cloud data warehouses (Snowflake, BigQuery, Redshift)
108 | - **Agile Analytics**: Rapid development and changing requirements
109 | - **Real-time Insights**: Near real-time data processing needed
110 | - **Data Lake Architecture**: Storing raw data for future unknown use cases
111 | - **Sufficient Target Resources**: Powerful target systems available
112 |
113 | ## Comparison Matrix
114 |
115 | | Aspect | ETL | ELT |
116 | |--------|-----|-----|
117 | | **Processing Location** | External transformation engine | Within target system |
118 | | **Data Quality** | High (pre-loading validation) | Variable (post-loading validation) |
119 | | **Time to Insight** | Higher latency | Lower latency |
120 | | **Flexibility** | Lower (predefined transformations) | Higher (on-demand transformations) |
121 | | **Resource Requirements** | Dedicated transformation infrastructure | Powerful target system |
122 | | **Data Storage** | Only transformed data stored | Raw + transformed data stored |
123 | | **Complexity** | Higher upfront complexity | Lower initial complexity |
124 | | **Maintenance** | More complex schema change management | Easier to adapt to changes |
125 | | **Cost Model** | Infrastructure + processing costs | Storage + compute costs |
126 |
127 | ## Modern Hybrid Approaches
128 |
129 | ### ELT with Pre-processing
130 | - Light transformations during extraction (data type conversion, basic cleaning)
131 | - Bulk of transformation occurs in target system
132 | - Balances benefits of both approaches
133 |
134 | ### Lambda Architecture
135 | - Combines batch (ETL) and stream (ELT) processing
136 | - Handles both historical and real-time data
137 | - Provides comprehensive data processing solution
138 |
139 | ### Medallion Architecture
140 | - **Bronze Layer**: Raw data (ELT approach)
141 | - **Silver Layer**: Cleaned and conformed data (ETL transformations)
142 | - **Gold Layer**: Business-ready data (ETL aggregations)
143 |
144 | ## Technology Considerations
145 |
146 | ### ETL-Optimized Tools
147 | - **Traditional ETL**: Informatica, IBM DataStage, Microsoft SSIS
148 | - **Modern ETL**: Apache Airflow, Talend, Apache NiFi
149 | - **Cloud ETL**: AWS Glue, Azure Data Factory, Google Dataflow
150 |
151 | ### ELT-Optimized Platforms
152 | - **Cloud Warehouses**: Snowflake, Google BigQuery, Amazon Redshift
153 | - **Data Lakes**: Apache Spark, Amazon S3 with Athena
154 | - **Stream Processing**: Apache Kafka, Amazon Kinesis
155 |
156 | ### Programming Approaches
157 | - **ETL**: Python with Pandas, Java with Apache Beam
158 | - **ELT**: SQL-based transformations, dbt (data build tool)
159 |
160 | ## Best Practices
161 |
162 | ### For ETL
163 | 1. **Design for Reusability**: Create modular transformation components
164 | 2. **Implement Error Handling**: Robust exception management and logging
165 | 3. **Optimize Performance**: Parallel processing and efficient algorithms
166 | 4. **Document Transformations**: Clear documentation of business rules
167 | 5. **Test Thoroughly**: Unit tests for transformation logic
168 |
169 | ### For ELT
170 | 1. **Data Governance**: Implement data lineage and quality monitoring
171 | 2. **Storage Optimization**: Use appropriate file formats (Parquet, Delta Lake)
172 | 3. **Query Optimization**: Leverage target system's optimization features
173 | 4. **Security Implementation**: Apply row-level security and column masking
174 | 5. **Cost Management**: Monitor and optimize compute and storage costs
175 |
176 | ## Future Trends
177 |
178 | ### Real-time Processing
179 | - Stream processing becoming standard
180 | - Event-driven architectures gaining popularity
181 | - CDC (Change Data Capture) integration
182 |
183 | ### DataOps Integration
184 | - CI/CD for data pipelines
185 | - Automated testing and deployment
186 | - Infrastructure as Code
187 |
188 | ### AI-Enhanced Processing
189 | - Automated data profiling and mapping
190 | - Intelligent error detection and correction
191 | - ML-powered transformation suggestions
192 |
193 | ## Conclusion
194 |
195 | The choice between ETL and ELT depends on specific use cases, infrastructure, data volumes, and business requirements. Modern data architectures often employ hybrid approaches, leveraging the strengths of both paradigms. Understanding these workflows is fundamental to designing effective data integration solutions that meet organizational needs while maintaining data quality, performance, and scalability.
196 |
197 | Key decision factors include:
198 | - Data volume and velocity requirements
199 | - Available infrastructure and resources
200 | - Data quality and governance needs
201 | - Time-to-insight requirements
202 | - Budget and cost considerations
203 | - Team skills and capabilities
204 |
205 | Both ETL and ELT remain relevant in today's data landscape, and the most successful data teams understand when and how to apply each approach effectively.
206 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## **LuxDevHQ Data Engineering Course Outline**
2 |
3 | This comprehensive course spans **4 months** (16 weeks) and equips learners with expertise in Python, SQL, Azure, AWS, Apache Airflow, Kafka, Spark, and more.
4 | - **Learning Days**: Monday to Thursday (theory and practice).
5 | - **Friday**: Job shadowing or peer projects.
6 | - **Saturday**: Hands-on lab sessions and project-based learning.
7 |
8 | ---
9 | ## Table of Contents
10 |
11 | 1. [Week 1](#week-1-Onboarding-and-Environment-Setup)
12 | 2. [Week 2](#week-2-SQL-Essentials-for-Data-Engineering)
13 | 3. [Week 3](#week-3-Introduction-to-Data-Pipelines)
14 | 4. [Week 4](#week-4-Introduction-to-Apache-Airflow)
15 | 5. [Week 5](#week-5-Data-Warehousing-and-Data-Lakes)
16 | 6. [Week 6](#week-6-Data-Governance-and-Security)
17 | 7. [Week 7](#week-7-Real-Time-Data-Processing-with-Kafka)
18 | 8. [Week 8](#week-8-Batch-vs-Stream-Processing)
19 | 9. [Week 9](#week-9-Machine-Learning-Integration-in-Data-Pipelines)
20 | 10. [Week 10](#week-10-Spark-and-PySpark-for-Big-Data)
21 | 11. [Week 11](#week-11-Advanced-Apache-Airflow-Techniques)
22 | 12. [Week 12](#week-12-Data-Lakes-and-Delta-Lake)
23 | 13. [Week 13](#week-13-Batch-Data-Pipeline-Development)
24 | 14. [Week 14](#week-14-Real-Time-Data-Pipeline-Development)
25 | 15. [Week 15](#week-15-Final-Project-Integration)
26 | 16. [Week 16](#week-16-capstone-project-presentation)
27 |
28 |
29 |
30 | ---
31 |
32 | ### **Month 1: Foundations of Data Engineering**
33 |
34 | #### Week 1: Onboarding and Environment Setup
35 | - **Monday**:
36 | - Onboarding, course overview, career pathways, tools introduction.
37 | - **Tuesday**:
38 | - Introduction to cloud computing (Azure and AWS).
39 | - **Wednesday**:
40 | - Data governance, security, compliance, and access control.
41 | - **Thursday**:
42 | - Introduction to SQL for data engineering and PostgreSQL setup.
43 | - **Friday**:
44 | - **Peer Project**: Environment setup challenges.
45 | - **Saturday (Lab)**:
46 | - **Mini Project**: Build a basic pipeline with PostgreSQL and Azure Blob Storage.
47 |
48 | ---
49 |
50 | #### Week 2: SQL Essentials for Data Engineering
51 | - **Monday**:
52 | - Core SQL concepts (`SELECT`, `WHERE`, `JOIN`, `GROUP BY`).
53 | - **Tuesday**:
54 | - Advanced SQL techniques: recursive queries, window functions, Views, Stored Procedures, Subque ries and CTEs.
55 | - **Wednesday**:
56 | - Query optimization and execution plans.
57 | - **Thursday**:
58 | - Data modeling: normalization, denormalization, and star schemas.
59 | - **Friday**:
60 | - **Job Shadowing**: Observe senior engineers writing and optimizing SQL queries.
61 | - **Saturday (Lab)**:
62 | - **Mini Project**: Create a star schema and analyze data using SQL.
63 |
64 | ---
65 |
66 | #### Week 3: Introduction to Data Pipelines
67 | - **Monday**:
68 | - Theory: Introduction to ETL/ELT workflows.
69 | - **Tuesday**:
70 | - Lab: Create a simple Python-based ETL pipeline for CSV data.
71 | - **Wednesday**:
72 | - Theory: Extract, transform, load (ETL) concepts and best practices.
73 | - **Thursday**:
74 | - Lab: Build a Python ETL pipeline for batch data processing.
75 | - **Friday**:
76 | - **Peer Project**: Collaborate to design a basic ETL workflow.
77 | - **Saturday (Lab)**:
78 | - **Mini Project**: Develop a simple ETL pipeline to process sales data.
79 |
80 | ---
81 |
82 | #### Week 4: Introduction to Apache Airflow
83 | - **Monday**:
84 | - Theory: Introduction to Apache Airflow, DAGs, and scheduling.
85 | - **Tuesday**:
86 | - Lab: Set up Apache Airflow and create a basic DAG.
87 | - **Wednesday**:
88 | - Theory: DAG best practices and scheduling in Airflow.
89 | - **Thursday**:
90 | - Lab: Integrate Airflow with PostgreSQL and Azure Blob Storage.
91 | - **Friday**:
92 | - **Job Shadowing**: Observe real-world Airflow pipelines.
93 | - **Saturday (Lab)**:
94 | - **Mini Project**: Automate an ETL pipeline with Airflow for batch data processing.
95 |
96 | ---
97 |
98 | ### **Month 2: Intermediate Tools and Concepts**
99 |
100 | #### Week 5: Data Warehousing and Data Lakes
101 | - **Monday**:
102 | - Theory: Introduction to data warehousing (OLAP vs. OLTP, partitioning, clustering).
103 | - **Tuesday**:
104 | - Lab: Work with Amazon Redshift and Snowflake for data warehousing.
105 | - **Wednesday**:
106 | - Theory: Data lakes and Lakehouse architecture.
107 | - **Thursday**:
108 | - Lab: Set up Delta Lake for raw and curated data.
109 | - **Friday**:
110 | - **Peer Project**: Implement a data warehouse model and data lake for sales data.
111 | - **Saturday (Lab)**:
112 | - **Mini Project**: Design and implement a basic Lakehouse architecture.
113 |
114 | ---
115 |
116 | #### Week 6: Data Governance and Security
117 | - **Monday**:
118 | - Theory: Data governance frameworks and data security principles.
119 | - **Tuesday**:
120 | - Lab: Use AWS Lake Formation for access control and security enforcement.
121 | - **Wednesday**:
122 | - Theory: Managing sensitive data and compliance (GDPR, HIPAA).
123 | - **Thursday**:
124 | - Lab: Implement security policies in S3 and Azure Blob Storage.
125 | - **Friday**:
126 | - **Job Shadowing**: Observe senior engineers applying governance policies.
127 | - **Saturday (Lab)**:
128 | - **Mini Project**: Secure data in the cloud using AWS and Azure.
129 |
130 | ---
131 |
132 | #### Week 7: Real-Time Data Processing with Kafka
133 | - **Monday**:
134 | - Theory: - [Introduction to Apache Kafka for real-time data streaming](/introduction-to-Kafka.md)
135 | - **Tuesday**:
136 | - Lab: [Set up a Kafka producer and consumer.](/Tuesday-Kafka-Lab.md)
137 | - **Wednesday**:
138 | - Theory: Kafka topics, partitions, and message brokers.
139 | - **Thursday**:
140 | - Lab: Integrate Kafka with PostgreSQL for real-time updates.
141 | - **Friday**:
142 | - **Peer Project**: Build a real-time Kafka pipeline for transactional data.
143 | - **Saturday (Lab)**:
144 | - **Mini Project**: Create a pipeline to stream e-commerce data with Kafka.
145 |
146 | [Apache Kafka 101](./Apache%20Kafka%20101%3A%20Apache%20Kafka%20for%20Data%20Engineering%20Guide.md)
147 |
148 | [Apache Kafka 102](/Apache%20Kafka%20102%3A%20Apache%20Kafka%20for%20Data%20Engineering%20Guide.md)
149 |
150 |
151 |
152 | ---
153 |
154 | #### Week 8: Batch vs. Stream Processing
155 | - **Monday**:
156 | - Theory: Introduction to batch vs. stream processing.
157 | - **Tuesday**:
158 | - Lab: Batch processing with PySpark.
159 | - **Wednesday**:
160 | - Theory: Combining batch and stream processing workflows.
161 | - **Thursday**:
162 | - Lab: Real-time processing with Apache Flink and Spark Streaming.
163 | - **Friday**:
164 | - **Job Shadowing**: Observe a real-time processing pipeline.
165 | - **Saturday (Lab)**:
166 | - **Mini Project**: Build a hybrid pipeline combining batch and real-time processing.
167 |
168 | ---
169 |
170 | ### **Month 3: Advanced Data Engineering**
171 |
172 | #### Week 9: Machine Learning Integration in Data Pipelines
173 | - **Monday**:
174 | - Theory: Overview of ML workflows in data engineering.
175 | - **Tuesday**:
176 | - Lab: Preprocess data for machine learning using Pandas and PySpark.
177 | - **Wednesday**:
178 | - Theory: Feature engineering and automated feature extraction.
179 | - **Thursday**:
180 | - Lab: Automate feature extraction using Apache Airflow.
181 | - **Friday**:
182 | - **Peer Project**: Build a simple pipeline that integrates ML models.
183 | - **Saturday (Lab)**:
184 | - **Mini Project**: Build an ML-powered recommendation system in a pipeline.
185 |
186 | ---
187 |
188 | #### Week 10: Spark and PySpark for Big Data
189 | - **Monday**:
190 | - Theory: Introduction to Apache Spark for big data processing.
191 | - **Tuesday**:
192 | - Lab: Set up Spark and PySpark for data analysis.
193 | - **Wednesday**:
194 | - Theory: Spark RDDs, DataFrames, Performance Optimization and SQL.
195 | - **Thursday**:
196 | - Lab: Analyze large datasets using Spark SQL.
197 | - **Friday**:
198 | - **Peer Project**: Build a PySpark pipeline for large-scale data processing.
199 | - **Saturday (Lab)**:
200 | - **Mini Project**: Analyze big data sets with Spark and PySpark.
201 |
202 | ---
203 |
204 | #### Week 11: Advanced Apache Airflow Techniques
205 | - **Monday**:
206 | - Theory: Advanced Airflow features (XCom, task dependencies).
207 | - **Tuesday**:
208 | - Lab: Implement dynamic DAGs and task dependencies in Airflow.
209 | - **Wednesday**:
210 | - Theory: Airflow scheduling, monitoring, and error handling.
211 | - **Thursday**:
212 | - Lab: Create complex DAGs for multi-step ETL pipelines.
213 | - **Friday**:
214 | - **Job Shadowing**: Observe advanced Airflow pipeline implementations.
215 | - **Saturday (Lab)**:
216 | - **Mini Project**: Design an advanced Airflow DAG for complex data workflows.
217 |
218 | ---
219 |
220 | #### Week 12: Data Lakes and Delta Lake
221 | - **Monday**:
222 | - Theory: Data lakes, Lakehouses, and Delta Lake architecture.
223 | - **Tuesday**:
224 | - Lab: Set up Delta Lake on AWS for data storage and management.
225 | - **Wednesday**:
226 | - Theory: Managing schema evolution in Delta Lake.
227 | - **Thursday**:
228 | - Lab: Implement batch and real-time data loading to Delta Lake.
229 | - **Friday**:
230 | - **Peer Project**: Design a Lakehouse architecture for an e-commerce platform.
231 | - **Saturday (Lab)**:
232 | - **Mini Project**: Implement a scalable Delta Lake architecture.
233 |
234 | ---
235 |
236 | ### **Month 4: Capstone Projects**
237 |
238 | #### Week 13: Batch Data Pipeline Development
239 | - **Monday to Thursday**:
240 | - **Design and Implementation**:
241 | - Build an end-to-end batch data pipeline for e-commerce sales analytics.
242 | - **Tools**: PySpark, SQL, PostgreSQL, Airflow, S3.
243 | - **Friday**:
244 | - **Peer Review**: Present progress and receive feedback.
245 | - **Saturday (Lab)**:
246 | - **Project Milestone**: Finalize and present batch pipeline results.
247 |
248 | ---
249 |
250 | #### Week 14: Real-Time Data Pipeline Development
251 | - **Monday to Thursday**:
252 | - **Design and Implementation**:
253 | - Build an end-to-end real-time data pipeline for IoT sensor monitoring.
254 | - **Tools**: Kafka, Spark Streaming, Flink, S3.
255 | - **Friday**:
256 | - **Peer Review**: Present progress and receive feedback.
257 | - **Saturday (Lab)**:
258 | - **Project Milestone**: Finalize and present real-time pipeline results.
259 |
260 | ---
261 |
262 | #### Week 15: Final Project Integration
263 | - **Monday to Thursday**:
264 | - **Design and Implementation**:
265 | - Integrate both batch and real-time pipelines for a comprehensive end-to-end solution.
266 | - **Tools**: Kafka, PySpark, Airflow, Delta Lake, PostgreSQL, and S3.
267 | - **Friday**:
268 | - **Job Shadowing**: Observe senior engineers integrating complex pipelines.
269 | - **Saturday (Lab)**:
270 | - **Project Milestone**: Showcase integrated solution for review.
271 |
272 | ---
273 |
274 | #### Week 16: Capstone Project Presentation
275 | - **Monday to Thursday**:
276 | - Final Presentation Preparation:
277 | - Polish, test, and document the final project.
278 | - **Friday**:
279 | - **Peer Review**: Present final projects to peers and receive feedback.
280 | - **Saturday (Lab)**:
281 | - **Capstone Presentation**: Showcase completed capstone projects to industry professionals and instructors.
282 |
--------------------------------------------------------------------------------
/Apache Spark.md:
--------------------------------------------------------------------------------
1 | # Apache Spark & PySpark: Learning Guide
2 |
3 | ## Understanding Apache Spark
4 |
5 | Apache Spark is an open-source distributed computing framework designed for processing large datasets across clusters of computers. Spark maintains data in memory between operations, which significantly reduces the time spent reading from and writing to disk storage. This in-memory processing capability makes Spark substantially faster than traditional disk-based processing systems.
6 |
7 | The framework provides a unified computing engine that handles multiple types of data processing workloads including batch processing (handling large chunks of data at once), stream processing (handling data as it arrives continuously), machine learning, and graph computation. Spark can scale from running on your laptop to running across thousands of computers in a data center.
8 |
9 | ## Core Spark Architecture
10 |
11 | Understanding Spark's architecture helps you comprehend how your data processing actually happens behind the scenes. The Spark runtime consists of several key components that work together to execute distributed computations.
12 |
13 | **Driver Program**: The main application process that runs your Spark code. The driver creates the SparkContext, converts your program into tasks, and coordinates the execution across the cluster.
14 |
15 | **Cluster Manager**: The external service responsible for acquiring resources and allocating them to Spark applications. Common cluster managers include YARN, Mesos, and Kubernetes.
16 |
17 | **Executors**: Worker processes that run on cluster nodes. Executors execute tasks assigned by the driver and store data for caching operations.
18 |
19 | **Tasks**: Individual units of work that executors perform on data partitions.
20 |
21 | **Partitions**: Logical divisions of your data that enable parallel processing across multiple executor cores.
22 |
23 | ## Essential Spark Concepts
24 |
25 | **Transformations** are operations that create new datasets from existing ones. The key insight here is that transformations follow lazy evaluation, meaning Spark doesn't actually do the work immediately. Instead, it builds up a plan of what you want to do.
26 |
27 | Common transformations include filtering data (keeping only rows that meet certain criteria), mapping data (applying a function to transform each row), grouping data by certain columns, and joining different datasets together.
28 |
29 | **Actions** are operations that actually trigger Spark to execute all those planned transformations and give you results. Actions either return results to your driver program or save data to external storage.
30 |
31 | **RDDs (Resilient Distributed Datasets)** represent Spark's most fundamental way of thinking about data. RDDs are immutable (they never change once created), distributed across your cluster, and can be processed in parallel. They maintain lineage information, which means Spark remembers how each RDD was created so it can recreate it if something goes wrong.
32 |
33 | **DataFrames** are a higher-level way to work with data that's more familiar if you've used databases or tools like Excel. DataFrames organize data into named columns and provide powerful optimization through Spark's Catalyst optimizer, which automatically makes your queries run faster.
34 |
35 | **Spark SQL** lets you write familiar SQL queries against your DataFrames. This means you can use SELECT, WHERE, GROUP BY, and other SQL statements you might already know, while still getting all the benefits of distributed processing.
36 |
37 |
38 |
39 | ## Practical Example: Understanding the Basics
40 |
41 | Let's walk through a comprehensive example that demonstrates how these concepts work together. I'll explain each part as we go, so you can see both what the code does and why we're doing it that way.
42 |
43 | ```python
44 | from pyspark.sql import SparkSession
45 | from pyspark.sql.functions import col, avg, count, max, min, sum as spark_sum
46 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
47 |
48 | # Initialize Spark session - this creates your SparkContext
49 | # Think of this as starting up your distributed computing engine
50 | spark = SparkSession.builder \
51 | .appName("SparkFundamentals") \
52 | .config("spark.sql.adaptive.enabled", "true") \
53 | .getOrCreate()
54 | ```
55 |
56 | This first section creates our Spark session, which is our entry point to all Spark functionality. The appName helps you identify your application when multiple Spark jobs are running. The config setting enables adaptive query execution, which helps Spark automatically optimize your queries as they run.
57 |
58 | ```python
59 | # Create sample data representing employee records
60 | # In real scenarios, this data would come from files, databases, or APIs
61 | employee_data = [
62 | ("E001", "Alice Johnson", "Engineering", "Senior", 85000, 5, 92.5, "2019-03-15"),
63 | ("E002", "Bob Smith", "Marketing", "Manager", 72000, 3, 88.2, "2021-07-20"),
64 | ("E003", "Carol Davis", "Engineering", "Junior", 65000, 2, 85.7, "2022-01-10"),
65 | ("E004", "David Wilson", "Sales", "Senior", 78000, 4, 90.1, "2020-05-18"),
66 | ("E005", "Eva Brown", "Engineering", "Manager", 95000, 6, 94.3, "2018-11-22"),
67 | ("E006", "Frank Miller", "Marketing", "Junior", 58000, 1, 82.4, "2023-02-14"),
68 | ("E007", "Grace Lee", "Sales", "Senior", 82000, 4, 91.8, "2020-08-30"),
69 | ("E008", "Henry Taylor", "Engineering", "Senior", 88000, 5, 93.1, "2019-06-12"),
70 | ("E009", "Iris Chen", "Marketing", "Manager", 75000, 3, 89.6, "2021-10-05"),
71 | ("E010", "Jack Anderson", "Sales", "Junior", 62000, 2, 84.9, "2022-09-28")
72 | ]
73 |
74 | # Define schema explicitly for data quality and performance
75 | # This tells Spark exactly what type of data to expect in each column
76 | employee_schema = StructType([
77 | StructField("employee_id", StringType(), False), # False means this field cannot be null
78 | StructField("name", StringType(), False),
79 | StructField("department", StringType(), False),
80 | StructField("level", StringType(), False),
81 | StructField("salary", IntegerType(), False),
82 | StructField("years_experience", IntegerType(), False),
83 | StructField("performance_score", FloatType(), False),
84 | StructField("hire_date", StringType(), False)
85 | ])
86 | ```
87 |
88 | Here we're creating sample data and defining its structure. In real applications, your data would typically come from files, databases, or streaming sources. The schema definition is important because it tells Spark exactly what to expect, which enables better performance and catches data quality issues early.
89 |
90 | ```python
91 | # Create DataFrame - this represents your distributed dataset
92 | # Even though our data is small, Spark treats it as if it could be distributed across many machines
93 | df = spark.createDataFrame(employee_data, employee_schema)
94 |
95 | # Basic transformations and actions
96 | print("Dataset Overview:")
97 | df.show() # Action: displays data - this actually executes and shows results
98 | print(f"Total employees: {df.count()}") # Action: counts rows - another execution
99 | ```
100 |
101 | This is where we create our DataFrame, which is Spark's way of representing structured data. Even though our example uses small data that fits in memory, Spark handles it the same way it would handle terabytes of data spread across hundreds of machines.
102 |
103 | The show() and count() operations are actions, which means they trigger Spark to actually process the data and return results. Up until this point, Spark was just planning what to do.
104 |
105 | ```python
106 | # Transformation: filter high performers
107 | # This creates a new DataFrame but doesn't execute yet (lazy evaluation)
108 | high_performers = df.filter(col("performance_score") > 90.0)
109 | print(f"High performers (>90 score): {high_performers.count()}") # Now it executes
110 | ```
111 |
112 | This demonstrates the difference between transformations and actions. The filter operation creates a new DataFrame that represents "employees with performance scores above 90," but Spark doesn't actually do the filtering until we call count(), which is an action.
113 |
114 | ```python
115 | # Transformation and action: departmental analysis
116 | # This shows how to perform aggregations - very common in data processing
117 | dept_analysis = df.groupBy("department").agg(
118 | avg("salary").alias("avg_salary"), # Calculate average salary per department
119 | count("*").alias("employee_count"), # Count employees per department
120 | avg("performance_score").alias("avg_performance"), # Average performance per department
121 | max("years_experience").alias("max_experience") # Maximum experience per department
122 | )
123 |
124 | print("\nDepartmental Analysis:")
125 | dept_analysis.show() # Action: triggers execution of the entire aggregation
126 | ```
127 |
128 | This section shows aggregation, which is one of the most common patterns in data processing. We're grouping employees by department and calculating various statistics for each group. The alias() method gives friendly names to our calculated columns.
129 |
130 | ```python
131 | # Using Spark SQL - same functionality, different syntax
132 | # Some people prefer SQL syntax for complex queries
133 | df.createOrReplaceTempView("employees") # Creates a temporary SQL table
134 |
135 | print("\nSenior Employee Analysis (SQL):")
136 | spark.sql("""
137 | SELECT department,
138 | COUNT(*) as senior_count,
139 | AVG(salary) as avg_senior_salary,
140 | AVG(performance_score) as avg_senior_performance
141 | FROM employees
142 | WHERE level = 'Senior'
143 | GROUP BY department
144 | ORDER BY avg_senior_salary DESC
145 | """).show()
146 | ```
147 |
148 | This demonstrates that you can use SQL syntax to accomplish the same data processing tasks. Some people find SQL more intuitive for complex queries, especially when joining multiple tables or doing complex filtering and aggregation.
149 |
150 | ```python
151 | # Demonstrate caching for performance
152 | # This tells Spark to keep this DataFrame in memory for faster access
153 | df.cache() # Keeps data in memory for faster subsequent operations
154 | ```
155 |
156 | Caching is a performance optimization technique. When you cache a DataFrame, Spark stores it in memory across your cluster, so subsequent operations on that DataFrame don't need to recompute it from the original data source.
157 |
158 | ## Understanding the Concepts in Action
159 |
160 | Let's trace through what happens when you run this code to understand how the concepts work together:
161 |
162 | When you create the SparkSession, you're establishing your connection to Spark's distributed computing capabilities. Even if you're running on a single machine, Spark still uses the same distributed architecture internally.
163 |
164 | When you create the DataFrame, Spark doesn't immediately load or process the data. Instead, it creates a logical representation of what the data looks like and how it's structured. This is part of Spark's lazy evaluation strategy.
165 |
166 | When you call transformations like filter() or groupBy(), Spark adds these operations to its execution plan but still doesn't do any actual work. It's building up a recipe for how to process your data when the time comes.
167 |
168 | When you call an action like show() or count(), Spark finally executes the entire chain of transformations. It looks at all the operations you've requested, optimizes the execution plan, and then processes the data across your cluster.
169 |
170 | The caching operation tells Spark to store the results in memory after the first computation, so if you perform additional operations on the same DataFrame, it can reuse the cached data instead of recomputing everything from scratch.
171 |
172 |
173 |
174 | ---
175 |
176 | # Weather Data ETL Assignment
177 |
178 | ## Assignment Overview
179 |
180 | Build a data pipeline that extracts weather data from the OpenWeatherMap API, processes it using Apache Spark, and visualizes the results in Grafana.
181 |
182 | ## Requirements
183 |
184 | Extract weather data for at least 10 cities using the OpenWeatherMap API. Select any 10 columns from the API response that you find interesting or relevant.
185 |
186 | Transform the data using PySpark to prepare it for visualization. Apply data cleaning, type conversions, or calculations as needed.
187 |
188 | Store the processed data in any format that allows you to visualize it in Grafana.
189 |
190 | Create visualizations in Grafana that demonstrate your data processing results. Include screenshots of your panels.
191 |
192 |
193 |
194 | ## Deliverables
195 |
196 | 1. **GitHub Repository**: Create a public repository containing your Python scripts and any configuration files
197 | 2. **Technical Article**: Write an article on Dev.to or Medium explaining your project, including:
198 | - Overview of your approach
199 | - Code explanations
200 | - Screenshots of your Grafana panels
201 | - Any challenges you encountered and how you solved them
202 |
--------------------------------------------------------------------------------
/GDPR & HIPAA Compliance Guide.md:
--------------------------------------------------------------------------------
1 | # GDPR & HIPAA Compliance Guide
2 |
3 | ## Overview
4 |
5 | ### Learning Objectives
6 | By the end of this study session, you should be able to:
7 | - Define GDPR and HIPAA and explain their purposes
8 | - Compare and contrast the two regulatory frameworks
9 | - Identify which regulation applies to different scenarios
10 | - Explain key compliance requirements for both frameworks
11 | - Describe implementation strategies and best practices
12 |
13 | ---
14 |
15 | ## Quick Reference Cards
16 |
17 | ### GDPR at a Glance
18 | - **Full Name:** General Data Protection Regulation
19 | - **Effective Date:** May 25, 2018
20 | - **Jurisdiction:** EU + anywhere processing EU citizen data
21 | - **Scope:** ALL personal data, ALL industries
22 | - **Max Penalty:** €20M or 4% global revenue
23 | - **Key Concept:** "Privacy by Design"
24 |
25 | ### HIPAA at a Glance
26 | - **Full Name:** Health Insurance Portability and Accountability Act
27 | - **Enacted:** 1996
28 | - **Jurisdiction:** United States only
29 | - **Scope:** Healthcare data (PHI) only
30 | - **Max Penalty:** $50,000 per violation
31 | - **Key Concept:** "Minimum Necessary Rule"
32 |
33 | ---
34 |
35 | ## Core Concepts & Definitions
36 |
37 | ### GDPR Key Terms
38 | | Term | Definition | Example |
39 | |------|------------|---------|
40 | | **Personal Data** | Any info relating to identifiable person | Name, email, IP address, location |
41 | | **Data Subject** | The individual whose data is processed | Patient, customer, employee |
42 | | **Data Controller** | Determines purposes/means of processing | Hospital, company, organization |
43 | | **Data Processor** | Processes data on behalf of controller | Cloud provider, software vendor |
44 | | **Special Category Data** | Sensitive personal data requiring extra protection | Health, biometric, genetic data |
45 |
46 | ### HIPAA Key Terms
47 | | Term | Definition | Example |
48 | |------|------------|---------|
49 | | **PHI** | Protected Health Information | Medical records, billing info |
50 | | **Covered Entity** | Must comply with HIPAA | Hospitals, doctors, health plans |
51 | | **Business Associate** | Works with covered entities, handles PHI | IT vendors, billing companies |
52 | | **ePHI** | Electronic Protected Health Information | Digital medical records |
53 | | **TPO** | Treatment, Payment, Operations | Core healthcare functions |
54 |
55 | ---
56 |
57 | ## Detailed Analysis
58 |
59 | ### Section 1: GDPR Deep Dive
60 |
61 | #### The 6 Lawful Bases for Processing (MEMORIZE!)
62 | 1. **Consent** - Clear, specific, informed agreement
63 | 2. **Contract** - Necessary for contract performance
64 | 3. **Legal Obligation** - Required by law
65 | 4. **Vital Interests** - Life or death situations
66 | 5. **Public Task** - Official/governmental functions
67 | 6. **Legitimate Interests** - Balancing test with individual rights
68 |
69 | #### GDPR Rights (The "Rights Menu")
70 | - **Right to be Informed** - Know what data is collected
71 | - **Right of Access** - See what data is held
72 | - **Right to Rectification** - Correct inaccurate data
73 | - **Right to Erasure** - "Right to be forgotten"
74 | - **Right to Restrict Processing** - Limit how data is used
75 | - **Right to Data Portability** - Move data between services
76 | - **Right to Object** - Say no to processing
77 | - **Rights Related to Automated Decision-Making** - Human review of AI decisions
78 |
79 | #### DPO Requirements (When Mandatory)
80 | - Public authorities (always)
81 | - Large-scale systematic monitoring
82 | - Large-scale special category data processing
83 |
84 | ### Section 2: HIPAA Deep Dive
85 |
86 | #### The 3 HIPAA Rules
87 | 1. **Privacy Rule** - Who can see PHI and when
88 | 2. **Security Rule** - How to protect ePHI technically
89 | 3. **Breach Notification Rule** - What to do when things go wrong
90 |
91 | #### HIPAA Safeguards (The Security Triangle)
92 | 1. **Administrative Safeguards**
93 | - Assign security officer
94 | - Create policies/procedures
95 | - Train workforce
96 | - Control access
97 |
98 | 2. **Physical Safeguards**
99 | - Lock facilities
100 | - Control workstation access
101 | - Secure devices/media
102 |
103 | 3. **Technical Safeguards**
104 | - Access controls
105 | - Audit controls
106 | - Data integrity
107 | - Transmission security
108 |
109 | ---
110 |
111 | ## Instructional Notes & Discussion Points
112 |
113 | ### Opening Discussion Questions
114 | 1. "Why do we need data protection laws?"
115 | - *Lead students to discuss privacy as fundamental right*
116 | 2. "What happens when your medical records are leaked vs. your shopping preferences?"
117 | - *Highlight different types of harm from data breaches*
118 |
119 | ### Interactive Teaching Activities
120 |
121 | #### Activity 1: Jurisdiction Quiz
122 | Present scenarios, students identify GDPR/HIPAA/Both/Neither:
123 | - US hospital treating EU tourist *(Both)*
124 | - EU company with US employees *(GDPR only)*
125 | - US pharmacy chain *(HIPAA only)*
126 | - Australian company, no EU/US connections *(Neither)*
127 |
128 | #### Activity 2: Data Classification Game
129 | Show different data types, students categorize:
130 | - Email address *(GDPR: Personal Data)*
131 | - X-ray image with patient name *(Both: Personal Data + PHI)*
132 | - Anonymous survey results *(Neither)*
133 | - Fitness tracker data *(GDPR: Personal Data)*
134 |
135 | ### Common Student Misconceptions
136 | ❌ **"GDPR only applies to EU companies"**
137 | ✅ **Correct:** Applies to ANY company processing EU citizen data
138 |
139 | ❌ **"HIPAA applies to all health data"**
140 | ✅ **Correct:** Only applies to covered entities and business associates
141 |
142 | ❌ **"Consent is always required"**
143 | ✅ **Correct:** GDPR has 6 legal bases; HIPAA allows TPO without consent
144 |
145 | ---
146 |
147 | ## Practice Exercises & Questions
148 |
149 | ### Quick Quiz Questions
150 |
151 | #### Multiple Choice
152 | 1. **GDPR applies to:**
153 | a) Only EU companies
154 | b) Any company processing EU citizen data
155 | c) Only healthcare companies
156 | d) Only large corporations
157 | *Answer: b*
158 |
159 | 2. **Under HIPAA, PHI can be shared without authorization for:**
160 | a) Marketing purposes
161 | b) Research studies
162 | c) Treatment, payment, operations
163 | d) Employee background checks
164 | *Answer: c*
165 |
166 | 3. **Maximum GDPR fine is:**
167 | a) €10 million
168 | b) €20 million or 4% global revenue
169 | c) €50 million
170 | d) $50,000 per violation
171 | *Answer: b*
172 |
173 | #### True/False
174 | - GDPR requires 72-hour breach notification *(True)*
175 | - HIPAA requires Data Protection Officer *(False - Security Officer)*
176 | - Both regulations require encryption *(True)*
177 | - GDPR allows indefinite data storage *(False)*
178 |
179 | ### Case Study Practice
180 |
181 | #### Case 1: The International Hospital
182 | **Scenario:** US hospital chain opens branch in Germany, treats both US and EU patients, uses cloud storage in Canada.
183 |
184 | **Questions:**
185 | 1. Which regulations apply?
186 | 2. What are the main compliance challenges?
187 | 3. How should they handle data transfers?
188 |
189 | #### Case 2: The Health App
190 | **Scenario:** Startup creates fitness app used by EU citizens, partners with US healthcare providers, stores data on AWS.
191 |
192 | **Questions:**
193 | 1. What type of data are they handling?
194 | 2. What legal basis could they use under GDPR?
195 | 3. Do they need a DPO?
196 |
197 | ---
198 |
199 | ## Exam Preparation
200 |
201 | ### Key Facts to Memorize
202 |
203 | #### GDPR Numbers
204 | - **72 hours** - breach notification to authority
205 | - **30 days** - respond to data subject requests
206 | - **€20M or 4%** - maximum fine
207 | - **May 25, 2018** - effective date
208 |
209 | #### HIPAA Numbers
210 | - **1996** - year enacted
211 | - **60 days** - breach notification to individuals
212 | - **500 individuals** - threshold for immediate HHS notification
213 | - **$50,000** - maximum penalty per violation
214 |
215 | ### Memory Techniques
216 |
217 | #### GDPR Rights Acronym: "I AREPORT"
218 | - **I**nformed
219 | - **A**ccess
220 | - **R**ectification
221 | - **E**rasure
222 | - **R**estrict processing
223 | - **P**ortability
224 | - **O**bject
225 | - **R**elated to automated decision-making
226 | - **T**ransparency (bonus)
227 |
228 | #### HIPAA Safeguards: "APT"
229 | - **A**dministrative
230 | - **P**hysical
231 | - **T**echnical
232 |
233 | ---
234 |
235 | ## Comparison Tables for Quick Review
236 |
237 | ### Similarities & Differences Matrix
238 |
239 | | Aspect | GDPR | HIPAA | Same/Different |
240 | |--------|------|-------|----------------|
241 | | Geographic Scope | Global (EU data) | US only | Different |
242 | | Industry Scope | All industries | Healthcare only | Different |
243 | | Requires Encryption | Yes | Yes | Same |
244 | | Requires DPO/Security Officer | Yes (DPO) | Yes (Security Officer) | Same |
245 | | Breach Notification Timeline | 72 hours | 60 days | Different |
246 | | Right to Delete Data | Yes (Right to Erasure) | No (permanent records) | Different |
247 | | Consent Requirements | Strict | Flexible for TPO | Different |
248 |
249 | ### Penalty Comparison
250 |
251 | | Violation Level | GDPR | HIPAA |
252 | |----------------|------|-------|
253 | | **Minor** | Warning or €10M/2% | $100-$50,000 |
254 | | **Major** | €20M/4% global revenue | Up to $1.5M annually |
255 | | **Criminal** | Varies by country | Up to $250K + 10 years prison |
256 |
257 | ---
258 |
259 | ## Best Practices for Implementation
260 |
261 | ### For Instructors
262 |
263 | #### Making It Relevant
264 | - Use current breach examples (Equifax, Anthem, etc.)
265 | - Discuss social media privacy settings
266 | - Connect to students' personal experiences with healthcare
267 |
268 | #### Common Teaching Pitfalls to Avoid
269 | - Don't get lost in legal details - focus on practical application
270 | - Avoid presenting as "US vs EU" - many companies need both
271 | - Don't oversimplify consent - it's more complex than "just ask permission"
272 |
273 | #### Assessment Ideas
274 | - **Case study analysis** - real-world application
275 | - **Compliance checklist creation** - practical skills
276 | - **Risk scenario evaluation** - critical thinking
277 | - **Policy writing exercise** - hands-on experience
278 |
279 | ### Study Group Activities
280 | 1. **Mock DPA Investigation** - role-play compliance audit
281 | 2. **Breach Response Simulation** - practice incident response
282 | 3. **Privacy Notice Comparison** - analyze real company notices
283 | 4. **Compliance Cost Calculation** - estimate implementation costs
284 |
285 | ---
286 |
287 | ## Additional Resources
288 |
289 | ### Essential Reading
290 | - GDPR Official Text (Articles 5, 6, 7, 12-22, 25, 32-34)
291 | - HIPAA Privacy Rule Summary
292 | - ICO (UK) Guidance Documents
293 | - HHS HIPAA Security Rule Guidance
294 |
295 | ### Recommended Cases to Study
296 | - **Schrems II** (EU-US data transfers)
297 | - **Google Spain** (Right to be forgotten)
298 | - **Anthem Breach** (largest healthcare breach)
299 | - **Facebook-Cambridge Analytica** (consent and data sharing)
300 |
301 | ### Online Tools
302 | - ICO Self-Assessment Tool
303 | - HHS Security Risk Assessment Tool
304 | - GDPR Compliance Checkers
305 | - Breach Cost Calculators
306 |
307 | ---
308 |
309 | ## Final Assessment Checklist
310 |
311 | ### Before the Exam, Can You:
312 | - [ ] Explain when GDPR vs HIPAA applies?
313 | - [ ] List all 6 GDPR lawful bases?
314 | - [ ] Name all GDPR rights?
315 | - [ ] Describe the 3 HIPAA rules?
316 | - [ ] Compare breach notification requirements?
317 | - [ ] Explain DPO vs Security Officer roles?
318 | - [ ] Calculate potential penalty amounts?
319 | - [ ] Identify required security safeguards?
320 | - [ ] Distinguish between personal data and PHI?
321 | - [ ] Apply regulations to real-world scenarios?
322 |
323 | ### Red Flag Concepts (Review if Unclear)
324 | - Cross-border data transfers
325 | - Legitimate interests balancing test
326 | - Business associate agreements
327 | - Data processing vs data controlling
328 | - Special category data protections
329 | - Minimum necessary rule application
330 |
331 | ---
332 |
333 | ## Instructional Script Snippets
334 |
335 | ### Opening Hook
336 | *"Imagine your medical records, including mental health visits, appear in a Google search of your name. Or your location data shows you visiting a cancer clinic every Tuesday. This isn't science fiction - it's why we need GDPR and HIPAA."*
337 |
338 | ### Transition Between Topics
339 | *"Now that we understand what GDPR protects - all personal data - let's look at HIPAA's more focused approach to healthcare information..."*
340 |
341 | ### Concept Reinforcement
342 | *"Remember: GDPR is the speed limit everywhere you drive with EU citizens as passengers. HIPAA is the special rules only in the hospital parking lot."*
343 |
344 | ### Closing Summary
345 | *"Both regulations share the same goal: protecting people's most sensitive information. The difference is scope - GDPR casts a wide net globally, HIPAA goes deep in US healthcare. Master both, and you'll understand the future of privacy law."*
346 |
--------------------------------------------------------------------------------
/Change Data Capture.md:
--------------------------------------------------------------------------------
1 | # Change Data Capture (CDC) Learning Guide
2 |
3 | ## What is Change Data Capture ?
4 |
5 | Change Data Capture is a powerful technique that tracks changes—specifically inserts, updates, and deletes—in a source database and streams them to a target system in real-time or near real-time. Think of CDC as a vigilant observer that watches your database and immediately reports any changes to other systems that need to stay synchronized.
6 |
7 | This approach ensures data consistency across different systems like data warehouses, caches, or analytics platforms without the need to process the entire dataset repeatedly. CDC serves as the backbone for data integration, real-time analytics, and maintaining up-to-date information across distributed systems.
8 |
9 | ## Why Use CDC?
10 |
11 | Understanding the benefits of CDC helps explain why it has become essential in modern data architectures:
12 |
13 | **Real-Time Data Synchronization**: Unlike traditional batch processing that updates target systems at scheduled intervals (perhaps once a day or hour), CDC updates target systems instantly as changes occur. This means your analytics dashboard can reflect customer purchases within seconds rather than waiting for the next batch job.
14 |
15 | **Exceptional Efficiency**: CDC processes only the data that has actually changed, dramatically reducing resource usage compared to full dataset transfers. Instead of copying an entire million-row table every hour, CDC might only transfer the dozen rows that actually changed.
16 |
17 | **Guaranteed Consistency**: CDC ensures that downstream systems accurately reflect changes in the source database. When a customer updates their address in your main application, that change propagates reliably to your data warehouse, recommendation engine, and reporting systems.
18 |
19 | ## How Does CDC Work?
20 |
21 | The magic of CDC lies in its ability to capture changes from a source database's write-ahead log (WAL), which is essentially the database's diary of all modifications. Every database maintains this log for recovery purposes, and CDC leverages this existing infrastructure.
22 |
23 | Here's how the process flows: when you make a change to your PostgreSQL database, that change first gets written to the WAL before being applied to the actual data files. CDC tools read these WAL entries and convert them into events that can be streamed to target systems. For example, tools like Debezium read these logs and stream changes to Apache Kafka, which then delivers them to targets such as Cassandra.
24 |
25 | ## Debezium Connector (Source)
26 |
27 | Debezium represents one of the most popular open-source platforms for CDC. It captures changes from various databases including PostgreSQL, MySQL, and Oracle, then streams these changes to Apache Kafka. Think of Debezium as a translator that speaks both "database language" and "streaming language."
28 |
29 | ### How Debezium Works
30 |
31 | Debezium connectors act as careful monitors of a database's WAL, detecting every change that occurs. Each detected change gets converted into a structured event and sent to a designated Kafka topic. This process happens with remarkable precision:
32 |
33 | When you insert a new row into a PostgreSQL table, Debezium generates an "INSERT" event containing all the new data. For updates, Debezium creates both "before" and "after" events, showing you exactly what changed from the old values to the new ones. Delete operations generate "DELETE" events that capture what was removed.
34 |
35 | ### Simple Example: Debezium with PostgreSQL
36 |
37 | Let's walk through setting up Debezium with PostgreSQL using a practical example. Imagine you have a PostgreSQL table called `users` with columns for id, name, and email.
38 |
39 | #### Step 1: Enable WAL in PostgreSQL
40 |
41 | First, you need to configure PostgreSQL to use logical replication, which Debezium requires to read changes:
42 |
43 | ```bash
44 | # Edit the PostgreSQL configuration file
45 | sudo nano /etc/postgresql/14/main/postgresql.conf
46 | ```
47 |
48 | Update these settings in the configuration file:
49 |
50 | ```
51 | wal_level = logical
52 | max_wal_senders = 1
53 | max_replication_slots = 1
54 | ```
55 |
56 | After making these changes, restart PostgreSQL to apply them:
57 |
58 | ```bash
59 | sudo systemctl restart postgresql
60 | ```
61 |
62 | Next, grant the necessary replication permissions to your database user:
63 |
64 | ```bash
65 | psql -U postgres -c "ALTER USER myuser WITH REPLICATION;"
66 | ```
67 |
68 | #### Step 2: Set Up Kafka and Debezium
69 |
70 | Now you'll set up the streaming infrastructure. Download and extract Apache Kafka:
71 |
72 | ```bash
73 | wget https://downloads.apache.org/kafka/3.6.0/kafka_2.13-3.6.0.tgz
74 | tar -xzf kafka_2.13-3.6.0.tgz
75 | cd kafka_2.13-3.6.0
76 | ```
77 |
78 | Start Zookeeper and Kafka in separate terminals. Zookeeper manages Kafka's configuration, while Kafka handles the actual message streaming:
79 |
80 | ```bash
81 | # Terminal 1: Start Zookeeper
82 | bin/zookeeper-server-start.sh config/zookeeper.properties
83 |
84 | # Terminal 2: Start Kafka
85 | bin/kafka-server-start.sh config/server.properties
86 | ```
87 |
88 | Download and set up the Debezium PostgreSQL connector:
89 |
90 | ```bash
91 | mkdir -p /path/to/kafka/plugins
92 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.7.0.Final/debezium-connector-postgres-2.7.0.Final-plugin.tar.gz
93 | tar -xzf debezium-connector-postgres-2.7.0.Final-plugin.tar.gz -C /path/to/kafka/plugins
94 | ```
95 |
96 | Configure Kafka Connect to recognize your Debezium plugin:
97 |
98 | ```bash
99 | nano config/connect-distributed.properties
100 | ```
101 |
102 | Add this line to tell Kafka Connect where to find plugins:
103 |
104 | ```
105 | plugin.path=/path/to/kafka/plugins
106 | ```
107 |
108 | Start Kafka Connect in distributed mode:
109 |
110 | ```bash
111 | bin/connect-distributed.sh config/connect-distributed.properties
112 | ```
113 |
114 | Create a configuration file for your Debezium connector. This JSON file tells Debezium exactly how to connect to your PostgreSQL database and which tables to monitor:
115 |
116 | ```json
117 | {
118 | "name": "postgres-connector",
119 | "config": {
120 | "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
121 | "database.hostname": "localhost",
122 | "database.port": "5432",
123 | "database.user": "myuser",
124 | "database.password": "mypassword",
125 | "database.dbname": "mydb",
126 | "database.server.name": "server1",
127 | "table.include.list": "public.users",
128 | "plugin.name": "pgoutput"
129 | }
130 | }
131 | ```
132 |
133 | Deploy the connector to start monitoring your database:
134 |
135 | ```bash
136 | curl -X POST -H "Content-Type: application/json" --data @postgres-connector.json http://localhost:8083/connectors
137 | ```
138 |
139 | #### Step 3: Observe Changes in Action
140 |
141 | Now for the exciting part—watching CDC work in real-time. Create your users table and insert some data:
142 |
143 | ```sql
144 | psql -U myuser -d mydb -c "CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT, email TEXT);"
145 | psql -U myuser -d mydb -c "INSERT INTO users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');"
146 | ```
147 |
148 | Debezium captures this insert operation and creates an event in a Kafka topic named `server1.public.users`. You can view this event using Kafka's console consumer:
149 |
150 | ```bash
151 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic server1.public.users --from-beginning
152 | ```
153 |
154 | The event structure looks like this:
155 |
156 | ```json
157 | {
158 | "schema": { ... },
159 | "payload": {
160 | "before": null,
161 | "after": { "id": 1, "name": "Alice", "email": "alice@example.com" },
162 | "op": "c", // 'c' indicates create (insert)
163 | "ts_ms": 1697051234567
164 | }
165 | }
166 | ```
167 |
168 | Notice how the event includes both "before" and "after" states. For an insert, "before" is null since the row didn't exist previously. The "op" field indicates the operation type, and "ts_ms" provides a timestamp.
169 |
170 | ## Cassandra Sink Connector
171 |
172 | The Cassandra sink connector completes the CDC pipeline by reading data from Kafka topics and writing it to a Cassandra database. This connector excels at storing CDC events in a scalable, distributed NoSQL environment.
173 |
174 | ### How Cassandra Sink Connector Works
175 |
176 | The connector acts as a bridge between Kafka and Cassandra, reading events from Kafka topics and translating them into appropriate Cassandra operations. It handles the mapping between Kafka record structures and Cassandra table schemas, managing inserts, updates, and deletes automatically.
177 |
178 | ### Simple Example: Cassandra Sink Connector
179 |
180 | Let's continue our example by setting up Cassandra to receive the user events from our Debezium setup.
181 |
182 | #### Step 1: Set Up Cassandra
183 |
184 | Install and start Cassandra on Ubuntu:
185 |
186 | ```bash
187 | sudo apt update
188 | sudo apt install cassandra
189 | sudo systemctl start cassandra
190 | ```
191 |
192 | Verify that Cassandra is running properly:
193 |
194 | ```bash
195 | nodetool status
196 | ```
197 |
198 | Create a keyspace and table structure that matches your source data:
199 |
200 | ```bash
201 | cqlsh -e "CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};"
202 | cqlsh -e "CREATE TABLE mykeyspace.users (id int PRIMARY KEY, name text, email text);"
203 | ```
204 |
205 | #### Step 2: Configure Cassandra Sink Connector
206 |
207 | Download the Cassandra sink connector:
208 |
209 | ```bash
210 | wget https://d1i4a8756m6x7j.cloudfront.net/repo/7.5/confluent-kafka-connect-cassandra-1.5.0.tar.gz
211 | tar -xzf confluent-kafka-connect-cassandra-1.5.0.tar.gz -C /path/to/kafka/plugins
212 | ```
213 |
214 | Create a configuration file that tells the connector how to map Kafka events to Cassandra records:
215 |
216 | ```json
217 | {
218 | "name": "cassandra-sink",
219 | "config": {
220 | "connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
221 | "tasks.max": "1",
222 | "topics": "server1.public.users",
223 | "cassandra.contact.points": "localhost",
224 | "cassandra.keyspace": "mykeyspace",
225 | "cassandra.table.name": "users",
226 | "cassandra.kcql": "INSERT INTO users SELECT id, name, email FROM server1.public.users"
227 | }
228 | }
229 | ```
230 |
231 | Deploy the connector to start the data flow:
232 |
233 | ```bash
234 | curl -X POST -H "Content-Type: application/json" --data @cassandra-sink.json http://localhost:8083/connectors
235 | ```
236 |
237 | #### Step 3: Observe Data in Cassandra
238 |
239 | When Debezium sends user events to Kafka, the Cassandra sink connector automatically writes them to your Cassandra table. Query Cassandra to see the results:
240 |
241 | ```bash
242 | cqlsh -e "SELECT * FROM mykeyspace.users;"
243 | ```
244 |
245 | You should see the result: `id=1, name='Alice', email='alice@example.com'`.
246 |
247 | The beauty of this setup is that any changes you make to the PostgreSQL users table will automatically appear in Cassandra within seconds, maintaining perfect synchronization between your systems.
248 |
249 | ## Key Considerations
250 |
251 | When implementing CDC in production environments, several important factors require careful attention:
252 |
253 | **Performance Impact**: CDC does add some overhead to your source database since it needs to read and process the WAL continuously. Monitor WAL usage in PostgreSQL, especially in high-transaction systems where the log can grow quickly. Consider the additional I/O load and ensure your database server has adequate resources.
254 |
255 | **Schema Evolution**: Debezium handles schema changes gracefully—when you add columns to a PostgreSQL table, Debezium automatically detects and includes them in future events. However, you must ensure that your target systems (like Cassandra tables) can accommodate these schema changes. Plan your schema evolution strategy carefully.
256 |
257 | **Scalability Considerations**: Cassandra's distributed architecture makes it excellent for handling high-volume CDC streams. Configure appropriate replication factors for reliability, and consider partitioning strategies that align with your query patterns and data access requirements.
258 |
259 | ## Assignment: Hands-On Project
260 |
261 | To solidify your understanding of CDC concepts, I encourage you to work through a practical implementation project. This hands-on experience will help you see how all the pieces fit together in a real-world scenario.
262 |
263 | **Project Repository**: [LuxDevHQ Data Engineering Project](https://github.com/LuxDevHQ/LuxDevHQ-Data-Engineering-Project)
264 |
265 | **Your Task**: Clone the repository and follow the comprehensive instructions to set up a complete CDC pipeline using Debezium and a sink connector. The project guides you through using Linux commands to set up PostgreSQL tables, make various changes (inserts, updates, deletes), and verify that these changes propagate correctly to the target system.
266 |
267 | ```bash
268 | git clone https://github.com/LuxDevHQ/LuxDevHQ-Data-Engineering-Project.git
269 | cd LuxDevHQ-Data-Engineering-Project
270 | ```
271 |
272 | Follow the project's README file for detailed setup and execution instructions. This project provides a practical environment where you can experiment with CDC concepts, helping you understand how Debezium and sink connectors work together in real-world scenarios.
273 |
274 | As you work through this project, pay attention to how the different components interact, observe the timing of data propagation, and experiment with different types of database changes to see how they're handled by the CDC pipeline.
275 |
--------------------------------------------------------------------------------
/PYTHON/chapter_1.md:
--------------------------------------------------------------------------------
1 | # Chapter 1: Introduction to Python and The Way of the Program
2 |
3 | ## What is Python?
4 |
5 | Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming. It is widely used in various domains such as web development, data analysis, artificial intelligence, scientific computing, and data engineering.
6 |
7 | ### Year Created
8 | Python was created in the late 1980s, and its first official version, Python 0.9.0, was released in February 1991. Python 2.0 was released in 2000, and Python 3.0, which is not backward-compatible with Python 2, was released in 2008.
9 |
10 | **Current stable Python version:** Python 3.13.2
11 |
12 | ### Important Resources
13 | - [W3School Python tutorial](https://www.w3schools.com/python/) (beginner friendly)
14 | - [Python official documentation](https://docs.python.org/)
15 | - [Programiz Python tutorial](https://www.programiz.com/python-programming)
16 | - [Pythontutorial.net](https://www.pythontutorial.net/)
17 |
18 | ## The Way of the Program
19 |
20 | The goal of learning Python is to teach you to think like a computer scientist. This way of thinking combines some of the best features of mathematics, engineering, and natural science.
21 |
22 | Like mathematicians, computer scientists use formal languages to denote ideas (specifically computations). Like engineers, they design things, assembling components into systems and evaluating tradeoffs among alternatives. Like scientists, they observe the behavior of complex systems, form hypotheses, and test predictions.
23 |
24 | The single most important skill for a computer scientist is **problem solving**. Problem solving means the ability to formulate problems, think creatively about solutions, and express a solution clearly and accurately. As it turns out, the process of learning to program is an excellent opportunity to practice problem-solving skills.
25 |
26 | ## 1.1 What is a Program?
27 |
28 | A program is a sequence of instructions that specifies how to perform a computation. The computation might be something mathematical, such as solving a system of equations or finding the roots of a polynomial, but it can also be a symbolic computation, such as searching and replacing text in a document or something graphical, like processing an image or playing a video.
29 |
30 | The details look different in different languages, but a few basic instructions appear in just about every language:
31 |
32 | - **input:** Get data from the keyboard, a file, the network, or some other device
33 | - **output:** Display data on the screen, save it in a file, send it over the network, etc.
34 | - **math:** Perform basic mathematical operations like addition and multiplication
35 | - **conditional execution:** Check for certain conditions and run the appropriate code
36 | - **repetition:** Perform some action repeatedly, usually with some variation
37 |
38 | Believe it or not, that's pretty much all there is to it. Every program you've ever used, no matter how complicated, is made up of instructions that look pretty much like these.
39 |
40 | You can think of programming as the process of breaking a large, complex task into smaller and smaller subtasks until the subtasks are simple enough to be performed with one of these basic instructions.
41 |
42 | ## 1.2 Running Python
43 |
44 | One of the challenges of getting started with Python is that you might have to install Python and related software on your computer. If you are familiar with your operating system, and especially if you are comfortable with the command-line interface, you will have no trouble installing Python. But for beginners, it can be painful to learn about system administration and programming at the same time.
45 |
46 | To avoid that problem, it's recommended that you start out running Python in a browser. Later, when you are comfortable with Python, you can install Python on your computer.
47 |
48 | There are a number of web pages you can use to run Python. Some popular options include:
49 | - [Replit](https://replit.com/)
50 | - [Python.org's online console](https://www.python.org/shell/)
51 | - [Trinket](https://trinket.io/python)
52 |
53 | ### Python Versions
54 |
55 | There are two major versions of Python: Python 2 and Python 3. They are very similar, so if you learn one, it is easy to switch to the other. However, Python 2 reached end-of-life on January 1, 2020, so all new projects should use Python 3.
56 |
57 | The Python interpreter is a program that reads and executes Python code. When you start the interpreter, you should see output like this:
58 |
59 | ```
60 | Python 3.13.2 (default, Jan 15 2025, 14:20:21)
61 | [GCC 11.2.0] on linux
62 | Type "help", "copyright", "credits" or "license" for more information.
63 | >>>
64 | ```
65 |
66 | The `>>>` is a prompt that indicates that the interpreter is ready for you to enter code. If you type a line of code and hit Enter, the interpreter displays the result:
67 |
68 | ```python
69 | >>> 1 + 1
70 | 2
71 | ```
72 |
73 | ## 1.3 The First Program
74 |
75 | Traditionally, the first program you write in a new language is called "Hello, World!" because all it does is display the words "Hello, World!". In Python, it looks like this:
76 |
77 | ```python
78 | >>> print('Hello, World!')
79 | Hello, World!
80 | ```
81 |
82 | This is an example of a print statement, although it doesn't actually print anything on paper. It displays a result on the screen. The quotation marks in the program mark the beginning and end of the text to be displayed; they don't appear in the result.
83 |
84 | The parentheses indicate that `print` is a function. We'll learn more about functions in later chapters.
85 |
86 | ## 1.4 Arithmetic Operators
87 |
88 | After "Hello, World!", the next step is arithmetic. Python provides operators, which are special symbols that represent computations like addition and multiplication.
89 |
90 | The operators `+`, `-`, and `*` perform addition, subtraction, and multiplication:
91 |
92 | ```python
93 | >>> 40 + 2
94 | 42
95 | >>> 43 - 1
96 | 42
97 | >>> 6 * 7
98 | 42
99 | ```
100 |
101 | The operator `/` performs division:
102 |
103 | ```python
104 | >>> 84 / 2
105 | 42.0
106 | ```
107 |
108 | Note that the result is `42.0` instead of `42` because division in Python 3 always returns a floating-point number.
109 |
110 | The operator `**` performs exponentiation (raising a number to a power):
111 |
112 | ```python
113 | >>> 6**2 + 6
114 | 42
115 | ```
116 |
117 | **Warning:** In some other languages, `^` is used for exponentiation, but in Python it is a bitwise operator called XOR:
118 |
119 | ```python
120 | >>> 6 ^ 2
121 | 4
122 | ```
123 |
124 | ## 1.5 Values and Types
125 |
126 | A value is one of the basic things a program works with, like a letter or a number. Some values we have seen so far are `2`, `42.0`, and `'Hello, World!'`.
127 |
128 | These values belong to different types:
129 | - `2` is an **integer**
130 | - `42.0` is a **floating-point number**
131 | - `'Hello, World!'` is a **string**
132 |
133 | If you are not sure what type a value has, the interpreter can tell you using the `type()` function:
134 |
135 | ```python
136 | >>> type(2)
137 |
138 | >>> type(42.0)
139 |
140 | >>> type('Hello, World!')
141 |
142 | ```
143 |
144 | In these results, the word "class" is used in the sense of a category; a type is a category of values.
145 |
146 | - Integers belong to the type `int`
147 | - Strings belong to `str`
148 | - Floating-point numbers belong to `float`
149 |
150 | ### String vs Numbers
151 |
152 | Values like `'2'` and `'42.0'` look like numbers, but they are in quotation marks, so they are strings:
153 |
154 | ```python
155 | >>> type('2')
156 |
157 | >>> type('42.0')
158 |
159 | ```
160 |
161 | ## Python Basics
162 |
163 | ### Identifiers
164 |
165 | Identifiers are names used to identify variables, functions, classes, modules, or other objects. Rules for naming identifiers in Python:
166 |
167 | - Identifiers can be a combination of letters (a-z, A-Z), digits (0-9), and underscores (_)
168 | - Identifiers cannot start with a digit
169 | - Identifiers are case-sensitive (`myVar` and `myvar` are different)
170 | - Reserved keywords cannot be used as identifiers
171 |
172 | **Valid identifiers:**
173 | ```python
174 | my_variable
175 | _private_var
176 | variable1
177 | MyClass
178 | ```
179 |
180 | **Invalid identifiers:**
181 | ```python
182 | 1variable # Cannot start with digit
183 | my-variable # Hyphen not allowed
184 | class # Reserved keyword
185 | ```
186 |
187 | ### Keywords
188 |
189 | Keywords are reserved words in Python that have special meanings and cannot be used as identifiers. Some of the keywords in Python include:
190 |
191 | `if`, `else`, `elif`, `for`, `while`, `break`, `continue`, `def`, `return`, `lambda`, `class`, `import`, `from`, `try`, `except`, `finally`, `and`, `or`, `not`, `True`, `False`, `None`
192 |
193 | You can see all keywords by running:
194 | ```python
195 | import keyword
196 | print(keyword.kwlist)
197 | ```
198 |
199 | ### PEP 8 Rules
200 |
201 | PEP 8 is the official style guide for Python code. It provides conventions for writing readable and consistent code. Some key PEP 8 rules include:
202 |
203 | - **Indentation:** Use 4 spaces per indentation level
204 | - **Line Length:** Limit all lines to a maximum of 79 characters (72 for docstrings/comments)
205 | - **Imports:** Imports should usually be on separate lines
206 | - **Whitespace:** Avoid extraneous whitespace in various situations
207 | - **Naming Conventions:**
208 | - Variables: `my_variable`
209 | - Functions: `my_function`
210 | - Classes: `MyClass`
211 | - Constants: `MY_CONSTANT`
212 |
213 | ## 1.6 Formal and Natural Languages
214 |
215 | Natural languages are the languages people speak, such as English, Spanish, and French. They were not designed by people (although people try to impose some order on them); they evolved naturally.
216 |
217 | Formal languages are languages that are designed by people for specific applications. For example, the notation that mathematicians use is a formal language that is particularly good at denoting relationships among numbers and symbols. Chemists use a formal language to represent the chemical structure of molecules. And most importantly:
218 |
219 | **Programming languages are formal languages that have been designed to express computations.**
220 |
221 | ### Key Differences
222 |
223 | Although formal and natural languages have many features in common—tokens, structure, and syntax—there are some differences:
224 |
225 | - **Ambiguity:** Natural languages are full of ambiguity, which people deal with by using contextual clues. Formal languages are designed to be nearly or completely unambiguous.
226 |
227 | - **Redundancy:** Natural languages employ lots of redundancy to reduce misunderstandings. Formal languages are less redundant and more concise.
228 |
229 | - **Literalness:** Natural languages are full of idiom and metaphor. Formal languages mean exactly what they say.
230 |
231 | ## 1.7 Debugging
232 |
233 | Programmers make mistakes. For whimsical reasons, programming errors are called **bugs** and the process of tracking them down is called **debugging**.
234 |
235 | Programming, and especially debugging, sometimes brings out strong emotions. If you are struggling with a difficult bug, you might feel angry, despondent, or embarrassed.
236 |
237 | ### Debugging Tips
238 |
239 | - Think of the computer as an employee with certain strengths (speed and precision) and weaknesses (lack of empathy and inability to grasp the big picture)
240 | - Find ways to take advantage of the strengths and mitigate the weaknesses
241 | - Use your emotions to engage with the problem, without letting reactions interfere with your ability to work effectively
242 | - Learning to debug is frustrating, but it's a valuable skill useful for many activities beyond programming
243 |
244 | ## Exercises
245 |
246 | ### Exercise 1.1
247 | Experiment with the "Hello, world!" program and try to make mistakes on purpose:
248 |
249 | 1. In a print statement, what happens if you leave out one of the parentheses, or both?
250 | 2. If you are trying to print a string, what happens if you leave out one of the quotation marks, or both?
251 | 3. You can use a minus sign to make a negative number like `-2`. What happens if you put a plus sign before a number? What about `2++2`?
252 | 4. In math notation, leading zeros are ok, as in `09`. What happens if you try this in Python? What about `011`?
253 | 5. What happens if you have two values with no operator between them?
254 |
255 | ### Exercise 1.2
256 | Start the Python interpreter and use it as a calculator:
257 |
258 | 1. How many seconds are there in 42 minutes 42 seconds?
259 | 2. How many miles are there in 10 kilometers? (Hint: there are 1.61 kilometers in a mile)
260 | 3. If you run a 10 kilometer race in 42 minutes 42 seconds, what is your average pace (time per mile in minutes and seconds)? What is your average speed in miles per hour?
261 |
262 | ## Glossary
263 |
264 | - **Problem solving:** The process of formulating a problem, finding a solution, and expressing it
265 | - **High-level language:** A programming language like Python that is designed to be easy for humans to read and write
266 | - **Interpreter:** A program that reads another program and executes it
267 | - **Program:** A set of instructions that specifies a computation
268 | - **Value:** One of the basic units of data, like a number or string, that a program manipulates
269 | - **Type:** A category of values (int, float, str)
270 | - **Bug:** An error in a program
271 | - **Debugging:** The process of finding and correcting bugs
272 | - **Identifier:** Names used to identify variables, functions, classes, or other objects
273 | - **Keyword:** Reserved words in Python with special meanings
274 | - **PEP 8:** The official style guide for Python code
--------------------------------------------------------------------------------
/IntroductiontoCloudComputing.md:
--------------------------------------------------------------------------------
1 | ### Introduction to Cloud Computing (Azure and AWS)
2 | **Duration**: 90 minutes
3 | **Audience**: Data Engineers
4 |
5 | ---
6 |
7 | ### Learning Objectives
8 | By the end of this session, participants will be able to:
9 | 1. Define cloud computing and its core principles.
10 | 2. Compare key features of Azure and AWS.
11 | 3. Navigate basic services relevant to data engineering in both platforms.
12 | 4. Set up and configure a basic cloud environment.
13 |
14 | ---
15 |
16 | ### Agenda
17 | 1. **What is Cloud Computing?** (10 minutes)
18 | - Definition and characteristics of cloud computing.
19 | - Cloud deployment models: Public, Private, Hybrid.
20 | - Service models: IaaS, PaaS, SaaS.
21 |
22 | 2. **Azure vs. AWS: Key Concepts** (15 minutes)
23 | - Overview of Azure and AWS platforms.
24 | - Comparison of service categories: Compute, Storage, Networking, and Databases.
25 | - Strengths and use cases for Azure and AWS.
26 |
27 | 3. **Core Services for Data Engineering** (20 minutes)
28 | - **Compute**:
29 | - Azure: Azure Virtual Machines, Azure Databricks.
30 | - AWS: EC2, EMR (Elastic MapReduce).
31 | - **Storage**:
32 | - Azure: Blob Storage, Data Lake Storage.
33 | - AWS: S3, Glacier.
34 | - **Databases**:
35 | - Azure: Azure SQL Database, Cosmos DB.
36 | - AWS: RDS, DynamoDB.
37 |
38 | 4. **Hands-On Lab: Setting Up a Cloud Environment** (30 minutes)
39 | - Create free accounts on Azure and AWS.
40 | - Configure basic cloud storage:
41 | - Azure Blob Storage.
42 | - AWS S3 bucket.
43 | - Upload and retrieve sample data.
44 |
45 | 5. **Q&A and Wrap-Up** (15 minutes)
46 | - Address participants questions.
47 | - Discuss common challenges and best practices.
48 | - Share additional resources for continued learning.
49 |
50 | ---
51 |
52 | ### Detailed Session Plan
53 |
54 | #### What is Cloud Computing? (10 minutes)
55 | - **Definition**: Delivering computing services (e.g., servers, storage, databases, networking, software) over the internet.
56 | - **Characteristics**:
57 | - On-demand availability.
58 | - Scalability.
59 | - Pay-as-you-go pricing.
60 | - High availability and reliability.
61 | - **Service Models**:
62 | - **IaaS** (e.g., Virtual Machines): Full control over infrastructure.
63 | - **PaaS** (e.g., Azure Databricks): Managed environment for deploying applications.
64 | - **SaaS** (e.g., Office 365): Pre-built software accessed via the cloud.
65 |
66 | ---
67 |
68 | #### Azure vs. AWS: Key Concepts (15 minutes)
69 | - **Azure**:
70 | - Focus on hybrid cloud and enterprise solutions.
71 | - Tight integration with Microsoft tools (e.g., Power BI, Office 365).
72 | - **AWS**:
73 | - Largest cloud provider with a wide range of services.
74 | - Strong presence in startups and tech-first organizations.
75 |
76 | | Feature | Azure | AWS |
77 | |-----------------|---------------------------------|----------------------------------|
78 | | Compute | Azure VMs, Azure Kubernetes | EC2, Lambda, ECS |
79 | | Storage | Blob Storage, Data Lake | S3, Glacier |
80 | | Databases | Azure SQL, Cosmos DB | RDS, DynamoDB |
81 | | Analytics | Azure Synapse, Databricks | Redshift, EMR, Athena |
82 |
83 | ---
84 |
85 | #### Core Services for Data Engineering (20 minutes)
86 | **Compute Services**:
87 | - Azure:
88 | - **Azure Virtual Machines**: Scalable virtual servers.
89 | - **Azure Databricks**: Apache Spark-based analytics.
90 | - AWS:
91 | - **EC2**: Elastic Compute Cloud for scalable servers.
92 | - **EMR**: Managed Hadoop/Spark for big data processing.
93 |
94 | **Storage Services**:
95 | - Azure:
96 | - **Blob Storage**: Unstructured data storage.
97 | - **Data Lake Storage**: Analytics-optimized storage.
98 | - AWS:
99 | - **S3**: Highly available object storage.
100 | - **Glacier**: Long-term archival storage.
101 |
102 | **Database Services**:
103 | - Azure:
104 | - **Azure SQL Database**: Managed relational database.
105 | - **Cosmos DB**: Globally distributed, multi-model database.
106 | - AWS:
107 | - **RDS**: Managed relational databases.
108 | - **DynamoDB**: NoSQL database with high performance.
109 |
110 | ---
111 |
112 | #### Hands-On Lab: Setting Up a Cloud Environment (30 minutes)
113 |
114 | **Step 1**: Create Free Accounts
115 | 1. **Azure**:
116 | - Visit [Azure Free Account](https://azure.microsoft.com/free/).
117 | - Sign up with a Microsoft account.
118 | - Activate $200 free credit.
119 |
120 | 2. **AWS**:
121 | - Visit [AWS Free Tier](https://aws.amazon.com/free/).
122 | - Sign up with email and billing details.
123 | - Activate free-tier services.
124 |
125 | **Step 2**: Configure Basic Cloud Storage
126 | 1. **Azure Blob Storage**:
127 | - Navigate to **Storage Accounts** in Azure Portal.
128 | - Create a new storage account.
129 | - Upload a sample CSV file and view its properties.
130 |
131 | 2. **AWS S3 Bucket**:
132 | - Navigate to **S3** in AWS Console.
133 | - Create a new S3 bucket.
134 | - Upload a sample CSV file and view its properties.
135 |
136 | **Step 3**: Retrieve and Use Data
137 | - Use Python or CLI tools to retrieve the uploaded file:
138 | - Azure: `azure-storage-blob` Python SDK.
139 | - AWS: `boto3` Python SDK.
140 |
141 | ---
142 |
143 | #### Q&A and Wrap-Up (15 minutes)
144 | - **Discussion Points**:
145 | - How to choose between Azure and AWS for specific use cases?
146 | - Best practices for managing costs in cloud platforms.
147 | - Common challenges faced by data engineers in cloud environments.
148 | - **Resources for Further Learning**:
149 | - Azure: Microsoft Learn - [Azure Fundamentals](https://learn.microsoft.com/en-us/azure/)
150 | - AWS: AWS Training - [AWS Fundamentals](https://aws.amazon.com/training/)
151 |
152 | ---
153 |
154 | ### Key Takeaways
155 | - Cloud computing provides scalable, cost-effective infrastructure and tools for data engineering.
156 | - Azure and AWS are the leading platforms, each with unique strengths.
157 | - Hands-on experience is crucial to understanding and leveraging cloud services.
158 |
159 |
160 |
161 | ###Bonus
162 | #### *Additional Notes and Tips for AWS Tools for Data Engineering.*
163 |
164 | AWS provides a rich ecosystem of tools and services tailored for data engineering tasks. Below is an expanded overview, along with notes and tips to maximize their utility.
165 |
166 | ---
167 |
168 | #### **Storage Services**
169 | 1. **Amazon S3 (Simple Storage Service)**
170 | - **Purpose**: Scalable object storage for raw, processed, and archived data.
171 | - **Common Use Cases**:
172 | - Data lakes.
173 | - Backup and disaster recovery.
174 | - Hosting static files.
175 | - **Tips**:
176 | - Use **S3 Lifecycle Policies** to move infrequently accessed data to cheaper storage classes (e.g., Glacier, Intelligent-Tiering).
177 | - Enable **versioning** to maintain file history and prevent accidental data loss.
178 | - Use **S3 Select** to query and retrieve subsets of data from objects directly, reducing data transfer costs.
179 | - Encrypt sensitive data using **SSE (Server-Side Encryption)** or **client-side encryption**.
180 |
181 | 2. **AWS Glue Data Catalog**
182 | - **Purpose**: Centralized metadata repository for datasets stored in S3 or other sources.
183 | - **Common Use Cases**:
184 | - Schema management for data lakes.
185 | - Integration with Athena, Redshift Spectrum, and EMR.
186 | - **Tips**:
187 | - Use AWS Glue Crawlers to automate schema detection for datasets.
188 | - Ensure proper IAM roles are configured for Glue to access S3 buckets.
189 |
190 | ---
191 |
192 | #### **Compute Services**
193 | 1. **Amazon EC2 (Elastic Compute Cloud)**
194 | - **Purpose**: General-purpose virtual servers.
195 | - **Common Use Cases**:
196 | - Hosting custom data pipelines.
197 | - Running one-time ETL jobs or long-running services.
198 | - **Tips**:
199 | - Use **Spot Instances** for cost savings on workloads that tolerate interruptions.
200 | - Implement **auto-scaling** to handle varying workloads.
201 |
202 | 2. **AWS Lambda**
203 | - **Purpose**: Serverless compute for event-driven processing.
204 | - **Common Use Cases**:
205 | - Lightweight ETL transformations.
206 | - Real-time data processing (e.g., responding to events from S3 or Kinesis).
207 | - **Tips**:
208 | - Keep Lambda functions small and focused on single tasks.
209 | - Optimize performance by minimizing package size and reusing connections.
210 |
211 | 3. **AWS EMR (Elastic MapReduce)**
212 | - **Purpose**: Managed Hadoop and Spark framework for big data processing.
213 | - **Common Use Cases**:
214 | - Batch processing of large datasets.
215 | - Running machine learning models on large-scale data.
216 | - **Tips**:
217 | - Use **Spot Instances** with EMR to reduce costs.
218 | - Leverage **EMR File System (EMRFS)** for better integration with S3.
219 |
220 | ---
221 |
222 | #### **Database Services**
223 | 1. **Amazon Redshift**
224 | - **Purpose**: Managed data warehouse for OLAP workloads.
225 | - **Common Use Cases**:
226 | - Aggregating and analyzing large datasets.
227 | - Business intelligence and reporting.
228 | - **Tips**:
229 | - Use **Redshift Spectrum** to query data directly from S3 without loading it into Redshift.
230 | - Monitor and optimize queries using the **Query Monitoring Rules** feature.
231 | - Compress data using columnar storage formats like Parquet or ORC to improve query performance.
232 |
233 | 2. **Amazon DynamoDB**
234 | - **Purpose**: NoSQL database for key-value and document storage.
235 | - **Common Use Cases**:
236 | - Low-latency, high-throughput applications (e.g., user session storage).
237 | - Storing metadata or logs for data pipelines.
238 | - **Tips**:
239 | - Enable **DynamoDB Streams** for change data capture and event-driven workflows.
240 | - Use the **on-demand capacity mode** for unpredictable workloads to avoid over-provisioning.
241 |
242 | 3. **Amazon RDS (Relational Database Service)**
243 | - **Purpose**: Managed relational database with support for MySQL, PostgreSQL, Oracle, and SQL Server.
244 | - **Common Use Cases**:
245 | - Storing structured, transactional data.
246 | - Serving as a staging area for ETL workflows.
247 | - **Tips**:
248 | - Enable **Multi-AZ deployments** for high availability.
249 | - Use **Read Replicas** to offload read-heavy workloads.
250 |
251 | ---
252 |
253 | #### **Data Analytics Tools**
254 | 1. **Amazon Athena**
255 | - **Purpose**: Serverless query engine for analyzing data in S3 using SQL.
256 | - **Common Use Cases**:
257 | - Interactive exploration of data lakes.
258 | - Quick validation of ETL pipeline outputs.
259 | - **Tips**:
260 | - Use columnar formats like Parquet or ORC for faster queries.
261 | - Partition your data to reduce query costs.
262 |
263 | 2. **AWS Glue**
264 | - **Purpose**: Serverless ETL service.
265 | - **Common Use Cases**:
266 | - Cleaning and transforming datasets for downstream consumption.
267 | - Automating ETL workflows.
268 | - **Tips**:
269 | - Use **job bookmarks** to handle incremental data loads.
270 | - Test transformations locally using the AWS Glue Docker image.
271 |
272 | 3. **Amazon QuickSight**
273 | - **Purpose**: BI and data visualization tool.
274 | - **Common Use Cases**:
275 | - Creating dashboards for stakeholders.
276 | - Visualizing insights from Athena or Redshift queries.
277 | - **Tips**:
278 | - Leverage SPICE (Super-fast, Parallel, In-memory Calculation Engine) for faster dashboard performance.
279 |
280 | ---
281 |
282 | #### **Data Streaming Services**
283 | 1. **Amazon Kinesis**
284 | - **Purpose**: Platform for collecting, processing, and analyzing real-time data streams.
285 | - **Common Use Cases**:
286 | - IoT data ingestion.
287 | - Real-time log processing.
288 | - **Tips**:
289 | - Use **Kinesis Data Firehose** to automatically load streaming data into S3, Redshift, or Elasticsearch.
290 | - Monitor and scale Kinesis streams using **CloudWatch metrics**.
291 |
292 | 2. **AWS Managed Streaming for Apache Kafka (MSK)**
293 | - **Purpose**: Managed Apache Kafka service for real-time data processing.
294 | - **Common Use Cases**:
295 | - Message brokering between services in a pipeline.
296 | - Event-driven architectures.
297 | - **Tips**:
298 | - Use Kafka connectors for seamless integration with AWS services like S3 or DynamoDB.
299 | - Optimize partitions and replication settings to balance performance and fault tolerance.
300 |
301 | ---
302 |
303 | #### **Data Security and Governance Tools**
304 | 1. **AWS IAM (Identity and Access Management)**
305 | - **Purpose**: Manage user permissions and access to AWS resources.
306 | - **Tips**:
307 | - Apply the **principle of least privilege** when assigning roles.
308 | - Use **IAM Policies** to define resource-level access.
309 |
310 | 2. **AWS Lake Formation**
311 | - **Purpose**: Simplify data lake creation with built-in governance.
312 | - **Tips**:
313 | - Define granular access policies using Lake Formation permissions.
314 | - Integrate with Glue Data Catalog for seamless schema management.
315 |
316 | 3. **AWS CloudTrail**
317 | - **Purpose**: Track user activity and API usage across AWS.
318 | - **Tips**:
319 | - Enable CloudTrail logs for all accounts to improve auditability.
320 | - Store logs in an S3 bucket for long-term analysis.
321 |
322 | ---
323 |
324 | ### Additional Resources
325 | - [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/)
326 | - [AWS Big Data Blog](https://aws.amazon.com/big-data/)
327 | - [Hands-On Labs for AWS](https://www.qwiklabs.com/)
328 |
329 |
330 |
--------------------------------------------------------------------------------
/PYTHON/chapter_2.md:
--------------------------------------------------------------------------------
1 | # Chapter 2: Variables, Expressions and Statements
2 |
3 | One of the most powerful features of a programming language is the ability to manipulate variables. A variable is a name that refers to a value.
4 |
5 | ## 2.1 Assignment Statements
6 |
7 | An assignment statement creates a new variable and gives it a value:
8 |
9 | ```python
10 | >>> message = 'And now for something completely different'
11 | >>> n = 17
12 | >>> pi = 3.1415926535897932
13 | ```
14 |
15 | This example makes three assignments. The first assigns a string to a new variable named `message`; the second gives the integer 17 to `n`; the third assigns the (approximate) value of π to `pi`.
16 |
17 | A common way to represent variables on paper is to write the name with an arrow pointing to its value. This kind of figure is called a **state diagram** because it shows what state each of the variables is in (think of it as the variable's state of mind).
18 |
19 | ```
20 | message ───────────────────→ 'And now for something completely different'
21 | n ──────────────────────────→ 17
22 | pi ─────────────────────────→ 3.1415926535897932
23 | ```
24 |
25 | ## 2.2 Variable Names
26 |
27 | Programmers generally choose names for their variables that are meaningful—they document what the variable is used for.
28 |
29 | Variable names can be as long as you like. They can contain both letters and numbers, but they can't begin with a number. It is legal to use uppercase letters, but it is conventional to use only lower case for variable names.
30 |
31 | The underscore character, `_`, can appear in a name. It is often used in names with multiple words, such as `your_name` or `airspeed_of_unladen_swallow`.
32 |
33 | If you give a variable an illegal name, you get a syntax error:
34 |
35 | ```python
36 | >>> 76trombones = 'big parade'
37 | SyntaxError: invalid syntax
38 | >>> more@ = 1000000
39 | SyntaxError: invalid syntax
40 | >>> class = 'Advanced Theoretical Zymurgy'
41 | SyntaxError: invalid syntax
42 | ```
43 |
44 | - `76trombones` is illegal because it begins with a number
45 | - `more@` is illegal because it contains an illegal character, `@`
46 | - `class` is illegal because it's a Python keyword
47 |
48 | ### Python Keywords
49 |
50 | The interpreter uses keywords to recognize the structure of the program, and they cannot be used as variable names. Python 3 has these keywords:
51 |
52 | | | | | | |
53 | |---------|----------|---------|----------|-------|
54 | | False | class | finally | is | return|
55 | | None | continue | for | lambda | try |
56 | | True | def | from | nonlocal | while |
57 | | and | del | global | not | with |
58 | | as | elif | if | or | yield |
59 | | assert | else | import | pass | |
60 | | break | except | in | raise | |
61 |
62 | You don't have to memorize this list. In most development environments, keywords are displayed in a different color; if you try to use one as a variable name, you'll know.
63 |
64 | ## 2.3 Expressions and Statements
65 |
66 | An **expression** is a combination of values, variables, and operators. A value all by itself is considered an expression, and so is a variable, so the following are all legal expressions:
67 |
68 | ```python
69 | >>> 42
70 | 42
71 | >>> n
72 | 17
73 | >>> n + 25
74 | 42
75 | ```
76 |
77 | When you type an expression at the prompt, the interpreter **evaluates** it, which means that it finds the value of the expression. In this example, `n` has the value 17 and `n + 25` has the value 42.
78 |
79 | A **statement** is a unit of code that has an effect, like creating a variable or displaying a value.
80 |
81 | ```python
82 | >>> n = 17
83 | >>> print(n)
84 | ```
85 |
86 | The first line is an assignment statement that gives a value to `n`. The second line is a print statement that displays the value of `n`.
87 |
88 | When you type a statement, the interpreter **executes** it, which means that it does whatever the statement says. In general, statements don't have values.
89 |
90 | ## 2.4 Script Mode
91 |
92 | So far we have run Python in **interactive mode**, which means that you interact directly with the interpreter. Interactive mode is a good way to get started, but if you are working with more than a few lines of code, it can be clumsy.
93 |
94 | The alternative is to save code in a file called a **script** and then run the interpreter in **script mode** to execute the script. By convention, Python scripts have names that end with `.py`.
95 |
96 | ### Differences Between Interactive and Script Mode
97 |
98 | Because Python provides both modes, you can test bits of code in interactive mode before you put them in a script. But there are differences between interactive mode and script mode that can be confusing.
99 |
100 | For example, if you are using Python as a calculator, you might type:
101 |
102 | ```python
103 | >>> miles = 26.2
104 | >>> miles * 1.61
105 | 42.182
106 | ```
107 |
108 | The first line assigns a value to `miles`, but it has no visible effect. The second line is an expression, so the interpreter evaluates it and displays the result. It turns out that a marathon is about 42 kilometers.
109 |
110 | But if you type the same code into a script and run it, you get no output at all. In script mode an expression, all by itself, has no visible effect. Python evaluates the expression, but it doesn't display the result. To display the result, you need a print statement like this:
111 |
112 | ```python
113 | miles = 26.2
114 | print(miles * 1.61)
115 | ```
116 |
117 | **Try this:** Type the following statements in the Python interpreter and see what they do:
118 | ```python
119 | 5
120 | x = 5
121 | x + 1
122 | ```
123 |
124 | Now put the same statements in a script and run it. What is the output? Modify the script by transforming each expression into a print statement and then run it again.
125 |
126 | ## 2.5 Order of Operations
127 |
128 | When an expression contains more than one operator, the order of evaluation depends on the **order of operations**. For mathematical operators, Python follows mathematical convention. The acronym **PEMDAS** is a useful way to remember the rules:
129 |
130 | 1. **Parentheses** have the highest precedence and can be used to force an expression to evaluate in the order you want. Since expressions in parentheses are evaluated first, `2 * (3-1)` is 4, and `(1+1)**(5-2)` is 8. You can also use parentheses to make an expression easier to read, as in `(minute * 100) / 60`, even if it doesn't change the result.
131 |
132 | 2. **Exponentiation** has the next highest precedence, so `1 + 2**3` is 9, not 27, and `2 * 3**2` is 18, not 36.
133 |
134 | 3. **Multiplication and Division** have higher precedence than Addition and Subtraction. So `2*3-1` is 5, not 4, and `6+4/2` is 8, not 5.
135 |
136 | 4. **Operators with the same precedence** are evaluated from left to right (except exponentiation). So in the expression `degrees / 2 * pi`, the division happens first and the result is multiplied by pi. To divide by 2π, you can use parentheses or write `degrees / 2 / pi`.
137 |
138 | **Tip:** If you can't tell the order of operations by looking at an expression, use parentheses to make it obvious.
139 |
140 | ## 2.6 String Operations
141 |
142 | In general, you can't perform mathematical operations on strings, even if the strings look like numbers, so the following are illegal:
143 |
144 | ```python
145 | 'chinese'-'food' # Illegal
146 | 'eggs'/'easy' # Illegal
147 | 'third'*'a charm' # Illegal (wrong operand type)
148 | ```
149 |
150 | But there are two exceptions, `+` and `*`.
151 |
152 | ### String Concatenation
153 |
154 | The `+` operator performs **string concatenation**, which means it joins the strings by linking them end-to-end. For example:
155 |
156 | ```python
157 | >>> first = 'throat'
158 | >>> second = 'warbler'
159 | >>> first + second
160 | 'throatwarbler'
161 | ```
162 |
163 | ### String Repetition
164 |
165 | The `*` operator also works on strings; it performs **repetition**. For example, `'Spam'*3` is `'SpamSpamSpam'`. If one of the values is a string, the other has to be an integer.
166 |
167 | ```python
168 | >>> 'Spam' * 3
169 | 'SpamSpamSpam'
170 | >>> 4 * 'Na'
171 | 'NaNaNaNa'
172 | ```
173 |
174 | This use of `+` and `*` makes sense by analogy with addition and multiplication. Just as `4*3` is equivalent to `4+4+4`, we expect `'Spam'*3` to be the same as `'Spam'+'Spam'+'Spam'`, and it is.
175 |
176 | **Think about it:** There is a significant way in which string concatenation and repetition are different from integer addition and multiplication. Can you think of a property that addition has that string concatenation does not?
177 |
178 | ## 2.7 Comments
179 |
180 | As programs get bigger and more complicated, they get more difficult to read. Formal languages are dense, and it is often difficult to look at a piece of code and figure out what it is doing, or why.
181 |
182 | For this reason, it is a good idea to add notes to your programs to explain in natural language what the program is doing. These notes are called **comments**, and they start with the `#` symbol:
183 |
184 | ```python
185 | # compute the percentage of the hour that has elapsed
186 | percentage = (minute * 100) / 60
187 | ```
188 |
189 | In this case, the comment appears on a line by itself. You can also put comments at the end of a line:
190 |
191 | ```python
192 | percentage = (minute * 100) / 60 # percentage of an hour
193 | ```
194 |
195 | Everything from the `#` to the end of the line is ignored—it has no effect on the execution of the program.
196 |
197 | ### Writing Good Comments
198 |
199 | Comments are most useful when they document non-obvious features of the code. It is reasonable to assume that the reader can figure out what the code does; it is more useful to explain why.
200 |
201 | **Bad comment (redundant with the code and useless):**
202 | ```python
203 | v = 5 # assign 5 to v
204 | ```
205 |
206 | **Good comment (contains useful information not in the code):**
207 | ```python
208 | v = 5 # velocity in meters/second
209 | ```
210 |
211 | Good variable names can reduce the need for comments, but long names can make complex expressions hard to read, so there is a tradeoff.
212 |
213 | ## 2.8 Debugging
214 |
215 | Three kinds of errors can occur in a program: **syntax errors**, **runtime errors**, and **semantic errors**. It is useful to distinguish between them in order to track them down more quickly.
216 |
217 | ### Syntax Error
218 | "Syntax" refers to the structure of a program and the rules about that structure. For example, parentheses have to come in matching pairs, so `(1 + 2)` is legal, but `8)` is a syntax error.
219 |
220 | If there is a syntax error anywhere in your program, Python displays an error message and quits, and you will not be able to run the program. During the first few weeks of your programming career, you might spend a lot of time tracking down syntax errors. As you gain experience, you will make fewer errors and find them faster.
221 |
222 | ### Runtime Error
223 | The second type of error is a runtime error, so called because the error does not appear until after the program has started running. These errors are also called **exceptions** because they usually indicate that something exceptional (and bad) has happened.
224 |
225 | Runtime errors are rare in the simple programs you will see in the first few chapters, so it might be a while before you encounter one.
226 |
227 | ### Semantic Error
228 | The third type of error is "semantic", which means related to meaning. If there is a semantic error in your program, it will run without generating error messages, but it will not do the right thing. It will do something else. Specifically, it will do what you told it to do.
229 |
230 | Identifying semantic errors can be tricky because it requires you to work backward by looking at the output of the program and trying to figure out what it is doing.
231 |
232 | ## Glossary
233 |
234 | - **variable:** A name that refers to a value
235 | - **assignment:** A statement that assigns a value to a variable
236 | - **state diagram:** A graphical representation of a set of variables and the values they refer to
237 | - **keyword:** A reserved word that is used to parse a program; you cannot use keywords like `if`, `def`, and `while` as variable names
238 | - **operand:** One of the values on which an operator operates
239 | - **expression:** A combination of variables, operators, and values that represents a single result
240 | - **evaluate:** To simplify an expression by performing the operations in order to yield a single value
241 | - **statement:** A section of code that represents a command or action. So far, the statements we have seen are assignments and print statements
242 | - **execute:** To run a statement and do what it says
243 | - **interactive mode:** A way of using the Python interpreter by typing code at the prompt
244 | - **script mode:** A way of using the Python interpreter to read code from a script and run it
245 | - **script:** A program stored in a file
246 | - **order of operations:** Rules governing the order in which expressions involving multiple operators and operands are evaluated
247 | - **concatenate:** To join two operands end-to-end
248 | - **comment:** Information in a program that is meant for other programmers (or anyone reading the source code) and has no effect on the execution of the program
249 | - **syntax error:** An error in a program that makes it impossible to parse (and therefore impossible to interpret)
250 | - **exception:** An error that is detected while the program is running
251 | - **semantics:** The meaning of a program
252 | - **semantic error:** An error in a program that makes it do something other than what the programmer intended
253 |
254 | ## Exercises
255 |
256 | ### Exercise 2.1
257 | Whenever you learn a new feature, you should try it out in interactive mode and make errors on purpose to see what goes wrong.
258 |
259 | - We've seen that `n = 42` is legal. What about `42 = n`?
260 | - How about `x = y = 1`?
261 | - In some languages every statement ends with a semi-colon, `;`. What happens if you put a semi-colon at the end of a Python statement?
262 | - What if you put a period at the end of a statement?
263 | - In math notation you can multiply x and y like this: xy. What happens if you try that in Python?
264 |
265 | ### Exercise 2.2
266 | Practice using the Python interpreter as a calculator:
267 |
268 | 1. The volume of a sphere with radius r is $\frac{4}{3}\pi r^3$. What is the volume of a sphere with radius 5?
269 |
270 | 2. Suppose the cover price of a book is $24.95, but bookstores get a 40% discount. Shipping costs $3 for the first copy and 75 cents for each additional copy. What is the total wholesale cost for 60 copies?
271 |
272 | 3. If I leave my house at 6:52 am and run 1 mile at an easy pace (8:15 per mile), then 3 miles at tempo (7:12 per mile) and 1 mile at easy pace again, what time do I get home for breakfast?
273 |
--------------------------------------------------------------------------------
/data_lake.md:
--------------------------------------------------------------------------------
1 | # Comprehensive Guide to Data Lakes
2 |
3 | ## Table of Contents
4 | 1. [What is a Data Lake?](#what-is-a-data-lake)
5 | 2. [Why Do You Need a Data Lake?](#why-do-you-need-a-data-lake)
6 | 3. [Core Characteristics](#core-characteristics)
7 | 4. [Data Lake vs Data Warehouse vs Data Lakehouse](#data-lake-vs-data-warehouse-vs-data-lakehouse)
8 | 5. [Essential Components of Data Lake Architecture](#essential-components-of-data-lake-architecture)
9 | 6. [Common Use Cases](#common-use-cases)
10 | 7. [Benefits and Challenges](#benefits-and-challenges)
11 | 8. [Popular Technologies](#popular-technologies)
12 | 9. [Best Practices](#best-practices)
13 | 10. [Conclusion](#conclusion)
14 |
15 | ## What is a Data Lake?
16 |
17 | A **data lake** is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. Unlike traditional databases or data warehouses, data lakes enable you to store data in its native, raw format without requiring a predefined schema.
18 |
19 | ### Key Definition Points:
20 | - **Centralized storage**: Single location for all organizational data
21 | - **Any scale**: From gigabytes to petabytes
22 | - **Raw format**: Data stored as-is, without preprocessing
23 | - **Schema-on-read**: Structure applied when data is accessed, not when stored
24 | - **Multi-format support**: Handles structured, semi-structured, and unstructured data
25 |
26 | ## Why Do You Need a Data Lake?
27 |
28 | Organizations implementing modern data architectures, including data lakes, demonstrate measurable advantages in operational efficiency and revenue growth. Research shows that more than half of enterprises have implemented data lakes, with another 22% planning implementation within 36 months.
29 |
30 | ### Business Value:
31 | - **Faster decision-making**: Advanced analytics across diverse data sources
32 | - **Personalized experiences**: Comprehensive customer data analysis
33 | - **Operational optimization**: Predictive maintenance and efficiency improvements
34 | - **Competitive advantage**: Early identification of revenue opportunities
35 | - **Cost efficiency**: Leverage inexpensive object storage and open formats
36 |
37 | ## Core Characteristics
38 |
39 | ### 1. Scalability
40 | - Built to scale horizontally
41 | - Cloud-based object storage solutions (Amazon S3, Azure Data Lake Storage)
42 | - Growth from terabytes to petabytes without capacity concerns
43 |
44 | ### 2. Schema-on-Read
45 | - No predefined schema required at ingestion
46 | - Flexibility to apply different schemas based on use case
47 | - Structure determined during data retrieval or transformation
48 |
49 | ### 3. Raw Data Storage
50 | - Retains all data types in native format
51 | - Preserves complete data potential for future analysis
52 | - No upfront data transformation requirements
53 |
54 | ### 4. Diverse Data Type Support
55 | - **Structured**: SQL tables, CSV files
56 | - **Semi-structured**: JSON, XML, logs
57 | - **Unstructured**: Images, videos, audio, documents, PDFs
58 |
59 | ## Data Lake vs Data Warehouse vs Data Lakehouse
60 |
61 | | Feature | Data Lake | Data Lakehouse | Data Warehouse |
62 | |---------|-----------|----------------|----------------|
63 | | **Data Types** | All types (structured, semi-structured, unstructured) | All types (structured, semi-structured, unstructured) | Structured data only |
64 | | **Cost** | $ (Low) | $ (Low) | $$$ (High) |
65 | | **Format** | Open format | Open format | Closed, proprietary |
66 | | **Scalability** | Scales at low cost regardless of type | Scales at low cost regardless of type | Exponentially expensive scaling |
67 | | **Schema** | Schema-on-read | Schema-on-read with governance | Schema-on-write |
68 | | **Performance** | Variable, depends on compute engine | High performance | Optimized for fast SQL queries |
69 | | **Intended Users** | Data scientists | Data analysts, scientists, ML engineers | Data analysts |
70 | | **Reliability** | Low quality (data swamp risk) | High quality, reliable | High quality, reliable |
71 | | **Use Cases** | ML, big data, raw storage | Unified analytics, BI, ML | BI, structured analytics |
72 |
73 | ## Essential Components of Data Lake Architecture
74 |
75 | ### 1. Data Ingestion Layer
76 | Brings data from various sources into the data lake.
77 |
78 | **Ingestion Modes:**
79 | - **Batch ingestion**: Periodic loading (nightly, hourly)
80 | - **Stream ingestion**: Real-time data flows
81 | - **Hybrid ingestion**: Combination of batch and stream
82 |
83 | **Popular Tools:**
84 | - Apache Kafka, AWS Kinesis (streaming)
85 | - Apache NiFi, Flume, AWS Glue (batch ETL)
86 |
87 | ### 2. Storage Layer
88 | Built on cloud object storage for elastic scaling and cost efficiency.
89 |
90 | **Key Features:**
91 | - Durability and availability with automatic replication
92 | - Separation of storage and compute
93 | - Data tiering (hot, warm, cold storage)
94 |
95 | **Popular Options:**
96 | - Amazon S3
97 | - Azure Data Lake Storage
98 | - Google Cloud Storage
99 | - MinIO (on-premise)
100 |
101 | ### 3. Catalog and Metadata Management
102 | Prevents data lakes from becoming "data swamps" by maintaining organization.
103 |
104 | **Manages:**
105 | - Data schema and location
106 | - Partitioning information
107 | - Data lineage and versioning
108 | - Search and discovery capabilities
109 |
110 | **Tools:**
111 | - AWS Glue Data Catalog
112 | - Apache Hive Metastore
113 | - Apache Atlas
114 | - DataHub
115 |
116 | ### 4. Processing and Analytics Layer
117 | Transforms raw data into insights through various operations.
118 |
119 | **Capabilities:**
120 | - ETL/ELT pipelines
121 | - SQL querying
122 | - Machine learning pipelines
123 | - Real-time and batch processing
124 |
125 | ### 5. Security and Governance
126 | Protects sensitive data and ensures compliance.
127 |
128 | **Essential Features:**
129 | - Identity and Access Management (IAM)
130 | - Encryption (in-transit and at-rest)
131 | - Data masking and anonymization
132 | - Auditing and monitoring
133 |
134 | **Tools:**
135 | - AWS Lake Formation
136 | - Apache Ranger
137 | - Azure Purview
138 |
139 | ## Common Use Cases
140 |
141 | ### 1. Big Data Analytics
142 | - Historical and real-time data analysis
143 | - Cross-departmental analytics with single source of truth
144 | - Petabyte-scale dataset queries
145 | - Custom analytics on raw, unprocessed data
146 |
147 | ### 2. Machine Learning and AI
148 | - Multi-format training dataset storage
149 | - Raw data preservation for ML experimentation
150 | - Automated ML pipeline support
151 | - Enhanced model accuracy through comprehensive data access
152 |
153 | ### 3. Centralized Data Archiving
154 | - Long-term storage for compliance and auditing
155 | - Cost-effective historical data retention
156 | - Trend analysis and forecasting
157 | - Future ML model training preparation
158 |
159 | ### 4. Data Science Experimentation
160 | - Exploratory data analysis (EDA)
161 | - Hypothesis testing and prototyping
162 | - Unconstrained access to raw datasets
163 | - Innovation without data engineering dependencies
164 |
165 | ### 5. Improved Customer Interactions
166 | Combine data from multiple sources:
167 | - CRM platforms
168 | - Social media analytics
169 | - Marketing platforms with purchase history
170 | - Customer service interactions
171 |
172 | ### 6. R&D Innovation Support
173 | - Hypothesis testing and assumption refinement
174 | - Material selection for product design
175 | - Genomic research for medication development
176 | - Customer willingness-to-pay analysis
177 |
178 | ### 7. Operational Efficiency
179 | - IoT device data collection and analysis
180 | - Manufacturing process optimization
181 | - Predictive maintenance
182 | - Cost reduction and quality improvement
183 |
184 | ## Benefits and Challenges
185 |
186 | ### Benefits
187 |
188 | #### Flexibility and Scalability
189 | - No upfront schema requirements
190 | - Effortless scaling from gigabytes to petabytes
191 | - Cloud-native storage cost efficiency
192 | - Decoupled compute and storage architecture
193 |
194 | #### Comprehensive Data Support
195 | - All data types in single platform
196 | - Raw, unprocessed data preservation
197 | - Enhanced analytics capabilities
198 | - Cross-team collaboration improvement
199 |
200 | #### Cost and Performance
201 | - Significantly cheaper than traditional databases
202 | - Independent scaling of analytics workloads
203 | - Elimination of data silos
204 | - Improved decision-making through comprehensive analysis
205 |
206 | ### Challenges
207 |
208 | #### Data Quality and Organization
209 | - **Data swamp risk**: Without proper governance, becomes unusable
210 | - **Lack of structure**: Difficult to query and document
211 | - **Quality issues**: Poor data may go undetected
212 | - **Metadata gaps**: Users may not find or understand available data
213 |
214 | #### Governance and Security
215 | - **Complex governance**: Access control and compliance challenges
216 | - **Security concerns**: Protecting sensitive data in flexible environment
217 | - **Performance issues**: Traditional query engines slow on large datasets
218 | - **Reliability problems**: Difficulty combining batch and streaming data
219 |
220 | ### Mitigation Strategies
221 |
222 | #### Governance and Organization
223 | - Implement comprehensive data catalogs
224 | - Use standardized naming and folder structures
225 | - Apply data validation and profiling tools
226 | - Automate lifecycle management policies
227 |
228 | #### Security and Performance
229 | - Robust access control and encryption
230 | - Role-based access management
231 | - Regular data quality monitoring
232 | - Performance optimization through proper partitioning
233 |
234 | ## Popular Technologies
235 |
236 | ### Cloud-Native Solutions
237 |
238 | #### Amazon Web Services (AWS)
239 | - **Amazon S3**: Scalable object storage
240 | - **AWS Lake Formation**: Permissions, cataloging, governance
241 | - **AWS Glue**: ETL and data cataloging
242 | - **Amazon Athena**: SQL queries on S3 data
243 |
244 | #### Microsoft Azure
245 | - **Azure Data Lake Storage**: HDFS-like capabilities with blob storage
246 | - **Azure Synapse Analytics**: Integrated analytics service
247 | - **Azure Purview**: Data governance and cataloging
248 |
249 | #### Google Cloud Platform (GCP)
250 | - **Google Cloud Storage**: Durable object storage
251 | - **BigQuery**: Data warehouse with lake capabilities
252 | - **Vertex AI**: Machine learning platform integration
253 |
254 | ### Open-Source Tools
255 |
256 | #### Storage and Processing
257 | - **Apache Hadoop**: Original distributed data framework
258 | - **Delta Lake**: ACID transactions and versioning for object storage
259 | - **Apache Iceberg**: Table format with atomic operations and time travel
260 | - **Presto**: Distributed SQL query engine
261 |
262 | #### Analytics and ML
263 | - **Apache Spark**: Distributed computing for big data processing
264 | - **Apache Kafka**: Real-time data streaming
265 | - **Jupyter Notebooks**: Interactive data analysis and experimentation
266 |
267 | ### Analytics Platform Integrations
268 |
269 | #### Data Platforms
270 | - **Databricks**: Collaborative workspace with Delta Lake support
271 | - **Snowflake**: Hybrid lakehouse capabilities
272 | - **Confluent**: Enterprise Kafka platform
273 |
274 | #### Business Intelligence
275 | - **Power BI**: Microsoft's business intelligence platform
276 | - **Tableau**: Data visualization and analytics
277 | - **Looker**: Modern BI and data platform
278 |
279 | ## Best Practices
280 |
281 | ### 1. Use Data Lake as Landing Zone
282 | - Store all data without transformation or aggregation
283 | - Preserve raw format for machine learning and lineage
284 | - Maintain complete data history
285 |
286 | ### 2. Implement Data Security
287 | - **Mask PII**: Pseudonymize personally identifiable information
288 | - **Access controls**: Role-based and view-based ACLs
289 | - **Encryption**: Implement both in-transit and at-rest encryption
290 | - **Compliance**: Ensure GDPR and regulatory compliance
291 |
292 | ### 3. Build Reliability and Performance
293 | - **Use Delta Lake**: Brings database-like reliability to data lakes
294 | - **Implement ACID transactions**: Ensure data consistency
295 | - **Optimize partitioning**: Improve query performance
296 | - **Monitor data quality**: Regular validation and profiling
297 |
298 | ### 4. Establish Data Catalog
299 | - **Metadata management**: Track schema, location, and lineage
300 | - **Enable self-service**: Allow users to discover and understand data
301 | - **Document data sources**: Maintain comprehensive data documentation
302 | - **Version control**: Track data and schema changes
303 |
304 | ### 5. Lifecycle Management
305 | - **Automate tiering**: Move old data to cheaper storage tiers
306 | - **Retention policies**: Define and enforce data retention rules
307 | - **Archive management**: Efficient long-term data storage
308 | - **Cleanup procedures**: Remove obsolete or duplicate data
309 |
310 | ### 6. Monitoring and Governance
311 | - **Performance monitoring**: Track query performance and resource usage
312 | - **Cost optimization**: Monitor and optimize storage and compute costs
313 | - **Access auditing**: Log and review data access patterns
314 | - **Quality metrics**: Establish and monitor data quality indicators
315 |
316 | ## Conclusion
317 |
318 | Data lakes represent a fundamental shift in how organizations approach data storage and analytics. By providing a flexible, scalable foundation for raw and semi-structured data, they enable advanced analytics, machine learning, and real-time decision-making that wasn't possible with traditional data warehouses alone.
319 |
320 | ### Key Takeaways:
321 |
322 | 1. **Flexibility First**: Data lakes excel when you need to store diverse data types without predefined schemas
323 | 2. **Scale and Cost**: Cloud-native solutions provide virtually unlimited scalability at low cost
324 | 3. **Governance Critical**: Success depends on implementing strong metadata management and governance from the start
325 | 4. **Hybrid Approach**: Many organizations benefit from combining data lakes with data warehouses and lakehouses
326 | 5. **Technology Evolution**: The ecosystem continues to evolve with new tools addressing traditional data lake challenges
327 |
328 | ### When to Choose Data Lakes:
329 |
330 | Data lakes are ideal when your organization:
331 | - Handles complex, diverse, or large-scale data
332 | - Needs to enable faster experimentation and innovation
333 | - Wants to implement advanced analytics and AI/ML initiatives
334 | - Requires cost-effective long-term data storage
335 | - Operates in data-driven industries with rapidly changing requirements
336 |
337 | ### Success Factors:
338 |
339 | - **Start with governance**: Implement cataloging and security from day one
340 | - **Choose the right technology stack**: Align tools with team expertise and organizational needs
341 | - **Plan for growth**: Design architecture that scales with data volume and user needs
342 | - **Invest in training**: Ensure teams understand how to effectively use data lake capabilities
343 | - **Monitor and optimize**: Continuously improve performance, cost, and data quality
344 |
345 | Data lakes are not just a storage solution—they're a foundation for modern, data-driven organizations that want to unlock the full potential of their data assets while maintaining flexibility for future innovations and use cases.
346 |
--------------------------------------------------------------------------------