├── LICENSE
├── 23_Handling Imbalanced Data
    └── 23_Handling Imbalanced Data.md
├── 07_Importing Data
    └── 07_Importing Data.md
├── 06_Data Frames and Tables
    └── 06_Data Frames and Tables.md
├── 03_Control Flow
    └── 03_Control Flow.md
├── 21_Clustering (K-Means)
    └── 21_Clustering (K-Means).md
├── 10_Data Visualization Basics
    └── 10_Data Visualization Basics.md
├── 16_Statistical Concepts
    └── 16_Statistical Concepts.md
├── 14_Working with APIs and JSON
    └── 14_Working with APIs and JSON.md
├── 22_Decision Trees
    └── 22_Decision Trees.md
├── 24_Feature Engineering
    └── 24_Feature Engineering.md
├── 17_Hypothesis Testing
    └── 17_Hypothesis Testing.md
├── 19_Linear Regression
    └── 19_Linear Regression.md
├── 15_Regular Expressions
    └── 15_Regular Expressions.md
├── 25_Model Evaluation and Metrics
    └── 25_Model Evaluation and Metrics.md
├── 29_Working with Big Data
    └── 29_Working with Big Data.md
├── 18_Basic Machine Learning Introduction
    └── 18_Basic Machine Learning Introduction.md
├── 12_SQL for Data Retrieval
    └── 12_SQL for Data Retrieval.md
├── 13_Time Series Analysis Introduction
    └── 13_Time Series Analysis Introduction.md
├── 08_Data Cleaning
    └── 08_Data Cleaning.md
├── 31_Deployment on Cloud Platform
    └── 31_Deployment on Cloud Platform.md
├── 02_Basics of the Language & Git Basics
    └── 02_Basics of the Language & Git Basics.md
├── 04_Functions and Modular Programming
    └── 04_Functions and Modular Programming.md
├── 09_Exploratory Data Analysis (EDA)
    └── 09_Exploratory Data Analysis (EDA).md
├── 30_Building a Data Science Pipeline
    └── 30_Building a Data Science Pipeline.md
├── 11_Advanced Data Visualization
    └── 11_Advanced Data Visualization.md
├── 05_Data Structures
    └── 05_Data Structures.md
├── 20_Logistic Regression
    └── 20_Logistic Regression.md
├── 28_Time Series Forecasting
    └── 28_Time Series Forecasting.md
├── 27_Natural Language Processing (NLP)
    └── 27_Natural Language Processing (NLP).md
├── 26_Advanced ML: Hyperparameter Tuning
    └── 26_Advanced ML: Hyperparameter Tuning.md
└── README.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Samarth Garge
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/23_Handling Imbalanced Data/23_Handling Imbalanced Data.md:
--------------------------------------------------------------------------------
  1 | [<< Day 22](../22_Decision%20Trees/22_Decision%20Trees.md) | [Day 24 >>](../24_Feature%20Engineering/24_Feature%20Engineering.md)
  2 | 
  3 | # Day 23: Imbalanced Data and Handling Techniques
  4 | 
  5 | Welcome to **Day 23** of the 30 Days of Data Science series! 🎉 Today, we tackle the challenge of **Imbalanced Data** in machine learning. Imbalanced datasets can severely impact model performance. We’ll explore various techniques to handle imbalanced data and ensure better results in classification tasks. 🌟
  6 | 
  7 | ## 📋 Table of Contents
  8 | 
  9 | - [📊 Introduction to Imbalanced Data](#-introduction-to-imbalanced-data)
 10 | - [📚 Understanding the Problem](#-understanding-the-problem)
 11 |   - [🔍 Why is Imbalanced Data a Challenge?](#-why-is-imbalanced-data-a-challenge)
 12 |   - [📈 Metrics for Imbalanced Data](#-metrics-for-imbalanced-data)
 13 | - [🛠️ Techniques to Handle Imbalanced Data](#%EF%B8%8F-techniques-to-handle-imbalanced-data)
 14 |   - [⚖️ Class Weighting](#%EF%B8%8F-class-weighting)
 15 |   - [🔬 Synthetic Minority Oversampling Technique (SMOTE)](#-synthetic-minority-oversampling-technique-smote)
 16 |   - [📉 Undersampling the Majority Class](#-undersampling-the-majority-class)
 17 |   - [🔄 Resampling Methods](#-resampling-methods)
 18 | - [📝 Practice Exercises](#-practice-exercises)
 19 | - [📌 Summary](#-summary)
 20 | 
 21 | 
 22 | 
 23 | ## 📊 Introduction to Imbalanced Data
 24 | 
 25 | Imbalanced datasets occur when one class significantly outnumbers others in a classification task. For example, in fraud detection, the ratio of fraudulent to non-fraudulent transactions is often heavily skewed.
 26 | 
 27 | ## 📚 Understanding the Problem
 28 | 
 29 | ### 🔍 Why is Imbalanced Data a Challenge?
 30 | 
 31 | - **Bias Towards Majority Class**: Models tend to predict the majority class more frequently.
 32 | - **Poor Metric Representation**: Accuracy alone can be misleading in imbalanced datasets.
 33 | 
 34 | ### 📈 Metrics for Imbalanced Data
 35 | 
 36 | Key metrics to evaluate performance include:
 37 | 
 38 | - **Precision**:  
 39 |   $$\text{Precision} = \frac{TP}{TP + FP}$$
 40 | 
 41 | - **Recall**:  
 42 |   $$\text{Recall} = \frac{TP}{TP + FN}$$
 43 | 
 44 | - **F1 Score**:  
 45 |   $$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
 46 | 
 47 | 
 48 | 
 49 | ## 🛠️ Techniques to Handle Imbalanced Data
 50 | 
 51 | ### ⚖️ Class Weighting
 52 | Assign higher weights to the minority class during model training to reduce imbalance effects.
 53 | 
 54 | ```python
 55 | from sklearn.ensemble import RandomForestClassifier
 56 | from sklearn.metrics import classification_report
 57 | 
 58 | # Assign class weights
 59 | model = RandomForestClassifier(class_weight='balanced', random_state=42)
 60 | model.fit(X_train, y_train)
 61 | 
 62 | y_pred = model.predict(X_test)
 63 | print(classification_report(y_test, y_pred))
 64 | ```
 65 | 
 66 | ### 🔬 Synthetic Minority Oversampling Technique (SMOTE)
 67 | SMOTE creates synthetic samples for the minority class by interpolating between existing examples.
 68 | 
 69 | ```python
 70 | from imblearn.over_sampling import SMOTE
 71 | from sklearn.model_selection import train_test_split
 72 | 
 73 | # Apply SMOTE
 74 | smote = SMOTE(random_state=42)
 75 | X_resampled, y_resampled = smote.fit_resample(X, y)
 76 | 
 77 | # Split data
 78 | X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
 79 | ```
 80 | 
 81 | ### 📉 Undersampling the Majority Class
 82 | Reduce the number of majority class examples to balance the dataset.
 83 | 
 84 | ```python
 85 | from imblearn.under_sampling import RandomUnderSampler
 86 | 
 87 | # Apply undersampling
 88 | undersampler = RandomUnderSampler(random_state=42)
 89 | X_resampled, y_resampled = undersampler.fit_resample(X, y)
 90 | ```
 91 | 
 92 | ### 🔄 Resampling Methods
 93 | Combine oversampling and undersampling techniques for better results.
 94 | 
 95 | ```python
 96 | from imblearn.combine import SMOTEENN
 97 | 
 98 | # Apply combined resampling
 99 | smote_enn = SMOTEENN(random_state=42)
100 | X_resampled, y_resampled = smote_enn.fit_resample(X, y)
101 | ```
102 | 
103 | 
104 | 
105 | ## 📝 Practice Exercises
106 | 
107 | 1. Load the `imbalanced-learn` package and experiment with SMOTE on a custom imbalanced dataset.
108 | 2. Compare model performance with and without class weighting on a highly imbalanced dataset.
109 | 3. Implement undersampling on the `imbalanced-learn` library and evaluate its impact on classification metrics.
110 | 
111 | 
112 | 
113 | ## 📌 Summary
114 | 
115 | In today’s lesson, we explored:
116 | 
117 | - The challenges of imbalanced data.
118 | - Key metrics for evaluating models on imbalanced datasets.
119 | - Techniques to address class imbalance such as class weighting, SMOTE, and undersampling.
120 | 
121 | By applying these techniques, you can improve the performance and fairness of your machine learning models. Keep practicing and experimenting! 🌟
122 | 
123 | ---
124 | 


--------------------------------------------------------------------------------
/07_Importing Data/07_Importing Data.md:
--------------------------------------------------------------------------------
  1 | [<< Day 6](../06_Data%20Frames%20and%20Tables/06_Data%20Frames%20and%20Tables.md) | [Day 8 >>](../08_Data%20Cleaning/08_Data%20Cleaning.md)
  2 | # 📘 Day 7: Importing Data with Pandas
  3 | 
  4 | Welcome to **Day 7** of the **30 Days of Data Science** series! Today, we focus on **Importing Data** using the Pandas library. Learning to import data from various formats like CSV, Excel, and JSON is foundational for data analysis.
  5 | 
  6 | 
  7 | 
  8 | ## Table of Contents
  9 | 
 10 | - [📘 Day 7: Importing Data with Pandas](#-day-7-importing-data-with-pandas)
 11 |   - [1️⃣ Introduction to Data Importing](#1️⃣-introduction-to-data-importing)
 12 |   - [2️⃣ Reading CSV Files](#2️⃣-reading-csv-files)
 13 |     - [Basic CSV Reading](#basic-csv-reading)
 14 |     - [Customizing the Read](#customizing-the-read)
 15 |     - [Efficient Reading of Large CSV Files](#efficient-reading-of-large-csv-files)
 16 |   - [3️⃣ Reading Excel Files](#3️⃣-reading-excel-files)
 17 |     - [Reading Specific Sheets](#reading-specific-sheets)
 18 |     - [Dealing with Missing Data in Excel Files](#dealing-with-missing-data-in-excel-files)
 19 |   - [4️⃣ Reading JSON Files](#4️⃣-reading-json-files)
 20 |     - [Handling Nested JSON](#handling-nested-json)
 21 |   - [🧠 Practice Exercises](#-practice-exercises)
 22 |   - [🌟 Summary](#-summary)
 23 |   
 24 | 
 25 | 
 26 | 
 27 | ## 1️⃣ Introduction to Data Importing
 28 | 
 29 | Data comes in various formats such as CSV, Excel, and JSON. Using Pandas, you can seamlessly convert these into **DataFrames** for manipulation.
 30 | 
 31 | Install Pandas if you haven’t:
 32 | 
 33 | ```bash
 34 | pip install pandas
 35 | ```
 36 | 
 37 | Import it in your Python script:
 38 | 
 39 | ```python
 40 | import pandas as pd
 41 | ```
 42 | 
 43 | 
 44 | 
 45 | ## 2️⃣ Reading CSV Files
 46 | 
 47 | ### Basic CSV Reading
 48 | 
 49 | Use `pd.read_csv` to load a CSV file into a Pandas DataFrame.
 50 | 
 51 | #### Example:
 52 | 
 53 | ```python
 54 | import pandas as pd
 55 | 
 56 | # Reading a CSV file
 57 | df = pd.read_csv("data.csv")
 58 | print(df.head())  # Display the first 5 rows
 59 | ```
 60 | 
 61 | 
 62 | 
 63 | ### Customizing the Read
 64 | 
 65 | Modify how data is read by specifying parameters like column names, skipping rows, or handling missing data.
 66 | 
 67 | #### Example: Rename Columns and Skip Rows
 68 | 
 69 | ```python
 70 | df = pd.read_csv("data.csv", skiprows=2, names=["ID", "Name", "Age", "City"])
 71 | print(df)
 72 | ```
 73 | 
 74 | #### Example: Handle Missing Values
 75 | 
 76 | ```python
 77 | df = pd.read_csv("data.csv", na_values=["N/A", "Missing"])
 78 | print(df)
 79 | ```
 80 | 
 81 | 
 82 | 
 83 | ### Efficient Reading of Large CSV Files
 84 | 
 85 | When working with large files, optimize reading by using these techniques:
 86 | 
 87 | #### Example: Load in Chunks
 88 | 
 89 | ```python
 90 | for chunk in pd.read_csv("large_data.csv", chunksize=1000):
 91 |     print(chunk.shape)
 92 | ```
 93 | 
 94 | #### Example: Use Specific Columns
 95 | 
 96 | ```python
 97 | df = pd.read_csv("data.csv", usecols=["Name", "Age"])
 98 | print(df)
 99 | ```
100 | 
101 | 
102 | 
103 | ## 3️⃣ Reading Excel Files
104 | 
105 | Excel files are popular for structured data storage. Use `pd.read_excel` to import them.
106 | 
107 | ### Reading Specific Sheets
108 | 
109 | Specify the `sheet_name` parameter to target a particular sheet.
110 | 
111 | #### Example:
112 | 
113 | ```python
114 | df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
115 | print(df.head())
116 | ```
117 | 
118 | 
119 | 
120 | ### Dealing with Missing Data in Excel Files
121 | 
122 | Handle empty cells by specifying values to treat as NaN.
123 | 
124 | #### Example:
125 | 
126 | ```python
127 | df = pd.read_excel("data.xlsx", na_values=["N/A", "Not Available"])
128 | print(df)
129 | ```
130 | 
131 | 
132 | 
133 | ## 4️⃣ Reading JSON Files
134 | 
135 | JSON files are lightweight and widely used in web applications. Use `pd.read_json` for flat files or `pd.json_normalize` for nested JSON structures.
136 | 
137 | ### Handling Nested JSON
138 | 
139 | For deeply nested JSON, normalize it for a tabular structure.
140 | 
141 | #### Example:
142 | 
143 | ```python
144 | import pandas as pd
145 | 
146 | # Example JSON data
147 | data = {
148 |     "Name": ["Alice", "Bob"],
149 |     "Details": [{"Age": 25, "City": "New York"}, {"Age": 30, "City": "Chicago"}]
150 | }
151 | 
152 | # Normalize JSON
153 | df = pd.json_normalize(data, record_path="Details", meta=["Name"])
154 | print(df)
155 | ```
156 | 
157 | **Output**:
158 | 
159 | ```
160 |    Age      City   Name
161 | 0   25  New York  Alice
162 | 1   30   Chicago    Bob
163 | ```
164 | 
165 | 
166 | 
167 | ## 🧠 Practice Exercises
168 | 
169 | 1. Read a CSV file and calculate the average of a numeric column.
170 | 2. Extract data from a specific Excel sheet and filter rows based on a condition.
171 | 3. Parse a nested JSON file and convert it into a DataFrame.
172 | 
173 | 
174 | 
175 | ## 🌟 Summary
176 | 
177 | - CSV: Use `pd.read_csv` with customizable parameters.
178 | - Excel: Use `pd.read_excel` to read specific sheets or handle missing data.
179 | - JSON: Use `pd.read_json` for flat files or `pd.json_normalize` for nested structures.
180 | 
181 | ---
182 | 
183 | 
184 | 


--------------------------------------------------------------------------------
/06_Data Frames and Tables/06_Data Frames and Tables.md:
--------------------------------------------------------------------------------
  1 | [<< Day 5](../05_Data%20Structures/05_Data%20Structures.md) | [Day 7 >>](../07_Importing%20Data/07_Importing%20Data.md)
  2 | 
  3 | 
  4 | # 📘 Day 6: Dataframes and Tables with Pandas
  5 | 
  6 | Welcome to **Day 6** of the **30 Days of Data Science** series! Today, we will explore **Dataframes and Tables** using the **Pandas** library. Pandas is a powerful Python library for data manipulation and analysis, widely used in data science and machine learning.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 6: Dataframes and Tables with Pandas](#-day-6-dataframes-and-tables-with-pandas)
 13 |   - [1️⃣ Introduction to Pandas 📊](#1️⃣-introduction-to-pandas-)
 14 |   - [2️⃣ Dataframes](#2️⃣-dataframes)
 15 |     - [Creating a DataFrame](#creating-a-dataframe)
 16 |     - [Accessing Data in a DataFrame](#accessing-data-in-a-dataframe)
 17 |     - [Adding and Removing Columns](#adding-and-removing-columns)
 18 |   - [3️⃣ Tables](#3️⃣-tables)
 19 |     - [Reading Data from a File](#reading-data-from-a-file)
 20 |     - [Sorting and Filtering Data](#sorting-and-filtering-data)
 21 |   - [🧠 Practice Exercises](#-practice-exercises)
 22 |   - [🌟 Summary](#-summary)
 23 |   
 24 | 
 25 | 
 26 | 
 27 | ## 1️⃣ Introduction to Pandas 📊
 28 | 
 29 | Pandas is a Python library designed to simplify data manipulation and analysis. It provides two primary data structures:
 30 | 
 31 | - **Series**: A one-dimensional labeled array (similar to a list).
 32 | - **DataFrame**: A two-dimensional labeled data structure (similar to a table).
 33 | 
 34 | To use Pandas, you first need to install it. Run the following command in your terminal if you haven't already:
 35 | 
 36 | ```bash
 37 | pip install pandas
 38 | ```
 39 | 
 40 | Then, import it in your Python code:
 41 | 
 42 | ```python
 43 | import pandas as pd
 44 | ```
 45 | 
 46 | 
 47 | 
 48 | ## 2️⃣ Dataframes
 49 | 
 50 | A **DataFrame** is a two-dimensional labeled data structure that resembles a table. It consists of rows and columns.
 51 | 
 52 | ### Creating a DataFrame
 53 | 
 54 | You can create a DataFrame from various data structures like lists, dictionaries, or even external files.
 55 | 
 56 | #### From a Dictionary
 57 | 
 58 | ```python
 59 | import pandas as pd
 60 | 
 61 | data = {
 62 |     "Name": ["Alice", "Bob", "Charlie"],
 63 |     "Age": [25, 30, 35],
 64 |     "City": ["New York", "San Francisco", "Los Angeles"]
 65 | }
 66 | 
 67 | df = pd.DataFrame(data)
 68 | print(df)
 69 | ```
 70 | 
 71 | **Output**:
 72 | 
 73 | ```
 74 |       Name  Age           City
 75 | 0    Alice   25      New York
 76 | 1      Bob   30  San Francisco
 77 | 2  Charlie   35   Los Angeles
 78 | ```
 79 | 
 80 | 
 81 | 
 82 | ### Accessing Data in a DataFrame
 83 | 
 84 | You can access specific rows, columns, or elements in a DataFrame.
 85 | 
 86 | #### Accessing Columns
 87 | 
 88 | ```python
 89 | # Accessing a single column
 90 | print(df["Name"])
 91 | 
 92 | # Accessing multiple columns
 93 | print(df[["Name", "Age"]])
 94 | ```
 95 | 
 96 | #### Accessing Rows
 97 | 
 98 | ```python
 99 | # Accessing rows by index
100 | print(df.iloc[0])  # First row
101 | print(df.iloc[1:3])  # Second to third row
102 | ```
103 | 
104 | #### Accessing Specific Elements
105 | 
106 | ```python
107 | # Accessing specific cell
108 | print(df.at[0, "Name"])  # Output: Alice
109 | ```
110 | 
111 | 
112 | 
113 | ### Adding and Removing Columns
114 | 
115 | #### Adding a Column
116 | 
117 | ```python
118 | df["Country"] = ["USA", "USA", "USA"]
119 | print(df)
120 | ```
121 | 
122 | #### Removing a Column
123 | 
124 | ```python
125 | df.drop("Age", axis=1, inplace=True)
126 | print(df)
127 | ```
128 | 
129 | 
130 | 
131 | ## 3️⃣ Tables
132 | 
133 | Tables in Pandas are often created or manipulated by reading data from external sources like CSV files, Excel files, or databases.
134 | 
135 | ### Reading Data from a File
136 | 
137 | #### Reading a CSV File
138 | 
139 | ```python
140 | df = pd.read_csv("data.csv")
141 | print(df.head())  # Display the first 5 rows
142 | ```
143 | 
144 | #### Writing to a CSV File
145 | 
146 | ```python
147 | df.to_csv("output.csv", index=False)
148 | ```
149 | 
150 | 
151 | 
152 | ### Sorting and Filtering Data
153 | 
154 | #### Sorting Data
155 | 
156 | ```python
157 | # Sorting by a single column
158 | df = df.sort_values("Age")
159 | print(df)
160 | 
161 | # Sorting by multiple columns
162 | df = df.sort_values(["Age", "Name"], ascending=[True, False])
163 | print(df)
164 | ```
165 | 
166 | #### Filtering Data
167 | 
168 | ```python
169 | # Filtering rows where Age > 30
170 | filtered_df = df[df["Age"] > 30]
171 | print(filtered_df)
172 | ```
173 | 
174 | 
175 | 
176 | ## 🧠 Practice Exercises
177 | 
178 | 1. Create a DataFrame with three columns: `Product`, `Price`, and `Quantity`. Add a new column `Total` by multiplying `Price` and `Quantity`.
179 | 2. Load a CSV file into a DataFrame and display the first 10 rows.
180 | 3. Sort a DataFrame by a specific column and filter rows based on a condition.
181 | 
182 | 
183 | 
184 | ## 🌟 Summary
185 | 
186 | - **DataFrames** are powerful data structures in Pandas for tabular data.
187 | - Pandas makes it easy to manipulate, analyze, and visualize data.
188 | - You can create DataFrames from dictionaries, lists, or external files.
189 | 
190 | ---
191 | 
192 | 


--------------------------------------------------------------------------------
/03_Control Flow/03_Control Flow.md:
--------------------------------------------------------------------------------
  1 | [<< Day 2](../02_Basics%20of%20the%20Language%20%26%20Git%20Basics/02_Basics%20of%20the%20Language%20%26%20Git%20Basics.md) | [Day 4 >>](../04_Functions%20and%20Modular%20Programming/04_Functions%20and%20Modular%20Programming.md)
  2 | 
  3 | # 📘 Day 3: If-Else and Loops in Python  
  4 | 
  5 | Welcome to Day 3 of the **30 Days of Data Science** series! Today, we will cover essential programming constructs—**If-Else Statements** and **Loops**—which are fundamental for controlling the flow of your Python programs. Let’s dive in!
  6 | 
  7 | 
  8 | 
  9 | ## Table of Contents  
 10 | - [📘 Day 3: If-Else and Loops in Python](#-day-3-if-else-and-loops-in-python)    
 11 |   - [1️⃣ If-Else Statements 🧠](#1️⃣-if-else-statements-)  
 12 |     - [Syntax](#syntax)  
 13 |     - [Example: Simple If-Else Statement](#example-simple-if-else-statement)  
 14 |     - [Example: Nested If-Else](#example-nested-if-else)  
 15 |     - [Example: If-Elif-Else](#example-if-elif-else)  
 16 |   - [2️⃣ Loops 🔁](#2️⃣-loops-)  
 17 |     - [For Loop](#for-loop)  
 18 |     - [Example: Using a For Loop](#example-using-a-for-loop)  
 19 |     - [While Loop](#while-loop)  
 20 |     - [Example: Using a While Loop](#example-using-a-while-loop)  
 21 |     - [Break and Continue](#break-and-continue)  
 22 |     - [Example: Break and Continue](#example-break-and-continue)  
 23 |   - [🧠 Practice Exercises](#-practice-exercises)  
 24 |   - [🌟 Summary](#-summary)  
 25 |   
 26 | 
 27 | 
 28 | 
 29 | ## 1️⃣ If-Else Statements 🧠  
 30 | 
 31 | ### Syntax  
 32 | ```python
 33 | if condition:
 34 |     # Code block executed if the condition is True
 35 | else:
 36 |     # Code block executed if the condition is False
 37 | ```
 38 | 
 39 | 
 40 | 
 41 | ### Example: Simple If-Else Statement  
 42 | ```python
 43 | age = 20
 44 | if age >= 18:
 45 |     print("You are an adult!")
 46 | else:
 47 |     print("You are a minor!")
 48 | ```
 49 | **Output:**  
 50 | ```plaintext
 51 | You are an adult!
 52 | ```
 53 | 
 54 | 
 55 | 
 56 | ### Example: Nested If-Else  
 57 | ```python
 58 | age = 16
 59 | if age >= 18:
 60 |     print("You can vote!")
 61 | else:
 62 |     if age >= 16:
 63 |         print("You are a teenager!")
 64 |     else:
 65 |         print("You are a child!")
 66 | ```
 67 | **Output:**  
 68 | ```plaintext
 69 | You are a teenager!
 70 | ```
 71 | 
 72 | 
 73 | 
 74 | ### Example: If-Elif-Else  
 75 | ```python
 76 | marks = 85
 77 | if marks >= 90:
 78 |     print("Grade: A")
 79 | elif marks >= 75:
 80 |     print("Grade: B")
 81 | elif marks >= 50:
 82 |     print("Grade: C")
 83 | else:
 84 |     print("Grade: F")
 85 | ```
 86 | **Output:**  
 87 | ```plaintext
 88 | Grade: B
 89 | ```
 90 | 
 91 | 
 92 | 
 93 | ## 2️⃣ Loops 🔁  
 94 | 
 95 | Loops allow repetitive tasks to be performed efficiently.
 96 | 
 97 | 
 98 | 
 99 | ### For Loop  
100 | The **for** loop iterates over a sequence (like a list, tuple, or string).  
101 | 
102 | #### Syntax  
103 | ```python
104 | for item in sequence:
105 |     # Code block to execute for each item
106 | ```
107 | 
108 | 
109 | 
110 | ### Example: Using a For Loop  
111 | ```python
112 | numbers = [1, 2, 3, 4, 5]
113 | for num in numbers:
114 |     print(num)
115 | ```
116 | **Output:**  
117 | ```plaintext
118 | 1
119 | 2
120 | 3
121 | 4
122 | 5
123 | ```
124 | 
125 | 
126 | 
127 | ### While Loop  
128 | The **while** loop executes a block of code as long as a condition is `True`.  
129 | 
130 | #### Syntax  
131 | ```python
132 | while condition:
133 |     # Code block to execute
134 | ```
135 | 
136 | 
137 | 
138 | ### Example: Using a While Loop  
139 | ```python
140 | count = 0
141 | while count < 5:
142 |     print(count)
143 |     count += 1
144 | ```
145 | **Output:**  
146 | ```plaintext
147 | 0
148 | 1
149 | 2
150 | 3
151 | 4
152 | ```
153 | 
154 | 
155 | 
156 | ### Break and Continue  
157 | 
158 | - **Break**: Terminates the loop prematurely.  
159 | - **Continue**: Skips the current iteration and moves to the next.  
160 | 
161 | 
162 | 
163 | ### Example: Break and Continue  
164 | ```python
165 | for num in range(1, 6):
166 |     if num == 3:
167 |         break  # Exit loop when num is 3
168 |     print(num)
169 | ```
170 | **Output:**  
171 | ```plaintext
172 | 1
173 | 2
174 | ```
175 | 
176 | ```python
177 | for num in range(1, 6):
178 |     if num == 3:
179 |         continue  # Skip iteration when num is 3
180 |     print(num)
181 | ```
182 | **Output:**  
183 | ```plaintext
184 | 1
185 | 2
186 | 4
187 | 5
188 | ```
189 | 
190 | 
191 | 
192 | ## 🧠 Practice Exercises  
193 | 
194 | ### If-Else Statements  
195 | 1. Write a program that checks if a number is positive, negative, or zero.  
196 | 2. Create a grade classifier using the if-elif-else structure.  
197 | 
198 | 
199 | 
200 | ### Loops  
201 | 1. Write a program that prints all even numbers from 1 to 50 using a for loop.  
202 | 2. Create a program that sums the numbers from 1 to 100 using a while loop.  
203 | 3. Use break and continue in a loop to demonstrate their functionality.  
204 | 
205 | 
206 | 
207 | ## 🌟 Summary  
208 | 
209 | - **If-Else Statements** allow you to make decisions in your code.  
210 | - **Loops** enable you to automate repetitive tasks efficiently.  
211 | - **Break and Continue** give more control over loop execution.  
212 | 
213 | ---
214 | 
215 | 
216 | 
217 | 
218 | 


--------------------------------------------------------------------------------
/21_Clustering (K-Means)/21_Clustering (K-Means).md:
--------------------------------------------------------------------------------
  1 | [<< Day 20](../20_Logistic%20Regression/20_Logistic%20Regression.md) | [Day 22 >>](../22_Decision%20Trees/22_Decision%20Trees.md)
  2 | 
  3 | 
  4 | 
  5 | # 📘 Day 21: Clustering with KMeans in Scikit-learn
  6 | 
  7 | Welcome to **Day 21** of the **30 Days of Data Science** series! Today, we explore **Clustering** using the **KMeans algorithm** with Python's Scikit-learn library. Clustering is a fundamental **unsupervised learning** technique used to group similar data points together.
  8 | 
  9 | 
 10 | 
 11 | ## Table of Contents
 12 | 
 13 | - [📘 Day 21: Clustering with KMeans in Scikit-learn](#-day-21-clustering-with-kmeans-in-scikit-learn)
 14 |   - [🔍 What is Clustering?](#-what-is-clustering)
 15 |   - [📌 The KMeans Algorithm](#-the-kmeans-algorithm)
 16 |   - [⚙️ Installing Required Libraries](#️-installing-required-libraries)
 17 |   - [🛠️ KMeans with Scikit-learn: Step-by-Step](#️-kmeans-with-scikit-learn-step-by-step)
 18 |     - [1️⃣ Data Preparation](#1️⃣-data-preparation)
 19 |     - [2️⃣ Applying KMeans](#2️⃣-applying-kmeans)
 20 |     - [3️⃣ Visualizing Clusters](#3️⃣-visualizing-clusters)
 21 |   - [🧠 Use Cases of Clustering](#-use-cases-of-clustering)
 22 |   - [🧪 Practice Exercises](#-practice-exercises)
 23 |   - [🌟 Summary](#-summary)
 24 | 
 25 | 
 26 | 
 27 | ## 🔍 What is Clustering?
 28 | 
 29 | Clustering is a type of **unsupervised learning** that involves grouping data into clusters based on their similarities. Unlike supervised learning, clustering does not use labeled data. It’s widely used in:
 30 | 
 31 | - Customer segmentation
 32 | - Document clustering
 33 | - Image segmentation
 34 | - Anomaly detection
 35 | 
 36 | 
 37 | 
 38 | ## 📌 The KMeans Algorithm
 39 | 
 40 | The **KMeans algorithm** works by:
 41 | 
 42 | 1. Randomly initializing `K` cluster centroids.
 43 | 2. Assigning each data point to the nearest centroid.
 44 | 3. Updating centroids by calculating the mean of the assigned points.
 45 | 4. Repeating steps 2 and 3 until convergence.
 46 | 
 47 | KMeans tries to minimize the **within-cluster sum of squares (WCSS)** to ensure compact clusters.
 48 | 
 49 | 
 50 | 
 51 | ## ⚙️ Installing Required Libraries
 52 | 
 53 | Before we proceed, ensure you have Scikit-learn installed:
 54 | 
 55 | ```bash
 56 | pip install scikit-learn matplotlib numpy
 57 | ```
 58 | 
 59 | 
 60 | 
 61 | ## 🛠️ KMeans with Scikit-learn: Step-by-Step
 62 | 
 63 | Let’s implement KMeans using Scikit-learn.
 64 | 
 65 | 
 66 | 
 67 | ### 1️⃣ Data Preparation
 68 | 
 69 | We’ll generate a sample dataset using Scikit-learn’s `make_blobs` function:
 70 | 
 71 | ```python
 72 | import numpy as np
 73 | from sklearn.datasets import make_blobs
 74 | import matplotlib.pyplot as plt
 75 | 
 76 | # Generating synthetic data
 77 | X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
 78 | 
 79 | # Visualizing the data
 80 | plt.scatter(X[:, 0], X[:, 1], s=50)
 81 | plt.title("Sample Data for Clustering")
 82 | plt.show()
 83 | ```
 84 | 
 85 | **Explanation**:
 86 | - `n_samples`: Number of data points.
 87 | - `centers`: Number of clusters.
 88 | - `cluster_std`: Spread of each cluster.
 89 | 
 90 | 
 91 | 
 92 | ### 2️⃣ Applying KMeans
 93 | 
 94 | Now, let’s apply the KMeans algorithm to cluster the data into 4 groups.
 95 | 
 96 | ```python
 97 | from sklearn.cluster import KMeans
 98 | 
 99 | # Applying KMeans
100 | kmeans = KMeans(n_clusters=4, random_state=42)
101 | y_kmeans = kmeans.fit_predict(X)
102 | 
103 | # Printing centroids
104 | print("Cluster Centers:")
105 | print(kmeans.cluster_centers_)
106 | ```
107 | 
108 | **Explanation**:
109 | - `n_clusters`: The number of clusters.
110 | - `fit_predict()`: Assigns each point to a cluster and returns labels.
111 | 
112 | 
113 | 
114 | ### 3️⃣ Visualizing Clusters
115 | 
116 | Let’s visualize the resulting clusters and centroids.
117 | 
118 | ```python
119 | # Visualizing the clusters
120 | plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
121 | 
122 | # Marking centroids
123 | centroids = kmeans.cluster_centers_
124 | plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', marker='X')
125 | plt.title("Clusters and Centroids")
126 | plt.show()
127 | ```
128 | 
129 | **Output**:
130 | - Data points are colored based on their cluster.
131 | - Red `X` marks represent the centroids.
132 | 
133 | 
134 | 
135 | ## 🧠 Use Cases of Clustering
136 | 
137 | - **Market Segmentation**: Grouping customers based on purchasing behavior.
138 | - **Image Compression**: Reducing colors in an image using cluster centroids.
139 | - **Document Clustering**: Grouping similar text documents.
140 | - **Biological Analysis**: Grouping genes with similar expression patterns.
141 | 
142 | 
143 | 
144 | ## 🧪 Practice Exercises
145 | 
146 | 1. Apply KMeans to a custom dataset of your choice.
147 | 2. Experiment with different values of `n_clusters` and observe the results.
148 | 3. Explore the **Elbow Method** to determine the optimal number of clusters.
149 | 
150 | 
151 | 
152 | ## 🌟 Summary
153 | 
154 | - Clustering is an essential unsupervised learning technique.
155 | - KMeans groups data into clusters by minimizing WCSS.
156 | - Scikit-learn provides easy-to-use tools for implementing KMeans.
157 | - Visualizing clusters helps interpret results effectively.
158 | 
159 | ---
160 | 
161 | 
162 | 


--------------------------------------------------------------------------------
/10_Data Visualization Basics/10_Data Visualization Basics.md:
--------------------------------------------------------------------------------
  1 | [<< Day 9](../09_Exploratory%20Data%20Analysis%20(EDA)/09_Exploratory%20Data%20Analysis%20(EDA).md) | [Day 11 >>](../11_Advanced%20Data%20Visualization/11_Advanced%20Data%20Visualization.md)
  2 | 
  3 | 
  4 | # 📘 Day 10: Data Visualization
  5 | 
  6 | Welcome to **Day 10** of the **30 Days of Data Science** series! Today, we dive into **Data Visualization** using two powerful libraries: **Matplotlib** and **Seaborn**. These tools allow us to create insightful visualizations that help understand and communicate data effectively.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 10: Data Visualization](#-day-10-data-visualization)
 13 |   - [1️⃣ Introduction to Data Visualization](#1️⃣-introduction-to-data-visualization)
 14 |   - [2️⃣ Visualizing Data with Matplotlib](#2️⃣-visualizing-data-with-matplotlib)
 15 |     - [Line Plots](#line-plots)
 16 |     - [Bar Charts](#bar-charts)
 17 |     - [Scatter Plots](#scatter-plots)
 18 |     - [Customizing Plots](#customizing-plots)
 19 |   - [3️⃣ Visualizing Data with Seaborn](#3️⃣-visualizing-data-with-seaborn)
 20 |     - [Distribution Plots](#distribution-plots)
 21 |     - [Categorical Plots](#categorical-plots)
 22 |     - [Pair Plots](#pair-plots)
 23 |     - [Heatmaps](#heatmaps)
 24 |   - [🧠 Practice Exercises](#-practice-exercises)
 25 |   - [🌟 Summary](#-summary)
 26 |   
 27 | 
 28 | 
 29 | 
 30 | ## 1️⃣ Introduction to Data Visualization
 31 | 
 32 | **Data Visualization** is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns.
 33 | 
 34 | We will use:
 35 | 
 36 | - **Matplotlib**: A versatile library for creating basic plots.
 37 | - **Seaborn**: Built on top of Matplotlib, it provides a high-level interface for creating attractive and informative statistical graphics.
 38 | 
 39 | Install the libraries if needed:
 40 | 
 41 | ```bash
 42 | pip install matplotlib seaborn
 43 | ```
 44 | 
 45 | 
 46 | 
 47 | ## 2️⃣ Visualizing Data with Matplotlib
 48 | 
 49 | ### Line Plots
 50 | 
 51 | Line plots are great for visualizing trends over time.
 52 | 
 53 | **Example**:
 54 | 
 55 | ```python
 56 | import matplotlib.pyplot as plt
 57 | 
 58 | x = [1, 2, 3, 4, 5]
 59 | y = [2, 4, 6, 8, 10]
 60 | 
 61 | plt.plot(x, y, label="y = 2x")
 62 | plt.title("Line Plot Example")
 63 | plt.xlabel("X-axis")
 64 | plt.ylabel("Y-axis")
 65 | plt.legend()
 66 | plt.show()
 67 | ```
 68 | 
 69 | 
 70 | 
 71 | ### Bar Charts
 72 | 
 73 | Bar charts are ideal for comparing categories.
 74 | 
 75 | **Example**:
 76 | 
 77 | ```python
 78 | categories = ["A", "B", "C", "D"]
 79 | values = [3, 7, 8, 5]
 80 | 
 81 | plt.bar(categories, values, color="skyblue")
 82 | plt.title("Bar Chart Example")
 83 | plt.xlabel("Categories")
 84 | plt.ylabel("Values")
 85 | plt.show()
 86 | ```
 87 | 
 88 | 
 89 | 
 90 | ### Scatter Plots
 91 | 
 92 | Scatter plots show relationships between two variables.
 93 | 
 94 | **Example**:
 95 | 
 96 | ```python
 97 | x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11]
 98 | y = [99, 86, 87, 88, 100, 86, 103, 87, 94, 78]
 99 | 
100 | plt.scatter(x, y, color="green")
101 | plt.title("Scatter Plot Example")
102 | plt.xlabel("X-axis")
103 | plt.ylabel("Y-axis")
104 | plt.show()
105 | ```
106 | 
107 | 
108 | 
109 | ### Customizing Plots
110 | 
111 | You can customize colors, styles, and layouts.
112 | 
113 | **Example**:
114 | 
115 | ```python
116 | plt.plot(x, y, linestyle="--", color="red", marker="o")
117 | plt.title("Customized Line Plot")
118 | plt.show()
119 | ```
120 | 
121 | 
122 | 
123 | ## 3️⃣ Visualizing Data with Seaborn
124 | 
125 | ### Distribution Plots
126 | 
127 | Distribution plots are used to visualize the distribution of a dataset.
128 | 
129 | **Example**:
130 | 
131 | ```python
132 | import seaborn as sns
133 | 
134 | data = [10, 20, 20, 30, 30, 30, 40, 50]
135 | sns.histplot(data, kde=True, color="purple")
136 | plt.title("Distribution Plot Example")
137 | plt.show()
138 | ```
139 | 
140 | 
141 | 
142 | ### Categorical Plots
143 | 
144 | Categorical plots help visualize relationships between categories.
145 | 
146 | **Example (Box Plot)**:
147 | 
148 | ```python
149 | tips = sns.load_dataset("tips")
150 | sns.boxplot(x="day", y="total_bill", data=tips)
151 | plt.title("Box Plot Example")
152 | plt.show()
153 | ```
154 | 
155 | 
156 | 
157 | ### Pair Plots
158 | 
159 | Pair plots visualize pairwise relationships in a dataset.
160 | 
161 | **Example**:
162 | 
163 | ```python
164 | sns.pairplot(tips, hue="day")
165 | plt.show()
166 | ```
167 | 
168 | 
169 | 
170 | ### Heatmaps
171 | 
172 | Heatmaps are used for showing correlations or matrices.
173 | 
174 | **Example**:
175 | 
176 | ```python
177 | correlation = tips.corr()
178 | sns.heatmap(correlation, annot=True, cmap="coolwarm")
179 | plt.title("Heatmap Example")
180 | plt.show()
181 | ```
182 | 
183 | 
184 | 
185 | ## 🧠 Practice Exercises
186 | 
187 | 1. Create a bar chart showing the sales of different products.
188 | 2. Plot the distribution of ages using Seaborn.
189 | 3. Use a scatter plot to explore the relationship between two variables.
190 | 4. Visualize pairwise relationships in a custom dataset.
191 | 
192 | 
193 | 
194 | ## 🌟 Summary
195 | 
196 | - **Matplotlib**: Provides low-level control for creating visualizations.
197 | - **Seaborn**: Offers high-level abstractions for complex visualizations.
198 | - Common visualization types include line plots, bar charts, scatter plots, box plots, and heatmaps.
199 | 
200 | ---
201 | 
202 | 
203 | 


--------------------------------------------------------------------------------
/16_Statistical Concepts/16_Statistical Concepts.md:
--------------------------------------------------------------------------------
  1 | [<< Day 15](../15_Regular%20Expressions/15_Regular%20Expressions.md) | [Day 17 >>](../17_Hypothesis%20Testing/17_Hypothesis%20Testing.md)
  2 | 
  3 | # *Day 16: Statistical Concepts of NumPy and SciPy* 📊📈  
  4 | 
  5 | ## *Table of Contents*  
  6 | - [Introduction to Statistical Concepts](#introduction-to-statistical-concepts) ✨  
  7 | - [NumPy for Statistics](#numpy-for-statistics) 🔢  
  8 |   - [Mean](#mean) 📐  
  9 |   - [Median](#median) 🎯  
 10 |   - [Mode](#mode) 🎲  
 11 |   - [Standard Deviation](#standard-deviation) 📏  
 12 |   - [Variance](#variance) 🔄  
 13 |   - [Percentile](#percentile) 📊  
 14 | - [SciPy for Advanced Statistics](#scipy-for-advanced-statistics) ⚡  
 15 |   - [Probability Distributions ](#probability-distributions) 🎛️  
 16 |   - [Hypothesis Testing](#hypothesis-testing) 🔍  
 17 |   - [Linear Regression](#linear-regression) 📉  
 18 | - [Practice Exercises](#practice-exercises) 📝  
 19 | - [Summary](#summary) 🚀  
 20 | 
 21 | 
 22 | 
 23 | ## Introduction to Statistical Concepts✨  
 24 | 
 25 | Statistics is the foundation of data science. It helps us analyze data, identify patterns, and make predictions. Python libraries like *NumPy* and *SciPy* provide powerful tools for performing statistical operations.  
 26 | 
 27 | In this lesson, we will:  
 28 | - Explore statistical functions in *NumPy*, such as mean, median, and standard deviation.  
 29 | - Dive into *SciPy* for advanced statistical analysis, including hypothesis testing and working with probability distributions.  
 30 | 
 31 | 
 32 | 
 33 | ## NumPy for Statistics🔢  
 34 | 
 35 | NumPy is a powerful numerical computing library. It includes essential statistical functions that operate on arrays.  
 36 | 
 37 | ### Mean📐  
 38 | The mean is the average of a dataset.  
 39 | 
 40 | *Example*:  
 41 | 
 42 | python
 43 | import numpy as np
 44 | 
 45 | data = [10, 20, 30, 40, 50]
 46 | mean = np.mean(data)
 47 | print(f"Mean: {mean}")
 48 | 
 49 | 
 50 | *Output*:  
 51 | 
 52 | Mean: 30.0
 53 |   
 54 | 
 55 | 
 56 | 
 57 | ### Median🎯 
 58 | 
 59 | *Output*:  
 60 | 
 61 | Median: 30.0
 62 |   
 63 | 
 64 | 
 65 | 
 66 | ### Mode🎲  
 67 | NumPy doesn't have a direct function for mode, but we can use *SciPy*.  
 68 | 
 69 | *Example*:  
 70 | 
 71 | python
 72 | from scipy import stats
 73 | 
 74 | data = [1, 2, 2, 3, 4]
 75 | mode = stats.mode(data)
 76 | print(f"Mode: {mode.mode[0]}, Count: {mode.count[0]}")
 77 | 
 78 | 
 79 | *Output*:  
 80 | 
 81 | Mode: 2, Count: 2
 82 |   
 83 | 
 84 | 
 85 | 
 86 | ### Standard Deviation📏  
 87 | The standard deviation measures how spread out the data is.  
 88 | 
 89 | *Example*:  
 90 | 
 91 | python
 92 | data = [10, 20, 30, 40, 50]
 93 | std_dev = np.std(data)
 94 | print(f"Standard Deviation: {std_dev}")
 95 | 
 96 | 
 97 | *Output*:  
 98 | 
 99 | Standard Deviation: 14.142135623730951
100 |   
101 | 
102 | 
103 | 
104 | ### Variance🔄  
105 | Variance is the square of the standard deviation.  
106 | 
107 | *Example*:  
108 | 
109 | python
110 | variance = np.var(data)
111 | print(f"Variance: {variance}")
112 | 
113 | 
114 | *Output*:  
115 | 
116 | Variance: 200.0
117 |   
118 | 
119 | 
120 | 
121 | ### Percentile📊  
122 | Percentiles divide data into 100 equal parts.  
123 | 
124 | *Example*:  
125 | 
126 | python
127 | percentile = np.percentile(data, 50)  # 50th percentile is the median
128 | print(f"50th Percentile: {percentile}")
129 | 
130 | 
131 | *Output*:  
132 | 
133 | 50th Percentile: 30.0
134 |   
135 | 
136 | 
137 | 
138 | ## SciPy for Advanced Statistics⚡  
139 | 
140 | SciPy builds on NumPy and provides additional statistical capabilities.  
141 | 
142 | ### Probability Distributions
143 | SciPy supports numerous probability distributions, such as normal, binomial, and uniform.
144 | 
145 | *Example: Normal Distribution*  
146 | 
147 | python
148 | from scipy.stats import norm
149 | 
150 | # Generate random data
151 | data = norm.rvs(loc=0, scale=1, size=1000)
152 | 
153 | # Compute PDF
154 | x = np.linspace(-3, 3, 100)
155 | pdf = norm.pdf(x)
156 | 
157 | print(f"First 5 PDF values: {pdf[:5]}")
158 |   
159 | 
160 | 
161 | 
162 | ### Hypothesis Testing🔍  
163 | Hypothesis testing is used to test assumptions about data.  
164 | 
165 | *Example: t-Test*  
166 | 
167 | python
168 | from scipy.stats import ttest_1samp
169 | 
170 | data = [2.5, 3.0, 2.8, 3.2, 3.0]
171 | t_stat, p_value = ttest_1samp(data, 3)
172 | print(f"T-statistic: {t_stat}, P-value: {p_value}")
173 | 
174 | 
175 | *Output*:  
176 | 
177 | T-statistic: -1.0, P-value: 0.374
178 |   
179 | 
180 | 
181 | 
182 | ### Linear Regression📉  
183 | Linear regression fits a line to a set of data points.  
184 | 
185 | *Example*:  
186 | 
187 | python
188 | from scipy.stats import linregress
189 | 
190 | x = [1, 2, 3, 4, 5]
191 | y = [2, 4, 5, 4, 5]
192 | 
193 | slope, intercept, r_value, p_value, std_err = linregress(x, y)
194 | print(f"Slope: {slope}, Intercept: {intercept}")
195 | 
196 | 
197 | *Output*:  
198 | 
199 | Slope: 0.6, Intercept: 2.2
200 |   
201 | 
202 | 
203 | 
204 | ## Practice Exercises📝  
205 | 
206 | 1. Calculate the *mean, **median, and **mode* of a dataset using NumPy and SciPy.  
207 | 2. Use the norm distribution from SciPy to generate and visualize random data.  
208 | 3. Perform a *t-test* on a sample dataset.  
209 | 4. Fit a *linear regression model* to a dataset using SciPy.  
210 | 
211 | 
212 | 
213 | ## Summary🚀  
214 | 
215 | - *NumPy* provides essential statistical functions, including *mean, **median, **variance, and **standard deviation*.  
216 | - *SciPy* offers advanced tools for working with *probability distributions, **hypothesis testing, and **regression analysis*.  
217 | - These libraries form the foundation for statistical analysis in Python.  
218 | 
219 | ---
220 | 


--------------------------------------------------------------------------------
/14_Working with APIs and JSON/14_Working with APIs and JSON.md:
--------------------------------------------------------------------------------
  1 | [<< Day 13](../13_Time%20Series%20Analysis%20Introduction/13_Time%20Series%20Analysis%20Introduction.md) | [Day 15 >>](../15_Regular%20Expressions/15_Regular%20Expressions.md)
  2 | 
  3 | # 📘 Day 14: Working with APIs and JSON in Python
  4 | 
  5 | Welcome to Day 14 of the **30 Days of Data Science** series! Today, we focus on understanding and interacting with **APIs** and handling **JSON** data. APIs and JSON are crucial in data science for accessing and managing external data.
  6 | 
  7 | 
  8 | 
  9 | ## Table of Contents
 10 | 
 11 | - [📘 Day 14: Working with APIs and JSON in Python](#-day-14-working-with-apis-and-json-in-python)
 12 |   - [1️⃣ APIs 📡](#1️⃣-apis-)
 13 |     - [What is an API?](#what-is-an-api)
 14 |     - [Making HTTP Requests](#making-http-requests)
 15 |     - [Example: Fetching Data from an API](#example-fetching-data-from-an-api)
 16 |   - [2️⃣ JSON: JavaScript Object Notation 📦](#2️⃣-json-javascript-object-notation-)
 17 |     - [What is JSON?](#what-is-json)
 18 |     - [Working with JSON in Python](#working-with-json-in-python)
 19 |     - [Example: Parsing JSON Data](#example-parsing-json-data)
 20 |     - [Example: Writing JSON Data](#example-writing-json-data)
 21 |   - [🧠 Practice Exercises](#-practice-exercises)
 22 |   - [🌟 Summary](#-summary)
 23 |   
 24 | 
 25 | ## 1️⃣ APIs 📡
 26 | 
 27 | ### What is an API?
 28 | 
 29 | An **API (Application Programming Interface)** allows two systems to communicate with each other. In data science, APIs are used to access real-time data, such as weather updates, stock prices, and social media feeds.
 30 | 
 31 | APIs typically return data in JSON format, which is lightweight and easy to parse.
 32 | 
 33 | 
 34 | 
 35 | ### Making HTTP Requests
 36 | 
 37 | To interact with an API, you need to make **HTTP requests**. Python's `requests` library simplifies this process.
 38 | 
 39 | #### HTTP Methods:
 40 | 
 41 | - **GET**: Retrieve data from an API.
 42 | - **POST**: Send data to an API.
 43 | - **PUT**: Update data on an API.
 44 | - **DELETE**: Delete data on an API.
 45 | 
 46 | 
 47 | 
 48 | ### Example: Fetching Data from an API
 49 | 
 50 | Let’s fetch weather data using a public API.
 51 | 
 52 | ```python
 53 | import requests
 54 | 
 55 | # API Endpoint
 56 | url = "https://api.open-meteo.com/v1/forecast"
 57 | params = {
 58 |     "latitude": 37.7749,  # Latitude for San Francisco
 59 |     "longitude": -122.4194,  # Longitude for San Francisco
 60 |     "hourly": "temperature_2m",
 61 | }
 62 | 
 63 | response = requests.get(url, params=params)
 64 | 
 65 | if response.status_code == 200:
 66 |     data = response.json()
 67 |     print("Weather Data:", data)
 68 | else:
 69 |     print(f"Failed to fetch data. Status code: {response.status_code}")
 70 | ```
 71 | 
 72 | **Explanation:**
 73 | 1. We use the `requests.get()` method to send a GET request.
 74 | 2. The `params` dictionary contains query parameters required by the API.
 75 | 3. The response is checked for a 200 status code (success) and parsed as JSON.
 76 | 
 77 | **Output:**
 78 | 
 79 | ```plaintext
 80 | Weather Data: { ...JSON data... }
 81 | ```
 82 | 
 83 | 
 84 | 
 85 | ## 2️⃣ JSON: JavaScript Object Notation 📦
 86 | 
 87 | ### What is JSON?
 88 | 
 89 | **JSON (JavaScript Object Notation)** is a lightweight data format often used to send and receive data through APIs. JSON data is structured as key-value pairs.
 90 | 
 91 | #### Example of JSON:
 92 | 
 93 | ```json
 94 | {
 95 |     "name": "Alice",
 96 |     "age": 25,
 97 |     "skills": ["Python", "Data Science"]
 98 | }
 99 | ```
100 | 
101 | 
102 | 
103 | ### Working with JSON in Python
104 | 
105 | Python provides the `json` module to parse and create JSON data.
106 | 
107 | 
108 | 
109 | ### Example: Parsing JSON Data
110 | 
111 | Let’s parse JSON data from a string.
112 | 
113 | ```python
114 | import json
115 | 
116 | # JSON string
117 | json_data = '''
118 | {
119 |     "name": "Alice",
120 |     "age": 25,
121 |     "skills": ["Python", "Data Science"]
122 | }
123 | '''
124 | 
125 | # Parse JSON string to Python dictionary
126 | data = json.loads(json_data)
127 | 
128 | print(data["name"])  # Output: Alice
129 | print(data["skills"])  # Output: ['Python', 'Data Science']
130 | ```
131 | 
132 | **Explanation:**
133 | 
134 | 1. The `json.loads()` method converts a JSON string into a Python dictionary.
135 | 2. You can access JSON data using keys.
136 | 
137 | 
138 | 
139 | ### Example: Writing JSON Data
140 | 
141 | You can write Python objects into JSON format.
142 | 
143 | ```python
144 | import json
145 | 
146 | # Python dictionary
147 | data = {
148 |     "name": "Bob",
149 |     "age": 30,
150 |     "skills": ["Machine Learning", "Deep Learning"]
151 | }
152 | 
153 | # Convert Python dictionary to JSON string
154 | json_string = json.dumps(data, indent=4)
155 | 
156 | print(json_string)
157 | ```
158 | 
159 | **Explanation:**
160 | 
161 | 1. The `json.dumps()` method converts a Python object to a JSON string.
162 | 2. The `indent` parameter makes the output more readable.
163 | 
164 | **Output:**
165 | 
166 | ```json
167 | {
168 |     "name": "Bob",
169 |     "age": 30,
170 |     "skills": ["Machine Learning", "Deep Learning"]
171 | }
172 | ```
173 | 
174 | 
175 | 
176 | ## 🧠 Practice Exercises
177 | 
178 | 1. Use the `requests` module to fetch data from an API of your choice.
179 | 2. Parse the JSON response to extract specific information.
180 | 3. Create a Python dictionary and write it as a JSON file.
181 | 4. Explore Python's `json` module documentation for advanced features.
182 | 
183 | 
184 | 
185 | ## 🌟 Summary
186 | 
187 | - APIs enable communication between different systems, making real-time data accessible.
188 | - The `requests` module simplifies sending HTTP requests.
189 | - JSON is a lightweight and popular format for data exchange.
190 | - Python's `json` module makes it easy to parse and create JSON data.
191 | 
192 | ---
193 | 
194 | 
195 | 


--------------------------------------------------------------------------------
/22_Decision Trees/22_Decision Trees.md:
--------------------------------------------------------------------------------
  1 | [<< Day 21](../21_Clustering%20(K-Means)/21_Clustering%20(K-Means).md) | [Day 23 >>](../23_Handling%20Imbalanced%20Data/23_Handling%20Imbalanced%20Data.md)
  2 | 
  3 | # 📘 Day 22: Decision Trees
  4 | 
  5 | Welcome to **Day 22** of the 30 Days of Data Science series! Today, we will dive into **Decision Trees**, an intuitive and powerful algorithm for classification and regression tasks. We will also explore the implementation of `DecisionTreeClassifier` from the scikit-learn library in Python.
  6 | 
  7 | 
  8 | 
  9 | ## Table of Contents
 10 | 
 11 | - [🌳 What is a Decision Tree?](#-what-is-a-decision-tree)
 12 | - [🛠️ Decision Trees in scikit-learn](#️-decision-trees-in-scikit-learn)
 13 | - [🧠 Key Concepts](#-key-concepts)
 14 |   - [1️⃣ Splitting Criteria](#1️⃣-splitting-criteria)
 15 |   - [2️⃣ Overfitting and Pruning](#2️⃣-overfitting-and-pruning)
 16 | - [🔨 Implementation](#-implementation)
 17 |   - [Example: Classifying Iris Dataset](#example-classifying-iris-dataset)
 18 | - [🌟 Advantages and Limitations](#-advantages-and-limitations)
 19 |   - [Pros](#pros)
 20 |   - [Cons](#cons)
 21 | - [🔗 Further Reading](#-further-reading)
 22 | - [📚 Exercises](#-exercises)
 23 | - [📜 Summary](#-summary)
 24 | 
 25 | 
 26 | 
 27 | ## 🌳 What is a Decision Tree?
 28 | 
 29 | A **Decision Tree** is a tree-like structure used to represent decisions and their possible consequences. It consists of nodes that split data based on features, ultimately leading to predictions. Key components include:
 30 | 
 31 | - **Root Node**: The initial node containing the entire dataset.
 32 | - **Internal Nodes**: Decision points splitting data based on feature conditions.
 33 | - **Leaf Nodes**: Endpoints that provide predictions.
 34 | 
 35 | ### How it Works:
 36 | 1. The algorithm selects a feature and a threshold to split the dataset.
 37 | 2. This process repeats recursively, creating branches.
 38 | 3. The tree stops splitting when it meets a predefined condition (e.g., maximum depth).
 39 | 
 40 | 
 41 | 
 42 | ## 🛠️ Decision Trees in scikit-learn
 43 | 
 44 | scikit-learn provides the `DecisionTreeClassifier` for classification tasks. It allows fine-tuning with parameters like `criterion`, `max_depth`, and `min_samples_split`.
 45 | 
 46 | ### Installation
 47 | To use scikit-learn, ensure it is installed:
 48 | 
 49 | ```bash
 50 | pip install scikit-learn
 51 | ```
 52 | 
 53 | 
 54 | 
 55 | ## 🧠 Key Concepts
 56 | 
 57 | ### 1️⃣ Splitting Criteria
 58 | 
 59 | Decision trees evaluate splits based on impurity measures like:
 60 | 
 61 | - **Gini Impurity** (default in scikit-learn):  
 62 |       Gini = $1 - \sum_{i=1}^n p_i^2$
 63 | 
 64 | - **Entropy** (Information Gain):  
 65 |       Entropy = $-\sum_{i=1}^n p_i \log_2(p_i)$
 66 | 
 67 | You can specify the criterion when initializing the classifier:
 68 | 
 69 | ```python
 70 | from sklearn.tree import DecisionTreeClassifier
 71 | clf = DecisionTreeClassifier(criterion='entropy')
 72 | ```
 73 | 
 74 | ### 2️⃣ Overfitting and Pruning
 75 | 
 76 | - **Overfitting**: Trees grow too deep, capturing noise instead of patterns.
 77 | - **Pruning**: Limits tree growth by setting constraints like `max_depth` or `min_samples_split`.
 78 | 
 79 | 
 80 | 
 81 | ## 🔨 Implementation
 82 | 
 83 | ### Example: Classifying Iris Dataset
 84 | 
 85 | ```python
 86 | # Import necessary libraries
 87 | from sklearn.datasets import load_iris
 88 | from sklearn.model_selection import train_test_split
 89 | from sklearn.tree import DecisionTreeClassifier, plot_tree
 90 | import matplotlib.pyplot as plt
 91 | 
 92 | # Load Iris dataset
 93 | iris = load_iris()
 94 | X, y = iris.data, iris.target
 95 | 
 96 | # Split data into training and testing sets
 97 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 98 | 
 99 | # Initialize DecisionTreeClassifier
100 | clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
101 | 
102 | # Train the classifier
103 | clf.fit(X_train, y_train)
104 | 
105 | # Evaluate the classifier
106 | accuracy = clf.score(X_test, y_test)
107 | print(f"Accuracy: {accuracy * 100:.2f}%")
108 | 
109 | # Visualize the decision tree
110 | plt.figure(figsize=(12, 8))
111 | plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
112 | plt.show()
113 | ```
114 | 
115 | ### Output Explanation:
116 | - The tree visualization shows feature splits and class probabilities at each node.
117 | - The accuracy metric evaluates model performance on test data.
118 | 
119 | 
120 | 
121 | ## 🌟 Advantages and Limitations
122 | 
123 | ### Pros:
124 | - Easy to understand and interpret.
125 | - Requires little data preprocessing.
126 | - Handles both numerical and categorical data.
127 | 
128 | ### Cons:
129 | - Prone to overfitting.
130 | - Sensitive to small data changes.
131 | - Not suitable for very large datasets.
132 | 
133 | 
134 | 
135 | ## 🔗 Further Reading
136 | - [scikit-learn Documentation: Decision Trees](https://scikit-learn.org/stable/modules/tree.html)
137 | 
138 | 
139 | 
140 | ## 📚 Exercises
141 | 
142 | 1. Train a decision tree on a different dataset (e.g., Wine Dataset) and compare Gini and Entropy criteria.
143 | 2. Experiment with hyperparameters like `min_samples_leaf` and `max_features` to reduce overfitting.
144 | 3. Visualize decision boundaries using 2D features.
145 | 
146 | 
147 | 
148 | ## 📜 Summary
149 | 
150 | In this session, you have learned:
151 | - The structure and working of decision trees.
152 | - Splitting criteria like Gini Impurity and Entropy.
153 | - The importance of pruning to avoid overfitting.
154 | - Implementing a decision tree using scikit-learn.
155 | 
156 | Decision trees are a fundamental building block in machine learning and serve as the basis for ensemble methods like Random Forests and Gradient Boosted Trees. Keep experimenting with datasets to gain a deeper understanding. 
157 | 
158 | ---
159 | 
160 | 
161 | 
162 | 
163 | 
164 | 
165 | 


--------------------------------------------------------------------------------
/24_Feature Engineering/24_Feature Engineering.md:
--------------------------------------------------------------------------------
  1 | [<< Day 23](../23_Handling%20Imbalanced%20Data/23_Handling%20Imbalanced%20Data.md) | [Day 25 >>](../25_Model%20Evaluation%20and%20Metrics/25_Model%20Evaluation%20and%20Metrics.md)
  2 | 
  3 | # 📚 Day 24: Feature Engineering
  4 | 
  5 | Welcome to **Day 24** of the 30 Days of Data Science series! 🎉 Today, we delve into the critical concept of **Feature Engineering**, a cornerstone of building effective machine learning models. We will explore techniques such as **Encoding**, **Scaling**, and **Feature Selection** to prepare data for modeling. 🔧🎨
  6 | 
  7 | 
  8 | 
  9 | ## 📌 Table of Contents
 10 | 
 11 | - [ 🎯Introduction to Feature Engineering](#introduction-to-feature-engineering)
 12 | - [ 🔧Encoding Techniques](#encoding-techniques)
 13 |   - [ 🔐One-Hot Encoding](#one-hot-encoding)
 14 |   - [ 🔑Label Encoding](#label-encoding)
 15 |   - [ 🏦Target Encoding](#target-encoding)
 16 | - [ 📈Scaling Features](#scaling-features)
 17 |   - [ 💸Standardization](#standardization)
 18 |   - [ 🪙Normalization](#normalization)
 19 | - [ 🎯Feature Selection](#feature-selection)
 20 |   - [ 🔬Filter Methods](#filter-methods)
 21 |   - [ 🧠Wrapper Methods](#wrapper-methods)
 22 |   - [ 📊Embedded Methods](#embedded-methods)
 23 | - [ 🖋Practice Exercises](#practice-exercises)
 24 | - [ 📌Summary](#summary)
 25 | 
 26 | 
 27 | 
 28 | ## 🎯Introduction to Feature Engineering
 29 | 
 30 | **Feature Engineering** is the process of transforming raw data into meaningful features that improve the performance of machine learning models. It includes:
 31 | 
 32 | - Encoding categorical variables
 33 | - Scaling numerical features
 34 | - Selecting the most relevant features
 35 | 
 36 | Effective feature engineering leads to:
 37 | - Improved model accuracy
 38 | - Faster convergence during training
 39 | - Reduced overfitting
 40 | 
 41 | 
 42 | 
 43 | ## 🔧Encoding Techniques
 44 | 
 45 | ### 🔐One-Hot Encoding
 46 | One-hot encoding converts categorical variables into binary vectors.
 47 | 
 48 | #### Example:
 49 | ```python
 50 | import pandas as pd
 51 | 
 52 | # Sample data
 53 | data = {'Color': ['Red', 'Green', 'Blue']}
 54 | df = pd.DataFrame(data)
 55 | 
 56 | # Apply one-hot encoding
 57 | encoded_df = pd.get_dummies(df, columns=['Color'])
 58 | print(encoded_df)
 59 | ```
 60 | **Output:**
 61 | ```
 62 |    Color_Blue  Color_Green  Color_Red
 63 | 0           0            0          1
 64 | 1           0            1          0
 65 | 2           1            0          0
 66 | ```
 67 | 
 68 | ### 🔑Label Encoding
 69 | Label encoding assigns unique integers to each category.
 70 | 
 71 | #### Example:
 72 | ```python
 73 | from sklearn.preprocessing import LabelEncoder
 74 | 
 75 | # Sample data
 76 | labels = ['Red', 'Green', 'Blue']
 77 | encoder = LabelEncoder()
 78 | encoded_labels = encoder.fit_transform(labels)
 79 | print(encoded_labels)
 80 | ```
 81 | **Output:**
 82 | ```
 83 | [2 1 0]
 84 | ```
 85 | 
 86 | ### 🏦Target Encoding
 87 | Target encoding maps categories to the mean of the target variable.
 88 | 
 89 | #### Example:
 90 | ```python
 91 | import pandas as pd
 92 | 
 93 | # Sample data
 94 | data = {'Category': ['A', 'B', 'A', 'C'], 'Target': [1, 0, 1, 0]}
 95 | df = pd.DataFrame(data)
 96 | 
 97 | def target_encode(column, target):
 98 |     return column.map(target.groupby(column).mean())
 99 | 
100 | df['Encoded_Category'] = target_encode(df['Category'], df['Target'])
101 | print(df)
102 | ```
103 | 
104 | 
105 | 
106 | ## 📈Scaling Features
107 | 
108 | ### 💸Standardization
109 | Standardization scales features to have a mean of 0 and a standard deviation of 1.
110 | 
111 | #### Example:
112 | ```python
113 | from sklearn.preprocessing import StandardScaler
114 | import numpy as np
115 | 
116 | # Sample data
117 | X = np.array([[1, 2], [3, 4], [5, 6]])
118 | scaler = StandardScaler()
119 | scaled_X = scaler.fit_transform(X)
120 | print(scaled_X)
121 | ```
122 | 
123 | ### 🪙Normalization
124 | Normalization scales features to a range of [0, 1].
125 | 
126 | #### Example:
127 | ```python
128 | from sklearn.preprocessing import MinMaxScaler
129 | 
130 | # Sample data
131 | scaler = MinMaxScaler()
132 | norm_X = scaler.fit_transform(X)
133 | print(norm_X)
134 | ```
135 | 
136 | 
137 | 
138 | ## 🎯Feature Selection
139 | 
140 | ### 🔬Filter Methods
141 | Filter methods use statistical tests to score and select features.
142 | 
143 | #### Example:
144 | ```python
145 | from sklearn.feature_selection import SelectKBest, chi2
146 | 
147 | # Sample data
148 | X = [[10, 20, 30], [20, 30, 40], [30, 40, 50]]
149 | y = [1, 0, 1]
150 | selector = SelectKBest(chi2, k=2)
151 | selected_X = selector.fit_transform(X, y)
152 | print(selected_X)
153 | ```
154 | 
155 | ### 🧠Wrapper Methods
156 | Wrapper methods use a predictive model to evaluate feature subsets.
157 | 
158 | #### Example:
159 | ```python
160 | from sklearn.feature_selection import RFE
161 | from sklearn.ensemble import RandomForestClassifier
162 | 
163 | # Sample data
164 | estimator = RandomForestClassifier()
165 | rfe = RFE(estimator, n_features_to_select=2)
166 | rfe.fit(X, y)
167 | print(rfe.support_)
168 | ```
169 | 
170 | ### 📊Embedded Methods
171 | Embedded methods perform feature selection during model training (e.g., Lasso).
172 | 
173 | #### Example:
174 | ```python
175 | from sklearn.linear_model import Lasso
176 | 
177 | # Sample data
178 | lasso = Lasso(alpha=0.01)
179 | lasso.fit(X, y)
180 | print(lasso.coef_)
181 | ```
182 | 
183 | 
184 | 
185 | ## 🖋Practice Exercises
186 | 
187 | 1. Implement one-hot encoding and label encoding on a dataset of your choice.
188 | 2. Experiment with scaling techniques and observe their impact on a logistic regression model.
189 | 3. Apply SelectKBest and RFE on a dataset to compare their feature selection results.
190 | 
191 | 
192 | 
193 | ## 📌Summary
194 | 
195 | Today, we covered:
196 | 
197 | - Encoding techniques for categorical variables.
198 | - Scaling methods to normalize numerical features.
199 | - Feature selection approaches to identify important features.
200 | 
201 | Feature engineering is an art and science that significantly impacts the success of machine learning models. Keep exploring and practicing these techniques! 🚀
202 | 
203 | ---
204 | 


--------------------------------------------------------------------------------
/17_Hypothesis Testing/17_Hypothesis Testing.md:
--------------------------------------------------------------------------------
  1 | [<< Day 16](../16_Statistical%20Concepts/16_Statistical%20Concepts.md) | [Day 18 >>](../18_Basic%20Machine%20Learning%20Introduction/18_Basic%20Machine%20Learning%20Introduction.md)
  2 | 
  3 | 
  4 | # 📘 Day 17: Hypothesis Testing
  5 | 
  6 | Welcome to Day 17 of the **30 Days of Data Science** series! Today, we delve into **Hypothesis Testing**, a fundamental concept in statistics, widely used to make data-driven decisions. This session will focus on **t-tests** and **chi-square tests**, two commonly used techniques for hypothesis testing.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 17: Hypothesis Testing](#-day-17-hypothesis-testing)
 13 |   - [📌 Topics Covered](#-topics-covered)
 14 |   - [1️⃣ What is Hypothesis Testing? 🧐](#1️⃣-what-is-hypothesis-testing-)
 15 |     - [Null and Alternative Hypotheses](#null-and-alternative-hypotheses)
 16 |     - [Steps in Hypothesis Testing](#steps-in-hypothesis-testing)
 17 |   - [2️⃣ t-Test 🧮](#2️⃣-t-test-)
 18 |     - [What is a t-Test?](#what-is-a-t-test)
 19 |     - [Types of t-Tests](#types-of-t-tests)
 20 |     - [Example: One-Sample t-Test](#example-one-sample-t-test)
 21 |     - [Example: Two-Sample t-Test](#example-two-sample-t-test)
 22 |   - [3️⃣ Chi-Square Test 🔢](#3️⃣-chi-square-test-)
 23 |     - [What is a Chi-Square Test?](#what-is-a-chi-square-test)
 24 |     - [Example: Chi-Square Test for Independence](#example-chi-square-test-for-independence)
 25 |   - [🧠 Practice Exercises](#-practice-exercises)
 26 |   - [🌟 Summary](#-summary)
 27 |   
 28 | 
 29 | 
 30 | 
 31 | ## 📌 Topics Covered
 32 | 
 33 | - **Hypothesis Testing**: Basics, importance, and applications.
 34 | - **t-Tests**: Types and examples (one-sample, two-sample).
 35 | - **Chi-Square Test**: Concepts and practical applications.
 36 | 
 37 | 
 38 | 
 39 | ## 1️⃣ What is Hypothesis Testing? 🧐
 40 | 
 41 | Hypothesis Testing is a statistical method used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population.
 42 | 
 43 | ### Null and Alternative Hypotheses
 44 | 
 45 | - **Null Hypothesis (H₀)**: Assumes no effect or no difference in the population.
 46 | - **Alternative Hypothesis (H₁)**: Assumes a significant effect or difference exists.
 47 | 
 48 | Example:
 49 | 
 50 | - H₀: The average height of students is 5.5 feet.
 51 | - H₁: The average height of students is not 5.5 feet.
 52 | 
 53 | 
 54 | 
 55 | ### Steps in Hypothesis Testing
 56 | 
 57 | 1. **State the hypotheses**: Define H₀ and H₁.
 58 | 2. **Choose a significance level (α)**: Commonly 0.05.
 59 | 3. **Select the appropriate test**: t-test, chi-square, etc.
 60 | 4. **Calculate the test statistic**: Using the chosen method.
 61 | 5. **Make a decision**: Compare the p-value to α.
 62 |    - p-value ≤ α: Reject H₀ (evidence supports H₁).
 63 |    - p-value > α: Fail to reject H₀.
 64 | 
 65 | 
 66 | 
 67 | ## 2️⃣ t-Test 🧮
 68 | 
 69 | ### What is a t-Test?
 70 | 
 71 | A **t-test** is used to compare means and determine if the differences are statistically significant. It assumes that the data is normally distributed.
 72 | 
 73 | ### Types of t-Tests
 74 | 
 75 | 1. **One-Sample t-Test**: Compares the sample mean to a known value.
 76 | 2. **Two-Sample t-Test**: Compares the means of two independent groups.
 77 | 3. **Paired t-Test**: Compares means of the same group at different times.
 78 | 
 79 | 
 80 | 
 81 | ### Example: One-Sample t-Test
 82 | 
 83 | ```python
 84 | from scipy.stats import ttest_1samp
 85 | import numpy as np
 86 | 
 87 | # Sample data
 88 | data = [12, 15, 14, 10, 13, 12, 14, 15, 11]
 89 | pop_mean = 13
 90 | 
 91 | # Perform t-test
 92 | t_stat, p_value = ttest_1samp(data, pop_mean)
 93 | 
 94 | print(f"T-statistic: {t_stat}")
 95 | print(f"P-value: {p_value}")
 96 | ```
 97 | 
 98 | **Output:**
 99 | 
100 | ```plaintext
101 | T-statistic: -1.024
102 | P-value: 0.340
103 | ```
104 | 
105 | - Since p-value > 0.05, we fail to reject H₀.
106 | 
107 | 
108 | 
109 | ### Example: Two-Sample t-Test
110 | 
111 | ```python
112 | from scipy.stats import ttest_ind
113 | 
114 | # Two independent groups
115 | group1 = [22, 24, 19, 23, 21]
116 | group2 = [30, 29, 34, 28, 27]
117 | 
118 | # Perform t-test
119 | t_stat, p_value = ttest_ind(group1, group2)
120 | 
121 | print(f"T-statistic: {t_stat}")
122 | print(f"P-value: {p_value}")
123 | ```
124 | 
125 | **Output:**
126 | 
127 | ```plaintext
128 | T-statistic: -5.123
129 | P-value: 0.002
130 | ```
131 | 
132 | - Since p-value ≤ 0.05, we reject H₀ and conclude there is a significant difference.
133 | 
134 | 
135 | 
136 | ## 3️⃣ Chi-Square Test 🔢
137 | 
138 | ### What is a Chi-Square Test?
139 | 
140 | The **Chi-Square Test** determines whether there is a significant association between categorical variables.
141 | 
142 | ### Example: Chi-Square Test for Independence
143 | 
144 | ```python
145 | import numpy as np
146 | from scipy.stats import chi2_contingency
147 | 
148 | # Contingency table
149 | data = np.array([[50, 30], [20, 100]])
150 | 
151 | # Perform chi-square test
152 | chi2, p, dof, expected = chi2_contingency(data)
153 | 
154 | print(f"Chi-Square Statistic: {chi2}")
155 | print(f"P-value: {p}")
156 | print(f"Degrees of Freedom: {dof}")
157 | print(f"Expected Frequencies: 
158 | {expected}")
159 | ```
160 | 
161 | **Output:**
162 | 
163 | ```plaintext
164 | Chi-Square Statistic: 23.88
165 | P-value: 0.0001
166 | Degrees of Freedom: 1
167 | Expected Frequencies:
168 | [[35.  45.]
169 |  [35.  85.]]
170 | ```
171 | 
172 | - Since p-value ≤ 0.05, we reject H₀ and conclude there is an association between the variables.
173 | 
174 | 
175 | 
176 | ## 🧠 Practice Exercises
177 | 
178 | 1. Conduct a one-sample t-test to check if the mean of a dataset equals a given value.
179 | 2. Perform a two-sample t-test on two independent datasets.
180 | 3. Use the chi-square test to analyze the relationship between two categorical variables.
181 | 
182 | 
183 | 
184 | ## 🌟 Summary
185 | 
186 | - Hypothesis testing involves comparing data against a null hypothesis.
187 | - t-tests assess differences in means for one or two groups.
188 | - Chi-square tests analyze associations between categorical variables.
189 | - Interpretation of p-values is crucial to making decisions in hypothesis testing.
190 | 
191 | ---
192 | 
193 | 


--------------------------------------------------------------------------------
/19_Linear Regression/19_Linear Regression.md:
--------------------------------------------------------------------------------
  1 | [<< Day 18](../18_Basic%20Machine%20Learning%20Introduction/18_Basic%20Machine%20Learning%20Introduction.md) | [Day 20 >>](../20_Logistic%20Regression/20_Logistic%20Regression.md)
  2 | 
  3 | 
  4 | 
  5 | # 📘 Day 19: Linear Regression with Scikit-learn
  6 | 
  7 | Welcome to Day 19 of the **30 Days of Data Science** series! Today, we delve into **Linear Regression**, one of the most fundamental and widely used algorithms in supervised machine learning. We'll be using Python's **Scikit-learn** library to implement and analyze Linear Regression models.
  8 | 
  9 | 
 10 | 
 11 | ## Table of Contents
 12 | 
 13 | - [📘 Day 19: Linear Regression with Scikit-learn](#-day-19-linear-regression-with-scikit-learn)
 14 |   - [📌 Topics Covered](#-topics-covered)
 15 |   - [1️⃣ What is Linear Regression?](#1️⃣-what-is-linear-regression)
 16 |     - [Equation of Linear Regression](#equation-of-linear-regression)
 17 |     - [Use Cases of Linear Regression](#use-cases-of-linear-regression)
 18 |   - [2️⃣ Linear Regression in Scikit-learn](#2️⃣-linear-regression-in-scikit-learn)
 19 |     - [Dataset Overview](#dataset-overview)
 20 |     - [Steps to Implement Linear Regression](#steps-to-implement-linear-regression)
 21 |   - [3️⃣ Code Implementation](#3️⃣-code-implementation)
 22 |     - [1. Importing Libraries](#1-importing-libraries)
 23 |     - [2. Loading and Exploring the Dataset](#2-loading-and-exploring-the-dataset)
 24 |     - [3. Preparing the Data](#3-preparing-the-data)
 25 |     - [4. Training the Model](#4-training-the-model)
 26 |     - [5. Making Predictions](#5-making-predictions)
 27 |     - [6. Model Evaluation](#6-model-evaluation)
 28 |   - [4️⃣ Practice Exercises](#4️⃣-practice-exercises)
 29 |   - [🌟 Summary](#-summary)
 30 |   
 31 | 
 32 | 
 33 | 
 34 | ## 📌 Topics Covered
 35 | 
 36 | - Introduction to Linear Regression and its applications.
 37 | - How to implement Linear Regression using Scikit-learn.
 38 | - Steps to preprocess data for Linear Regression.
 39 | - Evaluating a Linear Regression model.
 40 | 
 41 | 
 42 | 
 43 | ## 1️⃣ What is Linear Regression?
 44 | 
 45 | Linear Regression is a supervised learning algorithm used to predict a target variable (dependent variable) based on one or more input variables (independent variables). The goal is to establish a linear relationship between the variables.
 46 | 
 47 | 
 48 | 
 49 | ### Equation of Linear Regression
 50 | 
 51 | The equation of a simple linear regression line is:
 52 | 
 53 | **y = β₀ + β₁x + ε**
 54 | 
 55 | - **y**: Predicted value (target).
 56 | - **β₀**: Intercept.
 57 | - **β₁**: Coefficient (slope).
 58 | - **x**: Input variable (feature).
 59 | - **ε**: Error term.
 60 | 
 61 | For multiple linear regression:
 62 | 
 63 | **y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε**
 64 | 
 65 | 
 66 | 
 67 | ### Use Cases of Linear Regression
 68 | 
 69 | - Predicting house prices.
 70 | - Estimating sales figures.
 71 | - Analyzing the impact of marketing spend.
 72 | - Forecasting demand.
 73 | 
 74 | 
 75 | 
 76 | ## 2️⃣ Linear Regression in Scikit-learn
 77 | 
 78 | Scikit-learn provides a ready-to-use implementation of Linear Regression through the `LinearRegression` class. Let’s go step by step.
 79 | 
 80 | 
 81 | 
 82 | ### Dataset Overview
 83 | 
 84 | We’ll use a sample dataset from Scikit-learn’s `datasets` module or custom data to understand Linear Regression.
 85 | 
 86 | 
 87 | 
 88 | ### Steps to Implement Linear Regression
 89 | 
 90 | 1. Import required libraries.
 91 | 2. Load and explore the dataset.
 92 | 3. Split the dataset into training and testing sets.
 93 | 4. Train the Linear Regression model using the training data.
 94 | 5. Make predictions using the testing data.
 95 | 6. Evaluate the model using metrics like Mean Squared Error (MSE) and R-squared.
 96 | 
 97 | 
 98 | 
 99 | ## 3️⃣ Code Implementation
100 | 
101 | ### 1. Importing Libraries
102 | 
103 | ```python
104 | import numpy as np
105 | import pandas as pd
106 | from sklearn.model_selection import train_test_split
107 | from sklearn.linear_model import LinearRegression
108 | from sklearn.metrics import mean_squared_error, r2_score
109 | ```
110 | 
111 | 
112 | 
113 | ### 2. Loading and Exploring the Dataset
114 | 
115 | We use Scikit-learn’s `load_diabetes` dataset as an example.
116 | 
117 | ```python
118 | from sklearn.datasets import load_diabetes
119 | 
120 | # Load dataset
121 | data = load_diabetes()
122 | df = pd.DataFrame(data.data, columns=data.feature_names)
123 | df['target'] = data.target
124 | 
125 | # Explore the data
126 | print(df.head())
127 | ```
128 | 
129 | 
130 | 
131 | ### 3. Preparing the Data
132 | 
133 | ```python
134 | # Features and target variable
135 | X = df.drop(columns=['target'])
136 | y = df['target']
137 | 
138 | # Splitting data into train and test sets
139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
140 | ```
141 | 
142 | 
143 | 
144 | ### 4. Training the Model
145 | 
146 | ```python
147 | # Initialize the model
148 | model = LinearRegression()
149 | 
150 | # Train the model
151 | model.fit(X_train, y_train)
152 | print("Model trained successfully!")
153 | ```
154 | 
155 | 
156 | 
157 | ### 5. Making Predictions
158 | 
159 | ```python
160 | # Predict on the test set
161 | y_pred = model.predict(X_test)
162 | 
163 | # Display first five predictions
164 | print("Predictions:", y_pred[:5])
165 | ```
166 | 
167 | 
168 | 
169 | ### 6. Model Evaluation
170 | 
171 | ```python
172 | # Mean Squared Error (MSE)
173 | mse = mean_squared_error(y_test, y_pred)
174 | print(f"Mean Squared Error: {mse}")
175 | 
176 | # R-squared Score
177 | r2 = r2_score(y_test, y_pred)
178 | print(f"R-squared Score: {r2}")
179 | ```
180 | 
181 | 
182 | 
183 | ## 4️⃣ Practice Exercises
184 | 
185 | 1. Load a custom dataset and implement Linear Regression.
186 | 2. Try normalizing the data before training the model—does it improve the performance?
187 | 3. Evaluate your model using additional metrics like Mean Absolute Error (MAE).
188 | 
189 | 
190 | 
191 | ## 🌟 Summary
192 | 
193 | - Linear Regression models a linear relationship between input and output variables.
194 | - Scikit-learn provides an easy-to-use interface to train and evaluate regression models.
195 | - Key metrics like MSE and R-squared help assess model performance.
196 | 
197 | ---
198 | 
199 | 
200 | 


--------------------------------------------------------------------------------
/15_Regular Expressions/15_Regular Expressions.md:
--------------------------------------------------------------------------------
  1 | [<< Day 14](../14_Working%20with%20APIs%20and%20JSON/14_Working%20with%20APIs%20and%20JSON.md) | [Day 16 >>](../16_Statistical%20Concepts/16_Statistical%20Concepts.md)
  2 | 
  3 | 
  4 | # 📘 Day 15: Regular Expressions Using `re` Module
  5 | 
  6 | Welcome to Day 15 of the **30 Days of Data Science** series! Today, we dive into the fascinating world of **Regular Expressions (regex)** in Python. Regex is a powerful tool for pattern matching and text processing, enabling you to extract, validate, or modify data efficiently.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 15: Regular Expressions Using `re` Module](#-day-15-regular-expressions-using-re-module)
 13 |   - [1️⃣ Introduction to Regular Expressions](#1️⃣-introduction-to-regular-expressions)
 14 |   - [2️⃣ Using the `re` Module](#2️⃣-using-the-re-module)
 15 |     - [Importing the Module](#importing-the-module)
 16 |     - [Basic Pattern Matching](#basic-pattern-matching)
 17 |   - [3️⃣ Common Regex Patterns](#3️⃣-common-regex-patterns)
 18 |   - [4️⃣ Useful `re` Functions](#4️⃣-useful-re-functions)
 19 |     - [1. `re.match`](#1-rematch)
 20 |     - [2. `re.search`](#2-research)
 21 |     - [3. `re.findall`](#3-refindall)
 22 |     - [4. `re.sub`](#4-resub)
 23 |     - [5. `re.split`](#5-resplit)
 24 |   - [5️⃣ Practice Exercises](#5️⃣-practice-exercises)
 25 |   - [🌟 Summary](#-summary)
 26 |   
 27 | 
 28 | 
 29 | ## 1️⃣ Introduction to Regular Expressions
 30 | 
 31 | A **Regular Expression (regex)** is a sequence of characters defining a search pattern. Regex is commonly used for tasks such as:
 32 | 
 33 | - Validating data (e.g., email addresses, phone numbers)
 34 | - Extracting specific information from text (e.g., dates, URLs)
 35 | - Text replacements (e.g., removing unwanted characters)
 36 | 
 37 | 
 38 | 
 39 | ## 2️⃣ Using the `re` Module
 40 | 
 41 | Python provides the built-in `re` module to work with regular expressions.
 42 | 
 43 | 
 44 | 
 45 | ### Importing the Module
 46 | 
 47 | Before using regex, you need to import the `re` module:
 48 | 
 49 | ```python
 50 | import re
 51 | ```
 52 | 
 53 | 
 54 | 
 55 | ### Basic Pattern Matching
 56 | 
 57 | Regex patterns are enclosed in raw strings (`r""`) to avoid conflicts with Python's escape sequences.
 58 | 
 59 | ```python
 60 | import re
 61 | 
 62 | pattern = r"data"
 63 | text = "I love data science."
 64 | 
 65 | # Check if the pattern exists in the text
 66 | if re.search(pattern, text):
 67 |     print("Pattern found!")
 68 | else:
 69 |     print("Pattern not found.")
 70 | ```
 71 | 
 72 | **Output:**
 73 | 
 74 | ```plaintext
 75 | Pattern found!
 76 | ```
 77 | 
 78 | 
 79 | 
 80 | ## 3️⃣ Common Regex Patterns
 81 | 
 82 | Here are some common patterns and their meanings:
 83 | 
 84 | | Pattern      | Description                          | Example Match     |
 85 | |--------------|--------------------------------------|-------------------|
 86 | | `.`          | Any character except newline         | `a.c` matches `abc` |
 87 | | `\d`         | Any digit (0–9)                      | `\d` matches `3`   |
 88 | | `\w`         | Any word character (a-z, A-Z, 0-9)   | `\w` matches `a`   |
 89 | | `\s`         | Any whitespace (space, tab, newline) | `\s` matches ` `   |
 90 | | `^`          | Start of a string                   | `^hello` matches `hello world` |
 91 | | `$`          | End of a string                     | `world$` matches `hello world` |
 92 | | `*`          | Zero or more repetitions            | `ab*` matches `a`, `ab`, `abb` |
 93 | | `+`          | One or more repetitions             | `ab+` matches `ab`, `abb` |
 94 | | `?`          | Zero or one repetition              | `ab?` matches `a`, `ab` |
 95 | | `{n,m}`      | Between n and m repetitions         | `a{2,4}` matches `aa`, `aaa`, `aaaa` |
 96 | 
 97 | 
 98 | 
 99 | ## 4️⃣ Useful `re` Functions
100 | 
101 | The `re` module provides several functions for pattern matching and manipulation:
102 | 
103 | ### 1. `re.match`
104 | 
105 | Checks if the pattern matches at the **beginning** of the string.
106 | 
107 | ```python
108 | import re
109 | 
110 | text = "data science is amazing"
111 | match = re.match(r"data", text)
112 | 
113 | if match:
114 |     print(f"Matched: {match.group()}")
115 | else:
116 |     print("No match found.")
117 | ```
118 | 
119 | **Output:**
120 | 
121 | ```plaintext
122 | Matched: data
123 | ```
124 | 
125 | 
126 | 
127 | ### 2. `re.search`
128 | 
129 | Searches the entire string for a match.
130 | 
131 | ```python
132 | import re
133 | 
134 | text = "I love data science"
135 | search = re.search(r"data", text)
136 | 
137 | if search:
138 |     print(f"Found: {search.group()}")
139 | ```
140 | 
141 | **Output:**
142 | 
143 | ```plaintext
144 | Found: data
145 | ```
146 | 
147 | 
148 | 
149 | ### 3. `re.findall`
150 | 
151 | Returns all occurrences of the pattern in the string.
152 | 
153 | ```python
154 | import re
155 | 
156 | text = "data science involves data and more data"
157 | matches = re.findall(r"data", text)
158 | print(f"Occurrences: {matches}")
159 | ```
160 | 
161 | **Output:**
162 | 
163 | ```plaintext
164 | Occurrences: ['data', 'data', 'data']
165 | ```
166 | 
167 | 
168 | 
169 | ### 4. `re.sub`
170 | 
171 | Replaces occurrences of the pattern with a specified string.
172 | 
173 | ```python
174 | import re
175 | 
176 | text = "data science is amazing"
177 | result = re.sub(r"data", "AI", text)
178 | print(result)
179 | ```
180 | 
181 | **Output:**
182 | 
183 | ```plaintext
184 | AI science is amazing
185 | ```
186 | 
187 | 
188 | 
189 | ### 5. `re.split`
190 | 
191 | Splits the string by the pattern.
192 | 
193 | ```python
194 | import re
195 | 
196 | text = "data-science-is-fun"
197 | result = re.split(r"-", text)
198 | print(result)
199 | ```
200 | 
201 | **Output:**
202 | 
203 | ```plaintext
204 | ['data', 'science', 'is', 'fun']
205 | ```
206 | 
207 | 
208 | 
209 | ## 5️⃣ Practice Exercises
210 | 
211 | 1. Validate an email address using regex.
212 | 2. Extract all numbers from a given text.
213 | 3. Replace all whitespace in a string with underscores.
214 | 4. Split a paragraph into sentences.
215 | 
216 | 
217 | 
218 | ## 🌟 Summary
219 | 
220 | - Regular expressions are powerful for pattern matching and text manipulation.
221 | - Python's `re` module provides various functions like `match`, `search`, `findall`, `sub`, and `split`.
222 | - Familiarize yourself with common regex patterns for effective text processing.
223 | 
224 | ---
225 | 
226 | 
227 | 
228 | 
229 | 


--------------------------------------------------------------------------------
/25_Model Evaluation and Metrics/25_Model Evaluation and Metrics.md:
--------------------------------------------------------------------------------
  1 | [<< Day 24](../24_Feature%20Engineering/24_Feature%20Engineering.md) | [Day 26 >>](../26_Advanced%20ML%3A%20Hyperparameter%20Tuning/26_Advanced%20ML%3A%20Hyperparameter%20Tuning.md)
  2 | 
  3 | 
  4 | # 📊 Day 25 - Model Evaluation and Metrics
  5 | 
  6 | Welcome to **Day 25** of the **30 Days of Data Science** series! 🎉 Today, we will dive into the critical topic of **Model Evaluation and Metrics**. Understanding how to evaluate the performance of a machine learning model is essential to ensure its effectiveness and reliability in real-world scenarios. Let's explore topics like **Confusion Matrix** and **ROC-AUC Curve** with hands-on examples! 🚀
  7 | 
  8 | 
  9 | 
 10 | ## 📋 Table of Contents
 11 | 
 12 | - [📊 Day 25 - Model Evaluation and Metrics](#-day-25---model-evaluation-and-metrics)
 13 |   - [📋 Table of Contents](#-table-of-contents)
 14 |   - [🔍 Introduction to Model Evaluation](#-introduction-to-model-evaluation)
 15 |   - [📉 Confusion Matrix](#-confusion-matrix)
 16 |     - [🔢 Key Metrics Derived from the Confusion Matrix](#-key-metrics-derived-from-the-confusion-matrix)
 17 |   - [📈 ROC and AUC](#-roc-and-auc)
 18 |   - [📚 Practice Exercises](#-practice-exercises)
 19 |   - [📜 Summary](#-summary)
 20 | 
 21 | 
 22 | 
 23 | ## 🔍 Introduction to Model Evaluation
 24 | 
 25 | Model evaluation is a crucial step in machine learning to ensure that the model generalizes well to unseen data. The goal is to evaluate both the predictive power and robustness of a model using appropriate metrics.
 26 | 
 27 | Evaluation often involves splitting data into **training** and **test sets** or employing **cross-validation** techniques. Some common evaluation metrics include:
 28 | 
 29 | - **Accuracy**: Percentage of correctly classified instances.
 30 | - **Precision, Recall, and F1-Score**: Useful for imbalanced datasets.
 31 | - **ROC-AUC**: Measures a model's ability to distinguish between classes.
 32 | 
 33 | In this section, we'll focus on two widely-used tools for model evaluation: **Confusion Matrix** and **ROC-AUC Curve**.
 34 | 
 35 | 
 36 | 
 37 | ## 📉 Confusion Matrix
 38 | 
 39 | A **Confusion Matrix** is a tabular representation of actual versus predicted classifications. It is used to visualize the performance of classification models. The matrix contains four key components:
 40 | 
 41 | |                | Predicted Positive | Predicted Negative |
 42 | |----------------|--------------------|--------------------|
 43 | | **Actual Positive** | True Positive (TP)     | False Negative (FN)     |
 44 | | **Actual Negative** | False Positive (FP)    | True Negative (TN)      |
 45 | 
 46 | ### 🔢 Key Metrics Derived from the Confusion Matrix
 47 | 
 48 | 1. **Accuracy**:  
 49 |    Accuracy = (TP + TN) / (TP + TN + FP + FN)
 50 | 
 51 | 2. **Precision**: Measures the accuracy of positive predictions.  
 52 |    Precision = TP / (TP + FP)
 53 | 
 54 | 3. **Recall (Sensitivity)**: Measures the ability to detect positive samples.  
 55 |    Recall = TP / (TP + FN)
 56 | 
 57 | 4. **F1 Score**: Harmonic mean of Precision and Recall.  
 58 |    F1 = 2 * (Precision * Recall) / (Precision + Recall)
 59 | 
 60 | 
 61 | 
 62 | ### Example: Confusion Matrix in Scikit-Learn
 63 | 
 64 | ```python
 65 | from sklearn.metrics import confusion_matrix, classification_report
 66 | from sklearn.model_selection import train_test_split
 67 | from sklearn.ensemble import RandomForestClassifier
 68 | from sklearn.datasets import load_iris
 69 | 
 70 | # Load dataset and split
 71 | data = load_iris()
 72 | X, y = data.data, data.target
 73 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 74 | 
 75 | # Train model
 76 | model = RandomForestClassifier(random_state=42)
 77 | model.fit(X_train, y_train)
 78 | 
 79 | # Predictions
 80 | y_pred = model.predict(X_test)
 81 | 
 82 | # Confusion Matrix
 83 | cm = confusion_matrix(y_test, y_pred)
 84 | print("Confusion Matrix:\n", cm)
 85 | 
 86 | # Classification Report
 87 | print("\nClassification Report:\n", classification_report(y_test, y_pred))
 88 | ```
 89 | 
 90 | 
 91 | 
 92 | ## 📈 ROC and AUC
 93 | 
 94 | The **Receiver Operating Characteristic (ROC) curve** and the **Area Under the Curve (AUC)** are used to evaluate the performance of binary classification models. The ROC curve plots:
 95 | 
 96 | - **True Positive Rate (TPR)** (Sensitivity) on the Y-axis.
 97 | - **False Positive Rate (FPR)** on the X-axis.
 98 | 
 99 | The **AUC** provides a single scalar value to summarize the ROC curve. A model with an AUC of **1.0** indicates perfect classification, while **0.5** suggests random guessing.
100 | 
101 | ### Example: Plotting ROC-AUC in Scikit-Learn
102 | 
103 | ```python
104 | from sklearn.metrics import roc_curve, auc
105 | import matplotlib.pyplot as plt
106 | from sklearn.linear_model import LogisticRegression
107 | 
108 | # Train binary classifier
109 | binary_model = LogisticRegression()
110 | binary_model.fit(X_train, y_train == 1)  # Simplify to binary classification
111 | 
112 | # Get predicted probabilities
113 | y_probs = binary_model.predict_proba(X_test)[:, 1]
114 | 
115 | # Calculate ROC curve
116 | fpr, tpr, _ = roc_curve(y_test == 1, y_probs)
117 | roc_auc = auc(fpr, tpr)
118 | 
119 | # Plot ROC Curve
120 | plt.figure(figsize=(8, 6))
121 | plt.plot(fpr, tpr, color='blue', label=f"ROC Curve (AUC = {roc_auc:.2f})")
122 | plt.plot([0, 1], [0, 1], color='red', linestyle='--')
123 | plt.xlabel("False Positive Rate")
124 | plt.ylabel("True Positive Rate")
125 | plt.title("ROC Curve")
126 | plt.legend(loc="lower right")
127 | plt.show()
128 | ```
129 | 
130 | 
131 | 
132 | ## 📚 Practice Exercises
133 | 
134 | 1. Load a dataset of your choice and split it into training and test sets.
135 | 2. Train a classification model and compute the confusion matrix.
136 | 3. Use scikit-learn to plot the ROC curve and calculate the AUC for a binary classification task.
137 | 
138 | 
139 | 
140 | ## 📜 Summary
141 | 
142 | In **Day 25**, we explored the importance of **Model Evaluation and Metrics** in machine learning. We delved into:
143 | 
144 | - **Confusion Matrix**: A vital tool to understand classification results.
145 | - **Derived Metrics**: Accuracy, Precision, Recall, and F1 Score.
146 | - **ROC-AUC Curve**: For evaluating binary classifiers.
147 | 
148 | Mastering these concepts will ensure you can effectively measure and improve the performance of machine learning models. Keep practicing, and we'll see you tomorrow for Day 26! 🚀
149 | 
150 | ---
151 | 


--------------------------------------------------------------------------------
/29_Working with Big Data/29_Working with Big Data.md:
--------------------------------------------------------------------------------
  1 | [<< Day 28](../28_Time%20Series%20Forecasting/28_Time%20Series%20Forecasting.md) | [Day 30 >>](../30_Building%20a%20Data%20Science%20Pipeline/30_Building%20a%20Data%20Science%20Pipeline.md)
  2 | 
  3 | 
  4 | 
  5 | # 🗓️ Day 29: Working with Big Data 🚀
  6 | 
  7 | Welcome to **Day 29** of the **30 Days of Data Science** series! Today, we delve into the exciting world of **Big Data** and learn about **PySpark Basics**, along with related topics such as **Partitioning in Big Data** and **Handling Missing Data**.
  8 | 
  9 | 
 10 | 
 11 | ## 📚 Table of Contents
 12 | - [🌟 Introduction to Big Data](#-introduction-to-big-data)
 13 | - [🔥 What is Apache Spark?](#-what-is-apache-spark)
 14 | - [🐍 Why PySpark?](#-why-pyspark)
 15 | - [⚙️ Setting Up PySpark](#️-setting-up-pyspark)
 16 |   - [Installation](#installation)
 17 |   - [Setting Up Your Environment](#setting-up-your-environment)
 18 | - [📝 PySpark Basics](#-pyspark-basics)
 19 |   - [Creating an RDD](#creating-an-rdd)
 20 |   - [Transformations and Actions](#transformations-and-actions)
 21 | - [📁 Partitioning in Big Data](#-partitioning-in-big-data)
 22 | - [📉 Handling Missing Data in Big Data](#-handling-missing-data-in-big-data)
 23 | - [💡 Practice Exercise](#-practice-exercise)
 24 | - [📜 Summary](#-summary)
 25 | 
 26 | 
 27 | 
 28 | ## 🌟 Introduction to Big Data
 29 | 
 30 | Big Data refers to data that is so large, fast, or complex that traditional data processing methods cannot efficiently process it. Key characteristics include:
 31 | 
 32 | - **Volume**: Huge amounts of data.
 33 | - **Velocity**: High speed at which data is generated.
 34 | - **Variety**: Different forms like structured, unstructured, and semi-structured data.
 35 | 
 36 | 
 37 | 
 38 | ## 🔥 What is Apache Spark?
 39 | 
 40 | **Apache Spark** is an open-source, distributed computing system designed for fast and scalable processing of large datasets. Key features include:
 41 | 
 42 | - **Speed**: Processes data 100x faster than Hadoop MapReduce.
 43 | - **Ease of Use**: APIs in Python, Java, Scala, and R.
 44 | - **Versatility**: Supports SQL, streaming, machine learning, and graph processing.
 45 | 
 46 | 
 47 | 
 48 | ## 🐍 Why PySpark?
 49 | 
 50 | PySpark is the Python API for Apache Spark. It allows Python developers to leverage Spark's distributed computing capabilities with Pythonic simplicity.
 51 | 
 52 | - Easy to learn for Python developers.
 53 | - Integrates seamlessly with Python libraries like Pandas and NumPy.
 54 | 
 55 | 
 56 | 
 57 | ## ⚙️ Setting Up PySpark
 58 | 
 59 | ### Installation
 60 | 
 61 | To install PySpark, use pip:
 62 | 
 63 | ```bash
 64 | pip install pyspark
 65 | ```
 66 | 
 67 | ### Setting Up Your Environment
 68 | 
 69 | 1. Install Java Development Kit (JDK). Spark requires Java 8 or higher.
 70 | 2. Verify the installation:
 71 | 
 72 | ```bash
 73 | java -version
 74 | ```
 75 | 
 76 | 3. Launch PySpark from the terminal:
 77 | 
 78 | ```bash
 79 | pyspark
 80 | ```
 81 | 
 82 | 
 83 | 
 84 | ## 📝 PySpark Basics
 85 | 
 86 | ### Creating an RDD
 87 | 
 88 | An **RDD (Resilient Distributed Dataset)** is the fundamental data structure in Spark. You can create an RDD in PySpark as follows:
 89 | 
 90 | ```python
 91 | from pyspark import SparkContext
 92 | 
 93 | # Initialize SparkContext
 94 | sc = SparkContext("local", "Day 29 Example")
 95 | 
 96 | # Create an RDD
 97 | data = [1, 2, 3, 4, 5]
 98 | rdd = sc.parallelize(data)
 99 | 
100 | print("RDD Elements:", rdd.collect())
101 | ```
102 | 
103 | ### Transformations and Actions
104 | 
105 | - **Transformations** create a new RDD from an existing one. Examples: `map`, `filter`.
106 | - **Actions** perform operations and return results. Examples: `collect`, `count`.
107 | 
108 | #### Example: Map and Filter
109 | 
110 | ```python
111 | # Transformation: Map
112 | squared_rdd = rdd.map(lambda x: x ** 2)
113 | 
114 | # Transformation: Filter
115 | filtered_rdd = squared_rdd.filter(lambda x: x > 10)
116 | 
117 | # Action: Collect
118 | result = filtered_rdd.collect()
119 | print("Filtered Result:", result)
120 | ```
121 | 
122 | #### Example: Reduce
123 | 
124 | ```python
125 | # Action: Reduce
126 | sum_result = rdd.reduce(lambda x, y: x + y)
127 | print("Sum of RDD Elements:", sum_result)
128 | ```
129 | 
130 | 
131 | 
132 | ## 📁 Partitioning in Big Data
133 | 
134 | Partitioning refers to splitting data into smaller chunks to be processed in parallel. In PySpark, partitioning is essential for optimizing performance.
135 | 
136 | ### Example: Partitioning Data
137 | 
138 | ```python
139 | # Create an RDD with 4 partitions
140 | partitioned_rdd = sc.parallelize(data, 4)
141 | print("Number of Partitions:", partitioned_rdd.getNumPartitions())
142 | ```
143 | 
144 | ### Repartitioning
145 | 
146 | You can repartition an RDD to increase or decrease the number of partitions.
147 | 
148 | ```python
149 | # Repartitioning
150 | repartitioned_rdd = partitioned_rdd.repartition(2)
151 | print("New Number of Partitions:", repartitioned_rdd.getNumPartitions())
152 | ```
153 | 
154 | 
155 | 
156 | ## 📉 Handling Missing Data in Big Data
157 | 
158 | Big Data often contains missing or null values. PySpark provides tools to handle missing data efficiently.
159 | 
160 | ### Example: Handling Null Values in a DataFrame
161 | 
162 | ```python
163 | from pyspark.sql import SparkSession
164 | 
165 | # Initialize SparkSession
166 | spark = SparkSession.builder.appName("MissingDataExample").getOrCreate()
167 | 
168 | # Create a DataFrame with missing values
169 | data = [("Alice", 34), (None, 29), ("Bob", None)]
170 | columns = ["Name", "Age"]
171 | df = spark.createDataFrame(data, columns)
172 | 
173 | # Drop rows with null values
174 | df_cleaned = df.dropna()
175 | df_cleaned.show()
176 | ```
177 | 
178 | ### Filling Missing Values
179 | 
180 | ```python
181 | # Fill missing values with a default
182 | df_filled = df.fillna({"Name": "Unknown", "Age": 0})
183 | df_filled.show()
184 | ```
185 | 
186 | 
187 | 
188 | ## 💡 Practice Exercise
189 | 
190 | **Task**: Using PySpark, create an RDD and perform the following:
191 | 
192 | 1. Partition the RDD into 3 partitions.
193 | 2. Apply a transformation to multiply each element by 10.
194 | 3. Filter the elements greater than 20.
195 | 4. Collect the results.
196 | 
197 | 
198 | 
199 | ## 📜 Summary
200 | 
201 | Today, we explored:
202 | 
203 | - The fundamentals of **Big Data** and its challenges.
204 | - **PySpark Basics**, including RDD creation and transformations.
205 | - **Partitioning** for efficient data processing.
206 | - **Handling Missing Data** in PySpark.
207 | 
208 | ---
209 | 


--------------------------------------------------------------------------------
/18_Basic Machine Learning Introduction/18_Basic Machine Learning Introduction.md:
--------------------------------------------------------------------------------
  1 | [<< Day 17](../17_Hypothesis%20Testing/17_Hypothesis%20Testing.md) | [Day 19 >>](../19_Linear%20Regression/19_Linear%20Regression.md)
  2 | 
  3 | 
  4 | # 📘 Day 18: Basic Machine Learning Introduction and Scikit-learn Basics
  5 | 
  6 | Welcome to Day 18 of the **30 Days of Data Science** series! Today, we explore the basics of **Machine Learning** and an essential library for implementing ML models in Python: **Scikit-learn**. This session will set the foundation for understanding ML concepts and applying them in practice.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 18: Basic Machine Learning Introduction and Scikit-learn Basics](#-day-18-basic-machine-learning-introduction-and-scikit-learn-basics)
 13 |   - [📌 Topics Covered](#-topics-covered)
 14 |   - [1️⃣ What is Machine Learning?](#1️⃣-what-is-machine-learning)
 15 |     - [Types of Machine Learning](#types-of-machine-learning)
 16 |   - [2️⃣ Introduction to Scikit-learn](#2️⃣-introduction-to-scikit-learn)
 17 |     - [Installing Scikit-learn](#installing-scikit-learn)
 18 |     - [Scikit-learn Basics](#scikit-learn-basics)
 19 |   - [3️⃣ Example: Linear Regression with Scikit-learn](#3️⃣-example-linear-regression-with-scikit-learn)
 20 |   - [🧠 Practice Exercises](#-practice-exercises)
 21 |   - [🌟 Summary](#-summary)
 22 |   
 23 | 
 24 | 
 25 | 
 26 | ## 📌 Topics Covered
 27 | 
 28 | - What is Machine Learning?
 29 | - Types of Machine Learning: Supervised, Unsupervised, and Reinforcement Learning.
 30 | - Introduction to Scikit-learn, a machine learning library in Python.
 31 | - Example: Linear Regression using Scikit-learn.
 32 | 
 33 | 
 34 | 
 35 | ## 1️⃣ What is Machine Learning?
 36 | 
 37 | **Machine Learning (ML)** is a subset of artificial intelligence (AI) that enables systems to learn and improve from data without being explicitly programmed.
 38 | 
 39 | ### Key Concepts:
 40 | - **Data**: ML algorithms are trained using historical data.
 41 | - **Model**: A mathematical representation of the problem to make predictions or decisions.
 42 | - **Training**: The process of feeding data into the model to learn patterns.
 43 | 
 44 | 
 45 | 
 46 | ### Types of Machine Learning
 47 | 
 48 | 1. **Supervised Learning**:
 49 |    - Input data (features) and output labels (target) are provided.
 50 |    - Goal: Learn a mapping from input to output.
 51 |    - Examples: Regression, Classification.
 52 | 
 53 | 2. **Unsupervised Learning**:
 54 |    - Only input data is provided, no output labels.
 55 |    - Goal: Discover hidden patterns or groupings.
 56 |    - Examples: Clustering, Dimensionality Reduction.
 57 | 
 58 | 3. **Reinforcement Learning**:
 59 |    - Agents learn by interacting with the environment and receiving feedback (rewards or penalties).
 60 |    - Examples: Game playing, Robotics.
 61 | 
 62 | 
 63 | 
 64 | ## 2️⃣ Introduction to Scikit-learn
 65 | 
 66 | **Scikit-learn** is a Python library for implementing machine learning algorithms. It provides simple and efficient tools for predictive data analysis.
 67 | 
 68 | ### Key Features:
 69 | - Built-in algorithms for supervised and unsupervised learning.
 70 | - Tools for model evaluation, preprocessing, and pipeline creation.
 71 | - Compatible with other Python libraries like NumPy and pandas.
 72 | 
 73 | 
 74 | 
 75 | ### Installing Scikit-learn
 76 | 
 77 | Before using Scikit-learn, ensure it is installed in your environment. Use the following command:
 78 | 
 79 | ```bash
 80 | pip install scikit-learn
 81 | ```
 82 | 
 83 | 
 84 | 
 85 | ### Scikit-learn Basics
 86 | 
 87 | 1. **Loading a Dataset**:
 88 |    Scikit-learn comes with several built-in datasets.
 89 | 
 90 |    ```python
 91 |    from sklearn.datasets import load_iris
 92 | 
 93 |    iris = load_iris()
 94 |    print(iris.keys())  # Output: Keys like 'data', 'target', etc.
 95 |    ```
 96 | 
 97 | 2. **Splitting Data**:
 98 |    Use `train_test_split` to divide data into training and testing sets.
 99 | 
100 |    ```python
101 |    from sklearn.model_selection import train_test_split
102 | 
103 |    X_train, X_test, y_train, y_test = train_test_split(
104 |        iris.data, iris.target, test_size=0.2, random_state=42
105 |    )
106 |    ```
107 | 
108 | 3. **Training a Model**:
109 |    Fit a model using the training data.
110 | 
111 |    ```python
112 |    from sklearn.ensemble import RandomForestClassifier
113 | 
114 |    clf = RandomForestClassifier()
115 |    clf.fit(X_train, y_train)
116 |    ```
117 | 
118 | 4. **Making Predictions**:
119 |    Use the trained model to make predictions.
120 | 
121 |    ```python
122 |    predictions = clf.predict(X_test)
123 |    print(predictions)
124 |    ```
125 | 
126 | 5. **Evaluating a Model**:
127 |    Measure accuracy or other metrics.
128 | 
129 |    ```python
130 |    from sklearn.metrics import accuracy_score
131 | 
132 |    accuracy = accuracy_score(y_test, predictions)
133 |    print(f"Accuracy: {accuracy}")
134 |    ```
135 | 
136 | 
137 | 
138 | ## 3️⃣ Example: Linear Regression with Scikit-learn
139 | 
140 | Let’s build a **Linear Regression** model to predict house prices.
141 | 
142 | ```python
143 | from sklearn.datasets import fetch_california_housing
144 | from sklearn.model_selection import train_test_split
145 | from sklearn.linear_model import LinearRegression
146 | from sklearn.metrics import mean_squared_error
147 | 
148 | # Load the dataset
149 | data = fetch_california_housing()
150 | X, y = data.data, data.target
151 | 
152 | # Split into training and testing sets
153 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
154 | 
155 | # Create and train the model
156 | model = LinearRegression()
157 | model.fit(X_train, y_train)
158 | 
159 | # Make predictions
160 | predictions = model.predict(X_test)
161 | 
162 | # Evaluate the model
163 | mse = mean_squared_error(y_test, predictions)
164 | print(f"Mean Squared Error: {mse}")
165 | ```
166 | 
167 | **Output Example:**
168 | 
169 | ```plaintext
170 | Mean Squared Error: 0.5401
171 | ```
172 | 
173 | 
174 | 
175 | ## 🧠 Practice Exercises
176 | 
177 | 1. Use the `load_wine` dataset from Scikit-learn and train a Decision Tree Classifier.
178 | 2. Build a K-Means clustering model on synthetic data using Scikit-learn.
179 | 3. Experiment with different test sizes in the `train_test_split` function and observe the impact on performance.
180 | 
181 | 
182 | 
183 | ## 🌟 Summary
184 | 
185 | - Machine Learning enables systems to learn from data and make predictions.
186 | - Scikit-learn simplifies the implementation of ML algorithms with its tools and datasets.
187 | - Linear Regression is a basic but powerful algorithm to understand supervised learning.
188 | 
189 | ---
190 | 
191 | 
192 | 


--------------------------------------------------------------------------------
/12_SQL for Data Retrieval/12_SQL for Data Retrieval.md:
--------------------------------------------------------------------------------
  1 | [<< Day 11](../11_Advanced%20Data%20Visualization/11_Advanced%20Data%20Visualization.md) | [Day 13 >>](../13_Time%20Series%20Analysis%20Introduction/13_Time%20Series%20Analysis%20Introduction.md)
  2 | 
  3 | 
  4 | # 📘 Day 12: SQL for Data Retrieval using SQLite3 and SQLAlchemy
  5 | 
  6 | Welcome to Day 12 of the **30 Days of Data Science** series! Today, we focus on **SQL for data retrieval**—a crucial skill for any data scientist. We’ll explore how to use Python libraries **sqlite3** and **SQLAlchemy** to interact with databases and fetch meaningful insights.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 12: SQL for Data Retrieval using SQLite3 and SQLAlchemy](#-day-12-sql-for-data-retrieval-using-sqlite3-and-sqlalchemy)
 13 |   - [1️⃣ Introduction to SQL and Databases](#1️⃣-introduction-to-sql-and-databases)
 14 |   - [2️⃣ Using SQLite3 for Data Retrieval](#2️⃣-using-sqlite3-for-data-retrieval)
 15 |     - [Creating a Database](#creating-a-database)
 16 |     - [Inserting Data](#inserting-data)
 17 |     - [Querying Data](#querying-data)
 18 |     - [Example: Fetching Data with Conditions](#example-fetching-data-with-conditions)
 19 |   - [3️⃣ Using SQLAlchemy for Data Retrieval](#3️⃣-using-sqlalchemy-for-data-retrieval)
 20 |     - [Setting Up SQLAlchemy](#setting-up-sqlalchemy)
 21 |     - [Defining a Table Model](#defining-a-table-model)
 22 |     - [Adding and Querying Data](#adding-and-querying-data)
 23 |   - [🧠 Practice Exercises](#-practice-exercises)
 24 |   - [🌟 Summary](#-summary)
 25 |   
 26 | 
 27 | 
 28 | 
 29 | ## 1️⃣ Introduction to SQL and Databases
 30 | 
 31 | **SQL (Structured Query Language)** is used to interact with databases. It allows us to:
 32 | - Store data in tables (rows and columns).
 33 | - Retrieve data using queries.
 34 | - Filter, aggregate, and manipulate data.
 35 | 
 36 | 
 37 | 
 38 | ## 2️⃣ Using SQLite3 for Data Retrieval
 39 | 
 40 | SQLite3 is a lightweight database engine built into Python.
 41 | 
 42 | 
 43 | 
 44 | ### Creating a Database
 45 | 
 46 | You can create a new database or connect to an existing one using `sqlite3.connect()`.
 47 | 
 48 | ```python
 49 | import sqlite3
 50 | 
 51 | # Connect to SQLite database (or create one)
 52 | connection = sqlite3.connect("example.db")
 53 | 
 54 | # Create a cursor object to execute SQL commands
 55 | cursor = connection.cursor()
 56 | 
 57 | # Create a table
 58 | cursor.execute("""
 59 | CREATE TABLE IF NOT EXISTS employees (
 60 |     id INTEGER PRIMARY KEY AUTOINCREMENT,
 61 |     name TEXT,
 62 |     age INTEGER,
 63 |     department TEXT
 64 | )
 65 | """)
 66 | 
 67 | # Commit changes and close the connection
 68 | connection.commit()
 69 | connection.close()
 70 | ```
 71 | 
 72 | 
 73 | 
 74 | ### Inserting Data
 75 | 
 76 | Add data to your database using `INSERT INTO`.
 77 | 
 78 | ```python
 79 | connection = sqlite3.connect("example.db")
 80 | cursor = connection.cursor()
 81 | 
 82 | # Insert data
 83 | cursor.execute("INSERT INTO employees (name, age, department) VALUES (?, ?, ?)", 
 84 |                ("Alice", 30, "HR"))
 85 | cursor.execute("INSERT INTO employees (name, age, department) VALUES (?, ?, ?)", 
 86 |                ("Bob", 25, "Engineering"))
 87 | 
 88 | connection.commit()
 89 | connection.close()
 90 | ```
 91 | 
 92 | 
 93 | 
 94 | ### Querying Data
 95 | 
 96 | Retrieve data using `SELECT` queries.
 97 | 
 98 | ```python
 99 | connection = sqlite3.connect("example.db")
100 | cursor = connection.cursor()
101 | 
102 | # Query all employees
103 | cursor.execute("SELECT * FROM employees")
104 | rows = cursor.fetchall()
105 | 
106 | for row in rows:
107 |     print(row)
108 | 
109 | connection.close()
110 | ```
111 | 
112 | **Output:**
113 | 
114 | ```plaintext
115 | (1, 'Alice', 30, 'HR')
116 | (2, 'Bob', 25, 'Engineering')
117 | ```
118 | 
119 | 
120 | 
121 | ### Example: Fetching Data with Conditions
122 | 
123 | ```python
124 | connection = sqlite3.connect("example.db")
125 | cursor = connection.cursor()
126 | 
127 | # Query employees in the HR department
128 | cursor.execute("SELECT * FROM employees WHERE department = ?", ("HR",))
129 | rows = cursor.fetchall()
130 | 
131 | print(rows)  # Output: [(1, 'Alice', 30, 'HR')]
132 | 
133 | connection.close()
134 | ```
135 | 
136 | 
137 | 
138 | ## 3️⃣ Using SQLAlchemy for Data Retrieval
139 | 
140 | SQLAlchemy simplifies database interactions with Python objects.
141 | 
142 | 
143 | 
144 | ### Setting Up SQLAlchemy
145 | 
146 | Install SQLAlchemy:
147 | 
148 | ```bash
149 | pip install sqlalchemy
150 | ```
151 | 
152 | Create a database connection and an engine:
153 | 
154 | ```python
155 | from sqlalchemy import create_engine
156 | 
157 | # Create a SQLite engine
158 | engine = create_engine("sqlite:///example.db")
159 | ```
160 | 
161 | 
162 | 
163 | ### Defining a Table Model
164 | 
165 | Define tables using SQLAlchemy’s `declarative_base`:
166 | 
167 | ```python
168 | from sqlalchemy.ext.declarative import declarative_base
169 | from sqlalchemy import Column, Integer, String
170 | 
171 | Base = declarative_base()
172 | 
173 | class Employee(Base):
174 |     __tablename__ = 'employees'
175 |     id = Column(Integer, primary_key=True, autoincrement=True)
176 |     name = Column(String)
177 |     age = Column(Integer)
178 |     department = Column(String)
179 | ```
180 | 
181 | Create tables:
182 | 
183 | ```python
184 | Base.metadata.create_all(engine)
185 | ```
186 | 
187 | 
188 | 
189 | ### Adding and Querying Data
190 | 
191 | Insert data using a session:
192 | 
193 | ```python
194 | from sqlalchemy.orm import sessionmaker
195 | 
196 | Session = sessionmaker(bind=engine)
197 | session = Session()
198 | 
199 | # Add an employee
200 | new_employee = Employee(name="Charlie", age=28, department="Finance")
201 | session.add(new_employee)
202 | session.commit()
203 | ```
204 | 
205 | Query data:
206 | 
207 | ```python
208 | # Query all employees
209 | employees = session.query(Employee).all()
210 | 
211 | for emp in employees:
212 |     print(emp.name, emp.department)
213 | ```
214 | 
215 | **Output:**
216 | 
217 | ```plaintext
218 | Alice HR
219 | Bob Engineering
220 | Charlie Finance
221 | ```
222 | 
223 | 
224 | 
225 | ## 🧠 Practice Exercises
226 | 
227 | 1. Create a table for **products** with columns: `id`, `name`, `price`, and `quantity`. Populate it with data and fetch records where `price > 100`.
228 | 2. Use SQLAlchemy to define a `students` table. Add records and retrieve students aged above 20.
229 | 3. Write an SQL query to find the average age of employees in a department.
230 | 
231 | 
232 | 
233 | ## 🌟 Summary
234 | 
235 | - SQLite3 is a lightweight database engine suitable for small projects.
236 | - SQLAlchemy provides an abstraction layer for interacting with databases programmatically.
237 | - SQL queries like `SELECT`, `INSERT`, and `WHERE` are used for data retrieval and filtering.
238 | 
239 | 
240 | ---
241 | 
242 | 
243 | 
244 | 


--------------------------------------------------------------------------------
/13_Time Series Analysis Introduction/13_Time Series Analysis Introduction.md:
--------------------------------------------------------------------------------
  1 | [<< Day 12](../12_SQL%20for%20Data%20Retrieval/12_SQL%20for%20Data%20Retrieval.md) | [Day 14 >>](../14_Working%20with%20APIs%20and%20JSON/14_Working%20with%20APIs%20and%20JSON.md)
  2 | 
  3 | 
  4 | # 📘 Day 13: Time Series Analysis in Python
  5 | 
  6 | Welcome to Day 13 of the **30 Days of Data Science** series! Today, we’ll explore **Time Series Analysis**—a critical skill in data science for analyzing and predicting sequential data points over time. We'll leverage Python libraries like **pandas**, **datetime**, and **matplotlib** to understand and visualize time series data.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 13: Time Series Analysis in Python](#-day-13-time-series-analysis-in-python)
 13 |   - [1️⃣ What is Time Series Analysis? 🕒](#1️⃣-what-is-time-series-analysis-)
 14 |   - [2️⃣ Working with Datetime in Python 📅](#2️⃣-working-with-datetime-in-python-)
 15 |     - [Datetime Basics](#datetime-basics)
 16 |     - [Parsing and Formatting Dates](#parsing-and-formatting-dates)
 17 |     - [Example: Date Arithmetic](#example-date-arithmetic)
 18 |   - [3️⃣ Time Series Analysis with pandas 📊](#3️⃣-time-series-analysis-with-pandas-)
 19 |     - [Creating a Time Series](#creating-a-time-series)
 20 |     - [Resampling and Aggregation](#resampling-and-aggregation)
 21 |     - [Handling Missing Data](#handling-missing-data)
 22 |     - [Rolling Statistics](#rolling-statistics)
 23 |   - [4️⃣ Visualizing Time Series Data with matplotlib 📈](#4️⃣-visualizing-time-series-data-with-matplotlib-)
 24 |     - [Line Plots](#line-plots)
 25 |     - [Highlighting Trends](#highlighting-trends)
 26 |   - [🧠 Practice Exercises](#-practice-exercises)
 27 |   - [🌟 Summary](#-summary)
 28 |   
 29 | 
 30 | 
 31 | 
 32 | ## 1️⃣ What is Time Series Analysis? 🕒
 33 | 
 34 | A **time series** is a sequence of data points indexed in time order. Examples include stock prices, weather data, and sensor readings. **Time Series Analysis** focuses on uncovering patterns, trends, and seasonality in data to make informed decisions or predictions.
 35 | 
 36 | 
 37 | 
 38 | ## 2️⃣ Working with Datetime in Python 📅
 39 | 
 40 | The **datetime** module provides tools for handling dates and times.
 41 | 
 42 | 
 43 | 
 44 | ### Datetime Basics
 45 | 
 46 | ```python
 47 | from datetime import datetime
 48 | 
 49 | # Current date and time
 50 | now = datetime.now()
 51 | print(f"Current datetime: {now}")
 52 | 
 53 | # Creating specific dates
 54 | custom_date = datetime(2022, 12, 25, 10, 30)
 55 | print(f"Custom datetime: {custom_date}")
 56 | ```
 57 | 
 58 | **Output:**
 59 | 
 60 | ```plaintext
 61 | Current datetime: 2024-11-20 14:30:45.123456
 62 | Custom datetime: 2022-12-25 10:30:00
 63 | ```
 64 | 
 65 | 
 66 | 
 67 | ### Parsing and Formatting Dates
 68 | 
 69 | ```python
 70 | # Parsing a string to datetime
 71 | date_str = "2024-11-20"
 72 | parsed_date = datetime.strptime(date_str, "%Y-%m-%d")
 73 | print(f"Parsed date: {parsed_date}")
 74 | 
 75 | # Formatting datetime to string
 76 | formatted_date = parsed_date.strftime("%B %d, %Y")
 77 | print(f"Formatted date: {formatted_date}")
 78 | ```
 79 | 
 80 | **Output:**
 81 | 
 82 | ```plaintext
 83 | Parsed date: 2024-11-20 00:00:00
 84 | Formatted date: November 20, 2024
 85 | ```
 86 | 
 87 | 
 88 | 
 89 | ### Example: Date Arithmetic
 90 | 
 91 | ```python
 92 | from datetime import timedelta
 93 | 
 94 | # Adding days
 95 | future_date = now + timedelta(days=10)
 96 | print(f"10 days from now: {future_date}")
 97 | 
 98 | # Difference between dates
 99 | diff = future_date - now
100 | print(f"Days between: {diff.days}")
101 | ```
102 | 
103 | **Output:**
104 | 
105 | ```plaintext
106 | 10 days from now: 2024-11-30 ...
107 | Days between: 10
108 | ```
109 | 
110 | 
111 | 
112 | ## 3️⃣ Time Series Analysis with pandas 📊
113 | 
114 | The **pandas** library provides robust tools for working with time series data.
115 | 
116 | 
117 | 
118 | ### Creating a Time Series
119 | 
120 | ```python
121 | import pandas as pd
122 | 
123 | # Create a date range
124 | dates = pd.date_range(start="2024-01-01", periods=7, freq="D")
125 | 
126 | # Create a DataFrame
127 | data = pd.DataFrame({"Date": dates, "Value": [10, 15, 20, 25, 30, 35, 40]})
128 | data.set_index("Date", inplace=True)
129 | print(data)
130 | ```
131 | 
132 | **Output:**
133 | 
134 | ```plaintext
135 |             Value
136 | Date             
137 | 2024-01-01     10
138 | 2024-01-02     15
139 | 2024-01-03     20
140 | ...
141 | ```
142 | 
143 | 
144 | 
145 | ### Resampling and Aggregation
146 | 
147 | ```python
148 | # Resample to weekly average
149 | weekly_avg = data.resample("W").mean()
150 | print(weekly_avg)
151 | ```
152 | 
153 | **Output:**
154 | 
155 | ```plaintext
156 |             Value
157 | Date             
158 | 2024-01-07   22.5
159 | ```
160 | 
161 | 
162 | 
163 | ### Handling Missing Data
164 | 
165 | ```python
166 | # Create data with missing values
167 | data.loc["2024-01-05"] = None
168 | 
169 | # Fill missing values
170 | data_filled = data.fillna(method="ffill")
171 | print(data_filled)
172 | ```
173 | 
174 | **Output:**
175 | 
176 | ```plaintext
177 |             Value
178 | Date             
179 | 2024-01-05   30.0
180 | ```
181 | 
182 | 
183 | 
184 | ### Rolling Statistics
185 | 
186 | ```python
187 | # Calculate rolling mean
188 | data["Rolling Mean"] = data["Value"].rolling(window=3).mean()
189 | print(data)
190 | ```
191 | 
192 | **Output:**
193 | 
194 | ```plaintext
195 |             Value  Rolling Mean
196 | Date                          
197 | 2024-01-03   20.0     15.0
198 | ```
199 | 
200 | 
201 | 
202 | ## 4️⃣ Visualizing Time Series Data with matplotlib 📈
203 | 
204 | The **matplotlib** library helps in visualizing trends, seasonality, and anomalies in time series data.
205 | 
206 | 
207 | 
208 | ### Line Plots
209 | 
210 | ```python
211 | import matplotlib.pyplot as plt
212 | 
213 | # Plot time series
214 | data["Value"].plot(title="Time Series Data")
215 | plt.show()
216 | ```
217 | 
218 | 
219 | 
220 | ### Highlighting Trends
221 | 
222 | ```python
223 | # Plot with trend line
224 | plt.plot(data.index, data["Value"], label="Original Data")
225 | plt.plot(data.index, data["Rolling Mean"], label="Trend", linestyle="--")
226 | plt.legend()
227 | plt.show()
228 | ```
229 | 
230 | 
231 | 
232 | ## 🧠 Practice Exercises
233 | 
234 | 1. Create a time series dataset of daily sales for a week and calculate the rolling average.
235 | 2. Visualize monthly stock prices using matplotlib and identify trends.
236 | 3. Use pandas to resample hourly temperature data into daily averages.
237 | 
238 | 
239 | 
240 | ## 🌟 Summary
241 | 
242 | - The **datetime** module helps manage and manipulate dates in Python.
243 | - **pandas** provides tools for time series operations, such as resampling, filling missing values, and calculating rolling statistics.
244 | - Use **matplotlib** to visualize and analyze time series data.
245 | 
246 | ---
247 | 
248 | 
249 | 


--------------------------------------------------------------------------------
/08_Data Cleaning/08_Data Cleaning.md:
--------------------------------------------------------------------------------
  1 | [<< Day 7](../07_Importing%20Data/07_Importing%20Data.md) | [Day 9 >>](../09_Exploratory%20Data%20Analysis%20(EDA)/09_Exploratory%20Data%20Analysis%20(EDA).md)
  2 | 
  3 | 
  4 | # 📘 Day 8: Cleaning Data
  5 | 
  6 | Welcome to **Day 8** of the **30 Days of Data Science** series! Today, we focus on **Cleaning Data**, which includes handling missing values and duplicates. Cleaning your data is a crucial step to ensure the reliability and accuracy of your analyses and models.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 8: Cleaning Data](#-day-8-cleaning-data)
 13 |   - [1️⃣ Introduction to Data Cleaning](#1️⃣-introduction-to-data-cleaning)
 14 |   - [2️⃣ Handling Missing Values](#2️⃣-handling-missing-values)
 15 |     - [Identifying Missing Data](#identifying-missing-data)
 16 |     - [Removing Missing Values](#removing-missing-values)
 17 |     - [Imputing Missing Values](#imputing-missing-values)
 18 |     - [Replacing Missing Values with Interpolation](#replacing-missing-values-with-interpolation)
 19 |     - [Filling Missing Values with Forward/Backward Fill](#filling-missing-values-with-forwardbackward-fill)
 20 |   - [3️⃣ Handling Duplicate Values](#3️⃣-handling-duplicate-values)
 21 |     - [Identifying Duplicates](#identifying-duplicates)
 22 |     - [Removing Duplicates](#removing-duplicates)
 23 |     - [Keeping Specific Duplicates](#keeping-specific-duplicates)
 24 |   - [🧠 Practice Exercises](#-practice-exercises)
 25 |   - [🌟 Summary](#-summary)
 26 |   
 27 | 
 28 | 
 29 | 
 30 | ## 1️⃣ Introduction to Data Cleaning
 31 | 
 32 | In real-world datasets, it is common to encounter missing or duplicate data. Pandas provides a rich set of functions to handle such issues effectively. Cleaning data ensures that your dataset is:
 33 | 
 34 | - Consistent
 35 | - Reliable
 36 | - Free from errors
 37 | 
 38 | 
 39 | 
 40 | ## 2️⃣ Handling Missing Values
 41 | 
 42 | Missing values can occur for various reasons, such as data entry errors, system limitations, or incomplete surveys. Pandas offers multiple functions to handle these situations.
 43 | 
 44 | ### Identifying Missing Data
 45 | 
 46 | You can identify missing data using functions like `isnull()` and `notnull()`.
 47 | 
 48 | #### Example:
 49 | 
 50 | ```python
 51 | import pandas as pd
 52 | 
 53 | # Sample dataset
 54 | data = {'Name': ['Alice', 'Bob', None, 'David'],
 55 |         'Age': [25, None, 30, 22],
 56 |         'City': ['New York', 'Los Angeles', 'Chicago', None]}
 57 | 
 58 | df = pd.DataFrame(data)
 59 | 
 60 | # Check for missing values
 61 | print(df.isnull())  # Returns a DataFrame with True for missing values
 62 | print("\nNumber of missing values per column:")
 63 | print(df.isnull().sum())
 64 | ```
 65 | 
 66 | 
 67 | 
 68 | ### Removing Missing Values
 69 | 
 70 | Use `dropna()` to remove rows or columns with missing values.
 71 | 
 72 | #### Example:
 73 | 
 74 | ```python
 75 | # Drop rows with missing values
 76 | df_cleaned = df.dropna()
 77 | print(df_cleaned)
 78 | 
 79 | # Drop columns with missing values
 80 | df_cleaned_cols = df.dropna(axis=1)
 81 | print(df_cleaned_cols)
 82 | ```
 83 | 
 84 | You can also customize the behavior of `dropna()` using parameters like:
 85 | 
 86 | - `how='all'`: Removes rows/columns where all values are missing.
 87 | - `thresh=n`: Retains rows/columns with at least `n` non-NA values.
 88 | 
 89 | #### Example:
 90 | 
 91 | ```python
 92 | # Drop rows with all missing values
 93 | df_cleaned = df.dropna(how='all')
 94 | 
 95 | # Drop rows with at least 2 non-missing values
 96 | df_cleaned_thresh = df.dropna(thresh=2)
 97 | print(df_cleaned_thresh)
 98 | ```
 99 | 
100 | 
101 | 
102 | ### Imputing Missing Values
103 | 
104 | Imputation fills missing values with appropriate values like mean, median, or mode.
105 | 
106 | #### Example:
107 | 
108 | ```python
109 | # Fill missing numeric values with the mean
110 | df['Age'] = df['Age'].fillna(df['Age'].mean())
111 | print(df)
112 | 
113 | # Fill missing categorical values with a mode
114 | df['Name'] = df['Name'].fillna(df['Name'].mode()[0])
115 | print(df)
116 | ```
117 | 
118 | 
119 | 
120 | ### Replacing Missing Values with Interpolation
121 | 
122 | Interpolation estimates missing values using mathematical functions.
123 | 
124 | #### Example:
125 | 
126 | ```python
127 | # Interpolate missing values
128 | data = {'Value': [1, None, 3, None, 5]}
129 | df = pd.DataFrame(data)
130 | 
131 | # Linear interpolation
132 | df['Value'] = df['Value'].interpolate(method='linear')
133 | print(df)
134 | ```
135 | 
136 | 
137 | 
138 | ### Filling Missing Values with Forward/Backward Fill
139 | 
140 | Forward fill (`ffill`) and backward fill (`bfill`) propagate known values to fill gaps.
141 | 
142 | #### Example:
143 | 
144 | ```python
145 | # Forward fill
146 | df_ffill = df.fillna(method='ffill')
147 | 
148 | # Backward fill
149 | df_bfill = df.fillna(method='bfill')
150 | ```
151 | 
152 | 
153 | 
154 | ## 3️⃣ Handling Duplicate Values
155 | 
156 | Duplicate values can distort analyses and lead to redundant information. Use Pandas to identify and remove duplicates effectively.
157 | 
158 | ### Identifying Duplicates
159 | 
160 | The `duplicated()` method returns a boolean series indicating whether each row is a duplicate.
161 | 
162 | #### Example:
163 | 
164 | ```python
165 | data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],
166 |         'Age': [25, 30, 25, 22]}
167 | 
168 | df = pd.DataFrame(data)
169 | 
170 | # Check for duplicates
171 | print(df.duplicated())
172 | ```
173 | 
174 | 
175 | 
176 | ### Removing Duplicates
177 | 
178 | The `drop_duplicates()` method removes duplicate rows from the dataset.
179 | 
180 | #### Example:
181 | 
182 | ```python
183 | # Remove duplicate rows
184 | df_no_duplicates = df.drop_duplicates()
185 | print(df_no_duplicates)
186 | ```
187 | 
188 | 
189 | 
190 | ### Keeping Specific Duplicates
191 | 
192 | By default, `drop_duplicates()` retains the first occurrence. You can modify this behavior using the `keep` parameter.
193 | 
194 | #### Example:
195 | 
196 | ```python
197 | # Keep the last occurrence of duplicates
198 | df_last = df.drop_duplicates(keep='last')
199 | print(df_last)
200 | 
201 | # Remove all occurrences of duplicates
202 | df_none = df.drop_duplicates(keep=False)
203 | print(df_none)
204 | ```
205 | 
206 | 
207 | 
208 | ## 🧠 Practice Exercises
209 | 
210 | 1. Create a DataFrame with missing values and use various techniques to handle them.
211 | 2. Experiment with different interpolation methods like `polynomial` or `spline`.
212 | 3. Create a dataset with duplicate rows and test various `keep` options for `drop_duplicates()`.
213 | 
214 | 
215 | 
216 | ## 🌟 Summary
217 | 
218 | - **Missing Values**:
219 |   - Identify using `isnull()`.
220 |   - Remove using `dropna()` or fill using `fillna()`, `interpolate()`, `ffill`, and `bfill`.
221 | - **Duplicate Values**:
222 |   - Identify using `duplicated()`.
223 |   - Remove using `drop_duplicates()` with flexible `keep` options.
224 | 
225 | ---
226 | 
227 | 
228 | 


--------------------------------------------------------------------------------
/31_Deployment on Cloud Platform/31_Deployment on Cloud Platform.md:
--------------------------------------------------------------------------------
  1 | [<< Day 30](../30_Building%20a%20Data%20Science%20Pipeline/30_Building%20a%20Data%20Science%20Pipeline.md)  |
  2 | 
  3 | # 🎉 Bonus Day 31: Deployment on Cloud Platform
  4 | 
  5 | Welcome to **Bonus Day 31** of the 30 Days of Data Science series! 🎉 Today, we’ll explore how to deploy your machine learning models or applications to **Cloud Platforms** using **Flask/FastAPI**. By the end, you'll understand the steps to deploy to **AWS**, **Azure**, and **GCP**.
  6 | 
  7 | 
  8 | 
  9 | ## 📜 Table of Contents
 10 | 
 11 | - [🎉 Bonus Day 31: Deployment on Cloud Platform](#-bonus-day-31-deployment-on-cloud-platform)
 12 |   - [📜 Table of Contents](#-table-of-contents)
 13 |   - [🌐 Introduction](#-introduction)
 14 |   - [🚀 Preparing the Application for Deployment](#-preparing-the-application-for-deployment)
 15 |   - [⚙️ Deploying with Flask/FastAPI](#%EF%B8%8F-deploying-with-flaskfastapi)
 16 |     - [Flask Example](#flask-example)
 17 |     - [FastAPI Example](#fastapi-example)
 18 |   - [☁️ Deployment on AWS](#%EF%B8%8F-deployment-on-aws)
 19 |     - [Steps to Deploy](#steps-to-deploy)
 20 |   - [☁️ Deployment on Azure](#%EF%B8%8F-deployment-on-azure)
 21 |     - [Steps to Deploy](#steps-to-deploy-1)
 22 |   - [☁️ Deployment on GCP](#%EF%B8%8F-deployment-on-gcp)
 23 |     - [Steps to Deploy](#steps-to-deploy-2)
 24 |   - [📝 Practice Exercise](#-practice-exercise)
 25 |   - [📚 Summary](#-summary)
 26 | 
 27 | 
 28 | 
 29 | ## 🌐 Introduction
 30 | 
 31 | Deployment is a crucial step in bringing your data science project to life. It allows others to interact with your model or application in real-time. We'll focus on deploying applications using **Flask** or **FastAPI** to popular cloud platforms like:
 32 | 
 33 | - **AWS (Amazon Web Services)**
 34 | - **Azure**
 35 | - **GCP (Google Cloud Platform)**
 36 | 
 37 | 
 38 | 
 39 | ## 🚀 Preparing the Application for Deployment
 40 | 
 41 | ### Basic Folder Structure
 42 | Ensure your project folder is structured properly:
 43 | 
 44 | ```plaintext
 45 | project/
 46 | |-- app.py  # Main application script
 47 | |-- model.pkl  # Serialized ML model (if applicable)
 48 | |-- templates/
 49 | |    |-- index.html  # Frontend files (if needed)
 50 | |-- requirements.txt  # Python dependencies
 51 | |-- Dockerfile  # (Optional) Docker configuration
 52 | ```
 53 | 
 54 | ### Creating `requirements.txt`
 55 | List all dependencies in a `requirements.txt` file:
 56 | 
 57 | ```plaintext
 58 | Flask==2.1.2
 59 | pandas==1.5.3
 60 | numpy==1.23.5
 61 | scikit-learn==1.1.3
 62 | ```
 63 | Generate automatically:
 64 | 
 65 | ```bash
 66 | pip freeze > requirements.txt
 67 | ```
 68 | 
 69 | 
 70 | 
 71 | ## ⚙️ Deploying with Flask/FastAPI
 72 | 
 73 | ### Flask Example
 74 | 
 75 | Here’s a minimal **Flask** application:
 76 | 
 77 | ```python
 78 | from flask import Flask, request, jsonify
 79 | 
 80 | app = Flask(__name__)
 81 | 
 82 | @app.route("/", methods=["GET"])
 83 | def home():
 84 |     return "Welcome to the Flask Deployment!"
 85 | 
 86 | @app.route("/predict", methods=["POST"])
 87 | def predict():
 88 |     data = request.json
 89 |     prediction = sum(data["values"])  # Example prediction logic
 90 |     return jsonify({"prediction": prediction})
 91 | 
 92 | if __name__ == "__main__":
 93 |     app.run(debug=True)
 94 | ```
 95 | 
 96 | Run locally:
 97 | 
 98 | ```bash
 99 | python app.py
100 | ```
101 | 
102 | ### FastAPI Example
103 | 
104 | Here’s a minimal **FastAPI** application:
105 | 
106 | ```python
107 | from fastapi import FastAPI
108 | from pydantic import BaseModel
109 | 
110 | app = FastAPI()
111 | 
112 | class InputData(BaseModel):
113 |     values: list[int]
114 | 
115 | @app.get("/")
116 | def home():
117 |     return {"message": "Welcome to FastAPI Deployment!"}
118 | 
119 | @app.post("/predict")
120 | def predict(data: InputData):
121 |     prediction = sum(data.values)  # Example prediction logic
122 |     return {"prediction": prediction}
123 | 
124 | if __name__ == "__main__":
125 |     import uvicorn
126 |     uvicorn.run(app, host="0.0.0.0", port=8000)
127 | ```
128 | 
129 | Run locally:
130 | 
131 | ```bash
132 | uvicorn app:app --reload
133 | ```
134 | 
135 | 
136 | 
137 | ## ☁️ Deployment on AWS
138 | 
139 | ### Steps to Deploy
140 | 
141 | 1. **Set Up an EC2 Instance:**
142 |    - Go to the [AWS Management Console](https://aws.amazon.com/).
143 |    - Launch an EC2 instance with an Ubuntu AMI.
144 | 
145 | 2. **Install Dependencies on EC2:**
146 |    SSH into the instance and set up the environment:
147 | 
148 |    ```bash
149 |    sudo apt update && sudo apt upgrade
150 |    sudo apt install python3-pip
151 |    pip3 install -r requirements.txt
152 |    ```
153 | 
154 | 3. **Run the Application:**
155 |    
156 |    ```bash
157 |    python3 app.py
158 |    ```
159 | 
160 | 4. **Expose the Application:**
161 |    - Open port 5000 (or your application port) in the AWS security group.
162 |    - Access the app using the public IP of the EC2 instance.
163 | 
164 | 
165 | 
166 | ## ☁️ Deployment on Azure
167 | 
168 | ### Steps to Deploy
169 | 
170 | 1. **Set Up an App Service:**
171 |    - Go to the [Azure Portal](https://portal.azure.com/).
172 |    - Create an App Service and select a Python runtime.
173 | 
174 | 2. **Deploy Code:**
175 |    - Zip your project folder and upload it through the Azure portal.
176 | 
177 | 3. **Configure Startup Command:**
178 |    Add the following startup command in the portal:
179 | 
180 |    ```bash
181 |    gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app
182 |    ```
183 | 
184 | 4. **Access the Application:**
185 |    - Use the App Service URL provided by Azure.
186 | 
187 | 
188 | 
189 | ## ☁️ Deployment on GCP
190 | 
191 | ### Steps to Deploy
192 | 
193 | 1. **Set Up a Google Cloud Project:**
194 |    - Go to the [GCP Console](https://console.cloud.google.com/).
195 |    - Enable the App Engine API.
196 | 
197 | 2. **Create an `app.yaml` File:**
198 | 
199 |    ```yaml
200 |    runtime: python39
201 |    entrypoint: gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app
202 |    ```
203 | 
204 | 3. **Deploy the Application:**
205 | 
206 |    ```bash
207 |    gcloud app deploy
208 |    ```
209 | 
210 | 4. **Access the Application:**
211 |    - Use the provided GCP URL.
212 | 
213 | 
214 | 
215 | ## 📝 Practice Exercise
216 | 
217 | 1. Create a Flask/FastAPI application that:
218 |    - Accepts an input text.
219 |    - Returns the sentiment (positive/negative) using a pre-trained model.
220 | 
221 | 2. Deploy this application to any cloud platform of your choice.
222 | 
223 | 
224 | 
225 | ## 📚 Summary
226 | 
227 | In this bonus session, we learned how to:
228 | 
229 | - Prepare a Flask/FastAPI application for deployment.
230 | - Deploy to **AWS EC2**, **Azure App Service**, and **Google Cloud Platform**.
231 | - Configure cloud platforms and expose services to the internet.
232 | 
233 | 🎉 Congratulations on completing the **30 Days of Data Science** series with this bonus day! You’re now equipped to deploy your projects to the world. 🌎
234 | 
235 | ---
236 | 
237 | 
238 | 


--------------------------------------------------------------------------------
/02_Basics of the Language & Git Basics/02_Basics of the Language & Git Basics.md:
--------------------------------------------------------------------------------
  1 | # 📘 Day 2: Python Syntax, Variables, and Git Setup  
  2 | 
  3 | Welcome to Day 2 of the **30 Days of Data Science** challenge! Today, we’ll focus on learning the basics of Python syntax, understanding how variables work, and setting up Git for version control.
  4 | 
  5 | 
  6 | [<< Day 1](../README.md#-day-1) | [Day 3 >>](../03_Control%20Flow/03_Control%20Flow.md)
  7 | 
  8 | 
  9 | ## Table of Contents  
 10 | - [📘 Day 2: Python Syntax, Variables, and Git Setup](#-day-2-python-syntax-variables-and-git-setup)
 11 |   - [1️⃣ Python Syntax 🐍](#1️⃣-python-syntax-)
 12 |     - [Key Features](#key-features)
 13 |     - [Example: Basic Python Syntax](#example-basic-python-syntax)
 14 |     - [Common Python Errors](#common-python-errors)
 15 |   - [2️⃣ Variables in Python 🛠](#2️⃣-variables-in-python-)
 16 |     - [Declaring Variables](#declaring-variables)
 17 |     - [Rules for Variable Names](#rules-for-variable-names)
 18 |     - [Example: Different Data Types](#example-different-data-types)
 19 |     - [Updating Variables](#updating-variables)
 20 |   - [3️⃣ Git Setup and Basics 🌟](#3️⃣-git-setup-and-basics-)
 21 |     - [Installing Git](#installing-git)
 22 |     - [Configuring Git](#configuring-git)
 23 |     - [Initializing a Repository](#initializing-a-repository)
 24 |     - [Tracking Changes](#tracking-changes)
 25 |     - [Connecting to GitHub](#connecting-to-github)
 26 |     - [Example Workflow](#example-workflow)
 27 |   - [🧠 Practice Exercises](#-practice-exercises)
 28 |     - [Python Syntax](#python-syntax)
 29 |     - [Variables](#variables)
 30 |     - [Git](#git)
 31 |   - [🌟 Summary](#-summary)
 32 |   
 33 | ---
 34 | 
 35 | ### 1️⃣ Python Syntax 🐍  
 36 | 
 37 | Python syntax refers to the set of rules that defines the structure of a Python program. It’s known for being clean and easy to read.  
 38 | 
 39 | #### Key Features  
 40 | - Python uses **indentation** to define blocks of code (no curly braces).  
 41 | - Each statement is typically written on a new line.  
 42 | - Comments in Python start with `#`.  
 43 | 
 44 | #### Example: Basic Python Syntax  
 45 | ```python
 46 | # This is a single-line comment
 47 | print("Hello, Data Science!")  # This prints a message to the console
 48 | 
 49 | # Indentation defines a block of code
 50 | if 5 > 2:
 51 |     print("Five is greater than two.")  # Correct indentation
 52 | 
 53 | # Incorrect indentation will raise an error
 54 | # if 5 > 2:
 55 | # print("This will throw an error")
 56 | ```
 57 | 
 58 | #### Common Python Errors  
 59 | - **IndentationError**: Missing or incorrect indentation.  
 60 | - **SyntaxError**: Incorrect syntax, such as missing colons or parentheses.  
 61 | 
 62 | ---
 63 | 
 64 | ### 2️⃣ Variables in Python 🛠  
 65 | 
 66 | Variables are used to store data in memory. You can assign any type of data to a variable without explicitly declaring its type.  
 67 | 
 68 | #### Declaring Variables  
 69 | ```python
 70 | # Assigning values to variables
 71 | name = "Alice"
 72 | age = 25
 73 | height = 5.7
 74 | 
 75 | # Printing variable values
 76 | print(name)  # Output: Alice
 77 | print(age)   # Output: 25
 78 | print(height)  # Output: 5.7
 79 | ```
 80 | 
 81 | #### Rules for Variable Names  
 82 | 1. Must start with a letter or underscore (`_`).  
 83 | 2. Cannot start with a number.  
 84 | 3. Can only contain alphanumeric characters and underscores.  
 85 | 4. Case-sensitive (`Name` and `name` are different).  
 86 | 
 87 | #### Example: Different Data Types  
 88 | ```python
 89 | # String
 90 | greeting = "Hello, World!"
 91 | 
 92 | # Integer
 93 | year = 2024
 94 | 
 95 | # Float
 96 | pi = 3.14159
 97 | 
 98 | # Boolean
 99 | is_active = True
100 | 
101 | print(greeting)  # Output: Hello, World!
102 | print(year)      # Output: 2024
103 | print(pi)        # Output: 3.14159
104 | print(is_active) # Output: True
105 | ```
106 | 
107 | #### Updating Variables  
108 | You can update a variable's value or perform operations on it.  
109 | ```python
110 | count = 10
111 | count += 1  # Increment by 1
112 | print(count)  # Output: 11
113 | ```
114 | 
115 | ---
116 | 
117 | ### 3️⃣ Git Setup and Basics 🌟  
118 | 
119 | Git is a version control system that tracks changes in your code. It’s essential for collaboration and maintaining a clean workflow in data science projects.
120 | 
121 | #### Installing Git  
122 | 1. Download and install Git from the [official website](https://git-scm.com/).  
123 | 2. Check if Git is installed:  
124 |    ```bash
125 |    git --version
126 |    ```
127 | 
128 | #### Configuring Git  
129 | Set your name and email address (required for commits):  
130 | ```bash
131 | git config --global user.name "Your Name"
132 | git config --global user.email "your.email@example.com"
133 | ```
134 | 
135 | #### Initializing a Repository  
136 | 1. Navigate to your project folder:  
137 |    ```bash
138 |    cd 30DaysOfDataScience
139 |    ```
140 | 2. Initialize Git:  
141 |    ```bash
142 |    git init
143 |    ```
144 | 
145 | #### Tracking Changes  
146 | - **Add files** to the staging area:  
147 |   ```bash
148 |   git add filename.py  # Add a specific file
149 |   git add .            # Add all files in the folder
150 |   ```
151 | - **Commit changes** with a message:  
152 |   ```bash
153 |   git commit -m "Initial commit for Day 2"
154 |   ```
155 | 
156 | #### Connecting to GitHub  
157 | 1. Create a repository on GitHub.  
158 | 2. Link your local repo to GitHub:  
159 |    ```bash
160 |    git remote add origin https://github.com/SamarthGarge/30DaysOfDataScience.git
161 |    ```
162 | 3. Push changes to GitHub:  
163 |    ```bash
164 |    git branch -M main
165 |    git push -u origin main
166 |    ```
167 | 
168 | #### Example Workflow  
169 | ```bash
170 | # Make changes to your files
171 | git add .  # Stage the changes
172 | git commit -m "Updated Python variables and examples"
173 | git push   # Push to GitHub
174 | ```
175 | 
176 | ---
177 | 
178 | ## 🧠 Practice Exercises  
179 | 
180 | ### Python Syntax  
181 | 1. Write a Python program that prints:  
182 |    ```
183 |    Welcome to Data Science!
184 |    Let's explore Python syntax together.
185 |    ```  
186 | 2. Experiment with indentation by writing a small `if` statement.
187 | 
188 | ### Variables  
189 | 1. Create variables for:  
190 |    - Your name  
191 |    - Your age  
192 |    - Your favorite number  
193 | 2. Print them in a sentence using string concatenation or f-strings:  
194 |    ```python
195 |    # Example with f-strings
196 |    name = "Alice"
197 |    age = 25
198 |    favorite_number = 7
199 |    print(f"My name is {name}, I am {age} years old, and my favorite number is {favorite_number}.")
200 |    ```
201 | 
202 | ### Git  
203 | 1. Create a new file called `day2_practice.py`.  
204 | 2. Add it to your repository and commit with the message:  
205 |    `"Added Day 2 practice file"`.
206 | 
207 | ---
208 | 
209 | ## 🌟 Summary  
210 | 
211 | Today, you:  
212 | - Learned about Python’s syntax and how to avoid common errors.  
213 | - Explored how to declare and use variables effectively.  
214 | - Set up Git for version control and pushed your first changes to GitHub.  
215 | 
216 | 
217 | 


--------------------------------------------------------------------------------
/04_Functions and Modular Programming/04_Functions and Modular Programming.md:
--------------------------------------------------------------------------------
  1 | [<< Day 3](../03_Control%20Flow/03_Control%20Flow.md) | [Day 5 >>](../05_Data%20Structures/05_Data%20Structures.md)
  2 | # 📘 Day 4: Functions and Modular Programming in Python
  3 | 
  4 | Welcome to Day 4 of the **30 Days of Data Science** series! Today, we explore **functions**—one of the most critical tools for writing clean, reusable, and modular code. Functions are essential in breaking down complex problems into smaller, manageable pieces.
  5 | 
  6 | 
  7 | 
  8 | ## Table of Contents
  9 | 
 10 | - [📘 Day 4: Functions and Modular Programming in Python](#-day-4-functions-and-modular-programming-in-python)
 11 |   - [1️⃣ Functions in Python 📜](#1️⃣-functions-in-python-)
 12 |     - [Defining a Function](#defining-a-function)
 13 |     - [Example: Simple Function](#example-simple-function)
 14 |     - [Functions with Parameters](#functions-with-parameters)
 15 |     - [Example: Function with Multiple Parameters](#example-function-with-multiple-parameters)
 16 |     - [Default Arguments](#default-arguments)
 17 |     - [Return Statement](#return-statement)
 18 |     - [Example: Returning a Value](#example-returning-a-value)
 19 |     - [Calling Functions](#calling-functions)
 20 |   - [2️⃣ Modular Programming 🧩](#2️⃣-modular-programming-)
 21 |     - [What is Modular Programming?](#what-is-modular-programming)
 22 |     - [Creating and Importing Modules](#creating-and-importing-modules)
 23 |     - [Example: Custom Module](#example-custom-module)
 24 |     - [Built-in Modules](#built-in-modules)
 25 |   - [🧠 Practice Exercises](#-practice-exercises)
 26 |   - [🌟 Summary](#-summary)
 27 |   
 28 | 
 29 | 
 30 | 
 31 | 
 32 | 
 33 | ## 1️⃣ Functions in Python 📜
 34 | 
 35 | A **function** is a block of code designed to perform a specific task. It runs only when called and can accept inputs and return outputs.
 36 | 
 37 | 
 38 | 
 39 | ### Defining a Function
 40 | 
 41 | The `def` keyword is used to define a function in Python.
 42 | 
 43 | #### Syntax:
 44 | 
 45 | ```python
 46 | def function_name(parameters):
 47 |     # Code block
 48 |     return value  # Optional
 49 | ```
 50 | 
 51 | 
 52 | 
 53 | ### Example: Simple Function
 54 | 
 55 | ```python
 56 | def greet():
 57 |     print("Hello, Data Science Enthusiast!")
 58 | 
 59 | greet()
 60 | ```
 61 | 
 62 | **Output:**
 63 | 
 64 | ```plaintext
 65 | Hello, Data Science Enthusiast!
 66 | ```
 67 | 
 68 | 
 69 | 
 70 | ### Functions with Parameters
 71 | 
 72 | Functions can accept inputs (parameters) to customize their behavior.
 73 | 
 74 | ```python
 75 | def greet(name):
 76 |     print(f"Hello, {name}!")
 77 | 
 78 | greet("Alice")
 79 | greet("Bob")
 80 | ```
 81 | 
 82 | **Output:**
 83 | 
 84 | ```plaintext
 85 | Hello, Alice!
 86 | Hello, Bob!
 87 | ```
 88 | 
 89 | 
 90 | 
 91 | ### Example: Function with Multiple Parameters
 92 | 
 93 | ```python
 94 | def add_numbers(a, b):
 95 |     result = a + b
 96 |     print(f"The sum of {a} and {b} is {result}.")
 97 | 
 98 | add_numbers(3, 5)
 99 | ```
100 | 
101 | **Output:**
102 | 
103 | ```plaintext
104 | The sum of 3 and 5 is 8.
105 | ```
106 | 
107 | 
108 | 
109 | ### Default Arguments
110 | 
111 | Functions can have default values for parameters, making them optional.
112 | 
113 | ```python
114 | def greet(name="Data Scientist"):
115 |     print(f"Welcome, {name}!")
116 | 
117 | greet()
118 | greet("Alice")
119 | ```
120 | 
121 | **Output:**
122 | 
123 | ```plaintext
124 | Welcome, Data Scientist!
125 | Welcome, Alice!
126 | ```
127 | 
128 | 
129 | 
130 | ### Return Statement
131 | 
132 | The `return` statement allows a function to output a value.
133 | 
134 | 
135 | 
136 | ### Example: Returning a Value
137 | 
138 | ```python
139 | def square(number):
140 |     return number * number
141 | 
142 | result = square(4)
143 | print(f"The square of 4 is {result}.")
144 | ```
145 | 
146 | **Output:**
147 | 
148 | ```plaintext
149 | The square of 4 is 16.
150 | ```
151 | 
152 | 
153 | 
154 | ### Calling Functions
155 | 
156 | A **function call** executes the code inside a function definition. 
157 | 
158 | #### Example 1: Calling a Simple Function
159 | 
160 | ```python
161 | def say_hello():
162 |     print("Hello!")
163 | 
164 | # Function call
165 | say_hello()
166 | ```
167 | 
168 | **Output:**
169 | 
170 | ```plaintext
171 | Hello!
172 | ```
173 | 
174 | #### Example 2: Function Call with Parameters
175 | 
176 | ```python
177 | def add(a, b):
178 |     return a + b
179 | 
180 | result = add(10, 20)  # Function call
181 | print(f"The sum is {result}.")
182 | ```
183 | 
184 | **Output:**
185 | 
186 | ```plaintext
187 | The sum is 30.
188 | ```
189 | 
190 | #### Example 3: Combining Function Calls
191 | 
192 | You can call functions within other function calls.
193 | 
194 | ```python
195 | def double(number):
196 |     return number * 2
197 | 
198 | def add_and_double(a, b):
199 |     return double(a + b)
200 | 
201 | result = add_and_double(3, 5)
202 | print(f"The result is {result}.")
203 | ```
204 | 
205 | **Output:**
206 | 
207 | ```plaintext
208 | The result is 16.
209 | ```
210 | 
211 | 
212 | 
213 | ## 2️⃣ Modular Programming 🧩
214 | 
215 | ### What is Modular Programming?
216 | 
217 | Modular programming involves breaking a program into smaller, manageable parts or **modules**. It enhances readability, reusability, and maintainability.
218 | 
219 | 
220 | 
221 | ### Creating and Importing Modules
222 | 
223 | A **module** is a Python file containing functions, variables, and classes that can be reused in other files.
224 | 
225 | 1. **Create a Module**: Save your Python file (e.g., `my_module.py`).
226 | 
227 | ```python
228 | # my_module.py
229 | def greet(name):
230 |     return f"Hello, {name}!"
231 | ```
232 | 
233 | 2. **Import the Module**: Use the `import` keyword in another file.
234 | 
235 | ```python
236 | # main.py
237 | import my_module
238 | 
239 | message = my_module.greet("Alice")
240 | print(message)
241 | ```
242 | 
243 | **Output:**
244 | 
245 | ```plaintext
246 | Hello, Alice!
247 | ```
248 | 
249 | 
250 | 
251 | ### Example: Custom Module
252 | 
253 | Let’s create a custom math module:
254 | 
255 | ```python
256 | # math_utils.py
257 | def add(a, b):
258 |     return a + b
259 | 
260 | def multiply(a, b):
261 |     return a * b
262 | ```
263 | 
264 | ```python
265 | # main.py
266 | from math_utils import add, multiply
267 | 
268 | print(add(3, 5))       # Output: 8
269 | print(multiply(3, 5))  # Output: 15
270 | ```
271 | 
272 | 
273 | 
274 | ### Built-in Modules
275 | 
276 | Python includes many built-in modules like `math`, `os`, and `random`.
277 | 
278 | #### Example: Using the `math` Module
279 | 
280 | ```python
281 | import math
282 | 
283 | result = math.sqrt(16)
284 | print(f"The square root of 16 is {result}.")
285 | ```
286 | 
287 | **Output:**
288 | 
289 | ```plaintext
290 | The square root of 16 is 4.0.
291 | ```
292 | 
293 | 
294 | 
295 | ## 🧠 Practice Exercises
296 | 
297 | 1. Write a function that checks if a number is odd or even.
298 | 2. Create a function that calculates the factorial of a number.
299 | 3. Develop a module with utility functions for basic arithmetic operations.
300 | 4. Explore and use the `random` module to generate random numbers.
301 | 
302 | 
303 | 
304 | ## 🌟 Summary
305 | 
306 | - Functions make your code reusable and organized.
307 | - Parameters allow functions to accept input, and `return` values provide output.
308 | - Calling functions runs the code defined within them.
309 | - Modular programming improves code readability and maintenance.
310 | - Python’s built-in modules provide powerful utilities.
311 | 
312 | ---
313 | 


--------------------------------------------------------------------------------
/09_Exploratory Data Analysis (EDA)/09_Exploratory Data Analysis (EDA).md:
--------------------------------------------------------------------------------
  1 | [<< Day 8](../08_Data%20Cleaning/08_Data%20Cleaning.md) | [Day 10 >>](../10_Data%20Visualization%20Basics/10_Data%20Visualization%20Basics.md)
  2 | 
  3 | # 📘 Day 9: Exploratory Data Analysis (EDA) with Python
  4 | 
  5 | 
  6 | 
  7 | ## Table of Contents
  8 | 
  9 | - [📘 Day 9: Exploratory Data Analysis (EDA) with Python](#-day-9-exploratory-data-analysis-eda-with-python)
 10 |   - [1️⃣ Introduction to Exploratory Data Analysis (EDA)](#1️⃣-introduction-to-exploratory-data-analysis-eda)
 11 |   - [2️⃣ Data Overview](#2️⃣-data-overview)
 12 |     - [Dataset Structure](#dataset-structure)
 13 |     - [Variable Classification](#variable-classification)
 14 |   - [3️⃣ Measures of Central Tendency](#3️⃣-measures-of-central-tendency)
 15 |   - [4️⃣ Measures of Dispersion (Spread)](#4️⃣-measures-of-dispersion-spread)
 16 |   - [5️⃣ Distribution Analysis](#5️⃣-distribution-analysis)
 17 |   - [6️⃣ Quantiles and Percentiles](#6️⃣-quantiles-and-percentiles)
 18 |   - [7️⃣ Categorical Data Analysis](#7️⃣-categorical-data-analysis)
 19 |   - [8️⃣ Outlier Detection](#8️⃣-outlier-detection)
 20 |   - [9️⃣ Visualizations for Descriptive Statistics](#9️⃣-visualizations-for-descriptive-statistics)
 21 |   - [1️⃣0️⃣ Correlation and Relationships](#1️⃣0️⃣-correlation-and-relationships)
 22 |   - [1️⃣1️⃣ Missing Values Analysis](#1️⃣1️⃣-missing-values-analysis)
 23 |   - [1️⃣2️⃣ Data Cleaning Insights](#1️⃣2️⃣-data-cleaning-insights)
 24 |   - [🧠 Practice Exercises](#-practice-exercises)
 25 |   - [🌟 Summary](#-summary)
 26 | 
 27 | 
 28 | 
 29 | ## 1️⃣ Introduction to Exploratory Data Analysis (EDA)
 30 | 
 31 | Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It helps in understanding the data structure, spotting patterns, detecting anomalies, and deriving actionable insights.
 32 | 
 33 | 
 34 | 
 35 | ## 2️⃣ Data Overview
 36 | 
 37 | ### Dataset Structure
 38 | 
 39 | - **Understanding Rows and Columns**:  
 40 |   Use the `.shape` method to identify the number of rows and columns.
 41 | 
 42 |   ```python
 43 |   import pandas as pd
 44 | 
 45 |   # Sample dataset
 46 |   data = pd.read_csv('sample_data.csv')
 47 |   print("Shape of dataset:", data.shape)
 48 |   ```
 49 | 
 50 | - **Inspecting Data**: Use `.head()`, `.info()`, and `.describe()` for initial exploration.
 51 | 
 52 |   ```python
 53 |   # Display first 5 rows
 54 |   print(data.head())
 55 | 
 56 |   # Dataset information
 57 |   print(data.info())
 58 | 
 59 |   # Summary statistics for numerical columns
 60 |   print(data.describe())
 61 |   ```
 62 | 
 63 | ### Variable Classification
 64 | 
 65 | - **Numerical Variables**:  
 66 |   - Continuous: e.g., height, weight.  
 67 |   - Discrete: e.g., number of children.
 68 | 
 69 | - **Categorical Variables**:  
 70 |   - Nominal: No inherent order (e.g., gender, color).  
 71 |   - Ordinal: Ordered categories (e.g., education level).
 72 | 
 73 | - **Date/Time**: Useful for time-series analysis.
 74 | 
 75 | 
 76 | 
 77 | ## 3️⃣ Measures of Central Tendency
 78 | 
 79 | ### Mean
 80 | 
 81 | ```python
 82 | mean_value = data['column_name'].mean()
 83 | print("Mean:", mean_value)
 84 | ```
 85 | 
 86 | ### Median
 87 | 
 88 | ```python
 89 | median_value = data['column_name'].median()
 90 | print("Median:", median_value)
 91 | ```
 92 | 
 93 | ### Mode
 94 | 
 95 | ```python
 96 | mode_value = data['column_name'].mode()
 97 | print("Mode:", mode_value)
 98 | ```
 99 | 
100 | 
101 | 
102 | ## 4️⃣ Measures of Dispersion (Spread)
103 | 
104 | ### Range
105 | 
106 | ```python
107 | range_value = data['column_name'].max() - data['column_name'].min()
108 | print("Range:", range_value)
109 | ```
110 | 
111 | ### Variance and Standard Deviation
112 | 
113 | ```python
114 | variance = data['column_name'].var()
115 | std_dev = data['column_name'].std()
116 | 
117 | print("Variance:", variance)
118 | print("Standard Deviation:", std_dev)
119 | ```
120 | 
121 | ### Interquartile Range (IQR)
122 | 
123 | ```python
124 | q1 = data['column_name'].quantile(0.25)
125 | q3 = data['column_name'].quantile(0.75)
126 | iqr = q3 - q1
127 | 
128 | print("IQR:", iqr)
129 | ```
130 | 
131 | 
132 | 
133 | ## 5️⃣ Distribution Analysis
134 | 
135 | ### Skewness and Kurtosis
136 | 
137 | ```python
138 | skewness = data['column_name'].skew()
139 | kurtosis = data['column_name'].kurt()
140 | 
141 | print("Skewness:", skewness)
142 | print("Kurtosis:", kurtosis)
143 | ```
144 | 
145 | 
146 | 
147 | ## 6️⃣ Quantiles and Percentiles
148 | 
149 | ```python
150 | # Quartiles
151 | q1 = data['column_name'].quantile(0.25)
152 | q2 = data['column_name'].quantile(0.50)  # Median
153 | q3 = data['column_name'].quantile(0.75)
154 | 
155 | print("Quartiles:", q1, q2, q3)
156 | 
157 | # Percentile
158 | percentile_90 = data['column_name'].quantile(0.90)
159 | print("90th Percentile:", percentile_90)
160 | ```
161 | 
162 | 
163 | 
164 | ## 7️⃣ Categorical Data Analysis
165 | 
166 | ### Frequency Counts
167 | 
168 | ```python
169 | print(data['categorical_column'].value_counts())
170 | ```
171 | 
172 | ### Cross-Tabulation
173 | 
174 | ```python
175 | pd.crosstab(data['column1'], data['column2'])
176 | ```
177 | 
178 | 
179 | 
180 | ## 8️⃣ Outlier Detection
181 | 
182 | ### Using IQR
183 | 
184 | ```python
185 | outliers = data[(data['column_name'] < (q1 - 1.5 * iqr)) | 
186 |                 (data['column_name'] > (q3 + 1.5 * iqr))]
187 | print(outliers)
188 | ```
189 | 
190 | ### Z-Score Method
191 | 
192 | ```python
193 | from scipy.stats import zscore
194 | 
195 | data['z_score'] = zscore(data['column_name'])
196 | outliers = data[data['z_score'].abs() > 3]
197 | print(outliers)
198 | ```
199 | 
200 | 
201 | 
202 | ## 9️⃣ Visualizations for Descriptive Statistics
203 | 
204 | ### Numerical Data
205 | 
206 | ```python
207 | import matplotlib.pyplot as plt
208 | import seaborn as sns
209 | 
210 | # Histogram
211 | data['column_name'].hist()
212 | 
213 | # Boxplot
214 | sns.boxplot(x=data['column_name'])
215 | ```
216 | 
217 | ### Categorical Data
218 | 
219 | ```python
220 | # Bar Chart
221 | data['categorical_column'].value_counts().plot(kind='bar')
222 | ```
223 | 
224 | 
225 | 
226 | ## 1️⃣0️⃣ Correlation and Relationships
227 | 
228 | ### Correlation Coefficient
229 | 
230 | ```python
231 | correlation = data.corr()
232 | print(correlation)
233 | ```
234 | 
235 | ### Scatter Plot
236 | 
237 | ```python
238 | sns.scatterplot(x='column1', y='column2', data=data)
239 | ```
240 | 
241 | 
242 | 
243 | ## 1️⃣1️⃣ Missing Values Analysis
244 | 
245 | ```python
246 | # Proportion of missing values
247 | missing = data.isnull().mean()
248 | print(missing)
249 | 
250 | # Impute missing values
251 | data['column_name'].fillna(data['column_name'].mean(), inplace=True)
252 | ```
253 | 
254 | 
255 | 
256 | ## 1️⃣2️⃣ Data Cleaning Insights
257 | 
258 | - Identify inconsistencies like out-of-range values or incorrect types.  
259 | - Remove duplicates.
260 | 
261 | ```python
262 | # Remove duplicates
263 | data = data.drop_duplicates()
264 | ```
265 | 
266 | 
267 | 
268 | ## 🧠 Practice Exercises
269 | 
270 | 1. Load a dataset of your choice and summarize its structure.  
271 | 2. Compute measures of central tendency and dispersion for a numerical column.  
272 | 3. Identify and visualize outliers using boxplots.  
273 | 4. Analyze correlations between numerical variables and plot a heatmap.  
274 | 5. Handle missing values using different imputation techniques.
275 | 
276 | 
277 | 
278 | ## 🌟 Summary
279 | 
280 | Exploratory Data Analysis (EDA) helps in gaining a comprehensive understanding of datasets by summarizing their structure, detecting outliers, analyzing distributions, and visualizing relationships. Mastering EDA is a critical step for preparing data for advanced analytics and machine learning.
281 | 
282 | ---
283 | 


--------------------------------------------------------------------------------
/30_Building a Data Science Pipeline/30_Building a Data Science Pipeline.md:
--------------------------------------------------------------------------------
  1 | [<< Day 29](../29_Working%20with%20Big%20Data/29_Working%20with%20Big%20Data.md) | [Day 31 >>](../31_Deployment%20on%20Cloud%20Platform/31_Deployment%20on%20Cloud%20Platform.md)
  2 | 
  3 | # 🎉 Day 30: Building a Data Science Pipeline
  4 | 
  5 | Welcome to the final day of the **30 Days of Data Science** challenge! 🎊 Today, we will bring together all the skills you've learned by exploring how to build a **Data Science Pipeline**. Pipelines are essential for automating workflows, ensuring reproducibility, and deploying machine learning models effectively.
  6 | 
  7 | ## 📚 Table of Contents
  8 | - [🎉 Day 30: Building a Data Science Pipeline](#-day-30-building-a-data-science-pipeline)
  9 |   - [Introduction](#introduction)
 10 |   - [🔧 Sklearn Pipelines](#-sklearn-pipelines)
 11 |   - [📦 Serialization with Joblib](#-serialization-with-joblib)
 12 |   - [🛠️ Feature Engineering Pipelines](#️-feature-engineering-pipelines)
 13 |   - [🔄 Handling Data Preprocessing](#-handling-data-preprocessing)
 14 |   - [📊 Model Evaluation and Cross-Validation](#-model-evaluation-and-cross-validation)
 15 |   - [🚀 Deployment Best Practices](#-deployment-best-practices)
 16 |   - [🧪 Testing and Validation](#-testing-and-validation)
 17 |   - [📝 Practice Exercise](#-practice-exercise)
 18 |   - [📖 Summary](#-summary)
 19 | 
 20 | ## Introduction
 21 | A **Data Science Pipeline** is a structured sequence of steps that automates the flow of data from raw input to a final machine learning model or output. Pipelines make your projects scalable, reproducible, and easier to maintain. Today, we will focus on building efficient pipelines using Python and related libraries.
 22 | 
 23 | ## 🔧 Sklearn Pipelines
 24 | `Pipeline` in `sklearn` allows you to automate machine learning workflows. It helps chain preprocessing steps, feature transformations, and model training into a single object.
 25 | 
 26 | ### Example:
 27 | ```python
 28 | from sklearn.pipeline import Pipeline
 29 | from sklearn.preprocessing import StandardScaler
 30 | from sklearn.ensemble import RandomForestClassifier
 31 | 
 32 | # Define pipeline
 33 | pipeline = Pipeline([
 34 |     ('scaler', StandardScaler()),
 35 |     ('classifier', RandomForestClassifier())
 36 | ])
 37 | 
 38 | # Fit and predict
 39 | pipeline.fit(X_train, y_train)
 40 | predictions = pipeline.predict(X_test)
 41 | ```
 42 | 
 43 | ### Benefits of Sklearn Pipelines:
 44 | - Automates repetitive tasks
 45 | - Reduces code duplication
 46 | - Ensures consistency during training and testing
 47 | 
 48 | ## 📦 Serialization with Joblib
 49 | Serialization is essential for saving your trained models and reusing them later. `joblib` is a powerful library for saving and loading Python objects, especially models.
 50 | 
 51 | ### Example:
 52 | ```python
 53 | from joblib import dump, load
 54 | 
 55 | # Save the pipeline
 56 | dump(pipeline, 'model_pipeline.joblib')
 57 | 
 58 | # Load the pipeline
 59 | loaded_pipeline = load('model_pipeline.joblib')
 60 | ```
 61 | 
 62 | ## 🛠️ Feature Engineering Pipelines
 63 | Feature engineering involves creating meaningful input features for your model. Pipelines can include custom feature transformers.
 64 | 
 65 | ### Example of a Custom Transformer:
 66 | ```python
 67 | from sklearn.base import BaseEstimator, TransformerMixin
 68 | 
 69 | class CustomTransformer(BaseEstimator, TransformerMixin):
 70 |     def fit(self, X, y=None):
 71 |         return self
 72 | 
 73 |     def transform(self, X):
 74 |         # Custom transformation logic
 75 |         return X + 1
 76 | 
 77 | # Add custom transformer to pipeline
 78 | pipeline = Pipeline([
 79 |     ('custom_transformer', CustomTransformer()),
 80 |     ('classifier', RandomForestClassifier())
 81 | ])
 82 | ```
 83 | 
 84 | ## 🔄 Handling Data Preprocessing
 85 | Preprocessing ensures that your raw data is transformed into a format suitable for modeling. Steps like missing value imputation, encoding, and scaling can be incorporated into the pipeline.
 86 | 
 87 | ### Example:
 88 | ```python
 89 | from sklearn.compose import ColumnTransformer
 90 | from sklearn.impute import SimpleImputer
 91 | from sklearn.preprocessing import OneHotEncoder
 92 | 
 93 | # Define preprocessing steps
 94 | preprocessor = ColumnTransformer([
 95 |     ('num', SimpleImputer(strategy='mean'), numerical_columns),
 96 |     ('cat', OneHotEncoder(), categorical_columns)
 97 | ])
 98 | 
 99 | # Add preprocessing to pipeline
100 | pipeline = Pipeline([
101 |     ('preprocessor', preprocessor),
102 |     ('classifier', RandomForestClassifier())
103 | ])
104 | ```
105 | 
106 | ## 📊 Model Evaluation and Cross-Validation
107 | Cross-validation is crucial for evaluating the performance of your pipeline. It helps ensure that the model generalizes well to unseen data.
108 | 
109 | ### Example:
110 | ```python
111 | from sklearn.model_selection import cross_val_score
112 | 
113 | # Cross-validation
114 | scores = cross_val_score(pipeline, X, y, cv=5)
115 | print("Cross-validation scores:", scores)
116 | ```
117 | 
118 | ### Key Metrics:
119 | - **Accuracy:** Overall correctness of the model.
120 | - **Precision:** Focus on false positives.
121 | - **Recall:** Focus on false negatives.
122 | - **F1 Score:** Harmonic mean of precision and recall.
123 | 
124 | ## 🚀 Deployment Best Practices
125 | Once your pipeline is ready, deployment becomes the next critical step. Here are some best practices:
126 | 
127 | 1. **Serialization:** Save your model and preprocessing steps using `joblib`.
128 | 2. **Environment Consistency:** Use tools like Docker to ensure that your development and production environments are identical.
129 | 3. **Monitoring and Logging:** Implement monitoring to track model performance post-deployment.
130 | 4. **Versioning:** Keep track of model versions for rollback and debugging purposes.
131 | 
132 | ### Example:
133 | ```python
134 | import joblib
135 | import os
136 | 
137 | # Save model
138 | joblib.dump(pipeline, 'pipeline_v1.joblib')
139 | 
140 | # Check saved file
141 | if os.path.exists('pipeline_v1.joblib'):
142 |     print("Pipeline saved successfully!")
143 | ```
144 | 
145 | ## 🧪 Testing and Validation
146 | Testing your pipeline ensures that it generalizes well to unseen data. Use cross-validation and performance metrics to evaluate the pipeline.
147 | 
148 | ### Example:
149 | ```python
150 | from sklearn.model_selection import train_test_split
151 | from sklearn.metrics import classification_report
152 | 
153 | # Split the data
154 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
155 | 
156 | # Fit the pipeline
157 | pipeline.fit(X_train, y_train)
158 | 
159 | # Evaluate
160 | predictions = pipeline.predict(X_test)
161 | print(classification_report(y_test, predictions))
162 | ```
163 | 
164 | ## 📝 Practice Exercise
165 | **Task:** Build a data science pipeline that includes:
166 | 1. Data preprocessing (handle missing values and scaling)
167 | 2. Feature selection or engineering
168 | 3. Model training with a classifier of your choice
169 | 4. Save the pipeline using `joblib`
170 | 
171 | **Dataset:** Use any dataset of your choice (e.g., Titanic dataset).
172 | 
173 | **Steps:**
174 | 1. Load the dataset.
175 | 2. Define preprocessing and modeling steps.
176 | 3. Use `Pipeline` to combine all steps.
177 | 4. Evaluate the pipeline using cross-validation.
178 | 5. Save and reload the pipeline.
179 | 
180 | ## 📖 Summary
181 | In this final day, you learned how to build robust Data Science Pipelines using tools like `sklearn` and `joblib`. We explored pipeline creation, feature engineering, cross-validation, and deployment best practices. Pipelines simplify workflows, improve reproducibility, and make your projects production-ready. Congratulations on completing the 30 Days of Data Science challenge! 🎉
182 | 
183 | ---
184 | 


--------------------------------------------------------------------------------
/11_Advanced Data Visualization/11_Advanced Data Visualization.md:
--------------------------------------------------------------------------------
  1 | [<< Day 10](../10_Data%20Visualization%20Basics/10_Data%20Visualization%20Basics.md) | [Day 12 >>](../12_SQL%20for%20Data%20Retrieval/12_SQL%20for%20Data%20Retrieval.md)
  2 | 
  3 | # 📘 Day 11: Advanced Data Visualization with Plotly and Advanced Matplotlib
  4 | 
  5 | Welcome to **Day 11** of the **30 Days of Data Science** series! Today, we dive into advanced data visualization techniques using **Plotly** and advanced features of **Matplotlib**. These tools enable you to create highly interactive and visually appealing visualizations, perfect for storytelling and analyzing complex datasets.
  6 | 
  7 | 
  8 | 
  9 | ## Table of Contents
 10 | 
 11 | - [📘 Day 11: Advanced Data Visualization with Plotly and Advanced Matplotlib](#-day-11-advanced-data-visualization-with-plotly-and-advanced-matplotlib)
 12 |   - [1️⃣ Introduction to Plotly 📊](#1️⃣-introduction-to-plotly-)
 13 |     - [Why Use Plotly?](#why-use-plotly)
 14 |     - [Installing Plotly](#installing-plotly)
 15 |     - [Creating a Basic Plotly Chart](#creating-a-basic-plotly-chart)
 16 |     - [Interactive Features in Plotly](#interactive-features-in-plotly)
 17 |   - [2️⃣ Advanced Features of Plotly 🌟](#2️⃣-advanced-features-of-plotly-)
 18 |     - [Creating Subplots](#creating-subplots)
 19 |     - [Using Plotly Express](#using-plotly-express)
 20 |     - [Example: Advanced Interactive Dashboard](#example-advanced-interactive-dashboard)
 21 |   - [3️⃣ Advanced Matplotlib 📐](#3️⃣-advanced-matplotlib-)
 22 |     - [Customizing Matplotlib Visualizations](#customizing-matplotlib-visualizations)
 23 |     - [Using Styles and Themes](#using-styles-and-themes)
 24 |     - [Creating Complex Plots with Matplotlib](#creating-complex-plots-with-matplotlib)
 25 |     - [3D Plotting with Matplotlib](#3d-plotting-with-matplotlib)
 26 |   - [🧠 Practice Exercises](#-practice-exercises)
 27 |   - [🌟 Summary](#-summary)
 28 |   
 29 | 
 30 | 
 31 | 
 32 | ## 1️⃣ Introduction to Plotly 📊
 33 | 
 34 | **Plotly** is a Python library that allows you to create interactive, web-based visualizations. It is well-suited for creating complex and dynamic plots.
 35 | 
 36 | 
 37 | 
 38 | ### Why Use Plotly?
 39 | 
 40 | 1. **Interactive Visualizations**: Enables zooming, panning, and hover effects.
 41 | 2. **Web-Ready**: Integrates well with web applications.
 42 | 3. **Rich Ecosystem**: Includes support for 2D and 3D plots, dashboards, and more.
 43 | 
 44 | 
 45 | 
 46 | ### Installing Plotly
 47 | 
 48 | Use the following command to install Plotly:
 49 | 
 50 | ```bash
 51 | pip install plotly
 52 | ```
 53 | 
 54 | 
 55 | ### Creating a Basic Plotly Chart
 56 | 
 57 | Creating a simple line plot
 58 | ```bash
 59 | import plotly.graph_objects as go
 60 | 
 61 | fig = go.Figure()
 62 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[10, 20, 30], mode='lines+markers', name='Line Plot'))
 63 | fig.update_layout(title="Basic Plotly Line Chart", xaxis_title="X-axis", yaxis_title="Y-axis")
 64 | fig.show()
 65 | ```
 66 | 
 67 | ### Interactive Features in Plotly
 68 | 
 69 | Plotly charts support interactivity by default. Hovering over points displays tooltips, and you can zoom in or pan the chart.
 70 | 
 71 | ## 2️⃣ Advanced Features of Plotly 🌟
 72 | 
 73 | ### Creating Subplots
 74 | 
 75 | Plotly allows you to create multiple subplots within a single figure.
 76 | 
 77 | ```bash
 78 | from plotly.subplots import make_subplots
 79 | import plotly.graph_objects as go
 80 | 
 81 | # Creating subplots
 82 | fig = make_subplots(rows=1, cols=2, subplot_titles=("Plot 1", "Plot 2"))
 83 | 
 84 | # Adding traces to subplots
 85 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6], mode='lines', name='Line 1'), row=1, col=1)
 86 | fig.add_trace(go.Bar(x=["A", "B", "C"], y=[7, 8, 9], name='Bar Chart'), row=1, col=2)
 87 | 
 88 | fig.update_layout(title="Subplots Example")
 89 | fig.show()
 90 | ```
 91 | 
 92 | ### Using Plotly Express
 93 | 
 94 | Plotly Express simplifies the creation of visualizations with concise syntax.
 95 | 
 96 | ```bash
 97 | import plotly.express as px
 98 | 
 99 | # Example: Scatter plot
100 | df = px.data.iris()  # Built-in dataset
101 | fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", title="Iris Dataset Scatter Plot")
102 | fig.show()
103 | ```
104 | 
105 |  ### Example: Advanced Interactive Dashboard
106 | 
107 | ```bash
108 | import plotly.graph_objects as go
109 | from plotly.subplots import make_subplots
110 | 
111 | # Creating a dashboard layout
112 | fig = make_subplots(rows=2, cols=2, subplot_titles=("Line", "Bar", "Pie", "Scatter"))
113 | 
114 | # Adding various plots
115 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[10, 20, 30], mode='lines', name='Line'), row=1, col=1)
116 | fig.add_trace(go.Bar(x=["A", "B", "C"], y=[5, 10, 15], name='Bar'), row=1, col=2)
117 | fig.add_trace(go.Pie(labels=["A", "B", "C"], values=[10, 20, 30], name='Pie'), row=2, col=1)
118 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[7, 8, 9], mode='markers', name='Scatter'), row=2, col=2)
119 | 
120 | fig.update_layout(title="Advanced Interactive Dashboard")
121 | fig.show()
122 | ```
123 | 
124 | ## 3️⃣ Advanced Matplotlib 📐
125 | 
126 | Matplotlib offers extensive customization options for static visualizations. Let’s explore its advanced features.
127 | 
128 | ### Customizing Matplotlib Visualizations
129 | 
130 | ```bash
131 | import matplotlib.pyplot as plt
132 | 
133 | # Customizing plot styles
134 | plt.figure(figsize=(8, 6))
135 | plt.plot([1, 2, 3], [4, 5, 6], color='purple', linestyle='--', linewidth=2, marker='o', label='Line Plot')
136 | plt.title("Customized Matplotlib Plot", fontsize=16)
137 | plt.xlabel("X-axis Label", fontsize=12)
138 | plt.ylabel("Y-axis Label", fontsize=12)
139 | plt.legend()
140 | plt.grid(True)
141 | plt.show()
142 | ```
143 | 
144 | ### Using Styles and Themes
145 | 
146 | Matplotlib comes with built-in styles. You can activate them using `plt.style.use()`.
147 | 
148 | ```bash
149 | import matplotlib.pyplot as plt
150 | 
151 | # Applying a style
152 | plt.style.use('ggplot')
153 | 
154 | # Creating a styled plot
155 | plt.plot([1, 2, 3], [2, 4, 6], label='Styled Line', color='blue')
156 | plt.title("Using Matplotlib Styles")
157 | plt.legend()
158 | plt.show()
159 | ```
160 | 
161 | ### Creating Complex Plots with Matplotlib
162 | 
163 | ```bash
164 | import matplotlib.pyplot as plt
165 | import numpy as np
166 | 
167 | x = np.linspace(0, 10, 100)
168 | y = np.sin(x)
169 | 
170 | plt.figure(figsize=(10, 6))
171 | plt.plot(x, y, label='Sine Wave', color='green')
172 | plt.fill_between(x, y, alpha=0.3, color='green')
173 | plt.title("Sine Wave with Fill")
174 | plt.xlabel("X-axis")
175 | plt.ylabel("Y-axis")
176 | plt.legend()
177 | plt.show()
178 | ```
179 | 
180 | ### 3D Plotting with Matplotlib
181 | 
182 | ```bash
183 | from mpl_toolkits.mplot3d import Axes3D
184 | import matplotlib.pyplot as plt
185 | import numpy as np
186 | 
187 | fig = plt.figure(figsize=(10, 7))
188 | ax = fig.add_subplot(111, projection='3d')
189 | 
190 | # Generating data
191 | x = np.linspace(-5, 5, 100)
192 | y = np.linspace(-5, 5, 100)
193 | X, Y = np.meshgrid(x, y)
194 | Z = np.sin(np.sqrt(X**2 + Y**2))
195 | 
196 | # Creating a 3D surface plot
197 | ax.plot_surface(X, Y, Z, cmap='viridis')
198 | ax.set_title("3D Surface Plot")
199 | plt.show()
200 | ```
201 | 
202 | ## 🧠 Practice Exercises
203 | 
204 | 1. Create an interactive bar chart using Plotly.
205 | 2. Generate subplots combining line and scatter plots.
206 | 3. Use Matplotlib to create a heatmap.
207 | 4. Explore the use of seaborn with advanced Matplotlib customizations.
208 | 
209 | 
210 | 
211 | ## 🌟 Summary
212 | 
213 | - Plotly is excellent for creating interactive, web-based visualizations.
214 | - Matplotlib offers flexibility and control for static visualizations.
215 | - Subplots, styles, and 3D plotting enhance your ability to tell a story with data.
216 | 
217 | ---
218 | 


--------------------------------------------------------------------------------
/05_Data Structures/05_Data Structures.md:
--------------------------------------------------------------------------------
  1 | [<< Day 4](../04_Functions%20and%20Modular%20Programming/04_Functions%20and%20Modular%20Programming.md) | [Day 6 >>](../06_Data%20Frames%20and%20Tables/06_Data%20Frames%20and%20Tables.md)
  2 | # 📘 Day 5: Data Structures in Python
  3 | 
  4 | Welcome to **Day 5** of the **30 Days of Data Science** series! Today, we will explore **data structures** in Python, focusing on three important types:
  5 | 
  6 | - **Lists**
  7 | - **Tuples**
  8 | - **Dictionaries**
  9 | 
 10 | Understanding these data structures is fundamental for data manipulation and organization in Python.
 11 | 
 12 | 
 13 | 
 14 | ## Table of Contents
 15 | 
 16 | - [📘 Day 5: Data Structures in Python](#-day-5-data-structures-in-python)
 17 |   - [1️⃣ Lists in Python 📋](#1️⃣-lists-in-python-)
 18 |     - [Creating a List](#creating-a-list)
 19 |     - [Accessing List Elements](#accessing-list-elements)
 20 |     - [Adding Elements to a List](#adding-elements-to-a-list)
 21 |     - [Removing Elements from a List](#removing-elements-from-a-list)
 22 |     - [List Slicing](#list-slicing)
 23 |     - [Common List Methods](#common-list-methods)
 24 |   - [2️⃣ Tuples in Python 🔗](#2️⃣-tuples-in-python-)
 25 |     - [Creating a Tuple](#creating-a-tuple)
 26 |     - [Accessing Tuple Elements](#accessing-tuple-elements)
 27 |     - [Immutability of Tuples](#immutability-of-tuples)
 28 |     - [Common Tuple Methods](#common-tuple-methods)
 29 |   - [3️⃣ Dictionaries in Python 📖](#3️⃣-dictionaries-in-python-)
 30 |     - [Creating a Dictionary](#creating-a-dictionary)
 31 |     - [Accessing Dictionary Values](#accessing-dictionary-values)
 32 |     - [Adding and Updating Key-Value Pairs](#adding-and-updating-key-value-pairs)
 33 |     - [Removing Key-Value Pairs](#removing-key-value-pairs)
 34 |     - [Common Dictionary Methods](#common-dictionary-methods)
 35 |   - [🧠 Practice Exercises](#-practice-exercises)
 36 |   - [🌟 Summary](#-summary)
 37 |   
 38 | 
 39 | 
 40 | 
 41 | 
 42 | ## 1️⃣ Lists in Python 📋
 43 | 
 44 | ### What is a List?
 45 | 
 46 | A **list** is a mutable (modifiable) collection of ordered elements. Lists can store elements of different data types, such as integers, strings, floats, or even other lists.
 47 | 
 48 | 
 49 | 
 50 | ### Creating a List
 51 | 
 52 | ```python
 53 | # Creating a list of numbers
 54 | numbers = [1, 2, 3, 4, 5]
 55 | 
 56 | # Creating a mixed data type list
 57 | mixed_list = [1, "apple", 3.14, True]
 58 | ```
 59 | 
 60 | 
 61 | 
 62 | ### Accessing List Elements
 63 | 
 64 | You can access elements in a list using **indexing** (zero-based).
 65 | 
 66 | ```python
 67 | fruits = ["apple", "banana", "cherry"]
 68 | 
 69 | # Accessing the first element
 70 | print(fruits[0])  # Output: apple
 71 | 
 72 | # Accessing the last element
 73 | print(fruits[-1])  # Output: cherry
 74 | ```
 75 | 
 76 | 
 77 | 
 78 | ### Adding Elements to a List
 79 | 
 80 | Use the `append()` method to add an element to the end or the `insert()` method to add at a specific position.
 81 | 
 82 | ```python
 83 | fruits = ["apple", "banana"]
 84 | 
 85 | # Adding an element at the end
 86 | fruits.append("cherry")
 87 | print(fruits)  # Output: ['apple', 'banana', 'cherry']
 88 | 
 89 | # Inserting an element at a specific position
 90 | fruits.insert(1, "orange")
 91 | print(fruits)  # Output: ['apple', 'orange', 'banana', 'cherry']
 92 | ```
 93 | 
 94 | 
 95 | 
 96 | ### Removing Elements from a List
 97 | 
 98 | Use `remove()` or `pop()` to delete elements.
 99 | 
100 | ```python
101 | fruits = ["apple", "banana", "cherry"]
102 | 
103 | # Removing by value
104 | fruits.remove("banana")
105 | print(fruits)  # Output: ['apple', 'cherry']
106 | 
107 | # Removing by index
108 | fruits.pop(1)
109 | print(fruits)  # Output: ['apple']
110 | ```
111 | 
112 | 
113 | 
114 | ### List Slicing
115 | 
116 | Slicing allows you to access a subset of elements.
117 | 
118 | ```python
119 | numbers = [1, 2, 3, 4, 5]
120 | 
121 | # Getting the first three elements
122 | print(numbers[:3])  # Output: [1, 2, 3]
123 | 
124 | # Getting elements from index 2 to the end
125 | print(numbers[2:])  # Output: [3, 4, 5]
126 | ```
127 | 
128 | 
129 | 
130 | ### Common List Methods
131 | 
132 | Here are some commonly used list methods:
133 | 
134 | ```python
135 | numbers = [1, 2, 3]
136 | 
137 | # Adding an element
138 | numbers.append(4)
139 | 
140 | # Counting occurrences
141 | print(numbers.count(2))  # Output: 1
142 | 
143 | # Sorting the list
144 | numbers.sort()
145 | print(numbers)  # Output: [1, 2, 3, 4]
146 | ```
147 | 
148 | 
149 | 
150 | ## 2️⃣ Tuples in Python 🔗
151 | 
152 | ### What is a Tuple?
153 | 
154 | A **tuple** is an immutable (unchangeable) collection of ordered elements. Tuples are often used to group related data.
155 | 
156 | 
157 | 
158 | ### Creating a Tuple
159 | 
160 | ```python
161 | # Creating a tuple of strings
162 | fruits = ("apple", "banana", "cherry")
163 | ```
164 | 
165 | 
166 | 
167 | ### Accessing Tuple Elements
168 | 
169 | Similar to lists, you can access tuple elements using indexing.
170 | 
171 | ```python
172 | fruits = ("apple", "banana", "cherry")
173 | 
174 | print(fruits[0])  # Output: apple
175 | ```
176 | 
177 | 
178 | 
179 | ### Immutability of Tuples
180 | 
181 | Tuples cannot be changed after creation. Attempting to modify a tuple results in an error.
182 | 
183 | ```python
184 | fruits = ("apple", "banana", "cherry")
185 | 
186 | # This will raise an error
187 | fruits[1] = "orange"
188 | ```
189 | 
190 | 
191 | 
192 | ### Common Tuple Methods
193 | 
194 | ```python
195 | fruits = ("apple", "banana", "cherry")
196 | 
197 | # Getting the index of an element
198 | print(fruits.index("banana"))  # Output: 1
199 | 
200 | # Counting occurrences
201 | print(fruits.count("cherry"))  # Output: 1
202 | ```
203 | 
204 | 
205 | 
206 | ## 3️⃣ Dictionaries in Python 📖
207 | 
208 | ### What is a Dictionary?
209 | 
210 | A **dictionary** is a mutable collection of key-value pairs. Keys must be unique and immutable, while values can be of any type.
211 | 
212 | 
213 | 
214 | ### Creating a Dictionary
215 | 
216 | ```python
217 | # Creating a dictionary
218 | person = {
219 |     "name": "Alice",
220 |     "age": 25,
221 |     "city": "New York"
222 | }
223 | ```
224 | 
225 | 
226 | 
227 | ### Accessing Dictionary Values
228 | 
229 | You can access values using keys.
230 | 
231 | ```python
232 | person = {"name": "Alice", "age": 25}
233 | 
234 | print(person["name"])  # Output: Alice
235 | ```
236 | 
237 | 
238 | 
239 | ### Adding and Updating Key-Value Pairs
240 | 
241 | ```python
242 | person = {"name": "Alice", "age": 25}
243 | 
244 | # Adding a new key-value pair
245 | person["city"] = "New York"
246 | 
247 | # Updating an existing key
248 | person["age"] = 26
249 | ```
250 | 
251 | 
252 | 
253 | ### Removing Key-Value Pairs
254 | 
255 | Use the `del` keyword or `pop()` method.
256 | 
257 | ```python
258 | person = {"name": "Alice", "age": 25}
259 | 
260 | # Removing a key-value pair
261 | del person["age"]
262 | 
263 | # Using pop()
264 | person.pop("name")
265 | ```
266 | 
267 | 
268 | 
269 | ### Common Dictionary Methods
270 | 
271 | ```python
272 | person = {"name": "Alice", "age": 25}
273 | 
274 | # Getting all keys
275 | print(person.keys())  # Output: dict_keys(['name', 'age'])
276 | 
277 | # Getting all values
278 | print(person.values())  # Output: dict_values(['Alice', 25])
279 | ```
280 | 
281 | 
282 | 
283 | ## 🧠 Practice Exercises
284 | 
285 | 1. Create a list of your favorite movies and print the last one using negative indexing.
286 | 2. Create a tuple of three numbers and calculate their sum.
287 | 3. Create a dictionary to store information about a book (title, author, year), and add the publisher's name.
288 | 
289 | 
290 | 
291 | ## 🌟 Summary
292 | 
293 | - **Lists** are mutable and ordered collections.
294 | - **Tuples** are immutable and ordered collections.
295 | - **Dictionaries** store data as key-value pairs and are mutable.
296 | 
297 | ---
298 | 
299 | 
300 | 


--------------------------------------------------------------------------------
/20_Logistic Regression/20_Logistic Regression.md:
--------------------------------------------------------------------------------
  1 | [<< Day 19](../19_Linear%20Regression/19_Linear%20Regression.md) | [Day 21 >>](../21_Clustering%20(K-Means)/21_Clustering%20(K-Means).md)
  2 | 
  3 | 
  4 | # 📘 Day 20: Logistic Regression with scikit-learn
  5 | 
  6 | Welcome to Day 20 of the **30 Days of Data Science** series! Today, we will focus on **Logistic Regression**, one of the most commonly used classification algorithms in machine learning. By the end of this guide, you will understand the basics of Logistic Regression and how to implement it in Python using the `scikit-learn` library.
  7 | 
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [📘 Day 20: Logistic Regression with scikit-learn](#-day-20-logistic-regression-with-scikit-learn)
 13 |   - [📌 Topics Covered](#-topics-covered)
 14 |   - [1️⃣ What is Logistic Regression?](#1️⃣-what-is-logistic-regression)
 15 |   - [2️⃣ Logistic Regression vs Linear Regression](#2️⃣-logistic-regression-vs-linear-regression)
 16 |   - [3️⃣ Sigmoid Function](#3️⃣-sigmoid-function)
 17 |   - [4️⃣ Implementing Logistic Regression in scikit-learn](#4️⃣-implementing-logistic-regression-in-scikit-learn)
 18 |     - [Dataset Overview](#dataset-overview)
 19 |     - [Steps to Implement Logistic Regression](#steps-to-implement-logistic-regression)
 20 |     - [Code Example](#code-example)
 21 |   - [5️⃣ Evaluating the Model](#5️⃣-evaluating-the-model)
 22 |     - [Metrics](#metrics)
 23 |     - [Confusion Matrix](#confusion-matrix)
 24 |     - [ROC Curve and AUC](#roc-curve-and-auc)
 25 |   - [🧠 Practice Exercises](#-practice-exercises)
 26 |   - [🌟 Summary](#-summary)
 27 | 
 28 | 
 29 | 
 30 | 
 31 | ## 📌 Topics Covered
 32 | 
 33 | - **Understanding Logistic Regression**: Overview and applications.
 34 | - **Mathematics Behind Logistic Regression**: Sigmoid function and decision boundaries.
 35 | - **Implementing Logistic Regression in Python**: Using scikit-learn.
 36 | - **Model Evaluation**: Metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
 37 | 
 38 | 
 39 | 
 40 | ## 1️⃣ What is Logistic Regression?
 41 | 
 42 | Logistic Regression is a **supervised learning algorithm** used for binary classification problems. Despite its name, it is not a regression algorithm but a **classification technique** that predicts discrete outcomes (e.g., Yes/No, True/False, 0/1).
 43 | 
 44 | **Applications**:
 45 | - Predicting whether an email is spam or not.
 46 | - Identifying if a transaction is fraudulent.
 47 | - Classifying if a tumor is malignant or benign.
 48 | 
 49 | 
 50 | 
 51 | ## 2️⃣ Logistic Regression vs Linear Regression
 52 | 
 53 | | Feature                 | Logistic Regression     | Linear Regression       |
 54 | |-------------------------|-------------------------|-------------------------|
 55 | | **Goal**               | Classification          | Regression (continuous output) |
 56 | | **Output Range**       | [0, 1] (probabilities)  | (-∞, +∞)                |
 57 | | **Function**           | Sigmoid Function        | Linear Equation         |
 58 | 
 59 | 
 60 | 
 61 | ## 3️⃣ Sigmoid Function
 62 | 
 63 | The **sigmoid function** maps any real-valued number into the range [0, 1], making it ideal for predicting probabilities.
 64 | 
 65 | ### Formula:
 66 | 
 67 | σ(z) = 1 / (1 + e^(-z))
 68 | 
 69 | Where:
 70 | - **z = wᵀX + b** (linear combination of weights and inputs).
 71 | 
 72 | ### Sigmoid Plot:
 73 | 
 74 | - Input **z = 0**: Output is 0.5.
 75 | - As **z → +∞**, Output approaches 1.
 76 | - As **z → -∞**, Output approaches 0.
 77 | 
 78 | 
 79 | 
 80 | ## 4️⃣ Implementing Logistic Regression in scikit-learn
 81 | 
 82 | We will use the `scikit-learn` library to implement Logistic Regression.
 83 | 
 84 | ### Dataset Overview
 85 | 
 86 | We will use the **Breast Cancer dataset** from `scikit-learn`. It is a binary classification dataset where the goal is to classify tumors as benign (0) or malignant (1).
 87 | 
 88 | 
 89 | 
 90 | ### Steps to Implement Logistic Regression
 91 | 
 92 | 1. **Import Libraries**:
 93 |     - `numpy`, `pandas`, `matplotlib`, and `scikit-learn`.
 94 | 2. **Load Dataset**:
 95 |     - Use `sklearn.datasets.load_breast_cancer`.
 96 | 3. **Preprocess Data**:
 97 |     - Split data into training and testing sets.
 98 | 4. **Train the Model**:
 99 |     - Use `LogisticRegression` from `sklearn.linear_model`.
100 | 5. **Evaluate the Model**:
101 |     - Use metrics like accuracy and confusion matrix.
102 | 
103 | 
104 | 
105 | ### Code Example
106 | 
107 | ```python
108 | # Importing necessary libraries
109 | import numpy as np
110 | import pandas as pd
111 | from sklearn.datasets import load_breast_cancer
112 | from sklearn.model_selection import train_test_split
113 | from sklearn.linear_model import LogisticRegression
114 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
115 | 
116 | # Load the dataset
117 | data = load_breast_cancer()
118 | X = data.data
119 | y = data.target
120 | 
121 | # Split into training and testing datasets
122 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
123 | 
124 | # Initialize the model
125 | model = LogisticRegression(max_iter=10000)
126 | 
127 | # Train the model
128 | model.fit(X_train, y_train)
129 | 
130 | # Make predictions
131 | y_pred = model.predict(X_test)
132 | 
133 | # Evaluate the model
134 | accuracy = accuracy_score(y_test, y_pred)
135 | print(f"Accuracy: {accuracy * 100:.2f}%")
136 | print("Classification Report:")
137 | print(classification_report(y_test, y_pred))
138 | ```
139 | 
140 | **Output**:
141 | 
142 | ```plaintext
143 | Accuracy: 95.32%
144 | Classification Report:
145 |               precision    recall  f1-score   support
146 | 
147 |            0       0.96      0.94      0.95        63
148 |            1       0.95      0.97      0.96       108
149 | 
150 |     accuracy                           0.95       171
151 |    macro avg       0.95      0.95      0.95       171
152 | weighted avg       0.95      0.95      0.95       171
153 | ```
154 | 
155 | 
156 | 
157 | ## 5️⃣ Evaluating the Model
158 | 
159 | ### Metrics
160 | 
161 | - **Accuracy**: Proportion of correctly classified samples.
162 | - **Precision**: Ratio of correctly predicted positive observations to total predicted positives.
163 | - **Recall**: Ratio of correctly predicted positives to all actual positives.
164 | - **F1-Score**: Harmonic mean of precision and recall.
165 | 
166 | 
167 | 
168 | ### Confusion Matrix
169 | 
170 | A **confusion matrix** shows the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
171 | 
172 | ```python
173 | # Confusion Matrix
174 | import seaborn as sns
175 | import matplotlib.pyplot as plt
176 | 
177 | cm = confusion_matrix(y_test, y_pred)
178 | sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
179 | plt.xlabel("Predicted")
180 | plt.ylabel("Actual")
181 | plt.title("Confusion Matrix")
182 | plt.show()
183 | ```
184 | 
185 | 
186 | 
187 | ### ROC Curve and AUC
188 | 
189 | The **ROC curve** is a plot of True Positive Rate vs False Positive Rate, and **AUC** measures the area under this curve.
190 | 
191 | ```python
192 | from sklearn.metrics import roc_curve, roc_auc_score
193 | 
194 | y_proba = model.predict_proba(X_test)[:, 1]
195 | fpr, tpr, _ = roc_curve(y_test, y_proba)
196 | auc = roc_auc_score(y_test, y_proba)
197 | 
198 | plt.plot(fpr, tpr, label=f"AUC = {auc:.2f}")
199 | plt.xlabel("False Positive Rate")
200 | plt.ylabel("True Positive Rate")
201 | plt.title("ROC Curve")
202 | plt.legend()
203 | plt.show()
204 | ```
205 | 
206 | 
207 | 
208 | ## 🧠 Practice Exercises
209 | 
210 | 1. Train a Logistic Regression model on the **Iris dataset** for binary classification.
211 | 2. Experiment with the **`C` parameter** in `LogisticRegression` to observe its effect on regularization.
212 | 3. Evaluate the model using **Precision-Recall curves**.
213 | 
214 | 
215 | 
216 | ## 🌟 Summary
217 | 
218 | - Logistic Regression is a classification algorithm suitable for binary outcomes.
219 | - The sigmoid function maps linear predictions to probabilities.
220 | - scikit-learn provides a straightforward implementation of Logistic Regression.
221 | - Evaluation metrics and visualizations help assess the model's performance.
222 | 
223 | ---
224 | 
225 | 
226 | 


--------------------------------------------------------------------------------
/28_Time Series Forecasting/28_Time Series Forecasting.md:
--------------------------------------------------------------------------------
  1 | [<< Day 27](../27_Natural%20Language%20Processing%20(NLP)/27_Natural%20Language%20Processing%20(NLP).md) | [Day 29 >>](../29_Working%20with%20Big%20Data/29_Working%20with%20Big%20Data.md)
  2 | 
  3 | 
  4 | # 📅 Day 28: Time Series Forecasting ⏳📊
  5 | 
  6 | Welcome to **Day 28** of the **30 Days of Data Science** series! 🎉 Today, we will explore **Time Series Forecasting**, one of the most critical techniques in data science used for analyzing sequential data over time. We'll cover key concepts and popular models like **ARIMA** and **Prophet**. By the end of this lesson, you’ll have the tools to forecast future trends and patterns effectively.
  7 | 
  8 | 
  9 | 
 10 | ## 🌟 Table of Contents
 11 | - [📚 Introduction to Time Series Forecasting](#-introduction-to-time-series-forecasting)
 12 | - [📈 Understanding Time Series Data](#-understanding-time-series-data)
 13 |   - [Components of Time Series](#components-of-time-series)
 14 | - [🔮 ARIMA (AutoRegressive Integrated Moving Average)](#-arima-autoregressive-integrated-moving-average)
 15 |   - [Steps for ARIMA Modeling](#steps-for-arima-modeling)
 16 |   - [Python Example of ARIMA](#python-example-of-arima)
 17 | - [📜 Seasonal Decomposition of Time Series (STL)](#-seasonal-decomposition-of-time-series-stl)
 18 | - [📦 SARIMA (Seasonal ARIMA)](#-sarima-seasonal-arima)
 19 | - [🌍 Prophet: Time Series Forecasting Made Easy](#-prophet-time-series-forecasting-made-easy)
 20 |   - [Features of Prophet](#features-of-prophet)
 21 |   - [Python Example of Prophet](#python-example-of-prophet)
 22 | - [🧠 LSTM (Long Short-Term Memory Networks)](#-lstm-long-short-term-memory-networks)
 23 | - [✍️ Practice Exercise](#%EF%B8%8F-practice-exercise)
 24 | - [📝 Summary](#-summary)
 25 | 
 26 | 
 27 | 
 28 | ## 📚 Introduction to Time Series Forecasting
 29 | 
 30 | Time series forecasting predicts future values based on previously observed data. It is widely used in areas like:
 31 | - **Finance**: Stock price prediction 📈
 32 | - **Weather Forecasting**: Temperature and rainfall prediction 🌧️
 33 | - **Retail**: Sales forecasting 🛒
 34 | 
 35 | Forecasting allows businesses and researchers to plan effectively and make informed decisions.
 36 | 
 37 | 
 38 | 
 39 | ## 📈 Understanding Time Series Data
 40 | 
 41 | A **time series** is a sequence of data points collected or recorded at regular time intervals.
 42 | 
 43 | ### Components of Time Series
 44 | 1. **Trend**: Overall upward or downward movement over time.
 45 | 2. **Seasonality**: Regular patterns that repeat over a fixed period.
 46 | 3. **Cyclic Patterns**: Long-term fluctuations not tied to seasonality.
 47 | 4. **Noise**: Random variations or outliers in data.
 48 | 
 49 | ### Example: Time Series Plot
 50 | ```python
 51 | import pandas as pd
 52 | import matplotlib.pyplot as plt
 53 | 
 54 | # Sample data
 55 | data = {
 56 |     'Date': pd.date_range(start='2023-01-01', periods=12, freq='M'),
 57 |     'Sales': [200, 220, 250, 270, 300, 350, 400, 420, 450, 470, 500, 550]
 58 | }
 59 | df = pd.DataFrame(data)
 60 | 
 61 | # Plot
 62 | plt.plot(df['Date'], df['Sales'], marker='o', linestyle='-')
 63 | plt.title("Monthly Sales Data")
 64 | plt.xlabel("Date")
 65 | plt.ylabel("Sales")
 66 | plt.grid()
 67 | plt.show()
 68 | ```
 69 | 
 70 | 
 71 | 
 72 | ## 🔮 ARIMA (AutoRegressive Integrated Moving Average)
 73 | 
 74 | ### What is ARIMA?
 75 | ARIMA is a statistical modeling technique for analyzing and forecasting time series data. It combines three components:
 76 | - **AR (AutoRegressive)**: Uses past values.
 77 | - **I (Integrated)**: Differencing the data to make it stationary.
 78 | - **MA (Moving Average)**: Uses past forecast errors.
 79 | 
 80 | ### Steps for ARIMA Modeling
 81 | 1. **Visualize the Data**: Plot the series and check for trends, seasonality, and stationarity.
 82 | 2. **Stationarity Test**: Use tests like the Augmented Dickey-Fuller (ADF) test.
 83 | 3. **Differencing**: Transform non-stationary data to stationary.
 84 | 4. **Parameter Selection**: Use `p`, `d`, `q` to define the ARIMA model.
 85 | 5. **Model Training**: Fit the ARIMA model to your data.
 86 | 6. **Forecasting**: Predict future values.
 87 | 
 88 | ### Python Example of ARIMA
 89 | ```python
 90 | from statsmodels.tsa.arima.model import ARIMA
 91 | import pandas as pd
 92 | import matplotlib.pyplot as plt
 93 | 
 94 | # Example data
 95 | data = [112, 118, 132, 129, 121, 135, 148, 145, 140, 155, 164, 170]
 96 | df = pd.DataFrame(data, columns=['Sales'])
 97 | 
 98 | # Fit ARIMA model
 99 | model = ARIMA(df['Sales'], order=(1, 1, 1))
100 | model_fit = model.fit()
101 | 
102 | # Summary of the model
103 | print(model_fit.summary())
104 | 
105 | # Forecast future values
106 | forecast = model_fit.forecast(steps=5)
107 | print("Forecasted Values:", forecast)
108 | ```
109 | 
110 | 
111 | ## 📜 Seasonal Decomposition of Time Series (STL)
112 | 
113 | Seasonal Decomposition of Time Series (STL) splits the data into **trend**, **seasonal**, and **residual** components.
114 | 
115 | ### Example: STL Decomposition
116 | 
117 | ```python
118 | from statsmodels.tsa.seasonal import STL
119 | import pandas as pd
120 | import matplotlib.pyplot as plt
121 | 
122 | # Sample time series data
123 | data = [112, 118, 132, 129, 121, 135, 148, 136, 119, 104, 118, 115]
124 | df = pd.DataFrame(data, columns=['value'])
125 | 
126 | # STL decomposition
127 | stl = STL(df['value'], period=12)
128 | result = stl.fit()
129 | 
130 | # Plot components
131 | result.plot()
132 | plt.show()
133 | ```
134 | 
135 | 
136 | 
137 | ## 📦 SARIMA (Seasonal ARIMA)
138 | 
139 | **SARIMA** extends ARIMA by incorporating seasonality.
140 | 
141 | The model is defined by parameters `(p, d, q) x (P, D, Q, s)` where:
142 | - `(p, d, q)` are ARIMA parameters.
143 | - `(P, D, Q, s)` are seasonal parameters.
144 | 
145 | ### Example: SARIMA
146 | 
147 | ```python
148 | from statsmodels.tsa.statespace.sarimax import SARIMAX
149 | 
150 | # Fit SARIMA model
151 | model = SARIMAX(df['value'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
152 | sarima_result = model.fit()
153 | 
154 | # Forecast
155 | forecast = sarima_result.forecast(steps=5)
156 | print("SARIMA Forecast:", forecast)
157 | ```
158 | 
159 | 
160 | 
161 | ## 🌍 Prophet: Time Series Forecasting Made Easy
162 | 
163 | ### What is Prophet?
164 | Prophet is an open-source library developed by Facebook for time series forecasting. It is highly flexible, easy to use, and handles missing data, holidays, and seasonal patterns effectively.
165 | 
166 | ### Features of Prophet
167 | - Handles **seasonality** and **holiday effects**.
168 | - Robust to **missing data**.
169 | - Requires minimal tuning.
170 | 
171 | ### Python Example of Prophet
172 | ```python
173 | from prophet import Prophet
174 | import pandas as pd
175 | import matplotlib.pyplot as plt
176 | 
177 | # Create example data
178 | data = {
179 |     'ds': pd.date_range(start='2023-01-01', periods=12, freq='M'),
180 |     'y': [200, 220, 250, 270, 300, 350, 400, 420, 450, 470, 500, 550]
181 | }
182 | df = pd.DataFrame(data)
183 | 
184 | # Fit Prophet model
185 | model = Prophet()
186 | model.fit(df)
187 | 
188 | # Create future dataframe
189 | future = model.make_future_dataframe(periods=6, freq='M')
190 | 
191 | # Forecast
192 | forecast = model.predict(future)
193 | 
194 | # Plot results
195 | fig = model.plot(forecast)
196 | plt.show()
197 | ```
198 | 
199 | 
200 | ## 🧠 LSTM (Long Short-Term Memory Networks)
201 | 
202 | LSTMs are a type of **recurrent neural network (RNN)** capable of learning long-term dependencies.
203 | 
204 | ### Example: LSTM Model
205 | 
206 | ```python
207 | import numpy as np
208 | from keras.models import Sequential
209 | from keras.layers import LSTM, Dense
210 | 
211 | # Sample data
212 | data = np.array([112, 118, 132, 129, 121, 135, 148, 136, 119, 104, 118, 115])
213 | X = data[:-1].reshape((1, len(data)-1, 1))  # Features
214 | y = data[1:]  # Labels
215 | 
216 | # Define LSTM model
217 | model = Sequential()
218 | model.add(LSTM(50, activation='relu', input_shape=(X.shape[1], 1)))
219 | model.add(Dense(1))
220 | model.compile(optimizer='adam', loss='mse')
221 | 
222 | # Train
223 | model.fit(X, y, epochs=200, verbose=0)
224 | 
225 | # Forecast
226 | forecast = model.predict(X)
227 | print("LSTM Forecast:", forecast)
228 | ```
229 | 
230 | 
231 | 
232 | ## ✍️ Practice Exercise
233 | 
234 | Try the following:
235 | 1. Load a **time series dataset** of your choice (e.g., stock prices, weather data).
236 | 2. Preprocess the data to handle missing values.
237 | 3. Train an **ARIMA** model and forecast future values.
238 | 4. Compare the performance of **ARIMA** and **Prophet** on the same dataset.
239 | 
240 | 
241 | 
242 | ## 📝 Summary
243 | 
244 | In this lesson, we covered the fundamentals of **Time Series Forecasting**, explored **ARIMA**, and demonstrated the use of **Prophet** for efficient predictions. Forecasting is a powerful tool for uncovering trends and patterns in sequential data. Mastering these techniques will empower you to tackle real-world problems in diverse domains.
245 | 
246 | ---
247 | 
248 | 
249 | 
250 | 


--------------------------------------------------------------------------------
/27_Natural Language Processing (NLP)/27_Natural Language Processing (NLP).md:
--------------------------------------------------------------------------------
  1 | [<< Day 26](../26_Advanced%20ML%3A%20Hyperparameter%20Tuning/26_Advanced%20ML%3A%20Hyperparameter%20Tuning.md) | [Day 28 >>](../28_Time%20Series%20Forecasting/28_Time%20Series%20Forecasting.md)
  2 | 
  3 | 
  4 | # 🌟 Day 27: Natural Language Processing (NLP)
  5 | 
  6 | Welcome to **Day 27** of the 30 Days of Data Science series! Today, we dive into the fascinating world of **Natural Language Processing (NLP)**. NLP bridges the gap between human language and computers, allowing machines to understand, process, and generate human text. By the end of this lesson, you will have a solid understanding of the following key topics:
  7 | 
  8 | - **NLTK**
  9 | - **spaCy**
 10 | - **Hugging Face**
 11 | - **Topic Modeling with Gensim**
 12 | - **Text Summarization**
 13 | - **Word Embeddings with Word2Vec and GloVe**
 14 | 
 15 | 
 16 | 
 17 | ## 📖 Table of Contents
 18 | 
 19 | - [🌟 Day 27: Natural Language Processing (NLP)](#-day-27-natural-language-processing-nlp)
 20 |   - [📖 Table of Contents](#-table-of-contents)
 21 |   - [🔍 What is Natural Language Processing?](#-what-is-natural-language-processing)
 22 |   - [📚 NLTK: The Natural Language Toolkit](#-nltk-the-natural-language-toolkit)
 23 |     - [1. Tokenization](#1-tokenization)
 24 |     - [2. Stopword Removal](#2-stopword-removal)
 25 |     - [3. Stemming and Lemmatization](#3-stemming-and-lemmatization)
 26 |   - [💡 spaCy: Industrial-Strength NLP](#-spacy-industrial-strength-nlp)
 27 |     - [1. Named Entity Recognition (NER)](#1-named-entity-recognition-ner)
 28 |     - [2. Part-of-Speech (POS) Tagging](#2-part-of-speech-pos-tagging)
 29 |   - [🤗 Hugging Face: Transformers for NLP](#-hugging-face-transformers-for-nlp)
 30 |     - [1. Sentiment Analysis](#1-sentiment-analysis)
 31 |     - [2. Text Generation](#2-text-generation)
 32 |   - [🧠 Topic Modeling with Gensim](#-topic-modeling-with-gensim)
 33 |   - [📝 Text Summarization](#-text-summarization)
 34 |   - [📖 Word Embeddings with Word2Vec and GloVe](#-word-embeddings-with-word2vec-and-glove)
 35 |   - [📓 Practice Exercises](#-practice-exercises)
 36 |   - [📜 Summary](#-summary)
 37 |     
 38 | 
 39 | 
 40 | 
 41 | ## 🔍 What is Natural Language Processing?
 42 | 
 43 | **Natural Language Processing (NLP)** is a field within Artificial Intelligence that focuses on enabling machines to understand and interact with human language. It has wide applications, including:
 44 | 
 45 | - **Text Classification**: Spam detection, sentiment analysis.
 46 | - **Machine Translation**: Translating text between languages (e.g., Google Translate).
 47 | - **Named Entity Recognition (NER)**: Identifying entities like names, dates, and locations in text.
 48 | - **Question Answering**: Building systems like ChatGPT.
 49 | 
 50 | 
 51 | 
 52 | ## 📚 NLTK: The Natural Language Toolkit
 53 | 
 54 | **NLTK** is a powerful Python library for working with text data. It provides tools for tokenization, stemming, lemmatization, and more. Let’s explore some common functionalities.
 55 | 
 56 | ### 1. Tokenization
 57 | 
 58 | Tokenization is the process of breaking text into smaller components, such as words or sentences.
 59 | 
 60 | ```python
 61 | import nltk
 62 | from nltk.tokenize import word_tokenize, sent_tokenize
 63 | 
 64 | nltk.download('punkt')
 65 | 
 66 | text = "Natural Language Processing is fascinating. Let's learn more!"
 67 | 
 68 | # Word Tokenization
 69 | words = word_tokenize(text)
 70 | print("Word Tokens:", words)
 71 | 
 72 | # Sentence Tokenization
 73 | sentences = sent_tokenize(text)
 74 | print("Sentence Tokens:", sentences)
 75 | ```
 76 | 
 77 | ### 2. Stopword Removal
 78 | 
 79 | Stopwords are common words (e.g., "is", "the") that are often removed in text preprocessing.
 80 | 
 81 | ```python
 82 | from nltk.corpus import stopwords
 83 | 
 84 | nltk.download('stopwords')
 85 | 
 86 | stop_words = set(stopwords.words('english'))
 87 | filtered_words = [word for word in words if word.lower() not in stop_words]
 88 | 
 89 | print("Filtered Words:", filtered_words)
 90 | ```
 91 | 
 92 | ### 3. Stemming and Lemmatization
 93 | 
 94 | - **Stemming** reduces words to their root form (e.g., "running" -> "run").
 95 | - **Lemmatization** maps words to their base dictionary form (e.g., "better" -> "good").
 96 | 
 97 | ```python
 98 | from nltk.stem import PorterStemmer
 99 | from nltk.stem import WordNetLemmatizer
100 | 
101 | nltk.download('wordnet')
102 | 
103 | stemmer = PorterStemmer()
104 | lemmatizer = WordNetLemmatizer()
105 | 
106 | word = "running"
107 | print("Stemmed:", stemmer.stem(word))
108 | print("Lemmatized:", lemmatizer.lemmatize(word, pos='v'))
109 | ```
110 | 
111 | 
112 | 
113 | ## 💡 spaCy: Industrial-Strength NLP
114 | 
115 | **spaCy** is an efficient library designed for large-scale NLP tasks. It supports features like Named Entity Recognition (NER), Part-of-Speech tagging, and dependency parsing.
116 | 
117 | ### 1. Named Entity Recognition (NER)
118 | 
119 | NER identifies entities such as names, dates, and locations in text.
120 | 
121 | ```python
122 | import spacy
123 | 
124 | nlp = spacy.load("en_core_web_sm")
125 | text = "Apple was founded by Steve Jobs in Cupertino, California."
126 | 
127 | doc = nlp(text)
128 | for ent in doc.ents:
129 |     print(ent.text, ent.label_)
130 | ```
131 | 
132 | ### 2. Part-of-Speech (POS) Tagging
133 | 
134 | POS tagging assigns grammatical tags (e.g., noun, verb) to words in a sentence.
135 | 
136 | ```python
137 | for token in doc:
138 |     print(token.text, token.pos_)
139 | ```
140 | 
141 | 
142 | 
143 | ## 🤗 Hugging Face: Transformers for NLP
144 | 
145 | Hugging Face provides state-of-the-art NLP models, including BERT and GPT, through the `transformers` library.
146 | 
147 | ### 1. Sentiment Analysis
148 | 
149 | Use a pre-trained model to classify the sentiment of a given text.
150 | 
151 | ```python
152 | from transformers import pipeline
153 | 
154 | classifier = pipeline("sentiment-analysis")
155 | result = classifier("I love learning NLP!")
156 | print(result)
157 | ```
158 | 
159 | ### 2. Text Generation
160 | 
161 | Generate text using a language model like GPT-2.
162 | 
163 | ```python
164 | from transformers import pipeline
165 | 
166 | generator = pipeline("text-generation", model="gpt2")
167 | result = generator("Natural Language Processing is", max_length=30, num_return_sequences=1)
168 | print(result[0]['generated_text'])
169 | ```
170 | 
171 | 
172 | 
173 | ## 🧠 Topic Modeling with Gensim
174 | 
175 | **Topic Modeling** is the task of identifying abstract topics within a collection of documents. The `gensim` library provides tools for Latent Dirichlet Allocation (LDA), a popular topic modeling technique.
176 | 
177 | ```python
178 | from gensim import corpora, models
179 | from gensim.models.ldamodel import LdaModel
180 | 
181 | # Sample data
182 | documents = ["I love data science", "Data science is the future", "NLP is fascinating"]
183 | 
184 | # Preprocessing
185 | tokenized_docs = [doc.split() for doc in documents]
186 | dictionary = corpora.Dictionary(tokenized_docs)
187 | corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
188 | 
189 | # LDA Model
190 | lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary)
191 | for idx, topic in lda_model.print_topics(-1):
192 |     print(f"Topic {idx}: {topic}")
193 | ```
194 | 
195 | 
196 | 
197 | ## 📝 Text Summarization
198 | 
199 | Text summarization condenses a large text into a shorter version while retaining the main points. You can use Hugging Face for extractive summarization.
200 | 
201 | ```python
202 | from transformers import pipeline
203 | 
204 | summarizer = pipeline("summarization")
205 | text = "Natural Language Processing has a variety of applications, including text summarization. Summarization aims to condense long texts."
206 | 
207 | summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
208 | print(summary[0]['summary_text'])
209 | ```
210 | 
211 | 
212 | 
213 | ## 📖 Word Embeddings with Word2Vec and GloVe
214 | 
215 | Word embeddings are dense vector representations of words. Libraries like `gensim` support Word2Vec, while pre-trained GloVe embeddings are available for direct use.
216 | 
217 | ### Word2Vec Example
218 | 
219 | ```python
220 | from gensim.models import Word2Vec
221 | 
222 | sentences = [["I", "love", "NLP"], ["Word2Vec", "is", "useful"]]
223 | model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
224 | 
225 | # Get vector for a word
226 | vector = model.wv["NLP"]
227 | print(vector)
228 | ```
229 | 
230 | 
231 | 
232 | ## 📓 Practice Exercises
233 | 
234 | 1. Tokenize the following text using NLTK:
235 |    ```
236 |    "The quick brown fox jumps over the lazy dog."
237 |    ```
238 |    Count the number of tokens and remove stopwords.
239 | 
240 | 2. Use spaCy to extract entities from:
241 |    ```
242 |    "Tesla's stock price soared after Elon Musk's announcement in 2023."
243 |    ```
244 | 
245 | 3. Use Hugging Face's sentiment analysis pipeline to analyze:
246 |    ```
247 |    "The movie was a masterpiece, but the ending was disappointing."
248 |    ```
249 | 
250 | 4. Generate a short text completion starting with:
251 |    ```
252 |    "Data Science is the future of"
253 |    ```
254 | 
255 | 5. Use Gensim's LDA model to find topics from the following documents:
256 |    ```
257 |    ["Artificial intelligence is transforming industries", "Machine learning is a subset of AI", "NLP is a key AI application"]
258 |    ```
259 | 
260 | 6. Summarize the following text:
261 |    ```
262 |    "Machine learning is a branch of artificial intelligence that focuses on building systems capable of learning and improving from experience without being explicitly programmed."
263 |    ```
264 | 
265 | 
266 | 
267 | ## 📜 Summary
268 | 
269 | Today, you learned about the foundational tools and techniques for NLP:
270 | 
271 | - **NLTK**: Preprocessing text with tokenization, stopword removal, and lemmatization.
272 | - **spaCy**: Performing advanced tasks like NER and POS tagging.
273 | - **Hugging Face**: Leveraging pre-trained models for sentiment analysis and text generation.
274 | - **Gensim**: Topic modeling with LDA.
275 | - **Summarization**: Condensing text into shorter forms.
276 | - **Word Embeddings**: Representing words as dense vectors with Word2Vec and GloVe.
277 | 
278 | NLP is a powerful field with applications in numerous domains. Keep practicing and explore these libraries further to master them!
279 | 
280 | ---
281 | 
282 | 
283 | 


--------------------------------------------------------------------------------
/26_Advanced ML: Hyperparameter Tuning/26_Advanced ML: Hyperparameter Tuning.md:
--------------------------------------------------------------------------------
  1 | [<< Day 25](../25_Model%20Evaluation%20and%20Metrics/25_Model%20Evaluation%20and%20Metrics.md) | [Day 27 >>](../27_Natural%20Language%20Processing%20(NLP)/27_Natural%20Language%20Processing%20(NLP).md)
  2 | 
  3 | # 🚀 **Day 26: Advanced ML - Hyperparameter Tuning**
  4 | 
  5 | Welcome to **Day 26** of the **30 Days of Data Science** series! 🎉 Today, we delve into **Hyperparameter Tuning**, focusing on two powerful techniques: **GridSearchCV** and **RandomizedSearchCV**. Additionally, we will explore advanced topics like **Bayesian Optimization**, **Optuna**, and hyperparameter tuning for neural networks. These methods are essential for improving model performance and selecting the best parameters for machine learning models. Let's dive in! 🔍
  6 | 
  7 | 
  8 | 
  9 | ## 📚 **Table of Contents**
 10 | 
 11 | - [📚 Introduction to Hyperparameter Tuning](#-introduction-to-hyperparameter-tuning)
 12 | - [⚙️ GridSearchCV](#️-gridsearchcv)
 13 |   - [Advantages](#advantages)
 14 |   - [Disadvantages](#disadvantages)
 15 |   - [Implementation Example](#implementation-example)
 16 | - [🎲 RandomizedSearchCV](#-randomizedsearchcv)
 17 |   - [Advantages](#advantages-1)
 18 |   - [Disadvantages](#disadvantages-1)
 19 |   - [Implementation Example](#implementation-example-1)
 20 | - [🌟 Bayesian Optimization](#-bayesian-optimization)
 21 |   - [Implementation Example](#implementation-example-2)
 22 | - [🌟 Optuna](#-optuna)
 23 |   - [Implementation Example](#implementation-example-3)
 24 | - [🌟 Hyperparameter Tuning for Neural Networks](#-hyperparameter-tuning-for-neural-networks)
 25 |   - [Example with Keras Tuner](#example-with-keras-tuner)
 26 | - [💡 Best Practices](#-best-practices)
 27 | - [🛠️ Practice Exercise](#️-practice-exercise)
 28 | - [📜 Summary](#-summary)
 29 | 
 30 | 
 31 | 
 32 | ## 📚 **Introduction to Hyperparameter Tuning**
 33 | 
 34 | Hyperparameters are parameters that are not learned from the data during the training process but are instead set manually before the training begins. Examples include the learning rate, number of estimators, or maximum depth in a decision tree.
 35 | 
 36 | **Why Tune Hyperparameters?**
 37 | 
 38 | - Improves model performance 🎯
 39 | - Prevents overfitting or underfitting 🔧
 40 | - Helps in identifying the best configuration for your model 🏆
 41 | 
 42 | 
 43 | 
 44 | ## ⚙️ **GridSearchCV**
 45 | 
 46 | GridSearchCV is an exhaustive search technique that evaluates all possible combinations of hyperparameter values.
 47 | 
 48 | ### Advantages
 49 | - Guarantees finding the best combination of parameters 🎯
 50 | - Straightforward to implement 🛠️
 51 | 
 52 | ### Disadvantages
 53 | - Computationally expensive ⏳
 54 | - May not be feasible with large datasets or too many parameters ⚠️
 55 | 
 56 | ### Implementation Example
 57 | 
 58 | Here's how to use GridSearchCV in Python:
 59 | 
 60 | ```python
 61 | from sklearn.model_selection import GridSearchCV
 62 | from sklearn.ensemble import RandomForestClassifier
 63 | from sklearn.datasets import load_iris
 64 | from sklearn.model_selection import train_test_split
 65 | 
 66 | # Load dataset
 67 | iris = load_iris()
 68 | X, y = iris.data, iris.target
 69 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 70 | 
 71 | # Define model and parameter grid
 72 | model = RandomForestClassifier(random_state=42)
 73 | param_grid = {
 74 |     'n_estimators': [10, 50, 100],
 75 |     'max_depth': [None, 10, 20, 30],
 76 |     'min_samples_split': [2, 5, 10]
 77 | }
 78 | 
 79 | # Perform GridSearchCV
 80 | grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
 81 | grid_search.fit(X_train, y_train)
 82 | 
 83 | # Best parameters and accuracy
 84 | print("Best Parameters:", grid_search.best_params_)
 85 | print("Best Score:", grid_search.best_score_)
 86 | ```
 87 | 
 88 | 
 89 | 
 90 | ## 🎲 **RandomizedSearchCV**
 91 | 
 92 | RandomizedSearchCV searches a subset of the hyperparameter space by sampling a fixed number of parameter combinations.
 93 | 
 94 | ### Advantages
 95 | - Faster than GridSearchCV 🚀
 96 | - Can provide similar results with fewer computations 🧠
 97 | 
 98 | ### Disadvantages
 99 | - May not explore all possible parameter combinations ⚠️
100 | - Results may vary depending on random sampling 🎲
101 | 
102 | ### Implementation Example
103 | 
104 | Here's how to use RandomizedSearchCV in Python:
105 | 
106 | ```python
107 | from sklearn.model_selection import RandomizedSearchCV
108 | from sklearn.ensemble import RandomForestClassifier
109 | from sklearn.datasets import load_iris
110 | from sklearn.model_selection import train_test_split
111 | from scipy.stats import randint
112 | 
113 | # Load dataset
114 | iris = load_iris()
115 | X, y = iris.data, iris.target
116 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
117 | 
118 | # Define model and parameter distributions
119 | model = RandomForestClassifier(random_state=42)
120 | param_distributions = {
121 |     'n_estimators': randint(10, 200),
122 |     'max_depth': [None, 10, 20, 30],
123 |     'min_samples_split': randint(2, 20)
124 | }
125 | 
126 | # Perform RandomizedSearchCV
127 | random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions, n_iter=50, cv=5, scoring='accuracy', random_state=42)
128 | random_search.fit(X_train, y_train)
129 | 
130 | # Best parameters and accuracy
131 | print("Best Parameters:", random_search.best_params_)
132 | print("Best Score:", random_search.best_score_)
133 | ```
134 | 
135 | 
136 | 
137 | ## 🌟 **Bayesian Optimization**
138 | 
139 | Bayesian Optimization is an advanced method for hyperparameter tuning that uses probabilistic models to estimate the performance of different hyperparameter settings. It is especially useful when the search space is vast, and evaluations are expensive.
140 | 
141 | ### Implementation Example
142 | 
143 | ```python
144 | from skopt import BayesSearchCV
145 | from sklearn.ensemble import RandomForestClassifier
146 | from sklearn.datasets import load_iris
147 | from sklearn.model_selection import train_test_split
148 | 
149 | # Load dataset
150 | iris = load_iris()
151 | X, y = iris.data, iris.target
152 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
153 | 
154 | # Define model and parameter search space
155 | model = RandomForestClassifier(random_state=42)
156 | search_space = {
157 |     'n_estimators': (10, 200),
158 |     'max_depth': (1, 30),
159 |     'min_samples_split': (2, 20)
160 | }
161 | 
162 | # Bayesian Optimization with skopt
163 | bayes_search = BayesSearchCV(estimator=model, search_spaces=search_space, n_iter=50, cv=5, scoring='accuracy', random_state=42)
164 | bayes_search.fit(X_train, y_train)
165 | 
166 | # Best parameters and accuracy
167 | print("Best Parameters:", bayes_search.best_params_)
168 | print("Best Score:", bayes_search.best_score_)
169 | ```
170 | 
171 | 
172 | 
173 | ## 🌟 **Optuna**
174 | 
175 | Optuna is an open-source library designed for hyperparameter optimization. It features an automatic search space pruning mechanism that speeds up the optimization process.
176 | 
177 | ### Implementation Example
178 | 
179 | ```python
180 | import optuna
181 | from sklearn.ensemble import RandomForestClassifier
182 | from sklearn.datasets import load_iris
183 | from sklearn.model_selection import train_test_split, cross_val_score
184 | 
185 | # Load dataset
186 | iris = load_iris()
187 | X, y = iris.data, iris.target
188 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
189 | 
190 | # Define objective function
191 | def objective(trial):
192 |     n_estimators = trial.suggest_int('n_estimators', 10, 200)
193 |     max_depth = trial.suggest_int('max_depth', 1, 30)
194 |     min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
195 |     
196 |     model = RandomForestClassifier(
197 |         n_estimators=n_estimators,
198 |         max_depth=max_depth,
199 |         min_samples_split=min_samples_split,
200 |         random_state=42
201 |     )
202 |     return cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
203 | 
204 | # Hyperparameter optimization with Optuna
205 | study = optuna.create_study(direction='maximize')
206 | study.optimize(objective, n_trials=50)
207 | 
208 | # Best parameters and accuracy
209 | print("Best Parameters:", study.best_params)
210 | print("Best Score:", study.best_value)
211 | ```
212 | 
213 | 
214 | 
215 | ## 🌟 **Hyperparameter Tuning for Neural Networks**
216 | 
217 | Tuning hyperparameters for neural networks often involves searching for the best combination of learning rates, optimizers, batch sizes, and number of layers.
218 | 
219 | ### Example with Keras Tuner
220 | 
221 | ```python
222 | from tensorflow import keras
223 | from keras_tuner import RandomSearch
224 | 
225 | # Define the model
226 | def build_model(hp):
227 |     model = keras.Sequential()
228 |     model.add(keras.layers.Dense(units=hp.Int('units', min_value=32, max_value=512, step=32), activation='relu'))
229 |     model.add(keras.layers.Dense(3, activation='softmax'))
230 |     model.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])),
231 |                   loss='sparse_categorical_crossentropy',
232 |                   metrics=['accuracy'])
233 |     return model
234 | 
235 | # Load dataset
236 | iris = load_iris()
237 | X, y = iris.data, iris.target
238 | 
239 | # Hyperparameter optimization with Keras Tuner
240 | tuner = RandomSearch(
241 |     build_model,
242 |     objective='val_accuracy',
243 |     max_trials=5,
244 |     executions_per_trial=3,
245 |     directory='my_dir',
246 |     project_name='intro_to_kt'
247 | )
248 | 
249 | tuner.search(X, y, epochs=10, validation_split=0.2)
250 | ```
251 | 
252 | 
253 | 
254 | ## 💡 **Best Practices**
255 | 
256 | 1. **Start with RandomizedSearchCV** for quick insights.
257 | 2. Use **GridSearchCV** after narrowing down the hyperparameter space.
258 | 3. Utilize techniques like **cross-validation** to avoid overfitting.
259 | 4. Parallelize the search process using multiple CPUs or GPUs. ⚡
260 | 5. Evaluate results with metrics relevant to your problem, such as precision, recall, or F1-score. 📊
261 | 
262 | 
263 | 
264 | ## 🛠️ **Practice Exercise**
265 | 
266 | Use the dataset of your choice and apply **RandomizedSearchCV** to tune the hyperparameters of a Support Vector Machine (SVM) classifier. 
267 | 
268 | 1. Load a dataset (e.g., `load_digits()` from scikit-learn).
269 | 2. Define a parameter distribution for the SVM.
270 | 3. Use RandomizedSearchCV to find the best parameters.
271 | 4. Evaluate the tuned model on a test set.
272 | 
273 | Example starting code:
274 | 
275 | ```python
276 | from sklearn.datasets import load_digits
277 | from sklearn.svm import SVC
278 | from sklearn.model_selection import RandomizedSearchCV, train_test_split
279 | from scipy.stats import uniform
280 | 
281 | # Load dataset
282 | X, y = load_digits(return_X_y=True)
283 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
284 | 
285 | # Define model and parameter distributions
286 | svc = SVC()
287 | param_distributions = {
288 |     'C': uniform(0.1, 10),
289 |     'kernel': ['linear', 'rbf', 'poly'],
290 |     'gamma': uniform(0.01, 1)
291 | }
292 | 
293 | # RandomizedSearchCV
294 | random_search = RandomizedSearchCV(svc, param_distributions, n_iter=50, cv=5, random_state=42)
295 | random_search.fit(X_train, y_train)
296 | 
297 | # Best parameters and accuracy
298 | print("Best Parameters:", random_search.best_params_)
299 | print("Test Accuracy:", random_search.score(X_test, y_test))
300 | ```
301 | 
302 | 
303 | 
304 | ## 📜 **Summary**
305 | 
306 | Today, we explored various techniques for hyperparameter tuning: **GridSearchCV**, **RandomizedSearchCV**, **Bayesian Optimization**, **Optuna**, and hyperparameter tuning for neural networks. Each method has its unique advantages and applications, making them essential tools for optimizing machine learning models. Practice these methods to enhance your machine learning models! 🚀
307 | 
308 | ---
309 | 
310 | 
311 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 👨‍🔬 30 Days of Data Science
  2 | 
  3 | | **Day** | **Topic**                                | **Topics Covered**                    |
  4 | |---------|------------------------------------------|--------------------------------------|
  5 | | **01**   | [Introduction to Data Science](README.md#-day-1)| Setting up Python, Jupyter Notebook |
  6 | | **02**   | [Basics of the Language + Git Basics](./02_Basics%20of%20the%20Language%20%26%20Git%20Basics/02_Basics%20of%20the%20Language%20%26%20Git%20Basics.md)| Python syntax, variables, Git setup |
  7 | | **03**   | [Control Flow](./03_Control%20Flow/03_Control%20Flow.md)| If-else, loops       |
  8 | | **04**   | [Functions and Modular Programming](./04_Functions%20and%20Modular%20Programming/04_Functions%20and%20Modular%20Programming.md)| Defining & calling functions         |
  9 | | **05**   | [Data Structures](./05_Data%20Structures/05_Data%20Structures.md)| Lists, tuples, dictionaries          |
 10 | | **06**   | [Data Frames and Tables](./06_Data%20Frames%20and%20Tables/06_Data%20Frames%20and%20Tables.md)| pandas DataFrame                     |
 11 | | **07**   | [Importing Data](./07_Importing%20Data/07_Importing%20Data.md)| Reading CSV, Excel, JSON files       |
 12 | | **08**   | [Data Cleaning](./08_Data%20Cleaning/08_Data%20Cleaning.md)| Handling missing values, duplicates  |
 13 | | **09**   | [Exploratory Data Analysis (EDA)](./09_Exploratory%20Data%20Analysis%20(EDA)/09_Exploratory%20Data%20Analysis%20(EDA).md)         | Descriptive statistics               |
 14 | | **10**  | [Data Visualization Basics](./10_Data%20Visualization%20Basics/10_Data%20Visualization%20Basics.md) | matplotlib, seaborn                  |
 15 | | **11**  | [Advanced Data Visualization](./11_Advanced%20Data%20Visualization/11_Advanced%20Data%20Visualization.md)| Plotly, advanced matplotlib          |
 16 | | **12**  | [SQL for Data Retrieval](./12_SQL%20for%20Data%20Retrieval/12_SQL%20for%20Data%20Retrieval.md)| sqlite3, SQLAlchemy                  |
 17 | | **13**  | [Time Series Analysis Introduction](./13_Time%20Series%20Analysis%20Introduction/13_Time%20Series%20Analysis%20Introduction.md)| pandas datetime, matplotlib          |
 18 | | **14**  | [Working with APIs and JSON](./14_Working%20with%20APIs%20and%20JSON/14_Working%20with%20APIs%20and%20JSON.md)| requests, JSON module                |
 19 | | **15**  | [Regular Expressions](./15_Regular%20Expressions/15_Regular%20Expressions.md)| re module                            |
 20 | | **16**  | [Statistical Concepts](./16_Statistical%20Concepts/16_Statistical%20Concepts.md)| Scipy, NumPy                         |
 21 | | **17**  | [Hypothesis Testing](./17_Hypothesis%20Testing/17_Hypothesis%20Testing.md)| t-test, chi-square                   |
 22 | | **18**  | [Basic Machine Learning Introduction](./18_Basic%20Machine%20Learning%20Introduction/18_Basic%20Machine%20Learning%20Introduction.md)| scikit-learn basics                  |
 23 | | **19**  | [Linear Regression](./19_Linear%20Regression/19_Linear%20Regression.md)| LinearRegression in scikit-learn     |
 24 | | **20**  | [Logistic Regression](./20_Logistic%20Regression/20_Logistic%20Regression.md)| LogisticRegression in scikit-learn   |
 25 | | **21**  | [Clustering (K-Means)](./21_Clustering%20(K-Means)/21_Clustering%20(K-Means).md)| KMeans in scikit-learn               |
 26 | | **22**  | [Decision Trees](./22_Decision%20Trees/22_Decision%20Trees.md)| DecisionTreeClassifier in scikit-learn |
 27 | | **23**  | [Handling Imbalanced Data](./23_Handling%20Imbalanced%20Data/23_Handling%20Imbalanced%20Data.md)| SMOTE, class weighting               |
 28 | | **24**  | [Feature Engineering](./24_Feature%20Engineering/24_Feature%20Engineering.md)| Encoding, scaling, feature selection |
 29 | | **25**  | [Model Evaluation and Metrics](./25_Model%20Evaluation%20and%20Metrics/25_Model%20Evaluation%20and%20Metrics.md)| Confusion matrix, ROC-AUC            |
 30 | | **26**  | [Advanced ML: Hyperparameter Tuning](./26_Advanced%20ML%3A%20Hyperparameter%20Tuning/26_Advanced%20ML%3A%20Hyperparameter%20Tuning.md)| GridSearchCV, RandomizedSearchCV     |
 31 | | **27**  | [Natural Language Processing (NLP)](./27_Natural%20Language%20Processing%20(NLP)/27_Natural%20Language%20Processing%20(NLP).md)| NLTK, spaCy, Hugging Face            |
 32 | | **28**  | [Time Series Forecasting](./28_Time%20Series%20Forecasting/28_Time%20Series%20Forecasting.md)| ARIMA, Prophet                       |
 33 | | **29**  | [Working with Big Data](./29_Working%20with%20Big%20Data/29_Working%20with%20Big%20Data.md)                   | PySpark basics                       |
 34 | | **30**  | [Building a Data Science Pipeline](./30_Building%20a%20Data%20Science%20Pipeline/30_Building%20a%20Data%20Science%20Pipeline.md)| sklearn pipeline, joblib             | 
 35 | | **31**  | [Deployment on Cloud Platform](./31_Deployment%20on%20Cloud%20Platform/31_Deployment%20on%20Cloud%20Platform.md)| Deploy with Flask/FastAPI to AWS, Azure, or GCP |
 36 | 
 37 | 
 38 | - [👨‍🔬 30 Days Of Data Science](#-30-days-of-data-science)
 39 | - [📘 Day 1](#-day-1)
 40 |   - [Welcome](#welcome)
 41 |   - [Introduction](#introduction)
 42 |   - [Why Learn Data Science ?](#why-learn-data-science)
 43 |   - [Setting Up Your Environment](#setting-up-your-environment)
 44 |     - [Installing Python](#installing-python)
 45 |     - [Python Shell](#python-shell)
 46 |     - [Installing Visual Studio Code](#installing-visual-studio-code)
 47 |     - [Installing Jupyter Notebook](#installing-jupyter-notebook)
 48 | 
 49 | 
 50 | # 📘 Day 1
 51 | 
 52 | ## Welcome
 53 | 
 54 | **Congratulations** on deciding to participate in a _30 Days of Data Science_ challenge! In this challenge, you will dive into the essential concepts of data science, from foundational programming skills to data analysis, visualization, and machine learning.
 55 | 
 56 | 
 57 | 
 58 | ## Introduction
 59 | 
 60 | Data Science is an interdisciplinary field that uses programming, mathematics, and domain knowledge to extract insights from structured and unstructured data. Python is one of the most popular tools in data science due to its versatility, ease of use, and robust ecosystem of libraries. This challenge is designed to help you build a strong foundation in Python while applying it to practical data science tasks. The topics are distributed over 30 days, with clear explanations, real-world examples, and hands-on exercises.
 61 | 
 62 | This challenge is suitable for beginners as well as professionals looking to strengthen their data science skills. It may take 30 to 100 days to complete, depending on your pace.
 63 | 
 64 | 
 65 | 
 66 | 
 67 | ## Why Learn Data Science?
 68 | 
 69 | Data Science is revolutionizing industries by enabling data-driven decision-making. It combines programming, statistics, and domain expertise to solve complex problems. Python has become the go-to language in the data science community due to its simplicity and extensive library support for tasks like data cleaning, visualization, and modeling. Whether you aim to work in business analytics, artificial intelligence, or research, data science skills will open up endless possibilities. 
 70 | 
 71 | ### Setting Up Your Environment
 72 | 
 73 | ## Installing Python
 74 | 
 75 | To start coding in Python, you need to install it on your computer. Visit the [official Python website](https://www.python.org/) to download the latest version.  
 76 | - **Windows users**: Download Python by clicking the appropriate button.  
 77 | - **macOS users**: Follow similar steps to install Python for Mac.  
 78 | 
 79 | To confirm the installation, open your terminal or command prompt and type:
 80 | 
 81 | ```shell
 82 | python --version
 83 | ```
 84 | 
 85 | You should see the installed version, which should be Python 3.6 or above. For example:  
 86 | 
 87 | ```shell
 88 | Python 3.12.4
 89 | ```
 90 | 
 91 | If the command displays the Python version, you are ready to proceed.
 92 | 
 93 | ## Python Shell
 94 | 
 95 | Python is an interpreted language, meaning you can execute code line by line. Python comes with an interactive shell, which allows you to write and test Python commands directly. To open the shell, type the following command in your terminal:
 96 | 
 97 | ```shell
 98 | python
 99 | ```
100 | 
101 | Once the shell is open, you can start entering Python commands after the `>>>` prompt. For example, typing `2 + 3` will output `5`. To exit the shell, type `exit()`.
102 | 
103 | If you enter an invalid command, Python will provide an error message, helping you debug and learn. Debugging is the process of identifying and fixing errors in your code. You will encounter common error types such as `SyntaxError`, `NameError`, and `TypeError` throughout this challenge. Understanding these errors is crucial for becoming a proficient programmer.
104 | 
105 | ## Installing Visual Studio Code
106 | 
107 | While the Python shell is great for quick tests, real-world data science projects require robust code editors. For this challenge, we recommend using [Visual Studio Code](https://code.visualstudio.com/), a popular and lightweight editor. Feel free to use other editors if you prefer.
108 | 
109 | To start, download and install Visual Studio Code. Once installed, create a folder named `30DaysOfDataScience` on your computer and open it using Visual Studio Code. Inside the folder, create a new file, such as `helloworld.py`, to write your first Python script. This will serve as the workspace for your projects throughout the challenge.
110 | 
111 | #### Exploring the Editor
112 | 
113 | Visual Studio Code offers many features to enhance productivity, including debugging tools, extensions, and an intuitive interface. Spend some time familiarizing yourself with its layout and shortcuts.
114 | 
115 | ## Installing Jupyter Notebook
116 | 
117 | In addition to Visual Studio Code, another essential tool for data science is **Jupyter Notebook**. It is an interactive web-based environment where you can write and execute Python code, visualize data, and document your analysis all in one place. Jupyter Notebook is widely used in the data science community because it simplifies exploratory data analysis and data visualization.
118 | 
119 | ### Installing Jupyter Notebook
120 | 
121 | To install Jupyter Notebook, you'll first need to install `pip`, the Python package manager, which should already be available if you've installed Python. Open your terminal or command prompt and type:
122 | 
123 | ```shell
124 | pip install notebook
125 | ```
126 | 
127 | Once the installation is complete, you can launch Jupyter Notebook by typing:
128 | 
129 | ```shell
130 | jupyter notebook
131 | ```
132 | 
133 | This command will open Jupyter Notebook in your default web browser. You will see an interface that allows you to create and organize notebooks in different folders.
134 | 
135 | #### Using Jupyter Notebook
136 | 
137 | To create a new notebook:
138 | 
139 | 1. Navigate to the folder where you'd like to save your notebooks.
140 | 2. Click **New** (on the top-right corner) and select **Python 3 (ipykernel)**.
141 | 
142 | A new notebook will open where you can write Python code in individual cells. Press **Shift + Enter** to execute the code in a cell. You can also add explanatory text using Markdown cells to make your analysis more readable.
143 | 
144 | Here is a simple example to get started:
145 | 
146 | 1. Create a new notebook and name it `Day1_Basics.ipynb`.
147 | 2. Write the following code in a cell and execute it:
148 | 
149 | ```python
150 | # This is your first code in Jupyter Notebook
151 | print("Hello, Data Science!")
152 | ```
153 | 
154 | You should see the output below the cell:
155 | 
156 | ```
157 | Hello, Data Science!
158 | ```
159 | 
160 | #### Installing JupyterLab (Optional)
161 | 
162 | If you'd like a more modern interface with enhanced features, you can use **JupyterLab**, an upgraded version of Jupyter Notebook. Install it using:
163 | 
164 | ```shell
165 | pip install jupyterlab
166 | ```
167 | 
168 | Launch it by typing:
169 | 
170 | ```shell
171 | jupyter lab
172 | ```
173 | 
174 | #### Integration with Visual Studio Code
175 | 
176 | If you prefer to work within Visual Studio Code but want the interactivity of Jupyter Notebook, you can install the **Jupyter extension** in Visual Studio Code:
177 | 
178 | 1. Open Visual Studio Code and go to the Extensions Marketplace (the square icon on the sidebar).
179 | 2. Search for "Jupyter" and install the extension.
180 | 3. Open a `.ipynb` file, or create one using the command palette (`Ctrl + Shift + P` or `Cmd + Shift + P` on Mac) and selecting `Jupyter: Create New Blank Notebook`.
181 | 
182 | Now you can use Jupyter notebooks directly within Visual Studio Code!
183 | 
184 | 
185 | 
186 | 
187 | [Day 2 >>](./02_Basics%20of%20the%20Language%20%26%20Git%20Basics/02_Basics%20of%20the%20Language%20%26%20Git%20Basics.md)
188 | 
189 | 
190 | 
191 | 
192 | 


--------------------------------------------------------------------------------