├── LICENSE ├── 23_Handling Imbalanced Data └── 23_Handling Imbalanced Data.md ├── 07_Importing Data └── 07_Importing Data.md ├── 06_Data Frames and Tables └── 06_Data Frames and Tables.md ├── 03_Control Flow └── 03_Control Flow.md ├── 21_Clustering (K-Means) └── 21_Clustering (K-Means).md ├── 10_Data Visualization Basics └── 10_Data Visualization Basics.md ├── 16_Statistical Concepts └── 16_Statistical Concepts.md ├── 14_Working with APIs and JSON └── 14_Working with APIs and JSON.md ├── 22_Decision Trees └── 22_Decision Trees.md ├── 24_Feature Engineering └── 24_Feature Engineering.md ├── 17_Hypothesis Testing └── 17_Hypothesis Testing.md ├── 19_Linear Regression └── 19_Linear Regression.md ├── 15_Regular Expressions └── 15_Regular Expressions.md ├── 25_Model Evaluation and Metrics └── 25_Model Evaluation and Metrics.md ├── 29_Working with Big Data └── 29_Working with Big Data.md ├── 18_Basic Machine Learning Introduction └── 18_Basic Machine Learning Introduction.md ├── 12_SQL for Data Retrieval └── 12_SQL for Data Retrieval.md ├── 13_Time Series Analysis Introduction └── 13_Time Series Analysis Introduction.md ├── 08_Data Cleaning └── 08_Data Cleaning.md ├── 31_Deployment on Cloud Platform └── 31_Deployment on Cloud Platform.md ├── 02_Basics of the Language & Git Basics └── 02_Basics of the Language & Git Basics.md ├── 04_Functions and Modular Programming └── 04_Functions and Modular Programming.md ├── 09_Exploratory Data Analysis (EDA) └── 09_Exploratory Data Analysis (EDA).md ├── 30_Building a Data Science Pipeline └── 30_Building a Data Science Pipeline.md ├── 11_Advanced Data Visualization └── 11_Advanced Data Visualization.md ├── 05_Data Structures └── 05_Data Structures.md ├── 20_Logistic Regression └── 20_Logistic Regression.md ├── 28_Time Series Forecasting └── 28_Time Series Forecasting.md ├── 27_Natural Language Processing (NLP) └── 27_Natural Language Processing (NLP).md ├── 26_Advanced ML: Hyperparameter Tuning └── 26_Advanced ML: Hyperparameter Tuning.md └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Samarth Garge 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /23_Handling Imbalanced Data/23_Handling Imbalanced Data.md: -------------------------------------------------------------------------------- 1 | [<< Day 22](../22_Decision%20Trees/22_Decision%20Trees.md) | [Day 24 >>](../24_Feature%20Engineering/24_Feature%20Engineering.md) 2 | 3 | # Day 23: Imbalanced Data and Handling Techniques 4 | 5 | Welcome to **Day 23** of the 30 Days of Data Science series! 🎉 Today, we tackle the challenge of **Imbalanced Data** in machine learning. Imbalanced datasets can severely impact model performance. We’ll explore various techniques to handle imbalanced data and ensure better results in classification tasks. 🌟 6 | 7 | ## 📋 Table of Contents 8 | 9 | - [📊 Introduction to Imbalanced Data](#-introduction-to-imbalanced-data) 10 | - [📚 Understanding the Problem](#-understanding-the-problem) 11 | - [🔍 Why is Imbalanced Data a Challenge?](#-why-is-imbalanced-data-a-challenge) 12 | - [📈 Metrics for Imbalanced Data](#-metrics-for-imbalanced-data) 13 | - [🛠️ Techniques to Handle Imbalanced Data](#%EF%B8%8F-techniques-to-handle-imbalanced-data) 14 | - [⚖️ Class Weighting](#%EF%B8%8F-class-weighting) 15 | - [🔬 Synthetic Minority Oversampling Technique (SMOTE)](#-synthetic-minority-oversampling-technique-smote) 16 | - [📉 Undersampling the Majority Class](#-undersampling-the-majority-class) 17 | - [🔄 Resampling Methods](#-resampling-methods) 18 | - [📝 Practice Exercises](#-practice-exercises) 19 | - [📌 Summary](#-summary) 20 | 21 | 22 | 23 | ## 📊 Introduction to Imbalanced Data 24 | 25 | Imbalanced datasets occur when one class significantly outnumbers others in a classification task. For example, in fraud detection, the ratio of fraudulent to non-fraudulent transactions is often heavily skewed. 26 | 27 | ## 📚 Understanding the Problem 28 | 29 | ### 🔍 Why is Imbalanced Data a Challenge? 30 | 31 | - **Bias Towards Majority Class**: Models tend to predict the majority class more frequently. 32 | - **Poor Metric Representation**: Accuracy alone can be misleading in imbalanced datasets. 33 | 34 | ### 📈 Metrics for Imbalanced Data 35 | 36 | Key metrics to evaluate performance include: 37 | 38 | - **Precision**: 39 | $$\text{Precision} = \frac{TP}{TP + FP}$$ 40 | 41 | - **Recall**: 42 | $$\text{Recall} = \frac{TP}{TP + FN}$$ 43 | 44 | - **F1 Score**: 45 | $$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$ 46 | 47 | 48 | 49 | ## 🛠️ Techniques to Handle Imbalanced Data 50 | 51 | ### ⚖️ Class Weighting 52 | Assign higher weights to the minority class during model training to reduce imbalance effects. 53 | 54 | ```python 55 | from sklearn.ensemble import RandomForestClassifier 56 | from sklearn.metrics import classification_report 57 | 58 | # Assign class weights 59 | model = RandomForestClassifier(class_weight='balanced', random_state=42) 60 | model.fit(X_train, y_train) 61 | 62 | y_pred = model.predict(X_test) 63 | print(classification_report(y_test, y_pred)) 64 | ``` 65 | 66 | ### 🔬 Synthetic Minority Oversampling Technique (SMOTE) 67 | SMOTE creates synthetic samples for the minority class by interpolating between existing examples. 68 | 69 | ```python 70 | from imblearn.over_sampling import SMOTE 71 | from sklearn.model_selection import train_test_split 72 | 73 | # Apply SMOTE 74 | smote = SMOTE(random_state=42) 75 | X_resampled, y_resampled = smote.fit_resample(X, y) 76 | 77 | # Split data 78 | X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42) 79 | ``` 80 | 81 | ### 📉 Undersampling the Majority Class 82 | Reduce the number of majority class examples to balance the dataset. 83 | 84 | ```python 85 | from imblearn.under_sampling import RandomUnderSampler 86 | 87 | # Apply undersampling 88 | undersampler = RandomUnderSampler(random_state=42) 89 | X_resampled, y_resampled = undersampler.fit_resample(X, y) 90 | ``` 91 | 92 | ### 🔄 Resampling Methods 93 | Combine oversampling and undersampling techniques for better results. 94 | 95 | ```python 96 | from imblearn.combine import SMOTEENN 97 | 98 | # Apply combined resampling 99 | smote_enn = SMOTEENN(random_state=42) 100 | X_resampled, y_resampled = smote_enn.fit_resample(X, y) 101 | ``` 102 | 103 | 104 | 105 | ## 📝 Practice Exercises 106 | 107 | 1. Load the `imbalanced-learn` package and experiment with SMOTE on a custom imbalanced dataset. 108 | 2. Compare model performance with and without class weighting on a highly imbalanced dataset. 109 | 3. Implement undersampling on the `imbalanced-learn` library and evaluate its impact on classification metrics. 110 | 111 | 112 | 113 | ## 📌 Summary 114 | 115 | In today’s lesson, we explored: 116 | 117 | - The challenges of imbalanced data. 118 | - Key metrics for evaluating models on imbalanced datasets. 119 | - Techniques to address class imbalance such as class weighting, SMOTE, and undersampling. 120 | 121 | By applying these techniques, you can improve the performance and fairness of your machine learning models. Keep practicing and experimenting! 🌟 122 | 123 | --- 124 | -------------------------------------------------------------------------------- /07_Importing Data/07_Importing Data.md: -------------------------------------------------------------------------------- 1 | [<< Day 6](../06_Data%20Frames%20and%20Tables/06_Data%20Frames%20and%20Tables.md) | [Day 8 >>](../08_Data%20Cleaning/08_Data%20Cleaning.md) 2 | # 📘 Day 7: Importing Data with Pandas 3 | 4 | Welcome to **Day 7** of the **30 Days of Data Science** series! Today, we focus on **Importing Data** using the Pandas library. Learning to import data from various formats like CSV, Excel, and JSON is foundational for data analysis. 5 | 6 | 7 | 8 | ## Table of Contents 9 | 10 | - [📘 Day 7: Importing Data with Pandas](#-day-7-importing-data-with-pandas) 11 | - [1️⃣ Introduction to Data Importing](#1️⃣-introduction-to-data-importing) 12 | - [2️⃣ Reading CSV Files](#2️⃣-reading-csv-files) 13 | - [Basic CSV Reading](#basic-csv-reading) 14 | - [Customizing the Read](#customizing-the-read) 15 | - [Efficient Reading of Large CSV Files](#efficient-reading-of-large-csv-files) 16 | - [3️⃣ Reading Excel Files](#3️⃣-reading-excel-files) 17 | - [Reading Specific Sheets](#reading-specific-sheets) 18 | - [Dealing with Missing Data in Excel Files](#dealing-with-missing-data-in-excel-files) 19 | - [4️⃣ Reading JSON Files](#4️⃣-reading-json-files) 20 | - [Handling Nested JSON](#handling-nested-json) 21 | - [🧠 Practice Exercises](#-practice-exercises) 22 | - [🌟 Summary](#-summary) 23 | 24 | 25 | 26 | 27 | ## 1️⃣ Introduction to Data Importing 28 | 29 | Data comes in various formats such as CSV, Excel, and JSON. Using Pandas, you can seamlessly convert these into **DataFrames** for manipulation. 30 | 31 | Install Pandas if you haven’t: 32 | 33 | ```bash 34 | pip install pandas 35 | ``` 36 | 37 | Import it in your Python script: 38 | 39 | ```python 40 | import pandas as pd 41 | ``` 42 | 43 | 44 | 45 | ## 2️⃣ Reading CSV Files 46 | 47 | ### Basic CSV Reading 48 | 49 | Use `pd.read_csv` to load a CSV file into a Pandas DataFrame. 50 | 51 | #### Example: 52 | 53 | ```python 54 | import pandas as pd 55 | 56 | # Reading a CSV file 57 | df = pd.read_csv("data.csv") 58 | print(df.head()) # Display the first 5 rows 59 | ``` 60 | 61 | 62 | 63 | ### Customizing the Read 64 | 65 | Modify how data is read by specifying parameters like column names, skipping rows, or handling missing data. 66 | 67 | #### Example: Rename Columns and Skip Rows 68 | 69 | ```python 70 | df = pd.read_csv("data.csv", skiprows=2, names=["ID", "Name", "Age", "City"]) 71 | print(df) 72 | ``` 73 | 74 | #### Example: Handle Missing Values 75 | 76 | ```python 77 | df = pd.read_csv("data.csv", na_values=["N/A", "Missing"]) 78 | print(df) 79 | ``` 80 | 81 | 82 | 83 | ### Efficient Reading of Large CSV Files 84 | 85 | When working with large files, optimize reading by using these techniques: 86 | 87 | #### Example: Load in Chunks 88 | 89 | ```python 90 | for chunk in pd.read_csv("large_data.csv", chunksize=1000): 91 | print(chunk.shape) 92 | ``` 93 | 94 | #### Example: Use Specific Columns 95 | 96 | ```python 97 | df = pd.read_csv("data.csv", usecols=["Name", "Age"]) 98 | print(df) 99 | ``` 100 | 101 | 102 | 103 | ## 3️⃣ Reading Excel Files 104 | 105 | Excel files are popular for structured data storage. Use `pd.read_excel` to import them. 106 | 107 | ### Reading Specific Sheets 108 | 109 | Specify the `sheet_name` parameter to target a particular sheet. 110 | 111 | #### Example: 112 | 113 | ```python 114 | df = pd.read_excel("data.xlsx", sheet_name="Sheet1") 115 | print(df.head()) 116 | ``` 117 | 118 | 119 | 120 | ### Dealing with Missing Data in Excel Files 121 | 122 | Handle empty cells by specifying values to treat as NaN. 123 | 124 | #### Example: 125 | 126 | ```python 127 | df = pd.read_excel("data.xlsx", na_values=["N/A", "Not Available"]) 128 | print(df) 129 | ``` 130 | 131 | 132 | 133 | ## 4️⃣ Reading JSON Files 134 | 135 | JSON files are lightweight and widely used in web applications. Use `pd.read_json` for flat files or `pd.json_normalize` for nested JSON structures. 136 | 137 | ### Handling Nested JSON 138 | 139 | For deeply nested JSON, normalize it for a tabular structure. 140 | 141 | #### Example: 142 | 143 | ```python 144 | import pandas as pd 145 | 146 | # Example JSON data 147 | data = { 148 | "Name": ["Alice", "Bob"], 149 | "Details": [{"Age": 25, "City": "New York"}, {"Age": 30, "City": "Chicago"}] 150 | } 151 | 152 | # Normalize JSON 153 | df = pd.json_normalize(data, record_path="Details", meta=["Name"]) 154 | print(df) 155 | ``` 156 | 157 | **Output**: 158 | 159 | ``` 160 | Age City Name 161 | 0 25 New York Alice 162 | 1 30 Chicago Bob 163 | ``` 164 | 165 | 166 | 167 | ## 🧠 Practice Exercises 168 | 169 | 1. Read a CSV file and calculate the average of a numeric column. 170 | 2. Extract data from a specific Excel sheet and filter rows based on a condition. 171 | 3. Parse a nested JSON file and convert it into a DataFrame. 172 | 173 | 174 | 175 | ## 🌟 Summary 176 | 177 | - CSV: Use `pd.read_csv` with customizable parameters. 178 | - Excel: Use `pd.read_excel` to read specific sheets or handle missing data. 179 | - JSON: Use `pd.read_json` for flat files or `pd.json_normalize` for nested structures. 180 | 181 | --- 182 | 183 | 184 | -------------------------------------------------------------------------------- /06_Data Frames and Tables/06_Data Frames and Tables.md: -------------------------------------------------------------------------------- 1 | [<< Day 5](../05_Data%20Structures/05_Data%20Structures.md) | [Day 7 >>](../07_Importing%20Data/07_Importing%20Data.md) 2 | 3 | 4 | # 📘 Day 6: Dataframes and Tables with Pandas 5 | 6 | Welcome to **Day 6** of the **30 Days of Data Science** series! Today, we will explore **Dataframes and Tables** using the **Pandas** library. Pandas is a powerful Python library for data manipulation and analysis, widely used in data science and machine learning. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 6: Dataframes and Tables with Pandas](#-day-6-dataframes-and-tables-with-pandas) 13 | - [1️⃣ Introduction to Pandas 📊](#1️⃣-introduction-to-pandas-) 14 | - [2️⃣ Dataframes](#2️⃣-dataframes) 15 | - [Creating a DataFrame](#creating-a-dataframe) 16 | - [Accessing Data in a DataFrame](#accessing-data-in-a-dataframe) 17 | - [Adding and Removing Columns](#adding-and-removing-columns) 18 | - [3️⃣ Tables](#3️⃣-tables) 19 | - [Reading Data from a File](#reading-data-from-a-file) 20 | - [Sorting and Filtering Data](#sorting-and-filtering-data) 21 | - [🧠 Practice Exercises](#-practice-exercises) 22 | - [🌟 Summary](#-summary) 23 | 24 | 25 | 26 | 27 | ## 1️⃣ Introduction to Pandas 📊 28 | 29 | Pandas is a Python library designed to simplify data manipulation and analysis. It provides two primary data structures: 30 | 31 | - **Series**: A one-dimensional labeled array (similar to a list). 32 | - **DataFrame**: A two-dimensional labeled data structure (similar to a table). 33 | 34 | To use Pandas, you first need to install it. Run the following command in your terminal if you haven't already: 35 | 36 | ```bash 37 | pip install pandas 38 | ``` 39 | 40 | Then, import it in your Python code: 41 | 42 | ```python 43 | import pandas as pd 44 | ``` 45 | 46 | 47 | 48 | ## 2️⃣ Dataframes 49 | 50 | A **DataFrame** is a two-dimensional labeled data structure that resembles a table. It consists of rows and columns. 51 | 52 | ### Creating a DataFrame 53 | 54 | You can create a DataFrame from various data structures like lists, dictionaries, or even external files. 55 | 56 | #### From a Dictionary 57 | 58 | ```python 59 | import pandas as pd 60 | 61 | data = { 62 | "Name": ["Alice", "Bob", "Charlie"], 63 | "Age": [25, 30, 35], 64 | "City": ["New York", "San Francisco", "Los Angeles"] 65 | } 66 | 67 | df = pd.DataFrame(data) 68 | print(df) 69 | ``` 70 | 71 | **Output**: 72 | 73 | ``` 74 | Name Age City 75 | 0 Alice 25 New York 76 | 1 Bob 30 San Francisco 77 | 2 Charlie 35 Los Angeles 78 | ``` 79 | 80 | 81 | 82 | ### Accessing Data in a DataFrame 83 | 84 | You can access specific rows, columns, or elements in a DataFrame. 85 | 86 | #### Accessing Columns 87 | 88 | ```python 89 | # Accessing a single column 90 | print(df["Name"]) 91 | 92 | # Accessing multiple columns 93 | print(df[["Name", "Age"]]) 94 | ``` 95 | 96 | #### Accessing Rows 97 | 98 | ```python 99 | # Accessing rows by index 100 | print(df.iloc[0]) # First row 101 | print(df.iloc[1:3]) # Second to third row 102 | ``` 103 | 104 | #### Accessing Specific Elements 105 | 106 | ```python 107 | # Accessing specific cell 108 | print(df.at[0, "Name"]) # Output: Alice 109 | ``` 110 | 111 | 112 | 113 | ### Adding and Removing Columns 114 | 115 | #### Adding a Column 116 | 117 | ```python 118 | df["Country"] = ["USA", "USA", "USA"] 119 | print(df) 120 | ``` 121 | 122 | #### Removing a Column 123 | 124 | ```python 125 | df.drop("Age", axis=1, inplace=True) 126 | print(df) 127 | ``` 128 | 129 | 130 | 131 | ## 3️⃣ Tables 132 | 133 | Tables in Pandas are often created or manipulated by reading data from external sources like CSV files, Excel files, or databases. 134 | 135 | ### Reading Data from a File 136 | 137 | #### Reading a CSV File 138 | 139 | ```python 140 | df = pd.read_csv("data.csv") 141 | print(df.head()) # Display the first 5 rows 142 | ``` 143 | 144 | #### Writing to a CSV File 145 | 146 | ```python 147 | df.to_csv("output.csv", index=False) 148 | ``` 149 | 150 | 151 | 152 | ### Sorting and Filtering Data 153 | 154 | #### Sorting Data 155 | 156 | ```python 157 | # Sorting by a single column 158 | df = df.sort_values("Age") 159 | print(df) 160 | 161 | # Sorting by multiple columns 162 | df = df.sort_values(["Age", "Name"], ascending=[True, False]) 163 | print(df) 164 | ``` 165 | 166 | #### Filtering Data 167 | 168 | ```python 169 | # Filtering rows where Age > 30 170 | filtered_df = df[df["Age"] > 30] 171 | print(filtered_df) 172 | ``` 173 | 174 | 175 | 176 | ## 🧠 Practice Exercises 177 | 178 | 1. Create a DataFrame with three columns: `Product`, `Price`, and `Quantity`. Add a new column `Total` by multiplying `Price` and `Quantity`. 179 | 2. Load a CSV file into a DataFrame and display the first 10 rows. 180 | 3. Sort a DataFrame by a specific column and filter rows based on a condition. 181 | 182 | 183 | 184 | ## 🌟 Summary 185 | 186 | - **DataFrames** are powerful data structures in Pandas for tabular data. 187 | - Pandas makes it easy to manipulate, analyze, and visualize data. 188 | - You can create DataFrames from dictionaries, lists, or external files. 189 | 190 | --- 191 | 192 | -------------------------------------------------------------------------------- /03_Control Flow/03_Control Flow.md: -------------------------------------------------------------------------------- 1 | [<< Day 2](../02_Basics%20of%20the%20Language%20%26%20Git%20Basics/02_Basics%20of%20the%20Language%20%26%20Git%20Basics.md) | [Day 4 >>](../04_Functions%20and%20Modular%20Programming/04_Functions%20and%20Modular%20Programming.md) 2 | 3 | # 📘 Day 3: If-Else and Loops in Python 4 | 5 | Welcome to Day 3 of the **30 Days of Data Science** series! Today, we will cover essential programming constructs—**If-Else Statements** and **Loops**—which are fundamental for controlling the flow of your Python programs. Let’s dive in! 6 | 7 | 8 | 9 | ## Table of Contents 10 | - [📘 Day 3: If-Else and Loops in Python](#-day-3-if-else-and-loops-in-python) 11 | - [1️⃣ If-Else Statements 🧠](#1️⃣-if-else-statements-) 12 | - [Syntax](#syntax) 13 | - [Example: Simple If-Else Statement](#example-simple-if-else-statement) 14 | - [Example: Nested If-Else](#example-nested-if-else) 15 | - [Example: If-Elif-Else](#example-if-elif-else) 16 | - [2️⃣ Loops 🔁](#2️⃣-loops-) 17 | - [For Loop](#for-loop) 18 | - [Example: Using a For Loop](#example-using-a-for-loop) 19 | - [While Loop](#while-loop) 20 | - [Example: Using a While Loop](#example-using-a-while-loop) 21 | - [Break and Continue](#break-and-continue) 22 | - [Example: Break and Continue](#example-break-and-continue) 23 | - [🧠 Practice Exercises](#-practice-exercises) 24 | - [🌟 Summary](#-summary) 25 | 26 | 27 | 28 | 29 | ## 1️⃣ If-Else Statements 🧠 30 | 31 | ### Syntax 32 | ```python 33 | if condition: 34 | # Code block executed if the condition is True 35 | else: 36 | # Code block executed if the condition is False 37 | ``` 38 | 39 | 40 | 41 | ### Example: Simple If-Else Statement 42 | ```python 43 | age = 20 44 | if age >= 18: 45 | print("You are an adult!") 46 | else: 47 | print("You are a minor!") 48 | ``` 49 | **Output:** 50 | ```plaintext 51 | You are an adult! 52 | ``` 53 | 54 | 55 | 56 | ### Example: Nested If-Else 57 | ```python 58 | age = 16 59 | if age >= 18: 60 | print("You can vote!") 61 | else: 62 | if age >= 16: 63 | print("You are a teenager!") 64 | else: 65 | print("You are a child!") 66 | ``` 67 | **Output:** 68 | ```plaintext 69 | You are a teenager! 70 | ``` 71 | 72 | 73 | 74 | ### Example: If-Elif-Else 75 | ```python 76 | marks = 85 77 | if marks >= 90: 78 | print("Grade: A") 79 | elif marks >= 75: 80 | print("Grade: B") 81 | elif marks >= 50: 82 | print("Grade: C") 83 | else: 84 | print("Grade: F") 85 | ``` 86 | **Output:** 87 | ```plaintext 88 | Grade: B 89 | ``` 90 | 91 | 92 | 93 | ## 2️⃣ Loops 🔁 94 | 95 | Loops allow repetitive tasks to be performed efficiently. 96 | 97 | 98 | 99 | ### For Loop 100 | The **for** loop iterates over a sequence (like a list, tuple, or string). 101 | 102 | #### Syntax 103 | ```python 104 | for item in sequence: 105 | # Code block to execute for each item 106 | ``` 107 | 108 | 109 | 110 | ### Example: Using a For Loop 111 | ```python 112 | numbers = [1, 2, 3, 4, 5] 113 | for num in numbers: 114 | print(num) 115 | ``` 116 | **Output:** 117 | ```plaintext 118 | 1 119 | 2 120 | 3 121 | 4 122 | 5 123 | ``` 124 | 125 | 126 | 127 | ### While Loop 128 | The **while** loop executes a block of code as long as a condition is `True`. 129 | 130 | #### Syntax 131 | ```python 132 | while condition: 133 | # Code block to execute 134 | ``` 135 | 136 | 137 | 138 | ### Example: Using a While Loop 139 | ```python 140 | count = 0 141 | while count < 5: 142 | print(count) 143 | count += 1 144 | ``` 145 | **Output:** 146 | ```plaintext 147 | 0 148 | 1 149 | 2 150 | 3 151 | 4 152 | ``` 153 | 154 | 155 | 156 | ### Break and Continue 157 | 158 | - **Break**: Terminates the loop prematurely. 159 | - **Continue**: Skips the current iteration and moves to the next. 160 | 161 | 162 | 163 | ### Example: Break and Continue 164 | ```python 165 | for num in range(1, 6): 166 | if num == 3: 167 | break # Exit loop when num is 3 168 | print(num) 169 | ``` 170 | **Output:** 171 | ```plaintext 172 | 1 173 | 2 174 | ``` 175 | 176 | ```python 177 | for num in range(1, 6): 178 | if num == 3: 179 | continue # Skip iteration when num is 3 180 | print(num) 181 | ``` 182 | **Output:** 183 | ```plaintext 184 | 1 185 | 2 186 | 4 187 | 5 188 | ``` 189 | 190 | 191 | 192 | ## 🧠 Practice Exercises 193 | 194 | ### If-Else Statements 195 | 1. Write a program that checks if a number is positive, negative, or zero. 196 | 2. Create a grade classifier using the if-elif-else structure. 197 | 198 | 199 | 200 | ### Loops 201 | 1. Write a program that prints all even numbers from 1 to 50 using a for loop. 202 | 2. Create a program that sums the numbers from 1 to 100 using a while loop. 203 | 3. Use break and continue in a loop to demonstrate their functionality. 204 | 205 | 206 | 207 | ## 🌟 Summary 208 | 209 | - **If-Else Statements** allow you to make decisions in your code. 210 | - **Loops** enable you to automate repetitive tasks efficiently. 211 | - **Break and Continue** give more control over loop execution. 212 | 213 | --- 214 | 215 | 216 | 217 | 218 | -------------------------------------------------------------------------------- /21_Clustering (K-Means)/21_Clustering (K-Means).md: -------------------------------------------------------------------------------- 1 | [<< Day 20](../20_Logistic%20Regression/20_Logistic%20Regression.md) | [Day 22 >>](../22_Decision%20Trees/22_Decision%20Trees.md) 2 | 3 | 4 | 5 | # 📘 Day 21: Clustering with KMeans in Scikit-learn 6 | 7 | Welcome to **Day 21** of the **30 Days of Data Science** series! Today, we explore **Clustering** using the **KMeans algorithm** with Python's Scikit-learn library. Clustering is a fundamental **unsupervised learning** technique used to group similar data points together. 8 | 9 | 10 | 11 | ## Table of Contents 12 | 13 | - [📘 Day 21: Clustering with KMeans in Scikit-learn](#-day-21-clustering-with-kmeans-in-scikit-learn) 14 | - [🔍 What is Clustering?](#-what-is-clustering) 15 | - [📌 The KMeans Algorithm](#-the-kmeans-algorithm) 16 | - [⚙️ Installing Required Libraries](#️-installing-required-libraries) 17 | - [🛠️ KMeans with Scikit-learn: Step-by-Step](#️-kmeans-with-scikit-learn-step-by-step) 18 | - [1️⃣ Data Preparation](#1️⃣-data-preparation) 19 | - [2️⃣ Applying KMeans](#2️⃣-applying-kmeans) 20 | - [3️⃣ Visualizing Clusters](#3️⃣-visualizing-clusters) 21 | - [🧠 Use Cases of Clustering](#-use-cases-of-clustering) 22 | - [🧪 Practice Exercises](#-practice-exercises) 23 | - [🌟 Summary](#-summary) 24 | 25 | 26 | 27 | ## 🔍 What is Clustering? 28 | 29 | Clustering is a type of **unsupervised learning** that involves grouping data into clusters based on their similarities. Unlike supervised learning, clustering does not use labeled data. It’s widely used in: 30 | 31 | - Customer segmentation 32 | - Document clustering 33 | - Image segmentation 34 | - Anomaly detection 35 | 36 | 37 | 38 | ## 📌 The KMeans Algorithm 39 | 40 | The **KMeans algorithm** works by: 41 | 42 | 1. Randomly initializing `K` cluster centroids. 43 | 2. Assigning each data point to the nearest centroid. 44 | 3. Updating centroids by calculating the mean of the assigned points. 45 | 4. Repeating steps 2 and 3 until convergence. 46 | 47 | KMeans tries to minimize the **within-cluster sum of squares (WCSS)** to ensure compact clusters. 48 | 49 | 50 | 51 | ## ⚙️ Installing Required Libraries 52 | 53 | Before we proceed, ensure you have Scikit-learn installed: 54 | 55 | ```bash 56 | pip install scikit-learn matplotlib numpy 57 | ``` 58 | 59 | 60 | 61 | ## 🛠️ KMeans with Scikit-learn: Step-by-Step 62 | 63 | Let’s implement KMeans using Scikit-learn. 64 | 65 | 66 | 67 | ### 1️⃣ Data Preparation 68 | 69 | We’ll generate a sample dataset using Scikit-learn’s `make_blobs` function: 70 | 71 | ```python 72 | import numpy as np 73 | from sklearn.datasets import make_blobs 74 | import matplotlib.pyplot as plt 75 | 76 | # Generating synthetic data 77 | X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42) 78 | 79 | # Visualizing the data 80 | plt.scatter(X[:, 0], X[:, 1], s=50) 81 | plt.title("Sample Data for Clustering") 82 | plt.show() 83 | ``` 84 | 85 | **Explanation**: 86 | - `n_samples`: Number of data points. 87 | - `centers`: Number of clusters. 88 | - `cluster_std`: Spread of each cluster. 89 | 90 | 91 | 92 | ### 2️⃣ Applying KMeans 93 | 94 | Now, let’s apply the KMeans algorithm to cluster the data into 4 groups. 95 | 96 | ```python 97 | from sklearn.cluster import KMeans 98 | 99 | # Applying KMeans 100 | kmeans = KMeans(n_clusters=4, random_state=42) 101 | y_kmeans = kmeans.fit_predict(X) 102 | 103 | # Printing centroids 104 | print("Cluster Centers:") 105 | print(kmeans.cluster_centers_) 106 | ``` 107 | 108 | **Explanation**: 109 | - `n_clusters`: The number of clusters. 110 | - `fit_predict()`: Assigns each point to a cluster and returns labels. 111 | 112 | 113 | 114 | ### 3️⃣ Visualizing Clusters 115 | 116 | Let’s visualize the resulting clusters and centroids. 117 | 118 | ```python 119 | # Visualizing the clusters 120 | plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') 121 | 122 | # Marking centroids 123 | centroids = kmeans.cluster_centers_ 124 | plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', marker='X') 125 | plt.title("Clusters and Centroids") 126 | plt.show() 127 | ``` 128 | 129 | **Output**: 130 | - Data points are colored based on their cluster. 131 | - Red `X` marks represent the centroids. 132 | 133 | 134 | 135 | ## 🧠 Use Cases of Clustering 136 | 137 | - **Market Segmentation**: Grouping customers based on purchasing behavior. 138 | - **Image Compression**: Reducing colors in an image using cluster centroids. 139 | - **Document Clustering**: Grouping similar text documents. 140 | - **Biological Analysis**: Grouping genes with similar expression patterns. 141 | 142 | 143 | 144 | ## 🧪 Practice Exercises 145 | 146 | 1. Apply KMeans to a custom dataset of your choice. 147 | 2. Experiment with different values of `n_clusters` and observe the results. 148 | 3. Explore the **Elbow Method** to determine the optimal number of clusters. 149 | 150 | 151 | 152 | ## 🌟 Summary 153 | 154 | - Clustering is an essential unsupervised learning technique. 155 | - KMeans groups data into clusters by minimizing WCSS. 156 | - Scikit-learn provides easy-to-use tools for implementing KMeans. 157 | - Visualizing clusters helps interpret results effectively. 158 | 159 | --- 160 | 161 | 162 | -------------------------------------------------------------------------------- /10_Data Visualization Basics/10_Data Visualization Basics.md: -------------------------------------------------------------------------------- 1 | [<< Day 9](../09_Exploratory%20Data%20Analysis%20(EDA)/09_Exploratory%20Data%20Analysis%20(EDA).md) | [Day 11 >>](../11_Advanced%20Data%20Visualization/11_Advanced%20Data%20Visualization.md) 2 | 3 | 4 | # 📘 Day 10: Data Visualization 5 | 6 | Welcome to **Day 10** of the **30 Days of Data Science** series! Today, we dive into **Data Visualization** using two powerful libraries: **Matplotlib** and **Seaborn**. These tools allow us to create insightful visualizations that help understand and communicate data effectively. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 10: Data Visualization](#-day-10-data-visualization) 13 | - [1️⃣ Introduction to Data Visualization](#1️⃣-introduction-to-data-visualization) 14 | - [2️⃣ Visualizing Data with Matplotlib](#2️⃣-visualizing-data-with-matplotlib) 15 | - [Line Plots](#line-plots) 16 | - [Bar Charts](#bar-charts) 17 | - [Scatter Plots](#scatter-plots) 18 | - [Customizing Plots](#customizing-plots) 19 | - [3️⃣ Visualizing Data with Seaborn](#3️⃣-visualizing-data-with-seaborn) 20 | - [Distribution Plots](#distribution-plots) 21 | - [Categorical Plots](#categorical-plots) 22 | - [Pair Plots](#pair-plots) 23 | - [Heatmaps](#heatmaps) 24 | - [🧠 Practice Exercises](#-practice-exercises) 25 | - [🌟 Summary](#-summary) 26 | 27 | 28 | 29 | 30 | ## 1️⃣ Introduction to Data Visualization 31 | 32 | **Data Visualization** is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns. 33 | 34 | We will use: 35 | 36 | - **Matplotlib**: A versatile library for creating basic plots. 37 | - **Seaborn**: Built on top of Matplotlib, it provides a high-level interface for creating attractive and informative statistical graphics. 38 | 39 | Install the libraries if needed: 40 | 41 | ```bash 42 | pip install matplotlib seaborn 43 | ``` 44 | 45 | 46 | 47 | ## 2️⃣ Visualizing Data with Matplotlib 48 | 49 | ### Line Plots 50 | 51 | Line plots are great for visualizing trends over time. 52 | 53 | **Example**: 54 | 55 | ```python 56 | import matplotlib.pyplot as plt 57 | 58 | x = [1, 2, 3, 4, 5] 59 | y = [2, 4, 6, 8, 10] 60 | 61 | plt.plot(x, y, label="y = 2x") 62 | plt.title("Line Plot Example") 63 | plt.xlabel("X-axis") 64 | plt.ylabel("Y-axis") 65 | plt.legend() 66 | plt.show() 67 | ``` 68 | 69 | 70 | 71 | ### Bar Charts 72 | 73 | Bar charts are ideal for comparing categories. 74 | 75 | **Example**: 76 | 77 | ```python 78 | categories = ["A", "B", "C", "D"] 79 | values = [3, 7, 8, 5] 80 | 81 | plt.bar(categories, values, color="skyblue") 82 | plt.title("Bar Chart Example") 83 | plt.xlabel("Categories") 84 | plt.ylabel("Values") 85 | plt.show() 86 | ``` 87 | 88 | 89 | 90 | ### Scatter Plots 91 | 92 | Scatter plots show relationships between two variables. 93 | 94 | **Example**: 95 | 96 | ```python 97 | x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11] 98 | y = [99, 86, 87, 88, 100, 86, 103, 87, 94, 78] 99 | 100 | plt.scatter(x, y, color="green") 101 | plt.title("Scatter Plot Example") 102 | plt.xlabel("X-axis") 103 | plt.ylabel("Y-axis") 104 | plt.show() 105 | ``` 106 | 107 | 108 | 109 | ### Customizing Plots 110 | 111 | You can customize colors, styles, and layouts. 112 | 113 | **Example**: 114 | 115 | ```python 116 | plt.plot(x, y, linestyle="--", color="red", marker="o") 117 | plt.title("Customized Line Plot") 118 | plt.show() 119 | ``` 120 | 121 | 122 | 123 | ## 3️⃣ Visualizing Data with Seaborn 124 | 125 | ### Distribution Plots 126 | 127 | Distribution plots are used to visualize the distribution of a dataset. 128 | 129 | **Example**: 130 | 131 | ```python 132 | import seaborn as sns 133 | 134 | data = [10, 20, 20, 30, 30, 30, 40, 50] 135 | sns.histplot(data, kde=True, color="purple") 136 | plt.title("Distribution Plot Example") 137 | plt.show() 138 | ``` 139 | 140 | 141 | 142 | ### Categorical Plots 143 | 144 | Categorical plots help visualize relationships between categories. 145 | 146 | **Example (Box Plot)**: 147 | 148 | ```python 149 | tips = sns.load_dataset("tips") 150 | sns.boxplot(x="day", y="total_bill", data=tips) 151 | plt.title("Box Plot Example") 152 | plt.show() 153 | ``` 154 | 155 | 156 | 157 | ### Pair Plots 158 | 159 | Pair plots visualize pairwise relationships in a dataset. 160 | 161 | **Example**: 162 | 163 | ```python 164 | sns.pairplot(tips, hue="day") 165 | plt.show() 166 | ``` 167 | 168 | 169 | 170 | ### Heatmaps 171 | 172 | Heatmaps are used for showing correlations or matrices. 173 | 174 | **Example**: 175 | 176 | ```python 177 | correlation = tips.corr() 178 | sns.heatmap(correlation, annot=True, cmap="coolwarm") 179 | plt.title("Heatmap Example") 180 | plt.show() 181 | ``` 182 | 183 | 184 | 185 | ## 🧠 Practice Exercises 186 | 187 | 1. Create a bar chart showing the sales of different products. 188 | 2. Plot the distribution of ages using Seaborn. 189 | 3. Use a scatter plot to explore the relationship between two variables. 190 | 4. Visualize pairwise relationships in a custom dataset. 191 | 192 | 193 | 194 | ## 🌟 Summary 195 | 196 | - **Matplotlib**: Provides low-level control for creating visualizations. 197 | - **Seaborn**: Offers high-level abstractions for complex visualizations. 198 | - Common visualization types include line plots, bar charts, scatter plots, box plots, and heatmaps. 199 | 200 | --- 201 | 202 | 203 | -------------------------------------------------------------------------------- /16_Statistical Concepts/16_Statistical Concepts.md: -------------------------------------------------------------------------------- 1 | [<< Day 15](../15_Regular%20Expressions/15_Regular%20Expressions.md) | [Day 17 >>](../17_Hypothesis%20Testing/17_Hypothesis%20Testing.md) 2 | 3 | # *Day 16: Statistical Concepts of NumPy and SciPy* 📊📈 4 | 5 | ## *Table of Contents* 6 | - [Introduction to Statistical Concepts](#introduction-to-statistical-concepts) ✨ 7 | - [NumPy for Statistics](#numpy-for-statistics) 🔢 8 | - [Mean](#mean) 📐 9 | - [Median](#median) 🎯 10 | - [Mode](#mode) 🎲 11 | - [Standard Deviation](#standard-deviation) 📏 12 | - [Variance](#variance) 🔄 13 | - [Percentile](#percentile) 📊 14 | - [SciPy for Advanced Statistics](#scipy-for-advanced-statistics) ⚡ 15 | - [Probability Distributions ](#probability-distributions) 🎛️ 16 | - [Hypothesis Testing](#hypothesis-testing) 🔍 17 | - [Linear Regression](#linear-regression) 📉 18 | - [Practice Exercises](#practice-exercises) 📝 19 | - [Summary](#summary) 🚀 20 | 21 | 22 | 23 | ## Introduction to Statistical Concepts✨ 24 | 25 | Statistics is the foundation of data science. It helps us analyze data, identify patterns, and make predictions. Python libraries like *NumPy* and *SciPy* provide powerful tools for performing statistical operations. 26 | 27 | In this lesson, we will: 28 | - Explore statistical functions in *NumPy*, such as mean, median, and standard deviation. 29 | - Dive into *SciPy* for advanced statistical analysis, including hypothesis testing and working with probability distributions. 30 | 31 | 32 | 33 | ## NumPy for Statistics🔢 34 | 35 | NumPy is a powerful numerical computing library. It includes essential statistical functions that operate on arrays. 36 | 37 | ### Mean📐 38 | The mean is the average of a dataset. 39 | 40 | *Example*: 41 | 42 | python 43 | import numpy as np 44 | 45 | data = [10, 20, 30, 40, 50] 46 | mean = np.mean(data) 47 | print(f"Mean: {mean}") 48 | 49 | 50 | *Output*: 51 | 52 | Mean: 30.0 53 | 54 | 55 | 56 | 57 | ### Median🎯 58 | 59 | *Output*: 60 | 61 | Median: 30.0 62 | 63 | 64 | 65 | 66 | ### Mode🎲 67 | NumPy doesn't have a direct function for mode, but we can use *SciPy*. 68 | 69 | *Example*: 70 | 71 | python 72 | from scipy import stats 73 | 74 | data = [1, 2, 2, 3, 4] 75 | mode = stats.mode(data) 76 | print(f"Mode: {mode.mode[0]}, Count: {mode.count[0]}") 77 | 78 | 79 | *Output*: 80 | 81 | Mode: 2, Count: 2 82 | 83 | 84 | 85 | 86 | ### Standard Deviation📏 87 | The standard deviation measures how spread out the data is. 88 | 89 | *Example*: 90 | 91 | python 92 | data = [10, 20, 30, 40, 50] 93 | std_dev = np.std(data) 94 | print(f"Standard Deviation: {std_dev}") 95 | 96 | 97 | *Output*: 98 | 99 | Standard Deviation: 14.142135623730951 100 | 101 | 102 | 103 | 104 | ### Variance🔄 105 | Variance is the square of the standard deviation. 106 | 107 | *Example*: 108 | 109 | python 110 | variance = np.var(data) 111 | print(f"Variance: {variance}") 112 | 113 | 114 | *Output*: 115 | 116 | Variance: 200.0 117 | 118 | 119 | 120 | 121 | ### Percentile📊 122 | Percentiles divide data into 100 equal parts. 123 | 124 | *Example*: 125 | 126 | python 127 | percentile = np.percentile(data, 50) # 50th percentile is the median 128 | print(f"50th Percentile: {percentile}") 129 | 130 | 131 | *Output*: 132 | 133 | 50th Percentile: 30.0 134 | 135 | 136 | 137 | 138 | ## SciPy for Advanced Statistics⚡ 139 | 140 | SciPy builds on NumPy and provides additional statistical capabilities. 141 | 142 | ### Probability Distributions 143 | SciPy supports numerous probability distributions, such as normal, binomial, and uniform. 144 | 145 | *Example: Normal Distribution* 146 | 147 | python 148 | from scipy.stats import norm 149 | 150 | # Generate random data 151 | data = norm.rvs(loc=0, scale=1, size=1000) 152 | 153 | # Compute PDF 154 | x = np.linspace(-3, 3, 100) 155 | pdf = norm.pdf(x) 156 | 157 | print(f"First 5 PDF values: {pdf[:5]}") 158 | 159 | 160 | 161 | 162 | ### Hypothesis Testing🔍 163 | Hypothesis testing is used to test assumptions about data. 164 | 165 | *Example: t-Test* 166 | 167 | python 168 | from scipy.stats import ttest_1samp 169 | 170 | data = [2.5, 3.0, 2.8, 3.2, 3.0] 171 | t_stat, p_value = ttest_1samp(data, 3) 172 | print(f"T-statistic: {t_stat}, P-value: {p_value}") 173 | 174 | 175 | *Output*: 176 | 177 | T-statistic: -1.0, P-value: 0.374 178 | 179 | 180 | 181 | 182 | ### Linear Regression📉 183 | Linear regression fits a line to a set of data points. 184 | 185 | *Example*: 186 | 187 | python 188 | from scipy.stats import linregress 189 | 190 | x = [1, 2, 3, 4, 5] 191 | y = [2, 4, 5, 4, 5] 192 | 193 | slope, intercept, r_value, p_value, std_err = linregress(x, y) 194 | print(f"Slope: {slope}, Intercept: {intercept}") 195 | 196 | 197 | *Output*: 198 | 199 | Slope: 0.6, Intercept: 2.2 200 | 201 | 202 | 203 | 204 | ## Practice Exercises📝 205 | 206 | 1. Calculate the *mean, **median, and **mode* of a dataset using NumPy and SciPy. 207 | 2. Use the norm distribution from SciPy to generate and visualize random data. 208 | 3. Perform a *t-test* on a sample dataset. 209 | 4. Fit a *linear regression model* to a dataset using SciPy. 210 | 211 | 212 | 213 | ## Summary🚀 214 | 215 | - *NumPy* provides essential statistical functions, including *mean, **median, **variance, and **standard deviation*. 216 | - *SciPy* offers advanced tools for working with *probability distributions, **hypothesis testing, and **regression analysis*. 217 | - These libraries form the foundation for statistical analysis in Python. 218 | 219 | --- 220 | -------------------------------------------------------------------------------- /14_Working with APIs and JSON/14_Working with APIs and JSON.md: -------------------------------------------------------------------------------- 1 | [<< Day 13](../13_Time%20Series%20Analysis%20Introduction/13_Time%20Series%20Analysis%20Introduction.md) | [Day 15 >>](../15_Regular%20Expressions/15_Regular%20Expressions.md) 2 | 3 | # 📘 Day 14: Working with APIs and JSON in Python 4 | 5 | Welcome to Day 14 of the **30 Days of Data Science** series! Today, we focus on understanding and interacting with **APIs** and handling **JSON** data. APIs and JSON are crucial in data science for accessing and managing external data. 6 | 7 | 8 | 9 | ## Table of Contents 10 | 11 | - [📘 Day 14: Working with APIs and JSON in Python](#-day-14-working-with-apis-and-json-in-python) 12 | - [1️⃣ APIs 📡](#1️⃣-apis-) 13 | - [What is an API?](#what-is-an-api) 14 | - [Making HTTP Requests](#making-http-requests) 15 | - [Example: Fetching Data from an API](#example-fetching-data-from-an-api) 16 | - [2️⃣ JSON: JavaScript Object Notation 📦](#2️⃣-json-javascript-object-notation-) 17 | - [What is JSON?](#what-is-json) 18 | - [Working with JSON in Python](#working-with-json-in-python) 19 | - [Example: Parsing JSON Data](#example-parsing-json-data) 20 | - [Example: Writing JSON Data](#example-writing-json-data) 21 | - [🧠 Practice Exercises](#-practice-exercises) 22 | - [🌟 Summary](#-summary) 23 | 24 | 25 | ## 1️⃣ APIs 📡 26 | 27 | ### What is an API? 28 | 29 | An **API (Application Programming Interface)** allows two systems to communicate with each other. In data science, APIs are used to access real-time data, such as weather updates, stock prices, and social media feeds. 30 | 31 | APIs typically return data in JSON format, which is lightweight and easy to parse. 32 | 33 | 34 | 35 | ### Making HTTP Requests 36 | 37 | To interact with an API, you need to make **HTTP requests**. Python's `requests` library simplifies this process. 38 | 39 | #### HTTP Methods: 40 | 41 | - **GET**: Retrieve data from an API. 42 | - **POST**: Send data to an API. 43 | - **PUT**: Update data on an API. 44 | - **DELETE**: Delete data on an API. 45 | 46 | 47 | 48 | ### Example: Fetching Data from an API 49 | 50 | Let’s fetch weather data using a public API. 51 | 52 | ```python 53 | import requests 54 | 55 | # API Endpoint 56 | url = "https://api.open-meteo.com/v1/forecast" 57 | params = { 58 | "latitude": 37.7749, # Latitude for San Francisco 59 | "longitude": -122.4194, # Longitude for San Francisco 60 | "hourly": "temperature_2m", 61 | } 62 | 63 | response = requests.get(url, params=params) 64 | 65 | if response.status_code == 200: 66 | data = response.json() 67 | print("Weather Data:", data) 68 | else: 69 | print(f"Failed to fetch data. Status code: {response.status_code}") 70 | ``` 71 | 72 | **Explanation:** 73 | 1. We use the `requests.get()` method to send a GET request. 74 | 2. The `params` dictionary contains query parameters required by the API. 75 | 3. The response is checked for a 200 status code (success) and parsed as JSON. 76 | 77 | **Output:** 78 | 79 | ```plaintext 80 | Weather Data: { ...JSON data... } 81 | ``` 82 | 83 | 84 | 85 | ## 2️⃣ JSON: JavaScript Object Notation 📦 86 | 87 | ### What is JSON? 88 | 89 | **JSON (JavaScript Object Notation)** is a lightweight data format often used to send and receive data through APIs. JSON data is structured as key-value pairs. 90 | 91 | #### Example of JSON: 92 | 93 | ```json 94 | { 95 | "name": "Alice", 96 | "age": 25, 97 | "skills": ["Python", "Data Science"] 98 | } 99 | ``` 100 | 101 | 102 | 103 | ### Working with JSON in Python 104 | 105 | Python provides the `json` module to parse and create JSON data. 106 | 107 | 108 | 109 | ### Example: Parsing JSON Data 110 | 111 | Let’s parse JSON data from a string. 112 | 113 | ```python 114 | import json 115 | 116 | # JSON string 117 | json_data = ''' 118 | { 119 | "name": "Alice", 120 | "age": 25, 121 | "skills": ["Python", "Data Science"] 122 | } 123 | ''' 124 | 125 | # Parse JSON string to Python dictionary 126 | data = json.loads(json_data) 127 | 128 | print(data["name"]) # Output: Alice 129 | print(data["skills"]) # Output: ['Python', 'Data Science'] 130 | ``` 131 | 132 | **Explanation:** 133 | 134 | 1. The `json.loads()` method converts a JSON string into a Python dictionary. 135 | 2. You can access JSON data using keys. 136 | 137 | 138 | 139 | ### Example: Writing JSON Data 140 | 141 | You can write Python objects into JSON format. 142 | 143 | ```python 144 | import json 145 | 146 | # Python dictionary 147 | data = { 148 | "name": "Bob", 149 | "age": 30, 150 | "skills": ["Machine Learning", "Deep Learning"] 151 | } 152 | 153 | # Convert Python dictionary to JSON string 154 | json_string = json.dumps(data, indent=4) 155 | 156 | print(json_string) 157 | ``` 158 | 159 | **Explanation:** 160 | 161 | 1. The `json.dumps()` method converts a Python object to a JSON string. 162 | 2. The `indent` parameter makes the output more readable. 163 | 164 | **Output:** 165 | 166 | ```json 167 | { 168 | "name": "Bob", 169 | "age": 30, 170 | "skills": ["Machine Learning", "Deep Learning"] 171 | } 172 | ``` 173 | 174 | 175 | 176 | ## 🧠 Practice Exercises 177 | 178 | 1. Use the `requests` module to fetch data from an API of your choice. 179 | 2. Parse the JSON response to extract specific information. 180 | 3. Create a Python dictionary and write it as a JSON file. 181 | 4. Explore Python's `json` module documentation for advanced features. 182 | 183 | 184 | 185 | ## 🌟 Summary 186 | 187 | - APIs enable communication between different systems, making real-time data accessible. 188 | - The `requests` module simplifies sending HTTP requests. 189 | - JSON is a lightweight and popular format for data exchange. 190 | - Python's `json` module makes it easy to parse and create JSON data. 191 | 192 | --- 193 | 194 | 195 | -------------------------------------------------------------------------------- /22_Decision Trees/22_Decision Trees.md: -------------------------------------------------------------------------------- 1 | [<< Day 21](../21_Clustering%20(K-Means)/21_Clustering%20(K-Means).md) | [Day 23 >>](../23_Handling%20Imbalanced%20Data/23_Handling%20Imbalanced%20Data.md) 2 | 3 | # 📘 Day 22: Decision Trees 4 | 5 | Welcome to **Day 22** of the 30 Days of Data Science series! Today, we will dive into **Decision Trees**, an intuitive and powerful algorithm for classification and regression tasks. We will also explore the implementation of `DecisionTreeClassifier` from the scikit-learn library in Python. 6 | 7 | 8 | 9 | ## Table of Contents 10 | 11 | - [🌳 What is a Decision Tree?](#-what-is-a-decision-tree) 12 | - [🛠️ Decision Trees in scikit-learn](#️-decision-trees-in-scikit-learn) 13 | - [🧠 Key Concepts](#-key-concepts) 14 | - [1️⃣ Splitting Criteria](#1️⃣-splitting-criteria) 15 | - [2️⃣ Overfitting and Pruning](#2️⃣-overfitting-and-pruning) 16 | - [🔨 Implementation](#-implementation) 17 | - [Example: Classifying Iris Dataset](#example-classifying-iris-dataset) 18 | - [🌟 Advantages and Limitations](#-advantages-and-limitations) 19 | - [Pros](#pros) 20 | - [Cons](#cons) 21 | - [🔗 Further Reading](#-further-reading) 22 | - [📚 Exercises](#-exercises) 23 | - [📜 Summary](#-summary) 24 | 25 | 26 | 27 | ## 🌳 What is a Decision Tree? 28 | 29 | A **Decision Tree** is a tree-like structure used to represent decisions and their possible consequences. It consists of nodes that split data based on features, ultimately leading to predictions. Key components include: 30 | 31 | - **Root Node**: The initial node containing the entire dataset. 32 | - **Internal Nodes**: Decision points splitting data based on feature conditions. 33 | - **Leaf Nodes**: Endpoints that provide predictions. 34 | 35 | ### How it Works: 36 | 1. The algorithm selects a feature and a threshold to split the dataset. 37 | 2. This process repeats recursively, creating branches. 38 | 3. The tree stops splitting when it meets a predefined condition (e.g., maximum depth). 39 | 40 | 41 | 42 | ## 🛠️ Decision Trees in scikit-learn 43 | 44 | scikit-learn provides the `DecisionTreeClassifier` for classification tasks. It allows fine-tuning with parameters like `criterion`, `max_depth`, and `min_samples_split`. 45 | 46 | ### Installation 47 | To use scikit-learn, ensure it is installed: 48 | 49 | ```bash 50 | pip install scikit-learn 51 | ``` 52 | 53 | 54 | 55 | ## 🧠 Key Concepts 56 | 57 | ### 1️⃣ Splitting Criteria 58 | 59 | Decision trees evaluate splits based on impurity measures like: 60 | 61 | - **Gini Impurity** (default in scikit-learn): 62 | Gini = $1 - \sum_{i=1}^n p_i^2$ 63 | 64 | - **Entropy** (Information Gain): 65 | Entropy = $-\sum_{i=1}^n p_i \log_2(p_i)$ 66 | 67 | You can specify the criterion when initializing the classifier: 68 | 69 | ```python 70 | from sklearn.tree import DecisionTreeClassifier 71 | clf = DecisionTreeClassifier(criterion='entropy') 72 | ``` 73 | 74 | ### 2️⃣ Overfitting and Pruning 75 | 76 | - **Overfitting**: Trees grow too deep, capturing noise instead of patterns. 77 | - **Pruning**: Limits tree growth by setting constraints like `max_depth` or `min_samples_split`. 78 | 79 | 80 | 81 | ## 🔨 Implementation 82 | 83 | ### Example: Classifying Iris Dataset 84 | 85 | ```python 86 | # Import necessary libraries 87 | from sklearn.datasets import load_iris 88 | from sklearn.model_selection import train_test_split 89 | from sklearn.tree import DecisionTreeClassifier, plot_tree 90 | import matplotlib.pyplot as plt 91 | 92 | # Load Iris dataset 93 | iris = load_iris() 94 | X, y = iris.data, iris.target 95 | 96 | # Split data into training and testing sets 97 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 98 | 99 | # Initialize DecisionTreeClassifier 100 | clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42) 101 | 102 | # Train the classifier 103 | clf.fit(X_train, y_train) 104 | 105 | # Evaluate the classifier 106 | accuracy = clf.score(X_test, y_test) 107 | print(f"Accuracy: {accuracy * 100:.2f}%") 108 | 109 | # Visualize the decision tree 110 | plt.figure(figsize=(12, 8)) 111 | plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True) 112 | plt.show() 113 | ``` 114 | 115 | ### Output Explanation: 116 | - The tree visualization shows feature splits and class probabilities at each node. 117 | - The accuracy metric evaluates model performance on test data. 118 | 119 | 120 | 121 | ## 🌟 Advantages and Limitations 122 | 123 | ### Pros: 124 | - Easy to understand and interpret. 125 | - Requires little data preprocessing. 126 | - Handles both numerical and categorical data. 127 | 128 | ### Cons: 129 | - Prone to overfitting. 130 | - Sensitive to small data changes. 131 | - Not suitable for very large datasets. 132 | 133 | 134 | 135 | ## 🔗 Further Reading 136 | - [scikit-learn Documentation: Decision Trees](https://scikit-learn.org/stable/modules/tree.html) 137 | 138 | 139 | 140 | ## 📚 Exercises 141 | 142 | 1. Train a decision tree on a different dataset (e.g., Wine Dataset) and compare Gini and Entropy criteria. 143 | 2. Experiment with hyperparameters like `min_samples_leaf` and `max_features` to reduce overfitting. 144 | 3. Visualize decision boundaries using 2D features. 145 | 146 | 147 | 148 | ## 📜 Summary 149 | 150 | In this session, you have learned: 151 | - The structure and working of decision trees. 152 | - Splitting criteria like Gini Impurity and Entropy. 153 | - The importance of pruning to avoid overfitting. 154 | - Implementing a decision tree using scikit-learn. 155 | 156 | Decision trees are a fundamental building block in machine learning and serve as the basis for ensemble methods like Random Forests and Gradient Boosted Trees. Keep experimenting with datasets to gain a deeper understanding. 157 | 158 | --- 159 | 160 | 161 | 162 | 163 | 164 | 165 | -------------------------------------------------------------------------------- /24_Feature Engineering/24_Feature Engineering.md: -------------------------------------------------------------------------------- 1 | [<< Day 23](../23_Handling%20Imbalanced%20Data/23_Handling%20Imbalanced%20Data.md) | [Day 25 >>](../25_Model%20Evaluation%20and%20Metrics/25_Model%20Evaluation%20and%20Metrics.md) 2 | 3 | # 📚 Day 24: Feature Engineering 4 | 5 | Welcome to **Day 24** of the 30 Days of Data Science series! 🎉 Today, we delve into the critical concept of **Feature Engineering**, a cornerstone of building effective machine learning models. We will explore techniques such as **Encoding**, **Scaling**, and **Feature Selection** to prepare data for modeling. 🔧🎨 6 | 7 | 8 | 9 | ## 📌 Table of Contents 10 | 11 | - [ 🎯Introduction to Feature Engineering](#introduction-to-feature-engineering) 12 | - [ 🔧Encoding Techniques](#encoding-techniques) 13 | - [ 🔐One-Hot Encoding](#one-hot-encoding) 14 | - [ 🔑Label Encoding](#label-encoding) 15 | - [ 🏦Target Encoding](#target-encoding) 16 | - [ 📈Scaling Features](#scaling-features) 17 | - [ 💸Standardization](#standardization) 18 | - [ 🪙Normalization](#normalization) 19 | - [ 🎯Feature Selection](#feature-selection) 20 | - [ 🔬Filter Methods](#filter-methods) 21 | - [ 🧠Wrapper Methods](#wrapper-methods) 22 | - [ 📊Embedded Methods](#embedded-methods) 23 | - [ 🖋Practice Exercises](#practice-exercises) 24 | - [ 📌Summary](#summary) 25 | 26 | 27 | 28 | ## 🎯Introduction to Feature Engineering 29 | 30 | **Feature Engineering** is the process of transforming raw data into meaningful features that improve the performance of machine learning models. It includes: 31 | 32 | - Encoding categorical variables 33 | - Scaling numerical features 34 | - Selecting the most relevant features 35 | 36 | Effective feature engineering leads to: 37 | - Improved model accuracy 38 | - Faster convergence during training 39 | - Reduced overfitting 40 | 41 | 42 | 43 | ## 🔧Encoding Techniques 44 | 45 | ### 🔐One-Hot Encoding 46 | One-hot encoding converts categorical variables into binary vectors. 47 | 48 | #### Example: 49 | ```python 50 | import pandas as pd 51 | 52 | # Sample data 53 | data = {'Color': ['Red', 'Green', 'Blue']} 54 | df = pd.DataFrame(data) 55 | 56 | # Apply one-hot encoding 57 | encoded_df = pd.get_dummies(df, columns=['Color']) 58 | print(encoded_df) 59 | ``` 60 | **Output:** 61 | ``` 62 | Color_Blue Color_Green Color_Red 63 | 0 0 0 1 64 | 1 0 1 0 65 | 2 1 0 0 66 | ``` 67 | 68 | ### 🔑Label Encoding 69 | Label encoding assigns unique integers to each category. 70 | 71 | #### Example: 72 | ```python 73 | from sklearn.preprocessing import LabelEncoder 74 | 75 | # Sample data 76 | labels = ['Red', 'Green', 'Blue'] 77 | encoder = LabelEncoder() 78 | encoded_labels = encoder.fit_transform(labels) 79 | print(encoded_labels) 80 | ``` 81 | **Output:** 82 | ``` 83 | [2 1 0] 84 | ``` 85 | 86 | ### 🏦Target Encoding 87 | Target encoding maps categories to the mean of the target variable. 88 | 89 | #### Example: 90 | ```python 91 | import pandas as pd 92 | 93 | # Sample data 94 | data = {'Category': ['A', 'B', 'A', 'C'], 'Target': [1, 0, 1, 0]} 95 | df = pd.DataFrame(data) 96 | 97 | def target_encode(column, target): 98 | return column.map(target.groupby(column).mean()) 99 | 100 | df['Encoded_Category'] = target_encode(df['Category'], df['Target']) 101 | print(df) 102 | ``` 103 | 104 | 105 | 106 | ## 📈Scaling Features 107 | 108 | ### 💸Standardization 109 | Standardization scales features to have a mean of 0 and a standard deviation of 1. 110 | 111 | #### Example: 112 | ```python 113 | from sklearn.preprocessing import StandardScaler 114 | import numpy as np 115 | 116 | # Sample data 117 | X = np.array([[1, 2], [3, 4], [5, 6]]) 118 | scaler = StandardScaler() 119 | scaled_X = scaler.fit_transform(X) 120 | print(scaled_X) 121 | ``` 122 | 123 | ### 🪙Normalization 124 | Normalization scales features to a range of [0, 1]. 125 | 126 | #### Example: 127 | ```python 128 | from sklearn.preprocessing import MinMaxScaler 129 | 130 | # Sample data 131 | scaler = MinMaxScaler() 132 | norm_X = scaler.fit_transform(X) 133 | print(norm_X) 134 | ``` 135 | 136 | 137 | 138 | ## 🎯Feature Selection 139 | 140 | ### 🔬Filter Methods 141 | Filter methods use statistical tests to score and select features. 142 | 143 | #### Example: 144 | ```python 145 | from sklearn.feature_selection import SelectKBest, chi2 146 | 147 | # Sample data 148 | X = [[10, 20, 30], [20, 30, 40], [30, 40, 50]] 149 | y = [1, 0, 1] 150 | selector = SelectKBest(chi2, k=2) 151 | selected_X = selector.fit_transform(X, y) 152 | print(selected_X) 153 | ``` 154 | 155 | ### 🧠Wrapper Methods 156 | Wrapper methods use a predictive model to evaluate feature subsets. 157 | 158 | #### Example: 159 | ```python 160 | from sklearn.feature_selection import RFE 161 | from sklearn.ensemble import RandomForestClassifier 162 | 163 | # Sample data 164 | estimator = RandomForestClassifier() 165 | rfe = RFE(estimator, n_features_to_select=2) 166 | rfe.fit(X, y) 167 | print(rfe.support_) 168 | ``` 169 | 170 | ### 📊Embedded Methods 171 | Embedded methods perform feature selection during model training (e.g., Lasso). 172 | 173 | #### Example: 174 | ```python 175 | from sklearn.linear_model import Lasso 176 | 177 | # Sample data 178 | lasso = Lasso(alpha=0.01) 179 | lasso.fit(X, y) 180 | print(lasso.coef_) 181 | ``` 182 | 183 | 184 | 185 | ## 🖋Practice Exercises 186 | 187 | 1. Implement one-hot encoding and label encoding on a dataset of your choice. 188 | 2. Experiment with scaling techniques and observe their impact on a logistic regression model. 189 | 3. Apply SelectKBest and RFE on a dataset to compare their feature selection results. 190 | 191 | 192 | 193 | ## 📌Summary 194 | 195 | Today, we covered: 196 | 197 | - Encoding techniques for categorical variables. 198 | - Scaling methods to normalize numerical features. 199 | - Feature selection approaches to identify important features. 200 | 201 | Feature engineering is an art and science that significantly impacts the success of machine learning models. Keep exploring and practicing these techniques! 🚀 202 | 203 | --- 204 | -------------------------------------------------------------------------------- /17_Hypothesis Testing/17_Hypothesis Testing.md: -------------------------------------------------------------------------------- 1 | [<< Day 16](../16_Statistical%20Concepts/16_Statistical%20Concepts.md) | [Day 18 >>](../18_Basic%20Machine%20Learning%20Introduction/18_Basic%20Machine%20Learning%20Introduction.md) 2 | 3 | 4 | # 📘 Day 17: Hypothesis Testing 5 | 6 | Welcome to Day 17 of the **30 Days of Data Science** series! Today, we delve into **Hypothesis Testing**, a fundamental concept in statistics, widely used to make data-driven decisions. This session will focus on **t-tests** and **chi-square tests**, two commonly used techniques for hypothesis testing. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 17: Hypothesis Testing](#-day-17-hypothesis-testing) 13 | - [📌 Topics Covered](#-topics-covered) 14 | - [1️⃣ What is Hypothesis Testing? 🧐](#1️⃣-what-is-hypothesis-testing-) 15 | - [Null and Alternative Hypotheses](#null-and-alternative-hypotheses) 16 | - [Steps in Hypothesis Testing](#steps-in-hypothesis-testing) 17 | - [2️⃣ t-Test 🧮](#2️⃣-t-test-) 18 | - [What is a t-Test?](#what-is-a-t-test) 19 | - [Types of t-Tests](#types-of-t-tests) 20 | - [Example: One-Sample t-Test](#example-one-sample-t-test) 21 | - [Example: Two-Sample t-Test](#example-two-sample-t-test) 22 | - [3️⃣ Chi-Square Test 🔢](#3️⃣-chi-square-test-) 23 | - [What is a Chi-Square Test?](#what-is-a-chi-square-test) 24 | - [Example: Chi-Square Test for Independence](#example-chi-square-test-for-independence) 25 | - [🧠 Practice Exercises](#-practice-exercises) 26 | - [🌟 Summary](#-summary) 27 | 28 | 29 | 30 | 31 | ## 📌 Topics Covered 32 | 33 | - **Hypothesis Testing**: Basics, importance, and applications. 34 | - **t-Tests**: Types and examples (one-sample, two-sample). 35 | - **Chi-Square Test**: Concepts and practical applications. 36 | 37 | 38 | 39 | ## 1️⃣ What is Hypothesis Testing? 🧐 40 | 41 | Hypothesis Testing is a statistical method used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population. 42 | 43 | ### Null and Alternative Hypotheses 44 | 45 | - **Null Hypothesis (H₀)**: Assumes no effect or no difference in the population. 46 | - **Alternative Hypothesis (H₁)**: Assumes a significant effect or difference exists. 47 | 48 | Example: 49 | 50 | - H₀: The average height of students is 5.5 feet. 51 | - H₁: The average height of students is not 5.5 feet. 52 | 53 | 54 | 55 | ### Steps in Hypothesis Testing 56 | 57 | 1. **State the hypotheses**: Define H₀ and H₁. 58 | 2. **Choose a significance level (α)**: Commonly 0.05. 59 | 3. **Select the appropriate test**: t-test, chi-square, etc. 60 | 4. **Calculate the test statistic**: Using the chosen method. 61 | 5. **Make a decision**: Compare the p-value to α. 62 | - p-value ≤ α: Reject H₀ (evidence supports H₁). 63 | - p-value > α: Fail to reject H₀. 64 | 65 | 66 | 67 | ## 2️⃣ t-Test 🧮 68 | 69 | ### What is a t-Test? 70 | 71 | A **t-test** is used to compare means and determine if the differences are statistically significant. It assumes that the data is normally distributed. 72 | 73 | ### Types of t-Tests 74 | 75 | 1. **One-Sample t-Test**: Compares the sample mean to a known value. 76 | 2. **Two-Sample t-Test**: Compares the means of two independent groups. 77 | 3. **Paired t-Test**: Compares means of the same group at different times. 78 | 79 | 80 | 81 | ### Example: One-Sample t-Test 82 | 83 | ```python 84 | from scipy.stats import ttest_1samp 85 | import numpy as np 86 | 87 | # Sample data 88 | data = [12, 15, 14, 10, 13, 12, 14, 15, 11] 89 | pop_mean = 13 90 | 91 | # Perform t-test 92 | t_stat, p_value = ttest_1samp(data, pop_mean) 93 | 94 | print(f"T-statistic: {t_stat}") 95 | print(f"P-value: {p_value}") 96 | ``` 97 | 98 | **Output:** 99 | 100 | ```plaintext 101 | T-statistic: -1.024 102 | P-value: 0.340 103 | ``` 104 | 105 | - Since p-value > 0.05, we fail to reject H₀. 106 | 107 | 108 | 109 | ### Example: Two-Sample t-Test 110 | 111 | ```python 112 | from scipy.stats import ttest_ind 113 | 114 | # Two independent groups 115 | group1 = [22, 24, 19, 23, 21] 116 | group2 = [30, 29, 34, 28, 27] 117 | 118 | # Perform t-test 119 | t_stat, p_value = ttest_ind(group1, group2) 120 | 121 | print(f"T-statistic: {t_stat}") 122 | print(f"P-value: {p_value}") 123 | ``` 124 | 125 | **Output:** 126 | 127 | ```plaintext 128 | T-statistic: -5.123 129 | P-value: 0.002 130 | ``` 131 | 132 | - Since p-value ≤ 0.05, we reject H₀ and conclude there is a significant difference. 133 | 134 | 135 | 136 | ## 3️⃣ Chi-Square Test 🔢 137 | 138 | ### What is a Chi-Square Test? 139 | 140 | The **Chi-Square Test** determines whether there is a significant association between categorical variables. 141 | 142 | ### Example: Chi-Square Test for Independence 143 | 144 | ```python 145 | import numpy as np 146 | from scipy.stats import chi2_contingency 147 | 148 | # Contingency table 149 | data = np.array([[50, 30], [20, 100]]) 150 | 151 | # Perform chi-square test 152 | chi2, p, dof, expected = chi2_contingency(data) 153 | 154 | print(f"Chi-Square Statistic: {chi2}") 155 | print(f"P-value: {p}") 156 | print(f"Degrees of Freedom: {dof}") 157 | print(f"Expected Frequencies: 158 | {expected}") 159 | ``` 160 | 161 | **Output:** 162 | 163 | ```plaintext 164 | Chi-Square Statistic: 23.88 165 | P-value: 0.0001 166 | Degrees of Freedom: 1 167 | Expected Frequencies: 168 | [[35. 45.] 169 | [35. 85.]] 170 | ``` 171 | 172 | - Since p-value ≤ 0.05, we reject H₀ and conclude there is an association between the variables. 173 | 174 | 175 | 176 | ## 🧠 Practice Exercises 177 | 178 | 1. Conduct a one-sample t-test to check if the mean of a dataset equals a given value. 179 | 2. Perform a two-sample t-test on two independent datasets. 180 | 3. Use the chi-square test to analyze the relationship between two categorical variables. 181 | 182 | 183 | 184 | ## 🌟 Summary 185 | 186 | - Hypothesis testing involves comparing data against a null hypothesis. 187 | - t-tests assess differences in means for one or two groups. 188 | - Chi-square tests analyze associations between categorical variables. 189 | - Interpretation of p-values is crucial to making decisions in hypothesis testing. 190 | 191 | --- 192 | 193 | -------------------------------------------------------------------------------- /19_Linear Regression/19_Linear Regression.md: -------------------------------------------------------------------------------- 1 | [<< Day 18](../18_Basic%20Machine%20Learning%20Introduction/18_Basic%20Machine%20Learning%20Introduction.md) | [Day 20 >>](../20_Logistic%20Regression/20_Logistic%20Regression.md) 2 | 3 | 4 | 5 | # 📘 Day 19: Linear Regression with Scikit-learn 6 | 7 | Welcome to Day 19 of the **30 Days of Data Science** series! Today, we delve into **Linear Regression**, one of the most fundamental and widely used algorithms in supervised machine learning. We'll be using Python's **Scikit-learn** library to implement and analyze Linear Regression models. 8 | 9 | 10 | 11 | ## Table of Contents 12 | 13 | - [📘 Day 19: Linear Regression with Scikit-learn](#-day-19-linear-regression-with-scikit-learn) 14 | - [📌 Topics Covered](#-topics-covered) 15 | - [1️⃣ What is Linear Regression?](#1️⃣-what-is-linear-regression) 16 | - [Equation of Linear Regression](#equation-of-linear-regression) 17 | - [Use Cases of Linear Regression](#use-cases-of-linear-regression) 18 | - [2️⃣ Linear Regression in Scikit-learn](#2️⃣-linear-regression-in-scikit-learn) 19 | - [Dataset Overview](#dataset-overview) 20 | - [Steps to Implement Linear Regression](#steps-to-implement-linear-regression) 21 | - [3️⃣ Code Implementation](#3️⃣-code-implementation) 22 | - [1. Importing Libraries](#1-importing-libraries) 23 | - [2. Loading and Exploring the Dataset](#2-loading-and-exploring-the-dataset) 24 | - [3. Preparing the Data](#3-preparing-the-data) 25 | - [4. Training the Model](#4-training-the-model) 26 | - [5. Making Predictions](#5-making-predictions) 27 | - [6. Model Evaluation](#6-model-evaluation) 28 | - [4️⃣ Practice Exercises](#4️⃣-practice-exercises) 29 | - [🌟 Summary](#-summary) 30 | 31 | 32 | 33 | 34 | ## 📌 Topics Covered 35 | 36 | - Introduction to Linear Regression and its applications. 37 | - How to implement Linear Regression using Scikit-learn. 38 | - Steps to preprocess data for Linear Regression. 39 | - Evaluating a Linear Regression model. 40 | 41 | 42 | 43 | ## 1️⃣ What is Linear Regression? 44 | 45 | Linear Regression is a supervised learning algorithm used to predict a target variable (dependent variable) based on one or more input variables (independent variables). The goal is to establish a linear relationship between the variables. 46 | 47 | 48 | 49 | ### Equation of Linear Regression 50 | 51 | The equation of a simple linear regression line is: 52 | 53 | **y = β₀ + β₁x + ε** 54 | 55 | - **y**: Predicted value (target). 56 | - **β₀**: Intercept. 57 | - **β₁**: Coefficient (slope). 58 | - **x**: Input variable (feature). 59 | - **ε**: Error term. 60 | 61 | For multiple linear regression: 62 | 63 | **y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε** 64 | 65 | 66 | 67 | ### Use Cases of Linear Regression 68 | 69 | - Predicting house prices. 70 | - Estimating sales figures. 71 | - Analyzing the impact of marketing spend. 72 | - Forecasting demand. 73 | 74 | 75 | 76 | ## 2️⃣ Linear Regression in Scikit-learn 77 | 78 | Scikit-learn provides a ready-to-use implementation of Linear Regression through the `LinearRegression` class. Let’s go step by step. 79 | 80 | 81 | 82 | ### Dataset Overview 83 | 84 | We’ll use a sample dataset from Scikit-learn’s `datasets` module or custom data to understand Linear Regression. 85 | 86 | 87 | 88 | ### Steps to Implement Linear Regression 89 | 90 | 1. Import required libraries. 91 | 2. Load and explore the dataset. 92 | 3. Split the dataset into training and testing sets. 93 | 4. Train the Linear Regression model using the training data. 94 | 5. Make predictions using the testing data. 95 | 6. Evaluate the model using metrics like Mean Squared Error (MSE) and R-squared. 96 | 97 | 98 | 99 | ## 3️⃣ Code Implementation 100 | 101 | ### 1. Importing Libraries 102 | 103 | ```python 104 | import numpy as np 105 | import pandas as pd 106 | from sklearn.model_selection import train_test_split 107 | from sklearn.linear_model import LinearRegression 108 | from sklearn.metrics import mean_squared_error, r2_score 109 | ``` 110 | 111 | 112 | 113 | ### 2. Loading and Exploring the Dataset 114 | 115 | We use Scikit-learn’s `load_diabetes` dataset as an example. 116 | 117 | ```python 118 | from sklearn.datasets import load_diabetes 119 | 120 | # Load dataset 121 | data = load_diabetes() 122 | df = pd.DataFrame(data.data, columns=data.feature_names) 123 | df['target'] = data.target 124 | 125 | # Explore the data 126 | print(df.head()) 127 | ``` 128 | 129 | 130 | 131 | ### 3. Preparing the Data 132 | 133 | ```python 134 | # Features and target variable 135 | X = df.drop(columns=['target']) 136 | y = df['target'] 137 | 138 | # Splitting data into train and test sets 139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 140 | ``` 141 | 142 | 143 | 144 | ### 4. Training the Model 145 | 146 | ```python 147 | # Initialize the model 148 | model = LinearRegression() 149 | 150 | # Train the model 151 | model.fit(X_train, y_train) 152 | print("Model trained successfully!") 153 | ``` 154 | 155 | 156 | 157 | ### 5. Making Predictions 158 | 159 | ```python 160 | # Predict on the test set 161 | y_pred = model.predict(X_test) 162 | 163 | # Display first five predictions 164 | print("Predictions:", y_pred[:5]) 165 | ``` 166 | 167 | 168 | 169 | ### 6. Model Evaluation 170 | 171 | ```python 172 | # Mean Squared Error (MSE) 173 | mse = mean_squared_error(y_test, y_pred) 174 | print(f"Mean Squared Error: {mse}") 175 | 176 | # R-squared Score 177 | r2 = r2_score(y_test, y_pred) 178 | print(f"R-squared Score: {r2}") 179 | ``` 180 | 181 | 182 | 183 | ## 4️⃣ Practice Exercises 184 | 185 | 1. Load a custom dataset and implement Linear Regression. 186 | 2. Try normalizing the data before training the model—does it improve the performance? 187 | 3. Evaluate your model using additional metrics like Mean Absolute Error (MAE). 188 | 189 | 190 | 191 | ## 🌟 Summary 192 | 193 | - Linear Regression models a linear relationship between input and output variables. 194 | - Scikit-learn provides an easy-to-use interface to train and evaluate regression models. 195 | - Key metrics like MSE and R-squared help assess model performance. 196 | 197 | --- 198 | 199 | 200 | -------------------------------------------------------------------------------- /15_Regular Expressions/15_Regular Expressions.md: -------------------------------------------------------------------------------- 1 | [<< Day 14](../14_Working%20with%20APIs%20and%20JSON/14_Working%20with%20APIs%20and%20JSON.md) | [Day 16 >>](../16_Statistical%20Concepts/16_Statistical%20Concepts.md) 2 | 3 | 4 | # 📘 Day 15: Regular Expressions Using `re` Module 5 | 6 | Welcome to Day 15 of the **30 Days of Data Science** series! Today, we dive into the fascinating world of **Regular Expressions (regex)** in Python. Regex is a powerful tool for pattern matching and text processing, enabling you to extract, validate, or modify data efficiently. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 15: Regular Expressions Using `re` Module](#-day-15-regular-expressions-using-re-module) 13 | - [1️⃣ Introduction to Regular Expressions](#1️⃣-introduction-to-regular-expressions) 14 | - [2️⃣ Using the `re` Module](#2️⃣-using-the-re-module) 15 | - [Importing the Module](#importing-the-module) 16 | - [Basic Pattern Matching](#basic-pattern-matching) 17 | - [3️⃣ Common Regex Patterns](#3️⃣-common-regex-patterns) 18 | - [4️⃣ Useful `re` Functions](#4️⃣-useful-re-functions) 19 | - [1. `re.match`](#1-rematch) 20 | - [2. `re.search`](#2-research) 21 | - [3. `re.findall`](#3-refindall) 22 | - [4. `re.sub`](#4-resub) 23 | - [5. `re.split`](#5-resplit) 24 | - [5️⃣ Practice Exercises](#5️⃣-practice-exercises) 25 | - [🌟 Summary](#-summary) 26 | 27 | 28 | 29 | ## 1️⃣ Introduction to Regular Expressions 30 | 31 | A **Regular Expression (regex)** is a sequence of characters defining a search pattern. Regex is commonly used for tasks such as: 32 | 33 | - Validating data (e.g., email addresses, phone numbers) 34 | - Extracting specific information from text (e.g., dates, URLs) 35 | - Text replacements (e.g., removing unwanted characters) 36 | 37 | 38 | 39 | ## 2️⃣ Using the `re` Module 40 | 41 | Python provides the built-in `re` module to work with regular expressions. 42 | 43 | 44 | 45 | ### Importing the Module 46 | 47 | Before using regex, you need to import the `re` module: 48 | 49 | ```python 50 | import re 51 | ``` 52 | 53 | 54 | 55 | ### Basic Pattern Matching 56 | 57 | Regex patterns are enclosed in raw strings (`r""`) to avoid conflicts with Python's escape sequences. 58 | 59 | ```python 60 | import re 61 | 62 | pattern = r"data" 63 | text = "I love data science." 64 | 65 | # Check if the pattern exists in the text 66 | if re.search(pattern, text): 67 | print("Pattern found!") 68 | else: 69 | print("Pattern not found.") 70 | ``` 71 | 72 | **Output:** 73 | 74 | ```plaintext 75 | Pattern found! 76 | ``` 77 | 78 | 79 | 80 | ## 3️⃣ Common Regex Patterns 81 | 82 | Here are some common patterns and their meanings: 83 | 84 | | Pattern | Description | Example Match | 85 | |--------------|--------------------------------------|-------------------| 86 | | `.` | Any character except newline | `a.c` matches `abc` | 87 | | `\d` | Any digit (0–9) | `\d` matches `3` | 88 | | `\w` | Any word character (a-z, A-Z, 0-9) | `\w` matches `a` | 89 | | `\s` | Any whitespace (space, tab, newline) | `\s` matches ` ` | 90 | | `^` | Start of a string | `^hello` matches `hello world` | 91 | | `$` | End of a string | `world$` matches `hello world` | 92 | | `*` | Zero or more repetitions | `ab*` matches `a`, `ab`, `abb` | 93 | | `+` | One or more repetitions | `ab+` matches `ab`, `abb` | 94 | | `?` | Zero or one repetition | `ab?` matches `a`, `ab` | 95 | | `{n,m}` | Between n and m repetitions | `a{2,4}` matches `aa`, `aaa`, `aaaa` | 96 | 97 | 98 | 99 | ## 4️⃣ Useful `re` Functions 100 | 101 | The `re` module provides several functions for pattern matching and manipulation: 102 | 103 | ### 1. `re.match` 104 | 105 | Checks if the pattern matches at the **beginning** of the string. 106 | 107 | ```python 108 | import re 109 | 110 | text = "data science is amazing" 111 | match = re.match(r"data", text) 112 | 113 | if match: 114 | print(f"Matched: {match.group()}") 115 | else: 116 | print("No match found.") 117 | ``` 118 | 119 | **Output:** 120 | 121 | ```plaintext 122 | Matched: data 123 | ``` 124 | 125 | 126 | 127 | ### 2. `re.search` 128 | 129 | Searches the entire string for a match. 130 | 131 | ```python 132 | import re 133 | 134 | text = "I love data science" 135 | search = re.search(r"data", text) 136 | 137 | if search: 138 | print(f"Found: {search.group()}") 139 | ``` 140 | 141 | **Output:** 142 | 143 | ```plaintext 144 | Found: data 145 | ``` 146 | 147 | 148 | 149 | ### 3. `re.findall` 150 | 151 | Returns all occurrences of the pattern in the string. 152 | 153 | ```python 154 | import re 155 | 156 | text = "data science involves data and more data" 157 | matches = re.findall(r"data", text) 158 | print(f"Occurrences: {matches}") 159 | ``` 160 | 161 | **Output:** 162 | 163 | ```plaintext 164 | Occurrences: ['data', 'data', 'data'] 165 | ``` 166 | 167 | 168 | 169 | ### 4. `re.sub` 170 | 171 | Replaces occurrences of the pattern with a specified string. 172 | 173 | ```python 174 | import re 175 | 176 | text = "data science is amazing" 177 | result = re.sub(r"data", "AI", text) 178 | print(result) 179 | ``` 180 | 181 | **Output:** 182 | 183 | ```plaintext 184 | AI science is amazing 185 | ``` 186 | 187 | 188 | 189 | ### 5. `re.split` 190 | 191 | Splits the string by the pattern. 192 | 193 | ```python 194 | import re 195 | 196 | text = "data-science-is-fun" 197 | result = re.split(r"-", text) 198 | print(result) 199 | ``` 200 | 201 | **Output:** 202 | 203 | ```plaintext 204 | ['data', 'science', 'is', 'fun'] 205 | ``` 206 | 207 | 208 | 209 | ## 5️⃣ Practice Exercises 210 | 211 | 1. Validate an email address using regex. 212 | 2. Extract all numbers from a given text. 213 | 3. Replace all whitespace in a string with underscores. 214 | 4. Split a paragraph into sentences. 215 | 216 | 217 | 218 | ## 🌟 Summary 219 | 220 | - Regular expressions are powerful for pattern matching and text manipulation. 221 | - Python's `re` module provides various functions like `match`, `search`, `findall`, `sub`, and `split`. 222 | - Familiarize yourself with common regex patterns for effective text processing. 223 | 224 | --- 225 | 226 | 227 | 228 | 229 | -------------------------------------------------------------------------------- /25_Model Evaluation and Metrics/25_Model Evaluation and Metrics.md: -------------------------------------------------------------------------------- 1 | [<< Day 24](../24_Feature%20Engineering/24_Feature%20Engineering.md) | [Day 26 >>](../26_Advanced%20ML%3A%20Hyperparameter%20Tuning/26_Advanced%20ML%3A%20Hyperparameter%20Tuning.md) 2 | 3 | 4 | # 📊 Day 25 - Model Evaluation and Metrics 5 | 6 | Welcome to **Day 25** of the **30 Days of Data Science** series! 🎉 Today, we will dive into the critical topic of **Model Evaluation and Metrics**. Understanding how to evaluate the performance of a machine learning model is essential to ensure its effectiveness and reliability in real-world scenarios. Let's explore topics like **Confusion Matrix** and **ROC-AUC Curve** with hands-on examples! 🚀 7 | 8 | 9 | 10 | ## 📋 Table of Contents 11 | 12 | - [📊 Day 25 - Model Evaluation and Metrics](#-day-25---model-evaluation-and-metrics) 13 | - [📋 Table of Contents](#-table-of-contents) 14 | - [🔍 Introduction to Model Evaluation](#-introduction-to-model-evaluation) 15 | - [📉 Confusion Matrix](#-confusion-matrix) 16 | - [🔢 Key Metrics Derived from the Confusion Matrix](#-key-metrics-derived-from-the-confusion-matrix) 17 | - [📈 ROC and AUC](#-roc-and-auc) 18 | - [📚 Practice Exercises](#-practice-exercises) 19 | - [📜 Summary](#-summary) 20 | 21 | 22 | 23 | ## 🔍 Introduction to Model Evaluation 24 | 25 | Model evaluation is a crucial step in machine learning to ensure that the model generalizes well to unseen data. The goal is to evaluate both the predictive power and robustness of a model using appropriate metrics. 26 | 27 | Evaluation often involves splitting data into **training** and **test sets** or employing **cross-validation** techniques. Some common evaluation metrics include: 28 | 29 | - **Accuracy**: Percentage of correctly classified instances. 30 | - **Precision, Recall, and F1-Score**: Useful for imbalanced datasets. 31 | - **ROC-AUC**: Measures a model's ability to distinguish between classes. 32 | 33 | In this section, we'll focus on two widely-used tools for model evaluation: **Confusion Matrix** and **ROC-AUC Curve**. 34 | 35 | 36 | 37 | ## 📉 Confusion Matrix 38 | 39 | A **Confusion Matrix** is a tabular representation of actual versus predicted classifications. It is used to visualize the performance of classification models. The matrix contains four key components: 40 | 41 | | | Predicted Positive | Predicted Negative | 42 | |----------------|--------------------|--------------------| 43 | | **Actual Positive** | True Positive (TP) | False Negative (FN) | 44 | | **Actual Negative** | False Positive (FP) | True Negative (TN) | 45 | 46 | ### 🔢 Key Metrics Derived from the Confusion Matrix 47 | 48 | 1. **Accuracy**: 49 | Accuracy = (TP + TN) / (TP + TN + FP + FN) 50 | 51 | 2. **Precision**: Measures the accuracy of positive predictions. 52 | Precision = TP / (TP + FP) 53 | 54 | 3. **Recall (Sensitivity)**: Measures the ability to detect positive samples. 55 | Recall = TP / (TP + FN) 56 | 57 | 4. **F1 Score**: Harmonic mean of Precision and Recall. 58 | F1 = 2 * (Precision * Recall) / (Precision + Recall) 59 | 60 | 61 | 62 | ### Example: Confusion Matrix in Scikit-Learn 63 | 64 | ```python 65 | from sklearn.metrics import confusion_matrix, classification_report 66 | from sklearn.model_selection import train_test_split 67 | from sklearn.ensemble import RandomForestClassifier 68 | from sklearn.datasets import load_iris 69 | 70 | # Load dataset and split 71 | data = load_iris() 72 | X, y = data.data, data.target 73 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 74 | 75 | # Train model 76 | model = RandomForestClassifier(random_state=42) 77 | model.fit(X_train, y_train) 78 | 79 | # Predictions 80 | y_pred = model.predict(X_test) 81 | 82 | # Confusion Matrix 83 | cm = confusion_matrix(y_test, y_pred) 84 | print("Confusion Matrix:\n", cm) 85 | 86 | # Classification Report 87 | print("\nClassification Report:\n", classification_report(y_test, y_pred)) 88 | ``` 89 | 90 | 91 | 92 | ## 📈 ROC and AUC 93 | 94 | The **Receiver Operating Characteristic (ROC) curve** and the **Area Under the Curve (AUC)** are used to evaluate the performance of binary classification models. The ROC curve plots: 95 | 96 | - **True Positive Rate (TPR)** (Sensitivity) on the Y-axis. 97 | - **False Positive Rate (FPR)** on the X-axis. 98 | 99 | The **AUC** provides a single scalar value to summarize the ROC curve. A model with an AUC of **1.0** indicates perfect classification, while **0.5** suggests random guessing. 100 | 101 | ### Example: Plotting ROC-AUC in Scikit-Learn 102 | 103 | ```python 104 | from sklearn.metrics import roc_curve, auc 105 | import matplotlib.pyplot as plt 106 | from sklearn.linear_model import LogisticRegression 107 | 108 | # Train binary classifier 109 | binary_model = LogisticRegression() 110 | binary_model.fit(X_train, y_train == 1) # Simplify to binary classification 111 | 112 | # Get predicted probabilities 113 | y_probs = binary_model.predict_proba(X_test)[:, 1] 114 | 115 | # Calculate ROC curve 116 | fpr, tpr, _ = roc_curve(y_test == 1, y_probs) 117 | roc_auc = auc(fpr, tpr) 118 | 119 | # Plot ROC Curve 120 | plt.figure(figsize=(8, 6)) 121 | plt.plot(fpr, tpr, color='blue', label=f"ROC Curve (AUC = {roc_auc:.2f})") 122 | plt.plot([0, 1], [0, 1], color='red', linestyle='--') 123 | plt.xlabel("False Positive Rate") 124 | plt.ylabel("True Positive Rate") 125 | plt.title("ROC Curve") 126 | plt.legend(loc="lower right") 127 | plt.show() 128 | ``` 129 | 130 | 131 | 132 | ## 📚 Practice Exercises 133 | 134 | 1. Load a dataset of your choice and split it into training and test sets. 135 | 2. Train a classification model and compute the confusion matrix. 136 | 3. Use scikit-learn to plot the ROC curve and calculate the AUC for a binary classification task. 137 | 138 | 139 | 140 | ## 📜 Summary 141 | 142 | In **Day 25**, we explored the importance of **Model Evaluation and Metrics** in machine learning. We delved into: 143 | 144 | - **Confusion Matrix**: A vital tool to understand classification results. 145 | - **Derived Metrics**: Accuracy, Precision, Recall, and F1 Score. 146 | - **ROC-AUC Curve**: For evaluating binary classifiers. 147 | 148 | Mastering these concepts will ensure you can effectively measure and improve the performance of machine learning models. Keep practicing, and we'll see you tomorrow for Day 26! 🚀 149 | 150 | --- 151 | -------------------------------------------------------------------------------- /29_Working with Big Data/29_Working with Big Data.md: -------------------------------------------------------------------------------- 1 | [<< Day 28](../28_Time%20Series%20Forecasting/28_Time%20Series%20Forecasting.md) | [Day 30 >>](../30_Building%20a%20Data%20Science%20Pipeline/30_Building%20a%20Data%20Science%20Pipeline.md) 2 | 3 | 4 | 5 | # 🗓️ Day 29: Working with Big Data 🚀 6 | 7 | Welcome to **Day 29** of the **30 Days of Data Science** series! Today, we delve into the exciting world of **Big Data** and learn about **PySpark Basics**, along with related topics such as **Partitioning in Big Data** and **Handling Missing Data**. 8 | 9 | 10 | 11 | ## 📚 Table of Contents 12 | - [🌟 Introduction to Big Data](#-introduction-to-big-data) 13 | - [🔥 What is Apache Spark?](#-what-is-apache-spark) 14 | - [🐍 Why PySpark?](#-why-pyspark) 15 | - [⚙️ Setting Up PySpark](#️-setting-up-pyspark) 16 | - [Installation](#installation) 17 | - [Setting Up Your Environment](#setting-up-your-environment) 18 | - [📝 PySpark Basics](#-pyspark-basics) 19 | - [Creating an RDD](#creating-an-rdd) 20 | - [Transformations and Actions](#transformations-and-actions) 21 | - [📁 Partitioning in Big Data](#-partitioning-in-big-data) 22 | - [📉 Handling Missing Data in Big Data](#-handling-missing-data-in-big-data) 23 | - [💡 Practice Exercise](#-practice-exercise) 24 | - [📜 Summary](#-summary) 25 | 26 | 27 | 28 | ## 🌟 Introduction to Big Data 29 | 30 | Big Data refers to data that is so large, fast, or complex that traditional data processing methods cannot efficiently process it. Key characteristics include: 31 | 32 | - **Volume**: Huge amounts of data. 33 | - **Velocity**: High speed at which data is generated. 34 | - **Variety**: Different forms like structured, unstructured, and semi-structured data. 35 | 36 | 37 | 38 | ## 🔥 What is Apache Spark? 39 | 40 | **Apache Spark** is an open-source, distributed computing system designed for fast and scalable processing of large datasets. Key features include: 41 | 42 | - **Speed**: Processes data 100x faster than Hadoop MapReduce. 43 | - **Ease of Use**: APIs in Python, Java, Scala, and R. 44 | - **Versatility**: Supports SQL, streaming, machine learning, and graph processing. 45 | 46 | 47 | 48 | ## 🐍 Why PySpark? 49 | 50 | PySpark is the Python API for Apache Spark. It allows Python developers to leverage Spark's distributed computing capabilities with Pythonic simplicity. 51 | 52 | - Easy to learn for Python developers. 53 | - Integrates seamlessly with Python libraries like Pandas and NumPy. 54 | 55 | 56 | 57 | ## ⚙️ Setting Up PySpark 58 | 59 | ### Installation 60 | 61 | To install PySpark, use pip: 62 | 63 | ```bash 64 | pip install pyspark 65 | ``` 66 | 67 | ### Setting Up Your Environment 68 | 69 | 1. Install Java Development Kit (JDK). Spark requires Java 8 or higher. 70 | 2. Verify the installation: 71 | 72 | ```bash 73 | java -version 74 | ``` 75 | 76 | 3. Launch PySpark from the terminal: 77 | 78 | ```bash 79 | pyspark 80 | ``` 81 | 82 | 83 | 84 | ## 📝 PySpark Basics 85 | 86 | ### Creating an RDD 87 | 88 | An **RDD (Resilient Distributed Dataset)** is the fundamental data structure in Spark. You can create an RDD in PySpark as follows: 89 | 90 | ```python 91 | from pyspark import SparkContext 92 | 93 | # Initialize SparkContext 94 | sc = SparkContext("local", "Day 29 Example") 95 | 96 | # Create an RDD 97 | data = [1, 2, 3, 4, 5] 98 | rdd = sc.parallelize(data) 99 | 100 | print("RDD Elements:", rdd.collect()) 101 | ``` 102 | 103 | ### Transformations and Actions 104 | 105 | - **Transformations** create a new RDD from an existing one. Examples: `map`, `filter`. 106 | - **Actions** perform operations and return results. Examples: `collect`, `count`. 107 | 108 | #### Example: Map and Filter 109 | 110 | ```python 111 | # Transformation: Map 112 | squared_rdd = rdd.map(lambda x: x ** 2) 113 | 114 | # Transformation: Filter 115 | filtered_rdd = squared_rdd.filter(lambda x: x > 10) 116 | 117 | # Action: Collect 118 | result = filtered_rdd.collect() 119 | print("Filtered Result:", result) 120 | ``` 121 | 122 | #### Example: Reduce 123 | 124 | ```python 125 | # Action: Reduce 126 | sum_result = rdd.reduce(lambda x, y: x + y) 127 | print("Sum of RDD Elements:", sum_result) 128 | ``` 129 | 130 | 131 | 132 | ## 📁 Partitioning in Big Data 133 | 134 | Partitioning refers to splitting data into smaller chunks to be processed in parallel. In PySpark, partitioning is essential for optimizing performance. 135 | 136 | ### Example: Partitioning Data 137 | 138 | ```python 139 | # Create an RDD with 4 partitions 140 | partitioned_rdd = sc.parallelize(data, 4) 141 | print("Number of Partitions:", partitioned_rdd.getNumPartitions()) 142 | ``` 143 | 144 | ### Repartitioning 145 | 146 | You can repartition an RDD to increase or decrease the number of partitions. 147 | 148 | ```python 149 | # Repartitioning 150 | repartitioned_rdd = partitioned_rdd.repartition(2) 151 | print("New Number of Partitions:", repartitioned_rdd.getNumPartitions()) 152 | ``` 153 | 154 | 155 | 156 | ## 📉 Handling Missing Data in Big Data 157 | 158 | Big Data often contains missing or null values. PySpark provides tools to handle missing data efficiently. 159 | 160 | ### Example: Handling Null Values in a DataFrame 161 | 162 | ```python 163 | from pyspark.sql import SparkSession 164 | 165 | # Initialize SparkSession 166 | spark = SparkSession.builder.appName("MissingDataExample").getOrCreate() 167 | 168 | # Create a DataFrame with missing values 169 | data = [("Alice", 34), (None, 29), ("Bob", None)] 170 | columns = ["Name", "Age"] 171 | df = spark.createDataFrame(data, columns) 172 | 173 | # Drop rows with null values 174 | df_cleaned = df.dropna() 175 | df_cleaned.show() 176 | ``` 177 | 178 | ### Filling Missing Values 179 | 180 | ```python 181 | # Fill missing values with a default 182 | df_filled = df.fillna({"Name": "Unknown", "Age": 0}) 183 | df_filled.show() 184 | ``` 185 | 186 | 187 | 188 | ## 💡 Practice Exercise 189 | 190 | **Task**: Using PySpark, create an RDD and perform the following: 191 | 192 | 1. Partition the RDD into 3 partitions. 193 | 2. Apply a transformation to multiply each element by 10. 194 | 3. Filter the elements greater than 20. 195 | 4. Collect the results. 196 | 197 | 198 | 199 | ## 📜 Summary 200 | 201 | Today, we explored: 202 | 203 | - The fundamentals of **Big Data** and its challenges. 204 | - **PySpark Basics**, including RDD creation and transformations. 205 | - **Partitioning** for efficient data processing. 206 | - **Handling Missing Data** in PySpark. 207 | 208 | --- 209 | -------------------------------------------------------------------------------- /18_Basic Machine Learning Introduction/18_Basic Machine Learning Introduction.md: -------------------------------------------------------------------------------- 1 | [<< Day 17](../17_Hypothesis%20Testing/17_Hypothesis%20Testing.md) | [Day 19 >>](../19_Linear%20Regression/19_Linear%20Regression.md) 2 | 3 | 4 | # 📘 Day 18: Basic Machine Learning Introduction and Scikit-learn Basics 5 | 6 | Welcome to Day 18 of the **30 Days of Data Science** series! Today, we explore the basics of **Machine Learning** and an essential library for implementing ML models in Python: **Scikit-learn**. This session will set the foundation for understanding ML concepts and applying them in practice. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 18: Basic Machine Learning Introduction and Scikit-learn Basics](#-day-18-basic-machine-learning-introduction-and-scikit-learn-basics) 13 | - [📌 Topics Covered](#-topics-covered) 14 | - [1️⃣ What is Machine Learning?](#1️⃣-what-is-machine-learning) 15 | - [Types of Machine Learning](#types-of-machine-learning) 16 | - [2️⃣ Introduction to Scikit-learn](#2️⃣-introduction-to-scikit-learn) 17 | - [Installing Scikit-learn](#installing-scikit-learn) 18 | - [Scikit-learn Basics](#scikit-learn-basics) 19 | - [3️⃣ Example: Linear Regression with Scikit-learn](#3️⃣-example-linear-regression-with-scikit-learn) 20 | - [🧠 Practice Exercises](#-practice-exercises) 21 | - [🌟 Summary](#-summary) 22 | 23 | 24 | 25 | 26 | ## 📌 Topics Covered 27 | 28 | - What is Machine Learning? 29 | - Types of Machine Learning: Supervised, Unsupervised, and Reinforcement Learning. 30 | - Introduction to Scikit-learn, a machine learning library in Python. 31 | - Example: Linear Regression using Scikit-learn. 32 | 33 | 34 | 35 | ## 1️⃣ What is Machine Learning? 36 | 37 | **Machine Learning (ML)** is a subset of artificial intelligence (AI) that enables systems to learn and improve from data without being explicitly programmed. 38 | 39 | ### Key Concepts: 40 | - **Data**: ML algorithms are trained using historical data. 41 | - **Model**: A mathematical representation of the problem to make predictions or decisions. 42 | - **Training**: The process of feeding data into the model to learn patterns. 43 | 44 | 45 | 46 | ### Types of Machine Learning 47 | 48 | 1. **Supervised Learning**: 49 | - Input data (features) and output labels (target) are provided. 50 | - Goal: Learn a mapping from input to output. 51 | - Examples: Regression, Classification. 52 | 53 | 2. **Unsupervised Learning**: 54 | - Only input data is provided, no output labels. 55 | - Goal: Discover hidden patterns or groupings. 56 | - Examples: Clustering, Dimensionality Reduction. 57 | 58 | 3. **Reinforcement Learning**: 59 | - Agents learn by interacting with the environment and receiving feedback (rewards or penalties). 60 | - Examples: Game playing, Robotics. 61 | 62 | 63 | 64 | ## 2️⃣ Introduction to Scikit-learn 65 | 66 | **Scikit-learn** is a Python library for implementing machine learning algorithms. It provides simple and efficient tools for predictive data analysis. 67 | 68 | ### Key Features: 69 | - Built-in algorithms for supervised and unsupervised learning. 70 | - Tools for model evaluation, preprocessing, and pipeline creation. 71 | - Compatible with other Python libraries like NumPy and pandas. 72 | 73 | 74 | 75 | ### Installing Scikit-learn 76 | 77 | Before using Scikit-learn, ensure it is installed in your environment. Use the following command: 78 | 79 | ```bash 80 | pip install scikit-learn 81 | ``` 82 | 83 | 84 | 85 | ### Scikit-learn Basics 86 | 87 | 1. **Loading a Dataset**: 88 | Scikit-learn comes with several built-in datasets. 89 | 90 | ```python 91 | from sklearn.datasets import load_iris 92 | 93 | iris = load_iris() 94 | print(iris.keys()) # Output: Keys like 'data', 'target', etc. 95 | ``` 96 | 97 | 2. **Splitting Data**: 98 | Use `train_test_split` to divide data into training and testing sets. 99 | 100 | ```python 101 | from sklearn.model_selection import train_test_split 102 | 103 | X_train, X_test, y_train, y_test = train_test_split( 104 | iris.data, iris.target, test_size=0.2, random_state=42 105 | ) 106 | ``` 107 | 108 | 3. **Training a Model**: 109 | Fit a model using the training data. 110 | 111 | ```python 112 | from sklearn.ensemble import RandomForestClassifier 113 | 114 | clf = RandomForestClassifier() 115 | clf.fit(X_train, y_train) 116 | ``` 117 | 118 | 4. **Making Predictions**: 119 | Use the trained model to make predictions. 120 | 121 | ```python 122 | predictions = clf.predict(X_test) 123 | print(predictions) 124 | ``` 125 | 126 | 5. **Evaluating a Model**: 127 | Measure accuracy or other metrics. 128 | 129 | ```python 130 | from sklearn.metrics import accuracy_score 131 | 132 | accuracy = accuracy_score(y_test, predictions) 133 | print(f"Accuracy: {accuracy}") 134 | ``` 135 | 136 | 137 | 138 | ## 3️⃣ Example: Linear Regression with Scikit-learn 139 | 140 | Let’s build a **Linear Regression** model to predict house prices. 141 | 142 | ```python 143 | from sklearn.datasets import fetch_california_housing 144 | from sklearn.model_selection import train_test_split 145 | from sklearn.linear_model import LinearRegression 146 | from sklearn.metrics import mean_squared_error 147 | 148 | # Load the dataset 149 | data = fetch_california_housing() 150 | X, y = data.data, data.target 151 | 152 | # Split into training and testing sets 153 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 154 | 155 | # Create and train the model 156 | model = LinearRegression() 157 | model.fit(X_train, y_train) 158 | 159 | # Make predictions 160 | predictions = model.predict(X_test) 161 | 162 | # Evaluate the model 163 | mse = mean_squared_error(y_test, predictions) 164 | print(f"Mean Squared Error: {mse}") 165 | ``` 166 | 167 | **Output Example:** 168 | 169 | ```plaintext 170 | Mean Squared Error: 0.5401 171 | ``` 172 | 173 | 174 | 175 | ## 🧠 Practice Exercises 176 | 177 | 1. Use the `load_wine` dataset from Scikit-learn and train a Decision Tree Classifier. 178 | 2. Build a K-Means clustering model on synthetic data using Scikit-learn. 179 | 3. Experiment with different test sizes in the `train_test_split` function and observe the impact on performance. 180 | 181 | 182 | 183 | ## 🌟 Summary 184 | 185 | - Machine Learning enables systems to learn from data and make predictions. 186 | - Scikit-learn simplifies the implementation of ML algorithms with its tools and datasets. 187 | - Linear Regression is a basic but powerful algorithm to understand supervised learning. 188 | 189 | --- 190 | 191 | 192 | -------------------------------------------------------------------------------- /12_SQL for Data Retrieval/12_SQL for Data Retrieval.md: -------------------------------------------------------------------------------- 1 | [<< Day 11](../11_Advanced%20Data%20Visualization/11_Advanced%20Data%20Visualization.md) | [Day 13 >>](../13_Time%20Series%20Analysis%20Introduction/13_Time%20Series%20Analysis%20Introduction.md) 2 | 3 | 4 | # 📘 Day 12: SQL for Data Retrieval using SQLite3 and SQLAlchemy 5 | 6 | Welcome to Day 12 of the **30 Days of Data Science** series! Today, we focus on **SQL for data retrieval**—a crucial skill for any data scientist. We’ll explore how to use Python libraries **sqlite3** and **SQLAlchemy** to interact with databases and fetch meaningful insights. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 12: SQL for Data Retrieval using SQLite3 and SQLAlchemy](#-day-12-sql-for-data-retrieval-using-sqlite3-and-sqlalchemy) 13 | - [1️⃣ Introduction to SQL and Databases](#1️⃣-introduction-to-sql-and-databases) 14 | - [2️⃣ Using SQLite3 for Data Retrieval](#2️⃣-using-sqlite3-for-data-retrieval) 15 | - [Creating a Database](#creating-a-database) 16 | - [Inserting Data](#inserting-data) 17 | - [Querying Data](#querying-data) 18 | - [Example: Fetching Data with Conditions](#example-fetching-data-with-conditions) 19 | - [3️⃣ Using SQLAlchemy for Data Retrieval](#3️⃣-using-sqlalchemy-for-data-retrieval) 20 | - [Setting Up SQLAlchemy](#setting-up-sqlalchemy) 21 | - [Defining a Table Model](#defining-a-table-model) 22 | - [Adding and Querying Data](#adding-and-querying-data) 23 | - [🧠 Practice Exercises](#-practice-exercises) 24 | - [🌟 Summary](#-summary) 25 | 26 | 27 | 28 | 29 | ## 1️⃣ Introduction to SQL and Databases 30 | 31 | **SQL (Structured Query Language)** is used to interact with databases. It allows us to: 32 | - Store data in tables (rows and columns). 33 | - Retrieve data using queries. 34 | - Filter, aggregate, and manipulate data. 35 | 36 | 37 | 38 | ## 2️⃣ Using SQLite3 for Data Retrieval 39 | 40 | SQLite3 is a lightweight database engine built into Python. 41 | 42 | 43 | 44 | ### Creating a Database 45 | 46 | You can create a new database or connect to an existing one using `sqlite3.connect()`. 47 | 48 | ```python 49 | import sqlite3 50 | 51 | # Connect to SQLite database (or create one) 52 | connection = sqlite3.connect("example.db") 53 | 54 | # Create a cursor object to execute SQL commands 55 | cursor = connection.cursor() 56 | 57 | # Create a table 58 | cursor.execute(""" 59 | CREATE TABLE IF NOT EXISTS employees ( 60 | id INTEGER PRIMARY KEY AUTOINCREMENT, 61 | name TEXT, 62 | age INTEGER, 63 | department TEXT 64 | ) 65 | """) 66 | 67 | # Commit changes and close the connection 68 | connection.commit() 69 | connection.close() 70 | ``` 71 | 72 | 73 | 74 | ### Inserting Data 75 | 76 | Add data to your database using `INSERT INTO`. 77 | 78 | ```python 79 | connection = sqlite3.connect("example.db") 80 | cursor = connection.cursor() 81 | 82 | # Insert data 83 | cursor.execute("INSERT INTO employees (name, age, department) VALUES (?, ?, ?)", 84 | ("Alice", 30, "HR")) 85 | cursor.execute("INSERT INTO employees (name, age, department) VALUES (?, ?, ?)", 86 | ("Bob", 25, "Engineering")) 87 | 88 | connection.commit() 89 | connection.close() 90 | ``` 91 | 92 | 93 | 94 | ### Querying Data 95 | 96 | Retrieve data using `SELECT` queries. 97 | 98 | ```python 99 | connection = sqlite3.connect("example.db") 100 | cursor = connection.cursor() 101 | 102 | # Query all employees 103 | cursor.execute("SELECT * FROM employees") 104 | rows = cursor.fetchall() 105 | 106 | for row in rows: 107 | print(row) 108 | 109 | connection.close() 110 | ``` 111 | 112 | **Output:** 113 | 114 | ```plaintext 115 | (1, 'Alice', 30, 'HR') 116 | (2, 'Bob', 25, 'Engineering') 117 | ``` 118 | 119 | 120 | 121 | ### Example: Fetching Data with Conditions 122 | 123 | ```python 124 | connection = sqlite3.connect("example.db") 125 | cursor = connection.cursor() 126 | 127 | # Query employees in the HR department 128 | cursor.execute("SELECT * FROM employees WHERE department = ?", ("HR",)) 129 | rows = cursor.fetchall() 130 | 131 | print(rows) # Output: [(1, 'Alice', 30, 'HR')] 132 | 133 | connection.close() 134 | ``` 135 | 136 | 137 | 138 | ## 3️⃣ Using SQLAlchemy for Data Retrieval 139 | 140 | SQLAlchemy simplifies database interactions with Python objects. 141 | 142 | 143 | 144 | ### Setting Up SQLAlchemy 145 | 146 | Install SQLAlchemy: 147 | 148 | ```bash 149 | pip install sqlalchemy 150 | ``` 151 | 152 | Create a database connection and an engine: 153 | 154 | ```python 155 | from sqlalchemy import create_engine 156 | 157 | # Create a SQLite engine 158 | engine = create_engine("sqlite:///example.db") 159 | ``` 160 | 161 | 162 | 163 | ### Defining a Table Model 164 | 165 | Define tables using SQLAlchemy’s `declarative_base`: 166 | 167 | ```python 168 | from sqlalchemy.ext.declarative import declarative_base 169 | from sqlalchemy import Column, Integer, String 170 | 171 | Base = declarative_base() 172 | 173 | class Employee(Base): 174 | __tablename__ = 'employees' 175 | id = Column(Integer, primary_key=True, autoincrement=True) 176 | name = Column(String) 177 | age = Column(Integer) 178 | department = Column(String) 179 | ``` 180 | 181 | Create tables: 182 | 183 | ```python 184 | Base.metadata.create_all(engine) 185 | ``` 186 | 187 | 188 | 189 | ### Adding and Querying Data 190 | 191 | Insert data using a session: 192 | 193 | ```python 194 | from sqlalchemy.orm import sessionmaker 195 | 196 | Session = sessionmaker(bind=engine) 197 | session = Session() 198 | 199 | # Add an employee 200 | new_employee = Employee(name="Charlie", age=28, department="Finance") 201 | session.add(new_employee) 202 | session.commit() 203 | ``` 204 | 205 | Query data: 206 | 207 | ```python 208 | # Query all employees 209 | employees = session.query(Employee).all() 210 | 211 | for emp in employees: 212 | print(emp.name, emp.department) 213 | ``` 214 | 215 | **Output:** 216 | 217 | ```plaintext 218 | Alice HR 219 | Bob Engineering 220 | Charlie Finance 221 | ``` 222 | 223 | 224 | 225 | ## 🧠 Practice Exercises 226 | 227 | 1. Create a table for **products** with columns: `id`, `name`, `price`, and `quantity`. Populate it with data and fetch records where `price > 100`. 228 | 2. Use SQLAlchemy to define a `students` table. Add records and retrieve students aged above 20. 229 | 3. Write an SQL query to find the average age of employees in a department. 230 | 231 | 232 | 233 | ## 🌟 Summary 234 | 235 | - SQLite3 is a lightweight database engine suitable for small projects. 236 | - SQLAlchemy provides an abstraction layer for interacting with databases programmatically. 237 | - SQL queries like `SELECT`, `INSERT`, and `WHERE` are used for data retrieval and filtering. 238 | 239 | 240 | --- 241 | 242 | 243 | 244 | -------------------------------------------------------------------------------- /13_Time Series Analysis Introduction/13_Time Series Analysis Introduction.md: -------------------------------------------------------------------------------- 1 | [<< Day 12](../12_SQL%20for%20Data%20Retrieval/12_SQL%20for%20Data%20Retrieval.md) | [Day 14 >>](../14_Working%20with%20APIs%20and%20JSON/14_Working%20with%20APIs%20and%20JSON.md) 2 | 3 | 4 | # 📘 Day 13: Time Series Analysis in Python 5 | 6 | Welcome to Day 13 of the **30 Days of Data Science** series! Today, we’ll explore **Time Series Analysis**—a critical skill in data science for analyzing and predicting sequential data points over time. We'll leverage Python libraries like **pandas**, **datetime**, and **matplotlib** to understand and visualize time series data. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 13: Time Series Analysis in Python](#-day-13-time-series-analysis-in-python) 13 | - [1️⃣ What is Time Series Analysis? 🕒](#1️⃣-what-is-time-series-analysis-) 14 | - [2️⃣ Working with Datetime in Python 📅](#2️⃣-working-with-datetime-in-python-) 15 | - [Datetime Basics](#datetime-basics) 16 | - [Parsing and Formatting Dates](#parsing-and-formatting-dates) 17 | - [Example: Date Arithmetic](#example-date-arithmetic) 18 | - [3️⃣ Time Series Analysis with pandas 📊](#3️⃣-time-series-analysis-with-pandas-) 19 | - [Creating a Time Series](#creating-a-time-series) 20 | - [Resampling and Aggregation](#resampling-and-aggregation) 21 | - [Handling Missing Data](#handling-missing-data) 22 | - [Rolling Statistics](#rolling-statistics) 23 | - [4️⃣ Visualizing Time Series Data with matplotlib 📈](#4️⃣-visualizing-time-series-data-with-matplotlib-) 24 | - [Line Plots](#line-plots) 25 | - [Highlighting Trends](#highlighting-trends) 26 | - [🧠 Practice Exercises](#-practice-exercises) 27 | - [🌟 Summary](#-summary) 28 | 29 | 30 | 31 | 32 | ## 1️⃣ What is Time Series Analysis? 🕒 33 | 34 | A **time series** is a sequence of data points indexed in time order. Examples include stock prices, weather data, and sensor readings. **Time Series Analysis** focuses on uncovering patterns, trends, and seasonality in data to make informed decisions or predictions. 35 | 36 | 37 | 38 | ## 2️⃣ Working with Datetime in Python 📅 39 | 40 | The **datetime** module provides tools for handling dates and times. 41 | 42 | 43 | 44 | ### Datetime Basics 45 | 46 | ```python 47 | from datetime import datetime 48 | 49 | # Current date and time 50 | now = datetime.now() 51 | print(f"Current datetime: {now}") 52 | 53 | # Creating specific dates 54 | custom_date = datetime(2022, 12, 25, 10, 30) 55 | print(f"Custom datetime: {custom_date}") 56 | ``` 57 | 58 | **Output:** 59 | 60 | ```plaintext 61 | Current datetime: 2024-11-20 14:30:45.123456 62 | Custom datetime: 2022-12-25 10:30:00 63 | ``` 64 | 65 | 66 | 67 | ### Parsing and Formatting Dates 68 | 69 | ```python 70 | # Parsing a string to datetime 71 | date_str = "2024-11-20" 72 | parsed_date = datetime.strptime(date_str, "%Y-%m-%d") 73 | print(f"Parsed date: {parsed_date}") 74 | 75 | # Formatting datetime to string 76 | formatted_date = parsed_date.strftime("%B %d, %Y") 77 | print(f"Formatted date: {formatted_date}") 78 | ``` 79 | 80 | **Output:** 81 | 82 | ```plaintext 83 | Parsed date: 2024-11-20 00:00:00 84 | Formatted date: November 20, 2024 85 | ``` 86 | 87 | 88 | 89 | ### Example: Date Arithmetic 90 | 91 | ```python 92 | from datetime import timedelta 93 | 94 | # Adding days 95 | future_date = now + timedelta(days=10) 96 | print(f"10 days from now: {future_date}") 97 | 98 | # Difference between dates 99 | diff = future_date - now 100 | print(f"Days between: {diff.days}") 101 | ``` 102 | 103 | **Output:** 104 | 105 | ```plaintext 106 | 10 days from now: 2024-11-30 ... 107 | Days between: 10 108 | ``` 109 | 110 | 111 | 112 | ## 3️⃣ Time Series Analysis with pandas 📊 113 | 114 | The **pandas** library provides robust tools for working with time series data. 115 | 116 | 117 | 118 | ### Creating a Time Series 119 | 120 | ```python 121 | import pandas as pd 122 | 123 | # Create a date range 124 | dates = pd.date_range(start="2024-01-01", periods=7, freq="D") 125 | 126 | # Create a DataFrame 127 | data = pd.DataFrame({"Date": dates, "Value": [10, 15, 20, 25, 30, 35, 40]}) 128 | data.set_index("Date", inplace=True) 129 | print(data) 130 | ``` 131 | 132 | **Output:** 133 | 134 | ```plaintext 135 | Value 136 | Date 137 | 2024-01-01 10 138 | 2024-01-02 15 139 | 2024-01-03 20 140 | ... 141 | ``` 142 | 143 | 144 | 145 | ### Resampling and Aggregation 146 | 147 | ```python 148 | # Resample to weekly average 149 | weekly_avg = data.resample("W").mean() 150 | print(weekly_avg) 151 | ``` 152 | 153 | **Output:** 154 | 155 | ```plaintext 156 | Value 157 | Date 158 | 2024-01-07 22.5 159 | ``` 160 | 161 | 162 | 163 | ### Handling Missing Data 164 | 165 | ```python 166 | # Create data with missing values 167 | data.loc["2024-01-05"] = None 168 | 169 | # Fill missing values 170 | data_filled = data.fillna(method="ffill") 171 | print(data_filled) 172 | ``` 173 | 174 | **Output:** 175 | 176 | ```plaintext 177 | Value 178 | Date 179 | 2024-01-05 30.0 180 | ``` 181 | 182 | 183 | 184 | ### Rolling Statistics 185 | 186 | ```python 187 | # Calculate rolling mean 188 | data["Rolling Mean"] = data["Value"].rolling(window=3).mean() 189 | print(data) 190 | ``` 191 | 192 | **Output:** 193 | 194 | ```plaintext 195 | Value Rolling Mean 196 | Date 197 | 2024-01-03 20.0 15.0 198 | ``` 199 | 200 | 201 | 202 | ## 4️⃣ Visualizing Time Series Data with matplotlib 📈 203 | 204 | The **matplotlib** library helps in visualizing trends, seasonality, and anomalies in time series data. 205 | 206 | 207 | 208 | ### Line Plots 209 | 210 | ```python 211 | import matplotlib.pyplot as plt 212 | 213 | # Plot time series 214 | data["Value"].plot(title="Time Series Data") 215 | plt.show() 216 | ``` 217 | 218 | 219 | 220 | ### Highlighting Trends 221 | 222 | ```python 223 | # Plot with trend line 224 | plt.plot(data.index, data["Value"], label="Original Data") 225 | plt.plot(data.index, data["Rolling Mean"], label="Trend", linestyle="--") 226 | plt.legend() 227 | plt.show() 228 | ``` 229 | 230 | 231 | 232 | ## 🧠 Practice Exercises 233 | 234 | 1. Create a time series dataset of daily sales for a week and calculate the rolling average. 235 | 2. Visualize monthly stock prices using matplotlib and identify trends. 236 | 3. Use pandas to resample hourly temperature data into daily averages. 237 | 238 | 239 | 240 | ## 🌟 Summary 241 | 242 | - The **datetime** module helps manage and manipulate dates in Python. 243 | - **pandas** provides tools for time series operations, such as resampling, filling missing values, and calculating rolling statistics. 244 | - Use **matplotlib** to visualize and analyze time series data. 245 | 246 | --- 247 | 248 | 249 | -------------------------------------------------------------------------------- /08_Data Cleaning/08_Data Cleaning.md: -------------------------------------------------------------------------------- 1 | [<< Day 7](../07_Importing%20Data/07_Importing%20Data.md) | [Day 9 >>](../09_Exploratory%20Data%20Analysis%20(EDA)/09_Exploratory%20Data%20Analysis%20(EDA).md) 2 | 3 | 4 | # 📘 Day 8: Cleaning Data 5 | 6 | Welcome to **Day 8** of the **30 Days of Data Science** series! Today, we focus on **Cleaning Data**, which includes handling missing values and duplicates. Cleaning your data is a crucial step to ensure the reliability and accuracy of your analyses and models. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 8: Cleaning Data](#-day-8-cleaning-data) 13 | - [1️⃣ Introduction to Data Cleaning](#1️⃣-introduction-to-data-cleaning) 14 | - [2️⃣ Handling Missing Values](#2️⃣-handling-missing-values) 15 | - [Identifying Missing Data](#identifying-missing-data) 16 | - [Removing Missing Values](#removing-missing-values) 17 | - [Imputing Missing Values](#imputing-missing-values) 18 | - [Replacing Missing Values with Interpolation](#replacing-missing-values-with-interpolation) 19 | - [Filling Missing Values with Forward/Backward Fill](#filling-missing-values-with-forwardbackward-fill) 20 | - [3️⃣ Handling Duplicate Values](#3️⃣-handling-duplicate-values) 21 | - [Identifying Duplicates](#identifying-duplicates) 22 | - [Removing Duplicates](#removing-duplicates) 23 | - [Keeping Specific Duplicates](#keeping-specific-duplicates) 24 | - [🧠 Practice Exercises](#-practice-exercises) 25 | - [🌟 Summary](#-summary) 26 | 27 | 28 | 29 | 30 | ## 1️⃣ Introduction to Data Cleaning 31 | 32 | In real-world datasets, it is common to encounter missing or duplicate data. Pandas provides a rich set of functions to handle such issues effectively. Cleaning data ensures that your dataset is: 33 | 34 | - Consistent 35 | - Reliable 36 | - Free from errors 37 | 38 | 39 | 40 | ## 2️⃣ Handling Missing Values 41 | 42 | Missing values can occur for various reasons, such as data entry errors, system limitations, or incomplete surveys. Pandas offers multiple functions to handle these situations. 43 | 44 | ### Identifying Missing Data 45 | 46 | You can identify missing data using functions like `isnull()` and `notnull()`. 47 | 48 | #### Example: 49 | 50 | ```python 51 | import pandas as pd 52 | 53 | # Sample dataset 54 | data = {'Name': ['Alice', 'Bob', None, 'David'], 55 | 'Age': [25, None, 30, 22], 56 | 'City': ['New York', 'Los Angeles', 'Chicago', None]} 57 | 58 | df = pd.DataFrame(data) 59 | 60 | # Check for missing values 61 | print(df.isnull()) # Returns a DataFrame with True for missing values 62 | print("\nNumber of missing values per column:") 63 | print(df.isnull().sum()) 64 | ``` 65 | 66 | 67 | 68 | ### Removing Missing Values 69 | 70 | Use `dropna()` to remove rows or columns with missing values. 71 | 72 | #### Example: 73 | 74 | ```python 75 | # Drop rows with missing values 76 | df_cleaned = df.dropna() 77 | print(df_cleaned) 78 | 79 | # Drop columns with missing values 80 | df_cleaned_cols = df.dropna(axis=1) 81 | print(df_cleaned_cols) 82 | ``` 83 | 84 | You can also customize the behavior of `dropna()` using parameters like: 85 | 86 | - `how='all'`: Removes rows/columns where all values are missing. 87 | - `thresh=n`: Retains rows/columns with at least `n` non-NA values. 88 | 89 | #### Example: 90 | 91 | ```python 92 | # Drop rows with all missing values 93 | df_cleaned = df.dropna(how='all') 94 | 95 | # Drop rows with at least 2 non-missing values 96 | df_cleaned_thresh = df.dropna(thresh=2) 97 | print(df_cleaned_thresh) 98 | ``` 99 | 100 | 101 | 102 | ### Imputing Missing Values 103 | 104 | Imputation fills missing values with appropriate values like mean, median, or mode. 105 | 106 | #### Example: 107 | 108 | ```python 109 | # Fill missing numeric values with the mean 110 | df['Age'] = df['Age'].fillna(df['Age'].mean()) 111 | print(df) 112 | 113 | # Fill missing categorical values with a mode 114 | df['Name'] = df['Name'].fillna(df['Name'].mode()[0]) 115 | print(df) 116 | ``` 117 | 118 | 119 | 120 | ### Replacing Missing Values with Interpolation 121 | 122 | Interpolation estimates missing values using mathematical functions. 123 | 124 | #### Example: 125 | 126 | ```python 127 | # Interpolate missing values 128 | data = {'Value': [1, None, 3, None, 5]} 129 | df = pd.DataFrame(data) 130 | 131 | # Linear interpolation 132 | df['Value'] = df['Value'].interpolate(method='linear') 133 | print(df) 134 | ``` 135 | 136 | 137 | 138 | ### Filling Missing Values with Forward/Backward Fill 139 | 140 | Forward fill (`ffill`) and backward fill (`bfill`) propagate known values to fill gaps. 141 | 142 | #### Example: 143 | 144 | ```python 145 | # Forward fill 146 | df_ffill = df.fillna(method='ffill') 147 | 148 | # Backward fill 149 | df_bfill = df.fillna(method='bfill') 150 | ``` 151 | 152 | 153 | 154 | ## 3️⃣ Handling Duplicate Values 155 | 156 | Duplicate values can distort analyses and lead to redundant information. Use Pandas to identify and remove duplicates effectively. 157 | 158 | ### Identifying Duplicates 159 | 160 | The `duplicated()` method returns a boolean series indicating whether each row is a duplicate. 161 | 162 | #### Example: 163 | 164 | ```python 165 | data = {'Name': ['Alice', 'Bob', 'Alice', 'David'], 166 | 'Age': [25, 30, 25, 22]} 167 | 168 | df = pd.DataFrame(data) 169 | 170 | # Check for duplicates 171 | print(df.duplicated()) 172 | ``` 173 | 174 | 175 | 176 | ### Removing Duplicates 177 | 178 | The `drop_duplicates()` method removes duplicate rows from the dataset. 179 | 180 | #### Example: 181 | 182 | ```python 183 | # Remove duplicate rows 184 | df_no_duplicates = df.drop_duplicates() 185 | print(df_no_duplicates) 186 | ``` 187 | 188 | 189 | 190 | ### Keeping Specific Duplicates 191 | 192 | By default, `drop_duplicates()` retains the first occurrence. You can modify this behavior using the `keep` parameter. 193 | 194 | #### Example: 195 | 196 | ```python 197 | # Keep the last occurrence of duplicates 198 | df_last = df.drop_duplicates(keep='last') 199 | print(df_last) 200 | 201 | # Remove all occurrences of duplicates 202 | df_none = df.drop_duplicates(keep=False) 203 | print(df_none) 204 | ``` 205 | 206 | 207 | 208 | ## 🧠 Practice Exercises 209 | 210 | 1. Create a DataFrame with missing values and use various techniques to handle them. 211 | 2. Experiment with different interpolation methods like `polynomial` or `spline`. 212 | 3. Create a dataset with duplicate rows and test various `keep` options for `drop_duplicates()`. 213 | 214 | 215 | 216 | ## 🌟 Summary 217 | 218 | - **Missing Values**: 219 | - Identify using `isnull()`. 220 | - Remove using `dropna()` or fill using `fillna()`, `interpolate()`, `ffill`, and `bfill`. 221 | - **Duplicate Values**: 222 | - Identify using `duplicated()`. 223 | - Remove using `drop_duplicates()` with flexible `keep` options. 224 | 225 | --- 226 | 227 | 228 | -------------------------------------------------------------------------------- /31_Deployment on Cloud Platform/31_Deployment on Cloud Platform.md: -------------------------------------------------------------------------------- 1 | [<< Day 30](../30_Building%20a%20Data%20Science%20Pipeline/30_Building%20a%20Data%20Science%20Pipeline.md) | 2 | 3 | # 🎉 Bonus Day 31: Deployment on Cloud Platform 4 | 5 | Welcome to **Bonus Day 31** of the 30 Days of Data Science series! 🎉 Today, we’ll explore how to deploy your machine learning models or applications to **Cloud Platforms** using **Flask/FastAPI**. By the end, you'll understand the steps to deploy to **AWS**, **Azure**, and **GCP**. 6 | 7 | 8 | 9 | ## 📜 Table of Contents 10 | 11 | - [🎉 Bonus Day 31: Deployment on Cloud Platform](#-bonus-day-31-deployment-on-cloud-platform) 12 | - [📜 Table of Contents](#-table-of-contents) 13 | - [🌐 Introduction](#-introduction) 14 | - [🚀 Preparing the Application for Deployment](#-preparing-the-application-for-deployment) 15 | - [⚙️ Deploying with Flask/FastAPI](#%EF%B8%8F-deploying-with-flaskfastapi) 16 | - [Flask Example](#flask-example) 17 | - [FastAPI Example](#fastapi-example) 18 | - [☁️ Deployment on AWS](#%EF%B8%8F-deployment-on-aws) 19 | - [Steps to Deploy](#steps-to-deploy) 20 | - [☁️ Deployment on Azure](#%EF%B8%8F-deployment-on-azure) 21 | - [Steps to Deploy](#steps-to-deploy-1) 22 | - [☁️ Deployment on GCP](#%EF%B8%8F-deployment-on-gcp) 23 | - [Steps to Deploy](#steps-to-deploy-2) 24 | - [📝 Practice Exercise](#-practice-exercise) 25 | - [📚 Summary](#-summary) 26 | 27 | 28 | 29 | ## 🌐 Introduction 30 | 31 | Deployment is a crucial step in bringing your data science project to life. It allows others to interact with your model or application in real-time. We'll focus on deploying applications using **Flask** or **FastAPI** to popular cloud platforms like: 32 | 33 | - **AWS (Amazon Web Services)** 34 | - **Azure** 35 | - **GCP (Google Cloud Platform)** 36 | 37 | 38 | 39 | ## 🚀 Preparing the Application for Deployment 40 | 41 | ### Basic Folder Structure 42 | Ensure your project folder is structured properly: 43 | 44 | ```plaintext 45 | project/ 46 | |-- app.py # Main application script 47 | |-- model.pkl # Serialized ML model (if applicable) 48 | |-- templates/ 49 | | |-- index.html # Frontend files (if needed) 50 | |-- requirements.txt # Python dependencies 51 | |-- Dockerfile # (Optional) Docker configuration 52 | ``` 53 | 54 | ### Creating `requirements.txt` 55 | List all dependencies in a `requirements.txt` file: 56 | 57 | ```plaintext 58 | Flask==2.1.2 59 | pandas==1.5.3 60 | numpy==1.23.5 61 | scikit-learn==1.1.3 62 | ``` 63 | Generate automatically: 64 | 65 | ```bash 66 | pip freeze > requirements.txt 67 | ``` 68 | 69 | 70 | 71 | ## ⚙️ Deploying with Flask/FastAPI 72 | 73 | ### Flask Example 74 | 75 | Here’s a minimal **Flask** application: 76 | 77 | ```python 78 | from flask import Flask, request, jsonify 79 | 80 | app = Flask(__name__) 81 | 82 | @app.route("/", methods=["GET"]) 83 | def home(): 84 | return "Welcome to the Flask Deployment!" 85 | 86 | @app.route("/predict", methods=["POST"]) 87 | def predict(): 88 | data = request.json 89 | prediction = sum(data["values"]) # Example prediction logic 90 | return jsonify({"prediction": prediction}) 91 | 92 | if __name__ == "__main__": 93 | app.run(debug=True) 94 | ``` 95 | 96 | Run locally: 97 | 98 | ```bash 99 | python app.py 100 | ``` 101 | 102 | ### FastAPI Example 103 | 104 | Here’s a minimal **FastAPI** application: 105 | 106 | ```python 107 | from fastapi import FastAPI 108 | from pydantic import BaseModel 109 | 110 | app = FastAPI() 111 | 112 | class InputData(BaseModel): 113 | values: list[int] 114 | 115 | @app.get("/") 116 | def home(): 117 | return {"message": "Welcome to FastAPI Deployment!"} 118 | 119 | @app.post("/predict") 120 | def predict(data: InputData): 121 | prediction = sum(data.values) # Example prediction logic 122 | return {"prediction": prediction} 123 | 124 | if __name__ == "__main__": 125 | import uvicorn 126 | uvicorn.run(app, host="0.0.0.0", port=8000) 127 | ``` 128 | 129 | Run locally: 130 | 131 | ```bash 132 | uvicorn app:app --reload 133 | ``` 134 | 135 | 136 | 137 | ## ☁️ Deployment on AWS 138 | 139 | ### Steps to Deploy 140 | 141 | 1. **Set Up an EC2 Instance:** 142 | - Go to the [AWS Management Console](https://aws.amazon.com/). 143 | - Launch an EC2 instance with an Ubuntu AMI. 144 | 145 | 2. **Install Dependencies on EC2:** 146 | SSH into the instance and set up the environment: 147 | 148 | ```bash 149 | sudo apt update && sudo apt upgrade 150 | sudo apt install python3-pip 151 | pip3 install -r requirements.txt 152 | ``` 153 | 154 | 3. **Run the Application:** 155 | 156 | ```bash 157 | python3 app.py 158 | ``` 159 | 160 | 4. **Expose the Application:** 161 | - Open port 5000 (or your application port) in the AWS security group. 162 | - Access the app using the public IP of the EC2 instance. 163 | 164 | 165 | 166 | ## ☁️ Deployment on Azure 167 | 168 | ### Steps to Deploy 169 | 170 | 1. **Set Up an App Service:** 171 | - Go to the [Azure Portal](https://portal.azure.com/). 172 | - Create an App Service and select a Python runtime. 173 | 174 | 2. **Deploy Code:** 175 | - Zip your project folder and upload it through the Azure portal. 176 | 177 | 3. **Configure Startup Command:** 178 | Add the following startup command in the portal: 179 | 180 | ```bash 181 | gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app 182 | ``` 183 | 184 | 4. **Access the Application:** 185 | - Use the App Service URL provided by Azure. 186 | 187 | 188 | 189 | ## ☁️ Deployment on GCP 190 | 191 | ### Steps to Deploy 192 | 193 | 1. **Set Up a Google Cloud Project:** 194 | - Go to the [GCP Console](https://console.cloud.google.com/). 195 | - Enable the App Engine API. 196 | 197 | 2. **Create an `app.yaml` File:** 198 | 199 | ```yaml 200 | runtime: python39 201 | entrypoint: gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app 202 | ``` 203 | 204 | 3. **Deploy the Application:** 205 | 206 | ```bash 207 | gcloud app deploy 208 | ``` 209 | 210 | 4. **Access the Application:** 211 | - Use the provided GCP URL. 212 | 213 | 214 | 215 | ## 📝 Practice Exercise 216 | 217 | 1. Create a Flask/FastAPI application that: 218 | - Accepts an input text. 219 | - Returns the sentiment (positive/negative) using a pre-trained model. 220 | 221 | 2. Deploy this application to any cloud platform of your choice. 222 | 223 | 224 | 225 | ## 📚 Summary 226 | 227 | In this bonus session, we learned how to: 228 | 229 | - Prepare a Flask/FastAPI application for deployment. 230 | - Deploy to **AWS EC2**, **Azure App Service**, and **Google Cloud Platform**. 231 | - Configure cloud platforms and expose services to the internet. 232 | 233 | 🎉 Congratulations on completing the **30 Days of Data Science** series with this bonus day! You’re now equipped to deploy your projects to the world. 🌎 234 | 235 | --- 236 | 237 | 238 | -------------------------------------------------------------------------------- /02_Basics of the Language & Git Basics/02_Basics of the Language & Git Basics.md: -------------------------------------------------------------------------------- 1 | # 📘 Day 2: Python Syntax, Variables, and Git Setup 2 | 3 | Welcome to Day 2 of the **30 Days of Data Science** challenge! Today, we’ll focus on learning the basics of Python syntax, understanding how variables work, and setting up Git for version control. 4 | 5 | 6 | [<< Day 1](../README.md#-day-1) | [Day 3 >>](../03_Control%20Flow/03_Control%20Flow.md) 7 | 8 | 9 | ## Table of Contents 10 | - [📘 Day 2: Python Syntax, Variables, and Git Setup](#-day-2-python-syntax-variables-and-git-setup) 11 | - [1️⃣ Python Syntax 🐍](#1️⃣-python-syntax-) 12 | - [Key Features](#key-features) 13 | - [Example: Basic Python Syntax](#example-basic-python-syntax) 14 | - [Common Python Errors](#common-python-errors) 15 | - [2️⃣ Variables in Python 🛠](#2️⃣-variables-in-python-) 16 | - [Declaring Variables](#declaring-variables) 17 | - [Rules for Variable Names](#rules-for-variable-names) 18 | - [Example: Different Data Types](#example-different-data-types) 19 | - [Updating Variables](#updating-variables) 20 | - [3️⃣ Git Setup and Basics 🌟](#3️⃣-git-setup-and-basics-) 21 | - [Installing Git](#installing-git) 22 | - [Configuring Git](#configuring-git) 23 | - [Initializing a Repository](#initializing-a-repository) 24 | - [Tracking Changes](#tracking-changes) 25 | - [Connecting to GitHub](#connecting-to-github) 26 | - [Example Workflow](#example-workflow) 27 | - [🧠 Practice Exercises](#-practice-exercises) 28 | - [Python Syntax](#python-syntax) 29 | - [Variables](#variables) 30 | - [Git](#git) 31 | - [🌟 Summary](#-summary) 32 | 33 | --- 34 | 35 | ### 1️⃣ Python Syntax 🐍 36 | 37 | Python syntax refers to the set of rules that defines the structure of a Python program. It’s known for being clean and easy to read. 38 | 39 | #### Key Features 40 | - Python uses **indentation** to define blocks of code (no curly braces). 41 | - Each statement is typically written on a new line. 42 | - Comments in Python start with `#`. 43 | 44 | #### Example: Basic Python Syntax 45 | ```python 46 | # This is a single-line comment 47 | print("Hello, Data Science!") # This prints a message to the console 48 | 49 | # Indentation defines a block of code 50 | if 5 > 2: 51 | print("Five is greater than two.") # Correct indentation 52 | 53 | # Incorrect indentation will raise an error 54 | # if 5 > 2: 55 | # print("This will throw an error") 56 | ``` 57 | 58 | #### Common Python Errors 59 | - **IndentationError**: Missing or incorrect indentation. 60 | - **SyntaxError**: Incorrect syntax, such as missing colons or parentheses. 61 | 62 | --- 63 | 64 | ### 2️⃣ Variables in Python 🛠 65 | 66 | Variables are used to store data in memory. You can assign any type of data to a variable without explicitly declaring its type. 67 | 68 | #### Declaring Variables 69 | ```python 70 | # Assigning values to variables 71 | name = "Alice" 72 | age = 25 73 | height = 5.7 74 | 75 | # Printing variable values 76 | print(name) # Output: Alice 77 | print(age) # Output: 25 78 | print(height) # Output: 5.7 79 | ``` 80 | 81 | #### Rules for Variable Names 82 | 1. Must start with a letter or underscore (`_`). 83 | 2. Cannot start with a number. 84 | 3. Can only contain alphanumeric characters and underscores. 85 | 4. Case-sensitive (`Name` and `name` are different). 86 | 87 | #### Example: Different Data Types 88 | ```python 89 | # String 90 | greeting = "Hello, World!" 91 | 92 | # Integer 93 | year = 2024 94 | 95 | # Float 96 | pi = 3.14159 97 | 98 | # Boolean 99 | is_active = True 100 | 101 | print(greeting) # Output: Hello, World! 102 | print(year) # Output: 2024 103 | print(pi) # Output: 3.14159 104 | print(is_active) # Output: True 105 | ``` 106 | 107 | #### Updating Variables 108 | You can update a variable's value or perform operations on it. 109 | ```python 110 | count = 10 111 | count += 1 # Increment by 1 112 | print(count) # Output: 11 113 | ``` 114 | 115 | --- 116 | 117 | ### 3️⃣ Git Setup and Basics 🌟 118 | 119 | Git is a version control system that tracks changes in your code. It’s essential for collaboration and maintaining a clean workflow in data science projects. 120 | 121 | #### Installing Git 122 | 1. Download and install Git from the [official website](https://git-scm.com/). 123 | 2. Check if Git is installed: 124 | ```bash 125 | git --version 126 | ``` 127 | 128 | #### Configuring Git 129 | Set your name and email address (required for commits): 130 | ```bash 131 | git config --global user.name "Your Name" 132 | git config --global user.email "your.email@example.com" 133 | ``` 134 | 135 | #### Initializing a Repository 136 | 1. Navigate to your project folder: 137 | ```bash 138 | cd 30DaysOfDataScience 139 | ``` 140 | 2. Initialize Git: 141 | ```bash 142 | git init 143 | ``` 144 | 145 | #### Tracking Changes 146 | - **Add files** to the staging area: 147 | ```bash 148 | git add filename.py # Add a specific file 149 | git add . # Add all files in the folder 150 | ``` 151 | - **Commit changes** with a message: 152 | ```bash 153 | git commit -m "Initial commit for Day 2" 154 | ``` 155 | 156 | #### Connecting to GitHub 157 | 1. Create a repository on GitHub. 158 | 2. Link your local repo to GitHub: 159 | ```bash 160 | git remote add origin https://github.com/SamarthGarge/30DaysOfDataScience.git 161 | ``` 162 | 3. Push changes to GitHub: 163 | ```bash 164 | git branch -M main 165 | git push -u origin main 166 | ``` 167 | 168 | #### Example Workflow 169 | ```bash 170 | # Make changes to your files 171 | git add . # Stage the changes 172 | git commit -m "Updated Python variables and examples" 173 | git push # Push to GitHub 174 | ``` 175 | 176 | --- 177 | 178 | ## 🧠 Practice Exercises 179 | 180 | ### Python Syntax 181 | 1. Write a Python program that prints: 182 | ``` 183 | Welcome to Data Science! 184 | Let's explore Python syntax together. 185 | ``` 186 | 2. Experiment with indentation by writing a small `if` statement. 187 | 188 | ### Variables 189 | 1. Create variables for: 190 | - Your name 191 | - Your age 192 | - Your favorite number 193 | 2. Print them in a sentence using string concatenation or f-strings: 194 | ```python 195 | # Example with f-strings 196 | name = "Alice" 197 | age = 25 198 | favorite_number = 7 199 | print(f"My name is {name}, I am {age} years old, and my favorite number is {favorite_number}.") 200 | ``` 201 | 202 | ### Git 203 | 1. Create a new file called `day2_practice.py`. 204 | 2. Add it to your repository and commit with the message: 205 | `"Added Day 2 practice file"`. 206 | 207 | --- 208 | 209 | ## 🌟 Summary 210 | 211 | Today, you: 212 | - Learned about Python’s syntax and how to avoid common errors. 213 | - Explored how to declare and use variables effectively. 214 | - Set up Git for version control and pushed your first changes to GitHub. 215 | 216 | 217 | -------------------------------------------------------------------------------- /04_Functions and Modular Programming/04_Functions and Modular Programming.md: -------------------------------------------------------------------------------- 1 | [<< Day 3](../03_Control%20Flow/03_Control%20Flow.md) | [Day 5 >>](../05_Data%20Structures/05_Data%20Structures.md) 2 | # 📘 Day 4: Functions and Modular Programming in Python 3 | 4 | Welcome to Day 4 of the **30 Days of Data Science** series! Today, we explore **functions**—one of the most critical tools for writing clean, reusable, and modular code. Functions are essential in breaking down complex problems into smaller, manageable pieces. 5 | 6 | 7 | 8 | ## Table of Contents 9 | 10 | - [📘 Day 4: Functions and Modular Programming in Python](#-day-4-functions-and-modular-programming-in-python) 11 | - [1️⃣ Functions in Python 📜](#1️⃣-functions-in-python-) 12 | - [Defining a Function](#defining-a-function) 13 | - [Example: Simple Function](#example-simple-function) 14 | - [Functions with Parameters](#functions-with-parameters) 15 | - [Example: Function with Multiple Parameters](#example-function-with-multiple-parameters) 16 | - [Default Arguments](#default-arguments) 17 | - [Return Statement](#return-statement) 18 | - [Example: Returning a Value](#example-returning-a-value) 19 | - [Calling Functions](#calling-functions) 20 | - [2️⃣ Modular Programming 🧩](#2️⃣-modular-programming-) 21 | - [What is Modular Programming?](#what-is-modular-programming) 22 | - [Creating and Importing Modules](#creating-and-importing-modules) 23 | - [Example: Custom Module](#example-custom-module) 24 | - [Built-in Modules](#built-in-modules) 25 | - [🧠 Practice Exercises](#-practice-exercises) 26 | - [🌟 Summary](#-summary) 27 | 28 | 29 | 30 | 31 | 32 | 33 | ## 1️⃣ Functions in Python 📜 34 | 35 | A **function** is a block of code designed to perform a specific task. It runs only when called and can accept inputs and return outputs. 36 | 37 | 38 | 39 | ### Defining a Function 40 | 41 | The `def` keyword is used to define a function in Python. 42 | 43 | #### Syntax: 44 | 45 | ```python 46 | def function_name(parameters): 47 | # Code block 48 | return value # Optional 49 | ``` 50 | 51 | 52 | 53 | ### Example: Simple Function 54 | 55 | ```python 56 | def greet(): 57 | print("Hello, Data Science Enthusiast!") 58 | 59 | greet() 60 | ``` 61 | 62 | **Output:** 63 | 64 | ```plaintext 65 | Hello, Data Science Enthusiast! 66 | ``` 67 | 68 | 69 | 70 | ### Functions with Parameters 71 | 72 | Functions can accept inputs (parameters) to customize their behavior. 73 | 74 | ```python 75 | def greet(name): 76 | print(f"Hello, {name}!") 77 | 78 | greet("Alice") 79 | greet("Bob") 80 | ``` 81 | 82 | **Output:** 83 | 84 | ```plaintext 85 | Hello, Alice! 86 | Hello, Bob! 87 | ``` 88 | 89 | 90 | 91 | ### Example: Function with Multiple Parameters 92 | 93 | ```python 94 | def add_numbers(a, b): 95 | result = a + b 96 | print(f"The sum of {a} and {b} is {result}.") 97 | 98 | add_numbers(3, 5) 99 | ``` 100 | 101 | **Output:** 102 | 103 | ```plaintext 104 | The sum of 3 and 5 is 8. 105 | ``` 106 | 107 | 108 | 109 | ### Default Arguments 110 | 111 | Functions can have default values for parameters, making them optional. 112 | 113 | ```python 114 | def greet(name="Data Scientist"): 115 | print(f"Welcome, {name}!") 116 | 117 | greet() 118 | greet("Alice") 119 | ``` 120 | 121 | **Output:** 122 | 123 | ```plaintext 124 | Welcome, Data Scientist! 125 | Welcome, Alice! 126 | ``` 127 | 128 | 129 | 130 | ### Return Statement 131 | 132 | The `return` statement allows a function to output a value. 133 | 134 | 135 | 136 | ### Example: Returning a Value 137 | 138 | ```python 139 | def square(number): 140 | return number * number 141 | 142 | result = square(4) 143 | print(f"The square of 4 is {result}.") 144 | ``` 145 | 146 | **Output:** 147 | 148 | ```plaintext 149 | The square of 4 is 16. 150 | ``` 151 | 152 | 153 | 154 | ### Calling Functions 155 | 156 | A **function call** executes the code inside a function definition. 157 | 158 | #### Example 1: Calling a Simple Function 159 | 160 | ```python 161 | def say_hello(): 162 | print("Hello!") 163 | 164 | # Function call 165 | say_hello() 166 | ``` 167 | 168 | **Output:** 169 | 170 | ```plaintext 171 | Hello! 172 | ``` 173 | 174 | #### Example 2: Function Call with Parameters 175 | 176 | ```python 177 | def add(a, b): 178 | return a + b 179 | 180 | result = add(10, 20) # Function call 181 | print(f"The sum is {result}.") 182 | ``` 183 | 184 | **Output:** 185 | 186 | ```plaintext 187 | The sum is 30. 188 | ``` 189 | 190 | #### Example 3: Combining Function Calls 191 | 192 | You can call functions within other function calls. 193 | 194 | ```python 195 | def double(number): 196 | return number * 2 197 | 198 | def add_and_double(a, b): 199 | return double(a + b) 200 | 201 | result = add_and_double(3, 5) 202 | print(f"The result is {result}.") 203 | ``` 204 | 205 | **Output:** 206 | 207 | ```plaintext 208 | The result is 16. 209 | ``` 210 | 211 | 212 | 213 | ## 2️⃣ Modular Programming 🧩 214 | 215 | ### What is Modular Programming? 216 | 217 | Modular programming involves breaking a program into smaller, manageable parts or **modules**. It enhances readability, reusability, and maintainability. 218 | 219 | 220 | 221 | ### Creating and Importing Modules 222 | 223 | A **module** is a Python file containing functions, variables, and classes that can be reused in other files. 224 | 225 | 1. **Create a Module**: Save your Python file (e.g., `my_module.py`). 226 | 227 | ```python 228 | # my_module.py 229 | def greet(name): 230 | return f"Hello, {name}!" 231 | ``` 232 | 233 | 2. **Import the Module**: Use the `import` keyword in another file. 234 | 235 | ```python 236 | # main.py 237 | import my_module 238 | 239 | message = my_module.greet("Alice") 240 | print(message) 241 | ``` 242 | 243 | **Output:** 244 | 245 | ```plaintext 246 | Hello, Alice! 247 | ``` 248 | 249 | 250 | 251 | ### Example: Custom Module 252 | 253 | Let’s create a custom math module: 254 | 255 | ```python 256 | # math_utils.py 257 | def add(a, b): 258 | return a + b 259 | 260 | def multiply(a, b): 261 | return a * b 262 | ``` 263 | 264 | ```python 265 | # main.py 266 | from math_utils import add, multiply 267 | 268 | print(add(3, 5)) # Output: 8 269 | print(multiply(3, 5)) # Output: 15 270 | ``` 271 | 272 | 273 | 274 | ### Built-in Modules 275 | 276 | Python includes many built-in modules like `math`, `os`, and `random`. 277 | 278 | #### Example: Using the `math` Module 279 | 280 | ```python 281 | import math 282 | 283 | result = math.sqrt(16) 284 | print(f"The square root of 16 is {result}.") 285 | ``` 286 | 287 | **Output:** 288 | 289 | ```plaintext 290 | The square root of 16 is 4.0. 291 | ``` 292 | 293 | 294 | 295 | ## 🧠 Practice Exercises 296 | 297 | 1. Write a function that checks if a number is odd or even. 298 | 2. Create a function that calculates the factorial of a number. 299 | 3. Develop a module with utility functions for basic arithmetic operations. 300 | 4. Explore and use the `random` module to generate random numbers. 301 | 302 | 303 | 304 | ## 🌟 Summary 305 | 306 | - Functions make your code reusable and organized. 307 | - Parameters allow functions to accept input, and `return` values provide output. 308 | - Calling functions runs the code defined within them. 309 | - Modular programming improves code readability and maintenance. 310 | - Python’s built-in modules provide powerful utilities. 311 | 312 | --- 313 | -------------------------------------------------------------------------------- /09_Exploratory Data Analysis (EDA)/09_Exploratory Data Analysis (EDA).md: -------------------------------------------------------------------------------- 1 | [<< Day 8](../08_Data%20Cleaning/08_Data%20Cleaning.md) | [Day 10 >>](../10_Data%20Visualization%20Basics/10_Data%20Visualization%20Basics.md) 2 | 3 | # 📘 Day 9: Exploratory Data Analysis (EDA) with Python 4 | 5 | 6 | 7 | ## Table of Contents 8 | 9 | - [📘 Day 9: Exploratory Data Analysis (EDA) with Python](#-day-9-exploratory-data-analysis-eda-with-python) 10 | - [1️⃣ Introduction to Exploratory Data Analysis (EDA)](#1️⃣-introduction-to-exploratory-data-analysis-eda) 11 | - [2️⃣ Data Overview](#2️⃣-data-overview) 12 | - [Dataset Structure](#dataset-structure) 13 | - [Variable Classification](#variable-classification) 14 | - [3️⃣ Measures of Central Tendency](#3️⃣-measures-of-central-tendency) 15 | - [4️⃣ Measures of Dispersion (Spread)](#4️⃣-measures-of-dispersion-spread) 16 | - [5️⃣ Distribution Analysis](#5️⃣-distribution-analysis) 17 | - [6️⃣ Quantiles and Percentiles](#6️⃣-quantiles-and-percentiles) 18 | - [7️⃣ Categorical Data Analysis](#7️⃣-categorical-data-analysis) 19 | - [8️⃣ Outlier Detection](#8️⃣-outlier-detection) 20 | - [9️⃣ Visualizations for Descriptive Statistics](#9️⃣-visualizations-for-descriptive-statistics) 21 | - [1️⃣0️⃣ Correlation and Relationships](#1️⃣0️⃣-correlation-and-relationships) 22 | - [1️⃣1️⃣ Missing Values Analysis](#1️⃣1️⃣-missing-values-analysis) 23 | - [1️⃣2️⃣ Data Cleaning Insights](#1️⃣2️⃣-data-cleaning-insights) 24 | - [🧠 Practice Exercises](#-practice-exercises) 25 | - [🌟 Summary](#-summary) 26 | 27 | 28 | 29 | ## 1️⃣ Introduction to Exploratory Data Analysis (EDA) 30 | 31 | Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It helps in understanding the data structure, spotting patterns, detecting anomalies, and deriving actionable insights. 32 | 33 | 34 | 35 | ## 2️⃣ Data Overview 36 | 37 | ### Dataset Structure 38 | 39 | - **Understanding Rows and Columns**: 40 | Use the `.shape` method to identify the number of rows and columns. 41 | 42 | ```python 43 | import pandas as pd 44 | 45 | # Sample dataset 46 | data = pd.read_csv('sample_data.csv') 47 | print("Shape of dataset:", data.shape) 48 | ``` 49 | 50 | - **Inspecting Data**: Use `.head()`, `.info()`, and `.describe()` for initial exploration. 51 | 52 | ```python 53 | # Display first 5 rows 54 | print(data.head()) 55 | 56 | # Dataset information 57 | print(data.info()) 58 | 59 | # Summary statistics for numerical columns 60 | print(data.describe()) 61 | ``` 62 | 63 | ### Variable Classification 64 | 65 | - **Numerical Variables**: 66 | - Continuous: e.g., height, weight. 67 | - Discrete: e.g., number of children. 68 | 69 | - **Categorical Variables**: 70 | - Nominal: No inherent order (e.g., gender, color). 71 | - Ordinal: Ordered categories (e.g., education level). 72 | 73 | - **Date/Time**: Useful for time-series analysis. 74 | 75 | 76 | 77 | ## 3️⃣ Measures of Central Tendency 78 | 79 | ### Mean 80 | 81 | ```python 82 | mean_value = data['column_name'].mean() 83 | print("Mean:", mean_value) 84 | ``` 85 | 86 | ### Median 87 | 88 | ```python 89 | median_value = data['column_name'].median() 90 | print("Median:", median_value) 91 | ``` 92 | 93 | ### Mode 94 | 95 | ```python 96 | mode_value = data['column_name'].mode() 97 | print("Mode:", mode_value) 98 | ``` 99 | 100 | 101 | 102 | ## 4️⃣ Measures of Dispersion (Spread) 103 | 104 | ### Range 105 | 106 | ```python 107 | range_value = data['column_name'].max() - data['column_name'].min() 108 | print("Range:", range_value) 109 | ``` 110 | 111 | ### Variance and Standard Deviation 112 | 113 | ```python 114 | variance = data['column_name'].var() 115 | std_dev = data['column_name'].std() 116 | 117 | print("Variance:", variance) 118 | print("Standard Deviation:", std_dev) 119 | ``` 120 | 121 | ### Interquartile Range (IQR) 122 | 123 | ```python 124 | q1 = data['column_name'].quantile(0.25) 125 | q3 = data['column_name'].quantile(0.75) 126 | iqr = q3 - q1 127 | 128 | print("IQR:", iqr) 129 | ``` 130 | 131 | 132 | 133 | ## 5️⃣ Distribution Analysis 134 | 135 | ### Skewness and Kurtosis 136 | 137 | ```python 138 | skewness = data['column_name'].skew() 139 | kurtosis = data['column_name'].kurt() 140 | 141 | print("Skewness:", skewness) 142 | print("Kurtosis:", kurtosis) 143 | ``` 144 | 145 | 146 | 147 | ## 6️⃣ Quantiles and Percentiles 148 | 149 | ```python 150 | # Quartiles 151 | q1 = data['column_name'].quantile(0.25) 152 | q2 = data['column_name'].quantile(0.50) # Median 153 | q3 = data['column_name'].quantile(0.75) 154 | 155 | print("Quartiles:", q1, q2, q3) 156 | 157 | # Percentile 158 | percentile_90 = data['column_name'].quantile(0.90) 159 | print("90th Percentile:", percentile_90) 160 | ``` 161 | 162 | 163 | 164 | ## 7️⃣ Categorical Data Analysis 165 | 166 | ### Frequency Counts 167 | 168 | ```python 169 | print(data['categorical_column'].value_counts()) 170 | ``` 171 | 172 | ### Cross-Tabulation 173 | 174 | ```python 175 | pd.crosstab(data['column1'], data['column2']) 176 | ``` 177 | 178 | 179 | 180 | ## 8️⃣ Outlier Detection 181 | 182 | ### Using IQR 183 | 184 | ```python 185 | outliers = data[(data['column_name'] < (q1 - 1.5 * iqr)) | 186 | (data['column_name'] > (q3 + 1.5 * iqr))] 187 | print(outliers) 188 | ``` 189 | 190 | ### Z-Score Method 191 | 192 | ```python 193 | from scipy.stats import zscore 194 | 195 | data['z_score'] = zscore(data['column_name']) 196 | outliers = data[data['z_score'].abs() > 3] 197 | print(outliers) 198 | ``` 199 | 200 | 201 | 202 | ## 9️⃣ Visualizations for Descriptive Statistics 203 | 204 | ### Numerical Data 205 | 206 | ```python 207 | import matplotlib.pyplot as plt 208 | import seaborn as sns 209 | 210 | # Histogram 211 | data['column_name'].hist() 212 | 213 | # Boxplot 214 | sns.boxplot(x=data['column_name']) 215 | ``` 216 | 217 | ### Categorical Data 218 | 219 | ```python 220 | # Bar Chart 221 | data['categorical_column'].value_counts().plot(kind='bar') 222 | ``` 223 | 224 | 225 | 226 | ## 1️⃣0️⃣ Correlation and Relationships 227 | 228 | ### Correlation Coefficient 229 | 230 | ```python 231 | correlation = data.corr() 232 | print(correlation) 233 | ``` 234 | 235 | ### Scatter Plot 236 | 237 | ```python 238 | sns.scatterplot(x='column1', y='column2', data=data) 239 | ``` 240 | 241 | 242 | 243 | ## 1️⃣1️⃣ Missing Values Analysis 244 | 245 | ```python 246 | # Proportion of missing values 247 | missing = data.isnull().mean() 248 | print(missing) 249 | 250 | # Impute missing values 251 | data['column_name'].fillna(data['column_name'].mean(), inplace=True) 252 | ``` 253 | 254 | 255 | 256 | ## 1️⃣2️⃣ Data Cleaning Insights 257 | 258 | - Identify inconsistencies like out-of-range values or incorrect types. 259 | - Remove duplicates. 260 | 261 | ```python 262 | # Remove duplicates 263 | data = data.drop_duplicates() 264 | ``` 265 | 266 | 267 | 268 | ## 🧠 Practice Exercises 269 | 270 | 1. Load a dataset of your choice and summarize its structure. 271 | 2. Compute measures of central tendency and dispersion for a numerical column. 272 | 3. Identify and visualize outliers using boxplots. 273 | 4. Analyze correlations between numerical variables and plot a heatmap. 274 | 5. Handle missing values using different imputation techniques. 275 | 276 | 277 | 278 | ## 🌟 Summary 279 | 280 | Exploratory Data Analysis (EDA) helps in gaining a comprehensive understanding of datasets by summarizing their structure, detecting outliers, analyzing distributions, and visualizing relationships. Mastering EDA is a critical step for preparing data for advanced analytics and machine learning. 281 | 282 | --- 283 | -------------------------------------------------------------------------------- /30_Building a Data Science Pipeline/30_Building a Data Science Pipeline.md: -------------------------------------------------------------------------------- 1 | [<< Day 29](../29_Working%20with%20Big%20Data/29_Working%20with%20Big%20Data.md) | [Day 31 >>](../31_Deployment%20on%20Cloud%20Platform/31_Deployment%20on%20Cloud%20Platform.md) 2 | 3 | # 🎉 Day 30: Building a Data Science Pipeline 4 | 5 | Welcome to the final day of the **30 Days of Data Science** challenge! 🎊 Today, we will bring together all the skills you've learned by exploring how to build a **Data Science Pipeline**. Pipelines are essential for automating workflows, ensuring reproducibility, and deploying machine learning models effectively. 6 | 7 | ## 📚 Table of Contents 8 | - [🎉 Day 30: Building a Data Science Pipeline](#-day-30-building-a-data-science-pipeline) 9 | - [Introduction](#introduction) 10 | - [🔧 Sklearn Pipelines](#-sklearn-pipelines) 11 | - [📦 Serialization with Joblib](#-serialization-with-joblib) 12 | - [🛠️ Feature Engineering Pipelines](#️-feature-engineering-pipelines) 13 | - [🔄 Handling Data Preprocessing](#-handling-data-preprocessing) 14 | - [📊 Model Evaluation and Cross-Validation](#-model-evaluation-and-cross-validation) 15 | - [🚀 Deployment Best Practices](#-deployment-best-practices) 16 | - [🧪 Testing and Validation](#-testing-and-validation) 17 | - [📝 Practice Exercise](#-practice-exercise) 18 | - [📖 Summary](#-summary) 19 | 20 | ## Introduction 21 | A **Data Science Pipeline** is a structured sequence of steps that automates the flow of data from raw input to a final machine learning model or output. Pipelines make your projects scalable, reproducible, and easier to maintain. Today, we will focus on building efficient pipelines using Python and related libraries. 22 | 23 | ## 🔧 Sklearn Pipelines 24 | `Pipeline` in `sklearn` allows you to automate machine learning workflows. It helps chain preprocessing steps, feature transformations, and model training into a single object. 25 | 26 | ### Example: 27 | ```python 28 | from sklearn.pipeline import Pipeline 29 | from sklearn.preprocessing import StandardScaler 30 | from sklearn.ensemble import RandomForestClassifier 31 | 32 | # Define pipeline 33 | pipeline = Pipeline([ 34 | ('scaler', StandardScaler()), 35 | ('classifier', RandomForestClassifier()) 36 | ]) 37 | 38 | # Fit and predict 39 | pipeline.fit(X_train, y_train) 40 | predictions = pipeline.predict(X_test) 41 | ``` 42 | 43 | ### Benefits of Sklearn Pipelines: 44 | - Automates repetitive tasks 45 | - Reduces code duplication 46 | - Ensures consistency during training and testing 47 | 48 | ## 📦 Serialization with Joblib 49 | Serialization is essential for saving your trained models and reusing them later. `joblib` is a powerful library for saving and loading Python objects, especially models. 50 | 51 | ### Example: 52 | ```python 53 | from joblib import dump, load 54 | 55 | # Save the pipeline 56 | dump(pipeline, 'model_pipeline.joblib') 57 | 58 | # Load the pipeline 59 | loaded_pipeline = load('model_pipeline.joblib') 60 | ``` 61 | 62 | ## 🛠️ Feature Engineering Pipelines 63 | Feature engineering involves creating meaningful input features for your model. Pipelines can include custom feature transformers. 64 | 65 | ### Example of a Custom Transformer: 66 | ```python 67 | from sklearn.base import BaseEstimator, TransformerMixin 68 | 69 | class CustomTransformer(BaseEstimator, TransformerMixin): 70 | def fit(self, X, y=None): 71 | return self 72 | 73 | def transform(self, X): 74 | # Custom transformation logic 75 | return X + 1 76 | 77 | # Add custom transformer to pipeline 78 | pipeline = Pipeline([ 79 | ('custom_transformer', CustomTransformer()), 80 | ('classifier', RandomForestClassifier()) 81 | ]) 82 | ``` 83 | 84 | ## 🔄 Handling Data Preprocessing 85 | Preprocessing ensures that your raw data is transformed into a format suitable for modeling. Steps like missing value imputation, encoding, and scaling can be incorporated into the pipeline. 86 | 87 | ### Example: 88 | ```python 89 | from sklearn.compose import ColumnTransformer 90 | from sklearn.impute import SimpleImputer 91 | from sklearn.preprocessing import OneHotEncoder 92 | 93 | # Define preprocessing steps 94 | preprocessor = ColumnTransformer([ 95 | ('num', SimpleImputer(strategy='mean'), numerical_columns), 96 | ('cat', OneHotEncoder(), categorical_columns) 97 | ]) 98 | 99 | # Add preprocessing to pipeline 100 | pipeline = Pipeline([ 101 | ('preprocessor', preprocessor), 102 | ('classifier', RandomForestClassifier()) 103 | ]) 104 | ``` 105 | 106 | ## 📊 Model Evaluation and Cross-Validation 107 | Cross-validation is crucial for evaluating the performance of your pipeline. It helps ensure that the model generalizes well to unseen data. 108 | 109 | ### Example: 110 | ```python 111 | from sklearn.model_selection import cross_val_score 112 | 113 | # Cross-validation 114 | scores = cross_val_score(pipeline, X, y, cv=5) 115 | print("Cross-validation scores:", scores) 116 | ``` 117 | 118 | ### Key Metrics: 119 | - **Accuracy:** Overall correctness of the model. 120 | - **Precision:** Focus on false positives. 121 | - **Recall:** Focus on false negatives. 122 | - **F1 Score:** Harmonic mean of precision and recall. 123 | 124 | ## 🚀 Deployment Best Practices 125 | Once your pipeline is ready, deployment becomes the next critical step. Here are some best practices: 126 | 127 | 1. **Serialization:** Save your model and preprocessing steps using `joblib`. 128 | 2. **Environment Consistency:** Use tools like Docker to ensure that your development and production environments are identical. 129 | 3. **Monitoring and Logging:** Implement monitoring to track model performance post-deployment. 130 | 4. **Versioning:** Keep track of model versions for rollback and debugging purposes. 131 | 132 | ### Example: 133 | ```python 134 | import joblib 135 | import os 136 | 137 | # Save model 138 | joblib.dump(pipeline, 'pipeline_v1.joblib') 139 | 140 | # Check saved file 141 | if os.path.exists('pipeline_v1.joblib'): 142 | print("Pipeline saved successfully!") 143 | ``` 144 | 145 | ## 🧪 Testing and Validation 146 | Testing your pipeline ensures that it generalizes well to unseen data. Use cross-validation and performance metrics to evaluate the pipeline. 147 | 148 | ### Example: 149 | ```python 150 | from sklearn.model_selection import train_test_split 151 | from sklearn.metrics import classification_report 152 | 153 | # Split the data 154 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 155 | 156 | # Fit the pipeline 157 | pipeline.fit(X_train, y_train) 158 | 159 | # Evaluate 160 | predictions = pipeline.predict(X_test) 161 | print(classification_report(y_test, predictions)) 162 | ``` 163 | 164 | ## 📝 Practice Exercise 165 | **Task:** Build a data science pipeline that includes: 166 | 1. Data preprocessing (handle missing values and scaling) 167 | 2. Feature selection or engineering 168 | 3. Model training with a classifier of your choice 169 | 4. Save the pipeline using `joblib` 170 | 171 | **Dataset:** Use any dataset of your choice (e.g., Titanic dataset). 172 | 173 | **Steps:** 174 | 1. Load the dataset. 175 | 2. Define preprocessing and modeling steps. 176 | 3. Use `Pipeline` to combine all steps. 177 | 4. Evaluate the pipeline using cross-validation. 178 | 5. Save and reload the pipeline. 179 | 180 | ## 📖 Summary 181 | In this final day, you learned how to build robust Data Science Pipelines using tools like `sklearn` and `joblib`. We explored pipeline creation, feature engineering, cross-validation, and deployment best practices. Pipelines simplify workflows, improve reproducibility, and make your projects production-ready. Congratulations on completing the 30 Days of Data Science challenge! 🎉 182 | 183 | --- 184 | -------------------------------------------------------------------------------- /11_Advanced Data Visualization/11_Advanced Data Visualization.md: -------------------------------------------------------------------------------- 1 | [<< Day 10](../10_Data%20Visualization%20Basics/10_Data%20Visualization%20Basics.md) | [Day 12 >>](../12_SQL%20for%20Data%20Retrieval/12_SQL%20for%20Data%20Retrieval.md) 2 | 3 | # 📘 Day 11: Advanced Data Visualization with Plotly and Advanced Matplotlib 4 | 5 | Welcome to **Day 11** of the **30 Days of Data Science** series! Today, we dive into advanced data visualization techniques using **Plotly** and advanced features of **Matplotlib**. These tools enable you to create highly interactive and visually appealing visualizations, perfect for storytelling and analyzing complex datasets. 6 | 7 | 8 | 9 | ## Table of Contents 10 | 11 | - [📘 Day 11: Advanced Data Visualization with Plotly and Advanced Matplotlib](#-day-11-advanced-data-visualization-with-plotly-and-advanced-matplotlib) 12 | - [1️⃣ Introduction to Plotly 📊](#1️⃣-introduction-to-plotly-) 13 | - [Why Use Plotly?](#why-use-plotly) 14 | - [Installing Plotly](#installing-plotly) 15 | - [Creating a Basic Plotly Chart](#creating-a-basic-plotly-chart) 16 | - [Interactive Features in Plotly](#interactive-features-in-plotly) 17 | - [2️⃣ Advanced Features of Plotly 🌟](#2️⃣-advanced-features-of-plotly-) 18 | - [Creating Subplots](#creating-subplots) 19 | - [Using Plotly Express](#using-plotly-express) 20 | - [Example: Advanced Interactive Dashboard](#example-advanced-interactive-dashboard) 21 | - [3️⃣ Advanced Matplotlib 📐](#3️⃣-advanced-matplotlib-) 22 | - [Customizing Matplotlib Visualizations](#customizing-matplotlib-visualizations) 23 | - [Using Styles and Themes](#using-styles-and-themes) 24 | - [Creating Complex Plots with Matplotlib](#creating-complex-plots-with-matplotlib) 25 | - [3D Plotting with Matplotlib](#3d-plotting-with-matplotlib) 26 | - [🧠 Practice Exercises](#-practice-exercises) 27 | - [🌟 Summary](#-summary) 28 | 29 | 30 | 31 | 32 | ## 1️⃣ Introduction to Plotly 📊 33 | 34 | **Plotly** is a Python library that allows you to create interactive, web-based visualizations. It is well-suited for creating complex and dynamic plots. 35 | 36 | 37 | 38 | ### Why Use Plotly? 39 | 40 | 1. **Interactive Visualizations**: Enables zooming, panning, and hover effects. 41 | 2. **Web-Ready**: Integrates well with web applications. 42 | 3. **Rich Ecosystem**: Includes support for 2D and 3D plots, dashboards, and more. 43 | 44 | 45 | 46 | ### Installing Plotly 47 | 48 | Use the following command to install Plotly: 49 | 50 | ```bash 51 | pip install plotly 52 | ``` 53 | 54 | 55 | ### Creating a Basic Plotly Chart 56 | 57 | Creating a simple line plot 58 | ```bash 59 | import plotly.graph_objects as go 60 | 61 | fig = go.Figure() 62 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[10, 20, 30], mode='lines+markers', name='Line Plot')) 63 | fig.update_layout(title="Basic Plotly Line Chart", xaxis_title="X-axis", yaxis_title="Y-axis") 64 | fig.show() 65 | ``` 66 | 67 | ### Interactive Features in Plotly 68 | 69 | Plotly charts support interactivity by default. Hovering over points displays tooltips, and you can zoom in or pan the chart. 70 | 71 | ## 2️⃣ Advanced Features of Plotly 🌟 72 | 73 | ### Creating Subplots 74 | 75 | Plotly allows you to create multiple subplots within a single figure. 76 | 77 | ```bash 78 | from plotly.subplots import make_subplots 79 | import plotly.graph_objects as go 80 | 81 | # Creating subplots 82 | fig = make_subplots(rows=1, cols=2, subplot_titles=("Plot 1", "Plot 2")) 83 | 84 | # Adding traces to subplots 85 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6], mode='lines', name='Line 1'), row=1, col=1) 86 | fig.add_trace(go.Bar(x=["A", "B", "C"], y=[7, 8, 9], name='Bar Chart'), row=1, col=2) 87 | 88 | fig.update_layout(title="Subplots Example") 89 | fig.show() 90 | ``` 91 | 92 | ### Using Plotly Express 93 | 94 | Plotly Express simplifies the creation of visualizations with concise syntax. 95 | 96 | ```bash 97 | import plotly.express as px 98 | 99 | # Example: Scatter plot 100 | df = px.data.iris() # Built-in dataset 101 | fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", title="Iris Dataset Scatter Plot") 102 | fig.show() 103 | ``` 104 | 105 | ### Example: Advanced Interactive Dashboard 106 | 107 | ```bash 108 | import plotly.graph_objects as go 109 | from plotly.subplots import make_subplots 110 | 111 | # Creating a dashboard layout 112 | fig = make_subplots(rows=2, cols=2, subplot_titles=("Line", "Bar", "Pie", "Scatter")) 113 | 114 | # Adding various plots 115 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[10, 20, 30], mode='lines', name='Line'), row=1, col=1) 116 | fig.add_trace(go.Bar(x=["A", "B", "C"], y=[5, 10, 15], name='Bar'), row=1, col=2) 117 | fig.add_trace(go.Pie(labels=["A", "B", "C"], values=[10, 20, 30], name='Pie'), row=2, col=1) 118 | fig.add_trace(go.Scatter(x=[1, 2, 3], y=[7, 8, 9], mode='markers', name='Scatter'), row=2, col=2) 119 | 120 | fig.update_layout(title="Advanced Interactive Dashboard") 121 | fig.show() 122 | ``` 123 | 124 | ## 3️⃣ Advanced Matplotlib 📐 125 | 126 | Matplotlib offers extensive customization options for static visualizations. Let’s explore its advanced features. 127 | 128 | ### Customizing Matplotlib Visualizations 129 | 130 | ```bash 131 | import matplotlib.pyplot as plt 132 | 133 | # Customizing plot styles 134 | plt.figure(figsize=(8, 6)) 135 | plt.plot([1, 2, 3], [4, 5, 6], color='purple', linestyle='--', linewidth=2, marker='o', label='Line Plot') 136 | plt.title("Customized Matplotlib Plot", fontsize=16) 137 | plt.xlabel("X-axis Label", fontsize=12) 138 | plt.ylabel("Y-axis Label", fontsize=12) 139 | plt.legend() 140 | plt.grid(True) 141 | plt.show() 142 | ``` 143 | 144 | ### Using Styles and Themes 145 | 146 | Matplotlib comes with built-in styles. You can activate them using `plt.style.use()`. 147 | 148 | ```bash 149 | import matplotlib.pyplot as plt 150 | 151 | # Applying a style 152 | plt.style.use('ggplot') 153 | 154 | # Creating a styled plot 155 | plt.plot([1, 2, 3], [2, 4, 6], label='Styled Line', color='blue') 156 | plt.title("Using Matplotlib Styles") 157 | plt.legend() 158 | plt.show() 159 | ``` 160 | 161 | ### Creating Complex Plots with Matplotlib 162 | 163 | ```bash 164 | import matplotlib.pyplot as plt 165 | import numpy as np 166 | 167 | x = np.linspace(0, 10, 100) 168 | y = np.sin(x) 169 | 170 | plt.figure(figsize=(10, 6)) 171 | plt.plot(x, y, label='Sine Wave', color='green') 172 | plt.fill_between(x, y, alpha=0.3, color='green') 173 | plt.title("Sine Wave with Fill") 174 | plt.xlabel("X-axis") 175 | plt.ylabel("Y-axis") 176 | plt.legend() 177 | plt.show() 178 | ``` 179 | 180 | ### 3D Plotting with Matplotlib 181 | 182 | ```bash 183 | from mpl_toolkits.mplot3d import Axes3D 184 | import matplotlib.pyplot as plt 185 | import numpy as np 186 | 187 | fig = plt.figure(figsize=(10, 7)) 188 | ax = fig.add_subplot(111, projection='3d') 189 | 190 | # Generating data 191 | x = np.linspace(-5, 5, 100) 192 | y = np.linspace(-5, 5, 100) 193 | X, Y = np.meshgrid(x, y) 194 | Z = np.sin(np.sqrt(X**2 + Y**2)) 195 | 196 | # Creating a 3D surface plot 197 | ax.plot_surface(X, Y, Z, cmap='viridis') 198 | ax.set_title("3D Surface Plot") 199 | plt.show() 200 | ``` 201 | 202 | ## 🧠 Practice Exercises 203 | 204 | 1. Create an interactive bar chart using Plotly. 205 | 2. Generate subplots combining line and scatter plots. 206 | 3. Use Matplotlib to create a heatmap. 207 | 4. Explore the use of seaborn with advanced Matplotlib customizations. 208 | 209 | 210 | 211 | ## 🌟 Summary 212 | 213 | - Plotly is excellent for creating interactive, web-based visualizations. 214 | - Matplotlib offers flexibility and control for static visualizations. 215 | - Subplots, styles, and 3D plotting enhance your ability to tell a story with data. 216 | 217 | --- 218 | -------------------------------------------------------------------------------- /05_Data Structures/05_Data Structures.md: -------------------------------------------------------------------------------- 1 | [<< Day 4](../04_Functions%20and%20Modular%20Programming/04_Functions%20and%20Modular%20Programming.md) | [Day 6 >>](../06_Data%20Frames%20and%20Tables/06_Data%20Frames%20and%20Tables.md) 2 | # 📘 Day 5: Data Structures in Python 3 | 4 | Welcome to **Day 5** of the **30 Days of Data Science** series! Today, we will explore **data structures** in Python, focusing on three important types: 5 | 6 | - **Lists** 7 | - **Tuples** 8 | - **Dictionaries** 9 | 10 | Understanding these data structures is fundamental for data manipulation and organization in Python. 11 | 12 | 13 | 14 | ## Table of Contents 15 | 16 | - [📘 Day 5: Data Structures in Python](#-day-5-data-structures-in-python) 17 | - [1️⃣ Lists in Python 📋](#1️⃣-lists-in-python-) 18 | - [Creating a List](#creating-a-list) 19 | - [Accessing List Elements](#accessing-list-elements) 20 | - [Adding Elements to a List](#adding-elements-to-a-list) 21 | - [Removing Elements from a List](#removing-elements-from-a-list) 22 | - [List Slicing](#list-slicing) 23 | - [Common List Methods](#common-list-methods) 24 | - [2️⃣ Tuples in Python 🔗](#2️⃣-tuples-in-python-) 25 | - [Creating a Tuple](#creating-a-tuple) 26 | - [Accessing Tuple Elements](#accessing-tuple-elements) 27 | - [Immutability of Tuples](#immutability-of-tuples) 28 | - [Common Tuple Methods](#common-tuple-methods) 29 | - [3️⃣ Dictionaries in Python 📖](#3️⃣-dictionaries-in-python-) 30 | - [Creating a Dictionary](#creating-a-dictionary) 31 | - [Accessing Dictionary Values](#accessing-dictionary-values) 32 | - [Adding and Updating Key-Value Pairs](#adding-and-updating-key-value-pairs) 33 | - [Removing Key-Value Pairs](#removing-key-value-pairs) 34 | - [Common Dictionary Methods](#common-dictionary-methods) 35 | - [🧠 Practice Exercises](#-practice-exercises) 36 | - [🌟 Summary](#-summary) 37 | 38 | 39 | 40 | 41 | 42 | ## 1️⃣ Lists in Python 📋 43 | 44 | ### What is a List? 45 | 46 | A **list** is a mutable (modifiable) collection of ordered elements. Lists can store elements of different data types, such as integers, strings, floats, or even other lists. 47 | 48 | 49 | 50 | ### Creating a List 51 | 52 | ```python 53 | # Creating a list of numbers 54 | numbers = [1, 2, 3, 4, 5] 55 | 56 | # Creating a mixed data type list 57 | mixed_list = [1, "apple", 3.14, True] 58 | ``` 59 | 60 | 61 | 62 | ### Accessing List Elements 63 | 64 | You can access elements in a list using **indexing** (zero-based). 65 | 66 | ```python 67 | fruits = ["apple", "banana", "cherry"] 68 | 69 | # Accessing the first element 70 | print(fruits[0]) # Output: apple 71 | 72 | # Accessing the last element 73 | print(fruits[-1]) # Output: cherry 74 | ``` 75 | 76 | 77 | 78 | ### Adding Elements to a List 79 | 80 | Use the `append()` method to add an element to the end or the `insert()` method to add at a specific position. 81 | 82 | ```python 83 | fruits = ["apple", "banana"] 84 | 85 | # Adding an element at the end 86 | fruits.append("cherry") 87 | print(fruits) # Output: ['apple', 'banana', 'cherry'] 88 | 89 | # Inserting an element at a specific position 90 | fruits.insert(1, "orange") 91 | print(fruits) # Output: ['apple', 'orange', 'banana', 'cherry'] 92 | ``` 93 | 94 | 95 | 96 | ### Removing Elements from a List 97 | 98 | Use `remove()` or `pop()` to delete elements. 99 | 100 | ```python 101 | fruits = ["apple", "banana", "cherry"] 102 | 103 | # Removing by value 104 | fruits.remove("banana") 105 | print(fruits) # Output: ['apple', 'cherry'] 106 | 107 | # Removing by index 108 | fruits.pop(1) 109 | print(fruits) # Output: ['apple'] 110 | ``` 111 | 112 | 113 | 114 | ### List Slicing 115 | 116 | Slicing allows you to access a subset of elements. 117 | 118 | ```python 119 | numbers = [1, 2, 3, 4, 5] 120 | 121 | # Getting the first three elements 122 | print(numbers[:3]) # Output: [1, 2, 3] 123 | 124 | # Getting elements from index 2 to the end 125 | print(numbers[2:]) # Output: [3, 4, 5] 126 | ``` 127 | 128 | 129 | 130 | ### Common List Methods 131 | 132 | Here are some commonly used list methods: 133 | 134 | ```python 135 | numbers = [1, 2, 3] 136 | 137 | # Adding an element 138 | numbers.append(4) 139 | 140 | # Counting occurrences 141 | print(numbers.count(2)) # Output: 1 142 | 143 | # Sorting the list 144 | numbers.sort() 145 | print(numbers) # Output: [1, 2, 3, 4] 146 | ``` 147 | 148 | 149 | 150 | ## 2️⃣ Tuples in Python 🔗 151 | 152 | ### What is a Tuple? 153 | 154 | A **tuple** is an immutable (unchangeable) collection of ordered elements. Tuples are often used to group related data. 155 | 156 | 157 | 158 | ### Creating a Tuple 159 | 160 | ```python 161 | # Creating a tuple of strings 162 | fruits = ("apple", "banana", "cherry") 163 | ``` 164 | 165 | 166 | 167 | ### Accessing Tuple Elements 168 | 169 | Similar to lists, you can access tuple elements using indexing. 170 | 171 | ```python 172 | fruits = ("apple", "banana", "cherry") 173 | 174 | print(fruits[0]) # Output: apple 175 | ``` 176 | 177 | 178 | 179 | ### Immutability of Tuples 180 | 181 | Tuples cannot be changed after creation. Attempting to modify a tuple results in an error. 182 | 183 | ```python 184 | fruits = ("apple", "banana", "cherry") 185 | 186 | # This will raise an error 187 | fruits[1] = "orange" 188 | ``` 189 | 190 | 191 | 192 | ### Common Tuple Methods 193 | 194 | ```python 195 | fruits = ("apple", "banana", "cherry") 196 | 197 | # Getting the index of an element 198 | print(fruits.index("banana")) # Output: 1 199 | 200 | # Counting occurrences 201 | print(fruits.count("cherry")) # Output: 1 202 | ``` 203 | 204 | 205 | 206 | ## 3️⃣ Dictionaries in Python 📖 207 | 208 | ### What is a Dictionary? 209 | 210 | A **dictionary** is a mutable collection of key-value pairs. Keys must be unique and immutable, while values can be of any type. 211 | 212 | 213 | 214 | ### Creating a Dictionary 215 | 216 | ```python 217 | # Creating a dictionary 218 | person = { 219 | "name": "Alice", 220 | "age": 25, 221 | "city": "New York" 222 | } 223 | ``` 224 | 225 | 226 | 227 | ### Accessing Dictionary Values 228 | 229 | You can access values using keys. 230 | 231 | ```python 232 | person = {"name": "Alice", "age": 25} 233 | 234 | print(person["name"]) # Output: Alice 235 | ``` 236 | 237 | 238 | 239 | ### Adding and Updating Key-Value Pairs 240 | 241 | ```python 242 | person = {"name": "Alice", "age": 25} 243 | 244 | # Adding a new key-value pair 245 | person["city"] = "New York" 246 | 247 | # Updating an existing key 248 | person["age"] = 26 249 | ``` 250 | 251 | 252 | 253 | ### Removing Key-Value Pairs 254 | 255 | Use the `del` keyword or `pop()` method. 256 | 257 | ```python 258 | person = {"name": "Alice", "age": 25} 259 | 260 | # Removing a key-value pair 261 | del person["age"] 262 | 263 | # Using pop() 264 | person.pop("name") 265 | ``` 266 | 267 | 268 | 269 | ### Common Dictionary Methods 270 | 271 | ```python 272 | person = {"name": "Alice", "age": 25} 273 | 274 | # Getting all keys 275 | print(person.keys()) # Output: dict_keys(['name', 'age']) 276 | 277 | # Getting all values 278 | print(person.values()) # Output: dict_values(['Alice', 25]) 279 | ``` 280 | 281 | 282 | 283 | ## 🧠 Practice Exercises 284 | 285 | 1. Create a list of your favorite movies and print the last one using negative indexing. 286 | 2. Create a tuple of three numbers and calculate their sum. 287 | 3. Create a dictionary to store information about a book (title, author, year), and add the publisher's name. 288 | 289 | 290 | 291 | ## 🌟 Summary 292 | 293 | - **Lists** are mutable and ordered collections. 294 | - **Tuples** are immutable and ordered collections. 295 | - **Dictionaries** store data as key-value pairs and are mutable. 296 | 297 | --- 298 | 299 | 300 | -------------------------------------------------------------------------------- /20_Logistic Regression/20_Logistic Regression.md: -------------------------------------------------------------------------------- 1 | [<< Day 19](../19_Linear%20Regression/19_Linear%20Regression.md) | [Day 21 >>](../21_Clustering%20(K-Means)/21_Clustering%20(K-Means).md) 2 | 3 | 4 | # 📘 Day 20: Logistic Regression with scikit-learn 5 | 6 | Welcome to Day 20 of the **30 Days of Data Science** series! Today, we will focus on **Logistic Regression**, one of the most commonly used classification algorithms in machine learning. By the end of this guide, you will understand the basics of Logistic Regression and how to implement it in Python using the `scikit-learn` library. 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [📘 Day 20: Logistic Regression with scikit-learn](#-day-20-logistic-regression-with-scikit-learn) 13 | - [📌 Topics Covered](#-topics-covered) 14 | - [1️⃣ What is Logistic Regression?](#1️⃣-what-is-logistic-regression) 15 | - [2️⃣ Logistic Regression vs Linear Regression](#2️⃣-logistic-regression-vs-linear-regression) 16 | - [3️⃣ Sigmoid Function](#3️⃣-sigmoid-function) 17 | - [4️⃣ Implementing Logistic Regression in scikit-learn](#4️⃣-implementing-logistic-regression-in-scikit-learn) 18 | - [Dataset Overview](#dataset-overview) 19 | - [Steps to Implement Logistic Regression](#steps-to-implement-logistic-regression) 20 | - [Code Example](#code-example) 21 | - [5️⃣ Evaluating the Model](#5️⃣-evaluating-the-model) 22 | - [Metrics](#metrics) 23 | - [Confusion Matrix](#confusion-matrix) 24 | - [ROC Curve and AUC](#roc-curve-and-auc) 25 | - [🧠 Practice Exercises](#-practice-exercises) 26 | - [🌟 Summary](#-summary) 27 | 28 | 29 | 30 | 31 | ## 📌 Topics Covered 32 | 33 | - **Understanding Logistic Regression**: Overview and applications. 34 | - **Mathematics Behind Logistic Regression**: Sigmoid function and decision boundaries. 35 | - **Implementing Logistic Regression in Python**: Using scikit-learn. 36 | - **Model Evaluation**: Metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. 37 | 38 | 39 | 40 | ## 1️⃣ What is Logistic Regression? 41 | 42 | Logistic Regression is a **supervised learning algorithm** used for binary classification problems. Despite its name, it is not a regression algorithm but a **classification technique** that predicts discrete outcomes (e.g., Yes/No, True/False, 0/1). 43 | 44 | **Applications**: 45 | - Predicting whether an email is spam or not. 46 | - Identifying if a transaction is fraudulent. 47 | - Classifying if a tumor is malignant or benign. 48 | 49 | 50 | 51 | ## 2️⃣ Logistic Regression vs Linear Regression 52 | 53 | | Feature | Logistic Regression | Linear Regression | 54 | |-------------------------|-------------------------|-------------------------| 55 | | **Goal** | Classification | Regression (continuous output) | 56 | | **Output Range** | [0, 1] (probabilities) | (-∞, +∞) | 57 | | **Function** | Sigmoid Function | Linear Equation | 58 | 59 | 60 | 61 | ## 3️⃣ Sigmoid Function 62 | 63 | The **sigmoid function** maps any real-valued number into the range [0, 1], making it ideal for predicting probabilities. 64 | 65 | ### Formula: 66 | 67 | σ(z) = 1 / (1 + e^(-z)) 68 | 69 | Where: 70 | - **z = wᵀX + b** (linear combination of weights and inputs). 71 | 72 | ### Sigmoid Plot: 73 | 74 | - Input **z = 0**: Output is 0.5. 75 | - As **z → +∞**, Output approaches 1. 76 | - As **z → -∞**, Output approaches 0. 77 | 78 | 79 | 80 | ## 4️⃣ Implementing Logistic Regression in scikit-learn 81 | 82 | We will use the `scikit-learn` library to implement Logistic Regression. 83 | 84 | ### Dataset Overview 85 | 86 | We will use the **Breast Cancer dataset** from `scikit-learn`. It is a binary classification dataset where the goal is to classify tumors as benign (0) or malignant (1). 87 | 88 | 89 | 90 | ### Steps to Implement Logistic Regression 91 | 92 | 1. **Import Libraries**: 93 | - `numpy`, `pandas`, `matplotlib`, and `scikit-learn`. 94 | 2. **Load Dataset**: 95 | - Use `sklearn.datasets.load_breast_cancer`. 96 | 3. **Preprocess Data**: 97 | - Split data into training and testing sets. 98 | 4. **Train the Model**: 99 | - Use `LogisticRegression` from `sklearn.linear_model`. 100 | 5. **Evaluate the Model**: 101 | - Use metrics like accuracy and confusion matrix. 102 | 103 | 104 | 105 | ### Code Example 106 | 107 | ```python 108 | # Importing necessary libraries 109 | import numpy as np 110 | import pandas as pd 111 | from sklearn.datasets import load_breast_cancer 112 | from sklearn.model_selection import train_test_split 113 | from sklearn.linear_model import LogisticRegression 114 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 115 | 116 | # Load the dataset 117 | data = load_breast_cancer() 118 | X = data.data 119 | y = data.target 120 | 121 | # Split into training and testing datasets 122 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 123 | 124 | # Initialize the model 125 | model = LogisticRegression(max_iter=10000) 126 | 127 | # Train the model 128 | model.fit(X_train, y_train) 129 | 130 | # Make predictions 131 | y_pred = model.predict(X_test) 132 | 133 | # Evaluate the model 134 | accuracy = accuracy_score(y_test, y_pred) 135 | print(f"Accuracy: {accuracy * 100:.2f}%") 136 | print("Classification Report:") 137 | print(classification_report(y_test, y_pred)) 138 | ``` 139 | 140 | **Output**: 141 | 142 | ```plaintext 143 | Accuracy: 95.32% 144 | Classification Report: 145 | precision recall f1-score support 146 | 147 | 0 0.96 0.94 0.95 63 148 | 1 0.95 0.97 0.96 108 149 | 150 | accuracy 0.95 171 151 | macro avg 0.95 0.95 0.95 171 152 | weighted avg 0.95 0.95 0.95 171 153 | ``` 154 | 155 | 156 | 157 | ## 5️⃣ Evaluating the Model 158 | 159 | ### Metrics 160 | 161 | - **Accuracy**: Proportion of correctly classified samples. 162 | - **Precision**: Ratio of correctly predicted positive observations to total predicted positives. 163 | - **Recall**: Ratio of correctly predicted positives to all actual positives. 164 | - **F1-Score**: Harmonic mean of precision and recall. 165 | 166 | 167 | 168 | ### Confusion Matrix 169 | 170 | A **confusion matrix** shows the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). 171 | 172 | ```python 173 | # Confusion Matrix 174 | import seaborn as sns 175 | import matplotlib.pyplot as plt 176 | 177 | cm = confusion_matrix(y_test, y_pred) 178 | sns.heatmap(cm, annot=True, fmt="d", cmap="Blues") 179 | plt.xlabel("Predicted") 180 | plt.ylabel("Actual") 181 | plt.title("Confusion Matrix") 182 | plt.show() 183 | ``` 184 | 185 | 186 | 187 | ### ROC Curve and AUC 188 | 189 | The **ROC curve** is a plot of True Positive Rate vs False Positive Rate, and **AUC** measures the area under this curve. 190 | 191 | ```python 192 | from sklearn.metrics import roc_curve, roc_auc_score 193 | 194 | y_proba = model.predict_proba(X_test)[:, 1] 195 | fpr, tpr, _ = roc_curve(y_test, y_proba) 196 | auc = roc_auc_score(y_test, y_proba) 197 | 198 | plt.plot(fpr, tpr, label=f"AUC = {auc:.2f}") 199 | plt.xlabel("False Positive Rate") 200 | plt.ylabel("True Positive Rate") 201 | plt.title("ROC Curve") 202 | plt.legend() 203 | plt.show() 204 | ``` 205 | 206 | 207 | 208 | ## 🧠 Practice Exercises 209 | 210 | 1. Train a Logistic Regression model on the **Iris dataset** for binary classification. 211 | 2. Experiment with the **`C` parameter** in `LogisticRegression` to observe its effect on regularization. 212 | 3. Evaluate the model using **Precision-Recall curves**. 213 | 214 | 215 | 216 | ## 🌟 Summary 217 | 218 | - Logistic Regression is a classification algorithm suitable for binary outcomes. 219 | - The sigmoid function maps linear predictions to probabilities. 220 | - scikit-learn provides a straightforward implementation of Logistic Regression. 221 | - Evaluation metrics and visualizations help assess the model's performance. 222 | 223 | --- 224 | 225 | 226 | -------------------------------------------------------------------------------- /28_Time Series Forecasting/28_Time Series Forecasting.md: -------------------------------------------------------------------------------- 1 | [<< Day 27](../27_Natural%20Language%20Processing%20(NLP)/27_Natural%20Language%20Processing%20(NLP).md) | [Day 29 >>](../29_Working%20with%20Big%20Data/29_Working%20with%20Big%20Data.md) 2 | 3 | 4 | # 📅 Day 28: Time Series Forecasting ⏳📊 5 | 6 | Welcome to **Day 28** of the **30 Days of Data Science** series! 🎉 Today, we will explore **Time Series Forecasting**, one of the most critical techniques in data science used for analyzing sequential data over time. We'll cover key concepts and popular models like **ARIMA** and **Prophet**. By the end of this lesson, you’ll have the tools to forecast future trends and patterns effectively. 7 | 8 | 9 | 10 | ## 🌟 Table of Contents 11 | - [📚 Introduction to Time Series Forecasting](#-introduction-to-time-series-forecasting) 12 | - [📈 Understanding Time Series Data](#-understanding-time-series-data) 13 | - [Components of Time Series](#components-of-time-series) 14 | - [🔮 ARIMA (AutoRegressive Integrated Moving Average)](#-arima-autoregressive-integrated-moving-average) 15 | - [Steps for ARIMA Modeling](#steps-for-arima-modeling) 16 | - [Python Example of ARIMA](#python-example-of-arima) 17 | - [📜 Seasonal Decomposition of Time Series (STL)](#-seasonal-decomposition-of-time-series-stl) 18 | - [📦 SARIMA (Seasonal ARIMA)](#-sarima-seasonal-arima) 19 | - [🌍 Prophet: Time Series Forecasting Made Easy](#-prophet-time-series-forecasting-made-easy) 20 | - [Features of Prophet](#features-of-prophet) 21 | - [Python Example of Prophet](#python-example-of-prophet) 22 | - [🧠 LSTM (Long Short-Term Memory Networks)](#-lstm-long-short-term-memory-networks) 23 | - [✍️ Practice Exercise](#%EF%B8%8F-practice-exercise) 24 | - [📝 Summary](#-summary) 25 | 26 | 27 | 28 | ## 📚 Introduction to Time Series Forecasting 29 | 30 | Time series forecasting predicts future values based on previously observed data. It is widely used in areas like: 31 | - **Finance**: Stock price prediction 📈 32 | - **Weather Forecasting**: Temperature and rainfall prediction 🌧️ 33 | - **Retail**: Sales forecasting 🛒 34 | 35 | Forecasting allows businesses and researchers to plan effectively and make informed decisions. 36 | 37 | 38 | 39 | ## 📈 Understanding Time Series Data 40 | 41 | A **time series** is a sequence of data points collected or recorded at regular time intervals. 42 | 43 | ### Components of Time Series 44 | 1. **Trend**: Overall upward or downward movement over time. 45 | 2. **Seasonality**: Regular patterns that repeat over a fixed period. 46 | 3. **Cyclic Patterns**: Long-term fluctuations not tied to seasonality. 47 | 4. **Noise**: Random variations or outliers in data. 48 | 49 | ### Example: Time Series Plot 50 | ```python 51 | import pandas as pd 52 | import matplotlib.pyplot as plt 53 | 54 | # Sample data 55 | data = { 56 | 'Date': pd.date_range(start='2023-01-01', periods=12, freq='M'), 57 | 'Sales': [200, 220, 250, 270, 300, 350, 400, 420, 450, 470, 500, 550] 58 | } 59 | df = pd.DataFrame(data) 60 | 61 | # Plot 62 | plt.plot(df['Date'], df['Sales'], marker='o', linestyle='-') 63 | plt.title("Monthly Sales Data") 64 | plt.xlabel("Date") 65 | plt.ylabel("Sales") 66 | plt.grid() 67 | plt.show() 68 | ``` 69 | 70 | 71 | 72 | ## 🔮 ARIMA (AutoRegressive Integrated Moving Average) 73 | 74 | ### What is ARIMA? 75 | ARIMA is a statistical modeling technique for analyzing and forecasting time series data. It combines three components: 76 | - **AR (AutoRegressive)**: Uses past values. 77 | - **I (Integrated)**: Differencing the data to make it stationary. 78 | - **MA (Moving Average)**: Uses past forecast errors. 79 | 80 | ### Steps for ARIMA Modeling 81 | 1. **Visualize the Data**: Plot the series and check for trends, seasonality, and stationarity. 82 | 2. **Stationarity Test**: Use tests like the Augmented Dickey-Fuller (ADF) test. 83 | 3. **Differencing**: Transform non-stationary data to stationary. 84 | 4. **Parameter Selection**: Use `p`, `d`, `q` to define the ARIMA model. 85 | 5. **Model Training**: Fit the ARIMA model to your data. 86 | 6. **Forecasting**: Predict future values. 87 | 88 | ### Python Example of ARIMA 89 | ```python 90 | from statsmodels.tsa.arima.model import ARIMA 91 | import pandas as pd 92 | import matplotlib.pyplot as plt 93 | 94 | # Example data 95 | data = [112, 118, 132, 129, 121, 135, 148, 145, 140, 155, 164, 170] 96 | df = pd.DataFrame(data, columns=['Sales']) 97 | 98 | # Fit ARIMA model 99 | model = ARIMA(df['Sales'], order=(1, 1, 1)) 100 | model_fit = model.fit() 101 | 102 | # Summary of the model 103 | print(model_fit.summary()) 104 | 105 | # Forecast future values 106 | forecast = model_fit.forecast(steps=5) 107 | print("Forecasted Values:", forecast) 108 | ``` 109 | 110 | 111 | ## 📜 Seasonal Decomposition of Time Series (STL) 112 | 113 | Seasonal Decomposition of Time Series (STL) splits the data into **trend**, **seasonal**, and **residual** components. 114 | 115 | ### Example: STL Decomposition 116 | 117 | ```python 118 | from statsmodels.tsa.seasonal import STL 119 | import pandas as pd 120 | import matplotlib.pyplot as plt 121 | 122 | # Sample time series data 123 | data = [112, 118, 132, 129, 121, 135, 148, 136, 119, 104, 118, 115] 124 | df = pd.DataFrame(data, columns=['value']) 125 | 126 | # STL decomposition 127 | stl = STL(df['value'], period=12) 128 | result = stl.fit() 129 | 130 | # Plot components 131 | result.plot() 132 | plt.show() 133 | ``` 134 | 135 | 136 | 137 | ## 📦 SARIMA (Seasonal ARIMA) 138 | 139 | **SARIMA** extends ARIMA by incorporating seasonality. 140 | 141 | The model is defined by parameters `(p, d, q) x (P, D, Q, s)` where: 142 | - `(p, d, q)` are ARIMA parameters. 143 | - `(P, D, Q, s)` are seasonal parameters. 144 | 145 | ### Example: SARIMA 146 | 147 | ```python 148 | from statsmodels.tsa.statespace.sarimax import SARIMAX 149 | 150 | # Fit SARIMA model 151 | model = SARIMAX(df['value'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12)) 152 | sarima_result = model.fit() 153 | 154 | # Forecast 155 | forecast = sarima_result.forecast(steps=5) 156 | print("SARIMA Forecast:", forecast) 157 | ``` 158 | 159 | 160 | 161 | ## 🌍 Prophet: Time Series Forecasting Made Easy 162 | 163 | ### What is Prophet? 164 | Prophet is an open-source library developed by Facebook for time series forecasting. It is highly flexible, easy to use, and handles missing data, holidays, and seasonal patterns effectively. 165 | 166 | ### Features of Prophet 167 | - Handles **seasonality** and **holiday effects**. 168 | - Robust to **missing data**. 169 | - Requires minimal tuning. 170 | 171 | ### Python Example of Prophet 172 | ```python 173 | from prophet import Prophet 174 | import pandas as pd 175 | import matplotlib.pyplot as plt 176 | 177 | # Create example data 178 | data = { 179 | 'ds': pd.date_range(start='2023-01-01', periods=12, freq='M'), 180 | 'y': [200, 220, 250, 270, 300, 350, 400, 420, 450, 470, 500, 550] 181 | } 182 | df = pd.DataFrame(data) 183 | 184 | # Fit Prophet model 185 | model = Prophet() 186 | model.fit(df) 187 | 188 | # Create future dataframe 189 | future = model.make_future_dataframe(periods=6, freq='M') 190 | 191 | # Forecast 192 | forecast = model.predict(future) 193 | 194 | # Plot results 195 | fig = model.plot(forecast) 196 | plt.show() 197 | ``` 198 | 199 | 200 | ## 🧠 LSTM (Long Short-Term Memory Networks) 201 | 202 | LSTMs are a type of **recurrent neural network (RNN)** capable of learning long-term dependencies. 203 | 204 | ### Example: LSTM Model 205 | 206 | ```python 207 | import numpy as np 208 | from keras.models import Sequential 209 | from keras.layers import LSTM, Dense 210 | 211 | # Sample data 212 | data = np.array([112, 118, 132, 129, 121, 135, 148, 136, 119, 104, 118, 115]) 213 | X = data[:-1].reshape((1, len(data)-1, 1)) # Features 214 | y = data[1:] # Labels 215 | 216 | # Define LSTM model 217 | model = Sequential() 218 | model.add(LSTM(50, activation='relu', input_shape=(X.shape[1], 1))) 219 | model.add(Dense(1)) 220 | model.compile(optimizer='adam', loss='mse') 221 | 222 | # Train 223 | model.fit(X, y, epochs=200, verbose=0) 224 | 225 | # Forecast 226 | forecast = model.predict(X) 227 | print("LSTM Forecast:", forecast) 228 | ``` 229 | 230 | 231 | 232 | ## ✍️ Practice Exercise 233 | 234 | Try the following: 235 | 1. Load a **time series dataset** of your choice (e.g., stock prices, weather data). 236 | 2. Preprocess the data to handle missing values. 237 | 3. Train an **ARIMA** model and forecast future values. 238 | 4. Compare the performance of **ARIMA** and **Prophet** on the same dataset. 239 | 240 | 241 | 242 | ## 📝 Summary 243 | 244 | In this lesson, we covered the fundamentals of **Time Series Forecasting**, explored **ARIMA**, and demonstrated the use of **Prophet** for efficient predictions. Forecasting is a powerful tool for uncovering trends and patterns in sequential data. Mastering these techniques will empower you to tackle real-world problems in diverse domains. 245 | 246 | --- 247 | 248 | 249 | 250 | -------------------------------------------------------------------------------- /27_Natural Language Processing (NLP)/27_Natural Language Processing (NLP).md: -------------------------------------------------------------------------------- 1 | [<< Day 26](../26_Advanced%20ML%3A%20Hyperparameter%20Tuning/26_Advanced%20ML%3A%20Hyperparameter%20Tuning.md) | [Day 28 >>](../28_Time%20Series%20Forecasting/28_Time%20Series%20Forecasting.md) 2 | 3 | 4 | # 🌟 Day 27: Natural Language Processing (NLP) 5 | 6 | Welcome to **Day 27** of the 30 Days of Data Science series! Today, we dive into the fascinating world of **Natural Language Processing (NLP)**. NLP bridges the gap between human language and computers, allowing machines to understand, process, and generate human text. By the end of this lesson, you will have a solid understanding of the following key topics: 7 | 8 | - **NLTK** 9 | - **spaCy** 10 | - **Hugging Face** 11 | - **Topic Modeling with Gensim** 12 | - **Text Summarization** 13 | - **Word Embeddings with Word2Vec and GloVe** 14 | 15 | 16 | 17 | ## 📖 Table of Contents 18 | 19 | - [🌟 Day 27: Natural Language Processing (NLP)](#-day-27-natural-language-processing-nlp) 20 | - [📖 Table of Contents](#-table-of-contents) 21 | - [🔍 What is Natural Language Processing?](#-what-is-natural-language-processing) 22 | - [📚 NLTK: The Natural Language Toolkit](#-nltk-the-natural-language-toolkit) 23 | - [1. Tokenization](#1-tokenization) 24 | - [2. Stopword Removal](#2-stopword-removal) 25 | - [3. Stemming and Lemmatization](#3-stemming-and-lemmatization) 26 | - [💡 spaCy: Industrial-Strength NLP](#-spacy-industrial-strength-nlp) 27 | - [1. Named Entity Recognition (NER)](#1-named-entity-recognition-ner) 28 | - [2. Part-of-Speech (POS) Tagging](#2-part-of-speech-pos-tagging) 29 | - [🤗 Hugging Face: Transformers for NLP](#-hugging-face-transformers-for-nlp) 30 | - [1. Sentiment Analysis](#1-sentiment-analysis) 31 | - [2. Text Generation](#2-text-generation) 32 | - [🧠 Topic Modeling with Gensim](#-topic-modeling-with-gensim) 33 | - [📝 Text Summarization](#-text-summarization) 34 | - [📖 Word Embeddings with Word2Vec and GloVe](#-word-embeddings-with-word2vec-and-glove) 35 | - [📓 Practice Exercises](#-practice-exercises) 36 | - [📜 Summary](#-summary) 37 | 38 | 39 | 40 | 41 | ## 🔍 What is Natural Language Processing? 42 | 43 | **Natural Language Processing (NLP)** is a field within Artificial Intelligence that focuses on enabling machines to understand and interact with human language. It has wide applications, including: 44 | 45 | - **Text Classification**: Spam detection, sentiment analysis. 46 | - **Machine Translation**: Translating text between languages (e.g., Google Translate). 47 | - **Named Entity Recognition (NER)**: Identifying entities like names, dates, and locations in text. 48 | - **Question Answering**: Building systems like ChatGPT. 49 | 50 | 51 | 52 | ## 📚 NLTK: The Natural Language Toolkit 53 | 54 | **NLTK** is a powerful Python library for working with text data. It provides tools for tokenization, stemming, lemmatization, and more. Let’s explore some common functionalities. 55 | 56 | ### 1. Tokenization 57 | 58 | Tokenization is the process of breaking text into smaller components, such as words or sentences. 59 | 60 | ```python 61 | import nltk 62 | from nltk.tokenize import word_tokenize, sent_tokenize 63 | 64 | nltk.download('punkt') 65 | 66 | text = "Natural Language Processing is fascinating. Let's learn more!" 67 | 68 | # Word Tokenization 69 | words = word_tokenize(text) 70 | print("Word Tokens:", words) 71 | 72 | # Sentence Tokenization 73 | sentences = sent_tokenize(text) 74 | print("Sentence Tokens:", sentences) 75 | ``` 76 | 77 | ### 2. Stopword Removal 78 | 79 | Stopwords are common words (e.g., "is", "the") that are often removed in text preprocessing. 80 | 81 | ```python 82 | from nltk.corpus import stopwords 83 | 84 | nltk.download('stopwords') 85 | 86 | stop_words = set(stopwords.words('english')) 87 | filtered_words = [word for word in words if word.lower() not in stop_words] 88 | 89 | print("Filtered Words:", filtered_words) 90 | ``` 91 | 92 | ### 3. Stemming and Lemmatization 93 | 94 | - **Stemming** reduces words to their root form (e.g., "running" -> "run"). 95 | - **Lemmatization** maps words to their base dictionary form (e.g., "better" -> "good"). 96 | 97 | ```python 98 | from nltk.stem import PorterStemmer 99 | from nltk.stem import WordNetLemmatizer 100 | 101 | nltk.download('wordnet') 102 | 103 | stemmer = PorterStemmer() 104 | lemmatizer = WordNetLemmatizer() 105 | 106 | word = "running" 107 | print("Stemmed:", stemmer.stem(word)) 108 | print("Lemmatized:", lemmatizer.lemmatize(word, pos='v')) 109 | ``` 110 | 111 | 112 | 113 | ## 💡 spaCy: Industrial-Strength NLP 114 | 115 | **spaCy** is an efficient library designed for large-scale NLP tasks. It supports features like Named Entity Recognition (NER), Part-of-Speech tagging, and dependency parsing. 116 | 117 | ### 1. Named Entity Recognition (NER) 118 | 119 | NER identifies entities such as names, dates, and locations in text. 120 | 121 | ```python 122 | import spacy 123 | 124 | nlp = spacy.load("en_core_web_sm") 125 | text = "Apple was founded by Steve Jobs in Cupertino, California." 126 | 127 | doc = nlp(text) 128 | for ent in doc.ents: 129 | print(ent.text, ent.label_) 130 | ``` 131 | 132 | ### 2. Part-of-Speech (POS) Tagging 133 | 134 | POS tagging assigns grammatical tags (e.g., noun, verb) to words in a sentence. 135 | 136 | ```python 137 | for token in doc: 138 | print(token.text, token.pos_) 139 | ``` 140 | 141 | 142 | 143 | ## 🤗 Hugging Face: Transformers for NLP 144 | 145 | Hugging Face provides state-of-the-art NLP models, including BERT and GPT, through the `transformers` library. 146 | 147 | ### 1. Sentiment Analysis 148 | 149 | Use a pre-trained model to classify the sentiment of a given text. 150 | 151 | ```python 152 | from transformers import pipeline 153 | 154 | classifier = pipeline("sentiment-analysis") 155 | result = classifier("I love learning NLP!") 156 | print(result) 157 | ``` 158 | 159 | ### 2. Text Generation 160 | 161 | Generate text using a language model like GPT-2. 162 | 163 | ```python 164 | from transformers import pipeline 165 | 166 | generator = pipeline("text-generation", model="gpt2") 167 | result = generator("Natural Language Processing is", max_length=30, num_return_sequences=1) 168 | print(result[0]['generated_text']) 169 | ``` 170 | 171 | 172 | 173 | ## 🧠 Topic Modeling with Gensim 174 | 175 | **Topic Modeling** is the task of identifying abstract topics within a collection of documents. The `gensim` library provides tools for Latent Dirichlet Allocation (LDA), a popular topic modeling technique. 176 | 177 | ```python 178 | from gensim import corpora, models 179 | from gensim.models.ldamodel import LdaModel 180 | 181 | # Sample data 182 | documents = ["I love data science", "Data science is the future", "NLP is fascinating"] 183 | 184 | # Preprocessing 185 | tokenized_docs = [doc.split() for doc in documents] 186 | dictionary = corpora.Dictionary(tokenized_docs) 187 | corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs] 188 | 189 | # LDA Model 190 | lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary) 191 | for idx, topic in lda_model.print_topics(-1): 192 | print(f"Topic {idx}: {topic}") 193 | ``` 194 | 195 | 196 | 197 | ## 📝 Text Summarization 198 | 199 | Text summarization condenses a large text into a shorter version while retaining the main points. You can use Hugging Face for extractive summarization. 200 | 201 | ```python 202 | from transformers import pipeline 203 | 204 | summarizer = pipeline("summarization") 205 | text = "Natural Language Processing has a variety of applications, including text summarization. Summarization aims to condense long texts." 206 | 207 | summary = summarizer(text, max_length=50, min_length=25, do_sample=False) 208 | print(summary[0]['summary_text']) 209 | ``` 210 | 211 | 212 | 213 | ## 📖 Word Embeddings with Word2Vec and GloVe 214 | 215 | Word embeddings are dense vector representations of words. Libraries like `gensim` support Word2Vec, while pre-trained GloVe embeddings are available for direct use. 216 | 217 | ### Word2Vec Example 218 | 219 | ```python 220 | from gensim.models import Word2Vec 221 | 222 | sentences = [["I", "love", "NLP"], ["Word2Vec", "is", "useful"]] 223 | model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) 224 | 225 | # Get vector for a word 226 | vector = model.wv["NLP"] 227 | print(vector) 228 | ``` 229 | 230 | 231 | 232 | ## 📓 Practice Exercises 233 | 234 | 1. Tokenize the following text using NLTK: 235 | ``` 236 | "The quick brown fox jumps over the lazy dog." 237 | ``` 238 | Count the number of tokens and remove stopwords. 239 | 240 | 2. Use spaCy to extract entities from: 241 | ``` 242 | "Tesla's stock price soared after Elon Musk's announcement in 2023." 243 | ``` 244 | 245 | 3. Use Hugging Face's sentiment analysis pipeline to analyze: 246 | ``` 247 | "The movie was a masterpiece, but the ending was disappointing." 248 | ``` 249 | 250 | 4. Generate a short text completion starting with: 251 | ``` 252 | "Data Science is the future of" 253 | ``` 254 | 255 | 5. Use Gensim's LDA model to find topics from the following documents: 256 | ``` 257 | ["Artificial intelligence is transforming industries", "Machine learning is a subset of AI", "NLP is a key AI application"] 258 | ``` 259 | 260 | 6. Summarize the following text: 261 | ``` 262 | "Machine learning is a branch of artificial intelligence that focuses on building systems capable of learning and improving from experience without being explicitly programmed." 263 | ``` 264 | 265 | 266 | 267 | ## 📜 Summary 268 | 269 | Today, you learned about the foundational tools and techniques for NLP: 270 | 271 | - **NLTK**: Preprocessing text with tokenization, stopword removal, and lemmatization. 272 | - **spaCy**: Performing advanced tasks like NER and POS tagging. 273 | - **Hugging Face**: Leveraging pre-trained models for sentiment analysis and text generation. 274 | - **Gensim**: Topic modeling with LDA. 275 | - **Summarization**: Condensing text into shorter forms. 276 | - **Word Embeddings**: Representing words as dense vectors with Word2Vec and GloVe. 277 | 278 | NLP is a powerful field with applications in numerous domains. Keep practicing and explore these libraries further to master them! 279 | 280 | --- 281 | 282 | 283 | -------------------------------------------------------------------------------- /26_Advanced ML: Hyperparameter Tuning/26_Advanced ML: Hyperparameter Tuning.md: -------------------------------------------------------------------------------- 1 | [<< Day 25](../25_Model%20Evaluation%20and%20Metrics/25_Model%20Evaluation%20and%20Metrics.md) | [Day 27 >>](../27_Natural%20Language%20Processing%20(NLP)/27_Natural%20Language%20Processing%20(NLP).md) 2 | 3 | # 🚀 **Day 26: Advanced ML - Hyperparameter Tuning** 4 | 5 | Welcome to **Day 26** of the **30 Days of Data Science** series! 🎉 Today, we delve into **Hyperparameter Tuning**, focusing on two powerful techniques: **GridSearchCV** and **RandomizedSearchCV**. Additionally, we will explore advanced topics like **Bayesian Optimization**, **Optuna**, and hyperparameter tuning for neural networks. These methods are essential for improving model performance and selecting the best parameters for machine learning models. Let's dive in! 🔍 6 | 7 | 8 | 9 | ## 📚 **Table of Contents** 10 | 11 | - [📚 Introduction to Hyperparameter Tuning](#-introduction-to-hyperparameter-tuning) 12 | - [⚙️ GridSearchCV](#️-gridsearchcv) 13 | - [Advantages](#advantages) 14 | - [Disadvantages](#disadvantages) 15 | - [Implementation Example](#implementation-example) 16 | - [🎲 RandomizedSearchCV](#-randomizedsearchcv) 17 | - [Advantages](#advantages-1) 18 | - [Disadvantages](#disadvantages-1) 19 | - [Implementation Example](#implementation-example-1) 20 | - [🌟 Bayesian Optimization](#-bayesian-optimization) 21 | - [Implementation Example](#implementation-example-2) 22 | - [🌟 Optuna](#-optuna) 23 | - [Implementation Example](#implementation-example-3) 24 | - [🌟 Hyperparameter Tuning for Neural Networks](#-hyperparameter-tuning-for-neural-networks) 25 | - [Example with Keras Tuner](#example-with-keras-tuner) 26 | - [💡 Best Practices](#-best-practices) 27 | - [🛠️ Practice Exercise](#️-practice-exercise) 28 | - [📜 Summary](#-summary) 29 | 30 | 31 | 32 | ## 📚 **Introduction to Hyperparameter Tuning** 33 | 34 | Hyperparameters are parameters that are not learned from the data during the training process but are instead set manually before the training begins. Examples include the learning rate, number of estimators, or maximum depth in a decision tree. 35 | 36 | **Why Tune Hyperparameters?** 37 | 38 | - Improves model performance 🎯 39 | - Prevents overfitting or underfitting 🔧 40 | - Helps in identifying the best configuration for your model 🏆 41 | 42 | 43 | 44 | ## ⚙️ **GridSearchCV** 45 | 46 | GridSearchCV is an exhaustive search technique that evaluates all possible combinations of hyperparameter values. 47 | 48 | ### Advantages 49 | - Guarantees finding the best combination of parameters 🎯 50 | - Straightforward to implement 🛠️ 51 | 52 | ### Disadvantages 53 | - Computationally expensive ⏳ 54 | - May not be feasible with large datasets or too many parameters ⚠️ 55 | 56 | ### Implementation Example 57 | 58 | Here's how to use GridSearchCV in Python: 59 | 60 | ```python 61 | from sklearn.model_selection import GridSearchCV 62 | from sklearn.ensemble import RandomForestClassifier 63 | from sklearn.datasets import load_iris 64 | from sklearn.model_selection import train_test_split 65 | 66 | # Load dataset 67 | iris = load_iris() 68 | X, y = iris.data, iris.target 69 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 70 | 71 | # Define model and parameter grid 72 | model = RandomForestClassifier(random_state=42) 73 | param_grid = { 74 | 'n_estimators': [10, 50, 100], 75 | 'max_depth': [None, 10, 20, 30], 76 | 'min_samples_split': [2, 5, 10] 77 | } 78 | 79 | # Perform GridSearchCV 80 | grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy') 81 | grid_search.fit(X_train, y_train) 82 | 83 | # Best parameters and accuracy 84 | print("Best Parameters:", grid_search.best_params_) 85 | print("Best Score:", grid_search.best_score_) 86 | ``` 87 | 88 | 89 | 90 | ## 🎲 **RandomizedSearchCV** 91 | 92 | RandomizedSearchCV searches a subset of the hyperparameter space by sampling a fixed number of parameter combinations. 93 | 94 | ### Advantages 95 | - Faster than GridSearchCV 🚀 96 | - Can provide similar results with fewer computations 🧠 97 | 98 | ### Disadvantages 99 | - May not explore all possible parameter combinations ⚠️ 100 | - Results may vary depending on random sampling 🎲 101 | 102 | ### Implementation Example 103 | 104 | Here's how to use RandomizedSearchCV in Python: 105 | 106 | ```python 107 | from sklearn.model_selection import RandomizedSearchCV 108 | from sklearn.ensemble import RandomForestClassifier 109 | from sklearn.datasets import load_iris 110 | from sklearn.model_selection import train_test_split 111 | from scipy.stats import randint 112 | 113 | # Load dataset 114 | iris = load_iris() 115 | X, y = iris.data, iris.target 116 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 117 | 118 | # Define model and parameter distributions 119 | model = RandomForestClassifier(random_state=42) 120 | param_distributions = { 121 | 'n_estimators': randint(10, 200), 122 | 'max_depth': [None, 10, 20, 30], 123 | 'min_samples_split': randint(2, 20) 124 | } 125 | 126 | # Perform RandomizedSearchCV 127 | random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions, n_iter=50, cv=5, scoring='accuracy', random_state=42) 128 | random_search.fit(X_train, y_train) 129 | 130 | # Best parameters and accuracy 131 | print("Best Parameters:", random_search.best_params_) 132 | print("Best Score:", random_search.best_score_) 133 | ``` 134 | 135 | 136 | 137 | ## 🌟 **Bayesian Optimization** 138 | 139 | Bayesian Optimization is an advanced method for hyperparameter tuning that uses probabilistic models to estimate the performance of different hyperparameter settings. It is especially useful when the search space is vast, and evaluations are expensive. 140 | 141 | ### Implementation Example 142 | 143 | ```python 144 | from skopt import BayesSearchCV 145 | from sklearn.ensemble import RandomForestClassifier 146 | from sklearn.datasets import load_iris 147 | from sklearn.model_selection import train_test_split 148 | 149 | # Load dataset 150 | iris = load_iris() 151 | X, y = iris.data, iris.target 152 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 153 | 154 | # Define model and parameter search space 155 | model = RandomForestClassifier(random_state=42) 156 | search_space = { 157 | 'n_estimators': (10, 200), 158 | 'max_depth': (1, 30), 159 | 'min_samples_split': (2, 20) 160 | } 161 | 162 | # Bayesian Optimization with skopt 163 | bayes_search = BayesSearchCV(estimator=model, search_spaces=search_space, n_iter=50, cv=5, scoring='accuracy', random_state=42) 164 | bayes_search.fit(X_train, y_train) 165 | 166 | # Best parameters and accuracy 167 | print("Best Parameters:", bayes_search.best_params_) 168 | print("Best Score:", bayes_search.best_score_) 169 | ``` 170 | 171 | 172 | 173 | ## 🌟 **Optuna** 174 | 175 | Optuna is an open-source library designed for hyperparameter optimization. It features an automatic search space pruning mechanism that speeds up the optimization process. 176 | 177 | ### Implementation Example 178 | 179 | ```python 180 | import optuna 181 | from sklearn.ensemble import RandomForestClassifier 182 | from sklearn.datasets import load_iris 183 | from sklearn.model_selection import train_test_split, cross_val_score 184 | 185 | # Load dataset 186 | iris = load_iris() 187 | X, y = iris.data, iris.target 188 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 189 | 190 | # Define objective function 191 | def objective(trial): 192 | n_estimators = trial.suggest_int('n_estimators', 10, 200) 193 | max_depth = trial.suggest_int('max_depth', 1, 30) 194 | min_samples_split = trial.suggest_int('min_samples_split', 2, 20) 195 | 196 | model = RandomForestClassifier( 197 | n_estimators=n_estimators, 198 | max_depth=max_depth, 199 | min_samples_split=min_samples_split, 200 | random_state=42 201 | ) 202 | return cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean() 203 | 204 | # Hyperparameter optimization with Optuna 205 | study = optuna.create_study(direction='maximize') 206 | study.optimize(objective, n_trials=50) 207 | 208 | # Best parameters and accuracy 209 | print("Best Parameters:", study.best_params) 210 | print("Best Score:", study.best_value) 211 | ``` 212 | 213 | 214 | 215 | ## 🌟 **Hyperparameter Tuning for Neural Networks** 216 | 217 | Tuning hyperparameters for neural networks often involves searching for the best combination of learning rates, optimizers, batch sizes, and number of layers. 218 | 219 | ### Example with Keras Tuner 220 | 221 | ```python 222 | from tensorflow import keras 223 | from keras_tuner import RandomSearch 224 | 225 | # Define the model 226 | def build_model(hp): 227 | model = keras.Sequential() 228 | model.add(keras.layers.Dense(units=hp.Int('units', min_value=32, max_value=512, step=32), activation='relu')) 229 | model.add(keras.layers.Dense(3, activation='softmax')) 230 | model.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])), 231 | loss='sparse_categorical_crossentropy', 232 | metrics=['accuracy']) 233 | return model 234 | 235 | # Load dataset 236 | iris = load_iris() 237 | X, y = iris.data, iris.target 238 | 239 | # Hyperparameter optimization with Keras Tuner 240 | tuner = RandomSearch( 241 | build_model, 242 | objective='val_accuracy', 243 | max_trials=5, 244 | executions_per_trial=3, 245 | directory='my_dir', 246 | project_name='intro_to_kt' 247 | ) 248 | 249 | tuner.search(X, y, epochs=10, validation_split=0.2) 250 | ``` 251 | 252 | 253 | 254 | ## 💡 **Best Practices** 255 | 256 | 1. **Start with RandomizedSearchCV** for quick insights. 257 | 2. Use **GridSearchCV** after narrowing down the hyperparameter space. 258 | 3. Utilize techniques like **cross-validation** to avoid overfitting. 259 | 4. Parallelize the search process using multiple CPUs or GPUs. ⚡ 260 | 5. Evaluate results with metrics relevant to your problem, such as precision, recall, or F1-score. 📊 261 | 262 | 263 | 264 | ## 🛠️ **Practice Exercise** 265 | 266 | Use the dataset of your choice and apply **RandomizedSearchCV** to tune the hyperparameters of a Support Vector Machine (SVM) classifier. 267 | 268 | 1. Load a dataset (e.g., `load_digits()` from scikit-learn). 269 | 2. Define a parameter distribution for the SVM. 270 | 3. Use RandomizedSearchCV to find the best parameters. 271 | 4. Evaluate the tuned model on a test set. 272 | 273 | Example starting code: 274 | 275 | ```python 276 | from sklearn.datasets import load_digits 277 | from sklearn.svm import SVC 278 | from sklearn.model_selection import RandomizedSearchCV, train_test_split 279 | from scipy.stats import uniform 280 | 281 | # Load dataset 282 | X, y = load_digits(return_X_y=True) 283 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 284 | 285 | # Define model and parameter distributions 286 | svc = SVC() 287 | param_distributions = { 288 | 'C': uniform(0.1, 10), 289 | 'kernel': ['linear', 'rbf', 'poly'], 290 | 'gamma': uniform(0.01, 1) 291 | } 292 | 293 | # RandomizedSearchCV 294 | random_search = RandomizedSearchCV(svc, param_distributions, n_iter=50, cv=5, random_state=42) 295 | random_search.fit(X_train, y_train) 296 | 297 | # Best parameters and accuracy 298 | print("Best Parameters:", random_search.best_params_) 299 | print("Test Accuracy:", random_search.score(X_test, y_test)) 300 | ``` 301 | 302 | 303 | 304 | ## 📜 **Summary** 305 | 306 | Today, we explored various techniques for hyperparameter tuning: **GridSearchCV**, **RandomizedSearchCV**, **Bayesian Optimization**, **Optuna**, and hyperparameter tuning for neural networks. Each method has its unique advantages and applications, making them essential tools for optimizing machine learning models. Practice these methods to enhance your machine learning models! 🚀 307 | 308 | --- 309 | 310 | 311 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 👨‍🔬 30 Days of Data Science 2 | 3 | | **Day** | **Topic** | **Topics Covered** | 4 | |---------|------------------------------------------|--------------------------------------| 5 | | **01** | [Introduction to Data Science](README.md#-day-1)| Setting up Python, Jupyter Notebook | 6 | | **02** | [Basics of the Language + Git Basics](./02_Basics%20of%20the%20Language%20%26%20Git%20Basics/02_Basics%20of%20the%20Language%20%26%20Git%20Basics.md)| Python syntax, variables, Git setup | 7 | | **03** | [Control Flow](./03_Control%20Flow/03_Control%20Flow.md)| If-else, loops | 8 | | **04** | [Functions and Modular Programming](./04_Functions%20and%20Modular%20Programming/04_Functions%20and%20Modular%20Programming.md)| Defining & calling functions | 9 | | **05** | [Data Structures](./05_Data%20Structures/05_Data%20Structures.md)| Lists, tuples, dictionaries | 10 | | **06** | [Data Frames and Tables](./06_Data%20Frames%20and%20Tables/06_Data%20Frames%20and%20Tables.md)| pandas DataFrame | 11 | | **07** | [Importing Data](./07_Importing%20Data/07_Importing%20Data.md)| Reading CSV, Excel, JSON files | 12 | | **08** | [Data Cleaning](./08_Data%20Cleaning/08_Data%20Cleaning.md)| Handling missing values, duplicates | 13 | | **09** | [Exploratory Data Analysis (EDA)](./09_Exploratory%20Data%20Analysis%20(EDA)/09_Exploratory%20Data%20Analysis%20(EDA).md) | Descriptive statistics | 14 | | **10** | [Data Visualization Basics](./10_Data%20Visualization%20Basics/10_Data%20Visualization%20Basics.md) | matplotlib, seaborn | 15 | | **11** | [Advanced Data Visualization](./11_Advanced%20Data%20Visualization/11_Advanced%20Data%20Visualization.md)| Plotly, advanced matplotlib | 16 | | **12** | [SQL for Data Retrieval](./12_SQL%20for%20Data%20Retrieval/12_SQL%20for%20Data%20Retrieval.md)| sqlite3, SQLAlchemy | 17 | | **13** | [Time Series Analysis Introduction](./13_Time%20Series%20Analysis%20Introduction/13_Time%20Series%20Analysis%20Introduction.md)| pandas datetime, matplotlib | 18 | | **14** | [Working with APIs and JSON](./14_Working%20with%20APIs%20and%20JSON/14_Working%20with%20APIs%20and%20JSON.md)| requests, JSON module | 19 | | **15** | [Regular Expressions](./15_Regular%20Expressions/15_Regular%20Expressions.md)| re module | 20 | | **16** | [Statistical Concepts](./16_Statistical%20Concepts/16_Statistical%20Concepts.md)| Scipy, NumPy | 21 | | **17** | [Hypothesis Testing](./17_Hypothesis%20Testing/17_Hypothesis%20Testing.md)| t-test, chi-square | 22 | | **18** | [Basic Machine Learning Introduction](./18_Basic%20Machine%20Learning%20Introduction/18_Basic%20Machine%20Learning%20Introduction.md)| scikit-learn basics | 23 | | **19** | [Linear Regression](./19_Linear%20Regression/19_Linear%20Regression.md)| LinearRegression in scikit-learn | 24 | | **20** | [Logistic Regression](./20_Logistic%20Regression/20_Logistic%20Regression.md)| LogisticRegression in scikit-learn | 25 | | **21** | [Clustering (K-Means)](./21_Clustering%20(K-Means)/21_Clustering%20(K-Means).md)| KMeans in scikit-learn | 26 | | **22** | [Decision Trees](./22_Decision%20Trees/22_Decision%20Trees.md)| DecisionTreeClassifier in scikit-learn | 27 | | **23** | [Handling Imbalanced Data](./23_Handling%20Imbalanced%20Data/23_Handling%20Imbalanced%20Data.md)| SMOTE, class weighting | 28 | | **24** | [Feature Engineering](./24_Feature%20Engineering/24_Feature%20Engineering.md)| Encoding, scaling, feature selection | 29 | | **25** | [Model Evaluation and Metrics](./25_Model%20Evaluation%20and%20Metrics/25_Model%20Evaluation%20and%20Metrics.md)| Confusion matrix, ROC-AUC | 30 | | **26** | [Advanced ML: Hyperparameter Tuning](./26_Advanced%20ML%3A%20Hyperparameter%20Tuning/26_Advanced%20ML%3A%20Hyperparameter%20Tuning.md)| GridSearchCV, RandomizedSearchCV | 31 | | **27** | [Natural Language Processing (NLP)](./27_Natural%20Language%20Processing%20(NLP)/27_Natural%20Language%20Processing%20(NLP).md)| NLTK, spaCy, Hugging Face | 32 | | **28** | [Time Series Forecasting](./28_Time%20Series%20Forecasting/28_Time%20Series%20Forecasting.md)| ARIMA, Prophet | 33 | | **29** | [Working with Big Data](./29_Working%20with%20Big%20Data/29_Working%20with%20Big%20Data.md) | PySpark basics | 34 | | **30** | [Building a Data Science Pipeline](./30_Building%20a%20Data%20Science%20Pipeline/30_Building%20a%20Data%20Science%20Pipeline.md)| sklearn pipeline, joblib | 35 | | **31** | [Deployment on Cloud Platform](./31_Deployment%20on%20Cloud%20Platform/31_Deployment%20on%20Cloud%20Platform.md)| Deploy with Flask/FastAPI to AWS, Azure, or GCP | 36 | 37 | 38 | - [👨‍🔬 30 Days Of Data Science](#-30-days-of-data-science) 39 | - [📘 Day 1](#-day-1) 40 | - [Welcome](#welcome) 41 | - [Introduction](#introduction) 42 | - [Why Learn Data Science ?](#why-learn-data-science) 43 | - [Setting Up Your Environment](#setting-up-your-environment) 44 | - [Installing Python](#installing-python) 45 | - [Python Shell](#python-shell) 46 | - [Installing Visual Studio Code](#installing-visual-studio-code) 47 | - [Installing Jupyter Notebook](#installing-jupyter-notebook) 48 | 49 | 50 | # 📘 Day 1 51 | 52 | ## Welcome 53 | 54 | **Congratulations** on deciding to participate in a _30 Days of Data Science_ challenge! In this challenge, you will dive into the essential concepts of data science, from foundational programming skills to data analysis, visualization, and machine learning. 55 | 56 | 57 | 58 | ## Introduction 59 | 60 | Data Science is an interdisciplinary field that uses programming, mathematics, and domain knowledge to extract insights from structured and unstructured data. Python is one of the most popular tools in data science due to its versatility, ease of use, and robust ecosystem of libraries. This challenge is designed to help you build a strong foundation in Python while applying it to practical data science tasks. The topics are distributed over 30 days, with clear explanations, real-world examples, and hands-on exercises. 61 | 62 | This challenge is suitable for beginners as well as professionals looking to strengthen their data science skills. It may take 30 to 100 days to complete, depending on your pace. 63 | 64 | 65 | 66 | 67 | ## Why Learn Data Science? 68 | 69 | Data Science is revolutionizing industries by enabling data-driven decision-making. It combines programming, statistics, and domain expertise to solve complex problems. Python has become the go-to language in the data science community due to its simplicity and extensive library support for tasks like data cleaning, visualization, and modeling. Whether you aim to work in business analytics, artificial intelligence, or research, data science skills will open up endless possibilities. 70 | 71 | ### Setting Up Your Environment 72 | 73 | ## Installing Python 74 | 75 | To start coding in Python, you need to install it on your computer. Visit the [official Python website](https://www.python.org/) to download the latest version. 76 | - **Windows users**: Download Python by clicking the appropriate button. 77 | - **macOS users**: Follow similar steps to install Python for Mac. 78 | 79 | To confirm the installation, open your terminal or command prompt and type: 80 | 81 | ```shell 82 | python --version 83 | ``` 84 | 85 | You should see the installed version, which should be Python 3.6 or above. For example: 86 | 87 | ```shell 88 | Python 3.12.4 89 | ``` 90 | 91 | If the command displays the Python version, you are ready to proceed. 92 | 93 | ## Python Shell 94 | 95 | Python is an interpreted language, meaning you can execute code line by line. Python comes with an interactive shell, which allows you to write and test Python commands directly. To open the shell, type the following command in your terminal: 96 | 97 | ```shell 98 | python 99 | ``` 100 | 101 | Once the shell is open, you can start entering Python commands after the `>>>` prompt. For example, typing `2 + 3` will output `5`. To exit the shell, type `exit()`. 102 | 103 | If you enter an invalid command, Python will provide an error message, helping you debug and learn. Debugging is the process of identifying and fixing errors in your code. You will encounter common error types such as `SyntaxError`, `NameError`, and `TypeError` throughout this challenge. Understanding these errors is crucial for becoming a proficient programmer. 104 | 105 | ## Installing Visual Studio Code 106 | 107 | While the Python shell is great for quick tests, real-world data science projects require robust code editors. For this challenge, we recommend using [Visual Studio Code](https://code.visualstudio.com/), a popular and lightweight editor. Feel free to use other editors if you prefer. 108 | 109 | To start, download and install Visual Studio Code. Once installed, create a folder named `30DaysOfDataScience` on your computer and open it using Visual Studio Code. Inside the folder, create a new file, such as `helloworld.py`, to write your first Python script. This will serve as the workspace for your projects throughout the challenge. 110 | 111 | #### Exploring the Editor 112 | 113 | Visual Studio Code offers many features to enhance productivity, including debugging tools, extensions, and an intuitive interface. Spend some time familiarizing yourself with its layout and shortcuts. 114 | 115 | ## Installing Jupyter Notebook 116 | 117 | In addition to Visual Studio Code, another essential tool for data science is **Jupyter Notebook**. It is an interactive web-based environment where you can write and execute Python code, visualize data, and document your analysis all in one place. Jupyter Notebook is widely used in the data science community because it simplifies exploratory data analysis and data visualization. 118 | 119 | ### Installing Jupyter Notebook 120 | 121 | To install Jupyter Notebook, you'll first need to install `pip`, the Python package manager, which should already be available if you've installed Python. Open your terminal or command prompt and type: 122 | 123 | ```shell 124 | pip install notebook 125 | ``` 126 | 127 | Once the installation is complete, you can launch Jupyter Notebook by typing: 128 | 129 | ```shell 130 | jupyter notebook 131 | ``` 132 | 133 | This command will open Jupyter Notebook in your default web browser. You will see an interface that allows you to create and organize notebooks in different folders. 134 | 135 | #### Using Jupyter Notebook 136 | 137 | To create a new notebook: 138 | 139 | 1. Navigate to the folder where you'd like to save your notebooks. 140 | 2. Click **New** (on the top-right corner) and select **Python 3 (ipykernel)**. 141 | 142 | A new notebook will open where you can write Python code in individual cells. Press **Shift + Enter** to execute the code in a cell. You can also add explanatory text using Markdown cells to make your analysis more readable. 143 | 144 | Here is a simple example to get started: 145 | 146 | 1. Create a new notebook and name it `Day1_Basics.ipynb`. 147 | 2. Write the following code in a cell and execute it: 148 | 149 | ```python 150 | # This is your first code in Jupyter Notebook 151 | print("Hello, Data Science!") 152 | ``` 153 | 154 | You should see the output below the cell: 155 | 156 | ``` 157 | Hello, Data Science! 158 | ``` 159 | 160 | #### Installing JupyterLab (Optional) 161 | 162 | If you'd like a more modern interface with enhanced features, you can use **JupyterLab**, an upgraded version of Jupyter Notebook. Install it using: 163 | 164 | ```shell 165 | pip install jupyterlab 166 | ``` 167 | 168 | Launch it by typing: 169 | 170 | ```shell 171 | jupyter lab 172 | ``` 173 | 174 | #### Integration with Visual Studio Code 175 | 176 | If you prefer to work within Visual Studio Code but want the interactivity of Jupyter Notebook, you can install the **Jupyter extension** in Visual Studio Code: 177 | 178 | 1. Open Visual Studio Code and go to the Extensions Marketplace (the square icon on the sidebar). 179 | 2. Search for "Jupyter" and install the extension. 180 | 3. Open a `.ipynb` file, or create one using the command palette (`Ctrl + Shift + P` or `Cmd + Shift + P` on Mac) and selecting `Jupyter: Create New Blank Notebook`. 181 | 182 | Now you can use Jupyter notebooks directly within Visual Studio Code! 183 | 184 | 185 | 186 | 187 | [Day 2 >>](./02_Basics%20of%20the%20Language%20%26%20Git%20Basics/02_Basics%20of%20the%20Language%20%26%20Git%20Basics.md) 188 | 189 | 190 | 191 | 192 | --------------------------------------------------------------------------------