├── README.md
├── 1. Label Encoder.ipynb
├── Min Max Scaling with Python.ipynb
├── 4. Ordinal Encoder.ipynb
├── 3. Binary Encoder.ipynb
├── Standard Scaling using Python.ipynb
├── 2. One Hot Encoding.ipynb
└── Data Leakage in Machine Learning.ipynb


/README.md:
--------------------------------------------------------------------------------
  1 | # Feature Engineering
  2 | 
  3 | **Feature Engineering** is the process of transforming raw data into meaningful features that help machine learning models learn better patterns and make more accurate predictions.
  4 | 
  5 | In simple terms, it's about **creating the right inputs** for your model — the smarter the features, the better the model’s performance.
  6 | 
  7 | ---
  8 | 
  9 | ## Key Steps in Feature Engineering
 10 | 
 11 | ### 1. Feature Creation
 12 | Generate new features from existing ones to add useful information.
 13 | 
 14 | **Examples:**
 15 | - From `Date`, create `Day`, `Month`, or `Is_Weekend`
 16 | - From `Price` and `Quantity`, create `Total_Sales = Price × Quantity`
 17 | - From `Address`, extract `City` or `Postal_Code`
 18 | 
 19 | ---
 20 | 
 21 | ### 2. Feature Transformation
 22 | Modify existing features to improve learning and model performance.
 23 | 
 24 | **Common techniques:**
 25 | - **Scaling:** Normalize or standardize numerical features  
 26 |   *(e.g., Min-Max scaling or StandardScaler)*
 27 | - **Encoding:** Convert categorical data into numeric form  
 28 |   *(e.g., One-Hot Encoding or Label Encoding)*
 29 | - **Log Transformation:** Handle skewed data distributions
 30 | - **Binning:** Group continuous values into discrete intervals
 31 | 
 32 | ---
 33 | 
 34 | ### 3. Feature Selection
 35 | Choose only the most important features that influence predictions.
 36 | 
 37 | **Methods include:**
 38 | - Correlation analysis  
 39 | - Mutual information  
 40 | - Feature importance from tree-based models  
 41 | - Recursive Feature Elimination (RFE)
 42 | 
 43 | ---
 44 | 
 45 | ### 4. Handling Missing Data
 46 | Fill or remove missing values to ensure clean input for training.
 47 | 
 48 | **Strategies:**
 49 | - Replace with mean/median/mode  
 50 | - Forward/Backward fill  
 51 | - Drop rows/columns with too many missing values
 52 | 
 53 | ---
 54 | 
 55 | ## Example
 56 | 
 57 | **Raw dataset:**
 58 | 
 59 | | Date | Age | Salary | City |
 60 | |------|-----|---------|------|
 61 | | 2024-05-10 | 28 | 50000 | Berlin |
 62 | | 2024-05-11 | 35 | 62000 | Munich |
 63 | 
 64 | **After feature engineering:**
 65 | 
 66 | | Day | Month | Age | Salary | City_Berlin | City_Munich |
 67 | |-----|-------|-----|---------|--------------|--------------|
 68 | | 10 | 5 | 28 | 50000 | 1 | 0 |
 69 | | 11 | 5 | 35 | 62000 | 0 | 1 |
 70 | 
 71 | ---
 72 | 
 73 | ## Why It Matters
 74 | Feature Engineering:
 75 | - Improves **accuracy** of ML models  
 76 | - Reduces **overfitting**  
 77 | - Enhances **interpretability**  
 78 | - Speeds up **training and inference**
 79 | 
 80 | ---
 81 | 
 82 | ## Python
 83 | 
 84 | ```python
 85 | import pandas as pd
 86 | from sklearn.preprocessing import OneHotEncoder, StandardScaler
 87 | 
 88 | # Sample Data
 89 | data = {
 90 |     "Date": ["2024-05-10", "2024-05-11"],
 91 |     "Age": [28, 35],
 92 |     "Salary": [50000, 62000],
 93 |     "City": ["Berlin", "Munich"]
 94 | }
 95 | df = pd.DataFrame(data)
 96 | 
 97 | # Feature Creation
 98 | df["Date"] = pd.to_datetime(df["Date"])
 99 | df["Day"] = df["Date"].dt.day
100 | df["Month"] = df["Date"].dt.month
101 | 
102 | #Feature Transformation
103 | encoder = OneHotEncoder(sparse_output=False)
104 | city_encoded = encoder.fit_transform(df[["City"]])
105 | encoded_df = pd.DataFrame(city_encoded, columns=encoder.get_feature_names_out(["City"]))
106 | df = pd.concat([df, encoded_df], axis=1)
107 | 
108 | #Feature Scaling
109 | scaler = StandardScaler()
110 | df[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]])
111 | 
112 | # Final Result 
113 | print(df)
114 | 


--------------------------------------------------------------------------------
/1. Label Encoder.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "21070538",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# 1. Label Encoder"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "id": "5122882e",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "classes = ['ClassA', 'ClassB', 'ClassC', 'ClassD']\n",
 19 |     "\n",
 20 |     "instances = ['ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB']"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 2,
 26 |    "id": "0790c5cf",
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "name": "stdout",
 31 |      "output_type": "stream",
 32 |      "text": [
 33 |       "Encoded labels: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1]\n"
 34 |      ]
 35 |     }
 36 |    ],
 37 |    "source": [
 38 |     "label_to_int = {label: index for index, label in enumerate(classes)} #60 Days of Python ; Day 25\n",
 39 |     "encoded_labels = [label_to_int[label] for label in instances]\n",
 40 |     "\n",
 41 |     "print(\"Encoded labels:\", encoded_labels)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 3,
 47 |    "id": "9bdaf4f3",
 48 |    "metadata": {},
 49 |    "outputs": [
 50 |     {
 51 |      "name": "stdout",
 52 |      "output_type": "stream",
 53 |      "text": [
 54 |       "Encoded labels: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1]\n",
 55 |       "Decoded labels: ['ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB']\n"
 56 |      ]
 57 |     }
 58 |    ],
 59 |    "source": [
 60 |     "int_to_label = {index: label for label, index in label_to_int.items()}\n",
 61 |     "decoded_labels = [int_to_label[index] for index in encoded_labels]\n",
 62 |     "\n",
 63 |     "print(\"Encoded labels:\", encoded_labels)\n",
 64 |     "print(\"Decoded labels:\", decoded_labels)"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "id": "63ee1325",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "# Sklearn - Label Encoder"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 4,
 78 |    "id": "60a20a81",
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "from sklearn.preprocessing import LabelEncoder"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 5,
 88 |    "id": "ad2970f2",
 89 |    "metadata": {},
 90 |    "outputs": [
 91 |     {
 92 |      "name": "stdout",
 93 |      "output_type": "stream",
 94 |      "text": [
 95 |       "Encoded labels: [0 1 2 3 0 1 2 3 0 1]\n"
 96 |      ]
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "label_encoder = LabelEncoder()\n",
101 |     "encoded_labels = label_encoder.fit_transform(instances)\n",
102 |     "\n",
103 |     "print(\"Encoded labels:\", encoded_labels)"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 6,
109 |    "id": "5fd9dabe",
110 |    "metadata": {},
111 |    "outputs": [
112 |     {
113 |      "name": "stdout",
114 |      "output_type": "stream",
115 |      "text": [
116 |       "Encoded labels: [0 1 2 3 0 1 2 3 0 1]\n",
117 |       "Original labels: ['ClassA' 'ClassB' 'ClassC' 'ClassD' 'ClassA' 'ClassB' 'ClassC' 'ClassD'\n",
118 |       " 'ClassA' 'ClassB']\n"
119 |      ]
120 |     }
121 |    ],
122 |    "source": [
123 |     "original_labels = label_encoder.inverse_transform(encoded_labels)\n",
124 |     "\n",
125 |     "print(\"Encoded labels:\", encoded_labels)\n",
126 |     "print(\"Original labels:\", original_labels)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "id": "e932d2b3",
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": []
136 |   }
137 |  ],
138 |  "metadata": {
139 |   "kernelspec": {
140 |    "display_name": "Python 3 (ipykernel)",
141 |    "language": "python",
142 |    "name": "python3"
143 |   },
144 |   "language_info": {
145 |    "codemirror_mode": {
146 |     "name": "ipython",
147 |     "version": 3
148 |    },
149 |    "file_extension": ".py",
150 |    "mimetype": "text/x-python",
151 |    "name": "python",
152 |    "nbconvert_exporter": "python",
153 |    "pygments_lexer": "ipython3",
154 |    "version": "3.9.13"
155 |   }
156 |  },
157 |  "nbformat": 4,
158 |  "nbformat_minor": 5
159 | }
160 | 


--------------------------------------------------------------------------------
/Min Max Scaling with Python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "fc529592",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Raw Code"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "id": "1c33a4de",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "def min_max_scaling(data):\n",
 19 |     "    min_val = min(data)\n",
 20 |     "    max_val = max(data)\n",
 21 |     "    scaled_data = [(x - min_val) / (max_val - min_val) for x in data]\n",
 22 |     "    return scaled_data"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 2,
 28 |    "id": "53cba510",
 29 |    "metadata": {},
 30 |    "outputs": [
 31 |     {
 32 |      "name": "stdout",
 33 |      "output_type": "stream",
 34 |      "text": [
 35 |       "Original Data:  [1, 20, 30, 4, 5]\n",
 36 |       "Scaled data (raw): [0.0, 0.6551724137931034, 1.0, 0.10344827586206896, 0.13793103448275862]\n"
 37 |      ]
 38 |     }
 39 |    ],
 40 |    "source": [
 41 |     "data = [1, 20, 30, 4, 5]\n",
 42 |     "scaled_data = min_max_scaling(data)\n",
 43 |     "print('Original Data: ', data)\n",
 44 |     "print(\"Scaled data (raw):\", scaled_data)"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "id": "c9912be9",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "# using Sklearn"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 3,
 58 |    "id": "e2326ee0",
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "import pandas as pd\n",
 63 |     "from sklearn.preprocessing import MinMaxScaler"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": 4,
 69 |    "id": "6c2eab99",
 70 |    "metadata": {},
 71 |    "outputs": [
 72 |     {
 73 |      "data": {
 74 |       "text/html": [
 75 |        "<div>\n",
 76 |        "<style scoped>\n",
 77 |        "    .dataframe tbody tr th:only-of-type {\n",
 78 |        "        vertical-align: middle;\n",
 79 |        "    }\n",
 80 |        "\n",
 81 |        "    .dataframe tbody tr th {\n",
 82 |        "        vertical-align: top;\n",
 83 |        "    }\n",
 84 |        "\n",
 85 |        "    .dataframe thead th {\n",
 86 |        "        text-align: right;\n",
 87 |        "    }\n",
 88 |        "</style>\n",
 89 |        "<table border=\"1\" class=\"dataframe\">\n",
 90 |        "  <thead>\n",
 91 |        "    <tr style=\"text-align: right;\">\n",
 92 |        "      <th></th>\n",
 93 |        "      <th>Feature1</th>\n",
 94 |        "      <th>Feature2</th>\n",
 95 |        "    </tr>\n",
 96 |        "  </thead>\n",
 97 |        "  <tbody>\n",
 98 |        "    <tr>\n",
 99 |        "      <th>0</th>\n",
100 |        "      <td>1</td>\n",
101 |        "      <td>6</td>\n",
102 |        "    </tr>\n",
103 |        "    <tr>\n",
104 |        "      <th>1</th>\n",
105 |        "      <td>5</td>\n",
106 |        "      <td>7</td>\n",
107 |        "    </tr>\n",
108 |        "    <tr>\n",
109 |        "      <th>2</th>\n",
110 |        "      <td>10</td>\n",
111 |        "      <td>8</td>\n",
112 |        "    </tr>\n",
113 |        "    <tr>\n",
114 |        "      <th>3</th>\n",
115 |        "      <td>4</td>\n",
116 |        "      <td>19</td>\n",
117 |        "    </tr>\n",
118 |        "    <tr>\n",
119 |        "      <th>4</th>\n",
120 |        "      <td>5</td>\n",
121 |        "      <td>10</td>\n",
122 |        "    </tr>\n",
123 |        "  </tbody>\n",
124 |        "</table>\n",
125 |        "</div>"
126 |       ],
127 |       "text/plain": [
128 |        "   Feature1  Feature2\n",
129 |        "0         1         6\n",
130 |        "1         5         7\n",
131 |        "2        10         8\n",
132 |        "3         4        19\n",
133 |        "4         5        10"
134 |       ]
135 |      },
136 |      "execution_count": 4,
137 |      "metadata": {},
138 |      "output_type": "execute_result"
139 |     }
140 |    ],
141 |    "source": [
142 |     "data = {'Feature1': [1, 5, 10, 4, 5],\n",
143 |     "        'Feature2': [6, 7, 8, 19, 10]}\n",
144 |     "\n",
145 |     "df = pd.DataFrame(data)\n",
146 |     "df.head()"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 5,
152 |    "id": "03a59524",
153 |    "metadata": {},
154 |    "outputs": [],
155 |    "source": [
156 |     "scaler = MinMaxScaler()\n",
157 |     "scaled_data = scaler.fit_transform(df)\n",
158 |     "scaled_df = pd.DataFrame(scaled_data, columns=df.columns)"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 6,
164 |    "id": "b32cc8d6",
165 |    "metadata": {},
166 |    "outputs": [
167 |     {
168 |      "name": "stdout",
169 |      "output_type": "stream",
170 |      "text": [
171 |       "Original DataFrame:\n",
172 |       "   Feature1  Feature2\n",
173 |       "0         1         6\n",
174 |       "1         5         7\n",
175 |       "2        10         8\n",
176 |       "3         4        19\n",
177 |       "4         5        10\n",
178 |       "\n",
179 |       "Scaled DataFrame:\n",
180 |       "   Feature1  Feature2\n",
181 |       "0  0.000000  0.000000\n",
182 |       "1  0.444444  0.076923\n",
183 |       "2  1.000000  0.153846\n",
184 |       "3  0.333333  1.000000\n",
185 |       "4  0.444444  0.307692\n"
186 |      ]
187 |     }
188 |    ],
189 |    "source": [
190 |     "print(\"Original DataFrame:\")\n",
191 |     "print(df)\n",
192 |     "print(\"\\nScaled DataFrame:\")\n",
193 |     "print(scaled_df)"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": null,
199 |    "id": "0e898642",
200 |    "metadata": {},
201 |    "outputs": [],
202 |    "source": []
203 |   }
204 |  ],
205 |  "metadata": {
206 |   "kernelspec": {
207 |    "display_name": "Python 3 (ipykernel)",
208 |    "language": "python",
209 |    "name": "python3"
210 |   },
211 |   "language_info": {
212 |    "codemirror_mode": {
213 |     "name": "ipython",
214 |     "version": 3
215 |    },
216 |    "file_extension": ".py",
217 |    "mimetype": "text/x-python",
218 |    "name": "python",
219 |    "nbconvert_exporter": "python",
220 |    "pygments_lexer": "ipython3",
221 |    "version": "3.9.13"
222 |   }
223 |  },
224 |  "nbformat": 4,
225 |  "nbformat_minor": 5
226 | }
227 | 


--------------------------------------------------------------------------------
/4. Ordinal Encoder.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "b78c029d",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import pandas as pd\n",
 11 |     "from sklearn.preprocessing import OrdinalEncoder"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 2,
 17 |    "id": "c146e9da",
 18 |    "metadata": {},
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "data = [\n",
 22 |     "    ['good'], ['bad'], ['excellent'], ['average'], \n",
 23 |     "    ['good'], ['average'], ['excellent'], ['bad'], \n",
 24 |     "    ['average'], ['good']\n",
 25 |     "]"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 3,
 31 |    "id": "3ad7f2c7",
 32 |    "metadata": {},
 33 |    "outputs": [
 34 |     {
 35 |      "data": {
 36 |       "text/html": [
 37 |        "<div>\n",
 38 |        "<style scoped>\n",
 39 |        "    .dataframe tbody tr th:only-of-type {\n",
 40 |        "        vertical-align: middle;\n",
 41 |        "    }\n",
 42 |        "\n",
 43 |        "    .dataframe tbody tr th {\n",
 44 |        "        vertical-align: top;\n",
 45 |        "    }\n",
 46 |        "\n",
 47 |        "    .dataframe thead th {\n",
 48 |        "        text-align: right;\n",
 49 |        "    }\n",
 50 |        "</style>\n",
 51 |        "<table border=\"1\" class=\"dataframe\">\n",
 52 |        "  <thead>\n",
 53 |        "    <tr style=\"text-align: right;\">\n",
 54 |        "      <th></th>\n",
 55 |        "      <th>reviews</th>\n",
 56 |        "    </tr>\n",
 57 |        "  </thead>\n",
 58 |        "  <tbody>\n",
 59 |        "    <tr>\n",
 60 |        "      <th>0</th>\n",
 61 |        "      <td>good</td>\n",
 62 |        "    </tr>\n",
 63 |        "    <tr>\n",
 64 |        "      <th>1</th>\n",
 65 |        "      <td>bad</td>\n",
 66 |        "    </tr>\n",
 67 |        "    <tr>\n",
 68 |        "      <th>2</th>\n",
 69 |        "      <td>excellent</td>\n",
 70 |        "    </tr>\n",
 71 |        "    <tr>\n",
 72 |        "      <th>3</th>\n",
 73 |        "      <td>average</td>\n",
 74 |        "    </tr>\n",
 75 |        "    <tr>\n",
 76 |        "      <th>4</th>\n",
 77 |        "      <td>good</td>\n",
 78 |        "    </tr>\n",
 79 |        "  </tbody>\n",
 80 |        "</table>\n",
 81 |        "</div>"
 82 |       ],
 83 |       "text/plain": [
 84 |        "     reviews\n",
 85 |        "0       good\n",
 86 |        "1        bad\n",
 87 |        "2  excellent\n",
 88 |        "3    average\n",
 89 |        "4       good"
 90 |       ]
 91 |      },
 92 |      "execution_count": 3,
 93 |      "metadata": {},
 94 |      "output_type": "execute_result"
 95 |     }
 96 |    ],
 97 |    "source": [
 98 |     "data = pd.DataFrame(data=data, columns=['reviews'])\n",
 99 |     "data.head()"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 4,
105 |    "id": "14b89f09",
106 |    "metadata": {},
107 |    "outputs": [
108 |     {
109 |      "data": {
110 |       "text/plain": [
111 |        "(10, 1)"
112 |       ]
113 |      },
114 |      "execution_count": 4,
115 |      "metadata": {},
116 |      "output_type": "execute_result"
117 |     }
118 |    ],
119 |    "source": [
120 |     "data.shape"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 5,
126 |    "id": "3e569b99",
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "categories = [['bad', 'average', 'good', 'excellent']]"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 6,
136 |    "id": "eee41b3c",
137 |    "metadata": {},
138 |    "outputs": [
139 |     {
140 |      "data": {
141 |       "text/plain": [
142 |        "[['bad', 'average', 'good', 'excellent']]"
143 |       ]
144 |      },
145 |      "execution_count": 6,
146 |      "metadata": {},
147 |      "output_type": "execute_result"
148 |     }
149 |    ],
150 |    "source": [
151 |     "categories"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 7,
157 |    "id": "b3a1d73e",
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "encoder = OrdinalEncoder(categories=categories)"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": 8,
167 |    "id": "74286b79",
168 |    "metadata": {},
169 |    "outputs": [
170 |     {
171 |      "data": {
172 |       "text/plain": [
173 |        "array([[2.],\n",
174 |        "       [0.],\n",
175 |        "       [3.],\n",
176 |        "       [1.],\n",
177 |        "       [2.],\n",
178 |        "       [1.],\n",
179 |        "       [3.],\n",
180 |        "       [0.],\n",
181 |        "       [1.],\n",
182 |        "       [2.]])"
183 |       ]
184 |      },
185 |      "execution_count": 8,
186 |      "metadata": {},
187 |      "output_type": "execute_result"
188 |     }
189 |    ],
190 |    "source": [
191 |     "encoded_data = encoder.fit_transform(data)\n",
192 |     "encoded_data"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 9,
198 |    "id": "b4fa6545",
199 |    "metadata": {},
200 |    "outputs": [
201 |     {
202 |      "data": {
203 |       "text/plain": [
204 |        "array([['good'],\n",
205 |        "       ['bad'],\n",
206 |        "       ['excellent'],\n",
207 |        "       ['average'],\n",
208 |        "       ['good'],\n",
209 |        "       ['average'],\n",
210 |        "       ['excellent'],\n",
211 |        "       ['bad'],\n",
212 |        "       ['average'],\n",
213 |        "       ['good']], dtype=object)"
214 |       ]
215 |      },
216 |      "execution_count": 9,
217 |      "metadata": {},
218 |      "output_type": "execute_result"
219 |     }
220 |    ],
221 |    "source": [
222 |     "decoded_data = encoder.inverse_transform(encoded_data)\n",
223 |     "decoded_data"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": null,
229 |    "id": "525a8f59",
230 |    "metadata": {},
231 |    "outputs": [],
232 |    "source": []
233 |   }
234 |  ],
235 |  "metadata": {
236 |   "kernelspec": {
237 |    "display_name": "Python 3 (ipykernel)",
238 |    "language": "python",
239 |    "name": "python3"
240 |   },
241 |   "language_info": {
242 |    "codemirror_mode": {
243 |     "name": "ipython",
244 |     "version": 3
245 |    },
246 |    "file_extension": ".py",
247 |    "mimetype": "text/x-python",
248 |    "name": "python",
249 |    "nbconvert_exporter": "python",
250 |    "pygments_lexer": "ipython3",
251 |    "version": "3.9.13"
252 |   }
253 |  },
254 |  "nbformat": 4,
255 |  "nbformat_minor": 5
256 | }
257 | 


--------------------------------------------------------------------------------
/3. Binary Encoder.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "adfe09d4",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import pandas as pd\n",
 11 |     "import category_encoders as ce"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 2,
 17 |    "id": "56986103",
 18 |    "metadata": {},
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']}\n",
 22 |     "df = pd.DataFrame(data)"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 3,
 28 |    "id": "b5813023",
 29 |    "metadata": {},
 30 |    "outputs": [
 31 |     {
 32 |      "data": {
 33 |       "text/html": [
 34 |        "<div>\n",
 35 |        "<style scoped>\n",
 36 |        "    .dataframe tbody tr th:only-of-type {\n",
 37 |        "        vertical-align: middle;\n",
 38 |        "    }\n",
 39 |        "\n",
 40 |        "    .dataframe tbody tr th {\n",
 41 |        "        vertical-align: top;\n",
 42 |        "    }\n",
 43 |        "\n",
 44 |        "    .dataframe thead th {\n",
 45 |        "        text-align: right;\n",
 46 |        "    }\n",
 47 |        "</style>\n",
 48 |        "<table border=\"1\" class=\"dataframe\">\n",
 49 |        "  <thead>\n",
 50 |        "    <tr style=\"text-align: right;\">\n",
 51 |        "      <th></th>\n",
 52 |        "      <th>Category</th>\n",
 53 |        "    </tr>\n",
 54 |        "  </thead>\n",
 55 |        "  <tbody>\n",
 56 |        "    <tr>\n",
 57 |        "      <th>0</th>\n",
 58 |        "      <td>A</td>\n",
 59 |        "    </tr>\n",
 60 |        "    <tr>\n",
 61 |        "      <th>1</th>\n",
 62 |        "      <td>B</td>\n",
 63 |        "    </tr>\n",
 64 |        "    <tr>\n",
 65 |        "      <th>2</th>\n",
 66 |        "      <td>C</td>\n",
 67 |        "    </tr>\n",
 68 |        "    <tr>\n",
 69 |        "      <th>3</th>\n",
 70 |        "      <td>A</td>\n",
 71 |        "    </tr>\n",
 72 |        "    <tr>\n",
 73 |        "      <th>4</th>\n",
 74 |        "      <td>B</td>\n",
 75 |        "    </tr>\n",
 76 |        "  </tbody>\n",
 77 |        "</table>\n",
 78 |        "</div>"
 79 |       ],
 80 |       "text/plain": [
 81 |        "  Category\n",
 82 |        "0        A\n",
 83 |        "1        B\n",
 84 |        "2        C\n",
 85 |        "3        A\n",
 86 |        "4        B"
 87 |       ]
 88 |      },
 89 |      "execution_count": 3,
 90 |      "metadata": {},
 91 |      "output_type": "execute_result"
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "df.head()"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 4,
101 |    "id": "9390c11f",
102 |    "metadata": {},
103 |    "outputs": [
104 |     {
105 |      "data": {
106 |       "text/plain": [
107 |        "(9, 1)"
108 |       ]
109 |      },
110 |      "execution_count": 4,
111 |      "metadata": {},
112 |      "output_type": "execute_result"
113 |     }
114 |    ],
115 |    "source": [
116 |     "df.shape"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 5,
122 |    "id": "ec81a1c2",
123 |    "metadata": {},
124 |    "outputs": [],
125 |    "source": [
126 |     "encoder = ce.BinaryEncoder(cols=['Category'], return_df=True)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 6,
132 |    "id": "d5fe35de",
133 |    "metadata": {},
134 |    "outputs": [
135 |     {
136 |      "data": {
137 |       "text/html": [
138 |        "<div>\n",
139 |        "<style scoped>\n",
140 |        "    .dataframe tbody tr th:only-of-type {\n",
141 |        "        vertical-align: middle;\n",
142 |        "    }\n",
143 |        "\n",
144 |        "    .dataframe tbody tr th {\n",
145 |        "        vertical-align: top;\n",
146 |        "    }\n",
147 |        "\n",
148 |        "    .dataframe thead th {\n",
149 |        "        text-align: right;\n",
150 |        "    }\n",
151 |        "</style>\n",
152 |        "<table border=\"1\" class=\"dataframe\">\n",
153 |        "  <thead>\n",
154 |        "    <tr style=\"text-align: right;\">\n",
155 |        "      <th></th>\n",
156 |        "      <th>Category_0</th>\n",
157 |        "      <th>Category_1</th>\n",
158 |        "    </tr>\n",
159 |        "  </thead>\n",
160 |        "  <tbody>\n",
161 |        "    <tr>\n",
162 |        "      <th>0</th>\n",
163 |        "      <td>0</td>\n",
164 |        "      <td>1</td>\n",
165 |        "    </tr>\n",
166 |        "    <tr>\n",
167 |        "      <th>1</th>\n",
168 |        "      <td>1</td>\n",
169 |        "      <td>0</td>\n",
170 |        "    </tr>\n",
171 |        "    <tr>\n",
172 |        "      <th>2</th>\n",
173 |        "      <td>1</td>\n",
174 |        "      <td>1</td>\n",
175 |        "    </tr>\n",
176 |        "    <tr>\n",
177 |        "      <th>3</th>\n",
178 |        "      <td>0</td>\n",
179 |        "      <td>1</td>\n",
180 |        "    </tr>\n",
181 |        "    <tr>\n",
182 |        "      <th>4</th>\n",
183 |        "      <td>1</td>\n",
184 |        "      <td>0</td>\n",
185 |        "    </tr>\n",
186 |        "    <tr>\n",
187 |        "      <th>5</th>\n",
188 |        "      <td>1</td>\n",
189 |        "      <td>1</td>\n",
190 |        "    </tr>\n",
191 |        "    <tr>\n",
192 |        "      <th>6</th>\n",
193 |        "      <td>0</td>\n",
194 |        "      <td>1</td>\n",
195 |        "    </tr>\n",
196 |        "    <tr>\n",
197 |        "      <th>7</th>\n",
198 |        "      <td>1</td>\n",
199 |        "      <td>0</td>\n",
200 |        "    </tr>\n",
201 |        "    <tr>\n",
202 |        "      <th>8</th>\n",
203 |        "      <td>1</td>\n",
204 |        "      <td>1</td>\n",
205 |        "    </tr>\n",
206 |        "  </tbody>\n",
207 |        "</table>\n",
208 |        "</div>"
209 |       ],
210 |       "text/plain": [
211 |        "   Category_0  Category_1\n",
212 |        "0           0           1\n",
213 |        "1           1           0\n",
214 |        "2           1           1\n",
215 |        "3           0           1\n",
216 |        "4           1           0\n",
217 |        "5           1           1\n",
218 |        "6           0           1\n",
219 |        "7           1           0\n",
220 |        "8           1           1"
221 |       ]
222 |      },
223 |      "execution_count": 6,
224 |      "metadata": {},
225 |      "output_type": "execute_result"
226 |     }
227 |    ],
228 |    "source": [
229 |     "df_binary_encoded = encoder.fit_transform(df)\n",
230 |     "df_binary_encoded"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "id": "201a7fa5",
237 |    "metadata": {},
238 |    "outputs": [],
239 |    "source": []
240 |   }
241 |  ],
242 |  "metadata": {
243 |   "kernelspec": {
244 |    "display_name": "Python 3 (ipykernel)",
245 |    "language": "python",
246 |    "name": "python3"
247 |   },
248 |   "language_info": {
249 |    "codemirror_mode": {
250 |     "name": "ipython",
251 |     "version": 3
252 |    },
253 |    "file_extension": ".py",
254 |    "mimetype": "text/x-python",
255 |    "name": "python",
256 |    "nbconvert_exporter": "python",
257 |    "pygments_lexer": "ipython3",
258 |    "version": "3.9.13"
259 |   }
260 |  },
261 |  "nbformat": 4,
262 |  "nbformat_minor": 5
263 | }
264 | 


--------------------------------------------------------------------------------
/Standard Scaling using Python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "20c48bc4",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import pandas as pd"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 2,
 16 |    "id": "cf6184fa",
 17 |    "metadata": {},
 18 |    "outputs": [
 19 |     {
 20 |      "data": {
 21 |       "text/html": [
 22 |        "<div>\n",
 23 |        "<style scoped>\n",
 24 |        "    .dataframe tbody tr th:only-of-type {\n",
 25 |        "        vertical-align: middle;\n",
 26 |        "    }\n",
 27 |        "\n",
 28 |        "    .dataframe tbody tr th {\n",
 29 |        "        vertical-align: top;\n",
 30 |        "    }\n",
 31 |        "\n",
 32 |        "    .dataframe thead th {\n",
 33 |        "        text-align: right;\n",
 34 |        "    }\n",
 35 |        "</style>\n",
 36 |        "<table border=\"1\" class=\"dataframe\">\n",
 37 |        "  <thead>\n",
 38 |        "    <tr style=\"text-align: right;\">\n",
 39 |        "      <th></th>\n",
 40 |        "      <th>Feature 1</th>\n",
 41 |        "      <th>Feature 2</th>\n",
 42 |        "    </tr>\n",
 43 |        "  </thead>\n",
 44 |        "  <tbody>\n",
 45 |        "    <tr>\n",
 46 |        "      <th>0</th>\n",
 47 |        "      <td>1</td>\n",
 48 |        "      <td>6</td>\n",
 49 |        "    </tr>\n",
 50 |        "    <tr>\n",
 51 |        "      <th>1</th>\n",
 52 |        "      <td>20</td>\n",
 53 |        "      <td>7</td>\n",
 54 |        "    </tr>\n",
 55 |        "    <tr>\n",
 56 |        "      <th>2</th>\n",
 57 |        "      <td>3</td>\n",
 58 |        "      <td>18</td>\n",
 59 |        "    </tr>\n",
 60 |        "    <tr>\n",
 61 |        "      <th>3</th>\n",
 62 |        "      <td>40</td>\n",
 63 |        "      <td>19</td>\n",
 64 |        "    </tr>\n",
 65 |        "    <tr>\n",
 66 |        "      <th>4</th>\n",
 67 |        "      <td>5</td>\n",
 68 |        "      <td>10</td>\n",
 69 |        "    </tr>\n",
 70 |        "  </tbody>\n",
 71 |        "</table>\n",
 72 |        "</div>"
 73 |       ],
 74 |       "text/plain": [
 75 |        "   Feature 1  Feature 2\n",
 76 |        "0          1          6\n",
 77 |        "1         20          7\n",
 78 |        "2          3         18\n",
 79 |        "3         40         19\n",
 80 |        "4          5         10"
 81 |       ]
 82 |      },
 83 |      "execution_count": 2,
 84 |      "metadata": {},
 85 |      "output_type": "execute_result"
 86 |     }
 87 |    ],
 88 |    "source": [
 89 |     "data = {'Feature 1': [1, 20, 3, 40, 5],\n",
 90 |     "        'Feature 2': [6, 7, 18, 19, 10]}\n",
 91 |     "\n",
 92 |     "df = pd.DataFrame(data)\n",
 93 |     "df.head()"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 3,
 99 |    "id": "8fa7404f",
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "def standardize_data(data):\n",
104 |     "    mean = data.mean()\n",
105 |     "    std_dev = data.std() # delta degree of freedom ddof=0\n",
106 |     "    standardized_data_raw = (data - mean) / std_dev\n",
107 |     "    return standardized_data_raw"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": 4,
113 |    "id": "f2c02d59",
114 |    "metadata": {},
115 |    "outputs": [],
116 |    "source": [
117 |     "standardized_df_raw = df.apply(standardize_data)"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 5,
123 |    "id": "f22c9db0",
124 |    "metadata": {},
125 |    "outputs": [
126 |     {
127 |      "name": "stdout",
128 |      "output_type": "stream",
129 |      "text": [
130 |       "Original DataFrame:\n",
131 |       "   Feature 1  Feature 2\n",
132 |       "0          1          6\n",
133 |       "1         20          7\n",
134 |       "2          3         18\n",
135 |       "3         40         19\n",
136 |       "4          5         10\n",
137 |       "\n",
138 |       "Standardized DataFrame:\n",
139 |       "   Feature 1  Feature 2\n",
140 |       "0  -0.777975  -0.979796\n",
141 |       "1   0.376832  -0.816497\n",
142 |       "2  -0.656417   0.979796\n",
143 |       "3   1.592418   1.143095\n",
144 |       "4  -0.534858  -0.326599\n"
145 |      ]
146 |     }
147 |    ],
148 |    "source": [
149 |     "print(\"Original DataFrame:\")\n",
150 |     "print(df)\n",
151 |     "print(\"\\nStandardized DataFrame:\")\n",
152 |     "print(standardized_df_raw)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 6,
158 |    "id": "3fb91ce8",
159 |    "metadata": {},
160 |    "outputs": [
161 |     {
162 |      "data": {
163 |       "text/html": [
164 |        "<div>\n",
165 |        "<style scoped>\n",
166 |        "    .dataframe tbody tr th:only-of-type {\n",
167 |        "        vertical-align: middle;\n",
168 |        "    }\n",
169 |        "\n",
170 |        "    .dataframe tbody tr th {\n",
171 |        "        vertical-align: top;\n",
172 |        "    }\n",
173 |        "\n",
174 |        "    .dataframe thead th {\n",
175 |        "        text-align: right;\n",
176 |        "    }\n",
177 |        "</style>\n",
178 |        "<table border=\"1\" class=\"dataframe\">\n",
179 |        "  <thead>\n",
180 |        "    <tr style=\"text-align: right;\">\n",
181 |        "      <th></th>\n",
182 |        "      <th>Feature 1</th>\n",
183 |        "      <th>Feature 2</th>\n",
184 |        "    </tr>\n",
185 |        "  </thead>\n",
186 |        "  <tbody>\n",
187 |        "    <tr>\n",
188 |        "      <th>0</th>\n",
189 |        "      <td>1</td>\n",
190 |        "      <td>6</td>\n",
191 |        "    </tr>\n",
192 |        "    <tr>\n",
193 |        "      <th>1</th>\n",
194 |        "      <td>20</td>\n",
195 |        "      <td>7</td>\n",
196 |        "    </tr>\n",
197 |        "    <tr>\n",
198 |        "      <th>2</th>\n",
199 |        "      <td>3</td>\n",
200 |        "      <td>18</td>\n",
201 |        "    </tr>\n",
202 |        "    <tr>\n",
203 |        "      <th>3</th>\n",
204 |        "      <td>40</td>\n",
205 |        "      <td>19</td>\n",
206 |        "    </tr>\n",
207 |        "    <tr>\n",
208 |        "      <th>4</th>\n",
209 |        "      <td>5</td>\n",
210 |        "      <td>10</td>\n",
211 |        "    </tr>\n",
212 |        "  </tbody>\n",
213 |        "</table>\n",
214 |        "</div>"
215 |       ],
216 |       "text/plain": [
217 |        "   Feature 1  Feature 2\n",
218 |        "0          1          6\n",
219 |        "1         20          7\n",
220 |        "2          3         18\n",
221 |        "3         40         19\n",
222 |        "4          5         10"
223 |       ]
224 |      },
225 |      "execution_count": 6,
226 |      "metadata": {},
227 |      "output_type": "execute_result"
228 |     }
229 |    ],
230 |    "source": [
231 |     "df.head()"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": 7,
237 |    "id": "dcfedb7b",
238 |    "metadata": {},
239 |    "outputs": [
240 |     {
241 |      "data": {
242 |       "text/html": [
243 |        "<div>\n",
244 |        "<style scoped>\n",
245 |        "    .dataframe tbody tr th:only-of-type {\n",
246 |        "        vertical-align: middle;\n",
247 |        "    }\n",
248 |        "\n",
249 |        "    .dataframe tbody tr th {\n",
250 |        "        vertical-align: top;\n",
251 |        "    }\n",
252 |        "\n",
253 |        "    .dataframe thead th {\n",
254 |        "        text-align: right;\n",
255 |        "    }\n",
256 |        "</style>\n",
257 |        "<table border=\"1\" class=\"dataframe\">\n",
258 |        "  <thead>\n",
259 |        "    <tr style=\"text-align: right;\">\n",
260 |        "      <th></th>\n",
261 |        "      <th>Feature 1</th>\n",
262 |        "      <th>Feature 2</th>\n",
263 |        "    </tr>\n",
264 |        "  </thead>\n",
265 |        "  <tbody>\n",
266 |        "    <tr>\n",
267 |        "      <th>0</th>\n",
268 |        "      <td>-0.777975</td>\n",
269 |        "      <td>-0.979796</td>\n",
270 |        "    </tr>\n",
271 |        "    <tr>\n",
272 |        "      <th>1</th>\n",
273 |        "      <td>0.376832</td>\n",
274 |        "      <td>-0.816497</td>\n",
275 |        "    </tr>\n",
276 |        "    <tr>\n",
277 |        "      <th>2</th>\n",
278 |        "      <td>-0.656417</td>\n",
279 |        "      <td>0.979796</td>\n",
280 |        "    </tr>\n",
281 |        "    <tr>\n",
282 |        "      <th>3</th>\n",
283 |        "      <td>1.592418</td>\n",
284 |        "      <td>1.143095</td>\n",
285 |        "    </tr>\n",
286 |        "    <tr>\n",
287 |        "      <th>4</th>\n",
288 |        "      <td>-0.534858</td>\n",
289 |        "      <td>-0.326599</td>\n",
290 |        "    </tr>\n",
291 |        "  </tbody>\n",
292 |        "</table>\n",
293 |        "</div>"
294 |       ],
295 |       "text/plain": [
296 |        "   Feature 1  Feature 2\n",
297 |        "0  -0.777975  -0.979796\n",
298 |        "1   0.376832  -0.816497\n",
299 |        "2  -0.656417   0.979796\n",
300 |        "3   1.592418   1.143095\n",
301 |        "4  -0.534858  -0.326599"
302 |       ]
303 |      },
304 |      "execution_count": 7,
305 |      "metadata": {},
306 |      "output_type": "execute_result"
307 |     }
308 |    ],
309 |    "source": [
310 |     "standardized_df_raw.head()"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "code",
315 |    "execution_count": 8,
316 |    "id": "daaa1204",
317 |    "metadata": {},
318 |    "outputs": [
319 |     {
320 |      "data": {
321 |       "text/plain": [
322 |        "1.0"
323 |       ]
324 |      },
325 |      "execution_count": 8,
326 |      "metadata": {},
327 |      "output_type": "execute_result"
328 |     }
329 |    ],
330 |    "source": [
331 |     "standardized_df_raw['Feature 1'].mean() + 1"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": 9,
337 |    "id": "56277f06",
338 |    "metadata": {},
339 |    "outputs": [
340 |     {
341 |      "data": {
342 |       "text/plain": [
343 |        "0.9999999999999999"
344 |       ]
345 |      },
346 |      "execution_count": 9,
347 |      "metadata": {},
348 |      "output_type": "execute_result"
349 |     }
350 |    ],
351 |    "source": [
352 |     "standardized_df_raw['Feature 1'].std()"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "markdown",
357 |    "id": "690b8116",
358 |    "metadata": {},
359 |    "source": [
360 |     "# Sklearn"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "code",
365 |    "execution_count": 10,
366 |    "id": "11c15507",
367 |    "metadata": {},
368 |    "outputs": [],
369 |    "source": [
370 |     "from sklearn.preprocessing import StandardScaler"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": 11,
376 |    "id": "6f37ac57",
377 |    "metadata": {},
378 |    "outputs": [
379 |     {
380 |      "data": {
381 |       "text/html": [
382 |        "<div>\n",
383 |        "<style scoped>\n",
384 |        "    .dataframe tbody tr th:only-of-type {\n",
385 |        "        vertical-align: middle;\n",
386 |        "    }\n",
387 |        "\n",
388 |        "    .dataframe tbody tr th {\n",
389 |        "        vertical-align: top;\n",
390 |        "    }\n",
391 |        "\n",
392 |        "    .dataframe thead th {\n",
393 |        "        text-align: right;\n",
394 |        "    }\n",
395 |        "</style>\n",
396 |        "<table border=\"1\" class=\"dataframe\">\n",
397 |        "  <thead>\n",
398 |        "    <tr style=\"text-align: right;\">\n",
399 |        "      <th></th>\n",
400 |        "      <th>Feature 1</th>\n",
401 |        "      <th>Feature 2</th>\n",
402 |        "    </tr>\n",
403 |        "  </thead>\n",
404 |        "  <tbody>\n",
405 |        "    <tr>\n",
406 |        "      <th>0</th>\n",
407 |        "      <td>1</td>\n",
408 |        "      <td>6</td>\n",
409 |        "    </tr>\n",
410 |        "    <tr>\n",
411 |        "      <th>1</th>\n",
412 |        "      <td>20</td>\n",
413 |        "      <td>7</td>\n",
414 |        "    </tr>\n",
415 |        "    <tr>\n",
416 |        "      <th>2</th>\n",
417 |        "      <td>3</td>\n",
418 |        "      <td>18</td>\n",
419 |        "    </tr>\n",
420 |        "    <tr>\n",
421 |        "      <th>3</th>\n",
422 |        "      <td>40</td>\n",
423 |        "      <td>19</td>\n",
424 |        "    </tr>\n",
425 |        "    <tr>\n",
426 |        "      <th>4</th>\n",
427 |        "      <td>5</td>\n",
428 |        "      <td>10</td>\n",
429 |        "    </tr>\n",
430 |        "  </tbody>\n",
431 |        "</table>\n",
432 |        "</div>"
433 |       ],
434 |       "text/plain": [
435 |        "   Feature 1  Feature 2\n",
436 |        "0          1          6\n",
437 |        "1         20          7\n",
438 |        "2          3         18\n",
439 |        "3         40         19\n",
440 |        "4          5         10"
441 |       ]
442 |      },
443 |      "execution_count": 11,
444 |      "metadata": {},
445 |      "output_type": "execute_result"
446 |     }
447 |    ],
448 |    "source": [
449 |     "df"
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "code",
454 |    "execution_count": 12,
455 |    "id": "ea267703",
456 |    "metadata": {},
457 |    "outputs": [],
458 |    "source": [
459 |     "scaler = StandardScaler()\n",
460 |     "standardized_data = scaler.fit_transform(df)\n",
461 |     "standardized_df = pd.DataFrame(standardized_data, columns=df.columns)"
462 |    ]
463 |   },
464 |   {
465 |    "cell_type": "code",
466 |    "execution_count": 13,
467 |    "id": "c68cbfde",
468 |    "metadata": {},
469 |    "outputs": [
470 |     {
471 |      "name": "stdout",
472 |      "output_type": "stream",
473 |      "text": [
474 |       "Original DataFrame:\n",
475 |       "   Feature 1  Feature 2\n",
476 |       "0          1          6\n",
477 |       "1         20          7\n",
478 |       "2          3         18\n",
479 |       "3         40         19\n",
480 |       "4          5         10\n",
481 |       "\n",
482 |       "Standardized DataFrame: raw code\n",
483 |       "   Feature 1  Feature 2\n",
484 |       "0  -0.777975  -0.979796\n",
485 |       "1   0.376832  -0.816497\n",
486 |       "2  -0.656417   0.979796\n",
487 |       "3   1.592418   1.143095\n",
488 |       "4  -0.534858  -0.326599\n",
489 |       "\n",
490 |       "Standardized DataFrame: sklearn\n",
491 |       "   Feature 1  Feature 2\n",
492 |       "0  -0.869803  -1.095445\n",
493 |       "1   0.421311  -0.912871\n",
494 |       "2  -0.733896   1.095445\n",
495 |       "3   1.780378   1.278019\n",
496 |       "4  -0.597989  -0.365148\n"
497 |      ]
498 |     }
499 |    ],
500 |    "source": [
501 |     "print(\"Original DataFrame:\")\n",
502 |     "print(df)\n",
503 |     "\n",
504 |     "print(\"\\nStandardized DataFrame: raw code\")\n",
505 |     "print(standardized_df_raw)\n",
506 |     "\n",
507 |     "print(\"\\nStandardized DataFrame: sklearn\")\n",
508 |     "print(standardized_df)"
509 |    ]
510 |   },
511 |   {
512 |    "cell_type": "code",
513 |    "execution_count": null,
514 |    "id": "f764d2ec",
515 |    "metadata": {},
516 |    "outputs": [],
517 |    "source": []
518 |   },
519 |   {
520 |    "cell_type": "code",
521 |    "execution_count": null,
522 |    "id": "b1af06be",
523 |    "metadata": {},
524 |    "outputs": [],
525 |    "source": []
526 |   }
527 |  ],
528 |  "metadata": {
529 |   "kernelspec": {
530 |    "display_name": "Python 3 (ipykernel)",
531 |    "language": "python",
532 |    "name": "python3"
533 |   },
534 |   "language_info": {
535 |    "codemirror_mode": {
536 |     "name": "ipython",
537 |     "version": 3
538 |    },
539 |    "file_extension": ".py",
540 |    "mimetype": "text/x-python",
541 |    "name": "python",
542 |    "nbconvert_exporter": "python",
543 |    "pygments_lexer": "ipython3",
544 |    "version": "3.9.13"
545 |   }
546 |  },
547 |  "nbformat": 4,
548 |  "nbformat_minor": 5
549 | }
550 | 


--------------------------------------------------------------------------------
/2. One Hot Encoding.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "79afcd9b",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import pandas as pd"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 2,
 16 |    "id": "a31b20f7",
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']}"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 3,
 26 |    "id": "6c33a2f7",
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "data": {
 31 |       "text/html": [
 32 |        "<div>\n",
 33 |        "<style scoped>\n",
 34 |        "    .dataframe tbody tr th:only-of-type {\n",
 35 |        "        vertical-align: middle;\n",
 36 |        "    }\n",
 37 |        "\n",
 38 |        "    .dataframe tbody tr th {\n",
 39 |        "        vertical-align: top;\n",
 40 |        "    }\n",
 41 |        "\n",
 42 |        "    .dataframe thead th {\n",
 43 |        "        text-align: right;\n",
 44 |        "    }\n",
 45 |        "</style>\n",
 46 |        "<table border=\"1\" class=\"dataframe\">\n",
 47 |        "  <thead>\n",
 48 |        "    <tr style=\"text-align: right;\">\n",
 49 |        "      <th></th>\n",
 50 |        "      <th>Category</th>\n",
 51 |        "    </tr>\n",
 52 |        "  </thead>\n",
 53 |        "  <tbody>\n",
 54 |        "    <tr>\n",
 55 |        "      <th>0</th>\n",
 56 |        "      <td>A</td>\n",
 57 |        "    </tr>\n",
 58 |        "    <tr>\n",
 59 |        "      <th>1</th>\n",
 60 |        "      <td>B</td>\n",
 61 |        "    </tr>\n",
 62 |        "    <tr>\n",
 63 |        "      <th>2</th>\n",
 64 |        "      <td>C</td>\n",
 65 |        "    </tr>\n",
 66 |        "    <tr>\n",
 67 |        "      <th>3</th>\n",
 68 |        "      <td>A</td>\n",
 69 |        "    </tr>\n",
 70 |        "    <tr>\n",
 71 |        "      <th>4</th>\n",
 72 |        "      <td>B</td>\n",
 73 |        "    </tr>\n",
 74 |        "  </tbody>\n",
 75 |        "</table>\n",
 76 |        "</div>"
 77 |       ],
 78 |       "text/plain": [
 79 |        "  Category\n",
 80 |        "0        A\n",
 81 |        "1        B\n",
 82 |        "2        C\n",
 83 |        "3        A\n",
 84 |        "4        B"
 85 |       ]
 86 |      },
 87 |      "execution_count": 3,
 88 |      "metadata": {},
 89 |      "output_type": "execute_result"
 90 |     }
 91 |    ],
 92 |    "source": [
 93 |     "df = pd.DataFrame(data)\n",
 94 |     "df.head()"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": 4,
100 |    "id": "d5674ea5",
101 |    "metadata": {},
102 |    "outputs": [
103 |     {
104 |      "data": {
105 |       "text/html": [
106 |        "<div>\n",
107 |        "<style scoped>\n",
108 |        "    .dataframe tbody tr th:only-of-type {\n",
109 |        "        vertical-align: middle;\n",
110 |        "    }\n",
111 |        "\n",
112 |        "    .dataframe tbody tr th {\n",
113 |        "        vertical-align: top;\n",
114 |        "    }\n",
115 |        "\n",
116 |        "    .dataframe thead th {\n",
117 |        "        text-align: right;\n",
118 |        "    }\n",
119 |        "</style>\n",
120 |        "<table border=\"1\" class=\"dataframe\">\n",
121 |        "  <thead>\n",
122 |        "    <tr style=\"text-align: right;\">\n",
123 |        "      <th></th>\n",
124 |        "      <th>Category_A</th>\n",
125 |        "      <th>Category_B</th>\n",
126 |        "      <th>Category_C</th>\n",
127 |        "    </tr>\n",
128 |        "  </thead>\n",
129 |        "  <tbody>\n",
130 |        "    <tr>\n",
131 |        "      <th>0</th>\n",
132 |        "      <td>1</td>\n",
133 |        "      <td>0</td>\n",
134 |        "      <td>0</td>\n",
135 |        "    </tr>\n",
136 |        "    <tr>\n",
137 |        "      <th>1</th>\n",
138 |        "      <td>0</td>\n",
139 |        "      <td>1</td>\n",
140 |        "      <td>0</td>\n",
141 |        "    </tr>\n",
142 |        "    <tr>\n",
143 |        "      <th>2</th>\n",
144 |        "      <td>0</td>\n",
145 |        "      <td>0</td>\n",
146 |        "      <td>1</td>\n",
147 |        "    </tr>\n",
148 |        "    <tr>\n",
149 |        "      <th>3</th>\n",
150 |        "      <td>1</td>\n",
151 |        "      <td>0</td>\n",
152 |        "      <td>0</td>\n",
153 |        "    </tr>\n",
154 |        "    <tr>\n",
155 |        "      <th>4</th>\n",
156 |        "      <td>0</td>\n",
157 |        "      <td>1</td>\n",
158 |        "      <td>0</td>\n",
159 |        "    </tr>\n",
160 |        "    <tr>\n",
161 |        "      <th>5</th>\n",
162 |        "      <td>0</td>\n",
163 |        "      <td>0</td>\n",
164 |        "      <td>1</td>\n",
165 |        "    </tr>\n",
166 |        "    <tr>\n",
167 |        "      <th>6</th>\n",
168 |        "      <td>1</td>\n",
169 |        "      <td>0</td>\n",
170 |        "      <td>0</td>\n",
171 |        "    </tr>\n",
172 |        "    <tr>\n",
173 |        "      <th>7</th>\n",
174 |        "      <td>0</td>\n",
175 |        "      <td>1</td>\n",
176 |        "      <td>0</td>\n",
177 |        "    </tr>\n",
178 |        "    <tr>\n",
179 |        "      <th>8</th>\n",
180 |        "      <td>0</td>\n",
181 |        "      <td>0</td>\n",
182 |        "      <td>1</td>\n",
183 |        "    </tr>\n",
184 |        "  </tbody>\n",
185 |        "</table>\n",
186 |        "</div>"
187 |       ],
188 |       "text/plain": [
189 |        "   Category_A  Category_B  Category_C\n",
190 |        "0           1           0           0\n",
191 |        "1           0           1           0\n",
192 |        "2           0           0           1\n",
193 |        "3           1           0           0\n",
194 |        "4           0           1           0\n",
195 |        "5           0           0           1\n",
196 |        "6           1           0           0\n",
197 |        "7           0           1           0\n",
198 |        "8           0           0           1"
199 |       ]
200 |      },
201 |      "execution_count": 4,
202 |      "metadata": {},
203 |      "output_type": "execute_result"
204 |     }
205 |    ],
206 |    "source": [
207 |     "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'])\n",
208 |     "one_hot_encoded_df"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": 5,
214 |    "id": "e16d27ac",
215 |    "metadata": {},
216 |    "outputs": [
217 |     {
218 |      "data": {
219 |       "text/html": [
220 |        "<div>\n",
221 |        "<style scoped>\n",
222 |        "    .dataframe tbody tr th:only-of-type {\n",
223 |        "        vertical-align: middle;\n",
224 |        "    }\n",
225 |        "\n",
226 |        "    .dataframe tbody tr th {\n",
227 |        "        vertical-align: top;\n",
228 |        "    }\n",
229 |        "\n",
230 |        "    .dataframe thead th {\n",
231 |        "        text-align: right;\n",
232 |        "    }\n",
233 |        "</style>\n",
234 |        "<table border=\"1\" class=\"dataframe\">\n",
235 |        "  <thead>\n",
236 |        "    <tr style=\"text-align: right;\">\n",
237 |        "      <th></th>\n",
238 |        "      <th>Dummy_A</th>\n",
239 |        "      <th>Dummy_B</th>\n",
240 |        "      <th>Dummy_C</th>\n",
241 |        "    </tr>\n",
242 |        "  </thead>\n",
243 |        "  <tbody>\n",
244 |        "    <tr>\n",
245 |        "      <th>0</th>\n",
246 |        "      <td>1</td>\n",
247 |        "      <td>0</td>\n",
248 |        "      <td>0</td>\n",
249 |        "    </tr>\n",
250 |        "    <tr>\n",
251 |        "      <th>1</th>\n",
252 |        "      <td>0</td>\n",
253 |        "      <td>1</td>\n",
254 |        "      <td>0</td>\n",
255 |        "    </tr>\n",
256 |        "    <tr>\n",
257 |        "      <th>2</th>\n",
258 |        "      <td>0</td>\n",
259 |        "      <td>0</td>\n",
260 |        "      <td>1</td>\n",
261 |        "    </tr>\n",
262 |        "    <tr>\n",
263 |        "      <th>3</th>\n",
264 |        "      <td>1</td>\n",
265 |        "      <td>0</td>\n",
266 |        "      <td>0</td>\n",
267 |        "    </tr>\n",
268 |        "    <tr>\n",
269 |        "      <th>4</th>\n",
270 |        "      <td>0</td>\n",
271 |        "      <td>1</td>\n",
272 |        "      <td>0</td>\n",
273 |        "    </tr>\n",
274 |        "    <tr>\n",
275 |        "      <th>5</th>\n",
276 |        "      <td>0</td>\n",
277 |        "      <td>0</td>\n",
278 |        "      <td>1</td>\n",
279 |        "    </tr>\n",
280 |        "    <tr>\n",
281 |        "      <th>6</th>\n",
282 |        "      <td>1</td>\n",
283 |        "      <td>0</td>\n",
284 |        "      <td>0</td>\n",
285 |        "    </tr>\n",
286 |        "    <tr>\n",
287 |        "      <th>7</th>\n",
288 |        "      <td>0</td>\n",
289 |        "      <td>1</td>\n",
290 |        "      <td>0</td>\n",
291 |        "    </tr>\n",
292 |        "    <tr>\n",
293 |        "      <th>8</th>\n",
294 |        "      <td>0</td>\n",
295 |        "      <td>0</td>\n",
296 |        "      <td>1</td>\n",
297 |        "    </tr>\n",
298 |        "  </tbody>\n",
299 |        "</table>\n",
300 |        "</div>"
301 |       ],
302 |       "text/plain": [
303 |        "   Dummy_A  Dummy_B  Dummy_C\n",
304 |        "0        1        0        0\n",
305 |        "1        0        1        0\n",
306 |        "2        0        0        1\n",
307 |        "3        1        0        0\n",
308 |        "4        0        1        0\n",
309 |        "5        0        0        1\n",
310 |        "6        1        0        0\n",
311 |        "7        0        1        0\n",
312 |        "8        0        0        1"
313 |       ]
314 |      },
315 |      "execution_count": 5,
316 |      "metadata": {},
317 |      "output_type": "execute_result"
318 |     }
319 |    ],
320 |    "source": [
321 |     "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'], prefix='Dummy')\n",
322 |     "one_hot_encoded_df"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": 6,
328 |    "id": "2e8cfc1d",
329 |    "metadata": {},
330 |    "outputs": [
331 |     {
332 |      "data": {
333 |       "text/html": [
334 |        "<div>\n",
335 |        "<style scoped>\n",
336 |        "    .dataframe tbody tr th:only-of-type {\n",
337 |        "        vertical-align: middle;\n",
338 |        "    }\n",
339 |        "\n",
340 |        "    .dataframe tbody tr th {\n",
341 |        "        vertical-align: top;\n",
342 |        "    }\n",
343 |        "\n",
344 |        "    .dataframe thead th {\n",
345 |        "        text-align: right;\n",
346 |        "    }\n",
347 |        "</style>\n",
348 |        "<table border=\"1\" class=\"dataframe\">\n",
349 |        "  <thead>\n",
350 |        "    <tr style=\"text-align: right;\">\n",
351 |        "      <th></th>\n",
352 |        "      <th>Dummy_B</th>\n",
353 |        "      <th>Dummy_C</th>\n",
354 |        "    </tr>\n",
355 |        "  </thead>\n",
356 |        "  <tbody>\n",
357 |        "    <tr>\n",
358 |        "      <th>0</th>\n",
359 |        "      <td>0</td>\n",
360 |        "      <td>0</td>\n",
361 |        "    </tr>\n",
362 |        "    <tr>\n",
363 |        "      <th>1</th>\n",
364 |        "      <td>1</td>\n",
365 |        "      <td>0</td>\n",
366 |        "    </tr>\n",
367 |        "    <tr>\n",
368 |        "      <th>2</th>\n",
369 |        "      <td>0</td>\n",
370 |        "      <td>1</td>\n",
371 |        "    </tr>\n",
372 |        "    <tr>\n",
373 |        "      <th>3</th>\n",
374 |        "      <td>0</td>\n",
375 |        "      <td>0</td>\n",
376 |        "    </tr>\n",
377 |        "    <tr>\n",
378 |        "      <th>4</th>\n",
379 |        "      <td>1</td>\n",
380 |        "      <td>0</td>\n",
381 |        "    </tr>\n",
382 |        "    <tr>\n",
383 |        "      <th>5</th>\n",
384 |        "      <td>0</td>\n",
385 |        "      <td>1</td>\n",
386 |        "    </tr>\n",
387 |        "    <tr>\n",
388 |        "      <th>6</th>\n",
389 |        "      <td>0</td>\n",
390 |        "      <td>0</td>\n",
391 |        "    </tr>\n",
392 |        "    <tr>\n",
393 |        "      <th>7</th>\n",
394 |        "      <td>1</td>\n",
395 |        "      <td>0</td>\n",
396 |        "    </tr>\n",
397 |        "    <tr>\n",
398 |        "      <th>8</th>\n",
399 |        "      <td>0</td>\n",
400 |        "      <td>1</td>\n",
401 |        "    </tr>\n",
402 |        "  </tbody>\n",
403 |        "</table>\n",
404 |        "</div>"
405 |       ],
406 |       "text/plain": [
407 |        "   Dummy_B  Dummy_C\n",
408 |        "0        0        0\n",
409 |        "1        1        0\n",
410 |        "2        0        1\n",
411 |        "3        0        0\n",
412 |        "4        1        0\n",
413 |        "5        0        1\n",
414 |        "6        0        0\n",
415 |        "7        1        0\n",
416 |        "8        0        1"
417 |       ]
418 |      },
419 |      "execution_count": 6,
420 |      "metadata": {},
421 |      "output_type": "execute_result"
422 |     }
423 |    ],
424 |    "source": [
425 |     "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'], prefix='Dummy',drop_first=True )\n",
426 |     "one_hot_encoded_df"
427 |    ]
428 |   },
429 |   {
430 |    "cell_type": "code",
431 |    "execution_count": 7,
432 |    "id": "287f4418",
433 |    "metadata": {},
434 |    "outputs": [
435 |     {
436 |      "data": {
437 |       "text/html": [
438 |        "<div>\n",
439 |        "<style scoped>\n",
440 |        "    .dataframe tbody tr th:only-of-type {\n",
441 |        "        vertical-align: middle;\n",
442 |        "    }\n",
443 |        "\n",
444 |        "    .dataframe tbody tr th {\n",
445 |        "        vertical-align: top;\n",
446 |        "    }\n",
447 |        "\n",
448 |        "    .dataframe thead th {\n",
449 |        "        text-align: right;\n",
450 |        "    }\n",
451 |        "</style>\n",
452 |        "<table border=\"1\" class=\"dataframe\">\n",
453 |        "  <thead>\n",
454 |        "    <tr style=\"text-align: right;\">\n",
455 |        "      <th></th>\n",
456 |        "      <th>Category</th>\n",
457 |        "    </tr>\n",
458 |        "  </thead>\n",
459 |        "  <tbody>\n",
460 |        "    <tr>\n",
461 |        "      <th>0</th>\n",
462 |        "      <td>A</td>\n",
463 |        "    </tr>\n",
464 |        "    <tr>\n",
465 |        "      <th>1</th>\n",
466 |        "      <td>B</td>\n",
467 |        "    </tr>\n",
468 |        "    <tr>\n",
469 |        "      <th>2</th>\n",
470 |        "      <td>C</td>\n",
471 |        "    </tr>\n",
472 |        "    <tr>\n",
473 |        "      <th>3</th>\n",
474 |        "      <td>A</td>\n",
475 |        "    </tr>\n",
476 |        "    <tr>\n",
477 |        "      <th>4</th>\n",
478 |        "      <td>B</td>\n",
479 |        "    </tr>\n",
480 |        "  </tbody>\n",
481 |        "</table>\n",
482 |        "</div>"
483 |       ],
484 |       "text/plain": [
485 |        "  Category\n",
486 |        "0        A\n",
487 |        "1        B\n",
488 |        "2        C\n",
489 |        "3        A\n",
490 |        "4        B"
491 |       ]
492 |      },
493 |      "execution_count": 7,
494 |      "metadata": {},
495 |      "output_type": "execute_result"
496 |     }
497 |    ],
498 |    "source": [
499 |     "df.head()"
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "code",
504 |    "execution_count": null,
505 |    "id": "cb821849",
506 |    "metadata": {},
507 |    "outputs": [],
508 |    "source": []
509 |   }
510 |  ],
511 |  "metadata": {
512 |   "kernelspec": {
513 |    "display_name": "Python 3 (ipykernel)",
514 |    "language": "python",
515 |    "name": "python3"
516 |   },
517 |   "language_info": {
518 |    "codemirror_mode": {
519 |     "name": "ipython",
520 |     "version": 3
521 |    },
522 |    "file_extension": ".py",
523 |    "mimetype": "text/x-python",
524 |    "name": "python",
525 |    "nbconvert_exporter": "python",
526 |    "pygments_lexer": "ipython3",
527 |    "version": "3.9.13"
528 |   }
529 |  },
530 |  "nbformat": 4,
531 |  "nbformat_minor": 5
532 | }
533 | 


--------------------------------------------------------------------------------
/Data Leakage in Machine Learning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "4476c65b",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "[Watch Full Video on Data Leakage](https://youtu.be/UELHcSU_Dpg)"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "a5ffa0fa",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "`Data Leakage` (also called information leakage) happens when information that should be unavailable during model training accidentally influences the model. Our model “sees” or “learns from” data it shouldn’t have access to (like test data or future data). This causes `unrealistically high accuracy during training/testing`, but poor performance on unseen data."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "id": "c1fddf31",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "# ⚠️ Common Causes\n",
 25 |     "\n",
 26 |     "1. Doing preprocessing (scaling, encoding, imputing) before train-test split\n",
 27 |     "\n",
 28 |     "2. Using target values to create features\n",
 29 |     "\n",
 30 |     "3. Mixing future data with past data in time series problems"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 1,
 36 |    "id": "9a889af3",
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "from sklearn.datasets import make_regression\n",
 41 |     "from sklearn.preprocessing import StandardScaler\n",
 42 |     "from sklearn.linear_model import LinearRegression\n",
 43 |     "from sklearn.model_selection import train_test_split\n",
 44 |     "from sklearn.metrics import r2_score"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 2,
 50 |    "id": "6075bcb4",
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "x, y = make_regression(n_samples=500, n_features=3, noise=50, random_state=42)"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "id": "94f28c3f",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "# ❌ WRONG WAY (Data Leakage)"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 3,
 68 |    "id": "b51eba29",
 69 |    "metadata": {},
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "scaler = StandardScaler()\n",
 73 |     "X_scaled_leak = scaler.fit_transform(x)        # Scaler sees all data (train + test)\n",
 74 |     "X_train_leak, X_test_leak, y_train, y_test = train_test_split(X_scaled_leak, y, test_size=0.2, random_state=42)"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 4,
 80 |    "id": "8d9b5523",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "model_leak = LinearRegression()\n",
 85 |     "model_leak.fit(X_train_leak, y_train)\n",
 86 |     "pred_leak = model_leak.predict(X_test_leak)"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 5,
 92 |    "id": "c95748e0",
 93 |    "metadata": {},
 94 |    "outputs": [
 95 |     {
 96 |      "name": "stdout",
 97 |      "output_type": "stream",
 98 |      "text": [
 99 |       "With Data Leakage (scaled before split): 0.7945396288418394\n"
100 |      ]
101 |     }
102 |    ],
103 |    "source": [
104 |     "print(\"With Data Leakage (scaled before split):\", r2_score(y_test, pred_leak))"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "id": "8e20b3dd",
110 |    "metadata": {},
111 |    "source": [
112 |     "The model really learns — but it’s learning from `cheated information`.\n",
113 |     "\n",
114 |     "That’s why leaky models look smart in testing, but `collapse` in production.\n",
115 |     "\n",
116 |     "Always isolate your training data, and fit transformations or create features only on training — never on the whole dataset."
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "id": "74e551e9",
122 |    "metadata": {},
123 |    "source": [
124 |     "# ✅ CORRECT WAY (No Leakage)"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": 6,
130 |    "id": "ddd01c92",
131 |    "metadata": {},
132 |    "outputs": [],
133 |    "source": [
134 |     "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": 7,
140 |    "id": "52df4004",
141 |    "metadata": {},
142 |    "outputs": [
143 |     {
144 |      "data": {
145 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZgAAAEjBAMAAAD9GArQAAAAMFBMVEX///8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv3aB7AAAAD3RSTlMAMmZ2VBAi3c2Ju++rRJljPUTMAAAACXBIWXMAAA7EAAAOxAGVKw4bAAAbGUlEQVR4Ae1df2xkR33/7k97fd61A1KrEoSXQhORVnghldI0orsloKoC1atKKLlcOG+uatMIgR0FCk2i2BRIwkGx2zR/XALZBYkkSkLODQVKCHijKs0PwtmCIhQ48FOipE05nU1ycHfx3b1+vjNv5s17+/b5rb3r9bo7kvfNfL8z3/l+Zt7MmzfvvY+JiF73h2VK4xgS0ucFK68ZDZZ3Tfrbn33LwcobQqvP2K8F6qt2LVDeNWH834hi1z8TXv/eYDCpKsAM3dq07P2VpqrOKKZKsHsFO3tN8wrSwWBoqka08mpAOWEraxcCVJ0U3c7G4+zsu5tXk2oCZrJGNPh3AeWkrdsqAaoOilKnhfEb8fvO5tWEgQksFWIrMH97hNlXhJ278PuV5hZbBhNiq3ktW9ZkzwkTPyVKSViBFlsFE2YrsIL2CNPrws5VRMNtBBNmqz1+B1tZtFiepthS+8CE2gp2oz3S4pmyMLRqI+QB69CxlyC466b83i/eIlRHj9whTjNH8783LT/3P7jyZ599uTBZo/j9Xycqnnz4+DEbua+750iFyLF10ecKsBQ/cuxOosx9j6QOPfki0nQZSzsThuxT7xCWE7JnnoNrJaLLF99s0RUfg2b4eko9/hoiSmP/V2F8mujJUbp6rEbZy3CduXsU+XDMzlPiBLJKWz9ahNupLxD9pETJy155E6U+XobSXsBPh8KSbX8PvjgOpCy49Rski39ClGR8n4MDQwCjNdW/oJVZGvwGdIdraHmAeAHxiVmi/TguVZQtGi8Q7bOwwoCdxKuITOSJ6BJIOxXSf2Tbr8Fh2ZpxzG4x/NH49TjlgCHDLZ1ERGuKt1KKqFqCfKIGHcA8Aax2BYUsojH8Ob08AbcPE8JqheInR0msGDjdwfCXS/a0ciBjo6JT+OMziU4STa7hKFApTXEWkhSnhHMMpobO416sIvPIrLJFAJM9AzGN1QEGxxE22ulQRV/I1kx9CXVxveN1/GDeHq/hyGC0pmhBIhZAGsy8k2/vsnTY7ZmcGIkjaxRnVCML+Ol0iGMmchyg9NsO8bVnHA4ymCULR7nQdDTFAiTDfCpqMMjBcxnCgUNVOOyCGeCzlPacpTivmzoMJiV9mCspB+KfeC9j0GDm5pEQYJSmmIdkkH0zwAwKp+nCT5fZYRfMyFnON/DKtoBJlrgyPrsdBw4XRIcEgFGaQDDFOpsZwvgJArNnm8DAd4QpS4CxaJhb3OwZfZppjQDjO81i62WKjdISulGBsUQT7RE9NnJie3qmBucBJi/A1GlsDal1+rDumWINAj7NtEaASUBinGYDOJsSozGe4kYW0svcy3UBhi+lkP16e8D8miujiVFc1dgB9h2Lz9drMFMLUMfgutYIMHKBOobcPDVT1SLKVQTCqYVYQdrik5evULgbrW0PmHNcGT0Jp3CC1alax7Rr0+s0mByPYPZSawQYWl2GvFiTYLJ8Yg6WMzz9jq/F8tIWg6G5MmRLo9sDZh0VUgwep07ximVlAdPuYfq5BkOfgP5aXEa1RoIZmEUXzuCHe2bPr5BpjASmZ9ayy9KWAPOhOpZsyOJOzZ1bziR/fawCRy04U13OVShzjlIvjZdepFU4EeOLx5X3UvJx+8euRkwJlDpYofePnS0TdwcP/NiMONv2HziRK0tbojPTN5fpfXmchNxrY3xS22ivzoTkfPLjT//xI2w8ceMP8Hv08y+XM4+OPmjb0/FFe32Z6KIjz6dtu6I0c7b9Uc4ef+h4fsr+zfCi/b20/dCzTy1i3ZB96siddPeXHVvft9ctzvfMS5i0F+2T9Pu2/V0ssQtcvBMhjd3My98VaLkMKVaU/gAxa1oPwtbmirZeWb9EvwX6LdBvgX4L9Fug3wL9Fui3QL8F+i3Qb4F+C/RboOst8Nxdf33Hl0wv/gx7O5Yp6J14tpR4OPWo6e/teKAzbwp6J76f4q+kvmb6i43EiWVT0DvxP6Wc3N1VLmexmbiqEj135H1NGuKn7ra9ji3ENRI7oD2HQzgsB8gHRMCW2OAsOW8W9SIc3wABtiTOtB4Nq2UqGK7jOX9uLWMIeij6Ah5LpU0weM6xb7o3wcROx4vipQzd/jMFumv2zTrZU5E7/iVxaN7wOHXm6UMfeDpvSHZSNPbgPxJ51iuh3u3swT9h24W9tVAApjK3ZqZ2WDz12Fvfdv+3ypG9GqhFzrr9GbMlPAz7WPR6J82ZLXqx7ct5fkXWFZNrFvHbrHaVt5m+2/LYrcqD83T4LSXqteNPQ0bMiNFZZnSnTgPJekjr9xqYPw/B0muqeL7XPA7x963QDVdEhiizWYil7quSJfhwrePHNT0+m51/cJnS4t2S7jfsVj1IfTJ56q2rta2a2Rnlhyo0bp/aGb5s2Ysf4n3z9fktm+kb6GILpH+J5Uy9iw60s+qJr1zyynsq7bTYPVv4Cq9LX5l1APOARXRzB+x2xeR4heixrtTcgUrvh008zNgd4dt4R5bfbN4VAW/+D9Z3BRKA+BTRz0Jur3sL5k8oq/c9esvzAG+zL78cIO2L+i3Qb4F+C/RboN8C/RbotRaIH/uHXnPZ769LuxRf5U9JezoI2qX0KGPI9DwYQbuUzDMYZsbYBSGRZxC7BMzQbgIzuZvALO0iMG+0exfMEgiaJm27RkX7NNMu5fiBPnZb4q8qaiXCPC2Ilu6k+HE8SkPYe+QZRtxA63RZgbXdC+nDn8EX9Nj0yp4cFbRLJD/3j7+qqJXgmyJaepwE1UHsEUoz69JzYDcombROnfs4PmIDjc3igTK/nfxJZz52wJy28L51HnIOkmjp0ookCxqbx3vldZe8SdM6dY62QHix8c8AgAiWlrokXVA9sz4q+T6EBUm09PeIM4vETTjmfuWSNylaJ5G1qz8ZDJBhPLiM5b1gPNRKkmiJybUOj1LyHI5gyWigdYK4yyEGINeBxgtcEZKoxDnNzsAvppARQRItrSEOag/BkRc745I3jdehYEqQ7oeZMr0DNCTD8ESwrqgxg7QL5rRKAcyeM4cOHbrnC5D4aJ0g6XaoFqg+Mk14tBwRjKRfQnZF3qQ4qrqNBPVPzabnwTX3YnQwkmMK46fgZULaAWAGfpOpZE9T3QCTQ6vrE4tddFN8mmHK4KDJm3ZQzyRexXD5VjYP99SYgcB1n/12UzwB8GyGoMmbdhCY9MnfwRc8b8Jk5oARk4HrPvvtpgAmy7M25Vzyph0Ehub+EwuaY+yg6JmJeZ7ZXPdZ4aYAhuaWIXqXS960k8AUp0G2JAa1ADM5yx8ruO43gtmHV53TdZe8SYPp+nIGy6wCTpo19lmwYGXO0ktOVFArscIlWprJA8jHR+kXyNJA69T1hSY8LWMxMw+XmXYJh0seWjaplSDRREspsDdZ6LaDz6IBFHmTS+vUORYm1Lal4KVWclLAHRRYLHIEKfuyfgv0W6DfAv0W6LdAvwX6LdBvgX4LdKIFEnOvweyDNrZ/d0HILVZwD9TCt6s7GXOuiJslD23ATvZ2A98OTC7gfm50g1w7RI1vhkLDBcPYYMBWXC+E7Fwl3E0ri7353gCT+ezSRmBopkJ/FQ54x2ir4WDSeSpa1CvEGgaYGX6864ZpbvB4hTADWBz1het8afqIX7D9aQPMYfudvyvCxfdgA8oWHztgtGAGsBr9Sn/VL8vW/ZJtTxtg9tmuOx9ctc+xLwCTPRV0mRmssNoTLvWkupEwwGT5FQIdiuJ/1BxAeuZAwGXmP3RGHRko6WiXIgYYqtoV14v0TAmJC/BXPO5KVSwrRpRKyWN2wZve/pQJZtCeNRy4ooaEhb9JccIhYoSBZSOhok+rSLeOJpi0zU/cVEjzyxIW/uS/B1FiebzQm5SpsXKQdBtlJhh8DG02+AuYmetwJRvwEdcNQS4OFIKk2yjzgBm2v2FU/UHKzZxkdDVDKKMx7rWGkJhuEG2vwAMmtci3YhFCJtDtWEAXRjDWviweMDRmRztTBqxADzr4nmv2EdSY+WZgtVq4OqqjiCTkdd8UBcYn80KcfeCWcuKBT+ssB3Ws7ZFBbqjJ0K6P32efuteseE48jDYlgfEV2QRvLw/MPs9v5DjhlyrS/qN4s2RirSXD5pImpGCxzMq4hUejVtwuqZxSrFJtPa4wjhZZNJK2eh0l1BXJHXoln5dlvntzwoTuIyVp27E6C1MzBWkvKvPMql2RBUJ/vy+0/4wLqoNksM6SlRL/diTMzePBq60aSxPPnKd4tNxbF3tNOzDgWdJosS9yn0ijnQacCWwf4niHpSTknfjhy3lSrH2DrQeC8S5pggsS3aYU3vllKq/k7T5mGUcu4lXQrXzcvA9wxd6YBuOdXzoHJsOnszoNvL6EpcbqYVpHJ08zJLzzS+dOswFeAntPgwhupv7dk0kObCXSm2xyAoBYzS8yR+cmgEmeZHEazIuKos5m4r/TStf4Vw5sJ+1usila5xSPy98jepvcWO/c1DzB1/7VWZqVrgTxaAXs8YOAtlkwNtnU1TFpl4mmKbn8lCikxM0sbF6+ijeZU7aVlD0TbKdhjz8eOvz1qlQsZ+KP0tA61gF5ytCNwn7nljMzZ8EnMVcYHg3GIaQNe/zhw1+DGcmj+J51esO3iN5A9OG0bIOOLTRT9sly+mtj1tEQLOTf4/cNf39RDWbAgmrwX9O3ji8nn0A0I2/Wvo5oR0LS3n/w4XLyoBVm3b/H7xv+amArExqMuDlLPfVQJXnws6zcU+ffzt2cRbpc+vf43dU8fHu7HtjsKQcNRmx2SJn4XRHTRudumyNdLi3vHr/zb3kdL68XAzv+HRkeMcEwJYoZHhT84oMFU9bOeJTLZdq3x79iGR7kFvTAVlLdM3ShEsnjOXE1Gyt7pe1LjS9sbMu3x5/2XP35YaYzsJUlF4xvE/BEooI8ndsEXKrD/AYh593jHzA3mq7imwc5sLUVF0xyWgs5ctGX8BOL0HyeUtETi/mN8wKMucdvDv+9+N+muNta9hhxwdD2bpzrVbrHHW/Cu8evh3/qb37437bYqsXADp4AaKDiNYXUpQ2SbRV49/hXjLs1RPm7LjmwtU9Gz6Q/o6VOJFv3S7Y3baE6d4//QWcSdg6zUMqBjYgMqxhGKnxERdRxv4p06Wih3qA9fu2OGNgq1bDJphQ74th0j39HeNeaE033+Fsz08/db4F+C/RboN8C/99a4Bpj1bER9vR5wTl88lZMBhvcpLRq1yKXzNjBO+s+eSsmI9cdKWOqGh0M7Q0G45MHmnSZrrRbASKt22Rkyg/mmuaG0k3A+OQNJmFRMF0py5LxyiNSqq0dJ2u+8u/2pY1kqgkYn7zBJEwIpitlKpnnmEekVFs7NtT8zub2fE7rjD55g0md0YlIxiu/tA3phpq/0tyoz2md0SdvMKkzOpGhvF/SprS/5jA+dp/T2gOf3G9S51MR550NlWzf0V+zYDRpYt7ntM7lk/tN6nwqspRXsTYffTXHlkIeuvic1p745D6TOpuKOIxXKrm543X3HKlg41FwS73IJrLPvlxwa7744kNX0yrvuuTxrcWhY/i8nu66Kb/3i7eUOfPRI3cIpx1N7r7vZu/hB3pKznkMk4KyCuRWeFi6x7aZ6UoZdRivpIj+4PjTMO/Wk4j2qCM7Twk8ulTcUjD+5ChdPVZDhMOVo5QCpZGzIfacn2pq+HpKPf4a8jmaZPHs5UwfpOXCiDYZk5RVGX70l5opSaYrp6jkVZKiKy3KftSktBrB87UIgbd3+G1+yS2Vx0TPO6yHa/jhwLMYdrMlmJQFN3kzTFNNfQ5eDQGM1mROiGccSo68CNqkQ1lFi9yrz+P5H97O0EWLec4rRN9DZAxJXU/iBtZtGMYtlMOf5Jaq4QlKifCIGRER8CCQBhSY+DlsB+OPFNVUBn1KSYDRmsSpCpVJy4UN1+RNSIOyiqp5RGqStEMXLbJQ8HgMrSGSQ6upelgRJVRRcGSWwSA3mIFS4u0SvfZg14cVmAaqqUmulZctWpM4BQlpOSdck4qyiqamJTkEk3booi6YlTpKgQ2KxpGP2LGIYe+ywECSW2qB4jwCSIOZw4BPO6RhjVRT4zXkZTAp3sPnWhOMnrScE65JRVlFg+idXEV2gy7qglnNczG06ngdx2jDhUsgHDhUXUCNpxEdWXAocjSYD9m3vBcKtSPuo5pasqCTC0pHk0BbYxC6cqQk6w5Masoq7iLwwUhuGMVf5YKZKUFHi8ukeVM4HSVc+OkyMGgwgwzK7ZkUZuUnNBg/1RS/wiXBKI1E7crZmDapKau41d8DjeCGUUVdMPzSBt5DybcMZghNGQKG6EeHMQycnvFTTblOK004GJ4vRMCJVEdEgFFF/WAWWwezhMY1wehzQtZawnhAFvbRaqSa0qeTJqGSYLRcGNEmFWUV5s/ZNHcqg9FFASYnRYuoFWOm1GrPxHjyGllIL6sxk/BMAKlZqAenBZh6I9VUsQY1j5mxNUTW6cNOF2o5pOhWZVJRVmG2B3UXNAxGFwUYjCMWLRW4GK6s44w4+gQgKppaiBUUmLQoO1Zjc+gVfm8isYBLKuG8YB+hf72uZGoB6hh81RrZM1oOtSiCX5hUlFU8Qnn8C8910aJgvGIwEzXosjDbIhjBaDS+FssrMLS6DEtcAwex9E/UmM8QYKp1uGHT63QlglaT20NrJBgtF0Zck3Ns+11sd53HvwCji04IxisGM8hNOIzpu0UwWe6IZ9ay+jSjgVm05Qx+OKRYPVii1CledqygI4YP08/dSj4B9bWYILRGgiElZxs4qWYdkw5lFURL06xgz3XRScF4xaL0Y9CtFAwwEZczVYv2HziRK2tuqdTBCr1/7GyZK0PPYF4+jkh1GRe5RqqpK++l5OP2j12NMyyUXBghbdKhrIJ0xWIVnxfaqMF4de0TlOT3OVbrOIt5VRpxoZl96siddPeXNbcUTqSHjuenxMNVgLH2Hzm2DLOJG3+A36Off7mceXTUpZqii448n7btitIM2vbJEjJqOccNkw5lFXq7ArFkunKMSsYrh/zqomcfRga3nszXkb2VkOLMskM2Lsb5RAFvVogCpN48LaWa1NOSjX7mfgv0W6DfAv0W6LdAvwX6LdBvgX4L9FugQy0w4n3hdJtTbQa1q8BMVNrcOt0092Q3K2933diQ2jUhiPejZ8HlFnrW9UbH91iNsp6VjC33rOuNjqtPfBs1PSjhLeLdEmL6wVg4ooDvocMLbKz9xX1f3jhTSzkyaxGzN3wPHbFc02zxearON9VuSjFQi1is4XvoiOWaZrsa74mcbardlGIq7xQbqoSX938PHZ47gvaf8KgHT0faGYplac1lMGhi3f89dJNs0cUPLlO6hRc5ohg+KDMZDAZNSvm/h26SLbp4KU+x9oJxX32sVsL98H0PHZ45ijbW9tMszo9VRTDAzHjvNqdZn8434zy9TpaP/pu2VN4hPM9tYxiaVcYMMEE0ofFKE87TRppQZbHp8ajSyIekKrXl40hBmTDAmJxaBk1oMOepeNCpjEQ7JpzLS+rRaPmj5nLvmQ0wQTShObyxEMh5GvC184aV4+1HDsM1cWjbj3vPbIAJogk9gCqDOE+zYkS16M/PZP67Wiy2UXb3ntkEE0ATegEsBXGe+hgCNqpP6ofFeYaPX9v6lNu4ZzbBeDm1NE2o+z206/OFbjR6TFJYHsWbhdHLbJzTuGc2wTTShJIFY0HfQ9+wcSUBObgUz9DizaIA/aZExj2zB4yfJrQp52kwTeiGvlyCHPtuv/32b2+Ys4UMxj2zB4yfJrTp99CZzYx/vPo1itewcWkOecG/BRRO1l+6RTxgtkoT6poNjrmXt2C9K32jWI7UXEHz2GOuymQwwDtEdrSVxmReWPDThLpmjZjBVjpYM+Rh0avEympdzH5h+aBz75kbGAy2RhMaWK/BVmpMPIFZlTC9+Di9Dy+zRQkh98zmkibElLwdils+mtDAEgZbaUKvbwNzauEVGFQxfhcwQgi5Z94aTWhg3StrEK/WWJeMdhLTEp9fi5gtRAim0fuqo50qOZGAwyZpQgMsKVF1FjHJJpiNtick3t+kuZKy0EgKipcS1Yyo7plVZvO4SZrQwbppxBOfQzM7bKXuYPXk8Cfka+/yjWm/TqXHzzgx557ZuBfjM0EG75JGSf3H25RAEV7tK0jJ5RerwC/eisAv2DtspQ5lndI0O06Js1G+mN8kT/aw+EIFzeT0UCAY2gpNKE4nbVS1nMlWGhGM+HpFvLzaBArR317lYHXvmYPyjtWDpD7ZfSotB7ZKBR1NttKIp1m1DkMZvILdPNwbt3H2Irj3zAGZN+AJdEp8X5X00oQqqXk02UojTgCr7OcevVMYOJul7WlRS+iiwscT6B3YesdQPUNQNKGm9764yVYacWrGy/nG12dEQaSgNCOvWe49s69eToo38l25GthC4u4YqvnQoQn1UyO65eEUj+ZVyVYa8aLJGx6p75RNK43xVTnyn9Saxj3+MJpQY8dwZRQ2XJpQxXmqDRsRk6004nJmaAHfaX7TsBEUXZFjyr1npoY9/vDhr5fYI3nYd2lCFedpUKUmW2nEhWb6ZsreuEHH0IiYm417ZvLv8W8w/DWYAQt+uzShivM0AIyHrXRPISBHgCjzwOeXA8Qe0ZCYm82+9u/x+4a/pzQSGoy4OTNoQn3UiEY5D1sp35y1K8i52bhnJv8ev8kTCJpQ/8DWYPw0oXvqzXz0sJXybXO7gpybx4we9O3xO1/UOPVd35zz1E8TumLY9HrrYSv9lFe3tdQMf+x8t2HD8nKeejaBcTrywA6mPLzQMIKo5Dz1ymRKLd44tcltkCCzkIm5+WZXmfbu8TfQhPoHtj7NyLcJ6KNGdGug8QU3ITcB3fTWYjw3m+sj3x5/I02ob2C7YHw0oScSlSaeLdVdxflutA0xnpvNe+Yc9uOM/2tmDn9JE+ob2C4YH02ohxrR4yg+R9XB2TjX6a1FeG4eqLs2AMbY49fD36UJ9Q1sA8xAxTUTFrvNVapHGq5kS7G4bdFUyTXh3eNf0fciIsKTRXPO00aaUNdsk9jRJvJNinluVmtENuHd4w+gCfUNbKNnWv/ngClrk143K4a52blnFjnYfNAevy7uG9jeHUOdq0uR6ivqnlk4YOE3aI9fe+cZ2A07hjpbdyIrr5n3zHjsgxkg4l5WdxwOq3XEHprV+qZ7/DrHzo4M2ROFne1hC97F7cVKC9l3dta0oP/e2T5G927GuGeOXmqH5qz27NwV0KBMT7BrQtQthU4D/j98EXT3W/8TagAAAABJRU5ErkJggg==\n",
146 |       "text/plain": [
147 |        "<IPython.core.display.Image object>"
148 |       ]
149 |      },
150 |      "execution_count": 7,
151 |      "metadata": {},
152 |      "output_type": "execute_result"
153 |     }
154 |    ],
155 |    "source": [
156 |     "from IPython.display import Image\n",
157 |     "Image(filename='std.png')"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 8,
163 |    "id": "e6d45b2c",
164 |    "metadata": {},
165 |    "outputs": [],
166 |    "source": [
167 |     "scaler = StandardScaler()\n",
168 |     "X_train_scaled = scaler.fit_transform(X_train)   # fit only on training"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 9,
174 |    "id": "2a366b98",
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": [
178 |     "X_test_scaled = scaler.transform(X_test)         # transform test separately"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 10,
184 |    "id": "0d2f8328",
185 |    "metadata": {},
186 |    "outputs": [],
187 |    "source": [
188 |     "model_clean = LinearRegression()\n",
189 |     "model_clean.fit(X_train_scaled, y_train)\n",
190 |     "pred_clean = model_clean.predict(X_test_scaled)"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": 11,
196 |    "id": "6b19fa48",
197 |    "metadata": {},
198 |    "outputs": [
199 |     {
200 |      "name": "stdout",
201 |      "output_type": "stream",
202 |      "text": [
203 |       "Without Data Leakage: 0.7945396288418394\n"
204 |      ]
205 |     }
206 |    ],
207 |    "source": [
208 |     "print(\"Without Data Leakage:\", r2_score(y_test, pred_clean))"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": null,
214 |    "id": "9186420a",
215 |    "metadata": {},
216 |    "outputs": [],
217 |    "source": []
218 |   }
219 |  ],
220 |  "metadata": {
221 |   "kernelspec": {
222 |    "display_name": "Python 3 (ipykernel)",
223 |    "language": "python",
224 |    "name": "python3"
225 |   },
226 |   "language_info": {
227 |    "codemirror_mode": {
228 |     "name": "ipython",
229 |     "version": 3
230 |    },
231 |    "file_extension": ".py",
232 |    "mimetype": "text/x-python",
233 |    "name": "python",
234 |    "nbconvert_exporter": "python",
235 |    "pygments_lexer": "ipython3",
236 |    "version": "3.9.13"
237 |   }
238 |  },
239 |  "nbformat": 4,
240 |  "nbformat_minor": 5
241 | }
242 | 


--------------------------------------------------------------------------------