├── README.md ├── 1. Label Encoder.ipynb ├── Min Max Scaling with Python.ipynb ├── 4. Ordinal Encoder.ipynb ├── 3. Binary Encoder.ipynb ├── Standard Scaling using Python.ipynb ├── 2. One Hot Encoding.ipynb └── Data Leakage in Machine Learning.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Feature Engineering 2 | 3 | **Feature Engineering** is the process of transforming raw data into meaningful features that help machine learning models learn better patterns and make more accurate predictions. 4 | 5 | In simple terms, it's about **creating the right inputs** for your model — the smarter the features, the better the model’s performance. 6 | 7 | --- 8 | 9 | ## Key Steps in Feature Engineering 10 | 11 | ### 1. Feature Creation 12 | Generate new features from existing ones to add useful information. 13 | 14 | **Examples:** 15 | - From `Date`, create `Day`, `Month`, or `Is_Weekend` 16 | - From `Price` and `Quantity`, create `Total_Sales = Price × Quantity` 17 | - From `Address`, extract `City` or `Postal_Code` 18 | 19 | --- 20 | 21 | ### 2. Feature Transformation 22 | Modify existing features to improve learning and model performance. 23 | 24 | **Common techniques:** 25 | - **Scaling:** Normalize or standardize numerical features 26 | *(e.g., Min-Max scaling or StandardScaler)* 27 | - **Encoding:** Convert categorical data into numeric form 28 | *(e.g., One-Hot Encoding or Label Encoding)* 29 | - **Log Transformation:** Handle skewed data distributions 30 | - **Binning:** Group continuous values into discrete intervals 31 | 32 | --- 33 | 34 | ### 3. Feature Selection 35 | Choose only the most important features that influence predictions. 36 | 37 | **Methods include:** 38 | - Correlation analysis 39 | - Mutual information 40 | - Feature importance from tree-based models 41 | - Recursive Feature Elimination (RFE) 42 | 43 | --- 44 | 45 | ### 4. Handling Missing Data 46 | Fill or remove missing values to ensure clean input for training. 47 | 48 | **Strategies:** 49 | - Replace with mean/median/mode 50 | - Forward/Backward fill 51 | - Drop rows/columns with too many missing values 52 | 53 | --- 54 | 55 | ## Example 56 | 57 | **Raw dataset:** 58 | 59 | | Date | Age | Salary | City | 60 | |------|-----|---------|------| 61 | | 2024-05-10 | 28 | 50000 | Berlin | 62 | | 2024-05-11 | 35 | 62000 | Munich | 63 | 64 | **After feature engineering:** 65 | 66 | | Day | Month | Age | Salary | City_Berlin | City_Munich | 67 | |-----|-------|-----|---------|--------------|--------------| 68 | | 10 | 5 | 28 | 50000 | 1 | 0 | 69 | | 11 | 5 | 35 | 62000 | 0 | 1 | 70 | 71 | --- 72 | 73 | ## Why It Matters 74 | Feature Engineering: 75 | - Improves **accuracy** of ML models 76 | - Reduces **overfitting** 77 | - Enhances **interpretability** 78 | - Speeds up **training and inference** 79 | 80 | --- 81 | 82 | ## Python 83 | 84 | ```python 85 | import pandas as pd 86 | from sklearn.preprocessing import OneHotEncoder, StandardScaler 87 | 88 | # Sample Data 89 | data = { 90 | "Date": ["2024-05-10", "2024-05-11"], 91 | "Age": [28, 35], 92 | "Salary": [50000, 62000], 93 | "City": ["Berlin", "Munich"] 94 | } 95 | df = pd.DataFrame(data) 96 | 97 | # Feature Creation 98 | df["Date"] = pd.to_datetime(df["Date"]) 99 | df["Day"] = df["Date"].dt.day 100 | df["Month"] = df["Date"].dt.month 101 | 102 | #Feature Transformation 103 | encoder = OneHotEncoder(sparse_output=False) 104 | city_encoded = encoder.fit_transform(df[["City"]]) 105 | encoded_df = pd.DataFrame(city_encoded, columns=encoder.get_feature_names_out(["City"])) 106 | df = pd.concat([df, encoded_df], axis=1) 107 | 108 | #Feature Scaling 109 | scaler = StandardScaler() 110 | df[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]]) 111 | 112 | # Final Result 113 | print(df) 114 | -------------------------------------------------------------------------------- /1. Label Encoder.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "21070538", 6 | "metadata": {}, 7 | "source": [ 8 | "# 1. Label Encoder" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "5122882e", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "classes = ['ClassA', 'ClassB', 'ClassC', 'ClassD']\n", 19 | "\n", 20 | "instances = ['ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB']" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "id": "0790c5cf", 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stdout", 31 | "output_type": "stream", 32 | "text": [ 33 | "Encoded labels: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1]\n" 34 | ] 35 | } 36 | ], 37 | "source": [ 38 | "label_to_int = {label: index for index, label in enumerate(classes)} #60 Days of Python ; Day 25\n", 39 | "encoded_labels = [label_to_int[label] for label in instances]\n", 40 | "\n", 41 | "print(\"Encoded labels:\", encoded_labels)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "id": "9bdaf4f3", 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "name": "stdout", 52 | "output_type": "stream", 53 | "text": [ 54 | "Encoded labels: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1]\n", 55 | "Decoded labels: ['ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB']\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "int_to_label = {index: label for label, index in label_to_int.items()}\n", 61 | "decoded_labels = [int_to_label[index] for index in encoded_labels]\n", 62 | "\n", 63 | "print(\"Encoded labels:\", encoded_labels)\n", 64 | "print(\"Decoded labels:\", decoded_labels)" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "id": "63ee1325", 70 | "metadata": {}, 71 | "source": [ 72 | "# Sklearn - Label Encoder" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 4, 78 | "id": "60a20a81", 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "from sklearn.preprocessing import LabelEncoder" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 5, 88 | "id": "ad2970f2", 89 | "metadata": {}, 90 | "outputs": [ 91 | { 92 | "name": "stdout", 93 | "output_type": "stream", 94 | "text": [ 95 | "Encoded labels: [0 1 2 3 0 1 2 3 0 1]\n" 96 | ] 97 | } 98 | ], 99 | "source": [ 100 | "label_encoder = LabelEncoder()\n", 101 | "encoded_labels = label_encoder.fit_transform(instances)\n", 102 | "\n", 103 | "print(\"Encoded labels:\", encoded_labels)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 6, 109 | "id": "5fd9dabe", 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "Encoded labels: [0 1 2 3 0 1 2 3 0 1]\n", 117 | "Original labels: ['ClassA' 'ClassB' 'ClassC' 'ClassD' 'ClassA' 'ClassB' 'ClassC' 'ClassD'\n", 118 | " 'ClassA' 'ClassB']\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "original_labels = label_encoder.inverse_transform(encoded_labels)\n", 124 | "\n", 125 | "print(\"Encoded labels:\", encoded_labels)\n", 126 | "print(\"Original labels:\", original_labels)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "id": "e932d2b3", 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [] 136 | } 137 | ], 138 | "metadata": { 139 | "kernelspec": { 140 | "display_name": "Python 3 (ipykernel)", 141 | "language": "python", 142 | "name": "python3" 143 | }, 144 | "language_info": { 145 | "codemirror_mode": { 146 | "name": "ipython", 147 | "version": 3 148 | }, 149 | "file_extension": ".py", 150 | "mimetype": "text/x-python", 151 | "name": "python", 152 | "nbconvert_exporter": "python", 153 | "pygments_lexer": "ipython3", 154 | "version": "3.9.13" 155 | } 156 | }, 157 | "nbformat": 4, 158 | "nbformat_minor": 5 159 | } 160 | -------------------------------------------------------------------------------- /Min Max Scaling with Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "fc529592", 6 | "metadata": {}, 7 | "source": [ 8 | "# Raw Code" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "1c33a4de", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "def min_max_scaling(data):\n", 19 | " min_val = min(data)\n", 20 | " max_val = max(data)\n", 21 | " scaled_data = [(x - min_val) / (max_val - min_val) for x in data]\n", 22 | " return scaled_data" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "id": "53cba510", 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "Original Data: [1, 20, 30, 4, 5]\n", 36 | "Scaled data (raw): [0.0, 0.6551724137931034, 1.0, 0.10344827586206896, 0.13793103448275862]\n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "data = [1, 20, 30, 4, 5]\n", 42 | "scaled_data = min_max_scaling(data)\n", 43 | "print('Original Data: ', data)\n", 44 | "print(\"Scaled data (raw):\", scaled_data)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "id": "c9912be9", 50 | "metadata": {}, 51 | "source": [ 52 | "# using Sklearn" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 3, 58 | "id": "e2326ee0", 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "import pandas as pd\n", 63 | "from sklearn.preprocessing import MinMaxScaler" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "id": "6c2eab99", 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "data": { 74 | "text/html": [ 75 | "
\n", 76 | "\n", 89 | "\n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | "
Feature1Feature2
016
157
2108
3419
4510
\n", 125 | "
" 126 | ], 127 | "text/plain": [ 128 | " Feature1 Feature2\n", 129 | "0 1 6\n", 130 | "1 5 7\n", 131 | "2 10 8\n", 132 | "3 4 19\n", 133 | "4 5 10" 134 | ] 135 | }, 136 | "execution_count": 4, 137 | "metadata": {}, 138 | "output_type": "execute_result" 139 | } 140 | ], 141 | "source": [ 142 | "data = {'Feature1': [1, 5, 10, 4, 5],\n", 143 | " 'Feature2': [6, 7, 8, 19, 10]}\n", 144 | "\n", 145 | "df = pd.DataFrame(data)\n", 146 | "df.head()" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 5, 152 | "id": "03a59524", 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "scaler = MinMaxScaler()\n", 157 | "scaled_data = scaler.fit_transform(df)\n", 158 | "scaled_df = pd.DataFrame(scaled_data, columns=df.columns)" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 6, 164 | "id": "b32cc8d6", 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "name": "stdout", 169 | "output_type": "stream", 170 | "text": [ 171 | "Original DataFrame:\n", 172 | " Feature1 Feature2\n", 173 | "0 1 6\n", 174 | "1 5 7\n", 175 | "2 10 8\n", 176 | "3 4 19\n", 177 | "4 5 10\n", 178 | "\n", 179 | "Scaled DataFrame:\n", 180 | " Feature1 Feature2\n", 181 | "0 0.000000 0.000000\n", 182 | "1 0.444444 0.076923\n", 183 | "2 1.000000 0.153846\n", 184 | "3 0.333333 1.000000\n", 185 | "4 0.444444 0.307692\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "print(\"Original DataFrame:\")\n", 191 | "print(df)\n", 192 | "print(\"\\nScaled DataFrame:\")\n", 193 | "print(scaled_df)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "id": "0e898642", 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [] 203 | } 204 | ], 205 | "metadata": { 206 | "kernelspec": { 207 | "display_name": "Python 3 (ipykernel)", 208 | "language": "python", 209 | "name": "python3" 210 | }, 211 | "language_info": { 212 | "codemirror_mode": { 213 | "name": "ipython", 214 | "version": 3 215 | }, 216 | "file_extension": ".py", 217 | "mimetype": "text/x-python", 218 | "name": "python", 219 | "nbconvert_exporter": "python", 220 | "pygments_lexer": "ipython3", 221 | "version": "3.9.13" 222 | } 223 | }, 224 | "nbformat": 4, 225 | "nbformat_minor": 5 226 | } 227 | -------------------------------------------------------------------------------- /4. Ordinal Encoder.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "b78c029d", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import pandas as pd\n", 11 | "from sklearn.preprocessing import OrdinalEncoder" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "id": "c146e9da", 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "data = [\n", 22 | " ['good'], ['bad'], ['excellent'], ['average'], \n", 23 | " ['good'], ['average'], ['excellent'], ['bad'], \n", 24 | " ['average'], ['good']\n", 25 | "]" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 3, 31 | "id": "3ad7f2c7", 32 | "metadata": {}, 33 | "outputs": [ 34 | { 35 | "data": { 36 | "text/html": [ 37 | "
\n", 38 | "\n", 51 | "\n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | "
reviews
0good
1bad
2excellent
3average
4good
\n", 81 | "
" 82 | ], 83 | "text/plain": [ 84 | " reviews\n", 85 | "0 good\n", 86 | "1 bad\n", 87 | "2 excellent\n", 88 | "3 average\n", 89 | "4 good" 90 | ] 91 | }, 92 | "execution_count": 3, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "data = pd.DataFrame(data=data, columns=['reviews'])\n", 99 | "data.head()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 4, 105 | "id": "14b89f09", 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "(10, 1)" 112 | ] 113 | }, 114 | "execution_count": 4, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "data.shape" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 5, 126 | "id": "3e569b99", 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "categories = [['bad', 'average', 'good', 'excellent']]" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 6, 136 | "id": "eee41b3c", 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "data": { 141 | "text/plain": [ 142 | "[['bad', 'average', 'good', 'excellent']]" 143 | ] 144 | }, 145 | "execution_count": 6, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "categories" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 7, 157 | "id": "b3a1d73e", 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "encoder = OrdinalEncoder(categories=categories)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 8, 167 | "id": "74286b79", 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "array([[2.],\n", 174 | " [0.],\n", 175 | " [3.],\n", 176 | " [1.],\n", 177 | " [2.],\n", 178 | " [1.],\n", 179 | " [3.],\n", 180 | " [0.],\n", 181 | " [1.],\n", 182 | " [2.]])" 183 | ] 184 | }, 185 | "execution_count": 8, 186 | "metadata": {}, 187 | "output_type": "execute_result" 188 | } 189 | ], 190 | "source": [ 191 | "encoded_data = encoder.fit_transform(data)\n", 192 | "encoded_data" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 9, 198 | "id": "b4fa6545", 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "data": { 203 | "text/plain": [ 204 | "array([['good'],\n", 205 | " ['bad'],\n", 206 | " ['excellent'],\n", 207 | " ['average'],\n", 208 | " ['good'],\n", 209 | " ['average'],\n", 210 | " ['excellent'],\n", 211 | " ['bad'],\n", 212 | " ['average'],\n", 213 | " ['good']], dtype=object)" 214 | ] 215 | }, 216 | "execution_count": 9, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | } 220 | ], 221 | "source": [ 222 | "decoded_data = encoder.inverse_transform(encoded_data)\n", 223 | "decoded_data" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "id": "525a8f59", 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [] 233 | } 234 | ], 235 | "metadata": { 236 | "kernelspec": { 237 | "display_name": "Python 3 (ipykernel)", 238 | "language": "python", 239 | "name": "python3" 240 | }, 241 | "language_info": { 242 | "codemirror_mode": { 243 | "name": "ipython", 244 | "version": 3 245 | }, 246 | "file_extension": ".py", 247 | "mimetype": "text/x-python", 248 | "name": "python", 249 | "nbconvert_exporter": "python", 250 | "pygments_lexer": "ipython3", 251 | "version": "3.9.13" 252 | } 253 | }, 254 | "nbformat": 4, 255 | "nbformat_minor": 5 256 | } 257 | -------------------------------------------------------------------------------- /3. Binary Encoder.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "adfe09d4", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import pandas as pd\n", 11 | "import category_encoders as ce" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "id": "56986103", 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']}\n", 22 | "df = pd.DataFrame(data)" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 3, 28 | "id": "b5813023", 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/html": [ 34 | "
\n", 35 | "\n", 48 | "\n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | "
Category
0A
1B
2C
3A
4B
\n", 78 | "
" 79 | ], 80 | "text/plain": [ 81 | " Category\n", 82 | "0 A\n", 83 | "1 B\n", 84 | "2 C\n", 85 | "3 A\n", 86 | "4 B" 87 | ] 88 | }, 89 | "execution_count": 3, 90 | "metadata": {}, 91 | "output_type": "execute_result" 92 | } 93 | ], 94 | "source": [ 95 | "df.head()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 4, 101 | "id": "9390c11f", 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "data": { 106 | "text/plain": [ 107 | "(9, 1)" 108 | ] 109 | }, 110 | "execution_count": 4, 111 | "metadata": {}, 112 | "output_type": "execute_result" 113 | } 114 | ], 115 | "source": [ 116 | "df.shape" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 5, 122 | "id": "ec81a1c2", 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "encoder = ce.BinaryEncoder(cols=['Category'], return_df=True)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 6, 132 | "id": "d5fe35de", 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "text/html": [ 138 | "
\n", 139 | "\n", 152 | "\n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | "
Category_0Category_1
001
110
211
301
410
511
601
710
811
\n", 208 | "
" 209 | ], 210 | "text/plain": [ 211 | " Category_0 Category_1\n", 212 | "0 0 1\n", 213 | "1 1 0\n", 214 | "2 1 1\n", 215 | "3 0 1\n", 216 | "4 1 0\n", 217 | "5 1 1\n", 218 | "6 0 1\n", 219 | "7 1 0\n", 220 | "8 1 1" 221 | ] 222 | }, 223 | "execution_count": 6, 224 | "metadata": {}, 225 | "output_type": "execute_result" 226 | } 227 | ], 228 | "source": [ 229 | "df_binary_encoded = encoder.fit_transform(df)\n", 230 | "df_binary_encoded" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "id": "201a7fa5", 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [] 240 | } 241 | ], 242 | "metadata": { 243 | "kernelspec": { 244 | "display_name": "Python 3 (ipykernel)", 245 | "language": "python", 246 | "name": "python3" 247 | }, 248 | "language_info": { 249 | "codemirror_mode": { 250 | "name": "ipython", 251 | "version": 3 252 | }, 253 | "file_extension": ".py", 254 | "mimetype": "text/x-python", 255 | "name": "python", 256 | "nbconvert_exporter": "python", 257 | "pygments_lexer": "ipython3", 258 | "version": "3.9.13" 259 | } 260 | }, 261 | "nbformat": 4, 262 | "nbformat_minor": 5 263 | } 264 | -------------------------------------------------------------------------------- /Standard Scaling using Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "20c48bc4", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import pandas as pd" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "id": "cf6184fa", 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "text/html": [ 22 | "
\n", 23 | "\n", 36 | "\n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | "
Feature 1Feature 2
016
1207
2318
34019
4510
\n", 72 | "
" 73 | ], 74 | "text/plain": [ 75 | " Feature 1 Feature 2\n", 76 | "0 1 6\n", 77 | "1 20 7\n", 78 | "2 3 18\n", 79 | "3 40 19\n", 80 | "4 5 10" 81 | ] 82 | }, 83 | "execution_count": 2, 84 | "metadata": {}, 85 | "output_type": "execute_result" 86 | } 87 | ], 88 | "source": [ 89 | "data = {'Feature 1': [1, 20, 3, 40, 5],\n", 90 | " 'Feature 2': [6, 7, 18, 19, 10]}\n", 91 | "\n", 92 | "df = pd.DataFrame(data)\n", 93 | "df.head()" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 3, 99 | "id": "8fa7404f", 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "def standardize_data(data):\n", 104 | " mean = data.mean()\n", 105 | " std_dev = data.std() # delta degree of freedom ddof=0\n", 106 | " standardized_data_raw = (data - mean) / std_dev\n", 107 | " return standardized_data_raw" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 4, 113 | "id": "f2c02d59", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "standardized_df_raw = df.apply(standardize_data)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 5, 123 | "id": "f22c9db0", 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "Original DataFrame:\n", 131 | " Feature 1 Feature 2\n", 132 | "0 1 6\n", 133 | "1 20 7\n", 134 | "2 3 18\n", 135 | "3 40 19\n", 136 | "4 5 10\n", 137 | "\n", 138 | "Standardized DataFrame:\n", 139 | " Feature 1 Feature 2\n", 140 | "0 -0.777975 -0.979796\n", 141 | "1 0.376832 -0.816497\n", 142 | "2 -0.656417 0.979796\n", 143 | "3 1.592418 1.143095\n", 144 | "4 -0.534858 -0.326599\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "print(\"Original DataFrame:\")\n", 150 | "print(df)\n", 151 | "print(\"\\nStandardized DataFrame:\")\n", 152 | "print(standardized_df_raw)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 6, 158 | "id": "3fb91ce8", 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/html": [ 164 | "
\n", 165 | "\n", 178 | "\n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | "
Feature 1Feature 2
016
1207
2318
34019
4510
\n", 214 | "
" 215 | ], 216 | "text/plain": [ 217 | " Feature 1 Feature 2\n", 218 | "0 1 6\n", 219 | "1 20 7\n", 220 | "2 3 18\n", 221 | "3 40 19\n", 222 | "4 5 10" 223 | ] 224 | }, 225 | "execution_count": 6, 226 | "metadata": {}, 227 | "output_type": "execute_result" 228 | } 229 | ], 230 | "source": [ 231 | "df.head()" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 7, 237 | "id": "dcfedb7b", 238 | "metadata": {}, 239 | "outputs": [ 240 | { 241 | "data": { 242 | "text/html": [ 243 | "
\n", 244 | "\n", 257 | "\n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | "
Feature 1Feature 2
0-0.777975-0.979796
10.376832-0.816497
2-0.6564170.979796
31.5924181.143095
4-0.534858-0.326599
\n", 293 | "
" 294 | ], 295 | "text/plain": [ 296 | " Feature 1 Feature 2\n", 297 | "0 -0.777975 -0.979796\n", 298 | "1 0.376832 -0.816497\n", 299 | "2 -0.656417 0.979796\n", 300 | "3 1.592418 1.143095\n", 301 | "4 -0.534858 -0.326599" 302 | ] 303 | }, 304 | "execution_count": 7, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "standardized_df_raw.head()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 8, 316 | "id": "daaa1204", 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "data": { 321 | "text/plain": [ 322 | "1.0" 323 | ] 324 | }, 325 | "execution_count": 8, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "standardized_df_raw['Feature 1'].mean() + 1" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 9, 337 | "id": "56277f06", 338 | "metadata": {}, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "0.9999999999999999" 344 | ] 345 | }, 346 | "execution_count": 9, 347 | "metadata": {}, 348 | "output_type": "execute_result" 349 | } 350 | ], 351 | "source": [ 352 | "standardized_df_raw['Feature 1'].std()" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "id": "690b8116", 358 | "metadata": {}, 359 | "source": [ 360 | "# Sklearn" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": 10, 366 | "id": "11c15507", 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "from sklearn.preprocessing import StandardScaler" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 11, 376 | "id": "6f37ac57", 377 | "metadata": {}, 378 | "outputs": [ 379 | { 380 | "data": { 381 | "text/html": [ 382 | "
\n", 383 | "\n", 396 | "\n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | "
Feature 1Feature 2
016
1207
2318
34019
4510
\n", 432 | "
" 433 | ], 434 | "text/plain": [ 435 | " Feature 1 Feature 2\n", 436 | "0 1 6\n", 437 | "1 20 7\n", 438 | "2 3 18\n", 439 | "3 40 19\n", 440 | "4 5 10" 441 | ] 442 | }, 443 | "execution_count": 11, 444 | "metadata": {}, 445 | "output_type": "execute_result" 446 | } 447 | ], 448 | "source": [ 449 | "df" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": 12, 455 | "id": "ea267703", 456 | "metadata": {}, 457 | "outputs": [], 458 | "source": [ 459 | "scaler = StandardScaler()\n", 460 | "standardized_data = scaler.fit_transform(df)\n", 461 | "standardized_df = pd.DataFrame(standardized_data, columns=df.columns)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 13, 467 | "id": "c68cbfde", 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "name": "stdout", 472 | "output_type": "stream", 473 | "text": [ 474 | "Original DataFrame:\n", 475 | " Feature 1 Feature 2\n", 476 | "0 1 6\n", 477 | "1 20 7\n", 478 | "2 3 18\n", 479 | "3 40 19\n", 480 | "4 5 10\n", 481 | "\n", 482 | "Standardized DataFrame: raw code\n", 483 | " Feature 1 Feature 2\n", 484 | "0 -0.777975 -0.979796\n", 485 | "1 0.376832 -0.816497\n", 486 | "2 -0.656417 0.979796\n", 487 | "3 1.592418 1.143095\n", 488 | "4 -0.534858 -0.326599\n", 489 | "\n", 490 | "Standardized DataFrame: sklearn\n", 491 | " Feature 1 Feature 2\n", 492 | "0 -0.869803 -1.095445\n", 493 | "1 0.421311 -0.912871\n", 494 | "2 -0.733896 1.095445\n", 495 | "3 1.780378 1.278019\n", 496 | "4 -0.597989 -0.365148\n" 497 | ] 498 | } 499 | ], 500 | "source": [ 501 | "print(\"Original DataFrame:\")\n", 502 | "print(df)\n", 503 | "\n", 504 | "print(\"\\nStandardized DataFrame: raw code\")\n", 505 | "print(standardized_df_raw)\n", 506 | "\n", 507 | "print(\"\\nStandardized DataFrame: sklearn\")\n", 508 | "print(standardized_df)" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": null, 514 | "id": "f764d2ec", 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": null, 522 | "id": "b1af06be", 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [] 526 | } 527 | ], 528 | "metadata": { 529 | "kernelspec": { 530 | "display_name": "Python 3 (ipykernel)", 531 | "language": "python", 532 | "name": "python3" 533 | }, 534 | "language_info": { 535 | "codemirror_mode": { 536 | "name": "ipython", 537 | "version": 3 538 | }, 539 | "file_extension": ".py", 540 | "mimetype": "text/x-python", 541 | "name": "python", 542 | "nbconvert_exporter": "python", 543 | "pygments_lexer": "ipython3", 544 | "version": "3.9.13" 545 | } 546 | }, 547 | "nbformat": 4, 548 | "nbformat_minor": 5 549 | } 550 | -------------------------------------------------------------------------------- /2. One Hot Encoding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "79afcd9b", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import pandas as pd" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "id": "a31b20f7", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']}" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "id": "6c33a2f7", 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 46 | "\n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | "
Category
0A
1B
2C
3A
4B
\n", 76 | "
" 77 | ], 78 | "text/plain": [ 79 | " Category\n", 80 | "0 A\n", 81 | "1 B\n", 82 | "2 C\n", 83 | "3 A\n", 84 | "4 B" 85 | ] 86 | }, 87 | "execution_count": 3, 88 | "metadata": {}, 89 | "output_type": "execute_result" 90 | } 91 | ], 92 | "source": [ 93 | "df = pd.DataFrame(data)\n", 94 | "df.head()" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 4, 100 | "id": "d5674ea5", 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "data": { 105 | "text/html": [ 106 | "
\n", 107 | "\n", 120 | "\n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | "
Category_ACategory_BCategory_C
0100
1010
2001
3100
4010
5001
6100
7010
8001
\n", 186 | "
" 187 | ], 188 | "text/plain": [ 189 | " Category_A Category_B Category_C\n", 190 | "0 1 0 0\n", 191 | "1 0 1 0\n", 192 | "2 0 0 1\n", 193 | "3 1 0 0\n", 194 | "4 0 1 0\n", 195 | "5 0 0 1\n", 196 | "6 1 0 0\n", 197 | "7 0 1 0\n", 198 | "8 0 0 1" 199 | ] 200 | }, 201 | "execution_count": 4, 202 | "metadata": {}, 203 | "output_type": "execute_result" 204 | } 205 | ], 206 | "source": [ 207 | "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'])\n", 208 | "one_hot_encoded_df" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 5, 214 | "id": "e16d27ac", 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "data": { 219 | "text/html": [ 220 | "
\n", 221 | "\n", 234 | "\n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | "
Dummy_ADummy_BDummy_C
0100
1010
2001
3100
4010
5001
6100
7010
8001
\n", 300 | "
" 301 | ], 302 | "text/plain": [ 303 | " Dummy_A Dummy_B Dummy_C\n", 304 | "0 1 0 0\n", 305 | "1 0 1 0\n", 306 | "2 0 0 1\n", 307 | "3 1 0 0\n", 308 | "4 0 1 0\n", 309 | "5 0 0 1\n", 310 | "6 1 0 0\n", 311 | "7 0 1 0\n", 312 | "8 0 0 1" 313 | ] 314 | }, 315 | "execution_count": 5, 316 | "metadata": {}, 317 | "output_type": "execute_result" 318 | } 319 | ], 320 | "source": [ 321 | "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'], prefix='Dummy')\n", 322 | "one_hot_encoded_df" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 6, 328 | "id": "2e8cfc1d", 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "data": { 333 | "text/html": [ 334 | "
\n", 335 | "\n", 348 | "\n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | "
Dummy_BDummy_C
000
110
201
300
410
501
600
710
801
\n", 404 | "
" 405 | ], 406 | "text/plain": [ 407 | " Dummy_B Dummy_C\n", 408 | "0 0 0\n", 409 | "1 1 0\n", 410 | "2 0 1\n", 411 | "3 0 0\n", 412 | "4 1 0\n", 413 | "5 0 1\n", 414 | "6 0 0\n", 415 | "7 1 0\n", 416 | "8 0 1" 417 | ] 418 | }, 419 | "execution_count": 6, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'], prefix='Dummy',drop_first=True )\n", 426 | "one_hot_encoded_df" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 7, 432 | "id": "287f4418", 433 | "metadata": {}, 434 | "outputs": [ 435 | { 436 | "data": { 437 | "text/html": [ 438 | "
\n", 439 | "\n", 452 | "\n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | "
Category
0A
1B
2C
3A
4B
\n", 482 | "
" 483 | ], 484 | "text/plain": [ 485 | " Category\n", 486 | "0 A\n", 487 | "1 B\n", 488 | "2 C\n", 489 | "3 A\n", 490 | "4 B" 491 | ] 492 | }, 493 | "execution_count": 7, 494 | "metadata": {}, 495 | "output_type": "execute_result" 496 | } 497 | ], 498 | "source": [ 499 | "df.head()" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "id": "cb821849", 506 | "metadata": {}, 507 | "outputs": [], 508 | "source": [] 509 | } 510 | ], 511 | "metadata": { 512 | "kernelspec": { 513 | "display_name": "Python 3 (ipykernel)", 514 | "language": "python", 515 | "name": "python3" 516 | }, 517 | "language_info": { 518 | "codemirror_mode": { 519 | "name": "ipython", 520 | "version": 3 521 | }, 522 | "file_extension": ".py", 523 | "mimetype": "text/x-python", 524 | "name": "python", 525 | "nbconvert_exporter": "python", 526 | "pygments_lexer": "ipython3", 527 | "version": "3.9.13" 528 | } 529 | }, 530 | "nbformat": 4, 531 | "nbformat_minor": 5 532 | } 533 | -------------------------------------------------------------------------------- /Data Leakage in Machine Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "4476c65b", 6 | "metadata": {}, 7 | "source": [ 8 | "[Watch Full Video on Data Leakage](https://youtu.be/UELHcSU_Dpg)" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "a5ffa0fa", 14 | "metadata": {}, 15 | "source": [ 16 | "`Data Leakage` (also called information leakage) happens when information that should be unavailable during model training accidentally influences the model. Our model “sees” or “learns from” data it shouldn’t have access to (like test data or future data). This causes `unrealistically high accuracy during training/testing`, but poor performance on unseen data." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "c1fddf31", 22 | "metadata": {}, 23 | "source": [ 24 | "# ⚠️ Common Causes\n", 25 | "\n", 26 | "1. Doing preprocessing (scaling, encoding, imputing) before train-test split\n", 27 | "\n", 28 | "2. Using target values to create features\n", 29 | "\n", 30 | "3. Mixing future data with past data in time series problems" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "id": "9a889af3", 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "from sklearn.datasets import make_regression\n", 41 | "from sklearn.preprocessing import StandardScaler\n", 42 | "from sklearn.linear_model import LinearRegression\n", 43 | "from sklearn.model_selection import train_test_split\n", 44 | "from sklearn.metrics import r2_score" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "id": "6075bcb4", 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "x, y = make_regression(n_samples=500, n_features=3, noise=50, random_state=42)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "94f28c3f", 60 | "metadata": {}, 61 | "source": [ 62 | "# ❌ WRONG WAY (Data Leakage)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 3, 68 | "id": "b51eba29", 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "scaler = StandardScaler()\n", 73 | "X_scaled_leak = scaler.fit_transform(x) # Scaler sees all data (train + test)\n", 74 | "X_train_leak, X_test_leak, y_train, y_test = train_test_split(X_scaled_leak, y, test_size=0.2, random_state=42)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 4, 80 | "id": "8d9b5523", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "model_leak = LinearRegression()\n", 85 | "model_leak.fit(X_train_leak, y_train)\n", 86 | "pred_leak = model_leak.predict(X_test_leak)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 5, 92 | "id": "c95748e0", 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "With Data Leakage (scaled before split): 0.7945396288418394\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "print(\"With Data Leakage (scaled before split):\", r2_score(y_test, pred_leak))" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "id": "8e20b3dd", 110 | "metadata": {}, 111 | "source": [ 112 | "The model really learns — but it’s learning from `cheated information`.\n", 113 | "\n", 114 | "That’s why leaky models look smart in testing, but `collapse` in production.\n", 115 | "\n", 116 | "Always isolate your training data, and fit transformations or create features only on training — never on the whole dataset." 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "id": "74e551e9", 122 | "metadata": {}, 123 | "source": [ 124 | "# ✅ CORRECT WAY (No Leakage)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 6, 130 | "id": "ddd01c92", 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 7, 140 | "id": "52df4004", 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "data": { 145 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZgAAAEjBAMAAAD9GArQAAAAMFBMVEX///8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv3aB7AAAAD3RSTlMAMmZ2VBAi3c2Ju++rRJljPUTMAAAACXBIWXMAAA7EAAAOxAGVKw4bAAAbGUlEQVR4Ae1df2xkR33/7k97fd61A1KrEoSXQhORVnghldI0orsloKoC1atKKLlcOG+uatMIgR0FCk2i2BRIwkGx2zR/XALZBYkkSkLODQVKCHijKs0PwtmCIhQ48FOipE05nU1ycHfx3b1+vjNv5s17+/b5rb3r9bo7kvfNfL8z3/l+Zt7MmzfvvY+JiF73h2VK4xgS0ucFK68ZDZZ3Tfrbn33LwcobQqvP2K8F6qt2LVDeNWH834hi1z8TXv/eYDCpKsAM3dq07P2VpqrOKKZKsHsFO3tN8wrSwWBoqka08mpAOWEraxcCVJ0U3c7G4+zsu5tXk2oCZrJGNPh3AeWkrdsqAaoOilKnhfEb8fvO5tWEgQksFWIrMH97hNlXhJ278PuV5hZbBhNiq3ktW9ZkzwkTPyVKSViBFlsFE2YrsIL2CNPrws5VRMNtBBNmqz1+B1tZtFiepthS+8CE2gp2oz3S4pmyMLRqI+QB69CxlyC466b83i/eIlRHj9whTjNH8783LT/3P7jyZ599uTBZo/j9Xycqnnz4+DEbua+750iFyLF10ecKsBQ/cuxOosx9j6QOPfki0nQZSzsThuxT7xCWE7JnnoNrJaLLF99s0RUfg2b4eko9/hoiSmP/V2F8mujJUbp6rEbZy3CduXsU+XDMzlPiBLJKWz9ahNupLxD9pETJy155E6U+XobSXsBPh8KSbX8PvjgOpCy49Rski39ClGR8n4MDQwCjNdW/oJVZGvwGdIdraHmAeAHxiVmi/TguVZQtGi8Q7bOwwoCdxKuITOSJ6BJIOxXSf2Tbr8Fh2ZpxzG4x/NH49TjlgCHDLZ1ERGuKt1KKqFqCfKIGHcA8Aax2BYUsojH8Ob08AbcPE8JqheInR0msGDjdwfCXS/a0ciBjo6JT+OMziU4STa7hKFApTXEWkhSnhHMMpobO416sIvPIrLJFAJM9AzGN1QEGxxE22ulQRV/I1kx9CXVxveN1/GDeHq/hyGC0pmhBIhZAGsy8k2/vsnTY7ZmcGIkjaxRnVCML+Ol0iGMmchyg9NsO8bVnHA4ymCULR7nQdDTFAiTDfCpqMMjBcxnCgUNVOOyCGeCzlPacpTivmzoMJiV9mCspB+KfeC9j0GDm5pEQYJSmmIdkkH0zwAwKp+nCT5fZYRfMyFnON/DKtoBJlrgyPrsdBw4XRIcEgFGaQDDFOpsZwvgJArNnm8DAd4QpS4CxaJhb3OwZfZppjQDjO81i62WKjdISulGBsUQT7RE9NnJie3qmBucBJi/A1GlsDal1+rDumWINAj7NtEaASUBinGYDOJsSozGe4kYW0svcy3UBhi+lkP16e8D8miujiVFc1dgB9h2Lz9drMFMLUMfgutYIMHKBOobcPDVT1SLKVQTCqYVYQdrik5evULgbrW0PmHNcGT0Jp3CC1alax7Rr0+s0mByPYPZSawQYWl2GvFiTYLJ8Yg6WMzz9jq/F8tIWg6G5MmRLo9sDZh0VUgwep07ximVlAdPuYfq5BkOfgP5aXEa1RoIZmEUXzuCHe2bPr5BpjASmZ9ayy9KWAPOhOpZsyOJOzZ1bziR/fawCRy04U13OVShzjlIvjZdepFU4EeOLx5X3UvJx+8euRkwJlDpYofePnS0TdwcP/NiMONv2HziRK0tbojPTN5fpfXmchNxrY3xS22ivzoTkfPLjT//xI2w8ceMP8Hv08y+XM4+OPmjb0/FFe32Z6KIjz6dtu6I0c7b9Uc4ef+h4fsr+zfCi/b20/dCzTy1i3ZB96siddPeXHVvft9ctzvfMS5i0F+2T9Pu2/V0ssQtcvBMhjd3My98VaLkMKVaU/gAxa1oPwtbmirZeWb9EvwX6LdBvgX4L9Fug3wL9Fui3QL8F+i3Qb4F+C/RboOst8Nxdf33Hl0wv/gx7O5Yp6J14tpR4OPWo6e/teKAzbwp6J76f4q+kvmb6i43EiWVT0DvxP6Wc3N1VLmexmbiqEj135H1NGuKn7ra9ji3ENRI7oD2HQzgsB8gHRMCW2OAsOW8W9SIc3wABtiTOtB4Nq2UqGK7jOX9uLWMIeij6Ah5LpU0weM6xb7o3wcROx4vipQzd/jMFumv2zTrZU5E7/iVxaN7wOHXm6UMfeDpvSHZSNPbgPxJ51iuh3u3swT9h24W9tVAApjK3ZqZ2WDz12Fvfdv+3ypG9GqhFzrr9GbMlPAz7WPR6J82ZLXqx7ct5fkXWFZNrFvHbrHaVt5m+2/LYrcqD83T4LSXqteNPQ0bMiNFZZnSnTgPJekjr9xqYPw/B0muqeL7XPA7x963QDVdEhiizWYil7quSJfhwrePHNT0+m51/cJnS4t2S7jfsVj1IfTJ56q2rta2a2Rnlhyo0bp/aGb5s2Ysf4n3z9fktm+kb6GILpH+J5Uy9iw60s+qJr1zyynsq7bTYPVv4Cq9LX5l1APOARXRzB+x2xeR4heixrtTcgUrvh008zNgd4dt4R5bfbN4VAW/+D9Z3BRKA+BTRz0Jur3sL5k8oq/c9esvzAG+zL78cIO2L+i3Qb4F+C/RboN8C/RbotRaIH/uHXnPZ769LuxRf5U9JezoI2qX0KGPI9DwYQbuUzDMYZsbYBSGRZxC7BMzQbgIzuZvALO0iMG+0exfMEgiaJm27RkX7NNMu5fiBPnZb4q8qaiXCPC2Ilu6k+HE8SkPYe+QZRtxA63RZgbXdC+nDn8EX9Nj0yp4cFbRLJD/3j7+qqJXgmyJaepwE1UHsEUoz69JzYDcombROnfs4PmIDjc3igTK/nfxJZz52wJy28L51HnIOkmjp0ookCxqbx3vldZe8SdM6dY62QHix8c8AgAiWlrokXVA9sz4q+T6EBUm09PeIM4vETTjmfuWSNylaJ5G1qz8ZDJBhPLiM5b1gPNRKkmiJybUOj1LyHI5gyWigdYK4yyEGINeBxgtcEZKoxDnNzsAvppARQRItrSEOag/BkRc745I3jdehYEqQ7oeZMr0DNCTD8ESwrqgxg7QL5rRKAcyeM4cOHbrnC5D4aJ0g6XaoFqg+Mk14tBwRjKRfQnZF3qQ4qrqNBPVPzabnwTX3YnQwkmMK46fgZULaAWAGfpOpZE9T3QCTQ6vrE4tddFN8mmHK4KDJm3ZQzyRexXD5VjYP99SYgcB1n/12UzwB8GyGoMmbdhCY9MnfwRc8b8Jk5oARk4HrPvvtpgAmy7M25Vzyph0Ehub+EwuaY+yg6JmJeZ7ZXPdZ4aYAhuaWIXqXS960k8AUp0G2JAa1ADM5yx8ruO43gtmHV53TdZe8SYPp+nIGy6wCTpo19lmwYGXO0ktOVFArscIlWprJA8jHR+kXyNJA69T1hSY8LWMxMw+XmXYJh0seWjaplSDRREspsDdZ6LaDz6IBFHmTS+vUORYm1Lal4KVWclLAHRRYLHIEKfuyfgv0W6DfAv0W6LdAvwX6LdBvgX4LdKIFEnOvweyDNrZ/d0HILVZwD9TCt6s7GXOuiJslD23ATvZ2A98OTC7gfm50g1w7RI1vhkLDBcPYYMBWXC+E7Fwl3E0ri7353gCT+ezSRmBopkJ/FQ54x2ir4WDSeSpa1CvEGgaYGX6864ZpbvB4hTADWBz1het8afqIX7D9aQPMYfudvyvCxfdgA8oWHztgtGAGsBr9Sn/VL8vW/ZJtTxtg9tmuOx9ctc+xLwCTPRV0mRmssNoTLvWkupEwwGT5FQIdiuJ/1BxAeuZAwGXmP3RGHRko6WiXIgYYqtoV14v0TAmJC/BXPO5KVSwrRpRKyWN2wZve/pQJZtCeNRy4ooaEhb9JccIhYoSBZSOhok+rSLeOJpi0zU/cVEjzyxIW/uS/B1FiebzQm5SpsXKQdBtlJhh8DG02+AuYmetwJRvwEdcNQS4OFIKk2yjzgBm2v2FU/UHKzZxkdDVDKKMx7rWGkJhuEG2vwAMmtci3YhFCJtDtWEAXRjDWviweMDRmRztTBqxADzr4nmv2EdSY+WZgtVq4OqqjiCTkdd8UBcYn80KcfeCWcuKBT+ssB3Ws7ZFBbqjJ0K6P32efuteseE48jDYlgfEV2QRvLw/MPs9v5DjhlyrS/qN4s2RirSXD5pImpGCxzMq4hUejVtwuqZxSrFJtPa4wjhZZNJK2eh0l1BXJHXoln5dlvntzwoTuIyVp27E6C1MzBWkvKvPMql2RBUJ/vy+0/4wLqoNksM6SlRL/diTMzePBq60aSxPPnKd4tNxbF3tNOzDgWdJosS9yn0ijnQacCWwf4niHpSTknfjhy3lSrH2DrQeC8S5pggsS3aYU3vllKq/k7T5mGUcu4lXQrXzcvA9wxd6YBuOdXzoHJsOnszoNvL6EpcbqYVpHJ08zJLzzS+dOswFeAntPgwhupv7dk0kObCXSm2xyAoBYzS8yR+cmgEmeZHEazIuKos5m4r/TStf4Vw5sJ+1usila5xSPy98jepvcWO/c1DzB1/7VWZqVrgTxaAXs8YOAtlkwNtnU1TFpl4mmKbn8lCikxM0sbF6+ijeZU7aVlD0TbKdhjz8eOvz1qlQsZ+KP0tA61gF5ytCNwn7nljMzZ8EnMVcYHg3GIaQNe/zhw1+DGcmj+J51esO3iN5A9OG0bIOOLTRT9sly+mtj1tEQLOTf4/cNf39RDWbAgmrwX9O3ji8nn0A0I2/Wvo5oR0LS3n/w4XLyoBVm3b/H7xv+amArExqMuDlLPfVQJXnws6zcU+ffzt2cRbpc+vf43dU8fHu7HtjsKQcNRmx2SJn4XRHTRudumyNdLi3vHr/zb3kdL68XAzv+HRkeMcEwJYoZHhT84oMFU9bOeJTLZdq3x79iGR7kFvTAVlLdM3ShEsnjOXE1Gyt7pe1LjS9sbMu3x5/2XP35YaYzsJUlF4xvE/BEooI8ndsEXKrD/AYh593jHzA3mq7imwc5sLUVF0xyWgs5ctGX8BOL0HyeUtETi/mN8wKMucdvDv+9+N+muNta9hhxwdD2bpzrVbrHHW/Cu8evh3/qb37437bYqsXADp4AaKDiNYXUpQ2SbRV49/hXjLs1RPm7LjmwtU9Gz6Q/o6VOJFv3S7Y3baE6d4//QWcSdg6zUMqBjYgMqxhGKnxERdRxv4p06Wih3qA9fu2OGNgq1bDJphQ74th0j39HeNeaE033+Fsz08/db4F+C/RboN8C/99a4Bpj1bER9vR5wTl88lZMBhvcpLRq1yKXzNjBO+s+eSsmI9cdKWOqGh0M7Q0G45MHmnSZrrRbASKt22Rkyg/mmuaG0k3A+OQNJmFRMF0py5LxyiNSqq0dJ2u+8u/2pY1kqgkYn7zBJEwIpitlKpnnmEekVFs7NtT8zub2fE7rjD55g0md0YlIxiu/tA3phpq/0tyoz2md0SdvMKkzOpGhvF/SprS/5jA+dp/T2gOf3G9S51MR550NlWzf0V+zYDRpYt7ntM7lk/tN6nwqspRXsTYffTXHlkIeuvic1p745D6TOpuKOIxXKrm543X3HKlg41FwS73IJrLPvlxwa7744kNX0yrvuuTxrcWhY/i8nu66Kb/3i7eUOfPRI3cIpx1N7r7vZu/hB3pKznkMk4KyCuRWeFi6x7aZ6UoZdRivpIj+4PjTMO/Wk4j2qCM7Twk8ulTcUjD+5ChdPVZDhMOVo5QCpZGzIfacn2pq+HpKPf4a8jmaZPHs5UwfpOXCiDYZk5RVGX70l5opSaYrp6jkVZKiKy3KftSktBrB87UIgbd3+G1+yS2Vx0TPO6yHa/jhwLMYdrMlmJQFN3kzTFNNfQ5eDQGM1mROiGccSo68CNqkQ1lFi9yrz+P5H97O0EWLec4rRN9DZAxJXU/iBtZtGMYtlMOf5Jaq4QlKifCIGRER8CCQBhSY+DlsB+OPFNVUBn1KSYDRmsSpCpVJy4UN1+RNSIOyiqp5RGqStEMXLbJQ8HgMrSGSQ6upelgRJVRRcGSWwSA3mIFS4u0SvfZg14cVmAaqqUmulZctWpM4BQlpOSdck4qyiqamJTkEk3booi6YlTpKgQ2KxpGP2LGIYe+ywECSW2qB4jwCSIOZw4BPO6RhjVRT4zXkZTAp3sPnWhOMnrScE65JRVlFg+idXEV2gy7qglnNczG06ngdx2jDhUsgHDhUXUCNpxEdWXAocjSYD9m3vBcKtSPuo5pasqCTC0pHk0BbYxC6cqQk6w5Masoq7iLwwUhuGMVf5YKZKUFHi8ukeVM4HSVc+OkyMGgwgwzK7ZkUZuUnNBg/1RS/wiXBKI1E7crZmDapKau41d8DjeCGUUVdMPzSBt5DybcMZghNGQKG6EeHMQycnvFTTblOK004GJ4vRMCJVEdEgFFF/WAWWwezhMY1wehzQtZawnhAFvbRaqSa0qeTJqGSYLRcGNEmFWUV5s/ZNHcqg9FFASYnRYuoFWOm1GrPxHjyGllIL6sxk/BMAKlZqAenBZh6I9VUsQY1j5mxNUTW6cNOF2o5pOhWZVJRVmG2B3UXNAxGFwUYjCMWLRW4GK6s44w4+gQgKppaiBUUmLQoO1Zjc+gVfm8isYBLKuG8YB+hf72uZGoB6hh81RrZM1oOtSiCX5hUlFU8Qnn8C8910aJgvGIwEzXosjDbIhjBaDS+FssrMLS6DEtcAwex9E/UmM8QYKp1uGHT63QlglaT20NrJBgtF0Zck3Ns+11sd53HvwCji04IxisGM8hNOIzpu0UwWe6IZ9ay+jSjgVm05Qx+OKRYPVii1CledqygI4YP08/dSj4B9bWYILRGgiElZxs4qWYdkw5lFURL06xgz3XRScF4xaL0Y9CtFAwwEZczVYv2HziRK2tuqdTBCr1/7GyZK0PPYF4+jkh1GRe5RqqpK++l5OP2j12NMyyUXBghbdKhrIJ0xWIVnxfaqMF4de0TlOT3OVbrOIt5VRpxoZl96siddPeXNbcUTqSHjuenxMNVgLH2Hzm2DLOJG3+A36Off7mceXTUpZqii448n7btitIM2vbJEjJqOccNkw5lFXq7ArFkunKMSsYrh/zqomcfRga3nszXkb2VkOLMskM2Lsb5RAFvVogCpN48LaWa1NOSjX7mfgv0W6DfAv0W6LdAvwX6LdBvgX4L9FugQy0w4n3hdJtTbQa1q8BMVNrcOt0092Q3K2933diQ2jUhiPejZ8HlFnrW9UbH91iNsp6VjC33rOuNjqtPfBs1PSjhLeLdEmL6wVg4ooDvocMLbKz9xX1f3jhTSzkyaxGzN3wPHbFc02zxearON9VuSjFQi1is4XvoiOWaZrsa74mcbardlGIq7xQbqoSX938PHZ47gvaf8KgHT0faGYplac1lMGhi3f89dJNs0cUPLlO6hRc5ohg+KDMZDAZNSvm/h26SLbp4KU+x9oJxX32sVsL98H0PHZ45ijbW9tMszo9VRTDAzHjvNqdZn8434zy9TpaP/pu2VN4hPM9tYxiaVcYMMEE0ofFKE87TRppQZbHp8ajSyIekKrXl40hBmTDAmJxaBk1oMOepeNCpjEQ7JpzLS+rRaPmj5nLvmQ0wQTShObyxEMh5GvC184aV4+1HDsM1cWjbj3vPbIAJogk9gCqDOE+zYkS16M/PZP67Wiy2UXb3ntkEE0ATegEsBXGe+hgCNqpP6ofFeYaPX9v6lNu4ZzbBeDm1NE2o+z206/OFbjR6TFJYHsWbhdHLbJzTuGc2wTTShJIFY0HfQ9+wcSUBObgUz9DizaIA/aZExj2zB4yfJrQp52kwTeiGvlyCHPtuv/32b2+Ys4UMxj2zB4yfJrTp99CZzYx/vPo1itewcWkOecG/BRRO1l+6RTxgtkoT6poNjrmXt2C9K32jWI7UXEHz2GOuymQwwDtEdrSVxmReWPDThLpmjZjBVjpYM+Rh0avEympdzH5h+aBz75kbGAy2RhMaWK/BVmpMPIFZlTC9+Di9Dy+zRQkh98zmkibElLwdils+mtDAEgZbaUKvbwNzauEVGFQxfhcwQgi5Z94aTWhg3StrEK/WWJeMdhLTEp9fi5gtRAim0fuqo50qOZGAwyZpQgMsKVF1FjHJJpiNtick3t+kuZKy0EgKipcS1Yyo7plVZvO4SZrQwbppxBOfQzM7bKXuYPXk8Cfka+/yjWm/TqXHzzgx557ZuBfjM0EG75JGSf3H25RAEV7tK0jJ5RerwC/eisAv2DtspQ5lndI0O06Js1G+mN8kT/aw+EIFzeT0UCAY2gpNKE4nbVS1nMlWGhGM+HpFvLzaBArR317lYHXvmYPyjtWDpD7ZfSotB7ZKBR1NttKIp1m1DkMZvILdPNwbt3H2Irj3zAGZN+AJdEp8X5X00oQqqXk02UojTgCr7OcevVMYOJul7WlRS+iiwscT6B3YesdQPUNQNKGm9764yVYacWrGy/nG12dEQaSgNCOvWe49s69eToo38l25GthC4u4YqvnQoQn1UyO65eEUj+ZVyVYa8aLJGx6p75RNK43xVTnyn9Saxj3+MJpQY8dwZRQ2XJpQxXmqDRsRk6004nJmaAHfaX7TsBEUXZFjyr1npoY9/vDhr5fYI3nYd2lCFedpUKUmW2nEhWb6ZsreuEHH0IiYm417ZvLv8W8w/DWYAQt+uzShivM0AIyHrXRPISBHgCjzwOeXA8Qe0ZCYm82+9u/x+4a/pzQSGoy4OTNoQn3UiEY5D1sp35y1K8i52bhnJv8ev8kTCJpQ/8DWYPw0oXvqzXz0sJXybXO7gpybx4we9O3xO1/UOPVd35zz1E8TumLY9HrrYSv9lFe3tdQMf+x8t2HD8nKeejaBcTrywA6mPLzQMIKo5Dz1ymRKLd44tcltkCCzkIm5+WZXmfbu8TfQhPoHtj7NyLcJ6KNGdGug8QU3ITcB3fTWYjw3m+sj3x5/I02ob2C7YHw0oScSlSaeLdVdxflutA0xnpvNe+Yc9uOM/2tmDn9JE+ob2C4YH02ohxrR4yg+R9XB2TjX6a1FeG4eqLs2AMbY49fD36UJ9Q1sA8xAxTUTFrvNVapHGq5kS7G4bdFUyTXh3eNf0fciIsKTRXPO00aaUNdsk9jRJvJNinluVmtENuHd4w+gCfUNbKNnWv/ngClrk143K4a52blnFjnYfNAevy7uG9jeHUOdq0uR6ivqnlk4YOE3aI9fe+cZ2A07hjpbdyIrr5n3zHjsgxkg4l5WdxwOq3XEHprV+qZ7/DrHzo4M2ROFne1hC97F7cVKC9l3dta0oP/e2T5G927GuGeOXmqH5qz27NwV0KBMT7BrQtQthU4D/j98EXT3W/8TagAAAABJRU5ErkJggg==\n", 146 | "text/plain": [ 147 | "" 148 | ] 149 | }, 150 | "execution_count": 7, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "from IPython.display import Image\n", 157 | "Image(filename='std.png')" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 8, 163 | "id": "e6d45b2c", 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "scaler = StandardScaler()\n", 168 | "X_train_scaled = scaler.fit_transform(X_train) # fit only on training" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 9, 174 | "id": "2a366b98", 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "X_test_scaled = scaler.transform(X_test) # transform test separately" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 10, 184 | "id": "0d2f8328", 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "model_clean = LinearRegression()\n", 189 | "model_clean.fit(X_train_scaled, y_train)\n", 190 | "pred_clean = model_clean.predict(X_test_scaled)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 11, 196 | "id": "6b19fa48", 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "name": "stdout", 201 | "output_type": "stream", 202 | "text": [ 203 | "Without Data Leakage: 0.7945396288418394\n" 204 | ] 205 | } 206 | ], 207 | "source": [ 208 | "print(\"Without Data Leakage:\", r2_score(y_test, pred_clean))" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "id": "9186420a", 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [] 218 | } 219 | ], 220 | "metadata": { 221 | "kernelspec": { 222 | "display_name": "Python 3 (ipykernel)", 223 | "language": "python", 224 | "name": "python3" 225 | }, 226 | "language_info": { 227 | "codemirror_mode": { 228 | "name": "ipython", 229 | "version": 3 230 | }, 231 | "file_extension": ".py", 232 | "mimetype": "text/x-python", 233 | "name": "python", 234 | "nbconvert_exporter": "python", 235 | "pygments_lexer": "ipython3", 236 | "version": "3.9.13" 237 | } 238 | }, 239 | "nbformat": 4, 240 | "nbformat_minor": 5 241 | } 242 | --------------------------------------------------------------------------------