├── README.md
├── 1. Label Encoder.ipynb
├── Min Max Scaling with Python.ipynb
├── 4. Ordinal Encoder.ipynb
├── 3. Binary Encoder.ipynb
├── Standard Scaling using Python.ipynb
├── 2. One Hot Encoding.ipynb
└── Data Leakage in Machine Learning.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # Feature Engineering
2 |
3 | **Feature Engineering** is the process of transforming raw data into meaningful features that help machine learning models learn better patterns and make more accurate predictions.
4 |
5 | In simple terms, it's about **creating the right inputs** for your model — the smarter the features, the better the model’s performance.
6 |
7 | ---
8 |
9 | ## Key Steps in Feature Engineering
10 |
11 | ### 1. Feature Creation
12 | Generate new features from existing ones to add useful information.
13 |
14 | **Examples:**
15 | - From `Date`, create `Day`, `Month`, or `Is_Weekend`
16 | - From `Price` and `Quantity`, create `Total_Sales = Price × Quantity`
17 | - From `Address`, extract `City` or `Postal_Code`
18 |
19 | ---
20 |
21 | ### 2. Feature Transformation
22 | Modify existing features to improve learning and model performance.
23 |
24 | **Common techniques:**
25 | - **Scaling:** Normalize or standardize numerical features
26 | *(e.g., Min-Max scaling or StandardScaler)*
27 | - **Encoding:** Convert categorical data into numeric form
28 | *(e.g., One-Hot Encoding or Label Encoding)*
29 | - **Log Transformation:** Handle skewed data distributions
30 | - **Binning:** Group continuous values into discrete intervals
31 |
32 | ---
33 |
34 | ### 3. Feature Selection
35 | Choose only the most important features that influence predictions.
36 |
37 | **Methods include:**
38 | - Correlation analysis
39 | - Mutual information
40 | - Feature importance from tree-based models
41 | - Recursive Feature Elimination (RFE)
42 |
43 | ---
44 |
45 | ### 4. Handling Missing Data
46 | Fill or remove missing values to ensure clean input for training.
47 |
48 | **Strategies:**
49 | - Replace with mean/median/mode
50 | - Forward/Backward fill
51 | - Drop rows/columns with too many missing values
52 |
53 | ---
54 |
55 | ## Example
56 |
57 | **Raw dataset:**
58 |
59 | | Date | Age | Salary | City |
60 | |------|-----|---------|------|
61 | | 2024-05-10 | 28 | 50000 | Berlin |
62 | | 2024-05-11 | 35 | 62000 | Munich |
63 |
64 | **After feature engineering:**
65 |
66 | | Day | Month | Age | Salary | City_Berlin | City_Munich |
67 | |-----|-------|-----|---------|--------------|--------------|
68 | | 10 | 5 | 28 | 50000 | 1 | 0 |
69 | | 11 | 5 | 35 | 62000 | 0 | 1 |
70 |
71 | ---
72 |
73 | ## Why It Matters
74 | Feature Engineering:
75 | - Improves **accuracy** of ML models
76 | - Reduces **overfitting**
77 | - Enhances **interpretability**
78 | - Speeds up **training and inference**
79 |
80 | ---
81 |
82 | ## Python
83 |
84 | ```python
85 | import pandas as pd
86 | from sklearn.preprocessing import OneHotEncoder, StandardScaler
87 |
88 | # Sample Data
89 | data = {
90 | "Date": ["2024-05-10", "2024-05-11"],
91 | "Age": [28, 35],
92 | "Salary": [50000, 62000],
93 | "City": ["Berlin", "Munich"]
94 | }
95 | df = pd.DataFrame(data)
96 |
97 | # Feature Creation
98 | df["Date"] = pd.to_datetime(df["Date"])
99 | df["Day"] = df["Date"].dt.day
100 | df["Month"] = df["Date"].dt.month
101 |
102 | #Feature Transformation
103 | encoder = OneHotEncoder(sparse_output=False)
104 | city_encoded = encoder.fit_transform(df[["City"]])
105 | encoded_df = pd.DataFrame(city_encoded, columns=encoder.get_feature_names_out(["City"]))
106 | df = pd.concat([df, encoded_df], axis=1)
107 |
108 | #Feature Scaling
109 | scaler = StandardScaler()
110 | df[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]])
111 |
112 | # Final Result
113 | print(df)
114 |
--------------------------------------------------------------------------------
/1. Label Encoder.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "21070538",
6 | "metadata": {},
7 | "source": [
8 | "# 1. Label Encoder"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "5122882e",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "classes = ['ClassA', 'ClassB', 'ClassC', 'ClassD']\n",
19 | "\n",
20 | "instances = ['ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB']"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "id": "0790c5cf",
27 | "metadata": {},
28 | "outputs": [
29 | {
30 | "name": "stdout",
31 | "output_type": "stream",
32 | "text": [
33 | "Encoded labels: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1]\n"
34 | ]
35 | }
36 | ],
37 | "source": [
38 | "label_to_int = {label: index for index, label in enumerate(classes)} #60 Days of Python ; Day 25\n",
39 | "encoded_labels = [label_to_int[label] for label in instances]\n",
40 | "\n",
41 | "print(\"Encoded labels:\", encoded_labels)"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 3,
47 | "id": "9bdaf4f3",
48 | "metadata": {},
49 | "outputs": [
50 | {
51 | "name": "stdout",
52 | "output_type": "stream",
53 | "text": [
54 | "Encoded labels: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1]\n",
55 | "Decoded labels: ['ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB', 'ClassC', 'ClassD', 'ClassA', 'ClassB']\n"
56 | ]
57 | }
58 | ],
59 | "source": [
60 | "int_to_label = {index: label for label, index in label_to_int.items()}\n",
61 | "decoded_labels = [int_to_label[index] for index in encoded_labels]\n",
62 | "\n",
63 | "print(\"Encoded labels:\", encoded_labels)\n",
64 | "print(\"Decoded labels:\", decoded_labels)"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "id": "63ee1325",
70 | "metadata": {},
71 | "source": [
72 | "# Sklearn - Label Encoder"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 4,
78 | "id": "60a20a81",
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "from sklearn.preprocessing import LabelEncoder"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 5,
88 | "id": "ad2970f2",
89 | "metadata": {},
90 | "outputs": [
91 | {
92 | "name": "stdout",
93 | "output_type": "stream",
94 | "text": [
95 | "Encoded labels: [0 1 2 3 0 1 2 3 0 1]\n"
96 | ]
97 | }
98 | ],
99 | "source": [
100 | "label_encoder = LabelEncoder()\n",
101 | "encoded_labels = label_encoder.fit_transform(instances)\n",
102 | "\n",
103 | "print(\"Encoded labels:\", encoded_labels)"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 6,
109 | "id": "5fd9dabe",
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "name": "stdout",
114 | "output_type": "stream",
115 | "text": [
116 | "Encoded labels: [0 1 2 3 0 1 2 3 0 1]\n",
117 | "Original labels: ['ClassA' 'ClassB' 'ClassC' 'ClassD' 'ClassA' 'ClassB' 'ClassC' 'ClassD'\n",
118 | " 'ClassA' 'ClassB']\n"
119 | ]
120 | }
121 | ],
122 | "source": [
123 | "original_labels = label_encoder.inverse_transform(encoded_labels)\n",
124 | "\n",
125 | "print(\"Encoded labels:\", encoded_labels)\n",
126 | "print(\"Original labels:\", original_labels)"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "id": "e932d2b3",
133 | "metadata": {},
134 | "outputs": [],
135 | "source": []
136 | }
137 | ],
138 | "metadata": {
139 | "kernelspec": {
140 | "display_name": "Python 3 (ipykernel)",
141 | "language": "python",
142 | "name": "python3"
143 | },
144 | "language_info": {
145 | "codemirror_mode": {
146 | "name": "ipython",
147 | "version": 3
148 | },
149 | "file_extension": ".py",
150 | "mimetype": "text/x-python",
151 | "name": "python",
152 | "nbconvert_exporter": "python",
153 | "pygments_lexer": "ipython3",
154 | "version": "3.9.13"
155 | }
156 | },
157 | "nbformat": 4,
158 | "nbformat_minor": 5
159 | }
160 |
--------------------------------------------------------------------------------
/Min Max Scaling with Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "fc529592",
6 | "metadata": {},
7 | "source": [
8 | "# Raw Code"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "1c33a4de",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "def min_max_scaling(data):\n",
19 | " min_val = min(data)\n",
20 | " max_val = max(data)\n",
21 | " scaled_data = [(x - min_val) / (max_val - min_val) for x in data]\n",
22 | " return scaled_data"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 2,
28 | "id": "53cba510",
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "name": "stdout",
33 | "output_type": "stream",
34 | "text": [
35 | "Original Data: [1, 20, 30, 4, 5]\n",
36 | "Scaled data (raw): [0.0, 0.6551724137931034, 1.0, 0.10344827586206896, 0.13793103448275862]\n"
37 | ]
38 | }
39 | ],
40 | "source": [
41 | "data = [1, 20, 30, 4, 5]\n",
42 | "scaled_data = min_max_scaling(data)\n",
43 | "print('Original Data: ', data)\n",
44 | "print(\"Scaled data (raw):\", scaled_data)"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "id": "c9912be9",
50 | "metadata": {},
51 | "source": [
52 | "# using Sklearn"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 3,
58 | "id": "e2326ee0",
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "import pandas as pd\n",
63 | "from sklearn.preprocessing import MinMaxScaler"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": 4,
69 | "id": "6c2eab99",
70 | "metadata": {},
71 | "outputs": [
72 | {
73 | "data": {
74 | "text/html": [
75 | "
\n",
76 | "\n",
89 | "
\n",
90 | " \n",
91 | " \n",
92 | " | \n",
93 | " Feature1 | \n",
94 | " Feature2 | \n",
95 | "
\n",
96 | " \n",
97 | " \n",
98 | " \n",
99 | " | 0 | \n",
100 | " 1 | \n",
101 | " 6 | \n",
102 | "
\n",
103 | " \n",
104 | " | 1 | \n",
105 | " 5 | \n",
106 | " 7 | \n",
107 | "
\n",
108 | " \n",
109 | " | 2 | \n",
110 | " 10 | \n",
111 | " 8 | \n",
112 | "
\n",
113 | " \n",
114 | " | 3 | \n",
115 | " 4 | \n",
116 | " 19 | \n",
117 | "
\n",
118 | " \n",
119 | " | 4 | \n",
120 | " 5 | \n",
121 | " 10 | \n",
122 | "
\n",
123 | " \n",
124 | "
\n",
125 | "
"
126 | ],
127 | "text/plain": [
128 | " Feature1 Feature2\n",
129 | "0 1 6\n",
130 | "1 5 7\n",
131 | "2 10 8\n",
132 | "3 4 19\n",
133 | "4 5 10"
134 | ]
135 | },
136 | "execution_count": 4,
137 | "metadata": {},
138 | "output_type": "execute_result"
139 | }
140 | ],
141 | "source": [
142 | "data = {'Feature1': [1, 5, 10, 4, 5],\n",
143 | " 'Feature2': [6, 7, 8, 19, 10]}\n",
144 | "\n",
145 | "df = pd.DataFrame(data)\n",
146 | "df.head()"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 5,
152 | "id": "03a59524",
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "scaler = MinMaxScaler()\n",
157 | "scaled_data = scaler.fit_transform(df)\n",
158 | "scaled_df = pd.DataFrame(scaled_data, columns=df.columns)"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": 6,
164 | "id": "b32cc8d6",
165 | "metadata": {},
166 | "outputs": [
167 | {
168 | "name": "stdout",
169 | "output_type": "stream",
170 | "text": [
171 | "Original DataFrame:\n",
172 | " Feature1 Feature2\n",
173 | "0 1 6\n",
174 | "1 5 7\n",
175 | "2 10 8\n",
176 | "3 4 19\n",
177 | "4 5 10\n",
178 | "\n",
179 | "Scaled DataFrame:\n",
180 | " Feature1 Feature2\n",
181 | "0 0.000000 0.000000\n",
182 | "1 0.444444 0.076923\n",
183 | "2 1.000000 0.153846\n",
184 | "3 0.333333 1.000000\n",
185 | "4 0.444444 0.307692\n"
186 | ]
187 | }
188 | ],
189 | "source": [
190 | "print(\"Original DataFrame:\")\n",
191 | "print(df)\n",
192 | "print(\"\\nScaled DataFrame:\")\n",
193 | "print(scaled_df)"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "id": "0e898642",
200 | "metadata": {},
201 | "outputs": [],
202 | "source": []
203 | }
204 | ],
205 | "metadata": {
206 | "kernelspec": {
207 | "display_name": "Python 3 (ipykernel)",
208 | "language": "python",
209 | "name": "python3"
210 | },
211 | "language_info": {
212 | "codemirror_mode": {
213 | "name": "ipython",
214 | "version": 3
215 | },
216 | "file_extension": ".py",
217 | "mimetype": "text/x-python",
218 | "name": "python",
219 | "nbconvert_exporter": "python",
220 | "pygments_lexer": "ipython3",
221 | "version": "3.9.13"
222 | }
223 | },
224 | "nbformat": 4,
225 | "nbformat_minor": 5
226 | }
227 |
--------------------------------------------------------------------------------
/4. Ordinal Encoder.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "b78c029d",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import pandas as pd\n",
11 | "from sklearn.preprocessing import OrdinalEncoder"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 2,
17 | "id": "c146e9da",
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "data = [\n",
22 | " ['good'], ['bad'], ['excellent'], ['average'], \n",
23 | " ['good'], ['average'], ['excellent'], ['bad'], \n",
24 | " ['average'], ['good']\n",
25 | "]"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 3,
31 | "id": "3ad7f2c7",
32 | "metadata": {},
33 | "outputs": [
34 | {
35 | "data": {
36 | "text/html": [
37 | "\n",
38 | "\n",
51 | "
\n",
52 | " \n",
53 | " \n",
54 | " | \n",
55 | " reviews | \n",
56 | "
\n",
57 | " \n",
58 | " \n",
59 | " \n",
60 | " | 0 | \n",
61 | " good | \n",
62 | "
\n",
63 | " \n",
64 | " | 1 | \n",
65 | " bad | \n",
66 | "
\n",
67 | " \n",
68 | " | 2 | \n",
69 | " excellent | \n",
70 | "
\n",
71 | " \n",
72 | " | 3 | \n",
73 | " average | \n",
74 | "
\n",
75 | " \n",
76 | " | 4 | \n",
77 | " good | \n",
78 | "
\n",
79 | " \n",
80 | "
\n",
81 | "
"
82 | ],
83 | "text/plain": [
84 | " reviews\n",
85 | "0 good\n",
86 | "1 bad\n",
87 | "2 excellent\n",
88 | "3 average\n",
89 | "4 good"
90 | ]
91 | },
92 | "execution_count": 3,
93 | "metadata": {},
94 | "output_type": "execute_result"
95 | }
96 | ],
97 | "source": [
98 | "data = pd.DataFrame(data=data, columns=['reviews'])\n",
99 | "data.head()"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 4,
105 | "id": "14b89f09",
106 | "metadata": {},
107 | "outputs": [
108 | {
109 | "data": {
110 | "text/plain": [
111 | "(10, 1)"
112 | ]
113 | },
114 | "execution_count": 4,
115 | "metadata": {},
116 | "output_type": "execute_result"
117 | }
118 | ],
119 | "source": [
120 | "data.shape"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 5,
126 | "id": "3e569b99",
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "categories = [['bad', 'average', 'good', 'excellent']]"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 6,
136 | "id": "eee41b3c",
137 | "metadata": {},
138 | "outputs": [
139 | {
140 | "data": {
141 | "text/plain": [
142 | "[['bad', 'average', 'good', 'excellent']]"
143 | ]
144 | },
145 | "execution_count": 6,
146 | "metadata": {},
147 | "output_type": "execute_result"
148 | }
149 | ],
150 | "source": [
151 | "categories"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 7,
157 | "id": "b3a1d73e",
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "encoder = OrdinalEncoder(categories=categories)"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 8,
167 | "id": "74286b79",
168 | "metadata": {},
169 | "outputs": [
170 | {
171 | "data": {
172 | "text/plain": [
173 | "array([[2.],\n",
174 | " [0.],\n",
175 | " [3.],\n",
176 | " [1.],\n",
177 | " [2.],\n",
178 | " [1.],\n",
179 | " [3.],\n",
180 | " [0.],\n",
181 | " [1.],\n",
182 | " [2.]])"
183 | ]
184 | },
185 | "execution_count": 8,
186 | "metadata": {},
187 | "output_type": "execute_result"
188 | }
189 | ],
190 | "source": [
191 | "encoded_data = encoder.fit_transform(data)\n",
192 | "encoded_data"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 9,
198 | "id": "b4fa6545",
199 | "metadata": {},
200 | "outputs": [
201 | {
202 | "data": {
203 | "text/plain": [
204 | "array([['good'],\n",
205 | " ['bad'],\n",
206 | " ['excellent'],\n",
207 | " ['average'],\n",
208 | " ['good'],\n",
209 | " ['average'],\n",
210 | " ['excellent'],\n",
211 | " ['bad'],\n",
212 | " ['average'],\n",
213 | " ['good']], dtype=object)"
214 | ]
215 | },
216 | "execution_count": 9,
217 | "metadata": {},
218 | "output_type": "execute_result"
219 | }
220 | ],
221 | "source": [
222 | "decoded_data = encoder.inverse_transform(encoded_data)\n",
223 | "decoded_data"
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": null,
229 | "id": "525a8f59",
230 | "metadata": {},
231 | "outputs": [],
232 | "source": []
233 | }
234 | ],
235 | "metadata": {
236 | "kernelspec": {
237 | "display_name": "Python 3 (ipykernel)",
238 | "language": "python",
239 | "name": "python3"
240 | },
241 | "language_info": {
242 | "codemirror_mode": {
243 | "name": "ipython",
244 | "version": 3
245 | },
246 | "file_extension": ".py",
247 | "mimetype": "text/x-python",
248 | "name": "python",
249 | "nbconvert_exporter": "python",
250 | "pygments_lexer": "ipython3",
251 | "version": "3.9.13"
252 | }
253 | },
254 | "nbformat": 4,
255 | "nbformat_minor": 5
256 | }
257 |
--------------------------------------------------------------------------------
/3. Binary Encoder.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "adfe09d4",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import pandas as pd\n",
11 | "import category_encoders as ce"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 2,
17 | "id": "56986103",
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']}\n",
22 | "df = pd.DataFrame(data)"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 3,
28 | "id": "b5813023",
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "data": {
33 | "text/html": [
34 | "\n",
35 | "\n",
48 | "
\n",
49 | " \n",
50 | " \n",
51 | " | \n",
52 | " Category | \n",
53 | "
\n",
54 | " \n",
55 | " \n",
56 | " \n",
57 | " | 0 | \n",
58 | " A | \n",
59 | "
\n",
60 | " \n",
61 | " | 1 | \n",
62 | " B | \n",
63 | "
\n",
64 | " \n",
65 | " | 2 | \n",
66 | " C | \n",
67 | "
\n",
68 | " \n",
69 | " | 3 | \n",
70 | " A | \n",
71 | "
\n",
72 | " \n",
73 | " | 4 | \n",
74 | " B | \n",
75 | "
\n",
76 | " \n",
77 | "
\n",
78 | "
"
79 | ],
80 | "text/plain": [
81 | " Category\n",
82 | "0 A\n",
83 | "1 B\n",
84 | "2 C\n",
85 | "3 A\n",
86 | "4 B"
87 | ]
88 | },
89 | "execution_count": 3,
90 | "metadata": {},
91 | "output_type": "execute_result"
92 | }
93 | ],
94 | "source": [
95 | "df.head()"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 4,
101 | "id": "9390c11f",
102 | "metadata": {},
103 | "outputs": [
104 | {
105 | "data": {
106 | "text/plain": [
107 | "(9, 1)"
108 | ]
109 | },
110 | "execution_count": 4,
111 | "metadata": {},
112 | "output_type": "execute_result"
113 | }
114 | ],
115 | "source": [
116 | "df.shape"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 5,
122 | "id": "ec81a1c2",
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "encoder = ce.BinaryEncoder(cols=['Category'], return_df=True)"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 6,
132 | "id": "d5fe35de",
133 | "metadata": {},
134 | "outputs": [
135 | {
136 | "data": {
137 | "text/html": [
138 | "\n",
139 | "\n",
152 | "
\n",
153 | " \n",
154 | " \n",
155 | " | \n",
156 | " Category_0 | \n",
157 | " Category_1 | \n",
158 | "
\n",
159 | " \n",
160 | " \n",
161 | " \n",
162 | " | 0 | \n",
163 | " 0 | \n",
164 | " 1 | \n",
165 | "
\n",
166 | " \n",
167 | " | 1 | \n",
168 | " 1 | \n",
169 | " 0 | \n",
170 | "
\n",
171 | " \n",
172 | " | 2 | \n",
173 | " 1 | \n",
174 | " 1 | \n",
175 | "
\n",
176 | " \n",
177 | " | 3 | \n",
178 | " 0 | \n",
179 | " 1 | \n",
180 | "
\n",
181 | " \n",
182 | " | 4 | \n",
183 | " 1 | \n",
184 | " 0 | \n",
185 | "
\n",
186 | " \n",
187 | " | 5 | \n",
188 | " 1 | \n",
189 | " 1 | \n",
190 | "
\n",
191 | " \n",
192 | " | 6 | \n",
193 | " 0 | \n",
194 | " 1 | \n",
195 | "
\n",
196 | " \n",
197 | " | 7 | \n",
198 | " 1 | \n",
199 | " 0 | \n",
200 | "
\n",
201 | " \n",
202 | " | 8 | \n",
203 | " 1 | \n",
204 | " 1 | \n",
205 | "
\n",
206 | " \n",
207 | "
\n",
208 | "
"
209 | ],
210 | "text/plain": [
211 | " Category_0 Category_1\n",
212 | "0 0 1\n",
213 | "1 1 0\n",
214 | "2 1 1\n",
215 | "3 0 1\n",
216 | "4 1 0\n",
217 | "5 1 1\n",
218 | "6 0 1\n",
219 | "7 1 0\n",
220 | "8 1 1"
221 | ]
222 | },
223 | "execution_count": 6,
224 | "metadata": {},
225 | "output_type": "execute_result"
226 | }
227 | ],
228 | "source": [
229 | "df_binary_encoded = encoder.fit_transform(df)\n",
230 | "df_binary_encoded"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": null,
236 | "id": "201a7fa5",
237 | "metadata": {},
238 | "outputs": [],
239 | "source": []
240 | }
241 | ],
242 | "metadata": {
243 | "kernelspec": {
244 | "display_name": "Python 3 (ipykernel)",
245 | "language": "python",
246 | "name": "python3"
247 | },
248 | "language_info": {
249 | "codemirror_mode": {
250 | "name": "ipython",
251 | "version": 3
252 | },
253 | "file_extension": ".py",
254 | "mimetype": "text/x-python",
255 | "name": "python",
256 | "nbconvert_exporter": "python",
257 | "pygments_lexer": "ipython3",
258 | "version": "3.9.13"
259 | }
260 | },
261 | "nbformat": 4,
262 | "nbformat_minor": 5
263 | }
264 |
--------------------------------------------------------------------------------
/Standard Scaling using Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "20c48bc4",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import pandas as pd"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 2,
16 | "id": "cf6184fa",
17 | "metadata": {},
18 | "outputs": [
19 | {
20 | "data": {
21 | "text/html": [
22 | "\n",
23 | "\n",
36 | "
\n",
37 | " \n",
38 | " \n",
39 | " | \n",
40 | " Feature 1 | \n",
41 | " Feature 2 | \n",
42 | "
\n",
43 | " \n",
44 | " \n",
45 | " \n",
46 | " | 0 | \n",
47 | " 1 | \n",
48 | " 6 | \n",
49 | "
\n",
50 | " \n",
51 | " | 1 | \n",
52 | " 20 | \n",
53 | " 7 | \n",
54 | "
\n",
55 | " \n",
56 | " | 2 | \n",
57 | " 3 | \n",
58 | " 18 | \n",
59 | "
\n",
60 | " \n",
61 | " | 3 | \n",
62 | " 40 | \n",
63 | " 19 | \n",
64 | "
\n",
65 | " \n",
66 | " | 4 | \n",
67 | " 5 | \n",
68 | " 10 | \n",
69 | "
\n",
70 | " \n",
71 | "
\n",
72 | "
"
73 | ],
74 | "text/plain": [
75 | " Feature 1 Feature 2\n",
76 | "0 1 6\n",
77 | "1 20 7\n",
78 | "2 3 18\n",
79 | "3 40 19\n",
80 | "4 5 10"
81 | ]
82 | },
83 | "execution_count": 2,
84 | "metadata": {},
85 | "output_type": "execute_result"
86 | }
87 | ],
88 | "source": [
89 | "data = {'Feature 1': [1, 20, 3, 40, 5],\n",
90 | " 'Feature 2': [6, 7, 18, 19, 10]}\n",
91 | "\n",
92 | "df = pd.DataFrame(data)\n",
93 | "df.head()"
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": 3,
99 | "id": "8fa7404f",
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "def standardize_data(data):\n",
104 | " mean = data.mean()\n",
105 | " std_dev = data.std() # delta degree of freedom ddof=0\n",
106 | " standardized_data_raw = (data - mean) / std_dev\n",
107 | " return standardized_data_raw"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 4,
113 | "id": "f2c02d59",
114 | "metadata": {},
115 | "outputs": [],
116 | "source": [
117 | "standardized_df_raw = df.apply(standardize_data)"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 5,
123 | "id": "f22c9db0",
124 | "metadata": {},
125 | "outputs": [
126 | {
127 | "name": "stdout",
128 | "output_type": "stream",
129 | "text": [
130 | "Original DataFrame:\n",
131 | " Feature 1 Feature 2\n",
132 | "0 1 6\n",
133 | "1 20 7\n",
134 | "2 3 18\n",
135 | "3 40 19\n",
136 | "4 5 10\n",
137 | "\n",
138 | "Standardized DataFrame:\n",
139 | " Feature 1 Feature 2\n",
140 | "0 -0.777975 -0.979796\n",
141 | "1 0.376832 -0.816497\n",
142 | "2 -0.656417 0.979796\n",
143 | "3 1.592418 1.143095\n",
144 | "4 -0.534858 -0.326599\n"
145 | ]
146 | }
147 | ],
148 | "source": [
149 | "print(\"Original DataFrame:\")\n",
150 | "print(df)\n",
151 | "print(\"\\nStandardized DataFrame:\")\n",
152 | "print(standardized_df_raw)"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 6,
158 | "id": "3fb91ce8",
159 | "metadata": {},
160 | "outputs": [
161 | {
162 | "data": {
163 | "text/html": [
164 | "\n",
165 | "\n",
178 | "
\n",
179 | " \n",
180 | " \n",
181 | " | \n",
182 | " Feature 1 | \n",
183 | " Feature 2 | \n",
184 | "
\n",
185 | " \n",
186 | " \n",
187 | " \n",
188 | " | 0 | \n",
189 | " 1 | \n",
190 | " 6 | \n",
191 | "
\n",
192 | " \n",
193 | " | 1 | \n",
194 | " 20 | \n",
195 | " 7 | \n",
196 | "
\n",
197 | " \n",
198 | " | 2 | \n",
199 | " 3 | \n",
200 | " 18 | \n",
201 | "
\n",
202 | " \n",
203 | " | 3 | \n",
204 | " 40 | \n",
205 | " 19 | \n",
206 | "
\n",
207 | " \n",
208 | " | 4 | \n",
209 | " 5 | \n",
210 | " 10 | \n",
211 | "
\n",
212 | " \n",
213 | "
\n",
214 | "
"
215 | ],
216 | "text/plain": [
217 | " Feature 1 Feature 2\n",
218 | "0 1 6\n",
219 | "1 20 7\n",
220 | "2 3 18\n",
221 | "3 40 19\n",
222 | "4 5 10"
223 | ]
224 | },
225 | "execution_count": 6,
226 | "metadata": {},
227 | "output_type": "execute_result"
228 | }
229 | ],
230 | "source": [
231 | "df.head()"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 7,
237 | "id": "dcfedb7b",
238 | "metadata": {},
239 | "outputs": [
240 | {
241 | "data": {
242 | "text/html": [
243 | "\n",
244 | "\n",
257 | "
\n",
258 | " \n",
259 | " \n",
260 | " | \n",
261 | " Feature 1 | \n",
262 | " Feature 2 | \n",
263 | "
\n",
264 | " \n",
265 | " \n",
266 | " \n",
267 | " | 0 | \n",
268 | " -0.777975 | \n",
269 | " -0.979796 | \n",
270 | "
\n",
271 | " \n",
272 | " | 1 | \n",
273 | " 0.376832 | \n",
274 | " -0.816497 | \n",
275 | "
\n",
276 | " \n",
277 | " | 2 | \n",
278 | " -0.656417 | \n",
279 | " 0.979796 | \n",
280 | "
\n",
281 | " \n",
282 | " | 3 | \n",
283 | " 1.592418 | \n",
284 | " 1.143095 | \n",
285 | "
\n",
286 | " \n",
287 | " | 4 | \n",
288 | " -0.534858 | \n",
289 | " -0.326599 | \n",
290 | "
\n",
291 | " \n",
292 | "
\n",
293 | "
"
294 | ],
295 | "text/plain": [
296 | " Feature 1 Feature 2\n",
297 | "0 -0.777975 -0.979796\n",
298 | "1 0.376832 -0.816497\n",
299 | "2 -0.656417 0.979796\n",
300 | "3 1.592418 1.143095\n",
301 | "4 -0.534858 -0.326599"
302 | ]
303 | },
304 | "execution_count": 7,
305 | "metadata": {},
306 | "output_type": "execute_result"
307 | }
308 | ],
309 | "source": [
310 | "standardized_df_raw.head()"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": 8,
316 | "id": "daaa1204",
317 | "metadata": {},
318 | "outputs": [
319 | {
320 | "data": {
321 | "text/plain": [
322 | "1.0"
323 | ]
324 | },
325 | "execution_count": 8,
326 | "metadata": {},
327 | "output_type": "execute_result"
328 | }
329 | ],
330 | "source": [
331 | "standardized_df_raw['Feature 1'].mean() + 1"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": 9,
337 | "id": "56277f06",
338 | "metadata": {},
339 | "outputs": [
340 | {
341 | "data": {
342 | "text/plain": [
343 | "0.9999999999999999"
344 | ]
345 | },
346 | "execution_count": 9,
347 | "metadata": {},
348 | "output_type": "execute_result"
349 | }
350 | ],
351 | "source": [
352 | "standardized_df_raw['Feature 1'].std()"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "id": "690b8116",
358 | "metadata": {},
359 | "source": [
360 | "# Sklearn"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": 10,
366 | "id": "11c15507",
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "from sklearn.preprocessing import StandardScaler"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": 11,
376 | "id": "6f37ac57",
377 | "metadata": {},
378 | "outputs": [
379 | {
380 | "data": {
381 | "text/html": [
382 | "\n",
383 | "\n",
396 | "
\n",
397 | " \n",
398 | " \n",
399 | " | \n",
400 | " Feature 1 | \n",
401 | " Feature 2 | \n",
402 | "
\n",
403 | " \n",
404 | " \n",
405 | " \n",
406 | " | 0 | \n",
407 | " 1 | \n",
408 | " 6 | \n",
409 | "
\n",
410 | " \n",
411 | " | 1 | \n",
412 | " 20 | \n",
413 | " 7 | \n",
414 | "
\n",
415 | " \n",
416 | " | 2 | \n",
417 | " 3 | \n",
418 | " 18 | \n",
419 | "
\n",
420 | " \n",
421 | " | 3 | \n",
422 | " 40 | \n",
423 | " 19 | \n",
424 | "
\n",
425 | " \n",
426 | " | 4 | \n",
427 | " 5 | \n",
428 | " 10 | \n",
429 | "
\n",
430 | " \n",
431 | "
\n",
432 | "
"
433 | ],
434 | "text/plain": [
435 | " Feature 1 Feature 2\n",
436 | "0 1 6\n",
437 | "1 20 7\n",
438 | "2 3 18\n",
439 | "3 40 19\n",
440 | "4 5 10"
441 | ]
442 | },
443 | "execution_count": 11,
444 | "metadata": {},
445 | "output_type": "execute_result"
446 | }
447 | ],
448 | "source": [
449 | "df"
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": 12,
455 | "id": "ea267703",
456 | "metadata": {},
457 | "outputs": [],
458 | "source": [
459 | "scaler = StandardScaler()\n",
460 | "standardized_data = scaler.fit_transform(df)\n",
461 | "standardized_df = pd.DataFrame(standardized_data, columns=df.columns)"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 13,
467 | "id": "c68cbfde",
468 | "metadata": {},
469 | "outputs": [
470 | {
471 | "name": "stdout",
472 | "output_type": "stream",
473 | "text": [
474 | "Original DataFrame:\n",
475 | " Feature 1 Feature 2\n",
476 | "0 1 6\n",
477 | "1 20 7\n",
478 | "2 3 18\n",
479 | "3 40 19\n",
480 | "4 5 10\n",
481 | "\n",
482 | "Standardized DataFrame: raw code\n",
483 | " Feature 1 Feature 2\n",
484 | "0 -0.777975 -0.979796\n",
485 | "1 0.376832 -0.816497\n",
486 | "2 -0.656417 0.979796\n",
487 | "3 1.592418 1.143095\n",
488 | "4 -0.534858 -0.326599\n",
489 | "\n",
490 | "Standardized DataFrame: sklearn\n",
491 | " Feature 1 Feature 2\n",
492 | "0 -0.869803 -1.095445\n",
493 | "1 0.421311 -0.912871\n",
494 | "2 -0.733896 1.095445\n",
495 | "3 1.780378 1.278019\n",
496 | "4 -0.597989 -0.365148\n"
497 | ]
498 | }
499 | ],
500 | "source": [
501 | "print(\"Original DataFrame:\")\n",
502 | "print(df)\n",
503 | "\n",
504 | "print(\"\\nStandardized DataFrame: raw code\")\n",
505 | "print(standardized_df_raw)\n",
506 | "\n",
507 | "print(\"\\nStandardized DataFrame: sklearn\")\n",
508 | "print(standardized_df)"
509 | ]
510 | },
511 | {
512 | "cell_type": "code",
513 | "execution_count": null,
514 | "id": "f764d2ec",
515 | "metadata": {},
516 | "outputs": [],
517 | "source": []
518 | },
519 | {
520 | "cell_type": "code",
521 | "execution_count": null,
522 | "id": "b1af06be",
523 | "metadata": {},
524 | "outputs": [],
525 | "source": []
526 | }
527 | ],
528 | "metadata": {
529 | "kernelspec": {
530 | "display_name": "Python 3 (ipykernel)",
531 | "language": "python",
532 | "name": "python3"
533 | },
534 | "language_info": {
535 | "codemirror_mode": {
536 | "name": "ipython",
537 | "version": 3
538 | },
539 | "file_extension": ".py",
540 | "mimetype": "text/x-python",
541 | "name": "python",
542 | "nbconvert_exporter": "python",
543 | "pygments_lexer": "ipython3",
544 | "version": "3.9.13"
545 | }
546 | },
547 | "nbformat": 4,
548 | "nbformat_minor": 5
549 | }
550 |
--------------------------------------------------------------------------------
/2. One Hot Encoding.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "79afcd9b",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import pandas as pd"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 2,
16 | "id": "a31b20f7",
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']}"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 3,
26 | "id": "6c33a2f7",
27 | "metadata": {},
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "\n",
33 | "\n",
46 | "
\n",
47 | " \n",
48 | " \n",
49 | " | \n",
50 | " Category | \n",
51 | "
\n",
52 | " \n",
53 | " \n",
54 | " \n",
55 | " | 0 | \n",
56 | " A | \n",
57 | "
\n",
58 | " \n",
59 | " | 1 | \n",
60 | " B | \n",
61 | "
\n",
62 | " \n",
63 | " | 2 | \n",
64 | " C | \n",
65 | "
\n",
66 | " \n",
67 | " | 3 | \n",
68 | " A | \n",
69 | "
\n",
70 | " \n",
71 | " | 4 | \n",
72 | " B | \n",
73 | "
\n",
74 | " \n",
75 | "
\n",
76 | "
"
77 | ],
78 | "text/plain": [
79 | " Category\n",
80 | "0 A\n",
81 | "1 B\n",
82 | "2 C\n",
83 | "3 A\n",
84 | "4 B"
85 | ]
86 | },
87 | "execution_count": 3,
88 | "metadata": {},
89 | "output_type": "execute_result"
90 | }
91 | ],
92 | "source": [
93 | "df = pd.DataFrame(data)\n",
94 | "df.head()"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 4,
100 | "id": "d5674ea5",
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "data": {
105 | "text/html": [
106 | "\n",
107 | "\n",
120 | "
\n",
121 | " \n",
122 | " \n",
123 | " | \n",
124 | " Category_A | \n",
125 | " Category_B | \n",
126 | " Category_C | \n",
127 | "
\n",
128 | " \n",
129 | " \n",
130 | " \n",
131 | " | 0 | \n",
132 | " 1 | \n",
133 | " 0 | \n",
134 | " 0 | \n",
135 | "
\n",
136 | " \n",
137 | " | 1 | \n",
138 | " 0 | \n",
139 | " 1 | \n",
140 | " 0 | \n",
141 | "
\n",
142 | " \n",
143 | " | 2 | \n",
144 | " 0 | \n",
145 | " 0 | \n",
146 | " 1 | \n",
147 | "
\n",
148 | " \n",
149 | " | 3 | \n",
150 | " 1 | \n",
151 | " 0 | \n",
152 | " 0 | \n",
153 | "
\n",
154 | " \n",
155 | " | 4 | \n",
156 | " 0 | \n",
157 | " 1 | \n",
158 | " 0 | \n",
159 | "
\n",
160 | " \n",
161 | " | 5 | \n",
162 | " 0 | \n",
163 | " 0 | \n",
164 | " 1 | \n",
165 | "
\n",
166 | " \n",
167 | " | 6 | \n",
168 | " 1 | \n",
169 | " 0 | \n",
170 | " 0 | \n",
171 | "
\n",
172 | " \n",
173 | " | 7 | \n",
174 | " 0 | \n",
175 | " 1 | \n",
176 | " 0 | \n",
177 | "
\n",
178 | " \n",
179 | " | 8 | \n",
180 | " 0 | \n",
181 | " 0 | \n",
182 | " 1 | \n",
183 | "
\n",
184 | " \n",
185 | "
\n",
186 | "
"
187 | ],
188 | "text/plain": [
189 | " Category_A Category_B Category_C\n",
190 | "0 1 0 0\n",
191 | "1 0 1 0\n",
192 | "2 0 0 1\n",
193 | "3 1 0 0\n",
194 | "4 0 1 0\n",
195 | "5 0 0 1\n",
196 | "6 1 0 0\n",
197 | "7 0 1 0\n",
198 | "8 0 0 1"
199 | ]
200 | },
201 | "execution_count": 4,
202 | "metadata": {},
203 | "output_type": "execute_result"
204 | }
205 | ],
206 | "source": [
207 | "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'])\n",
208 | "one_hot_encoded_df"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": 5,
214 | "id": "e16d27ac",
215 | "metadata": {},
216 | "outputs": [
217 | {
218 | "data": {
219 | "text/html": [
220 | "\n",
221 | "\n",
234 | "
\n",
235 | " \n",
236 | " \n",
237 | " | \n",
238 | " Dummy_A | \n",
239 | " Dummy_B | \n",
240 | " Dummy_C | \n",
241 | "
\n",
242 | " \n",
243 | " \n",
244 | " \n",
245 | " | 0 | \n",
246 | " 1 | \n",
247 | " 0 | \n",
248 | " 0 | \n",
249 | "
\n",
250 | " \n",
251 | " | 1 | \n",
252 | " 0 | \n",
253 | " 1 | \n",
254 | " 0 | \n",
255 | "
\n",
256 | " \n",
257 | " | 2 | \n",
258 | " 0 | \n",
259 | " 0 | \n",
260 | " 1 | \n",
261 | "
\n",
262 | " \n",
263 | " | 3 | \n",
264 | " 1 | \n",
265 | " 0 | \n",
266 | " 0 | \n",
267 | "
\n",
268 | " \n",
269 | " | 4 | \n",
270 | " 0 | \n",
271 | " 1 | \n",
272 | " 0 | \n",
273 | "
\n",
274 | " \n",
275 | " | 5 | \n",
276 | " 0 | \n",
277 | " 0 | \n",
278 | " 1 | \n",
279 | "
\n",
280 | " \n",
281 | " | 6 | \n",
282 | " 1 | \n",
283 | " 0 | \n",
284 | " 0 | \n",
285 | "
\n",
286 | " \n",
287 | " | 7 | \n",
288 | " 0 | \n",
289 | " 1 | \n",
290 | " 0 | \n",
291 | "
\n",
292 | " \n",
293 | " | 8 | \n",
294 | " 0 | \n",
295 | " 0 | \n",
296 | " 1 | \n",
297 | "
\n",
298 | " \n",
299 | "
\n",
300 | "
"
301 | ],
302 | "text/plain": [
303 | " Dummy_A Dummy_B Dummy_C\n",
304 | "0 1 0 0\n",
305 | "1 0 1 0\n",
306 | "2 0 0 1\n",
307 | "3 1 0 0\n",
308 | "4 0 1 0\n",
309 | "5 0 0 1\n",
310 | "6 1 0 0\n",
311 | "7 0 1 0\n",
312 | "8 0 0 1"
313 | ]
314 | },
315 | "execution_count": 5,
316 | "metadata": {},
317 | "output_type": "execute_result"
318 | }
319 | ],
320 | "source": [
321 | "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'], prefix='Dummy')\n",
322 | "one_hot_encoded_df"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 6,
328 | "id": "2e8cfc1d",
329 | "metadata": {},
330 | "outputs": [
331 | {
332 | "data": {
333 | "text/html": [
334 | "\n",
335 | "\n",
348 | "
\n",
349 | " \n",
350 | " \n",
351 | " | \n",
352 | " Dummy_B | \n",
353 | " Dummy_C | \n",
354 | "
\n",
355 | " \n",
356 | " \n",
357 | " \n",
358 | " | 0 | \n",
359 | " 0 | \n",
360 | " 0 | \n",
361 | "
\n",
362 | " \n",
363 | " | 1 | \n",
364 | " 1 | \n",
365 | " 0 | \n",
366 | "
\n",
367 | " \n",
368 | " | 2 | \n",
369 | " 0 | \n",
370 | " 1 | \n",
371 | "
\n",
372 | " \n",
373 | " | 3 | \n",
374 | " 0 | \n",
375 | " 0 | \n",
376 | "
\n",
377 | " \n",
378 | " | 4 | \n",
379 | " 1 | \n",
380 | " 0 | \n",
381 | "
\n",
382 | " \n",
383 | " | 5 | \n",
384 | " 0 | \n",
385 | " 1 | \n",
386 | "
\n",
387 | " \n",
388 | " | 6 | \n",
389 | " 0 | \n",
390 | " 0 | \n",
391 | "
\n",
392 | " \n",
393 | " | 7 | \n",
394 | " 1 | \n",
395 | " 0 | \n",
396 | "
\n",
397 | " \n",
398 | " | 8 | \n",
399 | " 0 | \n",
400 | " 1 | \n",
401 | "
\n",
402 | " \n",
403 | "
\n",
404 | "
"
405 | ],
406 | "text/plain": [
407 | " Dummy_B Dummy_C\n",
408 | "0 0 0\n",
409 | "1 1 0\n",
410 | "2 0 1\n",
411 | "3 0 0\n",
412 | "4 1 0\n",
413 | "5 0 1\n",
414 | "6 0 0\n",
415 | "7 1 0\n",
416 | "8 0 1"
417 | ]
418 | },
419 | "execution_count": 6,
420 | "metadata": {},
421 | "output_type": "execute_result"
422 | }
423 | ],
424 | "source": [
425 | "one_hot_encoded_df = pd.get_dummies(df, columns=['Category'], prefix='Dummy',drop_first=True )\n",
426 | "one_hot_encoded_df"
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": 7,
432 | "id": "287f4418",
433 | "metadata": {},
434 | "outputs": [
435 | {
436 | "data": {
437 | "text/html": [
438 | "\n",
439 | "\n",
452 | "
\n",
453 | " \n",
454 | " \n",
455 | " | \n",
456 | " Category | \n",
457 | "
\n",
458 | " \n",
459 | " \n",
460 | " \n",
461 | " | 0 | \n",
462 | " A | \n",
463 | "
\n",
464 | " \n",
465 | " | 1 | \n",
466 | " B | \n",
467 | "
\n",
468 | " \n",
469 | " | 2 | \n",
470 | " C | \n",
471 | "
\n",
472 | " \n",
473 | " | 3 | \n",
474 | " A | \n",
475 | "
\n",
476 | " \n",
477 | " | 4 | \n",
478 | " B | \n",
479 | "
\n",
480 | " \n",
481 | "
\n",
482 | "
"
483 | ],
484 | "text/plain": [
485 | " Category\n",
486 | "0 A\n",
487 | "1 B\n",
488 | "2 C\n",
489 | "3 A\n",
490 | "4 B"
491 | ]
492 | },
493 | "execution_count": 7,
494 | "metadata": {},
495 | "output_type": "execute_result"
496 | }
497 | ],
498 | "source": [
499 | "df.head()"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": null,
505 | "id": "cb821849",
506 | "metadata": {},
507 | "outputs": [],
508 | "source": []
509 | }
510 | ],
511 | "metadata": {
512 | "kernelspec": {
513 | "display_name": "Python 3 (ipykernel)",
514 | "language": "python",
515 | "name": "python3"
516 | },
517 | "language_info": {
518 | "codemirror_mode": {
519 | "name": "ipython",
520 | "version": 3
521 | },
522 | "file_extension": ".py",
523 | "mimetype": "text/x-python",
524 | "name": "python",
525 | "nbconvert_exporter": "python",
526 | "pygments_lexer": "ipython3",
527 | "version": "3.9.13"
528 | }
529 | },
530 | "nbformat": 4,
531 | "nbformat_minor": 5
532 | }
533 |
--------------------------------------------------------------------------------
/Data Leakage in Machine Learning.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "4476c65b",
6 | "metadata": {},
7 | "source": [
8 | "[Watch Full Video on Data Leakage](https://youtu.be/UELHcSU_Dpg)"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "a5ffa0fa",
14 | "metadata": {},
15 | "source": [
16 | "`Data Leakage` (also called information leakage) happens when information that should be unavailable during model training accidentally influences the model. Our model “sees” or “learns from” data it shouldn’t have access to (like test data or future data). This causes `unrealistically high accuracy during training/testing`, but poor performance on unseen data."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "id": "c1fddf31",
22 | "metadata": {},
23 | "source": [
24 | "# ⚠️ Common Causes\n",
25 | "\n",
26 | "1. Doing preprocessing (scaling, encoding, imputing) before train-test split\n",
27 | "\n",
28 | "2. Using target values to create features\n",
29 | "\n",
30 | "3. Mixing future data with past data in time series problems"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 1,
36 | "id": "9a889af3",
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "from sklearn.datasets import make_regression\n",
41 | "from sklearn.preprocessing import StandardScaler\n",
42 | "from sklearn.linear_model import LinearRegression\n",
43 | "from sklearn.model_selection import train_test_split\n",
44 | "from sklearn.metrics import r2_score"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 2,
50 | "id": "6075bcb4",
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "x, y = make_regression(n_samples=500, n_features=3, noise=50, random_state=42)"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "id": "94f28c3f",
60 | "metadata": {},
61 | "source": [
62 | "# ❌ WRONG WAY (Data Leakage)"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 3,
68 | "id": "b51eba29",
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "scaler = StandardScaler()\n",
73 | "X_scaled_leak = scaler.fit_transform(x) # Scaler sees all data (train + test)\n",
74 | "X_train_leak, X_test_leak, y_train, y_test = train_test_split(X_scaled_leak, y, test_size=0.2, random_state=42)"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 4,
80 | "id": "8d9b5523",
81 | "metadata": {},
82 | "outputs": [],
83 | "source": [
84 | "model_leak = LinearRegression()\n",
85 | "model_leak.fit(X_train_leak, y_train)\n",
86 | "pred_leak = model_leak.predict(X_test_leak)"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 5,
92 | "id": "c95748e0",
93 | "metadata": {},
94 | "outputs": [
95 | {
96 | "name": "stdout",
97 | "output_type": "stream",
98 | "text": [
99 | "With Data Leakage (scaled before split): 0.7945396288418394\n"
100 | ]
101 | }
102 | ],
103 | "source": [
104 | "print(\"With Data Leakage (scaled before split):\", r2_score(y_test, pred_leak))"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "id": "8e20b3dd",
110 | "metadata": {},
111 | "source": [
112 | "The model really learns — but it’s learning from `cheated information`.\n",
113 | "\n",
114 | "That’s why leaky models look smart in testing, but `collapse` in production.\n",
115 | "\n",
116 | "Always isolate your training data, and fit transformations or create features only on training — never on the whole dataset."
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "id": "74e551e9",
122 | "metadata": {},
123 | "source": [
124 | "# ✅ CORRECT WAY (No Leakage)"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 6,
130 | "id": "ddd01c92",
131 | "metadata": {},
132 | "outputs": [],
133 | "source": [
134 | "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 7,
140 | "id": "52df4004",
141 | "metadata": {},
142 | "outputs": [
143 | {
144 | "data": {
145 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZgAAAEjBAMAAAD9GArQAAAAMFBMVEX///8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv3aB7AAAAD3RSTlMAMmZ2VBAi3c2Ju++rRJljPUTMAAAACXBIWXMAAA7EAAAOxAGVKw4bAAAbGUlEQVR4Ae1df2xkR33/7k97fd61A1KrEoSXQhORVnghldI0orsloKoC1atKKLlcOG+uatMIgR0FCk2i2BRIwkGx2zR/XALZBYkkSkLODQVKCHijKs0PwtmCIhQ48FOipE05nU1ycHfx3b1+vjNv5s17+/b5rb3r9bo7kvfNfL8z3/l+Zt7MmzfvvY+JiF73h2VK4xgS0ucFK68ZDZZ3Tfrbn33LwcobQqvP2K8F6qt2LVDeNWH834hi1z8TXv/eYDCpKsAM3dq07P2VpqrOKKZKsHsFO3tN8wrSwWBoqka08mpAOWEraxcCVJ0U3c7G4+zsu5tXk2oCZrJGNPh3AeWkrdsqAaoOilKnhfEb8fvO5tWEgQksFWIrMH97hNlXhJ278PuV5hZbBhNiq3ktW9ZkzwkTPyVKSViBFlsFE2YrsIL2CNPrws5VRMNtBBNmqz1+B1tZtFiepthS+8CE2gp2oz3S4pmyMLRqI+QB69CxlyC466b83i/eIlRHj9whTjNH8783LT/3P7jyZ599uTBZo/j9Xycqnnz4+DEbua+750iFyLF10ecKsBQ/cuxOosx9j6QOPfki0nQZSzsThuxT7xCWE7JnnoNrJaLLF99s0RUfg2b4eko9/hoiSmP/V2F8mujJUbp6rEbZy3CduXsU+XDMzlPiBLJKWz9ahNupLxD9pETJy155E6U+XobSXsBPh8KSbX8PvjgOpCy49Rski39ClGR8n4MDQwCjNdW/oJVZGvwGdIdraHmAeAHxiVmi/TguVZQtGi8Q7bOwwoCdxKuITOSJ6BJIOxXSf2Tbr8Fh2ZpxzG4x/NH49TjlgCHDLZ1ERGuKt1KKqFqCfKIGHcA8Aax2BYUsojH8Ob08AbcPE8JqheInR0msGDjdwfCXS/a0ciBjo6JT+OMziU4STa7hKFApTXEWkhSnhHMMpobO416sIvPIrLJFAJM9AzGN1QEGxxE22ulQRV/I1kx9CXVxveN1/GDeHq/hyGC0pmhBIhZAGsy8k2/vsnTY7ZmcGIkjaxRnVCML+Ol0iGMmchyg9NsO8bVnHA4ymCULR7nQdDTFAiTDfCpqMMjBcxnCgUNVOOyCGeCzlPacpTivmzoMJiV9mCspB+KfeC9j0GDm5pEQYJSmmIdkkH0zwAwKp+nCT5fZYRfMyFnON/DKtoBJlrgyPrsdBw4XRIcEgFGaQDDFOpsZwvgJArNnm8DAd4QpS4CxaJhb3OwZfZppjQDjO81i62WKjdISulGBsUQT7RE9NnJie3qmBucBJi/A1GlsDal1+rDumWINAj7NtEaASUBinGYDOJsSozGe4kYW0svcy3UBhi+lkP16e8D8miujiVFc1dgB9h2Lz9drMFMLUMfgutYIMHKBOobcPDVT1SLKVQTCqYVYQdrik5evULgbrW0PmHNcGT0Jp3CC1alax7Rr0+s0mByPYPZSawQYWl2GvFiTYLJ8Yg6WMzz9jq/F8tIWg6G5MmRLo9sDZh0VUgwep07ximVlAdPuYfq5BkOfgP5aXEa1RoIZmEUXzuCHe2bPr5BpjASmZ9ayy9KWAPOhOpZsyOJOzZ1bziR/fawCRy04U13OVShzjlIvjZdepFU4EeOLx5X3UvJx+8euRkwJlDpYofePnS0TdwcP/NiMONv2HziRK0tbojPTN5fpfXmchNxrY3xS22ivzoTkfPLjT//xI2w8ceMP8Hv08y+XM4+OPmjb0/FFe32Z6KIjz6dtu6I0c7b9Uc4ef+h4fsr+zfCi/b20/dCzTy1i3ZB96siddPeXHVvft9ctzvfMS5i0F+2T9Pu2/V0ssQtcvBMhjd3My98VaLkMKVaU/gAxa1oPwtbmirZeWb9EvwX6LdBvgX4L9Fug3wL9Fui3QL8F+i3Qb4F+C/RboOst8Nxdf33Hl0wv/gx7O5Yp6J14tpR4OPWo6e/teKAzbwp6J76f4q+kvmb6i43EiWVT0DvxP6Wc3N1VLmexmbiqEj135H1NGuKn7ra9ji3ENRI7oD2HQzgsB8gHRMCW2OAsOW8W9SIc3wABtiTOtB4Nq2UqGK7jOX9uLWMIeij6Ah5LpU0weM6xb7o3wcROx4vipQzd/jMFumv2zTrZU5E7/iVxaN7wOHXm6UMfeDpvSHZSNPbgPxJ51iuh3u3swT9h24W9tVAApjK3ZqZ2WDz12Fvfdv+3ypG9GqhFzrr9GbMlPAz7WPR6J82ZLXqx7ct5fkXWFZNrFvHbrHaVt5m+2/LYrcqD83T4LSXqteNPQ0bMiNFZZnSnTgPJekjr9xqYPw/B0muqeL7XPA7x963QDVdEhiizWYil7quSJfhwrePHNT0+m51/cJnS4t2S7jfsVj1IfTJ56q2rta2a2Rnlhyo0bp/aGb5s2Ysf4n3z9fktm+kb6GILpH+J5Uy9iw60s+qJr1zyynsq7bTYPVv4Cq9LX5l1APOARXRzB+x2xeR4heixrtTcgUrvh008zNgd4dt4R5bfbN4VAW/+D9Z3BRKA+BTRz0Jur3sL5k8oq/c9esvzAG+zL78cIO2L+i3Qb4F+C/RboN8C/RbotRaIH/uHXnPZ769LuxRf5U9JezoI2qX0KGPI9DwYQbuUzDMYZsbYBSGRZxC7BMzQbgIzuZvALO0iMG+0exfMEgiaJm27RkX7NNMu5fiBPnZb4q8qaiXCPC2Ilu6k+HE8SkPYe+QZRtxA63RZgbXdC+nDn8EX9Nj0yp4cFbRLJD/3j7+qqJXgmyJaepwE1UHsEUoz69JzYDcombROnfs4PmIDjc3igTK/nfxJZz52wJy28L51HnIOkmjp0ookCxqbx3vldZe8SdM6dY62QHix8c8AgAiWlrokXVA9sz4q+T6EBUm09PeIM4vETTjmfuWSNylaJ5G1qz8ZDJBhPLiM5b1gPNRKkmiJybUOj1LyHI5gyWigdYK4yyEGINeBxgtcEZKoxDnNzsAvppARQRItrSEOag/BkRc745I3jdehYEqQ7oeZMr0DNCTD8ESwrqgxg7QL5rRKAcyeM4cOHbrnC5D4aJ0g6XaoFqg+Mk14tBwRjKRfQnZF3qQ4qrqNBPVPzabnwTX3YnQwkmMK46fgZULaAWAGfpOpZE9T3QCTQ6vrE4tddFN8mmHK4KDJm3ZQzyRexXD5VjYP99SYgcB1n/12UzwB8GyGoMmbdhCY9MnfwRc8b8Jk5oARk4HrPvvtpgAmy7M25Vzyph0Ehub+EwuaY+yg6JmJeZ7ZXPdZ4aYAhuaWIXqXS960k8AUp0G2JAa1ADM5yx8ruO43gtmHV53TdZe8SYPp+nIGy6wCTpo19lmwYGXO0ktOVFArscIlWprJA8jHR+kXyNJA69T1hSY8LWMxMw+XmXYJh0seWjaplSDRREspsDdZ6LaDz6IBFHmTS+vUORYm1Lal4KVWclLAHRRYLHIEKfuyfgv0W6DfAv0W6LdAvwX6LdBvgX4LdKIFEnOvweyDNrZ/d0HILVZwD9TCt6s7GXOuiJslD23ATvZ2A98OTC7gfm50g1w7RI1vhkLDBcPYYMBWXC+E7Fwl3E0ri7353gCT+ezSRmBopkJ/FQ54x2ir4WDSeSpa1CvEGgaYGX6864ZpbvB4hTADWBz1het8afqIX7D9aQPMYfudvyvCxfdgA8oWHztgtGAGsBr9Sn/VL8vW/ZJtTxtg9tmuOx9ctc+xLwCTPRV0mRmssNoTLvWkupEwwGT5FQIdiuJ/1BxAeuZAwGXmP3RGHRko6WiXIgYYqtoV14v0TAmJC/BXPO5KVSwrRpRKyWN2wZve/pQJZtCeNRy4ooaEhb9JccIhYoSBZSOhok+rSLeOJpi0zU/cVEjzyxIW/uS/B1FiebzQm5SpsXKQdBtlJhh8DG02+AuYmetwJRvwEdcNQS4OFIKk2yjzgBm2v2FU/UHKzZxkdDVDKKMx7rWGkJhuEG2vwAMmtci3YhFCJtDtWEAXRjDWviweMDRmRztTBqxADzr4nmv2EdSY+WZgtVq4OqqjiCTkdd8UBcYn80KcfeCWcuKBT+ssB3Ws7ZFBbqjJ0K6P32efuteseE48jDYlgfEV2QRvLw/MPs9v5DjhlyrS/qN4s2RirSXD5pImpGCxzMq4hUejVtwuqZxSrFJtPa4wjhZZNJK2eh0l1BXJHXoln5dlvntzwoTuIyVp27E6C1MzBWkvKvPMql2RBUJ/vy+0/4wLqoNksM6SlRL/diTMzePBq60aSxPPnKd4tNxbF3tNOzDgWdJosS9yn0ijnQacCWwf4niHpSTknfjhy3lSrH2DrQeC8S5pggsS3aYU3vllKq/k7T5mGUcu4lXQrXzcvA9wxd6YBuOdXzoHJsOnszoNvL6EpcbqYVpHJ08zJLzzS+dOswFeAntPgwhupv7dk0kObCXSm2xyAoBYzS8yR+cmgEmeZHEazIuKos5m4r/TStf4Vw5sJ+1usila5xSPy98jepvcWO/c1DzB1/7VWZqVrgTxaAXs8YOAtlkwNtnU1TFpl4mmKbn8lCikxM0sbF6+ijeZU7aVlD0TbKdhjz8eOvz1qlQsZ+KP0tA61gF5ytCNwn7nljMzZ8EnMVcYHg3GIaQNe/zhw1+DGcmj+J51esO3iN5A9OG0bIOOLTRT9sly+mtj1tEQLOTf4/cNf39RDWbAgmrwX9O3ji8nn0A0I2/Wvo5oR0LS3n/w4XLyoBVm3b/H7xv+amArExqMuDlLPfVQJXnws6zcU+ffzt2cRbpc+vf43dU8fHu7HtjsKQcNRmx2SJn4XRHTRudumyNdLi3vHr/zb3kdL68XAzv+HRkeMcEwJYoZHhT84oMFU9bOeJTLZdq3x79iGR7kFvTAVlLdM3ShEsnjOXE1Gyt7pe1LjS9sbMu3x5/2XP35YaYzsJUlF4xvE/BEooI8ndsEXKrD/AYh593jHzA3mq7imwc5sLUVF0xyWgs5ctGX8BOL0HyeUtETi/mN8wKMucdvDv+9+N+muNta9hhxwdD2bpzrVbrHHW/Cu8evh3/qb37437bYqsXADp4AaKDiNYXUpQ2SbRV49/hXjLs1RPm7LjmwtU9Gz6Q/o6VOJFv3S7Y3baE6d4//QWcSdg6zUMqBjYgMqxhGKnxERdRxv4p06Wih3qA9fu2OGNgq1bDJphQ74th0j39HeNeaE033+Fsz08/db4F+C/RboN8C/99a4Bpj1bER9vR5wTl88lZMBhvcpLRq1yKXzNjBO+s+eSsmI9cdKWOqGh0M7Q0G45MHmnSZrrRbASKt22Rkyg/mmuaG0k3A+OQNJmFRMF0py5LxyiNSqq0dJ2u+8u/2pY1kqgkYn7zBJEwIpitlKpnnmEekVFs7NtT8zub2fE7rjD55g0md0YlIxiu/tA3phpq/0tyoz2md0SdvMKkzOpGhvF/SprS/5jA+dp/T2gOf3G9S51MR550NlWzf0V+zYDRpYt7ntM7lk/tN6nwqspRXsTYffTXHlkIeuvic1p745D6TOpuKOIxXKrm543X3HKlg41FwS73IJrLPvlxwa7744kNX0yrvuuTxrcWhY/i8nu66Kb/3i7eUOfPRI3cIpx1N7r7vZu/hB3pKznkMk4KyCuRWeFi6x7aZ6UoZdRivpIj+4PjTMO/Wk4j2qCM7Twk8ulTcUjD+5ChdPVZDhMOVo5QCpZGzIfacn2pq+HpKPf4a8jmaZPHs5UwfpOXCiDYZk5RVGX70l5opSaYrp6jkVZKiKy3KftSktBrB87UIgbd3+G1+yS2Vx0TPO6yHa/jhwLMYdrMlmJQFN3kzTFNNfQ5eDQGM1mROiGccSo68CNqkQ1lFi9yrz+P5H97O0EWLec4rRN9DZAxJXU/iBtZtGMYtlMOf5Jaq4QlKifCIGRER8CCQBhSY+DlsB+OPFNVUBn1KSYDRmsSpCpVJy4UN1+RNSIOyiqp5RGqStEMXLbJQ8HgMrSGSQ6upelgRJVRRcGSWwSA3mIFS4u0SvfZg14cVmAaqqUmulZctWpM4BQlpOSdck4qyiqamJTkEk3booi6YlTpKgQ2KxpGP2LGIYe+ywECSW2qB4jwCSIOZw4BPO6RhjVRT4zXkZTAp3sPnWhOMnrScE65JRVlFg+idXEV2gy7qglnNczG06ngdx2jDhUsgHDhUXUCNpxEdWXAocjSYD9m3vBcKtSPuo5pasqCTC0pHk0BbYxC6cqQk6w5Masoq7iLwwUhuGMVf5YKZKUFHi8ukeVM4HSVc+OkyMGgwgwzK7ZkUZuUnNBg/1RS/wiXBKI1E7crZmDapKau41d8DjeCGUUVdMPzSBt5DybcMZghNGQKG6EeHMQycnvFTTblOK004GJ4vRMCJVEdEgFFF/WAWWwezhMY1wehzQtZawnhAFvbRaqSa0qeTJqGSYLRcGNEmFWUV5s/ZNHcqg9FFASYnRYuoFWOm1GrPxHjyGllIL6sxk/BMAKlZqAenBZh6I9VUsQY1j5mxNUTW6cNOF2o5pOhWZVJRVmG2B3UXNAxGFwUYjCMWLRW4GK6s44w4+gQgKppaiBUUmLQoO1Zjc+gVfm8isYBLKuG8YB+hf72uZGoB6hh81RrZM1oOtSiCX5hUlFU8Qnn8C8910aJgvGIwEzXosjDbIhjBaDS+FssrMLS6DEtcAwex9E/UmM8QYKp1uGHT63QlglaT20NrJBgtF0Zck3Ns+11sd53HvwCji04IxisGM8hNOIzpu0UwWe6IZ9ay+jSjgVm05Qx+OKRYPVii1CledqygI4YP08/dSj4B9bWYILRGgiElZxs4qWYdkw5lFURL06xgz3XRScF4xaL0Y9CtFAwwEZczVYv2HziRK2tuqdTBCr1/7GyZK0PPYF4+jkh1GRe5RqqpK++l5OP2j12NMyyUXBghbdKhrIJ0xWIVnxfaqMF4de0TlOT3OVbrOIt5VRpxoZl96siddPeXNbcUTqSHjuenxMNVgLH2Hzm2DLOJG3+A36Off7mceXTUpZqii448n7btitIM2vbJEjJqOccNkw5lFXq7ArFkunKMSsYrh/zqomcfRga3nszXkb2VkOLMskM2Lsb5RAFvVogCpN48LaWa1NOSjX7mfgv0W6DfAv0W6LdAvwX6LdBvgX4L9FugQy0w4n3hdJtTbQa1q8BMVNrcOt0092Q3K2933diQ2jUhiPejZ8HlFnrW9UbH91iNsp6VjC33rOuNjqtPfBs1PSjhLeLdEmL6wVg4ooDvocMLbKz9xX1f3jhTSzkyaxGzN3wPHbFc02zxearON9VuSjFQi1is4XvoiOWaZrsa74mcbardlGIq7xQbqoSX938PHZ47gvaf8KgHT0faGYplac1lMGhi3f89dJNs0cUPLlO6hRc5ohg+KDMZDAZNSvm/h26SLbp4KU+x9oJxX32sVsL98H0PHZ45ijbW9tMszo9VRTDAzHjvNqdZn8434zy9TpaP/pu2VN4hPM9tYxiaVcYMMEE0ofFKE87TRppQZbHp8ajSyIekKrXl40hBmTDAmJxaBk1oMOepeNCpjEQ7JpzLS+rRaPmj5nLvmQ0wQTShObyxEMh5GvC184aV4+1HDsM1cWjbj3vPbIAJogk9gCqDOE+zYkS16M/PZP67Wiy2UXb3ntkEE0ATegEsBXGe+hgCNqpP6ofFeYaPX9v6lNu4ZzbBeDm1NE2o+z206/OFbjR6TFJYHsWbhdHLbJzTuGc2wTTShJIFY0HfQ9+wcSUBObgUz9DizaIA/aZExj2zB4yfJrQp52kwTeiGvlyCHPtuv/32b2+Ys4UMxj2zB4yfJrTp99CZzYx/vPo1itewcWkOecG/BRRO1l+6RTxgtkoT6poNjrmXt2C9K32jWI7UXEHz2GOuymQwwDtEdrSVxmReWPDThLpmjZjBVjpYM+Rh0avEympdzH5h+aBz75kbGAy2RhMaWK/BVmpMPIFZlTC9+Di9Dy+zRQkh98zmkibElLwdils+mtDAEgZbaUKvbwNzauEVGFQxfhcwQgi5Z94aTWhg3StrEK/WWJeMdhLTEp9fi5gtRAim0fuqo50qOZGAwyZpQgMsKVF1FjHJJpiNtick3t+kuZKy0EgKipcS1Yyo7plVZvO4SZrQwbppxBOfQzM7bKXuYPXk8Cfka+/yjWm/TqXHzzgx557ZuBfjM0EG75JGSf3H25RAEV7tK0jJ5RerwC/eisAv2DtspQ5lndI0O06Js1G+mN8kT/aw+EIFzeT0UCAY2gpNKE4nbVS1nMlWGhGM+HpFvLzaBArR317lYHXvmYPyjtWDpD7ZfSotB7ZKBR1NttKIp1m1DkMZvILdPNwbt3H2Irj3zAGZN+AJdEp8X5X00oQqqXk02UojTgCr7OcevVMYOJul7WlRS+iiwscT6B3YesdQPUNQNKGm9764yVYacWrGy/nG12dEQaSgNCOvWe49s69eToo38l25GthC4u4YqvnQoQn1UyO65eEUj+ZVyVYa8aLJGx6p75RNK43xVTnyn9Saxj3+MJpQY8dwZRQ2XJpQxXmqDRsRk6004nJmaAHfaX7TsBEUXZFjyr1npoY9/vDhr5fYI3nYd2lCFedpUKUmW2nEhWb6ZsreuEHH0IiYm417ZvLv8W8w/DWYAQt+uzShivM0AIyHrXRPISBHgCjzwOeXA8Qe0ZCYm82+9u/x+4a/pzQSGoy4OTNoQn3UiEY5D1sp35y1K8i52bhnJv8ev8kTCJpQ/8DWYPw0oXvqzXz0sJXybXO7gpybx4we9O3xO1/UOPVd35zz1E8TumLY9HrrYSv9lFe3tdQMf+x8t2HD8nKeejaBcTrywA6mPLzQMIKo5Dz1ymRKLd44tcltkCCzkIm5+WZXmfbu8TfQhPoHtj7NyLcJ6KNGdGug8QU3ITcB3fTWYjw3m+sj3x5/I02ob2C7YHw0oScSlSaeLdVdxflutA0xnpvNe+Yc9uOM/2tmDn9JE+ob2C4YH02ohxrR4yg+R9XB2TjX6a1FeG4eqLs2AMbY49fD36UJ9Q1sA8xAxTUTFrvNVapHGq5kS7G4bdFUyTXh3eNf0fciIsKTRXPO00aaUNdsk9jRJvJNinluVmtENuHd4w+gCfUNbKNnWv/ngClrk143K4a52blnFjnYfNAevy7uG9jeHUOdq0uR6ivqnlk4YOE3aI9fe+cZ2A07hjpbdyIrr5n3zHjsgxkg4l5WdxwOq3XEHprV+qZ7/DrHzo4M2ROFne1hC97F7cVKC9l3dta0oP/e2T5G927GuGeOXmqH5qz27NwV0KBMT7BrQtQthU4D/j98EXT3W/8TagAAAABJRU5ErkJggg==\n",
146 | "text/plain": [
147 | ""
148 | ]
149 | },
150 | "execution_count": 7,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "from IPython.display import Image\n",
157 | "Image(filename='std.png')"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 8,
163 | "id": "e6d45b2c",
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "scaler = StandardScaler()\n",
168 | "X_train_scaled = scaler.fit_transform(X_train) # fit only on training"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 9,
174 | "id": "2a366b98",
175 | "metadata": {},
176 | "outputs": [],
177 | "source": [
178 | "X_test_scaled = scaler.transform(X_test) # transform test separately"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 10,
184 | "id": "0d2f8328",
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "model_clean = LinearRegression()\n",
189 | "model_clean.fit(X_train_scaled, y_train)\n",
190 | "pred_clean = model_clean.predict(X_test_scaled)"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 11,
196 | "id": "6b19fa48",
197 | "metadata": {},
198 | "outputs": [
199 | {
200 | "name": "stdout",
201 | "output_type": "stream",
202 | "text": [
203 | "Without Data Leakage: 0.7945396288418394\n"
204 | ]
205 | }
206 | ],
207 | "source": [
208 | "print(\"Without Data Leakage:\", r2_score(y_test, pred_clean))"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "id": "9186420a",
215 | "metadata": {},
216 | "outputs": [],
217 | "source": []
218 | }
219 | ],
220 | "metadata": {
221 | "kernelspec": {
222 | "display_name": "Python 3 (ipykernel)",
223 | "language": "python",
224 | "name": "python3"
225 | },
226 | "language_info": {
227 | "codemirror_mode": {
228 | "name": "ipython",
229 | "version": 3
230 | },
231 | "file_extension": ".py",
232 | "mimetype": "text/x-python",
233 | "name": "python",
234 | "nbconvert_exporter": "python",
235 | "pygments_lexer": "ipython3",
236 | "version": "3.9.13"
237 | }
238 | },
239 | "nbformat": 4,
240 | "nbformat_minor": 5
241 | }
242 |
--------------------------------------------------------------------------------