├── README.md
└── Project Code.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Predicting-Purchase-Intent-Ecommerce
 2 | Project Overview
 3 | This project involves predicting purchase intent for an e-commerce platform using machine learning models. The goal is to analyze user behavior and predict whether a user will make a purchase based on various features, including interaction duration, page views, and other user characteristics.
 4 | 
 5 | ### Techniques Used:
 6 | 
 7 | Logistic Regression
 8 | 
 9 | K-Nearest Neighbors (KNN)
10 | 
11 | Random Forest Classifier
12 | 
13 | ### Data Description
14 | The dataset used in this project is the Online Shoppers Intention Dataset fro UCI Machine Learning Repository. It contains information about user sessions on an e-commerce website, with the target variable being whether or not a user made a purchase.
15 | 
16 | ### Key Features:
17 | 1. Administrative Duration: Time spent on administrative pages.
18 | 2. ProductRelated Duration: Time spent on product-related pages.
19 | 3. Informational Duration: Time spent on informational pages.
20 | 4. Bounce Rate: Percentage of sessions where users leave after viewing only one page.
21 | 5. Exit Rate: Percentage of sessions that end on a given page.
22 | 6. Revenue: Target variable indicating whether the user made a purchase (0 = No, 1 = Yes).
23 | 
24 | ### Objectives
25 | Objective 1: Analyze the influence of page interaction durations on user purchase behavior.
26 | 
27 | Objective 2: Evaluate the effect of bounce rate and exit rate on conversion likelihood across sessions.
28 | 
29 | Objective 3: Compare purchasing behavior across months and special days to identify seasonal trends.
30 | 
31 | Objective 4: Investigate the relationship between visitor type and purchasing behavior.
32 | 
33 | Objective 5: Examine regional and traffic type variations in user purchase behavior.
34 | 
35 | ### Data Preprocessing
36 | Before applying machine learning models, the following preprocessing steps were performed:
37 | 1. Removed duplicate rows from the dataset.
38 | 2. Filtered sessions where all durations (administrative, informational, product-related) were zero.
39 | 3. Handled missing values using forward fill (for year columns).
40 | 4. Encoded categorical variables (VisitorType and Month) using one-hot encoding.
41 | 5. Scaled the features using MinMaxScaler to normalize the dataset and improve model performance.
42 | 
43 | ### Machine Learning Models
44 | The project applies three different machine learning models to predict purchase intent:
45 | 
46 | 1. Logistic Regression: A binary classification model used to predict the likelihood of a user making a purchase.
47 | 2. K-Nearest Neighbors (KNN): A non-parametric method used for classification by finding the closest neighbors to a data point.
48 | 3. Random Forest Classifier: An ensemble learning method that uses multiple decision trees to predict the target variable.
49 | 
50 | Each model is evaluated using accuracy and confusion matrix metrics.
51 | 
52 | ### Model Evaluation
53 | The models were evaluated on their accuracy using the test set, with a focus on:
54 | 
55 | 1. Logistic Regression: ~80% accuracy
56 | 2. KNN Classifier: ~84% accuracy
57 | 3. Random Forest: ~86% accuracy
58 | 
59 | Additionally, confusion matrices were used to assess the performance and identify misclassifications.
60 | 
61 | 
62 | ### Contributing
63 | Feel free to fork this repository, make improvements, and submit pull requests. Contributions, suggestions, and feedback are always welcome!
64 | 


--------------------------------------------------------------------------------
/Project Code.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | import seaborn as sns
  5 | from sklearn.preprocessing import MinMaxScaler
  6 | from sklearn.linear_model import LogisticRegression
  7 | from sklearn.metrics import accuracy_score, confusion_matrix
  8 | from sklearn.model_selection import train_test_split
  9 | from sklearn.metrics import ConfusionMatrixDisplay
 10 | from sklearn.neighbors import KNeighborsClassifier
 11 | from sklearn.ensemble import RandomForestClassifier
 12 | 
 13 | 
 14 | #Load Dataset
 15 | df = pd.read_csv("E:\online_shoppers_intention.csv")
 16 | df.head()
 17 | 
 18 | #Remove Duplicate Rows
 19 | df.drop_duplicates(inplace=True)
 20 | df
 21 | 
 22 | #Remove sessions where all durations are 0 (not useful for behavior analysis)
 23 | df = df[~((df['Administrative_Duration'] == 0) &
 24 |           (df['Informational_Duration'] == 0) &
 25 |           (df['ProductRelated_Duration'] == 0))].copy()
 26 | df
 27 | 
 28 | 
 29 | #Fill Missing Values (None here, but safe to keep)
 30 | for col in df.columns:
 31 |     if df[col].isnull().sum() > 0:
 32 |         if df[col].dtype in ['int64', 'float64']:
 33 |             df[col].fillna(df[col].mean(), inplace=True)
 34 |         else:
 35 |             df[col].fillna(df[col].mode()[0], inplace=True)
 36 | df
 37 | 
 38 | 
 39 | #Encode Booleans + Target
 40 | df['Revenue'] = df['Revenue'].astype(int)
 41 | df['Weekend'] = df['Weekend'].astype(int)
 42 | 
 43 | 
 44 | #Encode Categorical Columns
 45 | df = pd.get_dummies(df, columns=['Month', 'VisitorType'], drop_first=True)
 46 | 
 47 | 
 48 | #Create New Features (optional but useful for EDA/Insights)
 49 | df['Total_Duration'] = df['Administrative_Duration'] + df['Informational_Duration'] + df['ProductRelated_Duration']
 50 | df['Total_Pages'] = df['Administrative'] + df['Informational'] + df['ProductRelated']
 51 | 
 52 | 
 53 | #Feature Scaling with MinMaxScaler
 54 | cols_to_scale = ['Administrative_Duration', 'Informational_Duration', 'ProductRelated_Duration',
 55 |                  'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Total_Duration']
 56 | scaler = MinMaxScaler()
 57 | df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
 58 | 
 59 | 
 60 | #Final Info Check
 61 | print(df.info())
 62 | 
 63 | ################################### OBJECTIVES ##################################
 64 | 
 65 | #Objective 1: Analyze the influence of page interaction durations on user purchase behavior
 66 | #Page interaction durations vs revenue "Voilin Plot" Shows spread & median comparison for durations
 67 | 
 68 | 
 69 | # Ensure Revenue is numeric
 70 | df['Revenue'] = df['Revenue'].astype(int)
 71 | # Set a clean style
 72 | sns.set(style="whitegrid")
 73 | 
 74 | # 1. ProductRelated_Duration vs Revenue
 75 | plt.figure(figsize=(8, 5))
 76 | sns.violinplot(x='Revenue', y='ProductRelated_Duration', data=df, palette='pastel')
 77 | plt.title('ProductRelated Duration vs Revenue')
 78 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)')
 79 | plt.ylabel('ProductRelated Duration')
 80 | plt.tight_layout()
 81 | plt.show()
 82 | 
 83 | # 2. Administrative_Duration vs Revenue
 84 | plt.figure(figsize=(8, 5))
 85 | sns.violinplot(x='Revenue', y='Administrative_Duration', data=df, palette='Set2')
 86 | plt.title('Administrative Duration vs Revenue')
 87 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)')
 88 | plt.ylabel('Administrative Duration')
 89 | plt.tight_layout()
 90 | plt.show()
 91 | 
 92 | # 3. Informational_Duration vs Revenue
 93 | plt.figure(figsize=(8, 5))
 94 | sns.violinplot(x='Revenue', y='Informational_Duration', data=df, palette='coolwarm')
 95 | plt.title('Informational Duration vs Revenue')
 96 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)')
 97 | plt.ylabel('Informational Duration')
 98 | plt.tight_layout()
 99 | plt.show()
100 | 
101 | 
102 | #objective 2 : Evaluate the effect of bounce rate and exit rate on conversion 
103 | #likelihood across sessions.
104 | # Scatterplot: BounceRate vs ExitRate colored by Revenue
105 | sns.scatterplot(x='BounceRates', y='ExitRates', hue='Revenue', data=df)
106 | plt.title('Bounce Rate vs Exit Rate by Purchase')
107 | plt.xlabel('Bounce Rate')
108 | plt.ylabel('Exit Rate')
109 | plt.legend(title='Purchase Made')
110 | plt.grid(True)
111 | plt.show()
112 | 
113 | 
114 | #Objective 3: Compare purchasing behavior across months and 
115 | #special days to identify seasonal shopping trends
116 | 
117 | print(df.columns)
118 | # Load original dataset (if Month was one-hot encoded in df)
119 | original_df = pd.read_csv("E:\online_shoppers_intention.csv")
120 | 
121 | # Define correct month order
122 | month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
123 | 
124 | # Filter only rows where purchase was made
125 | purchased = original_df[original_df['Revenue'] == True]
126 | 
127 | # Plot purchases by month
128 | sns.countplot(x='Month', data=purchased, order=month_order)
129 | plt.title('Purchases by Month')
130 | plt.xlabel('Month')
131 | plt.ylabel('Number of Purchases')
132 | plt.xticks(rotation=45)
133 | plt.tight_layout()
134 | plt.show()
135 | 
136 | # Filter only rows where no purchase was made
137 | not_purchased = original_df[original_df['Revenue'] == False]
138 | 
139 | # Plot non-purchases by month
140 | sns.countplot(x='Month', data=not_purchased, order=month_order)
141 | plt.title('Non-Purchases by Month')
142 | plt.xlabel('Month')
143 | plt.ylabel('Number of Non-Purchases')
144 | plt.xticks(rotation=45)
145 | plt.tight_layout()
146 | plt.show()
147 | 
148 | # Prepare purchase counts per month
149 | monthly_trend = original_df[original_df['Revenue'] == True]['Month'].value_counts().reindex(month_order).fillna(0)
150 | 
151 | # Line graph for trend
152 | monthly_trend.plot(kind='line', marker='o', linestyle='-', color='green', figsize=(10, 5))
153 | plt.title('Monthly Purchase Trend')
154 | plt.xlabel('Month')
155 | plt.ylabel('Number of Purchases')
156 | plt.xticks(rotation=45)
157 | plt.grid(True)
158 | plt.tight_layout()
159 | plt.show()
160 | 
161 | #Special Day : Purchase vs No Purchase 
162 | plt.figure(figsize=(6, 4))
163 | sns.barplot(x='Revenue', y='SpecialDay', data=df, palette='pastel')
164 | plt.title('Average SpecialDay Proximity by Purchase Outcome')
165 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)')
166 | plt.ylabel('Mean SpecialDay Score')
167 | plt.tight_layout()
168 | plt.show()
169 | 
170 | #Objective 4: 
171 | #Pie Chart: Visitor Type Distribution   
172 | 
173 | # Count of each visitor type
174 | visitor_counts = original_df['VisitorType'].value_counts()
175 | 
176 | # Plot pie chart
177 | plt.pie(visitor_counts, labels=visitor_counts.index, autopct='%1.1f%%', startangle=90)
178 | plt.title('Visitor Type Distribution')
179 | plt.axis('equal')  # Ensures pie is a circle
180 | plt.tight_layout()
181 | plt.show()
182 | 
183 | #Countplot: Visitor Type vs Revenue (Purchasing Behavior)
184 | # Countplot with hue by Revenue
185 | sns.countplot(x='VisitorType', hue='Revenue', data=original_df)
186 | plt.title('Visitor Type vs Purchase Behavior')
187 | plt.xlabel('Visitor Type')
188 | plt.ylabel('Number of Sessions')
189 | plt.legend(title='Purchase Made')
190 | plt.xticks(rotation=15)
191 | plt.tight_layout()
192 | plt.show()
193 | 
194 | 
195 | #Objective 5: Investigate regional and traffic type variations in user purchase behavior
196 | # Create a pivot table to calculate average conversion rate
197 | heat_data = pd.crosstab(df['Region'], df['TrafficType'], values=df['Revenue'], aggfunc='mean').fillna(0)
198 | 
199 | # Plot heatmap
200 | plt.figure(figsize=(12, 6))
201 | sns.heatmap(heat_data, annot=True, cmap='YlGnBu', fmt=".2f")
202 | plt.title('Conversion Rate by Region and Traffic Type')
203 | plt.xlabel('Traffic Type')
204 | plt.ylabel('Region')
205 | plt.tight_layout()
206 | plt.show()
207 | 
208 | 
209 | 
210 | ################################## ML MODELS #######################################
211 | 
212 | 
213 | # 1. Ensure Revenue is encoded as integer
214 | df['Revenue'] = df['Revenue'].astype(int)
215 | 
216 | # 2. Encode categorical features ('Month' and 'VisitorType') using One-Hot Encoding
217 | df = pd.get_dummies(df, columns=['Month', 'VisitorType'], drop_first=True)
218 | 
219 | # 3. Define features and target
220 | X = df.drop('Revenue', axis=1)  # All input features
221 | y = df['Revenue']              # Target label
222 | 
223 | # 4. Train-Test Split (80-20) with stratification
224 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
225 | 
226 | # 5. Feature Scaling with MinMaxScaler
227 | scaler = MinMaxScaler()
228 | X_train_scaled = scaler.fit_transform(X_train)
229 | X_test_scaled = scaler.transform(X_test)
230 | 
231 | 
232 | #################################### Logistic regression ###########################################
233 | # 6. Initialize Logistic Regression with class_weight balanced to handle imbalance
234 | log_model = LogisticRegression(max_iter=1000, class_weight='balanced')
235 | # 7. Train the Model
236 | log_model.fit(X_train_scaled, y_train)
237 | # 8. Make Predictions
238 | y_pred = log_model.predict(X_test_scaled)
239 | 
240 | # 9. Evaluate the Model
241 | acc = accuracy_score(y_test, y_pred)
242 | conf_mat = confusion_matrix(y_test, y_pred)
243 | 
244 | print(f"Logistic Regression Accuracy: {acc:.4f}")
245 | print("Logistic Regression Train Accuracy:", log_model.score(X_train_scaled, y_train))
246 | print("Logistic Regression Test Accuracy :", log_model.score(X_test_scaled, y_test))
247 | print("Confusion Matrix:\n", conf_mat)
248 | 
249 | # Plot Confusion Matrix for Logistic Regression
250 | plt.figure(figsize=(6, 4))
251 | sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues',
252 |             xticklabels=['No Purchase (0)', 'Purchase (1)'],
253 |             yticklabels=['No Purchase (0)', 'Purchase (1)'])
254 | plt.title('Confusion Matrix - Logistic Regression')
255 | plt.xlabel('Predicted Label')
256 | plt.ylabel('True Label')
257 | plt.tight_layout()
258 | plt.show()
259 | 
260 | ######################################### KNN MODEL ###############################################
261 | 
262 | #Initialize and Train the KNN Model
263 | knn_model = KNeighborsClassifier(n_neighbors=5)  # You can experiment with n_neighbors
264 | knn_model.fit(X_train_scaled, y_train)
265 | 
266 | #Make Predictions and Evaluate the Model
267 | y_pred_knn = knn_model.predict(X_test_scaled)
268 | 
269 | # Calculate Accuracy
270 | acc_knn = accuracy_score(y_test, y_pred_knn)
271 | print(f"KNN Accuracy: {acc_knn:.4f}")
272 | 
273 | print("KNN Train Accuracy:", knn_model.score(X_train_scaled, y_train))
274 | print("KNN Test Accuracy :", knn_model.score(X_test_scaled, y_test))
275 | 
276 | # Confusion Matrix
277 | conf_mat_knn = confusion_matrix(y_test, y_pred_knn)
278 | print("KNN Confusion Matrix:\n", conf_mat_knn)
279 | 
280 | # Plot Confusion Matrix for KNN
281 | plt.figure(figsize=(6, 4))
282 | sns.heatmap(conf_mat_knn, annot=True, fmt='d', cmap='Blues',
283 |             xticklabels=['No Purchase (0)', 'Purchase (1)'],
284 |             yticklabels=['No Purchase (0)', 'Purchase (1)'])
285 | plt.title('Confusion Matrix - KNN Classifier')
286 | plt.xlabel('Predicted Label')
287 | plt.ylabel('True Label')
288 | plt.tight_layout()
289 | plt.show()
290 | 
291 | ############################# RANDOM FOREST ##############################################
292 | 
293 | #Initialize and Train Random Forest Model
294 | rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')  # class_weight='balanced' to handle imbalance
295 | rf_model.fit(X_train_scaled, y_train)
296 | 
297 | #Make Predictions and Evaluate the Model
298 | y_pred_rf = rf_model.predict(X_test_scaled)
299 | 
300 | # After training
301 | print("Train:", rf_model.score(X_train_scaled, y_train))
302 | print("Test :", rf_model.score(X_test_scaled, y_test))
303 | # Calculate Accuracy
304 | acc_rf = accuracy_score(y_test, y_pred_rf)
305 | print(f"Random Forest Accuracy: {acc_rf:.4f}")
306 | 
307 | # Confusion Matrix
308 | conf_mat_rf = confusion_matrix(y_test, y_pred_rf)
309 | print("Random Forest Confusion Matrix:\n", conf_mat_rf)
310 | 
311 | ############## SHOWED SIGNS OF OVERFITTING 
312 | ################# BALANCING ( FINE TUNING RANDOM FOREST )
313 | 
314 | #Initialize and Train Optimized Random Forest Model
315 | rf_model = RandomForestClassifier(
316 |     n_estimators=100,        # Number of trees
317 |     max_depth=10,            # Limit depth to avoid overfitting
318 |     min_samples_split=5,     # Minimum samples to split a node
319 |     min_samples_leaf=4,      # Minimum samples in each leaf node
320 |     class_weight='balanced', # Handle class imbalance
321 |     random_state=42
322 | )
323 | 
324 | rf_model.fit(X_train_scaled, y_train)  # Train the model
325 | 
326 | #Make Predictions and Evaluate the Model
327 | y_pred_rf = rf_model.predict(X_test_scaled)
328 | 
329 | # Accuracy
330 | acc_rf = accuracy_score(y_test, y_pred_rf)
331 | print(f"Random Forest Accuracy: {acc_rf:.4f}")
332 | 
333 | # Confusion Matrix
334 | conf_mat_rf = confusion_matrix(y_test, y_pred_rf)
335 | print("Random Forest Confusion Matrix:\n", conf_mat_rf)
336 | 
337 | # Train vs Test Scores (to check overfitting)
338 | print("Train Accuracy:", rf_model.score(X_train_scaled, y_train))
339 | print("Test Accuracy :", rf_model.score(X_test_scaled, y_test))
340 | 
341 | # Plot Confusion Matrix for Random Forest
342 | plt.figure(figsize=(6, 4))
343 | sns.heatmap(conf_mat_rf, annot=True, fmt='d', cmap='Blues',
344 |             xticklabels=['No Purchase (0)', 'Purchase (1)'],
345 |             yticklabels=['No Purchase (0)', 'Purchase (1)'])
346 | plt.title('Confusion Matrix - Random Forest')
347 | plt.xlabel('Predicted Label')
348 | plt.ylabel('True Label')
349 | plt.tight_layout()
350 | plt.show()
351 | 
352 | 
353 | 
354 | 
355 | 


--------------------------------------------------------------------------------