├── README.md └── Project Code.py /README.md: -------------------------------------------------------------------------------- 1 | # Predicting-Purchase-Intent-Ecommerce 2 | Project Overview 3 | This project involves predicting purchase intent for an e-commerce platform using machine learning models. The goal is to analyze user behavior and predict whether a user will make a purchase based on various features, including interaction duration, page views, and other user characteristics. 4 | 5 | ### Techniques Used: 6 | 7 | Logistic Regression 8 | 9 | K-Nearest Neighbors (KNN) 10 | 11 | Random Forest Classifier 12 | 13 | ### Data Description 14 | The dataset used in this project is the Online Shoppers Intention Dataset fro UCI Machine Learning Repository. It contains information about user sessions on an e-commerce website, with the target variable being whether or not a user made a purchase. 15 | 16 | ### Key Features: 17 | 1. Administrative Duration: Time spent on administrative pages. 18 | 2. ProductRelated Duration: Time spent on product-related pages. 19 | 3. Informational Duration: Time spent on informational pages. 20 | 4. Bounce Rate: Percentage of sessions where users leave after viewing only one page. 21 | 5. Exit Rate: Percentage of sessions that end on a given page. 22 | 6. Revenue: Target variable indicating whether the user made a purchase (0 = No, 1 = Yes). 23 | 24 | ### Objectives 25 | Objective 1: Analyze the influence of page interaction durations on user purchase behavior. 26 | 27 | Objective 2: Evaluate the effect of bounce rate and exit rate on conversion likelihood across sessions. 28 | 29 | Objective 3: Compare purchasing behavior across months and special days to identify seasonal trends. 30 | 31 | Objective 4: Investigate the relationship between visitor type and purchasing behavior. 32 | 33 | Objective 5: Examine regional and traffic type variations in user purchase behavior. 34 | 35 | ### Data Preprocessing 36 | Before applying machine learning models, the following preprocessing steps were performed: 37 | 1. Removed duplicate rows from the dataset. 38 | 2. Filtered sessions where all durations (administrative, informational, product-related) were zero. 39 | 3. Handled missing values using forward fill (for year columns). 40 | 4. Encoded categorical variables (VisitorType and Month) using one-hot encoding. 41 | 5. Scaled the features using MinMaxScaler to normalize the dataset and improve model performance. 42 | 43 | ### Machine Learning Models 44 | The project applies three different machine learning models to predict purchase intent: 45 | 46 | 1. Logistic Regression: A binary classification model used to predict the likelihood of a user making a purchase. 47 | 2. K-Nearest Neighbors (KNN): A non-parametric method used for classification by finding the closest neighbors to a data point. 48 | 3. Random Forest Classifier: An ensemble learning method that uses multiple decision trees to predict the target variable. 49 | 50 | Each model is evaluated using accuracy and confusion matrix metrics. 51 | 52 | ### Model Evaluation 53 | The models were evaluated on their accuracy using the test set, with a focus on: 54 | 55 | 1. Logistic Regression: ~80% accuracy 56 | 2. KNN Classifier: ~84% accuracy 57 | 3. Random Forest: ~86% accuracy 58 | 59 | Additionally, confusion matrices were used to assess the performance and identify misclassifications. 60 | 61 | 62 | ### Contributing 63 | Feel free to fork this repository, make improvements, and submit pull requests. Contributions, suggestions, and feedback are always welcome! 64 | -------------------------------------------------------------------------------- /Project Code.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | from sklearn.preprocessing import MinMaxScaler 6 | from sklearn.linear_model import LogisticRegression 7 | from sklearn.metrics import accuracy_score, confusion_matrix 8 | from sklearn.model_selection import train_test_split 9 | from sklearn.metrics import ConfusionMatrixDisplay 10 | from sklearn.neighbors import KNeighborsClassifier 11 | from sklearn.ensemble import RandomForestClassifier 12 | 13 | 14 | #Load Dataset 15 | df = pd.read_csv("E:\online_shoppers_intention.csv") 16 | df.head() 17 | 18 | #Remove Duplicate Rows 19 | df.drop_duplicates(inplace=True) 20 | df 21 | 22 | #Remove sessions where all durations are 0 (not useful for behavior analysis) 23 | df = df[~((df['Administrative_Duration'] == 0) & 24 | (df['Informational_Duration'] == 0) & 25 | (df['ProductRelated_Duration'] == 0))].copy() 26 | df 27 | 28 | 29 | #Fill Missing Values (None here, but safe to keep) 30 | for col in df.columns: 31 | if df[col].isnull().sum() > 0: 32 | if df[col].dtype in ['int64', 'float64']: 33 | df[col].fillna(df[col].mean(), inplace=True) 34 | else: 35 | df[col].fillna(df[col].mode()[0], inplace=True) 36 | df 37 | 38 | 39 | #Encode Booleans + Target 40 | df['Revenue'] = df['Revenue'].astype(int) 41 | df['Weekend'] = df['Weekend'].astype(int) 42 | 43 | 44 | #Encode Categorical Columns 45 | df = pd.get_dummies(df, columns=['Month', 'VisitorType'], drop_first=True) 46 | 47 | 48 | #Create New Features (optional but useful for EDA/Insights) 49 | df['Total_Duration'] = df['Administrative_Duration'] + df['Informational_Duration'] + df['ProductRelated_Duration'] 50 | df['Total_Pages'] = df['Administrative'] + df['Informational'] + df['ProductRelated'] 51 | 52 | 53 | #Feature Scaling with MinMaxScaler 54 | cols_to_scale = ['Administrative_Duration', 'Informational_Duration', 'ProductRelated_Duration', 55 | 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Total_Duration'] 56 | scaler = MinMaxScaler() 57 | df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale]) 58 | 59 | 60 | #Final Info Check 61 | print(df.info()) 62 | 63 | ################################### OBJECTIVES ################################## 64 | 65 | #Objective 1: Analyze the influence of page interaction durations on user purchase behavior 66 | #Page interaction durations vs revenue "Voilin Plot" Shows spread & median comparison for durations 67 | 68 | 69 | # Ensure Revenue is numeric 70 | df['Revenue'] = df['Revenue'].astype(int) 71 | # Set a clean style 72 | sns.set(style="whitegrid") 73 | 74 | # 1. ProductRelated_Duration vs Revenue 75 | plt.figure(figsize=(8, 5)) 76 | sns.violinplot(x='Revenue', y='ProductRelated_Duration', data=df, palette='pastel') 77 | plt.title('ProductRelated Duration vs Revenue') 78 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)') 79 | plt.ylabel('ProductRelated Duration') 80 | plt.tight_layout() 81 | plt.show() 82 | 83 | # 2. Administrative_Duration vs Revenue 84 | plt.figure(figsize=(8, 5)) 85 | sns.violinplot(x='Revenue', y='Administrative_Duration', data=df, palette='Set2') 86 | plt.title('Administrative Duration vs Revenue') 87 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)') 88 | plt.ylabel('Administrative Duration') 89 | plt.tight_layout() 90 | plt.show() 91 | 92 | # 3. Informational_Duration vs Revenue 93 | plt.figure(figsize=(8, 5)) 94 | sns.violinplot(x='Revenue', y='Informational_Duration', data=df, palette='coolwarm') 95 | plt.title('Informational Duration vs Revenue') 96 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)') 97 | plt.ylabel('Informational Duration') 98 | plt.tight_layout() 99 | plt.show() 100 | 101 | 102 | #objective 2 : Evaluate the effect of bounce rate and exit rate on conversion 103 | #likelihood across sessions. 104 | # Scatterplot: BounceRate vs ExitRate colored by Revenue 105 | sns.scatterplot(x='BounceRates', y='ExitRates', hue='Revenue', data=df) 106 | plt.title('Bounce Rate vs Exit Rate by Purchase') 107 | plt.xlabel('Bounce Rate') 108 | plt.ylabel('Exit Rate') 109 | plt.legend(title='Purchase Made') 110 | plt.grid(True) 111 | plt.show() 112 | 113 | 114 | #Objective 3: Compare purchasing behavior across months and 115 | #special days to identify seasonal shopping trends 116 | 117 | print(df.columns) 118 | # Load original dataset (if Month was one-hot encoded in df) 119 | original_df = pd.read_csv("E:\online_shoppers_intention.csv") 120 | 121 | # Define correct month order 122 | month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] 123 | 124 | # Filter only rows where purchase was made 125 | purchased = original_df[original_df['Revenue'] == True] 126 | 127 | # Plot purchases by month 128 | sns.countplot(x='Month', data=purchased, order=month_order) 129 | plt.title('Purchases by Month') 130 | plt.xlabel('Month') 131 | plt.ylabel('Number of Purchases') 132 | plt.xticks(rotation=45) 133 | plt.tight_layout() 134 | plt.show() 135 | 136 | # Filter only rows where no purchase was made 137 | not_purchased = original_df[original_df['Revenue'] == False] 138 | 139 | # Plot non-purchases by month 140 | sns.countplot(x='Month', data=not_purchased, order=month_order) 141 | plt.title('Non-Purchases by Month') 142 | plt.xlabel('Month') 143 | plt.ylabel('Number of Non-Purchases') 144 | plt.xticks(rotation=45) 145 | plt.tight_layout() 146 | plt.show() 147 | 148 | # Prepare purchase counts per month 149 | monthly_trend = original_df[original_df['Revenue'] == True]['Month'].value_counts().reindex(month_order).fillna(0) 150 | 151 | # Line graph for trend 152 | monthly_trend.plot(kind='line', marker='o', linestyle='-', color='green', figsize=(10, 5)) 153 | plt.title('Monthly Purchase Trend') 154 | plt.xlabel('Month') 155 | plt.ylabel('Number of Purchases') 156 | plt.xticks(rotation=45) 157 | plt.grid(True) 158 | plt.tight_layout() 159 | plt.show() 160 | 161 | #Special Day : Purchase vs No Purchase 162 | plt.figure(figsize=(6, 4)) 163 | sns.barplot(x='Revenue', y='SpecialDay', data=df, palette='pastel') 164 | plt.title('Average SpecialDay Proximity by Purchase Outcome') 165 | plt.xlabel('Purchase Made (0 = No, 1 = Yes)') 166 | plt.ylabel('Mean SpecialDay Score') 167 | plt.tight_layout() 168 | plt.show() 169 | 170 | #Objective 4: 171 | #Pie Chart: Visitor Type Distribution 172 | 173 | # Count of each visitor type 174 | visitor_counts = original_df['VisitorType'].value_counts() 175 | 176 | # Plot pie chart 177 | plt.pie(visitor_counts, labels=visitor_counts.index, autopct='%1.1f%%', startangle=90) 178 | plt.title('Visitor Type Distribution') 179 | plt.axis('equal') # Ensures pie is a circle 180 | plt.tight_layout() 181 | plt.show() 182 | 183 | #Countplot: Visitor Type vs Revenue (Purchasing Behavior) 184 | # Countplot with hue by Revenue 185 | sns.countplot(x='VisitorType', hue='Revenue', data=original_df) 186 | plt.title('Visitor Type vs Purchase Behavior') 187 | plt.xlabel('Visitor Type') 188 | plt.ylabel('Number of Sessions') 189 | plt.legend(title='Purchase Made') 190 | plt.xticks(rotation=15) 191 | plt.tight_layout() 192 | plt.show() 193 | 194 | 195 | #Objective 5: Investigate regional and traffic type variations in user purchase behavior 196 | # Create a pivot table to calculate average conversion rate 197 | heat_data = pd.crosstab(df['Region'], df['TrafficType'], values=df['Revenue'], aggfunc='mean').fillna(0) 198 | 199 | # Plot heatmap 200 | plt.figure(figsize=(12, 6)) 201 | sns.heatmap(heat_data, annot=True, cmap='YlGnBu', fmt=".2f") 202 | plt.title('Conversion Rate by Region and Traffic Type') 203 | plt.xlabel('Traffic Type') 204 | plt.ylabel('Region') 205 | plt.tight_layout() 206 | plt.show() 207 | 208 | 209 | 210 | ################################## ML MODELS ####################################### 211 | 212 | 213 | # 1. Ensure Revenue is encoded as integer 214 | df['Revenue'] = df['Revenue'].astype(int) 215 | 216 | # 2. Encode categorical features ('Month' and 'VisitorType') using One-Hot Encoding 217 | df = pd.get_dummies(df, columns=['Month', 'VisitorType'], drop_first=True) 218 | 219 | # 3. Define features and target 220 | X = df.drop('Revenue', axis=1) # All input features 221 | y = df['Revenue'] # Target label 222 | 223 | # 4. Train-Test Split (80-20) with stratification 224 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) 225 | 226 | # 5. Feature Scaling with MinMaxScaler 227 | scaler = MinMaxScaler() 228 | X_train_scaled = scaler.fit_transform(X_train) 229 | X_test_scaled = scaler.transform(X_test) 230 | 231 | 232 | #################################### Logistic regression ########################################### 233 | # 6. Initialize Logistic Regression with class_weight balanced to handle imbalance 234 | log_model = LogisticRegression(max_iter=1000, class_weight='balanced') 235 | # 7. Train the Model 236 | log_model.fit(X_train_scaled, y_train) 237 | # 8. Make Predictions 238 | y_pred = log_model.predict(X_test_scaled) 239 | 240 | # 9. Evaluate the Model 241 | acc = accuracy_score(y_test, y_pred) 242 | conf_mat = confusion_matrix(y_test, y_pred) 243 | 244 | print(f"Logistic Regression Accuracy: {acc:.4f}") 245 | print("Logistic Regression Train Accuracy:", log_model.score(X_train_scaled, y_train)) 246 | print("Logistic Regression Test Accuracy :", log_model.score(X_test_scaled, y_test)) 247 | print("Confusion Matrix:\n", conf_mat) 248 | 249 | # Plot Confusion Matrix for Logistic Regression 250 | plt.figure(figsize=(6, 4)) 251 | sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues', 252 | xticklabels=['No Purchase (0)', 'Purchase (1)'], 253 | yticklabels=['No Purchase (0)', 'Purchase (1)']) 254 | plt.title('Confusion Matrix - Logistic Regression') 255 | plt.xlabel('Predicted Label') 256 | plt.ylabel('True Label') 257 | plt.tight_layout() 258 | plt.show() 259 | 260 | ######################################### KNN MODEL ############################################### 261 | 262 | #Initialize and Train the KNN Model 263 | knn_model = KNeighborsClassifier(n_neighbors=5) # You can experiment with n_neighbors 264 | knn_model.fit(X_train_scaled, y_train) 265 | 266 | #Make Predictions and Evaluate the Model 267 | y_pred_knn = knn_model.predict(X_test_scaled) 268 | 269 | # Calculate Accuracy 270 | acc_knn = accuracy_score(y_test, y_pred_knn) 271 | print(f"KNN Accuracy: {acc_knn:.4f}") 272 | 273 | print("KNN Train Accuracy:", knn_model.score(X_train_scaled, y_train)) 274 | print("KNN Test Accuracy :", knn_model.score(X_test_scaled, y_test)) 275 | 276 | # Confusion Matrix 277 | conf_mat_knn = confusion_matrix(y_test, y_pred_knn) 278 | print("KNN Confusion Matrix:\n", conf_mat_knn) 279 | 280 | # Plot Confusion Matrix for KNN 281 | plt.figure(figsize=(6, 4)) 282 | sns.heatmap(conf_mat_knn, annot=True, fmt='d', cmap='Blues', 283 | xticklabels=['No Purchase (0)', 'Purchase (1)'], 284 | yticklabels=['No Purchase (0)', 'Purchase (1)']) 285 | plt.title('Confusion Matrix - KNN Classifier') 286 | plt.xlabel('Predicted Label') 287 | plt.ylabel('True Label') 288 | plt.tight_layout() 289 | plt.show() 290 | 291 | ############################# RANDOM FOREST ############################################## 292 | 293 | #Initialize and Train Random Forest Model 294 | rf_model = RandomForestClassifier(random_state=42, class_weight='balanced') # class_weight='balanced' to handle imbalance 295 | rf_model.fit(X_train_scaled, y_train) 296 | 297 | #Make Predictions and Evaluate the Model 298 | y_pred_rf = rf_model.predict(X_test_scaled) 299 | 300 | # After training 301 | print("Train:", rf_model.score(X_train_scaled, y_train)) 302 | print("Test :", rf_model.score(X_test_scaled, y_test)) 303 | # Calculate Accuracy 304 | acc_rf = accuracy_score(y_test, y_pred_rf) 305 | print(f"Random Forest Accuracy: {acc_rf:.4f}") 306 | 307 | # Confusion Matrix 308 | conf_mat_rf = confusion_matrix(y_test, y_pred_rf) 309 | print("Random Forest Confusion Matrix:\n", conf_mat_rf) 310 | 311 | ############## SHOWED SIGNS OF OVERFITTING 312 | ################# BALANCING ( FINE TUNING RANDOM FOREST ) 313 | 314 | #Initialize and Train Optimized Random Forest Model 315 | rf_model = RandomForestClassifier( 316 | n_estimators=100, # Number of trees 317 | max_depth=10, # Limit depth to avoid overfitting 318 | min_samples_split=5, # Minimum samples to split a node 319 | min_samples_leaf=4, # Minimum samples in each leaf node 320 | class_weight='balanced', # Handle class imbalance 321 | random_state=42 322 | ) 323 | 324 | rf_model.fit(X_train_scaled, y_train) # Train the model 325 | 326 | #Make Predictions and Evaluate the Model 327 | y_pred_rf = rf_model.predict(X_test_scaled) 328 | 329 | # Accuracy 330 | acc_rf = accuracy_score(y_test, y_pred_rf) 331 | print(f"Random Forest Accuracy: {acc_rf:.4f}") 332 | 333 | # Confusion Matrix 334 | conf_mat_rf = confusion_matrix(y_test, y_pred_rf) 335 | print("Random Forest Confusion Matrix:\n", conf_mat_rf) 336 | 337 | # Train vs Test Scores (to check overfitting) 338 | print("Train Accuracy:", rf_model.score(X_train_scaled, y_train)) 339 | print("Test Accuracy :", rf_model.score(X_test_scaled, y_test)) 340 | 341 | # Plot Confusion Matrix for Random Forest 342 | plt.figure(figsize=(6, 4)) 343 | sns.heatmap(conf_mat_rf, annot=True, fmt='d', cmap='Blues', 344 | xticklabels=['No Purchase (0)', 'Purchase (1)'], 345 | yticklabels=['No Purchase (0)', 'Purchase (1)']) 346 | plt.title('Confusion Matrix - Random Forest') 347 | plt.xlabel('Predicted Label') 348 | plt.ylabel('True Label') 349 | plt.tight_layout() 350 | plt.show() 351 | 352 | 353 | 354 | 355 | --------------------------------------------------------------------------------