├── INT375 Report.pdf ├── README.md └── customersegmentation.py /INT375 Report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shivakumar9052/CustomerSigmentation/HEAD/INT375 Report.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Project Summary 2 | 3 | Understanding customer behavior is critical for any business. In this project, I performed: 4 | 5 | - 📊 **Exploratory Data Analysis (EDA)** to explore trends and transaction patterns 6 | - 📦 **RFM analysis** to assign behavior-based scores to customers 7 | - 🎯 **Segmentation** into categories like Champions, Loyal, At Risk, Lost 8 | - 📈 **Visualizations** to present insights clearly and drive actionable strategies 9 | 10 | --- 11 | 12 | ## 🧰 Tools & Libraries Used 13 | 14 | - Python 3.11 15 | - Pandas, NumPy for data manipulation 16 | - Seaborn, Matplotlib for visualization 17 | - Jupyter Notebook for development 18 | 19 | --- 20 | 21 | ## 📂 Dataset 22 | 23 | **Source:** [UCI Repository - Online Retail Dataset] 24 | (https://archive.ics.uci.edu/ml/datasets/online+retail) 25 | The dataset contains over 541,000 transactions from a UK-based online retailer between 2010 and 2011. 26 | 27 | ## ✅ Key Features of the Project 28 | 29 | ## 🔍 Exploratory Data Analysis (EDA) 30 | 31 | -Total spending and frequency per customer 32 | -Cancelled orders and basket size analysis 33 | -Revenue trends over time and by country 34 | 35 | ## 📊 RFM (Recency, Frequency, Monetary) Analysis 36 | 37 | -Recency: Days since last purchase 38 | -Frequency: Number of purchases 39 | -Monetary: Total amount spent 40 | -Quantile-based scoring system (1 to 5) 41 | -Customer segmentation based on combined RFM scores 42 | 43 | ## 📌 How to Run This Project 44 | 45 | 1. Clone the repository 46 | ``` 47 | git clone https://github.com/your-username/customer-segmentation-rfm.git 48 | cd customer-segmentation-rfm 49 | ``` 50 | 51 | 3. Install dependencies 52 | ``` 53 | pip install -r requirements.txt 54 | ``` 55 | 56 | 4. Run the Jupyter Notebook 57 | ``` 58 | jupyter notebook CustomerSegmentation.ipynb 59 | ``` 60 | 61 | ## Output Highlights 62 | 63 | Segmented over 4,000 customers 64 | Identified high-value customers for targeted campaigns 65 | Visual reports for business and marketing decisions 66 | 67 | ## 📚 References 68 | [Pandas Documentation](https://pandas.pydata.org/docs/) 69 | - [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html#k-means) 70 | - [Seaborn Documentation](https://seaborn.pydata.org/) 71 | - [Matplotlib Documentation](https://matplotlib.org/stable/index.html) 72 | -------------------------------------------------------------------------------- /customersegmentation.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import pandas as pd 4 | import matplotlib.pyplot as plt 5 | import seaborn as sns 6 | from sklearn.cluster import KMeans 7 | import warnings 8 | warnings.filterwarnings("ignore", category=FutureWarning) 9 | 10 | raw_df = pd.read_excel('/Users/mshivakumar/Programming Files/INT375DSToolbox/Online Retail.xlsx') 11 | 12 | df = raw_df.copy() 13 | df.head() 14 | df.info() 15 | 16 | df = df.dropna(subset = ['CustomerID']) # remove rows without customer ID 17 | df['CustomerID'] = df['CustomerID'].astype(int) 18 | df['TotalPrice'] = df['Quantity'] * df['UnitPrice'] # Added a colmun of Amount spent for each purchase 19 | df['IsCanceled'] = df['InvoiceNo'].astype(str).str.startswith('C') 20 | df.info() 21 | 22 | customer_spending = df.groupby('CustomerID')['TotalPrice'].sum().reset_index().rename(columns={'TotalPrice': 'TotalSpent'}) 23 | customer_spending 24 | 25 | orders_per_customer = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index().rename(columns={'InvoiceNo': 'TotalOrders'}) 26 | orders_per_customer 27 | 28 | plt.figure(figsize=(10,6)) 29 | ax = sns.histplot(orders_per_customer['TotalOrders'], bins=30, kde=False, color='mediumseagreen') 30 | plt.yscale('log') 31 | 32 | 33 | for rect in ax.patches: 34 | height = rect.get_height() 35 | if height > 0: 36 | ax.text( 37 | rect.get_x() + rect.get_width()/2, 38 | height, 39 | f"{int(height)}", 40 | ha='center', 41 | va='bottom', 42 | fontsize=8 43 | ) 44 | 45 | plt.title('Distribution of Orders per Customer') 46 | plt.xlabel('Total Orders') 47 | plt.ylabel('Number of Customers (Log Scale)') 48 | plt.tight_layout() 49 | plt.show() 50 | 51 | df_clean = df[~df['InvoiceNo'].astype(str).str.startswith('C')] #Dataframe without cancled data 52 | basket_size = df_clean.groupby('InvoiceNo')['Quantity'].sum().reset_index().rename(columns={'Quantity': 'BasketSize'}) 53 | basket_size 54 | 55 | top_customers = customer_spending.nlargest(10, 'TotalSpent') 56 | plt.figure(figsize=(10,6)) 57 | ax = sns.barplot( 58 | data=top_customers, 59 | x='CustomerID', 60 | y='TotalSpent', 61 | palette='magma', 62 | order=top_customers.sort_values('TotalSpent', ascending=False)['CustomerID'].astype(str) 63 | ) 64 | 65 | for bar in ax.patches: 66 | height = bar.get_height() 67 | ax.text(bar.get_x() + bar.get_width()/2, height, f'£{height:,.0f}', ha='center', va='bottom', fontsize=9) 68 | 69 | plt.title('Top 10 Customers by Total Spending') 70 | plt.xlabel('Customer ID') 71 | plt.ylabel('Total Spending (£)') 72 | plt.xticks(rotation=45) 73 | plt.tight_layout() 74 | plt.show() 75 | 76 | canceled_orders = df[df['IsCanceled']].groupby('CustomerID')['InvoiceNo'].nunique().reset_index().rename(columns={'InvoiceNo': 'CanceledOrders'}) 77 | canceled_orders 78 | 79 | plt.figure(figsize=(10,6)) 80 | ax = sns.histplot(basket_size['BasketSize'], bins=50, color='skyblue') 81 | plt.yscale('log') 82 | 83 | for rect in ax.patches: 84 | height = rect.get_height() 85 | if height > 0: 86 | ax.text( 87 | rect.get_x() + rect.get_width()/2, 88 | height, 89 | f"{int(height)}", 90 | ha='center', 91 | va='bottom', 92 | fontsize=8) 93 | 94 | plt.title('Distribution of Basket Size per Order (Log Scale)') 95 | plt.xlabel('Number of Items per Order') 96 | plt.ylabel('Number of Orders (Log Scale)') 97 | plt.tight_layout() 98 | plt.show() 99 | 100 | from functools import reduce 101 | dfs = [customer_spending, orders_per_customer, canceled_orders] 102 | customer_behavior = reduce(lambda left, right: pd.merge(left, right, on='CustomerID', how='outer'), dfs) 103 | 104 | customer_behavior['CanceledOrders'] = customer_behavior['CanceledOrders'].fillna(0).astype(int) 105 | 106 | customer_behavior = customer_behavior.dropna() 107 | 108 | customer_behavior.head() 109 | 110 | plt.figure(figsize=(10,6)) 111 | ax = sns.histplot(customer_behavior['CanceledOrders'], bins=20, kde=False, color='salmon') 112 | plt.yscale('log') 113 | 114 | max_height = max([rect.get_height() for rect in ax.patches]) 115 | plt.ylim(1, max_height * 1.5) 116 | 117 | for rect in ax.patches: 118 | height = rect.get_height() 119 | if height > 0: 120 | plt.text( 121 | rect.get_x() + rect.get_width()/2, 122 | height, 123 | f"{int(height)}", 124 | ha='center', 125 | va='bottom', 126 | fontsize=9) 127 | 128 | plt.title('Distribution of Canceled Orders per Customer (Log Scale)') 129 | plt.xlabel('Number of Canceled Orders') 130 | plt.ylabel('Number of Customers (Log Scale)') 131 | plt.tight_layout() 132 | plt.show() 133 | 134 | df['TotalPrice'] = df['Quantity'] * df['UnitPrice'] 135 | df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M') 136 | 137 | monthly_revenue = df.groupby('InvoiceMonth')['TotalPrice'].sum().reset_index() 138 | monthly_revenue['InvoiceMonth'] = monthly_revenue['InvoiceMonth'].astype(str) 139 | 140 | plt.figure(figsize=(14,6)) 141 | sns.lineplot(data=monthly_revenue, x='InvoiceMonth', y='TotalPrice', marker='o', color='darkblue') 142 | plt.title('Monthly Revenue Over Time') 143 | plt.xlabel('Month') 144 | plt.ylabel('Revenue (£)') 145 | plt.xticks(rotation=45) 146 | plt.tight_layout() 147 | plt.show() 148 | 149 | country_revenue = df.groupby('Country')['TotalPrice'].sum().sort_values(ascending=False).head(10) 150 | 151 | plt.figure(figsize=(12,6)) 152 | sns.barplot(x=country_revenue.values, y=country_revenue.index, palette='crest') 153 | plt.title('Top 10 Countries by Revenue') 154 | plt.xlabel('Total Revenue (£)') 155 | plt.ylabel('Country') 156 | plt.tight_layout() 157 | plt.show() 158 | 159 | country_avg_spending = df.groupby('Country').agg({'TotalPrice': 'sum', 'InvoiceNo': 'nunique'}) 160 | country_avg_spending['AvgSpendingPerOrder'] = country_avg_spending['TotalPrice'] / country_avg_spending['InvoiceNo'] 161 | top_avg_spending = country_avg_spending.sort_values('AvgSpendingPerOrder', ascending=False).head(10) 162 | 163 | plt.figure(figsize=(12,6)) 164 | sns.barplot(x=top_avg_spending['AvgSpendingPerOrder'], y=top_avg_spending.index, palette='flare') 165 | plt.title('Top 10 Countries by Avg Spending per Order') 166 | plt.xlabel('Avg Spending per Order (£)') 167 | plt.ylabel('Country') 168 | plt.tight_layout() 169 | plt.show() 170 | 171 | df = raw_df.copy() 172 | df = df.dropna(subset=['CustomerID']) 173 | df['TotalPrice'] = df['UnitPrice'] * df['Quantity'] 174 | 175 | reference_date = df['InvoiceDate'].max() + pd.Timedelta(days=1) 176 | 177 | rfm = df.groupby('CustomerID').agg({ 178 | 'InvoiceDate': lambda x: (reference_date - x.max()).days, # Recency : less is better 179 | 'InvoiceNo': 'nunique', # Frequency : more is better 180 | 'TotalPrice': 'sum' # Monetary : more is better 181 | }).reset_index() 182 | rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary'] 183 | 184 | rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels = [5, 4, 3, 2, 1]) 185 | rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method ='first'), 5, labels = [1, 2, 3, 4, 5]) 186 | rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels = [1, 2, 3, 4, 5]) 187 | 188 | rfm['RFM_Segment'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str) 189 | rfm['RFM_Score'] = rfm[['R_Score', 'F_Score', 'M_Score']].sum(axis = 1).astype(int) 190 | rfm.head() 191 | 192 | plt.figure(figsize=(11,6)) 193 | sns.histplot(rfm['RFM_Score'], bins=10, kde = True, color='dodgerblue') 194 | plt.title('Distribution of RFM Scores') 195 | plt.xlabel('RFM Score') 196 | plt.ylabel('Number of Customers') 197 | plt.xticks(range(rfm['RFM_Score'].min(), rfm['RFM_Score'].max() + 1)) 198 | plt.yticks(range(0, 900, 100)) 199 | plt.tight_layout() 200 | plt.show() 201 | 202 | def segment_customer(row): 203 | r = int(row['R_Score']) 204 | f = int(row['F_Score']) 205 | score = row['RFM_Score'] 206 | 207 | if score >= 13: 208 | return 'Champions' 209 | elif score >= 10: 210 | return 'Loyal' 211 | elif r >= 4 and f >= 3: 212 | return 'Potential Loyalist' 213 | elif r >= 3 and f <= 2: 214 | return 'Need Attention' 215 | elif r <= 2 and f >= 3: 216 | return 'At Risk' 217 | elif r == 1 and f == 1: 218 | return 'Lost' 219 | else: 220 | return 'Others' 221 | 222 | rfm['Segment'] = rfm.apply(segment_customer, axis=1) 223 | 224 | segment_counts = rfm['Segment'].value_counts().reset_index() 225 | segment_counts.columns = ['Segment', 'Count'] 226 | 227 | plt.figure(figsize=(10, 6)) 228 | sns.barplot(data=segment_counts, x='Segment', y='Count', palette='Set2') 229 | plt.title('Customer Segments Count') 230 | plt.xlabel('Customer Segment') 231 | plt.ylabel('Number of Customers') 232 | plt.xticks(rotation=45) 233 | plt.tight_layout() 234 | plt.show() 235 | 236 | from sklearn.preprocessing import StandardScaler 237 | from sklearn.metrics import silhouette_score 238 | 239 | 240 | rfm_for_clustering = rfm[['Recency', 'Frequency', 'Monetary']] 241 | 242 | scaler = StandardScaler() 243 | rfm_scaled = scaler.fit_transform(rfm_for_clustering) 244 | 245 | 246 | inertia = [] 247 | K_range = range(2, 11) 248 | 249 | for k in K_range: 250 | kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto') 251 | kmeans.fit(rfm_scaled) 252 | inertia.append(kmeans.inertia_) 253 | 254 | plt.figure(figsize=(10, 6)) 255 | plt.plot(K_range, inertia, marker='o', linestyle='--', color='teal') 256 | plt.xlabel('Number of Clusters (k)') 257 | plt.ylabel('Inertia (Sum of Squared Distances)') 258 | plt.title('Elbow Method - Optimal k') 259 | plt.xticks(K_range) 260 | plt.grid(True) 261 | plt.tight_layout() 262 | plt.show() 263 | --------------------------------------------------------------------------------