├── INT375 Report.pdf
├── README.md
└── customersegmentation.py


/INT375 Report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shivakumar9052/CustomerSigmentation/HEAD/INT375 Report.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Project Summary
 2 | 
 3 | Understanding customer behavior is critical for any business. In this project, I performed:
 4 | 
 5 | - 📊 **Exploratory Data Analysis (EDA)** to explore trends and transaction patterns
 6 | - 📦 **RFM analysis** to assign behavior-based scores to customers
 7 | - 🎯 **Segmentation** into categories like Champions, Loyal, At Risk, Lost
 8 | - 📈 **Visualizations** to present insights clearly and drive actionable strategies
 9 | 
10 | ---
11 | 
12 | ## 🧰 Tools & Libraries Used
13 | 
14 | - Python 3.11
15 | - Pandas, NumPy for data manipulation
16 | - Seaborn, Matplotlib for visualization
17 | - Jupyter Notebook for development
18 | 
19 | ---
20 | 
21 | ## 📂 Dataset
22 | 
23 | **Source:** [UCI Repository - Online Retail Dataset]
24 | (https://archive.ics.uci.edu/ml/datasets/online+retail)
25 | The dataset contains over 541,000 transactions from a UK-based online retailer between 2010 and 2011.
26 | 
27 | ## ✅ Key Features of the Project
28 | 
29 | ## 🔍 Exploratory Data Analysis (EDA)
30 | 
31 | -Total spending and frequency per customer
32 | -Cancelled orders and basket size analysis
33 | -Revenue trends over time and by country
34 | 
35 | ## 📊 RFM (Recency, Frequency, Monetary) Analysis
36 | 
37 | -Recency: Days since last purchase
38 | -Frequency: Number of purchases
39 | -Monetary: Total amount spent
40 | -Quantile-based scoring system (1 to 5)
41 | -Customer segmentation based on combined RFM scores
42 | 
43 | ## 📌 How to Run This Project
44 | 
45 | 1. Clone the repository  
46 |    ```
47 |    git clone https://github.com/your-username/customer-segmentation-rfm.git
48 |    cd customer-segmentation-rfm
49 |    ```
50 | 
51 | 3. Install dependencies  
52 |    ```
53 |    pip install -r requirements.txt
54 |    ```
55 | 
56 | 4. Run the Jupyter Notebook  
57 |    ```
58 |    jupyter notebook CustomerSegmentation.ipynb
59 |    ```
60 | 
61 | ## Output Highlights
62 | 
63 | Segmented over 4,000 customers
64 | Identified high-value customers for targeted campaigns
65 | Visual reports for business and marketing decisions
66 | 
67 | ## 📚 References
68 | [Pandas Documentation](https://pandas.pydata.org/docs/)  
69 | - [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html#k-means)  
70 | - [Seaborn Documentation](https://seaborn.pydata.org/)  
71 | - [Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
72 | 


--------------------------------------------------------------------------------
/customersegmentation.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import numpy as np
  3 | import pandas as pd
  4 | import matplotlib.pyplot as plt
  5 | import seaborn as sns
  6 | from sklearn.cluster import KMeans
  7 | import warnings
  8 | warnings.filterwarnings("ignore", category=FutureWarning)
  9 | 
 10 | raw_df = pd.read_excel('/Users/mshivakumar/Programming Files/INT375DSToolbox/Online Retail.xlsx')
 11 | 
 12 | df = raw_df.copy()
 13 | df.head()
 14 | df.info()
 15 | 
 16 | df = df.dropna(subset = ['CustomerID'])   # remove rows without customer ID
 17 | df['CustomerID'] = df['CustomerID'].astype(int)
 18 | df['TotalPrice'] = df['Quantity'] * df['UnitPrice']  # Added a colmun of Amount spent for each purchase
 19 | df['IsCanceled'] = df['InvoiceNo'].astype(str).str.startswith('C')
 20 | df.info()
 21 | 
 22 | customer_spending = df.groupby('CustomerID')['TotalPrice'].sum().reset_index().rename(columns={'TotalPrice': 'TotalSpent'})
 23 | customer_spending
 24 | 
 25 | orders_per_customer = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index().rename(columns={'InvoiceNo': 'TotalOrders'})
 26 | orders_per_customer
 27 | 
 28 | plt.figure(figsize=(10,6))
 29 | ax = sns.histplot(orders_per_customer['TotalOrders'], bins=30, kde=False, color='mediumseagreen')
 30 | plt.yscale('log')
 31 | 
 32 | 
 33 | for rect in ax.patches:
 34 |     height = rect.get_height()
 35 |     if height > 0:
 36 |         ax.text(
 37 |             rect.get_x() + rect.get_width()/2,
 38 |             height,
 39 |             f"{int(height)}",
 40 |             ha='center',
 41 |             va='bottom',
 42 |             fontsize=8
 43 |         )
 44 | 
 45 | plt.title('Distribution of Orders per Customer')
 46 | plt.xlabel('Total Orders')
 47 | plt.ylabel('Number of Customers (Log Scale)')
 48 | plt.tight_layout()
 49 | plt.show()
 50 | 
 51 | df_clean = df[~df['InvoiceNo'].astype(str).str.startswith('C')] #Dataframe without cancled data
 52 | basket_size = df_clean.groupby('InvoiceNo')['Quantity'].sum().reset_index().rename(columns={'Quantity': 'BasketSize'})
 53 | basket_size
 54 | 
 55 | top_customers = customer_spending.nlargest(10, 'TotalSpent')
 56 | plt.figure(figsize=(10,6))
 57 | ax = sns.barplot(
 58 |     data=top_customers,
 59 |     x='CustomerID',
 60 |     y='TotalSpent',
 61 |     palette='magma',
 62 |     order=top_customers.sort_values('TotalSpent', ascending=False)['CustomerID'].astype(str)
 63 | )
 64 | 
 65 | for bar in ax.patches:
 66 |     height = bar.get_height()
 67 |     ax.text(bar.get_x() + bar.get_width()/2, height, f'£{height:,.0f}', ha='center', va='bottom', fontsize=9)
 68 | 
 69 | plt.title('Top 10 Customers by Total Spending')
 70 | plt.xlabel('Customer ID')
 71 | plt.ylabel('Total Spending (£)')
 72 | plt.xticks(rotation=45)
 73 | plt.tight_layout()
 74 | plt.show()
 75 | 
 76 | canceled_orders = df[df['IsCanceled']].groupby('CustomerID')['InvoiceNo'].nunique().reset_index().rename(columns={'InvoiceNo': 'CanceledOrders'})
 77 | canceled_orders
 78 | 
 79 | plt.figure(figsize=(10,6))
 80 | ax = sns.histplot(basket_size['BasketSize'], bins=50, color='skyblue')
 81 | plt.yscale('log')
 82 | 
 83 | for rect in ax.patches:
 84 |     height = rect.get_height()
 85 |     if height > 0:
 86 |         ax.text(
 87 |             rect.get_x() + rect.get_width()/2,
 88 |             height,
 89 |             f"{int(height)}",
 90 |             ha='center',
 91 |             va='bottom',
 92 |             fontsize=8)
 93 | 
 94 | plt.title('Distribution of Basket Size per Order (Log Scale)')
 95 | plt.xlabel('Number of Items per Order')
 96 | plt.ylabel('Number of Orders (Log Scale)')
 97 | plt.tight_layout()
 98 | plt.show()
 99 | 
100 | from functools import reduce
101 | dfs = [customer_spending, orders_per_customer, canceled_orders]
102 | customer_behavior = reduce(lambda left, right: pd.merge(left, right, on='CustomerID', how='outer'), dfs)
103 | 
104 | customer_behavior['CanceledOrders'] = customer_behavior['CanceledOrders'].fillna(0).astype(int)
105 | 
106 | customer_behavior = customer_behavior.dropna()
107 | 
108 | customer_behavior.head()
109 | 
110 | plt.figure(figsize=(10,6))
111 | ax = sns.histplot(customer_behavior['CanceledOrders'], bins=20, kde=False, color='salmon')
112 | plt.yscale('log')
113 | 
114 | max_height = max([rect.get_height() for rect in ax.patches])
115 | plt.ylim(1, max_height * 1.5)
116 | 
117 | for rect in ax.patches:
118 |     height = rect.get_height()
119 |     if height > 0:
120 |         plt.text(
121 |             rect.get_x() + rect.get_width()/2,
122 |             height,
123 |             f"{int(height)}",
124 |             ha='center',
125 |             va='bottom',
126 |             fontsize=9)
127 | 
128 | plt.title('Distribution of Canceled Orders per Customer (Log Scale)')
129 | plt.xlabel('Number of Canceled Orders')
130 | plt.ylabel('Number of Customers (Log Scale)')
131 | plt.tight_layout()
132 | plt.show()
133 | 
134 | df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
135 | df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')
136 | 
137 | monthly_revenue = df.groupby('InvoiceMonth')['TotalPrice'].sum().reset_index()
138 | monthly_revenue['InvoiceMonth'] = monthly_revenue['InvoiceMonth'].astype(str)
139 | 
140 | plt.figure(figsize=(14,6))
141 | sns.lineplot(data=monthly_revenue, x='InvoiceMonth', y='TotalPrice', marker='o', color='darkblue')
142 | plt.title('Monthly Revenue Over Time')
143 | plt.xlabel('Month')
144 | plt.ylabel('Revenue (£)')
145 | plt.xticks(rotation=45)
146 | plt.tight_layout()
147 | plt.show()
148 | 
149 | country_revenue = df.groupby('Country')['TotalPrice'].sum().sort_values(ascending=False).head(10)
150 | 
151 | plt.figure(figsize=(12,6))
152 | sns.barplot(x=country_revenue.values, y=country_revenue.index, palette='crest')
153 | plt.title('Top 10 Countries by Revenue')
154 | plt.xlabel('Total Revenue (£)')
155 | plt.ylabel('Country')
156 | plt.tight_layout()
157 | plt.show()
158 | 
159 | country_avg_spending = df.groupby('Country').agg({'TotalPrice': 'sum', 'InvoiceNo': 'nunique'})
160 | country_avg_spending['AvgSpendingPerOrder'] = country_avg_spending['TotalPrice'] / country_avg_spending['InvoiceNo']
161 | top_avg_spending = country_avg_spending.sort_values('AvgSpendingPerOrder', ascending=False).head(10)
162 | 
163 | plt.figure(figsize=(12,6))
164 | sns.barplot(x=top_avg_spending['AvgSpendingPerOrder'], y=top_avg_spending.index, palette='flare')
165 | plt.title('Top 10 Countries by Avg Spending per Order')
166 | plt.xlabel('Avg Spending per Order (£)')
167 | plt.ylabel('Country')
168 | plt.tight_layout()
169 | plt.show()
170 | 
171 | df = raw_df.copy()
172 | df = df.dropna(subset=['CustomerID'])
173 | df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
174 | 
175 | reference_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
176 | 
177 | rfm = df.groupby('CustomerID').agg({
178 |     'InvoiceDate': lambda x: (reference_date - x.max()).days, # Recency : less is better
179 |     'InvoiceNo': 'nunique', # Frequency : more is better
180 |     'TotalPrice': 'sum' # Monetary : more is better
181 |     }).reset_index()
182 | rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']
183 | 
184 | rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels = [5, 4, 3, 2, 1])
185 | rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method ='first'), 5, labels = [1, 2, 3, 4, 5])
186 | rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels = [1, 2, 3, 4, 5])
187 | 
188 | rfm['RFM_Segment'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
189 | rfm['RFM_Score'] = rfm[['R_Score', 'F_Score', 'M_Score']].sum(axis = 1).astype(int)
190 | rfm.head()
191 | 
192 | plt.figure(figsize=(11,6))
193 | sns.histplot(rfm['RFM_Score'], bins=10, kde = True, color='dodgerblue')
194 | plt.title('Distribution of RFM Scores')
195 | plt.xlabel('RFM Score')
196 | plt.ylabel('Number of Customers')
197 | plt.xticks(range(rfm['RFM_Score'].min(), rfm['RFM_Score'].max() + 1))
198 | plt.yticks(range(0, 900, 100))
199 | plt.tight_layout()
200 | plt.show()
201 | 
202 | def segment_customer(row):
203 |     r = int(row['R_Score'])
204 |     f = int(row['F_Score'])
205 |     score = row['RFM_Score']
206 | 
207 |     if score >= 13:
208 |         return 'Champions'
209 |     elif score >= 10:
210 |         return 'Loyal'
211 |     elif r >= 4 and f >= 3:
212 |         return 'Potential Loyalist'
213 |     elif r >= 3 and f <= 2:
214 |         return 'Need Attention'
215 |     elif r <= 2 and f >= 3:
216 |         return 'At Risk'
217 |     elif r == 1 and f == 1:
218 |         return 'Lost'
219 |     else:
220 |         return 'Others'
221 | 
222 | rfm['Segment'] = rfm.apply(segment_customer, axis=1)
223 | 
224 | segment_counts = rfm['Segment'].value_counts().reset_index()
225 | segment_counts.columns = ['Segment', 'Count']
226 | 
227 | plt.figure(figsize=(10, 6))
228 | sns.barplot(data=segment_counts, x='Segment', y='Count', palette='Set2')
229 | plt.title('Customer Segments Count')
230 | plt.xlabel('Customer Segment')
231 | plt.ylabel('Number of Customers')
232 | plt.xticks(rotation=45)
233 | plt.tight_layout()
234 | plt.show()
235 | 
236 | from sklearn.preprocessing import StandardScaler
237 | from sklearn.metrics import silhouette_score
238 | 
239 | 
240 | rfm_for_clustering = rfm[['Recency', 'Frequency', 'Monetary']]
241 | 
242 | scaler = StandardScaler()
243 | rfm_scaled = scaler.fit_transform(rfm_for_clustering)
244 | 
245 | 
246 | inertia = []  
247 | K_range = range(2, 11)
248 | 
249 | for k in K_range:
250 |     kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
251 |     kmeans.fit(rfm_scaled)
252 |     inertia.append(kmeans.inertia_)
253 | 
254 | plt.figure(figsize=(10, 6))
255 | plt.plot(K_range, inertia, marker='o', linestyle='--', color='teal')
256 | plt.xlabel('Number of Clusters (k)')
257 | plt.ylabel('Inertia (Sum of Squared Distances)')
258 | plt.title('Elbow Method - Optimal k')
259 | plt.xticks(K_range)
260 | plt.grid(True)
261 | plt.tight_layout()
262 | plt.show()
263 | 


--------------------------------------------------------------------------------