├── requirements.txt
├── .gitignore
├── README.md
├── mall customers.csv
└── app.py
/requirements.txt:
--------------------------------------------------------------------------------
1 | streamlit
2 | pandas
3 | numpy
4 | matplotlib
5 | scikit-learn
6 | kneed
7 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rashakil-ds/cluster-analysis-with-kmeans/main/.gitignore
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Cluster Analysis with K-Means (Interactive Streamlit App)
2 |
3 | ---
4 |
5 | ### Click -> [APP Link](https://cluster-analyst.streamlit.app/)
6 |
7 | ---
8 |
9 | This project is an interactive implementation of **K-Means clustering** built using **Python and Streamlit**. It demonstrates how a traditional machine learning notebook can be transformed into a **usable, real-world analytics application**. The app is designed for exploratory data analysis and customer segmentation tasks, allowing users to experiment with clustering parameters and visually understand clustering behavior.
10 |
11 | ---
12 |
13 | Most clustering examples remain limited to Jupyter notebooks. This project focuses on converting clustering logic into a **user-friendly application** that can be used by analysts, students, and non-technical users without modifying code.
14 | The emphasis is on:
15 | - Interactivity
16 | - Explainability
17 | - Practical usability
18 |
19 | ---
20 |
21 | ## What This Application Does
22 |
23 | The application allows users to perform K-Means clustering with the following capabilities:
24 |
25 | - Load a dataset or use a built-in sample dataset
26 | - Select numeric features dynamically
27 | - Explore different values of the number of clusters
28 | - Automatically detect the optimal number of clusters
29 | - Visualize clusters and centroids
30 | - Download clustered results for further analysis
31 |
32 | ---
33 |
34 | ## Key Features
35 |
36 | ### Dataset Handling
37 | - Upload any CSV file
38 | - Use a sample dataset resembling customer income and spending behavior
39 | - Preview the dataset directly in the interface
40 | - Automatic detection of numeric columns
41 |
42 | ### Feature Selection
43 | - Select any `two numeric features` for clustering
44 | - No hardcoded column names
45 | - Works with different datasets without modification
46 |
47 | ### Model Configuration
48 | - Interactive slider to control the number of clusters
49 | - Optional feature standardization using [`StandardScaler`](https://scikit-learn.org/0.22/modules/generated/sklearn.preprocessing.StandardScaler.html)
50 | - Configurable random state for reproducibility
51 |
52 | ### Elbow and Knee Analysis
53 | The application provides two complementary visual tools to select the number of clusters:
54 |
55 | - **Elbow Curve**
56 | - Displays inertia versus the number of clusters
57 | - Helps identify diminishing returns as clusters increase
58 |
59 | - **Knee Plot**
60 | - Automatically detects the optimal number of clusters using the knee (elbow) detection algorithm
61 | - Displays the recommended cluster count
62 | - Allows applying the optimal value directly to the model
63 |
64 | ### Cluster Visualization
65 | - Scatter plot of clustered data
66 | - Clear separation of clusters
67 | - Centroids displayed explicitly
68 | - Dynamic updates when parameters change
69 |
70 | ### Cluster Insights
71 | - Table showing centroid values for each cluster
72 | - Table showing the number of samples per cluster
73 |
74 | ### Export Functionality
75 | - Download the final clustered dataset as a CSV file
76 | - Cluster labels included for downstream analysis
77 |
78 | ---
79 |
80 | ## Technologies Used
81 |
82 | - Python
83 | - Streamlit
84 | - Pandas
85 | - NumPy
86 | - Matplotlib
87 | - Scikit-learn
88 | - kneed
89 |
90 | ---
91 |
92 | ## Project Structure
93 | .
94 | ├── app.py
95 | ├── requirements.txt
96 | ├── mall customers.csv
97 | ├── Market Basket Analysis using K-Means Cluster Algorithm.ipynb
98 | ├── README.md
99 | └── .gitignore
100 |
101 | ---
102 |
103 | ## How to Run the Application
104 |
105 | ### Clone the Repository
106 | git clone https://github.com/rashakil-ds/cluster-analysis-with-kmeans.git
107 | cd cluster-analysis-with-kmeans
108 |
109 | ### Install Dependencies
110 | pip install -r requirements.txt
111 |
112 | ### Run the App
113 | streamlit run app.py
114 |
115 | ---
116 |
117 | ## Design Decisions
118 |
119 | - The K-Means model is trained dynamically inside the application
120 | - The model is not saved intentionally, as clustering depends on user-selected parameters and data
121 | - Streamlit session state is used to manage interactive behavior correctly
122 | - The application prioritizes clarity and explainability over complexity
123 |
124 | ---
125 |
126 | ## Possible Extensions
127 |
128 | - Add silhouette score or Davies–Bouldin index for cluster evaluation
129 | - Support clustering with more than two features
130 | - Deploy the application on Streamlit Cloud
131 | - Add authentication for multi-user environments
132 |
133 | ---
134 |
135 | ## Developed By
136 |
137 | Rashedul Alam
138 | [LinkedIn](https://www.linkedin.com/in/kmrashedulalam/)
139 | [GitHub](https://github.com/rashakil-ds)
140 |
--------------------------------------------------------------------------------
/mall customers.csv:
--------------------------------------------------------------------------------
1 | CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
2 | 1,Male,19,15,39
3 | 2,Male,21,15,81
4 | 3,Female,20,16,6
5 | 4,Female,23,16,77
6 | 5,Female,31,17,40
7 | 6,Female,22,17,76
8 | 7,Female,35,18,6
9 | 8,Female,23,18,94
10 | 9,Male,64,19,3
11 | 10,Female,30,19,72
12 | 11,Male,67,19,14
13 | 12,Female,35,19,99
14 | 13,Female,58,20,15
15 | 14,Female,24,20,77
16 | 15,Male,37,20,13
17 | 16,Male,22,20,79
18 | 17,Female,35,21,35
19 | 18,Male,20,21,66
20 | 19,Male,52,23,29
21 | 20,Female,35,23,98
22 | 21,Male,35,24,35
23 | 22,Male,25,24,73
24 | 23,Female,46,25,5
25 | 24,Male,31,25,73
26 | 25,Female,54,28,14
27 | 26,Male,29,28,82
28 | 27,Female,45,28,32
29 | 28,Male,35,28,61
30 | 29,Female,40,29,31
31 | 30,Female,23,29,87
32 | 31,Male,60,30,4
33 | 32,Female,21,30,73
34 | 33,Male,53,33,4
35 | 34,Male,18,33,92
36 | 35,Female,49,33,14
37 | 36,Female,21,33,81
38 | 37,Female,42,34,17
39 | 38,Female,30,34,73
40 | 39,Female,36,37,26
41 | 40,Female,20,37,75
42 | 41,Female,65,38,35
43 | 42,Male,24,38,92
44 | 43,Male,48,39,36
45 | 44,Female,31,39,61
46 | 45,Female,49,39,28
47 | 46,Female,24,39,65
48 | 47,Female,50,40,55
49 | 48,Female,27,40,47
50 | 49,Female,29,40,42
51 | 50,Female,31,40,42
52 | 51,Female,49,42,52
53 | 52,Male,33,42,60
54 | 53,Female,31,43,54
55 | 54,Male,59,43,60
56 | 55,Female,50,43,45
57 | 56,Male,47,43,41
58 | 57,Female,51,44,50
59 | 58,Male,69,44,46
60 | 59,Female,27,46,51
61 | 60,Male,53,46,46
62 | 61,Male,70,46,56
63 | 62,Male,19,46,55
64 | 63,Female,67,47,52
65 | 64,Female,54,47,59
66 | 65,Male,63,48,51
67 | 66,Male,18,48,59
68 | 67,Female,43,48,50
69 | 68,Female,68,48,48
70 | 69,Male,19,48,59
71 | 70,Female,32,48,47
72 | 71,Male,70,49,55
73 | 72,Female,47,49,42
74 | 73,Female,60,50,49
75 | 74,Female,60,50,56
76 | 75,Male,59,54,47
77 | 76,Male,26,54,54
78 | 77,Female,45,54,53
79 | 78,Male,40,54,48
80 | 79,Female,23,54,52
81 | 80,Female,49,54,42
82 | 81,Male,57,54,51
83 | 82,Male,38,54,55
84 | 83,Male,67,54,41
85 | 84,Female,46,54,44
86 | 85,Female,21,54,57
87 | 86,Male,48,54,46
88 | 87,Female,55,57,58
89 | 88,Female,22,57,55
90 | 89,Female,34,58,60
91 | 90,Female,50,58,46
92 | 91,Female,68,59,55
93 | 92,Male,18,59,41
94 | 93,Male,48,60,49
95 | 94,Female,40,60,40
96 | 95,Female,32,60,42
97 | 96,Male,24,60,52
98 | 97,Female,47,60,47
99 | 98,Female,27,60,50
100 | 99,Male,48,61,42
101 | 100,Male,20,61,49
102 | 101,Female,23,62,41
103 | 102,Female,49,62,48
104 | 103,Male,67,62,59
105 | 104,Male,26,62,55
106 | 105,Male,49,62,56
107 | 106,Female,21,62,42
108 | 107,Female,66,63,50
109 | 108,Male,54,63,46
110 | 109,Male,68,63,43
111 | 110,Male,66,63,48
112 | 111,Male,65,63,52
113 | 112,Female,19,63,54
114 | 113,Female,38,64,42
115 | 114,Male,19,64,46
116 | 115,Female,18,65,48
117 | 116,Female,19,65,50
118 | 117,Female,63,65,43
119 | 118,Female,49,65,59
120 | 119,Female,51,67,43
121 | 120,Female,50,67,57
122 | 121,Male,27,67,56
123 | 122,Female,38,67,40
124 | 123,Female,40,69,58
125 | 124,Male,39,69,91
126 | 125,Female,23,70,29
127 | 126,Female,31,70,77
128 | 127,Male,43,71,35
129 | 128,Male,40,71,95
130 | 129,Male,59,71,11
131 | 130,Male,38,71,75
132 | 131,Male,47,71,9
133 | 132,Male,39,71,75
134 | 133,Female,25,72,34
135 | 134,Female,31,72,71
136 | 135,Male,20,73,5
137 | 136,Female,29,73,88
138 | 137,Female,44,73,7
139 | 138,Male,32,73,73
140 | 139,Male,19,74,10
141 | 140,Female,35,74,72
142 | 141,Female,57,75,5
143 | 142,Male,32,75,93
144 | 143,Female,28,76,40
145 | 144,Female,32,76,87
146 | 145,Male,25,77,12
147 | 146,Male,28,77,97
148 | 147,Male,48,77,36
149 | 148,Female,32,77,74
150 | 149,Female,34,78,22
151 | 150,Male,34,78,90
152 | 151,Male,43,78,17
153 | 152,Male,39,78,88
154 | 153,Female,44,78,20
155 | 154,Female,38,78,76
156 | 155,Female,47,78,16
157 | 156,Female,27,78,89
158 | 157,Male,37,78,1
159 | 158,Female,30,78,78
160 | 159,Male,34,78,1
161 | 160,Female,30,78,73
162 | 161,Female,56,79,35
163 | 162,Female,29,79,83
164 | 163,Male,19,81,5
165 | 164,Female,31,81,93
166 | 165,Male,50,85,26
167 | 166,Female,36,85,75
168 | 167,Male,42,86,20
169 | 168,Female,33,86,95
170 | 169,Female,36,87,27
171 | 170,Male,32,87,63
172 | 171,Male,40,87,13
173 | 172,Male,28,87,75
174 | 173,Male,36,87,10
175 | 174,Male,36,87,92
176 | 175,Female,52,88,13
177 | 176,Female,30,88,86
178 | 177,Male,58,88,15
179 | 178,Male,27,88,69
180 | 179,Male,59,93,14
181 | 180,Male,35,93,90
182 | 181,Female,37,97,32
183 | 182,Female,32,97,86
184 | 183,Male,46,98,15
185 | 184,Female,29,98,88
186 | 185,Female,41,99,39
187 | 186,Male,30,99,97
188 | 187,Female,54,101,24
189 | 188,Male,28,101,68
190 | 189,Female,41,103,17
191 | 190,Female,36,103,85
192 | 191,Female,34,103,23
193 | 192,Female,32,103,69
194 | 193,Male,33,113,8
195 | 194,Female,38,113,91
196 | 195,Female,47,120,16
197 | 196,Female,35,120,79
198 | 197,Female,45,126,28
199 | 198,Male,32,126,74
200 | 199,Male,32,137,18
201 | 200,Male,30,137,83
202 |
--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
1 | # @Cluster Analysis Project
2 | import streamlit as st
3 | import pandas as pd
4 | import numpy as np
5 | from matplotlib import pyplot as plt
6 | from sklearn.cluster import KMeans
7 | from sklearn.preprocessing import StandardScaler
8 | from kneed import KneeLocator
9 |
10 | st.set_page_config(page_title="Cluster Analysis", layout="wide")
11 |
12 | # small helper
13 | def set_k(val: int):
14 | st.session_state["k_value"] = int(val)
15 |
16 | st.title("Cluster Analysis")
17 | st.caption("Clustering on two features (e.g., income & score) with interactive clusters(k) and visualization.")
18 |
19 | # data section
20 | st.sidebar.header("1) Data")
21 | uploaded = st.sidebar.file_uploader("Upload CSV", type=["csv"])
22 | use_sample = st.sidebar.checkbox("Use sample dataset (Mall Customers-like)", value=(uploaded is None))
23 |
24 | @st.cache_data
25 | def load_sample():
26 | rng = np.random.default_rng(42)
27 | c1 = rng.normal([20, 30], [5, 8], size=(60, 2))
28 | c2 = rng.normal([80, 80], [8, 10], size=(60, 2))
29 | c3 = rng.normal([50, 50], [6, 6], size=(80, 2))
30 | c4 = rng.normal([25, 75], [6, 8], size=(50, 2))
31 | c5 = rng.normal([85, 20], [7, 7], size=(50, 2))
32 | X = np.vstack([c1, c2, c3, c4, c5])
33 | df = pd.DataFrame(X, columns=["income", "score"])
34 | df["age"] = rng.integers(18, 60, size=len(df))
35 | return df
36 |
37 | if uploaded is not None and not use_sample:
38 | df = pd.read_csv(uploaded)
39 | else:
40 | df = load_sample()
41 |
42 | # preview section
43 | show_preview = st.sidebar.checkbox("Show dataset preview", value=True)
44 | if show_preview:
45 | with st.expander("Preview", expanded=True):
46 | st.dataframe(df.head(30), use_container_width=True, height=320)
47 |
48 | # feature selection
49 | st.sidebar.header("2) Features")
50 | numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
51 |
52 | if len(numeric_cols) < 2:
53 | st.error("Need at least 2 numeric columns to run KMeans.")
54 | st.stop()
55 |
56 | default_x = "score" if "score" in numeric_cols else numeric_cols[0]
57 | default_y = "income" if "income" in numeric_cols else numeric_cols[1]
58 |
59 | x_col = st.sidebar.selectbox("X-axis feature", numeric_cols, index=numeric_cols.index(default_x))
60 | y_options = [c for c in numeric_cols if c != x_col]
61 | y_col = st.sidebar.selectbox(
62 | "Y-axis feature",
63 | y_options,
64 | index=y_options.index(default_y) if default_y in y_options else 0
65 | )
66 |
67 | # model settings
68 | st.sidebar.header("3) Model Settings")
69 | show_elbow = st.sidebar.checkbox("Show elbow (inertia) chart", value=True)
70 | scale = st.sidebar.checkbox("Standardize features (recommended)", value=False)
71 | random_state = st.sidebar.number_input("random_state", value=42, step=1)
72 |
73 | # slider state (dynamic)
74 | if "k_value" not in st.session_state:
75 | st.session_state["k_value"] = 2 # small default
76 |
77 | k = st.sidebar.slider(
78 | "Number of clusters (k)",
79 | min_value=2,
80 | max_value=10,
81 | step=1,
82 | key="k_value"
83 | )
84 |
85 | # prepare features
86 | X = df[[x_col, y_col]].copy()
87 | if scale:
88 | scaler = StandardScaler()
89 | X_scaled = scaler.fit_transform(X)
90 | else:
91 | X_scaled = X.values
92 |
93 | # elbow + knee section
94 | optimal_k = None
95 | if show_elbow:
96 | st.subheader("Elbow Chart (Inertia vs Clusters)")
97 |
98 | col_left, col_right = st.columns([2, 1])
99 |
100 | ks = list(range(2, min(11, len(df))))
101 | inertias = []
102 | for kk in ks:
103 | km_tmp = KMeans(n_clusters=kk, random_state=random_state, n_init="auto")
104 | km_tmp.fit(X_scaled)
105 | inertias.append(km_tmp.inertia_)
106 |
107 | knee = KneeLocator(ks, inertias, curve="convex", direction="decreasing")
108 | optimal_k = knee.knee
109 |
110 | # left plot (normal elbow)
111 | with col_left:
112 | fig1, ax1 = plt.subplots(figsize=(7, 4), dpi=120)
113 | ax1.plot(ks, inertias, marker="o")
114 | ax1.set_xlabel("k")
115 | ax1.set_ylabel("Inertia")
116 | ax1.set_title("Elbow Curve")
117 | st.pyplot(fig1, use_container_width=True)
118 |
119 | # right plot (knee style) + auto set button
120 | with col_right:
121 | st.markdown("### Knee Point")
122 |
123 | fig2, ax2 = plt.subplots(figsize=(4, 4), dpi=120)
124 | ax2.plot(ks, inertias, label="data")
125 | ax2.set_xlabel("Clusters (k)")
126 | ax2.set_ylabel("Inertia")
127 | ax2.set_title("Knee Plot")
128 |
129 | if optimal_k is not None:
130 | ax2.axvline(optimal_k, linestyle="--", label="knee/elbow")
131 | ax2.legend(loc="best")
132 | st.metric("Optimal k", int(optimal_k))
133 |
134 | st.button(
135 | "Use optimal k",
136 | on_click=set_k,
137 | args=(int(optimal_k),),
138 | key="use_optimal_k_btn"
139 | )
140 | else:
141 | ax2.legend(loc="best")
142 | st.warning("No clear elbow found.")
143 |
144 | st.pyplot(fig2, use_container_width=True)
145 |
146 | # fit final model
147 | km = KMeans(n_clusters=int(st.session_state["k_value"]), random_state=random_state, n_init="auto")
148 | labels = km.fit_predict(X_scaled)
149 |
150 | df_out = df.copy()
151 | df_out["km_cluster"] = labels
152 |
153 | # centroids in original scale
154 | centers = km.cluster_centers_
155 | centers_orig = scaler.inverse_transform(centers) if scale else centers
156 |
157 | # cluster results
158 | st.subheader("Cluster Result")
159 | c1, c2 = st.columns([2, 1])
160 |
161 | with c1:
162 | fig, ax = plt.subplots(figsize=(8, 6), dpi=120)
163 | for cl in sorted(df_out["km_cluster"].unique()):
164 | subset = df_out[df_out["km_cluster"] == cl]
165 | ax.scatter(subset[x_col], subset[y_col], s=30, label=f"cluster_{cl+1}")
166 |
167 | ax.scatter(centers_orig[:, 0], centers_orig[:, 1], marker="*", s=250, label="centroid")
168 | ax.set_xlabel(x_col.upper())
169 | ax.set_ylabel(y_col.upper())
170 | ax.legend()
171 | st.pyplot(fig, use_container_width=True)
172 |
173 | with c2:
174 | st.markdown("### Centroids")
175 | cent_df = pd.DataFrame(centers_orig, columns=[x_col, y_col])
176 | cent_df.index = [f"cluster_{i+1}" for i in range(int(st.session_state["k_value"]))]
177 | st.dataframe(cent_df, use_container_width=True, height=220)
178 |
179 | st.markdown("### Cluster Counts")
180 | counts = df_out["km_cluster"].value_counts().sort_index().rename_axis("cluster").to_frame("count")
181 | st.dataframe(counts, use_container_width=True, height=220)
182 |
183 | # download section
184 | st.subheader("Download clustered data")
185 | csv = df_out.to_csv(index=False).encode("utf-8")
186 | st.download_button("Download CSV", csv, "clustered_output.csv", "text/csv")
187 |
188 | # footer
189 | st.markdown("---")
190 | st.caption("Developed by @Rashedul Alam")
191 |
--------------------------------------------------------------------------------