├── requirements.txt ├── .gitignore ├── README.md ├── mall customers.csv └── app.py /requirements.txt: -------------------------------------------------------------------------------- 1 | streamlit 2 | pandas 3 | numpy 4 | matplotlib 5 | scikit-learn 6 | kneed 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rashakil-ds/cluster-analysis-with-kmeans/main/.gitignore -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Cluster Analysis with K-Means (Interactive Streamlit App) 2 | 3 | --- 4 | 5 | ### Click -> [APP Link](https://cluster-analyst.streamlit.app/)
6 | 7 | --- 8 | 9 | This project is an interactive implementation of **K-Means clustering** built using **Python and Streamlit**. It demonstrates how a traditional machine learning notebook can be transformed into a **usable, real-world analytics application**. The app is designed for exploratory data analysis and customer segmentation tasks, allowing users to experiment with clustering parameters and visually understand clustering behavior. 10 | 11 | --- 12 | 13 | Most clustering examples remain limited to Jupyter notebooks. This project focuses on converting clustering logic into a **user-friendly application** that can be used by analysts, students, and non-technical users without modifying code. 14 | The emphasis is on: 15 | - Interactivity 16 | - Explainability 17 | - Practical usability 18 | 19 | --- 20 | 21 | ## What This Application Does 22 | 23 | The application allows users to perform K-Means clustering with the following capabilities: 24 | 25 | - Load a dataset or use a built-in sample dataset 26 | - Select numeric features dynamically 27 | - Explore different values of the number of clusters 28 | - Automatically detect the optimal number of clusters 29 | - Visualize clusters and centroids 30 | - Download clustered results for further analysis 31 | 32 | --- 33 | 34 | ## Key Features 35 | 36 | ### Dataset Handling 37 | - Upload any CSV file 38 | - Use a sample dataset resembling customer income and spending behavior 39 | - Preview the dataset directly in the interface 40 | - Automatic detection of numeric columns 41 | 42 | ### Feature Selection 43 | - Select any `two numeric features` for clustering 44 | - No hardcoded column names 45 | - Works with different datasets without modification 46 | 47 | ### Model Configuration 48 | - Interactive slider to control the number of clusters 49 | - Optional feature standardization using [`StandardScaler`](https://scikit-learn.org/0.22/modules/generated/sklearn.preprocessing.StandardScaler.html) 50 | - Configurable random state for reproducibility 51 | 52 | ### Elbow and Knee Analysis 53 | The application provides two complementary visual tools to select the number of clusters: 54 | 55 | - **Elbow Curve** 56 | - Displays inertia versus the number of clusters 57 | - Helps identify diminishing returns as clusters increase 58 | 59 | - **Knee Plot** 60 | - Automatically detects the optimal number of clusters using the knee (elbow) detection algorithm 61 | - Displays the recommended cluster count 62 | - Allows applying the optimal value directly to the model 63 | 64 | ### Cluster Visualization 65 | - Scatter plot of clustered data 66 | - Clear separation of clusters 67 | - Centroids displayed explicitly 68 | - Dynamic updates when parameters change 69 | 70 | ### Cluster Insights 71 | - Table showing centroid values for each cluster 72 | - Table showing the number of samples per cluster 73 | 74 | ### Export Functionality 75 | - Download the final clustered dataset as a CSV file 76 | - Cluster labels included for downstream analysis 77 | 78 | --- 79 | 80 | ## Technologies Used 81 | 82 | - Python 83 | - Streamlit 84 | - Pandas 85 | - NumPy 86 | - Matplotlib 87 | - Scikit-learn 88 | - kneed 89 | 90 | --- 91 | 92 | ## Project Structure 93 | .
94 | ├── app.py
95 | ├── requirements.txt
96 | ├── mall customers.csv
97 | ├── Market Basket Analysis using K-Means Cluster Algorithm.ipynb
98 | ├── README.md
99 | └── .gitignore 100 | 101 | --- 102 | 103 | ## How to Run the Application 104 | 105 | ### Clone the Repository 106 | git clone https://github.com/rashakil-ds/cluster-analysis-with-kmeans.git
107 | cd cluster-analysis-with-kmeans 108 | 109 | ### Install Dependencies 110 | pip install -r requirements.txt 111 | 112 | ### Run the App 113 | streamlit run app.py 114 | 115 | --- 116 | 117 | ## Design Decisions 118 | 119 | - The K-Means model is trained dynamically inside the application 120 | - The model is not saved intentionally, as clustering depends on user-selected parameters and data 121 | - Streamlit session state is used to manage interactive behavior correctly 122 | - The application prioritizes clarity and explainability over complexity 123 | 124 | --- 125 | 126 | ## Possible Extensions 127 | 128 | - Add silhouette score or Davies–Bouldin index for cluster evaluation 129 | - Support clustering with more than two features 130 | - Deploy the application on Streamlit Cloud 131 | - Add authentication for multi-user environments 132 | 133 | --- 134 | 135 | ## Developed By 136 | 137 | Rashedul Alam 138 | [LinkedIn](https://www.linkedin.com/in/kmrashedulalam/) 139 | [GitHub](https://github.com/rashakil-ds) 140 | -------------------------------------------------------------------------------- /mall customers.csv: -------------------------------------------------------------------------------- 1 | CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100) 2 | 1,Male,19,15,39 3 | 2,Male,21,15,81 4 | 3,Female,20,16,6 5 | 4,Female,23,16,77 6 | 5,Female,31,17,40 7 | 6,Female,22,17,76 8 | 7,Female,35,18,6 9 | 8,Female,23,18,94 10 | 9,Male,64,19,3 11 | 10,Female,30,19,72 12 | 11,Male,67,19,14 13 | 12,Female,35,19,99 14 | 13,Female,58,20,15 15 | 14,Female,24,20,77 16 | 15,Male,37,20,13 17 | 16,Male,22,20,79 18 | 17,Female,35,21,35 19 | 18,Male,20,21,66 20 | 19,Male,52,23,29 21 | 20,Female,35,23,98 22 | 21,Male,35,24,35 23 | 22,Male,25,24,73 24 | 23,Female,46,25,5 25 | 24,Male,31,25,73 26 | 25,Female,54,28,14 27 | 26,Male,29,28,82 28 | 27,Female,45,28,32 29 | 28,Male,35,28,61 30 | 29,Female,40,29,31 31 | 30,Female,23,29,87 32 | 31,Male,60,30,4 33 | 32,Female,21,30,73 34 | 33,Male,53,33,4 35 | 34,Male,18,33,92 36 | 35,Female,49,33,14 37 | 36,Female,21,33,81 38 | 37,Female,42,34,17 39 | 38,Female,30,34,73 40 | 39,Female,36,37,26 41 | 40,Female,20,37,75 42 | 41,Female,65,38,35 43 | 42,Male,24,38,92 44 | 43,Male,48,39,36 45 | 44,Female,31,39,61 46 | 45,Female,49,39,28 47 | 46,Female,24,39,65 48 | 47,Female,50,40,55 49 | 48,Female,27,40,47 50 | 49,Female,29,40,42 51 | 50,Female,31,40,42 52 | 51,Female,49,42,52 53 | 52,Male,33,42,60 54 | 53,Female,31,43,54 55 | 54,Male,59,43,60 56 | 55,Female,50,43,45 57 | 56,Male,47,43,41 58 | 57,Female,51,44,50 59 | 58,Male,69,44,46 60 | 59,Female,27,46,51 61 | 60,Male,53,46,46 62 | 61,Male,70,46,56 63 | 62,Male,19,46,55 64 | 63,Female,67,47,52 65 | 64,Female,54,47,59 66 | 65,Male,63,48,51 67 | 66,Male,18,48,59 68 | 67,Female,43,48,50 69 | 68,Female,68,48,48 70 | 69,Male,19,48,59 71 | 70,Female,32,48,47 72 | 71,Male,70,49,55 73 | 72,Female,47,49,42 74 | 73,Female,60,50,49 75 | 74,Female,60,50,56 76 | 75,Male,59,54,47 77 | 76,Male,26,54,54 78 | 77,Female,45,54,53 79 | 78,Male,40,54,48 80 | 79,Female,23,54,52 81 | 80,Female,49,54,42 82 | 81,Male,57,54,51 83 | 82,Male,38,54,55 84 | 83,Male,67,54,41 85 | 84,Female,46,54,44 86 | 85,Female,21,54,57 87 | 86,Male,48,54,46 88 | 87,Female,55,57,58 89 | 88,Female,22,57,55 90 | 89,Female,34,58,60 91 | 90,Female,50,58,46 92 | 91,Female,68,59,55 93 | 92,Male,18,59,41 94 | 93,Male,48,60,49 95 | 94,Female,40,60,40 96 | 95,Female,32,60,42 97 | 96,Male,24,60,52 98 | 97,Female,47,60,47 99 | 98,Female,27,60,50 100 | 99,Male,48,61,42 101 | 100,Male,20,61,49 102 | 101,Female,23,62,41 103 | 102,Female,49,62,48 104 | 103,Male,67,62,59 105 | 104,Male,26,62,55 106 | 105,Male,49,62,56 107 | 106,Female,21,62,42 108 | 107,Female,66,63,50 109 | 108,Male,54,63,46 110 | 109,Male,68,63,43 111 | 110,Male,66,63,48 112 | 111,Male,65,63,52 113 | 112,Female,19,63,54 114 | 113,Female,38,64,42 115 | 114,Male,19,64,46 116 | 115,Female,18,65,48 117 | 116,Female,19,65,50 118 | 117,Female,63,65,43 119 | 118,Female,49,65,59 120 | 119,Female,51,67,43 121 | 120,Female,50,67,57 122 | 121,Male,27,67,56 123 | 122,Female,38,67,40 124 | 123,Female,40,69,58 125 | 124,Male,39,69,91 126 | 125,Female,23,70,29 127 | 126,Female,31,70,77 128 | 127,Male,43,71,35 129 | 128,Male,40,71,95 130 | 129,Male,59,71,11 131 | 130,Male,38,71,75 132 | 131,Male,47,71,9 133 | 132,Male,39,71,75 134 | 133,Female,25,72,34 135 | 134,Female,31,72,71 136 | 135,Male,20,73,5 137 | 136,Female,29,73,88 138 | 137,Female,44,73,7 139 | 138,Male,32,73,73 140 | 139,Male,19,74,10 141 | 140,Female,35,74,72 142 | 141,Female,57,75,5 143 | 142,Male,32,75,93 144 | 143,Female,28,76,40 145 | 144,Female,32,76,87 146 | 145,Male,25,77,12 147 | 146,Male,28,77,97 148 | 147,Male,48,77,36 149 | 148,Female,32,77,74 150 | 149,Female,34,78,22 151 | 150,Male,34,78,90 152 | 151,Male,43,78,17 153 | 152,Male,39,78,88 154 | 153,Female,44,78,20 155 | 154,Female,38,78,76 156 | 155,Female,47,78,16 157 | 156,Female,27,78,89 158 | 157,Male,37,78,1 159 | 158,Female,30,78,78 160 | 159,Male,34,78,1 161 | 160,Female,30,78,73 162 | 161,Female,56,79,35 163 | 162,Female,29,79,83 164 | 163,Male,19,81,5 165 | 164,Female,31,81,93 166 | 165,Male,50,85,26 167 | 166,Female,36,85,75 168 | 167,Male,42,86,20 169 | 168,Female,33,86,95 170 | 169,Female,36,87,27 171 | 170,Male,32,87,63 172 | 171,Male,40,87,13 173 | 172,Male,28,87,75 174 | 173,Male,36,87,10 175 | 174,Male,36,87,92 176 | 175,Female,52,88,13 177 | 176,Female,30,88,86 178 | 177,Male,58,88,15 179 | 178,Male,27,88,69 180 | 179,Male,59,93,14 181 | 180,Male,35,93,90 182 | 181,Female,37,97,32 183 | 182,Female,32,97,86 184 | 183,Male,46,98,15 185 | 184,Female,29,98,88 186 | 185,Female,41,99,39 187 | 186,Male,30,99,97 188 | 187,Female,54,101,24 189 | 188,Male,28,101,68 190 | 189,Female,41,103,17 191 | 190,Female,36,103,85 192 | 191,Female,34,103,23 193 | 192,Female,32,103,69 194 | 193,Male,33,113,8 195 | 194,Female,38,113,91 196 | 195,Female,47,120,16 197 | 196,Female,35,120,79 198 | 197,Female,45,126,28 199 | 198,Male,32,126,74 200 | 199,Male,32,137,18 201 | 200,Male,30,137,83 202 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | # @Cluster Analysis Project 2 | import streamlit as st 3 | import pandas as pd 4 | import numpy as np 5 | from matplotlib import pyplot as plt 6 | from sklearn.cluster import KMeans 7 | from sklearn.preprocessing import StandardScaler 8 | from kneed import KneeLocator 9 | 10 | st.set_page_config(page_title="Cluster Analysis", layout="wide") 11 | 12 | # small helper 13 | def set_k(val: int): 14 | st.session_state["k_value"] = int(val) 15 | 16 | st.title("Cluster Analysis") 17 | st.caption("Clustering on two features (e.g., income & score) with interactive clusters(k) and visualization.") 18 | 19 | # data section 20 | st.sidebar.header("1) Data") 21 | uploaded = st.sidebar.file_uploader("Upload CSV", type=["csv"]) 22 | use_sample = st.sidebar.checkbox("Use sample dataset (Mall Customers-like)", value=(uploaded is None)) 23 | 24 | @st.cache_data 25 | def load_sample(): 26 | rng = np.random.default_rng(42) 27 | c1 = rng.normal([20, 30], [5, 8], size=(60, 2)) 28 | c2 = rng.normal([80, 80], [8, 10], size=(60, 2)) 29 | c3 = rng.normal([50, 50], [6, 6], size=(80, 2)) 30 | c4 = rng.normal([25, 75], [6, 8], size=(50, 2)) 31 | c5 = rng.normal([85, 20], [7, 7], size=(50, 2)) 32 | X = np.vstack([c1, c2, c3, c4, c5]) 33 | df = pd.DataFrame(X, columns=["income", "score"]) 34 | df["age"] = rng.integers(18, 60, size=len(df)) 35 | return df 36 | 37 | if uploaded is not None and not use_sample: 38 | df = pd.read_csv(uploaded) 39 | else: 40 | df = load_sample() 41 | 42 | # preview section 43 | show_preview = st.sidebar.checkbox("Show dataset preview", value=True) 44 | if show_preview: 45 | with st.expander("Preview", expanded=True): 46 | st.dataframe(df.head(30), use_container_width=True, height=320) 47 | 48 | # feature selection 49 | st.sidebar.header("2) Features") 50 | numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist() 51 | 52 | if len(numeric_cols) < 2: 53 | st.error("Need at least 2 numeric columns to run KMeans.") 54 | st.stop() 55 | 56 | default_x = "score" if "score" in numeric_cols else numeric_cols[0] 57 | default_y = "income" if "income" in numeric_cols else numeric_cols[1] 58 | 59 | x_col = st.sidebar.selectbox("X-axis feature", numeric_cols, index=numeric_cols.index(default_x)) 60 | y_options = [c for c in numeric_cols if c != x_col] 61 | y_col = st.sidebar.selectbox( 62 | "Y-axis feature", 63 | y_options, 64 | index=y_options.index(default_y) if default_y in y_options else 0 65 | ) 66 | 67 | # model settings 68 | st.sidebar.header("3) Model Settings") 69 | show_elbow = st.sidebar.checkbox("Show elbow (inertia) chart", value=True) 70 | scale = st.sidebar.checkbox("Standardize features (recommended)", value=False) 71 | random_state = st.sidebar.number_input("random_state", value=42, step=1) 72 | 73 | # slider state (dynamic) 74 | if "k_value" not in st.session_state: 75 | st.session_state["k_value"] = 2 # small default 76 | 77 | k = st.sidebar.slider( 78 | "Number of clusters (k)", 79 | min_value=2, 80 | max_value=10, 81 | step=1, 82 | key="k_value" 83 | ) 84 | 85 | # prepare features 86 | X = df[[x_col, y_col]].copy() 87 | if scale: 88 | scaler = StandardScaler() 89 | X_scaled = scaler.fit_transform(X) 90 | else: 91 | X_scaled = X.values 92 | 93 | # elbow + knee section 94 | optimal_k = None 95 | if show_elbow: 96 | st.subheader("Elbow Chart (Inertia vs Clusters)") 97 | 98 | col_left, col_right = st.columns([2, 1]) 99 | 100 | ks = list(range(2, min(11, len(df)))) 101 | inertias = [] 102 | for kk in ks: 103 | km_tmp = KMeans(n_clusters=kk, random_state=random_state, n_init="auto") 104 | km_tmp.fit(X_scaled) 105 | inertias.append(km_tmp.inertia_) 106 | 107 | knee = KneeLocator(ks, inertias, curve="convex", direction="decreasing") 108 | optimal_k = knee.knee 109 | 110 | # left plot (normal elbow) 111 | with col_left: 112 | fig1, ax1 = plt.subplots(figsize=(7, 4), dpi=120) 113 | ax1.plot(ks, inertias, marker="o") 114 | ax1.set_xlabel("k") 115 | ax1.set_ylabel("Inertia") 116 | ax1.set_title("Elbow Curve") 117 | st.pyplot(fig1, use_container_width=True) 118 | 119 | # right plot (knee style) + auto set button 120 | with col_right: 121 | st.markdown("### Knee Point") 122 | 123 | fig2, ax2 = plt.subplots(figsize=(4, 4), dpi=120) 124 | ax2.plot(ks, inertias, label="data") 125 | ax2.set_xlabel("Clusters (k)") 126 | ax2.set_ylabel("Inertia") 127 | ax2.set_title("Knee Plot") 128 | 129 | if optimal_k is not None: 130 | ax2.axvline(optimal_k, linestyle="--", label="knee/elbow") 131 | ax2.legend(loc="best") 132 | st.metric("Optimal k", int(optimal_k)) 133 | 134 | st.button( 135 | "Use optimal k", 136 | on_click=set_k, 137 | args=(int(optimal_k),), 138 | key="use_optimal_k_btn" 139 | ) 140 | else: 141 | ax2.legend(loc="best") 142 | st.warning("No clear elbow found.") 143 | 144 | st.pyplot(fig2, use_container_width=True) 145 | 146 | # fit final model 147 | km = KMeans(n_clusters=int(st.session_state["k_value"]), random_state=random_state, n_init="auto") 148 | labels = km.fit_predict(X_scaled) 149 | 150 | df_out = df.copy() 151 | df_out["km_cluster"] = labels 152 | 153 | # centroids in original scale 154 | centers = km.cluster_centers_ 155 | centers_orig = scaler.inverse_transform(centers) if scale else centers 156 | 157 | # cluster results 158 | st.subheader("Cluster Result") 159 | c1, c2 = st.columns([2, 1]) 160 | 161 | with c1: 162 | fig, ax = plt.subplots(figsize=(8, 6), dpi=120) 163 | for cl in sorted(df_out["km_cluster"].unique()): 164 | subset = df_out[df_out["km_cluster"] == cl] 165 | ax.scatter(subset[x_col], subset[y_col], s=30, label=f"cluster_{cl+1}") 166 | 167 | ax.scatter(centers_orig[:, 0], centers_orig[:, 1], marker="*", s=250, label="centroid") 168 | ax.set_xlabel(x_col.upper()) 169 | ax.set_ylabel(y_col.upper()) 170 | ax.legend() 171 | st.pyplot(fig, use_container_width=True) 172 | 173 | with c2: 174 | st.markdown("### Centroids") 175 | cent_df = pd.DataFrame(centers_orig, columns=[x_col, y_col]) 176 | cent_df.index = [f"cluster_{i+1}" for i in range(int(st.session_state["k_value"]))] 177 | st.dataframe(cent_df, use_container_width=True, height=220) 178 | 179 | st.markdown("### Cluster Counts") 180 | counts = df_out["km_cluster"].value_counts().sort_index().rename_axis("cluster").to_frame("count") 181 | st.dataframe(counts, use_container_width=True, height=220) 182 | 183 | # download section 184 | st.subheader("Download clustered data") 185 | csv = df_out.to_csv(index=False).encode("utf-8") 186 | st.download_button("Download CSV", csv, "clustered_output.csv", "text/csv") 187 | 188 | # footer 189 | st.markdown("---") 190 | st.caption("Developed by @Rashedul Alam") 191 | --------------------------------------------------------------------------------