├── Reports ├── 18MIS1042_1010.pdf ├── j.compbiomed.2011.11.010.pdf ├── Customer Segmentation Report.pdf ├── Approaches_to_Clustering_in_Customer_Segmentation.pdf └── MALLCUSTOMERSEGMENTATIONUSINGCLUSTERINGALGORITHM.pdf ├── Dataset └── Mall_Customers.csv ├── R Code └── Implementation In R.rmd └── README.md /Reports/18MIS1042_1010.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NelakurthiSudheer/Mall-Customers-Segmentation/HEAD/Reports/18MIS1042_1010.pdf -------------------------------------------------------------------------------- /Reports/j.compbiomed.2011.11.010.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NelakurthiSudheer/Mall-Customers-Segmentation/HEAD/Reports/j.compbiomed.2011.11.010.pdf -------------------------------------------------------------------------------- /Reports/Customer Segmentation Report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NelakurthiSudheer/Mall-Customers-Segmentation/HEAD/Reports/Customer Segmentation Report.pdf -------------------------------------------------------------------------------- /Reports/Approaches_to_Clustering_in_Customer_Segmentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NelakurthiSudheer/Mall-Customers-Segmentation/HEAD/Reports/Approaches_to_Clustering_in_Customer_Segmentation.pdf -------------------------------------------------------------------------------- /Reports/MALLCUSTOMERSEGMENTATIONUSINGCLUSTERINGALGORITHM.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NelakurthiSudheer/Mall-Customers-Segmentation/HEAD/Reports/MALLCUSTOMERSEGMENTATIONUSINGCLUSTERINGALGORITHM.pdf -------------------------------------------------------------------------------- /Dataset/Mall_Customers.csv: -------------------------------------------------------------------------------- 1 | CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100) 2 | 1,Male,19,15,39 3 | 2,Male,21,15,81 4 | 3,Female,20,16,6 5 | 4,Female,23,16,77 6 | 5,Female,31,17,40 7 | 6,Female,22,17,76 8 | 7,Female,35,18,6 9 | 8,Female,23,18,94 10 | 9,Male,64,19,3 11 | 10,Female,30,19,72 12 | 11,Male,67,19,14 13 | 12,Female,35,19,99 14 | 13,Female,58,20,15 15 | 14,Female,24,20,77 16 | 15,Male,37,20,13 17 | 16,Male,22,20,79 18 | 17,Female,35,21,35 19 | 18,Male,20,21,66 20 | 19,Male,52,23,29 21 | 20,Female,35,23,98 22 | 21,Male,35,24,35 23 | 22,Male,25,24,73 24 | 23,Female,46,25,5 25 | 24,Male,31,25,73 26 | 25,Female,54,28,14 27 | 26,Male,29,28,82 28 | 27,Female,45,28,32 29 | 28,Male,35,28,61 30 | 29,Female,40,29,31 31 | 30,Female,23,29,87 32 | 31,Male,60,30,4 33 | 32,Female,21,30,73 34 | 33,Male,53,33,4 35 | 34,Male,18,33,92 36 | 35,Female,49,33,14 37 | 36,Female,21,33,81 38 | 37,Female,42,34,17 39 | 38,Female,30,34,73 40 | 39,Female,36,37,26 41 | 40,Female,20,37,75 42 | 41,Female,65,38,35 43 | 42,Male,24,38,92 44 | 43,Male,48,39,36 45 | 44,Female,31,39,61 46 | 45,Female,49,39,28 47 | 46,Female,24,39,65 48 | 47,Female,50,40,55 49 | 48,Female,27,40,47 50 | 49,Female,29,40,42 51 | 50,Female,31,40,42 52 | 51,Female,49,42,52 53 | 52,Male,33,42,60 54 | 53,Female,31,43,54 55 | 54,Male,59,43,60 56 | 55,Female,50,43,45 57 | 56,Male,47,43,41 58 | 57,Female,51,44,50 59 | 58,Male,69,44,46 60 | 59,Female,27,46,51 61 | 60,Male,53,46,46 62 | 61,Male,70,46,56 63 | 62,Male,19,46,55 64 | 63,Female,67,47,52 65 | 64,Female,54,47,59 66 | 65,Male,63,48,51 67 | 66,Male,18,48,59 68 | 67,Female,43,48,50 69 | 68,Female,68,48,48 70 | 69,Male,19,48,59 71 | 70,Female,32,48,47 72 | 71,Male,70,49,55 73 | 72,Female,47,49,42 74 | 73,Female,60,50,49 75 | 74,Female,60,50,56 76 | 75,Male,59,54,47 77 | 76,Male,26,54,54 78 | 77,Female,45,54,53 79 | 78,Male,40,54,48 80 | 79,Female,23,54,52 81 | 80,Female,49,54,42 82 | 81,Male,57,54,51 83 | 82,Male,38,54,55 84 | 83,Male,67,54,41 85 | 84,Female,46,54,44 86 | 85,Female,21,54,57 87 | 86,Male,48,54,46 88 | 87,Female,55,57,58 89 | 88,Female,22,57,55 90 | 89,Female,34,58,60 91 | 90,Female,50,58,46 92 | 91,Female,68,59,55 93 | 92,Male,18,59,41 94 | 93,Male,48,60,49 95 | 94,Female,40,60,40 96 | 95,Female,32,60,42 97 | 96,Male,24,60,52 98 | 97,Female,47,60,47 99 | 98,Female,27,60,50 100 | 99,Male,48,61,42 101 | 100,Male,20,61,49 102 | 101,Female,23,62,41 103 | 102,Female,49,62,48 104 | 103,Male,67,62,59 105 | 104,Male,26,62,55 106 | 105,Male,49,62,56 107 | 106,Female,21,62,42 108 | 107,Female,66,63,50 109 | 108,Male,54,63,46 110 | 109,Male,68,63,43 111 | 110,Male,66,63,48 112 | 111,Male,65,63,52 113 | 112,Female,19,63,54 114 | 113,Female,38,64,42 115 | 114,Male,19,64,46 116 | 115,Female,18,65,48 117 | 116,Female,19,65,50 118 | 117,Female,63,65,43 119 | 118,Female,49,65,59 120 | 119,Female,51,67,43 121 | 120,Female,50,67,57 122 | 121,Male,27,67,56 123 | 122,Female,38,67,40 124 | 123,Female,40,69,58 125 | 124,Male,39,69,91 126 | 125,Female,23,70,29 127 | 126,Female,31,70,77 128 | 127,Male,43,71,35 129 | 128,Male,40,71,95 130 | 129,Male,59,71,11 131 | 130,Male,38,71,75 132 | 131,Male,47,71,9 133 | 132,Male,39,71,75 134 | 133,Female,25,72,34 135 | 134,Female,31,72,71 136 | 135,Male,20,73,5 137 | 136,Female,29,73,88 138 | 137,Female,44,73,7 139 | 138,Male,32,73,73 140 | 139,Male,19,74,10 141 | 140,Female,35,74,72 142 | 141,Female,57,75,5 143 | 142,Male,32,75,93 144 | 143,Female,28,76,40 145 | 144,Female,32,76,87 146 | 145,Male,25,77,12 147 | 146,Male,28,77,97 148 | 147,Male,48,77,36 149 | 148,Female,32,77,74 150 | 149,Female,34,78,22 151 | 150,Male,34,78,90 152 | 151,Male,43,78,17 153 | 152,Male,39,78,88 154 | 153,Female,44,78,20 155 | 154,Female,38,78,76 156 | 155,Female,47,78,16 157 | 156,Female,27,78,89 158 | 157,Male,37,78,1 159 | 158,Female,30,78,78 160 | 159,Male,34,78,1 161 | 160,Female,30,78,73 162 | 161,Female,56,79,35 163 | 162,Female,29,79,83 164 | 163,Male,19,81,5 165 | 164,Female,31,81,93 166 | 165,Male,50,85,26 167 | 166,Female,36,85,75 168 | 167,Male,42,86,20 169 | 168,Female,33,86,95 170 | 169,Female,36,87,27 171 | 170,Male,32,87,63 172 | 171,Male,40,87,13 173 | 172,Male,28,87,75 174 | 173,Male,36,87,10 175 | 174,Male,36,87,92 176 | 175,Female,52,88,13 177 | 176,Female,30,88,86 178 | 177,Male,58,88,15 179 | 178,Male,27,88,69 180 | 179,Male,59,93,14 181 | 180,Male,35,93,90 182 | 181,Female,37,97,32 183 | 182,Female,32,97,86 184 | 183,Male,46,98,15 185 | 184,Female,29,98,88 186 | 185,Female,41,99,39 187 | 186,Male,30,99,97 188 | 187,Female,54,101,24 189 | 188,Male,28,101,68 190 | 189,Female,41,103,17 191 | 190,Female,36,103,85 192 | 191,Female,34,103,23 193 | 192,Female,32,103,69 194 | 193,Male,33,113,8 195 | 194,Female,38,113,91 196 | 195,Female,47,120,16 197 | 196,Female,35,120,79 198 | 197,Female,45,126,28 199 | 198,Male,32,126,74 200 | 199,Male,32,137,18 201 | 200,Male,30,137,83 202 | -------------------------------------------------------------------------------- /R Code/Implementation In R.rmd: -------------------------------------------------------------------------------- 1 | ---- 2 | ---- 3 | 4 | ```{r} 5 | customer_data=read.csv("Mall_Customers.csv") 6 | summary(customer_data) 7 | ``` 8 | 9 | ```{r} 10 | names(customer_data) 11 | 12 | ``` 13 | 14 | ```{r} 15 | head(customer_data) 16 | summary(customer_data$Age) 17 | ``` 18 | 19 | 20 | ```{r} 21 | sd(customer_data$Age) 22 | summary(customer_data$Annual.Income..k..) 23 | sd(customer_data$Annual.Income..k..) 24 | summary(customer_data$Age) 25 | ``` 26 | 27 | 28 | ```{r} 29 | sd(customer_data$Spending.Score..1.100.) 30 | 31 | ``` 32 | 33 | ##### Data Visualization 34 | 35 | #### Gender Visualization by creating a barplot 36 | 37 | 38 | ```{r} 39 | a=table(customer_data$Gender) 40 | barplot(a,main="Using BarPlot to display Gender Comparision", 41 | ylab="Count", 42 | xlab="Gender", 43 | col=rainbow(2), 44 | legend=rownames(a)) 45 | ``` 46 | From the above barplot, we observe that the number of females is higher than the males. 47 | 48 | #### Piechart 49 | let us visualize a pie chart to observe the ratio of male and female distribution. 50 | 51 | ```{r} 52 | pct=round(a/sum(a)*100) 53 | lbs=paste(c("Female","Male")," ",pct,"%",sep=" ") 54 | library(plotrix) 55 | pie3D(a,labels=lbs, 56 | main="Pie Chart Depicting Ratio of Female and Male") 57 | ``` 58 | The percentage of females is 56%, whereas the percentage of male in the customer dataset is 44% 59 | 60 | 61 | ###### Age Visualization 62 | 63 | ```{r} 64 | summary(customer_data$Age) 65 | 66 | ``` 67 | ##### Age Visualization using Histogram 68 | 69 | ```{r} 70 | hist(customer_data$Age, 71 | col="blue", 72 | main="Histogram to Show Count of Age Class", 73 | xlab="Age Class", 74 | ylab="Frequency", 75 | labels=TRUE) 76 | ``` 77 | ##### Age Visualization using Boxplot 78 | 79 | ```{r} 80 | boxplot(customer_data$Age, 81 | col="#ff0066", 82 | main="Boxplot for Descriptive Analysis of Age") 83 | ``` 84 | The maximum customer ages are between 30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70. 85 | 86 | 87 | 88 | #### Annual Income Analysis 89 | 90 | ```{r} 91 | summary(customer_data$Annual.Income..k..) 92 | hist(customer_data$Annual.Income..k.., 93 | col="#660033", 94 | main="Histogram for Annual Income", 95 | xlab="Annual Income Class", 96 | ylab="Frequency", 97 | labels=TRUE) 98 | ``` 99 | 100 | #### Density plot 101 | 102 | 103 | ```{r} 104 | plot(density(customer_data$Annual.Income..k..), 105 | col="yellow", 106 | main="Density Plot for Annual Income", 107 | xlab="Annual Income Class", 108 | ylab="Density") 109 | polygon(density(customer_data$Annual.Income..k..), 110 | col="#ccff66") 111 | ``` 112 | The minimum annual income of the customers is 15 and the maximum income is 137. People earning an average income of 70 have the highest frequency count in our histogram distribution. The average salary of all the customers is 60.56. 113 | 114 | 115 | 116 | 117 | #### Analyzing Spending Score of the Customers 118 | 119 | 120 | 121 | ```{r} 122 | summary(customer_data$Spending.Score..1.100.) 123 | ## Min. 1st Qu. Median Mean 3rd Qu. Max. 124 | ## 1.00 34.75 50.00 50.20 73.00 99.00 125 | boxplot(customer_data$Spending.Score..1.100., 126 | horizontal=TRUE, 127 | col="#990000", 128 | main="BoxPlot for Descriptive Analysis of Spending Score") 129 | ``` 130 | 131 | #### Histogram 132 | 133 | 134 | ```{r} 135 | hist(customer_data$Spending.Score..1.100., 136 | main="HistoGram for Spending Score", 137 | xlab="Spending Score Class", 138 | ylab="Frequency", 139 | col="#6600cc", 140 | labels=TRUE) 141 | ``` 142 | The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20. From the histogram, we conclude that customers between class 40 and 50 have the highest spending score among all the classes. 143 | 144 | 145 | 146 | 147 | #### Determing the optimal number of clusters.1)Elbow Method 148 | 149 | 150 | 151 | ```{r} 152 | library(purrr) 153 | set.seed(123) 154 | # function to calculate total intra-cluster sum of square 155 | iss <- function(k) { 156 | kmeans(customer_data[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss 157 | } 158 | k.values <- 1:10 159 | iss_values <- map_dbl(k.values, iss) 160 | plot(k.values, iss_values, 161 | type="b", pch = 19, frame = FALSE, 162 | xlab="Number of clusters K", 163 | ylab="Total intra-clusters sum of squares") 164 | ``` 165 | 4 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot. 166 | 167 | 168 | 169 | 170 | ####2)Average Silhouette Method 171 | 172 | 173 | ```{r} 174 | library(cluster) 175 | library(gridExtra) 176 | library(grid) 177 | k2<-kmeans(customer_data[,3:5],2,iter.max=100,nstart=50,algorithm="Lloyd") 178 | s2<-plot(silhouette(k2$cluster,dist(customer_data[,3:5],"euclidean"))) 179 | ``` 180 | 181 | #### When n=3 182 | 183 | 184 | ```{r} 185 | k3<-kmeans(customer_data[,3:5],3,iter.max=100,nstart=50,algorithm="Lloyd") 186 | s3<-plot(silhouette(k3$cluster,dist(customer_data[,3:5],"euclidean"))) 187 | 188 | ``` 189 | 190 | #### When n=4 191 | ```{r} 192 | k4<-kmeans(customer_data[,3:5],4,iter.max=100,nstart=50,algorithm="Lloyd") 193 | s4<-plot(silhouette(k4$cluster,dist(customer_data[,3:5],"euclidean"))) 194 | 195 | ``` 196 | #### When n=5 197 | 198 | ```{r} 199 | k5<-kmeans(customer_data[,3:5],5,iter.max=100,nstart=50,algorithm="Lloyd") 200 | s5<-plot(silhouette(k5$cluster,dist(customer_data[,3:5],"euclidean"))) 201 | ``` 202 | 203 | #### When n=6 204 | ```{r} 205 | k6<-kmeans(customer_data[,3:5],6,iter.max=100,nstart=50,algorithm="Lloyd") 206 | s6<-plot(silhouette(k6$cluster,dist(customer_data[,3:5],"euclidean"))) 207 | 208 | ``` 209 | #### When n=7 210 | ```{r} 211 | k7<-kmeans(customer_data[,3:5],7,iter.max=100,nstart=50,algorithm="Lloyd") 212 | s7<-plot(silhouette(k7$cluster,dist(customer_data[,3:5],"euclidean"))) 213 | 214 | ``` 215 | #### When n=8 216 | 217 | ```{r} 218 | k8<-kmeans(customer_data[,3:5],8,iter.max=100,nstart=50,algorithm="Lloyd") 219 | s8<-plot(silhouette(k8$cluster,dist(customer_data[,3:5],"euclidean"))) 220 | ``` 221 | #### When n=9 222 | 223 | ```{r} 224 | k9<-kmeans(customer_data[,3:5],9,iter.max=100,nstart=50,algorithm="Lloyd") 225 | s9<-plot(silhouette(k9$cluster,dist(customer_data[,3:5],"euclidean"))) 226 | ``` 227 | 228 | #### When n=10 229 | ```{r} 230 | k10<-kmeans(customer_data[,3:5],10,iter.max=100,nstart=50,algorithm="Lloyd") 231 | s10<-plot(silhouette(k10$cluster,dist(customer_data[,3:5],"euclidean"))) 232 | ``` 233 | ##### Now, we have used fviz_nbclust() function to determine and visualize the optimal number of clusters as follows – 234 | 235 | ```{r} 236 | library(NbClust) 237 | library(factoextra) 238 | fviz_nbclust(customer_data[,3:5], kmeans, method = "silhouette") 239 | ``` 240 | 241 | 242 | ####Gap Statistic Method. 243 | ####For computing the gap statistics method we had utilized the clusGap function for providing gap statistic as well as standard error for a given output. 244 | 245 | 246 | 247 | ```{r} 248 | 249 | set.seed(125) 250 | stat_gap <- clusGap(customer_data[,3:5], FUN = kmeans, nstart = 25, 251 | K.max = 10, B = 50) 252 | fviz_gap_stat(stat_gap) 253 | ``` 254 | 255 | #### k6 as optimal cluster 256 | 257 | ```{r} 258 | k6<-kmeans(customer_data[,3:5],6,iter.max=100,nstart=50,algorithm="Lloyd") 259 | k6 260 | ``` 261 | #### Visualizing the Clustering Results using the First Two Principle Components 262 | 263 | ```{r} 264 | pcclust=prcomp(customer_data[,3:5],scale=FALSE) #principal component analysis 265 | summary(pcclust) 266 | pcclust$rotation[,1:2] 267 | ``` 268 | 269 | 270 | ```{r} 271 | set.seed(1) 272 | ggplot(customer_data, aes(x =Annual.Income..k.., y = Spending.Score..1.100.)) + 273 | geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) + 274 | scale_color_discrete(name=" ", 275 | breaks=c("1", "2", "3", "4", "5","6"), 276 | labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) + 277 | ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering") 278 | ``` 279 | 280 | 281 | ```{r} 282 | ggplot(customer_data, aes(x =Spending.Score..1.100., y =Age)) + 283 | geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) + 284 | scale_color_discrete(name=" ", 285 | breaks=c("1", "2", "3", "4", "5","6"), 286 | labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) + 287 | ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering") 288 | ``` 289 | 290 | 291 | ```{r} 292 | kCols=function(vec){cols=rainbow (length (unique (vec))) 293 | return (cols[as.numeric(as.factor(vec))])} 294 | digCluster<-k6$cluster; dignm<-as.character(digCluster); # K-means clusters 295 | plot(pcclust$x[,1:2], col =kCols(digCluster),pch =19,xlab ="K-means",ylab="classes") 296 | legend("bottomleft",unique(dignm),fill=unique(kCols(digCluster))) 297 | ``` 298 | 299 | 300 | 301 | ```{r} 302 | ``` 303 | 304 | 305 | ```{r} 306 | ``` 307 | 308 | 309 | 310 | ```{r} 311 | ``` 312 | 313 | 314 | ```{r} 315 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Overview 2 | Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base. In this machine learning project, we will make use of [K-means clustering](https://data-flair.training/blogs/k-means-clustering-tutorial/) which is the essential algorithm for clustering unlabeled dataset. Before ahead in this project, learn what actually customer segmentation is.
3 | ![seg](https://user-images.githubusercontent.com/90209933/147927776-948a9af0-18bb-49ac-bbc0-30efd2790649.png) 4 | 5 | ### What is Customer Segmentation 6 | Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing such as gender, age, interests, and miscellaneous spending habits.
7 | 8 | Companies that deploy customer segmentation are under the notion that every customer has different requirements and require a specific marketing effort to address them appropriately. Companies aim to gain a deeper approach of the customer they are targeting. Therefore, their aim has to be specific and should be tailored to address the requirements of each and every individual customer. Furthermore, through the data collected, companies can gain a deeper understanding of customer preferences as well as the requirements for discovering valuable segments that would reap them maximum profit. This way, they can strategize their marketing techniques more efficiently and minimize the possibility of risk to their investment.
9 | 10 | The technique of customer segmentation is dependent on several key differentiators that divide customers into groups to be targeted. Data related to demographics, geography, economic status as well as behavioral patterns play a crucial role in determining the company direction towards addressing the various segments
11 | ### What is K-Means Algorithm 12 | While using the k-means clustering algorithm, the first step is to indicate the number of clusters (k) that we wish to produce in the final output. The algorithm starts by selecting k objects from dataset randomly that will serve as the initial centers for our clusters. These selected objects are the cluster means, also known as centroids. Then, the remaining objects have an assignment of the closest centroid. This centroid is defined by the Euclidean Distance present between the object and the cluster mean. We refer to this step as “cluster assignment”. When the assignment is complete, the algorithm proceeds to calculate new mean value of each cluster present in the data. After the recalculation of the centers, the observations are checked if they are closer to a different cluster. Using the updated cluster mean, the objects undergo reassignment. This goes on repeatedly through several iterations until the cluster assignments stop altering. The clusters that are present in the current iteration are the same as the ones obtained in the previous iteration.
13 | 14 | ### Dataset 15 | The dataset is aquired from kaggle and the link is given below : 16 | 17 | https://www.kaggle.com/nelakurthisudheer/mall-customer-segmentation 18 | 19 | The dataset consists of following five features of 200 customers: 20 | 21 | - CustomerID: Unique ID assigned to the customer 22 | 23 | - Gender: Gender of the customer 24 | 25 | - Age: Age of the customer 26 | 27 | - Annual Income (k$): Annual Income of the customer 28 | 29 | - Spending Score (1-100): Score assigned by the mall based on customer behavior and spending nature. 30 | 31 | 32 | ### Steps for implementation 33 | - Import all neccessary packages 34 | ``` 35 | import ----- from ------ 36 | import ----- 37 | ``` 38 | 39 | - Data Exploration 40 | ``` 41 | customer_data=read.csv("/home/dataflair/Mall_Customers.csv") 42 | str(customer_data) 43 | names(customer_data) 44 | 45 | head(customer_data) 46 | summary(customer_data$Age) 47 | ``` 48 | - Statistical Analysis 49 | ``` 50 | sd(customer_data$Age) 51 | summary(customer_data$Annual.Income..k..) 52 | sd(customer_data$Annual.Income..k..) 53 | summary(customer_data$Age) 54 | ``` 55 | - Visualizations 56 | 57 | ``` 58 | Bar Plot 59 | 60 | a=table(customer_data$Gender) 61 | barplot(a,main="Using BarPlot to display Gender Comparision", 62 | ylab="Count", 63 | xlab="Gender", 64 | col=rainbow(2), 65 | legend=rownames(a)) 66 | 67 | 68 | ``` 69 | 70 | ``` 71 | Pie Chart 72 | pct=round(a/sum(a)*100) 73 | lbs=paste(c("Female","Male")," ",pct,"%",sep=" ") 74 | library(plotrix) 75 | pie3D(a,labels=lbs, 76 | main="Pie Chart Depicting Ratio of Female and Male") 77 | ``` 78 | 79 | ``` 80 | Histogram 81 | hist(customer_data$Age, 82 | col="blue", 83 | main="Histogram to Show Count of Age Class", 84 | xlab="Age Class", 85 | ylab="Frequency", 86 | labels=TRUE) 87 | ``` 88 | 89 | ``` 90 | Box Plot 91 | boxplot(customer_data$Age, 92 | col="ff0066", 93 | main="Boxplot for Descriptive Analysis of Age") 94 | ``` 95 | - Analysis 96 | 97 | ``` 98 | Analyzing the annual income of the customers through the Histogram 99 | summary(customer_data$Annual.Income..k..) 100 | hist(customer_data$Annual.Income..k.., 101 | col="#660033", 102 | main="Histogram for Annual Income", 103 | xlab="Annual Income Class", 104 | ylab="Frequency", 105 | labels=TRUE) 106 | ``` 107 | 108 | ``` 109 | Density Plot 110 | plot(density(customer_data$Annual.Income..k..), 111 | col="yellow", 112 | main="Density Plot for Annual Income", 113 | xlab="Annual Income Class", 114 | ylab="Density") 115 | polygon(density(customer_data$Annual.Income..k..), 116 | col="#ccff66") 117 | ``` 118 | ``` 119 | Analyzing Spending Score of the Customers with the help of BoxPlot 120 | summary(customer_data$Spending.Score..1.100.) 121 | 122 | Min. 1st Qu. Median Mean 3rd Qu. Max. 123 | ## 1.00 34.75 50.00 50.20 73.00 99.00 124 | 125 | boxplot(customer_data$Spending.Score..1.100., 126 | horizontal=TRUE, 127 | col="#990000", 128 | main="BoxPlot for Descriptive Analysis of Spending Score") 129 | ``` 130 | ### K-means Algorithm 131 | - We specify the number of clusters that we need to create. 132 | - The algorithm selects k objects at random from the dataset. This object is the initial cluster or mean. 133 | - The closest centroid obtains the assignment of a new observation. We base this assignment on the Euclidean Distance between object and the centroid. 134 | - k clusters in the data points update the centroid through calculation of the new mean values present in all the data points of the cluster. The kth cluster’s centroid has a - - Length of p that contains means of all variables for observations in the k-th cluster. We denote the number of variables with p. 135 | - Iterative minimization of the total within the sum of squares. Then through the iterative minimization of the total sum of the square, the assignment stop wavering when we - - Achieve maximum iteration. The default value is 10 that the R software uses for the maximum iterations.
136 | ### Determining Optimal Clusters 137 | While working with clusters, you need to specify the number of clusters to use. You would like to utilize the optimal number of clusters. To help you in determining the optimal clusters, there are three popular methods –
138 | 139 | - Elbow method 140 | The main goal behind cluster partitioning methods like k-means is to define the clusters such that the intra-cluster variation stays minimum.
141 | 142 | minimize(sum W(Ck)), k=1…k
143 | 144 | Where Ck represents the kth cluster and W(Ck) denotes the intra-cluster variation. With the measurement of the total intra-cluster variation, one can evaluate the compactness of the clustering boundary. We can then proceed to define the optimal clusters as follows –
145 | 146 | First, we calculate the clustering algorithm for several values of k. This can be done by creating a variation within k from 1 to 10 clusters. We then calculate the total intra-cluster sum of square (iss). Then, we proceed to plot iss based on the number of k clusters. This plot denotes the appropriate number of clusters required in our model. In the plot, the location of a bend or a knee is the indication of the optimum number of clusters.
147 | ``` 148 | minimize(sum W(Ck)), k=1…k 149 | ``` 150 | ``` 151 | library(purrr) 152 | set.seed(123) 153 | # function to calculate total intra-cluster sum of square 154 | iss <- function(k) { 155 | kmeans(customer_data[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss 156 | } 157 | 158 | k.values <- 1:10 159 | 160 | 161 | iss_values <- map_dbl(k.values, iss) 162 | 163 | plot(k.values, iss_values, 164 | type="b", pch = 19, frame = FALSE, 165 | xlab="Number of clusters K", 166 | ylab="Total intra-clusters sum of squares") 167 | ``` 168 | ![K-Means-Elbow-graph-in-R](https://user-images.githubusercontent.com/90209933/147944801-616cf62b-cfdb-4504-aeb7-93e14767ea99.png) 169 | From the above graph, we conclude that 4 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot. 170 | 171 | - Average Silhouette method 172 | With the help of the average silhouette method, we can measure the quality of our clustering operation. With this, we can determine how well within the cluster is the data object. If we obtain a high average silhouette width, it means that we have good clustering. The average silhouette method calculates the mean of silhouette observations for different k values. With the optimal number of k clusters, one can maximize the average silhouette over significant values for k clusters.
173 | 174 | Using the silhouette function in the cluster package, we can compute the average silhouette width using the kmean function. Here, the optimal cluster will possess highest average. 175 | ``` 176 | library(cluster) 177 | library(gridExtra) 178 | library(grid) 179 | 180 | 181 | k2<-kmeans(customer_data[,3:5],2,iter.max=100,nstart=50,algorithm="Lloyd") 182 | s2<-plot(silhouette(k2$cluster,dist(customer_data[,3:5],"euclidean"))) 183 | ``` 184 | ![np-function-graph-in-data-science-clustering](https://user-images.githubusercontent.com/90209933/147944967-a179c0b4-8f77-4f23-bd62-dbbe44b05285.png) 185 | 186 | - Gap statistic 187 | In 2001, researchers at Stanford University – R. Tibshirani, G.Walther and T. Hastie published the Gap Statistic Method. We can use this method to any of the clustering method like K-means, hierarchical clustering etc. Using the gap statistic, one can compare the total intracluster variation for different values of k along with their expected values under the null reference distribution of data. With the help of Monte Carlo simulations, one can produce the sample dataset. For each variable in the dataset, we can calculate the range between min(xi) and max (xj) through which we can produce values uniformly from interval lower bound to upper bound.
188 | 189 | For computing the gap statistics method we can utilize the clusGap function for providing gap statistic as well as standard error for a given output. 190 | ``` 191 | set.seed(125) 192 | stat_gap <- clusGap(customer_data[,3:5], FUN = kmeans, nstart = 25, 193 | K.max = 10, B = 50) 194 | fviz_gap_stat(stat_gap) 195 | ``` 196 | ![fviz_gap_stat-function-graph-in-ml](https://user-images.githubusercontent.com/90209933/147945091-24ba725d-f494-4f65-9389-6eefd2fadb2b.png)
197 | 198 | By Using these three methods in k-means clustering we have to find out which is giving the best minimum number optimal clusters. 199 | ![PCA-Cluster-Graph-in-ML-1](https://user-images.githubusercontent.com/90209933/147945560-dbc083a0-997c-4730-aab7-d6bcc82b40fd.png) 200 | ![PCA-Cluster-Graph-in-data-science](https://user-images.githubusercontent.com/90209933/147945716-2f0c5519-fb87-4953-8827-e090e37d07ee.png) 201 | From the above segemented graph:
202 | - Cluster 4 and 1 – These two clusters consist of customers with medium PCA1 and medium PCA2 score. 203 | 204 | - Cluster 6 – This cluster represents customers having a high PCA2 and a low PCA1. 205 | 206 | - Cluster 5 – In this cluster, there are customers with a medium PCA1 and a low PCA2 score. 207 | 208 | - Cluster 3 – This cluster comprises of customers with a high PCA1 income and a high PCA2. 209 | 210 | - Cluster 2 – This comprises of customers with a high PCA2 and a medium annual spend of income.
211 | 212 | With the help of clustering, we can understand the variables much better, prompting us to take careful decisions. With the identification of customers, companies can release products and services that target customers based on several parameters like income, age, spending patterns, etc. Furthermore, more complex patterns like product reviews are taken into consideration for better segmentation 213 | --------------------------------------------------------------------------------