├── LICENSE
├── README.md
├── kmeans.vba
└── kmeans.xlsm


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 bquanttrading, asmquant, gpolic
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # k-means
 2 | 
 3 | K-means is an algorithm for cluster analysis (clustering). It is the process of partitioning a set of data into related groups / clusters.
 4 | K-means clustering is useful for Data Mining and Business Intelligence.
 5 | 
 6 | Here is k-means in plain English:
 7 | https://hackerbits.com/data/k-means-data-mining-algorithm/
 8 | 
 9 | This script is based on the work of bquanttrading. His blog on market modelling and market analytics:
10 | 
11 | https://asmquantmacro.com
12 | 
13 | 
14 | # What does it do
15 | 
16 | k-means will classify each record in your data, placing it into a group (cluster). You do not need to specify the properties of each group, k-means will decide for the groups. However, usually we need to provide the number of groups that we want in the output.
17 | 
18 | The records in the same cluster are similar to each other. Records in different clusters are dissimilar.
19 | 
20 | Each row of your Excel data, should be a record/observation with one or more features. Each column is a feature in the observation.
21 | 
22 | As an example, here is a data set with the height and weight of 25,000 children in Hong Kong : http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html
23 | 
24 | Each row in the data represents a person. Each column is a feature of the person.
25 | 
26 | Currently the script works _only_ with numerical data.
27 | 
28 | 
29 | 
30 | # How does it work
31 | 
32 | * Enter your data in a new Excel worksheet
33 | * Enter the name of the worksheet in cell C4, and the range of the data at C5
34 | * Enter the worksheet for the output to be placed, at C6 (you can use the one where your data is)
35 | * Enter the cell where the output will be updated at C7
36 | * Number of groups in your data at C8
37 | * Click the button to start
38 | * Check the Result
39 | 
40 | If you do not know the number of clusters/groups contained in your data, try different values for example 1 up to 10. 
41 | Execute the script several times and observe the GAP figure. 
42 | At the point where GAP reaches its maximum value, it indicates that the number of clusters is efficient for this data set.
43 | 
44 | As an example, changing the number of clusters and calculating with the IRIS data set, GAP will maximize when we have 3 clusters.
45 | 	
46 | The original paper that describes the GAP calculation: https://web.stanford.edu/~hastie/Papers/gap.pdf
47 | 	
48 | # The results
49 | 
50 | The result is a number assigned on each record, that indicates the group/cluster the record belongs to.
51 | 
52 | The Result sheet contains information on the clusters, along with the cluster centers. 
53 | 
54 | 
55 | # Performance
56 | 
57 | When the "Distance" value is minimized, it indicates the output accuracy is higher. 
58 | 
59 | Execute the algorithm several times to find the best results.
60 | 
61 | The script will stop execution when the clusters are normalized or when the maximum iterations are reached (whichever comes first).  You can increase the number of iterations for better results.
62 | 
63 | Unfortunately Excel VBA runs on a single thread, therefore it does not take full advantage of your current CPU's
64 | 
65 | # Why is this different?
66 | 
67 | The script calculates the initial centroids using _k-means++_ algorithm. You do not have to provide the initial centroids.
68 | It also provides an indication of the number of groups contained in the data, using the GAP calculation.
69 | 
70 | # More info
71 | 
72 | This is implementing David Arthur and Sergei Vassilvitski k-means++ algorithm, which chooses the initial centroids.
73 | https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf
74 | 
75 | The example dataset provided in kmeans.xlsx is _IRIS_ from UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/dataset/53/iris
76 | 
77 | 
78 | 


--------------------------------------------------------------------------------
/kmeans.vba:
--------------------------------------------------------------------------------
  1 | Option Base 1
  2 | Option Explicit
  3 | 
  4 | Public Sub kmeans()
  5 |     Dim wkSheet As Worksheet
  6 |     Set wkSheet = ActiveWorkbook.Worksheets("Start")
  7 | 
  8 |     Dim MaximumIterations As Integer: MaximumIterations = wkSheet.Range("MaximumIterations").Value
  9 |     Dim DataSht As String: DataSht = wkSheet.Range("InputSheet").Value
 10 |     Dim DataRange As String: DataRange = wkSheet.Range("InputRange").Value
 11 |     Dim DataRecords As Variant: DataRecords = Worksheets(DataSht).Range(DataRange)
 12 |     Dim NUMBER_OF_RECORDS As Integer: NUMBER_OF_RECORDS = UBound(DataRecords, 1)
 13 |     Dim NUMCLUSTERS As Integer: NUMCLUSTERS = wkSheet.Range("Clusters").Value
 14 |     Dim ClusterIndexes As Variant, Centroids As Variant, InitialCentroidsCalc As Variant
 15 |     Dim ClustersUpdated As Integer, counter As Integer: counter = 1
 16 |     Dim StartTime As Double
 17 |     
 18 |     StartTime = Timer
 19 |     Application.StatusBar = "   [ Initialize. ]"
 20 |     
 21 |     ' initialize centroids with kmeans++ method
 22 |     InitialCentroidsCalc = ComputeInitialCentroidsCalc(DataRecords, NUMCLUSTERS)
 23 | 
 24 |     Application.StatusBar = "   [ Start..     ]"
 25 |     'Application.ScreenUpdating = False
 26 |     
 27 |     ' First pass. Assign each record(observation) in a initial cluster. ClusterIndexes is updated
 28 |     ClustersUpdated = FindClosestCentroid(DataRecords, InitialCentroidsCalc, ClusterIndexes)
 29 |     
 30 |     '  The result returned from FindClosestCentroid is not relevant right now
 31 |     ClustersUpdated = 1
 32 |     
 33 |     ' We will process k-means until it is normalized or MaximumIterationserations reached
 34 |     While counter <= MaximumIterations And ClustersUpdated > 0
 35 |         Application.StatusBar = "   [ Pass: " + CStr(counter) + "     ]"
 36 |         
 37 |         ' calculate new centroids for each cluster
 38 |         Centroids = ComputeCentroids(DataRecords, ClusterIndexes, NUMCLUSTERS)
 39 |         
 40 |         ' assign each record in a cluster based on the new centroids
 41 |         ClustersUpdated = FindClosestCentroid(DataRecords, Centroids, ClusterIndexes)
 42 |         counter = counter + 1
 43 |     Wend
 44 |     
 45 |     Application.StatusBar = "   Completed after " + CStr(counter - 1) + " iterations"
 46 |     'Application.ScreenUpdating = True
 47 |     
 48 |     ' show the clusters assigned in the output sheet/range
 49 |     Dim ClusterOutputSht As String: ClusterOutputSht = wkSheet.Range("OutputSheet").Value
 50 |     Dim ClusterOutputRange As String: ClusterOutputRange = wkSheet.Range("OutputRange").Value
 51 |     Worksheets(ClusterOutputSht).Range(ClusterOutputRange).Resize(NUMBER_OF_RECORDS, 1).Value = WorksheetFunction.Transpose(ClusterIndexes)
 52 |     
 53 |     Call ShowResult(DataRecords, ClusterIndexes, Centroids, NUMCLUSTERS)
 54 |     
 55 |     ' show more results
 56 |     Dim Distance As Double, ExpO As Double, Wk As Double
 57 |     
 58 |     Distance = CalculateDistances(DataRecords, Centroids, ClusterIndexes)
 59 |     ExpO = CalculateExpectation(DataRecords, NUMCLUSTERS)
 60 |     Wk = (1 / (2 * NUMBER_OF_RECORDS)) * Distance
 61 |     
 62 |     wkSheet.Range("C16").Value = Distance
 63 |     wkSheet.Range("C17").Value = ExpO - Log(Wk)
 64 |     'wkSheet.Range("C18").Value = ExpO
 65 |     'wkSheet.Range("C19").Value = Wk
 66 |     
 67 |     'MsgBox "Time elapsed " & Round(Timer - StartTime, 2) & " seconds", vbInformation
 68 | End Sub
 69 | 
 70 | 
 71 | Function CalculateDistances(ByRef DataRecords As Variant, ByRef Centroids As Variant, ByRef Cluster_Indexes As Variant) As Variant
 72 |     Dim NUMBER_OF_RECORDS As Integer: NUMBER_OF_RECORDS = UBound(DataRecords, 1)
 73 |     Dim NUMBER_OF_COLUMNS As Integer: NUMBER_OF_COLUMNS = UBound(DataRecords, 2)
 74 |     Dim NUMCLUSTERS As Integer: NUMCLUSTERS = UBound(Centroids, 1)
 75 |     Dim DistanceInCluster() As Variant:   ReDim DistanceInCluster(NUMCLUSTERS)
 76 |     Dim clusterCounter, recordCounter, recordsInCluster As Integer
 77 |     Dim DistanceSum As Double: DistanceSum = 0
 78 |     
 79 |     For clusterCounter = 1 To NUMCLUSTERS
 80 |             
 81 |             recordsInCluster = 0
 82 |             For recordCounter = 1 To NUMBER_OF_RECORDS
 83 |             
 84 |                 If Cluster_Indexes(recordCounter) = clusterCounter Then
 85 |                     DistanceInCluster(clusterCounter) = DistanceInCluster(clusterCounter) + _
 86 |                         EuclideanDistance(Application.Index(Centroids, clusterCounter, 0), Application.Index(DataRecords, recordCounter, 0), NUMBER_OF_COLUMNS)
 87 |                     recordsInCluster = recordsInCluster + 1
 88 |                 End If
 89 |                 
 90 |             Next recordCounter
 91 |             
 92 |             'DistanceSum = DistanceSum + Sqr(DistanceInCluster(clusterCounter) / recordsInCluster)
 93 |             DistanceSum = DistanceSum + DistanceInCluster(clusterCounter)
 94 |     Next clusterCounter
 95 |     
 96 |     CalculateDistances = DistanceSum
 97 | End Function
 98 | 
 99 | 
100 | Function CalculateExpectation(ByRef DataRecords As Variant, NUMCLUSTERS As Integer) As Double
101 |     Dim NUMBER_OF_RECORDS As Integer: NUMBER_OF_RECORDS = UBound(DataRecords, 1)
102 |     Dim NUMBER_OF_COLUMNS As Integer: NUMBER_OF_COLUMNS = UBound(DataRecords, 2)
103 |     
104 |     CalculateExpectation = Log((NUMBER_OF_RECORDS * NUMBER_OF_COLUMNS) / 12) - ((2 / NUMBER_OF_COLUMNS) * Log(NUMCLUSTERS))
105 | End Function
106 | 
107 | 
108 | ' Select initial centroids
109 | '
110 | Function ComputeInitialCentroidsCalc(ByRef DataRecords As Variant, NUMCLUSTERS As Integer) As Variant
111 | 
112 |     Dim NUMBER_OF_RECORDS As Integer: NUMBER_OF_RECORDS = UBound(DataRecords, 1)
113 |     Dim NUMBER_OF_COLUMNS As Integer: NUMBER_OF_COLUMNS = UBound(DataRecords, 2)
114 |     Dim Taken() As Variant: ReDim Taken(NUMBER_OF_RECORDS)
115 |     
116 |     Dim InitialCentroidsCalc As Variant: ReDim InitialCentroidsCalc(NUMCLUSTERS, NUMBER_OF_COLUMNS) As Variant
117 |     Dim minDistSquared As Variant: ReDim minDistSquared(NUMBER_OF_RECORDS)
118 |     Dim counter As Integer, CentroidsFound As Integer, FirstCentroidIndex As Integer
119 |     Dim dist As Double
120 |     Dim preventLoop As Boolean: preventLoop = True
121 |     Dim FirstCentroid As Variant: ReDim FirstCentroid(NUMBER_OF_COLUMNS)
122 |    
123 |     
124 |     FirstCentroidIndex = Int(Rnd * NUMBER_OF_RECORDS) + 1         ' The first centroid is random !
125 |     
126 | ' Change the kmeans++ standard algorithm. We choose the first centroid with the mean values, not by random selection
127 | ' First Centroid - Choose the record that is closer to the mean
128 | ' ------------------------------------------------------------------
129 | '    Dim colCounter As Integer
130 | '    For colCounter = 1 To NUMBER_OF_COLUMNS
131 | '        For counter = 1 To NUMBER_OF_RECORDS
132 | '            FirstCentroid(colCounter) = FirstCentroid(colCounter) + DataRecords(counter, colCounter)
133 | '        Next counter
134 | '        FirstCentroid(colCounter) = FirstCentroid(colCounter) / NUMBER_OF_RECORDS  ' find the mean
135 | '    Next colCounter
136 | '
137 | '    Dim MinimumDistance As Double: MinimumDistance = 99999999
138 | '    Dim MinRecord As Variant
139 | '    Dim recordNumber As Integer
140 | '    For recordNumber = 1 To NUMBER_OF_RECORDS          ' calculate distance to all records and select the record closer to the mean
141 | '        dist = EuclideanDistance(Application.Index(DataRecords, recordNumber, 0), FirstCentroid, NUMBER_OF_COLUMNS)
142 | '        If dist < MinimumDistance Then
143 | '            FirstCentroidIndex = recordNumber            ' the record with lowest distance to the means will be 1st centroid
144 | '            MinimumDistance = dist
145 | '        End If
146 | '    Next recordNumber                            ' check with next data record
147 | ' ------------------------------------------------------------------
148 |     
149 |     For counter = 1 To NUMBER_OF_COLUMNS
150 |         ' put this data record in FirstCentroid
151 |         FirstCentroid(counter) = DataRecords(FirstCentroidIndex, counter)
152 |         
153 |         ' and put it also in the array of results
154 |         InitialCentroidsCalc(1, counter) = FirstCentroid(counter)
155 |     Next counter
156 |     
157 |     ' mark point as Taken. We have one cluster center
158 |     Taken(FirstCentroidIndex) = 1
159 |     CentroidsFound = 1
160 |     
161 |     For counter = 1 To NUMBER_OF_RECORDS
162 |     
163 |         If Not counter = FirstCentroidIndex Then
164 |             dist = EuclideanDistance(FirstCentroid, Application.Index(DataRecords, counter, 0), NUMBER_OF_COLUMNS)
165 |             minDistSquared(counter) = dist * dist
166 |         End If
167 |         
168 |     Next counter
169 | 
170 |     ' main loop
171 |     Do While CentroidsFound < NUMCLUSTERS And preventLoop = True
172 |         
173 |             ' sum all the squared distances of the points not already taken
174 |             Dim distSqSum As Double: distSqSum = 0
175 |             For counter = 1 To NUMBER_OF_RECORDS
176 |             
177 |                 If Not Taken(counter) = 1 Then
178 |                 distSqSum = distSqSum + minDistSquared(counter)
179 |                 End If
180 |                 
181 |             Next counter
182 |         
183 |             ' add one new point. each point is chosen with probability proportional to D(x)2
184 |             Dim R As Double
185 |             R = Rnd * distSqSum
186 |         
187 |             ' the index of the next point to be added as cluster center
188 |             Dim nextpoint As Integer
189 |             nextpoint = -1
190 |             
191 |     
192 |              ' scan through the dist squared distances until sum > R
193 |             Dim sum As Double: sum = 0
194 |             For counter = 1 To NUMBER_OF_RECORDS
195 |             
196 |                 If Not Taken(counter) = 1 Then
197 |                     sum = sum + minDistSquared(counter)
198 |                     
199 |                     If sum > R Then
200 |                         nextpoint = counter
201 |                         Exit For
202 |                     End If
203 |                     
204 |                 End If
205 |                 
206 |             Next counter
207 |             
208 |             ' if a new point was not found yet, just pick the last available data record
209 |             If nextpoint = -1 Then
210 |                 For counter = NUMBER_OF_RECORDS To 1 Step -1
211 |                 
212 |                     If Not Taken(counter) = 1 Then
213 |                         nextpoint = counter
214 |                     End If
215 |                     
216 |                 Next counter
217 |             End If
218 |             
219 |             If nextpoint >= 0 Then
220 |             
221 |                 ' we found the next cluster center! Mark the data record as Taken
222 |                 CentroidsFound = CentroidsFound + 1
223 |                 Taken(nextpoint) = 1
224 |                 
225 |                 ' copy the data in the array to our result
226 |                 For counter = 1 To NUMBER_OF_COLUMNS
227 |                     InitialCentroidsCalc(CentroidsFound, counter) = DataRecords(nextpoint, counter)
228 |                 Next counter
229 |                 
230 |                 ' need to find more centroids. we will adjust the minSqDistance
231 |                 If CentroidsFound < NUMCLUSTERS Then
232 |                 
233 |                     For counter = 1 To NUMBER_OF_RECORDS
234 |                     
235 |                         If Not Taken(counter) = 1 Then
236 |                         
237 |                             ' find the distance to the new centroid
238 |                             Dim dista As Double, distSquared As Double
239 |                             
240 |                             dista = EuclideanDistance(Application.Index(InitialCentroidsCalc, CentroidsFound, 0), Application.Index(DataRecords, counter, 0), NUMBER_OF_COLUMNS)
241 |                             distSquared = dista * dista
242 |                             
243 |                             ' if the distance to the new centroid is lower than the previous, then use it
244 |                             If distSquared < minDistSquared(counter) Then
245 |                                 minDistSquared(counter) = distSquared
246 |                             End If
247 |                         End If
248 |                         
249 |                     Next counter
250 |                     
251 |                 End If
252 |             
253 |             Else                        ' there is no cluster center found
254 |                 preventLoop = False     ' make sure that the while loop can terminate
255 |             End If
256 |     Loop
257 | 
258 |     ComputeInitialCentroidsCalc = InitialCentroidsCalc
259 | End Function
260 |     
261 | 
262 | Public Function EuclideanDistance(X As Variant, Y As Variant, NumberOfObservations As Integer) As Double
263 |     Dim counter As Integer
264 |     Dim RunningSumSqr As Double: RunningSumSqr = 0
265 |     
266 |     For counter = 1 To NumberOfObservations
267 |         RunningSumSqr = RunningSumSqr + ((X(counter) - Y(counter)) ^ 2)
268 |     Next counter
269 |     
270 |     EuclideanDistance = Sqr(RunningSumSqr)
271 | End Function
272 | 
273 | 
274 | 
275 | ' For each record in Data Records, find the closest Centroid (cluster)
276 | ' The result is calculated and placed in Cluster_Indexes()
277 | ' This number is the cluster were we placed the record. This is more effective than creating new Arrays with Clusters
278 | '
279 | Public Function FindClosestCentroid(ByRef DataRecords As Variant, ByRef Centroids As Variant, ByRef Cluster_Indexes As Variant) As Integer
280 |     Dim NUMCLUSTERS As Integer: NUMCLUSTERS = UBound(Centroids, 1)
281 |     Dim NUMBER_OF_COLUMNS As Integer: NUMBER_OF_COLUMNS = UBound(Centroids, 2)
282 |     Dim NUMBER_OF_RECORDS As Integer: NUMBER_OF_RECORDS = UBound(DataRecords, 1)
283 |     Dim idx() As Variant: ReDim idx(NUMBER_OF_RECORDS) As Variant
284 |     Dim recordsCounter As Integer, clusterCounter As Integer
285 |     Dim changeCounter As Integer: changeCounter = 0
286 | 
287 |     For recordsCounter = 1 To NUMBER_OF_RECORDS
288 |     
289 |             Dim MinimumDistance As Double: MinimumDistance = 99999999
290 |             Dim MinCluster As Integer
291 |             Dim dist As Double: dist = 0
292 |             
293 |             ' calculate distance to all centroids and assign to the minimum distance cluster
294 |             For clusterCounter = 1 To NUMCLUSTERS
295 |                 dist = EuclideanDistance(Application.Index(DataRecords, recordsCounter, 0), Application.Index(Centroids, clusterCounter, 0), NUMBER_OF_COLUMNS)
296 |                 If dist < MinimumDistance Then
297 |                 
298 |                      ' this record will be assigned to cluster MinCluster when we find the min distance
299 |                     MinCluster = clusterCounter
300 |                     MinimumDistance = dist
301 |                 End If
302 |             Next clusterCounter
303 |             
304 |             ' change the cluster index to the closest cluster
305 |             idx(recordsCounter) = MinCluster
306 |             
307 |             ' During the first run Cluster Indexes is Empty
308 |             If Not (IsEmpty(Cluster_Indexes)) Then
309 |                 
310 |                 ' If the old cluster index is not the same as the new one
311 |                 If Not (Cluster_Indexes(recordsCounter) = idx(recordsCounter)) Then
312 |                 
313 |                     ' indicate that a change occured
314 |                     changeCounter = changeCounter + 1
315 |                 End If
316 |                 
317 |             End If
318 |         
319 |     Next recordsCounter                ' next record
320 |     
321 |     FindClosestCentroid = changeCounter
322 |     
323 |     ' update the clusters
324 |     Cluster_Indexes = idx()
325 | End Function
326 | 
327 | 
328 | 
329 | ' Show the results in the Result sheet
330 | '
331 | Public Sub ShowResult(ByRef DataRecords As Variant, ByRef Cluster_Indexes As Variant, ByRef Centroids, NUMCLUSTERS As Integer)
332 |     Dim resultSheet As Worksheet
333 |     Dim lRowLast As Integer, lColLast As Integer, counter As Integer
334 |     Dim Rng As Range
335 |     Dim ClusterObjects() As Variant: ReDim ClusterObjects(NUMCLUSTERS) As Variant
336 |     Dim NUMBER_OF_RECORDS As Integer: NUMBER_OF_RECORDS = UBound(DataRecords, 1)
337 |     
338 |     Set resultSheet = ActiveWorkbook.Worksheets("Result")
339 | 
340 |     
341 |     ' clear the old data in Result sheet
342 |     With resultSheet
343 |         lRowLast = .UsedRange.Row + .UsedRange.Rows.Count - 1
344 |         lColLast = .UsedRange.Column + .UsedRange.Columns.Count - 1
345 |         Set Rng = .Range(.Range("B4"), .Cells(lRowLast, lColLast))
346 |     End With
347 |     Rng.ClearContents
348 |     
349 |     ' initialize Cluster object count
350 |     For counter = 1 To NUMCLUSTERS
351 |         ClusterObjects(counter) = 0
352 |         resultSheet.Cells(4, 1 + counter).Value = counter
353 |     Next counter
354 | 
355 |     ' for every record in this cluster, increase the counter
356 |     For counter = 1 To NUMBER_OF_RECORDS
357 |         ClusterObjects(Cluster_Indexes(counter)) = ClusterObjects(Cluster_Indexes(counter)) + 1
358 |     Next counter
359 | 
360 |     ' Show the final centroids in the results
361 |     resultSheet.Range("B5").Resize(1, NUMCLUSTERS).Value = ClusterObjects
362 |     resultSheet.Range("B9").Resize(UBound(Centroids, 1), UBound(Centroids, 2)).Value = Centroids
363 |     
364 | End Sub
365 | 
366 | 
367 | ' This will sum all the records in a cluster, and average the values. The calculated averages will form the new Centroids
368 | '
369 | Public Function ComputeCentroids(DataRecords As Variant, ClusterIdx As Variant, Number_Of_Clusters As Integer) As Variant
370 |     Dim NUMBER_OF_RECORDS As Integer: NUMBER_OF_RECORDS = UBound(DataRecords, 1)
371 |     Dim NUMBER_OF_FEATURES As Integer: NUMBER_OF_FEATURES = UBound(DataRecords, 2)
372 |     Dim clusterNumber As Integer, columnNumber As Integer, recordNumber As Integer, counter As Integer
373 |     Dim tempSum() As Variant: ReDim tempSum(Number_Of_Clusters, NUMBER_OF_FEATURES) As Variant
374 |     Dim Centroids() As Variant: ReDim Centroids(Number_Of_Clusters, NUMBER_OF_FEATURES) As Variant
375 |     
376 |     For clusterNumber = 1 To Number_Of_Clusters
377 |     
378 |             For columnNumber = 1 To NUMBER_OF_FEATURES
379 |             
380 |                     counter = 0
381 |                     For recordNumber = 1 To NUMBER_OF_RECORDS
382 |                         If ClusterIdx(recordNumber) = clusterNumber Then
383 |                             
384 |                             ' if this record is part of the cluster then add
385 |                             Centroids(clusterNumber, columnNumber) = Centroids(clusterNumber, columnNumber) + DataRecords(recordNumber, columnNumber)
386 |                             counter = counter + 1
387 |                         End If
388 |                     Next recordNumber
389 |                     
390 |                     If counter > 0 Then
391 |                         
392 |                         ' compute the new centroid averaging all records in the cluster
393 |                         Centroids(clusterNumber, columnNumber) = Centroids(clusterNumber, columnNumber) / counter
394 |                     Else
395 |                         Centroids(clusterNumber, columnNumber) = 0
396 |                     End If
397 |                     
398 |             Next columnNumber
399 |             
400 |     Next clusterNumber
401 |     
402 |     ComputeCentroids = Centroids
403 | End Function
404 | 
405 | 
406 | 
407 | 
408 | 


--------------------------------------------------------------------------------
/kmeans.xlsm:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gpolic/kmeans-excel/99282c3334806aed801e2038e8a6f23b2c9ef65d/kmeans.xlsm


--------------------------------------------------------------------------------