├── .github
    └── FUNDING.yml
├── README.md
├── data.csv
└── kMeansClustering.py


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 | ko_fi: corvasto
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Simple k-Means Clustering - Python
 2 | Simple k-means clustering (centroid-based) using Python
 3 | 
 4 | ### Code Requirements
 5 | Python 3.5 <br />
 6 | Numpy 1.11.0
 7 | 
 8 | ### Description
 9 | k-Means clustering is one of the most popular clustering methods in data mining and also in unsupervised machine learning.
10 | Here is a simple technique (actually a demonstration of the algorithm) for clustering data using k-Means Clustering method (with centroid-based). This code (for now) uses iterative method but doesn't use stopping or convergence criteria.
11 | 
12 | Initialize the centroids (number and position of the centroids) in function **`create_centroids()`**.
13 | Note that, the algorithm may find suboptimal solution if the centroids are chosen badly.
14 | 
15 | The output of this code are the data points with the cluster number/label and also the final centroids position.
16 | 
17 | 
18 | 


--------------------------------------------------------------------------------
/data.csv:
--------------------------------------------------------------------------------
 1 | 15, 16
 2 | 16, 18.5
 3 | 17, 20.2
 4 | 16.4, 17.12
 5 | 17.23, 18.12
 6 | 43, 43
 7 | 44.43, 45.212
 8 | 45.8, 54.23
 9 | 46.313, 43.123
10 | 50.21, 46.3
11 | 99, 99.22
12 | 100.32, 98.123
13 | 100.32, 97.423
14 | 102, 93.23
15 | 102.23, 94.23
16 | 


--------------------------------------------------------------------------------
/kMeansClustering.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import os
 3 | 
 4 | def compute_euclidean_distance(point, centroid):
 5 |     return np.sqrt(np.sum((point - centroid)**2))
 6 | 
 7 | def assign_label_cluster(distance, data_point, centroids):
 8 |     index_of_minimum = min(distance, key=distance.get)
 9 |     return [index_of_minimum, data_point, centroids[index_of_minimum]]
10 | 
11 | def compute_new_centroids(cluster_label, centroids):
12 |     return np.array(cluster_label + centroids)/2
13 | 
14 | def iterate_k_means(data_points, centroids, total_iteration):
15 |     label = []
16 |     cluster_label = []
17 |     total_points = len(data_points)
18 |     k = len(centroids)
19 |     
20 |     for iteration in range(0, total_iteration):
21 |         for index_point in range(0, total_points):
22 |             distance = {}
23 |             for index_centroid in range(0, k):
24 |                 distance[index_centroid] = compute_euclidean_distance(data_points[index_point], centroids[index_centroid])
25 |             label = assign_label_cluster(distance, data_points[index_point], centroids)
26 |             centroids[label[0]] = compute_new_centroids(label[1], centroids[label[0]])
27 | 
28 |             if iteration == (total_iteration - 1):
29 |                 cluster_label.append(label)
30 | 
31 |     return [cluster_label, centroids]
32 | 
33 | def print_label_data(result):
34 |     print("Result of k-Means Clustering: \n")
35 |     for data in result[0]:
36 |         print("data point: {}".format(data[1]))
37 |         print("cluster number: {} \n".format(data[0]))
38 |     print("Last centroids position: \n {}".format(result[1]))
39 | 
40 | def create_centroids():
41 |     centroids = []
42 |     centroids.append([5.0, 0.0])
43 |     centroids.append([45.0, 70.0])
44 |     centroids.append([50.0, 90.0])
45 |     return np.array(centroids)
46 | 
47 | if __name__ == "__main__":
48 |     filename = os.path.dirname(__file__) + "\data.csv"
49 |     data_points = np.genfromtxt(filename, delimiter=",")
50 |     centroids = create_centroids()
51 |     total_iteration = 100
52 |     
53 |     [cluster_label, new_centroids] = iterate_k_means(data_points, centroids, total_iteration)
54 |     print_label_data([cluster_label, new_centroids])
55 |     print()


--------------------------------------------------------------------------------