├── .github └── FUNDING.yml ├── README.md ├── data.csv └── kMeansClustering.py /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | ko_fi: corvasto 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Simple k-Means Clustering - Python 2 | Simple k-means clustering (centroid-based) using Python 3 | 4 | ### Code Requirements 5 | Python 3.5
6 | Numpy 1.11.0 7 | 8 | ### Description 9 | k-Means clustering is one of the most popular clustering methods in data mining and also in unsupervised machine learning. 10 | Here is a simple technique (actually a demonstration of the algorithm) for clustering data using k-Means Clustering method (with centroid-based). This code (for now) uses iterative method but doesn't use stopping or convergence criteria. 11 | 12 | Initialize the centroids (number and position of the centroids) in function **`create_centroids()`**. 13 | Note that, the algorithm may find suboptimal solution if the centroids are chosen badly. 14 | 15 | The output of this code are the data points with the cluster number/label and also the final centroids position. 16 | 17 | 18 | -------------------------------------------------------------------------------- /data.csv: -------------------------------------------------------------------------------- 1 | 15, 16 2 | 16, 18.5 3 | 17, 20.2 4 | 16.4, 17.12 5 | 17.23, 18.12 6 | 43, 43 7 | 44.43, 45.212 8 | 45.8, 54.23 9 | 46.313, 43.123 10 | 50.21, 46.3 11 | 99, 99.22 12 | 100.32, 98.123 13 | 100.32, 97.423 14 | 102, 93.23 15 | 102.23, 94.23 16 | -------------------------------------------------------------------------------- /kMeansClustering.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | 4 | def compute_euclidean_distance(point, centroid): 5 | return np.sqrt(np.sum((point - centroid)**2)) 6 | 7 | def assign_label_cluster(distance, data_point, centroids): 8 | index_of_minimum = min(distance, key=distance.get) 9 | return [index_of_minimum, data_point, centroids[index_of_minimum]] 10 | 11 | def compute_new_centroids(cluster_label, centroids): 12 | return np.array(cluster_label + centroids)/2 13 | 14 | def iterate_k_means(data_points, centroids, total_iteration): 15 | label = [] 16 | cluster_label = [] 17 | total_points = len(data_points) 18 | k = len(centroids) 19 | 20 | for iteration in range(0, total_iteration): 21 | for index_point in range(0, total_points): 22 | distance = {} 23 | for index_centroid in range(0, k): 24 | distance[index_centroid] = compute_euclidean_distance(data_points[index_point], centroids[index_centroid]) 25 | label = assign_label_cluster(distance, data_points[index_point], centroids) 26 | centroids[label[0]] = compute_new_centroids(label[1], centroids[label[0]]) 27 | 28 | if iteration == (total_iteration - 1): 29 | cluster_label.append(label) 30 | 31 | return [cluster_label, centroids] 32 | 33 | def print_label_data(result): 34 | print("Result of k-Means Clustering: \n") 35 | for data in result[0]: 36 | print("data point: {}".format(data[1])) 37 | print("cluster number: {} \n".format(data[0])) 38 | print("Last centroids position: \n {}".format(result[1])) 39 | 40 | def create_centroids(): 41 | centroids = [] 42 | centroids.append([5.0, 0.0]) 43 | centroids.append([45.0, 70.0]) 44 | centroids.append([50.0, 90.0]) 45 | return np.array(centroids) 46 | 47 | if __name__ == "__main__": 48 | filename = os.path.dirname(__file__) + "\data.csv" 49 | data_points = np.genfromtxt(filename, delimiter=",") 50 | centroids = create_centroids() 51 | total_iteration = 100 52 | 53 | [cluster_label, new_centroids] = iterate_k_means(data_points, centroids, total_iteration) 54 | print_label_data([cluster_label, new_centroids]) 55 | print() --------------------------------------------------------------------------------