├── README.md ├── main.py └── sequences_str.csv /README.md: -------------------------------------------------------------------------------- 1 | ## Clustering sequences using similarity measures in Python 2 | 3 | Implementation of k-means clustering with the following similarity measures to choose from when evaluating the similarity of given sequences: 4 | - Euclidean distance 5 | - Damerau-Levenshtein edit distance 6 | - Dynamic Time Warping. 7 | 8 | 9 | ### 1. Input 10 | The expected input format are sequences being represented in string format, i.e., a csv file where each row represents a sequence and each column represents a singular item of a sequence of items. 11 | 12 | | col1 | col2 | col3 | ... | 13 | |------|------|------|-----| 14 | | glass | bowl | spoon | ... | 15 | 16 | ### 2. Implementation 17 | #### 2.1 K-means clustering 18 | K-means sets initial (random) centroids, calculates their distance to all the datapoints and assigns each datapoint to the nearest cluster. Centroids are then updated in relation to the datapoints assigned to the respective cluster (minimum distance to all datapoints) and compared to the old centroid values - the centroids keep updating until the distance between all old and new centroids is zero (i.e. none of them has changed in the previous update iteration). 19 | In order to be able to use different distance measures with k-means, k-means gets the preferred distance function as a parameter (dist_fun) as well as the number of clusters (k) and the preprocessed data (data). 20 | 21 | #### 2.2 Damerau-Levenshtein edit distance 22 | Damerau-Levenshtein distance calculates the distance between two strings by calculating the steps needed to transform one string into the other and returns this value as distance. In order to use Damerau-Levenshtein distance on numbers, there's a wrapper function (levenshtein_on_numbers) that converts strings to numbers so it is compatible with k-means, which works number based 23 | (for calculating means). For using Damerau-Levenshtein distance, each datapoint (string in list) is first converted into a letter before each dataset (list in list) is joined to a string and compared to the other datasets. 24 | 25 | #### 2.3 Dynamic Time Warping 26 | Takes difference sequence lenghts and non-linear similarities into account. 27 | 28 | #### 2.4 Evaluation methods 29 | ##### 2.4.1 Cluster counter 30 | In order to find optimal k, the algorithm is run repeatedly with a larger than expected number of clusters. After each iteration the size of the clusters is inspected and all non-empty clusters are counted. The output is a vector containing the number of non-empty clusters for a given k. E.g. [7, 69, 21, 2, 1] tells us that in 100 runs with k=5, 7 times only one cluster was filled with data, 69 times 2 clusters where filled with data, etc. 31 | 32 | How to use: 33 | 34 | ```python 35 | number_of_clusters = [0,0,0,0,0] 36 | for i in range(100): 37 | clusters, centroids = k_means(5, data, dtw_distance) 38 | count = 0 39 | for cluster in clusters: 40 | if len(cluster) > 0: 41 | count = count + 1 42 | number_of_clusters[count-1] = number_of_clusters[count-1] + 1 43 | print(number_of_clusters) 44 | ``` 45 | 46 | ##### 2.4.2 Elbow method for optimal number of clusters (k) 47 | Elbow method looks at the percentage of variance explained as a function of the number of clusters. The optimal number of clusters should be chosen so that adding another cluster doesn't result in much better modeling of the data (indicated by the angle in the graph). 48 | 49 | How to use: 50 | ```python 51 | max_len = max_dim(data) 52 | for dataset in data: 53 | if len(dataset) < max_len: 54 | for i in range(max_len - len(dataset)): 55 | dataset.append(0) 56 | sum_dists = [] 57 | for i in range(1,16): 58 | clusters, centroids = k_means(i, data, euclidean_distance) 59 | sum_dist = [] 60 | for i in range(len(clusters)): 61 | cluster = clusters[i] 62 | centroid = centroids[i] 63 | for j in range(len(cluster)): 64 | sum_dist.append((euclidean_distance(cluster[j], centroid))**2) 65 | sum_dists.append(min(sum_dist)) 66 | 67 | plt.plot(range(1,16), sum_dists, 'bx-') 68 | plt.xlabel('k') 69 | plt.ylabel('sum dist') 70 | plt.title('Elbow Method for optimal k') 71 | plt.show() 72 | ``` 73 | 74 | 75 | ### 3. Usage examples 76 | 77 | ```python 78 | # K-means with Damerau-Levenshtein distance 79 | data = read_data('sequences_str.csv') 80 | datapoint2num, num2datapoint = create_dicts(data) 81 | convert_with_dictionary(data, datapoint2num) 82 | clusters, centroids = k_means(5, data, levenshtein_on_numbers) 83 | for i in range(5): 84 | convert_with_dictionary(clusters[i], num2datapoint) 85 | print('====================================') 86 | print('Cluster ' + str(i) + ': ') 87 | for j in range(len(clusters[i])): 88 | print(clusters[i][j]) 89 | plot_data(clusters[i]) 90 | ``` 91 | ```python 92 | # K-means with dynamic time warping 93 | data = read_data('sequences_str.csv') 94 | datapoint2num, num2datapoint = create_dicts(data) 95 | convert_with_dictionary(data, datapoint2num) 96 | clusters, centroids = k_means(5, data, dtw_distance) 97 | for i in range(5): 98 | convert_with_dictionary(clusters[i], num2datapoint) 99 | print('====================================') 100 | print('Cluster ' + str(i) + ': ') 101 | for j in range(len(clusters[i])): 102 | print(clusters[i][j]) 103 | plot_data(clusters[i]) 104 | print(centroids) 105 | ``` 106 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import matplotlib.pyplot as plt 3 | import math 4 | import random 5 | import string 6 | 7 | def read_data(filename): 8 | with open(filename, 'rt') as csvfile: 9 | data = list(csv.reader(csvfile)) 10 | data.pop(0) 11 | return data 12 | 13 | def max_dim(data, use_chars=False): 14 | ''' get maximum number of dimensions ''' 15 | max = 0 16 | 17 | if use_chars: 18 | for dataset in data: 19 | if len(dataset[0]) > max: 20 | max = len(dataset[0]) 21 | else: 22 | for dataset in data: 23 | if len(dataset) > max: 24 | max = len(dataset) 25 | return max 26 | 27 | 28 | def convert_with_dictionary(data, dictionary): 29 | ''' convert data: num2word or word2num ''' 30 | for dataset in data: 31 | for i in range(len(dataset)): 32 | dataset[i] = dictionary[dataset[i]] 33 | 34 | 35 | def get_unique_words(data): 36 | ''' Create list of unique words ''' 37 | unique_words = set() 38 | for datasets in data: 39 | for datapoint in datasets: 40 | unique_words.add(datapoint) 41 | 42 | return list(unique_words) 43 | 44 | 45 | def create_dicts(data, use_chars=False): 46 | ''' create dictionarys for conversion (num<->datapoint or char<->datapoint) ''' 47 | unique_words = get_unique_words(data) 48 | 49 | if use_chars == True: 50 | datapoint2char = {} 51 | char2datapoint = {} 52 | char = string.ascii_lowercase[0] 53 | 54 | for datapoint in unique_words: 55 | datapoint2char[datapoint] = char 56 | char2datapoint[char] = datapoint 57 | char = chr(ord(char) + 1) 58 | return datapoint2char, char2datapoint 59 | 60 | else: 61 | datapoint2num = {} 62 | num2datapoint = {} 63 | num = 97 64 | 65 | for datapoint in unique_words: 66 | datapoint2num[datapoint] = num 67 | num2datapoint[num] = datapoint 68 | num = num+1 69 | return datapoint2num, num2datapoint 70 | 71 | 72 | def join_chars(data): 73 | ''' join chars to str ''' 74 | for i in range(len(data)): 75 | data[i] = [''.join(data[i])] 76 | return data 77 | 78 | 79 | def split_chars(data): 80 | ''' split str to chars ''' 81 | for i in range(len(data)): 82 | data[i] = list(data[i]) 83 | return data 84 | 85 | 86 | def plot_data(data): 87 | index = 0 88 | for dataset in data: 89 | plt.plot(dataset) 90 | index = index+1 91 | plt.legend() 92 | plt.show() 93 | 94 | 95 | def euclidean_distance(a, b): 96 | ''' euclidean distance ''' 97 | dist = 0 98 | if len(a) < len(b): 99 | a,b = b,a 100 | for i in range(len(a)): 101 | if i < len(b): 102 | dist = dist + (a[i] - b[i]) ** 2 103 | else: 104 | dist = dist + a[i]**2 105 | return math.sqrt(dist) 106 | 107 | 108 | def levenshtein_on_numbers(dataset1, dataset2): 109 | ''' damerau-levenshtein distance on numbers (convert strings) ''' 110 | datapoint2char, char2datapoint = create_dicts([dataset1, dataset2], use_chars=True) 111 | convert_with_dictionary([dataset1], datapoint2char) 112 | convert_with_dictionary([dataset2], datapoint2char) 113 | join_chars([dataset1]) 114 | join_chars([dataset2]) 115 | 116 | distance = d_levenshtein_distance(dataset1, dataset2) 117 | 118 | split_chars([dataset1]) 119 | split_chars([dataset2]) 120 | 121 | convert_with_dictionary([dataset1], char2datapoint) 122 | convert_with_dictionary([dataset2], char2datapoint) 123 | 124 | return distance 125 | 126 | 127 | def d_levenshtein_distance(str1, str2): 128 | ''' damerau-levenshtein distance ''' 129 | d = {} 130 | for i in range(len(str1) + 1): 131 | d[(i,0)] = i 132 | for j in range(len(str2) + 1): 133 | d[(0,j)] = j 134 | 135 | for i in range(1, len(str1) + 1): 136 | for j in range(1, len(str2) + 1): 137 | if str1[i-1] == str2[j-1]: 138 | subst_or_equal = d[(i-1, j-1)] + 0 139 | else: 140 | subst_or_equal = d[(i-1, j-1)] + 1 141 | 142 | deletion = d[(i-1,j)] + 1 143 | insertion = d[(i,j-1)] + 1 144 | 145 | if (i >= 2 and j >= 2) and (str1[i-1] == str2[j-2] and str1[i-2] == str2[j-1]): 146 | switch = d[(i-2,j-2)] + 1 147 | d[(i,j)] = min(subst_or_equal, deletion, insertion, switch) 148 | else: 149 | d[(i,j)] = min(subst_or_equal, deletion, insertion) 150 | 151 | return d[(len(str1), len(str2))] 152 | 153 | 154 | def dtw_distance(dataset1, dataset2): 155 | ''' dynamic time warping ''' 156 | dtw = {} 157 | for i in range(len(dataset1)): 158 | dtw[(i,-1)] = float('inf') 159 | for i in range(len(dataset2)): 160 | dtw[(-1,i)] = float('inf') 161 | 162 | dtw[(-1,-1)] = 0 163 | 164 | for i in range(len(dataset1)): 165 | for j in range(len(dataset2)): 166 | dist = (dataset1[i] - dataset2[j])**2 167 | dtw[(i,j)] = dist + min(dtw[(i-1,j)], dtw[(i,j-1)], dtw[(i-1,j-1)]) 168 | 169 | return math.sqrt(dtw[len(dataset1)-1, len(dataset2)-1]) 170 | 171 | 172 | def k_means(k, data, dist_fun): 173 | ''' k-means with number of clusters and preferred distance function ''' 174 | centroids = [] 175 | old_centroids = [] 176 | cluster_for_dataset = [] 177 | clusters = [[] for i in range(k)] 178 | delta_centroid_sum = 0 179 | dataset_dim = max_dim(data) 180 | min_value = 97 181 | max_value = max([datapoint for dataset in data for datapoint in dataset]) 182 | zeros = [0 for i in range(dataset_dim)] 183 | 184 | for cluster in range(k): 185 | randoms = [random.randint(min_value, max_value) for i in range(dataset_dim)] 186 | old_centroids.append(zeros) 187 | centroids.append(randoms) 188 | delta_centroid_sum = delta_centroid_sum + dist_fun(zeros, randoms) 189 | 190 | while delta_centroid_sum != 0: 191 | for dataset in data: 192 | cluster_distances = [] 193 | for cluster in range(k): 194 | cluster_distances.append(dist_fun(dataset, centroids[cluster])) 195 | cluster_for_dataset.append(cluster_distances.index(min(cluster_distances))) 196 | delta_centroid_sum = 0 197 | 198 | for cluster in range(k): 199 | cluster_members = [] 200 | for i in range(len(data)): 201 | if cluster == cluster_for_dataset[i]: 202 | cluster_members.append(data[i]) 203 | 204 | old_centroids[cluster] = centroids[cluster] 205 | datapoint_means = [0 for i in range(dataset_dim)] 206 | cluster_member_count = len(cluster_members) 207 | 208 | for dataset in cluster_members: 209 | for i in range(len(dataset)): 210 | datapoint_means[i] = datapoint_means[i] + dataset[i]/cluster_member_count 211 | 212 | centroids[cluster] = datapoint_means 213 | clusters[cluster] = cluster_members 214 | delta_centroid_sum = delta_centroid_sum + dist_fun(old_centroids[cluster], centroids[cluster]) 215 | 216 | return clusters, centroids 217 | 218 | 219 | 220 | 221 | def main(): 222 | ### find optimal k, elbow method 223 | # data = read_data('sequences_str.csv') 224 | # datapoint2num, num2datapoint = create_dicts(data) 225 | # convert_with_dictionary(data, datapoint2num) 226 | # max_len = max_dim(data) 227 | # 228 | # for dataset in data: 229 | # if len(dataset) < max_len: 230 | # for i in range(max_len - len(dataset)): 231 | # dataset.append(0) 232 | # 233 | # sum_dists = [] 234 | # for i in range(1,16): 235 | # clusters, centroids = k_means(i, data, dtw_distance) 236 | # 237 | # sum_dist = [] 238 | # for i in range(len(clusters)): 239 | # cluster = clusters[i] 240 | # centroid = centroids[i] 241 | # 242 | # for j in range(len(cluster)): 243 | # sum_dist.append((euclidean_distance(cluster[j], centroid))**2) 244 | # sum_dists.append(min(sum_dist)) 245 | # 246 | # plt.plot(range(1,16), sum_dists, 'bx-') 247 | # plt.xlabel('k') 248 | # plt.ylabel('sum dist') 249 | # plt.title('Elbow Method for optimal k') 250 | # plt.show() 251 | 252 | 253 | 254 | ### find optimal k, cluster counter 255 | # data = read_data('sequences_str.csv') 256 | # datapoint2num, num2datapoint = create_dicts(data) 257 | # convert_with_dictionary(data, datapoint2num) 258 | # number_of_clusters = [0,0,0,0,0] 259 | # for i in range(100): 260 | # clusters, centroids = k_means(5, data, levenshtein_on_numbers) 261 | # count = 0 262 | # for cluster in clusters: 263 | # if len(cluster) > 0: 264 | # count = count + 1 265 | # number_of_clusters[count-1] = number_of_clusters[count-1] + 1 266 | # print(number_of_clusters) 267 | 268 | 269 | ### plot 270 | # data = read_data('sequences_str.csv') 271 | # datapoint2num, num2datapoint = create_dicts(data) 272 | # convert_with_dictionary(data, datapoint2num) 273 | # plot_data(data) 274 | 275 | 276 | ### k-means with dtw on numbers 277 | data = read_data('sequences_str.csv') 278 | datapoint2num, num2datapoint = create_dicts(data) 279 | convert_with_dictionary(data, datapoint2num) 280 | clusters, centroids = k_means(5, data, dtw_distance) 281 | for i in range(5): 282 | convert_with_dictionary(clusters[i], num2datapoint) 283 | print('====================================') 284 | print('Cluster ' + str(i) + ': ') 285 | for j in range(len(clusters[i])): 286 | print(clusters[i][j]) 287 | plot_data(clusters[i]) 288 | print(centroids) 289 | 290 | 291 | ### levenshtein on strings 292 | # data = read_data('sequences_str.csv') 293 | ## data = [dataset[2:] for dataset in data] 294 | # datapoint2num, num2datapoint = create_dicts(data) 295 | # convert_with_dictionary(data, datapoint2num) 296 | # clusters, centroids = k_means(5, data, levenshtein_on_numbers) 297 | # for i in range(5): 298 | # convert_with_dictionary(clusters[i], num2datapoint) 299 | # print('====================================') 300 | # print('Cluster ' + str(i) + ': ') 301 | # for j in range(len(clusters[i])): 302 | # print(clusters[i][j]) 303 | # plot_data(clusters[i]) 304 | 305 | 306 | ### k_means with euclideans on numbers 307 | # data = read_data('sequences_str.csv') 308 | # datapoint2num, num2datapoint = create_dicts(data) 309 | # convert_with_dictionary(data, datapoint2num) 310 | # clusters, centroids = k_means(5, data, euclidean_distance) 311 | # for i in range(5): 312 | # convert_with_dictionary(clusters[i], num2datapoint) 313 | # print('====================================') 314 | # print('Cluster ' + str(i) + ': ') 315 | # for j in range(len(clusters[i])): 316 | # print(clusters[i][j]) 317 | # plot_data(clusters[i]) 318 | 319 | 320 | 321 | main() 322 | -------------------------------------------------------------------------------- /sequences_str.csv: -------------------------------------------------------------------------------- 1 | item1,item2,item3,item4,item5,item6,item7,item8,item9 2 | juice,milk,milk,juice,spoon,cereal,bowl,, 3 | milk,juice,buttermilk,milk,juice,cereal,spoon,glass,bowl 4 | juice,milk,spoon,glass,cereal,bowl,,, 5 | juice,milk,spoon,glass,cereal,cereal,bowl,, 6 | juice,milk,cereal,spoon,bowl,glass,,, 7 | juice,milk,spoon,spoon,glass,cereal,bowl,, 8 | milk,juice,spoon,glass,cereal,bowl,,, 9 | juice,milk,spoon,glass,cereal,bowl,,, 10 | juice,milk,spoon,glass,cereal,cereal,bowl,, 11 | juice,milk,cereal,spoon,glass,bowl,,, 12 | juice,milk,cereal,spoon,glass,bowl,,, 13 | juice,milk,spoon,glass,cereal,bowl,,, 14 | cereal,spoon,glass,bowl,juice,milk,,, 15 | juice,milk,spoon,glass,cereal,bowl,,, 16 | juice,milk,spoon,glass,cereal,bowl,,, 17 | juice,milk,spoon,cereal,bowl,bowl,glass,, 18 | juice,milk,spoon,cereal,glass,bowl,,, 19 | juice,milk,spoon,glass,cereal,bowl,,, 20 | juice,milk,spoon,glass,cereal,bowl,,, 21 | juice,milk,cereal,spoon,glass,bowl,,, 22 | juice,cocoa,spoon,glass,cereal,bowl,,, 23 | juice,milk,spoon,glass,cereal,bowl,,, 24 | juice,milk,cereal,spoon,glass,bowl,,, 25 | juice,milk,spoon,glass,cereal,bowl,,, 26 | juice,milk,spoon,glass,cereal,bowl,,, 27 | buttermilk,juice,spoon,glass,cereal,bowl,,, 28 | juice,milk,spoon,glass,cereal,bowl,,, 29 | juice,milk,spoon,cereal,glass,bowl,,, 30 | juice,milk,spoon,glass,cereal,bowl,,, 31 | milk,juice,spoon,glass,cereal,bowl,,, 32 | juice,milk,cereal,spoon,glass,bowl,,, 33 | milk,juice,spoon,cereal,glass,bowl,,, 34 | juice,milk,cereal,spoon,glass,bowl,,, 35 | juice,milk,spoon,cereal,bowl,milk,glass,, 36 | juice,milk,spoon,cereal,glass,bowl,,, 37 | juice,milk,cereal,spoon,glass,glass,bowl,, 38 | juice,milk,spoon,cereal,glass,bowl,,, 39 | juice,milk,spoon,glass,cereal,bowl,,, 40 | juice,milk,cereal,spoon,glass,bowl,,, 41 | --------------------------------------------------------------------------------