├── README.md
├── main.py
└── sequences_str.csv


/README.md:
--------------------------------------------------------------------------------
  1 | ## Clustering sequences using similarity measures in Python
  2 | 
  3 | Implementation of k-means clustering with the following similarity measures to choose from when evaluating the similarity of given sequences:
  4 | - Euclidean distance
  5 | - Damerau-Levenshtein edit distance
  6 | - Dynamic Time Warping.
  7 | 
  8 | 
  9 | ### 1. Input
 10 | The expected input format are sequences being represented in string format, i.e., a csv file where each row represents a sequence and each column represents a singular item of a sequence of items.
 11 | 
 12 | | col1 | col2 | col3 | ... |
 13 | |------|------|------|-----|
 14 | | glass | bowl | spoon | ... |
 15 | 
 16 | ### 2. Implementation
 17 | #### 2.1 K-means clustering
 18 | K-means sets initial (random) centroids, calculates their distance to all the datapoints and assigns each datapoint to the nearest cluster. Centroids are then updated in relation to the datapoints assigned to the respective cluster (minimum distance to all datapoints) and compared to the old centroid values - the centroids keep updating until the distance between all old and new centroids is zero (i.e. none of them has changed in the previous update iteration).
 19 | In order to be able to use different distance measures with k-means, k-means gets the preferred distance function as a parameter (dist_fun) as well as the number of clusters (k) and the preprocessed data (data).
 20 | 
 21 | #### 2.2 Damerau-Levenshtein edit distance
 22 | Damerau-Levenshtein distance calculates the distance between two strings by calculating the steps needed to transform one string into the other and returns this value as distance. In order to use Damerau-Levenshtein distance on numbers, there's a wrapper function (levenshtein_on_numbers) that converts strings to numbers so it is compatible with k-means, which works number based 
 23 | (for calculating means). For using Damerau-Levenshtein distance, each datapoint (string in list) is first converted into a letter before each dataset (list in list) is joined to a string and compared to the other datasets.
 24 | 
 25 | #### 2.3 Dynamic Time Warping
 26 | Takes difference sequence lenghts and non-linear similarities into account.
 27 | 
 28 | #### 2.4 Evaluation methods
 29 | ##### 2.4.1 Cluster counter
 30 | In order to find optimal k, the algorithm is run repeatedly with a larger than expected number of clusters. After each iteration the size of the clusters is inspected and all non-empty clusters are counted. The output is a vector containing the number of non-empty clusters for a given k. E.g. [7, 69, 21, 2, 1] tells us that in 100 runs with k=5, 7 times only one cluster was filled with data, 69 times 2 clusters where filled with data, etc.
 31 | 
 32 | How to use:
 33 | 
 34 | ```python
 35 | number_of_clusters = [0,0,0,0,0]
 36 | for i in range(100):
 37 | 	clusters, centroids = k_means(5, data, dtw_distance)
 38 | 	count = 0
 39 | 	for cluster in clusters:
 40 | 		if len(cluster) > 0:
 41 | 			count = count + 1
 42 | 	number_of_clusters[count-1] = number_of_clusters[count-1] + 1
 43 | print(number_of_clusters)
 44 | ```
 45 | 
 46 | ##### 2.4.2 Elbow method for optimal number of clusters (k)
 47 | Elbow method looks at the percentage of variance explained as a function of the number of clusters. The optimal number of clusters should be chosen so that adding another cluster doesn't result in much better modeling of the data (indicated by the angle in the graph).
 48 | 
 49 | How to use:
 50 | ```python
 51 | max_len = max_dim(data)
 52 | for dataset in data:
 53 |   if len(dataset) < max_len:
 54 | 		for i in range(max_len - len(dataset)):
 55 | 			dataset.append(0)	
 56 | sum_dists = []
 57 | for i in range(1,16):
 58 | 	clusters, centroids = k_means(i, data, euclidean_distance)
 59 | sum_dist = []
 60 | for i in range(len(clusters)):
 61 | 	cluster = clusters[i]
 62 | 	centroid = centroids[i]
 63 | 	for j in range(len(cluster)):
 64 | 		sum_dist.append((euclidean_distance(cluster[j], centroid))**2)
 65 | 	sum_dists.append(min(sum_dist))
 66 | 
 67 | plt.plot(range(1,16), sum_dists, 'bx-')
 68 | plt.xlabel('k')
 69 | plt.ylabel('sum dist')
 70 | plt.title('Elbow Method for optimal k')
 71 | plt.show() 
 72 | ```
 73 | 
 74 | 
 75 | ### 3. Usage examples
 76 | 
 77 | ```python
 78 | # K-means with Damerau-Levenshtein distance
 79 |     data = read_data('sequences_str.csv')
 80 |     datapoint2num, num2datapoint = create_dicts(data)
 81 |     convert_with_dictionary(data, datapoint2num)
 82 |     clusters, centroids = k_means(5, data, levenshtein_on_numbers)
 83 |     for i in range(5):
 84 |         convert_with_dictionary(clusters[i], num2datapoint)
 85 |         print('====================================')
 86 |         print('Cluster ' + str(i) + ': ')
 87 |         for j in range(len(clusters[i])):
 88 |             print(clusters[i][j])
 89 |         plot_data(clusters[i])
 90 | ```
 91 | ```python
 92 | # K-means with dynamic time warping
 93 |     data = read_data('sequences_str.csv')
 94 |     datapoint2num, num2datapoint = create_dicts(data)
 95 |     convert_with_dictionary(data, datapoint2num)
 96 |     clusters, centroids = k_means(5, data, dtw_distance)
 97 |     for i in range(5):
 98 |         convert_with_dictionary(clusters[i], num2datapoint)
 99 |         print('====================================')
100 |         print('Cluster ' + str(i) + ': ')
101 |         for j in range(len(clusters[i])):
102 |             print(clusters[i][j])
103 |         plot_data(clusters[i])
104 |     print(centroids)
105 | ```
106 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import csv
  2 | import matplotlib.pyplot as plt
  3 | import math
  4 | import random
  5 | import string
  6 | 
  7 | def read_data(filename):
  8 |     with open(filename, 'rt') as csvfile:
  9 |         data = list(csv.reader(csvfile))
 10 |         data.pop(0)
 11 |         return data
 12 | 
 13 | def max_dim(data, use_chars=False):
 14 |     ''' get maximum number of dimensions '''
 15 |     max = 0
 16 |     
 17 |     if use_chars:
 18 |         for dataset in data:
 19 |             if len(dataset[0]) > max:
 20 |                 max = len(dataset[0])
 21 |     else:
 22 |         for dataset in data:
 23 |             if len(dataset) > max:
 24 |                 max = len(dataset)
 25 |     return max
 26 | 
 27 | 
 28 | def convert_with_dictionary(data, dictionary):
 29 |     ''' convert data: num2word or word2num '''
 30 |     for dataset in data:
 31 |         for i in range(len(dataset)):
 32 |             dataset[i] = dictionary[dataset[i]]
 33 | 
 34 | 
 35 | def get_unique_words(data):
 36 |     ''' Create list of unique words '''
 37 |     unique_words = set()    
 38 |     for datasets in data:
 39 |         for datapoint in datasets:
 40 |             unique_words.add(datapoint)
 41 |     
 42 |     return list(unique_words)    
 43 | 
 44 | 
 45 | def create_dicts(data, use_chars=False):   
 46 |     ''' create dictionarys for conversion (num<->datapoint or char<->datapoint) '''
 47 |     unique_words = get_unique_words(data)
 48 | 
 49 |     if use_chars == True:
 50 |         datapoint2char = {}
 51 |         char2datapoint = {}
 52 |         char = string.ascii_lowercase[0]
 53 |         
 54 |         for datapoint in unique_words:
 55 |             datapoint2char[datapoint] = char
 56 |             char2datapoint[char] = datapoint
 57 |             char = chr(ord(char) + 1) 
 58 |         return datapoint2char, char2datapoint
 59 |     
 60 |     else:
 61 |         datapoint2num = {}
 62 |         num2datapoint = {}
 63 |         num = 97
 64 |         
 65 |         for datapoint in unique_words:
 66 |             datapoint2num[datapoint] = num
 67 |             num2datapoint[num] = datapoint
 68 |             num = num+1
 69 |         return datapoint2num, num2datapoint
 70 | 
 71 | 
 72 | def join_chars(data):
 73 |     ''' join chars to str '''
 74 |     for i in range(len(data)):
 75 |         data[i] = [''.join(data[i])]
 76 |     return data
 77 | 
 78 | 
 79 | def split_chars(data):
 80 |     ''' split str to chars '''
 81 |     for i in range(len(data)):
 82 |         data[i] = list(data[i])
 83 |     return data
 84 | 
 85 | 
 86 | def plot_data(data):    
 87 |     index = 0
 88 |     for dataset in data:
 89 |         plt.plot(dataset)
 90 |         index = index+1
 91 |     plt.legend()
 92 |     plt.show()
 93 | 
 94 | 
 95 | def euclidean_distance(a, b):
 96 |     ''' euclidean distance '''
 97 |     dist = 0
 98 |     if len(a) < len(b):
 99 |         a,b = b,a
100 |     for i in range(len(a)):
101 |         if i < len(b):
102 |             dist = dist + (a[i] - b[i]) ** 2
103 |         else:
104 |             dist = dist + a[i]**2
105 |     return math.sqrt(dist)
106 | 
107 |  
108 | def levenshtein_on_numbers(dataset1, dataset2):
109 |     ''' damerau-levenshtein distance on numbers (convert strings) '''
110 |     datapoint2char, char2datapoint = create_dicts([dataset1, dataset2], use_chars=True)        
111 |     convert_with_dictionary([dataset1], datapoint2char)
112 |     convert_with_dictionary([dataset2], datapoint2char)
113 |     join_chars([dataset1])
114 |     join_chars([dataset2])
115 |         
116 |     distance = d_levenshtein_distance(dataset1, dataset2)
117 |         
118 |     split_chars([dataset1])
119 |     split_chars([dataset2])
120 |         
121 |     convert_with_dictionary([dataset1], char2datapoint)
122 |     convert_with_dictionary([dataset2], char2datapoint)
123 |         
124 |     return distance
125 |                
126 | 
127 | def d_levenshtein_distance(str1, str2):
128 |     ''' damerau-levenshtein distance '''
129 |     d = {}
130 |     for i in range(len(str1) + 1):
131 |         d[(i,0)] = i
132 |     for j in range(len(str2) + 1):
133 |         d[(0,j)] = j
134 |     
135 |     for i in range(1, len(str1) + 1):
136 |         for j in range(1, len(str2) + 1):
137 |             if str1[i-1] == str2[j-1]:
138 |                 subst_or_equal = d[(i-1, j-1)] + 0
139 |             else:
140 |                 subst_or_equal = d[(i-1, j-1)] + 1
141 |             
142 |             deletion = d[(i-1,j)] + 1
143 |             insertion = d[(i,j-1)] + 1
144 |             
145 |             if (i >= 2 and j >= 2) and (str1[i-1] == str2[j-2] and str1[i-2] == str2[j-1]):
146 |                 switch = d[(i-2,j-2)] + 1
147 |                 d[(i,j)] = min(subst_or_equal, deletion, insertion, switch)
148 |             else:
149 |                 d[(i,j)] = min(subst_or_equal, deletion, insertion)
150 |     
151 |     return d[(len(str1), len(str2))]
152 | 
153 | 
154 | def dtw_distance(dataset1, dataset2):
155 |     ''' dynamic time warping '''
156 |     dtw = {}
157 |     for i in range(len(dataset1)):
158 |         dtw[(i,-1)] = float('inf')
159 |     for i in range(len(dataset2)):
160 |         dtw[(-1,i)] = float('inf')
161 |         
162 |     dtw[(-1,-1)] = 0
163 |     
164 |     for i in range(len(dataset1)):
165 |         for j in range(len(dataset2)):
166 |             dist = (dataset1[i] - dataset2[j])**2
167 |             dtw[(i,j)] = dist + min(dtw[(i-1,j)], dtw[(i,j-1)], dtw[(i-1,j-1)])
168 |             
169 |     return math.sqrt(dtw[len(dataset1)-1, len(dataset2)-1])
170 | 
171 | 
172 | def k_means(k, data, dist_fun):
173 |     ''' k-means with number of clusters and preferred distance function '''
174 |     centroids = []
175 |     old_centroids = []
176 |     cluster_for_dataset = []
177 |     clusters = [[] for i in range(k)]
178 |     delta_centroid_sum = 0
179 |     dataset_dim = max_dim(data)
180 |     min_value = 97
181 |     max_value = max([datapoint for dataset in data for datapoint in dataset])
182 |     zeros = [0 for i in range(dataset_dim)]
183 |     
184 |     for cluster in range(k):
185 |         randoms = [random.randint(min_value, max_value) for i in range(dataset_dim)]
186 |         old_centroids.append(zeros)
187 |         centroids.append(randoms)
188 |         delta_centroid_sum = delta_centroid_sum + dist_fun(zeros, randoms)
189 |     
190 |     while delta_centroid_sum != 0:
191 |         for dataset in data:
192 |             cluster_distances = []
193 |             for cluster in range(k):
194 |                 cluster_distances.append(dist_fun(dataset, centroids[cluster]))
195 |             cluster_for_dataset.append(cluster_distances.index(min(cluster_distances)))
196 |         delta_centroid_sum = 0
197 |         
198 |         for cluster in range(k):
199 |             cluster_members = []
200 |             for i in range(len(data)):
201 |                 if cluster == cluster_for_dataset[i]:
202 |                     cluster_members.append(data[i])
203 |     
204 |             old_centroids[cluster] = centroids[cluster]
205 |             datapoint_means = [0 for i in range(dataset_dim)]
206 |             cluster_member_count = len(cluster_members)
207 |             
208 |             for dataset in cluster_members:
209 |                 for i in range(len(dataset)):
210 |                     datapoint_means[i] = datapoint_means[i] + dataset[i]/cluster_member_count
211 |            
212 |             centroids[cluster] = datapoint_means
213 |             clusters[cluster] = cluster_members
214 |             delta_centroid_sum = delta_centroid_sum + dist_fun(old_centroids[cluster], centroids[cluster])
215 |         
216 |     return clusters, centroids
217 | 
218 | 
219 |     
220 |     
221 | def main():
222 | ### find optimal k, elbow method
223 | #    data = read_data('sequences_str.csv')
224 | #    datapoint2num, num2datapoint = create_dicts(data)
225 | #    convert_with_dictionary(data, datapoint2num)    
226 | #    max_len = max_dim(data)
227 | #    
228 | #    for dataset in data:
229 | #        if len(dataset) < max_len:
230 | #            for i in range(max_len - len(dataset)):
231 | #                dataset.append(0)
232 | #                
233 | #    sum_dists = []
234 | #    for i in range(1,16):
235 | #        clusters, centroids = k_means(i, data, dtw_distance)
236 | #        
237 | #        sum_dist = []                                                                                                                                                                                                                                                                 
238 | #        for i in range(len(clusters)):
239 | #            cluster = clusters[i]
240 | #            centroid = centroids[i]
241 | #            
242 | #            for j in range(len(cluster)):
243 | #                sum_dist.append((euclidean_distance(cluster[j], centroid))**2)
244 | #        sum_dists.append(min(sum_dist))
245 | #    
246 | #    plt.plot(range(1,16), sum_dists, 'bx-')
247 | #    plt.xlabel('k')
248 | #    plt.ylabel('sum dist')
249 | #    plt.title('Elbow Method for optimal k')
250 | #    plt.show()    
251 | 
252 |     
253 |     
254 | ### find optimal k, cluster counter
255 | #    data = read_data('sequences_str.csv')
256 | #    datapoint2num, num2datapoint = create_dicts(data)
257 | #    convert_with_dictionary(data, datapoint2num) 
258 | #    number_of_clusters = [0,0,0,0,0]
259 | #    for i in range(100):
260 | #        clusters, centroids = k_means(5, data, levenshtein_on_numbers)
261 | #        count = 0
262 | #        for cluster in clusters:
263 | #            if len(cluster) > 0:
264 | #                count = count + 1
265 | #        number_of_clusters[count-1] = number_of_clusters[count-1] + 1         
266 | #    print(number_of_clusters)
267 |     
268 |     
269 | ### plot
270 | #    data = read_data('sequences_str.csv')
271 | #    datapoint2num, num2datapoint = create_dicts(data)
272 | #    convert_with_dictionary(data, datapoint2num)
273 | #    plot_data(data)
274 |     
275 |     
276 | ### k-means with dtw on numbers
277 |     data = read_data('sequences_str.csv')
278 |     datapoint2num, num2datapoint = create_dicts(data)
279 |     convert_with_dictionary(data, datapoint2num)
280 |     clusters, centroids = k_means(5, data, dtw_distance)
281 |     for i in range(5):
282 |         convert_with_dictionary(clusters[i], num2datapoint)
283 |         print('====================================')
284 |         print('Cluster ' + str(i) + ': ')
285 |         for j in range(len(clusters[i])):
286 |             print(clusters[i][j])
287 |         plot_data(clusters[i])
288 |     print(centroids)
289 |  
290 |     
291 | ### levenshtein on strings
292 | #    data = read_data('sequences_str.csv')
293 | ##    data = [dataset[2:] for dataset in data]
294 | #    datapoint2num, num2datapoint = create_dicts(data)
295 | #    convert_with_dictionary(data, datapoint2num)
296 | #    clusters, centroids = k_means(5, data, levenshtein_on_numbers)
297 | #    for i in range(5):
298 | #        convert_with_dictionary(clusters[i], num2datapoint)
299 | #        print('====================================')
300 | #        print('Cluster ' + str(i) + ': ')
301 | #        for j in range(len(clusters[i])):
302 | #            print(clusters[i][j])
303 | #        plot_data(clusters[i])
304 | 
305 |     
306 | ### k_means with euclideans on numbers  
307 | #    data = read_data('sequences_str.csv')
308 | #    datapoint2num, num2datapoint = create_dicts(data)
309 | #    convert_with_dictionary(data, datapoint2num)
310 | #    clusters, centroids = k_means(5, data, euclidean_distance)
311 | #    for i in range(5):
312 | #        convert_with_dictionary(clusters[i], num2datapoint)
313 | #        print('====================================')
314 | #        print('Cluster ' + str(i) + ': ')
315 | #        for j in range(len(clusters[i])):
316 | #            print(clusters[i][j])
317 | #        plot_data(clusters[i])  
318 |     
319 | 
320 | 
321 | main()
322 | 


--------------------------------------------------------------------------------
/sequences_str.csv:
--------------------------------------------------------------------------------
 1 | item1,item2,item3,item4,item5,item6,item7,item8,item9
 2 | juice,milk,milk,juice,spoon,cereal,bowl,,
 3 | milk,juice,buttermilk,milk,juice,cereal,spoon,glass,bowl
 4 | juice,milk,spoon,glass,cereal,bowl,,,
 5 | juice,milk,spoon,glass,cereal,cereal,bowl,,
 6 | juice,milk,cereal,spoon,bowl,glass,,,
 7 | juice,milk,spoon,spoon,glass,cereal,bowl,,
 8 | milk,juice,spoon,glass,cereal,bowl,,,
 9 | juice,milk,spoon,glass,cereal,bowl,,,
10 | juice,milk,spoon,glass,cereal,cereal,bowl,,
11 | juice,milk,cereal,spoon,glass,bowl,,,
12 | juice,milk,cereal,spoon,glass,bowl,,,
13 | juice,milk,spoon,glass,cereal,bowl,,,
14 | cereal,spoon,glass,bowl,juice,milk,,,
15 | juice,milk,spoon,glass,cereal,bowl,,,
16 | juice,milk,spoon,glass,cereal,bowl,,,
17 | juice,milk,spoon,cereal,bowl,bowl,glass,,
18 | juice,milk,spoon,cereal,glass,bowl,,,
19 | juice,milk,spoon,glass,cereal,bowl,,,
20 | juice,milk,spoon,glass,cereal,bowl,,,
21 | juice,milk,cereal,spoon,glass,bowl,,,
22 | juice,cocoa,spoon,glass,cereal,bowl,,,
23 | juice,milk,spoon,glass,cereal,bowl,,,
24 | juice,milk,cereal,spoon,glass,bowl,,,
25 | juice,milk,spoon,glass,cereal,bowl,,,
26 | juice,milk,spoon,glass,cereal,bowl,,,
27 | buttermilk,juice,spoon,glass,cereal,bowl,,,
28 | juice,milk,spoon,glass,cereal,bowl,,,
29 | juice,milk,spoon,cereal,glass,bowl,,,
30 | juice,milk,spoon,glass,cereal,bowl,,,
31 | milk,juice,spoon,glass,cereal,bowl,,,
32 | juice,milk,cereal,spoon,glass,bowl,,,
33 | milk,juice,spoon,cereal,glass,bowl,,,
34 | juice,milk,cereal,spoon,glass,bowl,,,
35 | juice,milk,spoon,cereal,bowl,milk,glass,,
36 | juice,milk,spoon,cereal,glass,bowl,,,
37 | juice,milk,cereal,spoon,glass,glass,bowl,,
38 | juice,milk,spoon,cereal,glass,bowl,,,
39 | juice,milk,spoon,glass,cereal,bowl,,,
40 | juice,milk,cereal,spoon,glass,bowl,,,
41 | 


--------------------------------------------------------------------------------