├── README.md ├── STDBSCAN.py └── main_stdbscan.py /README.md: -------------------------------------------------------------------------------- 1 | # ST-DBSCAN 2 | 3 | ST-DBSCAN is a density-based clustering algorithm that takes into account both spatial and non-spatial attributes of the points. Like DBSCAN, this algorithm has the ability to identify clusters with arbitrary shape and it does not necessarily predetermine the number of clusters. The non-spatial attribute could be anyone that is not related to coordinates in a space (e.g., color, time, temperature). Thus, ST-DBSCAN can create groups with points that are spatially near each other and that has similar non-spatial attributes. 4 | 5 | Different from DBSCAN, wich requires only two parameters, ST-DBSCAN requires four parameters: Eps1, Eps2, MinPts, and Delta-Epson. Eps1 is the spatial distance, the maximum distance for a point be assign to a cluster. Eps2 is the maximum difference between non-spatial attributes. MinPts is the minimum number of neighbors for a point be a core point. The last one, Delta-Epson is the maximum difference between the attributes value average of a cluster and the attribute value for a new point wich will be inserted in that cluster. The parameter Delta-Epson has the function of split clusters that are near each other considering spatial aspects and are much different considering the non-spatial attribute. Delta-Epson was not implemented on this version. 6 | 7 | The ST-DBSCAN algorithm is composed basically of two functions. The main one is named ST-DBSCAN and creates clusters iteratively. This function uses another, retrive_neighborhood, which retrieves those points that are the neighbors of a given point. 8 | -------------------------------------------------------------------------------- /STDBSCAN.py: -------------------------------------------------------------------------------- 1 | import math 2 | from datetime import timedelta 3 | from geopy.distance import great_circle 4 | """ 5 | INPUTS: 6 | df={o1,o2,...,on} Set of objects 7 | spatial_threshold = Maximum geographical coordinate (spatial) distance value 8 | temporal_threshold = Maximum non-spatial distance value 9 | min_neighbors = Minimun number of points within Eps1 and Eps2 distance 10 | OUTPUT: 11 | C = {c1,c2,...,ck} Set of clusters 12 | """ 13 | def ST_DBSCAN(df, spatial_threshold, temporal_threshold, min_neighbors): 14 | cluster_label = 0 15 | NOISE = -1 16 | UNMARKED = 777777 17 | stack = [] 18 | 19 | # initialize each point with unmarked 20 | df['cluster'] = UNMARKED 21 | 22 | # for each point in database 23 | for index, point in df.iterrows(): 24 | if df.loc[index]['cluster'] == UNMARKED: 25 | neighborhood = retrieve_neighbors(index, df, spatial_threshold, temporal_threshold) 26 | 27 | if len(neighborhood) < min_neighbors: 28 | df.at[index, 'cluster'] = NOISE 29 | 30 | else: # found a core point 31 | cluster_label = cluster_label + 1 32 | df.at[index, 'cluster'] = cluster_label# assign a label to core point 33 | 34 | for neig_index in neighborhood: # assign core's label to its neighborhood 35 | df.at[neig_index, 'cluster'] = cluster_label 36 | stack.append(neig_index) # append neighborhood to stack 37 | 38 | while len(stack) > 0: # find new neighbors from core point neighborhood 39 | current_point_index = stack.pop() 40 | new_neighborhood = retrieve_neighbors(current_point_index, df, spatial_threshold, temporal_threshold) 41 | 42 | if len(new_neighborhood) >= min_neighbors: # current_point is a new core 43 | for neig_index in new_neighborhood: 44 | neig_cluster = df.loc[neig_index]['cluster'] 45 | if (neig_cluster != NOISE) & (neig_cluster == UNMARKED): 46 | # TODO: verify cluster average before add new point 47 | df.at[neig_index, 'cluster'] = cluster_label 48 | stack.append(neig_index) 49 | return df 50 | 51 | 52 | def retrieve_neighbors(index_center, df, spatial_threshold, temporal_threshold): 53 | neigborhood = [] 54 | 55 | center_point = df.loc[index_center] 56 | 57 | # filter by time 58 | min_time = center_point['date_time'] - timedelta(minutes = temporal_threshold) 59 | max_time = center_point['date_time'] + timedelta(minutes = temporal_threshold) 60 | df = df[(df['date_time'] >= min_time) & (df['date_time'] <= max_time)] 61 | 62 | # filter by distance 63 | for index, point in df.iterrows(): 64 | if index != index_center: 65 | distance = great_circle((center_point['latitude'], center_point['longitude']), (point['latitude'], point['longitude'])).meters 66 | if distance <= spatial_threshold: 67 | neigborhood.append(index) 68 | 69 | return neigborhood -------------------------------------------------------------------------------- /main_stdbscan.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sys import argv 3 | import STDBSCAN 4 | 5 | csv_path = argv[1] 6 | 7 | # df_table must have the columns: 'latitude', 'longitude' and 'date_time' 8 | df_table = pd.read_csv(csv_path) 9 | print df_table 10 | 11 | spatial_threshold = 100 # meters 12 | temporal_threshold = 1 # minutes 13 | min_neighbors = 2 14 | df_clustering = STDBSCAN.ST_DBSCAN(df_table, spatial_threshold, temporal_threshold, min_neighbors) 15 | print df_clustering --------------------------------------------------------------------------------