├── README.md
├── STDBSCAN.py
└── main_stdbscan.py


/README.md:
--------------------------------------------------------------------------------
1 | # ST-DBSCAN 
2 | 
3 | ST-DBSCAN is a density-based clustering algorithm that takes into account both spatial and non-spatial attributes of the points. Like DBSCAN, this algorithm has the ability to identify clusters with arbitrary shape and it does not necessarily predetermine the number of clusters. The non-spatial attribute could be anyone that is not related to coordinates in a space (e.g., color, time, temperature).  Thus, ST-DBSCAN can create groups with points that are spatially near each other and that has similar non-spatial attributes. 
4 | 
5 | Different from DBSCAN, wich requires only two parameters, ST-DBSCAN requires four parameters: Eps1, Eps2, MinPts, and Delta-Epson.  Eps1 is the spatial distance, the maximum distance for a point be assign to a cluster. Eps2 is the maximum difference between non-spatial attributes. MinPts is the minimum number of neighbors for a point be a core point. The last one, Delta-Epson is the maximum difference between the attributes value average of a cluster and the attribute value for a new point wich will be inserted in that cluster. The parameter Delta-Epson has the function of split clusters that are near each other considering spatial aspects and are much different considering the non-spatial attribute. Delta-Epson was not implemented on this version.
6 | 
7 | The ST-DBSCAN algorithm is composed basically of two functions. The main one is named ST-DBSCAN and creates clusters iteratively. This function uses another, retrive_neighborhood, which retrieves those points that are the neighbors of a given point.
8 | 


--------------------------------------------------------------------------------
/STDBSCAN.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | from datetime import timedelta
 3 | from geopy.distance import great_circle
 4 | """
 5 | INPUTS:
 6 |     df={o1,o2,...,on} Set of objects
 7 |     spatial_threshold = Maximum geographical coordinate (spatial) distance value
 8 |     temporal_threshold = Maximum non-spatial distance value
 9 |     min_neighbors = Minimun number of points within Eps1 and Eps2 distance
10 | OUTPUT:
11 |     C = {c1,c2,...,ck} Set of clusters
12 | """
13 | def ST_DBSCAN(df, spatial_threshold, temporal_threshold, min_neighbors):
14 |     cluster_label = 0
15 |     NOISE = -1
16 |     UNMARKED = 777777
17 |     stack = []
18 | 
19 |     # initialize each point with unmarked
20 |     df['cluster'] = UNMARKED
21 |     
22 |     # for each point in database
23 |     for index, point in df.iterrows():
24 |         if df.loc[index]['cluster'] == UNMARKED:
25 |             neighborhood = retrieve_neighbors(index, df, spatial_threshold, temporal_threshold)
26 |             
27 |             if len(neighborhood) < min_neighbors:
28 |                 df.at[index, 'cluster'] = NOISE
29 | 
30 |             else: # found a core point
31 |                 cluster_label = cluster_label + 1
32 |                 df.at[index, 'cluster'] = cluster_label# assign a label to core point
33 | 
34 |                 for neig_index in neighborhood: # assign core's label to its neighborhood
35 |                     df.at[neig_index, 'cluster'] = cluster_label
36 |                     stack.append(neig_index) # append neighborhood to stack
37 |                 
38 |                 while len(stack) > 0: # find new neighbors from core point neighborhood
39 |                     current_point_index = stack.pop()
40 |                     new_neighborhood = retrieve_neighbors(current_point_index, df, spatial_threshold, temporal_threshold)
41 |                     
42 |                     if len(new_neighborhood) >= min_neighbors: # current_point is a new core
43 |                         for neig_index in new_neighborhood:
44 |                             neig_cluster = df.loc[neig_index]['cluster']
45 |                             if (neig_cluster != NOISE) & (neig_cluster == UNMARKED): 
46 |                                 # TODO: verify cluster average before add new point
47 |                                 df.at[neig_index, 'cluster'] = cluster_label
48 |                                 stack.append(neig_index)
49 |     return df
50 | 
51 | 
52 | def retrieve_neighbors(index_center, df, spatial_threshold, temporal_threshold):
53 |     neigborhood = []
54 | 
55 |     center_point = df.loc[index_center]
56 | 
57 |     # filter by time 
58 |     min_time = center_point['date_time'] - timedelta(minutes = temporal_threshold)
59 |     max_time = center_point['date_time'] + timedelta(minutes = temporal_threshold)
60 |     df = df[(df['date_time'] >= min_time) & (df['date_time'] <= max_time)]
61 | 
62 |     # filter by distance
63 |     for index, point in df.iterrows():
64 |         if index != index_center:
65 |             distance = great_circle((center_point['latitude'], center_point['longitude']), (point['latitude'], point['longitude'])).meters
66 |             if distance <= spatial_threshold:
67 |                 neigborhood.append(index)
68 | 
69 |     return neigborhood


--------------------------------------------------------------------------------
/main_stdbscan.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | from sys import argv
 3 | import STDBSCAN
 4 | 
 5 | csv_path = argv[1]
 6 | 
 7 | # df_table must have the columns: 'latitude', 'longitude' and 'date_time'
 8 | df_table = pd.read_csv(csv_path)
 9 | print df_table
10 | 
11 | spatial_threshold = 100 # meters
12 | temporal_threshold = 1  # minutes
13 | min_neighbors = 2
14 | df_clustering = STDBSCAN.ST_DBSCAN(df_table, spatial_threshold, temporal_threshold, min_neighbors)
15 | print df_clustering


--------------------------------------------------------------------------------