├── output.txt ├── input_test.txt ├── input.txt ├── .gitignore ├── README.md └── communities.py /output.txt: -------------------------------------------------------------------------------- 1 | [1,2,3] 2 | [4,5,6] 3 | [7] 4 | [8] 5 | [9,10,11] 6 | [12,13,14] -------------------------------------------------------------------------------- /input_test.txt: -------------------------------------------------------------------------------- 1 | 1 2 2 | 1 3 3 | 2 3 4 | 2 4 5 | 4 5 6 | 4 6 7 | 4 7 8 | 6 7 9 | 5 6 -------------------------------------------------------------------------------- /input.txt: -------------------------------------------------------------------------------- 1 | 1 2 2 | 1 3 3 | 2 1 4 | 2 3 5 | 3 1 6 | 3 2 7 | 3 7 8 | 4 5 9 | 4 6 10 | 5 4 11 | 5 6 12 | 6 4 13 | 6 5 14 | 6 7 15 | 7 3 16 | 7 6 17 | 7 8 18 | 8 7 19 | 8 9 20 | 8 12 21 | 9 8 22 | 9 11 23 | 9 10 24 | 10 9 25 | 10 11 26 | 11 9 27 | 11 10 28 | 12 8 29 | 12 14 30 | 12 13 31 | 13 12 32 | 13 14 33 | 14 13 34 | 14 12 -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Community-Detection 2 | Implement a community detection algorithm using a divisive hierarchical clustering (Girvan-Newman algorithm) 3 | 4 | ## Overview 5 | 6 | Implement a community detection algorithm using a divisive hierarchical clustering (Girvan-Newman algorithm). It will make use of 2 python libraries called networkx and community. The networkx is a python library which can be installed on your machines. The assignment will require making use of the betweenness function and the modularity function which are a part of the networkx and the community libraries respectively. The matplotlib library will be used for plotting the communities. 7 | 8 | ## Main Steps 9 | 10 | - Read the input file into a graph using Networkx. 11 | - Use the betweenness of the edges as a measure to break communities into smaller 12 | - communities in divisive clustering. 13 | - The result should be the set of communities that have the highest modularity. 14 | - Use the modularity function in the community API 15 | - Use the betweenness function from networkx 16 | - After the best set of communities are obtained, use matplotlib and networkx functions to plot the communities with different colors. 17 | 18 | ## Execution Details 19 | 20 | The python code should take in two parameters, namely input file containing the graph and the output image with the community structure. For example: 21 | Python communities.py input.txt image.png 22 | 23 | ## Input Parameters 24 | 25 | ### input.txt 26 | This file consists of the representation of the graph. All graphs tested will be undirected graphs. Each line in the input file is of the format: 27 | 1 2 28 | where 1 and 2 are the nodes and each line represents an edge between the two nodes. The nodes are separated by one space. 29 | 30 | ### image.png 31 | This will be a visualization of the communities detected by your algorithm. You should represent each communities in a unique color. Each node should contain a label representing the node numbers in the input file. 32 | 33 | ## Output 34 | The Python code should output the communities in the form of a dictionary to standard output (the console). Each community should be an array representing nodes in that community. In each array, the nodes should be sorted lexicographically. For example: 35 | 36 | [1,2,3,4] 37 | [5,6,7,8] 38 | [9] 39 | [10] 40 | 41 | These 4 arrays represent the 4 communities. 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /communities.py: -------------------------------------------------------------------------------- 1 | """ 2 | Author: Lingzhe Teng 3 | Date: Nov. 15, 2015 4 | 5 | """ 6 | 7 | """ 8 | Executing code: 9 | Python communities.py input.txt image.png 10 | 11 | """ 12 | 13 | """ 14 | Change log: 15 | 16 | - Nov. 26 17 | 1. saving image based on the input image name instead of show graph 18 | 19 | - Nov. 28 20 | 1. fix bug for ploting community with random colors 21 | 22 | """ 23 | 24 | from operator import mul 25 | import networkx as nx 26 | import os 27 | import sys 28 | import community 29 | import matplotlib.pyplot as plt 30 | import matplotlib.image as mpimg 31 | import matplotlib.colors as mpcolors 32 | import matplotlib.cm as mpcm 33 | import numpy as np 34 | 35 | 36 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 37 | """ 3rd-party Libraries """ 38 | """ """ 39 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 40 | 41 | """ 42 | ## networkx(betweenness) 43 | Home: https://networkx.github.io 44 | 45 | ## community(modularity) 46 | Home: http://perso.crans.org/aynaud/communities/ 47 | API: http://perso.crans.org/aynaud/communities/api.html 48 | 49 | ## matplotlib 50 | Home: http://matplotlib.org/index.html 51 | Doc: http://matplotlib.org/examples/index.html 52 | Ref: https://gist.github.com/shobhit/3236373 53 | 54 | """ 55 | 56 | class Communities: 57 | def __init__(self, ipt_txt, ipt_png): 58 | self.ipt_txt = ipt_txt 59 | self.ipt_png = ipt_png 60 | self.graph = None 61 | 62 | def initialize(self): 63 | if not os.path.isfile(self.ipt_txt): 64 | self.quit(self.ipt_txt + " doesn't exist or it's not a file") 65 | 66 | # initialize 3rd-party libraries 67 | self.graph = nx.Graph() 68 | # load data 69 | self.load_txt(self.ipt_txt) 70 | 71 | 72 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 73 | """ Main Functions """ 74 | """ """ 75 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 76 | def find_best_partition(self): 77 | G = self.graph.copy() 78 | modularity = 0.0 79 | removed_edges = [] 80 | partition = {} 81 | while 1: 82 | betweenness = self.calculte_betweenness(G) 83 | max_betweenness_edges = self.get_max_betweenness_edges(betweenness) 84 | if len(G.edges()) == len(max_betweenness_edges): 85 | break 86 | 87 | G.remove_edges_from(max_betweenness_edges) 88 | components = nx.connected_components(G) 89 | idx = 0 90 | tmp_partition = {} 91 | for component in components: 92 | for inner in list(component): 93 | tmp_partition.setdefault(inner, idx) 94 | idx += 1 95 | cur_mod = community.modularity(tmp_partition, G) 96 | 97 | if cur_mod < modularity: 98 | G.add_edges_from(max_betweenness_edges) 99 | break; 100 | else: 101 | partition = tmp_partition 102 | removed_edges.extend(max_betweenness_edges) 103 | modularity = cur_mod 104 | return partition, G, removed_edges 105 | 106 | def get_max_betweenness_edges(self, betweenness): 107 | max_betweenness_edges = [] 108 | max_betweenness = max(betweenness.items(), key=lambda x: x[1]) 109 | for (k, v) in betweenness.items(): 110 | if v == max_betweenness[1]: 111 | max_betweenness_edges.append(k) 112 | return max_betweenness_edges 113 | 114 | def calculte_betweenness(self, G, bonus=True): 115 | """ 116 | Calculate Betweenness 117 | input: 118 | - G: graph 119 | - bonus: True if use my own betweenness calculator. (bonus=True by default) 120 | 121 | """ 122 | if bonus: 123 | betweenness = self.my_betweenness_calculation(G) 124 | else: 125 | betweenness = nx.edge_betweenness_centrality(G, k=None, normalized=True, weight=None, seed=None) 126 | return betweenness 127 | 128 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 129 | """ Bonus Functions """ 130 | """ """ 131 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 132 | def build_level(self, G, root): 133 | """ 134 | Build level for graph 135 | 136 | input: 137 | - G: networkx graph 138 | - root: root node 139 | 140 | output: 141 | - levels: nodes in each level 142 | - predecessors: predecessors for each node 143 | - successors: successors for each node 144 | 145 | """ 146 | levels = {} 147 | predecessors = {} 148 | successors = {} 149 | 150 | cur_level_nodes = [root] # initialize start point 151 | nodes = [] # store nodes that have been accessed 152 | level_idx = 0 # track level index 153 | while cur_level_nodes: # if have nodes for a level, continue process 154 | nodes.extend(cur_level_nodes) # add nodes that are inside new level into nodes list 155 | levels.setdefault(level_idx, cur_level_nodes) # set nodes for current level 156 | next_level_nodes = [] # prepare nodes for next level 157 | 158 | # find node in next level 159 | for node in cur_level_nodes: 160 | nei_nodes = G.neighbors(node) # all neighbors for the node in current level 161 | 162 | # find neighbor nodes in the next level 163 | for nei_node in nei_nodes: 164 | if nei_node not in nodes: # nodes in the next level must not be accessed 165 | predecessors.setdefault(nei_node, []) # initialize predecessors dictionary, use a list to store all predecessors 166 | predecessors[nei_node].append(node) 167 | successors.setdefault(node, []) # initialize successors dictionary, use a list to store all successors 168 | successors[node].append(nei_node) 169 | 170 | if nei_node not in next_level_nodes: # avoid add same node twice 171 | next_level_nodes.append(nei_node) 172 | cur_level_nodes = next_level_nodes 173 | level_idx += 1 174 | return levels, predecessors, successors 175 | 176 | def calculate_credits(self, G, levels, predecessors, successors, nodes_nsp): 177 | """ 178 | Calculate credits for nodes and edges 179 | 180 | """ 181 | nodes_credit = {} 182 | edges_credit = {} 183 | 184 | # loop, from bottom to top, not including the zero level 185 | for lvl_idx in range(len(levels)-1, 0, -1): 186 | lvl_nodes = levels[lvl_idx] # get nodes in the level 187 | 188 | # calculate for each node in current level 189 | for lvl_node in lvl_nodes: 190 | nodes_credit.setdefault(lvl_node, 1.) # set default credit for the node, 1 191 | if successors.has_key(lvl_node): # if it is not a leaf node 192 | # Each node that is not a leaf gets credit = 1 + sum of credits of the DAG edges from that node to level below 193 | for successor in successors[lvl_node]: 194 | nodes_credit[lvl_node] += edges_credit[(successor, lvl_node)] 195 | 196 | node_predecessors = predecessors[lvl_node] # get predecessors of the node in current level 197 | total_nodes_nsp = .0 # total number of shortest paths for predecessors of the node in current level 198 | 199 | # sum up for total_nodes_nsp 200 | for predecessor in node_predecessors: 201 | total_nodes_nsp += nodes_nsp[predecessor] 202 | 203 | # again, calculate for the weight of each predecessor, and assign credit for the responding edge 204 | for predecessor in node_predecessors: 205 | predecessor_weight = nodes_nsp[predecessor]/total_nodes_nsp # calculate weight of predecssor 206 | edges_credit.setdefault((lvl_node, predecessor), nodes_credit[lvl_node]*predecessor_weight) # bottom-up edge 207 | return nodes_credit, edges_credit 208 | 209 | 210 | def my_betweenness_calculation(self, G, normalized=False): 211 | """ 212 | Main Bonus Function to calculation betweenness 213 | 214 | """ 215 | graph_nodes = G.nodes() 216 | edge_contributions = {} 217 | components = list(nx.connected_components(G)) # connected components for current graph 218 | 219 | # calculate for each node 220 | for node in graph_nodes: 221 | component = None # the community current node belongs to 222 | for com in components: 223 | if node in list(com): 224 | component = list(com) 225 | nodes_nsp = {} # number of shorest paths 226 | node_levels, predecessors, successors = self.build_level(G, node) # build levels for calculation 227 | 228 | # calculate shortest paths for each node (including current node) 229 | for other_node in component: 230 | shortest_paths = nx.all_shortest_paths(G, source=node,target=other_node) 231 | nodes_nsp[other_node] = len(list(shortest_paths)) 232 | 233 | # calculate credits for nodes and edges (Only use "edges_credit" actually) 234 | nodes_credit, edges_credit = self.calculate_credits(G, node_levels, predecessors, successors, nodes_nsp) 235 | 236 | # sort tuple (key value of edges_credit), and sum up for edge_contributions 237 | for (k, v) in edges_credit.items(): 238 | k = sorted(k, reverse=False) 239 | edge_contributions_key = (k[0], k[1]) 240 | edge_contributions.setdefault(edge_contributions_key, 0) 241 | edge_contributions[edge_contributions_key] += v 242 | 243 | # divide by 2 to get true betweenness 244 | for (k, v) in edge_contributions.items(): 245 | edge_contributions[k] = v/2 246 | 247 | # normalize 248 | if normalized: 249 | max_edge_contribution = max(edge_contributions.values()) 250 | for (k, v) in edge_contributions.items(): 251 | edge_contributions[k] = v/max_edge_contribution 252 | return edge_contributions 253 | 254 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 255 | """ Plot Method """ 256 | """ """ 257 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 258 | def plot_graph(self, part_graph, removed_edges): 259 | G = self.graph 260 | G_part = part_graph 261 | exist_edges = part_graph.edges(data=True) 262 | pos = nx.spring_layout(G, k=0.1, iterations=50, scale=1.3) 263 | 264 | # nodes 265 | coms = nx.connected_components(part_graph) 266 | for com in coms: 267 | nodes = list(com) 268 | np.random.seed( len(nodes)*sum(nodes)*reduce(mul, nodes, 1)*min(nodes)*max(nodes) ) 269 | colors = np.random.rand(4 if len(nodes)<4 else len(nodes)+1) 270 | nx.draw_networkx_nodes(G,pos,nodelist=nodes,node_size=500, node_color=colors) 271 | 272 | # edges 273 | nx.draw_networkx_edges(G,pos,edgelist=exist_edges, width=2,alpha=1,edge_color='k') 274 | nx.draw_networkx_edges(G,pos,edgelist=removed_edges, width=2, edge_color='k') #, style='dashed') 275 | 276 | # labels 277 | nx.draw_networkx_labels(G,pos,font_size=12,font_family='sans-serif') 278 | 279 | plt.axis('off') 280 | plt.savefig(self.ipt_png) 281 | # plt.show() 282 | 283 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 284 | """ Help Method """ 285 | """ """ 286 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 287 | 288 | def load_txt(self, ipt_txt): 289 | input_data = open(ipt_txt, 'rU') 290 | for line in input_data: 291 | line = line.strip('\n') 292 | line = str(line) 293 | line = line.split(" ") 294 | 295 | if len(line) != 2: 296 | self.quit("edge format for input.txt is error") 297 | 298 | ending_node_a = int(line[0]) 299 | ending_node_b = int(line[1]) 300 | self.graph.add_edge(ending_node_a, ending_node_b, weight=0.6, len=3.0) 301 | 302 | def display_graph(self): 303 | G = self.graph 304 | print "+-------------------------------------------------+" 305 | print "| Display Graph |" 306 | print "+-------------------------------------------------+" 307 | 308 | print "Number of Nodes:", G.number_of_nodes() 309 | print "Number of Edges:", G.number_of_edges() 310 | print "Nodes: \n", G.nodes() 311 | print "Edges: \n", G.edges() 312 | 313 | def partition_to_community(self, partition): 314 | result = {} 315 | for (k, v) in partition.items(): 316 | result.setdefault(v, []) 317 | result[v].append(k) 318 | return result.values() 319 | 320 | def display(self, partition): 321 | comms = self.partition_to_community(partition) 322 | for comm in comms: 323 | comm.sort() # actually duplicate process here 324 | print comm 325 | 326 | def quit(self, err_desc): 327 | raise SystemExit('\n'+ "PROGRAM EXIT: " + err_desc + ', please check your input' + '\n') 328 | 329 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 330 | """ Main Method """ 331 | """ """ 332 | """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 333 | if __name__ == '__main__': 334 | 335 | ipt_txt = sys.argv[1] 336 | ipt_png = sys.argv[2] 337 | 338 | c = Communities(ipt_txt, ipt_png) 339 | c.initialize() 340 | partition, part_graph, removed_edges = c.find_best_partition() 341 | c.display(partition) 342 | c.plot_graph(part_graph, removed_edges) 343 | 344 | --------------------------------------------------------------------------------