├── output.txt
├── input_test.txt
├── input.txt
├── .gitignore
├── README.md
└── communities.py


/output.txt:
--------------------------------------------------------------------------------
1 | [1,2,3]
2 | [4,5,6]
3 | [7]
4 | [8]
5 | [9,10,11]
6 | [12,13,14]


--------------------------------------------------------------------------------
/input_test.txt:
--------------------------------------------------------------------------------
1 | 1 2
2 | 1 3
3 | 2 3
4 | 2 4
5 | 4 5
6 | 4 6
7 | 4 7
8 | 6 7
9 | 5 6


--------------------------------------------------------------------------------
/input.txt:
--------------------------------------------------------------------------------
 1 | 1 2
 2 | 1 3
 3 | 2 1
 4 | 2 3
 5 | 3 1
 6 | 3 2
 7 | 3 7
 8 | 4 5
 9 | 4 6
10 | 5 4
11 | 5 6
12 | 6 4
13 | 6 5
14 | 6 7
15 | 7 3
16 | 7 6
17 | 7 8
18 | 8 7
19 | 8 9
20 | 8 12
21 | 9 8
22 | 9 11
23 | 9 10
24 | 10 9
25 | 10 11
26 | 11 9
27 | 11 10
28 | 12 8
29 | 12 14
30 | 12 13
31 | 13 12
32 | 13 14
33 | 14 13
34 | 14 12


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 | 
26 | # PyInstaller
27 | #  Usually these files are written by a python script from a template
28 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 | 
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 | 
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *,cover
45 | 
46 | # Translations
47 | *.mo
48 | *.pot
49 | 
50 | # Django stuff:
51 | *.log
52 | 
53 | # Sphinx documentation
54 | docs/_build/
55 | 
56 | # PyBuilder
57 | target/
58 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Community-Detection
 2 | Implement a community detection algorithm using a divisive hierarchical clustering (Girvan-Newman algorithm)
 3 | 
 4 | ## Overview
 5 | 
 6 | Implement a community detection algorithm using a divisive hierarchical clustering (Girvan-Newman algorithm). It will make use of 2 python libraries called networkx and community. The networkx is a python library which can be installed on your machines. The assignment will require making use of the betweenness function and the modularity function which are a part of the networkx and the community libraries respectively. The matplotlib library will be used for plotting the communities.
 7 | 
 8 | ## Main Steps
 9 | 
10 | - Read the input file into a graph using Networkx.
11 | - Use the betweenness of the edges as a measure to break communities into smaller
12 | - communities in divisive clustering.
13 | - The result should be the set of communities that have the highest modularity.
14 | - Use the modularity function in the community API
15 | - Use the betweenness function from networkx
16 | - After the best set of communities are obtained, use matplotlib and networkx functions to plot the communities with different colors.
17 | 
18 | ## Execution Details
19 | 
20 | The python code should take in two parameters, namely input file containing the graph and the output image with the community structure. For example:
21 | Python communities.py input.txt image.png
22 | 
23 | ## Input Parameters
24 | 
25 | ### input.txt 
26 | This file consists of the representation of the graph. All graphs tested will be undirected graphs. Each line in the input file is of the format:
27 | 1 2
28 | where 1 and 2 are the nodes and each line represents an edge between the two nodes. The nodes are separated by one space.
29 | 
30 | ### image.png
31 | This will be a visualization of the communities detected by your algorithm. You should represent each communities in a unique color. Each node should contain a label representing the node numbers in the input file.
32 | 
33 | ## Output 
34 | The Python code should output the communities in the form of a dictionary to standard output (the console). Each community should be an array representing nodes in that community. In each array, the nodes should be sorted lexicographically. For example:
35 | 
36 |         [1,2,3,4] 
37 |         [5,6,7,8] 
38 |         [9]
39 |         [10]
40 | 
41 | These 4 arrays represent the 4 communities.
42 | 
43 | 
44 | 
45 | 
46 | 
47 | 
48 | 
49 | 
50 | 


--------------------------------------------------------------------------------
/communities.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Author: Lingzhe Teng
  3 | Date: Nov. 15, 2015
  4 | 
  5 | """
  6 | 
  7 | """
  8 | Executing code: 
  9 | Python communities.py input.txt image.png
 10 | 
 11 | """
 12 | 
 13 | """
 14 | Change log:
 15 | 
 16 | - Nov. 26
 17 | 1. saving image based on the input image name instead of show graph
 18 | 
 19 | - Nov. 28
 20 | 1. fix bug for ploting community with random colors
 21 | 
 22 | """
 23 | 
 24 | from operator import mul
 25 | import networkx as nx
 26 | import os
 27 | import sys
 28 | import community
 29 | import matplotlib.pyplot as plt
 30 | import matplotlib.image as mpimg
 31 | import matplotlib.colors as mpcolors
 32 | import matplotlib.cm as mpcm
 33 | import numpy as np
 34 | 
 35 | 
 36 | """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 37 | """                            3rd-party Libraries                               """
 38 | """                                                                              """    
 39 | """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 40 | 
 41 | """
 42 | ## networkx(betweenness)
 43 | Home: https://networkx.github.io
 44 | 
 45 | ## community(modularity)
 46 | Home: http://perso.crans.org/aynaud/communities/
 47 | API: http://perso.crans.org/aynaud/communities/api.html
 48 | 
 49 | ## matplotlib
 50 | Home: http://matplotlib.org/index.html
 51 | Doc: http://matplotlib.org/examples/index.html
 52 | Ref: https://gist.github.com/shobhit/3236373
 53 | 
 54 | """
 55 | 
 56 | class Communities:
 57 |     def __init__(self, ipt_txt, ipt_png):
 58 |         self.ipt_txt = ipt_txt
 59 |         self.ipt_png = ipt_png
 60 |         self.graph = None
 61 | 
 62 |     def initialize(self):
 63 |         if not os.path.isfile(self.ipt_txt):
 64 |             self.quit(self.ipt_txt + " doesn't exist or it's not a file")
 65 | 
 66 |         # initialize 3rd-party libraries
 67 |         self.graph = nx.Graph()
 68 |         # load data
 69 |         self.load_txt(self.ipt_txt)
 70 | 
 71 |         
 72 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 73 |     """                               Main Functions                                 """
 74 |     """                                                                              """    
 75 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 76 |     def find_best_partition(self):
 77 |         G = self.graph.copy()
 78 |         modularity = 0.0
 79 |         removed_edges = []
 80 |         partition = {}
 81 |         while 1:
 82 |             betweenness = self.calculte_betweenness(G)
 83 |             max_betweenness_edges = self.get_max_betweenness_edges(betweenness)
 84 |             if len(G.edges()) == len(max_betweenness_edges):
 85 |                 break
 86 | 
 87 |             G.remove_edges_from(max_betweenness_edges)  
 88 |             components = nx.connected_components(G)
 89 |             idx = 0
 90 |             tmp_partition = {}
 91 |             for component in components:
 92 |                 for inner in list(component):
 93 |                     tmp_partition.setdefault(inner, idx)
 94 |                 idx += 1
 95 |             cur_mod = community.modularity(tmp_partition, G)
 96 | 
 97 |             if cur_mod < modularity:
 98 |                 G.add_edges_from(max_betweenness_edges)
 99 |                 break;
100 |             else:
101 |                 partition = tmp_partition
102 |             removed_edges.extend(max_betweenness_edges)
103 |             modularity = cur_mod
104 |         return partition, G, removed_edges
105 | 
106 |     def get_max_betweenness_edges(self, betweenness):
107 |         max_betweenness_edges = []
108 |         max_betweenness = max(betweenness.items(), key=lambda x: x[1])
109 |         for (k, v) in betweenness.items():
110 |             if v == max_betweenness[1]:
111 |                 max_betweenness_edges.append(k)
112 |         return max_betweenness_edges
113 | 
114 |     def calculte_betweenness(self, G, bonus=True):
115 |         """
116 |         Calculate Betweenness
117 |         input:
118 |         - G: graph
119 |         - bonus: True if use my own betweenness calculator. (bonus=True by default)
120 | 
121 |         """
122 |         if bonus:
123 |             betweenness = self.my_betweenness_calculation(G)
124 |         else:
125 |             betweenness = nx.edge_betweenness_centrality(G, k=None, normalized=True, weight=None, seed=None)
126 |         return betweenness
127 | 
128 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
129 |     """                               Bonus Functions                                """
130 |     """                                                                              """    
131 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
132 |     def build_level(self, G, root):
133 |         """ 
134 |         Build level for graph
135 | 
136 |         input:
137 |         - G: networkx graph
138 |         - root: root node
139 | 
140 |         output:
141 |         - levels: nodes in each level
142 |         - predecessors: predecessors for each node
143 |         - successors: successors for each node
144 |  
145 |         """
146 |         levels = {}
147 |         predecessors = {}
148 |         successors = {}
149 | 
150 |         cur_level_nodes = [root]    # initialize start point
151 |         nodes = []  # store nodes that have been accessed
152 |         level_idx = 0       # track level index
153 |         while cur_level_nodes:  # if have nodes for a level, continue process
154 |             nodes.extend(cur_level_nodes)   # add nodes that are inside new level into nodes list
155 |             levels.setdefault(level_idx, cur_level_nodes)   # set nodes for current level
156 |             next_level_nodes = []   # prepare nodes for next level
157 | 
158 |             # find node in next level
159 |             for node in cur_level_nodes:
160 |                 nei_nodes = G.neighbors(node)   # all neighbors for the node in current level
161 |                 
162 |                 # find neighbor nodes in the next level
163 |                 for nei_node in nei_nodes:
164 |                     if nei_node not in nodes:   # nodes in the next level must not be accessed
165 |                         predecessors.setdefault(nei_node, [])   # initialize predecessors dictionary, use a list to store all predecessors
166 |                         predecessors[nei_node].append(node) 
167 |                         successors.setdefault(node, [])     # initialize successors dictionary, use a list to store all successors
168 |                         successors[node].append(nei_node)
169 | 
170 |                         if nei_node not in next_level_nodes:    # avoid add same node twice
171 |                             next_level_nodes.append(nei_node)
172 |             cur_level_nodes = next_level_nodes
173 |             level_idx += 1
174 |         return levels, predecessors, successors
175 | 
176 |     def calculate_credits(self, G, levels, predecessors, successors, nodes_nsp):
177 |         """
178 |         Calculate credits for nodes and edges
179 | 
180 |         """
181 |         nodes_credit = {}
182 |         edges_credit = {}
183 | 
184 |         # loop, from bottom to top, not including the zero level
185 |         for lvl_idx in range(len(levels)-1, 0, -1):
186 |             lvl_nodes = levels[lvl_idx]     # get nodes in the level
187 | 
188 |             # calculate for each node in current level
189 |             for lvl_node in lvl_nodes:
190 |                 nodes_credit.setdefault(lvl_node, 1.)   # set default credit for the node, 1
191 |                 if successors.has_key(lvl_node):        # if it is not a leaf node
192 |                     # Each node that is not a leaf gets credit = 1 + sum of credits of the DAG edges from that node to level below
193 |                     for successor in successors[lvl_node]:
194 |                         nodes_credit[lvl_node] += edges_credit[(successor, lvl_node)]
195 | 
196 |                 node_predecessors = predecessors[lvl_node]  #  get predecessors of the node in current level
197 |                 total_nodes_nsp = .0    # total number of shortest paths for predecessors of the node in current level
198 |                 
199 |                 # sum up for total_nodes_nsp
200 |                 for predecessor in node_predecessors:
201 |                     total_nodes_nsp += nodes_nsp[predecessor]
202 | 
203 |                 # again, calculate for the weight of each predecessor, and assign credit for the responding edge
204 |                 for predecessor in node_predecessors:
205 |                     predecessor_weight = nodes_nsp[predecessor]/total_nodes_nsp     # calculate weight of predecssor
206 |                     edges_credit.setdefault((lvl_node, predecessor), nodes_credit[lvl_node]*predecessor_weight)         # bottom-up edge
207 |         return nodes_credit, edges_credit
208 | 
209 | 
210 |     def my_betweenness_calculation(self, G, normalized=False):
211 |         """
212 |         Main Bonus Function to calculation betweenness
213 | 
214 |         """
215 |         graph_nodes = G.nodes()
216 |         edge_contributions = {}
217 |         components = list(nx.connected_components(G))   # connected components for current graph
218 | 
219 |         # calculate for each node
220 |         for node in graph_nodes:
221 |             component = None    # the community current node belongs to
222 |             for com in components: 
223 |                 if node in list(com):
224 |                     component = list(com)
225 |             nodes_nsp = {}  # number of shorest paths
226 |             node_levels, predecessors, successors = self.build_level(G, node)   # build levels for calculation
227 | 
228 |             # calculate shortest paths for each node (including current node)
229 |             for other_node in component:
230 |                 shortest_paths = nx.all_shortest_paths(G, source=node,target=other_node)
231 |                 nodes_nsp[other_node] = len(list(shortest_paths))
232 | 
233 |             # calculate credits for nodes and edges (Only use "edges_credit" actually)
234 |             nodes_credit, edges_credit = self.calculate_credits(G, node_levels, predecessors, successors, nodes_nsp)
235 | 
236 |             # sort tuple (key value of edges_credit), and sum up for edge_contributions
237 |             for (k, v) in edges_credit.items():
238 |                 k = sorted(k, reverse=False)
239 |                 edge_contributions_key = (k[0], k[1])
240 |                 edge_contributions.setdefault(edge_contributions_key, 0)
241 |                 edge_contributions[edge_contributions_key] += v
242 |            
243 |         # divide by 2 to get true betweenness
244 |         for (k, v) in edge_contributions.items():
245 |             edge_contributions[k] = v/2
246 | 
247 |         # normalize
248 |         if normalized:
249 |             max_edge_contribution = max(edge_contributions.values())
250 |             for (k, v) in edge_contributions.items():
251 |                 edge_contributions[k] = v/max_edge_contribution
252 |         return edge_contributions
253 | 
254 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
255 |     """                               Plot Method                                    """
256 |     """                                                                              """    
257 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
258 |     def plot_graph(self, part_graph, removed_edges):
259 |         G = self.graph
260 |         G_part = part_graph
261 |         exist_edges = part_graph.edges(data=True)
262 |         pos = nx.spring_layout(G, k=0.1, iterations=50, scale=1.3)
263 |         
264 |         # nodes
265 |         coms = nx.connected_components(part_graph)
266 |         for com in coms:
267 |             nodes = list(com) 
268 |             np.random.seed( len(nodes)*sum(nodes)*reduce(mul, nodes, 1)*min(nodes)*max(nodes) )
269 |             colors = np.random.rand(4 if len(nodes)<4 else len(nodes)+1)
270 |             nx.draw_networkx_nodes(G,pos,nodelist=nodes,node_size=500, node_color=colors)
271 | 
272 |         # edges
273 |         nx.draw_networkx_edges(G,pos,edgelist=exist_edges, width=2,alpha=1,edge_color='k')
274 |         nx.draw_networkx_edges(G,pos,edgelist=removed_edges, width=2, edge_color='k')   #, style='dashed')
275 | 
276 |         # labels
277 |         nx.draw_networkx_labels(G,pos,font_size=12,font_family='sans-serif')
278 | 
279 |         plt.axis('off')
280 |         plt.savefig(self.ipt_png)
281 |         # plt.show()
282 | 
283 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
284 |     """                               Help Method                                    """
285 |     """                                                                              """    
286 |     """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
287 | 
288 |     def load_txt(self, ipt_txt):        
289 |         input_data = open(ipt_txt, 'rU')
290 |         for line in input_data:
291 |             line = line.strip('\n')
292 |             line = str(line)
293 |             line = line.split(" ")
294 | 
295 |             if len(line) != 2:
296 |                 self.quit("edge format for input.txt is error")
297 | 
298 |             ending_node_a = int(line[0])
299 |             ending_node_b = int(line[1])
300 |             self.graph.add_edge(ending_node_a, ending_node_b, weight=0.6, len=3.0)
301 | 
302 |     def display_graph(self):
303 |         G = self.graph
304 |         print "+-------------------------------------------------+"
305 |         print "|                  Display Graph                  |"
306 |         print "+-------------------------------------------------+"
307 | 
308 |         print "Number of Nodes:", G.number_of_nodes()
309 |         print "Number of Edges:", G.number_of_edges()
310 |         print "Nodes: \n", G.nodes()
311 |         print "Edges: \n", G.edges()
312 | 
313 |     def partition_to_community(self, partition):
314 |         result = {}
315 |         for (k, v) in partition.items():
316 |             result.setdefault(v, [])
317 |             result[v].append(k)
318 |         return result.values()
319 | 
320 |     def display(self, partition):
321 |         comms = self.partition_to_community(partition)
322 |         for comm in comms:
323 |             comm.sort() # actually duplicate process here
324 |             print comm
325 | 
326 |     def quit(self, err_desc):
327 |         raise SystemExit('\n'+ "PROGRAM EXIT: " + err_desc + ', please check your input' + '\n')
328 | 
329 | """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
330 | """                               Main Method                                    """
331 | """                                                                              """    
332 | """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
333 | if __name__ == '__main__':
334 | 
335 |     ipt_txt = sys.argv[1]
336 |     ipt_png = sys.argv[2]
337 | 
338 |     c = Communities(ipt_txt, ipt_png)
339 |     c.initialize()
340 |     partition, part_graph, removed_edges = c.find_best_partition()
341 |     c.display(partition)
342 |     c.plot_graph(part_graph, removed_edges)
343 | 
344 | 


--------------------------------------------------------------------------------