├── .github
    └── FUNDING.yml
├── .gitignore
├── LICENSE
├── README.md
├── disparity.py
└── requirements.txt


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | github: ceteri
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | g.json
  2 | h.json
  3 | 
  4 | # Byte-compiled / optimized / DLL files
  5 | __pycache__/
  6 | *.py[cod]
  7 | *$py.class
  8 | 
  9 | # C extensions
 10 | *.so
 11 | 
 12 | # Distribution / packaging
 13 | .Python
 14 | build/
 15 | develop-eggs/
 16 | dist/
 17 | downloads/
 18 | eggs/
 19 | .eggs/
 20 | lib/
 21 | lib64/
 22 | parts/
 23 | sdist/
 24 | var/
 25 | wheels/
 26 | *.egg-info/
 27 | .installed.cfg
 28 | *.egg
 29 | MANIFEST
 30 | 
 31 | # PyInstaller
 32 | #  Usually these files are written by a python script from a template
 33 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 34 | *.manifest
 35 | *.spec
 36 | 
 37 | # Installer logs
 38 | pip-log.txt
 39 | pip-delete-this-directory.txt
 40 | 
 41 | # Unit test / coverage reports
 42 | htmlcov/
 43 | .tox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | 
 53 | # Translations
 54 | *.mo
 55 | *.pot
 56 | 
 57 | # Django stuff:
 58 | *.log
 59 | local_settings.py
 60 | db.sqlite3
 61 | 
 62 | # Flask stuff:
 63 | instance/
 64 | .webassets-cache
 65 | 
 66 | # Scrapy stuff:
 67 | .scrapy
 68 | 
 69 | # Sphinx documentation
 70 | docs/_build/
 71 | 
 72 | # PyBuilder
 73 | target/
 74 | 
 75 | # Jupyter Notebook
 76 | .ipynb_checkpoints
 77 | 
 78 | # pyenv
 79 | .python-version
 80 | 
 81 | # celery beat schedule file
 82 | celerybeat-schedule
 83 | 
 84 | # SageMath parsed files
 85 | *.sage.py
 86 | 
 87 | # Environments
 88 | .env
 89 | .venv
 90 | env/
 91 | venv/
 92 | ENV/
 93 | env.bak/
 94 | venv.bak/
 95 | 
 96 | # Spyder project settings
 97 | .spyderproject
 98 | .spyproject
 99 | 
100 | # Rope project settings
101 | .ropeproject
102 | 
103 | # mkdocs documentation
104 | /site
105 | 
106 | # mypy
107 | .mypy_cache/
108 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018-2021 derwen.ai
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # disparity_filter
  2 | 
  3 | Implements a **disparity filter** in Python, using graphs in
  4 | [NetworkX](https://networkx.github.io/),
  5 | based on *multiscale backbone networks*:
  6 | 
  7 | "Extracting the multiscale backbone of complex weighted networks"  
  8 | M. Ángeles Serrano, Marián Boguña, Alessandro Vespignani  
  9 | https://arxiv.org/pdf/0904.2389.pdf
 10 | 
 11 | > The disparity filter exploits local heterogeneity and local correlations among weights to extract the network backbone by considering the relevant edges at all the scales present in the system. The methodology preserves an edge whenever its intensity is a statistically not compatible with respect to a null hypothesis of uniform randomness for at least one of the two nodes the edge is incident to, which ensures that small nodes in terms of strength are not neglected. As result, the disparity filter reduces the number of edges in the original network significantly keeping, at the same time, almost all the weight and a large fraction of nodes. As well, this filter preserves the cut-off of the degree distribution, the form of the weight distribution, and the clustering coefficient.
 12 | 
 13 | This project is similar to, albeit providing different features than:
 14 | 
 15 |   * https://github.com/aekpalakorn/python-backbone-network
 16 | 
 17 | 
 18 | ## Implementation Details
 19 | 
 20 | If you are new to *multiscale backbone* analysis, think of this as
 21 | analogous to *centrality* calculated on the edges of a graph rather
 22 | than its nodes. In other words, consider this as a "dual" of the
 23 | problem typically faced in social networks. By managing cuts through a
 24 | process of iterating between measures of *centrality* and *disparity*
 25 | respectively, one can scale a large, noisy graph into something more
 26 | amenable for work with ontology -- especially as a way to clean up
 27 | input for neural networks.
 28 | 
 29 | The code expects each *node* to have a required `label` attribute,
 30 | which is a string unique within all of the nodes in the graph. Each
 31 | *edge* is expected to have a `weight` attribute, a decimal in the
 32 | range of `[0.0, 1.0]` which represents the relative weight of that
 33 | edge's relationship.
 34 | 
 35 | After calculating the disparity metrics, each node get assigned a
 36 | `strength` attribute, which is the sum of its outbound edges'
 37 | weights. Each edge gets assigned the following attributes:
 38 | 
 39 |   * `norm_weight`: ratio of the `edge[weight]/source_node[strength]`
 40 |   * `alpha`: disparity *alpha* metric
 41 |   * `alpha_ptile`: percentile for *alpha*, compared across the graph
 42 | 
 43 | One important distinction is that this implementation comes from work
 44 | in NLP and ontology, where graphs tend to become relatively "noisy"
 45 | and there are many graphs generated through automation which need to
 46 | be filtered. NLP applications had tended to reuse graph techniques
 47 | from social graph analysis, such as *connected components*,
 48 | *centrality*, cuts based on the relative *degree* of nodes -- while
 49 | applications which combine NLP plus ontology tend to need information
 50 | based on the edges.
 51 | 
 52 | In particular, this implementation focuses on directed graphs, and
 53 | uses quantile analysis to adjust graph cuts. The original paper showed
 54 | how to make cuts using the raw *alpha* values, which depended on
 55 | manual (human) decisions.  However, that is less than ideal for
 56 | applications in machine learning, where more automation is typically
 57 | required. Use of quantiles allows for a form of "normalization" for
 58 | threshold values, so that cuts can be performed more consistently when
 59 | automated.
 60 | 
 61 | This implementation also integrates support for working with
 62 | *neighborhood attention sets* (NES) and other mechanisms for working
 63 | with semantics and ontologies.
 64 | 
 65 | 
 66 | ## Getting Started
 67 | 
 68 | ```
 69 | python3 -m venv venv
 70 | source venv/bin/activate
 71 | 
 72 | python3 -m pip install -U pip wheel
 73 | python3 -m pip install -r requirements.txt
 74 | ```
 75 | 
 76 | ## Example
 77 | 
 78 | The running default `main()` function:
 79 | ```
 80 | python3 disparity.py
 81 | ```
 82 | 
 83 | This will:
 84 | 
 85 |   1. generate a random graph (using a seed) of 100 nodes, each with < 10 edges
 86 |   2. calculate the significance (*alpha*) for the disparity filter
 87 |   3. calculate quantiles for *alpha*
 88 |   4. cut edges below the 50th percentile (median) for *alpha*
 89 |   5. cut nodes with degree < 2
 90 | 
 91 | ```
 92 | graph: 100 nodes 489 edges
 93 | 
 94 | 	ptile	alpha
 95 | 	0.00	0.0000
 96 | 	0.10	0.0305
 97 | 	0.20	0.0624
 98 | 	0.30	0.1027
 99 | 	0.40	0.1512
100 | 	0.50	0.2159
101 | 	0.60	0.3222
102 | 	0.70	0.4821
103 | 	0.80	0.7102
104 | 	0.90	0.9998
105 | 
106 | filter: percentile 0.50, min alpha 0.2159, min degree 2
107 | 
108 | graph: 89 nodes 235 edges
109 | ```
110 | 
111 | In practice, adjust those thresholds as needed before making a cut on
112 | a graph. This mechanism provides a "dial" to adjust the scale of the
113 | multiscale backbone of the graph.
114 | 
115 | 
116 | ## Contributors
117 | 
118 | Please use the `Issues` section to ask questions or report any problems.
119 | 


--------------------------------------------------------------------------------
/disparity.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding: utf-8
  3 | 
  4 | from networkx.readwrite import json_graph
  5 | from scipy.stats import percentileofscore
  6 | from traceback import format_exception
  7 | import cProfile
  8 | import json
  9 | import networkx as nx
 10 | import numpy as np
 11 | import pandas as pd
 12 | import pstats
 13 | import random
 14 | import sys
 15 | 
 16 | DEBUG = False # True
 17 | 
 18 | 
 19 | ######################################################################
 20 | ## disparity filter for extracting the multiscale backbone of
 21 | ## complex weighted networks
 22 | 
 23 | def get_nes (graph, label):
 24 |     """
 25 |     find the neighborhood attention set (NES) for the given label
 26 |     """
 27 |     for node_id in graph.nodes():
 28 |         node = graph.node[node_id]
 29 | 
 30 |         if node["label"].lower() == label:
 31 |             return set([node_id]).union(set([id for id in graph.neighbors(node_id)]))
 32 | 
 33 | 
 34 | def disparity_integral (x, k):
 35 |     """
 36 |     calculate the definite integral for the PDF in the disparity filter
 37 |     """
 38 |     assert x != 1.0, "x == 1.0"
 39 |     assert k != 1.0, "k == 1.0"
 40 |     return ((1.0 - x)**k) / ((k - 1.0) * (x - 1.0))
 41 | 
 42 | 
 43 | def get_disparity_significance (norm_weight, degree):
 44 |     """
 45 |     calculate the significance (alpha) for the disparity filter
 46 |     """
 47 |     return 1.0 - ((degree - 1.0) * (disparity_integral(norm_weight, degree) - disparity_integral(0.0, degree)))
 48 | 
 49 | 
 50 | def disparity_filter (graph):
 51 |     """
 52 |     implements a disparity filter, based on multiscale backbone networks
 53 |     https://arxiv.org/pdf/0904.2389.pdf
 54 |     """
 55 |     alpha_measures = []
 56 |     
 57 |     for node_id in graph.nodes():
 58 |         node = graph.nodes[node_id]
 59 |         degree = graph.degree(node_id)
 60 |         strength = 0.0
 61 | 
 62 |         for id0, id1 in graph.edges(nbunch=[node_id]):
 63 |             edge = graph[id0][id1]
 64 |             strength += edge["weight"]
 65 | 
 66 |         node["strength"] = strength
 67 | 
 68 |         for id0, id1 in graph.edges(nbunch=[node_id]):
 69 |             edge = graph[id0][id1]
 70 | 
 71 |             norm_weight = edge["weight"] / strength
 72 |             edge["norm_weight"] = norm_weight
 73 | 
 74 |             if degree > 1:
 75 |                 try:
 76 |                     if norm_weight == 1.0:
 77 |                         norm_weight -= 0.0001
 78 | 
 79 |                     alpha = get_disparity_significance(norm_weight, degree)
 80 |                 except AssertionError:
 81 |                     report_error("disparity {}".format(repr(node)), fatal=True)
 82 | 
 83 |                 edge["alpha"] = alpha
 84 |                 alpha_measures.append(alpha)
 85 |             else:
 86 |                 edge["alpha"] = 0.0
 87 | 
 88 |     for id0, id1 in graph.edges():
 89 |         edge = graph[id0][id1]
 90 |         edge["alpha_ptile"] = percentileofscore(alpha_measures, edge["alpha"]) / 100.0
 91 | 
 92 |     return alpha_measures
 93 | 
 94 | 
 95 | ######################################################################
 96 | ## related metrics
 97 | 
 98 | def calc_centrality (graph, min_degree=1):
 99 |     """
100 |     to conserve compute costs, ignore centrality for nodes below `min_degree`
101 |     """
102 |     sub_graph = graph.copy()
103 |     sub_graph.remove_nodes_from([ n for n, d in list(graph.degree) if d < min_degree ])
104 | 
105 |     centrality = nx.betweenness_centrality(sub_graph, weight="weight")
106 |     #centrality = nx.closeness_centrality(sub_graph, distance="distance")
107 | 
108 |     return centrality
109 | 
110 | 
111 | def calc_quantiles (metrics, num):
112 |     """
113 |     calculate `num` quantiles for the given list
114 |     """
115 |     global DEBUG
116 | 
117 |     bins = np.linspace(0, 1, num=num, endpoint=True)
118 |     s = pd.Series(metrics)
119 |     q = s.quantile(bins, interpolation="nearest")
120 | 
121 |     try:
122 |         dig = np.digitize(metrics, q) - 1
123 |     except ValueError as e:
124 |         print("ValueError:", str(e), metrics, s, q, bins)
125 |         sys.exit(-1)
126 | 
127 |     quantiles = []
128 | 
129 |     for idx, q_hi in q.iteritems():
130 |         quantiles.append(q_hi)
131 | 
132 |         if DEBUG:
133 |             print(idx, q_hi)
134 | 
135 |     return quantiles
136 | 
137 | 
138 | def calc_alpha_ptile (alpha_measures, show=True):
139 |     """
140 |     calculate the quantiles used to define a threshold alpha cutoff
141 |     """
142 |     quantiles = calc_quantiles(alpha_measures, num=10)
143 |     num_quant = len(quantiles)
144 | 
145 |     if show:
146 |         print("\tptile\talpha")
147 | 
148 |         for i in range(num_quant):
149 |             percentile = i / float(num_quant)
150 |             print("\t{:0.2f}\t{:0.4f}".format(percentile, quantiles[i]))
151 | 
152 |     return quantiles, num_quant
153 | 
154 | 
155 | def cut_graph (graph, min_alpha_ptile=0.5, min_degree=2):
156 |     """
157 |     apply the disparity filter to cut the given graph
158 |     """
159 |     filtered_set = set([])
160 | 
161 |     for id0, id1 in graph.edges():
162 |         edge = graph[id0][id1]
163 | 
164 |         if edge["alpha_ptile"] < min_alpha_ptile:
165 |             filtered_set.add((id0, id1))
166 | 
167 |     for id0, id1 in filtered_set:
168 |         graph.remove_edge(id0, id1)
169 | 
170 |     filtered_set = set([])
171 | 
172 |     for node_id in graph.nodes():
173 |         node = graph.nodes[node_id]
174 | 
175 |         if graph.degree(node_id) < min_degree:
176 |             filtered_set.add(node_id)
177 | 
178 |     for node_id in filtered_set:
179 |         graph.remove_node(node_id)
180 | 
181 | 
182 | 
183 | ######################################################################
184 | ## profiling utilities
185 | 
186 | def start_profiling ():
187 |     """start profiling"""
188 |     pr = cProfile.Profile()
189 |     pr.enable()
190 | 
191 |     return pr
192 | 
193 | 
194 | def stop_profiling (pr):
195 |     """stop profiling and report"""
196 |     pr.disable()
197 | 
198 |     s = io.StringIO()
199 |     sortby = "cumulative"
200 |     ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
201 | 
202 |     ps.print_stats()
203 |     print(s.getvalue())
204 | 
205 | 
206 | def report_error (cause_string, logger=None, fatal=False):
207 |     """
208 |     TODO: errors should go to logger, and not be fatal
209 |     """
210 |     etype, value, tb = sys.exc_info()
211 |     error_str = "{} {}".format(cause_string, str(format_exception(etype, value, tb, 3)))
212 | 
213 |     if logger:
214 |         logger.info(error_str)
215 |     else:
216 |         print(error_str)
217 | 
218 |     if fatal:
219 |         sys.exit(-1)
220 | 
221 | 
222 | ######################################################################
223 | ## graph serialization
224 | 
225 | def load_graph (graph_path):
226 |     """
227 |     load a graph from JSON
228 |     """
229 |     with open(graph_path) as f:
230 |         data = json.load(f)
231 |         graph = json_graph.node_link_graph(data, directed=True)
232 |         return graph
233 | 
234 | 
235 | def save_graph (graph, graph_path):
236 |     """
237 |     save a graph as JSON
238 |     """
239 |     with open(graph_path, "w") as f:
240 |         data = json_graph.node_link_data(graph)
241 |         json.dump(data, f)
242 | 
243 | 
244 | ######################################################################
245 | ## testing
246 | 
247 | def random_graph (n, k, seed=0):
248 |     """
249 |     populate a random graph (with an optional seed) with `n` nodes and
250 |     up to `k` edges for each node
251 |     """
252 |     graph = nx.DiGraph()
253 |     random.seed(seed)
254 | 
255 |     for node_id in range(n):
256 |         graph.add_node(node_id, label=str(node_id))
257 | 
258 |     for node_id in range(n):
259 |         population = set(range(n)) - set([node_id])
260 | 
261 |         for neighbor in random.sample(population, random.randint(0, k)):
262 |             weight = random.random()
263 |             graph.add_edge(node_id, neighbor, weight=weight)
264 | 
265 |     return graph
266 | 
267 | 
268 | def describe_graph (graph, min_degree=1, show_centrality=False):
269 |     """
270 |     describe a graph
271 |     """
272 |     print("\ngraph: {} nodes {} edges\n".format(len(graph.nodes()), len(graph.edges())))
273 | 
274 |     if show_centrality:
275 |         print(calc_centrality(graph, min_degree))
276 | 
277 | 
278 | def main (n=100, k=10, min_alpha_ptile=0.5, min_degree=2):
279 |     # generate a random graph (from seed, always the same)
280 |     graph = random_graph(n, k)
281 | 
282 |     save_graph(graph, "g.json")
283 |     describe_graph(graph, min_degree)
284 | 
285 |     # calculate the multiscale backbone metrics
286 |     alpha_measures = disparity_filter(graph)
287 |     quantiles, num_quant = calc_alpha_ptile(alpha_measures)
288 |     alpha_cutoff = quantiles[round(num_quant * min_alpha_ptile)]
289 | 
290 |     print("\nfilter: percentile {:0.2f}, min alpha {:0.4f}, min degree {}".format(
291 |             min_alpha_ptile, alpha_cutoff, min_degree
292 |             ))
293 | 
294 |     # apply the filter to cut the graph
295 |     cut_graph(graph, min_alpha_ptile, min_degree)
296 | 
297 |     save_graph(graph, "h.json")
298 |     describe_graph(graph, min_degree)
299 | 
300 | 
301 | ######################################################################
302 | ## main entry point
303 | 
304 | if __name__ == "__main__":
305 |     main()
306 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | networkx >= 2.6
2 | numpy >= 1.22 # not directly required, pinned by Snyk to avoid a vulnerability
3 | pandas >= 1.3
4 | scipy >= 1.7
5 | 


--------------------------------------------------------------------------------