├── .gitignore
├── LICENSE.txt
├── README.md
├── __init__.py
├── clusters-repeat.py
├── clusters-single.py
├── ebc.py
├── ebc2d.py
├── matrix.py
├── resources
    ├── matrix-ebc-paper-dense-3d.tsv
    ├── matrix-ebc-paper-dense-4d.tsv
    ├── matrix-ebc-paper-dense.tsv
    ├── matrix-ebc-paper-sparse-3d.tsv
    ├── matrix-ebc-paper-sparse-4d.tsv
    ├── matrix-ebc-paper-sparse.tsv
    ├── matrix-itcc-3d-3clust.tsv
    ├── matrix-itcc-paper-3clust.tsv
    ├── matrix-itcc-paper-orig-letters.tsv
    └── matrix-itcc-paper-orig.tsv
└── tests
    ├── sample-matrix-file.txt
    ├── test_benchmark_ebc.py
    ├── test_benchmark_ebc2d.py
    ├── test_clusters.py
    ├── test_ebc.py
    ├── test_matrix.py
    └── test_sanitycheck.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | temp/
3 | __pycache__/
4 | *.pyc
5 | .DS_Store
6 | *.out
7 | *.sh
8 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Bethany Percha, Russ B. Altman, Yuhao Zhang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Ensemble Biclustering for Classification
 2 | ==============
 3 | 
 4 | A python implementation of the Ensemble Biclustering for Classification (EBC) algorithm. Although having "biclustering" in its name, EBC is a co-clustering algorithm that allows you to perform co-clustering on very large sparse N-dimensional matrix. For details and examples of using EBC please reference [this paper](http://www.ncbi.nlm.nih.gov/pubmed/26219079).
 5 | 
 6 | ## Files
 7 | 
 8 | - `ebc.py`: an implementation of EBC algorithm using sparse matrix.
 9 | - `ebc2d.py`: a vectorized implementation of EBC algorithm using (2D) numpy dense array.
10 | - `matrix.py`: a N-dimensional sparse matrix implementation using python dict that allows basic get/set/sum operations.
11 | - `ebc-sample-matrices/`: this directory contains many example data that could be used to run EBC on.
12 | 
13 | ## Usage
14 | 
15 | #### General Usage
16 | 
17 | Using ebc is easy. First you need a sparse matrix constructed as a SparseMatrix instance defined in `matrix.py`. You can use `matrix.py`'s built in `read_data()` method to construct the sparse matrix.
18 | 
19 | Once you have the sparse matrix and import the ebc module, you can simply do the following:
20 | 
21 |     ebc = EBC(sparse_matrix, n_clusters=[30, 125], max_iterations=10)
22 |     cXY, objective, iter = ebc.run()
23 | 
24 | The returned `cXY` contains the clustering assignments along each axis in a list of lists, `objective` contains the final objective function value, `iter` contains the iterations that it ran for to converge.
25 | 
26 | #### Efficiency Considerations
27 | 
28 | In short, `EBC` is built for highly sparse multi-dimensional matrix, while `EBC2D` is a vectorized version of `EBC`, and is built for less sparse 2D matrix. The running time of `EBC` increases linearly with the number of non-zero elements, while the running time of `EBC2D` will only increase when matrix size increases.
29 | 
30 | - If your input matrix is highly sparse or you need support for N-dimensional matrix with N > 2, you should use `EBC` class in `ebc.py`.
31 | - If your input matrix is not very sparse (e.g. 5% density), and can fit into memory with `numpy`, you can choose to use `EBC2D` class in `ebc2d.py`. Note that `EBC2D` only supports 2D matrix, since large multi-dimensional dense matrix can hardly fit into memory.
32 | 
33 | To give you more sense of the efficiency, the `EBC` implementation runs for ~50 seconds on our 14052 x 7272 example matrix (which is >99% sparse) on a MacBook, and in this case it is faster than `EBC2D`. However, in our testing, for a 5000 x 5000 matrix of 95% sparsity, `EBC2D` can get a `x5` speedup over `EBC`. And this speedup grows as the sparsity decreases.
34 | 
35 | #### Dependencies
36 | 
37 | To run this implementation of EBC algorithm, you need to have `numpy` installed.
38 | 
39 | ## References
40 | 
41 | - Percha, Bethany, and Russ B. Altman. "Learning the Structure of Biomedical Relationships from Unstructured Text." PLoS computational biology 11.7 (2015).
42 | - Dhillon, Inderjit S., Subramanyam Mallela, and Dharmendra S. Modha. "Information-theoretic co-clustering." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.
43 | 
44 | ## Questions?
45 | 
46 | This code was written by Beth Percha and Yuhao Zhang. We welcome any questions or comments, and would appreciate it if you would let us know if you make any substantial modifications or improvements. You can reach us at blpercha@stanford.edu.
47 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ["ebc", "matrix"]


--------------------------------------------------------------------------------
/clusters-repeat.py:
--------------------------------------------------------------------------------
 1 | from collections import defaultdict
 2 | import sys
 3 | 
 4 | from ebc import EBC
 5 | from matrix import SparseMatrix
 6 | 
 7 | 
 8 | def main():
 9 |     data_file = sys.argv[1]
10 |     ebc_cols = [int(e) for e in sys.argv[2].split(",")]
11 |     K = [int(e) for e in sys.argv[3].split(",")]
12 |     N_runs = int(sys.argv[4])
13 |     output_file = sys.argv[5]
14 |     jitter_max = float(sys.argv[6])
15 |     max_iterations_ebc = int(sys.argv[7])
16 |     entity_cols = [int(e) for e in sys.argv[8].split(",")]
17 |     object_toler = float(sys.argv[9])
18 | 
19 |     # get original data
20 |     raw_data = [line.split("\t") for line in open(data_file, "r")]
21 |     data = [[d[i] for i in ebc_cols] for d in raw_data]
22 |     data_dimensions = len(data[0]) - 1
23 | 
24 |     # get axis length for each dimension
25 |     N = []
26 |     for dim in range(data_dimensions):
27 |         N.append(len(set([d[dim] for d in data])))
28 |     print(N)
29 | 
30 |     # set up matrix
31 |     M = SparseMatrix(N)
32 |     M.read_data(data)
33 |     M.normalize()
34 | 
35 |     # set up entity map to ids
36 |     entity_map = defaultdict(tuple)
37 |     for d in raw_data:
38 |         entity = tuple([d[i] for i in entity_cols])
39 |         entity_ids = tuple([M.feature_ids[ebc_cols.index(i)][d[i]] for i in entity_cols])
40 |         entity_map[entity_ids] = entity
41 | 
42 |     # figure out which ebc columns the entity columns correspond to
43 |     entity_column_indices = []
44 |     for c in ebc_cols:
45 |         if c in entity_cols:
46 |             entity_column_indices.append(ebc_cols.index(c))
47 | 
48 |     # run EBC and get entity cluster assignments
49 |     ebc_M = EBC(M, K, max_iterations_ebc, jitter_max, object_toler)
50 |     clusters = defaultdict(list)
51 |     for t in range(N_runs):
52 |         print "run ", t
53 |         cXY_M, objective_M, it_M = ebc_M.run()
54 |         for e1 in entity_map.keys():
55 |             c1_i = tuple([cXY_M[i][e1[i]] for i in entity_column_indices])
56 |             clusters[e1].append(c1_i)
57 | 
58 |     # print assignments
59 |     writer = open(output_file, "w")
60 |     for k in clusters:
61 |         e1_name = entity_map[k]
62 |         writer.write(",".join([str(e) for e in k]) + "\t" +
63 |                      ",".join([e for e in e1_name]) + "\t" + "\t".join([",".join([str(f) for f in e])
64 |                                                                         for e in clusters[k]]) + "\n")
65 |         writer.flush()
66 |     writer.close()
67 | 
68 | 
69 | if __name__ == "__main__":
70 |     main()
71 | 


--------------------------------------------------------------------------------
/clusters-single.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | from numpy import mean, std
 4 | 
 5 | from ebc import EBC
 6 | from matrix import SparseMatrix
 7 | 
 8 | 
 9 | def compareRandom(num_trials, tensor_dimensions, matrix_data, cluster_dimensions,
10 |                   maxit_ebc, jitter_max_ebc, objective_tolerance):
11 |     deltas = []
12 |     objectives_M = []
13 |     objectives_Mr = []
14 |     iterations_M = []
15 |     iterations_Mr = []
16 |     noconverge_M = 0
17 |     noconverge_Mr = 0
18 |     for j in range(num_trials):
19 |         print "Trial ", j
20 | 
21 |         M = SparseMatrix(tensor_dimensions)
22 |         M.read_data(matrix_data)
23 |         Mr = M.shuffle()  # could also be M.shuffle_old()
24 | 
25 |         M.normalize()
26 | 
27 |         ebc_M = EBC(M, cluster_dimensions, maxit_ebc, jitter_max_ebc, objective_tolerance)
28 |         cXY_M, objective_M, it_M = ebc_M.run()
29 |         if it_M == maxit_ebc:
30 |             noconverge_M += 1
31 |         else:
32 |             iterations_M.append(it_M)
33 | 
34 |         Mr.normalize()
35 | 
36 |         ebc_Mr = EBC(Mr, cluster_dimensions, maxit_ebc, jitter_max_ebc, objective_tolerance)
37 |         cXY_Mr, objective_Mr, it_Mr = ebc_Mr.run()
38 |         if it_Mr == maxit_ebc:
39 |             noconverge_Mr += 1
40 |         else:
41 |             iterations_Mr.append(it_Mr)
42 | 
43 |         objectives_M.append(objective_M)
44 |         objectives_Mr.append(objective_Mr)
45 |         deltas.append(objective_M - objective_Mr)
46 |     return deltas, objectives_M, objectives_Mr, iterations_M, iterations_Mr, noconverge_M, noconverge_Mr
47 | 
48 | 
49 | def main():
50 |     data_file = sys.argv[1]
51 |     cols = [int(e) for e in sys.argv[2].split(",")]
52 |     K = [int(e) for e in sys.argv[3].split(",")]
53 |     N_trials = int(sys.argv[4])
54 |     output_file = sys.argv[5]
55 |     jitter_max = float(sys.argv[6])
56 |     max_iterations_ebc = int(sys.argv[7])
57 |     object_tol = float(sys.argv[8])
58 | 
59 |     # get original data
60 |     raw_data = [line.split("\t") for line in open(data_file, "r")]
61 |     data = [[d[i] for i in cols] for d in raw_data]
62 |     data_dimensions = len(data[0]) - 1
63 | 
64 |     # get axis length for each dimension
65 |     N = []
66 |     for dim in range(data_dimensions):
67 |         N.append(len(set([d[dim] for d in data])))
68 |     print(N)
69 | 
70 |     D_1, obj_orig, obj_rand, it_orig, it_rand, noconv_orig, noconv_rand = compareRandom(num_trials=N_trials,
71 |                                                                                         tensor_dimensions=N,
72 |                                                                                         matrix_data=data,
73 |                                                                                         cluster_dimensions=K,
74 |                                                                                         maxit_ebc=max_iterations_ebc,
75 |                                                                                         jitter_max_ebc=jitter_max,
76 |                                                                                         objective_tolerance=object_tol)
77 | 
78 |     # write final result to combined file (other processes also write to this file)
79 |     output_stream = open(output_file, "a")
80 |     output_stream.write("\t".join([str(e) for e in K]) + "\t" + str(mean(D_1)) + "\t" + str(std(D_1)) +
81 |                         "\t" + str(mean(obj_orig)) + "\t" + str(mean(obj_rand)) +
82 |                         "\t" + str(mean(it_orig)) + "\t" + str(mean(it_rand)) +
83 |                         "\t" + str(noconv_orig) + "\t" + str(noconv_rand) + "\n")
84 |     output_stream.flush()
85 |     output_stream.close()
86 | 
87 | 
88 | if __name__ == "__main__":
89 |     main()
90 | 


--------------------------------------------------------------------------------
/ebc.py:
--------------------------------------------------------------------------------
  1 | from collections import defaultdict
  2 | import random
  3 | import sys
  4 | import math
  5 | 
  6 | import numpy as np
  7 | from numpy.ma import divide, outer, sqrt
  8 | from numpy.random.mtrand import random_sample
  9 | 
 10 | from matrix import SparseMatrix
 11 | 
 12 | INFINITE = 1e10
 13 | 
 14 | 
 15 | class EBC:
 16 |     def __init__(self, matrix, n_clusters, max_iterations=10, jitter_max=1e-10, objective_tolerance=0.01):
 17 |         """ To initialize an EBC object.
 18 | 
 19 |         Args:
 20 |             matrix: a instance of SparseMatrix that represents the original distribution
 21 |             n_clusters: a list of number of clusters along each dimension
 22 |             max_iterations: the maximum number of iterations before we stop, default to be 10
 23 |             jitter_max: a small amount of value to add to probabilities to break ties when doing cluster assignments, default to be 1e-10
 24 |             objective_tolerance: the threshold of difference between two successive objective values for us to stop
 25 | 
 26 |         Return:
 27 |             None
 28 |         """
 29 |         if not isinstance(matrix, SparseMatrix):
 30 |             raise Exception("Matrix argument to EBC is not instance of SparseMatrix.")
 31 | 
 32 |         # check to ensure matrix is a probability distribution
 33 |         np.testing.assert_approx_equal(matrix.sum(), 1.0, significant=7,
 34 |                                        err_msg='Matrix elements does not sum to 1. Please normalize your matrix.')
 35 | 
 36 |         self.pXY = matrix  # the joint probability distribution e.g. p(X,Y)- the original sparse, multidimensional matrix
 37 |         self.K = n_clusters  # numbers of clusters along each dimension (len(K) = dim)
 38 |         self.dim = self.pXY.dim  # overall dimension of the matrix
 39 |         self.max_it = max_iterations
 40 | 
 41 |         self.cXY = None  # list of list: cluster assignments along each dimension e.g. [C(X), C(Y), ...]
 42 |         self.pX = None  # marginal probabilities
 43 | 
 44 |         self.qXhatYhat = None  # the approximate probability distribution after clustering e.g. q(X',Y'); need to be SparseMatrix
 45 |         self.qXhat = None  # the marginal clustering distribution in a list e.g. [q(X'), q(Y'), ...]
 46 |         self.qXxHat = None  # the distribution conditioned on the clustering in a list e.g. [q(X|X'), q(Y|Y'), ...]
 47 | 
 48 |         self.jitter_max = jitter_max  # amount to add to cluster assignment scores to break ties
 49 |         self.objective_tolerance = objective_tolerance  # the threshold for us to stop
 50 | 
 51 |     def run(self, assigned_clusters=None, verbose=True):
 52 |         """ To run the ebc algorithm.
 53 | 
 54 |         Args:
 55 |             assigned_clusters: an optional list of list representing the initial assignment of clusters along each dimension.
 56 | 
 57 |         Return:
 58 |             cXY: a list of list of cluster assignments along each dimension e.g. [C(X), C(Y), ...]
 59 |             objective: the final objective value
 60 |             max_it: the number of iterations that the algorithm has run
 61 |         """
 62 |         if verbose: print "Running EBC on a %d-d sparse matrix with size %s ..." % (self.dim, str(self.pXY.N))
 63 |         # Step 1: initialization steps
 64 |         self.pX = self.calculate_marginals(self.pXY)
 65 |         if assigned_clusters:
 66 |             if verbose: print "Using specified clusters, with cluster number on each axis: %s ..." % self.K
 67 |             self.cXY = assigned_clusters
 68 |         else:
 69 |             if verbose: print "Randomly initializing clusters, with cluster number on each axis: %s ..." % self.K
 70 |             self.cXY = self.initialize_cluster_centers(self.pXY, self.K)
 71 | 
 72 |         # Step 2: calculate cluster joint and marginal distributions
 73 |         self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cXY, self.K, self.pXY)
 74 |         self.qXhat = self.calculate_marginals(self.qXhatYhat)  # the cluster marginals along each axis
 75 |         self.qXxHat = self.calculate_conditionals(self.cXY, self.pXY.N, self.pX, self.qXhat)
 76 | 
 77 |         # Step 3: iterate through dimensions, recalculating distributions
 78 |         last_objective = objective = INFINITE
 79 |         for t in xrange(self.max_it):
 80 |             if verbose: sys.stdout.write("--> Running iteration %d " % (t + 1)); sys.stdout.flush()
 81 |             for axis in xrange(self.dim):
 82 |                 self.cXY[axis] = self.compute_clusters(self.pXY, self.qXhatYhat, self.qXhat, self.qXxHat, self.cXY,
 83 |                                                        axis)
 84 |                 self.ensure_correct_number_clusters(self.cXY[axis], self.K[axis])  # check to ensure correct K
 85 |                 self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cXY, self.K, self.pXY)
 86 |                 self.qXhat = self.calculate_marginals(self.qXhatYhat)
 87 |                 self.qXxHat = self.calculate_conditionals(self.cXY, self.pXY.N, self.pX, self.qXhat)
 88 |                 if verbose: sys.stdout.write("."); sys.stdout.flush()
 89 |             objective = self.calculate_objective()
 90 |             if verbose: sys.stdout.write(" objective value = %f\n" % (objective))
 91 |             if abs(objective - last_objective) < self.objective_tolerance:
 92 |                 if verbose: print "EBC finished in %d iterations, with final objective value %.4f" % (t + 1, objective)
 93 |                 return self.cXY, objective, t + 1
 94 |             last_objective = objective
 95 |         if verbose: print "EBC finished in %d iterations, with final objective value %.4f" % (self.max_it, objective)
 96 |         return self.cXY, objective, self.max_it  # hit max iterations - just return current assignments
 97 | 
 98 |     def compute_clusters(self, pXY, qXhatYhat, qXhat, qXxhat, cXY, axis):
 99 |         """ Compute the best cluster assignment along a single axis, given all the distributions and clusters on other axes.
100 | 
101 |         Args:
102 |             pXY: the original probability distribution matrix
103 |             qXhatYhat: the joint distribution over the clusters
104 |             qXhat: the marginal distributions of qXhatYhat
105 |             qXxhat: the distribution conditioned on the clustering in a list
106 |             cXY: current cluster assignments along each dimension
107 |             axis: the axis (dimension) over which clusters are being computed
108 | 
109 |         Return:
110 |             Best cluster assignment along a single axis as a list
111 |         """
112 |         if not isinstance(pXY, SparseMatrix) or not isinstance(qXhatYhat, SparseMatrix):
113 |             raise Exception("Arguments to compute_clusters not an instance of SparseMatrix.")
114 |         # To assign clusters, we calculate argmin_xhat D(p(Y,Z|x) || q(Y,Z|xhat)),
115 |         # where D(P|Q) = \sum_i P_i log (P_i / Q_i)
116 |         dPQ = np.zeros(shape=(pXY.N[axis], qXhatYhat.N[axis]))
117 |         # iterate though all non-zero elements; here we are making use of the sparsity to reduce computation
118 |         for coords, p_i in pXY.nonzero_elements.iteritems():
119 |             coord_this_axis = coords[axis]
120 |             px = self.pX[axis][coord_this_axis]
121 |             p_i = 1 if px == 0 else p_i / px  # calculate p(y|x) = p(x,y)/p(x), but we should be careful if px == 0
122 |             current_cluster_assignments = [cXY[i][coords[i]] for i in
123 |                                            xrange(self.dim)]  # cluster assignments on each axis
124 |             for xhat in xrange(self.K[axis]):
125 |                 current_cluster_assignments[axis] = xhat  # temporarily assign dth dimension to this xhat
126 |                 current_qXhatYhat = qXhatYhat.get(tuple(current_cluster_assignments))
127 |                 current_qXhat = qXhat[axis][xhat]
128 |                 q_i = 1.0
129 |                 if current_qXhatYhat == 0 and current_qXhat == 0:
130 |                     q_i = 0  # Here we define 0/0=0
131 |                 else:
132 |                     q_i *= current_qXhatYhat / current_qXhat
133 |                     for i in xrange(self.dim):
134 |                         if i == axis: continue
135 |                         q_i *= qXxhat[i][coords[i]]
136 |                 if q_i == 0:  # this can definitely happen if cluster joint distribution has zero element
137 |                     dPQ[coord_this_axis, xhat] = INFINITE
138 |                 else:
139 |                     dPQ[coord_this_axis, xhat] += p_i * math.log(p_i / q_i)
140 | 
141 |         # add random jitter to break ties
142 |         dPQ += self.jitter_max * random_sample(dPQ.shape)
143 |         return list(dPQ.argmin(1))  # return the closest cluster assignment under KL-divergence
144 | 
145 |     def calculate_marginals(self, pXY):
146 |         """ Calculate the marginal probabilities given a joint distribution.
147 | 
148 |         Args:
149 |             pXY: sparse matrix over which marginals are calculated, e.g. P(X,Y)
150 | 
151 |         Return:
152 |             marginals: the calculated marginal probabilities, e.g. P(X)
153 |         """
154 |         if not isinstance(pXY, SparseMatrix):
155 |             raise Exception("Illegal argument to marginal calculation: " + str(pXY))
156 |         marginals = [[0] * Ni for Ni in pXY.N]
157 |         for d in pXY.nonzero_elements:
158 |             for i in xrange(len(d)):
159 |                 marginals[i][d[i]] += pXY.nonzero_elements[d]
160 |         return marginals
161 | 
162 |     def calculate_joint_cluster_distribution(self, cXY, K, pXY):
163 |         """ Calculate the joint cluster distribution q(X',Y') using the current prob distribution and
164 |         cluster assignments. (Here we use X' to denote X_hat)
165 | 
166 |         Args:
167 |             cXY: current cluster assignments for each axis
168 |             K: numbers of clusters along each axis
169 |             pXY: original probability distribution matrix
170 | 
171 |         Return:
172 |             qXhatYhat: the joint cluster distribution
173 |         """
174 |         if not isinstance(pXY, SparseMatrix):
175 |             raise Exception("Matrix argument to calculate_joint_cluster_distribution not an instance of SparseMatrix.")
176 |         qXhatYhat = SparseMatrix(K)  # joint distribution over clusters
177 |         for coords in pXY.nonzero_elements:
178 |             # find the coordinates of the cluster for this element
179 |             cluster_coords = []
180 |             for i in xrange(len(coords)):
181 |                 cluster_coords.append(cXY[i][coords[i]])
182 |             qXhatYhat.add_value(tuple(cluster_coords), pXY.nonzero_elements[coords])
183 |         return qXhatYhat
184 | 
185 |     def calculate_conditionals(self, cXY, N, pX, qXhat):
186 |         """ Calculate the conditional marginal distributions given the clustering distribution, i.e. q(X|X').
187 | 
188 |         Args:
189 |             cXY: current cluster assignments
190 |             N: lengths of each dimension in the original data matrix
191 |             pX: marginal distributions over original data matrix
192 |             qXhat: marginal distributions over cluster joint distribution
193 | 
194 |         Return:
195 |             conditional_distributions: a list of distribution for each axis, with each element being a list of prob for i-th row/column in this axis.
196 |         """
197 |         conditional_distributions = [[0] * Ni for Ni in N]
198 |         for i in xrange(len(cXY)):
199 |             cluster_assignments_this_dimension = cXY[i]
200 |             for j in xrange(len(cluster_assignments_this_dimension)):
201 |                 cluster = cluster_assignments_this_dimension[j]
202 |                 if pX[i][j] == 0 and qXhat[i][cluster] == 0:
203 |                     conditional_distributions[i][j] = 0
204 |                 else:
205 |                     conditional_distributions[i][j] = pX[i][j] / qXhat[i][cluster]
206 |         return conditional_distributions
207 | 
208 |     def initialize_cluster_centers(self, pXY, K):
209 |         """ Initializes the cluster assignments along each axis, by first selecting k centers, 
210 |         and then map each row to its closet center under cosine similarity.
211 | 
212 |         Args:
213 |             pXY: original data matrix
214 |             K: numbers of clusters desired in each dimension
215 | 
216 |         Return:
217 |             new_C: a list of list of cluster id that the current index in the current axis is assigned to.
218 |         """
219 |         if not isinstance(pXY, SparseMatrix):
220 |             raise Exception("Matrix argument to initialize_cluster_centers is not an instance of SparseMatrix.")
221 |         new_C = [[-1] * Ni for Ni in pXY.N]
222 | 
223 |         for axis in xrange(len(K)):  # loop over each dimension
224 |             # choose cluster centers
225 |             axis_length = pXY.N[axis]
226 |             center_indices = random.sample(xrange(axis_length), K[axis])
227 |             cluster_ids = {}
228 |             for i in xrange(K[axis]):  # assign identifiers to clusters
229 |                 center_index = center_indices[i]
230 |                 cluster_ids[center_index] = i
231 |             centers = defaultdict(lambda: defaultdict(float))  # all nonzero indices for each center
232 |             for coords in pXY.nonzero_elements:
233 |                 coord_this_axis = coords[axis]
234 |                 if coord_this_axis in cluster_ids:  # is a center
235 |                     reduced_coords = tuple(
236 |                         [coords[i] for i in xrange(len(coords)) if i != axis])  # coords without the current axis
237 |                     centers[cluster_ids[coord_this_axis]][reduced_coords] = pXY.nonzero_elements[
238 |                         coords]  # (cluster_id, other coords) -> value
239 | 
240 |             # assign rows to clusters
241 |             scores = np.zeros(shape=(pXY.N[axis], K[axis]))  # scores: axis_size x cluster_number
242 |             denoms_P = np.zeros(shape=(pXY.N[axis]))
243 |             denoms_Q = np.zeros(shape=(K[axis]))
244 |             for coords in pXY.nonzero_elements:
245 |                 coord_this_axis = coords[axis]
246 |                 if coord_this_axis in center_indices:
247 |                     continue  # don't reassign cluster centers, please
248 |                 reduced_coords = tuple([coords[i] for i in xrange(len(coords)) if i != axis])
249 |                 for cluster_index in cluster_ids:
250 |                     xhat = cluster_ids[cluster_index]  # need cluster ID, not the axis index
251 |                     if reduced_coords in centers[xhat]:  # overlapping point
252 |                         P_i = pXY.nonzero_elements[coords]
253 |                         Q_i = centers[xhat][reduced_coords]
254 |                         scores[coords[axis]][xhat] += P_i * Q_i  # now doing based on cosine similarity
255 |                         denoms_P[coords[axis]] += P_i * P_i  # magnitude of this slice of original matrix
256 |                         denoms_Q[xhat] += Q_i * Q_i  # magnitude of cluster centers
257 | 
258 |             # normalize scores
259 |             scores = divide(scores, outer(sqrt(denoms_P), sqrt(denoms_Q)))
260 |             scores[scores == 0] = -1.0
261 | 
262 |             # add random jitter to scores to handle tie-breaking
263 |             scores += self.jitter_max * random_sample(scores.shape)
264 |             new_cXYi = list(scores.argmax(1))  # this needs to be argmax because cosine similarity
265 | 
266 |             # make sure to assign the cluster centers to themselves
267 |             for center_index in cluster_ids:
268 |                 new_cXYi[center_index] = cluster_ids[center_index]
269 | 
270 |             # ensure numbers of clusters are correct
271 |             self.ensure_correct_number_clusters(new_cXYi, K[axis])
272 |             new_C[axis] = new_cXYi
273 |         return new_C
274 | 
275 |     def ensure_correct_number_clusters(self, cXYi, expected_K):
276 |         """ To ensure a cluster assignment actually has the expected total number of clusters. 
277 | 
278 |         Args:
279 |             cXYi: the input cluster assignment
280 |             expected_K: expected number of clusters on this axis
281 | 
282 |         Return:
283 |             None. The assignment will be changed in place in cXYi.
284 |         """
285 |         clusters_represented = set()
286 |         for c in cXYi:
287 |             clusters_represented.add(c)
288 |         if len(clusters_represented) == expected_K:
289 |             return
290 |         for c in xrange(expected_K):
291 |             if c not in clusters_represented:
292 |                 index_to_change = random.randint(0, len(cXYi) - 1)
293 |                 cXYi[index_to_change] = c
294 |         self.ensure_correct_number_clusters(cXYi, expected_K)
295 | 
296 |     def calculate_objective(self):
297 |         """ Calculate the value of the objective function given the current cluster assignments. 
298 | 
299 |         Return:
300 |             objective: the objective function value
301 |         """
302 |         objective = 0.0
303 |         for d in self.pXY.nonzero_elements:
304 |             pXY_i = self.pXY.nonzero_elements[d]
305 |             qXY_i = self.get_element_approximate_distribution(d)
306 |             if qXY_i == 0:
307 |                 print(pXY_i)
308 |             objective += pXY_i * math.log(pXY_i / qXY_i)
309 |         return objective
310 | 
311 |     def get_element_approximate_distribution(self, coords):
312 |         """ Get the distribution approximated by q(X,Y). """
313 |         clusters = [self.cXY[i][coords[i]] for i in xrange(len(coords))]
314 |         element = self.qXhatYhat.get(tuple(clusters))
315 |         for i in xrange(len(coords)):
316 |             element *= self.qXxHat[i][coords[i]]
317 |         return element
318 | 
319 | 
320 | def main():
321 |     """ An example run of EBC. """
322 |     with open("resources/matrix-ebc-paper-sparse.tsv", "r") as f:
323 |         data = []
324 |         for line in f:
325 |             sl = line.split("\t")
326 |             if len(sl) < 5:  # headers
327 |                 continue
328 |             data.append([sl[0], sl[2], float(sl[4])])
329 | 
330 |     matrix = SparseMatrix([14052, 7272])
331 |     matrix.read_data(data)
332 |     matrix.normalize()
333 |     ebc = EBC(matrix, [30, 125], 10, 1e-10, 0.01)
334 |     cXY, objective, it = ebc.run()
335 | 
336 | 
337 | if __name__ == "__main__":
338 |     main()
339 | 


--------------------------------------------------------------------------------
/ebc2d.py:
--------------------------------------------------------------------------------
  1 | from collections import defaultdict
  2 | import random
  3 | import sys
  4 | 
  5 | import numpy as np
  6 | 
  7 | 
  8 | class EBC2D:
  9 |     def __init__(self, matrix, n_clusters, max_iterations=10, jitter_max=1e-10, objective_tolerance=0.01):
 10 |         """ To initialize an EBC object.
 11 | 
 12 |         Args:
 13 |             matrix: a instance of numpy ndarray that represents the original distribution
 14 |             n_clusters: a list of number of clusters along each dimension
 15 |             max_iterations: the maximum number of iterations before we stop, default to be 10
 16 |             jitter_max: a small amount of value to add to probabilities to break ties when doing cluster assignments, default to be 1e-10
 17 |             objective_tolerance: the threshold of difference between two successive objective values for us to stop
 18 | 
 19 |         Return:
 20 |             None
 21 |         """
 22 |         if not isinstance(matrix, np.ndarray):
 23 |             raise Exception("Matrix argument to EBC2D needs to be a numpy ndarray.")
 24 | 
 25 |         np.testing.assert_approx_equal(matrix.sum(), 1.0, significant=3, err_msg= \
 26 |             'Matrix elements does not sum to 1. Please normalize your matrix.')
 27 | 
 28 |         self.pXY = matrix  # p(X,Y)
 29 |         self.N = self.pXY.shape
 30 |         self.cX = np.zeros(self.N[0], dtype=np.int)  # cluster assignment along X axis, C(X)
 31 |         self.cY = np.zeros(self.N[1], dtype=np.int)  # C(Y)
 32 |         self.K = n_clusters  # number of clusters along x and y axis
 33 |         self.max_iters = max_iterations
 34 | 
 35 |         self.pX = np.empty(self.N[0])  # marginal probabilities, p(X)
 36 |         self.pY = np.empty(self.N[1])  # marginal probabilities, p(Y)
 37 | 
 38 |         self.qXhatYhat = np.zeros(tuple(self.K),
 39 |                                   dtype=np.float32)  # the approx probability distribution after clustering
 40 |         self.qXhat = np.zeros(self.K[0])  # q(X')
 41 |         self.qYhat = np.zeros(self.K[1])  # q(Y')
 42 |         self.qX_xhat = np.zeros(self.N[0])  # q(X|X')
 43 |         self.qY_yhat = np.zeros(self.N[1])  # q(Y|Y')
 44 | 
 45 |         self.jitter_max = jitter_max
 46 |         self.objective_tolerance = objective_tolerance
 47 | 
 48 |     def run(self, assigned_clusters=None, verbose=True):
 49 |         """ To run the ebc algorithm.
 50 | 
 51 |         Args:
 52 |             assigned_clusters: an optional list of list representing the initial assignment of clusters along each dimension.
 53 | 
 54 |         Return:
 55 |             cXY: a list of lists of cluster assignments along each dimension e.g. [C(X), C(Y), ...]
 56 |             objective: the final objective value
 57 |             max_it: the number of iterations that the algorithm has run
 58 |         """
 59 |         if verbose: print "Running EBC2D on a 2-d matrix with size %s ..." % str(self.pXY.shape)
 60 |         # Step 1: initialization steps
 61 |         self.pX, self.pY = self.calculate_marginals(self.pXY)
 62 |         if assigned_clusters and len(assigned_clusters) == 2:
 63 |             if verbose: print "Using specified clusters, with cluster number on each axis: %s ..." % self.K
 64 |             self.cX = np.asarray(assigned_clusters[0])
 65 |             self.cY = np.asarray(assigned_clusters[1])
 66 |         else:
 67 |             if verbose: print "Randomly initializing clusters, with cluster number on each axis: %s ..." % self.K
 68 |             self.cX, self.cY = self.initialize_cluster_centers(self.pXY, self.K)
 69 | 
 70 |         # Step 2: calculate cluster joint and marginal distributions
 71 |         self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cX, self.cY, self.K, self.pXY)
 72 |         self.qXhat, self.qYhat = self.calculate_marginals(self.qXhatYhat)
 73 |         self.qX_xhat, self.qY_yhat = self.calculate_conditionals(self.cX, self.cY, self.pX, self.pY, self.qXhat,
 74 |                                                                  self.qYhat)
 75 | 
 76 |         # Step 3: iterate, recalculating distributions
 77 |         last_objective = objective = 1e10
 78 |         for it in range(self.max_iters):
 79 |             if verbose: sys.stdout.write("--> Running iteration %d " % (it + 1)); sys.stdout.flush()
 80 |             # compute row/column clusters
 81 |             for axis in range(2):
 82 |                 if axis == 0:
 83 |                     self.cX = self.compute_row_clusters(self.pXY, self.qXhatYhat, self.qXhat, self.qY_yhat, self.cY)
 84 |                 else:
 85 |                     self.cY = self.compute_col_clusters(self.pXY, self.qXhatYhat, self.qYhat, self.qX_xhat, self.cX)
 86 |                 if verbose: sys.stdout.write("."); sys.stdout.flush()
 87 |             self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cX, self.cY, self.K, self.pXY)
 88 |             self.qXhat, self.qYhat = self.calculate_marginals(self.qXhatYhat)
 89 |             self.qX_xhat, self.qY_yhat = self.calculate_conditionals(self.cX, self.cY, self.pX, self.pY, self.qXhat,
 90 |                                                                      self.qYhat)
 91 | 
 92 |             objective = self.calculate_objective()
 93 |             if verbose: sys.stdout.write(" objective value = %f\n" % (objective))
 94 |             if abs(objective - last_objective) < self.objective_tolerance:
 95 |                 if verbose: print "EBC2D finished in %d iterations, with final objective value %.4f" % (
 96 |                 it + 1, objective)
 97 |                 return [self.cX.tolist(), self.cY.tolist()], objective, it + 1
 98 |             last_objective = objective
 99 |         if verbose: print "EBC2D finished in %d iterations, with final objective value %.4f" % (
100 |         self.max_iters, objective)
101 |         return [self.cX.tolist(), self.cY.tolist()], objective, self.max_iters
102 | 
103 |     def compute_row_clusters(self, pXY, qXhatYhat, qXhat, qY_yhat, cY):
104 |         """ Compute the best row cluster assignment, given all the distributions and clusters on y axis.
105 | 
106 |         Args:
107 |             pXY: the original probability distribution matrix
108 |             qXhatYhat: the joint distribution over the clusters
109 |             qXhat: the marginal distributions of qXhatYhat
110 |             qY_yhat: the distribution conditioned on the y axis clustering in a list
111 |             cY: current cluster assignments along y axis
112 | 
113 |         Return:
114 |             Best row cluster assignment as a list
115 |         """
116 |         nrow, nc_row = pXY.shape[0], qXhat.shape[0]
117 |         dPQ = np.empty((nrow, nc_row))
118 |         # Step 1: generate q(y|x'): |x'| x |y|
119 |         # - first expand q(x'y') to |x'| x |y| using clustering information cY
120 |         expanded_qXhatYhat = np.nan_to_num((qXhatYhat.T / qXhat).T[:, cY])
121 |         qY_xhat = expanded_qXhatYhat * qY_yhat
122 |         # Step 2: loop through all clusters
123 |         for i in range(nc_row):
124 |             for j in range(nrow):
125 |                 pXY_row = pXY[j, :]
126 |                 with np.errstate(divide='ignore', invalid='ignore'):
127 |                     log_matrix = np.log(pXY_row / qY_xhat[i, :])
128 |                     log_matrix[log_matrix == -np.inf] = 0
129 |                     log_matrix = np.nan_to_num(log_matrix)
130 |                     dPQ[j, i] = pXY_row.dot(log_matrix.T)
131 | 
132 |         dPQ += self.jitter_max * np.random.mtrand.random_sample(dPQ.shape)
133 |         C = dPQ.argmin(1)
134 |         self.ensure_correct_number_clusters(C, nc_row)
135 |         return C
136 | 
137 |     def compute_col_clusters(self, pXY, qXhatYhat, qYhat, qX_xhat, cX):
138 |         """ Compute the best column cluster assignment, given all the distributions and clusters on x axis. """
139 |         ncol, nc_col = pXY.shape[1], qYhat.shape[0]
140 |         dPQ = np.empty((nc_col, ncol))
141 |         expanded_qXhatYhat = np.nan_to_num((qXhatYhat / qYhat)[cX, :])
142 |         qX_yhat = expanded_qXhatYhat.T * qX_xhat
143 |         for i in range(nc_col):
144 |             for j in range(ncol):
145 |                 pXY_col = pXY[:, j].T
146 |                 with np.errstate(divide='ignore', invalid='ignore'):
147 |                     log_matrix = np.log(pXY_col / qX_yhat[i, :])
148 |                     log_matrix[log_matrix == -np.inf] = 0
149 |                     log_matrix = np.nan_to_num(log_matrix)
150 |                     dPQ[i, j] = pXY_col.dot(log_matrix.T)
151 |         dPQ += self.jitter_max * np.random.mtrand.random_sample(dPQ.shape)
152 |         C = dPQ.argmin(0)
153 |         self.ensure_correct_number_clusters(C, nc_col)
154 |         return C
155 | 
156 |     def calculate_marginals(self, pXY):
157 |         """ Calculate the marginal probabilities given a joint distribution.
158 | 
159 |         Args:
160 |             pXY: distribution over which marginals are calculated.
161 |         """
162 |         pX = pXY.sum(1)  # sum along the y dimension, note that the dimension index should be reverse
163 |         pY = pXY.sum(0)  # sum along the x dimension
164 |         return np.squeeze(np.asarray(pX)), np.squeeze(np.asarray(pY))  # return a numpy array
165 | 
166 |     def calculate_conditionals(self, cX, cY, pX, pY, qXhat, qYhat):
167 |         """ Calculate the conditional marginal distributions given the clustering distribution, i.e. q(X|X').
168 | 
169 |         Args:
170 |             cX, cY: current cluster assignments
171 |             pX, pY: marginal distributions over original data matrix
172 |             qXhat, qYhat: marginal distributions over cluster joint distribution
173 | 
174 |         Return:
175 |             qX_xhat, qY_yhat: conditional marginal distributions for each axis.
176 |         """
177 |         with np.errstate(divide='ignore', invalid='ignore'):
178 |             qX_xhat = pX / qXhat[cX]  # qX_xhat is a N[0]-size vector here
179 |             qY_yhat = pY / qYhat[cY]  # note that it could be problematic if cX is not a int vector
180 |             qX_xhat[qX_xhat == np.inf] = 0
181 |             qY_yhat[qY_yhat == np.inf] = 0
182 |         return qX_xhat, qY_yhat  # want a Nx1 array-like matrix
183 | 
184 |     def calculate_joint_cluster_distribution(self, cX, cY, K, pXY):
185 |         """ Calculate the joint cluster distribution q(X',Y') = p(X',Y') using the current prob distribution and
186 |         cluster assignments. (Here we use X' to denote X_hat)
187 | 
188 |         Args:
189 |             cX, cY: current cluster assignments for each axis
190 |             K: numbers of clusters along each axis
191 |             pXY: original probability distribution matrix
192 | 
193 |         Return:
194 |             qXhatYhat: the joint cluster distribution
195 |         """
196 |         nc_row, nc_col = K  # num of clusters along row and col
197 |         qXhatYhat = np.zeros(K)
198 |         # itm_matrix = csr_matrix((nc_row, pXY.shape[1])) # nc_row * col sparse intermidiate matrix
199 |         itm_matrix = np.empty((nc_row, pXY.shape[1]))
200 |         for i in range(nc_row):
201 |             itm_matrix[i, :] = pXY[np.where(cX == i)[0], :].sum(0)
202 |         for i in range(nc_col):
203 |             qXhatYhat[:, i] = itm_matrix[:, np.where(cY == i)[0]].sum(1).flatten()
204 |         return qXhatYhat
205 | 
206 |     def initialize_cluster_centers(self, pXY, K):
207 |         """ Initializes the cluster assignments along each axis, by first selecting k centers, 
208 |         and then map each row to its closet center under cosine similarity.
209 | 
210 |         Args:
211 |             pXY: original data matrix
212 |             K: numbers of clusters desired in each dimension
213 | 
214 |         Return:
215 |             cX, cY: a list of cluster id that the current index in the current axis is assigned to.
216 |         """
217 |         # For x axis
218 |         centers = pXY[random.sample(range(pXY.shape[0]), K[0]), :]  # randomly select clustering centers
219 |         cX = self.assign_clusters(pXY, centers, axis=0)
220 |         self.ensure_correct_number_clusters(cX, K[0])
221 |         # For y axis
222 |         centers = pXY[:, random.sample(range(pXY.shape[1]), K[1])]  # randomly select clustering centers
223 |         cY = self.assign_clusters(pXY, centers, axis=1)
224 |         self.ensure_correct_number_clusters(cY, K[1])
225 |         return cX, cY  # return a numpy array
226 | 
227 |     def assign_clusters(self, pXY, centers, axis):
228 |         """ Assign each row/col to clusters given cluster centers on this axis. """
229 |         cluster_num = centers.shape[axis]
230 |         scores = np.zeros(shape=(pXY.shape[axis], cluster_num))
231 |         for i in range(cluster_num):
232 |             if axis == 0:
233 |                 center = centers[i, :]
234 |                 score_i = pXY * center
235 |             else:
236 |                 center = centers[:, i]
237 |                 score_i = pXY.T * center
238 |             score_i = score_i.sum(1)  # calculate u.v
239 |             center_length = np.linalg.norm(center)  # get |v|
240 |             score_i = score_i / center_length  # get u.v/|v|
241 |             scores[:, i] = score_i.flatten()
242 |         scores += self.jitter_max * np.random.mtrand.random_sample(scores.shape)
243 |         C = np.argmax(scores, 1)
244 |         return C
245 | 
246 |     def calculate_objective(self):
247 |         """ Calculate the KL-divergence between p(X,Y) and q(X,Y). 
248 |         Here q(x,y) can be written as p(x',y')*p(x|x')*p(y|y'). """
249 |         # Here I cannot vectorize the computation
250 |         objective = .0
251 |         x_indices, y_indices = np.nonzero(self.pXY)
252 |         # compute values for all useful elements in qXY
253 |         for i in range(len(x_indices)):
254 |             x_idx, y_idx = x_indices[i], y_indices[i]
255 |             v = self.pXY[x_idx, y_idx]
256 |             c_x, c_y = self.cX[x_idx], self.cY[y_idx]
257 |             v_qXY = self.qX_xhat[x_idx] * self.qY_yhat[y_idx] * self.qXhatYhat[c_x, c_y]
258 |             objective += v * np.log(v / v_qXY)
259 |         return objective
260 | 
261 |     def ensure_correct_number_clusters(self, C, expected_K):
262 |         """ To ensure a cluster assignment actually has the expected total number of clusters. 
263 | 
264 |         Args:
265 |             cXYi: the input cluster assignment
266 |             expected_K: expected number of clusters on this axis
267 | 
268 |         Return:
269 |             None. The assignment will be changed in place.
270 |         """
271 |         clusters_unique = np.unique(C)
272 |         num_clusters = clusters_unique.shape[0]
273 |         if num_clusters == expected_K:
274 |             return
275 |         for c in range(expected_K):
276 |             if num_clusters < c + 1 or clusters_unique[c] != c:  # no element assigned to c
277 |                 idx = random.randint(0, C.shape[0] - 1)
278 |                 C[idx] = c
279 |         self.ensure_correct_number_clusters(C, expected_K)
280 | 
281 | 
282 | def get_matrix_from_data(data):
283 |     """ Read the data from a list and construct a scipy sparse dok_matrix. If 'data' is not a list, simply return. 
284 |     
285 |     Each element of the data list should be a list, and should have the following form:
286 |         [feature1, feature2, ..., feature dim, value]
287 |     """
288 |     feature_ids = defaultdict(lambda: defaultdict(int))
289 |     for d in data:
290 |         location = []
291 |         for i in range(len(d) - 1):
292 |             f_i = d[i]
293 |             if f_i not in feature_ids[i]:
294 |                 feature_ids[i][f_i] = len(feature_ids[i])  # new index is size of dict
295 |             location.append(feature_ids[i][f_i])
296 |     nrow = len(feature_ids[0])
297 |     ncol = len(feature_ids[1])
298 |     m = np.zeros((nrow, ncol), dtype=np.float32)
299 |     for d in data:
300 |         r = feature_ids[0][d[0]]
301 |         c = feature_ids[1][d[1]]
302 |         value = float(d[2])
303 |         if value != 0.0:
304 |             m[r, c] = value
305 |     # normalize the matrix
306 |     m = m / m.sum()
307 |     return m
308 | 


--------------------------------------------------------------------------------
/matrix.py:
--------------------------------------------------------------------------------
  1 | from collections import defaultdict
  2 | from operator import itemgetter
  3 | from random import shuffle
  4 | 
  5 | 
  6 | class SparseMatrix:
  7 |     """ An implementation of sparse matrix that is used by the ITCC and EBC algorithm. """
  8 | 
  9 |     def __init__(self, N):
 10 |         """ Initialize the sparse matrix.
 11 | 
 12 |         Args:
 13 |             N: the size of the matrix on each axis in a list-like data structure
 14 |         """
 15 |         self.dim = len(N)  # dimensionality of matrix
 16 |         self.nonzero_elements = {}
 17 |         self.N = N
 18 |         # feature_ids should be a map from feature name to the corresponding index. 
 19 |         # For example, in a 2D matrix, each feature corresponds to a specific row or column.
 20 |         self.feature_ids = defaultdict(lambda: defaultdict(int))
 21 | 
 22 |     def read_data(self, data):
 23 |         """ Read the data from a list and populate the matrix. If 'data' is not a list, simply return. 
 24 |         
 25 |         Args:
 26 |             data: each element of the data list should be a list, and should have the following form:
 27 |             [feature1, feature2, ..., feature dim, value]
 28 |         """
 29 |         if not isinstance(data, list):  # we expect a list of data points
 30 |             return
 31 |         for d in data:
 32 |             location = []
 33 |             for i in range(len(d) - 1):
 34 |                 f_i = d[i]
 35 |                 if f_i not in self.feature_ids[i]:
 36 |                     self.feature_ids[i][f_i] = len(self.feature_ids[i])  # new index is size of dict
 37 |                 location.append(self.feature_ids[i][f_i])
 38 |             value = float(d[len(d) - 1])
 39 |             if value != 0.0:
 40 |                 self.nonzero_elements[tuple(location)] = value
 41 | 
 42 |     def get(self, coordinates):
 43 |         """ Get an element of the sparse matrix.
 44 | 
 45 |         Args:
 46 |             coordinates: indices of the element as a tuple
 47 | 
 48 |         Return:
 49 |             the element value
 50 |         """
 51 |         if coordinates in self.nonzero_elements:
 52 |             return self.nonzero_elements[coordinates]
 53 |         return 0.0
 54 | 
 55 |     def set(self, coordinates, value):
 56 |         """ Set the value for an element in the sparse matrix.
 57 | 
 58 |         Args:
 59 |             coordinates: indices of the element as a tuple
 60 |             value: the element value
 61 |         """
 62 |         self.nonzero_elements[coordinates] = value
 63 | 
 64 |     def add_value(self, coordinates, added_value):
 65 |         """ Add a specific value to an element in the sparse matrix.
 66 | 
 67 |         Args:
 68 |             coordinates: indices of the element as a tuple
 69 |             added_value: the value to add
 70 |         """
 71 |         if coordinates in self.nonzero_elements:
 72 |             self.nonzero_elements[coordinates] += added_value
 73 |         else:
 74 |             self.nonzero_elements[coordinates] = added_value
 75 | 
 76 |     def sum(self):
 77 |         """ Get the sum of all sparse matrix elements. 
 78 | 
 79 |         Return:
 80 |             the sum value
 81 |         """
 82 |         sum_values = 0.0
 83 |         for v in self.nonzero_elements.values():
 84 |             sum_values += v
 85 |         return sum_values
 86 | 
 87 |     def normalize(self):
 88 |         """ Normalize the sparse matrix such that the elements in the matrix sum up to 1. """
 89 |         sum_values = self.sum()
 90 |         for d in self.nonzero_elements:
 91 |             self.nonzero_elements[d] /= sum_values
 92 | 
 93 |     def shuffle(self):
 94 |         """ Randomly shuffle the nonzero elements in the original matrix,  and return a new matrix with the elements shuffled.
 95 | 
 96 |         Return:
 97 |             a new sparse matrix with all the elements shuffled
 98 |         """
 99 |         self_shuffled = SparseMatrix(self.N)
100 |         indices = []
101 |         # Get all the indices of nonzero elements. indices is a list of 'dim' lists, each being a list of indices for a specific dimension
102 |         for i in range(self.dim):
103 |             indices.append([e[i] for e in self.nonzero_elements])
104 |         for i in range(self.dim):
105 |             shuffle(indices[i])
106 |         values = [self.nonzero_elements[e] for e in self.nonzero_elements]
107 |         shuffle(values)
108 |         for j in range(len(self.nonzero_elements)):
109 |             self_shuffled.add_value(tuple([indices[i][j] for i in range(self.dim)]), values[j])
110 |         return self_shuffled
111 | 
112 |     def __str__(self):
113 |         value_list = sorted(self.nonzero_elements.items(), key=itemgetter(0), reverse=False)
114 |         return "\n".join(["\t".join([str(e) for e in v[0]]) + "\t" + str(v[1]) for v in value_list])
115 | 


--------------------------------------------------------------------------------
/resources/matrix-itcc-3d-3clust.tsv:
--------------------------------------------------------------------------------
 1 | 0	0	0	1.0
 2 | 0	0	1	1.0
 3 | 0	1	0	1.0
 4 | 0	1	1	1.0
 5 | 1	0	0	1.0
 6 | 1	0	1	1.0
 7 | 1	1	0	1.0
 8 | 1	1	1	1.0
 9 | 2	2	2	1.0
10 | 2	2	3	1.0
11 | 2	3	2	1.0
12 | 3	2	2	1.0
13 | 2	3	3	1.0
14 | 3	3	2	1.0
15 | 3	2	3	1.0
16 | 3	3	3	1.0
17 | 4	4	4	1.0
18 | 4	4	5	1.0
19 | 4	5	4	1.0
20 | 4	5	5	1.0
21 | 5	4	4	1.0
22 | 5	4	5	1.0
23 | 5	5	4	1.0
24 | 5	5	5	1.0


--------------------------------------------------------------------------------
/resources/matrix-itcc-paper-3clust.tsv:
--------------------------------------------------------------------------------
 1 | 0  	  0  	 0.083
 2 | 0  	  1  	 0.083
 3 | 0  	  2  	 0.00
 4 | 0  	  3  	 0.00
 5 | 0  	  4  	 0.00
 6 | 0  	  5  	 0.00
 7 | 1  	  0  	 0.083
 8 | 1  	  1  	 0.083
 9 | 1  	  2  	 0.00
10 | 1  	  3  	 0.00
11 | 1  	  4  	 0.00
12 | 1  	  5  	 0.00
13 | 2  	  0  	 0.00
14 | 2  	  1  	 0.00
15 | 2  	  2  	 0.083
16 | 2  	  3  	 0.083
17 | 2  	  4  	 0.00
18 | 2  	  5  	 0.00
19 | 3  	  0  	 0.00
20 | 3  	  1  	 0.00
21 | 3  	  2  	 0.083
22 | 3  	  3  	 0.083
23 | 3  	  4  	 0.00
24 | 3  	  5  	 0.00
25 | 4  	  0  	 0.00
26 | 4  	  1  	 0.00
27 | 4  	  2  	 0.00
28 | 4  	  3  	 0.00
29 | 4  	  4  	 0.083
30 | 4  	  5  	 0.083
31 | 5  	  0  	 0.00
32 | 5  	  1  	 0.00
33 | 5  	  2  	 0.00
34 | 5  	  3  	 0.00
35 | 5  	  4  	 0.083
36 | 5  	  5  	 0.083


--------------------------------------------------------------------------------
/resources/matrix-itcc-paper-orig-letters.tsv:
--------------------------------------------------------------------------------
 1 | A  	  0  	 0.05
 2 | A  	  1  	 0.05
 3 | A  	  2  	 0.05
 4 | A  	  3  	 0.00
 5 | A  	  4  	 0.00
 6 | A  	  5  	 0.00
 7 | B  	  0  	 0.05
 8 | B  	  1  	 0.05
 9 | B  	  2  	 0.05
10 | B  	  3  	 0.00
11 | B  	  4  	 0.00
12 | B  	  5  	 0.00
13 | C  	  0  	 0.00
14 | C  	  1  	 0.00
15 | C  	  2  	 0.00
16 | C  	  3  	 0.05
17 | C  	  4  	 0.05
18 | C  	  5  	 0.05
19 | D  	  0  	 0.00
20 | D  	  1  	 0.00
21 | D  	  2  	 0.00
22 | D  	  3  	 0.05
23 | D  	  4  	 0.05
24 | D  	  5  	 0.05
25 | E  	  0  	 0.04
26 | E  	  1  	 0.04
27 | E  	  2  	 0.00
28 | E  	  3  	 0.04
29 | E  	  4  	 0.04
30 | E  	  5  	 0.04
31 | F  	  0  	 0.04
32 | F  	  1  	 0.04
33 | F  	  2  	 0.04
34 | F  	  3  	 0.00
35 | F  	  4  	 0.04
36 | F  	  5  	 0.04


--------------------------------------------------------------------------------
/resources/matrix-itcc-paper-orig.tsv:
--------------------------------------------------------------------------------
 1 | 0  	  0  	 0.05
 2 | 0  	  1  	 0.05
 3 | 0  	  2  	 0.05
 4 | 0  	  3  	 0.00
 5 | 0  	  4  	 0.00
 6 | 0  	  5  	 0.00
 7 | 1  	  0  	 0.05
 8 | 1  	  1  	 0.05
 9 | 1  	  2  	 0.05
10 | 1  	  3  	 0.00
11 | 1  	  4  	 0.00
12 | 1  	  5  	 0.00
13 | 2  	  0  	 0.00
14 | 2  	  1  	 0.00
15 | 2  	  2  	 0.00
16 | 2  	  3  	 0.05
17 | 2  	  4  	 0.05
18 | 2  	  5  	 0.05
19 | 3  	  0  	 0.00
20 | 3  	  1  	 0.00
21 | 3  	  2  	 0.00
22 | 3  	  3  	 0.05
23 | 3  	  4  	 0.05
24 | 3  	  5  	 0.05
25 | 4  	  0  	 0.04
26 | 4  	  1  	 0.04
27 | 4  	  2  	 0.00
28 | 4  	  3  	 0.04
29 | 4  	  4  	 0.04
30 | 4  	  5  	 0.04
31 | 5  	  0  	 0.04
32 | 5  	  1  	 0.04
33 | 5  	  2  	 0.04
34 | 5  	  3  	 0.00
35 | 5  	  4  	 0.04
36 | 5  	  5  	 0.04


--------------------------------------------------------------------------------
/tests/sample-matrix-file.txt:
--------------------------------------------------------------------------------
 1 | patient	Val30Met	START_ENTITY|nmod|FAP	2
 2 | patient	Val30Met	START_ENTITY|nmod|END_ENTITY	2
 3 | patient	Val30Met	FAP|compound|END_ENTITY	2
 4 | mice	R92Q	mice|nummod|END_ENTITY	3
 5 | mice	R92Q	mutation|appos|END_ENTITY	2
 6 | mice	R92Q	START_ENTITY|nummod|END_ENTITY	7
 7 | mice	R91W	START_ENTITY|nummod|END_ENTITY	2
 8 | mice	R90W	homozygous|nsubj|START_ENTITY	2
 9 | mice	R90W	+|compound|END_ENTITY	2
10 | mice	R90W	expression|nmod|END_ENTITY	2


--------------------------------------------------------------------------------
/tests/test_benchmark_ebc.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | 
 3 | from matrix import SparseMatrix
 4 | from ebc import EBC
 5 | 
 6 | 
 7 | class TestBenchmarkEBC(unittest.TestCase):
 8 |     """ Benchmark the EBC code as a unittest, using the sparse matrix data. """
 9 | 
10 |     def setUp(self):
11 |         with open("resources/matrix-ebc-paper-sparse.tsv", "r") as f:
12 |             data = []
13 |             for line in f:
14 |                 sl = line.split("\t")
15 |                 if len(sl) < 5:  # headers
16 |                     continue
17 |                 data.append([sl[0], sl[2], float(sl[4])])
18 | 
19 |         self.matrix = SparseMatrix([14052, 7272])
20 |         self.matrix.read_data(data)
21 |         self.matrix.normalize()
22 | 
23 |     def testEbcOnSparseMatrix(self):
24 |         ebc = EBC(self.matrix, [30, 125], 10, 1e-10, 0.01)
25 |         cXY, objective, it = ebc.run()
26 |         print "objective: ", objective
27 |         print "iterations: ", it
28 |         self.assertEquals(len(ebc.pXY.nonzero_elements), 29456)
29 |         self.assertEquals(len(set(ebc.cXY[0])), 30)
30 |         self.assertEquals(len(set(ebc.cXY[1])), 125)
31 | 


--------------------------------------------------------------------------------
/tests/test_benchmark_ebc2d.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | 
 3 | import ebc2d
 4 | from ebc2d import EBC2D
 5 | 
 6 | 
 7 | class TestBenchmarkEBC2D(unittest.TestCase):
 8 |     """ Benchmark the EBC2D code as a unittest, using the sparse matrix data. """
 9 | 
10 |     def setUp(self):
11 |         with open("resources/matrix-ebc-paper-sparse.tsv", "r") as f:
12 |             data = []
13 |             for line in f:
14 |                 sl = line.split("\t")
15 |                 if len(sl) < 5:  # headers
16 |                     continue
17 |                 data.append([sl[0], sl[2], float(sl[4])])
18 | 
19 |         self.matrix = ebc2d.get_matrix_from_data(data)
20 | 
21 |     def testEbc2dOnSparseMatrix(self):
22 |         ebc = EBC2D(self.matrix, [30, 125], 10, 1e-10, 0.01)
23 |         cXY, objective, it = ebc.run()
24 |         print "objective: ", objective
25 |         print "iterations: ", it
26 |         # self.assertEquals(len(ebc.pXY.nonzero[0]), 29456)
27 |         self.assertEquals(len(set(cXY[0])), 30)
28 |         self.assertEquals(len(set(cXY[1])), 125)
29 | 


--------------------------------------------------------------------------------
/tests/test_clusters.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | from ebc import EBC
 3 | from matrix import SparseMatrix
 4 | 
 5 | 
 6 | class TestMatrix(unittest.TestCase):
 7 |     def setUp(self):
 8 |         data = [[0, 0, 0, 1.0],
 9 |                 [0, 0, 1, 1.0],
10 |                 [0, 1, 0, 1.0],
11 |                 [0, 1, 1, 1.0],
12 |                 [1, 0, 0, 1.0],
13 |                 [1, 0, 1, 1.0],
14 |                 [1, 1, 0, 1.0],
15 |                 [1, 1, 1, 1.0],
16 |                 [2, 2, 2, 1.0],
17 |                 [2, 2, 3, 1.0],
18 |                 [2, 3, 2, 1.0],
19 |                 [3, 2, 2, 1.0],
20 |                 [2, 3, 3, 1.0],
21 |                 [3, 3, 2, 1.0],
22 |                 [3, 2, 3, 1.0],
23 |                 [3, 3, 3, 1.0],
24 |                 [4, 4, 4, 1.0],
25 |                 [4, 4, 5, 1.0],
26 |                 [4, 5, 4, 1.0],
27 |                 [4, 5, 5, 1.0],
28 |                 [5, 4, 4, 1.0],
29 |                 [5, 4, 5, 1.0],
30 |                 [5, 5, 4, 1.0],
31 |                 [5, 5, 5, 1.0]]
32 |         matrix = SparseMatrix([6, 6, 6])
33 |         matrix.read_data(data)
34 |         matrix.normalize()
35 | 
36 |         ebc = EBC(matrix, [3, 3, 3], 10, 1e-10)
37 |         assigned_C = [[0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2]]
38 |         cXY, objective = ebc.run(assigned_C)
39 |         self.assertEquals(cXY, assigned_C)
40 |         self.assertAlmostEqual(objective, 0.0)
41 |         cXY, objective = ebc.run()  # random initialization
42 |         self.assertAlmostEqual(objective, 0.0)


--------------------------------------------------------------------------------
/tests/test_ebc.py:
--------------------------------------------------------------------------------
  1 | from operator import itemgetter
  2 | import unittest
  3 | 
  4 | from ebc import EBC
  5 | from matrix import SparseMatrix
  6 | 
  7 | 
  8 | class TestEbc(unittest.TestCase):
  9 |     def setUp(self):
 10 |         self.data = [["0", "0", 0.05],
 11 |                      ["0", "1", 0.05],
 12 |                      ["0", "2", 0.05],
 13 |                      ["0", "3", 0.00],
 14 |                      ["0", "4", 0.00],
 15 |                      ["0", "5", 0.00],
 16 |                      ["1", "0", 0.05],
 17 |                      ["1", "1", 0.05],
 18 |                      ["1", "2", 0.05],
 19 |                      ["1", "3", 0.00],
 20 |                      ["1", "4", 0.00],
 21 |                      ["1", "5", 0.00],
 22 |                      ["2", "0", 0.00],
 23 |                      ["2", "1", 0.00],
 24 |                      ["2", "2", 0.00],
 25 |                      ["2", "3", 0.05],
 26 |                      ["2", "4", 0.05],
 27 |                      ["2", "5", 0.05],
 28 |                      ["3", "0", 0.00],
 29 |                      ["3", "1", 0.00],
 30 |                      ["3", "2", 0.00],
 31 |                      ["3", "3", 0.05],
 32 |                      ["3", "4", 0.05],
 33 |                      ["3", "5", 0.05],
 34 |                      ["4", "0", 0.04],
 35 |                      ["4", "1", 0.04],
 36 |                      ["4", "2", 0.00],
 37 |                      ["4", "3", 0.04],
 38 |                      ["4", "4", 0.04],
 39 |                      ["4", "5", 0.04],
 40 |                      ["5", "0", 0.04],
 41 |                      ["5", "1", 0.04],
 42 |                      ["5", "2", 0.04],
 43 |                      ["5", "3", 0.00],
 44 |                      ["5", "4", 0.04],
 45 |                      ["5", "5", 0.04]]
 46 |         self.matrix = SparseMatrix([6, 6])
 47 |         self.matrix.read_data(self.data)
 48 | 
 49 |     def testDataLoad(self):
 50 |         self.assertEquals(sorted(self.matrix.nonzero_elements.items(), key=itemgetter(0)),
 51 |                           [((0, 0), 0.05), ((0, 1), 0.05), ((0, 2), 0.05), ((1, 0), 0.05), ((1, 1), 0.05),
 52 |                            ((1, 2), 0.05), ((2, 3), 0.05), ((2, 4), 0.05), ((2, 5), 0.05), ((3, 3), 0.05),
 53 |                            ((3, 4), 0.05), ((3, 5), 0.05), ((4, 0), 0.04), ((4, 1), 0.04), ((4, 3), 0.04),
 54 |                            ((4, 4), 0.04), ((4, 5), 0.04), ((5, 0), 0.04), ((5, 1), 0.04), ((5, 2), 0.04),
 55 |                            ((5, 4), 0.04), ((5, 5), 0.04)])
 56 | 
 57 |     def testOldMatrix(self):
 58 |         with open("resources/matrix-ebc-paper-dense.tsv", "r") as f:
 59 |             data = []
 60 |             for line in f:
 61 |                 sl = line.split("\t")
 62 |                 if len(sl) < 5:  # headers
 63 |                     continue
 64 |                 data.append([sl[0], sl[2], float(sl[4])])
 65 | 
 66 |         matrix = SparseMatrix([3514, 1232])
 67 |         matrix.read_data(data)
 68 |         matrix.normalize()
 69 |         ebc = EBC(matrix, [30, 125], 10, 1e-10, 0.01)
 70 |         cXY, objective, it = ebc.run()
 71 |         print "objective: ", objective
 72 |         print "iterations: ", it
 73 |         self.assertEquals(len(ebc.pXY.nonzero_elements), 10007)
 74 |         self.assertEquals(len(set(ebc.cXY[0])), 30)
 75 |         self.assertEquals(len(set(ebc.cXY[1])), 125)
 76 | 
 77 |     def testOldMatrix3d(self):
 78 |         with open("resources/matrix-ebc-paper-dense-3d.tsv", "r") as f:
 79 |             data = []
 80 |             for line in f:
 81 |                 sl = line.split("\t")
 82 |                 data.append([sl[0], sl[1], sl[2], float(sl[3])])
 83 | 
 84 |         matrix = SparseMatrix([756, 996, 1232])
 85 |         matrix.read_data(data)
 86 |         matrix.normalize()
 87 |         ebc = EBC(matrix, [30, 30, 10], 100, 1e-10, 0.01)
 88 |         cXY, objective, it = ebc.run()
 89 |         print "objective: ", objective
 90 |         print "iterations: ", it
 91 |         self.assertEquals(len(ebc.pXY.nonzero_elements), 10007)
 92 |         self.assertEquals(len(set(ebc.cXY[0])), 30)
 93 |         self.assertEquals(len(set(ebc.cXY[1])), 30)
 94 |         self.assertEquals(len(set(ebc.cXY[2])), 10)
 95 | 
 96 |     def test3DMatrix(self):
 97 |         data = [[0, 0, 0, 1.0],
 98 |                 [0, 0, 1, 1.0],
 99 |                 [0, 1, 0, 1.0],
100 |                 [0, 1, 1, 1.0],
101 |                 [1, 0, 0, 1.0],
102 |                 [1, 0, 1, 1.0],
103 |                 [1, 1, 0, 1.0],
104 |                 [1, 1, 1, 1.0],
105 |                 [2, 2, 2, 1.0],
106 |                 [2, 2, 3, 1.0],
107 |                 [2, 3, 2, 1.0],
108 |                 [3, 2, 2, 1.0],
109 |                 [2, 3, 3, 1.0],
110 |                 [3, 3, 2, 1.0],
111 |                 [3, 2, 3, 1.0],
112 |                 [3, 3, 3, 1.0],
113 |                 [4, 4, 4, 1.0],
114 |                 [4, 4, 5, 1.0],
115 |                 [4, 5, 4, 1.0],
116 |                 [4, 5, 5, 1.0],
117 |                 [5, 4, 4, 1.0],
118 |                 [5, 4, 5, 1.0],
119 |                 [5, 5, 4, 1.0],
120 |                 [5, 5, 5, 1.0]]
121 |         matrix = SparseMatrix([6, 6, 6])
122 |         matrix.read_data(data)
123 |         matrix.normalize()
124 |         ebc = EBC(matrix, [3, 3, 3], 10, 1e-10, 0.01)
125 |         assigned_C = [[0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2]]
126 |         cXY, objective, it = ebc.run(assigned_C)
127 |         self.assertEquals(cXY, assigned_C)
128 |         self.assertAlmostEqual(objective, 0.0)
129 |         self.assertEquals(it, 1)
130 | 
131 |         for i in range(100):
132 |             cXY, objective, it = ebc.run()  # random initialization
133 |             print cXY, objective, it
134 | 


--------------------------------------------------------------------------------
/tests/test_matrix.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | 
 3 | from matrix import SparseMatrix
 4 | 
 5 | 
 6 | class TestMatrix(unittest.TestCase):
 7 |     def setUp(self):
 8 |         self.data = [l.split('\t') for l in open('tests/sample-matrix-file.txt', 'r').readlines()]
 9 |         self.matrix = SparseMatrix([2, 4, 9])
10 |         self.matrix.read_data(self.data)
11 | 
12 |     def testMatrixInit(self):
13 |         self.assertEquals(self.matrix.nonzero_elements[(1, 3, 7)], 2.0)
14 |         self.assertEquals(self.matrix.nonzero_elements[(0, 0, 0)], 2.0)
15 |         self.assertEquals(self.matrix.nonzero_elements[(0, 0, 2)], 2.0)
16 |         self.assertEquals(self.matrix.nonzero_elements[(1, 1, 5)], 7.0)
17 |         self.assertEquals(self.matrix.nonzero_elements[(1, 1, 3)], 3.0)
18 |         self.assertEquals(self.matrix.nonzero_elements[(1, 3, 6)], 2.0)
19 |         self.assertEquals(self.matrix.nonzero_elements[(1, 3, 8)], 2.0)
20 |         self.assertEquals(self.matrix.nonzero_elements[(0, 0, 1)], 2.0)
21 |         self.assertEquals(self.matrix.nonzero_elements[(1, 1, 4)], 2.0)
22 |         self.assertEquals(self.matrix.nonzero_elements[(1, 2, 5)], 2.0)
23 |         self.assertEquals(len(self.matrix.nonzero_elements), 10)
24 |         self.assertEquals(self.matrix.feature_ids[0], {'mice': 1, 'patient': 0})
25 |         self.assertEquals(self.matrix.feature_ids[1], {'R92Q': 1, 'R91W': 2, 'Val30Met': 0, 'R90W': 3})
26 |         self.assertEquals(self.matrix.feature_ids[2], {'START_ENTITY|nmod|END_ENTITY': 1,
27 |                                                        'START_ENTITY|nummod|END_ENTITY': 5,
28 |                                                        'FAP|compound|END_ENTITY': 2,
29 |                                                        'expression|nmod|END_ENTITY': 8,
30 |                                                        '+|compound|END_ENTITY': 7,
31 |                                                        'mice|nummod|END_ENTITY': 3,
32 |                                                        'homozygous|nsubj|START_ENTITY': 6,
33 |                                                        'mutation|appos|END_ENTITY': 4,
34 |                                                        'START_ENTITY|nmod|FAP': 0})
35 | 
36 |     def testShuffle(self):
37 |         shuffled_matrix = self.matrix.shuffle()
38 |         self.assertEquals(len(shuffled_matrix.nonzero_elements), len(self.matrix.nonzero_elements))
39 |         self.assertEquals(set(shuffled_matrix.nonzero_elements.values()), set(self.matrix.nonzero_elements.values()))
40 |         print("shuffled matrix elements: ", shuffled_matrix.nonzero_elements)
41 | 


--------------------------------------------------------------------------------
/tests/test_sanitycheck.py:
--------------------------------------------------------------------------------
  1 | import unittest
  2 | 
  3 | import numpy as np
  4 | 
  5 | from matrix import SparseMatrix
  6 | from ebc import EBC
  7 | import ebc2d
  8 | from ebc2d import EBC2D
  9 | 
 10 | 
 11 | class TestSanityCheck(unittest.TestCase):
 12 |     """ Do a sanity check for the EBC code, using the data from the original ITCC paper. """
 13 | 
 14 |     def setUp(self):
 15 |         with open("resources/matrix-itcc-paper-orig.tsv", "r") as f:
 16 |             data = [l.split('\t') for l in f]
 17 | 
 18 |         self.matrix = SparseMatrix([6, 6])
 19 |         self.matrix.read_data(data)
 20 |         self.matrix.normalize()
 21 | 
 22 |     def cartesian(self, arrays, out=None):
 23 |         arrays = [np.asarray(x) for x in arrays]
 24 |         dtype = arrays[0].dtype
 25 | 
 26 |         n = np.prod([x.size for x in arrays])
 27 |         if out is None:
 28 |             out = np.zeros([n, len(arrays)], dtype=dtype)
 29 | 
 30 |         m = n / arrays[0].size
 31 |         out[:, 0] = np.repeat(arrays[0], m)
 32 |         if arrays[1:]:
 33 |             self.cartesian(arrays[1:], out=out[0:m, 1:])
 34 |             for j in xrange(1, arrays[0].size):
 35 |                 out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]
 36 |         return out
 37 | 
 38 |     def testEbcOnSparseMatrix(self):
 39 |         ebc = EBC(self.matrix, [3, 2], 10, 1e-10, 0.01)
 40 |         cXY, objective, it = ebc.run(verbose=False)
 41 |         print "--> ebc"
 42 |         print "objective: ", objective
 43 |         print "iterations: ", it
 44 | 
 45 |         ebc = EBC(self.matrix, [3, 2], 10, 1e-10, 0.01)
 46 |         ebc.run(assigned_clusters=[[2, 0, 1, 1, 2, 2], [0, 0, 1, 0, 1, 1]], verbose=False)
 47 |         indices = [range(N_d) for N_d in ebc.pXY.N]
 48 |         index_list = self.cartesian(indices)
 49 |         approx_distribution = {}
 50 |         for location in index_list:
 51 |             q = 1.0
 52 |             c_location = []
 53 |             for i in range(len(location)):
 54 |                 c_i = ebc.cXY[i][location[i]]
 55 |                 c_location.append(c_i)
 56 |                 q *= ebc.qXxHat[i][location[i]]
 57 |             q *= ebc.qXhatYhat.get(tuple(c_location))
 58 |             approx_distribution[tuple(location)] = q
 59 | 
 60 |         self.assertAlmostEquals(approx_distribution[(0, 0)], 0.054)
 61 |         self.assertAlmostEquals(approx_distribution[(0, 1)], 0.054)
 62 |         self.assertAlmostEquals(approx_distribution[(0, 2)], 0.042)
 63 |         self.assertAlmostEquals(approx_distribution[(0, 3)], 0.0)
 64 |         self.assertAlmostEquals(approx_distribution[(0, 4)], 0.0)
 65 |         self.assertAlmostEquals(approx_distribution[(0, 5)], 0.0)
 66 |         self.assertAlmostEquals(approx_distribution[(1, 0)], 0.054)
 67 |         self.assertAlmostEquals(approx_distribution[(1, 1)], 0.054)
 68 |         self.assertAlmostEquals(approx_distribution[(1, 2)], 0.042)
 69 |         self.assertAlmostEquals(approx_distribution[(1, 3)], 0.0)
 70 |         self.assertAlmostEquals(approx_distribution[(1, 4)], 0.0)
 71 |         self.assertAlmostEquals(approx_distribution[(1, 5)], 0.0)
 72 |         self.assertAlmostEquals(approx_distribution[(2, 0)], 0.0)
 73 |         self.assertAlmostEquals(approx_distribution[(2, 1)], 0.0)
 74 |         self.assertAlmostEquals(approx_distribution[(2, 2)], 0.0)
 75 |         self.assertAlmostEquals(approx_distribution[(2, 3)], 0.042)
 76 |         self.assertAlmostEquals(approx_distribution[(2, 4)], 0.054)
 77 |         self.assertAlmostEquals(approx_distribution[(2, 5)], 0.054)
 78 |         self.assertAlmostEquals(approx_distribution[(3, 0)], 0.0)
 79 |         self.assertAlmostEquals(approx_distribution[(3, 1)], 0.0)
 80 |         self.assertAlmostEquals(approx_distribution[(3, 2)], 0.0)
 81 |         self.assertAlmostEquals(approx_distribution[(3, 3)], 0.042)
 82 |         self.assertAlmostEquals(approx_distribution[(3, 4)], 0.054)
 83 |         self.assertAlmostEquals(approx_distribution[(3, 5)], 0.054)
 84 |         self.assertAlmostEquals(approx_distribution[(4, 0)], 0.036)
 85 |         self.assertAlmostEquals(approx_distribution[(4, 1)], 0.036)
 86 |         self.assertAlmostEquals(approx_distribution[(4, 2)], 0.028)
 87 |         self.assertAlmostEquals(approx_distribution[(4, 3)], 0.028)
 88 |         self.assertAlmostEquals(approx_distribution[(4, 4)], 0.036)
 89 |         self.assertAlmostEquals(approx_distribution[(4, 5)], 0.036)
 90 |         self.assertAlmostEquals(approx_distribution[(5, 0)], 0.036)
 91 |         self.assertAlmostEquals(approx_distribution[(5, 1)], 0.036)
 92 |         self.assertAlmostEquals(approx_distribution[(5, 2)], 0.028)
 93 |         self.assertAlmostEquals(approx_distribution[(5, 3)], 0.028)
 94 |         self.assertAlmostEquals(approx_distribution[(5, 4)], 0.036)
 95 |         self.assertAlmostEquals(approx_distribution[(5, 5)], 0.036)
 96 | 
 97 |     def testEbc2dOnSparseMatrix(self):
 98 |         with open("resources/matrix-itcc-paper-orig.tsv", "r") as f:
 99 |             data = [l.split('\t') for l in f]
100 |         m = ebc2d.get_matrix_from_data(data)
101 |         # run without assigned clusters
102 |         ebc = EBC2D(m, [3, 2], 10, 1e-10, 0.01)
103 |         cXY, objective, it = ebc.run(verbose=False)
104 |         print "--> ebc2d"
105 |         print "objective: ", objective
106 |         print "iterations: ", it
107 | 
108 |         # run with assigned clusters
109 |         ebc = EBC2D(m, [3, 2], 10, 1e-10, 0.01)
110 |         cXY, objective, it = ebc.run(assigned_clusters=[[2, 0, 1, 1, 2, 2], [0, 0, 1, 0, 1, 1]], verbose=False)
111 |         indices = [range(N_d) for N_d in ebc.pXY.shape]
112 |         index_list = self.cartesian(indices)
113 |         approx_distribution = {}
114 |         qX_xhat = [ebc.qX_xhat, ebc.qY_yhat]
115 |         for location in index_list:
116 |             q = 1.0
117 |             c_location = []
118 |             for i in range(len(location)):
119 |                 c_i = cXY[i][location[i]]
120 |                 c_location.append(c_i)
121 |                 q *= qX_xhat[i][location[i]]
122 |             q *= ebc.qXhatYhat[c_location[0], c_location[1]]
123 |             approx_distribution[tuple(location)] = q
124 | 
125 |         self.assertAlmostEquals(approx_distribution[(0, 0)], 0.054)
126 |         self.assertAlmostEquals(approx_distribution[(0, 1)], 0.054)
127 |         self.assertAlmostEquals(approx_distribution[(0, 2)], 0.042)
128 |         self.assertAlmostEquals(approx_distribution[(0, 3)], 0.0)
129 |         self.assertAlmostEquals(approx_distribution[(0, 4)], 0.0)
130 |         self.assertAlmostEquals(approx_distribution[(0, 5)], 0.0)
131 |         self.assertAlmostEquals(approx_distribution[(1, 0)], 0.054)
132 |         self.assertAlmostEquals(approx_distribution[(1, 1)], 0.054)
133 |         self.assertAlmostEquals(approx_distribution[(1, 2)], 0.042)
134 |         self.assertAlmostEquals(approx_distribution[(1, 3)], 0.0)
135 |         self.assertAlmostEquals(approx_distribution[(1, 4)], 0.0)
136 |         self.assertAlmostEquals(approx_distribution[(1, 5)], 0.0)
137 |         self.assertAlmostEquals(approx_distribution[(2, 0)], 0.0)
138 |         self.assertAlmostEquals(approx_distribution[(2, 1)], 0.0)
139 |         self.assertAlmostEquals(approx_distribution[(2, 2)], 0.0)
140 |         self.assertAlmostEquals(approx_distribution[(2, 3)], 0.042)
141 |         self.assertAlmostEquals(approx_distribution[(2, 4)], 0.054)
142 |         self.assertAlmostEquals(approx_distribution[(2, 5)], 0.054)
143 |         self.assertAlmostEquals(approx_distribution[(3, 0)], 0.0)
144 |         self.assertAlmostEquals(approx_distribution[(3, 1)], 0.0)
145 |         self.assertAlmostEquals(approx_distribution[(3, 2)], 0.0)
146 |         self.assertAlmostEquals(approx_distribution[(3, 3)], 0.042)
147 |         self.assertAlmostEquals(approx_distribution[(3, 4)], 0.054)
148 |         self.assertAlmostEquals(approx_distribution[(3, 5)], 0.054)
149 |         self.assertAlmostEquals(approx_distribution[(4, 0)], 0.036)
150 |         self.assertAlmostEquals(approx_distribution[(4, 1)], 0.036)
151 |         self.assertAlmostEquals(approx_distribution[(4, 2)], 0.028)
152 |         self.assertAlmostEquals(approx_distribution[(4, 3)], 0.028)
153 |         self.assertAlmostEquals(approx_distribution[(4, 4)], 0.036)
154 |         self.assertAlmostEquals(approx_distribution[(4, 5)], 0.036)
155 |         self.assertAlmostEquals(approx_distribution[(5, 0)], 0.036)
156 |         self.assertAlmostEquals(approx_distribution[(5, 1)], 0.036)
157 |         self.assertAlmostEquals(approx_distribution[(5, 2)], 0.028)
158 |         self.assertAlmostEquals(approx_distribution[(5, 3)], 0.028)
159 |         self.assertAlmostEquals(approx_distribution[(5, 4)], 0.036)
160 |         self.assertAlmostEquals(approx_distribution[(5, 5)], 0.036)
161 | 


--------------------------------------------------------------------------------