├── .gitignore ├── LICENSE.txt ├── README.md ├── __init__.py ├── clusters-repeat.py ├── clusters-single.py ├── ebc.py ├── ebc2d.py ├── matrix.py ├── resources ├── matrix-ebc-paper-dense-3d.tsv ├── matrix-ebc-paper-dense-4d.tsv ├── matrix-ebc-paper-dense.tsv ├── matrix-ebc-paper-sparse-3d.tsv ├── matrix-ebc-paper-sparse-4d.tsv ├── matrix-ebc-paper-sparse.tsv ├── matrix-itcc-3d-3clust.tsv ├── matrix-itcc-paper-3clust.tsv ├── matrix-itcc-paper-orig-letters.tsv └── matrix-itcc-paper-orig.tsv └── tests ├── sample-matrix-file.txt ├── test_benchmark_ebc.py ├── test_benchmark_ebc2d.py ├── test_clusters.py ├── test_ebc.py ├── test_matrix.py └── test_sanitycheck.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | temp/ 3 | __pycache__/ 4 | *.pyc 5 | .DS_Store 6 | *.out 7 | *.sh 8 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Bethany Percha, Russ B. Altman, Yuhao Zhang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Ensemble Biclustering for Classification 2 | ============== 3 | 4 | A python implementation of the Ensemble Biclustering for Classification (EBC) algorithm. Although having "biclustering" in its name, EBC is a co-clustering algorithm that allows you to perform co-clustering on very large sparse N-dimensional matrix. For details and examples of using EBC please reference [this paper](http://www.ncbi.nlm.nih.gov/pubmed/26219079). 5 | 6 | ## Files 7 | 8 | - `ebc.py`: an implementation of EBC algorithm using sparse matrix. 9 | - `ebc2d.py`: a vectorized implementation of EBC algorithm using (2D) numpy dense array. 10 | - `matrix.py`: a N-dimensional sparse matrix implementation using python dict that allows basic get/set/sum operations. 11 | - `ebc-sample-matrices/`: this directory contains many example data that could be used to run EBC on. 12 | 13 | ## Usage 14 | 15 | #### General Usage 16 | 17 | Using ebc is easy. First you need a sparse matrix constructed as a SparseMatrix instance defined in `matrix.py`. You can use `matrix.py`'s built in `read_data()` method to construct the sparse matrix. 18 | 19 | Once you have the sparse matrix and import the ebc module, you can simply do the following: 20 | 21 | ebc = EBC(sparse_matrix, n_clusters=[30, 125], max_iterations=10) 22 | cXY, objective, iter = ebc.run() 23 | 24 | The returned `cXY` contains the clustering assignments along each axis in a list of lists, `objective` contains the final objective function value, `iter` contains the iterations that it ran for to converge. 25 | 26 | #### Efficiency Considerations 27 | 28 | In short, `EBC` is built for highly sparse multi-dimensional matrix, while `EBC2D` is a vectorized version of `EBC`, and is built for less sparse 2D matrix. The running time of `EBC` increases linearly with the number of non-zero elements, while the running time of `EBC2D` will only increase when matrix size increases. 29 | 30 | - If your input matrix is highly sparse or you need support for N-dimensional matrix with N > 2, you should use `EBC` class in `ebc.py`. 31 | - If your input matrix is not very sparse (e.g. 5% density), and can fit into memory with `numpy`, you can choose to use `EBC2D` class in `ebc2d.py`. Note that `EBC2D` only supports 2D matrix, since large multi-dimensional dense matrix can hardly fit into memory. 32 | 33 | To give you more sense of the efficiency, the `EBC` implementation runs for ~50 seconds on our 14052 x 7272 example matrix (which is >99% sparse) on a MacBook, and in this case it is faster than `EBC2D`. However, in our testing, for a 5000 x 5000 matrix of 95% sparsity, `EBC2D` can get a `x5` speedup over `EBC`. And this speedup grows as the sparsity decreases. 34 | 35 | #### Dependencies 36 | 37 | To run this implementation of EBC algorithm, you need to have `numpy` installed. 38 | 39 | ## References 40 | 41 | - Percha, Bethany, and Russ B. Altman. "Learning the Structure of Biomedical Relationships from Unstructured Text." PLoS computational biology 11.7 (2015). 42 | - Dhillon, Inderjit S., Subramanyam Mallela, and Dharmendra S. Modha. "Information-theoretic co-clustering." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003. 43 | 44 | ## Questions? 45 | 46 | This code was written by Beth Percha and Yuhao Zhang. We welcome any questions or comments, and would appreciate it if you would let us know if you make any substantial modifications or improvements. You can reach us at blpercha@stanford.edu. 47 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | __all__ = ["ebc", "matrix"] -------------------------------------------------------------------------------- /clusters-repeat.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import sys 3 | 4 | from ebc import EBC 5 | from matrix import SparseMatrix 6 | 7 | 8 | def main(): 9 | data_file = sys.argv[1] 10 | ebc_cols = [int(e) for e in sys.argv[2].split(",")] 11 | K = [int(e) for e in sys.argv[3].split(",")] 12 | N_runs = int(sys.argv[4]) 13 | output_file = sys.argv[5] 14 | jitter_max = float(sys.argv[6]) 15 | max_iterations_ebc = int(sys.argv[7]) 16 | entity_cols = [int(e) for e in sys.argv[8].split(",")] 17 | object_toler = float(sys.argv[9]) 18 | 19 | # get original data 20 | raw_data = [line.split("\t") for line in open(data_file, "r")] 21 | data = [[d[i] for i in ebc_cols] for d in raw_data] 22 | data_dimensions = len(data[0]) - 1 23 | 24 | # get axis length for each dimension 25 | N = [] 26 | for dim in range(data_dimensions): 27 | N.append(len(set([d[dim] for d in data]))) 28 | print(N) 29 | 30 | # set up matrix 31 | M = SparseMatrix(N) 32 | M.read_data(data) 33 | M.normalize() 34 | 35 | # set up entity map to ids 36 | entity_map = defaultdict(tuple) 37 | for d in raw_data: 38 | entity = tuple([d[i] for i in entity_cols]) 39 | entity_ids = tuple([M.feature_ids[ebc_cols.index(i)][d[i]] for i in entity_cols]) 40 | entity_map[entity_ids] = entity 41 | 42 | # figure out which ebc columns the entity columns correspond to 43 | entity_column_indices = [] 44 | for c in ebc_cols: 45 | if c in entity_cols: 46 | entity_column_indices.append(ebc_cols.index(c)) 47 | 48 | # run EBC and get entity cluster assignments 49 | ebc_M = EBC(M, K, max_iterations_ebc, jitter_max, object_toler) 50 | clusters = defaultdict(list) 51 | for t in range(N_runs): 52 | print "run ", t 53 | cXY_M, objective_M, it_M = ebc_M.run() 54 | for e1 in entity_map.keys(): 55 | c1_i = tuple([cXY_M[i][e1[i]] for i in entity_column_indices]) 56 | clusters[e1].append(c1_i) 57 | 58 | # print assignments 59 | writer = open(output_file, "w") 60 | for k in clusters: 61 | e1_name = entity_map[k] 62 | writer.write(",".join([str(e) for e in k]) + "\t" + 63 | ",".join([e for e in e1_name]) + "\t" + "\t".join([",".join([str(f) for f in e]) 64 | for e in clusters[k]]) + "\n") 65 | writer.flush() 66 | writer.close() 67 | 68 | 69 | if __name__ == "__main__": 70 | main() 71 | -------------------------------------------------------------------------------- /clusters-single.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | from numpy import mean, std 4 | 5 | from ebc import EBC 6 | from matrix import SparseMatrix 7 | 8 | 9 | def compareRandom(num_trials, tensor_dimensions, matrix_data, cluster_dimensions, 10 | maxit_ebc, jitter_max_ebc, objective_tolerance): 11 | deltas = [] 12 | objectives_M = [] 13 | objectives_Mr = [] 14 | iterations_M = [] 15 | iterations_Mr = [] 16 | noconverge_M = 0 17 | noconverge_Mr = 0 18 | for j in range(num_trials): 19 | print "Trial ", j 20 | 21 | M = SparseMatrix(tensor_dimensions) 22 | M.read_data(matrix_data) 23 | Mr = M.shuffle() # could also be M.shuffle_old() 24 | 25 | M.normalize() 26 | 27 | ebc_M = EBC(M, cluster_dimensions, maxit_ebc, jitter_max_ebc, objective_tolerance) 28 | cXY_M, objective_M, it_M = ebc_M.run() 29 | if it_M == maxit_ebc: 30 | noconverge_M += 1 31 | else: 32 | iterations_M.append(it_M) 33 | 34 | Mr.normalize() 35 | 36 | ebc_Mr = EBC(Mr, cluster_dimensions, maxit_ebc, jitter_max_ebc, objective_tolerance) 37 | cXY_Mr, objective_Mr, it_Mr = ebc_Mr.run() 38 | if it_Mr == maxit_ebc: 39 | noconverge_Mr += 1 40 | else: 41 | iterations_Mr.append(it_Mr) 42 | 43 | objectives_M.append(objective_M) 44 | objectives_Mr.append(objective_Mr) 45 | deltas.append(objective_M - objective_Mr) 46 | return deltas, objectives_M, objectives_Mr, iterations_M, iterations_Mr, noconverge_M, noconverge_Mr 47 | 48 | 49 | def main(): 50 | data_file = sys.argv[1] 51 | cols = [int(e) for e in sys.argv[2].split(",")] 52 | K = [int(e) for e in sys.argv[3].split(",")] 53 | N_trials = int(sys.argv[4]) 54 | output_file = sys.argv[5] 55 | jitter_max = float(sys.argv[6]) 56 | max_iterations_ebc = int(sys.argv[7]) 57 | object_tol = float(sys.argv[8]) 58 | 59 | # get original data 60 | raw_data = [line.split("\t") for line in open(data_file, "r")] 61 | data = [[d[i] for i in cols] for d in raw_data] 62 | data_dimensions = len(data[0]) - 1 63 | 64 | # get axis length for each dimension 65 | N = [] 66 | for dim in range(data_dimensions): 67 | N.append(len(set([d[dim] for d in data]))) 68 | print(N) 69 | 70 | D_1, obj_orig, obj_rand, it_orig, it_rand, noconv_orig, noconv_rand = compareRandom(num_trials=N_trials, 71 | tensor_dimensions=N, 72 | matrix_data=data, 73 | cluster_dimensions=K, 74 | maxit_ebc=max_iterations_ebc, 75 | jitter_max_ebc=jitter_max, 76 | objective_tolerance=object_tol) 77 | 78 | # write final result to combined file (other processes also write to this file) 79 | output_stream = open(output_file, "a") 80 | output_stream.write("\t".join([str(e) for e in K]) + "\t" + str(mean(D_1)) + "\t" + str(std(D_1)) + 81 | "\t" + str(mean(obj_orig)) + "\t" + str(mean(obj_rand)) + 82 | "\t" + str(mean(it_orig)) + "\t" + str(mean(it_rand)) + 83 | "\t" + str(noconv_orig) + "\t" + str(noconv_rand) + "\n") 84 | output_stream.flush() 85 | output_stream.close() 86 | 87 | 88 | if __name__ == "__main__": 89 | main() 90 | -------------------------------------------------------------------------------- /ebc.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import random 3 | import sys 4 | import math 5 | 6 | import numpy as np 7 | from numpy.ma import divide, outer, sqrt 8 | from numpy.random.mtrand import random_sample 9 | 10 | from matrix import SparseMatrix 11 | 12 | INFINITE = 1e10 13 | 14 | 15 | class EBC: 16 | def __init__(self, matrix, n_clusters, max_iterations=10, jitter_max=1e-10, objective_tolerance=0.01): 17 | """ To initialize an EBC object. 18 | 19 | Args: 20 | matrix: a instance of SparseMatrix that represents the original distribution 21 | n_clusters: a list of number of clusters along each dimension 22 | max_iterations: the maximum number of iterations before we stop, default to be 10 23 | jitter_max: a small amount of value to add to probabilities to break ties when doing cluster assignments, default to be 1e-10 24 | objective_tolerance: the threshold of difference between two successive objective values for us to stop 25 | 26 | Return: 27 | None 28 | """ 29 | if not isinstance(matrix, SparseMatrix): 30 | raise Exception("Matrix argument to EBC is not instance of SparseMatrix.") 31 | 32 | # check to ensure matrix is a probability distribution 33 | np.testing.assert_approx_equal(matrix.sum(), 1.0, significant=7, 34 | err_msg='Matrix elements does not sum to 1. Please normalize your matrix.') 35 | 36 | self.pXY = matrix # the joint probability distribution e.g. p(X,Y)- the original sparse, multidimensional matrix 37 | self.K = n_clusters # numbers of clusters along each dimension (len(K) = dim) 38 | self.dim = self.pXY.dim # overall dimension of the matrix 39 | self.max_it = max_iterations 40 | 41 | self.cXY = None # list of list: cluster assignments along each dimension e.g. [C(X), C(Y), ...] 42 | self.pX = None # marginal probabilities 43 | 44 | self.qXhatYhat = None # the approximate probability distribution after clustering e.g. q(X',Y'); need to be SparseMatrix 45 | self.qXhat = None # the marginal clustering distribution in a list e.g. [q(X'), q(Y'), ...] 46 | self.qXxHat = None # the distribution conditioned on the clustering in a list e.g. [q(X|X'), q(Y|Y'), ...] 47 | 48 | self.jitter_max = jitter_max # amount to add to cluster assignment scores to break ties 49 | self.objective_tolerance = objective_tolerance # the threshold for us to stop 50 | 51 | def run(self, assigned_clusters=None, verbose=True): 52 | """ To run the ebc algorithm. 53 | 54 | Args: 55 | assigned_clusters: an optional list of list representing the initial assignment of clusters along each dimension. 56 | 57 | Return: 58 | cXY: a list of list of cluster assignments along each dimension e.g. [C(X), C(Y), ...] 59 | objective: the final objective value 60 | max_it: the number of iterations that the algorithm has run 61 | """ 62 | if verbose: print "Running EBC on a %d-d sparse matrix with size %s ..." % (self.dim, str(self.pXY.N)) 63 | # Step 1: initialization steps 64 | self.pX = self.calculate_marginals(self.pXY) 65 | if assigned_clusters: 66 | if verbose: print "Using specified clusters, with cluster number on each axis: %s ..." % self.K 67 | self.cXY = assigned_clusters 68 | else: 69 | if verbose: print "Randomly initializing clusters, with cluster number on each axis: %s ..." % self.K 70 | self.cXY = self.initialize_cluster_centers(self.pXY, self.K) 71 | 72 | # Step 2: calculate cluster joint and marginal distributions 73 | self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cXY, self.K, self.pXY) 74 | self.qXhat = self.calculate_marginals(self.qXhatYhat) # the cluster marginals along each axis 75 | self.qXxHat = self.calculate_conditionals(self.cXY, self.pXY.N, self.pX, self.qXhat) 76 | 77 | # Step 3: iterate through dimensions, recalculating distributions 78 | last_objective = objective = INFINITE 79 | for t in xrange(self.max_it): 80 | if verbose: sys.stdout.write("--> Running iteration %d " % (t + 1)); sys.stdout.flush() 81 | for axis in xrange(self.dim): 82 | self.cXY[axis] = self.compute_clusters(self.pXY, self.qXhatYhat, self.qXhat, self.qXxHat, self.cXY, 83 | axis) 84 | self.ensure_correct_number_clusters(self.cXY[axis], self.K[axis]) # check to ensure correct K 85 | self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cXY, self.K, self.pXY) 86 | self.qXhat = self.calculate_marginals(self.qXhatYhat) 87 | self.qXxHat = self.calculate_conditionals(self.cXY, self.pXY.N, self.pX, self.qXhat) 88 | if verbose: sys.stdout.write("."); sys.stdout.flush() 89 | objective = self.calculate_objective() 90 | if verbose: sys.stdout.write(" objective value = %f\n" % (objective)) 91 | if abs(objective - last_objective) < self.objective_tolerance: 92 | if verbose: print "EBC finished in %d iterations, with final objective value %.4f" % (t + 1, objective) 93 | return self.cXY, objective, t + 1 94 | last_objective = objective 95 | if verbose: print "EBC finished in %d iterations, with final objective value %.4f" % (self.max_it, objective) 96 | return self.cXY, objective, self.max_it # hit max iterations - just return current assignments 97 | 98 | def compute_clusters(self, pXY, qXhatYhat, qXhat, qXxhat, cXY, axis): 99 | """ Compute the best cluster assignment along a single axis, given all the distributions and clusters on other axes. 100 | 101 | Args: 102 | pXY: the original probability distribution matrix 103 | qXhatYhat: the joint distribution over the clusters 104 | qXhat: the marginal distributions of qXhatYhat 105 | qXxhat: the distribution conditioned on the clustering in a list 106 | cXY: current cluster assignments along each dimension 107 | axis: the axis (dimension) over which clusters are being computed 108 | 109 | Return: 110 | Best cluster assignment along a single axis as a list 111 | """ 112 | if not isinstance(pXY, SparseMatrix) or not isinstance(qXhatYhat, SparseMatrix): 113 | raise Exception("Arguments to compute_clusters not an instance of SparseMatrix.") 114 | # To assign clusters, we calculate argmin_xhat D(p(Y,Z|x) || q(Y,Z|xhat)), 115 | # where D(P|Q) = \sum_i P_i log (P_i / Q_i) 116 | dPQ = np.zeros(shape=(pXY.N[axis], qXhatYhat.N[axis])) 117 | # iterate though all non-zero elements; here we are making use of the sparsity to reduce computation 118 | for coords, p_i in pXY.nonzero_elements.iteritems(): 119 | coord_this_axis = coords[axis] 120 | px = self.pX[axis][coord_this_axis] 121 | p_i = 1 if px == 0 else p_i / px # calculate p(y|x) = p(x,y)/p(x), but we should be careful if px == 0 122 | current_cluster_assignments = [cXY[i][coords[i]] for i in 123 | xrange(self.dim)] # cluster assignments on each axis 124 | for xhat in xrange(self.K[axis]): 125 | current_cluster_assignments[axis] = xhat # temporarily assign dth dimension to this xhat 126 | current_qXhatYhat = qXhatYhat.get(tuple(current_cluster_assignments)) 127 | current_qXhat = qXhat[axis][xhat] 128 | q_i = 1.0 129 | if current_qXhatYhat == 0 and current_qXhat == 0: 130 | q_i = 0 # Here we define 0/0=0 131 | else: 132 | q_i *= current_qXhatYhat / current_qXhat 133 | for i in xrange(self.dim): 134 | if i == axis: continue 135 | q_i *= qXxhat[i][coords[i]] 136 | if q_i == 0: # this can definitely happen if cluster joint distribution has zero element 137 | dPQ[coord_this_axis, xhat] = INFINITE 138 | else: 139 | dPQ[coord_this_axis, xhat] += p_i * math.log(p_i / q_i) 140 | 141 | # add random jitter to break ties 142 | dPQ += self.jitter_max * random_sample(dPQ.shape) 143 | return list(dPQ.argmin(1)) # return the closest cluster assignment under KL-divergence 144 | 145 | def calculate_marginals(self, pXY): 146 | """ Calculate the marginal probabilities given a joint distribution. 147 | 148 | Args: 149 | pXY: sparse matrix over which marginals are calculated, e.g. P(X,Y) 150 | 151 | Return: 152 | marginals: the calculated marginal probabilities, e.g. P(X) 153 | """ 154 | if not isinstance(pXY, SparseMatrix): 155 | raise Exception("Illegal argument to marginal calculation: " + str(pXY)) 156 | marginals = [[0] * Ni for Ni in pXY.N] 157 | for d in pXY.nonzero_elements: 158 | for i in xrange(len(d)): 159 | marginals[i][d[i]] += pXY.nonzero_elements[d] 160 | return marginals 161 | 162 | def calculate_joint_cluster_distribution(self, cXY, K, pXY): 163 | """ Calculate the joint cluster distribution q(X',Y') using the current prob distribution and 164 | cluster assignments. (Here we use X' to denote X_hat) 165 | 166 | Args: 167 | cXY: current cluster assignments for each axis 168 | K: numbers of clusters along each axis 169 | pXY: original probability distribution matrix 170 | 171 | Return: 172 | qXhatYhat: the joint cluster distribution 173 | """ 174 | if not isinstance(pXY, SparseMatrix): 175 | raise Exception("Matrix argument to calculate_joint_cluster_distribution not an instance of SparseMatrix.") 176 | qXhatYhat = SparseMatrix(K) # joint distribution over clusters 177 | for coords in pXY.nonzero_elements: 178 | # find the coordinates of the cluster for this element 179 | cluster_coords = [] 180 | for i in xrange(len(coords)): 181 | cluster_coords.append(cXY[i][coords[i]]) 182 | qXhatYhat.add_value(tuple(cluster_coords), pXY.nonzero_elements[coords]) 183 | return qXhatYhat 184 | 185 | def calculate_conditionals(self, cXY, N, pX, qXhat): 186 | """ Calculate the conditional marginal distributions given the clustering distribution, i.e. q(X|X'). 187 | 188 | Args: 189 | cXY: current cluster assignments 190 | N: lengths of each dimension in the original data matrix 191 | pX: marginal distributions over original data matrix 192 | qXhat: marginal distributions over cluster joint distribution 193 | 194 | Return: 195 | conditional_distributions: a list of distribution for each axis, with each element being a list of prob for i-th row/column in this axis. 196 | """ 197 | conditional_distributions = [[0] * Ni for Ni in N] 198 | for i in xrange(len(cXY)): 199 | cluster_assignments_this_dimension = cXY[i] 200 | for j in xrange(len(cluster_assignments_this_dimension)): 201 | cluster = cluster_assignments_this_dimension[j] 202 | if pX[i][j] == 0 and qXhat[i][cluster] == 0: 203 | conditional_distributions[i][j] = 0 204 | else: 205 | conditional_distributions[i][j] = pX[i][j] / qXhat[i][cluster] 206 | return conditional_distributions 207 | 208 | def initialize_cluster_centers(self, pXY, K): 209 | """ Initializes the cluster assignments along each axis, by first selecting k centers, 210 | and then map each row to its closet center under cosine similarity. 211 | 212 | Args: 213 | pXY: original data matrix 214 | K: numbers of clusters desired in each dimension 215 | 216 | Return: 217 | new_C: a list of list of cluster id that the current index in the current axis is assigned to. 218 | """ 219 | if not isinstance(pXY, SparseMatrix): 220 | raise Exception("Matrix argument to initialize_cluster_centers is not an instance of SparseMatrix.") 221 | new_C = [[-1] * Ni for Ni in pXY.N] 222 | 223 | for axis in xrange(len(K)): # loop over each dimension 224 | # choose cluster centers 225 | axis_length = pXY.N[axis] 226 | center_indices = random.sample(xrange(axis_length), K[axis]) 227 | cluster_ids = {} 228 | for i in xrange(K[axis]): # assign identifiers to clusters 229 | center_index = center_indices[i] 230 | cluster_ids[center_index] = i 231 | centers = defaultdict(lambda: defaultdict(float)) # all nonzero indices for each center 232 | for coords in pXY.nonzero_elements: 233 | coord_this_axis = coords[axis] 234 | if coord_this_axis in cluster_ids: # is a center 235 | reduced_coords = tuple( 236 | [coords[i] for i in xrange(len(coords)) if i != axis]) # coords without the current axis 237 | centers[cluster_ids[coord_this_axis]][reduced_coords] = pXY.nonzero_elements[ 238 | coords] # (cluster_id, other coords) -> value 239 | 240 | # assign rows to clusters 241 | scores = np.zeros(shape=(pXY.N[axis], K[axis])) # scores: axis_size x cluster_number 242 | denoms_P = np.zeros(shape=(pXY.N[axis])) 243 | denoms_Q = np.zeros(shape=(K[axis])) 244 | for coords in pXY.nonzero_elements: 245 | coord_this_axis = coords[axis] 246 | if coord_this_axis in center_indices: 247 | continue # don't reassign cluster centers, please 248 | reduced_coords = tuple([coords[i] for i in xrange(len(coords)) if i != axis]) 249 | for cluster_index in cluster_ids: 250 | xhat = cluster_ids[cluster_index] # need cluster ID, not the axis index 251 | if reduced_coords in centers[xhat]: # overlapping point 252 | P_i = pXY.nonzero_elements[coords] 253 | Q_i = centers[xhat][reduced_coords] 254 | scores[coords[axis]][xhat] += P_i * Q_i # now doing based on cosine similarity 255 | denoms_P[coords[axis]] += P_i * P_i # magnitude of this slice of original matrix 256 | denoms_Q[xhat] += Q_i * Q_i # magnitude of cluster centers 257 | 258 | # normalize scores 259 | scores = divide(scores, outer(sqrt(denoms_P), sqrt(denoms_Q))) 260 | scores[scores == 0] = -1.0 261 | 262 | # add random jitter to scores to handle tie-breaking 263 | scores += self.jitter_max * random_sample(scores.shape) 264 | new_cXYi = list(scores.argmax(1)) # this needs to be argmax because cosine similarity 265 | 266 | # make sure to assign the cluster centers to themselves 267 | for center_index in cluster_ids: 268 | new_cXYi[center_index] = cluster_ids[center_index] 269 | 270 | # ensure numbers of clusters are correct 271 | self.ensure_correct_number_clusters(new_cXYi, K[axis]) 272 | new_C[axis] = new_cXYi 273 | return new_C 274 | 275 | def ensure_correct_number_clusters(self, cXYi, expected_K): 276 | """ To ensure a cluster assignment actually has the expected total number of clusters. 277 | 278 | Args: 279 | cXYi: the input cluster assignment 280 | expected_K: expected number of clusters on this axis 281 | 282 | Return: 283 | None. The assignment will be changed in place in cXYi. 284 | """ 285 | clusters_represented = set() 286 | for c in cXYi: 287 | clusters_represented.add(c) 288 | if len(clusters_represented) == expected_K: 289 | return 290 | for c in xrange(expected_K): 291 | if c not in clusters_represented: 292 | index_to_change = random.randint(0, len(cXYi) - 1) 293 | cXYi[index_to_change] = c 294 | self.ensure_correct_number_clusters(cXYi, expected_K) 295 | 296 | def calculate_objective(self): 297 | """ Calculate the value of the objective function given the current cluster assignments. 298 | 299 | Return: 300 | objective: the objective function value 301 | """ 302 | objective = 0.0 303 | for d in self.pXY.nonzero_elements: 304 | pXY_i = self.pXY.nonzero_elements[d] 305 | qXY_i = self.get_element_approximate_distribution(d) 306 | if qXY_i == 0: 307 | print(pXY_i) 308 | objective += pXY_i * math.log(pXY_i / qXY_i) 309 | return objective 310 | 311 | def get_element_approximate_distribution(self, coords): 312 | """ Get the distribution approximated by q(X,Y). """ 313 | clusters = [self.cXY[i][coords[i]] for i in xrange(len(coords))] 314 | element = self.qXhatYhat.get(tuple(clusters)) 315 | for i in xrange(len(coords)): 316 | element *= self.qXxHat[i][coords[i]] 317 | return element 318 | 319 | 320 | def main(): 321 | """ An example run of EBC. """ 322 | with open("resources/matrix-ebc-paper-sparse.tsv", "r") as f: 323 | data = [] 324 | for line in f: 325 | sl = line.split("\t") 326 | if len(sl) < 5: # headers 327 | continue 328 | data.append([sl[0], sl[2], float(sl[4])]) 329 | 330 | matrix = SparseMatrix([14052, 7272]) 331 | matrix.read_data(data) 332 | matrix.normalize() 333 | ebc = EBC(matrix, [30, 125], 10, 1e-10, 0.01) 334 | cXY, objective, it = ebc.run() 335 | 336 | 337 | if __name__ == "__main__": 338 | main() 339 | -------------------------------------------------------------------------------- /ebc2d.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import random 3 | import sys 4 | 5 | import numpy as np 6 | 7 | 8 | class EBC2D: 9 | def __init__(self, matrix, n_clusters, max_iterations=10, jitter_max=1e-10, objective_tolerance=0.01): 10 | """ To initialize an EBC object. 11 | 12 | Args: 13 | matrix: a instance of numpy ndarray that represents the original distribution 14 | n_clusters: a list of number of clusters along each dimension 15 | max_iterations: the maximum number of iterations before we stop, default to be 10 16 | jitter_max: a small amount of value to add to probabilities to break ties when doing cluster assignments, default to be 1e-10 17 | objective_tolerance: the threshold of difference between two successive objective values for us to stop 18 | 19 | Return: 20 | None 21 | """ 22 | if not isinstance(matrix, np.ndarray): 23 | raise Exception("Matrix argument to EBC2D needs to be a numpy ndarray.") 24 | 25 | np.testing.assert_approx_equal(matrix.sum(), 1.0, significant=3, err_msg= \ 26 | 'Matrix elements does not sum to 1. Please normalize your matrix.') 27 | 28 | self.pXY = matrix # p(X,Y) 29 | self.N = self.pXY.shape 30 | self.cX = np.zeros(self.N[0], dtype=np.int) # cluster assignment along X axis, C(X) 31 | self.cY = np.zeros(self.N[1], dtype=np.int) # C(Y) 32 | self.K = n_clusters # number of clusters along x and y axis 33 | self.max_iters = max_iterations 34 | 35 | self.pX = np.empty(self.N[0]) # marginal probabilities, p(X) 36 | self.pY = np.empty(self.N[1]) # marginal probabilities, p(Y) 37 | 38 | self.qXhatYhat = np.zeros(tuple(self.K), 39 | dtype=np.float32) # the approx probability distribution after clustering 40 | self.qXhat = np.zeros(self.K[0]) # q(X') 41 | self.qYhat = np.zeros(self.K[1]) # q(Y') 42 | self.qX_xhat = np.zeros(self.N[0]) # q(X|X') 43 | self.qY_yhat = np.zeros(self.N[1]) # q(Y|Y') 44 | 45 | self.jitter_max = jitter_max 46 | self.objective_tolerance = objective_tolerance 47 | 48 | def run(self, assigned_clusters=None, verbose=True): 49 | """ To run the ebc algorithm. 50 | 51 | Args: 52 | assigned_clusters: an optional list of list representing the initial assignment of clusters along each dimension. 53 | 54 | Return: 55 | cXY: a list of lists of cluster assignments along each dimension e.g. [C(X), C(Y), ...] 56 | objective: the final objective value 57 | max_it: the number of iterations that the algorithm has run 58 | """ 59 | if verbose: print "Running EBC2D on a 2-d matrix with size %s ..." % str(self.pXY.shape) 60 | # Step 1: initialization steps 61 | self.pX, self.pY = self.calculate_marginals(self.pXY) 62 | if assigned_clusters and len(assigned_clusters) == 2: 63 | if verbose: print "Using specified clusters, with cluster number on each axis: %s ..." % self.K 64 | self.cX = np.asarray(assigned_clusters[0]) 65 | self.cY = np.asarray(assigned_clusters[1]) 66 | else: 67 | if verbose: print "Randomly initializing clusters, with cluster number on each axis: %s ..." % self.K 68 | self.cX, self.cY = self.initialize_cluster_centers(self.pXY, self.K) 69 | 70 | # Step 2: calculate cluster joint and marginal distributions 71 | self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cX, self.cY, self.K, self.pXY) 72 | self.qXhat, self.qYhat = self.calculate_marginals(self.qXhatYhat) 73 | self.qX_xhat, self.qY_yhat = self.calculate_conditionals(self.cX, self.cY, self.pX, self.pY, self.qXhat, 74 | self.qYhat) 75 | 76 | # Step 3: iterate, recalculating distributions 77 | last_objective = objective = 1e10 78 | for it in range(self.max_iters): 79 | if verbose: sys.stdout.write("--> Running iteration %d " % (it + 1)); sys.stdout.flush() 80 | # compute row/column clusters 81 | for axis in range(2): 82 | if axis == 0: 83 | self.cX = self.compute_row_clusters(self.pXY, self.qXhatYhat, self.qXhat, self.qY_yhat, self.cY) 84 | else: 85 | self.cY = self.compute_col_clusters(self.pXY, self.qXhatYhat, self.qYhat, self.qX_xhat, self.cX) 86 | if verbose: sys.stdout.write("."); sys.stdout.flush() 87 | self.qXhatYhat = self.calculate_joint_cluster_distribution(self.cX, self.cY, self.K, self.pXY) 88 | self.qXhat, self.qYhat = self.calculate_marginals(self.qXhatYhat) 89 | self.qX_xhat, self.qY_yhat = self.calculate_conditionals(self.cX, self.cY, self.pX, self.pY, self.qXhat, 90 | self.qYhat) 91 | 92 | objective = self.calculate_objective() 93 | if verbose: sys.stdout.write(" objective value = %f\n" % (objective)) 94 | if abs(objective - last_objective) < self.objective_tolerance: 95 | if verbose: print "EBC2D finished in %d iterations, with final objective value %.4f" % ( 96 | it + 1, objective) 97 | return [self.cX.tolist(), self.cY.tolist()], objective, it + 1 98 | last_objective = objective 99 | if verbose: print "EBC2D finished in %d iterations, with final objective value %.4f" % ( 100 | self.max_iters, objective) 101 | return [self.cX.tolist(), self.cY.tolist()], objective, self.max_iters 102 | 103 | def compute_row_clusters(self, pXY, qXhatYhat, qXhat, qY_yhat, cY): 104 | """ Compute the best row cluster assignment, given all the distributions and clusters on y axis. 105 | 106 | Args: 107 | pXY: the original probability distribution matrix 108 | qXhatYhat: the joint distribution over the clusters 109 | qXhat: the marginal distributions of qXhatYhat 110 | qY_yhat: the distribution conditioned on the y axis clustering in a list 111 | cY: current cluster assignments along y axis 112 | 113 | Return: 114 | Best row cluster assignment as a list 115 | """ 116 | nrow, nc_row = pXY.shape[0], qXhat.shape[0] 117 | dPQ = np.empty((nrow, nc_row)) 118 | # Step 1: generate q(y|x'): |x'| x |y| 119 | # - first expand q(x'y') to |x'| x |y| using clustering information cY 120 | expanded_qXhatYhat = np.nan_to_num((qXhatYhat.T / qXhat).T[:, cY]) 121 | qY_xhat = expanded_qXhatYhat * qY_yhat 122 | # Step 2: loop through all clusters 123 | for i in range(nc_row): 124 | for j in range(nrow): 125 | pXY_row = pXY[j, :] 126 | with np.errstate(divide='ignore', invalid='ignore'): 127 | log_matrix = np.log(pXY_row / qY_xhat[i, :]) 128 | log_matrix[log_matrix == -np.inf] = 0 129 | log_matrix = np.nan_to_num(log_matrix) 130 | dPQ[j, i] = pXY_row.dot(log_matrix.T) 131 | 132 | dPQ += self.jitter_max * np.random.mtrand.random_sample(dPQ.shape) 133 | C = dPQ.argmin(1) 134 | self.ensure_correct_number_clusters(C, nc_row) 135 | return C 136 | 137 | def compute_col_clusters(self, pXY, qXhatYhat, qYhat, qX_xhat, cX): 138 | """ Compute the best column cluster assignment, given all the distributions and clusters on x axis. """ 139 | ncol, nc_col = pXY.shape[1], qYhat.shape[0] 140 | dPQ = np.empty((nc_col, ncol)) 141 | expanded_qXhatYhat = np.nan_to_num((qXhatYhat / qYhat)[cX, :]) 142 | qX_yhat = expanded_qXhatYhat.T * qX_xhat 143 | for i in range(nc_col): 144 | for j in range(ncol): 145 | pXY_col = pXY[:, j].T 146 | with np.errstate(divide='ignore', invalid='ignore'): 147 | log_matrix = np.log(pXY_col / qX_yhat[i, :]) 148 | log_matrix[log_matrix == -np.inf] = 0 149 | log_matrix = np.nan_to_num(log_matrix) 150 | dPQ[i, j] = pXY_col.dot(log_matrix.T) 151 | dPQ += self.jitter_max * np.random.mtrand.random_sample(dPQ.shape) 152 | C = dPQ.argmin(0) 153 | self.ensure_correct_number_clusters(C, nc_col) 154 | return C 155 | 156 | def calculate_marginals(self, pXY): 157 | """ Calculate the marginal probabilities given a joint distribution. 158 | 159 | Args: 160 | pXY: distribution over which marginals are calculated. 161 | """ 162 | pX = pXY.sum(1) # sum along the y dimension, note that the dimension index should be reverse 163 | pY = pXY.sum(0) # sum along the x dimension 164 | return np.squeeze(np.asarray(pX)), np.squeeze(np.asarray(pY)) # return a numpy array 165 | 166 | def calculate_conditionals(self, cX, cY, pX, pY, qXhat, qYhat): 167 | """ Calculate the conditional marginal distributions given the clustering distribution, i.e. q(X|X'). 168 | 169 | Args: 170 | cX, cY: current cluster assignments 171 | pX, pY: marginal distributions over original data matrix 172 | qXhat, qYhat: marginal distributions over cluster joint distribution 173 | 174 | Return: 175 | qX_xhat, qY_yhat: conditional marginal distributions for each axis. 176 | """ 177 | with np.errstate(divide='ignore', invalid='ignore'): 178 | qX_xhat = pX / qXhat[cX] # qX_xhat is a N[0]-size vector here 179 | qY_yhat = pY / qYhat[cY] # note that it could be problematic if cX is not a int vector 180 | qX_xhat[qX_xhat == np.inf] = 0 181 | qY_yhat[qY_yhat == np.inf] = 0 182 | return qX_xhat, qY_yhat # want a Nx1 array-like matrix 183 | 184 | def calculate_joint_cluster_distribution(self, cX, cY, K, pXY): 185 | """ Calculate the joint cluster distribution q(X',Y') = p(X',Y') using the current prob distribution and 186 | cluster assignments. (Here we use X' to denote X_hat) 187 | 188 | Args: 189 | cX, cY: current cluster assignments for each axis 190 | K: numbers of clusters along each axis 191 | pXY: original probability distribution matrix 192 | 193 | Return: 194 | qXhatYhat: the joint cluster distribution 195 | """ 196 | nc_row, nc_col = K # num of clusters along row and col 197 | qXhatYhat = np.zeros(K) 198 | # itm_matrix = csr_matrix((nc_row, pXY.shape[1])) # nc_row * col sparse intermidiate matrix 199 | itm_matrix = np.empty((nc_row, pXY.shape[1])) 200 | for i in range(nc_row): 201 | itm_matrix[i, :] = pXY[np.where(cX == i)[0], :].sum(0) 202 | for i in range(nc_col): 203 | qXhatYhat[:, i] = itm_matrix[:, np.where(cY == i)[0]].sum(1).flatten() 204 | return qXhatYhat 205 | 206 | def initialize_cluster_centers(self, pXY, K): 207 | """ Initializes the cluster assignments along each axis, by first selecting k centers, 208 | and then map each row to its closet center under cosine similarity. 209 | 210 | Args: 211 | pXY: original data matrix 212 | K: numbers of clusters desired in each dimension 213 | 214 | Return: 215 | cX, cY: a list of cluster id that the current index in the current axis is assigned to. 216 | """ 217 | # For x axis 218 | centers = pXY[random.sample(range(pXY.shape[0]), K[0]), :] # randomly select clustering centers 219 | cX = self.assign_clusters(pXY, centers, axis=0) 220 | self.ensure_correct_number_clusters(cX, K[0]) 221 | # For y axis 222 | centers = pXY[:, random.sample(range(pXY.shape[1]), K[1])] # randomly select clustering centers 223 | cY = self.assign_clusters(pXY, centers, axis=1) 224 | self.ensure_correct_number_clusters(cY, K[1]) 225 | return cX, cY # return a numpy array 226 | 227 | def assign_clusters(self, pXY, centers, axis): 228 | """ Assign each row/col to clusters given cluster centers on this axis. """ 229 | cluster_num = centers.shape[axis] 230 | scores = np.zeros(shape=(pXY.shape[axis], cluster_num)) 231 | for i in range(cluster_num): 232 | if axis == 0: 233 | center = centers[i, :] 234 | score_i = pXY * center 235 | else: 236 | center = centers[:, i] 237 | score_i = pXY.T * center 238 | score_i = score_i.sum(1) # calculate u.v 239 | center_length = np.linalg.norm(center) # get |v| 240 | score_i = score_i / center_length # get u.v/|v| 241 | scores[:, i] = score_i.flatten() 242 | scores += self.jitter_max * np.random.mtrand.random_sample(scores.shape) 243 | C = np.argmax(scores, 1) 244 | return C 245 | 246 | def calculate_objective(self): 247 | """ Calculate the KL-divergence between p(X,Y) and q(X,Y). 248 | Here q(x,y) can be written as p(x',y')*p(x|x')*p(y|y'). """ 249 | # Here I cannot vectorize the computation 250 | objective = .0 251 | x_indices, y_indices = np.nonzero(self.pXY) 252 | # compute values for all useful elements in qXY 253 | for i in range(len(x_indices)): 254 | x_idx, y_idx = x_indices[i], y_indices[i] 255 | v = self.pXY[x_idx, y_idx] 256 | c_x, c_y = self.cX[x_idx], self.cY[y_idx] 257 | v_qXY = self.qX_xhat[x_idx] * self.qY_yhat[y_idx] * self.qXhatYhat[c_x, c_y] 258 | objective += v * np.log(v / v_qXY) 259 | return objective 260 | 261 | def ensure_correct_number_clusters(self, C, expected_K): 262 | """ To ensure a cluster assignment actually has the expected total number of clusters. 263 | 264 | Args: 265 | cXYi: the input cluster assignment 266 | expected_K: expected number of clusters on this axis 267 | 268 | Return: 269 | None. The assignment will be changed in place. 270 | """ 271 | clusters_unique = np.unique(C) 272 | num_clusters = clusters_unique.shape[0] 273 | if num_clusters == expected_K: 274 | return 275 | for c in range(expected_K): 276 | if num_clusters < c + 1 or clusters_unique[c] != c: # no element assigned to c 277 | idx = random.randint(0, C.shape[0] - 1) 278 | C[idx] = c 279 | self.ensure_correct_number_clusters(C, expected_K) 280 | 281 | 282 | def get_matrix_from_data(data): 283 | """ Read the data from a list and construct a scipy sparse dok_matrix. If 'data' is not a list, simply return. 284 | 285 | Each element of the data list should be a list, and should have the following form: 286 | [feature1, feature2, ..., feature dim, value] 287 | """ 288 | feature_ids = defaultdict(lambda: defaultdict(int)) 289 | for d in data: 290 | location = [] 291 | for i in range(len(d) - 1): 292 | f_i = d[i] 293 | if f_i not in feature_ids[i]: 294 | feature_ids[i][f_i] = len(feature_ids[i]) # new index is size of dict 295 | location.append(feature_ids[i][f_i]) 296 | nrow = len(feature_ids[0]) 297 | ncol = len(feature_ids[1]) 298 | m = np.zeros((nrow, ncol), dtype=np.float32) 299 | for d in data: 300 | r = feature_ids[0][d[0]] 301 | c = feature_ids[1][d[1]] 302 | value = float(d[2]) 303 | if value != 0.0: 304 | m[r, c] = value 305 | # normalize the matrix 306 | m = m / m.sum() 307 | return m 308 | -------------------------------------------------------------------------------- /matrix.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | from operator import itemgetter 3 | from random import shuffle 4 | 5 | 6 | class SparseMatrix: 7 | """ An implementation of sparse matrix that is used by the ITCC and EBC algorithm. """ 8 | 9 | def __init__(self, N): 10 | """ Initialize the sparse matrix. 11 | 12 | Args: 13 | N: the size of the matrix on each axis in a list-like data structure 14 | """ 15 | self.dim = len(N) # dimensionality of matrix 16 | self.nonzero_elements = {} 17 | self.N = N 18 | # feature_ids should be a map from feature name to the corresponding index. 19 | # For example, in a 2D matrix, each feature corresponds to a specific row or column. 20 | self.feature_ids = defaultdict(lambda: defaultdict(int)) 21 | 22 | def read_data(self, data): 23 | """ Read the data from a list and populate the matrix. If 'data' is not a list, simply return. 24 | 25 | Args: 26 | data: each element of the data list should be a list, and should have the following form: 27 | [feature1, feature2, ..., feature dim, value] 28 | """ 29 | if not isinstance(data, list): # we expect a list of data points 30 | return 31 | for d in data: 32 | location = [] 33 | for i in range(len(d) - 1): 34 | f_i = d[i] 35 | if f_i not in self.feature_ids[i]: 36 | self.feature_ids[i][f_i] = len(self.feature_ids[i]) # new index is size of dict 37 | location.append(self.feature_ids[i][f_i]) 38 | value = float(d[len(d) - 1]) 39 | if value != 0.0: 40 | self.nonzero_elements[tuple(location)] = value 41 | 42 | def get(self, coordinates): 43 | """ Get an element of the sparse matrix. 44 | 45 | Args: 46 | coordinates: indices of the element as a tuple 47 | 48 | Return: 49 | the element value 50 | """ 51 | if coordinates in self.nonzero_elements: 52 | return self.nonzero_elements[coordinates] 53 | return 0.0 54 | 55 | def set(self, coordinates, value): 56 | """ Set the value for an element in the sparse matrix. 57 | 58 | Args: 59 | coordinates: indices of the element as a tuple 60 | value: the element value 61 | """ 62 | self.nonzero_elements[coordinates] = value 63 | 64 | def add_value(self, coordinates, added_value): 65 | """ Add a specific value to an element in the sparse matrix. 66 | 67 | Args: 68 | coordinates: indices of the element as a tuple 69 | added_value: the value to add 70 | """ 71 | if coordinates in self.nonzero_elements: 72 | self.nonzero_elements[coordinates] += added_value 73 | else: 74 | self.nonzero_elements[coordinates] = added_value 75 | 76 | def sum(self): 77 | """ Get the sum of all sparse matrix elements. 78 | 79 | Return: 80 | the sum value 81 | """ 82 | sum_values = 0.0 83 | for v in self.nonzero_elements.values(): 84 | sum_values += v 85 | return sum_values 86 | 87 | def normalize(self): 88 | """ Normalize the sparse matrix such that the elements in the matrix sum up to 1. """ 89 | sum_values = self.sum() 90 | for d in self.nonzero_elements: 91 | self.nonzero_elements[d] /= sum_values 92 | 93 | def shuffle(self): 94 | """ Randomly shuffle the nonzero elements in the original matrix, and return a new matrix with the elements shuffled. 95 | 96 | Return: 97 | a new sparse matrix with all the elements shuffled 98 | """ 99 | self_shuffled = SparseMatrix(self.N) 100 | indices = [] 101 | # Get all the indices of nonzero elements. indices is a list of 'dim' lists, each being a list of indices for a specific dimension 102 | for i in range(self.dim): 103 | indices.append([e[i] for e in self.nonzero_elements]) 104 | for i in range(self.dim): 105 | shuffle(indices[i]) 106 | values = [self.nonzero_elements[e] for e in self.nonzero_elements] 107 | shuffle(values) 108 | for j in range(len(self.nonzero_elements)): 109 | self_shuffled.add_value(tuple([indices[i][j] for i in range(self.dim)]), values[j]) 110 | return self_shuffled 111 | 112 | def __str__(self): 113 | value_list = sorted(self.nonzero_elements.items(), key=itemgetter(0), reverse=False) 114 | return "\n".join(["\t".join([str(e) for e in v[0]]) + "\t" + str(v[1]) for v in value_list]) 115 | -------------------------------------------------------------------------------- /resources/matrix-itcc-3d-3clust.tsv: -------------------------------------------------------------------------------- 1 | 0 0 0 1.0 2 | 0 0 1 1.0 3 | 0 1 0 1.0 4 | 0 1 1 1.0 5 | 1 0 0 1.0 6 | 1 0 1 1.0 7 | 1 1 0 1.0 8 | 1 1 1 1.0 9 | 2 2 2 1.0 10 | 2 2 3 1.0 11 | 2 3 2 1.0 12 | 3 2 2 1.0 13 | 2 3 3 1.0 14 | 3 3 2 1.0 15 | 3 2 3 1.0 16 | 3 3 3 1.0 17 | 4 4 4 1.0 18 | 4 4 5 1.0 19 | 4 5 4 1.0 20 | 4 5 5 1.0 21 | 5 4 4 1.0 22 | 5 4 5 1.0 23 | 5 5 4 1.0 24 | 5 5 5 1.0 -------------------------------------------------------------------------------- /resources/matrix-itcc-paper-3clust.tsv: -------------------------------------------------------------------------------- 1 | 0 0 0.083 2 | 0 1 0.083 3 | 0 2 0.00 4 | 0 3 0.00 5 | 0 4 0.00 6 | 0 5 0.00 7 | 1 0 0.083 8 | 1 1 0.083 9 | 1 2 0.00 10 | 1 3 0.00 11 | 1 4 0.00 12 | 1 5 0.00 13 | 2 0 0.00 14 | 2 1 0.00 15 | 2 2 0.083 16 | 2 3 0.083 17 | 2 4 0.00 18 | 2 5 0.00 19 | 3 0 0.00 20 | 3 1 0.00 21 | 3 2 0.083 22 | 3 3 0.083 23 | 3 4 0.00 24 | 3 5 0.00 25 | 4 0 0.00 26 | 4 1 0.00 27 | 4 2 0.00 28 | 4 3 0.00 29 | 4 4 0.083 30 | 4 5 0.083 31 | 5 0 0.00 32 | 5 1 0.00 33 | 5 2 0.00 34 | 5 3 0.00 35 | 5 4 0.083 36 | 5 5 0.083 -------------------------------------------------------------------------------- /resources/matrix-itcc-paper-orig-letters.tsv: -------------------------------------------------------------------------------- 1 | A 0 0.05 2 | A 1 0.05 3 | A 2 0.05 4 | A 3 0.00 5 | A 4 0.00 6 | A 5 0.00 7 | B 0 0.05 8 | B 1 0.05 9 | B 2 0.05 10 | B 3 0.00 11 | B 4 0.00 12 | B 5 0.00 13 | C 0 0.00 14 | C 1 0.00 15 | C 2 0.00 16 | C 3 0.05 17 | C 4 0.05 18 | C 5 0.05 19 | D 0 0.00 20 | D 1 0.00 21 | D 2 0.00 22 | D 3 0.05 23 | D 4 0.05 24 | D 5 0.05 25 | E 0 0.04 26 | E 1 0.04 27 | E 2 0.00 28 | E 3 0.04 29 | E 4 0.04 30 | E 5 0.04 31 | F 0 0.04 32 | F 1 0.04 33 | F 2 0.04 34 | F 3 0.00 35 | F 4 0.04 36 | F 5 0.04 -------------------------------------------------------------------------------- /resources/matrix-itcc-paper-orig.tsv: -------------------------------------------------------------------------------- 1 | 0 0 0.05 2 | 0 1 0.05 3 | 0 2 0.05 4 | 0 3 0.00 5 | 0 4 0.00 6 | 0 5 0.00 7 | 1 0 0.05 8 | 1 1 0.05 9 | 1 2 0.05 10 | 1 3 0.00 11 | 1 4 0.00 12 | 1 5 0.00 13 | 2 0 0.00 14 | 2 1 0.00 15 | 2 2 0.00 16 | 2 3 0.05 17 | 2 4 0.05 18 | 2 5 0.05 19 | 3 0 0.00 20 | 3 1 0.00 21 | 3 2 0.00 22 | 3 3 0.05 23 | 3 4 0.05 24 | 3 5 0.05 25 | 4 0 0.04 26 | 4 1 0.04 27 | 4 2 0.00 28 | 4 3 0.04 29 | 4 4 0.04 30 | 4 5 0.04 31 | 5 0 0.04 32 | 5 1 0.04 33 | 5 2 0.04 34 | 5 3 0.00 35 | 5 4 0.04 36 | 5 5 0.04 -------------------------------------------------------------------------------- /tests/sample-matrix-file.txt: -------------------------------------------------------------------------------- 1 | patient Val30Met START_ENTITY|nmod|FAP 2 2 | patient Val30Met START_ENTITY|nmod|END_ENTITY 2 3 | patient Val30Met FAP|compound|END_ENTITY 2 4 | mice R92Q mice|nummod|END_ENTITY 3 5 | mice R92Q mutation|appos|END_ENTITY 2 6 | mice R92Q START_ENTITY|nummod|END_ENTITY 7 7 | mice R91W START_ENTITY|nummod|END_ENTITY 2 8 | mice R90W homozygous|nsubj|START_ENTITY 2 9 | mice R90W +|compound|END_ENTITY 2 10 | mice R90W expression|nmod|END_ENTITY 2 -------------------------------------------------------------------------------- /tests/test_benchmark_ebc.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from matrix import SparseMatrix 4 | from ebc import EBC 5 | 6 | 7 | class TestBenchmarkEBC(unittest.TestCase): 8 | """ Benchmark the EBC code as a unittest, using the sparse matrix data. """ 9 | 10 | def setUp(self): 11 | with open("resources/matrix-ebc-paper-sparse.tsv", "r") as f: 12 | data = [] 13 | for line in f: 14 | sl = line.split("\t") 15 | if len(sl) < 5: # headers 16 | continue 17 | data.append([sl[0], sl[2], float(sl[4])]) 18 | 19 | self.matrix = SparseMatrix([14052, 7272]) 20 | self.matrix.read_data(data) 21 | self.matrix.normalize() 22 | 23 | def testEbcOnSparseMatrix(self): 24 | ebc = EBC(self.matrix, [30, 125], 10, 1e-10, 0.01) 25 | cXY, objective, it = ebc.run() 26 | print "objective: ", objective 27 | print "iterations: ", it 28 | self.assertEquals(len(ebc.pXY.nonzero_elements), 29456) 29 | self.assertEquals(len(set(ebc.cXY[0])), 30) 30 | self.assertEquals(len(set(ebc.cXY[1])), 125) 31 | -------------------------------------------------------------------------------- /tests/test_benchmark_ebc2d.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | import ebc2d 4 | from ebc2d import EBC2D 5 | 6 | 7 | class TestBenchmarkEBC2D(unittest.TestCase): 8 | """ Benchmark the EBC2D code as a unittest, using the sparse matrix data. """ 9 | 10 | def setUp(self): 11 | with open("resources/matrix-ebc-paper-sparse.tsv", "r") as f: 12 | data = [] 13 | for line in f: 14 | sl = line.split("\t") 15 | if len(sl) < 5: # headers 16 | continue 17 | data.append([sl[0], sl[2], float(sl[4])]) 18 | 19 | self.matrix = ebc2d.get_matrix_from_data(data) 20 | 21 | def testEbc2dOnSparseMatrix(self): 22 | ebc = EBC2D(self.matrix, [30, 125], 10, 1e-10, 0.01) 23 | cXY, objective, it = ebc.run() 24 | print "objective: ", objective 25 | print "iterations: ", it 26 | # self.assertEquals(len(ebc.pXY.nonzero[0]), 29456) 27 | self.assertEquals(len(set(cXY[0])), 30) 28 | self.assertEquals(len(set(cXY[1])), 125) 29 | -------------------------------------------------------------------------------- /tests/test_clusters.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | from ebc import EBC 3 | from matrix import SparseMatrix 4 | 5 | 6 | class TestMatrix(unittest.TestCase): 7 | def setUp(self): 8 | data = [[0, 0, 0, 1.0], 9 | [0, 0, 1, 1.0], 10 | [0, 1, 0, 1.0], 11 | [0, 1, 1, 1.0], 12 | [1, 0, 0, 1.0], 13 | [1, 0, 1, 1.0], 14 | [1, 1, 0, 1.0], 15 | [1, 1, 1, 1.0], 16 | [2, 2, 2, 1.0], 17 | [2, 2, 3, 1.0], 18 | [2, 3, 2, 1.0], 19 | [3, 2, 2, 1.0], 20 | [2, 3, 3, 1.0], 21 | [3, 3, 2, 1.0], 22 | [3, 2, 3, 1.0], 23 | [3, 3, 3, 1.0], 24 | [4, 4, 4, 1.0], 25 | [4, 4, 5, 1.0], 26 | [4, 5, 4, 1.0], 27 | [4, 5, 5, 1.0], 28 | [5, 4, 4, 1.0], 29 | [5, 4, 5, 1.0], 30 | [5, 5, 4, 1.0], 31 | [5, 5, 5, 1.0]] 32 | matrix = SparseMatrix([6, 6, 6]) 33 | matrix.read_data(data) 34 | matrix.normalize() 35 | 36 | ebc = EBC(matrix, [3, 3, 3], 10, 1e-10) 37 | assigned_C = [[0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2]] 38 | cXY, objective = ebc.run(assigned_C) 39 | self.assertEquals(cXY, assigned_C) 40 | self.assertAlmostEqual(objective, 0.0) 41 | cXY, objective = ebc.run() # random initialization 42 | self.assertAlmostEqual(objective, 0.0) -------------------------------------------------------------------------------- /tests/test_ebc.py: -------------------------------------------------------------------------------- 1 | from operator import itemgetter 2 | import unittest 3 | 4 | from ebc import EBC 5 | from matrix import SparseMatrix 6 | 7 | 8 | class TestEbc(unittest.TestCase): 9 | def setUp(self): 10 | self.data = [["0", "0", 0.05], 11 | ["0", "1", 0.05], 12 | ["0", "2", 0.05], 13 | ["0", "3", 0.00], 14 | ["0", "4", 0.00], 15 | ["0", "5", 0.00], 16 | ["1", "0", 0.05], 17 | ["1", "1", 0.05], 18 | ["1", "2", 0.05], 19 | ["1", "3", 0.00], 20 | ["1", "4", 0.00], 21 | ["1", "5", 0.00], 22 | ["2", "0", 0.00], 23 | ["2", "1", 0.00], 24 | ["2", "2", 0.00], 25 | ["2", "3", 0.05], 26 | ["2", "4", 0.05], 27 | ["2", "5", 0.05], 28 | ["3", "0", 0.00], 29 | ["3", "1", 0.00], 30 | ["3", "2", 0.00], 31 | ["3", "3", 0.05], 32 | ["3", "4", 0.05], 33 | ["3", "5", 0.05], 34 | ["4", "0", 0.04], 35 | ["4", "1", 0.04], 36 | ["4", "2", 0.00], 37 | ["4", "3", 0.04], 38 | ["4", "4", 0.04], 39 | ["4", "5", 0.04], 40 | ["5", "0", 0.04], 41 | ["5", "1", 0.04], 42 | ["5", "2", 0.04], 43 | ["5", "3", 0.00], 44 | ["5", "4", 0.04], 45 | ["5", "5", 0.04]] 46 | self.matrix = SparseMatrix([6, 6]) 47 | self.matrix.read_data(self.data) 48 | 49 | def testDataLoad(self): 50 | self.assertEquals(sorted(self.matrix.nonzero_elements.items(), key=itemgetter(0)), 51 | [((0, 0), 0.05), ((0, 1), 0.05), ((0, 2), 0.05), ((1, 0), 0.05), ((1, 1), 0.05), 52 | ((1, 2), 0.05), ((2, 3), 0.05), ((2, 4), 0.05), ((2, 5), 0.05), ((3, 3), 0.05), 53 | ((3, 4), 0.05), ((3, 5), 0.05), ((4, 0), 0.04), ((4, 1), 0.04), ((4, 3), 0.04), 54 | ((4, 4), 0.04), ((4, 5), 0.04), ((5, 0), 0.04), ((5, 1), 0.04), ((5, 2), 0.04), 55 | ((5, 4), 0.04), ((5, 5), 0.04)]) 56 | 57 | def testOldMatrix(self): 58 | with open("resources/matrix-ebc-paper-dense.tsv", "r") as f: 59 | data = [] 60 | for line in f: 61 | sl = line.split("\t") 62 | if len(sl) < 5: # headers 63 | continue 64 | data.append([sl[0], sl[2], float(sl[4])]) 65 | 66 | matrix = SparseMatrix([3514, 1232]) 67 | matrix.read_data(data) 68 | matrix.normalize() 69 | ebc = EBC(matrix, [30, 125], 10, 1e-10, 0.01) 70 | cXY, objective, it = ebc.run() 71 | print "objective: ", objective 72 | print "iterations: ", it 73 | self.assertEquals(len(ebc.pXY.nonzero_elements), 10007) 74 | self.assertEquals(len(set(ebc.cXY[0])), 30) 75 | self.assertEquals(len(set(ebc.cXY[1])), 125) 76 | 77 | def testOldMatrix3d(self): 78 | with open("resources/matrix-ebc-paper-dense-3d.tsv", "r") as f: 79 | data = [] 80 | for line in f: 81 | sl = line.split("\t") 82 | data.append([sl[0], sl[1], sl[2], float(sl[3])]) 83 | 84 | matrix = SparseMatrix([756, 996, 1232]) 85 | matrix.read_data(data) 86 | matrix.normalize() 87 | ebc = EBC(matrix, [30, 30, 10], 100, 1e-10, 0.01) 88 | cXY, objective, it = ebc.run() 89 | print "objective: ", objective 90 | print "iterations: ", it 91 | self.assertEquals(len(ebc.pXY.nonzero_elements), 10007) 92 | self.assertEquals(len(set(ebc.cXY[0])), 30) 93 | self.assertEquals(len(set(ebc.cXY[1])), 30) 94 | self.assertEquals(len(set(ebc.cXY[2])), 10) 95 | 96 | def test3DMatrix(self): 97 | data = [[0, 0, 0, 1.0], 98 | [0, 0, 1, 1.0], 99 | [0, 1, 0, 1.0], 100 | [0, 1, 1, 1.0], 101 | [1, 0, 0, 1.0], 102 | [1, 0, 1, 1.0], 103 | [1, 1, 0, 1.0], 104 | [1, 1, 1, 1.0], 105 | [2, 2, 2, 1.0], 106 | [2, 2, 3, 1.0], 107 | [2, 3, 2, 1.0], 108 | [3, 2, 2, 1.0], 109 | [2, 3, 3, 1.0], 110 | [3, 3, 2, 1.0], 111 | [3, 2, 3, 1.0], 112 | [3, 3, 3, 1.0], 113 | [4, 4, 4, 1.0], 114 | [4, 4, 5, 1.0], 115 | [4, 5, 4, 1.0], 116 | [4, 5, 5, 1.0], 117 | [5, 4, 4, 1.0], 118 | [5, 4, 5, 1.0], 119 | [5, 5, 4, 1.0], 120 | [5, 5, 5, 1.0]] 121 | matrix = SparseMatrix([6, 6, 6]) 122 | matrix.read_data(data) 123 | matrix.normalize() 124 | ebc = EBC(matrix, [3, 3, 3], 10, 1e-10, 0.01) 125 | assigned_C = [[0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2]] 126 | cXY, objective, it = ebc.run(assigned_C) 127 | self.assertEquals(cXY, assigned_C) 128 | self.assertAlmostEqual(objective, 0.0) 129 | self.assertEquals(it, 1) 130 | 131 | for i in range(100): 132 | cXY, objective, it = ebc.run() # random initialization 133 | print cXY, objective, it 134 | -------------------------------------------------------------------------------- /tests/test_matrix.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from matrix import SparseMatrix 4 | 5 | 6 | class TestMatrix(unittest.TestCase): 7 | def setUp(self): 8 | self.data = [l.split('\t') for l in open('tests/sample-matrix-file.txt', 'r').readlines()] 9 | self.matrix = SparseMatrix([2, 4, 9]) 10 | self.matrix.read_data(self.data) 11 | 12 | def testMatrixInit(self): 13 | self.assertEquals(self.matrix.nonzero_elements[(1, 3, 7)], 2.0) 14 | self.assertEquals(self.matrix.nonzero_elements[(0, 0, 0)], 2.0) 15 | self.assertEquals(self.matrix.nonzero_elements[(0, 0, 2)], 2.0) 16 | self.assertEquals(self.matrix.nonzero_elements[(1, 1, 5)], 7.0) 17 | self.assertEquals(self.matrix.nonzero_elements[(1, 1, 3)], 3.0) 18 | self.assertEquals(self.matrix.nonzero_elements[(1, 3, 6)], 2.0) 19 | self.assertEquals(self.matrix.nonzero_elements[(1, 3, 8)], 2.0) 20 | self.assertEquals(self.matrix.nonzero_elements[(0, 0, 1)], 2.0) 21 | self.assertEquals(self.matrix.nonzero_elements[(1, 1, 4)], 2.0) 22 | self.assertEquals(self.matrix.nonzero_elements[(1, 2, 5)], 2.0) 23 | self.assertEquals(len(self.matrix.nonzero_elements), 10) 24 | self.assertEquals(self.matrix.feature_ids[0], {'mice': 1, 'patient': 0}) 25 | self.assertEquals(self.matrix.feature_ids[1], {'R92Q': 1, 'R91W': 2, 'Val30Met': 0, 'R90W': 3}) 26 | self.assertEquals(self.matrix.feature_ids[2], {'START_ENTITY|nmod|END_ENTITY': 1, 27 | 'START_ENTITY|nummod|END_ENTITY': 5, 28 | 'FAP|compound|END_ENTITY': 2, 29 | 'expression|nmod|END_ENTITY': 8, 30 | '+|compound|END_ENTITY': 7, 31 | 'mice|nummod|END_ENTITY': 3, 32 | 'homozygous|nsubj|START_ENTITY': 6, 33 | 'mutation|appos|END_ENTITY': 4, 34 | 'START_ENTITY|nmod|FAP': 0}) 35 | 36 | def testShuffle(self): 37 | shuffled_matrix = self.matrix.shuffle() 38 | self.assertEquals(len(shuffled_matrix.nonzero_elements), len(self.matrix.nonzero_elements)) 39 | self.assertEquals(set(shuffled_matrix.nonzero_elements.values()), set(self.matrix.nonzero_elements.values())) 40 | print("shuffled matrix elements: ", shuffled_matrix.nonzero_elements) 41 | -------------------------------------------------------------------------------- /tests/test_sanitycheck.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | import numpy as np 4 | 5 | from matrix import SparseMatrix 6 | from ebc import EBC 7 | import ebc2d 8 | from ebc2d import EBC2D 9 | 10 | 11 | class TestSanityCheck(unittest.TestCase): 12 | """ Do a sanity check for the EBC code, using the data from the original ITCC paper. """ 13 | 14 | def setUp(self): 15 | with open("resources/matrix-itcc-paper-orig.tsv", "r") as f: 16 | data = [l.split('\t') for l in f] 17 | 18 | self.matrix = SparseMatrix([6, 6]) 19 | self.matrix.read_data(data) 20 | self.matrix.normalize() 21 | 22 | def cartesian(self, arrays, out=None): 23 | arrays = [np.asarray(x) for x in arrays] 24 | dtype = arrays[0].dtype 25 | 26 | n = np.prod([x.size for x in arrays]) 27 | if out is None: 28 | out = np.zeros([n, len(arrays)], dtype=dtype) 29 | 30 | m = n / arrays[0].size 31 | out[:, 0] = np.repeat(arrays[0], m) 32 | if arrays[1:]: 33 | self.cartesian(arrays[1:], out=out[0:m, 1:]) 34 | for j in xrange(1, arrays[0].size): 35 | out[j * m:(j + 1) * m, 1:] = out[0:m, 1:] 36 | return out 37 | 38 | def testEbcOnSparseMatrix(self): 39 | ebc = EBC(self.matrix, [3, 2], 10, 1e-10, 0.01) 40 | cXY, objective, it = ebc.run(verbose=False) 41 | print "--> ebc" 42 | print "objective: ", objective 43 | print "iterations: ", it 44 | 45 | ebc = EBC(self.matrix, [3, 2], 10, 1e-10, 0.01) 46 | ebc.run(assigned_clusters=[[2, 0, 1, 1, 2, 2], [0, 0, 1, 0, 1, 1]], verbose=False) 47 | indices = [range(N_d) for N_d in ebc.pXY.N] 48 | index_list = self.cartesian(indices) 49 | approx_distribution = {} 50 | for location in index_list: 51 | q = 1.0 52 | c_location = [] 53 | for i in range(len(location)): 54 | c_i = ebc.cXY[i][location[i]] 55 | c_location.append(c_i) 56 | q *= ebc.qXxHat[i][location[i]] 57 | q *= ebc.qXhatYhat.get(tuple(c_location)) 58 | approx_distribution[tuple(location)] = q 59 | 60 | self.assertAlmostEquals(approx_distribution[(0, 0)], 0.054) 61 | self.assertAlmostEquals(approx_distribution[(0, 1)], 0.054) 62 | self.assertAlmostEquals(approx_distribution[(0, 2)], 0.042) 63 | self.assertAlmostEquals(approx_distribution[(0, 3)], 0.0) 64 | self.assertAlmostEquals(approx_distribution[(0, 4)], 0.0) 65 | self.assertAlmostEquals(approx_distribution[(0, 5)], 0.0) 66 | self.assertAlmostEquals(approx_distribution[(1, 0)], 0.054) 67 | self.assertAlmostEquals(approx_distribution[(1, 1)], 0.054) 68 | self.assertAlmostEquals(approx_distribution[(1, 2)], 0.042) 69 | self.assertAlmostEquals(approx_distribution[(1, 3)], 0.0) 70 | self.assertAlmostEquals(approx_distribution[(1, 4)], 0.0) 71 | self.assertAlmostEquals(approx_distribution[(1, 5)], 0.0) 72 | self.assertAlmostEquals(approx_distribution[(2, 0)], 0.0) 73 | self.assertAlmostEquals(approx_distribution[(2, 1)], 0.0) 74 | self.assertAlmostEquals(approx_distribution[(2, 2)], 0.0) 75 | self.assertAlmostEquals(approx_distribution[(2, 3)], 0.042) 76 | self.assertAlmostEquals(approx_distribution[(2, 4)], 0.054) 77 | self.assertAlmostEquals(approx_distribution[(2, 5)], 0.054) 78 | self.assertAlmostEquals(approx_distribution[(3, 0)], 0.0) 79 | self.assertAlmostEquals(approx_distribution[(3, 1)], 0.0) 80 | self.assertAlmostEquals(approx_distribution[(3, 2)], 0.0) 81 | self.assertAlmostEquals(approx_distribution[(3, 3)], 0.042) 82 | self.assertAlmostEquals(approx_distribution[(3, 4)], 0.054) 83 | self.assertAlmostEquals(approx_distribution[(3, 5)], 0.054) 84 | self.assertAlmostEquals(approx_distribution[(4, 0)], 0.036) 85 | self.assertAlmostEquals(approx_distribution[(4, 1)], 0.036) 86 | self.assertAlmostEquals(approx_distribution[(4, 2)], 0.028) 87 | self.assertAlmostEquals(approx_distribution[(4, 3)], 0.028) 88 | self.assertAlmostEquals(approx_distribution[(4, 4)], 0.036) 89 | self.assertAlmostEquals(approx_distribution[(4, 5)], 0.036) 90 | self.assertAlmostEquals(approx_distribution[(5, 0)], 0.036) 91 | self.assertAlmostEquals(approx_distribution[(5, 1)], 0.036) 92 | self.assertAlmostEquals(approx_distribution[(5, 2)], 0.028) 93 | self.assertAlmostEquals(approx_distribution[(5, 3)], 0.028) 94 | self.assertAlmostEquals(approx_distribution[(5, 4)], 0.036) 95 | self.assertAlmostEquals(approx_distribution[(5, 5)], 0.036) 96 | 97 | def testEbc2dOnSparseMatrix(self): 98 | with open("resources/matrix-itcc-paper-orig.tsv", "r") as f: 99 | data = [l.split('\t') for l in f] 100 | m = ebc2d.get_matrix_from_data(data) 101 | # run without assigned clusters 102 | ebc = EBC2D(m, [3, 2], 10, 1e-10, 0.01) 103 | cXY, objective, it = ebc.run(verbose=False) 104 | print "--> ebc2d" 105 | print "objective: ", objective 106 | print "iterations: ", it 107 | 108 | # run with assigned clusters 109 | ebc = EBC2D(m, [3, 2], 10, 1e-10, 0.01) 110 | cXY, objective, it = ebc.run(assigned_clusters=[[2, 0, 1, 1, 2, 2], [0, 0, 1, 0, 1, 1]], verbose=False) 111 | indices = [range(N_d) for N_d in ebc.pXY.shape] 112 | index_list = self.cartesian(indices) 113 | approx_distribution = {} 114 | qX_xhat = [ebc.qX_xhat, ebc.qY_yhat] 115 | for location in index_list: 116 | q = 1.0 117 | c_location = [] 118 | for i in range(len(location)): 119 | c_i = cXY[i][location[i]] 120 | c_location.append(c_i) 121 | q *= qX_xhat[i][location[i]] 122 | q *= ebc.qXhatYhat[c_location[0], c_location[1]] 123 | approx_distribution[tuple(location)] = q 124 | 125 | self.assertAlmostEquals(approx_distribution[(0, 0)], 0.054) 126 | self.assertAlmostEquals(approx_distribution[(0, 1)], 0.054) 127 | self.assertAlmostEquals(approx_distribution[(0, 2)], 0.042) 128 | self.assertAlmostEquals(approx_distribution[(0, 3)], 0.0) 129 | self.assertAlmostEquals(approx_distribution[(0, 4)], 0.0) 130 | self.assertAlmostEquals(approx_distribution[(0, 5)], 0.0) 131 | self.assertAlmostEquals(approx_distribution[(1, 0)], 0.054) 132 | self.assertAlmostEquals(approx_distribution[(1, 1)], 0.054) 133 | self.assertAlmostEquals(approx_distribution[(1, 2)], 0.042) 134 | self.assertAlmostEquals(approx_distribution[(1, 3)], 0.0) 135 | self.assertAlmostEquals(approx_distribution[(1, 4)], 0.0) 136 | self.assertAlmostEquals(approx_distribution[(1, 5)], 0.0) 137 | self.assertAlmostEquals(approx_distribution[(2, 0)], 0.0) 138 | self.assertAlmostEquals(approx_distribution[(2, 1)], 0.0) 139 | self.assertAlmostEquals(approx_distribution[(2, 2)], 0.0) 140 | self.assertAlmostEquals(approx_distribution[(2, 3)], 0.042) 141 | self.assertAlmostEquals(approx_distribution[(2, 4)], 0.054) 142 | self.assertAlmostEquals(approx_distribution[(2, 5)], 0.054) 143 | self.assertAlmostEquals(approx_distribution[(3, 0)], 0.0) 144 | self.assertAlmostEquals(approx_distribution[(3, 1)], 0.0) 145 | self.assertAlmostEquals(approx_distribution[(3, 2)], 0.0) 146 | self.assertAlmostEquals(approx_distribution[(3, 3)], 0.042) 147 | self.assertAlmostEquals(approx_distribution[(3, 4)], 0.054) 148 | self.assertAlmostEquals(approx_distribution[(3, 5)], 0.054) 149 | self.assertAlmostEquals(approx_distribution[(4, 0)], 0.036) 150 | self.assertAlmostEquals(approx_distribution[(4, 1)], 0.036) 151 | self.assertAlmostEquals(approx_distribution[(4, 2)], 0.028) 152 | self.assertAlmostEquals(approx_distribution[(4, 3)], 0.028) 153 | self.assertAlmostEquals(approx_distribution[(4, 4)], 0.036) 154 | self.assertAlmostEquals(approx_distribution[(4, 5)], 0.036) 155 | self.assertAlmostEquals(approx_distribution[(5, 0)], 0.036) 156 | self.assertAlmostEquals(approx_distribution[(5, 1)], 0.036) 157 | self.assertAlmostEquals(approx_distribution[(5, 2)], 0.028) 158 | self.assertAlmostEquals(approx_distribution[(5, 3)], 0.028) 159 | self.assertAlmostEquals(approx_distribution[(5, 4)], 0.036) 160 | self.assertAlmostEquals(approx_distribution[(5, 5)], 0.036) 161 | --------------------------------------------------------------------------------