├── .gitignore ├── README.rst ├── defiNETti.py ├── example ├── genetic_map_GRCh37_chr1_truncated.txt ├── run_example.py └── simulator_example.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | *.pyc 3 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | 2 | ================ 3 | defiNETti 4 | ================ 5 | 6 | defiNETti is a program for performing Bayesian inference on exchangeable 7 | (or permutation-invariant) data via deep learning. In particular, it is 8 | well-suited for population genetic applications. 9 | 10 | .. contents:: :depth: 2 11 | 12 | Installation instructions 13 | ========================= 14 | Prerequisites: 15 | 16 | 1. Scientific distribution of Python 2.7 or 3, e.g. [Anaconda](http://continuum.io/downloads), [Enthought Canopy](https://www.enthought.com/products/canopy/) 17 | 2. Alternatively, custom installation of pip, the SciPy stack 18 | 19 | (Optional) Create a virtual environment to store the dependencies. 20 | For python 2:: 21 | 22 | $ pip install virtualenv 23 | $ cd my_project_folder 24 | $ virtualenv my_project 25 | 26 | For python 3:: 27 | 28 | $ python3 -m venv my_project_folder 29 | 30 | To activate the virtual environment:: 31 | 32 | $ source my_project/bin/activate 33 | 34 | To install, in the top-level directory of defiNETti (where "setup.py" lives), type:: 35 | 36 | $ pip install . 37 | 38 | 39 | Simulation 40 | =========== 41 | The simulator is a function that returns a single datapoint tuple of ``(data, label)``. 42 | 43 | 1. ``data`` - A numpy array (e.g. genotype matrix, images, or point clouds) of 2 or 3 dimensions. 44 | 2. ``label`` - A numpy array of 1 dimension associated with the particular data array. 45 | 46 | Simulator Example 47 | ----------------- 48 | 49 | An example simulator for inferring gaussian parameters:: 50 | 51 | import numpy as np 52 | 53 | def simulator_gaussian(): 54 | mu = np.random.beta(5, 10) 55 | sigma = np.random.uniform(6,10) 56 | data = np.random.normal(mu, sigma, (100,1)) 57 | label = np.array([mu, sigma]) 58 | 59 | return (data,label) 60 | 61 | A more detailed population genetics-specific example is shown in ``example/``. Note in principle the simulator object could randomly sample from a fixed dataset if no generative model is available. 62 | 63 | 64 | Neural Network 65 | ============== 66 | The neural network building blocks in this program supports two types of layers: 67 | 68 | 1. Convolutional Layers - The syntax is written as ``('conv', <#width>, <#output depth>)``. Note that the height of the image patches is assumed to be 1 to enforce exchangeability. 69 | 2. Fully-connected Layers - The syntax is written as ``('fc', <#nodes>)``. 70 | 3. Matrix Multiply Layers - The last layer of the ``h_net`` neural network in the regression task. The syntax is written as ``('matmul',)``. 71 | 4. Softmax Layers - The last layer of the ``h_net`` neural network in the classification task. The syntax is written as ``('softmax',)``. 72 | 73 | For more information regarding the differences see: (http://cs231n.github.io/convolutional-networks/). The multiple layers can be combined in the form of a list with the first element corresponding to the first layer and so on. For example, ``[("fc",1024),("fc",1024), ('softmax',)]``. 74 | 75 | 76 | 77 | Training 78 | ========= 79 | The train command trains an exchangeable neural network using simulation-on-the-fly. The exchangeable neural network learns a permutation-invariant function mapping :math:`f` from the data :math:`X = (x_1, x_2, \ldots, x_n)` to the label or the posterior over the label :math:`\theta`. In order to ensure permutation invariance, the function can be decomposed as: 80 | 81 | .. math:: 82 | 83 | f(X) = (h \circ g)(\Phi(x_1), \Phi(x_2), \ldots , \Phi(x_n)) 84 | 85 | - :math:`\Phi` - a function parameterized by a neural network that applies to each row of the input data. 86 | - :math:`g` - a permutation-invariant function (e.g. max, sort, or moments). 87 | - :math:`h` - a function parameterized by a neural network that applies to the exchangeable feature representation :math:`g(\Phi(x_1), \Phi(x_2), \ldots , \Phi(x_n))`. 88 | 89 | For regression tasks, the output is an estimate of the label :math:`\hat{\theta} = f(X)`. For classification tasks, the output is a posterior over labels :math:`\mathcal{P}_{\theta} = f(X)`. 90 | 91 | Arguments 92 | --------- 93 | - ``input_shape`` - A tuple specifying the shape of the data output by ```` or :math:`X`. For example, if the data is a ``50 x 50`` image, then the shape should be listed as ``(50,50)``. Note that we always enforce permutation invariance in the 1st dimension and only support 2 or 3 dimensions. 94 | - ``output_shape`` - A tuple specifying the shape of the label output by ```` or :math:`\theta`. For example, if the label is a 5-length continuous vector then the shape should be listed as ``(5,)``. If the label is a discrete variable, the size of the dimension is the number of classes. Note that currently only classification of a single label is implemented. Only 1 dimensional labels are currently supported. 95 | - ``simulator`` - A function which returns tuples of ``(data, label)`` as mentioned above. 96 | - ``phi_net`` - A neural network parameterizing :math:`\Phi` shown above. The input syntax is shown in the Neural Network section above. 97 | - ``g`` - An operation parameterizing the permutation-invariant function :math:`g` as shown above. The supported options include ``('max',), ('sort',), ('top_k', ),`` or ``('moments', , , ...)`` 98 | - ``h_net`` - A neural network parameterizing :math:`h` as shown above. The input syntax is the same as for `phi_net`. 99 | - ``network_function`` - A function of tensorflow operations specifying the neural net if you want to create your own network (if present ignores phi_net, g, and h_net). 100 | - ``loss`` - The loss function to choose to train your neural network. Either "cross-ent" for cross-entropy loss or "l2" for l2-loss or a user-defined tensorflow function. 101 | - ``accuracy`` - The metric for measuring accuracy to output. Either "classification" for 0-1 loss accuracy, None for using loss function as accuracy, or a user-defined tensorflow function. 102 | - ``num_batches`` - The number of iterations (or batches) of training to perform when training the neural network. 103 | - ``batch_size`` - The size of each batch. 104 | - ``queue_capacity`` - The number of training examples to hold in the queue at once. 105 | - ``verbosity`` - Print every accuracy every ```` iterations. 106 | - ``training_threads`` - The number of threads dedicated to training the network. 107 | - ``sim_threads`` - The number of threads dedicated to simulating data. 108 | - ``save_path`` - The base filename to save the neural network. If None, the weights are not saved. 109 | - ``training_summary`` - The filename to save a summary of the training procedure. The format of the file is `` ``. If ``None``, then no summary file is created. 110 | - ``logfile`` - Log extra training information to logfile. If logfile='.', logs to STDERR. 111 | 112 | Note: How to include distances for the 3-dimension use case. Vector can simply be padded with a 1 in the second dimension. 113 | Note: How to feed in simulators in python? 114 | Note: Return accuracy values for training curves? 115 | 116 | Testing 117 | ======== 118 | The test command takes in data and a trained neural network to output predictions. 119 | 120 | Arguments 121 | --------- 122 | - ``data`` - A list of numpy arrays on which to run the neural network. The dimension of each numpy array should be the same as the input_shape in ``train()``. 123 | - ``model_path`` - Path to the basename where the network is stored, should be same as save_path in ``train()``. 124 | - ``threads`` - Number of threads used for the tensorflow operations 125 | 126 | Output 127 | ------ 128 | - ``output`` - A numpy array containing the network output for each input. The dimensions of the numpy array are ``(, )``. 129 | 130 | Population Genetic Example 131 | ========================== 132 | A population genetics-specific example can be found in ``example/``. Note that ``msprime`` version 0.4.0 is needed to run this example. This is a simpler version than the experiments used in the paper version. 133 | 134 | Quick Start 135 | ----------- 136 | To run the example, (for python 3 use python3 instead of python) :: 137 | 138 | $ cd example 139 | $ python run_example.py 140 | 141 | The expected accuracy after the first few hundred batches should be around ~80-90% with a slow steady increase after that. For 5 threads of simulation and training, the training should take roughly half an hour per 1000 batches. In the paper, we used compute resources that allocated 24 threads for each simulation and training. 142 | 143 | Additional Details 144 | ---------- 145 | - For inference purposes, we recommend running for around 20000 batches or once there is clear convergence. 146 | - The speed of the method is dependent on the number of CPU cores available for simulation. We recommend experimenting with the number of threads dedicated to simulation and training to find the optimal speed. (Make sure it sums to the total number of cores available). 147 | - Distances are normalized to be on the order of 0 and 1 for optimization purposes. 148 | - More SNPs than necessary are simulated then truncated and the hotspot region is centered. 149 | - A prior over rates is generated from the HapMap recombination map. In the paper version, we use windows of the fine-scale recombination map rather than flat rates as in the example. 150 | - When dealing with missing data, it may be helpful to copy the missing-ness patterns for the real data. 151 | -------------------------------------------------------------------------------- /defiNETti.py: -------------------------------------------------------------------------------- 1 | """Train and test neural networks using exchangeable nets and simulation-on-the-fly 2 | 3 | If you use this software, please cite: 4 | "A likelihood-free inference framework for population genetic data using exchangeable neural networks" 5 | by Chan, J., Perrone, V., Spence, J.P., Jenkins, P.A., Mathieson, S., and Song, Y.S. 6 | """ 7 | from __future__ import division 8 | import logging,threading,os 9 | import numpy as np 10 | import tensorflow as tf 11 | 12 | 13 | def _dimension_match(shape1, shape2): 14 | if len(shape1) != len(shape2): 15 | return False 16 | for d1,d2 in zip(shape1,shape2): 17 | if d1 != d2: 18 | return False 19 | return True 20 | 21 | def _total_dim(dims): 22 | if type(dims) == int: 23 | return dims 24 | to_return = 1 25 | for d in dims: 26 | to_return *= d 27 | return to_return 28 | 29 | class _DimensionException(Exception): 30 | pass 31 | 32 | 33 | def _layer_to_weights(prev_dim, layer): 34 | if layer[0] == "fc": 35 | prev_dim = _total_dim(prev_dim) 36 | return tf.Variable(tf.truncated_normal((prev_dim,layer[1]),stddev=0.05)), tf.Variable(tf.constant(0.01,shape=[layer[1]])), (layer[1],1) 37 | if layer[0] == "conv": 38 | if type(prev_dim) == int: 39 | conv_shape = (1,layer[1],1,layer[2]) 40 | else: 41 | conv_shape = (1,layer[1],prev_dim[1],layer[2]) 42 | new_dim = (prev_dim[0] - layer[1] + 1, layer[2]) 43 | return tf.Variable(tf.truncated_normal(conv_shape,stddev=0.05)), tf.Variable(tf.constant(0.01,shape=[layer[2]])), new_dim 44 | raise Exception("phi and h layers must be either fully connected ('fc', <#nodes>) or convolutional ('conv', <#width>, <#output depth>)") 45 | 46 | 47 | 48 | 49 | def _g_to_dimension(n, layer): 50 | if layer[0] == "max": 51 | return 1 52 | if layer[0] == "moments": 53 | return len(layer) - 1 54 | if layer[0] == "sort": 55 | return n 56 | if layer[0] == "top_k": 57 | assert layer[1] <= n 58 | assert layer[1] >= 1 59 | return layer[1] 60 | raise Exception("g layer must be either ('max',), ('sort',), ('top_k', ), or ('moments', , , ...)") 61 | 62 | 63 | 64 | 65 | 66 | def _layer(x, layer, weights, bias): 67 | if layer[0] == "fc": 68 | flat_shape = x.shape.as_list() 69 | flat_shape = [-1, flat_shape[1], flat_shape[2]*flat_shape[3]] 70 | x = tf.reshape(x,shape=flat_shape) 71 | return tf.expand_dims(tf.nn.relu(tf.tensordot(x, weights,axes=[[2],[0]]) + bias),-1) 72 | if layer[0] == "conv": 73 | return tf.nn.relu(tf.nn.conv2d(x, weights, strides=[1, 1, 1, 1], padding='VALID') + bias) 74 | if layer[0] == "matmul": 75 | flat_shape = x.shape.as_list() 76 | flat_shape = [-1, flat_shape[1] * flat_shape[2] * flat_shape[3]] 77 | x = tf.reshape(x, shape=flat_shape) 78 | return tf.matmul(x, weights) + bias 79 | if layer[0] == "softmax": 80 | flat_shape = x.shape.as_list() 81 | flat_shape = [-1, flat_shape[1] * flat_shape[2] * flat_shape[3]] 82 | x = tf.reshape(x, shape=flat_shape) 83 | return tf.nn.log_softmax(tf.matmul(x, weights) + bias) 84 | raise Exception("An error was encountered while building the network") 85 | 86 | 87 | 88 | def _symmetric_layer(x, layer): 89 | if layer[0] == "moments": 90 | fake_x = tf.where(tf.equal(x,0),x+1,x) #set zeros to be one, they will be ignored on the next like 91 | x = tf.concat([tf.reduce_mean(tf.where(tf.equal(x,0),x,fake_x**m),axis=1,keepdims=True) for m in layer[1:]],axis=1) 92 | flat_shape = x.shape.as_list() 93 | flat_shape = [-1, 1, flat_shape[2], flat_shape[1]*flat_shape[3]] 94 | return tf.reshape(x,shape=flat_shape) 95 | 96 | x = tf.transpose(x,perm=[0,3,2,1]) 97 | if layer[0] == "max": 98 | k = 1 99 | elif layer[0] == "sort": 100 | k = x.shape.as_list()[-1] 101 | elif layer[0] == "top_k": 102 | k = layer[1] 103 | else: 104 | raise Exception("g layer must be either ('max',), ('sort',), ('top_k', ), or ('moments', , , ...)") 105 | if k > 1: 106 | x = tf.nn.top_k(x, k=k, sorted=True)[0] 107 | elif k == 1: 108 | x = tf.reduce_max(x, 3, keepdims = True) 109 | else: 110 | raise Exception("k should not be < 0.") 111 | flat_shape = x.shape.as_list() 112 | flat_shape = [-1, 1, flat_shape[2], flat_shape[1]*flat_shape[3]] 113 | return tf.reshape(x,shape=flat_shape) 114 | 115 | 116 | 117 | def _build_network(phi_net, g, h_net, input_shape, output_shape): 118 | 119 | phi_weights = [] 120 | phi_bias = [] 121 | h_weights = [] 122 | h_bias = [] 123 | 124 | #get weights for pre-exchangeable part 125 | if len(input_shape) == 2: 126 | prev_dim = (input_shape[1],1) 127 | else: 128 | prev_dim = (input_shape[1], input_shape[2]) 129 | for layer in phi_net: 130 | this_w, this_b, prev_dim = _layer_to_weights(prev_dim, layer) 131 | phi_weights.append(this_w) 132 | phi_bias.append(this_b) 133 | 134 | #get weights for post-exchangeable part 135 | prev_dim = (prev_dim[0], prev_dim[1]*_g_to_dimension(input_shape[0], g)) 136 | for layer in h_net[:-1]: 137 | this_w, this_b, prev_dim = _layer_to_weights(prev_dim, layer) 138 | h_weights.append(this_w) 139 | h_bias.append(this_b) 140 | if h_net[-1][0] != "matmul" and h_net[-1][0] != "softmax": 141 | raise Exception("The final layer of the network must either be matmul or softmax") 142 | h_weights.append(tf.Variable(tf.truncated_normal((_total_dim(prev_dim),output_shape[0]),stddev=0.05))) 143 | h_bias.append(tf.Variable(tf.constant(0.01,shape=[output_shape[0]]))) 144 | 145 | #define function implied by network 146 | def network_function(x): 147 | if len(x.shape.as_list()) == 3: 148 | to_return = tf.expand_dims(x,-1) 149 | else: 150 | to_return = x 151 | for idx,layer in enumerate(phi_net): 152 | to_return = _layer(to_return, layer, phi_weights[idx], phi_bias[idx]) 153 | to_return = _symmetric_layer(to_return, g) 154 | for idx,layer in enumerate(h_net): 155 | to_return = _layer(to_return, layer, h_weights[idx], h_bias[idx]) 156 | return to_return 157 | 158 | return network_function 159 | 160 | def train(input_shape, output_shape, simulator, phi_net = [("fc",1024),("fc",1024)], g = ("max",), h_net = [("fc",512),("softmax",)], 161 | network_function = None, loss = "cross-ent", accuracy="classification", num_batches = 20000, batch_size=50, 162 | queue_capacity=250, verbosity=100, training_threads=1, 163 | sim_threads=1, save_path=None, training_summary = None, logfile = "."): 164 | """Train an exchangeable neural network using simulation-on-the-fly 165 | 166 | Args: 167 | input_shape: a tuple specifying the shape of the data output by 168 | output_shape: a tuple specifying the shape of the label output by 169 | simulator: a function that returns (data, label) tuples, where data is a and label is an shaped numpy array 170 | phi_net: a list of layers to perform before the symmetric function, see manual for more details 171 | g: a tuple specifying the exchangeable function to use, see manual for more details 172 | h_net: a list of layers to perform after the symmetric function, see manual for more details 173 | network_function: a function of tensorflow operations specifying the neural net (if present ignores phi_net, g, and h_net) 174 | loss: Either "cross-ent" for cross-entropy loss or "l2" for l2-loss or a user-defined tensorflow function 175 | accuracy: Either "classification", None for using loss function as accuracy, or a user-defined tensorflow function 176 | num_batches: number of mini-batches for training 177 | batch_size: number of training examples used per mini-batch 178 | queue_capacity: number of training examples to be held in queue 179 | verbosity: print every k iterations 180 | training_threads: number of threads used for neural network operations 181 | sim_threads: number of threads to used to simulate data 182 | save_path: file base name to save neural network, if None do not save network 183 | training_summary: A text file containing the <# batch count> <# loss value> <# Accuracy>. The number of batches saved are determined by verbosity. If None, then nothing is saved. 184 | logfile: Log extra info to logfile. If logfile='.', logs to STDERR. 185 | Returns: 186 | None 187 | """ 188 | if logfile == ".": 189 | logging.basicConfig(level=logging.INFO) 190 | elif logfile is not None: 191 | logging.basicConfig(filename=logfile, level=logging.INFO) 192 | 193 | assert len(input_shape) == 2 or len(input_shape) == 3 194 | assert len(output_shape) == 1 195 | 196 | #set up a queue that will simulate data on separate threads while the network trains 197 | #adapted from https://indico.io/blog/tensorflow-data-input-part2-extensions/ 198 | def safe_simulate(): 199 | try: 200 | result = simulator() 201 | if not isinstance(result, tuple): raise TypeError("Simulator did not produce a tuple") 202 | x,y = simulator() 203 | if not isinstance(x, np.ndarray): raise TypeError("Simulator did not produce a numpy array as input") 204 | if not isinstance(y, np.ndarray): raise TypeError("Simulator did not produce a numpy array for the label") 205 | if not _dimension_match(x.shape,input_shape): raise _DimensionException("input") 206 | if not _dimension_match(y.shape,output_shape): raise _DimensionException("output") 207 | return simulator() 208 | except _DimensionException as e: 209 | logging.warning("Dimension mismatch between simulations and " + str(e) + " size") 210 | except Exception as e: 211 | logging.warning("The provided simulator produced data that was not recognized by defiNETti. The exception encountered was " + str(e)) 212 | 213 | def simulation_iterator(): 214 | while True: 215 | x,y = safe_simulate() 216 | yield x,y 217 | class SimulationRunner(object): 218 | def __init__(self): 219 | self.X = tf.placeholder(dtype=tf.float32, shape=input_shape) 220 | self.Y = tf.placeholder(dtype=tf.float32, shape=output_shape) 221 | self.queue = tf.FIFOQueue(shapes=[input_shape,output_shape], 222 | dtypes=[tf.float32, tf.float32], 223 | capacity=queue_capacity) 224 | self.enqueue_op = self.queue.enqueue([self.X,self.Y]) 225 | self.run = True 226 | def get_inputs(self, batch_size): 227 | x_batch, y_batch = self.queue.dequeue_many(batch_size) 228 | return x_batch, y_batch 229 | def thread_main(self, sess): 230 | for x,y in simulation_iterator(): 231 | if self.run: 232 | sess.run(self.enqueue_op, feed_dict={self.X:x, self.Y:y}) 233 | else: 234 | break 235 | def start_threads(self, sess, n_threads): 236 | threads = [] 237 | for n in range(n_threads): 238 | t = threading.Thread(target=self.thread_main, args=(sess,)) 239 | t.daemon = True 240 | t.start() 241 | threads.append(t) 242 | return threads 243 | 244 | #construct network 245 | simulator_thread = SimulationRunner() 246 | x_, y_ = simulator_thread.get_inputs(batch_size) 247 | if network_function is None: 248 | function = _build_network(phi_net, g, h_net, input_shape, output_shape) 249 | else: 250 | function = network_function 251 | 252 | #define loss and accuracy 253 | y_pred = function(x_) 254 | if loss == "cross-ent": 255 | loss_func = tf.reduce_mean(-tf.reduce_sum(y_ * (y_pred), axis=[1])) 256 | elif loss == "l2": 257 | loss_func = tf.reduce_mean(tf.reduce_sum((y_-y_pred)**2,axis=[1])) 258 | else: 259 | loss_func = loss(y_pred, y_) 260 | if accuracy is None: 261 | acc_func = loss_func 262 | elif accuracy == "classification": 263 | acc_func = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y_,1), tf.argmax(y_pred,1)),tf.float32)) 264 | else: 265 | acc_func = accuracy(y_pred, y_) 266 | train_step = tf.train.AdamOptimizer(.001).minimize(loss_func) 267 | 268 | #save network information for testing purposes 269 | input_dims = tf.constant(list(input_shape), dtype=tf.float32, name="input_dims") 270 | x_test = tf.placeholder(dtype=tf.float32, shape=[None] + list(input_shape), name="test_input") 271 | if loss == "cross-ent": 272 | test_function = tf.exp(function(x_test), name="learned_function") 273 | else: 274 | test_function = tf.identity(function(x_test), name="learned_function") 275 | 276 | 277 | saver = tf.train.Saver() #for saving the weights later 278 | 279 | #run training 280 | if training_summary is not None: 281 | summary = np.zeros((int(np.ceil(float(num_batches) / verbosity)),3)) 282 | with tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=training_threads)) as sess: 283 | init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()) 284 | sess.run(init_op) 285 | coord = tf.train.Coordinator() 286 | simulator_thread.start_threads(sess,n_threads=sim_threads) 287 | op_threads = tf.train.start_queue_runners(sess=sess, coord=coord) 288 | for batch_count in range(num_batches): 289 | _,loss_val,acc = sess.run([train_step,loss_func,acc_func]) 290 | if batch_count % verbosity == 0: 291 | logging.info('Batch {} complete'.format(batch_count)) 292 | logging.info('Loss value on current batch = {}'.format(loss_val)) 293 | logging.info('Accuracy on current batch = {}'.format(acc)) 294 | if training_summary is not None: 295 | summary[int(batch_count / verbosity),:] = [batch_count, loss_val, acc] 296 | 297 | simulator_thread.run = False 298 | 299 | #close all threads 300 | coord.request_stop() 301 | coord.join(op_threads) 302 | 303 | if save_path is not None: 304 | saver.save(sess, os.path.expanduser(save_path)) 305 | 306 | np.savetxt(training_summary, summary) 307 | 308 | return 309 | 310 | 311 | 312 | 313 | def test(data, model_path, threads = 1): 314 | """Use a pre-trained neural network to analyze data 315 | 316 | Args: 317 | data: a list of numpy arrays on which to run the neural network 318 | model_path: path to the basename where the network is stored, should be same as save_path in train() 319 | threads: number of threads used for tensorflow operations 320 | Returns: 321 | output: a numpy array containing the network output for each input 322 | """ 323 | with tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=threads)) as sess: 324 | #Load the trained model 325 | loader = tf.train.import_meta_graph(os.path.expanduser(model_path) + ".meta") 326 | loader.restore(sess, tf.train.latest_checkpoint(os.path.dirname(model_path))) 327 | graph = tf.get_default_graph() 328 | 329 | #check that dimensions match for all inputs 330 | input_dims = sess.run(graph.get_tensor_by_name("input_dims:0")) 331 | for d in data: 332 | for idx,dim in enumerate(d.shape[1:]): 333 | if dim != input_dims[idx+1]: 334 | raise Exception("Training and Testing dimension differ") 335 | 336 | #run network 337 | test_input = graph.get_tensor_by_name("test_input:0") 338 | learned_function = graph.get_tensor_by_name("learned_function:0") 339 | outputs = sess.run(learned_function,{test_input:data}) 340 | 341 | return outputs 342 | -------------------------------------------------------------------------------- /example/genetic_map_GRCh37_chr1_truncated.txt: -------------------------------------------------------------------------------- 1 | Chromosome Position(bp) Rate(cM/Mb) Map(cM) 2 | chr1 55550 2.981822 0.000000 3 | chr1 82571 2.082414 0.080572 4 | chr1 88169 2.081358 0.092229 5 | chr1 254996 3.354927 0.439456 6 | chr1 564598 2.887498 1.478148 7 | chr1 564621 2.885864 1.478214 8 | chr1 565433 2.883892 1.480558 9 | chr1 568322 2.887570 1.488889 10 | chr1 568527 2.895420 1.489481 11 | chr1 721290 2.655176 1.931794 12 | chr1 723819 2.669992 1.938509 13 | chr1 728242 2.671779 1.950319 14 | chr1 729948 2.675202 1.954877 15 | chr1 739010 2.677693 1.979119 16 | chr1 740857 2.684310 1.984065 17 | chr1 750235 1.595600 2.009238 18 | chr1 752566 1.205854 2.012958 19 | chr1 753269 1.205507 2.013806 20 | chr1 753541 0.493648 2.014133 21 | chr1 754745 0.399609 2.014728 22 | chr1 765948 0.494327 2.019205 23 | chr1 767038 0.493138 2.019743 24 | chr1 767070 0.513499 2.019759 25 | chr1 768448 0.455582 2.020467 26 | chr1 769551 0.456521 2.020969 27 | chr1 771521 0.458418 2.021869 28 | chr1 774047 0.462218 2.023027 29 | chr1 775659 0.457037 2.023772 30 | chr1 776546 0.453806 2.024177 31 | chr1 777122 0.453593 2.024438 32 | chr1 777513 0.454370 2.024616 33 | chr1 779322 0.453268 2.025438 34 | chr1 780027 0.453415 2.025757 35 | chr1 780785 0.453231 2.026101 36 | chr1 784023 0.453169 2.027569 37 | chr1 784904 0.457151 2.027968 38 | chr1 785050 2.366093 2.028035 39 | chr1 785989 2.904551 2.030256 40 | chr1 792480 3.122720 2.049110 41 | chr1 798026 4.588229 2.066428 42 | chr1 798801 4.578148 2.069984 43 | chr1 798959 4.370906 2.070708 44 | chr1 799463 4.344183 2.072911 45 | chr1 800007 4.287599 2.075274 46 | chr1 808312 4.180096 2.110882 47 | chr1 846864 3.187313 2.272033 48 | chr1 882033 3.162755 2.384128 49 | chr1 888659 2.747543 2.405084 50 | chr1 957898 0.952858 2.595322 51 | chr1 962210 0.517933 2.599430 52 | chr1 982573 0.521736 2.609977 53 | chr1 990380 0.513886 2.614050 54 | chr1 998501 0.513754 2.618223 55 | chr1 998874 0.510871 2.618415 56 | chr1 1003629 0.496544 2.620844 57 | chr1 1005806 0.407540 2.621925 58 | chr1 1011006 0.393293 2.624044 59 | chr1 1017170 1.250116 2.626469 60 | chr1 1017197 1.889336 2.626502 61 | chr1 1017216 2.173130 2.626538 62 | chr1 1017587 1.717554 2.627345 63 | chr1 1017598 1.341542 2.627363 64 | chr1 1018562 1.320611 2.628657 65 | chr1 1018704 1.310429 2.628844 66 | chr1 1018966 1.297303 2.629188 67 | chr1 1021346 1.851467 2.632275 68 | chr1 1021408 1.911398 2.632390 69 | chr1 1021415 2.349187 2.632403 70 | chr1 1021583 2.439198 2.632798 71 | chr1 1021658 2.635598 2.632981 72 | chr1 1021695 2.917768 2.633078 73 | chr1 1022037 3.303601 2.634076 74 | chr1 1025301 3.324303 2.644859 75 | chr1 1026707 3.304892 2.649533 76 | chr1 1030374 2.573938 2.661652 77 | chr1 1030565 0.660922 2.662144 78 | chr1 1031060 0.614689 2.662471 79 | chr1 1031540 0.280745 2.662766 80 | chr1 1033994 0.280671 2.663455 81 | chr1 1033999 0.280597 2.663456 82 | chr1 1034953 0.280571 2.663724 83 | chr1 1036959 0.261500 2.664287 84 | chr1 1040026 0.261278 2.665089 85 | chr1 1046164 0.261441 2.666693 86 | chr1 1048955 0.262259 2.667422 87 | chr1 1049950 0.263129 2.667683 88 | chr1 1052946 0.263101 2.668472 89 | chr1 1053452 0.262919 2.668605 90 | chr1 1053570 0.263541 2.668636 91 | chr1 1060235 0.579117 2.670392 92 | chr1 1060608 0.939437 2.670608 93 | chr1 1061115 1.029868 2.671085 94 | chr1 1061152 1.097144 2.671123 95 | chr1 1061166 1.164451 2.671138 96 | chr1 1062015 1.949239 2.672127 97 | chr1 1062638 3.111017 2.673341 98 | chr1 1063145 3.111201 2.674918 99 | chr1 1064535 3.117384 2.679243 100 | chr1 1064979 3.112779 2.680627 101 | -------------------------------------------------------------------------------- /example/run_example.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | sys.path.append(os.path.relpath("..")) 4 | from defiNETti import * 5 | import simulator_example 6 | 7 | # Train the network -- Use small num_batches for example purposes. 8 | prior_rates, weight_prob = simulator_example.parse_hapmap_empirical_prior(['genetic_map_GRCh37_chr1_truncated.txt']) 9 | print('Training the network...') 10 | train((198,24,2), (2,), lambda: simulator_example.simulate_data(prior_rates, weight_prob), 11 | phi_net = [('conv', 5, 32),('conv', 5, 64)], 12 | g = ('top_k',1), 13 | h_net = [("fc",128),("fc",128),("softmax",)], 14 | num_batches = 2000, 15 | verbosity = 100, 16 | training_threads=5, 17 | sim_threads=5, 18 | save_path='./example_weights', 19 | training_summary='./summary.txt') 20 | 21 | 22 | # Test the network 23 | print("Testing the network...") 24 | data = [] 25 | for i in range(500): 26 | data_i, _ = simulator_example.simulate_data(prior_rates, weight_prob) # Typically you should feed in your real data here instead of simulated. 27 | data.append(data_i) 28 | output = test(data, model_path='./example_weights', threads = 5) 29 | np.savetxt('./test_predictions.txt',output) 30 | -------------------------------------------------------------------------------- /example/simulator_example.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import numpy as np 6 | import msprime 7 | 8 | # Data shape 9 | HEIGHT = 198 # number of haplotypes, CEU = 198, YRI = 216 10 | WIDTH = 24 # number of seg sites, should also be divisible by 4 11 | NUM_CLASSES = 2 12 | 13 | # Simulation parameters 14 | NE = 1.0e4 15 | MU = 1.1e-8 16 | R = 1.25e-8 17 | theta = 0.00044 18 | rho = .0005 19 | hap_len = 52000 #heuristic to get enough SNPs 20 | hot_spot_len = 2000 21 | lower_intensity = 10 22 | upper_intensity = 100 23 | GEN_IN_YRS = 29 24 | 25 | 26 | def parse_hapmap_empirical_prior(files): 27 | print("Parsing empirical prior...") 28 | assert(len(files) == 1) 29 | print(files) 30 | mat = np.loadtxt(files[0], skiprows = 1, usecols=(1,2)) 31 | print(mat.shape) 32 | mat[:,1] = mat[:,1]*(1.e-8) 33 | mat = mat[mat[:,1] != 0.0, :] # remove 0s 34 | weights = mat[1:,0] - mat[:-1,0] 35 | prior_rates = mat[:-1,1] 36 | prob = weights / np.sum(weights) 37 | print("Done parsing prior") 38 | 39 | return prior_rates, prob 40 | 41 | 42 | def draw_background_rate_from_prior(prior_rates, prob): 43 | return np.random.choice(prior_rates, p=prob) 44 | 45 | def simulate_data(prior_rates, weight_prob): 46 | hot_spot = np.random.randint(0,2) 47 | 48 | # Use flat empirical maps 49 | background = draw_background_rate_from_prior(prior_rates, weight_prob) 50 | if hot_spot >= 1: 51 | heat = np.random.uniform(lower_intensity,upper_intensity)*max(R/background,1.) 52 | reco_map = msprime.RecombinationMap([0,(hap_len - hot_spot_len)/2, 53 | (hap_len + hot_spot_len)/2, hap_len],[background,background*heat,background,0]) 54 | else: 55 | reco_map = msprime.RecombinationMap([0,hap_len],[background,0]) 56 | 57 | 58 | # Simulate 59 | tree_sequence = msprime.simulate(sample_size=HEIGHT, Ne = NE, recombination_map=reco_map, 60 | mutation_rate=MU) 61 | seg_sites = tree_sequence.get_num_mutations() 62 | 63 | # Make sure that we simulated enough seg sites for this example. Should not resimulate very often 64 | if seg_sites < WIDTH: 65 | print("Resimulating0...",file=sys.stderr) 66 | print(seg_sites) 67 | return simulate_data() 68 | 69 | 70 | # center around the hotspot 71 | mut_pos = [mut[0] for mut in tree_sequence.mutations()] # chromosome positions 72 | assert(len(mut_pos) == seg_sites) 73 | before_hs = [x < 0.5*hap_len for x in mut_pos].count(True) 74 | after_hs = [x >= 0.5*hap_len for x in mut_pos].count(True) 75 | hots = [x >= (hap_len-hot_spot_len)*0.5 and x < (hap_len+hot_spot_len)*0.5 for x in mut_pos].count(True) 76 | assert(before_hs + after_hs == seg_sites) 77 | 78 | 79 | # get distances and image 80 | if hots >= WIDTH: 81 | print("Hot Length: ", hots) 82 | print("Resimulating1.5...",file=sys.stderr) 83 | return simulate_data() 84 | # Check if there are enough SNPs left and right of the hotspot 85 | if before_hs < WIDTH//2: 86 | print("Resimulating2...",file=sys.stderr) 87 | print(before_hs) 88 | return simulate_data() 89 | if after_hs <= WIDTH//2: 90 | print("Resimulating3...",file=sys.stderr) 91 | print(after_hs) 92 | return simulate_data() 93 | distances = np.array([mut_pos[i+1]-mut_pos[i] for i in range(before_hs-WIDTH//2,before_hs+WIDTH//2)],dtype=float)[:-1] * 4*NE*MU 94 | distances = np.vstack([np.copy(distances) for k in range(HEIGHT)]) 95 | image = np.array([np.copy(variant.genotypes) if variant.genotypes.sum() < HEIGHT/2 else np.copy(1-variant.genotypes) for variant in tree_sequence.variants()]).transpose()[:,before_hs-WIDTH//2:before_hs+WIDTH//2].reshape([HEIGHT,WIDTH]) 96 | 97 | 98 | 99 | assert(image.shape[1] == distances.shape[1] + 1) 100 | assert(image.shape[0] == distances.shape[0]) 101 | combined = np.empty((image.shape[0], image.shape[1], 2), dtype=distances.dtype) 102 | combined[:,:,0] = image 103 | combined[:,:-1,1] = distances 104 | combined[:,-1,1] = 0.0 105 | 106 | label = np.zeros(2) 107 | label[hot_spot] = 1 108 | return (combined,label) 109 | 110 | # prior_rates, weight_prob = parse_hapmap_empirical_prior(['genetic_map_GRCh37_chr1_truncated.txt']) 111 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from distutils.core import setup 4 | from distutils.extension import Extension 5 | 6 | setup(name='defiNETti', 7 | version='1.0.0', 8 | description='A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks', 9 | author='Jeffrey Chan, Valerio Perrone, Jeffrey P. Spence, Paul A. Jenkins, Sara Mathieson, Yun S. Song', 10 | author_email='chanjed@berkeley.edu, v.Perrone@warwick.ac.uk, spence.jeffrey@berkeley.edu, p.Jenkins@warwick.ac.uk, smathie1@swarthmore.edu, yss@berkeley.edu', 11 | install_requires=['numpy>=1.14.2','tensorflow==1.6.0', 'msprime==0.4.0'], 12 | ) 13 | --------------------------------------------------------------------------------