├── .gitignore
├── README.rst
├── defiNETti.py
├── example
    ├── genetic_map_GRCh37_chr1_truncated.txt
    ├── run_example.py
    └── simulator_example.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | *.pyc
3 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | 
  2 | ================
  3 | defiNETti
  4 | ================
  5 | 
  6 | defiNETti is a program for performing Bayesian inference on exchangeable 
  7 | (or permutation-invariant) data via deep learning. In particular, it is
  8 | well-suited for population genetic applications.
  9 | 
 10 | .. contents:: :depth: 2
 11 | 
 12 | Installation instructions
 13 | =========================
 14 | Prerequisites:
 15 | 
 16 | 1. Scientific distribution of Python 2.7 or 3, e.g. [Anaconda](http://continuum.io/downloads), [Enthought Canopy](https://www.enthought.com/products/canopy/)
 17 | 2. Alternatively, custom installation of pip, the SciPy stack
 18 | 
 19 | (Optional) Create a virtual environment to store the dependencies.
 20 | For python 2::
 21 | 
 22 | $ pip install virtualenv
 23 | $ cd my_project_folder
 24 | $ virtualenv my_project
 25 | 
 26 | For python 3::
 27 | 
 28 | $ python3 -m venv my_project_folder
 29 | 
 30 | To activate the virtual environment::
 31 | 
 32 | $ source my_project/bin/activate
 33 | 
 34 | To install, in the top-level directory of defiNETti (where "setup.py" lives), type::
 35 | 
 36 | $ pip install .
 37 | 
 38 | 
 39 | Simulation
 40 | ===========
 41 | The simulator is a function that returns a single datapoint tuple of ``(data, label)``.
 42 | 
 43 | 1. ``data`` - A numpy array (e.g. genotype matrix, images, or point clouds) of 2 or 3 dimensions.
 44 | 2. ``label`` - A numpy array of 1 dimension associated with the particular data array.
 45 | 
 46 | Simulator Example
 47 | -----------------
 48 | 
 49 | An example simulator for inferring gaussian parameters::
 50 | 
 51 |     import numpy as np
 52 | 
 53 |     def simulator_gaussian():
 54 |         mu = np.random.beta(5, 10)
 55 |         sigma = np.random.uniform(6,10)
 56 |         data = np.random.normal(mu, sigma, (100,1))
 57 |         label = np.array([mu, sigma]) 
 58 | 
 59 |         return (data,label)
 60 | 
 61 | A more detailed population genetics-specific example is shown in ``example/``. Note in principle the simulator object could randomly sample from a fixed dataset if no generative model is available.
 62 | 
 63 | 
 64 | Neural Network
 65 | ==============
 66 | The neural network building blocks in this program supports two types of layers:
 67 | 
 68 | 1. Convolutional Layers - The syntax is written as ``('conv', <#width>, <#output depth>)``. Note that the height of the image patches is assumed to be 1 to enforce exchangeability.
 69 | 2. Fully-connected Layers - The syntax is written as ``('fc',  <#nodes>)``.
 70 | 3. Matrix Multiply Layers - The last layer of the ``h_net`` neural network in the regression task. The syntax is written as ``('matmul',)``.
 71 | 4. Softmax Layers - The last layer of the ``h_net`` neural network in the classification task. The syntax is written as ``('softmax',)``.
 72 | 
 73 | For more information regarding the differences see: (http://cs231n.github.io/convolutional-networks/). The multiple layers can be combined in the form of a list with the first element corresponding to the first layer and so on. For example, ``[("fc",1024),("fc",1024), ('softmax',)]``.
 74 | 
 75 | 
 76 | 
 77 | Training
 78 | =========
 79 | The train command trains an exchangeable neural network using simulation-on-the-fly. The exchangeable neural network learns a permutation-invariant function mapping :math:`f` from the data :math:`X = (x_1, x_2, \ldots, x_n)` to the label or the posterior over the label :math:`\theta`. In order to ensure permutation invariance, the function can be decomposed as:
 80 | 
 81 | .. math::
 82 | 
 83 |   f(X) = (h \circ g)(\Phi(x_1), \Phi(x_2), \ldots , \Phi(x_n))
 84 | 
 85 | - :math:`\Phi` - a function parameterized by a neural network that applies to each row of the input data.
 86 | - :math:`g` - a permutation-invariant function (e.g. max, sort, or moments).
 87 | - :math:`h` - a function parameterized by a neural network that applies to the exchangeable feature representation :math:`g(\Phi(x_1), \Phi(x_2), \ldots , \Phi(x_n))`.
 88 | 
 89 | For regression tasks, the output is an estimate of the label :math:`\hat{\theta} = f(X)`. For classification tasks, the output is a posterior over labels :math:`\mathcal{P}_{\theta} = f(X)`.
 90 | 
 91 | Arguments
 92 | ---------
 93 | - ``input_shape`` - A tuple specifying the shape of the data output by ``<simulator>`` or :math:`X`. For example, if the data is a ``50 x 50`` image, then the shape should be listed as ``(50,50)``. Note that we always enforce permutation invariance in the 1st dimension and only support 2 or 3 dimensions.
 94 | - ``output_shape`` - A tuple specifying the shape of the label output by ``<simulator>`` or :math:`\theta`. For example, if the label is a 5-length continuous vector then the shape should be listed as ``(5,)``. If the label is a discrete variable, the size of the dimension is the number of classes. Note that currently only classification of a single label is implemented. Only 1 dimensional labels are currently supported.
 95 | - ``simulator`` - A function which returns tuples of ``(data, label)`` as mentioned above.
 96 | - ``phi_net`` - A neural network parameterizing :math:`\Phi` shown above. The input syntax is shown in the Neural Network section above.
 97 | - ``g`` -  An operation parameterizing the permutation-invariant function :math:`g` as shown above. The supported options include ``('max',), ('sort',), ('top_k', <k>),`` or ``('moments', <m1>, <m2>, ...)``
 98 | - ``h_net`` - A neural network parameterizing :math:`h` as shown above. The input syntax is the same as for `phi_net`.
 99 | - ``network_function`` - A function of tensorflow operations specifying the neural net if you want to create your own network (if present ignores phi_net, g, and h_net).
100 | - ``loss`` - The loss function to choose to train your neural network. Either "cross-ent" for cross-entropy loss or "l2" for l2-loss or a user-defined tensorflow function.
101 | - ``accuracy`` - The metric for measuring accuracy to output. Either "classification" for 0-1 loss accuracy, None for using loss function as accuracy, or a user-defined tensorflow function.
102 | - ``num_batches`` - The number of iterations (or batches) of training to perform when training the neural network.
103 | - ``batch_size`` -  The size of each batch.
104 | - ``queue_capacity`` - The number of training examples to hold in the queue at once.
105 | - ``verbosity`` - Print every accuracy every ``<verbosity>`` iterations.
106 | - ``training_threads`` - The number of threads dedicated to training the network. 
107 | - ``sim_threads`` - The number of threads dedicated to simulating data.
108 | - ``save_path`` - The base filename to save the neural network. If None, the weights are not saved.
109 | - ``training_summary`` - The filename to save a summary of the training procedure. The format of the file is ``<batch_count> <loss_value> <accuracy>``. If ``None``, then no summary file is created.
110 | - ``logfile`` - Log extra training information to logfile. If logfile='.', logs to STDERR.
111 | 
112 | Note: How to include distances for the 3-dimension use case. Vector can simply be padded with a 1 in the second dimension.
113 | Note: How to feed in simulators in python?
114 | Note: Return accuracy values for training curves?
115 | 
116 | Testing
117 | ========
118 | The test command takes in data and a trained neural network to output predictions.
119 | 
120 | Arguments
121 | ---------
122 | - ``data`` - A list of numpy arrays on which to run the neural network. The dimension of each numpy array should be the same as the input_shape in ``train()``.
123 | - ``model_path`` - Path to the basename where the network is stored, should be same as save_path in ``train()``.
124 | - ``threads`` - Number of threads used for the tensorflow operations
125 | 
126 | Output
127 | ------
128 | - ``output`` - A numpy array containing the network output for each input. The dimensions of the numpy array are ``(<length of data list>, <output_shape[0]>)``.
129 | 
130 | Population Genetic Example
131 | ==========================
132 | A population genetics-specific example can be found in ``example/``. Note that ``msprime`` version 0.4.0 is needed to run this example. This is a simpler version than the experiments used in the paper version.
133 | 
134 | Quick Start
135 | -----------
136 | To run the example, (for python 3 use python3 instead of python) ::
137 | 
138 | $ cd example
139 | $ python run_example.py
140 | 
141 | The expected accuracy after the first few hundred batches should be around ~80-90% with a slow steady increase after that. For 5 threads of simulation and training, the training should take roughly half an hour per 1000 batches. In the paper, we used compute resources that allocated 24 threads for each simulation and training.
142 | 
143 | Additional Details
144 | ----------
145 | - For inference purposes, we recommend running for around 20000 batches or once there is clear convergence.
146 | - The speed of the method is dependent on the number of CPU cores available for simulation. We recommend experimenting with the number of threads dedicated to simulation and training to find the optimal speed. (Make sure it sums to the total number of cores available).
147 | - Distances are normalized to be on the order of 0 and 1 for optimization purposes.
148 | - More SNPs than necessary are simulated then truncated and the hotspot region is centered.
149 | - A prior over rates is generated from the HapMap recombination map. In the paper version, we use windows of the fine-scale recombination map rather than flat rates as in the example.
150 | - When dealing with missing data, it may be helpful to copy the missing-ness patterns for the real data.
151 | 


--------------------------------------------------------------------------------
/defiNETti.py:
--------------------------------------------------------------------------------
  1 | """Train and test neural networks using exchangeable nets and simulation-on-the-fly
  2 | 
  3 | If you use this software, please cite:
  4 | "A likelihood-free inference framework for population genetic data using exchangeable neural networks"
  5 | by Chan, J., Perrone, V., Spence, J.P., Jenkins, P.A., Mathieson, S., and Song, Y.S. 
  6 | """
  7 | from __future__ import division
  8 | import logging,threading,os
  9 | import numpy as np
 10 | import tensorflow as tf
 11 | 
 12 | 
 13 | def _dimension_match(shape1, shape2):
 14 |     if len(shape1) != len(shape2):
 15 |         return False
 16 |     for d1,d2 in zip(shape1,shape2):
 17 |         if d1 != d2:
 18 |             return False
 19 |     return True
 20 | 
 21 | def _total_dim(dims):
 22 |     if type(dims) == int:
 23 |         return dims
 24 |     to_return = 1
 25 |     for d in dims:
 26 |         to_return *= d
 27 |     return to_return
 28 | 
 29 | class _DimensionException(Exception):
 30 |     pass
 31 | 
 32 | 
 33 | def _layer_to_weights(prev_dim, layer):
 34 |     if layer[0] == "fc":
 35 |         prev_dim = _total_dim(prev_dim)
 36 |         return tf.Variable(tf.truncated_normal((prev_dim,layer[1]),stddev=0.05)), tf.Variable(tf.constant(0.01,shape=[layer[1]])), (layer[1],1)
 37 |     if layer[0] == "conv":
 38 |         if type(prev_dim) == int:
 39 |             conv_shape = (1,layer[1],1,layer[2])
 40 |         else:
 41 |             conv_shape = (1,layer[1],prev_dim[1],layer[2])
 42 |         new_dim = (prev_dim[0] - layer[1] + 1, layer[2])
 43 |         return tf.Variable(tf.truncated_normal(conv_shape,stddev=0.05)), tf.Variable(tf.constant(0.01,shape=[layer[2]])), new_dim
 44 |     raise Exception("phi and h layers must be either fully connected ('fc',  <#nodes>) or convolutional ('conv', <#width>, <#output depth>)")
 45 | 
 46 | 
 47 | 
 48 | 
 49 | def _g_to_dimension(n, layer):
 50 |     if layer[0] == "max":
 51 |         return 1
 52 |     if layer[0] == "moments":
 53 |         return len(layer) - 1
 54 |     if layer[0] == "sort":
 55 |         return n
 56 |     if layer[0] == "top_k":
 57 |         assert layer[1] <= n
 58 |         assert layer[1] >= 1
 59 |         return layer[1]
 60 |     raise Exception("g layer must be either ('max',), ('sort',), ('top_k', <k>), or ('moments', <m1>, <m2>, ...)")
 61 | 
 62 | 
 63 | 
 64 | 
 65 | 
 66 | def _layer(x, layer, weights, bias):
 67 |     if layer[0] == "fc":
 68 |         flat_shape = x.shape.as_list()
 69 |         flat_shape = [-1, flat_shape[1], flat_shape[2]*flat_shape[3]]
 70 |         x = tf.reshape(x,shape=flat_shape)
 71 |         return tf.expand_dims(tf.nn.relu(tf.tensordot(x, weights,axes=[[2],[0]]) + bias),-1)
 72 |     if layer[0] == "conv":
 73 |         return tf.nn.relu(tf.nn.conv2d(x, weights, strides=[1, 1, 1, 1], padding='VALID') + bias)
 74 |     if layer[0] == "matmul":
 75 |         flat_shape = x.shape.as_list()
 76 |         flat_shape = [-1, flat_shape[1] * flat_shape[2] * flat_shape[3]]
 77 |         x = tf.reshape(x, shape=flat_shape)
 78 |         return tf.matmul(x, weights) + bias
 79 |     if layer[0] == "softmax":
 80 |         flat_shape = x.shape.as_list()
 81 |         flat_shape = [-1, flat_shape[1] * flat_shape[2] * flat_shape[3]]
 82 |         x = tf.reshape(x, shape=flat_shape)
 83 |         return tf.nn.log_softmax(tf.matmul(x, weights) + bias)
 84 |     raise Exception("An error was encountered while building the network")
 85 | 
 86 | 
 87 | 
 88 | def _symmetric_layer(x, layer):
 89 |     if layer[0] == "moments":
 90 |         fake_x = tf.where(tf.equal(x,0),x+1,x) #set zeros to be one, they will be ignored on the next like
 91 |         x = tf.concat([tf.reduce_mean(tf.where(tf.equal(x,0),x,fake_x**m),axis=1,keepdims=True) for m in layer[1:]],axis=1)
 92 |         flat_shape = x.shape.as_list()
 93 |         flat_shape = [-1, 1, flat_shape[2], flat_shape[1]*flat_shape[3]]
 94 |         return tf.reshape(x,shape=flat_shape)
 95 | 
 96 |     x = tf.transpose(x,perm=[0,3,2,1])
 97 |     if layer[0] == "max":
 98 |         k = 1
 99 |     elif layer[0] == "sort":
100 |         k = x.shape.as_list()[-1]
101 |     elif layer[0] == "top_k":
102 |         k = layer[1]
103 |     else:
104 |         raise Exception("g layer must be either ('max',), ('sort',), ('top_k', <k>), or ('moments', <m1>, <m2>, ...)")
105 |     if k > 1:
106 |         x = tf.nn.top_k(x, k=k, sorted=True)[0]
107 |     elif k == 1:    
108 |         x = tf.reduce_max(x, 3, keepdims = True)
109 |     else:
110 |         raise Exception("k should not be < 0.")
111 |     flat_shape = x.shape.as_list()
112 |     flat_shape = [-1, 1, flat_shape[2], flat_shape[1]*flat_shape[3]]
113 |     return tf.reshape(x,shape=flat_shape)
114 | 
115 | 
116 | 
117 | def _build_network(phi_net, g, h_net, input_shape, output_shape):
118 | 
119 |     phi_weights = []
120 |     phi_bias = []
121 |     h_weights = []
122 |     h_bias = []
123 | 
124 |     #get weights for pre-exchangeable part
125 |     if len(input_shape) == 2:
126 |         prev_dim = (input_shape[1],1)
127 |     else:
128 |         prev_dim = (input_shape[1], input_shape[2])
129 |     for layer in phi_net:
130 |         this_w, this_b, prev_dim = _layer_to_weights(prev_dim, layer)
131 |         phi_weights.append(this_w)
132 |         phi_bias.append(this_b)
133 |     
134 |     #get weights for post-exchangeable part   
135 |     prev_dim = (prev_dim[0], prev_dim[1]*_g_to_dimension(input_shape[0], g))
136 |     for layer in h_net[:-1]:
137 |         this_w, this_b, prev_dim = _layer_to_weights(prev_dim, layer)
138 |         h_weights.append(this_w)
139 |         h_bias.append(this_b)
140 |     if h_net[-1][0] != "matmul" and h_net[-1][0] != "softmax":
141 |         raise Exception("The final layer of the network must either be matmul or softmax")
142 |     h_weights.append(tf.Variable(tf.truncated_normal((_total_dim(prev_dim),output_shape[0]),stddev=0.05)))
143 |     h_bias.append(tf.Variable(tf.constant(0.01,shape=[output_shape[0]])))
144 | 
145 |     #define function implied by network
146 |     def network_function(x):
147 |         if len(x.shape.as_list()) == 3:
148 |             to_return = tf.expand_dims(x,-1)
149 |         else:
150 |             to_return = x
151 |         for idx,layer in enumerate(phi_net):
152 |             to_return = _layer(to_return, layer, phi_weights[idx], phi_bias[idx])
153 |         to_return = _symmetric_layer(to_return, g)
154 |         for idx,layer in enumerate(h_net):
155 |             to_return = _layer(to_return, layer, h_weights[idx], h_bias[idx])
156 |         return to_return
157 | 
158 |     return network_function
159 | 
160 | def train(input_shape, output_shape, simulator, phi_net = [("fc",1024),("fc",1024)], g = ("max",), h_net = [("fc",512),("softmax",)], 
161 |           network_function = None, loss = "cross-ent", accuracy="classification", num_batches = 20000, batch_size=50, 
162 |           queue_capacity=250, verbosity=100, training_threads=1,
163 |           sim_threads=1, save_path=None, training_summary = None, logfile = "."):
164 |     """Train an exchangeable neural network using simulation-on-the-fly
165 | 
166 |     Args:
167 |         input_shape: a tuple specifying the shape of the data output by <simulator>
168 |         output_shape: a tuple specifying the shape of the label output by <simulator>
169 |         simulator: a function that returns (data, label) tuples, where data is a <input_shape> and label is an <output_shape> shaped numpy array
170 |         phi_net: a list of layers to perform before the symmetric function, see manual for more details
171 |         g: a tuple specifying the exchangeable function to use, see manual for more details
172 |         h_net: a list of layers to perform after the symmetric function, see manual for more details
173 |         network_function: a function of tensorflow operations specifying the neural net (if present ignores phi_net, g, and h_net)
174 |         loss: Either "cross-ent" for cross-entropy loss or "l2" for l2-loss or a user-defined tensorflow function
175 |         accuracy: Either "classification", None for using loss function as accuracy, or a user-defined tensorflow function
176 |         num_batches: number of mini-batches for training
177 |         batch_size: number of training examples used per mini-batch
178 |         queue_capacity: number of training examples to be held in queue
179 |         verbosity: print every k iterations
180 |         training_threads: number of threads used for neural network operations
181 |         sim_threads: number of threads to used to simulate data
182 |         save_path: file base name to save neural network, if None do not save network
183 |         training_summary: A text file containing the <# batch count> <# loss value> <# Accuracy>. The number of batches saved are determined by verbosity. If None, then nothing is saved.
184 |         logfile: Log extra info to logfile. If logfile='.', logs to STDERR.
185 |     Returns:
186 |         None
187 |     """
188 |     if logfile == ".":
189 |         logging.basicConfig(level=logging.INFO)
190 |     elif logfile is not None:
191 |         logging.basicConfig(filename=logfile, level=logging.INFO)
192 |     
193 |     assert len(input_shape) == 2 or len(input_shape) == 3
194 |     assert len(output_shape) == 1 
195 | 
196 |     #set up a queue that will simulate data on separate threads while the network trains
197 |     #adapted from https://indico.io/blog/tensorflow-data-input-part2-extensions/
198 |     def safe_simulate():
199 |         try:
200 |             result = simulator()
201 |             if not isinstance(result, tuple): raise TypeError("Simulator did not produce a tuple")
202 |             x,y = simulator()
203 |             if not isinstance(x, np.ndarray): raise TypeError("Simulator did not produce a numpy array as input")
204 |             if not isinstance(y, np.ndarray): raise TypeError("Simulator did not produce a numpy array for the label")
205 |             if not _dimension_match(x.shape,input_shape): raise _DimensionException("input")
206 |             if not _dimension_match(y.shape,output_shape): raise _DimensionException("output")
207 |             return simulator()
208 |         except _DimensionException as e:
209 |             logging.warning("Dimension mismatch between simulations and " + str(e) + " size")
210 |         except Exception as e:
211 |             logging.warning("The provided simulator produced data that was not recognized by defiNETti. The exception encountered was " + str(e))
212 |             
213 |     def simulation_iterator():
214 |         while True:
215 |             x,y = safe_simulate()
216 |             yield x,y
217 |     class SimulationRunner(object):
218 |         def __init__(self):
219 |             self.X = tf.placeholder(dtype=tf.float32, shape=input_shape)
220 |             self.Y = tf.placeholder(dtype=tf.float32, shape=output_shape)
221 |             self.queue = tf.FIFOQueue(shapes=[input_shape,output_shape],
222 |                                       dtypes=[tf.float32, tf.float32],
223 |                                       capacity=queue_capacity)
224 |             self.enqueue_op = self.queue.enqueue([self.X,self.Y])
225 |             self.run = True
226 |         def get_inputs(self, batch_size):
227 |             x_batch, y_batch = self.queue.dequeue_many(batch_size)
228 |             return x_batch, y_batch
229 |         def thread_main(self, sess):
230 |             for x,y in simulation_iterator():
231 |                 if self.run:
232 |                     sess.run(self.enqueue_op, feed_dict={self.X:x, self.Y:y})
233 |                 else:
234 |                     break
235 |         def start_threads(self, sess, n_threads):
236 |             threads = []
237 |             for n in range(n_threads):
238 |                 t = threading.Thread(target=self.thread_main, args=(sess,))
239 |                 t.daemon = True
240 |                 t.start()
241 |                 threads.append(t)
242 |             return threads
243 | 
244 |     #construct network
245 |     simulator_thread = SimulationRunner()
246 |     x_, y_ = simulator_thread.get_inputs(batch_size)
247 |     if network_function is None:
248 |         function = _build_network(phi_net, g, h_net, input_shape, output_shape)
249 |     else:
250 |         function = network_function
251 |     
252 |     #define loss and accuracy
253 |     y_pred = function(x_)
254 |     if loss == "cross-ent":
255 |         loss_func = tf.reduce_mean(-tf.reduce_sum(y_ * (y_pred), axis=[1]))
256 |     elif loss == "l2":
257 |         loss_func = tf.reduce_mean(tf.reduce_sum((y_-y_pred)**2,axis=[1]))
258 |     else:
259 |         loss_func = loss(y_pred, y_)
260 |     if accuracy is None:
261 |         acc_func = loss_func
262 |     elif accuracy == "classification":
263 |         acc_func = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y_,1), tf.argmax(y_pred,1)),tf.float32))
264 |     else:
265 |         acc_func = accuracy(y_pred, y_)
266 |     train_step = tf.train.AdamOptimizer(.001).minimize(loss_func)
267 | 
268 |     #save network information for testing purposes
269 |     input_dims = tf.constant(list(input_shape), dtype=tf.float32, name="input_dims")
270 |     x_test = tf.placeholder(dtype=tf.float32, shape=[None] + list(input_shape), name="test_input")
271 |     if loss == "cross-ent":
272 |         test_function = tf.exp(function(x_test), name="learned_function")
273 |     else:
274 |         test_function = tf.identity(function(x_test), name="learned_function")
275 | 
276 | 
277 |     saver = tf.train.Saver()    #for saving the weights later
278 | 
279 |     #run training
280 |     if training_summary is not None:
281 |         summary = np.zeros((int(np.ceil(float(num_batches) / verbosity)),3))
282 |     with tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=training_threads)) as sess:
283 |         init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
284 |         sess.run(init_op)
285 |         coord = tf.train.Coordinator()
286 |         simulator_thread.start_threads(sess,n_threads=sim_threads)
287 |         op_threads = tf.train.start_queue_runners(sess=sess, coord=coord)
288 |         for batch_count in range(num_batches):
289 |             _,loss_val,acc = sess.run([train_step,loss_func,acc_func])
290 |             if batch_count % verbosity == 0:
291 |                 logging.info('Batch {} complete'.format(batch_count))
292 |                 logging.info('Loss value on current batch = {}'.format(loss_val))
293 |                 logging.info('Accuracy on current batch = {}'.format(acc))
294 |                 if training_summary is not None:
295 |                     summary[int(batch_count / verbosity),:] = [batch_count, loss_val, acc]
296 |    
297 |         simulator_thread.run = False
298 | 
299 |         #close all threads
300 |         coord.request_stop()
301 |         coord.join(op_threads)
302 |         
303 |         if save_path is not None:
304 |             saver.save(sess, os.path.expanduser(save_path))
305 | 
306 |     np.savetxt(training_summary, summary)
307 | 
308 |     return
309 |    
310 | 
311 | 
312 | 
313 | def test(data, model_path, threads = 1):
314 |     """Use a pre-trained neural network to analyze data
315 | 
316 |     Args:
317 |         data: a list of numpy arrays on which to run the neural network
318 |         model_path: path to the basename where the network is stored, should be same as save_path in train()
319 |         threads: number of threads used for tensorflow operations
320 |     Returns:
321 |         output: a numpy array containing the network output for each input
322 |     """
323 |     with tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=threads)) as sess:
324 |         #Load the trained model
325 |         loader = tf.train.import_meta_graph(os.path.expanduser(model_path) + ".meta")
326 |         loader.restore(sess, tf.train.latest_checkpoint(os.path.dirname(model_path)))
327 |         graph = tf.get_default_graph()
328 |         
329 |         #check that dimensions match for all inputs
330 |         input_dims = sess.run(graph.get_tensor_by_name("input_dims:0"))
331 |         for d in data:
332 |             for idx,dim in enumerate(d.shape[1:]):
333 |                 if dim != input_dims[idx+1]:
334 |                     raise Exception("Training and Testing dimension differ")
335 | 
336 |         #run network
337 |         test_input = graph.get_tensor_by_name("test_input:0")
338 |         learned_function = graph.get_tensor_by_name("learned_function:0")        
339 |         outputs = sess.run(learned_function,{test_input:data})
340 | 
341 |     return outputs
342 | 


--------------------------------------------------------------------------------
/example/genetic_map_GRCh37_chr1_truncated.txt:
--------------------------------------------------------------------------------
  1 | Chromosome	Position(bp)	Rate(cM/Mb)	Map(cM)
  2 | chr1	55550	2.981822	0.000000
  3 | chr1	82571	2.082414	0.080572
  4 | chr1	88169	2.081358	0.092229
  5 | chr1	254996	3.354927	0.439456
  6 | chr1	564598	2.887498	1.478148
  7 | chr1	564621	2.885864	1.478214
  8 | chr1	565433	2.883892	1.480558
  9 | chr1	568322	2.887570	1.488889
 10 | chr1	568527	2.895420	1.489481
 11 | chr1	721290	2.655176	1.931794
 12 | chr1	723819	2.669992	1.938509
 13 | chr1	728242	2.671779	1.950319
 14 | chr1	729948	2.675202	1.954877
 15 | chr1	739010	2.677693	1.979119
 16 | chr1	740857	2.684310	1.984065
 17 | chr1	750235	1.595600	2.009238
 18 | chr1	752566	1.205854	2.012958
 19 | chr1	753269	1.205507	2.013806
 20 | chr1	753541	0.493648	2.014133
 21 | chr1	754745	0.399609	2.014728
 22 | chr1	765948	0.494327	2.019205
 23 | chr1	767038	0.493138	2.019743
 24 | chr1	767070	0.513499	2.019759
 25 | chr1	768448	0.455582	2.020467
 26 | chr1	769551	0.456521	2.020969
 27 | chr1	771521	0.458418	2.021869
 28 | chr1	774047	0.462218	2.023027
 29 | chr1	775659	0.457037	2.023772
 30 | chr1	776546	0.453806	2.024177
 31 | chr1	777122	0.453593	2.024438
 32 | chr1	777513	0.454370	2.024616
 33 | chr1	779322	0.453268	2.025438
 34 | chr1	780027	0.453415	2.025757
 35 | chr1	780785	0.453231	2.026101
 36 | chr1	784023	0.453169	2.027569
 37 | chr1	784904	0.457151	2.027968
 38 | chr1	785050	2.366093	2.028035
 39 | chr1	785989	2.904551	2.030256
 40 | chr1	792480	3.122720	2.049110
 41 | chr1	798026	4.588229	2.066428
 42 | chr1	798801	4.578148	2.069984
 43 | chr1	798959	4.370906	2.070708
 44 | chr1	799463	4.344183	2.072911
 45 | chr1	800007	4.287599	2.075274
 46 | chr1	808312	4.180096	2.110882
 47 | chr1	846864	3.187313	2.272033
 48 | chr1	882033	3.162755	2.384128
 49 | chr1	888659	2.747543	2.405084
 50 | chr1	957898	0.952858	2.595322
 51 | chr1	962210	0.517933	2.599430
 52 | chr1	982573	0.521736	2.609977
 53 | chr1	990380	0.513886	2.614050
 54 | chr1	998501	0.513754	2.618223
 55 | chr1	998874	0.510871	2.618415
 56 | chr1	1003629	0.496544	2.620844
 57 | chr1	1005806	0.407540	2.621925
 58 | chr1	1011006	0.393293	2.624044
 59 | chr1	1017170	1.250116	2.626469
 60 | chr1	1017197	1.889336	2.626502
 61 | chr1	1017216	2.173130	2.626538
 62 | chr1	1017587	1.717554	2.627345
 63 | chr1	1017598	1.341542	2.627363
 64 | chr1	1018562	1.320611	2.628657
 65 | chr1	1018704	1.310429	2.628844
 66 | chr1	1018966	1.297303	2.629188
 67 | chr1	1021346	1.851467	2.632275
 68 | chr1	1021408	1.911398	2.632390
 69 | chr1	1021415	2.349187	2.632403
 70 | chr1	1021583	2.439198	2.632798
 71 | chr1	1021658	2.635598	2.632981
 72 | chr1	1021695	2.917768	2.633078
 73 | chr1	1022037	3.303601	2.634076
 74 | chr1	1025301	3.324303	2.644859
 75 | chr1	1026707	3.304892	2.649533
 76 | chr1	1030374	2.573938	2.661652
 77 | chr1	1030565	0.660922	2.662144
 78 | chr1	1031060	0.614689	2.662471
 79 | chr1	1031540	0.280745	2.662766
 80 | chr1	1033994	0.280671	2.663455
 81 | chr1	1033999	0.280597	2.663456
 82 | chr1	1034953	0.280571	2.663724
 83 | chr1	1036959	0.261500	2.664287
 84 | chr1	1040026	0.261278	2.665089
 85 | chr1	1046164	0.261441	2.666693
 86 | chr1	1048955	0.262259	2.667422
 87 | chr1	1049950	0.263129	2.667683
 88 | chr1	1052946	0.263101	2.668472
 89 | chr1	1053452	0.262919	2.668605
 90 | chr1	1053570	0.263541	2.668636
 91 | chr1	1060235	0.579117	2.670392
 92 | chr1	1060608	0.939437	2.670608
 93 | chr1	1061115	1.029868	2.671085
 94 | chr1	1061152	1.097144	2.671123
 95 | chr1	1061166	1.164451	2.671138
 96 | chr1	1062015	1.949239	2.672127
 97 | chr1	1062638	3.111017	2.673341
 98 | chr1	1063145	3.111201	2.674918
 99 | chr1	1064535	3.117384	2.679243
100 | chr1	1064979	3.112779	2.680627
101 | 


--------------------------------------------------------------------------------
/example/run_example.py:
--------------------------------------------------------------------------------
 1 | import sys 
 2 | import os
 3 | sys.path.append(os.path.relpath(".."))
 4 | from defiNETti import *
 5 | import simulator_example
 6 | 
 7 | # Train the network --  Use small num_batches for example purposes.
 8 | prior_rates, weight_prob = simulator_example.parse_hapmap_empirical_prior(['genetic_map_GRCh37_chr1_truncated.txt'])
 9 | print('Training the network...')
10 | train((198,24,2), (2,), lambda: simulator_example.simulate_data(prior_rates, weight_prob), 
11 |                 phi_net = [('conv', 5, 32),('conv', 5, 64)],
12 |                 g = ('top_k',1),
13 |                 h_net = [("fc",128),("fc",128),("softmax",)],
14 |                 num_batches = 2000,
15 |                 verbosity = 100,
16 |                 training_threads=5,
17 |                 sim_threads=5, 
18 |                 save_path='./example_weights',
19 |                 training_summary='./summary.txt')
20 | 
21 | 
22 | # Test the network
23 | print("Testing the network...")
24 | data = []
25 | for i in range(500):
26 |     data_i, _ = simulator_example.simulate_data(prior_rates, weight_prob) # Typically you should feed in your real data here instead of simulated.
27 |     data.append(data_i)
28 | output = test(data, model_path='./example_weights', threads = 5)
29 | np.savetxt('./test_predictions.txt',output)
30 | 


--------------------------------------------------------------------------------
/example/simulator_example.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import numpy as np
  6 | import msprime
  7 | 
  8 | # Data shape
  9 | HEIGHT = 198  # number of haplotypes, CEU = 198, YRI = 216
 10 | WIDTH = 24   # number of seg sites, should also be divisible by 4
 11 | NUM_CLASSES = 2
 12 | 
 13 | # Simulation parameters
 14 | NE = 1.0e4
 15 | MU = 1.1e-8
 16 | R  = 1.25e-8
 17 | theta = 0.00044
 18 | rho = .0005
 19 | hap_len = 52000 #heuristic to get enough SNPs
 20 | hot_spot_len = 2000
 21 | lower_intensity = 10
 22 | upper_intensity = 100
 23 | GEN_IN_YRS = 29
 24 | 
 25 | 
 26 | def parse_hapmap_empirical_prior(files):
 27 |     print("Parsing empirical prior...")
 28 |     assert(len(files) == 1)
 29 |     print(files)
 30 |     mat = np.loadtxt(files[0], skiprows = 1, usecols=(1,2))
 31 |     print(mat.shape)
 32 |     mat[:,1] = mat[:,1]*(1.e-8)
 33 |     mat = mat[mat[:,1] != 0.0, :] # remove 0s
 34 |     weights = mat[1:,0] - mat[:-1,0]
 35 |     prior_rates = mat[:-1,1]
 36 |     prob = weights / np.sum(weights)
 37 |     print("Done parsing prior")
 38 | 
 39 |     return prior_rates, prob
 40 | 
 41 | 
 42 | def draw_background_rate_from_prior(prior_rates, prob):
 43 |     return np.random.choice(prior_rates, p=prob)
 44 | 
 45 | def simulate_data(prior_rates, weight_prob):
 46 |     hot_spot = np.random.randint(0,2)
 47 | 
 48 |     # Use flat empirical maps
 49 |     background = draw_background_rate_from_prior(prior_rates, weight_prob)
 50 |     if hot_spot >= 1:
 51 |         heat = np.random.uniform(lower_intensity,upper_intensity)*max(R/background,1.)
 52 |         reco_map = msprime.RecombinationMap([0,(hap_len - hot_spot_len)/2,
 53 |                                          (hap_len + hot_spot_len)/2, hap_len],[background,background*heat,background,0])
 54 |     else:
 55 |         reco_map = msprime.RecombinationMap([0,hap_len],[background,0])
 56 | 
 57 | 
 58 |     # Simulate
 59 |     tree_sequence = msprime.simulate(sample_size=HEIGHT, Ne = NE, recombination_map=reco_map,
 60 |                                     mutation_rate=MU)
 61 |     seg_sites = tree_sequence.get_num_mutations()
 62 | 
 63 |     # Make sure that we simulated enough seg sites for this example. Should not resimulate very often
 64 |     if seg_sites < WIDTH:
 65 |         print("Resimulating0...",file=sys.stderr)
 66 |         print(seg_sites)
 67 |         return simulate_data()
 68 | 
 69 | 
 70 |     # center around the hotspot
 71 |     mut_pos = [mut[0] for mut in tree_sequence.mutations()] # chromosome positions
 72 |     assert(len(mut_pos) == seg_sites)
 73 |     before_hs = [x < 0.5*hap_len for x in mut_pos].count(True)
 74 |     after_hs  = [x >= 0.5*hap_len for x in mut_pos].count(True)
 75 |     hots = [x >= (hap_len-hot_spot_len)*0.5 and x < (hap_len+hot_spot_len)*0.5 for x in mut_pos].count(True)
 76 |     assert(before_hs + after_hs == seg_sites)
 77 | 
 78 | 
 79 |     # get distances and image
 80 |     if hots >= WIDTH:
 81 |         print("Hot Length: ", hots)
 82 |         print("Resimulating1.5...",file=sys.stderr)
 83 |         return simulate_data()
 84 |     # Check if there are enough SNPs left and right of the hotspot
 85 |     if before_hs < WIDTH//2:
 86 |         print("Resimulating2...",file=sys.stderr)
 87 |         print(before_hs)
 88 |         return simulate_data()
 89 |     if after_hs <= WIDTH//2:
 90 |         print("Resimulating3...",file=sys.stderr)
 91 |         print(after_hs)
 92 |         return simulate_data()
 93 |     distances = np.array([mut_pos[i+1]-mut_pos[i] for i in range(before_hs-WIDTH//2,before_hs+WIDTH//2)],dtype=float)[:-1] * 4*NE*MU
 94 |     distances = np.vstack([np.copy(distances) for k in range(HEIGHT)])
 95 |     image = np.array([np.copy(variant.genotypes) if variant.genotypes.sum() < HEIGHT/2 else np.copy(1-variant.genotypes) for variant in tree_sequence.variants()]).transpose()[:,before_hs-WIDTH//2:before_hs+WIDTH//2].reshape([HEIGHT,WIDTH])
 96 | 
 97 | 
 98 | 
 99 |     assert(image.shape[1] == distances.shape[1] + 1)
100 |     assert(image.shape[0] == distances.shape[0])
101 |     combined = np.empty((image.shape[0], image.shape[1], 2), dtype=distances.dtype)
102 |     combined[:,:,0] = image
103 |     combined[:,:-1,1] = distances
104 |     combined[:,-1,1] = 0.0
105 |     
106 |     label = np.zeros(2)
107 |     label[hot_spot] = 1
108 |     return (combined,label)
109 | 
110 | # prior_rates, weight_prob = parse_hapmap_empirical_prior(['genetic_map_GRCh37_chr1_truncated.txt'])
111 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from distutils.core import setup
 4 | from distutils.extension import Extension
 5 | 
 6 | setup(name='defiNETti',
 7 |       version='1.0.0',
 8 |       description='A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks',
 9 |       author='Jeffrey Chan, Valerio Perrone, Jeffrey P. Spence, Paul A. Jenkins, Sara Mathieson, Yun S. Song',
10 |       author_email='chanjed@berkeley.edu, v.Perrone@warwick.ac.uk, spence.jeffrey@berkeley.edu, p.Jenkins@warwick.ac.uk, smathie1@swarthmore.edu, yss@berkeley.edu',
11 |       install_requires=['numpy>=1.14.2','tensorflow==1.6.0', 'msprime==0.4.0'],
12 |       )
13 | 


--------------------------------------------------------------------------------