├── CarDEC.egg-info
    ├── PKG-INFO
    ├── SOURCES.txt
    ├── dependency_links.txt
    ├── requires.txt
    └── top_level.txt
├── CarDEC
    ├── CarDEC_API.py
    ├── CarDEC_MainModel.py
    ├── CarDEC_SAE.py
    ├── CarDEC_count_decoder.py
    ├── CarDEC_dataloaders.py
    ├── CarDEC_layers.py
    ├── CarDEC_optimization.py
    ├── CarDEC_utils.py
    ├── __init__.py
    └── __pycache__
    │   ├── CarDEC_API.cpython-37.pyc
    │   ├── CarDEC_MainModel.cpython-37.pyc
    │   ├── CarDEC_SAE.cpython-37.pyc
    │   ├── CarDEC_count_decoder.cpython-37.pyc
    │   ├── CarDEC_dataloaders.cpython-37.pyc
    │   ├── CarDEC_layers.cpython-37.pyc
    │   ├── CarDEC_optimization.cpython-37.pyc
    │   ├── CarDEC_utils.cpython-37.pyc
    │   └── __init__.cpython-37.pyc
├── LICENSE.rtf
├── README.md
├── build
    └── lib
    │   └── CarDEC
    │       ├── CarDEC_API.py
    │       ├── CarDEC_MainModel.py
    │       ├── CarDEC_SAE.py
    │       ├── CarDEC_count_decoder.py
    │       ├── CarDEC_dataloaders.py
    │       ├── CarDEC_layers.py
    │       ├── CarDEC_optimization.py
    │       ├── CarDEC_utils.py
    │       └── __init__.py
├── dist
    ├── cardec-1.0.3-py3-none-any.whl
    └── cardec-1.0.3.tar.gz
└── setup.py


/CarDEC.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
 1 | Metadata-Version: 2.1
 2 | Name: cardec
 3 | Version: 1.0.3
 4 | Summary: A deep learning method for joint batch correction, denoting, and clustering of single-cell rna-seq data.
 5 | Home-page: https://github.com/jlakkis/CarDEC
 6 | Author: Justin Lakkis
 7 | Author-email: jlakks@gmail.com
 8 | License: UNKNOWN
 9 | Description: # CarDEC
10 |         
11 |         CarDEC (**C**ount **a**dapted **r**egularized **D**eep **E**mbedded **C**lustering) is a joint deep learning computational tool that is useful for analyses of single-cell RNA-seq data. CarDEC can be used to:
12 |         
13 |         1. Correct for batch effect in the full gene expression space, allowing the investigator to remove batch effect from downstream analyses like psuedotime analysis and coexpression analysis. Batch correction is also possible in a low-dimensional embedding space.
14 |         2. Denoise gene expression.
15 |         3. Cluster cells.
16 |         
17 |         ## Reproducibility
18 |         
19 |         We described and introduced CarDEC in our [methodological paper](https://www.biorxiv.org/content/10.1101/2020.09.23.310003v1). To find code to reproduce the results we generated in that paper, please visit this separate [github repository](https://github.com/jlakkis/CarDEC_Codes), which provides all code (including that for other methods) necessary to reproduce our results.
20 |         
21 |         ## Installation
22 |         
23 |         Recomended installation procedure is as follows. 
24 |         
25 |         1. Install [Anaconda](https://www.anaconda.com/products/individual) if you do not already have it. 
26 |         2. Create a conda environment, and then activate it as follows in terminal.
27 |         
28 |         ```
29 |         $ conda create -n cardecenv
30 |         $ conda activate cardecenv
31 |         ```
32 |         
33 |         3. Install an appropriate version of python.
34 |         
35 |         ```
36 |         $ conda install python==3.7
37 |         ```
38 |         
39 |         4. Install nb_conda_kernels so that you can change python kernels in jupyter notebook.
40 |         
41 |         ```
42 |         $ conda install nb_conda_kernels
43 |         ```
44 |         
45 |         5. Finally, install CarDEC.
46 |         
47 |         ```
48 |         $ pip install CarDEC
49 |         ```
50 |         
51 |         Now, to use CarDEC, always make sure you activate the environment in terminal first ("conda activate cardecenv"). And then run jupyter notebook. When you create a notebook to run CarDEC, make sure the active kernel is switched to "cardecenv"
52 |         
53 |         ## Usage
54 |         
55 |         A [tutorial jupyter notebook](https://drive.google.com/drive/folders/19VVOoq4XSdDFRZDou-VbTMyV2Na9z53O?usp=sharing), together with a dataset, is publicly downloadable.
56 |         
57 |         ## Software Requirements
58 |             
59 |         - Python >= 3.7
60 |         - TensorFlow >= 2.0.1, <= 2.3.1
61 |         - scikit-learn == 0.22.2.post1
62 |         - scanpy == 1.5.1
63 |         - louvain == 0.6.1
64 |         - pandas == 1.0.1
65 |         - scipy == 1.4.1
66 |         
67 |         ## Trouble shooting
68 |         
69 |         Installation on MacOS should be smooth. If installing on Windows Subsystem for Linux (WSL), the user must properly configure their g++ compiler to ensure that the louvain package can be built during installation. If the compiler is not properly configured, the user may encounter a following deprecation error similar to the following.
70 |         
71 |         "DEPRECATION: Could not build wheels for louvain which do not use PEP 517. pip will fall back to legacy 'setup.py install' for these. pip 21.0 will remove support for this functionality. A possible replacement is to fix the wheel build issue reported above."
72 |         
73 |         To fix this error, try to install the libxml2-dev package.
74 | Platform: UNKNOWN
75 | Classifier: Programming Language :: Python :: 3
76 | Classifier: License :: OSI Approved :: MIT License
77 | Classifier: Operating System :: OS Independent
78 | Requires-Python: >=3.7
79 | Description-Content-Type: text/markdown
80 | 


--------------------------------------------------------------------------------
/CarDEC.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
 1 | LICENSE.rtf
 2 | README.md
 3 | setup.py
 4 | CarDEC/CarDEC_API.py
 5 | CarDEC/CarDEC_MainModel.py
 6 | CarDEC/CarDEC_SAE.py
 7 | CarDEC/CarDEC_count_decoder.py
 8 | CarDEC/CarDEC_dataloaders.py
 9 | CarDEC/CarDEC_layers.py
10 | CarDEC/CarDEC_optimization.py
11 | CarDEC/CarDEC_utils.py
12 | CarDEC/__init__.py
13 | cardec.egg-info/PKG-INFO
14 | cardec.egg-info/SOURCES.txt
15 | cardec.egg-info/dependency_links.txt
16 | cardec.egg-info/requires.txt
17 | cardec.egg-info/top_level.txt


--------------------------------------------------------------------------------
/CarDEC.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/CarDEC.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.18.1
2 | pandas>=1.0.1
3 | scipy>=1.4.1
4 | tensorflow<=2.3.1,>=2.0.1
5 | scikit-learn>=0.22.2.post1
6 | scanpy>=1.5.1
7 | louvain>=0.6.1
8 | 


--------------------------------------------------------------------------------
/CarDEC.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | CarDEC
2 | 


--------------------------------------------------------------------------------
/CarDEC/CarDEC_API.py:
--------------------------------------------------------------------------------
  1 | from .CarDEC_utils import normalize_scanpy
  2 | from .CarDEC_MainModel import CarDEC_Model
  3 | from .CarDEC_count_decoder import count_model
  4 | 
  5 | import tensorflow as tf
  6 | from tensorflow.keras.optimizers import Adam
  7 | import numpy as np
  8 | from pandas import DataFrame
  9 | 
 10 | import os
 11 | 
 12 | class CarDEC_API:
 13 |     def __init__(self, adata, preprocess=True, weights_dir = "CarDEC Weights", batch_key = None, n_high_var = 2000, LVG = True,
 14 |                      normalize_samples = True, log_normalize = True, normalize_features = True):
 15 |         """ Main CarDEC API the user can use to conduct batch correction and denoising experiments.
 16 | 
 17 | 
 18 |         Arguments:
 19 |         ------------------------------------------------------------------
 20 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
 21 |         - preprocess: `bool`, If True, then preprocess the data.
 22 |         - weights_dir: `str`, the path in which to save the weights of the CarDEC model.
 23 |         - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch.
 24 |         - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable.
 25 |         - LVG: `bool`, If True, also model LVGs. Otherwise, only model HVGs.
 26 |         - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell.
 27 |         - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count.
 28 |         - normalize_features: `bool`, If True, z-score normalize each gene's expression.
 29 |         """
 30 |     
 31 |         if n_high_var is None:
 32 |             n_high_var = None
 33 |             LVG = False
 34 | 
 35 |         self.weights_dir = weights_dir
 36 |         self.LVG = LVG
 37 | 
 38 |         self.norm_args = (batch_key, n_high_var, LVG, normalize_samples, log_normalize, normalize_features)
 39 | 
 40 |         if preprocess:
 41 |             self.dataset = normalize_scanpy(adata, *self.norm_args)
 42 |         else:
 43 |             assert 'Variance Type' in adata.var.keys()
 44 |             assert 'normalized input' in adata.layers
 45 |             self.dataset = adata
 46 | 
 47 |         self.loaded = False
 48 |         self.count_loaded = False
 49 | 
 50 |     def build_model(self, load_fullmodel = True, dims = [128, 32], LVG_dims = [128, 32], tol = 0.005, n_clusters = None, 
 51 |                     random_seed = 201809, louvain_seed = 0, n_neighbors = 15, pretrain_epochs = 2000, batch_size_pretrain = 64,
 52 |                     act = 'relu', actincenter = "tanh", ae_lr = 1e-04, ae_decay_factor = 1/3, ae_patience_LR = 3, 
 53 |                     ae_patience_ES = 9, clust_weight = 1., load_encoder_weights = True):
 54 |         """ Initializes the main CarDEC model.
 55 | 
 56 | 
 57 |         Arguments:
 58 |         ------------------------------------------------------------------
 59 |         - load_fullmodel: `bool`, If True, the API will try to load the weights for the full model from the weight directory.
 60 |         - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers.
 61 |         - LVG_dims: `list`, the number of output features for each layer of the LVG encoder. The length of the list determines the number of layers.
 62 |         - tol: `float`, stop criterion, clustering procedure will be stopped when the difference ratio between the current iteration and last iteration larger than tol.
 63 |         - n_clusters: `int`, The number of clusters into which cells will be grouped.
 64 |         - random_seed: `int`, The seed used for random weight intialization.
 65 |         - louvain_seed: `int`, The seed used for louvain clustering intialization.
 66 |         - n_neighbors: `int`, The number of neighbors used for building the graph needed for louvain clustering.
 67 |         - pretrain_epochs: `int`, The maximum number of epochs for pretraining the HVG autoencoder. In practice, early stopping criteria should stop training much earlier.
 68 |         - batch_size_pretrain: `int`, The batch size used for pretraining the HVG autoencoder.
 69 |         - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer.
 70 |         - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC.
 71 |         - ae_lr: `float`, The learning rate for pretraining the HVG autoencoder.
 72 |         - ae_decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
 73 |         - ae_patience_LR: `int`, the number of epochs which the validation loss is allowed to increase before learning rate is decayed when pretraining the autoencoder.
 74 |         - ae_patience_ES: `int`, the number of epochs which the validation loss is allowed to increase before training is halted when pretraining the autoencoder.
 75 |         - clust_weight: `float`, a number between 0 and 2 qhich balances the clustering and reconstruction losses.
 76 |         - load_encoder_weights: `bool`, If True, the API will try to load the weights for the HVG encoder from the weight directory.
 77 |         """
 78 |         
 79 |         assert n_clusters is not None
 80 |         
 81 |         if 'normalized input' not in list(self.dataset.layers):
 82 |             self.dataset = normalize_scanpy(self.dataset, *self.norm_args)
 83 |         
 84 |         p = sum(self.dataset.var["Variance Type"] == 'HVG')
 85 |         self.dims = [p] + dims
 86 |         
 87 |         if self.LVG:
 88 |             LVG_p = sum(self.dataset.var["Variance Type"] == 'LVG')
 89 |             self.LVG_dims = [LVG_p] + LVG_dims
 90 |         else:
 91 |             self.LVG_dims = None
 92 |         
 93 |         self.load_fullmodel = load_fullmodel
 94 |         self.weights_exist = os.path.isfile("./" + self.weights_dir + "/tuned_CarDECweights.index")
 95 |         
 96 |         set_centroids = not (self.load_fullmodel and self.weights_exist)
 97 |         
 98 |         self.model = CarDEC_Model(self.dataset, self.dims, self.LVG_dims, tol, n_clusters, random_seed, louvain_seed, 
 99 |                                   n_neighbors, pretrain_epochs, batch_size_pretrain, ae_decay_factor, 
100 |                                   ae_patience_LR, ae_patience_ES, act, actincenter, ae_lr, 
101 |                                   clust_weight, load_encoder_weights, set_centroids, self.weights_dir)
102 |         
103 |     def make_inference(self, batch_size = 64, val_split = 0.1, lr = 1e-04, decay_factor = 1/3,
104 |                        iteration_patience_LR = 3, iteration_patience_ES = 6, maxiter = 1e3, epochs_fit = 1, 
105 |                        optimizer = Adam(), printperiter = None, denoise_all = True, denoise_list = None):
106 |         """ This class method makes inference on the data (batch correction + denoising) with the main CarDEC model
107 | 
108 | 
109 |         Arguments:
110 |         ------------------------------------------------------------------
111 |         - batch_size: `int`, The batch size used for training the full model.
112 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
113 |         - lr: `float`, The learning rate for training the full model.
114 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
115 |         - iteration_patience_LR: `int`, The number of iterations tolerated before decaying the learning rate during which the number of cells that change assignment is less than tol.
116 |         - iteration_patience_ES: `int`, The number of iterations tolerated before stopping training during which the number of cells that change assignment is less than tol.
117 |         - maxiter: `int`, The maximum number of iterations allowed to train the full model. In practice, the model will halt training long before hitting this limit.
118 |         - epochs_fit: `int`, The number of epochs during which to fine-tune weights, before updating the target distribution.
119 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
120 |         - printperiter: `int`, Optional integer argument. If specified, denoised values will be returned every printperiter epochs, so that the user can evaluate the progress of denoising as training continues.
121 |         - denoise_all: `bool`, If True, then denoised expression values are provided for all cells.
122 |         - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list.
123 | 
124 |         Returns:
125 |         ------------------------------------------------------------------
126 |         - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata.
127 |         """
128 | 
129 |         if denoise_list is not None:
130 |             denoise_all = False
131 |             
132 |         if not self.loaded:
133 |             if self.load_fullmodel and self.weights_exist:
134 |                 self.dataset = self.model.reload_model(self.dataset, batch_size, denoise_all)
135 | 
136 |             elif not self.weights_exist:
137 |                 print("CarDEC Model Weights not detected. Training full model.\n")
138 |                 self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor,
139 |                                iteration_patience_LR, iteration_patience_ES, maxiter,
140 |                                epochs_fit, optimizer, printperiter, denoise_all)
141 | 
142 |             else:
143 |                 print("Training full model.\n")
144 |                 self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor, 
145 |                                                 iteration_patience_LR, iteration_patience_ES, 
146 |                                                 maxiter, epochs_fit, optimizer, printperiter, denoise_all)
147 |             
148 |             
149 |             self.loaded = True
150 |             
151 |         elif denoise_all:
152 |             self.dataset = self.model.make_outputs(self.dataset, batch_size, True)
153 |             
154 |         if denoise_list is not None:
155 |             denoise_list = list(denoise_list)
156 |             indices = [x in denoise_list for x in self.dataset.obs.index]
157 |             denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32'))
158 |             denoised.index = self.dataset.obs.index[indices]
159 |             denoised.columns = self.dataset.var.index
160 |             
161 |             
162 |             if self.LVG:
163 |                 hvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"][indices])
164 |                 lvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["LVG embedding"][indices])
165 |             
166 |                 input_ds = tf.data.Dataset.zip((hvg_ds, lvg_ds))
167 |                 input_ds = input_ds.batch(batch_size)
168 | 
169 |                 start = 0     
170 |                 for x in input_ds:
171 |                     denoised_batch = {'HVG_denoised': self.model.decoder(x[0]), 'LVG_denoised': self.model.decoderLVG(x[1])}
172 |                     q_batch = self.model.clustering_layer(x[0])
173 |                     end = start + q_batch.shape[0]
174 | 
175 |                     denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'HVG')[0]] = denoised_batch['HVG_denoised'].numpy()
176 |                     denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'LVG')[0]] = denoised_batch['LVG_denoised'].numpy()
177 | 
178 |                     start = end
179 | 
180 |             else:
181 |                 input_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"])
182 | 
183 |                 input_ds = input_ds.batch(batch_size)
184 | 
185 |                 start = 0
186 | 
187 |                 for x in input_ds:
188 |                     denoised_batch = {'HVG_denoised': self.model.decoder(x)}
189 |                     q_batch = self.model.clustering_layer(x)
190 |                     end = start + q_batch.shape[0]
191 | 
192 |                     denoised.iloc[start:end] = denoised_batch['HVG_denoised'].numpy()
193 | 
194 |                     start = end
195 |             
196 |             return denoised
197 |             
198 |         print(" ")
199 |             
200 |     def model_counts(self, load_weights = True, act = 'relu', random_seed = 201809,
201 |                      optimizer = Adam(), keep_dispersion = False, num_epochs = 2000, batch_size_count = 64,
202 |                      val_split = 0.1, lr = 1e-03, decay_factor = 1/3, patience_LR = 3, patience_ES = 9, 
203 |                      denoise_all = True, denoise_list = None):
204 |         """ This class method makes inference on the data on the count scale.
205 | 
206 | 
207 |         Arguments:
208 |         ------------------------------------------------------------------
209 |         - load_weights: `bool`, If true, the API will attempt to load the weights for the count model.
210 |         - act: `str`, A string specifying the activation function for intermediate layers of the count models.
211 |         - random_seed: `int`, A seed used for weight initialization.
212 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
213 | 	- keep_dispersion: `bool`, If True, the gene, cell dispersions will be returned as well.
214 |         - num_epochs: `int`, The maximum number of epochs allowed to train each count model. In practice, the model will halt
215 |         training long before hitting this limit.
216 |         - batch_size_count: `int`, The batch size used for training the count models.
217 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
218 |         - lr: `float`, The learning rate for training the count models.
219 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
220 |         - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss does not decrease.
221 |         - patience_ES: `int`, The number of iterations tolerated before stopping training during which the validation loss does not decrease.
222 |         - denoise_all: `bool`, If True, then denoised expression values are provided for all cells.
223 |         - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list.
224 | 
225 |         Returns:
226 |         ------------------------------------------------------------------
227 |         - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression on the count scale provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata.
228 |         - denoised_dispersion: `pd.DataFrame`, (Optional) If denoise_list was specified and "keep_dispersion" was set to True, then this will be an array of dispersions from the fitted negative binomial model provided only for listed cells. If denoise_all was instead specified as False, but "keep_dispersion" was still True then dispersions for all cells will be added as a layer to adata.
229 |         """
230 |     
231 |         if denoise_list is not None:
232 |             denoise_all = False
233 |         
234 |         if not self.count_loaded:
235 |             weights_dir = os.path.join(self.weights_dir, 'count weights')
236 |             weight_files_exist = os.path.isfile(weights_dir + "/countmodel_weights_HVG Count.index")
237 |             if self.LVG:
238 |                 weight_files_exist = weight_files_exist and os.path.isfile(weights_dir + "/countmodel_weights_LVG Count.index")
239 | 
240 |             init_args = (act, random_seed, self.model.splitseed, optimizer, weights_dir)
241 |             train_args = (num_epochs, batch_size_count, val_split, lr, decay_factor, patience_LR, patience_ES)
242 | 
243 |             self.nbmodel = count_model(self.dims, *init_args, n_features = self.dims[-1], mode = 'HVG')
244 | 
245 |             if load_weights and weight_files_exist:
246 |                 print("Weight files for count models detected, loading weights.")
247 |                 self.nbmodel.load_model()
248 | 
249 |             elif load_weights:
250 |                 print("Weight files for count models not detected. Training HVG count model.\n")
251 |                 self.nbmodel.train(self.dataset, *train_args)
252 | 
253 |             else:
254 |                 print("Training HVG count model.\n")
255 |                 self.nbmodel.train(self.dataset, *train_args)
256 | 
257 |             if self.LVG:
258 |                 self.nbmodel_lvg = count_model(self.LVG_dims, *init_args, 
259 |                     n_features = self.dims[-1] + self.LVG_dims[-1], mode = 'LVG')
260 | 
261 |                 if load_weights and weight_files_exist:
262 |                     self.nbmodel_lvg.load_model()
263 |                     print("Count model weights loaded successfully.")
264 | 
265 |                 elif load_weights:
266 |                     print("\n \n \n")
267 |                     print("Training LVG count model.\n")
268 |                     self.nbmodel_lvg.train(self.dataset, *train_args)
269 | 
270 |                 else:
271 |                     print("\n \n \n")
272 |                     print("Training LVG count model.\n")
273 |                     self.nbmodel_lvg.train(self.dataset, *train_args)
274 |             
275 |             self.count_loaded = True
276 |             
277 |         if denoise_all:
278 |             self.nbmodel.denoise(self.dataset, keep_dispersion, batch_size_count)
279 |             if self.LVG:
280 |                 self.nbmodel_lvg.denoise(self.dataset, keep_dispersion, batch_size_count)
281 |                 
282 |         elif denoise_list is not None:
283 |             denoise_list = list(denoise_list)
284 |             indices = [x in denoise_list for x in self.dataset.obs.index]
285 |             denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32'))
286 |             denoised.index = self.dataset.obs.index[indices]
287 |             denoised.columns = self.dataset.var.index
288 |             if keep_dispersion:
289 |                 denoised_dispersion = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32'))
290 |                 denoised_dispersion.index = self.dataset.obs.index[indices]
291 |                 denoised_dispersion.columns = self.dataset.var.index
292 |             
293 |             input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['embedding'][indices])
294 |             input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices])
295 |             input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf))
296 |             input_ds = input_ds.batch(batch_size_count)
297 | 
298 |             type_indices = np.where(self.dataset.var['Variance Type'] == 'HVG')[0]
299 | 
300 |             if not keep_dispersion:
301 |                 start = 0
302 |                 for x in input_ds:
303 |                     end = start + x[0].shape[0]
304 |                     denoised.iloc[start:end, type_indices] = self.nbmodel(*x)[0].numpy()
305 |                     start = end
306 | 
307 |             else:
308 |                 start = 0
309 |                 for x in input_ds:
310 |                     end = start + x[0].shape[0]
311 |                     batch_output = self.nbmodel(*x)
312 |                     denoised.iloc[start:end, type_indices] = batch_output[0].numpy()
313 |                     denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy()
314 |                     start = end
315 |             
316 |             if self.LVG:
317 |                 input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['LVG embedding'][indices])
318 |                 input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices])
319 |                 input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf))
320 |                 input_ds = input_ds.batch(batch_size_count)
321 | 
322 |                 type_indices = np.where(self.dataset.var['Variance Type'] == 'LVG')[0]
323 | 
324 |                 if not keep_dispersion:
325 |                     start = 0
326 |                     for x in input_ds:
327 |                         end = start + x[0].shape[0]
328 |                         denoised.iloc[start:end, type_indices] = self.nbmodel_lvg(*x)[0].numpy()
329 |                         start = end
330 | 
331 |                 else:
332 |                     start = 0
333 |                     for x in input_ds:
334 |                         end = start + x[0].shape[0]
335 |                         batch_output = self.nbmodel_lvg(*x)
336 |                         denoised.iloc[start:end, type_indices] = batch_output[0].numpy()
337 |                         denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy()
338 |                         start = end
339 |                         
340 |             if not keep_dispersion:
341 |                 return denoised
342 |             else:
343 |                 return denoised, denoised_dispersion
344 | 
345 | 


--------------------------------------------------------------------------------
/CarDEC/CarDEC_MainModel.py:
--------------------------------------------------------------------------------
  1 | from .CarDEC_SAE import SAE
  2 | from .CarDEC_utils import build_dir, find_resolution
  3 | from .CarDEC_layers import ClusteringLayer
  4 | from .CarDEC_optimization import grad_MainModel as grad, total_loss, MSEloss
  5 | from .CarDEC_dataloaders import simpleloader, dataloader, tupleloader
  6 | 
  7 | import tensorflow as tf
  8 | from tensorflow.keras import Model, Sequential
  9 | from tensorflow.keras.layers import Dense, concatenate
 10 | from tensorflow.keras.optimizers import Adam
 11 | from tensorflow.keras.backend import set_floatx
 12 | 
 13 | from sklearn.cluster import KMeans
 14 | 
 15 | import scanpy as sc
 16 | from anndata import AnnData
 17 | import pandas as pd
 18 | 
 19 | import random
 20 | import numpy as np
 21 | from math import ceil
 22 | 
 23 | import os
 24 | from copy import deepcopy
 25 | from time import time
 26 | 
 27 | set_floatx('float32')
 28 | 
 29 | class CarDEC_Model(Model):
 30 |     def __init__(self, adata, dims, LVG_dims = None, tol = 0.005, n_clusters = None, random_seed = 201809, 
 31 |                  louvain_seed = 0, n_neighbors = 15, pretrain_epochs = 300, batch_size = 64, decay_factor = 1/3, 
 32 |                  patience_LR = 3, patience_ES = 9, act = 'relu', actincenter = "tanh", ae_lr = 1e-04, clust_weight = 1., 
 33 |                  load_encoder_weights = True, set_centroids = True, weights_dir = "CarDEC Weights"):
 34 |         super(CarDEC_Model, self).__init__()
 35 |         """ This class creates the TensorFlow CarDEC model architecture.
 36 | 
 37 | 
 38 |         Arguments:
 39 |         ------------------------------------------------------------------
 40 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
 41 |         - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers.
 42 |         - LVG_dims: `list`, the number of output features for each layer of the LVG encoder. The length of the list determines the number of layers.
 43 |         - tol: `float`, stop criterion, clustering procedure will be stopped when the difference ratio between the current iteration and last iteration larger than tol.
 44 |         - n_clusters: `int`, The number of clusters into which cells will be grouped.
 45 |         - random_seed: `int`, The seed used for random weight intialization.
 46 |         - louvain_seed: `int`, The seed used for louvain clustering intialization.
 47 |         - n_neighbors: `int`, The number of neighbors used for building the graph needed for louvain clustering.
 48 |         - pretrain_epochs: `int`, The maximum number of epochs for pretraining the HVG autoencoder. In practice, early stopping criteria should stop training much earlier.
 49 |         - batch_size: `int`, The batch size used for pretraining the HVG autoencoder.
 50 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
 51 |         - patience_LR: `int`, the number of epochs which the validation loss is allowed to increase before learning rate is decayed when pretraining the autoencoder.
 52 |         - patience_ES: `int`, the number of epochs which the validation loss is allowed to increase before training is halted when pretraining the autoencoder.
 53 |         - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer.
 54 |         - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC.
 55 |         - ae_lr: `float`, The learning rate for pretraining the HVG autoencoder.
 56 |         - clust_weight: `float`, a number between 0 and 2 qhich balances the clustering and reconstruction losses.
 57 |         - load_encoder_weights: `bool`, If True, the API will try to load the weights for the HVG encoder from the weight directory.
 58 |         - set_centroids: `bool`, If True, intialize the centroids by running Louvain's algorithm.
 59 |         - weights_dir: `str`, the path in which to save the weights of the CarDEC model.
 60 |         ------------------------------------------------------------------
 61 |         """
 62 |         
 63 |         assert clust_weight <= 2. and clust_weight>=0.
 64 |         
 65 |         tf.keras.backend.clear_session()
 66 |                     
 67 |         self.dims = dims
 68 |         self.LVG_dims = LVG_dims
 69 |         self.tol = tol
 70 |         self.input_dim = dims[0]  # for clustering layer 
 71 |         self.n_stacks = len(self.dims) - 1
 72 |         self.n_neighbors = n_neighbors
 73 |         self.batch_size = batch_size
 74 |         self.random_seed = random_seed
 75 |         self.activation = act
 76 |         self.actincenter = actincenter
 77 |         self.load_encoder_weights = load_encoder_weights
 78 |         self.clust_weight = clust_weight
 79 |         self.weights_dir = weights_dir
 80 |         self.preclust_embedding = None
 81 |         
 82 |         # set random seed
 83 |         random.seed(random_seed)
 84 |         np.random.seed(random_seed)
 85 |         tf.random.set_seed(random_seed)
 86 |         self.splitseed = round(abs(10000*np.random.randn()))
 87 |         
 88 |         # build the autoencoder
 89 |         self.sae = SAE(dims = self.dims, act = self.activation, actincenter = self.actincenter, 
 90 |                        random_seed = random_seed, splitseed = self.splitseed, init="glorot_uniform", optimizer = Adam(), 
 91 |                        weights_dir = weights_dir)
 92 |         
 93 |         build_dir(self.weights_dir)
 94 |         
 95 |         decoder_seed = round(100 * abs(np.random.normal()))
 96 |         if load_encoder_weights:
 97 |             if os.path.isfile("./" + self.weights_dir + "/pretrained_autoencoder_weights.index"):
 98 |                 print("Pretrain weight index file detected, loading weights.")
 99 |                 self.sae.load_autoencoder()
100 |                 print("Pretrained high variance autoencoder weights initialized.")
101 |             else:
102 |                 print("Pretrain weight index file not detected, pretraining autoencoder weights.\n")
103 |                 self.sae.train(adata, lr = ae_lr, num_epochs = pretrain_epochs, 
104 |                                batch_size = batch_size, decay_factor = decay_factor, 
105 |                                patience_LR = patience_LR, patience_ES = patience_ES)
106 |                 self.sae.load_autoencoder()
107 |         else:
108 |             print("Pre-training high variance autoencoder.\n")
109 |             self.sae.train(adata, lr = ae_lr, num_epochs = pretrain_epochs, 
110 |                            batch_size = batch_size, decay_factor = decay_factor, 
111 |                            patience_LR = patience_LR, patience_ES = patience_ES)
112 |             self.sae.load_autoencoder()
113 |         
114 |         features = self.sae.embed(adata)
115 |         self.preclust_emb = deepcopy(features)
116 |         self.preclust_denoised = self.sae.denoise(adata, batch_size)
117 |                 
118 |         if not set_centroids:
119 |             self.init_centroid = np.zeros((n_clusters, self.dims[-1]), dtype = 'float32')
120 |             self.n_clusters = n_clusters
121 |             self.init_pred = np.zeros((adata.shape[0], dims[-1]))
122 |             
123 |         elif louvain_seed is None:
124 |             print("\nInitializing cluster centroids using K-Means")
125 | 
126 |             kmeans = KMeans(n_clusters=n_clusters, n_init = 20)
127 |             Y_pred_init = kmeans.fit_predict(features)
128 | 
129 |             self.init_pred = deepcopy(Y_pred_init)
130 |             self.n_clusters = n_clusters
131 |             self.init_centroid = kmeans.cluster_centers_
132 |             
133 |         else:
134 |             print("\nInitializing cluster centroids using the louvain method.")
135 |             
136 |             n_cells = features.shape[0]
137 |             
138 |             if n_cells > 10**5:
139 |                 subset = np.random.choice(range(n_cells), 10**5, replace = False)
140 |                 adata0 = AnnData(features[subset])
141 |             else: 
142 |                 adata0 = AnnData(features)
143 | 
144 |             sc.pp.neighbors(adata0, n_neighbors = self.n_neighbors, use_rep="X")
145 |             self.resolution = find_resolution(adata0, n_clusters, louvain_seed)
146 |             adata0 = sc.tl.louvain(adata0, resolution = self.resolution, random_state = louvain_seed, copy = True)
147 | 
148 |             Y_pred_init = adata0.obs['louvain']
149 |             self.init_pred = np.asarray(Y_pred_init, dtype=int)
150 | 
151 |             features = pd.DataFrame(adata0.X, index = np.arange(0, adata0.shape[0]))
152 |             Group = pd.Series(self.init_pred, index = np.arange(0, adata0.shape[0]), name="Group")
153 |             Mergefeature = pd.concat([features, Group],axis=1)
154 | 
155 |             self.init_centroid = np.asarray(Mergefeature.groupby("Group").mean())
156 |             self.n_clusters = self.init_centroid.shape[0]
157 | 
158 |             print("\n " + str(self.n_clusters) + " clusters detected. \n")
159 |         
160 |         self.encoder = self.sae.encoder
161 |         self.decoder = self.sae.decoder
162 |         
163 |         if LVG_dims is not None:
164 |             n_stacks = len(dims) - 1
165 | 
166 |             LVG_encoder_layers = []
167 | 
168 |             for i in range(n_stacks-1):
169 |                 LVG_encoder_layers.append(Dense(LVG_dims[i + 1], kernel_initializer = 'glorot_uniform', activation = self.activation, name='encoder%d' % i))
170 | 
171 |             LVG_encoder_layers.append(Dense(LVG_dims[-1], kernel_initializer = 'glorot_uniform', activation = self.actincenter, name='embedding'))
172 |             self.encoderLVG = Sequential(LVG_encoder_layers, name = 'encoderLVG')
173 | 
174 |         if LVG_dims is not None:
175 |             decoder_layers = []
176 |             for i in range(self.n_stacks - 1, 0, -1):
177 |                 decoder_layers.append(Dense(self.LVG_dims[i], kernel_initializer = 'glorot_uniform', 
178 |                                             activation = self.activation, name='decoderLVG%d' % (i-1)))
179 |                 
180 |             decoder_layers.append(Dense(self.LVG_dims[0], activation = 'linear', name='outputLVG'))
181 |             self.decoderLVG = Sequential(decoder_layers, name = 'decoderLVG')
182 |         
183 |         self.clustering_layer = ClusteringLayer(centroids = self.init_centroid, name = 'clustering')
184 |         
185 |         del self.sae
186 |         
187 |         self.construct()
188 |         
189 |     def construct(self, summarize = True):
190 |         """ This class method fully initalizes the TensorFlow model.
191 | 
192 | 
193 |         Arguments:
194 |         ------------------------------------------------------------------
195 |         - summarize: `bool`, If True, then print a summary of the model architecture.
196 |         """
197 |         
198 |         x = [tf.zeros(shape = (1, self.dims[0]), dtype=float), None]
199 |         if self.LVG_dims is not None:
200 |             x[1] = tf.zeros(shape = (1, self.LVG_dims[0]), dtype=float)
201 |             
202 |         out = self(*x)
203 |         
204 |         if summarize:
205 |             print("\n-----------------------CarDEC Architecture-----------------------\n")
206 |             self.summary()
207 | 
208 |             print("\n--------------------Encoder Sub-Architecture--------------------\n")
209 |             self.encoder.summary()
210 |             
211 |             print("\n------------------Base Decoder Sub-Architecture------------------\n")
212 |             self.decoder.summary()
213 | 
214 |             if self.LVG_dims is not None:
215 |                 print("\n------------------LVG Encoder Sub-Architecture------------------\n")
216 |                 self.encoderLVG.summary()
217 | 
218 |                 print("\n----------------LVG Base Decoder Sub-Architecture----------------\n")
219 |                 self.decoderLVG.summary()
220 | 
221 |     def call(self, hvg, lvg, denoise = True):
222 |         """ This is the forward pass of the model.
223 |         
224 | 
225 |         ***Inputs***
226 |             - hvg: `tf.Tensor`, an input tensor of shape (n_obs, n_HVG).
227 |             - lvg: `tf.Tensor`, (Optional) an input tensor of shape (n_obs, n_LVG).
228 |             - denoise: `bool`, (Optional) If True, return denoised expression values for each cell.
229 |             
230 |         ***Outputs***
231 |             - denoised_output: `dict`, (Optional) Dictionary containing denoised tensors.
232 |             - cluster_output: `tf.Tensor`, a tensor of cell cluster membership probabilities of shape (n_obs, m).
233 |         """
234 |         
235 |         hvg = self.encoder(hvg)
236 | 
237 |         cluster_output = self.clustering_layer(hvg)
238 |         
239 |         if not denoise:
240 |             return cluster_output
241 | 
242 |         HVG_denoised_output = self.decoder(hvg)
243 |         denoised_output = {'HVG_denoised': HVG_denoised_output}
244 | 
245 |         if self.LVG_dims is not None:
246 |             lvg = self.encoderLVG(lvg)
247 |             z = concatenate([hvg, lvg], axis=1)
248 | 
249 |             LVG_denoised_output = self.decoderLVG(z)
250 | 
251 |             denoised_output['LVG_denoised'] = LVG_denoised_output
252 | 
253 |         return denoised_output, cluster_output
254 | 
255 |     @staticmethod
256 |     def target_distribution(q):
257 |         """ Updates target distribution cluster assignment probabilities given CarDEC output.
258 |         
259 |         
260 |         Arguments:
261 |         ------------------------------------------------------------------
262 |         - q: `tf.Tensor`, a tensor of shape (b, m) identifying the probability that each of b cells is in each of the m clusters. Obtained as output from CarDEC.
263 |         
264 |         Returns:
265 |         ------------------------------------------------------------------
266 |         - p: `tf.Tensor`, a tensor of shape (b, m) identifying the pseudo-label probability that each of b cells is in each of the m clusters.
267 |         """
268 |         
269 |         weight = q ** 2 / np.sum(q, axis = 0)
270 |         p = weight.T / np.sum(weight, axis = 1)
271 |         return p.T
272 |     
273 |     def make_generators(self, adata, val_split, batch_size):
274 |         """ This class method creates training and validation data generators for the current input data and pseudo labels.
275 |         
276 |         
277 |         Arguments:
278 |         ------------------------------------------------------------------
279 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
280 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
281 |         - batch_size: `int`, The batch size used for training the full model.
282 |         - p: `tf.Tensor`, a tensor of shape (b, m) identifying the pseudo-label probability that each of b cells is in each of the m clusters.
283 |         - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between iterations to ensure the same cells are always used for validation.
284 |         - newseed: `int`, The seed that is set after splitting cells between training and validation. Should be different every iteration so that stochastic operations other than splitting cells between training and validation vary between epochs.
285 |         
286 |         Returns:
287 |         ------------------------------------------------------------------
288 |         - train_dataset: `tf.data.Dataset`, Dataset that returns training examples.
289 |         - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples.
290 |         """
291 |                 
292 |         if self.LVG_dims is None:
293 |             hvg_input = adata.layers["normalized input"]
294 |             hvg_target = adata.layers["normalized input"]
295 |             lvg_input = None
296 |             lvg_target = None
297 |         else:
298 |             hvg_input = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG']
299 |             hvg_target = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG']
300 |             lvg_input = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'LVG']
301 |             lvg_target = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'LVG']
302 |                     
303 |         return dataloader(hvg_input, hvg_target, lvg_input, lvg_target, val_split, batch_size, self.splitseed)
304 |         
305 |     def train_loop(self, train_dataset):
306 |         """ This class method runs the training loop.
307 |         
308 |         
309 |         Arguments:
310 |         ------------------------------------------------------------------
311 |         - train_dataset: `tf.data.Dataset`, Dataset that returns training examples.
312 |         
313 |         Returns:
314 |         ------------------------------------------------------------------
315 |         - epoch_loss_avg: `float`, The mean training loss for the iteration.
316 |         """
317 | 
318 |         epoch_loss_avg = tf.keras.metrics.Mean()
319 |         
320 |         for inputs, target, LVG_target, batch_p in train_dataset(val = False):
321 |             loss_value, grads = grad(self, inputs, target, batch_p, total_loss = total_loss,
322 |                                      LVG_target = LVG_target, aeloss_fun = MSEloss, 
323 |                                      clust_weight = self.clust_weight)
324 |             self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
325 |             epoch_loss_avg(loss_value)
326 |                 
327 |         return epoch_loss_avg.result()
328 |                 
329 |     def validation_loop(self, val_dataset):
330 |         """ This class method runs the validation loop.
331 |         
332 |         
333 |         Arguments:
334 |         ------------------------------------------------------------------
335 |         - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples.
336 |         
337 |         Returns:
338 |         ------------------------------------------------------------------
339 |         - epoch_loss_avg: `float`, The mean validation loss for the iteration (reconstruction + clustering loss)
340 |         - epoch_aeloss_avg_val: `float`, The mean validation reconstruction loss for the iteration
341 |         """
342 | 
343 |         epoch_loss_avg_val = tf.keras.metrics.Mean()
344 |         epoch_aeloss_avg_val = tf.keras.metrics.Mean()
345 |             
346 |         for inputs, target, LVG_target, batch_p in val_dataset(val = True):
347 |             denoised_output, cluster_output = self(*inputs)
348 |             loss_value, aeloss = total_loss(target, denoised_output, batch_p, cluster_output, 
349 |                            LVG_target = LVG_target, aeloss_fun = MSEloss, clust_weight = self.clust_weight)
350 |             epoch_loss_avg_val(loss_value)
351 |             epoch_aeloss_avg_val(aeloss)
352 |                 
353 |         return epoch_loss_avg_val.result(), epoch_aeloss_avg_val.result()
354 |     
355 |     def package_output(self, adata, init_pred, preclust_denoised, preclust_emb):
356 |         """ This class adds some quantities to the adata object.
357 |         
358 |         
359 |         Arguments:
360 |         ------------------------------------------------------------------
361 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
362 |         - init_pred: `np.ndarray`, the array of initial cluster assignments for each cells, of shape (n_obs,).
363 |         - preclust_denoised: `np.ndarray`, This is the array of feature zscores denoised with the pretrained autoencoder of shape (n_obs, n_vars).
364 |         - preclust_emb: `np.ndarray`, This is the latent embedding from the pretrained autoencoder of shape (n_obs, n_embedding).
365 |         """        
366 |         
367 |         adata.obsm['precluster denoised'] = preclust_denoised
368 |         adata.obsm['precluster embedding'] = preclust_emb
369 |         if adata.shape[0] == init_pred.shape[0]:
370 |             adata.obsm['initial assignments'] = init_pred
371 |     
372 |     def embed(self, adata, batch_size):
373 |         """ This class method can be used to compute the low-dimension embedding for HVG features. 
374 |         
375 |         
376 |         Arguments:
377 |         ------------------------------------------------------------------
378 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
379 |         - batch_size: `int`, The batch size for filling the array of low dimension embeddings.
380 |         
381 |         Returns:
382 |         ------------------------------------------------------------------
383 |         - embedding: `np.ndarray`, Array of shape (n_obs, p_embedding) containing the HVG embedding for every cell in the dataset.
384 |         """
385 |         
386 |         input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size)
387 |         
388 |         embedding = np.zeros((adata.shape[0], self.dims[-1]), dtype = 'float32')
389 |         start = 0
390 | 
391 |         for x in input_ds:
392 |             end = start + x.shape[0]
393 |             embedding[start:end] = self.encoder(x).numpy()
394 |             start = end
395 |             
396 |         return embedding
397 |     
398 |     def embed_LVG(self, adata, batch_size):
399 |         """ This class method can be used to compute the low-dimension embedding for LVG features. 
400 |         
401 |         
402 |         Arguments:
403 |         ------------------------------------------------------------------
404 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
405 |         - batch_size: `int`, The batch size for filling the array of low dimension embeddings.
406 |         
407 |         Returns:
408 |         ------------------------------------------------------------------
409 |         - embedding: `np.ndarray`, Array of shape (n_obs, n_embedding) containing the LVG embedding for every cell in the dataset.
410 |         """
411 |         
412 |         input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'LVG'], batch_size)
413 | 
414 |         LVG_embedded = np.zeros((adata.shape[0], self.LVG_dims[-1]), dtype = 'float32')
415 |         start = 0
416 | 
417 |         for x in input_ds:
418 |             end = start + x.shape[0]
419 |             LVG_embedded[start:end] = self.encoderLVG(x).numpy()
420 |             start = end
421 | 
422 |         return np.concatenate((adata.obsm['embedding'], LVG_embedded), axis = 1)
423 |     
424 |     def make_outputs(self, adata, batch_size, denoise = True):
425 |         """ This class method can be used to pack all relvant outputs into the adata object after training.
426 |         
427 |         
428 |         Arguments:
429 |         ------------------------------------------------------------------
430 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
431 |         - batch_size: `int`, The batch size for filling the array of low dimension embeddings.
432 |         - denoise: `bool`, Whether to provide denoised expression values for all cells.
433 |         """
434 |         
435 |         if not denoise:
436 |             input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size)
437 |             adata.obsm["cluster memberships"] = np.zeros((adata.shape[0], self.n_clusters), dtype = 'float32')
438 |             
439 |             start = 0     
440 |             for x in input_ds:
441 |                 q_batch = self(x, None, False)
442 |                 end = start + q_batch.shape[0]
443 |                 adata.obsm["cluster memberships"][start:end] = q_batch.numpy()
444 |             
445 |                 start = end
446 |             
447 |             
448 |         elif self.LVG_dims is not None:
449 |             if not ('embedding' in list(adata.obsm) and 'LVG embedding' in list(adata.obsm)):
450 |                         adata.obsm['embedding'] = self.embed(adata, batch_size)
451 |                         adata.obsm['LVG embedding'] = self.embed_LVG(adata, batch_size)
452 |             input_ds = tupleloader(adata.obsm["embedding"], adata.obsm["LVG embedding"], batch_size = batch_size)
453 |             
454 |             adata.obsm["cluster memberships"] = np.zeros((adata.shape[0], self.n_clusters), dtype = 'float32')
455 |             adata.layers["denoised"] = np.zeros(adata.shape, dtype = 'float32')
456 |             
457 |             start = 0     
458 |             for input_ in input_ds:
459 |                 denoised_batch = {'HVG_denoised': self.decoder(input_[0]), 'LVG_denoised': self.decoderLVG(input_[1])}
460 |                 q_batch = self.clustering_layer(input_[0])
461 |                 end = start + q_batch.shape[0]
462 |                 
463 |                 adata.obsm["cluster memberships"][start:end] = q_batch.numpy()
464 |                 adata.layers["denoised"][start:end, adata.var['Variance Type'] == 'HVG'] = denoised_batch['HVG_denoised'].numpy()
465 |                 adata.layers["denoised"][start:end, adata.var['Variance Type'] == 'LVG'] = denoised_batch['LVG_denoised'].numpy()
466 |             
467 |                 start = end
468 |         
469 |         else:
470 |             if not ('embedding' in list(adata.obsm)):
471 |                 adata.obsm['embedding'] = self.embed(adata, batch_size)
472 |             input_ds = simpleloader(adata.obsm["embedding"], batch_size)
473 |             
474 |             adata.obsm["cluster memberships"] = np.zeros((adata.shape[0], self.n_clusters), dtype = 'float32')
475 |             adata.layers["denoised"] = np.zeros(adata.shape, dtype = 'float32')
476 |             
477 |             start = 0
478 |             
479 |             for input_ in input_ds:
480 |                 denoised_batch = {'HVG_denoised': self.decoder(input_)}
481 |                 q_batch = self.clustering_layer(input_)
482 |                 
483 |                 end = start + q_batch.shape[0]
484 |                 
485 |                 adata.obsm["cluster memberships"][start:end] = q_batch.numpy()
486 |                 adata.layers["denoised"][start:end] = denoised_batch['HVG_denoised'].numpy()
487 |                 
488 |                 start = end
489 |                 
490 |     def train(self, adata, batch_size = 64, val_split = 0.1, lr = 1e-04, decay_factor = 1/3,
491 |               iteration_patience_LR = 3, iteration_patience_ES = 6, 
492 |               maxiter = 1e3, epochs_fit = 1, optimizer = Adam(), printperiter = None, denoise = True):
493 |         """ This class method can be used to train the main CarDEC model
494 |         
495 |         
496 |         Arguments:
497 |         ------------------------------------------------------------------
498 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
499 |         - batch_size: `int`, The batch size used for training the full model.
500 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
501 |         - lr: `float`, The learning rate for training the full model.
502 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
503 |         - iteration_patience_LR: `int`, The number of iterations tolerated before decaying the learning rate during which the number of cells that change assignment is less than tol.
504 |         - iteration_patience_ES: `int`, The number of iterations tolerated before stopping training during which the number of cells that change assignment is less than tol.
505 |         - maxiter: `int`, The maximum number of iterations allowed to train the full model. In practice, the model will halt training long before hitting this limit.
506 |         - epochs_fit: `int`, The number of epochs during which to fine-tune weights, before updating the target distribution.
507 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
508 |         - printperiter: `int`, Optional integer argument. If specified, denoised values will be returned every printperiter epochs, so that the user can evaluate the progress of denoising as training continues.
509 |         - denoise: `bool`, If True, then denoised expression values are provided for all cells.
510 |         
511 |         Returns:
512 |         ------------------------------------------------------------------
513 |         - adata: `anndata.AnnData`, The updated annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. Depending on the arguments of the train call, some outputs will be added to adata.
514 |         """
515 |         
516 |         total_start = time()
517 |         seedlist = list(1000*np.random.randn(int(maxiter)))
518 |         seedlist = [abs(int(x)) for x in seedlist]
519 |         
520 |         self.optimizer = optimizer
521 |         self.optimizer.lr = lr
522 |         
523 |         # Begin deep clustering
524 |         y_pred_last = np.ones((adata.shape[0],), dtype = int) * -1.
525 | 
526 |         min_delta = np.inf
527 |         current_aeloss_val = np.inf
528 |         delta_patience_ES = 0
529 |         delta_patience_LR = 0
530 |         delta_stop = False
531 |         
532 |         dataset = self.make_generators(adata, val_split = 0.1, batch_size = batch_size)
533 |         
534 |         self.make_outputs(adata, batch_size, denoise = printperiter is not None)
535 |         
536 |         for ite in range(int(maxiter)):
537 |             
538 |             p = self.target_distribution(adata.obsm['cluster memberships'])
539 |             
540 |             dataset.update_p(p)
541 | 
542 |             best_loss = np.inf
543 |             iter_start = time()
544 |                         
545 |             for epoch in range(epochs_fit):
546 |                 current_loss_train = self.train_loop(dataset)
547 |                 current_loss_val, current_aeloss_val = self.validation_loop(dataset)
548 |             
549 |             self.make_outputs(adata, batch_size, denoise = printperiter is not None)
550 |             
551 |             y_pred = np.argmax(adata.obsm['cluster memberships'], axis = 1)
552 |                         
553 |             if printperiter is not None:
554 |                 if ite % printperiter == 0 and ite > 0:
555 |                     denoising_filename = os.path.join(CarDEC.weights_dir, '/intermediate_denoising/denoised' + ite)
556 |                     outfile = open(denoising_filename,'wb')
557 |                     pickle.dump(adata.layers["denoised"][:, adata.var['Variance Type'] == 'HVG'], outfile)
558 |                     outfile.close()
559 |                     
560 |                     if self.LVG_dims is not None:
561 |                         denoising_filename = os.path.join(CarDEC.weights_dir, '/intermediate_denoising/denoisedLVG' + ite)
562 |                         outfile = open(denoising_filename,'wb')
563 |                         pickle.dump(adata.layers["denoised"][:, adata.var['Variance Type'] == 'LVG'], outfile)
564 |                         outfile.close()
565 |             
566 |             # check stop criterion
567 |             delta_label = np.sum(y_pred != y_pred_last).astype(np.float32) / y_pred.shape[0]
568 |             y_pred_last = deepcopy(y_pred)
569 |             
570 |             current_aeloss_val = current_aeloss_val.numpy()
571 |             current_clustloss_val = (current_loss_val.numpy() - (1 - self.clust_weight) * current_aeloss_val)/self.clust_weight
572 |             print("Iter {:03d} Loss: [Training: {:.3f}, Validation Cluster: {:.3f}, Validation AE: {:.3f}], Label Change: {:.3f}, Time: {:.1f} s".format(ite, current_loss_train.numpy(), current_clustloss_val, current_aeloss_val, delta_label, time() - iter_start))
573 |             
574 |             if current_aeloss_val + 10**(-3) < min_delta:
575 |                 min_delta = current_aeloss_val
576 |                 delta_patience_ES = 0
577 |                 delta_patience_LR = 0
578 |                 
579 |             if delta_patience_ES >= iteration_patience_ES:
580 |                 delta_stop = True
581 |                 
582 |             if delta_patience_LR >= iteration_patience_LR:
583 |                 self.optimizer.lr = self.optimizer.lr * decay_factor
584 |                 delta_patience_LR = 0
585 |                 print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy()))
586 | 
587 |             delta_patience_ES = delta_patience_ES + 1
588 |             delta_patience_LR = delta_patience_LR + 1
589 |             
590 |             if delta_stop and delta_label < self.tol:
591 |                 print('\nAutoencoder_loss ', current_aeloss_val, 'not improving.')
592 |                 print('Proportion of Labels Changed: ', delta_label, ' is less than tolerance of ', self.tol)
593 |                 print('\nReached tolerance threshold. Stop training.')
594 |                 break
595 |                 
596 |                         
597 |         y0 = pd.Series(y_pred, dtype='category')
598 |         y0.cat.categories = range(0, len(y0.cat.categories))
599 |         print("\nThe final cluster assignments are:")
600 |         x = y0.value_counts()
601 |         print(x.sort_index(ascending=True))
602 |         
603 |         adata.obsm['embedding'] = self.embed(adata, batch_size)
604 |         if self.LVG_dims is not None:
605 |             adata.obsm['LVG embedding'] = self.embed_LVG(adata, batch_size)
606 |             
607 |         del adata.layers['normalized input']
608 |         
609 |         if denoise:
610 |             self.make_outputs(adata, batch_size, denoise = True)
611 |         
612 |         self.save_weights("./" + self.weights_dir + "/tuned_CarDECweights", save_format='tf')
613 |                    
614 |         print("\nTotal Runtime is " + str(time() - total_start))
615 |                 
616 |         print("\nThe CarDEC model is now making inference on the data matrix.")
617 |         
618 |         self.package_output(adata, self.init_pred, self.preclust_denoised, self.preclust_emb)
619 |             
620 |         print("Inference completed, results added.")
621 |         
622 |         return adata
623 |     
624 |     def reload_model(self, adata = None, batch_size = 64, denoise = True):
625 |         """ This class method can be used to load the model's saved weights and redo inference.
626 |         
627 |         
628 |         Arguments:
629 |         ------------------------------------------------------------------
630 |         - adata: `anndata.AnnData`, (Optional) The annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. If left as None, model weights will be reloaded but inference will not be made.
631 |         - batch_size: `int`, The batch size for filling the array of low dimension embeddings.
632 |         - denoise: `bool`, Whether to provide denoised expression values for all cells.
633 |         
634 |         Returns:
635 |         ------------------------------------------------------------------
636 |         - adata: `anndata.AnnData`, (Optional) The annotated data matrix of shape (n_obs, n_vars). If an adata object was provided as input, the adata object will be returned with inference outputs added.
637 |         """
638 |         
639 |         if os.path.isfile("./" + self.weights_dir + "/tuned_CarDECweights.index"):
640 |             print("Weight index file detected, loading weights.")
641 |             self.load_weights("./" + self.weights_dir + "/tuned_CarDECweights").expect_partial()
642 |             print("CarDEC Model weights loaded successfully.")
643 |         
644 |             if adata is not None:
645 |                 print("\nThe CarDEC model is now making inference on the data matrix.")
646 |                 
647 |                 adata.obsm['embedding'] = self.embed(adata, batch_size)
648 |                 if self.LVG_dims is not None:
649 |                     adata.obsm['LVG embedding'] = self.embed_LVG(adata, batch_size)
650 |                     
651 |                 del adata.layers['normalized input']
652 |                 
653 |                 if denoise:
654 |                     self.make_outputs(adata, batch_size, True)
655 |                 
656 |                 self.package_output(adata, self.init_pred, self.preclust_denoised, self.preclust_emb)
657 |                 
658 |                 print("Inference completed, results returned.")
659 |                 
660 |                 return adata
661 | 
662 |         else:
663 |             print("\nWeight index file not detected, please call CarDEC_Model.train to learn the weights\n")
664 | 
665 | 


--------------------------------------------------------------------------------
/CarDEC/CarDEC_SAE.py:
--------------------------------------------------------------------------------
  1 | from .CarDEC_optimization import grad_reconstruction as grad, MSEloss
  2 | from .CarDEC_dataloaders import simpleloader, aeloader
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow.keras import Model, Sequential
  6 | from tensorflow.keras.layers import Dense, concatenate
  7 | from tensorflow.keras.optimizers import Adam
  8 | from tensorflow.keras.backend import set_floatx
  9 | from time import time
 10 | 
 11 | import random
 12 | import numpy as np
 13 | from scipy.stats import zscore
 14 | import os
 15 | 
 16 | 
 17 | set_floatx('float32')
 18 | 
 19 | 
 20 | class SAE(Model):
 21 |     def __init__(self, dims, act = 'relu', actincenter = "tanh", 
 22 |                  random_seed = 201809, splitseed = 215, init = "glorot_uniform", optimizer = Adam(),
 23 |                  weights_dir = 'CarDEC Weights'):
 24 |         """ This class method initializes the SAE model.
 25 | 
 26 | 
 27 |         Arguments:
 28 |         ------------------------------------------------------------------
 29 |         - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers.
 30 |         - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer.
 31 |         - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC.
 32 |         - random_seed: `int`, The seed used for random weight intialization.
 33 |         - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between iterations to ensure the same cells are always used for validation.
 34 |         - init: `str`, The weight initialization strategy for the autoencoder.
 35 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
 36 |         - weights_dir: `str`, the path in which to save the weights of the CarDEC model.
 37 |         """
 38 |         
 39 |         super(SAE, self).__init__()
 40 |         
 41 |         tf.keras.backend.clear_session()
 42 |         
 43 |         self.weights_dir = weights_dir
 44 |         
 45 |         self.dims = dims
 46 |         self.n_stacks = len(dims) - 1
 47 |         self.init = init
 48 |         self.optimizer = optimizer
 49 |         self.random_seed = random_seed
 50 |         self.splitseed = splitseed
 51 |         
 52 |         self.activation = act
 53 |         self.actincenter = actincenter #hidden layer activation function
 54 |         
 55 |         #set random seed
 56 |         random.seed(random_seed)
 57 |         np.random.seed(random_seed)
 58 |         tf.random.set_seed(random_seed)
 59 |             
 60 |         encoder_layers = []
 61 |         for i in range(self.n_stacks-1):
 62 |             encoder_layers.append(Dense(self.dims[i + 1], kernel_initializer = self.init, activation = self.activation, name='encoder_%d' % i))
 63 |                 
 64 |         encoder_layers.append(Dense(self.dims[-1], kernel_initializer=self.init, activation=self.actincenter, name='embedding'))
 65 |         self.encoder = Sequential(encoder_layers, name = 'encoder')
 66 | 
 67 |         decoder_layers = []
 68 |         for i in range(self.n_stacks - 1, 0, -1):
 69 |             decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation
 70 |                                         , name = 'decoder%d' % (i-1)))
 71 |             
 72 |         decoder_layers.append(Dense(self.dims[0], activation = 'linear', name='output'))
 73 |         
 74 |         self.decoder = Sequential(decoder_layers, name = 'decoder')
 75 |         
 76 |         self.construct()
 77 | 
 78 |     def call(self, x):
 79 |         """ This is the forward pass of the model.
 80 |         
 81 |         
 82 |         ***Inputs***
 83 |             - x: `tf.Tensor`, an input tensor of shape (n_obs, p_HVG).
 84 |             
 85 |         ***Outputs***
 86 |             - output: `tf.Tensor`, A (n_obs, p_HVG) tensor of denoised HVG expression.
 87 |         """
 88 |         
 89 |         c = self.encoder(x)
 90 | 
 91 |         output = self.decoder(c)
 92 |                     
 93 |         return output
 94 |     
 95 |     def load_encoder(self, random_seed = 2312):
 96 |         """ This class method can be used to load the encoder weights, while randomly reinitializing the decoder weights.
 97 | 
 98 | 
 99 |         Arguments:
100 |         ------------------------------------------------------------------
101 |         - random_seed: `int`, Seed for reinitializing the decoder.
102 |         """
103 |         
104 |         tf.keras.backend.clear_session()
105 |         
106 |         #set random seed
107 |         random.seed(random_seed)
108 |         np.random.seed(random_seed)
109 |         tf.random.set_seed(random_seed)
110 |      
111 |         self.encoder.load_weights("./" + self.weights_dir + "/pretrained_encoder_weights").expect_partial()
112 |         
113 |         decoder_layers = []
114 |         for i in range(self.n_stacks - 1, 0, -1):
115 |             decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation
116 |                                         , name='decoder%d' % (i-1)))
117 |         self.decoder_base = Sequential(decoder_layers, name = 'decoderbase')
118 |         
119 |         self.output_layer = Dense(self.dims[0], activation = 'linear', name='output')
120 |             
121 |         self.construct(summarize = False)
122 |         
123 |     def load_autoencoder(self, ):
124 |         """ This class method can be used to load the full model's weights."""
125 |         
126 |         tf.keras.backend.clear_session()
127 |         
128 |         self.load_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights").expect_partial()
129 |         
130 |     def construct(self, summarize = False):
131 |         """ This class method fully initalizes the TensorFlow model.
132 | 
133 | 
134 |         Arguments:
135 |         ------------------------------------------------------------------
136 |         - summarize: `bool`, If True, then print a summary of the model architecture.
137 |         """
138 |         
139 |         x = tf.zeros(shape = (1, self.dims[0]), dtype=float)
140 |         out = self(x)
141 |         
142 |         if summarize:
143 |             print("----------Autoencoder Architecture----------")
144 |             self.summary()
145 | 
146 |             print("\n----------Encoder Sub-Architecture----------")
147 |             self.encoder.summary()
148 | 
149 |             print("\n----------Base Decoder Sub-Architecture----------")
150 |             self.decoder.summary()
151 |         
152 |     def denoise(self, adata, batch_size = 64):
153 |         """ This class method can be used to denoise gene expression for each cell.
154 | 
155 | 
156 |         Arguments:
157 |         ------------------------------------------------------------------
158 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
159 |         - batch_size: `int`, The batch size used for computing denoised expression.
160 |         
161 |         Returns:
162 |         ------------------------------------------------------------------
163 |         - output: `np.ndarray`, Numpy array of denoised expression of shape (n_obs, n_vars)
164 |         """
165 |         
166 |         input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size)
167 |         
168 |         output = np.zeros((adata.shape[0], self.dims[0]), dtype = 'float32')
169 |         start = 0
170 |         
171 |         for x in input_ds:
172 |             end = start + x.shape[0]
173 |             output[start:end] = self(x).numpy()
174 |             start = end
175 |         
176 |         return output
177 |         
178 |     def embed(self, adata, batch_size = 64):
179 |         """ This class method can be used to compute the low-dimension embedding for HVG features. 
180 |         
181 |         
182 |         Arguments:
183 |         ------------------------------------------------------------------
184 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
185 |         - batch_size: `int`, The batch size for filling the array of low dimension embeddings.
186 |         
187 |         Returns:
188 |         ------------------------------------------------------------------
189 |         - embedding: `np.ndarray`, Array of shape (n_obs, n_vars) containing the cell HVG embeddings.
190 |         """
191 |         
192 |         input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size)
193 |         
194 |         embedding = np.zeros((adata.shape[0], self.dims[-1]), dtype = 'float32')
195 |         
196 |         start = 0
197 |         for x in input_ds:
198 |             end = start + x.shape[0]
199 |             embedding[start:end] = self.encoder(x).numpy()
200 |             start = end
201 |             
202 |         return embedding
203 |     
204 |     def makegenerators(self, adata, val_split, batch_size, splitseed):
205 |         """ This class method creates training and validation data generators for the current input data.
206 |         
207 |         
208 |         Arguments:
209 |         ------------------------------------------------------------------
210 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars).
211 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
212 |         - batch_size: `int`, The batch size used for training the model.
213 |         - splitseed: `int`, The seed used to split cells between training and validation.
214 |         
215 |         Returns:
216 |         ------------------------------------------------------------------
217 |         - train_dataset: `tf.data.Dataset`, Dataset that returns training examples.
218 |         - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples.
219 |         """
220 |         
221 |         return aeloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], val_frac = val_split, batch_size = batch_size, splitseed = splitseed)
222 |     
223 |     def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3,
224 |               patience_LR = 3, patience_ES = 9, save_fullmodel = True):
225 |         """ This class method can be used to train the SAE.
226 |         
227 |         
228 |         Arguments:
229 |         ------------------------------------------------------------------
230 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
231 |         - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt training long before hitting this limit.
232 |         - batch_size: `int`, The batch size used for training the full model.
233 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
234 |         - lr: `float`, The learning rate for training the full model.
235 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
236 |         - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss fails to decrease.
237 |         - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to decrease.
238 |         - save_fullmodel: `bool`, If True, save the full model's weights, not just the encoder.
239 |         """
240 |         
241 |         tf.keras.backend.clear_session()
242 |         
243 |         dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed)
244 |         
245 |         counter_LR = 0
246 |         counter_ES = 0
247 |         best_loss = np.inf
248 |         
249 |         self.optimizer.lr = lr
250 |         
251 |         total_start = time()
252 |         for epoch in range(num_epochs):
253 |             epoch_start = time()
254 |             
255 |             epoch_loss_avg = tf.keras.metrics.Mean()
256 |             epoch_loss_avg_val = tf.keras.metrics.Mean()
257 |             
258 |             # Training loop - using batches of batch_size
259 |             for x, target in dataset(val = False):
260 |                 loss_value, grads = grad(self, x, target, MSEloss)
261 |                 self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
262 |                 epoch_loss_avg(loss_value)  # Add current batch loss
263 |             
264 |             # Validation Loop
265 |             for x, target in dataset(val = True):
266 |                 output = self(x)
267 |                 loss_value = MSEloss(target, output)
268 |                 epoch_loss_avg_val(loss_value)
269 |             
270 |             current_loss_val = epoch_loss_avg_val.result()
271 | 
272 |             epoch_time = round(time() - epoch_start, 1)
273 |             
274 |             print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time))
275 |             
276 |             if(current_loss_val + 10**(-3) < best_loss):
277 |                 counter_LR = 0
278 |                 counter_ES = 0
279 |                 best_loss = current_loss_val
280 |             else:
281 |                 counter_LR = counter_LR + 1
282 |                 counter_ES = counter_ES + 1
283 | 
284 |             if patience_ES <= counter_ES:
285 |                 break
286 | 
287 |             if patience_LR <= counter_LR:
288 |                 self.optimizer.lr = self.optimizer.lr * decay_factor
289 |                 counter_LR = 0
290 |                 print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy()))
291 |                 
292 |             # End epoch
293 |         
294 |         total_time = round(time() - total_start, 2)
295 |         
296 |         if not os.path.isdir("./" + self.weights_dir):
297 |             os.mkdir("./" + self.weights_dir)
298 |         
299 |         self.save_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights", save_format='tf')
300 |         self.encoder.save_weights("./" + self.weights_dir + "/pretrained_encoder_weights", save_format='tf')
301 |         
302 |         print('\nTraining Completed')
303 |         print("Total training time: " + str(total_time) + " seconds")
304 | 
305 | 


--------------------------------------------------------------------------------
/CarDEC/CarDEC_count_decoder.py:
--------------------------------------------------------------------------------
  1 | from .CarDEC_optimization import grad_reconstruction as grad, NBloss
  2 | from .CarDEC_utils import build_dir
  3 | from .CarDEC_dataloaders import countloader, tupleloader
  4 | 
  5 | import tensorflow as tf
  6 | from tensorflow.keras import Model, Sequential
  7 | from tensorflow.keras.layers import Dense, concatenate, Lambda
  8 | from tensorflow.keras.optimizers import Adam
  9 | from tensorflow.keras.backend import exp as tf_exp, set_floatx
 10 | from time import time
 11 | 
 12 | import random
 13 | import numpy as np
 14 | from scipy.stats import zscore
 15 | import os
 16 | 
 17 | 
 18 | set_floatx('float32')
 19 | 
 20 | 
 21 | class count_model(Model):
 22 |     def __init__(self, dims, act = 'relu', random_seed = 201809, splitseed = 215, optimizer = Adam(),
 23 |              weights_dir = 'CarDEC Count Weights', n_features = 32, mode = 'HVG'):
 24 |         """ This class method initializes the count model.
 25 | 
 26 | 
 27 |         Arguments:
 28 |         ------------------------------------------------------------------
 29 |         - dims: `list`, the number of output features for each layer of the model. The length of the list determines the
 30 |         number of layers.
 31 |         - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer.
 32 |         - random_seed: `int`, The seed used for random weight intialization.
 33 |         - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between
 34 |         iterations to ensure the same cells are always used for validation.
 35 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
 36 |         - weights_dir: `str`, the path in which to save the weights of the CarDEC model.
 37 |         - n_features: `int`, the number of input features.
 38 |         - mode: `str`, String identifying whether HVGs or LVGs are being modeled.
 39 |         """
 40 |         
 41 |         super(count_model, self).__init__()
 42 | 
 43 |         tf.keras.backend.clear_session()
 44 |         
 45 |         self.mode = mode
 46 |         self.name_ = mode + " Count"
 47 |         
 48 |         if mode == 'HVG':
 49 |             self.embed_name = 'embedding'
 50 |         else:
 51 |             self.embed_name = 'LVG embedding'
 52 |         
 53 |         self.weights_dir = weights_dir
 54 |         
 55 |         self.dims = dims
 56 |         n_stacks = len(dims) - 1
 57 |         
 58 |         self.optimizer = optimizer
 59 |         self.random_seed = random_seed
 60 |         self.splitseed = splitseed
 61 |         
 62 |         random.seed(random_seed)
 63 |         np.random.seed(random_seed)
 64 |         tf.random.set_seed(random_seed)
 65 |         
 66 |         self.activation = act
 67 |         self.MeanAct = lambda x: tf.clip_by_value(tf_exp(x), 1e-5, 1e6)
 68 |         self.DispAct = lambda x: tf.clip_by_value(tf.nn.softplus(x), 1e-4, 1e4)
 69 |         
 70 |         model_layers = []
 71 |         for i in range(n_stacks - 1, 0, -1):
 72 |             model_layers.append(Dense(dims[i], kernel_initializer = "glorot_uniform", activation = self.activation
 73 |                                         , name='base%d' % (i-1)))
 74 |         self.base = Sequential(model_layers, name = 'base')
 75 | 
 76 |         self.mean_layer = Dense(dims[0], activation = self.MeanAct, name='mean')
 77 |         self.disp_layer = Dense(dims[0], activation = self.DispAct, name='dispersion')
 78 | 
 79 |         self.rescale = Lambda(lambda l: tf.matmul(tf.linalg.diag(l[0]), l[1]), name = 'sf scaling')
 80 |         
 81 |         build_dir(self.weights_dir)
 82 |         
 83 |         self.construct(n_features, self.name_)
 84 |         
 85 |     def call(self, x, s):
 86 |         """ This is the forward pass of the model.
 87 |         
 88 | 
 89 |         ***Inputs***
 90 |             - x: `tf.Tensor`, an input tensor of shape (b, p)
 91 |             - s: `tf.Tensor`, and input tensor of shape (b, ) containing the size factor for each cell
 92 |             
 93 |         ***Outputs***
 94 |             - mean: `tf.Tensor`, A (b, p_gene) tensor of negative binomial means for each cell, gene.
 95 |             - disp: `tf.Tensor`, A (b, p_gene) tensor of negative binomial dispersions for each cell, gene.
 96 |         """
 97 |         
 98 |         x = self.base(x)
 99 |         
100 |         disp = self.disp_layer(x)
101 |         mean = self.mean_layer(x)
102 |         mean = self.rescale([s, mean])
103 |                         
104 |         return mean, disp
105 |         
106 |     def load_model(self, ):
107 |         """ This class method can be used to load the model's weights."""
108 |             
109 |         tf.keras.backend.clear_session()
110 |         
111 |         self.load_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_)).expect_partial()
112 |         
113 |     def construct(self, n_features, name, summarize = False):
114 |         """ This class method fully initalizes the TensorFlow model.
115 | 
116 | 
117 |         Arguments:
118 |         ------------------------------------------------------------------
119 |         - n_features: `int`, the number of input features.
120 |         - name: `str`, Model name (to distinguish HVG and LVG models).
121 |         - summarize: `bool`, If True, then print a summary of the model architecture.
122 |         """
123 |         
124 |         x = [tf.zeros(shape = (1, n_features), dtype='float32'), tf.ones(shape = (1,), dtype='float32')]
125 |         out = self(*x)
126 |         
127 |         if summarize:
128 |             print("----------Count Model " + name + " Architecture----------")
129 |             self.summary()
130 | 
131 |             print("\n----------Base Sub-Architecture----------")
132 |             self.base.summary()
133 |         
134 |     def denoise(self, adata, keep_dispersion = False, batch_size = 64):
135 |         """ This class method can be used to denoise gene expression for each cell on the count scale.
136 | 
137 | 
138 |         Arguments:
139 |         ------------------------------------------------------------------
140 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond
141 |         to cells and columns to genes.
142 |         - keep_dispersion: `bool`, If True, also return the dispersion for each gene, cell (added as a layer to adata)/
143 |         - batch_size: `int`, The batch size used for computing denoised expression.
144 |         
145 |         Returns:
146 |         ------------------------------------------------------------------
147 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Negative binomial means (and optionally 
148 |         dispersions) added as layers.
149 |         """
150 |         
151 |         input_ds = tupleloader(adata.obsm[self.embed_name], adata.obs['size factors'], batch_size = batch_size)
152 |         
153 |         if "denoised counts" not in list(adata.layers):
154 |             adata.layers["denoised counts"] = np.zeros(adata.shape, dtype = 'float32')
155 |         
156 |         type_indices = adata.var['Variance Type'] == self.mode
157 |         
158 |         if not keep_dispersion:
159 |             start = 0
160 |             for x in input_ds:
161 |                 end = start + x[0].shape[0]
162 |                 adata.layers["denoised counts"][start:end, type_indices] = self(*x)[0].numpy()
163 |                 start = end
164 |                 
165 |         else:
166 |             if "dispersion" not in list(adata.layers):
167 |                 adata.layers["dispersion"] = np.zeros(adata.shape, dtype = 'float32')
168 |                 
169 |             start = 0
170 |             for x in input_ds:
171 |                 end = start + x[0].shape[0]
172 |                 batch_output = self(*x)
173 |                 adata.layers["denoised counts"][start:end, type_indices] = batch_output[0].numpy()
174 |                 adata.layers["dispersion"][start:end, type_indices] = batch_output[1].numpy()
175 |                 start = end
176 |             
177 |     def makegenerators(self, adata, val_split, batch_size, splitseed):
178 |         """ This class method creates training and validation data generators for the current input data.
179 |         
180 |         
181 |         Arguments:
182 |         ------------------------------------------------------------------
183 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond
184 |         to cells and columns to genes.
185 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
186 |         - batch_size: `int`, The batch size used for training the model.
187 |         - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between
188 |         iterations to ensure the same cells are always used for validation.
189 |         
190 |         Returns:
191 |         ------------------------------------------------------------------
192 |         - train_dataset: `tf.data.Dataset`, Dataset that returns training examples.
193 |         - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples.
194 |         """
195 |         
196 |         return countloader(adata.obsm[self.embed_name], adata.X[:, adata.var['Variance Type'] == self.mode], adata.obs['size factors'], 
197 |                            val_split, batch_size, splitseed)
198 |     
199 |     def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3,
200 |               patience_LR = 3, patience_ES = 9):
201 |         """ This class method can be used to train the SAE.
202 |         
203 |         
204 |         Arguments:
205 |         ------------------------------------------------------------------
206 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond
207 |         to cells and columns to genes.
208 |         - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt
209 |         training long before hitting this limit.
210 |         - batch_size: `int`, The batch size used for training the full model.
211 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
212 |         - lr: `float`, The learning rate for training the full model.
213 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not
214 |         decreasing.
215 |         - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the
216 |         validation loss fails to decrease.
217 |         - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to
218 |         decrease.
219 |         """
220 |         
221 |         tf.keras.backend.clear_session()
222 |                 
223 |         loss = NBloss
224 |         
225 |         dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed)
226 |         
227 |         counter_LR = 0
228 |         counter_ES = 0
229 |         best_loss = np.inf
230 |         
231 |         self.optimizer.lr = lr
232 |         
233 |         total_start = time()
234 |         
235 |         for epoch in range(num_epochs):
236 |             epoch_start = time()
237 |             
238 |             epoch_loss_avg = tf.keras.metrics.Mean()
239 |             epoch_loss_avg_val = tf.keras.metrics.Mean()
240 |             
241 |             # Training loop - using batches of batch_size
242 |             for x, target in dataset(val = False):
243 |                 loss_value, grads = grad(self, x, target, loss)
244 |                 self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
245 |                 epoch_loss_avg(loss_value)  # Add current batch loss
246 |             
247 |             # Validation Loop
248 |             for x, target in dataset(val = True):
249 |                 output = self(*x)
250 |                 loss_value = loss(target, output)
251 |                 epoch_loss_avg_val(loss_value)
252 |             
253 |             current_loss_val = epoch_loss_avg_val.result()
254 | 
255 |             epoch_time = round(time() - epoch_start, 1)
256 |             
257 |             print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time))
258 |             
259 |             if(current_loss_val + 10**(-3) < best_loss):
260 |                 counter_LR = 0
261 |                 counter_ES = 0
262 |                 best_loss = current_loss_val
263 |             else:
264 |                 counter_LR = counter_LR + 1
265 |                 counter_ES = counter_ES + 1
266 | 
267 |             if patience_ES <= counter_ES:
268 |                 break
269 | 
270 |             if patience_LR <= counter_LR:
271 |                 self.optimizer.lr = self.optimizer.lr * decay_factor
272 |                 counter_LR = 0
273 |                 print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy()))
274 |                 
275 |             # End epoch
276 |         
277 |         total_time = round(time() - total_start, 2)
278 |         
279 |         if not os.path.isdir("./" + self.weights_dir):
280 |             os.mkdir("./" + self.weights_dir)
281 |         
282 |         self.save_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_), save_format='tf')
283 |                 
284 |         print('\nTraining Completed')
285 |         print("Total training time: " + str(total_time) + " seconds")
286 | 
287 | 


--------------------------------------------------------------------------------
/CarDEC/CarDEC_dataloaders.py:
--------------------------------------------------------------------------------
  1 | from tensorflow import convert_to_tensor as tensor
  2 | from numpy import setdiff1d
  3 | from numpy.random import choice, seed
  4 | 
  5 | class batch_sampler(object):
  6 |     def __init__(self, array, val_frac, batch_size, splitseed):
  7 |         seed(splitseed)
  8 |         self.val_indices = choice(range(len(array)), round(val_frac * len(array)), False)
  9 |         self.train_indices = setdiff1d(range(len(array)), self.val_indices)
 10 |         self.batch_size = batch_size
 11 |         
 12 |     def __iter__(self):
 13 |         batch = []
 14 |         
 15 |         if self.val:
 16 |             for idx in self.val_indices:
 17 |                 batch.append(idx)
 18 |                 
 19 |                 if len(batch) == self.batch_size:
 20 |                     yield batch
 21 |                     batch = []
 22 |                     
 23 |         else:
 24 |             train_idx = choice(self.train_indices, len(self.train_indices), False)
 25 |             
 26 |             for idx in train_idx:
 27 |                 batch.append(idx)
 28 |                 
 29 |                 if len(batch) == self.batch_size:
 30 |                     yield batch
 31 |                     batch = []
 32 |                     
 33 |         if batch:
 34 |             yield batch
 35 |             
 36 |     def __call__(self, val):
 37 |         self.val = val
 38 |         return self
 39 |             
 40 | class simpleloader(object):
 41 |     def __init__(self, array, batch_size):
 42 |         self.array = array
 43 |         self.batch_size = batch_size
 44 |         
 45 |     def __iter__(self):
 46 |         batch = []
 47 |         
 48 |         for idx in range(len(self.array)):
 49 |             batch.append(idx)
 50 |             
 51 |             if len(batch) == self.batch_size:
 52 |                 yield tensor(self.array[batch].copy())
 53 |                 batch = []
 54 |                 
 55 |         if batch:
 56 |             yield self.array[batch].copy()
 57 |             
 58 | class tupleloader(object):
 59 |     def __init__(self, *arrays, batch_size):
 60 |         self.arrays = arrays
 61 |         self.batch_size = batch_size
 62 |         
 63 |     def __iter__(self):
 64 |         batch = []
 65 |         
 66 |         for idx in range(len(self.arrays[0])):
 67 |             batch.append(idx)
 68 |             
 69 |             if len(batch) == self.batch_size:
 70 |                 yield [tensor(arr[batch].copy()) for arr in self.arrays]
 71 |                 batch = []
 72 |                 
 73 |         if batch:
 74 |             yield [tensor(arr[batch].copy()) for arr in self.arrays]
 75 |             
 76 | class aeloader(object):
 77 |     def __init__(self, *arrays, val_frac, batch_size, splitseed):
 78 |         self.arrays = arrays
 79 |         self.batch_size = batch_size
 80 |         self.sampler = batch_sampler(arrays[0], val_frac, batch_size, splitseed)
 81 |         
 82 |     def __iter__(self):
 83 |         for idxs in self.sampler(self.val):
 84 |             yield [tensor(arr[idxs].copy()) for arr in self.arrays]
 85 |             
 86 |     def __call__(self, val):
 87 |         self.val = val
 88 |         return self
 89 |             
 90 | class countloader(object):
 91 |     def __init__(self, embedding, target, sizefactor, val_frac, batch_size, splitseed):
 92 |         self.sampler = batch_sampler(embedding, val_frac, batch_size, splitseed)
 93 |         self.embedding = embedding
 94 |         self.target = target
 95 |         self.sizefactor = sizefactor
 96 |         
 97 |     def __iter__(self):
 98 |         for idxs in self.sampler(self.val):
 99 |             yield (tensor(self.embedding[idxs].copy()), tensor(self.sizefactor[idxs].copy())), tensor(self.target[idxs].copy())
100 |             
101 |     def __call__(self, val):
102 |         self.val = val
103 |         return self
104 |             
105 | class dataloader(object):
106 |     def __init__(self, hvg_input, hvg_target, lvg_input = None, lvg_target = None, val_frac = 0.1, batch_size = 128, splitseed = 0):
107 |         self.sampler = batch_sampler(hvg_input, val_frac, batch_size, splitseed)
108 |         self.hvg_input = hvg_input
109 |         self.hvg_target = hvg_target
110 |         self.lvg_input = lvg_input
111 |         self.lvg_target = lvg_target
112 |         
113 |     def __iter__(self):
114 |         for idxs in self.sampler(self.val):
115 |             hvg_input = tensor(self.hvg_input[idxs].copy())
116 |             hvg_target = tensor(self.hvg_target[idxs].copy())
117 |             p_target = tensor(self.p_target[idxs].copy())
118 |             
119 |             if (self.lvg_input is not None) and (self.lvg_target is not None):
120 |                 lvg_input = tensor(self.lvg_input[idxs].copy())
121 |                 lvg_target = tensor(self.lvg_target[idxs].copy())
122 |             else:
123 |                 lvg_input = None
124 |                 lvg_target = None
125 |                 
126 |             yield [hvg_input, lvg_input], hvg_target, lvg_target, p_target
127 |             
128 |     def __call__(self, val):
129 |         self.val = val
130 |         return self
131 |     
132 |     def update_p(self, new_p_target):
133 |         self.p_target = new_p_target


--------------------------------------------------------------------------------
/CarDEC/CarDEC_layers.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from tensorflow.keras.layers import Layer
 3 | 
 4 | class ClusteringLayer(Layer):
 5 |     def __init__(self, centroids = None, n_clusters = None, n_features = None, alpha=1.0, **kwargs):
 6 |         """ The clustering layer predicts the a cell's class membership probability for each cell.
 7 |         
 8 |         
 9 |         Arguments:
10 |         ------------------------------------------------------------------
11 |         - centroids: `tf.Tensor`, Initial cluster ceontroids after pretraining the model.
12 |         - n_clusters: `int`, Number of clusters.
13 |         - n_features: `int`, The number of features of the bottleneck embedding space that the centroids live in.
14 |         - alpha: parameter in Student's t-distribution. Default to 1.0.
15 |         """
16 |         
17 |         super(ClusteringLayer, self).__init__(**kwargs)
18 |         self.alpha = alpha
19 |         self.initial_centroids = centroids
20 | 
21 |         if centroids is not None:
22 |             n_clusters, n_features = centroids.shape
23 | 
24 |         self.n_features, self.n_clusters = n_features, n_clusters
25 | 
26 |         assert self.n_clusters is not None
27 |         assert self.n_features is not None
28 | 
29 |     def build(self, input_shape):
30 |         """ This class method builds the layer fully once it receives an input tensor.
31 |         
32 |         
33 |         Arguments:
34 |         ------------------------------------------------------------------
35 |         - input_shape: `list`, A list specifying the shape of the input tensor.
36 |         """
37 |         
38 |         assert len(input_shape) == 2
39 |         
40 |         self.centroids = self.add_weight(name = 'clusters', shape = (self.n_clusters, self.n_features), initializer = 'glorot_uniform')
41 |         if self.initial_centroids is not None:
42 |             self.set_weights([self.initial_centroids])
43 |             del self.initial_centroids
44 |         
45 |         self.built = True
46 | 
47 |     def call(self, x, **kwargs):
48 |         """ Forward pass of the clustering layer,
49 |         
50 |         
51 |         ***Inputs***:
52 |             - x: `tf.Tensor`, the embedding tensor of shape = (n_obs, n_var)
53 |         
54 |         ***Returns***:
55 |             - q: `tf.Tensor`, student's t-distribution, or soft labels for each sample of shape = (n_obs, n_clusters)
56 |         """
57 | 
58 |         q = 1.0 / (1.0 + (tf.reduce_sum(tf.square(tf.expand_dims(x, axis = 1) - self.centroids), axis = 2) / self.alpha))
59 |         q = q**((self.alpha + 1.0) / 2.0)
60 |         q = q / tf.reduce_sum(q, axis = 1, keepdims = True)
61 | 
62 |         return q
63 | 
64 |     def compute_output_shape(self, input_shape):
65 |         """ This method infers the output shape from the input shape.
66 |         
67 |         
68 |         Arguments:
69 |         ------------------------------------------------------------------
70 |         - input_shape: `list`, A list specifying the shape of the input tensor.
71 |         
72 |         Returns:
73 |         ------------------------------------------------------------------
74 |         - output_shape: `list`, A tuple specifying the shape of the output for the minibatch (n_obs, n_clusters)
75 |         """
76 |         
77 |         assert input_shape and len(input_shape) == 2
78 |         return input_shape[0], self.n_clusters


--------------------------------------------------------------------------------
/CarDEC/CarDEC_optimization.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | import tensorflow as tf
  4 | from tensorflow.keras.losses import KLD, MSE
  5 | 
  6 | 
  7 | def grad_MainModel(model, input_, target, target_p, total_loss, LVG_target = None, aeloss_fun = None, clust_weight = 1.):
  8 |     """Function to do a backprop update to the main CarDEC model for a minibatch.
  9 |     
 10 |     
 11 |     Arguments:
 12 |     ------------------------------------------------------------------
 13 |     - model: `tensorflow.keras.Model`, The main CarDEC model.
 14 |     - input_: `list`, A list containing the input HVG and (optionally) LVG expression tensors of the minibatch for the CarDEC model.
 15 |     - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs.
 16 |     - target_p: `tf.Tensor`, Tensor containing cluster membership probability targets for the minibatch.
 17 |     - total_loss: `function`, Function to compute the loss for the main CarDEC model for a minibatch.
 18 |     - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs.
 19 |     - aeloss_fun: `function`, Function to compute reconstruction loss.
 20 |     - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses.
 21 |     
 22 |     Returns:
 23 |     ------------------------------------------------------------------
 24 |     - loss_value: `tf.Tensor`: The loss computed for the minibatch.
 25 |     - gradients: `a list of Tensors`: Gradients to update the model weights.
 26 |     """
 27 |     
 28 |     with tf.GradientTape() as tape:
 29 |         denoised_output, cluster_output = model(*input_)
 30 |         loss_value, aeloss = total_loss(target, denoised_output, target_p, cluster_output, 
 31 |                                 LVG_target, aeloss_fun, clust_weight)
 32 |         
 33 |     return loss_value, tape.gradient(loss_value, model.trainable_variables)
 34 | 
 35 | 
 36 | def grad_reconstruction(model, input_, target, loss):
 37 |     """Function to compute gradient update for pretrained autoencoder only.
 38 |     
 39 |     
 40 |     Arguments:
 41 |     ------------------------------------------------------------------
 42 |     - model: `tensorflow.keras.Model`, The main CarDEC model.
 43 |     - input_: `list`, A list containing the input HVG expression tensor of the minibatch for the CarDEC model.
 44 |     - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs.
 45 |     - loss: `function`, Function to compute reconstruction loss.
 46 |     
 47 |     Returns:
 48 |     ------------------------------------------------------------------
 49 |     - loss_value: `tf.Tensor`: The loss computed for the minibatch.
 50 |     - gradients: `a list of Tensors`: Gradients to update the model weights.
 51 |     """
 52 |     
 53 |     if type(input_) != tuple:
 54 |         input_ = (input_, )
 55 |         
 56 |     with tf.GradientTape() as tape:
 57 |         output = model(*input_)
 58 |         loss_value = loss(target, output)
 59 |         
 60 |     return loss_value, tape.gradient(loss_value, model.trainable_variables)
 61 | 
 62 | 
 63 | def total_loss(target, denoised_output, p, cluster_output_q, LVG_target = None, aeloss_fun = None, clust_weight = 1.):
 64 |     """Function to compute the loss for the main CarDEC model for a minibatch.
 65 |     
 66 |     
 67 |     Arguments:
 68 |     ------------------------------------------------------------------
 69 |     - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs.
 70 |     - denoised_output: `dict`, Dictionary containing the output tensors from the CarDEC main model's forward pass.
 71 |     - p: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing cluster membership probability targets for the minibatch.
 72 |     - cluster_output_q: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing predicted cluster membership probabilities
 73 |     for each cell.
 74 |     - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs.
 75 |     - aeloss_fun: `function`, Function to compute reconstruction loss.
 76 |     - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses.
 77 |     
 78 |     Returns:
 79 |     ------------------------------------------------------------------
 80 |     - net_loss: `tf.Tensor`, The loss computed for the minibatch.
 81 |     - aeloss: `tf.Tensor`, The reconstruction loss computed for the minibatch.
 82 |     """
 83 | 
 84 |     if aeloss_fun is not None:
 85 |         
 86 |         aeloss_HVG = aeloss_fun(target, denoised_output['HVG_denoised'])
 87 |         if LVG_target is not None:
 88 |             aeloss_LVG = aeloss_fun(LVG_target, denoised_output['LVG_denoised'])
 89 |             aeloss = 0.5*(aeloss_LVG + aeloss_HVG)
 90 |         else:
 91 |             aeloss = 1. * aeloss_HVG
 92 |     else:
 93 |         aeloss = 0.
 94 |     
 95 |     net_loss = clust_weight * tf.reduce_mean(KLD(p, cluster_output_q)) + (2. - clust_weight) * aeloss
 96 |     
 97 |     return net_loss, aeloss
 98 | 
 99 | 
100 | def MSEloss(netinput, netoutput):
101 |     """Function to compute the MSEloss for the reconstruction loss of a minibatch.
102 |     
103 |     
104 |     Arguments:
105 |     ------------------------------------------------------------------
106 |     - netinput: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells.
107 |     - netoutput: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells.
108 |     
109 |     Returns:
110 |     ------------------------------------------------------------------
111 |     - mse_loss: `tf.Tensor`, The loss computed for the minibatch, averaged over genes and cells.
112 |     """
113 |     
114 |     return tf.math.reduce_mean(MSE(netinput, netoutput))
115 | 
116 | 
117 | def NBloss(count, output, eps = 1e-10, mean = True):
118 |     """Function to compute the negative binomial reconstruction loss of a minibatch.
119 |     
120 |     
121 |     Arguments:
122 |     ------------------------------------------------------------------
123 |     - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original 
124 |     counts).
125 |     - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells.
126 |     - eps: `float`, A small number introduced for computational stability
127 |     - mean: `bool`, If True, average negative binomial loss over genes and cells
128 |     
129 |     Returns:
130 |     ------------------------------------------------------------------
131 |     - nbloss: `tf.Tensor`, The loss computed for the minibatch. If mean was True, it has shape (n_obs, n_var). Otherwise, it has shape (1,).
132 |     """
133 |     
134 |     count = tf.cast(count, tf.float32)
135 |     mu = tf.cast(output[0], tf.float32)
136 | 
137 |     theta = tf.minimum(output[1], 1e6)
138 | 
139 |     t1 = tf.math.lgamma(theta + eps) + tf.math.lgamma(count + 1.0) - tf.math.lgamma(count + theta + eps)
140 |     t2 = (theta + count) * tf.math.log(1.0 + (mu/(theta+eps))) + (count * (tf.math.log(theta + eps) - tf.math.log(mu + eps)))
141 | 
142 |     final = _nan2inf(t1 + t2)
143 |     
144 |     if mean:
145 |         final = tf.reduce_sum(final)/final.shape[0]/final.shape[1]
146 | 
147 |     return final
148 | 
149 | 
150 | def ZINBloss(count, output, eps = 1e-10):
151 |     """Function to compute the negative binomial reconstruction loss of a minibatch.
152 |     
153 |     
154 |     Arguments:
155 |     ------------------------------------------------------------------
156 |     - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original counts).
157 |     - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells.
158 |     - eps: `float`, A small number introduced for computational stability
159 |     
160 |     Returns:
161 |     ------------------------------------------------------------------
162 |     - zinbloss: `tf.Tensor`, The loss computed for the minibatch. Has shape (1,).
163 |     """
164 |     
165 |     mu = output[0]
166 |     theta = output[1]
167 |     pi = output[2]
168 |     
169 |     NB = NBloss(count, output, eps = eps, mean = False) - tf.math.log(1.0 - pi + eps)
170 |     
171 |     count = tf.cast(count, tf.float32)
172 |     mu = tf.cast(mu, tf.float32)
173 |     
174 |     theta = tf.math.minimum(theta, 1e6)
175 |     
176 |     zero_nb = tf.math.pow(theta/(theta + mu + eps), theta)
177 |     zero_case = -tf.math.log(pi + ((1.0- pi) * zero_nb) + eps)
178 |     final = tf.where(tf.less(count, 1e-8), zero_case, NB)
179 |     
180 |     final = tf.reduce_sum(final)/final.shape[0]/final.shape[1]
181 |             
182 |     return final
183 | 
184 | 
185 | def _nan2inf(x):
186 |     """Function to replace nan entries in a Tensor with infinities.
187 |     
188 |     
189 |     Arguments:
190 |     ------------------------------------------------------------------
191 |     - x: `tf.Tensor`, Tensor of arbitrary shape.
192 |     
193 |     Returns:
194 |     ------------------------------------------------------------------
195 |     - x': `tf.Tensor`, Tensor x with nan entries replaced by infinity.
196 |     """
197 |     
198 |     return tf.where(tf.math.is_nan(x), tf.zeros_like(x) + np.inf, x)
199 | 
200 | 


--------------------------------------------------------------------------------
/CarDEC/CarDEC_utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import os
  3 | from scipy.sparse import issparse
  4 | 
  5 | import scanpy as sc
  6 | from anndata import AnnData
  7 | 
  8 | 
  9 | def normalize_scanpy(adata, batch_key = None, n_high_var = 1000, LVG = True, 
 10 |                      normalize_samples = True, log_normalize = True, 
 11 |                      normalize_features = True):
 12 |     """ This function preprocesses the raw count data.
 13 |     
 14 |     
 15 |     Arguments:
 16 |     ------------------------------------------------------------------
 17 |     - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
 18 |     - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch.
 19 |     - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable.
 20 |     - LVG: `bool`, Whether to retain and preprocess LVGs.
 21 |     - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell.
 22 |     - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count.
 23 |     - normalize_features: `bool`, If True, z-score normalize each gene's expression.
 24 |     
 25 |     Returns:
 26 |     ------------------------------------------------------------------
 27 |     - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Contains preprocessed data.
 28 |     """
 29 |     
 30 |     n, p = adata.shape
 31 |     sparsemode = issparse(adata.X)
 32 |     
 33 |     if batch_key is not None:
 34 |         batch = list(adata.obs[batch_key])
 35 |         batch = convert_vector_to_encoding(batch)
 36 |         batch = np.asarray(batch)
 37 |         batch = batch.astype('float32')
 38 |     else:
 39 |         batch = np.ones((n,), dtype = 'float32')
 40 |         norm_by_batch = False
 41 |         
 42 |     sc.pp.filter_genes(adata, min_counts=1)
 43 |     sc.pp.filter_cells(adata, min_counts=1)
 44 |         
 45 |     count = adata.X.copy()
 46 |         
 47 |     if normalize_samples:
 48 |         out = sc.pp.normalize_total(adata, inplace = False)
 49 |         obs_ = adata.obs
 50 |         var_ = adata.var
 51 |         adata = None
 52 |         adata = AnnData(out['X'])
 53 |         adata.obs = obs_
 54 |         adata.var = var_
 55 |         
 56 |         size_factors = out['norm_factor'] / np.median(out['norm_factor'])
 57 |         out = None
 58 |     else:
 59 |         size_factors = np.ones((adata.shape[0], ))
 60 |         
 61 |     if not log_normalize:
 62 |         adata_ = adata.copy()
 63 |     
 64 |     sc.pp.log1p(adata)
 65 |     
 66 |     if n_high_var is not None:
 67 |         sc.pp.highly_variable_genes(adata, inplace = True, min_mean = 0.0125, max_mean = 3, min_disp = 0.5, 
 68 |                                           n_bins = 20, n_top_genes = n_high_var, batch_key = batch_key)
 69 |         
 70 |         hvg = adata.var['highly_variable'].values
 71 |         
 72 |         if not log_normalize:
 73 |             adata = adata_.copy()
 74 | 
 75 |     else:
 76 |         hvg = [True] * adata.shape[1]
 77 |         
 78 |     if normalize_features:
 79 |         batch_list = np.unique(batch)
 80 | 
 81 |         if sparsemode:
 82 |             adata.X = adata.X.toarray()
 83 | 
 84 |         for batch_ in batch_list:
 85 |             indices = [x == batch_ for x in batch]
 86 |             sub_adata = adata[indices]
 87 |             
 88 |             sc.pp.scale(sub_adata)
 89 |             adata[indices] = sub_adata.X
 90 |         
 91 |         adata.layers["normalized input"] = adata.X
 92 |         adata.X = count
 93 |         adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg]
 94 |             
 95 |     else:
 96 |         if sparsemode:   
 97 |             adata.layers["normalized input"] = adata.X.toarray()
 98 |         else:
 99 |             adata.layers["normalized input"] = adata.X
100 |             
101 |         adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg]
102 |         
103 |     if n_high_var is not None:
104 |         del_keys = ['dispersions', 'dispersions_norm', 'highly_variable', 'highly_variable_intersection', 'highly_variable_nbatches', 'means']
105 |         del_keys = [x for x in del_keys if x in adata.var.keys()]
106 |         adata.var = adata.var.drop(del_keys, axis = 1)
107 |             
108 |     y = np.unique(batch)
109 |     num_batch = len(y)
110 |     
111 |     adata.obs['size factors'] = size_factors.astype('float32')
112 |     adata.obs['batch'] = batch
113 |     adata.uns['num_batch'] = num_batch
114 |     
115 |     if sparsemode:
116 |         adata.X = adata.X.toarray()
117 |         
118 |     if not LVG:
119 |         adata = adata[:, adata.var['Variance Type'] == 'HVG']
120 |         
121 |     return adata
122 | 
123 | 
124 | def build_dir(dir_path):
125 |     """ This function builds a directory if it does not exist.
126 |     
127 |     
128 |     Arguments:
129 |     ------------------------------------------------------------------
130 |     - dir_path: `str`, The directory to build. E.g. if dir_path = 'folder1/folder2/folder3', then this function will creates directory if folder1 if it does not already exist. Then it creates folder1/folder2 if folder2 does not exist in folder1. Then it creates folder1/folder2/folder3 if folder3 does not exist in folder2.
131 |     """
132 |     
133 |     subdirs = [dir_path]
134 |     substring = dir_path
135 | 
136 |     while substring != '':
137 |         splt_dir = os.path.split(substring)
138 |         substring = splt_dir[0]
139 |         subdirs.append(substring)
140 |         
141 |     subdirs.pop()
142 |     subdirs = [x for x in subdirs if os.path.basename(x) != '..']
143 | 
144 |     n = len(subdirs)
145 |     subdirs = [subdirs[n - 1 - x] for x in range(n)]
146 |     
147 |     for dir_ in subdirs:
148 |         if not os.path.isdir(dir_):
149 |             os.mkdir(dir_)
150 | 
151 | 
152 | def convert_string_to_encoding(string, vector_key):
153 |     """A function to convert a string to a numeric encoding.
154 |     
155 |     
156 |     Arguments:
157 |     ------------------------------------------------------------------
158 |     - string: `str`, The specific string to convert to a numeric encoding.
159 |     - vector_key: `np.ndarray`, Array of all possible values of string.
160 |     
161 |     Returns:
162 |     ------------------------------------------------------------------
163 |     - encoding: `int`, The integer encoding of string.
164 |     """
165 |     
166 |     return np.argwhere(vector_key == string)[0][0]
167 | 
168 | 
169 | def convert_vector_to_encoding(vector):
170 |     """A function to convert a vector of strings to a dense numeric encoding.
171 |     
172 |     
173 |     Arguments:
174 |     ------------------------------------------------------------------
175 |     - vector: `array_like`, The vector of strings to encode.
176 |     
177 |     Returns:
178 |     ------------------------------------------------------------------
179 |     - vector_num: `list`, A list containing the dense numeric encoding.
180 |     """
181 |     
182 |     vector_key = np.unique(vector)
183 |     vector_strings = list(vector)
184 |     vector_num = [convert_string_to_encoding(string, vector_key) for string in vector_strings]
185 |     
186 |     return vector_num
187 | 
188 | 
189 | def find_resolution(adata_, n_clusters, random):
190 |     """A function to find the louvain resolution tjat corresponds to a prespecified number of clusters, if it exists.
191 |     
192 |     
193 |     Arguments:
194 |     ------------------------------------------------------------------
195 |     - adata_: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to low dimension features.
196 |     - n_clusters: `int`, Number of clusters.
197 |     - random: `int`, The random seed.
198 |     
199 |     Returns:
200 |     ------------------------------------------------------------------
201 |     - resolution: `float`, The resolution that gives n_clusters after running louvain's clustering algorithm.
202 |     """
203 |     
204 |     obtained_clusters = -1
205 |     iteration = 0
206 |     resolutions = [0., 1000.]
207 |     
208 |     while obtained_clusters != n_clusters and iteration < 50:
209 |         current_res = sum(resolutions)/2
210 |         adata = sc.tl.louvain(adata_, resolution = current_res, random_state = random, copy = True)
211 |         labels = adata.obs['louvain']
212 |         obtained_clusters = len(np.unique(labels))
213 |         
214 |         if obtained_clusters < n_clusters:
215 |             resolutions[0] = current_res
216 |         else:
217 |             resolutions[1] = current_res
218 |         
219 |         iteration = iteration + 1
220 |         
221 |     return current_res
222 | 
223 | 


--------------------------------------------------------------------------------
/CarDEC/__init__.py:
--------------------------------------------------------------------------------
1 | from .CarDEC_API import CarDEC_API


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_API.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_API.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_MainModel.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_MainModel.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_SAE.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_SAE.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_count_decoder.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_count_decoder.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_dataloaders.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_dataloaders.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_layers.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_layers.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_optimization.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_optimization.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/CarDEC_utils.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_utils.cpython-37.pyc


--------------------------------------------------------------------------------
/CarDEC/__pycache__/__init__.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/__init__.cpython-37.pyc


--------------------------------------------------------------------------------
/LICENSE.rtf:
--------------------------------------------------------------------------------
 1 | {\rtf1\ansi\ansicpg1252\cocoartf2511
 2 | \cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fnil\fcharset0 Monaco;}
 3 | {\colortbl;\red255\green255\blue255;\red74\green70\blue67;\red255\green255\blue255;}
 4 | {\*\expandedcolortbl;;\cssrgb\c36078\c34510\c33333;\cssrgb\c100000\c100000\c100000;}
 5 | \margl1440\margr1440\vieww10800\viewh8400\viewkind0
 6 | \deftab720
 7 | \pard\pardeftab720\sl380\partightenfactor0
 8 | 
 9 | \f0\fs28 \cf2 \cb3 \expnd0\expndtw0\kerning0
10 | \outl0\strokewidth0 \strokec2 MIT License\
11 | \
12 | Copyright (c) 2020 Justin Lakkis\
13 | \
14 | Permission is hereby granted, free of charge, to any person obtaining a copy\
15 | of this software and associated documentation files (the "Software"), to deal\
16 | in the Software without restriction, including without limitation the rights\
17 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\
18 | copies of the Software, and to permit persons to whom the Software is\
19 | furnished to do so, subject to the following conditions:\
20 | \
21 | The above copyright notice and this permission notice shall be included in all\
22 | copies or substantial portions of the Software.\
23 | \
24 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\
25 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\
26 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\
27 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\
28 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\
29 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\
30 | SOFTWARE.}


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CarDEC
 2 | 
 3 | CarDEC (**C**ount **a**dapted **r**egularized **D**eep **E**mbedded **C**lustering) is a joint deep learning computational tool that is useful for analyses of single-cell RNA-seq data. CarDEC can be used to:
 4 | 
 5 | 1. Correct for batch effect in the full gene expression space, allowing the investigator to remove batch effect from downstream analyses like psuedotime analysis and coexpression analysis. Batch correction is also possible in a low-dimensional embedding space.
 6 | 2. Denoise gene expression.
 7 | 3. Cluster cells.
 8 | 
 9 | ## Reproducibility
10 | 
11 | We described and introduced CarDEC in our [methodological paper](https://www.biorxiv.org/content/10.1101/2020.09.23.310003v1). To find code to reproduce the results we generated in that paper, please visit this separate [github repository](https://github.com/jlakkis/CarDEC_Codes), which provides all code (including that for other methods) necessary to reproduce our results.
12 | 
13 | ## Installation
14 | 
15 | Recomended installation procedure is as follows. 
16 | 
17 | 1. Install [Anaconda](https://www.anaconda.com/products/individual) if you do not already have it. 
18 | 2. Create a conda environment, and then activate it as follows in terminal.
19 | 
20 | ```
21 | $ conda create -n cardecenv
22 | $ conda activate cardecenv
23 | ```
24 | 
25 | 3. Install an appropriate version of python.
26 | 
27 | ```
28 | $ conda install python==3.7
29 | ```
30 | 
31 | 4. Install nb_conda_kernels so that you can change python kernels in jupyter notebook.
32 | 
33 | ```
34 | $ conda install nb_conda_kernels
35 | ```
36 | 
37 | 5. Finally, install CarDEC.
38 | 
39 | ```
40 | $ pip install CarDEC
41 | ```
42 | 
43 | Now, to use CarDEC, always make sure you activate the environment in terminal first ("conda activate cardecenv"). And then run jupyter notebook. When you create a notebook to run CarDEC, make sure the active kernel is switched to "cardecenv"
44 | 
45 | ## Usage
46 | 
47 | A [tutorial jupyter notebook](https://drive.google.com/drive/folders/19VVOoq4XSdDFRZDou-VbTMyV2Na9z53O?usp=sharing), together with a dataset, is publicly downloadable.
48 | 
49 | ## Software Requirements
50 |     
51 | - Python >= 3.7
52 | - TensorFlow >= 2.0.1, <= 2.3.1
53 | - scikit-learn == 0.22.2.post1
54 | - scanpy == 1.5.1
55 | - louvain == 0.6.1
56 | - pandas == 1.0.1
57 | - scipy == 1.4.1
58 | 
59 | ## Trouble shooting
60 | 
61 | Installation on MacOS should be smooth. If installing on Windows Subsystem for Linux (WSL), the user must properly configure their g++ compiler to ensure that the louvain package can be built during installation. If the compiler is not properly configured, the user may encounter a following deprecation error similar to the following.
62 | 
63 | "DEPRECATION: Could not build wheels for louvain which do not use PEP 517. pip will fall back to legacy 'setup.py install' for these. pip 21.0 will remove support for this functionality. A possible replacement is to fix the wheel build issue reported above."
64 | 
65 | To fix this error, try to install the libxml2-dev package.


--------------------------------------------------------------------------------
/build/lib/CarDEC/CarDEC_API.py:
--------------------------------------------------------------------------------
  1 | from .CarDEC_utils import normalize_scanpy
  2 | from .CarDEC_MainModel import CarDEC_Model
  3 | from .CarDEC_count_decoder import count_model
  4 | 
  5 | import tensorflow as tf
  6 | from tensorflow.keras.optimizers import Adam
  7 | import numpy as np
  8 | from pandas import DataFrame
  9 | 
 10 | import os
 11 | 
 12 | class CarDEC_API:
 13 |     def __init__(self, adata, preprocess=True, weights_dir = "CarDEC Weights", batch_key = None, n_high_var = 2000, LVG = True,
 14 |                      normalize_samples = True, log_normalize = True, normalize_features = True):
 15 |         """ Main CarDEC API the user can use to conduct batch correction and denoising experiments.
 16 | 
 17 | 
 18 |         Arguments:
 19 |         ------------------------------------------------------------------
 20 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
 21 |         - preprocess: `bool`, If True, then preprocess the data.
 22 |         - weights_dir: `str`, the path in which to save the weights of the CarDEC model.
 23 |         - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch.
 24 |         - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable.
 25 |         - LVG: `bool`, If True, also model LVGs. Otherwise, only model HVGs.
 26 |         - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell.
 27 |         - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count.
 28 |         - normalize_features: `bool`, If True, z-score normalize each gene's expression.
 29 |         """
 30 |     
 31 |         if n_high_var is None:
 32 |             n_high_var = None
 33 |             LVG = False
 34 | 
 35 |         self.weights_dir = weights_dir
 36 |         self.LVG = LVG
 37 | 
 38 |         self.norm_args = (batch_key, n_high_var, LVG, normalize_samples, log_normalize, normalize_features)
 39 | 
 40 |         if preprocess:
 41 |             self.dataset = normalize_scanpy(adata, *self.norm_args)
 42 |         else:
 43 |             assert 'Variance Type' in adata.var.keys()
 44 |             assert 'normalized input' in adata.layers
 45 |             self.dataset = adata
 46 | 
 47 |         self.loaded = False
 48 |         self.count_loaded = False
 49 | 
 50 |     def build_model(self, load_fullmodel = True, dims = [128, 32], LVG_dims = [128, 32], tol = 0.005, n_clusters = None, 
 51 |                     random_seed = 201809, louvain_seed = 0, n_neighbors = 15, pretrain_epochs = 2000, batch_size_pretrain = 64,
 52 |                     act = 'relu', actincenter = "tanh", ae_lr = 1e-04, ae_decay_factor = 1/3, ae_patience_LR = 3, 
 53 |                     ae_patience_ES = 9, clust_weight = 1., load_encoder_weights = True):
 54 |         """ Initializes the main CarDEC model.
 55 | 
 56 | 
 57 |         Arguments:
 58 |         ------------------------------------------------------------------
 59 |         - load_fullmodel: `bool`, If True, the API will try to load the weights for the full model from the weight directory.
 60 |         - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers.
 61 |         - LVG_dims: `list`, the number of output features for each layer of the LVG encoder. The length of the list determines the number of layers.
 62 |         - tol: `float`, stop criterion, clustering procedure will be stopped when the difference ratio between the current iteration and last iteration larger than tol.
 63 |         - n_clusters: `int`, The number of clusters into which cells will be grouped.
 64 |         - random_seed: `int`, The seed used for random weight intialization.
 65 |         - louvain_seed: `int`, The seed used for louvain clustering intialization.
 66 |         - n_neighbors: `int`, The number of neighbors used for building the graph needed for louvain clustering.
 67 |         - pretrain_epochs: `int`, The maximum number of epochs for pretraining the HVG autoencoder. In practice, early stopping criteria should stop training much earlier.
 68 |         - batch_size_pretrain: `int`, The batch size used for pretraining the HVG autoencoder.
 69 |         - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer.
 70 |         - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC.
 71 |         - ae_lr: `float`, The learning rate for pretraining the HVG autoencoder.
 72 |         - ae_decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
 73 |         - ae_patience_LR: `int`, the number of epochs which the validation loss is allowed to increase before learning rate is decayed when pretraining the autoencoder.
 74 |         - ae_patience_ES: `int`, the number of epochs which the validation loss is allowed to increase before training is halted when pretraining the autoencoder.
 75 |         - clust_weight: `float`, a number between 0 and 2 qhich balances the clustering and reconstruction losses.
 76 |         - load_encoder_weights: `bool`, If True, the API will try to load the weights for the HVG encoder from the weight directory.
 77 |         """
 78 |         
 79 |         assert n_clusters is not None
 80 |         
 81 |         if 'normalized input' not in list(self.dataset.layers):
 82 |             self.dataset = normalize_scanpy(self.dataset, *self.norm_args)
 83 |         
 84 |         p = sum(self.dataset.var["Variance Type"] == 'HVG')
 85 |         self.dims = [p] + dims
 86 |         
 87 |         if self.LVG:
 88 |             LVG_p = sum(self.dataset.var["Variance Type"] == 'LVG')
 89 |             self.LVG_dims = [LVG_p] + LVG_dims
 90 |         else:
 91 |             self.LVG_dims = None
 92 |         
 93 |         self.load_fullmodel = load_fullmodel
 94 |         self.weights_exist = os.path.isfile("./" + self.weights_dir + "/tuned_CarDECweights.index")
 95 |         
 96 |         set_centroids = not (self.load_fullmodel and self.weights_exist)
 97 |         
 98 |         self.model = CarDEC_Model(self.dataset, self.dims, self.LVG_dims, tol, n_clusters, random_seed, louvain_seed, 
 99 |                                   n_neighbors, pretrain_epochs, batch_size_pretrain, ae_decay_factor, 
100 |                                   ae_patience_LR, ae_patience_ES, act, actincenter, ae_lr, 
101 |                                   clust_weight, load_encoder_weights, set_centroids, self.weights_dir)
102 |         
103 |     def make_inference(self, batch_size = 64, val_split = 0.1, lr = 1e-04, decay_factor = 1/3,
104 |                        iteration_patience_LR = 3, iteration_patience_ES = 6, maxiter = 1e3, epochs_fit = 1, 
105 |                        optimizer = Adam(), printperiter = None, denoise_all = True, denoise_list = None):
106 |         """ This class method makes inference on the data (batch correction + denoising) with the main CarDEC model
107 | 
108 | 
109 |         Arguments:
110 |         ------------------------------------------------------------------
111 |         - batch_size: `int`, The batch size used for training the full model.
112 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
113 |         - lr: `float`, The learning rate for training the full model.
114 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
115 |         - iteration_patience_LR: `int`, The number of iterations tolerated before decaying the learning rate during which the number of cells that change assignment is less than tol.
116 |         - iteration_patience_ES: `int`, The number of iterations tolerated before stopping training during which the number of cells that change assignment is less than tol.
117 |         - maxiter: `int`, The maximum number of iterations allowed to train the full model. In practice, the model will halt training long before hitting this limit.
118 |         - epochs_fit: `int`, The number of epochs during which to fine-tune weights, before updating the target distribution.
119 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
120 |         - printperiter: `int`, Optional integer argument. If specified, denoised values will be returned every printperiter epochs, so that the user can evaluate the progress of denoising as training continues.
121 |         - denoise_all: `bool`, If True, then denoised expression values are provided for all cells.
122 |         - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list.
123 | 
124 |         Returns:
125 |         ------------------------------------------------------------------
126 |         - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata.
127 |         """
128 | 
129 |         if denoise_list is not None:
130 |             denoise_all = False
131 |             
132 |         if not self.loaded:
133 |             if self.load_fullmodel and self.weights_exist:
134 |                 self.dataset = self.model.reload_model(self.dataset, batch_size, denoise_all)
135 | 
136 |             elif not self.weights_exist:
137 |                 print("CarDEC Model Weights not detected. Training full model.\n")
138 |                 self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor,
139 |                                iteration_patience_LR, iteration_patience_ES, maxiter,
140 |                                epochs_fit, optimizer, printperiter, denoise_all)
141 | 
142 |             else:
143 |                 print("Training full model.\n")
144 |                 self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor, 
145 |                                                 iteration_patience_LR, iteration_patience_ES, 
146 |                                                 maxiter, epochs_fit, optimizer, printperiter, denoise_all)
147 |             
148 |             
149 |             self.loaded = True
150 |             
151 |         elif denoise_all:
152 |             self.dataset = self.model.make_outputs(self.dataset, batch_size, True)
153 |             
154 |         if denoise_list is not None:
155 |             denoise_list = list(denoise_list)
156 |             indices = [x in denoise_list for x in self.dataset.obs.index]
157 |             denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32'))
158 |             denoised.index = self.dataset.obs.index[indices]
159 |             denoised.columns = self.dataset.var.index
160 |             
161 |             
162 |             if self.LVG:
163 |                 hvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"][indices])
164 |                 lvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["LVG embedding"][indices])
165 |             
166 |                 input_ds = tf.data.Dataset.zip((hvg_ds, lvg_ds))
167 |                 input_ds = input_ds.batch(batch_size)
168 | 
169 |                 start = 0     
170 |                 for x in input_ds:
171 |                     denoised_batch = {'HVG_denoised': self.model.decoder(x[0]), 'LVG_denoised': self.model.decoderLVG(x[1])}
172 |                     q_batch = self.model.clustering_layer(x[0])
173 |                     end = start + q_batch.shape[0]
174 | 
175 |                     denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'HVG')[0]] = denoised_batch['HVG_denoised'].numpy()
176 |                     denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'LVG')[0]] = denoised_batch['LVG_denoised'].numpy()
177 | 
178 |                     start = end
179 | 
180 |             else:
181 |                 input_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"])
182 | 
183 |                 input_ds = input_ds.batch(batch_size)
184 | 
185 |                 start = 0
186 | 
187 |                 for x in input_ds:
188 |                     denoised_batch = {'HVG_denoised': self.model.decoder(x)}
189 |                     q_batch = self.model.clustering_layer(x)
190 |                     end = start + q_batch.shape[0]
191 | 
192 |                     denoised.iloc[start:end] = denoised_batch['HVG_denoised'].numpy()
193 | 
194 |                     start = end
195 |             
196 |             return denoised
197 |             
198 |         print(" ")
199 |             
200 |     def model_counts(self, load_weights = True, act = 'relu', random_seed = 201809,
201 |                      optimizer = Adam(), keep_dispersion = False, num_epochs = 2000, batch_size_count = 64,
202 |                      val_split = 0.1, lr = 1e-03, decay_factor = 1/3, patience_LR = 3, patience_ES = 9, 
203 |                      denoise_all = True, denoise_list = None):
204 |         """ This class method makes inference on the data on the count scale.
205 | 
206 | 
207 |         Arguments:
208 |         ------------------------------------------------------------------
209 |         - load_weights: `bool`, If true, the API will attempt to load the weights for the count model.
210 |         - act: `str`, A string specifying the activation function for intermediate layers of the count models.
211 |         - random_seed: `int`, A seed used for weight initialization.
212 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
213 | 	- keep_dispersion: `bool`, If True, the gene, cell dispersions will be returned as well.
214 |         - num_epochs: `int`, The maximum number of epochs allowed to train each count model. In practice, the model will halt
215 |         training long before hitting this limit.
216 |         - batch_size_count: `int`, The batch size used for training the count models.
217 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
218 |         - lr: `float`, The learning rate for training the count models.
219 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
220 |         - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss does not decrease.
221 |         - patience_ES: `int`, The number of iterations tolerated before stopping training during which the validation loss does not decrease.
222 |         - denoise_all: `bool`, If True, then denoised expression values are provided for all cells.
223 |         - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list.
224 | 
225 |         Returns:
226 |         ------------------------------------------------------------------
227 |         - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression on the count scale provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata.
228 |         - denoised_dispersion: `pd.DataFrame`, (Optional) If denoise_list was specified and "keep_dispersion" was set to True, then this will be an array of dispersions from the fitted negative binomial model provided only for listed cells. If denoise_all was instead specified as False, but "keep_dispersion" was still True then dispersions for all cells will be added as a layer to adata.
229 |         """
230 |     
231 |         if denoise_list is not None:
232 |             denoise_all = False
233 |         
234 |         if not self.count_loaded:
235 |             weights_dir = os.path.join(self.weights_dir, 'count weights')
236 |             weight_files_exist = os.path.isfile(weights_dir + "/countmodel_weights_HVG Count.index")
237 |             if self.LVG:
238 |                 weight_files_exist = weight_files_exist and os.path.isfile(weights_dir + "/countmodel_weights_LVG Count.index")
239 | 
240 |             init_args = (act, random_seed, self.model.splitseed, optimizer, weights_dir)
241 |             train_args = (num_epochs, batch_size_count, val_split, lr, decay_factor, patience_LR, patience_ES)
242 | 
243 |             self.nbmodel = count_model(self.dims, *init_args, n_features = self.dims[-1], mode = 'HVG')
244 | 
245 |             if load_weights and weight_files_exist:
246 |                 print("Weight files for count models detected, loading weights.")
247 |                 self.nbmodel.load_model()
248 | 
249 |             elif load_weights:
250 |                 print("Weight files for count models not detected. Training HVG count model.\n")
251 |                 self.nbmodel.train(self.dataset, *train_args)
252 | 
253 |             else:
254 |                 print("Training HVG count model.\n")
255 |                 self.nbmodel.train(self.dataset, *train_args)
256 | 
257 |             if self.LVG:
258 |                 self.nbmodel_lvg = count_model(self.LVG_dims, *init_args, 
259 |                     n_features = self.dims[-1] + self.LVG_dims[-1], mode = 'LVG')
260 | 
261 |                 if load_weights and weight_files_exist:
262 |                     self.nbmodel_lvg.load_model()
263 |                     print("Count model weights loaded successfully.")
264 | 
265 |                 elif load_weights:
266 |                     print("\n \n \n")
267 |                     print("Training LVG count model.\n")
268 |                     self.nbmodel_lvg.train(self.dataset, *train_args)
269 | 
270 |                 else:
271 |                     print("\n \n \n")
272 |                     print("Training LVG count model.\n")
273 |                     self.nbmodel_lvg.train(self.dataset, *train_args)
274 |             
275 |             self.count_loaded = True
276 |             
277 |         if denoise_all:
278 |             self.nbmodel.denoise(self.dataset, keep_dispersion, batch_size_count)
279 |             if self.LVG:
280 |                 self.nbmodel_lvg.denoise(self.dataset, keep_dispersion, batch_size_count)
281 |                 
282 |         elif denoise_list is not None:
283 |             denoise_list = list(denoise_list)
284 |             indices = [x in denoise_list for x in self.dataset.obs.index]
285 |             denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32'))
286 |             denoised.index = self.dataset.obs.index[indices]
287 |             denoised.columns = self.dataset.var.index
288 |             if keep_dispersion:
289 |                 denoised_dispersion = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32'))
290 |                 denoised_dispersion.index = self.dataset.obs.index[indices]
291 |                 denoised_dispersion.columns = self.dataset.var.index
292 |             
293 |             input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['embedding'][indices])
294 |             input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices])
295 |             input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf))
296 |             input_ds = input_ds.batch(batch_size_count)
297 | 
298 |             type_indices = np.where(self.dataset.var['Variance Type'] == 'HVG')[0]
299 | 
300 |             if not keep_dispersion:
301 |                 start = 0
302 |                 for x in input_ds:
303 |                     end = start + x[0].shape[0]
304 |                     denoised.iloc[start:end, type_indices] = self.nbmodel(*x)[0].numpy()
305 |                     start = end
306 | 
307 |             else:
308 |                 start = 0
309 |                 for x in input_ds:
310 |                     end = start + x[0].shape[0]
311 |                     batch_output = self.nbmodel(*x)
312 |                     denoised.iloc[start:end, type_indices] = batch_output[0].numpy()
313 |                     denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy()
314 |                     start = end
315 |             
316 |             if self.LVG:
317 |                 input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['LVG embedding'][indices])
318 |                 input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices])
319 |                 input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf))
320 |                 input_ds = input_ds.batch(batch_size_count)
321 | 
322 |                 type_indices = np.where(self.dataset.var['Variance Type'] == 'LVG')[0]
323 | 
324 |                 if not keep_dispersion:
325 |                     start = 0
326 |                     for x in input_ds:
327 |                         end = start + x[0].shape[0]
328 |                         denoised.iloc[start:end, type_indices] = self.nbmodel_lvg(*x)[0].numpy()
329 |                         start = end
330 | 
331 |                 else:
332 |                     start = 0
333 |                     for x in input_ds:
334 |                         end = start + x[0].shape[0]
335 |                         batch_output = self.nbmodel_lvg(*x)
336 |                         denoised.iloc[start:end, type_indices] = batch_output[0].numpy()
337 |                         denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy()
338 |                         start = end
339 |                         
340 |             if not keep_dispersion:
341 |                 return denoised
342 |             else:
343 |                 return denoised, denoised_dispersion
344 | 
345 | 


--------------------------------------------------------------------------------
/build/lib/CarDEC/CarDEC_SAE.py:
--------------------------------------------------------------------------------
  1 | from .CarDEC_optimization import grad_reconstruction as grad, MSEloss
  2 | from .CarDEC_dataloaders import simpleloader, aeloader
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow.keras import Model, Sequential
  6 | from tensorflow.keras.layers import Dense, concatenate
  7 | from tensorflow.keras.optimizers import Adam
  8 | from tensorflow.keras.backend import set_floatx
  9 | from time import time
 10 | 
 11 | import random
 12 | import numpy as np
 13 | from scipy.stats import zscore
 14 | import os
 15 | 
 16 | 
 17 | set_floatx('float32')
 18 | 
 19 | 
 20 | class SAE(Model):
 21 |     def __init__(self, dims, act = 'relu', actincenter = "tanh", 
 22 |                  random_seed = 201809, splitseed = 215, init = "glorot_uniform", optimizer = Adam(),
 23 |                  weights_dir = 'CarDEC Weights'):
 24 |         """ This class method initializes the SAE model.
 25 | 
 26 | 
 27 |         Arguments:
 28 |         ------------------------------------------------------------------
 29 |         - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers.
 30 |         - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer.
 31 |         - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC.
 32 |         - random_seed: `int`, The seed used for random weight intialization.
 33 |         - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between iterations to ensure the same cells are always used for validation.
 34 |         - init: `str`, The weight initialization strategy for the autoencoder.
 35 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
 36 |         - weights_dir: `str`, the path in which to save the weights of the CarDEC model.
 37 |         """
 38 |         
 39 |         super(SAE, self).__init__()
 40 |         
 41 |         tf.keras.backend.clear_session()
 42 |         
 43 |         self.weights_dir = weights_dir
 44 |         
 45 |         self.dims = dims
 46 |         self.n_stacks = len(dims) - 1
 47 |         self.init = init
 48 |         self.optimizer = optimizer
 49 |         self.random_seed = random_seed
 50 |         self.splitseed = splitseed
 51 |         
 52 |         self.activation = act
 53 |         self.actincenter = actincenter #hidden layer activation function
 54 |         
 55 |         #set random seed
 56 |         random.seed(random_seed)
 57 |         np.random.seed(random_seed)
 58 |         tf.random.set_seed(random_seed)
 59 |             
 60 |         encoder_layers = []
 61 |         for i in range(self.n_stacks-1):
 62 |             encoder_layers.append(Dense(self.dims[i + 1], kernel_initializer = self.init, activation = self.activation, name='encoder_%d' % i))
 63 |                 
 64 |         encoder_layers.append(Dense(self.dims[-1], kernel_initializer=self.init, activation=self.actincenter, name='embedding'))
 65 |         self.encoder = Sequential(encoder_layers, name = 'encoder')
 66 | 
 67 |         decoder_layers = []
 68 |         for i in range(self.n_stacks - 1, 0, -1):
 69 |             decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation
 70 |                                         , name = 'decoder%d' % (i-1)))
 71 |             
 72 |         decoder_layers.append(Dense(self.dims[0], activation = 'linear', name='output'))
 73 |         
 74 |         self.decoder = Sequential(decoder_layers, name = 'decoder')
 75 |         
 76 |         self.construct()
 77 | 
 78 |     def call(self, x):
 79 |         """ This is the forward pass of the model.
 80 |         
 81 |         
 82 |         ***Inputs***
 83 |             - x: `tf.Tensor`, an input tensor of shape (n_obs, p_HVG).
 84 |             
 85 |         ***Outputs***
 86 |             - output: `tf.Tensor`, A (n_obs, p_HVG) tensor of denoised HVG expression.
 87 |         """
 88 |         
 89 |         c = self.encoder(x)
 90 | 
 91 |         output = self.decoder(c)
 92 |                     
 93 |         return output
 94 |     
 95 |     def load_encoder(self, random_seed = 2312):
 96 |         """ This class method can be used to load the encoder weights, while randomly reinitializing the decoder weights.
 97 | 
 98 | 
 99 |         Arguments:
100 |         ------------------------------------------------------------------
101 |         - random_seed: `int`, Seed for reinitializing the decoder.
102 |         """
103 |         
104 |         tf.keras.backend.clear_session()
105 |         
106 |         #set random seed
107 |         random.seed(random_seed)
108 |         np.random.seed(random_seed)
109 |         tf.random.set_seed(random_seed)
110 |      
111 |         self.encoder.load_weights("./" + self.weights_dir + "/pretrained_encoder_weights").expect_partial()
112 |         
113 |         decoder_layers = []
114 |         for i in range(self.n_stacks - 1, 0, -1):
115 |             decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation
116 |                                         , name='decoder%d' % (i-1)))
117 |         self.decoder_base = Sequential(decoder_layers, name = 'decoderbase')
118 |         
119 |         self.output_layer = Dense(self.dims[0], activation = 'linear', name='output')
120 |             
121 |         self.construct(summarize = False)
122 |         
123 |     def load_autoencoder(self, ):
124 |         """ This class method can be used to load the full model's weights."""
125 |         
126 |         tf.keras.backend.clear_session()
127 |         
128 |         self.load_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights").expect_partial()
129 |         
130 |     def construct(self, summarize = False):
131 |         """ This class method fully initalizes the TensorFlow model.
132 | 
133 | 
134 |         Arguments:
135 |         ------------------------------------------------------------------
136 |         - summarize: `bool`, If True, then print a summary of the model architecture.
137 |         """
138 |         
139 |         x = tf.zeros(shape = (1, self.dims[0]), dtype=float)
140 |         out = self(x)
141 |         
142 |         if summarize:
143 |             print("----------Autoencoder Architecture----------")
144 |             self.summary()
145 | 
146 |             print("\n----------Encoder Sub-Architecture----------")
147 |             self.encoder.summary()
148 | 
149 |             print("\n----------Base Decoder Sub-Architecture----------")
150 |             self.decoder.summary()
151 |         
152 |     def denoise(self, adata, batch_size = 64):
153 |         """ This class method can be used to denoise gene expression for each cell.
154 | 
155 | 
156 |         Arguments:
157 |         ------------------------------------------------------------------
158 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
159 |         - batch_size: `int`, The batch size used for computing denoised expression.
160 |         
161 |         Returns:
162 |         ------------------------------------------------------------------
163 |         - output: `np.ndarray`, Numpy array of denoised expression of shape (n_obs, n_vars)
164 |         """
165 |         
166 |         input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size)
167 |         
168 |         output = np.zeros((adata.shape[0], self.dims[0]), dtype = 'float32')
169 |         start = 0
170 |         
171 |         for x in input_ds:
172 |             end = start + x.shape[0]
173 |             output[start:end] = self(x).numpy()
174 |             start = end
175 |         
176 |         return output
177 |         
178 |     def embed(self, adata, batch_size = 64):
179 |         """ This class method can be used to compute the low-dimension embedding for HVG features. 
180 |         
181 |         
182 |         Arguments:
183 |         ------------------------------------------------------------------
184 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
185 |         - batch_size: `int`, The batch size for filling the array of low dimension embeddings.
186 |         
187 |         Returns:
188 |         ------------------------------------------------------------------
189 |         - embedding: `np.ndarray`, Array of shape (n_obs, n_vars) containing the cell HVG embeddings.
190 |         """
191 |         
192 |         input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size)
193 |         
194 |         embedding = np.zeros((adata.shape[0], self.dims[-1]), dtype = 'float32')
195 |         
196 |         start = 0
197 |         for x in input_ds:
198 |             end = start + x.shape[0]
199 |             embedding[start:end] = self.encoder(x).numpy()
200 |             start = end
201 |             
202 |         return embedding
203 |     
204 |     def makegenerators(self, adata, val_split, batch_size, splitseed):
205 |         """ This class method creates training and validation data generators for the current input data.
206 |         
207 |         
208 |         Arguments:
209 |         ------------------------------------------------------------------
210 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars).
211 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
212 |         - batch_size: `int`, The batch size used for training the model.
213 |         - splitseed: `int`, The seed used to split cells between training and validation.
214 |         
215 |         Returns:
216 |         ------------------------------------------------------------------
217 |         - train_dataset: `tf.data.Dataset`, Dataset that returns training examples.
218 |         - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples.
219 |         """
220 |         
221 |         return aeloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], val_frac = val_split, batch_size = batch_size, splitseed = splitseed)
222 |     
223 |     def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3,
224 |               patience_LR = 3, patience_ES = 9, save_fullmodel = True):
225 |         """ This class method can be used to train the SAE.
226 |         
227 |         
228 |         Arguments:
229 |         ------------------------------------------------------------------
230 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars).
231 |         - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt training long before hitting this limit.
232 |         - batch_size: `int`, The batch size used for training the full model.
233 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
234 |         - lr: `float`, The learning rate for training the full model.
235 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing.
236 |         - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss fails to decrease.
237 |         - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to decrease.
238 |         - save_fullmodel: `bool`, If True, save the full model's weights, not just the encoder.
239 |         """
240 |         
241 |         tf.keras.backend.clear_session()
242 |         
243 |         dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed)
244 |         
245 |         counter_LR = 0
246 |         counter_ES = 0
247 |         best_loss = np.inf
248 |         
249 |         self.optimizer.lr = lr
250 |         
251 |         total_start = time()
252 |         for epoch in range(num_epochs):
253 |             epoch_start = time()
254 |             
255 |             epoch_loss_avg = tf.keras.metrics.Mean()
256 |             epoch_loss_avg_val = tf.keras.metrics.Mean()
257 |             
258 |             # Training loop - using batches of batch_size
259 |             for x, target in dataset(val = False):
260 |                 loss_value, grads = grad(self, x, target, MSEloss)
261 |                 self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
262 |                 epoch_loss_avg(loss_value)  # Add current batch loss
263 |             
264 |             # Validation Loop
265 |             for x, target in dataset(val = True):
266 |                 output = self(x)
267 |                 loss_value = MSEloss(target, output)
268 |                 epoch_loss_avg_val(loss_value)
269 |             
270 |             current_loss_val = epoch_loss_avg_val.result()
271 | 
272 |             epoch_time = round(time() - epoch_start, 1)
273 |             
274 |             print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time))
275 |             
276 |             if(current_loss_val + 10**(-3) < best_loss):
277 |                 counter_LR = 0
278 |                 counter_ES = 0
279 |                 best_loss = current_loss_val
280 |             else:
281 |                 counter_LR = counter_LR + 1
282 |                 counter_ES = counter_ES + 1
283 | 
284 |             if patience_ES <= counter_ES:
285 |                 break
286 | 
287 |             if patience_LR <= counter_LR:
288 |                 self.optimizer.lr = self.optimizer.lr * decay_factor
289 |                 counter_LR = 0
290 |                 print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy()))
291 |                 
292 |             # End epoch
293 |         
294 |         total_time = round(time() - total_start, 2)
295 |         
296 |         if not os.path.isdir("./" + self.weights_dir):
297 |             os.mkdir("./" + self.weights_dir)
298 |         
299 |         self.save_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights", save_format='tf')
300 |         self.encoder.save_weights("./" + self.weights_dir + "/pretrained_encoder_weights", save_format='tf')
301 |         
302 |         print('\nTraining Completed')
303 |         print("Total training time: " + str(total_time) + " seconds")
304 | 
305 | 


--------------------------------------------------------------------------------
/build/lib/CarDEC/CarDEC_count_decoder.py:
--------------------------------------------------------------------------------
  1 | from .CarDEC_optimization import grad_reconstruction as grad, NBloss
  2 | from .CarDEC_utils import build_dir
  3 | from .CarDEC_dataloaders import countloader, tupleloader
  4 | 
  5 | import tensorflow as tf
  6 | from tensorflow.keras import Model, Sequential
  7 | from tensorflow.keras.layers import Dense, concatenate, Lambda
  8 | from tensorflow.keras.optimizers import Adam
  9 | from tensorflow.keras.backend import exp as tf_exp, set_floatx
 10 | from time import time
 11 | 
 12 | import random
 13 | import numpy as np
 14 | from scipy.stats import zscore
 15 | import os
 16 | 
 17 | 
 18 | set_floatx('float32')
 19 | 
 20 | 
 21 | class count_model(Model):
 22 |     def __init__(self, dims, act = 'relu', random_seed = 201809, splitseed = 215, optimizer = Adam(),
 23 |              weights_dir = 'CarDEC Count Weights', n_features = 32, mode = 'HVG'):
 24 |         """ This class method initializes the count model.
 25 | 
 26 | 
 27 |         Arguments:
 28 |         ------------------------------------------------------------------
 29 |         - dims: `list`, the number of output features for each layer of the model. The length of the list determines the
 30 |         number of layers.
 31 |         - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer.
 32 |         - random_seed: `int`, The seed used for random weight intialization.
 33 |         - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between
 34 |         iterations to ensure the same cells are always used for validation.
 35 |         - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer.
 36 |         - weights_dir: `str`, the path in which to save the weights of the CarDEC model.
 37 |         - n_features: `int`, the number of input features.
 38 |         - mode: `str`, String identifying whether HVGs or LVGs are being modeled.
 39 |         """
 40 |         
 41 |         super(count_model, self).__init__()
 42 | 
 43 |         tf.keras.backend.clear_session()
 44 |         
 45 |         self.mode = mode
 46 |         self.name_ = mode + " Count"
 47 |         
 48 |         if mode == 'HVG':
 49 |             self.embed_name = 'embedding'
 50 |         else:
 51 |             self.embed_name = 'LVG embedding'
 52 |         
 53 |         self.weights_dir = weights_dir
 54 |         
 55 |         self.dims = dims
 56 |         n_stacks = len(dims) - 1
 57 |         
 58 |         self.optimizer = optimizer
 59 |         self.random_seed = random_seed
 60 |         self.splitseed = splitseed
 61 |         
 62 |         random.seed(random_seed)
 63 |         np.random.seed(random_seed)
 64 |         tf.random.set_seed(random_seed)
 65 |         
 66 |         self.activation = act
 67 |         self.MeanAct = lambda x: tf.clip_by_value(tf_exp(x), 1e-5, 1e6)
 68 |         self.DispAct = lambda x: tf.clip_by_value(tf.nn.softplus(x), 1e-4, 1e4)
 69 |         
 70 |         model_layers = []
 71 |         for i in range(n_stacks - 1, 0, -1):
 72 |             model_layers.append(Dense(dims[i], kernel_initializer = "glorot_uniform", activation = self.activation
 73 |                                         , name='base%d' % (i-1)))
 74 |         self.base = Sequential(model_layers, name = 'base')
 75 | 
 76 |         self.mean_layer = Dense(dims[0], activation = self.MeanAct, name='mean')
 77 |         self.disp_layer = Dense(dims[0], activation = self.DispAct, name='dispersion')
 78 | 
 79 |         self.rescale = Lambda(lambda l: tf.matmul(tf.linalg.diag(l[0]), l[1]), name = 'sf scaling')
 80 |         
 81 |         build_dir(self.weights_dir)
 82 |         
 83 |         self.construct(n_features, self.name_)
 84 |         
 85 |     def call(self, x, s):
 86 |         """ This is the forward pass of the model.
 87 |         
 88 | 
 89 |         ***Inputs***
 90 |             - x: `tf.Tensor`, an input tensor of shape (b, p)
 91 |             - s: `tf.Tensor`, and input tensor of shape (b, ) containing the size factor for each cell
 92 |             
 93 |         ***Outputs***
 94 |             - mean: `tf.Tensor`, A (b, p_gene) tensor of negative binomial means for each cell, gene.
 95 |             - disp: `tf.Tensor`, A (b, p_gene) tensor of negative binomial dispersions for each cell, gene.
 96 |         """
 97 |         
 98 |         x = self.base(x)
 99 |         
100 |         disp = self.disp_layer(x)
101 |         mean = self.mean_layer(x)
102 |         mean = self.rescale([s, mean])
103 |                         
104 |         return mean, disp
105 |         
106 |     def load_model(self, ):
107 |         """ This class method can be used to load the model's weights."""
108 |             
109 |         tf.keras.backend.clear_session()
110 |         
111 |         self.load_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_)).expect_partial()
112 |         
113 |     def construct(self, n_features, name, summarize = False):
114 |         """ This class method fully initalizes the TensorFlow model.
115 | 
116 | 
117 |         Arguments:
118 |         ------------------------------------------------------------------
119 |         - n_features: `int`, the number of input features.
120 |         - name: `str`, Model name (to distinguish HVG and LVG models).
121 |         - summarize: `bool`, If True, then print a summary of the model architecture.
122 |         """
123 |         
124 |         x = [tf.zeros(shape = (1, n_features), dtype='float32'), tf.ones(shape = (1,), dtype='float32')]
125 |         out = self(*x)
126 |         
127 |         if summarize:
128 |             print("----------Count Model " + name + " Architecture----------")
129 |             self.summary()
130 | 
131 |             print("\n----------Base Sub-Architecture----------")
132 |             self.base.summary()
133 |         
134 |     def denoise(self, adata, keep_dispersion = False, batch_size = 64):
135 |         """ This class method can be used to denoise gene expression for each cell on the count scale.
136 | 
137 | 
138 |         Arguments:
139 |         ------------------------------------------------------------------
140 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond
141 |         to cells and columns to genes.
142 |         - keep_dispersion: `bool`, If True, also return the dispersion for each gene, cell (added as a layer to adata)/
143 |         - batch_size: `int`, The batch size used for computing denoised expression.
144 |         
145 |         Returns:
146 |         ------------------------------------------------------------------
147 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Negative binomial means (and optionally 
148 |         dispersions) added as layers.
149 |         """
150 |         
151 |         input_ds = tupleloader(adata.obsm[self.embed_name], adata.obs['size factors'], batch_size = batch_size)
152 |         
153 |         if "denoised counts" not in list(adata.layers):
154 |             adata.layers["denoised counts"] = np.zeros(adata.shape, dtype = 'float32')
155 |         
156 |         type_indices = adata.var['Variance Type'] == self.mode
157 |         
158 |         if not keep_dispersion:
159 |             start = 0
160 |             for x in input_ds:
161 |                 end = start + x[0].shape[0]
162 |                 adata.layers["denoised counts"][start:end, type_indices] = self(*x)[0].numpy()
163 |                 start = end
164 |                 
165 |         else:
166 |             if "dispersion" not in list(adata.layers):
167 |                 adata.layers["dispersion"] = np.zeros(adata.shape, dtype = 'float32')
168 |                 
169 |             start = 0
170 |             for x in input_ds:
171 |                 end = start + x[0].shape[0]
172 |                 batch_output = self(*x)
173 |                 adata.layers["denoised counts"][start:end, type_indices] = batch_output[0].numpy()
174 |                 adata.layers["dispersion"][start:end, type_indices] = batch_output[1].numpy()
175 |                 start = end
176 |             
177 |     def makegenerators(self, adata, val_split, batch_size, splitseed):
178 |         """ This class method creates training and validation data generators for the current input data.
179 |         
180 |         
181 |         Arguments:
182 |         ------------------------------------------------------------------
183 |         - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond
184 |         to cells and columns to genes.
185 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
186 |         - batch_size: `int`, The batch size used for training the model.
187 |         - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between
188 |         iterations to ensure the same cells are always used for validation.
189 |         
190 |         Returns:
191 |         ------------------------------------------------------------------
192 |         - train_dataset: `tf.data.Dataset`, Dataset that returns training examples.
193 |         - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples.
194 |         """
195 |         
196 |         return countloader(adata.obsm[self.embed_name], adata.X[:, adata.var['Variance Type'] == self.mode], adata.obs['size factors'], 
197 |                            val_split, batch_size, splitseed)
198 |     
199 |     def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3,
200 |               patience_LR = 3, patience_ES = 9):
201 |         """ This class method can be used to train the SAE.
202 |         
203 |         
204 |         Arguments:
205 |         ------------------------------------------------------------------
206 |         - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond
207 |         to cells and columns to genes.
208 |         - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt
209 |         training long before hitting this limit.
210 |         - batch_size: `int`, The batch size used for training the full model.
211 |         - val_split: `float`, The fraction of cells to be reserved for validation during this step.
212 |         - lr: `float`, The learning rate for training the full model.
213 |         - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not
214 |         decreasing.
215 |         - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the
216 |         validation loss fails to decrease.
217 |         - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to
218 |         decrease.
219 |         """
220 |         
221 |         tf.keras.backend.clear_session()
222 |                 
223 |         loss = NBloss
224 |         
225 |         dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed)
226 |         
227 |         counter_LR = 0
228 |         counter_ES = 0
229 |         best_loss = np.inf
230 |         
231 |         self.optimizer.lr = lr
232 |         
233 |         total_start = time()
234 |         
235 |         for epoch in range(num_epochs):
236 |             epoch_start = time()
237 |             
238 |             epoch_loss_avg = tf.keras.metrics.Mean()
239 |             epoch_loss_avg_val = tf.keras.metrics.Mean()
240 |             
241 |             # Training loop - using batches of batch_size
242 |             for x, target in dataset(val = False):
243 |                 loss_value, grads = grad(self, x, target, loss)
244 |                 self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
245 |                 epoch_loss_avg(loss_value)  # Add current batch loss
246 |             
247 |             # Validation Loop
248 |             for x, target in dataset(val = True):
249 |                 output = self(*x)
250 |                 loss_value = loss(target, output)
251 |                 epoch_loss_avg_val(loss_value)
252 |             
253 |             current_loss_val = epoch_loss_avg_val.result()
254 | 
255 |             epoch_time = round(time() - epoch_start, 1)
256 |             
257 |             print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time))
258 |             
259 |             if(current_loss_val + 10**(-3) < best_loss):
260 |                 counter_LR = 0
261 |                 counter_ES = 0
262 |                 best_loss = current_loss_val
263 |             else:
264 |                 counter_LR = counter_LR + 1
265 |                 counter_ES = counter_ES + 1
266 | 
267 |             if patience_ES <= counter_ES:
268 |                 break
269 | 
270 |             if patience_LR <= counter_LR:
271 |                 self.optimizer.lr = self.optimizer.lr * decay_factor
272 |                 counter_LR = 0
273 |                 print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy()))
274 |                 
275 |             # End epoch
276 |         
277 |         total_time = round(time() - total_start, 2)
278 |         
279 |         if not os.path.isdir("./" + self.weights_dir):
280 |             os.mkdir("./" + self.weights_dir)
281 |         
282 |         self.save_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_), save_format='tf')
283 |                 
284 |         print('\nTraining Completed')
285 |         print("Total training time: " + str(total_time) + " seconds")
286 | 
287 | 


--------------------------------------------------------------------------------
/build/lib/CarDEC/CarDEC_dataloaders.py:
--------------------------------------------------------------------------------
  1 | from tensorflow import convert_to_tensor as tensor
  2 | from numpy import setdiff1d
  3 | from numpy.random import choice, seed
  4 | 
  5 | class batch_sampler(object):
  6 |     def __init__(self, array, val_frac, batch_size, splitseed):
  7 |         seed(splitseed)
  8 |         self.val_indices = choice(range(len(array)), round(val_frac * len(array)), False)
  9 |         self.train_indices = setdiff1d(range(len(array)), self.val_indices)
 10 |         self.batch_size = batch_size
 11 |         
 12 |     def __iter__(self):
 13 |         batch = []
 14 |         
 15 |         if self.val:
 16 |             for idx in self.val_indices:
 17 |                 batch.append(idx)
 18 |                 
 19 |                 if len(batch) == self.batch_size:
 20 |                     yield batch
 21 |                     batch = []
 22 |                     
 23 |         else:
 24 |             train_idx = choice(self.train_indices, len(self.train_indices), False)
 25 |             
 26 |             for idx in train_idx:
 27 |                 batch.append(idx)
 28 |                 
 29 |                 if len(batch) == self.batch_size:
 30 |                     yield batch
 31 |                     batch = []
 32 |                     
 33 |         if batch:
 34 |             yield batch
 35 |             
 36 |     def __call__(self, val):
 37 |         self.val = val
 38 |         return self
 39 |             
 40 | class simpleloader(object):
 41 |     def __init__(self, array, batch_size):
 42 |         self.array = array
 43 |         self.batch_size = batch_size
 44 |         
 45 |     def __iter__(self):
 46 |         batch = []
 47 |         
 48 |         for idx in range(len(self.array)):
 49 |             batch.append(idx)
 50 |             
 51 |             if len(batch) == self.batch_size:
 52 |                 yield tensor(self.array[batch].copy())
 53 |                 batch = []
 54 |                 
 55 |         if batch:
 56 |             yield self.array[batch].copy()
 57 |             
 58 | class tupleloader(object):
 59 |     def __init__(self, *arrays, batch_size):
 60 |         self.arrays = arrays
 61 |         self.batch_size = batch_size
 62 |         
 63 |     def __iter__(self):
 64 |         batch = []
 65 |         
 66 |         for idx in range(len(self.arrays[0])):
 67 |             batch.append(idx)
 68 |             
 69 |             if len(batch) == self.batch_size:
 70 |                 yield [tensor(arr[batch].copy()) for arr in self.arrays]
 71 |                 batch = []
 72 |                 
 73 |         if batch:
 74 |             yield [tensor(arr[batch].copy()) for arr in self.arrays]
 75 |             
 76 | class aeloader(object):
 77 |     def __init__(self, *arrays, val_frac, batch_size, splitseed):
 78 |         self.arrays = arrays
 79 |         self.batch_size = batch_size
 80 |         self.sampler = batch_sampler(arrays[0], val_frac, batch_size, splitseed)
 81 |         
 82 |     def __iter__(self):
 83 |         for idxs in self.sampler(self.val):
 84 |             yield [tensor(arr[idxs].copy()) for arr in self.arrays]
 85 |             
 86 |     def __call__(self, val):
 87 |         self.val = val
 88 |         return self
 89 |             
 90 | class countloader(object):
 91 |     def __init__(self, embedding, target, sizefactor, val_frac, batch_size, splitseed):
 92 |         self.sampler = batch_sampler(embedding, val_frac, batch_size, splitseed)
 93 |         self.embedding = embedding
 94 |         self.target = target
 95 |         self.sizefactor = sizefactor
 96 |         
 97 |     def __iter__(self):
 98 |         for idxs in self.sampler(self.val):
 99 |             yield (tensor(self.embedding[idxs].copy()), tensor(self.sizefactor[idxs].copy())), tensor(self.target[idxs].copy())
100 |             
101 |     def __call__(self, val):
102 |         self.val = val
103 |         return self
104 |             
105 | class dataloader(object):
106 |     def __init__(self, hvg_input, hvg_target, lvg_input = None, lvg_target = None, val_frac = 0.1, batch_size = 128, splitseed = 0):
107 |         self.sampler = batch_sampler(hvg_input, val_frac, batch_size, splitseed)
108 |         self.hvg_input = hvg_input
109 |         self.hvg_target = hvg_target
110 |         self.lvg_input = lvg_input
111 |         self.lvg_target = lvg_target
112 |         
113 |     def __iter__(self):
114 |         for idxs in self.sampler(self.val):
115 |             hvg_input = tensor(self.hvg_input[idxs].copy())
116 |             hvg_target = tensor(self.hvg_target[idxs].copy())
117 |             p_target = tensor(self.p_target[idxs].copy())
118 |             
119 |             if (self.lvg_input is not None) and (self.lvg_target is not None):
120 |                 lvg_input = tensor(self.lvg_input[idxs].copy())
121 |                 lvg_target = tensor(self.lvg_target[idxs].copy())
122 |             else:
123 |                 lvg_input = None
124 |                 lvg_target = None
125 |                 
126 |             yield [hvg_input, lvg_input], hvg_target, lvg_target, p_target
127 |             
128 |     def __call__(self, val):
129 |         self.val = val
130 |         return self
131 |     
132 |     def update_p(self, new_p_target):
133 |         self.p_target = new_p_target


--------------------------------------------------------------------------------
/build/lib/CarDEC/CarDEC_layers.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from tensorflow.keras.layers import Layer
 3 | 
 4 | class ClusteringLayer(Layer):
 5 |     def __init__(self, centroids = None, n_clusters = None, n_features = None, alpha=1.0, **kwargs):
 6 |         """ The clustering layer predicts the a cell's class membership probability for each cell.
 7 |         
 8 |         
 9 |         Arguments:
10 |         ------------------------------------------------------------------
11 |         - centroids: `tf.Tensor`, Initial cluster ceontroids after pretraining the model.
12 |         - n_clusters: `int`, Number of clusters.
13 |         - n_features: `int`, The number of features of the bottleneck embedding space that the centroids live in.
14 |         - alpha: parameter in Student's t-distribution. Default to 1.0.
15 |         """
16 |         
17 |         super(ClusteringLayer, self).__init__(**kwargs)
18 |         self.alpha = alpha
19 |         self.initial_centroids = centroids
20 | 
21 |         if centroids is not None:
22 |             n_clusters, n_features = centroids.shape
23 | 
24 |         self.n_features, self.n_clusters = n_features, n_clusters
25 | 
26 |         assert self.n_clusters is not None
27 |         assert self.n_features is not None
28 | 
29 |     def build(self, input_shape):
30 |         """ This class method builds the layer fully once it receives an input tensor.
31 |         
32 |         
33 |         Arguments:
34 |         ------------------------------------------------------------------
35 |         - input_shape: `list`, A list specifying the shape of the input tensor.
36 |         """
37 |         
38 |         assert len(input_shape) == 2
39 |         
40 |         self.centroids = self.add_weight(name = 'clusters', shape = (self.n_clusters, self.n_features), initializer = 'glorot_uniform')
41 |         if self.initial_centroids is not None:
42 |             self.set_weights([self.initial_centroids])
43 |             del self.initial_centroids
44 |         
45 |         self.built = True
46 | 
47 |     def call(self, x, **kwargs):
48 |         """ Forward pass of the clustering layer,
49 |         
50 |         
51 |         ***Inputs***:
52 |             - x: `tf.Tensor`, the embedding tensor of shape = (n_obs, n_var)
53 |         
54 |         ***Returns***:
55 |             - q: `tf.Tensor`, student's t-distribution, or soft labels for each sample of shape = (n_obs, n_clusters)
56 |         """
57 | 
58 |         q = 1.0 / (1.0 + (tf.reduce_sum(tf.square(tf.expand_dims(x, axis = 1) - self.centroids), axis = 2) / self.alpha))
59 |         q = q**((self.alpha + 1.0) / 2.0)
60 |         q = q / tf.reduce_sum(q, axis = 1, keepdims = True)
61 | 
62 |         return q
63 | 
64 |     def compute_output_shape(self, input_shape):
65 |         """ This method infers the output shape from the input shape.
66 |         
67 |         
68 |         Arguments:
69 |         ------------------------------------------------------------------
70 |         - input_shape: `list`, A list specifying the shape of the input tensor.
71 |         
72 |         Returns:
73 |         ------------------------------------------------------------------
74 |         - output_shape: `list`, A tuple specifying the shape of the output for the minibatch (n_obs, n_clusters)
75 |         """
76 |         
77 |         assert input_shape and len(input_shape) == 2
78 |         return input_shape[0], self.n_clusters


--------------------------------------------------------------------------------
/build/lib/CarDEC/CarDEC_optimization.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | import tensorflow as tf
  4 | from tensorflow.keras.losses import KLD, MSE
  5 | 
  6 | 
  7 | def grad_MainModel(model, input_, target, target_p, total_loss, LVG_target = None, aeloss_fun = None, clust_weight = 1.):
  8 |     """Function to do a backprop update to the main CarDEC model for a minibatch.
  9 |     
 10 |     
 11 |     Arguments:
 12 |     ------------------------------------------------------------------
 13 |     - model: `tensorflow.keras.Model`, The main CarDEC model.
 14 |     - input_: `list`, A list containing the input HVG and (optionally) LVG expression tensors of the minibatch for the CarDEC model.
 15 |     - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs.
 16 |     - target_p: `tf.Tensor`, Tensor containing cluster membership probability targets for the minibatch.
 17 |     - total_loss: `function`, Function to compute the loss for the main CarDEC model for a minibatch.
 18 |     - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs.
 19 |     - aeloss_fun: `function`, Function to compute reconstruction loss.
 20 |     - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses.
 21 |     
 22 |     Returns:
 23 |     ------------------------------------------------------------------
 24 |     - loss_value: `tf.Tensor`: The loss computed for the minibatch.
 25 |     - gradients: `a list of Tensors`: Gradients to update the model weights.
 26 |     """
 27 |     
 28 |     with tf.GradientTape() as tape:
 29 |         denoised_output, cluster_output = model(*input_)
 30 |         loss_value, aeloss = total_loss(target, denoised_output, target_p, cluster_output, 
 31 |                                 LVG_target, aeloss_fun, clust_weight)
 32 |         
 33 |     return loss_value, tape.gradient(loss_value, model.trainable_variables)
 34 | 
 35 | 
 36 | def grad_reconstruction(model, input_, target, loss):
 37 |     """Function to compute gradient update for pretrained autoencoder only.
 38 |     
 39 |     
 40 |     Arguments:
 41 |     ------------------------------------------------------------------
 42 |     - model: `tensorflow.keras.Model`, The main CarDEC model.
 43 |     - input_: `list`, A list containing the input HVG expression tensor of the minibatch for the CarDEC model.
 44 |     - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs.
 45 |     - loss: `function`, Function to compute reconstruction loss.
 46 |     
 47 |     Returns:
 48 |     ------------------------------------------------------------------
 49 |     - loss_value: `tf.Tensor`: The loss computed for the minibatch.
 50 |     - gradients: `a list of Tensors`: Gradients to update the model weights.
 51 |     """
 52 |     
 53 |     if type(input_) != tuple:
 54 |         input_ = (input_, )
 55 |         
 56 |     with tf.GradientTape() as tape:
 57 |         output = model(*input_)
 58 |         loss_value = loss(target, output)
 59 |         
 60 |     return loss_value, tape.gradient(loss_value, model.trainable_variables)
 61 | 
 62 | 
 63 | def total_loss(target, denoised_output, p, cluster_output_q, LVG_target = None, aeloss_fun = None, clust_weight = 1.):
 64 |     """Function to compute the loss for the main CarDEC model for a minibatch.
 65 |     
 66 |     
 67 |     Arguments:
 68 |     ------------------------------------------------------------------
 69 |     - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs.
 70 |     - denoised_output: `dict`, Dictionary containing the output tensors from the CarDEC main model's forward pass.
 71 |     - p: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing cluster membership probability targets for the minibatch.
 72 |     - cluster_output_q: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing predicted cluster membership probabilities
 73 |     for each cell.
 74 |     - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs.
 75 |     - aeloss_fun: `function`, Function to compute reconstruction loss.
 76 |     - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses.
 77 |     
 78 |     Returns:
 79 |     ------------------------------------------------------------------
 80 |     - net_loss: `tf.Tensor`, The loss computed for the minibatch.
 81 |     - aeloss: `tf.Tensor`, The reconstruction loss computed for the minibatch.
 82 |     """
 83 | 
 84 |     if aeloss_fun is not None:
 85 |         
 86 |         aeloss_HVG = aeloss_fun(target, denoised_output['HVG_denoised'])
 87 |         if LVG_target is not None:
 88 |             aeloss_LVG = aeloss_fun(LVG_target, denoised_output['LVG_denoised'])
 89 |             aeloss = 0.5*(aeloss_LVG + aeloss_HVG)
 90 |         else:
 91 |             aeloss = 1. * aeloss_HVG
 92 |     else:
 93 |         aeloss = 0.
 94 |     
 95 |     net_loss = clust_weight * tf.reduce_mean(KLD(p, cluster_output_q)) + (2. - clust_weight) * aeloss
 96 |     
 97 |     return net_loss, aeloss
 98 | 
 99 | 
100 | def MSEloss(netinput, netoutput):
101 |     """Function to compute the MSEloss for the reconstruction loss of a minibatch.
102 |     
103 |     
104 |     Arguments:
105 |     ------------------------------------------------------------------
106 |     - netinput: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells.
107 |     - netoutput: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells.
108 |     
109 |     Returns:
110 |     ------------------------------------------------------------------
111 |     - mse_loss: `tf.Tensor`, The loss computed for the minibatch, averaged over genes and cells.
112 |     """
113 |     
114 |     return tf.math.reduce_mean(MSE(netinput, netoutput))
115 | 
116 | 
117 | def NBloss(count, output, eps = 1e-10, mean = True):
118 |     """Function to compute the negative binomial reconstruction loss of a minibatch.
119 |     
120 |     
121 |     Arguments:
122 |     ------------------------------------------------------------------
123 |     - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original 
124 |     counts).
125 |     - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells.
126 |     - eps: `float`, A small number introduced for computational stability
127 |     - mean: `bool`, If True, average negative binomial loss over genes and cells
128 |     
129 |     Returns:
130 |     ------------------------------------------------------------------
131 |     - nbloss: `tf.Tensor`, The loss computed for the minibatch. If mean was True, it has shape (n_obs, n_var). Otherwise, it has shape (1,).
132 |     """
133 |     
134 |     count = tf.cast(count, tf.float32)
135 |     mu = tf.cast(output[0], tf.float32)
136 | 
137 |     theta = tf.minimum(output[1], 1e6)
138 | 
139 |     t1 = tf.math.lgamma(theta + eps) + tf.math.lgamma(count + 1.0) - tf.math.lgamma(count + theta + eps)
140 |     t2 = (theta + count) * tf.math.log(1.0 + (mu/(theta+eps))) + (count * (tf.math.log(theta + eps) - tf.math.log(mu + eps)))
141 | 
142 |     final = _nan2inf(t1 + t2)
143 |     
144 |     if mean:
145 |         final = tf.reduce_sum(final)/final.shape[0]/final.shape[1]
146 | 
147 |     return final
148 | 
149 | 
150 | def ZINBloss(count, output, eps = 1e-10):
151 |     """Function to compute the negative binomial reconstruction loss of a minibatch.
152 |     
153 |     
154 |     Arguments:
155 |     ------------------------------------------------------------------
156 |     - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original counts).
157 |     - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells.
158 |     - eps: `float`, A small number introduced for computational stability
159 |     
160 |     Returns:
161 |     ------------------------------------------------------------------
162 |     - zinbloss: `tf.Tensor`, The loss computed for the minibatch. Has shape (1,).
163 |     """
164 |     
165 |     mu = output[0]
166 |     theta = output[1]
167 |     pi = output[2]
168 |     
169 |     NB = NBloss(count, output, eps = eps, mean = False) - tf.math.log(1.0 - pi + eps)
170 |     
171 |     count = tf.cast(count, tf.float32)
172 |     mu = tf.cast(mu, tf.float32)
173 |     
174 |     theta = tf.math.minimum(theta, 1e6)
175 |     
176 |     zero_nb = tf.math.pow(theta/(theta + mu + eps), theta)
177 |     zero_case = -tf.math.log(pi + ((1.0- pi) * zero_nb) + eps)
178 |     final = tf.where(tf.less(count, 1e-8), zero_case, NB)
179 |     
180 |     final = tf.reduce_sum(final)/final.shape[0]/final.shape[1]
181 |             
182 |     return final
183 | 
184 | 
185 | def _nan2inf(x):
186 |     """Function to replace nan entries in a Tensor with infinities.
187 |     
188 |     
189 |     Arguments:
190 |     ------------------------------------------------------------------
191 |     - x: `tf.Tensor`, Tensor of arbitrary shape.
192 |     
193 |     Returns:
194 |     ------------------------------------------------------------------
195 |     - x': `tf.Tensor`, Tensor x with nan entries replaced by infinity.
196 |     """
197 |     
198 |     return tf.where(tf.math.is_nan(x), tf.zeros_like(x) + np.inf, x)
199 | 
200 | 


--------------------------------------------------------------------------------
/build/lib/CarDEC/CarDEC_utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import os
  3 | from scipy.sparse import issparse
  4 | 
  5 | import scanpy as sc
  6 | from anndata import AnnData
  7 | 
  8 | 
  9 | def normalize_scanpy(adata, batch_key = None, n_high_var = 1000, LVG = True, 
 10 |                      normalize_samples = True, log_normalize = True, 
 11 |                      normalize_features = True):
 12 |     """ This function preprocesses the raw count data.
 13 |     
 14 |     
 15 |     Arguments:
 16 |     ------------------------------------------------------------------
 17 |     - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes.
 18 |     - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch.
 19 |     - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable.
 20 |     - LVG: `bool`, Whether to retain and preprocess LVGs.
 21 |     - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell.
 22 |     - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count.
 23 |     - normalize_features: `bool`, If True, z-score normalize each gene's expression.
 24 |     
 25 |     Returns:
 26 |     ------------------------------------------------------------------
 27 |     - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Contains preprocessed data.
 28 |     """
 29 |     
 30 |     n, p = adata.shape
 31 |     sparsemode = issparse(adata.X)
 32 |     
 33 |     if batch_key is not None:
 34 |         batch = list(adata.obs[batch_key])
 35 |         batch = convert_vector_to_encoding(batch)
 36 |         batch = np.asarray(batch)
 37 |         batch = batch.astype('float32')
 38 |     else:
 39 |         batch = np.ones((n,), dtype = 'float32')
 40 |         norm_by_batch = False
 41 |         
 42 |     sc.pp.filter_genes(adata, min_counts=1)
 43 |     sc.pp.filter_cells(adata, min_counts=1)
 44 |         
 45 |     count = adata.X.copy()
 46 |         
 47 |     if normalize_samples:
 48 |         out = sc.pp.normalize_total(adata, inplace = False)
 49 |         obs_ = adata.obs
 50 |         var_ = adata.var
 51 |         adata = None
 52 |         adata = AnnData(out['X'])
 53 |         adata.obs = obs_
 54 |         adata.var = var_
 55 |         
 56 |         size_factors = out['norm_factor'] / np.median(out['norm_factor'])
 57 |         out = None
 58 |     else:
 59 |         size_factors = np.ones((adata.shape[0], ))
 60 |         
 61 |     if not log_normalize:
 62 |         adata_ = adata.copy()
 63 |     
 64 |     sc.pp.log1p(adata)
 65 |     
 66 |     if n_high_var is not None:
 67 |         sc.pp.highly_variable_genes(adata, inplace = True, min_mean = 0.0125, max_mean = 3, min_disp = 0.5, 
 68 |                                           n_bins = 20, n_top_genes = n_high_var, batch_key = batch_key)
 69 |         
 70 |         hvg = adata.var['highly_variable'].values
 71 |         
 72 |         if not log_normalize:
 73 |             adata = adata_.copy()
 74 | 
 75 |     else:
 76 |         hvg = [True] * adata.shape[1]
 77 |         
 78 |     if normalize_features:
 79 |         batch_list = np.unique(batch)
 80 | 
 81 |         if sparsemode:
 82 |             adata.X = adata.X.toarray()
 83 | 
 84 |         for batch_ in batch_list:
 85 |             indices = [x == batch_ for x in batch]
 86 |             sub_adata = adata[indices]
 87 |             
 88 |             sc.pp.scale(sub_adata)
 89 |             adata[indices] = sub_adata.X
 90 |         
 91 |         adata.layers["normalized input"] = adata.X
 92 |         adata.X = count
 93 |         adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg]
 94 |             
 95 |     else:
 96 |         if sparsemode:   
 97 |             adata.layers["normalized input"] = adata.X.toarray()
 98 |         else:
 99 |             adata.layers["normalized input"] = adata.X
100 |             
101 |         adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg]
102 |         
103 |     if n_high_var is not None:
104 |         del_keys = ['dispersions', 'dispersions_norm', 'highly_variable', 'highly_variable_intersection', 'highly_variable_nbatches', 'means']
105 |         del_keys = [x for x in del_keys if x in adata.var.keys()]
106 |         adata.var = adata.var.drop(del_keys, axis = 1)
107 |             
108 |     y = np.unique(batch)
109 |     num_batch = len(y)
110 |     
111 |     adata.obs['size factors'] = size_factors.astype('float32')
112 |     adata.obs['batch'] = batch
113 |     adata.uns['num_batch'] = num_batch
114 |     
115 |     if sparsemode:
116 |         adata.X = adata.X.toarray()
117 |         
118 |     if not LVG:
119 |         adata = adata[:, adata.var['Variance Type'] == 'HVG']
120 |         
121 |     return adata
122 | 
123 | 
124 | def build_dir(dir_path):
125 |     """ This function builds a directory if it does not exist.
126 |     
127 |     
128 |     Arguments:
129 |     ------------------------------------------------------------------
130 |     - dir_path: `str`, The directory to build. E.g. if dir_path = 'folder1/folder2/folder3', then this function will creates directory if folder1 if it does not already exist. Then it creates folder1/folder2 if folder2 does not exist in folder1. Then it creates folder1/folder2/folder3 if folder3 does not exist in folder2.
131 |     """
132 |     
133 |     subdirs = [dir_path]
134 |     substring = dir_path
135 | 
136 |     while substring != '':
137 |         splt_dir = os.path.split(substring)
138 |         substring = splt_dir[0]
139 |         subdirs.append(substring)
140 |         
141 |     subdirs.pop()
142 |     subdirs = [x for x in subdirs if os.path.basename(x) != '..']
143 | 
144 |     n = len(subdirs)
145 |     subdirs = [subdirs[n - 1 - x] for x in range(n)]
146 |     
147 |     for dir_ in subdirs:
148 |         if not os.path.isdir(dir_):
149 |             os.mkdir(dir_)
150 | 
151 | 
152 | def convert_string_to_encoding(string, vector_key):
153 |     """A function to convert a string to a numeric encoding.
154 |     
155 |     
156 |     Arguments:
157 |     ------------------------------------------------------------------
158 |     - string: `str`, The specific string to convert to a numeric encoding.
159 |     - vector_key: `np.ndarray`, Array of all possible values of string.
160 |     
161 |     Returns:
162 |     ------------------------------------------------------------------
163 |     - encoding: `int`, The integer encoding of string.
164 |     """
165 |     
166 |     return np.argwhere(vector_key == string)[0][0]
167 | 
168 | 
169 | def convert_vector_to_encoding(vector):
170 |     """A function to convert a vector of strings to a dense numeric encoding.
171 |     
172 |     
173 |     Arguments:
174 |     ------------------------------------------------------------------
175 |     - vector: `array_like`, The vector of strings to encode.
176 |     
177 |     Returns:
178 |     ------------------------------------------------------------------
179 |     - vector_num: `list`, A list containing the dense numeric encoding.
180 |     """
181 |     
182 |     vector_key = np.unique(vector)
183 |     vector_strings = list(vector)
184 |     vector_num = [convert_string_to_encoding(string, vector_key) for string in vector_strings]
185 |     
186 |     return vector_num
187 | 
188 | 
189 | def find_resolution(adata_, n_clusters, random):
190 |     """A function to find the louvain resolution tjat corresponds to a prespecified number of clusters, if it exists.
191 |     
192 |     
193 |     Arguments:
194 |     ------------------------------------------------------------------
195 |     - adata_: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to low dimension features.
196 |     - n_clusters: `int`, Number of clusters.
197 |     - random: `int`, The random seed.
198 |     
199 |     Returns:
200 |     ------------------------------------------------------------------
201 |     - resolution: `float`, The resolution that gives n_clusters after running louvain's clustering algorithm.
202 |     """
203 |     
204 |     obtained_clusters = -1
205 |     iteration = 0
206 |     resolutions = [0., 1000.]
207 |     
208 |     while obtained_clusters != n_clusters and iteration < 50:
209 |         current_res = sum(resolutions)/2
210 |         adata = sc.tl.louvain(adata_, resolution = current_res, random_state = random, copy = True)
211 |         labels = adata.obs['louvain']
212 |         obtained_clusters = len(np.unique(labels))
213 |         
214 |         if obtained_clusters < n_clusters:
215 |             resolutions[0] = current_res
216 |         else:
217 |             resolutions[1] = current_res
218 |         
219 |         iteration = iteration + 1
220 |         
221 |     return current_res
222 | 
223 | 


--------------------------------------------------------------------------------
/build/lib/CarDEC/__init__.py:
--------------------------------------------------------------------------------
1 | from .CarDEC_API import CarDEC_API


--------------------------------------------------------------------------------
/dist/cardec-1.0.3-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/dist/cardec-1.0.3-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/cardec-1.0.3.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/dist/cardec-1.0.3.tar.gz


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | 
 4 | # In[ ]:
 5 | 
 6 | import setuptools
 7 | 
 8 | with open("README.md", "r") as fh:
 9 |     long_description = fh.read()
10 | 
11 | setuptools.setup(
12 |     name="cardec",
13 |     version="1.0.3",
14 |     author="Justin Lakkis",
15 |     author_email="jlakks@gmail.com",
16 |     description="A deep learning method for joint batch correction, denoting, and clustering of single-cell rna-seq data.",
17 |     long_description=long_description,
18 |     long_description_content_type="text/markdown",
19 |     url="https://github.com/jlakkis/CarDEC",
20 |     packages=setuptools.find_packages(),
21 |     classifiers=[
22 |         "Programming Language :: Python :: 3",
23 |         "License :: OSI Approved :: MIT License",
24 |         "Operating System :: OS Independent",
25 |     ],
26 |     install_requires=['numpy>=1.18.1', 'pandas>=1.0.1', 'scipy>=1.4.1', 'tensorflow>= 2.0.1, <=2.3.1', 'scikit-learn>=0.22.2.post1', 'scanpy>=1.5.1', 'louvain>=0.6.1'],
27 |     python_requires='>=3.7',
28 | )


--------------------------------------------------------------------------------