├── CarDEC.egg-info ├── PKG-INFO ├── SOURCES.txt ├── dependency_links.txt ├── requires.txt └── top_level.txt ├── CarDEC ├── CarDEC_API.py ├── CarDEC_MainModel.py ├── CarDEC_SAE.py ├── CarDEC_count_decoder.py ├── CarDEC_dataloaders.py ├── CarDEC_layers.py ├── CarDEC_optimization.py ├── CarDEC_utils.py ├── __init__.py └── __pycache__ │ ├── CarDEC_API.cpython-37.pyc │ ├── CarDEC_MainModel.cpython-37.pyc │ ├── CarDEC_SAE.cpython-37.pyc │ ├── CarDEC_count_decoder.cpython-37.pyc │ ├── CarDEC_dataloaders.cpython-37.pyc │ ├── CarDEC_layers.cpython-37.pyc │ ├── CarDEC_optimization.cpython-37.pyc │ ├── CarDEC_utils.cpython-37.pyc │ └── __init__.cpython-37.pyc ├── LICENSE.rtf ├── README.md ├── build └── lib │ └── CarDEC │ ├── CarDEC_API.py │ ├── CarDEC_MainModel.py │ ├── CarDEC_SAE.py │ ├── CarDEC_count_decoder.py │ ├── CarDEC_dataloaders.py │ ├── CarDEC_layers.py │ ├── CarDEC_optimization.py │ ├── CarDEC_utils.py │ └── __init__.py ├── dist ├── cardec-1.0.3-py3-none-any.whl └── cardec-1.0.3.tar.gz └── setup.py /CarDEC.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 2.1 2 | Name: cardec 3 | Version: 1.0.3 4 | Summary: A deep learning method for joint batch correction, denoting, and clustering of single-cell rna-seq data. 5 | Home-page: https://github.com/jlakkis/CarDEC 6 | Author: Justin Lakkis 7 | Author-email: jlakks@gmail.com 8 | License: UNKNOWN 9 | Description: # CarDEC 10 | 11 | CarDEC (**C**ount **a**dapted **r**egularized **D**eep **E**mbedded **C**lustering) is a joint deep learning computational tool that is useful for analyses of single-cell RNA-seq data. CarDEC can be used to: 12 | 13 | 1. Correct for batch effect in the full gene expression space, allowing the investigator to remove batch effect from downstream analyses like psuedotime analysis and coexpression analysis. Batch correction is also possible in a low-dimensional embedding space. 14 | 2. Denoise gene expression. 15 | 3. Cluster cells. 16 | 17 | ## Reproducibility 18 | 19 | We described and introduced CarDEC in our [methodological paper](https://www.biorxiv.org/content/10.1101/2020.09.23.310003v1). To find code to reproduce the results we generated in that paper, please visit this separate [github repository](https://github.com/jlakkis/CarDEC_Codes), which provides all code (including that for other methods) necessary to reproduce our results. 20 | 21 | ## Installation 22 | 23 | Recomended installation procedure is as follows. 24 | 25 | 1. Install [Anaconda](https://www.anaconda.com/products/individual) if you do not already have it. 26 | 2. Create a conda environment, and then activate it as follows in terminal. 27 | 28 | ``` 29 | $ conda create -n cardecenv 30 | $ conda activate cardecenv 31 | ``` 32 | 33 | 3. Install an appropriate version of python. 34 | 35 | ``` 36 | $ conda install python==3.7 37 | ``` 38 | 39 | 4. Install nb_conda_kernels so that you can change python kernels in jupyter notebook. 40 | 41 | ``` 42 | $ conda install nb_conda_kernels 43 | ``` 44 | 45 | 5. Finally, install CarDEC. 46 | 47 | ``` 48 | $ pip install CarDEC 49 | ``` 50 | 51 | Now, to use CarDEC, always make sure you activate the environment in terminal first ("conda activate cardecenv"). And then run jupyter notebook. When you create a notebook to run CarDEC, make sure the active kernel is switched to "cardecenv" 52 | 53 | ## Usage 54 | 55 | A [tutorial jupyter notebook](https://drive.google.com/drive/folders/19VVOoq4XSdDFRZDou-VbTMyV2Na9z53O?usp=sharing), together with a dataset, is publicly downloadable. 56 | 57 | ## Software Requirements 58 | 59 | - Python >= 3.7 60 | - TensorFlow >= 2.0.1, <= 2.3.1 61 | - scikit-learn == 0.22.2.post1 62 | - scanpy == 1.5.1 63 | - louvain == 0.6.1 64 | - pandas == 1.0.1 65 | - scipy == 1.4.1 66 | 67 | ## Trouble shooting 68 | 69 | Installation on MacOS should be smooth. If installing on Windows Subsystem for Linux (WSL), the user must properly configure their g++ compiler to ensure that the louvain package can be built during installation. If the compiler is not properly configured, the user may encounter a following deprecation error similar to the following. 70 | 71 | "DEPRECATION: Could not build wheels for louvain which do not use PEP 517. pip will fall back to legacy 'setup.py install' for these. pip 21.0 will remove support for this functionality. A possible replacement is to fix the wheel build issue reported above." 72 | 73 | To fix this error, try to install the libxml2-dev package. 74 | Platform: UNKNOWN 75 | Classifier: Programming Language :: Python :: 3 76 | Classifier: License :: OSI Approved :: MIT License 77 | Classifier: Operating System :: OS Independent 78 | Requires-Python: >=3.7 79 | Description-Content-Type: text/markdown 80 | -------------------------------------------------------------------------------- /CarDEC.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | LICENSE.rtf 2 | README.md 3 | setup.py 4 | CarDEC/CarDEC_API.py 5 | CarDEC/CarDEC_MainModel.py 6 | CarDEC/CarDEC_SAE.py 7 | CarDEC/CarDEC_count_decoder.py 8 | CarDEC/CarDEC_dataloaders.py 9 | CarDEC/CarDEC_layers.py 10 | CarDEC/CarDEC_optimization.py 11 | CarDEC/CarDEC_utils.py 12 | CarDEC/__init__.py 13 | cardec.egg-info/PKG-INFO 14 | cardec.egg-info/SOURCES.txt 15 | cardec.egg-info/dependency_links.txt 16 | cardec.egg-info/requires.txt 17 | cardec.egg-info/top_level.txt -------------------------------------------------------------------------------- /CarDEC.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /CarDEC.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.18.1 2 | pandas>=1.0.1 3 | scipy>=1.4.1 4 | tensorflow<=2.3.1,>=2.0.1 5 | scikit-learn>=0.22.2.post1 6 | scanpy>=1.5.1 7 | louvain>=0.6.1 8 | -------------------------------------------------------------------------------- /CarDEC.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | CarDEC 2 | -------------------------------------------------------------------------------- /CarDEC/CarDEC_API.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_utils import normalize_scanpy 2 | from .CarDEC_MainModel import CarDEC_Model 3 | from .CarDEC_count_decoder import count_model 4 | 5 | import tensorflow as tf 6 | from tensorflow.keras.optimizers import Adam 7 | import numpy as np 8 | from pandas import DataFrame 9 | 10 | import os 11 | 12 | class CarDEC_API: 13 | def __init__(self, adata, preprocess=True, weights_dir = "CarDEC Weights", batch_key = None, n_high_var = 2000, LVG = True, 14 | normalize_samples = True, log_normalize = True, normalize_features = True): 15 | """ Main CarDEC API the user can use to conduct batch correction and denoising experiments. 16 | 17 | 18 | Arguments: 19 | ------------------------------------------------------------------ 20 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 21 | - preprocess: `bool`, If True, then preprocess the data. 22 | - weights_dir: `str`, the path in which to save the weights of the CarDEC model. 23 | - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch. 24 | - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable. 25 | - LVG: `bool`, If True, also model LVGs. Otherwise, only model HVGs. 26 | - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell. 27 | - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count. 28 | - normalize_features: `bool`, If True, z-score normalize each gene's expression. 29 | """ 30 | 31 | if n_high_var is None: 32 | n_high_var = None 33 | LVG = False 34 | 35 | self.weights_dir = weights_dir 36 | self.LVG = LVG 37 | 38 | self.norm_args = (batch_key, n_high_var, LVG, normalize_samples, log_normalize, normalize_features) 39 | 40 | if preprocess: 41 | self.dataset = normalize_scanpy(adata, *self.norm_args) 42 | else: 43 | assert 'Variance Type' in adata.var.keys() 44 | assert 'normalized input' in adata.layers 45 | self.dataset = adata 46 | 47 | self.loaded = False 48 | self.count_loaded = False 49 | 50 | def build_model(self, load_fullmodel = True, dims = [128, 32], LVG_dims = [128, 32], tol = 0.005, n_clusters = None, 51 | random_seed = 201809, louvain_seed = 0, n_neighbors = 15, pretrain_epochs = 2000, batch_size_pretrain = 64, 52 | act = 'relu', actincenter = "tanh", ae_lr = 1e-04, ae_decay_factor = 1/3, ae_patience_LR = 3, 53 | ae_patience_ES = 9, clust_weight = 1., load_encoder_weights = True): 54 | """ Initializes the main CarDEC model. 55 | 56 | 57 | Arguments: 58 | ------------------------------------------------------------------ 59 | - load_fullmodel: `bool`, If True, the API will try to load the weights for the full model from the weight directory. 60 | - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers. 61 | - LVG_dims: `list`, the number of output features for each layer of the LVG encoder. The length of the list determines the number of layers. 62 | - tol: `float`, stop criterion, clustering procedure will be stopped when the difference ratio between the current iteration and last iteration larger than tol. 63 | - n_clusters: `int`, The number of clusters into which cells will be grouped. 64 | - random_seed: `int`, The seed used for random weight intialization. 65 | - louvain_seed: `int`, The seed used for louvain clustering intialization. 66 | - n_neighbors: `int`, The number of neighbors used for building the graph needed for louvain clustering. 67 | - pretrain_epochs: `int`, The maximum number of epochs for pretraining the HVG autoencoder. In practice, early stopping criteria should stop training much earlier. 68 | - batch_size_pretrain: `int`, The batch size used for pretraining the HVG autoencoder. 69 | - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer. 70 | - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC. 71 | - ae_lr: `float`, The learning rate for pretraining the HVG autoencoder. 72 | - ae_decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 73 | - ae_patience_LR: `int`, the number of epochs which the validation loss is allowed to increase before learning rate is decayed when pretraining the autoencoder. 74 | - ae_patience_ES: `int`, the number of epochs which the validation loss is allowed to increase before training is halted when pretraining the autoencoder. 75 | - clust_weight: `float`, a number between 0 and 2 qhich balances the clustering and reconstruction losses. 76 | - load_encoder_weights: `bool`, If True, the API will try to load the weights for the HVG encoder from the weight directory. 77 | """ 78 | 79 | assert n_clusters is not None 80 | 81 | if 'normalized input' not in list(self.dataset.layers): 82 | self.dataset = normalize_scanpy(self.dataset, *self.norm_args) 83 | 84 | p = sum(self.dataset.var["Variance Type"] == 'HVG') 85 | self.dims = [p] + dims 86 | 87 | if self.LVG: 88 | LVG_p = sum(self.dataset.var["Variance Type"] == 'LVG') 89 | self.LVG_dims = [LVG_p] + LVG_dims 90 | else: 91 | self.LVG_dims = None 92 | 93 | self.load_fullmodel = load_fullmodel 94 | self.weights_exist = os.path.isfile("./" + self.weights_dir + "/tuned_CarDECweights.index") 95 | 96 | set_centroids = not (self.load_fullmodel and self.weights_exist) 97 | 98 | self.model = CarDEC_Model(self.dataset, self.dims, self.LVG_dims, tol, n_clusters, random_seed, louvain_seed, 99 | n_neighbors, pretrain_epochs, batch_size_pretrain, ae_decay_factor, 100 | ae_patience_LR, ae_patience_ES, act, actincenter, ae_lr, 101 | clust_weight, load_encoder_weights, set_centroids, self.weights_dir) 102 | 103 | def make_inference(self, batch_size = 64, val_split = 0.1, lr = 1e-04, decay_factor = 1/3, 104 | iteration_patience_LR = 3, iteration_patience_ES = 6, maxiter = 1e3, epochs_fit = 1, 105 | optimizer = Adam(), printperiter = None, denoise_all = True, denoise_list = None): 106 | """ This class method makes inference on the data (batch correction + denoising) with the main CarDEC model 107 | 108 | 109 | Arguments: 110 | ------------------------------------------------------------------ 111 | - batch_size: `int`, The batch size used for training the full model. 112 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 113 | - lr: `float`, The learning rate for training the full model. 114 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 115 | - iteration_patience_LR: `int`, The number of iterations tolerated before decaying the learning rate during which the number of cells that change assignment is less than tol. 116 | - iteration_patience_ES: `int`, The number of iterations tolerated before stopping training during which the number of cells that change assignment is less than tol. 117 | - maxiter: `int`, The maximum number of iterations allowed to train the full model. In practice, the model will halt training long before hitting this limit. 118 | - epochs_fit: `int`, The number of epochs during which to fine-tune weights, before updating the target distribution. 119 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 120 | - printperiter: `int`, Optional integer argument. If specified, denoised values will be returned every printperiter epochs, so that the user can evaluate the progress of denoising as training continues. 121 | - denoise_all: `bool`, If True, then denoised expression values are provided for all cells. 122 | - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list. 123 | 124 | Returns: 125 | ------------------------------------------------------------------ 126 | - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata. 127 | """ 128 | 129 | if denoise_list is not None: 130 | denoise_all = False 131 | 132 | if not self.loaded: 133 | if self.load_fullmodel and self.weights_exist: 134 | self.dataset = self.model.reload_model(self.dataset, batch_size, denoise_all) 135 | 136 | elif not self.weights_exist: 137 | print("CarDEC Model Weights not detected. Training full model.\n") 138 | self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor, 139 | iteration_patience_LR, iteration_patience_ES, maxiter, 140 | epochs_fit, optimizer, printperiter, denoise_all) 141 | 142 | else: 143 | print("Training full model.\n") 144 | self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor, 145 | iteration_patience_LR, iteration_patience_ES, 146 | maxiter, epochs_fit, optimizer, printperiter, denoise_all) 147 | 148 | 149 | self.loaded = True 150 | 151 | elif denoise_all: 152 | self.dataset = self.model.make_outputs(self.dataset, batch_size, True) 153 | 154 | if denoise_list is not None: 155 | denoise_list = list(denoise_list) 156 | indices = [x in denoise_list for x in self.dataset.obs.index] 157 | denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32')) 158 | denoised.index = self.dataset.obs.index[indices] 159 | denoised.columns = self.dataset.var.index 160 | 161 | 162 | if self.LVG: 163 | hvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"][indices]) 164 | lvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["LVG embedding"][indices]) 165 | 166 | input_ds = tf.data.Dataset.zip((hvg_ds, lvg_ds)) 167 | input_ds = input_ds.batch(batch_size) 168 | 169 | start = 0 170 | for x in input_ds: 171 | denoised_batch = {'HVG_denoised': self.model.decoder(x[0]), 'LVG_denoised': self.model.decoderLVG(x[1])} 172 | q_batch = self.model.clustering_layer(x[0]) 173 | end = start + q_batch.shape[0] 174 | 175 | denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'HVG')[0]] = denoised_batch['HVG_denoised'].numpy() 176 | denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'LVG')[0]] = denoised_batch['LVG_denoised'].numpy() 177 | 178 | start = end 179 | 180 | else: 181 | input_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"]) 182 | 183 | input_ds = input_ds.batch(batch_size) 184 | 185 | start = 0 186 | 187 | for x in input_ds: 188 | denoised_batch = {'HVG_denoised': self.model.decoder(x)} 189 | q_batch = self.model.clustering_layer(x) 190 | end = start + q_batch.shape[0] 191 | 192 | denoised.iloc[start:end] = denoised_batch['HVG_denoised'].numpy() 193 | 194 | start = end 195 | 196 | return denoised 197 | 198 | print(" ") 199 | 200 | def model_counts(self, load_weights = True, act = 'relu', random_seed = 201809, 201 | optimizer = Adam(), keep_dispersion = False, num_epochs = 2000, batch_size_count = 64, 202 | val_split = 0.1, lr = 1e-03, decay_factor = 1/3, patience_LR = 3, patience_ES = 9, 203 | denoise_all = True, denoise_list = None): 204 | """ This class method makes inference on the data on the count scale. 205 | 206 | 207 | Arguments: 208 | ------------------------------------------------------------------ 209 | - load_weights: `bool`, If true, the API will attempt to load the weights for the count model. 210 | - act: `str`, A string specifying the activation function for intermediate layers of the count models. 211 | - random_seed: `int`, A seed used for weight initialization. 212 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 213 | - keep_dispersion: `bool`, If True, the gene, cell dispersions will be returned as well. 214 | - num_epochs: `int`, The maximum number of epochs allowed to train each count model. In practice, the model will halt 215 | training long before hitting this limit. 216 | - batch_size_count: `int`, The batch size used for training the count models. 217 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 218 | - lr: `float`, The learning rate for training the count models. 219 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 220 | - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss does not decrease. 221 | - patience_ES: `int`, The number of iterations tolerated before stopping training during which the validation loss does not decrease. 222 | - denoise_all: `bool`, If True, then denoised expression values are provided for all cells. 223 | - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list. 224 | 225 | Returns: 226 | ------------------------------------------------------------------ 227 | - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression on the count scale provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata. 228 | - denoised_dispersion: `pd.DataFrame`, (Optional) If denoise_list was specified and "keep_dispersion" was set to True, then this will be an array of dispersions from the fitted negative binomial model provided only for listed cells. If denoise_all was instead specified as False, but "keep_dispersion" was still True then dispersions for all cells will be added as a layer to adata. 229 | """ 230 | 231 | if denoise_list is not None: 232 | denoise_all = False 233 | 234 | if not self.count_loaded: 235 | weights_dir = os.path.join(self.weights_dir, 'count weights') 236 | weight_files_exist = os.path.isfile(weights_dir + "/countmodel_weights_HVG Count.index") 237 | if self.LVG: 238 | weight_files_exist = weight_files_exist and os.path.isfile(weights_dir + "/countmodel_weights_LVG Count.index") 239 | 240 | init_args = (act, random_seed, self.model.splitseed, optimizer, weights_dir) 241 | train_args = (num_epochs, batch_size_count, val_split, lr, decay_factor, patience_LR, patience_ES) 242 | 243 | self.nbmodel = count_model(self.dims, *init_args, n_features = self.dims[-1], mode = 'HVG') 244 | 245 | if load_weights and weight_files_exist: 246 | print("Weight files for count models detected, loading weights.") 247 | self.nbmodel.load_model() 248 | 249 | elif load_weights: 250 | print("Weight files for count models not detected. Training HVG count model.\n") 251 | self.nbmodel.train(self.dataset, *train_args) 252 | 253 | else: 254 | print("Training HVG count model.\n") 255 | self.nbmodel.train(self.dataset, *train_args) 256 | 257 | if self.LVG: 258 | self.nbmodel_lvg = count_model(self.LVG_dims, *init_args, 259 | n_features = self.dims[-1] + self.LVG_dims[-1], mode = 'LVG') 260 | 261 | if load_weights and weight_files_exist: 262 | self.nbmodel_lvg.load_model() 263 | print("Count model weights loaded successfully.") 264 | 265 | elif load_weights: 266 | print("\n \n \n") 267 | print("Training LVG count model.\n") 268 | self.nbmodel_lvg.train(self.dataset, *train_args) 269 | 270 | else: 271 | print("\n \n \n") 272 | print("Training LVG count model.\n") 273 | self.nbmodel_lvg.train(self.dataset, *train_args) 274 | 275 | self.count_loaded = True 276 | 277 | if denoise_all: 278 | self.nbmodel.denoise(self.dataset, keep_dispersion, batch_size_count) 279 | if self.LVG: 280 | self.nbmodel_lvg.denoise(self.dataset, keep_dispersion, batch_size_count) 281 | 282 | elif denoise_list is not None: 283 | denoise_list = list(denoise_list) 284 | indices = [x in denoise_list for x in self.dataset.obs.index] 285 | denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32')) 286 | denoised.index = self.dataset.obs.index[indices] 287 | denoised.columns = self.dataset.var.index 288 | if keep_dispersion: 289 | denoised_dispersion = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32')) 290 | denoised_dispersion.index = self.dataset.obs.index[indices] 291 | denoised_dispersion.columns = self.dataset.var.index 292 | 293 | input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['embedding'][indices]) 294 | input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices]) 295 | input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf)) 296 | input_ds = input_ds.batch(batch_size_count) 297 | 298 | type_indices = np.where(self.dataset.var['Variance Type'] == 'HVG')[0] 299 | 300 | if not keep_dispersion: 301 | start = 0 302 | for x in input_ds: 303 | end = start + x[0].shape[0] 304 | denoised.iloc[start:end, type_indices] = self.nbmodel(*x)[0].numpy() 305 | start = end 306 | 307 | else: 308 | start = 0 309 | for x in input_ds: 310 | end = start + x[0].shape[0] 311 | batch_output = self.nbmodel(*x) 312 | denoised.iloc[start:end, type_indices] = batch_output[0].numpy() 313 | denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy() 314 | start = end 315 | 316 | if self.LVG: 317 | input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['LVG embedding'][indices]) 318 | input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices]) 319 | input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf)) 320 | input_ds = input_ds.batch(batch_size_count) 321 | 322 | type_indices = np.where(self.dataset.var['Variance Type'] == 'LVG')[0] 323 | 324 | if not keep_dispersion: 325 | start = 0 326 | for x in input_ds: 327 | end = start + x[0].shape[0] 328 | denoised.iloc[start:end, type_indices] = self.nbmodel_lvg(*x)[0].numpy() 329 | start = end 330 | 331 | else: 332 | start = 0 333 | for x in input_ds: 334 | end = start + x[0].shape[0] 335 | batch_output = self.nbmodel_lvg(*x) 336 | denoised.iloc[start:end, type_indices] = batch_output[0].numpy() 337 | denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy() 338 | start = end 339 | 340 | if not keep_dispersion: 341 | return denoised 342 | else: 343 | return denoised, denoised_dispersion 344 | 345 | -------------------------------------------------------------------------------- /CarDEC/CarDEC_MainModel.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_SAE import SAE 2 | from .CarDEC_utils import build_dir, find_resolution 3 | from .CarDEC_layers import ClusteringLayer 4 | from .CarDEC_optimization import grad_MainModel as grad, total_loss, MSEloss 5 | from .CarDEC_dataloaders import simpleloader, dataloader, tupleloader 6 | 7 | import tensorflow as tf 8 | from tensorflow.keras import Model, Sequential 9 | from tensorflow.keras.layers import Dense, concatenate 10 | from tensorflow.keras.optimizers import Adam 11 | from tensorflow.keras.backend import set_floatx 12 | 13 | from sklearn.cluster import KMeans 14 | 15 | import scanpy as sc 16 | from anndata import AnnData 17 | import pandas as pd 18 | 19 | import random 20 | import numpy as np 21 | from math import ceil 22 | 23 | import os 24 | from copy import deepcopy 25 | from time import time 26 | 27 | set_floatx('float32') 28 | 29 | class CarDEC_Model(Model): 30 | def __init__(self, adata, dims, LVG_dims = None, tol = 0.005, n_clusters = None, random_seed = 201809, 31 | louvain_seed = 0, n_neighbors = 15, pretrain_epochs = 300, batch_size = 64, decay_factor = 1/3, 32 | patience_LR = 3, patience_ES = 9, act = 'relu', actincenter = "tanh", ae_lr = 1e-04, clust_weight = 1., 33 | load_encoder_weights = True, set_centroids = True, weights_dir = "CarDEC Weights"): 34 | super(CarDEC_Model, self).__init__() 35 | """ This class creates the TensorFlow CarDEC model architecture. 36 | 37 | 38 | Arguments: 39 | ------------------------------------------------------------------ 40 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 41 | - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers. 42 | - LVG_dims: `list`, the number of output features for each layer of the LVG encoder. The length of the list determines the number of layers. 43 | - tol: `float`, stop criterion, clustering procedure will be stopped when the difference ratio between the current iteration and last iteration larger than tol. 44 | - n_clusters: `int`, The number of clusters into which cells will be grouped. 45 | - random_seed: `int`, The seed used for random weight intialization. 46 | - louvain_seed: `int`, The seed used for louvain clustering intialization. 47 | - n_neighbors: `int`, The number of neighbors used for building the graph needed for louvain clustering. 48 | - pretrain_epochs: `int`, The maximum number of epochs for pretraining the HVG autoencoder. In practice, early stopping criteria should stop training much earlier. 49 | - batch_size: `int`, The batch size used for pretraining the HVG autoencoder. 50 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 51 | - patience_LR: `int`, the number of epochs which the validation loss is allowed to increase before learning rate is decayed when pretraining the autoencoder. 52 | - patience_ES: `int`, the number of epochs which the validation loss is allowed to increase before training is halted when pretraining the autoencoder. 53 | - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer. 54 | - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC. 55 | - ae_lr: `float`, The learning rate for pretraining the HVG autoencoder. 56 | - clust_weight: `float`, a number between 0 and 2 qhich balances the clustering and reconstruction losses. 57 | - load_encoder_weights: `bool`, If True, the API will try to load the weights for the HVG encoder from the weight directory. 58 | - set_centroids: `bool`, If True, intialize the centroids by running Louvain's algorithm. 59 | - weights_dir: `str`, the path in which to save the weights of the CarDEC model. 60 | ------------------------------------------------------------------ 61 | """ 62 | 63 | assert clust_weight <= 2. and clust_weight>=0. 64 | 65 | tf.keras.backend.clear_session() 66 | 67 | self.dims = dims 68 | self.LVG_dims = LVG_dims 69 | self.tol = tol 70 | self.input_dim = dims[0] # for clustering layer 71 | self.n_stacks = len(self.dims) - 1 72 | self.n_neighbors = n_neighbors 73 | self.batch_size = batch_size 74 | self.random_seed = random_seed 75 | self.activation = act 76 | self.actincenter = actincenter 77 | self.load_encoder_weights = load_encoder_weights 78 | self.clust_weight = clust_weight 79 | self.weights_dir = weights_dir 80 | self.preclust_embedding = None 81 | 82 | # set random seed 83 | random.seed(random_seed) 84 | np.random.seed(random_seed) 85 | tf.random.set_seed(random_seed) 86 | self.splitseed = round(abs(10000*np.random.randn())) 87 | 88 | # build the autoencoder 89 | self.sae = SAE(dims = self.dims, act = self.activation, actincenter = self.actincenter, 90 | random_seed = random_seed, splitseed = self.splitseed, init="glorot_uniform", optimizer = Adam(), 91 | weights_dir = weights_dir) 92 | 93 | build_dir(self.weights_dir) 94 | 95 | decoder_seed = round(100 * abs(np.random.normal())) 96 | if load_encoder_weights: 97 | if os.path.isfile("./" + self.weights_dir + "/pretrained_autoencoder_weights.index"): 98 | print("Pretrain weight index file detected, loading weights.") 99 | self.sae.load_autoencoder() 100 | print("Pretrained high variance autoencoder weights initialized.") 101 | else: 102 | print("Pretrain weight index file not detected, pretraining autoencoder weights.\n") 103 | self.sae.train(adata, lr = ae_lr, num_epochs = pretrain_epochs, 104 | batch_size = batch_size, decay_factor = decay_factor, 105 | patience_LR = patience_LR, patience_ES = patience_ES) 106 | self.sae.load_autoencoder() 107 | else: 108 | print("Pre-training high variance autoencoder.\n") 109 | self.sae.train(adata, lr = ae_lr, num_epochs = pretrain_epochs, 110 | batch_size = batch_size, decay_factor = decay_factor, 111 | patience_LR = patience_LR, patience_ES = patience_ES) 112 | self.sae.load_autoencoder() 113 | 114 | features = self.sae.embed(adata) 115 | self.preclust_emb = deepcopy(features) 116 | self.preclust_denoised = self.sae.denoise(adata, batch_size) 117 | 118 | if not set_centroids: 119 | self.init_centroid = np.zeros((n_clusters, self.dims[-1]), dtype = 'float32') 120 | self.n_clusters = n_clusters 121 | self.init_pred = np.zeros((adata.shape[0], dims[-1])) 122 | 123 | elif louvain_seed is None: 124 | print("\nInitializing cluster centroids using K-Means") 125 | 126 | kmeans = KMeans(n_clusters=n_clusters, n_init = 20) 127 | Y_pred_init = kmeans.fit_predict(features) 128 | 129 | self.init_pred = deepcopy(Y_pred_init) 130 | self.n_clusters = n_clusters 131 | self.init_centroid = kmeans.cluster_centers_ 132 | 133 | else: 134 | print("\nInitializing cluster centroids using the louvain method.") 135 | 136 | n_cells = features.shape[0] 137 | 138 | if n_cells > 10**5: 139 | subset = np.random.choice(range(n_cells), 10**5, replace = False) 140 | adata0 = AnnData(features[subset]) 141 | else: 142 | adata0 = AnnData(features) 143 | 144 | sc.pp.neighbors(adata0, n_neighbors = self.n_neighbors, use_rep="X") 145 | self.resolution = find_resolution(adata0, n_clusters, louvain_seed) 146 | adata0 = sc.tl.louvain(adata0, resolution = self.resolution, random_state = louvain_seed, copy = True) 147 | 148 | Y_pred_init = adata0.obs['louvain'] 149 | self.init_pred = np.asarray(Y_pred_init, dtype=int) 150 | 151 | features = pd.DataFrame(adata0.X, index = np.arange(0, adata0.shape[0])) 152 | Group = pd.Series(self.init_pred, index = np.arange(0, adata0.shape[0]), name="Group") 153 | Mergefeature = pd.concat([features, Group],axis=1) 154 | 155 | self.init_centroid = np.asarray(Mergefeature.groupby("Group").mean()) 156 | self.n_clusters = self.init_centroid.shape[0] 157 | 158 | print("\n " + str(self.n_clusters) + " clusters detected. \n") 159 | 160 | self.encoder = self.sae.encoder 161 | self.decoder = self.sae.decoder 162 | 163 | if LVG_dims is not None: 164 | n_stacks = len(dims) - 1 165 | 166 | LVG_encoder_layers = [] 167 | 168 | for i in range(n_stacks-1): 169 | LVG_encoder_layers.append(Dense(LVG_dims[i + 1], kernel_initializer = 'glorot_uniform', activation = self.activation, name='encoder%d' % i)) 170 | 171 | LVG_encoder_layers.append(Dense(LVG_dims[-1], kernel_initializer = 'glorot_uniform', activation = self.actincenter, name='embedding')) 172 | self.encoderLVG = Sequential(LVG_encoder_layers, name = 'encoderLVG') 173 | 174 | if LVG_dims is not None: 175 | decoder_layers = [] 176 | for i in range(self.n_stacks - 1, 0, -1): 177 | decoder_layers.append(Dense(self.LVG_dims[i], kernel_initializer = 'glorot_uniform', 178 | activation = self.activation, name='decoderLVG%d' % (i-1))) 179 | 180 | decoder_layers.append(Dense(self.LVG_dims[0], activation = 'linear', name='outputLVG')) 181 | self.decoderLVG = Sequential(decoder_layers, name = 'decoderLVG') 182 | 183 | self.clustering_layer = ClusteringLayer(centroids = self.init_centroid, name = 'clustering') 184 | 185 | del self.sae 186 | 187 | self.construct() 188 | 189 | def construct(self, summarize = True): 190 | """ This class method fully initalizes the TensorFlow model. 191 | 192 | 193 | Arguments: 194 | ------------------------------------------------------------------ 195 | - summarize: `bool`, If True, then print a summary of the model architecture. 196 | """ 197 | 198 | x = [tf.zeros(shape = (1, self.dims[0]), dtype=float), None] 199 | if self.LVG_dims is not None: 200 | x[1] = tf.zeros(shape = (1, self.LVG_dims[0]), dtype=float) 201 | 202 | out = self(*x) 203 | 204 | if summarize: 205 | print("\n-----------------------CarDEC Architecture-----------------------\n") 206 | self.summary() 207 | 208 | print("\n--------------------Encoder Sub-Architecture--------------------\n") 209 | self.encoder.summary() 210 | 211 | print("\n------------------Base Decoder Sub-Architecture------------------\n") 212 | self.decoder.summary() 213 | 214 | if self.LVG_dims is not None: 215 | print("\n------------------LVG Encoder Sub-Architecture------------------\n") 216 | self.encoderLVG.summary() 217 | 218 | print("\n----------------LVG Base Decoder Sub-Architecture----------------\n") 219 | self.decoderLVG.summary() 220 | 221 | def call(self, hvg, lvg, denoise = True): 222 | """ This is the forward pass of the model. 223 | 224 | 225 | ***Inputs*** 226 | - hvg: `tf.Tensor`, an input tensor of shape (n_obs, n_HVG). 227 | - lvg: `tf.Tensor`, (Optional) an input tensor of shape (n_obs, n_LVG). 228 | - denoise: `bool`, (Optional) If True, return denoised expression values for each cell. 229 | 230 | ***Outputs*** 231 | - denoised_output: `dict`, (Optional) Dictionary containing denoised tensors. 232 | - cluster_output: `tf.Tensor`, a tensor of cell cluster membership probabilities of shape (n_obs, m). 233 | """ 234 | 235 | hvg = self.encoder(hvg) 236 | 237 | cluster_output = self.clustering_layer(hvg) 238 | 239 | if not denoise: 240 | return cluster_output 241 | 242 | HVG_denoised_output = self.decoder(hvg) 243 | denoised_output = {'HVG_denoised': HVG_denoised_output} 244 | 245 | if self.LVG_dims is not None: 246 | lvg = self.encoderLVG(lvg) 247 | z = concatenate([hvg, lvg], axis=1) 248 | 249 | LVG_denoised_output = self.decoderLVG(z) 250 | 251 | denoised_output['LVG_denoised'] = LVG_denoised_output 252 | 253 | return denoised_output, cluster_output 254 | 255 | @staticmethod 256 | def target_distribution(q): 257 | """ Updates target distribution cluster assignment probabilities given CarDEC output. 258 | 259 | 260 | Arguments: 261 | ------------------------------------------------------------------ 262 | - q: `tf.Tensor`, a tensor of shape (b, m) identifying the probability that each of b cells is in each of the m clusters. Obtained as output from CarDEC. 263 | 264 | Returns: 265 | ------------------------------------------------------------------ 266 | - p: `tf.Tensor`, a tensor of shape (b, m) identifying the pseudo-label probability that each of b cells is in each of the m clusters. 267 | """ 268 | 269 | weight = q ** 2 / np.sum(q, axis = 0) 270 | p = weight.T / np.sum(weight, axis = 1) 271 | return p.T 272 | 273 | def make_generators(self, adata, val_split, batch_size): 274 | """ This class method creates training and validation data generators for the current input data and pseudo labels. 275 | 276 | 277 | Arguments: 278 | ------------------------------------------------------------------ 279 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 280 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 281 | - batch_size: `int`, The batch size used for training the full model. 282 | - p: `tf.Tensor`, a tensor of shape (b, m) identifying the pseudo-label probability that each of b cells is in each of the m clusters. 283 | - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between iterations to ensure the same cells are always used for validation. 284 | - newseed: `int`, The seed that is set after splitting cells between training and validation. Should be different every iteration so that stochastic operations other than splitting cells between training and validation vary between epochs. 285 | 286 | Returns: 287 | ------------------------------------------------------------------ 288 | - train_dataset: `tf.data.Dataset`, Dataset that returns training examples. 289 | - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples. 290 | """ 291 | 292 | if self.LVG_dims is None: 293 | hvg_input = adata.layers["normalized input"] 294 | hvg_target = adata.layers["normalized input"] 295 | lvg_input = None 296 | lvg_target = None 297 | else: 298 | hvg_input = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'] 299 | hvg_target = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'] 300 | lvg_input = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'LVG'] 301 | lvg_target = adata.layers["normalized input"][:, adata.var['Variance Type'] == 'LVG'] 302 | 303 | return dataloader(hvg_input, hvg_target, lvg_input, lvg_target, val_split, batch_size, self.splitseed) 304 | 305 | def train_loop(self, train_dataset): 306 | """ This class method runs the training loop. 307 | 308 | 309 | Arguments: 310 | ------------------------------------------------------------------ 311 | - train_dataset: `tf.data.Dataset`, Dataset that returns training examples. 312 | 313 | Returns: 314 | ------------------------------------------------------------------ 315 | - epoch_loss_avg: `float`, The mean training loss for the iteration. 316 | """ 317 | 318 | epoch_loss_avg = tf.keras.metrics.Mean() 319 | 320 | for inputs, target, LVG_target, batch_p in train_dataset(val = False): 321 | loss_value, grads = grad(self, inputs, target, batch_p, total_loss = total_loss, 322 | LVG_target = LVG_target, aeloss_fun = MSEloss, 323 | clust_weight = self.clust_weight) 324 | self.optimizer.apply_gradients(zip(grads, self.trainable_variables)) 325 | epoch_loss_avg(loss_value) 326 | 327 | return epoch_loss_avg.result() 328 | 329 | def validation_loop(self, val_dataset): 330 | """ This class method runs the validation loop. 331 | 332 | 333 | Arguments: 334 | ------------------------------------------------------------------ 335 | - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples. 336 | 337 | Returns: 338 | ------------------------------------------------------------------ 339 | - epoch_loss_avg: `float`, The mean validation loss for the iteration (reconstruction + clustering loss) 340 | - epoch_aeloss_avg_val: `float`, The mean validation reconstruction loss for the iteration 341 | """ 342 | 343 | epoch_loss_avg_val = tf.keras.metrics.Mean() 344 | epoch_aeloss_avg_val = tf.keras.metrics.Mean() 345 | 346 | for inputs, target, LVG_target, batch_p in val_dataset(val = True): 347 | denoised_output, cluster_output = self(*inputs) 348 | loss_value, aeloss = total_loss(target, denoised_output, batch_p, cluster_output, 349 | LVG_target = LVG_target, aeloss_fun = MSEloss, clust_weight = self.clust_weight) 350 | epoch_loss_avg_val(loss_value) 351 | epoch_aeloss_avg_val(aeloss) 352 | 353 | return epoch_loss_avg_val.result(), epoch_aeloss_avg_val.result() 354 | 355 | def package_output(self, adata, init_pred, preclust_denoised, preclust_emb): 356 | """ This class adds some quantities to the adata object. 357 | 358 | 359 | Arguments: 360 | ------------------------------------------------------------------ 361 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 362 | - init_pred: `np.ndarray`, the array of initial cluster assignments for each cells, of shape (n_obs,). 363 | - preclust_denoised: `np.ndarray`, This is the array of feature zscores denoised with the pretrained autoencoder of shape (n_obs, n_vars). 364 | - preclust_emb: `np.ndarray`, This is the latent embedding from the pretrained autoencoder of shape (n_obs, n_embedding). 365 | """ 366 | 367 | adata.obsm['precluster denoised'] = preclust_denoised 368 | adata.obsm['precluster embedding'] = preclust_emb 369 | if adata.shape[0] == init_pred.shape[0]: 370 | adata.obsm['initial assignments'] = init_pred 371 | 372 | def embed(self, adata, batch_size): 373 | """ This class method can be used to compute the low-dimension embedding for HVG features. 374 | 375 | 376 | Arguments: 377 | ------------------------------------------------------------------ 378 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 379 | - batch_size: `int`, The batch size for filling the array of low dimension embeddings. 380 | 381 | Returns: 382 | ------------------------------------------------------------------ 383 | - embedding: `np.ndarray`, Array of shape (n_obs, p_embedding) containing the HVG embedding for every cell in the dataset. 384 | """ 385 | 386 | input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size) 387 | 388 | embedding = np.zeros((adata.shape[0], self.dims[-1]), dtype = 'float32') 389 | start = 0 390 | 391 | for x in input_ds: 392 | end = start + x.shape[0] 393 | embedding[start:end] = self.encoder(x).numpy() 394 | start = end 395 | 396 | return embedding 397 | 398 | def embed_LVG(self, adata, batch_size): 399 | """ This class method can be used to compute the low-dimension embedding for LVG features. 400 | 401 | 402 | Arguments: 403 | ------------------------------------------------------------------ 404 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 405 | - batch_size: `int`, The batch size for filling the array of low dimension embeddings. 406 | 407 | Returns: 408 | ------------------------------------------------------------------ 409 | - embedding: `np.ndarray`, Array of shape (n_obs, n_embedding) containing the LVG embedding for every cell in the dataset. 410 | """ 411 | 412 | input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'LVG'], batch_size) 413 | 414 | LVG_embedded = np.zeros((adata.shape[0], self.LVG_dims[-1]), dtype = 'float32') 415 | start = 0 416 | 417 | for x in input_ds: 418 | end = start + x.shape[0] 419 | LVG_embedded[start:end] = self.encoderLVG(x).numpy() 420 | start = end 421 | 422 | return np.concatenate((adata.obsm['embedding'], LVG_embedded), axis = 1) 423 | 424 | def make_outputs(self, adata, batch_size, denoise = True): 425 | """ This class method can be used to pack all relvant outputs into the adata object after training. 426 | 427 | 428 | Arguments: 429 | ------------------------------------------------------------------ 430 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 431 | - batch_size: `int`, The batch size for filling the array of low dimension embeddings. 432 | - denoise: `bool`, Whether to provide denoised expression values for all cells. 433 | """ 434 | 435 | if not denoise: 436 | input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size) 437 | adata.obsm["cluster memberships"] = np.zeros((adata.shape[0], self.n_clusters), dtype = 'float32') 438 | 439 | start = 0 440 | for x in input_ds: 441 | q_batch = self(x, None, False) 442 | end = start + q_batch.shape[0] 443 | adata.obsm["cluster memberships"][start:end] = q_batch.numpy() 444 | 445 | start = end 446 | 447 | 448 | elif self.LVG_dims is not None: 449 | if not ('embedding' in list(adata.obsm) and 'LVG embedding' in list(adata.obsm)): 450 | adata.obsm['embedding'] = self.embed(adata, batch_size) 451 | adata.obsm['LVG embedding'] = self.embed_LVG(adata, batch_size) 452 | input_ds = tupleloader(adata.obsm["embedding"], adata.obsm["LVG embedding"], batch_size = batch_size) 453 | 454 | adata.obsm["cluster memberships"] = np.zeros((adata.shape[0], self.n_clusters), dtype = 'float32') 455 | adata.layers["denoised"] = np.zeros(adata.shape, dtype = 'float32') 456 | 457 | start = 0 458 | for input_ in input_ds: 459 | denoised_batch = {'HVG_denoised': self.decoder(input_[0]), 'LVG_denoised': self.decoderLVG(input_[1])} 460 | q_batch = self.clustering_layer(input_[0]) 461 | end = start + q_batch.shape[0] 462 | 463 | adata.obsm["cluster memberships"][start:end] = q_batch.numpy() 464 | adata.layers["denoised"][start:end, adata.var['Variance Type'] == 'HVG'] = denoised_batch['HVG_denoised'].numpy() 465 | adata.layers["denoised"][start:end, adata.var['Variance Type'] == 'LVG'] = denoised_batch['LVG_denoised'].numpy() 466 | 467 | start = end 468 | 469 | else: 470 | if not ('embedding' in list(adata.obsm)): 471 | adata.obsm['embedding'] = self.embed(adata, batch_size) 472 | input_ds = simpleloader(adata.obsm["embedding"], batch_size) 473 | 474 | adata.obsm["cluster memberships"] = np.zeros((adata.shape[0], self.n_clusters), dtype = 'float32') 475 | adata.layers["denoised"] = np.zeros(adata.shape, dtype = 'float32') 476 | 477 | start = 0 478 | 479 | for input_ in input_ds: 480 | denoised_batch = {'HVG_denoised': self.decoder(input_)} 481 | q_batch = self.clustering_layer(input_) 482 | 483 | end = start + q_batch.shape[0] 484 | 485 | adata.obsm["cluster memberships"][start:end] = q_batch.numpy() 486 | adata.layers["denoised"][start:end] = denoised_batch['HVG_denoised'].numpy() 487 | 488 | start = end 489 | 490 | def train(self, adata, batch_size = 64, val_split = 0.1, lr = 1e-04, decay_factor = 1/3, 491 | iteration_patience_LR = 3, iteration_patience_ES = 6, 492 | maxiter = 1e3, epochs_fit = 1, optimizer = Adam(), printperiter = None, denoise = True): 493 | """ This class method can be used to train the main CarDEC model 494 | 495 | 496 | Arguments: 497 | ------------------------------------------------------------------ 498 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 499 | - batch_size: `int`, The batch size used for training the full model. 500 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 501 | - lr: `float`, The learning rate for training the full model. 502 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 503 | - iteration_patience_LR: `int`, The number of iterations tolerated before decaying the learning rate during which the number of cells that change assignment is less than tol. 504 | - iteration_patience_ES: `int`, The number of iterations tolerated before stopping training during which the number of cells that change assignment is less than tol. 505 | - maxiter: `int`, The maximum number of iterations allowed to train the full model. In practice, the model will halt training long before hitting this limit. 506 | - epochs_fit: `int`, The number of epochs during which to fine-tune weights, before updating the target distribution. 507 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 508 | - printperiter: `int`, Optional integer argument. If specified, denoised values will be returned every printperiter epochs, so that the user can evaluate the progress of denoising as training continues. 509 | - denoise: `bool`, If True, then denoised expression values are provided for all cells. 510 | 511 | Returns: 512 | ------------------------------------------------------------------ 513 | - adata: `anndata.AnnData`, The updated annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. Depending on the arguments of the train call, some outputs will be added to adata. 514 | """ 515 | 516 | total_start = time() 517 | seedlist = list(1000*np.random.randn(int(maxiter))) 518 | seedlist = [abs(int(x)) for x in seedlist] 519 | 520 | self.optimizer = optimizer 521 | self.optimizer.lr = lr 522 | 523 | # Begin deep clustering 524 | y_pred_last = np.ones((adata.shape[0],), dtype = int) * -1. 525 | 526 | min_delta = np.inf 527 | current_aeloss_val = np.inf 528 | delta_patience_ES = 0 529 | delta_patience_LR = 0 530 | delta_stop = False 531 | 532 | dataset = self.make_generators(adata, val_split = 0.1, batch_size = batch_size) 533 | 534 | self.make_outputs(adata, batch_size, denoise = printperiter is not None) 535 | 536 | for ite in range(int(maxiter)): 537 | 538 | p = self.target_distribution(adata.obsm['cluster memberships']) 539 | 540 | dataset.update_p(p) 541 | 542 | best_loss = np.inf 543 | iter_start = time() 544 | 545 | for epoch in range(epochs_fit): 546 | current_loss_train = self.train_loop(dataset) 547 | current_loss_val, current_aeloss_val = self.validation_loop(dataset) 548 | 549 | self.make_outputs(adata, batch_size, denoise = printperiter is not None) 550 | 551 | y_pred = np.argmax(adata.obsm['cluster memberships'], axis = 1) 552 | 553 | if printperiter is not None: 554 | if ite % printperiter == 0 and ite > 0: 555 | denoising_filename = os.path.join(CarDEC.weights_dir, '/intermediate_denoising/denoised' + ite) 556 | outfile = open(denoising_filename,'wb') 557 | pickle.dump(adata.layers["denoised"][:, adata.var['Variance Type'] == 'HVG'], outfile) 558 | outfile.close() 559 | 560 | if self.LVG_dims is not None: 561 | denoising_filename = os.path.join(CarDEC.weights_dir, '/intermediate_denoising/denoisedLVG' + ite) 562 | outfile = open(denoising_filename,'wb') 563 | pickle.dump(adata.layers["denoised"][:, adata.var['Variance Type'] == 'LVG'], outfile) 564 | outfile.close() 565 | 566 | # check stop criterion 567 | delta_label = np.sum(y_pred != y_pred_last).astype(np.float32) / y_pred.shape[0] 568 | y_pred_last = deepcopy(y_pred) 569 | 570 | current_aeloss_val = current_aeloss_val.numpy() 571 | current_clustloss_val = (current_loss_val.numpy() - (1 - self.clust_weight) * current_aeloss_val)/self.clust_weight 572 | print("Iter {:03d} Loss: [Training: {:.3f}, Validation Cluster: {:.3f}, Validation AE: {:.3f}], Label Change: {:.3f}, Time: {:.1f} s".format(ite, current_loss_train.numpy(), current_clustloss_val, current_aeloss_val, delta_label, time() - iter_start)) 573 | 574 | if current_aeloss_val + 10**(-3) < min_delta: 575 | min_delta = current_aeloss_val 576 | delta_patience_ES = 0 577 | delta_patience_LR = 0 578 | 579 | if delta_patience_ES >= iteration_patience_ES: 580 | delta_stop = True 581 | 582 | if delta_patience_LR >= iteration_patience_LR: 583 | self.optimizer.lr = self.optimizer.lr * decay_factor 584 | delta_patience_LR = 0 585 | print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy())) 586 | 587 | delta_patience_ES = delta_patience_ES + 1 588 | delta_patience_LR = delta_patience_LR + 1 589 | 590 | if delta_stop and delta_label < self.tol: 591 | print('\nAutoencoder_loss ', current_aeloss_val, 'not improving.') 592 | print('Proportion of Labels Changed: ', delta_label, ' is less than tolerance of ', self.tol) 593 | print('\nReached tolerance threshold. Stop training.') 594 | break 595 | 596 | 597 | y0 = pd.Series(y_pred, dtype='category') 598 | y0.cat.categories = range(0, len(y0.cat.categories)) 599 | print("\nThe final cluster assignments are:") 600 | x = y0.value_counts() 601 | print(x.sort_index(ascending=True)) 602 | 603 | adata.obsm['embedding'] = self.embed(adata, batch_size) 604 | if self.LVG_dims is not None: 605 | adata.obsm['LVG embedding'] = self.embed_LVG(adata, batch_size) 606 | 607 | del adata.layers['normalized input'] 608 | 609 | if denoise: 610 | self.make_outputs(adata, batch_size, denoise = True) 611 | 612 | self.save_weights("./" + self.weights_dir + "/tuned_CarDECweights", save_format='tf') 613 | 614 | print("\nTotal Runtime is " + str(time() - total_start)) 615 | 616 | print("\nThe CarDEC model is now making inference on the data matrix.") 617 | 618 | self.package_output(adata, self.init_pred, self.preclust_denoised, self.preclust_emb) 619 | 620 | print("Inference completed, results added.") 621 | 622 | return adata 623 | 624 | def reload_model(self, adata = None, batch_size = 64, denoise = True): 625 | """ This class method can be used to load the model's saved weights and redo inference. 626 | 627 | 628 | Arguments: 629 | ------------------------------------------------------------------ 630 | - adata: `anndata.AnnData`, (Optional) The annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. If left as None, model weights will be reloaded but inference will not be made. 631 | - batch_size: `int`, The batch size for filling the array of low dimension embeddings. 632 | - denoise: `bool`, Whether to provide denoised expression values for all cells. 633 | 634 | Returns: 635 | ------------------------------------------------------------------ 636 | - adata: `anndata.AnnData`, (Optional) The annotated data matrix of shape (n_obs, n_vars). If an adata object was provided as input, the adata object will be returned with inference outputs added. 637 | """ 638 | 639 | if os.path.isfile("./" + self.weights_dir + "/tuned_CarDECweights.index"): 640 | print("Weight index file detected, loading weights.") 641 | self.load_weights("./" + self.weights_dir + "/tuned_CarDECweights").expect_partial() 642 | print("CarDEC Model weights loaded successfully.") 643 | 644 | if adata is not None: 645 | print("\nThe CarDEC model is now making inference on the data matrix.") 646 | 647 | adata.obsm['embedding'] = self.embed(adata, batch_size) 648 | if self.LVG_dims is not None: 649 | adata.obsm['LVG embedding'] = self.embed_LVG(adata, batch_size) 650 | 651 | del adata.layers['normalized input'] 652 | 653 | if denoise: 654 | self.make_outputs(adata, batch_size, True) 655 | 656 | self.package_output(adata, self.init_pred, self.preclust_denoised, self.preclust_emb) 657 | 658 | print("Inference completed, results returned.") 659 | 660 | return adata 661 | 662 | else: 663 | print("\nWeight index file not detected, please call CarDEC_Model.train to learn the weights\n") 664 | 665 | -------------------------------------------------------------------------------- /CarDEC/CarDEC_SAE.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_optimization import grad_reconstruction as grad, MSEloss 2 | from .CarDEC_dataloaders import simpleloader, aeloader 3 | 4 | import tensorflow as tf 5 | from tensorflow.keras import Model, Sequential 6 | from tensorflow.keras.layers import Dense, concatenate 7 | from tensorflow.keras.optimizers import Adam 8 | from tensorflow.keras.backend import set_floatx 9 | from time import time 10 | 11 | import random 12 | import numpy as np 13 | from scipy.stats import zscore 14 | import os 15 | 16 | 17 | set_floatx('float32') 18 | 19 | 20 | class SAE(Model): 21 | def __init__(self, dims, act = 'relu', actincenter = "tanh", 22 | random_seed = 201809, splitseed = 215, init = "glorot_uniform", optimizer = Adam(), 23 | weights_dir = 'CarDEC Weights'): 24 | """ This class method initializes the SAE model. 25 | 26 | 27 | Arguments: 28 | ------------------------------------------------------------------ 29 | - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers. 30 | - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer. 31 | - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC. 32 | - random_seed: `int`, The seed used for random weight intialization. 33 | - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between iterations to ensure the same cells are always used for validation. 34 | - init: `str`, The weight initialization strategy for the autoencoder. 35 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 36 | - weights_dir: `str`, the path in which to save the weights of the CarDEC model. 37 | """ 38 | 39 | super(SAE, self).__init__() 40 | 41 | tf.keras.backend.clear_session() 42 | 43 | self.weights_dir = weights_dir 44 | 45 | self.dims = dims 46 | self.n_stacks = len(dims) - 1 47 | self.init = init 48 | self.optimizer = optimizer 49 | self.random_seed = random_seed 50 | self.splitseed = splitseed 51 | 52 | self.activation = act 53 | self.actincenter = actincenter #hidden layer activation function 54 | 55 | #set random seed 56 | random.seed(random_seed) 57 | np.random.seed(random_seed) 58 | tf.random.set_seed(random_seed) 59 | 60 | encoder_layers = [] 61 | for i in range(self.n_stacks-1): 62 | encoder_layers.append(Dense(self.dims[i + 1], kernel_initializer = self.init, activation = self.activation, name='encoder_%d' % i)) 63 | 64 | encoder_layers.append(Dense(self.dims[-1], kernel_initializer=self.init, activation=self.actincenter, name='embedding')) 65 | self.encoder = Sequential(encoder_layers, name = 'encoder') 66 | 67 | decoder_layers = [] 68 | for i in range(self.n_stacks - 1, 0, -1): 69 | decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation 70 | , name = 'decoder%d' % (i-1))) 71 | 72 | decoder_layers.append(Dense(self.dims[0], activation = 'linear', name='output')) 73 | 74 | self.decoder = Sequential(decoder_layers, name = 'decoder') 75 | 76 | self.construct() 77 | 78 | def call(self, x): 79 | """ This is the forward pass of the model. 80 | 81 | 82 | ***Inputs*** 83 | - x: `tf.Tensor`, an input tensor of shape (n_obs, p_HVG). 84 | 85 | ***Outputs*** 86 | - output: `tf.Tensor`, A (n_obs, p_HVG) tensor of denoised HVG expression. 87 | """ 88 | 89 | c = self.encoder(x) 90 | 91 | output = self.decoder(c) 92 | 93 | return output 94 | 95 | def load_encoder(self, random_seed = 2312): 96 | """ This class method can be used to load the encoder weights, while randomly reinitializing the decoder weights. 97 | 98 | 99 | Arguments: 100 | ------------------------------------------------------------------ 101 | - random_seed: `int`, Seed for reinitializing the decoder. 102 | """ 103 | 104 | tf.keras.backend.clear_session() 105 | 106 | #set random seed 107 | random.seed(random_seed) 108 | np.random.seed(random_seed) 109 | tf.random.set_seed(random_seed) 110 | 111 | self.encoder.load_weights("./" + self.weights_dir + "/pretrained_encoder_weights").expect_partial() 112 | 113 | decoder_layers = [] 114 | for i in range(self.n_stacks - 1, 0, -1): 115 | decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation 116 | , name='decoder%d' % (i-1))) 117 | self.decoder_base = Sequential(decoder_layers, name = 'decoderbase') 118 | 119 | self.output_layer = Dense(self.dims[0], activation = 'linear', name='output') 120 | 121 | self.construct(summarize = False) 122 | 123 | def load_autoencoder(self, ): 124 | """ This class method can be used to load the full model's weights.""" 125 | 126 | tf.keras.backend.clear_session() 127 | 128 | self.load_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights").expect_partial() 129 | 130 | def construct(self, summarize = False): 131 | """ This class method fully initalizes the TensorFlow model. 132 | 133 | 134 | Arguments: 135 | ------------------------------------------------------------------ 136 | - summarize: `bool`, If True, then print a summary of the model architecture. 137 | """ 138 | 139 | x = tf.zeros(shape = (1, self.dims[0]), dtype=float) 140 | out = self(x) 141 | 142 | if summarize: 143 | print("----------Autoencoder Architecture----------") 144 | self.summary() 145 | 146 | print("\n----------Encoder Sub-Architecture----------") 147 | self.encoder.summary() 148 | 149 | print("\n----------Base Decoder Sub-Architecture----------") 150 | self.decoder.summary() 151 | 152 | def denoise(self, adata, batch_size = 64): 153 | """ This class method can be used to denoise gene expression for each cell. 154 | 155 | 156 | Arguments: 157 | ------------------------------------------------------------------ 158 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 159 | - batch_size: `int`, The batch size used for computing denoised expression. 160 | 161 | Returns: 162 | ------------------------------------------------------------------ 163 | - output: `np.ndarray`, Numpy array of denoised expression of shape (n_obs, n_vars) 164 | """ 165 | 166 | input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size) 167 | 168 | output = np.zeros((adata.shape[0], self.dims[0]), dtype = 'float32') 169 | start = 0 170 | 171 | for x in input_ds: 172 | end = start + x.shape[0] 173 | output[start:end] = self(x).numpy() 174 | start = end 175 | 176 | return output 177 | 178 | def embed(self, adata, batch_size = 64): 179 | """ This class method can be used to compute the low-dimension embedding for HVG features. 180 | 181 | 182 | Arguments: 183 | ------------------------------------------------------------------ 184 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 185 | - batch_size: `int`, The batch size for filling the array of low dimension embeddings. 186 | 187 | Returns: 188 | ------------------------------------------------------------------ 189 | - embedding: `np.ndarray`, Array of shape (n_obs, n_vars) containing the cell HVG embeddings. 190 | """ 191 | 192 | input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size) 193 | 194 | embedding = np.zeros((adata.shape[0], self.dims[-1]), dtype = 'float32') 195 | 196 | start = 0 197 | for x in input_ds: 198 | end = start + x.shape[0] 199 | embedding[start:end] = self.encoder(x).numpy() 200 | start = end 201 | 202 | return embedding 203 | 204 | def makegenerators(self, adata, val_split, batch_size, splitseed): 205 | """ This class method creates training and validation data generators for the current input data. 206 | 207 | 208 | Arguments: 209 | ------------------------------------------------------------------ 210 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). 211 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 212 | - batch_size: `int`, The batch size used for training the model. 213 | - splitseed: `int`, The seed used to split cells between training and validation. 214 | 215 | Returns: 216 | ------------------------------------------------------------------ 217 | - train_dataset: `tf.data.Dataset`, Dataset that returns training examples. 218 | - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples. 219 | """ 220 | 221 | return aeloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], val_frac = val_split, batch_size = batch_size, splitseed = splitseed) 222 | 223 | def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3, 224 | patience_LR = 3, patience_ES = 9, save_fullmodel = True): 225 | """ This class method can be used to train the SAE. 226 | 227 | 228 | Arguments: 229 | ------------------------------------------------------------------ 230 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 231 | - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt training long before hitting this limit. 232 | - batch_size: `int`, The batch size used for training the full model. 233 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 234 | - lr: `float`, The learning rate for training the full model. 235 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 236 | - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss fails to decrease. 237 | - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to decrease. 238 | - save_fullmodel: `bool`, If True, save the full model's weights, not just the encoder. 239 | """ 240 | 241 | tf.keras.backend.clear_session() 242 | 243 | dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed) 244 | 245 | counter_LR = 0 246 | counter_ES = 0 247 | best_loss = np.inf 248 | 249 | self.optimizer.lr = lr 250 | 251 | total_start = time() 252 | for epoch in range(num_epochs): 253 | epoch_start = time() 254 | 255 | epoch_loss_avg = tf.keras.metrics.Mean() 256 | epoch_loss_avg_val = tf.keras.metrics.Mean() 257 | 258 | # Training loop - using batches of batch_size 259 | for x, target in dataset(val = False): 260 | loss_value, grads = grad(self, x, target, MSEloss) 261 | self.optimizer.apply_gradients(zip(grads, self.trainable_variables)) 262 | epoch_loss_avg(loss_value) # Add current batch loss 263 | 264 | # Validation Loop 265 | for x, target in dataset(val = True): 266 | output = self(x) 267 | loss_value = MSEloss(target, output) 268 | epoch_loss_avg_val(loss_value) 269 | 270 | current_loss_val = epoch_loss_avg_val.result() 271 | 272 | epoch_time = round(time() - epoch_start, 1) 273 | 274 | print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time)) 275 | 276 | if(current_loss_val + 10**(-3) < best_loss): 277 | counter_LR = 0 278 | counter_ES = 0 279 | best_loss = current_loss_val 280 | else: 281 | counter_LR = counter_LR + 1 282 | counter_ES = counter_ES + 1 283 | 284 | if patience_ES <= counter_ES: 285 | break 286 | 287 | if patience_LR <= counter_LR: 288 | self.optimizer.lr = self.optimizer.lr * decay_factor 289 | counter_LR = 0 290 | print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy())) 291 | 292 | # End epoch 293 | 294 | total_time = round(time() - total_start, 2) 295 | 296 | if not os.path.isdir("./" + self.weights_dir): 297 | os.mkdir("./" + self.weights_dir) 298 | 299 | self.save_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights", save_format='tf') 300 | self.encoder.save_weights("./" + self.weights_dir + "/pretrained_encoder_weights", save_format='tf') 301 | 302 | print('\nTraining Completed') 303 | print("Total training time: " + str(total_time) + " seconds") 304 | 305 | -------------------------------------------------------------------------------- /CarDEC/CarDEC_count_decoder.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_optimization import grad_reconstruction as grad, NBloss 2 | from .CarDEC_utils import build_dir 3 | from .CarDEC_dataloaders import countloader, tupleloader 4 | 5 | import tensorflow as tf 6 | from tensorflow.keras import Model, Sequential 7 | from tensorflow.keras.layers import Dense, concatenate, Lambda 8 | from tensorflow.keras.optimizers import Adam 9 | from tensorflow.keras.backend import exp as tf_exp, set_floatx 10 | from time import time 11 | 12 | import random 13 | import numpy as np 14 | from scipy.stats import zscore 15 | import os 16 | 17 | 18 | set_floatx('float32') 19 | 20 | 21 | class count_model(Model): 22 | def __init__(self, dims, act = 'relu', random_seed = 201809, splitseed = 215, optimizer = Adam(), 23 | weights_dir = 'CarDEC Count Weights', n_features = 32, mode = 'HVG'): 24 | """ This class method initializes the count model. 25 | 26 | 27 | Arguments: 28 | ------------------------------------------------------------------ 29 | - dims: `list`, the number of output features for each layer of the model. The length of the list determines the 30 | number of layers. 31 | - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer. 32 | - random_seed: `int`, The seed used for random weight intialization. 33 | - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between 34 | iterations to ensure the same cells are always used for validation. 35 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 36 | - weights_dir: `str`, the path in which to save the weights of the CarDEC model. 37 | - n_features: `int`, the number of input features. 38 | - mode: `str`, String identifying whether HVGs or LVGs are being modeled. 39 | """ 40 | 41 | super(count_model, self).__init__() 42 | 43 | tf.keras.backend.clear_session() 44 | 45 | self.mode = mode 46 | self.name_ = mode + " Count" 47 | 48 | if mode == 'HVG': 49 | self.embed_name = 'embedding' 50 | else: 51 | self.embed_name = 'LVG embedding' 52 | 53 | self.weights_dir = weights_dir 54 | 55 | self.dims = dims 56 | n_stacks = len(dims) - 1 57 | 58 | self.optimizer = optimizer 59 | self.random_seed = random_seed 60 | self.splitseed = splitseed 61 | 62 | random.seed(random_seed) 63 | np.random.seed(random_seed) 64 | tf.random.set_seed(random_seed) 65 | 66 | self.activation = act 67 | self.MeanAct = lambda x: tf.clip_by_value(tf_exp(x), 1e-5, 1e6) 68 | self.DispAct = lambda x: tf.clip_by_value(tf.nn.softplus(x), 1e-4, 1e4) 69 | 70 | model_layers = [] 71 | for i in range(n_stacks - 1, 0, -1): 72 | model_layers.append(Dense(dims[i], kernel_initializer = "glorot_uniform", activation = self.activation 73 | , name='base%d' % (i-1))) 74 | self.base = Sequential(model_layers, name = 'base') 75 | 76 | self.mean_layer = Dense(dims[0], activation = self.MeanAct, name='mean') 77 | self.disp_layer = Dense(dims[0], activation = self.DispAct, name='dispersion') 78 | 79 | self.rescale = Lambda(lambda l: tf.matmul(tf.linalg.diag(l[0]), l[1]), name = 'sf scaling') 80 | 81 | build_dir(self.weights_dir) 82 | 83 | self.construct(n_features, self.name_) 84 | 85 | def call(self, x, s): 86 | """ This is the forward pass of the model. 87 | 88 | 89 | ***Inputs*** 90 | - x: `tf.Tensor`, an input tensor of shape (b, p) 91 | - s: `tf.Tensor`, and input tensor of shape (b, ) containing the size factor for each cell 92 | 93 | ***Outputs*** 94 | - mean: `tf.Tensor`, A (b, p_gene) tensor of negative binomial means for each cell, gene. 95 | - disp: `tf.Tensor`, A (b, p_gene) tensor of negative binomial dispersions for each cell, gene. 96 | """ 97 | 98 | x = self.base(x) 99 | 100 | disp = self.disp_layer(x) 101 | mean = self.mean_layer(x) 102 | mean = self.rescale([s, mean]) 103 | 104 | return mean, disp 105 | 106 | def load_model(self, ): 107 | """ This class method can be used to load the model's weights.""" 108 | 109 | tf.keras.backend.clear_session() 110 | 111 | self.load_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_)).expect_partial() 112 | 113 | def construct(self, n_features, name, summarize = False): 114 | """ This class method fully initalizes the TensorFlow model. 115 | 116 | 117 | Arguments: 118 | ------------------------------------------------------------------ 119 | - n_features: `int`, the number of input features. 120 | - name: `str`, Model name (to distinguish HVG and LVG models). 121 | - summarize: `bool`, If True, then print a summary of the model architecture. 122 | """ 123 | 124 | x = [tf.zeros(shape = (1, n_features), dtype='float32'), tf.ones(shape = (1,), dtype='float32')] 125 | out = self(*x) 126 | 127 | if summarize: 128 | print("----------Count Model " + name + " Architecture----------") 129 | self.summary() 130 | 131 | print("\n----------Base Sub-Architecture----------") 132 | self.base.summary() 133 | 134 | def denoise(self, adata, keep_dispersion = False, batch_size = 64): 135 | """ This class method can be used to denoise gene expression for each cell on the count scale. 136 | 137 | 138 | Arguments: 139 | ------------------------------------------------------------------ 140 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond 141 | to cells and columns to genes. 142 | - keep_dispersion: `bool`, If True, also return the dispersion for each gene, cell (added as a layer to adata)/ 143 | - batch_size: `int`, The batch size used for computing denoised expression. 144 | 145 | Returns: 146 | ------------------------------------------------------------------ 147 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Negative binomial means (and optionally 148 | dispersions) added as layers. 149 | """ 150 | 151 | input_ds = tupleloader(adata.obsm[self.embed_name], adata.obs['size factors'], batch_size = batch_size) 152 | 153 | if "denoised counts" not in list(adata.layers): 154 | adata.layers["denoised counts"] = np.zeros(adata.shape, dtype = 'float32') 155 | 156 | type_indices = adata.var['Variance Type'] == self.mode 157 | 158 | if not keep_dispersion: 159 | start = 0 160 | for x in input_ds: 161 | end = start + x[0].shape[0] 162 | adata.layers["denoised counts"][start:end, type_indices] = self(*x)[0].numpy() 163 | start = end 164 | 165 | else: 166 | if "dispersion" not in list(adata.layers): 167 | adata.layers["dispersion"] = np.zeros(adata.shape, dtype = 'float32') 168 | 169 | start = 0 170 | for x in input_ds: 171 | end = start + x[0].shape[0] 172 | batch_output = self(*x) 173 | adata.layers["denoised counts"][start:end, type_indices] = batch_output[0].numpy() 174 | adata.layers["dispersion"][start:end, type_indices] = batch_output[1].numpy() 175 | start = end 176 | 177 | def makegenerators(self, adata, val_split, batch_size, splitseed): 178 | """ This class method creates training and validation data generators for the current input data. 179 | 180 | 181 | Arguments: 182 | ------------------------------------------------------------------ 183 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond 184 | to cells and columns to genes. 185 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 186 | - batch_size: `int`, The batch size used for training the model. 187 | - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between 188 | iterations to ensure the same cells are always used for validation. 189 | 190 | Returns: 191 | ------------------------------------------------------------------ 192 | - train_dataset: `tf.data.Dataset`, Dataset that returns training examples. 193 | - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples. 194 | """ 195 | 196 | return countloader(adata.obsm[self.embed_name], adata.X[:, adata.var['Variance Type'] == self.mode], adata.obs['size factors'], 197 | val_split, batch_size, splitseed) 198 | 199 | def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3, 200 | patience_LR = 3, patience_ES = 9): 201 | """ This class method can be used to train the SAE. 202 | 203 | 204 | Arguments: 205 | ------------------------------------------------------------------ 206 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond 207 | to cells and columns to genes. 208 | - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt 209 | training long before hitting this limit. 210 | - batch_size: `int`, The batch size used for training the full model. 211 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 212 | - lr: `float`, The learning rate for training the full model. 213 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not 214 | decreasing. 215 | - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the 216 | validation loss fails to decrease. 217 | - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to 218 | decrease. 219 | """ 220 | 221 | tf.keras.backend.clear_session() 222 | 223 | loss = NBloss 224 | 225 | dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed) 226 | 227 | counter_LR = 0 228 | counter_ES = 0 229 | best_loss = np.inf 230 | 231 | self.optimizer.lr = lr 232 | 233 | total_start = time() 234 | 235 | for epoch in range(num_epochs): 236 | epoch_start = time() 237 | 238 | epoch_loss_avg = tf.keras.metrics.Mean() 239 | epoch_loss_avg_val = tf.keras.metrics.Mean() 240 | 241 | # Training loop - using batches of batch_size 242 | for x, target in dataset(val = False): 243 | loss_value, grads = grad(self, x, target, loss) 244 | self.optimizer.apply_gradients(zip(grads, self.trainable_variables)) 245 | epoch_loss_avg(loss_value) # Add current batch loss 246 | 247 | # Validation Loop 248 | for x, target in dataset(val = True): 249 | output = self(*x) 250 | loss_value = loss(target, output) 251 | epoch_loss_avg_val(loss_value) 252 | 253 | current_loss_val = epoch_loss_avg_val.result() 254 | 255 | epoch_time = round(time() - epoch_start, 1) 256 | 257 | print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time)) 258 | 259 | if(current_loss_val + 10**(-3) < best_loss): 260 | counter_LR = 0 261 | counter_ES = 0 262 | best_loss = current_loss_val 263 | else: 264 | counter_LR = counter_LR + 1 265 | counter_ES = counter_ES + 1 266 | 267 | if patience_ES <= counter_ES: 268 | break 269 | 270 | if patience_LR <= counter_LR: 271 | self.optimizer.lr = self.optimizer.lr * decay_factor 272 | counter_LR = 0 273 | print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy())) 274 | 275 | # End epoch 276 | 277 | total_time = round(time() - total_start, 2) 278 | 279 | if not os.path.isdir("./" + self.weights_dir): 280 | os.mkdir("./" + self.weights_dir) 281 | 282 | self.save_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_), save_format='tf') 283 | 284 | print('\nTraining Completed') 285 | print("Total training time: " + str(total_time) + " seconds") 286 | 287 | -------------------------------------------------------------------------------- /CarDEC/CarDEC_dataloaders.py: -------------------------------------------------------------------------------- 1 | from tensorflow import convert_to_tensor as tensor 2 | from numpy import setdiff1d 3 | from numpy.random import choice, seed 4 | 5 | class batch_sampler(object): 6 | def __init__(self, array, val_frac, batch_size, splitseed): 7 | seed(splitseed) 8 | self.val_indices = choice(range(len(array)), round(val_frac * len(array)), False) 9 | self.train_indices = setdiff1d(range(len(array)), self.val_indices) 10 | self.batch_size = batch_size 11 | 12 | def __iter__(self): 13 | batch = [] 14 | 15 | if self.val: 16 | for idx in self.val_indices: 17 | batch.append(idx) 18 | 19 | if len(batch) == self.batch_size: 20 | yield batch 21 | batch = [] 22 | 23 | else: 24 | train_idx = choice(self.train_indices, len(self.train_indices), False) 25 | 26 | for idx in train_idx: 27 | batch.append(idx) 28 | 29 | if len(batch) == self.batch_size: 30 | yield batch 31 | batch = [] 32 | 33 | if batch: 34 | yield batch 35 | 36 | def __call__(self, val): 37 | self.val = val 38 | return self 39 | 40 | class simpleloader(object): 41 | def __init__(self, array, batch_size): 42 | self.array = array 43 | self.batch_size = batch_size 44 | 45 | def __iter__(self): 46 | batch = [] 47 | 48 | for idx in range(len(self.array)): 49 | batch.append(idx) 50 | 51 | if len(batch) == self.batch_size: 52 | yield tensor(self.array[batch].copy()) 53 | batch = [] 54 | 55 | if batch: 56 | yield self.array[batch].copy() 57 | 58 | class tupleloader(object): 59 | def __init__(self, *arrays, batch_size): 60 | self.arrays = arrays 61 | self.batch_size = batch_size 62 | 63 | def __iter__(self): 64 | batch = [] 65 | 66 | for idx in range(len(self.arrays[0])): 67 | batch.append(idx) 68 | 69 | if len(batch) == self.batch_size: 70 | yield [tensor(arr[batch].copy()) for arr in self.arrays] 71 | batch = [] 72 | 73 | if batch: 74 | yield [tensor(arr[batch].copy()) for arr in self.arrays] 75 | 76 | class aeloader(object): 77 | def __init__(self, *arrays, val_frac, batch_size, splitseed): 78 | self.arrays = arrays 79 | self.batch_size = batch_size 80 | self.sampler = batch_sampler(arrays[0], val_frac, batch_size, splitseed) 81 | 82 | def __iter__(self): 83 | for idxs in self.sampler(self.val): 84 | yield [tensor(arr[idxs].copy()) for arr in self.arrays] 85 | 86 | def __call__(self, val): 87 | self.val = val 88 | return self 89 | 90 | class countloader(object): 91 | def __init__(self, embedding, target, sizefactor, val_frac, batch_size, splitseed): 92 | self.sampler = batch_sampler(embedding, val_frac, batch_size, splitseed) 93 | self.embedding = embedding 94 | self.target = target 95 | self.sizefactor = sizefactor 96 | 97 | def __iter__(self): 98 | for idxs in self.sampler(self.val): 99 | yield (tensor(self.embedding[idxs].copy()), tensor(self.sizefactor[idxs].copy())), tensor(self.target[idxs].copy()) 100 | 101 | def __call__(self, val): 102 | self.val = val 103 | return self 104 | 105 | class dataloader(object): 106 | def __init__(self, hvg_input, hvg_target, lvg_input = None, lvg_target = None, val_frac = 0.1, batch_size = 128, splitseed = 0): 107 | self.sampler = batch_sampler(hvg_input, val_frac, batch_size, splitseed) 108 | self.hvg_input = hvg_input 109 | self.hvg_target = hvg_target 110 | self.lvg_input = lvg_input 111 | self.lvg_target = lvg_target 112 | 113 | def __iter__(self): 114 | for idxs in self.sampler(self.val): 115 | hvg_input = tensor(self.hvg_input[idxs].copy()) 116 | hvg_target = tensor(self.hvg_target[idxs].copy()) 117 | p_target = tensor(self.p_target[idxs].copy()) 118 | 119 | if (self.lvg_input is not None) and (self.lvg_target is not None): 120 | lvg_input = tensor(self.lvg_input[idxs].copy()) 121 | lvg_target = tensor(self.lvg_target[idxs].copy()) 122 | else: 123 | lvg_input = None 124 | lvg_target = None 125 | 126 | yield [hvg_input, lvg_input], hvg_target, lvg_target, p_target 127 | 128 | def __call__(self, val): 129 | self.val = val 130 | return self 131 | 132 | def update_p(self, new_p_target): 133 | self.p_target = new_p_target -------------------------------------------------------------------------------- /CarDEC/CarDEC_layers.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow.keras.layers import Layer 3 | 4 | class ClusteringLayer(Layer): 5 | def __init__(self, centroids = None, n_clusters = None, n_features = None, alpha=1.0, **kwargs): 6 | """ The clustering layer predicts the a cell's class membership probability for each cell. 7 | 8 | 9 | Arguments: 10 | ------------------------------------------------------------------ 11 | - centroids: `tf.Tensor`, Initial cluster ceontroids after pretraining the model. 12 | - n_clusters: `int`, Number of clusters. 13 | - n_features: `int`, The number of features of the bottleneck embedding space that the centroids live in. 14 | - alpha: parameter in Student's t-distribution. Default to 1.0. 15 | """ 16 | 17 | super(ClusteringLayer, self).__init__(**kwargs) 18 | self.alpha = alpha 19 | self.initial_centroids = centroids 20 | 21 | if centroids is not None: 22 | n_clusters, n_features = centroids.shape 23 | 24 | self.n_features, self.n_clusters = n_features, n_clusters 25 | 26 | assert self.n_clusters is not None 27 | assert self.n_features is not None 28 | 29 | def build(self, input_shape): 30 | """ This class method builds the layer fully once it receives an input tensor. 31 | 32 | 33 | Arguments: 34 | ------------------------------------------------------------------ 35 | - input_shape: `list`, A list specifying the shape of the input tensor. 36 | """ 37 | 38 | assert len(input_shape) == 2 39 | 40 | self.centroids = self.add_weight(name = 'clusters', shape = (self.n_clusters, self.n_features), initializer = 'glorot_uniform') 41 | if self.initial_centroids is not None: 42 | self.set_weights([self.initial_centroids]) 43 | del self.initial_centroids 44 | 45 | self.built = True 46 | 47 | def call(self, x, **kwargs): 48 | """ Forward pass of the clustering layer, 49 | 50 | 51 | ***Inputs***: 52 | - x: `tf.Tensor`, the embedding tensor of shape = (n_obs, n_var) 53 | 54 | ***Returns***: 55 | - q: `tf.Tensor`, student's t-distribution, or soft labels for each sample of shape = (n_obs, n_clusters) 56 | """ 57 | 58 | q = 1.0 / (1.0 + (tf.reduce_sum(tf.square(tf.expand_dims(x, axis = 1) - self.centroids), axis = 2) / self.alpha)) 59 | q = q**((self.alpha + 1.0) / 2.0) 60 | q = q / tf.reduce_sum(q, axis = 1, keepdims = True) 61 | 62 | return q 63 | 64 | def compute_output_shape(self, input_shape): 65 | """ This method infers the output shape from the input shape. 66 | 67 | 68 | Arguments: 69 | ------------------------------------------------------------------ 70 | - input_shape: `list`, A list specifying the shape of the input tensor. 71 | 72 | Returns: 73 | ------------------------------------------------------------------ 74 | - output_shape: `list`, A tuple specifying the shape of the output for the minibatch (n_obs, n_clusters) 75 | """ 76 | 77 | assert input_shape and len(input_shape) == 2 78 | return input_shape[0], self.n_clusters -------------------------------------------------------------------------------- /CarDEC/CarDEC_optimization.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | import tensorflow as tf 4 | from tensorflow.keras.losses import KLD, MSE 5 | 6 | 7 | def grad_MainModel(model, input_, target, target_p, total_loss, LVG_target = None, aeloss_fun = None, clust_weight = 1.): 8 | """Function to do a backprop update to the main CarDEC model for a minibatch. 9 | 10 | 11 | Arguments: 12 | ------------------------------------------------------------------ 13 | - model: `tensorflow.keras.Model`, The main CarDEC model. 14 | - input_: `list`, A list containing the input HVG and (optionally) LVG expression tensors of the minibatch for the CarDEC model. 15 | - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs. 16 | - target_p: `tf.Tensor`, Tensor containing cluster membership probability targets for the minibatch. 17 | - total_loss: `function`, Function to compute the loss for the main CarDEC model for a minibatch. 18 | - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs. 19 | - aeloss_fun: `function`, Function to compute reconstruction loss. 20 | - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses. 21 | 22 | Returns: 23 | ------------------------------------------------------------------ 24 | - loss_value: `tf.Tensor`: The loss computed for the minibatch. 25 | - gradients: `a list of Tensors`: Gradients to update the model weights. 26 | """ 27 | 28 | with tf.GradientTape() as tape: 29 | denoised_output, cluster_output = model(*input_) 30 | loss_value, aeloss = total_loss(target, denoised_output, target_p, cluster_output, 31 | LVG_target, aeloss_fun, clust_weight) 32 | 33 | return loss_value, tape.gradient(loss_value, model.trainable_variables) 34 | 35 | 36 | def grad_reconstruction(model, input_, target, loss): 37 | """Function to compute gradient update for pretrained autoencoder only. 38 | 39 | 40 | Arguments: 41 | ------------------------------------------------------------------ 42 | - model: `tensorflow.keras.Model`, The main CarDEC model. 43 | - input_: `list`, A list containing the input HVG expression tensor of the minibatch for the CarDEC model. 44 | - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs. 45 | - loss: `function`, Function to compute reconstruction loss. 46 | 47 | Returns: 48 | ------------------------------------------------------------------ 49 | - loss_value: `tf.Tensor`: The loss computed for the minibatch. 50 | - gradients: `a list of Tensors`: Gradients to update the model weights. 51 | """ 52 | 53 | if type(input_) != tuple: 54 | input_ = (input_, ) 55 | 56 | with tf.GradientTape() as tape: 57 | output = model(*input_) 58 | loss_value = loss(target, output) 59 | 60 | return loss_value, tape.gradient(loss_value, model.trainable_variables) 61 | 62 | 63 | def total_loss(target, denoised_output, p, cluster_output_q, LVG_target = None, aeloss_fun = None, clust_weight = 1.): 64 | """Function to compute the loss for the main CarDEC model for a minibatch. 65 | 66 | 67 | Arguments: 68 | ------------------------------------------------------------------ 69 | - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs. 70 | - denoised_output: `dict`, Dictionary containing the output tensors from the CarDEC main model's forward pass. 71 | - p: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing cluster membership probability targets for the minibatch. 72 | - cluster_output_q: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing predicted cluster membership probabilities 73 | for each cell. 74 | - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs. 75 | - aeloss_fun: `function`, Function to compute reconstruction loss. 76 | - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses. 77 | 78 | Returns: 79 | ------------------------------------------------------------------ 80 | - net_loss: `tf.Tensor`, The loss computed for the minibatch. 81 | - aeloss: `tf.Tensor`, The reconstruction loss computed for the minibatch. 82 | """ 83 | 84 | if aeloss_fun is not None: 85 | 86 | aeloss_HVG = aeloss_fun(target, denoised_output['HVG_denoised']) 87 | if LVG_target is not None: 88 | aeloss_LVG = aeloss_fun(LVG_target, denoised_output['LVG_denoised']) 89 | aeloss = 0.5*(aeloss_LVG + aeloss_HVG) 90 | else: 91 | aeloss = 1. * aeloss_HVG 92 | else: 93 | aeloss = 0. 94 | 95 | net_loss = clust_weight * tf.reduce_mean(KLD(p, cluster_output_q)) + (2. - clust_weight) * aeloss 96 | 97 | return net_loss, aeloss 98 | 99 | 100 | def MSEloss(netinput, netoutput): 101 | """Function to compute the MSEloss for the reconstruction loss of a minibatch. 102 | 103 | 104 | Arguments: 105 | ------------------------------------------------------------------ 106 | - netinput: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells. 107 | - netoutput: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells. 108 | 109 | Returns: 110 | ------------------------------------------------------------------ 111 | - mse_loss: `tf.Tensor`, The loss computed for the minibatch, averaged over genes and cells. 112 | """ 113 | 114 | return tf.math.reduce_mean(MSE(netinput, netoutput)) 115 | 116 | 117 | def NBloss(count, output, eps = 1e-10, mean = True): 118 | """Function to compute the negative binomial reconstruction loss of a minibatch. 119 | 120 | 121 | Arguments: 122 | ------------------------------------------------------------------ 123 | - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original 124 | counts). 125 | - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells. 126 | - eps: `float`, A small number introduced for computational stability 127 | - mean: `bool`, If True, average negative binomial loss over genes and cells 128 | 129 | Returns: 130 | ------------------------------------------------------------------ 131 | - nbloss: `tf.Tensor`, The loss computed for the minibatch. If mean was True, it has shape (n_obs, n_var). Otherwise, it has shape (1,). 132 | """ 133 | 134 | count = tf.cast(count, tf.float32) 135 | mu = tf.cast(output[0], tf.float32) 136 | 137 | theta = tf.minimum(output[1], 1e6) 138 | 139 | t1 = tf.math.lgamma(theta + eps) + tf.math.lgamma(count + 1.0) - tf.math.lgamma(count + theta + eps) 140 | t2 = (theta + count) * tf.math.log(1.0 + (mu/(theta+eps))) + (count * (tf.math.log(theta + eps) - tf.math.log(mu + eps))) 141 | 142 | final = _nan2inf(t1 + t2) 143 | 144 | if mean: 145 | final = tf.reduce_sum(final)/final.shape[0]/final.shape[1] 146 | 147 | return final 148 | 149 | 150 | def ZINBloss(count, output, eps = 1e-10): 151 | """Function to compute the negative binomial reconstruction loss of a minibatch. 152 | 153 | 154 | Arguments: 155 | ------------------------------------------------------------------ 156 | - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original counts). 157 | - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells. 158 | - eps: `float`, A small number introduced for computational stability 159 | 160 | Returns: 161 | ------------------------------------------------------------------ 162 | - zinbloss: `tf.Tensor`, The loss computed for the minibatch. Has shape (1,). 163 | """ 164 | 165 | mu = output[0] 166 | theta = output[1] 167 | pi = output[2] 168 | 169 | NB = NBloss(count, output, eps = eps, mean = False) - tf.math.log(1.0 - pi + eps) 170 | 171 | count = tf.cast(count, tf.float32) 172 | mu = tf.cast(mu, tf.float32) 173 | 174 | theta = tf.math.minimum(theta, 1e6) 175 | 176 | zero_nb = tf.math.pow(theta/(theta + mu + eps), theta) 177 | zero_case = -tf.math.log(pi + ((1.0- pi) * zero_nb) + eps) 178 | final = tf.where(tf.less(count, 1e-8), zero_case, NB) 179 | 180 | final = tf.reduce_sum(final)/final.shape[0]/final.shape[1] 181 | 182 | return final 183 | 184 | 185 | def _nan2inf(x): 186 | """Function to replace nan entries in a Tensor with infinities. 187 | 188 | 189 | Arguments: 190 | ------------------------------------------------------------------ 191 | - x: `tf.Tensor`, Tensor of arbitrary shape. 192 | 193 | Returns: 194 | ------------------------------------------------------------------ 195 | - x': `tf.Tensor`, Tensor x with nan entries replaced by infinity. 196 | """ 197 | 198 | return tf.where(tf.math.is_nan(x), tf.zeros_like(x) + np.inf, x) 199 | 200 | -------------------------------------------------------------------------------- /CarDEC/CarDEC_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | from scipy.sparse import issparse 4 | 5 | import scanpy as sc 6 | from anndata import AnnData 7 | 8 | 9 | def normalize_scanpy(adata, batch_key = None, n_high_var = 1000, LVG = True, 10 | normalize_samples = True, log_normalize = True, 11 | normalize_features = True): 12 | """ This function preprocesses the raw count data. 13 | 14 | 15 | Arguments: 16 | ------------------------------------------------------------------ 17 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 18 | - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch. 19 | - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable. 20 | - LVG: `bool`, Whether to retain and preprocess LVGs. 21 | - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell. 22 | - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count. 23 | - normalize_features: `bool`, If True, z-score normalize each gene's expression. 24 | 25 | Returns: 26 | ------------------------------------------------------------------ 27 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Contains preprocessed data. 28 | """ 29 | 30 | n, p = adata.shape 31 | sparsemode = issparse(adata.X) 32 | 33 | if batch_key is not None: 34 | batch = list(adata.obs[batch_key]) 35 | batch = convert_vector_to_encoding(batch) 36 | batch = np.asarray(batch) 37 | batch = batch.astype('float32') 38 | else: 39 | batch = np.ones((n,), dtype = 'float32') 40 | norm_by_batch = False 41 | 42 | sc.pp.filter_genes(adata, min_counts=1) 43 | sc.pp.filter_cells(adata, min_counts=1) 44 | 45 | count = adata.X.copy() 46 | 47 | if normalize_samples: 48 | out = sc.pp.normalize_total(adata, inplace = False) 49 | obs_ = adata.obs 50 | var_ = adata.var 51 | adata = None 52 | adata = AnnData(out['X']) 53 | adata.obs = obs_ 54 | adata.var = var_ 55 | 56 | size_factors = out['norm_factor'] / np.median(out['norm_factor']) 57 | out = None 58 | else: 59 | size_factors = np.ones((adata.shape[0], )) 60 | 61 | if not log_normalize: 62 | adata_ = adata.copy() 63 | 64 | sc.pp.log1p(adata) 65 | 66 | if n_high_var is not None: 67 | sc.pp.highly_variable_genes(adata, inplace = True, min_mean = 0.0125, max_mean = 3, min_disp = 0.5, 68 | n_bins = 20, n_top_genes = n_high_var, batch_key = batch_key) 69 | 70 | hvg = adata.var['highly_variable'].values 71 | 72 | if not log_normalize: 73 | adata = adata_.copy() 74 | 75 | else: 76 | hvg = [True] * adata.shape[1] 77 | 78 | if normalize_features: 79 | batch_list = np.unique(batch) 80 | 81 | if sparsemode: 82 | adata.X = adata.X.toarray() 83 | 84 | for batch_ in batch_list: 85 | indices = [x == batch_ for x in batch] 86 | sub_adata = adata[indices] 87 | 88 | sc.pp.scale(sub_adata) 89 | adata[indices] = sub_adata.X 90 | 91 | adata.layers["normalized input"] = adata.X 92 | adata.X = count 93 | adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg] 94 | 95 | else: 96 | if sparsemode: 97 | adata.layers["normalized input"] = adata.X.toarray() 98 | else: 99 | adata.layers["normalized input"] = adata.X 100 | 101 | adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg] 102 | 103 | if n_high_var is not None: 104 | del_keys = ['dispersions', 'dispersions_norm', 'highly_variable', 'highly_variable_intersection', 'highly_variable_nbatches', 'means'] 105 | del_keys = [x for x in del_keys if x in adata.var.keys()] 106 | adata.var = adata.var.drop(del_keys, axis = 1) 107 | 108 | y = np.unique(batch) 109 | num_batch = len(y) 110 | 111 | adata.obs['size factors'] = size_factors.astype('float32') 112 | adata.obs['batch'] = batch 113 | adata.uns['num_batch'] = num_batch 114 | 115 | if sparsemode: 116 | adata.X = adata.X.toarray() 117 | 118 | if not LVG: 119 | adata = adata[:, adata.var['Variance Type'] == 'HVG'] 120 | 121 | return adata 122 | 123 | 124 | def build_dir(dir_path): 125 | """ This function builds a directory if it does not exist. 126 | 127 | 128 | Arguments: 129 | ------------------------------------------------------------------ 130 | - dir_path: `str`, The directory to build. E.g. if dir_path = 'folder1/folder2/folder3', then this function will creates directory if folder1 if it does not already exist. Then it creates folder1/folder2 if folder2 does not exist in folder1. Then it creates folder1/folder2/folder3 if folder3 does not exist in folder2. 131 | """ 132 | 133 | subdirs = [dir_path] 134 | substring = dir_path 135 | 136 | while substring != '': 137 | splt_dir = os.path.split(substring) 138 | substring = splt_dir[0] 139 | subdirs.append(substring) 140 | 141 | subdirs.pop() 142 | subdirs = [x for x in subdirs if os.path.basename(x) != '..'] 143 | 144 | n = len(subdirs) 145 | subdirs = [subdirs[n - 1 - x] for x in range(n)] 146 | 147 | for dir_ in subdirs: 148 | if not os.path.isdir(dir_): 149 | os.mkdir(dir_) 150 | 151 | 152 | def convert_string_to_encoding(string, vector_key): 153 | """A function to convert a string to a numeric encoding. 154 | 155 | 156 | Arguments: 157 | ------------------------------------------------------------------ 158 | - string: `str`, The specific string to convert to a numeric encoding. 159 | - vector_key: `np.ndarray`, Array of all possible values of string. 160 | 161 | Returns: 162 | ------------------------------------------------------------------ 163 | - encoding: `int`, The integer encoding of string. 164 | """ 165 | 166 | return np.argwhere(vector_key == string)[0][0] 167 | 168 | 169 | def convert_vector_to_encoding(vector): 170 | """A function to convert a vector of strings to a dense numeric encoding. 171 | 172 | 173 | Arguments: 174 | ------------------------------------------------------------------ 175 | - vector: `array_like`, The vector of strings to encode. 176 | 177 | Returns: 178 | ------------------------------------------------------------------ 179 | - vector_num: `list`, A list containing the dense numeric encoding. 180 | """ 181 | 182 | vector_key = np.unique(vector) 183 | vector_strings = list(vector) 184 | vector_num = [convert_string_to_encoding(string, vector_key) for string in vector_strings] 185 | 186 | return vector_num 187 | 188 | 189 | def find_resolution(adata_, n_clusters, random): 190 | """A function to find the louvain resolution tjat corresponds to a prespecified number of clusters, if it exists. 191 | 192 | 193 | Arguments: 194 | ------------------------------------------------------------------ 195 | - adata_: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to low dimension features. 196 | - n_clusters: `int`, Number of clusters. 197 | - random: `int`, The random seed. 198 | 199 | Returns: 200 | ------------------------------------------------------------------ 201 | - resolution: `float`, The resolution that gives n_clusters after running louvain's clustering algorithm. 202 | """ 203 | 204 | obtained_clusters = -1 205 | iteration = 0 206 | resolutions = [0., 1000.] 207 | 208 | while obtained_clusters != n_clusters and iteration < 50: 209 | current_res = sum(resolutions)/2 210 | adata = sc.tl.louvain(adata_, resolution = current_res, random_state = random, copy = True) 211 | labels = adata.obs['louvain'] 212 | obtained_clusters = len(np.unique(labels)) 213 | 214 | if obtained_clusters < n_clusters: 215 | resolutions[0] = current_res 216 | else: 217 | resolutions[1] = current_res 218 | 219 | iteration = iteration + 1 220 | 221 | return current_res 222 | 223 | -------------------------------------------------------------------------------- /CarDEC/__init__.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_API import CarDEC_API -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_API.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_API.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_MainModel.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_MainModel.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_SAE.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_SAE.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_count_decoder.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_count_decoder.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_dataloaders.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_dataloaders.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_layers.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_layers.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_optimization.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_optimization.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/CarDEC_utils.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/CarDEC_utils.cpython-37.pyc -------------------------------------------------------------------------------- /CarDEC/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/CarDEC/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /LICENSE.rtf: -------------------------------------------------------------------------------- 1 | {\rtf1\ansi\ansicpg1252\cocoartf2511 2 | \cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fnil\fcharset0 Monaco;} 3 | {\colortbl;\red255\green255\blue255;\red74\green70\blue67;\red255\green255\blue255;} 4 | {\*\expandedcolortbl;;\cssrgb\c36078\c34510\c33333;\cssrgb\c100000\c100000\c100000;} 5 | \margl1440\margr1440\vieww10800\viewh8400\viewkind0 6 | \deftab720 7 | \pard\pardeftab720\sl380\partightenfactor0 8 | 9 | \f0\fs28 \cf2 \cb3 \expnd0\expndtw0\kerning0 10 | \outl0\strokewidth0 \strokec2 MIT License\ 11 | \ 12 | Copyright (c) 2020 Justin Lakkis\ 13 | \ 14 | Permission is hereby granted, free of charge, to any person obtaining a copy\ 15 | of this software and associated documentation files (the "Software"), to deal\ 16 | in the Software without restriction, including without limitation the rights\ 17 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ 18 | copies of the Software, and to permit persons to whom the Software is\ 19 | furnished to do so, subject to the following conditions:\ 20 | \ 21 | The above copyright notice and this permission notice shall be included in all\ 22 | copies or substantial portions of the Software.\ 23 | \ 24 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\ 25 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\ 26 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\ 27 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\ 28 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\ 29 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\ 30 | SOFTWARE.} -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CarDEC 2 | 3 | CarDEC (**C**ount **a**dapted **r**egularized **D**eep **E**mbedded **C**lustering) is a joint deep learning computational tool that is useful for analyses of single-cell RNA-seq data. CarDEC can be used to: 4 | 5 | 1. Correct for batch effect in the full gene expression space, allowing the investigator to remove batch effect from downstream analyses like psuedotime analysis and coexpression analysis. Batch correction is also possible in a low-dimensional embedding space. 6 | 2. Denoise gene expression. 7 | 3. Cluster cells. 8 | 9 | ## Reproducibility 10 | 11 | We described and introduced CarDEC in our [methodological paper](https://www.biorxiv.org/content/10.1101/2020.09.23.310003v1). To find code to reproduce the results we generated in that paper, please visit this separate [github repository](https://github.com/jlakkis/CarDEC_Codes), which provides all code (including that for other methods) necessary to reproduce our results. 12 | 13 | ## Installation 14 | 15 | Recomended installation procedure is as follows. 16 | 17 | 1. Install [Anaconda](https://www.anaconda.com/products/individual) if you do not already have it. 18 | 2. Create a conda environment, and then activate it as follows in terminal. 19 | 20 | ``` 21 | $ conda create -n cardecenv 22 | $ conda activate cardecenv 23 | ``` 24 | 25 | 3. Install an appropriate version of python. 26 | 27 | ``` 28 | $ conda install python==3.7 29 | ``` 30 | 31 | 4. Install nb_conda_kernels so that you can change python kernels in jupyter notebook. 32 | 33 | ``` 34 | $ conda install nb_conda_kernels 35 | ``` 36 | 37 | 5. Finally, install CarDEC. 38 | 39 | ``` 40 | $ pip install CarDEC 41 | ``` 42 | 43 | Now, to use CarDEC, always make sure you activate the environment in terminal first ("conda activate cardecenv"). And then run jupyter notebook. When you create a notebook to run CarDEC, make sure the active kernel is switched to "cardecenv" 44 | 45 | ## Usage 46 | 47 | A [tutorial jupyter notebook](https://drive.google.com/drive/folders/19VVOoq4XSdDFRZDou-VbTMyV2Na9z53O?usp=sharing), together with a dataset, is publicly downloadable. 48 | 49 | ## Software Requirements 50 | 51 | - Python >= 3.7 52 | - TensorFlow >= 2.0.1, <= 2.3.1 53 | - scikit-learn == 0.22.2.post1 54 | - scanpy == 1.5.1 55 | - louvain == 0.6.1 56 | - pandas == 1.0.1 57 | - scipy == 1.4.1 58 | 59 | ## Trouble shooting 60 | 61 | Installation on MacOS should be smooth. If installing on Windows Subsystem for Linux (WSL), the user must properly configure their g++ compiler to ensure that the louvain package can be built during installation. If the compiler is not properly configured, the user may encounter a following deprecation error similar to the following. 62 | 63 | "DEPRECATION: Could not build wheels for louvain which do not use PEP 517. pip will fall back to legacy 'setup.py install' for these. pip 21.0 will remove support for this functionality. A possible replacement is to fix the wheel build issue reported above." 64 | 65 | To fix this error, try to install the libxml2-dev package. -------------------------------------------------------------------------------- /build/lib/CarDEC/CarDEC_API.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_utils import normalize_scanpy 2 | from .CarDEC_MainModel import CarDEC_Model 3 | from .CarDEC_count_decoder import count_model 4 | 5 | import tensorflow as tf 6 | from tensorflow.keras.optimizers import Adam 7 | import numpy as np 8 | from pandas import DataFrame 9 | 10 | import os 11 | 12 | class CarDEC_API: 13 | def __init__(self, adata, preprocess=True, weights_dir = "CarDEC Weights", batch_key = None, n_high_var = 2000, LVG = True, 14 | normalize_samples = True, log_normalize = True, normalize_features = True): 15 | """ Main CarDEC API the user can use to conduct batch correction and denoising experiments. 16 | 17 | 18 | Arguments: 19 | ------------------------------------------------------------------ 20 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 21 | - preprocess: `bool`, If True, then preprocess the data. 22 | - weights_dir: `str`, the path in which to save the weights of the CarDEC model. 23 | - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch. 24 | - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable. 25 | - LVG: `bool`, If True, also model LVGs. Otherwise, only model HVGs. 26 | - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell. 27 | - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count. 28 | - normalize_features: `bool`, If True, z-score normalize each gene's expression. 29 | """ 30 | 31 | if n_high_var is None: 32 | n_high_var = None 33 | LVG = False 34 | 35 | self.weights_dir = weights_dir 36 | self.LVG = LVG 37 | 38 | self.norm_args = (batch_key, n_high_var, LVG, normalize_samples, log_normalize, normalize_features) 39 | 40 | if preprocess: 41 | self.dataset = normalize_scanpy(adata, *self.norm_args) 42 | else: 43 | assert 'Variance Type' in adata.var.keys() 44 | assert 'normalized input' in adata.layers 45 | self.dataset = adata 46 | 47 | self.loaded = False 48 | self.count_loaded = False 49 | 50 | def build_model(self, load_fullmodel = True, dims = [128, 32], LVG_dims = [128, 32], tol = 0.005, n_clusters = None, 51 | random_seed = 201809, louvain_seed = 0, n_neighbors = 15, pretrain_epochs = 2000, batch_size_pretrain = 64, 52 | act = 'relu', actincenter = "tanh", ae_lr = 1e-04, ae_decay_factor = 1/3, ae_patience_LR = 3, 53 | ae_patience_ES = 9, clust_weight = 1., load_encoder_weights = True): 54 | """ Initializes the main CarDEC model. 55 | 56 | 57 | Arguments: 58 | ------------------------------------------------------------------ 59 | - load_fullmodel: `bool`, If True, the API will try to load the weights for the full model from the weight directory. 60 | - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers. 61 | - LVG_dims: `list`, the number of output features for each layer of the LVG encoder. The length of the list determines the number of layers. 62 | - tol: `float`, stop criterion, clustering procedure will be stopped when the difference ratio between the current iteration and last iteration larger than tol. 63 | - n_clusters: `int`, The number of clusters into which cells will be grouped. 64 | - random_seed: `int`, The seed used for random weight intialization. 65 | - louvain_seed: `int`, The seed used for louvain clustering intialization. 66 | - n_neighbors: `int`, The number of neighbors used for building the graph needed for louvain clustering. 67 | - pretrain_epochs: `int`, The maximum number of epochs for pretraining the HVG autoencoder. In practice, early stopping criteria should stop training much earlier. 68 | - batch_size_pretrain: `int`, The batch size used for pretraining the HVG autoencoder. 69 | - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer. 70 | - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC. 71 | - ae_lr: `float`, The learning rate for pretraining the HVG autoencoder. 72 | - ae_decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 73 | - ae_patience_LR: `int`, the number of epochs which the validation loss is allowed to increase before learning rate is decayed when pretraining the autoencoder. 74 | - ae_patience_ES: `int`, the number of epochs which the validation loss is allowed to increase before training is halted when pretraining the autoencoder. 75 | - clust_weight: `float`, a number between 0 and 2 qhich balances the clustering and reconstruction losses. 76 | - load_encoder_weights: `bool`, If True, the API will try to load the weights for the HVG encoder from the weight directory. 77 | """ 78 | 79 | assert n_clusters is not None 80 | 81 | if 'normalized input' not in list(self.dataset.layers): 82 | self.dataset = normalize_scanpy(self.dataset, *self.norm_args) 83 | 84 | p = sum(self.dataset.var["Variance Type"] == 'HVG') 85 | self.dims = [p] + dims 86 | 87 | if self.LVG: 88 | LVG_p = sum(self.dataset.var["Variance Type"] == 'LVG') 89 | self.LVG_dims = [LVG_p] + LVG_dims 90 | else: 91 | self.LVG_dims = None 92 | 93 | self.load_fullmodel = load_fullmodel 94 | self.weights_exist = os.path.isfile("./" + self.weights_dir + "/tuned_CarDECweights.index") 95 | 96 | set_centroids = not (self.load_fullmodel and self.weights_exist) 97 | 98 | self.model = CarDEC_Model(self.dataset, self.dims, self.LVG_dims, tol, n_clusters, random_seed, louvain_seed, 99 | n_neighbors, pretrain_epochs, batch_size_pretrain, ae_decay_factor, 100 | ae_patience_LR, ae_patience_ES, act, actincenter, ae_lr, 101 | clust_weight, load_encoder_weights, set_centroids, self.weights_dir) 102 | 103 | def make_inference(self, batch_size = 64, val_split = 0.1, lr = 1e-04, decay_factor = 1/3, 104 | iteration_patience_LR = 3, iteration_patience_ES = 6, maxiter = 1e3, epochs_fit = 1, 105 | optimizer = Adam(), printperiter = None, denoise_all = True, denoise_list = None): 106 | """ This class method makes inference on the data (batch correction + denoising) with the main CarDEC model 107 | 108 | 109 | Arguments: 110 | ------------------------------------------------------------------ 111 | - batch_size: `int`, The batch size used for training the full model. 112 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 113 | - lr: `float`, The learning rate for training the full model. 114 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 115 | - iteration_patience_LR: `int`, The number of iterations tolerated before decaying the learning rate during which the number of cells that change assignment is less than tol. 116 | - iteration_patience_ES: `int`, The number of iterations tolerated before stopping training during which the number of cells that change assignment is less than tol. 117 | - maxiter: `int`, The maximum number of iterations allowed to train the full model. In practice, the model will halt training long before hitting this limit. 118 | - epochs_fit: `int`, The number of epochs during which to fine-tune weights, before updating the target distribution. 119 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 120 | - printperiter: `int`, Optional integer argument. If specified, denoised values will be returned every printperiter epochs, so that the user can evaluate the progress of denoising as training continues. 121 | - denoise_all: `bool`, If True, then denoised expression values are provided for all cells. 122 | - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list. 123 | 124 | Returns: 125 | ------------------------------------------------------------------ 126 | - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata. 127 | """ 128 | 129 | if denoise_list is not None: 130 | denoise_all = False 131 | 132 | if not self.loaded: 133 | if self.load_fullmodel and self.weights_exist: 134 | self.dataset = self.model.reload_model(self.dataset, batch_size, denoise_all) 135 | 136 | elif not self.weights_exist: 137 | print("CarDEC Model Weights not detected. Training full model.\n") 138 | self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor, 139 | iteration_patience_LR, iteration_patience_ES, maxiter, 140 | epochs_fit, optimizer, printperiter, denoise_all) 141 | 142 | else: 143 | print("Training full model.\n") 144 | self.dataset = self.model.train(self.dataset, batch_size, val_split, lr, decay_factor, 145 | iteration_patience_LR, iteration_patience_ES, 146 | maxiter, epochs_fit, optimizer, printperiter, denoise_all) 147 | 148 | 149 | self.loaded = True 150 | 151 | elif denoise_all: 152 | self.dataset = self.model.make_outputs(self.dataset, batch_size, True) 153 | 154 | if denoise_list is not None: 155 | denoise_list = list(denoise_list) 156 | indices = [x in denoise_list for x in self.dataset.obs.index] 157 | denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32')) 158 | denoised.index = self.dataset.obs.index[indices] 159 | denoised.columns = self.dataset.var.index 160 | 161 | 162 | if self.LVG: 163 | hvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"][indices]) 164 | lvg_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["LVG embedding"][indices]) 165 | 166 | input_ds = tf.data.Dataset.zip((hvg_ds, lvg_ds)) 167 | input_ds = input_ds.batch(batch_size) 168 | 169 | start = 0 170 | for x in input_ds: 171 | denoised_batch = {'HVG_denoised': self.model.decoder(x[0]), 'LVG_denoised': self.model.decoderLVG(x[1])} 172 | q_batch = self.model.clustering_layer(x[0]) 173 | end = start + q_batch.shape[0] 174 | 175 | denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'HVG')[0]] = denoised_batch['HVG_denoised'].numpy() 176 | denoised.iloc[start:end, np.where(self.dataset.var['Variance Type'] == 'LVG')[0]] = denoised_batch['LVG_denoised'].numpy() 177 | 178 | start = end 179 | 180 | else: 181 | input_ds = tf.data.Dataset.from_tensor_slices(self.dataset.obsm["embedding"]) 182 | 183 | input_ds = input_ds.batch(batch_size) 184 | 185 | start = 0 186 | 187 | for x in input_ds: 188 | denoised_batch = {'HVG_denoised': self.model.decoder(x)} 189 | q_batch = self.model.clustering_layer(x) 190 | end = start + q_batch.shape[0] 191 | 192 | denoised.iloc[start:end] = denoised_batch['HVG_denoised'].numpy() 193 | 194 | start = end 195 | 196 | return denoised 197 | 198 | print(" ") 199 | 200 | def model_counts(self, load_weights = True, act = 'relu', random_seed = 201809, 201 | optimizer = Adam(), keep_dispersion = False, num_epochs = 2000, batch_size_count = 64, 202 | val_split = 0.1, lr = 1e-03, decay_factor = 1/3, patience_LR = 3, patience_ES = 9, 203 | denoise_all = True, denoise_list = None): 204 | """ This class method makes inference on the data on the count scale. 205 | 206 | 207 | Arguments: 208 | ------------------------------------------------------------------ 209 | - load_weights: `bool`, If true, the API will attempt to load the weights for the count model. 210 | - act: `str`, A string specifying the activation function for intermediate layers of the count models. 211 | - random_seed: `int`, A seed used for weight initialization. 212 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 213 | - keep_dispersion: `bool`, If True, the gene, cell dispersions will be returned as well. 214 | - num_epochs: `int`, The maximum number of epochs allowed to train each count model. In practice, the model will halt 215 | training long before hitting this limit. 216 | - batch_size_count: `int`, The batch size used for training the count models. 217 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 218 | - lr: `float`, The learning rate for training the count models. 219 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 220 | - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss does not decrease. 221 | - patience_ES: `int`, The number of iterations tolerated before stopping training during which the validation loss does not decrease. 222 | - denoise_all: `bool`, If True, then denoised expression values are provided for all cells. 223 | - denoise_list: `list`, An optional list of cell names (as strings). If provided, denoised values will be computed only for cells in this list. 224 | 225 | Returns: 226 | ------------------------------------------------------------------ 227 | - denoised: `pd.DataFrame`, (Optional) If denoise_list was specified, then this will be an array of denoised expression on the count scale provided only for listed cells. If denoise_all was instead specified as True, then denoised expression for all cells will be added as a layer to adata. 228 | - denoised_dispersion: `pd.DataFrame`, (Optional) If denoise_list was specified and "keep_dispersion" was set to True, then this will be an array of dispersions from the fitted negative binomial model provided only for listed cells. If denoise_all was instead specified as False, but "keep_dispersion" was still True then dispersions for all cells will be added as a layer to adata. 229 | """ 230 | 231 | if denoise_list is not None: 232 | denoise_all = False 233 | 234 | if not self.count_loaded: 235 | weights_dir = os.path.join(self.weights_dir, 'count weights') 236 | weight_files_exist = os.path.isfile(weights_dir + "/countmodel_weights_HVG Count.index") 237 | if self.LVG: 238 | weight_files_exist = weight_files_exist and os.path.isfile(weights_dir + "/countmodel_weights_LVG Count.index") 239 | 240 | init_args = (act, random_seed, self.model.splitseed, optimizer, weights_dir) 241 | train_args = (num_epochs, batch_size_count, val_split, lr, decay_factor, patience_LR, patience_ES) 242 | 243 | self.nbmodel = count_model(self.dims, *init_args, n_features = self.dims[-1], mode = 'HVG') 244 | 245 | if load_weights and weight_files_exist: 246 | print("Weight files for count models detected, loading weights.") 247 | self.nbmodel.load_model() 248 | 249 | elif load_weights: 250 | print("Weight files for count models not detected. Training HVG count model.\n") 251 | self.nbmodel.train(self.dataset, *train_args) 252 | 253 | else: 254 | print("Training HVG count model.\n") 255 | self.nbmodel.train(self.dataset, *train_args) 256 | 257 | if self.LVG: 258 | self.nbmodel_lvg = count_model(self.LVG_dims, *init_args, 259 | n_features = self.dims[-1] + self.LVG_dims[-1], mode = 'LVG') 260 | 261 | if load_weights and weight_files_exist: 262 | self.nbmodel_lvg.load_model() 263 | print("Count model weights loaded successfully.") 264 | 265 | elif load_weights: 266 | print("\n \n \n") 267 | print("Training LVG count model.\n") 268 | self.nbmodel_lvg.train(self.dataset, *train_args) 269 | 270 | else: 271 | print("\n \n \n") 272 | print("Training LVG count model.\n") 273 | self.nbmodel_lvg.train(self.dataset, *train_args) 274 | 275 | self.count_loaded = True 276 | 277 | if denoise_all: 278 | self.nbmodel.denoise(self.dataset, keep_dispersion, batch_size_count) 279 | if self.LVG: 280 | self.nbmodel_lvg.denoise(self.dataset, keep_dispersion, batch_size_count) 281 | 282 | elif denoise_list is not None: 283 | denoise_list = list(denoise_list) 284 | indices = [x in denoise_list for x in self.dataset.obs.index] 285 | denoised = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32')) 286 | denoised.index = self.dataset.obs.index[indices] 287 | denoised.columns = self.dataset.var.index 288 | if keep_dispersion: 289 | denoised_dispersion = DataFrame(np.zeros((len(denoise_list), self.dataset.shape[1]), dtype = 'float32')) 290 | denoised_dispersion.index = self.dataset.obs.index[indices] 291 | denoised_dispersion.columns = self.dataset.var.index 292 | 293 | input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['embedding'][indices]) 294 | input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices]) 295 | input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf)) 296 | input_ds = input_ds.batch(batch_size_count) 297 | 298 | type_indices = np.where(self.dataset.var['Variance Type'] == 'HVG')[0] 299 | 300 | if not keep_dispersion: 301 | start = 0 302 | for x in input_ds: 303 | end = start + x[0].shape[0] 304 | denoised.iloc[start:end, type_indices] = self.nbmodel(*x)[0].numpy() 305 | start = end 306 | 307 | else: 308 | start = 0 309 | for x in input_ds: 310 | end = start + x[0].shape[0] 311 | batch_output = self.nbmodel(*x) 312 | denoised.iloc[start:end, type_indices] = batch_output[0].numpy() 313 | denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy() 314 | start = end 315 | 316 | if self.LVG: 317 | input_ds_embed = tf.data.Dataset.from_tensor_slices(self.dataset.obsm['LVG embedding'][indices]) 318 | input_ds_sf = tf.data.Dataset.from_tensor_slices(self.dataset.obs['size factors'][indices]) 319 | input_ds = tf.data.Dataset.zip((input_ds_embed, input_ds_sf)) 320 | input_ds = input_ds.batch(batch_size_count) 321 | 322 | type_indices = np.where(self.dataset.var['Variance Type'] == 'LVG')[0] 323 | 324 | if not keep_dispersion: 325 | start = 0 326 | for x in input_ds: 327 | end = start + x[0].shape[0] 328 | denoised.iloc[start:end, type_indices] = self.nbmodel_lvg(*x)[0].numpy() 329 | start = end 330 | 331 | else: 332 | start = 0 333 | for x in input_ds: 334 | end = start + x[0].shape[0] 335 | batch_output = self.nbmodel_lvg(*x) 336 | denoised.iloc[start:end, type_indices] = batch_output[0].numpy() 337 | denoised_dispersion.iloc[start:end, type_indices] = batch_output[1].numpy() 338 | start = end 339 | 340 | if not keep_dispersion: 341 | return denoised 342 | else: 343 | return denoised, denoised_dispersion 344 | 345 | -------------------------------------------------------------------------------- /build/lib/CarDEC/CarDEC_SAE.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_optimization import grad_reconstruction as grad, MSEloss 2 | from .CarDEC_dataloaders import simpleloader, aeloader 3 | 4 | import tensorflow as tf 5 | from tensorflow.keras import Model, Sequential 6 | from tensorflow.keras.layers import Dense, concatenate 7 | from tensorflow.keras.optimizers import Adam 8 | from tensorflow.keras.backend import set_floatx 9 | from time import time 10 | 11 | import random 12 | import numpy as np 13 | from scipy.stats import zscore 14 | import os 15 | 16 | 17 | set_floatx('float32') 18 | 19 | 20 | class SAE(Model): 21 | def __init__(self, dims, act = 'relu', actincenter = "tanh", 22 | random_seed = 201809, splitseed = 215, init = "glorot_uniform", optimizer = Adam(), 23 | weights_dir = 'CarDEC Weights'): 24 | """ This class method initializes the SAE model. 25 | 26 | 27 | Arguments: 28 | ------------------------------------------------------------------ 29 | - dims: `list`, the number of output features for each layer of the HVG encoder. The length of the list determines the number of layers. 30 | - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer. 31 | - actincenter: `str`, The activation function used for the bottleneck layer of CarDEC. 32 | - random_seed: `int`, The seed used for random weight intialization. 33 | - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between iterations to ensure the same cells are always used for validation. 34 | - init: `str`, The weight initialization strategy for the autoencoder. 35 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 36 | - weights_dir: `str`, the path in which to save the weights of the CarDEC model. 37 | """ 38 | 39 | super(SAE, self).__init__() 40 | 41 | tf.keras.backend.clear_session() 42 | 43 | self.weights_dir = weights_dir 44 | 45 | self.dims = dims 46 | self.n_stacks = len(dims) - 1 47 | self.init = init 48 | self.optimizer = optimizer 49 | self.random_seed = random_seed 50 | self.splitseed = splitseed 51 | 52 | self.activation = act 53 | self.actincenter = actincenter #hidden layer activation function 54 | 55 | #set random seed 56 | random.seed(random_seed) 57 | np.random.seed(random_seed) 58 | tf.random.set_seed(random_seed) 59 | 60 | encoder_layers = [] 61 | for i in range(self.n_stacks-1): 62 | encoder_layers.append(Dense(self.dims[i + 1], kernel_initializer = self.init, activation = self.activation, name='encoder_%d' % i)) 63 | 64 | encoder_layers.append(Dense(self.dims[-1], kernel_initializer=self.init, activation=self.actincenter, name='embedding')) 65 | self.encoder = Sequential(encoder_layers, name = 'encoder') 66 | 67 | decoder_layers = [] 68 | for i in range(self.n_stacks - 1, 0, -1): 69 | decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation 70 | , name = 'decoder%d' % (i-1))) 71 | 72 | decoder_layers.append(Dense(self.dims[0], activation = 'linear', name='output')) 73 | 74 | self.decoder = Sequential(decoder_layers, name = 'decoder') 75 | 76 | self.construct() 77 | 78 | def call(self, x): 79 | """ This is the forward pass of the model. 80 | 81 | 82 | ***Inputs*** 83 | - x: `tf.Tensor`, an input tensor of shape (n_obs, p_HVG). 84 | 85 | ***Outputs*** 86 | - output: `tf.Tensor`, A (n_obs, p_HVG) tensor of denoised HVG expression. 87 | """ 88 | 89 | c = self.encoder(x) 90 | 91 | output = self.decoder(c) 92 | 93 | return output 94 | 95 | def load_encoder(self, random_seed = 2312): 96 | """ This class method can be used to load the encoder weights, while randomly reinitializing the decoder weights. 97 | 98 | 99 | Arguments: 100 | ------------------------------------------------------------------ 101 | - random_seed: `int`, Seed for reinitializing the decoder. 102 | """ 103 | 104 | tf.keras.backend.clear_session() 105 | 106 | #set random seed 107 | random.seed(random_seed) 108 | np.random.seed(random_seed) 109 | tf.random.set_seed(random_seed) 110 | 111 | self.encoder.load_weights("./" + self.weights_dir + "/pretrained_encoder_weights").expect_partial() 112 | 113 | decoder_layers = [] 114 | for i in range(self.n_stacks - 1, 0, -1): 115 | decoder_layers.append(Dense(self.dims[i], kernel_initializer = self.init, activation = self.activation 116 | , name='decoder%d' % (i-1))) 117 | self.decoder_base = Sequential(decoder_layers, name = 'decoderbase') 118 | 119 | self.output_layer = Dense(self.dims[0], activation = 'linear', name='output') 120 | 121 | self.construct(summarize = False) 122 | 123 | def load_autoencoder(self, ): 124 | """ This class method can be used to load the full model's weights.""" 125 | 126 | tf.keras.backend.clear_session() 127 | 128 | self.load_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights").expect_partial() 129 | 130 | def construct(self, summarize = False): 131 | """ This class method fully initalizes the TensorFlow model. 132 | 133 | 134 | Arguments: 135 | ------------------------------------------------------------------ 136 | - summarize: `bool`, If True, then print a summary of the model architecture. 137 | """ 138 | 139 | x = tf.zeros(shape = (1, self.dims[0]), dtype=float) 140 | out = self(x) 141 | 142 | if summarize: 143 | print("----------Autoencoder Architecture----------") 144 | self.summary() 145 | 146 | print("\n----------Encoder Sub-Architecture----------") 147 | self.encoder.summary() 148 | 149 | print("\n----------Base Decoder Sub-Architecture----------") 150 | self.decoder.summary() 151 | 152 | def denoise(self, adata, batch_size = 64): 153 | """ This class method can be used to denoise gene expression for each cell. 154 | 155 | 156 | Arguments: 157 | ------------------------------------------------------------------ 158 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 159 | - batch_size: `int`, The batch size used for computing denoised expression. 160 | 161 | Returns: 162 | ------------------------------------------------------------------ 163 | - output: `np.ndarray`, Numpy array of denoised expression of shape (n_obs, n_vars) 164 | """ 165 | 166 | input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size) 167 | 168 | output = np.zeros((adata.shape[0], self.dims[0]), dtype = 'float32') 169 | start = 0 170 | 171 | for x in input_ds: 172 | end = start + x.shape[0] 173 | output[start:end] = self(x).numpy() 174 | start = end 175 | 176 | return output 177 | 178 | def embed(self, adata, batch_size = 64): 179 | """ This class method can be used to compute the low-dimension embedding for HVG features. 180 | 181 | 182 | Arguments: 183 | ------------------------------------------------------------------ 184 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 185 | - batch_size: `int`, The batch size for filling the array of low dimension embeddings. 186 | 187 | Returns: 188 | ------------------------------------------------------------------ 189 | - embedding: `np.ndarray`, Array of shape (n_obs, n_vars) containing the cell HVG embeddings. 190 | """ 191 | 192 | input_ds = simpleloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], batch_size) 193 | 194 | embedding = np.zeros((adata.shape[0], self.dims[-1]), dtype = 'float32') 195 | 196 | start = 0 197 | for x in input_ds: 198 | end = start + x.shape[0] 199 | embedding[start:end] = self.encoder(x).numpy() 200 | start = end 201 | 202 | return embedding 203 | 204 | def makegenerators(self, adata, val_split, batch_size, splitseed): 205 | """ This class method creates training and validation data generators for the current input data. 206 | 207 | 208 | Arguments: 209 | ------------------------------------------------------------------ 210 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). 211 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 212 | - batch_size: `int`, The batch size used for training the model. 213 | - splitseed: `int`, The seed used to split cells between training and validation. 214 | 215 | Returns: 216 | ------------------------------------------------------------------ 217 | - train_dataset: `tf.data.Dataset`, Dataset that returns training examples. 218 | - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples. 219 | """ 220 | 221 | return aeloader(adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], adata.layers["normalized input"][:, adata.var['Variance Type'] == 'HVG'], val_frac = val_split, batch_size = batch_size, splitseed = splitseed) 222 | 223 | def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3, 224 | patience_LR = 3, patience_ES = 9, save_fullmodel = True): 225 | """ This class method can be used to train the SAE. 226 | 227 | 228 | Arguments: 229 | ------------------------------------------------------------------ 230 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). 231 | - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt training long before hitting this limit. 232 | - batch_size: `int`, The batch size used for training the full model. 233 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 234 | - lr: `float`, The learning rate for training the full model. 235 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not decreasing. 236 | - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the validation loss fails to decrease. 237 | - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to decrease. 238 | - save_fullmodel: `bool`, If True, save the full model's weights, not just the encoder. 239 | """ 240 | 241 | tf.keras.backend.clear_session() 242 | 243 | dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed) 244 | 245 | counter_LR = 0 246 | counter_ES = 0 247 | best_loss = np.inf 248 | 249 | self.optimizer.lr = lr 250 | 251 | total_start = time() 252 | for epoch in range(num_epochs): 253 | epoch_start = time() 254 | 255 | epoch_loss_avg = tf.keras.metrics.Mean() 256 | epoch_loss_avg_val = tf.keras.metrics.Mean() 257 | 258 | # Training loop - using batches of batch_size 259 | for x, target in dataset(val = False): 260 | loss_value, grads = grad(self, x, target, MSEloss) 261 | self.optimizer.apply_gradients(zip(grads, self.trainable_variables)) 262 | epoch_loss_avg(loss_value) # Add current batch loss 263 | 264 | # Validation Loop 265 | for x, target in dataset(val = True): 266 | output = self(x) 267 | loss_value = MSEloss(target, output) 268 | epoch_loss_avg_val(loss_value) 269 | 270 | current_loss_val = epoch_loss_avg_val.result() 271 | 272 | epoch_time = round(time() - epoch_start, 1) 273 | 274 | print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time)) 275 | 276 | if(current_loss_val + 10**(-3) < best_loss): 277 | counter_LR = 0 278 | counter_ES = 0 279 | best_loss = current_loss_val 280 | else: 281 | counter_LR = counter_LR + 1 282 | counter_ES = counter_ES + 1 283 | 284 | if patience_ES <= counter_ES: 285 | break 286 | 287 | if patience_LR <= counter_LR: 288 | self.optimizer.lr = self.optimizer.lr * decay_factor 289 | counter_LR = 0 290 | print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy())) 291 | 292 | # End epoch 293 | 294 | total_time = round(time() - total_start, 2) 295 | 296 | if not os.path.isdir("./" + self.weights_dir): 297 | os.mkdir("./" + self.weights_dir) 298 | 299 | self.save_weights("./" + self.weights_dir + "/pretrained_autoencoder_weights", save_format='tf') 300 | self.encoder.save_weights("./" + self.weights_dir + "/pretrained_encoder_weights", save_format='tf') 301 | 302 | print('\nTraining Completed') 303 | print("Total training time: " + str(total_time) + " seconds") 304 | 305 | -------------------------------------------------------------------------------- /build/lib/CarDEC/CarDEC_count_decoder.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_optimization import grad_reconstruction as grad, NBloss 2 | from .CarDEC_utils import build_dir 3 | from .CarDEC_dataloaders import countloader, tupleloader 4 | 5 | import tensorflow as tf 6 | from tensorflow.keras import Model, Sequential 7 | from tensorflow.keras.layers import Dense, concatenate, Lambda 8 | from tensorflow.keras.optimizers import Adam 9 | from tensorflow.keras.backend import exp as tf_exp, set_floatx 10 | from time import time 11 | 12 | import random 13 | import numpy as np 14 | from scipy.stats import zscore 15 | import os 16 | 17 | 18 | set_floatx('float32') 19 | 20 | 21 | class count_model(Model): 22 | def __init__(self, dims, act = 'relu', random_seed = 201809, splitseed = 215, optimizer = Adam(), 23 | weights_dir = 'CarDEC Count Weights', n_features = 32, mode = 'HVG'): 24 | """ This class method initializes the count model. 25 | 26 | 27 | Arguments: 28 | ------------------------------------------------------------------ 29 | - dims: `list`, the number of output features for each layer of the model. The length of the list determines the 30 | number of layers. 31 | - act: `str`, The activation function used for the intermediate layers of CarDEC, other than the bottleneck layer. 32 | - random_seed: `int`, The seed used for random weight intialization. 33 | - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between 34 | iterations to ensure the same cells are always used for validation. 35 | - optimizer: `tensorflow.python.keras.optimizer_v2`, An instance of a TensorFlow optimizer. 36 | - weights_dir: `str`, the path in which to save the weights of the CarDEC model. 37 | - n_features: `int`, the number of input features. 38 | - mode: `str`, String identifying whether HVGs or LVGs are being modeled. 39 | """ 40 | 41 | super(count_model, self).__init__() 42 | 43 | tf.keras.backend.clear_session() 44 | 45 | self.mode = mode 46 | self.name_ = mode + " Count" 47 | 48 | if mode == 'HVG': 49 | self.embed_name = 'embedding' 50 | else: 51 | self.embed_name = 'LVG embedding' 52 | 53 | self.weights_dir = weights_dir 54 | 55 | self.dims = dims 56 | n_stacks = len(dims) - 1 57 | 58 | self.optimizer = optimizer 59 | self.random_seed = random_seed 60 | self.splitseed = splitseed 61 | 62 | random.seed(random_seed) 63 | np.random.seed(random_seed) 64 | tf.random.set_seed(random_seed) 65 | 66 | self.activation = act 67 | self.MeanAct = lambda x: tf.clip_by_value(tf_exp(x), 1e-5, 1e6) 68 | self.DispAct = lambda x: tf.clip_by_value(tf.nn.softplus(x), 1e-4, 1e4) 69 | 70 | model_layers = [] 71 | for i in range(n_stacks - 1, 0, -1): 72 | model_layers.append(Dense(dims[i], kernel_initializer = "glorot_uniform", activation = self.activation 73 | , name='base%d' % (i-1))) 74 | self.base = Sequential(model_layers, name = 'base') 75 | 76 | self.mean_layer = Dense(dims[0], activation = self.MeanAct, name='mean') 77 | self.disp_layer = Dense(dims[0], activation = self.DispAct, name='dispersion') 78 | 79 | self.rescale = Lambda(lambda l: tf.matmul(tf.linalg.diag(l[0]), l[1]), name = 'sf scaling') 80 | 81 | build_dir(self.weights_dir) 82 | 83 | self.construct(n_features, self.name_) 84 | 85 | def call(self, x, s): 86 | """ This is the forward pass of the model. 87 | 88 | 89 | ***Inputs*** 90 | - x: `tf.Tensor`, an input tensor of shape (b, p) 91 | - s: `tf.Tensor`, and input tensor of shape (b, ) containing the size factor for each cell 92 | 93 | ***Outputs*** 94 | - mean: `tf.Tensor`, A (b, p_gene) tensor of negative binomial means for each cell, gene. 95 | - disp: `tf.Tensor`, A (b, p_gene) tensor of negative binomial dispersions for each cell, gene. 96 | """ 97 | 98 | x = self.base(x) 99 | 100 | disp = self.disp_layer(x) 101 | mean = self.mean_layer(x) 102 | mean = self.rescale([s, mean]) 103 | 104 | return mean, disp 105 | 106 | def load_model(self, ): 107 | """ This class method can be used to load the model's weights.""" 108 | 109 | tf.keras.backend.clear_session() 110 | 111 | self.load_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_)).expect_partial() 112 | 113 | def construct(self, n_features, name, summarize = False): 114 | """ This class method fully initalizes the TensorFlow model. 115 | 116 | 117 | Arguments: 118 | ------------------------------------------------------------------ 119 | - n_features: `int`, the number of input features. 120 | - name: `str`, Model name (to distinguish HVG and LVG models). 121 | - summarize: `bool`, If True, then print a summary of the model architecture. 122 | """ 123 | 124 | x = [tf.zeros(shape = (1, n_features), dtype='float32'), tf.ones(shape = (1,), dtype='float32')] 125 | out = self(*x) 126 | 127 | if summarize: 128 | print("----------Count Model " + name + " Architecture----------") 129 | self.summary() 130 | 131 | print("\n----------Base Sub-Architecture----------") 132 | self.base.summary() 133 | 134 | def denoise(self, adata, keep_dispersion = False, batch_size = 64): 135 | """ This class method can be used to denoise gene expression for each cell on the count scale. 136 | 137 | 138 | Arguments: 139 | ------------------------------------------------------------------ 140 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond 141 | to cells and columns to genes. 142 | - keep_dispersion: `bool`, If True, also return the dispersion for each gene, cell (added as a layer to adata)/ 143 | - batch_size: `int`, The batch size used for computing denoised expression. 144 | 145 | Returns: 146 | ------------------------------------------------------------------ 147 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Negative binomial means (and optionally 148 | dispersions) added as layers. 149 | """ 150 | 151 | input_ds = tupleloader(adata.obsm[self.embed_name], adata.obs['size factors'], batch_size = batch_size) 152 | 153 | if "denoised counts" not in list(adata.layers): 154 | adata.layers["denoised counts"] = np.zeros(adata.shape, dtype = 'float32') 155 | 156 | type_indices = adata.var['Variance Type'] == self.mode 157 | 158 | if not keep_dispersion: 159 | start = 0 160 | for x in input_ds: 161 | end = start + x[0].shape[0] 162 | adata.layers["denoised counts"][start:end, type_indices] = self(*x)[0].numpy() 163 | start = end 164 | 165 | else: 166 | if "dispersion" not in list(adata.layers): 167 | adata.layers["dispersion"] = np.zeros(adata.shape, dtype = 'float32') 168 | 169 | start = 0 170 | for x in input_ds: 171 | end = start + x[0].shape[0] 172 | batch_output = self(*x) 173 | adata.layers["denoised counts"][start:end, type_indices] = batch_output[0].numpy() 174 | adata.layers["dispersion"][start:end, type_indices] = batch_output[1].numpy() 175 | start = end 176 | 177 | def makegenerators(self, adata, val_split, batch_size, splitseed): 178 | """ This class method creates training and validation data generators for the current input data. 179 | 180 | 181 | Arguments: 182 | ------------------------------------------------------------------ 183 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond 184 | to cells and columns to genes. 185 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 186 | - batch_size: `int`, The batch size used for training the model. 187 | - splitseed: `int`, The seed used to split cells between training and validation. Should be consistent between 188 | iterations to ensure the same cells are always used for validation. 189 | 190 | Returns: 191 | ------------------------------------------------------------------ 192 | - train_dataset: `tf.data.Dataset`, Dataset that returns training examples. 193 | - val_dataset: `tf.data.Dataset`, Dataset that returns validation examples. 194 | """ 195 | 196 | return countloader(adata.obsm[self.embed_name], adata.X[:, adata.var['Variance Type'] == self.mode], adata.obs['size factors'], 197 | val_split, batch_size, splitseed) 198 | 199 | def train(self, adata, num_epochs = 2000, batch_size = 64, val_split = 0.1, lr = 1e-03, decay_factor = 1/3, 200 | patience_LR = 3, patience_ES = 9): 201 | """ This class method can be used to train the SAE. 202 | 203 | 204 | Arguments: 205 | ------------------------------------------------------------------ 206 | - adata: `anndata.AnnData`, The annotated data matrix of shape (n_obs, n_vars). Rows correspond 207 | to cells and columns to genes. 208 | - num_epochs: `int`, The maximum number of epochs allowed to train the full model. In practice, the model will halt 209 | training long before hitting this limit. 210 | - batch_size: `int`, The batch size used for training the full model. 211 | - val_split: `float`, The fraction of cells to be reserved for validation during this step. 212 | - lr: `float`, The learning rate for training the full model. 213 | - decay_factor: `float`, The multiplicative factor by which to decay the learning rate when validation loss is not 214 | decreasing. 215 | - patience_LR: `int`, The number of epochs tolerated before decaying the learning rate during which the 216 | validation loss fails to decrease. 217 | - patience_ES: `int`, The number of epochs tolerated before stopping training during which the validation loss fails to 218 | decrease. 219 | """ 220 | 221 | tf.keras.backend.clear_session() 222 | 223 | loss = NBloss 224 | 225 | dataset = self.makegenerators(adata, val_split = 0.1, batch_size = batch_size, splitseed = self.splitseed) 226 | 227 | counter_LR = 0 228 | counter_ES = 0 229 | best_loss = np.inf 230 | 231 | self.optimizer.lr = lr 232 | 233 | total_start = time() 234 | 235 | for epoch in range(num_epochs): 236 | epoch_start = time() 237 | 238 | epoch_loss_avg = tf.keras.metrics.Mean() 239 | epoch_loss_avg_val = tf.keras.metrics.Mean() 240 | 241 | # Training loop - using batches of batch_size 242 | for x, target in dataset(val = False): 243 | loss_value, grads = grad(self, x, target, loss) 244 | self.optimizer.apply_gradients(zip(grads, self.trainable_variables)) 245 | epoch_loss_avg(loss_value) # Add current batch loss 246 | 247 | # Validation Loop 248 | for x, target in dataset(val = True): 249 | output = self(*x) 250 | loss_value = loss(target, output) 251 | epoch_loss_avg_val(loss_value) 252 | 253 | current_loss_val = epoch_loss_avg_val.result() 254 | 255 | epoch_time = round(time() - epoch_start, 1) 256 | 257 | print("Epoch {:03d}: Training Loss: {:.3f}, Validation Loss: {:.3f}, Time: {:.1f} s".format(epoch, epoch_loss_avg.result().numpy(), epoch_loss_avg_val.result().numpy(), epoch_time)) 258 | 259 | if(current_loss_val + 10**(-3) < best_loss): 260 | counter_LR = 0 261 | counter_ES = 0 262 | best_loss = current_loss_val 263 | else: 264 | counter_LR = counter_LR + 1 265 | counter_ES = counter_ES + 1 266 | 267 | if patience_ES <= counter_ES: 268 | break 269 | 270 | if patience_LR <= counter_LR: 271 | self.optimizer.lr = self.optimizer.lr * decay_factor 272 | counter_LR = 0 273 | print("\nDecaying Learning Rate to: " + str(self.optimizer.lr.numpy())) 274 | 275 | # End epoch 276 | 277 | total_time = round(time() - total_start, 2) 278 | 279 | if not os.path.isdir("./" + self.weights_dir): 280 | os.mkdir("./" + self.weights_dir) 281 | 282 | self.save_weights(os.path.join(self.weights_dir, "countmodel_weights_" + self.name_), save_format='tf') 283 | 284 | print('\nTraining Completed') 285 | print("Total training time: " + str(total_time) + " seconds") 286 | 287 | -------------------------------------------------------------------------------- /build/lib/CarDEC/CarDEC_dataloaders.py: -------------------------------------------------------------------------------- 1 | from tensorflow import convert_to_tensor as tensor 2 | from numpy import setdiff1d 3 | from numpy.random import choice, seed 4 | 5 | class batch_sampler(object): 6 | def __init__(self, array, val_frac, batch_size, splitseed): 7 | seed(splitseed) 8 | self.val_indices = choice(range(len(array)), round(val_frac * len(array)), False) 9 | self.train_indices = setdiff1d(range(len(array)), self.val_indices) 10 | self.batch_size = batch_size 11 | 12 | def __iter__(self): 13 | batch = [] 14 | 15 | if self.val: 16 | for idx in self.val_indices: 17 | batch.append(idx) 18 | 19 | if len(batch) == self.batch_size: 20 | yield batch 21 | batch = [] 22 | 23 | else: 24 | train_idx = choice(self.train_indices, len(self.train_indices), False) 25 | 26 | for idx in train_idx: 27 | batch.append(idx) 28 | 29 | if len(batch) == self.batch_size: 30 | yield batch 31 | batch = [] 32 | 33 | if batch: 34 | yield batch 35 | 36 | def __call__(self, val): 37 | self.val = val 38 | return self 39 | 40 | class simpleloader(object): 41 | def __init__(self, array, batch_size): 42 | self.array = array 43 | self.batch_size = batch_size 44 | 45 | def __iter__(self): 46 | batch = [] 47 | 48 | for idx in range(len(self.array)): 49 | batch.append(idx) 50 | 51 | if len(batch) == self.batch_size: 52 | yield tensor(self.array[batch].copy()) 53 | batch = [] 54 | 55 | if batch: 56 | yield self.array[batch].copy() 57 | 58 | class tupleloader(object): 59 | def __init__(self, *arrays, batch_size): 60 | self.arrays = arrays 61 | self.batch_size = batch_size 62 | 63 | def __iter__(self): 64 | batch = [] 65 | 66 | for idx in range(len(self.arrays[0])): 67 | batch.append(idx) 68 | 69 | if len(batch) == self.batch_size: 70 | yield [tensor(arr[batch].copy()) for arr in self.arrays] 71 | batch = [] 72 | 73 | if batch: 74 | yield [tensor(arr[batch].copy()) for arr in self.arrays] 75 | 76 | class aeloader(object): 77 | def __init__(self, *arrays, val_frac, batch_size, splitseed): 78 | self.arrays = arrays 79 | self.batch_size = batch_size 80 | self.sampler = batch_sampler(arrays[0], val_frac, batch_size, splitseed) 81 | 82 | def __iter__(self): 83 | for idxs in self.sampler(self.val): 84 | yield [tensor(arr[idxs].copy()) for arr in self.arrays] 85 | 86 | def __call__(self, val): 87 | self.val = val 88 | return self 89 | 90 | class countloader(object): 91 | def __init__(self, embedding, target, sizefactor, val_frac, batch_size, splitseed): 92 | self.sampler = batch_sampler(embedding, val_frac, batch_size, splitseed) 93 | self.embedding = embedding 94 | self.target = target 95 | self.sizefactor = sizefactor 96 | 97 | def __iter__(self): 98 | for idxs in self.sampler(self.val): 99 | yield (tensor(self.embedding[idxs].copy()), tensor(self.sizefactor[idxs].copy())), tensor(self.target[idxs].copy()) 100 | 101 | def __call__(self, val): 102 | self.val = val 103 | return self 104 | 105 | class dataloader(object): 106 | def __init__(self, hvg_input, hvg_target, lvg_input = None, lvg_target = None, val_frac = 0.1, batch_size = 128, splitseed = 0): 107 | self.sampler = batch_sampler(hvg_input, val_frac, batch_size, splitseed) 108 | self.hvg_input = hvg_input 109 | self.hvg_target = hvg_target 110 | self.lvg_input = lvg_input 111 | self.lvg_target = lvg_target 112 | 113 | def __iter__(self): 114 | for idxs in self.sampler(self.val): 115 | hvg_input = tensor(self.hvg_input[idxs].copy()) 116 | hvg_target = tensor(self.hvg_target[idxs].copy()) 117 | p_target = tensor(self.p_target[idxs].copy()) 118 | 119 | if (self.lvg_input is not None) and (self.lvg_target is not None): 120 | lvg_input = tensor(self.lvg_input[idxs].copy()) 121 | lvg_target = tensor(self.lvg_target[idxs].copy()) 122 | else: 123 | lvg_input = None 124 | lvg_target = None 125 | 126 | yield [hvg_input, lvg_input], hvg_target, lvg_target, p_target 127 | 128 | def __call__(self, val): 129 | self.val = val 130 | return self 131 | 132 | def update_p(self, new_p_target): 133 | self.p_target = new_p_target -------------------------------------------------------------------------------- /build/lib/CarDEC/CarDEC_layers.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow.keras.layers import Layer 3 | 4 | class ClusteringLayer(Layer): 5 | def __init__(self, centroids = None, n_clusters = None, n_features = None, alpha=1.0, **kwargs): 6 | """ The clustering layer predicts the a cell's class membership probability for each cell. 7 | 8 | 9 | Arguments: 10 | ------------------------------------------------------------------ 11 | - centroids: `tf.Tensor`, Initial cluster ceontroids after pretraining the model. 12 | - n_clusters: `int`, Number of clusters. 13 | - n_features: `int`, The number of features of the bottleneck embedding space that the centroids live in. 14 | - alpha: parameter in Student's t-distribution. Default to 1.0. 15 | """ 16 | 17 | super(ClusteringLayer, self).__init__(**kwargs) 18 | self.alpha = alpha 19 | self.initial_centroids = centroids 20 | 21 | if centroids is not None: 22 | n_clusters, n_features = centroids.shape 23 | 24 | self.n_features, self.n_clusters = n_features, n_clusters 25 | 26 | assert self.n_clusters is not None 27 | assert self.n_features is not None 28 | 29 | def build(self, input_shape): 30 | """ This class method builds the layer fully once it receives an input tensor. 31 | 32 | 33 | Arguments: 34 | ------------------------------------------------------------------ 35 | - input_shape: `list`, A list specifying the shape of the input tensor. 36 | """ 37 | 38 | assert len(input_shape) == 2 39 | 40 | self.centroids = self.add_weight(name = 'clusters', shape = (self.n_clusters, self.n_features), initializer = 'glorot_uniform') 41 | if self.initial_centroids is not None: 42 | self.set_weights([self.initial_centroids]) 43 | del self.initial_centroids 44 | 45 | self.built = True 46 | 47 | def call(self, x, **kwargs): 48 | """ Forward pass of the clustering layer, 49 | 50 | 51 | ***Inputs***: 52 | - x: `tf.Tensor`, the embedding tensor of shape = (n_obs, n_var) 53 | 54 | ***Returns***: 55 | - q: `tf.Tensor`, student's t-distribution, or soft labels for each sample of shape = (n_obs, n_clusters) 56 | """ 57 | 58 | q = 1.0 / (1.0 + (tf.reduce_sum(tf.square(tf.expand_dims(x, axis = 1) - self.centroids), axis = 2) / self.alpha)) 59 | q = q**((self.alpha + 1.0) / 2.0) 60 | q = q / tf.reduce_sum(q, axis = 1, keepdims = True) 61 | 62 | return q 63 | 64 | def compute_output_shape(self, input_shape): 65 | """ This method infers the output shape from the input shape. 66 | 67 | 68 | Arguments: 69 | ------------------------------------------------------------------ 70 | - input_shape: `list`, A list specifying the shape of the input tensor. 71 | 72 | Returns: 73 | ------------------------------------------------------------------ 74 | - output_shape: `list`, A tuple specifying the shape of the output for the minibatch (n_obs, n_clusters) 75 | """ 76 | 77 | assert input_shape and len(input_shape) == 2 78 | return input_shape[0], self.n_clusters -------------------------------------------------------------------------------- /build/lib/CarDEC/CarDEC_optimization.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | import tensorflow as tf 4 | from tensorflow.keras.losses import KLD, MSE 5 | 6 | 7 | def grad_MainModel(model, input_, target, target_p, total_loss, LVG_target = None, aeloss_fun = None, clust_weight = 1.): 8 | """Function to do a backprop update to the main CarDEC model for a minibatch. 9 | 10 | 11 | Arguments: 12 | ------------------------------------------------------------------ 13 | - model: `tensorflow.keras.Model`, The main CarDEC model. 14 | - input_: `list`, A list containing the input HVG and (optionally) LVG expression tensors of the minibatch for the CarDEC model. 15 | - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs. 16 | - target_p: `tf.Tensor`, Tensor containing cluster membership probability targets for the minibatch. 17 | - total_loss: `function`, Function to compute the loss for the main CarDEC model for a minibatch. 18 | - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs. 19 | - aeloss_fun: `function`, Function to compute reconstruction loss. 20 | - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses. 21 | 22 | Returns: 23 | ------------------------------------------------------------------ 24 | - loss_value: `tf.Tensor`: The loss computed for the minibatch. 25 | - gradients: `a list of Tensors`: Gradients to update the model weights. 26 | """ 27 | 28 | with tf.GradientTape() as tape: 29 | denoised_output, cluster_output = model(*input_) 30 | loss_value, aeloss = total_loss(target, denoised_output, target_p, cluster_output, 31 | LVG_target, aeloss_fun, clust_weight) 32 | 33 | return loss_value, tape.gradient(loss_value, model.trainable_variables) 34 | 35 | 36 | def grad_reconstruction(model, input_, target, loss): 37 | """Function to compute gradient update for pretrained autoencoder only. 38 | 39 | 40 | Arguments: 41 | ------------------------------------------------------------------ 42 | - model: `tensorflow.keras.Model`, The main CarDEC model. 43 | - input_: `list`, A list containing the input HVG expression tensor of the minibatch for the CarDEC model. 44 | - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs. 45 | - loss: `function`, Function to compute reconstruction loss. 46 | 47 | Returns: 48 | ------------------------------------------------------------------ 49 | - loss_value: `tf.Tensor`: The loss computed for the minibatch. 50 | - gradients: `a list of Tensors`: Gradients to update the model weights. 51 | """ 52 | 53 | if type(input_) != tuple: 54 | input_ = (input_, ) 55 | 56 | with tf.GradientTape() as tape: 57 | output = model(*input_) 58 | loss_value = loss(target, output) 59 | 60 | return loss_value, tape.gradient(loss_value, model.trainable_variables) 61 | 62 | 63 | def total_loss(target, denoised_output, p, cluster_output_q, LVG_target = None, aeloss_fun = None, clust_weight = 1.): 64 | """Function to compute the loss for the main CarDEC model for a minibatch. 65 | 66 | 67 | Arguments: 68 | ------------------------------------------------------------------ 69 | - target: `tf.Tensor`, Tensor containing the reconstruction target of the minibatch for the HVGs. 70 | - denoised_output: `dict`, Dictionary containing the output tensors from the CarDEC main model's forward pass. 71 | - p: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing cluster membership probability targets for the minibatch. 72 | - cluster_output_q: `tf.Tensor`, Tensor of shape (n_obs, n_cluster) containing predicted cluster membership probabilities 73 | for each cell. 74 | - LVG_target: `tf.Tensor` (Optional), Tensor containing the reconstruction target of the minibatch for the LVGs. 75 | - aeloss_fun: `function`, Function to compute reconstruction loss. 76 | - clust_weight: `float`, A float between 0 and 2 balancing clustering and reconstruction losses. 77 | 78 | Returns: 79 | ------------------------------------------------------------------ 80 | - net_loss: `tf.Tensor`, The loss computed for the minibatch. 81 | - aeloss: `tf.Tensor`, The reconstruction loss computed for the minibatch. 82 | """ 83 | 84 | if aeloss_fun is not None: 85 | 86 | aeloss_HVG = aeloss_fun(target, denoised_output['HVG_denoised']) 87 | if LVG_target is not None: 88 | aeloss_LVG = aeloss_fun(LVG_target, denoised_output['LVG_denoised']) 89 | aeloss = 0.5*(aeloss_LVG + aeloss_HVG) 90 | else: 91 | aeloss = 1. * aeloss_HVG 92 | else: 93 | aeloss = 0. 94 | 95 | net_loss = clust_weight * tf.reduce_mean(KLD(p, cluster_output_q)) + (2. - clust_weight) * aeloss 96 | 97 | return net_loss, aeloss 98 | 99 | 100 | def MSEloss(netinput, netoutput): 101 | """Function to compute the MSEloss for the reconstruction loss of a minibatch. 102 | 103 | 104 | Arguments: 105 | ------------------------------------------------------------------ 106 | - netinput: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells. 107 | - netoutput: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells. 108 | 109 | Returns: 110 | ------------------------------------------------------------------ 111 | - mse_loss: `tf.Tensor`, The loss computed for the minibatch, averaged over genes and cells. 112 | """ 113 | 114 | return tf.math.reduce_mean(MSE(netinput, netoutput)) 115 | 116 | 117 | def NBloss(count, output, eps = 1e-10, mean = True): 118 | """Function to compute the negative binomial reconstruction loss of a minibatch. 119 | 120 | 121 | Arguments: 122 | ------------------------------------------------------------------ 123 | - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original 124 | counts). 125 | - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells. 126 | - eps: `float`, A small number introduced for computational stability 127 | - mean: `bool`, If True, average negative binomial loss over genes and cells 128 | 129 | Returns: 130 | ------------------------------------------------------------------ 131 | - nbloss: `tf.Tensor`, The loss computed for the minibatch. If mean was True, it has shape (n_obs, n_var). Otherwise, it has shape (1,). 132 | """ 133 | 134 | count = tf.cast(count, tf.float32) 135 | mu = tf.cast(output[0], tf.float32) 136 | 137 | theta = tf.minimum(output[1], 1e6) 138 | 139 | t1 = tf.math.lgamma(theta + eps) + tf.math.lgamma(count + 1.0) - tf.math.lgamma(count + theta + eps) 140 | t2 = (theta + count) * tf.math.log(1.0 + (mu/(theta+eps))) + (count * (tf.math.log(theta + eps) - tf.math.log(mu + eps))) 141 | 142 | final = _nan2inf(t1 + t2) 143 | 144 | if mean: 145 | final = tf.reduce_sum(final)/final.shape[0]/final.shape[1] 146 | 147 | return final 148 | 149 | 150 | def ZINBloss(count, output, eps = 1e-10): 151 | """Function to compute the negative binomial reconstruction loss of a minibatch. 152 | 153 | 154 | Arguments: 155 | ------------------------------------------------------------------ 156 | - count: `tf.Tensor`, Tensor containing the network reconstruction target of the minibatch for the cells (the original counts). 157 | - output: `tf.Tensor`, Tensor containing the reconstructed target of the minibatch for the cells. 158 | - eps: `float`, A small number introduced for computational stability 159 | 160 | Returns: 161 | ------------------------------------------------------------------ 162 | - zinbloss: `tf.Tensor`, The loss computed for the minibatch. Has shape (1,). 163 | """ 164 | 165 | mu = output[0] 166 | theta = output[1] 167 | pi = output[2] 168 | 169 | NB = NBloss(count, output, eps = eps, mean = False) - tf.math.log(1.0 - pi + eps) 170 | 171 | count = tf.cast(count, tf.float32) 172 | mu = tf.cast(mu, tf.float32) 173 | 174 | theta = tf.math.minimum(theta, 1e6) 175 | 176 | zero_nb = tf.math.pow(theta/(theta + mu + eps), theta) 177 | zero_case = -tf.math.log(pi + ((1.0- pi) * zero_nb) + eps) 178 | final = tf.where(tf.less(count, 1e-8), zero_case, NB) 179 | 180 | final = tf.reduce_sum(final)/final.shape[0]/final.shape[1] 181 | 182 | return final 183 | 184 | 185 | def _nan2inf(x): 186 | """Function to replace nan entries in a Tensor with infinities. 187 | 188 | 189 | Arguments: 190 | ------------------------------------------------------------------ 191 | - x: `tf.Tensor`, Tensor of arbitrary shape. 192 | 193 | Returns: 194 | ------------------------------------------------------------------ 195 | - x': `tf.Tensor`, Tensor x with nan entries replaced by infinity. 196 | """ 197 | 198 | return tf.where(tf.math.is_nan(x), tf.zeros_like(x) + np.inf, x) 199 | 200 | -------------------------------------------------------------------------------- /build/lib/CarDEC/CarDEC_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | from scipy.sparse import issparse 4 | 5 | import scanpy as sc 6 | from anndata import AnnData 7 | 8 | 9 | def normalize_scanpy(adata, batch_key = None, n_high_var = 1000, LVG = True, 10 | normalize_samples = True, log_normalize = True, 11 | normalize_features = True): 12 | """ This function preprocesses the raw count data. 13 | 14 | 15 | Arguments: 16 | ------------------------------------------------------------------ 17 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to genes. 18 | - batch_key: `str`, string specifying the name of the column in the observation dataframe which identifies the batch of each cell. If this is left as None, then all cells are assumed to be from one batch. 19 | - n_high_var: `int`, integer specifying the number of genes to be idntified as highly variable. E.g. if n_high_var = 2000, then the 2000 genes with the highest variance are designated as highly variable. 20 | - LVG: `bool`, Whether to retain and preprocess LVGs. 21 | - normalize_samples: `bool`, If True, normalize expression of each gene in each cell by the sum of expression counts in that cell. 22 | - log_normalize: `bool`, If True, log transform expression. I.e., compute log(expression + 1) for each gene, cell expression count. 23 | - normalize_features: `bool`, If True, z-score normalize each gene's expression. 24 | 25 | Returns: 26 | ------------------------------------------------------------------ 27 | - adata: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Contains preprocessed data. 28 | """ 29 | 30 | n, p = adata.shape 31 | sparsemode = issparse(adata.X) 32 | 33 | if batch_key is not None: 34 | batch = list(adata.obs[batch_key]) 35 | batch = convert_vector_to_encoding(batch) 36 | batch = np.asarray(batch) 37 | batch = batch.astype('float32') 38 | else: 39 | batch = np.ones((n,), dtype = 'float32') 40 | norm_by_batch = False 41 | 42 | sc.pp.filter_genes(adata, min_counts=1) 43 | sc.pp.filter_cells(adata, min_counts=1) 44 | 45 | count = adata.X.copy() 46 | 47 | if normalize_samples: 48 | out = sc.pp.normalize_total(adata, inplace = False) 49 | obs_ = adata.obs 50 | var_ = adata.var 51 | adata = None 52 | adata = AnnData(out['X']) 53 | adata.obs = obs_ 54 | adata.var = var_ 55 | 56 | size_factors = out['norm_factor'] / np.median(out['norm_factor']) 57 | out = None 58 | else: 59 | size_factors = np.ones((adata.shape[0], )) 60 | 61 | if not log_normalize: 62 | adata_ = adata.copy() 63 | 64 | sc.pp.log1p(adata) 65 | 66 | if n_high_var is not None: 67 | sc.pp.highly_variable_genes(adata, inplace = True, min_mean = 0.0125, max_mean = 3, min_disp = 0.5, 68 | n_bins = 20, n_top_genes = n_high_var, batch_key = batch_key) 69 | 70 | hvg = adata.var['highly_variable'].values 71 | 72 | if not log_normalize: 73 | adata = adata_.copy() 74 | 75 | else: 76 | hvg = [True] * adata.shape[1] 77 | 78 | if normalize_features: 79 | batch_list = np.unique(batch) 80 | 81 | if sparsemode: 82 | adata.X = adata.X.toarray() 83 | 84 | for batch_ in batch_list: 85 | indices = [x == batch_ for x in batch] 86 | sub_adata = adata[indices] 87 | 88 | sc.pp.scale(sub_adata) 89 | adata[indices] = sub_adata.X 90 | 91 | adata.layers["normalized input"] = adata.X 92 | adata.X = count 93 | adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg] 94 | 95 | else: 96 | if sparsemode: 97 | adata.layers["normalized input"] = adata.X.toarray() 98 | else: 99 | adata.layers["normalized input"] = adata.X 100 | 101 | adata.var['Variance Type'] = [['LVG', 'HVG'][int(x)] for x in hvg] 102 | 103 | if n_high_var is not None: 104 | del_keys = ['dispersions', 'dispersions_norm', 'highly_variable', 'highly_variable_intersection', 'highly_variable_nbatches', 'means'] 105 | del_keys = [x for x in del_keys if x in adata.var.keys()] 106 | adata.var = adata.var.drop(del_keys, axis = 1) 107 | 108 | y = np.unique(batch) 109 | num_batch = len(y) 110 | 111 | adata.obs['size factors'] = size_factors.astype('float32') 112 | adata.obs['batch'] = batch 113 | adata.uns['num_batch'] = num_batch 114 | 115 | if sparsemode: 116 | adata.X = adata.X.toarray() 117 | 118 | if not LVG: 119 | adata = adata[:, adata.var['Variance Type'] == 'HVG'] 120 | 121 | return adata 122 | 123 | 124 | def build_dir(dir_path): 125 | """ This function builds a directory if it does not exist. 126 | 127 | 128 | Arguments: 129 | ------------------------------------------------------------------ 130 | - dir_path: `str`, The directory to build. E.g. if dir_path = 'folder1/folder2/folder3', then this function will creates directory if folder1 if it does not already exist. Then it creates folder1/folder2 if folder2 does not exist in folder1. Then it creates folder1/folder2/folder3 if folder3 does not exist in folder2. 131 | """ 132 | 133 | subdirs = [dir_path] 134 | substring = dir_path 135 | 136 | while substring != '': 137 | splt_dir = os.path.split(substring) 138 | substring = splt_dir[0] 139 | subdirs.append(substring) 140 | 141 | subdirs.pop() 142 | subdirs = [x for x in subdirs if os.path.basename(x) != '..'] 143 | 144 | n = len(subdirs) 145 | subdirs = [subdirs[n - 1 - x] for x in range(n)] 146 | 147 | for dir_ in subdirs: 148 | if not os.path.isdir(dir_): 149 | os.mkdir(dir_) 150 | 151 | 152 | def convert_string_to_encoding(string, vector_key): 153 | """A function to convert a string to a numeric encoding. 154 | 155 | 156 | Arguments: 157 | ------------------------------------------------------------------ 158 | - string: `str`, The specific string to convert to a numeric encoding. 159 | - vector_key: `np.ndarray`, Array of all possible values of string. 160 | 161 | Returns: 162 | ------------------------------------------------------------------ 163 | - encoding: `int`, The integer encoding of string. 164 | """ 165 | 166 | return np.argwhere(vector_key == string)[0][0] 167 | 168 | 169 | def convert_vector_to_encoding(vector): 170 | """A function to convert a vector of strings to a dense numeric encoding. 171 | 172 | 173 | Arguments: 174 | ------------------------------------------------------------------ 175 | - vector: `array_like`, The vector of strings to encode. 176 | 177 | Returns: 178 | ------------------------------------------------------------------ 179 | - vector_num: `list`, A list containing the dense numeric encoding. 180 | """ 181 | 182 | vector_key = np.unique(vector) 183 | vector_strings = list(vector) 184 | vector_num = [convert_string_to_encoding(string, vector_key) for string in vector_strings] 185 | 186 | return vector_num 187 | 188 | 189 | def find_resolution(adata_, n_clusters, random): 190 | """A function to find the louvain resolution tjat corresponds to a prespecified number of clusters, if it exists. 191 | 192 | 193 | Arguments: 194 | ------------------------------------------------------------------ 195 | - adata_: `anndata.AnnData`, the annotated data matrix of shape (n_obs, n_vars). Rows correspond to cells and columns to low dimension features. 196 | - n_clusters: `int`, Number of clusters. 197 | - random: `int`, The random seed. 198 | 199 | Returns: 200 | ------------------------------------------------------------------ 201 | - resolution: `float`, The resolution that gives n_clusters after running louvain's clustering algorithm. 202 | """ 203 | 204 | obtained_clusters = -1 205 | iteration = 0 206 | resolutions = [0., 1000.] 207 | 208 | while obtained_clusters != n_clusters and iteration < 50: 209 | current_res = sum(resolutions)/2 210 | adata = sc.tl.louvain(adata_, resolution = current_res, random_state = random, copy = True) 211 | labels = adata.obs['louvain'] 212 | obtained_clusters = len(np.unique(labels)) 213 | 214 | if obtained_clusters < n_clusters: 215 | resolutions[0] = current_res 216 | else: 217 | resolutions[1] = current_res 218 | 219 | iteration = iteration + 1 220 | 221 | return current_res 222 | 223 | -------------------------------------------------------------------------------- /build/lib/CarDEC/__init__.py: -------------------------------------------------------------------------------- 1 | from .CarDEC_API import CarDEC_API -------------------------------------------------------------------------------- /dist/cardec-1.0.3-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/dist/cardec-1.0.3-py3-none-any.whl -------------------------------------------------------------------------------- /dist/cardec-1.0.3.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jlakkis/CarDEC/7e00f05e637febd0006728a1112702c203e7f7dc/dist/cardec-1.0.3.tar.gz -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[ ]: 5 | 6 | import setuptools 7 | 8 | with open("README.md", "r") as fh: 9 | long_description = fh.read() 10 | 11 | setuptools.setup( 12 | name="cardec", 13 | version="1.0.3", 14 | author="Justin Lakkis", 15 | author_email="jlakks@gmail.com", 16 | description="A deep learning method for joint batch correction, denoting, and clustering of single-cell rna-seq data.", 17 | long_description=long_description, 18 | long_description_content_type="text/markdown", 19 | url="https://github.com/jlakkis/CarDEC", 20 | packages=setuptools.find_packages(), 21 | classifiers=[ 22 | "Programming Language :: Python :: 3", 23 | "License :: OSI Approved :: MIT License", 24 | "Operating System :: OS Independent", 25 | ], 26 | install_requires=['numpy>=1.18.1', 'pandas>=1.0.1', 'scipy>=1.4.1', 'tensorflow>= 2.0.1, <=2.3.1', 'scikit-learn>=0.22.2.post1', 'scanpy>=1.5.1', 'louvain>=0.6.1'], 27 | python_requires='>=3.7', 28 | ) --------------------------------------------------------------------------------