├── .gitignore ├── LICENSE.txt ├── README.md ├── cvae ├── __init__.py ├── cvae.py └── lib │ ├── __init__.py │ ├── data_reader.py │ ├── data_reader_array.py │ ├── functions.py │ └── model_iaf.py ├── setup.cfg └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | test_tf.py 2 | test_mnist.py 3 | 4 | # Python virtual environment 5 | venv/ 6 | env/ 7 | 8 | # Python package build files 9 | *.egg-info/ 10 | dist/ 11 | build/ 12 | 13 | # Temporary files 14 | temp/ 15 | *.pyc 16 | __pycache__/ 17 | 18 | # IDE specific files 19 | .vscode/ 20 | .idea/ 21 | *.swp 22 | .DS_Store 23 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Max Frenzel 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CompressionVAE 2 | 3 | Data embedding API based on the Variational Autoencoder (VAE), originally proposed by Kingma and Welling https://arxiv.org/abs/1312.6114. 4 | 5 | This tool, implemented in TensorFlow (originally built with TF1.x, but updated to TF2.x through compatibility mode), is designed to work similar to familiar dimensionality reduction methods such as scikit-learn's t-SNE or UMAP, but also go beyond their capabilities in some notable ways, making full use of the VAE as a generative model. 6 | 7 | While I decided to call the tool itself CompressionVAE, or CVAE for short, I mainly chose this to give it a unique name. 8 | In practice, it is based on a standard VAE, with the (optional) addition of Inverse Autoregressive Flow (IAF) layers to allow for more flexible posterior distributions. 9 | For details on the IAF layers, I refer you to the original paper: https://arxiv.org/pdf/1606.04934.pdf. 10 | 11 | CompressionVAE has **several unique advantages** over the common manifold learning methods like t-SNE and UMAP: 12 | * Rather than just a transformation of the training data, it provides a **reversible and deterministic function**, mapping from data space to embedding space. 13 | * Due to the reversibility of the mapping, the model can be used to **generate new data from arbitrary latent variables**. It also makes them highly suitable as **intermediary representations for downstream tasks**. 14 | * Once a model is trained, it can be reused to transform new data, making it **suitable for use in live settings**. 15 | * Like UMAP, CVAE is **fast and scales much better to large datasets, and high dimensional input and latent spaces**. 16 | * The neural network architecture and training parameters are **highly customisable** through the simple API, allowing more advanced users to tailor the system to their needs. 17 | * VAEs have a **very strong theoretical foundation**, and the learned latent spaces have many desirable properties. There is also extensive literature on different variants, and CVAE can easily be extended to keep up with new research advances. 18 | 19 | ## Installing CompressionVAE 20 | 21 | CompressionVAE is distributed through PyPI under the name `cvae` (https://pypi.org/project/cvae/). To install the latest version, simply run 22 | ``` 23 | pip install cvae 24 | ``` 25 | Alternatively, to locally install CompressionVAE, clone this repository and run the following command from the CompressionVAE root directory. 26 | ``` 27 | pip install -e . 28 | ``` 29 | 30 | ## Basic Use Case 31 | 32 | To use CVAE to learn an embedding function, we first need to import the cvae library. 33 | ``` 34 | from cvae import cvae 35 | ``` 36 | 37 | When creating a CompressionVAE object for a new model, it needs to be provided a training dataset. 38 | For small datasets that fit in memory we can directly follow the sklean convention. Let's look at this case first and take MNIST as an example. 39 | 40 | First, load the MNIST data. (Note: this example requires scikit-learn which is not installed with CVAE. You might have to install it first by running `pip install sklearn`.) 41 | ``` 42 | from sklearn.datasets import fetch_openml 43 | mnist = fetch_openml('mnist_784', version=1, cache=True) 44 | X = mnist.data 45 | ``` 46 | 47 | ### Initializing CVAE 48 | Now we can create a CompressionVAE object/model based on this data. The minimal code to do this is 49 | ``` 50 | embedder = cvae.CompressionVAE(X) 51 | ``` 52 | By default, this creates a model with two-dimensional latent space, splits the data X randomly into 90% train and 10% validation data, applies feature normalization, and tries to match the model architecture to the input and latent feature dimensions. 53 | It also saves the model in a temporary directory which gets overwritten the next time you create a new CVAE object there. 54 | 55 | We will look at customising all this later, but for now let's move on to training. 56 | 57 | ### Training CVAE 58 | Once a CVAE object is initialised and associated with data, we can train the embedder using its `train` method. This works similar to t-SNE or UMAP's `fit` method. 59 | In the simplest case, we just run 60 | ``` 61 | embedder.train() 62 | ``` 63 | This will train the model, applying automatic learning rate scheduling based on the validation data loss, and stop either when the model converges or after 50k training steps. 64 | We can also stop the training process early through a KeyboardInterrupt (ctrl-c or 'interrupt kernel' in Jupyter notebook). The model will be saved at this point. 65 | 66 | It is also possible to stop training and then re-start with different parameters (see more details below). 67 | 68 | One note/warning: At the moment, the model can be quite sensitive to initialization (in some rare cases even giving NAN losses). Re-initializing/training the model can improve the results if a training run did not give satisfactory results. 69 | 70 | ### Embedding data 71 | Once we have a trained model (well, technically even before training, but the results would be random), we can use CVAE to compress data, embedding it into the latent space. 72 | To do this, we use CVAE's `embed` method. 73 | 74 | To embed the entire MNIST data: 75 | ``` 76 | z = embedder.embed(X) 77 | ``` 78 | But note that other than t-SNE or UMAP, this data does not have to be the same as the training data. It can be new and previously unseen data. 79 | 80 | ### Visualising the embedding 81 | For two-dimensional latent spaces, CVAE comes with a built-in visualization method, `visualize`. It provides a two-dimensional plot of the embeddings, including class information if available. 82 | 83 | To visualize the MNIST embeddings and color them by their respective class, we can run 84 | ``` 85 | embedder.visualize(z, labels=[int(label) for label in mnist.target]) 86 | ``` 87 | We could also passed the string labels `mnist.target` directly to `labels`, but in that case they would not necessarily be ordered from 0 to 9. 88 | Optionally, if we pass `labels` as a list of integers like above, we can also pass the `categories` parameter, a list of strings associating names with the labels. In the case of MNIST this is irrelevant since the label and class names are the same. 89 | By default the `visualize` simply displays the plot. By setting the `filename` parameter we can alternatively save the plot to a file. 90 | 91 | ### Generating data 92 | Finally, we can use CVAE as a generative model, generating data by decoding arbitrary latent vectors using the `decode` method. 93 | If we simply want to 'undo' our MNIST embedding and try to re-create the input data, we can run our embeddings `z` through the `decode` method. 94 | ``` 95 | X_reconstructed = embedder.decode(z) 96 | ``` 97 | As a more interesting example, we can use this for data interpolation. Let's say we want to create the data that's halfway between the first and the second MNIST datapoint (a '5' and a '0' respectively). 98 | We can achieve this with the following code 99 | ``` 100 | import numpy as np 101 | # Combine the two examples and add batch dimension 102 | z_interp = np.expand_dims(0.5*z[0] + 0.5*z[1], axis=0) 103 | # Decode the new latent vector. 104 | X_interp = embedder.decode(z_interp) 105 | ``` 106 | 107 | #### Visualizing the latent space 108 | In the case of image data, such as MNIST, CVAE also has a method that allows us to quickly visualize the latent space as seen through the decoder. 109 | To plot a 20 by 20 grid of reconstructed images, spanning the latent space region [-4, 4] in both x and y, we can run 110 | ``` 111 | embedder.visualize_latent_grid(xy_range=(-4.0, 4.0), 112 | grid_size=20, 113 | shape=(28, 28)) 114 | ``` 115 | 116 | ## Advanced Use Cases 117 | The example above shows the simplest usage of CVAE. However, if desired a user can take much more control over the system and customize the model and training processes. 118 | 119 | ### Customizing the model 120 | In the previous example we initialised a CompressionVAE with default parameters. If we want more control, we might want to initialise it the following way: 121 | ``` 122 | embedder = cvae.CompressionVAE(X, 123 | train_valid_split=0.99, 124 | dim_latent=16, 125 | iaf_flow_length=10, 126 | cells_encoder=[512, 256, 128], 127 | initializer='lecun_normal', 128 | batch_size=32, 129 | batch_size_test=128, 130 | logdir='~/mnist_16d', 131 | feature_normalization=False, 132 | tb_logging=True) 133 | ``` 134 | `train_valid_split` controls the random splitting into train and test data. Here 99% of X is used for training, and only 1% is reserved for validation. 135 | 136 | Alternatively, to get more control over the data the user can also provide `X_valid` as an input. In this case `train_valid_split` is ignored and the model uses `X` for training and `X_valid` for validation. 137 | 138 | `dim_latent` specifies the dimensionality of the latent space. 139 | 140 | `iaf_flow_length` controls how many IAF flow layers the model has. 141 | 142 | `cells_encoder` determines the number, as well as size of the encoders fully connected layers. In the case above, we have three layers with 512, 256, and 128 units respectively. The decoder uses the mirrored version of this. 143 | If this parameter is not set, CVAE creates a two layer network with sizes adjusted to the input dimension and latent dimension. The logic behind this is very handwavy and arbitrary for now, and I generally recommend setting this manually. 144 | 145 | `initializer` controls how the model weights are initialized, with options being `orthogonal` (default), `truncated_normal`, and `lecun_normal`. 146 | 147 | `batch_size` and `batch_size_test` determine the batch sizes used for training and testing respectively. 148 | 149 | `logdir` specifies the path to the model, and also acts as the model name. The default, `'temp'`, gets overwritten every time it is used, but other model names can be used to save and restore models for later use or even to continue training. 150 | 151 | `feature_normalization` tells CVAE whether it should internally apply feature normalization (zero mean, unit variance, based on the training data) or not. If True, the normalisation factors are stored with the model and get applied to any future data. 152 | 153 | `tb_logging` determines whether the model writes summaries for TensorBoard during the training process. 154 | 155 | ### Customizing the training process 156 | In the simple example we called the `train` method without any parameter. A more advanced call might look like 157 | ``` 158 | embedder.train(learning_rate=1e-4, 159 | num_steps=2000, 160 | dropout_keep_prob=0.6, 161 | test_every=50, 162 | lr_scheduling=False) 163 | ``` 164 | `learning_rate` sets the initial learning rate of training. 165 | 166 | `num_steps` controls the maximum number of training steps before stopping. 167 | 168 | `dropout_keep_prob` determines the keep probability for dropout in the fully connected layers. 169 | 170 | `test_every` sets the frequency of test steps. 171 | 172 | `lr_scheduling` controls whether learning rate scheduling is applied. If `False`, training continues at `learning_rate` until `num_steps` is reached. 173 | 174 | For more arguments/details, for example controlling the details of the learning rate scheduler and the convergence criteria, check the method definition. 175 | 176 | ### Using large datasets 177 | 178 | Alternatively to providing the input data `X` as a single numpy array, as done with t-SNE and UMAP, CVAE also allows for much larger datasets that do not fit into a single array. 179 | 180 | To prepare such a dataset, create a new directory, e.g. `'~/my_dataset'`, and save the training data as individual npy files per example in this directory. 181 | 182 | (Note: the data can also be saved in nested sub-directories, for example one directory per category. CVAE will look through the entire directory tree for npy files.) 183 | 184 | When initialising a model based on this kind of data pass the root directory of the dataset as `X`. E.g. 185 | ``` 186 | embedder = cvae.CompressionVAE(X='~/my_dataset') 187 | ``` 188 | Initialising will take slightly longer than if `X` is passed as an array, even for the same number of data points. But this method scales in principle to arbitrarily large datasets, and only loads one batch at a time during training. 189 | 190 | ### Restarting an existing model 191 | 192 | If a CompressionVAE object is initialized with `logdir='temp'` it always starts from a new untrained model, possible overwriting any previous temp model. 193 | However, if a different `logdir` is chosen, the model persists and can be reloaded. 194 | 195 | If CompressionVAE is initialized with a `logdir` that already exists and contains parameter and checkpoint files of a previous model, it attempts to restore that model's checkpoints. 196 | In this case, any model specific input parameter (e.g. `dim_latent` and `cells_encoder`) is ignored in favor of the original models parameters. 197 | 198 | A restored model can be use straight away to embed or generate data, but it is also possible to continue the training process, picking up from the most recent checkpoint. 199 | -------------------------------------------------------------------------------- /cvae/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/maxfrenzel/CompressionVAE/9d6b52359b885a03797be41f6d5baa17925d83ef/cvae/__init__.py -------------------------------------------------------------------------------- /cvae/cvae.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import random 4 | import json 5 | import time 6 | import shutil 7 | 8 | import numpy as np 9 | import tensorflow.compat.v1 as tf 10 | tf.disable_v2_behavior() 11 | try: 12 | tf.logging.set_verbosity(tf.logging.ERROR) 13 | except: 14 | pass 15 | 16 | # import matplotlib as mpl 17 | # mpl.use('TkAgg') 18 | import matplotlib.pyplot as plt 19 | import matplotlib.cm as cm 20 | 21 | import cvae.lib.data_reader_array as dra 22 | import cvae.lib.data_reader as dr 23 | import cvae.lib.model_iaf as model 24 | import cvae.lib.functions as fun 25 | 26 | 27 | # Save model to checkpoint 28 | def save(saver, sess, logdir, step): 29 | model_name = 'model.ckpt' 30 | checkpoint_path = os.path.join(logdir, model_name) 31 | print('Storing checkpoint to {} ...'.format(logdir), end="") 32 | sys.stdout.flush() 33 | 34 | if not os.path.exists(logdir): 35 | os.makedirs(logdir) 36 | 37 | saver.save(sess, checkpoint_path, global_step=step) 38 | print(' Done.') 39 | 40 | 41 | # Load model from checkpoint 42 | def load(saver, sess, logdir): 43 | print("Trying to restore saved checkpoints from {} ...".format(logdir), 44 | end="") 45 | 46 | ckpt = tf.train.get_checkpoint_state(logdir) 47 | if ckpt: 48 | print(" Checkpoint found: {}".format(ckpt.model_checkpoint_path)) 49 | global_step = int(ckpt.model_checkpoint_path 50 | .split('/')[-1] 51 | .split('-')[-1]) 52 | print(" Global step was: {}".format(global_step)) 53 | print(" Restoring...", end="") 54 | saver.restore(sess, ckpt.model_checkpoint_path) 55 | print(" Done.") 56 | return global_step 57 | else: 58 | print(" No checkpoint found.") 59 | return None 60 | 61 | 62 | class CompressionVAE(object): 63 | """ 64 | Variational Autoencoder (VAE) for vector compression/dimensionality reduction. 65 | 66 | Parameters 67 | ---------- 68 | X : array, shape (n_samples, n_features) 69 | Training data for the VAE. 70 | Alternatively, X can be the path to a root-directory containing npy files (potentially nested), each 71 | representing a single feature vector. This allows for handling of datasets that are too large to fit 72 | in memory. 73 | Can be None (default) only if a model with this name has previously been trained. Otherwise None will 74 | raise an exception. 75 | 76 | X_valid : array, shape (n__valid_samples, n_features), optional (default: None) 77 | Validation data. If not provided, X is split into training and validation data 78 | 79 | train_valid_split : float, optional (default: 0.9) 80 | Specifies in what ratio to split X into training and validation data (after randomizing the data). 81 | Ignored if X_valid provided. 82 | 83 | dim_latent : int, optional (default: 2) 84 | Dimension of latent space (i.e. number of features of embeddings) 85 | 86 | iaf_flow_length : int, optional (default: 5) 87 | Number of IAF Flow layers to use in the model. 88 | For details, see https://arxiv.org/abs/1606.04934. 89 | 90 | cells_encoder : list of int, optional (default: None) 91 | The length of this list determines the number of layers of the encoder and decoder, and the values 92 | determine the number of units per layer (reversed order for decoder). 93 | If None, this is automatically chosen based on number of features and latent dimension. 94 | 95 | initializer : string, optional (default: 'orthogonal') 96 | Initializer to use for weights of model. 97 | 98 | batch_size : int, optional (default: 64) 99 | Batch size to use for training. 100 | 101 | batch_size_test : int, optional (default: 64) 102 | Batch size to use for testing. 103 | 104 | logdir : string, optional (default: 'temp') 105 | Location for where to save the model and other related files. Can also be used to restart from an already 106 | trained model. 107 | If 'temp' (default), any previously stored data is deleted and model/data are initialised from scratch. 108 | 109 | feature_normalization : bool, optional (default: True) 110 | If True (default), normalization of all data is applied internally, based on training data statistics. 111 | 112 | tb_logging : bool, optional (default: False) 113 | If True, create tensorboard summaries with loss data etc. 114 | """ 115 | 116 | def __init__(self, 117 | X=None, 118 | X_valid=None, 119 | train_valid_split=0.9, 120 | dim_latent=2, 121 | iaf_flow_length=5, 122 | cells_encoder=None, 123 | initializer='orthogonal', 124 | batch_size=64, 125 | batch_size_test=64, 126 | logdir='temp', 127 | feature_normalization=True, 128 | tb_logging=False): 129 | 130 | self.dim_latent = dim_latent 131 | self.iaf_flow_length = iaf_flow_length 132 | self.cells_encoder = cells_encoder 133 | self.initializer = initializer 134 | self.batch_size = batch_size 135 | self.batch_size_test = batch_size_test 136 | self.logdir = os.path.abspath(logdir) 137 | self.feature_normalization = feature_normalization 138 | self.tb_logging = tb_logging 139 | 140 | self.trained_once_this_session = False 141 | 142 | # --- Check for existing model --- 143 | 144 | # Set flag to indicate that the model has not been trained yet 145 | self.is_trained = False 146 | 147 | # If using temporary model directory (default), delete any previously stored models 148 | if logdir == 'temp' and os.path.exists(self.logdir): 149 | shutil.rmtree(self.logdir) 150 | 151 | # Check if a model with the same name already exists 152 | # If no, create directory 153 | if not os.path.exists(self.logdir): 154 | os.makedirs(self.logdir) 155 | 156 | # Do checkpoint, parameter, and norm files exist? 157 | self.has_checkpoint = os.path.exists(f'{self.logdir}/checkpoint') 158 | self.has_params = os.path.exists(f'{self.logdir}/params.json') 159 | self.has_norm = os.path.exists(f'{self.logdir}/norm.pkl') 160 | self.has_dataset_file = os.path.exists(f'{self.logdir}/data_train.pkl') 161 | self.has_data = False 162 | 163 | # --- Prepare data --- 164 | 165 | # Check if data is provided as array or as directory. 166 | self.dataset_type = None 167 | 168 | if type(X) == str: 169 | self.dataset_type = 'string' 170 | 171 | if self.has_dataset_file: 172 | print('This model has already been associated with a dataset from a directory. To create a new ' 173 | 'dataset, delete the data_train.pkl and data_valid.pkl files in the model directory.') 174 | _, _, self.dim_feature = dr.load_dataset_file(f'{self.logdir}/data_train.pkl') 175 | else: 176 | print(f'Preparing train and validation datasets from feature directory {X}.') 177 | self.dim_feature = fun.prepare_dataset(data_dir=os.path.abspath(X), 178 | logdir=self.logdir, 179 | train_ratio=train_valid_split) 180 | self.has_dataset_file = True 181 | 182 | self.X = None 183 | self.X_valid = None 184 | self.has_data = True 185 | 186 | elif type(X) == np.ndarray: 187 | self.dataset_type = 'array' 188 | self.dim_feature = X.shape[1] 189 | # Split data into train and validation or use provided validation data 190 | if X_valid is not None: 191 | assert X_valid.shape[1] == self.dim_feature, "Train and validation data has different feature dimensions!" 192 | self.X = X.astype(np.float32) 193 | self.X_valid = X_valid.astype(np.float32) 194 | else: 195 | # Randomize data 196 | num_data = len(X) 197 | indices = list(range(num_data)) 198 | random.shuffle(indices) 199 | # Split data (and ensure it's float) 200 | split_index = int(train_valid_split * num_data) 201 | train_indices = indices[:split_index] 202 | valid_indices = indices[split_index:] 203 | self.X = X[train_indices].astype(np.float32) 204 | self.X_valid = X[valid_indices].astype(np.float32) 205 | self.has_data = True 206 | 207 | # elif X is None: 208 | # self.X = None 209 | # self.X_valid = None 210 | # if self.has_dataset_file: 211 | # print(f'Reloading dataset file {self.logdir}/data_train.pkl from previous instance of this model.') 212 | # _, _, self.dim_feature = dr.load_dataset_file(f'{self.logdir}/data_train.pkl') 213 | # self.has_data = True 214 | # else: 215 | # if self.has_checkpoint: 216 | # self.has_data = False 217 | # else: 218 | # raise Exception( 219 | # 'Model needs to be initialised with X provided as numpy array or path to directory.') 220 | else: 221 | raise Exception('Unsupported input type for X. Needs to be numpy array or string with path to directory ' 222 | 'containing npy files. ') 223 | 224 | # --- Prepare parameter file --- 225 | 226 | # If parameter file for this model already exists, load it. Otherwise create one. 227 | if self.has_params: 228 | print(f'Existing parameter file found for model {self.logdir}.\n' 229 | f'Loading stored parameters. Some input parameters might be ignored.') 230 | with open(f'{self.logdir}/params.json', 'r') as f: 231 | self.param = json.load(f) 232 | # Set dim_feature if not previously known 233 | if X is None and not self.has_dataset_file: 234 | self.dim_feature = self.param['dim_feature'] 235 | else: 236 | # If not given, determine model structure 237 | # NOTE: The reasoning here is a bit arbitrary/dodgy, should probably put some more thought into this 238 | # and improve it. 239 | # TODO: This does not give very good results yet... 240 | if cells_encoder is None: 241 | # Get all the powers of two between latent dim and feature dim 242 | smallest_power = int(2 ** (self.dim_latent - 1).bit_length()) 243 | largest_power = int(2 ** self.dim_feature.bit_length() / 2) 244 | powers_of_two = [smallest_power] 245 | while powers_of_two[-1] <= largest_power: 246 | powers_of_two.append(powers_of_two[-1]*2) 247 | 248 | # By default, use two layers, one with largest power of two, the second roughly half-way between 249 | # input and output dimension 250 | l2_index = int(len(powers_of_two) / 2) 251 | try: 252 | model_layers = [largest_power, 253 | powers_of_two[l2_index+1]] 254 | except: 255 | model_layers = [largest_power, 256 | int(largest_power/2)] 257 | 258 | else: 259 | model_layers = cells_encoder 260 | 261 | # Number of hidden cells is smaller of the last layer size or 64 262 | cells_hidden = min(model_layers[-1], 64) 263 | 264 | if self.dataset_type == 'string': 265 | dataset_file = f'{self.logdir}/data_train.pkl' 266 | dataset_file_valid = f'{self.logdir}/data_valid.pkl' 267 | else: 268 | dataset_file = None 269 | dataset_file_valid = None 270 | 271 | self.param = { 272 | "dataset_file": dataset_file, 273 | "dataset_file_valid": dataset_file_valid, 274 | "dim_latent": self.dim_latent, 275 | "dim_feature": self.dim_feature, 276 | "cells_encoder": model_layers, 277 | "cells_hidden": cells_hidden, 278 | "iaf_flow_length": self.iaf_flow_length, 279 | "dim_autoregressive_nl": cells_hidden, 280 | "initial_s_offset": 1.0, 281 | "feature_normalization": self.feature_normalization 282 | } 283 | 284 | # Write to json for future re-use of this model 285 | with open(f'{self.logdir}/params.json', 'w') as outfile: 286 | json.dump(self.param, outfile, indent=2) 287 | 288 | # --- Set up VAE model --- 289 | self.graph = tf.Graph() 290 | 291 | with self.graph.as_default(): 292 | 293 | # Create coordinator. 294 | self.coord = tf.train.Coordinator() 295 | 296 | # Set up batchers. 297 | with tf.name_scope('create_inputs'): 298 | if self.dataset_type == 'string': 299 | self.reader = dr.DataReader(self.param['dataset_file'], 300 | self.param, 301 | f'{self.logdir}/params.json', 302 | self.coord, 303 | self.logdir) 304 | self.test_batcher = dr.Batcher(self.param['dataset_file_valid'], 305 | self.param, 306 | f'{self.logdir}/params.json', 307 | self.logdir) 308 | else: 309 | self.reader = dra.DataReader(self.X, self.feature_normalization, self.coord, self.logdir) 310 | self.test_batcher = dra.Batcher(self.X_valid, self.feature_normalization, self.logdir) 311 | self.train_batch = self.reader.dequeue_feature(self.batch_size) 312 | 313 | # Get normalisation data 314 | if self.feature_normalization: 315 | self.mean = self.test_batcher.mean 316 | self.norm = self.test_batcher.norm 317 | 318 | num_test_data = self.test_batcher.num_data 319 | self.test_batches_full = int(self.test_batcher.num_data / self.batch_size_test) 320 | self.test_batch_last = num_test_data - (self.test_batches_full * self.batch_size_test) 321 | 322 | # Placeholder for test features 323 | self.test_feature_placeholder = tf.placeholder_with_default( 324 | input=tf.zeros([self.batch_size, self.dim_feature], dtype=tf.float32), 325 | shape=[None, self.dim_feature]) 326 | 327 | # Placeholder for dropout 328 | self.dropout_placeholder = tf.placeholder_with_default(input=tf.cast(1.0, dtype=tf.float32), shape=(), 329 | name="KeepProb") 330 | 331 | # Placeholder for learning rate 332 | self.lr_placeholder = tf.placeholder_with_default(input=tf.cast(1e-4, dtype=tf.float32), shape=(), 333 | name="LearningRate") 334 | 335 | print('Creating model.') 336 | self.net = model.VAEModel(self.param, 337 | self.batch_size, 338 | input_dim=self.dim_feature, 339 | keep_prob=self.dropout_placeholder, 340 | initializer=self.initializer) 341 | print('Model created.') 342 | 343 | self.embeddings = self.net.embed(self.test_feature_placeholder) 344 | 345 | print('Setting up loss.') 346 | self.loss = self.net.loss(self.train_batch) 347 | self.loss_test = self.net.loss(self.test_feature_placeholder, test=True) 348 | print('Loss set up.') 349 | 350 | optimizer = tf.train.AdamOptimizer(learning_rate=self.lr_placeholder, 351 | epsilon=1e-4) 352 | trainable = tf.trainable_variables() 353 | # for var in trainable: 354 | # print(var) 355 | self.optim = optimizer.minimize(self.loss, var_list=trainable) 356 | 357 | # Set up logging for TensorBoard. 358 | if self.tb_logging: 359 | self.writer = tf.summary.FileWriter(self.logdir) 360 | self.writer.add_graph(tf.get_default_graph()) 361 | run_metadata = tf.RunMetadata() 362 | self.summaries = tf.summary.merge_all() 363 | 364 | # Set up session 365 | print('Setting up session.') 366 | config = tf.ConfigProto(log_device_placement=False) 367 | config.gpu_options.allow_growth = True 368 | self.sess = tf.Session(config=config) 369 | init = tf.global_variables_initializer() 370 | self.sess.run(init) 371 | print('Session set up.') 372 | 373 | # Saver for storing checkpoints of the model. 374 | self.saver = tf.train.Saver(var_list=tf.trainable_variables(), max_to_keep=2) 375 | 376 | # Try to load model 377 | try: 378 | self.saved_global_step = load(self.saver, self.sess, self.logdir) 379 | if self.saved_global_step is None: 380 | # The first training step will be saved_global_step + 1, 381 | # therefore we put -1 here for new or overwritten trainings. 382 | self.saved_global_step = -1 383 | print(f'No model found to restore. Initialising new model.') 384 | else: 385 | print(f'Restored trained model from step {self.saved_global_step}.') 386 | except: 387 | print("Something went wrong while restoring checkpoint.") 388 | raise 389 | 390 | def train(self, 391 | learning_rate=1e-3, 392 | num_steps=int(5e4), 393 | dropout_keep_prob=0.75, 394 | overwrite=False, 395 | test_every=50, 396 | lr_scheduling=True, 397 | lr_scheduling_steps=5, 398 | lr_scheduling_factor=5, 399 | lr_scheduling_min=1e-5, 400 | checkpoint_every=2000): 401 | """ 402 | Train the model 403 | 404 | Parameters 405 | ---------- 406 | learning_rate : float, optional (default: 1e-3) 407 | Learning rate for training. If lr_scheduling is True, this is the initial learning rate. 408 | 409 | num_steps : int, optional (default: 5e4) 410 | Maximum number of training steps before stopping. 411 | 412 | dropout_keep_prob : float, optional (default: 0.75) 413 | Keep probability to use for dropout in encoder/decoder layers. 414 | 415 | overwrite : bool, optional (default: False) 416 | If False, does not allow for overwriting existing model data. 417 | Safety measure to prevent accidentally overwriting previously saved datasets/normalization values, and 418 | unintentional training continuation. 419 | 420 | test_every : int, optional (default: 50) 421 | A test step is performed after every test_every training steps. 422 | 423 | lr_scheduling : bool, optional (default: True) 424 | If True, learning rate scheduling is applied, automatically decreasing the learning rate when the test loss 425 | does not decrease any further for lr_scheduling_steps test steps. Once lr_scheduling_min is reached, 426 | assume model has converged and stop training. 427 | 428 | lr_scheduling_steps : int, optional (default: 5) 429 | If lr_scheduling is True, decrease learning rate after lr_scheduling_steps test steps without decrease 430 | in test loss. 431 | 432 | lr_scheduling_factor : int, optional (default: 5) 433 | Factor by which to decrease learning rate if lr_scheduling is True. 434 | 435 | lr_scheduling_min : int, optional (default: 50) 436 | Minimum learning rate. If lr_scheduling is True, training finishes once learning rate drops below this 437 | value. 438 | 439 | checkpoint_every : int, optional (default: 2000) 440 | Save the model after every checkpoint_every steps. 441 | """ 442 | 443 | assert self.has_data, "Model is not associated with any data yet. " \ 444 | "Recreate CompressionVAE object for this model with X!" 445 | 446 | lr = learning_rate 447 | 448 | # Check if model already exists 449 | if self.has_checkpoint and self.has_params: 450 | print(f'Found existing model {self.logdir}.') 451 | self.is_trained = True 452 | 453 | # If model is trained and overwrite is False, stop here 454 | if not overwrite: 455 | print('To continue training this model, set overwrite=True. To train a new model, ' 456 | 'specify a different logdir or use default "temp" directory.') 457 | return self 458 | else: 459 | print('Continuing model training.') 460 | 461 | with self.graph.as_default(): 462 | 463 | if self.trained_once_this_session is False: 464 | print('Starting queues.') 465 | threads = tf.train.start_queue_runners(sess=self.sess, coord=self.coord) 466 | self.reader.start_threads(self.sess) 467 | print('Reader threads started.') 468 | self.trained_once_this_session = True 469 | 470 | last_saved_step = self.saved_global_step 471 | 472 | test_loss_history = [] 473 | 474 | # Start training; If user interrupts, make sure model gets saved. 475 | try: 476 | for step in range(self.saved_global_step + 1, num_steps): 477 | start_time = time.time() 478 | 479 | epoch = self.reader.get_epoch(self.batch_size, step) 480 | 481 | # Run the actual optimization step 482 | if self.tb_logging: 483 | summary, loss_value, _ = self.sess.run([self.summaries, self.loss, self.optim], 484 | feed_dict={self.dropout_placeholder: dropout_keep_prob, 485 | self.lr_placeholder: lr}) 486 | self.writer.add_summary(summary, step) 487 | else: 488 | loss_value, _ = self.sess.run([self.loss, self.optim], 489 | feed_dict={self.dropout_placeholder: dropout_keep_prob, 490 | self.lr_placeholder: lr}) 491 | 492 | # Test step 493 | if step % test_every == 0: 494 | 495 | test_losses = [] 496 | 497 | for step_test in range(self.test_batches_full + 1): 498 | 499 | if step_test == self.test_batches_full: 500 | test_batch_size = self.test_batch_last 501 | else: 502 | test_batch_size = self.batch_size_test 503 | 504 | test_features = self.test_batcher.next_batch(test_batch_size) 505 | 506 | loss_value_test = self.sess.run([self.loss_test], 507 | feed_dict={self.test_feature_placeholder: test_features, 508 | self.dropout_placeholder: 1.0}) 509 | 510 | test_losses.append(loss_value_test) 511 | 512 | mean_test_loss = np.mean(test_losses) 513 | test_loss_history.append(mean_test_loss) 514 | 515 | if self.tb_logging: 516 | _summary = tf.Summary() 517 | _summary.value.add(tag='test/test_loss', simple_value=mean_test_loss) 518 | _summary.value.add(tag='test/test_loss_per_feat', 519 | simple_value=mean_test_loss / self.reader.dimension) 520 | self.writer.add_summary(_summary, step) 521 | 522 | duration = (time.time() - start_time) / test_every 523 | print('step {:d}; epoch {:.2f} - loss = {:.3f}, test_loss = {:.3f}, lr = {:.5f}, ({:.3f} sec/step)' 524 | .format(step, epoch, loss_value, mean_test_loss, lr, duration)) 525 | 526 | # Learning rate scheduling. 527 | if lr_scheduling and len(test_loss_history) >= lr_scheduling_steps: 528 | if test_loss_history[-lr_scheduling_steps] < min( 529 | test_loss_history[-lr_scheduling_steps + 1:]): 530 | lr /= lr_scheduling_factor 531 | print(f'No improvement on validation data for {lr_scheduling_steps} test steps. ' 532 | f'Decreasing learning rate by factor {lr_scheduling_factor}') 533 | 534 | # Check if training should be stopped 535 | if lr <= lr_scheduling_min: 536 | print(f'Reached learning rate threshold of {lr_scheduling_min}. ' 537 | f'Stopping.') 538 | break 539 | 540 | if step % checkpoint_every == 0: 541 | save(self.saver, self.sess, self.logdir, step) 542 | last_saved_step = step 543 | 544 | if step == num_steps - 1: 545 | print(f'Reached training step limit of {num_steps} steps. ' 546 | f'Stopping.') 547 | 548 | except KeyboardInterrupt: 549 | print() 550 | finally: 551 | self.is_trained = True 552 | self.has_checkpoint = True 553 | self.saved_global_step = step 554 | 555 | if step > last_saved_step: 556 | save(self.saver, self.sess, self.logdir, step) 557 | # self.coord.request_stop() 558 | # self.coord.join(threads) 559 | 560 | return self 561 | 562 | def embed(self, 563 | X, 564 | batch_size=64): 565 | """ 566 | Embed data into the latent space of a trained model 567 | 568 | Parameters 569 | ---------- 570 | X : array, shape (n_samples, n_features) 571 | Data to embed. 572 | 573 | batch_size : int, optional (default: 64) 574 | Batch size for processing input data. 575 | 576 | Returns 577 | ------- 578 | z : array, shape (n_samples, dim_latent) 579 | Embedding of the input data in latent space. 580 | """ 581 | 582 | X = X.astype(np.float32) 583 | 584 | num_data = X.shape[0] 585 | num_batches_full = int(num_data / batch_size) 586 | batch_last = num_data - (num_batches_full * batch_size) 587 | if batch_last > 0: 588 | num_batches = num_batches_full + 1 589 | else: 590 | num_batches = num_batches_full 591 | 592 | embs = [] 593 | 594 | for k in range(num_batches): 595 | 596 | if k == num_batches_full: 597 | input_batch = X[k * batch_size:] 598 | else: 599 | input_batch = X[k * batch_size: (k + 1) * batch_size] 600 | 601 | # Normalize 602 | if self.feature_normalization: 603 | input_batch -= self.mean 604 | input_batch = np.divide(input_batch, self.norm, out=np.zeros_like(input_batch), where=self.norm != 0) 605 | 606 | emb = self.sess.run([self.embeddings], 607 | feed_dict={self.test_feature_placeholder: input_batch}) 608 | 609 | embs.append(emb[0]) 610 | 611 | # Concatenate 612 | z = np.concatenate(embs, axis=0) 613 | 614 | return z 615 | 616 | def decode(self, 617 | z): 618 | """ 619 | Decode latent vectors from latent space of a trained model 620 | 621 | Parameters 622 | ---------- 623 | z : array, shape (n_samples, dim_latent) 624 | Latent vectors to decode. 625 | 626 | Returns 627 | ------- 628 | X : array, shape (n_samples, n_features) 629 | Reconstruction of the data from latent code. 630 | """ 631 | 632 | recon = self.net.decode(np.float32(z)) 633 | reconstruction = self.sess.run(recon) 634 | 635 | # Reverse data normalisation 636 | if self.feature_normalization: 637 | reconstruction = np.multiply(reconstruction, self.norm) 638 | reconstruction += self.mean 639 | 640 | X = reconstruction 641 | 642 | return X 643 | 644 | def visualize(self, 645 | z, 646 | labels=None, 647 | categories=None, 648 | filename=None): 649 | """ 650 | For 2d embeddings, visualize latent space. 651 | 652 | Parameters 653 | ---------- 654 | z : array, shape (n_samples, 2) 655 | 2D latent vectors to visualize. 656 | 657 | labels: array or list, shape (n_samples), optional (default: None) 658 | Label indices or strings for each embedding. If strings, categories parameter is ignored. 659 | 660 | categories: list of string, optional (default: None) 661 | Category names for indices in labels. 662 | 663 | filename: string, optional (default: None) 664 | If filename is given, save visualization to file. Otherwise display directly. 665 | 666 | """ 667 | 668 | assert z.shape[1] == 2, "Visualization only available for 2D embeddings." 669 | 670 | fig, ax = plt.subplots(1, 1, figsize=(12, 10), facecolor='w', edgecolor='k') 671 | if labels is None: 672 | s = ax.scatter(z[:, 0], z[:, 1], s=7) 673 | else: 674 | # Check if labels are provided as indices or strings 675 | if type(labels[0]) == int: 676 | pass 677 | elif type(labels[0]) == str: 678 | # Find unique categories and convert string labels to indices 679 | categories = list(set(labels)) 680 | str_to_int = {cat: k for k, cat in enumerate(categories)} 681 | labels = [str_to_int[label] for label in labels] 682 | else: 683 | raise Exception('Label needs to be list of integer or string labels.') 684 | 685 | cmap = plt.get_cmap('jet', np.max(labels) - np.min(labels) + 1) 686 | s = ax.scatter(z[:, 0], z[:, 1], s=7, c=labels, cmap=cmap, vmin=np.min(labels) - .5, 687 | vmax=np.max(labels) + .5) 688 | cax = plt.colorbar(s, ticks=np.arange(np.min(labels), np.max(labels) + 1)) 689 | if categories is not None: 690 | cax.ax.set_yticklabels(categories) 691 | 692 | if filename is not None: 693 | plt.savefig(filename) 694 | else: 695 | plt.show() 696 | 697 | def visualize_latent_grid(self, 698 | xy_range=(-4.0, 4.0), 699 | grid_size=10, 700 | shape=(28, 28), 701 | clip=(0, 255), 702 | figsize=(12, 12), 703 | filename=None): 704 | """ 705 | Visualize latent space by scanning over a grid, decoding, and plotting as image. 706 | Note: This assumes that the data is image data with a single channel, and currently only works for 707 | two-dimensional latent spaces. 708 | 709 | Parameters 710 | ---------- 711 | xy_range : (float, float), optional (default: (-4.0, 4.0)) 712 | Range in the x and y directions over which to scan. 713 | 714 | grid_size: int, optional (default: 10) 715 | Number of cells along x and y directions. 716 | 717 | shape: (int, int), optional (default: (28, 28)) 718 | Original shape of the image data, used to reshape the vectors to 2d images. 719 | 720 | clip: (float, float), optional (default: (0, 255)) 721 | Before displaying the image, clip the decoded data in this range. 722 | 723 | figsize: (float, float), optional (default: (12.0, 12.0)) 724 | 725 | filename: string, optional (default: None) 726 | If filename is given, save visualization to file. Otherwise display directly. 727 | 728 | """ 729 | 730 | assert self.dim_latent == 2, "visualize_latent_grid only implemented for 2d latent spaces." 731 | 732 | xy_extent = xy_range[1] - xy_range[0] 733 | step_size = xy_extent / grid_size 734 | 735 | # Create grid of latent variables 736 | z_list = [] 737 | for k in range(grid_size): 738 | for j in range(grid_size): 739 | z_list.append([xy_range[0] + (0.5 + k) * step_size, 740 | xy_range[0] + (0.5 + j) * step_size]) 741 | 742 | z_array = np.array(z_list) 743 | 744 | # Decode 745 | x_array = self.decode(z_array) 746 | 747 | # Arrange into image grid 748 | image = [] 749 | for k in range(grid_size): 750 | row = [] 751 | for j in range(grid_size): 752 | index = k * grid_size + j 753 | row.insert(0, np.reshape(x_array[index], shape)) 754 | image.append(np.concatenate(row)) 755 | 756 | # Concatenate into image 757 | image = np.concatenate(image, axis=1) 758 | 759 | # Apply clipping 760 | if clip is not None: 761 | image = np.clip(image, clip[0], clip[1]) 762 | 763 | # Plotting 764 | fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w', edgecolor='k') 765 | plt.imshow(image, cmap='Greys_r', extent=[xy_range[0], xy_range[1], xy_range[0], xy_range[1]]) 766 | 767 | if filename is not None: 768 | plt.savefig(filename) 769 | else: 770 | plt.show() 771 | -------------------------------------------------------------------------------- /cvae/lib/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/maxfrenzel/CompressionVAE/9d6b52359b885a03797be41f6d5baa17925d83ef/cvae/lib/__init__.py -------------------------------------------------------------------------------- /cvae/lib/data_reader.py: -------------------------------------------------------------------------------- 1 | import threading 2 | import random 3 | import tensorflow as tf 4 | import numpy as np 5 | import joblib 6 | import os 7 | from tqdm import tqdm 8 | 9 | 10 | # Check or compute features 11 | def normalize(ids, file_paths, logdir): 12 | 13 | # Find normalisation factors 14 | norm_file = f'{logdir}/norm.pkl' 15 | if not os.path.isfile(norm_file): 16 | 17 | print('Calculating normalisation factors.') 18 | 19 | feat_list = [] 20 | 21 | for k, id_val in enumerate(tqdm(ids)): 22 | 23 | feat_list.append(np.load(file_paths[id_val])) 24 | 25 | feat_array = np.stack(feat_list) 26 | 27 | mean = np.mean(feat_array, axis=0) 28 | max_val = np.max(feat_array, axis=0) 29 | min_val = np.min(feat_array, axis=0) 30 | var = np.var(feat_array, axis=0) 31 | 32 | # Normalize by standard deviation 33 | norm = np.sqrt(var) 34 | 35 | norm_dict = {'mean': mean, 36 | 'norm': norm, 37 | 'min_val': min_val, 38 | 'max_val': max_val} 39 | 40 | joblib.dump(norm_dict, norm_file) 41 | 42 | print('Normalisation factors calculated.') 43 | 44 | else: 45 | print('Normalisation factors already stored.') 46 | 47 | 48 | def load_norm(norm_file): 49 | norm_dict = joblib.load(norm_file) 50 | mean = norm_dict['mean'] 51 | norm = norm_dict['norm'] 52 | 53 | return mean, norm 54 | 55 | 56 | def return_data(ids, file_paths, logdir, normalize=True, randomize=True): 57 | 58 | # Shuffle tha data 59 | randomized_data = ids[:] 60 | if randomize: 61 | random.shuffle(randomized_data) 62 | 63 | # If desired, load normalisation 64 | if normalize: 65 | norm_file = f'{logdir}/norm.pkl' 66 | mean, norm = load_norm(norm_file) 67 | 68 | # Loop through data 69 | for id_val in randomized_data: 70 | 71 | # Load features and annotations and extract correct slices 72 | features = np.load(file_paths[id_val]) 73 | 74 | # Normalise 75 | if normalize: 76 | features -= mean 77 | # Can occasionally have features with zero variance, set those values to zero 78 | features = np.divide(features, norm, out=np.zeros_like(features), where=norm != 0) 79 | 80 | yield features 81 | 82 | 83 | class DataReader(object): 84 | def __init__(self, 85 | dataset_file, 86 | params, 87 | param_file, 88 | coord, 89 | logdir, 90 | queue_size=128): 91 | 92 | self.params = params 93 | self.param_file = param_file 94 | self.ids, self.file_paths, self.dimension = load_dataset_file(dataset_file) 95 | self.coord = coord 96 | self.logdir = logdir 97 | self.threads = [] 98 | 99 | self.num_data = len(self.ids) 100 | print('Total amount of data: ', self.num_data) 101 | print("Input feature dimension: ", self.dimension) 102 | 103 | # Make sure normalization factors have been calculated 104 | if self.params['feature_normalization']: 105 | normalize(self.ids, self.file_paths, self.logdir) 106 | 107 | self.feature_placeholder = tf.placeholder(dtype=tf.float32, shape=None) 108 | self.feature_queue = tf.PaddingFIFOQueue(queue_size, 109 | ['float32'], 110 | shapes=[[self.dimension]]) 111 | self.feature_enqueue = self.feature_queue.enqueue([self.feature_placeholder]) 112 | 113 | def dequeue_feature(self, num_elements): 114 | output = self.feature_queue.dequeue_many(num_elements) 115 | return output 116 | 117 | def thread_main(self, sess): 118 | stop = False 119 | # Go through the dataset multiple times 120 | while not stop: 121 | iterator = return_data(self.ids, self.file_paths, 122 | logdir=self.logdir, 123 | normalize=self.params['feature_normalization']) 124 | count = 0 125 | for feature in iterator: 126 | if self.coord.should_stop(): 127 | stop = True 128 | break 129 | 130 | sess.run(self.feature_enqueue, 131 | feed_dict={self.feature_placeholder: feature}) 132 | 133 | count += 1 134 | 135 | def start_threads(self, sess, n_threads=1): 136 | for _ in range(n_threads): 137 | thread = threading.Thread(target=self.thread_main, args=(sess,)) 138 | thread.daemon = True # Thread will close when parent quits. 139 | thread.start() 140 | self.threads.append(thread) 141 | return self.threads 142 | 143 | def get_epoch(self, batch_size, step): 144 | return (batch_size * step) / self.num_data 145 | 146 | 147 | class Batcher(object): 148 | def __init__(self, 149 | dataset_file, 150 | params, 151 | param_file, 152 | logdir, 153 | shuffle=False): 154 | 155 | self.params = params 156 | self.param_file = param_file 157 | self.ids, self.file_paths, self.dimension = load_dataset_file(dataset_file) 158 | self.logdir = logdir 159 | self.shuffle = shuffle 160 | 161 | if self.shuffle: 162 | np.random.shuffle(self.ids) 163 | 164 | self.num_data = len(self.ids) 165 | print('Total amount of data: ', self.num_data) 166 | 167 | self.index = 0 168 | 169 | if self.params['feature_normalization']: 170 | self.mean, self.norm = load_norm(f'{self.logdir}/norm.pkl') 171 | 172 | def get_epoch(self, batch_size, step): 173 | return (batch_size * step) / self.num_data 174 | 175 | def next_batch(self, batch_size): 176 | 177 | feature_list = [] 178 | truth_list = [] 179 | 180 | data_iterator = return_data(self.ids, self.file_paths, 181 | logdir=self.logdir, 182 | normalize=self.params['feature_normalization'], 183 | randomize=False) 184 | 185 | for k in range(batch_size): 186 | 187 | # Return features from generator, possibly recreating it if it's empty 188 | try: 189 | features = next(data_iterator) 190 | except: 191 | # Recreate the generator 192 | data_iterator = return_data(self.ids, self.file_paths, 193 | logdir=self.logdir, 194 | normalize=self.params['feature_normalization'], 195 | randomize=False) 196 | features = next(data_iterator) 197 | 198 | feature_list.append(np.float32(np.expand_dims(features, axis=0))) 199 | 200 | self.index += 1 201 | if self.index == self.num_data: 202 | self.index = 0 203 | 204 | if self.shuffle: 205 | np.random.shuffle(self.ids) 206 | 207 | feature_batch = np.concatenate(feature_list, axis=0) 208 | 209 | return feature_batch 210 | 211 | 212 | def load_dataset_file(filename): 213 | 214 | print('Loading dataset.') 215 | 216 | dataset = joblib.load(filename) 217 | 218 | dimension = dataset['dimension'] 219 | file_paths = dataset['file_paths'] 220 | ids = dataset['ids'] 221 | 222 | return ids, file_paths, dimension 223 | -------------------------------------------------------------------------------- /cvae/lib/data_reader_array.py: -------------------------------------------------------------------------------- 1 | import threading 2 | import random 3 | import tensorflow as tf 4 | import numpy as np 5 | import joblib 6 | import os 7 | import copy 8 | from tqdm import tqdm 9 | 10 | 11 | # Check or compute features 12 | def normalize(feat_array, logdir): 13 | 14 | # Find normalisation factors 15 | norm_file = f'{logdir}/norm.pkl' 16 | if not os.path.isfile(norm_file): 17 | 18 | print('Calculating normalisation factors.') 19 | 20 | mean = np.mean(feat_array, axis=0) 21 | max_val = np.max(feat_array, axis=0) 22 | min_val = np.min(feat_array, axis=0) 23 | var = np.var(feat_array, axis=0) 24 | 25 | # Normalize by standard deviation 26 | norm = np.sqrt(var) 27 | 28 | norm_dict = {'mean': mean, 29 | 'norm': norm, 30 | 'min_val': min_val, 31 | 'max_val': max_val} 32 | 33 | joblib.dump(norm_dict, norm_file) 34 | 35 | print('Normalisation factors calculated.') 36 | 37 | else: 38 | print('Normalisation factors already stored.') 39 | 40 | 41 | def load_norm(norm_file): 42 | norm_dict = joblib.load(norm_file) 43 | mean = norm_dict['mean'] 44 | norm = norm_dict['norm'] 45 | 46 | return mean, norm 47 | 48 | 49 | def return_data(feat_array, logdir, normalize=True, randomize=True): 50 | 51 | # Shuffle tha data 52 | randomized_indices = list(range(len(feat_array))) 53 | if randomize: 54 | random.shuffle(randomized_indices) 55 | 56 | # If desired, load normalisation 57 | if normalize: 58 | norm_file = f'{logdir}/norm.pkl' 59 | mean, norm = load_norm(norm_file) 60 | 61 | # Loop through data 62 | for id_val in randomized_indices: 63 | 64 | # Load features and annotations and extract correct slices 65 | features = copy.copy(feat_array[id_val]) 66 | 67 | # Normalise 68 | if normalize: 69 | features -= mean 70 | # Can occasionally have feature dimensions with zero variance, set those values to zero 71 | features = np.divide(features, norm, out=np.zeros_like(features), where=norm != 0) 72 | 73 | yield features 74 | 75 | 76 | class DataReader(object): 77 | def __init__(self, 78 | feat_array, 79 | feature_normalization, 80 | coord, 81 | logdir, 82 | queue_size=128): 83 | 84 | self.feat_array = feat_array 85 | self.normalize = feature_normalization 86 | self.num_data = feat_array.shape[0] 87 | self.dimension = feat_array.shape[1] 88 | self.coord = coord 89 | self.logdir = logdir 90 | self.threads = [] 91 | 92 | print('Total amount of data: ', self.num_data) 93 | print("Input feature dimension: ", self.dimension) 94 | 95 | # Make sure normalization factors have been calculated 96 | if self.normalize: 97 | normalize(self.feat_array, self.logdir) 98 | 99 | self.feature_placeholder = tf.compat.v1.placeholder(dtype=tf.float32, shape=None) 100 | self.feature_queue = tf.compat.v1.PaddingFIFOQueue(queue_size, 101 | ['float32'], 102 | shapes=[[self.dimension]]) 103 | self.feature_enqueue = self.feature_queue.enqueue([self.feature_placeholder]) 104 | 105 | def dequeue_feature(self, num_elements): 106 | output = self.feature_queue.dequeue_many(num_elements) 107 | return output 108 | 109 | def thread_main(self, sess): 110 | stop = False 111 | # Go through the dataset multiple times 112 | while not stop: 113 | iterator = return_data(self.feat_array, 114 | logdir=self.logdir, 115 | normalize=self.normalize) 116 | count = 0 117 | for feature in iterator: 118 | if self.coord.should_stop(): 119 | stop = True 120 | break 121 | 122 | sess.run(self.feature_enqueue, 123 | feed_dict={self.feature_placeholder: feature}) 124 | 125 | count += 1 126 | 127 | def start_threads(self, sess, n_threads=1): 128 | for _ in range(n_threads): 129 | thread = threading.Thread(target=self.thread_main, args=(sess,)) 130 | thread.daemon = True # Thread will close when parent quits. 131 | thread.start() 132 | self.threads.append(thread) 133 | return self.threads 134 | 135 | def get_epoch(self, batch_size, step): 136 | return (batch_size * step) / self.num_data 137 | 138 | 139 | class Batcher(object): 140 | def __init__(self, 141 | feat_array, 142 | feature_normalization, 143 | logdir, 144 | shuffle=False): 145 | 146 | self.feat_array = feat_array 147 | self.normalize = feature_normalization 148 | self.logdir = logdir 149 | self.shuffle = shuffle 150 | self.randomized_indices = list(range(len(feat_array))) 151 | 152 | if self.shuffle: 153 | np.random.shuffle(self.randomized_indices) 154 | 155 | self.num_data = len(self.randomized_indices) 156 | print('Total amount of data: ', self.num_data) 157 | 158 | self.index = 0 159 | 160 | if self.normalize: 161 | self.mean, self.norm = load_norm(f'{self.logdir}/norm.pkl') 162 | 163 | def get_epoch(self, batch_size, step): 164 | return (batch_size * step) / self.num_data 165 | 166 | def next_batch(self, batch_size): 167 | 168 | feature_list = [] 169 | 170 | data_iterator = return_data(self.feat_array, 171 | logdir=self.logdir, 172 | normalize=self.normalize, 173 | randomize=False) 174 | 175 | for k in range(batch_size): 176 | 177 | # Return features from generator, possibly recreating it if it's empty 178 | try: 179 | features = next(data_iterator) 180 | except: 181 | # Recreate the generator 182 | data_iterator = return_data(self.feat_array, 183 | logdir=self.logdir, 184 | normalize=self.normalize, 185 | randomize=False) 186 | features = next(data_iterator) 187 | 188 | feature_list.append(np.float32(np.expand_dims(features, axis=0))) 189 | 190 | self.index += 1 191 | if self.index == self.num_data: 192 | self.index = 0 193 | 194 | if self.shuffle: 195 | np.random.shuffle(self.ids) 196 | 197 | feature_batch = np.concatenate(feature_list, axis=0) 198 | 199 | return feature_batch 200 | 201 | 202 | def load_dataset_file(filename): 203 | 204 | print('Loading dataset.') 205 | 206 | dataset = joblib.load(filename) 207 | 208 | dimension = dataset['dimension'] 209 | file_paths = dataset['file_paths'] 210 | ids = dataset['ids'] 211 | 212 | return ids, file_paths, dimension 213 | -------------------------------------------------------------------------------- /cvae/lib/functions.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import random 4 | import joblib 5 | import numpy as np 6 | 7 | def get_data_subset(ids, full_data): 8 | 9 | file_paths_full = full_data['file_paths'] 10 | 11 | file_paths = dict() 12 | 13 | for id in ids: 14 | file_paths[id] = file_paths_full[id] 15 | 16 | datasubset = { 17 | 'ids': ids, 18 | 'file_paths': file_paths, 19 | 'dimension': full_data['dimension'] 20 | } 21 | 22 | return datasubset 23 | 24 | 25 | def prepare_dataset(data_dir, 26 | logdir, 27 | train_ratio=0.9): 28 | 29 | # Get all paths of numpy files 30 | feature_files = [] 31 | 32 | for dirName, subdirList, fileList in os.walk(data_dir, topdown=False): 33 | for fname in fileList: 34 | if os.path.splitext(fname)[1] in ['.npy']: 35 | feature_files.append('%s/%s' % (dirName, fname)) 36 | 37 | print(f'Total number of feature vectors found: {len(feature_files)}. Building dataset.') 38 | 39 | # Build dataset 40 | ids = [] 41 | file_paths = dict() 42 | 43 | for path in feature_files: 44 | # Find unique ID . Try filename first, if already exists add arbitrary extension 45 | id = os.path.splitext(os.path.basename(path))[0] 46 | while id in ids: 47 | id += 'x' 48 | 49 | file_paths[id] = path 50 | ids.append(id) 51 | 52 | # Get dimensionality (assume same for all) 53 | dimension = np.load(feature_files[0]).shape[0] 54 | print(f'Dimensionality of dataset: {dimension}.') 55 | 56 | dataset = { 57 | 'ids': ids, 58 | 'file_paths': file_paths, 59 | 'dimension': dimension 60 | } 61 | 62 | # Train/valid split 63 | split_index = int(train_ratio * len(ids)) 64 | 65 | for k in range(10): 66 | random.shuffle(ids) 67 | 68 | ids_train = ids[:split_index] 69 | ids_valid = ids[split_index:] 70 | 71 | print(f'Splitting {len(ids)} samples into {len(ids_train)} training and {len(ids_valid)} validation samples.') 72 | 73 | dataset_train = get_data_subset(ids_train, dataset) 74 | dataset_valid = get_data_subset(ids_valid, dataset) 75 | 76 | print('Saving dataset files.') 77 | 78 | if not os.path.exists('datasets'): 79 | os.makedirs('datasets') 80 | 81 | joblib.dump(dataset_train, f'{logdir}/data_train.pkl') 82 | joblib.dump(dataset_valid, f'{logdir}/data_valid.pkl') 83 | 84 | print('Done.') 85 | 86 | return dimension 87 | -------------------------------------------------------------------------------- /cvae/lib/model_iaf.py: -------------------------------------------------------------------------------- 1 | import tensorflow.compat.v1 as tf 2 | tf.disable_v2_behavior() 3 | 4 | 5 | def create_variable(name, shape, initializer_type=None): 6 | """Create weight variable with the specified name and shape, 7 | and initialize it specified initializer.""" 8 | if initializer_type == 'truncated_normal': 9 | initializer = tf.initializers.truncated_normal() 10 | elif initializer_type == 'lecun_normal': 11 | initializer = tf.initializers.lecun_normal() 12 | elif initializer_type == 'orthogonal': 13 | initializer = tf.initializers.orthogonal() 14 | else: 15 | print('No initializer type provided or provided type unknown. Defaulting to orthogonal.') 16 | initializer = tf.initializers.orthogonal() 17 | variable = tf.Variable(initializer(shape=shape), name=name) 18 | return variable 19 | 20 | 21 | def create_bias_variable(name, shape): 22 | """Create a bias variable with the specified name and shape and initialize it.""" 23 | initializer = tf.constant_initializer(value=0.001, dtype=tf.float32) 24 | return tf.Variable(initializer(shape=shape), name) 25 | 26 | 27 | # KL divergence between posterior with autoregressive flow and prior 28 | def kl_divergence(sigma, epsilon, z_K, param, batch_mean=True): 29 | # logprob of posterior 30 | log_q_z0 = -0.5 * tf.square(epsilon) 31 | 32 | # logprob of prior 33 | log_p_zK = 0.5 * tf.square(z_K) 34 | 35 | # Terms from each flow layer 36 | flow_loss = 0 37 | for l in range(param['iaf_flow_length'] + 1): 38 | # Make sure it can't take log(0) or log(neg) 39 | flow_loss -= tf.log(sigma[l] + 1e-10) 40 | 41 | kl_divs = tf.identity(log_q_z0 + flow_loss + log_p_zK) 42 | kl_divs_reduced = tf.reduce_sum(kl_divs, axis=1) 43 | 44 | if batch_mean: 45 | return tf.reduce_mean(kl_divs, axis=0), tf.reduce_mean(kl_divs_reduced) 46 | else: 47 | return kl_divs, kl_divs_reduced 48 | 49 | 50 | class VAEModel(object): 51 | 52 | def __init__(self, 53 | param, 54 | batch_size, 55 | input_dim, 56 | activation=tf.nn.relu, 57 | activation_nf=tf.nn.relu, 58 | keep_prob=1.0, 59 | encode=False, 60 | initializer='orthogonal'): 61 | 62 | self.input_dim = input_dim 63 | self.param = param 64 | self.batch_size = batch_size 65 | self.activation = activation 66 | self.activation_nf = activation_nf 67 | self.encode = encode 68 | self.cells_enc = self.param['cells_encoder'] 69 | self.layers_enc = len(param['cells_encoder']) 70 | self.cells_dec = self.cells_enc[::-1] 71 | self.layers_dec = self.layers_enc 72 | self.cells_hidden = self.param['cells_hidden'] 73 | self.dim_latent = param['dim_latent'] 74 | self.keep_prob = keep_prob 75 | self.initializer = initializer 76 | self.variables = self._create_variables() 77 | 78 | def _create_variables(self): 79 | """This function creates all variables used by the network. 80 | This allows us to share them between multiple calls to the loss 81 | function and generation function.""" 82 | 83 | var = dict() 84 | 85 | with tf.variable_scope('VAE'): 86 | 87 | with tf.variable_scope("Encoder"): 88 | 89 | var['encoder_stack'] = list() 90 | with tf.variable_scope('encoder_stack'): 91 | 92 | for l, num_units in enumerate(self.cells_enc): 93 | 94 | with tf.variable_scope('layer{}'.format(l)): 95 | 96 | layer = dict() 97 | 98 | if l == 0: 99 | units_in = self.input_dim 100 | else: 101 | units_in = self.cells_enc[l - 1] 102 | 103 | units_out = num_units 104 | 105 | layer['W'] = create_variable("W", 106 | shape=[units_in, units_out], 107 | initializer_type=self.initializer) 108 | layer['b'] = create_bias_variable("b", 109 | shape=[1, units_out]) 110 | 111 | var['encoder_stack'].append(layer) 112 | 113 | with tf.variable_scope('fully_connected'): 114 | 115 | layer = dict() 116 | 117 | num_cells_hidden = self.cells_hidden 118 | 119 | layer['W_z0'] = create_variable("W_z0", 120 | shape=[self.cells_enc[-1], 2 * num_cells_hidden], 121 | initializer_type=self.initializer) 122 | layer['b_z0'] = create_bias_variable("b_z0", 123 | shape=[1, 2 * num_cells_hidden]) 124 | 125 | layer['W_mu'] = create_variable("W_mu", 126 | shape=[self.cells_hidden, self.param['dim_latent']], 127 | initializer_type=self.initializer) 128 | layer['W_logvar'] = create_variable("W_logvar", 129 | shape=[self.cells_hidden, self.param['dim_latent']], 130 | initializer_type=self.initializer) 131 | layer['b_mu'] = create_bias_variable("b_mu", 132 | shape=[1, self.param['dim_latent']]) 133 | layer['b_logvar'] = create_bias_variable("b_logvar", 134 | shape=[1, self.param['dim_latent']]) 135 | 136 | var['encoder_fc'] = layer 137 | 138 | with tf.variable_scope("IAF"): 139 | 140 | var['iaf_flows'] = list() 141 | for l in range(self.param['iaf_flow_length']): 142 | 143 | with tf.variable_scope('layer{}'.format(l)): 144 | 145 | layer = dict() 146 | 147 | # Hidden state 148 | layer['W_flow'] = create_variable("W_flow", 149 | shape=[self.cells_enc[-1], self.dim_latent], 150 | initializer_type=self.initializer) 151 | layer['b_flow'] = create_bias_variable("b_flow", 152 | shape=[1, self.dim_latent]) 153 | 154 | flow_variables = list() 155 | # Flow parameters from hidden state (m and s parameters for IAF) 156 | for j in range(self.dim_latent): 157 | with tf.variable_scope('flow_layer{}'.format(j)): 158 | 159 | flow_layer = dict() 160 | 161 | # Set correct dimensionality 162 | units_to_hidden_iaf = self.param['dim_autoregressive_nl'] 163 | 164 | flow_layer['W_flow_params_nl'] = create_variable("W_flow_params_nl", 165 | shape=[self.dim_latent + j, 166 | units_to_hidden_iaf], 167 | initializer_type=self.initializer) 168 | flow_layer['b_flow_params_nl'] = create_bias_variable("b_flow_params_nl", 169 | shape=[1, units_to_hidden_iaf]) 170 | 171 | flow_layer['W_flow_params'] = create_variable("W_flow_params", 172 | shape=[units_to_hidden_iaf, 173 | 2], 174 | initializer_type=self.initializer) 175 | flow_layer['b_flow_params'] = create_bias_variable("b_flow_params", 176 | shape=[1, 2]) 177 | 178 | flow_variables.append(flow_layer) 179 | 180 | layer['flow_vars'] = flow_variables 181 | 182 | var['iaf_flows'].append(layer) 183 | 184 | with tf.variable_scope("Decoder"): 185 | 186 | var['decoder_stack'] = list() 187 | with tf.variable_scope('deconv_stack'): 188 | 189 | for l, num_units in enumerate(self.cells_dec): 190 | 191 | with tf.variable_scope('layer{}'.format(l)): 192 | 193 | layer = dict() 194 | 195 | if l == 0: 196 | units_in = self.dim_latent 197 | else: 198 | units_in = self.cells_dec[l - 1] 199 | 200 | units_out = num_units 201 | 202 | layer['W'] = create_variable("W", 203 | shape=[units_in, units_out], 204 | initializer_type=self.initializer) 205 | layer['b'] = create_bias_variable("b", 206 | shape=[1, units_out]) 207 | 208 | var['decoder_stack'].append(layer) 209 | 210 | with tf.variable_scope('fully_connected'): 211 | layer = dict() 212 | 213 | layer['W_mu'] = create_variable("W_mu", 214 | shape=[self.cells_dec[-1], self.input_dim], 215 | initializer_type=self.initializer) 216 | # layer['W_logvar'] = create_variable("W_logvar", 217 | # shape=[self.cells_dec[-1], self.input_dim]) 218 | layer['b_mu'] = create_bias_variable("b_mu", 219 | shape=[1, self.input_dim]) 220 | # layer['b_logvar'] = create_bias_variable("b_logvar", 221 | # shape=[1, self.input_dim]) 222 | 223 | var['decoder_fc'] = layer 224 | 225 | return var 226 | 227 | def _create_network(self, input_batch, encode=False): 228 | 229 | # ----------------------------------- 230 | # Encoder 231 | 232 | # Remove redundant dimension (weird thing to get PaddingFIFOQueue to work) 233 | # input_batch = tf.squeeze(input_batch) 234 | 235 | # Do encoder calculation 236 | encoder_hidden = input_batch 237 | # print('Encoder hidden state 0: ', encoder_hidden) 238 | for l in range(self.layers_enc): 239 | encoder_hidden = tf.nn.dropout(self.activation(tf.matmul(encoder_hidden, 240 | self.variables['encoder_stack'][l]['W']) 241 | + self.variables['encoder_stack'][l]['b']), 242 | keep_prob=self.keep_prob) 243 | 244 | # print(f'Encoder hidden state {l}: ', encoder_hidden) 245 | 246 | # encoder_hidden = tf.reshape(encoder_hidden, [-1, self.conv_out_units]) 247 | 248 | # Additional non-linearity between encoder hidden state and prediction of mu_0,sigma_0 249 | mu_logvar_hidden = tf.nn.dropout(self.activation(tf.matmul(encoder_hidden, 250 | self.variables['encoder_fc']['W_z0']) 251 | + self.variables['encoder_fc']['b_z0']), 252 | keep_prob=self.keep_prob) 253 | 254 | # Split into parts for mean and variance 255 | mu_hidden, logvar_hidden = tf.split(mu_logvar_hidden, num_or_size_splits=2, axis=1) 256 | 257 | # Final linear layer to calculate mean and variance 258 | encoder_mu = tf.add(tf.matmul(mu_hidden, self.variables['encoder_fc']['W_mu']), 259 | self.variables['encoder_fc']['b_mu'], name='ZMu') 260 | encoder_logvar = tf.add(tf.matmul(logvar_hidden, self.variables['encoder_fc']['W_logvar']), 261 | self.variables['encoder_fc']['b_logvar'], name='ZLogVar') 262 | 263 | # Convert log variance into standard deviation 264 | encoder_std = tf.exp(0.5 * encoder_logvar) 265 | 266 | # Sample epsilon 267 | epsilon = tf.random_normal(tf.shape(encoder_std), name='epsilon') 268 | 269 | if encode: 270 | z0 = tf.identity(encoder_mu, name='LatentZ0') 271 | else: 272 | z0 = tf.identity(tf.add(encoder_mu, tf.multiply(encoder_std, epsilon), 273 | name='LatentZ0')) 274 | 275 | # ----------------------------------- 276 | # Latent flow 277 | 278 | # Lists to store the latent variables and the flow parameters 279 | nf_z = [z0] 280 | nf_sigma = [encoder_std] 281 | 282 | # Do calculations for each flow layer 283 | for l in range(self.param['iaf_flow_length']): 284 | 285 | W_flow = self.variables['iaf_flows'][l]['W_flow'] 286 | b_flow = self.variables['iaf_flows'][l]['b_flow'] 287 | 288 | nf_hidden = self.activation_nf(tf.matmul(encoder_hidden, W_flow) + b_flow) 289 | 290 | # Autoregressive calculation 291 | m_list = self.dim_latent * [None] 292 | s_list = self.dim_latent * [None] 293 | 294 | for j, flow_vars in enumerate(self.variables['iaf_flows'][l]['flow_vars']): 295 | 296 | # Go through computation one variable at a time 297 | if j == 0: 298 | hidden_autoregressive = nf_hidden 299 | else: 300 | z_slice = tf.slice(nf_z[-1], [0, 0], [-1, j]) 301 | hidden_autoregressive = tf.concat(axis=1, values=[nf_hidden, z_slice]) 302 | 303 | W_flow_params_nl = flow_vars['W_flow_params_nl'] 304 | b_flow_params_nl = flow_vars['b_flow_params_nl'] 305 | W_flow_params = flow_vars['W_flow_params'] 306 | b_flow_params = flow_vars['b_flow_params'] 307 | 308 | # Non-linearity at current autoregressive step 309 | nf_hidden_nl = self.activation_nf(tf.matmul(hidden_autoregressive, 310 | W_flow_params_nl) + b_flow_params_nl) 311 | 312 | # Calculate parameters for normalizing flow as linear transform 313 | ms = tf.matmul(nf_hidden_nl, W_flow_params) + b_flow_params 314 | 315 | # Split into individual components 316 | # m_list[j], s_list[j] = tf.split_v(value=ms, 317 | # size_splits=[1,1], 318 | # split_dim=1) 319 | m_list[j], s_list[j] = tf.split(value=ms, 320 | num_or_size_splits=[1, 1], 321 | axis=1) 322 | 323 | # Concatenate autoregressively computed variables 324 | # Add offset to s to make sure it starts out positive 325 | # (could have also initialised the bias term to 1) 326 | # Guarantees that flow initially small 327 | m = tf.concat(axis=1, values=m_list) 328 | s = self.param['initial_s_offset'] + tf.concat(axis=1, values=s_list) 329 | 330 | # Calculate sigma ("update gate value") from s 331 | sigma = tf.nn.sigmoid(s) 332 | nf_sigma.append(sigma) 333 | 334 | # Perform normalizing flow 335 | z_current = tf.multiply(sigma, nf_z[-1]) + tf.multiply((1 - sigma), m) 336 | 337 | # Invert order of variables to alternate dependence of autoregressive structure 338 | z_current = tf.reverse(z_current, axis=[1], name='LatentZ%d' % (l + 1)) 339 | 340 | # Add to list of latent variables 341 | nf_z.append(z_current) 342 | 343 | z = tf.identity(nf_z[-1], name="LatentZ") 344 | 345 | # ----------------------------------- 346 | # Decoder 347 | 348 | # Fully connected 349 | decoder_hidden = z 350 | 351 | for l in range(self.layers_dec): 352 | # print(decoder_hidden) 353 | decoder_hidden = tf.nn.dropout(self.activation(tf.matmul(decoder_hidden, 354 | self.variables['decoder_stack'][l]['W']) 355 | + self.variables['decoder_stack'][l]['b']), 356 | keep_prob=self.keep_prob) 357 | decoder_hidden = self.activation(decoder_hidden) 358 | 359 | # Split into mu and logvar parts 360 | # decoder_hidden_mu, decoder_hidden_logvar = tf.split(decoder_hidden, num_or_size_splits=2, axis=1) 361 | 362 | # Final layer 363 | decoder_mu = tf.add(tf.matmul(decoder_hidden, self.variables['decoder_fc']['W_mu']), 364 | self.variables['decoder_fc']['b_mu'], 365 | name='XMu') 366 | # decoder_logvar = tf.add(tf.matmul(decoder_hidden_logvar, self.variables['decoder_fc']['W_logvar']), 367 | # self.variables['decoder_fc']['b_logvar'], 368 | # name='XLogVar') 369 | # 370 | # # Add clipping to avoid zero division 371 | # decoder_logvar = tf.clip_by_value(decoder_logvar, 372 | # clip_value_min=-8.0, 373 | # clip_value_max=+8.0) 374 | 375 | # Set decoder variance as fixed hyperparameter for stability; common assumption in Gaussian decoders 376 | decoder_logvar = tf.zeros_like(decoder_mu) 377 | 378 | # return decoder_output, encoder_hidden, encoder_logvar, encoder_std 379 | return decoder_mu, decoder_logvar, encoder_mu, encoder_logvar, encoder_std, epsilon, z, z0, nf_sigma 380 | 381 | def decode(self, z): 382 | 383 | decoder_hidden = z 384 | 385 | for l in range(self.layers_dec): 386 | # print(decoder_hidden) 387 | decoder_hidden = tf.nn.dropout(self.activation(tf.matmul(decoder_hidden, 388 | self.variables['decoder_stack'][l]['W']) 389 | + self.variables['decoder_stack'][l]['b']), 390 | keep_prob=self.keep_prob) 391 | decoder_hidden = self.activation(decoder_hidden) 392 | 393 | decoder_mu = tf.add(tf.matmul(decoder_hidden, self.variables['decoder_fc']['W_mu']), 394 | self.variables['decoder_fc']['b_mu'], 395 | name='XMu') 396 | 397 | return decoder_mu 398 | 399 | def input_identity(self, input_batch): 400 | 401 | # return tf.matmul(input_batch, self.variables['encoder_stack'][0]['W']) 402 | 403 | return self.variables['encoder_stack'][0]['W'] 404 | 405 | def loss(self, 406 | input_batch, 407 | name='vae', 408 | beta=1.0, 409 | test=False): 410 | 411 | with tf.name_scope(name): 412 | 413 | # Run computation 414 | decoder_mu, decoder_logvar, encoder_mu, encoder_logvar, encoder_std, epsilon, z, z0, nf_sigma = self._create_network(input_batch) 415 | 416 | # print("Output size: ", decoder_mu) 417 | 418 | # KL-Divergence loss 419 | _, div = kl_divergence(nf_sigma, epsilon, z, self.param, batch_mean=False) 420 | loss_latent = tf.identity(div, name='LossLatent') 421 | 422 | # Reconstruction loss assuming Gaussian output distribution 423 | decoder_var = tf.exp(decoder_logvar) 424 | loss_reconstruction = tf.identity(0.5 * tf.reduce_sum(tf.math.divide(tf.square(input_batch - decoder_mu), 425 | decoder_var) 426 | + decoder_logvar, axis=1), 427 | name='LossReconstruction') 428 | 429 | # Small penalty to prevent z0 values from going to infinity 430 | z0_boundary = 10.0 * tf.ones_like(z0) 431 | z0_for_penalty = tf.maximum(z0_boundary, tf.abs(z0)) 432 | z0_large = tf.reduce_mean(tf.square(z0_for_penalty - z0_boundary), axis=1) 433 | 434 | loss = tf.reduce_mean(loss_reconstruction + beta*loss_latent, name='Loss') 435 | 436 | if not test: 437 | tf.summary.scalar('loss_total', loss) 438 | tf.summary.scalar('loss_rec_per_feat', tf.reduce_mean(loss_reconstruction)/self.input_dim) 439 | tf.summary.scalar('loss_kl_per_dim', tf.reduce_mean(loss_latent)/self.dim_latent) 440 | tf.summary.scalar('beta', beta) 441 | 442 | return loss 443 | 444 | def embed(self, input_batch): 445 | 446 | # Run computation 447 | _, _, _, _, _, _, z, _, _ = self._create_network(input_batch, encode=True) 448 | 449 | return z 450 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | # This includes the license file(s) in the wheel. 3 | # https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file 4 | license_files = LICENSE.txt 5 | 6 | [bdist_wheel] 7 | # This flag says to generate wheels that support both Python 2 and Python 8 | # 3. If your code will not run unchanged on both Python 2 and 3, you will 9 | # need to generate separate wheels for each Python version that you 10 | # support. Removing this line (or setting universal to 0) will prevent 11 | # bdist_wheel from trying to make a universal wheel. For more see: 12 | # https://packaging.python.org/guides/distributing-packages-using-setuptools/#wheels 13 | universal=0 -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | import platform 3 | 4 | with open("README.md", "r") as fh: 5 | long_description = fh.read() 6 | 7 | # Determine the correct TensorFlow package based on architecture 8 | if platform.machine() == 'arm64': # Apple Silicon 9 | tensorflow_packages = [ 10 | 'tensorflow-macos==2.8.0', 11 | 'tensorflow-metal==0.4.0' 12 | ] 13 | else: # Intel/AMD 14 | tensorflow_packages = ['tensorflow>=2.9.0,<2.10.0'] 15 | 16 | setuptools.setup( 17 | name="cvae", 18 | version="0.2.0", 19 | author="Max Frenzel", 20 | author_email="maxfrenzel+cvae@gmail.com", 21 | description="CompressionVAE: General purpose dimensionality reduction and manifold learning tool based on " 22 | "Variational Autoencoder.", 23 | long_description=long_description, 24 | long_description_content_type="text/markdown", 25 | url="https://github.com/maxfrenzel/CompressionVAE", 26 | packages=setuptools.find_packages(), 27 | classifiers=[ 28 | "Programming Language :: Python :: 3", 29 | "License :: OSI Approved :: MIT License", 30 | "Operating System :: OS Independent", 31 | ], 32 | python_requires='>=3.6', 33 | install_requires=[ 34 | 'numpy>=1.16.5,<1.23.0', 35 | 'matplotlib>=3.3.0,<4.0.0', 36 | 'joblib>=1.0.0,<2.0.0', 37 | 'tqdm>=4.50.0,<5.0.0', 38 | 'pandas>=1.3.0,<2.0.0' 39 | ] + tensorflow_packages, 40 | extras_require={ 41 | 'test': ['scikit-learn>=1.0.0'] 42 | }, 43 | keywords='vae variational autoencoder manifold dimensionality reduction compression tensorflow' 44 | ) --------------------------------------------------------------------------------