├── .gitignore
├── LICENSE.txt
├── README.md
├── cvae
    ├── __init__.py
    ├── cvae.py
    └── lib
    │   ├── __init__.py
    │   ├── data_reader.py
    │   ├── data_reader_array.py
    │   ├── functions.py
    │   └── model_iaf.py
├── setup.cfg
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | test_tf.py
 2 | test_mnist.py
 3 | 
 4 | # Python virtual environment
 5 | venv/
 6 | env/
 7 | 
 8 | # Python package build files
 9 | *.egg-info/
10 | dist/
11 | build/
12 | 
13 | # Temporary files
14 | temp/
15 | *.pyc
16 | __pycache__/
17 | 
18 | # IDE specific files
19 | .vscode/
20 | .idea/
21 | *.swp
22 | .DS_Store
23 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Max Frenzel
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # CompressionVAE
  2 | 
  3 | Data embedding API based on the Variational Autoencoder (VAE), originally proposed by Kingma and Welling https://arxiv.org/abs/1312.6114.
  4 | 
  5 | This tool, implemented in TensorFlow (originally built with TF1.x, but updated to TF2.x through compatibility mode), is designed to work similar to familiar dimensionality reduction methods such as scikit-learn's t-SNE or UMAP, but also go beyond their capabilities in some notable ways, making full use of the VAE as a generative model.
  6 | 
  7 | While I decided to call the tool itself CompressionVAE, or CVAE for short, I mainly chose this to give it a unique name.
  8 | In practice, it is based on a standard VAE, with the (optional) addition of Inverse Autoregressive Flow (IAF) layers to allow for more flexible posterior distributions.
  9 | For details on the IAF layers, I refer you to the original paper: https://arxiv.org/pdf/1606.04934.pdf.
 10 | 
 11 | CompressionVAE has **several unique advantages** over the common manifold learning methods like t-SNE and UMAP:
 12 | * Rather than just a transformation of the training data, it provides a **reversible and deterministic function**, mapping from data space to embedding space.
 13 | * Due to the reversibility of the mapping, the model can be used to **generate new data from arbitrary latent variables**. It also makes them highly suitable as **intermediary representations for downstream tasks**.
 14 | * Once a model is trained, it can be reused to transform new data, making it **suitable for use in live settings**.
 15 | * Like UMAP, CVAE is **fast and scales much better to large datasets, and high dimensional input and latent spaces**.
 16 | * The neural network architecture and training parameters are **highly customisable** through the simple API, allowing more advanced users to tailor the system to their needs.
 17 | * VAEs have a **very strong theoretical foundation**, and the learned latent spaces have many desirable properties. There is also extensive literature on different variants, and CVAE can easily be extended to keep up with new research advances.
 18 | 
 19 | ## Installing CompressionVAE
 20 | 
 21 | CompressionVAE is distributed through PyPI under the name `cvae` (https://pypi.org/project/cvae/). To install the latest version, simply run
 22 | ```
 23 | pip install cvae
 24 | ```
 25 | Alternatively, to locally install CompressionVAE, clone this repository and run the following command from the CompressionVAE root directory.
 26 | ```
 27 | pip install -e .
 28 | ```
 29 | 
 30 | ## Basic Use Case
 31 | 
 32 | To use CVAE to learn an embedding function, we first need to import the cvae library.
 33 | ```
 34 | from cvae import cvae
 35 | ```
 36 | 
 37 | When creating a CompressionVAE object for a new model, it needs to be provided a training dataset. 
 38 | For small datasets that fit in memory we can directly follow the sklean convention. Let's look at this case first and take MNIST as an example.
 39 | 
 40 | First, load the MNIST data. (Note: this example requires scikit-learn which is not installed with CVAE. You might have to install it first by running `pip install sklearn`.)
 41 | ```
 42 | from sklearn.datasets import fetch_openml
 43 | mnist = fetch_openml('mnist_784', version=1, cache=True)
 44 | X = mnist.data
 45 | ```
 46 | 
 47 | ### Initializing CVAE
 48 | Now we can create a CompressionVAE object/model based on this data. The minimal code to do this is
 49 | ```
 50 | embedder = cvae.CompressionVAE(X)
 51 | ```
 52 | By default, this creates a model with two-dimensional latent space, splits the data X randomly into 90% train and 10% validation data, applies feature normalization, and tries to match the model architecture to the input and latent feature dimensions.
 53 | It also saves the model in a temporary directory which gets overwritten the next time you create a new CVAE object there.
 54 | 
 55 | We will look at customising all this later, but for now let's move on to training.
 56 | 
 57 | ### Training CVAE
 58 | Once a CVAE object is initialised and associated with data, we can train the embedder using its `train` method. This works similar to t-SNE or UMAP's `fit` method.
 59 | In the simplest case, we just run 
 60 | ```
 61 | embedder.train()
 62 | ```
 63 | This will train the model, applying automatic learning rate scheduling based on the validation data loss, and stop either when the model converges or after 50k training steps.
 64 | We can also stop the training process early through a KeyboardInterrupt (ctrl-c or 'interrupt kernel' in Jupyter notebook). The model will be saved at this point.
 65 | 
 66 | It is also possible to stop training and then re-start with different parameters (see more details below).
 67 | 
 68 | One note/warning: At the moment, the model can be quite sensitive to initialization (in some rare cases even giving NAN losses). Re-initializing/training the model can improve the results if a training run did not give satisfactory results.
 69 | 
 70 | ### Embedding data
 71 | Once we have a trained model (well, technically even before training, but the results would be random), we can use CVAE to compress data, embedding it into the latent space.
 72 | To do this, we use CVAE's `embed` method.
 73 | 
 74 | To embed the entire MNIST data:
 75 | ```
 76 | z = embedder.embed(X)
 77 | ```
 78 | But note that other than t-SNE or UMAP, this data does not have to be the same as the training data. It can be new and previously unseen data.
 79 | 
 80 | ### Visualising the embedding
 81 | For two-dimensional latent spaces, CVAE comes with a built-in visualization method, `visualize`. It provides a two-dimensional plot of the embeddings, including class information if available.
 82 | 
 83 | To visualize the MNIST embeddings and color them by their respective class, we can run
 84 | ```
 85 | embedder.visualize(z, labels=[int(label) for label in mnist.target])
 86 | ```
 87 | We could also passed the string labels `mnist.target` directly to `labels`, but in that case they would not necessarily be ordered from 0 to 9. 
 88 | Optionally, if we pass `labels` as a list of integers like above, we can also pass the `categories` parameter, a list of strings associating names with the labels. In the case of MNIST this is irrelevant since the label and class names are the same.
 89 | By default the `visualize` simply displays the plot. By setting the `filename` parameter we can alternatively save the plot to a file.
 90 | 
 91 | ### Generating data
 92 | Finally, we can use CVAE as a generative model, generating data by decoding arbitrary latent vectors using the `decode` method.
 93 | If we simply want to 'undo' our MNIST embedding and try to re-create the input data, we can run our embeddings `z` through the `decode` method.
 94 | ```
 95 | X_reconstructed = embedder.decode(z)
 96 | ```
 97 | As a more interesting example, we can use this for data interpolation. Let's say we want to create the data that's halfway between the first and the second MNIST datapoint (a '5' and a '0' respectively).
 98 | We can achieve this with the following code
 99 | ```
100 | import numpy as np
101 | # Combine the two examples and add batch dimension
102 | z_interp = np.expand_dims(0.5*z[0] + 0.5*z[1], axis=0)
103 | # Decode the new latent vector.
104 | X_interp = embedder.decode(z_interp)
105 | ```
106 | 
107 | #### Visualizing the latent space
108 | In the case of image data, such as MNIST, CVAE also has a method that allows us to quickly visualize the latent space as seen through the decoder.
109 | To plot a 20 by 20 grid of reconstructed images, spanning the latent space region [-4, 4] in both x and y, we can run
110 | ```
111 | embedder.visualize_latent_grid(xy_range=(-4.0, 4.0),
112 |                                grid_size=20,
113 |                                shape=(28, 28))
114 | ```
115 | 
116 | ## Advanced Use Cases
117 | The example above shows the simplest usage of CVAE. However, if desired a user can take much more control over the system and customize the model and training processes.
118 | 
119 | ### Customizing the model
120 | In the previous example we initialised a CompressionVAE with default parameters. If we want more control, we might want to initialise it the following way:
121 | ```
122 | embedder = cvae.CompressionVAE(X,
123 |                                train_valid_split=0.99,
124 |                                dim_latent=16,
125 |                                iaf_flow_length=10,
126 |                                cells_encoder=[512, 256, 128],
127 |                                initializer='lecun_normal',
128 |                                batch_size=32,
129 |                                batch_size_test=128,
130 |                                logdir='~/mnist_16d',
131 |                                feature_normalization=False,
132 |                                tb_logging=True)
133 | ```
134 | `train_valid_split` controls the random splitting into train and test data. Here 99% of X is used for training, and only 1% is reserved for validation.
135 | 
136 | Alternatively, to get more control over the data the user can also provide `X_valid` as an input. In this case `train_valid_split` is ignored and the model uses `X` for training and `X_valid` for validation.
137 | 
138 | `dim_latent` specifies the dimensionality of the latent space.
139 | 
140 | `iaf_flow_length` controls how many IAF flow layers the model has.
141 | 
142 | `cells_encoder` determines the number, as well as size of the encoders fully connected layers. In the case above, we have three layers with 512, 256, and 128 units respectively. The decoder uses the mirrored version of this.
143 | If this parameter is not set, CVAE creates a two layer network with sizes adjusted to the input dimension and latent dimension. The logic behind this is very handwavy and arbitrary for now, and I generally recommend setting this manually.
144 | 
145 | `initializer` controls how the model weights are initialized, with options being `orthogonal` (default), `truncated_normal`, and `lecun_normal`.
146 | 
147 | `batch_size` and `batch_size_test` determine the batch sizes used for training and testing respectively.
148 | 
149 | `logdir` specifies the path to the model, and also acts as the model name. The default, `'temp'`, gets overwritten every time it is used, but other model names can be used to save and restore models for later use or even to continue training.
150 | 
151 | `feature_normalization` tells CVAE whether it should internally apply feature normalization (zero mean, unit variance, based on the training data) or not. If True, the normalisation factors are stored with the model and get applied to any future data.
152 | 
153 | `tb_logging` determines whether the model writes summaries for TensorBoard during the training process.
154 | 
155 | ### Customizing the training process
156 | In the simple example we called the `train` method without any parameter. A more advanced call might look like
157 | ```
158 | embedder.train(learning_rate=1e-4,
159 |                num_steps=2000,
160 |                dropout_keep_prob=0.6,
161 |                test_every=50,
162 |                lr_scheduling=False)
163 | ```
164 | `learning_rate` sets the initial learning rate of training.
165 | 
166 | `num_steps` controls the maximum number of training steps before stopping.
167 | 
168 | `dropout_keep_prob` determines the keep probability for dropout in the fully connected layers.
169 | 
170 | `test_every` sets the frequency of test steps.
171 | 
172 | `lr_scheduling` controls whether learning rate scheduling is applied. If `False`, training continues at `learning_rate` until `num_steps` is reached.
173 | 
174 | For more arguments/details, for example controlling the details of the learning rate scheduler and the convergence criteria, check the method definition. 
175 | 
176 | ### Using large datasets
177 | 
178 | Alternatively to providing the input data `X` as a single numpy array, as done with t-SNE and UMAP, CVAE also allows for much larger datasets that do not fit into a single array.
179 | 
180 | To prepare such a dataset, create a new directory, e.g. `'~/my_dataset'`, and save the training data as individual npy files per example in this directory. 
181 | 
182 | (Note: the data can also be saved in nested sub-directories, for example one directory per category. CVAE will look through the entire directory tree for npy files.)
183 | 
184 | When initialising a model based on this kind of data pass the root directory of the dataset as `X`. E.g.
185 | ```
186 | embedder = cvae.CompressionVAE(X='~/my_dataset')
187 | ```  
188 | Initialising will take slightly longer than if `X` is passed as an array, even for the same number of data points. But this method scales in principle to arbitrarily large datasets, and only loads one batch at a time during training.
189 | 
190 | ### Restarting an existing model
191 | 
192 | If a CompressionVAE object is initialized with `logdir='temp'` it always starts from a new untrained model, possible overwriting any previous temp model.
193 | However, if a different `logdir` is chosen, the model persists and can be reloaded.
194 | 
195 | If CompressionVAE is initialized with a `logdir` that already exists and contains parameter and checkpoint files of a previous model, it attempts to restore that model's checkpoints.
196 | In this case, any model specific input parameter (e.g. `dim_latent` and `cells_encoder`) is ignored in favor of the original models parameters.
197 | 
198 | A restored model can be use straight away to embed or generate data, but it is also possible to continue the training process, picking up from the most recent checkpoint.
199 | 


--------------------------------------------------------------------------------
/cvae/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/maxfrenzel/CompressionVAE/9d6b52359b885a03797be41f6d5baa17925d83ef/cvae/__init__.py


--------------------------------------------------------------------------------
/cvae/cvae.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import os
  3 | import random
  4 | import json
  5 | import time
  6 | import shutil
  7 | 
  8 | import numpy as np
  9 | import tensorflow.compat.v1 as tf
 10 | tf.disable_v2_behavior()
 11 | try:
 12 |     tf.logging.set_verbosity(tf.logging.ERROR)
 13 | except:
 14 |     pass
 15 | 
 16 | # import matplotlib as mpl
 17 | # mpl.use('TkAgg')
 18 | import matplotlib.pyplot as plt
 19 | import matplotlib.cm as cm
 20 | 
 21 | import cvae.lib.data_reader_array as dra
 22 | import cvae.lib.data_reader as dr
 23 | import cvae.lib.model_iaf as model
 24 | import cvae.lib.functions as fun
 25 | 
 26 | 
 27 | # Save model to checkpoint
 28 | def save(saver, sess, logdir, step):
 29 |     model_name = 'model.ckpt'
 30 |     checkpoint_path = os.path.join(logdir, model_name)
 31 |     print('Storing checkpoint to {} ...'.format(logdir), end="")
 32 |     sys.stdout.flush()
 33 | 
 34 |     if not os.path.exists(logdir):
 35 |         os.makedirs(logdir)
 36 | 
 37 |     saver.save(sess, checkpoint_path, global_step=step)
 38 |     print(' Done.')
 39 | 
 40 | 
 41 | # Load model from checkpoint
 42 | def load(saver, sess, logdir):
 43 |     print("Trying to restore saved checkpoints from {} ...".format(logdir),
 44 |           end="")
 45 | 
 46 |     ckpt = tf.train.get_checkpoint_state(logdir)
 47 |     if ckpt:
 48 |         print("  Checkpoint found: {}".format(ckpt.model_checkpoint_path))
 49 |         global_step = int(ckpt.model_checkpoint_path
 50 |                           .split('/')[-1]
 51 |                           .split('-')[-1])
 52 |         print("  Global step was: {}".format(global_step))
 53 |         print("  Restoring...", end="")
 54 |         saver.restore(sess, ckpt.model_checkpoint_path)
 55 |         print(" Done.")
 56 |         return global_step
 57 |     else:
 58 |         print(" No checkpoint found.")
 59 |         return None
 60 | 
 61 | 
 62 | class CompressionVAE(object):
 63 |     """
 64 |     Variational Autoencoder (VAE) for vector compression/dimensionality reduction.
 65 | 
 66 |     Parameters
 67 |         ----------
 68 |         X : array, shape (n_samples, n_features)
 69 |             Training data for the VAE.
 70 |             Alternatively, X can be the path to a root-directory containing npy files (potentially nested), each
 71 |             representing a single feature vector. This allows for handling of datasets that are too large to fit
 72 |             in memory.
 73 |             Can be None (default) only if a model with this name has previously been trained. Otherwise None will
 74 |             raise an exception.
 75 | 
 76 |         X_valid : array, shape (n__valid_samples, n_features), optional (default: None)
 77 |             Validation data. If not provided, X is split into training and validation data
 78 | 
 79 |         train_valid_split : float, optional (default: 0.9)
 80 |             Specifies in what ratio to split X into training and validation data (after randomizing the data).
 81 |             Ignored if X_valid provided.
 82 | 
 83 |         dim_latent : int, optional (default: 2)
 84 |             Dimension of latent space (i.e. number of features of embeddings)
 85 | 
 86 |         iaf_flow_length : int, optional (default: 5)
 87 |             Number of IAF Flow layers to use in the model.
 88 |             For details, see https://arxiv.org/abs/1606.04934.
 89 | 
 90 |         cells_encoder : list of int, optional (default: None)
 91 |             The length of this list determines the number of layers of the encoder and decoder, and the values
 92 |             determine the number of units per layer (reversed order for decoder).
 93 |             If None, this is automatically chosen based on number of features and latent dimension.
 94 | 
 95 |         initializer : string, optional (default: 'orthogonal')
 96 |             Initializer to use for weights of model.
 97 | 
 98 |         batch_size : int, optional (default: 64)
 99 |             Batch size to use for training.
100 | 
101 |         batch_size_test : int, optional (default: 64)
102 |             Batch size to use for testing.
103 | 
104 |         logdir : string, optional (default: 'temp')
105 |             Location for where to save the model and other related files. Can also be used to restart from an already
106 |             trained model.
107 |             If 'temp' (default), any previously stored data is deleted and model/data are initialised from scratch.
108 | 
109 |         feature_normalization : bool, optional (default: True)
110 |             If True (default), normalization of all data is applied internally, based on training data statistics.
111 | 
112 |         tb_logging : bool, optional (default: False)
113 |             If True, create tensorboard summaries with loss data etc.
114 |     """
115 | 
116 |     def __init__(self,
117 |                  X=None,
118 |                  X_valid=None,
119 |                  train_valid_split=0.9,
120 |                  dim_latent=2,
121 |                  iaf_flow_length=5,
122 |                  cells_encoder=None,
123 |                  initializer='orthogonal',
124 |                  batch_size=64,
125 |                  batch_size_test=64,
126 |                  logdir='temp',
127 |                  feature_normalization=True,
128 |                  tb_logging=False):
129 | 
130 |         self.dim_latent = dim_latent
131 |         self.iaf_flow_length = iaf_flow_length
132 |         self.cells_encoder = cells_encoder
133 |         self.initializer = initializer
134 |         self.batch_size = batch_size
135 |         self.batch_size_test = batch_size_test
136 |         self.logdir = os.path.abspath(logdir)
137 |         self.feature_normalization = feature_normalization
138 |         self.tb_logging = tb_logging
139 | 
140 |         self.trained_once_this_session = False
141 | 
142 |         # --- Check for existing model ---
143 | 
144 |         # Set flag to indicate that the model has not been trained yet
145 |         self.is_trained = False
146 | 
147 |         # If using temporary model directory (default), delete any previously stored models
148 |         if logdir == 'temp' and os.path.exists(self.logdir):
149 |             shutil.rmtree(self.logdir)
150 | 
151 |         # Check if a model with the same name already exists
152 |         # If no, create directory
153 |         if not os.path.exists(self.logdir):
154 |             os.makedirs(self.logdir)
155 | 
156 |         # Do checkpoint, parameter, and norm files exist?
157 |         self.has_checkpoint = os.path.exists(f'{self.logdir}/checkpoint')
158 |         self.has_params = os.path.exists(f'{self.logdir}/params.json')
159 |         self.has_norm = os.path.exists(f'{self.logdir}/norm.pkl')
160 |         self.has_dataset_file = os.path.exists(f'{self.logdir}/data_train.pkl')
161 |         self.has_data = False
162 | 
163 |         # --- Prepare data ---
164 | 
165 |         # Check if data is provided as array or as directory.
166 |         self.dataset_type = None
167 | 
168 |         if type(X) == str:
169 |             self.dataset_type = 'string'
170 | 
171 |             if self.has_dataset_file:
172 |                 print('This model has already been associated with a dataset from a directory. To create a new '
173 |                       'dataset, delete the data_train.pkl and data_valid.pkl files in the model directory.')
174 |                 _, _, self.dim_feature = dr.load_dataset_file(f'{self.logdir}/data_train.pkl')
175 |             else:
176 |                 print(f'Preparing train and validation datasets from feature directory {X}.')
177 |                 self.dim_feature = fun.prepare_dataset(data_dir=os.path.abspath(X),
178 |                                                        logdir=self.logdir,
179 |                                                        train_ratio=train_valid_split)
180 |                 self.has_dataset_file = True
181 | 
182 |             self.X = None
183 |             self.X_valid = None
184 |             self.has_data = True
185 | 
186 |         elif type(X) == np.ndarray:
187 |             self.dataset_type = 'array'
188 |             self.dim_feature = X.shape[1]
189 |             # Split data into train and validation or use provided validation data
190 |             if X_valid is not None:
191 |                 assert X_valid.shape[1] == self.dim_feature, "Train and validation data has different feature dimensions!"
192 |                 self.X = X.astype(np.float32)
193 |                 self.X_valid = X_valid.astype(np.float32)
194 |             else:
195 |                 # Randomize data
196 |                 num_data = len(X)
197 |                 indices = list(range(num_data))
198 |                 random.shuffle(indices)
199 |                 # Split data (and ensure it's float)
200 |                 split_index = int(train_valid_split * num_data)
201 |                 train_indices = indices[:split_index]
202 |                 valid_indices = indices[split_index:]
203 |                 self.X = X[train_indices].astype(np.float32)
204 |                 self.X_valid = X[valid_indices].astype(np.float32)
205 |             self.has_data = True
206 | 
207 |         # elif X is None:
208 |         #     self.X = None
209 |         #     self.X_valid = None
210 |         #     if self.has_dataset_file:
211 |         #         print(f'Reloading dataset file {self.logdir}/data_train.pkl from previous instance of this model.')
212 |         #         _, _, self.dim_feature = dr.load_dataset_file(f'{self.logdir}/data_train.pkl')
213 |         #         self.has_data = True
214 |         #     else:
215 |         #         if self.has_checkpoint:
216 |         #             self.has_data = False
217 |         #         else:
218 |         #             raise Exception(
219 |         #                 'Model needs to be initialised with X provided as numpy array or path to directory.')
220 |         else:
221 |             raise Exception('Unsupported input type for X. Needs to be numpy array or string with path to directory '
222 |                             'containing npy files. ')
223 | 
224 |         # --- Prepare parameter file ---
225 | 
226 |         # If parameter file for this model already exists, load it. Otherwise create one.
227 |         if self.has_params:
228 |             print(f'Existing parameter file found for model {self.logdir}.\n'
229 |                   f'Loading stored parameters. Some input parameters might be ignored.')
230 |             with open(f'{self.logdir}/params.json', 'r') as f:
231 |                 self.param = json.load(f)
232 |             # Set dim_feature if not previously known
233 |             if X is None and not self.has_dataset_file:
234 |                 self.dim_feature = self.param['dim_feature']
235 |         else:
236 |             # If not given, determine model structure
237 |             # NOTE: The reasoning here is a bit arbitrary/dodgy, should probably put some more thought into this
238 |             # and improve it.
239 |             # TODO: This does not give very good results yet...
240 |             if cells_encoder is None:
241 |                 # Get all the powers of two between latent dim and feature dim
242 |                 smallest_power = int(2 ** (self.dim_latent - 1).bit_length())
243 |                 largest_power = int(2 ** self.dim_feature.bit_length() / 2)
244 |                 powers_of_two = [smallest_power]
245 |                 while powers_of_two[-1] <= largest_power:
246 |                     powers_of_two.append(powers_of_two[-1]*2)
247 | 
248 |                 # By default, use two layers, one with largest power of two, the second roughly half-way between
249 |                 # input and output dimension
250 |                 l2_index = int(len(powers_of_two) / 2)
251 |                 try:
252 |                     model_layers = [largest_power,
253 |                                     powers_of_two[l2_index+1]]
254 |                 except:
255 |                     model_layers = [largest_power,
256 |                                     int(largest_power/2)]
257 | 
258 |             else:
259 |                 model_layers = cells_encoder
260 | 
261 |             # Number of hidden cells is smaller of the last layer size or 64
262 |             cells_hidden = min(model_layers[-1], 64)
263 | 
264 |             if self.dataset_type == 'string':
265 |                 dataset_file = f'{self.logdir}/data_train.pkl'
266 |                 dataset_file_valid = f'{self.logdir}/data_valid.pkl'
267 |             else:
268 |                 dataset_file = None
269 |                 dataset_file_valid = None
270 | 
271 |             self.param = {
272 |                 "dataset_file": dataset_file,
273 |                 "dataset_file_valid": dataset_file_valid,
274 |                 "dim_latent": self.dim_latent,
275 |                 "dim_feature": self.dim_feature,
276 |                 "cells_encoder": model_layers,
277 |                 "cells_hidden": cells_hidden,
278 |                 "iaf_flow_length": self.iaf_flow_length,
279 |                 "dim_autoregressive_nl": cells_hidden,
280 |                 "initial_s_offset": 1.0,
281 |                 "feature_normalization": self.feature_normalization
282 |             }
283 | 
284 |             # Write to json for future re-use of this model
285 |             with open(f'{self.logdir}/params.json', 'w') as outfile:
286 |                 json.dump(self.param, outfile, indent=2)
287 | 
288 |         # --- Set up VAE model ---
289 |         self.graph = tf.Graph()
290 | 
291 |         with self.graph.as_default():
292 | 
293 |             # Create coordinator.
294 |             self.coord = tf.train.Coordinator()
295 | 
296 |             # Set up batchers.
297 |             with tf.name_scope('create_inputs'):
298 |                 if self.dataset_type == 'string':
299 |                     self.reader = dr.DataReader(self.param['dataset_file'],
300 |                                                 self.param,
301 |                                                 f'{self.logdir}/params.json',
302 |                                                 self.coord,
303 |                                                 self.logdir)
304 |                     self.test_batcher = dr.Batcher(self.param['dataset_file_valid'],
305 |                                                    self.param,
306 |                                                    f'{self.logdir}/params.json',
307 |                                                    self.logdir)
308 |                 else:
309 |                     self.reader = dra.DataReader(self.X, self.feature_normalization, self.coord, self.logdir)
310 |                     self.test_batcher = dra.Batcher(self.X_valid, self.feature_normalization, self.logdir)
311 |                 self.train_batch = self.reader.dequeue_feature(self.batch_size)
312 | 
313 |             # Get normalisation data
314 |             if self.feature_normalization:
315 |                 self.mean = self.test_batcher.mean
316 |                 self.norm = self.test_batcher.norm
317 | 
318 |             num_test_data = self.test_batcher.num_data
319 |             self.test_batches_full = int(self.test_batcher.num_data / self.batch_size_test)
320 |             self.test_batch_last = num_test_data - (self.test_batches_full * self.batch_size_test)
321 | 
322 |             # Placeholder for test features
323 |             self.test_feature_placeholder = tf.placeholder_with_default(
324 |                 input=tf.zeros([self.batch_size, self.dim_feature], dtype=tf.float32),
325 |                 shape=[None, self.dim_feature])
326 | 
327 |             # Placeholder for dropout
328 |             self.dropout_placeholder = tf.placeholder_with_default(input=tf.cast(1.0, dtype=tf.float32), shape=(),
329 |                                                                    name="KeepProb")
330 | 
331 |             # Placeholder for learning rate
332 |             self.lr_placeholder = tf.placeholder_with_default(input=tf.cast(1e-4, dtype=tf.float32), shape=(),
333 |                                                                   name="LearningRate")
334 | 
335 |             print('Creating model.')
336 |             self.net = model.VAEModel(self.param,
337 |                                       self.batch_size,
338 |                                       input_dim=self.dim_feature,
339 |                                       keep_prob=self.dropout_placeholder,
340 |                                       initializer=self.initializer)
341 |             print('Model created.')
342 | 
343 |             self.embeddings = self.net.embed(self.test_feature_placeholder)
344 | 
345 |             print('Setting up loss.')
346 |             self.loss = self.net.loss(self.train_batch)
347 |             self.loss_test = self.net.loss(self.test_feature_placeholder, test=True)
348 |             print('Loss set up.')
349 | 
350 |             optimizer = tf.train.AdamOptimizer(learning_rate=self.lr_placeholder,
351 |                                                epsilon=1e-4)
352 |             trainable = tf.trainable_variables()
353 |             # for var in trainable:
354 |             #     print(var)
355 |             self.optim = optimizer.minimize(self.loss, var_list=trainable)
356 | 
357 |             # Set up logging for TensorBoard.
358 |             if self.tb_logging:
359 |                 self.writer = tf.summary.FileWriter(self.logdir)
360 |                 self.writer.add_graph(tf.get_default_graph())
361 |                 run_metadata = tf.RunMetadata()
362 |                 self.summaries = tf.summary.merge_all()
363 | 
364 |             # Set up session
365 |             print('Setting up session.')
366 |             config = tf.ConfigProto(log_device_placement=False)
367 |             config.gpu_options.allow_growth = True
368 |             self.sess = tf.Session(config=config)
369 |             init = tf.global_variables_initializer()
370 |             self.sess.run(init)
371 |             print('Session set up.')
372 | 
373 |             # Saver for storing checkpoints of the model.
374 |             self.saver = tf.train.Saver(var_list=tf.trainable_variables(), max_to_keep=2)
375 | 
376 |             # Try to load model
377 |             try:
378 |                 self.saved_global_step = load(self.saver, self.sess, self.logdir)
379 |                 if self.saved_global_step is None:
380 |                     # The first training step will be saved_global_step + 1,
381 |                     # therefore we put -1 here for new or overwritten trainings.
382 |                     self.saved_global_step = -1
383 |                     print(f'No model found to restore. Initialising new model.')
384 |                 else:
385 |                     print(f'Restored trained model from step {self.saved_global_step}.')
386 |             except:
387 |                 print("Something went wrong while restoring checkpoint.")
388 |                 raise
389 | 
390 |     def train(self,
391 |               learning_rate=1e-3,
392 |               num_steps=int(5e4),
393 |               dropout_keep_prob=0.75,
394 |               overwrite=False,
395 |               test_every=50,
396 |               lr_scheduling=True,
397 |               lr_scheduling_steps=5,
398 |               lr_scheduling_factor=5,
399 |               lr_scheduling_min=1e-5,
400 |               checkpoint_every=2000):
401 |         """
402 |         Train the model
403 | 
404 |         Parameters
405 |         ----------
406 |         learning_rate : float, optional (default: 1e-3)
407 |             Learning rate for training. If lr_scheduling is True, this is the initial learning rate.
408 | 
409 |         num_steps : int, optional (default: 5e4)
410 |             Maximum number of training steps before stopping.
411 | 
412 |         dropout_keep_prob : float, optional (default: 0.75)
413 |             Keep probability to use for dropout in encoder/decoder layers.
414 | 
415 |         overwrite : bool, optional (default: False)
416 |             If False, does not allow for overwriting existing model data.
417 |             Safety measure to prevent accidentally overwriting previously saved datasets/normalization values, and
418 |             unintentional training continuation.
419 | 
420 |         test_every : int, optional (default: 50)
421 |             A test step is performed after every test_every training steps.
422 | 
423 |         lr_scheduling : bool, optional (default: True)
424 |             If True, learning rate scheduling is applied, automatically decreasing the learning rate when the test loss
425 |             does not decrease any further for lr_scheduling_steps test steps. Once lr_scheduling_min is reached,
426 |             assume model has converged and stop training.
427 | 
428 |         lr_scheduling_steps : int, optional (default: 5)
429 |             If lr_scheduling is True, decrease learning rate after lr_scheduling_steps test steps without decrease
430 |             in test loss.
431 | 
432 |         lr_scheduling_factor : int, optional (default: 5)
433 |             Factor by which to decrease learning rate if lr_scheduling is True.
434 | 
435 |         lr_scheduling_min : int, optional (default: 50)
436 |             Minimum learning rate. If lr_scheduling is True, training finishes once learning rate drops below this
437 |             value.
438 | 
439 |         checkpoint_every : int, optional (default: 2000)
440 |             Save the model after every checkpoint_every steps.
441 |         """
442 | 
443 |         assert self.has_data, "Model is not associated with any data yet. " \
444 |                               "Recreate CompressionVAE object for this model with X!"
445 | 
446 |         lr = learning_rate
447 | 
448 |         # Check if model already exists
449 |         if self.has_checkpoint and self.has_params:
450 |             print(f'Found existing model {self.logdir}.')
451 |             self.is_trained = True
452 | 
453 |             # If model is trained and overwrite is False, stop here
454 |             if not overwrite:
455 |                 print('To continue training this model, set overwrite=True. To train a new model, '
456 |                       'specify a different logdir or use default "temp" directory.')
457 |                 return self
458 |             else:
459 |                 print('Continuing model training.')
460 | 
461 |         with self.graph.as_default():
462 | 
463 |             if self.trained_once_this_session is False:
464 |                 print('Starting queues.')
465 |                 threads = tf.train.start_queue_runners(sess=self.sess, coord=self.coord)
466 |                 self.reader.start_threads(self.sess)
467 |                 print('Reader threads started.')
468 |                 self.trained_once_this_session = True
469 | 
470 |             last_saved_step = self.saved_global_step
471 | 
472 |             test_loss_history = []
473 | 
474 |             # Start training; If user interrupts, make sure model gets saved.
475 |             try:
476 |                 for step in range(self.saved_global_step + 1, num_steps):
477 |                     start_time = time.time()
478 | 
479 |                     epoch = self.reader.get_epoch(self.batch_size, step)
480 | 
481 |                     # Run the actual optimization step
482 |                     if self.tb_logging:
483 |                         summary, loss_value, _ = self.sess.run([self.summaries, self.loss, self.optim],
484 |                                                                feed_dict={self.dropout_placeholder: dropout_keep_prob,
485 |                                                                           self.lr_placeholder: lr})
486 |                         self.writer.add_summary(summary, step)
487 |                     else:
488 |                         loss_value, _ = self.sess.run([self.loss, self.optim],
489 |                                                       feed_dict={self.dropout_placeholder: dropout_keep_prob,
490 |                                                                  self.lr_placeholder: lr})
491 | 
492 |                     # Test step
493 |                     if step % test_every == 0:
494 | 
495 |                         test_losses = []
496 | 
497 |                         for step_test in range(self.test_batches_full + 1):
498 | 
499 |                             if step_test == self.test_batches_full:
500 |                                 test_batch_size = self.test_batch_last
501 |                             else:
502 |                                 test_batch_size = self.batch_size_test
503 | 
504 |                             test_features = self.test_batcher.next_batch(test_batch_size)
505 | 
506 |                             loss_value_test = self.sess.run([self.loss_test],
507 |                                                             feed_dict={self.test_feature_placeholder: test_features,
508 |                                                                        self.dropout_placeholder: 1.0})
509 | 
510 |                             test_losses.append(loss_value_test)
511 | 
512 |                         mean_test_loss = np.mean(test_losses)
513 |                         test_loss_history.append(mean_test_loss)
514 | 
515 |                         if self.tb_logging:
516 |                             _summary = tf.Summary()
517 |                             _summary.value.add(tag='test/test_loss', simple_value=mean_test_loss)
518 |                             _summary.value.add(tag='test/test_loss_per_feat',
519 |                                                simple_value=mean_test_loss / self.reader.dimension)
520 |                             self.writer.add_summary(_summary, step)
521 | 
522 |                         duration = (time.time() - start_time) / test_every
523 |                         print('step {:d}; epoch {:.2f} - loss = {:.3f}, test_loss = {:.3f}, lr = {:.5f}, ({:.3f} sec/step)'
524 |                               .format(step, epoch, loss_value, mean_test_loss, lr, duration))
525 | 
526 |                         # Learning rate scheduling.
527 |                         if lr_scheduling and len(test_loss_history) >= lr_scheduling_steps:
528 |                             if test_loss_history[-lr_scheduling_steps] < min(
529 |                                     test_loss_history[-lr_scheduling_steps + 1:]):
530 |                                 lr /= lr_scheduling_factor
531 |                                 print(f'No improvement on validation data for {lr_scheduling_steps} test steps. '
532 |                                       f'Decreasing learning rate by factor {lr_scheduling_factor}')
533 | 
534 |                                 # Check if training should be stopped
535 |                                 if lr <= lr_scheduling_min:
536 |                                     print(f'Reached learning rate threshold of {lr_scheduling_min}. '
537 |                                           f'Stopping.')
538 |                                     break
539 | 
540 |                     if step % checkpoint_every == 0:
541 |                         save(self.saver, self.sess, self.logdir, step)
542 |                         last_saved_step = step
543 | 
544 |                     if step == num_steps - 1:
545 |                         print(f'Reached training step limit of {num_steps} steps. '
546 |                               f'Stopping.')
547 | 
548 |             except KeyboardInterrupt:
549 |                 print()
550 |             finally:
551 |                 self.is_trained = True
552 |                 self.has_checkpoint = True
553 |                 self.saved_global_step = step
554 | 
555 |                 if step > last_saved_step:
556 |                     save(self.saver, self.sess, self.logdir, step)
557 |                 # self.coord.request_stop()
558 |                 # self.coord.join(threads)
559 | 
560 |         return self
561 | 
562 |     def embed(self,
563 |               X,
564 |               batch_size=64):
565 |         """
566 |         Embed data into the latent space of a trained model
567 | 
568 |         Parameters
569 |         ----------
570 |         X : array, shape (n_samples, n_features)
571 |             Data to embed.
572 | 
573 |         batch_size : int, optional (default: 64)
574 |             Batch size for processing input data.
575 | 
576 |         Returns
577 |         -------
578 |         z : array, shape (n_samples, dim_latent)
579 |             Embedding of the input data in latent space.
580 |         """
581 | 
582 |         X = X.astype(np.float32)
583 | 
584 |         num_data = X.shape[0]
585 |         num_batches_full = int(num_data / batch_size)
586 |         batch_last = num_data - (num_batches_full * batch_size)
587 |         if batch_last > 0:
588 |             num_batches = num_batches_full + 1
589 |         else:
590 |             num_batches = num_batches_full
591 | 
592 |         embs = []
593 | 
594 |         for k in range(num_batches):
595 | 
596 |             if k == num_batches_full:
597 |                 input_batch = X[k * batch_size:]
598 |             else:
599 |                 input_batch = X[k * batch_size: (k + 1) * batch_size]
600 | 
601 |             # Normalize
602 |             if self.feature_normalization:
603 |                 input_batch -= self.mean
604 |                 input_batch = np.divide(input_batch, self.norm, out=np.zeros_like(input_batch), where=self.norm != 0)
605 | 
606 |             emb = self.sess.run([self.embeddings],
607 |                                 feed_dict={self.test_feature_placeholder: input_batch})
608 | 
609 |             embs.append(emb[0])
610 | 
611 |         # Concatenate
612 |         z = np.concatenate(embs, axis=0)
613 | 
614 |         return z
615 | 
616 |     def decode(self,
617 |                z):
618 |         """
619 |         Decode latent vectors from latent space of a trained model
620 | 
621 |         Parameters
622 |         ----------
623 |         z : array, shape (n_samples, dim_latent)
624 |             Latent vectors to decode.
625 | 
626 |         Returns
627 |         -------
628 |         X : array, shape (n_samples, n_features)
629 |             Reconstruction of the data from latent code.
630 |         """
631 | 
632 |         recon = self.net.decode(np.float32(z))
633 |         reconstruction = self.sess.run(recon)
634 | 
635 |         # Reverse data normalisation
636 |         if self.feature_normalization:
637 |             reconstruction = np.multiply(reconstruction, self.norm)
638 |             reconstruction += self.mean
639 | 
640 |         X = reconstruction
641 | 
642 |         return X
643 | 
644 |     def visualize(self,
645 |                   z,
646 |                   labels=None,
647 |                   categories=None,
648 |                   filename=None):
649 |         """
650 |         For 2d embeddings, visualize latent space.
651 | 
652 |         Parameters
653 |         ----------
654 |         z : array, shape (n_samples, 2)
655 |             2D latent vectors to visualize.
656 | 
657 |         labels: array or list, shape (n_samples), optional (default: None)
658 |             Label indices or strings for each embedding. If strings, categories parameter is ignored.
659 | 
660 |         categories: list of string, optional (default: None)
661 |             Category names for indices in labels.
662 | 
663 |         filename: string, optional (default: None)
664 |             If filename is given, save visualization to file. Otherwise display directly.
665 | 
666 |         """
667 | 
668 |         assert z.shape[1] == 2, "Visualization only available for 2D embeddings."
669 | 
670 |         fig, ax = plt.subplots(1, 1, figsize=(12, 10), facecolor='w', edgecolor='k')
671 |         if labels is None:
672 |             s = ax.scatter(z[:, 0], z[:, 1], s=7)
673 |         else:
674 |             # Check if labels are provided as indices or strings
675 |             if type(labels[0]) == int:
676 |                 pass
677 |             elif type(labels[0]) == str:
678 |                 # Find unique categories and convert string labels to indices
679 |                 categories = list(set(labels))
680 |                 str_to_int = {cat: k for k, cat in enumerate(categories)}
681 |                 labels = [str_to_int[label] for label in labels]
682 |             else:
683 |                 raise Exception('Label needs to be list of integer or string labels.')
684 | 
685 |             cmap = plt.get_cmap('jet', np.max(labels) - np.min(labels) + 1)
686 |             s = ax.scatter(z[:, 0], z[:, 1], s=7, c=labels, cmap=cmap, vmin=np.min(labels) - .5,
687 |                            vmax=np.max(labels) + .5)
688 |         cax = plt.colorbar(s, ticks=np.arange(np.min(labels), np.max(labels) + 1))
689 |         if categories is not None:
690 |             cax.ax.set_yticklabels(categories)
691 | 
692 |         if filename is not None:
693 |             plt.savefig(filename)
694 |         else:
695 |             plt.show()
696 | 
697 |     def visualize_latent_grid(self,
698 |                               xy_range=(-4.0, 4.0),
699 |                               grid_size=10,
700 |                               shape=(28, 28),
701 |                               clip=(0, 255),
702 |                               figsize=(12, 12),
703 |                               filename=None):
704 |         """
705 |         Visualize latent space by scanning over a grid, decoding, and plotting as image.
706 |         Note: This assumes that the data is image data with a single channel, and currently only works for
707 |         two-dimensional latent spaces.
708 | 
709 |         Parameters
710 |         ----------
711 |         xy_range : (float, float), optional (default: (-4.0, 4.0))
712 |             Range in the x and y directions over which to scan.
713 | 
714 |         grid_size: int, optional (default: 10)
715 |             Number of cells along x and y directions.
716 | 
717 |         shape: (int, int), optional (default: (28, 28))
718 |             Original shape of the image data, used to reshape the vectors to 2d images.
719 | 
720 |         clip: (float, float), optional (default: (0, 255))
721 |             Before displaying the image, clip the decoded data in this range.
722 | 
723 |         figsize: (float, float), optional (default: (12.0, 12.0))
724 | 
725 |         filename: string, optional (default: None)
726 |             If filename is given, save visualization to file. Otherwise display directly.
727 | 
728 |         """
729 | 
730 |         assert self.dim_latent == 2, "visualize_latent_grid only implemented for 2d latent spaces."
731 | 
732 |         xy_extent = xy_range[1] - xy_range[0]
733 |         step_size = xy_extent / grid_size
734 | 
735 |         # Create grid of latent variables
736 |         z_list = []
737 |         for k in range(grid_size):
738 |             for j in range(grid_size):
739 |                 z_list.append([xy_range[0] + (0.5 + k) * step_size,
740 |                                xy_range[0] + (0.5 + j) * step_size])
741 | 
742 |         z_array = np.array(z_list)
743 | 
744 |         # Decode
745 |         x_array = self.decode(z_array)
746 | 
747 |         # Arrange into image grid
748 |         image = []
749 |         for k in range(grid_size):
750 |             row = []
751 |             for j in range(grid_size):
752 |                 index = k * grid_size + j
753 |                 row.insert(0, np.reshape(x_array[index], shape))
754 |             image.append(np.concatenate(row))
755 | 
756 |         # Concatenate into image
757 |         image = np.concatenate(image, axis=1)
758 | 
759 |         # Apply clipping
760 |         if clip is not None:
761 |             image = np.clip(image, clip[0], clip[1])
762 | 
763 |         # Plotting
764 |         fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w', edgecolor='k')
765 |         plt.imshow(image, cmap='Greys_r', extent=[xy_range[0], xy_range[1], xy_range[0], xy_range[1]])
766 | 
767 |         if filename is not None:
768 |             plt.savefig(filename)
769 |         else:
770 |             plt.show()
771 | 


--------------------------------------------------------------------------------
/cvae/lib/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/maxfrenzel/CompressionVAE/9d6b52359b885a03797be41f6d5baa17925d83ef/cvae/lib/__init__.py


--------------------------------------------------------------------------------
/cvae/lib/data_reader.py:
--------------------------------------------------------------------------------
  1 | import threading
  2 | import random
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | import joblib
  6 | import os
  7 | from tqdm import tqdm
  8 | 
  9 | 
 10 | # Check or compute features
 11 | def normalize(ids, file_paths, logdir):
 12 | 
 13 |     # Find normalisation factors
 14 |     norm_file = f'{logdir}/norm.pkl'
 15 |     if not os.path.isfile(norm_file):
 16 | 
 17 |         print('Calculating normalisation factors.')
 18 | 
 19 |         feat_list = []
 20 | 
 21 |         for k, id_val in enumerate(tqdm(ids)):
 22 | 
 23 |             feat_list.append(np.load(file_paths[id_val]))
 24 | 
 25 |         feat_array = np.stack(feat_list)
 26 | 
 27 |         mean = np.mean(feat_array, axis=0)
 28 |         max_val = np.max(feat_array, axis=0)
 29 |         min_val = np.min(feat_array, axis=0)
 30 |         var = np.var(feat_array, axis=0)
 31 | 
 32 |         # Normalize by standard deviation
 33 |         norm = np.sqrt(var)
 34 | 
 35 |         norm_dict = {'mean': mean,
 36 |                      'norm': norm,
 37 |                      'min_val': min_val,
 38 |                      'max_val': max_val}
 39 | 
 40 |         joblib.dump(norm_dict, norm_file)
 41 | 
 42 |         print('Normalisation factors calculated.')
 43 | 
 44 |     else:
 45 |         print('Normalisation factors already stored.')
 46 | 
 47 | 
 48 | def load_norm(norm_file):
 49 |     norm_dict = joblib.load(norm_file)
 50 |     mean = norm_dict['mean']
 51 |     norm = norm_dict['norm']
 52 | 
 53 |     return mean, norm
 54 | 
 55 | 
 56 | def return_data(ids, file_paths, logdir, normalize=True, randomize=True):
 57 | 
 58 |     # Shuffle tha data
 59 |     randomized_data = ids[:]
 60 |     if randomize:
 61 |         random.shuffle(randomized_data)
 62 | 
 63 |     # If desired, load normalisation
 64 |     if normalize:
 65 |         norm_file = f'{logdir}/norm.pkl'
 66 |         mean, norm = load_norm(norm_file)
 67 | 
 68 |     # Loop through data
 69 |     for id_val in randomized_data:
 70 | 
 71 |         # Load features and annotations and extract correct slices
 72 |         features = np.load(file_paths[id_val])
 73 | 
 74 |         # Normalise
 75 |         if normalize:
 76 |             features -= mean
 77 |             # Can occasionally have features with zero variance, set those values to zero
 78 |             features = np.divide(features, norm, out=np.zeros_like(features), where=norm != 0)
 79 | 
 80 |         yield features
 81 | 
 82 | 
 83 | class DataReader(object):
 84 |     def __init__(self,
 85 |                  dataset_file,
 86 |                  params,
 87 |                  param_file,
 88 |                  coord,
 89 |                  logdir,
 90 |                  queue_size=128):
 91 | 
 92 |         self.params = params
 93 |         self.param_file = param_file
 94 |         self.ids, self.file_paths, self.dimension = load_dataset_file(dataset_file)
 95 |         self.coord = coord
 96 |         self.logdir = logdir
 97 |         self.threads = []
 98 | 
 99 |         self.num_data = len(self.ids)
100 |         print('Total amount of data: ', self.num_data)
101 |         print("Input feature dimension: ", self.dimension)
102 | 
103 |         # Make sure normalization factors have been calculated
104 |         if self.params['feature_normalization']:
105 |             normalize(self.ids, self.file_paths, self.logdir)
106 | 
107 |         self.feature_placeholder = tf.placeholder(dtype=tf.float32, shape=None)
108 |         self.feature_queue = tf.PaddingFIFOQueue(queue_size,
109 |                                                  ['float32'],
110 |                                                  shapes=[[self.dimension]])
111 |         self.feature_enqueue = self.feature_queue.enqueue([self.feature_placeholder])
112 | 
113 |     def dequeue_feature(self, num_elements):
114 |         output = self.feature_queue.dequeue_many(num_elements)
115 |         return output
116 | 
117 |     def thread_main(self, sess):
118 |         stop = False
119 |         # Go through the dataset multiple times
120 |         while not stop:
121 |             iterator = return_data(self.ids, self.file_paths,
122 |                                    logdir=self.logdir,
123 |                                    normalize=self.params['feature_normalization'])
124 |             count = 0
125 |             for feature in iterator:
126 |                 if self.coord.should_stop():
127 |                     stop = True
128 |                     break
129 | 
130 |                 sess.run(self.feature_enqueue,
131 |                          feed_dict={self.feature_placeholder: feature})
132 | 
133 |                 count += 1
134 | 
135 |     def start_threads(self, sess, n_threads=1):
136 |         for _ in range(n_threads):
137 |             thread = threading.Thread(target=self.thread_main, args=(sess,))
138 |             thread.daemon = True  # Thread will close when parent quits.
139 |             thread.start()
140 |             self.threads.append(thread)
141 |         return self.threads
142 | 
143 |     def get_epoch(self, batch_size, step):
144 |         return (batch_size * step) / self.num_data
145 | 
146 | 
147 | class Batcher(object):
148 |     def __init__(self,
149 |                  dataset_file,
150 |                  params,
151 |                  param_file,
152 |                  logdir,
153 |                  shuffle=False):
154 | 
155 |         self.params = params
156 |         self.param_file = param_file
157 |         self.ids, self.file_paths, self.dimension = load_dataset_file(dataset_file)
158 |         self.logdir = logdir
159 |         self.shuffle = shuffle
160 | 
161 |         if self.shuffle:
162 |             np.random.shuffle(self.ids)
163 | 
164 |         self.num_data = len(self.ids)
165 |         print('Total amount of data: ', self.num_data)
166 | 
167 |         self.index = 0
168 | 
169 |         if self.params['feature_normalization']:
170 |             self.mean, self.norm = load_norm(f'{self.logdir}/norm.pkl')
171 | 
172 |     def get_epoch(self, batch_size, step):
173 |         return (batch_size * step) / self.num_data
174 | 
175 |     def next_batch(self, batch_size):
176 | 
177 |         feature_list = []
178 |         truth_list = []
179 | 
180 |         data_iterator = return_data(self.ids, self.file_paths,
181 |                                     logdir=self.logdir,
182 |                                     normalize=self.params['feature_normalization'],
183 |                                     randomize=False)
184 | 
185 |         for k in range(batch_size):
186 | 
187 |             # Return features from generator, possibly recreating it if it's empty
188 |             try:
189 |                 features = next(data_iterator)
190 |             except:
191 |                 # Recreate the generator
192 |                 data_iterator = return_data(self.ids, self.file_paths,
193 |                                             logdir=self.logdir,
194 |                                             normalize=self.params['feature_normalization'],
195 |                                             randomize=False)
196 |                 features = next(data_iterator)
197 | 
198 |             feature_list.append(np.float32(np.expand_dims(features, axis=0)))
199 | 
200 |             self.index += 1
201 |             if self.index == self.num_data:
202 |                 self.index = 0
203 | 
204 |                 if self.shuffle:
205 |                     np.random.shuffle(self.ids)
206 | 
207 |         feature_batch = np.concatenate(feature_list, axis=0)
208 | 
209 |         return feature_batch
210 | 
211 | 
212 | def load_dataset_file(filename):
213 | 
214 |     print('Loading dataset.')
215 | 
216 |     dataset = joblib.load(filename)
217 | 
218 |     dimension = dataset['dimension']
219 |     file_paths = dataset['file_paths']
220 |     ids = dataset['ids']
221 | 
222 |     return ids, file_paths, dimension
223 | 


--------------------------------------------------------------------------------
/cvae/lib/data_reader_array.py:
--------------------------------------------------------------------------------
  1 | import threading
  2 | import random
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | import joblib
  6 | import os
  7 | import copy
  8 | from tqdm import tqdm
  9 | 
 10 | 
 11 | # Check or compute features
 12 | def normalize(feat_array, logdir):
 13 | 
 14 |     # Find normalisation factors
 15 |     norm_file = f'{logdir}/norm.pkl'
 16 |     if not os.path.isfile(norm_file):
 17 | 
 18 |         print('Calculating normalisation factors.')
 19 | 
 20 |         mean = np.mean(feat_array, axis=0)
 21 |         max_val = np.max(feat_array, axis=0)
 22 |         min_val = np.min(feat_array, axis=0)
 23 |         var = np.var(feat_array, axis=0)
 24 | 
 25 |         # Normalize by standard deviation
 26 |         norm = np.sqrt(var)
 27 | 
 28 |         norm_dict = {'mean': mean,
 29 |                      'norm': norm,
 30 |                      'min_val': min_val,
 31 |                      'max_val': max_val}
 32 | 
 33 |         joblib.dump(norm_dict, norm_file)
 34 | 
 35 |         print('Normalisation factors calculated.')
 36 | 
 37 |     else:
 38 |         print('Normalisation factors already stored.')
 39 | 
 40 | 
 41 | def load_norm(norm_file):
 42 |     norm_dict = joblib.load(norm_file)
 43 |     mean = norm_dict['mean']
 44 |     norm = norm_dict['norm']
 45 | 
 46 |     return mean, norm
 47 | 
 48 | 
 49 | def return_data(feat_array, logdir, normalize=True, randomize=True):
 50 | 
 51 |     # Shuffle tha data
 52 |     randomized_indices = list(range(len(feat_array)))
 53 |     if randomize:
 54 |         random.shuffle(randomized_indices)
 55 | 
 56 |     # If desired, load normalisation
 57 |     if normalize:
 58 |         norm_file = f'{logdir}/norm.pkl'
 59 |         mean, norm = load_norm(norm_file)
 60 | 
 61 |     # Loop through data
 62 |     for id_val in randomized_indices:
 63 | 
 64 |         # Load features and annotations and extract correct slices
 65 |         features = copy.copy(feat_array[id_val])
 66 | 
 67 |         # Normalise
 68 |         if normalize:
 69 |             features -= mean
 70 |             # Can occasionally have feature dimensions with zero variance, set those values to zero
 71 |             features = np.divide(features, norm, out=np.zeros_like(features), where=norm != 0)
 72 | 
 73 |         yield features
 74 | 
 75 | 
 76 | class DataReader(object):
 77 |     def __init__(self,
 78 |                  feat_array,
 79 |                  feature_normalization,
 80 |                  coord,
 81 |                  logdir,
 82 |                  queue_size=128):
 83 | 
 84 |         self.feat_array = feat_array
 85 |         self.normalize = feature_normalization
 86 |         self.num_data = feat_array.shape[0]
 87 |         self.dimension = feat_array.shape[1]
 88 |         self.coord = coord
 89 |         self.logdir = logdir
 90 |         self.threads = []
 91 | 
 92 |         print('Total amount of data: ', self.num_data)
 93 |         print("Input feature dimension: ", self.dimension)
 94 | 
 95 |         # Make sure normalization factors have been calculated
 96 |         if self.normalize:
 97 |             normalize(self.feat_array, self.logdir)
 98 | 
 99 |         self.feature_placeholder = tf.compat.v1.placeholder(dtype=tf.float32, shape=None)
100 |         self.feature_queue = tf.compat.v1.PaddingFIFOQueue(queue_size,
101 |                                                  ['float32'],
102 |                                                  shapes=[[self.dimension]])
103 |         self.feature_enqueue = self.feature_queue.enqueue([self.feature_placeholder])
104 | 
105 |     def dequeue_feature(self, num_elements):
106 |         output = self.feature_queue.dequeue_many(num_elements)
107 |         return output
108 | 
109 |     def thread_main(self, sess):
110 |         stop = False
111 |         # Go through the dataset multiple times
112 |         while not stop:
113 |             iterator = return_data(self.feat_array,
114 |                                    logdir=self.logdir,
115 |                                    normalize=self.normalize)
116 |             count = 0
117 |             for feature in iterator:
118 |                 if self.coord.should_stop():
119 |                     stop = True
120 |                     break
121 | 
122 |                 sess.run(self.feature_enqueue,
123 |                          feed_dict={self.feature_placeholder: feature})
124 | 
125 |                 count += 1
126 | 
127 |     def start_threads(self, sess, n_threads=1):
128 |         for _ in range(n_threads):
129 |             thread = threading.Thread(target=self.thread_main, args=(sess,))
130 |             thread.daemon = True  # Thread will close when parent quits.
131 |             thread.start()
132 |             self.threads.append(thread)
133 |         return self.threads
134 | 
135 |     def get_epoch(self, batch_size, step):
136 |         return (batch_size * step) / self.num_data
137 | 
138 | 
139 | class Batcher(object):
140 |     def __init__(self,
141 |                  feat_array,
142 |                  feature_normalization,
143 |                  logdir,
144 |                  shuffle=False):
145 | 
146 |         self.feat_array = feat_array
147 |         self.normalize = feature_normalization
148 |         self.logdir = logdir
149 |         self.shuffle = shuffle
150 |         self.randomized_indices = list(range(len(feat_array)))
151 | 
152 |         if self.shuffle:
153 |             np.random.shuffle(self.randomized_indices)
154 | 
155 |         self.num_data = len(self.randomized_indices)
156 |         print('Total amount of data: ', self.num_data)
157 | 
158 |         self.index = 0
159 | 
160 |         if self.normalize:
161 |             self.mean, self.norm = load_norm(f'{self.logdir}/norm.pkl')
162 | 
163 |     def get_epoch(self, batch_size, step):
164 |         return (batch_size * step) / self.num_data
165 | 
166 |     def next_batch(self, batch_size):
167 | 
168 |         feature_list = []
169 | 
170 |         data_iterator = return_data(self.feat_array,
171 |                                     logdir=self.logdir,
172 |                                     normalize=self.normalize,
173 |                                     randomize=False)
174 | 
175 |         for k in range(batch_size):
176 | 
177 |             # Return features from generator, possibly recreating it if it's empty
178 |             try:
179 |                 features = next(data_iterator)
180 |             except:
181 |                 # Recreate the generator
182 |                 data_iterator = return_data(self.feat_array,
183 |                                             logdir=self.logdir,
184 |                                             normalize=self.normalize,
185 |                                             randomize=False)
186 |                 features = next(data_iterator)
187 | 
188 |             feature_list.append(np.float32(np.expand_dims(features, axis=0)))
189 | 
190 |             self.index += 1
191 |             if self.index == self.num_data:
192 |                 self.index = 0
193 | 
194 |                 if self.shuffle:
195 |                     np.random.shuffle(self.ids)
196 | 
197 |         feature_batch = np.concatenate(feature_list, axis=0)
198 | 
199 |         return feature_batch
200 | 
201 | 
202 | def load_dataset_file(filename):
203 | 
204 |     print('Loading dataset.')
205 | 
206 |     dataset = joblib.load(filename)
207 | 
208 |     dimension = dataset['dimension']
209 |     file_paths = dataset['file_paths']
210 |     ids = dataset['ids']
211 | 
212 |     return ids, file_paths, dimension
213 | 


--------------------------------------------------------------------------------
/cvae/lib/functions.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | import random
 4 | import joblib
 5 | import numpy as np
 6 | 
 7 | def get_data_subset(ids, full_data):
 8 | 
 9 |     file_paths_full = full_data['file_paths']
10 | 
11 |     file_paths = dict()
12 | 
13 |     for id in ids:
14 |         file_paths[id] = file_paths_full[id]
15 | 
16 |     datasubset = {
17 |         'ids': ids,
18 |         'file_paths': file_paths,
19 |         'dimension': full_data['dimension']
20 |     }
21 | 
22 |     return datasubset
23 | 
24 | 
25 | def prepare_dataset(data_dir,
26 |                     logdir,
27 |                     train_ratio=0.9):
28 | 
29 |     # Get all paths of numpy files
30 |     feature_files = []
31 | 
32 |     for dirName, subdirList, fileList in os.walk(data_dir, topdown=False):
33 |         for fname in fileList:
34 |             if os.path.splitext(fname)[1] in ['.npy']:
35 |                 feature_files.append('%s/%s' % (dirName, fname))
36 | 
37 |     print(f'Total number of feature vectors found: {len(feature_files)}. Building dataset.')
38 | 
39 |     # Build dataset
40 |     ids = []
41 |     file_paths = dict()
42 | 
43 |     for path in feature_files:
44 |         # Find unique ID . Try filename first, if already exists add arbitrary extension
45 |         id = os.path.splitext(os.path.basename(path))[0]
46 |         while id in ids:
47 |             id += 'x'
48 | 
49 |         file_paths[id] = path
50 |         ids.append(id)
51 | 
52 |     # Get dimensionality (assume same for all)
53 |     dimension = np.load(feature_files[0]).shape[0]
54 |     print(f'Dimensionality of dataset: {dimension}.')
55 | 
56 |     dataset = {
57 |         'ids': ids,
58 |         'file_paths': file_paths,
59 |         'dimension': dimension
60 |     }
61 | 
62 |     # Train/valid split
63 |     split_index = int(train_ratio * len(ids))
64 | 
65 |     for k in range(10):
66 |         random.shuffle(ids)
67 | 
68 |     ids_train = ids[:split_index]
69 |     ids_valid = ids[split_index:]
70 | 
71 |     print(f'Splitting {len(ids)} samples into {len(ids_train)} training and {len(ids_valid)} validation samples.')
72 | 
73 |     dataset_train = get_data_subset(ids_train, dataset)
74 |     dataset_valid = get_data_subset(ids_valid, dataset)
75 | 
76 |     print('Saving dataset files.')
77 | 
78 |     if not os.path.exists('datasets'):
79 |         os.makedirs('datasets')
80 | 
81 |     joblib.dump(dataset_train, f'{logdir}/data_train.pkl')
82 |     joblib.dump(dataset_valid, f'{logdir}/data_valid.pkl')
83 | 
84 |     print('Done.')
85 | 
86 |     return dimension
87 | 


--------------------------------------------------------------------------------
/cvae/lib/model_iaf.py:
--------------------------------------------------------------------------------
  1 | import tensorflow.compat.v1 as tf
  2 | tf.disable_v2_behavior()
  3 | 
  4 | 
  5 | def create_variable(name, shape, initializer_type=None):
  6 |     """Create weight variable with the specified name and shape,
  7 |     and initialize it specified initializer."""
  8 |     if initializer_type == 'truncated_normal':
  9 |         initializer = tf.initializers.truncated_normal()
 10 |     elif initializer_type == 'lecun_normal':
 11 |         initializer = tf.initializers.lecun_normal()
 12 |     elif initializer_type == 'orthogonal':
 13 |         initializer = tf.initializers.orthogonal()
 14 |     else:
 15 |         print('No initializer type provided or provided type unknown. Defaulting to orthogonal.')
 16 |         initializer = tf.initializers.orthogonal()
 17 |     variable = tf.Variable(initializer(shape=shape), name=name)
 18 |     return variable
 19 | 
 20 | 
 21 | def create_bias_variable(name, shape):
 22 |     """Create a bias variable with the specified name and shape and initialize it."""
 23 |     initializer = tf.constant_initializer(value=0.001, dtype=tf.float32)
 24 |     return tf.Variable(initializer(shape=shape), name)
 25 | 
 26 | 
 27 | # KL divergence between posterior with autoregressive flow and prior
 28 | def kl_divergence(sigma, epsilon, z_K, param, batch_mean=True):
 29 |     # logprob of posterior
 30 |     log_q_z0 = -0.5 * tf.square(epsilon)
 31 | 
 32 |     # logprob of prior
 33 |     log_p_zK = 0.5 * tf.square(z_K)
 34 | 
 35 |     # Terms from each flow layer
 36 |     flow_loss = 0
 37 |     for l in range(param['iaf_flow_length'] + 1):
 38 |         # Make sure it can't take log(0) or log(neg)
 39 |         flow_loss -= tf.log(sigma[l] + 1e-10)
 40 | 
 41 |     kl_divs = tf.identity(log_q_z0 + flow_loss + log_p_zK)
 42 |     kl_divs_reduced = tf.reduce_sum(kl_divs, axis=1)
 43 | 
 44 |     if batch_mean:
 45 |         return tf.reduce_mean(kl_divs, axis=0), tf.reduce_mean(kl_divs_reduced)
 46 |     else:
 47 |         return kl_divs, kl_divs_reduced
 48 | 
 49 | 
 50 | class VAEModel(object):
 51 | 
 52 |     def __init__(self,
 53 |                  param,
 54 |                  batch_size,
 55 |                  input_dim,
 56 |                  activation=tf.nn.relu,
 57 |                  activation_nf=tf.nn.relu,
 58 |                  keep_prob=1.0,
 59 |                  encode=False,
 60 |                  initializer='orthogonal'):
 61 | 
 62 |         self.input_dim = input_dim
 63 |         self.param = param
 64 |         self.batch_size = batch_size
 65 |         self.activation = activation
 66 |         self.activation_nf = activation_nf
 67 |         self.encode = encode
 68 |         self.cells_enc = self.param['cells_encoder']
 69 |         self.layers_enc = len(param['cells_encoder'])
 70 |         self.cells_dec = self.cells_enc[::-1]
 71 |         self.layers_dec = self.layers_enc
 72 |         self.cells_hidden = self.param['cells_hidden']
 73 |         self.dim_latent = param['dim_latent']
 74 |         self.keep_prob = keep_prob
 75 |         self.initializer = initializer
 76 |         self.variables = self._create_variables()
 77 | 
 78 |     def _create_variables(self):
 79 |         """This function creates all variables used by the network.
 80 |         This allows us to share them between multiple calls to the loss
 81 |         function and generation function."""
 82 | 
 83 |         var = dict()
 84 | 
 85 |         with tf.variable_scope('VAE'):
 86 | 
 87 |             with tf.variable_scope("Encoder"):
 88 | 
 89 |                 var['encoder_stack'] = list()
 90 |                 with tf.variable_scope('encoder_stack'):
 91 | 
 92 |                     for l, num_units in enumerate(self.cells_enc):
 93 | 
 94 |                         with tf.variable_scope('layer{}'.format(l)):
 95 | 
 96 |                             layer = dict()
 97 | 
 98 |                             if l == 0:
 99 |                                 units_in = self.input_dim
100 |                             else:
101 |                                 units_in = self.cells_enc[l - 1]
102 | 
103 |                             units_out = num_units
104 | 
105 |                             layer['W'] = create_variable("W",
106 |                                                          shape=[units_in, units_out],
107 |                                                          initializer_type=self.initializer)
108 |                             layer['b'] = create_bias_variable("b",
109 |                                                               shape=[1, units_out])
110 | 
111 |                             var['encoder_stack'].append(layer)
112 | 
113 |                 with tf.variable_scope('fully_connected'):
114 | 
115 |                     layer = dict()
116 | 
117 |                     num_cells_hidden = self.cells_hidden
118 | 
119 |                     layer['W_z0'] = create_variable("W_z0",
120 |                                                     shape=[self.cells_enc[-1], 2 * num_cells_hidden],
121 |                                                     initializer_type=self.initializer)
122 |                     layer['b_z0'] = create_bias_variable("b_z0",
123 |                                                          shape=[1, 2 * num_cells_hidden])
124 | 
125 |                     layer['W_mu'] = create_variable("W_mu",
126 |                                                     shape=[self.cells_hidden, self.param['dim_latent']],
127 |                                                     initializer_type=self.initializer)
128 |                     layer['W_logvar'] = create_variable("W_logvar",
129 |                                                         shape=[self.cells_hidden, self.param['dim_latent']],
130 |                                                         initializer_type=self.initializer)
131 |                     layer['b_mu'] = create_bias_variable("b_mu",
132 |                                                          shape=[1, self.param['dim_latent']])
133 |                     layer['b_logvar'] = create_bias_variable("b_logvar",
134 |                                                              shape=[1, self.param['dim_latent']])
135 | 
136 |                     var['encoder_fc'] = layer
137 | 
138 |             with tf.variable_scope("IAF"):
139 | 
140 |                 var['iaf_flows'] = list()
141 |                 for l in range(self.param['iaf_flow_length']):
142 | 
143 |                     with tf.variable_scope('layer{}'.format(l)):
144 | 
145 |                         layer = dict()
146 | 
147 |                         # Hidden state
148 |                         layer['W_flow'] = create_variable("W_flow",
149 |                                                           shape=[self.cells_enc[-1], self.dim_latent],
150 |                                                           initializer_type=self.initializer)
151 |                         layer['b_flow'] = create_bias_variable("b_flow",
152 |                                                              shape=[1, self.dim_latent])
153 | 
154 |                         flow_variables = list()
155 |                         # Flow parameters from hidden state (m and s parameters for IAF)
156 |                         for j in range(self.dim_latent):
157 |                             with tf.variable_scope('flow_layer{}'.format(j)):
158 | 
159 |                                 flow_layer = dict()
160 | 
161 |                                 # Set correct dimensionality
162 |                                 units_to_hidden_iaf = self.param['dim_autoregressive_nl']
163 | 
164 |                                 flow_layer['W_flow_params_nl'] = create_variable("W_flow_params_nl",
165 |                                                                                  shape=[self.dim_latent + j,
166 |                                                                                         units_to_hidden_iaf],
167 |                                                                                  initializer_type=self.initializer)
168 |                                 flow_layer['b_flow_params_nl'] = create_bias_variable("b_flow_params_nl",
169 |                                                                                       shape=[1, units_to_hidden_iaf])
170 | 
171 |                                 flow_layer['W_flow_params'] = create_variable("W_flow_params",
172 |                                                                               shape=[units_to_hidden_iaf,
173 |                                                                                      2],
174 |                                                                               initializer_type=self.initializer)
175 |                                 flow_layer['b_flow_params'] = create_bias_variable("b_flow_params",
176 |                                                                                    shape=[1, 2])
177 | 
178 |                                 flow_variables.append(flow_layer)
179 | 
180 |                         layer['flow_vars'] = flow_variables
181 | 
182 |                         var['iaf_flows'].append(layer)
183 | 
184 |             with tf.variable_scope("Decoder"):
185 | 
186 |                 var['decoder_stack'] = list()
187 |                 with tf.variable_scope('deconv_stack'):
188 | 
189 |                     for l, num_units in enumerate(self.cells_dec):
190 | 
191 |                         with tf.variable_scope('layer{}'.format(l)):
192 | 
193 |                             layer = dict()
194 | 
195 |                             if l == 0:
196 |                                 units_in = self.dim_latent
197 |                             else:
198 |                                 units_in = self.cells_dec[l - 1]
199 | 
200 |                             units_out = num_units
201 | 
202 |                             layer['W'] = create_variable("W",
203 |                                                          shape=[units_in, units_out],
204 |                                                          initializer_type=self.initializer)
205 |                             layer['b'] = create_bias_variable("b",
206 |                                                               shape=[1, units_out])
207 | 
208 |                             var['decoder_stack'].append(layer)
209 | 
210 |                 with tf.variable_scope('fully_connected'):
211 |                     layer = dict()
212 | 
213 |                     layer['W_mu'] = create_variable("W_mu",
214 |                                                     shape=[self.cells_dec[-1], self.input_dim],
215 |                                                     initializer_type=self.initializer)
216 |                     # layer['W_logvar'] = create_variable("W_logvar",
217 |                     #                                     shape=[self.cells_dec[-1], self.input_dim])
218 |                     layer['b_mu'] = create_bias_variable("b_mu",
219 |                                                          shape=[1, self.input_dim])
220 |                     # layer['b_logvar'] = create_bias_variable("b_logvar",
221 |                     #                                          shape=[1, self.input_dim])
222 | 
223 |                     var['decoder_fc'] = layer
224 | 
225 |         return var
226 | 
227 |     def _create_network(self, input_batch, encode=False):
228 | 
229 |         # -----------------------------------
230 |         # Encoder
231 | 
232 |         # Remove redundant dimension (weird thing to get PaddingFIFOQueue to work)
233 |         # input_batch = tf.squeeze(input_batch)
234 | 
235 |         # Do encoder calculation
236 |         encoder_hidden = input_batch
237 |         # print('Encoder hidden state 0: ', encoder_hidden)
238 |         for l in range(self.layers_enc):
239 |             encoder_hidden = tf.nn.dropout(self.activation(tf.matmul(encoder_hidden,
240 |                                                                      self.variables['encoder_stack'][l]['W'])
241 |                                                            + self.variables['encoder_stack'][l]['b']),
242 |                                            keep_prob=self.keep_prob)
243 | 
244 |             # print(f'Encoder hidden state {l}: ', encoder_hidden)
245 | 
246 |         # encoder_hidden = tf.reshape(encoder_hidden, [-1, self.conv_out_units])
247 | 
248 |         # Additional non-linearity between encoder hidden state and prediction of mu_0,sigma_0
249 |         mu_logvar_hidden = tf.nn.dropout(self.activation(tf.matmul(encoder_hidden,
250 |                                                                    self.variables['encoder_fc']['W_z0'])
251 |                                                          + self.variables['encoder_fc']['b_z0']),
252 |                                          keep_prob=self.keep_prob)
253 | 
254 |         # Split into parts for mean and variance
255 |         mu_hidden, logvar_hidden = tf.split(mu_logvar_hidden, num_or_size_splits=2, axis=1)
256 | 
257 |         # Final linear layer to calculate mean and variance
258 |         encoder_mu = tf.add(tf.matmul(mu_hidden, self.variables['encoder_fc']['W_mu']),
259 |                             self.variables['encoder_fc']['b_mu'], name='ZMu')
260 |         encoder_logvar = tf.add(tf.matmul(logvar_hidden, self.variables['encoder_fc']['W_logvar']),
261 |                                 self.variables['encoder_fc']['b_logvar'], name='ZLogVar')
262 | 
263 |         # Convert log variance into standard deviation
264 |         encoder_std = tf.exp(0.5 * encoder_logvar)
265 | 
266 |         # Sample epsilon
267 |         epsilon = tf.random_normal(tf.shape(encoder_std), name='epsilon')
268 | 
269 |         if encode:
270 |             z0 = tf.identity(encoder_mu, name='LatentZ0')
271 |         else:
272 |             z0 = tf.identity(tf.add(encoder_mu, tf.multiply(encoder_std, epsilon),
273 |                                     name='LatentZ0'))
274 | 
275 |         # -----------------------------------
276 |         # Latent flow
277 | 
278 |         # Lists to store the latent variables and the flow parameters
279 |         nf_z = [z0]
280 |         nf_sigma = [encoder_std]
281 | 
282 |         # Do calculations for each flow layer
283 |         for l in range(self.param['iaf_flow_length']):
284 | 
285 |             W_flow = self.variables['iaf_flows'][l]['W_flow']
286 |             b_flow = self.variables['iaf_flows'][l]['b_flow']
287 | 
288 |             nf_hidden = self.activation_nf(tf.matmul(encoder_hidden, W_flow) + b_flow)
289 | 
290 |             # Autoregressive calculation
291 |             m_list = self.dim_latent * [None]
292 |             s_list = self.dim_latent * [None]
293 | 
294 |             for j, flow_vars in enumerate(self.variables['iaf_flows'][l]['flow_vars']):
295 | 
296 |                 # Go through computation one variable at a time
297 |                 if j == 0:
298 |                     hidden_autoregressive = nf_hidden
299 |                 else:
300 |                     z_slice = tf.slice(nf_z[-1], [0, 0], [-1, j])
301 |                     hidden_autoregressive = tf.concat(axis=1, values=[nf_hidden, z_slice])
302 | 
303 |                 W_flow_params_nl = flow_vars['W_flow_params_nl']
304 |                 b_flow_params_nl = flow_vars['b_flow_params_nl']
305 |                 W_flow_params = flow_vars['W_flow_params']
306 |                 b_flow_params = flow_vars['b_flow_params']
307 | 
308 |                 # Non-linearity at current autoregressive step
309 |                 nf_hidden_nl = self.activation_nf(tf.matmul(hidden_autoregressive,
310 |                                                        W_flow_params_nl) + b_flow_params_nl)
311 | 
312 |                 # Calculate parameters for normalizing flow as linear transform
313 |                 ms = tf.matmul(nf_hidden_nl, W_flow_params) + b_flow_params
314 | 
315 |                 # Split into individual components
316 |                 # m_list[j], s_list[j] = tf.split_v(value=ms,
317 |                 #                    size_splits=[1,1],
318 |                 #                    split_dim=1)
319 |                 m_list[j], s_list[j] = tf.split(value=ms,
320 |                                                 num_or_size_splits=[1, 1],
321 |                                                 axis=1)
322 | 
323 |             # Concatenate autoregressively computed variables
324 |             # Add offset to s to make sure it starts out positive
325 |             # (could have also initialised the bias term to 1)
326 |             # Guarantees that flow initially small
327 |             m = tf.concat(axis=1, values=m_list)
328 |             s = self.param['initial_s_offset'] + tf.concat(axis=1, values=s_list)
329 | 
330 |             # Calculate sigma ("update gate value") from s
331 |             sigma = tf.nn.sigmoid(s)
332 |             nf_sigma.append(sigma)
333 | 
334 |             # Perform normalizing flow
335 |             z_current = tf.multiply(sigma, nf_z[-1]) + tf.multiply((1 - sigma), m)
336 | 
337 |             # Invert order of variables to alternate dependence of autoregressive structure
338 |             z_current = tf.reverse(z_current, axis=[1], name='LatentZ%d' % (l + 1))
339 | 
340 |             # Add to list of latent variables
341 |             nf_z.append(z_current)
342 | 
343 |         z = tf.identity(nf_z[-1], name="LatentZ")
344 | 
345 |         # -----------------------------------
346 |         # Decoder
347 | 
348 |         # Fully connected
349 |         decoder_hidden = z
350 | 
351 |         for l in range(self.layers_dec):
352 |             # print(decoder_hidden)
353 |             decoder_hidden = tf.nn.dropout(self.activation(tf.matmul(decoder_hidden,
354 |                                                                      self.variables['decoder_stack'][l]['W'])
355 |                                                            + self.variables['decoder_stack'][l]['b']),
356 |                                            keep_prob=self.keep_prob)
357 |             decoder_hidden = self.activation(decoder_hidden)
358 | 
359 |         # Split into mu and logvar parts
360 |         # decoder_hidden_mu, decoder_hidden_logvar = tf.split(decoder_hidden, num_or_size_splits=2, axis=1)
361 | 
362 |         # Final layer
363 |         decoder_mu = tf.add(tf.matmul(decoder_hidden, self.variables['decoder_fc']['W_mu']),
364 |                             self.variables['decoder_fc']['b_mu'],
365 |                             name='XMu')
366 |         # decoder_logvar = tf.add(tf.matmul(decoder_hidden_logvar, self.variables['decoder_fc']['W_logvar']),
367 |         #                         self.variables['decoder_fc']['b_logvar'],
368 |         #                         name='XLogVar')
369 |         #
370 |         # # Add clipping to avoid zero division
371 |         # decoder_logvar = tf.clip_by_value(decoder_logvar,
372 |         #                                   clip_value_min=-8.0,
373 |         #                                   clip_value_max=+8.0)
374 | 
375 |         # Set decoder variance as fixed hyperparameter for stability; common assumption in Gaussian decoders
376 |         decoder_logvar = tf.zeros_like(decoder_mu)
377 | 
378 |         # return decoder_output, encoder_hidden, encoder_logvar, encoder_std
379 |         return decoder_mu, decoder_logvar, encoder_mu, encoder_logvar, encoder_std, epsilon, z, z0, nf_sigma
380 | 
381 |     def decode(self, z):
382 | 
383 |         decoder_hidden = z
384 | 
385 |         for l in range(self.layers_dec):
386 |             # print(decoder_hidden)
387 |             decoder_hidden = tf.nn.dropout(self.activation(tf.matmul(decoder_hidden,
388 |                                                                      self.variables['decoder_stack'][l]['W'])
389 |                                                            + self.variables['decoder_stack'][l]['b']),
390 |                                            keep_prob=self.keep_prob)
391 |             decoder_hidden = self.activation(decoder_hidden)
392 | 
393 |         decoder_mu = tf.add(tf.matmul(decoder_hidden, self.variables['decoder_fc']['W_mu']),
394 |                             self.variables['decoder_fc']['b_mu'],
395 |                             name='XMu')
396 | 
397 |         return decoder_mu
398 | 
399 |     def input_identity(self, input_batch):
400 | 
401 |         # return tf.matmul(input_batch, self.variables['encoder_stack'][0]['W'])
402 | 
403 |         return self.variables['encoder_stack'][0]['W']
404 | 
405 |     def loss(self,
406 |              input_batch,
407 |              name='vae',
408 |              beta=1.0,
409 |              test=False):
410 | 
411 |         with tf.name_scope(name):
412 | 
413 |             # Run computation
414 |             decoder_mu, decoder_logvar, encoder_mu, encoder_logvar, encoder_std, epsilon, z, z0, nf_sigma = self._create_network(input_batch)
415 | 
416 |             # print("Output size: ", decoder_mu)
417 | 
418 |             # KL-Divergence loss
419 |             _, div = kl_divergence(nf_sigma, epsilon, z, self.param, batch_mean=False)
420 |             loss_latent = tf.identity(div, name='LossLatent')
421 | 
422 |             # Reconstruction loss assuming Gaussian output distribution
423 |             decoder_var = tf.exp(decoder_logvar)
424 |             loss_reconstruction = tf.identity(0.5 * tf.reduce_sum(tf.math.divide(tf.square(input_batch - decoder_mu),
425 |                                                                                  decoder_var)
426 |                                                                   + decoder_logvar, axis=1),
427 |                                               name='LossReconstruction')
428 | 
429 |             # Small penalty to prevent z0 values from going to infinity
430 |             z0_boundary = 10.0 * tf.ones_like(z0)
431 |             z0_for_penalty = tf.maximum(z0_boundary, tf.abs(z0))
432 |             z0_large = tf.reduce_mean(tf.square(z0_for_penalty - z0_boundary), axis=1)
433 | 
434 |             loss = tf.reduce_mean(loss_reconstruction + beta*loss_latent, name='Loss')
435 | 
436 |             if not test:
437 |                 tf.summary.scalar('loss_total', loss)
438 |                 tf.summary.scalar('loss_rec_per_feat', tf.reduce_mean(loss_reconstruction)/self.input_dim)
439 |                 tf.summary.scalar('loss_kl_per_dim', tf.reduce_mean(loss_latent)/self.dim_latent)
440 |                 tf.summary.scalar('beta', beta)
441 | 
442 |             return loss
443 | 
444 |     def embed(self, input_batch):
445 | 
446 |         # Run computation
447 |         _, _, _, _, _, _, z, _, _ = self._create_network(input_batch, encode=True)
448 | 
449 |         return z
450 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
 1 | [metadata]
 2 | # This includes the license file(s) in the wheel.
 3 | # https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file
 4 | license_files = LICENSE.txt
 5 | 
 6 | [bdist_wheel]
 7 | # This flag says to generate wheels that support both Python 2 and Python
 8 | # 3. If your code will not run unchanged on both Python 2 and 3, you will
 9 | # need to generate separate wheels for each Python version that you
10 | # support. Removing this line (or setting universal to 0) will prevent
11 | # bdist_wheel from trying to make a universal wheel. For more see:
12 | # https://packaging.python.org/guides/distributing-packages-using-setuptools/#wheels
13 | universal=0


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import setuptools
 2 | import platform
 3 | 
 4 | with open("README.md", "r") as fh:
 5 |     long_description = fh.read()
 6 | 
 7 | # Determine the correct TensorFlow package based on architecture
 8 | if platform.machine() == 'arm64':  # Apple Silicon
 9 |     tensorflow_packages = [
10 |         'tensorflow-macos==2.8.0',
11 |         'tensorflow-metal==0.4.0'
12 |     ]
13 | else:  # Intel/AMD
14 |     tensorflow_packages = ['tensorflow>=2.9.0,<2.10.0']
15 | 
16 | setuptools.setup(
17 |     name="cvae",
18 |     version="0.2.0",
19 |     author="Max Frenzel",
20 |     author_email="maxfrenzel+cvae@gmail.com",
21 |     description="CompressionVAE: General purpose dimensionality reduction and manifold learning tool based on "
22 |                 "Variational Autoencoder.",
23 |     long_description=long_description,
24 |     long_description_content_type="text/markdown",
25 |     url="https://github.com/maxfrenzel/CompressionVAE",
26 |     packages=setuptools.find_packages(),
27 |     classifiers=[
28 |         "Programming Language :: Python :: 3",
29 |         "License :: OSI Approved :: MIT License",
30 |         "Operating System :: OS Independent",
31 |     ],
32 |     python_requires='>=3.6',
33 |     install_requires=[
34 |         'numpy>=1.16.5,<1.23.0',
35 |         'matplotlib>=3.3.0,<4.0.0',
36 |         'joblib>=1.0.0,<2.0.0',
37 |         'tqdm>=4.50.0,<5.0.0',
38 |         'pandas>=1.3.0,<2.0.0'
39 |     ] + tensorflow_packages,
40 |     extras_require={
41 |         'test': ['scikit-learn>=1.0.0']
42 |     },
43 |     keywords='vae variational autoencoder manifold dimensionality reduction compression tensorflow'
44 | )


--------------------------------------------------------------------------------