├── .gitignore ├── DCGAN-MNIST.ipynb ├── DCGAN-face-creation.ipynb ├── README.md ├── VAE.ipynb └── style-transfer.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | *.mat 2 | .ipynb_checkpoints 3 | -------------------------------------------------------------------------------- /DCGAN-MNIST.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Teaching a Deep Convolutional Generative Adversarial Network (DCGAN) to draw MNIST characters\n", 8 | "\n", 9 | "In the last tutorial, we learnt using tensorflow for designing a Variational Autoencoder (VAE) that could draw MNIST characters. Most of the created digits looked nice. There was only one drawback -- some of the created images looked a bit fuzzy. The VAE was trained with the _mean squared error_ loss function. However, it's quite difficult to encode exact character edge locations, which leads to the network being unsure about those edges. And does it really matter if the edge of a character starts a few pixels more to the left or right? Not really.\n", 10 | "In this article, we will see how we can train a network that does not depend on the mean squared error or any related loss functions--instead, it will learn all by itself what a real image should look like.\n", 11 | "\n", 12 | "## Deep Convolutional Generative Adversarial Networks\n", 13 | "Another network architecture for learning to generate new content is the DCGAN. Like the VAE, our DCGAN consists of two parts:\n", 14 | "* The _discriminator_ learns how to distinguish fake from real objects of the type we'd like to create\n", 15 | "* The _generator_ creates new content and tries to fool the discriminator\n", 16 | "\n", 17 | "There is a HackerNoon article by Chanchana Sornsoontorn that explains the concept in more detail and describes some creative projects DCGANs have been applied to. One of these projects is the generation of MNIST characters. Let's try to use python and tensorflow for the same purpose." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 33, 23 | "metadata": { 24 | "collapsed": true 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "import tensorflow as tf\n", 29 | "import numpy as np\n", 30 | "import matplotlib.pyplot as plt\n", 31 | "%matplotlib inline" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 72, 37 | "metadata": { 38 | "collapsed": true 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "# Code by Parag Mital (github.com/pkmital/CADL)\n", 43 | "def montage(images):\n", 44 | " if isinstance(images, list):\n", 45 | " images = np.array(images)\n", 46 | " img_h = images.shape[1]\n", 47 | " img_w = images.shape[2]\n", 48 | " n_plots = int(np.ceil(np.sqrt(images.shape[0])))\n", 49 | " m = np.ones((images.shape[1] * n_plots + n_plots + 1, images.shape[2] * n_plots + n_plots + 1)) * 0.5\n", 50 | " for i in range(n_plots):\n", 51 | " for j in range(n_plots):\n", 52 | " this_filter = i * n_plots + j\n", 53 | " if this_filter < images.shape[0]:\n", 54 | " this_img = images[this_filter]\n", 55 | " m[1 + i + i * img_h:1 + i + (i + 1) * img_h,\n", 56 | " 1 + j + j * img_w:1 + j + (j + 1) * img_w] = this_img\n", 57 | " return m" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## Setting up the basics\n", 65 | "Like in the last tutorial, we use tensorflow's own method for accessing batches of MNIST characters. We set our batch size to be 64. Our generator will take noise as input. The number of these inputs is being set to 100. Batch normalization considerably improved the training of this network. For tensorflow to apply batch normalization, we need to let it know whether we are in training mode. The variable _keep_prob_ will be used by our dropout layers, which we introduce for more stable learning outcomes.\n", 66 | "_lrelu_ defines the popular leaky ReLU, that hopefully will be supported by future versions of tensorflow! I firstly tried to apply standard ReLUs to this network, but this lead to the well-known _dying ReLU problem_, and I received generated images that looked like artwork by Kazimir Malevich--I just got black squares. \n", 67 | "\n", 68 | "Then, we define a function _binary_cross_entropy_, which we will use later, when computing losses." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "collapsed": true 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "from tensorflow.examples.tutorials.mnist import input_data\n", 80 | "mnist = input_data.read_data_sets('MNIST_data')\n", 81 | "tf.reset_default_graph()\n", 82 | "batch_size = 64\n", 83 | "n_noise = 64\n", 84 | "\n", 85 | "X_in = tf.placeholder(dtype=tf.float32, shape=[None, 28, 28], name='X')\n", 86 | "noise = tf.placeholder(dtype=tf.float32, shape=[None, n_noise])\n", 87 | "\n", 88 | "keep_prob = tf.placeholder(dtype=tf.float32, name='keep_prob')\n", 89 | "is_training = tf.placeholder(dtype=tf.bool, name='is_training')\n", 90 | "\n", 91 | "def lrelu(x):\n", 92 | " return tf.maximum(x, tf.multiply(x, 0.2))\n", 93 | "\n", 94 | "def binary_cross_entropy(x, z):\n", 95 | " eps = 1e-12\n", 96 | " return (-(x * tf.log(z + eps) + (1. - x) * tf.log(1. - z + eps)))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "## The discriminator\n", 104 | "Now, we can define the discriminator. It looks similar to the encoder part of our VAE. As input, it takes real or fake MNIST digits (28 x 28 pixel grayscale images) and applies a series of convolutions. Finally, we use a sigmoid to make sure our output can be interpreted as the probability to that the input image is a real MNIST character." 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 79, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "Extracting MNIST_data/train-images-idx3-ubyte.gz\n", 117 | "Extracting MNIST_data/train-labels-idx1-ubyte.gz\n", 118 | "Extracting MNIST_data/t10k-images-idx3-ubyte.gz\n", 119 | "Extracting MNIST_data/t10k-labels-idx1-ubyte.gz\n" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "def discriminator(img_in, reuse=None, keep_prob=keep_prob):\n", 125 | " activation = lrelu\n", 126 | " with tf.variable_scope(\"discriminator\", reuse=reuse):\n", 127 | " x = tf.reshape(img_in, shape=[-1, 28, 28, 1])\n", 128 | " x = tf.layers.conv2d(x, kernel_size=5, filters=64, strides=2, padding='same', activation=activation)\n", 129 | " x = tf.layers.dropout(x, keep_prob)\n", 130 | " x = tf.layers.conv2d(x, kernel_size=5, filters=64, strides=1, padding='same', activation=activation)\n", 131 | " x = tf.layers.dropout(x, keep_prob)\n", 132 | " x = tf.layers.conv2d(x, kernel_size=5, filters=64, strides=1, padding='same', activation=activation)\n", 133 | " x = tf.layers.dropout(x, keep_prob)\n", 134 | " x = tf.contrib.layers.flatten(x)\n", 135 | " x = tf.layers.dense(x, units=128, activation=activation)\n", 136 | " x = tf.layers.dense(x, units=1, activation=tf.nn.sigmoid)\n", 137 | " return x" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "## The generator\n", 145 | "The generator--just like the decoder part in our VAE--takes noise and tries to learn how to transform this noise into digits. To this end, it applies several transpose convolutions. At first, I didn't apply batch normalization to the generator, and its learning seemed to be really unefficient. After applying batch normalization layers, learning improved considerably. Also, I firstly had a much larger dense layer accepting the generator input. This led to the generator creating the same output always, no matter what the input noise was. Tuning the generator honestly took quite some effort!" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "collapsed": true 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "def generator(z, keep_prob=keep_prob, is_training=is_training):\n", 157 | " activation = lrelu\n", 158 | " momentum = 0.99\n", 159 | " with tf.variable_scope(\"generator\", reuse=None):\n", 160 | " x = z\n", 161 | " d1 = 4\n", 162 | " d2 = 1\n", 163 | " x = tf.layers.dense(x, units=d1 * d1 * d2, activation=activation)\n", 164 | " x = tf.layers.dropout(x, keep_prob) \n", 165 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum) \n", 166 | " x = tf.reshape(x, shape=[-1, d1, d1, d2])\n", 167 | " x = tf.image.resize_images(x, size=[7, 7])\n", 168 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=64, strides=2, padding='same', activation=activation)\n", 169 | " x = tf.layers.dropout(x, keep_prob)\n", 170 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum)\n", 171 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=64, strides=2, padding='same', activation=activation)\n", 172 | " x = tf.layers.dropout(x, keep_prob)\n", 173 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum)\n", 174 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=64, strides=1, padding='same', activation=activation)\n", 175 | " x = tf.layers.dropout(x, keep_prob)\n", 176 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum)\n", 177 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=1, strides=1, padding='same', activation=tf.nn.sigmoid)\n", 178 | " return x " 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## Loss functions and optimizers\n", 186 | "Now, we wire both parts together, like we did for the encoder and the decoder of our VAE in the last tutorial.\n", 187 | "However, we have to create two objects of our discriminator\n", 188 | "* The first object receives the real images\n", 189 | "* The second object receives the fake images\n", 190 | "\n", 191 | "_reuse_ of the second object is set to _True_ so both objects share their variables. We need both instances for computing two types of losses:\n", 192 | "* when receiving real images, the discriminator should learn to compute high values (near _1_), meaning that it is confident the input images are real\n", 193 | "* when receiving fake images, it should compute low values (near _0_), meaning it is confident the input images are not real\n", 194 | "\n", 195 | "To accomplish this, we use the _binary cross entropy_ function defined earlier. The generator tries to achieve the opposite goal, it tries to make the discriminator assign high values to fake images.\n", 196 | "\n", 197 | "Now, we also apply some regularization. We create two distinct optimizers, one for the discriminator, one for the generator. We have to define which variables we allow these optimizers to modify, otherwise the generator's optimizer could just mess up the discriminator's variables and vice-versa.\n", 198 | "\n", 199 | "We have to provide the __update_ops__ to our optimizers when applying batch normalization--take a look at the tensorflow documentation for more information on this topic." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 80, 205 | "metadata": { 206 | "collapsed": true 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "g = generator(noise, keep_prob, is_training)\n", 211 | "d_real = discriminator(X_in)\n", 212 | "d_fake = discriminator(g, reuse=True)\n", 213 | "\n", 214 | "vars_g = [var for var in tf.trainable_variables() if var.name.startswith(\"generator\")]\n", 215 | "vars_d = [var for var in tf.trainable_variables() if var.name.startswith(\"discriminator\")]\n", 216 | "\n", 217 | "\n", 218 | "d_reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-6), vars_d)\n", 219 | "g_reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-6), vars_g)\n", 220 | "\n", 221 | "loss_d_real = binary_cross_entropy(tf.ones_like(d_real), d_real)\n", 222 | "loss_d_fake = binary_cross_entropy(tf.zeros_like(d_fake), d_fake)\n", 223 | "loss_g = tf.reduce_mean(binary_cross_entropy(tf.ones_like(d_fake), d_fake))\n", 224 | "loss_d = tf.reduce_mean(0.5 * (loss_d_real + loss_d_fake))\n", 225 | "\n", 226 | "update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)\n", 227 | "with tf.control_dependencies(update_ops):\n", 228 | " optimizer_d = tf.train.RMSPropOptimizer(learning_rate=0.00015).minimize(loss_d + d_reg, var_list=vars_d)\n", 229 | " optimizer_g = tf.train.RMSPropOptimizer(learning_rate=0.00015).minimize(loss_g + g_reg, var_list=vars_g)\n", 230 | " \n", 231 | " \n", 232 | "sess = tf.Session()\n", 233 | "sess.run(tf.global_variables_initializer())" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "## Training the DCGAN\n", 241 | "Finally, the fun part begins--let's train our network! \n", 242 | "We feed random values to our generator, which will learn to create digits out of this noise. We also take care that neither the generator nor the discriminator becomes too strong--otherwise, this would inhibit the learning of the other part and could even stop the network from learning anything at all (I unfortunately have made this experience)." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "collapsed": true 250 | }, 251 | "outputs": [], 252 | "source": [ 253 | "for i in range(60000):\n", 254 | " train_d = True\n", 255 | " train_g = True\n", 256 | " keep_prob_train = 0.6 # 0.5\n", 257 | " \n", 258 | " \n", 259 | " n = np.random.uniform(0.0, 1.0, [batch_size, n_noise]).astype(np.float32) \n", 260 | " batch = [np.reshape(b, [28, 28]) for b in mnist.train.next_batch(batch_size=batch_size)[0]] \n", 261 | " \n", 262 | " d_real_ls, d_fake_ls, g_ls, d_ls = sess.run([loss_d_real, loss_d_fake, loss_g, loss_d], feed_dict={X_in: batch, noise: n, keep_prob: keep_prob_train, is_training:True})\n", 263 | " \n", 264 | " d_real_ls = np.mean(d_real_ls)\n", 265 | " d_fake_ls = np.mean(d_fake_ls)\n", 266 | " g_ls = g_ls\n", 267 | " d_ls = d_ls\n", 268 | " \n", 269 | " if g_ls * 1.5 < d_ls:\n", 270 | " train_g = False\n", 271 | " pass\n", 272 | " if d_ls * 2 < g_ls:\n", 273 | " train_d = False\n", 274 | " pass\n", 275 | " \n", 276 | " if train_d:\n", 277 | " sess.run(optimizer_d, feed_dict={noise: n, X_in: batch, keep_prob: keep_prob_train, is_training:True})\n", 278 | " \n", 279 | " \n", 280 | " if train_g:\n", 281 | " sess.run(optimizer_g, feed_dict={noise: n, keep_prob: keep_prob_train, is_training:True})\n", 282 | " \n", 283 | " \n", 284 | " if not i % 50:\n", 285 | " print (i, d_ls, g_ls, d_real_ls, d_fake_ls)\n", 286 | " if not train_g:\n", 287 | " print(\"not training generator\")\n", 288 | " if not train_d:\n", 289 | " print(\"not training discriminator\")\n", 290 | " gen_img = sess.run(g, feed_dict = {noise: n, keep_prob: 1.0, is_training:False})\n", 291 | " imgs = [img[:,:,0] for img in gen_img]\n", 292 | " m = montage(imgs)\n", 293 | " gen_img = m\n", 294 | " plt.axis('off')\n", 295 | " plt.imshow(gen_img, cmap='gray')\n", 296 | " plt.show()" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": { 302 | "collapsed": true 303 | }, 304 | "source": [ 305 | "## Results\n", 306 | "Take a look at the pictures drawn by our generator--they look more realistic than the pictures drawn by the VAE, which looked more fuzzy at their edges. Training however took much longer than training the other model.\n", 307 | "\n", 308 | "In conclusion, training the DCGAN took me much longer than training the VAE. Maybe fine-tuning the architecture could speed up the network's learning. Nonetheless, it's a real advantage that we are not dependent on loss functions based on pixel positions, making the results look less fuzzy. This is especially important when creating more complex data--e.g. pictures of human faces. So, just be a little patient--then everything is possible in the world of deep learning!" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": { 315 | "collapsed": true 316 | }, 317 | "outputs": [], 318 | "source": [] 319 | } 320 | ], 321 | "metadata": { 322 | "anaconda-cloud": {}, 323 | "kernelspec": { 324 | "display_name": "Python 3", 325 | "language": "python", 326 | "name": "python3" 327 | }, 328 | "language_info": { 329 | "codemirror_mode": { 330 | "name": "ipython", 331 | "version": 3 332 | }, 333 | "file_extension": ".py", 334 | "mimetype": "text/x-python", 335 | "name": "python", 336 | "nbconvert_exporter": "python", 337 | "pygments_lexer": "ipython3", 338 | "version": "3.6.1" 339 | } 340 | }, 341 | "nbformat": 4, 342 | "nbformat_minor": 1 343 | } 344 | -------------------------------------------------------------------------------- /DCGAN-face-creation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training a DCGAN to draw human faces\n", 8 | "\n", 9 | "This notebook does not contain much documentation text. If you are wondering about the DCGAN code shown below, please take a look at the code of a DCGAN for MNIST creation. The architecture of this network is basically the same.\n", 10 | "\n", 11 | "## Examples created\n", 12 | "See the _examples_ directory, the _lfw_ images have been created by this network.\n", 13 | "\n", 14 | "\n", 15 | "## What to consider\n", 16 | "If you want to train this model yourself, please make sure you have a decent GPU--the example images were created after running the model on a Tesla K80 for several hours.\n", 17 | "\n", 18 | "\n", 19 | "\n", 20 | "## Downloading the LFW (Labeled Faces in the Wild) data" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "url = \"http://vis-www.cs.umass.edu/lfw/lfw.tgz\"\n", 32 | "filename = \"lfw.tgz\"\n", 33 | "directory = \"imgs\"\n", 34 | "new_dir = \"new_imgs\"\n", 35 | "import urllib\n", 36 | "import tarfile\n", 37 | "import os\n", 38 | "import tarfile\n", 39 | "import numpy as np\n", 40 | "import matplotlib.pyplot as plt\n", 41 | "from matplotlib.image import imread\n", 42 | "from scipy.misc import imresize, imsave\n", 43 | "import tensorflow as tf\n", 44 | "%matplotlib inline" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Saving the LFW files to a directory" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "collapsed": true 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "if not os.path.isdir(directory):\n", 63 | " if not os.path.isfile(filename):\n", 64 | " urllib.urlretrieve (url, filename)\n", 65 | " tar = tarfile.open(filename, \"r:gz\")\n", 66 | " tar.extractall(path=directory)\n", 67 | " tar.close()" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Modifying the images (reducing their size)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": { 81 | "collapsed": true 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "filepaths = []\n", 86 | "for dir_, _, files in os.walk(directory):\n", 87 | " for fileName in files:\n", 88 | " relDir = os.path.relpath(dir_, directory)\n", 89 | " relFile = os.path.join(relDir, fileName)\n", 90 | " filepaths.append(directory + \"/\" + relFile)\n", 91 | " \n", 92 | "for i, fp in enumerate(filepaths):\n", 93 | " img = imread(fp) #/ 255.0\n", 94 | " img = imresize(img, (40, 40))\n", 95 | " imsave(new_dir + \"/\" + str(i) + \".png\", img) " 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": { 102 | "collapsed": true 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "filepaths_new = []\n", 107 | "for dir_, _, files in os.walk(new_dir):\n", 108 | " for fileName in files:\n", 109 | " if not fileName.endswith(\".png\"):\n", 110 | " continue\n", 111 | " relDir = os.path.relpath(dir_, directory)\n", 112 | " relFile = os.path.join(relDir, fileName)\n", 113 | " filepaths_new.append(directory + \"/\" + relFile)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "## Definition of a method to access 40 x 40 x 3 face images" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "collapsed": true 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "def next_batch(num=64, data=filepaths_new):\n", 132 | " idx = np.arange(0 , len(data))\n", 133 | " np.random.shuffle(idx)\n", 134 | " idx = idx[:num]\n", 135 | " data_shuffle = [imread(data[i]) for i in idx]\n", 136 | "\n", 137 | " shuffled = np.asarray(data_shuffle)\n", 138 | " \n", 139 | " return np.asarray(data_shuffle)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "## Code for creating montages (by Parag Mital)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "collapsed": true 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "# Code by Parag Mital (https://github.com/pkmital/CADL/)\n", 158 | "def montage(images): \n", 159 | " if isinstance(images, list):\n", 160 | " images = np.array(images)\n", 161 | " img_h = images.shape[1]\n", 162 | " img_w = images.shape[2]\n", 163 | " n_plots = int(np.ceil(np.sqrt(images.shape[0])))\n", 164 | " if len(images.shape) == 4 and images.shape[3] == 3:\n", 165 | " m = np.ones(\n", 166 | " (images.shape[1] * n_plots + n_plots + 1,\n", 167 | " images.shape[2] * n_plots + n_plots + 1, 3)) * 0.5\n", 168 | " elif len(images.shape) == 4 and images.shape[3] == 1:\n", 169 | " m = np.ones(\n", 170 | " (images.shape[1] * n_plots + n_plots + 1,\n", 171 | " images.shape[2] * n_plots + n_plots + 1, 1)) * 0.5\n", 172 | " elif len(images.shape) == 3:\n", 173 | " m = np.ones(\n", 174 | " (images.shape[1] * n_plots + n_plots + 1,\n", 175 | " images.shape[2] * n_plots + n_plots + 1)) * 0.5\n", 176 | " else:\n", 177 | " raise ValueError('Could not parse image shape of {}'.format(\n", 178 | " images.shape))\n", 179 | " for i in range(n_plots):\n", 180 | " for j in range(n_plots):\n", 181 | " this_filter = i * n_plots + j\n", 182 | " if this_filter < images.shape[0]:\n", 183 | " this_img = images[this_filter]\n", 184 | " m[1 + i + i * img_h:1 + i + (i + 1) * img_h,\n", 185 | " 1 + j + j * img_w:1 + j + (j + 1) * img_w] = this_img\n", 186 | " return m" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "## Definition of the neural network" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": { 200 | "collapsed": true 201 | }, 202 | "outputs": [], 203 | "source": [ 204 | "tf.reset_default_graph()\n", 205 | "batch_size = 64\n", 206 | "n_noise = 64\n", 207 | "\n", 208 | "X_in = tf.placeholder(dtype=tf.float32, shape=[None, 40, 40, 3], name='X')\n", 209 | "noise = tf.placeholder(dtype=tf.float32, shape=[None, n_noise])\n", 210 | "\n", 211 | "keep_prob = tf.placeholder(dtype=tf.float32, name='keep_prob')\n", 212 | "is_training = tf.placeholder(dtype=tf.bool, name='is_training')\n", 213 | "\n", 214 | "def lrelu(x):\n", 215 | " return tf.maximum(x, tf.multiply(x, 0.2))\n", 216 | "\n", 217 | "def binary_cross_entropy(x, z):\n", 218 | " eps = 1e-12\n", 219 | " return (-(x * tf.log(z + eps) + (1. - x) * tf.log(1. - z + eps)))\n", 220 | "\n", 221 | "def discriminator(img_in, reuse=None, keep_prob=keep_prob):\n", 222 | " activation = lrelu\n", 223 | " with tf.variable_scope(\"discriminator\", reuse=reuse):\n", 224 | " x = tf.reshape(img_in, shape=[-1, 40, 40, 3])\n", 225 | " x = tf.layers.conv2d(x, kernel_size=5, filters=256, strides=2, padding='same', activation=activation)\n", 226 | " x = tf.layers.dropout(x, keep_prob)\n", 227 | " x = tf.layers.conv2d(x, kernel_size=5, filters=128, strides=1, padding='same', activation=activation)\n", 228 | " x = tf.layers.dropout(x, keep_prob)\n", 229 | " x = tf.layers.conv2d(x, kernel_size=5, filters=64, strides=1, padding='same', activation=activation)\n", 230 | " x = tf.layers.dropout(x, keep_prob)\n", 231 | " x = tf.contrib.layers.flatten(x)\n", 232 | " x = tf.layers.dense(x, units=128, activation=activation)\n", 233 | " x = tf.layers.dense(x, units=1, activation=tf.nn.sigmoid)\n", 234 | " return x\n", 235 | " \n", 236 | "def generator(z, keep_prob=keep_prob, is_training=is_training):\n", 237 | " activation = lrelu\n", 238 | " momentum = 0.9\n", 239 | " with tf.variable_scope(\"generator\", reuse=None):\n", 240 | " x = z\n", 241 | " \n", 242 | " d1 = 4#3\n", 243 | " d2 = 3\n", 244 | " \n", 245 | " x = tf.layers.dense(x, units=d1 * d1 * d2, activation=activation)\n", 246 | " x = tf.layers.dropout(x, keep_prob) \n", 247 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum) \n", 248 | " \n", 249 | " x = tf.reshape(x, shape=[-1, d1, d1, d2])\n", 250 | " x = tf.image.resize_images(x, size=[10, 10])\n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=256, strides=2, padding='same', activation=activation)\n", 255 | " x = tf.layers.dropout(x, keep_prob)\n", 256 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum)\n", 257 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=128, strides=2, padding='same', activation=activation)\n", 258 | " x = tf.layers.dropout(x, keep_prob)\n", 259 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum)\n", 260 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=64, strides=1, padding='same', activation=activation)\n", 261 | " x = tf.layers.dropout(x, keep_prob)\n", 262 | " x = tf.contrib.layers.batch_norm(x, is_training=is_training, decay=momentum)\n", 263 | " x = tf.layers.conv2d_transpose(x, kernel_size=5, filters=3, strides=1, padding='same', activation=tf.nn.sigmoid)\n", 264 | " return x " 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "## Losses and optimizers" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": { 278 | "collapsed": true 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "g = generator(noise, keep_prob, is_training)\n", 283 | "print(g)\n", 284 | "d_real = discriminator(X_in)\n", 285 | "d_fake = discriminator(g, reuse=True)\n", 286 | "\n", 287 | "vars_g = [var for var in tf.trainable_variables() if var.name.startswith(\"generator\")]\n", 288 | "vars_d = [var for var in tf.trainable_variables() if var.name.startswith(\"discriminator\")]\n", 289 | "\n", 290 | "\n", 291 | "d_reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-6), vars_d)\n", 292 | "g_reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-6), vars_g)\n", 293 | "\n", 294 | "loss_d_real = binary_cross_entropy(tf.ones_like(d_real), d_real)\n", 295 | "loss_d_fake = binary_cross_entropy(tf.zeros_like(d_fake), d_fake)\n", 296 | "loss_g = tf.reduce_mean(binary_cross_entropy(tf.ones_like(d_fake), d_fake))\n", 297 | "\n", 298 | "loss_d = tf.reduce_mean(0.5 * (loss_d_real + loss_d_fake))\n", 299 | "\n", 300 | "update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)\n", 301 | "with tf.control_dependencies(update_ops):\n", 302 | " optimizer_d = tf.train.RMSPropOptimizer(learning_rate=0.0001).minimize(loss_d + d_reg, var_list=vars_d)\n", 303 | " optimizer_g = tf.train.RMSPropOptimizer(learning_rate=0.0002).minimize(loss_g + g_reg, var_list=vars_g)\n", 304 | "sess = tf.Session()\n", 305 | "sess.run(tf.global_variables_initializer())" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "## Training the network" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": { 319 | "collapsed": true 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "for i in range(60000):\n", 324 | " train_d = True\n", 325 | " train_g = True\n", 326 | " keep_prob_train = 0.6 # 0.5\n", 327 | " \n", 328 | " \n", 329 | " n = np.random.uniform(0.0, 1.0, [batch_size, n_noise]).astype(np.float32) \n", 330 | " batch = [b for b in next_batch(num=batch_size)] \n", 331 | " \n", 332 | " d_real_ls, d_fake_ls, g_ls, d_ls = sess.run([loss_d_real, loss_d_fake, loss_g, loss_d], feed_dict={X_in: batch, noise: n, keep_prob: keep_prob_train, is_training:True})\n", 333 | " \n", 334 | " d_fake_ls_init = d_fake_ls\n", 335 | " \n", 336 | " d_real_ls = np.mean(d_real_ls)\n", 337 | " d_fake_ls = np.mean(d_fake_ls)\n", 338 | " g_ls = g_ls\n", 339 | " d_ls = d_ls\n", 340 | " \n", 341 | " if g_ls * 1.35 < d_ls:\n", 342 | " train_g = False\n", 343 | " pass\n", 344 | " if d_ls * 1.35 < g_ls:\n", 345 | " train_d = False\n", 346 | " pass\n", 347 | " \n", 348 | " if train_d:\n", 349 | " sess.run(optimizer_d, feed_dict={noise: n, X_in: batch, keep_prob: keep_prob_train, is_training:True})\n", 350 | " \n", 351 | " \n", 352 | " if train_g:\n", 353 | " sess.run(optimizer_g, feed_dict={noise: n, keep_prob: keep_prob_train, is_training:True})\n", 354 | " \n", 355 | " \n", 356 | " if not i % 10:\n", 357 | " print (i, d_ls, g_ls)\n", 358 | " if not train_g:\n", 359 | " print(\"not training generator\")\n", 360 | " if not train_d:\n", 361 | " print(\"not training discriminator\")\n", 362 | " gen_imgs = sess.run(g, feed_dict = {noise: n, keep_prob: 1.0, is_training:False})\n", 363 | " imgs = [img[:,:,:] for img in gen_imgs]\n", 364 | " m = montage(imgs)\n", 365 | " #m = imgs[0]\n", 366 | " plt.axis('off')\n", 367 | " plt.imshow(m, cmap='gray')\n", 368 | " plt.show()" 369 | ] 370 | } 371 | ], 372 | "metadata": { 373 | "kernelspec": { 374 | "display_name": "Python 3", 375 | "language": "python", 376 | "name": "python3" 377 | }, 378 | "language_info": { 379 | "codemirror_mode": { 380 | "name": "ipython", 381 | "version": 3 382 | }, 383 | "file_extension": ".py", 384 | "mimetype": "text/x-python", 385 | "name": "python", 386 | "nbconvert_exporter": "python", 387 | "pygments_lexer": "ipython3", 388 | "version": "3.6.1" 389 | } 390 | }, 391 | "nbformat": 4, 392 | "nbformat_minor": 2 393 | } 394 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Learning with Python 2 | 3 | Here I will put some projects that I create while learning how to apply deep neural networks. 4 | Current projects: 5 | * Variational Autoencoder for creating MNIST characters (notebook with explanations) 6 | * Deep Convolutional Generative Adversarial Network (DCGAN) for the same purpose (notebook with explanations) 7 | * DCGAN for creating human faces 8 | * Style Transfer 9 | -------------------------------------------------------------------------------- /VAE.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Teaching a Variational Autoencoder (VAE) to draw MNIST characters\n", 8 | "\n", 9 | "Autoencoders are a type of neural network that can be used to learn efficient codings of input data. \n", 10 | "Given some inputs, the network firstly applies a series of transformations that map the input data into a lower dimensional space. This part of the network is called the _encoder_. Then, the network uses the encoded data to try and recreate the inputs. This part of the network is the _decoder_. Using the encoder, we can later compress data of the type that is understood by the network. However, autoencoders are rarely used for this purpose, as usually there exist hand-crafted algorithms (like _jpg_-compression) that are more efficient. Instead, autoencoders have repeatedly been applied to perform denoising tasks. Then, the encoder receives pictures that have been tampered with noise, and it learns how to reconstruct the original images.\n", 11 | "\n", 12 | "\n", 13 | "## Variational Autoencoders put simply\n", 14 | "But there exists a much more interesting application for autoencoders. This application is called the _variational autoencoder_. Using variational autoencoders, it's not only possible to compress data -- it's also possible to generate new objects of the type the autoencoder has seen before.\n", 15 | "\n", 16 | "Using a general autoencoder, we don't know anything about the coding that's been generated by our network. We could take a look at and compare different encoded objects, but it's unlikely that we'll be able to understand what's going on. This means that we won't be able to use our decoder for creating new objects -- we simply don't know what the inputs should look like.\n", 17 | "\n", 18 | "Using a variational autoencoder, we take the opposite approach instead. We will not try to make guesses concerning the distribution that's being followed by the latent vectors. We simply tell our network what we want this distribution to look like. Usually, we will constrain the network to produce latent vectors having entries that follow the unit normal distribution. Then, when trying to generate data, we can simply sample some values from this distribution, feed them to the decoder, and the decoder will return us completely new objects that appear just like the objects our network has been trained with.\n", 19 | "\n", 20 | "Let's see how this can be done using python and tensorflow. We are going to teach our network how to draw MNIST characters." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## First steps -- Loading the training data\n", 28 | "Firstly, we perform some basic imports. Tensorflow has a quite handy function that allows us to easily access the MNIST data set." 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "import tensorflow as tf\n", 38 | "import numpy as np\n", 39 | "import matplotlib.pyplot as plt\n", 40 | "%matplotlib inline\n", 41 | "\n", 42 | "from tensorflow.examples.tutorials.mnist import input_data\n", 43 | "mnist = input_data.read_data_sets('MNIST_data')" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## Defining our input and output data\n", 51 | "MNIST images have a dimension of 28 * 28 pixels with one color channel. Our inputs _X_in_ will be batches of MNIST characters, while our network will learn to reconstruct them and output them in a placeholder _Y_, which thus has the same dimensions. _Y_flat_ will be used later, when computing losses. _keep_prob_ will be used when applying dropouts as a means of regularization. During training, it will have a value of 0.8. When generating new data, we won't apply dropout, so the value will be 1. The function _lrelu_ is being defined as tensorflow unfortunately doesn't come up with a predefined leaky ReLU." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "tf.reset_default_graph()\n", 61 | "\n", 62 | "batch_size = 64\n", 63 | "\n", 64 | "X_in = tf.placeholder(dtype=tf.float32, shape=[None, 28, 28], name='X')\n", 65 | "Y = tf.placeholder(dtype=tf.float32, shape=[None, 28, 28], name='Y')\n", 66 | "Y_flat = tf.reshape(Y, shape=[-1, 28 * 28])\n", 67 | "keep_prob = tf.placeholder(dtype=tf.float32, shape=(), name='keep_prob')\n", 68 | "\n", 69 | "dec_in_channels = 1\n", 70 | "n_latent = 8\n", 71 | "\n", 72 | "reshaped_dim = [-1, 7, 7, dec_in_channels]\n", 73 | "inputs_decoder = 49 * dec_in_channels / 2\n", 74 | "\n", 75 | "\n", 76 | "def lrelu(x, alpha=0.3):\n", 77 | " return tf.maximum(x, tf.multiply(x, alpha))" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Defining the encoder\n", 85 | "As our inputs are images, it's most reasonable to apply some convolutional transformations to them. What's most noteworthy is the fact that we are creating two vectors in our encoder, as the encoder is supposed to create objects following a Gaussian Distribution:\n", 86 | "* A vector of means\n", 87 | "* A vector of standard deviations\n", 88 | "\n", 89 | "You will see later how we \"force\" the encoder to make sure it really creates values following a Normal Distribution. The returned values that will be fed to the decoder are the _z_-values. We will need the mean and standard deviation of our distributions later, when computing losses. " 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "def encoder(X_in, keep_prob):\n", 99 | " activation = lrelu\n", 100 | " with tf.variable_scope(\"encoder\", reuse=None):\n", 101 | " X = tf.reshape(X_in, shape=[-1, 28, 28, 1])\n", 102 | " x = tf.layers.conv2d(X, filters=64, kernel_size=4, strides=2, padding='same', activation=activation)\n", 103 | " x = tf.nn.dropout(x, keep_prob)\n", 104 | " x = tf.layers.conv2d(x, filters=64, kernel_size=4, strides=2, padding='same', activation=activation)\n", 105 | " x = tf.nn.dropout(x, keep_prob)\n", 106 | " x = tf.layers.conv2d(x, filters=64, kernel_size=4, strides=1, padding='same', activation=activation)\n", 107 | " x = tf.nn.dropout(x, keep_prob)\n", 108 | " x = tf.contrib.layers.flatten(x)\n", 109 | " mn = tf.layers.dense(x, units=n_latent)\n", 110 | " sd = 0.5 * tf.layers.dense(x, units=n_latent) \n", 111 | " epsilon = tf.random_normal(tf.stack([tf.shape(x)[0], n_latent])) \n", 112 | " z = mn + tf.multiply(epsilon, tf.exp(sd))\n", 113 | " \n", 114 | " return z, mn, sd" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## Defining the decoder\n", 122 | "The decoder does not care about whether the input values are sampled from some specific distribution that has been defined by us. It simply will try to reconstruct the input images. To this end, we use a series of transpose convolutions." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "def decoder(sampled_z, keep_prob):\n", 132 | " with tf.variable_scope(\"decoder\", reuse=None):\n", 133 | " x = tf.layers.dense(sampled_z, units=inputs_decoder, activation=lrelu)\n", 134 | " x = tf.layers.dense(x, units=inputs_decoder * 2 + 1, activation=lrelu)\n", 135 | " x = tf.reshape(x, reshaped_dim)\n", 136 | " x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=2, padding='same', activation=tf.nn.relu)\n", 137 | " x = tf.nn.dropout(x, keep_prob)\n", 138 | " x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=1, padding='same', activation=tf.nn.relu)\n", 139 | " x = tf.nn.dropout(x, keep_prob)\n", 140 | " x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=1, padding='same', activation=tf.nn.relu)\n", 141 | " \n", 142 | " x = tf.contrib.layers.flatten(x)\n", 143 | " x = tf.layers.dense(x, units=28*28, activation=tf.nn.sigmoid)\n", 144 | " img = tf.reshape(x, shape=[-1, 28, 28])\n", 145 | " return img" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "Now, we'll wire together both parts:" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "sampled, mn, sd = encoder(X_in, keep_prob)\n", 162 | "dec = decoder(sampled, keep_prob)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## Computing losses and enforcing a Gaussian latent distribution\n", 170 | "For computing the image reconstruction loss, we simply use squared difference (which could lead to images sometimes looking a bit fuzzy). This loss is combined with the _Kullback-Leibler divergence_, which makes sure our latent values will be sampled from a normal distribution. For more on this topic, please take a look a Jaan Altosaar's great article on VAEs. " 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "unreshaped = tf.reshape(dec, [-1, 28*28])\n", 180 | "img_loss = tf.reduce_sum(tf.squared_difference(unreshaped, Y_flat), 1)\n", 181 | "latent_loss = -0.5 * tf.reduce_sum(1.0 + 2.0 * sd - tf.square(mn) - tf.exp(2.0 * sd), 1)\n", 182 | "loss = tf.reduce_mean(img_loss + latent_loss)\n", 183 | "optimizer = tf.train.AdamOptimizer(0.0005).minimize(loss)\n", 184 | "sess = tf.Session()\n", 185 | "sess.run(tf.global_variables_initializer())" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": { 191 | "collapsed": true 192 | }, 193 | "source": [ 194 | "## Training the network\n", 195 | "Now, we can finally train our VAE! Every 200 steps, we'll take a look at what the current reconstructions look like. After having processed about 2000 batches, most reconstructions will look reasonable." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "for i in range(30000):\n", 205 | " batch = [np.reshape(b, [28, 28]) for b in mnist.train.next_batch(batch_size=batch_size)[0]]\n", 206 | " sess.run(optimizer, feed_dict = {X_in: batch, Y: batch, keep_prob: 0.8})\n", 207 | " \n", 208 | " if not i % 200:\n", 209 | " ls, d, i_ls, d_ls, mu, sigm = sess.run([loss, dec, img_loss, latent_loss, mn, sd], feed_dict = {X_in: batch, Y: batch, keep_prob: 1.0})\n", 210 | " plt.imshow(np.reshape(batch[0], [28, 28]), cmap='gray')\n", 211 | " plt.show()\n", 212 | " plt.imshow(d[0], cmap='gray')\n", 213 | " plt.show()\n", 214 | " print(i, ls, np.mean(i_ls), np.mean(d_ls))" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "## Generating new data\n", 222 | "The most awesome part is that we are now able to create new characters. To this end, we simply sample values from a unit normal distribution and feed them to our decoder. Most of the created characters look just like they've been written by humans. " 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "randoms = [np.random.normal(0, 1, n_latent) for _ in range(10)]\n", 232 | "imgs = sess.run(dec, feed_dict = {sampled: randoms, keep_prob: 1.0})\n", 233 | "imgs = [np.reshape(imgs[i], [28, 28]) for i in range(len(imgs))]\n", 234 | "\n", 235 | "for img in imgs:\n", 236 | " plt.figure(figsize=(1,1))\n", 237 | " plt.axis('off')\n", 238 | " plt.imshow(img, cmap='gray')" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": { 244 | "collapsed": true 245 | }, 246 | "source": [ 247 | "## Conclusion\n", 248 | "Now, this obviously is a relatively simple example of an application of VAEs. But just think about what could be possible! Neural networks could learn to compose music. They could automatically create illustrations for books, games etc. With a bit of creativity, VAEs will open up space for some awesome projects " 249 | ] 250 | } 251 | ], 252 | "metadata": { 253 | "anaconda-cloud": {}, 254 | "kernelspec": { 255 | "display_name": "Python 3", 256 | "language": "python", 257 | "name": "python3" 258 | }, 259 | "language_info": { 260 | "codemirror_mode": { 261 | "name": "ipython", 262 | "version": 3 263 | }, 264 | "file_extension": ".py", 265 | "mimetype": "text/x-python", 266 | "name": "python", 267 | "nbconvert_exporter": "python", 268 | "pygments_lexer": "ipython3", 269 | "version": "3.6.4" 270 | } 271 | }, 272 | "nbformat": 4, 273 | "nbformat_minor": 1 274 | } 275 | -------------------------------------------------------------------------------- /style-transfer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Style Transfer with Deep Learning using the VGG-19 network" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import os\n", 17 | "from six.moves import urllib\n", 18 | "from scipy.io import loadmat\n", 19 | "import tensorflow as tf\n", 20 | "import numpy as np\n", 21 | "import matplotlib.pyplot as plt\n", 22 | "from scipy.misc import imresize\n", 23 | "%matplotlib inline\n", 24 | "\n", 25 | "\n", 26 | "def download_hook(count, block_size, total_size):\n", 27 | " if count % 20 == 0 or count * block_size == total_size:\n", 28 | " percentage = 100.0 * count * block_size / total_size\n", 29 | " barstring = [\"=\" for _ in range(int(percentage / 2.0))] + [\">\"] + [\".\" for _ in range(50 - int(percentage / 2.0))]\n", 30 | " barstring = \"[\" + \"\".join(barstring) + \"]\"\n", 31 | " outstring = '%02.02f%% (%02.02f of %02.02f MB)\\t\\t' + barstring\n", 32 | " print(outstring % (percentage, count * block_size / 1024.0 / 1024.0, total_size / 1024.0 / 1024.0), end='\\r')" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "The VGG-19 model is quite large, so be a little patient." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "path = \"http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat\"\n", 49 | "fname = \"vgg-19.mat\"\n", 50 | "if not os.path.exists(fname):\n", 51 | " print(\"Downloading ...\")\n", 52 | " filepath, _ = urllib.request.urlretrieve(path, filename=fname, reporthook=download_hook)\n", 53 | " print(\"Done.\")\n", 54 | "\n", 55 | " \n", 56 | "if not os.path.exists(\"content.jpg\"): \n", 57 | " urllib.request.urlretrieve(\"\", filename=\"content.jpg\") # Attribution: Gage Skidmore\n", 58 | " urllib.request.urlretrieve(\"\", filename=\"style.jpg\") " 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Find a description (in the form of python code) of the loaded model at\n", 66 | "https://github.com/chiphuyen/stanford-tensorflow-tutorials/blob/master/assignments/style_transfer/vgg_model.py" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "original_layers = loadmat(fname)[\"layers\"][0]\n", 76 | "original_layers.shape" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "The downloaded file contains the VGG-19 model, consisting of 43 trained layers. You can access the neccessary information as follows:\n", 84 | "* The name of layer `i` can be found by accessing `original_layers[i][0][0][0][0]`\n", 85 | "* The layer weight matrix can be found by accessing `original_layers[i][0][2][0][0]`\n", 86 | "* The bias can be found by accessing `original_layers[i][0][2][0][1]`" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "def get_layer_name(i):\n", 96 | " return original_layers[i][0][0][0][0]\n", 97 | "\n", 98 | "def get_layer_weights(i):\n", 99 | " return original_layers[i][0][0][2][0][0]\n", 100 | "\n", 101 | "def get_layer_bias(i):\n", 102 | " return original_layers[i][0][0][2][0][1]\n", 103 | " \n", 104 | "def get_layer_params(i):\n", 105 | " return (get_layer_weights(i), get_layer_bias(i))\n", 106 | "\n", 107 | "layer_names = [get_layer_name(i) for i in range(len(original_layers))]\n", 108 | "\n", 109 | "def get_layer_by_name(name):\n", 110 | " return layer_names.index(name)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "Let's get an inuition of what the VGG-19 network looks like:" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "print(\", \".join(layer_names))\n", 127 | "conv_layers = [ln for ln in layer_names if ln.startswith(\"conv\")]\n", 128 | "pool_layers = [ln for ln in layer_names if ln.startswith(\"pool\")]\n", 129 | "\n", 130 | "print(conv_layers)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "The network consists of a series of convolutional layers making sense of the input content, finally the network makes a decision about what is being depicted in the input image by moving the content through some fully connected layers followed by a softmax function.\n", 138 | "\n", 139 | "We now have an understanding of what the VGG-19 network consists of, and we also are able to access its weights and biases. What we need to do next is move the model to tensorflow. We only need to rebuild those parts that we're going to use later on--that is, we can ignore the fully connected layers, as they are only needed for making guesses about the kind of object depicted in the input image. Everything that's really interesting happens in the convolutional layers.\n", 140 | "\n", 141 | "This means: We are not trying to rebuild the complete model. We will take the convolutional layers that are needed to reason about the image style and image content, and we will only work with those layers." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "def create_activated_convlayer(prev, i):\n", 151 | " layer_index = get_layer_by_name(conv_layers[i])\n", 152 | " W, b = get_layer_params(layer_index)\n", 153 | " W = tf.constant(W)\n", 154 | " b = tf.constant(np.reshape(b, (b.size)))\n", 155 | " conv = tf.nn.conv2d(prev, filter=W, strides=[1,1,1,1], padding='SAME') + b\n", 156 | " return tf.nn.relu(conv)\n", 157 | "\n", 158 | "def create_pool_layer(prev):\n", 159 | " return tf.nn.avg_pool(prev, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')\n", 160 | "\n", 161 | "def get_next_convlayer(i, prev):\n", 162 | " next_i = i + 1\n", 163 | " return (next_i, create_activated_convlayer(prev, i))\n", 164 | "\n", 165 | "def get_next_convlayer_name(i):\n", 166 | " return conv_layers[i]\n", 167 | "\n", 168 | "def get_last_convlayer_name(i):\n", 169 | " return conv_layers[i - 1]" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "_\"For image synthesis we found that replacing the\n", 177 | "max-pooling operation by average pooling improves the gradient flow and one obtains slightly\n", 178 | "more appealing results, which is why the images shown were generated with average pooling.\"_\n", 179 | "\n", 180 | "Let's keep this in mind and rebuild the model parts that we're going to need." 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "tf.reset_default_graph()\n", 190 | "\n", 191 | "model = {}\n", 192 | "\n", 193 | "content = plt.imread(\"content.jpg\") \n", 194 | "style = plt.imread(\"style.jpg\") \n", 195 | "\n", 196 | "\n", 197 | "scale_down = 5\n", 198 | "\n", 199 | "\n", 200 | "height = content.shape[0] // scale_down\n", 201 | "width = content.shape[1] // scale_down\n", 202 | "index = 0\n", 203 | "\n", 204 | "model['in'] = tf.Variable(np.zeros((1, height, width, 3)), dtype=tf.float32)\n", 205 | "index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model['in'])\n", 206 | "index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model[get_last_convlayer_name(index)])\n", 207 | "model['avgpool1'] = create_pool_layer(model[get_last_convlayer_name(index)])\n", 208 | "index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model['avgpool1'])\n", 209 | "index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model[get_last_convlayer_name(index)])\n", 210 | "model['avgpool2'] = create_pool_layer(model[get_last_convlayer_name(index)])\n", 211 | "index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model['avgpool2'])\n", 212 | "for i in range(3):\n", 213 | " index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model[get_last_convlayer_name(index)])\n", 214 | "model['avgpool3'] = create_pool_layer(model[get_last_convlayer_name(index)])\n", 215 | "index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model['avgpool3'])\n", 216 | "for i in range(3):\n", 217 | " index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model[get_last_convlayer_name(index)])\n", 218 | "model['avgpool4'] = create_pool_layer(model[get_last_convlayer_name(index)])\n", 219 | "index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model['avgpool4'])\n", 220 | "for i in range(3):\n", 221 | " index, model[get_next_convlayer_name(index - 1)] = get_next_convlayer(index, model[get_last_convlayer_name(index)])\n", 222 | "model['avgpool5'] = create_pool_layer(model[get_last_convlayer_name(index)])" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "model" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "For working with input images, the image's mean accross each color channel has to be subtracted, so our images have the same format like those images the VGG-19 model was trained with." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "#mr = np.mean(content[:,:,0])\n", 248 | "#mg = np.mean(content[:,:,1])\n", 249 | "#mb = np.mean(content[:,:,2])\n", 250 | "# --- (Not sure whether to use the fixed numbers or the calculated means actually) ---\n", 251 | "#means = np.reshape([mr, mg, mb], (1,1,3))\n", 252 | "means = np.reshape([116.779, 123.68, 103.939], (1,1,3))\n", 253 | "\n", 254 | "def preprocess_image(img_in):\n", 255 | " img = img_in.astype(\"float32\")\n", 256 | " img = imresize(img, (height, width))\n", 257 | " \n", 258 | " img = img - means\n", 259 | " img = img[np.newaxis]\n", 260 | " return means, img\n", 261 | " \n", 262 | "def unprocess_image(img_in):\n", 263 | " img = img_in\n", 264 | " img = img[0]\n", 265 | " img = img + means\n", 266 | " img = np.clip(img, 0, 255).astype('uint8')\n", 267 | " return img\n", 268 | " \n", 269 | " \n", 270 | "means_c, processed_content = preprocess_image(content)\n", 271 | "means_s, processed_style = preprocess_image(style)\n", 272 | "unprocessed = unprocess_image(processed_content)\n", 273 | "unprocessed_style = unprocess_image(processed_style)\n", 274 | "\n", 275 | "plt.figure(figsize=(10, 10))\n", 276 | "plt.axis('off')\n", 277 | "plt.imshow(unprocessed.astype('uint8'))\n", 278 | "plt.show()\n", 279 | "plt.figure(figsize=(10, 10))\n", 280 | "plt.axis('off')\n", 281 | "plt.imshow(unprocessed_style.astype('uint8'))\n", 282 | "plt.show()" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "_\"For the images […] we matched the content representation on layer ‘conv4 2’ and the\n", 290 | "style representations on layers ‘conv1 1’, ‘conv2 1’, ‘conv3 1’, ‘conv4 1’ and ‘conv5 1’\"_\n", 291 | "\n", 292 | "\n", 293 | "\n", 294 | "Higher convolutional layers are good at capturing the overall content of an image. The lowest layers catch simple features (like horizontal and vertical edges or curves), whereas the layers in between capture features of medium complexity (like noses and eyes or more complex style patterns). Hence, we can take a high-level layer and see how it reacts when receiving the content image. The activation values can be interpreted as a representation of the content we would like to preserve. To get a representation of the style we would like to apply, we look how different layers react when receiving the style image as an input. \n", 295 | "\n", 296 | "\n", 297 | "\n", 298 | "_You may want to play around with the layers and weights; different choices will lead to differing results._" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "content_layer = 'conv4_2'\n", 308 | "style_layers = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']\n", 309 | "style_weights = [.8, .8, .8, .8, .8]\n", 310 | "\n", 311 | "\n", 312 | "sess = tf.InteractiveSession()\n", 313 | "sess.run(tf.global_variables_initializer())\n", 314 | "\n", 315 | "content_features = sess.run(model[content_layer], feed_dict={model['in']: processed_content})\n", 316 | "style_features = sess.run([model[sl] for sl in style_layers], feed_dict={model['in']: processed_style})" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "For the content loss, we will basically compare the current activations of the content layer with the saved values it produced when receiving the actual content image. The style loss is a bit more complicated and involves the calculation of _gram matrices_. Take a look at the original paper if you are more interested in this topic.\n", 324 | "\n", 325 | "https://arxiv.org/pdf/1508.06576.pdf\n", 326 | "\n", 327 | "We also calculate the _variation loss_. This will be added to the total loss and make sure that neighboring pixels are relatively similar, so as to avoid clutter." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": {}, 334 | "outputs": [], 335 | "source": [ 336 | "def content_loss(p):\n", 337 | " size = np.prod(content_features.shape[1:])\n", 338 | " return (1 / (2.0 * size)) * tf.reduce_sum(tf.pow((p - content_features), 2)) \n", 339 | " \n", 340 | "\n", 341 | "def gram_matrix(features, n, m):\n", 342 | " features_t = tf.reshape(f, (m, n))\n", 343 | " return tf.matmul(tf.transpose(features_t), features_t)\n", 344 | "\n", 345 | "def style_loss(a, x):\n", 346 | " n = a.shape[3]\n", 347 | " m = a.shape[1] * a.shape[2]\n", 348 | " a_matrix = gram_matrix(a, n, m)\n", 349 | " g_matrix = gram_matrix(x, n, m)\n", 350 | " return (1 / (4 * n**2 * m**2)) * tf.reduce_sum(tf.pow(g_matrix - a_matrix, 2))\n", 351 | "\n", 352 | "def var_loss(x):\n", 353 | " h, w = x.get_shape().as_list()[1], x.get_shape().as_list()[2]\n", 354 | " dx = tf.square(x[:, :h - 1, :w - 1, :] - x[:, :h - 1, 1:, :])\n", 355 | " dy = tf.square(x[:, :h - 1, :w - 1, :] - x[:, 1:, :w - 1, :])\n", 356 | " return tf.reduce_sum(tf.pow(dx + dy, 1.25))" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "The $\\alpha$-, $\\beta$- and $\\gamma$- values are used to balance the style, content and variation loss. If your results are not satisfying, these are some values you might want to adjust." 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "e = [style_loss(sf, model[ln]) for sf, ln in zip(style_features, style_layers)]\n", 373 | "styleloss = sum([style_weights[l] * e[l] for l in range(len(style_layers))])\n", 374 | "contentloss = content_loss(model[content_layer])\n", 375 | "varloss = var_loss(model['in'])\n", 376 | "\n", 377 | "alpha = 1\n", 378 | "beta = 100\n", 379 | "gamma = 0.1\n", 380 | "\n", 381 | "total_loss = alpha * contentloss + beta * styleloss + gamma * varloss" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "The authors of the original paper proposed to assign white noise to the input of our network and let the network transform this noise into a combined representation of the desired content and style. However, we can make the task of restoring the image content easier by inputting a noisy representation of our content image to the network. " 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "noise_ratio = 0.7\n", 398 | "content_ratio = 1. - noise_ratio\n", 399 | "noise = np.random.uniform(-15, 15, processed_content.shape)\n", 400 | "input_image = (processed_content * content_ratio) + noise_ratio * noise\n", 401 | "\n", 402 | "unp = unprocess_image(input_image)\n", 403 | "plt.imshow(unp.astype(\"uint8\"))\n", 404 | "plt.show()" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "It was possible to train the network using a large learning rate, but you might want to adjust these settings." 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "optimizer = tf.train.AdamOptimizer(1).minimize(total_loss)\n", 421 | "sess = tf.InteractiveSession()\n", 422 | "sess.run(tf.global_variables_initializer())\n", 423 | "sess.run(model['in'].assign(input_image))\n", 424 | "pass" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "## Training the Network\n", 432 | "\n", 433 | "It takes about 100 iterations until the created image will look somewhat like a merge of the content image with the desired style. Then, the output slowly becomes more visually appealing. I stopped training before the loss had converged, as the outputs mostly looked neat enough much earlier and the training would otherwise take quite long. I run the training steps on a Tesla K80 GPU for about 15-20 minutes for most content/style combinations until I was satisfied with the results." 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "for i in range(1000):\n", 443 | " if i % 50 == 0:\n", 444 | " m_in = sess.run(model['in'])\n", 445 | " plt.imshow(unprocess_image(m_in, means_c).astype(\"uint8\"))\n", 446 | " plt.show()\n", 447 | " \n", 448 | " _, ls = sess.run([optimizer, total_loss])\n", 449 | " \n", 450 | " if i % 10 == 0: \n", 451 | " print(i, ls)" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [] 460 | } 461 | ], 462 | "metadata": { 463 | "kernelspec": { 464 | "display_name": "Python 3", 465 | "language": "python", 466 | "name": "python3" 467 | }, 468 | "language_info": { 469 | "codemirror_mode": { 470 | "name": "ipython", 471 | "version": 3 472 | }, 473 | "file_extension": ".py", 474 | "mimetype": "text/x-python", 475 | "name": "python", 476 | "nbconvert_exporter": "python", 477 | "pygments_lexer": "ipython3", 478 | "version": "3.6.4" 479 | } 480 | }, 481 | "nbformat": 4, 482 | "nbformat_minor": 2 483 | } 484 | --------------------------------------------------------------------------------