├── README.md ├── assets ├── background-card-chip.jpg ├── blur-blurry-close-up-167259.jpg ├── echo_logo.png ├── latex_brelu.png ├── latex_silu.png ├── latex_softexp.png ├── plot_brelu.png ├── plot_silu.png └── plot_softexp.pnd.png ├── custom_activations_example.py ├── extending-pytorch-with-custom-activation-functions.ipynb ├── gain └── activation_gain.py ├── in-place-operations-in-pytorch.ipynb └── landscapes ├── cifar-10-shake-shake-landscapes.ipynb ├── create_landscape.py ├── imrelu.png ├── imsmish.png └── imswish.png /README.md: -------------------------------------------------------------------------------- 1 | ![image](https://github.com/Lexie88rus/Activation-functions-examples-pytorch/raw/master/assets/blur-blurry-close-up-167259.jpg) 2 | 3 | # Custom Activation Functions Examples for PyTorch 4 | Repository containing the article with examples of custom activation functions for Pytorch and scripts used in the article. 5 | See the [article on Medium](https://towardsdatascience.com/extending-pytorch-with-custom-activation-functions-2d8b065ef2fa) and a [kernel on Kaggle](https://www.kaggle.com/aleksandradeis/extending-pytorch-with-custom-activation-functions). 6 | 7 | See also the [article about the in-place activations in PyTorch](https://medium.com/p/in-place-operations-in-pytorch-f91d493e970e?source=email-9f0981e41a86--writer.postDistributed&sk=d44e1786ba9cadee76dcdf87e150a5af). 8 | 9 | ## Introduction 10 | Today deep learning is going viral and is applied to a variety of machine learning problems such as image recognition, speech recognition, machine translation, and others. There is a wide range of highly customizable neural network architectures, which can suit almost any problem when given enough data. Each neural network should be elaborated to suit the given problem well enough. You have to fine tune the hyperparameters of the network (the learning rate, dropout coefficients, weight decay, and many others) as well as the number of hidden layers, and the number of units in each layer. __Choosing the right activation function for each layer is also crucial and may have a significant impact on metric scores and the training speed of the model.__ 11 | 12 | ## References 13 | * My original [article on Medium](https://towardsdatascience.com/extending-pytorch-with-custom-activation-functions-2d8b065ef2fa) 14 | * My related [kernel on Kaggle](https://www.kaggle.com/aleksandradeis/extending-pytorch-with-custom-activation-functions) 15 | * [Activation functions wiki](https://en.wikipedia.org/wiki/Activation_function) 16 | * [Tutorial](https://pytorch.org/docs/master/notes/extending.html) on PyTorch extending. 17 | * My [article on Medium](https://medium.com/p/in-place-operations-in-pytorch-f91d493e970e?source=email-9f0981e41a86--writer.postDistributed&sk=d44e1786ba9cadee76dcdf87e150a5af) explaining in-place operations in PyTorch. 18 | -------------------------------------------------------------------------------- /assets/background-card-chip.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/background-card-chip.jpg -------------------------------------------------------------------------------- /assets/blur-blurry-close-up-167259.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/blur-blurry-close-up-167259.jpg -------------------------------------------------------------------------------- /assets/echo_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/echo_logo.png -------------------------------------------------------------------------------- /assets/latex_brelu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/latex_brelu.png -------------------------------------------------------------------------------- /assets/latex_silu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/latex_silu.png -------------------------------------------------------------------------------- /assets/latex_softexp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/latex_softexp.png -------------------------------------------------------------------------------- /assets/plot_brelu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/plot_brelu.png -------------------------------------------------------------------------------- /assets/plot_silu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/plot_silu.png -------------------------------------------------------------------------------- /assets/plot_softexp.pnd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/assets/plot_softexp.pnd.png -------------------------------------------------------------------------------- /custom_activations_example.py: -------------------------------------------------------------------------------- 1 | # Imports 2 | 3 | # Import basic libraries 4 | import numpy as np # linear algebra 5 | import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) 6 | from collections import OrderedDict 7 | 8 | # Import PyTorch 9 | import torch # import main library 10 | from torch.autograd import Variable 11 | import torch.nn as nn # import modules 12 | from torch.autograd import Function # import Function to create custom activations 13 | from torch.nn.parameter import Parameter # import Parameter to create custom activations with learnable parameters 14 | from torch import optim # import optimizers for demonstrations 15 | import torch.nn.functional as F # import torch functions 16 | from torchvision import datasets, transforms # import transformations to use for demo 17 | 18 | # helper function to train a model 19 | def train_model(model, trainloader): 20 | ''' 21 | Function trains the model and prints out the training log. 22 | INPUT: 23 | model - initialized PyTorch model ready for training. 24 | trainloader - PyTorch dataloader for training data. 25 | ''' 26 | #setup training 27 | 28 | #define loss function 29 | criterion = nn.NLLLoss() 30 | #define learning rate 31 | learning_rate = 0.003 32 | #define number of epochs 33 | epochs = 5 34 | #initialize optimizer 35 | optimizer = optim.Adam(model.parameters(), lr=learning_rate) 36 | 37 | #run training and print out the loss to make sure that we are actually fitting to the training set 38 | print('Training the model. Make sure that loss decreases after each epoch.\n') 39 | for e in range(epochs): 40 | running_loss = 0 41 | for images, labels in trainloader: 42 | images = images.view(images.shape[0], -1) 43 | log_ps = model(images) 44 | loss = criterion(log_ps, labels) 45 | 46 | optimizer.zero_grad() 47 | loss.backward() 48 | optimizer.step() 49 | 50 | running_loss += loss.item() 51 | else: 52 | # print out the loss to make sure it is decreasing 53 | print(f"Training loss: {running_loss}") 54 | 55 | # simply define a silu function 56 | def silu(input): 57 | ''' 58 | Applies the Sigmoid Linear Unit (SiLU) function element-wise: 59 | 60 | SiLU(x) = x * sigmoid(x) 61 | ''' 62 | return input * torch.sigmoid(input) # use torch.sigmoid to make sure that we created the most efficient implemetation based on builtin PyTorch functions 63 | 64 | # create a class wrapper from PyTorch nn.Module, so 65 | # the function now can be easily used in models 66 | class SiLU(nn.Module): 67 | ''' 68 | Applies the Sigmoid Linear Unit (SiLU) function element-wise: 69 | 70 | SiLU(x) = x * sigmoid(x) 71 | 72 | Shape: 73 | - Input: (N, *) where * means, any number of additional 74 | dimensions 75 | - Output: (N, *), same shape as the input 76 | 77 | References: 78 | - Related paper: 79 | https://arxiv.org/pdf/1606.08415.pdf 80 | 81 | Examples: 82 | >>> m = silu() 83 | >>> input = torch.randn(2) 84 | >>> output = m(input) 85 | 86 | ''' 87 | def __init__(self): 88 | ''' 89 | Init method. 90 | ''' 91 | super().__init__() # init the base class 92 | 93 | def forward(self, input): 94 | ''' 95 | Forward pass of the function. 96 | ''' 97 | return silu(input) # simply apply already implemented SiLU 98 | 99 | # create class for basic fully-connected deep neural network 100 | class ClassifierSiLU(nn.Module): 101 | ''' 102 | Demo classifier model class to demonstrate SiLU 103 | ''' 104 | def __init__(self): 105 | super().__init__() 106 | 107 | # initialize layers 108 | self.fc1 = nn.Linear(784, 256) 109 | self.fc2 = nn.Linear(256, 128) 110 | self.fc3 = nn.Linear(128, 64) 111 | self.fc4 = nn.Linear(64, 10) 112 | 113 | def forward(self, x): 114 | # make sure the input tensor is flattened 115 | x = x.view(x.shape[0], -1) 116 | 117 | # apply silu function 118 | x = silu(self.fc1(x)) 119 | 120 | # apply silu function 121 | x = silu(self.fc2(x)) 122 | 123 | # apply silu function 124 | x = silu(self.fc3(x)) 125 | 126 | x = F.log_softmax(self.fc4(x), dim=1) 127 | 128 | return x 129 | 130 | # Implementation of Soft Exponential activation function 131 | class soft_exponential(nn.Module): 132 | ''' 133 | Implementation of soft exponential activation. 134 | 135 | Shape: 136 | - Input: (N, *) where * means, any number of additional 137 | dimensions 138 | - Output: (N, *), same shape as the input 139 | 140 | Parameters: 141 | - alpha - trainable parameter 142 | 143 | References: 144 | - See related paper: 145 | https://arxiv.org/pdf/1602.01321.pdf 146 | 147 | Examples: 148 | >>> a1 = soft_exponential(256) 149 | >>> x = torch.randn(256) 150 | >>> x = a1(x) 151 | ''' 152 | def __init__(self, in_features, alpha = None): 153 | ''' 154 | Initialization. 155 | INPUT: 156 | - in_features: shape of the input 157 | - aplha: trainable parameter 158 | aplha is initialized with zero value by default 159 | ''' 160 | super(soft_exponential,self).__init__() 161 | self.in_features = in_features 162 | 163 | # initialize alpha 164 | if alpha == None: 165 | self.alpha = Parameter(torch.tensor(0.0)) # create a tensor out of alpha 166 | else: 167 | self.alpha = Parameter(torch.tensor(alpha)) # create a tensor out of alpha 168 | 169 | self.alpha.requiresGrad = True # set requiresGrad to true! 170 | 171 | def forward(self, x): 172 | ''' 173 | Forward pass of the function. 174 | Applies the function to the input elementwise. 175 | ''' 176 | if (self.alpha == 0.0): 177 | return x 178 | 179 | if (self.alpha < 0.0): 180 | return - torch.log(1 - self.alpha * (x + self.alpha)) / self.alpha 181 | 182 | if (self.alpha > 0.0): 183 | return (torch.exp(self.alpha * x) - 1)/ self.alpha + self.alpha 184 | 185 | # create class for basic fully-connected deep neural network 186 | class ClassifierSExp(nn.Module): 187 | ''' 188 | Basic fully-connected network to test Soft Exponential activation. 189 | ''' 190 | def __init__(self): 191 | super().__init__() 192 | 193 | # initialize layers 194 | self.fc1 = nn.Linear(784, 256) 195 | self.fc2 = nn.Linear(256, 128) 196 | self.fc3 = nn.Linear(128, 64) 197 | self.fc4 = nn.Linear(64, 10) 198 | 199 | # initialize Soft Exponential activation 200 | self.a1 = soft_exponential(256) 201 | self.a2 = soft_exponential(128) 202 | self.a3 = soft_exponential(64) 203 | 204 | def forward(self, x): 205 | # make sure the input tensor is flattened 206 | x = x.view(x.shape[0], -1) 207 | 208 | # apply Soft Exponential unit 209 | x = self.a1(self.fc1(x)) 210 | x = self.a2(self.fc2(x)) 211 | x = self.a3(self.fc3(x)) 212 | x = F.log_softmax(self.fc4(x), dim=1) 213 | 214 | return x 215 | 216 | # Implementation of BReLU activation function with custom backward step 217 | class brelu(Function): 218 | ''' 219 | Implementation of BReLU activation function. 220 | 221 | Shape: 222 | - Input: (N, *) where * means, any number of additional 223 | dimensions 224 | - Output: (N, *), same shape as the input 225 | 226 | References: 227 | - See BReLU paper: 228 | https://arxiv.org/pdf/1709.04054.pdf 229 | 230 | Examples: 231 | >>> brelu_activation = brelu.apply 232 | >>> t = torch.randn((5,5), dtype=torch.float, requires_grad = True) 233 | >>> t = brelu_activation(t) 234 | ''' 235 | #both forward and backward are @staticmethods 236 | @staticmethod 237 | def forward(ctx, input): 238 | """ 239 | In the forward pass we receive a Tensor containing the input and return 240 | a Tensor containing the output. ctx is a context object that can be used 241 | to stash information for backward computation. You can cache arbitrary 242 | objects for use in the backward pass using the ctx.save_for_backward method. 243 | """ 244 | ctx.save_for_backward(input) # save input for backward pass 245 | 246 | # get lists of odd and even indices 247 | input_shape = input.shape[0] 248 | even_indices = [i for i in range(0, input_shape, 2)] 249 | odd_indices = [i for i in range(1, input_shape, 2)] 250 | 251 | # clone the input tensor 252 | output = input.clone() 253 | 254 | # apply ReLU to elements where i mod 2 == 0 255 | output[even_indices] = output[even_indices].clamp(min=0) 256 | 257 | # apply inversed ReLU to inversed elements where i mod 2 != 0 258 | output[odd_indices] = 0 - output[odd_indices] # reverse elements with odd indices 259 | output[odd_indices] = - output[odd_indices].clamp(min = 0) # apply reversed ReLU 260 | 261 | return output 262 | 263 | @staticmethod 264 | def backward(ctx, grad_output): 265 | """ 266 | In the backward pass we receive a Tensor containing the gradient of the loss 267 | with respect to the output, and we need to compute the gradient of the loss 268 | with respect to the input. 269 | """ 270 | grad_input = None # set output to None 271 | 272 | input, = ctx.saved_tensors # restore input from context 273 | 274 | # check that input requires grad 275 | # if not requires grad we will return None to speed up computation 276 | if ctx.needs_input_grad[0]: 277 | grad_input = grad_output.clone() 278 | 279 | # get lists of odd and even indices 280 | input_shape = input.shape[0] 281 | even_indices = [i for i in range(0, input_shape, 2)] 282 | odd_indices = [i for i in range(1, input_shape, 2)] 283 | 284 | # set grad_input for even_indices 285 | grad_input[even_indices] = (input[even_indices] >= 0).float() * grad_input[even_indices] 286 | 287 | # set grad_input for odd_indices 288 | grad_input[odd_indices] = (input[odd_indices] < 0).float() * grad_input[odd_indices] 289 | 290 | return grad_input 291 | 292 | # Simple model to demonstrate BReLU 293 | class ClassifierBReLU(nn.Module): 294 | ''' 295 | Simple fully-connected classifier model to demonstrate BReLU activation. 296 | ''' 297 | def __init__(self): 298 | super(ClassifierBReLU, self).__init__() 299 | 300 | # initialize layers 301 | self.fc1 = nn.Linear(784, 256) 302 | self.fc2 = nn.Linear(256, 128) 303 | self.fc3 = nn.Linear(128, 64) 304 | self.fc4 = nn.Linear(64, 10) 305 | 306 | # create shortcuts for BReLU 307 | self.a1 = brelu.apply 308 | self.a2 = brelu.apply 309 | self.a3 = brelu.apply 310 | 311 | def forward(self, x): 312 | # make sure the input tensor is flattened 313 | x = x.view(x.shape[0], -1) 314 | 315 | # apply BReLU 316 | x = self.a1(self.fc1(x)) 317 | x = self.a2(self.fc2(x)) 318 | x = self.a3(self.fc3(x)) 319 | x = F.log_softmax(self.fc4(x), dim=1) 320 | 321 | return x 322 | 323 | def main(): 324 | print('Loading the Fasion MNIST dataset.\n') 325 | 326 | # Define a transform 327 | transform = transforms.Compose([transforms.ToTensor()]) 328 | 329 | # Download and load the training data for Fashion MNIST 330 | trainset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', download=True, train=True, transform=transform) 331 | trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True) 332 | 333 | # 1. SiLU demonstration with model created with Sequential 334 | # use SiLU with model created with Sequential 335 | 336 | # initialize activation function 337 | activation_function = SiLU() 338 | 339 | # Initialize the model using nn.Sequential 340 | model = nn.Sequential(OrderedDict([ 341 | ('fc1', nn.Linear(784, 256)), 342 | ('activation1', activation_function), # use SiLU 343 | ('fc2', nn.Linear(256, 128)), 344 | ('bn2', nn.BatchNorm1d(num_features=128)), 345 | ('activation2', activation_function), # use SiLU 346 | ('dropout', nn.Dropout(0.3)), 347 | ('fc3', nn.Linear(128, 64)), 348 | ('bn3', nn.BatchNorm1d(num_features=64)), 349 | ('activation3', activation_function), # use SiLU 350 | ('logits', nn.Linear(64, 10)), 351 | ('logsoftmax', nn.LogSoftmax(dim=1))])) 352 | 353 | # Run training 354 | print('Training model with SiLU activation.\n') 355 | train_model(model, trainloader) 356 | 357 | # 2. SiLU demonstration of a model defined as a nn.Module 358 | 359 | # Create demo model 360 | model = ClassifierSiLU() 361 | 362 | # Run training 363 | print('Training model with SiLU activation.\n') 364 | train_model(model, trainloader) 365 | 366 | # 3. Soft Eponential function demonstration (with trainable parameter alpha) 367 | 368 | # Create demo model 369 | model = ClassifierSExp() 370 | 371 | # Run training 372 | print('Training model with Soft Exponential activation.\n') 373 | train_model(model, trainloader) 374 | 375 | # 4. BReLU activation function demonstration (with custom backward step) 376 | 377 | # Create demo model 378 | model = ClassifierBReLU() 379 | 380 | # Run training 381 | print('Training model with BReLU activation.\n') 382 | train_model(model, trainloader) 383 | 384 | if __name__ == '__main__': 385 | main() 386 | -------------------------------------------------------------------------------- /extending-pytorch-with-custom-activation-functions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 8 | "_kg_hide-input": true, 9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" 10 | }, 11 | "outputs": [], 12 | "source": [ 13 | "# Imports\n", 14 | "\n", 15 | "# Import basic libraries\n", 16 | "import numpy as np # linear algebra\n", 17 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 18 | "%matplotlib inline\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "plt.style.use('seaborn-whitegrid')\n", 21 | "from collections import OrderedDict\n", 22 | "from PIL import Image\n", 23 | "\n", 24 | "# Import PyTorch\n", 25 | "import torch # import main library\n", 26 | "from torch.autograd import Variable\n", 27 | "import torch.nn as nn # import modules\n", 28 | "from torch.autograd import Function # import Function to create custom activations\n", 29 | "from torch.nn.parameter import Parameter # import Parameter to create custom activations with learnable parameters\n", 30 | "from torch import optim # import optimizers for demonstrations\n", 31 | "import torch.nn.functional as F # import torch functions\n", 32 | "from torchvision import transforms # import transformations to use for demo\n", 33 | "from torch.utils.data import Dataset, DataLoader " 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "![image](https://github.com/Lexie88rus/Activation-functions-examples-pytorch/raw/master/assets/blur-blurry-close-up-167259.jpg)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": { 46 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 47 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", 48 | "collapsed": true 49 | }, 50 | "source": [ 51 | "# Extending PyTorch with Custom Activation Functions" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "## Introduction\n", 59 | "Today deep learning is viral and applied to a variety of machine learning problems such as image recognition, speech recognition, machine translation, etc. There is a wide range of highly customizable neural network architectures, which can suit almost any problem when given enough data. Each neural network should be customized to suit the given problem well enough. You have to fine tune the hyperparameters for the network for each task (the learning rate, dropout coefficients, weight decay, etc.) as well as number of hidden layers, number of units in layers. __Choosing the right activation function for each layer is also crucial and may have a significant impact on learning speed.__" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "## Activation Functions\n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "The [activation function](https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-functions-when-to-use-them/) is an essential building block for every neural network. We can choose from a huge list of popular activation functions, which are already implemented in Deep Learning frameworks, like [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks), [Sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function), [Tanh](https://en.wikipedia.org/wiki/Hyperbolic_function) and many others.\n", 74 | "\n", 75 | "But to create a state of the art model, customized particularly for your task, you may need to use a custom activation function, which is not yet implemented in Deep Learning framework you are using. Activation functions can be roughly classified into the following groups by complexity:\n", 76 | "\n", 77 | "1. Simple activation functions like [SiLU](https://arxiv.org/pdf/1606.08415.pdf), [Inverse square root unit (ISRU)](https://arxiv.org/pdf/1710.09967.pdf). These functions can be easily implemented in any Deep Learning framework.\n", 78 | "2. Activation functions with __trainable parameters__ like [SoftExponential](https://arxiv.org/pdf/1602.01321.pdf) or [S-shaped rectified linear activation unit (SReLU)](https://arxiv.org/pdf/1512.07030.pdf). \n", 79 | "3. Activation functions, which are not differentiable at some points and require __custom implementation of backward step__, for example [Bipolar rectified linear unit (BReLU)](https://arxiv.org/pdf/1709.04054.pdf).\n", 80 | "\n", 81 | "In this kernel I will try to cover implementation and demo examples for all of these types of functions using [Fashion MNIST dataset](https://www.kaggle.com/zalando-research/fashionmnist)." 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "## Seeting Up The Demo\n", 89 | "Ih this section I will prepare everything for the demonstration:\n", 90 | "* Load Fashion MNIST dataset from PyTorch,\n", 91 | "* Introduce transformations for Fashion MNIST images using PyTorch,\n", 92 | "* Prepare model training procedure.\n", 93 | "\n", 94 | "If you are familiar with PyTorch basics, just skip this part and go straight to implementation of the activation functions." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### Introduce Transformations" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "The most efficient way to transform the input data is to use buil-in PyTorch transformations:" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 2, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Define a transform\n", 118 | "transform = transforms.Compose([transforms.ToTensor()])" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### Load the Data" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "To load the data I used standard Dataset and Dataloader classes from PyTorch and [FashionMNIST class code from this kernel](https://www.kaggle.com/arturlacerda/pytorch-conditional-gan):" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 3, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "class FashionMNIST(Dataset):\n", 142 | " '''\n", 143 | " Dataset clas to load Fashion MNIST data from csv.\n", 144 | " Code from original kernel:\n", 145 | " https://www.kaggle.com/arturlacerda/pytorch-conditional-gan\n", 146 | " '''\n", 147 | " def __init__(self, transform=None):\n", 148 | " self.transform = transform\n", 149 | " fashion_df = pd.read_csv('../input/fashion-mnist_train.csv')\n", 150 | " self.labels = fashion_df.label.values\n", 151 | " self.images = fashion_df.iloc[:, 1:].values.astype('uint8').reshape(-1, 28, 28)\n", 152 | "\n", 153 | " def __len__(self):\n", 154 | " return len(self.images)\n", 155 | "\n", 156 | " def __getitem__(self, idx):\n", 157 | " label = self.labels[idx]\n", 158 | " img = Image.fromarray(self.images[idx])\n", 159 | " \n", 160 | " if self.transform:\n", 161 | " img = self.transform(img)\n", 162 | "\n", 163 | " return img, label\n", 164 | "\n", 165 | "# Load the training data for Fashion MNIST\n", 166 | "trainset = FashionMNIST(transform=transform)\n", 167 | "# Define the dataloader\n", 168 | "trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "### Setup Training Procedure\n", 176 | "I wrote a small training procedure, which runs 5 training epochs and prints the loss for each epoch, so we make sure that we are fitting the training set:" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 4, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "def train_model(model):\n", 186 | " '''\n", 187 | " Function trains the model and prints out the training log.\n", 188 | " '''\n", 189 | " #setup training\n", 190 | " \n", 191 | " #define loss function\n", 192 | " criterion = nn.NLLLoss()\n", 193 | " #define learning rate\n", 194 | " learning_rate = 0.003\n", 195 | " #define number of epochs\n", 196 | " epochs = 5\n", 197 | " #initialize optimizer\n", 198 | " optimizer = optim.Adam(model.parameters(), lr=learning_rate)\n", 199 | "\n", 200 | " #run training and print out the loss to make sure that we are actually fitting to the training set\n", 201 | " print('Training the model. Make sure that loss decreases after each epoch.\\n')\n", 202 | " for e in range(epochs):\n", 203 | " running_loss = 0\n", 204 | " for images, labels in trainloader:\n", 205 | " images = images.view(images.shape[0], -1)\n", 206 | " log_ps = model(images)\n", 207 | " loss = criterion(log_ps, labels)\n", 208 | "\n", 209 | " optimizer.zero_grad()\n", 210 | " loss.backward()\n", 211 | " optimizer.step()\n", 212 | "\n", 213 | " running_loss += loss.item()\n", 214 | " else:\n", 215 | " # print out the loss to make sure it is decreasing\n", 216 | " print(f\"Training loss: {running_loss}\")" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "## Implementing Simple Activation Functions\n", 224 | "The most simple activation functions\n", 225 | "* are differentiable and don't need the manual implementation of the backward step,\n", 226 | "* don't have any trainable parameters, all their parameters are set in advance.\n", 227 | "\n", 228 | "One of the examples of such simple functions is Sigmoid Linear Unit or just [SiLU](https://arxiv.org/pdf/1606.08415.pdf) also known as Swish-1:\n", 229 | "\n", 230 | "$$SiLU(x) = x * \\sigma(x) = x * \\frac{1}{1 + e^{-x}}$$\n", 231 | "\n", 232 | "Plot:" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 5, 238 | "metadata": { 239 | "_kg_hide-input": true 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "image/png": "\n", 245 | "text/plain": [ 246 | "
" 247 | ] 248 | }, 249 | "metadata": { 250 | "needs_background": "light" 251 | }, 252 | "output_type": "display_data" 253 | } 254 | ], 255 | "source": [ 256 | "def sigmoid(x):\n", 257 | " return 1 / (1 + np.exp(-x))\n", 258 | "\n", 259 | "fig = plt.figure(figsize=(10,7))\n", 260 | "ax = plt.axes()\n", 261 | "\n", 262 | "plt.title(\"SiLU\")\n", 263 | "\n", 264 | "x = np.linspace(-10, 10, 1000)\n", 265 | "ax.plot(x, x * sigmoid(x), '-g');" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "The implementation of SiLU:" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 6, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "# simply define a silu function\n", 282 | "def silu(input):\n", 283 | " '''\n", 284 | " Applies the Sigmoid Linear Unit (SiLU) function element-wise:\n", 285 | "\n", 286 | " SiLU(x) = x * sigmoid(x)\n", 287 | " '''\n", 288 | " return input * torch.sigmoid(input) # use torch.sigmoid to make sure that we created the most efficient implemetation based on builtin PyTorch functions\n", 289 | "\n", 290 | "# create a class wrapper from PyTorch nn.Module, so\n", 291 | "# the function now can be easily used in models\n", 292 | "class SiLU(nn.Module):\n", 293 | " '''\n", 294 | " Applies the Sigmoid Linear Unit (SiLU) function element-wise:\n", 295 | " \n", 296 | " SiLU(x) = x * sigmoid(x)\n", 297 | "\n", 298 | " Shape:\n", 299 | " - Input: (N, *) where * means, any number of additional\n", 300 | " dimensions\n", 301 | " - Output: (N, *), same shape as the input\n", 302 | "\n", 303 | " References:\n", 304 | " - Related paper:\n", 305 | " https://arxiv.org/pdf/1606.08415.pdf\n", 306 | "\n", 307 | " Examples:\n", 308 | " >>> m = silu()\n", 309 | " >>> input = torch.randn(2)\n", 310 | " >>> output = m(input)\n", 311 | "\n", 312 | " '''\n", 313 | " def __init__(self):\n", 314 | " '''\n", 315 | " Init method.\n", 316 | " '''\n", 317 | " super().__init__() # init the base class\n", 318 | "\n", 319 | " def forward(self, input):\n", 320 | " '''\n", 321 | " Forward pass of the function.\n", 322 | " '''\n", 323 | " return silu(input) # simply apply already implemented SiLU" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "Now it's time for a small demo _(don't forget to enable GPU in kernel settings to make training faster)_" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "Here is a small example of building a model with nn.Sequential and out custom SiLU class:" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 7, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "name": "stdout", 347 | "output_type": "stream", 348 | "text": [ 349 | "Training the model. Make sure that loss decreases after each epoch.\n", 350 | "\n", 351 | "Training loss: 478.8589185029268\n", 352 | "Training loss: 350.0746013522148\n", 353 | "Training loss: 319.25493428111076\n", 354 | "Training loss: 294.4485657066107\n", 355 | "Training loss: 280.9470457062125\n" 356 | ] 357 | } 358 | ], 359 | "source": [ 360 | "# use SiLU with model created with Sequential\n", 361 | "\n", 362 | "# initialize activation function\n", 363 | "activation_function = SiLU()\n", 364 | "\n", 365 | "# Initialize the model using nn.Sequential\n", 366 | "model = nn.Sequential(OrderedDict([\n", 367 | " ('fc1', nn.Linear(784, 256)),\n", 368 | " ('activation1', activation_function), # use SiLU\n", 369 | " ('fc2', nn.Linear(256, 128)),\n", 370 | " ('bn2', nn.BatchNorm1d(num_features=128)),\n", 371 | " ('activation2', activation_function), # use SiLU\n", 372 | " ('dropout', nn.Dropout(0.3)),\n", 373 | " ('fc3', nn.Linear(128, 64)),\n", 374 | " ('bn3', nn.BatchNorm1d(num_features=64)),\n", 375 | " ('activation3', activation_function), # use SiLU\n", 376 | " ('logits', nn.Linear(64, 10)),\n", 377 | " ('logsoftmax', nn.LogSoftmax(dim=1))]))\n", 378 | "\n", 379 | "# Run training\n", 380 | "train_model(model)" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "We can also use silu function in model class as follows:" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 8, 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "name": "stdout", 397 | "output_type": "stream", 398 | "text": [ 399 | "Training the model. Make sure that loss decreases after each epoch.\n", 400 | "\n", 401 | "Training loss: 466.8241208344698\n", 402 | "Training loss: 349.4934533312917\n", 403 | "Training loss: 313.43123760819435\n", 404 | "Training loss: 293.65521355718374\n", 405 | "Training loss: 279.42011239379644\n" 406 | ] 407 | } 408 | ], 409 | "source": [ 410 | "# create class for basic fully-connected deep neural network\n", 411 | "class ClassifierSiLU(nn.Module):\n", 412 | " '''\n", 413 | " Demo classifier model class to demonstrate SiLU\n", 414 | " '''\n", 415 | " def __init__(self):\n", 416 | " super().__init__()\n", 417 | "\n", 418 | " # initialize layers\n", 419 | " self.fc1 = nn.Linear(784, 256)\n", 420 | " self.fc2 = nn.Linear(256, 128)\n", 421 | " self.fc3 = nn.Linear(128, 64)\n", 422 | " self.fc4 = nn.Linear(64, 10)\n", 423 | "\n", 424 | " def forward(self, x):\n", 425 | " # make sure the input tensor is flattened\n", 426 | " x = x.view(x.shape[0], -1)\n", 427 | "\n", 428 | " # apply silu function\n", 429 | " x = silu(self.fc1(x))\n", 430 | "\n", 431 | " # apply silu function\n", 432 | " x = silu(self.fc2(x))\n", 433 | " \n", 434 | " # apply silu function\n", 435 | " x = silu(self.fc3(x))\n", 436 | " \n", 437 | " x = F.log_softmax(self.fc4(x), dim=1)\n", 438 | "\n", 439 | " return x\n", 440 | "\n", 441 | "# Create demo model\n", 442 | "model = ClassifierSiLU()\n", 443 | " \n", 444 | "# Run training\n", 445 | "train_model(model)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "## Implement Activation Function with Learnable Parameters\n", 453 | "\n", 454 | "There are lot's of activation functions with parameters, which can be trained with gradient descent while training the model. A great example for one of these is [SoftExponential](https://arxiv.org/pdf/1602.01321.pdf) function:\n", 455 | "\n", 456 | "$$SoftExponential(x, \\alpha) = \\left\\{\\begin{matrix} - \\frac{log(1 - \\alpha(x + \\alpha))}{\\alpha}, \\alpha < 0\\\\ x, \\alpha = 0\\\\ \\frac{e^{\\alpha * x} - 1}{\\alpha} + \\alpha, \\alpha > 0 \\end{matrix}\\right.$$" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "Plot (image from wikipedia):" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "![soft exponential plot](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Activation_soft_exponential.svg/2880px-Activation_soft_exponential.svg.png)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 9, 476 | "metadata": { 477 | "_kg_hide-input": true 478 | }, 479 | "outputs": [ 480 | { 481 | "name": "stderr", 482 | "output_type": "stream", 483 | "text": [ 484 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in log\n", 485 | " \n" 486 | ] 487 | }, 488 | { 489 | "data": { 490 | "image/png": "\n", 491 | "text/plain": [ 492 | "
" 493 | ] 494 | }, 495 | "metadata": { 496 | "needs_background": "light" 497 | }, 498 | "output_type": "display_data" 499 | } 500 | ], 501 | "source": [ 502 | "# Soft Exponential\n", 503 | "def soft_exponential_func(x, alpha):\n", 504 | " if alpha == 0.0:\n", 505 | " return x\n", 506 | " if alpha < 0.0:\n", 507 | " return - np.log(1 - alpha * (x + alpha)) / alpha\n", 508 | " if alpha > 0.0:\n", 509 | " return (np.exp(alpha * x) - 1)/ alpha + alpha\n", 510 | "\n", 511 | "fig = plt.figure(figsize=(10,7))\n", 512 | "ax = plt.axes()\n", 513 | "\n", 514 | "plt.title(\"Soft Exponential\")\n", 515 | "\n", 516 | "x = np.linspace(-5, 5, 1000)\n", 517 | "ax.plot(x, soft_exponential_func(x, -1.0), '-g', label = 'Soft Exponential, alpha = -1.0', linestyle = 'dashed');\n", 518 | "ax.plot(x, soft_exponential_func(x, -0.5), '-g', label = 'Soft Exponential, alpha = -0.5');\n", 519 | "ax.plot(x, soft_exponential_func(x, 0), '-b', label = 'Soft Exponential, alpha = 0');\n", 520 | "ax.plot(x, soft_exponential_func(x, 0.5), '-r', label = 'Soft Exponential, alpha = 0.5');\n", 521 | "\n", 522 | "plt.legend();" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "To implement an activation function with trainable parameters we have to:\n", 530 | "* derive a class from nn.Module and make alpha one of its members,\n", 531 | "* wrap alpha as a Parameter and set requiresGrad to True.\n", 532 | "\n", 533 | "See an example:" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 10, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [ 542 | "class soft_exponential(nn.Module):\n", 543 | " '''\n", 544 | " Implementation of soft exponential activation.\n", 545 | "\n", 546 | " Shape:\n", 547 | " - Input: (N, *) where * means, any number of additional\n", 548 | " dimensions\n", 549 | " - Output: (N, *), same shape as the input\n", 550 | "\n", 551 | " Parameters:\n", 552 | " - alpha - trainable parameter\n", 553 | "\n", 554 | " References:\n", 555 | " - See related paper:\n", 556 | " https://arxiv.org/pdf/1602.01321.pdf\n", 557 | "\n", 558 | " Examples:\n", 559 | " >>> a1 = soft_exponential(256)\n", 560 | " >>> x = torch.randn(256)\n", 561 | " >>> x = a1(x)\n", 562 | " '''\n", 563 | " def __init__(self, in_features, alpha = None):\n", 564 | " '''\n", 565 | " Initialization.\n", 566 | " INPUT:\n", 567 | " - in_features: shape of the input\n", 568 | " - aplha: trainable parameter\n", 569 | " aplha is initialized with zero value by default\n", 570 | " '''\n", 571 | " super(soft_exponential,self).__init__()\n", 572 | " self.in_features = in_features\n", 573 | "\n", 574 | " # initialize alpha\n", 575 | " if alpha == None:\n", 576 | " self.alpha = Parameter(torch.tensor(0.0)) # create a tensor out of alpha\n", 577 | " else:\n", 578 | " self.alpha = Parameter(torch.tensor(alpha)) # create a tensor out of alpha\n", 579 | " \n", 580 | " self.alpha.requiresGrad = True # set requiresGrad to true!\n", 581 | "\n", 582 | " def forward(self, x):\n", 583 | " '''\n", 584 | " Forward pass of the function.\n", 585 | " Applies the function to the input elementwise.\n", 586 | " '''\n", 587 | " if (self.alpha == 0.0):\n", 588 | " return x\n", 589 | "\n", 590 | " if (self.alpha < 0.0):\n", 591 | " return - torch.log(1 - self.alpha * (x + self.alpha)) / self.alpha\n", 592 | "\n", 593 | " if (self.alpha > 0.0):\n", 594 | " return (torch.exp(self.alpha * x) - 1)/ self.alpha + self.alpha" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "Let's make a small demo: create a simple model, which uses Soft Exponential activation:" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": 11, 607 | "metadata": {}, 608 | "outputs": [ 609 | { 610 | "name": "stdout", 611 | "output_type": "stream", 612 | "text": [ 613 | "Training the model. Make sure that loss decreases after each epoch.\n", 614 | "\n", 615 | "Training loss: 558.482414022088\n", 616 | "Training loss: 472.3926571011543\n", 617 | "Training loss: 453.55578088760376\n", 618 | "Training loss: 438.11303447186947\n", 619 | "Training loss: 430.5142505913973\n" 620 | ] 621 | } 622 | ], 623 | "source": [ 624 | "# create class for basic fully-connected deep neural network\n", 625 | "class ClassifierSExp(nn.Module):\n", 626 | " '''\n", 627 | " Basic fully-connected network to test Soft Exponential activation.\n", 628 | " '''\n", 629 | " def __init__(self):\n", 630 | " super().__init__()\n", 631 | "\n", 632 | " # initialize layers\n", 633 | " self.fc1 = nn.Linear(784, 256)\n", 634 | " self.fc2 = nn.Linear(256, 128)\n", 635 | " self.fc3 = nn.Linear(128, 64)\n", 636 | " self.fc4 = nn.Linear(64, 10)\n", 637 | "\n", 638 | " # initialize Soft Exponential activation\n", 639 | " self.a1 = soft_exponential(256)\n", 640 | " self.a2 = soft_exponential(128)\n", 641 | " self.a3 = soft_exponential(64)\n", 642 | "\n", 643 | " def forward(self, x):\n", 644 | " # make sure the input tensor is flattened\n", 645 | " x = x.view(x.shape[0], -1)\n", 646 | "\n", 647 | " # apply Soft Exponential unit\n", 648 | " x = self.a1(self.fc1(x))\n", 649 | " x = self.a2(self.fc2(x))\n", 650 | " x = self.a3(self.fc3(x))\n", 651 | " x = F.log_softmax(self.fc4(x), dim=1)\n", 652 | "\n", 653 | " return x\n", 654 | " \n", 655 | "model = ClassifierSExp()\n", 656 | "train_model(model)" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": {}, 662 | "source": [ 663 | "## Implement Activation Function with Custom Backward Step\n", 664 | "The perfect example of an activation function, which needs implementation of a custom backward step is [BReLU](https://arxiv.org/pdf/1709.04054.pdf) (Bipolar Rectified Linear Activation Unit):\n", 665 | "\n", 666 | "$$BReLU(x_i) = \\left\\{\\begin{matrix} f(x_i), i \\mod 2 = 0\\\\ - f(-x_i), i \\mod 2 \\neq 0 \\end{matrix}\\right.$$\n", 667 | "\n", 668 | "This function is not differenciable at 0, so automatic gradient computation may fail.\n", 669 | "\n", 670 | "Plot:" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 12, 676 | "metadata": { 677 | "_kg_hide-input": true 678 | }, 679 | "outputs": [ 680 | { 681 | "data": { 682 | "image/png": "\n", 683 | "text/plain": [ 684 | "
" 685 | ] 686 | }, 687 | "metadata": { 688 | "needs_background": "light" 689 | }, 690 | "output_type": "display_data" 691 | } 692 | ], 693 | "source": [ 694 | "# ReLU function\n", 695 | "def relu(x):\n", 696 | " return (x >= 0) * x\n", 697 | "# inversed ReLU\n", 698 | "def inv_relu(x):\n", 699 | " return - relu(- x)\n", 700 | "\n", 701 | "fig = plt.figure(figsize=(10,7))\n", 702 | "ax = plt.axes()\n", 703 | "\n", 704 | "plt.title(\"BReLU\")\n", 705 | "\n", 706 | "x = np.linspace(-5, 5, 1000)\n", 707 | "ax.plot(x, relu(x), '-g', label = 'BReLU, xi mod 2 = 0');\n", 708 | "ax.plot(x, inv_relu(x), '-b', label = 'BReLU, xi mod 2 != 0', linestyle='dashed');\n", 709 | "\n", 710 | "plt.legend();" 711 | ] 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": {}, 716 | "source": [ 717 | "To impement custom activation function with backward step we should:\n", 718 | "* create a class which, inherits Function from torch.autograd,\n", 719 | "* override static forward and backward methods. Forward method just applies the function to the input. Backward method should compute the gradient of the loss function with respect to the input given the gradient of the loss function with respect to the output.\n", 720 | "\n", 721 | "Let's see an example for BReLU:" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": 13, 727 | "metadata": {}, 728 | "outputs": [], 729 | "source": [ 730 | "class brelu(Function):\n", 731 | " '''\n", 732 | " Implementation of BReLU activation function.\n", 733 | "\n", 734 | " Shape:\n", 735 | " - Input: (N, *) where * means, any number of additional\n", 736 | " dimensions\n", 737 | " - Output: (N, *), same shape as the input\n", 738 | "\n", 739 | " References:\n", 740 | " - See BReLU paper:\n", 741 | " https://arxiv.org/pdf/1709.04054.pdf\n", 742 | "\n", 743 | " Examples:\n", 744 | " >>> brelu_activation = brelu.apply\n", 745 | " >>> t = torch.randn((5,5), dtype=torch.float, requires_grad = True)\n", 746 | " >>> t = brelu_activation(t)\n", 747 | " '''\n", 748 | " #both forward and backward are @staticmethods\n", 749 | " @staticmethod\n", 750 | " def forward(ctx, input):\n", 751 | " \"\"\"\n", 752 | " In the forward pass we receive a Tensor containing the input and return\n", 753 | " a Tensor containing the output. ctx is a context object that can be used\n", 754 | " to stash information for backward computation. You can cache arbitrary\n", 755 | " objects for use in the backward pass using the ctx.save_for_backward method.\n", 756 | " \"\"\"\n", 757 | " ctx.save_for_backward(input) # save input for backward pass\n", 758 | "\n", 759 | " # get lists of odd and even indices\n", 760 | " input_shape = input.shape[0]\n", 761 | " even_indices = [i for i in range(0, input_shape, 2)]\n", 762 | " odd_indices = [i for i in range(1, input_shape, 2)]\n", 763 | "\n", 764 | " # clone the input tensor\n", 765 | " output = input.clone()\n", 766 | "\n", 767 | " # apply ReLU to elements where i mod 2 == 0\n", 768 | " output[even_indices] = output[even_indices].clamp(min=0)\n", 769 | "\n", 770 | " # apply inversed ReLU to inversed elements where i mod 2 != 0\n", 771 | " output[odd_indices] = 0 - output[odd_indices] # reverse elements with odd indices\n", 772 | " output[odd_indices] = - output[odd_indices].clamp(min = 0) # apply reversed ReLU\n", 773 | "\n", 774 | " return output\n", 775 | "\n", 776 | " @staticmethod\n", 777 | " def backward(ctx, grad_output):\n", 778 | " \"\"\"\n", 779 | " In the backward pass we receive a Tensor containing the gradient of the loss\n", 780 | " with respect to the output, and we need to compute the gradient of the loss\n", 781 | " with respect to the input.\n", 782 | " \"\"\"\n", 783 | " grad_input = None # set output to None\n", 784 | "\n", 785 | " input, = ctx.saved_tensors # restore input from context\n", 786 | "\n", 787 | " # check that input requires grad\n", 788 | " # if not requires grad we will return None to speed up computation\n", 789 | " if ctx.needs_input_grad[0]:\n", 790 | " grad_input = grad_output.clone()\n", 791 | "\n", 792 | " # get lists of odd and even indices\n", 793 | " input_shape = input.shape[0]\n", 794 | " even_indices = [i for i in range(0, input_shape, 2)]\n", 795 | " odd_indices = [i for i in range(1, input_shape, 2)]\n", 796 | "\n", 797 | " # set grad_input for even_indices\n", 798 | " grad_input[even_indices] = (input[even_indices] >= 0).float() * grad_input[even_indices]\n", 799 | "\n", 800 | " # set grad_input for odd_indices\n", 801 | " grad_input[odd_indices] = (input[odd_indices] < 0).float() * grad_input[odd_indices]\n", 802 | "\n", 803 | " return grad_input" 804 | ] 805 | }, 806 | { 807 | "cell_type": "markdown", 808 | "metadata": {}, 809 | "source": [ 810 | "Create a simple classifier model for a demonstration and run training:[](http://)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 14, 816 | "metadata": {}, 817 | "outputs": [ 818 | { 819 | "name": "stdout", 820 | "output_type": "stream", 821 | "text": [ 822 | "Training the model. Make sure that loss decreases after each epoch.\n", 823 | "\n", 824 | "Training loss: 585.9589667767286\n", 825 | "Training loss: 442.55006262660027\n", 826 | "Training loss: 414.7568979859352\n", 827 | "Training loss: 390.45185139775276\n", 828 | "Training loss: 372.7816904038191\n" 829 | ] 830 | } 831 | ], 832 | "source": [ 833 | "class ClassifierBReLU(nn.Module):\n", 834 | " '''\n", 835 | " Simple fully-connected classifier model to demonstrate BReLU activation.\n", 836 | " '''\n", 837 | " def __init__(self):\n", 838 | " super(ClassifierBReLU, self).__init__()\n", 839 | "\n", 840 | " # initialize layers\n", 841 | " self.fc1 = nn.Linear(784, 256)\n", 842 | " self.fc2 = nn.Linear(256, 128)\n", 843 | " self.fc3 = nn.Linear(128, 64)\n", 844 | " self.fc4 = nn.Linear(64, 10)\n", 845 | "\n", 846 | " # create shortcuts for BReLU\n", 847 | " self.a1 = brelu.apply\n", 848 | " self.a2 = brelu.apply\n", 849 | " self.a3 = brelu.apply\n", 850 | "\n", 851 | " def forward(self, x):\n", 852 | " # make sure the input tensor is flattened\n", 853 | " x = x.view(x.shape[0], -1)\n", 854 | "\n", 855 | " # apply BReLU\n", 856 | " x = self.a1(self.fc1(x))\n", 857 | " x = self.a2(self.fc2(x))\n", 858 | " x = self.a3(self.fc3(x))\n", 859 | " x = F.log_softmax(self.fc4(x), dim=1)\n", 860 | " \n", 861 | " return x\n", 862 | " \n", 863 | "model = ClassifierBReLU()\n", 864 | "train_model(model)" 865 | ] 866 | }, 867 | { 868 | "cell_type": "markdown", 869 | "metadata": {}, 870 | "source": [ 871 | "## Conclusion\n", 872 | "In this tutorial I demonstrated:\n", 873 | "* How to create a simple custom activation function,\n", 874 | "* How to create an activation function with learnable parameters, which can be trained using gradient descent,\n", 875 | "* How to create an activation function with custom backward step." 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "metadata": {}, 881 | "source": [ 882 | "## Improvement\n", 883 | "While building a lot of custom activation functions, I noticed, that they often consume much more GPU memory. Creation of inplace implementations of custom activations using PyTorch inplace methods will improve this situation." 884 | ] 885 | }, 886 | { 887 | "cell_type": "markdown", 888 | "metadata": {}, 889 | "source": [ 890 | "## Additional References\n", 891 | "Links to the additional resources and further reading:\n", 892 | "1. [Activation functions wiki page](https://en.wikipedia.org/wiki/Activation_function)\n", 893 | "2. [Tutorial on extending PyTorch](https://pytorch.org/docs/master/notes/extending.html)\n", 894 | "3. [Implementation of Maxout in PyTorch](https://github.com/Usama113/Maxout-PyTorch/blob/master/Maxout.ipynb)\n", 895 | "4. [PyTorch Comprehensive Overview](https://medium.com/@layog/a-comprehensive-overview-of-pytorch-7f70b061963f)\n", 896 | "5. [PyTorch and Fashion MNIST Kernel](https://www.kaggle.com/arturlacerda/pytorch-conditional-gan) I copied some code to load the data" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": {}, 902 | "source": [ 903 | "## PS\n", 904 | "![echo logo](https://github.com/Lexie88rus/Activation-functions-examples-pytorch/blob/master/assets/echo_logo.png?raw=true)\n", 905 | "\n", 906 | "I participate in implementation of a __Echo package__ with mathematical backend for neural networks, which can be used with most popular existing packages (TensorFlow, Keras and [PyTorch](https://pytorch.org/)). We have done a lot for activation functions for PyTorch so far. Here is a [link to a repository on GitHub](https://github.com/digantamisra98/Echo/tree/Dev-adeis), __I will highly appreciate your feedback__ on that." 907 | ] 908 | } 909 | ], 910 | "metadata": { 911 | "kernelspec": { 912 | "display_name": "Python 3", 913 | "language": "python", 914 | "name": "python3" 915 | }, 916 | "language_info": { 917 | "codemirror_mode": { 918 | "name": "ipython", 919 | "version": 3 920 | }, 921 | "file_extension": ".py", 922 | "mimetype": "text/x-python", 923 | "name": "python", 924 | "nbconvert_exporter": "python", 925 | "pygments_lexer": "ipython3", 926 | "version": "3.6.6" 927 | } 928 | }, 929 | "nbformat": 4, 930 | "nbformat_minor": 1 931 | } 932 | -------------------------------------------------------------------------------- /gain/activation_gain.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | import sys 4 | 5 | # see pytorch forum topic: 6 | # https://discuss.pytorch.org/t/calculate-gain-tanh/20854 7 | 8 | # 1.5212 looks like a good gain value for SiLU 9 | 10 | def silu(input): 11 | ''' 12 | Applies the Sigmoid Linear Unit (SiLU) function element-wise: 13 | SiLU(x) = x * sigmoid(x) 14 | ''' 15 | return input * torch.sigmoid(input) 16 | 17 | a = torch.randn(1000,1000, requires_grad=True) 18 | b = a 19 | print (f"in: {a.std().item():.4f}") 20 | for i in range(100): 21 | l = torch.nn.Linear(1000,1000, bias=False) 22 | torch.nn.init.xavier_normal_(l.weight, float(sys.argv[2])) 23 | #b = getattr(F, sys.argv[1])(l(b)) 24 | b = silu(l(b)) 25 | 26 | if i % 10 == 0: 27 | print (f"out: {b.std().item():.4f}", end=" ") 28 | a.grad = None 29 | b.sum().backward(retain_graph=True) 30 | print (f"grad: {a.grad.abs().mean().item():.4f}") 31 | -------------------------------------------------------------------------------- /in-place-operations-in-pytorch.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 8 | "_kg_hide-input": true, 9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" 10 | }, 11 | "outputs": [], 12 | "source": [ 13 | "# Imports\n", 14 | "\n", 15 | "# Import basic libraries\n", 16 | "import numpy as np # linear algebra\n", 17 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 18 | "%matplotlib inline\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "from collections import OrderedDict\n", 21 | "from PIL import Image\n", 22 | "\n", 23 | "# Import PyTorch\n", 24 | "import torch # import main library\n", 25 | "from torch.autograd import Variable\n", 26 | "import torch.nn as nn # import modules\n", 27 | "from torch import optim # import optimizers for demonstrations\n", 28 | "import torch.nn.functional as F # import torch functions\n", 29 | "from torchvision import transforms # import transformations to use for demo\n", 30 | "from torch.utils.data import Dataset, DataLoader " 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "![image](https://github.com/Lexie88rus/Activation-functions-examples-pytorch/raw/master/assets/background-card-chip.jpg)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "[_Photo by Fancycrave.com from Pexels_](https://www.pexels.com/photo/green-ram-card-collection-825262/)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 51 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", 52 | "collapsed": true 53 | }, 54 | "source": [ 55 | "# In-Place Operations in PyTorch\n", 56 | "_What are they and why avoid them_" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "## Introduction\n" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "Today's advanced deep neural networks have millions of parameters (for example, see the comparison in [this paper](https://arxiv.org/pdf/1905.11946.pdf)) and trying to train them on free GPU's like Kaggle or Goggle Colab often leads to running out of memory on GPU. There are several simple ways to reduce the GPU memory occupied by the model, for example:\n", 71 | "* Consider changing the architecture of the model or using the type of model with fewer parameters (for example choose [DenseNet](https://arxiv.org/pdf/1608.06993.pdf)-121 over DenseNet-169). This approach can affect model's performance metrics.\n", 72 | "* Reduce the batch size or manually set the number of data loader workers. In this case it will take longer for the model to train.\n", 73 | "\n", 74 | "Using in-place operations in neural networks may help to avoid the downsides of approaches mentioned above while saving some GPU memory. However it is strongly __not recommended to use in-place operations__ for several reasons.\n", 75 | "\n", 76 | "In this kernel I would like to:\n", 77 | "* Describe what are the in-place operations and demonstrate how they might help to save the GPU memory.\n", 78 | "* Tell why we should avoid the in-place operations or use them with great caution." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## In-place Operations\n", 86 | "`In-place operation is an operation that changes directly the content of a given linear algebra, vector, matrices(Tensor) without making a copy. The operators which helps to do the operation is called in-place operator.` See the [tutorial](https://www.tutorialspoint.com/inplace-operator-in-python) on in-place operations in Python.\n", 87 | "\n", 88 | "As it is said in the definition, in-place operations don't make a copy of the input, that is why they can help to reduce the memory usage, when operating with high-dimentional data." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "I would like to run a simple model on [Fashion MNIST dataset](https://www.kaggle.com/zalando-research/fashionmnist) to demonstrate how in-place operations help to consume less GPU memory. In this demonstration I use simple fully-connected deep neural network with four linear layers and [ReLU](https://pytorch.org/docs/stable/nn.html#relu) activation after each hidden layer." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "## Seeting Up The Demo\n", 103 | "Ih this section I will prepare everything for the demonstration:\n", 104 | "* Load Fashion MNIST dataset from PyTorch,\n", 105 | "* Introduce transformations for Fashion MNIST images using PyTorch,\n", 106 | "* Prepare model training procedure.\n", 107 | "\n", 108 | "If you are familiar with PyTorch basics, just skip this part and go straight to the rest of the kernel." 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Introduce Transformations" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "The most efficient way to transform the input data is to use buil-in PyTorch transformations:" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 2, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# Define a transform\n", 132 | "transform = transforms.Compose([transforms.ToTensor()])" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "### Load the Data" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "To load the data I used standard Dataset and Dataloader classes from PyTorch and [FashionMNIST class code from this kernel](https://www.kaggle.com/arturlacerda/pytorch-conditional-gan):" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 3, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "class FashionMNIST(Dataset):\n", 156 | " '''\n", 157 | " Dataset clas to load Fashion MNIST data from csv.\n", 158 | " Code from original kernel:\n", 159 | " https://www.kaggle.com/arturlacerda/pytorch-conditional-gan\n", 160 | " '''\n", 161 | " def __init__(self, transform=None):\n", 162 | " self.transform = transform\n", 163 | " fashion_df = pd.read_csv('../input/fashion-mnist_train.csv')\n", 164 | " self.labels = fashion_df.label.values\n", 165 | " self.images = fashion_df.iloc[:, 1:].values.astype('uint8').reshape(-1, 28, 28)\n", 166 | "\n", 167 | " def __len__(self):\n", 168 | " return len(self.images)\n", 169 | "\n", 170 | " def __getitem__(self, idx):\n", 171 | " label = self.labels[idx]\n", 172 | " img = Image.fromarray(self.images[idx])\n", 173 | " \n", 174 | " if self.transform:\n", 175 | " img = self.transform(img)\n", 176 | "\n", 177 | " return img, label\n", 178 | "\n", 179 | "# Load the training data for Fashion MNIST\n", 180 | "trainset = FashionMNIST(transform=transform)\n", 181 | "# Define the dataloader\n", 182 | "trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "### Setup Training Procedure\n", 190 | "I wrote a small training procedure, which runs 5 training epochs and prints the loss for each epoch:" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 4, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "def train_model(model, device):\n", 200 | " '''\n", 201 | " Function trains the model and prints out the training log.\n", 202 | " '''\n", 203 | " #setup training\n", 204 | " \n", 205 | " #define loss function\n", 206 | " criterion = nn.NLLLoss()\n", 207 | " #define learning rate\n", 208 | " learning_rate = 0.003\n", 209 | " #define number of epochs\n", 210 | " epochs = 5\n", 211 | " #initialize optimizer\n", 212 | " optimizer = optim.Adam(model.parameters(), lr=learning_rate)\n", 213 | " \n", 214 | " model.to(device)\n", 215 | "\n", 216 | " #run training and print out the loss to make sure that we are actually fitting to the training set\n", 217 | " print('Training the model \\n')\n", 218 | " for e in range(epochs):\n", 219 | " running_loss = 0\n", 220 | " for images, labels in trainloader:\n", 221 | " \n", 222 | " images, labels = images.to(device), labels.to(device)\n", 223 | " images = images.view(images.shape[0], -1)\n", 224 | " log_ps = model(images)\n", 225 | " loss = criterion(log_ps, labels)\n", 226 | "\n", 227 | " optimizer.zero_grad()\n", 228 | " loss.backward()\n", 229 | " optimizer.step()\n", 230 | "\n", 231 | " running_loss += loss.item()\n", 232 | " else:\n", 233 | " # print out the loss to make sure it is decreasing\n", 234 | " print(f\"Training loss: {running_loss}\")" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "### Define the Model\n", 242 | "\n", 243 | "PyTorch provides us with in-place implementation of ReLU activation function. I will run consequently training with in-place ReLU implementation and with vanilla ReLU." 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 5, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "# create class for basic fully-connected deep neural network\n", 253 | "class Classifier(nn.Module):\n", 254 | " '''\n", 255 | " Demo classifier model class to demonstrate in-place operations\n", 256 | " '''\n", 257 | " def __init__(self, inplace = False):\n", 258 | " super().__init__()\n", 259 | "\n", 260 | " # initialize layers\n", 261 | " self.fc1 = nn.Linear(784, 256)\n", 262 | " self.fc2 = nn.Linear(256, 128)\n", 263 | " self.fc3 = nn.Linear(128, 64)\n", 264 | " self.fc4 = nn.Linear(64, 10)\n", 265 | " \n", 266 | " self.relu = nn.ReLU(inplace = inplace) # pass inplace as parameter to ReLU\n", 267 | "\n", 268 | " def forward(self, x):\n", 269 | " # make sure the input tensor is flattened\n", 270 | " x = x.view(x.shape[0], -1)\n", 271 | "\n", 272 | " # apply activation function\n", 273 | " x = self.relu(self.fc1(x))\n", 274 | "\n", 275 | " # apply activation function\n", 276 | " x = self.relu(self.fc2(x))\n", 277 | " \n", 278 | " # apply activation function\n", 279 | " x = self.relu(self.fc3(x))\n", 280 | " \n", 281 | " x = F.log_softmax(self.fc4(x), dim=1)\n", 282 | "\n", 283 | " return x" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "## Compare Memory Usage for In-place and Vanilla Operations\n" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "Let's compare memory usage for one single call of ReLU activation function:" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 6, 303 | "metadata": {}, 304 | "outputs": [ 305 | { 306 | "data": { 307 | "text/plain": [ 308 | "device(type='cuda', index=0)" 309 | ] 310 | }, 311 | "execution_count": 6, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "# empty caches and setup the device\n", 318 | "torch.cuda.empty_cache()\n", 319 | "\n", 320 | "device = torch.device('cuda:0' if torch.cuda.is_available() else \"cpu\")\n", 321 | "device" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 7, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "def get_memory_allocated(device, inplace = False):\n", 331 | " '''\n", 332 | " Function measures allocated memory before and after the ReLU function call.\n", 333 | " '''\n", 334 | " \n", 335 | " # Create a large tensor\n", 336 | " t = torch.randn(10000, 10000, device=device)\n", 337 | " \n", 338 | " # Measure allocated memory\n", 339 | " torch.cuda.synchronize()\n", 340 | " start_max_memory = torch.cuda.max_memory_allocated() / 1024**2\n", 341 | " start_memory = torch.cuda.memory_allocated() / 1024**2\n", 342 | " \n", 343 | " # Call in-place or normal ReLU\n", 344 | " if inplace:\n", 345 | " F.relu_(t)\n", 346 | " else:\n", 347 | " output = F.relu(t)\n", 348 | " \n", 349 | " # Measure allocated memory after the call\n", 350 | " torch.cuda.synchronize()\n", 351 | " end_max_memory = torch.cuda.max_memory_allocated() / 1024**2\n", 352 | " end_memory = torch.cuda.memory_allocated() / 1024**2\n", 353 | " \n", 354 | " # Return amount of memory allocated for ReLU call\n", 355 | " return end_memory - start_memory, end_max_memory - start_max_memory" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "Run out of place ReLU:" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 8, 368 | "metadata": {}, 369 | "outputs": [ 370 | { 371 | "name": "stdout", 372 | "output_type": "stream", 373 | "text": [ 374 | "Allocated memory: 382.0\n", 375 | "Allocated max memory: 382.0\n" 376 | ] 377 | } 378 | ], 379 | "source": [ 380 | "memory_allocated, max_memory_allocated = get_memory_allocated(device, inplace = False)\n", 381 | "print('Allocated memory: {}'.format(memory_allocated))\n", 382 | "print('Allocated max memory: {}'.format(max_memory_allocated))" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "Run in-place ReLU:" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 9, 395 | "metadata": {}, 396 | "outputs": [ 397 | { 398 | "name": "stdout", 399 | "output_type": "stream", 400 | "text": [ 401 | "Allocated memory: 0.0\n", 402 | "Allocated max memory: 0.0\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "memory_allocated_inplace, max_memory_allocated_inplace = get_memory_allocated(device, inplace = True)\n", 408 | "print('Allocated memory: {}'.format(memory_allocated_inplace))\n", 409 | "print('Allocated max memory: {}'.format(max_memory_allocated_inplace))" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "Now let's do the same while training a simple classifier.\n", 417 | "Run training with vanilla ReLU:" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 10, 423 | "metadata": {}, 424 | "outputs": [ 425 | { 426 | "name": "stdout", 427 | "output_type": "stream", 428 | "text": [ 429 | "Training the model \n", 430 | "\n", 431 | "Training loss: 490.5504989773035\n", 432 | "Training loss: 361.1345275044441\n", 433 | "Training loss: 329.05726308375597\n", 434 | "Training loss: 306.97832968086004\n", 435 | "Training loss: 292.6471059694886\n" 436 | ] 437 | } 438 | ], 439 | "source": [ 440 | "# initialize classifier\n", 441 | "model = Classifier(inplace = False)\n", 442 | "\n", 443 | "# measure allocated memory\n", 444 | "torch.cuda.synchronize()\n", 445 | "start_max_memory = torch.cuda.max_memory_allocated() / 1024**2\n", 446 | "start_memory = torch.cuda.memory_allocated() / 1024**2\n", 447 | "\n", 448 | "# train the classifier\n", 449 | "train_model(model, device)\n", 450 | "\n", 451 | "# measure allocated memory after training\n", 452 | "torch.cuda.synchronize()\n", 453 | "end_max_memory = torch.cuda.max_memory_allocated() / 1024**2\n", 454 | "end_memory = torch.cuda.memory_allocated() / 1024**2" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": 11, 460 | "metadata": {}, 461 | "outputs": [ 462 | { 463 | "name": "stdout", 464 | "output_type": "stream", 465 | "text": [ 466 | "Allocated memory: 1.853515625\n", 467 | "Allocated max memory: 0.0\n" 468 | ] 469 | } 470 | ], 471 | "source": [ 472 | "print('Allocated memory: {}'.format(end_memory - start_memory))\n", 473 | "print('Allocated max memory: {}'.format(end_max_memory - start_max_memory))" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "Run training with in-place ReLU:" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": 12, 486 | "metadata": {}, 487 | "outputs": [ 488 | { 489 | "name": "stdout", 490 | "output_type": "stream", 491 | "text": [ 492 | "Training the model \n", 493 | "\n", 494 | "Training loss: 485.5531446188688\n", 495 | "Training loss: 359.61066341400146\n", 496 | "Training loss: 329.1772850751877\n", 497 | "Training loss: 307.14213905483484\n", 498 | "Training loss: 292.3229675516486\n" 499 | ] 500 | } 501 | ], 502 | "source": [ 503 | "# initialize model with in-place ReLU\n", 504 | "model = Classifier(inplace = True)\n", 505 | "\n", 506 | "# measure allocated memory\n", 507 | "torch.cuda.synchronize()\n", 508 | "start_max_memory = torch.cuda.max_memory_allocated() / 1024**2\n", 509 | "start_memory = torch.cuda.memory_allocated() / 1024**2\n", 510 | "\n", 511 | "# train the classifier with in-place ReLU\n", 512 | "train_model(model, device)\n", 513 | "\n", 514 | "# measure allocated memory after training\n", 515 | "torch.cuda.synchronize()\n", 516 | "end_max_memory = torch.cuda.max_memory_allocated() / 1024**2\n", 517 | "end_memory = torch.cuda.memory_allocated() / 1024**2" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 13, 523 | "metadata": {}, 524 | "outputs": [ 525 | { 526 | "name": "stdout", 527 | "output_type": "stream", 528 | "text": [ 529 | "Allocated memory: 1.853515625\n", 530 | "Allocated max memory: 0.0\n" 531 | ] 532 | } 533 | ], 534 | "source": [ 535 | "print('Allocated memory: {}'.format(end_memory - start_memory))\n", 536 | "print('Allocated max memory: {}'.format(end_max_memory - start_max_memory))" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "Looks like using in-place ReLU really helps us to save some GPU memory. But we should be __extremely cautious when using in-place operations and check twice__. In the next section I will show why." 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "## Downsides of In-place Operations" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "The major downside of in-place operations is the fact that __they might overwrite values required to compute gradients__ which means breaking the training procedure of the model. That is what [the official PyTorch autograd documentation](https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd) says:\n", 558 | "> Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.\n", 559 | "\n", 560 | "> There are two main reasons that limit the applicability of in-place operations:\n", 561 | "\n", 562 | "> 1. In-place operations can potentially overwrite values required to compute gradients.\n", 563 | "> 2. Every in-place operation actually requires the implementation to rewrite the computational graph. Out-of-place versions simply allocate new objects and keep references to the old graph, while in-place operations, require changing the creator of all inputs to the Function representing this operation. This can be tricky, especially if there are many Tensors that reference the same storage (e.g. created by indexing or transposing), and in-place functions will actually raise an error if the storage of modified inputs is referenced by any other Tensor." 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "The other reason of being careful with in-place operations is that their implementation is extremely tricky. That is why I would __recommend to use PyTorch standard in-place operations__ (like `torch.tanh_` or `torch.sigmoid_`) instead of implementing one manually.\n", 571 | "\n", 572 | "Let's see an example of [SiLU](https://arxiv.org/pdf/1606.08415.pdf) (or Swish-1) activation function. This is the normal implementation of SiLU:" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": 14, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "def silu(input):\n", 582 | " '''\n", 583 | " Normal implementation of SiLU activation function\n", 584 | " https://arxiv.org/pdf/1606.08415.pdf\n", 585 | " '''\n", 586 | " return input * torch.sigmoid(input)" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "Let's try to implement in-place SiLU using torch.sigmoid_ in-place function:" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 15, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [ 602 | "def silu_inplace_1(input):\n", 603 | " '''\n", 604 | " Incorrect implementation of in-place SiLU activation function\n", 605 | " https://arxiv.org/pdf/1606.08415.pdf\n", 606 | " '''\n", 607 | " return input * torch.sigmoid_(input) # THIS IS INCORRECT!!!" 608 | ] 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "metadata": {}, 613 | "source": [ 614 | "The code above __incorrectly__ implements in-place SiLU. We can make sure of that:" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 16, 620 | "metadata": {}, 621 | "outputs": [ 622 | { 623 | "name": "stdout", 624 | "output_type": "stream", 625 | "text": [ 626 | "Original SiLU: tensor([ 0.0796, -0.2744, -0.2598])\n", 627 | "In-place SiLU: tensor([0.5370, 0.2512, 0.2897])\n" 628 | ] 629 | } 630 | ], 631 | "source": [ 632 | "t = torch.randn(3)\n", 633 | "\n", 634 | "# print result of original SiLU\n", 635 | "print(\"Original SiLU: {}\".format(silu(t)))\n", 636 | "\n", 637 | "# change the value of t with in-place function\n", 638 | "silu_inplace_1(t)\n", 639 | "print(\"In-place SiLU: {}\".format(t))" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "metadata": {}, 645 | "source": [ 646 | "It is easy to see that the function `silu_inplace_1` in fact returns `sigmoid(input) * sigmoid(input)` !" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": {}, 652 | "source": [ 653 | "The working example of the in-place implementation of SiLU using `torch.sigmoid_` could be:" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": 17, 659 | "metadata": {}, 660 | "outputs": [], 661 | "source": [ 662 | "def silu_inplace_2(input):\n", 663 | " '''\n", 664 | " Example of implementation of in-place SiLU activation function using torch.sigmoid_\n", 665 | " https://arxiv.org/pdf/1606.08415.pdf\n", 666 | " '''\n", 667 | " result = input.clone()\n", 668 | " torch.sigmoid_(input)\n", 669 | " input *= result\n", 670 | " return input" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 18, 676 | "metadata": {}, 677 | "outputs": [ 678 | { 679 | "name": "stdout", 680 | "output_type": "stream", 681 | "text": [ 682 | "Original SiLU: tensor([ 0.7774, -0.2767, 0.2967])\n", 683 | "In-place SiLU #2: tensor([ 0.7774, -0.2767, 0.2967])\n" 684 | ] 685 | } 686 | ], 687 | "source": [ 688 | "t = torch.randn(3)\n", 689 | "\n", 690 | "# print result of original SiLU\n", 691 | "print(\"Original SiLU: {}\".format(silu(t)))\n", 692 | "\n", 693 | "# change the value of t with in-place function\n", 694 | "silu_inplace_2(t)\n", 695 | "print(\"In-place SiLU #2: {}\".format(t))" 696 | ] 697 | }, 698 | { 699 | "cell_type": "markdown", 700 | "metadata": {}, 701 | "source": [ 702 | "This small example demonstrates why we should be extremely careful and check twice when using the in-place operations." 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "## Conclusion\n", 710 | "In this article: \n", 711 | "* I described the in-place operations and their purpose. Demonstrated how in-place operations help to __consume less GPU memory__.\n", 712 | "* Described the major __downsides of in-place operations__. One should be very careful about using them and check the result twice." 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": {}, 718 | "source": [ 719 | "## Additional References\n", 720 | "Links to the additional resources and further reading:\n", 721 | "\n", 722 | "1. [PyTorch Autograd documentation](https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd)" 723 | ] 724 | }, 725 | { 726 | "cell_type": "markdown", 727 | "metadata": {}, 728 | "source": [ 729 | "## PS\n", 730 | "![echo logo](https://github.com/Lexie88rus/Activation-functions-examples-pytorch/blob/master/assets/echo_logo.png?raw=true)\n", 731 | "\n", 732 | "I participate in implementation of a __Echo package__ with mathematical backend for neural networks, which can be used with most popular existing packages (TensorFlow, Keras and [PyTorch](https://pytorch.org/)). We have done a lot for PyTorch and Keras so far. Here is a [link to a repository on GitHub](https://github.com/digantamisra98/Echo/tree/Dev-adeis), __I will highly appreciate your feedback__ on that." 733 | ] 734 | } 735 | ], 736 | "metadata": { 737 | "kernelspec": { 738 | "display_name": "Python 3", 739 | "language": "python", 740 | "name": "python3" 741 | }, 742 | "language_info": { 743 | "codemirror_mode": { 744 | "name": "ipython", 745 | "version": 3 746 | }, 747 | "file_extension": ".py", 748 | "mimetype": "text/x-python", 749 | "name": "python", 750 | "nbconvert_exporter": "python", 751 | "pygments_lexer": "ipython3", 752 | "version": "3.6.6" 753 | } 754 | }, 755 | "nbformat": 4, 756 | "nbformat_minor": 1 757 | } 758 | -------------------------------------------------------------------------------- /landscapes/create_landscape.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from torch import optim 4 | import torch.nn.functional as F 5 | 6 | from collections import OrderedDict 7 | import numpy as np 8 | 9 | from PIL import Image 10 | 11 | from sklearn.preprocessing import MinMaxScaler 12 | 13 | # import matplotlib for visualization 14 | from matplotlib.pyplot import imshow 15 | import matplotlib.pyplot as plt 16 | 17 | def fswish(input, beta=1.25): 18 | return input * torch.sigmoid(beta * input) 19 | 20 | class swish(nn.Module): 21 | def __init__(self, beta = 1.25): 22 | ''' 23 | Init method. 24 | ''' 25 | super().__init__() 26 | self.beta = beta 27 | 28 | 29 | def forward(self, input): 30 | ''' 31 | Forward pass of the function. 32 | ''' 33 | return fswish(input, self.beta) 34 | 35 | def fmish(input): 36 | return input * torch.tanh(F.softplus(input)) 37 | 38 | class mish(nn.Module): 39 | def __init__(self): 40 | ''' 41 | Init method. 42 | ''' 43 | super().__init__() 44 | 45 | def forward(self, input): 46 | ''' 47 | Forward pass of the function. 48 | ''' 49 | return fmish(input) 50 | 51 | def build_model(activation_function): 52 | return nn.Sequential(OrderedDict([ 53 | ('fc1', nn.Linear(2, 64)), 54 | ('activation1', activation_function), # use custom activation function 55 | ('fc2', nn.Linear(64, 32)), 56 | ('activation2', activation_function), 57 | ('fc3', nn.Linear(32, 16)), 58 | ('activation3', activation_function), 59 | ('fc4', nn.Linear(16, 1)), 60 | ('activation4', activation_function)])) 61 | 62 | def convert_to_PIL(img, width, height): 63 | img_r = img.reshape(height,width) 64 | 65 | pil_img = Image.new('RGB', (height,width), 'white') 66 | pixels = pil_img.load() 67 | 68 | for i in range(0, height): 69 | for j in range(0, width): 70 | pixels[j, i] = img[i,j], img[i,j], img[i,j] 71 | 72 | return pil_img 73 | 74 | def main(): 75 | model1 = build_model(nn.ReLU()) 76 | model2 = build_model(swish(beta = 1)) 77 | model3 = build_model(mish()) 78 | 79 | x = np.linspace(0.0, 10.0, num=100) 80 | y = np.linspace(0.0, 10.0, num=100) 81 | 82 | grid = [torch.tensor([xi, yi]) for xi in x for yi in y] 83 | 84 | np_img_relu = np.array([model1(point).detach().numpy() for point in grid]).reshape(100, 100) 85 | np_img_swish = np.array([model2(point).detach().numpy() for point in grid]).reshape(100, 100) 86 | np_img_mish = np.array([model3(point).detach().numpy() for point in grid]).reshape(100, 100) 87 | 88 | scaler = MinMaxScaler(feature_range=(0, 255)) 89 | np_img_relu = scaler.fit_transform(np_img_relu) 90 | np_img_swish = scaler.fit_transform(np_img_swish) 91 | np_img_mish = scaler.fit_transform(np_img_mish) 92 | 93 | image_relu = convert_to_PIL(np_img_relu, 100, 100) 94 | image_swish = convert_to_PIL(np_img_swish, 100, 100) 95 | image_mish = convert_to_PIL(np_img_mish, 100, 100) 96 | 97 | image_relu.save('relu.png') 98 | image_swish.save('swish.png') 99 | image_mish.save('mish.png') 100 | 101 | plt.imsave('imrelu.png', np_img_relu) 102 | plt.imsave('imswish.png', np_img_swish) 103 | plt.imsave('imsmish.png', np_img_mish) 104 | 105 | return 106 | 107 | 108 | if __name__ == '__main__': 109 | main() 110 | -------------------------------------------------------------------------------- /landscapes/imrelu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/landscapes/imrelu.png -------------------------------------------------------------------------------- /landscapes/imsmish.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/landscapes/imsmish.png -------------------------------------------------------------------------------- /landscapes/imswish.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lexie88rus/Activation-functions-examples-pytorch/769ac4c23ac57c9d244f25e4ab2b96b07f1e8826/landscapes/imswish.png --------------------------------------------------------------------------------