├── .gitignore ├── README.md ├── diffGrad.py ├── diffGrad_Regression_Loss.ipynb ├── diffGrad_analytical.ipynb └── diffGrad_v2.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # [diffGrad: An Optimization Method for Convolutional Neural Networks](https://ieeexplore.ieee.org/document/8939562) 2 | 3 | The PyTorch implementation of diffGrad can be found in [torch-optimizer](https://pypi.org/project/torch-optimizer/#diffgrad) and can easily be used by following. 5 | 6 | ## How to use 7 | 8 |
pip install torch-optimizer
 9 | 
10 | import torch_optimizer as optimizer
11 | 
12 | # model = ...
13 | optimizer = optimizer.DiffGrad(
14 |     model.parameters(),
15 |     lr= 1e-3,
16 |     betas=(0.9, 0.999),
17 |     eps=1e-8,
18 |     weight_decay=0,
19 | )
20 | optimizer.step()
21 | 
22 | 23 | ## Issues 24 | 25 | It is recommended to use diffGrad_v2.py which fixes [an issue](https://github.com/shivram1987/diffGrad/issues/2) in diffGrad.py. 27 | 28 | It is also recommended to refer [arXiv version](https://arxiv.org/abs/1909.11015) for the updated results. 30 | 31 | ## Abstract 32 | 33 | Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take the advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms the other optimizers. Also, we showed that diffGrad performs uniformly well on network using different activation functions. 34 | 35 | ## Citation 36 | 37 | 38 | If you use this code in your research, please cite as: 39 | 40 | @article{dubey2019diffgrad, 41 | title={diffGrad: An Optimization Method for Convolutional Neural Networks}, 42 | author={Dubey, Shiv Ram and Chakraborty, Soumendu and Roy, Swalpa Kumar and Mukherjee, Snehasis and Singh, Satish Kumar and Chaudhuri, Bidyut Baran}, 43 | journal={IEEE Transactions on Neural Networks and Learning Systems}, 44 | volume={31}, 45 | no={11}, 46 | pp.={4500 - 4511}, 47 | year={2020}, 48 | publisher={IEEE} 49 | } 50 | 51 | ## Acknowledgement 52 | 53 | All experiments are perfomed using following framework: https://github.com/kuangliu/pytorch-cifar 54 | 55 | 56 | ## License 57 | 58 | Copyright (©2019): Shiv Ram Dubey, Indian Institute of Information Technology, Sri City, Chittoor, A.P., India. Released under the MIT License. See [LICENSE](LICENSE) for details. 59 | -------------------------------------------------------------------------------- /diffGrad.py: -------------------------------------------------------------------------------- 1 | # This diffGrad implementation has a bug. Use diffGrad_v2.py. 2 | import math 3 | import torch 4 | from torch.optim.optimizer import Optimizer 5 | import numpy as np 6 | import torch.nn as nn 7 | #import torch.optim as Optimizer 8 | 9 | class diffgrad(Optimizer): 10 | r"""Implements diffGrad algorithm. It is modified from the pytorch implementation of Adam. 11 | 12 | It has been proposed in `diffGrad: An Optimization Method for Convolutional Neural Networks`_. 13 | 14 | Arguments: 15 | params (iterable): iterable of parameters to optimize or dicts defining 16 | parameter groups 17 | lr (float, optional): learning rate (default: 1e-3) 18 | betas (Tuple[float, float], optional): coefficients used for computing 19 | running averages of gradient and its square (default: (0.9, 0.999)) 20 | eps (float, optional): term added to the denominator to improve 21 | numerical stability (default: 1e-8) 22 | weight_decay (float, optional): weight decay (L2 penalty) (default: 0) 23 | amsgrad (boolean, optional): whether to use the AMSGrad variant of this 24 | algorithm from the paper `On the Convergence of Adam and Beyond`_ 25 | (default: False) 26 | 27 | .. _diffGrad: An Optimization Method for Convolutional Neural Networks: 28 | https://arxiv.org/abs/1909.11015 29 | .. _Adam\: A Method for Stochastic Optimization: 30 | https://arxiv.org/abs/1412.6980 31 | .. _On the Convergence of Adam and Beyond: 32 | https://openreview.net/forum?id=ryQu7f-RZ 33 | """ 34 | 35 | def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0): 36 | if not 0.0 <= lr: 37 | raise ValueError("Invalid learning rate: {}".format(lr)) 38 | if not 0.0 <= eps: 39 | raise ValueError("Invalid epsilon value: {}".format(eps)) 40 | if not 0.0 <= betas[0] < 1.0: 41 | raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0])) 42 | if not 0.0 <= betas[1] < 1.0: 43 | raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1])) 44 | defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay) 45 | super(diffgrad, self).__init__(params, defaults) 46 | 47 | def __setstate__(self, state): 48 | super(diffgrad, self).__setstate__(state) 49 | 50 | def step(self, closure=None): 51 | """Performs a single optimization step. 52 | 53 | Arguments: 54 | closure (callable, optional): A closure that reevaluates the model 55 | and returns the loss. 56 | """ 57 | loss = None 58 | if closure is not None: 59 | loss = closure() 60 | 61 | for group in self.param_groups: 62 | for p in group['params']: 63 | if p.grad is None: 64 | continue 65 | grad = p.grad.data 66 | if grad.is_sparse: 67 | raise RuntimeError('diffGrad does not support sparse gradients, please consider SparseAdam instead') 68 | 69 | state = self.state[p] 70 | 71 | # State initialization 72 | if len(state) == 0: 73 | state['step'] = 0 74 | # Exponential moving average of gradient values 75 | state['exp_avg'] = torch.zeros_like(p.data) 76 | # Exponential moving average of squared gradient values 77 | state['exp_avg_sq'] = torch.zeros_like(p.data) 78 | # Previous gradient 79 | state['previous_grad'] = torch.zeros_like(p.data) 80 | 81 | exp_avg, exp_avg_sq, previous_grad = state['exp_avg'], state['exp_avg_sq'], state['previous_grad'] 82 | beta1, beta2 = group['betas'] 83 | 84 | state['step'] += 1 85 | 86 | if group['weight_decay'] != 0: 87 | grad.add_(group['weight_decay'], p.data) 88 | 89 | # Decay the first and second moment running average coefficient 90 | exp_avg.mul_(beta1).add_(1 - beta1, grad) 91 | exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 92 | denom = exp_avg_sq.sqrt().add_(group['eps']) 93 | 94 | bias_correction1 = 1 - beta1 ** state['step'] 95 | bias_correction2 = 1 - beta2 ** state['step'] 96 | 97 | # compute diffgrad coefficient (dfc) 98 | diff = abs(previous_grad - grad) 99 | dfc = 1. / (1. + torch.exp(-diff)) 100 | state['previous_grad'] = grad # used in paper but has the bug that previous grad is overwritten with grad and diff becomes always zero. Fixed in the next line. 101 | 102 | # update momentum with dfc 103 | exp_avg1 = exp_avg * dfc 104 | 105 | step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1 106 | 107 | p.data.addcdiv_(-step_size, exp_avg1, denom) 108 | 109 | return loss 110 | -------------------------------------------------------------------------------- /diffGrad_v2.py: -------------------------------------------------------------------------------- 1 | # Fixes a bug in original diffGrad code 2 | import math 3 | import torch 4 | from torch.optim.optimizer import Optimizer 5 | import numpy as np 6 | import torch.nn as nn 7 | #import torch.optim as Optimizer 8 | 9 | class diffgrad(Optimizer): 10 | r"""Implements diffGrad algorithm. It is modified from the pytorch implementation of Adam. 11 | It has been proposed in `diffGrad: An Optimization Method for Convolutional Neural Networks`_. 12 | Arguments: 13 | params (iterable): iterable of parameters to optimize or dicts defining 14 | parameter groups 15 | lr (float, optional): learning rate (default: 1e-3) 16 | betas (Tuple[float, float], optional): coefficients used for computing 17 | running averages of gradient and its square (default: (0.9, 0.999)) 18 | eps (float, optional): term added to the denominator to improve 19 | numerical stability (default: 1e-8) 20 | weight_decay (float, optional): weight decay (L2 penalty) (default: 0) 21 | amsgrad (boolean, optional): whether to use the AMSGrad variant of this 22 | algorithm from the paper `On the Convergence of Adam and Beyond`_ 23 | (default: False) 24 | .. _diffGrad: An Optimization Method for Convolutional Neural Networks: 25 | https://arxiv.org/abs/1909.11015 26 | .. _Adam\: A Method for Stochastic Optimization: 27 | https://arxiv.org/abs/1412.6980 28 | .. _On the Convergence of Adam and Beyond: 29 | https://openreview.net/forum?id=ryQu7f-RZ 30 | """ 31 | 32 | def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0): 33 | if not 0.0 <= lr: 34 | raise ValueError("Invalid learning rate: {}".format(lr)) 35 | if not 0.0 <= eps: 36 | raise ValueError("Invalid epsilon value: {}".format(eps)) 37 | if not 0.0 <= betas[0] < 1.0: 38 | raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0])) 39 | if not 0.0 <= betas[1] < 1.0: 40 | raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1])) 41 | defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay) 42 | super(diffgrad, self).__init__(params, defaults) 43 | 44 | def __setstate__(self, state): 45 | super(diffgrad, self).__setstate__(state) 46 | 47 | def step(self, closure=None): 48 | """Performs a single optimization step. 49 | Arguments: 50 | closure (callable, optional): A closure that reevaluates the model 51 | and returns the loss. 52 | """ 53 | loss = None 54 | if closure is not None: 55 | loss = closure() 56 | 57 | for group in self.param_groups: 58 | for p in group['params']: 59 | if p.grad is None: 60 | continue 61 | grad = p.grad.data 62 | if grad.is_sparse: 63 | raise RuntimeError('diffGrad does not support sparse gradients, please consider SparseAdam instead') 64 | 65 | state = self.state[p] 66 | 67 | # State initialization 68 | if len(state) == 0: 69 | state['step'] = 0 70 | # Exponential moving average of gradient values 71 | state['exp_avg'] = torch.zeros_like(p.data) 72 | # Exponential moving average of squared gradient values 73 | state['exp_avg_sq'] = torch.zeros_like(p.data) 74 | # Previous gradient 75 | state['previous_grad'] = torch.zeros_like(p.data) 76 | 77 | exp_avg, exp_avg_sq, previous_grad = state['exp_avg'], state['exp_avg_sq'], state['previous_grad'] 78 | beta1, beta2 = group['betas'] 79 | 80 | state['step'] += 1 81 | 82 | if group['weight_decay'] != 0: 83 | grad.add_(group['weight_decay'], p.data) 84 | 85 | # Decay the first and second moment running average coefficient 86 | exp_avg.mul_(beta1).add_(1 - beta1, grad) 87 | exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 88 | denom = exp_avg_sq.sqrt().add_(group['eps']) 89 | 90 | bias_correction1 = 1 - beta1 ** state['step'] 91 | bias_correction2 = 1 - beta2 ** state['step'] 92 | 93 | # compute diffgrad coefficient (dfc) 94 | diff = abs(previous_grad - grad) 95 | dfc = 1. / (1. + torch.exp(-diff)) 96 | #state['previous_grad'] = grad %used in paper but has the bug that previous grad is overwritten with grad and diff becomes always zero. Fixed in the next line. 97 | state['previous_grad'] = grad.clone() 98 | 99 | # update momentum with dfc 100 | exp_avg1 = exp_avg * dfc 101 | 102 | step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1 103 | 104 | p.data.addcdiv_(-step_size, exp_avg1, denom) 105 | 106 | return loss 107 | --------------------------------------------------------------------------------