├── .gitignore
├── LICENSE
├── README.md
├── images
    └── rectified_adam.png
├── rectified_adam.py
└── tf_rectified_adam.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 
106 | # logs
107 | logs/
108 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Somshubra Majumdar
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Rectified Adam for Keras
 2 | 
 3 | Keras port of [Rectified Adam](https://github.com/LiyuanLucasLiu/RAdam), from the paper [On the Variance of the Adaptive Learning Rate and Beyond.](https://arxiv.org/abs/1908.03265)
 4 | 
 5 | # Rectified ADAM
 6 | 
 7 | <img src="https://github.com/titu1994/keras_rectified_adam/blob/master/images/rectified_adam.png?raw=true" height=100% width=100%>
 8 | 
 9 | Above image is from the paper. One of the many contributions of this paper is the idea that Adam with Warmup tends to perform better than Adam without warmup. However, when Adam is used without warmup, during the initial iterations the gradients have large variance. This large variance causes overshoots of minima, and thereby leads to poor optima. 
10 | 
11 | Warmup on the other hand is the idea of training with a very low learning rate for the first few epochs to offset this large variance. However, the degree of warmup - how long and what learning rate should be used require extensive hyper parameter search, which is usually costly. 
12 | 
13 | Therefore Rectified ADAM proposes a dynamic variance reduction algorithm.
14 | 
15 | ## Usage
16 | 
17 | Add the `rectified_adam.py` script to your project, and import it. Can be a dropin replacement for `Adam` Optimizer. 
18 | 
19 | Note, currently only the basic Rectified Adam is supported, not the EMA buffered variant as Keras cannot index the current timestep 
20 | when in graph mode. This will probably be fixed in Tensorflow 2.0 Eager Execution mode.
21 | 
22 | ```python
23 | from rectified_adam import RectifiedAdam
24 | 
25 | optm = RectifiedAdam(lr=1e-3)
26 | ```
27 | 
28 | 
29 | # Requirements
30 | - Keras 2.2.4+ & Tensorflow 1.12+ (Only supports TF backend for now).
31 | - Numpy
32 | 


--------------------------------------------------------------------------------
/images/rectified_adam.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/titu1994/keras_rectified_adam/3f24787b2ef8fdd650136db38ecea06b0c693f07/images/rectified_adam.png


--------------------------------------------------------------------------------
/rectified_adam.py:
--------------------------------------------------------------------------------
  1 | from keras import backend as K
  2 | from keras.optimizers import Optimizer
  3 | 
  4 | 
  5 | # Ported from https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam.py
  6 | class RectifiedAdam(Optimizer):
  7 |     """RectifiedAdam optimizer.
  8 | 
  9 |     Default parameters follow those provided in the original paper.
 10 | 
 11 |     # Arguments
 12 |         lr: float >= 0. Learning rate.
 13 |         final_lr: float >= 0. Final learning rate.
 14 |         beta_1: float, 0 < beta < 1. Generally close to 1.
 15 |         beta_2: float, 0 < beta < 1. Generally close to 1.
 16 |         gamma: float >= 0. Convergence speed of the bound function.
 17 |         epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
 18 |         decay: float >= 0. Learning rate decay over each update.
 19 |         weight_decay: Weight decay weight.
 20 |         amsbound: boolean. Whether to apply the AMSBound variant of this
 21 |             algorithm.
 22 | 
 23 |     # References
 24 |         - [On the Variance of the Adaptive Learning Rate and Beyond]
 25 |           (https://arxiv.org/abs/1908.03265)
 26 |         - [Adam - A Method for Stochastic Optimization]
 27 |           (https://arxiv.org/abs/1412.6980v8)
 28 |         - [On the Convergence of Adam and Beyond]
 29 |           (https://openreview.net/forum?id=ryQu7f-RZ)
 30 |     """
 31 | 
 32 |     def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999,
 33 |                  epsilon=None, decay=0., weight_decay=0.0, **kwargs):
 34 |         super(RectifiedAdam, self).__init__(**kwargs)
 35 | 
 36 |         with K.name_scope(self.__class__.__name__):
 37 |             self.iterations = K.variable(0, dtype='int64', name='iterations')
 38 |             self.lr = K.variable(lr, name='lr')
 39 |             self.beta_1 = K.variable(beta_1, name='beta_1')
 40 |             self.beta_2 = K.variable(beta_2, name='beta_2')
 41 |             self.decay = K.variable(decay, name='decay')
 42 | 
 43 |         if epsilon is None:
 44 |             epsilon = K.epsilon()
 45 |         self.epsilon = epsilon
 46 |         self.initial_decay = decay
 47 | 
 48 |         self.weight_decay = float(weight_decay)
 49 | 
 50 |     def get_updates(self, loss, params):
 51 |         grads = self.get_gradients(loss, params)
 52 |         self.updates = [K.update_add(self.iterations, 1)]
 53 | 
 54 |         lr = self.lr
 55 |         if self.initial_decay > 0:
 56 |             lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
 57 |                                                       K.dtype(self.decay))))
 58 | 
 59 |         t = K.cast(self.iterations, K.floatx()) + 1
 60 | 
 61 |         ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
 62 |         vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
 63 |         self.weights = [self.iterations] + ms + vs
 64 | 
 65 |         for p, g, m, v in zip(params, grads, ms, vs):
 66 |             m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
 67 |             v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
 68 | 
 69 |             beta2_t = self.beta_2 ** t
 70 |             N_sma_max = 2 / (1 - self.beta_2) - 1
 71 |             N_sma = N_sma_max - 2 * t * beta2_t / (1 - beta2_t)
 72 | 
 73 |             # apply weight decay
 74 |             if self.weight_decay != 0.:
 75 |                 p_wd = p - self.weight_decay * lr * p
 76 |             else:
 77 |                 p_wd = None
 78 | 
 79 |             if p_wd is None:
 80 |                 p_ = p
 81 |             else:
 82 |                 p_ = p_wd
 83 | 
 84 |             def gt_path():
 85 |                 step_size = lr * K.sqrt(
 86 |                     (1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max /
 87 |                     (N_sma_max - 2)) / (1 - self.beta_1 ** t)
 88 | 
 89 |                 denom = K.sqrt(v_t) + self.epsilon
 90 |                 p_t = p_ - step_size * (m_t / denom)
 91 | 
 92 |                 return p_t
 93 | 
 94 |             def lt_path():
 95 |                 step_size = lr / (1 - self.beta_1 ** t)
 96 |                 p_t = p_ - step_size * m_t
 97 | 
 98 |                 return p_t
 99 | 
100 |             p_t = K.switch(N_sma > 5, gt_path, lt_path)
101 | 
102 |             self.updates.append(K.update(m, m_t))
103 |             self.updates.append(K.update(v, v_t))
104 |             new_p = p_t
105 | 
106 |             # Apply constraints.
107 |             if getattr(p, 'constraint', None) is not None:
108 |                 new_p = p.constraint(new_p)
109 | 
110 |             self.updates.append(K.update(p, new_p))
111 |         return self.updates
112 | 
113 |     def get_config(self):
114 |         config = {'lr': float(K.get_value(self.lr)),
115 |                   'beta_1': float(K.get_value(self.beta_1)),
116 |                   'beta_2': float(K.get_value(self.beta_2)),
117 |                   'decay': float(K.get_value(self.decay)),
118 |                   'epsilon': self.epsilon,
119 |                   'weight_decay': self.weight_decay}
120 |         base_config = super(RectifiedAdam, self).get_config()
121 |         return dict(list(base_config.items()) + list(config.items()))
122 | 
123 | 
124 | # Ported from https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam.py
125 | # class EMARectifiedAdam(Optimizer):
126 | #     """EMARectifiedAdam optimizer.
127 | #
128 | #     Default parameters follow those provided in the original paper.
129 | #
130 | #     # Arguments
131 | #         lr: float >= 0. Learning rate.
132 | #         final_lr: float >= 0. Final learning rate.
133 | #         beta_1: float, 0 < beta < 1. Generally close to 1.
134 | #         beta_2: float, 0 < beta < 1. Generally close to 1.
135 | #         gamma: float >= 0. Convergence speed of the bound function.
136 | #         epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
137 | #         decay: float >= 0. Learning rate decay over each update.
138 | #         weight_decay: Weight decay weight.
139 | #         amsbound: boolean. Whether to apply the AMSBound variant of this
140 | #             algorithm.
141 | #
142 | #     # References
143 | #         - [On the Variance of the Adaptive Learning Rate and Beyond]
144 | #           (https://arxiv.org/abs/1908.03265)
145 | #         - [Adam - A Method for Stochastic Optimization]
146 | #           (https://arxiv.org/abs/1412.6980v8)
147 | #         - [On the Convergence of Adam and Beyond]
148 | #           (https://openreview.net/forum?id=ryQu7f-RZ)
149 | #     """
150 | #
151 | #     def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999,
152 | #                  epsilon=None, decay=0., weight_decay=0.0,
153 | #                  buffer=None, **kwargs):
154 | #         super(EMARectifiedAdam, self).__init__(**kwargs)
155 | #
156 | #         with K.name_scope(self.__class__.__name__):
157 | #             self.iterations = K.variable(0, dtype='int64', name='iterations')
158 | #             self.lr = K.variable(lr, name='lr')
159 | #             self.beta_1 = K.variable(beta_1, name='beta_1')
160 | #             self.beta_2 = K.variable(beta_2, name='beta_2')
161 | #             self.decay = K.variable(decay, name='decay')
162 | #
163 | #         if epsilon is None:
164 | #             epsilon = K.epsilon()
165 | #         self.epsilon = epsilon
166 | #         self.initial_decay = decay
167 | #
168 | #         self.weight_decay = float(weight_decay)
169 | #
170 | #         if buffer is None:
171 | #             self.buffer = [[None, None, None]
172 | #                            for _ in range(10)]
173 | #         else:
174 | #             self.buffer = buffer
175 | #
176 | #     def get_updates(self, loss, params):
177 | #         grads = self.get_gradients(loss, params)
178 | #         self.updates = [K.update_add(self.iterations, 1)]
179 | #
180 | #         lr = self.lr
181 | #         if self.initial_decay > 0:
182 | #             lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
183 | #                                                       K.dtype(self.decay))))
184 | #
185 | #         t = K.cast(self.iterations, K.floatx()) + 1
186 | #         t_int = K.cast(self.iterations, 'int64') + 1
187 | #         buffered = self.buffer[(t_int % 10)]
188 | #
189 | #         ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
190 | #         vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
191 | #         self.weights = [self.iterations] + ms + vs
192 | #
193 | #         for p, g, m, v in zip(params, grads, ms, vs):
194 | #             m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
195 | #             v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
196 | #
197 | #             if t_int == buffered[0]:
198 | #
199 | #                 N_sma, step_size = buffered[1], buffered[2]
200 | #
201 | #             else:
202 | #                 buffered[0] = t_int
203 | #
204 | #                 beta2_t = self.beta_2 ** t
205 | #                 N_sma_max = 2 / (1 - self.beta_2) - 1
206 | #                 N_sma = N_sma_max - 2 * t * beta2_t / (1 - beta2_t)
207 | #                 buffered[1] = N_sma
208 | #
209 | #                 def gt_step_path():
210 | #                     step_size = lr * K.sqrt(
211 | #                         (1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max /
212 | #                         (N_sma_max - 2)) / (1 - self.beta_1 ** t)
213 | #
214 | #                     return step_size
215 | #
216 | #                 def lt_step_path():
217 | #                     step_size = lr / (1 - self.beta_1 ** t)
218 | #
219 | #                     return step_size
220 | #
221 | #                 step_size = K.switch(N_sma > 5, gt_step_path, lt_step_path)
222 | #                 buffered[2] = step_size
223 | #
224 | #             # apply weight decay
225 | #             if self.weight_decay != 0.:
226 | #                 p_wd = p - self.weight_decay * lr * p
227 | #             else:
228 | #                 p_wd = None
229 | #
230 | #             if p_wd is None:
231 | #                 p_ = p
232 | #             else:
233 | #                 p_ = p_wd
234 | #
235 | #             def gt_sma_path():
236 | #                 denom = K.sqrt(v_t) + self.epsilon
237 | #                 p_t = p_ - step_size * (m_t / denom)
238 | #
239 | #                 return p_t
240 | #
241 | #             def lt_sma_path():
242 | #                 p_t = p_ - step_size * m_t
243 | #
244 | #                 return p_t
245 | #
246 | #             p_t = K.switch(N_sma > 4, gt_sma_path, lt_sma_path)
247 | #
248 | #             self.updates.append(K.update(m, m_t))
249 | #             self.updates.append(K.update(v, v_t))
250 | #             new_p = p_t
251 | #
252 | #             # Apply constraints.
253 | #             if getattr(p, 'constraint', None) is not None:
254 | #                 new_p = p.constraint(new_p)
255 | #
256 | #             self.updates.append(K.update(p, new_p))
257 | #         return self.updates
258 | #
259 | #     def get_config(self):
260 | #         buffer = [[]]
261 | #
262 | #         config = {'lr': float(K.get_value(self.lr)),
263 | #                   'beta_1': float(K.get_value(self.beta_1)),
264 | #                   'beta_2': float(K.get_value(self.beta_2)),
265 | #                   'decay': float(K.get_value(self.decay)),
266 | #                   'epsilon': self.epsilon,
267 | #                   'weight_decay': self.weight_decay,
268 | #                   'buffer': self.buffer}
269 | #         base_config = super(EMARectifiedAdam, self).get_config()
270 | #         return dict(list(base_config.items()) + list(config.items()))
271 | 


--------------------------------------------------------------------------------
/tf_rectified_adam.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from tensorflow.python.keras.optimizer_v2.optimizer_v2 import OptimizerV2
 3 | 
 4 | 
 5 | class RectifiedAdam(OptimizerV2):
 6 | 
 7 |     def __init__(self,
 8 |                  learning_rate=0.001,
 9 |                  beta_1=0.9,
10 |                  beta_2=0.999,
11 |                  epsilon=None,
12 |                  weight_decay=0.0,
13 |                  name='RectifiedAdam', **kwargs):
14 |         super(RectifiedAdam, self).__init__(name, **kwargs)
15 | 
16 |         self._set_hyper('learning_rate', kwargs.get('lr', learning_rate))
17 |         self._set_hyper('beta_1', beta_1)
18 |         self._set_hyper('beta_2', beta_2)
19 |         self._set_hyper('decay', self._initial_decay)
20 |         self.epsilon = epsilon or tf.keras.backend.epsilon()
21 |         self.weight_decay = weight_decay
22 | 
23 |     def _create_slots(self, var_list):
24 |         for var in var_list:
25 |             self.add_slot(var, 'm')
26 |         for var in var_list:
27 |             self.add_slot(var, 'v')
28 | 
29 |     def _resource_apply_dense(self, grad, var):
30 |         var_dtype = var.dtype.base_dtype
31 |         lr_t = self._decayed_lr(var_dtype)
32 |         m = self.get_slot(var, 'm')
33 |         v = self.get_slot(var, 'v')
34 |         beta_1_t = self._get_hyper('beta_1', var_dtype)
35 |         beta_2_t = self._get_hyper('beta_2', var_dtype)
36 |         epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
37 |         t = tf.cast(self.iterations + 1, var_dtype)
38 | 
39 |         m_t = (beta_1_t * m) + (1. - beta_1_t) * grad
40 |         v_t = (beta_2_t * v) + (1. - beta_2_t) * tf.square(grad)
41 | 
42 |         beta2_t = beta_2_t ** t
43 |         N_sma_max = 2 / (1 - beta_2_t) - 1
44 |         N_sma = N_sma_max - 2 * t * beta2_t / (1 - beta2_t)
45 | 
46 |         # apply weight decay
47 |         if self.weight_decay != 0.:
48 |             p_wd = var - self.weight_decay * lr_t * var
49 |         else:
50 |             p_wd = None
51 | 
52 |         if p_wd is None:
53 |             p_ = var
54 |         else:
55 |             p_ = p_wd
56 | 
57 |         def gt_path():
58 |             step_size = lr_t * tf.sqrt(
59 |                 (1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max /
60 |                 (N_sma_max - 2)) / (1 - beta_1_t ** t)
61 | 
62 |             denom = tf.sqrt(v_t) + epsilon_t
63 |             p_t = p_ - step_size * (m_t / denom)
64 | 
65 |             return p_t
66 | 
67 |         def lt_path():
68 |             step_size = lr_t / (1 - beta_1_t ** t)
69 |             p_t = p_ - step_size * m_t
70 | 
71 |             return p_t
72 | 
73 |         p_t = tf.cond(N_sma > 5, gt_path, lt_path)
74 | 
75 |         m_t = tf.compat.v1.assign(m, m_t)
76 |         v_t = tf.compat.v1.assign(v, v_t)
77 | 
78 |         with tf.control_dependencies([m_t, v_t]):
79 |             param_update = tf.compat.v1.assign(var, p_t)
80 |             return tf.group(*[param_update, m_t, v_t])
81 | 
82 |     def _resource_apply_sparse(self, grad, handle, indices):
83 |         raise NotImplementedError("Sparse data is not supported yet")
84 | 
85 |     def get_config(self):
86 |         config = super(RectifiedAdam, self).get_config()
87 |         config.update({
88 |             'learning_rate': self._serialize_hyperparameter('learning_rate'),
89 |             'decay': self._serialize_hyperparameter('decay'),
90 |             'beta_1': self._serialize_hyperparameter('beta_1'),
91 |             'beta_2': self._serialize_hyperparameter('beta_2'),
92 |             'epsilon': self.epsilon,
93 |             'weight_decay': self.weight_decay,
94 |         })
95 |         return config
96 | 


--------------------------------------------------------------------------------