├── README.md ├── doc ├── 1705.08292.pdf ├── ON LARGE-BATCH TRAINING FOR DEEP LEARNING.pdf ├── compstat-2010.pdf └── dl-optimization.pdf ├── md_pic ├── 20160909001936276.gif ├── 20170521223752521.png ├── 20170521224015304.png ├── 20170605150229100.png ├── 搜狗截图20180101204317.png ├── 搜狗截图20180101205314.png ├── 搜狗截图20180101211507.png ├── 搜狗截图20180101212512.png ├── 搜狗截图20180101212816.png ├── 搜狗截图20180101212904.png ├── 搜狗截图20180101214020.png └── 搜狗截图20180101214206.png └── python └── optimize.py /README.md: -------------------------------------------------------------------------------- 1 | # 摘要 2 | > 优化算法指通过改善训练方式,来最小化(或最大化)损失函数E(x) 3 | --- 4 | # 局部最优问题 5 | > 局部最优与鞍点。在神经网络中,最小化非凸误差函数的另一个关键挑战是避免陷于多个其他局部最小值中。实际上,问题并非源于局部极小值,而是来自鞍点,即一个维度向上倾斜且另一维度向下倾斜的点。这些鞍点通常被相同误差值的平面所包围,这使得SGD算法很难脱离出来,因为梯度在所有维度上接近于零。 6 | ![公式](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/20160909001936276.gif) 7 | 8 | # batch优化 9 | > 很难选择出合适的学习率。太小的学习率会导致网络收敛过于缓慢,而学习率太大可能会影响收敛,并导致损失函数在最小值上波动,甚至出现梯度发散。 10 | 2.此外,相同的学习率并不适用于所有的参数更新。如果训练集数据很稀疏,且特征频率非常不同,则不应该将其全部更新到相同的程度,但是对于很少出现的特征,应使用更大的更新率。 11 | 12 | ## 随机梯度下降 13 | > 对每个训练样本进行参数更新,每次执行都进行一次更新,且执行速度更快。 14 | 频繁的更新使得参数间具有高方差,损失函数会以不同的强度波动。这实际上是一件好事,因为它有助于我们发现新的和可能更优的局部最小值,而标准梯度下降将只会收敛到某个局部最优值。 15 | 但SGD的问题是,由于频繁的更新和波动,最终将收敛到最小限度,并会因波动频繁存在超调量。 16 | 17 | ## 批量梯度下降 18 | > 传统的批量梯度下降将计算整个数据集梯度,但只会进行一次更新,因此在处理大型数据集时速度很慢且难以控制,甚至导致内存溢出。 19 | 权重更新的快慢是由学习率η决定的,并且可以在凸面误差曲面中收敛到全局最优值,在非凸曲面中可能趋于局部最优值。 20 | 使用标准形式的批量梯度下降还有一个问题,就是在训练大型数据集时存在冗余的权重更新。 21 | 22 | ## mini-batch梯度下降 23 | > 听说GPU对2的幂次的batch可以发挥更佳的性能,因此设置成16、32、64、128...时往往要比设置为整10、整100的倍数时表现更优 24 | 显存估算,在每一个epoch计算完所有的样本后,计算下一代样本的时候,可以选择打乱所有样本顺序。 25 | 26 | # 一阶优化 27 | > 使用各参数的梯度值来最小化或最大化损失函数E(x) 28 | 29 | # 学习率优化 30 | > 全局调整 31 | 调整每一个需要优化的参数的学习率. 32 | - 学习率衰减 33 | > 每隔几个epoch减少一次learning rate, 一般是每运行5个epoch左右把learning rate减少一半, 或者是每隔20个epoch减少为原来的1/10. 34 | 线性衰减。例如:每过5个epochs学习率减半 35 | 指数衰减。例如:每过5个epochs将学习率乘以0.1 36 | > 指数加权平均 37 | 38 | ![公式](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101204317.png) 39 | > 带修正的指数加权平均 40 | 41 | ![修正公式](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101205314.png) 42 | 43 | - 动量梯度下降Momentum 44 | > 给学习率增加了惯性 45 | 46 | ![公式](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/20170521223752521.png) 47 | 48 | ![算法](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/20170521224015304.png) 49 | 50 | - NAG(Nesterov Momentum) 51 | > Momentum由前面下降方向的一个累积和当前点的梯度方向组合而成.先按照历史梯度往前走那么一小步,按照前面一小步位置的“超前梯度”来做梯度合并 52 | 53 | ![公式](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101211507.png) 54 | - AdaGrad 55 | > 学习率 η 会随着每次迭代而根据历史梯度的变化而变化。 56 | > 将每一个参数的每一次迭代的梯度取平方累加再开方,用基础学习率除以这个数,来做学习率的动态更新 57 | 58 | ![公式](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101212512.png) 59 | >其学习率是单调递减的,训练后期学习率非常小 60 | 其需要手工设置一个全局的初始学习率 61 | 学习率的调整太激进, 因此常常过早结束了学习过程. 62 | 为每一个参数保留一个学习率以提升在稀疏梯度(即自然语言和计算机视觉问题)上的性能 63 | - Rprop 64 | - RMSprop 65 | > 目前并没有发表, 基于权重梯度最近量级的均值为每一个参数适应性地保留学习率 66 | 67 | ![公式](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101214206.png) 68 | - Adam 69 | > 结合momentum和RMSprop 70 | 71 | # 二阶优化 72 | > 使用了二阶导数(也叫做Hessian方法)来最小化或最大化损失函数。由于二阶导数的计算成本很高,所以这种方法并没有广泛使用。 73 | 如果估计不好一阶导数,那么对二阶导数的估计会有更大的误差,这对于这些算法来说是致命的。对于二阶优化算法,减小batch换来的收敛速度提升远不如引入大量噪声导致的性能下降。在使用二阶优化算法时,往往要采用大batch 74 | ## 牛顿法 75 | ## 拟牛顿法 76 | > 牛顿法有个缺点,海森矩阵是非稀疏矩阵,参数太多,其计算量太大。 77 | 78 | ![优化算法效果对比图](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101212904.png) 79 | - AdaDelta 80 | > 通过设置窗口 w, 只使用部分时间的梯度累积. 81 | 82 | ![优化算法效果对比图](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101212816.png) 83 | ![优化算法效果对比图](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/搜狗截图20180101214020.png) 84 | - Hessian 85 | ## 其它 86 | - 共轭梯度法 87 | - 启发式优化 88 | > 模拟退火方法、遗传算法、蚁群算法以及粒子群算法 89 | 90 | # [代码演示](https://github.com/gdyshi/ml_optimize.git) 91 | # 总结 92 | > [文章](doc\1705.08292.pdf)则是通过对比给出如下结论:自适应优化算法通常都会得到比SGD算法性能更差(经常是差很多)的结果,尽管自适应优化算法在训练时会表现的比较好,因此使用者在使用自适应优化算法时需要慎重考虑! 93 | 94 | ![优化算法效果对比图](https://raw.githubusercontent.com/gdyshi/ml_optimize/master/md_pic/20170605150229100.png) 95 | 96 | > 推荐使用Adam方法. Adam 算法通常会比 RMSProp 算法效果好. 另外,也可以尝试 SGD+Nesterov Momentum 97 | --- 98 | 参考资料 99 | - [一文看懂各种神经网络优化算法:从梯度下降到Adam方法](http://www.sohu.com/a/149921578_610300) 100 | - [神经网络优化算法综述](http://blog.csdn.net/young_gy/article/details/72633202) 101 | - [谈谈深度学习中的 Batch_Size](http://blog.csdn.net/ycheng_sjtu/article/details/49804041) 102 | - [深度机器学习中的batch的大小对学习效果有何影响?](https://www.zhihu.com/question/32673260) 103 | - [自适应学习率调整:AdaDelta](https://www.cnblogs.com/neopenx/p/4768388.html) 104 | - [卷积神经网络中的优化算法比较](http://shuokay.com/2016/06/11/optimization/) 105 | - [Deep Learning 之 最优化方法](http://blog.csdn.net/BVL10101111/article/details/72614711) 106 | - [卷积神经网络中的优化算法比较](http://shuokay.com/2016/06/11/optimization/) 107 | - [教程 | 听说你了解深度学习最常用的学习算法:Adam优化算法?](http://www.sohu.com/a/156495506_465975) 108 | - [[Math] 常见的几种最优化方法](https://www.cnblogs.com/maybe2030/p/4751804.html) 109 | -------------------------------------------------------------------------------- /doc/1705.08292.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/doc/1705.08292.pdf -------------------------------------------------------------------------------- /doc/ON LARGE-BATCH TRAINING FOR DEEP LEARNING.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/doc/ON LARGE-BATCH TRAINING FOR DEEP LEARNING.pdf -------------------------------------------------------------------------------- /doc/compstat-2010.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/doc/compstat-2010.pdf -------------------------------------------------------------------------------- /doc/dl-optimization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/doc/dl-optimization.pdf -------------------------------------------------------------------------------- /md_pic/20160909001936276.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/20160909001936276.gif -------------------------------------------------------------------------------- /md_pic/20170521223752521.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/20170521223752521.png -------------------------------------------------------------------------------- /md_pic/20170521224015304.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/20170521224015304.png -------------------------------------------------------------------------------- /md_pic/20170605150229100.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/20170605150229100.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101204317.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101204317.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101205314.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101205314.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101211507.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101211507.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101212512.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101212512.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101212816.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101212816.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101212904.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101212904.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101214020.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101214020.png -------------------------------------------------------------------------------- /md_pic/搜狗截图20180101214206.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gdyshi/ml_optimize/2489da6658c2f1b37a92a1f95b224aba7f8a386e/md_pic/搜狗截图20180101214206.png -------------------------------------------------------------------------------- /python/optimize.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | # Copyright 2017 gdyshi. All Rights Reserved. 3 | # github: https://github.com/gdyshi 4 | # ============================================================================== 5 | 6 | import sys 7 | from tensorflow.examples.tutorials.mnist import input_data 8 | import tensorflow as tf 9 | import numpy as np 10 | 11 | FLAGS = None 12 | 13 | layers = [784, 270, 90, 30, 10] 14 | TRAINING_STEPS = 20000 15 | LR_NONE = 'none' 16 | LR_MOMENTUM = 'monentum' 17 | LR_NAG = 'nag' 18 | LR_ADAGRAD = 'adagrad' 19 | LR_RMSPROP = 'RMSprop' 20 | LR_ADAM = 'Adam' 21 | LR_ADADELTA = 'adadelta' 22 | 23 | LR_DECAY_LINER = 'LINER' # 线性下降 24 | LR_DECAY_EXP = 'EXP' # 指数下降 25 | LR_DECAY_NATURAL_EXP = 'NATURAL_EXP' # e为底的指数下降 26 | # LR_DECAY_polynomial = 'polynomial' # 多项式 27 | 28 | # 0.02 29 | # LINER、EXP、NATURAL_EXP:0.2 30 | # ADAM:0.001 31 | learning_rate = 0.02 32 | batch_size = 50 # mini-batch 33 | 34 | # OPTIMIZE_METHORD = LR_ADAM 35 | OPTIMIZE_METHORD = LR_ADAM 36 | 37 | 38 | def accuracy(y_pred, y_real): 39 | correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_real, 1)) 40 | acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 41 | return acc 42 | 43 | 44 | def main(_): 45 | # Import data 46 | mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True) 47 | # Create the model 48 | x = tf.placeholder(tf.float32, [None, 784]) 49 | for i in range(0, len(layers) - 1): 50 | X = x if i == 0 else y 51 | 52 | node_in = layers[i] 53 | node_out = layers[i + 1] 54 | W = tf.Variable(np.random.randn(node_in, node_out).astype('float32') / (np.sqrt(node_in))) 55 | b = tf.Variable(np.random.randn(node_out).astype('float32')) 56 | z = tf.matmul(X, W) + b 57 | y = tf.nn.tanh(z) 58 | 59 | # Define loss and optimizer 60 | y_ = tf.placeholder(tf.float32, [None, 10]) 61 | loss = tf.reduce_mean(tf.norm(y_ - y, axis=1) ** 2) / 2 62 | global_step = tf.Variable(0, trainable=False) 63 | if OPTIMIZE_METHORD == LR_DECAY_EXP: 64 | # decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps) 65 | decayed_learning_rate = tf.train.exponential_decay(learning_rate, global_step, 1000, 0.96, staircase=True) 66 | elif OPTIMIZE_METHORD == LR_DECAY_NATURAL_EXP: 67 | # decayed_learning_rate = learning_rate * exp(-decay_rate * global_step) 68 | decayed_learning_rate = tf.train.natural_exp_decay(learning_rate, global_step, 1000, 0.96, staircase=True) 69 | elif OPTIMIZE_METHORD == LR_DECAY_LINER: 70 | # decayed_learning_rate = learning_rate / (1 + decay_rate * t) 71 | decayed_learning_rate = tf.train.inverse_time_decay(learning_rate, global_step, 1000, 0.96, staircase=True) 72 | # elif OPTIMIZE_METHORD == LR_DECAY_polynomial: 73 | # # global_step = min(global_step, decay_steps) 74 | # # decayed_learning_rate = (learning_rate - end_learning_rate) * 75 | # # (1 - global_step / decay_steps) ^ (power) + 76 | # # end_learning_rate 77 | # decayed_learning_rate = tf.train.polynomial_decay(learning_rate, global_step, 100, 50) 78 | else: 79 | decayed_learning_rate = learning_rate 80 | 81 | if OPTIMIZE_METHORD == LR_MOMENTUM: 82 | train_step = tf.train.MomentumOptimizer(decayed_learning_rate, 0.5).minimize(loss, global_step=global_step) 83 | elif OPTIMIZE_METHORD == LR_NAG: 84 | train_step = tf.train.MomentumOptimizer(decayed_learning_rate, 0.5, use_nesterov=True).minimize(loss, 85 | global_step=global_step) 86 | elif OPTIMIZE_METHORD == LR_ADAGRAD: 87 | train_step = tf.train.AdagradOptimizer(decayed_learning_rate).minimize(loss, global_step=global_step) 88 | elif OPTIMIZE_METHORD == LR_RMSPROP: 89 | train_step = tf.train.RMSPropOptimizer(decayed_learning_rate).minimize(loss, global_step=global_step) 90 | elif OPTIMIZE_METHORD == LR_ADAM: 91 | train_step = tf.train.AdamOptimizer(decayed_learning_rate).minimize(loss, global_step=global_step) 92 | elif OPTIMIZE_METHORD == LR_ADADELTA: 93 | train_step = tf.train.AdadeltaOptimizer(decayed_learning_rate).minimize(loss, global_step=global_step) 94 | else: 95 | train_step = tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(loss, global_step=global_step) 96 | 97 | 98 | acc = accuracy(y, y_) 99 | 100 | with tf.Session() as sess: 101 | sess.run(tf.global_variables_initializer()) 102 | for i in range(TRAINING_STEPS): 103 | batch_xs, batch_ys = mnist.train.next_batch(batch_size) 104 | if i % 1000 == 0: 105 | valid_acc = acc.eval(feed_dict={x: mnist.validation.images, 106 | y_: mnist.validation.labels}) 107 | print("After %d training step(s), accuracy on validation is %g." % (i, valid_acc)) 108 | train_step.run(feed_dict={x: batch_xs, y_: batch_ys}) 109 | test_acc = acc.eval(feed_dict={x: mnist.test.images, 110 | y_: mnist.test.labels}) 111 | print("After %d training step(s), accuracy on test is %g." % (TRAINING_STEPS, test_acc)) 112 | 113 | 114 | if __name__ == '__main__': 115 | parser = argparse.ArgumentParser() 116 | parser.add_argument('--data_dir', type=str, default='E:\data\mnist', 117 | help='Directory for storing input data') 118 | FLAGS, unparsed = parser.parse_known_args() 119 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 120 | --------------------------------------------------------------------------------