├── .DS_Store ├── 1_Getting_Started_入门.md ├── 2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md ├── 3_Multilayer_Perceptron_多层感知机.md ├── 4_Convoltional_Neural_Networks_LeNet_卷积神经网络.md ├── 5_Denoising_Autoencoders_降噪自动编码.md ├── 6_Stacked_Denoising_Autoencoders_层叠降噪自动编码机.md ├── 7_Restricted_Boltzmann_Machine_受限波尔兹曼机.md ├── README.md └── images ├── .DS_Store ├── 1_0-1_loss_1.png ├── 1_0-1_loss_2.png ├── 1_0-1_loss_3.png ├── 1_l1_l2_regularization_1.png ├── 1_l1_l2_regularization_2.png ├── 1_negative_log_likelihod_1.png ├── 1_negative_log_likelihod_2.png ├── 2_defining_a_loss_function_1.png ├── 2_the_model_1.png ├── 2_the_model_2.png ├── 3_from_lr_to_mlp_1.png ├── 3_the_model_1.png ├── 3_the_model_2.png ├── 3_the_model_3.png ├── 3wolfmoon.jpg ├── 4_conv_operator_1.png ├── 4_detail_notation_1.png ├── 4_detail_notation_2.png ├── 4_detail_notation_3.png ├── 4_full_model_1.png ├── 4_sparse_con_1.png ├── 5_autoencoders_1.png ├── 5_autoencoders_2.png ├── 5_autoencoders_3.png ├── 5_autoencoders_4.png ├── 5_running_code_1.png ├── 5_running_code_2.png ├── 6_sda_1.png ├── 7_ebm_1.png ├── 7_ebm_2.png ├── 7_ebm_3.png ├── 7_ebm_4.png ├── 7_ebm_hidden_units_1.png ├── 7_ebm_hidden_units_2.png ├── 7_ebm_hidden_units_3.png ├── 7_ebm_hidden_units_4.png ├── 7_ebm_hidden_units_5.png ├── 7_implementation_1.png ├── 7_proxies_likelihood_1.png ├── 7_rbm_1.png ├── 7_rbm_2.png ├── 7_rbm_3.png ├── 7_rbm_4.png ├── 7_rbm_binary_units_1.png ├── 7_rbm_binary_units_2.png ├── 7_rbm_binary_units_3.png ├── 7_sampling_1.png ├── 7_sampling_2.png └── 7_update_e_b_u_1.png /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/.DS_Store -------------------------------------------------------------------------------- /1_Getting_Started_入门.md: -------------------------------------------------------------------------------- 1 | 入门(Getting Started) 2 | ====================== 3 | 4 | 这个教程并不是为了巩固研究生或者本科生的机器学习课程,但我们确实对一些重要的概念(和公式)做了的快速的概述,来确保我们在谈论同个概念。同时,你也需要去下载数据集,以便可以跑未来课程的样例代码。 5 | 6 | ### 下载 7 | 在每一个学习算法的页面,你都需要去下载相关的文件。加入你想要一次下载所有的文件,你可以克隆本教程的git仓库。 8 | 9 | git clone git://github.com/lisa-lab/DeepLearningTutorials.git 10 | 11 | ### 数据集 12 | #### MNIST数据集(mnist.pkl.gz) 13 | 14 | [MNIST](http://yann.lecun.com/exdb/mnist)是一个包含60000个训练样例和10000个测试样例的手写数字图像的数据集。在许多论文,包括本教程,都将60000个训练样例分为50000个样例的训练集和10000个样例的验证集(为了超参数,例如学习率、模型尺寸等等)。所有的数字图像都被归一化和中心化为28*28的像素,256位图的灰度图。 15 | 为了方便在Python中的使用,我们对数据集进行了处理。你可以在这里[下载](http://deeplearning.net/data/mnist/mnist.pkl.gz)。这个文件被表示为包含3个lists的tuple:训练集、验证集和测试集。每个lists都是都是两个list的组合,一个list是有numpy的1维array表示的784(28*28)维的0~1(0是黑,1是白)的float值,另一个list是0~9的图像标签。下面的代码显示了如何去加载这个数据集。 16 | 17 | ```Python 18 | import cPickle, gzip, numpy 19 | # Load the dataset 20 | f = gzip.open('mnist.pkl.gz', 'rb') 21 | train_set, valid_set, test_set = cPickle.load(f) 22 | f.close() 23 | ``` 24 | 25 | 当我们使用这个数据集的时候,通常将它分割维几个minibatch。我们建议你将数据集储存为共享变量(shared variables),通过minibatch的索引(一个固定的被告知的batch的尺寸)来存取它们。使用共享变量的原因是为了使用GPU。因为往GPUX显存中复制数据是一个巨大的开销。如果不使用共享变量,GPU代码的运行效率将不会比CPU代码快。如果你将自己的数据定义为共享变量,当共享变量被构建的时候,你就给了Theano在一次请求中将整个数据复制到GPU上的可能。之后,GPU就可以通过共享变量的slice(切片)来存取任何一个minibatch,而不必再从CPU上拷贝数据。同时,因为数据向量(实数)和标签(整数)通常是不同属性的,测试集、验证集和训练集是不同目的的,所以我们建议通过不同的共享变量来储存(这就产生了6个不同的共享变量)。 26 | 由于现在的数据再一个变量里面,一个minibatch被定义为这个变量的一个切片。通过指定它的索引和它的尺寸,可以更加自然的来定义一个minibatch。下面的代码展示了如何去存取数据和如何存取一个minibatch。 27 | 28 | ```Python 29 | def shared_dataset(data_xy): 30 | """ Function that loads the dataset into shared variables 31 | 32 | The reason we store our dataset in shared variables is to allow 33 | Theano to copy it into the GPU memory (when code is run on GPU). 34 | Since copying data into the GPU is slow, copying a minibatch everytime 35 | is needed (the default behaviour if the data is not in a shared 36 | variable) would lead to a large decrease in performance. 37 | """ 38 | data_x, data_y = data_xy 39 | shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX)) 40 | shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX)) 41 | # When storing data on the GPU it has to be stored as floats 42 | # therefore we will store the labels as ``floatX`` as well 43 | # (``shared_y`` does exactly that). But during our computations 44 | # we need them as ints (we use labels as index, and if they are 45 | # floats it doesn't make sense) therefore instead of returning 46 | # ``shared_y`` we will have to cast it to int. This little hack 47 | # lets us get around this issue 48 | return shared_x, T.cast(shared_y, 'int32') 49 | ``` 50 | 51 | 这个数据以float的形式被存储在GPU上(`dtype`被定义为`theano.confug.floatX`)。然后再将标签转换为int型。 52 | 如果你再GPU上跑代码,并且数据集太大,可能导致内存崩溃。在这个时候,你就应当把数据存储为共享变量。你可以将数据储存为一个充分小的大块(几个minibatch)在一个共享变量里面,然后在训练的时候使用它。一旦你使用了这个大块,更新它储存的值。这将最小化CPU和GPU的内存交换。 53 | 54 | 55 | ### 标记 56 | #### 数据集标记 57 | 我们定义数据集为D,包括3个部分,D_train,D_valid,D_test三个集合。D内每个索引都是一个(x,y)对。 58 | #### 数学约定 59 | * W:大写字母表示矩阵(除非特殊说明) 60 | * W(i,j):矩阵内(i,j)点的数据 61 | * W(i.):矩阵的一行 62 | * W(.j):矩阵的一列 63 | * b:小些字母表示向量(除非特殊说明) 64 | * b(i):向量内的(i)点的数据 65 | #### 符号和缩略语表 66 | * D:输入维度的数目 67 | * D_h(i):第i层个隐层的输入单元数目 68 | * L:标签的数目 69 | * NLL:负对数似然函数 70 | * theta:给定模型的参数集合 71 | 72 | #### Python命名空间 73 | ```Python 74 | import theano 75 | import theano.tensor as T 76 | import numpy 77 | ``` 78 | 79 | ### 深度学习的监督优化入门 80 | #### 学习一个分类器 81 | ##### 0-1损失函数 82 | ![0-1_loss_1](/images/1_0-1_loss_1.png) 83 | 84 | ![0-1_loss_2](/images/1_0-1_loss_2.png) 85 | 86 | ![0-1_loss_3](/images/1_0-1_loss_3.png) 87 | 88 | ```Python 89 | # zero_one_loss is a Theano variable representing a symbolic 90 | # expression of the zero one loss ; to get the actual value this 91 | # symbolic expression has to be compiled into a Theano function (see 92 | # the Theano tutorial for more details) 93 | zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y)) 94 | ``` 95 | 96 | ##### 负对数似然损失函数 97 | 由于0-1损失函数不可微分,在大型模型中对它优化会造成巨大开销。因此我们通过最大化给定数据标签的似然函数来训练模型。 98 | 99 | ![nll_1](/images/1_negative_log_likelihod_1.png) 100 | 101 | ![nll_2](/images/1_negative_log_likelihod_2.png) 102 | 103 | 由于我们通常说最小化损失函数,所以我们给对数似然函数添加负号,来使得我们可以求解最小化负对数似然损失函数。 104 | 105 | ```Python 106 | # NLL is a symbolic variable ; to get the actual value of NLL, this symbolic 107 | # expression has to be compiled into a Theano function (see the Theano 108 | # tutorial for more details) 109 | NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y]) 110 | # note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)]. 111 | # Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the 112 | # elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this 113 | # syntax to retrieve the log-probability of the correct labels, y. 114 | ``` 115 | 116 | #### 随机梯度下降 117 | 什么是普通的梯度下降?梯度下降是一个简单的算法,利用负梯度方向来决定每次迭代的新的搜索方向,使得每次迭代能使待优化的目标函数逐步减小。 118 | 伪代码如下所示。 119 | 120 | ```Python 121 | # GRADIENT DESCENT 122 | 123 | while True: 124 | loss = f(params) 125 | d_loss_wrt_params = ... # compute gradient 126 | params -= learning_rate * d_loss_wrt_params 127 | if : 128 | return params 129 | ``` 130 | 131 | 随机梯度下降则是普通梯度下降的优化。通过使用一部分样本来优化梯度代替所有样本优化梯度,从而得以更快逼近结果。下面的代码,我们一次只用一个样本来计算梯度。 132 | 133 | ```Python 134 | # STOCHASTIC GRADIENT DESCENT 135 | for (x_i,y_i) in training_set: 136 | # imagine an infinite generator 137 | # that may repeat examples (if there is only a finite training set) 138 | loss = f(params, x_i, y_i) 139 | d_loss_wrt_params = ... # compute gradient 140 | params -= learning_rate * d_loss_wrt_params 141 | if : 142 | return params 143 | ``` 144 | 145 | 我们不止一次的在深度学习中提及这个变体——“minibatches”。Minibatch随机梯度下降区别与随机梯度下降,在每次梯度估计时使用一个minibatch的数据。这个技术减小了每次梯度估计时的方差,也适合现代电脑的分层内存构架。 146 | 147 | ```Python 148 | for (x_batch,y_batch) in train_batches: 149 | # imagine an infinite generator 150 | # that may repeat examples 151 | loss = f(params, x_batch, y_batch) 152 | d_loss_wrt_params = ... # compute gradient using theano 153 | params -= learning_rate * d_loss_wrt_params 154 | if : 155 | return params 156 | ``` 157 | 158 | 在选择minibatch的尺寸B时中有个权衡。当尺寸比较大时,在梯度估计时就要花费更多时间计算方差;当尺寸比较小的时候呢,就要进行更多的迭代,也更容易波动。因而尺寸的选择要结合模型、数据集、硬件结构等,从1到几百不等。 159 | 伪代码如下。 160 | 161 | ```Python 162 | # Minibatch Stochastic Gradient Descent 163 | 164 | # assume loss is a symbolic description of the loss function given 165 | # the symbolic variables params (shared variable), x_batch, y_batch; 166 | 167 | # compute gradient of loss with respect to params 168 | d_loss_wrt_params = T.grad(loss, params) 169 | 170 | # compile the MSGD step into a theano function 171 | updates = [(params, params - learning_rate * d_loss_wrt_params)] 172 | MSGD = theano.function([x_batch,y_batch], loss, updates=updates) 173 | 174 | for (x_batch, y_batch) in train_batches: 175 | # here x_batch and y_batch are elements of train_batches and 176 | # therefore numpy arrays; function MSGD also updates the params 177 | print('Current loss is ', MSGD(x_batch, y_batch)) 178 | if stopping_condition_is_met: 179 | return params 180 | ``` 181 | 182 | #### 正则化 183 | 正则化是为了防止在MSGD训练过程中出现过拟合。为了应对过拟合,我们提出了几个方法:L1/L2正则化和early-stopping。 184 | ##### L1/L2正则化 185 | L1/L2正则化就是在损失函数中添加额外的项,用以惩罚一定的参数结构。对于L2正则化,又被称为“权制递减(weight decay)”。 186 | 187 | ![l1_l2_regularization_1](/images/1_l1_l2_regularization_1.png) 188 | 189 | ![l1_l2_regularization_2](/images/1_l1_l2_regularization_2.png) 190 | 191 | 192 | 原则上来说,增加一个正则项,可以平滑神经网络的网络映射(通过惩罚大的参数值,可以减少网络模型的非线性参数数)。因而最小化这个和,就可以寻找到与训练数据最贴合同时范化性更好的模型。更具奥卡姆剃刀原则,最好的模型总是最简单的。 193 | 当然,事实上,简单模型并不一定意味着好的泛化。但从经验上看,这个正则化方案可以提高神经网络的泛化能力,尤其是对于小数据集而言。下面的代码我们分别给两个正则项一个对应的权重。 194 | 195 | ```Python 196 | # symbolic Theano variable that represents the L1 regularization term 197 | L1 = T.sum(abs(param)) 198 | 199 | # symbolic Theano variable that represents the squared L2 term 200 | L2_sqr = T.sum(param ** 2) 201 | 202 | # the loss 203 | loss = NLL + lambda_1 * L1 + lambda_2 * L2 204 | ``` 205 | 206 | ##### Early-stopping 207 | Early-stopping通过监控模型在验证集上的表现来应对过拟合。验证集是一个我们从未在梯度下降中使用,也不在测试集的数据集合,它被认为是为了测试数据的一个表达。当在验证集上,模型的表现不再提高,或者表现更差,那么启发式算法应该放弃继续优化。 208 | 在选择何时终止优化方面,主要基于主观判断和一些启发式的方法,但在这个教程里,我们使用一个几何级数增加的patience量的策略。 209 | 210 | ```Python 211 | # early-stopping parameters 212 | patience = 5000 # look as this many examples regardless 213 | patience_increase = 2 # wait this much longer when a new best is 214 | # found 215 | improvement_threshold = 0.995 # a relative improvement of this much is 216 | # considered significant 217 | validation_frequency = min(n_train_batches, patience/2) 218 | # go through this many 219 | # minibatches before checking the network 220 | # on the validation set; in this case we 221 | # check every epoch 222 | 223 | best_params = None 224 | best_validation_loss = numpy.inf 225 | test_score = 0. 226 | start_time = time.clock() 227 | 228 | done_looping = False 229 | epoch = 0 230 | while (epoch < n_epochs) and (not done_looping): 231 | # Report "1" for first epoch, "n_epochs" for last epoch 232 | epoch = epoch + 1 233 | for minibatch_index in xrange(n_train_batches): 234 | 235 | d_loss_wrt_params = ... # compute gradient 236 | params -= learning_rate * d_loss_wrt_params # gradient descent 237 | 238 | # iteration number. We want it to start at 0. 239 | iter = (epoch - 1) * n_train_batches + minibatch_index 240 | # note that if we do `iter % validation_frequency` it will be 241 | # true for iter = 0 which we do not want. We want it true for 242 | # iter = validation_frequency - 1. 243 | if (iter + 1) % validation_frequency == 0: 244 | 245 | this_validation_loss = ... # compute zero-one loss on validation set 246 | 247 | if this_validation_loss < best_validation_loss: 248 | 249 | # improve patience if loss improvement is good enough 250 | if this_validation_loss < best_validation_loss * improvement_threshold: 251 | 252 | patience = max(patience, iter * patience_increase) 253 | best_params = copy.deepcopy(params) 254 | best_validation_loss = this_validation_loss 255 | 256 | if patience <= iter: 257 | done_looping = True 258 | break 259 | 260 | # POSTCONDITION: 261 | # best_params refers to the best out-of-sample parameters observed during the optimization 262 | ``` 263 | 264 | 如果过训练数据的batch批次。 265 | 这个`validation_frequency`应该要比`patience`更小。这个代码应该至少检查了两次,在使用`patience`之前。这就是我们使用这个等式`validation_frequency = min( value, patience/2.`的原因。 266 | 这个算法可能会有更好的表现,当我们通过统计显著性的测试来代替简单的比较来决定是否增加patient。 267 | 268 | #### 测试 269 | 我们依据在验证集上表现最好的参数作为模型的参数,去在测试集上进行测试。 270 | 271 | #### 总结 272 | 这是对优化章节的总结。Early-stopping技术需要我们将数据分割为训练集、验证集、测试集。测试集使用minibatch的随机梯度下降来对目标函数进行逼近。同时引入L1/L2正则项来应对过拟合。 273 | 274 | ### Theano/Python技巧 275 | #### 载入和保存模型 276 | 当你做实验的时候,用梯度下降算法可能要好几个小时去发现一个最优解。你可能在发现解的时候,想要保存这些权值。你也可能想要保存搜索进程中当前最优化的解。 277 | 278 | ##### 使用Pickle在共享变量中储存numpy的ndarrays 279 | ```Python 280 | >>> import cPickle 281 | >>> save_file = open('path', 'wb') # this will overwrite current contents 282 | >>> cPickle.dump(w.get_value(borrow=True), save_file, -1) # the -1 is for HIGHEST_PROTOCOL 283 | >>> cPickle.dump(v.get_value(borrow=True), save_file, -1) # .. and it triggers much more efficient 284 | >>> cPickle.dump(u.get_value(borrow=True), save_file, -1) # .. storage than numpy's default 285 | >>> save_file.close() 286 | ``` 287 | ```Python 288 | >>> save_file = open('path') 289 | >>> w.set_value(cPickle.load(save_file), borrow=True) 290 | >>> v.set_value(cPickle.load(save_file), borrow=True) 291 | >>> u.set_value(cPickle.load(save_file), borrow=True) 292 | ``` 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | -------------------------------------------------------------------------------- /2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md: -------------------------------------------------------------------------------- 1 | 使用逻辑回归进行MNIST分类(Classifying MNIST using Logistic Regressing) 2 | ============================= 3 | 本节假定读者属性了下面的Theano概念:[共享变量(shared variable)](http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables), [基本数学算子(basic arithmetic ops)](http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars), [Theano的进阶(T.grad)](http://deeplearning.net/software/theano/tutorial/examples.html#computing-gradients), [floatX(默认为float64)](http://deeplearning.net/software/theano/library/config.html#config.floatX)。假如你想要在你的GPU上跑你的代码,你也需要看[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。 4 | 5 | 本节的所有代码可以在[这里](http://deeplearning.net/tutorial/code/logistic_sgd.py)下载。 6 | 7 | 在这一节,我们将展示Theano如何实现最基本的分类器:逻辑回归分类器。我们以模型的快速入门开始,复习(refresher)和巩固(anchor)数学负号,也展示了数学表达式如何映射到Theano图中。 8 | 9 | ## 模型 10 | 逻辑回归模型是一个线性概率模型。它由一个权值矩阵W和偏置向量b参数化。分类通过将输入向量提交到一组超平面,每个超平面对应一个类。输入向量和超平面的距离是这个输入属于该类的一个概率量化。 11 | 在给定模型下,输入x,输出为y的概率,可以用如下公式表示 12 | 13 |
![probality](/images/2_the_model_1.png)
14 | 15 |
![y_prediction](/images/2_the_model_2.png)
16 | 17 | Theano代码如下。 18 | 19 | ```Python 20 | # initialize with 0 the weights W as a matrix of shape (n_in, n_out) 21 | self.W = theano.shared( 22 | value=numpy.zeros( 23 | (n_in, n_out), 24 | dtype=theano.config.floatX 25 | ), 26 | name='W', 27 | borrow=True 28 | ) 29 | # initialize the baises b as a vector of n_out 0s 30 | self.b = theano.shared( 31 | value=numpy.zeros( 32 | (n_out,), 33 | dtype=theano.config.floatX 34 | ), 35 | name='b', 36 | borrow=True 37 | ) 38 | 39 | # symbolic expression for computing the matrix of class-membership 40 | # probabilities 41 | # Where: 42 | # W is a matrix where column-k represent the separation hyper plain for 43 | # class-k 44 | # x is a matrix where row-j represents input training sample-j 45 | # b is a vector where element-k represent the free parameter of hyper 46 | # plain-k 47 | self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b) 48 | 49 | # symbolic description of how to compute prediction as class whose 50 | # probability is maximal 51 | self.y_pred = T.argmax(self.p_y_given_x, axis=1) 52 | ``` 53 | 54 | 由于模型的参数需要不断的存取和修正,所以我们把W和b定义为共享变量。这个dot(点乘)和softmax运算用以计算这个P(Y|x,W,b)。这个结果`p_y_given_x`(probability)是一个vector类型的概率向量。 55 | 为了获得实际的模型预测,我们使用`T_argmax`操作,来返回`p_y_given_x`的最大值对应的y。 56 | 如果想要获得完整的Theano算子,看[算子列表](http://deeplearning.net/software/theano/library/tensor/basic.html#basic-tensor-functionality) 57 | 58 | ## 定义一个损失函数 59 | 学习优化模型参数需要最小化一个损失参数。在多分类的逻辑回归中,很显然是使用负对数似然函数作为损失函数。似然函数和损失函数定义如下: 60 | 61 |
![loss_function](/images/2_defining_a_loss_function_1.png)
62 | 63 | 虽然整本书都致力于探讨最小化话题,但梯度下降是迄今为止最简单的最小化非线性函数的方法。在这个教程中,我们使用minibatch随机梯度下降算法。可以看[随机梯度下降](http://deeplearning.net/tutorial/gettingstarted.html#opt-sgd)来获得更多细节。 64 | 下面的代码定义了一个对给定的minibatch的损失函数。 65 | 66 | ```Python 67 | # y.shape[0] is (symbolically) the number of rows in y, i.e., 68 | # number of examples (call it n) in the minibatch 69 | # T.arange(y.shape[0]) is a symbolic vector which will contain 70 | # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of 71 | # Log-Probabilities (call it LP) with one row per example and 72 | # one column per class LP[T.arange(y.shape[0]),y] is a vector 73 | # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ..., 74 | # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is 75 | # the mean (across minibatch examples) of the elements in v, 76 | # i.e., the mean log-likelihood across the minibatch. 77 | return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y]) 78 | ``` 79 | 在这里我们使用错误的平均来表示损失函数,以减少minibatch尺寸对我们的影响。 80 | 81 | ## 创建一个逻辑回归类 82 | 现在,我们要定义一个`逻辑回归`的类,来概括逻辑回归的基本行为。代码已经是我们之前涵盖的了,不再进行过多解释。 83 | 84 | ```Python 85 | class LogisticRegression(object): 86 | """Multi-class Logistic Regression Class 87 | 88 | The logistic regression is fully described by a weight matrix :math:`W` 89 | and bias vector :math:`b`. Classification is done by projecting data 90 | points onto a set of hyperplanes, the distance to which is used to 91 | determine a class membership probability. 92 | """ 93 | 94 | def __init__(self, input, n_in, n_out): 95 | """ Initialize the parameters of the logistic regression 96 | 97 | :type input: theano.tensor.TensorType 98 | :param input: symbolic variable that describes the input of the 99 | architecture (one minibatch) 100 | 101 | :type n_in: int 102 | :param n_in: number of input units, the dimension of the space in 103 | which the datapoints lie 104 | 105 | :type n_out: int 106 | :param n_out: number of output units, the dimension of the space in 107 | which the labels lie 108 | 109 | """ 110 | # start-snippet-1 111 | # initialize with 0 the weights W as a matrix of shape (n_in, n_out) 112 | self.W = theano.shared( 113 | value=numpy.zeros( 114 | (n_in, n_out), 115 | dtype=theano.config.floatX 116 | ), 117 | name='W', 118 | borrow=True 119 | ) 120 | # initialize the baises b as a vector of n_out 0s 121 | self.b = theano.shared( 122 | value=numpy.zeros( 123 | (n_out,), 124 | dtype=theano.config.floatX 125 | ), 126 | name='b', 127 | borrow=True 128 | ) 129 | 130 | # symbolic expression for computing the matrix of class-membership 131 | # probabilities 132 | # Where: 133 | # W is a matrix where column-k represent the separation hyper plain for 134 | # class-k 135 | # x is a matrix where row-j represents input training sample-j 136 | # b is a vector where element-k represent the free parameter of hyper 137 | # plain-k 138 | self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b) 139 | 140 | # symbolic description of how to compute prediction as class whose 141 | # probability is maximal 142 | self.y_pred = T.argmax(self.p_y_given_x, axis=1) 143 | # end-snippet-1 144 | 145 | # parameters of the model 146 | self.params = [self.W, self.b] 147 | 148 | def negative_log_likelihood(self, y): 149 | """Return the mean of the negative log-likelihood of the prediction 150 | of this model under a given target distribution. 151 | 152 | .. math:: 153 | 154 | \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) = 155 | \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} 156 | \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\ 157 | \ell (\theta=\{W,b\}, \mathcal{D}) 158 | 159 | :type y: theano.tensor.TensorType 160 | :param y: corresponds to a vector that gives for each example the 161 | correct label 162 | 163 | Note: we use the mean instead of the sum so that 164 | the learning rate is less dependent on the batch size 165 | """ 166 | # start-snippet-2 167 | # y.shape[0] is (symbolically) the number of rows in y, i.e., 168 | # number of examples (call it n) in the minibatch 169 | # T.arange(y.shape[0]) is a symbolic vector which will contain 170 | # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of 171 | # Log-Probabilities (call it LP) with one row per example and 172 | # one column per class LP[T.arange(y.shape[0]),y] is a vector 173 | # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ..., 174 | # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is 175 | # the mean (across minibatch examples) of the elements in v, 176 | # i.e., the mean log-likelihood across the minibatch. 177 | return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y]) 178 | # end-snippet-2 179 | 180 | def errors(self, y): 181 | """Return a float representing the number of errors in the minibatch 182 | over the total number of examples of the minibatch ; zero one 183 | loss over the size of the minibatch 184 | 185 | :type y: theano.tensor.TensorType 186 | :param y: corresponds to a vector that gives for each example the 187 | correct label 188 | """ 189 | 190 | # check if y has same dimension of y_pred 191 | if y.ndim != self.y_pred.ndim: 192 | raise TypeError( 193 | 'y should have the same shape as self.y_pred', 194 | ('y', y.type, 'y_pred', self.y_pred.type) 195 | ) 196 | # check if y is of the correct datatype 197 | if y.dtype.startswith('int'): 198 | # the T.neq operator returns a vector of 0s and 1s, where 1 199 | # represents a mistake in prediction 200 | return T.mean(T.neq(self.y_pred, y)) 201 | else: 202 | raise NotImplementedError() 203 | ``` 204 | 我们通过如下代码来实例化这个类。 205 | 206 | ```Pyhton 207 | # generate symbolic variables for input (x and y represent a 208 | # minibatch) 209 | x = T.matrix('x') # data, presented as rasterized images 210 | y = T.ivector('y') # labels, presented as 1D vector of [int] labels 211 | 212 | # construct the logistic regression class 213 | # Each MNIST image has size 28*28 214 | classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10) 215 | 216 | ``` 217 | 需要注意的是,输入向量x,和其相关的标签y都是定义在`LogisticRegression`实体外的。这个类需要将输入数据作为`__init__`函数的参数。这在将这些类的实例连接起来构建深网络方面非常有用。一层的输出可以作为下一层的输入。 218 | 最后,我们定义了一个`cost`变量来最小化。 219 | 220 | ```Python 221 | # the cost we minimize during training is the negative log likelihood of 222 | # the model in symbolic format 223 | cost = classifier.negative_log_likelihood(y) 224 | ``` 225 | 226 | ## 学习模型 227 | 在实现MSGD的许多语言中,需要通过手动求解损失函数对每个参数的梯度(微分)来实现。 228 | 在Theano中呢,这是非常简单的。它自动微分,并且使用了一定的数学转换来提高数学稳定性。 229 | 230 | ```Pyhton 231 | g_W = T.grad(cost=cost, wrt=classifier.W) 232 | g_b = T.grad(cost=cost, wrt=classifier.b) 233 | ``` 234 | 这个函数`train_model`可以被定义如下。 235 | ```Python 236 | # specify how to update the parameters of the model as a list of 237 | # (variable, update expression) pairs. 238 | updates = [(classifier.W, classifier.W - learning_rate * g_W), 239 | (classifier.b, classifier.b - learning_rate * g_b)] 240 | 241 | # compiling a Theano function `train_model` that returns the cost, but in 242 | # the same time updates the parameter of the model based on the rules 243 | # defined in `updates` 244 | train_model = theano.function( 245 | inputs=[index], 246 | outputs=cost, 247 | updates=updates, 248 | givens={ 249 | x: train_set_x[index * batch_size: (index + 1) * batch_size], 250 | y: train_set_y[index * batch_size: (index + 1) * batch_size] 251 | } 252 | ) 253 | ``` 254 | `update`是一个list,用以更新每一步的参数。`given`是一个字典,用以表示象征变量,和你在该步中表示的数据。这个`train_model`定义如下: 255 | * 输入是minibatch的`index`,batch的大小之前已经固定,以此被定义为x,以及其相关的y。 256 | * 返回是该`index`下与x,y相关的`cost`/损失函数。 257 | * 每一次函数调用,它都先用index对应的训练集的切片来更新x,y。然后计算该minibatch下的cost,以及申请`update`操作。 258 | 每次`train_model(inedx)`被调用,它都计算并返回该minibatch的cost,当然这也是MSGD的一步。整个学习算法因循环了数据集所有样例。 259 | 260 | ## 训练模型 261 | 在之前论述中所说,我们对分类错误的样本感兴趣(不仅仅是可能性)。因此模型中增加了一个额外的实例方法,来纪录每个minibatch中的错误分类样例数。 262 | 263 | ```Python 264 | def errors(self, y): 265 | """Return a float representing the number of errors in the minibatch 266 | over the total number of examples of the minibatch ; zero one 267 | loss over the size of the minibatch 268 | 269 | :type y: theano.tensor.TensorType 270 | :param y: corresponds to a vector that gives for each example the 271 | correct label 272 | """ 273 | 274 | # check if y has same dimension of y_pred 275 | if y.ndim != self.y_pred.ndim: 276 | raise TypeError( 277 | 'y should have the same shape as self.y_pred', 278 | ('y', y.type, 'y_pred', self.y_pred.type) 279 | ) 280 | # check if y is of the correct datatype 281 | if y.dtype.startswith('int'): 282 | # the T.neq operator returns a vector of 0s and 1s, where 1 283 | # represents a mistake in prediction 284 | return T.mean(T.neq(self.y_pred, y)) 285 | else: 286 | raise NotImplementedError() 287 | ``` 288 | 我们创建了`test_model`函数,然后也创建了`validate_model`来调用去修正这个值。当然`validate_model`是early-stopping的关键。它们都是来统计minibatch中分类错误的样例数。 289 | 290 | ```Python 291 | # compiling a Theano function that computes the mistakes that are made by 292 | # the model on a minibatch 293 | test_model = theano.function( 294 | inputs=[index], 295 | outputs=classifier.errors(y), 296 | givens={ 297 | x: test_set_x[index * batch_size: (index + 1) * batch_size], 298 | y: test_set_y[index * batch_size: (index + 1) * batch_size] 299 | } 300 | ) 301 | 302 | validate_model = theano.function( 303 | inputs=[index], 304 | outputs=classifier.errors(y), 305 | givens={ 306 | x: valid_set_x[index * batch_size: (index + 1) * batch_size], 307 | y: valid_set_y[index * batch_size: (index + 1) * batch_size] 308 | } 309 | ) 310 | ``` 311 | ## 把它们组合起来 312 | 最后的代码如下。 313 | ```Python 314 | """ 315 | This tutorial introduces logistic regression using Theano and stochastic 316 | gradient descent. 317 | 318 | Logistic regression is a probabilistic, linear classifier. It is parametrized 319 | by a weight matrix :math:`W` and a bias vector :math:`b`. Classification is 320 | done by projecting data points onto a set of hyperplanes, the distance to 321 | which is used to determine a class membership probability. 322 | 323 | Mathematically, this can be written as: 324 | 325 | .. math:: 326 | P(Y=i|x, W,b) &= softmax_i(W x + b) \\ 327 | &= \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}} 328 | 329 | 330 | The output of the model or prediction is then done by taking the argmax of 331 | the vector whose i'th element is P(Y=i|x). 332 | 333 | .. math:: 334 | 335 | y_{pred} = argmax_i P(Y=i|x,W,b) 336 | 337 | 338 | This tutorial presents a stochastic gradient descent optimization method 339 | suitable for large datasets. 340 | 341 | 342 | References: 343 | 344 | - textbooks: "Pattern Recognition and Machine Learning" - 345 | Christopher M. Bishop, section 4.3.2 346 | 347 | """ 348 | __docformat__ = 'restructedtext en' 349 | 350 | import cPickle 351 | import gzip 352 | import os 353 | import sys 354 | import time 355 | 356 | import numpy 357 | 358 | import theano 359 | import theano.tensor as T 360 | 361 | 362 | class LogisticRegression(object): 363 | """Multi-class Logistic Regression Class 364 | 365 | The logistic regression is fully described by a weight matrix :math:`W` 366 | and bias vector :math:`b`. Classification is done by projecting data 367 | points onto a set of hyperplanes, the distance to which is used to 368 | determine a class membership probability. 369 | """ 370 | 371 | def __init__(self, input, n_in, n_out): 372 | """ Initialize the parameters of the logistic regression 373 | 374 | :type input: theano.tensor.TensorType 375 | :param input: symbolic variable that describes the input of the 376 | architecture (one minibatch) 377 | 378 | :type n_in: int 379 | :param n_in: number of input units, the dimension of the space in 380 | which the datapoints lie 381 | 382 | :type n_out: int 383 | :param n_out: number of output units, the dimension of the space in 384 | which the labels lie 385 | 386 | """ 387 | # start-snippet-1 388 | # initialize with 0 the weights W as a matrix of shape (n_in, n_out) 389 | self.W = theano.shared( 390 | value=numpy.zeros( 391 | (n_in, n_out), 392 | dtype=theano.config.floatX 393 | ), 394 | name='W', 395 | borrow=True 396 | ) 397 | # initialize the baises b as a vector of n_out 0s 398 | self.b = theano.shared( 399 | value=numpy.zeros( 400 | (n_out,), 401 | dtype=theano.config.floatX 402 | ), 403 | name='b', 404 | borrow=True 405 | ) 406 | 407 | # symbolic expression for computing the matrix of class-membership 408 | # probabilities 409 | # Where: 410 | # W is a matrix where column-k represent the separation hyper plain for 411 | # class-k 412 | # x is a matrix where row-j represents input training sample-j 413 | # b is a vector where element-k represent the free parameter of hyper 414 | # plain-k 415 | self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b) 416 | 417 | # symbolic description of how to compute prediction as class whose 418 | # probability is maximal 419 | self.y_pred = T.argmax(self.p_y_given_x, axis=1) 420 | # end-snippet-1 421 | 422 | # parameters of the model 423 | self.params = [self.W, self.b] 424 | 425 | def negative_log_likelihood(self, y): 426 | """Return the mean of the negative log-likelihood of the prediction 427 | of this model under a given target distribution. 428 | 429 | .. math:: 430 | 431 | \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) = 432 | \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} 433 | \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\ 434 | \ell (\theta=\{W,b\}, \mathcal{D}) 435 | 436 | :type y: theano.tensor.TensorType 437 | :param y: corresponds to a vector that gives for each example the 438 | correct label 439 | 440 | Note: we use the mean instead of the sum so that 441 | the learning rate is less dependent on the batch size 442 | """ 443 | # start-snippet-2 444 | # y.shape[0] is (symbolically) the number of rows in y, i.e., 445 | # number of examples (call it n) in the minibatch 446 | # T.arange(y.shape[0]) is a symbolic vector which will contain 447 | # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of 448 | # Log-Probabilities (call it LP) with one row per example and 449 | # one column per class LP[T.arange(y.shape[0]),y] is a vector 450 | # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ..., 451 | # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is 452 | # the mean (across minibatch examples) of the elements in v, 453 | # i.e., the mean log-likelihood across the minibatch. 454 | return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y]) 455 | # end-snippet-2 456 | 457 | def errors(self, y): 458 | """Return a float representing the number of errors in the minibatch 459 | over the total number of examples of the minibatch ; zero one 460 | loss over the size of the minibatch 461 | 462 | :type y: theano.tensor.TensorType 463 | :param y: corresponds to a vector that gives for each example the 464 | correct label 465 | """ 466 | 467 | # check if y has same dimension of y_pred 468 | if y.ndim != self.y_pred.ndim: 469 | raise TypeError( 470 | 'y should have the same shape as self.y_pred', 471 | ('y', y.type, 'y_pred', self.y_pred.type) 472 | ) 473 | # check if y is of the correct datatype 474 | if y.dtype.startswith('int'): 475 | # the T.neq operator returns a vector of 0s and 1s, where 1 476 | # represents a mistake in prediction 477 | return T.mean(T.neq(self.y_pred, y)) 478 | else: 479 | raise NotImplementedError() 480 | 481 | 482 | def load_data(dataset): 483 | ''' Loads the dataset 484 | 485 | :type dataset: string 486 | :param dataset: the path to the dataset (here MNIST) 487 | ''' 488 | 489 | ############# 490 | # LOAD DATA # 491 | ############# 492 | 493 | # Download the MNIST dataset if it is not present 494 | data_dir, data_file = os.path.split(dataset) 495 | if data_dir == "" and not os.path.isfile(dataset): 496 | # Check if dataset is in the data directory. 497 | new_path = os.path.join( 498 | os.path.split(__file__)[0], 499 | "..", 500 | "data", 501 | dataset 502 | ) 503 | if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz': 504 | dataset = new_path 505 | 506 | if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz': 507 | import urllib 508 | origin = ( 509 | 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz' 510 | ) 511 | print 'Downloading data from %s' % origin 512 | urllib.urlretrieve(origin, dataset) 513 | 514 | print '... loading data' 515 | 516 | # Load the dataset 517 | f = gzip.open(dataset, 'rb') 518 | train_set, valid_set, test_set = cPickle.load(f) 519 | f.close() 520 | #train_set, valid_set, test_set format: tuple(input, target) 521 | #input is an numpy.ndarray of 2 dimensions (a matrix) 522 | #witch row's correspond to an example. target is a 523 | #numpy.ndarray of 1 dimensions (vector)) that have the same length as 524 | #the number of rows in the input. It should give the target 525 | #target to the example with the same index in the input. 526 | 527 | def shared_dataset(data_xy, borrow=True): 528 | """ Function that loads the dataset into shared variables 529 | 530 | The reason we store our dataset in shared variables is to allow 531 | Theano to copy it into the GPU memory (when code is run on GPU). 532 | Since copying data into the GPU is slow, copying a minibatch everytime 533 | is needed (the default behaviour if the data is not in a shared 534 | variable) would lead to a large decrease in performance. 535 | """ 536 | data_x, data_y = data_xy 537 | shared_x = theano.shared(numpy.asarray(data_x, 538 | dtype=theano.config.floatX), 539 | borrow=borrow) 540 | shared_y = theano.shared(numpy.asarray(data_y, 541 | dtype=theano.config.floatX), 542 | borrow=borrow) 543 | # When storing data on the GPU it has to be stored as floats 544 | # therefore we will store the labels as ``floatX`` as well 545 | # (``shared_y`` does exactly that). But during our computations 546 | # we need them as ints (we use labels as index, and if they are 547 | # floats it doesn't make sense) therefore instead of returning 548 | # ``shared_y`` we will have to cast it to int. This little hack 549 | # lets ous get around this issue 550 | return shared_x, T.cast(shared_y, 'int32') 551 | 552 | test_set_x, test_set_y = shared_dataset(test_set) 553 | valid_set_x, valid_set_y = shared_dataset(valid_set) 554 | train_set_x, train_set_y = shared_dataset(train_set) 555 | 556 | rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y), 557 | (test_set_x, test_set_y)] 558 | return rval 559 | 560 | 561 | def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000, 562 | dataset='mnist.pkl.gz', 563 | batch_size=600): 564 | """ 565 | Demonstrate stochastic gradient descent optimization of a log-linear 566 | model 567 | 568 | This is demonstrated on MNIST. 569 | 570 | :type learning_rate: float 571 | :param learning_rate: learning rate used (factor for the stochastic 572 | gradient) 573 | 574 | :type n_epochs: int 575 | :param n_epochs: maximal number of epochs to run the optimizer 576 | 577 | :type dataset: string 578 | :param dataset: the path of the MNIST dataset file from 579 | http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz 580 | 581 | """ 582 | datasets = load_data(dataset) 583 | 584 | train_set_x, train_set_y = datasets[0] 585 | valid_set_x, valid_set_y = datasets[1] 586 | test_set_x, test_set_y = datasets[2] 587 | 588 | # compute number of minibatches for training, validation and testing 589 | n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size 590 | n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size 591 | n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size 592 | 593 | ###################### 594 | # BUILD ACTUAL MODEL # 595 | ###################### 596 | print '... building the model' 597 | 598 | # allocate symbolic variables for the data 599 | index = T.lscalar() # index to a [mini]batch 600 | 601 | # generate symbolic variables for input (x and y represent a 602 | # minibatch) 603 | x = T.matrix('x') # data, presented as rasterized images 604 | y = T.ivector('y') # labels, presented as 1D vector of [int] labels 605 | 606 | # construct the logistic regression class 607 | # Each MNIST image has size 28*28 608 | classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10) 609 | 610 | # the cost we minimize during training is the negative log likelihood of 611 | # the model in symbolic format 612 | cost = classifier.negative_log_likelihood(y) 613 | 614 | # compiling a Theano function that computes the mistakes that are made by 615 | # the model on a minibatch 616 | test_model = theano.function( 617 | inputs=[index], 618 | outputs=classifier.errors(y), 619 | givens={ 620 | x: test_set_x[index * batch_size: (index + 1) * batch_size], 621 | y: test_set_y[index * batch_size: (index + 1) * batch_size] 622 | } 623 | ) 624 | 625 | validate_model = theano.function( 626 | inputs=[index], 627 | outputs=classifier.errors(y), 628 | givens={ 629 | x: valid_set_x[index * batch_size: (index + 1) * batch_size], 630 | y: valid_set_y[index * batch_size: (index + 1) * batch_size] 631 | } 632 | ) 633 | 634 | # compute the gradient of cost with respect to theta = (W,b) 635 | g_W = T.grad(cost=cost, wrt=classifier.W) 636 | g_b = T.grad(cost=cost, wrt=classifier.b) 637 | 638 | # start-snippet-3 639 | # specify how to update the parameters of the model as a list of 640 | # (variable, update expression) pairs. 641 | updates = [(classifier.W, classifier.W - learning_rate * g_W), 642 | (classifier.b, classifier.b - learning_rate * g_b)] 643 | 644 | # compiling a Theano function `train_model` that returns the cost, but in 645 | # the same time updates the parameter of the model based on the rules 646 | # defined in `updates` 647 | train_model = theano.function( 648 | inputs=[index], 649 | outputs=cost, 650 | updates=updates, 651 | givens={ 652 | x: train_set_x[index * batch_size: (index + 1) * batch_size], 653 | y: train_set_y[index * batch_size: (index + 1) * batch_size] 654 | } 655 | ) 656 | # end-snippet-3 657 | 658 | ############### 659 | # TRAIN MODEL # 660 | ############### 661 | print '... training the model' 662 | # early-stopping parameters 663 | patience = 5000 # look as this many examples regardless 664 | patience_increase = 2 # wait this much longer when a new best is 665 | # found 666 | improvement_threshold = 0.995 # a relative improvement of this much is 667 | # considered significant 668 | validation_frequency = min(n_train_batches, patience / 2) 669 | # go through this many 670 | # minibatche before checking the network 671 | # on the validation set; in this case we 672 | # check every epoch 673 | 674 | best_validation_loss = numpy.inf 675 | test_score = 0. 676 | start_time = time.clock() 677 | 678 | done_looping = False 679 | epoch = 0 680 | while (epoch < n_epochs) and (not done_looping): 681 | epoch = epoch + 1 682 | for minibatch_index in xrange(n_train_batches): 683 | 684 | minibatch_avg_cost = train_model(minibatch_index) 685 | # iteration number 686 | iter = (epoch - 1) * n_train_batches + minibatch_index 687 | 688 | if (iter + 1) % validation_frequency == 0: 689 | # compute zero-one loss on validation set 690 | validation_losses = [validate_model(i) 691 | for i in xrange(n_valid_batches)] 692 | this_validation_loss = numpy.mean(validation_losses) 693 | 694 | print( 695 | 'epoch %i, minibatch %i/%i, validation error %f %%' % 696 | ( 697 | epoch, 698 | minibatch_index + 1, 699 | n_train_batches, 700 | this_validation_loss * 100. 701 | ) 702 | ) 703 | 704 | # if we got the best validation score until now 705 | if this_validation_loss < best_validation_loss: 706 | #improve patience if loss improvement is good enough 707 | if this_validation_loss < best_validation_loss * \ 708 | improvement_threshold: 709 | patience = max(patience, iter * patience_increase) 710 | 711 | best_validation_loss = this_validation_loss 712 | # test it on the test set 713 | 714 | test_losses = [test_model(i) 715 | for i in xrange(n_test_batches)] 716 | test_score = numpy.mean(test_losses) 717 | 718 | print( 719 | ( 720 | ' epoch %i, minibatch %i/%i, test error of' 721 | ' best model %f %%' 722 | ) % 723 | ( 724 | epoch, 725 | minibatch_index + 1, 726 | n_train_batches, 727 | test_score * 100. 728 | ) 729 | ) 730 | 731 | if patience <= iter: 732 | done_looping = True 733 | break 734 | 735 | end_time = time.clock() 736 | print( 737 | ( 738 | 'Optimization complete with best validation score of %f %%,' 739 | 'with test performance %f %%' 740 | ) 741 | % (best_validation_loss * 100., test_score * 100.) 742 | ) 743 | print 'The code run for %d epochs, with %f epochs/sec' % ( 744 | epoch, 1. * epoch / (end_time - start_time)) 745 | print >> sys.stderr, ('The code for file ' + 746 | os.path.split(__file__)[1] + 747 | ' ran for %.1fs' % ((end_time - start_time))) 748 | 749 | if __name__ == '__main__': 750 | sgd_optimization_mnist() 751 | ``` 752 | 这个输出将是如下的格式 753 | ``` 754 | ... 755 | epoch 72, minibatch 83/83, validation error 7.510417 % 756 | epoch 72, minibatch 83/83, test error of best model 7.510417 % 757 | epoch 73, minibatch 83/83, validation error 7.500000 % 758 | epoch 73, minibatch 83/83, test error of best model 7.489583 % 759 | Optimization complete with best validation score of 7.500000 %,with test performance 7.489583 % 760 | The code run for 74 epochs, with 1.936983 epochs/sec 761 | ``` 762 | 在`Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00 Ghz`上,这个代码的速度是1.936 epochs/sec然后跑75 epochs,得到测试错误率为7.489%。在GPU上为10.0 epochs/sec在这个实例中我们定义为batch的大小为600。 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | -------------------------------------------------------------------------------- /3_Multilayer_Perceptron_多层感知机.md: -------------------------------------------------------------------------------- 1 | 多层感知机(Multilayer Perceptron) 2 | ================================== 3 | 4 | 在本节中,假设你已经了解了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)。同时本节的所有代码可以在[这里](http://deeplearning.net/tutorial/code/mlp.py)下载. 5 | 6 | 下一个我们将在Theano中使用的结构是单隐层的多层感知机(MLP)。MLP可以被看作一个逻辑回归分类器。这个中间层被称为隐藏层。一个单隐层对于MLP成为通用近似器是有效的。然而在后面,我们将讲述使用多个隐藏层的好处,例如深度学习的前提。这个课程介绍了[MLP,反向误差传导,如何训练MLPs](http://www.iro.umontreal.ca/~pift6266/H10/notes/mlp.html)。 7 | 8 | ## 模型 9 | 一个多层感知机(或者说人工神经网络——ANN),在只有一个隐藏层时可以被表示为如下的图: 10 | 11 | ![mlp_model_1](/images/3_the_model_1.png) 12 | 13 | 事实上,一个单隐藏层的MLP是一个如下的函数![mlp_model_2](/images/3_the_model_2.png),其中x是输入向量的维度,L是输出向量的维度。我们用下面的公式来表示MLP模型: 14 | 15 | ![mlp_model_3](/images/3_the_model_3.png) 16 | 17 | 其中b_1,W_1是输出层到隐藏层的偏置向量和权值矩阵,s是该层的激活函数。而b_2,W_2是隐藏层到输出层的偏置向量和权值矩阵,G是该层的激活函数。通常选择s为sigmoid函数,G为softmax函数。 18 | 在训练MLP模型的参数时,我们使用minibatch的随机梯度下降,在获得梯度后使用反向误差传导算法来实现参数的训练。由于Theano提供自动的微分,我们不需要在这个教程里面谈及这个方面。 19 | 20 | ## 从逻辑回归到多层感知机 21 | 本教程将专注于单隐藏层的MLP。我们以隐藏层的类的实现开始,如果要构建一个MLP,只需要在此基础上添加一个逻辑回归就好。 22 | 23 | ```Python 24 | class HiddenLayer(object): 25 | def __init__(self, rng, input, n_in, n_out, W=None, b=None, 26 | activation=T.tanh): 27 | """ 28 | Typical hidden layer of a MLP: units are fully-connected and have 29 | sigmoidal activation function. Weight matrix W is of shape (n_in,n_out) 30 | and the bias vector b is of shape (n_out,). 31 | 32 | NOTE : The nonlinearity used here is tanh 33 | 34 | Hidden unit activation is given by: tanh(dot(input,W) + b) 35 | 36 | :type rng: numpy.random.RandomState 37 | :param rng: a random number generator used to initialize weights 38 | 39 | :type input: theano.tensor.dmatrix 40 | :param input: a symbolic tensor of shape (n_examples, n_in) 41 | 42 | :type n_in: int 43 | :param n_in: dimensionality of input 44 | 45 | :type n_out: int 46 | :param n_out: number of hidden units 47 | 48 | :type activation: theano.Op or function 49 | :param activation: Non linearity to be applied in the hidden 50 | layer 51 | """ 52 | self.input = input 53 | ``` 54 | 一个隐藏层的权值初始化,应当从基于激活函数的均匀间隔中均匀采样。对于sigmoid函数而言,这个间隔是![interval](/images/3_from_lr_to_mlp_1.png)。其中fan_in是第(i-1)层的单元数目,fan_out是第(i)层单元的数目,[结论出自这里](http://deeplearning.net/tutorial/references.html#xavier10)。 55 | 这样的初始化,保证了在训练的早期,每个神经元都可以工作在它激活函数的控制范围内,从而使得信息可以更简单的前向传导(从输入到输出的激活)和后向传导(从输出到输入的梯度)。 56 | 57 | ```Python 58 | # `W` is initialized with `W_values` which is uniformely sampled 59 | # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden)) 60 | # for tanh activation function 61 | # the output of uniform if converted using asarray to dtype 62 | # theano.config.floatX so that the code is runable on GPU 63 | # Note : optimal initialization of weights is dependent on the 64 | # activation function used (among other things). 65 | # For example, results presented in [Xavier10] suggest that you 66 | # should use 4 times larger initial weights for sigmoid 67 | # compared to tanh 68 | # We have no info for other function, so we use the same as 69 | # tanh. 70 | if W is None: 71 | W_values = numpy.asarray( 72 | rng.uniform( 73 | low=-numpy.sqrt(6. / (n_in + n_out)), 74 | high=numpy.sqrt(6. / (n_in + n_out)), 75 | size=(n_in, n_out) 76 | ), 77 | dtype=theano.config.floatX 78 | ) 79 | if activation == theano.tensor.nnet.sigmoid: 80 | W_values *= 4 81 | 82 | W = theano.shared(value=W_values, name='W', borrow=True) 83 | 84 | if b is None: 85 | b_values = numpy.zeros((n_out,), dtype=theano.config.floatX) 86 | b = theano.shared(value=b_values, name='b', borrow=True) 87 | 88 | self.W = W 89 | self.b = b 90 | ``` 91 | 注意,我们通过会将一个给定的非线性函数作为隐藏层的激活函数。默认是`tanh`函数,当然很多时候你可能需要其他函数。 92 | 93 | ```Python 94 | lin_output = T.dot(input, self.W) + self.b 95 | self.output = ( 96 | lin_output if activation is None 97 | else activation(lin_output) 98 | ) 99 | ``` 100 | 如果你已经阅读了上面的隐藏层输出和[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)。那么你可以看下面的MLP类的实现了。 101 | 102 | ```Python 103 | class MLP(object): 104 | """Multi-Layer Perceptron Class 105 | 106 | A multilayer perceptron is a feedforward artificial neural network model 107 | that has one layer or more of hidden units and nonlinear activations. 108 | Intermediate layers usually have as activation function tanh or the 109 | sigmoid function (defined here by a ``HiddenLayer`` class) while the 110 | top layer is a softamx layer (defined here by a ``LogisticRegression`` 111 | class). 112 | """ 113 | 114 | def __init__(self, rng, input, n_in, n_hidden, n_out): 115 | """Initialize the parameters for the multilayer perceptron 116 | 117 | :type rng: numpy.random.RandomState 118 | :param rng: a random number generator used to initialize weights 119 | 120 | :type input: theano.tensor.TensorType 121 | :param input: symbolic variable that describes the input of the 122 | architecture (one minibatch) 123 | 124 | :type n_in: int 125 | :param n_in: number of input units, the dimension of the space in 126 | which the datapoints lie 127 | 128 | :type n_hidden: int 129 | :param n_hidden: number of hidden units 130 | 131 | :type n_out: int 132 | :param n_out: number of output units, the dimension of the space in 133 | which the labels lie 134 | 135 | """ 136 | 137 | # Since we are dealing with a one hidden layer MLP, this will translate 138 | # into a HiddenLayer with a tanh activation function connected to the 139 | # LogisticRegression layer; the activation function can be replaced by 140 | # sigmoid or any other nonlinear function 141 | self.hiddenLayer = HiddenLayer( 142 | rng=rng, 143 | input=input, 144 | n_in=n_in, 145 | n_out=n_hidden, 146 | activation=T.tanh 147 | ) 148 | 149 | # The logistic regression layer gets as input the hidden units 150 | # of the hidden layer 151 | self.logRegressionLayer = LogisticRegression( 152 | input=self.hiddenLayer.output, 153 | n_in=n_hidden, 154 | n_out=n_out 155 | ) 156 | ``` 157 | 在本节中,我们也使用L1/L2正则化([L1/L2正则化](http://deeplearning.net/tutorial/gettingstarted.html#l1-l2-regularization))。所以我们需要去计算W_1和W_2矩阵的L1正则和L2平方正则。 158 | 159 | ```Python 160 | # L1 norm ; one regularization option is to enforce L1 norm to 161 | # be small 162 | self.L1 = ( 163 | abs(self.hiddenLayer.W).sum() 164 | + abs(self.logRegressionLayer.W).sum() 165 | ) 166 | 167 | # square of L2 norm ; one regularization option is to enforce 168 | # square of L2 norm to be small 169 | self.L2_sqr = ( 170 | (self.hiddenLayer.W ** 2).sum() 171 | + (self.logRegressionLayer.W ** 2).sum() 172 | ) 173 | 174 | # negative log likelihood of the MLP is given by the negative 175 | # log likelihood of the output of the model, computed in the 176 | # logistic regression layer 177 | self.negative_log_likelihood = ( 178 | self.logRegressionLayer.negative_log_likelihood 179 | ) 180 | # same holds for the function computing the number of errors 181 | self.errors = self.logRegressionLayer.errors 182 | 183 | # the parameters of the model are the parameters of the two layer it is 184 | # made out of 185 | self.params = self.hiddenLayer.params + self.logRegressionLayer.params 186 | ``` 187 | 在此之前,我们使用minibatch的随机梯度下降来训练这个模型。不同的是,我们现在在`cost`函数里面添加了正则项。`L1_reg`和`L2_reg`可以控制权值矩阵的正则化。计算新cost的代码如下: 188 | 189 | ```Python 190 | # the cost we minimize during training is the negative log likelihood of 191 | # the model plus the regularization terms (L1 and L2); cost is expressed 192 | # here symbolically 193 | cost = ( 194 | classifier.negative_log_likelihood(y) 195 | + L1_reg * classifier.L1 196 | + L2_reg * classifier.L2_sqr 197 | ) 198 | ``` 199 | 我们使用梯度来更新模型参数,这基本和逻辑回归里面的一样。我们从模型的`params`中获取参数列表,然后分析它,并计算每一步的梯度。 200 | 201 | ```Python 202 | # compute the gradient of cost with respect to theta (sotred in params) 203 | # the resulting gradients will be stored in a list gparams 204 | gparams = [T.grad(cost, param) for param in classifier.params] 205 | 206 | # specify how to update the parameters of the model as a list of 207 | # (variable, update expression) pairs 208 | 209 | # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of 210 | # same length, zip generates a list C of same size, where each element 211 | # is a pair formed from the two lists : 212 | # C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)] 213 | updates = [ 214 | (param, param - learning_rate * gparam) 215 | for param, gparam in zip(classifier.params, gparams) 216 | ] 217 | 218 | # compiling a Theano function `train_model` that returns the cost, but 219 | # in the same time updates the parameter of the model based on the rules 220 | # defined in `updates` 221 | train_model = theano.function( 222 | inputs=[index], 223 | outputs=cost, 224 | updates=updates, 225 | givens={ 226 | x: train_set_x[index * batch_size: (index + 1) * batch_size], 227 | y: train_set_y[index * batch_size: (index + 1) * batch_size] 228 | } 229 | ) 230 | ``` 231 | 232 | ## 把它组合起来 233 | 已经解释了所有的基本该概念,下面的代码就是一个完整的MLP类。 234 | 235 | ```Python 236 | """ 237 | This tutorial introduces the multilayer perceptron using Theano. 238 | 239 | A multilayer perceptron is a logistic regressor where 240 | instead of feeding the input to the logistic regression you insert a 241 | intermediate layer, called the hidden layer, that has a nonlinear 242 | activation function (usually tanh or sigmoid) . One can use many such 243 | hidden layers making the architecture deep. The tutorial will also tackle 244 | the problem of MNIST digit classification. 245 | 246 | .. math:: 247 | 248 | f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))), 249 | 250 | References: 251 | 252 | - textbooks: "Pattern Recognition and Machine Learning" - 253 | Christopher M. Bishop, section 5 254 | 255 | """ 256 | __docformat__ = 'restructedtext en' 257 | 258 | 259 | import os 260 | import sys 261 | import time 262 | 263 | import numpy 264 | 265 | import theano 266 | import theano.tensor as T 267 | 268 | 269 | from logistic_sgd import LogisticRegression, load_data 270 | 271 | 272 | # start-snippet-1 273 | class HiddenLayer(object): 274 | def __init__(self, rng, input, n_in, n_out, W=None, b=None, 275 | activation=T.tanh): 276 | """ 277 | Typical hidden layer of a MLP: units are fully-connected and have 278 | sigmoidal activation function. Weight matrix W is of shape (n_in,n_out) 279 | and the bias vector b is of shape (n_out,). 280 | 281 | NOTE : The nonlinearity used here is tanh 282 | 283 | Hidden unit activation is given by: tanh(dot(input,W) + b) 284 | 285 | :type rng: numpy.random.RandomState 286 | :param rng: a random number generator used to initialize weights 287 | 288 | :type input: theano.tensor.dmatrix 289 | :param input: a symbolic tensor of shape (n_examples, n_in) 290 | 291 | :type n_in: int 292 | :param n_in: dimensionality of input 293 | 294 | :type n_out: int 295 | :param n_out: number of hidden units 296 | 297 | :type activation: theano.Op or function 298 | :param activation: Non linearity to be applied in the hidden 299 | layer 300 | """ 301 | self.input = input 302 | # end-snippet-1 303 | 304 | # `W` is initialized with `W_values` which is uniformely sampled 305 | # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden)) 306 | # for tanh activation function 307 | # the output of uniform if converted using asarray to dtype 308 | # theano.config.floatX so that the code is runable on GPU 309 | # Note : optimal initialization of weights is dependent on the 310 | # activation function used (among other things). 311 | # For example, results presented in [Xavier10] suggest that you 312 | # should use 4 times larger initial weights for sigmoid 313 | # compared to tanh 314 | # We have no info for other function, so we use the same as 315 | # tanh. 316 | if W is None: 317 | W_values = numpy.asarray( 318 | rng.uniform( 319 | low=-numpy.sqrt(6. / (n_in + n_out)), 320 | high=numpy.sqrt(6. / (n_in + n_out)), 321 | size=(n_in, n_out) 322 | ), 323 | dtype=theano.config.floatX 324 | ) 325 | if activation == theano.tensor.nnet.sigmoid: 326 | W_values *= 4 327 | 328 | W = theano.shared(value=W_values, name='W', borrow=True) 329 | 330 | if b is None: 331 | b_values = numpy.zeros((n_out,), dtype=theano.config.floatX) 332 | b = theano.shared(value=b_values, name='b', borrow=True) 333 | 334 | self.W = W 335 | self.b = b 336 | 337 | lin_output = T.dot(input, self.W) + self.b 338 | self.output = ( 339 | lin_output if activation is None 340 | else activation(lin_output) 341 | ) 342 | # parameters of the model 343 | self.params = [self.W, self.b] 344 | 345 | 346 | # start-snippet-2 347 | class MLP(object): 348 | """Multi-Layer Perceptron Class 349 | 350 | A multilayer perceptron is a feedforward artificial neural network model 351 | that has one layer or more of hidden units and nonlinear activations. 352 | Intermediate layers usually have as activation function tanh or the 353 | sigmoid function (defined here by a ``HiddenLayer`` class) while the 354 | top layer is a softamx layer (defined here by a ``LogisticRegression`` 355 | class). 356 | """ 357 | 358 | def __init__(self, rng, input, n_in, n_hidden, n_out): 359 | """Initialize the parameters for the multilayer perceptron 360 | 361 | :type rng: numpy.random.RandomState 362 | :param rng: a random number generator used to initialize weights 363 | 364 | :type input: theano.tensor.TensorType 365 | :param input: symbolic variable that describes the input of the 366 | architecture (one minibatch) 367 | 368 | :type n_in: int 369 | :param n_in: number of input units, the dimension of the space in 370 | which the datapoints lie 371 | 372 | :type n_hidden: int 373 | :param n_hidden: number of hidden units 374 | 375 | :type n_out: int 376 | :param n_out: number of output units, the dimension of the space in 377 | which the labels lie 378 | 379 | """ 380 | 381 | # Since we are dealing with a one hidden layer MLP, this will translate 382 | # into a HiddenLayer with a tanh activation function connected to the 383 | # LogisticRegression layer; the activation function can be replaced by 384 | # sigmoid or any other nonlinear function 385 | self.hiddenLayer = HiddenLayer( 386 | rng=rng, 387 | input=input, 388 | n_in=n_in, 389 | n_out=n_hidden, 390 | activation=T.tanh 391 | ) 392 | 393 | # The logistic regression layer gets as input the hidden units 394 | # of the hidden layer 395 | self.logRegressionLayer = LogisticRegression( 396 | input=self.hiddenLayer.output, 397 | n_in=n_hidden, 398 | n_out=n_out 399 | ) 400 | # end-snippet-2 start-snippet-3 401 | # L1 norm ; one regularization option is to enforce L1 norm to 402 | # be small 403 | self.L1 = ( 404 | abs(self.hiddenLayer.W).sum() 405 | + abs(self.logRegressionLayer.W).sum() 406 | ) 407 | 408 | # square of L2 norm ; one regularization option is to enforce 409 | # square of L2 norm to be small 410 | self.L2_sqr = ( 411 | (self.hiddenLayer.W ** 2).sum() 412 | + (self.logRegressionLayer.W ** 2).sum() 413 | ) 414 | 415 | # negative log likelihood of the MLP is given by the negative 416 | # log likelihood of the output of the model, computed in the 417 | # logistic regression layer 418 | self.negative_log_likelihood = ( 419 | self.logRegressionLayer.negative_log_likelihood 420 | ) 421 | # same holds for the function computing the number of errors 422 | self.errors = self.logRegressionLayer.errors 423 | 424 | # the parameters of the model are the parameters of the two layer it is 425 | # made out of 426 | self.params = self.hiddenLayer.params + self.logRegressionLayer.params 427 | # end-snippet-3 428 | 429 | 430 | def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000, 431 | dataset='mnist.pkl.gz', batch_size=20, n_hidden=500): 432 | """ 433 | Demonstrate stochastic gradient descent optimization for a multilayer 434 | perceptron 435 | 436 | This is demonstrated on MNIST. 437 | 438 | :type learning_rate: float 439 | :param learning_rate: learning rate used (factor for the stochastic 440 | gradient 441 | 442 | :type L1_reg: float 443 | :param L1_reg: L1-norm's weight when added to the cost (see 444 | regularization) 445 | 446 | :type L2_reg: float 447 | :param L2_reg: L2-norm's weight when added to the cost (see 448 | regularization) 449 | 450 | :type n_epochs: int 451 | :param n_epochs: maximal number of epochs to run the optimizer 452 | 453 | :type dataset: string 454 | :param dataset: the path of the MNIST dataset file from 455 | http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz 456 | 457 | 458 | """ 459 | datasets = load_data(dataset) 460 | 461 | train_set_x, train_set_y = datasets[0] 462 | valid_set_x, valid_set_y = datasets[1] 463 | test_set_x, test_set_y = datasets[2] 464 | 465 | # compute number of minibatches for training, validation and testing 466 | n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size 467 | n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size 468 | n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size 469 | 470 | ###################### 471 | # BUILD ACTUAL MODEL # 472 | ###################### 473 | print '... building the model' 474 | 475 | # allocate symbolic variables for the data 476 | index = T.lscalar() # index to a [mini]batch 477 | x = T.matrix('x') # the data is presented as rasterized images 478 | y = T.ivector('y') # the labels are presented as 1D vector of 479 | # [int] labels 480 | 481 | rng = numpy.random.RandomState(1234) 482 | 483 | # construct the MLP class 484 | classifier = MLP( 485 | rng=rng, 486 | input=x, 487 | n_in=28 * 28, 488 | n_hidden=n_hidden, 489 | n_out=10 490 | ) 491 | 492 | # start-snippet-4 493 | # the cost we minimize during training is the negative log likelihood of 494 | # the model plus the regularization terms (L1 and L2); cost is expressed 495 | # here symbolically 496 | cost = ( 497 | classifier.negative_log_likelihood(y) 498 | + L1_reg * classifier.L1 499 | + L2_reg * classifier.L2_sqr 500 | ) 501 | # end-snippet-4 502 | 503 | # compiling a Theano function that computes the mistakes that are made 504 | # by the model on a minibatch 505 | test_model = theano.function( 506 | inputs=[index], 507 | outputs=classifier.errors(y), 508 | givens={ 509 | x: test_set_x[index * batch_size:(index + 1) * batch_size], 510 | y: test_set_y[index * batch_size:(index + 1) * batch_size] 511 | } 512 | ) 513 | 514 | validate_model = theano.function( 515 | inputs=[index], 516 | outputs=classifier.errors(y), 517 | givens={ 518 | x: valid_set_x[index * batch_size:(index + 1) * batch_size], 519 | y: valid_set_y[index * batch_size:(index + 1) * batch_size] 520 | } 521 | ) 522 | 523 | # start-snippet-5 524 | # compute the gradient of cost with respect to theta (sotred in params) 525 | # the resulting gradients will be stored in a list gparams 526 | gparams = [T.grad(cost, param) for param in classifier.params] 527 | 528 | # specify how to update the parameters of the model as a list of 529 | # (variable, update expression) pairs 530 | 531 | # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of 532 | # same length, zip generates a list C of same size, where each element 533 | # is a pair formed from the two lists : 534 | # C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)] 535 | updates = [ 536 | (param, param - learning_rate * gparam) 537 | for param, gparam in zip(classifier.params, gparams) 538 | ] 539 | 540 | # compiling a Theano function `train_model` that returns the cost, but 541 | # in the same time updates the parameter of the model based on the rules 542 | # defined in `updates` 543 | train_model = theano.function( 544 | inputs=[index], 545 | outputs=cost, 546 | updates=updates, 547 | givens={ 548 | x: train_set_x[index * batch_size: (index + 1) * batch_size], 549 | y: train_set_y[index * batch_size: (index + 1) * batch_size] 550 | } 551 | ) 552 | # end-snippet-5 553 | 554 | ############### 555 | # TRAIN MODEL # 556 | ############### 557 | print '... training' 558 | 559 | # early-stopping parameters 560 | patience = 10000 # look as this many examples regardless 561 | patience_increase = 2 # wait this much longer when a new best is 562 | # found 563 | improvement_threshold = 0.995 # a relative improvement of this much is 564 | # considered significant 565 | validation_frequency = min(n_train_batches, patience / 2) 566 | # go through this many 567 | # minibatche before checking the network 568 | # on the validation set; in this case we 569 | # check every epoch 570 | 571 | best_validation_loss = numpy.inf 572 | best_iter = 0 573 | test_score = 0. 574 | start_time = time.clock() 575 | 576 | epoch = 0 577 | done_looping = False 578 | 579 | while (epoch < n_epochs) and (not done_looping): 580 | epoch = epoch + 1 581 | for minibatch_index in xrange(n_train_batches): 582 | 583 | minibatch_avg_cost = train_model(minibatch_index) 584 | # iteration number 585 | iter = (epoch - 1) * n_train_batches + minibatch_index 586 | 587 | if (iter + 1) % validation_frequency == 0: 588 | # compute zero-one loss on validation set 589 | validation_losses = [validate_model(i) for i 590 | in xrange(n_valid_batches)] 591 | this_validation_loss = numpy.mean(validation_losses) 592 | 593 | print( 594 | 'epoch %i, minibatch %i/%i, validation error %f %%' % 595 | ( 596 | epoch, 597 | minibatch_index + 1, 598 | n_train_batches, 599 | this_validation_loss * 100. 600 | ) 601 | ) 602 | 603 | # if we got the best validation score until now 604 | if this_validation_loss < best_validation_loss: 605 | #improve patience if loss improvement is good enough 606 | if ( 607 | this_validation_loss < best_validation_loss * 608 | improvement_threshold 609 | ): 610 | patience = max(patience, iter * patience_increase) 611 | 612 | best_validation_loss = this_validation_loss 613 | best_iter = iter 614 | 615 | # test it on the test set 616 | test_losses = [test_model(i) for i 617 | in xrange(n_test_batches)] 618 | test_score = numpy.mean(test_losses) 619 | 620 | print((' epoch %i, minibatch %i/%i, test error of ' 621 | 'best model %f %%') % 622 | (epoch, minibatch_index + 1, n_train_batches, 623 | test_score * 100.)) 624 | 625 | if patience <= iter: 626 | done_looping = True 627 | break 628 | 629 | end_time = time.clock() 630 | print(('Optimization complete. Best validation score of %f %% ' 631 | 'obtained at iteration %i, with test performance %f %%') % 632 | (best_validation_loss * 100., best_iter + 1, test_score * 100.)) 633 | print >> sys.stderr, ('The code for file ' + 634 | os.path.split(__file__)[1] + 635 | ' ran for %.2fm' % ((end_time - start_time) / 60.)) 636 | 637 | 638 | if __name__ == '__main__': 639 | test_mlp() 640 | ``` 641 | 预计将会得到这样的输出: 642 | ``` 643 | Optimization complete. Best validation score of 1.690000 % obtained at iteration 2070000, with test performance 1.650000 % 644 | The code for file mlp.py ran for 97.34m 645 | ``` 646 | 在一台Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz的机器上,这个代码跑了10.3 epoch/minute然后花了828 epochs得到了1.65%的测试错误率。 647 | 读者也可以在[这个页面](http://yann.lecun.com/exdb/mnist)查看MNIST的识别结果。 648 | 649 | 650 | ## 训练MLPs的技巧 651 | 在上面的代码中国,有一些是不能进行梯度下降来优化的。严格意义上将,发现最优的超参集合是不可能的任务。第一,我们不能独立的优化每一个参数。第二,我们不能很容易的求解所有参数的梯度(有些是离散的值,有些是实数)。第三,这个优化问题是非凸的,容易陷入局部最优。 652 | 好消息是,过去25年,研究者发明了一些在神经网络中选择超参数的方法和规则。你可以在LeCun等人的[Efficient BackPro](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)中阅读,这是一个好的综述。这里,我们将总结下我们的代码中用到的几个重要的方法和技术。 653 | 654 | ### 非线性 655 | 最常见的就是`sigmoid`和`tanh`函数。在[第4.4节](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)中解释的,非线性是关于原点对称的,它倾向去输出0均值的输出(这是被期望的属性)。根据我们的经验,tanh(双曲函数)拥有更好的收敛性。 656 | ### 权值初始化 657 | 在初始化权值的时候,我们一般需要它们在0附近,要足够小(在激活函数的近似线性区域可以获得最大的梯度)。另一个特性,尤其对深度网络而言,是可以减小层与层之间的激活函数的方差和反向传导梯度的方差。这就可以让信息更好的向下和向上的传导,减少层间差异。数学推倒,请看[Xavier10](http://deeplearning.net/tutorial/references.html#xavier10)。 658 | ### 学习率 659 | 有许多文献专注在好的学习速率的选择上。最简单的方案就是选择一个固定速率。经验法则:尝试对数间隔的值(0.1,001,。。),然后缩小(对数)网络搜索的范围(你获得最低验证错误的区域)。 660 | 随着时间的推移减小学习速率有时候也是一个好主意。一个简单的方法是使用这个公式:u/(1+d*t),u是初始速率(可以使用上面讲的网格搜索选择),d是减小常量,用以控制学习速率,可以设为0.001或者更小,t是迭代次数或者时间。 661 | [4.7节](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)讲述了网络中每个参数学习速率选择的方法,然后基于分类错误率自适应的选择它们。 662 | ### 隐藏节点数 663 | 这个超参数是非常基于数据集的。模糊的来说就是,输入分布越复杂,去模拟它的网络就需要更大的容量,那么隐藏单元的数目就要更大。事实上,一个层的权值矩阵就是可以直接度量的(输入维度*输出维度)。 664 | 除非我们去使用正则选项(early-stopping或L1/L2惩罚),隐藏节点数和泛化表现的分布图,将呈现U型(即隐藏节点越多,在后期并不能提高泛化性)。 665 | ### 正则化参数 666 | 典型的方法是使用L1/L2正则化,同时lambda设为0.01,0.001等。尽管在我们之前提及的框架里面,它并没有显著提高性能,但它仍然是一个值得探讨的方法。 667 | 668 | 669 | 670 | 671 | 672 | 673 | -------------------------------------------------------------------------------- /4_Convoltional_Neural_Networks_LeNet_卷积神经网络.md: -------------------------------------------------------------------------------- 1 | 卷积神经网络(Convolutional Neural Networks LeNet) 2 | ================================================== 3 | 在这一节假设读者已经阅读了之前的两章[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。 4 | 5 | 如果你想要在GPU上跑样例,你需要一个好的GPU。至少是1GB显存的。当显示器连接在显卡上时,你可能需要更大的显存。 6 | 7 | 当GPU连接在显示器上时,每次GPU的调用会有几秒钟的限制。这是必须的,因为GPUs不能在计算的时候,同时被用于显示器。如果没有这个限制,屏幕会长时间不动,就像死机一样。这个例子会说中中等质量的GPUs。当GPU不连接显示器时,没有这个时间限制。你可以通过降低batch的大小来出来延时问题。 8 | 9 | 本节的所有代码,可以在[这里](http://deeplearning.net/tutorial/code/convolutional_mlp.py)下载,还有[3狼月亮图](https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/doc/images/3wolfmoon.jpg)。 10 | 11 | ## 动机 12 | 卷积神经网络是多层感知机的生物灵感变种。从Hubel和Wiesel先前对猫的视觉皮层的研究,我们知道视皮层中含有细胞的复杂分布。这些细胞只对小的视觉子区域敏感,称为`感受野`。这些子区域平铺来覆盖整个视场。这些细胞表现为输入图像空间的局部滤波器,非常适合检测自然图像中的强空间局部相关性。 13 | 14 | 此外,两类基础细胞类型被定义:`简单细胞`使用它们的感受野,最大限度的响应特定的棱状图案。`复杂细胞`有更大的感受野,可以局部不变的确定图案精确位置。动物视觉皮层是现存的最强大的视觉处理系统,很显然,我们需要去模仿它的行为。因此,许多类神经模型在文献中出现,包括[NeoCognitron](http://deeplearning.net/tutorial/references.html#fukushima),[HMAX](http://deeplearning.net/tutorial/references.html#serre07)和[LeNet-5](http://deeplearning.net/tutorial/references.html#lecun98),这是本教程需要着重讲解的。 15 | 16 | ## 稀疏连接 17 | 卷积神经网络通过在相邻层的神经元之间实施局部连接模式来检测局部空间相关性。换句话说就是,第m层的隐藏单元的输入来自第m-1层单元的子集,单元拥有空间上的感受野连接。我们可以通过如下的图来表示: 18 | 19 | ![sparse_connectivity](/images/4_sparse_con_1.png) 20 | 21 | 想象一下,第m-1层是输入视网膜。在上图总,第m层的单元有宽度为3的对输入视网膜的感受野,因此它们只连接视网膜层中3个相邻的神经元。第m层的单元与下一层有相似的连接。我们说,感受野连接于下一层的数目也是3,但是感受野连接于输入的则更大(5)。每个单元对视网膜上于自己感受野相异的地方都是不会有响应的。这个结构保证了被学习的滤波器对一个空间局部输入图案有最强的响应。 22 | 23 | 然而,就像上面展示的,将这些层叠加起来去形成(非线性)滤波器,就可以变得越来越全局化。举例而言,第m+1层的单元可以编码一个宽度为5的非线性特征。 24 | 25 | ## 权值共享 26 | 此外,在CNNs中,每一只滤波器共享同一组权值,这样该滤波器就可以形成一个特征映射(feaature map)。梯度下降算法在小改动后可以学习这种共享参数。这个被共享权值的梯度就是被共享的参数的梯度的简单求和。 27 | 28 | 复制单元使得特征可以无视其在视觉野中的位置而被检测到。此外,权值共享增加了学习效率,减少了需要被学习的自由参数的数目。这样的设定,使得CNNs在视觉问题上有更好的泛化性。 29 | 30 | ## 细节和注解 31 | 一个特征映射是由一个函数在整个图像的某一子区域重复使用来获得的,换句话说,就是通过线性滤波器来卷积输入图像,加上偏置后,再输入到非线性函数。如果我们定义第k个特征映射是为h_k,滤波器有W_k,b_k定义,则特征映射可以被表现为如下形式: 32 | 33 | ![h_k(i,j)](/images/4_detail_notation_1.png) 34 | 35 | 其中对于2维卷积有如下定义: 36 | 37 | ![2-D_conv](/images/4_detail_notation_2.png) 38 | 39 | 为了形成数据更丰富的表达,隐藏层有多层特征映射组成{h_k,k=0..K}。一个隐层的权值矩阵W可以用一个4维张量来表示,包含了每个目标特征映射、源目标特征映射、源水平位置,源垂直位置的元素。偏置b则是一个向量,纪录每个目标特征映射的元素。我们可以用如下的图来表示: 40 | 41 | ![cnn_layer](/images/4_detail_notation_3.png) 42 | 43 | 上图显示了一个CNN的两层,第m-1层包含4个特征映射,第m层包含2个特征映射(h_0和h_1)。h_0和h_1中红蓝色区域的像素(输出值)由第m-1层中2*2的感受野计算而言(相同颜色区域)。注意,感受野包含了所有的4个输入特征映射。W_0,W_1,h_0,h_1因此都是3维权值张量。第一个维度指定输入特征映射,剩下两个表示参考坐标。 44 | 45 | 把它们都放一起就是,W_k_l(i,j),表示第m层的第k个特征映射,在第m-1层的l个特征映射的(i,j)参考坐标的连接权值。 46 | 47 | 48 | ## 卷积操作 49 | 卷积操作是Theano实现卷积层的主要消耗。卷积操作通过`theano.tensor.signal.conv2d`,它包括两个输入符号: 50 | 51 | * 与输入的minibatch有关的4维张量,尺寸包括如下:[mini-batch的大小,输入特征映射的数目,图像高度,图像宽度]。 52 | * 与权值矩阵W相关的4维张量,尺寸包括如下:[m层特征映射的数目,m-1层特征映射的数目,滤波器高度,滤波器宽度]。 53 | 54 | 下面的Theano代码实现了类似图1的卷积层。输入包括大小为120*160的3个特征映射(1一个RGB彩图)。我们使用2个大小为9*9感受野的卷积滤波器。 55 | 56 | ```Python 57 | from theano.tensor.nnet import conv 58 | rng = numpy.random.RandomState(23455) 59 | 60 | # instantiate 4D tensor for input 61 | input = T.tensor4(name='input') 62 | 63 | # initialize shared variable for weights. 64 | w_shp = (2, 3, 9, 9) 65 | w_bound = numpy.sqrt(3 * 9 * 9) 66 | W = theano.shared( numpy.asarray( 67 | rng.uniform( 68 | low=-1.0 / w_bound, 69 | high=1.0 / w_bound, 70 | size=w_shp), 71 | dtype=input.dtype), name ='W') 72 | 73 | # initialize shared variable for bias (1D tensor) with random values 74 | # IMPORTANT: biases are usually initialized to zero. However in this 75 | # particular application, we simply apply the convolutional layer to 76 | # an image without learning the parameters. We therefore initialize 77 | # them to random values to "simulate" learning. 78 | b_shp = (2,) 79 | b = theano.shared(numpy.asarray( 80 | rng.uniform(low=-.5, high=.5, size=b_shp), 81 | dtype=input.dtype), name ='b') 82 | 83 | # build symbolic expression that computes the convolution of input with filters in w 84 | conv_out = conv.conv2d(input, W) 85 | 86 | # build symbolic expression to add bias and apply activation function, i.e. produce neural net layer output 87 | # A few words on ``dimshuffle`` : 88 | # ``dimshuffle`` is a powerful tool in reshaping a tensor; 89 | # what it allows you to do is to shuffle dimension around 90 | # but also to insert new ones along which the tensor will be 91 | # broadcastable; 92 | # dimshuffle('x', 2, 'x', 0, 1) 93 | # This will work on 3d tensors with no broadcastable 94 | # dimensions. The first dimension will be broadcastable, 95 | # then we will have the third dimension of the input tensor as 96 | # the second of the resulting tensor, etc. If the tensor has 97 | # shape (20, 30, 40), the resulting tensor will have dimensions 98 | # (1, 40, 1, 20, 30). (AxBxC tensor is mapped to 1xCx1xAxB tensor) 99 | # More examples: 100 | # dimshuffle('x') -> make a 0d (scalar) into a 1d vector 101 | # dimshuffle(0, 1) -> identity 102 | # dimshuffle(1, 0) -> inverts the first and second dimensions 103 | # dimshuffle('x', 0) -> make a row out of a 1d vector (N to 1xN) 104 | # dimshuffle(0, 'x') -> make a column out of a 1d vector (N to Nx1) 105 | # dimshuffle(2, 0, 1) -> AxBxC to CxAxB 106 | # dimshuffle(0, 'x', 1) -> AxB to Ax1xB 107 | # dimshuffle(1, 'x', 0) -> AxB to Bx1xA 108 | output = T.nnet.sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x')) 109 | 110 | # create theano function to compute filtered images 111 | f = theano.function([input], output) 112 | ``` 113 | 让我们让它变得有趣点... 114 | ```Python 115 | import numpy 116 | import pylab 117 | from PIL import Image 118 | 119 | # open random image of dimensions 639x516 120 | img = Image.open(open('/images/3wolfmoon.jpg')) 121 | img = numpy.asarray(img, dtype='float64') / 256. 122 | 123 | # put image in 4D tensor of shape (1, 3, height, width) 124 | img_ = img.swapaxes(0, 2).swapaxes(1, 2).reshape(1, 3, 639, 516) 125 | filtered_img = f(img_) 126 | 127 | # plot original image and first and second components of output 128 | pylab.subplot(1, 3, 1); pylab.axis('off'); pylab.imshow(img) 129 | pylab.gray(); 130 | # recall that the convOp output (filtered image) is actually a "minibatch", 131 | # of size 1 here, so we take index 0 in the first dimension: 132 | pylab.subplot(1, 3, 2); pylab.axis('off'); pylab.imshow(filtered_img[0, 0, :, :]) 133 | pylab.subplot(1, 3, 3); pylab.axis('off'); pylab.imshow(filtered_img[0, 1, :, :]) 134 | pylab.show() 135 | ``` 136 | 将产生这样的输出: 137 | 138 | ![3wolf](/images/4_conv_operator_1.png) 139 | 140 | 注意,一个随机的初始化滤波器表现得很像一个特征检测器。 141 | 142 | 注意我们使用了与MLP相同得权值初始化方案。权值在一个范围为[-1/fan-in, 1/fan-in]的均匀分布中随机取样,fan-in是一个隐单元的输入数。对MLP,它是下一层单元的数目。对CNNs,我不得不需要去考虑到输入特征映射的数目和感受野的大小。 143 | 144 | ## 最大池化 145 | 卷积神经网络另一个重大的概念是最大池化,一个非线性的降采样形式。最大池化就是将输入图像分割为一系列不重叠的矩阵,然后对每个子区域,输出最大值。 146 | 147 | 最大池化在视觉中是有用的,由如下2个原因: 148 | * 通过消除非最大值,减少了更上层的计算量 149 | * 提供了一种平移不变性。想象一下,一个最大池化层级联在一个卷积层。这里有8个方向,一个输入图像可以通过单个像素平移。假如说最大池化是2*2的区域,8个可能的方向中有3个可能会产生相同的输出(3/8)。当池化层为3*3时,概率增加到5/8。 150 | 151 | 最大池化在Theano中通过`theano.tensor.signal.downsample.max_pool_2d`。这个函数被设计为可以接受N维的张量和一个缩减因子,然后对张量的最后2维执行最大池化。 152 | 153 | ```Python 154 | from theano.tensor.signal import downsample 155 | 156 | input = T.dtensor4('input') 157 | maxpool_shape = (2, 2) 158 | pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=True) 159 | f = theano.function([input],pool_out) 160 | 161 | invals = numpy.random.RandomState(1).rand(3, 2, 5, 5) 162 | print 'With ignore_border set to True:' 163 | print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :] 164 | print 'output[0, 0, :, :] =\n', f(invals)[0, 0, :, :] 165 | 166 | pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=False) 167 | f = theano.function([input],pool_out) 168 | print 'With ignore_border set to False:' 169 | print 'invals[1, 0, :, :] =\n ', invals[1, 0, :, :] 170 | print 'output[1, 0, :, :] =\n ', f(invals)[1, 0, :, :] 171 | ``` 172 | 将会产生下面的输出: 173 | ``` 174 | With ignore_border set to True: 175 | invals[0, 0, :, :] = 176 | [[ 4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01 1.46755891e-01] 177 | [ 9.23385948e-02 1.86260211e-01 3.45560727e-01 3.96767474e-01 5.38816734e-01] 178 | [ 4.19194514e-01 6.85219500e-01 2.04452250e-01 8.78117436e-01 2.73875932e-02] 179 | [ 6.70467510e-01 4.17304802e-01 5.58689828e-01 1.40386939e-01 1.98101489e-01] 180 | [ 8.00744569e-01 9.68261576e-01 3.13424178e-01 6.92322616e-01 8.76389152e-01]] 181 | output[0, 0, :, :] = 182 | [[ 0.72032449 0.39676747] 183 | [ 0.6852195 0.87811744]] 184 | 185 | With ignore_border set to False: 186 | invals[1, 0, :, :] = 187 | [[ 0.01936696 0.67883553 0.21162812 0.26554666 0.49157316] 188 | [ 0.05336255 0.57411761 0.14672857 0.58930554 0.69975836] 189 | [ 0.10233443 0.41405599 0.69440016 0.41417927 0.04995346] 190 | [ 0.53589641 0.66379465 0.51488911 0.94459476 0.58655504] 191 | [ 0.90340192 0.1374747 0.13927635 0.80739129 0.39767684]] 192 | output[1, 0, :, :] = 193 | [[ 0.67883553 0.58930554 0.69975836] 194 | [ 0.66379465 0.94459476 0.58655504] 195 | [ 0.90340192 0.80739129 0.39767684]] 196 | ``` 197 | 注意,与其他Theano代码相比,`max_pool_2d`操作有点特殊。它需要缩减因子`ds`(长度维2的tuple,班汉图像长度和宽度的缩减因子)在图构建的时候被告知。这在未来可能会发生改变。 198 | 199 | 200 | ## 整个模型 201 | 稀疏性、卷积层和最大池化时LeNet系列模型的核心。而准确的模型细节有很大的差异,下图显示了一个LeNet模型。 202 | 203 | ![full_model](/images/4_full_model_1.png) 204 | 205 | 模型比较低的层是由卷积和最大池化层的交替来构建的,较高的层则是全连接的传统MLP(隐藏层+逻辑回归)。第一个全连接层的输入是前一层(the layer below)的特征映射的集合。 206 | 207 | 从上图的实现来看,较低层的操作都是建立在4维张量上的。然后它需要被压缩为2维的特征映射,来适应之后的MLP实现。 208 | 209 | ###将它组合起来 210 | 现在我们已经知道了在Theano中实现LeNet的所有方法。那我们先实现一个`LeNetConvPoolLayer`类,它是{卷积+最大池化}层。 211 | 212 | ```Python 213 | class LeNetConvPoolLayer(object): 214 | """Pool Layer of a convolutional network """ 215 | 216 | def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)): 217 | """ 218 | Allocate a LeNetConvPoolLayer with shared variable internal parameters. 219 | 220 | :type rng: numpy.random.RandomState 221 | :param rng: a random number generator used to initialize weights 222 | 223 | :type input: theano.tensor.dtensor4 224 | :param input: symbolic image tensor, of shape image_shape 225 | 226 | :type filter_shape: tuple or list of length 4 227 | :param filter_shape: (number of filters, num input feature maps, 228 | filter height, filter width) 229 | 230 | :type image_shape: tuple or list of length 4 231 | :param image_shape: (batch size, num input feature maps, 232 | image height, image width) 233 | 234 | :type poolsize: tuple or list of length 2 235 | :param poolsize: the downsampling (pooling) factor (#rows, #cols) 236 | """ 237 | 238 | assert image_shape[1] == filter_shape[1] 239 | self.input = input 240 | 241 | # there are "num input feature maps * filter height * filter width" 242 | # inputs to each hidden unit 243 | fan_in = numpy.prod(filter_shape[1:]) 244 | # each unit in the lower layer receives a gradient from: 245 | # "num output feature maps * filter height * filter width" / 246 | # pooling size 247 | fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) / 248 | numpy.prod(poolsize)) 249 | # initialize weights with random weights 250 | W_bound = numpy.sqrt(6. / (fan_in + fan_out)) 251 | self.W = theano.shared( 252 | numpy.asarray( 253 | rng.uniform(low=-W_bound, high=W_bound, size=filter_shape), 254 | dtype=theano.config.floatX 255 | ), 256 | borrow=True 257 | ) 258 | 259 | # the bias is a 1D tensor -- one bias per output feature map 260 | b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX) 261 | self.b = theano.shared(value=b_values, borrow=True) 262 | 263 | # convolve input feature maps with filters 264 | conv_out = conv.conv2d( 265 | input=input, 266 | filters=self.W, 267 | filter_shape=filter_shape, 268 | image_shape=image_shape 269 | ) 270 | 271 | # downsample each feature map individually, using maxpooling 272 | pooled_out = downsample.max_pool_2d( 273 | input=conv_out, 274 | ds=poolsize, 275 | ignore_border=True 276 | ) 277 | 278 | # add the bias term. Since the bias is a vector (1D array), we first 279 | # reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will 280 | # thus be broadcasted across mini-batches and feature map 281 | # width & height 282 | self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')) 283 | 284 | # store parameters of this layer 285 | self.params = [self.W, self.b] 286 | ``` 287 | 注意,当初始化权重值的时候,fan-in是由感受野和输入特征映射数决定的。 288 | 289 | 最后,我们通过使用在[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)中定义的`LogisticRegression`类,和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)中的`HiddenLayer`类来实例化我们的网络: 290 | 291 | ```Python 292 | x = T.matrix('x') # the data is presented as rasterized images 293 | y = T.ivector('y') # the labels are presented as 1D vector of 294 | # [int] labels 295 | 296 | ###################### 297 | # BUILD ACTUAL MODEL # 298 | ###################### 299 | print '... building the model' 300 | 301 | # Reshape matrix of rasterized images of shape (batch_size, 28 * 28) 302 | # to a 4D tensor, compatible with our LeNetConvPoolLayer 303 | # (28, 28) is the size of MNIST images. 304 | layer0_input = x.reshape((batch_size, 1, 28, 28)) 305 | 306 | # Construct the first convolutional pooling layer: 307 | # filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24) 308 | # maxpooling reduces this further to (24/2, 24/2) = (12, 12) 309 | # 4D output tensor is thus of shape (batch_size, nkerns[0], 12, 12) 310 | layer0 = LeNetConvPoolLayer( 311 | rng, 312 | input=layer0_input, 313 | image_shape=(batch_size, 1, 28, 28), 314 | filter_shape=(nkerns[0], 1, 5, 5), 315 | poolsize=(2, 2) 316 | ) 317 | 318 | # Construct the second convolutional pooling layer 319 | # filtering reduces the image size to (12-5+1, 12-5+1) = (8, 8) 320 | # maxpooling reduces this further to (8/2, 8/2) = (4, 4) 321 | # 4D output tensor is thus of shape (nkerns[0], nkerns[1], 4, 4) 322 | layer1 = LeNetConvPoolLayer( 323 | rng, 324 | input=layer0.output, 325 | image_shape=(batch_size, nkerns[0], 12, 12), 326 | filter_shape=(nkerns[1], nkerns[0], 5, 5), 327 | poolsize=(2, 2) 328 | ) 329 | 330 | # the HiddenLayer being fully-connected, it operates on 2D matrices of 331 | # shape (batch_size, num_pixels) (i.e matrix of rasterized images). 332 | # This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4), 333 | # or (500, 50 * 4 * 4) = (500, 800) with the default values. 334 | layer2_input = layer1.output.flatten(2) 335 | 336 | # construct a fully-connected sigmoidal layer 337 | layer2 = HiddenLayer( 338 | rng, 339 | input=layer2_input, 340 | n_in=nkerns[1] * 4 * 4, 341 | n_out=500, 342 | activation=T.tanh 343 | ) 344 | 345 | # classify the values of the fully-connected sigmoidal layer 346 | layer3 = LogisticRegression(input=layer2.output, n_in=500, n_out=10) 347 | 348 | # the cost we minimize during training is the NLL of the model 349 | cost = layer3.negative_log_likelihood(y) 350 | 351 | # create a function to compute the mistakes that are made by the model 352 | test_model = theano.function( 353 | [index], 354 | layer3.errors(y), 355 | givens={ 356 | x: test_set_x[index * batch_size: (index + 1) * batch_size], 357 | y: test_set_y[index * batch_size: (index + 1) * batch_size] 358 | } 359 | ) 360 | 361 | validate_model = theano.function( 362 | [index], 363 | layer3.errors(y), 364 | givens={ 365 | x: valid_set_x[index * batch_size: (index + 1) * batch_size], 366 | y: valid_set_y[index * batch_size: (index + 1) * batch_size] 367 | } 368 | ) 369 | 370 | # create a list of all model parameters to be fit by gradient descent 371 | params = layer3.params + layer2.params + layer1.params + layer0.params 372 | 373 | # create a list of gradients for all model parameters 374 | grads = T.grad(cost, params) 375 | 376 | # train_model is a function that updates the model parameters by 377 | # SGD Since this model has many parameters, it would be tedious to 378 | # manually create an update rule for each model parameter. We thus 379 | # create the updates list by automatically looping over all 380 | # (params[i], grads[i]) pairs. 381 | updates = [ 382 | (param_i, param_i - learning_rate * grad_i) 383 | for param_i, grad_i in zip(params, grads) 384 | ] 385 | 386 | train_model = theano.function( 387 | [index], 388 | cost, 389 | updates=updates, 390 | givens={ 391 | x: train_set_x[index * batch_size: (index + 1) * batch_size], 392 | y: train_set_y[index * batch_size: (index + 1) * batch_size] 393 | } 394 | ) 395 | ``` 396 | 我们把进行实际训练和early-stopping代码取出了。因为它和MLP中是一样的。有兴趣的读者,可以阅读教程开头的源代码。 397 | 398 | ## 运行代码 399 | 在一台Core i7-2600K CPU clocked at 3.40GHz上,我们使用floatX=float32,获得如下的输出: 400 | 401 | ``` 402 | Optimization complete. 403 | Best validation score of 0.910000 % obtained at iteration 17800,with test 404 | performance 0.920000 % 405 | The code for file convolutional_mlp.py ran for 380.28m 406 | ``` 407 | 在GeForce GTX 285上,我们获得了如下: 408 | 409 | ``` 410 | Optimization complete. 411 | Best validation score of 0.910000 % obtained at iteration 15500,with test 412 | performance 0.930000 % 413 | The code for file convolutional_mlp.py ran for 46.76m 414 | ``` 415 | 在GeForce GTX 480上,获得如下: 416 | 417 | ``` 418 | Optimization complete. 419 | Best validation score of 0.910000 % obtained at iteration 16400,with test 420 | performance 0.930000 % 421 | The code for file convolutional_mlp.py ran for 32.52m 422 | ``` 423 | 可以观察到不同实验下验证误差和测试误差的不同,这是由不同硬件的取整结构不同造成的。可以忽略。 424 | 425 | ## 技巧 426 | ### 超参的选择 427 | 卷积神经网络的训练相比与标准的MLP是相当困难的,因为它添加了更多的超参数。当我们在应用学习率和正则化的规则下,下面的方法也需要在优化CNNs被考虑: 428 | 429 | #### 滤波器的数量 430 | 当选择每层滤波器数量的时候,需要记住计算单卷积层的活性比传统的MLP会更加昂贵。 431 | 432 | 假设第l-1层包含K_(l-1)个特征映射和M*N个像素点(例如,位置数乘以特征映射数),然后第l层有K_(l)个滤波器,尺寸为m*n。那么计算一个特征映射(在(M-m)*(N-n)个像素位置应用每个m*n大小的滤波器)将消耗(M-m)*(N-n)*m*n*K_(l-1)的计算量。然后总共要计算K_l次。如果不是所有的特征只与前一层的所有特征相连,那么事情就变得更加复杂啦。 433 | 434 | 对标准MLP而言,第l层如果有K_l个神经元,那计算量只有K_(k-1)*K_l。因此,CNNs中滤波器的数量通常比MLPs中隐单元的数量小很多,通常是基于特征映射的尺寸(输入图像的尺寸和滤波器的形状)。 435 | 436 | 因为特征映射的尺寸会随着深度的增加而减小,靠近输入层的层将趋向于有更少的滤波器,而更高的层有更多的滤波器。事实上,为了平衡每一层的计算量,特征数和图像位置数的乘积在层的传递过程中都是基本一致的。为了保护输入信息,我们需要保证总的激活数量(特征映射数*像素位置数)在层间传递的时候是至于减少(当然我们在做监督学习的时候当然是希望它减小的)。特征映射的数量直接控制整个容量,同时它依赖于可用样例的数目和任务的复杂度。 437 | 438 | 439 | #### 滤波器的尺寸 440 | 通常在每个文献中滤波器的尺寸都有很大的不同,它常常是基于数据库的。MNIST在第一层的最好结果是5*5层滤波器。当自然图像(每维有几百个像素)趋向于使用更大的滤波器,例如12*12,15*15。 441 | 442 | 因此这个技巧事实上是去寻找正确等级的“粒度”,以便对给定的数据集去形成合适范围内的抽象。 443 | 444 | #### 最大池化的尺寸 445 | 经典的是2*2,或者没有最大池化。非常大的图可以在较低的层使用4*4的池化。但是需要记住的是,池化在通过16个因子减少信号维度的同时,也可能导致信号细节的大量丢失。 446 | 447 | #### 技巧 448 | 假如你想要在新的数据集上采用这个模型,下面的一些小技巧可能能让你获得更好的结果: 449 | * 白化(whitening)数据(例如,使用主成分分析) 450 | * 衰减每次迭代的学习速率。 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | -------------------------------------------------------------------------------- /5_Denoising_Autoencoders_降噪自动编码.md: -------------------------------------------------------------------------------- 1 | 降噪自动编码机(Denoising Autoencoders) 2 | ==================================== 3 | 4 | 这一节假设读者一节阅读了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md),[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。如果你需要在GPU上跑代码,你也需要阅读[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。 5 | 6 | 本节所有的代码都可用在[这里](http://deeplearning.net/tutorial/code/dA.py)下载。 7 | 8 | 降噪自动编码机(denoising Autoencoders)是经典自动编码机的扩展。它在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中作为深度网络的一个构建块被介绍。我们通过简短的[自动编码机](http://deeplearning.net/tutorial/dA.html#autoencoders)来开始本教程。 9 | 10 | ## 自动编码机 11 | 在[Bengio09](http://deeplearning.net/tutorial/references.html#bengio09)的第4.6节中,有自动编码机的简介。一个自动编码机,由d维的[0,1]之间的输入向量x,通过第一层映射(使用一个编码器)来获得隐藏的d‘维度的[0,1]的输出表达y。通过如下的决定性映射: 12 | 13 | ![y_mapping](/images/5_autoencoders_1.png) 14 | 15 | 这里s是一个非线性的函数,例如sigmoid。这个潜在的表达y,或者码,被映射回一个重构机z,来重构x。这个映射通过下面的简单转换来实现: 16 | 17 | ![z_mapping](/images/5_autoencoders_2.png) 18 | 19 | (这里撇号不代表矩阵转置)当y被给定时,z被看着是对x的预测。可选的,这个权重矩阵W‘的逆映射可用被约束为正向映射的转置:![tanspose](/images/5_autoencoders_3.png),这被称为捆绑权重。这个模型的所有参数(W,b,b‘,或者不使用捆绑权重W’)通过优化最小平均重构误差来实现训练。 20 | 21 | 重建误差可以通过许多方法来度量,基于输入的分布假设。这个传统的平方误差是L(x,z)=||x-z||^2,可以被使用。假如这个输入是通过比特向量或者比特概率向量来表述,重构`交叉熵`([cross-entropy)可以被表示如下: 22 | 23 | ![cross-entropy](/images/5_autoencoders_4.png) 24 | 25 | 我们希望这样,这个编码y是一个分布式表达,可以捕捉数据中主要因子变化的位置。这就类似与主成分凸出,金额也捕捉数据中主要因子的变化。事实上,假如这里有一个线性隐藏层(这个编码),并且平均平方误差准则被用以训练这个网络,然后这k个隐藏单元学习去凸出这个输入,在该数据的前k个主成分的范围中。假如这个隐藏层是非线性的,这个自动编码机表现的是与PCA不同的,它有着捕捉输入分布的多模态方面的能力。从PCA开始变得更加重要,当我们考虑堆叠混合编码机(stacking multiple encoders,在[Hinton06](http://deeplearning.net/tutorial/references.html#hinton06)中被用以构建一个深度自动编码机)。 26 | 27 | 因为y是视为x的有损压缩(lossy compression),它不可能对所有的x有好的压缩。优化可以使得训练样本有好的压缩,同时也希望对别的输入有好的压缩,但是不是对于任意输入。这里有一个对自动编码机的概括定义:它对于与训练样本有相似分布的测试样本有较低的重建误差,但对于随机的输入会有较高的重构误差。 28 | 29 | 我们希望通过Theano中来实现自动编码机,作为一个类的形式,它可以在未来去构建一个层叠自动编码机。这个第一步是去创建自动编码机的共享变量参数(W,b,b‘)。 30 | 31 | ```Python 32 | class dA(object): 33 | """Denoising Auto-Encoder class (dA) 34 | 35 | A denoising autoencoders tries to reconstruct the input from a corrupted 36 | version of it by projecting it first in a latent space and reprojecting 37 | it afterwards back in the input space. Please refer to Vincent et al.,2008 38 | for more details. If x is the input then equation (1) computes a partially 39 | destroyed version of x by means of a stochastic mapping q_D. Equation (2) 40 | computes the projection of the input into the latent space. Equation (3) 41 | computes the reconstruction of the input, while equation (4) computes the 42 | reconstruction error. 43 | 44 | .. math:: 45 | 46 | \tilde{x} ~ q_D(\tilde{x}|x) (1) 47 | 48 | y = s(W \tilde{x} + b) (2) 49 | 50 | x = s(W' y + b') (3) 51 | 52 | L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4) 53 | 54 | """ 55 | 56 | def __init__( 57 | self, 58 | numpy_rng, 59 | theano_rng=None, 60 | input=None, 61 | n_visible=784, 62 | n_hidden=500, 63 | W=None, 64 | bhid=None, 65 | bvis=None 66 | ): 67 | """ 68 | Initialize the dA class by specifying the number of visible units (the 69 | dimension d of the input ), the number of hidden units ( the dimension 70 | d' of the latent or hidden space ) and the corruption level. The 71 | constructor also receives symbolic variables for the input, weights and 72 | bias. Such a symbolic variables are useful when, for example the input 73 | is the result of some computations, or when weights are shared between 74 | the dA and an MLP layer. When dealing with SdAs this always happens, 75 | the dA on layer 2 gets as input the output of the dA on layer 1, 76 | and the weights of the dA are used in the second stage of training 77 | to construct an MLP. 78 | 79 | :type numpy_rng: numpy.random.RandomState 80 | :param numpy_rng: number random generator used to generate weights 81 | 82 | :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams 83 | :param theano_rng: Theano random generator; if None is given one is 84 | generated based on a seed drawn from `rng` 85 | 86 | :type input: theano.tensor.TensorType 87 | :param input: a symbolic description of the input or None for 88 | standalone dA 89 | 90 | :type n_visible: int 91 | :param n_visible: number of visible units 92 | 93 | :type n_hidden: int 94 | :param n_hidden: number of hidden units 95 | 96 | :type W: theano.tensor.TensorType 97 | :param W: Theano variable pointing to a set of weights that should be 98 | shared belong the dA and another architecture; if dA should 99 | be standalone set this to None 100 | 101 | :type bhid: theano.tensor.TensorType 102 | :param bhid: Theano variable pointing to a set of biases values (for 103 | hidden units) that should be shared belong dA and another 104 | architecture; if dA should be standalone set this to None 105 | 106 | :type bvis: theano.tensor.TensorType 107 | :param bvis: Theano variable pointing to a set of biases values (for 108 | visible units) that should be shared belong dA and another 109 | architecture; if dA should be standalone set this to None 110 | 111 | 112 | """ 113 | self.n_visible = n_visible 114 | self.n_hidden = n_hidden 115 | 116 | # create a Theano random generator that gives symbolic random values 117 | if not theano_rng: 118 | theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) 119 | 120 | # note : W' was written as `W_prime` and b' as `b_prime` 121 | if not W: 122 | # W is initialized with `initial_W` which is uniformely sampled 123 | # from -4*sqrt(6./(n_visible+n_hidden)) and 124 | # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if 125 | # converted using asarray to dtype 126 | # theano.config.floatX so that the code is runable on GPU 127 | initial_W = numpy.asarray( 128 | numpy_rng.uniform( 129 | low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), 130 | high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), 131 | size=(n_visible, n_hidden) 132 | ), 133 | dtype=theano.config.floatX 134 | ) 135 | W = theano.shared(value=initial_W, name='W', borrow=True) 136 | 137 | if not bvis: 138 | bvis = theano.shared( 139 | value=numpy.zeros( 140 | n_visible, 141 | dtype=theano.config.floatX 142 | ), 143 | borrow=True 144 | ) 145 | 146 | if not bhid: 147 | bhid = theano.shared( 148 | value=numpy.zeros( 149 | n_hidden, 150 | dtype=theano.config.floatX 151 | ), 152 | name='b', 153 | borrow=True 154 | ) 155 | 156 | self.W = W 157 | # b corresponds to the bias of the hidden 158 | self.b = bhid 159 | # b_prime corresponds to the bias of the visible 160 | self.b_prime = bvis 161 | # tied weights, therefore W_prime is W transpose 162 | self.W_prime = self.W.T 163 | self.theano_rng = theano_rng 164 | # if no input is given, generate a variable representing the input 165 | if input is None: 166 | # we use a matrix because we expect a minibatch of several 167 | # examples, each example being a row 168 | self.x = T.dmatrix(name='input') 169 | else: 170 | self.x = input 171 | 172 | self.params = [self.W, self.b, self.b_prime] 173 | ``` 174 | 注意,我们将`input`作为一个参数来传递给自动编码机。我们可以串联自动编码机来实现深度网络:第k层的输出(y)可以变成第k+1层的输入。 175 | 176 | 现在,我们可以预计去重构信号的潜在表达的计算量。 177 | 178 | ```Python 179 | def get_hidden_values(self, input): 180 | """ Computes the values of the hidden layer """ 181 | return T.nnet.sigmoid(T.dot(input, self.W) + self.b) 182 | 183 | def get_reconstructed_input(self, hidden): 184 | """Computes the reconstructed input given the values of the 185 | hidden layer 186 | 187 | """ 188 | return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime) 189 | ``` 190 | 然后我们通过这些函数可以计算一个随机梯度下降步骤的cost和更新。 191 | 192 | ```Python 193 | def get_cost_updates(self, corruption_level, learning_rate): 194 | """ This function computes the cost and the updates for one trainng 195 | step of the dA """ 196 | 197 | tilde_x = self.get_corrupted_input(self.x, corruption_level) 198 | y = self.get_hidden_values(tilde_x) 199 | z = self.get_reconstructed_input(y) 200 | # note : we sum over the size of a datapoint; if we are using 201 | # minibatches, L will be a vector, with one entry per 202 | # example in minibatch 203 | L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1) 204 | # note : L is now a vector, where each element is the 205 | # cross-entropy cost of the reconstruction of the 206 | # corresponding example of the minibatch. We need to 207 | # compute the average of all these to get the cost of 208 | # the minibatch 209 | cost = T.mean(L) 210 | 211 | # compute the gradients of the cost of the `dA` with respect 212 | # to its parameters 213 | gparams = T.grad(cost, self.params) 214 | # generate the list of updates 215 | updates = [ 216 | (param, param - learning_rate * gparam) 217 | for param, gparam in zip(self.params, gparams) 218 | ] 219 | 220 | return (cost, updates) 221 | ``` 222 | 我们现在可以定义一个函数来实现重复的更新参数W,b,b‘,直到这个重构消耗大约是最小的。 223 | 224 | ```Python 225 | da = dA( 226 | numpy_rng=rng, 227 | theano_rng=theano_rng, 228 | input=x, 229 | n_visible=28 * 28, 230 | n_hidden=500 231 | ) 232 | 233 | cost, updates = da.get_cost_updates( 234 | corruption_level=0., 235 | learning_rate=learning_rate 236 | ) 237 | 238 | train_da = theano.function( 239 | [index], 240 | cost, 241 | updates=updates, 242 | givens={ 243 | x: train_set_x[index * batch_size: (index + 1) * batch_size] 244 | } 245 | ) 246 | ``` 247 | 假设除了最小化重构误差外没有别的的限制,一个有n个输入和n(或者更大)维编码学习能力的自动编码机定义函数,将倾向于去映射出它的输入副本。这种自动编码机将无法从训练样本的分布中区分任何测试样例。(无效编码机:输出与输入完全相同)。 248 | 249 | 让人惊讶地,在[Bengio07](http://deeplearning.net/tutorial/references.html#bengio07)的实验指出,在实践中,当通过随机梯度下降训练时,有比输入更多的隐藏单元(称为超完备)的非线性的自动编码机可以产生有效的表达。(这里,有效指的是编码作为网络的输入获得了更低的分类误差)。 250 | 251 | 一个简单的解释是,使用early-stopping的随机梯度下降是与参数的L2正则化相似的。为了去实现对连续性输入有更好的重建,包含非线性隐藏单元的单隐藏层的自动编码机需要非常小的权值在第一(编码)层,以使得将非线性隐藏单元进入他们的线性区域(参考sigmoid函数),然后在第二(解码)层有更大的权值。对于二进制输入,非常大的权值也需要彻底的最小化重构误差。因为隐性的或者显性的正则化将使得获得大权值的解决方案变得困难,这个最优化算法发现在训练样本中表现好的编码。这意味着,表达是利用训练集的统计规律来实现的,而不仅仅是复制输入。 252 | 253 | 这里有其他方法,使得一个有比输入有更多隐藏单元的自动编码机,去避免只学习它本身,而是在输入的隐藏表达中捕捉到有用的东西。一个是添加稀疏性(迫使许多隐单元是0或者接近0)。稀疏性已经被很成功的发挥了[Ranzato07](http://deeplearning.net/tutorial/references.html#ranzato07)[Lee08](http://deeplearning.net/tutorial/references.html#lee08)。另一个是,在输入到重建过程中,增加从输入到重建的转换中的随机性。这个技术在受限玻尔兹曼机中被使用(Restricted Boltzmann Machines,在后面的章节中讨论),还有降噪自动编码机,在后面讨论。 254 | 255 | ## 降噪自动编码机 256 | 257 | 降噪自动编码机的思想是很简单饿。为了迫使隐藏层去发现更加鲁棒性的特征,避免它只是去简单的学习定义,我们训练自动编码机去重建被破坏的输入版本的数据。 258 | 259 | 这个降噪自动编码机是自动编码机的随机版本。直观上讲,一个降噪自动编码机做2件事情:尝试对输入进行编码(保护输入信息),然后尝试去消除输入中的随机差错产生的影响。后者可以通过捕捉输入间的统计相关性来实现。降噪自动编码机可以从不到的角度来理解(流行学习角度,随机操作角度,自下而上的信息论角度,自上而下的生成模型角度),所有的这些在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中被解释。在[Bengio09](http://deeplearning.net/tutorial/references.html#bengio09)的第7.2节有自动编码机的综述。 260 | 261 | 在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中,随机差错进程随机的设定部分(也可以是一半)输入为0。因此降噪自动编码机尝试去从未被污染的值中去预测被污染的(丢失)的值,通过随机的选择丢失模式下的仔鸡。注意如何能预测从剩下的变量的任意子集是一个充分条件,去完全捕获一组变量之间的联合分布(这是Gibbs采样工作)。 262 | 263 | 从自动编码机的类转换为降噪自动编码机,我们需要去增加一个随机误差步骤去应用到输入中。这个输入可以通过许多方法来污染,但在这个教程中,我们将支持以输入的随机性来腐化原始数据,使它趋向于0。代码如下: 264 | 265 | ```Python 266 | from theano.tensor.shared_randomstreams import RandomStreams 267 | 268 | def get_corrupted_input(self, input, corruption_level): 269 | """ This function keeps ``1-corruption_level`` entries of the inputs the same 270 | and zero-out randomly selected subset of size ``coruption_level`` 271 | Note : first argument of theano.rng.binomial is the shape(size) of 272 | random numbers that it should produce 273 | second argument is the number of trials 274 | third argument is the probability of success of any trial 275 | 276 | this will produce an array of 0s and 1s where 1 has a probability of 277 | 1 - ``corruption_level`` and 0 with ``corruption_level`` 278 | """ 279 | return self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * input 280 | ``` 281 | 在层叠自动编码机类([层叠自动编码机](http://deeplearning.net/tutorial/SdA.html#stacked-autoencoders))中,`dA`类中的权值不得不和相应的sigmoid层共享。因为这个原因,dA的构建也将Theano变量指向了共享参数。假如这些参数被设置为`None`,新的参数会被构建。 282 | 283 | 最后的降噪自动编码机类就变成了这样: 284 | 285 | ```Python 286 | class dA(object): 287 | """Denoising Auto-Encoder class (dA) 288 | 289 | A denoising autoencoders tries to reconstruct the input from a corrupted 290 | version of it by projecting it first in a latent space and reprojecting 291 | it afterwards back in the input space. Please refer to Vincent et al.,2008 292 | for more details. If x is the input then equation (1) computes a partially 293 | destroyed version of x by means of a stochastic mapping q_D. Equation (2) 294 | computes the projection of the input into the latent space. Equation (3) 295 | computes the reconstruction of the input, while equation (4) computes the 296 | reconstruction error. 297 | 298 | .. math:: 299 | 300 | \tilde{x} ~ q_D(\tilde{x}|x) (1) 301 | 302 | y = s(W \tilde{x} + b) (2) 303 | 304 | x = s(W' y + b') (3) 305 | 306 | L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4) 307 | 308 | """ 309 | 310 | def __init__(self, numpy_rng, theano_rng=None, input=None, n_visible=784, n_hidden=500, 311 | W=None, bhid=None, bvis=None): 312 | """ 313 | Initialize the dA class by specifying the number of visible units (the 314 | dimension d of the input ), the number of hidden units ( the dimension 315 | d' of the latent or hidden space ) and the corruption level. The 316 | constructor also receives symbolic variables for the input, weights and 317 | bias. Such a symbolic variables are useful when, for example the input is 318 | the result of some computations, or when weights are shared between the 319 | dA and an MLP layer. When dealing with SdAs this always happens, 320 | the dA on layer 2 gets as input the output of the dA on layer 1, 321 | and the weights of the dA are used in the second stage of training 322 | to construct an MLP. 323 | 324 | :type numpy_rng: numpy.random.RandomState 325 | :param numpy_rng: number random generator used to generate weights 326 | 327 | :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams 328 | :param theano_rng: Theano random generator; if None is given one is generated 329 | based on a seed drawn from `rng` 330 | 331 | :type input: theano.tensor.TensorType 332 | :paran input: a symbolic description of the input or None for standalone 333 | dA 334 | 335 | :type n_visible: int 336 | :param n_visible: number of visible units 337 | 338 | :type n_hidden: int 339 | :param n_hidden: number of hidden units 340 | 341 | :type W: theano.tensor.TensorType 342 | :param W: Theano variable pointing to a set of weights that should be 343 | shared belong the dA and another architecture; if dA should 344 | be standalone set this to None 345 | 346 | :type bhid: theano.tensor.TensorType 347 | :param bhid: Theano variable pointing to a set of biases values (for 348 | hidden units) that should be shared belong dA and another 349 | architecture; if dA should be standalone set this to None 350 | 351 | :type bvis: theano.tensor.TensorType 352 | :param bvis: Theano variable pointing to a set of biases values (for 353 | visible units) that should be shared belong dA and another 354 | architecture; if dA should be standalone set this to None 355 | 356 | 357 | """ 358 | self.n_visible = n_visible 359 | self.n_hidden = n_hidden 360 | 361 | # create a Theano random generator that gives symbolic random values 362 | if not theano_rng : 363 | theano_rng = RandomStreams(rng.randint(2 ** 30)) 364 | 365 | # note : W' was written as `W_prime` and b' as `b_prime` 366 | if not W: 367 | # W is initialized with `initial_W` which is uniformely sampled 368 | # from -4.*sqrt(6./(n_visible+n_hidden)) and 4.*sqrt(6./(n_hidden+n_visible)) 369 | # the output of uniform if converted using asarray to dtype 370 | # theano.config.floatX so that the code is runable on GPU 371 | initial_W = numpy.asarray(numpy_rng.uniform( 372 | low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), 373 | high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), 374 | size=(n_visible, n_hidden)), dtype=theano.config.floatX) 375 | W = theano.shared(value=initial_W, name='W') 376 | 377 | if not bvis: 378 | bvis = theano.shared(value = numpy.zeros(n_visible, 379 | dtype=theano.config.floatX), name='bvis') 380 | 381 | if not bhid: 382 | bhid = theano.shared(value=numpy.zeros(n_hidden, 383 | dtype=theano.config.floatX), name='bhid') 384 | 385 | self.W = W 386 | # b corresponds to the bias of the hidden 387 | self.b = bhid 388 | # b_prime corresponds to the bias of the visible 389 | self.b_prime = bvis 390 | # tied weights, therefore W_prime is W transpose 391 | self.W_prime = self.W.T 392 | self.theano_rng = theano_rng 393 | # if no input is given, generate a variable representing the input 394 | if input == None: 395 | # we use a matrix because we expect a minibatch of several examples, 396 | # each example being a row 397 | self.x = T.dmatrix(name='input') 398 | else: 399 | self.x = input 400 | 401 | self.params = [self.W, self.b, self.b_prime] 402 | 403 | def get_corrupted_input(self, input, corruption_level): 404 | """ This function keeps ``1-corruption_level`` entries of the inputs the same 405 | and zero-out randomly selected subset of size ``coruption_level`` 406 | Note : first argument of theano.rng.binomial is the shape(size) of 407 | random numbers that it should produce 408 | second argument is the number of trials 409 | third argument is the probability of success of any trial 410 | 411 | this will produce an array of 0s and 1s where 1 has a probability of 412 | 1 - ``corruption_level`` and 0 with ``corruption_level`` 413 | """ 414 | return self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * input 415 | 416 | 417 | def get_hidden_values(self, input): 418 | """ Computes the values of the hidden layer """ 419 | return T.nnet.sigmoid(T.dot(input, self.W) + self.b) 420 | 421 | def get_reconstructed_input(self, hidden ): 422 | """ Computes the reconstructed input given the values of the hidden layer """ 423 | return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime) 424 | 425 | def get_cost_updates(self, corruption_level, learning_rate): 426 | """ This function computes the cost and the updates for one trainng 427 | step of the dA """ 428 | 429 | tilde_x = self.get_corrupted_input(self.x, corruption_level) 430 | y = self.get_hidden_values( tilde_x) 431 | z = self.get_reconstructed_input(y) 432 | # note : we sum over the size of a datapoint; if we are using minibatches, 433 | # L will be a vector, with one entry per example in minibatch 434 | L = -T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1 ) 435 | # note : L is now a vector, where each element is the cross-entropy cost 436 | # of the reconstruction of the corresponding example of the 437 | # minibatch. We need to compute the average of all these to get 438 | # the cost of the minibatch 439 | cost = T.mean(L) 440 | 441 | # compute the gradients of the cost of the `dA` with respect 442 | # to its parameters 443 | gparams = T.grad(cost, self.params) 444 | # generate the list of updates 445 | updates = [] 446 | for param, gparam in zip(self.params, gparams): 447 | updates.append((param, param - learning_rate * gparam)) 448 | 449 | return (cost, updates) 450 | ``` 451 | 452 | 453 | ## 将它组合起来 454 | 455 | 现在去构建一个`dA`类和训练它变得很简单了。 456 | 457 | ```Python 458 | # allocate symbolic variables for the data 459 | index = T.lscalar() # index to a [mini]batch 460 | x = T.matrix('x') # the data is presented as rasterized images 461 | 462 | ###################### 463 | # BUILDING THE MODEL # 464 | ###################### 465 | 466 | rng = numpy.random.RandomState(123) 467 | theano_rng = RandomStreams(rng.randint(2 ** 30)) 468 | 469 | da = dA(numpy_rng=rng, theano_rng=theano_rng, input=x, 470 | n_visible=28 * 28, n_hidden=500) 471 | 472 | cost, updates = da.get_cost_updates(corruption_level=0.2, 473 | learning_rate=learning_rate) 474 | 475 | 476 | train_da = theano.function([index], cost, updates=updates, 477 | givens = {x: train_set_x[index * batch_size: (index + 1) * batch_size]}) 478 | 479 | start_time = time.clock() 480 | 481 | ############ 482 | # TRAINING # 483 | ############ 484 | 485 | # go through training epochs 486 | for epoch in xrange(training_epochs): 487 | # go through trainng set 488 | c = [] 489 | for batch_index in xrange(n_train_batches): 490 | c.append(train_da(batch_index)) 491 | 492 | print 'Training epoch %d, cost ' % epoch, numpy.mean(c) 493 | 494 | end_time = time.clock 495 | 496 | training_time = (end_time - start_time) 497 | 498 | print ('Training took %f minutes' % (pretraining_time / 60.)) 499 | ``` 500 | 501 | 为了了解网络学习了什么东西,我们将会描述出滤波器(通过权值矩阵来定义)。记住,事实上它没有提供完整的情况,因为我们忽略了偏置,并且画出的权值被乘以了常数(权值被转换到了0-1之间)。 502 | 503 | 去画出我们的滤波器,我们需要`title_raster_images`(看[Plotting Samples and Filters](http://deeplearning.net/tutorial/utilities.html#how-to-plot)),所以我们强烈建议读者去了解它。当然,也在PIL(python image library)的帮助下,下面行的代码将把滤波器保存为图像: 504 | 505 | ```Python 506 | image = Image.fromarray(tile_raster_images(X=da.W.get_value(borrow=True).T, 507 | img_shape=(28, 28), tile_shape=(10, 10), 508 | tile_spacing=(1, 1))) 509 | image.save('filters_corruption_30.png') 510 | ``` 511 | 512 | ## 运行这个代码 513 | 514 | 当我们不使用任何噪声的时候,获得的滤波器如下: 515 | 516 | ![filter_not_nosie](/images/5_running_code_1.png) 517 | 518 | 有30%噪声的时候,滤波器如下: 519 | 520 | ![filter_with_nosie](/images/5_running_code_2.png) 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | -------------------------------------------------------------------------------- /6_Stacked_Denoising_Autoencoders_层叠降噪自动编码机.md: -------------------------------------------------------------------------------- 1 | 层叠降噪自动编码机(Stacked Denoising Autoencoders (SdA)) 2 | ========================================================= 3 | 4 | 在这一节,我们假设读者已经了解了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。如果你需要在GPU上进行运算,你还需要了解[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。 5 | 6 | 本节的所有代码可以在[这里](http://deeplearning.net/tutorial/code/SdA.py)下载。 7 | 8 | 层叠降噪自动编码机(Stacked Denoising Autoencoder,SdA)是层叠自动编码机([Bengio07](http://deeplearning.net/tutorial/references.html#bengio07))的一个扩展,在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中被介绍。 9 | 10 | 这个教程建立在前一个[降噪自动编码机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/5_Denoising_Autoencoders_降噪自动编码.md)之上。我们建议,对于没有自动编码机经验的人应该阅读上述章节。 11 | 12 | ###层叠自动编码机 13 | 降噪自动编码机可以被叠加起来形成一个深度网络,通过反馈前一层的降噪自动编码机的潜在表达(输出编码)作为当前层的输入。这个非监督的预学习结构一次只能学习一个层。每一层都被作为一个降噪自动编码机以最小化重构误差来进行训练。当前k个层被训练完了,我们可以进行k+1层的训练,因此此时我们才可以计算前一层的编码和潜在表达。当所有的层都被训练了,整个网络进行第二阶段训练,称为微调(fine-tuning)。这里,我们考虑监督微调,当我们需要最小化一个监督任务的预测误差吧。为此我们现在网络的顶端添加一个逻辑回归层(使输出层的编码更加精确)。然后我们像训练多层感知器一样训练整个网络。这里,我们考虑每个自动编码的机的编码模块。这个阶段是有监督的,因为我们在训练的时候使用了目标类别(更多细节请看[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)) 14 | 15 | ![SdA](/images/6_sda_1.png) 16 | 17 | 这在Theano里面,使用之前定义的降噪自动编码机,可以轻易的被实现。我们可以将层叠降噪自动编码机看作两部分,一个是自动编码机链表,另一个是一个多层感知机。在预训练阶段,我们使用了第一部分,例如我们将模型看作一系列的自动编码机,然后分别训练每一个自动编码机。在第二阶段,我们使用第二部分。这个两个部分通过分享参数来实现连接。 18 | 19 | ```Python 20 | class SdA(object): 21 | """Stacked denoising auto-encoder class (SdA) 22 | 23 | A stacked denoising autoencoder model is obtained by stacking several 24 | dAs. The hidden layer of the dA at layer `i` becomes the input of 25 | the dA at layer `i+1`. The first layer dA gets as input the input of 26 | the SdA, and the hidden layer of the last dA represents the output. 27 | Note that after pretraining, the SdA is dealt with as a normal MLP, 28 | the dAs are only used to initialize the weights. 29 | """ 30 | 31 | def __init__( 32 | self, 33 | numpy_rng, 34 | theano_rng=None, 35 | n_ins=784, 36 | hidden_layers_sizes=[500, 500], 37 | n_outs=10, 38 | corruption_levels=[0.1, 0.1] 39 | ): 40 | """ This class is made to support a variable number of layers. 41 | 42 | :type numpy_rng: numpy.random.RandomState 43 | :param numpy_rng: numpy random number generator used to draw initial 44 | weights 45 | 46 | :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams 47 | :param theano_rng: Theano random generator; if None is given one is 48 | generated based on a seed drawn from `rng` 49 | 50 | :type n_ins: int 51 | :param n_ins: dimension of the input to the sdA 52 | 53 | :type n_layers_sizes: list of ints 54 | :param n_layers_sizes: intermediate layers size, must contain 55 | at least one value 56 | 57 | :type n_outs: int 58 | :param n_outs: dimension of the output of the network 59 | 60 | :type corruption_levels: list of float 61 | :param corruption_levels: amount of corruption to use for each 62 | layer 63 | """ 64 | 65 | self.sigmoid_layers = [] 66 | self.dA_layers = [] 67 | self.params = [] 68 | self.n_layers = len(hidden_layers_sizes) 69 | 70 | assert self.n_layers > 0 71 | 72 | if not theano_rng: 73 | theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) 74 | # allocate symbolic variables for the data 75 | self.x = T.matrix('x') # the data is presented as rasterized images 76 | self.y = T.ivector('y') # the labels are presented as 1D vector of 77 | # [int] labels 78 | ``` 79 | `self.sigmoid_layers`将会储存多层感知机的sigmoid层,`self.dA_layers`将会储存连接多层感知机层的降噪自动编码机。 80 | 81 | 下一步,我们构建`n_layers`个sigmoid层(我们使用在[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)中介绍的`HiddenLayer`类,唯一的更改是将原本的非线性函数`tanh`换成了logistic函数s=1/(1+exp(-x)))和`n_layers`个降噪自动编码机,当然`n_layers`就是我们模型的深度。我们连接sigmoid函数,使得他们形成一个MLP,构建每一个自动编码机和他们对应的sigmoid层,去共享编码部分的权值矩阵和偏执 82 | 83 | ```Python 84 | for i in xrange(self.n_layers): 85 | # construct the sigmoidal layer 86 | 87 | # the size of the input is either the number of hidden units of 88 | # the layer below or the input size if we are on the first layer 89 | if i == 0: 90 | input_size = n_ins 91 | else: 92 | input_size = hidden_layers_sizes[i - 1] 93 | 94 | # the input to this layer is either the activation of the hidden 95 | # layer below or the input of the SdA if you are on the first 96 | # layer 97 | if i == 0: 98 | layer_input = self.x 99 | else: 100 | layer_input = self.sigmoid_layers[-1].output 101 | 102 | sigmoid_layer = HiddenLayer(rng=numpy_rng, 103 | input=layer_input, 104 | n_in=input_size, 105 | n_out=hidden_layers_sizes[i], 106 | activation=T.nnet.sigmoid) 107 | # add the layer to our list of layers 108 | self.sigmoid_layers.append(sigmoid_layer) 109 | # its arguably a philosophical question... 110 | # but we are going to only declare that the parameters of the 111 | # sigmoid_layers are parameters of the StackedDAA 112 | # the visible biases in the dA are parameters of those 113 | # dA, but not the SdA 114 | self.params.extend(sigmoid_layer.params) 115 | 116 | # Construct a denoising autoencoder that shared weights with this 117 | # layer 118 | dA_layer = dA(numpy_rng=numpy_rng, 119 | theano_rng=theano_rng, 120 | input=layer_input, 121 | n_visible=input_size, 122 | n_hidden=hidden_layers_sizes[i], 123 | W=sigmoid_layer.W, 124 | bhid=sigmoid_layer.b) 125 | self.dA_layers.append(dA_layer) 126 | ``` 127 | 128 | 现在,我们需要在sigmoid层的上方添加逻辑层,所以我们将有一个MLP。我们将使用在[使用逻辑回归进MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)的`LogisticRegression`类。 129 | 130 | ```Python 131 | # We now need to add a logistic layer on top of the MLP 132 | self.logLayer = LogisticRegression( 133 | input=self.sigmoid_layers[-1].output, 134 | n_in=hidden_layers_sizes[-1], 135 | n_out=n_outs 136 | ) 137 | 138 | self.params.extend(self.logLayer.params) 139 | # construct a function that implements one step of finetunining 140 | 141 | # compute the cost for second phase of training, 142 | # defined as the negative log likelihood 143 | self.finetune_cost = self.logLayer.negative_log_likelihood(self.y) 144 | # compute the gradients with respect to the model parameters 145 | # symbolic variable that points to the number of errors made on the 146 | # minibatch given by self.x and self.y 147 | self.errors = self.logLayer.errors(self.y) 148 | ``` 149 | 这个类也提供一个方法去产生与不同层相关的降噪自动编码机的训练函数。它们以list的形式返回,第i个元素就是一个实现训练第i层的`dA`的函数。 150 | 151 | ```Python 152 | def pretraining_functions(self, train_set_x, batch_size): 153 | ''' Generates a list of functions, each of them implementing one 154 | step in trainnig the dA corresponding to the layer with same index. 155 | The function will require as input the minibatch index, and to train 156 | a dA you just need to iterate, calling the corresponding function on 157 | all minibatch indexes. 158 | 159 | :type train_set_x: theano.tensor.TensorType 160 | :param train_set_x: Shared variable that contains all datapoints used 161 | for training the dA 162 | 163 | :type batch_size: int 164 | :param batch_size: size of a [mini]batch 165 | 166 | :type learning_rate: float 167 | :param learning_rate: learning rate used during training for any of 168 | the dA layers 169 | ''' 170 | 171 | # index to a [mini]batch 172 | index = T.lscalar('index') # index to a minibatch 173 | ``` 174 | 为了有能力在训练时,改变差错等级或者训练速率。我们用一个Theano变量来联系它们。 175 | 176 | ```Python 177 | corruption_level = T.scalar('corruption') # % of corruption to use 178 | learning_rate = T.scalar('lr') # learning rate to use 179 | # begining of a batch, given `index` 180 | batch_begin = index * batch_size 181 | # ending of a batch given `index` 182 | batch_end = batch_begin + batch_size 183 | 184 | pretrain_fns = [] 185 | for dA in self.dA_layers: 186 | # get the cost and the updates list 187 | cost, updates = dA.get_cost_updates(corruption_level, 188 | learning_rate) 189 | # compile the theano function 190 | fn = theano.function( 191 | inputs=[ 192 | index, 193 | theano.Param(corruption_level, default=0.2), 194 | theano.Param(learning_rate, default=0.1) 195 | ], 196 | outputs=cost, 197 | updates=updates, 198 | givens={ 199 | self.x: train_set_x[batch_begin: batch_end] 200 | } 201 | ) 202 | # append `fn` to the list of functions 203 | pretrain_fns.append(fn) 204 | 205 | return pretrain_fns 206 | ``` 207 | 现在任何一个`pretrain_fns[i]`函数,可以将`index`,`corruption`——差错等级,`lr`——学习速率作为参数。注意,这些参数的名字是Theano变量的名字,而不是Python变量的名字(`learning_rate`或者`corruption_level`)。在使用Theano时,注意这一点。 208 | 209 | 以相同的方式(fashion),我们创建了一个方法用于在微调(fine-tuning)时需要的构建函数(`train_model`,`validate_model`,`test_model`函数)。 210 | 211 | ```Python 212 | def build_finetune_functions(self, datasets, batch_size, learning_rate): 213 | '''Generates a function `train` that implements one step of 214 | finetuning, a function `validate` that computes the error on 215 | a batch from the validation set, and a function `test` that 216 | computes the error on a batch from the testing set 217 | 218 | :type datasets: list of pairs of theano.tensor.TensorType 219 | :param datasets: It is a list that contain all the datasets; 220 | the has to contain three pairs, `train`, 221 | `valid`, `test` in this order, where each pair 222 | is formed of two Theano variables, one for the 223 | datapoints, the other for the labels 224 | 225 | :type batch_size: int 226 | :param batch_size: size of a minibatch 227 | 228 | :type learning_rate: float 229 | :param learning_rate: learning rate used during finetune stage 230 | ''' 231 | 232 | (train_set_x, train_set_y) = datasets[0] 233 | (valid_set_x, valid_set_y) = datasets[1] 234 | (test_set_x, test_set_y) = datasets[2] 235 | 236 | # compute number of minibatches for training, validation and testing 237 | n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] 238 | n_valid_batches /= batch_size 239 | n_test_batches = test_set_x.get_value(borrow=True).shape[0] 240 | n_test_batches /= batch_size 241 | 242 | index = T.lscalar('index') # index to a [mini]batch 243 | 244 | # compute the gradients with respect to the model parameters 245 | gparams = T.grad(self.finetune_cost, self.params) 246 | 247 | # compute list of fine-tuning updates 248 | updates = [ 249 | (param, param - gparam * learning_rate) 250 | for param, gparam in zip(self.params, gparams) 251 | ] 252 | 253 | train_fn = theano.function( 254 | inputs=[index], 255 | outputs=self.finetune_cost, 256 | updates=updates, 257 | givens={ 258 | self.x: train_set_x[ 259 | index * batch_size: (index + 1) * batch_size 260 | ], 261 | self.y: train_set_y[ 262 | index * batch_size: (index + 1) * batch_size 263 | ] 264 | }, 265 | name='train' 266 | ) 267 | 268 | test_score_i = theano.function( 269 | [index], 270 | self.errors, 271 | givens={ 272 | self.x: test_set_x[ 273 | index * batch_size: (index + 1) * batch_size 274 | ], 275 | self.y: test_set_y[ 276 | index * batch_size: (index + 1) * batch_size 277 | ] 278 | }, 279 | name='test' 280 | ) 281 | 282 | valid_score_i = theano.function( 283 | [index], 284 | self.errors, 285 | givens={ 286 | self.x: valid_set_x[ 287 | index * batch_size: (index + 1) * batch_size 288 | ], 289 | self.y: valid_set_y[ 290 | index * batch_size: (index + 1) * batch_size 291 | ] 292 | }, 293 | name='valid' 294 | ) 295 | 296 | # Create a function that scans the entire validation set 297 | def valid_score(): 298 | return [valid_score_i(i) for i in xrange(n_valid_batches)] 299 | 300 | # Create a function that scans the entire test set 301 | def test_score(): 302 | return [test_score_i(i) for i in xrange(n_test_batches)] 303 | 304 | return train_fn, valid_score, test_score 305 | ``` 306 | 注意,这里返回的`valid_score`和`test_score`并不是Theano函数,而是Python函数,在整个验证集和测试集循环,以产生这些集合的损失数的list。 307 | 308 | ###将它组合起来 309 | 310 | 下面的几行代码去构建层叠自动编码机: 311 | ```Python 312 | numpy_rng = numpy.random.RandomState(89677) 313 | print '... building the model' 314 | # construct the stacked denoising autoencoder class 315 | sda = SdA( 316 | numpy_rng=numpy_rng, 317 | n_ins=28 * 28, 318 | hidden_layers_sizes=[1000, 1000, 1000], 319 | n_outs=10 320 | ) 321 | ``` 322 | 在训练这个网络时,有两个阶段,一层是预训练,之后是微调。 323 | 324 | 对于预训练阶段,我们将循环网络中的所有层。对于每一层,我们将使用编译的theano函数来实现SGD(随机梯度下降),以实现权值优化,来见效每层的重构损失。这个函数将在训练集中被应用,并且是以`pretraining_epochs`中给出的固定次数的迭代。 325 | 326 | ```Python 327 | ######################### 328 | # PRETRAINING THE MODEL # 329 | ######################### 330 | print '... getting the pretraining functions' 331 | pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x, 332 | batch_size=batch_size) 333 | 334 | print '... pre-training the model' 335 | start_time = time.clock() 336 | ## Pre-train layer-wise 337 | corruption_levels = [.1, .2, .3] 338 | for i in xrange(sda.n_layers): 339 | # go through pretraining epochs 340 | for epoch in xrange(pretraining_epochs): 341 | # go through the training set 342 | c = [] 343 | for batch_index in xrange(n_train_batches): 344 | c.append(pretraining_fns[i](index=batch_index, 345 | corruption=corruption_levels[i], 346 | lr=pretrain_lr)) 347 | print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch), 348 | print numpy.mean(c) 349 | 350 | end_time = time.clock() 351 | 352 | print >> sys.stderr, ('The pretraining code for file ' + 353 | os.path.split(__file__)[1] + 354 | ' ran for %.2fm' % ((end_time - start_time) / 60.)) 355 | ``` 356 | 这个微调(fine-tuning)循环和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)中的非常类似,唯一的不同是我们将使用在`build_funetune_functions`中给定的新函数。 357 | 358 | ###运行这个代码 359 | 默认情况下,这个代码以块数目为1,每一层循环15次来进行预训练预训练。错差等级(corruption level)在第一层被设为0.1,第二层被设为0.2,第三层被设为0.3。预训练的学习速率为0.001,微调学习速率为0.1。预训练花了585.01分钟,平均每层13分钟。微调在36次迭代,444.2分钟后完成。平均每层迭代12.34分钟。最后的验证得分为1.39%,测试得分为1.3%。所有的结果都是在Intel Xeon E5430 @ 2.66GHz CPU,GotoBLAS下得出。 360 | 361 | ###技巧 362 | 这里有一个方法去提高代码的运行速度(假定你有足够的可用内存),是去计算这个网络(直到第k-1层时)如何转换你的数据。换句话说,你通过训练你的第一个dA层来开始。一旦它被训练,你就可以为每一个数据节点计算隐单元的值然后将它们储存为一个新的数据集,以便你在第2层中训练dA。一旦你训练完第2层的dA,你以相同的方式计算第三层的数据。现在你可以明白,在这个时候,这个dAs被分开训练了,它们仅仅提供(一对一的)对输入的非线性转换。一旦所有的dAs被训练,你就可以开始微调整个模型了。 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | -------------------------------------------------------------------------------- /7_Restricted_Boltzmann_Machine_受限波尔兹曼机.md: -------------------------------------------------------------------------------- 1 | 受限波尔兹曼机(Restricted Boltzmann Machines) 2 | ============================================== 3 | 4 | 在这一章节,我们假设读者已经阅读了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。当然,假如你要使用GPU来运行代码,你还需要阅读[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。 5 | 6 | 本节的所有代码都可以在[这里](http://deeplearning.net/tutorial/code/rbm.py)下载。 7 | 8 | ## 基于能量模型(Energy-Based Models) 9 | 基于能量的模型(EBM)把我们所关心变量的各种组合和一个标量能量联系在一起。训练模型的过程就是不断改变标量能量的过程,使其能量函数的形状满足期望的形状。比如,如果一个变量组合被认为是合理的,它同时也具有较小的能量。基于能量的概率模型通过能量函数来定义概率分布: 10 | 11 | ![energy_fun](/images/7_ebm_1.png) 12 | 13 | 其中归一化因子Z被称为分割函数: 14 | 15 | ![Z_fun](/images/7_ebm_2.png) 16 | 17 | 基于能量的模型可以利用使用梯度下降或随机梯度下降的方法来学习,具体而言,就是以先验(训练集)的负对数似然函数作为损失函数,就像在逻辑回归中我们定义的那样, 18 | 19 | ![loss_fun](/images/7_ebm_3.png) 20 | 21 | 其中随机梯度为![gradient](/images/7_ebm_4.png),其中theta为模型的参数。 22 | 23 | ### 包含隐藏单元的EBMs 24 | 25 | 在很多情况下,我们无法观察到x样本的全部分布,或者我们需要引进一些没有观察到的变量,以增加模型的表达能力。因而我们考虑将模型分为2部分,一个可见部分(x的观察分布)和一个隐藏部分h,这样得到的就是包含隐含变量的EBM: 26 | 27 | ![ebm_with_hidden_unit](/images/7_ebm_hidden_units_1.png) 28 | 29 | 同时我们受物理启发定义了自由能量(free energy): 30 | 31 | ![free_energy](/images/7_ebm_hidden_units_2.png) 32 | 33 | 然后我们可以写成如下公式: 34 | 35 | ![ebm_with_hidden_units_2](/images/7_ebm_hidden_units_3.png) 36 | 37 | 数据的服对数似然函数梯度就有如下有趣的形式: 38 | 39 | ![gradient_rbm_h](/images/7_ebm_hidden_units_4.png) 40 | 41 | 推倒公式如下: 42 | 43 | ![gradient_rbm_h_2](/images/7_ebm_hidden_units_5.png) 44 | 45 | 需要注意的是上述的梯度包含2个项,包括正相位和负相位。正和负的术语不指公式中的每个项的符号,而是反映其对模型所定义的概率密度的影响。第一项增加训练数据的概率(通过减少相关的自由能量),而第二项减小模型产生的样本的概率。 46 | 47 | 通常我们很难精确计算这个梯度,因为式中第一项涉及到可见单元与隐含单元的联合分布,由于归一化因子Z(θ)的存在,该分布很难获取。 我们只能通过一些采样方法(如Gibbs采样)获取其近似值,其具体方法将在后文中详述。 48 | 49 | 50 | ## 受限波尔兹曼机(RBM) 51 | 52 | 波尔兹曼机是对数线性马尔可夫随机场(MRF)的一种特殊形式,例如这个能量函数在它的自由参数下是线性的。为了使得它们能更强力的表达复杂分布(从受限的参数设定到一个非参数设定),我们认为一些变量是不可见的(被称为隐藏)。通过拥有更多隐藏变量(也称之为隐藏单元),我们可以增加波尔兹曼机的模型容量。受限波尔兹曼机限制波尔兹曼机可视层和隐藏层的层内连接。RBM模型可以由下图描述: 53 | 54 | ![rbm_graph](/images/7_rbm_1.png) 55 | 56 | RBM的能量函数可以被定义如下: 57 | 58 | ![rbm_energy_fun](/images/7_rbm_2.png) 59 | 60 | 其中’表示转置,b,c,W为模型的参数,b,c分别为可见层和隐含层的偏置,W为可见层与隐含层的链接权重。 61 | 62 | 自由能量为如下形式: 63 | 64 | ![free_energy_rbm](/images/7_rbm_3.png) 65 | 66 | 由于RBM的特殊结构,可视层和隐藏层层间单元是相互独立的。利用这个特性,我们定义如下: 67 | 68 | ![prob_rbm](/images/7_rbm_4.png) 69 | 70 | 71 | ### 二进制单元的RBMs 72 | 在使用二进制单元(v和h都属于{0,1})的普通研究情况时,概率版的普通神经激活函数表示如下: 73 | 74 | ![activation_fun](/images/7_rbm_binary_units_1.png) 75 | 76 | ![activation_fun2](/images/7_rbm_binary_units_2.png) 77 | 78 | 二进制单元RBMs的自由能力为: 79 | 80 | ![free_energy_binary](images/7_rbm_binary_units_1.png) 81 | 82 | 83 | ### 二进制单元的更新公式 84 | 85 | 我们可以获得如下的一个二进制单元RBM的对数似然梯度: 86 | 87 | ![equations](/images/7_update_e_b_u_1.png) 88 | 89 | 这个公式的更多细节推倒,读者可以阅读[这一页](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DBNEquations),或者[Learning Deep Architectures for AI](http://www.iro.umontreal.ca/%7Elisa/publications2/index.php/publications/show/239)的第五节。在这里,我们将不使用这些等式,而是通过Theano的`T.grad`来获取梯度。 90 | 91 | ## 在RBM中进行采样 92 | 93 | p(x)的样本可以通过运行马尔可夫链的汇聚、Gibbs采样的过渡来得到。 94 | 95 | 由N个随机变量(S=(S1,S2,...Sn))的联合分布的Gibbs采样,可以通过N个采样子步骤来实现,形式如Si~p(Si|S-i),其中S-i表示集合S中除Si的N-1个随机变量。 96 | 97 | 我们可以从X的一个任意状态(比如[x1(0),x2(0),…,xK(0)])开始,利用上述条件 分布,迭代的对其分量依次采样,随着采样次数的增加,随机变量[x1(n),x2(n),…,xK(n)]的概率分布将以n的几何级数的速度收敛于X的联合 概率分布P(X)。也就是说,我们可以在未知联合概率分布的条件下对其进行采样。 98 | 99 | 对于RBMs来说,S包含了可视和隐藏单元的集合。然而,由于它们的条件独立性,可以执行块Gibbs抽样。在这个设定中,可视单元被采样,同时给出隐藏单元的固定值,同样的,隐藏单元也是如此: 100 | 101 | ![h_v](/images/7_sampling_1.png) 102 | 103 | 这里,h(n)表示马尔可夫链中第n布的隐藏单元的集合。这意味着,h(n+1)根据概率`simg(W‘v(n)+ci)`来随机地被选为0/1。类似地v(n+1)也是如此。这个过程可以通过下面地图来展现: 104 | 105 | ![gibbs_sampling](/images/7_sampling_2.png) 106 | 107 | 当t趋向于无穷时,(v(t),h(t))将越加逼近正确样本的概率分布p(v,h)。 108 | 109 | 在这个理论里面,每个参数在学习进程中的更新都需要运行这样几个链来趋近。毫无疑问这将耗费很大的计算量。一些新的算法已经被提出来,以有效的学习p(v,h)中的样本情况。 110 | 111 | 112 | ## 对比散度算法(CD-k) 113 | 114 | 对比散度算法,是一种成功的用于求解对数似然函数关于未知参数梯度的近似的方法。它使用两个技巧来技术采样过程: 115 | * 因为我们希望p(v)=p_train(v)(数据的真实、底层分布),所以我们使用一个训练样本来初始化马尔可夫链(例如,从一个被预计接近于p的分布,所以这个链已经开始去收敛这个最终的分布p)。 116 | * 对比梯度不需要等待链的收敛。样本在k步Gibbs采样后就可以获得。在实际中,k=1时就可以获得惊人的好的效果。 117 | 118 | 119 | ### 持续的对比散度 120 | 121 | 持续的对比散度[Tieleman08](http://deeplearning.net/tutorial/references.html#tieleman08)使用了另外一种近似方法来从p(v,h)中采样。它建立在一个拥有持续状态的单马尔可夫链上(例如,不是对每个可视样例都重启链)。对每一次参数更新,我们通过简单的运行这个链k步来获得新的样本。然后保存链的状态以便后续的更新。 122 | 123 | 一般直觉的是,如果参数的更新是足够小相比链的混合率,那么马尔科夫链应该能够“赶上”模型的变化。 124 | 125 | ## 实现 126 | 127 | ![RBM_impl](/images/7_implementation_1.png) 128 | 129 | 我们构建一个RBM类。这个网络的参数既可以通过构造函数初始化,也可以作为参数进行传入。当RBM被用于构建一个深度网络时,这个选项——权重矩阵和隐藏层偏置与一个MLP网络的相应的S形层共享,就是非常有用的。 130 | 131 | ```Python 132 | class RBM(object): 133 | """Restricted Boltzmann Machine (RBM) """ 134 | def __init__( 135 | self, 136 | input=None, 137 | n_visible=784, 138 | n_hidden=500, 139 | W=None, 140 | hbias=None, 141 | vbias=None, 142 | numpy_rng=None, 143 | theano_rng=None 144 | ): 145 | """ 146 | RBM constructor. Defines the parameters of the model along with 147 | basic operations for inferring hidden from visible (and vice-versa), 148 | as well as for performing CD updates. 149 | 150 | :param input: None for standalone RBMs or symbolic variable if RBM is 151 | part of a larger graph. 152 | 153 | :param n_visible: number of visible units 154 | 155 | :param n_hidden: number of hidden units 156 | 157 | :param W: None for standalone RBMs or symbolic variable pointing to a 158 | shared weight matrix in case RBM is part of a DBN network; in a DBN, 159 | the weights are shared between RBMs and layers of a MLP 160 | 161 | :param hbias: None for standalone RBMs or symbolic variable pointing 162 | to a shared hidden units bias vector in case RBM is part of a 163 | different network 164 | 165 | :param vbias: None for standalone RBMs or a symbolic variable 166 | pointing to a shared visible units bias 167 | """ 168 | 169 | self.n_visible = n_visible 170 | self.n_hidden = n_hidden 171 | 172 | if numpy_rng is None: 173 | # create a number generator 174 | numpy_rng = numpy.random.RandomState(1234) 175 | 176 | if theano_rng is None: 177 | theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) 178 | 179 | if W is None: 180 | # W is initialized with `initial_W` which is uniformely 181 | # sampled from -4*sqrt(6./(n_visible+n_hidden)) and 182 | # 4*sqrt(6./(n_hidden+n_visible)) the output of uniform if 183 | # converted using asarray to dtype theano.config.floatX so 184 | # that the code is runable on GPU 185 | initial_W = numpy.asarray( 186 | numpy_rng.uniform( 187 | low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), 188 | high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), 189 | size=(n_visible, n_hidden) 190 | ), 191 | dtype=theano.config.floatX 192 | ) 193 | # theano shared variables for weights and biases 194 | W = theano.shared(value=initial_W, name='W', borrow=True) 195 | 196 | if hbias is None: 197 | # create shared variable for hidden units bias 198 | hbias = theano.shared( 199 | value=numpy.zeros( 200 | n_hidden, 201 | dtype=theano.config.floatX 202 | ), 203 | name='hbias', 204 | borrow=True 205 | ) 206 | 207 | if vbias is None: 208 | # create shared variable for visible units bias 209 | vbias = theano.shared( 210 | value=numpy.zeros( 211 | n_visible, 212 | dtype=theano.config.floatX 213 | ), 214 | name='vbias', 215 | borrow=True 216 | ) 217 | 218 | # initialize input layer for standalone RBM or layer0 of DBN 219 | self.input = input 220 | if not input: 221 | self.input = T.matrix('input') 222 | 223 | self.W = W 224 | self.hbias = hbias 225 | self.vbias = vbias 226 | self.theano_rng = theano_rng 227 | # **** WARNING: It is not a good idea to put things in this list 228 | # other than shared variables created in this function. 229 | self.params = [self.W, self.hbias, self.vbias] 230 | ``` 231 | 下一步就是去定义函数来构建S图。代码如下: 232 | 233 | ```Python 234 | def propup(self, vis): 235 | '''This function propagates the visible units activation upwards to 236 | the hidden units 237 | 238 | Note that we return also the pre-sigmoid activation of the 239 | layer. As it will turn out later, due to how Theano deals with 240 | optimizations, this symbolic variable will be needed to write 241 | down a more stable computational graph (see details in the 242 | reconstruction cost function) 243 | 244 | ''' 245 | pre_sigmoid_activation = T.dot(vis, self.W) + self.hbias 246 | return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)] 247 | ``` 248 | 249 | ```Python 250 | def sample_h_given_v(self, v0_sample): 251 | ''' This function infers state of hidden units given visible units ''' 252 | # compute the activation of the hidden units given a sample of 253 | # the visibles 254 | pre_sigmoid_h1, h1_mean = self.propup(v0_sample) 255 | # get a sample of the hiddens given their activation 256 | # Note that theano_rng.binomial returns a symbolic sample of dtype 257 | # int64 by default. If we want to keep our computations in floatX 258 | # for the GPU we need to specify to return the dtype floatX 259 | h1_sample = self.theano_rng.binomial(size=h1_mean.shape, 260 | n=1, p=h1_mean, 261 | dtype=theano.config.floatX) 262 | return [pre_sigmoid_h1, h1_mean, h1_sample] 263 | ``` 264 | 265 | ```Python 266 | def propdown(self, hid): 267 | '''This function propagates the hidden units activation downwards to 268 | the visible units 269 | 270 | Note that we return also the pre_sigmoid_activation of the 271 | layer. As it will turn out later, due to how Theano deals with 272 | optimizations, this symbolic variable will be needed to write 273 | down a more stable computational graph (see details in the 274 | reconstruction cost function) 275 | 276 | ''' 277 | pre_sigmoid_activation = T.dot(hid, self.W.T) + self.vbias 278 | return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)] 279 | ``` 280 | 281 | ```Python 282 | def sample_v_given_h(self, h0_sample): 283 | ''' This function infers state of visible units given hidden units ''' 284 | # compute the activation of the visible given the hidden sample 285 | pre_sigmoid_v1, v1_mean = self.propdown(h0_sample) 286 | # get a sample of the visible given their activation 287 | # Note that theano_rng.binomial returns a symbolic sample of dtype 288 | # int64 by default. If we want to keep our computations in floatX 289 | # for the GPU we need to specify to return the dtype floatX 290 | v1_sample = self.theano_rng.binomial(size=v1_mean.shape, 291 | n=1, p=v1_mean, 292 | dtype=theano.config.floatX) 293 | return [pre_sigmoid_v1, v1_mean, v1_sample] 294 | ``` 295 | 现在,我们可以使用这些函数来定义一个Gibbs采样步骤的符号图。我们定义如下两个函数: 296 | * `gibbs_vhv`表示从可视单元中开始的Gibbs采样的步骤。我们将可以看到这对于从RBM中采样是非常有用的。 297 | * `gibbs_hvh`表示从隐藏单元中开始Gibbs采样的步骤。这个函数再实现CD和PCD更新中是非常有用的。 298 | 代码如下: 299 | 300 | ```Python 301 | def gibbs_hvh(self, h0_sample): 302 | ''' This function implements one step of Gibbs sampling, 303 | starting from the hidden state''' 304 | pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h0_sample) 305 | pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v1_sample) 306 | return [pre_sigmoid_v1, v1_mean, v1_sample, 307 | pre_sigmoid_h1, h1_mean, h1_sample] 308 | ``` 309 | 310 | ```Python 311 | def gibbs_vhv(self, v0_sample): 312 | ''' This function implements one step of Gibbs sampling, 313 | starting from the visible state''' 314 | pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v0_sample) 315 | pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h1_sample) 316 | return [pre_sigmoid_h1, h1_mean, h1_sample, 317 | pre_sigmoid_v1, v1_mean, v1_sample] 318 | 319 | # start-snippet-2 320 | ``` 321 | 322 | 这个类还有一个函数去计算模型的自由能量,以便去计算参数的梯度。注意我们也会返回pre-sigmoid。 323 | 324 | ```Python 325 | def free_energy(self, v_sample): 326 | ''' Function to compute the free energy ''' 327 | wx_b = T.dot(v_sample, self.W) + self.hbias 328 | vbias_term = T.dot(v_sample, self.vbias) 329 | hidden_term = T.sum(T.log(1 + T.exp(wx_b)), axis=1) 330 | return -hidden_term - vbias_term 331 | ``` 332 | 我们随后添加一个`get_cost_update`方法,目的是产生CD-k和PCD-k的更新的象征性梯度。 333 | 334 | ```Python 335 | def get_cost_updates(self, lr=0.1, persistent=None, k=1): 336 | """This functions implements one step of CD-k or PCD-k 337 | 338 | :param lr: learning rate used to train the RBM 339 | 340 | :param persistent: None for CD. For PCD, shared variable 341 | containing old state of Gibbs chain. This must be a shared 342 | variable of size (batch size, number of hidden units). 343 | 344 | :param k: number of Gibbs steps to do in CD-k/PCD-k 345 | 346 | Returns a proxy for the cost and the updates dictionary. The 347 | dictionary contains the update rules for weights and biases but 348 | also an update of the shared variable used to store the persistent 349 | chain, if one is used. 350 | 351 | """ 352 | 353 | # compute positive phase 354 | pre_sigmoid_ph, ph_mean, ph_sample = self.sample_h_given_v(self.input) 355 | 356 | # decide how to initialize persistent chain: 357 | # for CD, we use the newly generate hidden sample 358 | # for PCD, we initialize from the old state of the chain 359 | if persistent is None: 360 | chain_start = ph_sample 361 | else: 362 | chain_start = persistent 363 | ``` 364 | 注意`get_cost_update`作为参数被变量化为`persistent`。这允许我们去使用相同的代码来实现CD和PCD。为了使用PCD,`persistent`需要被关联到一个共享变量,它包含前一次迭代的Gibbs链的状态。 365 | 366 | 假如`persistent`为`None`,则我们使用正相位时产生的隐藏样本来初始化Gibbs链,以此实现CD。当我们已经建立了这个链的开始点的时候,我们就可以计算这个Gibbs链的终点的样本,以及我们需要的去获得梯度的样本。为了获得这些,我们使用Theano提供的`sacn`操作,我们建议读者去阅读这个[链接](http://deeplearning.net/software/theano/library/scan.html)。 367 | 368 | ```Python 369 | # perform actual negative phase 370 | # in order to implement CD-k/PCD-k we need to scan over the 371 | # function that implements one gibbs step k times. 372 | # Read Theano tutorial on scan for more information : 373 | # http://deeplearning.net/software/theano/library/scan.html 374 | # the scan will return the entire Gibbs chain 375 | ( 376 | [ 377 | pre_sigmoid_nvs, 378 | nv_means, 379 | nv_samples, 380 | pre_sigmoid_nhs, 381 | nh_means, 382 | nh_samples 383 | ], 384 | updates 385 | ) = theano.scan( 386 | self.gibbs_hvh, 387 | # the None are place holders, saying that 388 | # chain_start is the initial state corresponding to the 389 | # 6th output 390 | outputs_info=[None, None, None, None, None, chain_start], 391 | n_steps=k 392 | ) 393 | ``` 394 | 当你已经产生了这个链,我们在链的末尾的样例获得负相位的自由能量。注意,这个`chain_end`是模型参数项中的一个的象征性的Theano变量,当我们简单的求解`T.grad`的时候,这个函数将通过Gibbs链来得到这个梯度。这不是我们想要的(它会搞乱我们的梯度),因此我们需要指示`T.grad`,`chain_end`是一个常量。我们通过`T.grad`的`consider_constant`来做这个事情。 395 | 396 | ```Python 397 | # determine gradients on RBM parameters 398 | # note that we only need the sample at the end of the chain 399 | chain_end = nv_samples[-1] 400 | 401 | cost = T.mean(self.free_energy(self.input)) - T.mean( 402 | self.free_energy(chain_end)) 403 | # We must not compute the gradient through the gibbs sampling 404 | gparams = T.grad(cost, self.params, consider_constant=[chain_end]) 405 | ``` 406 | 最后,我们增加由`scan`返回的更新字典(包含了随机状态的`theano_rng`更新规则)来获取参数更新。在PCD例子中,也需要更新包含Gibbs链状态的共享变量。 407 | 408 | ```Python 409 | # constructs the update dictionary 410 | for gparam, param in zip(gparams, self.params): 411 | # make sure that the learning rate is of the right dtype 412 | updates[param] = param - gparam * T.cast( 413 | lr, 414 | dtype=theano.config.floatX 415 | ) 416 | if persistent: 417 | # Note that this works only if persistent is a shared variable 418 | updates[persistent] = nh_samples[-1] 419 | # pseudo-likelihood is a better proxy for PCD 420 | monitoring_cost = self.get_pseudo_likelihood_cost(updates) 421 | else: 422 | # reconstruction cross-entropy is a better proxy for CD 423 | monitoring_cost = self.get_reconstruction_cost(updates, 424 | pre_sigmoid_nvs[-1]) 425 | 426 | return monitoring_cost, updates 427 | ``` 428 | 429 | ## 进展跟踪 430 | 431 | RBMs的训练是特别困难的。由于归一化函数Z,我们无法在训练的时候估计对数似然函数log(P(x))。因而我们没有直接可以度量超参数优化与否的方法。 432 | 433 | 而下面的几个选项对用户是有用的。 434 | 435 | ### 负样本的检查 436 | 437 | 在训练中获得的负样本是可以可视化的。在训练进程中,我们知道由RBM定义的模型不断逼近真实分布,p_train(x)。负样例就可以视为训练集中的样本。显而易见的,坏的超参数将在这种方式下被丢弃。 438 | 439 | ### 滤波器的可视化跟踪 440 | 441 | 由模型训练的滤波器是可以可视化的。我们可以将每个单元的权值以灰度图的方式展示。滤波器应该选出数据中强的特征。对于任意的数据集,这个滤波器都是不确定的。例如,训练MNIST,滤波器就表现的像“stroke”检测器,而训练自然图像的稀疏编码的时候,则像Gabor滤波器。 442 | 443 | ### 似然估计的替代 444 | 445 | 此外,更加容易处理的函数可以被用于做似然估计的替代。当我们使用PCD来训练RBM的时候,可以使用伪似然估计来替代。伪似然估计(Pseudo-likeihood,PL)更加简于计算,因为它假设所有的比特都是相互独立的,因此有: 446 | 447 | ![PL](/images/7_proxies_likelihood_1.png) 448 | 449 | 450 | 451 | 452 | ## 主循环 453 | 454 | 455 | 456 | ## 结果 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Deep Learning Tutorial in Chinese 2 | ================================= 3 | 深度学习教程中文版 4 | ================================= 5 | 6 | This is a `Chinese tutorial` which is translated from [DeepLearning 0.1 documentation](http://deeplearning.net/tutorial/contents.html#). And in this tutorial, all algorithms and models are coded by Python and [Theano](http://deeplearning.net/software/theano/index.html). Theano is a famous third-party library, and allows coder to use GPU or CPU to run his Python code. 7 | 8 | 9 | 10 | 这是一个翻译自[深度学习0.1文档](http://deeplearning.net/tutorial/contents.html)的`中文教程`。在这个教程里面所有的算法和模型都是通过Pyhton和[Theano](http://deeplearning.net/software/theano/index.html)实现的。Theano是一个著名的第三方库,允许程序员使用GPU或者CPU去运行他的Python代码。 11 | 12 | 13 | ## 内容/Contents 14 | 15 | * [入门(Getting Started)](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/1_Getting_Started_入门.md) 16 | * [使用逻辑回归进行MNIST分类(Classifying MNIST digits using Logistic Regression)](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md) 17 | * [多层感知机(Multilayer Perceptron)](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md) 18 | * [卷积神经网络(Convolutional Neural Networks(LeNet))](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/4_Convoltional_Neural_Networks_LeNet_卷积神经网络.md) 19 | * [降噪自动编码机器(Denoising Autoencoders(dA))](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/5_Denoising_Autoencoders_降噪自动编码.md) 20 | * [层叠自动编码机(Stcaked Denoising Autoencoders(SdA))](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/6_Stacked_Denoising_Autoencoders_层叠降噪自动编码机.md) 21 | * [受限波尔兹曼机(Restricted Boltzmann Machines(RBM))](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/7_Restricted_Boltzmann_Machine_受限波尔兹曼机.md) 22 | * Deep Belif Networks 23 | * Hybrid Monte-Carlo Sampling 24 | * Recurrent Neural Networks with Word Embeddings 25 | * Modeling and generating sequences of polyphonic music the RNN-RBM 26 | * Miscellaneous 27 | 28 | 29 | ## 版权/Copyright 30 | #### 作者/Author 31 | [Theano Development Team](http://deeplearning.net/tutorial/LICENSE.html), LISA lab, University of Montreal 32 | #### 翻译者/Translator 33 | [Lifeng Hua](https://github.com/Syndrome777), Zhejiang University 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /images/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/.DS_Store -------------------------------------------------------------------------------- /images/1_0-1_loss_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_0-1_loss_1.png -------------------------------------------------------------------------------- /images/1_0-1_loss_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_0-1_loss_2.png -------------------------------------------------------------------------------- /images/1_0-1_loss_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_0-1_loss_3.png -------------------------------------------------------------------------------- /images/1_l1_l2_regularization_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_l1_l2_regularization_1.png -------------------------------------------------------------------------------- /images/1_l1_l2_regularization_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_l1_l2_regularization_2.png -------------------------------------------------------------------------------- /images/1_negative_log_likelihod_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_negative_log_likelihod_1.png -------------------------------------------------------------------------------- /images/1_negative_log_likelihod_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_negative_log_likelihod_2.png -------------------------------------------------------------------------------- /images/2_defining_a_loss_function_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/2_defining_a_loss_function_1.png -------------------------------------------------------------------------------- /images/2_the_model_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/2_the_model_1.png -------------------------------------------------------------------------------- /images/2_the_model_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/2_the_model_2.png -------------------------------------------------------------------------------- /images/3_from_lr_to_mlp_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_from_lr_to_mlp_1.png -------------------------------------------------------------------------------- /images/3_the_model_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_the_model_1.png -------------------------------------------------------------------------------- /images/3_the_model_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_the_model_2.png -------------------------------------------------------------------------------- /images/3_the_model_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_the_model_3.png -------------------------------------------------------------------------------- /images/3wolfmoon.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3wolfmoon.jpg -------------------------------------------------------------------------------- /images/4_conv_operator_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_conv_operator_1.png -------------------------------------------------------------------------------- /images/4_detail_notation_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_detail_notation_1.png -------------------------------------------------------------------------------- /images/4_detail_notation_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_detail_notation_2.png -------------------------------------------------------------------------------- /images/4_detail_notation_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_detail_notation_3.png -------------------------------------------------------------------------------- /images/4_full_model_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_full_model_1.png -------------------------------------------------------------------------------- /images/4_sparse_con_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_sparse_con_1.png -------------------------------------------------------------------------------- /images/5_autoencoders_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_1.png -------------------------------------------------------------------------------- /images/5_autoencoders_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_2.png -------------------------------------------------------------------------------- /images/5_autoencoders_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_3.png -------------------------------------------------------------------------------- /images/5_autoencoders_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_4.png -------------------------------------------------------------------------------- /images/5_running_code_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_running_code_1.png -------------------------------------------------------------------------------- /images/5_running_code_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_running_code_2.png -------------------------------------------------------------------------------- /images/6_sda_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/6_sda_1.png -------------------------------------------------------------------------------- /images/7_ebm_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_1.png -------------------------------------------------------------------------------- /images/7_ebm_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_2.png -------------------------------------------------------------------------------- /images/7_ebm_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_3.png -------------------------------------------------------------------------------- /images/7_ebm_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_4.png -------------------------------------------------------------------------------- /images/7_ebm_hidden_units_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_1.png -------------------------------------------------------------------------------- /images/7_ebm_hidden_units_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_2.png -------------------------------------------------------------------------------- /images/7_ebm_hidden_units_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_3.png -------------------------------------------------------------------------------- /images/7_ebm_hidden_units_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_4.png -------------------------------------------------------------------------------- /images/7_ebm_hidden_units_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_5.png -------------------------------------------------------------------------------- /images/7_implementation_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_implementation_1.png -------------------------------------------------------------------------------- /images/7_proxies_likelihood_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_proxies_likelihood_1.png -------------------------------------------------------------------------------- /images/7_rbm_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_1.png -------------------------------------------------------------------------------- /images/7_rbm_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_2.png -------------------------------------------------------------------------------- /images/7_rbm_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_3.png -------------------------------------------------------------------------------- /images/7_rbm_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_4.png -------------------------------------------------------------------------------- /images/7_rbm_binary_units_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_binary_units_1.png -------------------------------------------------------------------------------- /images/7_rbm_binary_units_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_binary_units_2.png -------------------------------------------------------------------------------- /images/7_rbm_binary_units_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_binary_units_3.png -------------------------------------------------------------------------------- /images/7_sampling_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_sampling_1.png -------------------------------------------------------------------------------- /images/7_sampling_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_sampling_2.png -------------------------------------------------------------------------------- /images/7_update_e_b_u_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_update_e_b_u_1.png --------------------------------------------------------------------------------