├── .DS_Store
├── 1_Getting_Started_入门.md
├── 2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md
├── 3_Multilayer_Perceptron_多层感知机.md
├── 4_Convoltional_Neural_Networks_LeNet_卷积神经网络.md
├── 5_Denoising_Autoencoders_降噪自动编码.md
├── 6_Stacked_Denoising_Autoencoders_层叠降噪自动编码机.md
├── 7_Restricted_Boltzmann_Machine_受限波尔兹曼机.md
├── README.md
└── images
    ├── .DS_Store
    ├── 1_0-1_loss_1.png
    ├── 1_0-1_loss_2.png
    ├── 1_0-1_loss_3.png
    ├── 1_l1_l2_regularization_1.png
    ├── 1_l1_l2_regularization_2.png
    ├── 1_negative_log_likelihod_1.png
    ├── 1_negative_log_likelihod_2.png
    ├── 2_defining_a_loss_function_1.png
    ├── 2_the_model_1.png
    ├── 2_the_model_2.png
    ├── 3_from_lr_to_mlp_1.png
    ├── 3_the_model_1.png
    ├── 3_the_model_2.png
    ├── 3_the_model_3.png
    ├── 3wolfmoon.jpg
    ├── 4_conv_operator_1.png
    ├── 4_detail_notation_1.png
    ├── 4_detail_notation_2.png
    ├── 4_detail_notation_3.png
    ├── 4_full_model_1.png
    ├── 4_sparse_con_1.png
    ├── 5_autoencoders_1.png
    ├── 5_autoencoders_2.png
    ├── 5_autoencoders_3.png
    ├── 5_autoencoders_4.png
    ├── 5_running_code_1.png
    ├── 5_running_code_2.png
    ├── 6_sda_1.png
    ├── 7_ebm_1.png
    ├── 7_ebm_2.png
    ├── 7_ebm_3.png
    ├── 7_ebm_4.png
    ├── 7_ebm_hidden_units_1.png
    ├── 7_ebm_hidden_units_2.png
    ├── 7_ebm_hidden_units_3.png
    ├── 7_ebm_hidden_units_4.png
    ├── 7_ebm_hidden_units_5.png
    ├── 7_implementation_1.png
    ├── 7_proxies_likelihood_1.png
    ├── 7_rbm_1.png
    ├── 7_rbm_2.png
    ├── 7_rbm_3.png
    ├── 7_rbm_4.png
    ├── 7_rbm_binary_units_1.png
    ├── 7_rbm_binary_units_2.png
    ├── 7_rbm_binary_units_3.png
    ├── 7_sampling_1.png
    ├── 7_sampling_2.png
    └── 7_update_e_b_u_1.png


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/.DS_Store


--------------------------------------------------------------------------------
/1_Getting_Started_入门.md:
--------------------------------------------------------------------------------
  1 | 入门（Getting Started）
  2 | ======================
  3 | 
  4 | 这个教程并不是为了巩固研究生或者本科生的机器学习课程，但我们确实对一些重要的概念（和公式）做了的快速的概述，来确保我们在谈论同个概念。同时，你也需要去下载数据集，以便可以跑未来课程的样例代码。
  5 | 
  6 | ### 下载
  7 |  在每一个学习算法的页面，你都需要去下载相关的文件。加入你想要一次下载所有的文件，你可以克隆本教程的git仓库。
  8 | 
  9 | 	git clone git://github.com/lisa-lab/DeepLearningTutorials.git
 10 | 
 11 | ### 数据集
 12 | #### MNIST数据集(mnist.pkl.gz)
 13 | 
 14 |  [MNIST](http://yann.lecun.com/exdb/mnist)是一个包含60000个训练样例和10000个测试样例的手写数字图像的数据集。在许多论文，包括本教程，都将60000个训练样例分为50000个样例的训练集和10000个样例的验证集（为了超参数，例如学习率、模型尺寸等等）。所有的数字图像都被归一化和中心化为28*28的像素，256位图的灰度图。
 15 |  为了方便在Python中的使用，我们对数据集进行了处理。你可以在这里[下载](http://deeplearning.net/data/mnist/mnist.pkl.gz)。这个文件被表示为包含3个lists的tuple：训练集、验证集和测试集。每个lists都是都是两个list的组合，一个list是有numpy的1维array表示的784（28*28）维的0～1（0是黑，1是白）的float值，另一个list是0～9的图像标签。下面的代码显示了如何去加载这个数据集。
 16 | 
 17 | ```Python
 18 | import cPickle, gzip, numpy
 19 | # Load the dataset
 20 | f = gzip.open('mnist.pkl.gz', 'rb')
 21 | train_set, valid_set, test_set = cPickle.load(f)
 22 | f.close()
 23 | ```
 24 | 
 25 |  当我们使用这个数据集的时候，通常将它分割维几个minibatch。我们建议你将数据集储存为共享变量（shared variables），通过minibatch的索引（一个固定的被告知的batch的尺寸）来存取它们。使用共享变量的原因是为了使用GPU。因为往GPUX显存中复制数据是一个巨大的开销。如果不使用共享变量，GPU代码的运行效率将不会比CPU代码快。如果你将自己的数据定义为共享变量，当共享变量被构建的时候，你就给了Theano在一次请求中将整个数据复制到GPU上的可能。之后，GPU就可以通过共享变量的slice（切片）来存取任何一个minibatch，而不必再从CPU上拷贝数据。同时，因为数据向量（实数）和标签（整数）通常是不同属性的，测试集、验证集和训练集是不同目的的，所以我们建议通过不同的共享变量来储存（这就产生了6个不同的共享变量）。
 26 |  由于现在的数据再一个变量里面，一个minibatch被定义为这个变量的一个切片。通过指定它的索引和它的尺寸，可以更加自然的来定义一个minibatch。下面的代码展示了如何去存取数据和如何存取一个minibatch。
 27 | 
 28 | ```Python
 29 | def shared_dataset(data_xy):
 30 |     """ Function that loads the dataset into shared variables
 31 | 
 32 |     The reason we store our dataset in shared variables is to allow
 33 |     Theano to copy it into the GPU memory (when code is run on GPU).
 34 |     Since copying data into the GPU is slow, copying a minibatch everytime
 35 |     is needed (the default behaviour if the data is not in a shared
 36 |     variable) would lead to a large decrease in performance.
 37 |     """
 38 |     data_x, data_y = data_xy
 39 |     shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
 40 |     shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
 41 |     # When storing data on the GPU it has to be stored as floats
 42 |     # therefore we will store the labels as ``floatX`` as well
 43 |     # (``shared_y`` does exactly that). But during our computations
 44 |     # we need them as ints (we use labels as index, and if they are
 45 |     # floats it doesn't make sense) therefore instead of returning
 46 |     # ``shared_y`` we will have to cast it to int. This little hack
 47 |     # lets us get around this issue
 48 |     return shared_x, T.cast(shared_y, 'int32')
 49 | ```
 50 | 
 51 |  这个数据以float的形式被存储在GPU上（`dtype`被定义为`theano.confug.floatX`）。然后再将标签转换为int型。
 52 | 	如果你再GPU上跑代码，并且数据集太大，可能导致内存崩溃。在这个时候，你就应当把数据存储为共享变量。你可以将数据储存为一个充分小的大块（几个minibatch）在一个共享变量里面，然后在训练的时候使用它。一旦你使用了这个大块，更新它储存的值。这将最小化CPU和GPU的内存交换。
 53 | 
 54 | 
 55 | ### 标记
 56 | #### 数据集标记
 57 | 我们定义数据集为D,包括3个部分，D_train，D_valid，D_test三个集合。D内每个索引都是一个(x,y)对。
 58 | #### 数学约定
 59 | * W：大写字母表示矩阵（除非特殊说明）
 60 | * W(i,j)：矩阵内（i，j）点的数据
 61 | * W(i.)：矩阵的一行
 62 | * W(.j)：矩阵的一列
 63 | * b：小些字母表示向量（除非特殊说明）
 64 | * b(i)：向量内的（i）点的数据
 65 | #### 符号和缩略语表
 66 | * D：输入维度的数目
 67 | * D_h(i)：第i层个隐层的输入单元数目
 68 | * L：标签的数目
 69 | * NLL：负对数似然函数
 70 | * theta：给定模型的参数集合
 71 | 
 72 | #### Python命名空间
 73 | ```Python
 74 | import theano
 75 | import theano.tensor as T
 76 | import numpy
 77 | ```
 78 | 
 79 | ### 深度学习的监督优化入门
 80 | #### 学习一个分类器
 81 | ##### 0-1损失函数
 82 | ![0-1_loss_1](/images/1_0-1_loss_1.png)
 83 | 
 84 | ![0-1_loss_2](/images/1_0-1_loss_2.png)
 85 | 
 86 | ![0-1_loss_3](/images/1_0-1_loss_3.png)
 87 | 
 88 | ```Python
 89 | # zero_one_loss is a Theano variable representing a symbolic
 90 | # expression of the zero one loss ; to get the actual value this
 91 | # symbolic expression has to be compiled into a Theano function (see
 92 | # the Theano tutorial for more details)
 93 | zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))
 94 | ```
 95 | 
 96 | ##### 负对数似然损失函数
 97 |  由于0-1损失函数不可微分，在大型模型中对它优化会造成巨大开销。因此我们通过最大化给定数据标签的似然函数来训练模型。
 98 | 
 99 | ![nll_1](/images/1_negative_log_likelihod_1.png)
100 | 
101 | ![nll_2](/images/1_negative_log_likelihod_2.png)
102 | 
103 |  由于我们通常说最小化损失函数，所以我们给对数似然函数添加负号，来使得我们可以求解最小化负对数似然损失函数。
104 | 
105 | ```Python
106 | # NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
107 | # expression has to be compiled into a Theano function (see the Theano
108 | # tutorial for more details)
109 | NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
110 | # note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
111 | # Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
112 | # elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this
113 | # syntax to retrieve the log-probability of the correct labels, y.
114 | ```
115 | 
116 | #### 随机梯度下降
117 |  什么是普通的梯度下降？梯度下降是一个简单的算法，利用负梯度方向来决定每次迭代的新的搜索方向，使得每次迭代能使待优化的目标函数逐步减小。
118 | 伪代码如下所示。
119 | 
120 | ```Python
121 | # GRADIENT DESCENT
122 | 
123 | while True:
124 |     loss = f(params)
125 |     d_loss_wrt_params = ... # compute gradient
126 |     params -= learning_rate * d_loss_wrt_params
127 |     if <stopping condition is met>:
128 |         return params
129 | ```
130 | 
131 |  随机梯度下降则是普通梯度下降的优化。通过使用一部分样本来优化梯度代替所有样本优化梯度，从而得以更快逼近结果。下面的代码，我们一次只用一个样本来计算梯度。
132 | 
133 | ```Python
134 | # STOCHASTIC GRADIENT DESCENT
135 | for (x_i,y_i) in training_set:
136 |                             # imagine an infinite generator
137 |                             # that may repeat examples (if there is only a finite training set)
138 |     loss = f(params, x_i, y_i)
139 |     d_loss_wrt_params = ... # compute gradient
140 |     params -= learning_rate * d_loss_wrt_params
141 |     if <stopping condition is met>:
142 |         return params
143 | ```
144 | 
145 |  我们不止一次的在深度学习中提及这个变体——“minibatches”。Minibatch随机梯度下降区别与随机梯度下降，在每次梯度估计时使用一个minibatch的数据。这个技术减小了每次梯度估计时的方差，也适合现代电脑的分层内存构架。
146 | 
147 | ```Python
148 | for (x_batch,y_batch) in train_batches:
149 |                             # imagine an infinite generator
150 |                             # that may repeat examples
151 |     loss = f(params, x_batch, y_batch)
152 |     d_loss_wrt_params = ... # compute gradient using theano
153 |     params -= learning_rate * d_loss_wrt_params
154 |     if <stopping condition is met>:
155 |         return params
156 | ```
157 | 
158 |  在选择minibatch的尺寸B时中有个权衡。当尺寸比较大时，在梯度估计时就要花费更多时间计算方差；当尺寸比较小的时候呢，就要进行更多的迭代，也更容易波动。因而尺寸的选择要结合模型、数据集、硬件结构等，从1到几百不等。
159 |  伪代码如下。
160 | 
161 |  ```Python
162 | # Minibatch Stochastic Gradient Descent
163 | 
164 | # assume loss is a symbolic description of the loss function given
165 | # the symbolic variables params (shared variable), x_batch, y_batch;
166 | 
167 | # compute gradient of loss with respect to params
168 | d_loss_wrt_params = T.grad(loss, params)
169 | 
170 | # compile the MSGD step into a theano function
171 | updates = [(params, params - learning_rate * d_loss_wrt_params)]
172 | MSGD = theano.function([x_batch,y_batch], loss, updates=updates)
173 | 
174 | for (x_batch, y_batch) in train_batches:
175 |     # here x_batch and y_batch are elements of train_batches and
176 |     # therefore numpy arrays; function MSGD also updates the params
177 |     print('Current loss is ', MSGD(x_batch, y_batch))
178 |     if stopping_condition_is_met:
179 |         return params
180 | ```
181 | 
182 | #### 正则化
183 |  正则化是为了防止在MSGD训练过程中出现过拟合。为了应对过拟合，我们提出了几个方法：L1/L2正则化和early-stopping。
184 | ##### L1/L2正则化
185 |  L1/L2正则化就是在损失函数中添加额外的项，用以惩罚一定的参数结构。对于L2正则化，又被称为“权制递减（weight decay）”。
186 | 
187 | ![l1_l2_regularization_1](/images/1_l1_l2_regularization_1.png)
188 | 
189 | ![l1_l2_regularization_2](/images/1_l1_l2_regularization_2.png)
190 | 
191 | 
192 |  原则上来说，增加一个正则项，可以平滑神经网络的网络映射（通过惩罚大的参数值，可以减少网络模型的非线性参数数）。因而最小化这个和，就可以寻找到与训练数据最贴合同时范化性更好的模型。更具奥卡姆剃刀原则，最好的模型总是最简单的。
193 |  当然，事实上，简单模型并不一定意味着好的泛化。但从经验上看，这个正则化方案可以提高神经网络的泛化能力，尤其是对于小数据集而言。下面的代码我们分别给两个正则项一个对应的权重。
194 | 
195 | ```Python
196 | # symbolic Theano variable that represents the L1 regularization term
197 | L1  = T.sum(abs(param))
198 | 
199 | # symbolic Theano variable that represents the squared L2 term
200 | L2_sqr = T.sum(param ** 2)
201 | 
202 | # the loss
203 | loss = NLL + lambda_1 * L1 + lambda_2 * L2
204 | ```
205 | 
206 | ##### Early-stopping
207 |  Early-stopping通过监控模型在验证集上的表现来应对过拟合。验证集是一个我们从未在梯度下降中使用，也不在测试集的数据集合，它被认为是为了测试数据的一个表达。当在验证集上，模型的表现不再提高，或者表现更差，那么启发式算法应该放弃继续优化。
208 |  在选择何时终止优化方面，主要基于主观判断和一些启发式的方法，但在这个教程里，我们使用一个几何级数增加的patience量的策略。
209 | 
210 | ```Python
211 | # early-stopping parameters
212 | patience = 5000  # look as this many examples regardless
213 | patience_increase = 2     # wait this much longer when a new best is
214 |                               # found
215 | improvement_threshold = 0.995  # a relative improvement of this much is
216 |                                # considered significant
217 | validation_frequency = min(n_train_batches, patience/2)
218 |                               # go through this many
219 |                               # minibatches before checking the network
220 |                               # on the validation set; in this case we
221 |                               # check every epoch
222 | 
223 | best_params = None
224 | best_validation_loss = numpy.inf
225 | test_score = 0.
226 | start_time = time.clock()
227 | 
228 | done_looping = False
229 | epoch = 0
230 | while (epoch < n_epochs) and (not done_looping):
231 |     # Report "1" for first epoch, "n_epochs" for last epoch
232 |     epoch = epoch + 1
233 |     for minibatch_index in xrange(n_train_batches):
234 | 
235 |         d_loss_wrt_params = ... # compute gradient
236 |         params -= learning_rate * d_loss_wrt_params # gradient descent
237 | 
238 |         # iteration number. We want it to start at 0.
239 |         iter = (epoch - 1) * n_train_batches + minibatch_index
240 |         # note that if we do `iter % validation_frequency` it will be
241 |         # true for iter = 0 which we do not want. We want it true for
242 |         # iter = validation_frequency - 1.
243 |         if (iter + 1) % validation_frequency == 0:
244 | 
245 |             this_validation_loss = ... # compute zero-one loss on validation set
246 | 
247 |             if this_validation_loss < best_validation_loss:
248 | 
249 |                 # improve patience if loss improvement is good enough
250 |                 if this_validation_loss < best_validation_loss * improvement_threshold:
251 | 
252 |                     patience = max(patience, iter * patience_increase)
253 |                 best_params = copy.deepcopy(params)
254 |                 best_validation_loss = this_validation_loss
255 | 
256 |         if patience <= iter:
257 |             done_looping = True
258 |             break
259 | 
260 | # POSTCONDITION:
261 | # best_params refers to the best out-of-sample parameters observed during the optimization
262 | ```
263 | 
264 |  如果过训练数据的batch批次。
265 | 	这个`validation_frequency`应该要比`patience`更小。这个代码应该至少检查了两次，在使用`patience`之前。这就是我们使用这个等式`validation_frequency = min( value, patience/2.`的原因。
266 | 	这个算法可能会有更好的表现，当我们通过统计显著性的测试来代替简单的比较来决定是否增加patient。
267 | 
268 | #### 测试
269 |  我们依据在验证集上表现最好的参数作为模型的参数，去在测试集上进行测试。
270 | 
271 | #### 总结
272 |  这是对优化章节的总结。Early-stopping技术需要我们将数据分割为训练集、验证集、测试集。测试集使用minibatch的随机梯度下降来对目标函数进行逼近。同时引入L1/L2正则项来应对过拟合。
273 | 
274 | ### Theano/Python技巧
275 | #### 载入和保存模型
276 |  当你做实验的时候，用梯度下降算法可能要好几个小时去发现一个最优解。你可能在发现解的时候，想要保存这些权值。你也可能想要保存搜索进程中当前最优化的解。
277 | 
278 | ##### 使用Pickle在共享变量中储存numpy的ndarrays
279 | ```Python
280 | >>> import cPickle
281 | >>> save_file = open('path', 'wb')  # this will overwrite current contents
282 | >>> cPickle.dump(w.get_value(borrow=True), save_file, -1)  # the -1 is for HIGHEST_PROTOCOL
283 | >>> cPickle.dump(v.get_value(borrow=True), save_file, -1)  # .. and it triggers much more efficient
284 | >>> cPickle.dump(u.get_value(borrow=True), save_file, -1)  # .. storage than numpy's default
285 | >>> save_file.close()
286 | ```
287 | ```Python
288 | >>> save_file = open('path')
289 | >>> w.set_value(cPickle.load(save_file), borrow=True)
290 | >>> v.set_value(cPickle.load(save_file), borrow=True)
291 | >>> u.set_value(cPickle.load(save_file), borrow=True)
292 | ```
293 | 
294 | 
295 | 
296 | 
297 | 
298 | 
299 | 
300 | 
301 | 
302 | 
303 | 


--------------------------------------------------------------------------------
/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md:
--------------------------------------------------------------------------------
  1 | 使用逻辑回归进行MNIST分类（Classifying MNIST using Logistic Regressing）
  2 | =============================
  3 | 本节假定读者属性了下面的Theano概念：[共享变量（shared variable）](http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables), [基本数学算子（basic arithmetic ops）](http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars), [Theano的进阶（T.grad）](http://deeplearning.net/software/theano/tutorial/examples.html#computing-gradients), [floatX(默认为float64)](http://deeplearning.net/software/theano/library/config.html#config.floatX)。假如你想要在你的GPU上跑你的代码，你也需要看[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。
  4 | 
  5 | 本节的所有代码可以在[这里](http://deeplearning.net/tutorial/code/logistic_sgd.py)下载。
  6 | 
  7 | 在这一节，我们将展示Theano如何实现最基本的分类器：逻辑回归分类器。我们以模型的快速入门开始，复习(refresher)和巩固(anchor)数学负号，也展示了数学表达式如何映射到Theano图中。
  8 | 
  9 | ## 模型
 10 | 逻辑回归模型是一个线性概率模型。它由一个权值矩阵W和偏置向量b参数化。分类通过将输入向量提交到一组超平面，每个超平面对应一个类。输入向量和超平面的距离是这个输入属于该类的一个概率量化。
 11 | 在给定模型下，输入x，输出为y的概率，可以用如下公式表示
 12 | 
 13 | <center>![probality](/images/2_the_model_1.png)</center>
 14 | 
 15 | <center>![y_prediction](/images/2_the_model_2.png)</center>
 16 | 
 17 | Theano代码如下。
 18 | 
 19 | ```Python
 20 | 		# initialize with 0 the weights W as a matrix of shape (n_in, n_out)
 21 |         self.W = theano.shared(
 22 |             value=numpy.zeros(
 23 |                 (n_in, n_out),
 24 |                 dtype=theano.config.floatX
 25 |             ),
 26 |             name='W',
 27 |             borrow=True
 28 |         )
 29 |         # initialize the baises b as a vector of n_out 0s
 30 |         self.b = theano.shared(
 31 |             value=numpy.zeros(
 32 |                 (n_out,),
 33 |                 dtype=theano.config.floatX
 34 |             ),
 35 |             name='b',
 36 |             borrow=True
 37 |         )
 38 | 
 39 |         # symbolic expression for computing the matrix of class-membership
 40 |         # probabilities
 41 |         # Where:
 42 |         # W is a matrix where column-k represent the separation hyper plain for
 43 |         # class-k
 44 |         # x is a matrix where row-j  represents input training sample-j
 45 |         # b is a vector where element-k represent the free parameter of hyper
 46 |         # plain-k
 47 |         self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
 48 | 
 49 |         # symbolic description of how to compute prediction as class whose
 50 |         # probability is maximal
 51 |         self.y_pred = T.argmax(self.p_y_given_x, axis=1)
 52 | ```
 53 | 
 54 | 由于模型的参数需要不断的存取和修正，所以我们把W和b定义为共享变量。这个dot（点乘）和softmax运算用以计算这个P(Y|x,W,b)。这个结果`p_y_given_x`(probability)是一个vector类型的概率向量。
 55 | 为了获得实际的模型预测，我们使用`T_argmax`操作，来返回`p_y_given_x`的最大值对应的y。
 56 | 	如果想要获得完整的Theano算子，看[算子列表](http://deeplearning.net/software/theano/library/tensor/basic.html#basic-tensor-functionality)
 57 | 
 58 | ## 定义一个损失函数
 59 | 学习优化模型参数需要最小化一个损失参数。在多分类的逻辑回归中，很显然是使用负对数似然函数作为损失函数。似然函数和损失函数定义如下：
 60 | 
 61 | <center>![loss_function](/images/2_defining_a_loss_function_1.png)</center>
 62 | 
 63 | 虽然整本书都致力于探讨最小化话题，但梯度下降是迄今为止最简单的最小化非线性函数的方法。在这个教程中，我们使用minibatch随机梯度下降算法。可以看[随机梯度下降](http://deeplearning.net/tutorial/gettingstarted.html#opt-sgd)来获得更多细节。
 64 | 下面的代码定义了一个对给定的minibatch的损失函数。
 65 | 
 66 | ```Python
 67 |         # y.shape[0] is (symbolically) the number of rows in y, i.e.,
 68 |         # number of examples (call it n) in the minibatch
 69 |         # T.arange(y.shape[0]) is a symbolic vector which will contain
 70 |         # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
 71 |         # Log-Probabilities (call it LP) with one row per example and
 72 |         # one column per class LP[T.arange(y.shape[0]),y] is a vector
 73 |         # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
 74 |         # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
 75 |         # the mean (across minibatch examples) of the elements in v,
 76 |         # i.e., the mean log-likelihood across the minibatch.
 77 |         return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
 78 | ```
 79 | 	在这里我们使用错误的平均来表示损失函数，以减少minibatch尺寸对我们的影响。
 80 | 
 81 | ## 创建一个逻辑回归类
 82 | 现在，我们要定义一个`逻辑回归`的类，来概括逻辑回归的基本行为。代码已经是我们之前涵盖的了，不再进行过多解释。
 83 | 
 84 | ```Python
 85 | class LogisticRegression(object):
 86 |     """Multi-class Logistic Regression Class
 87 | 
 88 |     The logistic regression is fully described by a weight matrix :math:`W`
 89 |     and bias vector :math:`b`. Classification is done by projecting data
 90 |     points onto a set of hyperplanes, the distance to which is used to
 91 |     determine a class membership probability.
 92 |     """
 93 | 
 94 |     def __init__(self, input, n_in, n_out):
 95 |         """ Initialize the parameters of the logistic regression
 96 | 
 97 |         :type input: theano.tensor.TensorType
 98 |         :param input: symbolic variable that describes the input of the
 99 |                       architecture (one minibatch)
100 | 
101 |         :type n_in: int
102 |         :param n_in: number of input units, the dimension of the space in
103 |                      which the datapoints lie
104 | 
105 |         :type n_out: int
106 |         :param n_out: number of output units, the dimension of the space in
107 |                       which the labels lie
108 | 
109 |         """
110 |         # start-snippet-1
111 |         # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
112 |         self.W = theano.shared(
113 |             value=numpy.zeros(
114 |                 (n_in, n_out),
115 |                 dtype=theano.config.floatX
116 |             ),
117 |             name='W',
118 |             borrow=True
119 |         )
120 |         # initialize the baises b as a vector of n_out 0s
121 |         self.b = theano.shared(
122 |             value=numpy.zeros(
123 |                 (n_out,),
124 |                 dtype=theano.config.floatX
125 |             ),
126 |             name='b',
127 |             borrow=True
128 |         )
129 | 
130 |         # symbolic expression for computing the matrix of class-membership
131 |         # probabilities
132 |         # Where:
133 |         # W is a matrix where column-k represent the separation hyper plain for
134 |         # class-k
135 |         # x is a matrix where row-j  represents input training sample-j
136 |         # b is a vector where element-k represent the free parameter of hyper
137 |         # plain-k
138 |         self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
139 | 
140 |         # symbolic description of how to compute prediction as class whose
141 |         # probability is maximal
142 |         self.y_pred = T.argmax(self.p_y_given_x, axis=1)
143 |         # end-snippet-1
144 | 
145 |         # parameters of the model
146 |         self.params = [self.W, self.b]
147 | 
148 |     def negative_log_likelihood(self, y):
149 |         """Return the mean of the negative log-likelihood of the prediction
150 |         of this model under a given target distribution.
151 | 
152 |         .. math::
153 | 
154 |             \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
155 |             \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|}
156 |                 \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
157 |             \ell (\theta=\{W,b\}, \mathcal{D})
158 | 
159 |         :type y: theano.tensor.TensorType
160 |         :param y: corresponds to a vector that gives for each example the
161 |                   correct label
162 | 
163 |         Note: we use the mean instead of the sum so that
164 |               the learning rate is less dependent on the batch size
165 |         """
166 |         # start-snippet-2
167 |         # y.shape[0] is (symbolically) the number of rows in y, i.e.,
168 |         # number of examples (call it n) in the minibatch
169 |         # T.arange(y.shape[0]) is a symbolic vector which will contain
170 |         # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
171 |         # Log-Probabilities (call it LP) with one row per example and
172 |         # one column per class LP[T.arange(y.shape[0]),y] is a vector
173 |         # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
174 |         # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
175 |         # the mean (across minibatch examples) of the elements in v,
176 |         # i.e., the mean log-likelihood across the minibatch.
177 |         return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
178 |         # end-snippet-2
179 | 
180 |     def errors(self, y):
181 |         """Return a float representing the number of errors in the minibatch
182 |         over the total number of examples of the minibatch ; zero one
183 |         loss over the size of the minibatch
184 | 
185 |         :type y: theano.tensor.TensorType
186 |         :param y: corresponds to a vector that gives for each example the
187 |                   correct label
188 |         """
189 | 
190 |         # check if y has same dimension of y_pred
191 |         if y.ndim != self.y_pred.ndim:
192 |             raise TypeError(
193 |                 'y should have the same shape as self.y_pred',
194 |                 ('y', y.type, 'y_pred', self.y_pred.type)
195 |             )
196 |         # check if y is of the correct datatype
197 |         if y.dtype.startswith('int'):
198 |             # the T.neq operator returns a vector of 0s and 1s, where 1
199 |             # represents a mistake in prediction
200 |             return T.mean(T.neq(self.y_pred, y))
201 |         else:
202 |             raise NotImplementedError()
203 | ```
204 | 我们通过如下代码来实例化这个类。
205 | 
206 | ```Pyhton
207 | 	# generate symbolic variables for input (x and y represent a
208 |     # minibatch)
209 |     x = T.matrix('x')  # data, presented as rasterized images
210 |     y = T.ivector('y')  # labels, presented as 1D vector of [int] labels
211 | 
212 |     # construct the logistic regression class
213 |     # Each MNIST image has size 28*28
214 |     classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)
215 | 
216 | ```
217 | 需要注意的是，输入向量x，和其相关的标签y都是定义在`LogisticRegression`实体外的。这个类需要将输入数据作为`__init__`函数的参数。这在将这些类的实例连接起来构建深网络方面非常有用。一层的输出可以作为下一层的输入。
218 | 最后，我们定义了一个`cost`变量来最小化。
219 | 
220 | ```Python
221 |     # the cost we minimize during training is the negative log likelihood of
222 |     # the model in symbolic format
223 |     cost = classifier.negative_log_likelihood(y)
224 | ```
225 | 
226 | ## 学习模型
227 | 在实现MSGD的许多语言中，需要通过手动求解损失函数对每个参数的梯度（微分）来实现。
228 | 在Theano中呢，这是非常简单的。它自动微分，并且使用了一定的数学转换来提高数学稳定性。
229 | 
230 | ```Pyhton
231 | 	g_W = T.grad(cost=cost, wrt=classifier.W)
232 |     g_b = T.grad(cost=cost, wrt=classifier.b)
233 | ```
234 | 这个函数`train_model`可以被定义如下。
235 | ```Python
236 |     # specify how to update the parameters of the model as a list of
237 |     # (variable, update expression) pairs.
238 |     updates = [(classifier.W, classifier.W - learning_rate * g_W),
239 |                (classifier.b, classifier.b - learning_rate * g_b)]
240 | 
241 |     # compiling a Theano function `train_model` that returns the cost, but in
242 |     # the same time updates the parameter of the model based on the rules
243 |     # defined in `updates`
244 |     train_model = theano.function(
245 |         inputs=[index],
246 |         outputs=cost,
247 |         updates=updates,
248 |         givens={
249 |             x: train_set_x[index * batch_size: (index + 1) * batch_size],
250 |             y: train_set_y[index * batch_size: (index + 1) * batch_size]
251 |         }
252 |     )
253 | ```
254 | `update`是一个list，用以更新每一步的参数。`given`是一个字典，用以表示象征变量，和你在该步中表示的数据。这个`train_model`定义如下：
255 | * 输入是minibatch的`index`，batch的大小之前已经固定，以此被定义为x，以及其相关的y。
256 | * 返回是该`index`下与x，y相关的`cost`/损失函数。
257 | * 每一次函数调用，它都先用index对应的训练集的切片来更新x，y。然后计算该minibatch下的cost，以及申请`update`操作。
258 | 每次`train_model(inedx)`被调用，它都计算并返回该minibatch的cost，当然这也是MSGD的一步。整个学习算法因循环了数据集所有样例。
259 | 
260 | ## 训练模型
261 | 在之前论述中所说，我们对分类错误的样本感兴趣（不仅仅是可能性）。因此模型中增加了一个额外的实例方法，来纪录每个minibatch中的错误分类样例数。
262 | 
263 | ```Python
264 |    def errors(self, y):
265 |         """Return a float representing the number of errors in the minibatch
266 |         over the total number of examples of the minibatch ; zero one
267 |         loss over the size of the minibatch
268 | 
269 |         :type y: theano.tensor.TensorType
270 |         :param y: corresponds to a vector that gives for each example the
271 |                   correct label
272 |         """
273 | 
274 |         # check if y has same dimension of y_pred
275 |         if y.ndim != self.y_pred.ndim:
276 |             raise TypeError(
277 |                 'y should have the same shape as self.y_pred',
278 |                 ('y', y.type, 'y_pred', self.y_pred.type)
279 |             )
280 |         # check if y is of the correct datatype
281 |         if y.dtype.startswith('int'):
282 |             # the T.neq operator returns a vector of 0s and 1s, where 1
283 |             # represents a mistake in prediction
284 |             return T.mean(T.neq(self.y_pred, y))
285 |         else:
286 |             raise NotImplementedError()
287 | ```
288 | 我们创建了`test_model`函数，然后也创建了`validate_model`来调用去修正这个值。当然`validate_model`是early-stopping的关键。它们都是来统计minibatch中分类错误的样例数。
289 | 
290 | ```Python
291 |     # compiling a Theano function that computes the mistakes that are made by
292 |     # the model on a minibatch
293 |     test_model = theano.function(
294 |         inputs=[index],
295 |         outputs=classifier.errors(y),
296 |         givens={
297 |             x: test_set_x[index * batch_size: (index + 1) * batch_size],
298 |             y: test_set_y[index * batch_size: (index + 1) * batch_size]
299 |         }
300 |     )
301 | 
302 |     validate_model = theano.function(
303 |         inputs=[index],
304 |         outputs=classifier.errors(y),
305 |         givens={
306 |             x: valid_set_x[index * batch_size: (index + 1) * batch_size],
307 |             y: valid_set_y[index * batch_size: (index + 1) * batch_size]
308 |         }
309 |     )
310 | ```
311 | ## 把它们组合起来
312 | 最后的代码如下。
313 | ```Python
314 | """
315 | This tutorial introduces logistic regression using Theano and stochastic
316 | gradient descent.
317 | 
318 | Logistic regression is a probabilistic, linear classifier. It is parametrized
319 | by a weight matrix :math:`W` and a bias vector :math:`b`. Classification is
320 | done by projecting data points onto a set of hyperplanes, the distance to
321 | which is used to determine a class membership probability.
322 | 
323 | Mathematically, this can be written as:
324 | 
325 | .. math::
326 |   P(Y=i|x, W,b) &= softmax_i(W x + b) \\
327 |                 &= \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}
328 | 
329 | 
330 | The output of the model or prediction is then done by taking the argmax of
331 | the vector whose i'th element is P(Y=i|x).
332 | 
333 | .. math::
334 | 
335 |   y_{pred} = argmax_i P(Y=i|x,W,b)
336 | 
337 | 
338 | This tutorial presents a stochastic gradient descent optimization method
339 | suitable for large datasets.
340 | 
341 | 
342 | References:
343 | 
344 |     - textbooks: "Pattern Recognition and Machine Learning" -
345 |                  Christopher M. Bishop, section 4.3.2
346 | 
347 | """
348 | __docformat__ = 'restructedtext en'
349 | 
350 | import cPickle
351 | import gzip
352 | import os
353 | import sys
354 | import time
355 | 
356 | import numpy
357 | 
358 | import theano
359 | import theano.tensor as T
360 | 
361 | 
362 | class LogisticRegression(object):
363 |     """Multi-class Logistic Regression Class
364 | 
365 |     The logistic regression is fully described by a weight matrix :math:`W`
366 |     and bias vector :math:`b`. Classification is done by projecting data
367 |     points onto a set of hyperplanes, the distance to which is used to
368 |     determine a class membership probability.
369 |     """
370 | 
371 |     def __init__(self, input, n_in, n_out):
372 |         """ Initialize the parameters of the logistic regression
373 | 
374 |         :type input: theano.tensor.TensorType
375 |         :param input: symbolic variable that describes the input of the
376 |                       architecture (one minibatch)
377 | 
378 |         :type n_in: int
379 |         :param n_in: number of input units, the dimension of the space in
380 |                      which the datapoints lie
381 | 
382 |         :type n_out: int
383 |         :param n_out: number of output units, the dimension of the space in
384 |                       which the labels lie
385 | 
386 |         """
387 |         # start-snippet-1
388 |         # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
389 |         self.W = theano.shared(
390 |             value=numpy.zeros(
391 |                 (n_in, n_out),
392 |                 dtype=theano.config.floatX
393 |             ),
394 |             name='W',
395 |             borrow=True
396 |         )
397 |         # initialize the baises b as a vector of n_out 0s
398 |         self.b = theano.shared(
399 |             value=numpy.zeros(
400 |                 (n_out,),
401 |                 dtype=theano.config.floatX
402 |             ),
403 |             name='b',
404 |             borrow=True
405 |         )
406 | 
407 |         # symbolic expression for computing the matrix of class-membership
408 |         # probabilities
409 |         # Where:
410 |         # W is a matrix where column-k represent the separation hyper plain for
411 |         # class-k
412 |         # x is a matrix where row-j  represents input training sample-j
413 |         # b is a vector where element-k represent the free parameter of hyper
414 |         # plain-k
415 |         self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
416 | 
417 |         # symbolic description of how to compute prediction as class whose
418 |         # probability is maximal
419 |         self.y_pred = T.argmax(self.p_y_given_x, axis=1)
420 |         # end-snippet-1
421 | 
422 |         # parameters of the model
423 |         self.params = [self.W, self.b]
424 | 
425 |     def negative_log_likelihood(self, y):
426 |         """Return the mean of the negative log-likelihood of the prediction
427 |         of this model under a given target distribution.
428 | 
429 |         .. math::
430 | 
431 |             \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
432 |             \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|}
433 |                 \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
434 |             \ell (\theta=\{W,b\}, \mathcal{D})
435 | 
436 |         :type y: theano.tensor.TensorType
437 |         :param y: corresponds to a vector that gives for each example the
438 |                   correct label
439 | 
440 |         Note: we use the mean instead of the sum so that
441 |               the learning rate is less dependent on the batch size
442 |         """
443 |         # start-snippet-2
444 |         # y.shape[0] is (symbolically) the number of rows in y, i.e.,
445 |         # number of examples (call it n) in the minibatch
446 |         # T.arange(y.shape[0]) is a symbolic vector which will contain
447 |         # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
448 |         # Log-Probabilities (call it LP) with one row per example and
449 |         # one column per class LP[T.arange(y.shape[0]),y] is a vector
450 |         # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
451 |         # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
452 |         # the mean (across minibatch examples) of the elements in v,
453 |         # i.e., the mean log-likelihood across the minibatch.
454 |         return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
455 |         # end-snippet-2
456 | 
457 |     def errors(self, y):
458 |         """Return a float representing the number of errors in the minibatch
459 |         over the total number of examples of the minibatch ; zero one
460 |         loss over the size of the minibatch
461 | 
462 |         :type y: theano.tensor.TensorType
463 |         :param y: corresponds to a vector that gives for each example the
464 |                   correct label
465 |         """
466 | 
467 |         # check if y has same dimension of y_pred
468 |         if y.ndim != self.y_pred.ndim:
469 |             raise TypeError(
470 |                 'y should have the same shape as self.y_pred',
471 |                 ('y', y.type, 'y_pred', self.y_pred.type)
472 |             )
473 |         # check if y is of the correct datatype
474 |         if y.dtype.startswith('int'):
475 |             # the T.neq operator returns a vector of 0s and 1s, where 1
476 |             # represents a mistake in prediction
477 |             return T.mean(T.neq(self.y_pred, y))
478 |         else:
479 |             raise NotImplementedError()
480 | 
481 | 
482 | def load_data(dataset):
483 |     ''' Loads the dataset
484 | 
485 |     :type dataset: string
486 |     :param dataset: the path to the dataset (here MNIST)
487 |     '''
488 | 
489 |     #############
490 |     # LOAD DATA #
491 |     #############
492 | 
493 |     # Download the MNIST dataset if it is not present
494 |     data_dir, data_file = os.path.split(dataset)
495 |     if data_dir == "" and not os.path.isfile(dataset):
496 |         # Check if dataset is in the data directory.
497 |         new_path = os.path.join(
498 |             os.path.split(__file__)[0],
499 |             "..",
500 |             "data",
501 |             dataset
502 |         )
503 |         if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz':
504 |             dataset = new_path
505 | 
506 |     if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz':
507 |         import urllib
508 |         origin = (
509 |             'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
510 |         )
511 |         print 'Downloading data from %s' % origin
512 |         urllib.urlretrieve(origin, dataset)
513 | 
514 |     print '... loading data'
515 | 
516 |     # Load the dataset
517 |     f = gzip.open(dataset, 'rb')
518 |     train_set, valid_set, test_set = cPickle.load(f)
519 |     f.close()
520 |     #train_set, valid_set, test_set format: tuple(input, target)
521 |     #input is an numpy.ndarray of 2 dimensions (a matrix)
522 |     #witch row's correspond to an example. target is a
523 |     #numpy.ndarray of 1 dimensions (vector)) that have the same length as
524 |     #the number of rows in the input. It should give the target
525 |     #target to the example with the same index in the input.
526 | 
527 |     def shared_dataset(data_xy, borrow=True):
528 |         """ Function that loads the dataset into shared variables
529 | 
530 |         The reason we store our dataset in shared variables is to allow
531 |         Theano to copy it into the GPU memory (when code is run on GPU).
532 |         Since copying data into the GPU is slow, copying a minibatch everytime
533 |         is needed (the default behaviour if the data is not in a shared
534 |         variable) would lead to a large decrease in performance.
535 |         """
536 |         data_x, data_y = data_xy
537 |         shared_x = theano.shared(numpy.asarray(data_x,
538 |                                                dtype=theano.config.floatX),
539 |                                  borrow=borrow)
540 |         shared_y = theano.shared(numpy.asarray(data_y,
541 |                                                dtype=theano.config.floatX),
542 |                                  borrow=borrow)
543 |         # When storing data on the GPU it has to be stored as floats
544 |         # therefore we will store the labels as ``floatX`` as well
545 |         # (``shared_y`` does exactly that). But during our computations
546 |         # we need them as ints (we use labels as index, and if they are
547 |         # floats it doesn't make sense) therefore instead of returning
548 |         # ``shared_y`` we will have to cast it to int. This little hack
549 |         # lets ous get around this issue
550 |         return shared_x, T.cast(shared_y, 'int32')
551 | 
552 |     test_set_x, test_set_y = shared_dataset(test_set)
553 |     valid_set_x, valid_set_y = shared_dataset(valid_set)
554 |     train_set_x, train_set_y = shared_dataset(train_set)
555 | 
556 |     rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y),
557 |             (test_set_x, test_set_y)]
558 |     return rval
559 | 
560 | 
561 | def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
562 |                            dataset='mnist.pkl.gz',
563 |                            batch_size=600):
564 |     """
565 |     Demonstrate stochastic gradient descent optimization of a log-linear
566 |     model
567 | 
568 |     This is demonstrated on MNIST.
569 | 
570 |     :type learning_rate: float
571 |     :param learning_rate: learning rate used (factor for the stochastic
572 |                           gradient)
573 | 
574 |     :type n_epochs: int
575 |     :param n_epochs: maximal number of epochs to run the optimizer
576 | 
577 |     :type dataset: string
578 |     :param dataset: the path of the MNIST dataset file from
579 |                  http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
580 | 
581 |     """
582 |     datasets = load_data(dataset)
583 | 
584 |     train_set_x, train_set_y = datasets[0]
585 |     valid_set_x, valid_set_y = datasets[1]
586 |     test_set_x, test_set_y = datasets[2]
587 | 
588 |     # compute number of minibatches for training, validation and testing
589 |     n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
590 |     n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size
591 |     n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size
592 | 
593 |     ######################
594 |     # BUILD ACTUAL MODEL #
595 |     ######################
596 |     print '... building the model'
597 | 
598 |     # allocate symbolic variables for the data
599 |     index = T.lscalar()  # index to a [mini]batch
600 | 
601 |     # generate symbolic variables for input (x and y represent a
602 |     # minibatch)
603 |     x = T.matrix('x')  # data, presented as rasterized images
604 |     y = T.ivector('y')  # labels, presented as 1D vector of [int] labels
605 | 
606 |     # construct the logistic regression class
607 |     # Each MNIST image has size 28*28
608 |     classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)
609 | 
610 |     # the cost we minimize during training is the negative log likelihood of
611 |     # the model in symbolic format
612 |     cost = classifier.negative_log_likelihood(y)
613 | 
614 |     # compiling a Theano function that computes the mistakes that are made by
615 |     # the model on a minibatch
616 |     test_model = theano.function(
617 |         inputs=[index],
618 |         outputs=classifier.errors(y),
619 |         givens={
620 |             x: test_set_x[index * batch_size: (index + 1) * batch_size],
621 |             y: test_set_y[index * batch_size: (index + 1) * batch_size]
622 |         }
623 |     )
624 | 
625 |     validate_model = theano.function(
626 |         inputs=[index],
627 |         outputs=classifier.errors(y),
628 |         givens={
629 |             x: valid_set_x[index * batch_size: (index + 1) * batch_size],
630 |             y: valid_set_y[index * batch_size: (index + 1) * batch_size]
631 |         }
632 |     )
633 | 
634 |     # compute the gradient of cost with respect to theta = (W,b)
635 |     g_W = T.grad(cost=cost, wrt=classifier.W)
636 |     g_b = T.grad(cost=cost, wrt=classifier.b)
637 | 
638 |     # start-snippet-3
639 |     # specify how to update the parameters of the model as a list of
640 |     # (variable, update expression) pairs.
641 |     updates = [(classifier.W, classifier.W - learning_rate * g_W),
642 |                (classifier.b, classifier.b - learning_rate * g_b)]
643 | 
644 |     # compiling a Theano function `train_model` that returns the cost, but in
645 |     # the same time updates the parameter of the model based on the rules
646 |     # defined in `updates`
647 |     train_model = theano.function(
648 |         inputs=[index],
649 |         outputs=cost,
650 |         updates=updates,
651 |         givens={
652 |             x: train_set_x[index * batch_size: (index + 1) * batch_size],
653 |             y: train_set_y[index * batch_size: (index + 1) * batch_size]
654 |         }
655 |     )
656 |     # end-snippet-3
657 | 
658 |     ###############
659 |     # TRAIN MODEL #
660 |     ###############
661 |     print '... training the model'
662 |     # early-stopping parameters
663 |     patience = 5000  # look as this many examples regardless
664 |     patience_increase = 2  # wait this much longer when a new best is
665 |                                   # found
666 |     improvement_threshold = 0.995  # a relative improvement of this much is
667 |                                   # considered significant
668 |     validation_frequency = min(n_train_batches, patience / 2)
669 |                                   # go through this many
670 |                                   # minibatche before checking the network
671 |                                   # on the validation set; in this case we
672 |                                   # check every epoch
673 | 
674 |     best_validation_loss = numpy.inf
675 |     test_score = 0.
676 |     start_time = time.clock()
677 | 
678 |     done_looping = False
679 |     epoch = 0
680 |     while (epoch < n_epochs) and (not done_looping):
681 |         epoch = epoch + 1
682 |         for minibatch_index in xrange(n_train_batches):
683 | 
684 |             minibatch_avg_cost = train_model(minibatch_index)
685 |             # iteration number
686 |             iter = (epoch - 1) * n_train_batches + minibatch_index
687 | 
688 |             if (iter + 1) % validation_frequency == 0:
689 |                 # compute zero-one loss on validation set
690 |                 validation_losses = [validate_model(i)
691 |                                      for i in xrange(n_valid_batches)]
692 |                 this_validation_loss = numpy.mean(validation_losses)
693 | 
694 |                 print(
695 |                     'epoch %i, minibatch %i/%i, validation error %f %%' %
696 |                     (
697 |                         epoch,
698 |                         minibatch_index + 1,
699 |                         n_train_batches,
700 |                         this_validation_loss * 100.
701 |                     )
702 |                 )
703 | 
704 |                 # if we got the best validation score until now
705 |                 if this_validation_loss < best_validation_loss:
706 |                     #improve patience if loss improvement is good enough
707 |                     if this_validation_loss < best_validation_loss *  \
708 |                        improvement_threshold:
709 |                         patience = max(patience, iter * patience_increase)
710 | 
711 |                     best_validation_loss = this_validation_loss
712 |                     # test it on the test set
713 | 
714 |                     test_losses = [test_model(i)
715 |                                    for i in xrange(n_test_batches)]
716 |                     test_score = numpy.mean(test_losses)
717 | 
718 |                     print(
719 |                         (
720 |                             '     epoch %i, minibatch %i/%i, test error of'
721 |                             ' best model %f %%'
722 |                         ) %
723 |                         (
724 |                             epoch,
725 |                             minibatch_index + 1,
726 |                             n_train_batches,
727 |                             test_score * 100.
728 |                         )
729 |                     )
730 | 
731 |             if patience <= iter:
732 |                 done_looping = True
733 |                 break
734 | 
735 |     end_time = time.clock()
736 |     print(
737 |         (
738 |             'Optimization complete with best validation score of %f %%,'
739 |             'with test performance %f %%'
740 |         )
741 |         % (best_validation_loss * 100., test_score * 100.)
742 |     )
743 |     print 'The code run for %d epochs, with %f epochs/sec' % (
744 |         epoch, 1. * epoch / (end_time - start_time))
745 |     print >> sys.stderr, ('The code for file ' +
746 |                           os.path.split(__file__)[1] +
747 |                           ' ran for %.1fs' % ((end_time - start_time)))
748 | 
749 | if __name__ == '__main__':
750 |     sgd_optimization_mnist()
751 | ```
752 | 这个输出将是如下的格式
753 | ```
754 | ...
755 | epoch 72, minibatch 83/83, validation error 7.510417 %
756 | epoch 72, minibatch 83/83, test error of best model 7.510417 %
757 | epoch 73, minibatch 83/83, validation error 7.500000 %
758 | epoch 73, minibatch 83/83, test error of best model 7.489583 %
759 | Optimization complete with best validation score of 7.500000 %,with test performance 7.489583 %
760 | The code run for 74 epochs, with 1.936983 epochs/sec
761 | ```
762 | 在`Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00 Ghz`上，这个代码的速度是1.936 epochs/sec然后跑75 epochs，得到测试错误率为7.489%。在GPU上为10.0 epochs/sec在这个实例中我们定义为batch的大小为600。
763 | 
764 | 	
765 | 
766 | 
767 | 
768 | 
769 | 
770 | 
771 | 
772 | 
773 | 
774 | 
775 | 
776 | 
777 | 
778 | 
779 | 
780 | 
781 | 
782 | 
783 | 
784 | 
785 | 
786 | 
787 | 
788 | 
789 | 
790 | 


--------------------------------------------------------------------------------
/3_Multilayer_Perceptron_多层感知机.md:
--------------------------------------------------------------------------------
  1 | 多层感知机（Multilayer Perceptron）
  2 | ==================================
  3 | 
  4 | 在本节中，假设你已经了解了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)。同时本节的所有代码可以在[这里](http://deeplearning.net/tutorial/code/mlp.py)下载.
  5 | 
  6 | 下一个我们将在Theano中使用的结构是单隐层的多层感知机（MLP）。MLP可以被看作一个逻辑回归分类器。这个中间层被称为隐藏层。一个单隐层对于MLP成为通用近似器是有效的。然而在后面，我们将讲述使用多个隐藏层的好处，例如深度学习的前提。这个课程介绍了[MLP，反向误差传导，如何训练MLPs](http://www.iro.umontreal.ca/~pift6266/H10/notes/mlp.html)。
  7 | 
  8 | ## 模型
  9 | 一个多层感知机（或者说人工神经网络——ANN）,在只有一个隐藏层时可以被表示为如下的图：
 10 | 
 11 | ![mlp_model_1](/images/3_the_model_1.png)
 12 | 
 13 | 事实上，一个单隐藏层的MLP是一个如下的函数![mlp_model_2](/images/3_the_model_2.png)，其中x是输入向量的维度，L是输出向量的维度。我们用下面的公式来表示MLP模型：
 14 | 
 15 | ![mlp_model_3](/images/3_the_model_3.png)
 16 | 
 17 | 其中b_1，W_1是输出层到隐藏层的偏置向量和权值矩阵，s是该层的激活函数。而b_2，W_2是隐藏层到输出层的偏置向量和权值矩阵，G是该层的激活函数。通常选择s为sigmoid函数，G为softmax函数。
 18 | 在训练MLP模型的参数时，我们使用minibatch的随机梯度下降，在获得梯度后使用反向误差传导算法来实现参数的训练。由于Theano提供自动的微分，我们不需要在这个教程里面谈及这个方面。
 19 | 
 20 | ## 从逻辑回归到多层感知机
 21 | 本教程将专注于单隐藏层的MLP。我们以隐藏层的类的实现开始，如果要构建一个MLP，只需要在此基础上添加一个逻辑回归就好。
 22 | 
 23 | ```Python
 24 | class HiddenLayer(object):
 25 |     def __init__(self, rng, input, n_in, n_out, W=None, b=None,
 26 |                  activation=T.tanh):
 27 |         """
 28 |         Typical hidden layer of a MLP: units are fully-connected and have
 29 |         sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
 30 |         and the bias vector b is of shape (n_out,).
 31 | 
 32 |         NOTE : The nonlinearity used here is tanh
 33 | 
 34 |         Hidden unit activation is given by: tanh(dot(input,W) + b)
 35 | 
 36 |         :type rng: numpy.random.RandomState
 37 |         :param rng: a random number generator used to initialize weights
 38 | 
 39 |         :type input: theano.tensor.dmatrix
 40 |         :param input: a symbolic tensor of shape (n_examples, n_in)
 41 | 
 42 |         :type n_in: int
 43 |         :param n_in: dimensionality of input
 44 | 
 45 |         :type n_out: int
 46 |         :param n_out: number of hidden units
 47 | 
 48 |         :type activation: theano.Op or function
 49 |         :param activation: Non linearity to be applied in the hidden
 50 |                            layer
 51 |         """
 52 |         self.input = input
 53 | ```
 54 | 一个隐藏层的权值初始化，应当从基于激活函数的均匀间隔中均匀采样。对于sigmoid函数而言，这个间隔是![interval](/images/3_from_lr_to_mlp_1.png)。其中fan_in是第（i－1）层的单元数目，fan_out是第（i）层单元的数目，[结论出自这里](http://deeplearning.net/tutorial/references.html#xavier10)。
 55 | 这样的初始化，保证了在训练的早期，每个神经元都可以工作在它激活函数的控制范围内，从而使得信息可以更简单的前向传导（从输入到输出的激活）和后向传导（从输出到输入的梯度）。
 56 | 
 57 | ```Python
 58 |         # `W` is initialized with `W_values` which is uniformely sampled
 59 |         # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
 60 |         # for tanh activation function
 61 |         # the output of uniform if converted using asarray to dtype
 62 |         # theano.config.floatX so that the code is runable on GPU
 63 |         # Note : optimal initialization of weights is dependent on the
 64 |         #        activation function used (among other things).
 65 |         #        For example, results presented in [Xavier10] suggest that you
 66 |         #        should use 4 times larger initial weights for sigmoid
 67 |         #        compared to tanh
 68 |         #        We have no info for other function, so we use the same as
 69 |         #        tanh.
 70 |         if W is None:
 71 |             W_values = numpy.asarray(
 72 |                 rng.uniform(
 73 |                     low=-numpy.sqrt(6. / (n_in + n_out)),
 74 |                     high=numpy.sqrt(6. / (n_in + n_out)),
 75 |                     size=(n_in, n_out)
 76 |                 ),
 77 |                 dtype=theano.config.floatX
 78 |             )
 79 |             if activation == theano.tensor.nnet.sigmoid:
 80 |                 W_values *= 4
 81 | 
 82 |             W = theano.shared(value=W_values, name='W', borrow=True)
 83 | 
 84 |         if b is None:
 85 |             b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
 86 |             b = theano.shared(value=b_values, name='b', borrow=True)
 87 | 
 88 |         self.W = W
 89 |         self.b = b
 90 | ```
 91 | 注意，我们通过会将一个给定的非线性函数作为隐藏层的激活函数。默认是`tanh`函数，当然很多时候你可能需要其他函数。
 92 | 
 93 | ```Python
 94 |         lin_output = T.dot(input, self.W) + self.b
 95 |         self.output = (
 96 |             lin_output if activation is None
 97 |             else activation(lin_output)
 98 |         )
 99 | ```
100 | 如果你已经阅读了上面的隐藏层输出和[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)。那么你可以看下面的MLP类的实现了。
101 | 
102 | ```Python
103 | class MLP(object):
104 |     """Multi-Layer Perceptron Class
105 | 
106 |     A multilayer perceptron is a feedforward artificial neural network model
107 |     that has one layer or more of hidden units and nonlinear activations.
108 |     Intermediate layers usually have as activation function tanh or the
109 |     sigmoid function (defined here by a ``HiddenLayer`` class)  while the
110 |     top layer is a softamx layer (defined here by a ``LogisticRegression``
111 |     class).
112 |     """
113 | 
114 |     def __init__(self, rng, input, n_in, n_hidden, n_out):
115 |         """Initialize the parameters for the multilayer perceptron
116 | 
117 |         :type rng: numpy.random.RandomState
118 |         :param rng: a random number generator used to initialize weights
119 | 
120 |         :type input: theano.tensor.TensorType
121 |         :param input: symbolic variable that describes the input of the
122 |         architecture (one minibatch)
123 | 
124 |         :type n_in: int
125 |         :param n_in: number of input units, the dimension of the space in
126 |         which the datapoints lie
127 | 
128 |         :type n_hidden: int
129 |         :param n_hidden: number of hidden units
130 | 
131 |         :type n_out: int
132 |         :param n_out: number of output units, the dimension of the space in
133 |         which the labels lie
134 | 
135 |         """
136 | 
137 |         # Since we are dealing with a one hidden layer MLP, this will translate
138 |         # into a HiddenLayer with a tanh activation function connected to the
139 |         # LogisticRegression layer; the activation function can be replaced by
140 |         # sigmoid or any other nonlinear function
141 |         self.hiddenLayer = HiddenLayer(
142 |             rng=rng,
143 |             input=input,
144 |             n_in=n_in,
145 |             n_out=n_hidden,
146 |             activation=T.tanh
147 |         )
148 | 
149 |         # The logistic regression layer gets as input the hidden units
150 |         # of the hidden layer
151 |         self.logRegressionLayer = LogisticRegression(
152 |             input=self.hiddenLayer.output,
153 |             n_in=n_hidden,
154 |             n_out=n_out
155 |         )
156 | ```
157 | 在本节中，我们也使用L1/L2正则化（[L1/L2正则化](http://deeplearning.net/tutorial/gettingstarted.html#l1-l2-regularization)）。所以我们需要去计算W_1和W_2矩阵的L1正则和L2平方正则。
158 | 
159 | ```Python
160 |         # L1 norm ; one regularization option is to enforce L1 norm to
161 |         # be small
162 |         self.L1 = (
163 |             abs(self.hiddenLayer.W).sum()
164 |             + abs(self.logRegressionLayer.W).sum()
165 |         )
166 | 
167 |         # square of L2 norm ; one regularization option is to enforce
168 |         # square of L2 norm to be small
169 |         self.L2_sqr = (
170 |             (self.hiddenLayer.W ** 2).sum()
171 |             + (self.logRegressionLayer.W ** 2).sum()
172 |         )
173 | 
174 |         # negative log likelihood of the MLP is given by the negative
175 |         # log likelihood of the output of the model, computed in the
176 |         # logistic regression layer
177 |         self.negative_log_likelihood = (
178 |             self.logRegressionLayer.negative_log_likelihood
179 |         )
180 |         # same holds for the function computing the number of errors
181 |         self.errors = self.logRegressionLayer.errors
182 | 
183 |         # the parameters of the model are the parameters of the two layer it is
184 |         # made out of
185 |         self.params = self.hiddenLayer.params + self.logRegressionLayer.params
186 | ```
187 | 在此之前，我们使用minibatch的随机梯度下降来训练这个模型。不同的是，我们现在在`cost`函数里面添加了正则项。`L1_reg`和`L2_reg`可以控制权值矩阵的正则化。计算新cost的代码如下：
188 | 
189 | ```Python
190 |     # the cost we minimize during training is the negative log likelihood of
191 |     # the model plus the regularization terms (L1 and L2); cost is expressed
192 |     # here symbolically
193 |     cost = (
194 |         classifier.negative_log_likelihood(y)
195 |         + L1_reg * classifier.L1
196 |         + L2_reg * classifier.L2_sqr
197 |     )
198 | ```
199 | 我们使用梯度来更新模型参数，这基本和逻辑回归里面的一样。我们从模型的`params`中获取参数列表，然后分析它，并计算每一步的梯度。
200 | 
201 | ```Python
202 |     # compute the gradient of cost with respect to theta (sotred in params)
203 |     # the resulting gradients will be stored in a list gparams
204 |     gparams = [T.grad(cost, param) for param in classifier.params]
205 | 
206 |     # specify how to update the parameters of the model as a list of
207 |     # (variable, update expression) pairs
208 | 
209 |     # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of
210 |     # same length, zip generates a list C of same size, where each element
211 |     # is a pair formed from the two lists :
212 |     #    C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
213 |     updates = [
214 |         (param, param - learning_rate * gparam)
215 |         for param, gparam in zip(classifier.params, gparams)
216 |     ]
217 | 
218 |     # compiling a Theano function `train_model` that returns the cost, but
219 |     # in the same time updates the parameter of the model based on the rules
220 |     # defined in `updates`
221 |     train_model = theano.function(
222 |         inputs=[index],
223 |         outputs=cost,
224 |         updates=updates,
225 |         givens={
226 |             x: train_set_x[index * batch_size: (index + 1) * batch_size],
227 |             y: train_set_y[index * batch_size: (index + 1) * batch_size]
228 |         }
229 |     )
230 | ```
231 | 
232 | ## 把它组合起来
233 | 已经解释了所有的基本该概念，下面的代码就是一个完整的MLP类。
234 | 
235 | ```Python
236 | """
237 | This tutorial introduces the multilayer perceptron using Theano.
238 | 
239 |  A multilayer perceptron is a logistic regressor where
240 | instead of feeding the input to the logistic regression you insert a
241 | intermediate layer, called the hidden layer, that has a nonlinear
242 | activation function (usually tanh or sigmoid) . One can use many such
243 | hidden layers making the architecture deep. The tutorial will also tackle
244 | the problem of MNIST digit classification.
245 | 
246 | .. math::
247 | 
248 |     f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))),
249 | 
250 | References:
251 | 
252 |     - textbooks: "Pattern Recognition and Machine Learning" -
253 |                  Christopher M. Bishop, section 5
254 | 
255 | """
256 | __docformat__ = 'restructedtext en'
257 | 
258 | 
259 | import os
260 | import sys
261 | import time
262 | 
263 | import numpy
264 | 
265 | import theano
266 | import theano.tensor as T
267 | 
268 | 
269 | from logistic_sgd import LogisticRegression, load_data
270 | 
271 | 
272 | # start-snippet-1
273 | class HiddenLayer(object):
274 |     def __init__(self, rng, input, n_in, n_out, W=None, b=None,
275 |                  activation=T.tanh):
276 |         """
277 |         Typical hidden layer of a MLP: units are fully-connected and have
278 |         sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
279 |         and the bias vector b is of shape (n_out,).
280 | 
281 |         NOTE : The nonlinearity used here is tanh
282 | 
283 |         Hidden unit activation is given by: tanh(dot(input,W) + b)
284 | 
285 |         :type rng: numpy.random.RandomState
286 |         :param rng: a random number generator used to initialize weights
287 | 
288 |         :type input: theano.tensor.dmatrix
289 |         :param input: a symbolic tensor of shape (n_examples, n_in)
290 | 
291 |         :type n_in: int
292 |         :param n_in: dimensionality of input
293 | 
294 |         :type n_out: int
295 |         :param n_out: number of hidden units
296 | 
297 |         :type activation: theano.Op or function
298 |         :param activation: Non linearity to be applied in the hidden
299 |                            layer
300 |         """
301 |         self.input = input
302 |         # end-snippet-1
303 | 
304 |         # `W` is initialized with `W_values` which is uniformely sampled
305 |         # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
306 |         # for tanh activation function
307 |         # the output of uniform if converted using asarray to dtype
308 |         # theano.config.floatX so that the code is runable on GPU
309 |         # Note : optimal initialization of weights is dependent on the
310 |         #        activation function used (among other things).
311 |         #        For example, results presented in [Xavier10] suggest that you
312 |         #        should use 4 times larger initial weights for sigmoid
313 |         #        compared to tanh
314 |         #        We have no info for other function, so we use the same as
315 |         #        tanh.
316 |         if W is None:
317 |             W_values = numpy.asarray(
318 |                 rng.uniform(
319 |                     low=-numpy.sqrt(6. / (n_in + n_out)),
320 |                     high=numpy.sqrt(6. / (n_in + n_out)),
321 |                     size=(n_in, n_out)
322 |                 ),
323 |                 dtype=theano.config.floatX
324 |             )
325 |             if activation == theano.tensor.nnet.sigmoid:
326 |                 W_values *= 4
327 | 
328 |             W = theano.shared(value=W_values, name='W', borrow=True)
329 | 
330 |         if b is None:
331 |             b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
332 |             b = theano.shared(value=b_values, name='b', borrow=True)
333 | 
334 |         self.W = W
335 |         self.b = b
336 | 
337 |         lin_output = T.dot(input, self.W) + self.b
338 |         self.output = (
339 |             lin_output if activation is None
340 |             else activation(lin_output)
341 |         )
342 |         # parameters of the model
343 |         self.params = [self.W, self.b]
344 | 
345 | 
346 | # start-snippet-2
347 | class MLP(object):
348 |     """Multi-Layer Perceptron Class
349 | 
350 |     A multilayer perceptron is a feedforward artificial neural network model
351 |     that has one layer or more of hidden units and nonlinear activations.
352 |     Intermediate layers usually have as activation function tanh or the
353 |     sigmoid function (defined here by a ``HiddenLayer`` class)  while the
354 |     top layer is a softamx layer (defined here by a ``LogisticRegression``
355 |     class).
356 |     """
357 | 
358 |     def __init__(self, rng, input, n_in, n_hidden, n_out):
359 |         """Initialize the parameters for the multilayer perceptron
360 | 
361 |         :type rng: numpy.random.RandomState
362 |         :param rng: a random number generator used to initialize weights
363 | 
364 |         :type input: theano.tensor.TensorType
365 |         :param input: symbolic variable that describes the input of the
366 |         architecture (one minibatch)
367 | 
368 |         :type n_in: int
369 |         :param n_in: number of input units, the dimension of the space in
370 |         which the datapoints lie
371 | 
372 |         :type n_hidden: int
373 |         :param n_hidden: number of hidden units
374 | 
375 |         :type n_out: int
376 |         :param n_out: number of output units, the dimension of the space in
377 |         which the labels lie
378 | 
379 |         """
380 | 
381 |         # Since we are dealing with a one hidden layer MLP, this will translate
382 |         # into a HiddenLayer with a tanh activation function connected to the
383 |         # LogisticRegression layer; the activation function can be replaced by
384 |         # sigmoid or any other nonlinear function
385 |         self.hiddenLayer = HiddenLayer(
386 |             rng=rng,
387 |             input=input,
388 |             n_in=n_in,
389 |             n_out=n_hidden,
390 |             activation=T.tanh
391 |         )
392 | 
393 |         # The logistic regression layer gets as input the hidden units
394 |         # of the hidden layer
395 |         self.logRegressionLayer = LogisticRegression(
396 |             input=self.hiddenLayer.output,
397 |             n_in=n_hidden,
398 |             n_out=n_out
399 |         )
400 |         # end-snippet-2 start-snippet-3
401 |         # L1 norm ; one regularization option is to enforce L1 norm to
402 |         # be small
403 |         self.L1 = (
404 |             abs(self.hiddenLayer.W).sum()
405 |             + abs(self.logRegressionLayer.W).sum()
406 |         )
407 | 
408 |         # square of L2 norm ; one regularization option is to enforce
409 |         # square of L2 norm to be small
410 |         self.L2_sqr = (
411 |             (self.hiddenLayer.W ** 2).sum()
412 |             + (self.logRegressionLayer.W ** 2).sum()
413 |         )
414 | 
415 |         # negative log likelihood of the MLP is given by the negative
416 |         # log likelihood of the output of the model, computed in the
417 |         # logistic regression layer
418 |         self.negative_log_likelihood = (
419 |             self.logRegressionLayer.negative_log_likelihood
420 |         )
421 |         # same holds for the function computing the number of errors
422 |         self.errors = self.logRegressionLayer.errors
423 | 
424 |         # the parameters of the model are the parameters of the two layer it is
425 |         # made out of
426 |         self.params = self.hiddenLayer.params + self.logRegressionLayer.params
427 |         # end-snippet-3
428 | 
429 | 
430 | def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000,
431 |              dataset='mnist.pkl.gz', batch_size=20, n_hidden=500):
432 |     """
433 |     Demonstrate stochastic gradient descent optimization for a multilayer
434 |     perceptron
435 | 
436 |     This is demonstrated on MNIST.
437 | 
438 |     :type learning_rate: float
439 |     :param learning_rate: learning rate used (factor for the stochastic
440 |     gradient
441 | 
442 |     :type L1_reg: float
443 |     :param L1_reg: L1-norm's weight when added to the cost (see
444 |     regularization)
445 | 
446 |     :type L2_reg: float
447 |     :param L2_reg: L2-norm's weight when added to the cost (see
448 |     regularization)
449 | 
450 |     :type n_epochs: int
451 |     :param n_epochs: maximal number of epochs to run the optimizer
452 | 
453 |     :type dataset: string
454 |     :param dataset: the path of the MNIST dataset file from
455 |                  http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
456 | 
457 | 
458 |    """
459 |     datasets = load_data(dataset)
460 | 
461 |     train_set_x, train_set_y = datasets[0]
462 |     valid_set_x, valid_set_y = datasets[1]
463 |     test_set_x, test_set_y = datasets[2]
464 | 
465 |     # compute number of minibatches for training, validation and testing
466 |     n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
467 |     n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size
468 |     n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size
469 | 
470 |     ######################
471 |     # BUILD ACTUAL MODEL #
472 |     ######################
473 |     print '... building the model'
474 | 
475 |     # allocate symbolic variables for the data
476 |     index = T.lscalar()  # index to a [mini]batch
477 |     x = T.matrix('x')  # the data is presented as rasterized images
478 |     y = T.ivector('y')  # the labels are presented as 1D vector of
479 |                         # [int] labels
480 | 
481 |     rng = numpy.random.RandomState(1234)
482 | 
483 |     # construct the MLP class
484 |     classifier = MLP(
485 |         rng=rng,
486 |         input=x,
487 |         n_in=28 * 28,
488 |         n_hidden=n_hidden,
489 |         n_out=10
490 |     )
491 | 
492 |     # start-snippet-4
493 |     # the cost we minimize during training is the negative log likelihood of
494 |     # the model plus the regularization terms (L1 and L2); cost is expressed
495 |     # here symbolically
496 |     cost = (
497 |         classifier.negative_log_likelihood(y)
498 |         + L1_reg * classifier.L1
499 |         + L2_reg * classifier.L2_sqr
500 |     )
501 |     # end-snippet-4
502 | 
503 |     # compiling a Theano function that computes the mistakes that are made
504 |     # by the model on a minibatch
505 |     test_model = theano.function(
506 |         inputs=[index],
507 |         outputs=classifier.errors(y),
508 |         givens={
509 |             x: test_set_x[index * batch_size:(index + 1) * batch_size],
510 |             y: test_set_y[index * batch_size:(index + 1) * batch_size]
511 |         }
512 |     )
513 | 
514 |     validate_model = theano.function(
515 |         inputs=[index],
516 |         outputs=classifier.errors(y),
517 |         givens={
518 |             x: valid_set_x[index * batch_size:(index + 1) * batch_size],
519 |             y: valid_set_y[index * batch_size:(index + 1) * batch_size]
520 |         }
521 |     )
522 | 
523 |     # start-snippet-5
524 |     # compute the gradient of cost with respect to theta (sotred in params)
525 |     # the resulting gradients will be stored in a list gparams
526 |     gparams = [T.grad(cost, param) for param in classifier.params]
527 | 
528 |     # specify how to update the parameters of the model as a list of
529 |     # (variable, update expression) pairs
530 | 
531 |     # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of
532 |     # same length, zip generates a list C of same size, where each element
533 |     # is a pair formed from the two lists :
534 |     #    C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
535 |     updates = [
536 |         (param, param - learning_rate * gparam)
537 |         for param, gparam in zip(classifier.params, gparams)
538 |     ]
539 | 
540 |     # compiling a Theano function `train_model` that returns the cost, but
541 |     # in the same time updates the parameter of the model based on the rules
542 |     # defined in `updates`
543 |     train_model = theano.function(
544 |         inputs=[index],
545 |         outputs=cost,
546 |         updates=updates,
547 |         givens={
548 |             x: train_set_x[index * batch_size: (index + 1) * batch_size],
549 |             y: train_set_y[index * batch_size: (index + 1) * batch_size]
550 |         }
551 |     )
552 |     # end-snippet-5
553 | 
554 |     ###############
555 |     # TRAIN MODEL #
556 |     ###############
557 |     print '... training'
558 | 
559 |     # early-stopping parameters
560 |     patience = 10000  # look as this many examples regardless
561 |     patience_increase = 2  # wait this much longer when a new best is
562 |                            # found
563 |     improvement_threshold = 0.995  # a relative improvement of this much is
564 |                                    # considered significant
565 |     validation_frequency = min(n_train_batches, patience / 2)
566 |                                   # go through this many
567 |                                   # minibatche before checking the network
568 |                                   # on the validation set; in this case we
569 |                                   # check every epoch
570 | 
571 |     best_validation_loss = numpy.inf
572 |     best_iter = 0
573 |     test_score = 0.
574 |     start_time = time.clock()
575 | 
576 |     epoch = 0
577 |     done_looping = False
578 | 
579 |     while (epoch < n_epochs) and (not done_looping):
580 |         epoch = epoch + 1
581 |         for minibatch_index in xrange(n_train_batches):
582 | 
583 |             minibatch_avg_cost = train_model(minibatch_index)
584 |             # iteration number
585 |             iter = (epoch - 1) * n_train_batches + minibatch_index
586 | 
587 |             if (iter + 1) % validation_frequency == 0:
588 |                 # compute zero-one loss on validation set
589 |                 validation_losses = [validate_model(i) for i
590 |                                      in xrange(n_valid_batches)]
591 |                 this_validation_loss = numpy.mean(validation_losses)
592 | 
593 |                 print(
594 |                     'epoch %i, minibatch %i/%i, validation error %f %%' %
595 |                     (
596 |                         epoch,
597 |                         minibatch_index + 1,
598 |                         n_train_batches,
599 |                         this_validation_loss * 100.
600 |                     )
601 |                 )
602 | 
603 |                 # if we got the best validation score until now
604 |                 if this_validation_loss < best_validation_loss:
605 |                     #improve patience if loss improvement is good enough
606 |                     if (
607 |                         this_validation_loss < best_validation_loss *
608 |                         improvement_threshold
609 |                     ):
610 |                         patience = max(patience, iter * patience_increase)
611 | 
612 |                     best_validation_loss = this_validation_loss
613 |                     best_iter = iter
614 | 
615 |                     # test it on the test set
616 |                     test_losses = [test_model(i) for i
617 |                                    in xrange(n_test_batches)]
618 |                     test_score = numpy.mean(test_losses)
619 | 
620 |                     print(('     epoch %i, minibatch %i/%i, test error of '
621 |                            'best model %f %%') %
622 |                           (epoch, minibatch_index + 1, n_train_batches,
623 |                            test_score * 100.))
624 | 
625 |             if patience <= iter:
626 |                 done_looping = True
627 |                 break
628 | 
629 |     end_time = time.clock()
630 |     print(('Optimization complete. Best validation score of %f %% '
631 |            'obtained at iteration %i, with test performance %f %%') %
632 |           (best_validation_loss * 100., best_iter + 1, test_score * 100.))
633 |     print >> sys.stderr, ('The code for file ' +
634 |                           os.path.split(__file__)[1] +
635 |                           ' ran for %.2fm' % ((end_time - start_time) / 60.))
636 | 
637 | 
638 | if __name__ == '__main__':
639 |     test_mlp()
640 | ```
641 | 预计将会得到这样的输出：
642 | ```
643 | Optimization complete. Best validation score of 1.690000 % obtained at iteration 2070000, with test performance 1.650000 %
644 | The code for file mlp.py ran for 97.34m
645 | ```
646 | 在一台Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz的机器上，这个代码跑了10.3 epoch/minute然后花了828 epochs得到了1.65%的测试错误率。
647 | 读者也可以在[这个页面](http://yann.lecun.com/exdb/mnist)查看MNIST的识别结果。
648 | 
649 | 
650 | ## 训练MLPs的技巧
651 | 在上面的代码中国，有一些是不能进行梯度下降来优化的。严格意义上将，发现最优的超参集合是不可能的任务。第一，我们不能独立的优化每一个参数。第二，我们不能很容易的求解所有参数的梯度（有些是离散的值，有些是实数）。第三，这个优化问题是非凸的，容易陷入局部最优。
652 | 好消息是，过去25年，研究者发明了一些在神经网络中选择超参数的方法和规则。你可以在LeCun等人的[Efficient BackPro](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)中阅读，这是一个好的综述。这里，我们将总结下我们的代码中用到的几个重要的方法和技术。
653 | 
654 | ### 非线性
655 | 最常见的就是`sigmoid`和`tanh`函数。在[第4.4节](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)中解释的，非线性是关于原点对称的，它倾向去输出0均值的输出（这是被期望的属性）。根据我们的经验，tanh(双曲函数)拥有更好的收敛性。
656 | ### 权值初始化
657 | 在初始化权值的时候，我们一般需要它们在0附近，要足够小（在激活函数的近似线性区域可以获得最大的梯度）。另一个特性，尤其对深度网络而言，是可以减小层与层之间的激活函数的方差和反向传导梯度的方差。这就可以让信息更好的向下和向上的传导，减少层间差异。数学推倒，请看[Xavier10](http://deeplearning.net/tutorial/references.html#xavier10)。
658 | ### 学习率
659 | 有许多文献专注在好的学习速率的选择上。最简单的方案就是选择一个固定速率。经验法则：尝试对数间隔的值(0.1，001，。。)，然后缩小（对数）网络搜索的范围（你获得最低验证错误的区域）。
660 | 随着时间的推移减小学习速率有时候也是一个好主意。一个简单的方法是使用这个公式：u/(1+d*t)，u是初始速率（可以使用上面讲的网格搜索选择），d是减小常量，用以控制学习速率，可以设为0.001或者更小，t是迭代次数或者时间。
661 | [4.7节](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)讲述了网络中每个参数学习速率选择的方法，然后基于分类错误率自适应的选择它们。
662 | ### 隐藏节点数
663 | 这个超参数是非常基于数据集的。模糊的来说就是，输入分布越复杂，去模拟它的网络就需要更大的容量，那么隐藏单元的数目就要更大。事实上，一个层的权值矩阵就是可以直接度量的（输入维度＊输出维度）。
664 | 除非我们去使用正则选项（early-stopping或L1/L2惩罚），隐藏节点数和泛化表现的分布图，将呈现U型（即隐藏节点越多，在后期并不能提高泛化性）。
665 | ### 正则化参数
666 | 典型的方法是使用L1/L2正则化，同时lambda设为0.01，0.001等。尽管在我们之前提及的框架里面，它并没有显著提高性能，但它仍然是一个值得探讨的方法。
667 | 
668 | 
669 | 
670 | 
671 | 
672 | 
673 | 


--------------------------------------------------------------------------------
/4_Convoltional_Neural_Networks_LeNet_卷积神经网络.md:
--------------------------------------------------------------------------------
  1 | 卷积神经网络（Convolutional Neural Networks LeNet）
  2 | ==================================================
  3 | 在这一节假设读者已经阅读了之前的两章[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。
  4 | 
  5 | 如果你想要在GPU上跑样例，你需要一个好的GPU。至少是1GB显存的。当显示器连接在显卡上时，你可能需要更大的显存。
  6 | 
  7 | 当GPU连接在显示器上时，每次GPU的调用会有几秒钟的限制。这是必须的，因为GPUs不能在计算的时候，同时被用于显示器。如果没有这个限制，屏幕会长时间不动，就像死机一样。这个例子会说中中等质量的GPUs。当GPU不连接显示器时，没有这个时间限制。你可以通过降低batch的大小来出来延时问题。
  8 | 
  9 | 本节的所有代码，可以在[这里](http://deeplearning.net/tutorial/code/convolutional_mlp.py)下载，还有[3狼月亮图](https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/doc/images/3wolfmoon.jpg)。
 10 | 
 11 | ## 动机
 12 | 卷积神经网络是多层感知机的生物灵感变种。从Hubel和Wiesel先前对猫的视觉皮层的研究，我们知道视皮层中含有细胞的复杂分布。这些细胞只对小的视觉子区域敏感，称为`感受野`。这些子区域平铺来覆盖整个视场。这些细胞表现为输入图像空间的局部滤波器，非常适合检测自然图像中的强空间局部相关性。
 13 | 
 14 | 此外，两类基础细胞类型被定义：`简单细胞`使用它们的感受野，最大限度的响应特定的棱状图案。`复杂细胞`有更大的感受野，可以局部不变的确定图案精确位置。动物视觉皮层是现存的最强大的视觉处理系统，很显然，我们需要去模仿它的行为。因此，许多类神经模型在文献中出现，包括[NeoCognitron](http://deeplearning.net/tutorial/references.html#fukushima)，[HMAX](http://deeplearning.net/tutorial/references.html#serre07)和[LeNet-5](http://deeplearning.net/tutorial/references.html#lecun98)，这是本教程需要着重讲解的。
 15 | 
 16 | ## 稀疏连接
 17 | 卷积神经网络通过在相邻层的神经元之间实施局部连接模式来检测局部空间相关性。换句话说就是，第m层的隐藏单元的输入来自第m－1层单元的子集，单元拥有空间上的感受野连接。我们可以通过如下的图来表示：
 18 | 
 19 | ![sparse_connectivity](/images/4_sparse_con_1.png)
 20 | 
 21 | 想象一下，第m－1层是输入视网膜。在上图总，第m层的单元有宽度为3的对输入视网膜的感受野，因此它们只连接视网膜层中3个相邻的神经元。第m层的单元与下一层有相似的连接。我们说，感受野连接于下一层的数目也是3，但是感受野连接于输入的则更大（5）。每个单元对视网膜上于自己感受野相异的地方都是不会有响应的。这个结构保证了被学习的滤波器对一个空间局部输入图案有最强的响应。
 22 | 
 23 | 然而，就像上面展示的，将这些层叠加起来去形成（非线性）滤波器，就可以变得越来越全局化。举例而言，第m＋1层的单元可以编码一个宽度为5的非线性特征。
 24 | 
 25 | ## 权值共享
 26 | 此外，在CNNs中，每一只滤波器共享同一组权值，这样该滤波器就可以形成一个特征映射（feaature map）。梯度下降算法在小改动后可以学习这种共享参数。这个被共享权值的梯度就是被共享的参数的梯度的简单求和。
 27 | 
 28 | 复制单元使得特征可以无视其在视觉野中的位置而被检测到。此外，权值共享增加了学习效率，减少了需要被学习的自由参数的数目。这样的设定，使得CNNs在视觉问题上有更好的泛化性。
 29 | 
 30 | ## 细节和注解
 31 | 一个特征映射是由一个函数在整个图像的某一子区域重复使用来获得的，换句话说，就是通过线性滤波器来卷积输入图像，加上偏置后，再输入到非线性函数。如果我们定义第k个特征映射是为h_k，滤波器有W_k，b_k定义，则特征映射可以被表现为如下形式：
 32 | 
 33 | ![h_k(i,j)](/images/4_detail_notation_1.png)
 34 | 
 35 | 其中对于2维卷积有如下定义：
 36 | 
 37 | ![2-D_conv](/images/4_detail_notation_2.png)
 38 | 
 39 | 为了形成数据更丰富的表达，隐藏层有多层特征映射组成｛h_k,k=0..K｝。一个隐层的权值矩阵W可以用一个4维张量来表示，包含了每个目标特征映射、源目标特征映射、源水平位置，源垂直位置的元素。偏置b则是一个向量，纪录每个目标特征映射的元素。我们可以用如下的图来表示：
 40 | 
 41 | ![cnn_layer](/images/4_detail_notation_3.png)
 42 | 
 43 | 上图显示了一个CNN的两层，第m-1层包含4个特征映射，第m层包含2个特征映射（h_0和h_1）。h_0和h_1中红蓝色区域的像素（输出值）由第m-1层中2*2的感受野计算而言（相同颜色区域）。注意，感受野包含了所有的4个输入特征映射。W_0，W_1，h_0，h_1因此都是3维权值张量。第一个维度指定输入特征映射，剩下两个表示参考坐标。
 44 | 
 45 | 把它们都放一起就是，W_k_l(i,j)，表示第m层的第k个特征映射，在第m-1层的l个特征映射的(i,j)参考坐标的连接权值。
 46 | 
 47 | 
 48 | ## 卷积操作
 49 | 卷积操作是Theano实现卷积层的主要消耗。卷积操作通过`theano.tensor.signal.conv2d`，它包括两个输入符号：
 50 | 
 51 | * 与输入的minibatch有关的4维张量，尺寸包括如下：[mini-batch的大小，输入特征映射的数目，图像高度，图像宽度]。
 52 | * 与权值矩阵W相关的4维张量，尺寸包括如下：[m层特征映射的数目，m-1层特征映射的数目，滤波器高度，滤波器宽度]。
 53 | 
 54 | 下面的Theano代码实现了类似图1的卷积层。输入包括大小为120*160的3个特征映射（1一个RGB彩图）。我们使用2个大小为9*9感受野的卷积滤波器。
 55 | 
 56 | ```Python
 57 | from theano.tensor.nnet import conv
 58 | rng = numpy.random.RandomState(23455)
 59 | 
 60 | # instantiate 4D tensor for input
 61 | input = T.tensor4(name='input')
 62 | 
 63 | # initialize shared variable for weights.
 64 | w_shp = (2, 3, 9, 9)
 65 | w_bound = numpy.sqrt(3 * 9 * 9)
 66 | W = theano.shared( numpy.asarray(
 67 |             rng.uniform(
 68 |                 low=-1.0 / w_bound,
 69 |                 high=1.0 / w_bound,
 70 |                 size=w_shp),
 71 |             dtype=input.dtype), name ='W')
 72 | 
 73 | # initialize shared variable for bias (1D tensor) with random values
 74 | # IMPORTANT: biases are usually initialized to zero. However in this
 75 | # particular application, we simply apply the convolutional layer to
 76 | # an image without learning the parameters. We therefore initialize
 77 | # them to random values to "simulate" learning.
 78 | b_shp = (2,)
 79 | b = theano.shared(numpy.asarray(
 80 |             rng.uniform(low=-.5, high=.5, size=b_shp),
 81 |             dtype=input.dtype), name ='b')
 82 | 
 83 | # build symbolic expression that computes the convolution of input with filters in w
 84 | conv_out = conv.conv2d(input, W)
 85 | 
 86 | # build symbolic expression to add bias and apply activation function, i.e. produce neural net layer output
 87 | # A few words on ``dimshuffle`` :
 88 | #   ``dimshuffle`` is a powerful tool in reshaping a tensor;
 89 | #   what it allows you to do is to shuffle dimension around
 90 | #   but also to insert new ones along which the tensor will be
 91 | #   broadcastable;
 92 | #   dimshuffle('x', 2, 'x', 0, 1)
 93 | #   This will work on 3d tensors with no broadcastable
 94 | #   dimensions. The first dimension will be broadcastable,
 95 | #   then we will have the third dimension of the input tensor as
 96 | #   the second of the resulting tensor, etc. If the tensor has
 97 | #   shape (20, 30, 40), the resulting tensor will have dimensions
 98 | #   (1, 40, 1, 20, 30). (AxBxC tensor is mapped to 1xCx1xAxB tensor)
 99 | #   More examples:
100 | #    dimshuffle('x') -> make a 0d (scalar) into a 1d vector
101 | #    dimshuffle(0, 1) -> identity
102 | #    dimshuffle(1, 0) -> inverts the first and second dimensions
103 | #    dimshuffle('x', 0) -> make a row out of a 1d vector (N to 1xN)
104 | #    dimshuffle(0, 'x') -> make a column out of a 1d vector (N to Nx1)
105 | #    dimshuffle(2, 0, 1) -> AxBxC to CxAxB
106 | #    dimshuffle(0, 'x', 1) -> AxB to Ax1xB
107 | #    dimshuffle(1, 'x', 0) -> AxB to Bx1xA
108 | output = T.nnet.sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))
109 | 
110 | # create theano function to compute filtered images
111 | f = theano.function([input], output)
112 | ```
113 | 让我们让它变得有趣点...
114 | ```Python
115 | import numpy
116 | import pylab
117 | from PIL import Image
118 | 
119 | # open random image of dimensions 639x516
120 | img = Image.open(open('/images/3wolfmoon.jpg'))
121 | img = numpy.asarray(img, dtype='float64') / 256.
122 | 
123 | # put image in 4D tensor of shape (1, 3, height, width)
124 | img_ = img.swapaxes(0, 2).swapaxes(1, 2).reshape(1, 3, 639, 516)
125 | filtered_img = f(img_)
126 | 
127 | # plot original image and first and second components of output
128 | pylab.subplot(1, 3, 1); pylab.axis('off'); pylab.imshow(img)
129 | pylab.gray();
130 | # recall that the convOp output (filtered image) is actually a "minibatch",
131 | # of size 1 here, so we take index 0 in the first dimension:
132 | pylab.subplot(1, 3, 2); pylab.axis('off'); pylab.imshow(filtered_img[0, 0, :, :])
133 | pylab.subplot(1, 3, 3); pylab.axis('off'); pylab.imshow(filtered_img[0, 1, :, :])
134 | pylab.show()
135 | ```
136 | 将产生这样的输出：
137 | 
138 | ![3wolf](/images/4_conv_operator_1.png)
139 | 
140 | 注意，一个随机的初始化滤波器表现得很像一个特征检测器。
141 | 
142 | 注意我们使用了与MLP相同得权值初始化方案。权值在一个范围为[-1/fan-in, 1/fan-in]的均匀分布中随机取样，fan-in是一个隐单元的输入数。对MLP，它是下一层单元的数目。对CNNs，我不得不需要去考虑到输入特征映射的数目和感受野的大小。
143 | 
144 | ## 最大池化
145 | 卷积神经网络另一个重大的概念是最大池化，一个非线性的降采样形式。最大池化就是将输入图像分割为一系列不重叠的矩阵，然后对每个子区域，输出最大值。
146 | 
147 | 最大池化在视觉中是有用的，由如下2个原因：
148 | * 通过消除非最大值，减少了更上层的计算量
149 | * 提供了一种平移不变性。想象一下，一个最大池化层级联在一个卷积层。这里有8个方向，一个输入图像可以通过单个像素平移。假如说最大池化是2*2的区域，8个可能的方向中有3个可能会产生相同的输出（3/8）。当池化层为3*3时，概率增加到5/8。
150 | 
151 | 最大池化在Theano中通过`theano.tensor.signal.downsample.max_pool_2d`。这个函数被设计为可以接受N维的张量和一个缩减因子，然后对张量的最后2维执行最大池化。
152 | 
153 | ```Python
154 | from theano.tensor.signal import downsample
155 | 
156 | input = T.dtensor4('input')
157 | maxpool_shape = (2, 2)
158 | pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=True)
159 | f = theano.function([input],pool_out)
160 | 
161 | invals = numpy.random.RandomState(1).rand(3, 2, 5, 5)
162 | print 'With ignore_border set to True:'
163 | print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :]
164 | print 'output[0, 0, :, :] =\n', f(invals)[0, 0, :, :]
165 | 
166 | pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=False)
167 | f = theano.function([input],pool_out)
168 | print 'With ignore_border set to False:'
169 | print 'invals[1, 0, :, :] =\n ', invals[1, 0, :, :]
170 | print 'output[1, 0, :, :] =\n ', f(invals)[1, 0, :, :]
171 | ```
172 | 将会产生下面的输出：
173 | ```
174 | With ignore_border set to True:
175 |     invals[0, 0, :, :] =
176 |     [[  4.17022005e-01   7.20324493e-01   1.14374817e-04   3.02332573e-01 1.46755891e-01]
177 |      [  9.23385948e-02   1.86260211e-01   3.45560727e-01   3.96767474e-01 5.38816734e-01]
178 |      [  4.19194514e-01   6.85219500e-01   2.04452250e-01   8.78117436e-01 2.73875932e-02]
179 |      [  6.70467510e-01   4.17304802e-01   5.58689828e-01   1.40386939e-01 1.98101489e-01]
180 |      [  8.00744569e-01   9.68261576e-01   3.13424178e-01   6.92322616e-01 8.76389152e-01]]
181 |     output[0, 0, :, :] =
182 |     [[ 0.72032449  0.39676747]
183 |      [ 0.6852195   0.87811744]]
184 | 
185 | With ignore_border set to False:
186 |     invals[1, 0, :, :] =
187 |     [[ 0.01936696  0.67883553  0.21162812  0.26554666  0.49157316]
188 |      [ 0.05336255  0.57411761  0.14672857  0.58930554  0.69975836]
189 |      [ 0.10233443  0.41405599  0.69440016  0.41417927  0.04995346]
190 |      [ 0.53589641  0.66379465  0.51488911  0.94459476  0.58655504]
191 |      [ 0.90340192  0.1374747   0.13927635  0.80739129  0.39767684]]
192 |     output[1, 0, :, :] =
193 |     [[ 0.67883553  0.58930554  0.69975836]
194 |      [ 0.66379465  0.94459476  0.58655504]
195 |      [ 0.90340192  0.80739129  0.39767684]]
196 | ```
197 | 注意，与其他Theano代码相比，`max_pool_2d`操作有点特殊。它需要缩减因子`ds`(长度维2的tuple，班汉图像长度和宽度的缩减因子)在图构建的时候被告知。这在未来可能会发生改变。
198 | 
199 | 
200 | ## 整个模型
201 | 稀疏性、卷积层和最大池化时LeNet系列模型的核心。而准确的模型细节有很大的差异，下图显示了一个LeNet模型。
202 | 
203 | ![full_model](/images/4_full_model_1.png)
204 | 
205 | 模型比较低的层是由卷积和最大池化层的交替来构建的，较高的层则是全连接的传统MLP（隐藏层＋逻辑回归）。第一个全连接层的输入是前一层(the layer below)的特征映射的集合。
206 | 
207 | 从上图的实现来看，较低层的操作都是建立在4维张量上的。然后它需要被压缩为2维的特征映射，来适应之后的MLP实现。
208 | 
209 | ###将它组合起来
210 | 现在我们已经知道了在Theano中实现LeNet的所有方法。那我们先实现一个`LeNetConvPoolLayer`类，它是{卷积＋最大池化}层。
211 | 
212 | ```Python
213 | class LeNetConvPoolLayer(object):
214 |     """Pool Layer of a convolutional network """
215 | 
216 |     def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):
217 |         """
218 |         Allocate a LeNetConvPoolLayer with shared variable internal parameters.
219 | 
220 |         :type rng: numpy.random.RandomState
221 |         :param rng: a random number generator used to initialize weights
222 | 
223 |         :type input: theano.tensor.dtensor4
224 |         :param input: symbolic image tensor, of shape image_shape
225 | 
226 |         :type filter_shape: tuple or list of length 4
227 |         :param filter_shape: (number of filters, num input feature maps,
228 |                               filter height, filter width)
229 | 
230 |         :type image_shape: tuple or list of length 4
231 |         :param image_shape: (batch size, num input feature maps,
232 |                              image height, image width)
233 | 
234 |         :type poolsize: tuple or list of length 2
235 |         :param poolsize: the downsampling (pooling) factor (#rows, #cols)
236 |         """
237 | 
238 |         assert image_shape[1] == filter_shape[1]
239 |         self.input = input
240 | 
241 |         # there are "num input feature maps * filter height * filter width"
242 |         # inputs to each hidden unit
243 |         fan_in = numpy.prod(filter_shape[1:])
244 |         # each unit in the lower layer receives a gradient from:
245 |         # "num output feature maps * filter height * filter width" /
246 |         #   pooling size
247 |         fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) /
248 |                    numpy.prod(poolsize))
249 |         # initialize weights with random weights
250 |         W_bound = numpy.sqrt(6. / (fan_in + fan_out))
251 |         self.W = theano.shared(
252 |             numpy.asarray(
253 |                 rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
254 |                 dtype=theano.config.floatX
255 |             ),
256 |             borrow=True
257 |         )
258 | 
259 |         # the bias is a 1D tensor -- one bias per output feature map
260 |         b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
261 |         self.b = theano.shared(value=b_values, borrow=True)
262 | 
263 |         # convolve input feature maps with filters
264 |         conv_out = conv.conv2d(
265 |             input=input,
266 |             filters=self.W,
267 |             filter_shape=filter_shape,
268 |             image_shape=image_shape
269 |         )
270 | 
271 |         # downsample each feature map individually, using maxpooling
272 |         pooled_out = downsample.max_pool_2d(
273 |             input=conv_out,
274 |             ds=poolsize,
275 |             ignore_border=True
276 |         )
277 | 
278 |         # add the bias term. Since the bias is a vector (1D array), we first
279 |         # reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will
280 |         # thus be broadcasted across mini-batches and feature map
281 |         # width & height
282 |         self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
283 | 
284 |         # store parameters of this layer
285 |         self.params = [self.W, self.b]
286 | ```
287 | 注意，当初始化权重值的时候，fan-in是由感受野和输入特征映射数决定的。
288 | 
289 | 最后，我们通过使用在[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)中定义的`LogisticRegression`类，和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)中的`HiddenLayer`类来实例化我们的网络：
290 | 
291 | ```Python
292 |     x = T.matrix('x')   # the data is presented as rasterized images
293 |     y = T.ivector('y')  # the labels are presented as 1D vector of
294 |                         # [int] labels
295 | 
296 |     ######################
297 |     # BUILD ACTUAL MODEL #
298 |     ######################
299 |     print '... building the model'
300 | 
301 |     # Reshape matrix of rasterized images of shape (batch_size, 28 * 28)
302 |     # to a 4D tensor, compatible with our LeNetConvPoolLayer
303 |     # (28, 28) is the size of MNIST images.
304 |     layer0_input = x.reshape((batch_size, 1, 28, 28))
305 | 
306 |     # Construct the first convolutional pooling layer:
307 |     # filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24)
308 |     # maxpooling reduces this further to (24/2, 24/2) = (12, 12)
309 |     # 4D output tensor is thus of shape (batch_size, nkerns[0], 12, 12)
310 |     layer0 = LeNetConvPoolLayer(
311 |         rng,
312 |         input=layer0_input,
313 |         image_shape=(batch_size, 1, 28, 28),
314 |         filter_shape=(nkerns[0], 1, 5, 5),
315 |         poolsize=(2, 2)
316 |     )
317 | 
318 |     # Construct the second convolutional pooling layer
319 |     # filtering reduces the image size to (12-5+1, 12-5+1) = (8, 8)
320 |     # maxpooling reduces this further to (8/2, 8/2) = (4, 4)
321 |     # 4D output tensor is thus of shape (nkerns[0], nkerns[1], 4, 4)
322 |     layer1 = LeNetConvPoolLayer(
323 |         rng,
324 |         input=layer0.output,
325 |         image_shape=(batch_size, nkerns[0], 12, 12),
326 |         filter_shape=(nkerns[1], nkerns[0], 5, 5),
327 |         poolsize=(2, 2)
328 |     )
329 | 
330 |     # the HiddenLayer being fully-connected, it operates on 2D matrices of
331 |     # shape (batch_size, num_pixels) (i.e matrix of rasterized images).
332 |     # This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
333 |     # or (500, 50 * 4 * 4) = (500, 800) with the default values.
334 |     layer2_input = layer1.output.flatten(2)
335 | 
336 |     # construct a fully-connected sigmoidal layer
337 |     layer2 = HiddenLayer(
338 |         rng,
339 |         input=layer2_input,
340 |         n_in=nkerns[1] * 4 * 4,
341 |         n_out=500,
342 |         activation=T.tanh
343 |     )
344 | 
345 |     # classify the values of the fully-connected sigmoidal layer
346 |     layer3 = LogisticRegression(input=layer2.output, n_in=500, n_out=10)
347 | 
348 |     # the cost we minimize during training is the NLL of the model
349 |     cost = layer3.negative_log_likelihood(y)
350 | 
351 |     # create a function to compute the mistakes that are made by the model
352 |     test_model = theano.function(
353 |         [index],
354 |         layer3.errors(y),
355 |         givens={
356 |             x: test_set_x[index * batch_size: (index + 1) * batch_size],
357 |             y: test_set_y[index * batch_size: (index + 1) * batch_size]
358 |         }
359 |     )
360 | 
361 |     validate_model = theano.function(
362 |         [index],
363 |         layer3.errors(y),
364 |         givens={
365 |             x: valid_set_x[index * batch_size: (index + 1) * batch_size],
366 |             y: valid_set_y[index * batch_size: (index + 1) * batch_size]
367 |         }
368 |     )
369 | 
370 |     # create a list of all model parameters to be fit by gradient descent
371 |     params = layer3.params + layer2.params + layer1.params + layer0.params
372 | 
373 |     # create a list of gradients for all model parameters
374 |     grads = T.grad(cost, params)
375 | 
376 |     # train_model is a function that updates the model parameters by
377 |     # SGD Since this model has many parameters, it would be tedious to
378 |     # manually create an update rule for each model parameter. We thus
379 |     # create the updates list by automatically looping over all
380 |     # (params[i], grads[i]) pairs.
381 |     updates = [
382 |         (param_i, param_i - learning_rate * grad_i)
383 |         for param_i, grad_i in zip(params, grads)
384 |     ]
385 | 
386 |     train_model = theano.function(
387 |         [index],
388 |         cost,
389 |         updates=updates,
390 |         givens={
391 |             x: train_set_x[index * batch_size: (index + 1) * batch_size],
392 |             y: train_set_y[index * batch_size: (index + 1) * batch_size]
393 |         }
394 |     )
395 | ```
396 | 我们把进行实际训练和early-stopping代码取出了。因为它和MLP中是一样的。有兴趣的读者，可以阅读教程开头的源代码。
397 | 
398 | ## 运行代码
399 | 在一台Core i7-2600K CPU clocked at 3.40GHz上，我们使用floatX=float32，获得如下的输出：
400 | 
401 | ```
402 | Optimization complete.
403 | Best validation score of 0.910000 % obtained at iteration 17800,with test
404 | performance 0.920000 %
405 | The code for file convolutional_mlp.py ran for 380.28m
406 | ```
407 | 在GeForce GTX 285上，我们获得了如下：
408 | 
409 | ```
410 | Optimization complete.
411 | Best validation score of 0.910000 % obtained at iteration 15500,with test
412 | performance 0.930000 %
413 | The code for file convolutional_mlp.py ran for 46.76m
414 | ```
415 | 在GeForce GTX 480上，获得如下：
416 | 
417 | ```
418 | Optimization complete.
419 | Best validation score of 0.910000 % obtained at iteration 16400,with test
420 | performance 0.930000 %
421 | The code for file convolutional_mlp.py ran for 32.52m
422 | ```
423 | 可以观察到不同实验下验证误差和测试误差的不同，这是由不同硬件的取整结构不同造成的。可以忽略。
424 | 
425 | ## 技巧
426 | ### 超参的选择
427 | 卷积神经网络的训练相比与标准的MLP是相当困难的，因为它添加了更多的超参数。当我们在应用学习率和正则化的规则下，下面的方法也需要在优化CNNs被考虑：
428 | 
429 | #### 滤波器的数量
430 | 当选择每层滤波器数量的时候，需要记住计算单卷积层的活性比传统的MLP会更加昂贵。
431 | 
432 | 假设第l-1层包含K_(l-1)个特征映射和M*N个像素点（例如，位置数乘以特征映射数），然后第l层有K_(l)个滤波器，尺寸为m*n。那么计算一个特征映射（在(M-m)*(N-n)个像素位置应用每个m*n大小的滤波器）将消耗(M-m)*(N-n)*m*n*K_(l-1)的计算量。然后总共要计算K_l次。如果不是所有的特征只与前一层的所有特征相连，那么事情就变得更加复杂啦。
433 | 
434 | 对标准MLP而言，第l层如果有K_l个神经元，那计算量只有K_(k-1)*K_l。因此，CNNs中滤波器的数量通常比MLPs中隐单元的数量小很多，通常是基于特征映射的尺寸（输入图像的尺寸和滤波器的形状）。
435 | 
436 | 因为特征映射的尺寸会随着深度的增加而减小，靠近输入层的层将趋向于有更少的滤波器，而更高的层有更多的滤波器。事实上，为了平衡每一层的计算量，特征数和图像位置数的乘积在层的传递过程中都是基本一致的。为了保护输入信息，我们需要保证总的激活数量（特征映射数*像素位置数）在层间传递的时候是至于减少（当然我们在做监督学习的时候当然是希望它减小的）。特征映射的数量直接控制整个容量，同时它依赖于可用样例的数目和任务的复杂度。
437 | 
438 | 
439 | #### 滤波器的尺寸
440 | 通常在每个文献中滤波器的尺寸都有很大的不同，它常常是基于数据库的。MNIST在第一层的最好结果是5*5层滤波器。当自然图像（每维有几百个像素）趋向于使用更大的滤波器，例如12*12，15*15。
441 | 
442 | 因此这个技巧事实上是去寻找正确等级的“粒度”，以便对给定的数据集去形成合适范围内的抽象。
443 | 
444 | #### 最大池化的尺寸
445 | 经典的是2*2，或者没有最大池化。非常大的图可以在较低的层使用4*4的池化。但是需要记住的是，池化在通过16个因子减少信号维度的同时，也可能导致信号细节的大量丢失。
446 | 
447 | #### 技巧
448 | 假如你想要在新的数据集上采用这个模型，下面的一些小技巧可能能让你获得更好的结果：
449 | * 白化(whitening)数据（例如，使用主成分分析）
450 | * 衰减每次迭代的学习速率。
451 | 
452 | 
453 | 
454 | 
455 | 
456 | 
457 | 
458 | 
459 | 
460 | 
461 | 
462 | 
463 | 
464 | 
465 | 
466 | 
467 | 
468 | 
469 | 


--------------------------------------------------------------------------------
/5_Denoising_Autoencoders_降噪自动编码.md:
--------------------------------------------------------------------------------
  1 | 降噪自动编码机(Denoising Autoencoders)
  2 | ====================================
  3 | 
  4 | 这一节假设读者一节阅读了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)，[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。如果你需要在GPU上跑代码，你也需要阅读[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。
  5 | 
  6 | 本节所有的代码都可用在[这里](http://deeplearning.net/tutorial/code/dA.py)下载。
  7 | 
  8 | 降噪自动编码机(denoising Autoencoders)是经典自动编码机的扩展。它在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中作为深度网络的一个构建块被介绍。我们通过简短的[自动编码机](http://deeplearning.net/tutorial/dA.html#autoencoders)来开始本教程。
  9 | 
 10 | ## 自动编码机
 11 | 在[Bengio09](http://deeplearning.net/tutorial/references.html#bengio09)的第4.6节中，有自动编码机的简介。一个自动编码机，由d维的[0,1]之间的输入向量x，通过第一层映射（使用一个编码器）来获得隐藏的d‘维度的[0,1]的输出表达y。通过如下的决定性映射：
 12 | 
 13 | ![y_mapping](/images/5_autoencoders_1.png)
 14 | 
 15 | 这里s是一个非线性的函数，例如sigmoid。这个潜在的表达y，或者码，被映射回一个重构机z，来重构x。这个映射通过下面的简单转换来实现：
 16 | 
 17 | ![z_mapping](/images/5_autoencoders_2.png)
 18 | 
 19 | （这里撇号不代表矩阵转置）当y被给定时，z被看着是对x的预测。可选的，这个权重矩阵W‘的逆映射可用被约束为正向映射的转置：![tanspose](/images/5_autoencoders_3.png)，这被称为捆绑权重。这个模型的所有参数（W，b，b‘，或者不使用捆绑权重W’）通过优化最小平均重构误差来实现训练。
 20 | 
 21 | 重建误差可以通过许多方法来度量，基于输入的分布假设。这个传统的平方误差是L(x,z)=||x-z||^2，可以被使用。假如这个输入是通过比特向量或者比特概率向量来表述，重构`交叉熵`([cross-entropy)可以被表示如下：
 22 | 
 23 | ![cross-entropy](/images/5_autoencoders_4.png)
 24 | 
 25 | 我们希望这样，这个编码y是一个分布式表达，可以捕捉数据中主要因子变化的位置。这就类似与主成分凸出，金额也捕捉数据中主要因子的变化。事实上，假如这里有一个线性隐藏层（这个编码），并且平均平方误差准则被用以训练这个网络，然后这k个隐藏单元学习去凸出这个输入，在该数据的前k个主成分的范围中。假如这个隐藏层是非线性的，这个自动编码机表现的是与PCA不同的，它有着捕捉输入分布的多模态方面的能力。从PCA开始变得更加重要，当我们考虑堆叠混合编码机(stacking multiple encoders，在[Hinton06](http://deeplearning.net/tutorial/references.html#hinton06)中被用以构建一个深度自动编码机)。
 26 | 
 27 | 因为y是视为x的有损压缩(lossy compression)，它不可能对所有的x有好的压缩。优化可以使得训练样本有好的压缩，同时也希望对别的输入有好的压缩，但是不是对于任意输入。这里有一个对自动编码机的概括定义：它对于与训练样本有相似分布的测试样本有较低的重建误差，但对于随机的输入会有较高的重构误差。
 28 | 
 29 | 我们希望通过Theano中来实现自动编码机，作为一个类的形式，它可以在未来去构建一个层叠自动编码机。这个第一步是去创建自动编码机的共享变量参数（W，b，b‘）。
 30 | 
 31 | ```Python
 32 | class dA(object):
 33 |     """Denoising Auto-Encoder class (dA)
 34 | 
 35 |     A denoising autoencoders tries to reconstruct the input from a corrupted
 36 |     version of it by projecting it first in a latent space and reprojecting
 37 |     it afterwards back in the input space. Please refer to Vincent et al.,2008
 38 |     for more details. If x is the input then equation (1) computes a partially
 39 |     destroyed version of x by means of a stochastic mapping q_D. Equation (2)
 40 |     computes the projection of the input into the latent space. Equation (3)
 41 |     computes the reconstruction of the input, while equation (4) computes the
 42 |     reconstruction error.
 43 | 
 44 |     .. math::
 45 | 
 46 |         \tilde{x} ~ q_D(\tilde{x}|x)                                     (1)
 47 | 
 48 |         y = s(W \tilde{x} + b)                                           (2)
 49 | 
 50 |         x = s(W' y  + b')                                                (3)
 51 | 
 52 |         L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)]      (4)
 53 | 
 54 |     """
 55 | 
 56 |     def __init__(
 57 |         self,
 58 |         numpy_rng,
 59 |         theano_rng=None,
 60 |         input=None,
 61 |         n_visible=784,
 62 |         n_hidden=500,
 63 |         W=None,
 64 |         bhid=None,
 65 |         bvis=None
 66 |     ):
 67 |         """
 68 |         Initialize the dA class by specifying the number of visible units (the
 69 |         dimension d of the input ), the number of hidden units ( the dimension
 70 |         d' of the latent or hidden space ) and the corruption level. The
 71 |         constructor also receives symbolic variables for the input, weights and
 72 |         bias. Such a symbolic variables are useful when, for example the input
 73 |         is the result of some computations, or when weights are shared between
 74 |         the dA and an MLP layer. When dealing with SdAs this always happens,
 75 |         the dA on layer 2 gets as input the output of the dA on layer 1,
 76 |         and the weights of the dA are used in the second stage of training
 77 |         to construct an MLP.
 78 | 
 79 |         :type numpy_rng: numpy.random.RandomState
 80 |         :param numpy_rng: number random generator used to generate weights
 81 | 
 82 |         :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
 83 |         :param theano_rng: Theano random generator; if None is given one is
 84 |                      generated based on a seed drawn from `rng`
 85 | 
 86 |         :type input: theano.tensor.TensorType
 87 |         :param input: a symbolic description of the input or None for
 88 |                       standalone dA
 89 | 
 90 |         :type n_visible: int
 91 |         :param n_visible: number of visible units
 92 | 
 93 |         :type n_hidden: int
 94 |         :param n_hidden:  number of hidden units
 95 | 
 96 |         :type W: theano.tensor.TensorType
 97 |         :param W: Theano variable pointing to a set of weights that should be
 98 |                   shared belong the dA and another architecture; if dA should
 99 |                   be standalone set this to None
100 | 
101 |         :type bhid: theano.tensor.TensorType
102 |         :param bhid: Theano variable pointing to a set of biases values (for
103 |                      hidden units) that should be shared belong dA and another
104 |                      architecture; if dA should be standalone set this to None
105 | 
106 |         :type bvis: theano.tensor.TensorType
107 |         :param bvis: Theano variable pointing to a set of biases values (for
108 |                      visible units) that should be shared belong dA and another
109 |                      architecture; if dA should be standalone set this to None
110 | 
111 | 
112 |         """
113 |         self.n_visible = n_visible
114 |         self.n_hidden = n_hidden
115 | 
116 |         # create a Theano random generator that gives symbolic random values
117 |         if not theano_rng:
118 |             theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))
119 | 
120 |         # note : W' was written as `W_prime` and b' as `b_prime`
121 |         if not W:
122 |             # W is initialized with `initial_W` which is uniformely sampled
123 |             # from -4*sqrt(6./(n_visible+n_hidden)) and
124 |             # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
125 |             # converted using asarray to dtype
126 |             # theano.config.floatX so that the code is runable on GPU
127 |             initial_W = numpy.asarray(
128 |                 numpy_rng.uniform(
129 |                     low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
130 |                     high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
131 |                     size=(n_visible, n_hidden)
132 |                 ),
133 |                 dtype=theano.config.floatX
134 |             )
135 |             W = theano.shared(value=initial_W, name='W', borrow=True)
136 | 
137 |         if not bvis:
138 |             bvis = theano.shared(
139 |                 value=numpy.zeros(
140 |                     n_visible,
141 |                     dtype=theano.config.floatX
142 |                 ),
143 |                 borrow=True
144 |             )
145 | 
146 |         if not bhid:
147 |             bhid = theano.shared(
148 |                 value=numpy.zeros(
149 |                     n_hidden,
150 |                     dtype=theano.config.floatX
151 |                 ),
152 |                 name='b',
153 |                 borrow=True
154 |             )
155 | 
156 |         self.W = W
157 |         # b corresponds to the bias of the hidden
158 |         self.b = bhid
159 |         # b_prime corresponds to the bias of the visible
160 |         self.b_prime = bvis
161 |         # tied weights, therefore W_prime is W transpose
162 |         self.W_prime = self.W.T
163 |         self.theano_rng = theano_rng
164 |         # if no input is given, generate a variable representing the input
165 |         if input is None:
166 |             # we use a matrix because we expect a minibatch of several
167 |             # examples, each example being a row
168 |             self.x = T.dmatrix(name='input')
169 |         else:
170 |             self.x = input
171 | 
172 |         self.params = [self.W, self.b, self.b_prime]
173 | ```
174 | 注意，我们将`input`作为一个参数来传递给自动编码机。我们可以串联自动编码机来实现深度网络：第k层的输出(y)可以变成第k+1层的输入。
175 | 
176 | 现在，我们可以预计去重构信号的潜在表达的计算量。
177 | 
178 | ```Python
179 |     def get_hidden_values(self, input):
180 |         """ Computes the values of the hidden layer """
181 |         return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
182 | 
183 |     def get_reconstructed_input(self, hidden):
184 |         """Computes the reconstructed input given the values of the
185 |         hidden layer
186 | 
187 |         """
188 |         return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
189 | ```
190 | 然后我们通过这些函数可以计算一个随机梯度下降步骤的cost和更新。
191 | 
192 | ```Python
193 |     def get_cost_updates(self, corruption_level, learning_rate):
194 |         """ This function computes the cost and the updates for one trainng
195 |         step of the dA """
196 | 
197 |         tilde_x = self.get_corrupted_input(self.x, corruption_level)
198 |         y = self.get_hidden_values(tilde_x)
199 |         z = self.get_reconstructed_input(y)
200 |         # note : we sum over the size of a datapoint; if we are using
201 |         #        minibatches, L will be a vector, with one entry per
202 |         #        example in minibatch
203 |         L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)
204 |         # note : L is now a vector, where each element is the
205 |         #        cross-entropy cost of the reconstruction of the
206 |         #        corresponding example of the minibatch. We need to
207 |         #        compute the average of all these to get the cost of
208 |         #        the minibatch
209 |         cost = T.mean(L)
210 | 
211 |         # compute the gradients of the cost of the `dA` with respect
212 |         # to its parameters
213 |         gparams = T.grad(cost, self.params)
214 |         # generate the list of updates
215 |         updates = [
216 |             (param, param - learning_rate * gparam)
217 |             for param, gparam in zip(self.params, gparams)
218 |         ]
219 | 
220 |         return (cost, updates)
221 | ```
222 | 我们现在可以定义一个函数来实现重复的更新参数W，b，b‘，直到这个重构消耗大约是最小的。
223 | 
224 | ```Python
225 |     da = dA(
226 |         numpy_rng=rng,
227 |         theano_rng=theano_rng,
228 |         input=x,
229 |         n_visible=28 * 28,
230 |         n_hidden=500
231 |     )
232 | 
233 |     cost, updates = da.get_cost_updates(
234 |         corruption_level=0.,
235 |         learning_rate=learning_rate
236 |     )
237 | 
238 |     train_da = theano.function(
239 |         [index],
240 |         cost,
241 |         updates=updates,
242 |         givens={
243 |             x: train_set_x[index * batch_size: (index + 1) * batch_size]
244 |         }
245 |     )
246 | ```
247 | 假设除了最小化重构误差外没有别的的限制，一个有n个输入和n（或者更大）维编码学习能力的自动编码机定义函数，将倾向于去映射出它的输入副本。这种自动编码机将无法从训练样本的分布中区分任何测试样例。（无效编码机：输出与输入完全相同）。
248 | 
249 | 让人惊讶地，在[Bengio07](http://deeplearning.net/tutorial/references.html#bengio07)的实验指出，在实践中，当通过随机梯度下降训练时，有比输入更多的隐藏单元（称为超完备）的非线性的自动编码机可以产生有效的表达。（这里，有效指的是编码作为网络的输入获得了更低的分类误差）。
250 | 
251 | 一个简单的解释是，使用early-stopping的随机梯度下降是与参数的L2正则化相似的。为了去实现对连续性输入有更好的重建，包含非线性隐藏单元的单隐藏层的自动编码机需要非常小的权值在第一（编码）层，以使得将非线性隐藏单元进入他们的线性区域（参考sigmoid函数），然后在第二（解码）层有更大的权值。对于二进制输入，非常大的权值也需要彻底的最小化重构误差。因为隐性的或者显性的正则化将使得获得大权值的解决方案变得困难，这个最优化算法发现在训练样本中表现好的编码。这意味着，表达是利用训练集的统计规律来实现的，而不仅仅是复制输入。
252 | 
253 | 这里有其他方法，使得一个有比输入有更多隐藏单元的自动编码机，去避免只学习它本身，而是在输入的隐藏表达中捕捉到有用的东西。一个是添加稀疏性（迫使许多隐单元是0或者接近0）。稀疏性已经被很成功的发挥了[Ranzato07](http://deeplearning.net/tutorial/references.html#ranzato07)[Lee08](http://deeplearning.net/tutorial/references.html#lee08)。另一个是，在输入到重建过程中，增加从输入到重建的转换中的随机性。这个技术在受限玻尔兹曼机中被使用（Restricted Boltzmann Machines，在后面的章节中讨论），还有降噪自动编码机，在后面讨论。
254 | 
255 | ## 降噪自动编码机
256 | 
257 | 降噪自动编码机的思想是很简单饿。为了迫使隐藏层去发现更加鲁棒性的特征，避免它只是去简单的学习定义，我们训练自动编码机去重建被破坏的输入版本的数据。
258 | 
259 | 这个降噪自动编码机是自动编码机的随机版本。直观上讲，一个降噪自动编码机做2件事情：尝试对输入进行编码（保护输入信息），然后尝试去消除输入中的随机差错产生的影响。后者可以通过捕捉输入间的统计相关性来实现。降噪自动编码机可以从不到的角度来理解（流行学习角度，随机操作角度，自下而上的信息论角度，自上而下的生成模型角度），所有的这些在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中被解释。在[Bengio09](http://deeplearning.net/tutorial/references.html#bengio09)的第7.2节有自动编码机的综述。
260 | 
261 | 在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中，随机差错进程随机的设定部分（也可以是一半）输入为0。因此降噪自动编码机尝试去从未被污染的值中去预测被污染的（丢失）的值，通过随机的选择丢失模式下的仔鸡。注意如何能预测从剩下的变量的任意子集是一个充分条件，去完全捕获一组变量之间的联合分布（这是Gibbs采样工作）。
262 | 
263 | 从自动编码机的类转换为降噪自动编码机，我们需要去增加一个随机误差步骤去应用到输入中。这个输入可以通过许多方法来污染，但在这个教程中，我们将支持以输入的随机性来腐化原始数据，使它趋向于0。代码如下：
264 | 
265 | ```Python
266 | from theano.tensor.shared_randomstreams import RandomStreams
267 | 
268 | def get_corrupted_input(self, input, corruption_level):
269 |       """ This function keeps ``1-corruption_level`` entries of the inputs the same
270 |       and zero-out randomly selected subset of size ``coruption_level``
271 |       Note : first argument of theano.rng.binomial is the shape(size) of
272 |              random numbers that it should produce
273 |              second argument is the number of trials
274 |              third argument is the probability of success of any trial
275 | 
276 |               this will produce an array of 0s and 1s where 1 has a probability of
277 |               1 - ``corruption_level`` and 0 with ``corruption_level``
278 |       """
279 |       return  self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * input
280 | ```
281 | 在层叠自动编码机类([层叠自动编码机](http://deeplearning.net/tutorial/SdA.html#stacked-autoencoders))中，`dA`类中的权值不得不和相应的sigmoid层共享。因为这个原因，dA的构建也将Theano变量指向了共享参数。假如这些参数被设置为`None`，新的参数会被构建。
282 | 
283 | 最后的降噪自动编码机类就变成了这样：
284 | 
285 | ```Python
286 | class dA(object):
287 |    """Denoising Auto-Encoder class (dA)
288 | 
289 |    A denoising autoencoders tries to reconstruct the input from a corrupted
290 |    version of it by projecting it first in a latent space and reprojecting
291 |    it afterwards back in the input space. Please refer to Vincent et al.,2008
292 |    for more details. If x is the input then equation (1) computes a partially
293 |    destroyed version of x by means of a stochastic mapping q_D. Equation (2)
294 |    computes the projection of the input into the latent space. Equation (3)
295 |    computes the reconstruction of the input, while equation (4) computes the
296 |    reconstruction error.
297 | 
298 |    .. math::
299 | 
300 |        \tilde{x} ~ q_D(\tilde{x}|x)                                     (1)
301 | 
302 |        y = s(W \tilde{x} + b)                                           (2)
303 | 
304 |        x = s(W' y  + b')                                                (3)
305 | 
306 |        L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)]      (4)
307 | 
308 |    """
309 | 
310 |    def __init__(self, numpy_rng, theano_rng=None, input=None, n_visible=784, n_hidden=500,
311 |               W=None, bhid=None, bvis=None):
312 |        """
313 |        Initialize the dA class by specifying the number of visible units (the
314 |        dimension d of the input ), the number of hidden units ( the dimension
315 |        d' of the latent or hidden space ) and the corruption level. The
316 |        constructor also receives symbolic variables for the input, weights and
317 |        bias. Such a symbolic variables are useful when, for example the input is
318 |        the result of some computations, or when weights are shared between the
319 |        dA and an MLP layer. When dealing with SdAs this always happens,
320 |        the dA on layer 2 gets as input the output of the dA on layer 1,
321 |        and the weights of the dA are used in the second stage of training
322 |        to construct an MLP.
323 | 
324 |        :type numpy_rng: numpy.random.RandomState
325 |        :param numpy_rng: number random generator used to generate weights
326 | 
327 |        :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
328 |        :param theano_rng: Theano random generator; if None is given one is generated
329 |                     based on a seed drawn from `rng`
330 | 
331 |        :type input: theano.tensor.TensorType
332 |        :paran input: a symbolic description of the input or None for standalone
333 |                      dA
334 | 
335 |        :type n_visible: int
336 |        :param n_visible: number of visible units
337 | 
338 |        :type n_hidden: int
339 |        :param n_hidden:  number of hidden units
340 | 
341 |        :type W: theano.tensor.TensorType
342 |        :param W: Theano variable pointing to a set of weights that should be
343 |                  shared belong the dA and another architecture; if dA should
344 |                  be standalone set this to None
345 | 
346 |        :type bhid: theano.tensor.TensorType
347 |        :param bhid: Theano variable pointing to a set of biases values (for
348 |                     hidden units) that should be shared belong dA and another
349 |                     architecture; if dA should be standalone set this to None
350 | 
351 |        :type bvis: theano.tensor.TensorType
352 |        :param bvis: Theano variable pointing to a set of biases values (for
353 |                     visible units) that should be shared belong dA and another
354 |                     architecture; if dA should be standalone set this to None
355 | 
356 | 
357 |        """
358 |        self.n_visible = n_visible
359 |        self.n_hidden = n_hidden
360 | 
361 |        # create a Theano random generator that gives symbolic random values
362 |        if not theano_rng :
363 |            theano_rng = RandomStreams(rng.randint(2 ** 30))
364 | 
365 |        # note : W' was written as `W_prime` and b' as `b_prime`
366 |        if not W:
367 |            # W is initialized with `initial_W` which is uniformely sampled
368 |            # from -4.*sqrt(6./(n_visible+n_hidden)) and 4.*sqrt(6./(n_hidden+n_visible))
369 |            # the output of uniform if converted using asarray to dtype
370 |            # theano.config.floatX so that the code is runable on GPU
371 |            initial_W = numpy.asarray(numpy_rng.uniform(
372 |                      low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
373 |                      high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
374 |                      size=(n_visible, n_hidden)), dtype=theano.config.floatX)
375 |            W = theano.shared(value=initial_W, name='W')
376 | 
377 |        if not bvis:
378 |            bvis = theano.shared(value = numpy.zeros(n_visible,
379 |                                         dtype=theano.config.floatX), name='bvis')
380 | 
381 |        if not bhid:
382 |            bhid = theano.shared(value=numpy.zeros(n_hidden,
383 |                                              dtype=theano.config.floatX), name='bhid')
384 | 
385 |        self.W = W
386 |        # b corresponds to the bias of the hidden
387 |        self.b = bhid
388 |        # b_prime corresponds to the bias of the visible
389 |        self.b_prime = bvis
390 |        # tied weights, therefore W_prime is W transpose
391 |        self.W_prime = self.W.T
392 |        self.theano_rng = theano_rng
393 |        # if no input is given, generate a variable representing the input
394 |        if input == None:
395 |            # we use a matrix because we expect a minibatch of several examples,
396 |            # each example being a row
397 |            self.x = T.dmatrix(name='input')
398 |        else:
399 |            self.x = input
400 | 
401 |        self.params = [self.W, self.b, self.b_prime]
402 | 
403 |    def get_corrupted_input(self, input, corruption_level):
404 |        """ This function keeps ``1-corruption_level`` entries of the inputs the same
405 |        and zero-out randomly selected subset of size ``coruption_level``
406 |        Note : first argument of theano.rng.binomial is the shape(size) of
407 |               random numbers that it should produce
408 |               second argument is the number of trials
409 |               third argument is the probability of success of any trial
410 | 
411 |                this will produce an array of 0s and 1s where 1 has a probability of
412 |                1 - ``corruption_level`` and 0 with ``corruption_level``
413 |        """
414 |        return  self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * input
415 | 
416 | 
417 |    def get_hidden_values(self, input):
418 |        """ Computes the values of the hidden layer """
419 |        return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
420 | 
421 |    def get_reconstructed_input(self, hidden ):
422 |        """ Computes the reconstructed input given the values of the hidden layer """
423 |        return  T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
424 | 
425 |    def get_cost_updates(self, corruption_level, learning_rate):
426 |        """ This function computes the cost and the updates for one trainng
427 |        step of the dA """
428 | 
429 |        tilde_x = self.get_corrupted_input(self.x, corruption_level)
430 |        y = self.get_hidden_values( tilde_x)
431 |        z = self.get_reconstructed_input(y)
432 |        # note : we sum over the size of a datapoint; if we are using minibatches,
433 |        #        L will  be a vector, with one entry per example in minibatch
434 |        L = -T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1 )
435 |        # note : L is now a vector, where each element is the cross-entropy cost
436 |        #        of the reconstruction of the corresponding example of the
437 |        #        minibatch. We need to compute the average of all these to get
438 |        #        the cost of the minibatch
439 |        cost = T.mean(L)
440 | 
441 |        # compute the gradients of the cost of the `dA` with respect
442 |        # to its parameters
443 |        gparams = T.grad(cost, self.params)
444 |        # generate the list of updates
445 |        updates = []
446 |        for param, gparam in zip(self.params, gparams):
447 |            updates.append((param, param - learning_rate * gparam))
448 | 
449 |        return (cost, updates)
450 | ```
451 | 
452 | 
453 | ## 将它组合起来
454 | 
455 | 现在去构建一个`dA`类和训练它变得很简单了。
456 | 
457 | ```Python
458 | # allocate symbolic variables for the data
459 | index = T.lscalar()  # index to a [mini]batch
460 | x = T.matrix('x')  # the data is presented as rasterized images
461 | 
462 | ######################
463 | # BUILDING THE MODEL #
464 | ######################
465 | 
466 | rng = numpy.random.RandomState(123)
467 | theano_rng = RandomStreams(rng.randint(2 ** 30))
468 | 
469 | da = dA(numpy_rng=rng, theano_rng=theano_rng, input=x,
470 |         n_visible=28 * 28, n_hidden=500)
471 | 
472 | cost, updates = da.get_cost_updates(corruption_level=0.2,
473 |                             learning_rate=learning_rate)
474 | 
475 | 
476 | train_da = theano.function([index], cost, updates=updates,
477 |      givens = {x: train_set_x[index * batch_size: (index + 1) * batch_size]})
478 | 
479 | start_time = time.clock()
480 | 
481 | ############
482 | # TRAINING #
483 | ############
484 | 
485 | # go through training epochs
486 | for epoch in xrange(training_epochs):
487 |     # go through trainng set
488 |     c = []
489 |     for batch_index in xrange(n_train_batches):
490 |         c.append(train_da(batch_index))
491 | 
492 |     print 'Training epoch %d, cost ' % epoch, numpy.mean(c)
493 | 
494 | end_time = time.clock
495 | 
496 | training_time = (end_time - start_time)
497 | 
498 | print ('Training took %f minutes' % (pretraining_time / 60.))
499 | ```
500 | 
501 | 为了了解网络学习了什么东西，我们将会描述出滤波器（通过权值矩阵来定义）。记住，事实上它没有提供完整的情况，因为我们忽略了偏置，并且画出的权值被乘以了常数（权值被转换到了0-1之间）。
502 | 
503 | 去画出我们的滤波器，我们需要`title_raster_images`(看[Plotting Samples and Filters](http://deeplearning.net/tutorial/utilities.html#how-to-plot))，所以我们强烈建议读者去了解它。当然，也在PIL(python image library)的帮助下，下面行的代码将把滤波器保存为图像：
504 | 
505 | ```Python
506 | image = Image.fromarray(tile_raster_images(X=da.W.get_value(borrow=True).T,
507 |              img_shape=(28, 28), tile_shape=(10, 10),
508 |              tile_spacing=(1, 1)))
509 | image.save('filters_corruption_30.png')
510 | ```
511 | 
512 | ## 运行这个代码
513 | 
514 | 当我们不使用任何噪声的时候，获得的滤波器如下：
515 | 
516 | ![filter_not_nosie](/images/5_running_code_1.png)
517 | 
518 | 有30%噪声的时候，滤波器如下：
519 | 
520 | ![filter_with_nosie](/images/5_running_code_2.png)
521 | 
522 | 
523 | 
524 | 
525 | 
526 | 
527 | 
528 | 
529 | 
530 | 
531 | 
532 | 
533 | 
534 | 
535 | 
536 | 
537 | 
538 | 
539 | 
540 | 
541 | 
542 | 
543 | 
544 | 
545 | 
546 | 
547 | 
548 | 
549 | 
550 | 
551 | 
552 | 
553 | 
554 | 


--------------------------------------------------------------------------------
/6_Stacked_Denoising_Autoencoders_层叠降噪自动编码机.md:
--------------------------------------------------------------------------------
  1 | 层叠降噪自动编码机（Stacked Denoising Autoencoders (SdA)）
  2 | =========================================================
  3 | 
  4 | 在这一节，我们假设读者已经了解了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。如果你需要在GPU上进行运算，你还需要了解[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。
  5 | 
  6 | 本节的所有代码可以在[这里](http://deeplearning.net/tutorial/code/SdA.py)下载。
  7 | 
  8 | 层叠降噪自动编码机（Stacked Denoising Autoencoder，SdA）是层叠自动编码机（[Bengio07](http://deeplearning.net/tutorial/references.html#bengio07)）的一个扩展，在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中被介绍。
  9 | 
 10 | 这个教程建立在前一个[降噪自动编码机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/5_Denoising_Autoencoders_降噪自动编码.md)之上。我们建议，对于没有自动编码机经验的人应该阅读上述章节。
 11 | 
 12 | ###层叠自动编码机
 13 | 降噪自动编码机可以被叠加起来形成一个深度网络，通过反馈前一层的降噪自动编码机的潜在表达（输出编码）作为当前层的输入。这个非监督的预学习结构一次只能学习一个层。每一层都被作为一个降噪自动编码机以最小化重构误差来进行训练。当前k个层被训练完了，我们可以进行k+1层的训练，因此此时我们才可以计算前一层的编码和潜在表达。当所有的层都被训练了，整个网络进行第二阶段训练，称为微调（fine-tuning）。这里，我们考虑监督微调，当我们需要最小化一个监督任务的预测误差吧。为此我们现在网络的顶端添加一个逻辑回归层（使输出层的编码更加精确）。然后我们像训练多层感知器一样训练整个网络。这里，我们考虑每个自动编码的机的编码模块。这个阶段是有监督的，因为我们在训练的时候使用了目标类别（更多细节请看[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)）
 14 | 
 15 | ![SdA](/images/6_sda_1.png)
 16 | 
 17 | 这在Theano里面，使用之前定义的降噪自动编码机，可以轻易的被实现。我们可以将层叠降噪自动编码机看作两部分，一个是自动编码机链表，另一个是一个多层感知机。在预训练阶段，我们使用了第一部分，例如我们将模型看作一系列的自动编码机，然后分别训练每一个自动编码机。在第二阶段，我们使用第二部分。这个两个部分通过分享参数来实现连接。
 18 | 
 19 | ```Python
 20 | class SdA(object):
 21 |     """Stacked denoising auto-encoder class (SdA)
 22 | 
 23 |     A stacked denoising autoencoder model is obtained by stacking several
 24 |     dAs. The hidden layer of the dA at layer `i` becomes the input of
 25 |     the dA at layer `i+1`. The first layer dA gets as input the input of
 26 |     the SdA, and the hidden layer of the last dA represents the output.
 27 |     Note that after pretraining, the SdA is dealt with as a normal MLP,
 28 |     the dAs are only used to initialize the weights.
 29 |     """
 30 | 
 31 |     def __init__(
 32 |         self,
 33 |         numpy_rng,
 34 |         theano_rng=None,
 35 |         n_ins=784,
 36 |         hidden_layers_sizes=[500, 500],
 37 |         n_outs=10,
 38 |         corruption_levels=[0.1, 0.1]
 39 |     ):
 40 |         """ This class is made to support a variable number of layers.
 41 | 
 42 |         :type numpy_rng: numpy.random.RandomState
 43 |         :param numpy_rng: numpy random number generator used to draw initial
 44 |                     weights
 45 | 
 46 |         :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
 47 |         :param theano_rng: Theano random generator; if None is given one is
 48 |                            generated based on a seed drawn from `rng`
 49 | 
 50 |         :type n_ins: int
 51 |         :param n_ins: dimension of the input to the sdA
 52 | 
 53 |         :type n_layers_sizes: list of ints
 54 |         :param n_layers_sizes: intermediate layers size, must contain
 55 |                                at least one value
 56 | 
 57 |         :type n_outs: int
 58 |         :param n_outs: dimension of the output of the network
 59 | 
 60 |         :type corruption_levels: list of float
 61 |         :param corruption_levels: amount of corruption to use for each
 62 |                                   layer
 63 |         """
 64 | 
 65 |         self.sigmoid_layers = []
 66 |         self.dA_layers = []
 67 |         self.params = []
 68 |         self.n_layers = len(hidden_layers_sizes)
 69 | 
 70 |         assert self.n_layers > 0
 71 | 
 72 |         if not theano_rng:
 73 |             theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))
 74 |         # allocate symbolic variables for the data
 75 |         self.x = T.matrix('x')  # the data is presented as rasterized images
 76 |         self.y = T.ivector('y')  # the labels are presented as 1D vector of
 77 |                                  # [int] labels
 78 | ```
 79 | `self.sigmoid_layers`将会储存多层感知机的sigmoid层，`self.dA_layers`将会储存连接多层感知机层的降噪自动编码机。
 80 | 
 81 | 下一步，我们构建`n_layers`个sigmoid层（我们使用在[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)中介绍的`HiddenLayer`类，唯一的更改是将原本的非线性函数`tanh`换成了logistic函数s=1/(1+exp(-x))）和`n_layers`个降噪自动编码机，当然`n_layers`就是我们模型的深度。我们连接sigmoid函数，使得他们形成一个MLP，构建每一个自动编码机和他们对应的sigmoid层，去共享编码部分的权值矩阵和偏执
 82 | 
 83 | ```Python
 84 |         for i in xrange(self.n_layers):
 85 |             # construct the sigmoidal layer
 86 | 
 87 |             # the size of the input is either the number of hidden units of
 88 |             # the layer below or the input size if we are on the first layer
 89 |             if i == 0:
 90 |                 input_size = n_ins
 91 |             else:
 92 |                 input_size = hidden_layers_sizes[i - 1]
 93 | 
 94 |             # the input to this layer is either the activation of the hidden
 95 |             # layer below or the input of the SdA if you are on the first
 96 |             # layer
 97 |             if i == 0:
 98 |                 layer_input = self.x
 99 |             else:
100 |                 layer_input = self.sigmoid_layers[-1].output
101 | 
102 |             sigmoid_layer = HiddenLayer(rng=numpy_rng,
103 |                                         input=layer_input,
104 |                                         n_in=input_size,
105 |                                         n_out=hidden_layers_sizes[i],
106 |                                         activation=T.nnet.sigmoid)
107 |             # add the layer to our list of layers
108 |             self.sigmoid_layers.append(sigmoid_layer)
109 |             # its arguably a philosophical question...
110 |             # but we are going to only declare that the parameters of the
111 |             # sigmoid_layers are parameters of the StackedDAA
112 |             # the visible biases in the dA are parameters of those
113 |             # dA, but not the SdA
114 |             self.params.extend(sigmoid_layer.params)
115 | 
116 |             # Construct a denoising autoencoder that shared weights with this
117 |             # layer
118 |             dA_layer = dA(numpy_rng=numpy_rng,
119 |                           theano_rng=theano_rng,
120 |                           input=layer_input,
121 |                           n_visible=input_size,
122 |                           n_hidden=hidden_layers_sizes[i],
123 |                           W=sigmoid_layer.W,
124 |                           bhid=sigmoid_layer.b)
125 |             self.dA_layers.append(dA_layer)
126 | ```
127 | 
128 | 现在，我们需要在sigmoid层的上方添加逻辑层，所以我们将有一个MLP。我们将使用在[使用逻辑回归进MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)的`LogisticRegression`类。
129 | 
130 | ```Python
131 |         # We now need to add a logistic layer on top of the MLP
132 |         self.logLayer = LogisticRegression(
133 |             input=self.sigmoid_layers[-1].output,
134 |             n_in=hidden_layers_sizes[-1],
135 |             n_out=n_outs
136 |         )
137 | 
138 |         self.params.extend(self.logLayer.params)
139 |         # construct a function that implements one step of finetunining
140 | 
141 |         # compute the cost for second phase of training,
142 |         # defined as the negative log likelihood
143 |         self.finetune_cost = self.logLayer.negative_log_likelihood(self.y)
144 |         # compute the gradients with respect to the model parameters
145 |         # symbolic variable that points to the number of errors made on the
146 |         # minibatch given by self.x and self.y
147 |         self.errors = self.logLayer.errors(self.y)
148 | ```
149 | 这个类也提供一个方法去产生与不同层相关的降噪自动编码机的训练函数。它们以list的形式返回，第i个元素就是一个实现训练第i层的`dA`的函数。
150 | 
151 | ```Python
152 |     def pretraining_functions(self, train_set_x, batch_size):
153 |         ''' Generates a list of functions, each of them implementing one
154 |         step in trainnig the dA corresponding to the layer with same index.
155 |         The function will require as input the minibatch index, and to train
156 |         a dA you just need to iterate, calling the corresponding function on
157 |         all minibatch indexes.
158 | 
159 |         :type train_set_x: theano.tensor.TensorType
160 |         :param train_set_x: Shared variable that contains all datapoints used
161 |                             for training the dA
162 | 
163 |         :type batch_size: int
164 |         :param batch_size: size of a [mini]batch
165 | 
166 |         :type learning_rate: float
167 |         :param learning_rate: learning rate used during training for any of
168 |                               the dA layers
169 |         '''
170 | 
171 |         # index to a [mini]batch
172 |         index = T.lscalar('index')  # index to a minibatch
173 | ```
174 | 为了有能力在训练时，改变差错等级或者训练速率。我们用一个Theano变量来联系它们。
175 | 
176 | ```Python
177 |         corruption_level = T.scalar('corruption')  # % of corruption to use
178 |         learning_rate = T.scalar('lr')  # learning rate to use
179 |         # begining of a batch, given `index`
180 |         batch_begin = index * batch_size
181 |         # ending of a batch given `index`
182 |         batch_end = batch_begin + batch_size
183 | 
184 |         pretrain_fns = []
185 |         for dA in self.dA_layers:
186 |             # get the cost and the updates list
187 |             cost, updates = dA.get_cost_updates(corruption_level,
188 |                                                 learning_rate)
189 |             # compile the theano function
190 |             fn = theano.function(
191 |                 inputs=[
192 |                     index,
193 |                     theano.Param(corruption_level, default=0.2),
194 |                     theano.Param(learning_rate, default=0.1)
195 |                 ],
196 |                 outputs=cost,
197 |                 updates=updates,
198 |                 givens={
199 |                     self.x: train_set_x[batch_begin: batch_end]
200 |                 }
201 |             )
202 |             # append `fn` to the list of functions
203 |             pretrain_fns.append(fn)
204 | 
205 |         return pretrain_fns
206 | ```
207 | 现在任何一个`pretrain_fns[i]`函数，可以将`index`，`corruption`——差错等级，`lr`——学习速率作为参数。注意，这些参数的名字是Theano变量的名字，而不是Python变量的名字（`learning_rate`或者`corruption_level`）。在使用Theano时，注意这一点。
208 | 
209 | 以相同的方式（fashion），我们创建了一个方法用于在微调（fine-tuning）时需要的构建函数（`train_model`，`validate_model`，`test_model`函数）。
210 | 
211 | ```Python
212 |     def build_finetune_functions(self, datasets, batch_size, learning_rate):
213 |         '''Generates a function `train` that implements one step of
214 |         finetuning, a function `validate` that computes the error on
215 |         a batch from the validation set, and a function `test` that
216 |         computes the error on a batch from the testing set
217 | 
218 |         :type datasets: list of pairs of theano.tensor.TensorType
219 |         :param datasets: It is a list that contain all the datasets;
220 |                          the has to contain three pairs, `train`,
221 |                          `valid`, `test` in this order, where each pair
222 |                          is formed of two Theano variables, one for the
223 |                          datapoints, the other for the labels
224 | 
225 |         :type batch_size: int
226 |         :param batch_size: size of a minibatch
227 | 
228 |         :type learning_rate: float
229 |         :param learning_rate: learning rate used during finetune stage
230 |         '''
231 | 
232 |         (train_set_x, train_set_y) = datasets[0]
233 |         (valid_set_x, valid_set_y) = datasets[1]
234 |         (test_set_x, test_set_y) = datasets[2]
235 | 
236 |         # compute number of minibatches for training, validation and testing
237 |         n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
238 |         n_valid_batches /= batch_size
239 |         n_test_batches = test_set_x.get_value(borrow=True).shape[0]
240 |         n_test_batches /= batch_size
241 | 
242 |         index = T.lscalar('index')  # index to a [mini]batch
243 | 
244 |         # compute the gradients with respect to the model parameters
245 |         gparams = T.grad(self.finetune_cost, self.params)
246 | 
247 |         # compute list of fine-tuning updates
248 |         updates = [
249 |             (param, param - gparam * learning_rate)
250 |             for param, gparam in zip(self.params, gparams)
251 |         ]
252 | 
253 |         train_fn = theano.function(
254 |             inputs=[index],
255 |             outputs=self.finetune_cost,
256 |             updates=updates,
257 |             givens={
258 |                 self.x: train_set_x[
259 |                     index * batch_size: (index + 1) * batch_size
260 |                 ],
261 |                 self.y: train_set_y[
262 |                     index * batch_size: (index + 1) * batch_size
263 |                 ]
264 |             },
265 |             name='train'
266 |         )
267 | 
268 |         test_score_i = theano.function(
269 |             [index],
270 |             self.errors,
271 |             givens={
272 |                 self.x: test_set_x[
273 |                     index * batch_size: (index + 1) * batch_size
274 |                 ],
275 |                 self.y: test_set_y[
276 |                     index * batch_size: (index + 1) * batch_size
277 |                 ]
278 |             },
279 |             name='test'
280 |         )
281 | 
282 |         valid_score_i = theano.function(
283 |             [index],
284 |             self.errors,
285 |             givens={
286 |                 self.x: valid_set_x[
287 |                     index * batch_size: (index + 1) * batch_size
288 |                 ],
289 |                 self.y: valid_set_y[
290 |                     index * batch_size: (index + 1) * batch_size
291 |                 ]
292 |             },
293 |             name='valid'
294 |         )
295 | 
296 |         # Create a function that scans the entire validation set
297 |         def valid_score():
298 |             return [valid_score_i(i) for i in xrange(n_valid_batches)]
299 | 
300 |         # Create a function that scans the entire test set
301 |         def test_score():
302 |             return [test_score_i(i) for i in xrange(n_test_batches)]
303 | 
304 |         return train_fn, valid_score, test_score
305 | ```
306 | 注意，这里返回的`valid_score`和`test_score`并不是Theano函数，而是Python函数，在整个验证集和测试集循环，以产生这些集合的损失数的list。
307 | 
308 | ###将它组合起来
309 | 
310 | 下面的几行代码去构建层叠自动编码机：
311 | ```Python
312 |     numpy_rng = numpy.random.RandomState(89677)
313 |     print '... building the model'
314 |     # construct the stacked denoising autoencoder class
315 |     sda = SdA(
316 |         numpy_rng=numpy_rng,
317 |         n_ins=28 * 28,
318 |         hidden_layers_sizes=[1000, 1000, 1000],
319 |         n_outs=10
320 |     )
321 | ```
322 | 在训练这个网络时，有两个阶段，一层是预训练，之后是微调。
323 | 
324 | 对于预训练阶段，我们将循环网络中的所有层。对于每一层，我们将使用编译的theano函数来实现SGD(随机梯度下降)，以实现权值优化，来见效每层的重构损失。这个函数将在训练集中被应用，并且是以`pretraining_epochs`中给出的固定次数的迭代。
325 | 
326 | ```Python
327 |     #########################
328 |     # PRETRAINING THE MODEL #
329 |     #########################
330 |     print '... getting the pretraining functions'
331 |     pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x,
332 |                                                 batch_size=batch_size)
333 | 
334 |     print '... pre-training the model'
335 |     start_time = time.clock()
336 |     ## Pre-train layer-wise
337 |     corruption_levels = [.1, .2, .3]
338 |     for i in xrange(sda.n_layers):
339 |         # go through pretraining epochs
340 |         for epoch in xrange(pretraining_epochs):
341 |             # go through the training set
342 |             c = []
343 |             for batch_index in xrange(n_train_batches):
344 |                 c.append(pretraining_fns[i](index=batch_index,
345 |                          corruption=corruption_levels[i],
346 |                          lr=pretrain_lr))
347 |             print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch),
348 |             print numpy.mean(c)
349 | 
350 |     end_time = time.clock()
351 | 
352 |     print >> sys.stderr, ('The pretraining code for file ' +
353 |                           os.path.split(__file__)[1] +
354 |                           ' ran for %.2fm' % ((end_time - start_time) / 60.))
355 | ```
356 | 这个微调（fine-tuning）循环和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)中的非常类似，唯一的不同是我们将使用在`build_funetune_functions`中给定的新函数。
357 | 
358 | ###运行这个代码
359 | 默认情况下，这个代码以块数目为1，每一层循环15次来进行预训练预训练。错差等级（corruption level）在第一层被设为0.1，第二层被设为0.2，第三层被设为0.3。预训练的学习速率为0.001，微调学习速率为0.1。预训练花了585.01分钟，平均每层13分钟。微调在36次迭代，444.2分钟后完成。平均每层迭代12.34分钟。最后的验证得分为1.39%，测试得分为1.3%。所有的结果都是在Intel Xeon E5430 @ 2.66GHz CPU，GotoBLAS下得出。
360 | 
361 | ###技巧
362 | 这里有一个方法去提高代码的运行速度（假定你有足够的可用内存），是去计算这个网络（直到第k-1层时）如何转换你的数据。换句话说，你通过训练你的第一个dA层来开始。一旦它被训练，你就可以为每一个数据节点计算隐单元的值然后将它们储存为一个新的数据集，以便你在第2层中训练dA。一旦你训练完第2层的dA，你以相同的方式计算第三层的数据。现在你可以明白，在这个时候，这个dAs被分开训练了，它们仅仅提供（一对一的）对输入的非线性转换。一旦所有的dAs被训练，你就可以开始微调整个模型了。
363 | 
364 | 
365 | 
366 | 
367 | 
368 | 
369 | 
370 | 
371 | 
372 | 
373 | 
374 | 
375 | 
376 | 
377 | 
378 | 


--------------------------------------------------------------------------------
/7_Restricted_Boltzmann_Machine_受限波尔兹曼机.md:
--------------------------------------------------------------------------------
  1 | 受限波尔兹曼机（Restricted Boltzmann Machines）
  2 | ==============================================
  3 | 
  4 | 在这一章节，我们假设读者已经阅读了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)和[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。当然，假如你要使用GPU来运行代码，你还需要阅读[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。
  5 | 
  6 | 本节的所有代码都可以在[这里](http://deeplearning.net/tutorial/code/rbm.py)下载。
  7 | 
  8 | ## 基于能量模型（Energy-Based Models）
  9 | 基于能量的模型（EBM）把我们所关心变量的各种组合和一个标量能量联系在一起。训练模型的过程就是不断改变标量能量的过程，使其能量函数的形状满足期望的形状。比如，如果一个变量组合被认为是合理的，它同时也具有较小的能量。基于能量的概率模型通过能量函数来定义概率分布：
 10 | 
 11 | ![energy_fun](/images/7_ebm_1.png)
 12 | 
 13 | 其中归一化因子Z被称为分割函数：
 14 | 
 15 | ![Z_fun](/images/7_ebm_2.png)
 16 | 
 17 | 基于能量的模型可以利用使用梯度下降或随机梯度下降的方法来学习，具体而言，就是以先验（训练集）的负对数似然函数作为损失函数，就像在逻辑回归中我们定义的那样，
 18 | 
 19 | ![loss_fun](/images/7_ebm_3.png)
 20 | 
 21 | 其中随机梯度为![gradient](/images/7_ebm_4.png)，其中theta为模型的参数。
 22 | 
 23 | ### 包含隐藏单元的EBMs
 24 | 
 25 | 在很多情况下，我们无法观察到x样本的全部分布，或者我们需要引进一些没有观察到的变量，以增加模型的表达能力。因而我们考虑将模型分为2部分，一个可见部分（x的观察分布）和一个隐藏部分h，这样得到的就是包含隐含变量的EBM：
 26 | 
 27 | ![ebm_with_hidden_unit](/images/7_ebm_hidden_units_1.png)
 28 | 
 29 | 同时我们受物理启发定义了自由能量（free energy）：
 30 | 
 31 | ![free_energy](/images/7_ebm_hidden_units_2.png)
 32 | 
 33 | 然后我们可以写成如下公式：
 34 | 
 35 | ![ebm_with_hidden_units_2](/images/7_ebm_hidden_units_3.png)
 36 | 
 37 | 数据的服对数似然函数梯度就有如下有趣的形式：
 38 | 
 39 | ![gradient_rbm_h](/images/7_ebm_hidden_units_4.png)
 40 | 
 41 | 推倒公式如下：
 42 | 
 43 | ![gradient_rbm_h_2](/images/7_ebm_hidden_units_5.png)
 44 | 
 45 | 需要注意的是上述的梯度包含2个项，包括正相位和负相位。正和负的术语不指公式中的每个项的符号，而是反映其对模型所定义的概率密度的影响。第一项增加训练数据的概率（通过减少相关的自由能量），而第二项减小模型产生的样本的概率。
 46 | 
 47 | 通常我们很难精确计算这个梯度，因为式中第一项涉及到可见单元与隐含单元的联合分布，由于归一化因子Z(θ)的存在，该分布很难获取。 我们只能通过一些采样方法（如Gibbs采样）获取其近似值，其具体方法将在后文中详述。
 48 | 
 49 | 
 50 | ## 受限波尔兹曼机（RBM）
 51 | 
 52 | 波尔兹曼机是对数线性马尔可夫随机场（MRF）的一种特殊形式，例如这个能量函数在它的自由参数下是线性的。为了使得它们能更强力的表达复杂分布（从受限的参数设定到一个非参数设定），我们认为一些变量是不可见的（被称为隐藏）。通过拥有更多隐藏变量（也称之为隐藏单元），我们可以增加波尔兹曼机的模型容量。受限波尔兹曼机限制波尔兹曼机可视层和隐藏层的层内连接。RBM模型可以由下图描述：
 53 | 
 54 | ![rbm_graph](/images/7_rbm_1.png)
 55 | 
 56 | RBM的能量函数可以被定义如下：
 57 | 
 58 | ![rbm_energy_fun](/images/7_rbm_2.png)
 59 | 
 60 | 其中’表示转置，b,c,W为模型的参数，b,c分别为可见层和隐含层的偏置，W为可见层与隐含层的链接权重。
 61 | 
 62 | 自由能量为如下形式：
 63 | 
 64 | ![free_energy_rbm](/images/7_rbm_3.png)
 65 | 
 66 | 由于RBM的特殊结构，可视层和隐藏层层间单元是相互独立的。利用这个特性，我们定义如下：
 67 | 
 68 | ![prob_rbm](/images/7_rbm_4.png)
 69 | 
 70 | 
 71 | ### 二进制单元的RBMs
 72 | 在使用二进制单元（v和h都属于{0,1}）的普通研究情况时，概率版的普通神经激活函数表示如下：
 73 | 
 74 | ![activation_fun](/images/7_rbm_binary_units_1.png)
 75 | 
 76 | ![activation_fun2](/images/7_rbm_binary_units_2.png)
 77 | 
 78 | 二进制单元RBMs的自由能力为：
 79 | 
 80 | ![free_energy_binary](images/7_rbm_binary_units_1.png)
 81 | 
 82 | 
 83 | ### 二进制单元的更新公式
 84 | 
 85 | 我们可以获得如下的一个二进制单元RBM的对数似然梯度：
 86 | 
 87 | ![equations](/images/7_update_e_b_u_1.png)
 88 | 
 89 | 这个公式的更多细节推倒，读者可以阅读[这一页](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DBNEquations)，或者[Learning Deep Architectures for AI](http://www.iro.umontreal.ca/%7Elisa/publications2/index.php/publications/show/239)的第五节。在这里，我们将不使用这些等式，而是通过Theano的`T.grad`来获取梯度。
 90 | 
 91 | ## 在RBM中进行采样
 92 | 
 93 | p(x)的样本可以通过运行马尔可夫链的汇聚、Gibbs采样的过渡来得到。
 94 | 
 95 | 由N个随机变量（S=(S1,S2,...Sn)）的联合分布的Gibbs采样，可以通过N个采样子步骤来实现,形式如Si～p(Si|S-i)，其中S-i表示集合S中除Si的N-1个随机变量。
 96 | 
 97 | 我们可以从X的一个任意状态(比如[x1(0),x2(0),…,xK(0)])开始，利用上述条件 分布，迭代的对其分量依次采样，随着采样次数的增加，随机变量[x1(n),x2(n),…,xK(n)]的概率分布将以n的几何级数的速度收敛于X的联合 概率分布P(X)。也就是说，我们可以在未知联合概率分布的条件下对其进行采样。
 98 | 
 99 | 对于RBMs来说，S包含了可视和隐藏单元的集合。然而，由于它们的条件独立性，可以执行块Gibbs抽样。在这个设定中，可视单元被采样，同时给出隐藏单元的固定值，同样的，隐藏单元也是如此：
100 | 
101 | ![h_v](/images/7_sampling_1.png)
102 | 
103 | 这里，h(n)表示马尔可夫链中第n布的隐藏单元的集合。这意味着，h(n+1)根据概率`simg(W‘v(n)+ci)`来随机地被选为0/1。类似地v(n+1)也是如此。这个过程可以通过下面地图来展现：
104 | 
105 | ![gibbs_sampling](/images/7_sampling_2.png)
106 | 
107 | 当t趋向于无穷时，(v(t),h(t))将越加逼近正确样本的概率分布p(v,h)。
108 | 
109 | 在这个理论里面，每个参数在学习进程中的更新都需要运行这样几个链来趋近。毫无疑问这将耗费很大的计算量。一些新的算法已经被提出来，以有效的学习p(v,h)中的样本情况。
110 | 
111 | 
112 | ## 对比散度算法（CD-k）
113 | 
114 | 对比散度算法，是一种成功的用于求解对数似然函数关于未知参数梯度的近似的方法。它使用两个技巧来技术采样过程：
115 | * 因为我们希望p(v)=p_train(v)（数据的真实、底层分布），所以我们使用一个训练样本来初始化马尔可夫链（例如，从一个被预计接近于p的分布，所以这个链已经开始去收敛这个最终的分布p）。
116 | * 对比梯度不需要等待链的收敛。样本在k步Gibbs采样后就可以获得。在实际中，k=1时就可以获得惊人的好的效果。
117 | 
118 | 
119 | ### 持续的对比散度
120 | 
121 | 持续的对比散度[Tieleman08](http://deeplearning.net/tutorial/references.html#tieleman08)使用了另外一种近似方法来从p(v,h)中采样。它建立在一个拥有持续状态的单马尔可夫链上（例如，不是对每个可视样例都重启链）。对每一次参数更新，我们通过简单的运行这个链k步来获得新的样本。然后保存链的状态以便后续的更新。
122 | 
123 | 一般直觉的是，如果参数的更新是足够小相比链的混合率，那么马尔科夫链应该能够“赶上”模型的变化。
124 | 
125 | ## 实现
126 | 
127 | ![RBM_impl](/images/7_implementation_1.png)
128 | 
129 | 我们构建一个RBM类。这个网络的参数既可以通过构造函数初始化，也可以作为参数进行传入。当RBM被用于构建一个深度网络时，这个选项——权重矩阵和隐藏层偏置与一个MLP网络的相应的S形层共享，就是非常有用的。
130 | 
131 | ```Python
132 | class RBM(object):
133 |     """Restricted Boltzmann Machine (RBM)  """
134 |     def __init__(
135 |         self,
136 |         input=None,
137 |         n_visible=784,
138 |         n_hidden=500,
139 |         W=None,
140 |         hbias=None,
141 |         vbias=None,
142 |         numpy_rng=None,
143 |         theano_rng=None
144 |     ):
145 |         """
146 |         RBM constructor. Defines the parameters of the model along with
147 |         basic operations for inferring hidden from visible (and vice-versa),
148 |         as well as for performing CD updates.
149 | 
150 |         :param input: None for standalone RBMs or symbolic variable if RBM is
151 |         part of a larger graph.
152 | 
153 |         :param n_visible: number of visible units
154 | 
155 |         :param n_hidden: number of hidden units
156 | 
157 |         :param W: None for standalone RBMs or symbolic variable pointing to a
158 |         shared weight matrix in case RBM is part of a DBN network; in a DBN,
159 |         the weights are shared between RBMs and layers of a MLP
160 | 
161 |         :param hbias: None for standalone RBMs or symbolic variable pointing
162 |         to a shared hidden units bias vector in case RBM is part of a
163 |         different network
164 | 
165 |         :param vbias: None for standalone RBMs or a symbolic variable
166 |         pointing to a shared visible units bias
167 |         """
168 | 
169 |         self.n_visible = n_visible
170 |         self.n_hidden = n_hidden
171 | 
172 |         if numpy_rng is None:
173 |             # create a number generator
174 |             numpy_rng = numpy.random.RandomState(1234)
175 | 
176 |         if theano_rng is None:
177 |             theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))
178 | 
179 |         if W is None:
180 |             # W is initialized with `initial_W` which is uniformely
181 |             # sampled from -4*sqrt(6./(n_visible+n_hidden)) and
182 |             # 4*sqrt(6./(n_hidden+n_visible)) the output of uniform if
183 |             # converted using asarray to dtype theano.config.floatX so
184 |             # that the code is runable on GPU
185 |             initial_W = numpy.asarray(
186 |                 numpy_rng.uniform(
187 |                     low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
188 |                     high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
189 |                     size=(n_visible, n_hidden)
190 |                 ),
191 |                 dtype=theano.config.floatX
192 |             )
193 |             # theano shared variables for weights and biases
194 |             W = theano.shared(value=initial_W, name='W', borrow=True)
195 | 
196 |         if hbias is None:
197 |             # create shared variable for hidden units bias
198 |             hbias = theano.shared(
199 |                 value=numpy.zeros(
200 |                     n_hidden,
201 |                     dtype=theano.config.floatX
202 |                 ),
203 |                 name='hbias',
204 |                 borrow=True
205 |             )
206 | 
207 |         if vbias is None:
208 |             # create shared variable for visible units bias
209 |             vbias = theano.shared(
210 |                 value=numpy.zeros(
211 |                     n_visible,
212 |                     dtype=theano.config.floatX
213 |                 ),
214 |                 name='vbias',
215 |                 borrow=True
216 |             )
217 | 
218 |         # initialize input layer for standalone RBM or layer0 of DBN
219 |         self.input = input
220 |         if not input:
221 |             self.input = T.matrix('input')
222 | 
223 |         self.W = W
224 |         self.hbias = hbias
225 |         self.vbias = vbias
226 |         self.theano_rng = theano_rng
227 |         # **** WARNING: It is not a good idea to put things in this list
228 |         # other than shared variables created in this function.
229 |         self.params = [self.W, self.hbias, self.vbias]
230 | ```
231 | 下一步就是去定义函数来构建S图。代码如下：
232 | 
233 | ```Python
234 |     def propup(self, vis):
235 |         '''This function propagates the visible units activation upwards to
236 |         the hidden units
237 | 
238 |         Note that we return also the pre-sigmoid activation of the
239 |         layer. As it will turn out later, due to how Theano deals with
240 |         optimizations, this symbolic variable will be needed to write
241 |         down a more stable computational graph (see details in the
242 |         reconstruction cost function)
243 | 
244 |         '''
245 |         pre_sigmoid_activation = T.dot(vis, self.W) + self.hbias
246 |         return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)]
247 | ```
248 | 
249 | ```Python
250 |     def sample_h_given_v(self, v0_sample):
251 |         ''' This function infers state of hidden units given visible units '''
252 |         # compute the activation of the hidden units given a sample of
253 |         # the visibles
254 |         pre_sigmoid_h1, h1_mean = self.propup(v0_sample)
255 |         # get a sample of the hiddens given their activation
256 |         # Note that theano_rng.binomial returns a symbolic sample of dtype
257 |         # int64 by default. If we want to keep our computations in floatX
258 |         # for the GPU we need to specify to return the dtype floatX
259 |         h1_sample = self.theano_rng.binomial(size=h1_mean.shape,
260 |                                              n=1, p=h1_mean,
261 |                                              dtype=theano.config.floatX)
262 |         return [pre_sigmoid_h1, h1_mean, h1_sample]
263 | ```
264 | 
265 | ```Python
266 |     def propdown(self, hid):
267 |         '''This function propagates the hidden units activation downwards to
268 |         the visible units
269 | 
270 |         Note that we return also the pre_sigmoid_activation of the
271 |         layer. As it will turn out later, due to how Theano deals with
272 |         optimizations, this symbolic variable will be needed to write
273 |         down a more stable computational graph (see details in the
274 |         reconstruction cost function)
275 | 
276 |         '''
277 |         pre_sigmoid_activation = T.dot(hid, self.W.T) + self.vbias
278 |         return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)]
279 | ```
280 | 
281 | ```Python
282 |     def sample_v_given_h(self, h0_sample):
283 |         ''' This function infers state of visible units given hidden units '''
284 |         # compute the activation of the visible given the hidden sample
285 |         pre_sigmoid_v1, v1_mean = self.propdown(h0_sample)
286 |         # get a sample of the visible given their activation
287 |         # Note that theano_rng.binomial returns a symbolic sample of dtype
288 |         # int64 by default. If we want to keep our computations in floatX
289 |         # for the GPU we need to specify to return the dtype floatX
290 |         v1_sample = self.theano_rng.binomial(size=v1_mean.shape,
291 |                                              n=1, p=v1_mean,
292 |                                              dtype=theano.config.floatX)
293 |         return [pre_sigmoid_v1, v1_mean, v1_sample]
294 | ```
295 | 现在，我们可以使用这些函数来定义一个Gibbs采样步骤的符号图。我们定义如下两个函数：
296 | * `gibbs_vhv`表示从可视单元中开始的Gibbs采样的步骤。我们将可以看到这对于从RBM中采样是非常有用的。
297 | * `gibbs_hvh`表示从隐藏单元中开始Gibbs采样的步骤。这个函数再实现CD和PCD更新中是非常有用的。
298 | 代码如下：
299 | 
300 | ```Python
301 |     def gibbs_hvh(self, h0_sample):
302 |         ''' This function implements one step of Gibbs sampling,
303 |             starting from the hidden state'''
304 |         pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h0_sample)
305 |         pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v1_sample)
306 |         return [pre_sigmoid_v1, v1_mean, v1_sample,
307 |                 pre_sigmoid_h1, h1_mean, h1_sample]
308 | ```
309 | 
310 | ```Python
311 |     def gibbs_vhv(self, v0_sample):
312 |         ''' This function implements one step of Gibbs sampling,
313 |             starting from the visible state'''
314 |         pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v0_sample)
315 |         pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h1_sample)
316 |         return [pre_sigmoid_h1, h1_mean, h1_sample,
317 |                 pre_sigmoid_v1, v1_mean, v1_sample]
318 | 
319 |     # start-snippet-2
320 | ```
321 | 
322 | 这个类还有一个函数去计算模型的自由能量，以便去计算参数的梯度。注意我们也会返回pre-sigmoid。
323 | 
324 | ```Python
325 |     def free_energy(self, v_sample):
326 |         ''' Function to compute the free energy '''
327 |         wx_b = T.dot(v_sample, self.W) + self.hbias
328 |         vbias_term = T.dot(v_sample, self.vbias)
329 |         hidden_term = T.sum(T.log(1 + T.exp(wx_b)), axis=1)
330 |         return -hidden_term - vbias_term
331 | ```
332 | 我们随后添加一个`get_cost_update`方法，目的是产生CD-k和PCD-k的更新的象征性梯度。
333 | 
334 | ```Python
335 |     def get_cost_updates(self, lr=0.1, persistent=None, k=1):
336 |         """This functions implements one step of CD-k or PCD-k
337 | 
338 |         :param lr: learning rate used to train the RBM
339 | 
340 |         :param persistent: None for CD. For PCD, shared variable
341 |             containing old state of Gibbs chain. This must be a shared
342 |             variable of size (batch size, number of hidden units).
343 | 
344 |         :param k: number of Gibbs steps to do in CD-k/PCD-k
345 | 
346 |         Returns a proxy for the cost and the updates dictionary. The
347 |         dictionary contains the update rules for weights and biases but
348 |         also an update of the shared variable used to store the persistent
349 |         chain, if one is used.
350 | 
351 |         """
352 | 
353 |         # compute positive phase
354 |         pre_sigmoid_ph, ph_mean, ph_sample = self.sample_h_given_v(self.input)
355 | 
356 |         # decide how to initialize persistent chain:
357 |         # for CD, we use the newly generate hidden sample
358 |         # for PCD, we initialize from the old state of the chain
359 |         if persistent is None:
360 |             chain_start = ph_sample
361 |         else:
362 |             chain_start = persistent
363 | ```
364 | 注意`get_cost_update`作为参数被变量化为`persistent`。这允许我们去使用相同的代码来实现CD和PCD。为了使用PCD，`persistent`需要被关联到一个共享变量，它包含前一次迭代的Gibbs链的状态。
365 | 
366 | 假如`persistent`为`None`，则我们使用正相位时产生的隐藏样本来初始化Gibbs链，以此实现CD。当我们已经建立了这个链的开始点的时候，我们就可以计算这个Gibbs链的终点的样本，以及我们需要的去获得梯度的样本。为了获得这些，我们使用Theano提供的`sacn`操作，我们建议读者去阅读这个[链接](http://deeplearning.net/software/theano/library/scan.html)。
367 | 
368 | ```Python
369 |         # perform actual negative phase
370 |         # in order to implement CD-k/PCD-k we need to scan over the
371 |         # function that implements one gibbs step k times.
372 |         # Read Theano tutorial on scan for more information :
373 |         # http://deeplearning.net/software/theano/library/scan.html
374 |         # the scan will return the entire Gibbs chain
375 |         (
376 |             [
377 |                 pre_sigmoid_nvs,
378 |                 nv_means,
379 |                 nv_samples,
380 |                 pre_sigmoid_nhs,
381 |                 nh_means,
382 |                 nh_samples
383 |             ],
384 |             updates
385 |         ) = theano.scan(
386 |             self.gibbs_hvh,
387 |             # the None are place holders, saying that
388 |             # chain_start is the initial state corresponding to the
389 |             # 6th output
390 |             outputs_info=[None, None, None, None, None, chain_start],
391 |             n_steps=k
392 |         )
393 | ```
394 | 当你已经产生了这个链，我们在链的末尾的样例获得负相位的自由能量。注意，这个`chain_end`是模型参数项中的一个的象征性的Theano变量，当我们简单的求解`T.grad`的时候，这个函数将通过Gibbs链来得到这个梯度。这不是我们想要的（它会搞乱我们的梯度），因此我们需要指示`T.grad`，`chain_end`是一个常量。我们通过`T.grad`的`consider_constant`来做这个事情。
395 | 
396 | ```Python
397 |         # determine gradients on RBM parameters
398 |         # note that we only need the sample at the end of the chain
399 |         chain_end = nv_samples[-1]
400 | 
401 |         cost = T.mean(self.free_energy(self.input)) - T.mean(
402 |             self.free_energy(chain_end))
403 |         # We must not compute the gradient through the gibbs sampling
404 |         gparams = T.grad(cost, self.params, consider_constant=[chain_end])
405 | ```
406 | 最后，我们增加由`scan`返回的更新字典（包含了随机状态的`theano_rng`更新规则）来获取参数更新。在PCD例子中，也需要更新包含Gibbs链状态的共享变量。
407 | 
408 | ```Python
409 |         # constructs the update dictionary
410 |         for gparam, param in zip(gparams, self.params):
411 |             # make sure that the learning rate is of the right dtype
412 |             updates[param] = param - gparam * T.cast(
413 |                 lr,
414 |                 dtype=theano.config.floatX
415 |             )
416 |         if persistent:
417 |             # Note that this works only if persistent is a shared variable
418 |             updates[persistent] = nh_samples[-1]
419 |             # pseudo-likelihood is a better proxy for PCD
420 |             monitoring_cost = self.get_pseudo_likelihood_cost(updates)
421 |         else:
422 |             # reconstruction cross-entropy is a better proxy for CD
423 |             monitoring_cost = self.get_reconstruction_cost(updates,
424 |                                                            pre_sigmoid_nvs[-1])
425 | 
426 |         return monitoring_cost, updates
427 | ```
428 | 
429 | ## 进展跟踪
430 | 
431 | RBMs的训练是特别困难的。由于归一化函数Z，我们无法在训练的时候估计对数似然函数log(P(x))。因而我们没有直接可以度量超参数优化与否的方法。
432 | 
433 | 而下面的几个选项对用户是有用的。
434 | 
435 | ### 负样本的检查
436 | 
437 | 在训练中获得的负样本是可以可视化的。在训练进程中，我们知道由RBM定义的模型不断逼近真实分布，p_train(x)。负样例就可以视为训练集中的样本。显而易见的，坏的超参数将在这种方式下被丢弃。
438 | 
439 | ### 滤波器的可视化跟踪
440 | 
441 | 由模型训练的滤波器是可以可视化的。我们可以将每个单元的权值以灰度图的方式展示。滤波器应该选出数据中强的特征。对于任意的数据集，这个滤波器都是不确定的。例如，训练MNIST，滤波器就表现的像“stroke”检测器，而训练自然图像的稀疏编码的时候，则像Gabor滤波器。
442 | 
443 | ### 似然估计的替代
444 | 
445 | 此外，更加容易处理的函数可以被用于做似然估计的替代。当我们使用PCD来训练RBM的时候，可以使用伪似然估计来替代。伪似然估计（Pseudo-likeihood,PL）更加简于计算，因为它假设所有的比特都是相互独立的，因此有：
446 | 
447 | ![PL](/images/7_proxies_likelihood_1.png)
448 | 
449 | 
450 | 
451 | 
452 | ## 主循环
453 | 
454 | 
455 | 
456 | ## 结果
457 | 
458 | 
459 | 
460 | 
461 | 
462 | 
463 | 
464 | 
465 | 
466 | 
467 | 
468 | 
469 | 
470 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Deep Learning Tutorial in Chinese
 2 | =================================
 3 | 深度学习教程中文版
 4 | =================================
 5 | 
 6 | This is a `Chinese tutorial` which is translated from [DeepLearning 0.1 documentation](http://deeplearning.net/tutorial/contents.html#). And in this tutorial, all algorithms and models are coded by Python and [Theano](http://deeplearning.net/software/theano/index.html). Theano is a famous third-party library, and allows coder to use GPU or CPU to run his Python code.
 7 | 
 8 | 
 9 | 
10 | 这是一个翻译自[深度学习0.1文档](http://deeplearning.net/tutorial/contents.html)的`中文教程`。在这个教程里面所有的算法和模型都是通过Pyhton和[Theano](http://deeplearning.net/software/theano/index.html)实现的。Theano是一个著名的第三方库，允许程序员使用GPU或者CPU去运行他的Python代码。
11 | 
12 | 
13 | ## 内容/Contents
14 | 
15 | * [入门（Getting Started）](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/1_Getting_Started_入门.md)
16 | * [使用逻辑回归进行MNIST分类(Classifying MNIST digits using Logistic Regression)](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md)
17 | * [多层感知机（Multilayer Perceptron）](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)
18 | * [卷积神经网络（Convolutional Neural Networks(LeNet)）](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/4_Convoltional_Neural_Networks_LeNet_卷积神经网络.md)
19 | * [降噪自动编码机器（Denoising Autoencoders(dA)）](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/5_Denoising_Autoencoders_降噪自动编码.md)
20 | * [层叠自动编码机(Stcaked Denoising Autoencoders(SdA))](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/6_Stacked_Denoising_Autoencoders_层叠降噪自动编码机.md)
21 | * [受限波尔兹曼机（Restricted Boltzmann Machines(RBM)）](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/7_Restricted_Boltzmann_Machine_受限波尔兹曼机.md)
22 | * Deep Belif Networks
23 | * Hybrid Monte-Carlo Sampling
24 | * Recurrent Neural Networks with Word Embeddings
25 | * Modeling and generating sequences of polyphonic music the RNN-RBM
26 | * Miscellaneous
27 | 
28 | 
29 | ## 版权/Copyright
30 | #### 作者/Author
31 | [Theano Development Team](http://deeplearning.net/tutorial/LICENSE.html), LISA lab, University of Montreal
32 | #### 翻译者/Translator
33 | [Lifeng Hua](https://github.com/Syndrome777), Zhejiang University
34 | 
35 | 
36 | 
37 | 
38 | 
39 | 
40 | 
41 | 
42 | 
43 | 


--------------------------------------------------------------------------------
/images/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/.DS_Store


--------------------------------------------------------------------------------
/images/1_0-1_loss_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_0-1_loss_1.png


--------------------------------------------------------------------------------
/images/1_0-1_loss_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_0-1_loss_2.png


--------------------------------------------------------------------------------
/images/1_0-1_loss_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_0-1_loss_3.png


--------------------------------------------------------------------------------
/images/1_l1_l2_regularization_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_l1_l2_regularization_1.png


--------------------------------------------------------------------------------
/images/1_l1_l2_regularization_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_l1_l2_regularization_2.png


--------------------------------------------------------------------------------
/images/1_negative_log_likelihod_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_negative_log_likelihod_1.png


--------------------------------------------------------------------------------
/images/1_negative_log_likelihod_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/1_negative_log_likelihod_2.png


--------------------------------------------------------------------------------
/images/2_defining_a_loss_function_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/2_defining_a_loss_function_1.png


--------------------------------------------------------------------------------
/images/2_the_model_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/2_the_model_1.png


--------------------------------------------------------------------------------
/images/2_the_model_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/2_the_model_2.png


--------------------------------------------------------------------------------
/images/3_from_lr_to_mlp_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_from_lr_to_mlp_1.png


--------------------------------------------------------------------------------
/images/3_the_model_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_the_model_1.png


--------------------------------------------------------------------------------
/images/3_the_model_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_the_model_2.png


--------------------------------------------------------------------------------
/images/3_the_model_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3_the_model_3.png


--------------------------------------------------------------------------------
/images/3wolfmoon.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/3wolfmoon.jpg


--------------------------------------------------------------------------------
/images/4_conv_operator_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_conv_operator_1.png


--------------------------------------------------------------------------------
/images/4_detail_notation_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_detail_notation_1.png


--------------------------------------------------------------------------------
/images/4_detail_notation_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_detail_notation_2.png


--------------------------------------------------------------------------------
/images/4_detail_notation_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_detail_notation_3.png


--------------------------------------------------------------------------------
/images/4_full_model_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_full_model_1.png


--------------------------------------------------------------------------------
/images/4_sparse_con_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/4_sparse_con_1.png


--------------------------------------------------------------------------------
/images/5_autoencoders_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_1.png


--------------------------------------------------------------------------------
/images/5_autoencoders_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_2.png


--------------------------------------------------------------------------------
/images/5_autoencoders_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_3.png


--------------------------------------------------------------------------------
/images/5_autoencoders_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_autoencoders_4.png


--------------------------------------------------------------------------------
/images/5_running_code_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_running_code_1.png


--------------------------------------------------------------------------------
/images/5_running_code_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/5_running_code_2.png


--------------------------------------------------------------------------------
/images/6_sda_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/6_sda_1.png


--------------------------------------------------------------------------------
/images/7_ebm_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_1.png


--------------------------------------------------------------------------------
/images/7_ebm_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_2.png


--------------------------------------------------------------------------------
/images/7_ebm_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_3.png


--------------------------------------------------------------------------------
/images/7_ebm_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_4.png


--------------------------------------------------------------------------------
/images/7_ebm_hidden_units_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_1.png


--------------------------------------------------------------------------------
/images/7_ebm_hidden_units_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_2.png


--------------------------------------------------------------------------------
/images/7_ebm_hidden_units_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_3.png


--------------------------------------------------------------------------------
/images/7_ebm_hidden_units_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_4.png


--------------------------------------------------------------------------------
/images/7_ebm_hidden_units_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_ebm_hidden_units_5.png


--------------------------------------------------------------------------------
/images/7_implementation_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_implementation_1.png


--------------------------------------------------------------------------------
/images/7_proxies_likelihood_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_proxies_likelihood_1.png


--------------------------------------------------------------------------------
/images/7_rbm_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_1.png


--------------------------------------------------------------------------------
/images/7_rbm_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_2.png


--------------------------------------------------------------------------------
/images/7_rbm_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_3.png


--------------------------------------------------------------------------------
/images/7_rbm_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_4.png


--------------------------------------------------------------------------------
/images/7_rbm_binary_units_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_binary_units_1.png


--------------------------------------------------------------------------------
/images/7_rbm_binary_units_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_binary_units_2.png


--------------------------------------------------------------------------------
/images/7_rbm_binary_units_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_rbm_binary_units_3.png


--------------------------------------------------------------------------------
/images/7_sampling_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_sampling_1.png


--------------------------------------------------------------------------------
/images/7_sampling_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_sampling_2.png


--------------------------------------------------------------------------------
/images/7_update_e_b_u_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Syndrome777/DeepLearningTutorial/c399fa239dbd3d0921ee4edcd30c552eb03e72b4/images/7_update_e_b_u_1.png


--------------------------------------------------------------------------------