├── 10Model
    ├── Model.md
    ├── Model.pdf
    └── NNFS_version10.py
├── 11Evaluate
    ├── Evaluate.md
    ├── Evaluate.pdf
    └── NNFS_version11.py
├── 12Dataset
    ├── Dataset.md
    ├── Dataset.pdf
    └── NNFS_version12.py
├── 13SaveandLoadModel
    ├── NNFS_version13.py
    ├── SaveandLoadModel.md
    └── SaveandLoadModel.pdf
├── 1Dense Layer
    ├── Dense Layer.md
    ├── Dense Layer.pdf
    └── NNFS_version1.py
├── 2Activation
    ├── Activation.md
    ├── Activation.pdf
    └── NNFS_version2.py
├── 3Loss
    ├── Loss.md
    ├── Loss.pdf
    └── NNFS_version3.py
├── 4Backpropagation
    ├── Backpropagation.md
    ├── Backpropagation.pdf
    └── NNFS_version4.py
├── 5combineLossandActivation
    ├── NNFS_version5.py
    ├── combineLossandActivation.md
    └── combineLossandActivation.pdf
├── 6Optimizer
    ├── NNFS_version6.py
    ├── Optimizer.md
    └── Optimizer.pdf
├── 7L1andL2Regularization
    ├── L1andL2Regularization.md
    ├── L1andL2Regularization.pdf
    └── NNFS_version7.py
├── 8Dropout
    ├── Dropout.md
    ├── Dropout.pdf
    └── NNFS_version8.py
├── 9Regression
    ├── NNFS_version9.py
    ├── Regression.md
    └── Regression.pdf
└── README.md


/10Model/Model.md:
--------------------------------------------------------------------------------
  1 | # Model
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 之前的实现是通过编写大量代码并对一些相当大的代码块进行修改来构建模型。Model类将前向传播、反向传播的训练和验证过程封装，将模型本身变成一个对象，特别是希望做一些像保存和加载这个对象以用于未来预测任务。这将使用这个对象来减少一些更常见的代码行，使得更容易使用当前的代码库并构建新模型。
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### 一、 创建Layer_Input
 10 | 
 11 | 因为希望在层的循环中执行这个操作，并需要知道每一层的前一层和后一层的内容，为保持代码的完整性，这里在第一个Dense层前面加入Layer_Input层，作为输入层，但没有与之相关的权重和偏差。输入层只包含训练数据，在循环中迭代层时，只将其用作第一层的“前一层”。
 12 | 
 13 | ```py
 14 | class Layer_Input:
 15 |     def __init__(self):
 16 |         pass
 17 | 
 18 |     def forward(self, input):
 19 |         self.output = input
 20 | ```
 21 | 
 22 | 
 23 | 
 24 | 
 25 | 
 26 | ### 二、修改Loss
 27 | 
 28 | ```py
 29 | class Loss:
 30 |     def __init__(self):
 31 |         pass
 32 | 
 33 |     # 在求loss时需要知参别些层里面有可以训练参数，可以正则化
 34 |     def save_trainable_layer(self, trainable_layer):
 35 |         self.trainable_layer = trainable_layer
 36 | 
 37 | 	  # 统一通过调用calculate方法计算损失
 38 |     def calculate(self, y_pred, y_ture, add_regular_loss=False):
 39 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
 40 |         data_loss = np.mean(self.forward(y_pred, y_ture))
 41 | 
 42 |         # 在加入正则代码后，可以求得正则损失
 43 |         # 注意之前版本调用regularization_loss(layer)
 44 |         # 但这个版本有了self.trainable_layer，可直接找到Dense层（有参数）
 45 |         regularization_loss = self.regularization_loss()
 46 |         if not add_regular_loss:
 47 |             # 在测试模型性能时只关心data_loss
 48 |             regularization_loss = 0
 49 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
 50 |         return data_loss, regularization_loss
 51 | 
 52 |     def regularization_loss(self):
 53 |         # 默认为0
 54 |         regularization_loss = 0
 55 |         for layer in self.trainable_layer:
 56 |             # 如果存在L1的loss
 57 |             if layer.weight_L1 > 0:
 58 |                 regularization_loss += layer.weight_L1 * np.sum(np.abs(layer.weight))
 59 |             if layer.bias_L1 > 0:
 60 |                 regularization_loss += layer.bias_L1 * np.sum(np.abs(layer.bias))
 61 |             # 如果存在L2的loss
 62 |             if layer.weight_L2 > 0:
 63 |                 regularization_loss += layer.weight_L2 * np.sum(layer.weight ** 2)
 64 |             if layer.bias_L2 > 0:
 65 |                 regularization_loss += layer.bias_L2 * np.sum(layer.bias ** 2)
 66 | 
 67 |         return regularization_loss
 68 | ```
 69 | 
 70 | > 这里加入了save_trainable_layer和修改了regularization_loss方法、calculate方法。调用save_trainable_layer方法，创建self.trainable_layer属性，存放Dense层（有参数可以学习）。calculate方法在测试时通过add_regular_loss=False参数，输出测试集上的regularization_loss = 0，只关心data_loss。
 71 | 
 72 | ### 三、修改Activation
 73 | 
 74 | ```py
 75 | class Activation_Softmax:
 76 |     def prediction(self, output):
 77 |         return np.argmax(output, axis=1, keepdims=True)
 78 |     
 79 | class Activation_Sigmoid:
 80 |     def prediction(self, output):
 81 |         # output > 0.5反回的是二进制值
 82 |         # 乘1变成数值
 83 |         return ( output > 0.5 ) * 1
 84 |     
 85 | class Activation_Linear:
 86 |     def prediction(self, output):
 87 |         return output
 88 | ```
 89 | 
 90 | > 这里不修改ReLu，因为它不作为输出层。prediction方法输出预测类别（分类）或输出值（回归）。
 91 | >
 92 | > **注意：softmax中要保持形状，因为prediction以矩阵形式用来计算准确率**
 93 | 
 94 | ### 四、修改Dropout
 95 | 
 96 | ```py
 97 |     def forward(self, input, drop_on=True):
 98 |         self.input = input
 99 |         # 按概率生成一个0、1矩阵
100 |         # 因为1的概率只有rate这么大，就要除以rate偿损失值
101 |         if not drop_on:
102 |             # 如果关上dropout就输出等于输入
103 |             self.output = self.input
104 |             return
105 |         
106 |         self.mask = np.random.binomial(1, self.rate, size=self.input.shape) / self.rate
107 |         self.output = self.input * self.mask
108 | ```
109 | 
110 | > 在forward方法中加了on_off=True参数，当测试时前向传播不用dropout
111 | 
112 | ### 五、创建Accuracy
113 | 
114 | #### **父类**
115 | 
116 | 这个类根据预测结果计算准确率。
117 | 
118 | ```py
119 | class Accuracy:
120 |     # 计算准确率
121 |     def calculate(self, prediction, y_true):
122 |         # 获得比较结果
123 |         comparision = self.compare(prediction, y_true)
124 | 
125 |         # 计算准确率
126 |         accuracy = np.mean(comparision)
127 | 
128 |         return accuracy
129 | ```
130 | 
131 | #### **子类1**
132 | 
133 | ```python
134 | class Accuracy_Regresion(Accuracy):
135 |     def __init__(self):
136 |         # 创建一个属性，保存精度
137 |         # 因为对于Regresion，要自己先创建一个精度标准
138 |         self.precision = None
139 | 
140 |     def compare(self, precision, y_true):
141 |         if self.precision is None:
142 |             self.precision = np.std(y_true) / 250
143 |         return np.abs(precision - y_true) < self.precision
144 | ```
145 | 
146 | > **这里使用np.std(y_true) / 250作为精度标准。**来计算回归问题的准确率。
147 | 
148 | #### **子类2**
149 | 
150 | ```python
151 | class Accuracy_Classification(Accuracy):
152 |     def __init__(self):
153 |         pass
154 | 
155 |     def compare(self, precision, y_true):
156 |         # onehot编码
157 |         if len(y_true.shape) == 2:
158 |             # 改成单个类别
159 |             y_true = np.argmax(y_true,axis=1) #此时是行向量，可能用keepdims=保持矩阵
160 |         # 注意：prediction是一个矩阵，y_true是一个向量1xa
161 |         # 当矩阵是ax1时，会错误产生广播
162 |         # 非常重要，我以为是模型代码错了一天的bug，
163 |         # 最后发现可能只是正确率证算错误了
164 |         y_true = y_true.reshape(-1, 1)
165 |         compare = (precision == y_true) * 1
166 |         return compare
167 | ```
168 | 
169 | > 计算分类问题的准确率。**注意：不同形状的y_true计算会不同，现只接收onehot编码和行向量**
170 | 
171 | ### 六、 创建Model
172 | 
173 | Model类，Model类将前向传播、反向传播的训练和验证过程封装，使整个过程实现方便。
174 | 
175 | #### **实现**
176 | 
177 | ```python
178 | class Model():
179 |     def __init__(self):
180 |         # 这个属性用来存模型的每层结构
181 |         self.layer = []
182 |         # 先初始化为None，后面会在finalize中判断是否符合softmax+categoricalCrossentropy或sigmiod+binaryCrossentropy
183 |         self.softmax_categoricalCrossentropy = None
184 |         self.sigmoid_binaryCrossentropy = None
185 | 
186 |     # 用来加入层结构
187 |     def add(self, layer):
188 |         self.layer.append(layer)
189 | 
190 |     # 用来设置损失loss的类型、优化器等
191 |     # 在星号之后的所有参数都必须作为关键字参数传递，而不能作为位置参数传递
192 |     def set(self, *, loss, optimizer, accuracy):
193 |         self.loss = loss
194 |         self.optimizer = optimizer
195 |         self.accuracy = accuracy
196 | 
197 |     # 训练模型
198 |     # epochs训练轮数
199 |     # print_every每多少轮输出一次
200 |     def train(self, X, y, *, epochs=1, print_every=1, vaildation_data=None):
201 |         # 注意：vaildation_data需要输入一个元组，包括X、y
202 |         for epoch in range(1, epochs+1):
203 |             # 前向传播
204 |             output = self.forward(X)
205 |             # 计算损失
206 |             data_loss, regularization_loss = self.loss.calculate(output, y, add_regular_loss=True)
207 |             # 总loss
208 |             loss = data_loss + regularization_loss
209 |             # 计算预测值或预测类别
210 |             prediction = self.output_layer.prediction(output)
211 |             # 计算准确率
212 |             accuracy = self.accuracy.calculate(prediction, y)
213 | 
214 |             # 反向传播
215 |             self.backward(output, y)
216 | 
217 |             # 优化器进行优化
218 |             self.optimizer.pre_update_param()
219 |             for layer in self.trainable_layer:
220 |                 self.optimizer.update_param(layer)
221 |             self.optimizer.post_update_param()
222 | 
223 |             # 输出信息
224 |             if not epoch % print_every:
225 |                 print(f'epoch: {epoch}, ' +
226 |                     f'acc: {accuracy:.3f}, ' +
227 |                     f'loss: {loss:.3f} (' +
228 |                     f'data_loss: {data_loss:.3f}, ' +
229 |                     f'reg_loss: {regularization_loss:.3f}), ' +
230 |                     f'lr: {self.optimizer.current_learning_rate}')
231 | 
232 |         if vaildation_data:
233 |             X_val, y_val = vaildation_data
234 |             # 输出层的输出
235 |             output = self.forward(X_val, False)
236 |             # 计算loss
237 |             data_loss, regularization_loss = self.loss.calculate(output, y_val)
238 |             loss = data_loss + regularization_loss
239 |             # 预测类别或预测值
240 |             prediction = self.output_layer.prediction(output)
241 |             # 计算准确率
242 |             accuracy = self.accuracy.calculate(prediction, y_val)
243 |             # 测试输出
244 |             print(f'validation, ' +
245 |                 f'acc: {accuracy:.3f}, ' +
246 |                 f'loss: {loss:.3f}')
247 |             # plt.plot(X_val, y_val)
248 |             # plt.plot(X_val, output)
249 |             # plt.show()
250 | 
251 |     ## 在该方法内实现模型的定型
252 |     # 1.确定不同层之间的前后次序
253 |     # 2.确定Dense层
254 |     # 3.将Dense层传入loss对象中，以计算正则损失
255 |     # 4.判断是否符合softmax+categoricalCrossentropy或sigmiod+binaryCrossentropy
256 |     def finalize(self):
257 |         # 创建输入层
258 |         self.input_layer = Layer_Input()
259 |         # 模型层数，不包括输入层、loss层
260 |         layer_num = len(self.layer)
261 |         # 存放Dense层（有参数可以学习）
262 |         self.trainable_layer = []
263 | 
264 |         # 循环设置层间关系
265 |         for i in range(layer_num):
266 |             if i == 0:
267 |                 # 第一层Dense,它的前一层是input_layer
268 |                 self.layer[i].pre = self.input_layer
269 |                 self.layer[i].next = self.layer[i + 1]
270 |             elif i == layer_num-1:
271 |                 # 最后一个Dense，它是后一层是loss
272 |                 self.layer[i].pre = self.layer[i - 1]
273 |                 self.layer[i].next = self.loss
274 |                 # 在最后一层标记一下所用的输出层是什么Activation存在Model的属性中
275 |                 self.output_layer = self.layer[i]
276 |             else:
277 |                 self.layer[i].pre = self.layer[i-1]
278 |                 self.layer[i].next = self.layer[i+1]
279 | 
280 |             if hasattr(self.layer[i], 'weight'):
281 |                 # 如果当前层有'weight'属性，说是当前层是Dense层
282 |                 # 该层是可以训练的
283 |                 self.trainable_layer.append(self.layer[i])
284 | 
285 |         # 把Dense层告诉loss对象
286 |         self.loss.save_trainable_layer(self.trainable_layer)
287 |         # 判断是否符合softmax+categoricalCrossentropy或sigmiod+binaryCrossentropy
288 |         if isinstance(self.layer[-1], Activation_Softmax) and \
289 |                 isinstance(self.loss, Loss_CategoricalCrossentropy):
290 |             self.softmax_categoricalCrossentropy = Activation_Softmax_Loss_CategoricalCrossentropy()
291 | 
292 |         if isinstance(self.layer[-1], Activation_Sigmoid) and \
293 |                 isinstance(self.loss, Loss_BinaryCrossentropy):
294 |             self.sigmoid_binaryCrossentropy = Activation_Sigmoid_Loss_BinaryCrossentropy()
295 | 
296 |     # 前向传播
297 |     # 该方法将在train方法中调用（训练过程将调用很多种方法，forward中是其中一个）
298 |     def forward(self, input, dropout=True):
299 |         self.input_layer.forward(input)
300 |         for layer in self.layer:
301 |             if isinstance(layer,Dropout) and (not dropout):
302 |                 layer.forward(layer.pre.output,dropout)
303 |             else:
304 |                 layer.forward(layer.pre.output)
305 | 
306 |         # 这里的layer是最后一层的activation
307 |         return layer.output
308 | 
309 |     def backward(self, output, y_true):
310 |         if self.softmax_categoricalCrossentropy:
311 |             self.softmax_categoricalCrossentropy.backward(output, y_true)
312 |             # 最后一层是softmax,不调用backward求dinput,
313 |             # 因为softmax_categoricalCrossentropy已经算好
314 |             self.layer[-1].dinput = self.softmax_categoricalCrossentropy.dinput
315 |             # 注意：这里循环不包含最后一层（softmax）
316 |             for layer in reversed(self.layer[:-1]):
317 |                 layer.backward(layer.next.dinput)
318 |             return
319 |         if self.sigmoid_binaryCrossentropy:
320 |             self.sigmoid_binaryCrossentropy.backward(output, y_true)
321 |             # 最后一层是sigmoid,不调用backward求dinput,
322 |             # 因为softmax_categoricalCrossentropy已经算好
323 |             self.layer[-1].dinput = self.sigmoid_binaryCrossentropy.dinput
324 |             # 注意：这里循环不包含最后一层（softmax）
325 |             for layer in reversed(self.layer[:-1]):
326 |                 layer.backward(layer.next.dinput)
327 |             return
328 | 
329 |         self.loss.backward(output, y_true)
330 |         # 注意：这里用的不是self.trainable_layer
331 |         for layer in reversed(self.layer):
332 |             layer.backward(layer.next.dinput)
333 | ```
334 | 
335 | > 以下是该类中方法的作用简述：
336 | 
337 | add() ，用来加入层结构。
338 | 
339 | set()，用来设置损失loss的类型、优化器、准确率等。
340 | 
341 | forward()，前向传播，在finalize方法调用。
342 | 
343 | backward()，反向传播，在finalize方法调用。
344 | 
345 | finalize()，在该方法内实现模型的定型
346 | 
347 | train()，训练模型，在该方法中调其他方法实现训练，其中有vaildation_data参数，用于测试。
348 | 
349 | #### **实例1**
350 | 
351 | ```python
352 | # 生成数据共1000个点
353 | X, y = sine_data()
354 | X_test = X[::2]
355 | y_test = y[::2]
356 | X = X[1::2]
357 | y = y[1::2]
358 | 
359 | model = Model()
360 | model.add(Layer_Dense(1,64))
361 | model.add((Activation_ReLu()))
362 | model.add(Layer_Dense(64,64))
363 | model.add(Activation_ReLu())
364 | model.add(Layer_Dense(64,1))
365 | model.add(Activation_Linear())
366 | 
367 | # 计得加()号，loss=Loss_MeanSquaredError是不行的，
368 | # 这样只调用了对象的属性
369 | model.set(loss=Loss_MeanSquaredError(),
370 |           optimizer=Optimizer_Adam(learning_rate=0.005, decay=1e-3),
371 |           accuracy=Accuracy_Regresion())
372 | 
373 | model.finalize()
374 | 
375 | model.train(X, y, epochs=10000, print_every=100)
376 | ```
377 | 
378 | ![image-20230812213059469](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308122130547.png)
379 | 
380 | ![image-20230812213113252](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308122131305.png)
381 | 
382 | > 对回归问题，效果很好。
383 | 
384 | #### 实例**2**
385 | 
386 | ```python
387 | nnfs.init()#默认随机种子为0，每次运行同样的数据
388 | X, y = spiral_data(samples=1000, classes=2)
389 | X_test, y_test = spiral_data(samples=100, classes=2)
390 | 
391 | model = Model()
392 | model.add(Layer_Dense(2,64,weight_L2=5e-4,bias_L2=5e-4))#,weight_L2=5e-4,bias_L2=5e-4
393 | model.add(Activation_ReLu())
394 | 
395 | model.add(Layer_Dense(64,1))
396 | model.add(Activation_Sigmoid())
397 | model.set(loss=Loss_BinaryCrossentropy(),
398 |           optimizer=Optimizer_Adam(decay=5e-7),
399 |           accuracy=Accuracy_Classification())
400 | 
401 | model.finalize()
402 | 
403 | model.train(X,y,vaildation_data=(X_test,y_test),epochs=10000,print_every=100)
404 | ```
405 | 
406 | ![image-20230813180428010](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308131804107.png)
407 | 
408 | #### **实例3**
409 | 
410 | ```python
411 | nnfs.init()#默认随机种子为0，每次运行同样的数据
412 | X, y = spiral_data(samples=1000, classes=3)
413 | X_test, y_test = spiral_data(samples=100, classes=3)
414 | 
415 | 
416 | model = Model()
417 | model.add(Layer_Dense(2,512,weight_L2=5e-4,bias_L2=5e-4))#,weight_L2=5e-4,bias_L2=5e-4
418 | model.add(Activation_ReLu())
419 | 
420 | 
421 | model.add(Dropout(0.1))
422 | model.add(Layer_Dense(512,3))
423 | model.add(Activation_Softmax())
424 | model.set(loss=Loss_CategoricalCrossentropy(),
425 |           optimizer=Optimizer_Adam(learning_rate=0.05, decay=5e-5),
426 |           accuracy=Accuracy_Classification())
427 | 
428 | model.finalize()
429 | 
430 | model.train(X,y,vaildation_data=(X_test,y_test),epochs=10000,print_every=100)
431 | ```
432 | 
433 | ![image-20230813200825846](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308132008923.png)
434 | 
435 | ![image-20230813200848626](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308132008669.png)
436 | 
437 | >  对于3分类的问题，相同的参数设置训练结果与书上结果相似，但并没有体现出dropout层的优势。
438 | 
439 | #### **实现4**
440 | 
441 | ```python
442 | X, y = spiral_data(samples=1000, classes=3)
443 | X_test, y_test = spiral_data(samples=100, classes=3)
444 | # print(X[:5])
445 | # print(X_test[:5])
446 | 
447 | model = Model()
448 | model.add(Layer_Dense(2,64,weight_L2=5e-4,bias_L2=5e-4))#,weight_L2=5e-4,bias_L2=5e-4
449 | model.add(Activation_ReLu())
450 | 
451 | model.add(Layer_Dense(64,3))
452 | model.add(Activation_Softmax())
453 | model.set(loss=Loss_CategoricalCrossentropy(),
454 |           optimizer=Optimizer_Adam(decay=5e-7),
455 |           accuracy=Accuracy_Classification())
456 | ```
457 | 
458 | ![image-20230813202042320](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308132020371.png)
459 | 
460 | > 同样是3分类问题，用更少的神经元并且不加dropout效果更好。所以对于解决一个复杂问题，是选一个简单的结构不加dropout，还选一个复杂的结构加上dropout，要根据实际性况而定。
461 | 


--------------------------------------------------------------------------------
/10Model/Model.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/10Model/Model.pdf


--------------------------------------------------------------------------------
/11Evaluate/Evaluate.md:
--------------------------------------------------------------------------------
  1 | # Evaluate
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 加入模型的测试方法，包括分批处理（batch）。
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### **一、修改Loss**
 10 | 
 11 | ```py
 12 | class Loss:
 13 |     # 统一通过调用calculate方法计算损失
 14 |     def calculate(self, y_pred, y_ture, *, add_regular_loss=False):
 15 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
 16 |         data_loss = np.mean(self.forward(y_pred, y_ture))
 17 | 
 18 |         # 加入了batch，所以要计算累计的损失和已训练过的样本数
 19 |         self.cumulate_dataloss += data_loss
 20 |         self.cumulate_num += len(data_loss)
 21 | 
 22 |         # 在加入正则代码后，可以求得正则损失
 23 |         # 注意之前版本调用regularization_loss(layer)
 24 |         # 但这个版本有了self.trainable_layer，可直接找到Dense层（有参数）
 25 |         regularization_loss = self.regularization_loss()
 26 |         if not add_regular_loss:
 27 |             # 在测试模型性能时只关心data_loss
 28 |             regularization_loss = 0
 29 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
 30 |         return data_loss, regularization_loss
 31 |     
 32 |     def calculate_cumulate(self, *, add_regularization=False):
 33 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
 34 |         sample_loss = self.forward(y_pred, y_ture)
 35 |         data_loss = np.mean(sample_loss)
 36 |         # 加入了batch，所以要计算累计的损失和已训练过的样本数
 37 |         self.cumulate_dataloss += np.sum(sample_loss)
 38 |         self.cumulate_num += len(sample_loss)
 39 | 
 40 |     def clean_cumulate(self):
 41 |         self.cumulate_dataloss = 0
 42 |         self.cumulate_num = 0
 43 | ```
 44 | 
 45 | > 加入了batch，所以要计算累计的损失和已训练过的样本数，增加了self.cumulate_dataloss和self.cumulate_num属性，还有给清0属性的方法。
 46 | 
 47 | ### **二、修改Accuracy**
 48 | 
 49 | ```py
 50 | class Accuracy:
 51 |     # 计算准确率
 52 |     def calculate(self, prediction, y_true):
 53 |         # 获得比较结果
 54 |         comparision = self.compare(prediction, y_true)
 55 |         # 计算准确率
 56 |         accuracy = np.mean(comparision)
 57 |         # 加入了累积精度属性
 58 |         self.cumulate_dataloss += np.sum(comparision)
 59 |         self.cumulate_num += len(comparision)
 60 | 
 61 |         return accuracy
 62 | 
 63 |     def calculate_cumulate(self):
 64 |         # 平均精度
 65 |         accuracy = self.cumulate_dataloss / self.cumulate_num
 66 |         return accuracy
 67 |     
 68 |     def clean_cumulate(self):
 69 |         self.cumulate_dataloss = 0
 70 |         self.cumulate_num = 0
 71 | ```
 72 | 
 73 | > 加入了batch，所以要计算累计的损失和已训练过的样本数，增加了self.cumulate_dataloss和self.cumulate_num属性，还有给清0属性的方法。
 74 | 
 75 | ### **三、修改Model**
 76 | 
 77 | ```py
 78 | class Model():
 79 |     def evaluate(self, X_val, y_val, *, batch_size=None):
 80 |         # 默认只有一个batch
 81 |         validation_step = 1
 82 |         if batch_size is not None:
 83 |             validation_step = len(X_val) // batch_size
 84 |             if validation_step * batch_size < len(X_val):  # 如果有余数
 85 |                 validation_step += 1
 86 |         # 清除0
 87 |         self.loss.clean_cumulate()
 88 |         self.accuracy.clean_cumulate()
 89 | 
 90 |         for step in range(validation_step):
 91 |             # 没置batch
 92 |             if not batch_size:
 93 |                 X_batch = X_val
 94 |                 y_batch = y_val
 95 |             else:  # 这里有一个很好的性质，当(step+1)*batch_size超过X长度，则自动到最后为止。
 96 |                 X_batch = X_val[step * batch_size:(step + 1) * batch_size]
 97 |                 y_batch = y_val[step * batch_size:(step + 1) * batch_size]
 98 | 
 99 |             # 输出层的输出
100 |             output = self.forward(X_batch, False)
101 |             # 计算loss
102 |             data_loss, regularization_loss = self.loss.calculate(output, y_batch)
103 |             loss = data_loss + regularization_loss
104 |             # 预测类别或预测值
105 |             prediction = self.output_layer.prediction(output)
106 |             # 计算准确率
107 |             accuracy = self.accuracy.calculate(prediction, y_batch)
108 |         # 平均精度和损失
109 |         validation_accuracy = self.accuracy.calculate_cumulate()
110 |         validation_data_loss, validation_regularizaion_loss = self.loss.calculate_cumulate()
111 |         validation_loss = validation_regularizaion_loss + validation_data_loss
112 |         # 测试输出,输出的是在测试集上的平均表现
113 |         print(f'validation, ' +
114 |               f'acc: {validation_accuracy:.3f}, ' +
115 |               f'loss: {validation_loss:.3f}')
116 |         # plt.plot(X_val, y_val)
117 |         # plt.plot(X_val, output)
118 |         # plt.show()
119 | 
120 | 
121 |     # 训练模型
122 |     # epochs训练轮数
123 |     # print_every每多少轮输出一次
124 |     def train(self, X, y, *, epochs=1, print_every=1, batch_size=None, validation_data=None):
125 |         # 数据集(默认)分为1个batch
126 |         train_step = 1
127 | 
128 |         # 非默认情况
129 |         if batch_size is not None:
130 |             train_step = len(X) // batch_size
131 |             if train_step * batch_size < len(X): # 如果有余数
132 |                 train_step += 1
133 | 
134 |         # 注意：validation_data需要输入一个元组，包括X、y
135 |         for epoch in range(1, epochs+1):
136 |             print(f'epoch:{epoch}')
137 |             # 清累积
138 |             self.loss.clean_cumulate()
139 |             self.accuracy.clean_cumulate()
140 | 
141 |             for step in range(train_step):
142 |                 # 没置batch
143 |                 if not batch_size:
144 |                     X_batch = X
145 |                     y_batch = y
146 |                 else: # 这里有一个很好的性质，当(step+1)*batch_size超过X长度，则自动到最后为止。
147 |                     X_batch = X[step*batch_size:(step+1)*batch_size]
148 |                     y_batch = y[step*batch_size:(step+1)*batch_size]
149 | 
150 |                 # 前向传播
151 |                 output = self.forward(X_batch)
152 |                 # 计算损失
153 |                 data_loss, regularization_loss = self.loss.calculate(output, y_batch, add_regular_loss=True)
154 |                 # 总loss
155 |                 loss = data_loss + regularization_loss
156 |                 # 计算预测值或预测类别
157 |                 prediction = self.output_layer.prediction(output)
158 |                 # 计算准确率
159 |                 accuracy = self.accuracy.calculate(prediction, y_batch)
160 | 
161 |                 # 反向传播
162 |                 self.backward(output, y_batch)
163 | 
164 |                 # 优化器进行优化
165 |                 self.optimizer.pre_update_param()
166 |                 for layer in self.trainable_layer:
167 |                     self.optimizer.update_param(layer)
168 |                 self.optimizer.post_update_param()
169 | 
170 |                 # step中打印的是每次的真实值
171 |                 if not step % print_every or step == train_step - 1:
172 |                     print(f'step: {step}, ' +
173 |                         f'acc: {accuracy:.3f}, ' +
174 |                         f'loss: {loss:.3f} (' +
175 |                         f'data_loss: {data_loss:.3f}, ' +
176 |                         f'reg_loss: {regularization_loss:.3f}), ' +
177 |                         f'lr: {self.optimizer.current_learning_rate}')
178 | 
179 |             # 让epoch输出，输出每次epoch的平均值
180 |             epoch_data_loss, epoch_regularization_loss = \
181 |                 self.loss.calculate_cumulate(add_regularization=True)
182 |             epoch_loss = epoch_regularization_loss + epoch_data_loss
183 |             epoch_accuracy = self.accuracy.calculate_cumulate()
184 |             # 输出信息，输出每次epoch的平均值
185 |             print(f'training {epoch}, ' +
186 |                 f'acc: {epoch_accuracy:.3f}, ' +
187 |                 f'loss: {epoch_loss:.3f} (' +
188 |                 f'data_loss: {epoch_data_loss:.3f}, ' +
189 |                 f'reg_loss: {epoch_regularization_loss:.3f}), ' +
190 |                 f'lr: {self.optimizer.current_learning_rate}')
191 | 
192 | 
193 |             if validation_data is not None:
194 |                 self.evaluate(*validation_data,batch_size=batch_size)
195 | ```
196 | 
197 | > 在Model类内加入了一个新方法evaluate，通过调用evaluate来测试模型地性能。
198 | 
199 | ![image-20230814093754126](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308140938221.png)
200 | 


--------------------------------------------------------------------------------
/11Evaluate/Evaluate.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/11Evaluate/Evaluate.pdf


--------------------------------------------------------------------------------
/12Dataset/Dataset.md:
--------------------------------------------------------------------------------
  1 | # Dataset
  2 | 
  3 | ## 一、内容
  4 | 
  5 | Fashion MNIST数据集是一个包含60,000个训练样本和10,000个测试样本的28x28图像集合，包括10种不同的服装物品，如鞋子、靴子、衬衫、包包等。
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### **下载数据**
 10 | 
 11 | ```py
 12 | # 数据集下载地址
 13 | URL = 'https://nnfs.io/datasets/fashion_mnist_images.zip'
 14 | # 存放地址
 15 | FILE = 'fashion_mnist_images.zip'
 16 | # 解压地址
 17 | FOLDER = 'fashion_mnist_images'
 18 | # 将网上数据存在当前文件夹的FILE中
 19 | # 如果本地没有文件，就下载
 20 | if not os.path.isfile(FILE):
 21 |     print(f'下载 {URL} 并存在 {FILE}...')
 22 |     urllib.request.urlretrieve(URL, FILE)
 23 | 
 24 |     print('解压文件')
 25 |     with ZipFile(FILE) as zip_images:
 26 |         zip_images.extractall(FOLDER)
 27 |         
 28 | # image_data = cv2.imread('fashion_mnist_images/train/7/0002.png',cv2.IMREAD_UNCHANGED)
 29 | # np.set_printoptions(linewidth=200)
 30 | # print(image_data)
 31 | #
 32 | # plt.imshow(image_data, cmap='gray')
 33 | # plt.show()
 34 | ```
 35 | 
 36 | ![image-20230814141508650](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308141415725.png)
 37 | 
 38 | > 鞋
 39 | 
 40 | ### **加载数据**
 41 | 
 42 | ```py
 43 | # 加载MNIST dataset
 44 | def load_mnist_dataset(dataset, path):
 45 | 
 46 |     # 输入数据集的名称和地址
 47 |     # 得到类文件
 48 |     labels = os.listdir(os.path.join(path, dataset))
 49 | 
 50 |     X = []
 51 |     y = []
 52 | 
 53 |     # 打开每个类文件夹
 54 |     for label in labels:
 55 |         # 循环其中每个文件
 56 |         for file in os.listdir(os.path.join(path, dataset, label)):
 57 |             # 读文件
 58 |             image = cv2.imread(os.path.join(path, dataset, label, file), cv2.IMREAD_UNCHANGED)
 59 | 
 60 |             # 存到list中
 61 |             X.append(image)
 62 |             y.append(label)
 63 | 
 64 |     return np.array(X), np.array(y).astype('uint8')
 65 |     
 66 | # 创建数据集，内部调用load_mnist_dataset
 67 | def create_data_mnist(path):
 68 | 
 69 |     # 加载训练和测试集
 70 |     X, y = load_mnist_dataset('train', path)
 71 |     X_test, y_test = load_mnist_dataset('test', path)
 72 | 
 73 |     return X, y, X_test, y_test
 74 | ```
 75 | 
 76 | ### **预处理数据**
 77 | 
 78 | ```py
 79 | def data_preprocess():
 80 |     X, y, X_test, y_test = create_data_mnist('D:/python_workplace/pycharm/workplace/NNFS_py38_NNFS/fashion_mnist_images')
 81 | 
 82 |     # 归一化，让数据分布在[-1.1],利于训练
 83 |     X = (X.astype(np.float32) - 127.5) / 127.5
 84 |     X_test = (X_test.astype(np.float32) - 127.5) / 127.5
 85 | 
 86 |     # 因为网络模一型是全连接网络，要将二维图片展成一维
 87 |     X = X.reshape(X.shape[0],-1)
 88 |     X_test = X_test.reshape(X_test.shape[0], -1)
 89 | 
 90 |     # 打乱数据顺序
 91 |     key = np.array(range(X.shape[0]))
 92 |     np.random.shuffle(key)
 93 |     X = X[key]
 94 |     y = y[key]
 95 | 
 96 |     return X, y, X_test, y_test
 97 | ```
 98 | 
 99 | > 预处理数据包括：归一化、二维图片展成一维、打乱数据顺序。
100 | 
101 | ### **实例**
102 | 
103 | ```py
104 | model = Model()
105 | model.add(Layer_Dense(X.shape[1], 64, weight_L2=5e-4,bias_L2=5e-4))#,weight_L2=5e-4,bias_L2=5e-4
106 | model.add(Activation_ReLu())
107 | model.add(Layer_Dense(64, 64))
108 | model.add(Activation_ReLu())
109 | model.add(Layer_Dense(64, 10))
110 | model.add(Activation_Softmax())
111 | model.set(loss=Loss_CategoricalCrossentropy(),
112 |           optimizer=Optimizer_Adam(decay=5e-7),
113 |           accuracy=Accuracy_Classification())
114 | 
115 | 
116 | model.finalize()
117 | 
118 | model.train(X, y, batch_size=100, validation_data=(X_test, y_test), epochs=5, print_every=10)
119 | model.evaluate(X_test, y_test, batch_size=10)
120 | # 反回各类别的概率
121 | confidence = model.predict(X_test[95:105])
122 | prediction = model.output_layer.prediction(confidence)
123 | 
124 | print('预测分类：',prediction)
125 | print('ground truth：',y_test[95:105])
126 | ```
127 | 
128 | ![image-20230814150101484](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308141501545.png)
129 | 
130 | > 表现非常好。


--------------------------------------------------------------------------------
/12Dataset/Dataset.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/12Dataset/Dataset.pdf


--------------------------------------------------------------------------------
/13SaveandLoadModel/SaveandLoadModel.md:
--------------------------------------------------------------------------------
  1 | # Save and Load Model
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 本部分将实现模型的两种保存和加载。
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### 一、保存参数
 10 | 
 11 | #### **修改Layer_Dense**
 12 | 
 13 | ```py
 14 | class Layer_Dense:
 15 |     def get_paramter(self):
 16 |         return self.weight, self.bias
 17 | 
 18 |     def set_paramter(self, weight, bias):
 19 |         self.weight = weight
 20 |         self.bias = bias
 21 | ```
 22 | 
 23 | > 增加了get_paramter和load_paramter方法，用于dense层返回和加载参数.
 24 | 
 25 | #### **修改Model**
 26 | 
 27 | ```py
 28 | class Model():
 29 |     def get_paramter(self):
 30 |         paramter = []
 31 |         for layer in self.trainable_layer:
 32 |             paramter.append(layer.get_paramter())
 33 |         return paramter
 34 | 
 35 |     def set_paramter(self, paramter):
 36 |         for paramter_set, layer in zip(paramter, self.trainable_layer):
 37 |             layer.set_paramter(*paramter_set)
 38 |             
 39 |     def save_paramter(self, path):
 40 |         with open(path, 'wb') as f:
 41 |             pickle.dump(self.get_paramter(), f)
 42 |             
 43 |     def load_paramter(self, path):
 44 |         with open(path, 'rb') as f:
 45 |             self.set_paramter(pickle.load(f))
 46 | ```
 47 | 
 48 | > 增加了get_paramter、set_paramter、save_paramter和load_paramter方法，用于整个模型层返回和加载参数。
 49 | 
 50 | #### **实例**1
 51 | 
 52 | ```py
 53 | X, y, X_test, y_test = data_preprocess()
 54 | print(X.shape, X_test.shape)
 55 | 
 56 | 
 57 | model = Model()
 58 | model.add(Layer_Dense(X.shape[1], 64, weight_L2=5e-4,bias_L2=5e-4))#,weight_L2=5e-4,bias_L2=5e-4
 59 | model.add(Activation_ReLu())
 60 | model.add(Layer_Dense(64, 64))
 61 | model.add(Activation_ReLu())
 62 | model.add(Layer_Dense(64, 10))
 63 | model.add(Activation_Softmax())
 64 | model.set(loss=Loss_CategoricalCrossentropy(),
 65 |           optimizer=Optimizer_Adam(decay=5e-7),
 66 |           accuracy=Accuracy_Classification())
 67 | 
 68 | 
 69 | model.finalize()
 70 | 
 71 | model.train(X, y, batch_size=100, validation_data=(X_test, y_test), epochs=5, print_every=10)
 72 | model.evaluate(X_test, y_test, batch_size=10)
 73 | # 反回各类别的概率
 74 | confidence = model.predict(X_test[95:105])
 75 | prediction = model.output_layer.prediction(confidence)
 76 | 
 77 | print('预测分类：', prediction)
 78 | print('ground truth：', y_test[95:105])
 79 | 
 80 | 
 81 | ###############################################################
 82 | # 重新加载新模型
 83 | # 获得参数
 84 | paramter = model.get_paramter()
 85 | 
 86 | model = Model()
 87 | model.add(Layer_Dense(X.shape[1], 64, weight_L2=5e-4,bias_L2=5e-4))#,weight_L2=5e-4,bias_L2=5e-4
 88 | model.add(Activation_ReLu())
 89 | model.add(Layer_Dense(64, 64))
 90 | model.add(Activation_ReLu())
 91 | model.add(Layer_Dense(64, 10))
 92 | model.add(Activation_Softmax())
 93 | model.set(loss=Loss_CategoricalCrossentropy(),
 94 |           optimizer=Optimizer_Adam(decay=5e-7),
 95 |           accuracy=Accuracy_Classification())
 96 | 
 97 | model.finalize()
 98 | 
 99 | # 加载参数
100 | model.set_paramter(paramter)
101 | 
102 | model.evaluate(X_test, y_test, batch_size=10)
103 | ##################################################################
104 | ```
105 | 
106 | #### **实例2**
107 | 
108 | ```py
109 | model.save_paramter('Mode_paramter.para')
110 | model.load_paramter('Mode_paramter.para')
111 | ```
112 | 
113 | > 通过将参数保存到Mode_paramter.para文件，并通过文件加载参数。（文件后缀可以任意）
114 | 
115 | ![image-20230814174709195](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308141747265.png)
116 | 
117 | > 第一个模型和加载后的模型在测试集上表现一样。
118 | 
119 | ## 二、保存整个模型
120 | 
121 | ```py
122 | class Model():
123 | 	 def save_Model(self,path):
124 |         model = copy.deepcopy(self)
125 | 
126 |         # 删除无关参数，减小模型大小
127 |         # 减少模型文件的大小并提高保存和加载模型的效率
128 |         model.loss.clean_cumulate()
129 |         model.accuracy.clean_cumulate()
130 |         model.input_layer.__dict__.pop('output', None)
131 |         model.loss.__dict__.pop('dinput',None)
132 | 
133 |         for layer in model.layer:
134 |             for property in ['input','output','dinput','dweight','dbias']:
135 |                 layer.__dict__.pop(property, None)
136 | 
137 |         with open(path, 'wb') as f:
138 |             pickle.dump(model,f)
139 | 
140 |     # 不需要先实例化一个模型对象就能调用load方法
141 |     @staticmethod
142 |     def load_Model(path):
143 |         with open(path, 'rb') as f:
144 |             model = pickle.load(f)
145 |         return model
146 | ```
147 | 
148 | > 增加了save_Model和load_Model方法。使用 @staticmethod 装饰器。这个装饰器可以与类方法一起使用，在未初始化的对象上运行它们，其中 self 不存在（注意它在函数定义中缺失）。在的例子中，将使用它来立即创建一个模型对象，而不需要先实例化一个模型对象。在这个方法中，将使用传入的路径以二进制读取模式打开一个文件，并使用 pickle 反序列化保存的模型。
149 | 
150 | ![image-20230814182028668](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308141820744.png)
151 | 
152 | > 第一个模型和加载后的模型在测试集上表现一样。


--------------------------------------------------------------------------------
/13SaveandLoadModel/SaveandLoadModel.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/13SaveandLoadModel/SaveandLoadModel.pdf


--------------------------------------------------------------------------------
/1Dense Layer/Dense Layer.md:
--------------------------------------------------------------------------------
 1 | # Dense Layer
 2 | 
 3 | ## 一、内容
 4 | 
 5 | 本部分将构建Dense Layer类（也被称为fully connected layer），其中的只包含forward method，也就是只做前向传播。其余功能将在后继内容中加入。
 6 | 
 7 | ## 二、代码
 8 | 
 9 | ### 一、生成数据
10 | 
11 | ~~~python
12 | import numpy as np
13 | from nnfs.datasets import spiral_data
14 | import matplotlib.pyplot as plt
15 | 
16 | # 生成数据
17 | X, y = spiral_data(samples=100, classes=3)
18 | # 查看数据大小
19 | print(X.shape,y.shape)
20 | # 设置了图形的参数，以y数组中的值作颜色，并使用brg三颜鎟
21 | # 并注意，Matplotlib内置的颜色映射名称为'brg',并不是常用的'rgb'顺序
22 | plt.scatter(X[:,0],X[:,1],c=y,cmap='brg')
23 | # 显示图形
24 | plt.show()
25 | ~~~
26 | 
27 | ![image-20230806222228216](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308071220418.png)
28 | 
29 | > X是300x2大小，y是300x1大小
30 | 
31 | ![image-20230806221844756](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308071218164.png)
32 | 
33 | > 这是一个螺旋状数据，共三个类别。
34 | 
35 | ### 二、Dense Layer类
36 | 
37 | ~~~py
38 | class Layer_Dense:
39 |     def __init__(self, n_input, n_neuron):
40 |         # 用正态分布初始化权重
41 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
42 |         # 将bias(偏差)初始化为0
43 |         self.bias = np.zeros(n_neuron)
44 | 
45 |     def forward(self, input):
46 |         self.output = np.dot(input, self.weight) + self.bias
47 | ~~~
48 | 
49 | ### 三、实例
50 | 
51 | ~~~py
52 | # 构建一个含三个神经元的Dense层实例
53 | dense = Layer_Dense(2,3)
54 | # 前向传播
55 | dense.forward(X);
56 | # 输出结果
57 | print(dense.output[:5])
58 | ~~~
59 | 
60 | ![image-20230807111700067](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308071222988.png)
61 | 
62 | 


--------------------------------------------------------------------------------
/1Dense Layer/Dense Layer.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/1Dense Layer/Dense Layer.pdf


--------------------------------------------------------------------------------
/1Dense Layer/NNFS_version1.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 作者：黄欣
 3 | 日期：2023年08月06日
 4 | """
 5 | import numpy as np
 6 | from nnfs.datasets import spiral_data
 7 | import matplotlib.pyplot as plt
 8 | 
 9 | # 生成数据
10 | X, y = spiral_data(samples=100, classes=3)
11 | # 查看数据大小
12 | print(X.shape,y.shape)
13 | # # 设置了图形的参数，以y数组中的值作颜色，并使用brg三颜鎟
14 | # # 并注意，Matplotlib内置的颜色映射名称为'brg',并不是常用的'rgb'顺序
15 | # plt.scatter(X[:,0],X[:,1],c=y,cmap='brg')
16 | # # 显示图形
17 | # plt.show()
18 | 
19 | class Layer_Dense:
20 |     def __init__(self, n_input, n_neuron):
21 |         # 用正态分布初始化权重
22 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
23 |         # 将bias(偏差)初始化为0
24 |         self.bias = np.zeros(n_neuron)
25 | 
26 |     def forward(self, input):
27 |         self.output = np.dot(input, self.weight) + self.bias
28 | 
29 | # 构建一个含三个神经元的Dense层实例
30 | dense = Layer_Dense(2,3)
31 | # 前向传播
32 | dense.forward(X);
33 | # 输出结果
34 | print(dense.output[:5])


--------------------------------------------------------------------------------
/2Activation/Activation.md:
--------------------------------------------------------------------------------
  1 | # Activation
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 在本部分将实现常用的Activation Function，例如：ReLu（Rectified Linear units Function）、Softmax、Sigmoid。本部分只实现forward method，反向传播将在后续加入
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### 一、Sigmoid
 10 | 
 11 | 1. 公式
 12 | 
 13 |    其中$z_{i,j}$表示这个激活函数的输入，$\sigma_{i,j}$表示单个输出值。索引$i$表示当前样本，索引$j$ 表示当前样本中的当前输出。$\sigma_{i,j}$可理解成对第$j$对类别，例如猫狗分类中狗类别的confidence(置信度)。当然，一个模型可能要对多对类别分类，例如：高矮、胖瘦等。Sigmoid用于二分类
 14 |    $$
 15 |    \sigma_{i,j}=\frac{1}{1+e^{-z_{i,j}}}
 16 |    $$
 17 | 
 18 | 2. 实现
 19 | 
 20 |    ~~~py
 21 |    class Activation_Softmax:
 22 |        def __init__(self):
 23 |            pass
 24 |    
 25 |        def forward(self, input):
 26 |            # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 27 |            # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 28 |            self.output = 1 / ( 1 + np.exp(-input) )
 29 |    ~~~
 30 | 
 31 |    
 32 | 
 33 | 3. 实例
 34 | 
 35 |    ~~~py
 36 |    # 生成数据
 37 |    X, y = spiral_data(samples=100, classes=2)
 38 |    # 构建一个含三个神经元的Dense层实例
 39 |    dense = Layer_Dense(2,1)
 40 |    # 构建Softmax激活函数
 41 |    activation1 = Activation_Softmax()
 42 |    
 43 |    # 前向传播
 44 |    dense.forward(X)
 45 |    activation1.forward(dense.output)
 46 |    # 输出结果
 47 |    print(activation1.output[:5])
 48 |    ~~~
 49 | 
 50 |    ![image-20230807190723399](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308071907435.png)
 51 | 
 52 | > 输出是nx1大小，表示二分类别的confidence(置信度)
 53 | 
 54 | ### 二、ReLu
 55 | 
 56 | 1. 公式
 57 |    $$
 58 |    y=\begin{cases}x,x > 0\\0,x \le 0\end{cases}
 59 |    $$
 60 |    
 61 | 2. 实现
 62 | 
 63 |    ~~~python
 64 |    class Activation_ReLu:
 65 |        def __init__(self):
 66 |            pass
 67 |    
 68 |        def forward(self,input):
 69 |            self.output = np.maximum(0,input)
 70 |    ~~~
 71 | 
 72 |    
 73 | 
 74 | 3. 实例
 75 | 
 76 |    ~~~python
 77 |    # 生成数据
 78 |    X, y = spiral_data(samples=100, classes=3)
 79 |    # 构建一个含三个神经元的Dense层实例
 80 |    dense = Layer_Dense(2,3)
 81 |    # 构建Softmax激活函数
 82 |    activation1 = Activation_ReLu()
 83 |    
 84 |    # 前向传播
 85 |    dense.forward(X)
 86 |    activation1.forward(dense.output)
 87 |    # 输出结果
 88 |    print(activation1.output[:5])
 89 |    ~~~
 90 | 
 91 |    ![image-20230807193322759](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308071933795.png)
 92 | 
 93 | ### 三、Softmax
 94 | 
 95 | ​		Softmax函数是一种将j个实数向量转换为j个可能结果的概率分布的函数。索引i表示当前样本，索引j表示当前样本中的当前输出，$S_{i,j}$表示j个可能结果的概率。
 96 | 
 97 |   1. 公式
 98 |      $$
 99 |      S_{i,j}=\frac{e^{z_{i,j}}}{\sum\limits_{l=1}^L{e^{z_{i,l}}}}
100 |      $$
101 |      
102 |   2. 实现
103 | 
104 |      softmax函数对输入值非常敏感，而且很容易产生极端的概率分布。这可能会导致模型过度自信地预测某个类别，而忽略了其他类别的可能性。为了避免这种情况，我们可以在进行指数运算之前，从输入值中减去最大值。这样做不会改变softmax函数的结果，因为分子和分母都会被同一个常数除以。但是，这样做可以使输入值更小，从而避免指数运算产生过大的数字。
105 | 
106 |      ~~~python
107 |      class Activation_Softmax:
108 |          def __init__(self):
109 |              pass
110 |      
111 |          def forward(self,input):
112 |              # 要有keepdims=True参数设置
113 |              # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
114 |              # 而行向量长度不与input的每一行长度相同，
115 |              # 则无法广播
116 |              # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
117 |              self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
118 |              self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
119 |      ~~~
120 | 
121 |      
122 | 
123 |   3. 实例
124 | 
125 |      ```python
126 |      # 生成数据
127 |      X, y = spiral_data(samples=100, classes=3)
128 |      # 构建一个含三个神经元的Dense层实例
129 |      dense = Layer_Dense(2,3)
130 |      # 构建Softmax激活函数
131 |      activation1 = Activation_Softmax()
132 |      
133 |      # 前向传播
134 |      dense.forward(X)
135 |      activation1.forward(dense.output)
136 |      # 输出结果
137 |      print(activation1.output[:5])
138 |      ```
139 | 
140 | ​	![image-20230807202215791](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308072022825.png)
141 | 
142 | 
143 | 
144 | 


--------------------------------------------------------------------------------
/2Activation/Activation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/2Activation/Activation.pdf


--------------------------------------------------------------------------------
/2Activation/NNFS_version2.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 作者：黄欣
 3 | 日期：2023年08月07日
 4 | """
 5 | 
 6 | # 版本增加了Activation类，只实现了前向传播
 7 | 
 8 | import numpy as np
 9 | from nnfs.datasets import spiral_data
10 | import matplotlib.pyplot as plt
11 | 
12 | 
13 | class Layer_Dense:
14 |     def __init__(self, n_input, n_neuron):
15 |         # 用正态分布初始化权重
16 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
17 |         # 将bias(偏差)初始化为0
18 |         self.bias = np.zeros(n_neuron)
19 | 
20 |     def forward(self, input):
21 |         self.output = np.dot(input, self.weight) + self.bias
22 | 
23 | class Activation_Softmax:
24 |     def __init__(self):
25 |         pass
26 | 
27 |     def forward(self, input):
28 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
29 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
30 |         self.output = 1 / ( 1 + np.exp(-input) )
31 | 
32 | class Activation_ReLu:
33 |     def __init__(self):
34 |         pass
35 | 
36 |     def forward(self,input):
37 |         self.output = np.maximum(0,input)
38 | 
39 | class Activation_Softmax:
40 |     def __init__(self):
41 |         pass
42 | 
43 |     def forward(self,input):
44 |         # 要有keepdims=True参数设置
45 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
46 |         # 而行向量长度不与input的每一行长度相同，
47 |         # 则无法广播
48 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
49 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
50 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
51 | 
52 | # 生成数据
53 | X, y = spiral_data(samples=100, classes=3)
54 | # 构建一个含三个神经元的Dense层实例
55 | dense = Layer_Dense(2,3)
56 | # 构建Softmax激活函数
57 | activation1 = Activation_Softmax()
58 | 
59 | # 前向传播
60 | dense.forward(X)
61 | activation1.forward(dense.output)
62 | # 输出结果
63 | print(activation1.output[:5])
64 | 
65 | 
66 | 


--------------------------------------------------------------------------------
/3Loss/Loss.md:
--------------------------------------------------------------------------------
  1 | # Loss
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 在本部分将实现的Loss，CategoricalCrossentropy类（继承了Loss类）。本部分只实现forward method，反向传播将在后续加入。
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### 一、Loss父类
 10 | 
 11 |   1. 实现
 12 | 
 13 |      ```python
 14 |      class Loss:
 15 |          def __init__(self):
 16 |              pass
 17 |      
 18 |          # 统一通过调用calculate方法计算损失
 19 |          def calculate(self, prediction, y):
 20 |              # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
 21 |              data_loss = np.mean( self.forward(prediction, y) )
 22 |              # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
 23 |              return data_loss
 24 |      ```
 25 | 
 26 | ### 二、CategoricalCrossentropy类
 27 | 
 28 |   1. 公式
 29 |      $$
 30 |      L_i=-\sum\limits_jy_{i,j}log(\hat{y}_{i,j})
 31 |      $$
 32 | 
 33 |      > 当预测属于A、B、C三个类的概率分别是0.7，0.1、0.2，其实类别为A，测$L_i$计算如下。其中i表示对第i个sample计算得到的loss
 34 | 
 35 |      ![](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308072055997.png)
 36 | 
 37 |   2. 实现
 38 | 
 39 |      ```python
 40 |      class Loss_CategoricalCrossentropy(Loss):
 41 |          def __init__(self):
 42 |              pass
 43 |      
 44 |          def forward(self, y_pred, y_true):
 45 |              # 多少个样本
 46 |              n_sample = len(y_true)
 47 |      
 48 |              # 为了防止log(0)，所以以1e-7为左边界
 49 |              # 另一个问题是将置信度向1移动，即使是非常小的值，
 50 |              # 为了防止偏移，右边界为1 - 1e-7
 51 |              y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
 52 |      
 53 |              loss = - np.log(y_pred)
 54 |              if len(y_true.shape) == 2:# 标签是onehot的编码
 55 |                  loss = np.sum(loss * y_true,axis=1)
 56 |              elif len(y_true.shape) == 1:# 只有一个类别标签
 57 |                  # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
 58 |                  loss = loss[range(n_sample), y_true]
 59 |      
 60 |              return loss
 61 |      ```
 62 | 
 63 |   3. 实例
 64 | 
 65 |      ```python
 66 |      # 生成数据
 67 |      X, y = spiral_data(samples=100, classes=3)
 68 |      # 构建一个含三个神经元的Dense层实例
 69 |      dense1 = Layer_Dense(2,3)
 70 |      # 构建ReLu激活函数
 71 |      activation1 = Activation_ReLu()
 72 |      # 构建一个含4个神经元的Dense层实例
 73 |      dense2 = Layer_Dense(3,4)
 74 |      # 构建Softmax激活函数
 75 |      activation2 = Activation_Softmax()
 76 |      # 构建损失函数
 77 |      loss = Loss_CategoricalCrossentropy()
 78 |      
 79 |      # 前向传播
 80 |      dense1.forward(X)
 81 |      activation1.forward(dense1.output)
 82 |      dense2.forward(activation1.output)
 83 |      activation2.forward(dense2.output)
 84 |      dataloss = loss.calculate(activation2.output, y)
 85 |      
 86 |      # 输出结果
 87 |      print('loss =',dataloss)
 88 |      
 89 |      # 计算正确率
 90 |      soft_output = activation2.output
 91 |      # 返回最大confidence的类别作为预测类别
 92 |      prediction = np.argmax(soft_output,axis=1)
 93 |      # 如果y是onehot编码
 94 |      if len(y.shape) == 2:
 95 |          # 将其变为只有一个标签类别
 96 |          y = np.argmax(y,axis=1)
 97 |      
 98 |      accuracy = np.mean(prediction == y)
 99 |      print("accurcy =",accuracy)
100 |      ```
101 | 
102 | ![image-20230807220820346](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308072208379.png)
103 | 
104 | ### 三、Binary Cross-Entropy
105 | 
106 |   1. 公式
107 | 
108 |      ![image-20230808225158008](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308082251060.png)
109 | 
110 |      > 由于一个模型可以包含多个二进制输出，而且每个输出都不像交叉熵损失那样输出每个类别的confidence，所以在单个输出上计算的损失将是一个损失向量，其中每个输出都包含一个值。与CategoricalCrossentropy最大的不同是：
111 |      >
112 |      > * CategoricalCrossentropy中的每个类别是互斥的，
113 |      > * Binary Cross-Entropy中二进制输出是互斥的，但多个二进制之间不互斥，
114 |      > * 例如：男女互斥，高矮互斥，但男女与高矮之间不互斥。
115 | 
116 |   2. 实现
117 | 
118 |      ```py
119 |      class Loss_BinaryCrossentropy(Loss):
120 |          def __init__(self):
121 |              pass
122 |      
123 |          def forward(self, y_pred, y_true):
124 |              # 这里要特别注意，书上都没有写明
125 |              # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
126 |              # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
127 |              # 所以下面的loss大小会变成(n_sample,n_sample)
128 |              # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
129 |              if len(y_true.shape) == 1: # y_true是个行向量
130 |                  y_true = y_true.reshape(-1,1)
131 |              # 为了防止log(0)，所以以1e-7为左边界
132 |              # 另一个问题是将置信度向1移动，即使是非常小的值，
133 |              # 为了防止偏移，右边界为1 - 1e-7
134 |              y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
135 |              loss = -  np.log(y_pred) * y_true  - np.log(1 - y_pred) * (1 - y_true)
136 |              # 这里的求平均和父类中的calculate求平均的维度不同
137 |              # 这里是对多对的二进制求平均
138 |              # calculate中的求平均是对每个样本可平均
139 |              loss = np.mean(loss, axis=-1)
140 |              return loss
141 |      ```
142 | 
143 | 
144 | 
145 | 
146 | 
147 | 


--------------------------------------------------------------------------------
/3Loss/Loss.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/3Loss/Loss.pdf


--------------------------------------------------------------------------------
/3Loss/NNFS_version3.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 作者：黄欣
  3 | 日期：2023年08月07日
  4 | """
  5 | 
  6 | # 版本增加了Loss类和CategoricalCrossentropy类（继承了Loss类），只实现了前向传播
  7 | 
  8 | import numpy as np
  9 | from nnfs.datasets import spiral_data
 10 | import matplotlib.pyplot as plt
 11 | 
 12 | 
 13 | class Layer_Dense:
 14 |     def __init__(self, n_input, n_neuron):
 15 |         # 用正态分布初始化权重
 16 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
 17 |         # 将bias(偏差)初始化为0
 18 |         self.bias = np.zeros(n_neuron)
 19 | 
 20 |     def forward(self, input):
 21 |         self.output = np.dot(input, self.weight) + self.bias
 22 | 
 23 | class Activation_Sigmoid:
 24 |     def __init__(self):
 25 |         pass
 26 | 
 27 |     def forward(self, input):
 28 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 29 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 30 |         self.output = 1 / ( 1 + np.exp(-input) )
 31 | 
 32 | class Activation_ReLu:
 33 |     def __init__(self):
 34 |         pass
 35 | 
 36 |     def forward(self,input):
 37 |         self.output = np.maximum(0,input)
 38 | 
 39 | class Activation_Softmax:
 40 |     def __init__(self):
 41 |         pass
 42 | 
 43 |     def forward(self,input):
 44 |         # 要有keepdims=True参数设置
 45 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
 46 |         # 而行向量长度不与input的每一行长度相同，
 47 |         # 则无法广播
 48 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
 49 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
 50 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
 51 | 
 52 | class Loss:
 53 |     def __init__(self):
 54 |         pass
 55 | 
 56 |     # 统一通过调用calculate方法计算损失
 57 |     def calculate(self, y_pred, y_ture):
 58 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
 59 |         data_loss = np.mean( self.forward(y_pred, y_ture) )
 60 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
 61 |         return data_loss
 62 | 
 63 | class Loss_CategoricalCrossentropy(Loss):
 64 |     def __init__(self):
 65 |         pass
 66 | 
 67 |     def forward(self, y_pred, y_true):
 68 |         # 多少个样本
 69 |         n_sample = len(y_true)
 70 | 
 71 |         # 为了防止log(0)，所以以1e-7为左边界
 72 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
 73 |         # 为了防止偏移，右边界为1 - 1e-7
 74 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
 75 | 
 76 |         loss = - np.log(y_pred)
 77 |         if len(y_true.shape) == 2:# 标签是onehot的编码
 78 |             loss = np.sum(loss * y_true,axis=1)
 79 |         elif len(y_true.shape) == 1:# 只有一个类别标签
 80 |             # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
 81 |             loss = loss[range(n_sample), y_true]
 82 | 
 83 |         # 这里不用求均值，父类中的calculate方法中求均值
 84 |         return loss
 85 | 
 86 | class Loss_BinaryCrossentropy(Loss):
 87 |     def __init__(self):
 88 |         pass
 89 | 
 90 |     def forward(self, y_pred, y_true):
 91 |         # 多少个样本
 92 |         n_sample = len(y_true)
 93 |         # 这里要特别注意，书上都没有写明
 94 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
 95 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
 96 |         # 所以下面的loss大小会变成(n_sample,n_sample)
 97 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
 98 |         if len(y_true.shape) == 1: # y_true是个行向量
 99 |             y_true = y_true.reshape(-1,1)
100 |         # 为了防止log(0)，所以以1e-7为左边界
101 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
102 |         # 为了防止偏移，右边界为1 - 1e-7
103 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
104 |         loss = -  np.log(y_pred) * y_true  - np.log(1 - y_pred) * (1 - y_true)
105 |         # 这里的求平均和父类中的calculate求平均的维度不同
106 |         # 这里是对多对的二进制求平均
107 |         # calculate中的求平均是对每个样本可平均
108 |         loss = np.mean(loss, axis=-1)
109 |         return loss
110 | 
111 | # 生成数据
112 | X, y = spiral_data(samples=100, classes=3)
113 | # 构建一个含三个神经元的Dense层实例
114 | dense1 = Layer_Dense(2,3)
115 | # 构建ReLu激活函数
116 | activation1 = Activation_ReLu()
117 | # 构建一个含4个神经元的Dense层实例
118 | dense2 = Layer_Dense(3,4)
119 | # 构建Softmax激活函数
120 | activation2 = Activation_Softmax()
121 | # 构建损失函数
122 | loss = Loss_CategoricalCrossentropy()
123 | 
124 | # 前向传播
125 | dense1.forward(X)
126 | activation1.forward(dense1.output)
127 | dense2.forward(activation1.output)
128 | activation2.forward(dense2.output)
129 | dataloss = loss.calculate(activation2.output, y)
130 | 
131 | # 输出结果
132 | print('loss =',dataloss)
133 | 
134 | # 计算正确率
135 | soft_output = activation2.output
136 | # 返回最大confidence的类别作为预测类别
137 | prediction = np.argmax(soft_output,axis=1)
138 | # 如果y是onehot编码
139 | if len(y.shape) == 2:
140 |     # 将其变为只有一个标签类别
141 |     y = np.argmax(y,axis=1)
142 | 
143 | accuracy = np.mean(prediction == y)
144 | print("accurcy =",accuracy)
145 | 
146 | 
147 | 


--------------------------------------------------------------------------------
/4Backpropagation/Backpropagation.md:
--------------------------------------------------------------------------------
  1 | # Backpropagation
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 本部分将实现Dense Layer、Activation Function和Loss的反向传播。
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### 一、Dense Layer
 10 | 
 11 | **公式**
 12 | $$
 13 | y = wx+b
 14 | $$
 15 | 
 16 | $$
 17 | \frac{\partial y}{\partial w} = x
 18 | $$
 19 | 
 20 | $$
 21 | \frac{\partial y}{\partial x} = w
 22 | $$
 23 | 
 24 | $$
 25 | \frac{\partial y}{\partial b} = 1
 26 | $$
 27 | 
 28 | > 其中$x$是输入向量，$w$是权重，$b$是偏置，$y$是Dense Layer层是输出向量，$b$和$w$已经在初始化时保存，所以在前向传播中要将$x$保存在Dense Layer的属性中，**注意：$1$和$w$一样是一个矩阵 ，但大小不一样。**相关代码如下：
 29 | 
 30 | **实现**
 31 | 
 32 | ```python
 33 | def forward(self, input):
 34 |     # 因为要增加backward方法，
 35 |     # Layer_Dense的输出对输入（input）的偏导是self.weight，
 36 |     # 面Layer_Dense的输出对self.weight的偏导是输入（input）
 37 |     # 所以要在forward中增加self.input属性
 38 |     self.input = input #self.input是相对前面代码版本中新加入的
 39 |     self.output = np.dot(input, self.weight) + self.bias
 40 | ```
 41 | 
 42 | **公式**
 43 | $$
 44 | loss = f(y)
 45 | $$
 46 | 
 47 | $$
 48 | \frac{\partial loss}{\partial y}= dvalue
 49 | $$
 50 | 
 51 | $$
 52 | \frac{\partial loss}{\partial w}=\frac{\partial loss}{\partial y}\frac{\partial y}{\partial w}=dvalue*\frac{\partial y}{\partial w}=dvalue*x
 53 | $$
 54 | 
 55 | $$
 56 | \frac{\partial loss}{\partial x}=\frac{\partial loss}{\partial y}\frac{\partial y}{\partial x}=dvalue*\frac{\partial y}{\partial x}=dvalue*w
 57 | $$
 58 | 
 59 | $$
 60 | \frac{\partial loss}{\partial b}=\frac{\partial loss}{\partial y}\frac{\partial y}{\partial b}=dvalue*\frac{\partial y}{\partial b}=dvalue*1
 61 | $$
 62 | 
 63 | > 其中的dvalue通过下一层的反向传播求得，并作为这一层backward方法的参数，所以dvalue在该层中是已知的，只需通过代码实现求$\frac{\partial y}{\partial w}$和$\frac{\partial y}{\partial x}$，即$x$和$w$，代码如下：
 64 | 
 65 | **实现**
 66 | 
 67 | ```python
 68 | def backward(self, dvalue):
 69 |     # dvalue是loss对下一层（Activation）的输入（input）的导数，
 70 |     # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 71 |     # 这里会用到链式法则
 72 | 
 73 |     # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 74 |     # 这便找到了self.weight优化的方向（negative gradient direction）
 75 | 
 76 |     # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 77 |     # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 78 |     # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 79 |     # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 80 |     # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 81 |     self.dweight = np.dot(self.input.T, dvalue)
 82 | 
 83 |     # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 84 |     # 以便作为下一层的backward方法中的dvalue参数，
 85 | 
 86 |     # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
 87 |     # dvalue = dloss/doutput大小即为output的大小(1xb)，
 88 |     # weight大小为axb
 89 |     # 所以1xa = (1xb) * (axb).T
 90 |     self.dinput = np.dot(dvalue, self.weight.T)
 91 | 
 92 |     # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
 93 |     # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
 94 |     # 但有更快更简单的实现
 95 |     self.dbias = np.sum(dvalue, axis=0, keepdims=True)# 此处不要keepdims=True也行，因为按0维相加还是行向量
 96 | ```
 97 | 
 98 | ### 二、ReLu
 99 | 
100 | **公式**
101 | 
102 | 
103 | $$
104 | y=\begin{cases}x,x > 0\\0,x \le 0\end{cases}
105 | $$
106 | 
107 | $$
108 | \frac{dy}{dx}=\begin{cases}1,x > 0 \\0,x < 0\end{cases}
109 | $$
110 | 
111 | $$
112 | loss = f(y)
113 | $$
114 | 
115 | $$
116 | \frac{\partial loss}{\partial x}=\frac{\partial loss}{\partial y}\frac{\partial y}{\partial x}=dvalue*\frac{\partial y}{\partial x}=\begin{cases}dvalue,x > 0\\0,x < 0\end{cases}
117 | $$
118 | 
119 | > **从矩阵的角度看$\frac{\partial y}{\partial x}$是一个对角方阵，对角线上的值为dvalue或0，但实际并不用矩阵乘法实现**
120 | 
121 | **实现**
122 | 
123 | ```py
124 | def backward(self, dvalue):
125 |     # self.input和self.output形状是一样的
126 |     # 那么dinput大小=doutput大小=dvalue大小
127 |     # 可以用mask来更快实现，而不用矩阵运算
128 |     self.dinput = dvalue.copy()
129 |     self.dinput[self.input < 0] = 0
130 | ```
131 | 
132 | ### 三、Categorical Cross-Entropy loss
133 | 
134 | **公式**
135 | $$
136 | L_i=-\sum\limits_jy_{i,j}log(\hat y_{i,j})
137 | $$
138 | 
139 | > 其中$L_i$表示样本损失值，$i$表示集合中的第$i$个样本，$j$表示标签索引，$y$表示目标值，$\hat y$表示预测值。
140 | 
141 | ![image-20230808164836215](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308081648314.png)
142 | 
143 | **实现**
144 | 
145 | ```python
146 | def backward(self, y_pred, y_true):
147 |     n_sample = len(y_true)
148 |     if len(y_true.shape) == 2:  # 标签是onehot的编码
149 |         label = y_true
150 |     elif len(y_true.shape) == 1:  # 只有一个类别标签
151 |         # 将标签改成onehot的编码
152 |         label = np.zeros((n_sample, len(y_pred[0])))
153 |         label[range(n_sample), y_true] = 1
154 |     self.dinput = - label / y_pred
155 |     # 每个样本除以n_sample，因为在优化的过程中要对样本求和
156 |     self.dinput = self.dinput / n_sample
157 | ```
158 | 
159 | ### 四、Softmax
160 | 
161 | **公式**
162 | 
163 | Softmax函数是一种将j个实数向量转换为j个可能结果的概率分布的函数。索引i表示当前样本，索引j表示当前样本中的当前输出，$S_{i,j}$表示j个可能结果的概率。
164 | $$
165 | S_{i,j}=\frac{e^{z_{i,j}}}{\sum\limits_{l=1}^L{e^{z_{i,l}}}}
166 | $$
167 | 
168 | $$
169 | \frac{\partial S_{i,j}}{\partial z_{i,k}}=\frac{\partial \frac{e^{z_{i,j}}}{\sum\limits_{l=1}^L{e^{z_{i,l}}}}}{\partial z_{i,k}}
170 | $$
171 | 
172 | > 当$j=k$，推导如下：
173 | 
174 | ![image-20230808180015739](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308081800813.png)
175 | 
176 | > 当$j\neq k$，推导如下：
177 | 
178 | ![image-20230808181505425](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308081815472.png)
179 | 
180 | > 综上有：
181 | 
182 | ![image-20230808181748009](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308081817058.png)
183 | 
184 | ![image-20230808181806618](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308081818656.png)
185 | 
186 | ![image-20230808181824173](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308081818210.png)
187 | 
188 | ![image-20230808181900366](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308081819405.png)
189 | $$
190 | \frac{\partial loss}{\partial z_{i,k}}=\frac{\partial loss}{\partial S_{i,j}}\frac{\partial S_{i,j}}{\partial z_{i,k}}
191 | $$
192 | 
193 | 
194 | **实现**
195 | 
196 | ```python
197 |     def backward(self, dvalue):
198 |         # input和output大小相同都为1xa，
199 |         # loss是标量，那么dinput和doutput（即dvalue）大小相同都为1xa，
200 |         # output对input的导数为一个axa的方阵
201 | 
202 |         # 相同大小的空矩阵
203 |         self.dinput = np.empty_like(dvalue)
204 |         # 对每个samlpe（每一行）循环
205 |         for each, (single_output, single_dvalue) in enumerate(zip(self.output, dvalue)):
206 |             # 这里是(1xa) * (axa) = 1xa是行向量
207 |             # 这里要先将1xa向量变为1xa矩阵
208 |             # 因为向量没有转置（.T操作后还是与原来相同），
209 |             # np.dot接收到向量后，会调整向量的方向，但得到的还是向量（行向量）,就算得到列向量也会表示成行向量
210 |             # np.dot接收到1xa矩阵后，要考虑前后矩阵大小的匹配，不然要报错,最后得到的还是矩阵
211 |             single_output = single_output.reshape(1, -1)
212 |             jacobian_matrix = np.diagflat(single_output) - np.dot(single_output.T,single_output)
213 |             # 因为single_dvalue是行向量，dot运算会调整向量的方向
214 |             # 所以np.dot(single_dvalue, jacobian_matrix)和np.dot(jacobian_matrix， single_dvalue)
215 |             # 得到的都是一个行向量，但两都的计算方法不同，得到的值也不同
216 |             # np.dot(jacobian_matrix, single_dvalue)也是对的，这样得到的才是行向量，
217 |             # 而不是经过dot将列向量转置成行向量
218 |             self.dinput[each] = np.dot(jacobian_matrix, single_dvalue)
219 | ```
220 | 
221 | ### 五、Sigmoid
222 | 
223 | **公式**
224 | $$
225 | \sigma_{i,j}=\frac{1}{1+e^{-z_{i,j}}}
226 | $$
227 | 
228 | 
229 | > 其中$z_{i,j}$表示这个激活函数的输入，$\sigma_{i,j}$表示单个输出值。索引$i$表示当前样本，索引$j$ 表示当前样本中的当前输出。$\sigma_{i,j}$可理解成对第$j$对类别，例如猫狗分类中狗类别的confidence(置信度)。当然，一个模型可能要对多对类别分类，例如：高矮、胖瘦等。Sigmoid用于二分类
230 | 
231 | ![image-20230808214620764](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308082146840.png)
232 | $$
233 | \frac{\partial loss}{\partial z_{i,k}}=\begin{cases}\frac{\partial loss}{\partial \sigma_{i,j}}\frac{\partial \sigma_{i,j}}{\partial z_{i,k}}, j = k \\ 0, j \neq k\end{cases}
234 | $$
235 | 
236 | > $k$取一个固定值，那么$j$每取一个值，$\frac{\partial loss}{\partial z_{i,k}}$都是标量；而$\frac{\partial loss}{\partial z_{i,*}}$就是个行向量，$\frac{\partial \sigma_{i,*}}{\partial z_{i,*}}$是一个对角方阵。
237 | >
238 | > 这里可以用矩阵计算，但有更简单的方法，实现如下：
239 | 
240 | **实现**
241 | 
242 | ```python
243 | def backward(self, dvalue):
244 |     # 这里也可以用矩阵计算，但dinput、dvalue、output大小相同，
245 |     # 可以直接按元素对应相乘。
246 |     self.dinput = dvalue * self.output * ( 1 - self.output )
247 | ```
248 | 
249 | ### 六、Binary Cross-Entropy loss
250 | 
251 | **公式**
252 | 
253 | ![image-20230808232215931](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308082322975.png)
254 | 
255 | > 其中，$j$是第$j$对二进制输出。
256 | 
257 | ![](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308082319747.png)
258 | 
259 | > 由于一个模型可以包含多个二进制输出，因此在单个输出上计算的损失将组成一个损失向量，其中每个输出都有一个值。需要的是一个样本损失，需要计算所有这些来自单个样本的损失的平均。
260 | 
261 | ![image-20230809130550709](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308091305753.png)
262 | 
263 | ![image-20230809130935346](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308091309417.png)
264 | 
265 | ![image-20230809131100610](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308091311670.png)
266 | 
267 | 
268 | 
269 | **实现**
270 | 
271 | ```python
272 |     def backward(self, y_pred, y_true):
273 |         # 样本个数
274 |         n_sample = len(y_true)
275 |         # 二进制输出个数
276 |         n_output = len(y_pred[0])
277 |         # 这里要特别注意，书上都没有写明
278 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
279 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
280 |         # 所以下面的loss大小会变成(n_sample,n_sample)
281 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
282 |         if len(y_true.shape) == 1:  # y_true是个行向量
283 |             y_true = y_true.reshape(-1, 1)
284 |         # 注意：BinaryCrossentropy之前都是Sigmoid函数
285 |         # Sigmoid函数很容易出现0和1的输出
286 |         # 所以以1e-7为左边界
287 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
288 |         # 为了防止偏移，右边界为1 - 1e-7
289 |         y_pred_clip = np.clip(y_pred, 1e-7, 1 - 1e-7)
290 |         # 千万不要与成下面这样，因为-y_true优先级最高，而y_true是uint8，-1=>255
291 |         # 这个bug我找了很久，要重视
292 |         # self.dinput = -y_true / y_pred_clip + (1 - y_true) / (1 - y_pred_clip)) / n_output
293 |         self.dinput = -(y_true / y_pred_clip - (1 - y_true) / (1 - y_pred_clip)) / n_output
294 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
295 |         self.dinput = self.dinput / n_sample 
296 | ```
297 | 


--------------------------------------------------------------------------------
/4Backpropagation/Backpropagation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/4Backpropagation/Backpropagation.pdf


--------------------------------------------------------------------------------
/4Backpropagation/NNFS_version4.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 作者：黄欣
  3 | 日期：2023年08月08日
  4 | """
  5 | 
  6 | # 版本增加了Dense Layer、Activation Function和Loss的反向传播。
  7 | 
  8 | import numpy as np
  9 | from nnfs.datasets import spiral_data
 10 | import matplotlib.pyplot as plt
 11 | 
 12 | 
 13 | class Layer_Dense:
 14 |     def __init__(self, n_input, n_neuron):
 15 |         # 用正态分布初始化权重
 16 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
 17 |         # 将bias(偏差)初始化为0
 18 |         self.bias = np.zeros(n_neuron)
 19 | 
 20 |     def forward(self, input):
 21 |         # 因为要增加backward方法，
 22 |         # Layer_Dense的输出对输入（input）的偏导是self.weight，
 23 |         # 面Layer_Dense的输出对self.weight的偏导是输入（input）
 24 |         # 所以要在forward中增加self.input属性
 25 |         self.input = input
 26 |         self.output = np.dot(input, self.weight) + self.bias
 27 | 
 28 |     def backward(self, dvalue):
 29 |         # dvalue是loss对下一层（Activation）的输入（input）的导数，
 30 |         # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 31 |         # 这里会用到链式法则
 32 | 
 33 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 34 |         # 这便找到了self.weight优化的方向（negative gradient direction）
 35 | 
 36 |         # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 37 |         # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 38 |         # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 39 |         # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 40 |         # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 41 |         self.dweight = np.dot(self.input.T, dvalue)
 42 | 
 43 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 44 |         # 以便作为下一层的backward方法中的dvalue参数，
 45 | 
 46 |         # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
 47 |         # dvalue = dloss/doutput大小即为output的大小(1xb)，
 48 |         # weight大小为axb
 49 |         # 所以1xa = (1xb) * (axb).T
 50 |         self.dinput = np.dot(dvalue, self.weight.T)
 51 | 
 52 |         # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
 53 |         # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
 54 |         # 但有更快更简单的实现
 55 |         self.dbias = np.sum(dvalue, axis=0, keepdims=True)# 此处不要keepdims=True也行，因为按0维相加还是行向量
 56 | 
 57 | class Activation_Sigmoid:
 58 |     def __init__(self):
 59 |         pass
 60 | 
 61 |     def forward(self, input):
 62 |         self.input = input
 63 | 
 64 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 65 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 66 |         self.output = 1 / ( 1 + np.exp(-input) )
 67 | 
 68 |     def backward(self, dvalue):
 69 |         # 这里也可以用矩阵计算，但dinput、dvalue、output大小相同，
 70 |         # 可以直接按元素对应相乘。
 71 |         self.dinput = dvalue * self.output * ( 1 - self.output )
 72 | 
 73 | class Activation_ReLu:
 74 |     def __init__(self):
 75 |         pass
 76 | 
 77 |     def forward(self,input):
 78 |         self.input = input
 79 |         self.output = np.maximum(0,input)
 80 | 
 81 |     def backward(self, dvalue):
 82 |         # self.input和self.output形状是一样的
 83 |         # 那么dinput大小=doutput大小=dvalue大小
 84 |         # 可以用mask来更快实现，而不用矩阵运算
 85 |         self.dinput = dvalue.copy()
 86 |         self.dinput[self.input < 0] = 0
 87 | 
 88 | class Activation_Softmax:
 89 |     def __init__(self):
 90 |         pass
 91 | 
 92 |     def forward(self,input):
 93 |         self.input = input
 94 | 
 95 |         # 要有keepdims=True参数设置
 96 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
 97 |         # 而行向量长度不与input的每一行长度相同，
 98 |         # 则无法广播
 99 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
100 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
101 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
102 | 
103 |     def backward(self, dvalue):
104 |         # input和output大小相同都为1xa，
105 |         # loss是标量，那么dinput和doutput（即dvalue）大小相同都为1xa，
106 |         # output对input的导数为一个axa的方阵
107 | 
108 |         # 相同大小的空矩阵
109 |         self.dinput = np.empty_like(dvalue)
110 |         # 对每个samlpe（每一行）循环
111 |         for each, (single_output, single_dvalue) in enumerate(zip(self.output, dvalue)):
112 |             # 这里是(1xa) * (axa) = 1xa是行向量
113 |             # 这里要先将1xa向量变为1xa矩阵
114 |             # 因为向量没有转置（.T操作后还是与原来相同），
115 |             # np.dot接收到向量后，会调整向量的方向，但得到的还是向量（行向量）,就算得到列向量也会表示成行向量
116 |             # np.dot接收到1xa矩阵后，要考虑前后矩阵大小的匹配，不然要报错,最后得到的还是矩阵
117 |             single_output = single_output.reshape(1, -1)
118 |             jacobian_matrix = np.diagflat(single_output) - np.dot(single_output.T,single_output)
119 |             # 因为single_dvalue是行向量，dot运算会调整向量的方向
120 |             # 所以np.dot(single_dvalue, jacobian_matrix)和np.dot(jacobian_matrix， single_dvalue)
121 |             # 得到的都是一个行向量，但两都的计算方法不同，得到的值也不同
122 |             # np.dot(jacobian_matrix, single_dvalue)也是对的，这样得到的才是行向量，
123 |             # 而不是经过dot将列向量转置成行向量
124 |             self.dinput[each] = np.dot(jacobian_matrix, single_dvalue)
125 | 
126 | 
127 | class Loss:
128 |     def __init__(self):
129 |         pass
130 | 
131 |     # 统一通过调用calculate方法计算损失
132 |     def calculate(self, y_pred, y_ture):
133 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
134 |         data_loss = np.mean( self.forward(y_pred, y_ture) )
135 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
136 |         return data_loss
137 | 
138 | class Loss_CategoricalCrossentropy(Loss):
139 |     def __init__(self):
140 |         pass
141 | 
142 |     def forward(self, y_pred, y_true):
143 |         # 多少个样本
144 |         n_sample = len(y_true)
145 | 
146 |         # 为了防止log(0)，所以以1e-7为左边界
147 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
148 |         # 为了防止偏移，右边界为1 - 1e-7
149 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
150 | 
151 |         loss = - np.log(y_pred)
152 |         if len(y_true.shape) == 2:# 标签是onehot的编码
153 |             loss = np.sum(loss * y_true,axis=1)
154 |         elif len(y_true.shape) == 1:# 只有一个类别标签
155 |             # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
156 |             loss = loss[range(n_sample), y_true]
157 | 
158 |         # loss是一个列向量，每一行是一个样本,
159 |         # 这里不用求均值，父类中的calculate方法中求均值
160 |         return loss
161 | 
162 |     def backward(self, y_pred, y_true):
163 |         n_sample = len(y_true)
164 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
165 |             label = y_true
166 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
167 |             # 将标签改成onehot的编码
168 |             label = np.zeros((n_sample, len(y_pred[0])))
169 |             label[range(n_sample), y_true] = 1
170 |         self.dinput = - label / y_pred
171 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
172 |         self.dinput = self.dinput / n_sample
173 | 
174 | 
175 | class Loss_BinaryCrossentropy(Loss):
176 |     def __init__(self):
177 |         pass
178 | 
179 |     def forward(self, y_pred, y_true):
180 |         # 这里要特别注意，书上都没有写明
181 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
182 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
183 |         # 所以下面的loss大小会变成(n_sample,n_sample)
184 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
185 |         if len(y_true.shape) == 1: # y_true是个行向量
186 |             y_true = y_true.reshape(-1,1)
187 |         # 为了防止log(0)，所以以1e-7为左边界
188 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
189 |         # 为了防止偏移，右边界为1 - 1e-7
190 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
191 |         loss = -  np.log(y_pred) * y_true  - np.log(1 - y_pred) * (1 - y_true)
192 |         # 这里的求平均和父类中的calculate求平均的维度不同
193 |         # 这里是对多对的二进制求平均
194 |         # calculate中的求平均是对每个样本可平均
195 |         loss = np.mean(loss, axis=-1)
196 |         return loss
197 | 
198 |     def backward(self, y_pred, y_true):
199 |         # 样本个数
200 |         n_sample = len(y_true)
201 |         # 二进制输出个数
202 |         n_output = len(y_pred[0])
203 |         # 这里要特别注意，书上都没有写明
204 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
205 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
206 |         # 所以下面的loss大小会变成(n_sample,n_sample)
207 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
208 |         if len(y_true.shape) == 1:  # y_true是个行向量
209 |             y_true = y_true.reshape(-1, 1)
210 |         # 注意：BinaryCrossentropy之前都是Sigmoid函数
211 |         # Sigmoid函数很容易出现0和1的输出
212 |         # 所以以1e-7为左边界
213 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
214 |         # 为了防止偏移，右边界为1 - 1e-7
215 |         y_pred_clip = np.clip(y_pred, 1e-7, 1 - 1e-7)
216 |         # 千万不要与成下面这样，因为-y_true优先级最高，而y_true是uint8，-1=>255
217 |         # 这个bug我找了很久，要重视
218 |         # self.dinput = -y_true / y_pred_clip + (1 - y_true) / (1 - y_pred_clip)) / n_output
219 |         self.dinput = -(y_true / y_pred_clip - (1 - y_true) / (1 - y_pred_clip)) / n_output
220 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
221 |         self.dinput = self.dinput / n_sample
222 | 
223 | ########################################
224 | # 生成数据
225 | X, y = spiral_data(samples=100, classes=2)
226 | #########################################
227 | 
228 | ########################################################
229 | # 构建一个含三个神经元的Dense层实例
230 | dense1 = Layer_Dense(2,4)
231 | # 构建ReLu激活函数
232 | activation1 = Activation_ReLu()
233 | # 构建一个含4个神经元的Dense层实例
234 | dense2 = Layer_Dense(4,1)
235 | # 构建Softmax激活函数
236 | activation2 = Activation_Sigmoid()
237 | # 构建损失函数
238 | loss = Loss_BinaryCrossentropy()
239 | 
240 | # 前向传播
241 | dense1.forward(X)
242 | activation1.forward(dense1.output)
243 | dense2.forward(activation1.output)
244 | activation2.forward(dense2.output)
245 | dataloss = loss.calculate(activation2.output, y)
246 | 
247 | # 反向传播
248 | loss.backward(activation2.output, y)
249 | activation2.backward(loss.dinput)
250 | dense2.backward(activation2.dinput)
251 | print(dense2.dinput[40:50])
252 | #print(activation2.dinput[40:50])
253 | 
254 | # 输出结果
255 | print('loss =',dataloss)
256 | 
257 | # 计算正确率
258 | soft_output = activation2.output
259 | # 返回最大confidence的类别作为预测类别
260 | prediction = np.argmax(soft_output,axis=1)
261 | # 如果y是onehot编码
262 | if len(y.shape) == 2:
263 |     # 将其变为只有一个标签类别
264 |     y = np.argmax(y,axis=1)
265 | 
266 | accuracy = np.mean(prediction == y)
267 | print("accurcy =",accuracy)
268 | #################################################
269 | 
270 | #############################################3
271 | # 构建一个含三个神经元的Dense层实例
272 | dense1 = Layer_Dense(2,4)
273 | # 构建ReLu激活函数
274 | activation1 = Activation_ReLu()
275 | # 构建一个含4个神经元的Dense层实例
276 | dense2 = Layer_Dense(4,2)
277 | # 构建Softmax激活函数
278 | activation2 = Activation_Softmax()
279 | # 构建损失函数
280 | loss = Loss_CategoricalCrossentropy()
281 | 
282 | # 前向传播
283 | dense1.forward(X)
284 | activation1.forward(dense1.output)
285 | dense2.forward(activation1.output)
286 | activation2.forward(dense2.output)
287 | 
288 | dataloss = loss.calculate(activation2.output, y)
289 | 
290 | # 反向传播
291 | loss.backward(activation2.output, y)
292 | activation2.backward(loss.dinput)
293 | dense2.backward(activation2.dinput)
294 | print(dense2.dinput[40:50])
295 | #print(activation2.dinput[40:50])
296 | 
297 | # 输出结果
298 | print('loss =',dataloss)
299 | 
300 | # 计算正确率
301 | soft_output = activation2.output
302 | # 返回最大confidence的类别作为预测类别
303 | prediction = np.argmax(soft_output,axis=1)
304 | # 如果y是onehot编码
305 | if len(y.shape) == 2:
306 |     # 将其变为只有一个标签类别
307 |     y = np.argmax(y,axis=1)
308 | 
309 | accuracy = np.mean(prediction == y)
310 | print("accurcy =",accuracy)
311 | ########################################################
312 | 
313 | 
314 | 
315 | 
316 | 
317 | 
318 | 
319 | 


--------------------------------------------------------------------------------
/5combineLossandActivation/NNFS_version5.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 作者：黄欣
  3 | 日期：2023年08月08日
  4 | """
  5 | from timeit import timeit
  6 | 
  7 | # 版本增加了Categorical Cross-Entropy函数和Softmax激活函数结合、Binary Cross-Entropy loss和Sigmoid结合的实现。
  8 | 
  9 | import numpy as np
 10 | from nnfs.datasets import spiral_data
 11 | import matplotlib.pyplot as plt
 12 | 
 13 | 
 14 | class Layer_Dense:
 15 |     def __init__(self, n_input, n_neuron):
 16 |         # 用正态分布初始化权重
 17 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
 18 |         # 将bias(偏差)初始化为0
 19 |         self.bias = np.zeros(n_neuron)
 20 | 
 21 |     def forward(self, input):
 22 |         # 因为要增加backward方法，
 23 |         # Layer_Dense的输出对输入（input）的偏导是self.weight，
 24 |         # 面Layer_Dense的输出对self.weight的偏导是输入（input）
 25 |         # 所以要在forward中增加self.input属性
 26 |         self.input = input
 27 |         self.output = np.dot(input, self.weight) + self.bias
 28 | 
 29 |     def backward(self, dvalue):
 30 |         # dvalue是loss对下一层（Activation）的输入（input）的导数，
 31 |         # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 32 |         # 这里会用到链式法则
 33 | 
 34 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 35 |         # 这便找到了self.weight优化的方向（negative gradient direction）
 36 | 
 37 |         # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 38 |         # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 39 |         # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 40 |         # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 41 |         # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 42 |         self.dweight = np.dot(self.input.T, dvalue)
 43 | 
 44 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 45 |         # 以便作为下一层的backward方法中的dvalue参数，
 46 | 
 47 |         # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
 48 |         # dvalue = dloss/doutput大小即为output的大小(1xb)，
 49 |         # weight大小为axb
 50 |         # 所以1xa = (1xb) * (axb).T
 51 |         self.dinput = np.dot(dvalue, self.weight.T)
 52 | 
 53 |         # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
 54 |         # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
 55 |         # 但有更快更简单的实现
 56 |         self.dbias = np.sum(dvalue, axis=0, keepdims=True)# 此处不要keepdims=True也行，因为按0维相加还是行向量
 57 | 
 58 | class Activation_Sigmoid:
 59 |     def __init__(self):
 60 |         pass
 61 | 
 62 |     def forward(self, input):
 63 |         self.input = input
 64 | 
 65 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 66 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 67 |         self.output = 1 / ( 1 + np.exp(- (self.input) ) )
 68 | 
 69 |     def backward(self, dvalue):
 70 |         # 这里也可以用矩阵计算，但dinput、dvalue、output大小相同，
 71 |         # 可以直接按元素对应相乘。
 72 |         self.dinput = dvalue * self.output * ( 1 - self.output )
 73 | 
 74 | class Activation_ReLu:
 75 |     def __init__(self):
 76 |         pass
 77 | 
 78 |     def forward(self,input):
 79 |         self.input = input
 80 |         self.output = np.maximum(0,input)
 81 | 
 82 |     def backward(self, dvalue):
 83 |         # self.input和self.output形状是一样的
 84 |         # 那么dinput大小=doutput大小=dvalue大小
 85 |         # 可以用mask来更快实现，而不用矩阵运算
 86 |         self.dinput = dvalue.copy()
 87 |         self.dinput[self.input < 0] = 0
 88 | 
 89 | class Activation_Softmax:
 90 |     def __init__(self):
 91 |         pass
 92 | 
 93 |     def forward(self,input):
 94 |         self.input = input
 95 | 
 96 |         # 要有keepdims=True参数设置
 97 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
 98 |         # 而行向量长度不与input的每一行长度相同，
 99 |         # 则无法广播
100 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
101 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
102 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
103 | 
104 |     def backward(self, dvalue):
105 |         # input和output大小相同都为1xa，
106 |         # loss是标量，那么dinput和doutput（即dvalue）大小相同都为1xa，
107 |         # output对input的导数为一个axa的方阵
108 | 
109 |         # 相同大小的空矩阵
110 |         self.dinput = np.empty_like(dvalue)
111 |         # 对每个samlpe（每一行）循环
112 |         for each, (single_output, single_dvalue) in enumerate(zip(self.output, dvalue)):
113 |             # 显然这两种计算法算到的dinput大小是一样的
114 |             # 这里是(1xa) * (axa) = 1xa是行向量
115 |             # 这里要先将1xa向量变为1xa矩阵
116 |             # 因为向量没有转置（.T操作后还是与原来相同），
117 |             # np.dot接收到向量后，会调整向量的方向，但得到的还是向量（行向量）,就算得到列向量也会表示成行向量
118 |             # np.dot接收到1xa矩阵后，要考虑前后矩阵大小的匹配，不然要报错,最后得到的还是矩阵
119 |             single_output = single_output.reshape(1, -1)
120 |             jacobian_matrix = np.diagflat(single_output) - np.dot(single_output.T, single_output)
121 |             # 因为single_dvalue是行向量，dot运算会调整向量的方向
122 |             # 所以np.dot(single_dvalue, jacobian_matrix)和np.dot(jacobian_matrix， single_dvalue)
123 |             # 得到的都是一个行向量，但两都的计算方法不同，得到的值也不同
124 |             # np.dot(jacobian_matrix, single_dvalue)也是对的，这样得到的才是行向量，
125 |             # 而不是经过dot将列向量转置成行向量
126 |             self.dinput[each] = np.dot(jacobian_matrix, single_dvalue)
127 | 
128 | 
129 | 
130 | class Loss:
131 |     def __init__(self):
132 |         pass
133 | 
134 |     # 统一通过调用calculate方法计算损失
135 |     def calculate(self, y_pred, y_ture):
136 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
137 |         data_loss = np.mean( self.forward(y_pred, y_ture) )
138 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
139 |         return data_loss
140 | 
141 | class Loss_CategoricalCrossentropy(Loss):
142 |     def __init__(self):
143 |         pass
144 | 
145 |     def forward(self, y_pred, y_true):
146 |         # 多少个样本
147 |         n_sample = len(y_true)
148 | 
149 |         # 为了防止log(0)，所以以1e-7为左边界
150 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
151 |         # 为了防止偏移，右边界为1 - 1e-7
152 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
153 | 
154 |         loss = - np.log(y_pred)
155 |         if len(y_true.shape) == 2:# 标签是onehot的编码
156 |             loss = np.sum(loss * y_true, axis=1)
157 |         elif len(y_true.shape) == 1:# 只有一个类别标签
158 |             # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
159 |             loss = loss[range(n_sample), y_true]
160 | 
161 |         # loss是一个列向量，每一行是一个样本,
162 |         # 这里不用求均值，父类中的calculate方法中求均值
163 |         return loss
164 | 
165 |     def backward(self, y_pred, y_true):
166 |         n_sample = len(y_true)
167 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
168 |             label = y_true
169 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
170 |             # 将标签改成onehot的编码
171 |             label = np.zeros((n_sample, len(y_pred[0])))
172 |             label[range(n_sample), y_true] = 1
173 |         self.dinput = - label / y_pred
174 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
175 |         self.dinput = self.dinput / n_sample
176 | 
177 | 
178 | 
179 | 
180 | class Loss_BinaryCrossentropy(Loss):
181 |     def __init__(self):
182 |         pass
183 | 
184 |     def forward(self, y_pred, y_true):
185 |         # 多少个样本
186 |         n_sample = len(y_true)
187 |         # 这里要特别注意，书上都没有写明
188 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
189 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
190 |         # 所以下面的loss大小会变成(n_sample,n_sample)
191 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
192 |         if len(y_true.shape) == 1: # y_true是个行向量
193 |             y_true = y_true.reshape(-1,1)
194 |         # 为了防止log(0)，所以以1e-7为左边界
195 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
196 |         # 为了防止偏移，右边界为1 - 1e-7
197 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
198 |         loss = -  np.log(y_pred) * y_true  - np.log(1 - y_pred) * (1 - y_true)
199 |         # 这里的求平均和父类中的calculate求平均的维度不同
200 |         # 这里是对多对的二进制求平均
201 |         # calculate中的求平均是对每个样本可平均
202 |         loss = np.mean(loss, axis=-1)
203 |         return loss
204 | 
205 |     def backward(self, y_pred, y_true):
206 |         # 样本个数
207 |         n_sample = len(y_true)
208 |         # 二进制输出个数
209 |         n_output = len(y_pred[0])
210 |         # 这里要特别注意，书上都没有写明
211 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
212 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
213 |         # 所以下面的loss大小会变成(n_sample,n_sample)
214 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
215 |         if len(y_true.shape) == 1:  # y_true是个行向量
216 |             y_true = y_true.reshape(-1, 1)
217 |         # 注意：BinaryCrossentropy之前都是Sigmoid函数
218 |         # Sigmoid函数很容易出现0和1的输出
219 |         # 所以以1e-7为左边界
220 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
221 |         # 为了防止偏移，右边界为1 - 1e-7
222 |         y_pred_clip = np.clip(y_pred, 1e-7, 1 - 1e-7)
223 |         # 千万不要与成下面这样，因为-y_true优先级最高，而y_true是uint8，-1=>255
224 |         # 这个bug我找了很久，要重视
225 |         # self.dinput = -y_true / y_pred_clip + (1 - y_true) / (1 - y_pred_clip)) / n_output
226 |         self.dinput = -(y_true / y_pred_clip - (1 - y_true) / (1 - y_pred_clip)) / n_output
227 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
228 |         self.dinput = self.dinput / n_sample
229 | 
230 | class Activation_Softmax_Loss_CategoricalCrossentropy():
231 |     def __init__(self):
232 |         self.activation = Activation_Softmax()
233 |         self.loss = Loss_CategoricalCrossentropy()
234 | 
235 |     # 注意：Activation_Softmax_Loss_CategoricalCrossentropy类中是调用forward计算loss
236 |     # 因为它没有继承Loss类
237 |     def forward(self, input, y_true):
238 |         self.activation.forward(input)
239 |         # 该类的output属性应该是Activation_Softmax()的输出
240 |         self.output = self.activation.output
241 |         # 该类返回的是loss
242 |         return self.loss.calculate(self.output, y_true)
243 | 
244 |     # 其实y_pred一定等于self.output，但为了与之前代码一致
245 |     def backward(self, y_pred, y_true):
246 |         # 样本个数
247 |         n_sample = len(y_true)
248 |         if len(y_true.shape) == 2: # onehot编码
249 |             # 直接套公式
250 |             self.dinput = y_pred - y_true
251 |         elif len(y_true.shape) == 1: # 只有一个类别
252 |             self.dinput = y_pred.copy()
253 |             # 需将每一行中y_true类别（索引）中的-1，其它-0（不操作）
254 |             self.dinput[range(n_sample), y_true] -= 1
255 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
256 |         self.dinput = self.dinput / n_sample
257 | 
258 | class Activation_Sigmoid_Loss_BinaryCrossentropy():
259 |     def __init__(self):
260 |         self.activation = Activation_Sigmoid()
261 |         self.loss = Loss_BinaryCrossentropy()
262 | 
263 |     def forward(self, input, y_true):
264 |         self.activation.forward(input)
265 |         # 类的output是Sigmoid的输出
266 |         self.output = self.activation.output
267 |         return self.loss.calculate(self.output, y_true)
268 | 
269 |     def backward(self, y_pred, y_true):
270 |         # 样本数量
271 |         n_sample = len(y_pred)
272 |         # 这里要特别注意，书上都没有写明
273 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
274 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
275 |         # 所以下面的loss大小会变成(n_sample,n_sample)
276 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
277 |         if len(y_true.shape) == 1:  # y_true是个行向量
278 |             y_true = y_true.reshape(-1, 1)
279 |         # 二进制输出个数
280 |         J = len(y_pred[0])
281 |         # y_true中每一行都有J个1或0的二进制值，1代表正例，0代表负例。
282 |         self.dinput = ( y_pred - y_true ) / J
283 | 
284 |         # 优化时要将所有样本相加，为了梯度与样本数量无关，这里除以样本数
285 |         self.dinput /= n_sample
286 | 
287 | 
288 | ##########################################
289 | # 数据
290 | X, y = spiral_data(samples=100,classes=2)
291 | 
292 | 
293 | # 两层Dense，一层ReLu
294 | dense1 = Layer_Dense(2,4)
295 | dense2 = Layer_Dense(4,1)
296 | activation1 = Activation_ReLu()
297 | 
298 | # 前向传播
299 | dense1.forward(X)
300 | activation1.forward(dense1.output)
301 | dense2.forward(activation1.output)
302 | sigmoid_in = dense2.output
303 | 
304 | # 前向传播
305 | ####
306 | ##
307 | sigmoid_loss = Activation_Sigmoid_Loss_BinaryCrossentropy()
308 | dataloss1 = sigmoid_loss.forward(sigmoid_in, y)
309 | ##
310 | activation2 = Activation_Sigmoid()
311 | loss = Loss_BinaryCrossentropy()
312 | activation2.forward(sigmoid_in)
313 | dataloss2 = loss.calculate(activation2.output, y)
314 | ##
315 | ####
316 | 
317 | # 反向传播
318 | ####
319 | ##
320 | sigmoid_loss.backward(sigmoid_loss.output, y)
321 | dinput1 = sigmoid_loss.dinput
322 | ##
323 | loss.backward(activation2.output, y)
324 | activation2.backward(loss.dinput)
325 | dinput2 = activation2.dinput
326 | 
327 | print('Gradients: combined loss and activation:')
328 | print(dataloss1)
329 | print(dinput1.shape)
330 | print(dinput1[50:55])
331 | 
332 | print('Gradients: separate loss and activation:')
333 | print(dataloss2)
334 | print(dinput2.shape)
335 | print(dinput2[50:55])
336 | ##
337 | ####
338 | 
339 | 
340 | 
341 | # def f1():
342 | #     sigmoid_loss.backward(sigmoid_loss.output, y)
343 | #     dinput1 = sigmoid_loss.dinput
344 | # def f2():
345 | #     loss.backward(activation2.output, y)
346 | #     activation2.backward(loss.dinput)
347 | #     dinput2 = activation2.dinput
348 | #
349 | # t1 = timeit(lambda: f1(), number=10000)
350 | # t2 = timeit(lambda: f2(), number=10000)
351 | # print(t2/t1)
352 | 
353 | 
354 | 
355 | 


--------------------------------------------------------------------------------
/5combineLossandActivation/combineLossandActivation.md:
--------------------------------------------------------------------------------
  1 | # combine the Loss and Activation Function
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 在之前内容中已经实现了Categorical Cross-Entropy函数和Softmax激活函数，但是还可以进一步来加速计算。这部分是因为两个函数的导数结合起来使整个代码实现更简单、更快。除此之外，Binary Cross-Entropy loss和Sigmoid也能结合。
  6 | 
  7 | ## 二、Categorical Cross-Entropy loss and Softmax activation
  8 | 
  9 | ### **公式**
 10 | 
 11 | $$
 12 | L_i=-\sum\limits_jy_{i,j}log(\hat y_{i,j})
 13 | $$
 14 | 
 15 | > 在Backpropagation的Softmax部分讲到了$\frac{\partial S_{i,j}}{\partial z_{i,k}}$的计算，且$\hat y_{i,j}=S_{i,j}$，所以有：
 16 | 
 17 | ![image-20230809092620943](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308090926027.png)
 18 | 
 19 | ![image-20230809092641430](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308090926463.png)
 20 | 
 21 | > 在Backpropagation的Categorical Cross-Entropy loss部分讲到了：
 22 | 
 23 | ![image-20230809093259100](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308090932146.png)
 24 | 
 25 | > 综上有：
 26 | 
 27 | ![image-20230809093342583](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308090933626.png)
 28 | 
 29 | > **注意：这里的$z$是Softmax的input，$L$是Categorical Cross-Entropy的output**
 30 | 
 31 | ### **实现**
 32 | 
 33 | ```python
 34 | class Activation_Softmax_Loss_CategoricalCrossentropy():
 35 |     def __init__(self):
 36 |         self.activation = Activation_Softmax()
 37 |         self.loss = Loss_CategoricalCrossentropy()
 38 | 
 39 |     # 注意：Activation_Softmax_Loss_CategoricalCrossentropy类中是调用forward计算loss
 40 |     # 因为它没有继承Loss类
 41 |     def forward(self, input, y_true):
 42 |         self.activation.forward(input)
 43 |         # 该类的output属性应该是Activation_Softmax()的输出
 44 |         self.output = self.activation.output
 45 |         # 该类返回的是loss
 46 |         return self.loss.calculate(self.output, y_true)
 47 | 
 48 |     # 其实y_pred一定等于self.output，但为了与之前代码一致
 49 |     def backward(self, y_pred, y_true):
 50 |         # 样本个数
 51 |         n_sample = len(y_true)
 52 |         if len(y_true.shape) == 2: # onehot编码
 53 |             # 直接套公式
 54 |             self.dinput = y_pred - y_true
 55 |         elif len(y_true.shape) == 1: # 只有一个类别
 56 |             self.dinput = y_pred.copy()
 57 |             # 需将每一行中y_true类别（索引）中的-1，其它-0（不操作）
 58 |             self.dinput[range(n_sample), y_true] -= 1
 59 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
 60 |         self.dinput = self.dinput / n_sample
 61 | ```
 62 | 
 63 | ### **实例**
 64 | 
 65 | ```python
 66 | ##########################################
 67 | softmax_outputs = np.array([[0.7, 0.1, 0.2],[0.1, 0.5, 0.4],[0.02, 0.9, 0.08]])
 68 | class_targets = np.array([0, 1, 1])
 69 | softmax_loss = Activation_Softmax_Loss_CategoricalCrossentropy()
 70 | softmax_loss.backward(softmax_outputs, class_targets)
 71 | dvalues1 = softmax_loss.dinput
 72 | 
 73 | activation = Activation_Softmax()
 74 | activation.output = softmax_outputs
 75 | loss = Loss_CategoricalCrossentropy()
 76 | loss.backward(softmax_outputs, class_targets)
 77 | activation.backward(loss.dinput)
 78 | dvalues2 = activation.dinput
 79 | 
 80 | print('Gradients: combined loss and activation:')
 81 | print(dvalues1)
 82 | print('Gradients: separate loss and activation:')
 83 | print(dvalues2)
 84 | ###################################################
 85 | ```
 86 | 
 87 | ![image-20230809123920168](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308091239219.png)
 88 | 
 89 | > 将Activation和loss分开，或都合并都实现了相同的结果。
 90 | 
 91 | ```python
 92 | def f1():
 93 |     softmax_loss = Activation_Softmax_Loss_CategoricalCrossentropy()
 94 |     softmax_loss.backward(softmax_outputs, class_targets)
 95 |     dvalues1 = softmax_loss.dinput
 96 | def f2():
 97 |     activation = Activation_Softmax()
 98 |     activation.output = softmax_outputs
 99 |     loss = Loss_CategoricalCrossentropy()
100 |     loss.backward(softmax_outputs, class_targets)
101 |     activation.backward(loss.dinput)
102 |     dvalues2 = activation.dinput
103 | 
104 | t1 = timeit(lambda: f1(), number=10000)
105 | t2 = timeit(lambda: f2(), number=10000)
106 | print(t2/t1)
107 | ```
108 | 
109 | ![image-20230809124516686](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308091245718.png)
110 | 
111 | > 可以看到，当两种实现方法重复10000次以后，所用时间接近4倍。
112 | 
113 | ## 三、Sigmoid and Binary Cross-Entropy Loss
114 | 
115 | > **这部分内容在书中并没有， 是我自己根据理解后，推导公式和编程实现的，并不代表完全正确。将在更深入学习后勘误。**
116 | 
117 | ### **公式**
118 | 
119 | > 参照Sigmoid和Binary Cross-Entropy的求代公式有（第个样本下标$i$，省去）：
120 | 
121 | $$
122 | \frac{\partial L}{\partial \hat y_j} = -\frac{1}{J}(\frac {\partial y_j}{\partial \hat y_j} - \frac {1-\partial y_j}{1-\partial \hat y_j})
123 | $$
124 | 
125 | $$
126 | \frac{\partial \sigma_j}{\partial z_j} = \sigma_j(1-\sigma_j)
127 | $$
128 | 
129 | > 因为，Sigmoid的输出$\sigma$就是Binary Cross-Entropy的输入$\hat y$，写成矩阵形式，$\frac{\partial L}{\partial z}$和$\frac{\partial L}{\partial \hat y}$是行向量，$\frac{\partial \sigma}{\partial z}$是对角方阵。
130 | 
131 | $$
132 | \frac{\partial L}{\partial z}=\frac{\partial L}{\partial \hat y}\frac{\partial \sigma}{\partial z}
133 | $$
134 | 
135 | > 对每个标量进行计算有：
136 | 
137 | $$
138 | \frac{\partial L}{\partial z_j}=\frac{\partial L}{\partial \hat y_j}\frac{\partial \sigma_j}{\partial z_j}=\frac{\partial L}{\partial \hat y_j}\frac{\partial \hat y_j}{\partial z_j}= -\frac{1}{J}(\frac {\partial y_j}{\partial \hat y_j} - \frac {1-\partial y_j}{1-\partial \hat y_j})\hat y_j(1-\hat y_j)=\frac{\hat y_j-y_j}{J}
139 | $$
140 | 
141 | ### **实现**
142 | 
143 | ```python
144 | class Activation_Sigmoid_Loss_BinaryCrossentropy():
145 |     def __init__(self):
146 |         self.activation = Activation_Sigmoid()
147 |         self.loss = Loss_BinaryCrossentropy()
148 | 
149 |     def forward(self, input, y_true):
150 |         self.activation.forward(input)
151 |         # 类的output是Sigmoid的输出
152 |         self.output = self.activation.output
153 |         return self.loss.calculate(self.output, y_true)
154 | 
155 |     def backward(self, y_pred, y_true):
156 |         # 样本数量
157 |         n_sample = len(y_pred)
158 |         # 这里要特别注意，书上都没有写明
159 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
160 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
161 |         # 所以下面的loss大小会变成(n_sample,n_sample)
162 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
163 |         if len(y_true.shape) == 1:  # y_true是个行向量
164 |             y_true = y_true.reshape(-1, 1)
165 |         # 二进制输出个数
166 |         J = len(y_pred[0])
167 |         # y_true中每一行都有J个1或0的二进制值，1代表正例，0代表负例。
168 |         self.dinput = ( y_pred - y_true ) / J
169 | 
170 |         # 优化时要将所有样本相加，为了梯度与样本数量无关，这里除以样本数
171 |         self.dinput /= n_sample
172 | ```
173 | 
174 | > **注意看注释，非常重要**
175 | 
176 | ### **实例**
177 | 
178 | ```python
179 | ##########################################
180 | # 数据
181 | X, y = spiral_data(samples=100,classes=2)
182 | print(X.shape)
183 | print(X[:5])
184 | 
185 | # 两层Dense，一层ReLu
186 | dense1 = Layer_Dense(2,4)
187 | dense2 = Layer_Dense(4,1)
188 | activation1 = Activation_ReLu()
189 | 
190 | # 前向传播
191 | dense1.forward(X)
192 | activation1.forward(dense1.output)
193 | dense2.forward(activation1.output)
194 | sigmoid_in = dense2.output
195 | print(sigmoid_in[:5])
196 | ####
197 | ##
198 | sigmoid_loss = Activation_Sigmoid_Loss_BinaryCrossentropy()
199 | dataloss1 = sigmoid_loss.forward(sigmoid_in, y)
200 | ##
201 | activation2 = Activation_Sigmoid()
202 | loss = Loss_BinaryCrossentropy()
203 | activation2.forward(sigmoid_in)
204 | dataloss2 = loss.calculate(activation2.output, y)
205 | ##
206 | ####
207 | 
208 | # 反向传播
209 | ####
210 | ##
211 | sigmoid_loss.backward(sigmoid_loss.output, y)
212 | dinput1 = sigmoid_loss.dinput
213 | ##
214 | loss.backward(activation2.output, y)
215 | activation2.backward(loss.dinput)
216 | dinput2 = activation2.dinput
217 | 
218 | print('Gradients: combined loss and activation:')
219 | print(dataloss1)
220 | print(dinput1.shape)
221 | print(dinput1[50:55])
222 | 
223 | print('Gradients: separate loss and activation:')
224 | print(dataloss2)
225 | print(dinput2.shape)
226 | print(dinput2[50:55])
227 | ```
228 | 
229 | 
230 | 
231 | ![](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308091744550.png)
232 | 
233 | > 两种实现方法计算得到的loss和梯度是一样的。
234 | 
235 | ```python
236 | def f1():
237 |     sigmoid_loss.backward(sigmoid_loss.output, y)
238 |     dinput1 = sigmoid_loss.dinput
239 | def f2():
240 |     loss.backward(activation2.output, y)
241 |     activation2.backward(loss.dinput)
242 |     dinput2 = activation2.dinput
243 | 
244 | t1 = timeit(lambda: f1(), number=10000)
245 | t2 = timeit(lambda: f2(), number=10000)
246 | print(t2/t1)
247 | ```
248 | 
249 | ![image-20230809175814647](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308091758689.png)
250 | 
251 | >  两种方法重复10000次，运行时间相差6倍。
252 | 


--------------------------------------------------------------------------------
/5combineLossandActivation/combineLossandActivation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/5combineLossandActivation/combineLossandActivation.pdf


--------------------------------------------------------------------------------
/6Optimizer/NNFS_version6.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 作者：黄欣
  3 | 日期：2023年08月10日
  4 | """
  5 | 
  6 | from timeit import timeit
  7 | 
  8 | # 版本增加了Optimizer的实现。
  9 | 
 10 | import numpy as np
 11 | from nnfs.datasets import spiral_data
 12 | import matplotlib.pyplot as plt
 13 | 
 14 | 
 15 | class Layer_Dense:
 16 |     def __init__(self, n_input, n_neuron):
 17 |         # 用正态分布初始化权重
 18 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
 19 |         # 将bias(偏差)初始化为0
 20 |         # self.bias = np.zeros(n_neuron)
 21 |         self.bias = np.zeros((1, n_neuron))
 22 | 
 23 |     def forward(self, input):
 24 |         # 因为要增加backward方法，
 25 |         # Layer_Dense的输出对输入（input）的偏导是self.weight，
 26 |         # 面Layer_Dense的输出对self.weight的偏导是输入（input）
 27 |         # 所以要在forward中增加self.input属性
 28 |         self.input = input
 29 |         self.output = np.dot(input, self.weight) + self.bias
 30 | 
 31 |     def backward(self, dvalue):
 32 |         # dvalue是loss对下一层（Activation）的输入（input）的导数，
 33 |         # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 34 |         # 这里会用到链式法则
 35 | 
 36 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 37 |         # 这便找到了self.weight优化的方向（negative gradient direction）
 38 | 
 39 |         # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 40 |         # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 41 |         # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 42 |         # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 43 |         # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 44 |         self.dweight = np.dot(self.input.T, dvalue)
 45 | 
 46 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 47 |         # 以便作为下一层的backward方法中的dvalue参数，
 48 | 
 49 |         # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
 50 |         # dvalue = dloss/doutput大小即为output的大小(1xb)，
 51 |         # weight大小为axb
 52 |         # 所以1xa = (1xb) * (axb).T
 53 |         self.dinput = np.dot(dvalue, self.weight.T)
 54 | 
 55 |         # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
 56 |         # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
 57 |         # 但有更快更简单的实现
 58 |         self.dbias = np.sum(dvalue, axis=0, keepdims=True)  # 此处不要keepdims=True也行，因为按0维相加还是行向量
 59 | 
 60 | 
 61 | class Activation_Sigmoid:
 62 |     def __init__(self):
 63 |         pass
 64 | 
 65 |     def forward(self, input):
 66 |         self.input = input
 67 | 
 68 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 69 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 70 |         self.output = 1 / (1 + np.exp(- (self.input)))
 71 | 
 72 |     def backward(self, dvalue):
 73 |         # 这里也可以用矩阵计算，但dinput、dvalue、output大小相同，
 74 |         # 可以直接按元素对应相乘。
 75 |         self.dinput = dvalue * self.output * (1 - self.output)
 76 | 
 77 | 
 78 | class Activation_ReLu:
 79 |     def __init__(self):
 80 |         pass
 81 | 
 82 |     def forward(self, input):
 83 |         self.input = input
 84 |         self.output = np.maximum(0, input)
 85 | 
 86 |     def backward(self, dvalue):
 87 |         # self.input和self.output形状是一样的
 88 |         # 那么dinput大小=doutput大小=dvalue大小
 89 |         # 可以用mask来更快实现，而不用矩阵运算
 90 |         self.dinput = dvalue.copy()
 91 |         self.dinput[self.input < 0] = 0
 92 | 
 93 | 
 94 | class Activation_Softmax:
 95 |     def __init__(self):
 96 |         pass
 97 | 
 98 |     def forward(self, input):
 99 |         self.input = input
100 | 
101 |         # 要有keepdims=True参数设置
102 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
103 |         # 而行向量长度不与input的每一行长度相同，
104 |         # 则无法广播
105 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
106 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
107 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
108 | 
109 |     def backward(self, dvalue):
110 |         # input和output大小相同都为1xa，
111 |         # loss是标量，那么dinput和doutput（即dvalue）大小相同都为1xa，
112 |         # output对input的导数为一个axa的方阵
113 | 
114 |         # 相同大小的空矩阵
115 |         self.dinput = np.empty_like(dvalue)
116 |         # 对每个samlpe（每一行）循环
117 |         for each, (single_output, single_dvalue) in enumerate(zip(self.output, dvalue)):
118 |             # 显然这两种计算法算到的dinput大小是一样的
119 |             # 这里是(1xa) * (axa) = 1xa是行向量
120 |             # 这里要先将1xa向量变为1xa矩阵
121 |             # 因为向量没有转置（.T操作后还是与原来相同），
122 |             # np.dot接收到向量后，会调整向量的方向，但得到的还是向量（行向量）,就算得到列向量也会表示成行向量
123 |             # np.dot接收到1xa矩阵后，要考虑前后矩阵大小的匹配，不然要报错,最后得到的还是矩阵
124 |             single_output = single_output.reshape(1, -1)
125 |             jacobian_matrix = np.diagflat(single_output) - np.dot(single_output.T, single_output)
126 |             # 因为single_dvalue是行向量，dot运算会调整向量的方向
127 |             # 所以np.dot(single_dvalue, jacobian_matrix)和np.dot(jacobian_matrix， single_dvalue)
128 |             # 得到的都是一个行向量，但两都的计算方法不同，得到的值也不同
129 |             # np.dot(jacobian_matrix, single_dvalue)也是对的，这样得到的才是行向量，
130 |             # 而不是经过dot将列向量转置成行向量
131 |             self.dinput[each] = np.dot(jacobian_matrix, single_dvalue)
132 | 
133 | 
134 | class Loss:
135 |     def __init__(self):
136 |         pass
137 | 
138 |     # 统一通过调用calculate方法计算损失
139 |     def calculate(self, y_pred, y_ture):
140 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
141 |         data_loss = np.mean(self.forward(y_pred, y_ture))
142 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
143 |         return data_loss
144 | 
145 | 
146 | class Loss_CategoricalCrossentropy(Loss):
147 |     def __init__(self):
148 |         pass
149 | 
150 |     def forward(self, y_pred, y_true):
151 |         # 多少个样本
152 |         n_sample = len(y_true)
153 | 
154 |         # 为了防止log(0)，所以以1e-7为左边界
155 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
156 |         # 为了防止偏移，右边界为1 - 1e-7
157 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
158 | 
159 |         loss = - np.log(y_pred)
160 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
161 |             loss = np.sum(loss * y_true, axis=1)
162 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
163 |             # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
164 |             loss = loss[range(n_sample), y_true]
165 | 
166 |         # loss是一个列向量，每一行是一个样本,
167 |         # 这里不用求均值，父类中的calculate方法中求均值
168 |         return loss
169 | 
170 |     def backward(self, y_pred, y_true):
171 |         n_sample = len(y_true)
172 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
173 |             label = y_true
174 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
175 |             # 将标签改成onehot的编码
176 |             label = np.zeros((n_sample, len(y_pred[0])))
177 |             label[range(n_sample), y_true] = 1
178 |         self.dinput = - label / y_pred
179 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
180 |         self.dinput = self.dinput / n_sample
181 | 
182 | 
183 | class Loss_BinaryCrossentropy(Loss):
184 |     def __init__(self):
185 |         pass
186 | 
187 |     def forward(self, y_pred, y_true):
188 |         # 这里要特别注意，书上都没有写明
189 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
190 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
191 |         # 所以下面的loss大小会变成(n_sample,n_sample)
192 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
193 |         if len(y_true.shape) == 1:  # y_true是个行向量
194 |             y_true = y_true.reshape(-1, 1)
195 |         # 为了防止log(0)，所以以1e-7为左边界
196 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
197 |         # 为了防止偏移，右边界为1 - 1e-7
198 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
199 |         loss = -  np.log(y_pred) * y_true - np.log(1 - y_pred) * (1 - y_true)
200 |         # 这里的求平均和父类中的calculate求平均的维度不同
201 |         # 这里是对多对的二进制求平均
202 |         # calculate中的求平均是对每个样本可平均
203 |         loss = np.mean(loss, axis=-1)
204 |         return loss
205 | 
206 |     def backward(self, y_pred, y_true):
207 |         # 样本个数
208 |         n_sample = len(y_true)
209 |         # 二进制输出个数
210 |         n_output = len(y_pred[0])
211 |         # 这里要特别注意，书上都没有写明
212 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
213 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
214 |         # 所以下面的loss大小会变成(n_sample,n_sample)
215 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
216 |         if len(y_true.shape) == 1:  # y_true是个行向量
217 |             y_true = y_true.reshape(-1, 1)
218 |         # 注意：BinaryCrossentropy之前都是Sigmoid函数
219 |         # Sigmoid函数很容易出现0和1的输出
220 |         # 所以以1e-7为左边界
221 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
222 |         # 为了防止偏移，右边界为1 - 1e-7
223 |         y_pred_clip = np.clip(y_pred, 1e-7, 1 - 1e-7)
224 |         # 千万不要与成下面这样，因为-y_true优先级最高，而y_true是uint8，-1=>255
225 |         # 这个bug我找了很久，要重视
226 |         # self.dinput = -y_true / y_pred_clip + (1 - y_true) / (1 - y_pred_clip)) / n_output
227 |         self.dinput = -(y_true / y_pred_clip - (1 - y_true) / (1 - y_pred_clip)) / n_output
228 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
229 |         self.dinput = self.dinput / n_sample
230 | 
231 | 
232 | class Activation_Softmax_Loss_CategoricalCrossentropy():
233 |     def __init__(self):
234 |         self.activation = Activation_Softmax()
235 |         self.loss = Loss_CategoricalCrossentropy()
236 | 
237 |     # 注意：Activation_Softmax_Loss_CategoricalCrossentropy类中是调用forward计算loss
238 |     # 因为它没有继承Loss类
239 |     def forward(self, input, y_true):
240 |         self.activation.forward(input)
241 |         # 该类的output属性应该是Activation_Softmax()的输出
242 |         self.output = self.activation.output
243 |         # 该类返回的是loss
244 |         return self.loss.calculate(self.output, y_true)
245 | 
246 |     # 其实y_pred一定等于self.output，但为了与之前代码一致
247 |     def backward(self, y_pred, y_true):
248 |         # 样本个数
249 |         n_sample = len(y_true)
250 |         if len(y_true.shape) == 2:  # onehot编码
251 |             # 直接套公式
252 |             self.dinput = y_pred - y_true
253 |         elif len(y_true.shape) == 1:  # 只有一个类别
254 |             self.dinput = y_pred.copy()
255 |             # 需将每一行中y_true类别（索引）中的-1，其它-0（不操作）
256 |             self.dinput[range(n_sample), y_true] -= 1
257 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
258 |         self.dinput = self.dinput / n_sample
259 | 
260 | 
261 | class Activation_Sigmoid_Loss_BinaryCrossentropy():
262 |     def __init__(self):
263 |         self.activation = Activation_Sigmoid()
264 |         self.loss = Loss_BinaryCrossentropy()
265 | 
266 |     def forward(self, input, y_true):
267 |         self.activation.forward(input)
268 |         # 类的output是Sigmoid的输出
269 |         self.output = self.activation.output
270 |         return self.loss.calculate(self.output, y_true)
271 | 
272 |     def backward(self, y_pred, y_true):
273 |         # 样本数量
274 |         n_sample = len(y_pred)
275 |         # 这里要特别注意，书上都没有写明
276 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
277 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
278 |         # 所以下面的loss大小会变成(n_sample,n_sample)
279 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
280 |         if len(y_true.shape) == 1:  # y_true是个行向量
281 |             y_true = y_true.reshape(-1, 1)
282 |         # 二进制输出个数
283 |         J = len(y_pred[0])
284 |         # y_true中每一行都有J个1或0的二进制值，1代表正例，0代表负例。
285 |         self.dinput = (y_pred - y_true) / J
286 | 
287 |         # 优化时要将所有样本相加，为了梯度与样本数量无关，这里除以样本数
288 |         self.dinput /= n_sample
289 | 
290 | class Optimizer_SGD():
291 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
292 |     def __init__(self, learning_rate = 1.0, decay = 0, momentum=0):
293 |         self.learning_rate = learning_rate
294 |         self.decay = decay
295 |         self.current_learning_rate = learning_rate
296 |         self.iteration = 0
297 |         self.momentum = momentum
298 | 
299 |     def pre_update_param(self):
300 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
301 |         if self.decay:
302 |             self.current_learning_rate = self.learning_rate * \
303 |                                          (1 / (1 + self.decay * self.iteration))
304 | 
305 |     # 给一个层对象参数，执行最基本的优化
306 |     def update_param(self, layer):
307 | 
308 |         deta_weight = layer.dweight
309 |         deta_bias = layer.dbias
310 | 
311 |         # 如果使用momentum
312 |         if self.momentum:
313 |             # 如果还没有累积动量
314 |             if not hasattr(layer, "dweight_cumulate"):
315 |                 # 注意：这里是往layer层里加属性
316 |                 # 这很容易理解，历史信息肯定是要存在对应的对像中
317 |                 layer.dweight_cumulate = np.zeros_like(layer.weight)
318 |                 layer.dbias_cumulate = np.zeros_like(layer.bias)
319 |             deta_weight += self.momentum * layer.dweight_cumulate
320 |             layer.dweight_cumulate = deta_weight
321 |             deta_bias += self.momentum * layer.dbias_cumulate
322 |             layer.dbias_cumulate = deta_bias
323 |         layer.weight -= self.current_learning_rate * deta_weight
324 |         # (64,) = (64,) + (1,64) >> (1,64)
325 |         # (64,) += (1,64) >> 无法广播
326 |         # (1, 64) = (64,) + (1,64) >> (1,64)
327 |         # (1, 64) += (64,) >> (1,64)
328 |         # 所以修改了dense中
329 |         # self.bias = np.zeros(n_neuron) => self.bias = np.zeros((1, n_neuron))
330 |         layer.bias -= self.current_learning_rate * deta_bias
331 | 
332 |     def post_update_param(self):
333 |         self.iteration += 1
334 | 
335 | class Optimizer_Adagrad():
336 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
337 |     def __init__(self, learning_rate = 1.0, decay = 0, epsilon = 1e-7):
338 |         self.learning_rate = learning_rate
339 |         self.decay = decay
340 |         self.current_learning_rate = learning_rate
341 |         self.iteration = 0
342 |         # 极小值，防止除以0
343 |         self.epsilon = epsilon
344 | 
345 | 
346 |     def pre_update_param(self):
347 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
348 |         if self.decay:
349 |             self.current_learning_rate = self.learning_rate * \
350 |                                          (1 / (1 + self.decay * self.iteration))
351 | 
352 |     # 给一个层对象参数
353 |     def update_param(self, layer):
354 |         if not hasattr(layer, 'dweight_square_sum'):
355 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
356 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
357 |         layer.dweight_square_sum = layer.dweight_square_sum + layer.dweight ** 2
358 |         layer.dbias_square_sum = layer.dbias_square_sum + layer.dbias ** 2
359 |         layer.weight += -self.current_learning_rate * layer.dweight / \
360 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
361 |         layer.bias += -self.current_learning_rate * layer.dbias / \
362 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
363 | 
364 |     def post_update_param(self):
365 |         self.iteration += 1
366 | 
367 | class Optimizer_RMSprop():
368 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
369 |     def __init__(self, learning_rate = 0.001, decay = 0, epsilon = 1e-7, beta = 0.9):
370 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
371 |         self.learning_rate = learning_rate
372 |         self.decay = decay
373 |         self.current_learning_rate = learning_rate
374 |         self.iteration = 0
375 |         # 极小值，防止除以0
376 |         self.epsilon = epsilon
377 |         self.beta = beta
378 | 
379 |     def pre_update_param(self):
380 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
381 |         if self.decay:
382 |             self.current_learning_rate = self.learning_rate * \
383 |                                          (1 / (1 + self.decay * self.iteration))
384 | 
385 |     # 给一个层对象参数
386 |     def update_param(self, layer):
387 |         if not hasattr(layer, 'dweight_square_sum'):
388 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
389 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
390 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
391 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
392 |         layer.weight += -self.current_learning_rate * layer.dweight / \
393 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
394 |         layer.bias += -self.current_learning_rate * layer.dbias / \
395 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
396 | 
397 |     def post_update_param(self):
398 |         self.iteration += 1
399 | 
400 | class Optimizer_Adam():
401 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
402 |     def __init__(self, learning_rate = 0.001, decay = 0, epsilon = 1e-7, momentum = 0.0,beta = 0.999):
403 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
404 |         self.learning_rate = learning_rate
405 |         self.decay = decay
406 |         self.current_learning_rate = learning_rate
407 |         self.iteration = 0
408 |         # 极小值，防止除以0
409 |         self.epsilon = epsilon
410 |         self.beta = beta
411 |         self.momentum = momentum
412 | 
413 |     def pre_update_param(self):
414 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
415 |         if self.decay:
416 |             self.current_learning_rate = self.learning_rate * \
417 |                                          (1 / (1 + self.decay * self.iteration))
418 | 
419 |     # 给一个层对象参数
420 |     def update_param(self, layer):
421 |         if not hasattr(layer, 'dweight_square_sum') or not hasattr(layer, 'dweight_cumulate'):
422 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
423 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
424 |             layer.dweight_cumulate = np.zeros_like(layer.weight)
425 |             layer.dbias_cumulate = np.zeros_like(layer.bias)
426 |         # 动量
427 |         layer.dweight_cumulate = self.momentum * layer.dweight_cumulate + (1 - self.momentum) * layer.dweight
428 |         layer.dbias_cumulate = self.momentum * layer.dbias_cumulate + (1 - self.momentum) * layer.dbias
429 |         # 微调动量
430 |         layer.dweight_cumulate_modified = layer.dweight_cumulate / (1 - self.momentum ** (self.iteration + 1))
431 |         layer.dbias_cumulate_modified = layer.dbias_cumulate / (1 - self.momentum ** (self.iteration + 1))
432 |         # 平方和
433 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
434 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
435 |         # 微调平方和
436 |         layer.dweight_square_sum_modified = layer.dweight_square_sum / (1 - self.beta ** (self.iteration + 1))
437 |         layer.dbias_square_sum_modified = layer.dbias_square_sum / (1 - self.beta ** (self.iteration + 1))
438 | 
439 |         layer.weight += -self.current_learning_rate * layer.dweight_cumulate_modified / \
440 |                         ( np.sqrt(layer.dweight_square_sum_modified) + self.epsilon )
441 |         layer.bias += -self.current_learning_rate * layer.dbias_cumulate_modified / \
442 |                         (np.sqrt(layer.dbias_square_sum_modified) + self.epsilon)
443 | 
444 |     def post_update_param(self):
445 |         self.iteration += 1
446 | 
447 | 
448 | # 数据集
449 | X, y = spiral_data(samples=100, classes=3)
450 | 
451 | # 2输入64输出
452 | dense1 = Layer_Dense(2, 64)
453 | activation1 = Activation_ReLu()
454 | # 64输入3输出
455 | dense2 = Layer_Dense(64, 3)
456 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
457 | 
458 | # 优化器
459 | optimizer = Optimizer_Adam(learning_rate=0.05, decay=5e-7)
460 | 
461 | # 循环10000轮
462 | for epoch in range(10001):
463 |     # 前向传播
464 |     dense1.forward(X)
465 |     activation1.forward(dense1.output)
466 |     dense2.forward(activation1.output)
467 |     loss = loss_activation.forward(dense2.output, y)
468 | 
469 |     # 最高confidence的类别
470 |     predictions = np.argmax(loss_activation.output, axis=1)
471 |     if len(y.shape) == 2: # onehot编码
472 |         # 改成只有一个类别
473 |         y = np.argmax(y, axis=1)
474 |     accuracy = np.mean(predictions == y)
475 | 
476 |     if not epoch % 100:
477 |         print(f'epoch: {epoch}, ' +
478 |                 f'acc: {accuracy:.3f}, ' +
479 |                 f'loss: {loss:.3f}, '+
480 |                 f'lr: {optimizer.current_learning_rate}'
481 |                 )
482 | 
483 |     # 反向传播
484 |     loss_activation.backward(loss_activation.output, y)
485 |     dense2.backward(loss_activation.dinput)
486 |     activation1.backward(dense2.dinput)
487 |     dense1.backward(activation1.dinput)
488 | 
489 |     # 更新梯度
490 |     optimizer.pre_update_param()
491 |     optimizer.update_param(dense1)
492 |     optimizer.update_param(dense2)
493 |     optimizer.post_update_param()
494 | 


--------------------------------------------------------------------------------
/6Optimizer/Optimizer.md:
--------------------------------------------------------------------------------
  1 | # Optimizer
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 在这一部分，将实现Stochastic Gradient Descent（SGD）、Batch Gradient Descent（BGD）、Mini-batch Gradient Descent（MBGD）、Momentum、AdaGrad、RMSProp、Adam等优化器。**优化器主要是根据得到的梯度信息，决定朝哪个方向下降，并决定下降多少。**
  6 | 
  7 | ## 二、优化器
  8 | 
  9 | ### 一、SGD
 10 | 
 11 | #### **公式**
 12 | 
 13 | ![image-20230810094803983](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308100948056.png)
 14 | 
 15 | #### **实现**
 16 | 
 17 | ```py
 18 | class Optimizer_SGD():
 19 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
 20 |     def __init__(self, learning_rate = 1.0):
 21 |         self.learning_rate = learning_rate
 22 | 
 23 |     # 给一个层对象参数，执行最基本的优化
 24 |     def update_param(self, layer):
 25 |         layer.weight += - self.learning_rate * layer.dweight
 26 |         # (64,) = (64,) + (1,64) >> (1,64)
 27 |         # (64,) += (1,64) >> 无法广播
 28 |         # (1, 64) = (64,) + (1,64) >> (1,64)
 29 |         # (1, 64) += (64,) >> (1,64)
 30 |         # 所以修改了dense中
 31 |         # self.bias = np.zeros(n_neuron) => self.bias = np.zeros((1, n_neuron))
 32 |         layer.bias += - self.learning_rate * layer.dbias
 33 | ```
 34 | 
 35 | #### **实例**
 36 | 
 37 | ```python
 38 | # 数据集
 39 | X, y = spiral_data(samples=100, classes=3)
 40 | 
 41 | # 2输入64输出
 42 | dense1 = Layer_Dense(2, 64)
 43 | activation1 = Activation_ReLu()
 44 | # 64输入3输出
 45 | dense2 = Layer_Dense(64, 3)
 46 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
 47 | 
 48 | # 优化器
 49 | optimizer = Optimizer_SGD()
 50 | 
 51 | # 循环10000轮
 52 | for epoch in range(10001):
 53 |     # 前向传播
 54 |     dense1.forward(X)
 55 |     activation1.forward(dense1.output)
 56 |     dense2.forward(activation1.output)
 57 |     loss = loss_activation.forward(dense2.output, y)
 58 | 
 59 |     # 最高confidence的类别
 60 |     predictions = np.argmax(loss_activation.output, axis=1)
 61 |     if len(y.shape) == 2: # onehot编码
 62 |         # 改成只有一个类别
 63 |         y = np.argmax(y, axis=1)
 64 |     accuracy = np.mean(predictions == y)
 65 | 
 66 |     if not epoch % 100:
 67 |         print(f'epoch: {epoch}, ' +
 68 |                 f'acc: {accuracy:.3f}, ' +
 69 |                   f'loss: {loss:.3f}')
 70 | 
 71 |     # 反向传播
 72 |     loss_activation.backward(loss_activation.output, y)
 73 |     dense2.backward(loss_activation.dinput)
 74 |     activation1.backward(dense2.dinput)
 75 |     dense1.backward(activation1.dinput)
 76 | 
 77 |     # 更新梯度
 78 |     optimizer.update_param(dense1)
 79 |     optimizer.update_param(dense2)
 80 | ```
 81 | 
 82 | ![image-20230810103616150](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101036186.png)
 83 | 
 84 | ![image-20230810103632601](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101036632.png)
 85 | 
 86 | > 可以看到准确率提高了，损失下降了。
 87 | 
 88 | #### **公式**
 89 | 
 90 | 学习率衰减的目的是在训练过程中逐渐减小学习率。这样做的原因是，使用一个固定的学习率来训练神经网络，并且最终会在远离实际最小值的地方振荡。为了克服这种情况，在训练过程中逐渐减小学习率的建议，这有助于网络收敛到局部最小值并避免振荡。每一步更新学习率，取步数分数的倒数。称为学习率衰减。这种衰减的工作原理是取步数和衰减比率并将它们相乘。训练越深入，步数越大，这个乘法的结果也越大。然后我们取它的倒数（训练越深入，值越低），并将初始学习率乘以它。添加的 1 确保结果算法永远不会提高学习率。
 91 | $$
 92 | r_c = \frac{r}{(1+decay \times t)}
 93 | $$
 94 | 
 95 | > $t$是epoch数量
 96 | 
 97 | #### **实现**
 98 | 
 99 | ```py
100 | class Optimizer_SGD():
101 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
102 |     def __init__(self, learning_rate = 1.0, decay = 0):
103 |         self.learning_rate = learning_rate
104 |         self.decay = decay
105 |         self.current_learning_rate = learning_rate
106 |         self.iteration = 0
107 | 
108 |     def pre_update_param(self):
109 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
110 |         if self.decay:
111 |             self.current_learning_rate = self.learning_rate * \
112 |                                          (1 / (1 + self.decay * self.iteration))
113 | 
114 |     # 给一个层对象参数，执行最基本的优化
115 |     def update_param(self, layer):
116 |         layer.weight += - self.current_learning_rate * layer.dweight
117 |         # (64,) = (64,) + (1,64) >> (1,64)
118 |         # (64,) += (1,64) >> 无法广播
119 |         # (1, 64) = (64,) + (1,64) >> (1,64)
120 |         # (1, 64) += (64,) >> (1,64)
121 |         # 所以修改了dense中
122 |         # self.bias = np.zeros(n_neuron) => self.bias = np.zeros((1, n_neuron))
123 |         layer.bias += - self.current_learning_rate * layer.dbias
124 | 
125 |     def post_update_param(self):
126 |         self.iteration += 1
127 | ```
128 | 
129 | #### **实例**
130 | 
131 | ```py
132 | # 更新梯度
133 | optimizer.pre_update_param()
134 | optimizer.update_param(dense1)
135 | optimizer.update_param(dense2)
136 | optimizer.post_update_param()
137 | ```
138 | 
139 | ![image-20230810113742570](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101137606.png)
140 | 
141 | ### 二、Momentum
142 | 
143 | #### **公式**
144 | 
145 | ![image-20230810132642666](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101326716.png)
146 | 
147 | > 在实现的时候，$(1-\beta)$直接取为1
148 | 
149 | #### **实现**
150 | 
151 | > 这里对Momentum优化器的实现并不是重新实现一个优化器，而是在SGD的基础上，通过momentum参数调用。
152 | 
153 | ```py
154 | class Optimizer_SGD():
155 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
156 |     def __init__(self, learning_rate = 1.0, decay = 0, momentum=0):
157 |         self.learning_rate = learning_rate
158 |         self.decay = decay
159 |         self.current_learning_rate = learning_rate
160 |         self.iteration = 0
161 |         self.momentum = momentum
162 | 
163 |     def pre_update_param(self):
164 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
165 |         if self.decay:
166 |             self.current_learning_rate = self.learning_rate * \
167 |                                          (1 / (1 + self.decay * self.iteration))
168 | 
169 |     # 给一个层对象参数，执行最基本的优化
170 |     def update_param(self, layer):
171 | 
172 |         deta_weight = layer.dweight
173 |         deta_bias = layer.dbias
174 | 
175 |         # 如果使用momentum
176 |         if self.momentum:
177 |             # 如果还没有累积动量
178 |             if not hasattr(layer, "dweight_cumulate"):
179 |                 # 注意：这里是往layer层里加属性
180 |                 # 这很容易理解，历史信息肯定是要存在对应的对像中
181 |                 layer.dweight_cumulate = np.zeros_like(layer.weight)
182 |                 layer.dbias_cumulate = np.zeros_like(layer.bias)
183 |             deta_weight += self.momentum * layer.dweight_cumulate
184 |             layer.dweight_cumulate = deta_weight
185 |             deta_bias += self.momentum * layer.dbias_cumulate
186 |             layer.dbias_cumulate = deta_bias
187 |         layer.weight -= self.current_learning_rate * deta_weight
188 |         # (64,) = (64,) + (1,64) >> (1,64)
189 |         # (64,) += (1,64) >> 无法广播
190 |         # (1, 64) = (64,) + (1,64) >> (1,64)
191 |         # (1, 64) += (64,) >> (1,64)
192 |         # 所以修改了dense中
193 |         # self.bias = np.zeros(n_neuron) => self.bias = np.zeros((1, n_neuron))
194 |         layer.bias -= self.current_learning_rate * deta_bias
195 | 
196 |     def post_update_param(self):
197 |         self.iteration += 1
198 | ```
199 | 
200 | #### **实例**
201 | 
202 | ```python
203 | # 数据集
204 | X, y = spiral_data(samples=100, classes=3)
205 | 
206 | # 2输入64输出
207 | dense1 = Layer_Dense(2, 64)
208 | activation1 = Activation_ReLu()
209 | # 64输入3输出
210 | dense2 = Layer_Dense(64, 3)
211 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
212 | 
213 | # 优化器
214 | optimizer = Optimizer_SGD(decay=0.001, momentum=0.8)
215 | 
216 | # 循环10000轮
217 | for epoch in range(10001):
218 |     # 前向传播
219 |     dense1.forward(X)
220 |     activation1.forward(dense1.output)
221 |     dense2.forward(activation1.output)
222 |     loss = loss_activation.forward(dense2.output, y)
223 | 
224 |     # 最高confidence的类别
225 |     predictions = np.argmax(loss_activation.output, axis=1)
226 |     if len(y.shape) == 2: # onehot编码
227 |         # 改成只有一个类别
228 |         y = np.argmax(y, axis=1)
229 |     accuracy = np.mean(predictions == y)
230 | 
231 |     if not epoch % 100:
232 |         print(f'epoch: {epoch}, ' +
233 |                 f'acc: {accuracy:.3f}, ' +
234 |                 f'loss: {loss:.3f}, '+
235 |                 f'lr: {optimizer.current_learning_rate}'
236 |                 )
237 | 
238 |     # 反向传播
239 |     loss_activation.backward(loss_activation.output, y)
240 |     dense2.backward(loss_activation.dinput)
241 |     activation1.backward(dense2.dinput)
242 |     dense1.backward(activation1.dinput)
243 | 
244 |     # 更新梯度
245 |     optimizer.pre_update_param()
246 |     optimizer.update_param(dense1)
247 |     optimizer.update_param(dense2)
248 |     optimizer.post_update_param()
249 | ```
250 | 
251 | > 这里取momentum=0.8
252 | 
253 | ![image-20230810135951807](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101359846.png)
254 | 
255 | > 可以看到准确率提高到了95%,loss低到了0.16
256 | 
257 | ### 三、Adagrad
258 | 
259 | AdaGrad，即自适应梯度，是一种为每个参数设定学习率而不是全局共享率的方法，**为每个参数计算一个自适应的学习率**。这里的想法是对特征进行归一化更新。在训练过程中，有些权重可能会显著增加，而有些权重则不会改变太多。由于更新的单调性，用一个不断增加的缓存进行除法运算也可能导致学习停滞，因为随着时间的推移，更新变得越来越小。这就是为什么这个优化器除了一些特定的应用之外，没有被广泛使用的原因。这个优化器通常用在稀疏数据上（特征特别多），主要是特征不同，而不是特征的程度不同的数据上。
260 | 
261 | [“随机梯度下降、牛顿法、动量法、Nesterov、AdaGrad、RMSprop、Adam”，打包理解对梯度下降法的优化_哔哩哔哩_bilibili](https://www.bilibili.com/video/BV1r64y1s7fU/?spm_id_from=333.337.search-card.all.click&vd_source=464f67ea9f577a9f41a7cb8930f73ee5))
262 | 
263 | #### **公式**
264 | 
265 | ![image-20230810140933381](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101409434.png)
266 | 
267 | #### **实现**
268 | 
269 | ```python
270 | class Optimizer_Adagrad():
271 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
272 |     def __init__(self, learning_rate = 1.0, decay = 0, epsilon = 1e-7):
273 |         self.learning_rate = learning_rate
274 |         self.decay = decay
275 |         self.current_learning_rate = learning_rate
276 |         self.iteration = 0
277 |         # 极小值，防止除以0
278 |         self.epsilon = epsilon
279 | 
280 | 
281 |     def pre_update_param(self):
282 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
283 |         if self.decay:
284 |             self.current_learning_rate = self.learning_rate * \
285 |                                          (1 / (1 + self.decay * self.iteration))
286 | 
287 |     # 给一个层对象参数
288 |     def update_param(self, layer):
289 |         if not hasattr(layer, 'dweight_square_sum'):
290 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
291 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
292 |         layer.dweight_square_sum += layer.dweight ** 2
293 |         layer.dbias_square_sum += layer.dbias ** 2
294 |         layer.weight += -self.current_learning_rate * layer.dweight / \
295 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
296 |         layer.bias += -self.current_learning_rate * layer.dbias / \
297 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
298 | 
299 |     def post_update_param(self):
300 |         self.iteration += 1
301 | ```
302 | 
303 | #### **实例**
304 | 
305 | ![image-20230810162705435](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101627474.png)
306 | 
307 | > AdaGrad在这里表现得相当不错，但没有SGD with momentum好，我们可以看到损失在整个训练过程中一直在下降。有趣的是，AdaGrad最初花了更多的周期才达到和带有动量的随机梯度下降相似的结果。这可能是因为AdaGrad的学习率随着梯度的累积而逐渐减小，导致后期的更新变得很小，而SGD with momentum则能够保持一定的更新速度和方向。不过，AdaGrad也有它的优势，比如能够处理稀疏数据和不同尺度的特征。
308 | 
309 | ### 四、RMSProp
310 | 
311 | RMSProp（Root  Mean Square Propagation）和AdaGrad类似，RMSProp也是**为每个参数计算一个自适应的学习率**；它只是用一种不同于AdaGrad的方式来计算。RMSProp的主要思想是使用一个指数衰减的平均来存储过去梯度的平方，从而避免了AdaGrad学习率过快下降的问题。
312 | 
313 | #### **公式**
314 | 
315 | ![image-20230810162309098](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101623143.png)
316 | 
317 | > RMSProp与Adagrad加入了一个加权系数，这里的新超参数是$\beta$。$\beta$是缓存记忆衰减率，越早的梯度平方占的权得越低。由于这个优化器在默认值下，能够保持很大的自适应学习率更新，所以即使很小的梯度更新也足以让它继续运行（梯度减小的慢）；因此，默认学习率为1太大了，会导致模型立刻不稳定。一个能够再次稳定并且给出足够快速更新的学习率大约是0.001（这也是一些知名机器学习框架中使用的这个优化器的默认值）。我们从现在开始也会用这个值作为默认值。
318 | 
319 | #### **实现**
320 | 
321 | ```python
322 | class Optimizer_RMSprop():
323 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
324 |     def __init__(self, learning_rate = 0.001, decay = 0, epsilon = 1e-7, beta = 0.9):
325 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
326 |         self.learning_rate = learning_rate
327 |         self.decay = decay
328 |         self.current_learning_rate = learning_rate
329 |         self.iteration = 0
330 |         # 极小值，防止除以0
331 |         self.epsilon = epsilon
332 |         self.beta = beta
333 | 
334 |     def pre_update_param(self):
335 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
336 |         if self.decay:
337 |             self.current_learning_rate = self.learning_rate * \
338 |                                          (1 / (1 + self.decay * self.iteration))
339 | 
340 |     # 给一个层对象参数
341 |     def update_param(self, layer):
342 |         if not hasattr(layer, 'dweight_square_sum'):
343 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
344 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
345 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
346 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
347 |         layer.weight += -self.current_learning_rate * layer.dweight / \
348 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
349 |         layer.bias += -self.current_learning_rate * layer.dbias / \
350 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
351 | 
352 |     def post_update_param(self):
353 |         self.iteration += 1
354 | ```
355 | 
356 | #### **实例**
357 | 
358 | ![image-20230810165813780](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101658832.png)
359 | 
360 | > 可以看到学习率和loss变化很慢。
361 | 
362 | ```py
363 | # 优化器
364 | optimizer = Optimizer_RMSprop(learning_rate=0.02, decay=1e-5,beta=0.999)
365 | ```
366 | 
367 | ![image-20230810170239993](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101702033.png)
368 | 
369 | > 这个优化器参数不好调，因为改变了$\beta$，$(1-\beta)$也改变了，两者都有影响。
370 | 
371 | ### 五、Adam
372 | 
373 | Adam（Adaptive Momentum），即自适应动量，目前是最广泛使用的优化器，它建立在RMSProp之上，并加入了SGD中的动量概念。这意味着，我们不再直接应用当前的梯度，而是像带有动量的SGD优化器一样应用动量，然后像RMSProp一样用缓存来应用每个权重的自适应学习率。这样，我们就能够结合SGD和RMSProp的优点，实现更快、更稳定的训练过程。
374 | 
375 | #### 公式
376 | 
377 | ![image-20230810170914039](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101709090.png)
378 | 
379 | > 在训练开始时，动量和缓存的初始值通常都是0，这会导致训练速度较慢。为了解决这个问题，可以使用偏差校正机制来对动量还有平方和进行修正。偏差校正机制的原理是将动量、平方和除以一个衰减系数，这个系数随着训练的进行而逐渐减小，最终趋近于1。在训练初期，由于衰减系数较大，所以除以它会使动量和缓存变得更大，从而加快训练速度。随着训练的进行，衰减系数逐渐减小，动量和缓存也会逐渐恢复到正常值。
380 | 
381 | $$
382 | 1-\beta^t
383 | $$
384 | 
385 | > 这就是衰减系数，$t$是epoch数，开始时衰减系数很小，除以它能得到一个很大的数，所以在开始时梯度下降很快。
386 | 
387 | #### **实现**
388 | 
389 | ```python
390 | # 数据集
391 | X, y = spiral_data(samples=100, classes=3)
392 | 
393 | # 2输入64输出
394 | dense1 = Layer_Dense(2, 64)
395 | activation1 = Activation_ReLu()
396 | # 64输入3输出
397 | dense2 = Layer_Dense(64, 3)
398 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
399 | 
400 | # 优化器
401 | optimizer = Optimizer_Adam(learning_rate=0.05, decay=5e-7)
402 | 
403 | # 循环10000轮
404 | for epoch in range(10001):
405 |     # 前向传播
406 |     dense1.forward(X)
407 |     activation1.forward(dense1.output)
408 |     dense2.forward(activation1.output)
409 |     loss = loss_activation.forward(dense2.output, y)
410 | 
411 |     # 最高confidence的类别
412 |     predictions = np.argmax(loss_activation.output, axis=1)
413 |     if len(y.shape) == 2: # onehot编码
414 |         # 改成只有一个类别
415 |         y = np.argmax(y, axis=1)
416 |     accuracy = np.mean(predictions == y)
417 | 
418 |     if not epoch % 100:
419 |         print(f'epoch: {epoch}, ' +
420 |                 f'acc: {accuracy:.3f}, ' +
421 |                 f'loss: {loss:.3f}, '+
422 |                 f'lr: {optimizer.current_learning_rate}'
423 |                 )
424 | 
425 |     # 反向传播
426 |     loss_activation.backward(loss_activation.output, y)
427 |     dense2.backward(loss_activation.dinput)
428 |     activation1.backward(dense2.dinput)
429 |     dense1.backward(activation1.dinput)
430 | 
431 |     # 更新梯度
432 |     optimizer.pre_update_param()
433 |     optimizer.update_param(dense1)
434 |     optimizer.update_param(dense2)
435 |     optimizer.post_update_param()
436 | ```
437 | 
438 | ![image-20230810195229652](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308101952696.png)
439 | 
440 | > 虽然Adam在这里表现得最好，通常也是最好的优化器之一，但并不总是这样。通常先尝试Adam优化器是个好主意，但也要尝试其他优化器，特别是当你没有得到期望的结果时。有时简单的SGD或SGD + 动量比Adam表现得更好。原因各不相同，但请记住这一点。 我们将在训练时介绍如何选择各种超参数（如学习率），但对于SGD来说，一个通常的初始学习率是1.0，衰减到0.1。对于Adam来说，一个好的初始LR是0.001（1e-3），衰减到0.0001（1e-4）。不同的问题可能需要不同的值，但这些值都是一个不错的起点。 
441 | 


--------------------------------------------------------------------------------
/6Optimizer/Optimizer.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/6Optimizer/Optimizer.pdf


--------------------------------------------------------------------------------
/7L1andL2Regularization/L1andL2Regularization.md:
--------------------------------------------------------------------------------
  1 | # L1 and L2 Regularization
  2 | 
  3 | ## 一、内容
  4 | 
  5 | L1正则化，由于其线性特性，比L2正则化更多地惩罚小权重，导致模型开始对小输入不敏感，只对较大的输入变化。这就是为什么L1正则化很少单独使用，通常如果使用的话，也会与L2正则化结合。这种类型的正则化函数使权重和参数的和趋向于0，这也可以帮助解决梯度爆炸（模型不稳定，可能导致权重变成非常大的值）的情况。
  6 | 
  7 | ## 二、前向传播
  8 | 
  9 | ### 一、公式
 10 | 
 11 | $$
 12 | L_{1w}=\lambda\sum\limits_{i=k}|w_k|
 13 | $$
 14 | 
 15 | $$
 16 | L_{1b}=\lambda\sum\limits_{i=k}|b_k|
 17 | $$
 18 | 
 19 | $$
 20 | L_{2w}=\lambda\sum\limits_{i=k}w_k^2
 21 | $$
 22 | 
 23 | $$
 24 | L_{2b}=\lambda\sum\limits_{i=k}b_k^2
 25 | $$
 26 | 
 27 | $$
 28 | Loss = dataloss + L_{1w}+ L_{1b}+L_{2w}+L_{2b}
 29 | $$
 30 | 
 31 | ### 二、实现
 32 | 
 33 | ```python
 34 | class Layer_Dense:
 35 |     def __init__(self, n_input, n_neuron, weight_L1, weight_L2, bias_L1, bias_L2):
 36 |         # 用正态分布初始化权重
 37 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
 38 |         # 将bias(偏差)初始化为0
 39 |         # self.bias = np.zeros(n_neuron)
 40 |         self.bias = np.zeros((1, n_neuron))
 41 |         self.weight_L1 = weight_L1
 42 |         self.weight_L2 = weight_L2
 43 |         self.bias_L1 = bias_L1
 44 |         self.bias_L2 = bias_L2
 45 | ```
 46 | 
 47 | > 因为weight_L1, weight_L2, bias_L1, bias_L2和weight、bias是同时使用，所以以属性值存在Layer_Dense中。
 48 | 
 49 | ```python
 50 | class Loss: 
 51 |     def regularization_loss(self,layer):
 52 |         # 默认为0
 53 |         regularization_loss = 0
 54 |         # 如果存在L1的loss
 55 |         if layer.weight_L1 > 0:
 56 |             regularization_loss += layer.weight_L1 * np.sum(np.abs(layer.weight))
 57 |         if layer.bias_L1 > 0:
 58 |             regularization_loss += layer.bias_L1 * np.sum(np.abs(layer.bias))
 59 |         # 如果存在L2的loss
 60 |         if layer.weight_L2 > 0:
 61 |             regularization_loss += layer.weight_L2 * np.sum(layer.weight ** 2)
 62 |         if layer.bias_L2 > 0:
 63 |             regularization_loss += layer.bias_L2 * np.sum(layer.bias ** 2)
 64 | 
 65 |         return regularization_loss
 66 | ```
 67 | 
 68 | > Loss类中要有反回regularization_loss的方法
 69 | 
 70 | ## 三、反向传播
 71 | 
 72 | ### 一、公式
 73 | 
 74 | ![image-20230810204245784](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308102042851.png)
 75 | 
 76 | ![image-20230810204310332](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308102043366.png)
 77 | 
 78 | ### 二、实现
 79 | 
 80 | ```python
 81 | class Layer_Dense:
 82 |     def backward(self, dvalue):
 83 |         # dvalue是loss对下一层（Activation）的输入（input）的导数，
 84 |         # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 85 |         # 这里会用到链式法则
 86 | 
 87 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 88 |         # 这便找到了self.weight优化的方向（negative gradient direction）
 89 | 
 90 |         # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 91 |         # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 92 |         # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 93 |         # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 94 |         # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 95 |         self.dweight = np.dot(self.input.T, dvalue)
 96 | 
 97 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 98 |         # 以便作为下一层的backward方法中的dvalue参数，
 99 | 
100 |         # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
101 |         # dvalue = dloss/doutput大小即为output的大小(1xb)，
102 |         # weight大小为axb
103 |         # 所以1xa = (1xb) * (axb).T
104 |         self.dinput = np.dot(dvalue, self.weight.T)
105 | 
106 |         # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
107 |         # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
108 |         # 但有更快更简单的实现
109 |         self.dbias = np.sum(dvalue, axis=0, keepdims=True)  # 此处不要keepdims=True也行，因为按0维相加还是行向量
110 | 
111 |         # 正则项的梯度
112 |         if self.weight_L2 > 0:
113 |             self.dweight += 2 * self.weight_L2 * self.weight
114 |         if self.bias_L2 > 0:
115 |             self.dbias += 2 * self.bias_L2 * self.weight
116 |         if self.weight_L1 > 0:
117 |             dL = np.ones_like(self.weight)
118 |             dL[self.weight < 0] = -1
119 |             self.dweight += self.weight_L1 * dL
120 |         if self.bias_L1 > 0:
121 |             dL = np.ones_like(self.bias)
122 |             dL[self.bias < 0] = -1
123 |             self.dbias += self.bias_L1 * dL
124 |             
125 | ```
126 | 
127 | ### 三、实例
128 | 
129 | ```python
130 | # 数据集
131 | X, y = spiral_data(samples=2000, classes=3)
132 | keys = np.array(range(X.shape[0]))
133 | np.random.shuffle(keys)
134 | X = X[keys]
135 | y = y[keys]
136 | X_test = X[3000:]
137 | y_test = y[3000:]
138 | X = X[0:3000]
139 | y = y[0:3000]
140 | print(X-X_test)
141 | 
142 | # 2输入64输出
143 | dense1 = Layer_Dense(2, 512, weight_L2=5e-4, bias_L2=5e-4)#, weight_L2=5e-4, bias_L2=5e-4
144 | activation1 = Activation_ReLu()
145 | # 64输入3输出
146 | dense2 = Layer_Dense(512, 3)
147 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
148 | 
149 | # 优化器
150 | optimizer = Optimizer_Adam(learning_rate=0.02, decay=5e-7)
151 | 
152 | # 循环10000轮
153 | for epoch in range(10001):
154 |     # 前向传播
155 |     dense1.forward(X)
156 |     activation1.forward(dense1.output)
157 |     dense2.forward(activation1.output)
158 |     data_loss = loss_activation.forward(dense2.output, y)
159 |     regularization_loss = loss_activation.loss.regularization_loss(dense1) +loss_activation.loss.regularization_loss(dense2)
160 |     loss = data_loss + regularization_loss
161 |     # 最高confidence的类别
162 |     predictions = np.argmax(loss_activation.output, axis=1)
163 |     if len(y.shape) == 2: # onehot编码
164 |         # 改成只有一个类别
165 |         y = np.argmax(y, axis=1)
166 |     accuracy = np.mean(predictions == y)
167 | 
168 |     if not epoch % 100:
169 |         print(f'epoch: {epoch}, ' +
170 |               f'acc: {accuracy:.3f}, ' +
171 |               f'loss: {loss:.3f} (' +
172 |               f'data_loss: {data_loss:.3f}, ' +
173 |               f'reg_loss: {regularization_loss:.3f}), ' +
174 |               f'lr: {optimizer.current_learning_rate}'
175 |                 )
176 | 
177 |     # 反向传播
178 |     loss_activation.backward(loss_activation.output, y)
179 |     dense2.backward(loss_activation.dinput)
180 |     activation1.backward(dense2.dinput)
181 |     dense1.backward(activation1.dinput)
182 | 
183 |     # 更新梯度
184 |     optimizer.pre_update_param()
185 |     optimizer.update_param(dense1)
186 |     optimizer.update_param(dense2)
187 |     optimizer.post_update_param()
188 | 
189 | 
190 | 
191 | # Create test dataset
192 | 
193 | # Perform a forward pass of our testing data through this layer
194 | dense1.forward(X_test)
195 | # Perform a forward pass through activation function
196 | # takes the output of first dense layer here
197 | activation1.forward(dense1.output)
198 | # Perform a forward pass through second Dense layer
199 | # takes outputs of activation function of first layer as inputs
200 | dense2.forward(activation1.output)
201 | # Perform a forward pass through the activation/loss function
202 | # takes the output of second dense layer here and returns loss
203 | loss = loss_activation.forward(dense2.output, y_test)
204 | # Calculate accuracy from output of activation2 and targets
205 | # calculate values along first axis
206 | predictions = np.argmax(loss_activation.output, axis=1)
207 | if len(y_test.shape) == 2:
208 |     y_test = np.argmax(y_test, axis=1)
209 | accuracy = np.mean(predictions==y_test)
210 | print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')
211 | ```
212 | 
213 | 
214 | 
215 | ![image-20230811110414803](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111104899.png)
216 | 
217 | > 这可以看到加上正则后效不好，验证集上的正确率比训练集上的还要小，说明正则化没有起到作用。还需再找一下是否代码有问题。
218 | 
219 | ![image-20230811112112840](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111121883.png)
220 | 
221 | > 上面图片是未加正则项时的结果，可以看到未加正则时的最后一轮训练准确率要比加了正则项的大，说明正则项确定可以减小训练集上的过拟合，但在测试集上表现并没有提升。
222 | 
223 | ![image-20230811113110808](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111131854.png)
224 | 
225 | > 上图是书中给出的在同样的参数设置下的训练和测试结果，可以看到lr值是一样的，说明优化器代码正确。训练集上的其他指标表现也差不多，但测试集表现却相差太大。
226 | 
227 | ```py
228 | # 2输入64输出
229 | dense1 = Layer_Dense(2, 256, weight_L2=5e-4, bias_L2=5e-4)#, weight_L2=5e-4, bias_L2=5e-4
230 | activation1 = Activation_ReLu()
231 | # 2输入64输出
232 | dense2 = Layer_Dense(256, 128, weight_L2=5e-4, bias_L2=5e-4)#, weight_L2=5e-4, bias_L2=5e-4
233 | activation2 = Activation_ReLu()
234 | # 64输入3输出
235 | dense3 = Layer_Dense(128, 3)
236 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
237 | ```
238 | 
239 | > 增加模型复杂度，三层神经元。
240 | 
241 | ![image-20230811120026937](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111200984.png)
242 | 
243 | > 并没有太大提升。
244 | 
245 | ![image-20230811120931058](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111209104.png)
246 | 
247 | > 同样是三层结构，但不使用正则项。可以看到训练准确率更高，但测试准确率更低，说明正测项也是有效果的。但无论如何也不能像书中结果一样：测试集准确率大于训练集。


--------------------------------------------------------------------------------
/7L1andL2Regularization/L1andL2Regularization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/7L1andL2Regularization/L1andL2Regularization.pdf


--------------------------------------------------------------------------------
/7L1andL2Regularization/NNFS_version7.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 作者：黄欣
  3 | 日期：2023年08月10日
  4 | """
  5 | 
  6 | # 本版本将加入L1和L2正则化的实现
  7 | 
  8 | import numpy as np
  9 | from nnfs.datasets import spiral_data
 10 | import matplotlib.pyplot as plt
 11 | 
 12 | 
 13 | class Layer_Dense:
 14 |     def __init__(self, n_input, n_neuron, weight_L1=0., weight_L2=0., bias_L1=0., bias_L2=0.):
 15 |         # 用正态分布初始化权重
 16 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
 17 |         # 将bias(偏差)初始化为0
 18 |         # self.bias = np.zeros(n_neuron)
 19 |         self.bias = np.zeros((1, n_neuron))
 20 |         self.weight_L1 = weight_L1
 21 |         self.weight_L2 = weight_L2
 22 |         self.bias_L1 = bias_L1
 23 |         self.bias_L2 = bias_L2
 24 | 
 25 |     def forward(self, input):
 26 |         # 因为要增加backward方法，
 27 |         # Layer_Dense的输出对输入（input）的偏导是self.weight，
 28 |         # 面Layer_Dense的输出对self.weight的偏导是输入（input）
 29 |         # 所以要在forward中增加self.input属性
 30 |         self.input = input
 31 |         self.output = np.dot(input, self.weight) + self.bias
 32 | 
 33 |     def backward(self, dvalue):
 34 |         # dvalue是loss对下一层（Activation）的输入（input）的导数，
 35 |         # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 36 |         # 这里会用到链式法则
 37 | 
 38 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 39 |         # 这便找到了self.weight优化的方向（negative gradient direction）
 40 | 
 41 |         # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 42 |         # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 43 |         # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 44 |         # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 45 |         # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 46 |         self.dweight = np.dot(self.input.T, dvalue)
 47 | 
 48 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 49 |         # 以便作为下一层的backward方法中的dvalue参数，
 50 | 
 51 |         # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
 52 |         # dvalue = dloss/doutput大小即为output的大小(1xb)，
 53 |         # weight大小为axb
 54 |         # 所以1xa = (1xb) * (axb).T
 55 |         self.dinput = np.dot(dvalue, self.weight.T)
 56 | 
 57 |         # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
 58 |         # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
 59 |         # 但有更快更简单的实现
 60 |         self.dbias = np.sum(dvalue, axis=0, keepdims=True)  # 此处不要keepdims=True也行，因为按0维相加还是行向量
 61 | 
 62 |         # 正则项的梯度
 63 |         if self.weight_L2 > 0:
 64 |             self.dweight += 2 * self.weight_L2 * self.weight
 65 |         if self.bias_L2 > 0:
 66 |             self.dbias += 2 * self.bias_L2 * self.bias
 67 |         if self.weight_L1 > 0:
 68 |             dL = np.ones_like(self.weight)
 69 |             dL[self.weight < 0] = -1
 70 |             self.dweight += self.weight_L1 * dL
 71 |         if self.bias_L1 > 0:
 72 |             dL = np.ones_like(self.bias)
 73 |             dL[self.bias < 0] = -1
 74 |             self.dbias += self.bias_L1 * dL
 75 | 
 76 | class Activation_Sigmoid:
 77 |     def __init__(self):
 78 |         pass
 79 | 
 80 |     def forward(self, input):
 81 |         self.input = input
 82 | 
 83 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 84 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 85 |         self.output = 1 / (1 + np.exp(- (self.input)))
 86 | 
 87 |     def backward(self, dvalue):
 88 |         # 这里也可以用矩阵计算，但dinput、dvalue、output大小相同，
 89 |         # 可以直接按元素对应相乘。
 90 |         self.dinput = dvalue * self.output * (1 - self.output)
 91 | 
 92 | 
 93 | class Activation_ReLu:
 94 |     def __init__(self):
 95 |         pass
 96 | 
 97 |     def forward(self, input):
 98 |         self.input = input
 99 |         self.output = np.maximum(0, input)
100 | 
101 |     def backward(self, dvalue):
102 |         # self.input和self.output形状是一样的
103 |         # 那么dinput大小=doutput大小=dvalue大小
104 |         # 可以用mask来更快实现，而不用矩阵运算
105 |         self.dinput = dvalue.copy()
106 |         self.dinput[self.input < 0] = 0
107 | 
108 | 
109 | class Activation_Softmax:
110 |     def __init__(self):
111 |         pass
112 | 
113 |     def forward(self, input):
114 |         self.input = input
115 | 
116 |         # 要有keepdims=True参数设置
117 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
118 |         # 而行向量长度不与input的每一行长度相同，
119 |         # 则无法广播
120 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
121 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
122 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
123 | 
124 |     def backward(self, dvalue):
125 |         # input和output大小相同都为1xa，
126 |         # loss是标量，那么dinput和doutput（即dvalue）大小相同都为1xa，
127 |         # output对input的导数为一个axa的方阵
128 | 
129 |         # 相同大小的空矩阵
130 |         self.dinput = np.empty_like(dvalue)
131 |         # 对每个samlpe（每一行）循环
132 |         for each, (single_output, single_dvalue) in enumerate(zip(self.output, dvalue)):
133 |             # 显然这两种计算法算到的dinput大小是一样的
134 |             # 这里是(1xa) * (axa) = 1xa是行向量
135 |             # 这里要先将1xa向量变为1xa矩阵
136 |             # 因为向量没有转置（.T操作后还是与原来相同），
137 |             # np.dot接收到向量后，会调整向量的方向，但得到的还是向量（行向量）,就算得到列向量也会表示成行向量
138 |             # np.dot接收到1xa矩阵后，要考虑前后矩阵大小的匹配，不然要报错,最后得到的还是矩阵
139 |             single_output = single_output.reshape(1, -1)
140 |             jacobian_matrix = np.diagflat(single_output) - np.dot(single_output.T, single_output)
141 |             # 因为single_dvalue是行向量，dot运算会调整向量的方向
142 |             # 所以np.dot(single_dvalue, jacobian_matrix)和np.dot(jacobian_matrix， single_dvalue)
143 |             # 得到的都是一个行向量，但两都的计算方法不同，得到的值也不同
144 |             # np.dot(jacobian_matrix, single_dvalue)也是对的，这样得到的才是行向量，
145 |             # 而不是经过dot将列向量转置成行向量
146 |             self.dinput[each] = np.dot(jacobian_matrix, single_dvalue)
147 | 
148 | 
149 | class Loss:
150 |     def __init__(self):
151 |         pass
152 | 
153 |     # 统一通过调用calculate方法计算损失
154 |     def calculate(self, y_pred, y_ture):
155 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
156 |         data_loss = np.mean(self.forward(y_pred, y_ture))
157 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
158 |         return data_loss
159 | 
160 |     def regularization_loss(self, layer):
161 |         # 默认为0
162 |         regularization_loss = 0
163 |         # 如果存在L1的loss
164 |         if layer.weight_L1 > 0:
165 |             regularization_loss += layer.weight_L1 * np.sum(np.abs(layer.weight))
166 |         if layer.bias_L1 > 0:
167 |             regularization_loss += layer.bias_L1 * np.sum(np.abs(layer.bias))
168 |         # 如果存在L2的loss
169 |         if layer.weight_L2 > 0:
170 |             regularization_loss += layer.weight_L2 * np.sum(layer.weight ** 2)
171 |         if layer.bias_L2 > 0:
172 |             regularization_loss += layer.bias_L2 * np.sum(layer.bias ** 2)
173 | 
174 |         return regularization_loss
175 | 
176 | class Loss_CategoricalCrossentropy(Loss):
177 |     def __init__(self):
178 |         pass
179 | 
180 |     def forward(self, y_pred, y_true):
181 |         # 多少个样本
182 |         n_sample = len(y_true)
183 | 
184 |         # 为了防止log(0)，所以以1e-7为左边界
185 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
186 |         # 为了防止偏移，右边界为1 - 1e-7
187 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
188 | 
189 |         loss = - np.log(y_pred)
190 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
191 |             loss = np.sum(loss * y_true, axis=1)
192 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
193 |             # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
194 |             loss = loss[range(n_sample), y_true]
195 | 
196 |         # loss是一个列向量，每一行是一个样本,
197 |         # 这里不用求均值，父类中的calculate方法中求均值
198 |         return loss
199 | 
200 |     def backward(self, y_pred, y_true):
201 |         n_sample = len(y_true)
202 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
203 |             label = y_true
204 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
205 |             # 将标签改成onehot的编码
206 |             label = np.zeros((n_sample, len(y_pred[0])))
207 |             label[range(n_sample), y_true] = 1
208 |         self.dinput = - label / y_pred
209 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
210 |         self.dinput = self.dinput / n_sample
211 | 
212 | 
213 | class Loss_BinaryCrossentropy(Loss):
214 |     def __init__(self):
215 |         pass
216 | 
217 |     def forward(self, y_pred, y_true):
218 |         # 这里要特别注意，书上都没有写明
219 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
220 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
221 |         # 所以下面的loss大小会变成(n_sample,n_sample)
222 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
223 |         if len(y_true.shape) == 1:  # y_true是个行向量
224 |             y_true = y_true.reshape(-1, 1)
225 |         # 为了防止log(0)，所以以1e-7为左边界
226 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
227 |         # 为了防止偏移，右边界为1 - 1e-7
228 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
229 |         loss = -  np.log(y_pred) * y_true - np.log(1 - y_pred) * (1 - y_true)
230 |         # 这里的求平均和父类中的calculate求平均的维度不同
231 |         # 这里是对多对的二进制求平均
232 |         # calculate中的求平均是对每个样本可平均
233 |         loss = np.mean(loss, axis=-1)
234 |         return loss
235 | 
236 |     def backward(self, y_pred, y_true):
237 |         # 样本个数
238 |         n_sample = len(y_pred)
239 |         # 二进制输出个数
240 |         n_output = len(y_pred[0])
241 |         # 这里要特别注意，书上都没有写明
242 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
243 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
244 |         # 所以下面的loss大小会变成(n_sample,n_sample)
245 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
246 |         if len(y_true.shape) == 1:  # y_true是个行向量
247 |             y_true = y_true.reshape(-1, 1)
248 |         # 注意：BinaryCrossentropy之前都是Sigmoid函数
249 |         # Sigmoid函数很容易出现0和1的输出
250 |         # 所以以1e-7为左边界
251 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
252 |         # 为了防止偏移，右边界为1 - 1e-7
253 |         y_pred_clip = np.clip(y_pred, 1e-7, 1 - 1e-7)
254 |         # 千万不要与成下面这样，因为-y_true优先级最高，而y_true是uint8，-1=>255
255 |         # 这个bug我找了很久，要重视
256 |         # self.dinput = -y_true / y_pred_clip + (1 - y_true) / (1 - y_pred_clip)) / n_output
257 |         self.dinput = -(y_true / y_pred_clip - (1 - y_true) / (1 - y_pred_clip)) / n_output
258 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
259 |         self.dinput = self.dinput / n_sample
260 | 
261 | 
262 | class Activation_Softmax_Loss_CategoricalCrossentropy():
263 |     def __init__(self):
264 |         self.activation = Activation_Softmax()
265 |         self.loss = Loss_CategoricalCrossentropy()
266 | 
267 |     # 注意：Activation_Softmax_Loss_CategoricalCrossentropy类中是调用forward计算loss
268 |     # 因为它没有继承Loss类
269 |     def forward(self, input, y_true):
270 |         self.activation.forward(input)
271 |         # 该类的output属性应该是Activation_Softmax()的输出
272 |         self.output = self.activation.output
273 |         # 该类返回的是loss
274 |         return self.loss.calculate(self.output, y_true)
275 | 
276 |     # 其实y_pred一定等于self.output，但为了与之前代码一致
277 |     def backward(self, y_pred, y_true):
278 |         # 样本个数
279 |         n_sample = len(y_true)
280 |         if len(y_true.shape) == 2:  # onehot编码
281 |             # 直接套公式
282 |             self.dinput = y_pred - y_true
283 |         elif len(y_true.shape) == 1:  # 只有一个类别
284 |             self.dinput = y_pred.copy()
285 |             # 需将每一行中y_true类别（索引）中的-1，其它-0（不操作）
286 |             self.dinput[range(n_sample), y_true] -= 1
287 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
288 |         self.dinput = self.dinput / n_sample
289 | 
290 | 
291 | class Activation_Sigmoid_Loss_BinaryCrossentropy():
292 |     def __init__(self):
293 |         self.activation = Activation_Sigmoid()
294 |         self.loss = Loss_BinaryCrossentropy()
295 | 
296 |     def forward(self, input, y_true):
297 |         self.activation.forward(input)
298 |         # 类的output是Sigmoid的输出
299 |         self.output = self.activation.output
300 |         return self.loss.calculate(self.output, y_true)
301 | 
302 |     def backward(self, y_pred, y_true):
303 |         # 样本数量
304 |         n_sample = len(y_pred)
305 |         # 这里要特别注意，书上都没有写明
306 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
307 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
308 |         # 所以下面的loss大小会变成(n_sample,n_sample)
309 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
310 |         if len(y_true.shape) == 1:  # y_true是个行向量
311 |             y_true = y_true.reshape(-1, 1)
312 |         # 二进制输出个数
313 |         J = len(y_pred[0])
314 |         # y_true中每一行都有J个1或0的二进制值，1代表正例，0代表负例。
315 |         self.dinput = (y_pred - y_true) / J
316 | 
317 |         # 优化时要将所有样本相加，为了梯度与样本数量无关，这里除以样本数
318 |         self.dinput /= n_sample
319 | 
320 | class Optimizer_SGD():
321 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
322 |     def __init__(self, learning_rate = 1.0, decay = 0., momentum=0):
323 |         self.learning_rate = learning_rate
324 |         self.decay = decay
325 |         self.current_learning_rate = learning_rate
326 |         self.iteration = 0
327 |         self.momentum = momentum
328 | 
329 |     def pre_update_param(self):
330 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
331 |         if self.decay:
332 |             self.current_learning_rate = self.learning_rate * \
333 |                                          (1 / (1 + self.decay * self.iteration))
334 | 
335 |     # 给一个层对象参数，执行最基本的优化
336 |     def update_param(self, layer):
337 | 
338 |         deta_weight = layer.dweight
339 |         deta_bias = layer.dbias
340 | 
341 |         # 如果使用momentum
342 |         if self.momentum:
343 |             # 如果还没有累积动量
344 |             if not hasattr(layer, "dweight_cumulate"):
345 |                 # 注意：这里是往layer层里加属性
346 |                 # 这很容易理解，历史信息肯定是要存在对应的对像中
347 |                 layer.dweight_cumulate = np.zeros_like(layer.weight)
348 |                 layer.dbias_cumulate = np.zeros_like(layer.bias)
349 |             deta_weight += self.momentum * layer.dweight_cumulate
350 |             layer.dweight_cumulate = deta_weight
351 |             deta_bias += self.momentum * layer.dbias_cumulate
352 |             layer.dbias_cumulate = deta_bias
353 |         layer.weight -= self.current_learning_rate * deta_weight
354 |         # (64,) = (64,) + (1,64) >> (1,64)
355 |         # (64,) += (1,64) >> 无法广播
356 |         # (1, 64) = (64,) + (1,64) >> (1,64)
357 |         # (1, 64) += (64,) >> (1,64)
358 |         # 所以修改了dense中
359 |         # self.bias = np.zeros(n_neuron) => self.bias = np.zeros((1, n_neuron))
360 |         layer.bias -= self.current_learning_rate * deta_bias
361 | 
362 |     def post_update_param(self):
363 |         self.iteration += 1
364 | 
365 | class Optimizer_Adagrad():
366 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
367 |     def __init__(self, learning_rate = 1.0, decay = 0., epsilon = 1e-7):
368 |         self.learning_rate = learning_rate
369 |         self.decay = decay
370 |         self.current_learning_rate = learning_rate
371 |         self.iteration = 0
372 |         # 极小值，防止除以0
373 |         self.epsilon = epsilon
374 | 
375 | 
376 |     def pre_update_param(self):
377 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
378 |         if self.decay:
379 |             self.current_learning_rate = self.learning_rate * \
380 |                                          (1 / (1 + self.decay * self.iteration))
381 | 
382 |     # 给一个层对象参数
383 |     def update_param(self, layer):
384 |         if not hasattr(layer, 'dweight_square_sum'):
385 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
386 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
387 |         layer.dweight_square_sum = layer.dweight_square_sum + layer.dweight ** 2
388 |         layer.dbias_square_sum = layer.dbias_square_sum + layer.dbias ** 2
389 |         layer.weight += -self.current_learning_rate * layer.dweight / \
390 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
391 |         layer.bias += -self.current_learning_rate * layer.dbias / \
392 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
393 | 
394 |     def post_update_param(self):
395 |         self.iteration += 1
396 | 
397 | class Optimizer_RMSprop():
398 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
399 |     def __init__(self, learning_rate = 0.001, decay = 0., epsilon = 1e-7, beta = 0.9):
400 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
401 |         self.learning_rate = learning_rate
402 |         self.decay = decay
403 |         self.current_learning_rate = learning_rate
404 |         self.iteration = 0
405 |         # 极小值，防止除以0
406 |         self.epsilon = epsilon
407 |         self.beta = beta
408 | 
409 |     def pre_update_param(self):
410 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
411 |         if self.decay:
412 |             self.current_learning_rate = self.learning_rate * \
413 |                                          (1 / (1 + self.decay * self.iteration))
414 | 
415 |     # 给一个层对象参数
416 |     def update_param(self, layer):
417 |         if not hasattr(layer, 'dweight_square_sum'):
418 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
419 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
420 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
421 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
422 |         layer.weight += -self.current_learning_rate * layer.dweight / \
423 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
424 |         layer.bias += -self.current_learning_rate * layer.dbias / \
425 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
426 | 
427 |     def post_update_param(self):
428 |         self.iteration += 1
429 | 
430 | class Optimizer_Adam():
431 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
432 |     def __init__(self, learning_rate = 0.001, decay = 0., epsilon = 1e-7, momentum = 0.9,beta = 0.999):
433 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
434 |         self.learning_rate = learning_rate
435 |         self.decay = decay
436 |         self.current_learning_rate = learning_rate
437 |         self.iteration = 0
438 |         # 极小值，防止除以0
439 |         self.epsilon = epsilon
440 |         self.beta = beta
441 |         self.momentum = momentum
442 | 
443 |     def pre_update_param(self):
444 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
445 |         if self.decay:
446 |             self.current_learning_rate = self.learning_rate * \
447 |                                          (1 / (1 + self.decay * self.iteration))
448 | 
449 |     # 给一个层对象参数
450 |     def update_param(self, layer):
451 |         if not hasattr(layer, 'dweight_square_sum') or not hasattr(layer, 'dweight_cumulate'):
452 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
453 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
454 |             layer.dweight_cumulate = np.zeros_like(layer.weight)
455 |             layer.dbias_cumulate = np.zeros_like(layer.bias)
456 |         # 动量
457 |         layer.dweight_cumulate = self.momentum * layer.dweight_cumulate + (1 - self.momentum) * layer.dweight
458 |         layer.dbias_cumulate = self.momentum * layer.dbias_cumulate + (1 - self.momentum) * layer.dbias
459 |         # 微调动量
460 |         layer.dweight_cumulate_modified = layer.dweight_cumulate / (1 - self.momentum ** (self.iteration + 1))
461 |         layer.dbias_cumulate_modified = layer.dbias_cumulate / (1 - self.momentum ** (self.iteration + 1))
462 |         # 平方和
463 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
464 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
465 |         # 微调平方和
466 |         layer.dweight_square_sum_modified = layer.dweight_square_sum / (1 - self.beta ** (self.iteration + 1))
467 |         layer.dbias_square_sum_modified = layer.dbias_square_sum / (1 - self.beta ** (self.iteration + 1))
468 | 
469 |         layer.weight += -self.current_learning_rate * layer.dweight_cumulate_modified / \
470 |                         ( np.sqrt(layer.dweight_square_sum_modified) + self.epsilon )
471 |         layer.bias += -self.current_learning_rate * layer.dbias_cumulate_modified / \
472 |                         (np.sqrt(layer.dbias_square_sum_modified) + self.epsilon)
473 | 
474 |     def post_update_param(self):
475 |         self.iteration += 1
476 | 
477 | 
478 | # 数据集
479 | X, y = spiral_data(samples=2000, classes=3)
480 | keys = np.array(range(X.shape[0]))
481 | np.random.shuffle(keys)
482 | X = X[keys]
483 | y = y[keys]
484 | X_test = X[3000:]
485 | y_test = y[3000:]
486 | X = X[0:3000]
487 | y = y[0:3000]
488 | print(X-X_test)
489 | 
490 | # 2输入64输出
491 | dense1 = Layer_Dense(2, 256)#, weight_L2=5e-4, bias_L2=5e-4
492 | activation1 = Activation_ReLu()
493 | # 2输入64输出
494 | dense2 = Layer_Dense(256, 128)#, weight_L2=5e-4, bias_L2=5e-4
495 | activation2 = Activation_ReLu()
496 | # 64输入3输出
497 | dense3 = Layer_Dense(128, 3)
498 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
499 | 
500 | # 优化器
501 | optimizer = Optimizer_Adam(learning_rate=0.02, decay=5e-7)
502 | 
503 | # 循环10000轮
504 | for epoch in range(10001):
505 |     # 前向传播
506 |     dense1.forward(X)
507 |     activation1.forward(dense1.output)
508 |     dense2.forward(activation1.output)
509 |     activation2.forward(dense2.output)
510 |     dense3.forward(activation2.output)
511 |     data_loss = loss_activation.forward(dense3.output, y)
512 |     regularization_loss = loss_activation.loss.regularization_loss(dense1) +loss_activation.loss.regularization_loss(dense2)
513 |     loss = data_loss + regularization_loss
514 |     # 最高confidence的类别
515 |     predictions = np.argmax(loss_activation.output, axis=1)
516 |     if len(y.shape) == 2: # onehot编码
517 |         # 改成只有一个类别
518 |         y = np.argmax(y, axis=1)
519 |     accuracy = np.mean(predictions == y)
520 | 
521 |     if not epoch % 100:
522 |         print(f'epoch: {epoch}, ' +
523 |               f'acc: {accuracy:.3f}, ' +
524 |               f'loss: {loss:.3f} (' +
525 |               f'data_loss: {data_loss:.3f}, ' +
526 |               f'reg_loss: {regularization_loss:.3f}), ' +
527 |               f'lr: {optimizer.current_learning_rate}'
528 |                 )
529 | 
530 |     # 反向传播
531 |     loss_activation.backward(loss_activation.output, y)
532 |     dense3.backward(loss_activation.dinput)
533 |     activation2.backward(dense3.dinput)
534 |     dense2.backward(activation2.dinput)
535 |     activation1.backward(dense2.dinput)
536 |     dense1.backward(activation1.dinput)
537 | 
538 |     # 更新梯度
539 |     optimizer.pre_update_param()
540 |     optimizer.update_param(dense1)
541 |     optimizer.update_param(dense2)
542 |     optimizer.update_param(dense3)
543 |     optimizer.post_update_param()
544 | 
545 | 
546 | 
547 | # Create test dataset
548 | 
549 | # Perform a forward pass of our testing data through this layer
550 | dense1.forward(X_test)
551 | # Perform a forward pass through activation function
552 | # takes the output of first dense layer here
553 | activation1.forward(dense1.output)
554 | # Perform a forward pass through second Dense layer
555 | # takes outputs of activation function of first layer as inputs
556 | dense2.forward(activation1.output)
557 | activation2.forward(dense2.output)
558 | dense3.forward(activation2.output)
559 | # Perform a forward pass through the activation/loss function
560 | # takes the output of second dense layer here and returns loss
561 | loss = loss_activation.forward(dense3.output, y_test)
562 | # Calculate accuracy from output of activation2 and targets
563 | # calculate values along first axis
564 | predictions = np.argmax(loss_activation.output, axis=1)
565 | if len(y_test.shape) == 2:
566 |     y_test = np.argmax(y_test, axis=1)
567 | accuracy = np.mean(predictions==y_test)
568 | print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')


--------------------------------------------------------------------------------
/8Dropout/Dropout.md:
--------------------------------------------------------------------------------
  1 | # Dropout
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 神经网络正则化的另一种选择是添加一个dropout层，它禁用一些神经元，而其他神经元保持不变。这里的想法与正则化类似，是为了防止神经网络过于依赖任何神经元或在特定实例中完全依赖任何神经元。Dropout函数通过在每次前向传递期间以给定速率随机禁用神经元来工作，迫使网络学习如何仅使用剩余的随机部分神经元进行准确预测。Dropout迫使模型为同一目的使用更多的神经元，从而增加了学习描述数据的底层函数的机会。例如，如果在当前步骤中禁用一半的神经元，在下一步中禁用另一半，则强迫更多的神经元学习数据，因为只有它们中的一部分“看到”数据并在给定传递中获得更新。这些交替的神经元半数只是一个例子，将使用一个超参数来通知dropout层随机禁用多少个神经元。**dropout层并不真正禁用神经元，而是将它们的输出归零。换句话说，dropout并不减少使用的神经元数量，也不会在禁用一半神经元时使训练过程快两倍。**
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### **函数**
 10 | 
 11 | 代码将使用np.random.binomial()函数实现对dropout概率的设定。
 12 | 
 13 | ~~~py
 14 | np.random.binomial(2, 0.8, size=10) 
 15 | ~~~
 16 | 
 17 | np.random.binomial是NumPy库中的一个函数，它用于从二项分布中抽取样本。二项分布是一种离散概率分布，它描述了在 n 次独立的是/非试验中成功的次数，其中每次试验的成功概率为p。函数接受三个参数：n、p和size。n表示试验次数，p表示每次试验的成功概率，size表示要抽取的样本数量。例如，上面的代码将从一个参数为n=2和p=0.8的二项分布中抽取10个样本。
 18 | 
 19 | ### **公式**
 20 | 
 21 | ![image-20230811152120806](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111521840.png)
 22 | 
 23 | > $Dr$是dropout函数，$z$是输入，$q$是断开连接的概率。因为0的概率只有$q$这么大，就要除以$1-q$偿损失值。
 24 | 
 25 | $$
 26 | \frac{\partial loss}{\partial z}=\frac{\partial loss}{\partial Dr}\frac{\partial Dr}{\partial z}
 27 | $$
 28 | 
 29 | ### **实现**
 30 | 
 31 | ```py
 32 | class Dropout():
 33 |     def __init__(self, rate):
 34 |         # rate是断开连接的概率
 35 |         self.rate = 1 - rate
 36 | 
 37 |     def forward(self, input):
 38 |         self.input = input
 39 |         # 按概率生成一个0、1矩阵
 40 |         # 因为1的概率只有rate这么大，就要除以rate偿损失值
 41 |         self.mask = np.random.binomial(1, self.rate, size=self.input.shape) / self.rate
 42 |         self.output = self.input * self.mask
 43 | 
 44 |     def backward(self,dvalue):
 45 |         self.dinput = dvalue * self.mask
 46 | ```
 47 | 
 48 | ### **实例**
 49 | 
 50 | ```python
 51 | # 数据集
 52 | X, y = spiral_data(samples=2000, classes=3)
 53 | keys = np.array(range(X.shape[0]))
 54 | np.random.shuffle(keys)
 55 | X = X[keys]
 56 | y = y[keys]
 57 | X_test = X[3000:]
 58 | y_test = y[3000:]
 59 | X = X[0:3000]
 60 | y = y[0:3000]
 61 | print(X-X_test)
 62 | 
 63 | 
 64 | 
 65 | # 2输入64输出
 66 | dense1 = Layer_Dense(2, 512, weight_L2=5e-4, bias_L2=5e-4)
 67 | activation1 = Activation_ReLu()
 68 | 
 69 | dropout1 = Dropout(0.1)
 70 | 
 71 | # 64输入3输出
 72 | dense2 = Layer_Dense(512, 3)
 73 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
 74 | 
 75 | # 优化器
 76 | optimizer = Optimizer_Adam(learning_rate=0.05, decay=5e-5)
 77 | 
 78 | # 循环10000轮
 79 | for epoch in range(10001):
 80 |     # 前向传播
 81 |     dense1.forward(X)
 82 |     activation1.forward(dense1.output)
 83 |     dropout1.forward(activation1.output)
 84 |     dense2.forward(activation1.output)
 85 |     data_loss = loss_activation.forward(dense2.output, y)
 86 |     regularization_loss = loss_activation.loss.regularization_loss(dense1) + loss_activation.loss.regularization_loss(dense2)
 87 |     loss = data_loss + regularization_loss
 88 | 
 89 |     # 最高confidence的类别
 90 |     predictions = np.argmax(loss_activation.output, axis=1)
 91 |     if len(y.shape) == 2: # onehot编码
 92 |         # 改成只有一个类别
 93 |         y = np.argmax(y, axis=1)
 94 |     accuracy = np.mean(predictions == y)
 95 | 
 96 |     if not epoch % 100:
 97 |         print(f'epoch: {epoch}, ' +
 98 |               f'acc: {accuracy:.3f}, ' +
 99 |               f'loss: {loss:.3f} (' +
100 |               f'data_loss: {data_loss:.3f}, ' +
101 |               f'reg_loss: {regularization_loss:.3f}), ' +
102 |               f'lr: {optimizer.current_learning_rate}'
103 |               )
104 | 
105 |     # 反向传播
106 |     loss_activation.backward(loss_activation.output, y)
107 |     dense2.backward(loss_activation.dinput)
108 |     dropout1.backward(dense2.dinput)
109 |     activation1.backward(dropout1.dinput)
110 |     dense1.backward(activation1.dinput)
111 | 
112 |     # 更新梯度
113 |     optimizer.pre_update_param()
114 |     optimizer.update_param(dense1)
115 |     optimizer.update_param(dense2)
116 |     optimizer.post_update_param()
117 | 
118 | 
119 | 
120 | # Create test dataset
121 | 
122 | # Perform a forward pass of our testing data through this layer
123 | dense1.forward(X_test)
124 | # Perform a forward pass through activation function
125 | # takes the output of first dense layer here
126 | activation1.forward(dense1.output)
127 | # Perform a forward pass through second Dense layer
128 | # takes outputs of activation function of first layer as inputs
129 | dense2.forward(activation1.output)
130 | 
131 | # Perform a forward pass through the activation/loss function
132 | # takes the output of second dense layer here and returns loss
133 | loss = loss_activation.forward(dense2.output, y_test)
134 | # Calculate accuracy from output of activation2 and targets
135 | # calculate values along first axis
136 | predictions = np.argmax(loss_activation.output, axis=1)
137 | if len(y_test.shape) == 2:
138 |     y_test = np.argmax(y_test, axis=1)
139 | accuracy = np.mean(predictions==y_test)
140 | print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')
141 | ```
142 | 
143 | ![image-20230811155900960](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111559007.png)
144 | 
145 | ![image-20230811155947357](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111559395.png)
146 | 
147 | > 实际结果要比书中的好。


--------------------------------------------------------------------------------
/8Dropout/Dropout.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/8Dropout/Dropout.pdf


--------------------------------------------------------------------------------
/8Dropout/NNFS_version8.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 作者：黄欣
  3 | 日期：2023年08月11日
  4 | """
  5 | 
  6 | # 本版本将实现dropout层
  7 | 
  8 | import numpy as np
  9 | from nnfs.datasets import spiral_data
 10 | import matplotlib.pyplot as plt
 11 | 
 12 | 
 13 | class Layer_Dense:
 14 |     def __init__(self, n_input, n_neuron, weight_L1=0., weight_L2=0., bias_L1=0., bias_L2=0.):
 15 |         # 用正态分布初始化权重
 16 |         self.weight = 0.01 * np.random.randn(n_input, n_neuron)
 17 |         # 将bias(偏差)初始化为0
 18 |         # self.bias = np.zeros(n_neuron)
 19 |         self.bias = np.zeros((1, n_neuron))
 20 |         self.weight_L1 = weight_L1
 21 |         self.weight_L2 = weight_L2
 22 |         self.bias_L1 = bias_L1
 23 |         self.bias_L2 = bias_L2
 24 | 
 25 |     def forward(self, input):
 26 |         # 因为要增加backward方法，
 27 |         # Layer_Dense的输出对输入（input）的偏导是self.weight，
 28 |         # 面Layer_Dense的输出对self.weight的偏导是输入（input）
 29 |         # 所以要在forward中增加self.input属性
 30 |         self.input = input
 31 |         self.output = np.dot(input, self.weight) + self.bias
 32 | 
 33 |     def backward(self, dvalue):
 34 |         # dvalue是loss对下一层（Activation）的输入（input）的导数，
 35 |         # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 36 |         # 这里会用到链式法则
 37 | 
 38 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 39 |         # 这便找到了self.weight优化的方向（negative gradient direction）
 40 | 
 41 |         # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 42 |         # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 43 |         # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 44 |         # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 45 |         # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 46 |         self.dweight = np.dot(self.input.T, dvalue)
 47 | 
 48 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 49 |         # 以便作为下一层的backward方法中的dvalue参数，
 50 | 
 51 |         # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
 52 |         # dvalue = dloss/doutput大小即为output的大小(1xb)，
 53 |         # weight大小为axb
 54 |         # 所以1xa = (1xb) * (axb).T
 55 |         self.dinput = np.dot(dvalue, self.weight.T)
 56 | 
 57 |         # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
 58 |         # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
 59 |         # 但有更快更简单的实现
 60 |         self.dbias = np.sum(dvalue, axis=0, keepdims=True)  # 此处不要keepdims=True也行，因为按0维相加还是行向量
 61 | 
 62 |         # 正则项的梯度
 63 |         if self.weight_L2 > 0:
 64 |             self.dweight += 2 * self.weight_L2 * self.weight
 65 |         if self.bias_L2 > 0:
 66 |             self.dbias += 2 * self.bias_L2 * self.bias
 67 |         if self.weight_L1 > 0:
 68 |             dL = np.ones_like(self.weight)
 69 |             dL[self.weight < 0] = -1
 70 |             self.dweight += self.weight_L1 * dL
 71 |         if self.bias_L1 > 0:
 72 |             dL = np.ones_like(self.bias)
 73 |             dL[self.bias < 0] = -1
 74 |             self.dbias += self.bias_L1 * dL
 75 | 
 76 | class Activation_Sigmoid:
 77 |     def __init__(self):
 78 |         pass
 79 | 
 80 |     def forward(self, input):
 81 |         self.input = input
 82 | 
 83 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 84 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 85 |         self.output = 1 / (1 + np.exp(- (self.input)))
 86 | 
 87 |     def backward(self, dvalue):
 88 |         # 这里也可以用矩阵计算，但dinput、dvalue、output大小相同，
 89 |         # 可以直接按元素对应相乘。
 90 |         self.dinput = dvalue * self.output * (1 - self.output)
 91 | 
 92 | 
 93 | class Activation_ReLu:
 94 |     def __init__(self):
 95 |         pass
 96 | 
 97 |     def forward(self, input):
 98 |         self.input = input
 99 |         self.output = np.maximum(0, input)
100 | 
101 |     def backward(self, dvalue):
102 |         # self.input和self.output形状是一样的
103 |         # 那么dinput大小=doutput大小=dvalue大小
104 |         # 可以用mask来更快实现，而不用矩阵运算
105 |         self.dinput = dvalue.copy()
106 |         self.dinput[self.input < 0] = 0
107 | 
108 | 
109 | class Activation_Softmax:
110 |     def __init__(self):
111 |         pass
112 | 
113 |     def forward(self, input):
114 |         self.input = input
115 | 
116 |         # 要有keepdims=True参数设置
117 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
118 |         # 而行向量长度不与input的每一行长度相同，
119 |         # 则无法广播
120 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
121 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
122 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
123 | 
124 |     def backward(self, dvalue):
125 |         # input和output大小相同都为1xa，
126 |         # loss是标量，那么dinput和doutput（即dvalue）大小相同都为1xa，
127 |         # output对input的导数为一个axa的方阵
128 | 
129 |         # 相同大小的空矩阵
130 |         self.dinput = np.empty_like(dvalue)
131 |         # 对每个samlpe（每一行）循环
132 |         for each, (single_output, single_dvalue) in enumerate(zip(self.output, dvalue)):
133 |             # 显然这两种计算法算到的dinput大小是一样的
134 |             # 这里是(1xa) * (axa) = 1xa是行向量
135 |             # 这里要先将1xa向量变为1xa矩阵
136 |             # 因为向量没有转置（.T操作后还是与原来相同），
137 |             # np.dot接收到向量后，会调整向量的方向，但得到的还是向量（行向量）,就算得到列向量也会表示成行向量
138 |             # np.dot接收到1xa矩阵后，要考虑前后矩阵大小的匹配，不然要报错,最后得到的还是矩阵
139 |             single_output = single_output.reshape(1, -1)
140 |             jacobian_matrix = np.diagflat(single_output) - np.dot(single_output.T, single_output)
141 |             # 因为single_dvalue是行向量，dot运算会调整向量的方向
142 |             # 所以np.dot(single_dvalue, jacobian_matrix)和np.dot(jacobian_matrix， single_dvalue)
143 |             # 得到的都是一个行向量，但两都的计算方法不同，得到的值也不同
144 |             # np.dot(jacobian_matrix, single_dvalue)也是对的，这样得到的才是行向量，
145 |             # 而不是经过dot将列向量转置成行向量
146 |             self.dinput[each] = np.dot(jacobian_matrix, single_dvalue)
147 | 
148 | 
149 | class Loss:
150 |     def __init__(self):
151 |         pass
152 | 
153 |     # 统一通过调用calculate方法计算损失
154 |     def calculate(self, y_pred, y_ture):
155 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
156 |         data_loss = np.mean(self.forward(y_pred, y_ture))
157 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
158 |         return data_loss
159 | 
160 |     def regularization_loss(self, layer):
161 |         # 默认为0
162 |         regularization_loss = 0
163 |         # 如果存在L1的loss
164 |         if layer.weight_L1 > 0:
165 |             regularization_loss += layer.weight_L1 * np.sum(np.abs(layer.weight))
166 |         if layer.bias_L1 > 0:
167 |             regularization_loss += layer.bias_L1 * np.sum(np.abs(layer.bias))
168 |         # 如果存在L2的loss
169 |         if layer.weight_L2 > 0:
170 |             regularization_loss += layer.weight_L2 * np.sum(layer.weight ** 2)
171 |         if layer.bias_L2 > 0:
172 |             regularization_loss += layer.bias_L2 * np.sum(layer.bias ** 2)
173 | 
174 |         return regularization_loss
175 | 
176 | class Loss_CategoricalCrossentropy(Loss):
177 |     def __init__(self):
178 |         pass
179 | 
180 |     def forward(self, y_pred, y_true):
181 |         # 多少个样本
182 |         n_sample = len(y_true)
183 | 
184 |         # 为了防止log(0)，所以以1e-7为左边界
185 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
186 |         # 为了防止偏移，右边界为1 - 1e-7
187 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
188 | 
189 |         loss = - np.log(y_pred)
190 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
191 |             loss = np.sum(loss * y_true, axis=1)
192 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
193 |             # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
194 |             loss = loss[range(n_sample), y_true]
195 | 
196 |         # loss是一个列向量，每一行是一个样本,
197 |         # 这里不用求均值，父类中的calculate方法中求均值
198 |         return loss
199 | 
200 |     def backward(self, y_pred, y_true):
201 |         n_sample = len(y_true)
202 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
203 |             label = y_true
204 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
205 |             # 将标签改成onehot的编码
206 |             label = np.zeros((n_sample, len(y_pred[0])))
207 |             label[range(n_sample), y_true] = 1
208 |         self.dinput = - label / y_pred
209 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
210 |         self.dinput = self.dinput / n_sample
211 | 
212 | 
213 | class Loss_BinaryCrossentropy(Loss):
214 |     def __init__(self):
215 |         pass
216 | 
217 |     def forward(self, y_pred, y_true):
218 |         # 这里要特别注意，书上都没有写明
219 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
220 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
221 |         # 所以下面的loss大小会变成(n_sample,n_sample)
222 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
223 |         if len(y_true.shape) == 1:  # y_true是个行向量
224 |             y_true = y_true.reshape(-1, 1)
225 |         # 为了防止log(0)，所以以1e-7为左边界
226 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
227 |         # 为了防止偏移，右边界为1 - 1e-7
228 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
229 |         loss = -  np.log(y_pred) * y_true - np.log(1 - y_pred) * (1 - y_true)
230 |         # 这里的求平均和父类中的calculate求平均的维度不同
231 |         # 这里是对多对的二进制求平均
232 |         # calculate中的求平均是对每个样本可平均
233 |         loss = np.mean(loss, axis=-1)
234 |         return loss
235 | 
236 |     def backward(self, y_pred, y_true):
237 |         # 样本个数
238 |         n_sample = len(y_pred)
239 |         # 二进制输出个数
240 |         n_output = len(y_pred[0])
241 |         # 这里要特别注意，书上都没有写明
242 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
243 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
244 |         # 所以下面的loss大小会变成(n_sample,n_sample)
245 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
246 |         if len(y_true.shape) == 1:  # y_true是个行向量
247 |             y_true = y_true.reshape(-1, 1)
248 |         # 注意：BinaryCrossentropy之前都是Sigmoid函数
249 |         # Sigmoid函数很容易出现0和1的输出
250 |         # 所以以1e-7为左边界
251 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
252 |         # 为了防止偏移，右边界为1 - 1e-7
253 |         y_pred_clip = np.clip(y_pred, 1e-7, 1 - 1e-7)
254 |         # 千万不要与成下面这样，因为-y_true优先级最高，而y_true是uint8，-1=>255
255 |         # 这个bug我找了很久，要重视
256 |         # self.dinput = -y_true / y_pred_clip + (1 - y_true) / (1 - y_pred_clip)) / n_output
257 |         self.dinput = -(y_true / y_pred_clip - (1 - y_true) / (1 - y_pred_clip)) / n_output
258 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
259 |         self.dinput = self.dinput / n_sample
260 | 
261 | 
262 | class Activation_Softmax_Loss_CategoricalCrossentropy():
263 |     def __init__(self):
264 |         self.activation = Activation_Softmax()
265 |         self.loss = Loss_CategoricalCrossentropy()
266 | 
267 |     # 注意：Activation_Softmax_Loss_CategoricalCrossentropy类中是调用forward计算loss
268 |     # 因为它没有继承Loss类
269 |     def forward(self, input, y_true):
270 |         self.activation.forward(input)
271 |         # 该类的output属性应该是Activation_Softmax()的输出
272 |         self.output = self.activation.output
273 |         # 该类返回的是loss
274 |         return self.loss.calculate(self.output, y_true)
275 | 
276 |     # 其实y_pred一定等于self.output，但为了与之前代码一致
277 |     def backward(self, y_pred, y_true):
278 |         # 样本个数
279 |         n_sample = len(y_true)
280 |         if len(y_true.shape) == 2:  # onehot编码
281 |             # 直接套公式
282 |             self.dinput = y_pred - y_true
283 |         elif len(y_true.shape) == 1:  # 只有一个类别
284 |             self.dinput = y_pred.copy()
285 |             # 需将每一行中y_true类别（索引）中的-1，其它-0（不操作）
286 |             self.dinput[range(n_sample), y_true] -= 1
287 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
288 |         self.dinput = self.dinput / n_sample
289 | 
290 | 
291 | class Activation_Sigmoid_Loss_BinaryCrossentropy():
292 |     def __init__(self):
293 |         self.activation = Activation_Sigmoid()
294 |         self.loss = Loss_BinaryCrossentropy()
295 | 
296 |     def forward(self, input, y_true):
297 |         self.activation.forward(input)
298 |         # 类的output是Sigmoid的输出
299 |         self.output = self.activation.output
300 |         return self.loss.calculate(self.output, y_true)
301 | 
302 |     def backward(self, y_pred, y_true):
303 |         # 样本数量
304 |         n_sample = len(y_pred)
305 |         # 这里要特别注意，书上都没有写明
306 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
307 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
308 |         # 所以下面的loss大小会变成(n_sample,n_sample)
309 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
310 |         if len(y_true.shape) == 1:  # y_true是个行向量
311 |             y_true = y_true.reshape(-1, 1)
312 |         # 二进制输出个数
313 |         J = len(y_pred[0])
314 |         # y_true中每一行都有J个1或0的二进制值，1代表正例，0代表负例。
315 |         self.dinput = (y_pred - y_true) / J
316 | 
317 |         # 优化时要将所有样本相加，为了梯度与样本数量无关，这里除以样本数
318 |         self.dinput /= n_sample
319 | 
320 | class Optimizer_SGD():
321 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
322 |     def __init__(self, learning_rate = 1.0, decay = 0., momentum=0):
323 |         self.learning_rate = learning_rate
324 |         self.decay = decay
325 |         self.current_learning_rate = learning_rate
326 |         self.iteration = 0
327 |         self.momentum = momentum
328 | 
329 |     def pre_update_param(self):
330 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
331 |         if self.decay:
332 |             self.current_learning_rate = self.learning_rate * \
333 |                                          (1 / (1 + self.decay * self.iteration))
334 | 
335 |     # 给一个层对象参数，执行最基本的优化
336 |     def update_param(self, layer):
337 | 
338 |         deta_weight = layer.dweight
339 |         deta_bias = layer.dbias
340 | 
341 |         # 如果使用momentum
342 |         if self.momentum:
343 |             # 如果还没有累积动量
344 |             if not hasattr(layer, "dweight_cumulate"):
345 |                 # 注意：这里是往layer层里加属性
346 |                 # 这很容易理解，历史信息肯定是要存在对应的对像中
347 |                 layer.dweight_cumulate = np.zeros_like(layer.weight)
348 |                 layer.dbias_cumulate = np.zeros_like(layer.bias)
349 |             deta_weight += self.momentum * layer.dweight_cumulate
350 |             layer.dweight_cumulate = deta_weight
351 |             deta_bias += self.momentum * layer.dbias_cumulate
352 |             layer.dbias_cumulate = deta_bias
353 |         layer.weight -= self.current_learning_rate * deta_weight
354 |         # (64,) = (64,) + (1,64) >> (1,64)
355 |         # (64,) += (1,64) >> 无法广播
356 |         # (1, 64) = (64,) + (1,64) >> (1,64)
357 |         # (1, 64) += (64,) >> (1,64)
358 |         # 所以修改了dense中
359 |         # self.bias = np.zeros(n_neuron) => self.bias = np.zeros((1, n_neuron))
360 |         layer.bias -= self.current_learning_rate * deta_bias
361 | 
362 |     def post_update_param(self):
363 |         self.iteration += 1
364 | 
365 | class Optimizer_Adagrad():
366 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
367 |     def __init__(self, learning_rate = 1.0, decay = 0., epsilon = 1e-7):
368 |         self.learning_rate = learning_rate
369 |         self.decay = decay
370 |         self.current_learning_rate = learning_rate
371 |         self.iteration = 0
372 |         # 极小值，防止除以0
373 |         self.epsilon = epsilon
374 | 
375 | 
376 |     def pre_update_param(self):
377 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
378 |         if self.decay:
379 |             self.current_learning_rate = self.learning_rate * \
380 |                                          (1 / (1 + self.decay * self.iteration))
381 | 
382 |     # 给一个层对象参数
383 |     def update_param(self, layer):
384 |         if not hasattr(layer, 'dweight_square_sum'):
385 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
386 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
387 |         layer.dweight_square_sum = layer.dweight_square_sum + layer.dweight ** 2
388 |         layer.dbias_square_sum = layer.dbias_square_sum + layer.dbias ** 2
389 |         layer.weight += -self.current_learning_rate * layer.dweight / \
390 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
391 |         layer.bias += -self.current_learning_rate * layer.dbias / \
392 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
393 | 
394 |     def post_update_param(self):
395 |         self.iteration += 1
396 | 
397 | class Optimizer_RMSprop():
398 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
399 |     def __init__(self, learning_rate = 0.001, decay = 0., epsilon = 1e-7, beta = 0.9):
400 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
401 |         self.learning_rate = learning_rate
402 |         self.decay = decay
403 |         self.current_learning_rate = learning_rate
404 |         self.iteration = 0
405 |         # 极小值，防止除以0
406 |         self.epsilon = epsilon
407 |         self.beta = beta
408 | 
409 |     def pre_update_param(self):
410 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
411 |         if self.decay:
412 |             self.current_learning_rate = self.learning_rate * \
413 |                                          (1 / (1 + self.decay * self.iteration))
414 | 
415 |     # 给一个层对象参数
416 |     def update_param(self, layer):
417 |         if not hasattr(layer, 'dweight_square_sum'):
418 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
419 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
420 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
421 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
422 |         layer.weight += -self.current_learning_rate * layer.dweight / \
423 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
424 |         layer.bias += -self.current_learning_rate * layer.dbias / \
425 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
426 | 
427 |     def post_update_param(self):
428 |         self.iteration += 1
429 | 
430 | class Optimizer_Adam():
431 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
432 |     def __init__(self, learning_rate = 0.001, decay = 0., epsilon = 1e-7, momentum = 0.9,beta = 0.999):
433 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
434 |         self.learning_rate = learning_rate
435 |         self.decay = decay
436 |         self.current_learning_rate = learning_rate
437 |         self.iteration = 0
438 |         # 极小值，防止除以0
439 |         self.epsilon = epsilon
440 |         self.beta = beta
441 |         self.momentum = momentum
442 | 
443 |     def pre_update_param(self):
444 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
445 |         if self.decay:
446 |             self.current_learning_rate = self.learning_rate * \
447 |                                          (1 / (1 + self.decay * self.iteration))
448 | 
449 |     # 给一个层对象参数
450 |     def update_param(self, layer):
451 |         if not hasattr(layer, 'dweight_square_sum') or not hasattr(layer, 'dweight_cumulate'):
452 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
453 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
454 |             layer.dweight_cumulate = np.zeros_like(layer.weight)
455 |             layer.dbias_cumulate = np.zeros_like(layer.bias)
456 |         # 动量
457 |         layer.dweight_cumulate = self.momentum * layer.dweight_cumulate + (1 - self.momentum) * layer.dweight
458 |         layer.dbias_cumulate = self.momentum * layer.dbias_cumulate + (1 - self.momentum) * layer.dbias
459 |         # 微调动量
460 |         layer.dweight_cumulate_modified = layer.dweight_cumulate / (1 - self.momentum ** (self.iteration + 1))
461 |         layer.dbias_cumulate_modified = layer.dbias_cumulate / (1 - self.momentum ** (self.iteration + 1))
462 |         # 平方和
463 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
464 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
465 |         # 微调平方和
466 |         layer.dweight_square_sum_modified = layer.dweight_square_sum / (1 - self.beta ** (self.iteration + 1))
467 |         layer.dbias_square_sum_modified = layer.dbias_square_sum / (1 - self.beta ** (self.iteration + 1))
468 | 
469 |         layer.weight += -self.current_learning_rate * layer.dweight_cumulate_modified / \
470 |                         ( np.sqrt(layer.dweight_square_sum_modified) + self.epsilon )
471 |         layer.bias += -self.current_learning_rate * layer.dbias_cumulate_modified / \
472 |                         (np.sqrt(layer.dbias_square_sum_modified) + self.epsilon)
473 | 
474 |     def post_update_param(self):
475 |         self.iteration += 1
476 | 
477 | class Dropout():
478 |     def __init__(self, rate):
479 |         # rate是断开连接的概率
480 |         self.rate = 1 - rate
481 | 
482 |     def forward(self, input):
483 |         self.input = input
484 |         # 按概率生成一个0、1矩阵
485 |         # 因为1的概率只有rate这么大，就要除以rate偿损失值
486 |         self.mask = np.random.binomial(1, self.rate, size=self.input.shape) / self.rate
487 |         self.output = self.input * self.mask
488 | 
489 |     def backward(self,dvalue):
490 |         self.dinput = dvalue * self.mask
491 | 
492 | # 2输入64输出
493 | dense1 = Layer_Dense(2, 512, weight_L2=5e-4, bias_L2=5e-4)
494 | activation1 = Activation_ReLu()
495 | 
496 | dropout1 = Dropout(0.1)
497 | 
498 | # 64输入3输出
499 | dense2 = Layer_Dense(512, 3)
500 | loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
501 | 
502 | # 优化器
503 | optimizer = Optimizer_Adam(learning_rate=0.05, decay=5e-5)
504 | 
505 | # 数据集
506 | X, y = spiral_data(samples=2000, classes=3)
507 | keys = np.array(range(X.shape[0]))
508 | # 千万要注意，这里的keys是输入
509 | # 不要写成key=np.random.shuffle(keys)
510 | np.random.shuffle(keys)
511 | X = X[keys]
512 | y = y[keys]
513 | X_test = X[3000:]
514 | y_test = y[3000:]
515 | X = X[0:3000]
516 | y = y[0:3000]
517 | #print(X-X_test)
518 | print(X[:5])
519 | print(X_test[:5])
520 | 
521 | 
522 | 
523 | 
524 | # 循环10000轮
525 | for epoch in range(10001):
526 |     # 前向传播
527 |     dense1.forward(X)
528 |     activation1.forward(dense1.output)
529 |     dropout1.forward(activation1.output)
530 |     dense2.forward(activation1.output)
531 |     data_loss = loss_activation.forward(dense2.output, y)
532 |     regularization_loss = loss_activation.loss.regularization_loss(dense1) + loss_activation.loss.regularization_loss(dense2)
533 |     loss = data_loss + regularization_loss
534 | 
535 |     # 最高confidence的类别
536 |     predictions = np.argmax(loss_activation.output, axis=1)
537 |     if len(y.shape) == 2: # onehot编码
538 |         # 改成只有一个类别
539 |         y = np.argmax(y, axis=1)
540 |     accuracy = np.mean(predictions == y)
541 | 
542 |     if not epoch % 100:
543 |         print(f'epoch: {epoch}, ' +
544 |               f'acc: {accuracy:.3f}, ' +
545 |               f'loss: {loss:.3f} (' +
546 |               f'data_loss: {data_loss:.3f}, ' +
547 |               f'reg_loss: {regularization_loss:.3f}), ' +
548 |               f'lr: {optimizer.current_learning_rate}'
549 |               )
550 | 
551 |     # 反向传播
552 |     loss_activation.backward(loss_activation.output, y)
553 |     dense2.backward(loss_activation.dinput)
554 |     dropout1.backward(dense2.dinput)
555 |     activation1.backward(dropout1.dinput)
556 |     dense1.backward(activation1.dinput)
557 | 
558 |     # 更新梯度
559 |     optimizer.pre_update_param()
560 |     optimizer.update_param(dense1)
561 |     optimizer.update_param(dense2)
562 |     optimizer.post_update_param()
563 | 
564 | 
565 | 
566 | 
567 | 
568 | # Perform a forward pass of our testing data through this layer
569 | dense1.forward(X_test)
570 | # Perform a forward pass through activation function
571 | # takes the output of first dense layer here
572 | activation1.forward(dense1.output)
573 | # Perform a forward pass through second Dense layer
574 | # takes outputs of activation function of first layer as inputs
575 | dense2.forward(activation1.output)
576 | 
577 | # Perform a forward pass through the activation/loss function
578 | # takes the output of second dense layer here and returns loss
579 | loss = loss_activation.forward(dense2.output, y_test)
580 | # Calculate accuracy from output of activation2 and targets
581 | # calculate values along first axis
582 | predictions = np.argmax(loss_activation.output, axis=1)
583 | if len(y_test.shape) == 2:
584 |     y_test = np.argmax(y_test, axis=1)
585 | accuracy = np.mean(predictions==y_test)
586 | print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')
587 | 


--------------------------------------------------------------------------------
/9Regression/NNFS_version9.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 作者：黄欣
  3 | 日期：2023年08月11日
  4 | """
  5 | 
  6 | # 本版本实现对回归的解决方法
  7 | 
  8 | import numpy as np
  9 | from nnfs.datasets import spiral_data
 10 | from nnfs.datasets import sine_data
 11 | import matplotlib.pyplot as plt
 12 | 
 13 | 
 14 | class Layer_Dense:
 15 |     def __init__(self, n_input, n_neuron, weight_L1=0., weight_L2=0., bias_L1=0., bias_L2=0.):
 16 |         # 用正态分布初始化权重
 17 |         self.weight = 0.1 * np.random.randn(n_input, n_neuron)
 18 |         # 将bias(偏差)初始化为0
 19 |         # self.bias = np.zeros(n_neuron)
 20 |         self.bias = np.zeros((1, n_neuron))
 21 |         self.weight_L1 = weight_L1
 22 |         self.weight_L2 = weight_L2
 23 |         self.bias_L1 = bias_L1
 24 |         self.bias_L2 = bias_L2
 25 | 
 26 |     def forward(self, input):
 27 |         # 因为要增加backward方法，
 28 |         # Layer_Dense的输出对输入（input）的偏导是self.weight，
 29 |         # 面Layer_Dense的输出对self.weight的偏导是输入（input）
 30 |         # 所以要在forward中增加self.input属性
 31 |         self.input = input
 32 |         self.output = np.dot(input, self.weight) + self.bias
 33 | 
 34 |     def backward(self, dvalue):
 35 |         # dvalue是loss对下一层（Activation）的输入（input）的导数，
 36 |         # 也就是loss对这一层（Layer_Dense）的输出（output）的导数，
 37 |         # 这里会用到链式法则
 38 | 
 39 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.weight的导数
 40 |         # 这便找到了self.weight优化的方向（negative gradient direction）
 41 | 
 42 |         # 这里要考虑到self.dweight的大小要与self.weight一致，因为方便w - lr * dw公式进行优化
 43 |         # 假设input只有一个sample，大小为1xa，weight大小为axb，则output大小为1xb，
 44 |         # 因为loss是标量，所以dvalue = dloss/doutput大小即为output的大小(1xb)，
 45 |         # 所以dweight的大小为(1xa).T * (1xb) = axb,大小和weight一致。
 46 |         # 注意：当input有多个sample时（一个矩阵输入），则dweight为多个axb矩阵相加。
 47 |         self.dweight = np.dot(self.input.T, dvalue)
 48 | 
 49 |         # 在本层中，希望求得的是loss对这一层（Layer_Dense）的self.input的导数
 50 |         # 以便作为下一层的backward方法中的dvalue参数，
 51 | 
 52 |         # 因为loss是标量，所以dinput大小即为intput的大小(1xa)，
 53 |         # dvalue = dloss/doutput大小即为output的大小(1xb)，
 54 |         # weight大小为axb
 55 |         # 所以1xa = (1xb) * (axb).T
 56 |         self.dinput = np.dot(dvalue, self.weight.T)
 57 | 
 58 |         # 像self.dinput一样，self.dbias可以通过矩阵乘法实现，
 59 |         # self.dbias = np.dot( dvalue, np.ones( ( len(self.bias), len(self.bias) ) ) )
 60 |         # 但有更快更简单的实现
 61 |         self.dbias = np.sum(dvalue, axis=0, keepdims=True)  # 此处不要keepdims=True也行，因为按0维相加还是行向量
 62 | 
 63 |         # 正则项的梯度
 64 |         if self.weight_L2 > 0:
 65 |             self.dweight += 2 * self.weight_L2 * self.weight
 66 |         if self.bias_L2 > 0:
 67 |             self.dbias += 2 * self.bias_L2 * self.bias
 68 |         if self.weight_L1 > 0:
 69 |             dL = np.ones_like(self.weight)
 70 |             dL[self.weight < 0] = -1
 71 |             self.dweight += self.weight_L1 * dL
 72 |         if self.bias_L1 > 0:
 73 |             dL = np.ones_like(self.bias)
 74 |             dL[self.bias < 0] = -1
 75 |             self.dbias += self.bias_L1 * dL
 76 | 
 77 | class Activation_Sigmoid:
 78 |     def __init__(self):
 79 |         pass
 80 | 
 81 |     def forward(self, input):
 82 |         self.input = input
 83 | 
 84 |         # input的大小是nx1，n是Activation输入的sample数量，每个sample只有一个维度。
 85 |         # 所以前一个hidden layer必须是Layer_Dense(n, 1)
 86 |         self.output = 1 / (1 + np.exp(- (self.input)))
 87 | 
 88 |     def backward(self, dvalue):
 89 |         # 这里也可以用矩阵计算，但dinput、dvalue、output大小相同，
 90 |         # 可以直接按元素对应相乘。
 91 |         self.dinput = dvalue * self.output * (1 - self.output)
 92 | 
 93 | 
 94 | class Activation_ReLu:
 95 |     def __init__(self):
 96 |         pass
 97 | 
 98 |     def forward(self, input):
 99 |         self.input = input
100 |         self.output = np.maximum(0, input)
101 | 
102 |     def backward(self, dvalue):
103 |         # self.input和self.output形状是一样的
104 |         # 那么dinput大小=doutput大小=dvalue大小
105 |         # 可以用mask来更快实现，而不用矩阵运算
106 |         self.dinput = dvalue.copy()
107 |         self.dinput[self.input < 0] = 0
108 | 
109 | 
110 | class Activation_Softmax:
111 |     def __init__(self):
112 |         pass
113 | 
114 |     def forward(self, input):
115 |         self.input = input
116 | 
117 |         # 要有keepdims=True参数设置
118 |         # 如没有设置，则np.max(input, axis=1)后的列向量会变成行向量，
119 |         # 而行向量长度不与input的每一行长度相同，
120 |         # 则无法广播
121 |         # 进行指数运算之前，从输入值中减去最大值，使输入值更小，从而避免指数运算产生过大的数字
122 |         self.output = np.exp(input - np.max(input, axis=1, keepdims=True))
123 |         self.output = self.output / np.sum(self.output, axis=1, keepdims=True)
124 | 
125 |     def backward(self, dvalue):
126 |         # input和output大小相同都为1xa，
127 |         # loss是标量，那么dinput和doutput（即dvalue）大小相同都为1xa，
128 |         # output对input的导数为一个axa的方阵
129 | 
130 |         # 相同大小的空矩阵
131 |         self.dinput = np.empty_like(dvalue)
132 |         # 对每个samlpe（每一行）循环
133 |         for each, (single_output, single_dvalue) in enumerate(zip(self.output, dvalue)):
134 |             # 显然这两种计算法算到的dinput大小是一样的
135 |             # 这里是(1xa) * (axa) = 1xa是行向量
136 |             # 这里要先将1xa向量变为1xa矩阵
137 |             # 因为向量没有转置（.T操作后还是与原来相同），
138 |             # np.dot接收到向量后，会调整向量的方向，但得到的还是向量（行向量）,就算得到列向量也会表示成行向量
139 |             # np.dot接收到1xa矩阵后，要考虑前后矩阵大小的匹配，不然要报错,最后得到的还是矩阵
140 |             single_output = single_output.reshape(1, -1)
141 |             jacobian_matrix = np.diagflat(single_output) - np.dot(single_output.T, single_output)
142 |             # 因为single_dvalue是行向量，dot运算会调整向量的方向
143 |             # 所以np.dot(single_dvalue, jacobian_matrix)和np.dot(jacobian_matrix， single_dvalue)
144 |             # 得到的都是一个行向量，但两都的计算方法不同，得到的值也不同
145 |             # np.dot(jacobian_matrix, single_dvalue)也是对的，这样得到的才是行向量，
146 |             # 而不是经过dot将列向量转置成行向量
147 |             self.dinput[each] = np.dot(jacobian_matrix, single_dvalue)
148 | 
149 | class Activation_Linear:
150 |     def __init__(self):
151 |         pass
152 | 
153 |     def forward(self, input):
154 |         self.input = input
155 |         self.output = self.input
156 | 
157 |     def backward(self, dvalue):
158 |         # 注意与self.dinput = dvalue（目前还未发现这样无不可）
159 |         # 这意味着 dinput 和 dvalue 指向同一个对象，因此对 dinput 的任何更改都会影响原始的 dvalue 对象
160 |         # 而对dvalue进行运算如乘1，则和下面代码一样
161 |         self.dinput = dvalue.copy()
162 | 
163 | 
164 | class Loss:
165 |     def __init__(self):
166 |         pass
167 | 
168 |     # 统一通过调用calculate方法计算损失
169 |     def calculate(self, y_pred, y_ture):
170 |         # 对于不同的损失函数，通过继承Loss父类，并实现不同的forward方法。
171 |         data_loss = np.mean(self.forward(y_pred, y_ture))
172 |         # 注意，这里计算得到的loss不作为类属性储存，而是直接通过return返回
173 |         return data_loss
174 | 
175 |     def regularization_loss(self, layer):
176 |         # 默认为0
177 |         regularization_loss = 0
178 |         # 如果存在L1的loss
179 |         if layer.weight_L1 > 0:
180 |             regularization_loss += layer.weight_L1 * np.sum(np.abs(layer.weight))
181 |         if layer.bias_L1 > 0:
182 |             regularization_loss += layer.bias_L1 * np.sum(np.abs(layer.bias))
183 |         # 如果存在L2的loss
184 |         if layer.weight_L2 > 0:
185 |             regularization_loss += layer.weight_L2 * np.sum(layer.weight ** 2)
186 |         if layer.bias_L2 > 0:
187 |             regularization_loss += layer.bias_L2 * np.sum(layer.bias ** 2)
188 | 
189 |         return regularization_loss
190 | 
191 | class Loss_CategoricalCrossentropy(Loss):
192 |     def __init__(self):
193 |         pass
194 | 
195 |     def forward(self, y_pred, y_true):
196 |         # 多少个样本
197 |         n_sample = len(y_true)
198 | 
199 |         # 为了防止log(0)，所以以1e-7为左边界
200 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
201 |         # 为了防止偏移，右边界为1 - 1e-7
202 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
203 | 
204 |         loss = - np.log(y_pred)
205 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
206 |             loss = np.sum(loss * y_true, axis=1)
207 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
208 |             # 注意loss = loss[:, y_ture]是不一样的，这样会返回一个矩阵
209 |             loss = loss[range(n_sample), y_true]
210 | 
211 |         # loss是一个列向量，每一行是一个样本,
212 |         # 这里不用求均值，父类中的calculate方法中求均值
213 |         return loss
214 | 
215 |     def backward(self, y_pred, y_true):
216 |         n_sample = len(y_true)
217 |         if len(y_true.shape) == 2:  # 标签是onehot的编码
218 |             label = y_true
219 |         elif len(y_true.shape) == 1:  # 只有一个类别标签
220 |             # 将标签改成onehot的编码
221 |             label = np.zeros((n_sample, len(y_pred[0])))
222 |             label[range(n_sample), y_true] = 1
223 |         self.dinput = - label / y_pred
224 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
225 |         self.dinput = self.dinput / n_sample
226 | 
227 | 
228 | class Loss_BinaryCrossentropy(Loss):
229 |     def __init__(self):
230 |         pass
231 | 
232 |     def forward(self, y_pred, y_true):
233 |         # 这里要特别注意，书上都没有写明
234 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
235 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
236 |         # 所以下面的loss大小会变成(n_sample,n_sample)
237 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
238 |         if len(y_true.shape) == 1:  # y_true是个行向量
239 |             y_true = y_true.reshape(-1, 1)
240 |         # 为了防止log(0)，所以以1e-7为左边界
241 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
242 |         # 为了防止偏移，右边界为1 - 1e-7
243 |         y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
244 |         loss = -  np.log(y_pred) * y_true - np.log(1 - y_pred) * (1 - y_true)
245 |         # 这里的求平均和父类中的calculate求平均的维度不同
246 |         # 这里是对多对的二进制求平均
247 |         # calculate中的求平均是对每个样本可平均
248 |         loss = np.mean(loss, axis=-1)
249 |         return loss
250 | 
251 |     def backward(self, y_pred, y_true):
252 |         # 样本个数
253 |         n_sample = len(y_pred)
254 |         # 二进制输出个数
255 |         n_output = len(y_pred[0])
256 |         # 这里要特别注意，书上都没有写明
257 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
258 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
259 |         # 所以下面的loss大小会变成(n_sample,n_sample)
260 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
261 |         if len(y_true.shape) == 1:  # y_true是个行向量
262 |             y_true = y_true.reshape(-1, 1)
263 |         # 注意：BinaryCrossentropy之前都是Sigmoid函数
264 |         # Sigmoid函数很容易出现0和1的输出
265 |         # 所以以1e-7为左边界
266 |         # 另一个问题是将置信度向1移动，即使是非常小的值，
267 |         # 为了防止偏移，右边界为1 - 1e-7
268 |         y_pred_clip = np.clip(y_pred, 1e-7, 1 - 1e-7)
269 |         # 千万不要与成下面这样，因为-y_true优先级最高，而y_true是uint8，-1=>255
270 |         # 这个bug我找了很久，要重视
271 |         # self.dinput = -y_true / y_pred_clip + (1 - y_true) / (1 - y_pred_clip)) / n_output
272 |         self.dinput = -(y_true / y_pred_clip - (1 - y_true) / (1 - y_pred_clip)) / n_output
273 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
274 |         self.dinput = self.dinput / n_sample
275 | 
276 | 
277 | class Activation_Softmax_Loss_CategoricalCrossentropy():
278 |     def __init__(self):
279 |         self.activation = Activation_Softmax()
280 |         self.loss = Loss_CategoricalCrossentropy()
281 | 
282 |     # 注意：Activation_Softmax_Loss_CategoricalCrossentropy类中是调用forward计算loss
283 |     # 因为它没有继承Loss类
284 |     def forward(self, input, y_true):
285 |         self.activation.forward(input)
286 |         # 该类的output属性应该是Activation_Softmax()的输出
287 |         self.output = self.activation.output
288 |         # 该类返回的是loss
289 |         return self.loss.calculate(self.output, y_true)
290 | 
291 |     # 其实y_pred一定等于self.output，但为了与之前代码一致
292 |     def backward(self, y_pred, y_true):
293 |         # 样本个数
294 |         n_sample = len(y_true)
295 |         if len(y_true.shape) == 2:  # onehot编码
296 |             # 直接套公式
297 |             self.dinput = y_pred - y_true
298 |         elif len(y_true.shape) == 1:  # 只有一个类别
299 |             self.dinput = y_pred.copy()
300 |             # 需将每一行中y_true类别（索引）中的-1，其它-0（不操作）
301 |             self.dinput[range(n_sample), y_true] -= 1
302 |         # 每个样本除以n_sample，因为在优化的过程中要对样本求和
303 |         self.dinput = self.dinput / n_sample
304 | 
305 | 
306 | class Activation_Sigmoid_Loss_BinaryCrossentropy():
307 |     def __init__(self):
308 |         self.activation = Activation_Sigmoid()
309 |         self.loss = Loss_BinaryCrossentropy()
310 | 
311 |     def forward(self, input, y_true):
312 |         self.activation.forward(input)
313 |         # 类的output是Sigmoid的输出
314 |         self.output = self.activation.output
315 |         return self.loss.calculate(self.output, y_true)
316 | 
317 |     def backward(self, y_pred, y_true):
318 |         # 样本数量
319 |         n_sample = len(y_pred)
320 |         # 这里要特别注意，书上都没有写明
321 |         # 当只有一对二进制类别时，y_pred大小为(n_sample,1),y_ture大小为(n_sample,)
322 |         # (n_sample,)和(n_sample,1)一样都可以广播，只是(n_sample,)不能转置
323 |         # 所以下面的loss大小会变成(n_sample,n_sample)
324 |         # 当有二对二进制类别时，y_pred大小为(n_sample,2),y_ture大小为(n_sample,2)
325 |         if len(y_true.shape) == 1:  # y_true是个行向量
326 |             y_true = y_true.reshape(-1, 1)
327 |         # 二进制输出个数
328 |         J = len(y_pred[0])
329 |         # y_true中每一行都有J个1或0的二进制值，1代表正例，0代表负例。
330 |         self.dinput = (y_pred - y_true) / J
331 | 
332 |         # 优化时要将所有样本相加，为了梯度与样本数量无关，这里除以样本数
333 |         self.dinput /= n_sample
334 | 
335 | class Loss_MeanSquaredError(Loss):
336 |     def __init__(self):
337 |         pass
338 | 
339 |     def forward(self, y_pred, y_true):
340 |         # 输出变量的维度
341 |         loss = np.mean( (y_pred - y_true) ** 2, axis=-1 )
342 |         return loss
343 | 
344 |     def backward(self, y_pred, y_true):
345 |         # 样本个数
346 |         n_sample = len(y_pred)
347 |         # 输出维度
348 |         n_output = len(y_true[0])
349 |         self.dinput = 2 / n_output * (y_pred - y_true)
350 |         # 这里要非常注意，之前的解释都错了
351 |         # 在loss类的calculate方法中有data_loss = np.mean( self.forward(prediction, y) )
352 |         # 有一个对样本求均值的过程，即有一个除以样本个数的计算，所以求导后，除以样本个数来传递过来。
353 |         self.dinput /= n_sample
354 | 
355 | class Optimizer_SGD():
356 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
357 |     def __init__(self, learning_rate = 1.0, decay = 0., momentum=0):
358 |         self.learning_rate = learning_rate
359 |         self.decay = decay
360 |         self.current_learning_rate = learning_rate
361 |         self.iteration = 0
362 |         self.momentum = momentum
363 | 
364 |     def pre_update_param(self):
365 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
366 |         if self.decay:
367 |             self.current_learning_rate = self.learning_rate * \
368 |                                          (1 / (1 + self.decay * self.iteration))
369 | 
370 |     # 给一个层对象参数，执行最基本的优化
371 |     def update_param(self, layer):
372 | 
373 |         deta_weight = layer.dweight
374 |         deta_bias = layer.dbias
375 | 
376 |         # 如果使用momentum
377 |         if self.momentum:
378 |             # 如果还没有累积动量
379 |             if not hasattr(layer, "dweight_cumulate"):
380 |                 # 注意：这里是往layer层里加属性
381 |                 # 这很容易理解，历史信息肯定是要存在对应的对像中
382 |                 layer.dweight_cumulate = np.zeros_like(layer.weight)
383 |                 layer.dbias_cumulate = np.zeros_like(layer.bias)
384 |             deta_weight += self.momentum * layer.dweight_cumulate
385 |             layer.dweight_cumulate = deta_weight
386 |             deta_bias += self.momentum * layer.dbias_cumulate
387 |             layer.dbias_cumulate = deta_bias
388 |         layer.weight -= self.current_learning_rate * deta_weight
389 |         # (64,) = (64,) + (1,64) >> (1,64)
390 |         # (64,) += (1,64) >> 无法广播
391 |         # (1, 64) = (64,) + (1,64) >> (1,64)
392 |         # (1, 64) += (64,) >> (1,64)
393 |         # 所以修改了dense中
394 |         # self.bias = np.zeros(n_neuron) => self.bias = np.zeros((1, n_neuron))
395 |         layer.bias -= self.current_learning_rate * deta_bias
396 | 
397 |     def post_update_param(self):
398 |         self.iteration += 1
399 | 
400 | class Optimizer_Adagrad():
401 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
402 |     def __init__(self, learning_rate = 1.0, decay = 0., epsilon = 1e-7):
403 |         self.learning_rate = learning_rate
404 |         self.decay = decay
405 |         self.current_learning_rate = learning_rate
406 |         self.iteration = 0
407 |         # 极小值，防止除以0
408 |         self.epsilon = epsilon
409 | 
410 | 
411 |     def pre_update_param(self):
412 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
413 |         if self.decay:
414 |             self.current_learning_rate = self.learning_rate * \
415 |                                          (1 / (1 + self.decay * self.iteration))
416 | 
417 |     # 给一个层对象参数
418 |     def update_param(self, layer):
419 |         if not hasattr(layer, 'dweight_square_sum'):
420 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
421 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
422 |         layer.dweight_square_sum = layer.dweight_square_sum + layer.dweight ** 2
423 |         layer.dbias_square_sum = layer.dbias_square_sum + layer.dbias ** 2
424 |         layer.weight += -self.current_learning_rate * layer.dweight / \
425 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
426 |         layer.bias += -self.current_learning_rate * layer.dbias / \
427 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
428 | 
429 |     def post_update_param(self):
430 |         self.iteration += 1
431 | 
432 | class Optimizer_RMSprop():
433 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
434 |     def __init__(self, learning_rate = 0.001, decay = 0., epsilon = 1e-7, beta = 0.9):
435 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
436 |         self.learning_rate = learning_rate
437 |         self.decay = decay
438 |         self.current_learning_rate = learning_rate
439 |         self.iteration = 0
440 |         # 极小值，防止除以0
441 |         self.epsilon = epsilon
442 |         self.beta = beta
443 | 
444 |     def pre_update_param(self):
445 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
446 |         if self.decay:
447 |             self.current_learning_rate = self.learning_rate * \
448 |                                          (1 / (1 + self.decay * self.iteration))
449 | 
450 |     # 给一个层对象参数
451 |     def update_param(self, layer):
452 |         if not hasattr(layer, 'dweight_square_sum'):
453 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
454 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
455 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
456 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
457 |         layer.weight += -self.current_learning_rate * layer.dweight / \
458 |                         ( np.sqrt(layer.dweight_square_sum) + self.epsilon )
459 |         layer.bias += -self.current_learning_rate * layer.dbias / \
460 |                         (np.sqrt(layer.dbias_square_sum) + self.epsilon)
461 | 
462 |     def post_update_param(self):
463 |         self.iteration += 1
464 | 
465 | class Optimizer_Adam():
466 |     # 初始化方法将接收超参数，从学习率开始，将它们存储在类的属性中
467 |     def __init__(self, learning_rate = 0.001, decay = 0., epsilon = 1e-7, momentum = 0.9,beta = 0.999):
468 |         # 注意：这里的学习率learning_rate = 0.001，不是默认为1
469 |         self.learning_rate = learning_rate
470 |         self.decay = decay
471 |         self.current_learning_rate = learning_rate
472 |         self.iteration = 0
473 |         # 极小值，防止除以0
474 |         self.epsilon = epsilon
475 |         self.beta = beta
476 |         self.momentum = momentum
477 | 
478 |     def pre_update_param(self):
479 |         # 这种衰减的工作原理是取步数和衰减比率并将它们相乘。
480 |         if self.decay:
481 |             self.current_learning_rate = self.learning_rate * \
482 |                                          (1 / (1 + self.decay * self.iteration))
483 | 
484 |     # 给一个层对象参数
485 |     def update_param(self, layer):
486 |         if not hasattr(layer, 'dweight_square_sum') or not hasattr(layer, 'dweight_cumulate'):
487 |             layer.dweight_square_sum = np.zeros_like(layer.weight)
488 |             layer.dbias_square_sum = np.zeros_like(layer.bias)
489 |             layer.dweight_cumulate = np.zeros_like(layer.weight)
490 |             layer.dbias_cumulate = np.zeros_like(layer.bias)
491 |         # 动量
492 |         layer.dweight_cumulate = self.momentum * layer.dweight_cumulate + (1 - self.momentum) * layer.dweight
493 |         layer.dbias_cumulate = self.momentum * layer.dbias_cumulate + (1 - self.momentum) * layer.dbias
494 |         # 微调动量
495 |         layer.dweight_cumulate_modified = layer.dweight_cumulate / (1 - self.momentum ** (self.iteration + 1))
496 |         layer.dbias_cumulate_modified = layer.dbias_cumulate / (1 - self.momentum ** (self.iteration + 1))
497 |         # 平方和
498 |         layer.dweight_square_sum = self.beta * layer.dweight_square_sum + (1 - self.beta) * layer.dweight ** 2
499 |         layer.dbias_square_sum = self.beta * layer.dbias_square_sum + (1 - self.beta) * layer.dbias ** 2
500 |         # 微调平方和
501 |         layer.dweight_square_sum_modified = layer.dweight_square_sum / (1 - self.beta ** (self.iteration + 1))
502 |         layer.dbias_square_sum_modified = layer.dbias_square_sum / (1 - self.beta ** (self.iteration + 1))
503 | 
504 |         layer.weight += -self.current_learning_rate * layer.dweight_cumulate_modified / \
505 |                         ( np.sqrt(layer.dweight_square_sum_modified) + self.epsilon )
506 |         layer.bias += -self.current_learning_rate * layer.dbias_cumulate_modified / \
507 |                         (np.sqrt(layer.dbias_square_sum_modified) + self.epsilon)
508 | 
509 |     def post_update_param(self):
510 |         self.iteration += 1
511 | 
512 | class Dropout():
513 |     def __init__(self, rate):
514 |         # rate是断开连接的概率
515 |         self.rate = 1 - rate
516 | 
517 |     def forward(self, input):
518 |         self.input = input
519 |         # 按概率生成一个0、1矩阵
520 |         # 因为1的概率只有rate这么大，就要除以rate偿损失值
521 |         self.mask = np.random.binomial(1, self.rate, size=self.input.shape) / self.rate
522 |         self.output = self.input * self.mask
523 | 
524 |     def backward(self,dvalue):
525 |         self.dinput = dvalue * self.mask
526 | 
527 | 
528 | 
529 | # 生成数据共1000个点
530 | X, y = sine_data()
531 | X_test = X[::2]
532 | y_test = y[::2]
533 | X = X[1::2]
534 | y = y[1::2]
535 | 
536 | 
537 | # 三层结构
538 | dense1 = Layer_Dense(1, 64)
539 | activation1 = Activation_ReLu()
540 | dense2 = Layer_Dense(64, 64)# ,weight_L2=1e-4, bias_L2=1e-4
541 | activation2 = Activation_ReLu()
542 | dense3 = Layer_Dense(64, 1)
543 | activation3 = Activation_Linear()
544 | loss_function = Loss_MeanSquaredError()
545 | # 优化器
546 | optimizer = Optimizer_Adam(learning_rate = 0.001)
547 | 
548 | # 精度标准
549 | accuracy_precision = np.std(y) / 250
550 | 
551 | for epoch in range(10001):
552 |     # 前向传播
553 |     dense1.forward(X)
554 |     activation1.forward(dense1.output)
555 |     dense2.forward(activation1.output)
556 |     activation2.forward(dense2.output)
557 |     dense3.forward(activation2.output)
558 |     activation3.forward(dense3.output)
559 |     data_loss = loss_function.calculate(activation3.output, y)
560 | 
561 |     regularization_loss = \
562 |         loss_function.regularization_loss(dense1) + \
563 |         loss_function.regularization_loss(dense2) + \
564 |         loss_function.regularization_loss(dense3)
565 | 
566 |     loss = data_loss + regularization_loss
567 | 
568 |     # 计算准确率
569 |     predictions = activation3.output
570 |     accuracy = np.mean(np.absolute(predictions - y) <
571 |                        accuracy_precision)
572 | 
573 |     if not epoch % 100:
574 |         print(f'epoch: {epoch}, ' +
575 |             f'acc: {accuracy:.3f}, ' +
576 |             f'loss: {loss:.3f} (' +
577 |             f'data_loss: {data_loss:.3f}, ' +
578 |             f'reg_loss: {regularization_loss:.3f}), ' +
579 |             f'lr: {optimizer.current_learning_rate}')
580 |     # 反向传播
581 |     loss_function.backward(activation3.output, y)
582 |     activation3.backward(loss_function.dinput)
583 |     dense3.backward(activation3.dinput)
584 |     activation2.backward(dense3.dinput)
585 |     dense2.backward(activation2.dinput)
586 |     activation1.backward(dense2.dinput)
587 |     dense1.backward(activation1.dinput)
588 | 
589 |     # 更新权重
590 |     optimizer.pre_update_param()
591 |     optimizer.update_param(dense1)
592 |     optimizer.update_param(dense2)
593 |     optimizer.update_param(dense3)
594 |     optimizer.post_update_param()
595 | 
596 | # 测试集
597 | 
598 | 
599 | dense1.forward(X_test)
600 | activation1.forward(dense1.output)
601 | dense2.forward(activation1.output)
602 | activation2.forward(dense2.output)
603 | dense3.forward(activation2.output)
604 | activation3.forward(dense3.output)
605 | 
606 | plt.plot(X_test, y_test)
607 | plt.plot(X_test, activation3.output)
608 | plt.show()


--------------------------------------------------------------------------------
/9Regression/Regression.md:
--------------------------------------------------------------------------------
  1 | # Regression 
  2 | 
  3 | ## 一、内容
  4 | 
  5 | 本部分将实现能解决回归问题的模型。
  6 | 
  7 | ## 二、代码
  8 | 
  9 | ### 一、Linear Activation
 10 | 
 11 | 这个线性激活函数不修改它的输入，而是将它传递到输出：$y=x$。对于反向传递，我们已经知道$f(x)=x$的导数是1。做只是为了完整性和清晰性，以便在模型定义代码中看到输出层的激活函数。从计算时间的角度来看，这几乎不会增加处理时间，至少不足以明显影响训练时间。
 12 | 
 13 | #### **实现**
 14 | 
 15 | ```python
 16 | class Activation_Linear:
 17 |     def __init__(self):
 18 |         pass
 19 | 
 20 |     def forward(self, input):
 21 |         self.input = input
 22 |         self.output = self.input
 23 | 
 24 |     def backward(self, dvalue):
 25 |         # 注意不能self.dinput = dvalue
 26 |         # 这意味着 dinput 和 dvalue 指向同一个对象，因此对 dinput 的任何更改都会影响原始的 dvalue 对象
 27 |         # 而对dvalue进行运算如乘1，则和下面代码一样
 28 |         self.dinput = dvalue.copy()
 29 | ```
 30 | 
 31 | ### 二、Mean Squared Error Loss
 32 | 
 33 | #### **公式**
 34 | 
 35 | ![image-20230811182916692](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111829742.png)
 36 | 
 37 | ![image-20230811182944117](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308111829163.png)
 38 | 
 39 | > 公式都很好理解，不做过多解释。
 40 | 
 41 | ### 三、回归问师中衡量准确率
 42 | 
 43 | 在交叉熵中，可以计算匹配的数量（预测等于真实目标的情况），然后除以样本数来衡量模型的准确度。在回归模型中，预测是一个浮点值，不能简单地检查输出值是否等于真实值，因为它很可能不会——如果它稍微不同，准确度就会是0。对于回归来说，没有完美的方法来显示准确度。不过，最好还是有一些准确度指标。例如，Keras，一个流行的深度学习框架，会显示回归模型的准确度和损失，我们也会制作自己的准确度指标。**计算真实目标值的标准差，然后除以250。这个值可以根据目标而变化。除以的数字越大，准确度指标就越“严格”。250是这里选择的值。**
 44 | 
 45 | ~~~py
 46 | accuracy_precision = np.std(y) / 250
 47 | predictions = activation2.output 
 48 | accuracy = np.mean(np.absolute(predictions - y) < accuracy_precision) 
 49 | ~~~
 50 | 
 51 | #### **实例**
 52 | 
 53 | ```python
 54 | # 生成数据共1000个点
 55 | X, y = sine_data()
 56 | X_test = X[500:]
 57 | y_test = y[500:]
 58 | X = X[0:500]
 59 | y = y[0:500]
 60 | 
 61 | # 三层结构
 62 | dense1 = Layer_Dense(1, 64)
 63 | activation1 = Activation_ReLu()
 64 | dense2 = Layer_Dense(64, 64)# ,weight_L2=1e-4, bias_L2=1e-4
 65 | activation2 = Activation_ReLu()
 66 | dense3 = Layer_Dense(64, 1)
 67 | activation3 = Activation_Linear()
 68 | loss_function = Loss_MeanSquaredError()
 69 | # 优化器
 70 | optimizer = Optimizer_Adam(learning_rate=0.01, decay=1e-3)
 71 | 
 72 | # 精度标准
 73 | accuracy_precision = np.std(y) / 250
 74 | 
 75 | for epoch in range(10001):
 76 |     # 前向传播
 77 |     dense1.forward(X)
 78 |     activation1.forward(dense1.output)
 79 |     dense2.forward(activation1.output)
 80 |     activation2.forward(dense2.output)
 81 |     dense3.forward(activation2.output)
 82 |     activation3.forward(dense3.output)
 83 |     data_loss = loss_function.calculate(activation3.output, y)
 84 | 
 85 |     regularization_loss = \
 86 |         loss_function.regularization_loss(dense1) + \
 87 |         loss_function.regularization_loss(dense2) + \
 88 |         loss_function.regularization_loss(dense3)
 89 | 
 90 |     loss = data_loss + regularization_loss
 91 | 
 92 |     # 计算准确率
 93 |     predictions = activation3.output
 94 |     accuracy = np.mean(np.absolute(predictions - y) <
 95 |                        accuracy_precision)
 96 | 
 97 |     if not epoch % 100:
 98 |         print(f'epoch: {epoch}, ' +
 99 |             f'acc: {accuracy:.3f}, ' +
100 |             f'loss: {loss:.3f} (' +
101 |             f'data_loss: {data_loss:.3f}, ' +
102 |             f'reg_loss: {regularization_loss:.3f}), ' +
103 |             f'lr: {optimizer.current_learning_rate}')
104 |     # 反向传播
105 |     loss_function.backward(activation3.output, y)
106 |     activation3.backward(loss_function.dinput)
107 |     dense3.backward(activation3.dinput)
108 |     activation2.backward(dense3.dinput)
109 |     dense2.backward(activation2.dinput)
110 |     activation1.backward(dense2.dinput)
111 |     dense1.backward(activation1.dinput)
112 | 
113 |     # 更新权重
114 |     optimizer.pre_update_param()
115 |     optimizer.update_param(dense1)
116 |     optimizer.update_param(dense2)
117 |     optimizer.update_param(dense3)
118 |     optimizer.post_update_param()
119 | 
120 | # 测试集
121 | X_test, y_test = sine_data()
122 | 
123 | dense1.forward(X_test)
124 | activation1.forward(dense1.output)
125 | dense2.forward(activation1.output)
126 | activation2.forward(dense2.output)
127 | dense3.forward(activation2.output)
128 | activation3.forward(dense3.output)
129 | 
130 | plt.plot(X_test, y_test)
131 | plt.plot(X_test, activation3.output)
132 | plt.show()
133 | ```
134 | 
135 | **参数1**
136 | 
137 | ~~~py
138 | optimizer = Optimizer_Adam(learning_rate=0.01, decay=1e-3)
139 | ~~~
140 | 
141 | 
142 | 
143 | ![image-20230812180329076](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121803164.png)
144 | 
145 | ![image-20230812180354505](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121803556.png)
146 | 
147 | > 橙色线是预测值，蓝色线是ground truth，结果和书上一致。
148 | 
149 | **参数2**
150 | 
151 | ```py
152 | optimizer = Optimizer_Adam(learning_rate = 0.001)
153 | ```
154 | 
155 | ![image-20230812181150450](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121811490.png)
156 | 
157 | ![image-20230812181259817](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121812866.png)
158 | 
159 | > 也与书上结果一致。
160 | 
161 | 
162 | 
163 | ```python
164 | optimizer = Optimizer_Adam(learning_rate=0.005, decay=1e-3)
165 | ```
166 | 
167 | ![image-20230812180758284](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121807322.png)
168 | 
169 | ![image-20230812180811512](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121808560.png)
170 | 
171 | > 结果也与书上一致，learning_rate=0.005是最优的参数，0.01和0.001效果都不好，在此之间取0.005有效果。可以看到，在很窄的参数取值才有效果。
172 | 
173 | ```python
174 | # 用正态分布初始化权重
175 | self.weight = 0.01 * np.random.randn(n_input, n_neuron)
176 | ```
177 | 
178 | **上面代码是Layer_Dense中初始化权重的代码。**
179 | 
180 | 可以参考Keras中的初始化实现。其中用了Glorot均匀初始化器，也称为Xavier均匀初始化器。它从一个均匀分布中抽取样本，范围在$[-limit, limit]$之间，其中$limit$`是`$\sqrt{6 / (n_{input} + n_{output})}$,$n_{input}$是权重张量中输入单元的数量，$n_{output}$是权重张量中输出单元的数量。简单来说，这种初始化方法可以根据权重张量的输入和输出单元数量来确定初始化范围，从而更好地初始化神经网络的权重。
181 | 
182 | 实际上在此时遇到了一个非常类似的问题，改变权重的初始化方式使模型从完全不学习到学习状态，但不按Glorot均匀初始化，只是在Layer_Dense中初始化权重的代码修改一下。 为了这个目的，将Dense层的权重初始化中乘以正态分布抽取的因子改为0.1。
183 | 
184 | **参数3**
185 | 
186 | ~~~py
187 | self.weight = 0.1 * np.random.randn(n_input, n_neuron)
188 | ~~~
189 | 
190 | ~~~py
191 | optimizer = Optimizer_Adam(learning_rate = 0.001)
192 | ~~~
193 | 
194 | ![image-20230812185857352](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121858409.png)
195 | 
196 | ![image-20230812185922887](https://raw.githubusercontent.com/HX-1234/NoteImage/main/202308121859937.png)
197 | 
198 | > 将Dense层的权重初始化中乘以正态分布抽取的因子改为0.1后，用之前同样的参数训练效果还是很好。
199 | 


--------------------------------------------------------------------------------
/9Regression/Regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HX-1234/Neural-Networks-from-Scratch-in-Python/5026e7dc7442b0993dd21bf1db036c903122f133/9Regression/Regression.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 《Neural Networks from Scratch in Python》总结
 2 | 
 3 | ## 一、概述
 4 | 
 5 | >本项目是对《Neural Networks from Scratch in Python》读后的总结，在本项目中将应用Python（numpy）从0开始实现一个全连接神经网络，提供所有可运行代码，并对每一段代码加入注释（自己的理解）。本项目内容包括：全连接层、激活函数、损失函数、梯度、反向传播、优化器、正则化、dropout、数据集处理、模型验证、参数保存和下载、预测推理。
 6 | 
 7 | ## 二、环境
 8 | 
 9 | >1. python3.8
10 | >2. NumPy
11 | >3. matplotlib
12 | >4. nnfs（这个包是Neural Networks from Scratch in Python提供的包，用于生成训练和测试数据）
13 | 
14 | ## 三、文档
15 | 
16 | 项目的行文顺序与Neural Networks from Scratch in Python章节顺序一致，但并不每章行成一个文档，根据内容复杂程度会将多个章节内容构成一个文档。每个文档包含以下内容：
17 | 
18 | >1. .md文件，该文件包含主要代码，对代码详细注释，并加有个人的理解以及相关公式推导、运行结果、流程图等。**注意：如果latex公式无法正确显示，请在chrome 应用商店中下载MathJax 3 Plugin for Github插件**
19 | >2. .pdf文件，该文件与md文件内容一致，是md文件的pdf版本。
20 | >3. .py文件，该文件是完整代码，可用来运行。
21 | 


--------------------------------------------------------------------------------