├── README.md
├── figures
    ├── bp.png
    ├── computation-graph.png
    ├── nn-from-scratch-dataset.png
    └── nn-from-scratch-h3.png
├── gate.py
├── layer.py
├── main.py
├── mlnn.py
├── output.py
└── utils.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Implementing Multiple Layer Neural Network from Scratch
  2 | This post is inspired by <http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch>.
  3 | 
  4 | In this post, we will implement a multiple layer neural network from scratch. You can regard the number of layers and dimension of each layer as parameter. For example, `[2, 3, 2]` represents inputs with 2 dimension, one hidden layer with 3 dimension and output with 2 dimension (binary classification) (using softmax as output).
  5 | 
  6 | We won’t derive all the math that’s required, but I will try to give an intuitive explanation of what we are doing. I will also point to resources for you read up on the details.
  7 | 
  8 | ## Generating a dataset
  9 | Let’s start by generating a dataset we can play with. Fortunately, [scikit-learn](http://scikit-learn.org/) has some useful dataset generators, so we don’t need to write the code ourselves. We will go with the [make_moons](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html) function.
 10 | 
 11 | ```python
 12 | # Generate a dataset and plot it
 13 | np.random.seed(0)
 14 | X, y = sklearn.datasets.make_moons(200, noise=0.20)
 15 | plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
 16 | ```
 17 | ![](https://github.com/pangolulu/neural-network-from-scratch/raw/master/figures/nn-from-scratch-dataset.png)
 18 | 
 19 | The dataset we generated has two classes, plotted as red and blue points. Our goal is to train a Machine Learning classifier that predicts the correct class given the x- and y- coordinates. Note that the data is not linearly separable, we can’t draw a straight line that separates the two classes. This means that linear classifiers, such as Logistic Regression, won’t be able to fit the data unless you hand-engineer non-linear features (such as polynomials) that work well for the given dataset.
 20 | 
 21 | In fact, that’s one of the major advantages of Neural Networks. You don’t need to worry about feature engineering. The hidden layer of a neural network will learn features for you.
 22 | 
 23 | ## Neural Network
 24 | ### Neural Network Architecture
 25 | You can read this tutorial (<http://cs231n.github.io/neural-networks-1/>) to learn the basic concepts of neural network. Like activation functions, feed-forward computation and so on.
 26 | 
 27 | Because we want our network to output probabilities the activation function for the output layer will be the [softmax](https://en.wikipedia.org/wiki/Softmax_function), which is simply a way to convert raw scores to probabilities. If you’re familiar with the logistic function you can think of softmax as its generalization to multiple classes.
 28 | 
 29 | When you choose softmax as output, you can use [cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression) (also known as negative log likelihood) as loss function. More about Loss Function can be find in <http://cs231n.github.io/neural-networks-2/#losses>.
 30 | ### Learning the Parameters
 31 | Learning the parameters for our network means finding parameters (such as (W_1, b_1, W_2, b_2)) that minimize the error on our training data (loss function).
 32 | 
 33 | We can use [gradient descent](http://cs231n.github.io/optimization-1/) to find the minimum and I will implement the most vanilla version of gradient descent, also called batch gradient descent with a fixed learning rate. Variations such as SGD (stochastic gradient descent) or minibatch gradient descent typically perform better in practice. So if you are serious you’ll want to use one of these, and ideally you would also [decay the learning rate over time](http://cs231n.github.io/neural-networks-3/#anneal).
 34 | 
 35 | The key of gradient descent method is how to calculate the gradient of loss function by the parameters. One approach is called [Back Propagation](https://en.wikipedia.org/wiki/Backpropagation). You can learn it more from <http://colah.github.io/posts/2015-08-Backprop/> and <http://cs231n.github.io/optimization-2/>.
 36 | 
 37 | ### Implementation
 38 | We start by given the computation graph of neural network.
 39 | ![](https://github.com/pangolulu/neural-network-from-scratch/raw/master/figures/computation-graph.png)
 40 | 
 41 | In the computation graph, you can see that it contains three components (`gate`, `layer` and `output`), there is two kinds of gate (`multiply` and `add`), and you can use `tanh` layer and `softmax` output.
 42 | 
 43 | `gate`, `layer` and `output` can all be seen as operation unit of computation graph, so they will implement the inner derivatives of their inputs (we call it `backward`), and use chain rule according to the computation graph. You can see the following figure for nice explanation.
 44 | ![](https://github.com/pangolulu/neural-network-from-scratch/raw/master/figures/bp.png)
 45 | 
 46 | **`gate.py`**
 47 | ```python
 48 | import numpy as np
 49 | 
 50 | class MultiplyGate:
 51 |     def forward(self,W, X):
 52 |         return np.dot(X, W)
 53 | 
 54 |     def backward(self, W, X, dZ):
 55 |         dW = np.dot(np.transpose(X), dZ)
 56 |         dX = np.dot(dZ, np.transpose(W))
 57 |         return dW, dX
 58 | 
 59 | class AddGate:
 60 |     def forward(self, X, b):
 61 |         return X + b
 62 | 
 63 |     def backward(self, X, b, dZ):
 64 |         dX = dZ * np.ones_like(X)
 65 |         db = np.dot(np.ones((1, dZ.shape[0]), dtype=np.float64), dZ)
 66 |         return db, dX
 67 | ```
 68 | 
 69 | **`layer.py`**
 70 | ```python
 71 | import numpy as np
 72 | 
 73 | class Sigmoid:
 74 |     def forward(self, X):
 75 |         return 1.0 / (1.0 + np.exp(-X))
 76 | 
 77 |     def backward(self, X, top_diff):
 78 |         output = self.forward(X)
 79 |         return (1.0 - output) * output * top_diff
 80 | 
 81 | class Tanh:
 82 |     def forward(self, X):
 83 |         return np.tanh(X)
 84 | 
 85 |     def backward(self, X, top_diff):
 86 |         output = self.forward(X)
 87 |         return (1.0 - np.square(output)) * top_diff
 88 | ```
 89 | **`output.py`**
 90 | ```python
 91 | import numpy as np
 92 | 
 93 | class Softmax:
 94 |     def predict(self, X):
 95 |         exp_scores = np.exp(X)
 96 |         return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
 97 | 
 98 |     def loss(self, X, y):
 99 |         num_examples = X.shape[0]
100 |         probs = self.predict(X)
101 |         corect_logprobs = -np.log(probs[range(num_examples), y])
102 |         data_loss = np.sum(corect_logprobs)
103 |         return 1./num_examples * data_loss
104 | 
105 |     def diff(self, X, y):
106 |         num_examples = X.shape[0]
107 |         probs = self.predict(X)
108 |         probs[range(num_examples), y] -= 1
109 |         return probs
110 | ```
111 | 
112 | We can implement out neural network by a class `Model` and initialize the parameters in the `__init__` function. You can pass the parameter `layers_dim = [2, 3, 2]`, which represents inputs with 2 dimension, one hidden layer with 3 dimension and output with 2 dimension
113 | ```python
114 | class Model:
115 |     def __init__(self, layers_dim):
116 |         self.b = []
117 |         self.W = []
118 |         for i in range(len(layers_dim)-1):
119 |             self.W.append(np.random.randn(layers_dim[i], layers_dim[i+1]) / np.sqrt(layers_dim[i]))
120 |             self.b.append(np.random.randn(layers_dim[i+1]).reshape(1, layers_dim[i+1]))
121 | ```
122 | 
123 | First let’s implement the loss function we defined above. It is just a forward propagation computation of out neural network. We use this to evaluate how well our model is doing:
124 | ```python
125 | def calculate_loss(self, X, y):
126 |     mulGate = MultiplyGate()
127 |     addGate = AddGate()
128 |     layer = Tanh()
129 |     softmaxOutput = Softmax()
130 | 
131 |     input = X
132 |     for i in range(len(self.W)):
133 |         mul = mulGate.forward(self.W[i], input)
134 |         add = addGate.forward(mul, self.b[i])
135 |         input = layer.forward(add)
136 | 
137 |     return softmaxOutput.loss(input, y)
138 | ```
139 | We also implement a helper function to calculate the output of the network. It does forward propagation as defined above and returns the class with the highest probability.
140 | ```python
141 | def predict(self, X):
142 |     mulGate = MultiplyGate()
143 |     addGate = AddGate()
144 |     layer = Tanh()
145 |     softmaxOutput = Softmax()
146 | 
147 |     input = X
148 |     for i in range(len(self.W)):
149 |         mul = mulGate.forward(self.W[i], input)
150 |         add = addGate.forward(mul, self.b[i])
151 |         input = layer.forward(add)
152 | 
153 |     probs = softmaxOutput.predict(input)
154 |     return np.argmax(probs, axis=1)
155 | ```
156 | Finally, here comes the function to train our Neural Network. It implements batch gradient descent using the backpropagation algorithms we have learned above.
157 | 
158 | ```python
159 | def train(self, X, y, num_passes=20000, epsilon=0.01, reg_lambda=0.01, print_loss=False):
160 |     mulGate = MultiplyGate()
161 |     addGate = AddGate()
162 |     layer = Tanh()
163 |     softmaxOutput = Softmax()
164 | 
165 |     for epoch in range(num_passes):
166 |         # Forward propagation
167 |         input = X
168 |         forward = [(None, None, input)]
169 |         for i in range(len(self.W)):
170 |             mul = mulGate.forward(self.W[i], input)
171 |             add = addGate.forward(mul, self.b[i])
172 |             input = layer.forward(add)
173 |             forward.append((mul, add, input))
174 | 
175 |         # Back propagation
176 |         dtanh = softmaxOutput.diff(forward[len(forward)-1][2], y)
177 |         for i in range(len(forward)-1, 0, -1):
178 |             dadd = layer.backward(forward[i][1], dtanh)
179 |             db, dmul = addGate.backward(forward[i][0], self.b[i-1], dadd)
180 |             dW, dtanh = mulGate.backward(self.W[i-1], forward[i-1][2], dmul)
181 |             # Add regularization terms (b1 and b2 don't have regularization terms)
182 |             dW += reg_lambda * self.W[i-1]
183 |             # Gradient descent parameter update
184 |             self.b[i-1] += -epsilon * db
185 |             self.W[i-1] += -epsilon * dW
186 | 
187 |         if print_loss and epoch % 1000 == 0:
188 |             print("Loss after iteration %i: %f" %(epoch, self.calculate_loss(X, y)))
189 | ```
190 | ### A network with a hidden layer of size 3
191 | Let’s see what happens if we train a network with a hidden layer size of 3.
192 | ```python
193 | import matplotlib.pyplot as plt
194 | import numpy as np
195 | import sklearn
196 | import sklearn.datasets
197 | import sklearn.linear_model
198 | import mlnn
199 | from utils import plot_decision_boundary
200 | 
201 | # Generate a dataset and plot it
202 | np.random.seed(0)
203 | X, y = sklearn.datasets.make_moons(200, noise=0.20)
204 | plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
205 | plt.show()
206 | 
207 | layers_dim = [2, 3, 2]
208 | 
209 | model = mlnn.Model(layers_dim)
210 | model.train(X, y, num_passes=20000, epsilon=0.01, reg_lambda=0.01, print_loss=True)
211 | 
212 | # Plot the decision boundary
213 | plot_decision_boundary(lambda x: model.predict(x), X, y)
214 | plt.title("Decision Boundary for hidden layer size 3")
215 | plt.show()
216 | ```
217 | ![](https://github.com/pangolulu/neural-network-from-scratch/raw/master/figures/nn-from-scratch-h3.png)
218 | 
219 | This looks pretty good. Our neural networks was able to find a decision boundary that successfully separates the classes.
220 | 
221 | The `plot_decision_boundary` function is referenced by <http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch>.
222 | ```python
223 | import matplotlib.pyplot as plt
224 | import numpy as np
225 | 
226 | # Helper function to plot a decision boundary.
227 | def plot_decision_boundary(pred_func, X, y):
228 |     # Set min and max values and give it some padding
229 |     x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
230 |     y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
231 |     h = 0.01
232 |     # Generate a grid of points with distance h between them
233 |     xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
234 |     # Predict the function value for the whole gid
235 |     Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
236 |     Z = Z.reshape(xx.shape)
237 |     # Plot the contour and training examples
238 |     plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
239 |     plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
240 | ```
241 | ## Further more
242 | 1. Instead of batch gradient descent, use minibatch gradient to train the network. Minibatch gradient descent typically performs better in practice ([more info](http://cs231n.github.io/optimization-1/#gd)).
243 | 2. We used a fixed learning rate `epsilon` for gradient descent. Implement an annealing schedule for the gradient descent learning rate ([more info](http://cs231n.github.io/neural-networks-3/#anneal)).
244 | 3. We used a `tanh` activation function for our hidden layer. Experiment with other activation functions ([more info](http://cs231n.github.io/neural-networks-1/#actfun)).
245 | 4. Extend the network from two to three classes. You will need to generate an appropriate dataset for this.
246 | 5. Try some other Parameter updates method, like `Momentum update`, `Nesterov momentum`, `Adagrad`, `RMSprop` and `Adam`([more info](http://cs231n.github.io/neural-networks-3/#update)).
247 | 6. Some other tricks of training neural network can be find <http://cs231n.github.io/neural-networks-2> and <http://cs231n.github.io/neural-networks-3>, like `dropout reglarization`, `batch normazation`, `Gradient checks` and `Model Ensembles`.
248 | 
249 | ## Some useful resources
250 | 1. <http://cs231n.github.io/>
251 | 2. <http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch>
252 | 3. <http://colah.github.io/posts/2015-08-Backprop/>
253 | 


--------------------------------------------------------------------------------
/figures/bp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gy910210/neural-network-from-scratch/a2c7dc6e76d458e52e579496e545e3c658d430de/figures/bp.png


--------------------------------------------------------------------------------
/figures/computation-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gy910210/neural-network-from-scratch/a2c7dc6e76d458e52e579496e545e3c658d430de/figures/computation-graph.png


--------------------------------------------------------------------------------
/figures/nn-from-scratch-dataset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gy910210/neural-network-from-scratch/a2c7dc6e76d458e52e579496e545e3c658d430de/figures/nn-from-scratch-dataset.png


--------------------------------------------------------------------------------
/figures/nn-from-scratch-h3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gy910210/neural-network-from-scratch/a2c7dc6e76d458e52e579496e545e3c658d430de/figures/nn-from-scratch-h3.png


--------------------------------------------------------------------------------
/gate.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | class MultiplyGate:
 4 |     def forward(self,W, X):
 5 |         return np.dot(X, W)
 6 | 
 7 |     def backward(self, W, X, dZ):
 8 |         dW = np.dot(np.transpose(X), dZ)
 9 |         dX = np.dot(dZ, np.transpose(W))
10 |         return dW, dX
11 | 
12 | class AddGate:
13 |     def forward(self, X, b):
14 |         return X + b
15 | 
16 |     def backward(self, X, b, dZ):
17 |         dX = dZ * np.ones_like(X)
18 |         db = np.dot(np.ones((1, dZ.shape[0]), dtype=np.float64), dZ)
19 |         return db, dX


--------------------------------------------------------------------------------
/layer.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | class Sigmoid:
 4 |     def forward(self, X):
 5 |         return 1.0 / (1.0 + np.exp(-X))
 6 | 
 7 |     def backward(self, X, top_diff):
 8 |         output = self.forward(X)
 9 |         return (1.0 - output) * output * top_diff
10 | 
11 | class Tanh:
12 |     def forward(self, X):
13 |         return np.tanh(X)
14 | 
15 |     def backward(self, X, top_diff):
16 |         output = self.forward(X)
17 |         return (1.0 - np.square(output)) * top_diff


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | import numpy as np
 3 | import sklearn
 4 | import sklearn.datasets
 5 | import sklearn.linear_model
 6 | import mlnn
 7 | from utils import plot_decision_boundary
 8 | 
 9 | # Generate a dataset and plot it
10 | np.random.seed(0)
11 | X, y = sklearn.datasets.make_moons(200, noise=0.20)
12 | plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
13 | plt.show()
14 | 
15 | layers_dim = [2, 3, 2]
16 | 
17 | model = mlnn.Model(layers_dim)
18 | model.train(X, y, num_passes=20000, epsilon=0.01, reg_lambda=0.01, print_loss=True)
19 | 
20 | # Plot the decision boundary
21 | plot_decision_boundary(lambda x: model.predict(x), X, y)
22 | plt.title("Decision Boundary for hidden layer size 3")
23 | plt.show()


--------------------------------------------------------------------------------
/mlnn.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from gate import MultiplyGate, AddGate
 3 | from output import Softmax
 4 | from layer import Tanh
 5 | 
 6 | 
 7 | class Model:
 8 |     def __init__(self, layers_dim):
 9 |         self.b = []
10 |         self.W = []
11 |         for i in range(len(layers_dim)-1):
12 |             self.W.append(np.random.randn(layers_dim[i], layers_dim[i+1]) / np.sqrt(layers_dim[i]))
13 |             self.b.append(np.random.randn(layers_dim[i+1]).reshape(1, layers_dim[i+1]))
14 | 
15 |     def calculate_loss(self, X, y):
16 |         mulGate = MultiplyGate()
17 |         addGate = AddGate()
18 |         layer = Tanh()
19 |         softmaxOutput = Softmax()
20 | 
21 |         input = X
22 |         for i in range(len(self.W)):
23 |             mul = mulGate.forward(self.W[i], input)
24 |             add = addGate.forward(mul, self.b[i])
25 |             input = layer.forward(add)
26 | 
27 |         return softmaxOutput.loss(input, y)
28 | 
29 |     def predict(self, X):
30 |         mulGate = MultiplyGate()
31 |         addGate = AddGate()
32 |         layer = Tanh()
33 |         softmaxOutput = Softmax()
34 | 
35 |         input = X
36 |         for i in range(len(self.W)):
37 |             mul = mulGate.forward(self.W[i], input)
38 |             add = addGate.forward(mul, self.b[i])
39 |             input = layer.forward(add)
40 | 
41 |         probs = softmaxOutput.predict(input)
42 |         return np.argmax(probs, axis=1)
43 | 
44 |     def train(self, X, y, num_passes=20000, epsilon=0.01, reg_lambda=0.01, print_loss=False):
45 |         mulGate = MultiplyGate()
46 |         addGate = AddGate()
47 |         layer = Tanh()
48 |         softmaxOutput = Softmax()
49 | 
50 |         for epoch in range(num_passes):
51 |             # Forward propagation
52 |             input = X
53 |             forward = [(None, None, input)]
54 |             for i in range(len(self.W)):
55 |                 mul = mulGate.forward(self.W[i], input)
56 |                 add = addGate.forward(mul, self.b[i])
57 |                 input = layer.forward(add)
58 |                 forward.append((mul, add, input))
59 | 
60 |             # Back propagation
61 |             dtanh = softmaxOutput.diff(forward[len(forward)-1][2], y)
62 |             for i in range(len(forward)-1, 0, -1):
63 |                 dadd = layer.backward(forward[i][1], dtanh)
64 |                 db, dmul = addGate.backward(forward[i][0], self.b[i-1], dadd)
65 |                 dW, dtanh = mulGate.backward(self.W[i-1], forward[i-1][2], dmul)
66 |                 # Add regularization terms (b1 and b2 don't have regularization terms)
67 |                 dW += reg_lambda * self.W[i-1]
68 |                 # Gradient descent parameter update
69 |                 self.b[i-1] += -epsilon * db
70 |                 self.W[i-1] += -epsilon * dW
71 | 
72 |             if print_loss and epoch % 1000 == 0:
73 |                 print("Loss after iteration %i: %f" %(epoch, self.calculate_loss(X, y)))


--------------------------------------------------------------------------------
/output.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | class Softmax:
 4 |     def predict(self, X):
 5 |         exp_scores = np.exp(X)
 6 |         return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
 7 | 
 8 |     def loss(self, X, y):
 9 |         num_examples = X.shape[0]
10 |         probs = self.predict(X)
11 |         corect_logprobs = -np.log(probs[range(num_examples), y])
12 |         data_loss = np.sum(corect_logprobs)
13 |         return 1./num_examples * data_loss
14 | 
15 |     def diff(self, X, y):
16 |         num_examples = X.shape[0]
17 |         probs = self.predict(X)
18 |         probs[range(num_examples), y] -= 1
19 |         return probs


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | import numpy as np
 3 | 
 4 | # Helper function to plot a decision boundary.
 5 | def plot_decision_boundary(pred_func, X, y):
 6 |     # Set min and max values and give it some padding
 7 |     x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
 8 |     y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
 9 |     h = 0.01
10 |     # Generate a grid of points with distance h between them
11 |     xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
12 |     # Predict the function value for the whole gid
13 |     Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
14 |     Z = Z.reshape(xx.shape)
15 |     # Plot the contour and training examples
16 |     plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
17 |     plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)


--------------------------------------------------------------------------------