├── FAQ.html ├── FAQ.md ├── README.md ├── cnn_numpy.py ├── cnn_numpy_sg.py ├── images ├── stride-groups.png └── strides.png ├── index.html └── mnist.py /FAQ.html: -------------------------------------------------------------------------------- 1 | cnn-numpy FAQ 2 | 3 |

cnn-numpy FAQ

4 |

Experiments with implementing a convolutional neural network (CNN) in NumPy.

5 |

Written by David Stein (david@djstein.com). See index.html for more information.

6 |

Also available at https://www.github.com/neuron-whisperer/cnn-numpy.

7 |

Questions About cnn-numpy

8 |

How do I run this code?

9 |

Let's get this out of the way:

10 |

Make sure you have Python 3 installed with requests and NumPy:
12 |
```
  python3 -m pip install requests numpy
```
13 |
Clone the repository, or just download mnist.py, cnn_numpy.py, and cnn_numpy_sg.py.
15 |
Ensure that you have (a) a working internet connection or (b) a copy of the MNIST handwritten digits data set. Just grab the first four files on the page and save them (with the original filenames) in the same folder as the Python scripts. No need to unzip them.
17 |
Run the script:
19 |
```
  python3 mnist.py cnn
 20 | 
```

22 |

This will train a simple CNN against the MNIST data set. You can replace cnn with flat to run a non-CNN fully-connected model instead. You can also append naive to use the naive CNN implementation, which runs much, much slower.

23 |

What's the point of this project?

24 |

For students who want to learn about convolutional neural networks, one of the best-known tutorials is Dr. Andrew Ng's "Building a Convolutional Neural Network Step By Step" assignment as part of the Coursera sequence on deep learning. However, the code from that example is poorly structured, incomplete, and in fact enormously inefficient - to the point where it is not feasible to see it learn anything. This project is a reimplementation of the code in that project that is cleaner, complete, and adapted for much faster performance (x1,000!) using NumPy.

25 |

What should I use this code for?

26 |

As a study tool or reference for machine learning architectures, particularly CNNs.

27 |

As a demonstration or proof-of-concept, kind of like a machine with the cover removed. If you want to see a CNN in action - one that's largely transparent, where you can see and easily understand the details of the layers in action - this project provides some different architectures that successfully classify the MNIST data set.

28 |

As an experimental package. You can tweak the hyperparameters: models, training, testing, etc. - and see how the models respond. You can easily build many kinds of models, tweak the code to add activation functions / optimizers / regularizers / normalizers / new layer types / etc., and train your models against the MNIST data set. It is meant to be simple, readable, and extensible.

29 |

What shouldn't I use this code for?

30 |

Anything of moderate-or-greater complexity, where training in a reasonable amount of time requires a GPU or a distributed architecture such as a computing cluster. TensorFlow and PyTorch will run circles around this code.

31 |

Background Questions

32 |

What's a CNN?

33 |

A convolutional neural network, which is a type of artificial neural network that is extremely useful for tasks like detecting shapes and objects in images.

34 |

What's NumPy?

35 |

NumPy is a library for Python that is useful for matrix operations - basically, calculations over huge batches of numbers. The NumPy library is highly optimized for lightning-fast operations - when used correctly, it can be thousands of times faster than plain old Python. Since machine learning requires lots of multiplications of weights with inputs, NumPy is ideal for the kinds of operations that are performed in machine learning.

36 |

What's Jupyter?

37 |

Jupyter is a neat way of documenting Python tutorials. A Jupyter "notebook" is a collection of blocks, each of which is either markdown text (like HTML but easier to write) or Python code that you can run in place, view, and modify.

38 |

What's the MNIST handwritten digits data set?

39 |

This data set for training simple machine learning networks.

40 |

What's "Building a Convolutional Neural Network Step By Step?"

41 |

An assignment in this Coursera sequence on machine learning taught by Dr. Andrew Ng. An outstanding course sequence for anyone who wants to get started in the field.

42 |

Who's Andrew Ng?

43 |

Dr. Andrew Ng is the founder of the Google Brian Deep Learning Project; former Chief Scientist at Baidu; current pioneer of a variety of AI-related projects; an adjunct professor at Stanford; and an all-around excellent instructor.

44 |

High-Level Project Details

45 |

How does this project differ from the Coursera example?

46 |

Several improvements:

47 |

Cleaner code. The Coursera example relies on global functions and variables passed around in dictionary caches. This project repackages the functionality into simple Python classes.
49 |
Consolidated code. The Coursera example presents the code in a set of Jupyter blocks with big chunks of exposition and unit tests in between. This project consolidates the model functionality into a small script that is easily studied.
51 |
Complete code. The Coursera example lacks a flattening layer and weight updates, and details such as fully-connected layers, weight updates, the softmax function, categorical cross-entropy loss function, and training regimen are presented only in other lessons. This project provides all of the functionality needed to train and use a variety of network architectures.
53 |
Application. The Coursera example develops a CNN, but does not apply it to any problem. This project provides an application of any model to the MNIST handwritten digits database.
55 |
Optimization. This project includes an alternative implementation of the convolutional and pooling layers that run equivalently, that can be easily dropped into place, and that run approximately 1,000 times as fast as the naive implementation.
57 |

59 |

What Python packages does this code require?

60 |

Only NumPy and Requests (which is used to retrieve the data files from Yann LeCun's website).

61 |

How do I use it?

62 |

To run the MNIST database with a non-CNN (flat) machine learning model:

63 |

python3 mnist.py flat
 64 |

To run the MNIST database with a CNN machine learning model with the stride groups implementation:

65 |

python3 mnist.py cnn
 66 |

To run the MNIST database with a CNN machine learning model with the naive implementation:

67 |

python3 mnist.py cnn naive
 68 |

How do I build my own network architecture?

69 |

The syntax is TensorFlow-like - just provide an array of layer objects:

70 |

    l1 = ConvLayer(32, 3)            # 32 3x3 filters, stride 1 (default), zero padding (default)
 71 |     l2 = PoolLayer_Max(2)            # 2x2 pooling
 72 |     l3 = FlatLayer()                 # flattening layer
 73 |     l4 = FCLayer_ReLU(100)           # 100-neuron dense layer, ReLU activation
 74 |     l5 = FCLayer_Softmax(10)         # 10-layer dense layer, softmax (per-mini-batch) activation
 75 |     net = Net([l1, l2, l3, l4, l5])  # sequential network with all five layers
 76 |

Wait - don't I have to specify the size of the input to each layer?

77 |

No. In this implementation, all of the weights are initialized at the beginning of training, so each layer figures out its parameter set based on the dimensions of the first mini-batch.

78 |

How do I train and use the network?

79 |

Training using one mini-batch:

80 |

net.train(mini_batch, learning_rate)  # returns (copy of layer set, CCE, CE)
 81 |

Evaluation with input tensor X and a label set Y (e.g., test set):

82 |

net.evaluate(X, Y)                    # returns (output tensor, CCE, CE)
 83 |

Prediction for input tensor X:

84 |

net.predict(X)                        # returns output tensor
 85 |

Project Theory and Mechanics

86 |

What are the variables used in the classes and algorithms?

87 |

W and b are the weight vector and bias vector. (The FC layers don't have a separate bias vector: the bias is tacked onto the end of the weight vector, and the input tensor to the layer is augmented with a 1 so that the bias is always fully activated.)
89 |
f_w is the size (width and height) of the filters. f_w = 2 specifies a 2x2 convolutional filter.
91 |
n_f is the number of filters in a convolutional layer, which is also the number of output channels from the convolutional layer.
93 |
n_m is the number of samples in the input tensor for the current mini-batch. Might vary from one mini-batch to the next (e.g., splitting a training set with 100 inputs into three mini-batches).
95 |
n_c is the number of channels in the input to the layer. (Could be the number of color channels in an image for the lowest-level layer, or the number of filters in a preceding convolutional layer.)
97 |
p and s are padding and stride. A padding of 0 is the same as the "valid" padding option in TensorFlow. If you want "same" padding, then choose a suitable value of p such that p = ((s - 1) * ih - s + fw)/2 is an integer.
99 |
ih and iw are the input height and width of each input to the layer, after padding.
101 |
oh and ow are the output height and width of each output of the layer, after padding.
103 |
Z is the product of the weights and biases. In most of the layer types, it is also the output of the layer.
105 |
A is the activation of the layer, after applying an activation function (such as ReLU, sigmoid, or softmax) to Z.
107 |
dZ is the gradient that the layer receives and applies to its weights during backpropagation from the following layer (or as calculated from the loss function, for the last layer). The FCLayers actually receive dA, the gradient of the activation function, and then run it backward through a first derivative of the activation function to calculate dZ. The other layer types don't have an activation function, and so receive dZ directly.
109 |
dW and db are the components of the gradient to be applied to the weights and biases, respectively, to perform gradient-descent optimization in pursuit of the objective function (which is categorical cross-entropy). PoolLayer and FlatLayer have no trainable parameters, so these steps are skipped.
111 |
dA_prev is the gradient that passes backward through this layer to the previous layer so that it can also apply gradient-descent-based weight optimization. It is calculated by multiplying dZ by the weights (but not the biases, since they are constants that are factored out by the first derivative).
113 |
cce is categorical cross-entropy - i.e., the sum of the vector distances between each output and the expected label, summed over the mini-batch.
115 |
ce is classification error - the number of incorrect answers in the mini-batch.
117 |

119 |

Why is `cnn_numpy.py` so slow? Isn't it based on NumPy?

120 |

The heart of the convolutional and pooling layers in "Build a Convolutional Neural Network Step By Step" is a four-layer Python loop. Here is the one from the ConvLayer forward pass:

121 |

for i in range(self.n_m):
122 |     for h in range(self.oh):
123 |         for w in range(self.ow):
124 |             for f in range(self.n_f):
125 |                 self.Z[i, h, w, f] = np.self.input[i, ih1:ih2, iw1:iw2, :] * self.W[:, :, :, f])
126 |                 self.Z += self.b[:, :, :, f]
127 |

This four-layer loop is used in both the forward pass and the backward pass of the convolutional layer and the pooling layers.

128 |

Consider applying this implementation to the MNIST handwritten digits data set, which features 70,000 images of size 28x28 and only one channel. A basic CNN architecture might include a convolutional layer with 32 filters (n_f = 32) of size 3x3x1 and a stride of one (resulting in self.oh = self.ow = 26), followed by a pooling layer. A 95% train/test split results in a training set of 65,500 images.

129 |

For the forward pass of the convolutional layer for one epoch of the training data set, the inner calculations of self.Z will run (n_m * oh * ow * n_f = 65,500 * 26 * 26 * 32 = 14,168,960,000) times.

130 |

That's 14 trillion calculations, for the forward pass only of one convolutional layer, for one epoch. The backpropagation pass executes the same loop, so another 14 trillion calculations per epoch.

131 |

While NumPy is a very fast library, it simply cannot perform 28 trillion operations in anything resembling a reasonable amount of time. (The pooling layers are similarly inefficient, but on a much smaller scale.)

132 |

On my machine, the ConvLayer architecture above this architecture is able to train on about two seconds per input sample - as in: 130,000 seconds, or 36 hours, for one epoch. It is practically impossible to see it learning anything with this performance.

133 |

How is `cnn_numpy_sg.py` different than `cnn_numpy.py`?

134 |

The optimized implementations have an identical FlatLayer, FCLayer, and Network classes. Both ConvLayer and PoolLayer replace the four-layer loop with a two-layer iteration over the number of stride groups. This is, at most, the area of one filter (i.e., a 3x3 filter has at most nine stride groups). For operations where the stride equals the filter size, the entire forward and backward passes require exactly one iteration of this loop - each of the forward pass and backward pass are calculated, in entirety, with one massive matrix multiplication.

135 |

What's a stride group?

136 |

Stride groups are a novel optimization technique for performing convolutional and pooling operations. The idea is to replace iteration, as much as possible, with a reshaping of the input and weight tensors so that each element in each filter is multiplied together with a huge number of elements in the input tensor. So instead of iterating through the input tensors trillions of times to perform small calculations, stride groups enable NumPy to iterate only a few times, in some cases only once, to calculate the forward and backward passes.

137 |

No, seriously, what's a stride group?

138 |

A simple example will help. Consider a 2x2 convolutional filter with four elements:

139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 |


A	B
C	D

157 |

Consider the convolution of this filter over this 4x4 input (with a stride of 1):

158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 |


1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

194 |

In naive convolution, this calculation is performed by sliding the filter over the image, performing a 2x2 element-wise multiplication at each position, and adding up the results. cnn_numpy.py does this, but performing this small 2x2 multiplication over 65,500 input samples (a 95% set of the MNIST data set), where each image is 26x26 (so 24x24 positions for each image), and for each filter, is intensely slow.

195 |

Instead, consider which elements above actually get multiplied by each element of the filter. If the output of the layer (Z) is this:

196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 |


E	F
G	H

214 |

...then each of these is calculated based on a multiplication of the filter with a subset of the pixels:

215 |

E = A * (1, 2, 3, 5, 6, 7, 9, 10, 11)
216 | F = B * (2, 3, 4, 6, 7, 8, 10, 11, 12)
217 | G = C * (5, 6, 7, 9, 10, 11, 13, 14, 15)
218 | H = D * (6, 7, 8, 10, 11, 12, 14, 15, 16)
219 |

220 |

In order to break down this problem, consider that the filter strides over the input nine times:

221 |

Convolution with Strides

222 |

However, the iteration is still proportional to the size of the input, the number of filters, and the fine-grained nature of the stride. It will still be computationally inefficient to apply this design to data sets with larger images or networks with multiple layers.

223 |

We can do better. It would be great to grab as many of these pixels as we can during one pass through the input tensor - that's the sort of enormous calculation that NumPy does very well. One way to achieve this would be to reshape the array so that the elements against which each filter element are grouped together. NumPy can do that, too.

224 |

Unfortunately, it's not quite as simple as reshaping the array - because, as you'll notice, some of those inputs are repeated. Element 2 in the input is multiplied by both filter elements A and B. Element 6 in the input is multiplied by all four filter elements. So simply reshaping the array won't be enough.

225 |

Instead, we can divide the positions of the filter over the input into subsets of non-overlapping filter positions. For each subset of filter positions, each input element is included at most once. A careful slicing and reshaping of the array could then group together all of the input elements for each filter element and perform the weight-vector-by-input-tensor multiplication (while excluding all of the input elements that aren't covered by any filter position in this subset).

226 |

For instance, this 2x2 convolution has four stride groups:

227 |

Convolution with Stride Groups

228 |

Thus, in the first stride group:

229 |

E = A * (1, 3, 9, 11)
230 | F = B * (2, 4, 10, 12)
231 | G = C * (5, 7, 9, 11)
232 | H = D * (6, 8, 14, 16)
233 |

In the second stride group:

234 |

E = A * (2, 10)
235 | F = B * (3, 11)
236 | G = C * (6, 14)
237 | H = D * (7, 15)
238 |

In the third stride group:

239 |

E = A * (5, 7)
240 | F = B * (6, 8)
241 | G = C * (9, 11)
242 | H = D * (10, 12)
243 |

And in the fourth stride group:

244 |

E = A * (6)
245 | F = B * (7)
246 | G = C * (10)
247 | H = D * (11)
248 |

Each stride group, then, is a collection of non-overlapping positions of the filter, where the entire set of input elements that are multiplied by each filter element in all of those positions are grouped together at the end of the array.

249 |

Why are stride groups better?

250 |

Because each stride group allows a big slice of the array to be aligned and multiplied with the corresponding filter elements on one pass through the input tensor. This tremendously reduces the number of stride groups that must be processed, i.e., the number of iterations or passes through the array for each forward propagation / backpropagation operation. In fact, the number of stride groups is at most the area of the filter - e.g., at most nine iterations for a 3x3 filter - and typically fewer. If the stride matches the filter width, then the number of stride group is one: forward propagation and backpropagation are each performed with a single, enormous matrix multiplication.

251 |

The number of stride groups is based on the filter size and the stride. It is also based on a modulo of the input widths and heights, but it is not proportional to the input widths and heights. And it is completely irrespective of the number of inputs, the number of channels, and the number of filters.

252 |

Could `cnn_numpy_sg.py` be further optimized?

253 |

Certainly. I am a NumPy amateur, and my code is pretty clumsy. I presume that a lot of operations could be equivalently performed in a much more efficient manner.

254 |

Should I optimize it?

255 |

Probably not. This code will never compete with TensorFlow or PyTorch. Your time may be better spent on those. But if you're really itching to sharpen your NumPy skills or optimization, this might provide good practice.

256 |

I have questions...

257 |

Please feel free to contact me: david@djstein.com, or try posting on /r/deeplearning - they're pretty friendly.

258 |

259 |

More Projects

260 |

261 | 262 | 263 | -------------------------------------------------------------------------------- /FAQ.md: -------------------------------------------------------------------------------- 1 | # cnn-numpy FAQ 2 | 3 | Experiments with implementing a convolutional neural network (CNN) in NumPy. 4 | 5 | Written by David Stein (david@djstein.com). See [README.md](README.md) for more information. 6 | 7 | Also available at [https://www.djstein.com/cnn-numpy](https://www.djstein.com/cnn-numpy). 8 | 9 | ## Questions About cnn-numpy 10 | 11 | ### How do I run this code? 12 | 13 | Let's get this out of the way: 14 | 15 | * Make sure that you have [Python 3](https://www.python.org) installed with requests and NumPy: 16 | 17 | python3 -m pip install requests numpy 18 | 19 | * Clone the repository, or just download [`mnist.py`](mnist.py), [`cnn_numpy.py`](cnn_numpy.py), and [`cnn_numpy_sg.py`](cnn_numpy_sg.py). 20 | 21 | * Ensure that you have (a) a working internet connection or (b) a copy of the [MNIST handwritten digits data set](http://yann.lecun.com/exdb/mnist/). Just download the first four files on the page and save them (with the original filenames) in the same folder as the Python scripts. No need to unzip them. 22 | 23 | * Run the script: 24 | 25 | python3 mnist.py cnn 26 | 27 | This will train a simple CNN against the MNIST data set. You can replace `cnn` with `flat` to run a non-CNN fully-connected model instead. You can also append `naive` to use the naive CNN implementation, which runs much, much slower. 28 | 29 | ### What's the point of this project? 30 | 31 | For students who want to learn about convolutional neural networks, one of the best-known tutorials is Dr. Andrew Ng's "Building a Convolutional Neural Network Step By Step" assignment as part of the Coursera sequence on deep learning. However, the code from that example is poorly structured, incomplete, and in fact *enormously* inefficient - to the point where it is not feasible to see it learn anything. This project is a reimplementation of the code in that project that is cleaner, complete, and adapted for much faster performance (x1,000!) using NumPy. 32 | 33 | ### What should I use this code for? 34 | 35 | As a study tool or reference for machine learning architectures, particularly CNNs. 36 | 37 | As a demonstration or proof-of-concept, kind of like a machine with the cover removed. If you want to see a CNN in action - one that's largely transparent, where you can see and easily understand the details of the layers in action - this project provides some different architectures can successfully classify the MNIST data set. 38 | 39 | As an experimental package. You can tweak the hyperparameters: models, training, testing, etc. - and see how the models respond. You can easily build many kinds of models, tweak the code to add activation functions / optimizers / regularizers / normalizers / new layer types / etc., and train your models against the MNIST data set. It is meant to be simple, readable, and extensible. 40 | 41 | ### What *shouldn't* I use this code for? 42 | 43 | Anything of moderate-or-greater complexity, where training in a reasonable amount of time requires a GPU or a distributed architecture such as a computing cluster. TensorFlow and PyTorch will run circles around this code. 44 | 45 | ## Background Questions 46 | 47 | ### What's a CNN? 48 | 49 | A [convolutional neural network](https://en.wikipedia.org/wiki/Convolutional_neural_network), which is a type of artificial neural network that is extremely useful for tasks like detecting shapes and objects in images. 50 | 51 | ### What's NumPy? 52 | 53 | [NumPy](https://en.wikipedia.org/wiki/NumPy) is a library for Python that is useful for matrix operations - basically, calculations over huge batches of numbers. The NumPy library is highly optimized for lightning-fast operations - when used correctly, it can be *thousands* of times faster than plain old Python. Since machine learning requires lots of multiplications of weights with inputs, NumPy is ideal for the kinds of operations that are performed in machine learning. 54 | 55 | ### What's Jupyter? 56 | 57 | [Jupyter](https://jupyter.org/) is a neat way of documenting Python tutorials. A Jupyter "notebook" is a collection of blocks, each of which is either markdown text (like HTML but easier to write) or Python code that you can run in place, view, and modify. 58 | 59 | ### What's the MNIST handwritten digits data set? 60 | 61 | [This data set](http://yann.lecun.com/exdb/mnist/) for training simple machine learning networks. 62 | 63 | ### What's "Building a Convolutional Neural Network Step By Step?" 64 | 65 | An assignment in [this Coursera sequence on machine learning](https://www.coursera.org/learn/convolutional-neural-networks?specialization=deep-learning#syllabus) taught by Dr. Andrew Ng. An outstanding course sequence for anyone who wants to get started in the field. 66 | 67 | ### Who's Andrew Ng? 68 | 69 | [Dr. Andrew Ng](https://en.wikipedia.org/wiki/Andrew_Ng) is the founder of the Google Brian Deep Learning Project; former Chief Scientist at Baidu; current pioneer of a variety of AI-related projects; an adjunct professor at Stanford; and an all-around excellent instructor. 70 | 71 | ## High-Level Project Details 72 | 73 | ### How does this project differ from the Coursera example? 74 | 75 | Several improvements: 76 | 77 | * Cleaner code. The Coursera example relies on global functions and variables passed around in dictionary caches. This project repackages the functionality into simple Python classes. 78 | 79 | * Consolidated code. The Coursera example presents the code in a set of Jupyter blocks with big chunks of exposition and unit tests in between. This project consolidates the model functionality into a small script that is easily studied. 80 | 81 | * Complete code. The Coursera example lacks a flattening layer and weight updates, and details such as fully-connected layers, weight updates, the softmax function, categorical cross-entropy loss function, and training regimen are presented only in other lessons. This project provides all of the functionality needed to train and use a variety of network architectures. 82 | 83 | * Application. The Coursera example develops a CNN, but does not apply it to any problem. This project provides an application of any model to the MNIST handwritten digits database. 84 | 85 | * Optimization. This project includes an alternative implementation of the convolutional and pooling layers that runs equivalently, that can be easily dropped into place, and that runs approximately 1,000 times as fast as the naive implementation. 86 | 87 | ### What Python packages does this code require? 88 | 89 | Only NumPy and Requests (which is used to retrieve the data files from Yann LeCun's website). 90 | 91 | ### How do I use it? 92 | 93 | To run the MNIST database with a non-CNN (flat) machine learning model: 94 | 95 | python3 mnist.py flat 96 | 97 | To run the MNIST database with a CNN machine learning model with the stride groups implementation: 98 | 99 | python3 mnist.py cnn 100 | 101 | To run the MNIST database with a CNN machine learning model with the naive implementation: 102 | 103 | python3 mnist.py cnn naive 104 | 105 | ### How do I build my own network architecture? 106 | 107 | The syntax is TensorFlow-like - just provide an array of layer objects: 108 | 109 | l1 = ConvLayer(32, 3) # 32 3x3 filters, stride 1 (default), zero padding (default) 110 | l2 = PoolLayer_Max(2) # 2x2 pooling 111 | l3 = FlatLayer() # flattening layer 112 | l4 = FCLayer_ReLU(100) # 100-neuron dense layer, ReLU activation 113 | l5 = FCLayer_Softmax(10) # 10-layer dense layer, softmax (per-mini-batch) activation 114 | net = Net([l1, l2, l3, l4, l5]) # sequential network with all five layers 115 | 116 | ### Wait - don't I have to specify the size of the input to each layer? 117 | 118 | No. In this implementation, all of the weights are initialized at the beginning of training, so each layer figures out its parameter set based on the dimensions of the first mini-batch. 119 | 120 | ### How do I train and use the network? 121 | 122 | Training using one mini-batch: 123 | 124 | net.train(mini_batch, learning_rate) # returns (copy of layer set, CCE, CE) 125 | 126 | Evaluation with input tensor X and a label set Y (*e.g.*, test set): 127 | 128 | net.evaluate(X, Y) # returns (output tensor, CCE, CE) 129 | 130 | Prediction for input tensor X: 131 | 132 | net.predict(X) # returns output tensor 133 | 134 | ## Project Theory and Mechanics 135 | 136 | ### What are the variables used in the classes and algorithms? 137 | 138 | * `W` and `b` are the weight vector and bias vector. (The FC layers don't have a separate bias vector: the bias is tacked onto the end of the weight vector, and the input tensor to the layer is augmented with a 1 so that the bias is always fully activated.) 139 | 140 | * `f_w` is the size (width and height) of the filters. `f_w = 2` specifies a 2x2 convolutional filter. 141 | 142 | * `n_f` is the number of filters in a the convolutional layer, which is also the number of output channels from the convolutional layer. 143 | 144 | * `n_m` is the number of samples in the input tensor for the current mini-batch. Might vary from one mini-batch to the next (e.g., splitting a training set with 100 inputs into three mini-batches). 145 | 146 | * `n_c` is the number of channels in the input to the layer. (Could be the number of color channels in an image for the lowest-level layer, or the number of filters in a preceding convolutional layer.) 147 | 148 | * `p` and `s` are padding and stride. A padding of 0 is the same as the "valid" padding option in TensorFlow. If you want "same" padding, then choose a suitable value of `p` such that `p = ((s - 1) * ih - s + fw)/2` is an integer. 149 | 150 | * `ih` and `iw` are the input height and width of each input to the layer, after padding. 151 | 152 | * `oh` and `ow` are the output height and width of each output of the layer, after padding. 153 | 154 | * `Z` is the product of the weights and biases. In most of the layer types, it is also the output of the layer. 155 | 156 | * `A` is the activation of the layer, after applying an activation function (such as ReLU, sigmoid, or softmax) to `Z`. 157 | 158 | * `dZ` is the gradient that the layer receives and applies to its weights during backpropagation from the following layer (or as calculated from the loss function, for the last layer). The FCLayers actually receive `dA`, the gradient of the activation function, and then run it backward through a first derivative of the activation function to calculate `dZ`. The other layer types don't have an activation function, and so receive `dZ` directly. 159 | 160 | * `dW` and `db` are the components of the gradient to be applied to the weights and biases, respectively, to perform gradient-descent optimization in pursuit of the objective function (which is categorical cross-entropy). PoolLayer and FlatLayer have no trainable parameters, so these steps are skipped. 161 | 162 | * `dA_prev` is the gradient that passes backward through this layer to the previous layer so that it can also apply gradient-descent-based weight optimization. It is calculated by multiplying `dZ` by the weights (but not the biases, since they are constants that are factored out by the first derivative). 163 | 164 | * `cce` is categorical cross-entropy - *i.e.*, the sum of the vector distances between each output and the expected label, summed over the mini-batch. 165 | 166 | * `ce` is classification error - the number of incorrect answers in the mini-batch. 167 | 168 | ### Why is `cnn_numpy.py` so slow? Isn't it based on NumPy? 169 | 170 | The heart of the convolutional and pooling layers in "Build a Convolutional Neural Network Step By Step" is a four-layer Python loop. Here is the one from the ConvLayer forward pass: 171 | 172 | for i in range(self.n_m): 173 | for h in range(self.oh): 174 | for w in range(self.ow): 175 | for f in range(self.n_f): 176 | self.Z[i, h, w, f] = np.self.input[i, ih1:ih2, iw1:iw2, :] * self.W[:, :, :, f]) 177 | self.Z += self.b[:, :, :, f] 178 | 179 | This four-layer loop is used in both the forward pass and the backward pass of the convolutional layer and the pooling layers. 180 | 181 | Consider applying this implementation to the MNIST handwritten digits data set, which features 70,000 images of size 28x28 and only one channel. A basic CNN architecture might include a convolutional layer with 32 filters (`n_f = 32`) of size 3x3x1 and a stride of one (resulting in `self.oh = self.ow = 26`), followed by a pooling layer. A 95% train/test split results in a training set with 65,500 images. 182 | 183 | For the forward pass of the convolutional layer for one epoch of the training data set, the inner calculations of self.Z will run (`n_m * oh * ow * n_f = 65,500 * 26 * 26 * 32 = 14,168,960,000`) times. 184 | 185 | That's 14 *trillion* calculations, for the forward pass only of one convolutional layer, for *one epoch*. The backpropagation pass executes the same loop, so another 14 trillion calculations per epoch. 186 | 187 | While NumPy is a very fast library, it simply cannot perform 28 trillion operations in anything resembling a reasonable amount of time. (The pooling layers are similarly inefficient, but on a much smaller scale.) 188 | 189 | On my machine, the ConvLayer architecture above this architecture is able to train on about *two seconds per input sample* - as in: 130,000 seconds, or 36 hours, for one epoch. It is practically impossible to see it learning anything with this performance. 190 | 191 | ### How is `cnn_numpy_sg.py` different than `cnn_numpy.py`? 192 | 193 | The optimized implementations have an identical FlatLayer, FCLayer, and Network classes. Both ConvLayer and PoolLayer replace the four-layer loop with a two-layer iteration over the number of stride groups. This is, at most, the area of one filter (*i.e.*, a 3x3 filter has *at most* nine stride groups). For operations where the stride equals the filter size, the *entire* forward and backward passes require exactly one iteration of this loop - each of the forward pass and backward pass are calculated, in entirety, with one massive matrix multiplication. 194 | 195 | ### What's a stride group? 196 | 197 | Stride groups are a novel optimization technique for performing convolutional and pooling operations. The idea is to replace iteration, as much as possible, with a reshaping of the input and weight tensors so that each element in each filter is multiplied together with a huge number of elements in the input tensor. So instead of iterating through the input tensors *trillions* of times to perform small calculations, stride groups enable NumPy to iterate only a few times, *in some cases only once*, to calculate the forward and backward passes. 198 | 199 | ### No, seriously, what's a stride group? 200 | 201 | A simple example will help. Consider a 2x2 convolutional filter with four elements: 202 | 203 | | | | 204 | |---|---| 205 | | A | B | 206 | | C | D | 207 | 208 | Consider the convolution of this filter over this 4x4 input (with a stride of 1): 209 | 210 | | | | | | 211 | |----|----|----|----| 212 | | 1 | 2 | 3 | 4 | 213 | | 5 | 6 | 7 | 8 | 214 | | 9 | 10 | 11 | 12 | 215 | | 13 | 14 | 15 | 16 | 216 | 217 | In naive convolution, this calculation is performed by sliding the filter over the image, performing a 2x2 element-wise multiplication at each position, and adding up the results. `cnn_numpy.py` does this, but performing this small 2x2 multiplication over 65,500 input samples (a 95% set of the MNIST data set), where each image is 26x26 (so 24x24 positions for each image), and for each filter, is intensely slow. 218 | 219 | Instead, consider which elements above *actually* get multiplied by each element of the filter. If the output of the layer (`Z`) is this: 220 | 221 | | | | 222 | |---|---| 223 | | E | F | 224 | | G | H | 225 | 226 | ...then each of these is calculated based on a multiplication of the filter with a subset of the pixels: 227 | 228 | E = A * (1, 2, 3, 5, 6, 7, 9, 10, 11) 229 | F = B * (2, 3, 4, 6, 7, 8, 10, 11, 12) 230 | G = C * (5, 6, 7, 9, 10, 11, 13, 14, 15) 231 | H = D * (6, 7, 8, 10, 11, 12, 14, 15, 16) 232 | 233 | In order to break down this problem, consider that the filter strides over the input nine times: 234 | 235 | ![Convolution with Strides](images/strides.png) 236 | 237 | However, the iteration is still proportional to the size of the input, the number of filters, and the fine-grained nature of the stride. It will still be computationally inefficient to apply this design to data sets with larger images or networks with multiple layers. 238 | 239 | We can do better. It would be great to grab as many of these pixels as we can during one pass through the input tensor - that's the sort of enormous calculation that NumPy does very well. One way to achieve this would be to reshape the array so that the elements against which each filter element are grouped together. NumPy can do that, too. 240 | 241 | Unfortunately, it's not quite as simple as reshaping the array - because, as you'll notice, some of those inputs are repeated. Element 2 in the input is multiplied by both filter elements A and B. Element 6 in the input is multiplied by *all four* filter elements. So simply reshaping the array won't be enough. 242 | 243 | Instead, we can divide the positions of the filter over the input into *subsets* of *non-overlapping* filter positions. For each subset of filter positions, each input element is included *at most once*. A careful slicing and reshaping of the array could then group together all of the input elements for each filter element and perform the weight-vector-by-input-tensor multiplication (while excluding all of the input elements that aren't covered by any filter position in this subset). 244 | 245 | For instance, this 2x2 convolution has four stride groups: 246 | 247 | ![Convolution with Stride Groups](images/stride-groups.png) 248 | 249 | Thus, in the first stride group: 250 | 251 | E = A * (1, 3, 9, 11) 252 | F = B * (2, 4, 10, 12) 253 | G = C * (5, 7, 9, 11) 254 | H = D * (6, 8, 14, 16) 255 | 256 | In the second stride group: 257 | 258 | E = A * (2, 10) 259 | F = B * (3, 11) 260 | G = C * (6, 14) 261 | H = D * (7, 15) 262 | 263 | In the third stride group: 264 | 265 | E = A * (5, 7) 266 | F = B * (6, 8) 267 | G = C * (9, 11) 268 | H = D * (10, 12) 269 | 270 | And in the fourth stride group: 271 | 272 | E = A * (6) 273 | F = B * (7) 274 | G = C * (10) 275 | H = D * (11) 276 | 277 | Each stride group, then, is a collection of non-overlapping positions of the filter, where the entire set of input elements that are multiplied by each filter element in all of those positions are grouped together at the end of the array. 278 | 279 | ### Why are stride groups better? 280 | 281 | Because each stride group allows a big slice of the array to be aligned and multiplied with the corresponding filter elements on one pass through the input tensor. This tremendously reduces the number of stride groups that must be processed, *i.e.*, the number of iterations or passes through the array for each forward propagation / backpropagation operation. In fact, the number of stride groups is *at most* the area of the filter - e.g., *at most* nine iterations for a 3x3 filter - and typically fewer. If the stride matches the filter width, then the number of stride group is *one*: forward propagation and backpropagation are each performed with a single, enormous matrix multiplication. 282 | 283 | The number of stride groups is based on the filter size and the stride. It is also based on a *modulo* of the input widths and heights, but it is not *proportional to* the input widths and heights. And it is completely irrespective of the number of inputs, the number of channels, and the number of filters. 284 | 285 | ### Could `cnn_numpy_sg.py` be further optimized? 286 | 287 | Certainly. I am a NumPy amateur, and my code is pretty clumsy. I presume that a lot of operations could be equivalently performed in a much more efficient manner. 288 | 289 | ### Should I optimize it? 290 | 291 | Probably not. This code will never compete with TensorFlow or PyTorch. Your time may be better spent on those. But if you're really itching to sharpen your NumPy skills or optimization, this might provide good practice. 292 | 293 | ### I have questions... 294 | 295 | Please feel free to contact me: david@djstein.com , or try posting on [/r/deeplearning](https://www.reddit.com/r/deeplearning) - they're pretty friendly. 296 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # cnn-numpy 2 | 3 | Experiments with implementing a convolutional neural network (CNN) in NumPy. 4 | 5 | Written by David Stein (david@djstein.com). See [FAQ.md](FAQ.md) for more information. 6 | 7 | Also available at [https://www.djstein.com/cnn-numpy](https://www.djstein.com/cnn-numpy). 8 | 9 | ## The Short Version 10 | 11 | The purpose of this project is to provide a fully working, NumPy-only implementation of a convolutional neural network, with a few goals: 12 | 13 | 1. Presenting the simplest, cleanest example of a fully implemented, working CNN as a minimal adaptation of Dr. Andrew Ng's example code. See: `cnn_numpy.py`. 14 | 15 | 2. Provide a working application of the above example to classify the MNIST database of handwritten digits. See: `mnist.py`. 16 | 17 | While these two efforts were successful, it happens that the example code runs *prohibitively* slowly - because it is designed for clarity and education, not actual use. It is not possible to train the model to do anything meaningful in a reasonable amount of time. Thus, the following objectives were added: 18 | 19 | 3. Provide an alternative implementation to the above example that runs equivalently, but makes much better use of NumPy and runs in a reasonable amount of time. (This redesign involves an apparently novel computational technique of **stride groups**, described below.) See: `cnn_numpy_sg.py`. 20 | 21 | ## Background 22 | 23 | Convolutional neural networks are the current primary model of deep learning for tasks such as image analysis. [Dr. Andrew Ng](https://en.wikipedia.org/wiki/Andrew_Ng), one of the contemporary experts in machine learning, offers [this excellent Coursera course on convolutional neural networks](https://www.coursera.org/learn/convolutional-neural-networks?specialization=deep-learning#syllabus) that includes an assignment entitled "Build a Convolutional Neural Network Step by Step." 24 | 25 | The assignment is presented as a Jupyter notebook with NumPy-based code and a detailed explanation of the calculations, including helpful illustrations. The code includes a convolutional layer, a MaxPool layer, and an AvgPool layer. The Jupyter notebook also includes unit tests to show that the individual functions work. 26 | 27 | However, the code in the Jupyter notebook has a few issues that prevent its use for anything beyond the confines of the assignment: 28 | 29 | 1. The code is a bit... primitive. It is structured as a set of global functions. Parameters are passed in a "cache" dictionary, with names identified by strings. Some variables are cached and then never used. Different variable names are sometimes used between a calling function and a called function. Tracing the interconnection of the functions is rather painful due to big chunks of explanatory text, unit tests, and output. 30 | 31 | 2. The implementation in the Jupyter notebook is not complete! The convolutional layers calculate dW/db, but do not update the filters. The implementations do not include a fully connected layer - these are developed in previous exercises - nor a flattening layer (which is trivial to implement, but its absence is frustrating). 32 | 33 | 3. The implementation in the Jupyter notebook is not applied to any kind of problem. It is simply presented as an example. The next exercise in the course sequence involves applying a CNN to the MNIST handwritten digits data set - and yet, *it does not use the CNN implementation from this assignment*; instead, it switches to a pure TensorFlow implementation. (To make matters worse, the TensorFlow implementation requires TensorFlow 1.x, which will of course not run in the current TensorFlow 2.x environment!) 34 | 35 | These limitations create two problems. 36 | 37 | **Problem #1:** While reviewing "Build a Convolutional Neural Network Step by Step," it is very difficult to take a step back and see a fully-realized CNN that uses this code, including training. It is also disappointing that the code is not shown to run against any kind of sample problem. Students cannot easily see it in action. Students cannot experiment with the structure or hyperparameters to see how performance may differ. And students cannot easily apply this code to other standard data sets to develop any kind of skill and confidence in this knowledge. 38 | 39 | (As an avid student of machine learning, I am continually irked by the gap between the skill set of theoretical knowledge, such as how models work, and the skill set of applied machine learning, such as the available platforms and the common theories of how to use them. Anything that can be done to bridge this gap with actual knowledge will, I believe, be helpful for promoting a comprehensive understanding of the field.) 40 | 41 | ## Basic Implementation 42 | 43 | Based on the above, I set out fill in the gaps. I sought to refactor, streamline, and just generally *normalize* all of Dr. Ng's code into a tight, cohesive CNN library - one that faithfully reproduces the algorithms from the example code, but in a much more readable, usable way. Also, I sought to apply this model to the MNIST database in order to show that it works. The result of this effort is `cnn_numpy.py` and `mnist.py`: 44 | 45 | `cnn_numpy.py` is a 180-line file that includes: 46 | 47 | * A ConvLayer class 48 | 49 | * A PoolLayer base class, with PoolLayer_Max and AvgPoolLayer_Avg subclasses 50 | 51 | * A FlatLayer class 52 | 53 | * An FCLayer base class, with FCLayer_ReLU, FCLayer_Sigmoid, and FCLayer_Softmax subclasses 54 | 55 | * A Network class with predict(), evaluate(), and train() functions, based on categorical cross-entropy and multiclass error, and a parameterized learning rate 56 | 57 | And `mnist.py` is an 80-line file that loads the MNIST handwritten digits data set from [Dr. Yann LeCun's website](http://yann.lecun.com/exdb/mnist/), unpacks it, and trains any combination of layers from `cnn_numpy.py` to perform multiclass classification. Students may freely experiment with different architectures by changing one line of code, like this non-CNN ("flat") model: 58 | 59 | net = Network([FlatLayer(), FCLayer_ReLU(100), FCLayer_Softmax(10)]) 60 | 61 | ...or this CNN model: 62 | 63 | net = Network([ConvLayer(32, 3), PoolLayer_Max(2, 2), FlatLayer(), FCLayer_ReLU(100), FCLayer_Softmax(10)]) 64 | 65 | This is a TensorFlow-like syntax, but unlike TensorFlow, `cnn_numpy.py` is completely transparent and readable. 66 | 67 | For convenience, `mnist.py` allows you to select either architecture with a suitable set of training parameters: 68 | 69 | python3 mnist.py flat naive 70 | python3 mnist.py cnn naive 71 | 72 | The flat architecture work great. On a typical machine, it can train from scratch with a 95% training split of the data set to reach a classification error under 10% within one minute, and under 5% in three minutes. 73 | 74 | The CNN architectures also runs... slowly. *Brutally* slowly. *Unusably* slowly. 75 | 76 | **Problem #2:** The code from "Build a Convolutional Neural Network Step by Step" is prohibitively slow. So slow that it is not realistically possible to apply it to a problem or to see it learn anything. Thus, it is difficult to see that it even works, let alone experiment with it. 77 | 78 | The ConvLayer and PoolLayer classes require a four-layer-deep loop in Python for each of the forward and backward passes. The loop iterates over: (1) The inputs in the input tensor, (2) the height of each input, (3) the width of each input, and (4) each filter in the convolutional layer. 79 | 80 | As a result of this deeply nested iteration, processing one epoch of the MNIST data set with a one-ConvLayer CNN with 32 filters would require *28 trillion NumPy calculations*. NumPy is indeed fast, but executing literally trillions of small NumPy calculations requires an absurdly long period of time. 81 | 82 | I suppose that these factors explain the hard transition in the CNN course sequence from the NumPy-based "Build a Convolutional Neural Network Step by Step" code to the TensorFlow-based code for the immediately following lesson. But it does feel like a bait-and-switch: "Here is some code... now that we've developed all of that, let's use none of it, and instead switch to a totally different platform." 83 | 84 | The challenge presented at this stage is: How can the examples of pooling and convolutional layers from "Build a Convolutional Neural Network Step by Step" be modified to use the same model, and only simple NumPy, but to make much better use of NumPy's array processing capabilities so that iteration can be vastly reduced? 85 | 86 | ## Stride Groups 87 | 88 | NumPy is capable of performing operations on large, multidimensional arrays at blazingly faster speeds than Python. NumPy also features sophisticated array indexing and slicing operations, as well as broadcasting, which permits element-wise multiplication between a multi-element axis of an array and a one-element axis of an array. However, the four-layer loop makes very poor use of these properties: it uses NumPy to perform element-wise multiplication. 89 | 90 | Each implementation in the example code requires iteration over (1) the number of training samples, (2) the output width and height of each sample (which are based on the square area of the input image), and (3) the number of filters. And yet, all of these properties are originally featured in arrays: the input to each layer, the weight matrix, etc. So it is quite feasible to operate on massive subsections of this array instead of lots of element-wise operations. 91 | 92 | One idea is to iterate over the elements of each filter rather than the elements of each image, since the filters are smaller. Theoretically, element-wise multiplication of each element of each filter would require (f_w * f_w * c) iterations - in the network above: (3 * 3 * 32) = 288 iterations, which is certainly an improvement. (But this methodology is not applicable to the pooling layers.) Still - we can do better. 93 | 94 | A much better idea - and the heart of the following technique - is to reorient the input into **stride groups**. A stride group is the set of input elements for all non-overlapping strides, assembled into a sub-matrix of elements and aligned with a corresponding element of the filter matrix with which those elements of the input are multiplied during convolution - like this: 95 | 96 | ![Convolution with Stride Groups](images/stride-groups.png) 97 | 98 | This approach has an extremely important advantage: The number of iterations is irrespective of the size of the input, the number of input channels, and the number of filters. 99 | 100 | In this approach, four total iterations are required whether the matrix is 4x4, or 5x5, or 100x100. Four total iterations are required whether the number of filters is 1, or 32, or 10,000. The number of iterations is strictly based on the filter width and the stride. Larger inputs and larger number of filters will increase the sizes of the matrix that are multiplied together in each iteration, but not the number of iterations. 101 | 102 | Moreover: if the filter width and the stride are equal - for example, a 2x2 filter with a stride of 2 - then only one shift group exists... and all of convolution is performed in a single matrix multiplication. Iteration is *entirely eliminated*. 103 | 104 | ## Stride Groups: Implementation 105 | 106 | The redesign of the neural network architecture using stride groups is presented in `cnn-numpy-sg.py`, and can be used by omitting the "naive" parameter from the command: 107 | 108 | python3 mnist.py cnn 109 | 110 | The redesign presents identical versions of `FlatLayer`, `FCLayer`, and `Network`. The implementations of `ConvLayer` and `PoolLayer` are different, and not as readable, but they are functionally equivalent and can be swapped into place. 111 | 112 | I will admit that I am a NumPy amateur. I am certain that the NumPy code could be rewritten to consolidate operations (such as broadcasting rather than repeating) and to make tremendously better use of the sophisticated capabilities of NumPy. 113 | 114 | And yet... this amateur implementation is approximately **1,000 times faster** than the naive implementation. 115 | 116 | A one-ConvLayer CNN using the stride groups implementation can complete an entire 100-minibatch epoch over 95% of the MNIST data set in a little over two minutes (vs. 18 hours for the example code). It is performant enough to observe and experiment with its learning capabilities on the MNIST data set. It exhibits a typical classification error under 30% *in one epoch*, that is, having seen each of the 65,500 samples only once. Again, continued training reduces the classification error below 3%. 117 | 118 | On the one hand, the fully-connected network performs better performance on this training set in a shorter period of time (*e.g.*, one minute). However, the fully-connected network requires 100 epochs, raising the prospect of overtraining. And of course, fully-connected network also scales poorly to larger data sets, since the number of weights of the first FC layer is (number of neurons * input size * number of channels). Fully-connected networks also fail to account for localized characteristics, so they will train more slowly and may be more susceptible to extraneous noise. 119 | 120 | ## Conclusion 121 | 122 | This project demonstrates: 123 | 124 | * The capabilities of neural network architectures with an application for multi-class classification of the MNIST data set; 125 | 126 | * The validity of the example CNN architecture; 127 | 128 | * The computational power of NumPy when used well (and the importance of using it well!); and 129 | 130 | * An apparently novel computational technique for aligning the capabilities of NumPy with some typical neural network operations. 131 | -------------------------------------------------------------------------------- /cnn_numpy.py: -------------------------------------------------------------------------------- 1 | # cnn_numpy.py 2 | # Written by David Stein (david@djstein.com). See https://www.djstein.com/cnn-numpy/ for more info. 3 | # Source: https://github.com/neuron-whisperer/cnn-numpy 4 | 5 | # This code is an adaptation of Andrew Ng's convolutional neural network code: 6 | # https://www.coursera.org/learn/convolutional-neural-networks/programming/qO8ng/convolutional-model-step-by-step 7 | 8 | import math, numpy as np, random 9 | 10 | class ConvLayer: 11 | 12 | def __init__(self, num_filters, filter_width, stride = 1, padding = 0): 13 | self.fw = filter_width; self.n_f = num_filters; self.s = stride; self.p = padding 14 | self.W = None; self.b = None 15 | 16 | def forward_propagate(self, input): 17 | self.input = np.array(input) 18 | if self.p > 0: # pad input 19 | shape = ((0, 0), (self.p, self.p), (self.p, self.p), (0, 0)) 20 | self.input = np.pad(input, shape, mode='constant', constant_values = (0, 0)) 21 | self.W = np.random.random((self.fw, self.fw, self.input.shape[3], self.n_f)) * 0.01 22 | self.b = np.random.random((1, 1, 1, self.n_f)) * 0.01 23 | self.n_m = self.input.shape[0] # number of inputs 24 | self.ih = self.input.shape[1]; self.iw = self.input.shape[2] # input height and width 25 | self.oh = math.floor((self.ih - self.fw + 2 * self.p) / self.s) + 1 # output height 26 | self.ow = math.floor((self.iw - self.fw + 2 * self.p) / self.s) + 1 # output width 27 | self.Z = np.zeros((self.n_m, self.oh, self.ow, self.n_f)) 28 | for i in range(self.n_m): # iterate over inputs 29 | for h in range(self.oh): # iterate over output height 30 | ih1 = h * self.s; ih2 = ih1 + self.fw # calculate input window height coordinates 31 | for w in range(self.ow): # iterate over output width 32 | iw1 = w * self.s; iw2 = iw1 + self.fw # calculate input window width coordinates 33 | for f in range(self.n_f): # iterate over filters 34 | self.Z[i, h, w, f] = np.sum(self.input[i, ih1:ih2, iw1:iw2, :] * self.W[:, :, :, f]) 35 | self.Z += self.b[:, :, :, f] # calculate output 36 | return self.Z 37 | 38 | def backpropagate(self, dZ, learning_rate): 39 | dA_prev = np.zeros((self.n_m, self.ih, self.iw, self.n_f)) 40 | dW = np.zeros(self.W.shape); db = np.zeros(self.b.shape) 41 | for i in range(self.n_m): # iterate over inputs 42 | for h in range(self.oh): # iterate over output width 43 | ih1 = h * self.s; ih2 = ih1 + self.fw # calculate input window height coordinates 44 | for w in range(self.ow): # iterate over output width 45 | iw1 = w * self.s; iw2 = iw1 + self.fw # calculate input window width coordinates 46 | for f in range(self.n_f): # iterate over filters 47 | dA_prev[i, ih1:ih2, iw1:iw2, :] += self.W[:, :, :, f] * dZ[i, h, w, f] 48 | dW[:, :, :, f] += self.input[i, ih1:ih2, iw1:iw2, :] * dZ[i, h, w, f] 49 | db[:, :, :, f] += dZ[i, h, w, f] 50 | self.W -= dW * learning_rate; self.b -= db * learning_rate 51 | if self.p > 0: # remove padding 52 | dA_prev = dA_prev[:, self.p:-self.p, self.p:-self.p, :] 53 | return dA_prev 54 | 55 | class PoolLayer: 56 | 57 | def __init__(self, filter_width, stride = 1): 58 | self.fw = filter_width; self.s = stride 59 | 60 | def forward_propagate(self, input): 61 | self.input = input 62 | self.n_m = self.input.shape[0] # number of inputs 63 | self.ih = self.input.shape[1]; self.iw = self.input.shape[2] # input height and width 64 | self.oh = math.floor((self.ih - self.fw) / self.s) + 1 # output width 65 | self.ow = math.floor((self.iw - self.fw) / self.s) + 1 # output height 66 | self.n_f = self.input.shape[3] # output channels (same as input channels) 67 | self.Z = np.zeros((self.n_m, self.oh, self.ow, self.n_f)) 68 | for i in range(self.n_m): # iterate over inputs 69 | for h in range(self.oh): # iterate over output height 70 | ih1 = h * self.s; ih2 = ih1 + self.fw # calculate input window height coordinates 71 | for w in range(self.ow): # iterate over output width 72 | iw1 = w * self.s; iw2 = iw1 + self.fw # calculate input window width coordinates 73 | for f in range(self.n_f): # iterate over output channels 74 | self.Z[i, h, w, f] = self.pool(self.input[i, ih1:ih2, iw1:iw2, f]) 75 | return self.Z 76 | 77 | def backpropagate(self, dZ, learning_rate): 78 | dA_prev = np.zeros((self.n_m, self.ih, self.iw, self.n_f)) 79 | for i in range(self.n_m): # iterate over input images 80 | for h in range(self.oh): # iterate over output height 81 | ih1 = h * self.s; ih2 = ih1 + self.fw # calculate input window height coordinates 82 | for w in range(self.ow): # iterate over output width 83 | iw1 = w * self.s; iw2 = iw1 + self.fw # calculate input window width coordinates 84 | for f in range(self.n_f): # iterate over output channels 85 | slice = self.input[i, ih1:ih2, iw1:iw2, f] 86 | dA_prev[i, ih1:ih2, iw1:iw2, f] += self.gradient(slice, dZ[i, h, w, f]) 87 | return dA_prev 88 | 89 | class PoolLayer_Max(PoolLayer): 90 | 91 | def __init__(self, filter_width, stride = 1): 92 | self.pool = np.max 93 | self.gradient = lambda slice, dZ: dZ * (slice == np.max(slice)) 94 | super().__init__(filter_width, stride) 95 | 96 | class PoolLayer_Avg(PoolLayer): 97 | 98 | def __init__(self, filter_width, stride = 1): 99 | self.pool = np.mean 100 | self.gradient = lambda slice, dZ: np.ones((self.fw, self.fw)) * dZ / (self.fw ** 2) 101 | super().__init__(filter_width, stride) 102 | 103 | class FlatLayer: 104 | 105 | def forward_propagate(self, input): 106 | self.input_shape = input.shape 107 | return np.reshape(input, (input.shape[0], int(input.size / input.shape[0]))) 108 | 109 | def backpropagate(self, dZ, learning_rate): 110 | return np.reshape(dZ, self.input_shape) 111 | 112 | class FCLayer: 113 | 114 | def __init__(self, num_neurons): 115 | self.num_neurons = num_neurons; self.W = None 116 | 117 | def forward_propagate(self, input): 118 | if self.W is None: 119 | self.W = np.random.random((self.num_neurons, input.shape[1] + 1)) * 0.0001 120 | self.input = np.hstack([input, np.ones((input.shape[0], 1))]) # add bias inputs 121 | self.Z = np.dot(self.input, self.W.transpose()) 122 | return self.activate(self.Z) 123 | 124 | def backpropagate(self, dA, learning_rate): 125 | dZ = self.gradient(dA, self.Z) 126 | dW = np.dot(self.input.transpose(), dZ).transpose() / dA.shape[0] 127 | dA_prev = np.dot(dZ, self.W) 128 | dA_prev = np.delete(dA_prev, dA_prev.shape[1] - 1, 1) # remove bias inputs 129 | self.W = self.W - learning_rate * dW 130 | return dA_prev 131 | 132 | class FCLayer_ReLU(FCLayer): 133 | 134 | def __init__(self, num_neurons): 135 | self.activate = lambda Z: np.maximum(0.0, Z) 136 | self.gradient = lambda dA, Z: dA * (Z > 0.0) 137 | super().__init__(num_neurons) 138 | 139 | class FCLayer_Sigmoid(FCLayer): 140 | 141 | def __init__(self, num_neurons): 142 | self.activate = lambda Z: 1.0 / (1.0 + np.exp(-Z)) 143 | self.gradient = lambda dA, Z: dA / (1.0 + np.exp(-Z)) * (1.0 - (1.0 / (1.0 + np.exp(-Z)))) 144 | super().__init__(num_neurons) 145 | 146 | class FCLayer_Softmax(FCLayer): 147 | 148 | def __init__(self, num_neurons): 149 | self.activate = lambda Z: np.exp(1.0 / (1.0 + np.exp(-Z))) / np.expand_dims(np.sum(np.exp(1.0 / (1.0 + np.exp(-Z))), axis=1), 1) 150 | self.gradient = lambda dA, Z: dA / (1.0 + np.exp(-Z)) * (1.0 - (1.0 / (1.0 + np.exp(-Z)))) 151 | super().__init__(num_neurons) 152 | 153 | class Network: 154 | 155 | def __init__(self, layers = []): 156 | self.layers = layers 157 | 158 | def predict(self, X): 159 | A = np.array(X) 160 | for i in range(len(self.layers)): 161 | A = self.layers[i].forward_propagate(A) 162 | A = np.clip(A, 1e-15, None) # clip to avoid log(0) in CCE 163 | A += np.random.random(A.shape) * 0.00001 # small amount of noise to break ties 164 | return A 165 | 166 | def evaluate(self, X, Y): 167 | A = self.predict(X); Y = np.array(Y) 168 | cce = -np.sum(Y * np.log(A)) / A.shape[0] # categorical cross-entropy 169 | B = np.array(list(1.0 * (A[i] == np.max(A[i])) for i in range(A.shape[0]))) 170 | ce = np.sum(np.abs(B - Y)) / len(Y) / 2.0 # class error 171 | return (A, cce, ce) 172 | 173 | def train(self, X, Y, learning_rate): 174 | A, cce, ce = self.evaluate(X, Y) 175 | dA = A - Y 176 | for i in reversed(range(len(self.layers))): 177 | dA = self.layers[i].backpropagate(dA, learning_rate) 178 | return (np.copy(self.layers), cce, ce) 179 | -------------------------------------------------------------------------------- /cnn_numpy_sg.py: -------------------------------------------------------------------------------- 1 | # cnn_numpy_sg.py 2 | # Written by David Stein (david@djstein.com). See https://www.djstein.com/cnn-numpy/ for more info. 3 | # Source: https://github.com/neuron-whisperer/cnn-numpy 4 | 5 | # This code improves upon cnn_numpy.py by implementing ConvLayer and PoolLayers using stride groups. 6 | 7 | import math, numpy as np, random 8 | 9 | class ConvLayer: 10 | 11 | def __init__(self, num_filters, filter_width, stride = 1, padding = 0): 12 | self.fw = filter_width; self.n_f = num_filters; self.s = stride; self.p = padding 13 | self.W = None; self.b = None 14 | 15 | def forward_propagate(self, input): 16 | input = np.array(input) 17 | if self.p > 0: # pad input 18 | shape = ((0, 0), (self.p, self.p), (self.p, self.p), (0, 0)) 19 | input = np.pad(input, shape, mode='constant', constant_values = (0, 0)) 20 | im, ih, iw, id = input.shape; s = self.s; fw = self.fw; f = self.n_f 21 | self.input_shape = input.shape 22 | if self.W is None: 23 | self.W = np.random.random((self.fw, self.fw, id, self.n_f)) * 0.1 24 | self.b = np.random.random((1, 1, 1, self.n_f)) * 0.01 25 | self.n_rows = math.ceil(min(fw, ih - fw + 1) / s) 26 | self.n_cols = math.ceil(min(fw, iw - fw + 1) / s) 27 | z_h = int(((ih - fw) / s) + 1); z_w = int(((iw - fw) / s) + 1) 28 | self.Z = np.empty((im, z_h, z_w, f)); self.input_blocks = [] 29 | for t in range(self.n_rows): 30 | self.input_blocks.append([]) 31 | b = ih - (ih - t) % fw 32 | cols = np.empty((im, int((b - t) / fw), z_w, f)) 33 | for i in range(self.n_cols): 34 | l = i * s; r = iw - (iw - l) % fw 35 | block = input[:, t:b, l:r, :] 36 | block = np.array(np.split(block, (r - l) / fw, 2)) 37 | block = np.array(np.split(block, (b - t) / fw, 2)) 38 | block = np.moveaxis(block, 2, 0) 39 | block = np.expand_dims(block, 6) 40 | self.input_blocks[t].append(block) 41 | block = block * self.W 42 | block = np.sum(block, 5) 43 | block = np.sum(block, 4) 44 | block = np.sum(block, 3) 45 | cols[:, :, i::self.n_cols, :] = block 46 | self.Z[:, t * s ::self.n_rows, :, :] = cols 47 | self.Z += self.b 48 | self.A = np.abs(self.Z) # ReLU activation 49 | return self.A 50 | 51 | def backpropagate(self, dZ, learning_rate): 52 | im, ih, iw, id = self.input_shape; s = self.s; fw = self.fw; f = self.n_f 53 | n_rows = self.n_rows; n_cols = self.n_cols 54 | dA_prev = np.zeros((im, ih, iw, id)) 55 | dW = np.zeros(self.W.shape); db = np.zeros(self.b.shape) 56 | for t in range(n_rows): 57 | row = dZ[:, t::n_rows, :, :] 58 | for l in range(n_cols): 59 | b = (ih - t * s) % fw; r = (iw - l * s) % fw # region of input and dZ for this block 60 | block = row[:, :, l * s::n_cols, :] # block = corresponding region of dA 61 | block = np.expand_dims(block, 3) # axis for channels 62 | block = np.expand_dims(block, 3) # axis for rows 63 | block = np.expand_dims(block, 3) # axis for columns 64 | dW_block = block * self.input_blocks[t][l] 65 | dW_block = np.sum(dW_block, 2) 66 | dW_block = np.sum(dW_block, 1) 67 | dW_block = np.sum(dW_block, 0) 68 | dW += dW_block 69 | db_block = np.sum(dW_block, 2, keepdims=True) 70 | db_block = np.sum(db_block, 1, keepdims=True) 71 | db_block = np.sum(db_block, 0, keepdims=True) 72 | db += db_block 73 | dA_prev_block = block * self.W 74 | dA_prev_block = np.sum(dA_prev_block, 6) 75 | dA_prev_block = np.reshape(dA_prev_block, (im, ih - b - t, iw - r - l, id)) 76 | dA_prev[:, t:ih - b, l:iw - r, :] += dA_prev_block 77 | self.W -= dW * learning_rate; self.b -= db * learning_rate 78 | if self.p > 0: # remove padding 79 | dA_prev = dA_prev[:, self.p:-self.p, self.p:-self.p, :] 80 | return dA_prev 81 | 82 | class PoolLayer: 83 | 84 | def __init__(self, filter_width, stride = 1): 85 | self.fw = filter_width; self.s = stride 86 | 87 | def forward_propagate(self, input): 88 | im, ih, iw, id = input.shape; fw = self.fw; s = self.s 89 | self.n_rows = math.ceil(min(fw, ih - fw + 1) / s) 90 | self.n_cols = math.ceil(min(fw, iw - fw + 1) / s) 91 | z_h = int(((ih - fw) / s) + 1); z_w = int(((iw - fw) / s) + 1) 92 | self.Z = np.empty((im, z_h, z_w, id)); self.input = input 93 | for t in range(self.n_rows): 94 | b = ih - (ih - t) % fw 95 | Z_cols = np.empty((im, int((b - t) / fw), z_w, id)) 96 | for i in range(self.n_cols): 97 | l = i * s; r = iw - (iw - l) % fw 98 | block = input[:, t:b, l:r, :] 99 | block = np.array(np.split(block, (r - l) / fw, 2)) 100 | block = np.array(np.split(block, (b - t) / fw, 2)) 101 | block = self.pool(block, 4) 102 | block = self.pool(block, 3) 103 | block = np.moveaxis(block, 0, 2) 104 | block = np.moveaxis(block, 0, 2) 105 | Z_cols[:, :, i::self.n_cols, :] = block 106 | self.Z[:, t * s ::self.n_rows, :, :] = Z_cols 107 | return self.Z 108 | 109 | def assemble_block(self, block, t, b, l, r): 110 | ih = self.input.shape[1]; iw = self.input.shape[2] 111 | block = np.repeat(block, self.fw ** 2, 2) 112 | block = np.array(np.split(block, block.shape[2] / self.fw, 2)) 113 | block = np.moveaxis(block, 0, 2) 114 | block = np.array(np.split(block, block.shape[2] / self.fw, 2)) 115 | block = np.moveaxis(block, 0, 3) 116 | return np.reshape(block, (self.input.shape[0], ih - t - b, iw - l - r, self.input.shape[3])) 117 | 118 | class PoolLayer_Max(PoolLayer): 119 | 120 | def __init__(self, filter_width, stride = 1): 121 | self.pool = np.max 122 | super().__init__(filter_width, stride) 123 | 124 | def backpropagate(self, dZ, learning_rate): 125 | im, ih, iw, id = self.input.shape 126 | fw = self.fw; s = self.s; n_rows = self.n_rows; n_cols = self.n_cols 127 | dA_prev = np.zeros(self.input.shape) 128 | 129 | for t in range(n_rows): 130 | mask_row = self.Z[:, t::n_rows, :, :] 131 | row = dZ[:, t::self.n_rows, :, :] 132 | for l in range(self.n_cols): 133 | b = (ih - t * s) % fw; r = (iw - l * s) % fw 134 | mask = mask_row[:, :, l * s::n_cols, :] 135 | mask = self.assemble_block(mask, t, b, l, r) 136 | block = row[:, :, l * s::n_cols, :] 137 | block = self.assemble_block(block, t, b, l, r) 138 | mask = (self.input[:, t:ih - b, l:iw - r, :] == mask) 139 | dA_prev[:, t:ih - b, l:iw - r, :] += block * mask 140 | return dA_prev 141 | 142 | class PoolLayer_Avg(PoolLayer): 143 | 144 | def __init__(self, filter_width, stride = 1): 145 | self.pool = np.mean 146 | super().__init__(filter_width, stride) 147 | 148 | def backpropagate(self, dZ, learning_rate): 149 | im, ih, iw, id = self.input.shape 150 | fw = self.fw; s = self.s; n_rows = self.n_rows; n_cols = self.n_cols 151 | dA_prev = np.zeros(self.input.shape) 152 | 153 | for t in range(n_rows): 154 | row = dZ[:, t::n_rows, :, :] 155 | for l in range(n_cols): 156 | b = (ih - t * s) % fw; r = (iw - l * s) % fw 157 | block = row[:, :, l * s::n_cols, :] 158 | block = self.assemble_block(block, t, b, l, r) 159 | dA_prev[:, t:ih - b, l:iw - r, :] += block / (fw ** 2) 160 | return dA_prev 161 | 162 | class FlatLayer: 163 | 164 | def forward_propagate(self, input): 165 | self.input_shape = input.shape 166 | return np.reshape(input, (input.shape[0], int(input.size / input.shape[0]))) 167 | 168 | def backpropagate(self, dZ, learning_rate): 169 | return np.reshape(dZ, self.input_shape) 170 | 171 | class FCLayer: 172 | 173 | def __init__(self, num_neurons): 174 | self.num_neurons = num_neurons; self.W = None 175 | 176 | def forward_propagate(self, input): 177 | if self.W is None: 178 | self.W = np.random.random((self.num_neurons, input.shape[1] + 1)) * 0.0001 179 | self.input = np.hstack([input, np.ones((input.shape[0], 1))]) # add bias inputs 180 | self.Z = np.dot(self.input, self.W.transpose()) 181 | return self.activate(self.Z) 182 | 183 | def backpropagate(self, dA, learning_rate): 184 | dZ = self.gradient(dA, self.Z) 185 | dW = np.dot(self.input.transpose(), dZ).transpose() / dA.shape[0] 186 | dA_prev = np.dot(dZ, self.W) 187 | dA_prev = np.delete(dA_prev, dA_prev.shape[1] - 1, 1) # remove bias inputs 188 | self.W = self.W - learning_rate * dW 189 | return dA_prev 190 | 191 | class FCLayer_ReLU(FCLayer): 192 | 193 | def __init__(self, num_neurons): 194 | self.activate = lambda Z: np.maximum(0.0, Z) 195 | self.gradient = lambda dA, Z: dA * (Z > 0.0) 196 | super().__init__(num_neurons) 197 | 198 | class FCLayer_Sigmoid(FCLayer): 199 | 200 | def __init__(self, num_neurons): 201 | self.activate = lambda Z: 1.0 / (1.0 + np.exp(-Z)) 202 | self.gradient = lambda dA, Z: dA / (1.0 + np.exp(-Z)) * (1.0 - (1.0 / (1.0 + np.exp(-Z)))) 203 | super().__init__(num_neurons) 204 | 205 | class FCLayer_Softmax(FCLayer): 206 | 207 | def __init__(self, num_neurons): 208 | self.activate = lambda Z: np.exp(1.0 / (1.0 + np.exp(-Z))) / np.expand_dims(np.sum(np.exp(1.0 / (1.0 + np.exp(-Z))), axis=1), 1) 209 | self.gradient = lambda dA, Z: dA / (1.0 + np.exp(-Z)) * (1.0 - (1.0 / (1.0 + np.exp(-Z)))) 210 | super().__init__(num_neurons) 211 | 212 | class Network: 213 | 214 | def __init__(self, layers = []): 215 | self.layers = layers 216 | 217 | def predict(self, X): 218 | A = np.array(X) 219 | for i in range(len(self.layers)): 220 | A = self.layers[i].forward_propagate(A) 221 | A = np.clip(A, 1e-15, None) # clip to avoid log(0) in CCE 222 | A += np.random.random(A.shape) * 0.00001 # small amount of noise to break ties 223 | return A 224 | 225 | def evaluate(self, X, Y): 226 | A = self.predict(X); Y = np.array(Y) 227 | cce = -np.sum(Y * np.log(A)) / A.shape[0] # categorical cross-entropy 228 | B = np.array(list(1.0 * (A[i] == np.max(A[i])) for i in range(A.shape[0]))) 229 | ce = np.sum(np.abs(B - Y)) / len(Y) / 2.0 # class error 230 | return (A, cce, ce) 231 | 232 | def train(self, X, Y, learning_rate): 233 | A, cce, ce = self.evaluate(X, Y) 234 | dA = A - Y 235 | for i in reversed(range(len(self.layers))): 236 | dA = self.layers[i].backpropagate(dA, learning_rate) 237 | return (np.copy(self.layers), cce, ce) 238 | -------------------------------------------------------------------------------- /images/stride-groups.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuron-whisperer/cnn-numpy/e712a44289dc78e4fea0b037db3a26c3b467c7a7/images/stride-groups.png -------------------------------------------------------------------------------- /images/strides.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuron-whisperer/cnn-numpy/e712a44289dc78e4fea0b037db3a26c3b467c7a7/images/strides.png -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | cnn-numpy 2 | 3 |

cnn-numpy

4 |

Experiments with implementing a convolutional neural network (CNN) in NumPy.

5 |

Written by David Stein (david@djstein.com). See FAQ.html for more information.

6 |

Also available at https://www.github.com/neuron-whisperer/cnn-numpy.

7 |

The Short Version

8 |

The purpose of this project is to provide a fully working, NumPy-only implementation of a convolutional neural network, with a few goals:

9 |

Presenting the simplest, cleanest example of a fully implemented, working CNN as a minimal adaptation of Dr. Andrew Ng's example code. See: cnn_numpy.py.
11 |
Provide a working application of the above example to classify the MNIST database of handwritten digits. See: mnist.py.
13 |

15 |

While these two efforts were successful, it happens that the example code runs prohibitively slowly - because it is designed for clarity and education, not actual use. It is not possible to train the model to do anything meaningful in a reasonable amount of time. Thus, the following objectives were added:

16 |

Provide an alternative implementation to the above example that runs equivalently, but makes much better use of NumPy and runs in a reasonable amount of time. (This redesign involves an apparently novel computational technique of stride groups, described below.) See: cnn_numpy_sg.py.

19 |

Background

20 |

Convolutional neural networks are the current primary model of deep learning for tasks such as image analysis. Dr. Andrew Ng, one of the contemporary experts in machine learning, offers this excellent Coursera course on convolutional neural networks that includes an assignment entitled "Build a Convolutional Neural Network Step by Step."

21 |

The assignment is presented as a Jupyter notebook with NumPy-based code and a detailed explanation of the calculations, including helpful illustrations. The code includes a convolutional layer, a MaxPool layer, and an AvgPool layer. The Jupyter notebook also includes unit tests to show that the individual functions work.

22 |

However, the code in the Jupyter notebook has a few issues that prevent its use for anything beyond the confines of the assignment:

23 |

The code is a bit... primitive. It is structured as a set of global functions. Parameters are passed in a "cache" dictionary, with names identified by strings. Some variables are cached and then never used. Different variable names are sometimes used between a calling function and a called function. Tracing the interconnection of the functions is rather painful due to big chunks of explanatory text, unit tests, and output.
25 |
The implementation in the Jupyter notebook is not complete! The convolutional layers calculate dW/db, but do not update the filters. The implementations do not include a fully connected layer - these are developed in previous exercises - nor a flattening layer (which is trivial to implement, but its absence is frustrating).
27 |
The implementation in the Jupyter notebook is not applied to any kind of problem. It is simply presented as an example. The next exercise in the course sequence involves applying a CNN to the MNIST handwritten digits data set - and yet, it does not use the CNN implementation from this assignment; instead, it switches to a pure TensorFlow implementation. (To make matters worse, the TensorFlow implementation requires TensorFlow 1.x, which will of course not run in the current TensorFlow 2.x environment!)
29 |

31 |

These limitations create two problems.

32 |

Problem #1: While reviewing "Build a Convolutional Neural Network Step by Step," it is very difficult to take a step back and see a fully-realized CNN that uses this code, including training. It is also disappointing that the code is not shown to run against any kind of sample problem. Students cannot easily see it in action. Students cannot experiment with the structure or hyperparameters to see how performance may differ. And students cannot easily apply this code to other standard data sets to develop any kind of skill and confidence in this knowledge.

33 |

(As an avid student of machine learning, I am continually irked by the gap between the skill set of theoretical knowledge, such as how models work, and the skill set of applied machine learning, such as the available platforms and the common theories of how to use them. Anything that can be done to bridge this gap with actual knowledge will, I believe, be helpful for promoting a comprehensive understanding of the field.)

34 |

Basic Implementation

35 |

Based on the above, I set out fill in the gaps. I sought to refactor, streamline, and just generally normalize all of Dr. Ng's code into a tight, cohesive CNN library - one that faithfully reproduces the algorithms from the example code, but in a much more readable, usable way. Also, I sought to apply this model to the MNIST database in order to show that it works. The result of this effort is cnn_numpy.py and mnist.py:

36 |

cnn_numpy.py is a 180-line file that includes:

37 |

A ConvLayer class
39 |
A PoolLayer base class, with PoolLayer_Max and AvgPoolLayer_Avg subclasses
41 |
A FlatLayer class
43 |
An FCLayer base class, with FCLayer_ReLU, FCLayer_Sigmoid, and FCLayer_Softmax subclasses
45 |
A Network class with predict(), evaluate(), and train() functions, based on categorical cross-entropy and multiclass error, and a parameterized learning rate
47 |

49 |

And mnist.py is an 80-line file that loads the MNIST handwritten digits data set from Dr. Yann LeCun's website, unpacks it, and trains any combination of layers from cnn_numpy.py to perform multiclass classification. Students may freely experiment with different architectures by changing one line of code, like this non-CNN ("flat") model:

50 |

net = Network([FlatLayer(), FCLayer_ReLU(100), FCLayer_Softmax(10)])
51 |

...or this CNN model:

52 |

net = Network([ConvLayer(32, 3), PoolLayer_Max(2, 2), FlatLayer(), FCLayer_ReLU(100), FCLayer_Softmax(10)])
53 |

This is a TensorFlow-like syntax, but unlike TensorFlow, cnn_numpy.py is completely transparent and readable.

54 |

For convenience, mnist.py allows you to select either architecture with a suitable set of training parameters:

55 |

python3 mnist.py flat naive
56 | python3 mnist.py cnn naive
57 |

The flat architecture work great. On a typical machine, it can train from scratch with a 95% training split of the data set to reach a classification error under 10% within one minute, and under 5% in three minutes.

58 |

The CNN architectures also runs... slowly. Brutally slowly. Unusably slowly.

59 |

Problem #2: The code from "Build a Convolutional Neural Network Step by Step" is prohibitively slow. So slow that it is not realistically possible to apply it to a problem or to see it learn anything. Thus, it is difficult to see that it even works, let alone experiment with it.

60 |

The ConvLayer and PoolLayer classes require a four-layer-deep loop in Python for each of the forward and backward passes. The loop iterates over: (1) The inputs in the input tensor, (2) the height of each input, (3) the width of each input, and (4) each filter in the convolutional layer.

61 |

As a result of this deeply nested iteration, processing one epoch of the MNIST data set with a one-ConvLayer CNN with 32 filters would require 28 trillion NumPy calculations. NumPy is indeed fast, but executing literally trillions of small NumPy calculations requires an absurdly long period of time.

62 |

I suppose that these factors explain the hard transition in the CNN course sequence from the NumPy-based "Build a Convolutional Neural Network Step by Step" code to the TensorFlow-based code for the immediately following lesson. But it does feel like a bait-and-switch: "Here is some code... now that we've developed all of that, let's use none of it, and instead switch to a totally different platform."

63 |

The challenge presented at this stage is: How can the examples of pooling and convolutional layers from "Build a Convolutional Neural Network Step by Step" be modified to use the same model, and only simple NumPy, but to make much better use of NumPy's array processing capabilities so that iteration can be vastly reduced?

64 |

Stride Groups

65 |

NumPy is capable of performing operations on large, multidimensional arrays at blazingly faster speeds than Python. NumPy also features sophisticated array indexing and slicing operations, as well as broadcasting, which permits element-wise multiplication between a multi-element axis of an array and a one-element axis of an array. However, the four-layer loop makes very poor use of these properties: it uses NumPy to perform element-wise multiplication.

66 |

Each implementation in the example code requires iteration over (1) the number of training samples, (2) the output width and height of each sample (which are based on the square area of the input image), and (3) the number of filters. And yet, all of these properties are originally featured in arrays: the input to each layer, the weight matrix, etc. So it is quite feasible to operate on massive subsections of this array instead of lots of element-wise operations.

67 |

One idea is to iterate over the elements of each filter rather than the elements of each image, since the filters are smaller. Theoretically, element-wise multiplication of each element of each filter would require (f_w f_w c) iterations - in the network above: (3 3 32) = 288 iterations, which is certainly an improvement. (But this methodology is not applicable to the pooling layers.) Still - we can do better.

68 |

A much better idea - and the heart of the following technique - is to reorient the input into stride groups. A stride group is the set of input elements for all non-overlapping strides, assembled into a sub-matrix of elements and aligned with a corresponding element of the filter matrix with which those elements of the input are multiplied during convolution - like this:

69 |

Convolution with Stride Groups

70 |

This approach has an extremely important advantage: The number of iterations is irrespective of the size of the input, the number of input channels, and the number of filters.

71 |

In this approach, four total iterations are required whether the matrix is 4x4, or 5x5, or 100x100. Four total iterations are required whether the number of filters is 1, or 32, or 10,000. The number of iterations is strictly based on the filter width and the stride. Larger inputs and larger number of filters will increase the sizes of the matrix that are multiplied together in each iteration, but not the number of iterations.

72 |

Moreover: if the filter width and the stride are equal - for example, a 2x2 filter with a stride of 2 - then only one shift group exists... and all of convolution is performed in a single matrix multiplication. Iteration is entirely eliminated.

73 |

Stride Groups: Implementation

74 |

The redesign of the neural network architecture using stride groups is presented in cnn-numpy-sg.py, and can be used by omitting the "naive" parameter from the command:

75 |

python3 mnist.py cnn
76 |

The redesign presents identical versions of FlatLayer, FCLayer, and Network. The implementations of ConvLayer and PoolLayer are different, and not as readable, but they are functionally equivalent and can be swapped into place.

77 |

I will admit that I am a NumPy amateur. I am certain that the NumPy code could be rewritten to consolidate operations (such as broadcasting rather than repeating) and to make tremendously better use of the sophisticated capabilities of NumPy.

78 |

And yet... this amateur implementation is approximately 1,000 times faster than the naive implementation.

79 |

A one-ConvLayer CNN using the stride groups implementation can complete an entire 100-minibatch epoch over 95% of the MNIST data set in a little over two minutes (vs. 18 hours for the example code). It is performant enough to observe and experiment with its learning capabilities on the MNIST data set. It exhibits a typical classification error under 30% in one epoch, that is, having seen each of the 65,500 samples only once. Again, continued training reduces the classification error below 3%.

80 |

On the one hand, the fully-connected network performs better performance on this training set in a shorter period of time (e.g., one minute). However, the fully-connected network requires 100 epochs, raising the prospect of overtraining. And of course, fully-connected network also scales poorly to larger data sets, since the number of weights of the first FC layer is (number of neurons input size number of channels). Fully-connected networks also fail to account for localized characteristics, so they will train more slowly and may be more susceptible to extraneous noise.

81 |

Conclusion

82 |

This project demonstrates:

83 |

The capabilities of neural network architectures with an application for multi-class classification of the MNIST data set;
85 |
The validity of the example CNN architecture;
87 |
The computational power of NumPy when used well (and the importance of using it well!); and
89 |
An apparently novel computational technique for aligning the capabilities of NumPy with some typical neural network operations.
91 |

93 |

94 |

More Projects

95 |

96 | 97 | 98 | -------------------------------------------------------------------------------- /mnist.py: -------------------------------------------------------------------------------- 1 | # mnist.py 2 | # Written by David Stein (david@djstein.com). See https://www.djstein.com/cnn-numpy/ for more info. 3 | # Source: https://github.com/neuron-whisperer/cnn-numpy 4 | 5 | import gzip, math, numpy as np, os, random, requests, sys, time, warnings 6 | 7 | def load_mnist_database(): 8 | # read MNIST database according to format (http://yann.lecun.com/exdb/mnist/) 9 | path = os.path.dirname(os.path.realpath(__file__)); data_set = [] 10 | for name in ['train-images-idx3-ubyte.gz', 't10k-images-idx3-ubyte.gz', 11 | 'train-labels-idx1-ubyte.gz', 't10k-labels-idx1-ubyte.gz']: 12 | full_name = f'{path}/{name}' 13 | if not os.path.exists(full_name): 14 | r = requests.get(f'http://yann.lecun.com/exdb/mnist/{name}') 15 | with open(full_name, 'wb') as file: 16 | file.write(r.content) 17 | f = gzip.open(full_name) 18 | if 'images' in name: 19 | header = f.read(16) 20 | num_images = int.from_bytes(header[4:8], byteorder='big') 21 | image_size = int.from_bytes(header[8:12], byteorder='big') 22 | buf = f.read(image_size * image_size * num_images) 23 | data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32) 24 | data_set.append(data.reshape(num_images, image_size, image_size, 1)) 25 | else: 26 | header = f.read(8) 27 | num_labels = int.from_bytes(header[4:8], byteorder='big') 28 | buf = f.read(image_size * image_size * num_images) 29 | data_set.append(np.frombuffer(buf, dtype=np.uint8)) 30 | x = np.concatenate((data_set[0], data_set[1])); y = np.concatenate((data_set[2], data_set[3])) 31 | num_classes = np.max(y) + 1; y = np.eye(num_classes)[y] # one-hot-encoded labels 32 | return (x, y) 33 | 34 | def split_database(x, y, ratio): 35 | indices = list(range(len(x))); random.shuffle(indices) 36 | split = int(len(x) * ratio) 37 | train_x = list(x[i] for i in indices[:split]) 38 | test_x = list(x[i] for i in indices[split:]) 39 | train_y = list(y[i] for i in indices[:split]) 40 | test_y = list(y[i] for i in indices[split:]) 41 | return ((train_x, train_y), (test_x, test_y)) 42 | 43 | def train_mnist(net, learning_rate, num_epochs, mini_batches, split): 44 | x, y = load_mnist_database(); history = [] 45 | training_set, test_set = split_database(x, y, split) 46 | start_time = time.time() 47 | for e in range(num_epochs): 48 | for b in range(mini_batches): 49 | start = int(len(training_set[0]) * b / mini_batches) 50 | stop = int(len(training_set[0]) * (b + 1) / mini_batches) 51 | mb_x = training_set[0][start:stop]; mb_y = training_set[1][start:stop] 52 | history.append(net.train(mb_x, mb_y, learning_rate)) 53 | t = int(time.time() - start_time); t_str = '%2dm %2ds' % (int(t / 60), t % 60) 54 | _, cce, ce = history[-1] 55 | print(f'\rTime: {t_str} Epoch: {e+1:4}, {b+1:4}/{mini_batches} Mini-Batch Size: {len(mb_x)} CCE: {cce:0.4f} CE: {ce:0.4f}', end='') 56 | print('') 57 | cce, ce = net.run(test_set[0], test_set[1]) 58 | print(f'Test set: CCE: {cce:0.3f} CE: {ce:0.3f}') 59 | 60 | if __name__ == '__main__': 61 | 62 | use_cnn = (len(sys.argv) > 1 and sys.argv[1].lower() == 'cnn') 63 | use_flat = (len(sys.argv) > 1 and sys.argv[1].lower() == 'flat') 64 | if not (use_cnn or use_flat): 65 | print('Syntax: python3 mnist.py (cnn | flat) [sg]'); sys.exit(1) 66 | naive = (len(sys.argv) > 2 and sys.argv[2].lower() == 'naive') 67 | if naive: 68 | from cnn_numpy import * # naive implementation 69 | else: 70 | from cnn_numpy_sg import * # stride groups implementation 71 | 72 | with warnings.catch_warnings(): 73 | np.set_printoptions(threshold=sys.maxsize) 74 | warnings.simplefilter("ignore") # disable numpy overflow warnings 75 | net = [FlatLayer(), FCLayer_ReLU(100), FCLayer_Softmax(10)] 76 | learning_rate = 0.01; num_epochs = 100; mini_batches = 10 77 | if use_cnn: 78 | net[0:0] = [ConvLayer(32, 2), PoolLayer_Max(3, 3)] 79 | learning_rate = 0.001; num_epochs = 3; mini_batches = 6550 if naive else 500 80 | train_mnist(Network(net), learning_rate, num_epochs, mini_batches, split = 0.95) 81 | --------------------------------------------------------------------------------

cnn-numpy FAQ

Questions About cnn-numpy

How do I run this code?

What's the point of this project?

What should I use this code for?

What shouldn't I use this code for?

Background Questions

What's a CNN?

What's NumPy?

What's Jupyter?

What's the MNIST handwritten digits data set?

What's "Building a Convolutional Neural Network Step By Step?"

Who's Andrew Ng?

High-Level Project Details

How does this project differ from the Coursera example?

What Python packages does this code require?

How do I use it?

How do I build my own network architecture?

Wait - don't I have to specify the size of the input to each layer?

How do I train and use the network?

Project Theory and Mechanics

What are the variables used in the classes and algorithms?

Why is cnn_numpy.py so slow? Isn't it based on NumPy?

How is cnn_numpy_sg.py different than cnn_numpy.py?

What's a stride group?

No, seriously, what's a stride group?

Why are stride groups better?

Could cnn_numpy_sg.py be further optimized?

Should I optimize it?

I have questions...

More Projects

cnn-numpy

The Short Version

Background

Basic Implementation

Stride Groups

Stride Groups: Implementation

Conclusion

More Projects

Why is `cnn_numpy.py` so slow? Isn't it based on NumPy?

How is `cnn_numpy_sg.py` different than `cnn_numpy.py`?

Could `cnn_numpy_sg.py` be further optimized?