├── autograd
├── grads.png
├── test.png
├── grads2.png
├── simple.png
├── grads-final.png
└── index.md
├── dl-ocr-demo
├── h.png
├── boxed.png
├── boxed-only.png
├── cleaned.jpeg
├── h-poster.png
├── t-poster.png
└── index.md
├── dl-convolutional
├── poster.png
├── animation.mp4
├── confusion-letters.png
├── index.md
└── gelu.svg
├── hello-deep-learning
├── diff.png
├── boxed.png
├── prod3.png
├── seven.png
├── three.png
├── learning.mp4
└── index.md
├── first-learning
├── random-image.png
├── random-prod.png
├── weights-anim.gif
├── random-weights.png
└── index.md
├── hello-deep-learning-chapter1
├── diff.png
├── prod3.png
├── prod7.png
├── seven.png
├── three.png
├── sevens.png
├── threes.png
├── wrong-7-22.png
└── index.md
├── hyperparameters-inspection-adam
├── sgd.gif
├── sgd-complex-momentum.gif
├── sgd-complex-no-momentum.gif
└── index.md
├── README.md
├── dropout-data-augmentation-weight-decay
├── weight-decay-wait-evolution-scatter.png
└── index.md
├── LICENSE
├── dl-gru-lstm-dna
└── index.md
├── dl-and-now-what
└── index.md
├── dl-what-does-it-all-mean
└── index.md
├── hello-deep-learning-intro
└── index.md
└── handwritten-digits-sgd-batches
└── index.md
/autograd/grads.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/grads.png
--------------------------------------------------------------------------------
/autograd/test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/test.png
--------------------------------------------------------------------------------
/dl-ocr-demo/h.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/h.png
--------------------------------------------------------------------------------
/autograd/grads2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/grads2.png
--------------------------------------------------------------------------------
/autograd/simple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/simple.png
--------------------------------------------------------------------------------
/dl-ocr-demo/boxed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/boxed.png
--------------------------------------------------------------------------------
/autograd/grads-final.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/grads-final.png
--------------------------------------------------------------------------------
/dl-ocr-demo/boxed-only.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/boxed-only.png
--------------------------------------------------------------------------------
/dl-ocr-demo/cleaned.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/cleaned.jpeg
--------------------------------------------------------------------------------
/dl-ocr-demo/h-poster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/h-poster.png
--------------------------------------------------------------------------------
/dl-ocr-demo/t-poster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/t-poster.png
--------------------------------------------------------------------------------
/dl-convolutional/poster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-convolutional/poster.png
--------------------------------------------------------------------------------
/hello-deep-learning/diff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/diff.png
--------------------------------------------------------------------------------
/dl-convolutional/animation.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-convolutional/animation.mp4
--------------------------------------------------------------------------------
/first-learning/random-image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/random-image.png
--------------------------------------------------------------------------------
/first-learning/random-prod.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/random-prod.png
--------------------------------------------------------------------------------
/first-learning/weights-anim.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/weights-anim.gif
--------------------------------------------------------------------------------
/hello-deep-learning/boxed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/boxed.png
--------------------------------------------------------------------------------
/hello-deep-learning/prod3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/prod3.png
--------------------------------------------------------------------------------
/hello-deep-learning/seven.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/seven.png
--------------------------------------------------------------------------------
/hello-deep-learning/three.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/three.png
--------------------------------------------------------------------------------
/first-learning/random-weights.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/random-weights.png
--------------------------------------------------------------------------------
/hello-deep-learning/learning.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/learning.mp4
--------------------------------------------------------------------------------
/dl-convolutional/confusion-letters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-convolutional/confusion-letters.png
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/diff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/diff.png
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/prod3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/prod3.png
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/prod7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/prod7.png
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/seven.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/seven.png
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/three.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/three.png
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/sevens.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/sevens.png
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/threes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/threes.png
--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/sgd.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hyperparameters-inspection-adam/sgd.gif
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/wrong-7-22.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/wrong-7-22.png
--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/sgd-complex-momentum.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hyperparameters-inspection-adam/sgd-complex-momentum.gif
--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/sgd-complex-no-momentum.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hyperparameters-inspection-adam/sgd-complex-no-momentum.gif
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | This is the Markdown from which https://berthub.eu/articles/posts/hello-deep-learning is/will
2 | be populated. This allows everyone to contribute better wording or examples
3 | or graphs etc.
4 |
5 |
--------------------------------------------------------------------------------
/dropout-data-augmentation-weight-decay/weight-decay-wait-evolution-scatter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dropout-data-augmentation-weight-decay/weight-decay-wait-evolution-scatter.png
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 bert hubert
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/dl-gru-lstm-dna/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Gated Recurring Unit / LSTM: Some language processing, DNA scanning"
3 | date: 2023-03-29T13:00:00+02:00
4 | draft: true
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts.
7 |
8 | Placeholder page. Will mostly be an homage to the essential [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).
9 |
10 | Includes a demo trained on my blog posts that writes pretty plausible sentences.
11 |
12 | For example: "Galileo Problems and Indonesia and Safety Tolar experiments and communications are available to investigate our manufacturers also do not specific and lives on the fact that we are going to be the same generation in a single time. And that's it. I can report that the communication provider the market is a previously been in a ton of work to learn about the world where one strand shows the protein expression making a satellite, and then just like winning operations (in Europe) taken everything about the same time. The reader I have left on a lot of research and will still start with the real thing. Explain the Internet is not the case. "
13 |
14 | The network constructs sentences like these character by character, which is quite impressive. It generates valid markdown links ftoo, for example.
15 |
16 | Page will also include a demo of how Gated Recurring Units can spot splice junctions in DNA.
17 |
18 | In [the next chapter](../dl-what-does-it-all-mean) you can find some philosophizing about what deep learning all means, analogies to biology and what the future might hold.
19 |
--------------------------------------------------------------------------------
/dl-and-now-what/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Further reading & worthwhile projects"
3 | date: 2023-03-30T12:00:09+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dl-and-now-what/index.md)!
7 |
8 | After having completed this series of blogposts (well done!) you should have a good grounding in what deep learning is actually doing. However, this was of course only a small 20k word introduction, so there is a lot left to learn.
9 |
10 | Unfortunately, there is a lot of nonsense online. Either the explanations are sloppy or they are just plain wrong.
11 |
12 | Here is an as yet pretty short list of things I've found to be useful. I very much hope to hear from readers about their favorite books and sites. You can send [pull requests directly](https://github.com/berthubert/hello-dl-posts/blob/main/dl-and-now-what/index.md) or email me on bert@hubertnet.nl
13 |
14 | Sites:
15 | * The [PyTorch documentation](https://pytorch.org/docs/stable/index.html) is very useful, even if you are not using PyTorch. It describes pretty well how many layers work exactly.
16 | * [Andrej Karpathy](https://twitter.com/karpathy)'s [micrograd](https://github.com/karpathy/micrograd) Python autogradient implementation is a tiny work of art
17 | * Andrej Karpathy's post [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/), and also [this post](https://karpathy.github.io/2019/04/25/recipe/)
18 | * [FastAI](https://fast.ai)'s Jupyter notebooks.
19 |
20 | Projects:
21 | * [Whisper.cpp](https://github.com/ggerganov/whisper.cpp), by hero worker [Georgi Gerganov](https://ggerganov.com/). An open source self-contained C++ version of OpenAI's whisper speech recognition model. You can run this locally on very modest hardware and it is incredibly impressive. Because the source code is so small it is a great learning opportunity.
22 | * [Llama.cpp](https://github.com/ggerganov/llama.cpp), again by Georgi, a C++ version of Meta's Llama "small" large language model that can run on reasonable hardware. Uses quantisation to fit in normal amounts of memory. If prompted well, the Llama model shows ChatGPT-like capabilities.
23 |
24 |
--------------------------------------------------------------------------------
/dl-what-does-it-all-mean/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Deep Learning: What does it all mean?"
3 | date: 2023-03-30T12:00:10+02:00
4 | draft: true
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts.
7 |
8 | XXX ENTIRELY draft not done XXX
9 |
10 | In writing this series, I've first hand experienced the 'wow factor' of having your new neural network do impressive things. As part of my alarming lack of focus, I am also a very amateur biologist. I study DNA and evolution, as exhibited for example in [my Nature Scientific Data paper](https://www.nature.com/articles/s41597-022-01179-8).
11 |
12 | Microbial life can be completely recreated from its pretty small genome, which is typically a few million DNA letters long. Each DNA letter (A, C, G or T) carries 2 bits of information. A whole bacterium therefore can be regarded as having around a megabyte of parameters. Incidentally, this is of similar size to many interesting neural networks.
13 |
14 | Both bacteria and neural networks can evolve new functionality by changing random parameters. For bacteria, we can see this process in action at day long timescales. For example, under lab conditions, a bacterial strain can evolve resistance to an antibiotic within a week. Other more fundamental things take a lot longer, but still happen. For example, in the [E. coli long-term evolution experiment](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment), bacteria took around 33000 generations to evolve a way to live off citrate under aerobic (with oxygen) conditions.
15 |
16 | The similarity here is that both networks and life have millions (or billions) of parameters, and that through changes of these, there is a path towards great improvements.
17 |
18 | This is in stark contrast to traditional computer programs, where if you make a change, either nothing happens or your program crashes. There is no random walk imaginable that suddendly adds new features or higher performance to your work.
19 |
20 | Now, it is not evident that the gradient descent techniques from neural networks are guaranteed to find interesting minima. But from observation, they very often too. Similarly, life has clearly been extremely successful achieving interesting goals by tweaking millions or billions of parameters.
21 |
22 | Traditional optimizers of simpler functions often get stuck at local minima. But it appears that if you create a solution where you can tweak not just a few parameters but millions of them, it is possible to have a fitness landscape where it is extremely hard to get stuck in a local minimum. Or in other words, even without heroics, your network can wind its way down to a very good optimum.
23 |
24 | The outrageous success of both life and neural networks appears to argue for this hypothesis.
25 |
26 | # Generative AI
27 | It has been fascinating to see the discussion around what ChatGPT and similar systems do. Are they intelligent? What does that question even mean? ChatGPT sounds unreasonably sure of itself at times, even when it is generating text that is dead wrong. To the people that use this to disparage AI, I ask, have you ever met any people?
28 |
29 |
--------------------------------------------------------------------------------
/hello-deep-learning/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning"
3 | date: 2023-03-30T11:59:00+02:00
4 | draft: false
5 | images: [boxed.png]
6 | ---
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
18 |
19 |
20 | A from scratch GPU-free introduction to modern machine learning. Many tutorials exist already of course, but this one aims to really explain what is going on, from the ground up. Also, we'll develop the demo until it is actually useful on **real life** data which you can supply yourself.
21 |
22 | Other documents start out from the (very impressive) PyTorch environment, or they attempt to math it up from first principles.
23 | Trying to understand deep learning via PyTorch is like trying to learn aerodynamics from flying an Airbus A380.
24 |
25 | Meanwhile the pure maths approach ("see it is easy, it is just a Jacobian matrix") is probably only suited for seasoned mathematicians.
26 |
27 | The goal of this tutorial is to develop modern neural networks entirely from scratch, but where we still end up with really impressive results.
28 |
29 | [Code is here](https://github.com/berthubert/hello-dl). Markdown for blogposts can [also be found on GitHub](https://github.com/berthubert/hello-dl-posts) so you can turn typos into pull requests (thanks, the first updates have arrived!).
30 |
31 | Chapters:
32 |
33 | * [Introduction](../hello-deep-learning-intro) (which you can skip if you want)
34 | * [Chapter 1: Linear combinations](../hello-deep-learning-chapter1)
35 | * [Chapter 2: Some actual learning, backward propagation](../first-learning)
36 | * [Chapter 3: Automatic differentiation](../autograd)
37 | * [Chapter 4: Recognizing handwritten digits using a multi-layer network: batch learning SGD](../handwritten-digits-sgd-batches)
38 | * [Chapter 5: Neural disappointments, convolutional networks, recognizing handwritten **letters**](../dl-convolutional/)
39 | * [Chapter 6: Inspecting and plotting what is going on, hyperparameters, momentum, ADAM](../hyperparameters-inspection-adam)
40 | * [Chapter 7: Dropout, data augmentation and weight decay, quantisation](../dropout-data-augmentation-weight-decay)
41 | * [Chapter 8: An actual 1700 line from scratch handwritten letter OCR program](../dl-ocr-demo)
42 | * Chapter 9: Gated Recurring Unit / LSTM: Some language processing, DNA scanning
43 | * Chapter 10: Attention, transformers, how does this compare to ChatGPT?
44 | * [Chapter 11: Further reading & worthwhile projects](../dl-and-now-what)
45 | * Chapter 12: What does it all mean?
46 |
47 |
48 |
53 |
--------------------------------------------------------------------------------
/hello-deep-learning-intro/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Intro"
3 | date: 2023-03-30T12:00:00+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. Also, feel free to skip this intro and [head straight for chapter 1](../hello-deep-learning-chapter1) where the machine learning begins!
7 |
8 | Deep learning and 'generative AI' have now truly arrived. If this is a good thing very much remains to be seen. What is certain however is that these technologies will have a huge impact.
9 |
10 | Up to late 2022, I had unwisely derided the advances of deep learning as overhyped nonsense from people doing fake demos. Turned out this was only half false - many of the demos were indeed fake.
11 |
12 | But meanwhile, truly staggering things were happening, and I had ignored all of that. In hindsight, I wish I had read and believed Andrej Karpathy's incredibly important 2015 post [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/). The examples in there are self-contained proof something very remarkable had been discovered.
13 |
14 | For me this meant I had to catch up and figure out what was going on. What is this magical stuff really? Soon I found myself in a maze of confusing YouTube videos and Jupyter notebooks that showed me awesome things, but that did not address how all this magic worked. Also, quite often when trying to reproduce what I had seen, the magic did not actually work.
15 |
16 | To make up for my somewhat idiotic ignorance, I went back to first principles to emulate a bit of what Andrej Karpathy had achieved: I set out to build a a self-contained, simple, but still impressive demo of the technologies involved, one that would really showcase this awesome new technology, including its pitfalls.
17 |
18 | The goal is to really start from the ground up. Many other projects will tell you how to use the impressive deep learning tooling that is now available. This project hopes to show you what this tooling is actually doing for you to make the magic happen. And not only show: we're going to start truly from scratch - this is not built on top of PyTorch or TensorFlow. It it built on top of plain C++.
19 |
20 | In the chapters of this 'Hello Deep Learning' project, we'll build several solutions that do actually impressive things. The first solution is a relatively small from scratch program that will learn how to recognize handwritten letters, and also perform this feat on actual real life data -- something many projects conveniently skip.
21 |
22 | Along the way we'll cover many of the latest deep learning techniques, and employ them in our little programs.
23 |
24 | In this project, the 'from scratch' part means that we'll only be depending on system libraries, [a logging library](https://berthub.eu/articles/posts/big-data-storage/), [a matrix library](https://en.wikipedia.org/wiki/Eigen_(C%2B%2B_library)) and [an image processing library](https://github.com/nothings/stb). It serves no educational purpose to develop any of these things as part of this series. Yet, we will spend time on what the matrix library is doing for us, and why you should not ever roll your own.
25 |
26 | I hope you'll enjoy this trip through the fascinating world of deep learning. It has been my personal way of making up for years of ignorance, and with some luck, this project will not only have been useful for me.
27 |
28 | Finally, all pages are [hosted on github](https://github.com/berthubert/hello-dl-posts) and I very much look forward to receiving your pull requests to fix my inevitable mistakes or fumbled explanations!
29 |
30 | Now, do head on to [Chapter 1: Linear combinations](../hello-deep-learning-chapter1).
31 |
--------------------------------------------------------------------------------
/first-learning/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: actually learning something"
3 | date: 2023-03-30T12:00:02+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/first-learning/index.md)!
7 |
8 | In this chapter we're going to take [the neural network we made earlier](../hello-deep-learning), but actually make it do some learning itself. And, oddly enough, this demonstration will again likely simultaneously make you wonder "is this all??" and also impress you by what even this trivial stuff can do.
9 |
10 | The first part of this chapter covers the theory, and shows no code. The second part explains the code that makes it all happen. You can skip or skim the second part if you want to focus on the ideas.
11 |
12 | ## The basics
13 | [Earlier we configured a linear combination neural layer](../hello-deep-learning), in which we used an element-wise multiplication to recognize if an image was a 3 or a seven:
14 |
15 | {{< rawhtml >}}
16 |
21 | {{}}
22 |
23 |
24 |
25 |
26 |
27 |
*
28 |
29 |
=
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 | This network achieved impressive accuracy on the very clean and polished EMNIST testing data, but partially this is because we carefully configured the network by hand. It did no learning of itself.
38 |
39 | ## How about some actual learning
40 | The key to calculating the verdict if something is a 3 or a 7 is the *weights* matrix. We manually initialized that matrix in the previous chapter. In machine learning, it is customary to randomly initialize the parameters. [But to what](https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79)? In practice, libraries tend to pick values uniformly distributed between
41 | {{< katex inline >}}-1/\sqrt{N} {{}} and {{< katex inline >}}1/\sqrt{N}{{}}, where {{< katex inline >}}N{{}} is the number of coefficients in the input matrix.
42 |
43 | Such a randomly chosen matrix will of course not yet be of any use:
44 |
45 |
46 |
47 |
48 |
*
49 |
50 |
=
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 | The result of this multiplication and subsequent summation is 0.529248, so our random weights got it wrong: this is actually a three, and the resulting score should have been negative.
59 |
60 | So, what to do? What's the simplest thing we could even do?
61 |
62 | Recall that the 'score' we are looking at is the sum of the element-wise product of the image pixels ({{}}p_n{{}}) and the weights ({{}}w_n{{}}). Or, concretely, this summation over 28*28=784 elements:
63 |
64 | {{}}R=p_1w_1 + p_2w_2 + \cdots + p_{783}w_{783} + p_{784}w_{784}{{}}
65 |
66 | Our current random weights delivered us an {{}}R{{}} that was too high. We can't change the input image pixels, but we can simply decide to lower the various weights, as this will deliver a lower {{}}R{{}}. So by how much should we lower them?
67 |
68 | There is no impact for this image if we lower {{}}w_1{{}} since the first pixel {{}}p_1{{}} is 0 (black). And in fact, we'll get the biggest impact if we lower parameters in places of bright pixels.
69 |
70 | In practice in neural networks, we often lower each {{}}w_n{{}} by {{}}0.1p_n{{}}. This is then called a 'learning rate of 0.1'. Note that this effectively means: make bigger changes where they matter more.
71 |
72 | We do this lowering (or raising) in the direction of the desired outcome. So if the network had looked at a seven and produced a negative output, we'd be doing this learning in the opposite direction by increasing the weight parameters by 0.1 of the value of the input pixel.
73 |
74 | Now, although this sounds ridiculously simple minded ("just twist the knobs so the score goes in the right direction"), let's give this a spin:
75 |
76 | ```
77 | $ ./37learn
78 | Have 240000 training images and 40000 validation images.
79 | 50.5375% correct
80 | 50% correct
81 | 81.8125% correct
82 | 86.675% correct
83 | 58.8% correct
84 | 85.65% correct
85 | 79.375% correct
86 | ...
87 | 98.025% correct
88 | ```
89 |
90 | Recall how our carefully hand-configured neural network managed to achieve 97.025%. Here is an animation (left) showing the evolution of the weights matrix, from its initial random form to something remarkably like what we hand-configured earlier (right):
91 |
92 |
109 |
110 | It appears that even our astoundingly simplistic learning technique delivered a pretty good result.
111 |
112 | The process described above is called 'backpropagation', and it is at the absolute core of any neural network, including ChatGPT3 or any other mega impressive network you may have heard about. Continuing the theme from the previous chapter, it is confusing that a technique this simple and unimpressive might have such remarkable success.
113 |
114 | In the next chapter we'll talk more about this process, and the computational challenges that it brings across more complex networks.
115 |
116 | # The code
117 | Now for the real details. The code [can be found here](https://github.com/berthubert/hello-dl/blob/main/37learn.cc).
118 |
119 | ```C++
120 | Tensor weights(28,28);
121 | weights.randomize(1.0/sqrt(28*28));
122 |
123 | saveTensor(weights, "random-weights.png", 252);
124 |
125 | float bias=0;
126 | ```
127 |
128 | We start out by initializing a weights matrix to random numbers between {{< katex inline >}}-1/\sqrt{28*28} {{}} and {{< katex inline >}}1/\sqrt{28*28}{{}}. In the informal explanation above, I neglected to mention the *bias*, which is part of the score formula:
129 |
130 | {{< katex display >}} R =\sum{\mathit{image}\circ{}w} + b {{}}
131 |
132 | Next up we need to set the *learning rate*:
133 | ```C++
134 | Tensor lr(28,28);
135 | lr.identity(0.01);
136 | ```
137 | The learning rate is what we need to multiply our image with to know how much to adjust the weights. Now, we'd love to just multiply the image by 0.01, but that is not how matrices work. If you want to multiply each coefficient of a matrix by a factor, you need to set up another matrix with that factor on all diagonal coefficients ('from the top left to the bottom right'). Our tensor class has an `identity()` method just for that purpose. [This Wikipedia page](https://en.wikipedia.org/wiki/Identity_matrix) may or may not be helpful.
138 |
139 | > Earlier on this page we mentioned 0.1 as a typical learning rate. Here I've chosen 0.01 since the network learns plenty fast already, and by slowing it down, the final results improve a bit. This is because the network can seek out optima somewhat more diligently. In a later chapter we'll read about learning rate schedulers that automate this process.
140 |
141 | Next up, let's do some learning:
142 |
143 | ```C++
144 | for(unsigned int n = 0 ; n < mn.num(); ++n) {
145 | int label = mn.getLabel(n);
146 | if(label != 3 && label != 7)
147 | continue;
148 |
149 | if(!(count % 4)) {
150 | if(doTest(mntest, weights, bias) > 98.0)
151 | break;
152 | }
153 | ```
154 |
155 | As earlier, this goes over all training samples of EMNIST. In addition, after every 4 images, we test our weights and bias against the validation database. If this `doTest` function returns that we got more than 98% of images correct, we leave the loop.
156 |
157 | Next:
158 |
159 | ```C++
160 | Tensor img(28,28);
161 | mn.pushImage(n, img);
162 | float res = (img.dot(weights).sum()(0,0)) + bias; // the calculation
163 | int verdict = res > 0 ? 7 : 3;
164 |
165 | if(label == 7) {
166 | if(res < 2.0) {
167 | weights.raw() = weights.raw() + img.raw() * lr.raw();
168 | bias += 0.01;
169 | }
170 | } else {
171 | if(res > -2.0) {
172 | weights.raw() = weights.raw() - img.raw() * lr.raw();
173 | bias -= 0.01;
174 | }
175 | }
176 | ++count;
177 | }
178 | ```
179 | This is where the actual learning happens. If we just fed the neural network a 7, and if the calculated score was less than 2, we increase all the weights by `lr` of the associated pixel value.
180 |
181 | Similarly, if we fed the network a 3, we lower all the weights, unless the score was already below -2.
182 |
183 | The reason we test against 2 or -2 is that otherwise the network would eventually move parameters all the way to infinity.
184 |
185 | In upcoming chapters we'll see how the use of [activation functions](https://en.wikipedia.org/wiki/Activation_function) replaces the need for such crude limits.
186 |
187 | As earlier, the somewhat ugly `.raw()` functions are necessary to prevent the slightly magic `Tensor` class from doing all kinds of work for us. In the next chapter, we're going to dive further into the theory of backpropagation, and that is when the `Tensor` class is going to shine.
188 |
189 | Rounding off this chapter, using a surprisingly [small amount of code](https://github.com/berthubert/hello-dl/blob/main/37learn.cc) we've been able to make a neural network learn how to distinguish between images of the digits 3 and 7. It should be noted that this is very clean data, and that by focusing on only 2 digits the task isn't that hard.
190 |
191 | Still, as earlier, it is somewhat disconcerting how effective these techniques are even when what we are doing appears to be trivial.
192 |
193 | In the next chapter, [we're going to learn all about automatic differentiation](../autograd).
194 |
195 |
--------------------------------------------------------------------------------
/dropout-data-augmentation-weight-decay/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Dropout, data augmentation, weight decay and quantisation"
3 | date: 2023-03-30T12:00:07+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dropout-data-augmentation-weight-decay/index.md)!
7 |
8 | In the previous chapter we found ways to speed up our character recognition learning by a factor of 20 by using a better optimizer, and a further factor of four by cleverly using threads using a 'shared nothing architecture'. We also learned how we can observe the development of parameters.
9 |
10 | So our convolutional network is now super fast, and performs well on training and validation sets. But is is robust? Or has it internalized too much of the training set details?
11 |
12 | Previously we found that the performance of our handwritten digit recognizer plummeted if we flipped a few pixels or moved the digit around slightly. And the reason behind that was that the linear combination behind that network was tied to actual pixel positions, and not to shapes.
13 |
14 | We now have a fancy and fast convolutional network that can theoretically do a lot better. Let's see how the network performs with slightly modified inputs:
15 |
16 |
22 |
23 | Well, that is disappointing. Although results are better than with the simple linear combination, we still take a significant hit when we move the image by only two pixels, and flip 5 random pixels from light to dark. No human being would be fooled (or perhaps even notice) these changes.
24 |
25 | How is this possible? It turns out that a network can't learn from what it doesn't see. If all the inputs are centered exactly and have no noise, the network never learns to deal with off-center or corrupted inputs.
26 |
27 | In this post we'll go over several modern techniques to enhance performance, robustness and efficiency.
28 |
29 | # Data augmentation
30 | Through a technique called [data augmentation](https://en.wikipedia.org/wiki/Data_augmentation), we can shake up our training set, making sure our network is exposed to more variation. And lo, when we do that, training and validation again score similarly, and only slightly worse than on unmodified data:
31 |
32 |
38 |
39 | Data augmentation has several uses. It can make a network more robust by erasing hidden assumptions - even those you might not have been aware of. Hidden constant factors between training and validation sets are a major reason why networks that appear to do well, fail in the field. Because out there in the real world, samples aren't neatly centered and free from noise.
40 |
41 | In addition, if you have a lack of training data, you can augment it by creating modified versions of inputs. This in effect enlarges your training set. Possible modifications include skewing or rotating images, adding noise, or making inputs slightly larger or smaller, changing colors. It pays to try a lot of things - the more you try, the larger your chances are of creating a dataset that can only be learned by understanding the essence of the inputs.
42 |
43 | In the demo code, data augmentation is implemented [here](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc#L24). It moves the image around by -2 to +2 pixels, and flips the value of 5 random pixels.
44 |
45 | # Normalization
46 | The inverse of data augmentation might be called normalization. Many training sets were gathered using highly repeatable measurements. For examples, faces were photographed under the same lighting, or scanned images were normalised to a certain average brightness, with similar standard deviations.
47 |
48 | You could undo such normalisation by brightening or dimming your inputs, and retraining. Or you could do the reverse and also normalise any inputs before running your network. Most networks, including our letter reading one, perform such normalisation. This is appropriate for any metric where you can objectively normalise. This is for example not the case for unskewing or rotating images back to their 'normal' state, because you don't know what that looks like.
49 |
50 | Our demos so far have been doing image normalization like this:
51 |
52 | ```C++
53 | d_model.img.normalize(0.172575, 0.25);
54 | ```
55 | This normalizes the mean pixel value to 0.172575 and the standard deviation to 0.25. So why these specific numbers? I applied a common machine learning trick: I picked them [from another model that works well](https://github.com/pytorch/examples/blob/main/cpp/mnist/mnist.cpp).
56 |
57 | # Dropout
58 | Another important technique to make networks generalise is called 'dropout'. By randomly zeroing out parts of the network, we force it to develop multiple pathways to determine how to classify an input. In addition, the network can't rely on accidental features to do classification since it won't always see those accidental features.
59 |
60 | Once the network is in production, we no longer perform the dropout, which gives a relative boost to performance. In some contexts, dropout is absolutely mandatory, but it does not do a lot for our letter recognizer. It does make learning harder:
61 |
62 |
63 |
64 | 
65 |
66 |
67 |
68 | Note that here the validation clearly outperforms the training set, which is made harder by the dropout. Training also takes a lot longer, and in some cases does not converge. It does however lead to a network that should be immune against overtraining. Overtraining is easily recognized when performance on the training set is higher than on the validation set. Dropout reverses that.
69 |
70 | If dropout is set to 50%, on average 50% of values of a tensor will be set to zero. Little known fact is that [the other values are then doubled](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html). This means that the overall impact of this tensor is retained (on average).
71 |
72 | In our code, the implementation is an element-wise multiplication of a tensor with a tensor filled with zeros for blanked out positions, and the multiplication factor for the rest.
73 |
74 | # Weight decay
75 | Not all parts of a network end up being used. But they are still there, and the parameters in those unused parts can have large numerical values. These parts could however become active for certain inputs that weren't observed during training, and then disrupt things, degrading the network's performance.
76 |
77 | It could therefore be useful to put a slight zero-ward pressure on parameter values that apparently have no impact. This has a deep analogy to what happens in biology, where genes that are not used tend to decay to a non-functional state, and in the case of microbial life, even get cut out of the genome.
78 |
79 | In neural networks, a surprisingly easy way to achieve a similar effect is by including the sum of all squared parameters in the loss function. Recall that the learning process effectively tries to minimise the loss to zero - by adding these squared values, there is an automatic zero-ward pressure. This works surprisingly well, and I think this is pretty meaningful.
80 |
81 |
86 |
87 | Here we take a previously trained model and then turn on weight decay. Note how the distribution shifts leftward. The move appears modest, but this is a logarithmic plot. Many parameters go down by a factor of 10 or more. Here is a cumulative distribution:
88 |
89 |
94 |
95 | Here we can see that initially 20,000 parameters had a squared value of less than 0.0001. After the weight decay process, this number goes up to 80,000.
96 |
97 | If we look at each parameter:
98 |
99 |
104 |
105 | Here everything below the black line represents a decrease of a parameter's squared value. It can be seen that especially larger values are reduced by a large fraction.
106 |
107 | Essentially, after weight reduction we have a network that still functions well, but now effectively with a lot less parameters (if we remove tiny values). I find it pretty remarkable that we can achieve this just by adding the squared value of all values to the loss function. Such a simple mathematical operation yet it gives is a simpler network.
108 |
109 | The implementation is near trivial:
110 | ```C++
111 | if(weightfact(0,0) != 0.0) {
112 | weightsloss = weightfact*(s.c1.SquaredWeightsSum() + s.c2.SquaredWeightsSum() + s.c3.SquaredWeightsSum() +
113 | s.fc1.SquaredWeightsSum() + s.fc2.SquaredWeightsSum() + s.fc3.SquaredWeightsSum());
114 |
115 | loss = modelloss + weightsloss;
116 | }
117 | else
118 | loss = modelloss;
119 | ```
120 |
121 | Here `weightfact` is how heavy to weigh down on the squared weights. 0.02 appears to work well for our model.
122 |
123 | On a closing note, the number of parameters impacts how much memory and CPU/GPU a model requires to function. Currently, networks use gigantic amounts of electrical power, which is not sustainable. If we can use this technique to slim down networks, that would be very good.
124 |
125 | In addition, we might be able to understand better what is going on if we have fewer parameters to look at.
126 |
127 | # Quantisation
128 | From the histograms above, we can see that most parameter values cluster close together. In most networks, such parameters are stored as 32 bit single precision floating point numbers. But do we actually need all those 32 bits? Given by how much we could drive down the parameter values with no impact on performance, it is clear we do not need to store very large numerical values.
129 |
130 | We can easily imagine a reduction to 16 bits working - this effectively only adds some noise to the network. And indeed, the industry is rapidly moving to 16 bits floating point. Even [processors](https://networkbuilders.intel.com/solutionslibrary/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide) and GPUs have gained native ability to perform operations on such half-precision floating point numbers.
131 |
132 | It turns out however that on large language model networks, one can go down to **4 bit precision** without appreciable loss of performance. Hero worker [Georgi Gerganov](https://ggerganov.com/) has implemented such quantisation in his [C++ version of Facebook's Llama model](https://github.com/ggerganov/llama.cpp), and it works very well.
133 |
134 | To perform quantisation, values are divided into 2^n bins of equal population, like this:
135 |
140 |
141 | And values are then stored as 4 bits, indicating which bin they correspond to. Interestingly enough, there are even binary networks with only two values. Out there in the real world, 8-bit networks [are already seeing production use](https://blog.plumerai.com/).
142 |
143 | > This is the only feature discussed in this blog series that is not (currently) present in the demo code.
144 |
145 | # Summarising
146 | There is a lot that can be done to networks to improve their efficiency and performance. By augmenting our data, through small changes, we can make sure the network is exposed to more variation, and in this way become more robust against real life input.
147 |
148 | Similarly, by performing internal dropout, the network is forced to learn how to recognize the input while not being able to rely on artifacts.
149 |
150 | By adding a fraction of the squared value of parameters to the loss function, we can perform weight decay, which drives parameters to zero if they are not contributing to the result. This again aids in robustness, since stray unused neurons have less chance of interfering. Furthermore, we might drop very small value neurons from our network entirely, and still have a working network.
151 |
152 | Finally, quantisation is the art of storing the weights in fewer bits which, kinda surprisingly, can be done without impacting performance too much.
153 |
154 | Next up, [we are going to do some actual OCR with what we've learned](../dl-ocr-demo)!
155 |
--------------------------------------------------------------------------------
/dl-ocr-demo/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Doing some actual OCR on handwritten characters"
3 | date: 2023-03-30T12:00:08+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dl-ocr-demo/index.md)!
7 |
8 | The [previous](../dropout-data-augmentation-weight-decay/) chapters have often mentioned the chasm between "deep learning models that work on my data" and "it actually works in the real world". It is perhaps for this reason that almost all demos and YouTube tutorials you find online never do any real world testing.
9 |
10 | Here, we are going to do it, and this will allow us to experience first hand how hard this is. We're going to build a computer program that reads handwritten letters from a provided photo, based [on the convolutional model developed earlier](../dl-convolutional/#convolutional-networks).
11 |
12 | # Training
13 | All elements described in previous chapters are present in [tensor-convo-par.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc). This tool can train our alphabet model, and optionally use the Adam optimizer, do dropout, weight decay, and data augmentation.
14 |
15 | Here are its options:
16 |
17 | ```bash
18 | Usage: tensor-convo-par [-h] [--learning-rate VAR] [--alpha VAR] [--momentum VAR]
19 | [--batch-size VAR] [--dropout] [--adam] [--threads VAR] [--mut-on-learn]
20 | [--mut-on-validate] state-file
21 |
22 | Positional arguments:
23 | state-file state file to read from [default: ""]
24 |
25 | Optional arguments:
26 | -h, --help shows help message and exits
27 | -v, --version prints version information and exits
28 | --lr, --learning-rate learning rate for SGD [default: 0.01]
29 | --alpha alpha value for adam [default: 0.001]
30 | --momentum [default: 0.9]
31 | --batch-size [default: 64]
32 | --dropout
33 | --adam
34 | --threads [default: 4]
35 | --mut-on-learn augment training data
36 | --mut-on-validate augment validation data
37 | ```
38 | When this program runs, it starts with a freshly randomised state, or one read from the specified `state-file`. It will periodically save its state to a file called `tensor-convo-par.state`. So you can restart from an existing state with ease, possibly using different settings.
39 |
40 | While the program learns, it emits statistics to a sqlite file called `tensor-convo-par-vals.sqlite3` which you can use to study what is going on, as outlined in [this chapter of the series](../hyperparameters-inspection-adam/#inspection).
41 |
42 | Details on how to build and run this program can [be found here](https://github.com/berthubert/hello-dl/blob/main/README.md).
43 |
44 | For testing purposes, settings `--adam --mut-on-learn --mut-on-validate` work well. If you run it like this, you can terminate the process after 30 minutes or so, and have a decent model.
45 |
46 | # Real world input
47 | Here is our input image, which already has a story behind it:
48 |
49 |
50 |
51 | 
52 |
53 |
54 |
55 |
56 | When I first got the OCR program working, results were very depressing. The network struggled mightily on some letters, often just not getting them right. Whatever I did, the 'h' would not work for example. First I blamed my own sloppy handwriting, but then I studied what the network was trained on:
57 |
58 |
59 |
60 | 
61 |
62 |
63 |
64 |
65 | Compare this to how I (& many other Europeans) write an h:
66 |
67 |
68 | 
69 |
70 |
71 |
72 |
73 | No amount of training on the MNIST set is going to teach a neural network to consistently recognize this as an h - this shape is simply not in the training set.
74 |
75 | So that was the first lesson - really be aware of what is in your training data. If it is different from what you thought, results might very well disappoint. To make progress, I changed my handwriting in the test image to something that looks like what is actually in the MNIST data.
76 |
77 | # Practicalities
78 | The source code of the OCR program [is here](https://github.com/berthubert/hello-dl/blob/main/img-ocr.cc), and it is around 300 lines. For image processing, I found the [stb collection](https://github.com/nothings/stb) of single-file public domain include files very useful. To run the program, run something like `./img-ocr sample/cleaned.jpeg tensor-convo-mod.par`. It will generate a file called `boxed.png` for you with the results in there.
79 |
80 | So, getting started: what we have is a network that does pretty well on 28 by 28 pixel representations of letters, where the background pixel value is 0. By contrast, input images tend to have millions of pixels, in full colour even, and where black pixels have a value of 0, so the inverse of our training.
81 |
82 | The first thing to do is to turn the image into a gray scale version, where we also adjust the white balance so black is actually black and where the gray that passes for white is actually white.
83 |
84 | From OCR theory I learned that the first step in character segmentation is to recognize lines of text. This is done by making a graph of the total intensity per horizontal line of the image. From this graph, you then try to select intervals of high intensity that look like they might represent a line of text.
85 |
86 | For each line, you then travel from left to right to try to box in characters.
87 |
88 | This leads us to:
89 |
90 |
91 | 
92 |
93 |
94 |
95 |
96 | Note that this is already hard work, and not very robust. How hard this is is yet another reminder that a lot of machine learning is in fact preprocessing your data so it cooperates. Compare it to painting a house - lots of sanding and tape, and finally a fun bit of painting.
97 |
98 | The code that does this segmentation in [img-ocr.cc](https://github.com/berthubert/hello-dl/blob/main/img-ocr.cc) is none too pretty and has only been worked on so it does enough of a job that we can demo our actual neural network. (Pull requests welcome!)
99 |
100 | # Loading the network
101 | Firing up the network is much like we did while training:
102 |
103 | ```C++
104 | ConvoAlphabetModel m;
105 | ConvoAlphabetModel::State s;
106 |
107 | cout<<"Loading model state from file '"< newpic;
124 | newpic.reserve((l.lstopcol-l.lstartcol) * (l.lstoprow-l.lstartrow));
125 |
126 | for(int r= l.lstartrow; r < l.lstoprow; ++r) {
127 | for(int c= l.lstartcol ; c < l.lstopcol; ++c) {
128 | int intensity = getintens(c, r);
129 | if(intensity < whiteballow)
130 | newpic.push_back(255);
131 | else if(intensity > whitebalhigh)
132 | newpic.push_back(0);
133 | else
134 | newpic.push_back(255*(1- pow((intensity - whiteballow)/(whitebalhigh - whiteballow), 1) ));
135 | }
136 | }
137 | ```
138 | This iterates over the rectangles, and the first thing it does is fix the white balance, and create a new balanced image in `newpic`.
139 |
140 | Our network needs a 28x28 pixel version, which is not what we get. Usually we get a lot more pixels, but not necessarily with a square aspect ratio. To make our box square, we previously enlarged the smallest dimension so it has the same size as the largest one. From [stb_image_resize.h](https://github.com/nothings/stb/blob/master/stb_image_resize.h) we get functionality to do high quality resizing.
141 |
142 | As another example of how things tend to be more difficult than you think, the MNIST training data is 28x28 pixels, BUT, the outer 2 pixels are always empty. So in fact, the network trains on 24x24 sized letters. This means that to match our image data to our network, we had best also resize letters to 24x24 pixels, and place them in the middle of a 28x28 grid:
143 |
144 |
145 | ```C++
146 | vector scaledpic(24*24);
147 | stbir_resize_uint8(&newpic[0], l.lstopcol - l.lstartcol, l.lstoprow - l.lstartrow, 0,
148 | &scaledpic[0], 24, 24, 0, 1);
149 |
150 | m.img.zero();
151 | for(unsigned int r=0; r < 24; ++r)
152 | for(unsigned int c=0; c < 24; ++c)
153 | m.img(2+r,2+c) = scaledpic[c+r*24]/255.0;
154 |
155 | m.img.normalize(0.172575, 0.25);
156 | ```
157 | > Note that if we performed [data augmentation](../dropout-data-augmentation-weight-decay/#data-augmentation), our network should be robust against off center letters, or pixels in the outer two rows. But let's not make life harder than necessary for our model.
158 |
159 | On the last line, we perform normalization [as described previously](../dropout-data-augmentation-weight-decay/#normalization) so that the pixels have a similar brightness to what the network is used to. This may feel like cheating, but this kind of normalization is an objective mathematical operation. Your eyes for example do the same thing by dilating your pupils so the photoreceptor cells receive a normalized amount of photons. Those cells in turn again also change their sensitivity depending on light levels.
160 |
161 | Next up, we can ask the network what it made of our character:
162 |
163 | ```C++
164 | m.expected.oneHotColumn(0);
165 | m.modelloss(0,0); // makes the calculation happen
166 |
167 | int predicted = m.scores.maxValueIndexOfColumn(0);
168 | cout<<"predicted: "<<(char)(predicted+'a')<
177 |
178 | 
179 |
180 |
181 |
182 |
183 | Now, if you run the training yourself (which I encourage you to do), you'll find that the network will always make a few mistakes. In this sample, it gets the L wrong and thinks it is a C. It does the same for the T. If you train with different settings, it will get other letters wrong.
184 |
185 | In this animated version, recorded while a network was learning, you can see the network flip around as it is improving:
186 |
187 |
188 |
193 |
194 |
195 | It is highly instructive to try to improve [img-ocr.cc](https://github.com/berthubert/hello-dl/blob/main/img-ocr.cc) and pull requests are welcome!
196 |
197 | But, always check what the training data say - for example, the specific form of the 't' that the network got wrong above [is not present a lot in the training set](t-poster.png).
198 |
199 | Also - it may be good at this point to realise we've written a functional OCR program, including training, in around 1800 lines of code. This is quite remarkable, and without neural networks this would never have worked.
200 |
201 | # Summarising
202 | The initial attempt to test this network on real life data failed somewhat because the MNIST character set does not include all forms of letters (which is by design, by the way).
203 |
204 | Secondly, we've learned that actually _doing_ something with real life data requires a lot of preprocessing to actually isolate letters, fix white balance, box in characters and adjust them to what the network expects.
205 |
206 | The end result however is quite pleasing, especially since we spent only 300 lines on 'infrastructure' to get the data ready for our network.
207 |
208 | And, it should be noted that the total line count of 1500 for training and 300 for inference is impressively low.
209 |
210 |
213 | In [the next chapter](../dl-and-now-what/) you'll find further reading & pointers where to continue your deep learning journey.
214 |
--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Linear combinations"
3 | date: 2023-03-30T12:00:01+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/hello-deep-learning-chapter1/index.md)!
7 |
8 | In this chapter we're going to build our first neural network and take it for a spin. Weirdly, this demonstration will likely simultaneously make you wonder "is this all??" and also impress you by what even this trivial stuff can do.
9 |
10 | The first part of this chapter covers the theory, and shows no code. The second part explains the code that makes it all happen. You can skip or skim the second part if you want to focus on the ideas.
11 |
12 | {{< rawhtml >}}
13 |
18 | {{}}
19 |
20 |
21 | ## Hello, world
22 | The "Hello, world" of neural networks is the MNIST set of handwritten digits. Meticulously collected, sanitized and labeled, this collection of 280,000 images is perfect to get started with. Most tutorials use MNIST, but because this one is written in 2023, we can use the ['extended' and improved EMNIST dataset](https://www.nist.gov/itl/products-and-services/emnist-dataset).
23 |
24 | For our first sample, we're going to write a neural network that can distinguish images of the digits 3 and 7, inspired by [this excellent FastAI tutorial for PyTorch](https://github.com/fastai/fastbook/blob/master/04_mnist_basics.ipynb).
25 |
26 | To start out with, we're not yet going to have a network that can learn things. We're going to configure it explicitly, which is a great way of figuring out what is going on.
27 |
28 | Neural networks are a lot about matrices and multiplying them. Confusingly, everyone in this world calls these matrices 'tensors', which is actually the wrong name for them. But I digress.
29 |
30 | So, the first thing we do is represent the images of digits found in the EMNIST database as matrices, which allows us to do math on them. In this way we can calculate the following three matrices (shown as images):
31 |
32 |
33 |
34 | 
35 | 
36 | 
37 | *The average 3, the average 7, the difference between these two*
38 |
39 |
40 |
41 | We can average all 3's and all 7's and get these fuzzy representations. The last picture is the most interesting one: it represents the "average 7 minus the average 3". The red pixels are high values, areas where there typically is more 'seven' than 'three'. The blue parts are low values, where there is typically more 'three' than 'seven'. Black pixels meanwhile are neutral, and confer no 'threeness' or 'sevenness'.
42 |
43 | One elementary neural network layer is the linear combination whereby we multiply the input (here, the image of a digit) by a matrix of 'weights'. These weights are the parameters that are usually evolved by training a network, but we're not going to do that yet.
44 |
45 | Instead, we're going to use the difference matrix shown above as the weights. Here what that looks like for a typical 3:
46 |
47 |
48 |
49 |
50 |
51 |
*
52 |
53 |
=
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 | This represents a coefficient-wise product of two matrices (also known as a [Hadamard-Schur product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))). Each pixel in the right-most image is the product of the pixel in the same place in the left and middle images. Because this is a typical 3, we see a lot of blue in the right. If we'd add up all the values of the pixels on the right, we'd end up with a negative number. This could then also be our decision rule: if the sum is negative, infer that this was an image of a 3.
62 |
63 | Conversely, this is what it looks like for a 7:
64 |
65 |
66 |
67 |
68 |
69 |
*
70 |
=
71 |
72 |
73 |
74 |
75 |
76 | Here we see a lot of red on the right, indicating a lot of higher values. The sum of all pixels is likely going to be a positive number, which means we can correctly infer this was a 7.
77 |
78 | Now, this is all ridiculously naive, but we can give it a try using the [threeorseven.cc](https://github.com/berthubert/hello-dl/blob/main/threeorseven.cc) program:
79 | ```
80 | $ ./threeorseven
81 | Have 240000 training images and 40000 validation
82 | Three average result: -10.7929, seven average result: 0.785063
83 | 82.2125% correct
84 | ```
85 | That is quite something. So, this introduces another key aspect of machine learning: training and validation. The EMNIST set offers us 240,000 training images. To make sure that networks don't only memorise their training set, it is customary to validate models on a separate set of inputs. These are the 40,000 validation images. And true to form, the threeorseven.cc program calculates the averages only based on the training images, and then measures performance using the validation images.
86 |
87 | Now, 82.21% is nice, but something else stands out in the output of the program. We originally thought that a negative score (sum of pixels in the rightmost image) would represent a 3. We do see that the average three scores negatively (-10.79), but the average seven is only barely positive (0.785). Let's make a histogram of scores:
88 |
89 |
90 |
91 | 
92 |
93 |
94 |
95 |
96 | Clearly 0 is not the right number to compare our score against. Our histograms have a definitive negative *bias*. Instead, we could use the middle between the average 3 score and the average 7 score:
97 |
98 | ```
99 | $ ./threeorseven
100 | Have 240000 training images and 40000 validation
101 | Three average result: -10.7929, seven average result: 0.785063
102 | Middle: -5.00393
103 | 97.025% correct
104 | ```
105 |
106 | That is pretty astounding. Using -5.00 as a decision rule we get 97.025% accuracy. This is approaching human level performance. Later we'll find out many reasons why we should not quite start celebrating yet though. But for now, this is quite impressive.
107 |
108 | In the above we have set the 'weights'(w) to the difference between average threes and sevens. We've also found a bias(b) that we need to apply. In formula form:
109 |
110 | {{< katex display >}} R =\sum{\mathit{image}\circ{}w} + b {{}}
111 |
112 | Here {{< katex inline >}}\circ{{}} stands for the coefficient-wise product, and {{< katex inline >}}\sum{{}} means we add up all coefficients.
113 |
114 | If the result {{< katex inline >}}R{{}} is positive, we infer that {{< katex inline >}}\mathit{image}{{}} represents a 7.
115 |
116 | Note that for reasons which will become apparent later, neural network linear combinations mostly do not use these 'square' matrices and Hadamard products, but instead flattened versions. The central equation then becomes a regular matrix multiplication:
117 |
118 | {{< katex display >}} R =\mathit{image}\cdot{}w + b {{}}
119 |
120 | ## Takeaway
121 | From the above, we can see that given very clean data, simple multiplication and additions are sufficient for a properly configured neural network to do pretty well on a simple task. The demo above is a completely standard neural network layer, with the only simplification that we configured it by hand instead of letting it learn. I'm simultaneously impressed by what this simple layer can do, but you might at this stage also be wondering "is that it??".
122 |
123 | In the next chapter we'll cover how training works. And you'll likely again be wondering how something so simple can be so effective.
124 |
125 | ## The code
126 | To start, clone the GitHub repository and download and unzip the EMNIST dataset:
127 | ```bash
128 | git clone https://github.com/berthubert/hello-dl.git
129 | cd hello-dl
130 | cmake .
131 | make -j4
132 | wget http://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip
133 | unzip gzip.zip
134 | ```
135 |
136 | We're going to look at [threeorseven.cc](https://github.com/berthubert/hello-dl/blob/main/threeorseven.cc).
137 |
138 | The first thing the code does is to read the EMNIST data (which is formatted in the MNIST standard):
139 |
140 | ```C++
141 | MNISTReader mn("gzip/emnist-digits-train-images-idx3-ubyte.gz", "gzip/emnist-digits-train-labels-idx1-ubyte.gz");
142 | MNISTReader mntest("gzip/emnist-digits-test-images-idx3-ubyte.gz", "gzip/emnist-digits-test-labels-idx1-ubyte.gz");
143 |
144 | cout << "Have "<}} R =\sum{\mathit{image}\circ{}w} {{}}) for every image. And using that, we calculate the average score for the threes and the sevens. And then we take the middle of those two scores.
216 |
217 | And finally, we're going to *validate* our model using the EMNIST set of test images in `mntest`:
218 |
219 | ```C++
220 | float bias = -middle;
221 | unsigned int corrects=0, wrongs=0;
222 |
223 | for(unsigned int n = 0 ; n < mntest.num(); ++n) {
224 | int label = mntest.getLabel(n);
225 | if(label != 3 && label != 7)
226 | continue;
227 |
228 | Tensor img(28,28);
229 | mntest.pushImage(n, img);
230 |
231 | float score = (img.dot(delta).sum()(0,0)) + bias; // the calculation
232 | int predict = score > 0 ? 7 : 3; // the verdict
233 |
234 | if(predict == label)
235 | corrects++;
236 | else {
237 | saveTensor(img, "wrong-"+to_string(label)+"-"+to_string(wrongs)+".png", 252);
238 | wrongs++;
239 | }
240 | }
241 | cout<< 100.0*corrects/(corrects+wrongs) << "% correct" << endl;
242 | ```
243 |
244 | Note that on the first line we use the calculated result middle to set the *bias* term. The rest of the code is straightforward. Note that an image is generated for every incorrect result. Studying those images will give you an impression of where the algorithm gets it wrong. Here's an example 7 that got classified as a 3:
245 |
246 |
247 |
248 | 
249 |
250 |
251 |
252 |
253 | And that's it! If you look at the full [threeorseven.cc](https://github.com/berthubert/hello-dl/blob/main/threeorseven.cc) you'll find that it contains some additional code to log data for the histogram we showed above, and for generating some sample images.
254 |
255 | In [the next chapter](../first-learning), we'll start doing some actual learning.
256 |
--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Hyperparameters, inspection, parallelism, ADAM"
3 | date: 2023-03-30T12:00:06+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/hyperparameters-inpection-adam/index.md)!
7 |
8 | In [the previous chapter](../dl-convolutional) we successfully trained a network to recognize handwritten letters, but it took an awfully long time. This is not just inconvenient: networks that take too long to train mean we can experiment less. Some things really are out of reach if each iteration takes 24 hours, instead of 15 minutes. In addition we waste a lot of energy this way.
9 |
10 | To speed things up, [we can make our calculations go faster, or we can do less of them](https://berthub.eu/articles/posts/optimizing-optimizing-400-percent-speedup/). Deep learning employs both techniques.
11 |
12 | On this page we will go through how deep learning networks speed up their operations. In addition we'll be taking a rare look inside networks to see how parameters are actually evolving.
13 |
14 | # Parallelization
15 | One way of speeding up calculations is by making more of them happen at the same time. In deep learning, it is very typical to evaluate batches of 64 inputs at a time. Instead of shoveling a series of 64 matrices one by one through our calculations, we could also use a library that can take a stack of 64 matrices, call this a "64-high tensor", and do the calculation all at once for the whole stack. This would be especially convenient if we had 64 parallel processing units available of course.
16 |
17 | And it turns out that if you have the right kind of GPU, you do have such capacity.
18 |
19 | In addition, modern CPUs have [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) capabilities to perform certain calculations on 4, 8 or even 16 numbers at the same time. Also, most modern computers have multiple CPU cores available.
20 |
21 | Because of this, it is nearly standard in professional neural network environments to do almost everything multidimensional tensors, as this offers your calculating backend as much opportunities as possible to use your CPU or GPU to perform many calculations in parallel. Some of the largest networks are even [distributed over multiple computers](https://pytorch.org/tutorials/beginner/dist_overview.html).
22 |
23 | The reason this works so well is that we typically evaluate a whole batch of inputs while keeping the neural network parameters constant. This means all calculations can happen in parallel - they don't need to change any common data. The only thing that needs to happen sequentially is to gather all the gradients and apply them to the network. And then a new batch can be processed in parallel again.
24 |
25 | # Being clever
26 | A mainstay of neural networks is matrix multiplication. On the surface this would appear to be an \\(O(N^3)\\) process, scaling with the number of rows and columns of both matrices (where the number of columns in the second matrix is equal to the number of rows in the first). It turns out that through sufficient mathematical cleverness, [matrix multiplications can be performed a lot more efficiently](https://en.wikipedia.org/wiki/Computational_complexity_of_matrix_multiplication). In addition, if you truly understand what CPUs and caches are doing, you can speed things up even more.
27 |
28 | As an example, an earlier version of the software behind these blog posts performed naive matrix multiplication. I finally gave up and moved to a professional matrix library ([Eigen](https://en.wikipedia.org/wiki/Eigen_(C%2B%2B_library))) and this delivered a 320-fold speedup immediately. In short, unless you really know what you are doing, you have no business implementing matrix multiplication yourself.
29 |
30 | The software behind this series of posts benefits from SIMD vectorization because Eigen and compilers are able to make good use of parallel instructions. In addition, threads are used to just use more CPU cores.
31 |
32 | # Doing less
33 | Parallel and clever computations are nice since they allow us to do what we were doing already, but now faster. There are limits to this approach however - we can't keep on inventing faster math for example.
34 |
35 | While training our networks, we are effectively trying out sets of parameters & calculating how the outcome would change if we adjusted our parameters, based on their derivatives (through automatic differentiation).
36 |
37 | Above we described how we could perform such calculations faster, which means we can evaluate more sets of parameters per unit of time, which is nice.
38 |
39 | A key component however can also be improved: making better use of those derivatives to improve our parameters more effectively.
40 |
41 | In previous chapters we've trained our networks by adjusting each parameter by (say) 0.2 times the derivative of the loss function with respect to the parameter. This is the simplest possible approach, but it is not the best one.
42 |
43 | Learning by equally sized increments could be compared to climbing a hill taking tiny equally sized steps, where you know that if all previous steps have been upwards, you could probably get there a lot faster if you took larger steps.
44 |
45 | Recall how we previously described 'hill climbing', where it worked pretty well:
46 |
47 |
48 |
49 | 
50 |
51 |
52 |
53 |
54 |
55 | However, on a more complex landscape, this regular gradient descent does not work so well:
56 |
57 |
58 |
59 | 
60 | *Nearly gets stuck around x=1.5, overshoots the goal*
61 |
62 |
63 |
64 | We can see that the algorithm nearly gets stuck on the horizontal part around {{}}x=1.5{{}}. In addition, when it eventually reaches the goal, it ping-pongs around it, and never reaches the minimum.
65 |
66 | A popular enhancement to this 'linear gradient descent' is to make it slightly more physical. For example, we could simulate a ball rolling down a hill, where the ball speeds up as it goes along, but also experiences friction:
67 |
68 |
69 |
70 | 
71 |
72 |
73 |
74 | This is called gradient descent with momentum, and it is pretty nice. Further enhancements are possible too, and most networks these days use [ADAM](https://machinelearningmastery.com/adam-optimization-from-scratch/), which not only implements momentum but also performs smoothing both on speed and on momentum. In addition, it cleverly initializes variables so the network "gets a running start". With judiciously picked parameters (\\(\alpha\\), \\(\beta_1\\) and \\(\beta_2\\)), [ADAM appears to be the best generic optimizer around](https://arxiv.org/abs/1412.6980).
75 |
76 | As a case in point, recall how our pedestrian stochastic gradient descent took almost a whole day to learn how to read letters. Here is that same model on ADAM:
77 |
78 |
79 |
80 | 
81 |
82 |
83 |
84 | Within 1 CPU hour, [this code](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc) was recognizing 80% of letters correctly.
85 |
86 | In addition, by benefiting from four-fold parallelization (since I have 4 cores), this code becomes even faster:
87 |
88 |
93 |
94 | By achieving good performance after 15 wall clock minutes, we've increased our learning speed by over a factor of 80.
95 |
96 | These things are not just nice, they are complete game changers. Networks that otherwise take days to reach decent performance can do so in hours with the right optimizer. Also, the optimizer can actually achieve better results by not getting stuck in local minima. This brings us to a rather dark subject in deep learning.
97 |
98 | # Hyperparameters
99 | So far we've seen a number of parameters that had to be set: the learning rate, for which we suggested a value of 0.2. If we want to use momentum (the rolling ball method), we have to pick a *momentum* parameter. If we use ADAM, we need to pick \\(\alpha\\), \\(\beta_1\\) and \\(\beta_2\\) (although the default values are pretty good).
100 |
101 | In addition there is the batch size. If we set this too low, the network jumps around too much. Too high and everything takes too long.
102 |
103 | These important numbers are called *hyperparameters*, to distinguish them from the regular parameters that we are mutating within our neural network to make it learn things.
104 |
105 | If you visit the many many demos of how easy machine learning is, you'll mostly see the hyperparameters just appearing there, with no explanation how they were derived.
106 |
107 | I can authoritatively tell you that very often these numbers came from days long experimentation. If you pick the numbers wrong, nothing (good) might happen, or at least not at any reasonable speed. Many demos are not honest about this, and if you change any of their carefully chosen numbers, you might find that the network no longer converges.
108 |
109 | The *learning* part of machine learning is a lot harder than many demos make it out to be.
110 |
111 | The actual design of the neural network layers is also considered part of the hyperparameter set. So if a network sorta arbitrarily consists of three convolutional layers with N channels in and M channels out, plus three fully connected linear combinations of x by y, know that these numbers were often gleaned from an earlier implementation, or were selected only after tedious "parameter sweeping".
112 |
113 | So know that if you are ever building a novel network and it doesn't immediately perform like the many demos you saw, this is entirely normal and not your fault.
114 |
115 | # Inspection
116 | Neural networks tend to be pretty opaque, and this happens on two levels. From a theoretical standpoint, it is already hard to figure out how "a network does its thing". Much like in biology, it is not clear which neuron does what. We can sometimes "see" what is happening, as for example in our earliest 3-or-7 network. But it is hard work.
117 |
118 | On a second level, if we have a ton of parameters all being trained, it is in a practical sense not that easy to get the numbers out to figure out what is going on.
119 |
120 | For PyTorch, there are commercial platforms like [Weights & Biases](https://wandb.ai/) that can help create insight. But it turns out that with some simple measures we can also get a good look about what is going on.
121 |
122 | For logging, we use [SQLiteWriter](https://berthub.eu/articles/posts/big-data-storage/), a tiny but pretty powerful logger that works like this:
123 |
124 | ```C++
125 | SQLiteWriter sqw("convo-vals.sqlite3");
126 |
127 | ...
128 | sqw.addValue({
129 | {"startID", startID}, {"batchno", batchno},
130 | {"epoch", 1.0*batchno*batch.size()/mn.num()},
131 | {"time", time(0)}, {"elapsed", time(0) - g_starttime},
132 | {"cputime", (double)clock()/CLOCKS_PER_SEC},
133 | {"corperc", perc}, {"avgloss", totalLoss/batch.size()},
134 | {"batchsize", (int)batch.size()}, {"lr", lr*batch.size()},
135 | {"momentum", momentum}}, "training");
136 | ```
137 |
138 | This logs a modest amount of statistics to SQLite for every batch. The 'startID' is set when the program starts, which means that multiple runs of the software can log to the same SQLite database and we can distinguish what was logged by which invocation.
139 |
140 | The other numbers mostly describe themselves, with `corperc` denoting the percentage of correct digit determinations (in this case). `lr` and `momentum` are also logged since these might change from run to run. All values end up in a table called `training`, there is a similar table called `validation` which stores the same numbers, but then for the validation set.
141 |
142 | These numbers are nice to track our learning progress, but to really look inside we need to log a lot more. Recall how in the code samples so far we register the layers in our network:
143 |
144 | ```C++
145 | State()
146 | {
147 | this->d_members = {{&c1, "c1"}, {&c2, "c2"},
148 | {&c3, "c3"}, {&fc1, "fc1"},
149 | {&fc2, "fc2"}, {&fc3, "fc3"}};
150 | }
151 | ```
152 | Note that we also gave each layer a name. Our network does not itself need to know the names of layers, but it is great for logging. Each layer in the software knows how to log itself to the `SQLiteWriter` and we can make this happen like this:
153 |
154 | ```C++
155 | ConvoAlphabetModel m;
156 | ConvoAlphabetModel::State s;
157 | ..
158 | if(batchno < 32 || !(tries%32)) {
159 | s.emit(sqw, startID, batchno, batch.size());
160 | }
161 | ```
162 |
163 | This logs the full model to the `SQLiteWriter` for the first 32 batches, and from then on once every 32 batches. Since models might have millions of parameters, we do need to think this through a bit.
164 |
165 | Here is what comes out:
166 |
167 |
168 | 
169 | *Values of the kernel of the 20th filter of c2*
170 |
171 |
172 |
173 |
174 | The [code to create this](https://github.com/berthubert/hello-dl/blob/main/hello-dl.ipynb) is relatively simple. First we retrieve the data:
175 |
176 | ```Python
177 | engine = create_engine("sqlite:////home/ahu/git/hello-dl/convo-vals.sqlite3")
178 | startIDs = pandas.read_sql_query("SELECT distinct(startID) as startID FROM data", engine)
179 | startID=startIDs.startID.max()
180 | ```
181 |
182 | And then select the data to plot:
183 | ```Python
184 | fig, ax1 = plt.subplots(figsize=(7,6))
185 |
186 | sel = pandas.read_sql_query(f"SELECT * FROM data where startID={startID} and name='c2' and
187 | idx=20 and subname='filter' order by batchno", engine)
188 | sel.set_index("batchno", inplace=True)
189 | for c in sel.col.unique():
190 | for r in sel.row.unique():
191 | v = sel[(sel.row==r) & (sel.col==c)]
192 | ax1.plot(v.index, v.val - 1.0*v.val.mean(),
193 | label=v.name.unique()[0]+"["+str(v.idx.unique()[0])+"]("+str(r)+","+str(c)+")" )
194 | ax1.legend(loc=2)
195 | plt.title("Value of parameters of a convolutional filter kernel")
196 | plt.xlabel("batchno")
197 | plt.ylabel("value")
198 | ```
199 |
200 | The `data` table has fields called `batchno`, `startID`, `name`, `idx`, `row`, `col`, `value` and `grad` that fully identify an element, and also store its current value and the gradient being used for SGD or ADAM.
201 |
202 | # ADAM practicalities
203 | The ADAM optimizer does require some infrastructure. To make things work, our `Tensor` class now also has a struct storing the ADAM parameters:
204 |
205 | ```C++
206 | struct AdamVals
207 | {
208 | EigenMatrix m;
209 | EigenMatrix v;
210 | } d_adamval;
211 | ```
212 |
213 | These stand for the momentum and velocity of "the ball" if you will.
214 |
215 | [Our code](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc#L259) has meanwhile grown an option parser so we can select an optimizer at will:
216 |
217 | ```C++
218 | if(program.get("--adam"))
219 | s.learnAdam(1.0/batch.size(), batchno, program.get("--alpha"));
220 | else
221 | s.learn(lr, momentum);
222 | ```
223 |
224 | The mechanics of `learnAdam` can be found in [tensor-layers.hh](https://github.com/berthubert/hello-dl/blob/main/tensor-layers.hh#L27).
225 |
226 | # Parallelization
227 | As noted, we can evaluate a whole batch in parallel, since the network parameters stay constant during evaluation. We do however have to gather all the gradients from the individual evaluations and add them up.
228 |
229 | As is the case always, **speeding things up by parallelizing them does not make your code any more readable**. This is especially painful for an educational project like this one. I've tried hard to keep it as simple as possible. The 4- or 8-fold speedup you can achieve with this technique is important enough to warrant its use. There is a huge difference between 30 minutes of training or 4 hours.
230 |
231 | One of the simplest ways to make sure that things actually get faster with multiple threads is to use a '[shared nothing architecture](https://en.wikipedia.org/wiki/Shared-nothing_architecture)', and this is what we do for our project.
232 |
233 | We launch a number of threads that each have a complete copy of the model we are training. These then process individual images/samples from a batch, and record the gradients.
234 |
235 | Once all threads are done, the gradients are gathered together, and then the `main()` thread copy of the model performs the learning. The new parameters are then broadcast to the thread copies again, and then the next batch is processed.
236 |
237 | Sadly, despite my best efforts, the code in [tensor-convo-par.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc) has a hundred lines of thread handling to make this all possible.
238 |
239 | # Next up
240 | We started our quest for robust character recognition, but found that it was learning only very slowly. In this chapter we looked into various optimizers and found that ADAM converged 20 times faster. [In the next chapter](../dropout-data-augmentation-weight-decay), we are going to check if our network is actually robust, and what we can do to make it so.
241 |
--------------------------------------------------------------------------------
/autograd/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Automatic differentiation, autograd"
3 | date: 2023-03-30T12:00:03+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/autograd/index.md)!
7 |
8 | In the previous chapter we configured a neural network and made it learn to distinguish between the digits 3 and 7. The learning turned out to consist of "twisting the knobs in the right direction". Although simplistic, the results were pretty impressive. But, you might still be a bit underwhelmed - the network only distinguished between two digits.
9 |
10 | To whelm it up a somewhat, in this chapter we'll introduce a 5 layer network that can learn to recognize all 10 handwritten digits with near perfect accuracy. But before we can make it learn, we need to move slightly beyond the "just twist the parameters in the right direction" algorithm.
11 |
12 | The first part of this chapter covers the theory, and shows no code. The second part explains the code that makes it all happen. You can skip or skim the second part if you want to focus on the ideas.
13 |
14 | ## The basics
15 | Our previous network consisted of one layer, a linear combination of input pixels. Here is a **preview** of the layers that achieve 98% accuracy recognizing handwritten digits:
16 |
17 | 1. Flatten 28x28 image to a 784x1 matrix
18 | 2. Multiply this matrix by a 128x784 matrix
19 | 3. Replace all negative elements of the resulting matrix by 0
20 | 4. Multiply the resulting matrix by a 64x128 matrix
21 | 5. Replace all negative elements of the resulting matrix by 0
22 | 6. Multiply the resulting matrix by a 10x64 matrix
23 | 7. Pick the highest row of the resulting 10x1 matrix, this is the digit the network thinks it saw
24 |
25 | This model involves three matrices of parameters, with in total 128\*784 + 64\*128 + 10\*64 = 109184 *weights*. There are also 128+64+10 = 202 *bias* parameters.
26 |
27 | We'll dive into this network in detail later, but for now, ponder how we'd train this thing. If the output of this model is not right, by how much should we adjust each parameter? For the one-layer model from the previous chapter this was trivial - the connection between input image intensity and a weight was clear. But here?
28 |
29 | ## Turning the knobs, or, gradient descent
30 | In our previous model, we took the formula:
31 |
32 | {{}}R=p_1w_1 + p_2w_2 + \cdots + p_{783}w_{783} + p_{784}w_{784}{{}}
33 |
34 | And we then performed 'learning' by increasing the {{}}w_n{{}} parameters by 0.1 of their associated {{}}p_n{{}}. Effectively, we took the *derivative* of the error ({{}}\pm R{{}}) with respect to {{}}w_n{{}}, multiplied it by 0.1, and added it to {{}}w_n{{}}.
35 |
36 | This is what is called 'gradient descent', and it looks like this:
37 |
38 |
39 |
40 | 
41 | *Actually hill descending in this case*
42 |
43 |
44 |
45 |
46 | This is a one-dimensional example, and it is very successful: it quickly found the minimum of the function. Such hill climbing has a tendency of getting stuck in local optima, but in neural networks this apparently is far less of a problem. This may be because we aren't optimizing over 1 axis, we are actually optimizing over 109184 parameters (in the digit reading network described above). It probably takes quite a lot of work to create a 109184-dimensional local minimum.
47 |
48 | So, to learn this way, we need to perform all the calculations in the neural network, look at the outcome, and see if it needs to go up or down. Then we need to find the derivative of the outcome versus all parameters. And then we move all parameters by 0.1 of that derivative (the 'learning rate').
49 |
50 | This really is all there is to it, but we are now left with the problem how to determine all these derivatives. Luckily this is a well solved problem, and the solution is quite magical. And it is good that this is so, because there are models with hundreds of billions of parameters. Those derivatives should be simple and cheap.
51 |
52 | # Automatic differentiation
53 | So, unlike integration, differentiation is actually very straightforward. And it turns out that with relatively little trouble you can get a computer to do it for you. If we for example have:
54 |
55 | {{}}y = 2x^3 + 4x^2 + 3x + 2 {{}}
56 |
57 | It is trivial (even for a computer) to turn this into:
58 |
59 | {{}}\frac{dy}{dx} = 6x^2 + 8x + 3 {{}}
60 |
61 | And even if we make life more complex, the rules remain simple:
62 |
63 | {{}}y = \sin{(2x^3 + 4x^2 + 3x + 2)} {{}}
64 | {{}}\frac{dy}{dx} = (6x^2+8x+3) \cos{(2x^3 + 4x^2 + 3x + 2)}{{}}
65 |
66 | This is the '[chain rule](https://en.wikipedia.org/wiki/Chain_rule)', which effectively says the derivative of a compound function is the derivative of that function multiplied by the derivative of the input function.
67 |
68 | I don't want to flood you with too much math, but [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) is at the absolute core of neural networks, so it pays to understand what is going on.
69 |
70 | # "Autograd"
71 | Every neural network system (PyTorch, TensorFlow, [Flashlight](https://github.com/flashlight/flashlight)) implements an autogradient system that performs automatic differentiation. Such systems can be implemented easily in any programming language that supports operator overloading and reference counted objects. And in fact, the implementation is so easy that you sometimes barely see it. A great example of this is [Andrej Karpathy](https://twitter.com/karpathy)'s [micrograd](https://github.com/karpathy/micrograd) autogradient implementation, [which is a tiny work of art](https://github.com/karpathy/micrograd/blob/master/micrograd/engine.py).
72 |
73 | First, let's look at what such a system can do:
74 |
75 | ```C++
76 | Tensor x(2.0f);
77 | Tensor z(0.0f);
78 | Tensor y = Tensor(3.0f)*x*x*x + Tensor(4.0f)*x + Tensor(1.0f) + x*z;
79 | ```
80 |
81 | This configures `y` to be {{}}3x^3 + 4x + 1 +xz{{}}. The notation is somewhat clunky - it is possible to make a library that automatically converts naked numbers into `Tensor`s, but such a library might also surprise you one day when it does so when you don't expect it.
82 |
83 | Next up, let's do something:
84 |
85 | ```C++
86 | cout << "y = "<< y << endl; // 3*8 + 4*2 + 1 = 33
87 |
88 | y.backward();
89 |
90 | cout << "dy/dx = " << x.getGrad() << endl; // 9*x^2 + 4 = 40
91 | cout << "dy/dz = " << z.getGrad() << endl; // 2
92 | ```
93 |
94 | This prints out the expected outputs, which is nice. The first line perhaps appears to only print out the value of `y`, but as is customary in these systems, the calculation only happens once you try to get the value. In other words, this is [lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation). This can sometimes confuse you when you setup a huge calculation that appears to happen in 'no time'. And this is because the actual calculation hasn't happened yet.
95 |
96 | The last line of the initial snippet of code (`Tensor y =`...) actually created a little computer program that will create the right output once run. This little computer program takes the shape of a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph):
97 |
98 |
115 |
116 | Here it is obvious that {{}}dy/dx=z=0{{}}. Meanwhile, {{}}dy/dz=x=2{{}}. So if we look at the directed acyclic graph (DAG), if we want to calculate the gradient or differential, each node gets the value of the opposite node:
117 |
118 |
119 |
120 | 
121 | *Red lines denote 'sending the gradient'. The right node received the value of the left node as its gradient, and vice versa*
122 |
123 |
124 |
125 |
126 | For a slightly more complicated example:
127 |
134 |
135 | Here we see that the gradients 'drop down' the tree and add up to the correct values.
136 | {{}}dy/dx =1{{}}, because {{}}z+a=1{{}}. Meanwhile, both
137 | {{}}dy/da{{}} and {{}}dy/dz{{}} are 2, because {{}}x=2{{}}.
138 |
139 | Now for our full calculation:
140 |
147 |
148 |
183 | And this indeed arrives at the right numbers. To perform the actual calculation, we visit each node *once*, starting at the top, and *push down* the accumulated gradient to the child nodes.
184 |
185 | Now, in a demonstration why a computer science education is useful (I missed out sadly), it turns out that doing such a traversal is a well solved problem. Using an elegant algorithm, a directed acyclic graph can be [sorted topologically](https://en.wikipedia.org/wiki/Topological_sorting). And in this order, we can visit each node once and in the right order to promulgate the accumulated gradients downward.
186 |
187 | The elegant algorithm is so elegant you might miss it in the code. It goes like this:
188 |
189 | 1. Start at the top node
190 | 2. If a node has been visited already, return. Otherwise, add node to *visited* set
191 | 3. Visit all child nodes (ie, start at step 1 again for each node)
192 | 4. Add ourselves at the end of the topographical list of nodes
193 |
194 | In this way, we can see that the leaf nodes are the first to be added. The top node only gets added last, because adding to the topographical list only happens once all child nodes are done. Meanwhile the *visited* set makes sure we do the process just once per node.
195 |
196 | To distribute the gradients for automatic integration, the topological list is processed in reverse order, which means that we start at the top.
197 |
198 | Automatic differentiation can be used for many other things, and need not stop at first derivatives. [ADOL-C](https://github.com/coin-or/ADOL-C) is an interesting library in this respect.
199 |
200 | Any time you ask ChatGPT a question, know there is a DAG containing 175 billion parameters that is processing your every word, and that it got taught what it can do by the exact autogradient process described on this page.
201 |
202 | # The code
203 | The key concept is that by typing in formulas, we get our computer to build the DAG for us. Because otherwise it would be undoable. Any language that features operator overloading enables us to make this happen rather easily. This is a great example of "letting the ball do the work". By defining addition, multiplication operators that don't actually perform those calculations, but instead populate a DAG that eventually will, we get a ton of functionality for free.
204 |
205 | We need a bit more than operator overloading though. We also need objects that stay alive, either by being reference counted, or by surviving garbage collection.
206 |
207 | As an example:
208 |
209 | ```C++
210 | Tensor x(2.0f);
211 | Tensor z(0.0f);
212 | Tensor y = Tensor(3.0f)*x*x*x + Tensor(4.0f)*x + Tensor(1.0f) + x*z;
213 | ```
214 | The values 3.0, 4.0 and 1.0 are all temporaries. These instances vanish from existence by the time the final line is done executing. Yet, they must still find a place in the DAG.
215 |
216 | For this reason, a language like C++ needs to create reference counted copies. Python and other pass-by-reference languages with garbage collection may get this for free.
217 |
218 | The `Tensor` class in this series of blog posts works like this:
219 |
220 | ```C++
221 | template
222 | struct Tensor
223 | {
224 | typedef Tensor us_t;
225 | Tensor() : d_imp(std::make_shared>())
226 | {}
227 |
228 | Tensor(unsigned int rows, unsigned int cols) : d_imp(std::make_shared>(rows, cols))
229 | {}
230 |
231 | // ...
232 | std::shared_ptr> d_imp;
233 | };
234 |
235 | ```
236 | There are many other methods, but this is the key - there is an actual reference counted `TensorImp` behind this. The class is templatized, defaulting to float. Amazingly enough, machine learning has such an effect on hardware that it is triggering innovations like 16 bit floats!
237 |
238 | To actually do anything with these `Tensor`s, there are overloaded operators:
239 |
240 | ```C++
241 | template
242 | inline Tensor operator+(const Tensor& lhs, const Tensor& rhs)
243 | {
244 | Tensor ret;
245 | ret.d_imp = std::make_shared>(lhs.d_imp, rhs.d_imp, TMode::Addition);
246 | return ret;
247 | }
248 | ```
249 | With this, you can do `Tensor z = x + w`, and `z` will end up containing a `TensorImp` containing reference counted references to `x` and `w`.
250 |
251 | Which looks like this:
252 |
253 | ```C++
254 | template
255 | struct TensorImp
256 | {
257 | typedef TensorImp us_t;
258 |
259 | //! Create a new parameter (value) tensor. Inits everything to zero.
260 | TensorImp(unsigned int rows, unsigned int cols) : d_mode(TMode::Parameter)
261 | {
262 | d_val = Eigen::MatrixX(rows, cols);
263 | d_grads = Eigen::MatrixX(rows, cols);
264 | d_grads.setZero();
265 | d_val.setZero();
266 | d_haveval = true;
267 | }
268 |
269 | TensorImp(std::shared_ptr lhs, std::shared_ptr rhs, TMode m) :
270 | d_lhs(lhs), d_rhs(rhs), d_mode(m)
271 | {
272 | }
273 | ...
274 | std::shared_ptr d_lhs, d_rhs;
275 | TMode d_mode;
276 | }
277 | ```
278 | Here we see a few notable things. For one, we see Eigen crop up. Eigen is a matrix library used by many machine learning projects (including TensorFlow and PyTorch). You might initially think you could do your own matrix library, but this is not the case. The Eigen matrix multiplications for example are over 300 times faster than my hand rolled previous attempts.
279 |
280 | We also see `d_lhs` and `d_rhs`, these are the embedded references to binary operators like '+', '-', '\*' etc. It is these references that allow us to build a directed acyclic graph that contains instructions how to calculate the outcome of the calculation.
281 |
282 | Here's an abbreviated version of how that works:
283 | ```C++
284 | void assureValue(const TensorImp* caller=0) const
285 | {
286 | if(d_haveval || d_mode == TMode::Parameter)
287 | return;
288 |
289 | if(d_mode == TMode::Addition) {
290 | d_lhs->assureValue(this);
291 | d_rhs->assureValue(this);
292 | d_val.noalias() = d_lhs->d_val + d_rhs->d_val;
293 | }
294 | else if(d_mode == TMode::Mult) {
295 | d_lhs->assureValue(this);
296 | d_rhs->assureValue(this);
297 | d_val.noalias() = d_lhs->d_val * d_rhs->d_val;
298 | }
299 | ...
300 | }
301 | ```
302 | Nodes can contain a value that was calculated earlier, in which case `d_haveval` is set. And if needed, `assureValue` is called in turn on child nodes.
303 |
304 | 'Calculating the outcome' is what is called the 'forward pass' in neural networks. The automatic differentiation meanwhile is calculated in the opposite direction. Here is where we get all the nodes in topological (reverse) order:
305 |
306 | ```C++
307 | void build_topo(std::unordered_set*>& visited, std::vector*>& topo)
308 | {
309 | if(visited.count(this))
310 | return;
311 | visited.insert(this);
312 |
313 | if(d_lhs) {
314 | d_lhs->build_topo(visited, topo);
315 | }
316 | if(d_rhs) {
317 | d_rhs->build_topo(visited, topo);
318 | }
319 | topo.push_back(this);
320 | }
321 | ```
322 |
323 | As noted above, you could easily miss the magic behind this.
324 |
325 | Once we have this topographic ordering, the distributing of the gradients downwards is simple:
326 |
327 | ```C++
328 | d_imp->d_grads.setConstant(1.0);
329 | for(auto iter = topo.rbegin(); iter != topo.rend(); ++iter) {
330 | (*iter)->doGrad();
331 | }
332 | ```
333 |
334 | The first line is important: the gradient of the top node is 1 (by definition, {{}}dy/dy=1{{}}). Every other node starts at 0, and is set through the automatic differentiation.
335 | Note the `rbegin()` and `rend()` which means we traverse the topography in reverse order.
336 |
337 | The abbreviated `doGrad()` meanwhile looks like this:
338 |
339 | ```C++
340 | void doGrad()
341 | {
342 | if(d_mode == TMode::Parameter) {
343 | return;
344 | }
345 | else if(d_mode == TMode::Addition) {
346 | d_lhs->d_grads += d_grads;
347 | d_rhs->d_grads += d_grads;
348 | }
349 | else if(d_mode == TMode::Mult) {
350 | d_lhs->d_grads.noalias() += (d_grads * d_rhs->d_val.transpose());
351 | d_rhs->d_grads.noalias() += (d_lhs->d_val.transpose() * d_grads);
352 | }
353 | ...
354 | ```
355 |
356 | If a node is just a number (`Tmode::Parameter`) it has no gradient to distribute further. If a node represents an addition, the gradient gets passed on verbatim to both the left hand and right hand sides of the + operator.
357 |
358 | For the multiplication case, we see that the left hand side indeed gets a gradient delivered that is proportional to the right hand side, and vice-versa. The delivered gradient is also proportional to the gradient that has already been passed down to this node.
359 |
360 | The calls to `.transpose()` meanwhile reflect that our Tensor class is actually a matrix. So far we've been multiplying only 1x1 Tensors, which act just like numbers. In reality this class is used to multiply pretty large matrices.
361 |
362 | Rounding it off - automatic differentiation is absolute key to neural networks. That we can assemble networks of many many layers each consisting of huge matrices using a straightforward syntax makes it possible to innovate rapidly. We are lucky enough that modern languages make it possible to both assemble these networks easily, AND perform automatic differentiation.
363 |
364 | [In the next chapter](../handwritten-digits-sgd-batches/) we'll be making our multi-layer network do some actual work in learning to recognize 10 different digits. There we'll also be introducing key concepts in machine learning like loss function, batches and the enigmatic 'softlogmax' layer.
365 |
366 |
367 |
--------------------------------------------------------------------------------
/dl-convolutional/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Convolutional networks"
3 | date: 2023-03-30T12:00:05+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dl-convolutional/index.md)!
7 |
8 | In the [previous chapter](../autograd) we taught a network of linear combinations and 'rectified linear units' to recognize handwritten digits reasonably successfully. But we already noted that the network would be sensitive to the exact location of pixels, and that it does not in any meaningful way "know" what a 7 looks like.
9 |
10 | In this chapter we're going to explore convolutional layers that can scan for shapes and make decisions based on their relative positions. And, we'll go over the design over a convolutional neural network that is quite successful at reading not just handwritten digits, but can do handwritten letters.
11 |
12 | # Neural network disappointment
13 | A recurring theme in machine learning is if the network is 'overfitting', where it is not really learning things as we'd hope, but is instead simply memorising stuff. 'Rote learning' if you will. This is in fact a constant battle, and many of the very public machine learning failures, as for example [during the COVID-19 pandemic](https://www.technologyreview.com/2021/07/30/1030329/machine-learning-ai-failed-covid-hospital-diagnosis-pandemic/), are due to networks latching on to the wrong things, or not having generalized the knowledge as broadly as we'd been assuming.
14 |
15 | If you take away one thing from this series of posts, please let it be that production use of a neural network tends to go through these four phases (if you are lucky):
16 |
17 | 1. It works on the training data
18 | 2. It also works on the validation data
19 | 3. After a lot of disappointment, we get it to work on other people's real life data too
20 | 4. Other people can get it to work on their own data as well
21 |
22 | Almost all demos declare victory after phase 2. This tutorial aims to achieve the final phase.
23 |
24 | To prove this point, here is a graph showing the validation success of our previous network with only slightly modified inputs:
25 |
26 |
27 |
28 | 
29 |
30 |
31 |
32 |
33 | Here the input was shifted around by 2 pixels, and 5 random pixels were flicked. No human would be phazed in the least by these changes, but our network performance does drop to around 50%, which is pretty disappointing.
34 |
35 | Clearly we need better things than just multiplying whole images and matrices. These things turn out to be 'convolutional' operations, 'max-pooling' and 'gelu'.
36 |
37 | # Convolutional networks
38 | Also known as [CNN, or ConvNet](https://en.wikipedia.org/wiki/Convolutional_neural_network), these may have been the first neural networks that saw bona fide production use. This video by [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) on [YouTube from 1989](https://www.youtube.com/watch?v=FwFduRA_L6Q) is absolutely worth your while ([associated paper](http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf)), especially since we're going to build a network here that is a lot like the one demonstrated there.
39 |
40 | Our previous work took a whole image as input to the network, where the position of pixels really mattered. A convolution is a matrix operation that *slides* over its input. In this way it can scan for features. What it slides over its input is a set of matrices called kernels, typically quite small. Each kernel is multiplied per element over the part of the input it lies on. The output is the sum of all these multiplications:
41 |
42 | {{}}
43 |
46 | {{}}
47 |
48 |
49 |
50 | ```goat
51 | input layer kernel output layer
52 | +--+--+--+--+--+--+--+--+ +--+--+--+ +--+--+--+--+--+--+
53 | |1 |2 |3 |4 |5 |6 |7 |8 | |1 |2 |3 | |A |..|..|..|..|..|
54 | +--+--+--+--+--+--+--+--+ +--+--+--+ +--+--+--+--+--+--+
55 | |9 |10|11|12|13|14|15|16| |4 |5 |6 | |..|B |..|..|..|..|
56 | +--+--+--+--+--+--+--+--+ +--+--+--+ +--+--+--+--+--+--+
57 | |17|18|19|20|21|22|23|24| |7 |8 |9 | |..|..|..|..|..|C |
58 | +--+--+--+--+--+--+--+--+ +--+--+--+ +--+--+--+--+--+--+
59 | |25|26|27|28|29|30|31|32| |..|..|..|..|..|..|
60 | +--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+
61 | |33|34|35|36|37|38|39|40|
62 | +--+--+--+--+--+--+--+--+
63 | |41|42|43|44|45|46|47|48|
64 | +--+--+--+--+--+--+--+--+
65 | ```
66 |
67 | Here three sample positions A, B and C in the output layer:
68 | ```
69 | A = 1*1 + 2*2 + 3*3 + 9*4 + 10*5 + 11*6 + 17*7 + 18*8 + 19*9
70 | B = 10*1 + 11*2 + 12*3 + 18*4 + 19*5 + 20*6 + 26*7 + 27*8 + 28*9
71 | C = 22*1 + 23*2 + 24*3 + 30*4 + 31*5 + 32*6 + 38*7 + 39*8 + 40*9
72 | ```
73 |
74 | Note that the output differs in dimensions from the input. If the input had R rows and a K by K kernel is used, the output will have 1+R-K rows, and similar for columns (1+C-K). The output dimensions will always be smaller. The values in the output represent the presence of features matched by the filter kernels.
75 |
76 | Typically, many kernels are used, leading to a single input layer creating many output layers. Every kernel is associated with a single output layer. Conceptually this can be seen as a convolutional layer scanning for many different kinds of features, all at the same time.
77 |
78 |
79 |
80 | A convolutional network can also accept multiple input layers at the same time. In this case, every output kernel slides over every input channel, and the output is the sum of the sums of that kernel sliding over all input channels. This means the number of operations is proportional to the product of the number of output layers and the number of input layers. Quite soon we are talking billions of operations. The number of filter parameters scales with product of the number of input and output layers, and of course the kernel size.
81 |
82 | Convolutional networks do not use a lot of parameters (since kernels tend to be small), but they simply *burn* through CPU cycles. Because they do not access a lot of memory, parallel processing can speed things up tremendously though.
83 |
84 | Chapter 7 from "Dive into Deep Learning" [has a good and more expansive explanation](https://d2l.ai/chapter_convolutional-neural-networks/index.html), and it may do a better job than this page.
85 |
86 | # Max-pooling
87 | We use convolutional layers to detect features, but we don't care that much about the exact position of a feature. In fact we may often not even want to know - the network might start to depend on it. Because of this, the output of a convolutional layer is often fed through a 'max-pool'.
88 |
89 | This is a simple operation that slides over an input matrix, but has no kernel parameters. But it does have a size, often 2 by 2. The output is the maximum value within that window.
90 |
91 | Unlike a convolutional layer, max-pooling uses non-overlapping windows. So if a 2x2 window is used, the output channels have half the number of rows and columns compared to the input channels.
92 |
93 | The essence of this is that if a feature is detected anywhere within a 2x2 window, it generates the same output independent of its position on any of the four pixels. Also, the number of outputs is divided by 4, which is useful for limiting the size of the network.
94 |
95 | > Note: Pools can of course have other sizes. Also, when two-dimensional pools are used, you'll often see them described as 'max2d'.
96 |
97 | # GELU
98 | Recall how we used the 'Rectified linear unit' (RELU) to replace all negative values by zero (and leaving the rest alone). This introduces a non-linearity between matrix operations, which in turn means the network turns into something more than a simple linear combination of elements.
99 |
100 | Various neural networks have been experimenting, and it has been noted that RELU throws away a lot of information for negative values. It appears that using a different activation function can help a lot. Popular these days is GELU, Gaussian Error Linear Unit, which is not linear nor is it an error. And also not that Gaussian actually.
101 |
102 |
103 |
104 | 
105 |
106 |
107 |
108 |
109 | More details can be found in [Gaussian Error Linear Units (GELUs)](https://arxiv.org/abs/1606.08415), and there is some more depth in [Avoiding Kernel Fixed Points:
110 | Computing with ELU and GELU Infinite Networks](https://ojs.aaai.org/index.php/AAAI/article/view/17197/17004).
111 |
112 | After having read the literature, I'm afraid I'm left [with the impression that GELU tends to do better](https://arxiv.org/pdf/2002.05202.pdf), but that we're not that sure why. For what it's worth, the code we're developing here confirms this impression.
113 |
114 | # The whole design
115 | Here is the complete design of our neural network that we're going to use to recognize handwritten (print) letters:
116 |
117 | 1. The input is again a 28x28 image, not flattened
118 | 2. A 3x3 kernel convolutional layer with 1 input layer and 32 output layers
119 | 3. Max2d layer, 2x2
120 | 4. GELU activation
121 | 5. A 3x3 kernel convolutional layer, 13x13 input dimensions, 32 input layers, 64 output layers
122 | 6. Max2d layer, 2x2
123 | 7. GELU activation
124 | 8. A 3x3 kernel convolutional layer, 6x6 input dimensions, 64 input layers, 128 output layers
125 | 6. Max2d layer, 2x2
126 | 7. GELU activation
127 | 8. Flatten all these 128 2x2 layers to 512*1 matrix
128 | 9. First linear combination
129 | 11. GELU activation
130 | 12. Second linear combination
131 | 13. GELU activation
132 | 14. Third linear combination, down to 26x1
133 | 15. LogSoftMax
134 |
135 | This looks like a lot, but if you look carefully, steps 2/3/4, 5/6/7, 8/9/10 are three times the same thing.
136 |
137 | Expressed as code it may even be easier to follow:
138 | ```C++
139 | using ActFunc = GeluFunc;
140 |
141 | auto step1 = s.c1.forward(img); // -> 26x26, 32 layers
142 | auto step2 = Max2dfw(step1, 2); // -> 13x13
143 | auto step3 = s.c2.forward(step2); // -> 11x11, 64 layers
144 | auto step4 = Max2dfw(step3, 2); // -> 6x6 (padding)
145 | auto step5 = s.c3.forward(step4); // -> 4x4, 128 layers
146 | auto step6 = Max2dfw(step5, 2); // -> 2x2
147 | auto flat = makeFlatten(step6); // -> 512x1
148 | auto output = s.fc1.forward(flat); // -> 64
149 | auto output2 = makeFunction(output);
150 | auto output3 = makeFunction(s.fc2.forward(output2)); // -> 128
151 | auto output4 = makeFunction(s.fc3.forward(output3)); // -> 26
152 | scores = makeLogSoftMax(output4);
153 | modelloss = -(expected*scores).sum();
154 | ```
155 |
156 | It is somewhat astounding that these few lines will learn to read handwritten characters.
157 |
158 | Visually:
159 |
160 | ```goat
161 | input layer
162 | +--+--+--+--+--+--+--+--+
163 | | 1| | | | | | |28| 32 x
164 | +--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+ 64 x
165 | | | | | | | | | | | 1| | | | |13| +--+--+--+--+ 128 x
166 | +--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+ | 1| | | 6| +--+--+
167 | | | | | | | | | | | | | | | | | +--+--+--+--+ | 1| 2|
168 | +--+--+--+--+--+--+--+--+ -> +--+--+--+--+--+--+ -> | | | | | -> +--+--+
169 | | | | | | | | | | | | | | | | | +--+--+--+--+ | 2| 2|
170 | +--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+ | 6| | | 6| +--+--+
171 | | | | | | | | | | |13| | | | |13| +--+--+--+--+
172 | +--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+
173 | |28| | | | | | |28|
174 | +--+--+--+--+--+--+--+--+
175 | ```
176 |
177 | These are the three convolutions and "Max2d" combinations. We end up with 128 layers of four values each. These are flattened into a 512x1 matrix,
178 | which then undergoes further multiplications:
179 |
180 | ```goat
181 | +---+--+--+--+
182 | | 1| | |64|
183 | +---+--+--+--+
184 | | | | | |
185 | +---+--+--+--+ +---+--+--+--+--+--+---+
186 | | | | | | | 1| | | | | |128|
187 | +---+--+--+--+ +---+--+--+--+--+--+---+
188 | | | | | | | | | | | | | |
189 | +---+--+--+--+ +---+--+--+--+--+--+---+
190 | | | | | | | | | | | | | |
191 | +---+--+--+--+ +---+--+--+--+--+--+---+
192 | |512| | | | | 64| | | | | | |
193 | x +---+--+--+--+ x +---+--+--+--+--+--+---+
194 |
195 | +--+--+--+--+--+--+--+---+ +---+--+--+--+ +---+--+--+--+ +---+--+--+--+--+--+---+
196 | | 1| | | | | | |512| = | 1| | |64| -> GELU -> | 1| | |64| = | 1| | | | | |128|
197 | +--+--+--+--+--+--+--+---+ +---+--+--+--+ +---+--+--+--+ +---+--+--+--+--+--+---+
198 |
199 |
200 | +---+--+--+--+
201 | | 1| | |26|
202 | +---+--+--+--+
203 | | | | | |
204 | +---+--+--+--+
205 | | | | | |
206 | +---+--+--+--+
207 | | | | | |
208 | +---+--+--+--+
209 | | | | | |
210 | +---+--+--+--+
211 | |128| | | |
212 | +---+--+--+--+
213 |
214 | x
215 | +---+--+--+--+--+--+---+ +---+--+--+--+
216 | -> GELU -> | 1| | | | | |128| = | 1| | |26| -> SoftLogMax
217 | +---+--+--+--+--+--+---+ +---+--+--+--+
218 | ```
219 |
220 | And this last matrix, 26 wide, gives us the scores for each possible character.
221 |
222 | # So where did this design come from?
223 | I copied it from [here](https://data-flair.training/blogs/handwritten-character-recognition-neural-network/). As we'll discuss in [the next chapter](hyperparameters-inspection-adam), neural network demos and tutorials tend to make their designs appear out of thin air. In reality, designing a good neural network is a lot of hard work. What you are seeing in a demo is the outcome of an (undisclosed) search over many possibilities. If you want to learn the ropes, it is best to first copy something that is known to work. And even then you'll often find that it doesn't work as well for you as it it did in the demo.
224 |
225 | # Let's fire it up!
226 | Whereas previous test programs did their learning in seconds or minutes, teaching this network to learn to recognize letters takes *ages*. As in, most of a day:
227 |
233 |
234 | So a few things to note - even after 24 hours of training the network was only 85% correct or so. If you look at the failures however, quite a lot of the input is in fact subjective. The difference between a handwritten *g* and a handwritten *q* is not that obvious without context. If we count the "second best guess" as almost correct, the network scores over 95% correct or almost correct, which is not too bad.
235 |
236 | Here is a sample of the input the network has to deal with:
237 |
238 |
239 | 
240 |
241 |
242 |
243 |
244 | And here is the confusion matrix, where you can see that besides *g* and *q*, distinguishing *i* and *l* is hard, as well as *h* and *n*:
245 |
251 |
252 | So why does the training take so long? These days computers are magically fast at multiplying large matrices, which is why our earlier model learned in minutes. This model however does all these convolutions, which involve applying tons of filters to parameters, which really is all work that has to be done. You can only get a little bit clever at this, but this cleverness does not deliver orders of magnitudes of speedups. The only way to make the convolutional filters a lot faster is by getting your hardware to do many of them at a time. There are however other techniques to make the whole process converge faster, so fewer calculations need to be done. [We'll cover these in the next chapter](../hyperparameters-inspection-adam).
253 |
254 | Another interesting thing to note in the graph is that after six hours or so, the network suddenly starts to perform worse, and then it starts improving again. And I'd love to tell you why that happens, but I simply don't know. It is hard to imagine the stochastic gradient descent to guess wrong so badly for such a long time, but perhaps it can get stuck in a bad ridge.
255 |
256 | ## Deeper background on how convolutional networks work
257 | Above we described how convolutional networks work. A kernel is laid on top of an input and image and kernel are multiplied element by element. The sum of all those multiplications is the output of that location. The kernel then slides over the entire input, producing a smaller output.
258 |
259 | In code this looks like this:
260 |
261 | ```C++
262 | ...
263 | else if(d_mode == TMode::Convo) {
264 | d_lhs->assureValue();
265 | d_rhs->assureValue(); // the weights
266 |
267 | d_val = EigenMatrix(1 + d_lhs->d_val.rows() - d_convop.kernel,
268 | 1 + d_lhs->d_val.cols() - d_convop.kernel);
269 | for(int r = 0 ; r < d_val.rows(); ++r)
270 | for(int c = 0 ; c < d_val.cols(); ++c)
271 | d_val(r,c) = d_lhs->d_val.block(r, c, d_convop.kernel, d_convop.kernel)
272 | .cwiseProduct(d_rhs->d_val).sum()
273 | + d_convop.bias->d_val(0,0);
274 | }
275 | ```
276 | This is part of [tensor2.hh](https://github.com/berthubert/hello-dl/blob/main/tensor2.hh) that implements the core neural network operations and the automatic differentiation.
277 |
278 | When doing the *forward* pass, this code first assures that the input (in `d_lhs`, aka left hand side) and the kernel (in `d_rhs`) are calculated. The output of the convolutional operation is the value of this node, and it ends up in in `d_val`. On the third and fourth lines,
279 |
280 | The for-loops meanwhile slide the kernel over the input, using the Eigen `.block()` primitive to focus on the input part covered by the kernel. Finally, the bias gets added.
281 |
282 | This is all really straightforward, as the *forward* pass tends to be. But backpropagation requires a bit more thinking: how does changing the kernel parameters impact the output of a convolutional layer? And, how does changing the input change the output? It is clear we need to backpropagate in these two directions.
283 |
284 | It turns out the process is not that hard, and in fact also involves a convolution:
285 | ```C++
286 | for(int r = 0 ; r < d_val.rows(); ++r)
287 | for(int c = 0 ; c < d_val.cols(); ++c)
288 | d_lhs->d_grads.block(r,c,d_convop.kernel, d_convop.kernel)
289 | += d_rhs->d_val * d_grads(r,c);
290 | ```
291 |
292 | Recall that `d_lhs` is the input to the convolution. The backward pass involves sliding the filter kernel over the input and adding the matrix product of the filter kernel and the gradients.
293 |
294 | Here is the backpropagation to the filter kernel:
295 |
296 | ```C++
297 | for(int r = 0 ; r < d_rhs->d_val.rows(); ++r)
298 | for(int c = 0 ; c < d_rhs->d_val.cols(); ++c)
299 | d_rhs->d_grads(r,c) += (d_lhs->d_val.block(r, c, d_val.rows(), d_val.cols())*d_grads).sum();
300 | d_rhs->d_grads.array() /= sqrt(d_grads.rows()*d_grads.cols());
301 | ```
302 |
303 | And finally the bias:
304 |
305 | ```C++
306 | d_convop.bias->d_grads(0,0) += d_grads.sum();
307 | ```
308 |
309 | This all is a bit 'deus ex machina', magical math making the numbers come out right. I present the code here because finding the exact instructions elsewhere is not easy. But you don't need to delve into these functions line by line to understand conceptually what is happening.
310 |
311 | # The actual code
312 | The code is in [tensor-convo.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-convo.cc) and is much like
313 | the digit reading code from [the previous chapter](../handwritten-digits-sgd-batches).
314 |
315 | Here is a key part of the difference:
316 |
317 | ```C++
318 | - MNISTReader mn("gzip/emnist-digits-train-images-idx3-ubyte.gz", "gzip/emnist-digits-train-labels-idx1-ubyte.gz");
319 | - MNISTReader mntest("gzip/emnist-digits-test-images-idx3-ubyte.gz", "gzip/emnist-digits-test-labels-idx1-ubyte.gz");
320 | + MNISTReader mn("gzip/emnist-letters-train-images-idx3-ubyte.gz", "gzip/emnist-letters-train-labels-idx1-ubyte.gz");
321 | + MNISTReader mntest("gzip/emnist-letters-test-images-idx3-ubyte.gz", "gzip/emnist-letters-test-labels-idx1-ubyte.gz");
322 |
323 | cout<<"Have "< img{28,28};
337 | Tensor scores{26, 1};
338 | Tensor expected{1,26};
339 | Tensor modelloss{1,1};
340 | Tensor weightsloss{1,1};
341 | Tensor loss{1,1};
342 | ```
343 | This defines the `img` variable into which we punt the image to be taught or recognized. The `scores` tensor meanwhile holds the calculated score for each of the 26 possible outputs. For training purposes, we input into `expected` which letter we expect the network to output.
344 |
345 | Ignore `modelloss` and `weightsloss` for a bit, they will become relevant in a later chapter.
346 |
347 | Finally the `loss` tensor is what we train the network on, and it represents how likely the network thought itself to be right.
348 |
349 | Next up, we're going to define the state, which contains the parameters that will be trained/used:
350 |
351 | ```C++
352 | struct State : public ModelState
353 | {
354 | // r_in c k c_i c_out
355 | Conv2d c1; // -> 26*26 -> max2d -> 13*13
356 | Conv2d c2; // -> -> 11*11 -> max2d -> 6*6 //padding
357 | Conv2d c3; // -> 4*4 -> max2d -> 2*2
358 | // flattened to 512 (128*2*2)
359 | // IN OUT
360 | Linear fc1;
361 | Linear fc2;
362 | Linear fc3;
363 | ```
364 | This has three convolutional layers (`c1`, `c2`, `c3`) and three full linear combination layers (`fc1`, `fc2`, `fc3`). Note that `fc3` will end up delivering a vector of 26 scores.
365 |
366 | Finally there is some important housekeeping:
367 |
368 | ```C++
369 | State()
370 | {
371 | this->d_members = {{&c1, "c1"}, {&c2, "c2"},
372 | {&c3, "c3"}, {&fc1, "fc1"},
373 | {&fc2, "fc2"}, {&fc3, "fc3"}
374 | };
375 | }
376 | };
377 | ```
378 | `State` descends from `ModelState` which, as previously, brings a lot of logic for saving and storing parameters, as well as modifying them for training purposes. But to perform its services, it needs to know about the members. And we also tell it the names of the members for reporting purposes, which we are going to explore in [the next chapter](../hyperparameters-inspection-adam).
379 |
380 | # Finally
381 | In this chapter we've introduced convolutional layers that are able to recognize image features, and are therefore more likely to robustly identify letters and not just pixels in specific places. We've also found however that training such convolutional layers takes a far longer time. The end result however is pretty ok given that handwritten letters can look pretty identical on their own (*i* versus *l*, *g* versus *q* etc).
382 |
383 | In [the next chapter](../hyperparameters-inspection-adam) we'll go over what the excitingly named hyperparameters are, and how a fellow called ADAM can help us speed up training tremendously.
384 |
385 |
386 |
--------------------------------------------------------------------------------
/dl-convolutional/gelu.svg:
--------------------------------------------------------------------------------
1 |
2 |
4 |
1044 |
--------------------------------------------------------------------------------
/handwritten-digits-sgd-batches/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hello Deep Learning: Reading handwritten digits"
3 | date: 2023-03-30T12:00:04+02:00
4 | draft: false
5 | ---
6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/handwritten-digits-sgd-batches/index.md)!
7 |
8 | In the [previous chapter](../autograd) we described how automatic differentiation of the result of neural networks works.
9 |
10 | In the first and second chapters we designed and trained a one-layer neural network that could distinguish images of the digit 3 and the digit 7, and the network did so very well. But honestly, it is not that difficult also.
11 |
12 | The next challenge is to recognize and classify all ten digits. To do so, we'll use a network that does the following:
13 |
14 | 1. Flatten 28x28 image to a 784x1 matrix
15 | 2. Multiply this matrix by a 128x784 matrix ('lc1')
16 | 3. Replace all negative elements of the resulting matrix by 0
17 | 4. Multiply the resulting matrix by a 64x128 matrix ('lc2')
18 | 5. Replace all negative elements of the resulting matrix by 0
19 | 6. Multiply the resulting matrix by a 10x64 matrix ('lc3')
20 | 7. Pick the highest row of the resulting 10x1 matrix, this is the digit the network thinks it saw
21 |
22 | Or, in code form:
23 |
24 | ```C++
25 | auto output = s.lc1.forward(makeFlatten({img})); // 1, 2
26 | auto output2 = makeFunction(output); // 3
27 | auto output3 = s.lc2.forward(output2); // 4
28 | auto output4 = makeFunction(output3); // 5
29 | auto output5 = s.lc3.forward(output4); // 6
30 | scores = makeLogSoftMax(output5); // 7a
31 |
32 | ...
33 | int predicted = scores.maxValueIndexOfColumn(0); // 7b
34 | ```
35 |
36 | So, what is going on here? First we turn the image into a looooong matrix of 784x1, using a call to `makeFlatten`. This loses some spatial context - neighboring pixels are no longer necessarily next to each other. But it is necessary for the rest of the operations.
37 |
38 | The flattened matrix now goes through `lc1`, which is a linear combination layer. Or in other words, a matrix multiplication.
39 |
40 | Next up, the `ReluFunc`. This 'rectified linear unit' is nothing other than an if statement: `if(x<0) return 0; else return x`. If we would stack multiple linear combinations, this would not add any extra smarts to the network - you could summarise two layers as one layer with different parameters. Inserting a non-linear element like 'ReLu' changes this.
41 |
42 | > From the excellent [FastAI notebook on MNIST](https://github.com/fastai/fastbook/blob/master/04_mnist_basics.ipynb):
43 | "Amazingly enough, it can be mathematically proven that this little function can solve any computable problem to an arbitrarily high level of accuracy, if you can find the right parameters for w1 and w2 and if you make these matrices big enough. For any arbitrarily wiggly function, we can approximate it as a bunch of lines joined together; to make it closer to the wiggly function, we just have to use shorter lines. This is known as the [universal approximation theorem](https://towardsdatascience.com/how-do-relu-neural-networks-approximate-any-continuous-function-f59ca3cf2c39)."
44 | > Incidentally, I can highly recommend reading the FastAI notebook after you've finished with my 'from scratch' series. The FastAI work will then make sense, and will allow you to convert your from scratch knowledge into deep learning frameworks that people actually use.
45 |
46 | After the first ReLu, we pass our data through a second linear combination, after which follows further ReLu, and a final linear combination.
47 |
48 | # LogSoftMax, "Cross Entropy Loss"
49 | Ok - here we are going to make some big steps and introduce a lot of modern machine learning vocabulary.
50 |
51 | In our previous example, the network had one output, and if it was negative, we would interpret that as a prediction of a three.
52 |
53 | Our present network has a more difficult task, determining which of 10 digits we are looking at. After step 6 of our network, we have a 10x1 matrix full of values. The convention is that the highest valued coefficient in that represents the network's verdict.
54 |
55 | Recall that earlier we made the network 'learn in the right direction' by comparing the output to what we'd hope to get. We set a target of '2' or '-2' by hand, and remarked that this was an ugly trick we'd get rid of later on. That moment is now.
56 |
57 | All machine learning projects define a 'loss value' calculated by the 'loss function'. The loss represents the distance between what a network predicts, and what we'd like it to predict. In our previous example, we informally used a loss function of {{}}min(2 - R, 0){{}}, and trained our network to minimize this loss function.
58 |
59 | Or in other words, as long as {{}}R<2{{}} we'd change the parameters to increase {{}}R{{}}.
60 |
61 | This is the key concept of neural network learning: modify the parameters so that the loss function goes down. And to do so, take the derivative of the loss function against all parameters. Then subtract a fraction of that derivative from all the parameters.
62 |
63 | This only works if we can get a *single number* to come out of our network. But recall, the digit recognizing network we are designing on this page has *10* outputs. So some work is in order.
64 |
65 | In practice, we first feed all the outputs to a function called '*LogSoftMax*':
66 |
67 | {{}}
68 | \text{LogSoftmax}(x_{i}) = \log\left(\frac{\exp(x_i) }{ \sum_j \exp(x_j)} \right) = x_i - \log\left(\sum_j \exp(x_j)\right)
69 | {{}}
70 |
71 | If we put in 10 inputs, out come 10 outputs, but now lowered by the logarithm of the sum of the exponent of all elements.
72 |
73 | This looks like:
74 |
75 |
76 | ```Python
77 | # 0 1 2 3 4 5 6 7 8 9
78 | In: [-2.5, -4, -3, -0.5, -4.4, 4, -0.75, -0.25, -0.5, -2.0]
79 | Out: [-6.5, -8.04, -7.05, -4.55, -8.459, -0.05, -4.80, -4.30, -4.55, -6.05]
80 | ```
81 |
82 | This would correspond to a network saying '5', which has the highest number. The output of `LogSoftMax` is typically interpreted as a log probability. Or in other words, the network is then regarded to say it predicts with {{}}e^{-0.05}=95\%{{}} probability that it is looking at a 5.
83 |
84 | LogSoftMax works well for a variety of reasons, one of which is that it prevents the "pushing to infinity" we had to safeguard against in chapter 2. When using LogSoftMax as part of the loss function, we know that 0 is the best answer we're ever going to get ('100% certain'). But, because of the logarithms in there, the 'push' becomes ever smaller the closer we get to zero.
85 |
86 | But, we still have 10 digits to look at, and we need just 1 for our loss value. To do this, it is customary to construct a "one hot vector", a matrix consisting of all zeroes, except at the index of the number we were expecting.
87 |
88 | So to get to the loss value, we do:
89 |
90 | ```
91 | scores=[-6.5, -8.04, -7.05, -4.55, -8.459, -0.05, -4.80, -4.30, -4.55, -6.05]
92 | onehotvector=[[0],[0],[0],[0],[0],[1.0],[0],[0],[0],[0]]
93 | loss = -scores*onehotvector = 0.05
94 | ```
95 | Where \* denotes a matrix multiplication. Here, the loss is 0.05, since the value at index 5 (the digit we know we put into the network) is -0.05. The negation is there because 0 is the very best we're ever going to get, but we are approaching it from negative territory. This technique goes by the pompous name of '[cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)'.
96 |
97 | Once we have the loss number, we can take the derivative, and adjust the parameters of the matrices with a fraction of that derivative.
98 |
99 | # Recap
100 | Because this is so crucial, let's go over it one more time.
101 |
102 | Our neural network consists of several layers. We turn our image into a 1-dimensional matrix, multiply it by another matrix, replace all negative elements by zero, repeat the last two steps once more, and then a final matrix multiplication. Ten values come out. We do a logarithmic operation on these numbers, and turn them into 10 log-probabilities.
103 |
104 | The highest log-probability we can get is 0, which represents 100%. We multiply the ten numbers by yet another 'expectation' matrix, which is zero, except for the place corresponding to the actual number we put in. Out comes the probability the network assigned to what we know is the right digit.
105 |
106 | If this log probability is -0.05, we say that we that we have a 'loss' of 0.05.
107 |
108 | This loss number is the outcome of all these matrix multiplications and ReLu operations, the LogSoftMax layer and finally the multiplication with the expectation matrix.
109 |
110 | And because of the magic of [automatic differentiation from the previous chapter](../autograd), we can determine exactly how our loss function would change if we modified the three parameter matrices we used for multiplication.
111 |
112 | We then update those matrices with a fraction of the derivative, and we call this fraction the 'learning rate'.
113 |
114 | # One final complication: batches
115 | We could perform the procedure outlined above once per training digit. But this might cause our network to oscillate wildly between "getting the ones right", "getting the twos right" etc. For this and other reasons, it is customary to do the learning per batch. Picking a batch size is an important choice - if the batch size is too small (1, for example), the network might swerve. If it is too large however, we lose training opportunities.
116 |
117 | There is also a more hardware related reason to do this. Much machine learning happens on GPUs which only perform well if you give them a lot of work they can do in parallel. If you only process a single input at a time, much of your GPU will be idle.
118 |
119 | When we gather our learning from a batch and average the results, we call this [Stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) or SGD.
120 |
121 |
122 | # Getting down to work
123 | The code is in [tensor-relu.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-relu.cc), where you'll find slightly more lines of code than described below. The additional lines perform logging to generate the graphs that demonstrate the performance of this model.
124 |
125 | Here is the definition of our model:
126 |
127 | ```C++
128 | struct ReluDigitModel
129 | {
130 | Tensor img{28,28};
131 | Tensor scores{10, 1};
132 | Tensor expected{1,10};
133 | Tensor loss{1,1};
134 | struct State : public ModelState
135 | { // IN OUT
136 | Linear lc1;
137 | Linear lc2;
138 | Linear lc3;
139 |
140 | State()
141 | {
142 | this->d_members = {{&lc1, "lc1"}, {&lc2, "lc2"}, {&lc3, "lc3"}};
143 | }
144 | };
145 |
146 | ```
147 |
148 | Here we see how the state of a model is kept separate. This state is what contains the actual parameters. Note that the state derives from `ModelState`. This parent class gives us common operations like `load()`, `save()`, `randomise()`, but also logging of everything to SQLite. To make sure that the parent class knows what to do, the `State` struct registers its members in its constructor.
149 |
150 | Next up, let's hook it all up:
151 |
152 | ```C++
153 | void init(State& s)
154 | {
155 | auto output = s.lc1.forward(makeFlatten({img}));
156 | auto output2 = makeFunction(output);
157 | auto output3 = s.lc2.forward(output2);
158 | auto output4 = makeFunction(output3);
159 | auto output5 = s.lc3.forward(output4);
160 | scores = makeLogSoftMax(output5);
161 | loss = -(expected*scores);
162 | }
163 | };
164 | ```
165 |
166 | This mirrors the code we've seen earlier. In the last line of code we define the 'loss' function. And recall, this is all lazy evaluation - we're setting up the logic, but nothing is being calculated yet.
167 |
168 | Next up, mechanics:
169 |
170 | ```C++
171 | MNISTReader mn("gzip/emnist-digits-train-images-idx3-ubyte.gz", "gzip/emnist-digits-train-labels-idx1-ubyte.gz");
172 | MNISTReader mntest("gzip/emnist-digits-test-images-idx3-ubyte.gz", "gzip/emnist-digits-test-labels-idx1-ubyte.gz");
173 |
174 | ReluDigitModel m;
175 | ReluDigitModel::State s;
176 |
177 | if(argc==2) {
178 | cout << "Loading model state from '" << argv[1] << "'\n";
179 | loadModelState(s, argv[1]);
180 | }
181 | else
182 | s.randomize();
183 |
184 | m.init(s);
185 |
186 | auto topo = m.loss.getTopo();
187 | Batcher batcher(mn.num());
188 | ```
189 |
190 | The 'topo' line gets the topographical sort we'll be using later on, as described in the previous chapter.
191 |
192 | The final line is a helper class called `Batcher`, which we pass the number of training images we have. This class then shuffles all these numbers. Later, we can request batches of N numbers for processing, and we'll get a random batch of numbers for processing.
193 |
194 | Let's do just that:
195 |
196 | ```C++
197 | for(unsigned int tries = 0 ;; ++tries) {
198 | if(!(tries % 32)) {
199 | testModel(m, mntest);
200 | saveModelState(s, "tensor-relu.state");
201 | }
202 |
203 | auto batch = batcher.getBatch(64);
204 | if(batch.empty())
205 | break;
206 | float totalLoss = 0;
207 | unsigned int corrects=0, wrongs=0;
208 | ```
209 | Up to here it is just mechanics. Every 32 batches we test the model against our validation data, and we also save our state to disk. Next up more interesting things happen:
210 |
211 | ```C++
212 | m.loss.zeroAccumGrads(topo);
213 |
214 | for(const auto& idx : batch) {
215 | mn.pushImage(idx, m.img);
216 | int label = mn.getLabel(idx);
217 | m.expected.oneHotColumn(label);
218 |
219 | totalLoss += m.loss(0,0); // turns it into a float
220 |
221 | int predicted = m.scores.maxValueIndexOfColumn(0);
222 |
223 | if(predicted == label)
224 | corrects++;
225 | else wrongs++;
226 |
227 | ```
228 | As noted, we process a whole batch of images before starting the learning process. Per image we look at we gather gradients through automatic differentiation. We need to add up all these gradients for the eventual learning. To make life easy, our `Tensor` class has a facility where you can stash your gradients. But, before we start a batch, we must zero the accumulated numbers. That's line 1.
229 |
230 | Next up, we iterate over all numbers in the batch. For each number, we fetch the EMNIST image and the label assigned to it.
231 | We then configure the `expected` variable, with the 'one hot' configuration which is 1 only for the correct outcome.
232 |
233 | The `totalLoss += m.loss(0,0);` line looks like a bit of statistics keeping, but it is what actually triggers the whole network into action. We wake up the lazy evaluation.
234 |
235 | In the next line we look up the row in the scores matrix with the highest value, which is the prediction from the model.
236 |
237 | Then we count the correct and wrong predictions.
238 |
239 | Now we come to an interesting part again:
240 | ```C++
241 | // backward the thing
242 | m.loss.backward(topo);
243 | m.loss.accumGrads(topo);
244 | // clear grads & havevalue
245 | m.loss.zerograd(topo);
246 | }
247 | ```
248 |
249 | This is where we perform the automatic differentiation (`.backward(topo)`). We then call `accumGrads(topo)` to accumulate the gradients for this specific image. Finally, there is a call to `.zeroGrad(topo)`. From the previous chapter, you'll recall how the gradients rain downward additively. If we run the same model a second time, we first need to zero those gradients so we start from a clean slate.
250 |
251 | Once we are done with a whole batch, we can output some statistics and do the actual learning:
252 |
253 |
254 | ```C++
255 | cout << tries << ": Average loss " << totalLoss/batch.size()<< ", percent batch correct " << 100.0*corrects/(corrects+wrongs) << "%\n";
256 |
257 | double lr=0.01 / batch.size();
258 | s.learn(lr);
259 | }
260 | ```
261 |
262 | Of note, we divide the learning rate by the batch size. This is because we've accumulated gradients for each of the images in the batch, and we want to learn from their average and not their sum.
263 |
264 | Finally, let's zoom in on what `s.learn(lr)` actually does:
265 |
266 | ```C++
267 | void learn(float lr)
268 | {
269 | for(auto& p : d_params) {
270 | auto grad1 = p.ptr->getAccumGrad();
271 | grad1 *= lr;
272 | *p.ptr -= grad1;
273 | }
274 | }
275 | ```
276 |
277 | For each parameter, the accumulated gradients are gathered, and then multiplied by the learning rate. Finally, this reduced gradient is then subtracted from the actual parameter value.
278 |
279 | # Giving it a spin!
280 | Finally, let's run all this:
281 |
282 | ```bash
283 | $ ./tensor-relu
284 | Have 240000 training images and 40000 test images
285 |
286 |
287 | ...
288 | *XXXXX.
289 | .XXXXXXXX.
290 | *XXXX*.*XX*..
291 | *XXX* XXXXXX
292 | .XXX. XXXXXX
293 | *XX* .XXXX*
294 | *XX. *XXX*.
295 | *XXX ..XXXXX.
296 | .XXXXXXXXXXXXX
297 | *XXXXXXXXXXX*
298 | .**** .XXX.
299 | *XX*
300 | XXX.
301 | .XX*
302 | *XXX.
303 | XXXX
304 | XXX*
305 | .XXX.
306 | *XX*
307 | XXX*
308 | .XX*
309 | X*.
310 |
311 |
312 |
313 |
314 | predicted: 3, actual: 9, loss: 2.31289
315 | Validation batch average loss: 2.30657, percentage correct: 9.375%
316 | 0: Average loss 2.31008, percent batch correct 9.375%
317 | ...
318 |
319 | .X*
320 | *XX*
321 | .XX*
322 | *XX*
323 | XXX
324 | *XX.
325 | *XXX
326 | XXX
327 | *XX.
328 | XXX
329 | *XX.
330 | .XX*
331 | *XX ...
332 | XXX *XXXXX
333 | XX. .XXXXXXXXX.
334 | XX* *XXXX*. *XX.
335 | XXX XXXX*. XX*
336 | *XX.XXX* .XX
337 | .XXXXX* .XX
338 | *XXX. *XX
339 | XXX* *XX*
340 | XXXXX*......*XX*
341 | *XXXXXXXXXX*
342 | .*XXXXXX*
343 |
344 |
345 |
346 | predicted: 6, actual: 6, loss: 1.28472
347 | Validation batch average loss: 1.39618, percentage correct: 76.5625%
348 | ...
349 | Validation batch average loss: 0.276509, percentage correct: 92.225%
350 | ```
351 | 92.23%, not too shabby! Here are some customary ways of looking at performance, starting with a training/validation percentage correct graph:
352 |
353 |
368 |
369 | And finally the wonderfully named confusion matrix, which shows how often a prediction (vertical) matched up with the actual label (horizontal):
370 | {{< rawhtml >}}
371 |
413 |
414 |
415 |
416 |
417 |
0
418 |
1
419 |
2
420 |
3
421 |
4
422 |
5
423 |
6
424 |
7
425 |
8
426 |
9
427 |
428 |
429 |
430 |
431 |
0
432 |
3750
433 |
3
434 |
27
435 |
25
436 |
12
437 |
42
438 |
16
439 |
5
440 |
10
441 |
10
442 |
443 |
444 |
1
445 |
4
446 |
3793
447 |
25
448 |
5
449 |
13
450 |
24
451 |
18
452 |
16
453 |
87
454 |
12
455 |
456 |
457 |
2
458 |
32
459 |
9
460 |
3665
461 |
74
462 |
22
463 |
40
464 |
24
465 |
12
466 |
44
467 |
3
468 |
469 |
470 |
3
471 |
7
472 |
31
473 |
32
474 |
3581
475 |
0
476 |
110
477 |
0
478 |
10
479 |
53
480 |
34
481 |
482 |
483 |
4
484 |
70
485 |
34
486 |
68
487 |
3
488 |
3766
489 |
86
490 |
32
491 |
17
492 |
63
493 |
123
494 |
495 |
496 |
5
497 |
48
498 |
32
499 |
22
500 |
143
501 |
3
502 |
3548
503 |
23
504 |
3
505 |
120
506 |
15
507 |
508 |
509 |
6
510 |
38
511 |
18
512 |
79
513 |
7
514 |
52
515 |
48
516 |
3865
517 |
0
518 |
8
519 |
0
520 |
521 |
522 |
7
523 |
3
524 |
3
525 |
26
526 |
40
527 |
2
528 |
12
529 |
0
530 |
3716
531 |
11
532 |
171
533 |
534 |
535 |
8
536 |
46
537 |
73
538 |
55
539 |
83
540 |
21
541 |
75
542 |
22
543 |
22
544 |
3556
545 |
37
546 |
547 |
548 |
9
549 |
2
550 |
4
551 |
1
552 |
39
553 |
109
554 |
15
555 |
0
556 |
199
557 |
48
558 |
3595
559 |
560 |
561 |
562 | {{}}
563 |
564 | From this you can see for example that the network has some trouble distinguishing 7 and 9, but that it absolutely never confuses a 7 for a 6, or a 6 for a 9.
565 |
566 | # Discussion
567 | In the above, we've seen how we can configure a multi-layer network consisting of linear combinations, 'relu units', SoftLogMax and finally the expectation 'one hot vector'. We also made this network learn, and qualified its success.
568 |
569 | This is about as far as linear combinations can go. And although 90+% correctness is nice, this network has really only learned what perfectly centered and rather clean digits look like. Concretely, this network is really attached to *where* the pixels are. We expect a network that somehow 'understands' what this is doing to not be so sensitive to placement.
570 |
571 | However, we can still feel pretty good - this tiny network did really well on its simple job, and we know *exactly* how it was trained and what it does.
572 |
573 | [In the next chapter](../dl-convolutional/), we'll be adding elements that actually capture shapes and their relations, which leads to greater generic performance, but also more complexity and training time. [We'll also be going over some common neural network disappointments](../dl-convolutional/).
574 |
575 |
576 |
--------------------------------------------------------------------------------