├── autograd
    ├── grads.png
    ├── test.png
    ├── grads2.png
    ├── simple.png
    ├── grads-final.png
    └── index.md
├── dl-ocr-demo
    ├── h.png
    ├── boxed.png
    ├── boxed-only.png
    ├── cleaned.jpeg
    ├── h-poster.png
    ├── t-poster.png
    └── index.md
├── dl-convolutional
    ├── poster.png
    ├── animation.mp4
    ├── confusion-letters.png
    ├── index.md
    └── gelu.svg
├── hello-deep-learning
    ├── diff.png
    ├── boxed.png
    ├── prod3.png
    ├── seven.png
    ├── three.png
    ├── learning.mp4
    └── index.md
├── first-learning
    ├── random-image.png
    ├── random-prod.png
    ├── weights-anim.gif
    ├── random-weights.png
    └── index.md
├── hello-deep-learning-chapter1
    ├── diff.png
    ├── prod3.png
    ├── prod7.png
    ├── seven.png
    ├── three.png
    ├── sevens.png
    ├── threes.png
    ├── wrong-7-22.png
    └── index.md
├── hyperparameters-inspection-adam
    ├── sgd.gif
    ├── sgd-complex-momentum.gif
    ├── sgd-complex-no-momentum.gif
    └── index.md
├── README.md
├── dropout-data-augmentation-weight-decay
    ├── weight-decay-wait-evolution-scatter.png
    └── index.md
├── LICENSE
├── dl-gru-lstm-dna
    └── index.md
├── dl-and-now-what
    └── index.md
├── dl-what-does-it-all-mean
    └── index.md
├── hello-deep-learning-intro
    └── index.md
└── handwritten-digits-sgd-batches
    └── index.md


/autograd/grads.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/grads.png


--------------------------------------------------------------------------------
/autograd/test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/test.png


--------------------------------------------------------------------------------
/dl-ocr-demo/h.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/h.png


--------------------------------------------------------------------------------
/autograd/grads2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/grads2.png


--------------------------------------------------------------------------------
/autograd/simple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/simple.png


--------------------------------------------------------------------------------
/dl-ocr-demo/boxed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/boxed.png


--------------------------------------------------------------------------------
/autograd/grads-final.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/autograd/grads-final.png


--------------------------------------------------------------------------------
/dl-ocr-demo/boxed-only.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/boxed-only.png


--------------------------------------------------------------------------------
/dl-ocr-demo/cleaned.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/cleaned.jpeg


--------------------------------------------------------------------------------
/dl-ocr-demo/h-poster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/h-poster.png


--------------------------------------------------------------------------------
/dl-ocr-demo/t-poster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-ocr-demo/t-poster.png


--------------------------------------------------------------------------------
/dl-convolutional/poster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-convolutional/poster.png


--------------------------------------------------------------------------------
/hello-deep-learning/diff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/diff.png


--------------------------------------------------------------------------------
/dl-convolutional/animation.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-convolutional/animation.mp4


--------------------------------------------------------------------------------
/first-learning/random-image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/random-image.png


--------------------------------------------------------------------------------
/first-learning/random-prod.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/random-prod.png


--------------------------------------------------------------------------------
/first-learning/weights-anim.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/weights-anim.gif


--------------------------------------------------------------------------------
/hello-deep-learning/boxed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/boxed.png


--------------------------------------------------------------------------------
/hello-deep-learning/prod3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/prod3.png


--------------------------------------------------------------------------------
/hello-deep-learning/seven.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/seven.png


--------------------------------------------------------------------------------
/hello-deep-learning/three.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/three.png


--------------------------------------------------------------------------------
/first-learning/random-weights.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/first-learning/random-weights.png


--------------------------------------------------------------------------------
/hello-deep-learning/learning.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning/learning.mp4


--------------------------------------------------------------------------------
/dl-convolutional/confusion-letters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dl-convolutional/confusion-letters.png


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/diff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/diff.png


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/prod3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/prod3.png


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/prod7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/prod7.png


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/seven.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/seven.png


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/three.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/three.png


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/sevens.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/sevens.png


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/threes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/threes.png


--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/sgd.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hyperparameters-inspection-adam/sgd.gif


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/wrong-7-22.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hello-deep-learning-chapter1/wrong-7-22.png


--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/sgd-complex-momentum.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hyperparameters-inspection-adam/sgd-complex-momentum.gif


--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/sgd-complex-no-momentum.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/hyperparameters-inspection-adam/sgd-complex-no-momentum.gif


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | This is the Markdown from which https://berthub.eu/articles/posts/hello-deep-learning is/will
2 | be populated. This allows everyone to contribute better wording or examples
3 | or graphs etc.
4 | 
5 | 


--------------------------------------------------------------------------------
/dropout-data-augmentation-weight-decay/weight-decay-wait-evolution-scatter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berthubert/hello-dl-posts/main/dropout-data-augmentation-weight-decay/weight-decay-wait-evolution-scatter.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 bert hubert
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/dl-gru-lstm-dna/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Gated Recurring Unit / LSTM: Some language processing, DNA scanning"
 3 | date: 2023-03-29T13:00:00+02:00
 4 | draft: true
 5 | ---
 6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts.
 7 | 
 8 | Placeholder page. Will mostly be an homage to the essential [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).
 9 | 
10 | Includes a demo trained on my blog posts that writes pretty plausible sentences.
11 | 
12 | For example: "Galileo Problems and Indonesia and Safety Tolar experiments and communications are available to investigate our manufacturers also do not specific and lives on the fact that we are going to be the same generation in a single time.  And that's it.   I can report that the communication provider the market is a previously been in a ton of work to learn about the world where one strand shows the protein expression making a satellite, and then just like winning operations (in Europe) taken everything about the same time.  The reader I have left on a lot of research and will still start with the real thing.  Explain the Internet is not the case. "
13 | 
14 | The network constructs sentences like these character by character, which is quite impressive. It generates valid markdown links ftoo, for example.
15 | 
16 | Page will also include a demo of how Gated Recurring Units can spot splice junctions in DNA.
17 | 
18 | In [the next chapter](../dl-what-does-it-all-mean) you can find some philosophizing about what deep learning all means, analogies to biology and what the future might hold.
19 | 


--------------------------------------------------------------------------------
/dl-and-now-what/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Hello Deep Learning: Further reading & worthwhile projects"
 3 | date: 2023-03-30T12:00:09+02:00
 4 | draft: false
 5 | ---
 6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dl-and-now-what/index.md)!
 7 | 
 8 | After having completed this series of blogposts (well done!) you should have a good grounding in what deep learning is actually doing. However, this was of course only a small 20k word introduction, so there is a lot left to learn.
 9 | 
10 | Unfortunately, there is a lot of nonsense online. Either the explanations are sloppy or they are just plain wrong. 
11 | 
12 | Here is an as yet pretty short list of things I've found to be useful. I very much hope to hear from readers about their favorite books and sites. You can send [pull requests directly](https://github.com/berthubert/hello-dl-posts/blob/main/dl-and-now-what/index.md) or email me on bert@hubertnet.nl
13 | 
14 | Sites:
15 |  * The [PyTorch documentation](https://pytorch.org/docs/stable/index.html) is very useful, even if you are not using PyTorch. It describes pretty well how many layers work exactly.
16 |  * [Andrej Karpathy](https://twitter.com/karpathy)'s [micrograd](https://github.com/karpathy/micrograd) Python autogradient implementation is a tiny work of art
17 |  * Andrej Karpathy's post [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/), and also [this post](https://karpathy.github.io/2019/04/25/recipe/)
18 |  * [FastAI](https://fast.ai)'s Jupyter notebooks.
19 | 
20 | Projects:
21 |  * [Whisper.cpp](https://github.com/ggerganov/whisper.cpp), by hero worker [Georgi Gerganov](https://ggerganov.com/). An open source self-contained C++ version of OpenAI's whisper speech recognition model. You can run this locally on very modest hardware and it is incredibly impressive. Because the source code is so small it is a great learning opportunity.
22 |  * [Llama.cpp](https://github.com/ggerganov/llama.cpp), again by Georgi, a C++ version of Meta's Llama "small" large language model that can run on reasonable hardware. Uses quantisation to fit in normal amounts of memory. If prompted well, the Llama model shows ChatGPT-like capabilities.
23 |  
24 | 


--------------------------------------------------------------------------------
/dl-what-does-it-all-mean/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Deep Learning: What does it all mean?"
 3 | date: 2023-03-30T12:00:10+02:00
 4 | draft: true
 5 | ---
 6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts.
 7 | 
 8 | XXX ENTIRELY draft not done XXX
 9 | 
10 | In writing this series, I've first hand experienced the 'wow factor' of having your new neural network do impressive things. As part of my alarming lack of focus, I am also a very amateur biologist. I study DNA and evolution, as exhibited for example in [my Nature Scientific Data paper](https://www.nature.com/articles/s41597-022-01179-8).
11 | 
12 | Microbial life can be completely recreated from its pretty small genome, which is typically a few million DNA letters long. Each DNA letter (A, C, G or T) carries 2 bits of information. A whole bacterium therefore can be regarded as having around a megabyte of parameters. Incidentally, this is of similar size to many interesting neural networks.
13 | 
14 | Both bacteria and neural networks can evolve new functionality by changing random parameters. For bacteria, we can see this process in action at day long timescales. For example, under lab conditions, a bacterial strain can evolve resistance to an antibiotic within a week. Other more fundamental things take a lot longer, but still happen. For example, in the [E. coli long-term evolution experiment](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment), bacteria took around 33000 generations to evolve a way to live off citrate under aerobic (with oxygen) conditions.
15 | 
16 | The similarity here is that both networks and life have millions (or billions) of parameters, and that through changes of these, there is a path towards great improvements. 
17 | 
18 | This is in stark contrast to traditional computer programs, where if you make a change, either nothing happens or your program crashes. There is no random walk imaginable that suddendly adds new features or higher performance to your work.
19 | 
20 | Now, it is not evident that the gradient descent techniques from neural networks are guaranteed to find interesting minima. But from observation, they very often too. Similarly, life has clearly been extremely successful achieving interesting goals by tweaking millions or billions of parameters. 
21 | 
22 | Traditional optimizers of simpler functions often get stuck at local minima. But it appears that if you create a solution where you can tweak not just a few parameters but millions of them, it is possible to have a fitness landscape where it is extremely hard to get stuck in a local minimum. Or in other words, even without heroics, your network can wind its way down to a very good optimum.
23 | 
24 | The outrageous success of both life and neural networks appears to argue for this hypothesis. 
25 | 
26 | # Generative AI
27 | It has been fascinating to see the discussion around what ChatGPT and similar systems do. Are they intelligent? What does that question even mean? ChatGPT sounds unreasonably sure of itself at times, even when it is generating text that is dead wrong. To the people that use this to disparage AI, I ask, have you ever met any people?
28 | 
29 | 


--------------------------------------------------------------------------------
/hello-deep-learning/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Hello Deep Learning"
 3 | date: 2023-03-30T11:59:00+02:00
 4 | draft: false
 5 | images: [boxed.png]
 6 | ---
 7 | <meta name="twitter:card" content="summary_large_image">
 8 | <meta name="twitter:site" content="@bert_hu_bert">
 9 | <meta name="twitter:creator" content="@bert_hu_bert">
10 | <meta name="twitter:title" content="Hello Deep Learning">
11 | <meta name="twitter:description" content="A from-scratch introduction to deep learning, including a handwritten letter recognition program">
12 | <meta name="twitter:image" content="https://berthub.eu/articles/boxed.png">
13 | <center>
14 | <video width="100%" autoplay loop muted playsinline>
15 |     <source src="learning.mp4"
16 |             type="video/mp4">
17 | </video>
18 | </center>
19 | 
20 | A from scratch GPU-free introduction to modern machine learning. Many tutorials exist already of course, but this one aims to really explain what is going on, from the ground up. Also, we'll develop the demo until it is actually useful on **real life** data which you can supply yourself.
21 | 
22 | Other documents start out from the (very impressive) PyTorch environment, or they attempt to math it up from first principles.
23 | Trying to understand deep learning via PyTorch is like trying to learn aerodynamics from flying an Airbus A380.
24 | 
25 | Meanwhile the pure maths approach ("see it is easy, it is just a Jacobian matrix") is probably only suited for seasoned mathematicians.
26 | 
27 | The goal of this tutorial is to develop modern neural networks entirely from scratch, but where we still end up with really impressive results.
28 | 
29 | [Code is here](https://github.com/berthubert/hello-dl). Markdown for blogposts can [also be found on GitHub](https://github.com/berthubert/hello-dl-posts) so you can turn typos into pull requests (thanks, the first updates have arrived!).
30 | 
31 | Chapters:
32 | 
33 |  * [Introduction](../hello-deep-learning-intro) (which you can skip if you want)
34 |  * [Chapter 1: Linear combinations](../hello-deep-learning-chapter1)
35 |  * [Chapter 2: Some actual learning, backward propagation](../first-learning)
36 |  * [Chapter 3: Automatic differentiation](../autograd)
37 |  * [Chapter 4: Recognizing handwritten digits using a multi-layer network: batch learning SGD](../handwritten-digits-sgd-batches)
38 |  * [Chapter 5: Neural disappointments, convolutional networks, recognizing handwritten **letters**](../dl-convolutional/)
39 |  * [Chapter 6: Inspecting and plotting what is going on, hyperparameters, momentum, ADAM](../hyperparameters-inspection-adam)
40 |  * [Chapter 7: Dropout, data augmentation and weight decay, quantisation](../dropout-data-augmentation-weight-decay)
41 |  * [Chapter 8: An actual 1700 line from scratch handwritten letter OCR program](../dl-ocr-demo) 
42 |  * Chapter 9: Gated Recurring Unit / LSTM: Some language processing, DNA scanning
43 |  * Chapter 10: Attention, transformers, how does this compare to ChatGPT?
44 |  * [Chapter 11: Further reading & worthwhile projects](../dl-and-now-what)
45 |  * Chapter 12: What does it all mean?
46 | 
47 | 
48 | <!--  
49 |  * [Chapter 9: Gated Recurring Unit / LSTM: Some language processing, DNA scanning](../dl-gru-lstm-dna) (WIP)
50 |  * [Chapter 10: Attention, transformers, how does this compare to ChatGPT?](../dl-attention-transformers-chatgpt) (nothing yet) 
51 |  * [Chapter 12: What does it all mean?](../dl-what-does-it-all-mean) (WIP)
52 |  -->
53 | 


--------------------------------------------------------------------------------
/hello-deep-learning-intro/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Hello Deep Learning: Intro"
 3 | date: 2023-03-30T12:00:00+02:00
 4 | draft: false
 5 | ---
 6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. Also, feel free to skip this intro and [head straight for chapter 1](../hello-deep-learning-chapter1) where the machine learning begins! 
 7 | 
 8 | Deep learning and 'generative AI' have now truly arrived. If this is a good thing very much remains to be seen. What is certain however is that these technologies will have a huge impact.
 9 | 
10 | Up to late 2022, I had unwisely derided the advances of deep learning as overhyped nonsense from people doing fake demos. Turned out this was only half false - many of the demos were indeed fake.
11 | 
12 | But meanwhile, truly staggering things were happening, and I had ignored all of that. In hindsight, I wish I had read and believed Andrej Karpathy's incredibly important 2015 post [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/). The examples in there are self-contained proof something very remarkable had been discovered.
13 | 
14 | For me this meant I had to catch up and figure out what was going on. What is this magical stuff really? Soon I found myself in a maze of confusing YouTube videos and Jupyter notebooks that showed me awesome things, but that did not address how all this magic worked. Also, quite often when trying to reproduce what I had seen, the magic did not actually work.
15 | 
16 | To make up for my somewhat idiotic ignorance, I went back to first principles to emulate a bit of what Andrej Karpathy had achieved: I set out to build a a self-contained, simple, but still impressive demo of the technologies involved, one that would really showcase this awesome new technology, including its pitfalls.
17 | 
18 | The goal is to really start from the ground up. Many other projects will tell you how to use the impressive deep learning tooling that is now available. This project hopes to show you what this tooling is actually doing for you to make the magic happen. And not only show: we're going to start truly from scratch - this is not built on top of PyTorch or TensorFlow. It it built on top of plain C++. 
19 | 
20 | In the chapters of this 'Hello Deep Learning' project, we'll build several solutions that do actually impressive things. The first solution is a relatively small from scratch program that will learn how to recognize handwritten letters, and also perform this feat on actual real life data -- something many projects conveniently skip.
21 | 
22 | Along the way we'll cover many of the latest deep learning techniques, and employ them in our little programs.
23 | 
24 | In this project, the 'from scratch' part means that we'll only be depending on system libraries, [a logging library](https://berthub.eu/articles/posts/big-data-storage/), [a matrix library](https://en.wikipedia.org/wiki/Eigen_(C%2B%2B_library)) and [an image processing library](https://github.com/nothings/stb). It serves no educational purpose to develop any of these things as part of this series. Yet, we will spend time on what the matrix library is doing for us, and why you should not ever roll your own.
25 | 
26 | I hope you'll enjoy this trip through the fascinating world of deep learning. It has been my personal way of making up for years of ignorance, and with some luck, this project will not only have been useful for me.
27 | 
28 | Finally, all pages are [hosted on github](https://github.com/berthubert/hello-dl-posts) and I very much look forward to receiving your pull requests to fix my inevitable mistakes or fumbled explanations!
29 | 
30 | Now, do head on to [Chapter 1: Linear combinations](../hello-deep-learning-chapter1).
31 | 


--------------------------------------------------------------------------------
/first-learning/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: actually learning something"
  3 | date: 2023-03-30T12:00:02+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/first-learning/index.md)!
  7 | 
  8 | In this chapter we're going to take [the neural network we made earlier](../hello-deep-learning), but actually make it do some learning itself. And, oddly enough, this demonstration will again likely simultaneously make you wonder "is this all??" and also impress you by what even this trivial stuff can do.
  9 | 
 10 | The first part of this chapter covers the theory, and shows no code. The second part explains the code that makes it all happen. You can skip or skim the second part if you want to focus on the ideas.
 11 | 
 12 | ## The basics
 13 | [Earlier we configured a linear combination neural layer](../hello-deep-learning), in which we used an element-wise multiplication to recognize if an image was a 3 or a seven:
 14 | 
 15 | {{< rawhtml >}}
 16 | <style>
 17 | table, th, td {
 18 |   border: 0px solid black;
 19 | }
 20 | </style>
 21 | {{</rawhtml >}}
 22 | 
 23 | <center>
 24 | 
 25 | <table><tr>
 26 |     <td width="32%"><img style="width:100%" src="../hello-deep-learning/three.png"> </td>
 27 |     <td style="text-align: center;"><p style="font-size:30px; color: red">*</p></font></td>
 28 |     <td width="32%"><img style="width:100%" src="../hello-deep-learning/diff.png"></td>
 29 | <td style="text-align: center;"><p style="font-size:30px; color: red">=</p></font></td>
 30 |     <td width="32%"><img style="width:100%" src="../hello-deep-learning/prod3.png"> </td>
 31 |     </tr>
 32 | </table>
 33 | <p></p>
 34 | 
 35 | </center>
 36 | 
 37 | This network achieved impressive accuracy on the very clean and polished EMNIST testing data, but partially this is because we carefully configured the network by hand. It did no learning of itself.
 38 | 
 39 | ## How about some actual learning
 40 | The key to calculating the verdict if something is a 3 or a 7 is the *weights* matrix. We manually initialized that matrix in the previous chapter. In machine learning, it is customary to randomly initialize the parameters. [But to what](https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79)? In practice, libraries tend to pick values uniformly distributed between 
 41 | {{< katex inline >}}-1/\sqrt{N} {{</katex>}} and {{< katex inline >}}1/\sqrt{N}{{</katex>}}, where {{< katex inline >}}N{{</katex>}} is the number of coefficients in the input matrix. 
 42 | 
 43 | Such a randomly chosen matrix will of course not yet be of any use:
 44 | 
 45 | <center>
 46 | <table><tr>
 47 |     <td width="32%"><img style="width:100%" src="random-image.png"> </td>
 48 |         <td style="text-align: center;"><p style="font-size:30px; color: red">*</p></font></td>
 49 |     <td width="32%"><img style="width:100%" src="random-weights.png"></td>
 50 |             <td style="text-align: center;"><p style="font-size:30px; color: red">=</p></font></td>
 51 |     <td width="32%"><img style="width:100%" src="random-prod.png"> </td>
 52 |     </tr>
 53 | </table>
 54 | <p></p>
 55 | 
 56 | </center>
 57 | 
 58 | The result of this multiplication and subsequent summation is 0.529248, so our random weights got it wrong: this is actually a three, and the resulting score should have been negative.
 59 | 
 60 | So, what to do? What's the simplest thing we could even do?
 61 | 
 62 | Recall that the 'score' we are looking at is the sum of the element-wise product of the image pixels ({{<katex inline>}}p_n{{</katex>}}) and the weights ({{<katex inline>}}w_n{{</katex>}}). Or, concretely, this summation over 28*28=784 elements:
 63 | 
 64 | {{<katex display>}}R=p_1w_1 + p_2w_2 + \cdots + p_{783}w_{783} + p_{784}w_{784}{{</katex>}}
 65 | 
 66 | Our current random weights delivered us an {{<katex inline>}}R{{</katex>}} that was too high. We can't change the input image pixels, but we can simply decide to lower the various weights, as this will deliver a lower {{<katex inline>}}R{{</katex>}}. So by how much should we lower them?
 67 | 
 68 | There is no impact for this image if we lower {{<katex inline>}}w_1{{</katex>}} since the first pixel {{<katex inline>}}p_1{{</katex>}} is 0 (black). And in fact, we'll get the biggest impact if we lower parameters in places of bright pixels. 
 69 | 
 70 | In practice in neural networks, we often lower each {{<katex inline>}}w_n{{</katex>}} by {{<katex inline>}}0.1p_n{{</katex>}}. This is then called a 'learning rate of 0.1'. Note that this effectively means: make bigger changes where they matter more. 
 71 | 
 72 | We do this lowering (or raising) in the direction of the desired outcome. So if the network had looked at a seven and produced a negative output, we'd be doing this learning in the opposite direction by increasing the weight parameters by 0.1 of the value of the input pixel.
 73 | 
 74 | Now, although this sounds ridiculously simple minded ("just twist the knobs so the score goes in the right direction"), let's give this a spin:
 75 | 
 76 | ```
 77 | $ ./37learn 
 78 | Have 240000 training images and 40000 validation images.
 79 | 50.5375% correct
 80 | 50% correct
 81 | 81.8125% correct
 82 | 86.675% correct
 83 | 58.8% correct
 84 | 85.65% correct
 85 | 79.375% correct
 86 | ...
 87 | 98.025% correct
 88 | ```
 89 | 
 90 | Recall how our carefully hand-configured neural network managed to achieve 97.025%. Here is an animation (left) showing the evolution of the weights matrix, from its initial random form to something remarkably like what we hand-configured earlier (right):
 91 | 
 92 | <center>
 93 | 
 94 | ![](weights-anim.gif) ![](../hello-deep-learning/diff.png)
 95 | 
 96 | <p></p>
 97 | 
 98 | </center>
 99 | 
100 | And here is the histogram of scores:
101 | 
102 | <center>
103 | 
104 | ![](linear-learned-histo.svg)
105 | 
106 | <p></p>
107 | 
108 | </center>
109 | 
110 | It appears that even our astoundingly simplistic learning technique delivered a pretty good result.
111 | 
112 | The process described above is called 'backpropagation', and it is at the absolute core of any neural network, including ChatGPT3 or any other mega impressive network you may have heard about. Continuing the theme from the previous chapter, it is confusing that a technique this simple and unimpressive might have such remarkable success. 
113 | 
114 | In the next chapter we'll talk more about this process, and the computational challenges that it brings across more complex networks.
115 | 
116 | # The code
117 | Now for the real details. The code [can be found here](https://github.com/berthubert/hello-dl/blob/main/37learn.cc).
118 | 
119 | ```C++
120 | Tensor weights(28,28);
121 | weights.randomize(1.0/sqrt(28*28));
122 | 
123 | saveTensor(weights, "random-weights.png", 252);
124 | 
125 | float bias=0;
126 | ```
127 | 
128 | We start out by initializing a weights matrix to random numbers between {{< katex inline >}}-1/\sqrt{28*28} {{</katex>}} and {{< katex inline >}}1/\sqrt{28*28}{{</katex>}}. In the informal explanation above, I neglected to mention the *bias*, which is part of the score formula:
129 | 
130 | {{< katex display >}} R =\sum{\mathit{image}\circ{}w} + b {{</katex>}}
131 | 
132 | Next up we need to set the *learning rate*:
133 | ```C++
134 | Tensor lr(28,28);
135 | lr.identity(0.01);
136 | ```
137 | The learning rate is what we need to multiply our image with to know how much to adjust the weights. Now, we'd love to just multiply the image by 0.01, but that is not how matrices work. If you want to multiply each coefficient of a matrix by a factor, you need to set up another matrix with that factor on all diagonal coefficients ('from the top left to the bottom right'). Our tensor class has an `identity()` method just for that purpose. [This Wikipedia page](https://en.wikipedia.org/wiki/Identity_matrix) may or may not be helpful.
138 | 
139 | > Earlier on this page we mentioned 0.1 as a typical learning rate. Here I've chosen 0.01 since the network learns plenty fast already, and by slowing it down, the final results improve a bit. This is because the network can seek out optima somewhat more diligently. In a later chapter we'll read about learning rate schedulers that automate this process.
140 | 
141 | Next up, let's do some learning:
142 | 
143 | ```C++
144 | for(unsigned int n = 0 ; n < mn.num(); ++n) {
145 |   int label = mn.getLabel(n);
146 |   if(label != 3 && label != 7)
147 |     continue;
148 | 
149 |   if(!(count % 4)) {
150 |     if(doTest(mntest, weights, bias) > 98.0)
151 |       break;
152 |   }
153 | ```
154 | 
155 | As earlier, this goes over all training samples of EMNIST. In addition, after every 4 images, we test our weights and bias against the validation database. If this `doTest` function returns that we got more than 98% of images correct, we leave the loop.
156 | 
157 | Next:
158 | 
159 | ```C++
160 |   Tensor img(28,28);
161 |   mn.pushImage(n, img);
162 |   float res = (img.dot(weights).sum()(0,0)) + bias; // the calculation
163 |   int verdict = res > 0 ? 7 : 3;
164 | 
165 |   if(label == 7) {
166 |     if(res < 2.0) {
167 |       weights.raw() = weights.raw() + img.raw() * lr.raw();
168 |       bias += 0.01;
169 |     }
170 |   } else {
171 |     if(res > -2.0) {
172 |       weights.raw() = weights.raw() - img.raw() * lr.raw();
173 |       bias -= 0.01;
174 |     }
175 |   }
176 |   ++count;
177 | }
178 | ```
179 | This is where the actual learning happens. If we just fed the neural network a 7, and if the calculated score was less than 2, we increase all the weights by `lr` of the associated pixel value.
180 | 
181 | Similarly, if we fed the network a 3, we lower all the weights, unless the score was already below -2.
182 | 
183 | The reason we test against 2 or -2 is that otherwise the network would eventually move parameters all the way to infinity. 
184 | 
185 | In upcoming chapters we'll see how the use of [activation functions](https://en.wikipedia.org/wiki/Activation_function) replaces the need for such crude limits.
186 | 
187 | As earlier, the somewhat ugly `.raw()` functions are necessary to prevent the slightly magic `Tensor` class from doing all kinds of work for us. In the next chapter, we're going to dive further into the theory of backpropagation, and that is when the `Tensor` class is going to shine.
188 | 
189 | Rounding off this chapter, using a surprisingly [small amount of code](https://github.com/berthubert/hello-dl/blob/main/37learn.cc) we've been able to make a neural network learn how to distinguish between images of the digits 3 and 7. It should be noted that this is very clean data, and that by focusing on only 2 digits the task isn't that hard. 
190 | 
191 | Still, as earlier, it is somewhat disconcerting how effective these techniques are even when what we are doing appears to be trivial.
192 | 
193 | In the next chapter, [we're going to learn all about automatic differentiation](../autograd).
194 | 
195 | 


--------------------------------------------------------------------------------
/dropout-data-augmentation-weight-decay/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: Dropout, data augmentation, weight decay and quantisation"
  3 | date: 2023-03-30T12:00:07+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dropout-data-augmentation-weight-decay/index.md)!
  7 | 
  8 | In the previous chapter we found ways to speed up our character recognition learning by a factor of 20 by using a better optimizer, and a further factor of four by cleverly using threads using a 'shared nothing architecture'. We also learned how we can observe the development of parameters.
  9 | 
 10 | So our convolutional network is now super fast, and performs well on training and validation sets. But is is robust? Or has it internalized too much of the training set details?
 11 | 
 12 | Previously we found that the performance of our handwritten digit recognizer plummeted if we flipped a few pixels or moved the digit around slightly. And the reason behind that was that the linear combination behind that network was tied to actual pixel positions, and not to shapes.
 13 | 
 14 | We now have a fancy and fast convolutional network that can theoretically do a lot better. Let's see how the network performs with slightly modified inputs:
 15 | 
 16 | <center>
 17 | 
 18 | ![](tensor-convo-adam-wall-modified.svg)
 19 | 
 20 | <p></p>
 21 | </center>
 22 | 
 23 | Well, that is disappointing. Although results are better than with the simple linear combination, we still take a significant hit when we move the image by only two pixels, and flip 5 random pixels from light to dark. No human being would be fooled (or perhaps even notice) these changes.
 24 | 
 25 | How is this possible? It turns out that a network can't learn from what it doesn't see. If all the inputs are centered exactly and have no noise, the network never learns to deal with off-center or corrupted inputs.
 26 | 
 27 | In this post we'll go over several modern techniques to enhance performance, robustness and efficiency.
 28 | 
 29 | # Data augmentation
 30 | Through a technique called [data augmentation](https://en.wikipedia.org/wiki/Data_augmentation), we can shake up our training set, making sure our network is exposed to more variation. And lo, when we do that, training and validation again score similarly, and only slightly worse than on unmodified data:
 31 | 
 32 | <center>
 33 | 
 34 | ![](tensor-convo-adam-wall-modified-modified.svg)
 35 | 
 36 | <p></p>
 37 | </center>
 38 | 
 39 | Data augmentation has several uses. It can make a network more robust by erasing hidden assumptions - even those you might not have been aware of. Hidden constant factors between training and validation sets are a major reason why networks that appear to do well, fail in the field. Because out there in the real world, samples aren't neatly centered and free from noise. 
 40 | 
 41 | In addition, if you have a lack of training data, you can augment it by creating modified versions of inputs. This in effect enlarges your training set. Possible modifications include skewing or rotating images, adding noise, or making inputs slightly larger or smaller, changing colors. It pays to try a lot of things - the more you try, the larger your chances are of creating a dataset that can only be learned by understanding the essence of the inputs.
 42 | 
 43 | In the demo code, data augmentation is implemented [here](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc#L24). It moves the image around by -2 to +2 pixels, and flips the value of 5 random pixels.
 44 | 
 45 | # Normalization
 46 | The inverse of data augmentation might be called normalization. Many training sets were gathered using highly repeatable measurements. For examples, faces were photographed under the same lighting, or scanned images were normalised to a certain average brightness, with similar standard deviations. 
 47 | 
 48 | You could undo such normalisation by brightening or dimming your inputs, and retraining. Or you could do the reverse and also normalise any inputs before running your network. Most networks, including our letter reading one, perform such normalisation. This is appropriate for any metric where you can objectively normalise. This is for example not the case for unskewing or rotating images back to their 'normal' state, because you don't know what that looks like.
 49 | 
 50 | Our demos so far have been doing image normalization like this:
 51 | 
 52 | ```C++
 53 | d_model.img.normalize(0.172575, 0.25);
 54 | ```
 55 | This normalizes the mean pixel value to 0.172575 and the standard deviation to 0.25. So why these specific numbers? I applied a common machine learning trick: I picked them [from another model that works well](https://github.com/pytorch/examples/blob/main/cpp/mnist/mnist.cpp).
 56 | 
 57 | # Dropout
 58 | Another important technique to make networks generalise is called 'dropout'. By randomly zeroing out parts of the network, we force it to develop multiple pathways to determine how to classify an input. In addition, the network can't rely on accidental features to do classification since it won't always see those accidental features.
 59 | 
 60 | Once the network is in production, we no longer perform the dropout, which gives a relative boost to performance. In some contexts, dropout is absolutely mandatory, but it does not do a lot for our letter recognizer. It does make learning harder:
 61 | 
 62 | <center>
 63 | 
 64 | ![](tensor-convo-dropout.svg)
 65 | <p></p>
 66 | </center>
 67 | 
 68 | Note that here the validation clearly outperforms the training set, which is made harder by the dropout. Training also takes a lot longer, and in some cases does not converge. It does however lead to a network that should be immune against overtraining. Overtraining is easily recognized when performance on the training set is higher than on the validation set. Dropout reverses that.
 69 | 
 70 | If dropout is set to 50%, on average 50% of values of a tensor will be set to zero. Little known fact is that [the other values are then doubled](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html). This means that the overall impact of this tensor is retained (on average).
 71 | 
 72 | In our code, the implementation is an element-wise multiplication of a tensor with a tensor filled with zeros for blanked out positions, and the multiplication factor for the rest.
 73 | 
 74 | # Weight decay
 75 | Not all parts of a network end up being used. But they are still there, and the parameters in those unused parts can have large numerical values. These parts could however become active for certain inputs that weren't observed during training, and then disrupt things, degrading the network's performance. 
 76 | 
 77 | It could therefore be useful to put a slight zero-ward pressure on parameter values that apparently have no impact. This has a deep analogy to what happens in biology, where genes that are not used tend to decay to a non-functional state, and in the case of microbial life, even get cut out of the genome. 
 78 | 
 79 | In neural networks, a surprisingly easy way to achieve a similar effect is by including the sum of all squared parameters in the loss function. Recall that the learning process effectively tries to minimise the loss to zero - by adding these squared values, there is an automatic zero-ward pressure. This works surprisingly well, and I think this is pretty meaningful.
 80 | 
 81 | <center>
 82 | 
 83 | ![](weight-decay-wait-evolution.svg)
 84 | <p></p>
 85 | </center>
 86 | 
 87 | Here we take a previously trained model and then turn on weight decay. Note how the distribution shifts leftward. The move appears modest, but this is a logarithmic plot. Many parameters go down by a factor of 10 or more. Here is a cumulative distribution:
 88 | 
 89 | <center>
 90 | 
 91 | ![](weight-decay-wait-evolution-cumul.svg)
 92 | <p></p>
 93 | </center>
 94 | 
 95 | Here we can see that initially 20,000 parameters had a squared value of less than 0.0001. After the weight decay process, this number goes up to 80,000. 
 96 | 
 97 | If we look at each parameter:
 98 | 
 99 | <center>
100 | 
101 | ![](weight-decay-wait-evolution-scatter.png)
102 | <p></p>
103 | </center>
104 | 
105 | Here everything below the black line represents a decrease of a parameter's squared value. It can be seen that especially larger values are reduced by a large fraction.
106 | 
107 | Essentially, after weight reduction we have a network that still functions well, but now effectively with a lot less parameters (if we remove tiny values). I find it pretty remarkable that we can achieve this just by adding the squared value of all values to the loss function. Such a simple mathematical operation yet it gives is a simpler network.
108 | 
109 | The implementation is near trivial:
110 | ```C++
111 | if(weightfact(0,0) != 0.0) {
112 |   weightsloss = weightfact*(s.c1.SquaredWeightsSum() +  s.c2.SquaredWeightsSum() +  s.c3.SquaredWeightsSum() +
113 |                             s.fc1.SquaredWeightsSum() + s.fc2.SquaredWeightsSum() + s.fc3.SquaredWeightsSum());
114 |       
115 |   loss = modelloss + weightsloss;
116 | }
117 | else
118 |   loss = modelloss;
119 | ```
120 | 
121 | Here `weightfact` is how heavy to weigh down on the squared weights. 0.02 appears to work well for our model.
122 | 
123 | On a closing note, the number of parameters impacts how much memory and CPU/GPU a model requires to function. Currently, networks use gigantic amounts of electrical power, which is not sustainable. If we can use this technique to slim down networks, that would be very good.
124 | 
125 | In addition, we might be able to understand better what is going on if we have fewer parameters to look at.
126 | 
127 | # Quantisation
128 | From the histograms above, we can see that most parameter values cluster close together. In most networks, such parameters are stored as 32 bit single precision floating point numbers. But do we actually need all those 32 bits? Given by how much we could drive down the parameter values with no impact on performance, it is clear we do not need to store very large numerical values.
129 | 
130 | We can easily imagine a reduction to 16 bits working - this effectively only adds some noise to the network. And indeed, the industry is rapidly moving to 16 bits floating point. Even [processors](https://networkbuilders.intel.com/solutionslibrary/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide) and GPUs have gained native ability to perform operations on such half-precision floating point numbers.
131 | 
132 | It turns out however that on large language model networks, one can go down to **4 bit precision** without appreciable loss of performance. Hero worker [Georgi Gerganov](https://ggerganov.com/) has implemented such quantisation in his [C++ version of Facebook's Llama model](https://github.com/ggerganov/llama.cpp), and it works very well. 
133 | 
134 | To perform quantisation, values are divided into 2^n bins of equal population, like this:
135 | <center>
136 | 
137 | ![](weight-decay-wait-evolution-cumul-quant.svg)
138 | <p></p>
139 | </center>
140 | 
141 | And values are then stored as 4 bits, indicating which bin they correspond to. Interestingly enough, there are even binary networks with only two values. Out there in the real world, 8-bit networks [are already seeing production use](https://blog.plumerai.com/). 
142 | 
143 | > This is the only feature discussed in this blog series that is not (currently) present in the demo code. 
144 | 
145 | # Summarising
146 | There is a lot that can be done to networks to improve their efficiency and performance. By augmenting our data, through small changes, we can make sure the network is exposed to more variation, and in this way become more robust against real life input.
147 | 
148 | Similarly, by performing internal dropout, the network is forced to learn how to recognize the input while not being able to rely on artifacts.
149 | 
150 | By adding a fraction of the squared value of parameters to the loss function, we can perform weight decay, which drives parameters to zero if they are not contributing to the result. This again aids in robustness, since stray unused neurons have less chance of interfering. Furthermore, we might drop very small value neurons from our network entirely, and still have a working network.
151 | 
152 | Finally, quantisation is the art of storing the weights in fewer bits which, kinda surprisingly, can be done without impacting performance too much.
153 | 
154 | Next up, [we are going to do some actual OCR with what we've learned](../dl-ocr-demo)!
155 | 


--------------------------------------------------------------------------------
/dl-ocr-demo/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: Doing some actual OCR on handwritten characters"
  3 | date: 2023-03-30T12:00:08+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dl-ocr-demo/index.md)!
  7 | 
  8 | The [previous](../dropout-data-augmentation-weight-decay/) chapters have often mentioned the chasm between "deep learning models that work on my data" and "it actually works in the real world". It is perhaps for this reason that almost all demos and YouTube tutorials you find online never do any real world testing.
  9 | 
 10 | Here, we are going to do it, and this will allow us to experience first hand how hard this is. We're going to build a computer program that reads handwritten letters from a provided photo, based [on the convolutional model developed earlier](../dl-convolutional/#convolutional-networks).
 11 | 
 12 | # Training
 13 | All elements described in previous chapters are present in [tensor-convo-par.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc). This tool can train our alphabet model, and optionally use the Adam optimizer, do dropout, weight decay, and data augmentation. 
 14 | 
 15 | Here are its options:
 16 | 
 17 | ```bash
 18 | Usage: tensor-convo-par [-h] [--learning-rate VAR] [--alpha VAR] [--momentum VAR] 
 19 | [--batch-size VAR] [--dropout] [--adam] [--threads VAR] [--mut-on-learn] 
 20 | [--mut-on-validate] state-file
 21 | 
 22 | Positional arguments:
 23 |   state-file           	state file to read from [default: ""]
 24 | 
 25 | Optional arguments:
 26 |   -h, --help           	shows help message and exits 
 27 |   -v, --version        	prints version information and exits 
 28 |   --lr, --learning-rate	learning rate for SGD [default: 0.01]
 29 |   --alpha              	alpha value for adam [default: 0.001]
 30 |   --momentum           	[default: 0.9]
 31 |   --batch-size         	[default: 64]
 32 |   --dropout            	
 33 |   --adam               	
 34 |   --threads            	[default: 4]
 35 |   --mut-on-learn       	augment training data 
 36 |   --mut-on-validate    	augment validation data 
 37 | ```
 38 | When this program runs, it starts with a freshly randomised state, or one read from the specified `state-file`. It will periodically save its state to a file called `tensor-convo-par.state`. So you can restart from an existing state with ease, possibly using different settings. 
 39 | 
 40 | While the program learns, it emits statistics to a sqlite file called `tensor-convo-par-vals.sqlite3` which you can use to study what is going on, as outlined in [this chapter of the series](../hyperparameters-inspection-adam/#inspection).
 41 | 
 42 | Details on how to build and run this program can [be found here](https://github.com/berthubert/hello-dl/blob/main/README.md).
 43 | 
 44 | For testing purposes, settings `--adam --mut-on-learn --mut-on-validate` work well. If you run it like this, you can terminate the process after 30 minutes or so, and have a decent model. 
 45 | 
 46 | # Real world input
 47 | Here is our input image, which already has a story behind it:
 48 | 
 49 | <center>
 50 | 
 51 | ![](cleaned.jpeg)
 52 | 
 53 | <p></p>
 54 | </center>
 55 | 
 56 | When I first got the OCR program working, results were very depressing. The network struggled mightily on some letters, often just not getting them right. Whatever I did, the 'h' would not work for example. First I blamed my own sloppy handwriting, but then I studied what the network was trained on:
 57 | 
 58 | <center>
 59 | 
 60 | ![](h-poster.png)
 61 | 
 62 | <p></p>
 63 | </center>
 64 | 
 65 | Compare this to how I (& many other Europeans) write an h:
 66 | <center>
 67 | 
 68 | ![](h.png)
 69 | 
 70 | <p></p>
 71 | </center>
 72 | 
 73 | No amount of training on the MNIST set is going to teach a neural network to consistently recognize this as an h - this shape is simply not in the training set.
 74 | 
 75 | So that was the first lesson - really be aware of what is in your training data. If it is different from what you thought, results might very well disappoint. To make progress, I changed my handwriting in the test image to something that looks like what is actually in the MNIST data.
 76 | 
 77 | # Practicalities
 78 | The source code of the OCR program [is here](https://github.com/berthubert/hello-dl/blob/main/img-ocr.cc), and it is around 300 lines. For image processing, I found the [stb collection](https://github.com/nothings/stb) of single-file public domain include files very useful. To run the program, run something like `./img-ocr sample/cleaned.jpeg tensor-convo-mod.par`. It will generate a file called `boxed.png` for you with the results in there.
 79 | 
 80 | So, getting started: what we have is a network that does pretty well on 28 by 28 pixel representations of letters, where the background pixel value is 0. By contrast, input images tend to have millions of pixels, in full colour even, and where black pixels have a value of 0, so the inverse of our training.
 81 | 
 82 | The first thing to do is to turn the image into a gray scale version, where we also adjust the white balance so black is actually black and where the gray that passes for white is actually white.
 83 | 
 84 | From OCR theory I learned that the first step in character segmentation is to recognize lines of text. This is done by making a graph of the total intensity per horizontal line of the image. From this graph, you then try to select intervals of high intensity that look like they might represent a line of text.
 85 | 
 86 | For each line, you then travel from left to right to try to box in characters.
 87 | 
 88 | This leads us to:
 89 | <center>
 90 | 
 91 | ![](boxed-only.png)
 92 | 
 93 | <p></p>
 94 | </center>
 95 | 
 96 | Note that this is already hard work, and not very robust. How hard this is is yet another reminder that a lot of machine learning is in fact preprocessing your data so it cooperates. Compare it to painting a house - lots of sanding and tape, and finally a fun bit of painting.
 97 | 
 98 | The code that does this segmentation in [img-ocr.cc](https://github.com/berthubert/hello-dl/blob/main/img-ocr.cc) is none too pretty and has only been worked on so it does enough of a job that we can demo our actual neural network. (Pull requests welcome!)
 99 | 
100 | # Loading the network
101 | Firing up the network is much like we did while training:
102 | 
103 | ```C++
104 | ConvoAlphabetModel m;
105 | ConvoAlphabetModel::State s;
106 | 
107 | cout<<"Loading model state from file '"<<argv[2]<<"'\n";
108 | 
109 | loadModelState(s, argv[2]);
110 | m.init(s, true);
111 | auto topo = m.loss.getTopo();
112 | ```
113 | 
114 | Of some note is that we initialize the network with the production parameter set to `true`. This means that the network will not perform *dropout* ([as described in the previous chapter](../dropout-data-augmentation-weight-decay/#dropout)). Dropout can be used during training to hide a lot of data from neural network so as to prevent overfitting.
115 | 
116 | # Processing the letter
117 | Here is the code that goes over all rectangles (stored in `rects`) around letters:
118 | 
119 | ```C++
120 | for(const auto& l: rects) {
121 |   cout<<"Size: cols "<< l.lstopcol-l.lstartcol<<" rows "<<l.lstoprow-l.lstartrow<<endl;
122 |     
123 |   vector<uint8_t> newpic;
124 |   newpic.reserve((l.lstopcol-l.lstartcol) * (l.lstoprow-l.lstartrow));
125 |     
126 |   for(int r= l.lstartrow; r < l.lstoprow; ++r) {
127 |     for(int c= l.lstartcol ; c < l.lstopcol; ++c) {
128 |     int intensity = getintens(c, r);
129 |     if(intensity < whiteballow)
130 |       newpic.push_back(255);
131 |     else if(intensity > whitebalhigh)
132 |       newpic.push_back(0);
133 |     else
134 |       newpic.push_back(255*(1- pow((intensity - whiteballow)/(whitebalhigh - whiteballow), 1) ));
135 |     }
136 |   }
137 | ```
138 | This iterates over the rectangles, and the first thing it does is fix the white balance, and create a new balanced image in `newpic`.
139 | 
140 | Our network needs a 28x28 pixel version, which is not what we get. Usually we get a lot more pixels, but not necessarily with a square aspect ratio. To make our box square, we previously enlarged the smallest dimension so it has the same size as the largest one. From [stb_image_resize.h](https://github.com/nothings/stb/blob/master/stb_image_resize.h) we get functionality to do high quality resizing. 
141 | 
142 | As another example of how things tend to be more difficult than you think, the MNIST training data is 28x28 pixels, BUT, the outer 2 pixels are always empty. So in fact, the network trains on 24x24 sized letters. This means that to match our image data to our network, we had best also resize letters to 24x24 pixels, and place them in the middle of a 28x28 grid:
143 | 
144 | 
145 | ```C++
146 | vector<uint8_t> scaledpic(24*24);
147 | stbir_resize_uint8(&newpic[0], l.lstopcol - l.lstartcol, l.lstoprow - l.lstartrow, 0,
148 |                    &scaledpic[0], 24, 24, 0, 1);
149 |     
150 | m.img.zero();
151 | for(unsigned int r=0; r < 24; ++r)
152 |   for(unsigned int c=0; c < 24; ++c)
153 |     m.img(2+r,2+c) = scaledpic[c+r*24]/255.0;
154 | 
155 | m.img.normalize(0.172575, 0.25);
156 | ```
157 | > Note that if we performed [data augmentation](../dropout-data-augmentation-weight-decay/#data-augmentation), our network should be robust against off center letters, or pixels in the outer two rows. But let's not make life harder than necessary for our model.
158 | 
159 | On the last line, we perform normalization [as described previously](../dropout-data-augmentation-weight-decay/#normalization) so that the pixels have a similar brightness to what the network is used to. This may feel like cheating, but this kind of normalization is an objective mathematical operation. Your eyes for example do the same thing by dilating your pupils so the photoreceptor cells receive a normalized amount of photons. Those cells in turn again also change their sensitivity depending on light levels. 
160 | 
161 | Next up, we can ask the network what it made of our character:
162 | 
163 | ```C++
164 | m.expected.oneHotColumn(0);
165 | m.modelloss(0,0); // makes the calculation happen
166 |     
167 | int predicted = m.scores.maxValueIndexOfColumn(0);
168 | cout<<"predicted: "<<(char)(predicted+'a')<<endl;
169 | ```
170 | 
171 | Again this is a lot like training the network, with one specific change. In this case, we honestly don't know what letter to expect. We are asking our network to figure that out for us. However, we can't leave the expectation value unset as this could cause out of range errors. We therefore configure `expected` that our expectation is 0, which corresponds to an 'a'.
172 | 
173 | # Bling
174 | To make it pretty, the code then modifies the image to print its best guess of the letter below it:
175 | 
176 | <center>
177 | 
178 | ![](boxed.png)
179 | 
180 | <p></p>
181 | </center>
182 | 
183 | Now, if you run the training yourself (which I encourage you to do), you'll find that the network will always make a few mistakes. In this sample, it gets the L wrong and thinks it is a C. It does the same for the T. If you train with different settings, it will get other letters wrong. 
184 | 
185 | In this animated version, recorded while a network was learning, you can see the network flip around as it is improving:
186 | 
187 | <center>
188 | <video width="100%" autoplay loop muted playsinline>
189 |     <source src="../hello-deep-learning/learning.mp4"
190 |             type="video/mp4">
191 |     Sorry, your browser doesn't support embedded videos.
192 | </video>
193 | </center>
194 | 
195 | It is highly instructive to try to improve [img-ocr.cc](https://github.com/berthubert/hello-dl/blob/main/img-ocr.cc) and pull requests are welcome! 
196 | 
197 | But, always check what the training data say - for example, the specific form of the 't' that the network got wrong above [is not present a lot in the training set](t-poster.png).
198 | 
199 | Also - it may be good at this point to realise we've written a functional OCR program, including training, in around 1800 lines of code. This is quite remarkable, and without neural networks this would never have worked.
200 | 
201 | # Summarising
202 | The initial attempt to test this network on real life data failed somewhat because the MNIST character set does not include all forms of letters (which is by design, by the way). 
203 | 
204 | Secondly, we've learned that actually _doing_ something with real life data requires a lot of preprocessing to actually isolate letters, fix white balance, box in characters and adjust them to what the network expects.
205 | 
206 | The end result however is quite pleasing, especially since we spent only 300 lines on 'infrastructure' to get the data ready for our network.
207 |     
208 | And, it should be noted that the total line count of 1500 for training and 300 for inference is impressively low.
209 | 
210 | <!--
211 | In [the next chapter](../dl-gru-lstm-dna/) we'll be moving away from convolutions and enter the excuting world of language processing.
212 | -->
213 | In [the next chapter](../dl-and-now-what/) you'll find further reading & pointers where to continue your deep learning journey.
214 | 


--------------------------------------------------------------------------------
/hello-deep-learning-chapter1/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: Linear combinations"
  3 | date: 2023-03-30T12:00:01+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/hello-deep-learning-chapter1/index.md)!
  7 | 
  8 | In this chapter we're going to build our first neural network and take it for a spin. Weirdly, this demonstration will likely simultaneously make you wonder "is this all??" and also impress you by what even this trivial stuff can do.
  9 | 
 10 | The first part of this chapter covers the theory, and shows no code. The second part explains the code that makes it all happen. You can skip or skim the second part if you want to focus on the ideas.
 11 | 
 12 | {{< rawhtml >}}
 13 | <style>
 14 | table, th, td {
 15 |   border: 0px solid black;
 16 | }
 17 | </style>
 18 | {{</rawhtml >}}
 19 | 
 20 | 
 21 | ## Hello, world
 22 | The "Hello, world" of neural networks is the MNIST set of handwritten digits. Meticulously collected, sanitized and labeled, this collection of 280,000 images is perfect to get started with. Most tutorials use MNIST, but because this one is written in 2023, we can use the ['extended' and improved EMNIST dataset](https://www.nist.gov/itl/products-and-services/emnist-dataset).
 23 | 
 24 | For our first sample, we're going to write a neural network that can distinguish images of the digits 3 and 7, inspired by [this excellent FastAI tutorial for PyTorch](https://github.com/fastai/fastbook/blob/master/04_mnist_basics.ipynb).
 25 | 
 26 | To start out with, we're not yet going to have a network that can learn things. We're going to configure it explicitly, which is a great way of figuring out what is going on.
 27 | 
 28 | Neural networks are a lot about matrices and multiplying them. Confusingly, everyone in this world calls these matrices 'tensors', which is actually the wrong name for them. But I digress.
 29 | 
 30 | So, the first thing we do is represent the images of digits found in the EMNIST database as matrices, which allows us to do math on them. In this way we can calculate the following three matrices (shown as images):
 31 | 
 32 | <center>
 33 | 
 34 | ![](./threes.png)
 35 | ![](./sevens.png)
 36 | ![](./diff.png)  
 37 | *The average 3, the average 7, the difference between these two* 
 38 | <p></p>
 39 | </center>
 40 | 
 41 | We can average all 3's and all 7's and get these fuzzy representations. The last picture is the most interesting one: it represents the "average 7 minus the average 3". The red pixels are high values, areas where there typically is more 'seven' than 'three'. The blue parts are low values, where there is typically more 'three' than 'seven'. Black pixels meanwhile are neutral, and confer no 'threeness' or 'sevenness'. 
 42 | 
 43 | One elementary neural network layer is the linear combination whereby we multiply the input (here, the image of a digit) by a matrix of 'weights'. These weights are the parameters that are usually evolved by training a network, but we're not going to do that yet.
 44 | 
 45 | Instead, we're going to use the difference matrix shown above as the weights. Here what that looks like for a typical 3:
 46 | 
 47 | <center>
 48 | 
 49 | <table><tr>
 50 |     <td width="32%"><img style="width:100%" src="three.png"> </td>
 51 |     <td style="text-align: center;"><p style="font-size:30px; color: red">*</p></font></td>
 52 |     <td width="32%"><img style="width:100%" src="diff.png"></td>
 53 |     <td style="text-align: center;"><p style="font-size:30px; color: red">=</p></td>
 54 |     <td width="32%"><img style="width:100%" src="prod3.png"> </td>
 55 |     </tr>
 56 | </table>
 57 | <p></p>
 58 | 
 59 | </center>
 60 | 
 61 | This represents a coefficient-wise product of two matrices (also known as a [Hadamard-Schur product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))). Each pixel in the right-most image is the product of the pixel in the same place in the left and middle images. Because this is a typical 3, we see a lot of blue in the right. If we'd add up all the values of the pixels on the right, we'd end up with a negative number. This could then also be our decision rule: if the sum is negative, infer that this was an image of a 3.
 62 | 
 63 | Conversely, this is what it looks like for a 7:
 64 | <center>
 65 | 
 66 | <table><tr>
 67 |     <td width="32%">
 68 |     <img style="width:100%" src="seven.png"> </td>
 69 |         <td style="text-align: center;"><p style="font-size:30px; color: red">*</p></font></td><td width="32%"><img src="diff.png"></td>
 70 |             <td style="text-align: center;"><p style="font-size:30px; color: red">=</p></font></td><td width="32%"><img src="prod7.png"> </td></tr>
 71 | </table>
 72 | <p></p>
 73 | 
 74 | </center>
 75 | 
 76 | Here we see a lot of red on the right, indicating a lot of higher values. The sum of all pixels is likely going to be a positive number, which means we can correctly infer this was a 7.
 77 | 
 78 | Now, this is all ridiculously naive, but we can give it a try using the [threeorseven.cc](https://github.com/berthubert/hello-dl/blob/main/threeorseven.cc) program:
 79 | ```
 80 | $ ./threeorseven 
 81 | Have 240000 training images and 40000 validation
 82 | Three average result: -10.7929, seven average result: 0.785063
 83 | 82.2125% correct
 84 | ```
 85 | That is quite something. So, this introduces another key aspect of machine learning: training and validation. The EMNIST set offers us 240,000 training images. To make sure that networks don't only memorise their training set, it is customary to validate models on a separate set of inputs. These are the 40,000 validation images. And true to form, the threeorseven.cc program calculates the averages only based on the training images, and then measures performance using the validation images.
 86 | 
 87 | Now, 82.21% is nice, but something else stands out in the output of the program. We originally thought that a negative score (sum of pixels in the rightmost image) would represent a 3. We do see that the average three scores negatively (-10.79), but the average seven is only barely positive (0.785). Let's make a histogram of scores:
 88 | 
 89 | <center>
 90 | 
 91 | ![](linear-histo.svg)
 92 | <p></p>
 93 | 
 94 | </center>
 95 | 
 96 | Clearly 0 is not the right number to compare our score against. Our histograms have a definitive negative *bias*. Instead, we could use the middle between the average 3 score and the average 7 score:
 97 | 
 98 | ```
 99 | $ ./threeorseven 
100 | Have 240000 training images and 40000 validation
101 | Three average result: -10.7929, seven average result: 0.785063
102 | Middle: -5.00393
103 | 97.025% correct
104 | ```
105 | 
106 | That is pretty astounding. Using -5.00 as a decision rule we get 97.025% accuracy. This is approaching human level performance. Later we'll find out many reasons why we should not quite start celebrating yet though. But for now, this is quite impressive.
107 | 
108 | In the above we have set the 'weights'(w) to the difference between average threes and sevens. We've also found a bias(b) that we need to apply. In formula form:
109 | 
110 | {{< katex display >}} R =\sum{\mathit{image}\circ{}w} + b {{</katex>}}
111 | 
112 | Here {{< katex inline >}}\circ{{</katex>}} stands for the coefficient-wise product, and {{< katex inline >}}\sum{{</katex>}} means we add up all coefficients.
113 | 
114 | If the result {{< katex inline >}}R{{</katex>}} is positive, we infer that {{< katex inline >}}\mathit{image}{{</katex>}} represents a 7.
115 | 
116 | Note that for reasons which will become apparent later, neural network linear combinations mostly do not use these 'square' matrices and Hadamard products, but instead flattened versions. The central equation then becomes a regular matrix multiplication:
117 | 
118 | {{< katex display >}} R =\mathit{image}\cdot{}w + b {{</katex>}}
119 | 
120 | ## Takeaway
121 | From the above, we can see that given very clean data, simple multiplication and additions are sufficient for a properly configured neural network to do pretty well on a simple task. The demo above is a completely standard neural network layer, with the only simplification that we configured it by hand instead of letting it learn. I'm simultaneously impressed by what this simple layer can do, but you might at this stage also be wondering "is that it??". 
122 | 
123 | In the next chapter we'll cover how training works. And you'll likely again be wondering how something so simple can be so effective.
124 | 
125 | ## The code
126 | To start, clone the GitHub repository and download and unzip the EMNIST dataset:
127 | ```bash
128 | git clone https://github.com/berthubert/hello-dl.git
129 | cd hello-dl
130 | cmake .
131 | make -j4
132 | wget http://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip
133 | unzip gzip.zip
134 | ```
135 | 
136 | We're going to look at [threeorseven.cc](https://github.com/berthubert/hello-dl/blob/main/threeorseven.cc).
137 | 
138 | The first thing the code does is to read the EMNIST data (which is formatted in the MNIST standard):
139 | 
140 | ```C++
141 |   MNISTReader mn("gzip/emnist-digits-train-images-idx3-ubyte.gz", "gzip/emnist-digits-train-labels-idx1-ubyte.gz");
142 |   MNISTReader mntest("gzip/emnist-digits-test-images-idx3-ubyte.gz", "gzip/emnist-digits-test-labels-idx1-ubyte.gz");
143 | 
144 |   cout << "Have "<<mn.num() << " training images and " << mntest.num() << " validation images." <<endl;
145 | ```
146 | 
147 | The `MNISTReader` class can be found in [mnistreader.hh](https://github.com/berthubert/hello-dl/blob/main/mnistreader.hh)
148 | and [mnistreader.cc](https://github.com/berthubert/hello-dl/blob/main/mnistreader.cc). It does nothing special, except parse the raw compressed MNIST data.
149 | 
150 | In general, you'll find that quite a lot of machine learning is about parsing and preprocessing data. 
151 | 
152 | Next up, we're going to add up all the threes and sevens so we can create their average:
153 | 
154 | ```C++
155 |   Tensor threes(28, 28), sevens(28, 28);
156 |   
157 |   float threecount = 0, sevencount=0;
158 |   for(unsigned int n = 0 ; n < mn.num(); ++n) {
159 |     int label = mn.getLabel(n);
160 |     if(label != 3 && label != 7)
161 |       continue;
162 |     
163 |     Tensor img(28,28);
164 |     mn.pushImage(n, img);
165 | 
166 |     if(label == 3) {
167 |       threecount++;
168 |       threes.raw() += img.raw();
169 |     }
170 |     else {
171 |       sevencount++;
172 |       sevens.raw() += img.raw();
173 |     }
174 |   }
175 | ```
176 | 
177 | This looks pretty obvious, and introduces the `Tensor` class. These Tensors are actually two-dimensional matrices, in this case of 28x28 coefficients. 
178 | 
179 | In later chapter we'll find out that the `Tensor` class is a rather magical construct that does [lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation) and [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation). We'll be developing the `Tensor` class in later chapters. In this case however we don't want any of that fancy stuff, hence the `.raw()` suffixes. 
180 | 
181 | Next up, we divide the `sevens` and `threes` by their number, and calculate the difference:
182 | 
183 | ```C++
184 |   Tensor totcount(threecount + sevencount);
185 |   auto delta = (sevens - threes) / totcount;
186 | 
187 |   saveTensor(delta, "diff.png", 252);
188 | ```
189 | In the final line we save the delta to the 252-pixel wide `diff.png` which is the colorful image you saw above.
190 | 
191 | So now we have our `delta` which we can use to score images with. Let's do that for all threes and sevens:
192 | 
193 | ```C++
194 |   float threesmean = 0, sevensmean = 0;
195 | 
196 |   for(unsigned int n = 0 ; n < mn.num(); ++n) {
197 |     int label = mn.getLabel(n);
198 |     if(label != 3 && label != 7)
199 |       continue;
200 |     Tensor img(28,28);
201 |     mn.pushImage(n, img);
202 | 
203 |     float res = (img.dot(delta).sum()(0,0)); // the calculation
204 | 
205 |     if(label == 3) 
206 |       threesmean += res;
207 |     else 
208 |       sevensmean += res;
209 |   }
210 | 
211 |   cout<<"Three average result: "<<threesmean/threecount<<", seven average result: "<<sevensmean/sevencount<<endl;
212 |   float middle = (sevensmean/sevencount + threesmean/threecount) / 2;
213 | ```
214 | 
215 | This again goes over all threes and sevens, but this time it calculates the score ({{< katex inline >}} R =\sum{\mathit{image}\circ{}w} {{</katex>}}) for every image. And using that, we calculate the average score for the threes and the sevens. And then we take the middle of those two scores.
216 | 
217 | And finally, we're going to *validate* our model using the EMNIST set of test images in `mntest`:
218 | 
219 | ```C++
220 |   float bias = -middle;
221 |   unsigned int corrects=0, wrongs=0;
222 | 
223 |   for(unsigned int n = 0 ; n < mntest.num(); ++n) {
224 |     int label = mntest.getLabel(n);
225 |     if(label != 3 && label != 7)
226 |       continue;
227 |       
228 |     Tensor img(28,28);
229 |     mntest.pushImage(n, img);
230 | 
231 |     float score = (img.dot(delta).sum()(0,0)) + bias; // the calculation
232 |     int predict = score > 0 ? 7 : 3;                  // the verdict
233 |     
234 |     if(predict == label) 
235 |       corrects++;
236 |     else {
237 |       saveTensor(img, "wrong-"+to_string(label)+"-"+to_string(wrongs)+".png", 252);
238 |       wrongs++;
239 |     }
240 |   }
241 |   cout<< 100.0*corrects/(corrects+wrongs) << "% correct" << endl;
242 | ```
243 | 
244 | Note that on the first line we use the calculated result middle to set the *bias* term. The rest of the code is straightforward. Note that an image is generated for every incorrect result. Studying those images will give you an impression of where the algorithm gets it wrong. Here's an example 7 that got classified as a 3:
245 | 
246 | <center>
247 | 
248 | ![](wrong-7-22.png)
249 | 
250 | <p></p>
251 | </center>
252 | 
253 | And that's it! If you look at the full [threeorseven.cc](https://github.com/berthubert/hello-dl/blob/main/threeorseven.cc) you'll find that it contains some additional code to log data for the histogram we showed above, and for generating some sample images.
254 | 
255 | In [the next chapter](../first-learning), we'll start doing some actual learning. 
256 | 


--------------------------------------------------------------------------------
/hyperparameters-inspection-adam/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: Hyperparameters, inspection, parallelism, ADAM"
  3 | date: 2023-03-30T12:00:06+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/hyperparameters-inpection-adam/index.md)!
  7 | 
  8 | In [the previous chapter](../dl-convolutional) we successfully trained a network to recognize handwritten letters, but it took an awfully long time. This is not just inconvenient: networks that take too long to train mean we can experiment less. Some things really are out of reach if each iteration takes 24 hours, instead of 15 minutes. In addition we waste a lot of energy this way.
  9 | 
 10 | To speed things up, [we can make our calculations go faster, or we can do less of them](https://berthub.eu/articles/posts/optimizing-optimizing-400-percent-speedup/). Deep learning employs both techniques. 
 11 | 
 12 | On this page we will go through how deep learning networks speed up their operations. In addition we'll be taking a rare look inside networks to see how parameters are actually evolving.
 13 | 
 14 | # Parallelization
 15 | One way of speeding up calculations is by making more of them happen at the same time. In deep learning, it is very typical to evaluate batches of 64 inputs at a time. Instead of shoveling a series of 64 matrices one by one through our calculations, we could also use a library that can take a stack of 64 matrices, call this a "64-high tensor", and do the calculation all at once for the whole stack. This would be especially convenient if we had 64 parallel processing units available of course.
 16 | 
 17 | And it turns out that if you have the right kind of GPU, you do have such capacity.
 18 | 
 19 | In addition, modern CPUs have [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) capabilities to perform certain calculations on 4, 8 or even 16 numbers at the same time. Also, most modern computers have multiple CPU cores available.
 20 | 
 21 | Because of this, it is nearly standard in professional neural network environments to do almost everything multidimensional tensors, as this offers your calculating backend as much opportunities as possible to use your CPU or GPU to perform many calculations in parallel. Some of the largest networks are even [distributed over multiple computers](https://pytorch.org/tutorials/beginner/dist_overview.html).
 22 | 
 23 | The reason this works so well is that we typically evaluate a whole batch of inputs while keeping the neural network parameters constant. This means all calculations can happen in parallel - they don't need to change any common data. The only thing that needs to happen sequentially is to gather all the gradients and apply them to the network. And then a new batch can be processed in parallel again.
 24 | 
 25 | # Being clever
 26 | A mainstay of neural networks is matrix multiplication. On the surface this would appear to be an \\(O(N^3)\\) process, scaling with the number of rows and columns of both matrices (where the number of columns in the second matrix is equal to the number of rows in the first). It turns out that through sufficient mathematical cleverness, [matrix multiplications can be performed a lot more efficiently](https://en.wikipedia.org/wiki/Computational_complexity_of_matrix_multiplication). In addition, if you truly understand what CPUs and caches are doing, you can speed things up even more.
 27 | 
 28 | As an example, an earlier version of the software behind these blog posts performed naive matrix multiplication. I finally gave up and moved to a professional matrix library ([Eigen](https://en.wikipedia.org/wiki/Eigen_(C%2B%2B_library))) and this delivered a 320-fold speedup immediately. In short, unless you really know what you are doing, you have no business implementing matrix multiplication yourself.
 29 | 
 30 | The software behind this series of posts benefits from SIMD vectorization because Eigen and compilers are able to make good use of parallel instructions. In addition, threads are used to just use more CPU cores.
 31 | 
 32 | # Doing less
 33 | Parallel and clever computations are nice since they allow us to do what we were doing already, but now faster. There are limits to this approach however - we can't keep on inventing faster math for example.
 34 | 
 35 | While training our networks, we are effectively trying out sets of parameters & calculating how the outcome would change if we adjusted our parameters, based on their derivatives (through automatic differentiation).
 36 | 
 37 | Above we described how we could perform such calculations faster, which means we can evaluate more sets of parameters per unit of time, which is nice.
 38 | 
 39 | A key component however can also be improved: making better use of those derivatives to improve our parameters more effectively.
 40 | 
 41 | In previous chapters we've trained our networks by adjusting each parameter by (say) 0.2 times the derivative of the loss function with respect to the parameter. This is the simplest possible approach, but it is not the best one. 
 42 | 
 43 | Learning by equally sized increments could be compared to climbing a hill taking tiny equally sized steps, where you know that if all previous steps have been upwards, you could probably get there a lot faster if you took larger steps.
 44 | 
 45 | Recall how we previously described 'hill climbing', where it worked pretty well:
 46 | 
 47 | <center>
 48 | 
 49 | ![](sgd.gif)
 50 | 
 51 | <p></p>
 52 | </center>
 53 | 
 54 | 
 55 | However, on a more complex landscape, this regular gradient descent does not work so well:
 56 | 
 57 | <center>
 58 | 
 59 | ![](sgd-complex-no-momentum.gif)  
 60 | *Nearly gets stuck around x=1.5, overshoots the goal*
 61 | <p></p>
 62 | </center>
 63 | 
 64 | We can see that the algorithm nearly gets stuck on the horizontal part around {{<katex inline>}}x=1.5{{</katex>}}. In addition, when it eventually reaches the goal, it ping-pongs around it, and never reaches the minimum.
 65 | 
 66 | A popular enhancement to this 'linear gradient descent' is to make it slightly more physical. For example, we could simulate a ball rolling down a hill, where the ball speeds up as it goes along, but also experiences friction:
 67 | 
 68 | <center>
 69 | 
 70 | ![](sgd-complex-momentum.gif)
 71 | <p></p>
 72 | </center>
 73 | 
 74 | This is called gradient descent with momentum, and it is pretty nice. Further enhancements are possible too, and most networks these days use [ADAM](https://machinelearningmastery.com/adam-optimization-from-scratch/), which not only implements momentum but also performs smoothing both on speed and on momentum. In addition, it cleverly initializes variables so the network "gets a running start". With judiciously picked parameters (\\(\alpha\\), \\(\beta_1\\) and \\(\beta_2\\)), [ADAM appears to be the best generic optimizer around](https://arxiv.org/abs/1412.6980).
 75 | 
 76 | As a case in point, recall how our pedestrian stochastic gradient descent took almost a whole day to learn how to read letters. Here is that same model on ADAM:
 77 | 
 78 | <center>
 79 | 
 80 | ![](tensor-convo-adam.svg)
 81 | <p></p>
 82 | </center>
 83 | 
 84 | Within 1 CPU hour, [this code](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc) was recognizing 80% of letters correctly. 
 85 | 
 86 | In addition, by benefiting from four-fold parallelization (since I have 4 cores), this code becomes even faster:
 87 | 
 88 | <center>
 89 | 
 90 | ![](tensor-convo-adam-wall.svg)
 91 | <p></p>
 92 | </center>
 93 | 
 94 | By achieving good performance after 15 wall clock minutes, we've increased our learning speed by over a factor of 80.
 95 | 
 96 | These things are not just nice, they are complete game changers. Networks that otherwise take days to reach decent performance can do so in hours with the right optimizer. Also, the optimizer can actually achieve better results by not getting stuck in local minima. This brings us to a rather dark subject in deep learning.
 97 | 
 98 | # Hyperparameters
 99 | So far we've seen a number of parameters that had to be set: the learning rate, for which we suggested a value of 0.2. If we want to use momentum (the rolling ball method), we have to pick a *momentum* parameter. If we use ADAM, we need to pick \\(\alpha\\), \\(\beta_1\\) and \\(\beta_2\\) (although the default values are pretty good).
100 | 
101 | In addition there is the batch size. If we set this too low, the network jumps around too much. Too high and everything takes too long.
102 | 
103 | These important numbers are called *hyperparameters*, to distinguish them from the regular parameters that we are mutating within our neural network to make it learn things.
104 | 
105 | If you visit the many many demos of how easy machine learning is, you'll mostly see the hyperparameters just appearing there, with no explanation how they were derived. 
106 | 
107 | I can authoritatively tell you that very often these numbers came from days long experimentation. If you pick the numbers wrong, nothing (good) might happen, or at least not at any reasonable speed. Many demos are not honest about this, and if you change any of their carefully chosen numbers, you might find that the network no longer converges.
108 | 
109 | The *learning* part of machine learning is a lot harder than many demos make it out to be.
110 | 
111 | The actual design of the neural network layers is also considered part of the hyperparameter set. So if a network sorta arbitrarily consists of three convolutional layers with N channels in and M channels out, plus three fully connected linear combinations of x by y, know that these numbers were often gleaned from an earlier implementation, or were selected only after tedious "parameter sweeping".
112 | 
113 | So know that if you are ever building a novel network and it doesn't immediately perform like the many demos you saw, this is entirely normal and not your fault.
114 | 
115 | # Inspection
116 | Neural networks tend to be pretty opaque, and this happens on two levels. From a theoretical standpoint, it is already hard to figure out how "a network does its thing". Much like in biology, it is not clear which neuron does what. We can sometimes "see" what is happening, as for example in our earliest 3-or-7 network. But it is hard work.
117 | 
118 | On a second level, if we have a ton of parameters all being trained, it is in a practical sense not that easy to get the numbers out to figure out what is going on.
119 | 
120 | For PyTorch, there are commercial platforms like [Weights & Biases](https://wandb.ai/) that can help create insight. But it turns out that with some simple measures we can also get a good look about what is going on.
121 | 
122 | For logging, we use [SQLiteWriter](https://berthub.eu/articles/posts/big-data-storage/), a tiny but pretty powerful logger that works like this:
123 | 
124 | ```C++
125 | SQLiteWriter sqw("convo-vals.sqlite3");
126 | 
127 | ...
128 | sqw.addValue({
129 |   {"startID", startID}, {"batchno", batchno},
130 |   {"epoch", 1.0*batchno*batch.size()/mn.num()}, 
131 |   {"time", time(0)}, {"elapsed", time(0) - g_starttime},
132 |   {"cputime", (double)clock()/CLOCKS_PER_SEC},
133 |   {"corperc", perc}, {"avgloss", totalLoss/batch.size()},
134 |   {"batchsize", (int)batch.size()}, {"lr", lr*batch.size()},
135 |   {"momentum", momentum}}, "training");
136 | ```
137 | 
138 | This logs a modest amount of statistics to SQLite for every batch. The 'startID' is set when the program starts, which means that multiple runs of the software can log to the same SQLite database and we can distinguish what was logged by which invocation.
139 | 
140 | The other numbers mostly describe themselves, with `corperc` denoting the percentage of correct digit determinations (in this case). `lr` and `momentum` are also logged since these might change from run to run. All values end up in a table called `training`, there is a similar table called `validation` which stores the same numbers, but then for the validation set.
141 | 
142 | These numbers are nice to track our learning progress, but to really look inside we need to log a lot more. Recall how in the code samples so far we register the layers in our network:
143 | 
144 | ```C++
145 | State()
146 | {
147 |   this->d_members = {{&c1, "c1"}, {&c2, "c2"}, 
148 |                      {&c3, "c3"}, {&fc1, "fc1"}, 
149 |                      {&fc2, "fc2"}, {&fc3, "fc3"}};
150 | }
151 | ```
152 | Note that we also gave each layer a name. Our network does not itself need to know the names of layers, but it is great for logging. Each layer in the software knows how to log itself to the `SQLiteWriter` and we can make this happen like this:
153 | 
154 | ```C++
155 | ConvoAlphabetModel m;
156 | ConvoAlphabetModel::State s;
157 | ..
158 | if(batchno < 32 || !(tries%32)) {
159 |   s.emit(sqw, startID, batchno, batch.size());
160 | }
161 | ```
162 | 
163 | This logs the full model to the `SQLiteWriter` for the first 32 batches, and from then on once every 32 batches. Since models might have millions of parameters, we do need to think this through a bit.
164 | 
165 | Here is what comes out:
166 | <center>
167 | 
168 | ![](c2-values.svg)  
169 | *Values of the kernel of the 20th filter of c2*
170 | 
171 | <p></p>
172 | </center>
173 | 
174 | The [code to create this](https://github.com/berthubert/hello-dl/blob/main/hello-dl.ipynb) is relatively simple. First we retrieve the data:
175 | 
176 | ```Python
177 | engine = create_engine("sqlite:////home/ahu/git/hello-dl/convo-vals.sqlite3")
178 | startIDs = pandas.read_sql_query("SELECT distinct(startID) as startID FROM data", engine)
179 | startID=startIDs.startID.max()
180 | ```
181 | 
182 | And then select the data to plot:
183 | ```Python
184 | fig, ax1 = plt.subplots(figsize=(7,6))
185 | 
186 | sel = pandas.read_sql_query(f"SELECT * FROM data where startID={startID} and name='c2' and 
187 |             idx=20 and subname='filter' order by batchno", engine)
188 | sel.set_index("batchno", inplace=True)
189 | for c in sel.col.unique():
190 |     for r in sel.row.unique():
191 |         v = sel[(sel.row==r) & (sel.col==c)]
192 |         ax1.plot(v.index, v.val - 1.0*v.val.mean(),
193 |           label=v.name.unique()[0]+"["+str(v.idx.unique()[0])+"]("+str(r)+","+str(c)+")" )
194 | ax1.legend(loc=2)
195 | plt.title("Value of parameters of a convolutional filter kernel")
196 | plt.xlabel("batchno")
197 | plt.ylabel("value")
198 | ```
199 | 
200 | The `data` table has fields called `batchno`, `startID`, `name`, `idx`, `row`, `col`, `value` and `grad` that fully identify an element, and also store its current value and the gradient being used for SGD or ADAM.
201 | 
202 | # ADAM practicalities
203 | The ADAM optimizer does require some infrastructure. To make things work, our `Tensor<T>` class now also has a struct storing the ADAM parameters:
204 | 
205 | ```C++
206 |   struct AdamVals
207 |   {
208 |     EigenMatrix m;
209 |     EigenMatrix v;
210 |   } d_adamval;
211 | ```
212 | 
213 | These stand for the momentum and velocity of "the ball" if you will.
214 | 
215 | [Our code](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc#L259) has meanwhile grown an option parser so we can select an optimizer at will:
216 | 
217 | ```C++
218 | if(program.get<bool>("--adam"))
219 |   s.learnAdam(1.0/batch.size(), batchno, program.get<double>("--alpha"));
220 | else
221 |   s.learn(lr, momentum);
222 | ```
223 | 
224 | The mechanics of `learnAdam` can be found in [tensor-layers.hh](https://github.com/berthubert/hello-dl/blob/main/tensor-layers.hh#L27).
225 | 
226 | # Parallelization
227 | As noted, we can evaluate a whole batch in parallel, since the network parameters stay constant during evaluation. We do however have to gather all the gradients from the individual evaluations and add them up.
228 | 
229 | As is the case always, **speeding things up by parallelizing them does not make your code any more readable**. This is especially painful for an educational project like this one. I've tried hard to keep it as simple as possible. The 4- or 8-fold speedup you can achieve with this technique is important enough to warrant its use. There is a huge difference between 30 minutes of training or 4 hours.
230 | 
231 | One of the simplest ways to make sure that things actually get faster with multiple threads is to use a '[shared nothing architecture](https://en.wikipedia.org/wiki/Shared-nothing_architecture)', and this is what we do for our project.
232 | 
233 | We launch a number of threads that each have a complete copy of the model we are training. These then process individual images/samples from a batch, and record the gradients. 
234 | 
235 | Once all threads are done, the gradients are gathered together, and then the `main()` thread copy of the model performs the learning. The new parameters are then broadcast to the thread copies again, and then the next batch is processed.
236 | 
237 | Sadly, despite my best efforts, the code in [tensor-convo-par.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-convo-par.cc) has a hundred lines of thread handling to make this all possible. 
238 | 
239 | # Next up
240 | We started our quest for robust character recognition, but found that it was learning only very slowly. In this chapter we looked into various optimizers and found that ADAM converged 20 times faster. [In the next chapter](../dropout-data-augmentation-weight-decay), we are going to check if our network is actually robust, and what we can do to make it so.
241 | 


--------------------------------------------------------------------------------
/autograd/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: Automatic differentiation, autograd"
  3 | date: 2023-03-30T12:00:03+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/autograd/index.md)!
  7 | 
  8 | In the previous chapter we configured a neural network and made it learn to distinguish between the digits 3 and 7. The learning turned out to consist of "twisting the knobs in the right direction". Although simplistic, the results were pretty impressive. But, you might still be a bit underwhelmed - the network only distinguished between two digits.
  9 | 
 10 | To whelm it up a somewhat, in this chapter we'll introduce a 5 layer network that can learn to recognize all 10 handwritten digits with near perfect accuracy. But before we can make it learn, we need to move slightly beyond the "just twist the parameters in the right direction" algorithm.
 11 | 
 12 | The first part of this chapter covers the theory, and shows no code. The second part explains the code that makes it all happen. You can skip or skim the second part if you want to focus on the ideas.
 13 | 
 14 | ## The basics
 15 | Our previous network consisted of one layer, a linear combination of input pixels. Here is a **preview** of the layers that achieve 98% accuracy recognizing handwritten digits:
 16 | 
 17 | 1. Flatten 28x28 image to a 784x1 matrix
 18 | 2. Multiply this matrix by a 128x784 matrix 
 19 | 3. Replace all negative elements of the resulting matrix by 0
 20 | 4. Multiply the resulting matrix by a 64x128 matrix 
 21 | 5. Replace all negative elements of the resulting matrix by 0
 22 | 6. Multiply the resulting matrix by a 10x64 matrix
 23 | 7. Pick the highest row of the resulting 10x1 matrix, this is the digit the network thinks it saw
 24 | 
 25 | This model involves three matrices of parameters, with in total 128\*784 + 64\*128 + 10\*64 = 109184 *weights*. There are also 128+64+10 = 202 *bias* parameters.
 26 | 
 27 | We'll dive into this network in detail later, but for now, ponder how we'd train this thing. If the output of this model is not right, by how much should we adjust each parameter? For the one-layer model from the previous chapter this was trivial - the connection between input image intensity and a weight was clear. But here?
 28 | 
 29 | ## Turning the knobs, or, gradient descent
 30 | In our previous model, we took the formula:
 31 | 
 32 | {{<katex display>}}R=p_1w_1 + p_2w_2 + \cdots + p_{783}w_{783} + p_{784}w_{784}{{</katex>}}
 33 | 
 34 | And we then performed 'learning' by increasing the {{<katex inline>}}w_n{{</katex>}} parameters by 0.1 of their associated {{<katex inline>}}p_n{{</katex>}}. Effectively, we took the *derivative* of the error ({{<katex inline>}}\pm R{{</katex>}}) with respect to {{<katex inline>}}w_n{{</katex>}}, multiplied it by 0.1, and added it to {{<katex inline>}}w_n{{</katex>}}.
 35 | 
 36 | This is what is called 'gradient descent', and it looks like this:
 37 | 
 38 | <center>
 39 | 
 40 | ![](hill-climbing.svg)  
 41 | *Actually hill descending in this case*
 42 | <p></p>
 43 | 
 44 | </center>
 45 | 
 46 | This is a one-dimensional example, and it is very successful: it quickly found the minimum of the function. Such hill climbing has a tendency of getting stuck in local optima, but in neural networks this apparently is far less of a problem. This may be because we aren't optimizing over 1 axis, we are actually optimizing over 109184 parameters (in the digit reading network described above). It probably takes quite a lot of work to create a 109184-dimensional local minimum.
 47 | 
 48 | So, to learn this way, we need to perform all the calculations in the neural network, look at the outcome, and see if it needs to go up or down. Then we need to find the derivative of the outcome versus all parameters. And then we move all parameters by 0.1 of that derivative (the 'learning rate'). 
 49 | 
 50 | This really is all there is to it, but we are now left with the problem how to determine all these derivatives. Luckily this is a well solved problem, and the solution is quite magical. And it is good that this is so, because there are models with hundreds of billions of parameters. Those derivatives should be simple and cheap.
 51 | 
 52 | # Automatic differentiation
 53 | So, unlike integration, differentiation is actually very straightforward. And it turns out that with relatively little trouble you can get a computer to do it for you. If we for example have:
 54 | 
 55 | {{<katex display>}}y = 2x^3 + 4x^2 + 3x + 2 {{</katex>}}
 56 | 
 57 | It is trivial (even for a computer) to turn this into:
 58 | 
 59 | {{<katex display>}}\frac{dy}{dx} = 6x^2 + 8x + 3 {{</katex>}}
 60 | 
 61 | And even if we make life more complex, the rules remain simple:
 62 | 
 63 | {{<katex display>}}y = \sin{(2x^3 + 4x^2 + 3x + 2)} {{</katex>}}
 64 | {{<katex display>}}\frac{dy}{dx} = (6x^2+8x+3) \cos{(2x^3 + 4x^2 + 3x + 2)}{{</katex>}}
 65 | 
 66 | This is the '[chain rule](https://en.wikipedia.org/wiki/Chain_rule)', which effectively says the derivative of a compound function is the derivative of that function multiplied by the derivative of the input function. 
 67 | 
 68 | I don't want to flood you with too much math, but [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) is at the absolute core of neural networks, so it pays to understand what is going on.
 69 | 
 70 | # "Autograd"
 71 | Every neural network system (PyTorch, TensorFlow, [Flashlight](https://github.com/flashlight/flashlight)) implements an autogradient system that performs automatic differentiation. Such systems can be implemented easily in any programming language that supports operator overloading and reference counted objects. And in fact, the implementation is so easy that you sometimes barely see it. A great example of this is [Andrej Karpathy](https://twitter.com/karpathy)'s [micrograd](https://github.com/karpathy/micrograd) autogradient implementation, [which is a tiny work of art](https://github.com/karpathy/micrograd/blob/master/micrograd/engine.py).
 72 | 
 73 | First, let's look at what such a system can do:
 74 | 
 75 | ```C++
 76 | Tensor x(2.0f);
 77 | Tensor z(0.0f);
 78 | Tensor y = Tensor(3.0f)*x*x*x + Tensor(4.0f)*x + Tensor(1.0f) + x*z;
 79 | ```
 80 | 
 81 | This configures `y` to be {{<katex inline>}}3x^3 + 4x + 1 +xz{{</katex>}}. The notation is somewhat clunky - it is possible to make a library that automatically converts naked numbers into `Tensor`s, but such a library might also surprise you one day when it does so when you don't expect it.
 82 | 
 83 | Next up, let's do something:
 84 | 
 85 | ```C++
 86 | cout << "y = "<< y << endl; // 3*8 + 4*2 + 1 = 33
 87 |   
 88 | y.backward();
 89 | 
 90 | cout << "dy/dx = " << x.getGrad() << endl; // 9*x^2 + 4 = 40
 91 | cout << "dy/dz = " << z.getGrad() << endl; // 2
 92 | ```
 93 | 
 94 | This prints out the expected outputs, which is nice. The first line perhaps appears to only print out the value of `y`, but as is customary in these systems, the calculation only happens once you try to get the value. In other words, this is [lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation). This can sometimes confuse you when you setup a huge calculation that appears to happen in 'no time'. And this is because the actual calculation hasn't happened yet.
 95 | 
 96 | The last line of the initial snippet of code (`Tensor y =`...) actually created a little computer program that will create the right output once run. This little computer program takes the shape of a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph):
 97 | 
 98 | <center> 
 99 | 
100 | ![](test.png)  
101 | {{<katex inline>}}y = 3x^3 + 4x + 1 +xz{{</katex>}}
102 | 
103 | <p></p>
104 | </center>
105 | 
106 | Our interest in this case is to determine {{<katex inline>}}dy/dx{{</katex>}}. But first let's look at a very simple case, {{<katex inline>}}y = xz{{</katex>}}:
107 | 
108 | <center> 
109 | 
110 | ![](simple.png)  
111 | {{<katex inline>}}y = xz{{</katex>}}
112 | 
113 | <p></p>
114 | </center>
115 | 
116 | Here it is obvious that {{<katex inline>}}dy/dx=z=0{{</katex>}}. Meanwhile, {{<katex inline>}}dy/dz=x=2{{</katex>}}. So if we look at the directed acyclic graph (DAG), if we want to calculate the gradient or differential, each node gets the value of the opposite node:
117 | 
118 | <center> 
119 | 
120 | ![](grads.png)  
121 | *Red lines denote 'sending the gradient'. The right node received the value of the left node as its gradient, and vice versa*
122 | 
123 | <p></p>
124 | </center>
125 | 
126 | For a slightly more complicated example:
127 | <center> 
128 | 
129 | ![](grads2.png)  
130 | {{<katex inline>}}y = x(z+a){{</katex>}}
131 | 
132 | <p></p>
133 | </center>
134 | 
135 | Here we see that the gradients 'drop down' the tree and add up to the correct values. 
136 | {{<katex inline>}}dy/dx =1{{</katex>}}, because {{<katex inline>}}z+a=1{{</katex>}}. Meanwhile, both
137 | {{<katex inline>}}dy/da{{</katex>}} and {{<katex inline>}}dy/dz{{</katex>}} are 2, because {{<katex inline>}}x=2{{</katex>}}.
138 | 
139 | Now for our full calculation:
140 | <center> 
141 | 
142 | ![](grads-final.png)  
143 | {{<katex inline>}}y = 3x^3 + 4x + 1 +xz{{</katex>}}
144 | 
145 | <p></p>
146 | </center>
147 | 
148 | <!-- 
149 | digraph {
150 | "0x55beda3b9100" -> "0x55beda3b8fd0" [label="+1" color="red"]
151 | "0x55beda3b9100" -> "0x55beda3b8370" [label="+1" color="red"]
152 | "0x55beda3b9100" [label="+\ngrad=1"]
153 | "0x55beda3b8370" [label="*\ngrad=1"]
154 | "0x55beda3b8370" -> "0x55beda3b7ec0" [label="+0" color="red"]
155 | "0x55beda3b8370" -> "0x55beda3b8080" [label="+2" color="red"]
156 | "0x55beda3b8fd0" -> "0x55beda3b8ea0" [label="+1" color="red"]
157 | "0x55beda3b8fd0" -> "0x55beda3b8240" [label="+1" color="red"]
158 | "0x55beda3b8fd0" [label="+\ngrad=1"]
159 | "0x55beda3b8ea0" -> "0x55beda3b8d70" [label="+1" color="red"]
160 | "0x55beda3b8ea0" -> "0x55beda3b8820" [label="+1" color="red"]
161 | "0x55beda3b8ea0" [label="+\ngrad=1"]
162 | "0x55beda3b8820" [label="*\ngrad=1"]
163 | "0x55beda3b8820" -> "0x55beda3b8530" [label="+2" color="red"]
164 | "0x55beda3b8820" -> "0x55beda3b7ec0" [label="+4" color="red"]
165 | "0x55beda3b8d70" [label="*\ngrad=1"]
166 | "0x55beda3b8d70" -> "0x55beda3b8c40" [label="+2" color="red"]
167 | "0x55beda3b8d70" -> "0x55beda3b7ec0" [label="+12" color="red"]
168 | "0x55beda3b8c40" [label="*\ngrad=2"]
169 | "0x55beda3b8c40" -> "0x55beda3b8b10" [label="+4" color="red"]
170 | "0x55beda3b8c40" -> "0x55beda3b7ec0" [label="+12" color="red"]
171 | "0x55beda3b8b10" [label="*\ngrad=4"]
172 | "0x55beda3b8b10" -> "0x55beda3b86f0" [label="+8" color="red"]
173 | "0x55beda3b8b10" -> "0x55beda3b7ec0" [label="+12" color="red"]
174 | 
175 | "0x55beda3b86f0" [label="3"]
176 | "0x55beda3b7ec0" [label="x=2\ngrad=40"]
177 | "0x55beda3b8530" [label="4"]
178 | "0x55beda3b8240" [label="1"]
179 | "0x55beda3b8080" [label="z=0\ngrad=2"]
180 | }
181 | 
182 | -->
183 | And this indeed arrives at the right numbers. To perform the actual calculation, we visit each node *once*, starting at the top, and *push down* the accumulated gradient to the child nodes.
184 | 
185 | Now, in a demonstration why a computer science education is useful (I missed out sadly), it turns out that doing such a traversal is a well solved problem. Using an elegant algorithm, a directed acyclic graph can be [sorted topologically](https://en.wikipedia.org/wiki/Topological_sorting). And in this order, we can visit each node once and in the right order to promulgate the accumulated gradients downward.
186 | 
187 | The elegant algorithm is so elegant you might miss it in the code. It goes like this:
188 | 
189 | 1. Start at the top node
190 | 2. If a node has been visited already, return. Otherwise, add node to *visited* set
191 | 3. Visit all child nodes (ie, start at step 1 again for each node)
192 | 4. Add ourselves at the end of the topographical list of nodes
193 | 
194 | In this way, we can see that the leaf nodes are the first to be added. The top node only gets added last, because adding to the topographical list only happens once all child nodes are done. Meanwhile the *visited* set makes sure we do the process just once per node.
195 | 
196 | To distribute the gradients for automatic integration, the topological list is processed in reverse order, which means that we start at the top.
197 | 
198 | Automatic differentiation can be used for many other things, and need not stop at first derivatives. [ADOL-C](https://github.com/coin-or/ADOL-C) is an interesting library in this respect.
199 | 
200 | Any time you ask ChatGPT a question, know there is a DAG containing 175 billion parameters that is processing your every word, and that it got taught what it can do by the exact autogradient process described on this page.
201 | 
202 | # The code
203 | The key concept is that by typing in formulas, we get our computer to build the DAG for us. Because otherwise it would be undoable. Any language that features operator overloading enables us to make this happen rather easily. This is a great example of "letting the ball do the work". By defining addition, multiplication operators that don't actually perform those calculations, but instead populate a DAG that eventually will, we get a ton of functionality for free.
204 | 
205 | We need a bit more than operator overloading though. We also need objects that stay alive, either by being reference counted, or by surviving garbage collection. 
206 | 
207 | As an example:
208 | 
209 | ```C++
210 | Tensor x(2.0f);
211 | Tensor z(0.0f);
212 | Tensor y = Tensor(3.0f)*x*x*x + Tensor(4.0f)*x + Tensor(1.0f) + x*z;
213 | ```
214 | The values 3.0, 4.0 and 1.0 are all temporaries. These instances vanish from existence by the time the final line is done executing. Yet, they must still find a place in the DAG.
215 | 
216 | For this reason, a language like C++ needs to create reference counted copies. Python and other pass-by-reference languages with garbage collection may get this for free.
217 | 
218 | The `Tensor` class in this series of blog posts works like this:
219 | 
220 | ```C++
221 | template<typename T=float>
222 | struct Tensor
223 | {
224 |   typedef Tensor<T> us_t;
225 |   Tensor() : d_imp(std::make_shared<TensorImp<T>>())
226 |   {}
227 | 
228 |   Tensor(unsigned int rows, unsigned int cols) : d_imp(std::make_shared<TensorImp<T>>(rows, cols))
229 |   {}
230 | 
231 |   // ...
232 |   std::shared_ptr<TensorImp<T>> d_imp;
233 | };
234 | 
235 | ```
236 | There are many other methods, but this is the key - there is an actual reference counted `TensorImp<T>` behind this. The class is templatized, defaulting to float. Amazingly enough, machine learning has such an effect on hardware that it is triggering innovations like 16 bit floats!
237 | 
238 | To actually do anything with these `Tensor`s, there are overloaded operators:
239 | 
240 | ```C++
241 | template<typename T>
242 | inline Tensor<T> operator+(const Tensor<T>& lhs, const Tensor<T>& rhs)
243 | {
244 |   Tensor<T> ret;
245 |   ret.d_imp = std::make_shared<TensorImp<T>>(lhs.d_imp, rhs.d_imp, TMode::Addition);
246 |   return ret;
247 | }
248 | ```
249 | With this, you can do `Tensor z = x + w`, and `z` will end up containing a `TensorImp` containing reference counted references to `x` and `w`.
250 | 
251 | Which looks like this:
252 | 
253 | ```C++
254 | template<typename T=float>
255 | struct TensorImp
256 | {
257 |   typedef TensorImp<T> us_t;
258 | 
259 |   //! Create a new parameter (value) tensor. Inits everything to zero.
260 |   TensorImp(unsigned int rows, unsigned int cols) :  d_mode(TMode::Parameter)
261 |   {
262 |     d_val = Eigen::MatrixX<T>(rows, cols);
263 |     d_grads = Eigen::MatrixX<T>(rows, cols);
264 |     d_grads.setZero();
265 |     d_val.setZero();
266 |     d_haveval = true;
267 |   }
268 | 
269 |   TensorImp(std::shared_ptr<us_t> lhs, std::shared_ptr<us_t> rhs, TMode m) : 
270 |   d_lhs(lhs), d_rhs(rhs), d_mode(m)
271 |   {
272 |   }
273 |   ... 
274 |   std::shared_ptr<us_t> d_lhs, d_rhs;
275 |   TMode d_mode;
276 | }
277 | ```
278 | Here we see a few notable things. For one, we see Eigen crop up. Eigen is a matrix library used by many machine learning projects (including TensorFlow and PyTorch). You might initially think you could do your own matrix library, but this is not the case. The Eigen matrix multiplications for example are over 300 times faster than my hand rolled previous attempts. 
279 | 
280 | We also see `d_lhs` and `d_rhs`, these are the embedded references to binary operators like '+', '-', '\*' etc. It is these references that allow us to build a directed acyclic graph that contains instructions how to calculate the outcome of the calculation.
281 | 
282 | Here's an abbreviated version of how that works:
283 | ```C++
284 | void assureValue(const TensorImp<T>* caller=0) const
285 | {
286 |   if(d_haveval || d_mode == TMode::Parameter)
287 |     return;
288 | 
289 |   if(d_mode == TMode::Addition) {
290 |     d_lhs->assureValue(this);
291 |     d_rhs->assureValue(this);
292 |     d_val.noalias() = d_lhs->d_val + d_rhs->d_val;
293 |   }
294 |   else if(d_mode == TMode::Mult) {
295 |     d_lhs->assureValue(this);
296 |     d_rhs->assureValue(this);
297 |     d_val.noalias() = d_lhs->d_val * d_rhs->d_val;
298 |   }
299 |   ...
300 | }
301 | ```
302 | Nodes can contain a value that was calculated earlier, in which case `d_haveval` is set. And if needed, `assureValue` is called in turn on child nodes. 
303 | 
304 | 'Calculating the outcome' is what is called the 'forward pass' in neural networks. The automatic differentiation meanwhile is calculated in the opposite direction. Here is where we get all the nodes in topological (reverse) order:
305 | 
306 | ```C++
307 | void build_topo(std::unordered_set<TensorImp<T>*>& visited, std::vector<TensorImp<T>*>& topo)
308 | { 
309 |   if(visited.count(this))
310 |     return;
311 |   visited.insert(this);
312 |   
313 |   if(d_lhs) {
314 |     d_lhs->build_topo(visited, topo);
315 |   }
316 |   if(d_rhs) {
317 |     d_rhs->build_topo(visited, topo);
318 |   }
319 |   topo.push_back(this);
320 | }
321 | ```
322 | 
323 | As noted above, you could easily miss the magic behind this. 
324 | 
325 | Once we have this topographic ordering, the distributing of the gradients downwards is simple:
326 | 
327 | ```C++
328 |     d_imp->d_grads.setConstant(1.0);
329 |     for(auto iter = topo.rbegin(); iter != topo.rend(); ++iter) {
330 |       (*iter)->doGrad();
331 |     }
332 | ```
333 | 
334 | The first line is important: the gradient of the top node is 1 (by definition, {{<katex inline>}}dy/dy=1{{</katex>}}). Every other node starts at 0, and is set through the automatic differentiation.
335 | Note the `rbegin()` and `rend()` which means we traverse the topography in reverse order.
336 | 
337 | The abbreviated `doGrad()` meanwhile looks like this:
338 | 
339 | ```C++
340 | void doGrad()
341 | {
342 |   if(d_mode == TMode::Parameter) {
343 |     return;
344 |   }
345 |   else if(d_mode == TMode::Addition) {
346 |     d_lhs->d_grads += d_grads;
347 |     d_rhs->d_grads += d_grads;
348 |   }
349 |   else if(d_mode == TMode::Mult) {
350 |     d_lhs->d_grads.noalias() += (d_grads * d_rhs->d_val.transpose());
351 |     d_rhs->d_grads.noalias() += (d_lhs->d_val.transpose() * d_grads);
352 |   }
353 |   ...
354 | ```
355 | 
356 | If a node is just a number (`Tmode::Parameter`) it has no gradient to distribute further. If a node represents an addition, the gradient gets passed on verbatim to both the left hand and right hand sides of the + operator.
357 | 
358 | For the multiplication case, we see that the left hand side indeed gets a gradient delivered that is proportional to the right hand side, and vice-versa. The delivered gradient is also proportional to the gradient that has already been passed down to this node.
359 | 
360 | The calls to `.transpose()` meanwhile reflect that our Tensor class is actually a matrix. So far we've been multiplying only 1x1 Tensors, which act just like numbers. In reality this class is used to multiply pretty large matrices. 
361 | 
362 | Rounding it off - automatic differentiation is absolute key to neural networks. That we can assemble networks of many many layers each consisting of huge matrices using a straightforward syntax makes it possible to innovate rapidly. We are lucky enough that modern languages make it possible to both assemble these networks easily, AND perform automatic differentiation.
363 | 
364 | [In the next chapter](../handwritten-digits-sgd-batches/) we'll be making our multi-layer network do some actual work in learning to recognize 10 different digits. There we'll also be introducing key concepts in machine learning like loss function, batches and the enigmatic 'softlogmax' layer.
365 | 
366 | 
367 | 


--------------------------------------------------------------------------------
/dl-convolutional/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: Convolutional networks"
  3 | date: 2023-03-30T12:00:05+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/dl-convolutional/index.md)!
  7 | 
  8 | In the [previous chapter](../autograd) we taught a network of linear combinations and 'rectified linear units' to recognize handwritten digits reasonably successfully. But we already noted that the network would be sensitive to the exact location of pixels, and that it does not in any meaningful way "know" what a 7 looks like.
  9 | 
 10 | In this chapter we're going to explore convolutional layers that can scan for shapes and make decisions based on their relative positions. And, we'll go over the design over a convolutional neural network that is quite successful at reading not just handwritten digits, but can do handwritten letters.
 11 | 
 12 | # Neural network disappointment 
 13 | A recurring theme in machine learning is if the network is 'overfitting', where it is not really learning things as we'd hope, but is instead simply memorising stuff. 'Rote learning' if you will. This is in fact a constant battle, and many of the very public machine learning failures, as for example [during the COVID-19 pandemic](https://www.technologyreview.com/2021/07/30/1030329/machine-learning-ai-failed-covid-hospital-diagnosis-pandemic/), are due to networks latching on to the wrong things, or not having generalized the knowledge as broadly as we'd been assuming.
 14 | 
 15 | If you take away one thing from this series of posts, please let it be that production use of a neural network tends to go through these four phases (if you are lucky):
 16 | 
 17 | 1. It works on the training data
 18 | 2. It also works on the validation data
 19 | 3. After a lot of disappointment, we get it to work on other people's real life data too
 20 | 4. Other people can get it to work on their own data as well
 21 | 
 22 | Almost all demos declare victory after phase 2. This tutorial aims to achieve the final phase.
 23 | 
 24 | To prove this point, here is a graph showing the validation success of our previous network with only slightly modified inputs:
 25 | 
 26 | <center>
 27 | 
 28 | ![](slightly-moved.svg)
 29 | 
 30 | <p></p>
 31 | </center>
 32 | 
 33 | Here the input was shifted around by 2 pixels, and 5 random pixels were flicked. No human would be phazed in the least by these changes, but our network performance does drop to around 50%, which is pretty disappointing.
 34 | 
 35 | Clearly we need better things than just multiplying whole images and matrices. These things turn out to be 'convolutional' operations, 'max-pooling' and 'gelu'.
 36 | 
 37 | # Convolutional networks
 38 | Also known as [CNN, or ConvNet](https://en.wikipedia.org/wiki/Convolutional_neural_network), these may have been the first neural networks that saw bona fide production use. This video by [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) on [YouTube from 1989](https://www.youtube.com/watch?v=FwFduRA_L6Q) is absolutely worth your while ([associated paper](http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf)), especially since we're going to build a network here that is a lot like the one demonstrated there. 
 39 | 
 40 | Our previous work took a whole image as input to the network, where the position of pixels really mattered. A convolution is a matrix operation that *slides* over its input. In this way it can scan for features. What it slides over its input is a set of matrices called kernels, typically quite small. Each kernel is multiplied per element over the part of the input it lies on. The output is the sum of all these multiplications:
 41 |  
 42 | {{<rawhtml>}}
 43 | <style>
 44 | text { font-family: monospace; }
 45 | </style>
 46 | {{</rawhtml>}}
 47 | 
 48 | <!-- text {font-size: 0.8em !important; } -->
 49 | 
 50 | ```goat
 51 |        input layer           kernel        output layer
 52 | +--+--+--+--+--+--+--+--+  +--+--+--+  +--+--+--+--+--+--+
 53 | |1 |2 |3 |4 |5 |6 |7 |8 |  |1 |2 |3 |  |A |..|..|..|..|..|
 54 | +--+--+--+--+--+--+--+--+  +--+--+--+  +--+--+--+--+--+--+
 55 | |9 |10|11|12|13|14|15|16|  |4 |5 |6 |  |..|B |..|..|..|..|
 56 | +--+--+--+--+--+--+--+--+  +--+--+--+  +--+--+--+--+--+--+
 57 | |17|18|19|20|21|22|23|24|  |7 |8 |9 |  |..|..|..|..|..|C | 
 58 | +--+--+--+--+--+--+--+--+  +--+--+--+  +--+--+--+--+--+--+ 
 59 | |25|26|27|28|29|30|31|32|              |..|..|..|..|..|..| 
 60 | +--+--+--+--+--+--+--+--+              +--+--+--+--+--+--+ 
 61 | |33|34|35|36|37|38|39|40|
 62 | +--+--+--+--+--+--+--+--+ 
 63 | |41|42|43|44|45|46|47|48|
 64 | +--+--+--+--+--+--+--+--+
 65 | ```
 66 | 
 67 | Here three sample positions A, B and C in the output layer:
 68 | ```
 69 | A =  1*1 +  2*2 +  3*3 +  9*4 + 10*5 + 11*6 + 17*7 + 18*8 + 19*9  
 70 | B = 10*1 + 11*2 + 12*3 + 18*4 + 19*5 + 20*6 + 26*7 + 27*8 + 28*9  
 71 | C = 22*1 + 23*2 + 24*3 + 30*4 + 31*5 + 32*6 + 38*7 + 39*8 + 40*9  
 72 | ```
 73 | 
 74 | Note that the output differs in dimensions from the input. If the input had R rows and a K by K kernel is used, the output will have 1+R-K rows, and similar for columns (1+C-K). The output dimensions will always be smaller. The values in the output represent the presence of features matched by the filter kernels.
 75 | 
 76 | Typically, many kernels are used, leading to a single input layer creating many output layers. Every kernel is associated with a single output layer. Conceptually this can be seen as a convolutional layer scanning for many different kinds of features, all at the same time.
 77 | 
 78 | <!-- XXX https://medium.com/mlearning-ai/simple-explanation-for-calculating-the-number-of-parameters-in-convolutional-neural-network-33ce0fffb80c -->
 79 | 
 80 | A convolutional network can also accept multiple input layers at the same time. In this case, every output kernel slides over every input channel, and the output is the sum of the sums of that kernel sliding over all input channels. This means the number of operations is proportional to the product of the number of output layers and the number of input layers. Quite soon we are talking billions of operations. The number of filter parameters scales with product of the number of input and output layers, and of course the kernel size.
 81 | 
 82 | Convolutional networks do not use a lot of parameters (since kernels tend to be small), but they simply *burn* through CPU cycles. Because they do not access a lot of memory, parallel processing can speed things up tremendously though.
 83 | 
 84 | Chapter 7 from "Dive into Deep Learning" [has a good and more expansive explanation](https://d2l.ai/chapter_convolutional-neural-networks/index.html), and it may do a better job than this page.
 85 | 
 86 | # Max-pooling
 87 | We use convolutional layers to detect features, but we don't care that much about the exact position of a feature. In fact we may often not even want to know - the network might start to depend on it. Because of this, the output of a convolutional layer is often fed through a 'max-pool'.
 88 | 
 89 | This is a simple operation that slides over an input matrix, but has no kernel parameters. But it does have a size, often 2 by 2. The output is the maximum value within that window. 
 90 | 
 91 | Unlike a convolutional layer, max-pooling uses non-overlapping windows. So if a 2x2 window is used, the output channels have half the number of rows and columns compared to the input channels.
 92 | 
 93 | The essence of this is that if a feature is detected anywhere within a 2x2 window, it generates the same output independent of its position on any of the four pixels. Also, the number of outputs is divided by 4, which is useful for limiting the size of the network.
 94 | 
 95 | > Note: Pools can of course have other sizes. Also, when two-dimensional pools are used, you'll often see them described as 'max2d'.
 96 | 
 97 | # GELU
 98 | Recall how we used the 'Rectified linear unit' (RELU) to replace all negative values by zero (and leaving the rest alone). This introduces a non-linearity between matrix operations, which in turn means the network turns into something more than a simple linear combination of elements.
 99 | 
100 | Various neural networks have been experimenting, and it has been noted that RELU throws away a lot of information for negative values. It appears that using a different activation function can help a lot. Popular these days is GELU, Gaussian Error Linear Unit, which is not linear nor is it an error. And also not that Gaussian actually.
101 | 
102 | <center>
103 | 
104 | ![](./gelu.svg)
105 | <p></p>
106 | 
107 | </center>
108 | 
109 | More details can be found in [Gaussian Error Linear Units (GELUs)](https://arxiv.org/abs/1606.08415), and there is some more depth in [Avoiding Kernel Fixed Points:
110 | Computing with ELU and GELU Infinite Networks](https://ojs.aaai.org/index.php/AAAI/article/view/17197/17004).
111 | 
112 | After having read the literature, I'm afraid I'm left [with the impression that GELU tends to do better](https://arxiv.org/pdf/2002.05202.pdf), but that we're not that sure why. For what it's worth, the code we're developing here confirms this impression.
113 | 
114 | # The whole design
115 | Here is the complete design of our neural network that we're going to use to recognize handwritten (print) letters:
116 | 
117 | 1. The input is again a 28x28 image, not flattened
118 | 2. A 3x3 kernel convolutional layer with 1 input layer and 32 output layers
119 | 3. Max2d layer, 2x2
120 | 4. GELU activation 
121 | 5. A 3x3 kernel convolutional layer, 13x13 input dimensions, 32 input layers, 64 output layers
122 | 6. Max2d layer, 2x2
123 | 7. GELU activation 
124 | 8. A 3x3 kernel convolutional layer, 6x6 input dimensions, 64 input layers, 128 output layers
125 | 6. Max2d layer, 2x2
126 | 7. GELU activation 
127 | 8. Flatten all these 128 2x2 layers to 512*1 matrix
128 | 9. First linear combination
129 | 11. GELU activation
130 | 12. Second linear combination
131 | 13. GELU activation
132 | 14. Third linear combination, down to 26x1
133 | 15. LogSoftMax
134 | 
135 | This looks like a lot, but if you look carefully, steps 2/3/4, 5/6/7, 8/9/10 are three times the same thing.
136 | 
137 | Expressed as code it may even be easier to follow:
138 | ```C++
139 | using ActFunc = GeluFunc;
140 | 
141 | auto step1 = s.c1.forward(img);    // -> 26x26, 32 layers
142 | auto step2 = Max2dfw(step1, 2);    // -> 13x13
143 | auto step3 = s.c2.forward(step2);  // -> 11x11, 64 layers
144 | auto step4 = Max2dfw(step3, 2);    // -> 6x6 (padding)
145 | auto step5 = s.c3.forward(step4);  // -> 4x4, 128 layers
146 | auto step6 = Max2dfw(step5, 2);    // -> 2x2
147 | auto flat = makeFlatten(step6);    // -> 512x1
148 | auto output = s.fc1.forward(flat); // -> 64
149 | auto output2 = makeFunction<ActFunc>(output);
150 | auto output3 = makeFunction<ActFunc>(s.fc2.forward(output2)); // -> 128
151 | auto output4 = makeFunction<ActFunc>(s.fc3.forward(output3)); // -> 26
152 | scores = makeLogSoftMax(output4);
153 | modelloss = -(expected*scores).sum();
154 | ```
155 | 
156 | It is somewhat astounding that these few lines will learn to read handwritten characters.
157 | 
158 | Visually:
159 | 
160 | ```goat
161 |        input layer         
162 | +--+--+--+--+--+--+--+--+  
163 | | 1|  |  |  |  |  |  |28|            32 x
164 | +--+--+--+--+--+--+--+--+    +--+--+--+--+--+--+          64 x
165 | |  |  |  |  |  |  |  |  |    | 1|  |  |  |  |13|    +--+--+--+--+     128 x
166 | +--+--+--+--+--+--+--+--+    +--+--+--+--+--+--+    | 1|  |  | 6|    +--+--+
167 | |  |  |  |  |  |  |  |  |    |  |  |  |  |  |  |    +--+--+--+--+    | 1| 2|
168 | +--+--+--+--+--+--+--+--+ -> +--+--+--+--+--+--+ -> |  |  |  |  | -> +--+--+
169 | |  |  |  |  |  |  |  |  |    |  |  |  |  |  |  |    +--+--+--+--+    | 2| 2|
170 | +--+--+--+--+--+--+--+--+    +--+--+--+--+--+--+    | 6|  |  | 6|    +--+--+
171 | |  |  |  |  |  |  |  |  |    |13|  |  |  |  |13|    +--+--+--+--+    
172 | +--+--+--+--+--+--+--+--+    +--+--+--+--+--+--+                        
173 | |28|  |  |  |  |  |  |28|                                           
174 | +--+--+--+--+--+--+--+--+                                           
175 | ```                                                                 
176 | 
177 | These are the three convolutions and "Max2d" combinations. We end up with 128 layers of four values each. These are flattened into a 512x1 matrix, 
178 | which then undergoes further multiplications:
179 | 
180 | ```goat
181 |                                +---+--+--+--+                         
182 |                                |  1|  |  |64|
183 |                                +---+--+--+--+          
184 |                                |   |  |  |  |
185 |                                +---+--+--+--+                              +---+--+--+--+--+--+---+   
186 |                                |   |  |  |  |                              |  1|  |  |  |  |  |128|   
187 |                                +---+--+--+--+                              +---+--+--+--+--+--+---+   
188 |                                |   |  |  |  |                              |   |  |  |  |  |  |   |   
189 |                                +---+--+--+--+                              +---+--+--+--+--+--+---+   
190 |                                |   |  |  |  |                              |   |  |  |  |  |  |   |   
191 |                                +---+--+--+--+                              +---+--+--+--+--+--+---+   
192 |                                |512|  |  |  |                              | 64|  |  |  |  |  |   |   
193 |                             x  +---+--+--+--+                           x  +---+--+--+--+--+--+---+   
194 | 
195 | +--+--+--+--+--+--+--+---+     +---+--+--+--+            +---+--+--+--+    +---+--+--+--+--+--+---+   
196 | | 1|  |  |  |  |  |  |512|  =  |  1|  |  |64| -> GELU -> |  1|  |  |64| =  |  1|  |  |  |  |  |128|   
197 | +--+--+--+--+--+--+--+---+     +---+--+--+--+            +---+--+--+--+    +---+--+--+--+--+--+---+   
198 | 
199 |                              
200 |                                         +---+--+--+--+
201 |                                         |  1|  |  |26|
202 |                                         +---+--+--+--+
203 |                                         |   |  |  |  |
204 |                                         +---+--+--+--+
205 |                                         |   |  |  |  |
206 |                                         +---+--+--+--+
207 |                                         |   |  |  |  |
208 |                                         +---+--+--+--+
209 |                                         |   |  |  |  |
210 |                                         +---+--+--+--+
211 |                                         |128|  |  |  |
212 |                                         +---+--+--+--+
213 | 
214 |                                      x
215 |            +---+--+--+--+--+--+---+     +---+--+--+--+
216 | -> GELU -> |  1|  |  |  |  |  |128|  =  |  1|  |  |26| -> SoftLogMax
217 |            +---+--+--+--+--+--+---+     +---+--+--+--+
218 | ```
219 | 
220 | And this last matrix, 26 wide, gives us the scores for each possible character.
221 | 
222 | # So where did this design come from?
223 | I copied it from [here](https://data-flair.training/blogs/handwritten-character-recognition-neural-network/). As we'll discuss in [the next chapter](hyperparameters-inspection-adam), neural network demos and tutorials tend to make their designs appear out of thin air. In reality, designing a good neural network is a lot of hard work. What you are seeing in a demo is the outcome of an (undisclosed) search over many possibilities. If you want to learn the ropes, it is best to first copy something that is known to work. And even then you'll often find that it doesn't work as well for you as it it did in the demo.
224 | 
225 | # Let's fire it up!
226 | Whereas previous test programs did their learning in seconds or minutes, teaching this network to learn to recognize letters takes *ages*. As in, most of a day:
227 | <center>
228 | 
229 | ![](slow-learning.svg)
230 | 
231 | <p></p>
232 | </center>
233 | 
234 | So a few things to note - even after 24 hours of training the network was only 85% correct or so. If you look at the failures however, quite a lot of the input is in fact subjective. The difference between a handwritten *g* and a handwritten *q* is not that obvious without context. If we count the "second best guess" as almost correct, the network scores over 95% correct or almost correct, which is not too bad.
235 | 
236 | Here is a sample of the input the network has to deal with:
237 | <center>
238 | 
239 | ![](poster.png)
240 | 
241 | <p></p>
242 | </center>
243 | 
244 | And here is the confusion matrix, where you can see that besides *g* and *q*, distinguishing *i* and *l* is hard, as well as *h* and *n*:
245 | <center>
246 | 
247 | ![](confusion-letters.png)
248 | 
249 | <p></p>
250 | </center>
251 | 
252 | So why does the training take so long? These days computers are magically fast at multiplying large matrices, which is why our earlier model learned in minutes. This model however does all these convolutions, which involve applying tons of filters to parameters, which really is all work that has to be done. You can only get a little bit clever at this, but this cleverness does not deliver orders of magnitudes of speedups. The only way to make the convolutional filters a lot faster is by getting your hardware to do many of them at a time. There are however other techniques to make the whole process converge faster, so fewer calculations need to be done. [We'll cover these in the next chapter](../hyperparameters-inspection-adam).
253 | 
254 | Another interesting thing to note in the graph is that after six hours or so, the network suddenly starts to perform worse, and then it starts improving again. And I'd love to tell you why that happens, but I simply don't know. It is hard to imagine the stochastic gradient descent to guess wrong so badly for such a long time, but perhaps it can get stuck in a bad ridge.
255 | 
256 | ## Deeper background on how convolutional networks work
257 | Above we described how convolutional networks work. A kernel is laid on top of an input and image and kernel are multiplied element by element. The sum of all those multiplications is the output of that location. The kernel then slides over the entire input, producing a smaller output. 
258 | 
259 | In code this looks like this:
260 | 
261 | ```C++
262 | ...
263 | else if(d_mode == TMode::Convo) {
264 |   d_lhs->assureValue();
265 |   d_rhs->assureValue(); // the weights
266 |       
267 |   d_val = EigenMatrix(1 + d_lhs->d_val.rows() - d_convop.kernel, 
268 |                       1 + d_lhs->d_val.cols() - d_convop.kernel);
269 |   for(int r = 0 ; r < d_val.rows(); ++r)
270 |     for(int c = 0 ; c < d_val.cols(); ++c)
271 |       d_val(r,c) = d_lhs->d_val.block(r, c, d_convop.kernel, d_convop.kernel)
272 |                    .cwiseProduct(d_rhs->d_val).sum()
273 |                    + d_convop.bias->d_val(0,0);
274 | }
275 | ```
276 | This is part of [tensor2.hh](https://github.com/berthubert/hello-dl/blob/main/tensor2.hh) that implements the core neural network operations and the automatic differentiation. 
277 | 
278 | When doing the *forward* pass, this code first assures that the input (in `d_lhs`, aka left hand side) and the kernel (in `d_rhs`) are calculated. The output of the convolutional operation is the value of this node, and it ends up in in `d_val`. On the third and fourth lines, 
279 | 
280 | The for-loops meanwhile slide the kernel over the input, using the Eigen `.block()` primitive to focus on the input part covered by the kernel. Finally, the bias gets added.
281 | 
282 | This is all really straightforward, as the *forward* pass tends to be. But backpropagation requires a bit more thinking: how does changing the kernel parameters impact the output of a convolutional layer? And, how does changing the input change the output? It is clear we need to backpropagate in these two directions.
283 | 
284 | It turns out the process is not that hard, and in fact also involves a convolution:
285 | ```C++
286 | for(int r = 0 ; r < d_val.rows(); ++r)
287 |   for(int c = 0 ; c < d_val.cols(); ++c)
288 |     d_lhs->d_grads.block(r,c,d_convop.kernel, d_convop.kernel)  
289 |       += d_rhs->d_val * d_grads(r,c);
290 | ```
291 | 
292 | Recall that `d_lhs` is the input to the convolution. The backward pass involves sliding the filter kernel over the input and adding the matrix product of the filter kernel and the gradients. 
293 | 
294 | Here is the backpropagation to the filter kernel:
295 | 
296 | ```C++
297 | for(int r = 0 ; r < d_rhs->d_val.rows(); ++r)
298 |   for(int c = 0 ; c < d_rhs->d_val.cols(); ++c)
299 |     d_rhs->d_grads(r,c) += (d_lhs->d_val.block(r, c, d_val.rows(), d_val.cols())*d_grads).sum();
300 | d_rhs->d_grads.array() /= sqrt(d_grads.rows()*d_grads.cols());
301 | ```
302 | 
303 | And finally the bias:
304 | 
305 | ```C++
306 | d_convop.bias->d_grads(0,0) += d_grads.sum(); 
307 | ```
308 | 
309 | This all is a bit 'deus ex machina', magical math making the numbers come out right. I present the code here because finding the exact instructions elsewhere is not easy. But you don't need to delve into these functions line by line to understand conceptually what is happening.
310 | 
311 | # The actual code
312 | The code is in [tensor-convo.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-convo.cc) and is much like 
313 | the digit reading code from [the previous chapter](../handwritten-digits-sgd-batches).
314 | 
315 | Here is a key part of the difference:
316 | 
317 | ```C++
318 | -  MNISTReader mn("gzip/emnist-digits-train-images-idx3-ubyte.gz", "gzip/emnist-digits-train-labels-idx1-ubyte.gz");
319 | -  MNISTReader mntest("gzip/emnist-digits-test-images-idx3-ubyte.gz", "gzip/emnist-digits-test-labels-idx1-ubyte.gz");
320 | +  MNISTReader mn("gzip/emnist-letters-train-images-idx3-ubyte.gz", "gzip/emnist-letters-train-labels-idx1-ubyte.gz");
321 | +  MNISTReader mntest("gzip/emnist-letters-test-images-idx3-ubyte.gz", "gzip/emnist-letters-test-labels-idx1-ubyte.gz");
322 |  
323 |    cout<<"Have "<<mn.num()<<" training images and "<<mntest.num()<<" test images"<<endl;
324 |  
325 | -  ReluDigitModel m;
326 | -  ReluDigitModel::State s;
327 | +  ConvoAlphabetModel m;
328 | +  ConvoAlphabetModel::State s;
329 | ```
330 | Instead of reading the MNIST digits files, we now use the 'letters' files. And instead of the `ReluDigitModel` we use the `ConvoAlphabetModel`. This model is found separately in the [convo-alphabet.hh](https://github.com/berthubert/hello-dl/blob/main/convo-alphabet.hh) file, since we're going to be reusing this model in the next chapter.
331 | 
332 | The key bit from the model:
333 | 
334 | ```C++
335 | struct ConvoAlphabetModel {
336 |   Tensor<float> img{28,28};
337 |   Tensor<float> scores{26, 1};
338 |   Tensor<float> expected{1,26};
339 |   Tensor<float> modelloss{1,1};
340 |   Tensor<float> weightsloss{1,1};
341 |   Tensor<float> loss{1,1};
342 | ```
343 | This defines the `img` variable into which we punt the image to be taught or recognized. The `scores` tensor meanwhile holds the calculated score for each of the 26 possible outputs. For training purposes, we input into `expected` which letter we expect the network to output.
344 | 
345 | Ignore `modelloss` and `weightsloss` for a bit, they will become relevant in a later chapter.
346 | 
347 | Finally the `loss` tensor is what we train the network on, and it represents how likely the network thought itself to be right.
348 | 
349 | Next up, we're going to define the state, which contains the parameters that will be trained/used:
350 | 
351 | ```C++
352 |  struct State : public ModelState<float>
353 | {
354 |   //           r_in c   k c_i  c_out
355 |   Conv2d<float, 28, 28, 3, 1,  32> c1; // -> 26*26 -> max2d -> 13*13
356 |   Conv2d<float, 13, 13, 3, 32, 64> c2; // -> -> 11*11 -> max2d -> 6*6 //padding
357 |   Conv2d<float, 6,   6, 3, 64, 128> c3; // -> 4*4 -> max2d -> 2*2
358 |   // flattened to 512 (128*2*2)
359 |          //      IN OUT
360 |   Linear<float, 512, 64> fc1;
361 |   Linear<float, 64, 128> fc2;
362 |   Linear<float, 128, 26> fc3; 
363 | ```
364 | This has three convolutional layers (`c1`, `c2`, `c3`) and three full linear combination layers (`fc1`, `fc2`, `fc3`). Note that `fc3` will end up delivering a vector of 26 scores.
365 | 
366 | Finally there is some important housekeeping:
367 | 
368 | ```C++
369 |   State()
370 |   {
371 |     this->d_members = {{&c1, "c1"}, {&c2, "c2"}, 
372 |                        {&c3, "c3"}, {&fc1, "fc1"}, 
373 |                        {&fc2, "fc2"}, {&fc3, "fc3"}
374 |                       };
375 |   }
376 | };
377 | ```
378 | `State` descends from `ModelState` which, as previously, brings a lot of logic for saving and storing parameters, as well as modifying them for training purposes. But to perform its services, it needs to know about the members. And we also tell it the names of the members for reporting purposes, which we are going to explore in [the next chapter](../hyperparameters-inspection-adam).
379 | 
380 | # Finally
381 | In this chapter we've introduced convolutional layers that are able to recognize image features, and are therefore more likely to robustly identify letters and not just pixels in specific places. We've also found however that training such convolutional layers takes a far longer time. The end result however is pretty ok given that handwritten letters can look pretty identical on their own (*i* versus *l*, *g* versus *q* etc). 
382 | 
383 | In [the next chapter](../hyperparameters-inspection-adam) we'll go over what the excitingly named hyperparameters are, and how a fellow called ADAM can help us speed up training tremendously.
384 | 
385 | 
386 | 


--------------------------------------------------------------------------------
/dl-convolutional/gelu.svg:
--------------------------------------------------------------------------------
   1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
   2 | <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
   3 |   "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
   4 | <svg xmlns:xlink="http://www.w3.org/1999/xlink" width="432pt" height="288pt" viewBox="0 0 432 288" xmlns="http://www.w3.org/2000/svg" version="1.1">
   5 |  <metadata>
   6 |   <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   7 |    <cc:Work>
   8 |     <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
   9 |     <dc:date>2023-02-02T15:46:24.855307</dc:date>
  10 |     <dc:format>image/svg+xml</dc:format>
  11 |     <dc:creator>
  12 |      <cc:Agent>
  13 |       <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
  14 |      </cc:Agent>
  15 |     </dc:creator>
  16 |    </cc:Work>
  17 |   </rdf:RDF>
  18 |  </metadata>
  19 |  <defs>
  20 |   <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
  21 |  </defs>
  22 |  <g id="figure_1">
  23 |   <g id="patch_1">
  24 |    <path d="M 0 288 
  25 | L 432 288 
  26 | L 432 0 
  27 | L 0 0 
  28 | z
  29 | " style="fill: #ffffff"/>
  30 |   </g>
  31 |   <g id="axes_1">
  32 |    <g id="patch_2">
  33 |     <path d="M 54 256.32 
  34 | L 388.8 256.32 
  35 | L 388.8 34.56 
  36 | L 54 34.56 
  37 | z
  38 | " style="fill: #ffffff"/>
  39 |    </g>
  40 |    <g id="matplotlib.axis_1">
  41 |     <g id="xtick_1">
  42 |      <g id="line2d_1">
  43 |       <path d="M 69.218182 256.32 
  44 | L 69.218182 34.56 
  45 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
  46 |      </g>
  47 |      <g id="line2d_2">
  48 |       <defs>
  49 |        <path id="m7b8686ce6a" d="M 0 0 
  50 | L 0 3.5 
  51 | " style="stroke: #000000; stroke-width: 0.8"/>
  52 |       </defs>
  53 |       <g>
  54 |        <use xlink:href="#m7b8686ce6a" x="69.218182" y="256.32" style="stroke: #000000; stroke-width: 0.8"/>
  55 |       </g>
  56 |      </g>
  57 |      <g id="text_1">
  58 |       <!-- −3 -->
  59 |       <g transform="translate(61.847088 270.918437)scale(0.1 -0.1)">
  60 |        <defs>
  61 |         <path id="DejaVuSans-2212" d="M 678 2272 
  62 | L 4684 2272 
  63 | L 4684 1741 
  64 | L 678 1741 
  65 | L 678 2272 
  66 | z
  67 | " transform="scale(0.015625)"/>
  68 |         <path id="DejaVuSans-33" d="M 2597 2516 
  69 | Q 3050 2419 3304 2112 
  70 | Q 3559 1806 3559 1356 
  71 | Q 3559 666 3084 287 
  72 | Q 2609 -91 1734 -91 
  73 | Q 1441 -91 1130 -33 
  74 | Q 819 25 488 141 
  75 | L 488 750 
  76 | Q 750 597 1062 519 
  77 | Q 1375 441 1716 441 
  78 | Q 2309 441 2620 675 
  79 | Q 2931 909 2931 1356 
  80 | Q 2931 1769 2642 2001 
  81 | Q 2353 2234 1838 2234 
  82 | L 1294 2234 
  83 | L 1294 2753 
  84 | L 1863 2753 
  85 | Q 2328 2753 2575 2939 
  86 | Q 2822 3125 2822 3475 
  87 | Q 2822 3834 2567 4026 
  88 | Q 2313 4219 1838 4219 
  89 | Q 1578 4219 1281 4162 
  90 | Q 984 4106 628 3988 
  91 | L 628 4550 
  92 | Q 988 4650 1302 4700 
  93 | Q 1616 4750 1894 4750 
  94 | Q 2613 4750 3031 4423 
  95 | Q 3450 4097 3450 3541 
  96 | Q 3450 3153 3228 2886 
  97 | Q 3006 2619 2597 2516 
  98 | z
  99 | " transform="scale(0.015625)"/>
 100 |        </defs>
 101 |        <use xlink:href="#DejaVuSans-2212"/>
 102 |        <use xlink:href="#DejaVuSans-33" x="83.789062"/>
 103 |       </g>
 104 |      </g>
 105 |     </g>
 106 |     <g id="xtick_2">
 107 |      <g id="line2d_3">
 108 |       <path d="M 119.945455 256.32 
 109 | L 119.945455 34.56 
 110 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 111 |      </g>
 112 |      <g id="line2d_4">
 113 |       <g>
 114 |        <use xlink:href="#m7b8686ce6a" x="119.945455" y="256.32" style="stroke: #000000; stroke-width: 0.8"/>
 115 |       </g>
 116 |      </g>
 117 |      <g id="text_2">
 118 |       <!-- −2 -->
 119 |       <g transform="translate(112.574361 270.918437)scale(0.1 -0.1)">
 120 |        <defs>
 121 |         <path id="DejaVuSans-32" d="M 1228 531 
 122 | L 3431 531 
 123 | L 3431 0 
 124 | L 469 0 
 125 | L 469 531 
 126 | Q 828 903 1448 1529 
 127 | Q 2069 2156 2228 2338 
 128 | Q 2531 2678 2651 2914 
 129 | Q 2772 3150 2772 3378 
 130 | Q 2772 3750 2511 3984 
 131 | Q 2250 4219 1831 4219 
 132 | Q 1534 4219 1204 4116 
 133 | Q 875 4013 500 3803 
 134 | L 500 4441 
 135 | Q 881 4594 1212 4672 
 136 | Q 1544 4750 1819 4750 
 137 | Q 2544 4750 2975 4387 
 138 | Q 3406 4025 3406 3419 
 139 | Q 3406 3131 3298 2873 
 140 | Q 3191 2616 2906 2266 
 141 | Q 2828 2175 2409 1742 
 142 | Q 1991 1309 1228 531 
 143 | z
 144 | " transform="scale(0.015625)"/>
 145 |        </defs>
 146 |        <use xlink:href="#DejaVuSans-2212"/>
 147 |        <use xlink:href="#DejaVuSans-32" x="83.789062"/>
 148 |       </g>
 149 |      </g>
 150 |     </g>
 151 |     <g id="xtick_3">
 152 |      <g id="line2d_5">
 153 |       <path d="M 170.672727 256.32 
 154 | L 170.672727 34.56 
 155 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 156 |      </g>
 157 |      <g id="line2d_6">
 158 |       <g>
 159 |        <use xlink:href="#m7b8686ce6a" x="170.672727" y="256.32" style="stroke: #000000; stroke-width: 0.8"/>
 160 |       </g>
 161 |      </g>
 162 |      <g id="text_3">
 163 |       <!-- −1 -->
 164 |       <g transform="translate(163.301634 270.918437)scale(0.1 -0.1)">
 165 |        <defs>
 166 |         <path id="DejaVuSans-31" d="M 794 531 
 167 | L 1825 531 
 168 | L 1825 4091 
 169 | L 703 3866 
 170 | L 703 4441 
 171 | L 1819 4666 
 172 | L 2450 4666 
 173 | L 2450 531 
 174 | L 3481 531 
 175 | L 3481 0 
 176 | L 794 0 
 177 | L 794 531 
 178 | z
 179 | " transform="scale(0.015625)"/>
 180 |        </defs>
 181 |        <use xlink:href="#DejaVuSans-2212"/>
 182 |        <use xlink:href="#DejaVuSans-31" x="83.789062"/>
 183 |       </g>
 184 |      </g>
 185 |     </g>
 186 |     <g id="xtick_4">
 187 |      <g id="line2d_7">
 188 |       <path d="M 221.4 256.32 
 189 | L 221.4 34.56 
 190 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 191 |      </g>
 192 |      <g id="line2d_8">
 193 |       <g>
 194 |        <use xlink:href="#m7b8686ce6a" x="221.4" y="256.32" style="stroke: #000000; stroke-width: 0.8"/>
 195 |       </g>
 196 |      </g>
 197 |      <g id="text_4">
 198 |       <!-- 0 -->
 199 |       <g transform="translate(218.21875 270.918437)scale(0.1 -0.1)">
 200 |        <defs>
 201 |         <path id="DejaVuSans-30" d="M 2034 4250 
 202 | Q 1547 4250 1301 3770 
 203 | Q 1056 3291 1056 2328 
 204 | Q 1056 1369 1301 889 
 205 | Q 1547 409 2034 409 
 206 | Q 2525 409 2770 889 
 207 | Q 3016 1369 3016 2328 
 208 | Q 3016 3291 2770 3770 
 209 | Q 2525 4250 2034 4250 
 210 | z
 211 | M 2034 4750 
 212 | Q 2819 4750 3233 4129 
 213 | Q 3647 3509 3647 2328 
 214 | Q 3647 1150 3233 529 
 215 | Q 2819 -91 2034 -91 
 216 | Q 1250 -91 836 529 
 217 | Q 422 1150 422 2328 
 218 | Q 422 3509 836 4129 
 219 | Q 1250 4750 2034 4750 
 220 | z
 221 | " transform="scale(0.015625)"/>
 222 |        </defs>
 223 |        <use xlink:href="#DejaVuSans-30"/>
 224 |       </g>
 225 |      </g>
 226 |     </g>
 227 |     <g id="xtick_5">
 228 |      <g id="line2d_9">
 229 |       <path d="M 272.127273 256.32 
 230 | L 272.127273 34.56 
 231 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 232 |      </g>
 233 |      <g id="line2d_10">
 234 |       <g>
 235 |        <use xlink:href="#m7b8686ce6a" x="272.127273" y="256.32" style="stroke: #000000; stroke-width: 0.8"/>
 236 |       </g>
 237 |      </g>
 238 |      <g id="text_5">
 239 |       <!-- 1 -->
 240 |       <g transform="translate(268.946023 270.918437)scale(0.1 -0.1)">
 241 |        <use xlink:href="#DejaVuSans-31"/>
 242 |       </g>
 243 |      </g>
 244 |     </g>
 245 |     <g id="xtick_6">
 246 |      <g id="line2d_11">
 247 |       <path d="M 322.854545 256.32 
 248 | L 322.854545 34.56 
 249 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 250 |      </g>
 251 |      <g id="line2d_12">
 252 |       <g>
 253 |        <use xlink:href="#m7b8686ce6a" x="322.854545" y="256.32" style="stroke: #000000; stroke-width: 0.8"/>
 254 |       </g>
 255 |      </g>
 256 |      <g id="text_6">
 257 |       <!-- 2 -->
 258 |       <g transform="translate(319.673295 270.918437)scale(0.1 -0.1)">
 259 |        <use xlink:href="#DejaVuSans-32"/>
 260 |       </g>
 261 |      </g>
 262 |     </g>
 263 |     <g id="xtick_7">
 264 |      <g id="line2d_13">
 265 |       <path d="M 373.581818 256.32 
 266 | L 373.581818 34.56 
 267 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 268 |      </g>
 269 |      <g id="line2d_14">
 270 |       <g>
 271 |        <use xlink:href="#m7b8686ce6a" x="373.581818" y="256.32" style="stroke: #000000; stroke-width: 0.8"/>
 272 |       </g>
 273 |      </g>
 274 |      <g id="text_7">
 275 |       <!-- 3 -->
 276 |       <g transform="translate(370.400568 270.918437)scale(0.1 -0.1)">
 277 |        <use xlink:href="#DejaVuSans-33"/>
 278 |       </g>
 279 |      </g>
 280 |     </g>
 281 |     <g id="text_8">
 282 |      <!-- x -->
 283 |      <g transform="translate(218.440625 284.596563)scale(0.1 -0.1)">
 284 |       <defs>
 285 |        <path id="DejaVuSans-78" d="M 3513 3500 
 286 | L 2247 1797 
 287 | L 3578 0 
 288 | L 2900 0 
 289 | L 1881 1375 
 290 | L 863 0 
 291 | L 184 0 
 292 | L 1544 1831 
 293 | L 300 3500 
 294 | L 978 3500 
 295 | L 1906 2253 
 296 | L 2834 3500 
 297 | L 3513 3500 
 298 | z
 299 | " transform="scale(0.015625)"/>
 300 |       </defs>
 301 |       <use xlink:href="#DejaVuSans-78"/>
 302 |      </g>
 303 |     </g>
 304 |    </g>
 305 |    <g id="matplotlib.axis_2">
 306 |     <g id="ytick_1">
 307 |      <g id="line2d_15">
 308 |       <path d="M 54 235.416983 
 309 | L 388.8 235.416983 
 310 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 311 |      </g>
 312 |      <g id="line2d_16">
 313 |       <defs>
 314 |        <path id="medb4620979" d="M 0 0 
 315 | L -3.5 0 
 316 | " style="stroke: #000000; stroke-width: 0.8"/>
 317 |       </defs>
 318 |       <g>
 319 |        <use xlink:href="#medb4620979" x="54" y="235.416983" style="stroke: #000000; stroke-width: 0.8"/>
 320 |       </g>
 321 |      </g>
 322 |      <g id="text_9">
 323 |       <!-- 0.0 -->
 324 |       <g transform="translate(31.096875 239.216202)scale(0.1 -0.1)">
 325 |        <defs>
 326 |         <path id="DejaVuSans-2e" d="M 684 794 
 327 | L 1344 794 
 328 | L 1344 0 
 329 | L 684 0 
 330 | L 684 794 
 331 | z
 332 | " transform="scale(0.015625)"/>
 333 |        </defs>
 334 |        <use xlink:href="#DejaVuSans-30"/>
 335 |        <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
 336 |        <use xlink:href="#DejaVuSans-30" x="95.410156"/>
 337 |       </g>
 338 |      </g>
 339 |     </g>
 340 |     <g id="ytick_2">
 341 |      <g id="line2d_17">
 342 |       <path d="M 54 203.577839 
 343 | L 388.8 203.577839 
 344 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 345 |      </g>
 346 |      <g id="line2d_18">
 347 |       <g>
 348 |        <use xlink:href="#medb4620979" x="54" y="203.577839" style="stroke: #000000; stroke-width: 0.8"/>
 349 |       </g>
 350 |      </g>
 351 |      <g id="text_10">
 352 |       <!-- 0.5 -->
 353 |       <g transform="translate(31.096875 207.377058)scale(0.1 -0.1)">
 354 |        <defs>
 355 |         <path id="DejaVuSans-35" d="M 691 4666 
 356 | L 3169 4666 
 357 | L 3169 4134 
 358 | L 1269 4134 
 359 | L 1269 2991 
 360 | Q 1406 3038 1543 3061 
 361 | Q 1681 3084 1819 3084 
 362 | Q 2600 3084 3056 2656 
 363 | Q 3513 2228 3513 1497 
 364 | Q 3513 744 3044 326 
 365 | Q 2575 -91 1722 -91 
 366 | Q 1428 -91 1123 -41 
 367 | Q 819 9 494 109 
 368 | L 494 744 
 369 | Q 775 591 1075 516 
 370 | Q 1375 441 1709 441 
 371 | Q 2250 441 2565 725 
 372 | Q 2881 1009 2881 1497 
 373 | Q 2881 1984 2565 2268 
 374 | Q 2250 2553 1709 2553 
 375 | Q 1456 2553 1204 2497 
 376 | Q 953 2441 691 2322 
 377 | L 691 4666 
 378 | z
 379 | " transform="scale(0.015625)"/>
 380 |        </defs>
 381 |        <use xlink:href="#DejaVuSans-30"/>
 382 |        <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
 383 |        <use xlink:href="#DejaVuSans-35" x="95.410156"/>
 384 |       </g>
 385 |      </g>
 386 |     </g>
 387 |     <g id="ytick_3">
 388 |      <g id="line2d_19">
 389 |       <path d="M 54 171.738696 
 390 | L 388.8 171.738696 
 391 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 392 |      </g>
 393 |      <g id="line2d_20">
 394 |       <g>
 395 |        <use xlink:href="#medb4620979" x="54" y="171.738696" style="stroke: #000000; stroke-width: 0.8"/>
 396 |       </g>
 397 |      </g>
 398 |      <g id="text_11">
 399 |       <!-- 1.0 -->
 400 |       <g transform="translate(31.096875 175.537915)scale(0.1 -0.1)">
 401 |        <use xlink:href="#DejaVuSans-31"/>
 402 |        <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
 403 |        <use xlink:href="#DejaVuSans-30" x="95.410156"/>
 404 |       </g>
 405 |      </g>
 406 |     </g>
 407 |     <g id="ytick_4">
 408 |      <g id="line2d_21">
 409 |       <path d="M 54 139.899553 
 410 | L 388.8 139.899553 
 411 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 412 |      </g>
 413 |      <g id="line2d_22">
 414 |       <g>
 415 |        <use xlink:href="#medb4620979" x="54" y="139.899553" style="stroke: #000000; stroke-width: 0.8"/>
 416 |       </g>
 417 |      </g>
 418 |      <g id="text_12">
 419 |       <!-- 1.5 -->
 420 |       <g transform="translate(31.096875 143.698771)scale(0.1 -0.1)">
 421 |        <use xlink:href="#DejaVuSans-31"/>
 422 |        <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
 423 |        <use xlink:href="#DejaVuSans-35" x="95.410156"/>
 424 |       </g>
 425 |      </g>
 426 |     </g>
 427 |     <g id="ytick_5">
 428 |      <g id="line2d_23">
 429 |       <path d="M 54 108.060409 
 430 | L 388.8 108.060409 
 431 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 432 |      </g>
 433 |      <g id="line2d_24">
 434 |       <g>
 435 |        <use xlink:href="#medb4620979" x="54" y="108.060409" style="stroke: #000000; stroke-width: 0.8"/>
 436 |       </g>
 437 |      </g>
 438 |      <g id="text_13">
 439 |       <!-- 2.0 -->
 440 |       <g transform="translate(31.096875 111.859628)scale(0.1 -0.1)">
 441 |        <use xlink:href="#DejaVuSans-32"/>
 442 |        <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
 443 |        <use xlink:href="#DejaVuSans-30" x="95.410156"/>
 444 |       </g>
 445 |      </g>
 446 |     </g>
 447 |     <g id="ytick_6">
 448 |      <g id="line2d_25">
 449 |       <path d="M 54 76.221266 
 450 | L 388.8 76.221266 
 451 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 452 |      </g>
 453 |      <g id="line2d_26">
 454 |       <g>
 455 |        <use xlink:href="#medb4620979" x="54" y="76.221266" style="stroke: #000000; stroke-width: 0.8"/>
 456 |       </g>
 457 |      </g>
 458 |      <g id="text_14">
 459 |       <!-- 2.5 -->
 460 |       <g transform="translate(31.096875 80.020485)scale(0.1 -0.1)">
 461 |        <use xlink:href="#DejaVuSans-32"/>
 462 |        <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
 463 |        <use xlink:href="#DejaVuSans-35" x="95.410156"/>
 464 |       </g>
 465 |      </g>
 466 |     </g>
 467 |     <g id="ytick_7">
 468 |      <g id="line2d_27">
 469 |       <path d="M 54 44.382122 
 470 | L 388.8 44.382122 
 471 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square"/>
 472 |      </g>
 473 |      <g id="line2d_28">
 474 |       <g>
 475 |        <use xlink:href="#medb4620979" x="54" y="44.382122" style="stroke: #000000; stroke-width: 0.8"/>
 476 |       </g>
 477 |      </g>
 478 |      <g id="text_15">
 479 |       <!-- 3.0 -->
 480 |       <g transform="translate(31.096875 48.181341)scale(0.1 -0.1)">
 481 |        <use xlink:href="#DejaVuSans-33"/>
 482 |        <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
 483 |        <use xlink:href="#DejaVuSans-30" x="95.410156"/>
 484 |       </g>
 485 |      </g>
 486 |     </g>
 487 |     <g id="text_16">
 488 |      <!-- y -->
 489 |      <g transform="translate(25.017187 148.399375)rotate(-90)scale(0.1 -0.1)">
 490 |       <defs>
 491 |        <path id="DejaVuSans-79" d="M 2059 -325 
 492 | Q 1816 -950 1584 -1140 
 493 | Q 1353 -1331 966 -1331 
 494 | L 506 -1331 
 495 | L 506 -850 
 496 | L 844 -850 
 497 | Q 1081 -850 1212 -737 
 498 | Q 1344 -625 1503 -206 
 499 | L 1606 56 
 500 | L 191 3500 
 501 | L 800 3500 
 502 | L 1894 763 
 503 | L 2988 3500 
 504 | L 3597 3500 
 505 | L 2059 -325 
 506 | z
 507 | " transform="scale(0.015625)"/>
 508 |       </defs>
 509 |       <use xlink:href="#DejaVuSans-79"/>
 510 |      </g>
 511 |     </g>
 512 |    </g>
 513 |    <g id="line2d_29">
 514 |     <path d="M 69.218182 235.67486 
 515 | L 72.292562 235.724751 
 516 | L 75.366942 235.782888 
 517 | L 78.441322 235.85034 
 518 | L 81.515702 235.928251 
 519 | L 84.590083 236.017841 
 520 | L 87.664463 236.120395 
 521 | L 90.738843 236.237252 
 522 | L 93.813223 236.369788 
 523 | L 96.887603 236.519398 
 524 | L 99.961983 236.687472 
 525 | L 103.036364 236.87537 
 526 | L 106.110744 237.084386 
 527 | L 109.185124 237.31571 
 528 | L 112.259504 237.570394 
 529 | L 115.333884 237.849299 
 530 | L 118.408264 238.153051 
 531 | L 121.482645 238.481985 
 532 | L 124.557025 238.836095 
 533 | L 127.631405 239.214974 
 534 | L 130.705785 239.617761 
 535 | L 133.780165 240.043081 
 536 | L 136.854545 240.488996 
 537 | L 139.928926 240.952953 
 538 | L 143.003306 241.431739 
 539 | L 146.077686 241.921444 
 540 | L 149.152066 242.417433 
 541 | L 152.226446 242.914321 
 542 | L 155.300826 243.405974 
 543 | L 158.375207 243.885503 
 544 | L 161.449587 244.345292 
 545 | L 164.523967 244.777026 
 546 | L 167.598347 245.171738 
 547 | L 170.672727 245.519878 
 548 | L 173.747107 245.811387 
 549 | L 176.821488 246.035793 
 550 | L 179.895868 246.182319 
 551 | L 182.970248 246.24 
 552 | L 186.044628 246.197814 
 553 | L 189.119008 246.04482 
 554 | L 192.193388 245.770302 
 555 | L 195.267769 245.363919 
 556 | L 198.342149 244.815844 
 557 | L 201.416529 244.116918 
 558 | L 204.490909 243.25878 
 559 | L 207.565289 242.234 
 560 | L 210.639669 241.036193 
 561 | L 213.71405 239.660124 
 562 | L 216.78843 238.101789 
 563 | L 219.86281 236.358481 
 564 | L 222.93719 234.428836 
 565 | L 226.01157 232.312854 
 566 | L 229.08595 230.011899 
 567 | L 232.160331 227.528678 
 568 | L 235.234711 224.867194 
 569 | L 238.309091 222.032684 
 570 | L 241.383471 219.031532 
 571 | L 244.457851 215.871168 
 572 | L 247.532231 212.559953 
 573 | L 250.606612 209.107046 
 574 | L 253.680992 205.522273 
 575 | L 256.755372 201.815977 
 576 | L 259.829752 197.998874 
 577 | L 262.904132 194.081903 
 578 | L 265.978512 190.076087 
 579 | L 269.052893 185.99239 
 580 | L 272.127273 181.841591 
 581 | L 275.201653 177.634161 
 582 | L 278.276033 173.380159 
 583 | L 281.350413 169.089135 
 584 | L 284.424793 164.770056 
 585 | L 287.499174 160.431236 
 586 | L 290.573554 156.080294 
 587 | L 293.647934 151.724115 
 588 | L 296.722314 147.368836 
 589 | L 299.796694 143.019841 
 590 | L 302.871074 138.681765 
 591 | L 305.945455 134.358518 
 592 | L 309.019835 130.053313 
 593 | L 312.094215 125.768702 
 594 | L 315.168595 121.506626 
 595 | L 318.242975 117.268456 
 596 | L 321.317355 113.055056 
 597 | L 324.391736 108.866832 
 598 | L 327.466116 104.70379 
 599 | L 330.540496 100.565595 
 600 | L 333.614876 96.451621 
 601 | L 336.689256 92.361007 
 602 | L 339.763636 88.292701 
 603 | L 342.838017 84.245513 
 604 | L 345.912397 80.218148 
 605 | L 348.986777 76.209248 
 606 | L 352.061157 72.217422 
 607 | L 355.135537 68.241276 
 608 | L 358.209917 64.279431 
 609 | L 361.284298 60.330551 
 610 | L 364.358678 56.39335 
 611 | L 367.433058 52.466608 
 612 | L 370.507438 48.54918 
 613 | L 373.581818 44.64 
 614 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #1f77b4; stroke-width: 1.5; stroke-linecap: square"/>
 615 |    </g>
 616 |    <g id="line2d_30">
 617 |     <path d="M 54 235.416983 
 618 | L 388.8 235.416983 
 619 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #000000; stroke-linecap: square"/>
 620 |    </g>
 621 |    <g id="line2d_31">
 622 |     <path d="M 221.4 256.32 
 623 | L 221.4 34.56 
 624 | " clip-path="url(#pe438f5696c)" style="fill: none; stroke: #000000; stroke-linecap: square"/>
 625 |    </g>
 626 |    <g id="patch_3">
 627 |     <path d="M 54 256.32 
 628 | L 54 34.56 
 629 | " style="fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square"/>
 630 |    </g>
 631 |    <g id="patch_4">
 632 |     <path d="M 388.8 256.32 
 633 | L 388.8 34.56 
 634 | " style="fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square"/>
 635 |    </g>
 636 |    <g id="patch_5">
 637 |     <path d="M 54 256.32 
 638 | L 388.8 256.32 
 639 | " style="fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square"/>
 640 |    </g>
 641 |    <g id="patch_6">
 642 |     <path d="M 54 34.56 
 643 | L 388.8 34.56 
 644 | " style="fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square"/>
 645 |    </g>
 646 |    <g id="text_17">
 647 |     <!-- Gaussian Error "Linear" Unit -->
 648 |     <g transform="translate(137.150625 28.56)scale(0.12 -0.12)">
 649 |      <defs>
 650 |       <path id="DejaVuSans-47" d="M 3809 666 
 651 | L 3809 1919 
 652 | L 2778 1919 
 653 | L 2778 2438 
 654 | L 4434 2438 
 655 | L 4434 434 
 656 | Q 4069 175 3628 42 
 657 | Q 3188 -91 2688 -91 
 658 | Q 1594 -91 976 548 
 659 | Q 359 1188 359 2328 
 660 | Q 359 3472 976 4111 
 661 | Q 1594 4750 2688 4750 
 662 | Q 3144 4750 3555 4637 
 663 | Q 3966 4525 4313 4306 
 664 | L 4313 3634 
 665 | Q 3963 3931 3569 4081 
 666 | Q 3175 4231 2741 4231 
 667 | Q 1884 4231 1454 3753 
 668 | Q 1025 3275 1025 2328 
 669 | Q 1025 1384 1454 906 
 670 | Q 1884 428 2741 428 
 671 | Q 3075 428 3337 486 
 672 | Q 3600 544 3809 666 
 673 | z
 674 | " transform="scale(0.015625)"/>
 675 |       <path id="DejaVuSans-61" d="M 2194 1759 
 676 | Q 1497 1759 1228 1600 
 677 | Q 959 1441 959 1056 
 678 | Q 959 750 1161 570 
 679 | Q 1363 391 1709 391 
 680 | Q 2188 391 2477 730 
 681 | Q 2766 1069 2766 1631 
 682 | L 2766 1759 
 683 | L 2194 1759 
 684 | z
 685 | M 3341 1997 
 686 | L 3341 0 
 687 | L 2766 0 
 688 | L 2766 531 
 689 | Q 2569 213 2275 61 
 690 | Q 1981 -91 1556 -91 
 691 | Q 1019 -91 701 211 
 692 | Q 384 513 384 1019 
 693 | Q 384 1609 779 1909 
 694 | Q 1175 2209 1959 2209 
 695 | L 2766 2209 
 696 | L 2766 2266 
 697 | Q 2766 2663 2505 2880 
 698 | Q 2244 3097 1772 3097 
 699 | Q 1472 3097 1187 3025 
 700 | Q 903 2953 641 2809 
 701 | L 641 3341 
 702 | Q 956 3463 1253 3523 
 703 | Q 1550 3584 1831 3584 
 704 | Q 2591 3584 2966 3190 
 705 | Q 3341 2797 3341 1997 
 706 | z
 707 | " transform="scale(0.015625)"/>
 708 |       <path id="DejaVuSans-75" d="M 544 1381 
 709 | L 544 3500 
 710 | L 1119 3500 
 711 | L 1119 1403 
 712 | Q 1119 906 1312 657 
 713 | Q 1506 409 1894 409 
 714 | Q 2359 409 2629 706 
 715 | Q 2900 1003 2900 1516 
 716 | L 2900 3500 
 717 | L 3475 3500 
 718 | L 3475 0 
 719 | L 2900 0 
 720 | L 2900 538 
 721 | Q 2691 219 2414 64 
 722 | Q 2138 -91 1772 -91 
 723 | Q 1169 -91 856 284 
 724 | Q 544 659 544 1381 
 725 | z
 726 | M 1991 3584 
 727 | L 1991 3584 
 728 | z
 729 | " transform="scale(0.015625)"/>
 730 |       <path id="DejaVuSans-73" d="M 2834 3397 
 731 | L 2834 2853 
 732 | Q 2591 2978 2328 3040 
 733 | Q 2066 3103 1784 3103 
 734 | Q 1356 3103 1142 2972 
 735 | Q 928 2841 928 2578 
 736 | Q 928 2378 1081 2264 
 737 | Q 1234 2150 1697 2047 
 738 | L 1894 2003 
 739 | Q 2506 1872 2764 1633 
 740 | Q 3022 1394 3022 966 
 741 | Q 3022 478 2636 193 
 742 | Q 2250 -91 1575 -91 
 743 | Q 1294 -91 989 -36 
 744 | Q 684 19 347 128 
 745 | L 347 722 
 746 | Q 666 556 975 473 
 747 | Q 1284 391 1588 391 
 748 | Q 1994 391 2212 530 
 749 | Q 2431 669 2431 922 
 750 | Q 2431 1156 2273 1281 
 751 | Q 2116 1406 1581 1522 
 752 | L 1381 1569 
 753 | Q 847 1681 609 1914 
 754 | Q 372 2147 372 2553 
 755 | Q 372 3047 722 3315 
 756 | Q 1072 3584 1716 3584 
 757 | Q 2034 3584 2315 3537 
 758 | Q 2597 3491 2834 3397 
 759 | z
 760 | " transform="scale(0.015625)"/>
 761 |       <path id="DejaVuSans-69" d="M 603 3500 
 762 | L 1178 3500 
 763 | L 1178 0 
 764 | L 603 0 
 765 | L 603 3500 
 766 | z
 767 | M 603 4863 
 768 | L 1178 4863 
 769 | L 1178 4134 
 770 | L 603 4134 
 771 | L 603 4863 
 772 | z
 773 | " transform="scale(0.015625)"/>
 774 |       <path id="DejaVuSans-6e" d="M 3513 2113 
 775 | L 3513 0 
 776 | L 2938 0 
 777 | L 2938 2094 
 778 | Q 2938 2591 2744 2837 
 779 | Q 2550 3084 2163 3084 
 780 | Q 1697 3084 1428 2787 
 781 | Q 1159 2491 1159 1978 
 782 | L 1159 0 
 783 | L 581 0 
 784 | L 581 3500 
 785 | L 1159 3500 
 786 | L 1159 2956 
 787 | Q 1366 3272 1645 3428 
 788 | Q 1925 3584 2291 3584 
 789 | Q 2894 3584 3203 3211 
 790 | Q 3513 2838 3513 2113 
 791 | z
 792 | " transform="scale(0.015625)"/>
 793 |       <path id="DejaVuSans-20" transform="scale(0.015625)"/>
 794 |       <path id="DejaVuSans-45" d="M 628 4666 
 795 | L 3578 4666 
 796 | L 3578 4134 
 797 | L 1259 4134 
 798 | L 1259 2753 
 799 | L 3481 2753 
 800 | L 3481 2222 
 801 | L 1259 2222 
 802 | L 1259 531 
 803 | L 3634 531 
 804 | L 3634 0 
 805 | L 628 0 
 806 | L 628 4666 
 807 | z
 808 | " transform="scale(0.015625)"/>
 809 |       <path id="DejaVuSans-72" d="M 2631 2963 
 810 | Q 2534 3019 2420 3045 
 811 | Q 2306 3072 2169 3072 
 812 | Q 1681 3072 1420 2755 
 813 | Q 1159 2438 1159 1844 
 814 | L 1159 0 
 815 | L 581 0 
 816 | L 581 3500 
 817 | L 1159 3500 
 818 | L 1159 2956 
 819 | Q 1341 3275 1631 3429 
 820 | Q 1922 3584 2338 3584 
 821 | Q 2397 3584 2469 3576 
 822 | Q 2541 3569 2628 3553 
 823 | L 2631 2963 
 824 | z
 825 | " transform="scale(0.015625)"/>
 826 |       <path id="DejaVuSans-6f" d="M 1959 3097 
 827 | Q 1497 3097 1228 2736 
 828 | Q 959 2375 959 1747 
 829 | Q 959 1119 1226 758 
 830 | Q 1494 397 1959 397 
 831 | Q 2419 397 2687 759 
 832 | Q 2956 1122 2956 1747 
 833 | Q 2956 2369 2687 2733 
 834 | Q 2419 3097 1959 3097 
 835 | z
 836 | M 1959 3584 
 837 | Q 2709 3584 3137 3096 
 838 | Q 3566 2609 3566 1747 
 839 | Q 3566 888 3137 398 
 840 | Q 2709 -91 1959 -91 
 841 | Q 1206 -91 779 398 
 842 | Q 353 888 353 1747 
 843 | Q 353 2609 779 3096 
 844 | Q 1206 3584 1959 3584 
 845 | z
 846 | " transform="scale(0.015625)"/>
 847 |       <path id="DejaVuSans-22" d="M 1147 4666 
 848 | L 1147 2931 
 849 | L 616 2931 
 850 | L 616 4666 
 851 | L 1147 4666 
 852 | z
 853 | M 2328 4666 
 854 | L 2328 2931 
 855 | L 1797 2931 
 856 | L 1797 4666 
 857 | L 2328 4666 
 858 | z
 859 | " transform="scale(0.015625)"/>
 860 |       <path id="DejaVuSans-4c" d="M 628 4666 
 861 | L 1259 4666 
 862 | L 1259 531 
 863 | L 3531 531 
 864 | L 3531 0 
 865 | L 628 0 
 866 | L 628 4666 
 867 | z
 868 | " transform="scale(0.015625)"/>
 869 |       <path id="DejaVuSans-65" d="M 3597 1894 
 870 | L 3597 1613 
 871 | L 953 1613 
 872 | Q 991 1019 1311 708 
 873 | Q 1631 397 2203 397 
 874 | Q 2534 397 2845 478 
 875 | Q 3156 559 3463 722 
 876 | L 3463 178 
 877 | Q 3153 47 2828 -22 
 878 | Q 2503 -91 2169 -91 
 879 | Q 1331 -91 842 396 
 880 | Q 353 884 353 1716 
 881 | Q 353 2575 817 3079 
 882 | Q 1281 3584 2069 3584 
 883 | Q 2775 3584 3186 3129 
 884 | Q 3597 2675 3597 1894 
 885 | z
 886 | M 3022 2063 
 887 | Q 3016 2534 2758 2815 
 888 | Q 2500 3097 2075 3097 
 889 | Q 1594 3097 1305 2825 
 890 | Q 1016 2553 972 2059 
 891 | L 3022 2063 
 892 | z
 893 | " transform="scale(0.015625)"/>
 894 |       <path id="DejaVuSans-55" d="M 556 4666 
 895 | L 1191 4666 
 896 | L 1191 1831 
 897 | Q 1191 1081 1462 751 
 898 | Q 1734 422 2344 422 
 899 | Q 2950 422 3222 751 
 900 | Q 3494 1081 3494 1831 
 901 | L 3494 4666 
 902 | L 4128 4666 
 903 | L 4128 1753 
 904 | Q 4128 841 3676 375 
 905 | Q 3225 -91 2344 -91 
 906 | Q 1459 -91 1007 375 
 907 | Q 556 841 556 1753 
 908 | L 556 4666 
 909 | z
 910 | " transform="scale(0.015625)"/>
 911 |       <path id="DejaVuSans-74" d="M 1172 4494 
 912 | L 1172 3500 
 913 | L 2356 3500 
 914 | L 2356 3053 
 915 | L 1172 3053 
 916 | L 1172 1153 
 917 | Q 1172 725 1289 603 
 918 | Q 1406 481 1766 481 
 919 | L 2356 481 
 920 | L 2356 0 
 921 | L 1766 0 
 922 | Q 1100 0 847 248 
 923 | Q 594 497 594 1153 
 924 | L 594 3053 
 925 | L 172 3053 
 926 | L 172 3500 
 927 | L 594 3500 
 928 | L 594 4494 
 929 | L 1172 4494 
 930 | z
 931 | " transform="scale(0.015625)"/>
 932 |      </defs>
 933 |      <use xlink:href="#DejaVuSans-47"/>
 934 |      <use xlink:href="#DejaVuSans-61" x="77.490234"/>
 935 |      <use xlink:href="#DejaVuSans-75" x="138.769531"/>
 936 |      <use xlink:href="#DejaVuSans-73" x="202.148438"/>
 937 |      <use xlink:href="#DejaVuSans-73" x="254.248047"/>
 938 |      <use xlink:href="#DejaVuSans-69" x="306.347656"/>
 939 |      <use xlink:href="#DejaVuSans-61" x="334.130859"/>
 940 |      <use xlink:href="#DejaVuSans-6e" x="395.410156"/>
 941 |      <use xlink:href="#DejaVuSans-20" x="458.789062"/>
 942 |      <use xlink:href="#DejaVuSans-45" x="490.576172"/>
 943 |      <use xlink:href="#DejaVuSans-72" x="553.759766"/>
 944 |      <use xlink:href="#DejaVuSans-72" x="593.123047"/>
 945 |      <use xlink:href="#DejaVuSans-6f" x="631.986328"/>
 946 |      <use xlink:href="#DejaVuSans-72" x="693.167969"/>
 947 |      <use xlink:href="#DejaVuSans-20" x="734.28125"/>
 948 |      <use xlink:href="#DejaVuSans-22" x="766.068359"/>
 949 |      <use xlink:href="#DejaVuSans-4c" x="812.064453"/>
 950 |      <use xlink:href="#DejaVuSans-69" x="867.777344"/>
 951 |      <use xlink:href="#DejaVuSans-6e" x="895.560547"/>
 952 |      <use xlink:href="#DejaVuSans-65" x="958.939453"/>
 953 |      <use xlink:href="#DejaVuSans-61" x="1020.462891"/>
 954 |      <use xlink:href="#DejaVuSans-72" x="1081.742188"/>
 955 |      <use xlink:href="#DejaVuSans-22" x="1122.855469"/>
 956 |      <use xlink:href="#DejaVuSans-20" x="1168.851562"/>
 957 |      <use xlink:href="#DejaVuSans-55" x="1200.638672"/>
 958 |      <use xlink:href="#DejaVuSans-6e" x="1273.832031"/>
 959 |      <use xlink:href="#DejaVuSans-69" x="1337.210938"/>
 960 |      <use xlink:href="#DejaVuSans-74" x="1364.994141"/>
 961 |     </g>
 962 |    </g>
 963 |    <g id="legend_1">
 964 |     <g id="patch_7">
 965 |      <path d="M 61 57.238125 
 966 | L 114.617188 57.238125 
 967 | Q 116.617188 57.238125 116.617188 55.238125 
 968 | L 116.617188 41.56 
 969 | Q 116.617188 39.56 114.617188 39.56 
 970 | L 61 39.56 
 971 | Q 59 39.56 59 41.56 
 972 | L 59 55.238125 
 973 | Q 59 57.238125 61 57.238125 
 974 | z
 975 | " style="fill: #ffffff; opacity: 0.8; stroke: #cccccc; stroke-linejoin: miter"/>
 976 |     </g>
 977 |     <g id="line2d_32">
 978 |      <path d="M 63 47.658438 
 979 | L 73 47.658438 
 980 | L 83 47.658438 
 981 | " style="fill: none; stroke: #1f77b4; stroke-width: 1.5; stroke-linecap: square"/>
 982 |     </g>
 983 |     <g id="text_18">
 984 |      <!-- gelu -->
 985 |      <g transform="translate(91 51.158438)scale(0.1 -0.1)">
 986 |       <defs>
 987 |        <path id="DejaVuSans-67" d="M 2906 1791 
 988 | Q 2906 2416 2648 2759 
 989 | Q 2391 3103 1925 3103 
 990 | Q 1463 3103 1205 2759 
 991 | Q 947 2416 947 1791 
 992 | Q 947 1169 1205 825 
 993 | Q 1463 481 1925 481 
 994 | Q 2391 481 2648 825 
 995 | Q 2906 1169 2906 1791 
 996 | z
 997 | M 3481 434 
 998 | Q 3481 -459 3084 -895 
 999 | Q 2688 -1331 1869 -1331 
1000 | Q 1566 -1331 1297 -1286 
1001 | Q 1028 -1241 775 -1147 
1002 | L 775 -588 
1003 | Q 1028 -725 1275 -790 
1004 | Q 1522 -856 1778 -856 
1005 | Q 2344 -856 2625 -561 
1006 | Q 2906 -266 2906 331 
1007 | L 2906 616 
1008 | Q 2728 306 2450 153 
1009 | Q 2172 0 1784 0 
1010 | Q 1141 0 747 490 
1011 | Q 353 981 353 1791 
1012 | Q 353 2603 747 3093 
1013 | Q 1141 3584 1784 3584 
1014 | Q 2172 3584 2450 3431 
1015 | Q 2728 3278 2906 2969 
1016 | L 2906 3500 
1017 | L 3481 3500 
1018 | L 3481 434 
1019 | z
1020 | " transform="scale(0.015625)"/>
1021 |        <path id="DejaVuSans-6c" d="M 603 4863 
1022 | L 1178 4863 
1023 | L 1178 0 
1024 | L 603 0 
1025 | L 603 4863 
1026 | z
1027 | " transform="scale(0.015625)"/>
1028 |       </defs>
1029 |       <use xlink:href="#DejaVuSans-67"/>
1030 |       <use xlink:href="#DejaVuSans-65" x="63.476562"/>
1031 |       <use xlink:href="#DejaVuSans-6c" x="125"/>
1032 |       <use xlink:href="#DejaVuSans-75" x="152.783203"/>
1033 |      </g>
1034 |     </g>
1035 |    </g>
1036 |   </g>
1037 |  </g>
1038 |  <defs>
1039 |   <clipPath id="pe438f5696c">
1040 |    <rect x="54" y="34.56" width="334.8" height="221.76"/>
1041 |   </clipPath>
1042 |  </defs>
1043 | </svg>
1044 | 


--------------------------------------------------------------------------------
/handwritten-digits-sgd-batches/index.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hello Deep Learning: Reading handwritten digits"
  3 | date: 2023-03-30T12:00:04+02:00
  4 | draft: false
  5 | ---
  6 | > This page is part of the [Hello Deep Learning](../hello-deep-learning) series of blog posts. You are very welcome to improve this page [via GitHub](https://github.com/berthubert/hello-dl-posts/blob/main/handwritten-digits-sgd-batches/index.md)!
  7 | 
  8 | In the [previous chapter](../autograd) we described how automatic differentiation of the result of neural networks works. 
  9 | 
 10 | In the first and second chapters we designed and trained a one-layer neural network that could distinguish images of the digit 3 and the digit 7, and the network did so very well. But honestly, it is not that difficult also.
 11 | 
 12 | The next challenge is to recognize and classify all ten digits. To do so, we'll use a network that does the following:
 13 | 
 14 | 1. Flatten 28x28 image to a 784x1 matrix
 15 | 2. Multiply this matrix by a 128x784 matrix ('lc1')
 16 | 3. Replace all negative elements of the resulting matrix by 0
 17 | 4. Multiply the resulting matrix by a 64x128 matrix ('lc2')
 18 | 5. Replace all negative elements of the resulting matrix by 0
 19 | 6. Multiply the resulting matrix by a 10x64 matrix ('lc3')
 20 | 7. Pick the highest row of the resulting 10x1 matrix, this is the digit the network thinks it saw
 21 | 
 22 | Or, in code form:
 23 | 
 24 | ```C++
 25 | auto output = s.lc1.forward(makeFlatten({img}));  // 1, 2
 26 | auto output2 = makeFunction<ReluFunc>(output);    // 3
 27 | auto output3 = s.lc2.forward(output2);            // 4
 28 | auto output4 = makeFunction<ReluFunc>(output3);   // 5
 29 | auto output5 = s.lc3.forward(output4);            // 6
 30 | scores = makeLogSoftMax(output5);                 // 7a
 31 | 
 32 | ...
 33 | int predicted = scores.maxValueIndexOfColumn(0);  // 7b
 34 | ```
 35 | 
 36 | So, what is going on here? First we turn the image into a looooong matrix of 784x1, using a call to `makeFlatten`. This loses some spatial context - neighboring pixels are no longer necessarily next to each other. But it is necessary for the rest of the operations.
 37 | 
 38 | The flattened matrix now goes through `lc1`, which is a linear combination layer. Or in other words, a matrix multiplication.
 39 | 
 40 | Next up, the `ReluFunc`. This 'rectified linear unit' is nothing other than an if statement: `if(x<0) return 0; else return x`. If we would stack multiple linear combinations, this would not add any extra smarts to the network - you could summarise two layers as one layer with different parameters. Inserting a non-linear element like 'ReLu' changes this.
 41 | 
 42 | > From the excellent [FastAI notebook on MNIST](https://github.com/fastai/fastbook/blob/master/04_mnist_basics.ipynb):
 43 | "Amazingly enough, it can be mathematically proven that this little function can solve any computable problem to an arbitrarily high level of accuracy, if you can find the right parameters for w1 and w2 and if you make these matrices big enough. For any arbitrarily wiggly function, we can approximate it as a bunch of lines joined together; to make it closer to the wiggly function, we just have to use shorter lines. This is known as the [universal approximation theorem](https://towardsdatascience.com/how-do-relu-neural-networks-approximate-any-continuous-function-f59ca3cf2c39)."
 44 | > Incidentally, I can highly recommend reading the FastAI notebook after you've finished with my 'from scratch' series. The FastAI work will then make sense, and will allow you to convert your from scratch knowledge into deep learning frameworks that people actually use.
 45 | 
 46 | After the first ReLu, we pass our data through a second linear combination, after which follows further ReLu, and a final linear combination. 
 47 | 
 48 | # LogSoftMax, "Cross Entropy Loss"
 49 | Ok - here we are going to make some big steps and introduce a lot of modern machine learning vocabulary.
 50 | 
 51 | In our previous example, the network had one output, and if it was negative, we would interpret that as a prediction of a three. 
 52 | 
 53 | Our present network has a more difficult task, determining which of 10 digits we are looking at. After step 6 of our network, we have a 10x1 matrix full of values. The convention is that the highest valued coefficient in that represents the network's verdict.
 54 | 
 55 | Recall that earlier we made the network 'learn in the right direction' by comparing the output to what we'd hope to get. We set a target of '2' or '-2' by hand, and remarked that this was an ugly trick we'd get rid of later on. That moment is now.
 56 | 
 57 | All machine learning projects define a 'loss value' calculated by the 'loss function'. The loss represents the distance between what a network predicts, and what we'd like it to predict. In our previous example, we informally used a loss function of {{<katex inline>}}min(2 - R, 0){{</katex>}}, and trained our network to minimize this loss function.
 58 | 
 59 | Or in other words, as long as {{<katex inline>}}R<2{{</katex>}} we'd change the parameters to increase {{<katex inline>}}R{{</katex>}}.
 60 | 
 61 | This is the key concept of neural network learning: modify the parameters so that the loss function goes down. And to do so, take the derivative of the loss function against all parameters. Then subtract a fraction of that derivative from all the parameters.
 62 | 
 63 | This only works if we can get a *single number* to come out of our network. But recall, the digit recognizing network we are designing on this page has *10* outputs. So some work is in order.
 64 | 
 65 | In practice, we first feed all the outputs to a function called '*LogSoftMax*':
 66 | 
 67 | {{<katex display>}}
 68 | \text{LogSoftmax}(x_{i}) = \log\left(\frac{\exp(x_i) }{ \sum_j \exp(x_j)} \right) = x_i - \log\left(\sum_j \exp(x_j)\right)
 69 | {{</katex>}}
 70 | 
 71 | If we put in 10 inputs, out come 10 outputs, but now lowered by the logarithm of the sum of the exponent of all elements.
 72 | 
 73 | This looks like:
 74 | 
 75 | 
 76 | ```Python
 77 | #      0     1      2       3      4      5     6       7      8      9
 78 | In:  [-2.5, -4,    -3,    -0.5,  -4.4,    4,    -0.75, -0.25, -0.5,  -2.0]
 79 | Out: [-6.5, -8.04, -7.05, -4.55, -8.459, -0.05, -4.80, -4.30, -4.55, -6.05]
 80 | ```
 81 | 
 82 | This would correspond to a network saying '5', which has the highest number. The output of `LogSoftMax` is typically interpreted as a log probability. Or in other words, the network is then regarded to say it predicts with {{<katex inline>}}e^{-0.05}=95\%{{</katex>}} probability that it is looking at a 5.
 83 | 
 84 | LogSoftMax works well for a variety of reasons, one of which is that it prevents the "pushing to infinity" we had to safeguard against in chapter 2. When using LogSoftMax as part of the loss function, we know that 0 is the best answer we're ever going to get ('100% certain'). But, because of the logarithms in there, the 'push' becomes ever smaller the closer we get to zero.
 85 | 
 86 | But, we still have 10 digits to look at, and we need just 1 for our loss value. To do this, it is customary to construct a "one hot vector", a matrix consisting of all zeroes, except at the index of the number we were expecting. 
 87 | 
 88 | So to get to the loss value, we do:
 89 | 
 90 | ```
 91 | scores=[-6.5, -8.04, -7.05, -4.55, -8.459, -0.05, -4.80, -4.30, -4.55, -6.05]
 92 | onehotvector=[[0],[0],[0],[0],[0],[1.0],[0],[0],[0],[0]]
 93 | loss = -scores*onehotvector = 0.05
 94 | ```
 95 | Where \* denotes a matrix multiplication. Here, the loss is 0.05, since the value at index 5 (the digit we know we put into the network) is -0.05. The negation is there because 0 is the very best we're ever going to get, but we are approaching it from negative territory. This technique goes by the pompous name of '[cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)'.
 96 | 
 97 | Once we have the loss number, we can take the derivative, and adjust the parameters of the matrices with a fraction of that derivative. 
 98 | 
 99 | # Recap
100 | Because this is so crucial, let's go over it one more time.
101 | 
102 | Our neural network consists of several layers. We turn our image into a 1-dimensional matrix, multiply it by another matrix, replace all negative elements by zero, repeat the last two steps once more, and then a final matrix multiplication. Ten values come out. We do a logarithmic operation on these numbers, and turn them into 10 log-probabilities. 
103 | 
104 | The highest log-probability we can get is 0, which represents 100%. We multiply the ten numbers by yet another 'expectation' matrix, which is zero, except for the place corresponding to the actual number we put in. Out comes the probability the network assigned to what we know is the right digit. 
105 | 
106 | If this log probability is -0.05, we say that we that we have a 'loss' of 0.05. 
107 | 
108 | This loss number is the outcome of all these matrix multiplications and ReLu operations, the LogSoftMax layer and finally the multiplication with the expectation matrix.
109 | 
110 | And because of the magic of [automatic differentiation from the previous chapter](../autograd), we can determine exactly how our loss function would change if we modified the three parameter matrices we used for multiplication. 
111 | 
112 | We then update those matrices with a fraction of the derivative, and we call this fraction the 'learning rate'.
113 | 
114 | # One final complication: batches
115 | We could perform the procedure outlined above once per training digit. But this might cause our network to oscillate wildly between "getting the ones right", "getting the twos right" etc. For this and other reasons, it is customary to do the learning per batch. Picking a batch size is an important choice - if the batch size is too small (1, for example), the network might swerve. If it is too large however, we lose training opportunities. 
116 | 
117 | There is also a more hardware related reason to do this. Much machine learning happens on GPUs which only perform well if you give them a lot of work they can do in parallel. If you only process a single input at a time, much of your GPU will be idle. 
118 | 
119 | When we gather our learning from a batch and average the results, we call this [Stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) or SGD.
120 | 
121 | 
122 | # Getting down to work
123 | The code is in [tensor-relu.cc](https://github.com/berthubert/hello-dl/blob/main/tensor-relu.cc), where you'll find slightly more lines of code than described below. The additional lines perform logging to generate the graphs that demonstrate the performance of this model.
124 | 
125 | Here is the definition of our model:
126 | 
127 | ```C++
128 | struct ReluDigitModel 
129 | {
130 |   Tensor<float> img{28,28};
131 |   Tensor<float> scores{10, 1};
132 |   Tensor<float> expected{1,10};
133 |   Tensor<float> loss{1,1};
134 |   struct State : public ModelState<float>
135 |   { //             IN    OUT
136 |     Linear<float, 28*28, 128> lc1;
137 |     Linear<float, 128,    64> lc2;
138 |     Linear<float, 64,     10> lc3;
139 | 
140 |     State()
141 |     {
142 |       this->d_members = {{&lc1, "lc1"}, {&lc2, "lc2"}, {&lc3, "lc3"}};
143 |     }
144 |   };
145 |   
146 | ```
147 | 
148 | Here we see how the state of a model is kept separate. This state is what contains the actual parameters. Note that the state derives from `ModelState<float>`. This parent class gives us common operations like `load()`, `save()`, `randomise()`, but also logging of everything to SQLite. To make sure that the parent class knows what to do, the `State` struct registers its members in its constructor.
149 | 
150 | Next up, let's hook it all up:
151 | 
152 | ```C++
153 |   void init(State& s)
154 |   {
155 |     auto output = s.lc1.forward(makeFlatten({img}));
156 |     auto output2 = makeFunction<ReluFunc>(output);
157 |     auto output3 = s.lc2.forward(output2);
158 |     auto output4 = makeFunction<ReluFunc>(output3);
159 |     auto output5 = s.lc3.forward(output4);
160 |     scores = makeLogSoftMax(output5);
161 |     loss = -(expected*scores);
162 |   }
163 | };
164 | ```
165 | 
166 | This mirrors the code we've seen earlier. In the last line of code we define the 'loss' function. And recall, this is all lazy evaluation - we're setting up the logic, but nothing is being calculated yet.
167 | 
168 | Next up, mechanics:
169 | 
170 | ```C++
171 | MNISTReader mn("gzip/emnist-digits-train-images-idx3-ubyte.gz", "gzip/emnist-digits-train-labels-idx1-ubyte.gz");
172 | MNISTReader mntest("gzip/emnist-digits-test-images-idx3-ubyte.gz", "gzip/emnist-digits-test-labels-idx1-ubyte.gz");
173 | 
174 | ReluDigitModel m;
175 | ReluDigitModel::State s;
176 | 
177 | if(argc==2) {
178 |   cout << "Loading model state from '" << argv[1] << "'\n";
179 |   loadModelState(s, argv[1]);
180 | }
181 | else
182 |   s.randomize();
183 | 
184 | m.init(s);
185 | 
186 | auto topo = m.loss.getTopo();
187 | Batcher batcher(mn.num());
188 | ```
189 | 
190 | The 'topo' line gets the topographical sort we'll be using later on, as described in the previous chapter.
191 | 
192 | The final line is a helper class called `Batcher`, which we pass the number of training images we have. This class then shuffles all these numbers. Later, we can request batches of N numbers for processing, and we'll get a random batch of numbers for processing.
193 | 
194 | Let's do just that:
195 | 
196 | ```C++
197 | for(unsigned int tries = 0 ;; ++tries) {
198 |   if(!(tries % 32)) {
199 |     testModel(m, mntest);
200 |     saveModelState(s, "tensor-relu.state");
201 |   }
202 |     
203 |   auto batch = batcher.getBatch(64);
204 |   if(batch.empty())
205 |     break;
206 |   float totalLoss = 0;
207 |   unsigned int corrects=0, wrongs=0;
208 | ```
209 | Up to here it is just mechanics. Every 32 batches we test the model against our validation data, and we also save our state to disk. Next up more interesting things happen:
210 | 
211 | ```C++
212 | m.loss.zeroAccumGrads(topo);
213 |     
214 | for(const auto& idx : batch) {
215 |   mn.pushImage(idx, m.img);
216 |   int label = mn.getLabel(idx);
217 |   m.expected.oneHotColumn(label);
218 | 
219 |   totalLoss += m.loss(0,0); // turns it into a float
220 |       
221 |   int predicted = m.scores.maxValueIndexOfColumn(0);
222 | 
223 |   if(predicted == label)
224 |      corrects++;
225 |   else wrongs++;
226 | 
227 | ```
228 | As noted, we process a whole batch of images before starting the learning process. Per image we look at we gather gradients through automatic differentiation. We need to add up all these gradients for the eventual learning. To make life easy, our `Tensor` class has a facility where you can stash your gradients. But, before we start a batch, we must zero the accumulated numbers. That's line 1.
229 | 
230 | Next up, we iterate over all numbers in the batch. For each number, we fetch the EMNIST image and the label assigned to it.
231 | We then configure the `expected` variable, with the 'one hot' configuration which is 1 only for the correct outcome.
232 | 
233 | The `totalLoss += m.loss(0,0);` line looks like a bit of statistics keeping, but it is what actually triggers the whole network into action. We wake up the lazy evaluation. 
234 | 
235 | In the next line we look up the row in the scores matrix with the highest value, which is the prediction from the model.
236 | 
237 | Then we count the correct and wrong predictions.
238 | 
239 | Now we come to an interesting part again:
240 | ```C++
241 |   // backward the thing
242 |   m.loss.backward(topo);
243 |   m.loss.accumGrads(topo); 
244 |   // clear grads & havevalue
245 |   m.loss.zerograd(topo);
246 | }
247 | ```
248 | 
249 | This is where we perform the automatic differentiation (`.backward(topo)`). We then call `accumGrads(topo)` to accumulate the gradients for this specific image. Finally, there is a call to `.zeroGrad(topo)`. From the previous chapter, you'll recall how the gradients rain downward additively. If we run the same model a second time, we first need to zero those gradients so we start from a clean slate. 
250 | 
251 | Once we are done with a whole batch, we can output some statistics and do the actual learning:
252 | 
253 | 
254 | ```C++
255 | cout << tries << ": Average loss " << totalLoss/batch.size()<< ", percent batch correct " << 100.0*corrects/(corrects+wrongs) << "%\n";
256 |     
257 | double lr=0.01 / batch.size();
258 | s.learn(lr);
259 | }
260 | ```
261 | 
262 | Of note, we divide the learning rate by the batch size. This is because we've accumulated gradients for each of the images in the batch, and we want to learn from their average and not their sum.
263 | 
264 | Finally, let's zoom in on what `s.learn(lr)` actually does:
265 | 
266 | ```C++
267 | void learn(float lr) 
268 | {
269 |   for(auto& p : d_params) {
270 |     auto grad1 = p.ptr->getAccumGrad();
271 |     grad1 *= lr;
272 |     *p.ptr -= grad1;
273 |   }
274 | }
275 | ```
276 | 
277 | For each parameter, the accumulated gradients are gathered, and then multiplied by the learning rate. Finally, this reduced gradient is then subtracted from the actual parameter value.
278 | 
279 | # Giving it a spin!
280 | Finally, let's run all this:
281 | 
282 | ```bash
283 | $ ./tensor-relu
284 | Have 240000 training images and 40000 test images
285 |                             
286 |                             
287 |              ...            
288 |            *XXXXX.          
289 |          .XXXXXXXX.         
290 |         *XXXX*.*XX*..       
291 |        *XXX*    XXXXXX      
292 |       .XXX.     XXXXXX      
293 |       *XX*      .XXXX*      
294 |       *XX.      *XXX*.      
295 |       *XXX   ..XXXXX.       
296 |       .XXXXXXXXXXXXX        
297 |        *XXXXXXXXXXX*        
298 |         .****  .XXX.        
299 |                *XX*         
300 |                XXX.         
301 |               .XX*          
302 |              *XXX.          
303 |              XXXX           
304 |              XXX*           
305 |             .XXX.           
306 |             *XX*            
307 |             XXX*            
308 |            .XX*             
309 |             X*.             
310 |                             
311 |                             
312 |                             
313 | 
314 | predicted: 3, actual: 9, loss: 2.31289
315 | Validation batch average loss: 2.30657, percentage correct: 9.375%
316 | 0: Average loss 2.31008, percent batch correct 9.375%
317 | ...
318 |                             
319 |             .X*             
320 |            *XX*             
321 |           .XX*              
322 |          *XX*               
323 |          XXX                
324 |         *XX.                
325 |        *XXX                 
326 |        XXX                  
327 |       *XX.                  
328 |       XXX                   
329 |      *XX.                   
330 |     .XX*                    
331 |     *XX          ...        
332 |     XXX        *XXXXX       
333 |     XX.     .XXXXXXXXX.     
334 |     XX*   *XXXX*.   *XX.    
335 |     XXX  XXXX*.      XX*    
336 |     *XX.XXX*         .XX    
337 |     .XXXXX*          .XX    
338 |      *XXX.           *XX    
339 |       XXX*          *XX*    
340 |       XXXXX*......*XX*      
341 |         *XXXXXXXXXX*        
342 |           .*XXXXXX*         
343 |                             
344 |                             
345 | 
346 | predicted: 6, actual: 6, loss: 1.28472
347 | Validation batch average loss: 1.39618, percentage correct: 76.5625%
348 | ...
349 | Validation batch average loss: 0.276509, percentage correct: 92.225%
350 | ```
351 | 92.23%, not too shabby! Here are some customary ways of looking at performance, starting with a training/validation percentage correct graph:
352 | 
353 | <center>
354 | 
355 | ![](tensor-relu.svg)
356 | 
357 | <p></p>
358 | </center>
359 | 
360 | The average loss per batch:
361 | 
362 | <center>
363 | 
364 | ![](tensor-relu-loss.svg)
365 | 
366 | <p></p>
367 | </center>
368 | 
369 | And finally the wonderfully named confusion matrix, which shows how often a prediction (vertical) matched up with the actual label (horizontal): 
370 | {{< rawhtml >}}
371 | <style type="text/css">
372 | #T_599e6_row0_col0, #T_599e6_row1_col1, #T_599e6_row2_col2, #T_599e6_row3_col3, #T_599e6_row4_col4, #T_599e6_row5_col5, #T_599e6_row6_col6, #T_599e6_row7_col7, #T_599e6_row8_col8, #T_599e6_row9_col9 {
373 |   font-size: 8pt;
374 |   background-color: #000000;
375 |   color: #f1f1f1;
376 | }
377 | #T_599e6_row0_col1, #T_599e6_row0_col2, #T_599e6_row0_col3, #T_599e6_row0_col4, #T_599e6_row0_col6, #T_599e6_row0_col7, #T_599e6_row0_col8, #T_599e6_row0_col9, #T_599e6_row1_col0, #T_599e6_row1_col2, #T_599e6_row1_col3, #T_599e6_row1_col4, #T_599e6_row1_col5, #T_599e6_row1_col6, #T_599e6_row1_col7, #T_599e6_row1_col9, #T_599e6_row2_col1, #T_599e6_row2_col4, #T_599e6_row2_col6, #T_599e6_row2_col7, #T_599e6_row2_col9, #T_599e6_row3_col0, #T_599e6_row3_col1, #T_599e6_row3_col4, #T_599e6_row3_col6, #T_599e6_row3_col7, #T_599e6_row4_col3, #T_599e6_row4_col7, #T_599e6_row5_col1, #T_599e6_row5_col2, #T_599e6_row5_col4, #T_599e6_row5_col6, #T_599e6_row5_col7, #T_599e6_row5_col9, #T_599e6_row6_col1, #T_599e6_row6_col3, #T_599e6_row6_col7, #T_599e6_row6_col8, #T_599e6_row6_col9, #T_599e6_row7_col0, #T_599e6_row7_col1, #T_599e6_row7_col2, #T_599e6_row7_col4, #T_599e6_row7_col5, #T_599e6_row7_col6, #T_599e6_row7_col8, #T_599e6_row8_col4, #T_599e6_row8_col6, #T_599e6_row8_col7, #T_599e6_row9_col0, #T_599e6_row9_col1, #T_599e6_row9_col2, #T_599e6_row9_col5, #T_599e6_row9_col6 {
378 |   font-size: 8pt;
379 |   background-color: #ffffff;
380 |   color: #000000;
381 | }
382 | #T_599e6_row0_col5, #T_599e6_row2_col0, #T_599e6_row2_col5, #T_599e6_row2_col8, #T_599e6_row3_col2, #T_599e6_row3_col8, #T_599e6_row3_col9, #T_599e6_row4_col1, #T_599e6_row4_col6, #T_599e6_row4_col8, #T_599e6_row5_col0, #T_599e6_row6_col0, #T_599e6_row6_col4, #T_599e6_row6_col5, #T_599e6_row7_col3, #T_599e6_row8_col0, #T_599e6_row8_col2, #T_599e6_row8_col9, #T_599e6_row9_col3, #T_599e6_row9_col8 {
383 |   font-size: 8pt;
384 |   background-color: #fefefe;
385 |   color: #000000;
386 | }
387 | #T_599e6_row1_col8, #T_599e6_row2_col3, #T_599e6_row4_col0, #T_599e6_row4_col2, #T_599e6_row4_col5, #T_599e6_row6_col2, #T_599e6_row8_col1, #T_599e6_row8_col3, #T_599e6_row8_col5 {
388 |   font-size: 8pt;
389 |   background-color: #fdfdfd;
390 |   color: #000000;
391 | }
392 | #T_599e6_row3_col5, #T_599e6_row9_col4 {
393 |   font-size: 8pt;
394 |   background-color: #fcfcfc;
395 |   color: #000000;
396 | }
397 | #T_599e6_row4_col9, #T_599e6_row5_col8 {
398 |   font-size: 8pt;
399 |   background-color: #fbfbfb;
400 |   color: #000000;
401 | }
402 | #T_599e6_row5_col3 {
403 |   font-size: 8pt;
404 |   background-color: #fafafa;
405 |   color: #000000;
406 | }
407 | #T_599e6_row7_col9, #T_599e6_row9_col7 {
408 |   font-size: 8pt;
409 |   background-color: #f9f9f9;
410 |   color: #000000;
411 | }
412 | </style>
413 | <table id="T_599e6">
414 |   <thead>
415 |     <tr>
416 |       <th class="blank level0" >&nbsp;</th>
417 |       <th id="T_599e6_level0_col0" class="col_heading level0 col0" >0</th>
418 |       <th id="T_599e6_level0_col1" class="col_heading level0 col1" >1</th>
419 |       <th id="T_599e6_level0_col2" class="col_heading level0 col2" >2</th>
420 |       <th id="T_599e6_level0_col3" class="col_heading level0 col3" >3</th>
421 |       <th id="T_599e6_level0_col4" class="col_heading level0 col4" >4</th>
422 |       <th id="T_599e6_level0_col5" class="col_heading level0 col5" >5</th>
423 |       <th id="T_599e6_level0_col6" class="col_heading level0 col6" >6</th>
424 |       <th id="T_599e6_level0_col7" class="col_heading level0 col7" >7</th>
425 |       <th id="T_599e6_level0_col8" class="col_heading level0 col8" >8</th>
426 |       <th id="T_599e6_level0_col9" class="col_heading level0 col9" >9</th>
427 |     </tr>
428 |   </thead>
429 |   <tbody>
430 |     <tr>
431 |       <th id="T_599e6_level0_row0" class="row_heading level0 row0" >0</th>
432 |       <td id="T_599e6_row0_col0" class="data row0 col0" >3750</td>
433 |       <td id="T_599e6_row0_col1" class="data row0 col1" >3</td>
434 |       <td id="T_599e6_row0_col2" class="data row0 col2" >27</td>
435 |       <td id="T_599e6_row0_col3" class="data row0 col3" >25</td>
436 |       <td id="T_599e6_row0_col4" class="data row0 col4" >12</td>
437 |       <td id="T_599e6_row0_col5" class="data row0 col5" >42</td>
438 |       <td id="T_599e6_row0_col6" class="data row0 col6" >16</td>
439 |       <td id="T_599e6_row0_col7" class="data row0 col7" >5</td>
440 |       <td id="T_599e6_row0_col8" class="data row0 col8" >10</td>
441 |       <td id="T_599e6_row0_col9" class="data row0 col9" >10</td>
442 |     </tr>
443 |     <tr>
444 |       <th id="T_599e6_level0_row1" class="row_heading level0 row1" >1</th>
445 |       <td id="T_599e6_row1_col0" class="data row1 col0" >4</td>
446 |       <td id="T_599e6_row1_col1" class="data row1 col1" >3793</td>
447 |       <td id="T_599e6_row1_col2" class="data row1 col2" >25</td>
448 |       <td id="T_599e6_row1_col3" class="data row1 col3" >5</td>
449 |       <td id="T_599e6_row1_col4" class="data row1 col4" >13</td>
450 |       <td id="T_599e6_row1_col5" class="data row1 col5" >24</td>
451 |       <td id="T_599e6_row1_col6" class="data row1 col6" >18</td>
452 |       <td id="T_599e6_row1_col7" class="data row1 col7" >16</td>
453 |       <td id="T_599e6_row1_col8" class="data row1 col8" >87</td>
454 |       <td id="T_599e6_row1_col9" class="data row1 col9" >12</td>
455 |     </tr>
456 |     <tr>
457 |       <th id="T_599e6_level0_row2" class="row_heading level0 row2" >2</th>
458 |       <td id="T_599e6_row2_col0" class="data row2 col0" >32</td>
459 |       <td id="T_599e6_row2_col1" class="data row2 col1" >9</td>
460 |       <td id="T_599e6_row2_col2" class="data row2 col2" >3665</td>
461 |       <td id="T_599e6_row2_col3" class="data row2 col3" >74</td>
462 |       <td id="T_599e6_row2_col4" class="data row2 col4" >22</td>
463 |       <td id="T_599e6_row2_col5" class="data row2 col5" >40</td>
464 |       <td id="T_599e6_row2_col6" class="data row2 col6" >24</td>
465 |       <td id="T_599e6_row2_col7" class="data row2 col7" >12</td>
466 |       <td id="T_599e6_row2_col8" class="data row2 col8" >44</td>
467 |       <td id="T_599e6_row2_col9" class="data row2 col9" >3</td>
468 |     </tr>
469 |     <tr>
470 |       <th id="T_599e6_level0_row3" class="row_heading level0 row3" >3</th>
471 |       <td id="T_599e6_row3_col0" class="data row3 col0" >7</td>
472 |       <td id="T_599e6_row3_col1" class="data row3 col1" >31</td>
473 |       <td id="T_599e6_row3_col2" class="data row3 col2" >32</td>
474 |       <td id="T_599e6_row3_col3" class="data row3 col3" >3581</td>
475 |       <td id="T_599e6_row3_col4" class="data row3 col4" >0</td>
476 |       <td id="T_599e6_row3_col5" class="data row3 col5" >110</td>
477 |       <td id="T_599e6_row3_col6" class="data row3 col6" >0</td>
478 |       <td id="T_599e6_row3_col7" class="data row3 col7" >10</td>
479 |       <td id="T_599e6_row3_col8" class="data row3 col8" >53</td>
480 |       <td id="T_599e6_row3_col9" class="data row3 col9" >34</td>
481 |     </tr>
482 |     <tr>
483 |       <th id="T_599e6_level0_row4" class="row_heading level0 row4" >4</th>
484 |       <td id="T_599e6_row4_col0" class="data row4 col0" >70</td>
485 |       <td id="T_599e6_row4_col1" class="data row4 col1" >34</td>
486 |       <td id="T_599e6_row4_col2" class="data row4 col2" >68</td>
487 |       <td id="T_599e6_row4_col3" class="data row4 col3" >3</td>
488 |       <td id="T_599e6_row4_col4" class="data row4 col4" >3766</td>
489 |       <td id="T_599e6_row4_col5" class="data row4 col5" >86</td>
490 |       <td id="T_599e6_row4_col6" class="data row4 col6" >32</td>
491 |       <td id="T_599e6_row4_col7" class="data row4 col7" >17</td>
492 |       <td id="T_599e6_row4_col8" class="data row4 col8" >63</td>
493 |       <td id="T_599e6_row4_col9" class="data row4 col9" >123</td>
494 |     </tr>
495 |     <tr>
496 |       <th id="T_599e6_level0_row5" class="row_heading level0 row5" >5</th>
497 |       <td id="T_599e6_row5_col0" class="data row5 col0" >48</td>
498 |       <td id="T_599e6_row5_col1" class="data row5 col1" >32</td>
499 |       <td id="T_599e6_row5_col2" class="data row5 col2" >22</td>
500 |       <td id="T_599e6_row5_col3" class="data row5 col3" >143</td>
501 |       <td id="T_599e6_row5_col4" class="data row5 col4" >3</td>
502 |       <td id="T_599e6_row5_col5" class="data row5 col5" >3548</td>
503 |       <td id="T_599e6_row5_col6" class="data row5 col6" >23</td>
504 |       <td id="T_599e6_row5_col7" class="data row5 col7" >3</td>
505 |       <td id="T_599e6_row5_col8" class="data row5 col8" >120</td>
506 |       <td id="T_599e6_row5_col9" class="data row5 col9" >15</td>
507 |     </tr>
508 |     <tr>
509 |       <th id="T_599e6_level0_row6" class="row_heading level0 row6" >6</th>
510 |       <td id="T_599e6_row6_col0" class="data row6 col0" >38</td>
511 |       <td id="T_599e6_row6_col1" class="data row6 col1" >18</td>
512 |       <td id="T_599e6_row6_col2" class="data row6 col2" >79</td>
513 |       <td id="T_599e6_row6_col3" class="data row6 col3" >7</td>
514 |       <td id="T_599e6_row6_col4" class="data row6 col4" >52</td>
515 |       <td id="T_599e6_row6_col5" class="data row6 col5" >48</td>
516 |       <td id="T_599e6_row6_col6" class="data row6 col6" >3865</td>
517 |       <td id="T_599e6_row6_col7" class="data row6 col7" >0</td>
518 |       <td id="T_599e6_row6_col8" class="data row6 col8" >8</td>
519 |       <td id="T_599e6_row6_col9" class="data row6 col9" >0</td>
520 |     </tr>
521 |     <tr>
522 |       <th id="T_599e6_level0_row7" class="row_heading level0 row7" >7</th>
523 |       <td id="T_599e6_row7_col0" class="data row7 col0" >3</td>
524 |       <td id="T_599e6_row7_col1" class="data row7 col1" >3</td>
525 |       <td id="T_599e6_row7_col2" class="data row7 col2" >26</td>
526 |       <td id="T_599e6_row7_col3" class="data row7 col3" >40</td>
527 |       <td id="T_599e6_row7_col4" class="data row7 col4" >2</td>
528 |       <td id="T_599e6_row7_col5" class="data row7 col5" >12</td>
529 |       <td id="T_599e6_row7_col6" class="data row7 col6" >0</td>
530 |       <td id="T_599e6_row7_col7" class="data row7 col7" >3716</td>
531 |       <td id="T_599e6_row7_col8" class="data row7 col8" >11</td>
532 |       <td id="T_599e6_row7_col9" class="data row7 col9" >171</td>
533 |     </tr>
534 |     <tr>
535 |       <th id="T_599e6_level0_row8" class="row_heading level0 row8" >8</th>
536 |       <td id="T_599e6_row8_col0" class="data row8 col0" >46</td>
537 |       <td id="T_599e6_row8_col1" class="data row8 col1" >73</td>
538 |       <td id="T_599e6_row8_col2" class="data row8 col2" >55</td>
539 |       <td id="T_599e6_row8_col3" class="data row8 col3" >83</td>
540 |       <td id="T_599e6_row8_col4" class="data row8 col4" >21</td>
541 |       <td id="T_599e6_row8_col5" class="data row8 col5" >75</td>
542 |       <td id="T_599e6_row8_col6" class="data row8 col6" >22</td>
543 |       <td id="T_599e6_row8_col7" class="data row8 col7" >22</td>
544 |       <td id="T_599e6_row8_col8" class="data row8 col8" >3556</td>
545 |       <td id="T_599e6_row8_col9" class="data row8 col9" >37</td>
546 |     </tr>
547 |     <tr>
548 |       <th id="T_599e6_level0_row9" class="row_heading level0 row9" >9</th>
549 |       <td id="T_599e6_row9_col0" class="data row9 col0" >2</td>
550 |       <td id="T_599e6_row9_col1" class="data row9 col1" >4</td>
551 |       <td id="T_599e6_row9_col2" class="data row9 col2" >1</td>
552 |       <td id="T_599e6_row9_col3" class="data row9 col3" >39</td>
553 |       <td id="T_599e6_row9_col4" class="data row9 col4" >109</td>
554 |       <td id="T_599e6_row9_col5" class="data row9 col5" >15</td>
555 |       <td id="T_599e6_row9_col6" class="data row9 col6" >0</td>
556 |       <td id="T_599e6_row9_col7" class="data row9 col7" >199</td>
557 |       <td id="T_599e6_row9_col8" class="data row9 col8" >48</td>
558 |       <td id="T_599e6_row9_col9" class="data row9 col9" >3595</td>
559 |     </tr>
560 |   </tbody>
561 | </table>
562 | {{</rawhtml>}}
563 | 
564 | From this you can see for example that the network has some trouble distinguishing 7 and 9, but that it absolutely never confuses a 7 for a 6, or a 6 for a 9.
565 | 
566 | # Discussion
567 | In the above, we've seen how we can configure a multi-layer network consisting of linear combinations, 'relu units', SoftLogMax and finally the expectation 'one hot vector'. We also made this network learn, and qualified its success.
568 | 
569 | This is about as far as linear combinations can go. And although 90+% correctness is nice, this network has really only learned what perfectly centered and rather clean digits look like. Concretely, this network is really attached to *where* the pixels are. We expect a network that somehow 'understands' what this is doing to not be so sensitive to placement.
570 | 
571 | However, we can still feel pretty good - this tiny network did really well on its simple job, and we know *exactly* how it was trained and what it does.
572 | 
573 | [In the next chapter](../dl-convolutional/), we'll be adding elements that actually capture shapes and their relations, which leads to greater generic performance, but also more complexity and training time. [We'll also be going over some common neural network disappointments](../dl-convolutional/).
574 | 
575 | 
576 | 


--------------------------------------------------------------------------------