├── .gitattributes
├── book
    ├── images
    │   ├── comic.png
    │   ├── logo.png
    │   └── favicon.ico
    ├── layers
    │   ├── transformer
    │   │   ├── training
    │   │   │   ├── training.ipynb
    │   │   │   ├── no-training
    │   │   │   │   └── no-training.ipynb
    │   │   │   ├── teacher
    │   │   │   │   └── teacher.ipynb
    │   │   │   └── token
    │   │   │   │   └── token.ipynb
    │   │   ├── attn
    │   │   │   ├── self-attn.ipynb
    │   │   │   └── attn.ipynb
    │   │   ├── transformer-vs-rnn.ipynb
    │   │   └── transformer.ipynb
    │   ├── linear
    │   │   ├── linear-grad.ipynb
    │   │   └── linear.ipynb
    │   ├── dropout
    │   │   └── dropout.ipynb
    │   ├── emb
    │   │   └── emb.ipynb
    │   ├── norm
    │   │   └── norm.ipynb
    │   ├── rnn
    │   │   ├── gru
    │   │   │   └── gru.ipynb
    │   │   ├── lstm
    │   │   │   └── lstm.ipynb
    │   │   └── rnn.ipynb
    │   ├── padding
    │   │   └── padding.ipynb
    │   ├── pooling
    │   │   └── pooling.ipynb
    │   ├── cnn
    │   │   └── cnn.ipynb
    │   ├── activation
    │   │   ├── tanh
    │   │   │   └── tanh.ipynb
    │   │   ├── sigmoid
    │   │   │   └── sigmoid.ipynb
    │   │   ├── relu
    │   │   │   └── relu.ipynb
    │   │   ├── activation.ipynb
    │   │   └── softmax
    │   │   │   └── softmax.ipynb
    │   └── layers.ipynb
    ├── reuse
    │   ├── reuse.ipynb
    │   ├── distil
    │   │   └── distil.ipynb
    │   └── transfer
    │   │   ├── tl-vs-da.ipynb
    │   │   └── tl-da.ipynb
    ├── better
    │   ├── explainable
    │   │   ├── saliency.ipynb
    │   │   └── explainable.ipynb
    │   ├── better.ipynb
    │   ├── meta
    │   │   └── meta.ipynb
    │   ├── compression
    │   │   └── compression.ipynb
    │   └── lll
    │   │   └── lll.ipynb
    ├── reinforce
    │   ├── essential
    │   │   ├── reward.ipynb
    │   │   ├── state.ipynb
    │   │   ├── action.ipynb
    │   │   ├── agent.ipynb
    │   │   └── online-offline.ipynb
    │   ├── policy
    │   │   ├── policy.ipynb
    │   │   └── policy-gradient.ipynb
    │   ├── value
    │   │   ├── value.ipynb
    │   │   └── q-learning.ipynb
    │   ├── ac
    │   │   └── ac.ipynb
    │   └── reinforce.ipynb
    ├── notice
    │   ├── optimizer
    │   │   └── optimizer.ipynb
    │   ├── data
    │   │   ├── underfit.ipynb
    │   │   └── overfit.ipynb
    │   ├── gradient
    │   │   ├── saddle.ipynb
    │   │   └── norm.ipynb
    │   ├── lr
    │   │   └── lr.ipynb
    │   ├── notice.ipynb
    │   └── batch
    │   │   └── batch.ipynb
    ├── unsupervised
    │   ├── self-supervised
    │   │   └── self-supervised.ipynb
    │   ├── semi-supervised
    │   │   └── semi-supervised.ipynb
    │   ├── decision-tree
    │   │   └── decision-tree.ipynb
    │   ├── unsupervised.ipynb
    │   └── clustering
    │   │   └── clustering.ipynb
    ├── _config.yml
    ├── tasks
    │   ├── regression
    │   │   ├── auto
    │   │   │   └── auto.ipynb
    │   │   └── regression.ipynb
    │   ├── tasks.ipynb
    │   └── classification
    │   │   ├── classification.ipynb
    │   │   └── multilabel
    │   │       └── multilabel.ipynb
    ├── basics
    │   ├── gradients
    │   │   ├── loss-fn-derivative.ipynb
    │   │   ├── gradients.ipynb
    │   │   └── back-prop.ipynb
    │   ├── loss
    │   │   └── loss.ipynb
    │   ├── model
    │   │   └── model.ipynb
    │   ├── data
    │   │   └── data.ipynb
    │   ├── basics.ipynb
    │   └── approx
    │   │   └── approx.ipynb
    ├── generative
    │   ├── ae
    │   │   ├── ae-semi.ipynb
    │   │   ├── vae
    │   │   │   └── vae.ipynb
    │   │   ├── ae-arch.ipynb
    │   │   └── ae.ipynb
    │   ├── generative.ipynb
    │   ├── gan
    │   │   └── gan.ipynb
    │   └── gmm
    │   │   └── gmm.ipynb
    ├── intro.ipynb
    └── _toc.yml
├── CONTRIBUTING.md
├── pyproject.toml
├── .github
    └── workflows
    │   └── build.yaml
└── .gitignore


/.gitattributes:
--------------------------------------------------------------------------------
1 | *.ipynb linguist-language=Python
2 | 


--------------------------------------------------------------------------------
/book/images/comic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rentruewang/learning-machine/HEAD/book/images/comic.png


--------------------------------------------------------------------------------
/book/images/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rentruewang/learning-machine/HEAD/book/images/logo.png


--------------------------------------------------------------------------------
/book/images/favicon.ico:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rentruewang/learning-machine/HEAD/book/images/favicon.ico


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing to this repository <!-- omit in toc -->
 2 | 
 3 | ## Getting started <!-- omit in toc -->
 4 | 
 5 | Before you begin:
 6 | 
 7 | - The site is powered by [Jupyter-Book](https://jupyterbook.org). Please check there if there are building issues.
 8 | - Have you read the [code of conduct](CODE_OF_CONDUCT.md)?
 9 | - Check out the [existing issues](https://github.com/rentruewang/learning-machine/issues) for your type of issue.
10 | 


--------------------------------------------------------------------------------
/book/layers/transformer/training/training.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Training Transformers"
 8 |    ]
 9 |   }
10 |  ],
11 |  "metadata": {
12 |   "language_info": {
13 |    "codemirror_mode": {
14 |     "name": "ipython",
15 |     "version": 3
16 |    },
17 |    "file_extension": ".py",
18 |    "mimetype": "text/x-python",
19 |    "name": "python",
20 |    "nbconvert_exporter": "python",
21 |    "pygments_lexer": "ipython3",
22 |    "version": 3
23 |   }
24 |  },
25 |  "nbformat": 4,
26 |  "nbformat_minor": 2
27 | }
28 | 


--------------------------------------------------------------------------------
/book/layers/linear/linear-grad.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Calculate gradients for Linear Layers"
 8 |    ]
 9 |   }
10 |  ],
11 |  "metadata": {
12 |   "language_info": {
13 |    "codemirror_mode": {
14 |     "name": "ipython",
15 |     "version": 3
16 |    },
17 |    "file_extension": ".py",
18 |    "mimetype": "text/x-python",
19 |    "name": "python",
20 |    "nbconvert_exporter": "python",
21 |    "pygments_lexer": "ipython3",
22 |    "version": 3
23 |   }
24 |  },
25 |  "nbformat": 4,
26 |  "nbformat_minor": 2
27 | }
28 | 


--------------------------------------------------------------------------------
/book/reuse/reuse.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Reusing Existing Models"
 8 |    ]
 9 |   }
10 |  ],
11 |  "metadata": {
12 |   "kernelspec": {
13 |    "display_name": "Python 3",
14 |    "language": "python",
15 |    "name": "python3"
16 |   },
17 |   "language_info": {
18 |    "codemirror_mode": {
19 |     "name": "ipython",
20 |     "version": 3
21 |    },
22 |    "file_extension": ".py",
23 |    "mimetype": "text/x-python",
24 |    "name": "python",
25 |    "nbconvert_exporter": "python",
26 |    "pygments_lexer": "ipython3",
27 |    "version": "3.9.5"
28 |   }
29 |  },
30 |  "nbformat": 4,
31 |  "nbformat_minor": 2
32 | }
33 | 


--------------------------------------------------------------------------------
/book/layers/transformer/attn/self-attn.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Self Attention"
 8 |    ]
 9 |   }
10 |  ],
11 |  "metadata": {
12 |   "kernelspec": {
13 |    "display_name": "Python 3",
14 |    "language": "python",
15 |    "name": "python3"
16 |   },
17 |   "language_info": {
18 |    "codemirror_mode": {
19 |     "name": "ipython",
20 |     "version": 3
21 |    },
22 |    "file_extension": ".py",
23 |    "mimetype": "text/x-python",
24 |    "name": "python",
25 |    "nbconvert_exporter": "python",
26 |    "pygments_lexer": "ipython3",
27 |    "version": "3.9.5"
28 |   }
29 |  },
30 |  "nbformat": 4,
31 |  "nbformat_minor": 2
32 | }
33 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [project]
 2 | name = "learning-machine"
 3 | description = "Machine learning based on answers"
 4 | authors = [
 5 |     {name = "RenChu Wang", email = "patrick1031wang@gmail.com"},
 6 | ]
 7 | dependencies = [
 8 |     "jupyter-book>=0.15.1",
 9 |     "matplotlib>=3.8.2",
10 |     "numpy>=1.26.3",
11 |     "scipy>=1.11.4",
12 |     "torch>=2.1.2",
13 | ]
14 | requires-python = "==3.10.*"
15 | readme = "README.md"
16 | license = {text = "Apache-2.0"}
17 | dynamic = ["version"]
18 | 
19 | 
20 | [build-system]
21 | requires = ["setuptools", "wheel", "setuptools-scm"]
22 | build-backend = "setuptools.build_meta"
23 | 
24 | [tool.setuptools_scm]
25 | 
26 | [tool.pdm]
27 | distribution = false
28 | [tool.pdm.dev-dependencies]
29 | format = [
30 |     "autoflake>=2.2.1",
31 |     "black>=23.12.1",
32 |     "isort>=5.13.2",
33 | ]
34 | 


--------------------------------------------------------------------------------
/book/better/explainable/saliency.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Saliency Maps"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What are saliency maps?\n",
15 |     "\n",
16 |     "Saliency maps aims to display _what part of the input is the most important part for a model._ It tries to do so by trying to maximize the input (with respect to an classifier output). For example, we have a picture of a panda, and our model tells us so. To generate a saliency map, we try to maximize the label `panda` by modifying the input. If we In other words, **a saliency map is generated by trying to maximize the chance of the picture being classified.** Most of the time, we use the gradient _ascent_ algorithm to achieve this."
17 |    ]
18 |   }
19 |  ],
20 |  "metadata": {
21 |   "language_info": {
22 |    "name": "python"
23 |   }
24 |  },
25 |  "nbformat": 4,
26 |  "nbformat_minor": 2
27 | }
28 | 


--------------------------------------------------------------------------------
/book/reinforce/essential/reward.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Reward"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Rewards are for?\n",
15 |     "\n",
16 |     "Rewards are for good actions. Good actions in RL get more reward. Rewards are used in updating the agent such that the agent tries to take actions that yields more rewards in the future."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Rewards in deep learning.\n",
24 |     "\n",
25 |     "Rewards in deep learning is still a scalar number. It is used in the loss function to update the model. On thing worth notice is: The bigger the reward, the better. However, losses should be as minimal as possible. So a common way to do that is to set $ loss = - reward $. This way, minimizing losses is equal to maximizing rewards."
26 |    ]
27 |   }
28 |  ],
29 |  "metadata": {
30 |   "language_info": {
31 |    "name": "python"
32 |   }
33 |  },
34 |  "nbformat": 4,
35 |  "nbformat_minor": 2
36 | }
37 | 


--------------------------------------------------------------------------------
/.github/workflows/build.yaml:
--------------------------------------------------------------------------------
 1 | name: Website Build
 2 | 
 3 | on:
 4 |   push:
 5 |     branches:
 6 |       - main
 7 | 
 8 | jobs:
 9 |   build-and-deploy:
10 |     name: 🌎 Build the pages
11 |     runs-on: ubuntu-latest
12 |     steps:
13 |       - name: 🔔 Checkout
14 |         uses: actions/checkout@v3
15 | 
16 |       - name: 🏗️ python 3.10
17 |         uses: actions/setup-python@v4
18 |         with:
19 |           python-version: "3.10"
20 | 
21 |       - name: ⬇️ Python PDM
22 |         uses: pdm-project/setup-pdm@v3
23 | 
24 |       - name: ⬇️ Python Dependencies
25 |         run: pdm install
26 | 
27 |       - name: 🚂 Activate environment
28 |         run: echo "$(pdm venv --path in-project)/bin" >> $GITHUB_PATH
29 | 
30 |       - name: 🇺🇸 Build English version of the book
31 |         run: jupyter-book build book
32 | 
33 |       - name: 🇺🇸 Deploy English book
34 |         uses: JamesIves/github-pages-deploy-action@4.1.1
35 |         with:
36 |           branch: gh-pages
37 |           folder: ./book/_build/html
38 |           git-config-name: "github-actions[bot]"
39 |           git-config-email: "github-actions[bot]@users.noreply.github.com"
40 |           commit-message: 🎉 Book deployed
41 | 


--------------------------------------------------------------------------------
/book/notice/optimizer/optimizer.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "bf577c74",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Optimizer"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "987e1cf7",
14 |    "metadata": {},
15 |    "source": [
16 |     "## What are optimizers?\n",
17 |     "\n",
18 |     "Optimizers are responsible for making the updates to your model. A good optimizer can help your model converge faster by using momentum to roll past terrain where gradients are small, while not skipping the convergence point. A good optimizing strategy is essential in training effectively."
19 |    ]
20 |   }
21 |  ],
22 |  "metadata": {
23 |   "kernelspec": {
24 |    "display_name": "Python 3",
25 |    "language": "python",
26 |    "name": "python3"
27 |   },
28 |   "language_info": {
29 |    "codemirror_mode": {
30 |     "name": "ipython",
31 |     "version": 3
32 |    },
33 |    "file_extension": ".py",
34 |    "mimetype": "text/x-python",
35 |    "name": "python",
36 |    "nbconvert_exporter": "python",
37 |    "pygments_lexer": "ipython3",
38 |    "version": "3.9.6"
39 |   }
40 |  },
41 |  "nbformat": 4,
42 |  "nbformat_minor": 5
43 | }
44 | 


--------------------------------------------------------------------------------
/book/unsupervised/self-supervised/self-supervised.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Self Supervised Learning"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Self supervised learning vs unsupervised learning?\n",
15 |     "\n",
16 |     "Self-supervised methods are just unsupervised methods, but specifically used to describe the _Masked Language Model (MLM)_ methods used in natural language processing. In the task, the input sentences are randomly masked, and the mission of the model is to find out what the word that is masked is."
17 |    ]
18 |   }
19 |  ],
20 |  "metadata": {
21 |   "kernelspec": {
22 |    "display_name": "Python 3",
23 |    "language": "python",
24 |    "name": "python3"
25 |   },
26 |   "language_info": {
27 |    "codemirror_mode": {
28 |     "name": "ipython",
29 |     "version": 3
30 |    },
31 |    "file_extension": ".py",
32 |    "mimetype": "text/x-python",
33 |    "name": "python",
34 |    "nbconvert_exporter": "python",
35 |    "pygments_lexer": "ipython3",
36 |    "version": "3.9.6"
37 |   }
38 |  },
39 |  "nbformat": 4,
40 |  "nbformat_minor": 2
41 | }
42 | 


--------------------------------------------------------------------------------
/book/unsupervised/semi-supervised/semi-supervised.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "catholic-champagne",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Semi Supervised Training"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "intellectual-broadcasting",
14 |    "metadata": {},
15 |    "source": [
16 |     "## How is semi-supervised training different from unsupervised training?\n",
17 |     "\n",
18 |     "Semi-supervised methods refers to training on a dataset in which only a little portion have labels while most of the data remain unlabeled. Usually these methods train like a supervised model on labeled data, and performing auxiliary updates on unlabeled data."
19 |    ]
20 |   }
21 |  ],
22 |  "metadata": {
23 |   "kernelspec": {
24 |    "display_name": "Python 3",
25 |    "language": "python",
26 |    "name": "python3"
27 |   },
28 |   "language_info": {
29 |    "codemirror_mode": {
30 |     "name": "ipython",
31 |     "version": 3
32 |    },
33 |    "file_extension": ".py",
34 |    "mimetype": "text/x-python",
35 |    "name": "python",
36 |    "nbconvert_exporter": "python",
37 |    "pygments_lexer": "ipython3",
38 |    "version": "3.9.6"
39 |   }
40 |  },
41 |  "nbformat": 4,
42 |  "nbformat_minor": 5
43 | }
44 | 


--------------------------------------------------------------------------------
/book/reinforce/essential/state.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# State"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What state are you in?\n",
15 |     "\n",
16 |     "States are used to describe the conditions of a thing. In other words, if the state of an object is given, you can tell everything that you care about it. For example, for a ball, you may only care about its position and its velocity. But for a knife, you would also want to know where the blade faces in order to know if it's going to hurt you. That is, the position and velocity is a ball's state, while the state of a knife may be an additional info stored in state."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## State in deep learning.\n",
24 |     "\n",
25 |     "In deep learning, everything is modelled by a set of numbers (tensors). States are no exception. A state is often observed by a model as a tensor, and we call this tensor observation."
26 |    ]
27 |   }
28 |  ],
29 |  "metadata": {
30 |   "language_info": {
31 |    "name": "python"
32 |   }
33 |  },
34 |  "nbformat": 4,
35 |  "nbformat_minor": 2
36 | }
37 | 


--------------------------------------------------------------------------------
/book/reinforce/essential/action.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Action"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Do actions speak louder than words?\n",
15 |     "\n",
16 |     "Yes. Actions are taken to change the world around us, while words don't always do that. In RL's world, an action is used to change the environment, also known as move between different states. It's also possible that after an action is taken, the state is transitioned into itself."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Action in deep learning.\n",
24 |     "\n",
25 |     "Action is usually the output of the agent (that is modelled by a deep learning function). If the function is a classifier, then the action would be an integer showing which action to take. If the function is a regression model, then the action is merely the value of the output. Really, what an action is depends on how it is used, and anything can be considered an action so long as it makes the state change to a new one."
26 |    ]
27 |   }
28 |  ],
29 |  "metadata": {
30 |   "language_info": {
31 |    "name": "python"
32 |   }
33 |  },
34 |  "nbformat": 4,
35 |  "nbformat_minor": 2
36 | }
37 | 


--------------------------------------------------------------------------------
/book/layers/dropout/dropout.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Dropout Layer"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What does dropout layers do?\n",
15 |     "\n",
16 |     "Dropout layers throw things away. Now you would be asking, why would I want my model to throw data away? It turns out that throwing things away when training a model can drastically improve a model's performance in testing (where data is not throw away)."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## When to use dropout layers?\n",
24 |     "\n",
25 |     "When you feel like your model is overfitting the input, makes the probability of dropping out higher. Often you dropout as much as possible because dropout usually makes a model more robust to noisy inputs."
26 |    ]
27 |   }
28 |  ],
29 |  "metadata": {
30 |   "language_info": {
31 |    "codemirror_mode": {
32 |     "name": "ipython",
33 |     "version": 3
34 |    },
35 |    "file_extension": ".py",
36 |    "mimetype": "text/x-python",
37 |    "name": "python",
38 |    "nbconvert_exporter": "python",
39 |    "pygments_lexer": "ipython3",
40 |    "version": 3
41 |   }
42 |  },
43 |  "nbformat": 4,
44 |  "nbformat_minor": 2
45 | }
46 | 


--------------------------------------------------------------------------------
/book/_config.yml:
--------------------------------------------------------------------------------
 1 | # Book settings
 2 | # Learn more at https://jupyterbook.org/customize/config.html
 3 | 
 4 | title: Learning Machine
 5 | author: RenChu Wang
 6 | copyright: "RenChu Wang, 2021"
 7 | logo: images/logo.png
 8 | exclude_patterns: [_build]
 9 | only_build_toc_files: true
10 | 
11 | # Force re-execution of notebooks on each build.
12 | # See https://jupyterbook.org/content/execute.html
13 | execute:
14 |   execute_notebooks: force
15 | 
16 | # Define the name of the latex output file for PDF builds
17 | latex:
18 |   latex_documents:
19 |     targetname: book.tex
20 | 
21 | # Add a bibtex file so that we can create citations
22 | # bibtex_bibfiles:
23 | #   - references.bib
24 | 
25 | # Information about where the book exists on the web
26 | repository:
27 |   url: https://github.com/rentruewang/learning-machine # Online location of your book
28 |   path_to_book: book # Optional path to your book, relative to the repository root
29 |   branch: main # Which branch of the repository should be used when creating links (optional)
30 | 
31 | # Add GitHub buttons to your book
32 | # See https://jupyterbook.org/customize/config.html#add-a-link-to-your-repository
33 | html:
34 |   favicon: images/favicon.ico
35 |   home_page_in_navbar: true
36 |   use_edit_page_button: true
37 |   use_issues_button: true
38 |   use_repository_button: true
39 |   baseurl: https://rentruewang.github.io/learning-machine/
40 | 


--------------------------------------------------------------------------------
/book/layers/emb/emb.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Embedding Layer"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What do embedding layers do?\n",
15 |     "\n",
16 |     "Embedding layers convert a token (an integer) to a vector (a list of floating point numbers)."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## When to use embedding layers?\n",
24 |     "\n",
25 |     "When you want to process text. Text can be converted to integers, but because neural networks don't directly understand integers (because they are based on gradients, and you can't compute gradients on integers), you have to use embedding layers to convert those integers into a list of 'features'. For example, if one of the dimension shows how red the word is, then apple should score a lot higher than banana in that dimension."
26 |    ]
27 |   }
28 |  ],
29 |  "metadata": {
30 |   "language_info": {
31 |    "codemirror_mode": {
32 |     "name": "ipython",
33 |     "version": 3
34 |    },
35 |    "file_extension": ".py",
36 |    "mimetype": "text/x-python",
37 |    "name": "python",
38 |    "nbconvert_exporter": "python",
39 |    "pygments_lexer": "ipython3",
40 |    "version": 3
41 |   }
42 |  },
43 |  "nbformat": 4,
44 |  "nbformat_minor": 2
45 | }
46 | 


--------------------------------------------------------------------------------
/book/tasks/regression/auto/auto.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Auto Regression"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What does auto-regression mean?\n",
15 |     "\n",
16 |     "Auto regression refers to predicting the next token in time-sequence models. Given many previous tokens, predict the next one. Given \"How are you\", predict \"today\". Now you may be wondering, why is it called auto-regression?\n",
17 |     "\n",
18 |     "Regression is the process of predicting a value y' for an unseen input x', given many pairs of (x, y). Auto-regressive model's name comes from regression, do it does something similar. It takes in a sequence and predicts one output. Naturally, it's the next token that's going to be appended to the sequence!"
19 |    ]
20 |   }
21 |  ],
22 |  "metadata": {
23 |   "kernelspec": {
24 |    "display_name": "Python 3",
25 |    "language": "python",
26 |    "name": "python3"
27 |   },
28 |   "language_info": {
29 |    "codemirror_mode": {
30 |     "name": "ipython",
31 |     "version": 3
32 |    },
33 |    "file_extension": ".py",
34 |    "mimetype": "text/x-python",
35 |    "name": "python",
36 |    "nbconvert_exporter": "python",
37 |    "pygments_lexer": "ipython3",
38 |    "version": "3.9.5"
39 |   }
40 |  },
41 |  "nbformat": 4,
42 |  "nbformat_minor": 2
43 | }
44 | 


--------------------------------------------------------------------------------
/book/basics/gradients/loss-fn-derivative.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Do loss functions have to be differentiable?\n",
 8 |     "\n",
 9 |     "Yes and no. If the loss function is only used for evaluating the quality of the prediction (on an evaluation set), then it does not have to be differentiable. However, if the loss function is the objective of a gradient descent algorithm, then yes, it has to be differentiable to do that.\n"
10 |    ]
11 |   },
12 |   {
13 |    "cell_type": "markdown",
14 |    "metadata": {},
15 |    "source": [
16 |     "## Example of loss functions that don't need to be differentiable to be useful?\n",
17 |     "\n",
18 |     "F1 score, BLEU score, rewards in reinforcement learning. Those losses are for evaluation of how good a certain model performs on a set of data. Therefore, only used in comparing models, but not in directly training models."
19 |    ]
20 |   }
21 |  ],
22 |  "metadata": {
23 |   "kernelspec": {
24 |    "display_name": "Python 3",
25 |    "language": "python",
26 |    "name": "python3"
27 |   },
28 |   "language_info": {
29 |    "codemirror_mode": {
30 |     "name": "ipython",
31 |     "version": 3
32 |    },
33 |    "file_extension": ".py",
34 |    "mimetype": "text/x-python",
35 |    "name": "python",
36 |    "nbconvert_exporter": "python",
37 |    "pygments_lexer": "ipython3",
38 |    "version": "3.9.5"
39 |   }
40 |  },
41 |  "nbformat": 4,
42 |  "nbformat_minor": 2
43 | }
44 | 


--------------------------------------------------------------------------------
/book/layers/norm/norm.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Normalization Layer"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "\n",
15 |     "```{note}\n",
16 |     "We will refer to normalization layers as Norm layers.\n",
17 |     "```"
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "markdown",
22 |    "metadata": {},
23 |    "source": [
24 |     "## What do Norm layers do?\n",
25 |     "\n",
26 |     "Norm layers normalize the input. Normalization really helps stabilize training and improve training speed. Before norm layers are introduced, people had a very difficult time training huge models (big and deep) because of exploding/vanishing gradients, and normalization simply removes that issue mostly."
27 |    ]
28 |   },
29 |   {
30 |    "cell_type": "markdown",
31 |    "metadata": {},
32 |    "source": [
33 |     "## When to use Norm layers?\n",
34 |     "\n",
35 |     "Almost always. If you want to train a big and deep neural network, remember to normalize your input."
36 |    ]
37 |   }
38 |  ],
39 |  "metadata": {
40 |   "language_info": {
41 |    "codemirror_mode": {
42 |     "name": "ipython",
43 |     "version": 3
44 |    },
45 |    "file_extension": ".py",
46 |    "mimetype": "text/x-python",
47 |    "name": "python",
48 |    "nbconvert_exporter": "python",
49 |    "pygments_lexer": "ipython3",
50 |    "version": 3
51 |   }
52 |  },
53 |  "nbformat": 4,
54 |  "nbformat_minor": 2
55 | }
56 | 


--------------------------------------------------------------------------------
/book/layers/rnn/gru/gru.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Gated Linear Unit"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "A gated linear unit is often abbreviated as a **GRU**. Not to be confused with the one in Despicable Me!\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What are GRUs?\n",
24 |     "\n",
25 |     "GRU is a special kind of recurrent layer. It allows some input to pass the 'gate', but transform the other parts. The mechanism is highly inspired by LSTMs."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## When to use GRUs?\n",
33 |     "\n",
34 |     "Gate linear units are a lot like LSTMs. It is much less complicated compare to LSTM, so it's often used as a cheap replacement to LSTMs. Its performance is not too shabby, and it trains a lot faster compared to similar sized LSTM networks."
35 |    ]
36 |   }
37 |  ],
38 |  "metadata": {
39 |   "language_info": {
40 |    "codemirror_mode": {
41 |     "name": "ipython",
42 |     "version": 3
43 |    },
44 |    "file_extension": ".py",
45 |    "mimetype": "text/x-python",
46 |    "name": "python",
47 |    "nbconvert_exporter": "python",
48 |    "pygments_lexer": "ipython3",
49 |    "version": 3
50 |   }
51 |  },
52 |  "nbformat": 4,
53 |  "nbformat_minor": 2
54 | }
55 | 


--------------------------------------------------------------------------------
/book/tasks/tasks.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Types of tasks"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "All problems in deep learning fall into the following two types."
15 |    ]
16 |   },
17 |   {
18 |    "cell_type": "markdown",
19 |    "metadata": {},
20 |    "source": [
21 |     "## [Classification](./classification/classification)\n",
22 |     "\n",
23 |     "Classification asks a question: which one? Which type of animal is in your image? Which action is better?\n",
24 |     "In classification problems, you choose from several labels."
25 |    ]
26 |   },
27 |   {
28 |    "cell_type": "markdown",
29 |    "metadata": {},
30 |    "source": [
31 |     "## [Regression](./regression/regression)\n",
32 |     "\n",
33 |     "Regression asks another question: how much? What is the probability of raining tomorrow? How much do you think the stock is worth next year? (You probably shouldn't use a model to predict that) In regression problems, you guess a number (or several numbers)."
34 |    ]
35 |   }
36 |  ],
37 |  "metadata": {
38 |   "language_info": {
39 |    "codemirror_mode": {
40 |     "name": "ipython",
41 |     "version": 3
42 |    },
43 |    "file_extension": ".py",
44 |    "mimetype": "text/x-python",
45 |    "name": "python",
46 |    "nbconvert_exporter": "python",
47 |    "pygments_lexer": "ipython3",
48 |    "version": 3
49 |   }
50 |  },
51 |  "nbformat": 4,
52 |  "nbformat_minor": 2
53 | }
54 | 


--------------------------------------------------------------------------------
/book/layers/padding/padding.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Padding Layer"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What do padding layers do?\n",
15 |     "\n",
16 |     "Padding layers pad special values to your input. Usually padding layers are used on images, to pad the edge pixels such that the convolution filters can move all the way to the edge of the image.\n",
17 |     "There are several ways to pad. Most common ways include padding 0 values, or mirror padding. 0 padding is self-explanatory, and mirror padding is like having a little mirror on the edge of your image. The padded pixels 'mirror' the real pixels."
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "markdown",
22 |    "metadata": {},
23 |    "source": [
24 |     "## When to use padding layers?\n",
25 |     "\n",
26 |     "Almost always when you are using a convolution-based model. Due to the nature of sliding convolution filters, you don't really have an alternative to padding layers if you want the convolution window to slide all the way to the edge."
27 |    ]
28 |   }
29 |  ],
30 |  "metadata": {
31 |   "language_info": {
32 |    "codemirror_mode": {
33 |     "name": "ipython",
34 |     "version": 3
35 |    },
36 |    "file_extension": ".py",
37 |    "mimetype": "text/x-python",
38 |    "name": "python",
39 |    "nbconvert_exporter": "python",
40 |    "pygments_lexer": "ipython3",
41 |    "version": 3
42 |   }
43 |  },
44 |  "nbformat": 4,
45 |  "nbformat_minor": 2
46 | }
47 | 


--------------------------------------------------------------------------------
/book/layers/rnn/lstm/lstm.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Long Short Term Memory"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "Long short term memory is often abbreviated as **LSTM**.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What are LSTMs?\n",
24 |     "\n",
25 |     "LSTM is a special kind of recurrent layer. A human brain use both long-term memory and short-term memory to remember things, and LSTM is that idea in neural network. Its construct allows some input to be unprocessed by the layer (long term memory) while processing a portion of the input (short term memory)."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## When to use LSTMs?\n",
33 |     "\n",
34 |     "Compare to vanilla RNN (the one introduced in the previous section), almost always. Vanilla RNNs are too difficult to train because of gradient issues, while because LSTM allow some input to escape processing, it helps tremendously in that regard."
35 |    ]
36 |   }
37 |  ],
38 |  "metadata": {
39 |   "language_info": {
40 |    "codemirror_mode": {
41 |     "name": "ipython",
42 |     "version": 3
43 |    },
44 |    "file_extension": ".py",
45 |    "mimetype": "text/x-python",
46 |    "name": "python",
47 |    "nbconvert_exporter": "python",
48 |    "pygments_lexer": "ipython3",
49 |    "version": 3
50 |   }
51 |  },
52 |  "nbformat": 4,
53 |  "nbformat_minor": 2
54 | }
55 | 


--------------------------------------------------------------------------------
/book/layers/pooling/pooling.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Pooling Layer"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What are pooling layers?\n",
15 |     "\n",
16 |     "A pooling layer generates an output from a 'pool', such as a $ 2 \\times 2 $ block, usually by selecting from that block. For example, commonly used pooling layers include max-pooling, which retains the maximum of every $ 2 \\times 2 $ block. But there are also some fewer used layers like average pooling, which takes an weighted average from the block (acts a lot like a convolution layer)."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## When to use pooling layers?\n",
24 |     "\n",
25 |     "Use pooling layers when you want to drastically reduce the input size. For example, $ 2 \\times 2 $ blocks reduce the height and the width by 2 times. \n",
26 |     "Pooling layers also have the meaning of 'focusing on the best part'. When you're using a pooling layer, you usually don't care about individual points in the input, but whether there's something special in one region of the input."
27 |    ]
28 |   }
29 |  ],
30 |  "metadata": {
31 |   "language_info": {
32 |    "codemirror_mode": {
33 |     "name": "ipython",
34 |     "version": 3
35 |    },
36 |    "file_extension": ".py",
37 |    "mimetype": "text/x-python",
38 |    "name": "python",
39 |    "nbconvert_exporter": "python",
40 |    "pygments_lexer": "ipython3",
41 |    "version": 3
42 |   }
43 |  },
44 |  "nbformat": 4,
45 |  "nbformat_minor": 2
46 | }
47 | 


--------------------------------------------------------------------------------
/book/notice/data/underfit.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Underfit"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What does underfit mean?\n",
15 |     "\n",
16 |     "Underfitting is the opposite of overfitting. An underfitting models does not seem to perform well on the task you're training it on. It happens when you have too much data that's very diverse relative to model's size."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## How to deal with underfitting?"
24 |    ]
25 |   },
26 |   {
27 |    "cell_type": "markdown",
28 |    "metadata": {},
29 |    "source": [
30 |     "### Reduce some data.\n",
31 |     "\n",
32 |     "You don't need to use all of your data for training the model."
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "markdown",
37 |    "metadata": {},
38 |    "source": [
39 |     "### Increase model size.\n",
40 |     "\n",
41 |     "Tune the hyper-parameters. Making the model bigger such that it could learn more!"
42 |    ]
43 |   }
44 |  ],
45 |  "metadata": {
46 |   "kernelspec": {
47 |    "display_name": "Python 3",
48 |    "language": "python",
49 |    "name": "python3"
50 |   },
51 |   "language_info": {
52 |    "codemirror_mode": {
53 |     "name": "ipython",
54 |     "version": 3
55 |    },
56 |    "file_extension": ".py",
57 |    "mimetype": "text/x-python",
58 |    "name": "python",
59 |    "nbconvert_exporter": "python",
60 |    "pygments_lexer": "ipython3",
61 |    "version": "3.9.6"
62 |   }
63 |  },
64 |  "nbformat": 4,
65 |  "nbformat_minor": 2
66 | }
67 | 


--------------------------------------------------------------------------------
/book/layers/rnn/rnn.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Recurrent Layer"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "A neural network that use recurrent layers is often called a RNN, a recurrent neural network.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What do recurrent layers do?\n",
24 |     "\n",
25 |     "Recurrent layer transform an input many times. Everytime an input is transformed, the output is fed back into the recurrent layer to process again (with some new inputs)."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## When to use recurrent layers?\n",
33 |     "\n",
34 |     "Recurrent layers are really good at predicting what's happening next. Because it's like seeing all the past things and then predict the next thing that happens. We usually use it in sequence processing, whether that's text or voice or videos, anything that's time related."
35 |    ]
36 |   }
37 |  ],
38 |  "metadata": {
39 |   "kernelspec": {
40 |    "display_name": "Python 3",
41 |    "language": "python",
42 |    "name": "python3"
43 |   },
44 |   "language_info": {
45 |    "codemirror_mode": {
46 |     "name": "ipython",
47 |     "version": 3
48 |    },
49 |    "file_extension": ".py",
50 |    "mimetype": "text/x-python",
51 |    "name": "python",
52 |    "nbconvert_exporter": "python",
53 |    "pygments_lexer": "ipython3",
54 |    "version": "3.9.5"
55 |   }
56 |  },
57 |  "nbformat": 4,
58 |  "nbformat_minor": 2
59 | }
60 | 


--------------------------------------------------------------------------------
/book/layers/transformer/training/no-training/no-training.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "precise-amendment",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Using Bert without training?"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "ahead-palestinian",
14 |    "metadata": {},
15 |    "source": [
16 |     "It is possible. In fact, Facebook AI Research (FAIR) published papers on using it completely un-trained, randomly initialized, proving that the structure of transformer itself is enough to extract information (to some extent).\n",
17 |     "\n",
18 |     "However, the appeal of Bert is its readily available pretrained models, and using it without training it first (or train it yourself) kind of defeats the purpose.\n",
19 |     "\n",
20 |     "Bert is essentially a building block for your model, the idea behind Bert is that essentially, you can add very few layers (one linear layer achieved 85% in spam classification), and get a very good model, without training a lot. So, except the case when you are FAIR (Facebook AI Research), which released several papers about feature-extraction of completely untrained model, you would want to use a pre-trained version of Bert."
21 |    ]
22 |   }
23 |  ],
24 |  "metadata": {
25 |   "kernelspec": {
26 |    "display_name": "Python 3",
27 |    "language": "python",
28 |    "name": "python3"
29 |   },
30 |   "language_info": {
31 |    "codemirror_mode": {
32 |     "name": "ipython",
33 |     "version": 3
34 |    },
35 |    "file_extension": ".py",
36 |    "mimetype": "text/x-python",
37 |    "name": "python",
38 |    "nbconvert_exporter": "python",
39 |    "pygments_lexer": "ipython3",
40 |    "version": "3.9.5"
41 |   }
42 |  },
43 |  "nbformat": 4,
44 |  "nbformat_minor": 5
45 | }
46 | 


--------------------------------------------------------------------------------
/book/basics/loss/loss.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Loss Function"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## The need for loss functions\n",
15 |     "\n",
16 |     "Students take exams to evaluate how well they learn in school. Employees are evaluated by companies to measure their job performance. The thing is, sometimes we just need to know how good a person is in doing a certain task (learning/working). Loss functions are like exams to models. We use loss functions to evaluate models to quantitatively measure how they perform. This is a common metric used to measure model performance. We can compare which model performs better using this. We all want the best model to perform our task, don't we?"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What exactly is a loss function?\n",
24 |     "\n",
25 |     "A loss function takes in the decision made by a model, and returns a scalar value which we called the loss value. In deep learning, the lower the loss, the closer the predictions to the actual values, the better the model performs."
26 |    ]
27 |   }
28 |  ],
29 |  "metadata": {
30 |   "kernelspec": {
31 |    "display_name": "Python 3",
32 |    "language": "python",
33 |    "name": "python3"
34 |   },
35 |   "language_info": {
36 |    "codemirror_mode": {
37 |     "name": "ipython",
38 |     "version": 3
39 |    },
40 |    "file_extension": ".py",
41 |    "mimetype": "text/x-python",
42 |    "name": "python",
43 |    "nbconvert_exporter": "python",
44 |    "pygments_lexer": "ipython3",
45 |    "version": "3.9.5"
46 |   }
47 |  },
48 |  "nbformat": 4,
49 |  "nbformat_minor": 2
50 | }
51 | 


--------------------------------------------------------------------------------
/book/reinforce/essential/agent.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Agent"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Agent, is that you?\n",
15 |     "\n",
16 |     "Anything is an agent in RL settings. An agent make things happen around it, and make changes to the environment. You have a big carbon footprint? You're changing the earth, so even you are an agent in RL's world!"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Agent in deep learning.\n",
24 |     "\n",
25 |     "An agent in deep learning is nothing different from a model. We've learnt many different models, and they can all be an agent if we apply the correct RL setting. For example, the cat/dog classifier. If we design the algorithm such that the cat/dog classifier prints 'cat' when a cat is found, and 'dog' when a dog is seen, it is changing the computer's state (its environment)!\n",
26 |     "\n",
27 |     "However, usually we choose a classifier for environments where only a few actions are allowed (like key pressing), and regressive models for tasks where any action can be selected (like steering a wheel)."
28 |    ]
29 |   }
30 |  ],
31 |  "metadata": {
32 |   "kernelspec": {
33 |    "display_name": "Python 3",
34 |    "language": "python",
35 |    "name": "python3"
36 |   },
37 |   "language_info": {
38 |    "codemirror_mode": {
39 |     "name": "ipython",
40 |     "version": 3
41 |    },
42 |    "file_extension": ".py",
43 |    "mimetype": "text/x-python",
44 |    "name": "python",
45 |    "nbconvert_exporter": "python",
46 |    "pygments_lexer": "ipython3",
47 |    "version": "3.9.6"
48 |   }
49 |  },
50 |  "nbformat": 4,
51 |  "nbformat_minor": 2
52 | }
53 | 


--------------------------------------------------------------------------------
/book/basics/model/model.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Model"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why do we need models?\n",
15 |     "\n",
16 |     "We use our brain to process and understand the world around us. Think about what a brain does. First, it takes in some data, which we call senses on our surroundings. The five human senses are sight, sound, smell, taste and touch. These are all data that our brain processes. Secondly, it performs some actions. Like our hands can respond to our will and our eyes can roll to see what's beside us. Models are brains for machines. After all, we're building smart machines."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What is a model?\n",
24 |     "\n",
25 |     "A model mimics how human makes decisions. So a model takes in some data (that it uses to make decisions), and outputs some other data (the decisions).\n",
26 |     "\n",
27 |     "A model is basically a function that makes decisions. It takes in some input, and outputs some decision. Is the object a cat or a dog? The model will decide based on what it was trained on."
28 |    ]
29 |   }
30 |  ],
31 |  "metadata": {
32 |   "kernelspec": {
33 |    "display_name": "Python 3",
34 |    "language": "python",
35 |    "name": "python3"
36 |   },
37 |   "language_info": {
38 |    "codemirror_mode": {
39 |     "name": "ipython",
40 |     "version": 3
41 |    },
42 |    "file_extension": ".py",
43 |    "mimetype": "text/x-python",
44 |    "name": "python",
45 |    "nbconvert_exporter": "python",
46 |    "pygments_lexer": "ipython3",
47 |    "version": "3.9.5"
48 |   }
49 |  },
50 |  "nbformat": 4,
51 |  "nbformat_minor": 2
52 | }
53 | 


--------------------------------------------------------------------------------
/book/layers/cnn/cnn.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Convolution Layer"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "A neural network that use convolution layers is often called a CNN, a convolution neural network.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What do convolution layers do?\n",
24 |     "\n",
25 |     "Taking image processing for example, a convolution layer compute the local features (whether there's an edge here, a dot there, etc), and aggregate the result into another image, which can sometimes be visualized.\n",
26 |     "There are different kinds of convolution layers. We often use the 1-dimension version for text processing, 2-dimension version for image processing (most common)."
27 |    ]
28 |   },
29 |   {
30 |    "cell_type": "markdown",
31 |    "metadata": {},
32 |    "source": [
33 |     "## When to use a convolution layer?\n",
34 |     "\n",
35 |     "Convolution layers are often used to directly process inputs, which means they are often places in the front of a model.\n",
36 |     "Use of convolution in image processing is very common (usually paired with linear layers and pooling layers for classification). Convolution layers are also quite common in text processing."
37 |    ]
38 |   }
39 |  ],
40 |  "metadata": {
41 |   "language_info": {
42 |    "codemirror_mode": {
43 |     "name": "ipython",
44 |     "version": 3
45 |    },
46 |    "file_extension": ".py",
47 |    "mimetype": "text/x-python",
48 |    "name": "python",
49 |    "nbconvert_exporter": "python",
50 |    "pygments_lexer": "ipython3",
51 |    "version": 3
52 |   }
53 |  },
54 |  "nbformat": 4,
55 |  "nbformat_minor": 2
56 | }
57 | 


--------------------------------------------------------------------------------
/book/tasks/classification/classification.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Classification"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why classification?\n",
15 |     "\n",
16 |     "There are many times where you have to make a decision. Either A or B. Either Emily or Tom (It's 2021. It's ok to be bisexual). This either ... or ... theme is the heart of classification problems."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## How does classification work?\n",
24 |     "\n",
25 |     "Let's say we are classifying an image. The image is said to be either a cat or a dog. But which one? If we are given the information that there are only black cats and white dogs in the image, we would probably apply a heuristic saying that if the object in the image is bright, then it's likely a dog. If it's dark then it's a cat. In machine words, if the image's average is bright (assuming that the cat/dog will occupy the majority of the image), then it's a dog. Or else it's a cat.\n",
26 |     "\n",
27 |     "What we did above is basically mapping from an image to 2 labels, cat and dog. This is how all models solving classification problem work. If you can find a model that maps the input image (or sound or anything) to your desired label, then it's a good classifier."
28 |    ]
29 |   }
30 |  ],
31 |  "metadata": {
32 |   "language_info": {
33 |    "codemirror_mode": {
34 |     "name": "ipython",
35 |     "version": 3
36 |    },
37 |    "file_extension": ".py",
38 |    "mimetype": "text/x-python",
39 |    "name": "python",
40 |    "nbconvert_exporter": "python",
41 |    "pygments_lexer": "ipython3",
42 |    "version": 3
43 |   }
44 |  },
45 |  "nbformat": 4,
46 |  "nbformat_minor": 2
47 | }
48 | 


--------------------------------------------------------------------------------
/book/basics/data/data.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Data"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why do we need data?\n",
15 |     "\n",
16 |     "A person born blind will never understand light. A person born deaf will never know what music sounds like. The point is, as smart as humans are, we need real world data in order to know what real world looks like (Obviously, duh!). The same can be said for machines."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What are data?\n",
24 |     "\n",
25 |     "Data can be anything that you know. Images, text, or even videos. Or just numbers. Data are there, so that machine knows what to learn.\n",
26 |     "\n",
27 |     "When we ask machine to learn stuff, a common approach is we provide data that come in pairs (the input and the desired output). We call this supervised learning where we 'supervise' the machine with the correct answers. There are cases where data are not in pairs, we call this unsupervised learning. In unsupervised learning, machines have no idea what the desired output is. The machine will have to figure out something based on the pattern in the given input. We'll discuss about these topics later."
28 |    ]
29 |   }
30 |  ],
31 |  "metadata": {
32 |   "kernelspec": {
33 |    "display_name": "Python 3",
34 |    "language": "python",
35 |    "name": "python3"
36 |   },
37 |   "language_info": {
38 |    "codemirror_mode": {
39 |     "name": "ipython",
40 |     "version": 3
41 |    },
42 |    "file_extension": ".py",
43 |    "mimetype": "text/x-python",
44 |    "name": "python",
45 |    "nbconvert_exporter": "python",
46 |    "pygments_lexer": "ipython3",
47 |    "version": "3.9.5"
48 |   }
49 |  },
50 |  "nbformat": 4,
51 |  "nbformat_minor": 2
52 | }
53 | 


--------------------------------------------------------------------------------
/book/reinforce/policy/policy.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Policy"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Policy in RL.\n",
15 |     "\n",
16 |     "An agent makes a decision according to its **policy**. For humans, brain is our policy because it observes what environment we are in (in a swimming pool!) and what actions we will react accordingly (swim)."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "```{warning}\n",
24 |     "Incoming math!\n",
25 |     "```"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "Formally, policy $ \\pi $ is a function that takes in the state $ s $, and outputs an action $ a $, which the agent takes and arrive at a new state $ s' $."
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "markdown",
37 |    "metadata": {},
38 |    "source": [
39 |     "## Are policies the same thing as agents?\n",
40 |     "\n",
41 |     "Agents are different from policies. An agent acts according a policy, which can be shared between agents. For example, in a futuristic world, you can control clones of your body. You have multiple bodies, but only a single brain. In such a case, the brain is the policy, and the agents are your bodies."
42 |    ]
43 |   }
44 |  ],
45 |  "metadata": {
46 |   "kernelspec": {
47 |    "display_name": "Python 3",
48 |    "language": "python",
49 |    "name": "python3"
50 |   },
51 |   "language_info": {
52 |    "codemirror_mode": {
53 |     "name": "ipython",
54 |     "version": 3
55 |    },
56 |    "file_extension": ".py",
57 |    "mimetype": "text/x-python",
58 |    "name": "python",
59 |    "nbconvert_exporter": "python",
60 |    "pygments_lexer": "ipython3",
61 |    "version": "3.9.6"
62 |   }
63 |  },
64 |  "nbformat": 4,
65 |  "nbformat_minor": 2
66 | }
67 | 


--------------------------------------------------------------------------------
/book/basics/basics.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Holy Trinity for Machine Learning"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## I want to do machine learning. How?\n",
15 |     "\n",
16 |     "Machine learning systems may look very different, but under the hood, every machine learning system are made up of three essential parts."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "### [Data](./data/data)\n",
24 |     "\n",
25 |     "We are talking about machine _learning_. Without data, what's there to learn?"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "### [Model](./model/model)\n",
33 |     "\n",
34 |     "The task of machine learning is to have machines to make decisions for us. Without a model to do these, who will make the decision?"
35 |    ]
36 |   },
37 |   {
38 |    "cell_type": "markdown",
39 |    "metadata": {},
40 |    "source": [
41 |     "### [Loss](./loss/loss)\n",
42 |     "\n",
43 |     "Whether it's playing a game or driving a car, we want our model to perform well on a task. You have some data, and a model to make predictions based on the inputs. You need to measure how good your model is. Without loss functions, you do not know how well your model performs."
44 |    ]
45 |   }
46 |  ],
47 |  "metadata": {
48 |   "kernelspec": {
49 |    "display_name": "Python 3",
50 |    "language": "python",
51 |    "name": "python3"
52 |   },
53 |   "language_info": {
54 |    "codemirror_mode": {
55 |     "name": "ipython",
56 |     "version": 3
57 |    },
58 |    "file_extension": ".py",
59 |    "mimetype": "text/x-python",
60 |    "name": "python",
61 |    "nbconvert_exporter": "python",
62 |    "pygments_lexer": "ipython3",
63 |    "version": "3.9.5"
64 |   }
65 |  },
66 |  "nbformat": 4,
67 |  "nbformat_minor": 2
68 | }
69 | 


--------------------------------------------------------------------------------
/book/tasks/regression/regression.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Regression"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why regression?\n",
15 |     "\n",
16 |     "Well, not all cases are classification problems. There are times where you can easily separate things into labels. For example, if the probability of rain is 59% and 60%, you'll think that it's basically the same. However, getting 59 and 60 in an exam can mean failing the semester vs passing the semester. In these two cases, we say the first one is a regression problem and the second one a classification one. We need regression instead of classification when it just makes no sense to even try separate close numbers."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## How does regression work?\n",
24 |     "\n",
25 |     "Image we are predicting the stock market next year (again, not a wise thing to do, but a good example). You noticed in the previous years that the stock market index is related to the number of your hairs by the relation $ s = 0.05h - 3000 $, $ s $ is the index and $ h $ is the number of hair. You create a model and then make a profit, while doing hair transplants to ensure that the stock performs well.\n",
26 |     "\n",
27 |     "In this case, you created a model that maps some inputs (the number of hairs) to some outputs (in this case, the number of stock). If you can find a good model that does this mapping well, then you found a good regressor."
28 |    ]
29 |   }
30 |  ],
31 |  "metadata": {
32 |   "language_info": {
33 |    "codemirror_mode": {
34 |     "name": "ipython",
35 |     "version": 3
36 |    },
37 |    "file_extension": ".py",
38 |    "mimetype": "text/x-python",
39 |    "name": "python",
40 |    "nbconvert_exporter": "python",
41 |    "pygments_lexer": "ipython3",
42 |    "version": 3
43 |   }
44 |  },
45 |  "nbformat": 4,
46 |  "nbformat_minor": 2
47 | }
48 | 


--------------------------------------------------------------------------------
/book/unsupervised/decision-tree/decision-tree.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "d1f2392d",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Decision Tree"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "a68e4bdf",
14 |    "metadata": {},
15 |    "source": [
16 |     "## How to make a decision?\n",
17 |     "\n",
18 |     "Naturally, people make decisions based on a set of rules. If the sky looks gloomy, it's likely going to rain so taking the umbrella would be a nice idea. Decision tree is based on that idea. A good decision tree separate things into several groups, based on some features, and hopefully it could be able to understand each and every data."
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "id": "74b5ab9e",
24 |    "metadata": {},
25 |    "source": [
26 |     "## How do decision trees work?\n",
27 |     "\n",
28 |     "A good decision should tell you a lot of things. Decision trees are based on that idea. Decision trees assume that each class given to it is distinct, and it should be relatively easy to distinguish between each and every class. That is, it has to differentiate between every two cases, with as few decisions as possible.\n",
29 |     "\n",
30 |     "A decision tree works by maximizing the information obtained by a split. You don't need to know what that means yet, but in simple words, it splits the group as evenly as possible. That way, it can reach every base case fast, and it makes the decision tree more balanced (more even)."
31 |    ]
32 |   }
33 |  ],
34 |  "metadata": {
35 |   "kernelspec": {
36 |    "display_name": "Python 3",
37 |    "language": "python",
38 |    "name": "python3"
39 |   },
40 |   "language_info": {
41 |    "codemirror_mode": {
42 |     "name": "ipython",
43 |     "version": 3
44 |    },
45 |    "file_extension": ".py",
46 |    "mimetype": "text/x-python",
47 |    "name": "python",
48 |    "nbconvert_exporter": "python",
49 |    "pygments_lexer": "ipython3",
50 |    "version": "3.9.5"
51 |   }
52 |  },
53 |  "nbformat": 4,
54 |  "nbformat_minor": 5
55 | }
56 | 


--------------------------------------------------------------------------------
/book/reinforce/essential/online-offline.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "b66a254c",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Online Methods vs Offline Methods"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "77954a20",
14 |    "metadata": {},
15 |    "source": [
16 |     "## Online methods.\n",
17 |     "\n",
18 |     "In RL, online methods refer to methods updating the model while interacting with the environment.\n",
19 |     "\n",
20 |     "**Pro**\n",
21 |     "- Mathematically easy because most RL theories consider this method.\n",
22 |     "- Is usually much simpler than offline methods.\n",
23 |     "- Online methods are more stable.\n",
24 |     "\n",
25 |     "**Con**\n",
26 |     "- Very data inefficient because each observation/state can be used once."
27 |    ]
28 |   },
29 |   {
30 |    "cell_type": "markdown",
31 |    "id": "ecb8cb51",
32 |    "metadata": {},
33 |    "source": [
34 |     "## Offline methods.\n",
35 |     "\n",
36 |     "In RL, offline methods refer to methods updating the model after having interacted with the environment.\n",
37 |     "\n",
38 |     "**Pro**\n",
39 |     "- Data efficient, the same trajectory (history data) can be used for many updates.\n",
40 |     "\n",
41 |     "**Con**\n",
42 |     "- Mathematically difficult because most RL theories are based on online methods.\n",
43 |     "- Is usually more complicated because it involves storing histories, re-weighting different history entries, and discarding old entries.\n",
44 |     "- Offline methods are less stable."
45 |    ]
46 |   }
47 |  ],
48 |  "metadata": {
49 |   "kernelspec": {
50 |    "display_name": "Python 3 (ipykernel)",
51 |    "language": "python",
52 |    "name": "python3"
53 |   },
54 |   "language_info": {
55 |    "codemirror_mode": {
56 |     "name": "ipython",
57 |     "version": 3
58 |    },
59 |    "file_extension": ".py",
60 |    "mimetype": "text/x-python",
61 |    "name": "python",
62 |    "nbconvert_exporter": "python",
63 |    "pygments_lexer": "ipython3",
64 |    "version": "3.9.5"
65 |   }
66 |  },
67 |  "nbformat": 4,
68 |  "nbformat_minor": 5
69 | }
70 | 


--------------------------------------------------------------------------------
/book/layers/transformer/training/teacher/teacher.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Teacher Forcing vs Scheduled Sampling vs Normal Mode"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "There are 3 ways of training:"
15 |    ]
16 |   },
17 |   {
18 |    "cell_type": "markdown",
19 |    "metadata": {},
20 |    "source": [
21 |     "## Normal mode\n",
22 |     "\n",
23 |     "This mode predicts the next token based on the sentence the model is generating. The benefit of this method is that it knows what to say even if the sentence being generated is rubbish. (Which can't be said for models trained for teacher-forcing)"
24 |    ]
25 |   },
26 |   {
27 |    "cell_type": "markdown",
28 |    "metadata": {},
29 |    "source": [
30 |     "## Teacher forcing\n",
31 |     "\n",
32 |     "This mode predicts the next token based on the correct input. The benefit of this method is that 1. it is trained on the correct labels (normal mode's label is generated by itself, not necessarily accurate always), and 2. it tends to prevent gradient explosion (especially in the case of RNN)."
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "markdown",
37 |    "metadata": {},
38 |    "source": [
39 |     "## Scheduled sampling\n",
40 |     "\n",
41 |     "If you find the above two ways too extreme, this is a compromise. It sometimes uses normal mode (the half-finished sentence the model is generating), and sometimes uses teacher-forcing (using the correct sentence to predict the next token)"
42 |    ]
43 |   }
44 |  ],
45 |  "metadata": {
46 |   "kernelspec": {
47 |    "display_name": "Python 3",
48 |    "language": "python",
49 |    "name": "python3"
50 |   },
51 |   "language_info": {
52 |    "codemirror_mode": {
53 |     "name": "ipython",
54 |     "version": 3
55 |    },
56 |    "file_extension": ".py",
57 |    "mimetype": "text/x-python",
58 |    "name": "python",
59 |    "nbconvert_exporter": "python",
60 |    "pygments_lexer": "ipython3",
61 |    "version": "3.9.5"
62 |   }
63 |  },
64 |  "nbformat": 4,
65 |  "nbformat_minor": 2
66 | }
67 | 


--------------------------------------------------------------------------------
/book/generative/ae/ae-semi.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "indie-charleston",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Improving Auto Encoders with Semi Supervised Training"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "innovative-involvement",
14 |    "metadata": {},
15 |    "source": [
16 |     "Semi-supervised training refers to training with mostly unlabeled data, with a small portion of that data labeled.\n",
17 |     "\n",
18 |     "Most of the time semi-supervised training can help the classifier learn a better latent representation, which in turns makes the decoder's job easier, which makes the entire system better."
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "id": "developing-protection",
24 |    "metadata": {},
25 |    "source": [
26 |     "## Using semi-supervised to aid AE's performance\n",
27 |     "\n",
28 |     "There are many ways to use classification to aid in the process of encoding. One straightforward way is to directly train a small linear classifier to classify the latent encoded by encoder. It's essentially encouraging the encoder to separate different images with different classes into different clusters, which makes decoding simpler.\n",
29 |     "\n",
30 |     "Another good example is **capsule network**. When training a capsule network, the magnitude of its latent vector is passed through softmax function, and used as the confidence different classes. The way of training makes capsule networks' decoders very good generators (for MNIST numbers), despite using only linear layers."
31 |    ]
32 |   }
33 |  ],
34 |  "metadata": {
35 |   "kernelspec": {
36 |    "display_name": "Python 3",
37 |    "language": "python",
38 |    "name": "python3"
39 |   },
40 |   "language_info": {
41 |    "codemirror_mode": {
42 |     "name": "ipython",
43 |     "version": 3
44 |    },
45 |    "file_extension": ".py",
46 |    "mimetype": "text/x-python",
47 |    "name": "python",
48 |    "nbconvert_exporter": "python",
49 |    "pygments_lexer": "ipython3",
50 |    "version": "3.9.5"
51 |   }
52 |  },
53 |  "nbformat": 4,
54 |  "nbformat_minor": 5
55 | }
56 | 


--------------------------------------------------------------------------------
/book/generative/ae/vae/vae.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Variational AutoEncoder Model"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "We will refer to Auto Encoders as AE.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "```{note}\n",
24 |     "We will refer to Variational Auto Encoders as VAE.\n",
25 |     "```"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## What do you mean by variational?\n",
33 |     "\n",
34 |     "VAEs are normal AEs with a twist: the encoded vector is constrained to be a noisy Gaussian distribution. In VAEs, during encoding of images, the encoded latents are assumed to be the mean, `mean`, and standard deviation, `stddev`, of some distribution corresponding to the image, and noises are added to the `mean` latent, with scale `stddev` to ensure that the model is robust."
35 |    ]
36 |   },
37 |   {
38 |    "cell_type": "markdown",
39 |    "metadata": {},
40 |    "source": [
41 |     "### Are VAEs better?\n",
42 |     "\n",
43 |     "For generators, definitely. VAEs usually generates better than normal AEs do, because it has to learn to generate images when the latent is a little bit off (because of how the noises are added to `mean`).\n",
44 |     "\n",
45 |     "However, for training a good compression model, VAEs usually cannot reduce the latent size as aggressively as AEs because of the same reason, it has to be more robust so more information has to be passed through to ensure that."
46 |    ]
47 |   }
48 |  ],
49 |  "metadata": {
50 |   "kernelspec": {
51 |    "display_name": "Python 3",
52 |    "language": "python",
53 |    "name": "python3"
54 |   },
55 |   "language_info": {
56 |    "codemirror_mode": {
57 |     "name": "ipython",
58 |     "version": 3
59 |    },
60 |    "file_extension": ".py",
61 |    "mimetype": "text/x-python",
62 |    "name": "python",
63 |    "nbconvert_exporter": "python",
64 |    "pygments_lexer": "ipython3",
65 |    "version": "3.9.5"
66 |   }
67 |  },
68 |  "nbformat": 4,
69 |  "nbformat_minor": 2
70 | }
71 | 


--------------------------------------------------------------------------------
/book/notice/gradient/saddle.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Saddle point"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What is a saddle point?\n",
15 |     "\n",
16 |     "A saddle point is a point where all the slopes and derivatives are all zero, but is not the minimum (or maximum) of the loss function. When the parameter is very close to the saddle point, the gradient gets extremely close to zero, and may slow down training. We call this phenomenon _\"stuck in the saddle point\"_, though visually speaking, it should be called _sitting on the saddle_."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## When do saddle points appear?\n",
24 |     "\n",
25 |     "Saddle points appear where all the dimension has zero derivative. Usually this means that there's a local minimum or local maximum (which is less likely, as gradients should point away from maximums). However, there are also chances that this is a saddle point."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## Why not to worry about saddle points?\n",
33 |     "\n",
34 |     "Saddle points exist, no doubt. However, encountering one is very unlikely especially for very large nets. To have a chance of being stuck in a saddle point, we have to cross out fingers and hope that all (not some) parameters are stuck in their maximum. Sounds unlikely, right? Even if some parameters are stuck in their maximum, it usually does not matter when 99.999% of the parameters are in their minimum. (That maximum has to be huge!) With larger nets, it's even less likely that we encounter saddle points that do affect our training. So don't fear!"
35 |    ]
36 |   }
37 |  ],
38 |  "metadata": {
39 |   "language_info": {
40 |    "codemirror_mode": {
41 |     "name": "ipython",
42 |     "version": 3
43 |    },
44 |    "file_extension": ".py",
45 |    "mimetype": "text/x-python",
46 |    "name": "python",
47 |    "nbconvert_exporter": "python",
48 |    "pygments_lexer": "ipython3",
49 |    "version": 3
50 |   }
51 |  },
52 |  "nbformat": 4,
53 |  "nbformat_minor": 2
54 | }
55 | 


--------------------------------------------------------------------------------
/book/better/better.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Improvements to a model"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Areas to improve\n",
15 |     "\n",
16 |     "Now we know how to train a model. However, machine learning is so much more than training. How to train super fast on unseen data? How to make the model trained extremely small to fit in mobile devices? How to make a model perform well on multiple different domains? And how to explain how your model makes prediction?\n",
17 |     "\n",
18 |     "In this section we will walk over these questions and show you that, yes, we can make the model even better!"
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "metadata": {},
24 |    "source": [
25 |     "### [Meta learning](./meta/meta)\n",
26 |     "\n",
27 |     "It doesn't get any easier, you just have to get faster."
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "markdown",
32 |    "metadata": {},
33 |    "source": [
34 |     "### [Model compression](./compression/compression)\n",
35 |     "\n",
36 |     "It's not about being small, it's about doing more with less."
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "metadata": {},
42 |    "source": [
43 |     "### [Life long learning](./lll/lll)\n",
44 |     "\n",
45 |     "Wisdom is not a product of school but of the life long attempt to acquire it."
46 |    ]
47 |   },
48 |   {
49 |    "cell_type": "markdown",
50 |    "metadata": {},
51 |    "source": [
52 |     "### [Explainable AI](./explainable/explainable)\n",
53 |     "\n",
54 |     "Sometimes being understanding is more important than being right..."
55 |    ]
56 |   }
57 |  ],
58 |  "metadata": {
59 |   "kernelspec": {
60 |    "display_name": "Python 3",
61 |    "language": "python",
62 |    "name": "python3"
63 |   },
64 |   "language_info": {
65 |    "codemirror_mode": {
66 |     "name": "ipython",
67 |     "version": 3
68 |    },
69 |    "file_extension": ".py",
70 |    "mimetype": "text/x-python",
71 |    "name": "python",
72 |    "nbconvert_exporter": "python",
73 |    "pygments_lexer": "ipython3",
74 |    "version": "3.9.5"
75 |   }
76 |  },
77 |  "nbformat": 4,
78 |  "nbformat_minor": 2
79 | }
80 | 


--------------------------------------------------------------------------------
/book/generative/generative.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Generative Models"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why generative models?\n",
15 |     "\n",
16 |     "Humans can think of marvelous things. Look at Mozart's ability to create music, and Van Gough's prowess to paint! In machine learning words, we call these people **good generators**. They are able to generate stuff that are so good it's called art.\n",
17 |     "\n",
18 |     "Generative models strive to do the same for computers. It's very different from the normal classification/regression problem (even though the models still use them under the hood), because in classification/regression, you are reducing the input (an image, a voice command) to something simpler. For classification, it's a label. For regression, it's a handful of numbers. While generative models want to create output that are complicated, and real."
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "metadata": {},
24 |    "source": [
25 |     "## How do generative models generate things?\n",
26 |     "\n",
27 |     "A computer can simply generate things because it would look like garbage, a few real world data need to be used to tell the machine what are good real world examples.\n",
28 |     "\n",
29 |     "In training the generator, we usually use several ways to _correct_ the behavior of the model by providing incentive. For AE/VAE it's the reconstruction loss. For GAN it's the discriminator's loss. You don't need to know what it means yet, they will be introduced in the following sections, just have to bear in mind that even generative models require real world examples (contrary to common believes)."
30 |    ]
31 |   }
32 |  ],
33 |  "metadata": {
34 |   "kernelspec": {
35 |    "display_name": "Python 3",
36 |    "language": "python",
37 |    "name": "python3"
38 |   },
39 |   "language_info": {
40 |    "codemirror_mode": {
41 |     "name": "ipython",
42 |     "version": 3
43 |    },
44 |    "file_extension": ".py",
45 |    "mimetype": "text/x-python",
46 |    "name": "python",
47 |    "nbconvert_exporter": "python",
48 |    "pygments_lexer": "ipython3",
49 |    "version": "3.9.5"
50 |   }
51 |  },
52 |  "nbformat": 4,
53 |  "nbformat_minor": 2
54 | }
55 | 


--------------------------------------------------------------------------------
/book/unsupervised/unsupervised.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Unsupervised Learning"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Unsupervised learning?\n",
15 |     "\n",
16 |     "Most methods we cover are supervised learning. Supervised learning refers to using the dataset where the training data and their labels are both present. Unsupervised methods, on the other hand, do not use labels. They find out traits in different parts of the dataset all by themselves."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Most common unsupervised methods"
24 |    ]
25 |   },
26 |   {
27 |    "cell_type": "markdown",
28 |    "metadata": {},
29 |    "source": [
30 |     "### [Clustering](./clustering/clustering)\n",
31 |     "\n",
32 |     "Clustering clusters the data entry that are close to each other together."
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "markdown",
37 |    "metadata": {},
38 |    "source": [
39 |     "### [Decision tree](./decision-tree/decision-tree)\n",
40 |     "\n",
41 |     "Decision tree discriminate between different classes of points by making decisions."
42 |    ]
43 |   },
44 |   {
45 |    "cell_type": "markdown",
46 |    "metadata": {},
47 |    "source": [
48 |     "### [Self supervised](./self-supervised/self-supervised)\n",
49 |     "\n",
50 |     "Self supervised methods usually refer to MLM (Masked Language Model) training."
51 |    ]
52 |   },
53 |   {
54 |    "cell_type": "markdown",
55 |    "metadata": {},
56 |    "source": [
57 |     "### [Semi supervised](./semi-supervised/semi-supervised)\n",
58 |     "\n",
59 |     "Semi supervised methods are used when most data don't have labels but a small portion of the dataset does have labels."
60 |    ]
61 |   }
62 |  ],
63 |  "metadata": {
64 |   "kernelspec": {
65 |    "display_name": "Python 3",
66 |    "language": "python",
67 |    "name": "python3"
68 |   },
69 |   "language_info": {
70 |    "codemirror_mode": {
71 |     "name": "ipython",
72 |     "version": 3
73 |    },
74 |    "file_extension": ".py",
75 |    "mimetype": "text/x-python",
76 |    "name": "python",
77 |    "nbconvert_exporter": "python",
78 |    "pygments_lexer": "ipython3",
79 |    "version": "3.9.6"
80 |   }
81 |  },
82 |  "nbformat": 4,
83 |  "nbformat_minor": 2
84 | }
85 | 


--------------------------------------------------------------------------------
/book/layers/transformer/training/token/token.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Token\n",
 8 |     "\n",
 9 |     "A token is what makes up a sequence. You can tokenize on the word level. In such a case, \"Hello world.\" becomes [\"Hello\", \"World\", \".\"]. Or if you decide that words are too big and say that you want to tokenize on a character level, in which case the sequence becomes ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '.']. Notice that the space character is significant in the second case but is ignored in the first case.\n",
10 |     "\n",
11 |     "In neural machine translation (NMT), the transformer encoder will take the sequence of language A, and the encoder will output the probability distribution over the tokens in language B at each time step (in the case of autoregressive NMT). \n",
12 |     "\n",
13 |     "In autoregressive NMT, the decoder input will be the tokens previously generated by the decoder. Let's say that at time T=0, the decoder outputs a token 'P'. At the next time step T=1, the decoder generates another token 'Q' conditioning on the previously generated token 'P'. So how does the decoder condition on the previously generated 'P'? It simply takes 'P' as the decoder input. At T=2, the decoder input will be 'PQ'.\n",
14 |     "\n",
15 |     "The input sequence contains tokens; however, the transformer model can only take vectors (or tensors) as its input, so we need to convert each token in the input sequence into its corresponding vector, those vectors are called embedding vectors. If the vector corresponds to the input tokens (of the encoder, which would be the tokens of language A), then we call those vectors input embedding (vector).  If the vector corresponds to the output tokens (of the decoder, which would be the tokens of language B), then we call those vectors output embedding (vector).  "
16 |    ]
17 |   }
18 |  ],
19 |  "metadata": {
20 |   "kernelspec": {
21 |    "display_name": "Python 3",
22 |    "language": "python",
23 |    "name": "python3"
24 |   },
25 |   "language_info": {
26 |    "codemirror_mode": {
27 |     "name": "ipython",
28 |     "version": 3
29 |    },
30 |    "file_extension": ".py",
31 |    "mimetype": "text/x-python",
32 |    "name": "python",
33 |    "nbconvert_exporter": "python",
34 |    "pygments_lexer": "ipython3",
35 |    "version": "3.9.5"
36 |   }
37 |  },
38 |  "nbformat": 4,
39 |  "nbformat_minor": 2
40 | }
41 | 


--------------------------------------------------------------------------------
/book/reuse/distil/distil.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Knowledge Distillation"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "We will use the abbreviation of knowledge distilation, KD, in this chapter.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What is KD, and do we need it?\n",
24 |     "\n",
25 |     "KD refers to the transferring the knowledge of a model into another model. It is mainly used in transferring the knowledge of a big model with billions of parameters into a smaller model that's small enough to deploy on edge devices. The small model is encouraged to perform exactly the same as the bigger model by replicating the bigger model's output, all while being more efficient."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## Why don't we train a model from scratch?\n",
33 |     "\n",
34 |     "We could, of course, train a smaller model from scratch. However, bigger models have tendencies to do a better job in searching a better solution than smaller ones, and trying to replicate bigger models tends to perform better than training a smaller model from scratch, which often stuck in local optimum points."
35 |    ]
36 |   },
37 |   {
38 |    "cell_type": "markdown",
39 |    "metadata": {},
40 |    "source": [
41 |     "## The training flow of KD.\n",
42 |     "\n",
43 |     "1. First, train a **teacher network**.\n",
44 |     "    This network is much more complicated than the **student network** that we wish to train in the end.\n",
45 |     "2. Freeze the teacher network, and generate some inputs.\n",
46 |     "3. Use the frozen teacher network and inputs to generate some target outputs.\n",
47 |     "4. Use the input and output pairs to train the student network in a supervised manner."
48 |    ]
49 |   }
50 |  ],
51 |  "metadata": {
52 |   "kernelspec": {
53 |    "display_name": "Python 3 (ipykernel)",
54 |    "language": "python",
55 |    "name": "python3"
56 |   },
57 |   "language_info": {
58 |    "codemirror_mode": {
59 |     "name": "ipython",
60 |     "version": 3
61 |    },
62 |    "file_extension": ".py",
63 |    "mimetype": "text/x-python",
64 |    "name": "python",
65 |    "nbconvert_exporter": "python",
66 |    "pygments_lexer": "ipython3",
67 |    "version": "3.9.5"
68 |   }
69 |  },
70 |  "nbformat": 4,
71 |  "nbformat_minor": 2
72 | }
73 | 


--------------------------------------------------------------------------------
/book/notice/lr/lr.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Learning Rate"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "We will abbreviate learning rate as LR.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "```{note}\n",
24 |     "Learning rate is also called step size.\n",
25 |     "```"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## What are LR?\n",
33 |     "\n",
34 |     "In vanilla gradient descent, with the LR $ \\eta $, parameter $ x $, loss function $ f $, the update formula is as follows:\n",
35 |     "\n",
36 |     "$$\n",
37 |     "x' = x - \\eta * \\frac{df}{dx}\n",
38 |     "$$\n",
39 |     "\n",
40 |     "Other optimizers are updated in a similar fashion. The takeaway is, LR is how much you step forward performing a gradient update. If LR is huge, you update the parameters more. If LR is tiny, you don't update the parameter as much."
41 |    ]
42 |   },
43 |   {
44 |    "cell_type": "markdown",
45 |    "metadata": {},
46 |    "source": [
47 |     "## How should I choose my LR?\n",
48 |     "\n",
49 |     "Initially, we want the LR to be as big as possible (because updates are faster) whenever possible. However, with big LRs, it's difficult to move into a precise location because it only takes big steps. With training progresses, we want to slightly reduce the LR such that we find a more finetuned solution. Some optimizers sort of do this internally (reducing step size by reducing the gradients' scale), but we can always use a learning rate scheduler if we want an explicit LR schedule."
50 |    ]
51 |   },
52 |   {
53 |    "cell_type": "markdown",
54 |    "metadata": {},
55 |    "source": [
56 |     "## Can different learning rates be used on different parameters at the same time?\n",
57 |     "\n",
58 |     "The learning rate of different parameters can be different. Algorithms like Adam and Adagrad explicitly re-scales the learning rate to achieve faster training."
59 |    ]
60 |   }
61 |  ],
62 |  "metadata": {
63 |   "language_info": {
64 |    "codemirror_mode": {
65 |     "name": "ipython",
66 |     "version": 3
67 |    },
68 |    "file_extension": ".py",
69 |    "mimetype": "text/x-python",
70 |    "name": "python",
71 |    "nbconvert_exporter": "python",
72 |    "pygments_lexer": "ipython3",
73 |    "version": 3
74 |   }
75 |  },
76 |  "nbformat": 4,
77 |  "nbformat_minor": 2
78 | }
79 | 


--------------------------------------------------------------------------------
/book/layers/transformer/transformer-vs-rnn.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Transformer vs RNN"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why are RNNs good?\n",
15 |     "\n",
16 |     "For many years RNNs are the undisputed champion in sequence processing. Sequences include texts, voice data, and all time-related data. The reason RNNs are so good is that it sees through all past things (or events), decides what's important to keep, then makes the next prediction. RNN works like a human mind does, seeing through past things to predict the future."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Why are RNNs not good enough?\n",
24 |     "\n",
25 |     "All kinds of RNNs suffer from gradient explosion/vanishing. That means it's very difficult to train large scale RNN, process over long sequences, or just continuously improve upon the result because bigger RNNs are not necessarily better.\n",
26 |     "\n",
27 |     "Also, because RNNs have to be trained on the sequence tokens in a one-by-one fashion, it's difficult to parallelize that and make it faster."
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "markdown",
32 |    "metadata": {},
33 |    "source": [
34 |     "## Why are transformers all the rage?\n",
35 |     "\n",
36 |     "Transformers are not RNNs. That mean, it doesn't suffer from all those weaknesses of RNNs like training slowly or unable to scale up. However, that's not the reason transformers have all the attention (pun intended) right now.\n",
37 |     "\n",
38 |     "The reason transformers are so popular started with Bert, a massive pretrained transformer based model that you can easily use for other tasks. Being pretrained means that you don't need to train it yourself, you can simply use the model as a preprocessor, a feature extractor, and train a much smaller model for your task. And Bert is the first wildly successful language processing model to do that."
39 |    ]
40 |   }
41 |  ],
42 |  "metadata": {
43 |   "kernelspec": {
44 |    "display_name": "Python 3",
45 |    "language": "python",
46 |    "name": "python3"
47 |   },
48 |   "language_info": {
49 |    "codemirror_mode": {
50 |     "name": "ipython",
51 |     "version": 3
52 |    },
53 |    "file_extension": ".py",
54 |    "mimetype": "text/x-python",
55 |    "name": "python",
56 |    "nbconvert_exporter": "python",
57 |    "pygments_lexer": "ipython3",
58 |    "version": "3.9.5"
59 |   }
60 |  },
61 |  "nbformat": 4,
62 |  "nbformat_minor": 2
63 | }
64 | 


--------------------------------------------------------------------------------
/book/better/meta/meta.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Meta Learning"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What's meta learning?\n",
15 |     "\n",
16 |     "Meta means referring to oneself. Meta learning means learning, for learning. Or learning in order to learn better, or so on. Meta learning focuses on help models learn faster on the first run, training faster without needing to worry about bad things. Learning new things fast and efficient --- That's the realm of meta learning."
17 |    ]
18 |   },
19 |   {
20 |    "attachments": {},
21 |    "cell_type": "markdown",
22 |    "metadata": {},
23 |    "source": [
24 |     "## Difference with Life Long Learning.\n",
25 |     "\n",
26 |     "Life long learning refers to a model learning and adapting to different environments and datasets through out the lifespan of the model (not just when it's trained). So, how's that different from meta learning?Life long learning focuses on the improvement of new tasks with the same model, in contrast, meta learning hopes to learn a good initial model that can learn many tasks fast."
27 |    ]
28 |   },
29 |   {
30 |    "cell_type": "markdown",
31 |    "metadata": {},
32 |    "source": [
33 |     "## How to meta learning?\n",
34 |     "\n",
35 |     "In traditional ML methods, we are given a lot of pairs of `input` and `target` data. To train a model, we initialize the model with some parameters, then _update the parameters towards the goal of minimization of the loss function between the target and the model's output._\n",
36 |     "\n",
37 |     "In meta learning however, instead of finding the global minimum right off the bat, we are searching for a location from which it's easy to find global minimum for each tasks. In other words, we are **pre-training** a model such that each sub-task can be trained really fast. For example, in _MAML_ and _Reptile_, two meta learning algorithms, models are updated towards minimizing each task, not just one single task in particular."
38 |    ]
39 |   }
40 |  ],
41 |  "metadata": {
42 |   "kernelspec": {
43 |    "display_name": "Python 3",
44 |    "language": "python",
45 |    "name": "python3"
46 |   },
47 |   "language_info": {
48 |    "codemirror_mode": {
49 |     "name": "ipython",
50 |     "version": 3
51 |    },
52 |    "file_extension": ".py",
53 |    "mimetype": "text/x-python",
54 |    "name": "python",
55 |    "nbconvert_exporter": "python",
56 |    "pygments_lexer": "ipython3",
57 |    "version": "3.9.6"
58 |   }
59 |  },
60 |  "nbformat": 4,
61 |  "nbformat_minor": 2
62 | }
63 | 


--------------------------------------------------------------------------------
/book/generative/ae/ae-arch.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "cultural-muslim",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Auto Encoder Architecture"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "light-audience",
14 |    "metadata": {},
15 |    "source": [
16 |     "## Should I design my AE's decoder to be symmetry to its encoder?\n",
17 |     "\n",
18 |     "An AE’s encoder and decoder will not need to be symmetric in terms of structure. However, symmetric model structures work well in practice, so it’s usually preferable to non-symmetric design.\n",
19 |     "\n",
20 |     "So the answer: Yes, no, maybe. It depends on how well your model is performing, which ultimately is down to experiments."
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "id": "hawaiian-brother",
26 |    "metadata": {},
27 |    "source": [
28 |     "## How to take an AE and say: this part is the encoder, and this part is the decoder?\n",
29 |     "\n",
30 |     "Usually the layer in AE with the smallest dimension is its latent. Encoders encodes inputs into latents, and decoder decodes latents into outputs.\n",
31 |     "\n",
32 |     "It helps to think in terms of information flowing through the network. Take a look at all auto-encoders. They usually consist of layers that scale down the input (encoder) and layers scaling up the input (decoders). Suppose that no information is lost (the decoder decodes the image perfectly), then the latent (vector created during the pass through the model) with the lowest dimension is most densely-packed with information, therefore, the best latent."
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "markdown",
37 |    "id": "official-benefit",
38 |    "metadata": {},
39 |    "source": [
40 |     "## Can we share latent between different decoders (maybe of different architecture)?\n",
41 |     "\n",
42 |     "Yes but no. It is feasible, but no-one does that.\n",
43 |     "\n",
44 |     "If you really want to do it, you can simply train a new decoder that maps the latent to an image in a supervised manner. But wait! There is already the original decoder that is trained in the same way. So why bother training a new model like that?"
45 |    ]
46 |   }
47 |  ],
48 |  "metadata": {
49 |   "kernelspec": {
50 |    "display_name": "Python 3",
51 |    "language": "python",
52 |    "name": "python3"
53 |   },
54 |   "language_info": {
55 |    "codemirror_mode": {
56 |     "name": "ipython",
57 |     "version": 3
58 |    },
59 |    "file_extension": ".py",
60 |    "mimetype": "text/x-python",
61 |    "name": "python",
62 |    "nbconvert_exporter": "python",
63 |    "pygments_lexer": "ipython3",
64 |    "version": "3.9.5"
65 |   }
66 |  },
67 |  "nbformat": 4,
68 |  "nbformat_minor": 5
69 | }
70 | 


--------------------------------------------------------------------------------
/book/layers/activation/tanh/tanh.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Hyperbolic Tangent"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{note}\n",
 15 |     "Hyperbolic tangent is also called tanh for short.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## Introduction\n",
 24 |     "\n",
 25 |     "Tanh is used when you want to limit the range of the output to $ (-1, 1) $. It looks like sigmoid."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## How does tanh look, and how it works in code?"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "%matplotlib inline\n",
 42 |     "\n",
 43 |     "import numpy as np\n",
 44 |     "from matplotlib import pyplot as plt"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "def tanh(x):\n",
 54 |     "    exp = np.exp(x)\n",
 55 |     "    inv_exp = 1 / exp\n",
 56 |     "    return (exp - inv_exp) / (exp + inv_exp)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "x = np.arange(-10, 11)\n",
 66 |     "y = tanh(x)\n",
 67 |     "print(\"x =\", x)\n",
 68 |     "print(\"y =\", y)"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "x = np.arange(-200, 210) / 20\n",
 78 |     "y = tanh(x)\n",
 79 |     "plt.plot(x, y)\n",
 80 |     "plt.show()"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "Notice that the range of `tanh` function is in the range $ (-1, 1) $ instead of $ (0, 1) $ like `sigmoid`."
 88 |    ]
 89 |   }
 90 |  ],
 91 |  "metadata": {
 92 |   "kernelspec": {
 93 |    "display_name": "Python 3",
 94 |    "language": "python",
 95 |    "name": "python3"
 96 |   },
 97 |   "language_info": {
 98 |    "codemirror_mode": {
 99 |     "name": "ipython",
100 |     "version": 3
101 |    },
102 |    "file_extension": ".py",
103 |    "mimetype": "text/x-python",
104 |    "name": "python",
105 |    "nbconvert_exporter": "python",
106 |    "pygments_lexer": "ipython3",
107 |    "version": "3.9.5"
108 |   }
109 |  },
110 |  "nbformat": 4,
111 |  "nbformat_minor": 2
112 | }
113 | 


--------------------------------------------------------------------------------
/book/layers/activation/sigmoid/sigmoid.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Sigmoid"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Introduction\n",
 15 |     "\n",
 16 |     "Sigmoid is often used to approximate probabilities, due to the fact that a sigmoid function's maximum value will never exceed $ 1 $, and will never get below $ 0 $."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## Definition\n",
 24 |     "\n",
 25 |     "sigmoid($ x $) = $ \\frac{1}{1 + e^{-x}} $"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## How does sigmoid look, and how it works in code?"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "%matplotlib inline\n",
 42 |     "\n",
 43 |     "import numpy as np\n",
 44 |     "from matplotlib import pyplot as plt"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "def sigmoid(x):\n",
 54 |     "    return 1 / (1 + np.exp(-x))"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "x = np.arange(-10, 11)\n",
 64 |     "y = sigmoid(x)\n",
 65 |     "print(\"x =\", x)\n",
 66 |     "print(\"y =\", y)"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "When the number gets big, the value of sigmoid gets very close to $ 1 $. When the number gets very negative, the value of sigmoid gets very close to $ 0 $."
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": null,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "x = np.arange(-200, 210) / 20\n",
 83 |     "y = sigmoid(x)\n",
 84 |     "plt.plot(x, y)\n",
 85 |     "plt.show()"
 86 |    ]
 87 |   }
 88 |  ],
 89 |  "metadata": {
 90 |   "kernelspec": {
 91 |    "display_name": "Python 3",
 92 |    "language": "python",
 93 |    "name": "python3"
 94 |   },
 95 |   "language_info": {
 96 |    "codemirror_mode": {
 97 |     "name": "ipython",
 98 |     "version": 3
 99 |    },
100 |    "file_extension": ".py",
101 |    "mimetype": "text/x-python",
102 |    "name": "python",
103 |    "nbconvert_exporter": "python",
104 |    "pygments_lexer": "ipython3",
105 |    "version": "3.9.5"
106 |   }
107 |  },
108 |  "nbformat": 4,
109 |  "nbformat_minor": 2
110 | }
111 | 


--------------------------------------------------------------------------------
/book/reuse/transfer/tl-vs-da.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "270b6b05",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Transfer Learning vs Domain Adaptation"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "c36d380c",
14 |    "metadata": {},
15 |    "source": [
16 |     "## How is TL different from DA?\n",
17 |     "\n",
18 |     "In the previous section, we have introduced TL and DA as if they are the same thing. While there are some arguments whether DA is a part of TL, we lean towards treating them as cousins, similar, but different."
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "id": "16b35af8",
24 |    "metadata": {},
25 |    "source": [
26 |     "## Transfer learning definition.\n",
27 |     "\n",
28 |     "Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.\n",
29 |     "\n",
30 |     "-- Wikipedia"
31 |    ]
32 |   },
33 |   {
34 |    "cell_type": "markdown",
35 |    "id": "d99f285f",
36 |    "metadata": {},
37 |    "source": [
38 |     "## Domain adaptation definition.\n",
39 |     "\n",
40 |     "Domain adaptation is used when we aim at learning from a source data distribution, a well performing model on a different (but related) target data distribution. For example, an algorithm trained on news-wires might have to adapt to a new dataset of biomedical documents.\n",
41 |     "\n",
42 |     "-- Wikipedia"
43 |    ]
44 |   },
45 |   {
46 |    "cell_type": "markdown",
47 |    "id": "5d5469bf",
48 |    "metadata": {},
49 |    "source": [
50 |     "## In simple terms.\n",
51 |     "\n",
52 |     "See the difference? Transfer learning is used when you want to _transfer_ the knowledge of a model over to another model, while domain adaptation is used when you want to _adapt_ a model to what it has never seen before. Transfer learning is about keeping what a model knows, and domain adaptation is more about making a model work in a new environment.\n",
53 |     "\n",
54 |     "**In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change.**"
55 |    ]
56 |   }
57 |  ],
58 |  "metadata": {
59 |   "kernelspec": {
60 |    "display_name": "Python 3 (ipykernel)",
61 |    "language": "python",
62 |    "name": "python3"
63 |   },
64 |   "language_info": {
65 |    "codemirror_mode": {
66 |     "name": "ipython",
67 |     "version": 3
68 |    },
69 |    "file_extension": ".py",
70 |    "mimetype": "text/x-python",
71 |    "name": "python",
72 |    "nbconvert_exporter": "python",
73 |    "pygments_lexer": "ipython3",
74 |    "version": "3.9.5"
75 |   }
76 |  },
77 |  "nbformat": 4,
78 |  "nbformat_minor": 5
79 | }
80 | 


--------------------------------------------------------------------------------
/book/reinforce/policy/policy-gradient.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Policy Gradient"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Where can policy gradients be applied?\n",
15 |     "\n",
16 |     "Policy gradient is specifically used to update policy networks that outputs probability for each actions."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "```{warning}\n",
24 |     "Incoming math!\n",
25 |     "```"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## The policy gradient in simple terms.\n",
33 |     "\n",
34 |     "First, we have a policy network $ \\pi $, the current state $ s $, reward for each action $ r_i $, and probability for each actions $ \\pi(s)_i $.\n",
35 |     "\n",
36 |     "The expected future reward (value function) of this state is thus\n",
37 |     "\n",
38 |     "$$\n",
39 |     "G = \\sum_i \\pi(s)_i r_i\n",
40 |     "$$\n",
41 |     "\n",
42 |     "To maximize the value function of this state, we wish to find the gradient of the expected future reward $ \\nabla G $, which is \n",
43 |     "\n",
44 |     "$$\n",
45 |     "\\nabla \\sum_i \\pi(s)_i r_i\n",
46 |     "$$\n",
47 |     "\n",
48 |     "Because rewards are scalars and determined by the environment,\n",
49 |     "\n",
50 |     "$$\n",
51 |     "\\nabla G = \\sum_i r_i \\nabla \\pi(s)_i\n",
52 |     "$$\n",
53 |     "\n",
54 |     "Since $ \\pi(s)_i $ is the probability of each action (which usually is non-zero), we divide and multiply by it in the previous equation\n",
55 |     "\n",
56 |     "$$\n",
57 |     "\\nabla G = \\sum_i r_i \\frac{ \\nabla \\pi(s)_i }{ \\pi(s)_i } \\pi(s)_i\n",
58 |     "$$\n",
59 |     "\n",
60 |     "We notice that this looks like the formula for expectation! So the equation is reduced to:\n",
61 |     "\n",
62 |     "$$\n",
63 |     "\\nabla G = E_{\\sim\\pi(s)}[ r_i \\frac{ \\nabla \\pi(s)_i }{ \\pi(s)_i } ]\n",
64 |     "$$\n",
65 |     "\n",
66 |     "Which is equivalent to\n",
67 |     "\n",
68 |     "$$\n",
69 |     "\\nabla G = E_{\\sim\\pi(s)}[ r_i \\nabla \\log \\pi(s)_i ]\n",
70 |     "$$\n",
71 |     "\n",
72 |     "Voila! This is the reason you see log-probability quite often in policy gradients."
73 |    ]
74 |   }
75 |  ],
76 |  "metadata": {
77 |   "kernelspec": {
78 |    "display_name": "Python 3 (ipykernel)",
79 |    "language": "python",
80 |    "name": "python3"
81 |   },
82 |   "language_info": {
83 |    "codemirror_mode": {
84 |     "name": "ipython",
85 |     "version": 3
86 |    },
87 |    "file_extension": ".py",
88 |    "mimetype": "text/x-python",
89 |    "name": "python",
90 |    "nbconvert_exporter": "python",
91 |    "pygments_lexer": "ipython3",
92 |    "version": "3.9.5"
93 |   }
94 |  },
95 |  "nbformat": 4,
96 |  "nbformat_minor": 2
97 | }
98 | 


--------------------------------------------------------------------------------
/book/notice/data/overfit.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Overfit"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What happens when a model overfit?\n",
15 |     "\n",
16 |     "An overfitting model seems to perform really well when you train it. In training it yields very small losses. However, it may perform terribly when evaluating the data."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## When does overfitting happen?\n",
24 |     "\n",
25 |     "Generally speaking, the more data you have, the bigger model you'll be using. Overfitting happens when you don't have enough data for your model. Or your model is too big for the task."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## How to deal with overfitting?"
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "markdown",
37 |    "metadata": {},
38 |    "source": [
39 |     "### Add data.\n",
40 |     "\n",
41 |     "Overfitting happens when data is too few relative to model size. So adding data will help."
42 |    ]
43 |   },
44 |   {
45 |    "cell_type": "markdown",
46 |    "metadata": {},
47 |    "source": [
48 |     "### Reduce model size."
49 |    ]
50 |   },
51 |   {
52 |    "cell_type": "markdown",
53 |    "metadata": {},
54 |    "source": [
55 |     "#### Reduce the hyper-parameters of the model.\n",
56 |     "\n",
57 |     "Making the model smaller will definitely help with solving overfitting. However, this is not a practical solution because to reduce the model's hyper-parameters, the model will have to be re-trained in the process, and that's a lot of computing power!"
58 |    ]
59 |   },
60 |   {
61 |    "cell_type": "markdown",
62 |    "metadata": {},
63 |    "source": [
64 |     "#### Regularization.\n",
65 |     "p\n",
66 |     "Regularization is a technique to make a model smaller than it is. Regularization makes the model smaller by paralyzing part of the model. For example: dropout layers purposefully drops out part of the model's inner layer output to make the inner layers appear smaller. L1 and L2 regularization induced penalties on big weights and encourages weights to go to zero so that the inner layer's effective nodes are reduced (zero stops the data from flowing)."
67 |    ]
68 |   }
69 |  ],
70 |  "metadata": {
71 |   "kernelspec": {
72 |    "display_name": "Python 3",
73 |    "language": "python",
74 |    "name": "python3"
75 |   },
76 |   "language_info": {
77 |    "codemirror_mode": {
78 |     "name": "ipython",
79 |     "version": 3
80 |    },
81 |    "file_extension": ".py",
82 |    "mimetype": "text/x-python",
83 |    "name": "python",
84 |    "nbconvert_exporter": "python",
85 |    "pygments_lexer": "ipython3",
86 |    "version": "3.9.6"
87 |   }
88 |  },
89 |  "nbformat": 4,
90 |  "nbformat_minor": 2
91 | }
92 | 


--------------------------------------------------------------------------------
/book/tasks/classification/multilabel/multilabel.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Multi Label Classification"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Multi-hot labels\n",
15 |     "\n",
16 |     "Multi-hot label is a simple extension of the one-hot label: In one-hot label, only the maximum value is selected. While in multi-hot label, the top $ n $ labels are selected at once. It's useful when all the labels are independent of one another, in such case, multi-hot labels can be treated as many independent binary classifiers (every label is a yes/no)."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Deep learning based methods\n",
24 |     "\n",
25 |     "The best deep learning based multi-classification method is to run a `Seq2Seq` model over the input, and convert that input into a sequence of labels. It sounds stupid, I know, but sometimes it performs surprisingly well."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## Why are Seq2Seq sometimes more preferable to Multi-hot labels?\n",
33 |     "\n",
34 |     "Because multi-hot models (treating every label as a binary classification problem) don't take into account the relationships between labels.\n",
35 |     "\n",
36 |     "If we treat a multi-classification problem as several different binary classification problems, in most cases it works well, but in cases where the labels are correlated, it works less well.\n",
37 |     "\n",
38 |     "Example question:\n",
39 |     "\n",
40 |     "When classifying an image, we may have several labels: [apples, oranges, banana, lemon]. We want our model to output two labels to describe the current picture at hand. Suppose we are treating each label as a different binary classification problem.\n",
41 |     "\n",
42 |     "Example case:\n",
43 |     "\n",
44 |     "We show the model a picture of a banana, the model could tell us the probabilities of each label as [8%, 23%, 98%, 67%]. In this case, the model says that the picture shows a picture that looks is a banana \"and\" lemon. But how can a fruit be both a banana and lemon? The model is not making any sense!\n",
45 |     "\n",
46 |     "However, if the problem is about what the fruit looks like, then banana \"or\" lemon is probably a good answer."
47 |    ]
48 |   }
49 |  ],
50 |  "metadata": {
51 |   "kernelspec": {
52 |    "display_name": "Python 3",
53 |    "language": "python",
54 |    "name": "python3"
55 |   },
56 |   "language_info": {
57 |    "codemirror_mode": {
58 |     "name": "ipython",
59 |     "version": 3
60 |    },
61 |    "file_extension": ".py",
62 |    "mimetype": "text/x-python",
63 |    "name": "python",
64 |    "nbconvert_exporter": "python",
65 |    "pygments_lexer": "ipython3",
66 |    "version": "3.9.5"
67 |   }
68 |  },
69 |  "nbformat": 4,
70 |  "nbformat_minor": 2
71 | }
72 | 


--------------------------------------------------------------------------------
/book/generative/ae/ae.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# AutoEncoder Model"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why autoencoders?\n",
15 |     "\n",
16 |     "Machine learning is all about information. AutoEncoder was initially developed to capture that information.\n",
17 |     "\n",
18 |     "What do you mean by information? Well, information is the most crucial part of data. For example, if I'm trying to tell you: \"You are a beautiful human being\". You'll likely understand me when I say like a caveman: \"You, beautiful, human!\". Notice that the second sentence is much shorter than the first one, yet it's perfectly understandable by any person (and caveman). We say that the amount of information is the same in the first one and the second one. Weirdly enough, scientists like the second one (caveman version) better because it is shorter and takes less space on you computer.\n",
19 |     "\n",
20 |     "What an autoencoder does is no different. It takes in an input (image etc) and convert it into a more compact format."
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "metadata": {},
26 |    "source": [
27 |     "## How autoencoders do this?\n",
28 |     "\n",
29 |     "Autoencoders do this by first reduce the size of the input array with layers after layers of processing, and then increase the size of the array also by multiple layers, until the size of the array is the same as the input. The array is then compared to the original input. If the difference is small, then congratulations, you have a good model (because it is able to preserve information). In this case, we take the smallest array along the processing path and say that the array is the most compressed (think in data flowing through the model). We call the compressed vector a **latent representation**. We separate the model into two parts, the part that processes the input into the compressed input is called **encoder**, and the part decompressing the compressed input is called **decoder**."
30 |    ]
31 |   },
32 |   {
33 |    "cell_type": "markdown",
34 |    "metadata": {},
35 |    "source": [
36 |     "## Why are autoencoders generative models?\n",
37 |     "\n",
38 |     "The decoder part of an autoencoder is a generative model.\n",
39 |     "\n",
40 |     "Why? Well, take an autoencoder and cover the encoder. You see that the decoder takes in a lower dimension input and gives a high dimension output that looks like real world examples. That's basically what every generative model does!"
41 |    ]
42 |   }
43 |  ],
44 |  "metadata": {
45 |   "kernelspec": {
46 |    "display_name": "Python 3",
47 |    "language": "python",
48 |    "name": "python3"
49 |   },
50 |   "language_info": {
51 |    "codemirror_mode": {
52 |     "name": "ipython",
53 |     "version": 3
54 |    },
55 |    "file_extension": ".py",
56 |    "mimetype": "text/x-python",
57 |    "name": "python",
58 |    "nbconvert_exporter": "python",
59 |    "pygments_lexer": "ipython3",
60 |    "version": "3.9.5"
61 |   }
62 |  },
63 |  "nbformat": 4,
64 |  "nbformat_minor": 2
65 | }
66 | 


--------------------------------------------------------------------------------
/book/better/explainable/explainable.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Explainable AI"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why?\n",
15 |     "\n",
16 |     "Well, we want to know why. Why does a machine think that way. When AlphaGo is placing a stone, what value does it see in that move? When a model says that a fish is a human, what does he see in that fish that makes it a human? Being able to explain how a model works really helps us people understand, and improve on existing systems."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Main approaches for explainable AI systems\n",
24 |     "\n",
25 |     "Explainable AI systems tend to "
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "### Just explain it\n",
33 |     "\n",
34 |     "Models like decision tree is easy to explain, because it's just making decisions along the way. It's fairly easy to see what goes wrong along the multiple decisions. Or models like KNN, you know how your model clusters things because it follows an algorithm that you understands.\n",
35 |     "\n",
36 |     "Most of the models that are directly explainable are based on some algorithms, as opposed to numeric models (that crunch numbers), which deep learning models are."
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "metadata": {},
42 |    "source": [
43 |     "### Removing a part, and see what gets affected\n",
44 |     "\n",
45 |     "If a panda does not have his/her head, is he/she still a human? Probably. However, if it is doesn't have its belly, it probably looks like a zebra. Sometimes when you want to see what part affects the decision of the model the most, you simply remove that part and see if the model changes its mind. In the above case, belly of a panda is much more characteristic of a panda than the head of a panda, because your model relies on it to make the correct decision."
46 |    ]
47 |   },
48 |   {
49 |    "cell_type": "markdown",
50 |    "metadata": {},
51 |    "source": [
52 |     "### See what parts of the input yield the biggest gradients\n",
53 |     "\n",
54 |     "We love using gradients in deep learning. Gradients of inputs mean the amount of change to the output if input changes in the direction of the gradients. So if the gradients of input is huge in some region, it probably means that that part of the input is very important. This technique is very popular in images, and has a special name: saliency map."
55 |    ]
56 |   }
57 |  ],
58 |  "metadata": {
59 |   "kernelspec": {
60 |    "display_name": "Python 3",
61 |    "language": "python",
62 |    "name": "python3"
63 |   },
64 |   "language_info": {
65 |    "codemirror_mode": {
66 |     "name": "ipython",
67 |     "version": 3
68 |    },
69 |    "file_extension": ".py",
70 |    "mimetype": "text/x-python",
71 |    "name": "python",
72 |    "nbconvert_exporter": "python",
73 |    "pygments_lexer": "ipython3",
74 |    "version": "3.9.6"
75 |   }
76 |  },
77 |  "nbformat": 4,
78 |  "nbformat_minor": 2
79 | }
80 | 


--------------------------------------------------------------------------------
/book/basics/gradients/gradients.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Gradients"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Why are gradients useful?\n",
15 |     "\n",
16 |     "Gradients are useful because it gives us information about the surface around the parameter. Gradient always points upwards, it tells us the direction in order to maximize the result.\n",
17 |     "\n",
18 |     "There are many optimization methods rely on the existence of gradients. The most prominent one is the one used in deep learning, gradient descent. That's why we need differentiable functions, functions that we can take gradients of, in deep learning in general."
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "metadata": {},
24 |    "source": [
25 |     "## Gradient descent.\n",
26 |     "\n",
27 |     "In deep learning models are optimized with a technique called gradient descent. Basically what gradient descent tries to do is to move along the opposite direction of the gradient, which always points **up**, so moving against this direction reduces the loss value.\n",
28 |     "\n",
29 |     "The simplest form of gradient descent is:\n",
30 |     "\n",
31 |     "$$\n",
32 |     "\\theta' = \\theta - \\eta \\frac{d f}{d \\theta}\n",
33 |     "$$\n",
34 |     "\n",
35 |     "where $ \\theta' $ denotes the new value of $ \\theta $, and $ \\eta $ is what we call **learning rate** or **step size**. It's usually a small, positive number."
36 |    ]
37 |   },
38 |   {
39 |    "cell_type": "markdown",
40 |    "metadata": {},
41 |    "source": [
42 |     "## How to determine if a function is differentiable?\n",
43 |     "\n",
44 |     "There are so many functions in the world. Most of them we don't know the gradients of. In those cases, we simply can't take the gradients of those functions (at least in a computer program). \n",
45 |     "\n",
46 |     "However, if we can approximate the function well enough with a differentiable function, then we can suddenly take the gradients from the function. For instance, `sin` and `cos` function are approximated using Taylor's expansion series.\n",
47 |     "\n",
48 |     "However, there are functions that we just don't know how to calculate the gradients for. In the context of deep learning, those functions are just never used. So no worries.\n",
49 |     "\n",
50 |     "The rule of thumb is, if a function is differentiable, is composed of differentiable functions, or is approximated by a composed differentiable function, then it's differentiable. Chances are you can use them in deep learning. Other functions, no luck."
51 |    ]
52 |   }
53 |  ],
54 |  "metadata": {
55 |   "kernelspec": {
56 |    "display_name": "Python 3 (ipykernel)",
57 |    "language": "python",
58 |    "name": "python3"
59 |   },
60 |   "language_info": {
61 |    "codemirror_mode": {
62 |     "name": "ipython",
63 |     "version": 3
64 |    },
65 |    "file_extension": ".py",
66 |    "mimetype": "text/x-python",
67 |    "name": "python",
68 |    "nbconvert_exporter": "python",
69 |    "pygments_lexer": "ipython3",
70 |    "version": "3.9.5"
71 |   }
72 |  },
73 |  "nbformat": 4,
74 |  "nbformat_minor": 2
75 | }
76 | 


--------------------------------------------------------------------------------
/book/generative/gan/gan.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Generative Adversarial Models"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "Generative adversarial networks are often abbreviated as GANs.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What do GANs do?\n",
24 |     "\n",
25 |     "GANs are used to generate things. A face generator may use GANs to generate new faces. A property owner may use GANs to generate the images of their property. GANs are used when we need almost-good-as-real data. Such as a fake photo, fake song etc."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## How do GANs work?\n",
33 |     "\n",
34 |     "Generating things are not easy. Even us humans aren't that good at producing things. Think of the last time you try to create a new song or paint. It's really difficult.\n",
35 |     "\n",
36 |     "GANs do this by using competition. There are two players in this game: A **generator** and a **discriminator**. A generator is used to generate your target image/audio/video etc. A discriminator is used to tell how good the result is: it tries to tell the real images from the fake ones.\n",
37 |     "\n",
38 |     "Initially, both generator and discriminator are bad at doing their jobs. A generator generates things that are very rough. Discriminator has no experiences at all. However, as times go on, both get more experiences. Generators try its best to fool the discriminator, and discriminator try to discriminate the real images from the fake ones. The end result is that generators get so good at creating good images that the images it creates can fool us humans (but maybe not discriminators)!"
39 |    ]
40 |   },
41 |   {
42 |    "cell_type": "markdown",
43 |    "metadata": {},
44 |    "source": [
45 |     "## GANs in algorithm.\n",
46 |     "\n",
47 |     "Settings:\n",
48 |     "Generator $ G $\n",
49 |     "Discriminator $ D $\n",
50 |     "Real data $ X $\n",
51 |     "\n",
52 |     "Generate some fake data from $ G $. We call this fake data $ \\tilde{X} $.\n",
53 |     "$ D $ is a binary classifier, which outputs $ 1 $ when $ X $ is encountered, and $ 0 $ when $ \\tilde{X} $ is seen.\n",
54 |     "$ D $ tries to get the result of $ D(X) $ close to $ 1 $, and $ D(\\tilde{X}) $ close to $ 0 $.\n",
55 |     "$ G $ tries to generate $ \\tilde{X} $ so that $ D(\\tilde{X}) $ gets close to $ 1 $.\n",
56 |     "Repeat the above procedure many times. Voila, you have a good generator ready to generate fake things!"
57 |    ]
58 |   }
59 |  ],
60 |  "metadata": {
61 |   "kernelspec": {
62 |    "display_name": "Python 3",
63 |    "language": "python",
64 |    "name": "python3"
65 |   },
66 |   "language_info": {
67 |    "codemirror_mode": {
68 |     "name": "ipython",
69 |     "version": 3
70 |    },
71 |    "file_extension": ".py",
72 |    "mimetype": "text/x-python",
73 |    "name": "python",
74 |    "nbconvert_exporter": "python",
75 |    "pygments_lexer": "ipython3",
76 |    "version": "3.9.6"
77 |   }
78 |  },
79 |  "nbformat": 4,
80 |  "nbformat_minor": 2
81 | }
82 | 


--------------------------------------------------------------------------------
/book/layers/activation/relu/relu.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Rectified Linear Unit"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{note}\n",
 15 |     "A **R**ctified **L**inear **U**nit is usually called **ReLU**.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## Introduction\n",
 24 |     "\n",
 25 |     "ReLU is one of most frequently used activations for hidden layers because of the following two reasons.\n",
 26 |     "\n",
 27 |     "1. Using ReLU typically avoids gradient vanishing/exploding.\n",
 28 |     "\n",
 29 |     "2. Because of how simple ReLU is, networks with ReLU train quite fast compared to more complicated activation functions like $ Tanh $."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "## Definition\n",
 37 |     "\n",
 38 |     "ReLU($ x $) = $ \\max \\{ 0, x \\} $"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "## How does ReLU look, and how it works in code?"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "%matplotlib inline\n",
 55 |     "\n",
 56 |     "import numpy as np\n",
 57 |     "from matplotlib import pyplot as plt"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": null,
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "def ReLU(x):\n",
 67 |     "    return np.maximum(0, x)"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "x = np.arange(-10, 11)\n",
 77 |     "y = ReLU(x)\n",
 78 |     "print(\"x = \", x)\n",
 79 |     "print(\"y = \", y)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "See how all negative numbers are replaced by 0.\n",
 87 |     "\n",
 88 |     "How does ReLU's input-output looks like?"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "metadata": {},
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "x = np.arange(-100, 110) / 100\n",
 98 |     "y = ReLU(x)\n",
 99 |     "plt.plot(x, y)\n",
100 |     "plt.show()"
101 |    ]
102 |   }
103 |  ],
104 |  "metadata": {
105 |   "interpreter": {
106 |    "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90"
107 |   },
108 |   "kernelspec": {
109 |    "display_name": "Python 3",
110 |    "language": "python",
111 |    "name": "python3"
112 |   },
113 |   "language_info": {
114 |    "codemirror_mode": {
115 |     "name": "ipython",
116 |     "version": 3
117 |    },
118 |    "file_extension": ".py",
119 |    "mimetype": "text/x-python",
120 |    "name": "python",
121 |    "nbconvert_exporter": "python",
122 |    "pygments_lexer": "ipython3",
123 |    "version": "3.9.5"
124 |   }
125 |  },
126 |  "nbformat": 4,
127 |  "nbformat_minor": 2
128 | }
129 | 


--------------------------------------------------------------------------------
/book/layers/activation/activation.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Activation Functions"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "We will refer to activation functions as activations.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## What are activation functions?\n",
24 |     "\n",
25 |     "Activations transform an input vector into another vector. These functions are usually applied after a linear layer, so that the output of a linear layer will abide by some rules. A $ RELU(x) $ function sets the rule that only positive numbers are allowed, it's defined as $ \\max \\{0, x\\} $. A $ Sigmoid(x) $ function limits the output values to be within $ (0, 1) $ because it's defined as $ \\frac{1}{1 + e^{-x}} $.\n",
26 |     "\n",
27 |     "In deep learning though, when there are many linear layers stacked on top of each other, usually we don't care too much about the activations that are hidden, that is, not at the last layer."
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "markdown",
32 |    "metadata": {},
33 |    "source": [
34 |     "## Why activations?\n",
35 |     "\n",
36 |     "Without activations, deep learning is meaningless. To understand why that's the case, we use a very simple example.\n",
37 |     "\n",
38 |     "Suppose that there is a small neural network that has only two layer. To simplify the problem further, let's assume that these two layers only have weight matrices but not bias matrices. That is, the network can be represented by the function: $ F(x) = (B)(A) x $, with $ A $ the weight of the first layer, and $ B $ the weight of the second layer.\n",
39 |     "\n",
40 |     "However, we can see the function in a different way: $ F(x) = (BA)x $, which means that we can construct a simpler network, with only one layer, whose weight matrix is $ BA $, that does the exactly same thing! It means that adding layer literally doesn't really help us in anyway.\n",
41 |     "\n",
42 |     "Truly, because of how every neural network can be thought of as a chain of matrix multiplications, simply adding more linear layers are never going to help expand the type of functions that we want to approximate, because with only linear functions, we can only approximate linear functions! That's without activation function.\n",
43 |     "\n",
44 |     "If we apply activation $ \\sigma $ to the network: $ F(X) = \\sigma (((B) \\sigma (A) ) x $, then we can't decompose the function easily. In fact, with enough layers, the function would be so complicated that it can approximate any function! And that's all because of the power of activation functions."
45 |    ]
46 |   }
47 |  ],
48 |  "metadata": {
49 |   "kernelspec": {
50 |    "display_name": "Python 3",
51 |    "language": "python",
52 |    "name": "python3"
53 |   },
54 |   "language_info": {
55 |    "codemirror_mode": {
56 |     "name": "ipython",
57 |     "version": 3
58 |    },
59 |    "file_extension": ".py",
60 |    "mimetype": "text/x-python",
61 |    "name": "python",
62 |    "nbconvert_exporter": "python",
63 |    "pygments_lexer": "ipython3",
64 |    "version": "3.9.5"
65 |   }
66 |  },
67 |  "nbformat": 4,
68 |  "nbformat_minor": 2
69 | }
70 | 


--------------------------------------------------------------------------------
/book/reinforce/value/value.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Value"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Value function in RL\n",
15 |     "\n",
16 |     "Value function in RL refers to how much reward an agent will get until it arrives at the end state. For example, suppose you get one point each time you celebrate your birthday. And you have a 50 % chance of living to 100 years old, and 50 % of living to 110 years old. Then the value function would be $ 0.5 \\times (100 - x) + 0.5 \\times (110 - x) $, with $ x $ being your current age."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "```{warning}\n",
24 |     "Incoming math!\n",
25 |     "```"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "Formally, value function can be determined by weighted sum of all rewards, and calculated by a recursive function:\n",
33 |     "\n",
34 |     "$$\n",
35 |     "\n",
36 |     "G_s = R_a + \\gamma \\sum_a \\pi_a G_{s + a}\n",
37 |     "\n",
38 |     "$$\n",
39 |     "\n",
40 |     "where $ s $ is the current state, $ a $ an action that this state can take, $ \\pi_a $ the probability of taking action $ a $, $ s + a $ the next state after taking action $ a $, and finally, $ G_s $ is the value function at state $ s $. $ 0 \\le \\gamma \\le 1 $ is the decay factor because distant rewards are less valuable (more uncertainty).\n",
41 |     "\n",
42 |     "Since the value function is defined as the rewards an agent receives along its life, value function at the end state is $ 0 $.\n",
43 |     "\n",
44 |     "$$\n",
45 |     "\n",
46 |     "G_{end} = 0\n",
47 |     "\n",
48 |     "$$\n",
49 |     "\n",
50 |     "Where $ end $ is the end state."
51 |    ]
52 |   },
53 |   {
54 |    "cell_type": "markdown",
55 |    "metadata": {},
56 |    "source": [
57 |     "## Relation between value function and rewards.\n",
58 |     "\n",
59 |     "Rewards are observed through actions with the environment, and value functions are defined as the expected summation of rewards until termination state is encountered."
60 |    ]
61 |   },
62 |   {
63 |    "cell_type": "markdown",
64 |    "metadata": {},
65 |    "source": [
66 |     "## Why value functions are important?\n",
67 |     "\n",
68 |     "Value functions provide an easy way to rank a state. A state with a higher value is expected to receive more rewards in the future, so it's better. Intuitively, arriving at a state with a higher value means that there would be more rewards in the future, and that's the foundation of RL: _RL aims to maximize the rewards an agent would get, in other words, tries to modify an agent's behavior so that it can be in a state where values are higher._ "
69 |    ]
70 |   }
71 |  ],
72 |  "metadata": {
73 |   "kernelspec": {
74 |    "display_name": "Python 3",
75 |    "language": "python",
76 |    "name": "python3"
77 |   },
78 |   "language_info": {
79 |    "codemirror_mode": {
80 |     "name": "ipython",
81 |     "version": 3
82 |    },
83 |    "file_extension": ".py",
84 |    "mimetype": "text/x-python",
85 |    "name": "python",
86 |    "nbconvert_exporter": "python",
87 |    "pygments_lexer": "ipython3",
88 |    "version": "3.9.6"
89 |   }
90 |  },
91 |  "nbformat": 4,
92 |  "nbformat_minor": 2
93 | }
94 | 


--------------------------------------------------------------------------------
/book/notice/gradient/norm.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Gradient Vanishing / Gradient Explosion"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What happens when gradients vanish or explode?\n",
15 |     "\n",
16 |     "Remember that the neural network trains by using chain rules. That is, earlier layers' gradients are calculated by scaling later layers' gradients. Chain rule:\n",
17 |     "\n",
18 |     "$$\n",
19 |     "\\frac{dy}{dx} = \\frac{df}{du} \\frac{du}{dx}\n",
20 |     "$$\n",
21 |     "\n",
22 |     "If later layers gradients are extremely large, then it's going to scale the earlier gradients by a huge factor. Due to how the computer represent numbers, this number may just be `INFINITY`. If later layers gradients are extremely small, then it's going to scale down the earlier gradients a lot. The earlier gradients may just become `0` in such a case.\n",
23 |     "\n",
24 |     "In either cases, the gradients calculated are not true gradients, and may just make the network un-trainable. Do you really want to update your parameter by `INFINITY`?"
25 |    ]
26 |   },
27 |   {
28 |    "cell_type": "markdown",
29 |    "metadata": {},
30 |    "source": [
31 |     "## When do gradients vanish or explode?\n",
32 |     "\n",
33 |     "In very deep networks, vanishing or explosion is more likely to happen. For example, if every layer scales the norm by 10 (not that big considering that there are many parameters in a layer), then after 300 layers (which is not uncommon in current neural networks), the gradients will approach `INFINITY`. The same arguments can be applied to vanishing gradients. In deep networks, these things happen quite often."
34 |    ]
35 |   },
36 |   {
37 |    "cell_type": "markdown",
38 |    "metadata": {},
39 |    "source": [
40 |     "## How to deal with gradient vanishing or explosion?\n",
41 |     "\n",
42 |     "There are several ways to deal with vanishing or exploding gradients."
43 |    ]
44 |   },
45 |   {
46 |    "cell_type": "markdown",
47 |    "metadata": {},
48 |    "source": [
49 |     "### Normalization\n",
50 |     "\n",
51 |     "Normalization makes sure that in training, the size of the gradient does not get out of hand, because it is normalized by passing through this layer."
52 |    ]
53 |   },
54 |   {
55 |    "cell_type": "markdown",
56 |    "metadata": {},
57 |    "source": [
58 |     "### Residual Networks\n",
59 |     "\n",
60 |     "Residual networks reduces the depth of the networks by providing shortcuts. Shallower networks are less likely to have gradient explosion/vanishing problems."
61 |    ]
62 |   },
63 |   {
64 |    "cell_type": "markdown",
65 |    "metadata": {},
66 |    "source": [
67 |     "### Don't use certain activation functions\n",
68 |     "\n",
69 |     "Activation functions like `Sigmoid`, `Softmax`, `Tanh`, if applied over and over in several layers, will make the gradients of the network is extremely small. This is the reason `ReLU` is popular, because it doesn't suffer from gradient problems as often."
70 |    ]
71 |   }
72 |  ],
73 |  "metadata": {
74 |   "language_info": {
75 |    "codemirror_mode": {
76 |     "name": "ipython",
77 |     "version": 3
78 |    },
79 |    "file_extension": ".py",
80 |    "mimetype": "text/x-python",
81 |    "name": "python",
82 |    "nbconvert_exporter": "python",
83 |    "pygments_lexer": "ipython3",
84 |    "version": 3
85 |   }
86 |  },
87 |  "nbformat": 4,
88 |  "nbformat_minor": 2
89 | }
90 | 


--------------------------------------------------------------------------------
/book/layers/transformer/attn/attn.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Attention"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Why attention?\n",
 15 |     "\n",
 16 |     "Attention means keeping tabs on the most important parts. Attention comes from a key observation: Not all words are equal, and some words are more crucial to understanding the sentence than other. For example, the sentence \"It is raining outside\". You probably understand that it's raining outside if I say: \"Rain! Out!\". In this case, _it_ and _is_ are completely redundant. And if a model is trying to understand the sentence, throwing out _it_ and _is_ is probably not going to make a difference."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## What's attention?\n",
 24 |     "\n",
 25 |     "So how to focus only on the most important part? One way to do it is to multiply the important parts by a large factor, while reducing the unimportant parts values (those parts are, in fact, numbers in machine's language). And that's what attention mechanism does."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## Try attention in code"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "%matplotlib inline\n",
 42 |     "\n",
 43 |     "import numpy as np\n",
 44 |     "from matplotlib import pyplot as plt"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "def softmax(x, t = 1):\n",
 54 |     "    exp = np.exp(x / t)\n",
 55 |     "\n",
 56 |     "    # sums over the last axis\n",
 57 |     "    sum_exp = exp.sum(-1, keepdims=True)\n",
 58 |     "    \n",
 59 |     "    return exp / sum_exp"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "num = 5\n",
 69 |     "\n",
 70 |     "weights = softmax(np.random.randn(num), t=0.1)\n",
 71 |     "data = np.random.randn(num)\n",
 72 |     "\n",
 73 |     "print(weights)\n",
 74 |     "print(data)"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "average = data.sum() / data.size\n",
 84 |     "attn_applied = weights @ data\n",
 85 |     "\n",
 86 |     "print(average)\n",
 87 |     "print(attn_applied)\n",
 88 |     "\n",
 89 |     "print(weights.argmax())\n",
 90 |     "print(data[weights.argmax()])"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "See how the attention mask makes the weighted average of data closer to the desired place."
 98 |    ]
 99 |   }
100 |  ],
101 |  "metadata": {
102 |   "kernelspec": {
103 |    "display_name": "Python 3",
104 |    "language": "python",
105 |    "name": "python3"
106 |   },
107 |   "language_info": {
108 |    "codemirror_mode": {
109 |     "name": "ipython",
110 |     "version": 3
111 |    },
112 |    "file_extension": ".py",
113 |    "mimetype": "text/x-python",
114 |    "name": "python",
115 |    "nbconvert_exporter": "python",
116 |    "pygments_lexer": "ipython3",
117 |    "version": "3.9.5"
118 |   }
119 |  },
120 |  "nbformat": 4,
121 |  "nbformat_minor": 2
122 | }
123 | 


--------------------------------------------------------------------------------
/book/better/compression/compression.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Model Compression"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## The need of model compression\n",
15 |     "\n",
16 |     "Models are getting larger and larger everyday. State of the art models gets super large super fast. Model compression is a method to combat the stress that this trend puts on your device: it makes your model smaller, so that it can be transferred over the Internet, it can fit in your memory to run faster, or it can just save a lot of disk usage. Model compression is the science of reducing the size of a model.\n",
17 |     "\n",
18 |     "Of course, model compression does come with its downsides. After compressed, models will get less accurate. In many cases though, it's a sacrifice that people are willing to take."
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "metadata": {},
24 |    "source": [
25 |     "## Ways of doing model compression\n",
26 |     "\n",
27 |     "There are many ways of doing model compression:"
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "markdown",
32 |    "metadata": {},
33 |    "source": [
34 |     "### Unstructured pruning\n",
35 |     "\n",
36 |     "Because of how deep learning models are based on linear algebra, zero values in a layer in the model simply does not do anything but waste space in memory. Pruning is the art of making the model's layers less dense and more sparse, so that it can only store things that matter. "
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "metadata": {},
42 |    "source": [
43 |     "### Structured pruning\n",
44 |     "\n",
45 |     "As great as unstructured pruning is, dealing with sparse matrices (which is produced a lot in unstructured pruning) is slow because it's difficult to run it on GPU. Structured pruning does the opposite, it finds a filter/channel/matrix to prune, so that the end result is still a network that consists of dense matrices."
46 |    ]
47 |   },
48 |   {
49 |    "cell_type": "markdown",
50 |    "metadata": {},
51 |    "source": [
52 |     "### Quantization\n",
53 |     "\n",
54 |     "Quantization means to store the weights of the model in a less accurate format to save weight. For example, if your model's weight is 64-bit floating point numbers, converting those numbers to 32-bit floating point numbers will slash off half the amount of space. It's as simple as that. Recently there are also 16-bit floating point models that makes storing the models efficiently even easier.\n",
55 |     "\n",
56 |     "Some people take quantization a bit far and use integers for storing the values of the model. It's feasible but hurts performance quite a lot."
57 |    ]
58 |   },
59 |   {
60 |    "cell_type": "markdown",
61 |    "metadata": {},
62 |    "source": [
63 |     "### Summary\n",
64 |     "\n",
65 |     "These three ways (or two if you merge the two pruning methods) are the main ways people reduce the size of their model without training new ones.\n",
66 |     "\n",
67 |     "If training a new model is an option, also see knowledge distillation."
68 |    ]
69 |   }
70 |  ],
71 |  "metadata": {
72 |   "kernelspec": {
73 |    "display_name": "Python 3",
74 |    "language": "python",
75 |    "name": "python3"
76 |   },
77 |   "language_info": {
78 |    "codemirror_mode": {
79 |     "name": "ipython",
80 |     "version": 3
81 |    },
82 |    "file_extension": ".py",
83 |    "mimetype": "text/x-python",
84 |    "name": "python",
85 |    "nbconvert_exporter": "python",
86 |    "pygments_lexer": "ipython3",
87 |    "version": "3.9.5"
88 |   }
89 |  },
90 |  "nbformat": 4,
91 |  "nbformat_minor": 2
92 | }
93 | 


--------------------------------------------------------------------------------
/book/notice/notice.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Other Things To Notice"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Besides what we've seen so far.\n",
 15 |     "\n",
 16 |     "There are other things to notice if you want to make training a machine learning model easy."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## How hard can training be?\n",
 24 |     "\n",
 25 |     "Let's face it, training is very hard, because of how hard it is to debug a machine learning system. `print`ing out all the values in the model? You'll still not understand why something goes wrong because those numbers are meaningless to you. Trial and error? Well, it's super time-consuming to do so, and will make your computer very heated.\n",
 26 |     "\n",
 27 |     "But fret not! Most training issues can be categorized into the following types:"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "## Gradient issues\n",
 35 |     "\n",
 36 |     "Gradient issues are the most common ones. It happens when you're models gradients are out of control, not in the desired range. When [gradients are too large or small](./gradient/norm), or when you are unlucky and stuck in a [saddle point](./gradient/saddle)."
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "## Learning rate\n",
 44 |     "\n",
 45 |     "Learning rate is how much you update your model. When [learning rate is too small or large](./lr/lr), training may get super slow."
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Optimizer\n",
 53 |     "\n",
 54 |     "An [optimizer](./optimizer/optimizer) is responsible for updating the model. If the wrong optimizer is selected, training can be deceptively slow and ineffective."
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "## Batch size\n",
 62 |     "\n",
 63 |     "When you have a [too big or small batch](./batch/batch), bad things happen because of probability."
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "## Overfitting and underfitting\n",
 71 |     "\n",
 72 |     "Sometimes when you're model is [too complicated for the task](./data/overfit), the model looks dope in training but is useless when being used on real world data. Or the model is just [overly simple](./data/underfit) for the task, the model doesn't seem to learn anything in training."
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "markdown",
 77 |    "metadata": {},
 78 |    "source": [
 79 |     "## Summary\n",
 80 |     "\n",
 81 |     "Most of the issues are caused because of how machine learning systems' intolerance to big or small numbers, so carefully selecting/tuning a hyperparameter is key to solving many ML issues encountered in training."
 82 |    ]
 83 |   }
 84 |  ],
 85 |  "metadata": {
 86 |   "kernelspec": {
 87 |    "display_name": "Python 3 (ipykernel)",
 88 |    "language": "python",
 89 |    "name": "python3"
 90 |   },
 91 |   "language_info": {
 92 |    "codemirror_mode": {
 93 |     "name": "ipython",
 94 |     "version": 3
 95 |    },
 96 |    "file_extension": ".py",
 97 |    "mimetype": "text/x-python",
 98 |    "name": "python",
 99 |    "nbconvert_exporter": "python",
100 |    "pygments_lexer": "ipython3",
101 |    "version": "3.9.5"
102 |   }
103 |  },
104 |  "nbformat": 4,
105 |  "nbformat_minor": 2
106 | }
107 | 


--------------------------------------------------------------------------------
/book/unsupervised/clustering/clustering.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Clustering"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## What are clustering methods?\n",
15 |     "\n",
16 |     "A cluster is a group of things that are close together. Clustering is a very important idea in machine learning in general. Clustering work well with feature vectors, which is a vector of numbers representing features. Because feature vectors that are close to one another is supposed to be related in a deeper way (because of closer features), clustering is very useful in determining whether two data features are in the same group (same cluster)."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Clustering clustering methods \n",
24 |     "\n",
25 |     "Most clustering methods are related to the following clustering methods:"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "### K-means\n",
33 |     "\n",
34 |     "K-means is a kind of **partition based methods**, which means that it partition the (feature) space into different regions, and predicts arbitrary points in the (feature) space by locating which partition the point falls into.\n",
35 |     "\n",
36 |     "K-means is one of the most popular clustering methods, because of how simple it is. Its algorithm is as follows:\n",
37 |     "\n",
38 |     "1. Select initial center points. These points will act as the center of the classes later.\n",
39 |     "2. Assign class labels to each point using the rule that a point is assigned the label of the closest center.\n",
40 |     "3. Shifting the center to be the mean of all the points that have the same class label as the center.\n",
41 |     "4. Repeat."
42 |    ]
43 |   },
44 |   {
45 |    "cell_type": "markdown",
46 |    "metadata": {},
47 |    "source": [
48 |     "### DBSCAN\n",
49 |     "\n",
50 |     "DBSCAN, abbreviation for **D**ensity-**B**ased **S**patial **C**lustering of **A**pplications with **N**oise, is a **density based method**, which means that it clusters the points based on densities. If there are many points in a certain region, the algorithm simply assumes that they are all of the same base class."
51 |    ]
52 |   },
53 |   {
54 |    "cell_type": "markdown",
55 |    "metadata": {},
56 |    "source": [
57 |     "### GMM\n",
58 |     "\n",
59 |     "**G**aussian **M**ixure **M**odels, introduced previously, can do more than generating data. The Gaussians can fit existing data, and be used to explain clusters with probabilities. A GMM, composed of several Gaussians, can be used to fit the data, and the end results will be several Gaussians, which act as clusters."
60 |    ]
61 |   },
62 |   {
63 |    "cell_type": "markdown",
64 |    "metadata": {},
65 |    "source": [
66 |     "### Hierarchy methods\n",
67 |     "\n",
68 |     "Treating every point as its independent cluster, we combine multiple points into the same cluster and do that until there is one big cluster left. We can then see the hierarchy of the problem: later grouped clusters are inherently farther away form each other than earlier grouped clusters."
69 |    ]
70 |   }
71 |  ],
72 |  "metadata": {
73 |   "kernelspec": {
74 |    "display_name": "Python 3",
75 |    "language": "python",
76 |    "name": "python3"
77 |   },
78 |   "language_info": {
79 |    "codemirror_mode": {
80 |     "name": "ipython",
81 |     "version": 3
82 |    },
83 |    "file_extension": ".py",
84 |    "mimetype": "text/x-python",
85 |    "name": "python",
86 |    "nbconvert_exporter": "python",
87 |    "pygments_lexer": "ipython3",
88 |    "version": "3.9.5"
89 |   }
90 |  },
91 |  "nbformat": 4,
92 |  "nbformat_minor": 2
93 | }
94 | 


--------------------------------------------------------------------------------
/book/better/lll/lll.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Life Long Learning"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "```{note}\n",
15 |     "**L**ife **L**ong **L**earning are also called LLL.\n",
16 |     "```"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "metadata": {},
22 |    "source": [
23 |     "## Why LLL?\n",
24 |     "\n",
25 |     "Humans can learn and memorize things. When we learn about new things, we don't really forget how to do it. A person can probably swim after years of staying out of a swimming pool, or ride a bicycle despite not having ridden on one in decades. However, machine learning models don't seem to be able to do that. When performing updates to a model, the model's ability to perform on the previous tasks are simply destroyed. That's the reason for the existence of LLL, to help machine retain memory over its lifetime."
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "## How LLL works?\n",
33 |     "\n",
34 |     "There are multiple ways to perform LLL. Most LLL methods use one or more of the following principle."
35 |    ]
36 |   },
37 |   {
38 |    "cell_type": "markdown",
39 |    "metadata": {},
40 |    "source": [
41 |     "## Don't change the model too much.\n",
42 |     "\n",
43 |     "If a model isn't changed too much, normally you wouldn't expect the output to change by a huge amount. This method is called **weight consolidation**. Using this method, you either hard constrain the maximum distance between the old weight and the new weight, or apply penalties for big distances."
44 |    ]
45 |   },
46 |   {
47 |    "cell_type": "markdown",
48 |    "metadata": {},
49 |    "source": [
50 |     "## Make the model bigger.\n",
51 |     "\n",
52 |     "By making the model bigger, you have the option to not modify the old weights by constraining new weights to not affect the old tasks, and only updating the new weights in learning new tasks."
53 |    ]
54 |   },
55 |   {
56 |    "cell_type": "markdown",
57 |    "metadata": {},
58 |    "source": [
59 |     "## Mix-in data in old tasks when learning new tasks.\n",
60 |     "\n",
61 |     "When you're learning something new, sometimes reviewing what you already know helps clarify the difference between what you already know and the thing that's new. This approach is no different. When learning a new task, this approach will try to also re-learn old tasks so the model's ability in performing old tasks don't get affected."
62 |    ]
63 |   },
64 |   {
65 |    "cell_type": "markdown",
66 |    "metadata": {},
67 |    "source": [
68 |     "## Don't learn new things that conflict with what the model already knows.\n",
69 |     "\n",
70 |     "When learning a new task, the model is updated in a special direction. This direction can be thought of as the knowledge that the model learns. What this method does is that it retains every _updated paths_ that the model has learned. In learning any new task, the knowledge of previous tasks should never be reduced. In mathematical terms, the gradients in learning new tasks and the gradients in learning old tasks should never have a negative dot product."
71 |    ]
72 |   }
73 |  ],
74 |  "metadata": {
75 |   "kernelspec": {
76 |    "display_name": "Python 3 (ipykernel)",
77 |    "language": "python",
78 |    "name": "python3"
79 |   },
80 |   "language_info": {
81 |    "codemirror_mode": {
82 |     "name": "ipython",
83 |     "version": 3
84 |    },
85 |    "file_extension": ".py",
86 |    "mimetype": "text/x-python",
87 |    "name": "python",
88 |    "nbconvert_exporter": "python",
89 |    "pygments_lexer": "ipython3",
90 |    "version": "3.9.5"
91 |   }
92 |  },
93 |  "nbformat": 4,
94 |  "nbformat_minor": 2
95 | }
96 | 


--------------------------------------------------------------------------------
/book/reinforce/ac/ac.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Actor Critic"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## What is actor critic?\n",
 15 |     "\n",
 16 |     "Actor critic sounds cool, but it's nothing special. Remember that policy networks outputs probability? The probability is generated by taking `softmax` (see the previous chapters about softmax), over a vector of scalars, which is usually called **logits**. Here's an interesting fact about logits: because softmax is applied on logits, adding a scalar to all the logits doesn't change the softmax output!"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "```{warning}\n",
 24 |     "Incoming math!\n",
 25 |     "```"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "Because the definition of softmax is\n",
 33 |     "\n",
 34 |     "$$\n",
 35 |     "\\frac{e^{x_i}}{\\sum_j e^{x_j}}\n",
 36 |     "$$\n",
 37 |     "\n",
 38 |     "You notice that adding a scalar to all $ x_i $ is equivalent to\n",
 39 |     "\n",
 40 |     "$$\n",
 41 |     "\\frac{e^{x_i + s}}{\\sum_j e^{x_j + s}}\n",
 42 |     "$$\n",
 43 |     "\n",
 44 |     "which can be written as\n",
 45 |     "\n",
 46 |     "$$\n",
 47 |     "\\frac{e^{x_i} e^s}{\\sum_j e^{x_j} e^s}\n",
 48 |     "$$\n",
 49 |     "\n",
 50 |     "Which is reduced to the original definition of softmax because $ e^s $ cancels each other.\n",
 51 |     "\n",
 52 |     "$$\n",
 53 |     "\\frac{e^{x_i}}{\\sum_j e^{x_j}}\n",
 54 |     "$$\n"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "What actor critic does is essentially this, it uses a policy network to generate logits, and uses a value network to generate a scalar. It then subtracts the scalar from the logits, and call the result **advantage**. The scalar acts similarly to the mean of logits, and although it doesn't change the output of the policy model (because the logits are eventually passed into a softmax layer), it helps reduce the variance of the model and makes the model more robust."
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "## Proximal Policy Optimization.\n",
 69 |     "\n",
 70 |     "PPO is one of the most famous actor critic methods, developed by OpenAI. PPO is an offline RL optimization method. It follows the principle that the data-collecting agent (that's not updated) should not be too different from the agent being trained (updated). The PPO algorithm optimizes the model in the following steps:\n",
 71 |     "\n",
 72 |     "1. Collects a batch of trajectories (data).\n",
 73 |     "2. Re-weights the probabilities so that the data-collecting agent's rewards are re-scaled according to the probability ratio (because of how policy based methods take the expectation of future reward, see the policy section).\n",
 74 |     "3. Clip the re-scaling factor because the new model shouldn't be too different from the old model (which would increase variance).\n",
 75 |     "4. Apply normal policy gradient methods.\n",
 76 |     "5. Repeat."
 77 |    ]
 78 |   }
 79 |  ],
 80 |  "metadata": {
 81 |   "kernelspec": {
 82 |    "display_name": "Python 3 (ipykernel)",
 83 |    "language": "python",
 84 |    "name": "python3"
 85 |   },
 86 |   "language_info": {
 87 |    "codemirror_mode": {
 88 |     "name": "ipython",
 89 |     "version": 3
 90 |    },
 91 |    "file_extension": ".py",
 92 |    "mimetype": "text/x-python",
 93 |    "name": "python",
 94 |    "nbconvert_exporter": "python",
 95 |    "pygments_lexer": "ipython3",
 96 |    "version": "3.9.5"
 97 |   }
 98 |  },
 99 |  "nbformat": 4,
100 |  "nbformat_minor": 2
101 | }
102 | 


--------------------------------------------------------------------------------
/book/layers/linear/linear.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Linear Layer\n"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{note}\n",
 15 |     "Sometimes, Linear Layers are also called Dense Layers, like in the toolkit Keras.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## What do linear layers do?\n",
 24 |     "\n",
 25 |     "A linear layer transforms a vector into another vector. For example, you can transform a vector `[1, 2, 3]` to `[1, 2, 3, 4]` with a linear layer."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "\n",
 33 |     "## When to use linear layers?\n",
 34 |     "\n",
 35 |     "Use linear layers when you want to change a vector into another vector. This often happens when the target vector's shape is different from the vector at hand.\n"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "\n",
 43 |     "```{note}\n",
 44 |     "Linear layers are often called linear transformation or linear mapping.\n",
 45 |     "```\n"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "\n",
 53 |     "## How does a linear layer work?\n",
 54 |     "\n",
 55 |     "There are two components in a linear layer. A weight $ W $, and a bias $ B $. If the input of a linear layer is a vector $ X $, then the output is $ W X + B $.\n",
 56 |     "\n",
 57 |     "If the linear layer transforms a vector of dimension $ N $ to dimension $ M $, then $ W $ is a $ M \\times N $ matrix, $ X $ is of dimension $ N $, $ B $ is of dimension $ M $."
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "## Linear layers in code?"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {},
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "import torch\n",
 74 |     "from torch.nn import Linear"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "linear = Linear(3, 4)\n",
 84 |     "print(linear.weight.detach())\n",
 85 |     "print(linear.bias.detach())"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "You see, linear layers are just 2 matrices, weight and bias."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "x = torch.tensor([1., 2., 3.])\n",
102 |     "y1 = linear(x)\n",
103 |     "y2 = linear.weight @ x + linear.bias\n",
104 |     "print(y1)\n",
105 |     "print(y2)\n",
106 |     "print(y1 == y2)"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "All that a linear layer do is to `matmul` the input vector, then added by the bias. It's the linear algebra notation of $ WX+B $, with $ W $ the weight matrix, and $ B $ the bias vector."
114 |    ]
115 |   }
116 |  ],
117 |  "metadata": {
118 |   "kernelspec": {
119 |    "display_name": "Python 3",
120 |    "language": "python",
121 |    "name": "python3"
122 |   },
123 |   "language_info": {
124 |    "codemirror_mode": {
125 |     "name": "ipython",
126 |     "version": 3
127 |    },
128 |    "file_extension": ".py",
129 |    "mimetype": "text/x-python",
130 |    "name": "python",
131 |    "nbconvert_exporter": "python",
132 |    "pygments_lexer": "ipython3",
133 |    "version": "3.8.6"
134 |   }
135 |  },
136 |  "nbformat": 4,
137 |  "nbformat_minor": 2
138 | }
139 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | .pdm-python
112 | 
113 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
114 | __pypackages__/
115 | 
116 | # Celery stuff
117 | celerybeat-schedule
118 | celerybeat.pid
119 | 
120 | # SageMath parsed files
121 | *.sage.py
122 | 
123 | # Environments
124 | .env
125 | .venv
126 | env/
127 | venv/
128 | ENV/
129 | env.bak/
130 | venv.bak/
131 | 
132 | # Spyder project settings
133 | .spyderproject
134 | .spyproject
135 | 
136 | # Rope project settings
137 | .ropeproject
138 | 
139 | # mkdocs documentation
140 | /site
141 | 
142 | # mypy
143 | .mypy_cache/
144 | .dmypy.json
145 | dmypy.json
146 | 
147 | # Pyre type checker
148 | .pyre/
149 | 
150 | # pytype static type analyzer
151 | .pytype/
152 | 
153 | # Cython debug symbols
154 | cython_debug/
155 | 
156 | # PyCharm
157 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
158 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
159 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
160 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
161 | #.idea/
162 | 
163 | 
164 | _build/
165 | 


--------------------------------------------------------------------------------
/book/basics/gradients/back-prop.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# How Gradients Are Calculated?"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{warning}\n",
 15 |     "This part may be more mathematics focused. If you simply want to grasp the intuition behind deep learning, feel free to skip the section.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "```{note}\n",
 24 |     "This explanation will focus on how PyTorch calculates gradients. Recently TensorFlow has switched to the same model so the method seems pretty good.\n",
 25 |     "```"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## Chain rule\n",
 33 |     "\n",
 34 |     "$$\n",
 35 |     "\\frac{d f}{d x} = \\frac{d f}{d y} \\frac{d y}{d x}\n",
 36 |     "$$\n",
 37 |     "\n",
 38 |     "Chain rule is basically a way to calculate derivatives for functions that are very composed and complicated. With chain rule at hand, we will be able to take derivatives for functions that look familiar but complicated."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "## Example usage of chain rule\n",
 46 |     "\n",
 47 |     "It's very easy to calculate the derivative of $ 5 x $, which is $ 5 $. It's also obvious that the derivative of $ x^3 $ is $ 3 x^2 $. However, what's the gradient of $ (5 x)^3 $? \n",
 48 |     "\n",
 49 |     "It's easy, you say. $ (5 x)^3 = 125 x^3 $, so the derivative is $ 125 (3 x^2) = 375 x^2 $. You're right. For demonstration purpose, let's see how chain rule can derive the same answer.\n",
 50 |     "\n",
 51 |     "\n",
 52 |     "First, let $ y = 5 x $. Then, the chain rule:\n",
 53 |     "\n",
 54 |     "$$\n",
 55 |     "\\frac{d f}{d x} = \\frac{d f}{d y} \\frac{d y}{d x}\n",
 56 |     "$$\n",
 57 |     "\n",
 58 |     "will reduce to:\n",
 59 |     "\n",
 60 |     "$$\n",
 61 |     "\\frac{d (5 x)^3}{d x} = \\frac{d (5 x)^3}{d (5 x)} \\frac{d (5 x)}{d x}\n",
 62 |     "$$\n",
 63 |     "\n",
 64 |     "We know that $ \\frac{d x^3}{x} = 3 x^2 $.\n",
 65 |     "\n",
 66 |     "So that:\n",
 67 |     "$$\n",
 68 |     "\\frac{d (5 x)^3}{d (5 x)} \\frac{d (5 x)}{d x} = 3 (5 x)^2 5 = 375 x^2\n",
 69 |     "$$\n",
 70 |     "\n",
 71 |     "The same answer. So why chain rule? Imagine if it's not $ (5 x)^3 $, but $ \\cos(\\sin (5 x) 8 x) $? This is where chain rule comes in handy. It can decompose the complicated function step by step and arrive at the solution that may be otherwise too complicated. I'm not going to derive the derivatives of $ \\cos(\\sin (5 x) 8 x) $. Try it yourself!"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "## So how does PyTorch calculate gradients?\n",
 79 |     "\n",
 80 |     "Gradients are multi-dimensional derivatives. A gradient for a list of parameter $ X $ with regards to the number $ y $ can be defined as:\n",
 81 |     "\n",
 82 |     "$$\n",
 83 |     "\\begin{bmatrix}\n",
 84 |     "\\frac{d y}{d x_1} \\\\\n",
 85 |     "\\frac{d y}{d x_2} \\\\\n",
 86 |     "\\vdots \\\\\n",
 87 |     "\\frac{d y}{d x_n}\n",
 88 |     "\\end{bmatrix}\n",
 89 |     "$$\n",
 90 |     "\n",
 91 |     "Gradients are calculated using chain rule. Turns out that most functions you use, that PyTorch provides, are composed by easy functions, and PyTorch derives the gradients for the "
 92 |    ]
 93 |   }
 94 |  ],
 95 |  "metadata": {
 96 |   "kernelspec": {
 97 |    "display_name": "Python 3 (ipykernel)",
 98 |    "language": "python",
 99 |    "name": "python3"
100 |   },
101 |   "language_info": {
102 |    "codemirror_mode": {
103 |     "name": "ipython",
104 |     "version": 3
105 |    },
106 |    "file_extension": ".py",
107 |    "mimetype": "text/x-python",
108 |    "name": "python",
109 |    "nbconvert_exporter": "python",
110 |    "pygments_lexer": "ipython3",
111 |    "version": "3.9.9"
112 |   }
113 |  },
114 |  "nbformat": 4,
115 |  "nbformat_minor": 2
116 | }
117 | 


--------------------------------------------------------------------------------
/book/intro.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Introduction"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{seealso}\n",
 15 |     "The book's source code is hosted on [GitHub](https://github.com/rentruewang/learning-machine). Please consider giving it a star (★)  if you like it! Please file an issue if you find anything wrong.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "```{seealso}\n",
 24 |     "This book accompanies [Machine Learning with Hung-Yi Lee](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.html). Check it out!\n",
 25 |     "```"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## Why this Book?\n",
 33 |     "\n",
 34 |     "There are many resources for machine learning on the internet. However, most of them are either\n",
 35 |     "\n",
 36 |     "1. Too long. It takes half an hour just to read through.\n",
 37 |     "\n",
 38 |     "2. Too math heavy. It takes you forever to understand.\n",
 39 |     "\n",
 40 |     "3. Too confusing. The concepts are not straight-forward.\n",
 41 |     "\n",
 42 |     "This book aims to solve all of that. It tries to be as concise but easy to grasp as possible."
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "markdown",
 47 |    "metadata": {},
 48 |    "source": [
 49 |     "## What is this book?\n",
 50 |     "\n",
 51 |     "This book is for learners who want to quickly grasp an idea, without diving deep into a topic (it takes way too long!). The book is a handbook for people who want to preserve their time."
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "## How to use this book?\n",
 59 |     "\n",
 60 |     "Don't use this book as a reference, use it as a handbook instead. We'll cover all the basics.\n"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "## Before we start\n",
 68 |     "\n",
 69 |     "In this book, we're mainly focused on deep learning, which is a branch of machine learning that leverages of a lot of computing power and yield incredible result. Sometimes it's called artificial intelligence."
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "```{note}\n",
 77 |     "AI (artificial intelligence), ML (machine learning), DL (deep learning). This book use these terms interchangeably. They are not equal terms outside of this book, however, in the subset we cover in this book they can be seen as equal.\n",
 78 |     "```"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "## Why learn machine learning?\n",
 86 |     "\n",
 87 |     "There are many reason to learn ML. The most obvious reason is that ML is quite powerful and useful, and many systems use it ranging from Google's search to snapchat filters. Secondly, it's quite the rage nowadays, so it certainly would help you land a prestigious job. Thirdly, in the case where robots take over the world (like Elon Musk feared), having learned machines' ways of dealing things may be your only path to survival! Jokes aside, learning ML does help to calm your nerves since most of the theories on the Internet about robots are just over the top. And learning ML helps keeping you from being tricked by those deceptive information."
 88 |    ]
 89 |   }
 90 |  ],
 91 |  "metadata": {
 92 |   "kernelspec": {
 93 |    "display_name": "Python 3 (ipykernel)",
 94 |    "language": "python",
 95 |    "name": "python3"
 96 |   },
 97 |   "language_info": {
 98 |    "codemirror_mode": {
 99 |     "name": "ipython",
100 |     "version": 3
101 |    },
102 |    "file_extension": ".py",
103 |    "mimetype": "text/x-python",
104 |    "name": "python",
105 |    "nbconvert_exporter": "python",
106 |    "pygments_lexer": "ipython3",
107 |    "version": "3.9.5"
108 |   }
109 |  },
110 |  "nbformat": 4,
111 |  "nbformat_minor": 2
112 | }
113 | 


--------------------------------------------------------------------------------
/book/_toc.yml:
--------------------------------------------------------------------------------
  1 | format: jb-book
  2 | root: intro
  3 | parts:
  4 |   - caption: Getting Started
  5 |     chapters:
  6 |       - file: basics/basics
  7 |         sections:
  8 |           - file: basics/data/data
  9 |           - file: basics/model/model
 10 |           - file: basics/loss/loss
 11 |       - file: basics/approx/approx
 12 |       - file: basics/gradients/gradients
 13 |         sections:
 14 |           - file: basics/gradients/loss-fn-derivative
 15 |           - file: basics/gradients/back-prop
 16 |   - caption: Common Tasks
 17 |     chapters:
 18 |       - file: tasks/tasks
 19 |       - file: tasks/regression/regression
 20 |         sections:
 21 |           - file: tasks/regression/auto/auto
 22 |       - file: tasks/classification/classification
 23 |   - caption: Common Building Blocks
 24 |     chapters:
 25 |       - file: layers/layers
 26 |       - file: layers/linear/linear
 27 |         sections:
 28 |           - file: layers/linear/linear-grad
 29 |       - file: layers/cnn/cnn
 30 |       - file: layers/rnn/rnn
 31 |         sections:
 32 |           - file: layers/rnn/lstm/lstm
 33 |           - file: layers/rnn/gru/gru
 34 |       - file: layers/emb/emb
 35 |       - file: layers/dropout/dropout
 36 |       - file: layers/norm/norm
 37 |       - file: layers/padding/padding
 38 |       - file: layers/pooling/pooling
 39 |       - file: layers/transformer/transformer
 40 |         sections:
 41 |           - file: layers/transformer/attn/attn
 42 |           - file: layers/transformer/attn/self-attn
 43 |           - file: layers/transformer/transformer-vs-rnn
 44 |           - file: layers/transformer/training/training
 45 |           - file: layers/transformer/training/teacher/teacher
 46 |           - file: layers/transformer/training/token/token
 47 |           - file: layers/transformer/training/no-training/no-training
 48 |       - file: layers/activation/activation
 49 |         sections:
 50 |           - file: layers/activation/relu/relu
 51 |           - file: layers/activation/sigmoid/sigmoid
 52 |           - file: layers/activation/softmax/softmax
 53 |           - file: layers/activation/tanh/tanh
 54 |   - caption: Some Other Important Things To Notice
 55 |     chapters:
 56 |       - file: notice/notice
 57 |       - file: notice/batch/batch
 58 |       - file: notice/gradient/norm
 59 |       - file: notice/gradient/saddle
 60 |       - file: notice/lr/lr
 61 |       - file: notice/optimizer/optimizer
 62 |       - file: notice/data/overfit
 63 |       - file: notice/data/underfit
 64 |   - caption: Generative Models
 65 |     chapters:
 66 |       - file: generative/generative
 67 |       - file: generative/ae/ae
 68 |         sections:
 69 |           - file: generative/ae/ae-arch
 70 |           - file: generative/ae/ae-semi
 71 |           - file: generative/ae/vae/vae
 72 |       - file: generative/gan/gan
 73 |       - file: generative/gmm/gmm
 74 |   - caption: Improving Models
 75 |     chapters:
 76 |       - file: better/better
 77 |       - file: better/explainable/explainable
 78 |         sections:
 79 |           - file: better/explainable/saliency
 80 |       - file: better/meta/meta
 81 |       - file: better/lll/lll
 82 |       - file: better/compression/compression
 83 |   - caption: Reuse Existing Models
 84 |     chapters:
 85 |       - file: reuse/reuse
 86 |       - file: reuse/transfer/tl-da
 87 |         sections:
 88 |           - file: reuse/transfer/tl-vs-da
 89 |       - file: reuse/distil/distil
 90 |   - caption: Beyond Supervised Training
 91 |     chapters:
 92 |       - file: unsupervised/unsupervised
 93 |       - file: unsupervised/clustering/clustering
 94 |       - file: unsupervised/decision-tree/decision-tree
 95 |       - file: unsupervised/self-supervised/self-supervised
 96 |       - file: unsupervised/semi-supervised/semi-supervised
 97 |   - caption: Reinforcement Learning
 98 |     chapters:
 99 |       - file: reinforce/reinforce
100 |         sections:
101 |           - file: reinforce/essential/state
102 |           - file: reinforce/essential/agent
103 |           - file: reinforce/essential/action
104 |           - file: reinforce/essential/reward
105 |           - file: reinforce/essential/online-offline
106 |       - file: reinforce/value/value
107 |         sections:
108 |           - file: reinforce/value/q-learning
109 |       - file: reinforce/policy/policy
110 |         sections:
111 |           - file: reinforce/policy/policy-gradient
112 |       - file: reinforce/ac/ac
113 | 


--------------------------------------------------------------------------------
/book/layers/layers.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Layers"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## What are layers?\n",
 15 |     "\n",
 16 |     "Layers are basic building blocks of a neural network. You can think of layers as filters. Each layer do something that helps you in your task. It's convenient to think of a big neural network system as many layers working together to achieve a common goal. Don't think too hard. Anything can be a layer. A layer is just a fancy way to refer to a function. It takes in an input and spits out some output. It can be anything, trust me."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## What are some common layers?\n",
 24 |     "\n",
 25 |     "- [Linear layer](./linear/linear) is everywhere. \n",
 26 |     "- [Convolution layer](./cnn/cnn) is a layer that specialize in processing data that have patterns. Like images and voice.\n",
 27 |     "- [Recurrent layer](./rnn/rnn) and [Transformer](./transformer/transformer) are good in processing sequences and text.\n",
 28 |     "- [Padding layer](./padding/padding) and [Pooling layer](./pooling/pooling) layers are good at reshaping the input data.\n",
 29 |     "- [Embedding layer](./emb/emb) are good at converting tokens (like characters) to vector (its meanings).\n",
 30 |     "- ..._and a lot more_"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "## Layers in code"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": null,
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "import torch\n",
 47 |     "from torch.nn import Conv2d, Linear, Module, Sequential"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "Layers in PyTorch is represented by `Module` class. All layers, such as `Linear`, are subclasses of it."
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "print(issubclass(Linear, Module))\n",
 64 |     "print(issubclass(Conv2d, Module))"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "This is how you define a custom `Module`. It's easy."
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "class Identity(Module):\n",
 81 |     "    def __init__(self):\n",
 82 |     "        super().__init__()\n",
 83 |     "\n",
 84 |     "    def forward(x):\n",
 85 |     "        return x"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "We created an identity function! Which means it does nothing but spit out what's passed in. But isn't it very easy! You now have a reusable module that can be put into a neural network! And can use a lot of PyTorch's function such as callbacks or printing."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "markdown",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "Now let's create a sequential model."
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": null,
105 |    "metadata": {},
106 |    "outputs": [],
107 |    "source": [
108 |     "model = Sequential(\n",
109 |     "    Linear(3, 4),\n",
110 |     "    Linear(4, 5),\n",
111 |     "    Linear(5, 6),\n",
112 |     ")\n",
113 |     "\n",
114 |     "x = torch.randn(3)\n",
115 |     "print(x)\n",
116 |     "print(x.shape)\n",
117 |     "\n",
118 |     "y = model(x)\n",
119 |     "print(y)\n",
120 |     "print(y.shape)"
121 |    ]
122 |   }
123 |  ],
124 |  "metadata": {
125 |   "kernelspec": {
126 |    "display_name": "Python 3",
127 |    "language": "python",
128 |    "name": "python3"
129 |   },
130 |   "language_info": {
131 |    "codemirror_mode": {
132 |     "name": "ipython",
133 |     "version": 3
134 |    },
135 |    "file_extension": ".py",
136 |    "mimetype": "text/x-python",
137 |    "name": "python",
138 |    "nbconvert_exporter": "python",
139 |    "pygments_lexer": "ipython3",
140 |    "version": "3.8.6"
141 |   }
142 |  },
143 |  "nbformat": 4,
144 |  "nbformat_minor": 2
145 | }
146 | 


--------------------------------------------------------------------------------
/book/reinforce/value/q-learning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Q Learning"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## What is a Q function?\n",
 15 |     "\n",
 16 |     "A Q function is a more complicated version of a value function. In RL, a value function $ v(s) $ tries to approximate all the future rewards a state $ s $ will get. A Q function basically does the same thing, but with a twist.\n",
 17 |     "\n",
 18 |     "Remember that in RL, state transition happens when an agent, standing in a state $ s^1 $, takes an action $ a $, and ends up in another state $ s^2 $? We notice that instead of directly approximating the value of $ s^2 $, which is $ v(s^2) $, we could use a **Q function**, which takes $ s^1 $ and $ a $ as parameters, and define the q function $ Q(s^1, a) = v(s^2) $ given that $ (s^1, a) \\rightarrow s^2 $.\n",
 19 |     "\n",
 20 |     "So, compare to value function $ v(s) $ that takes in a state $ s $ as input and outputs a scalar, a Q function $ Q(s) $ that takes in $ s $ as input will output a vector of possible rewards corresponding to possible states we will end up in after taking each possible actions."
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "## How are Q functions better?\n",
 28 |     "\n",
 29 |     "Q functions are better when the number of possible actions are pre-determined, and when we can generate a batch of values faster than we can generate them one-by-one. Because we already know all the possible actions, we could generate a vector faster than a value function, which has to take in the next states one-by-one, which takes precious computational time."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "## Q learning in simple terms."
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "```{warning}\n",
 44 |     "Incoming math!\n",
 45 |     "```"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "In Q-learning, we often make use of an algorithm called $ \\epsilon $-greedy. What the algorithm does is that given an epsilon that's in the range $ [0,1] $, $ \\epsilon $ is the probability that we randomly select an action, or else we greedily choose the best value of all possible actions.\n",
 53 |     "\n",
 54 |     "Remember that the Q function $ Q(s) $ outputs the future rewards associated with each state that this action can transition to? Suppose that our Q function is accurate enough, $ \\arg_a \\max Q(s) $ is the best action to take in all possible actions.\n",
 55 |     "\n",
 56 |     "Imagine that we are in a state $ s^1 $, after taking the action $ \\arg_a \\max Q(s) $, we arrive at the state $ s^2 $. If our Q function is accurate, then we have\n",
 57 |     "\n",
 58 |     "$$\n",
 59 |     "Q(s^1, a) = \\gamma R_a + \\arg_a \\max Q(s^2)\n",
 60 |     "$$\n",
 61 |     "\n",
 62 |     "because of the RL equation that\n",
 63 |     "\n",
 64 |     "$$\n",
 65 |     "G_t = \\gamma R_a + G_{t+1}\n",
 66 |     "$$\n",
 67 |     "\n",
 68 |     "So the update rules are easy. We want to update\n",
 69 |     "\n",
 70 |     "$$\n",
 71 |     "Q(s^1, a)\n",
 72 |     "$$\n",
 73 |     "\n",
 74 |     "to be as close to \n",
 75 |     "\n",
 76 |     "$$\n",
 77 |     "\\gamma R_a + \\arg_a \\max Q(s^2)\n",
 78 |     "$$\n",
 79 |     "\n",
 80 |     "as possible (but not vice versa! The reason is a little big complicated. See Sutton and Barto's Intro to RL book).\n",
 81 |     "\n",
 82 |     "Then we have the update rule:\n",
 83 |     "\n",
 84 |     "$$\n",
 85 |     "Q(s^1, a) = Q(s^1, a) - \\eta \\nabla L(Q(s^1, a), \\gamma R_a + \\arg_a \\max Q(s^2))\n",
 86 |     "$$\n",
 87 |     "\n",
 88 |     "where $ L $ is the loss function."
 89 |    ]
 90 |   }
 91 |  ],
 92 |  "metadata": {
 93 |   "kernelspec": {
 94 |    "display_name": "Python 3 (ipykernel)",
 95 |    "language": "python",
 96 |    "name": "python3"
 97 |   },
 98 |   "language_info": {
 99 |    "codemirror_mode": {
100 |     "name": "ipython",
101 |     "version": 3
102 |    },
103 |    "file_extension": ".py",
104 |    "mimetype": "text/x-python",
105 |    "name": "python",
106 |    "nbconvert_exporter": "python",
107 |    "pygments_lexer": "ipython3",
108 |    "version": "3.9.5"
109 |   }
110 |  },
111 |  "nbformat": 4,
112 |  "nbformat_minor": 2
113 | }
114 | 


--------------------------------------------------------------------------------
/book/layers/activation/softmax/softmax.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Softmax"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Introduction\n",
 15 |     "\n",
 16 |     "`Softmax` is a multi-dimension version of `sigmoid`. Softmax is used when:\n",
 17 |     "\n",
 18 |     "1. Used as a _softer_ max function, as it makes the max value more pronounced in its output.\n",
 19 |     "2. Approximating a probability distribution, because the output of softmax will never exceed $ 1 $ or get below $ 0 $."
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "## Definition\n",
 27 |     "\n",
 28 |     "softmax($ x_i $) = $ \\frac{e^{x_i}}{\\sum_j e^{x_j}} $\n",
 29 |     "\n",
 30 |     "With temperature\n",
 31 |     "\n",
 32 |     "softmax($ x_i $, $ t $) = $ \\frac{e^{\\frac{x_i}{t}}}{\\sum_j e^{\\frac{x_j}{t}}} $"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "## How does softmax look, and how it works in code?"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "%matplotlib inline\n",
 49 |     "\n",
 50 |     "import numpy as np\n",
 51 |     "from matplotlib import pyplot as plt"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": null,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "def softmax(x, t = 1):\n",
 61 |     "    exp = np.exp(x / t)\n",
 62 |     "\n",
 63 |     "    # sums over the last axis\n",
 64 |     "    sum_exp = exp.sum(-1, keepdims=True)\n",
 65 |     "    \n",
 66 |     "    return exp / sum_exp"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "Now let's see how softmax approaches the max function"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": null,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "array = np.random.randn(5)\n",
 83 |     "softer_max = softmax(array)\n",
 84 |     "print(array)\n",
 85 |     "print(softer_max)"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "See how the maximum value gets emphasized and gets a much larger share of probability. Applying weighted average would make it even clearer."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "average = array.sum() / array.size\n",
102 |     "weighted = array @ softer_max\n",
103 |     "print(average)\n",
104 |     "print(weighted)\n",
105 |     "print(array.max())"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "metadata": {},
111 |    "source": [
112 |     "See how the weighted average gets closer to the real maximum. To make it even closer to max, reduce the temperature."
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": null,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "colder_max = softmax(array, 0.1)\n",
122 |     "weighted = array @ colder_max\n",
123 |     "print(average)\n",
124 |     "print(weighted)\n",
125 |     "print(array.max())"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "Softmax is a generalization of sigmoid. Sigmoid can be seen as softmax($ [x, 0] $). Plotting shows that."
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "x = np.zeros([410, 2])\n",
142 |     "x[:, 0] = np.arange(-200, 210) / 20\n",
143 |     "y = softmax(x)\n",
144 |     "plt.plot(x[:, 0], y[:, 0])\n",
145 |     "plt.show()"
146 |    ]
147 |   }
148 |  ],
149 |  "metadata": {
150 |   "kernelspec": {
151 |    "display_name": "Python 3",
152 |    "language": "python",
153 |    "name": "python3"
154 |   },
155 |   "language_info": {
156 |    "codemirror_mode": {
157 |     "name": "ipython",
158 |     "version": 3
159 |    },
160 |    "file_extension": ".py",
161 |    "mimetype": "text/x-python",
162 |    "name": "python",
163 |    "nbconvert_exporter": "python",
164 |    "pygments_lexer": "ipython3",
165 |    "version": "3.9.5"
166 |   }
167 |  },
168 |  "nbformat": 4,
169 |  "nbformat_minor": 2
170 | }
171 | 


--------------------------------------------------------------------------------
/book/reinforce/reinforce.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Reinforcement Learning"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{note}\n",
 15 |     "We will use RL to refer to reinforcement learning.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## What is RL?\n",
 24 |     "\n",
 25 |     "RL is a branch of machine learning, focusing on interacting with things. RL was mainly developed by observing animal/human behavior, so it has a lot in common with how humans make decisions. In RL, an **agent** makes an **action** that changes an **environment**, and receives **rewards** in the process. So for example, RL can be used to model how a person, _agent_, decides to have curry for dinner, _action_, which causes some carbon footprint on earth, _environment_, and feels happy about it, _reward_. In other words, RL can be used to model problems that are interactive, about things changing, and how an action will impact future behavior, and making the right decisions. Oh, and eating curry isn't that bad for the planet earth."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## Reinforce?\n",
 33 |     "\n",
 34 |     "I agree that it's a bad name. RL in its early days referred to updating a model, that's initially random, and **reinforce**/enhance the actions that yield good rewards."
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "## Markov Decision Process"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "```{note}\n",
 49 |     "Markov Decision Process is also called MDP.\n",
 50 |     "```"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "RL is designed to optimize the rewards out of an MDP. An MDP consists of several parts we previously mentioned:"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "### [Agent](./essential/agent)\n",
 65 |     "\n",
 66 |     "An agent is a person or a computer or an animal, anything that makes the world around it change. "
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "### [State](./essential/state)\n",
 74 |     "\n",
 75 |     "An agent interacts with an environment. And state is used to describe that environment. If the agent modifies the environment, we say that the state of the environment is changed."
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "### [Action](./essential/action)\n",
 83 |     "\n",
 84 |     "An agent makes an action to change the environment, which is, an agent makes an action to transition between states."
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "### [Reward](./essential/reward)\n",
 92 |     "\n",
 93 |     "Reward is obtained when making actions. Rewards are used to measure how good or bad an action is."
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "markdown",
 98 |    "metadata": {},
 99 |    "source": [
100 |     "So basically what RL tries to solve is to have a good agent, that takes reasonable actions between states, and try to get the most rewards."
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "metadata": {},
106 |    "source": [
107 |     "## Important terms in RL."
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "markdown",
112 |    "metadata": {},
113 |    "source": [
114 |     "### [Value](./value/value)\n",
115 |     "\n",
116 |     "Value function refers to the total of rewards an agent will get before it dies (enters a terminated state)."
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "### [Policy](./policy/policy)\n",
124 |     "\n",
125 |     "A policy refers to how an agent makes a decision."
126 |    ]
127 |   }
128 |  ],
129 |  "metadata": {
130 |   "kernelspec": {
131 |    "display_name": "Python 3",
132 |    "language": "python",
133 |    "name": "python3"
134 |   },
135 |   "language_info": {
136 |    "codemirror_mode": {
137 |     "name": "ipython",
138 |     "version": 3
139 |    },
140 |    "file_extension": ".py",
141 |    "mimetype": "text/x-python",
142 |    "name": "python",
143 |    "nbconvert_exporter": "python",
144 |    "pygments_lexer": "ipython3",
145 |    "version": "3.9.6"
146 |   }
147 |  },
148 |  "nbformat": 4,
149 |  "nbformat_minor": 2
150 | }
151 | 


--------------------------------------------------------------------------------
/book/reuse/transfer/tl-da.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Transfer Learning and Domain Adaptation"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{note}\n",
 15 |     "We will use the abbreviation TL for transfer learning.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "```{note}\n",
 24 |     "We will use the abbreviation DA for domain adaptation.\n",
 25 |     "```"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## When do we need TL/DA?\n",
 33 |     "\n",
 34 |     "Suppose that you are born and raised in the US, you've not learned a foreign language, and you've never been to another country. Now all of a sudden you are thrown into a small town in Russia, what would you think?\n",
 35 |     "\n",
 36 |     "Machine learning models feel the same way when you feed them completely different inputs. For example, when a model has only seen MNIST digits, but now you're asking it to classify Full-HD, colorful, picturesque printed numbers taken from a photo, it's no wonder that your model decides not to work.\n",
 37 |     "\n",
 38 |     "However, those Full-HD numbers are still numbers right? Sure they have something in common with MNIST? Indeed, the same number 1 share some traits and the same number 2 also are roughly similar (just like Russian and English, they are both from humans). But it's still different enough that the machine learning model decides to give up.\n",
 39 |     "\n",
 40 |     "TL/DA aims to solve that. What TL/DA tries to do, is to make a model work across several similar environments that are different. By applying techniques in TL/DA, a model can perform better in not-so-similar-but-arguably-the-same environments, compared to the the environment where the model is trained."
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "## Different techniques in TL/DA.\n",
 48 |     "\n",
 49 |     "Most TL/DA methods fall in one of the following three categories."
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "### Discrepancy-based methods.\n",
 57 |     "\n",
 58 |     "Discrepancy-based methods utilize a **feature-extractor** and a very simple **classifier**. Those methods try to align the **statistical measures** of features of different domains.\n",
 59 |     "\n",
 60 |     "For example, we have a text feature-extractor, trained on news. Suppose that when reading from news, the mean of the features is $ \\mu $ and the stddev $ \\sigma $. When reading from medical documents, the mean of the features is $ \\mu' $ and the stddev $ \\sigma' $. A discrepancy-based method basically rescales the features extracted from the feature-extractor $ x' $ to $ \\frac{\\sigma}{\\sigma'} (x' - \\mu') + \\mu $. And then pass the rescaled features through the classifier."
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "### Adversarial-based methods.\n",
 68 |     "\n",
 69 |     "Adversarial-based methods utilize a **feature-extractor** and a **discriminator**. Those methods aim to train a feature-extractor that extract features that are common to different domains.\n",
 70 |     "\n",
 71 |     "For example, we have an audio feature-extractor. When training on classical music and funk, both types are encoded into a feature vector, and the discriminator's job is to tell apart classical music and funk. And we train it like a [GAN](/generative/gan/gan)."
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "### Reconstruction-based methods.\n",
 79 |     "\n",
 80 |     "Reconstruction-based methods try to utilize one **encoder** and multiple **decoders**. Those methods try to encode features that can be reconstructed by different decoders onto different domains.\n",
 81 |     "\n",
 82 |     "For example, we have an mammal image encoder, but an gorilla decoder and a monkey decoder. During training, the encoder tries to encode both images from gorilla domain, and monkey domain. The gorilla decoder would try to decode the gorilla images' features into gorillas, and the monkey decoder would try to decode the monkey images' features into monkeys."
 83 |    ]
 84 |   }
 85 |  ],
 86 |  "metadata": {
 87 |   "kernelspec": {
 88 |    "display_name": "Python 3 (ipykernel)",
 89 |    "language": "python",
 90 |    "name": "python3"
 91 |   },
 92 |   "language_info": {
 93 |    "codemirror_mode": {
 94 |     "name": "ipython",
 95 |     "version": 3
 96 |    },
 97 |    "file_extension": ".py",
 98 |    "mimetype": "text/x-python",
 99 |    "name": "python",
100 |    "nbconvert_exporter": "python",
101 |    "pygments_lexer": "ipython3",
102 |    "version": "3.9.5"
103 |   }
104 |  },
105 |  "nbformat": 4,
106 |  "nbformat_minor": 2
107 | }
108 | 


--------------------------------------------------------------------------------
/book/generative/gmm/gmm.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Gaussian Mixture Model"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{note}\n",
 15 |     "We will refer to Gaussian mixture model as GMM in this section.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "```{note}\n",
 24 |     "We will refer to Gaussian distribution as simply Gaussian in this section\n",
 25 |     "\n",
 26 |     "```"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "markdown",
 31 |    "metadata": {},
 32 |    "source": [
 33 |     "## What are GMMs?\n",
 34 |     "\n",
 35 |     "A GMM uses many Gaussians to approximate the probability of events. With the model, you can guess at what place the new event will happen."
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "## When to use GMMs?\n",
 43 |     "\n",
 44 |     "GMMs are good for creating input. Because it models a distribution by combining several Gaussians, it is capable of generating sample data from that distribution.\n",
 45 |     "A GMM is also super useful in that it uses minimal parameters, one average vector $ \\mu $, and one covariance matrix $ \\sigma $ for each Gaussian. And that's it! Usually, it's used when we want a fast way to have a relatively good model."
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Why GMMs use Gaussians?\n",
 53 |     "\n",
 54 |     "We all know gaussians right? The one that looks a bit funny, like a pile of slime. It's also called normal distribution, because of how common it is in modelling the real world. That's the reason for many cases using a few Gaussians in a GMM will produce sufficely good results.\n",
 55 |     "Gaussian also has a special property, that it can approximate any distribution given enough Gaussian distributions, which is also the reason it's used in GMM."
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "## How does GMM look?\n",
 63 |     "\n",
 64 |     "We said that GMM is basically a mix of multiple Gaussian distributions. So how does this new distribution look like?"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {},
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "%matplotlib inline\n",
 74 |     "\n",
 75 |     "import numpy as np\n",
 76 |     "from matplotlib import pyplot as plt"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {},
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "def gaussian(x, mean, stddev):\n",
 86 |     "    exponent = - (((x - mean) / stddev) ** 2) / 2\n",
 87 |     "    numerator = np.exp(exponent)\n",
 88 |     "    denominator = (stddev * np.sqrt(2 * np.pi))\n",
 89 |     "    return numerator / denominator"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "metadata": {},
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "# Choose number of gaussians\n",
 99 |     "num_gaussians = 6\n",
100 |     "\n",
101 |     "means = np.random.randn(num_gaussians)\n",
102 |     "# stddev should be larger than 0\n",
103 |     "stddevs = abs(np.random.randn(num_gaussians))\n",
104 |     "\n",
105 |     "# weights should sum to 1\n",
106 |     "weights = abs(np.random.randn(num_gaussians))\n",
107 |     "weights /= weights.sum()"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "x = (np.arange(-200, 201) / 20).tolist()\n",
117 |     "\n",
118 |     "for n in range(num_gaussians):\n",
119 |     "    y = [gaussian(x[i], means[n], stddevs[n]) * weights[n] for i in range(len(x))]\n",
120 |     "    plt.plot(x, y)\n",
121 |     "\n",
122 |     "plt.show()"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "The sum of Gaussians is the look of the new distribution."
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "y = [0] * len(x)\n",
139 |     "for n in range(num_gaussians):\n",
140 |     "    for i in range(len(x)):\n",
141 |     "        y[i] += gaussian(x[i], means[n], stddevs[n]) * weights[n]\n",
142 |     "\n",
143 |     "plt.plot(x, y)\n",
144 |     "plt.show()"
145 |    ]
146 |   }
147 |  ],
148 |  "metadata": {
149 |   "kernelspec": {
150 |    "display_name": "Python 3",
151 |    "language": "python",
152 |    "name": "python3"
153 |   },
154 |   "language_info": {
155 |    "codemirror_mode": {
156 |     "name": "ipython",
157 |     "version": 3
158 |    },
159 |    "file_extension": ".py",
160 |    "mimetype": "text/x-python",
161 |    "name": "python",
162 |    "nbconvert_exporter": "python",
163 |    "pygments_lexer": "ipython3",
164 |    "version": "3.9.5"
165 |   }
166 |  },
167 |  "nbformat": 4,
168 |  "nbformat_minor": 2
169 | }
170 | 


--------------------------------------------------------------------------------
/book/basics/approx/approx.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Approximation models"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "```{warning}\n",
 15 |     "This part may be more mathematics focused. If you simply want to grasp the intuition behind deep learning, feel free to skip the section.\n",
 16 |     "```"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## What is an approximation model?\n",
 24 |     "\n",
 25 |     "Mathematically speaking, an approximation model approximates (but may never be) the output function. For example, we can approximate $ 0 $ with the function $ \\frac{1}{x} $. When $ x \\rightarrow \\inf $, the function $ \\frac{1}{x} $ gets very close to $ 0 $. Notice that $ \\frac{1}{x} $ can never become $ 0 $ no matter how big $ x $ gets, but $ \\frac{1}{x} $ gets close enough to $ 0 $ that we don't care about that anymore.\n",
 26 |     "\n",
 27 |     "A machine learning model is no different. Taking the cat/dog differentiator for example. Since mapping from images and labels can be seen as a function's inputs and outputs, if we can use a model to approximate the mapping, and do a well enough job, then the model is essentially a good appproximation model to the function that maps from cat images to cat labels and dog images to dog labels."
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "## How well can a model approximate?\n",
 35 |     "\n",
 36 |     "Well, it's proven that a model can be as accurate as it wants, provided enough parameters.\n",
 37 |     "\n",
 38 |     "Suppose we have a function mapping from $ x $ to $ y $. With such a simple function, we can plot it on paper."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "metadata": {},
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "%matplotlib inline\n",
 48 |     "\n",
 49 |     "import numpy as np\n",
 50 |     "from matplotlib import pyplot as plt"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "x = np.arange(60000) / 10000\n",
 60 |     "y = np.sin(x)\n",
 61 |     "plt.plot(x, y)\n",
 62 |     "plt.show()"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "You see a sine wave right? However, this sine wave isn't really smooth! It consists of a lot of little segments that are just too small to see. Let's scale up a bit."
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "x = np.arange(12) / 2\n",
 79 |     "y = np.sin(x)\n",
 80 |     "plt.plot(x, y)\n",
 81 |     "plt.show()"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "See how the shape is approximated by a lot of line segments? Here's the deal, we could approximate any function with line segments, as long as we have enough of it. A machine learning model basically works the same way, creating truly complicated approximations to real world functions (like cat images to cat labels) by using a lot of simple functions (activation functions, more on that later)."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "Now you should use your imagination. For higher dimension $ x $, we can still think of it as a \"line\", just that this line is high dimensional. If $ x $ is 2D, then this line is a surface. For higher dimension though, you'll have to use your imagination."
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "## Why deeper is better?\n",
103 |     "\n",
104 |     "Suppose that we have a simple model where a neuron (a node in a layer) makes a decision: is the number bigger than a threshold? In other words, each neuron `n` separates the numbers into two intervals, bigger than `n`, or smaller than `n`.\n",
105 |     "\n",
106 |     "We have two models:\n",
107 |     "\n",
108 |     "1. One layer with three nodes.\n",
109 |     "2. Three layers with one node each.\n",
110 |     "\n",
111 |     "For model 1, it could separate all possible inputs (numbers) into 4 intervals (because each neuron separates the current interval in half.)\n",
112 |     "For model 2, because every neuron depends on the neuron comes before it, it could separate all possible inputs (numbers) into 8 intervals (2 to the power of 3).\n",
113 |     "\n",
114 |     "8 is bigger than 4!\n",
115 |     "\n",
116 |     "The reason deeper models perform better follows the same reason."
117 |    ]
118 |   }
119 |  ],
120 |  "metadata": {
121 |   "kernelspec": {
122 |    "display_name": "Python 3",
123 |    "language": "python",
124 |    "name": "python3"
125 |   },
126 |   "language_info": {
127 |    "codemirror_mode": {
128 |     "name": "ipython",
129 |     "version": 3
130 |    },
131 |    "file_extension": ".py",
132 |    "mimetype": "text/x-python",
133 |    "name": "python",
134 |    "nbconvert_exporter": "python",
135 |    "pygments_lexer": "ipython3",
136 |    "version": "3.9.5"
137 |   }
138 |  },
139 |  "nbformat": 4,
140 |  "nbformat_minor": 2
141 | }
142 | 


--------------------------------------------------------------------------------
/book/layers/transformer/transformer.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Transformer Block"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "## Attention is all you need\n",
15 |     "\n",
16 |     "The transformer model comes out of the paper _attention is all you need_. The paper shows how powerful pure attention mechanisms can be. They introduced a new kind of attention mechanism called self-attention, which we'll discuss later.\n",
17 |     "\n",
18 |     "The significance about self-attention is that with only attention mechanism, the model achieves state-of-the-art performance on many datasets, in a field previously dominated by RNNs."
19 |    ]
20 |   },
21 |   {
22 |    "cell_type": "markdown",
23 |    "metadata": {},
24 |    "source": [
25 |     "## How does the transformer work?\n",
26 |     "\n",
27 |     "The transformer architecture is based on a Seq2Seq model. Traditionally, a seq2seq model is basically an encoder and a decoder, like auto-encoders, but both encoder and decoder are RNNs. The encoder first process through the input, then feeds the encoder's RNN state or output to the decoder to decode the full sentence. The idea is that the encoder should be able to encode the input into some kind of representation that contains the meaning of the sentence, and the decoder should be able to understand that representation.\n",
28 |     "\n",
29 |     "In the case of transformers, because it's not an RNN, so instead of RNN state, the attention produced by the encoder is used and sent to the decoder. Decoder uses that global information to produce the output."
30 |    ]
31 |   },
32 |   {
33 |    "cell_type": "markdown",
34 |    "metadata": {},
35 |    "source": [
36 |     "## Transformer encoder\n",
37 |     "\n",
38 |     "The encoding component is a stack of smaller encoders. An encoder does the following thing\n",
39 |     "\n",
40 |     "1. Calculate self-attention score for the input $ I $.\n",
41 |     "2. Weigh the input by self-attention scores $ S(I) $.\n",
42 |     "3. Pass it through an add-and-normalize layer $ O = I + S(I) $.\n",
43 |     "4. Feed the processed data through a linear layer $ F = f(O) $.\n",
44 |     "5. Perform activation on the linear layer's output $ F' = \\sigma(F) $.\n",
45 |     "6. Multiply the mutated output with the output itself $ F'' = F'F $.\n",
46 |     "7. Pass it through an add-and-normalize layer $ F'' + O $.\n",
47 |     "\n",
48 |     "An add-and-normalize layer performs a residual add, adding the input and the processed input."
49 |    ]
50 |   },
51 |   {
52 |    "cell_type": "markdown",
53 |    "metadata": {},
54 |    "source": [
55 |     "## Transformer decoder\n",
56 |     "\n",
57 |     "The decoding component look a lot like the encoding component, it's also a stack of decoders. Decoders are basically encoders, but they take the attention provided by encoders, and perform step $ 3 $ twice, the first time adds it by the decoder generated attention, the second time adds by the encoder generated attention.\n",
58 |     "\n",
59 |     "For transformers, the decoder is an auto-regressive model. In inference mode, what it does is no different from any other decoder. It takes in the sequence it previously generated (it uses <bos> it's the first token being predicted), and predicts a new token. So what is the encoder doing? Turns out it is used for storing attention information. The stored information is then passed to decoders (on different layers) so that decoders know the meaning of the input sentence."
60 |    ]
61 |   },
62 |   {
63 |    "cell_type": "markdown",
64 |    "metadata": {},
65 |    "source": [
66 |     "## Positional encoding\n",
67 |     "\n",
68 |     "Words in a sentence have different meanings if they are ordered differently. The sentence _Alice ate Bob_ is very different in meaning to _Bob ate Alice_. Well, at least for Alice and Bob. For RNNs, that isn't an issue. Because RNNs run over a sentence sequentially. So it will either see Alice or Bob first, and knows who appears to be eaten. However, transformers have no way of knowing who comes first because of how self-attention mechanism is symmetry to each position.\n",
69 |     "\n",
70 |     "That is the reason we need to add information for position to the model. In _Attention is all you need_, a positional encoding is added to the input. A positional encoding is basically an embedding, with different values for every indices, so that the model knows what a word's position is when it processes it.\n",
71 |     "\n",
72 |     "A very interesting fact is that changing the order of the tokens does not actually change the output of the model (unlike RNNs), as long as the right positional encoding is associated with the right position. That means that after applying (usually by adding) positional encoding to the input word embedding vector, you can shuffle the order of the vector (along the time axis) all you want without affecting the output of the model. Very cool indeed."
73 |    ]
74 |   }
75 |  ],
76 |  "metadata": {
77 |   "kernelspec": {
78 |    "display_name": "Python 3",
79 |    "language": "python",
80 |    "name": "python3"
81 |   },
82 |   "language_info": {
83 |    "codemirror_mode": {
84 |     "name": "ipython",
85 |     "version": 3
86 |    },
87 |    "file_extension": ".py",
88 |    "mimetype": "text/x-python",
89 |    "name": "python",
90 |    "nbconvert_exporter": "python",
91 |    "pygments_lexer": "ipython3",
92 |    "version": "3.9.5"
93 |   }
94 |  },
95 |  "nbformat": 4,
96 |  "nbformat_minor": 2
97 | }
98 | 


--------------------------------------------------------------------------------
/book/notice/batch/batch.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Batch size"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## What are batches\n",
 15 |     "\n",
 16 |     "Machine learning models are usually trained on batches of data. A batch is simply a number (usually the power of 2), that a model trains itself on in an iteration. For example, batch size 32 means that the model takes in these 32 entries of data, averaging its output, and trains on the 32 labels of the entires."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## Batch size too small\n",
 24 |     "\n",
 25 |     "When batch sizes are too small, there are several issues that we may encounter."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "### Training takes a long time.\n",
 33 |     "\n",
 34 |     "When training on GPUs, data is sent to the GPU batch by batch, with the overhead of transfering data back and forth. If the batch size is too small, we spend a much higher percentage of time sending data than actually computing. This is the reason we usually prefer bigger batches."
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "### The training does not converge.\n",
 42 |     "\n",
 43 |     "In probability theory, variance measures how much the input varies, how close on average are two inputs. Let's plot two distributions with different variances. "
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": null,
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "%matplotlib inline\n",
 53 |     "\n",
 54 |     "import numpy as np\n",
 55 |     "from matplotlib import pyplot as plt\n",
 56 |     "from scipy import stats"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "mu = 0\n",
 66 |     "variance = 1\n",
 67 |     "sigma = np.sqrt(variance)\n",
 68 |     "x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)\n",
 69 |     "plt.xlim(-10, 10)\n",
 70 |     "plt.ylim(0, 1)\n",
 71 |     "plt.plot(x, stats.norm.pdf(x, mu, sigma))\n",
 72 |     "plt.show()"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "mu = 0\n",
 82 |     "variance = 10\n",
 83 |     "sigma = np.sqrt(variance)\n",
 84 |     "x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)\n",
 85 |     "plt.xlim(-10, 10)\n",
 86 |     "plt.ylim(0, 1)\n",
 87 |     "plt.plot(x, stats.norm.pdf(x, mu, sigma))\n",
 88 |     "plt.show()"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "See how the first distribution is narrower than the second distribution. That means, the number sampled (numbers yielded from) the distribution is closer to one another, and changes little. A smaller variance helps the machine learning model to learn faster, because it's easier to learn from a simpler input than from inputs that can change a lot. For that reason, averages are easier to learn than individuals."
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "## Batch size too big"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "### Training takes a lot of time.\n",
110 |     "\n",
111 |     "When batch size is too big, trainig can also take a long time, but for different reasons. It's again related to variance.\n",
112 |     "\n",
113 |     "There are certain problems in machine learning that gradients can be small, close to 0. In such a case, when the batch is overly big, the gradients can average so close to zero that it hinders progress of convergence. Well, worry not! Most of us aren't rich enough to use those kind of batch sizes (because of expensive GPUs)."
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "## So, how should I choose batch sizes?\n",
121 |     "\n",
122 |     "Batch sizes are very tricky. Both too big and too small can make training very slow. So make sure to try different batch sizes and observe before commiting to training the whole dataset."
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "## Is it possible that the model only remembers the last batch it sees?\n",
130 |     "\n",
131 |     "Yes, in theory it's possible that the model only remembers the last batch it sees if the model isn't big enough or if you update too much. However, in practice, we iterate over the dataset over and over, and update the model little by little. It's very unlikely that the model only remembers the last batch it has seen."
132 |    ]
133 |   }
134 |  ],
135 |  "metadata": {
136 |   "interpreter": {
137 |    "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90"
138 |   },
139 |   "kernelspec": {
140 |    "display_name": "Python 3",
141 |    "language": "python",
142 |    "name": "python3"
143 |   },
144 |   "language_info": {
145 |    "codemirror_mode": {
146 |     "name": "ipython",
147 |     "version": 3
148 |    },
149 |    "file_extension": ".py",
150 |    "mimetype": "text/x-python",
151 |    "name": "python",
152 |    "nbconvert_exporter": "python",
153 |    "pygments_lexer": "ipython3",
154 |    "version": "3.9.6"
155 |   }
156 |  },
157 |  "nbformat": 4,
158 |  "nbformat_minor": 2
159 | }
160 | 


--------------------------------------------------------------------------------