\n",
13 | " What we are NOT going to cover in this course: \n",
14 | " How to implement SOTA models\n",
15 | " \n",
16 | " How to optimize our code\n",
17 | " \n",
18 | " How autograd is implemented\n",
19 | " \n",
20 | " How to use the new fancy stuff: mobile support, distributed training, quantization, sparse tensors, etc.\n",
21 | "
\n",
22 | " Instead, we are going to: \n",
23 | " Understand the key PyTorch concepts (e.g., tensors, modules, autograd, broadcasting, ...)\n",
24 | " \n",
25 | " Understand what PyTorch can and cannot do\n",
26 | " \n",
27 | " Create simple neural networks and get and idea of how we can implement more complex models in the future\n",
28 | " \n",
29 | " Kick off with PyTorch 🚀\n",
30 | "
\n",
31 | "\n",
32 | "> If you use PyTorch on a daily basis, you will most probably not learn a lot during this lecture."
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "---"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "# Quick Recap of Jupyter Notebooks\n",
47 | "\n",
48 | "A jupyter notebook document has the `.ipynb` extension and is composed of a number of cells. In cells, you can write program code in Python and create notes in markdown style. These three types of cells correspond to:\n",
49 | " \n",
50 | " 1. code\n",
51 | " 2. markdown\n",
52 | " 3. raw\n",
53 | " \n",
54 | "To work with the contents of a cell, use *Edit mode* (turns on by pressing **Enter** after selecting a cell), and to navigate between cells, use *command mode* (turns on by pressing **Esc**).\n",
55 | "\n",
56 | "The cell type can be set in command mode either using hotkeys (**y** to code, **m** to markdown, **r** to edit raw text), or in the menu *Cell -> Cell type* ... "
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "### Example"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "# cell with code\n",
73 | "a = 1"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "a = 2"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": null,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "a\n",
92 | "print(a)"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "Cell with markdown text"
100 | ]
101 | },
102 | {
103 | "cell_type": "raw",
104 | "metadata": {},
105 | "source": [
106 | "Cell with raw text"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "Next, press `Shift + Enter` to process the contents of the cell:\n",
114 | "interpret the code or lay out the marked-up text."
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "### Basic shortcuts\n",
122 | "\n",
123 | "- `a` creates a cell above the current cell\n",
124 | "- `b` creates a cell below the current cell\n",
125 | "- `dd` deletes the curent cell\n",
126 | "- `Enter` enters in edit mode\n",
127 | "- `Esc` exits edit mode\n",
128 | "- `Ctrl` + `Enter` runs the cell\n",
129 | "- `Shift` + `Enter` runs the cell and creates (or jumps to) a next one\n",
130 | "- `m` converts the current cell to markdown\n",
131 | "- `y` converts the current cell to code"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "> ***Word of caution*** \n",
139 | "> Jupyter-notebook is a great tool for data science since we can see the direct effect of a snippet of code, either by plotting the result or by inspecting the direct output. However, we should be careful with the order in which we run cells (this is a common source of errors).\n"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "---"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "# PyTorch Overview\n",
154 | "\n",
155 | "\n",
156 | "> \"PyTorch - From Research To Production\n",
157 | "> \n",
158 | "> An open source machine learning framework that accelerates the path from research prototyping to production deployment.\"\n",
159 | "> -- https://pytorch.org/"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "## \"Build by run\" - what is that and why do I care?\n",
167 | "\n"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | ""
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "A very practical reason to use PyTorch:"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "import torch\n",
191 | "import ipdb\n",
192 | "\n",
193 | "def f(x):\n",
194 | " res = x + x\n",
195 | " ipdb.set_trace() # <-- :o\n",
196 | " return res\n",
197 | "\n",
198 | "x = torch.randn(1, 8)\n",
199 | "f(x)"
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {},
205 | "source": [
206 | "## Other reasons for using PyTorch\n"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | ""
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "\n",
221 | "- Seamless GPU integration\n",
222 | "- Production ready\n",
223 | "- Distributed training\n",
224 | "- Mobile support\n",
225 | "- Cloud support\n",
226 | "- Robust ecosystem\n",
227 | "- C++ front-end\n"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "## Other neural network toolkits you might want to check out\n",
235 | "- TensorFlow\n",
236 | "- JAX\n",
237 | "- MXNet\n",
238 | "- Keras\n",
239 | "- CNTK\n",
240 | "- Chainer\n",
241 | "- caffe\n",
242 | "- caffe2\n",
243 | "- dynet\n",
244 | "- many many more\n",
245 | "\n",
246 | "Which one to choose? There is no bullet silver. All of them are good!\n"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "---"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "# Useful Links\n",
261 | "\n",
262 | "- Twitter: https://twitter.com/PyTorch\n",
263 | "- Forum: https://discuss.pytorch.org/\n",
264 | "- Tutorials: https://pytorch.org/tutorials/\n",
265 | "- Examples: https://github.com/pytorch/examples\n",
266 | "- API Reference: https://pytorch.org/docs/stable/index.html\n",
267 | "- Torchvision: https://pytorch.org/docs/stable/torchvision/index.html\n",
268 | "- PyTorch Text: https://github.com/pytorch/text\n",
269 | "- PyTorch Audio: https://github.com/pytorch/audio\n",
270 | "\n",
271 | "\n",
272 | "More tutorials:\n",
273 | "- https://github.com/sotte/pytorch_tutorial\n",
274 | "- https://github.com/erickrf/pytorch-lecture\n",
275 | "- https://github.com/goncalomcorreia/pytorch-lecture"
276 | ]
277 | }
278 | ],
279 | "metadata": {
280 | "kernelspec": {
281 | "display_name": "Python 3 (ipykernel)",
282 | "language": "python",
283 | "name": "python3"
284 | },
285 | "language_info": {
286 | "codemirror_mode": {
287 | "name": "ipython",
288 | "version": 3
289 | },
290 | "file_extension": ".py",
291 | "mimetype": "text/x-python",
292 | "name": "python",
293 | "nbconvert_exporter": "python",
294 | "pygments_lexer": "ipython3",
295 | "version": "3.9.7"
296 | }
297 | },
298 | "nbformat": 4,
299 | "nbformat_minor": 2
300 | }
301 |
--------------------------------------------------------------------------------
/01-pytorch-basics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# An introduction to PyTorch\n",
8 | "\n",
9 | "PyTorch is a platform for deep learning in Python or C++. In this lecture we will focus in the **Python** landscape. "
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "# Tensors\n",
17 | "\n",
18 | "Tensors are elementary units of PyTorch. They are very similar to numpy arrays"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": null,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "import numpy as np\n",
28 | "np.random.seed(0)\n",
29 | "\n",
30 | "import torch\n",
31 | "torch.manual_seed(0)"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "x = np.array([1.0, 2.0, 3.0])\n",
41 | "y = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "x"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "y"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": null,
65 | "metadata": {},
66 | "outputs": [],
67 | "source": [
68 | "z = y ** 2\n",
69 | "z"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "Broadly speaking, a tensor is like a numpy array that can carry gradient information from the chain of operations applied on top of it. There are other flavors that make them different, but this is the key distinction."
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "## Creating tensors "
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": null,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "# directly from data\n",
93 | "data = [[0, 1], [1, 0]]\n",
94 | "x_data = torch.tensor(data)\n",
95 | "x_data"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": null,
101 | "metadata": {},
102 | "outputs": [],
103 | "source": [
104 | "# from a numpy array\n",
105 | "x_numpy = np.array([[1, 2], [3, 4]])\n",
106 | "x_torch = torch.from_numpy(x_numpy)\n",
107 | "x_torch"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "# convert it back to a numpy array\n",
117 | "x_numpy = x_torch.numpy()\n",
118 | "x_numpy"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": null,
124 | "metadata": {},
125 | "outputs": [],
126 | "source": [
127 | "# with constant data\n",
128 | "x = torch.ones(2, 3) # 2 rows and 3 columns\n",
129 | "print(x)\n",
130 | "y = torch.zeros(3, 2) # 3 rows and 2 columns\n",
131 | "print(y)\n",
132 | "z = torch.full((3, 1), -5) # 3 row and 1 columns (aka column vector)\n",
133 | "print(z)"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {},
140 | "outputs": [],
141 | "source": [
142 | "# with random data\n",
143 | "x = torch.rand(2, 3) # uniform distribution U(0, 1)\n",
144 | "print(x)\n",
145 | "y = torch.randn(2, 3) # standard gaussian N(0, 1)\n",
146 | "print(y)\n",
147 | "z = torch.randint(0, 10, size=(2, 3)) # random integers [0, 10)\n",
148 | "print(z)"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "# other initializations\n",
158 | "print(torch.arange(5)) # from 0 (inclusive) to 5 (exclusive)\n",
159 | "print(torch.arange(2, 8)) # from 2 to 8\n",
160 | "print(torch.arange(2, 8, 2)) # from 2 to 8, with stepsize=2\n",
161 | "\n",
162 | "print(torch.linspace(0, 1, 6)) # returns 6 linear spaced numbers from 0 to 1 (inclusive)\n",
163 | "print(torch.linspace(-1, 1, 8)) # returns 8 linear spaced numbers form -1 to 1 \n",
164 | "\n",
165 | "print(torch.eye(3)) # identity matrix"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "See the full set of creation ops [here](https://pytorch.org/docs/stable/torch.html#creation-ops)."
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "## Tensor attributes"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "x = torch.rand(3, 4, requires_grad=True)\n",
189 | "print(x.device)\n",
190 | "print(x.shape)\n",
191 | "print(x.dtype)\n",
192 | "print(x)\n",
193 | "print(x.data)\n",
194 | "print(x[0, 0])\n",
195 | "print(x[0, 0].item())"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "Tensor data types:\n",
203 | "\n",
204 | "
\n",
1599 | " $$\n",
1600 | " \\dfrac{\\partial \\big[\\sum_{x_i} e^{0.001 x_i^2} + \\sin(x_i^3) \\cdot \\log(x_i)\\big]}{\\partial x}\n",
1601 | " $$\n",
1602 | " \n",
1603 | " and make a function that computes it. Check that it gives the same output as `x.grad` in our previous example.\n",
1604 | "
"
1605 | ]
1606 | }
1607 | ],
1608 | "metadata": {
1609 | "kernelspec": {
1610 | "display_name": "Python 3 (ipykernel)",
1611 | "language": "python",
1612 | "name": "python3"
1613 | },
1614 | "language_info": {
1615 | "codemirror_mode": {
1616 | "name": "ipython",
1617 | "version": 3
1618 | },
1619 | "file_extension": ".py",
1620 | "mimetype": "text/x-python",
1621 | "name": "python",
1622 | "nbconvert_exporter": "python",
1623 | "pygments_lexer": "ipython3",
1624 | "version": "3.9.7"
1625 | }
1626 | },
1627 | "nbformat": 4,
1628 | "nbformat_minor": 2
1629 | }
1630 |
--------------------------------------------------------------------------------
/02-linear-regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Linear Regression and Gradient Descent"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook we will see how we can perform linear regression in three different ways: \n",
15 | "1. pure numpy\n",
16 | "2. numpy + pytorch's autograd \n",
17 | "3. pure pytorch"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": null,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "import numpy as np\n",
27 | "import torch\n",
28 | "import matplotlib.pyplot as plt\n",
29 | "from pprint import pprint"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "np.random.seed(0)\n",
39 | "torch.manual_seed(0);"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## The Problem"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "Suppose that we want to predict a real-valued quantity $y \\in \\mathbb{R}$ for a given input $\\mathbf{x} \\in \\mathbb{R}^d$. This is known as **regression**. \n",
54 | "\n",
55 | "The most common loss function for regression is the **quadractic loss** or **$\\ell_2$ loss**:\n",
56 | "\n",
57 | "$$\n",
58 | "\\ell_2(y, \\hat{y}) = (y - \\hat{y})^2\n",
59 | "$$\n",
60 | "\n",
61 | "The empirical risk becomes the **mean squared error (MSE)**:\n",
62 | "\n",
63 | "$$\n",
64 | "MSE(\\theta) = \\frac{1}{N} \\sum\\limits_{n=1}^{N} (y_n - f(\\mathbf{x}_n; \\theta))^2\n",
65 | "$$\n",
66 | "\n",
67 | "The model $f(\\mathbf{x}_n; \\theta)$ can be parameterized in many ways. In this lecture we will focus on a linear parameterization, leading to the well-known **Linear Regression** formulation:\n",
68 | "\n",
69 | "$$\n",
70 | "f(\\mathbf{x}; \\theta) = \\mathbf{w}^\\top \\mathbf{x} + b = w_1 x_1 + w_2 x_2 + \\cdots + w_D x_D + b\n",
71 | "$$\n",
72 | "\n",
73 | "where $\\theta = (b, \\mathbf{w})$ are the parameters of the model."
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "## Example\n",
81 | "\n",
82 | "Let's create a synthetic regression dataset using `sklearn`'s `make_regression` function. For better visualization, we will use only a single feature.$"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": null,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "from sklearn.datasets import make_regression\n",
92 | "\n",
93 | "\n",
94 | "n_features = 1\n",
95 | "n_samples = 100\n",
96 | "\n",
97 | "X, y = make_regression(\n",
98 | " n_samples=n_samples,\n",
99 | " n_features=n_features,\n",
100 | " noise=20,\n",
101 | " random_state=42,\n",
102 | ")\n",
103 | "\n",
104 | "fix, ax = plt.subplots()\n",
105 | "ax.plot(X, y, \".\")\n",
106 | "print(X.shape, y.shape)"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "For instance, by looking at the plot above, let's say that $w \\approx 40$ and $b \\approx 2$. Then, we would arrive at the following predictions (with vertical bars indicating the errors)."
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {},
120 | "outputs": [],
121 | "source": [
122 | "# our estimate\n",
123 | "w = 40.0\n",
124 | "b = 2.0\n",
125 | "y_pred = w*X + b\n",
126 | "\n",
127 | "# subplots\n",
128 | "fig, axs = plt.subplots(1, 2, figsize=(16, 4))\n",
129 | "\n",
130 | "# left plot\n",
131 | "axs[0].plot(X, y, 'o')\n",
132 | "axs[0].plot(X, y_pred, '-')\n",
133 | "\n",
134 | "# right plot\n",
135 | "axs[1].vlines(X, y, y_pred, color='black')\n",
136 | "axs[1].plot(X, y, 'o')\n",
137 | "axs[1].plot(X, y_pred, '-')"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "By adjusting our parameters $\\theta=(w, b)$, we can minimize the sum of squared errors to find the **least squares solution**\n",
145 | "\n",
146 | "$$\n",
147 | "\\begin{align}\n",
148 | "\\hat{\\theta} &= \\arg\\min_\\theta MSE(\\theta) \\\\\n",
149 | "&= \\arg\\min_\\theta \\frac{1}{N} \\sum\\limits_{n=1}^{N} (y_n - f(\\mathbf{x}_n; \\theta))^2 \\\\\n",
150 | "&= \\arg\\min_{w,b} \\frac{1}{N} \\sum\\limits_{n=1}^{N} (y_n - (w \\cdot x_n + b))^2\n",
151 | "\\end{align}\n",
152 | "$$\n",
153 | "\n",
154 | "Which can be found by taking the gradient of the loss function w.r.t. $\\theta$. \n",
155 | "\n",
156 | "\n"
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "metadata": {},
207 | "source": [
208 | "In general, for inputs with higher dimensionality $d$, we have $\\mathbf{w} \\in \\mathbb{R}^d$, and thus we have the following gradient (assuming that $b$ is absorbed by $w$):\n",
209 | "\n",
210 | "$$\n",
211 | "\\begin{align}\n",
212 | "\\nabla_\\mathbf{w} MSE(\\theta) &= \\nabla_\\mathbf{w} \\frac{1}{N} \\sum\\limits_{n=1}^{N} (y_n - f(\\mathbf{x}_n; \\theta))^2 \\\\\n",
213 | "&= \\frac{-2}{N} \\sum\\limits_{n=1}^{N} (y_n - f(\\mathbf{x}_n; \\theta)) \\cdot \\nabla_\\mathbf{w} f(\\mathbf{x}_n; \\theta) \\\\\n",
214 | "&= \\frac{-2}{N} \\sum\\limits_{n=1}^{N} (y_n - (\\mathbf{w}^\\top \\mathbf{x}_n + b)) \\cdot \\mathbf{x}_n\n",
215 | "\\end{align}\n",
216 | "$$\n",
217 | "\n",
218 | "Now, we just have follow the gradient descent rule to update $\\mathbf{w}$: \n",
219 | "\n",
220 | "$$\n",
221 | "\\mathbf{w}_{t+1} = \\mathbf{w}_{t} - \\alpha \\nabla_{\\mathbf{w}} MSE(\\theta)\n",
222 | "$$\n",
223 | "\n",
224 | "Where $\\alpha$ represents the learning rate. So, let's implement this in numpy to see what happens."
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "# Numpy Solution"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": null,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "class LinearRegression(object):\n",
241 | " def __init__(self, n_features, n_targets=1, lr=0.1):\n",
242 | " self.W = np.zeros((n_targets, n_features))\n",
243 | " self.lr = lr\n",
244 | "\n",
245 | " def update_weight(self, X, y, y_hat):\n",
246 | " N = X.shape[0]\n",
247 | " W_grad = - 2 * np.dot(X.T, y - y_hat) / N\n",
248 | " self.W = self.W - self.lr * W_grad\n",
249 | "\n",
250 | " def loss(self, y_hat, y):\n",
251 | " return np.mean(np.power(y - y_hat, 2))\n",
252 | "\n",
253 | " def predict(self, X):\n",
254 | " return np.dot(X, self.W.T).squeeze(-1)\n",
255 | "\n",
256 | " def train(self, X, y, epochs=50):\n",
257 | " \"\"\"\n",
258 | " X (n_examples x n_features): input matrix\n",
259 | " y (n_examples): gold labels\n",
260 | " \"\"\"\n",
261 | " loss_history = []\n",
262 | " for _ in range(epochs):\n",
263 | " # get prediction for computing the loss\n",
264 | " y_hat = self.predict(X)\n",
265 | " loss = self.loss(y_hat, y)\n",
266 | "\n",
267 | " # update weights\n",
268 | " self.update_weight(X, y, y_hat)\n",
269 | " # (thought exercise): what happens if we do this instead?\n",
270 | " # for x_i, y_i in zip(X, y):\n",
271 | " # self.update_weight(x_i, y_i)\n",
272 | "\n",
273 | " # save loss value\n",
274 | " loss_history.append(loss)\n",
275 | " return loss_history"
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": null,
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "# trick for handling the bias term:\n",
285 | "# concat a columns of 1s to the original input matrix X\n",
286 | "use_bias = True\n",
287 | "if use_bias:\n",
288 | " X_np = np.hstack([np.ones((n_samples,1)), X])\n",
289 | " n_features += 1\n",
290 | "else:\n",
291 | " X_np = X"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {},
298 | "outputs": [],
299 | "source": [
300 | "model = LinearRegression(n_features=n_features, n_targets=1, lr=0.1)\n",
301 | "loss_history = model.train(X_np, y, epochs=50)\n",
302 | "y_hat = model.predict(X_np)"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {},
309 | "outputs": [],
310 | "source": [
311 | "print('b:', model.W[0,0])\n",
312 | "print('W:', model.W[0,1])\n",
313 | "plt.plot(loss_history)\n",
314 | "plt.title('Loss per epoch')"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "metadata": {},
321 | "outputs": [],
322 | "source": [
323 | "# Vis\n",
324 | "fig, axs = plt.subplots(1, 2, figsize=(16, 4))\n",
325 | "axs[0].plot(X, y, \"o\", label=\"data\")\n",
326 | "axs[0].plot(X, 40*X + 2, \"-\", label=\"pred\")\n",
327 | "axs[0].set_title(\"Guess\")\n",
328 | "axs[0].legend();\n",
329 | "\n",
330 | "axs[1].plot(X, y, \"o\", label=\"data\")\n",
331 | "axs[1].plot(X, y_hat, \"-\", label=\"pred\")\n",
332 | "axs[1].set_title(\"Numpy solution\")\n",
333 | "axs[1].legend();"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "# Numpy + Autograd Solution\n",
341 | "\n",
342 | "In the previous implementation, we had to derive the gradient $\\frac{\\partial MSE(\\theta)}{\\partial \\theta}$ manually. If the model $f(\\cdot;\\theta)$ is more complex, this might be a cumbersome and error-prone task. To avoid this, we will use PyTorch `autograd` to automatically compute gradients.\n"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": [
351 | "class MixedLinearRegression(object):\n",
352 | " def __init__(self, n_features, n_targets=1, lr=0.01):\n",
353 | " # note requires_grad=True!\n",
354 | " self.W = torch.zeros(n_targets, n_features, requires_grad=True)\n",
355 | " self.lr = lr\n",
356 | " \n",
357 | " def update_weight(self):\n",
358 | " # Gradients are given to us by autograd!\n",
359 | " self.W.data = self.W.data - self.lr * self.W.grad.data\n",
360 | "\n",
361 | " def loss(self, y_hat, y):\n",
362 | " return torch.mean(torch.pow(y - y_hat, 2))\n",
363 | "\n",
364 | " def predict(self, X):\n",
365 | " return torch.matmul(X, self.W.t()).squeeze(-1)\n",
366 | "\n",
367 | " def train(self, X, y, epochs=50):\n",
368 | " \"\"\"\n",
369 | " X (n_examples x n_features): input matrix\n",
370 | " y (n_examples): gold labels\n",
371 | " \"\"\"\n",
372 | " loss_history = []\n",
373 | " for _ in range(epochs):\n",
374 | " # Our neural net is a Line function!\n",
375 | " y_hat = self.predict(X)\n",
376 | " \n",
377 | " # Compute the loss using torch operations so they are saved in the gradient history.\n",
378 | " loss = self.loss(y_hat, y)\n",
379 | " \n",
380 | " # Computes the gradient of loss with respect to all Variables with requires_grad=True.\n",
381 | " # where Variables are tensors with requires_grad=True\n",
382 | " loss.backward()\n",
383 | " loss_history.append(loss.item())\n",
384 | "\n",
385 | " # Update weights using gradient descent; W.data is a Tensor.\n",
386 | " self.update_weight()\n",
387 | "\n",
388 | " # Reset the accumulated gradients\n",
389 | " self.W.grad.data.zero_()\n",
390 | " \n",
391 | " return loss_history"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": null,
397 | "metadata": {},
398 | "outputs": [],
399 | "source": [
400 | "X_pt = torch.from_numpy(X_np).float()\n",
401 | "y_pt = torch.from_numpy(y).float()"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": null,
407 | "metadata": {},
408 | "outputs": [],
409 | "source": [
410 | "model = MixedLinearRegression(n_features=n_features, n_targets=1, lr=0.1)\n",
411 | "loss_history = model.train(X_pt, y_pt, epochs=50)\n",
412 | "with torch.no_grad():\n",
413 | " y_hat = model.predict(X_pt)"
414 | ]
415 | },
416 | {
417 | "cell_type": "code",
418 | "execution_count": null,
419 | "metadata": {},
420 | "outputs": [],
421 | "source": [
422 | "print('b:', model.W[0,0].item())\n",
423 | "print('W:', model.W[0,1].item())\n",
424 | "plt.plot(loss_history)\n",
425 | "plt.title('Loss per epoch');"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": null,
431 | "metadata": {},
432 | "outputs": [],
433 | "source": [
434 | "# Vis\n",
435 | "fig, axs = plt.subplots(1, 3, figsize=(16, 4))\n",
436 | "axs[0].plot(X, y, \"o\", label=\"data\")\n",
437 | "axs[0].plot(X, 40*X + 2, \"-\", label=\"pred\")\n",
438 | "axs[0].set_title(\"Guess\")\n",
439 | "axs[0].legend();\n",
440 | "\n",
441 | "axs[1].plot(X, y, \"o\", label=\"data\")\n",
442 | "axs[1].plot(X, 47.12483907744531*X + 2.3264433961431727, \"-\", label=\"pred\")\n",
443 | "axs[1].set_title(\"Numpy solution\")\n",
444 | "axs[1].legend();\n",
445 | "\n",
446 | "axs[2].plot(X, y, \"o\", label=\"data\")\n",
447 | "axs[2].plot(X, y_hat, \"-\", label=\"pred\")\n",
448 | "axs[2].set_title(\"Mixed solution\")\n",
449 | "axs[2].legend();"
450 | ]
451 | },
452 | {
453 | "cell_type": "markdown",
454 | "metadata": {},
455 | "source": [
456 | "# PyTorch Solution\n",
457 | "\n",
458 | "Mixing PyTorch and Numpy is no fun. PyTorch is actually very powerful and provides most of the things we need to apply gradient descent for any model $f$, as long all operations applied over the inputs are Torch operations (so gradients can be tracked). \n",
459 | "\n",
460 | "To this end, we will use the submodule `torch.nn`, which provides us a way for encapsulating our model into a `nn.Module`. With this, all we need to do is define the our parameters in the `__init__` method and then the _forward_ pass of our model in the `forward` method. "
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": null,
466 | "metadata": {},
467 | "outputs": [],
468 | "source": [
469 | "from torch import nn\n",
470 | "from torch import optim\n",
471 | "\n",
472 | "# See the inheritance from nn.Module\n",
473 | "class TorchLinearRegression(nn.Module):\n",
474 | " \n",
475 | " def __init__(self, n_features, n_targets=1):\n",
476 | " super().__init__() # this is mandatory!\n",
477 | " \n",
478 | " # encapsulate our weights into a nn.Parameter object\n",
479 | " self.W = torch.nn.Parameter(torch.zeros(n_targets, n_features))\n",
480 | "\n",
481 | " def forward(self, X):\n",
482 | " \"\"\"\n",
483 | " X (n_examples x n_features): input matrix\n",
484 | " \"\"\"\n",
485 | " #if self.training:\n",
486 | " # X = X ** 2\n",
487 | " #else:\n",
488 | " # X = X ** 3\n",
489 | " # import ipdb; ipdb. set_trace()\n",
490 | " return X @ self.W.t()"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "metadata": {},
497 | "outputs": [],
498 | "source": [
499 | "# define model, loss function and optmizer\n",
500 | "model = TorchLinearRegression(n_features)\n",
501 | "loss_fn = nn.MSELoss()\n",
502 | "optimizer = optim.SGD(model.parameters(), lr=0.1)\n",
503 | "\n",
504 | "# move to CUDA if available\n",
505 | "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
506 | "model = model.to(device)\n",
507 | "X = X_pt.to(device)\n",
508 | "y = y_pt.to(device).unsqueeze(-1)"
509 | ]
510 | },
511 | {
512 | "cell_type": "markdown",
513 | "metadata": {},
514 | "source": [
515 | "All done! Now we just have to write a training loop, which is more or less a standard set of steps for training all models:"
516 | ]
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": null,
521 | "metadata": {},
522 | "outputs": [],
523 | "source": [
524 | "def train(model, X, y, epochs=50):\n",
525 | " # inform PyTorch that we are in \"training\" mode\n",
526 | " model.train()\n",
527 | " \n",
528 | " loss_history = []\n",
529 | " for _ in range(epochs):\n",
530 | " # reset gradients before learning\n",
531 | " optimizer.zero_grad()\n",
532 | " \n",
533 | " # get predictions and and the final score from the loss function \n",
534 | " y_hat = model(X)\n",
535 | " loss = loss_fn(y_hat, y)\n",
536 | " loss_history.append(loss.item())\n",
537 | " \n",
538 | " # compute gradients of the loss wrt parameters\n",
539 | " loss.backward()\n",
540 | " \n",
541 | " # perform gradient step to update the parameters\n",
542 | " optimizer.step()\n",
543 | "\n",
544 | " return loss_history"
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "execution_count": null,
550 | "metadata": {},
551 | "outputs": [],
552 | "source": [
553 | "def evaluate(model, X):\n",
554 | " # inform PyTorch that we are in \"evaluation\" mode\n",
555 | " model.eval()\n",
556 | " \n",
557 | " # disable gradient tracking\n",
558 | " with torch.no_grad():\n",
559 | " # get prediction\n",
560 | " y_hat = model(X)\n",
561 | " \n",
562 | " return y_hat"
563 | ]
564 | },
565 | {
566 | "cell_type": "code",
567 | "execution_count": null,
568 | "metadata": {},
569 | "outputs": [],
570 | "source": [
571 | "loss_history = train(model, X, y, epochs=50)\n",
572 | "y_hat = evaluate(model, X)"
573 | ]
574 | },
575 | {
576 | "cell_type": "code",
577 | "execution_count": null,
578 | "metadata": {},
579 | "outputs": [],
580 | "source": [
581 | "print('b:', model.W[0,0].item())\n",
582 | "print('W:', model.W[0,1].item())\n",
583 | "plt.plot(loss_history)\n",
584 | "plt.title('Loss per epoch');"
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": null,
590 | "metadata": {},
591 | "outputs": [],
592 | "source": [
593 | "# Vis\n",
594 | "X = X_pt[:, 1:].numpy()\n",
595 | "y = y_pt.squeeze(-1).numpy()\n",
596 | "\n",
597 | "fig, axs = plt.subplots(1, 4, figsize=(16, 4))\n",
598 | "axs[0].plot(X, y, \"o\", label=\"data\")\n",
599 | "axs[0].plot(X, 40*X + 2, \"-\", label=\"pred\")\n",
600 | "axs[0].set_title(\"Guess\")\n",
601 | "axs[0].legend();\n",
602 | "\n",
603 | "axs[1].plot(X, y, \"o\", label=\"data\")\n",
604 | "axs[1].plot(X, 47.12483907744531*X + 2.3264433961431727, \"-\", label=\"pred\")\n",
605 | "axs[1].set_title(\"Numpy solution\")\n",
606 | "axs[1].legend();\n",
607 | "\n",
608 | "axs[2].plot(X, y, \"o\", label=\"data\")\n",
609 | "axs[2].plot(X, 47.12483596801758*X + 2.3264429569244385, \"-\", label=\"pred\")\n",
610 | "axs[2].set_title(\"Mixed solution\")\n",
611 | "axs[2].legend();\n",
612 | "\n",
613 | "axs[3].plot(X, y, \"o\", label=\"data\")\n",
614 | "axs[3].plot(X, y_hat, \"-\", label=\"pred\")\n",
615 | "axs[3].set_title(\"PyTorch solution\")\n",
616 | "axs[3].legend();"
617 | ]
618 | },
619 | {
620 | "cell_type": "markdown",
621 | "metadata": {},
622 | "source": [
623 | "**Note:** I did gradient descent with the entire dataset rather than splitting the data into `train` and `valid` subsets, which should be done in practice!"
624 | ]
625 | },
626 | {
627 | "cell_type": "markdown",
628 | "metadata": {},
629 | "source": [
630 | "## Exercises"
631 | ]
632 | },
633 | {
634 | "cell_type": "markdown",
635 | "metadata": {},
636 | "source": [
637 | "- Write a proper training loop for PyTorch:\n",
638 | " - add support for batches\n",
639 | " - add a stop criterion for the convergence of the model\n",
640 | " \n",
641 | "- Add L2 regularization"
642 | ]
643 | }
644 | ],
645 | "metadata": {
646 | "anaconda-cloud": {},
647 | "kernelspec": {
648 | "display_name": "Python 3 (ipykernel)",
649 | "language": "python",
650 | "name": "python3"
651 | },
652 | "language_info": {
653 | "codemirror_mode": {
654 | "name": "ipython",
655 | "version": 3
656 | },
657 | "file_extension": ".py",
658 | "mimetype": "text/x-python",
659 | "name": "python",
660 | "nbconvert_exporter": "python",
661 | "pygments_lexer": "ipython3",
662 | "version": "3.9.7"
663 | }
664 | },
665 | "nbformat": 4,
666 | "nbformat_minor": 1
667 | }
668 |
--------------------------------------------------------------------------------
/03-modules-and-mlps.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Why Modules\n",
8 | "\n",
9 | "A typical training procedure for a neural net:\n",
10 | "\n",
11 | "0. Define a dataset ($X$ and $Y$)\n",
12 | "1. Define the neural network with some learnable weights\n",
13 | "2. Iterate over the dataset\n",
14 | "3. Pass inputs to the network (forward pass)\n",
15 | "4. Compute the loss\n",
16 | "5. Compute gradients w.r.t. network's weights (backward pass)\n",
17 | "6. Update weights (e.g., weight = weight - lr * gradient)\n",
18 | "\n",
19 | "PyTorch handles 1-6 for you via encapsulation, so you still have the flexibility to change something in between if you want! "
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "## Example: MNIST classifier\n",
27 | "\n",
28 | "The MNIST dataset is composed of images of digits that must be classified with labels from 0 to 9. The inputs are 28x28 matrices containing the grayscale intensity in each pixel.\n",
29 | "\n",
30 | "We will download the MNIST dataset for training a classifier. PyTorch provides a convenient function for that."
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": null,
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "import torch\n",
40 | "import torch.nn as nn\n",
41 | "import torch.optim as optim\n",
42 | "from torchvision import datasets\n",
43 | "import matplotlib.pyplot as plt\n",
44 | "torch.manual_seed(0);"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "# Dataset\n",
52 | "It's easy to create your `Dataset`,\n",
53 | "but PyTorch comes with several built-in datasets for [vision](https://pytorch.org/vision/stable/datasets.html), [audio](https://pytorch.org/audio/stable/datasets.html), and [text](https://pytorch.org/text/stable/datasets.html) modalities.\n",
54 | "\n",
55 | "The class `Dataset` gives you information about the number of samples (implement `__len__`) and gives you the sample at a given index (implement `__getitem__`). It's a nice and simple abstraction to work with data. It has the following structure:\n",
56 | "\n",
57 | "```python\n",
58 | "class Dataset(object):\n",
59 | " def __getitem__(self, index):\n",
60 | " raise NotImplementedError\n",
61 | "\n",
62 | " def __len__(self):\n",
63 | " raise NotImplementedError\n",
64 | "\n",
65 | " def __add__(self, other):\n",
66 | " return ConcatDataset([self, other])\n",
67 | "```\n",
68 | "\n",
69 | "For now, let's use MNIST. But feel free to use another `Dataset` as an exercise."
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "from torch.utils.data import Dataset"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "# download MNIST and store it in \"../data\"\n",
88 | "# PyTorch.datasets also handles caching for you so you don't have to download the dataset twice\n",
89 | "train_data = datasets.MNIST('../data', train=True, download=True)\n",
90 | "test_data = datasets.MNIST('../data', train=False)\n",
91 | "\n",
92 | "train_x = train_data.data\n",
93 | "train_y = train_data.targets\n",
94 | "test_x = test_data.data\n",
95 | "test_y = test_data.targets"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": null,
101 | "metadata": {},
102 | "outputs": [],
103 | "source": [
104 | "n_train_examples = train_x.shape[0]\n",
105 | "n_test_examples = test_x.shape[0]\n",
106 | "print('Training instances:', n_train_examples)\n",
107 | "print('Test instances:', n_test_examples)"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "Check the shape of our training data to see how many input features we have:"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "train_x.shape, train_y.shape"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "And what the images looks like:"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "metadata": {
137 | "scrolled": true
138 | },
139 | "outputs": [],
140 | "source": [
141 | "C = 8\n",
142 | "fig, axs = plt.subplots(3, C, figsize=(12, 4))\n",
143 | "for i in range(3):\n",
144 | " for j in range(C):\n",
145 | " axs[i, j].imshow(train_x[i*C + j], cmap='gray')\n",
146 | " axs[i, j].set_axis_off()\n",
147 | "print(train_y[:24].reshape(3, C))"
148 | ]
149 | },
150 | {
151 | "cell_type": "markdown",
152 | "metadata": {},
153 | "source": [
154 | "### Formatting"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "Each sample is a 28x28 matrix. But we want to represent them as vectors, since our model (which will be a simple MLP) doesn't take any advantage of the 2D nature of the data.\n",
162 | "\n",
163 | "So, we reshape the data:"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "num_features = 28 * 28\n",
173 | "train_x_vectors = train_x.view(n_train_examples, num_features)\n",
174 | "print(train_x_vectors.shape)"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "When we reshape an array (or torch tensor, for that matter), we don't need to specify all dimensions. We can leave one as -1, and it will be automatically determined from the size of the data. This is useful when we don't know a priori the shape of some array."
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "train_x_vectors = train_x.view(n_train_examples, -1)\n",
191 | "test_x_vectors = test_x.view(n_test_examples, -1)\n",
192 | "\n",
193 | "print(train_x_vectors.shape, test_x_vectors.shape)"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "Also, the values are integers in the range $[0, 255]$. It is better to work with float values in a smaller interval, such as $[0, 1]$ or $[-1, 1]$. There are some more elaborate normalization techniques, but for now let's just normalize the data into $[0, 1]$."
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": null,
206 | "metadata": {},
207 | "outputs": [],
208 | "source": [
209 | "train_x_norm = train_x_vectors / 255.0\n",
210 | "test_x_norm = test_x_vectors / 255.0\n",
211 | "print(train_x_norm.max(), train_x_norm.min(), train_x_norm.mean(), train_x_norm.std())"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "Now, let's check all the available labels:"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": null,
224 | "metadata": {},
225 | "outputs": [],
226 | "source": [
227 | "print(torch.unique(train_y))\n",
228 | "num_classes = len(torch.unique(train_y))\n",
229 | "print('Num classes:', num_classes)"
230 | ]
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": [
236 | "# Modules and MLPs\n",
237 | "\n",
238 | "We've seen how the internals of a simple linear classifier work. However, we still had to set a lot of things manually. It's much better to have a higher-level API that encapsulates the classifier.\n",
239 | "\n",
240 | "We are going to see that now, with pytorch Module objects. Then, it will allow us to build more complex models, like a multilayer perceptron."
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "We begin by loading, reshaping and normalizing the data again (so the code looks concise):"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": null,
253 | "metadata": {},
254 | "outputs": [],
255 | "source": [
256 | "from torchvision.transforms import ToTensor\n",
257 | "\n",
258 | "train_dataset = datasets.MNIST('../data', train=True, download=True, transform=ToTensor())\n",
259 | "test_dataset = datasets.MNIST('../data', train=False, transform=ToTensor())\n",
260 | "\n",
261 | "train_x = train_dataset.data\n",
262 | "train_y = train_dataset.targets\n",
263 | "test_x = test_dataset.data\n",
264 | "test_y = test_dataset.targets\n",
265 | "\n",
266 | "num_features = 28 * 28\n",
267 | "num_classes = len(torch.unique(train_y))\n",
268 | "new_shape = [-1, num_features]\n",
269 | "train_x_vectors = train_x.reshape(new_shape)\n",
270 | "test_x_vectors = test_x.reshape(new_shape)\n",
271 | "\n",
272 | "# shorten the names\n",
273 | "train_x = train_x_vectors.float() / 255\n",
274 | "test_x = test_x_vectors.float() / 255"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "## Using Modules\n",
282 | "\n",
283 | "PyTorch provides some basic building blocks for neural nets under `.nn` module. Here you can check the complete list of available blocks: https://pytorch.org/docs/stable/nn.html\n",
284 | "\n",
285 | "For now, let's recreate a simple linear model using `nn.Linear` (see [doc](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear))."
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": [
294 | "class LinearModel(nn.Module):\n",
295 | " def __init__(self, n_features, n_classes):\n",
296 | " super().__init__()\n",
297 | " self.linear_layer = nn.Linear(n_features, n_classes)\n",
298 | " \n",
299 | " def forward(self, X):\n",
300 | " # This is the same as doing:\n",
301 | " # return X @ self.linear_layer.weight.t() + self.linear_layer.bias\n",
302 | " # where weight and bias are instances of nn.Parameter\n",
303 | " return self.linear_layer(X)\n",
304 | "\n",
305 | "linear_model = LinearModel(num_features, num_classes)"
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "metadata": {},
311 | "source": [
312 | "As before, the model can be called as function in order to produce an output:"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "metadata": {},
319 | "outputs": [],
320 | "source": [
321 | "batch = train_x[:2]\n",
322 | "outputs = linear_model(batch)\n",
323 | "outputs"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "Same as doing the forward method $$w^T x + b$$"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": null,
336 | "metadata": {},
337 | "outputs": [],
338 | "source": [
339 | "batch @ linear_model.linear_layer.weight.t() + linear_model.linear_layer.bias"
340 | ]
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {},
345 | "source": [
346 | "Now that we defined our model, we just have to: \n",
347 | "- define an iterator\n",
348 | "- define and compute the loss\n",
349 | "- compute gradients\n",
350 | "- define the strategy to update the parameters of our model\n",
351 | "- glue previous steps to form the training loop!"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "#### Batching\n",
359 | "\n",
360 | "Batching can be boring to code. PyTorch provides the `DataLoader` class to help us! Dealing with data is one of the most important yet more time consuming tasks. Take a look in the PyTorch `data` submodule to [learn more](https://pytorch.org/docs/stable/data.html).\n",
361 | "\n",
362 | "In general, we just have to pass a torch `Dataset` object as input to the dataloader, and then set some hyperparams for the iterator: "
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {},
369 | "outputs": [],
370 | "source": [
371 | "from torch.utils.data import DataLoader\n",
372 | "print(type(train_dataset))\n",
373 | "\n",
374 | "train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "#### Loss\n",
382 | "\n",
383 | "Here is the complete list of available [loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions).\n",
384 | "If the provided loss functions don't satisfy your constraints, it is easy to define your own loss function: just use torch operations (and be careful with differentiability issues). For example:"
385 | ]
386 | },
387 | {
388 | "cell_type": "code",
389 | "execution_count": null,
390 | "metadata": {},
391 | "outputs": [],
392 | "source": [
393 | "with torch.no_grad(): # disable gradient-tracking\n",
394 | " \n",
395 | " dummy_loss = nn.CrossEntropyLoss()\n",
396 | " \n",
397 | " # try other losses!\n",
398 | " # multi-class classification hinge loss (margin-based loss):\n",
399 | " # dummy_loss = nn.MultiMarginLoss() \n",
400 | " batch = train_x[:2]\n",
401 | " targets = train_y[:2]\n",
402 | " predictions = linear_model(batch)\n",
403 | " \n",
404 | " print(predictions.shape, targets.shape)\n",
405 | " print(dummy_loss(predictions, targets))"
406 | ]
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "And writing our own function (from the definition of the Cross Entropy loss):\n",
413 | "\n",
414 | "$$\n",
415 | "CE(p,y) = - \\log\\frac{\\exp(p_y)}{\\sum_c \\exp(p_c)}\n",
416 | "$$"
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": null,
422 | "metadata": {},
423 | "outputs": [],
424 | "source": [
425 | "def dummy_loss(y_pred, y):\n",
426 | " one_hot = y.unsqueeze(1) == torch.arange(num_classes).unsqueeze(0)\n",
427 | " res = - torch.log(torch.exp(y_pred) / torch.exp(y_pred).sum(-1).unsqueeze(-1))[one_hot]\n",
428 | " return res.mean() # average per sample\n",
429 | "\n",
430 | "print(dummy_loss(predictions, targets))"
431 | ]
432 | },
433 | {
434 | "cell_type": "markdown",
435 | "metadata": {},
436 | "source": [
437 | "We will use the CrossEntropy function as our loss"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {},
444 | "outputs": [],
445 | "source": [
446 | "loss_function = nn.CrossEntropyLoss()"
447 | ]
448 | },
449 | {
450 | "cell_type": "markdown",
451 | "metadata": {},
452 | "source": [
453 | "#### Optimizer\n",
454 | "\n",
455 | "The optimizer is the object which handles the update of the model's parameters. In the previous exercise, we were using the famous \"delta\" rule to update our weights:\n",
456 | "\n",
457 | "$$\\mathbf{w}_t = \\mathbf{w}_{t-1} - \\alpha \\frac{\\partial L}{\\partial \\mathbf{w}}.$$\n",
458 | "\n",
459 | "But there are more ellaborate ways of updating our parameters: \n",
460 | "\n",
461 | "\n",
462 | "\n",
463 | "\n",
464 | "\n",
465 | "\n",
466 | "PyTorch provides an extensive list of optimizers: https://pytorch.org/docs/stable/optim.html. Notice that, as everything else, it should be easy to define your own optimizer procedure. \n",
467 | "\n",
468 | "We will use the simple yet powerful SGD optmizer. The optimizer needs to be told which are the parameters to optimize."
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": null,
474 | "metadata": {},
475 | "outputs": [],
476 | "source": [
477 | "parameters = linear_model.parameters() # we will optimize all model's parameters!\n",
478 | "optimizer = torch.optim.SGD(parameters, lr=0.1)"
479 | ]
480 | },
481 | {
482 | "cell_type": "markdown",
483 | "metadata": {},
484 | "source": [
485 | "#### Training loop\n",
486 | "\n",
487 | "Now we write the main training loop. This is the basic skeleton for training PyTorch models."
488 | ]
489 | },
490 | {
491 | "cell_type": "code",
492 | "execution_count": null,
493 | "metadata": {},
494 | "outputs": [],
495 | "source": [
496 | "def train_model(model, dataloader, optimizer, loss_function, num_epochs=1):\n",
497 | " # Tell PyTorch that we are in training mode.\n",
498 | " # This is useful for mechanisms that work differently during training and test time, like Dropout. \n",
499 | " model.train()\n",
500 | " \n",
501 | " losses = []\n",
502 | " for epoch in range(1, num_epochs+1):\n",
503 | " print('Starting epoch %d' % epoch)\n",
504 | " total_loss = 0\n",
505 | " hits = 0\n",
506 | "\n",
507 | " for batch_x, batch_y in dataloader:\n",
508 | " # check shapes with:\n",
509 | " # import ipdb; ipdb.set_trace()\n",
510 | " # batch_x.shape is (batch_size, 28, 28)\n",
511 | " # batch_y.shape is (batch_size, )\n",
512 | " \n",
513 | " # Step 1. Remember that PyTorch accumulates gradients.\n",
514 | " # We need to clear them out before each step\n",
515 | " optimizer.zero_grad()\n",
516 | " \n",
517 | " # Step 2. Preprocess the data\n",
518 | " # (batch_size, 28, 28) -> (batch_size, 784 = 28 * 28)\n",
519 | " batch_x = batch_x.reshape(batch_x.shape[0], -1)\n",
520 | " batch_x = batch_x.to(torch.float) / 255.0\n",
521 | "\n",
522 | " # Step 3. Run forward pass.\n",
523 | " logits = model(batch_x)\n",
524 | "\n",
525 | " # Step 4. Compute loss\n",
526 | " loss = loss_function(logits, batch_y)\n",
527 | " \n",
528 | " # Step 5. Compute gradeints\n",
529 | " loss.backward()\n",
530 | " \n",
531 | " # Step 6. After determining the gradients, take a step toward their (neg-)direction\n",
532 | " optimizer.step()\n",
533 | " \n",
534 | " # Optional. Save statistics of your training\n",
535 | " loss_value = loss.item()\n",
536 | " total_loss += loss_value\n",
537 | " losses.append(loss_value)\n",
538 | " y_pred = logits.argmax(dim=1)\n",
539 | " hits += torch.sum(y_pred == batch_y).item()\n",
540 | " \n",
541 | " avg_loss = total_loss / len(train_dataloader.dataset)\n",
542 | " print('Epoch loss: %.4f' % avg_loss)\n",
543 | " acc = hits / len(train_dataloader.dataset)\n",
544 | " print('Epoch accuracy: %.4f' % acc)\n",
545 | " \n",
546 | " print('Done!')\n",
547 | " return losses"
548 | ]
549 | },
550 | {
551 | "cell_type": "code",
552 | "execution_count": null,
553 | "metadata": {},
554 | "outputs": [],
555 | "source": [
556 | "linear_losses = train_model(linear_model, train_dataloader, optimizer, loss_function, num_epochs=10)"
557 | ]
558 | },
559 | {
560 | "cell_type": "markdown",
561 | "metadata": {},
562 | "source": [
563 | "Graphics are good to understand the performance of a model. Let's plot the loss curve by training step:"
564 | ]
565 | },
566 | {
567 | "cell_type": "code",
568 | "execution_count": null,
569 | "metadata": {},
570 | "outputs": [],
571 | "source": [
572 | "fig, ax = plt.subplots()\n",
573 | "ax.plot(linear_losses, \"-\")\n",
574 | "ax.set_xlabel('Step')\n",
575 | "ax.set_ylabel('Loss');"
576 | ]
577 | },
578 | {
579 | "cell_type": "markdown",
580 | "metadata": {},
581 | "source": [
582 | "What can you conclude from this?"
583 | ]
584 | },
585 | {
586 | "cell_type": "markdown",
587 | "metadata": {},
588 | "source": [
589 | "## Multilayer Perceptron\n",
590 | "\n",
591 | "We can now proceed to a more sofisticated classifier: a multilayer perceptron. Let's build one using the Sequential API."
592 | ]
593 | },
594 | {
595 | "cell_type": "code",
596 | "execution_count": null,
597 | "metadata": {},
598 | "outputs": [],
599 | "source": [
600 | "class MLP(nn.Module):\n",
601 | " def __init__(self, n_features, hidden_size, n_classes):\n",
602 | " super().__init__()\n",
603 | " linear_layer1 = nn.Linear(n_features, hidden_size)\n",
604 | " linear_layer2 = nn.Linear(hidden_size, hidden_size)\n",
605 | " linear_layer3 = nn.Linear(hidden_size, n_classes)\n",
606 | " self.feedforward = nn.Sequential(\n",
607 | " linear_layer1, \n",
608 | " nn.Tanh(), \n",
609 | " linear_layer2, \n",
610 | " nn.Tanh(),\n",
611 | " linear_layer3\n",
612 | " )\n",
613 | "\n",
614 | " def forward(self, X):\n",
615 | " return self.feedforward(X)\n",
616 | "\n",
617 | "hidden_size = 200\n",
618 | "mlp = MLP(num_features, hidden_size, num_classes)\n",
619 | "loss_function = nn.CrossEntropyLoss()\n",
620 | "optimizer = torch.optim.SGD(mlp.parameters(), lr=0.1)"
621 | ]
622 | },
623 | {
624 | "cell_type": "markdown",
625 | "metadata": {},
626 | "source": [
627 | "Now let's train the model."
628 | ]
629 | },
630 | {
631 | "cell_type": "code",
632 | "execution_count": null,
633 | "metadata": {},
634 | "outputs": [],
635 | "source": [
636 | "mlp_losses = train_model(mlp, train_dataloader, optimizer, loss_function, num_epochs=5)"
637 | ]
638 | },
639 | {
640 | "cell_type": "markdown",
641 | "metadata": {},
642 | "source": [
643 | "How do the loss and accuracy compare with the linear model?\n",
644 | "\n",
645 | "You probably also noticed a difference in running time!"
646 | ]
647 | },
648 | {
649 | "cell_type": "code",
650 | "execution_count": null,
651 | "metadata": {},
652 | "outputs": [],
653 | "source": [
654 | "fig, ax = plt.subplots()\n",
655 | "ax.plot(linear_losses, \".\", label=\"linear\")\n",
656 | "ax.plot(mlp_losses, \".\", label=\"mlp\")\n",
657 | "ax.legend()"
658 | ]
659 | },
660 | {
661 | "cell_type": "markdown",
662 | "metadata": {},
663 | "source": [
664 | "Note the different concentration of dots in the MLP and Linear graphics!"
665 | ]
666 | },
667 | {
668 | "cell_type": "markdown",
669 | "metadata": {},
670 | "source": [
671 | "### Validation data\n",
672 | "\n",
673 | "Evaluating the performance on training data is important to understand if the model is actually learning, but if we want to know if our model has any usefulness, we should evaluate its performance on validation or test data.\n",
674 | "\n"
675 | ]
676 | },
677 | {
678 | "cell_type": "code",
679 | "execution_count": null,
680 | "metadata": {},
681 | "outputs": [],
682 | "source": [
683 | "def evaluate_model(model, test_x, test_y):\n",
684 | " # Tell PyTorch that we are in evaluation mode.\n",
685 | " model.eval()\n",
686 | "\n",
687 | " with torch.no_grad():\n",
688 | " loss_function = torch.nn.CrossEntropyLoss()\n",
689 | " logits = model(test_x)\n",
690 | " loss = loss_function(logits, test_y)\n",
691 | "\n",
692 | " y_pred = logits.argmax(dim=1)\n",
693 | " hits = torch.sum(y_pred == test_y).item()\n",
694 | " \n",
695 | " return loss.item() / len(test_x), hits / len(test_x)"
696 | ]
697 | },
698 | {
699 | "cell_type": "code",
700 | "execution_count": null,
701 | "metadata": {},
702 | "outputs": [],
703 | "source": [
704 | "evaluate_model(mlp, train_x, train_y)"
705 | ]
706 | },
707 | {
708 | "cell_type": "code",
709 | "execution_count": null,
710 | "metadata": {},
711 | "outputs": [],
712 | "source": [
713 | "evaluate_model(mlp, test_x, test_y)"
714 | ]
715 | },
716 | {
717 | "cell_type": "code",
718 | "execution_count": null,
719 | "metadata": {},
720 | "outputs": [],
721 | "source": [
722 | "evaluate_model(linear_model, train_x, train_y)"
723 | ]
724 | },
725 | {
726 | "cell_type": "code",
727 | "execution_count": null,
728 | "metadata": {},
729 | "outputs": [],
730 | "source": [
731 | "evaluate_model(linear_model, test_x, test_y)"
732 | ]
733 | },
734 | {
735 | "cell_type": "markdown",
736 | "metadata": {},
737 | "source": [
738 | "How can we make our model better? There are two things to be done:\n",
739 | "\n",
740 | "1. **Hyperparameter search**. Do a grid search or random search on the hyperparameters (hidden size, learning rate, batch size, activation function, type of optimizer, ...)\n",
741 | "2. **Generalize better**. This include either finding some better feature representation or regularizing, i.e., add some kind of penalty to the model weights that encourages it to find a more general solution. Examples: L2-norm weight regularization, dropout.\n",
742 | "3. **Early stop**. Evaluate the model on validation data after each epoch or some number of batches; only save it when validation performance increases. This means detecting when the model achieved its performance peak."
743 | ]
744 | },
745 | {
746 | "cell_type": "markdown",
747 | "metadata": {},
748 | "source": [
749 | "#### Dropout\n",
750 | "\n",
751 | "We could try dropout. It effectivelly deactivates some neural connections at random, forcing the network to avoid depending on specific inputs."
752 | ]
753 | },
754 | {
755 | "cell_type": "code",
756 | "execution_count": null,
757 | "metadata": {},
758 | "outputs": [],
759 | "source": [
760 | "class MLPDropout(nn.Module):\n",
761 | " def __init__(self, n_features, hidden_size, n_classes, p_dropout):\n",
762 | " super().__init__()\n",
763 | " linear_layer1 = nn.Linear(n_features, hidden_size)\n",
764 | " linear_layer2 = nn.Linear(hidden_size, n_classes)\n",
765 | " self.feedforward = nn.Sequential(\n",
766 | " linear_layer1,\n",
767 | " nn.Tanh(),\n",
768 | " nn.Dropout(p_dropout),\n",
769 | " linear_layer2\n",
770 | " )\n",
771 | "\n",
772 | " def forward(self, X):\n",
773 | " return self.feedforward(X)\n",
774 | "\n",
775 | "hidden_size = 200\n",
776 | "p_dropout = 0.5\n",
777 | "mlp_dropout = MLPDropout(num_features, hidden_size, num_classes, p_dropout)\n",
778 | "loss_function = nn.CrossEntropyLoss()\n",
779 | "optimizer = torch.optim.SGD(mlp_dropout.parameters(), lr=0.1)"
780 | ]
781 | },
782 | {
783 | "cell_type": "code",
784 | "execution_count": null,
785 | "metadata": {},
786 | "outputs": [],
787 | "source": [
788 | "losses = train_model(mlp_dropout, train_dataloader, optimizer, loss_function, num_epochs=5)"
789 | ]
790 | },
791 | {
792 | "cell_type": "markdown",
793 | "metadata": {},
794 | "source": [
795 | "Training loss is a bit worse, as expected. After all, we are obstructing some connections.\n",
796 | "\n",
797 | "Now let's check validation performance:"
798 | ]
799 | },
800 | {
801 | "cell_type": "code",
802 | "execution_count": null,
803 | "metadata": {},
804 | "outputs": [],
805 | "source": [
806 | "evaluate_model(mlp, test_x, test_y)"
807 | ]
808 | },
809 | {
810 | "cell_type": "code",
811 | "execution_count": null,
812 | "metadata": {},
813 | "outputs": [],
814 | "source": [
815 | "evaluate_model(mlp_dropout, test_x, test_y)"
816 | ]
817 | },
818 | {
819 | "cell_type": "markdown",
820 | "metadata": {},
821 | "source": [
822 | "No improvement. Ideally, we should retrain our model with different hyperparamters (learning rates, layer sizes, number of layers, dropout rate) as well as some changes in the structure (different optimizers, activation functions, losses). However, data representation plays a key role. \n",
823 | "\n",
824 | " \n",
825 | "
\n",
826 | "Do you think representing the input as independent pixels is a good idea for recognizing digits?\n",
827 | "