├── README.md ├── Simple_Grad.ipynb ├── autograd_tutorial.ipynb ├── cifar10_tutorial.ipynb ├── dp.py ├── neural_networks_tutorial.ipynb └── tensor_tutorial.ipynb /README.md: -------------------------------------------------------------------------------- 1 | 本项目是创建者学习60分钟闪击速成PyTorch(Deep Learning with PyTorch: A 60 Minute Blitz)这一PyTorch官方教程后,修改Markdown显示格式并做学习笔记的课程四部分notebook文件,以及一个原教程在colab上发布的notebook(为便于国内读者下载文件而上传至GitHub公开项目)。 2 | (2021.12.8:增加数据并行部分的源代码dp.py) 3 | 具体的文件顺序及我撰写的学习笔记内容,见我的CSDN博文[60分钟闪击速成PyTorch(Deep Learning with PyTorch: A 60 Minute Blitz)学习笔记](https://blog.csdn.net/PolarisRisingWar/article/details/116069338) -------------------------------------------------------------------------------- /Simple_Grad.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Simple Grad", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "rnvTWs4W4Hea" 20 | }, 21 | "source": [ 22 | "# Simple Autograd\n", 23 | "\n", 24 | "This notebook walks through a self-contained implementation of reverse mode auto-differentiation. The intention is to make it easier to understand PyTorch's implementation of auto-diff and how TorchScript interacts with it without having to work through all the complexity that the real implementation contains.\n", 25 | "\n", 26 | "\n", 27 | "To get started, we import some helper functions." 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "metadata": { 33 | "id": "2Lf8JaEK4MHJ" 34 | }, 35 | "source": [ 36 | "import torch\n", 37 | "from typing import List, NamedTuple, Callable, Dict, Optional\n", 38 | "\n", 39 | "_name: int = 0\n", 40 | "def fresh_name() -> str:\n", 41 | " \"\"\" create a new unique name for a variable: v0, v1, v2 \"\"\"\n", 42 | " global _name\n", 43 | " r = f'v{_name}'\n", 44 | " _name += 1\n", 45 | " return r\n" 46 | ], 47 | "execution_count": null, 48 | "outputs": [] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": { 53 | "id": "erjC686T4S4c" 54 | }, 55 | "source": [ 56 | "To make it possible to fully understand, this system does not rely on PyTorch's autograd at all. It only uses the Tensor object to do compute. We add our own Variable class to track the gradients of computation, and a `grad` function compute gradients.\n", 57 | "\n", 58 | "\n", 59 | "Similar to PyTorch, we use tape-based reverse mode auto-differentiation to\n", 60 | "compute the gradient. For some scalar loss `l`, we will compute the value `dl/dX` for \n", 61 | "_every_ value `X` computed in the program (`l` is always a scalar, but the `X`s can be tensors).\n", 62 | "We do this by starting with `dl/dl == 1`, and use the partial derivatives plus \n", 63 | "the chain rule to propagate the values backward,\n", 64 | "e.g. `dl/dx * dx/dy = dl/dy`.\n", 65 | "\n", 66 | "https://sidsite.com/posts/autodiff/ might be a good place to start if you \n", 67 | "haven't seen reverse mode auto-diff before.\n", 68 | "\n", 69 | "For the purpose of this example, we primarily use point-wise tensor operators like `+` to keep the partial derivatives simple.\n", 70 | "\n", 71 | "\n", 72 | "# The Implementation\n", 73 | "\n", 74 | "Variable is a wrapper around Tensor that tracks the compute.\n", 75 | " Each variable has a globally unique name so that we can track the gradient\n", 76 | "for this Variable in a dictionary. For ease of understanding,\n", 77 | "we sometimes provide this name as an argument. Otherwise, we \n", 78 | "generate a fresh temporary each time.\n", 79 | " " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "metadata": { 85 | "id": "sfaIcdiXEZpO" 86 | }, 87 | "source": [ 88 | "class Variable:\n", 89 | " def __init__(self, value : torch.Tensor, name: str=None):\n", 90 | " self.value = value\n", 91 | " self.name = name or fresh_name()\n", 92 | "\n", 93 | " # We need to start with some tensors whose values were not computed\n", 94 | " # inside the autograd. This function constructs leaf nodes. \n", 95 | " @staticmethod\n", 96 | " def constant(value: torch.Tensor, name: str=None):\n", 97 | " r = Variable(value, name)\n", 98 | " print(f'{r.name} = {value}')\n", 99 | " return r\n", 100 | "\n", 101 | " def __repr__(self):\n", 102 | " return repr(self.value)\n", 103 | "\n", 104 | "\n", 105 | " # This performs a pointwise multiplication of a Variable, tracking gradients\n", 106 | " def __mul__(self, rhs: 'Variable') -> 'Variable':\n", 107 | " # defined later in the notebook\n", 108 | " return operator_mul(self, rhs)\n", 109 | "\n", 110 | " def __add__(self, rhs: 'Variable') -> 'Variable':\n", 111 | " return operator_add(self, rhs)\n", 112 | " \n", 113 | " def sum(self, name: Optional[str]=None) -> 'Variable':\n", 114 | " return operator_sum(self, name)\n", 115 | " \n", 116 | " def expand(self, sizes: List[int]) -> 'Variable':\n", 117 | " return operator_expand(self, sizes)\n" 118 | ], 119 | "execution_count": null, 120 | "outputs": [] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": { 125 | "id": "o4gBKqAGEoHD" 126 | }, 127 | "source": [ 128 | "We need to keep track of all the computation so we can apply the\n", 129 | "chain rule backward. A tape entry will help is do this." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "metadata": { 135 | "id": "Y5QC7spUEuVV" 136 | }, 137 | "source": [ 138 | "class TapeEntry(NamedTuple):\n", 139 | " # names of the inputs to the original computation\n", 140 | " inputs : List[str]\n", 141 | " # names of the outputs of the original computation\n", 142 | " outputs: List[str]\n", 143 | " # apply chain rule\n", 144 | " propagate: 'Callable[List[Variable], List[Variable]]'" 145 | ], 146 | "execution_count": null, 147 | "outputs": [] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": { 152 | "id": "byeB7Cg_E8Kz" 153 | }, 154 | "source": [ 155 | "The `inputs` and `outputs` are the unique names of the Variables that are inputs and outputs of the _original_ computation. `propagate` is a closure that propagates the gradient of the outputs of this function to the inputs using the chain rule. This is specific to each leaf operator. Its inputs are `dL/dOutputs`, and its outputs are `dL/dInputs`. The tape is a just a list of accumulated entries recording all compute. We provide a way to reset it so we can run multiple examples." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "metadata": { 161 | "id": "XmfuXusUFVL3" 162 | }, 163 | "source": [ 164 | "gradient_tape : List[TapeEntry] = []\n", 165 | "\n", 166 | "def reset_tape():\n", 167 | " gradient_tape.clear()\n", 168 | " global _name\n", 169 | " _name = 0 # reset variable names too to keep them small.\n" 170 | ], 171 | "execution_count": null, 172 | "outputs": [] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": { 177 | "id": "hb--WbxnFvec" 178 | }, 179 | "source": [ 180 | "Now let's look at how an operator is defined. First we calculate the forward result and create a new Variable to represent it. Then we define the `propagate` closure, which uses the chain rule to backprop the gradient." 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "metadata": { 186 | "id": "qJYuZnYqGZrt" 187 | }, 188 | "source": [ 189 | "def operator_mul(self : Variable, rhs: Variable) -> Variable:\n", 190 | " if isinstance(rhs, float) and rhs == 1.0:\n", 191 | " # peephole optimization\n", 192 | " return self\n", 193 | "\n", 194 | " # define forward\n", 195 | " r = Variable(self.value * rhs.value)\n", 196 | " print(f'{r.name} = {self.name} * {rhs.name}')\n", 197 | "\n", 198 | " # record what the inputs and outputs of the op were\n", 199 | " inputs = [self.name, rhs.name]\n", 200 | " outputs = [r.name]\n", 201 | "\n", 202 | " # define backprop\n", 203 | " def propagate(dL_doutputs: List[Variable]):\n", 204 | " dL_dr, = dL_doutputs\n", 205 | " \n", 206 | " dr_dself = rhs # partial derivative of r = self*rhs\n", 207 | " dr_drhs = self # partial derivative of r = self*rhs\n", 208 | "\n", 209 | " # chain rule propagation from outputs to inputs of multiply\n", 210 | " dL_dself = dL_dr * dr_dself\n", 211 | " dL_drhs = dL_dr * dr_drhs\n", 212 | " dL_dinputs = [dL_dself, dL_drhs] \n", 213 | " return dL_dinputs\n", 214 | " # finally, we record the compute we did on the tape\n", 215 | " gradient_tape.append(TapeEntry(inputs=inputs, outputs=outputs, propagate=propagate))\n", 216 | " return r" 217 | ], 218 | "execution_count": null, 219 | "outputs": [] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": { 224 | "id": "bn69AKteGaWN" 225 | }, 226 | "source": [ 227 | " Notice how both `rhs` and `self` are captured by this closure.\n", 228 | " Their values have to be saved for the backward pass.\n", 229 | " PyTorch does something similar, but because PyTorch allows for\n", 230 | " mutable tensors, it has additional logic to make sure these captured\n", 231 | " variables are not mutated.\n", 232 | "\n", 233 | " We'll define the other operators later. Let's look at how we can define a `grad` function that puts these pieces together. `grad` calculates the gradient of `L` with respect to `desired_results`. We first calculate the gradient of `L` with respect to _all_ computed values and then just extract `desired_results` from them. Real systems do more pruning ahead of time to make sure we are not computing unused values.\n" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "metadata": { 239 | "id": "_iWNs9YfH8KM" 240 | }, 241 | "source": [ 242 | "def grad(L, desired_results: List[Variable]) -> List[Variable]:\n", 243 | " # this map holds dL/dX for all values X\n", 244 | " dL_d : Dict[str, Variable] = {}\n", 245 | " # It starts by initializing the 'seed' dL/dL, which is 1\n", 246 | " dL_d[L.name] = Variable(torch.ones(()))\n", 247 | " print(f'd{L.name} ------------------------')\n", 248 | "\n", 249 | " # look up dL_dentries. If a variable is never used to compute the loss,\n", 250 | " # we consider its gradient None, see the note below about zeros for more information.\n", 251 | " def gather_grad(entries: List[str]):\n", 252 | " return [dL_d[entry] if entry in dL_d else None for entry in entries]\n", 253 | "\n", 254 | " # propagate the gradient information backward\n", 255 | " for entry in reversed(gradient_tape):\n", 256 | " dL_doutputs = gather_grad(entry.outputs)\n", 257 | " if all(dL_doutput is None for dL_doutput in dL_doutputs):\n", 258 | " # optimize for the case where some gradient pathways are zero. See\n", 259 | " # The note below for more details.\n", 260 | " continue\n", 261 | "\n", 262 | " # perform chain rule propagation specific to each compute\n", 263 | " dL_dinputs = entry.propagate(dL_doutputs)\n", 264 | "\n", 265 | " # Accululate the gradient produced for each input.\n", 266 | " # Each use of a variable produces some gradient dL_dinput for that \n", 267 | " # use. The multivariate chain rule tells us it is safe to sum \n", 268 | " # all the contributions together.\n", 269 | " for input, dL_dinput in zip(entry.inputs, dL_dinputs):\n", 270 | " if input not in dL_d:\n", 271 | " dL_d[input] = dL_dinput\n", 272 | " else:\n", 273 | " dL_d[input] += dL_dinput\n", 274 | "\n", 275 | " # print some information to understand the values of each intermediate \n", 276 | " for name, value in dL_d.items():\n", 277 | " print(f'd{L.name}_d{name} = {value.name}')\n", 278 | " print(f'------------------------')\n", 279 | "\n", 280 | " return gather_grad(desired.name for desired in desired_results)\n" 281 | ], 282 | "execution_count": null, 283 | "outputs": [] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": { 288 | "id": "DKF1A2XFKOtj" 289 | }, 290 | "source": [ 291 | "# Some more operators\n", 292 | "\n", 293 | "We'll use these in our examples. Their implementation is very similar to `operator_mul`." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "metadata": { 299 | "id": "tRqwgR64KWLb" 300 | }, 301 | "source": [ 302 | "def operator_add(self : Variable, rhs: Variable) -> Variable:\n", 303 | " # Add follows a similar pattern to Mul, but it doesn't end up\n", 304 | " # capturing any variables.\n", 305 | " r = Variable(self.value + rhs.value)\n", 306 | " print(f'{r.name} = {self.name} + {rhs.name}')\n", 307 | " def propagate(dL_doutputs: List[Variable]):\n", 308 | " dL_dr, = dL_doutputs\n", 309 | " dr_dself = 1.0\n", 310 | " dr_drhs = 1.0\n", 311 | " dL_dself = dL_dr * dr_dself\n", 312 | " dL_drhs = dL_dr * dr_drhs\n", 313 | " return [dL_dself, dL_drhs]\n", 314 | " gradient_tape.append(TapeEntry(inputs=[self.name, rhs.name], outputs=[r.name], propagate=propagate))\n", 315 | " return r\n", 316 | "\n", 317 | "# sum is used to turn our matrices into a single scalar to get a loss.\n", 318 | "# expand is the backward of sum, so it is added to make sure our Variable\n", 319 | "# is closed under differentiation. Both have rules similar to mul above.\n", 320 | "\n", 321 | "def operator_sum(self: Variable, name: Optional[str]) -> 'Variable':\n", 322 | " r = Variable(torch.sum(self.value), name=name)\n", 323 | " print(f'{r.name} = {self.name}.sum()')\n", 324 | " def propagate(dL_doutputs: List[Variable]):\n", 325 | " dL_dr, = dL_doutputs\n", 326 | " size = self.value.size()\n", 327 | " return [dL_dr.expand(*size)]\n", 328 | " gradient_tape.append(TapeEntry(inputs=[self.name], outputs=[r.name], propagate=propagate))\n", 329 | " return r\n", 330 | "\n", 331 | "\n", 332 | "def operator_expand(self: Variable, sizes: List[int]) -> 'Variable':\n", 333 | " assert(self.value.dim() == 0) # only works for scalars\n", 334 | " r = Variable(self.value.expand(sizes))\n", 335 | " print(f'{r.name} = {self.name}.expand({sizes})')\n", 336 | " def propagate(dL_doutputs: List[Variable]):\n", 337 | " dL_dr, = dL_doutputs\n", 338 | " return [dL_dr.sum()]\n", 339 | " gradient_tape.append(TapeEntry(inputs=[self.name], outputs=[r.name], propagate=propagate))\n", 340 | " return r" 341 | ], 342 | "execution_count": null, 343 | "outputs": [] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": { 348 | "id": "i1TS_fbDBQry" 349 | }, 350 | "source": [ 351 | "# Using `grad`\n", 352 | "Let's use the implementation to calculate some gradients" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "metadata": { 358 | "id": "njvtatdLDrBz", 359 | "colab": { 360 | "base_uri": "https://localhost:8080/", 361 | "height": 272 362 | }, 363 | "outputId": "0002f41e-4ce1-411f-f6bc-5c66f424382a" 364 | }, 365 | "source": [ 366 | "a_global, b_global = torch.rand(4), torch.rand(4)\n", 367 | "\n", 368 | "def simple(a, b):\n", 369 | " t = a + b\n", 370 | " return t * b\n", 371 | "\n", 372 | "reset_tape() # reset any compute from other cells\n", 373 | "a = Variable.constant(a_global, name='a')\n", 374 | "b = Variable.constant(b_global, name='b')\n", 375 | "loss = simple(a, b)\n", 376 | "da, db = grad(loss, [a, b])\n", 377 | "print(\"da\", da)\n", 378 | "print(\"db\", db)" 379 | ], 380 | "execution_count": null, 381 | "outputs": [ 382 | { 383 | "output_type": "stream", 384 | "text": [ 385 | "a = tensor([0.0171, 0.1633, 0.5833, 0.3794])\n", 386 | "b = tensor([0.3774, 0.6308, 0.5239, 0.1387])\n", 387 | "v0 = a + b\n", 388 | "v1 = v0 * b\n", 389 | "dv1 ------------------------\n", 390 | "v3 = v2 * b\n", 391 | "v4 = v2 * v0\n", 392 | "v5 = v4 + v3\n", 393 | "dv1_dv1 = v2\n", 394 | "dv1_dv0 = v3\n", 395 | "dv1_db = v5\n", 396 | "dv1_da = v3\n", 397 | "------------------------\n", 398 | "da tensor([0.3774, 0.6308, 0.5239, 0.1387])\n", 399 | "db tensor([0.7719, 1.4249, 1.6311, 0.6567])\n" 400 | ], 401 | "name": "stdout" 402 | } 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": { 408 | "id": "dszwikD7ENj5" 409 | }, 410 | "source": [ 411 | "# Zero Gradients\n", 412 | "\n", 413 | "An interesting case to look at is when the gradient is zero." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "metadata": { 419 | "id": "PkmEfLA9EV46", 420 | "colab": { 421 | "base_uri": "https://localhost:8080/", 422 | "height": 187 423 | }, 424 | "outputId": "23de0b78-cd3b-4f4d-f7ca-5dc0aec42b3a" 425 | }, 426 | "source": [ 427 | "reset_tape()\n", 428 | "loss = a*a\n", 429 | "da, db = grad(loss, [a, b])\n", 430 | "print(\"da\", da)\n", 431 | "print(\"db\", db)" 432 | ], 433 | "execution_count": null, 434 | "outputs": [ 435 | { 436 | "output_type": "stream", 437 | "text": [ 438 | "v0 = a * a\n", 439 | "dv0 ------------------------\n", 440 | "v2 = v1 * a\n", 441 | "v3 = v1 * a\n", 442 | "v4 = v2 + v3\n", 443 | "dv0_dv0 = v1\n", 444 | "dv0_da = v4\n", 445 | "------------------------\n", 446 | "da tensor([0.9209, 0.8121, 1.8843, 0.7893])\n", 447 | "db None\n" 448 | ], 449 | "name": "stdout" 450 | } 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": { 456 | "id": "bwfRHXd3ElEF" 457 | }, 458 | "source": [ 459 | "Notice that `db` has the value `None`. Another perhaps more mathematically appropriate choice would be to return a 4 element tensor of zeros because a value that does not contribute to the loss will have a gradient of zero. So why do we use `None` instead? The reason is because we want to be able to quickly check that a gradient value is zero, so that we can skip `propgate` functions that involve it in `grad`:\n", 460 | "\n", 461 | "```\n", 462 | "if all(dL_doutput is None for dL_doutput in dL_doutputs):\n", 463 | " # optimize for the case where some gradient pathways are zero. See\n", 464 | " # The note below for more details.\n", 465 | " continue\n", 466 | "```\n", 467 | "\n", 468 | "How does this skipping optimization work? Each propagate function is applying the chain rule.\n", 469 | "In the general case where there is a vector of inputs and vector\n", 470 | "of outputs to the function, the jacobean `J` represents the pairwise\n", 471 | "partial derivatives from each input to each output (`dinput_i/d_output_j`) in matrix form.\n", 472 | "The multiplication `v*J` (equivalently `J^t*v` if you treat `v` as a column vector) propagates the chain \n", 473 | "rule backward. This is why propagate is sometimes called the\n", 474 | "vector-Jacobean product, or `vjp` (and also why forward autodiff uses the Jacobean-vector product).\n", 475 | "\n", 476 | "In practice, we do not construct the `J` matrix, because it often\n", 477 | "has a lot of structure in it. For instance, in pointwise operations,\n", 478 | "it is a diagonal matrix (input of vector `i` affects only the output of vector `i`). Constructing it would create `N^2` entries when we only have `N` non-zeros.\n", 479 | "\n", 480 | "However, we know that propgate always computes a matrix product\n", 481 | "against `J`. One important property is if `v` is 0, we know from\n", 482 | "the fact that matrix multiplication is a linear operator, that `v*J`\n", 483 | "is also 0. This is what the the `if`-statement is saying. If all the \n", 484 | "input derivatives are 0, we know the outputs are 0, even without\n", 485 | "running propagate. This property is important in autograd as we often\n", 486 | "do more compute that is not related to the loss, and do not\n", 487 | "want to waste time computing zero gradients for it.\n", 488 | "\n", 489 | "This would be more expensive to check if we have to check that each element of a matrix was zero. So we use `None` in grad to represent a value _known_ to be full of only zeros, making the check constant time. PyTorch's autograd does the exact same check. For historical reasons it uses undefined tensors (`at::Tensor()`) in C++ to represent these known-to-be-zero tensors. This has implications for when we generate gradients for aggregate operators as we will see later. When working with the PyTorch autograd, you should keep in mind that undefined tensors are always used to represent these known-to-be-zero values. " 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": { 495 | "id": "D-4kNZj7GWzR" 496 | }, 497 | "source": [ 498 | "# Gradients of Gradients\n", 499 | "\n", 500 | "Notice how the definition of `propagate` works on `Variables` not `Tensors`. This is so that it can calculate the gradient of some other gradient. Just think of the first gradient computation like any other compute you can do. There is no reason why you can't take a gradient of that compute as well. As a concrete example lets look at this code:\n", 501 | "\n" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "metadata": { 507 | "id": "_-Vjp9w4G8qo", 508 | "colab": { 509 | "base_uri": "https://localhost:8080/", 510 | "height": 884 511 | }, 512 | "outputId": "698a70ad-01c7-4578-c089-9995a8bf3389" 513 | }, 514 | "source": [ 515 | "def run_gradients(my_fn, second_loss=True):\n", 516 | " reset_tape()\n", 517 | " a = Variable.constant(a_global, name='a')\n", 518 | " b = Variable.constant(b_global, name='b')\n", 519 | "\n", 520 | " # our first loss\n", 521 | " L0 = (my_fn(a, b)).sum(name='L0')\n", 522 | "\n", 523 | " # compute derivatives of our inputs\n", 524 | " dL0_da, dL0_db = grad(L0, [a, b])\n", 525 | " if not second_loss:\n", 526 | " return dL0_da, dL0_db\n", 527 | "\n", 528 | " # now lets compute the L2 norm of our derivatives\n", 529 | " L1 = (dL0_da*dL0_da + dL0_db*dL0_db).sum(name='L1')\n", 530 | "\n", 531 | " # and take the gradient of that.\n", 532 | " # notice there are two losses involved.\n", 533 | " dL1_da, dL1_db = grad(L1, [a, b])\n", 534 | " return dL1_da, dL1_db\n", 535 | "\n", 536 | "da, db = run_gradients(simple)\n", 537 | "print(\"da\", da)\n", 538 | "print(\"db\", db)" 539 | ], 540 | "execution_count": null, 541 | "outputs": [ 542 | { 543 | "output_type": "stream", 544 | "text": [ 545 | "a = tensor([0.4605, 0.4061, 0.9422, 0.3946])\n", 546 | "b = tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 547 | "v0 = a + b\n", 548 | "v1 = v0 * b\n", 549 | "L0 = v1.sum()\n", 550 | "dL0 ------------------------\n", 551 | "v3 = v2.expand(4)\n", 552 | "v4 = v3 * b\n", 553 | "v5 = v3 * v0\n", 554 | "v6 = v5 + v4\n", 555 | "dL0_dL0 = v2\n", 556 | "dL0_dv1 = v3\n", 557 | "dL0_dv0 = v4\n", 558 | "dL0_db = v6\n", 559 | "dL0_da = v4\n", 560 | "------------------------\n", 561 | "v7 = v4 * v4\n", 562 | "v8 = v6 * v6\n", 563 | "v9 = v7 + v8\n", 564 | "L1 = v9.sum()\n", 565 | "dL1 ------------------------\n", 566 | "v11 = v10.expand(4)\n", 567 | "v12 = v11 * v6\n", 568 | "v13 = v11 * v6\n", 569 | "v14 = v12 + v13\n", 570 | "v15 = v11 * v4\n", 571 | "v16 = v11 * v4\n", 572 | "v17 = v15 + v16\n", 573 | "v18 = v17 + v14\n", 574 | "v19 = v14 * v0\n", 575 | "v20 = v14 * v3\n", 576 | "v21 = v18 * b\n", 577 | "v22 = v18 * v3\n", 578 | "v23 = v19 + v21\n", 579 | "v24 = v23.sum()\n", 580 | "v25 = v22 + v20\n", 581 | "dL1_dL1 = v10\n", 582 | "dL1_dv9 = v11\n", 583 | "dL1_dv7 = v11\n", 584 | "dL1_dv8 = v11\n", 585 | "dL1_dv6 = v14\n", 586 | "dL1_dv4 = v18\n", 587 | "dL1_dv5 = v14\n", 588 | "dL1_dv3 = v23\n", 589 | "dL1_dv0 = v20\n", 590 | "dL1_db = v25\n", 591 | "dL1_dv2 = v24\n", 592 | "dL1_da = v20\n", 593 | "------------------------\n", 594 | "da tensor([1.2611, 2.1304, 5.8394, 3.3869])\n", 595 | "db tensor([ 2.6923, 4.9201, 13.6563, 8.0727])\n" 596 | ], 597 | "name": "stdout" 598 | } 599 | ] 600 | }, 601 | { 602 | "cell_type": "markdown", 603 | "metadata": { 604 | "id": "pxdBcwoPHvRE" 605 | }, 606 | "source": [ 607 | "Notice how the `gradient_tape` just keeps accumulating more entries as we run `grad` twice. This is because in the second call to `grad` we still have to consider all the pathways through which the gradient flows all the way from `L1` back to the inputs `a` and `b`. One implication is that the entries that are run in the first call to `grad` actually get run _again_ in the second call to `grad`. In practice this means that if you append a `propagate` function to the tape in a gradient-of-gradient scenario, you should expect it to run multiple times! If a single gradient compute is \"forward, backward\", then a gradient of gradient compute could be thought of as \"forward-part-0, backward-part-0, foward-part-1, backward-part-1, backward-part-0 (again)\".\n", 608 | "\n", 609 | "Issues with how autograd functions behave often _only_ appear when considering higher order gradients so it is important to test changes on these cases. We'll see an example later." 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": { 615 | "id": "XVos1LB_JgA5" 616 | }, 617 | "source": [ 618 | "# Rules of thumb for Autograd\n", 619 | "\n", 620 | "## Every use of a Variable generates a gradient specific to that use\n", 621 | "\n", 622 | "If you use a temporary variable `t` in two different subsequent computations, each _use_ of that value will have a gradient associated with it from the using operator. The multivariate chain rule tells us we can sum these gradients to get the overall contribute of `t`. We always have to account for all uses of a variable. If we forget about one, we will calculate the wrong value.\n", 623 | "\n", 624 | "## Inputs become outputs, outputs become inputs, reads become writes, writes become reads\n", 625 | "\n", 626 | "When we record a `TapeEntry` we also record the inputs and outputs of the compute _from the perspective of the forward pass_. The inputs/outputs of the propgate function in the backward pass are _flipped_. You get `dL/doutputs` and you produce `dL/dinputs`. It is easy to get confused by names like input or output. You have to keep in mind what they are relative to. A corrolary here occurs at the level of compute. Because every read of a value in a matrix produces a gradient, it implies that in the backward pass we will be computing (and writing) a value for every read in forward. For instance, the `sum` operator reads an entire matrix and produces one value. So its reverse must be an operator that reads one value and writes an entire matrix. Indeed, the backward of `sum` is `expand`, which does precisely that.\n", 627 | "\n", 628 | "## Each call to grad produces gradients for a different loss\n", 629 | "\n", 630 | "When you call `grad(l, [a,b])` you are computing a set of gradients `dl_da`, `dl_db`. A subsquent call to `grad` will use a different loss, and potentially care about different inputs. If you abbreviate the loss, e.g. by saying `da`, you better be sure there aren't additional losses or you will quickly get confused. Gradient-of-gradient, or higher-order gradient, just means that we are computing some loss that was based on the gradients of another loss. There are an infinite number of calculations that compute gradients. There isn't a single \"grad-of-grad\" compute.\n" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": { 636 | "id": "SI-bjH-1NHNx" 637 | }, 638 | "source": [ 639 | "# Creating Aggregate Operations\n", 640 | "\n", 641 | "While autograd is a great way to piece together fundamental operators, sometimes you want to create aggregate operators that do not perform autograd operations internally. Fusion is one common example of this where, for instance, you may want to generate a single CUDA kernel for `(a + b)*b`. TorchScript's symbolic autograd internally can separate this compute from autograd and generate explicit forward/backward passes. Let's look at what issues arise when trying to do this. This should help with understanding the PyTorch implementation, and also make it possible to create custom aggregate operators with correct autograd implementations. Let's try to turn our `simple` function from before into one that computes its entire body as an aggregate:" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "metadata": { 647 | "id": "bqpAiClBJnM2", 648 | "colab": { 649 | "base_uri": "https://localhost:8080/", 650 | "height": 443 651 | }, 652 | "outputId": "713e7e56-0bd6-4677-f657-712c55fbe51f" 653 | }, 654 | "source": [ 655 | "def simple(a, b):\n", 656 | " t = a + b\n", 657 | " return t * b\n", 658 | "\n", 659 | "def simple_type_error(a, b):\n", 660 | " t = (a.value + b.value)\n", 661 | " r = Variable(t * b.value)\n", 662 | " def propagate(dL_doutputs: List[Variable]) -> List[Variable]:\n", 663 | " # manually apply the chain rule to the compute,\n", 664 | " # in practice a symbolic differentiator might create this code\n", 665 | " dL_dr, = dL_doutputs\n", 666 | " dr_dt = b # partial from: r = t * b\n", 667 | " dr_db = t # partial from: r = t * b\n", 668 | " dL_dt = dL_dr*dr_dt # chain rule\n", 669 | " dt_da = 1.0 # partial from t = a + b\n", 670 | " dt_db = 1.0 # partial from t = a + b\n", 671 | " dL_da = dL_dt * dt_da # chain rule\n", 672 | " dL_db = dL_dt * dt_db + dL_dr * dr_db # ERROR! dr_db is a Tensor not a Variable\n", 673 | " return [dL_da, dL_db]\n", 674 | " gradient_tape.append(TapeEntry(inputs=[a.name, b.name], outputs=[r.name], propagate=propagate))\n", 675 | " return r\n", 676 | "\n", 677 | "da, db = run_gradients(simple_type_error)" 678 | ], 679 | "execution_count": null, 680 | "outputs": [ 681 | { 682 | "output_type": "stream", 683 | "text": [ 684 | "a = tensor([0.4605, 0.4061, 0.9422, 0.3946])\n", 685 | "b = tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 686 | "L0 = v0.sum()\n", 687 | "dL0 ------------------------\n", 688 | "v2 = v1.expand(4)\n", 689 | "v3 = v2 * b\n" 690 | ], 691 | "name": "stdout" 692 | }, 693 | { 694 | "output_type": "error", 695 | "ename": "AttributeError", 696 | "evalue": "ignored", 697 | "traceback": [ 698 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 699 | "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", 700 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 22\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m \u001b[0mda\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrun_gradients\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msimple_type_error\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 701 | "\u001b[0;32m\u001b[0m in \u001b[0;36mrun_gradients\u001b[0;34m(my_fn, second_loss)\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m# compute derivatives of our inputs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mdL0_da\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdL0_db\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mL0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 11\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0msecond_loss\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdL0_da\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdL0_db\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 702 | "\u001b[0;32m\u001b[0m in \u001b[0;36mgrad\u001b[0;34m(L, desired_results)\u001b[0m\n\u001b[1;32m 20\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[0;31m# perform chain rule propagation specific to each compute\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 22\u001b[0;31m \u001b[0mdL_dinputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mentry\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpropagate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdL_doutputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 23\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0;31m# Accululate the gradient produced for each input.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 703 | "\u001b[0;32m\u001b[0m in \u001b[0;36mpropagate\u001b[0;34m(dL_doutputs)\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0mdt_db\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1.0\u001b[0m \u001b[0;31m# partial from t = a + b\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[0mdL_da\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdL_dt\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mdt_da\u001b[0m \u001b[0;31m# chain rule\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 18\u001b[0;31m \u001b[0mdL_db\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdL_dt\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mdt_db\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mdL_dr\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mdr_db\u001b[0m \u001b[0;31m# ERROR! dr_db is a Tensor not a Variable\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 19\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mdL_da\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdL_db\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 20\u001b[0m \u001b[0mgradient_tape\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTapeEntry\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpropagate\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpropagate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 704 | "\u001b[0;32m\u001b[0m in \u001b[0;36m__mul__\u001b[0;34m(self, rhs)\u001b[0m\n\u001b[1;32m 19\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__mul__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrhs\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Variable'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'Variable'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 20\u001b[0m \u001b[0;31m# defined later in the notebook\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 21\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0moperator_mul\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrhs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 22\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__add__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrhs\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Variable'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'Variable'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 705 | "\u001b[0;32m\u001b[0m in \u001b[0;36moperator_mul\u001b[0;34m(self, rhs)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;31m# define forward\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mVariable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mrhs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'{r.name} = {self.name} * {rhs.name}'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 706 | "\u001b[0;31mAttributeError\u001b[0m: 'Tensor' object has no attribute 'value'" 707 | ] 708 | } 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": { 714 | "id": "pa5P78Y4PHfT" 715 | }, 716 | "source": [ 717 | "This doesn't work because `t` is being captured and used in propagate, but propgate expects to compute on Variables. Becuase `t` was extracted from autograd, it can no longer directly participate in the `propagate` call. One way to fix this is to recompute `t`" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "metadata": { 723 | "id": "gY9_KfgZPgql", 724 | "colab": { 725 | "base_uri": "https://localhost:8080/", 726 | "height": 1000 727 | }, 728 | "outputId": "f64d673a-7e95-42d6-b605-9ceec8efd6ae" 729 | }, 730 | "source": [ 731 | "def simple_recompute(a, b):\n", 732 | " t = (a.value + b.value)\n", 733 | " r = Variable(t * b.value)\n", 734 | " def propagate(dL_doutputs: List[Variable]) -> List[Variable]:\n", 735 | " dL_dr, = dL_doutputs\n", 736 | " dr_dt = b # partial from: r = t * b\n", 737 | " t = a + b # RECOMPUTE!\n", 738 | " dr_db = t # partial from: r = t * b\n", 739 | " dL_dt = dL_dr*dr_dt # chain rule\n", 740 | " dt_da = 1.0 # partial from t = a + b\n", 741 | " dt_db = 1.0 # partial from t = a + b\n", 742 | " dL_da = dL_dt * dt_da # chain rule\n", 743 | " dL_db = dL_dt * dt_db + dL_dr * dr_db # chain rule\n", 744 | " return [dL_da, dL_db]\n", 745 | " gradient_tape.append(TapeEntry(inputs=[a.name, b.name], outputs=[r.name], propagate=propagate))\n", 746 | " return r\n", 747 | "\n", 748 | "da, db = run_gradients(simple_recompute)\n", 749 | "da_ref, db_ref = run_gradients(simple)\n", 750 | "print(\"da\", da, da_ref)\n", 751 | "print(\"db\", db, db_ref)" 752 | ], 753 | "execution_count": null, 754 | "outputs": [ 755 | { 756 | "output_type": "stream", 757 | "text": [ 758 | "a = tensor([0.4605, 0.4061, 0.9422, 0.3946])\n", 759 | "b = tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 760 | "L0 = v0.sum()\n", 761 | "dL0 ------------------------\n", 762 | "v2 = v1.expand(4)\n", 763 | "v3 = a + b\n", 764 | "v4 = v2 * b\n", 765 | "v5 = v2 * v3\n", 766 | "v6 = v4 + v5\n", 767 | "dL0_dL0 = v1\n", 768 | "dL0_dv0 = v2\n", 769 | "dL0_da = v4\n", 770 | "dL0_db = v6\n", 771 | "------------------------\n", 772 | "v7 = v4 * v4\n", 773 | "v8 = v6 * v6\n", 774 | "v9 = v7 + v8\n", 775 | "L1 = v9.sum()\n", 776 | "dL1 ------------------------\n", 777 | "v11 = v10.expand(4)\n", 778 | "v12 = v11 * v6\n", 779 | "v13 = v11 * v6\n", 780 | "v14 = v12 + v13\n", 781 | "v15 = v11 * v4\n", 782 | "v16 = v11 * v4\n", 783 | "v17 = v15 + v16\n", 784 | "v18 = v17 + v14\n", 785 | "v19 = v14 * v3\n", 786 | "v20 = v14 * v2\n", 787 | "v21 = v18 * b\n", 788 | "v22 = v18 * v2\n", 789 | "v23 = v19 + v21\n", 790 | "v24 = v22 + v20\n", 791 | "v25 = v23.sum()\n", 792 | "dL1_dL1 = v10\n", 793 | "dL1_dv9 = v11\n", 794 | "dL1_dv7 = v11\n", 795 | "dL1_dv8 = v11\n", 796 | "dL1_dv6 = v14\n", 797 | "dL1_dv4 = v18\n", 798 | "dL1_dv5 = v14\n", 799 | "dL1_dv2 = v23\n", 800 | "dL1_dv3 = v20\n", 801 | "dL1_db = v24\n", 802 | "dL1_da = v20\n", 803 | "dL1_dv1 = v25\n", 804 | "------------------------\n", 805 | "a = tensor([0.4605, 0.4061, 0.9422, 0.3946])\n", 806 | "b = tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 807 | "v0 = a + b\n", 808 | "v1 = v0 * b\n", 809 | "L0 = v1.sum()\n", 810 | "dL0 ------------------------\n", 811 | "v3 = v2.expand(4)\n", 812 | "v4 = v3 * b\n", 813 | "v5 = v3 * v0\n", 814 | "v6 = v5 + v4\n", 815 | "dL0_dL0 = v2\n", 816 | "dL0_dv1 = v3\n", 817 | "dL0_dv0 = v4\n", 818 | "dL0_db = v6\n", 819 | "dL0_da = v4\n", 820 | "------------------------\n", 821 | "v7 = v4 * v4\n", 822 | "v8 = v6 * v6\n", 823 | "v9 = v7 + v8\n", 824 | "L1 = v9.sum()\n", 825 | "dL1 ------------------------\n", 826 | "v11 = v10.expand(4)\n", 827 | "v12 = v11 * v6\n", 828 | "v13 = v11 * v6\n", 829 | "v14 = v12 + v13\n", 830 | "v15 = v11 * v4\n", 831 | "v16 = v11 * v4\n", 832 | "v17 = v15 + v16\n", 833 | "v18 = v17 + v14\n", 834 | "v19 = v14 * v0\n", 835 | "v20 = v14 * v3\n", 836 | "v21 = v18 * b\n", 837 | "v22 = v18 * v3\n", 838 | "v23 = v19 + v21\n", 839 | "v24 = v23.sum()\n", 840 | "v25 = v22 + v20\n", 841 | "dL1_dL1 = v10\n", 842 | "dL1_dv9 = v11\n", 843 | "dL1_dv7 = v11\n", 844 | "dL1_dv8 = v11\n", 845 | "dL1_dv6 = v14\n", 846 | "dL1_dv4 = v18\n", 847 | "dL1_dv5 = v14\n", 848 | "dL1_dv3 = v23\n", 849 | "dL1_dv0 = v20\n", 850 | "dL1_db = v25\n", 851 | "dL1_dv2 = v24\n", 852 | "dL1_da = v20\n", 853 | "------------------------\n", 854 | "da tensor([1.2611, 2.1304, 5.8394, 3.3869]) tensor([1.2611, 2.1304, 5.8394, 3.3869])\n", 855 | "db tensor([ 2.6923, 4.9201, 13.6563, 8.0727]) tensor([ 2.6923, 4.9201, 13.6563, 8.0727])\n" 856 | ], 857 | "name": "stdout" 858 | } 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": { 864 | "id": "HLUaN1UeP4C_" 865 | }, 866 | "source": [ 867 | "This recompute works but it is not ideal. First, the original compute may have been expensive (think a bunch of convolutions and multiplies), so redoing it in the backward pass may take significant time. Second, we need to save `a` and `b` to recompute `t`. Previously we only had to save `b`. What if `a` was a _huge_ matrix but `t` was small? Then we are using _more total memory_ by doing this recompute as well. In general, we want to avoid recomputing things unless we know it won't be expensive in time or space.\n", 868 | "\n", 869 | "Let's consider another approach. What happens if we just make `t` into a Variable?" 870 | ] 871 | }, 872 | { 873 | "cell_type": "code", 874 | "metadata": { 875 | "id": "ZIysmiVCQaw5", 876 | "colab": { 877 | "base_uri": "https://localhost:8080/", 878 | "height": 799 879 | }, 880 | "outputId": "7b1a0d2a-a546-436f-af00-9b5d5cc53986" 881 | }, 882 | "source": [ 883 | "def simple_variable_wrong(a, b):\n", 884 | " t = (a.value + b.value)\n", 885 | " t_v = Variable(t, name='t') # named for debugging\n", 886 | " r = Variable(t * b.value)\n", 887 | " def propagate(dL_doutputs: List[Variable]) -> List[Variable]:\n", 888 | " dL_dr, = dL_doutputs\n", 889 | " dr_dt = b # partial from: r = t * b\n", 890 | " dr_db = t_v # partial from: r = t * b\n", 891 | " dL_dt = dL_dr*dr_dt # chain rule\n", 892 | " dt_da = 1.0 # partial from t = a + b\n", 893 | " dt_db = 1.0 # partial from t = a + b\n", 894 | " dL_da = dL_dt * dt_da # chain rule\n", 895 | " dL_db = dL_dt * dt_db + dL_dr * dr_db # chain rule\n", 896 | " return [dL_da, dL_db]\n", 897 | " gradient_tape.append(TapeEntry(inputs=[a.name, b.name], outputs=[r.name], propagate=propagate))\n", 898 | " return r\n", 899 | "\n", 900 | "da, db = run_gradients(simple_variable_wrong)\n", 901 | "print(\"da\", da) # ERROR: da is None!!!????\n", 902 | "print(\"db\", db)" 903 | ], 904 | "execution_count": null, 905 | "outputs": [ 906 | { 907 | "output_type": "stream", 908 | "text": [ 909 | "a = tensor([0.4605, 0.4061, 0.9422, 0.3946])\n", 910 | "b = tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 911 | "L0 = v0.sum()\n", 912 | "dL0 ------------------------\n", 913 | "v2 = v1.expand(4)\n", 914 | "v3 = v2 * b\n", 915 | "v4 = v2 * t\n", 916 | "v5 = v3 + v4\n", 917 | "dL0_dL0 = v1\n", 918 | "dL0_dv0 = v2\n", 919 | "dL0_da = v3\n", 920 | "dL0_db = v5\n", 921 | "------------------------\n", 922 | "v6 = v3 * v3\n", 923 | "v7 = v5 * v5\n", 924 | "v8 = v6 + v7\n", 925 | "L1 = v8.sum()\n", 926 | "dL1 ------------------------\n", 927 | "v10 = v9.expand(4)\n", 928 | "v11 = v10 * v5\n", 929 | "v12 = v10 * v5\n", 930 | "v13 = v11 + v12\n", 931 | "v14 = v10 * v3\n", 932 | "v15 = v10 * v3\n", 933 | "v16 = v14 + v15\n", 934 | "v17 = v16 + v13\n", 935 | "v18 = v13 * t\n", 936 | "v19 = v13 * v2\n", 937 | "v20 = v17 * b\n", 938 | "v21 = v17 * v2\n", 939 | "v22 = v18 + v20\n", 940 | "v23 = v22.sum()\n", 941 | "dL1_dL1 = v9\n", 942 | "dL1_dv8 = v10\n", 943 | "dL1_dv6 = v10\n", 944 | "dL1_dv7 = v10\n", 945 | "dL1_dv5 = v13\n", 946 | "dL1_dv3 = v17\n", 947 | "dL1_dv4 = v13\n", 948 | "dL1_dv2 = v22\n", 949 | "dL1_dt = v19\n", 950 | "dL1_db = v21\n", 951 | "dL1_dv1 = v23\n", 952 | "------------------------\n", 953 | "da None\n", 954 | "db tensor([1.4312, 2.7896, 7.8169, 4.6857])\n" 955 | ], 956 | "name": "stdout" 957 | } 958 | ] 959 | }, 960 | { 961 | "cell_type": "markdown", 962 | "metadata": { 963 | "id": "v6D1xMlfQ4xE" 964 | }, 965 | "source": [ 966 | "While we do not get an error, something is clearly wrong. `dL1/da` is None, but we _know_ that the value of `a` affects the norm of the gradients of the original loss so this value should not be None. We are not propagating a gradient somewhere!\n", 967 | "\n", 968 | "Let's see what happens when we run just the first gradient.\n" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "metadata": { 974 | "id": "KsCpcoCYRVgR", 975 | "colab": { 976 | "base_uri": "https://localhost:8080/", 977 | "height": 544 978 | }, 979 | "outputId": "c67a729e-0dfe-4e0e-b267-50d7ee05e307" 980 | }, 981 | "source": [ 982 | "da, db = run_gradients(simple_variable_wrong, second_loss=False)\n", 983 | "da_ref, db_ref = run_gradients(simple, second_loss=False)\n", 984 | "print(\"da\", da, da_ref) \n", 985 | "print(\"db\", db, db_ref)" 986 | ], 987 | "execution_count": null, 988 | "outputs": [ 989 | { 990 | "output_type": "stream", 991 | "text": [ 992 | "a = tensor([0.4605, 0.4061, 0.9422, 0.3946])\n", 993 | "b = tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 994 | "L0 = v0.sum()\n", 995 | "dL0 ------------------------\n", 996 | "v2 = v1.expand(4)\n", 997 | "v3 = v2 * b\n", 998 | "v4 = v2 * t\n", 999 | "v5 = v3 + v4\n", 1000 | "dL0_dL0 = v1\n", 1001 | "dL0_dv0 = v2\n", 1002 | "dL0_da = v3\n", 1003 | "dL0_db = v5\n", 1004 | "------------------------\n", 1005 | "a = tensor([0.4605, 0.4061, 0.9422, 0.3946])\n", 1006 | "b = tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 1007 | "v0 = a + b\n", 1008 | "v1 = v0 * b\n", 1009 | "L0 = v1.sum()\n", 1010 | "dL0 ------------------------\n", 1011 | "v3 = v2.expand(4)\n", 1012 | "v4 = v3 * b\n", 1013 | "v5 = v3 * v0\n", 1014 | "v6 = v5 + v4\n", 1015 | "dL0_dL0 = v2\n", 1016 | "dL0_dv1 = v3\n", 1017 | "dL0_dv0 = v4\n", 1018 | "dL0_db = v6\n", 1019 | "dL0_da = v4\n", 1020 | "------------------------\n", 1021 | "da tensor([0.0850, 0.3296, 0.9888, 0.6494]) tensor([0.0850, 0.3296, 0.9888, 0.6494])\n", 1022 | "db tensor([0.6306, 1.0652, 2.9197, 1.6935]) tensor([0.6306, 1.0652, 2.9197, 1.6935])\n" 1023 | ], 1024 | "name": "stdout" 1025 | } 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": { 1031 | "id": "-ck3a9nQRil7" 1032 | }, 1033 | "source": [ 1034 | "In the single-backward case, we get the right answer! This illustrates a key part of autograd: it is _very easy_ to make it appear to work for a single backward pass but have the code be broken when trying higher order gradients. \n", 1035 | "\n", 1036 | "So what is going wrong? Look at the debug trace from the first time we ran `simple_variable_wrong`. Inside the compute of `dL0` (the first backward), you can see a line: `v4 = v2 * t`. The first backward is using the value of `t`. But if a computation _uses_ `t` then the gradient of that computation will have a non-zero gradient `dL1/dt` for any future loss (`L1`) that uses the results of that computation. But this future use of `t` is not accounted for in `simple_variable_wrong`! We consider the effect of `r` on `t` as `dL_dt = dL_dr*dr_dt`, but do not consider uses of `t` outside the local aggregate. This is because the way `t` can be used in the future is subtle: it escapes from our compute _only_ through its use as a closed over variable in `propagate`. So this gradient pathway can only be non-zero for higher-order gradients where we are differentiating through this use.\n", 1037 | "\n", 1038 | "The problem originates because `t` was not declared as an output of the original computation, even though it was defined by the computation and used by later computations. We can fix this by defining it as an output in the gradient tape and then using the derivative contribution that comes from it." 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "metadata": { 1044 | "id": "2cbRekShbmxX", 1045 | "colab": { 1046 | "base_uri": "https://localhost:8080/", 1047 | "height": 986 1048 | }, 1049 | "outputId": "950d5943-f63f-4ff3-e370-72e9316d340b" 1050 | }, 1051 | "source": [ 1052 | "def simple_variable_almost(a, b):\n", 1053 | " t = (a.value + b.value)\n", 1054 | " t_v = Variable(t, name='t_v')\n", 1055 | " r = Variable(t * b.value)\n", 1056 | " def propagate(dL_doutputs: List[Variable]) -> List[Variable]:\n", 1057 | " # t is considered an output, so we now get dL_dt0 as an input.\n", 1058 | " dL_dr, dL_dt0 = dL_doutputs\n", 1059 | " ###### new gradient contribution\n", 1060 | "\n", 1061 | " # Handle cases where one incoming gradient is zero (None)\n", 1062 | " if dL_dr is None:\n", 1063 | " dL_dr = Variable.constant(torch.zeros(()))\n", 1064 | " if dL_dt0 is None:\n", 1065 | " dL_dt0 = Variable.constant(torch.zeros(()))\n", 1066 | " \n", 1067 | "\n", 1068 | " dr_dt = b \n", 1069 | " dr_db = t_v \n", 1070 | " # we combine this with the contribution from r to calculate \n", 1071 | " # all gradient paths to dL_dt\n", 1072 | " dL_dt = dL_dt0 + dL_dr*dr_dt # chain rule\n", 1073 | " ######\n", 1074 | "\n", 1075 | " dt_da = 1.0 \n", 1076 | " dt_db = 1.0 \n", 1077 | " dL_db = dL_dr * dr_db + dL_dt * dt_db \n", 1078 | " dL_da = dL_dt * dt_da\n", 1079 | " return [dL_da, dL_db]\n", 1080 | "\n", 1081 | " # note: t_v is now considered an output in the tape\n", 1082 | " gradient_tape.append(TapeEntry(inputs=[a.name, b.name], outputs=[r.name, t_v.name], propagate=propagate))\n", 1083 | " ######### new output\n", 1084 | " return r\n", 1085 | "da, db = run_gradients(simple_variable_almost)\n", 1086 | "print(\"da\", da) \n", 1087 | "print(\"db\", db)" 1088 | ], 1089 | "execution_count": null, 1090 | "outputs": [ 1091 | { 1092 | "output_type": "stream", 1093 | "text": [ 1094 | "a = tensor([0.0171, 0.1633, 0.5833, 0.3794])\n", 1095 | "b = tensor([0.3774, 0.6308, 0.5239, 0.1387])\n", 1096 | "L0 = v0.sum()\n", 1097 | "dL0 ------------------------\n", 1098 | "v2 = v1.expand(4)\n", 1099 | "v3 = 0.0\n", 1100 | "v4 = v2 * b\n", 1101 | "v5 = v3 + v4\n", 1102 | "v6 = v2 * t_v\n", 1103 | "v7 = v6 + v5\n", 1104 | "dL0_dL0 = v1\n", 1105 | "dL0_dv0 = v2\n", 1106 | "dL0_da = v5\n", 1107 | "dL0_db = v7\n", 1108 | "------------------------\n", 1109 | "v8 = v5 * v5\n", 1110 | "v9 = v7 * v7\n", 1111 | "v10 = v8 + v9\n", 1112 | "L1 = v10.sum()\n", 1113 | "dL1 ------------------------\n", 1114 | "v12 = v11.expand(4)\n", 1115 | "v13 = v12 * v7\n", 1116 | "v14 = v12 * v7\n", 1117 | "v15 = v13 + v14\n", 1118 | "v16 = v12 * v5\n", 1119 | "v17 = v12 * v5\n", 1120 | "v18 = v16 + v17\n", 1121 | "v19 = v18 + v15\n", 1122 | "v20 = v15 * t_v\n", 1123 | "v21 = v15 * v2\n", 1124 | "v22 = v19 * b\n", 1125 | "v23 = v19 * v2\n", 1126 | "v24 = v20 + v22\n", 1127 | "v25 = v24.sum()\n", 1128 | "v26 = 0.0\n", 1129 | "v27 = v26 * b\n", 1130 | "v28 = v21 + v27\n", 1131 | "v29 = v26 * t_v\n", 1132 | "v30 = v29 + v28\n", 1133 | "v31 = v23 + v30\n", 1134 | "dL1_dL1 = v11\n", 1135 | "dL1_dv10 = v12\n", 1136 | "dL1_dv8 = v12\n", 1137 | "dL1_dv9 = v12\n", 1138 | "dL1_dv7 = v15\n", 1139 | "dL1_dv5 = v19\n", 1140 | "dL1_dv6 = v15\n", 1141 | "dL1_dv2 = v24\n", 1142 | "dL1_dt_v = v21\n", 1143 | "dL1_dv3 = v19\n", 1144 | "dL1_dv4 = v19\n", 1145 | "dL1_db = v31\n", 1146 | "dL1_dv1 = v25\n", 1147 | "dL1_da = v28\n", 1148 | "------------------------\n", 1149 | "da tensor([1.5438, 2.8499, 3.2622, 1.3134])\n", 1150 | "db tensor([3.8424, 6.9614, 7.5721, 2.9042])\n" 1151 | ], 1152 | "name": "stdout" 1153 | } 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "markdown", 1158 | "metadata": { 1159 | "id": "GjIEL9nncdmO" 1160 | }, 1161 | "source": [ 1162 | "This code is now correct! However, it has some non-optimal behavior. Notice how at the beginning of `propagate` we need to handle the cases where the gradients coming in are `None`. Recall that when a pathway has no gradient we give it the value `None`. The first time through `propagate` `dL_dt0` will be `None` since `t` is not used outside of the propagate function itself on the first backward. The _second_ time through `propgate`, `dL_dt0` will have a value but `dL_dr` will be `None`. Excercise: convince yourself why `dL_dr` is `None` the second time through. When we fix this by changing the `None` into zeros, we get the right answer but at the cost of always doing more compute. For instance in this case, it adds an additional pointwise addition of a zero tensor to every single-backward call to handle `dL_dt0` input which will be zero.\n", 1163 | "\n", 1164 | " It makes sense to use a constant-time check for zero to eliminate a tensor-sized amount of work. So we optimize this code by replicating some of the `None` handling logic in `grad` directly into the aggregate op. It is a little messy but it handles the cases where inputs might be `None` with a minimal amount of compute." 1165 | ] 1166 | }, 1167 | { 1168 | "cell_type": "code", 1169 | "metadata": { 1170 | "id": "fshJIV4xcJKW", 1171 | "colab": { 1172 | "base_uri": "https://localhost:8080/", 1173 | "height": 833 1174 | }, 1175 | "outputId": "90eee555-de63-4fa5-fdf1-313d9fae8116" 1176 | }, 1177 | "source": [ 1178 | "def add_optional(a: Optional['Variable'], b: Optional['Variable']):\n", 1179 | " if a is None:\n", 1180 | " return b\n", 1181 | " if b is None:\n", 1182 | " return a\n", 1183 | " return a + b\n", 1184 | "\n", 1185 | "def simple_variable(a, b):\n", 1186 | " t = (a.value + b.value)\n", 1187 | " t_v = Variable(t, name='t_v')\n", 1188 | " r = Variable(t * b.value)\n", 1189 | " def propagate(dL_doutputs: List[Variable]) -> List[Variable]:\n", 1190 | " dL_dr, dL_dt0 = dL_doutputs\n", 1191 | " dr_dt = b # partial from: r = t * b\n", 1192 | " dr_db = t_v # partial from: r = t * b\n", 1193 | " dL_dt = dL_dt0\n", 1194 | " if dL_dr is not None:\n", 1195 | " dL_dt = add_optional(dL_dt, dL_dr*dr_dt) # chain rule\n", 1196 | "\n", 1197 | " dt_da = 1.0 # partial from t = a + b\n", 1198 | " dt_db = 1.0 # partial from t = a + b\n", 1199 | " if dL_dr is not None:\n", 1200 | " dL_db = dL_dr * dr_db # chain rule\n", 1201 | " else:\n", 1202 | " dL_db = None\n", 1203 | "\n", 1204 | " if dL_dt is not None:\n", 1205 | " dL_da = dL_dt * dt_da # chain rule\n", 1206 | " dL_db = add_optional(dL_db, dL_dt * dt_db)\n", 1207 | " else:\n", 1208 | " dL_da = None\n", 1209 | "\n", 1210 | " return [dL_da, dL_db]\n", 1211 | "\n", 1212 | " gradient_tape.append(TapeEntry(inputs=[a.name, b.name], outputs=[r.name, t_v.name], propagate=propagate))\n", 1213 | " return r\n", 1214 | "da, db = run_gradients(simple_variable)\n", 1215 | "print(\"da\", da) \n", 1216 | "print(\"db\", db)" 1217 | ], 1218 | "execution_count": null, 1219 | "outputs": [ 1220 | { 1221 | "output_type": "stream", 1222 | "text": [ 1223 | "a = tensor([0.0171, 0.1633, 0.5833, 0.3794])\n", 1224 | "b = tensor([0.3774, 0.6308, 0.5239, 0.1387])\n", 1225 | "L0 = v0.sum()\n", 1226 | "dL0 ------------------------\n", 1227 | "v2 = v1.expand(4)\n", 1228 | "v3 = v2 * b\n", 1229 | "v4 = v2 * t_v\n", 1230 | "v5 = v4 + v3\n", 1231 | "dL0_dL0 = v1\n", 1232 | "dL0_dv0 = v2\n", 1233 | "dL0_da = v3\n", 1234 | "dL0_db = v5\n", 1235 | "------------------------\n", 1236 | "v6 = v3 * v3\n", 1237 | "v7 = v5 * v5\n", 1238 | "v8 = v6 + v7\n", 1239 | "L1 = v8.sum()\n", 1240 | "dL1 ------------------------\n", 1241 | "v10 = v9.expand(4)\n", 1242 | "v11 = v10 * v5\n", 1243 | "v12 = v10 * v5\n", 1244 | "v13 = v11 + v12\n", 1245 | "v14 = v10 * v3\n", 1246 | "v15 = v10 * v3\n", 1247 | "v16 = v14 + v15\n", 1248 | "v17 = v16 + v13\n", 1249 | "v18 = v13 * t_v\n", 1250 | "v19 = v13 * v2\n", 1251 | "v20 = v17 * b\n", 1252 | "v21 = v17 * v2\n", 1253 | "v22 = v18 + v20\n", 1254 | "v23 = v22.sum()\n", 1255 | "v24 = v21 + v19\n", 1256 | "dL1_dL1 = v9\n", 1257 | "dL1_dv8 = v10\n", 1258 | "dL1_dv6 = v10\n", 1259 | "dL1_dv7 = v10\n", 1260 | "dL1_dv5 = v13\n", 1261 | "dL1_dv3 = v17\n", 1262 | "dL1_dv4 = v13\n", 1263 | "dL1_dv2 = v22\n", 1264 | "dL1_dt_v = v19\n", 1265 | "dL1_db = v24\n", 1266 | "dL1_dv1 = v23\n", 1267 | "dL1_da = v19\n", 1268 | "------------------------\n", 1269 | "da tensor([1.5438, 2.8499, 3.2622, 1.3134])\n", 1270 | "db tensor([3.8424, 6.9614, 7.5721, 2.9042])\n" 1271 | ], 1272 | "name": "stdout" 1273 | } 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "markdown", 1278 | "metadata": { 1279 | "id": "r77nkn7Y5NjU" 1280 | }, 1281 | "source": [ 1282 | "**Excercise** modify `run_gradients` such that the second call to `grad` produces non-zero values for both `dL_dr` and `dL_dt`. Hint: it can be done with the addition of 2 characters." 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "markdown", 1287 | "metadata": { 1288 | "id": "lZvYIrBS5fXr" 1289 | }, 1290 | "source": [ 1291 | "In PyTorch's symbolic autodiff implementation, the handling of zero tensors is done with undefined tensors in the place of `None` values, but the logic in TorchScript is very similar. The function `any_defined(...)` is used to check if any inputs are non-zero and guards the calculation of unused parts of the autograd. The `AutogradAdd(a, b)` operator adds two tensors, handling the case where either is undefined, similar to `add_optional`. \n", 1292 | "\n", 1293 | "The backward pass is very messy as-is with all of this conditional logic. Furthermore, as you have seen in these examples, in many cases the logic will branch in the same direction. This is especially true for single-backward where gradients from captured temporaries will always be zero. It is profitable to try to specialize this code for particular patterns of non-zeros since it allows more aggresive fusion of the backward pass." 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "markdown", 1298 | "metadata": { 1299 | "id": "sAApdEb27KTA" 1300 | }, 1301 | "source": [ 1302 | "# PyTorch vs Simple Grad\n", 1303 | "\n", 1304 | "Simple Grad gives a good overview of how PyTorch's autograd works. TorchScript's symbolic gradient pass can generate aggregate operators from subsets of the IR by automating the process we went through to define `simple` as an aggregate operator.\n", 1305 | "\n", 1306 | "The real PyTorch autograd has some features that go beyond this example related to mutable tensors. Simple Grad assumes that tensors are immutable, so saving a Tensor for use in `propagate` is as simple as saving a reference to it. In PyTorch, the gradient formulas need to explicity mark a Tensor as needing to be saved so we can track future potential mutations. The idea is to be able to track if a user mutated a tensor that is needed by backward and report an error on use. Mutable ops themselves also affect how the `propagate` functions get recorded. If a tensor is mutated, uses of the tensor _before_ the mutation need to propagate gradient to the original value, while uses _after_ propagate gradient to the new mutated value. Since tensors can be views of other mutable tensors, PyTorch needs bookkeeping to make sure any time a tensor is updated all views of the tensor now propagate gradient to the new value and not the old one. \n", 1307 | "\n", 1308 | "# Where to go from here\n", 1309 | "\n", 1310 | "If you still have questions about how this process works, I encourage you to edit this notebook with additional debug information and play around with compute. You can try:\n", 1311 | "* Adding a new operator with `propagate` formula (use torch.grad to verify correctness)\n", 1312 | "* Modify `run_gradient` to calculate weirder higher order gradients and see if it behaves as you expect.\n", 1313 | "* Remove `None` and implement gradients using Tensor zeros.\n", 1314 | "* Try to manually define an another aggregate operator for something similar to `simple`\n", 1315 | "* Write a 'compiler' that can take a small expression similar to `simple` and transform it automatically into a forward and `propagate`, similar to autodiff.cpp\n", 1316 | "* Rewrite `simple_variable` so all the branching for `None` checks is at the top of `propagate`. Can you generalize this such that a compiler can generate specializations for the seen non-zero patterns?\n", 1317 | "* Read `autodiff.cpp` and add a description to this document about how concenpts in here directly relate to that code." 1318 | ] 1319 | } 1320 | ] 1321 | } -------------------------------------------------------------------------------- /autograd_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "source": [ 5 | "本notebook文件来源:https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html" 6 | ], 7 | "cell_type": "markdown", 8 | "metadata": {} 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "%matplotlib inline" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "# A Gentle Introduction to ``torch.autograd``\n", 26 | "\n", 27 | "``torch.autograd`` is PyTorch’s automatic differentiation engine that powers\n", 28 | "neural network training. In this section, you will get a conceptual\n", 29 | "understanding of how autograd helps a neural network train.\n", 30 | "\n", 31 | "## Background\n", 32 | "Neural networks (NNs) are a collection of nested functions that are\n", 33 | "executed on some input data. These functions are defined by *parameters*\n", 34 | "(consisting of weights and biases), which in PyTorch are stored in\n", 35 | "tensors. \n", 36 | "nested functions内嵌函数\n", 37 | "\n", 38 | "Training a NN happens in two steps:\n", 39 | "\n", 40 | "**Forward Propagation**: In forward prop, the NN makes its best guess\n", 41 | "about the correct output. It runs the input data through each of its\n", 42 | "functions to make this guess.\n", 43 | "\n", 44 | "**Backward Propagation**: In backprop, the NN adjusts its parameters\n", 45 | "proportionate to the error in its guess. It does this by traversing\n", 46 | "backwards from the output, collecting the derivatives of the error with\n", 47 | "respect to the parameters of the functions (*gradients*), and optimizing\n", 48 | "the parameters using gradient descent. For a more detailed walkthrough\n", 49 | "of backprop, check out this [video from\n", 50 | "3Blue1Brown](https://www.youtube.com/watch?v=tIeHLnjs5U8).\n", 51 | "\n", 52 | "## Usage in PyTorch\n", 53 | "Let's take a look at a single training step.\n", 54 | "For this example, we load a pretrained resnet18 model from ``torchvision``.\n", 55 | "We create a random data tensor to represent a single image with 3 channels, and height & width of 64,\n", 56 | "and its corresponding ``label`` initialized to some random values.\n", 57 | "\n" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "import torch, torchvision\n", 69 | "model = torchvision.models.resnet18(pretrained=True)\n", 70 | "data = torch.rand(1, 3, 64, 64)\n", 71 | "labels = torch.rand(1, 1000)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "Next, we run the input data through the model through each of its layers to make a prediction.\n", 79 | "This is the **forward pass**.\n", 80 | "\n", 81 | "\n" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": { 88 | "collapsed": false 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "prediction = model(data) # forward pass" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "We use the model's prediction and the corresponding label to calculate the error (``loss``).\n", 100 | "The next step is to backpropagate this error through the network.\n", 101 | "Backward propagation is kicked off when we call ``.backward()`` on the error tensor.\n", 102 | "Autograd then calculates and stores the gradients for each model parameter in the parameter's ``.grad`` attribute.\n", 103 | "\n", 104 | "kick off使开始\n" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "metadata": { 111 | "collapsed": false 112 | }, 113 | "outputs": [], 114 | "source": [ 115 | "loss = (prediction - labels).sum()\n", 116 | "loss.backward() # backward pass" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and momentum of 0.9.\n", 124 | "We register all the parameters of the model in the optimizer.\n", 125 | "\n", 126 | "\n" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 5, 132 | "metadata": { 133 | "collapsed": false 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "Finally, we call ``.step()`` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in ``.grad``.\n", 145 | "\n", 146 | "\n" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 6, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "optim.step() #gradient descent" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "At this point, you have everything you need to train your neural network.\n", 165 | "The below sections detail the workings of autograd - feel free to skip them.\n", 166 | "\n", 167 | "\n" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "--------------\n", 175 | "\n", 176 | "\n" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "## Differentiation in Autograd\n", 184 | "\n", 185 | "Let's take a look at how ``autograd`` collects gradients. We create two tensors ``a`` and ``b`` with\n", 186 | "``requires_grad=True``. This signals to ``autograd`` that every operation on them should be tracked.\n", 187 | "\n", 188 | "\n" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 7, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "import torch\n", 200 | "\n", 201 | "a = torch.tensor([2., 3.], requires_grad=True)\n", 202 | "b = torch.tensor([6., 4.], requires_grad=True)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "We create another tensor ``Q`` from ``a`` and ``b``.\n", 210 | "\n", 211 | "\\begin{align}Q = 3a^3 - b^2\\end{align}\n", 212 | "\n" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 8, 218 | "metadata": { 219 | "collapsed": false 220 | }, 221 | "outputs": [], 222 | "source": [ 223 | "Q = 3*a**3 - b**2" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "Let's assume ``a`` and ``b`` to be parameters of an NN, and ``Q``\n", 231 | "to be the error. In NN training, we want gradients of the error\n", 232 | "w.r.t. parameters, i.e.\n", 233 | "\n", 234 | "$$\\begin{align}\\frac{\\partial Q}{\\partial a} = 9a^2\\end{align}$$\n", 235 | "\n", 236 | "\\begin{align}\\frac{\\partial Q}{\\partial b} = -2b\\end{align}\n", 237 | "\n", 238 | "\n", 239 | "When we call ``.backward()`` on ``Q``, autograd calculates these gradients\n", 240 | "and stores them in the respective tensors' ``.grad`` attribute.\n", 241 | "\n", 242 | "respective各自的\n", 243 | "\n", 244 | "We need to explicitly pass a ``gradient`` argument in ``Q.backward()`` because it is a vector.\n", 245 | "``gradient`` is a tensor of the same shape as ``Q``, and it represents the\n", 246 | "gradient of Q w.r.t. itself, i.e.\n", 247 | "\n", 248 | "\\begin{align}\\frac{dQ}{dQ} = 1\\end{align}\n", 249 | "\n", 250 | "Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like ``Q.sum().backward()``.\n", 251 | "\n", 252 | "\n" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 9, 258 | "metadata": { 259 | "collapsed": false 260 | }, 261 | "outputs": [], 262 | "source": [ 263 | "external_grad = torch.tensor([1., 1.])\n", 264 | "Q.backward(gradient=external_grad)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "Gradients are now deposited in ``a.grad`` and ``b.grad``\n", 272 | "\n" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 10, 278 | "metadata": { 279 | "collapsed": false 280 | }, 281 | "outputs": [ 282 | { 283 | "output_type": "stream", 284 | "name": "stdout", 285 | "text": [ 286 | "tensor([True, True])\ntensor([True, True])\n" 287 | ] 288 | } 289 | ], 290 | "source": [ 291 | "# check if collected gradients are correct\n", 292 | "print(9*a**2 == a.grad)\n", 293 | "print(-2*b == b.grad)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "## Optional Reading - Vector Calculus using ``autograd``\n", 301 | "\n", 302 | "Mathematically, if you have a vector valued function\n", 303 | "$\\vec{y}=f(\\vec{x})$, then the gradient of $\\vec{y}$ with\n", 304 | "respect to $\\vec{x}$ is a Jacobian matrix $J$:\n", 305 | "\n", 306 | "\\begin{align}J\n", 307 | " =\n", 308 | " \\left(\\begin{array}{cc}\n", 309 | " \\frac{\\partial \\bf{y}}{\\partial x_{1}} &\n", 310 | " ... &\n", 311 | " \\frac{\\partial \\bf{y}}{\\partial x_{n}}\n", 312 | " \\end{array}\\right)\n", 313 | " =\n", 314 | " \\left(\\begin{array}{ccc}\n", 315 | " \\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{1}}{\\partial x_{n}}\\\\\n", 316 | " \\vdots & \\ddots & \\vdots\\\\\n", 317 | " \\frac{\\partial y_{m}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n", 318 | " \\end{array}\\right)\\end{align}\n", 319 | "\n", 320 | "Generally speaking, ``torch.autograd`` is an engine for computing\n", 321 | "vector-Jacobian product. That is, given any vector $\\vec{v}$, compute the product\n", 322 | "$J^{T}\\cdot \\vec{v}$\n", 323 | "\n", 324 | "If $\\vec{v}$ happens to be the gradient of a scalar function $l=g\\left(\\vec{y}\\right)$:\n", 325 | "\n", 326 | "\\begin{align}\\vec{v}\n", 327 | " =\n", 328 | " \\left(\\begin{array}{ccc}\\frac{\\partial l}{\\partial y_{1}} & \\cdots & \\frac{\\partial l}{\\partial y_{m}}\\end{array}\\right)^{T}\\end{align}\n", 329 | "\n", 330 | "then by the chain rule, the vector-Jacobian product would be the\n", 331 | "gradient of $l$ with respect to $\\vec{x}$:\n", 332 | "\n", 333 | "\\begin{align}J^{T}\\cdot \\vec{v}=\\left(\\begin{array}{ccc}\n", 334 | " \\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{1}}\\\\\n", 335 | " \\vdots & \\ddots & \\vdots\\\\\n", 336 | " \\frac{\\partial y_{1}}{\\partial x_{n}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n", 337 | " \\end{array}\\right)\\left(\\begin{array}{c}\n", 338 | " \\frac{\\partial l}{\\partial y_{1}}\\\\\n", 339 | " \\vdots\\\\\n", 340 | " \\frac{\\partial l}{\\partial y_{m}}\n", 341 | " \\end{array}\\right)=\\left(\\begin{array}{c}\n", 342 | " \\frac{\\partial l}{\\partial x_{1}}\\\\\n", 343 | " \\vdots\\\\\n", 344 | " \\frac{\\partial l}{\\partial x_{n}}\n", 345 | " \\end{array}\\right)\\end{align}\n", 346 | "\n", 347 | "This characteristic of vector-Jacobian product is what we use in the above example;\n", 348 | "``external_grad`` represents $\\vec{v}$.\n", 349 | "\n", 350 | "\n" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "## Computational Graph\n", 358 | "\n", 359 | "Conceptually, autograd keeps a record of data (tensors) & all executed\n", 360 | "operations (along with the resulting new tensors) in a directed acyclic\n", 361 | "graph (DAG) consisting of\n", 362 | "[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)\n", 363 | "objects. In this DAG, leaves are the input tensors, roots are the output\n", 364 | "tensors. By tracing this graph from roots to leaves, you can\n", 365 | "automatically compute the gradients using the chain rule.\n", 366 | "\n", 367 | "In a forward pass, autograd does two things simultaneously:\n", 368 | "\n", 369 | "- run the requested operation to compute a resulting tensor, and\n", 370 | "- maintain the operation’s *gradient function* in the DAG.\n", 371 | "\n", 372 | "The backward pass kicks off when ``.backward()`` is called on the DAG\n", 373 | "root. ``autograd`` then:\n", 374 | "\n", 375 | "- computes the gradients from each ``.grad_fn``,\n", 376 | "- accumulates them in the respective tensor’s ``.grad`` attribute, and\n", 377 | "- using the chain rule, propagates all the way to the leaf tensors.\n", 378 | "\n", 379 | "Below is a visual representation of the DAG in our example. In the graph,\n", 380 | "the arrows are in the direction of the forward pass. The nodes represent the backward functions\n", 381 | "of each operation in the forward pass. The leaf nodes in blue represent our leaf tensors ``a`` and ``b``.\n", 382 | "\n", 383 | "![figure](https://pytorch.org/tutorials/_images/dag_autograd.png)\n", 384 | "\n", 385 | "

Note

\n", 386 | "\n", 387 | "**DAGs are dynamic in PyTorch**\n", 388 | "\n", 389 | " An important thing to note is that the graph is recreated from scratch; after each\n", 390 | " ``.backward()`` call, autograd starts populating a new graph. This is\n", 391 | " exactly what allows you to use control flow statements in your model;\n", 392 | " you can change the shape, size and operations at every iteration if\n", 393 | " needed.\n", 394 | "\n", 395 | "## Exclusion from the DAG\n", 396 | "\n", 397 | "``torch.autograd`` tracks operations on all tensors which have their\n", 398 | "``requires_grad`` flag set to ``True``. For tensors that don’t require\n", 399 | "gradients, setting this attribute to ``False`` excludes it from the\n", 400 | "gradient computation DAG.\n", 401 | "\n", 402 | "The output tensor of an operation will require gradients even if only a\n", 403 | "single input tensor has ``requires_grad=True``.\n", 404 | "\n", 405 | "\n" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 11, 411 | "metadata": { 412 | "collapsed": false 413 | }, 414 | "outputs": [ 415 | { 416 | "output_type": "stream", 417 | "name": "stdout", 418 | "text": [ 419 | "Does `a` require gradients? : False\nDoes `b` require gradients?: True\n" 420 | ] 421 | } 422 | ], 423 | "source": [ 424 | "x = torch.rand(5, 5) \n", 425 | "y = torch.rand(5, 5)\n", 426 | "z = torch.rand((5, 5), requires_grad=True)\n", 427 | "\n", 428 | "a = x + y\n", 429 | "print(f\"Does `a` require gradients? : {a.requires_grad}\")\n", 430 | "b = x + z\n", 431 | "print(f\"Does `b` require gradients?: {b.requires_grad}\")" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "In a NN, parameters that don't compute gradients are usually called **frozen parameters**.\n", 439 | "It is useful to \"freeze\" part of your model if you know in advance that you won't need the gradients of those parameters\n", 440 | "(this offers some performance benefits by reducing autograd computations).\n", 441 | "\n", 442 | "Another common usecase where exclusion from the DAG is important is for\n", 443 | "[finetuning a pretrained network](https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html)\n", 444 | "\n", 445 | "In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels.\n", 446 | "Let's walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.\n", 447 | "\n" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 12, 453 | "metadata": { 454 | "collapsed": false 455 | }, 456 | "outputs": [], 457 | "source": [ 458 | "from torch import nn, optim\n", 459 | "\n", 460 | "model = torchvision.models.resnet18(pretrained=True)\n", 461 | "\n", 462 | "# Freeze all the parameters in the network\n", 463 | "for param in model.parameters():\n", 464 | " param.requires_grad = False" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "Let's say we want to finetune the model on a new dataset with 10 labels.\n", 472 | "In resnet, the classifier is the last linear layer ``model.fc``.\n", 473 | "We can simply replace it with a new linear layer (unfrozen by default)\n", 474 | "that acts as our classifier.\n", 475 | "\n" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 13, 481 | "metadata": { 482 | "collapsed": false 483 | }, 484 | "outputs": [], 485 | "source": [ 486 | "model.fc = nn.Linear(512, 10)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "Now all parameters in the model, except the parameters of ``model.fc``, are frozen.\n", 494 | "The only parameters that compute gradients are the weights and bias of ``model.fc``.\n", 495 | "\n" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 14, 501 | "metadata": { 502 | "collapsed": false 503 | }, 504 | "outputs": [], 505 | "source": [ 506 | "# Optimize only the classifier\n", 507 | "optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "Notice although we register all the parameters in the optimizer,\n", 515 | "the only parameters that are computing gradients (and hence updated in gradient descent)\n", 516 | "are the weights and bias of the classifier.\n", 517 | "\n", 518 | "The same exclusionary functionality is available as a context manager in\n", 519 | "[torch.no_grad()](https://pytorch.org/docs/stable/generated/torch.no_grad.html)\n", 520 | "\n", 521 | "\n" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "--------------\n", 529 | "\n", 530 | "\n" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "## Further readings:\n", 538 | "\n", 539 | "- [In-place operations & Multithreaded Autograd](https://pytorch.org/docs/stable/notes/autograd.html)\n", 540 | "- [Example implementation of reverse-mode autodiff](https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC)\n", 541 | "\n" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": null, 547 | "metadata": {}, 548 | "outputs": [], 549 | "source": [] 550 | } 551 | ], 552 | "metadata": { 553 | "kernelspec": { 554 | "name": "python376jvsc74a57bd07779bbc2c1aef615d27866e0f56ee45ae126f5562611ad823fbbdf236dddc76d", 555 | "display_name": "Python 3.7.6 64-bit ('base': conda)" 556 | }, 557 | "language_info": { 558 | "codemirror_mode": { 559 | "name": "ipython", 560 | "version": 3 561 | }, 562 | "file_extension": ".py", 563 | "mimetype": "text/x-python", 564 | "name": "python", 565 | "nbconvert_exporter": "python", 566 | "pygments_lexer": "ipython3", 567 | "version": "3.7.6" 568 | } 569 | }, 570 | "nbformat": 4, 571 | "nbformat_minor": 0 572 | } -------------------------------------------------------------------------------- /dp.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from torch.utils.data import Dataset, DataLoader 4 | 5 | # Parameters and DataLoaders 6 | input_size = 5 7 | output_size = 2 8 | 9 | batch_size = 30 10 | data_size = 100 11 | 12 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 13 | 14 | class RandomDataset(Dataset): 15 | 16 | def __init__(self, size, length): 17 | self.len = length 18 | self.data = torch.randn(length, size) 19 | 20 | def __getitem__(self, index): 21 | return self.data[index] 22 | 23 | def __len__(self): 24 | return self.len 25 | 26 | rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size), 27 | batch_size=batch_size, shuffle=True) 28 | 29 | class Model(nn.Module): 30 | # Our model 31 | 32 | def __init__(self, input_size, output_size): 33 | super(Model, self).__init__() 34 | self.fc = nn.Linear(input_size, output_size) 35 | 36 | def forward(self, input): 37 | output = self.fc(input) 38 | print("\tIn Model: input size", input.size(), 39 | "output size", output.size()) 40 | 41 | return output 42 | 43 | model = Model(input_size, output_size) 44 | if torch.cuda.device_count() > 1: 45 | print("Let's use", torch.cuda.device_count(), "GPUs!") 46 | # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs 47 | model = nn.DataParallel(model) 48 | 49 | model.to(device) 50 | 51 | for data in rand_loader: 52 | input = data.to(device) 53 | output = model(input) 54 | print("Outside: input size", input.size(), 55 | "output_size", output.size()) 56 | 57 | #在4张GPU上的输出: 58 | """ 59 | Let's use 4 GPUs! 60 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 61 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 62 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 63 | In Model: input size torch.Size([6, 5]) output size torch.Size([6, 2]) 64 | Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2]) 65 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 66 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 67 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 68 | In Model: input size torch.Size([6, 5]) output size torch.Size([6, 2]) 69 | Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2]) 70 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 71 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 72 | In Model: input size torch.Size([8, 5]) output size torch.Size([8, 2]) 73 | In Model: input size torch.Size([6, 5]) output size torch.Size([6, 2]) 74 | Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2]) 75 | In Model: input size torch.Size([3, 5]) output size torch.Size([3, 2]) 76 | In Model: input size torch.Size([3, 5]) output size torch.Size([3, 2]) 77 | In Model: input size torch.Size([3, 5]) output size torch.Size([3, 2]) 78 | In Model: input size torch.Size([1, 5]) output size torch.Size([1, 2]) 79 | Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2]) 80 | """ -------------------------------------------------------------------------------- /neural_networks_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "source": [ 5 | "本notebook文件来源:https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html" 6 | ], 7 | "cell_type": "markdown", 8 | "metadata": {} 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "%matplotlib inline" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "\n", 26 | "Neural Networks\n", 27 | "===============\n", 28 | "\n", 29 | "Neural networks can be constructed using the ``torch.nn`` package.\n", 30 | "\n", 31 | "Now that you had a glimpse of ``autograd``, ``nn`` depends on\n", 32 | "``autograd`` to define models and differentiate them.\n", 33 | "An ``nn.Module`` contains layers, and a method ``forward(input)`` that\n", 34 | "returns the ``output``.\n", 35 | "\n", 36 | "differentiate求导\n", 37 | "\n", 38 | "For example, look at this network that classifies digit images:\n", 39 | "\n", 40 | "![figure](https://pytorch.org/tutorials/_images/mnist.png)\n", 41 | "\n", 42 | " convnet\n", 43 | "\n", 44 | "It is a simple feed-forward network. It takes the input, feeds it\n", 45 | "through several layers one after the other, and then finally gives the\n", 46 | "output.\n", 47 | "\n", 48 | "feed-forward network前馈神经网络\n", 49 | "\n", 50 | "A typical training procedure for a neural network is as follows:\n", 51 | "\n", 52 | "- Define the neural network that has some learnable parameters (or\n", 53 | " weights)\n", 54 | "- Iterate over a dataset of inputs\n", 55 | "- Process input through the network\n", 56 | "- Compute the loss (how far is the output from being correct)\n", 57 | "- Propagate gradients back into the network’s parameters\n", 58 | "- Update the weights of the network, typically using a simple update rule:\n", 59 | " ``weight = weight - learning_rate * gradient``\n", 60 | "\n", 61 | "Define the network\n", 62 | "------------------\n", 63 | "\n", 64 | "Let’s define this network:\n", 65 | "\n" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 2, 71 | "metadata": { 72 | "collapsed": false 73 | }, 74 | "outputs": [ 75 | { 76 | "output_type": "stream", 77 | "name": "stdout", 78 | "text": [ 79 | "Net(\n (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))\n (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))\n (fc1): Linear(in_features=576, out_features=120, bias=True)\n (fc2): Linear(in_features=120, out_features=84, bias=True)\n (fc3): Linear(in_features=84, out_features=10, bias=True)\n)\n" 80 | ] 81 | } 82 | ], 83 | "source": [ 84 | "import torch\n", 85 | "import torch.nn as nn\n", 86 | "import torch.nn.functional as F\n", 87 | "\n", 88 | "\n", 89 | "class Net(nn.Module):\n", 90 | "\n", 91 | " def __init__(self):\n", 92 | " super(Net, self).__init__()\n", 93 | " #Conv2d是在输入信号(由几个平面图像构成)上应用2维卷积\n", 94 | " # 1 input image channel, 6 output channels, 3x3 square convolution kernel\n", 95 | " self.conv1 = nn.Conv2d(1, 6, 3)\n", 96 | " self.conv2 = nn.Conv2d(6, 16, 3)\n", 97 | " # an affine operation: y = Wx + b\n", 98 | " #affine仿射的\n", 99 | " self.fc1 = nn.Linear(16 * 6 * 6, 120)\n", 100 | " #16是conv2的输出通道数,6*6是图像维度\n", 101 | " #(32*32的原图,经conv1卷后是6*30*30,经池化后是6*15*15,经conv2卷后是16*13*13,经池化后是16*6*6)\n", 102 | " #经过网络层后的维度数计算方式都可以看网络的类的文档来查到\n", 103 | " self.fc2 = nn.Linear(120, 84)\n", 104 | " self.fc3 = nn.Linear(84, 10)\n", 105 | "\n", 106 | " def forward(self, x):\n", 107 | " # Max pooling over a (2, 2) window\n", 108 | " x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))\n", 109 | " # If the size is a square, you can specify with a single number\n", 110 | " x = F.max_pool2d(F.relu(self.conv2(x)), 2)\n", 111 | " x = x.view(-1, self.num_flat_features(x))\n", 112 | " #将x转化为元素不变,尺寸为[-1,self.num_flat_features(x)]的Tensor\n", 113 | " #-1的维度具体是多少,是根据另一维度计算出来的\n", 114 | " #由于另一维度是x全部特征的长度,所以这一步就是把x从三维张量拉成一维向量\n", 115 | " x = F.relu(self.fc1(x))\n", 116 | " x = F.relu(self.fc2(x))\n", 117 | " x = self.fc3(x)\n", 118 | " return x\n", 119 | "\n", 120 | " def num_flat_features(self, x):\n", 121 | " #计算得到x的特征总数(就是把各维度乘起来)\n", 122 | " size = x.size()[1:] # all dimensions except the batch dimension\n", 123 | " num_features = 1\n", 124 | " for s in size:\n", 125 | " num_features *= s\n", 126 | " return num_features\n", 127 | "\n", 128 | "\n", 129 | "net = Net()\n", 130 | "print(net)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "You just have to define the ``forward`` function, and the ``backward``\n", 138 | "function (where gradients are computed) is automatically defined for you\n", 139 | "using ``autograd``.\n", 140 | "You can use any of the Tensor operations in the ``forward`` function.\n", 141 | "\n", 142 | "The learnable parameters of a model are returned by ``net.parameters()``\n", 143 | "\n" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 3, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [ 153 | { 154 | "output_type": "stream", 155 | "name": "stdout", 156 | "text": [ 157 | "10\ntorch.Size([6, 1, 3, 3])\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "params = list(net.parameters())\n", 163 | "print(len(params))\n", 164 | "print(params[0].size()) # conv1's .weight" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Let's try a random 32x32 input.\n", 172 | "Note: expected input size of this net (LeNet) is 32x32. To use this net on\n", 173 | "the MNIST dataset, please resize the images from the dataset to 32x32.\n", 174 | "\n" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 4, 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "outputs": [ 184 | { 185 | "output_type": "stream", 186 | "name": "stdout", 187 | "text": [ 188 | "tensor([[-0.0894, -0.0754, -0.1340, 0.0693, 0.0953, 0.0937, -0.1054, 0.0527,\n 0.0066, 0.0356]], grad_fn=)\n" 189 | ] 190 | } 191 | ], 192 | "source": [ 193 | "input = torch.randn(1, 1, 32, 32)\n", 194 | "out = net(input)\n", 195 | "print(out)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "Zero the gradient buffers of all parameters and backprops with random\n", 203 | "gradients:\n", 204 | "\n" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 5, 210 | "metadata": { 211 | "collapsed": false 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "net.zero_grad()\n", 216 | "out.backward(torch.randn(1, 10))" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "### Note\n", 224 | "``torch.nn`` only supports mini-batches. The entire ``torch.nn``\n", 225 | " package only supports inputs that are a mini-batch of samples, and not\n", 226 | " a single sample.\n", 227 | "\n", 228 | "For example, ``nn.Conv2d`` will take in a 4D Tensor of\n", 229 | "``nSamples x nChannels x Height x Width``.\n", 230 | "\n", 231 | "If you have a single sample, just use ``input.unsqueeze(0)`` to add\n", 232 | "a fake batch dimension.\n", 233 | "\n", 234 | "Before proceeding further, let's recap all the classes you’ve seen so far.\n", 235 | "\n", 236 | "**Recap:**\n", 237 | " - ``torch.Tensor`` - A *multi-dimensional array* with support for autograd\n", 238 | " operations like ``backward()``. Also *holds the gradient* w.r.t. the\n", 239 | " tensor.\n", 240 | " - ``nn.Module`` - Neural network module. *Convenient way of\n", 241 | " encapsulating parameters*, with helpers for moving them to GPU,\n", 242 | " exporting, loading, etc.\n", 243 | " - ``nn.Parameter`` - A kind of Tensor, that is *automatically\n", 244 | " registered as a parameter when assigned as an attribute to a*\n", 245 | " ``Module``.\n", 246 | " - ``autograd.Function`` - Implements *forward and backward definitions\n", 247 | " of an autograd operation*. Every ``Tensor`` operation creates at\n", 248 | " least a single ``Function`` node that connects to functions that\n", 249 | " created a ``Tensor`` and *encodes its history*.\n", 250 | "\n", 251 | "**At this point, we covered:**\n", 252 | " - Defining a neural network\n", 253 | " - Processing inputs and calling backward\n", 254 | "\n", 255 | "**Still Left:**\n", 256 | " - Computing the loss\n", 257 | " - Updating the weights of the network\n", 258 | "\n", 259 | "## Loss Function\n", 260 | "A loss function takes the (output, target) pair of inputs, and computes a\n", 261 | "value that estimates how far away the output is from the target.\n", 262 | "\n", 263 | "There are several different\n", 264 | "[loss functions](https://pytorch.org/docs/nn.html#loss-functions) under the\n", 265 | "nn package .\n", 266 | "A simple loss is: ``nn.MSELoss`` which computes the mean-squared error\n", 267 | "between the input and the target.\n", 268 | "\n", 269 | "For example:\n", 270 | "\n" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 7, 276 | "metadata": { 277 | "collapsed": false 278 | }, 279 | "outputs": [ 280 | { 281 | "output_type": "stream", 282 | "name": "stdout", 283 | "text": [ 284 | "tensor(0.7491, grad_fn=)\n" 285 | ] 286 | } 287 | ], 288 | "source": [ 289 | "output = net(input)\n", 290 | "target = torch.randn(10) # a dummy target, for example\n", 291 | "target = target.view(1, -1) # make it the same shape as output\n", 292 | "criterion = nn.MSELoss()\n", 293 | "\n", 294 | "loss = criterion(output, target)\n", 295 | "print(loss)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "Now, if you follow ``loss`` in the backward direction, using its\n", 303 | "``.grad_fn`` attribute, you will see a graph of computations that looks\n", 304 | "like this:\n", 305 | "\n", 306 | "\n", 307 | "\n", 308 | " input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d\n", 309 | " -> view -> linear -> relu -> linear -> relu -> linear\n", 310 | " -> MSELoss\n", 311 | " -> loss\n", 312 | "\n", 313 | "So, when we call ``loss.backward()``, the whole graph is differentiated\n", 314 | "w.r.t. the loss, and all Tensors in the graph that have ``requires_grad=True``\n", 315 | "will have their ``.grad`` Tensor accumulated with the gradient.\n", 316 | "\n", 317 | "For illustration, let us follow a few steps backward:\n", 318 | "\n" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 8, 324 | "metadata": { 325 | "collapsed": false 326 | }, 327 | "outputs": [ 328 | { 329 | "output_type": "stream", 330 | "name": "stdout", 331 | "text": [ 332 | "\n\n\n" 333 | ] 334 | } 335 | ], 336 | "source": [ 337 | "print(loss.grad_fn) # MSELoss\n", 338 | "print(loss.grad_fn.next_functions[0][0]) # Linear\n", 339 | "print(loss.grad_fn.next_functions[0][0].next_functions[0][0]) # ReLU" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "Backprop\n", 347 | "--------\n", 348 | "To backpropagate the error all we have to do is to ``loss.backward()``.\n", 349 | "You need to clear the existing gradients though, else gradients will be\n", 350 | "accumulated to existing gradients.\n", 351 | "\n", 352 | "\n", 353 | "Now we shall call ``loss.backward()``, and have a look at conv1's bias\n", 354 | "gradients before and after the backward.\n", 355 | "\n" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 9, 361 | "metadata": { 362 | "collapsed": false 363 | }, 364 | "outputs": [ 365 | { 366 | "output_type": "stream", 367 | "name": "stdout", 368 | "text": [ 369 | "conv1.bias.grad before backward\ntensor([0., 0., 0., 0., 0., 0.])\nconv1.bias.grad after backward\ntensor([ 0.0095, -0.0128, 0.0023, 0.0062, -0.0013, -0.0055])\n" 370 | ] 371 | } 372 | ], 373 | "source": [ 374 | "net.zero_grad() # zeroes the gradient buffers of all parameters\n", 375 | "\n", 376 | "print('conv1.bias.grad before backward')\n", 377 | "print(net.conv1.bias.grad)\n", 378 | "\n", 379 | "loss.backward()\n", 380 | "\n", 381 | "print('conv1.bias.grad after backward')\n", 382 | "print(net.conv1.bias.grad)" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "Now, we have seen how to use loss functions.\n", 390 | "\n", 391 | "**Read Later:**\n", 392 | "\n", 393 | " The neural network package contains various modules and loss functions\n", 394 | " that form the building blocks of deep neural networks. A full list with\n", 395 | " documentation is [here](https://pytorch.org/docs/nn).\n", 396 | "\n", 397 | "**The only thing left to learn is:**\n", 398 | "\n", 399 | " - Updating the weights of the network\n", 400 | "\n", 401 | "## Update the weights\n", 402 | "\n", 403 | "The simplest update rule used in practice is the Stochastic Gradient\n", 404 | "Descent (SGD):\n", 405 | "\n", 406 | "$weight = weight - learning\\_rate * gradient$\n", 407 | "\n", 408 | "We can implement this using simple Python code:\n", 409 | "\n", 410 | "```python\n", 411 | " learning_rate = 0.01\n", 412 | " for f in net.parameters():\n", 413 | " f.data.sub_(f.grad.data * learning_rate)\n", 414 | "```\n", 415 | "\n", 416 | "However, as you use neural networks, you want to use various different\n", 417 | "update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.\n", 418 | "To enable this, we built a small package: ``torch.optim`` that\n", 419 | "implements all these methods. Using it is very simple:\n", 420 | "\n" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 10, 426 | "metadata": { 427 | "collapsed": false 428 | }, 429 | "outputs": [], 430 | "source": [ 431 | "import torch.optim as optim\n", 432 | "\n", 433 | "# create your optimizer\n", 434 | "optimizer = optim.SGD(net.parameters(), lr=0.01)\n", 435 | "\n", 436 | "# in your training loop:\n", 437 | "optimizer.zero_grad() # zero the gradient buffers\n", 438 | "output = net(input)\n", 439 | "loss = criterion(output, target)\n", 440 | "loss.backward()\n", 441 | "optimizer.step() # Does the update" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "Note:\n", 449 | "\n", 450 | "Observe how gradient buffers had to be manually set to zero using\n", 451 | "``optimizer.zero_grad()``. This is because gradients are accumulated\n", 452 | "as explained in the `Backprop` section.\n", 453 | "\n" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [] 462 | } 463 | ], 464 | "metadata": { 465 | "kernelspec": { 466 | "name": "python376jvsc74a57bd07779bbc2c1aef615d27866e0f56ee45ae126f5562611ad823fbbdf236dddc76d", 467 | "display_name": "Python 3.7.6 64-bit ('base': conda)" 468 | }, 469 | "language_info": { 470 | "codemirror_mode": { 471 | "name": "ipython", 472 | "version": 3 473 | }, 474 | "file_extension": ".py", 475 | "mimetype": "text/x-python", 476 | "name": "python", 477 | "nbconvert_exporter": "python", 478 | "pygments_lexer": "ipython3", 479 | "version": "3.7.6" 480 | } 481 | }, 482 | "nbformat": 4, 483 | "nbformat_minor": 0 484 | } -------------------------------------------------------------------------------- /tensor_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "source": [ 5 | "本notebook文件来源:https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html" 6 | ], 7 | "cell_type": "markdown", 8 | "metadata": {} 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "%matplotlib inline" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "\n", 26 | "Tensors\n", 27 | "--------------------------------------------\n", 28 | "\n", 29 | "Tensors are a specialized data structure that are very similar to arrays\n", 30 | "and matrices. In PyTorch, we use tensors to encode the inputs and\n", 31 | "outputs of a model, as well as the model’s parameters.\n", 32 | "\n", 33 | "Tensors are similar to NumPy’s ndarrays, except that tensors can run on\n", 34 | "GPUs or other specialized hardware to accelerate computing. If you’re familiar with ndarrays, you’ll\n", 35 | "be right at home with the Tensor API. If not, follow along in this quick\n", 36 | "API walkthrough.\n", 37 | "\n", 38 | "\n" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "metadata": { 45 | "collapsed": false 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "import torch\n", 50 | "import numpy as np" 51 | ] 52 | }, 53 | { 54 | "source": [ 55 | "## Tensor Initialization\n", 56 | "\n", 57 | "Tensors can be initialized in various ways. Take a look at the following examples:\n", 58 | "\n", 59 | "**Directly from data**\n", 60 | "\n", 61 | "Tensors can be created directly from data. The data type is automatically inferred." 62 | ], 63 | "cell_type": "markdown", 64 | "metadata": {} 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": { 70 | "collapsed": false 71 | }, 72 | "outputs": [ 73 | { 74 | "output_type": "execute_result", 75 | "data": { 76 | "text/plain": [ 77 | "tensor([[1, 2],\n", 78 | " [3, 4]])" 79 | ] 80 | }, 81 | "metadata": {}, 82 | "execution_count": 4 83 | } 84 | ], 85 | "source": [ 86 | "data = [[1, 2],[3, 4]]\n", 87 | "x_data = torch.tensor(data)\n", 88 | "x_data" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "**From a NumPy array**\n", 96 | "\n", 97 | "Tensors can be created from NumPy arrays (and vice versa - see [Bridge with NumPy](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#bridge-to-np-label)).\n", 98 | "\n" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 7, 104 | "metadata": { 105 | "collapsed": false 106 | }, 107 | "outputs": [ 108 | { 109 | "output_type": "execute_result", 110 | "data": { 111 | "text/plain": [ 112 | "tensor([[1, 2],\n", 113 | " [3, 4]], dtype=torch.int32)" 114 | ] 115 | }, 116 | "metadata": {}, 117 | "execution_count": 7 118 | } 119 | ], 120 | "source": [ 121 | "np_array = np.array(data)\n", 122 | "x_np = torch.from_numpy(np_array)\n", 123 | "x_np" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "**From another tensor:**\n", 131 | "\n", 132 | "The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden.\n", 133 | "\n" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 7, 139 | "metadata": { 140 | "collapsed": false 141 | }, 142 | "outputs": [ 143 | { 144 | "output_type": "stream", 145 | "name": "stdout", 146 | "text": [ 147 | "Ones Tensor: \n tensor([[1, 1],\n [1, 1]]) \n\nRandom Tensor: \n tensor([[0.7270, 0.7228],\n [0.9934, 0.3336]]) \n\n" 148 | ] 149 | } 150 | ], 151 | "source": [ 152 | "x_ones = torch.ones_like(x_data) # retains the properties of x_data\n", 153 | "print(f\"Ones Tensor: \\n {x_ones} \\n\")\n", 154 | "\n", 155 | "x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data\n", 156 | "print(f\"Random Tensor: \\n {x_rand} \\n\")" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "**With random or constant values:**\n", 164 | "\n", 165 | "``shape`` is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor.\n", 166 | "\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 8, 172 | "metadata": { 173 | "collapsed": false 174 | }, 175 | "outputs": [ 176 | { 177 | "output_type": "stream", 178 | "name": "stdout", 179 | "text": [ 180 | "Random Tensor: \n tensor([[0.5604, 0.7527, 0.6460],\n [0.1415, 0.4168, 0.1672]]) \n\nOnes Tensor: \n tensor([[1., 1., 1.],\n [1., 1., 1.]]) \n\nZeros Tensor: \n tensor([[0., 0., 0.],\n [0., 0., 0.]])\n" 181 | ] 182 | } 183 | ], 184 | "source": [ 185 | "shape = (2,3,)\n", 186 | "rand_tensor = torch.rand(shape)\n", 187 | "ones_tensor = torch.ones(shape)\n", 188 | "zeros_tensor = torch.zeros(shape)\n", 189 | "\n", 190 | "print(f\"Random Tensor: \\n {rand_tensor} \\n\")\n", 191 | "print(f\"Ones Tensor: \\n {ones_tensor} \\n\")\n", 192 | "print(f\"Zeros Tensor: \\n {zeros_tensor}\")" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "--------------\n", 200 | "\n", 201 | "\n" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "## Tensor Attributes\n", 209 | "\n", 210 | "Tensor attributes describe their shape, datatype, and the device on which they are stored.\n", 211 | "\n" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 9, 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "outputs": [ 221 | { 222 | "output_type": "stream", 223 | "name": "stdout", 224 | "text": [ 225 | "Shape of tensor: torch.Size([3, 4])\nDatatype of tensor: torch.float32\nDevice tensor is stored on: cpu\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "tensor = torch.rand(3,4)\n", 231 | "\n", 232 | "print(f\"Shape of tensor: {tensor.shape}\")\n", 233 | "print(f\"Datatype of tensor: {tensor.dtype}\")\n", 234 | "print(f\"Device tensor is stored on: {tensor.device}\")" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "--------------\n", 242 | "\n", 243 | "\n" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "## Tensor Operations\n", 251 | "\n", 252 | "Over 100 tensor operations, including transposing, indexing, slicing,\n", 253 | "mathematical operations, linear algebra, random sampling, and more are\n", 254 | "comprehensively described\n", 255 | "[here](https://pytorch.org/docs/stable/torch.html).\n", 256 | "\n", 257 | "Each of them can be run on the GPU (at typically higher speeds than on a\n", 258 | "CPU). If you’re using Colab, allocate a GPU by going to Edit > Notebook\n", 259 | "Settings.\n", 260 | "\n", 261 | "\n" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 10, 267 | "metadata": { 268 | "collapsed": false 269 | }, 270 | "outputs": [], 271 | "source": [ 272 | "# We move our tensor to the GPU if available\n", 273 | "if torch.cuda.is_available():\n", 274 | " tensor = tensor.to('cuda')" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "Try out some of the operations from the list.\n", 282 | "If you're familiar with the NumPy API, you'll find the Tensor API a breeze to use.\n", 283 | "\n", 284 | "\n" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "**Standard numpy-like indexing and slicing:**\n", 292 | "\n" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 12, 298 | "metadata": { 299 | "collapsed": false 300 | }, 301 | "outputs": [ 302 | { 303 | "output_type": "stream", 304 | "name": "stdout", 305 | "text": [ 306 | "tensor([[1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.]])\n" 307 | ] 308 | } 309 | ], 310 | "source": [ 311 | "tensor = torch.ones(4, 4)\n", 312 | "tensor[:,1] = 0\n", 313 | "print(tensor)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "**Joining tensors** You can use ``torch.cat`` to concatenate a sequence of tensors along a given dimension.\n", 321 | "See also [torch.stack](https://pytorch.org/docs/stable/generated/torch.stack.html),\n", 322 | "another tensor joining op that is subtly different from ``torch.cat``.\n", 323 | "\n" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 13, 329 | "metadata": { 330 | "collapsed": false 331 | }, 332 | "outputs": [ 333 | { 334 | "output_type": "stream", 335 | "name": "stdout", 336 | "text": [ 337 | "tensor([[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],\n [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],\n [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],\n [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.]])\n" 338 | ] 339 | } 340 | ], 341 | "source": [ 342 | "t1 = torch.cat([tensor, tensor, tensor], dim=1)\n", 343 | "print(t1)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "**Multiplying tensors**\n", 351 | "\n" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 14, 357 | "metadata": { 358 | "collapsed": false 359 | }, 360 | "outputs": [ 361 | { 362 | "output_type": "stream", 363 | "name": "stdout", 364 | "text": [ 365 | "tensor.mul(tensor) \n tensor([[1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.]]) \n\ntensor * tensor \n tensor([[1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.]])\n" 366 | ] 367 | } 368 | ], 369 | "source": [ 370 | "# This computes the element-wise product\n", 371 | "print(f\"tensor.mul(tensor) \\n {tensor.mul(tensor)} \\n\")\n", 372 | "# Alternative syntax:\n", 373 | "print(f\"tensor * tensor \\n {tensor * tensor}\")" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "This computes the matrix multiplication between two tensors\n", 381 | "\n" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 15, 387 | "metadata": { 388 | "collapsed": false 389 | }, 390 | "outputs": [ 391 | { 392 | "output_type": "stream", 393 | "name": "stdout", 394 | "text": [ 395 | "tensor.matmul(tensor.T) \n tensor([[3., 3., 3., 3.],\n [3., 3., 3., 3.],\n [3., 3., 3., 3.],\n [3., 3., 3., 3.]]) \n\ntensor @ tensor.T \n tensor([[3., 3., 3., 3.],\n [3., 3., 3., 3.],\n [3., 3., 3., 3.],\n [3., 3., 3., 3.]])\n" 396 | ] 397 | } 398 | ], 399 | "source": [ 400 | "print(f\"tensor.matmul(tensor.T) \\n {tensor.matmul(tensor.T)} \\n\")\n", 401 | "#https://pytorch.org/docs/stable/generated/torch.matmul.html 我有预感,这个函数我以后还会用得到\n", 402 | "# Alternative syntax:\n", 403 | "print(f\"tensor @ tensor.T \\n {tensor @ tensor.T}\")" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "**In-place operations**\n", 411 | "Operations that have a ``_`` suffix are in-place. For example: ``x.copy_(y)``, ``x.t_()``, will change ``x``.\n", 412 | "\n", 413 | "in-place原地" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 16, 419 | "metadata": { 420 | "collapsed": false 421 | }, 422 | "outputs": [ 423 | { 424 | "output_type": "stream", 425 | "name": "stdout", 426 | "text": [ 427 | "tensor([[1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.],\n [1., 0., 1., 1.]]) \n\ntensor([[6., 5., 6., 6.],\n [6., 5., 6., 6.],\n [6., 5., 6., 6.],\n [6., 5., 6., 6.]])\n" 428 | ] 429 | } 430 | ], 431 | "source": [ 432 | "print(tensor, \"\\n\")\n", 433 | "tensor.add_(5)\n", 434 | "print(tensor)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "

Note

In-place operations save some memory, but can be problematic when computing derivatives because of an immediate loss\n", 442 | " of history. Hence, their use is discouraged.

\n", 443 | "\n" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "--------------\n", 451 | "\n", 452 | "\n" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "\n", 460 | "## Bridge with NumPy\n", 461 | "\n", 462 | "Tensors on the CPU and NumPy arrays can share their underlying memory\n", 463 | "locations, and changing one will change\tthe other.\n", 464 | "\n" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "## Tensor to NumPy array\n" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": 17, 477 | "metadata": { 478 | "collapsed": false 479 | }, 480 | "outputs": [ 481 | { 482 | "output_type": "stream", 483 | "name": "stdout", 484 | "text": [ 485 | "t: tensor([1., 1., 1., 1., 1.])\nn: [1. 1. 1. 1. 1.]\n" 486 | ] 487 | } 488 | ], 489 | "source": [ 490 | "t = torch.ones(5)\n", 491 | "print(f\"t: {t}\")\n", 492 | "n = t.numpy()\n", 493 | "print(f\"n: {n}\")" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "A change in the tensor reflects in the NumPy array.\n", 501 | "\n" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 18, 507 | "metadata": { 508 | "collapsed": false 509 | }, 510 | "outputs": [ 511 | { 512 | "output_type": "stream", 513 | "name": "stdout", 514 | "text": [ 515 | "t: tensor([2., 2., 2., 2., 2.])\nn: [2. 2. 2. 2. 2.]\n" 516 | ] 517 | } 518 | ], 519 | "source": [ 520 | "t.add_(1)\n", 521 | "print(f\"t: {t}\")\n", 522 | "print(f\"n: {n}\")" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "## NumPy array to Tensor" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": 19, 535 | "metadata": { 536 | "collapsed": false 537 | }, 538 | "outputs": [], 539 | "source": [ 540 | "n = np.ones(5)\n", 541 | "t = torch.from_numpy(n)" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "Changes in the NumPy array reflects in the tensor.\n", 549 | "\n" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": 20, 555 | "metadata": { 556 | "collapsed": false 557 | }, 558 | "outputs": [ 559 | { 560 | "output_type": "stream", 561 | "name": "stdout", 562 | "text": [ 563 | "t: tensor([2., 2., 2., 2., 2.], dtype=torch.float64)\nn: [2. 2. 2. 2. 2.]\n" 564 | ] 565 | } 566 | ], 567 | "source": [ 568 | "np.add(n, 1, out=n)\n", 569 | "print(f\"t: {t}\")\n", 570 | "print(f\"n: {n}\")" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [] 579 | } 580 | ], 581 | "metadata": { 582 | "kernelspec": { 583 | "name": "python376jvsc74a57bd07779bbc2c1aef615d27866e0f56ee45ae126f5562611ad823fbbdf236dddc76d", 584 | "display_name": "Python 3.7.6 64-bit ('base': conda)" 585 | }, 586 | "language_info": { 587 | "codemirror_mode": { 588 | "name": "ipython", 589 | "version": 3 590 | }, 591 | "file_extension": ".py", 592 | "mimetype": "text/x-python", 593 | "name": "python", 594 | "nbconvert_exporter": "python", 595 | "pygments_lexer": "ipython3", 596 | "version": "3.7.6" 597 | } 598 | }, 599 | "nbformat": 4, 600 | "nbformat_minor": 0 601 | } --------------------------------------------------------------------------------