├── .gitignore
├── .pre-commit-config.yaml
├── 00_intro.ipynb
├── 01a_fractal_cupy.ipynb
├── 01b_fractal_numba.ipynb
├── 02a_nll_cupy.ipynb
├── 02b_nll_tensorflow.ipynb
├── 02c_nll_torch.ipynb
├── 02x_torch_autograd.ipynb
├── 03_nll.ipynb
├── 04_ode.ipynb
├── ExampleRunner.ipynb
├── ExampleRunnerExample.ipynb
├── README.md
├── environment.yml
├── images
    ├── ButtonToClick.png
    ├── HeaderBar.png
    ├── LanguageInterest.png
    ├── LibraryInterest.png
    └── SetupPage.png
├── interactive
    ├── MinicondaInstallNotes.txt
    ├── course.pygpu.default
    ├── default
    ├── environment.yml
    └── interactive.sbatch
└── sbatch_magic.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *nbconvert*
2 | tmp.sbatch
3 | /*.sbatch
4 | *.out
5 | __pycache__
6 | *.html
7 | .ipynb_checkpoints
8 | 


--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
 1 | repos:
 2 |   - repo: https://github.com/pre-commit/pre-commit-hooks
 3 |     rev: "v4.2.0"
 4 |     hooks:
 5 |       - id: check-added-large-files
 6 |       - id: check-case-conflict
 7 |       - id: check-merge-conflict
 8 |       - id: check-symlinks
 9 |       - id: check-yaml
10 |       - id: debug-statements
11 |       - id: end-of-file-fixer
12 |       - id: mixed-line-ending
13 |       - id: requirements-txt-fixer
14 |       - id: trailing-whitespace
15 | 
16 |   - repo: https://github.com/psf/black
17 |     rev: "22.3.0"
18 |     hooks:
19 |       - id: black-jupyter
20 | 
21 |   - repo: https://github.com/kynan/nbstripout
22 |     rev: "0.5.0"
23 |     hooks:
24 |       - id: nbstripout
25 | 
26 |   - repo: https://github.com/codespell-project/codespell
27 |     rev: "v2.1.0"
28 |     hooks:
29 |       - id: codespell
30 |         args: ["-L", "hist"]
31 | 
32 |   - repo: https://github.com/pre-commit/mirrors-prettier
33 |     rev: "v2.6.2"
34 |     hooks:
35 |       - id: prettier
36 |         types_or: [yaml, markdown]
37 | 
38 |   - repo: local
39 |     hooks:
40 |       - id: disallow-caps
41 |         name: Disallow improper capitalization
42 |         language: pygrep
43 |         entry: PyBind|Numpy|Cmake|CCache|Github|PyTest
44 |         exclude: .pre-commit-config.yaml
45 | 


--------------------------------------------------------------------------------
/00_intro.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# High Performance Python: GPUs\n",
  8 |     "## Henry Schreiner\n",
  9 |     "\n",
 10 |     "05-03-2022\n",
 11 |     "\n",
 12 |     "Survey: TBD\n",
 13 |     "\n",
 14 |     "Useful links:\n",
 15 |     "* [High Performance Python: CPUs](https://github.com/henryiii/python-performance-minicourse)\n",
 16 |     "* [Compiled Python](https://github.com/henryiii/python-compiled-minicourse)\n",
 17 |     "* [iscinumpy.dev](https://iscinumpy.dev)\n",
 18 |     "* [CompClass](https://github.com/henryiii/compclass)\n",
 19 |     "* [Level Up Your Python](https://henryiii.github.io/level-up-your-python)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "## Intro to GPUs\n",
 27 |     "\n",
 28 |     "GPUs are \"graphics processing units\" designed to compute pixels on a screen. The massively parallel design can be useful for general purpose computing;\n",
 29 |     "GPU companies started providing ways to use GPUs as \"GPGPU\"s, general purpose GPUs."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "## Computing platform\n",
 37 |     "\n",
 38 |     "We will be using Conda, mostly through the conda-forge channel, which  gained support for proper CUDA libraries. We are getting PyTorch from the torch channel, and Tensorflow 2 snuck in as well. Pip support for ML libraries is not too bad, either (both are rapidly improving).\n",
 39 |     "\n",
 40 |     "We will be using Python 3.9 because it was the default installer, though all libraries recently now work with Python 3.10. Numba, PyTorch, and TensorFlow are usually slow to support newer Python versions (3-5 months), but we are past that now for 3.10."
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "## Languages/platforms\n",
 48 |     "\n",
 49 |     "For differences in terminology, the ROCm page is quite good: <https://rocm.github.io/languages.html>.\n",
 50 |     "\n",
 51 |     "![Language interest](images/LanguageInterest.png)"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "#### CUDA\n",
 59 |     "\n",
 60 |     "The leader in the pack is easily NVidia; they were first to the foray with the CUDA language, and they easily lead for scientific computation.\n",
 61 |     "\n",
 62 |     "\n",
 63 |     "* Wildly popular\n",
 64 |     "* NVidia only\n",
 65 |     "* A C++-like language, single source (with JIT option)\n",
 66 |     "\n",
 67 |     "#### OpenCL\n",
 68 |     "\n",
 69 |     "AMD was late to the game, and tried to support an open standard, OpenCL - but poor support from other players caused it to be almost AMD exclusive.\n",
 70 |     "Apple released Metal as a replacement for OpenGL & OpenCL; they have worked with Intel & AMD on it. The are dropping their (almost non-existent) support for OpenCL.\n",
 71 |     "The Kronos group (which works on OpenGL/CL) has released a successor, Vulkan, but it mostly focuses on graphics (OpenGL) at the moment.\n",
 72 |     "Intel is planning to drop OpenCL in 2-3 years, too.\n",
 73 |     "\n",
 74 |     "* Works on most platforms\n",
 75 |     "* Most platforms have buggy, older support execpt AMD\n",
 76 |     "* Only JIT-like option\n",
 77 |     "* Also supports other compute backends, like FPGAs and CPUs\n",
 78 |     "\n",
 79 |     "\n",
 80 |     "\n",
 81 |     "#### ROCm\n",
 82 |     "\n",
 83 |     "* AMD only\n",
 84 |     "* Open, interacts with others at various levels\n",
 85 |     "\n",
 86 |     "AMD's custom platform is ROCm, which is their custom platform.\n",
 87 |     "\n",
 88 |     "#### SYCL\n",
 89 |     "\n",
 90 |     "SYCL was a CUDA-like single source language built on OpenCL, and is now part of Intel's OneAPI plan.\n",
 91 |     "\n",
 92 |     "\n",
 93 |     "#### OpenMP\n",
 94 |     "\n",
 95 |     "OpenMP now has tools to target GPUs, but it can be tricky to program (especially if you expect to write the same code to run multiple places). There's also OpenACC."
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "Today we will focus on CUDA, since it has good Python support and is the current lingua franca for scientific computing. OpenCL is not as popular, but has some Python libraries. ROCm recently has been showing up in Numba and CuPy."
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "# Libraries\n",
110 |     "\n",
111 |     "In the CPU class, we covered several libraries, but Numba was a clear standout in terms of high performance and ease of use. In GPU computing, the landscape is still quite varied. It's much harder to select a clear winner; each has features and drawbacks.\n",
112 |     "\n",
113 |     "![Library interest](images/LibraryInterest.png)\n",
114 |     "\n",
115 |     "Note that this is dominated by ML."
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "markdown",
120 |    "metadata": {},
121 |    "source": [
122 |     "## CuPy\n",
123 |     "\n",
124 |     "This was designed for the ML framework Chainer, but it becoming quite popular on its own.\n",
125 |     "\n",
126 |     "* *Very* close (often drop in replacement) for NumPy\n",
127 |     "* Custom kernel support, including element-wise and reduction kernels\n",
128 |     "     * Written in CUDA\n",
129 |     "* Fusion support (although limited)\n",
130 |     "* Strong development\n",
131 |     "* Experimental ROCm support\n",
132 |     "* Supports Numba's GPU array interface"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "## Numba\n",
140 |     "\n",
141 |     "This comes up again, since it has a GPU mode too!\n",
142 |     "\n",
143 |     "* Powerful but limited vectorize (elementwise UFunct)\n",
144 |     "* Full kernel mode, but hand launched\n",
145 |     "    * Written in Python subset\n",
146 |     "* Device function support\n",
147 |     "* ~~New ROCm mode, but different terms (removed)~~\n",
148 |     "* Developed the GPU array interface"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "## PyTorch\n",
156 |     "\n",
157 |     "This is Facebook's ML library.\n",
158 |     "\n",
159 |     "* NumPy-like\n",
160 |     "* Has tape-based gradient support\n",
161 |     "* Has fusion mode (torch-script), can support multiple languages\n",
162 |     "* Hard to make custom kernels\n",
163 |     "* Supports Numba's GPU array interface\n",
164 |     "* *Great* tutorials"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "metadata": {},
170 |    "source": [
171 |     "## Tensorflow\n",
172 |     "\n",
173 |     "This is Google's ML library.\n",
174 |     "\n",
175 |     "* New API is similar to PyTorch\n",
176 |     "* Fusion mode builds graph, initially slower than API 1\n",
177 |     "* API 1 was very fast, but hard to *setup* (computations easy, though)\n",
178 |     "* Hard to make custom kernels\n",
179 |     "* Lucky to support NumPy's array interface; no GPU interface (yet?)\n",
180 |     "* Multiple language backends, including Swift"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "## CUDA Python\n",
188 |     "\n",
189 |     "This is a new library by NVIDIA to support Python + CUDA. Still quite new, and really targeting library authors (like CuPy) to simplify and standardize the work they have to do."
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "markdown",
194 |    "metadata": {},
195 |    "source": [
196 |     "# GPU basics\n",
197 |     "\n",
198 |     "GPU programming has several characteristics:"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "### Memory\n",
206 |     "\n",
207 |     "GPU memory is separate from main memory, and the transfer cost is high. You will constantly be thinking about *where* you memory is, and how to reduce the transfer of that memory between host and device.\n",
208 |     "\n",
209 |     "Note that there are techniques, like pinned/universal memory that can hide this from the programmer to an extent.\n",
210 |     "\n",
211 |     "Also there are several types of memory, going from local to global, along with specialized memories like constant and texture."
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "metadata": {},
217 |    "source": [
218 |     "### Parallel computation\n",
219 |     "\n",
220 |     "GPU threads often are tempting to think of as CPU threads, but they are must more like vector registers. A GPU computes a \"warp\" at a time (32 threads); each thread has does the *same* computation. So, for example, how many times will the following code run:\n",
221 |     "\n",
222 |     "```python\n",
223 |     "if x < 0:\n",
224 |     "    x = 0\n",
225 |     "else:\n",
226 |     "    x = x\n",
227 |     "```\n",
228 |     "\n",
229 |     "This will first run and compute `x<0` and create a mask. It will then run `x = 0` with some threads masked, then `x = x` with the other threads masked!"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "markdown",
234 |    "metadata": {},
235 |    "source": [
236 |     "### Synchronization\n",
237 |     "\n",
238 |     "GPUs can operate on streams (somewhat like threads in CPU programming). You can give one stream commands to work on while the other stream is loading data. However, this means a lot of commands are asynchronous, that is, they return immediately and just schedule work to be done, rather than waiting till after the work is done to return. If you are using the results, this is fine (things wait properly), but if you are timing runs, you should have a \"synchronize\" step to make sure work is done."
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "metadata": {},
244 |    "source": [
245 |     "### Caching (and other smart CPU things)\n",
246 |     "\n",
247 |     "GPUs are not as smart as CPUs, and cannot do as much branch prediction and caching as a CPU can."
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "metadata": {},
253 |    "source": [
254 |     "Other parallel concepts, like atomics, still apply for GPUs as well."
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "## Multiple GPUs\n",
262 |     "\n",
263 |     "You may have multiple GPUs connected to a single CPU system. Most GPU libraries have a context system that lets you switch between the GPUs, but it's usually another thing you have to program for."
264 |    ]
265 |   }
266 |  ],
267 |  "metadata": {
268 |   "kernelspec": {
269 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
270 |    "language": "python",
271 |    "name": "sys_pygpu201912"
272 |   },
273 |   "language_info": {
274 |    "codemirror_mode": {
275 |     "name": "ipython",
276 |     "version": 3
277 |    },
278 |    "file_extension": ".py",
279 |    "mimetype": "text/x-python",
280 |    "name": "python",
281 |    "nbconvert_exporter": "python",
282 |    "pygments_lexer": "ipython3",
283 |    "version": "3.9.7"
284 |   }
285 |  },
286 |  "nbformat": 4,
287 |  "nbformat_minor": 4
288 | }
289 | 


--------------------------------------------------------------------------------
/01a_fractal_cupy.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "height = 2_000\n",
 10 |     "width = 3_000\n",
 11 |     "maxiterations = 20"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": null,
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import numpy as np\n",
 21 |     "import numba\n",
 22 |     "import math\n",
 23 |     "import matplotlib.pyplot as plt\n",
 24 |     "import cupy as cp\n",
 25 |     "import cupyx"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "Let's start by checking to see which GPU we have, using a shell command (sadly CuPy does not seem to be able to query names; it is currently limited to numerical attributes):\n"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "!nvidia-smi"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "You will either have a V100 or A100. Performance will vary by device."
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "# Mandelbrot Fractal\n",
 56 |     "\n",
 57 |     "From the CPU course, we had the Mandelbrot fractal, which we will be covering today as well."
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "You can generate a Mandelbrot fractal by applying the transform:\n",
 65 |     "\n",
 66 |     "$$\n",
 67 |     "z_{n+1}=z_{n}^{2}+c\n",
 68 |     "$$\n",
 69 |     "\n",
 70 |     "repeatedly to a regular matrix of complex numbers $c$, and recording the iteration number where the value $|z|$ surpassed some bound $N$, usually $N=2$. You start at $z_0 = c$.\n",
 71 |     "\n",
 72 |     "\n",
 73 |     "\n",
 74 |     "Let's set up some initial parameters and a helper matrix:"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "def prepare(height, width, xp=np):\n",
 84 |     "    x, y = xp.ogrid[-1.5j : 1.5j : height * 1j, -2 : 2 : width * 1j]\n",
 85 |     "    c = x + y\n",
 86 |     "    fractal = xp.zeros(c.shape, dtype=xp.int32)\n",
 87 |     "    return c, fractal"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "## NumPy\n",
 95 |     "\n",
 96 |     "Let's try a NumPy run (we will use `%%time` instead of `%%timeit`, since this takes several seconds to run so we don't need a precision measurement and don't want to waste time):"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "def fractal_x(c, f, maxiterations):\n",
106 |     "    xp = cp.get_array_module(c)\n",
107 |     "    f *= 0  # set to 0\n",
108 |     "    z = c.copy()\n",
109 |     "\n",
110 |     "    for i in range(1, maxiterations + 1):\n",
111 |     "        z = z**2 + c  # Compute z\n",
112 |     "        diverge = xp.abs(z**2) > 2**2  # Divergence criteria\n",
113 |     "\n",
114 |     "        z[diverge] = 2  # Keep number size small\n",
115 |     "        f[~diverge] = i  # Fill in non-diverged iteration number\n",
116 |     "\n",
117 |     "    return f"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": null,
123 |    "metadata": {},
124 |    "outputs": [],
125 |    "source": [
126 |     "c, fractal = prepare(height, width, np)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "metadata": {},
133 |    "outputs": [],
134 |    "source": [
135 |     "%%time\n",
136 |     "_ = fractal_x(c, fractal, 20)"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": null,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "plt.imshow(fractal)"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "markdown",
150 |    "metadata": {},
151 |    "source": [
152 |     "## Numba\n",
153 |     "\n",
154 |     "Let's do a quick check with Numba from the CPU course, just to see how fast we can get on single CPU:"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "@numba.vectorize([numba.int32(numba.complex128, numba.int32)])\n",
164 |     "def on_each_numba(cxy, maxiterations):\n",
165 |     "    z = cxy\n",
166 |     "    for i in range(maxiterations):\n",
167 |     "        z = z**2 + cxy\n",
168 |     "        if abs(z) > 2:\n",
169 |     "            return i\n",
170 |     "    return maxiterations"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "c, fractal = prepare(height, width, np)"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {},
186 |    "outputs": [],
187 |    "source": [
188 |     "%%time\n",
189 |     "r = on_each_numba(c, 20)"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "code",
194 |    "execution_count": null,
195 |    "metadata": {},
196 |    "outputs": [],
197 |    "source": [
198 |     "plt.imshow(r);"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "## CuPy: NumPy interface\n",
206 |     "\n",
207 |     "Now, let's try a CuPy run (We will run a synchronize call just for good measure, since we are not using the output):"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": null,
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": [
216 |     "import cupy as cp"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": null,
222 |    "metadata": {},
223 |    "outputs": [],
224 |    "source": [
225 |     "c, fractal = prepare(height, width, cp)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": [
234 |     "%%timeit\n",
235 |     "fractal_x(c, fractal, 20)\n",
236 |     "cp.cuda.get_current_stream().synchronize()"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "## CuPy: Fuse interface\n",
244 |     "\n",
245 |     "This is a \"Numba vectorize\"-like interface for making elementwise interfaces and simple reductions. It's quite limited, though."
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": null,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "@cp.fuse()\n",
255 |     "def cupy_fuse_combine(z, c):\n",
256 |     "    x = z**2 + c\n",
257 |     "    return x, cp.abs(x**2)\n",
258 |     "\n",
259 |     "\n",
260 |     "def fractal_fuse(c, f, maxiterations):\n",
261 |     "    xp = cp.get_array_module(c)\n",
262 |     "    f *= 0  # set to 0\n",
263 |     "    z = c.copy()\n",
264 |     "\n",
265 |     "    for i in range(1, maxiterations + 1):\n",
266 |     "        z, az2 = cupy_fuse_combine(z, c)  # Compute z\n",
267 |     "        diverge = az2 > 2**2  # Divergence criteria\n",
268 |     "\n",
269 |     "        z[diverge] = 2  # Keep number size small\n",
270 |     "        f[~diverge] = i  # Fill in non-diverged iteration number\n",
271 |     "\n",
272 |     "    return f"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": null,
278 |    "metadata": {},
279 |    "outputs": [],
280 |    "source": [
281 |     "c, fractal = prepare(height, width, cp)"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {},
288 |    "outputs": [],
289 |    "source": [
290 |     "%%timeit\n",
291 |     "fractal_fuse(c, fractal, 20)\n",
292 |     "cp.cuda.get_current_stream().synchronize()"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "markdown",
297 |    "metadata": {},
298 |    "source": [
299 |     "## CuPy: Elementwise Kernel\n",
300 |     "\n",
301 |     "Now, let's try a custom elementwise kernel."
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": null,
307 |    "metadata": {},
308 |    "outputs": [],
309 |    "source": [
310 |     "cupy_single = cp.ElementwiseKernel(\n",
311 |     "    \"complex128 c, int32 maxiterations\",\n",
312 |     "    \"int32 res\",\n",
313 |     "    \"\"\"\n",
314 |     "    res = 0;\n",
315 |     "    complex<double> z = c;\n",
316 |     "\n",
317 |     "    for (int i=0; i<maxiterations; i++) {\n",
318 |     "        z = z*z + c;\n",
319 |     "\n",
320 |     "        if(z.real()*z.real() + z.imag()*z.imag() > 4)\n",
321 |     "            break;\n",
322 |     "\n",
323 |     "        res = i;\n",
324 |     "    }\n",
325 |     "    \n",
326 |     "    \"\"\",\n",
327 |     "    \"fract_el\",\n",
328 |     ")"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "code",
333 |    "execution_count": null,
334 |    "metadata": {},
335 |    "outputs": [],
336 |    "source": [
337 |     "%%timeit\n",
338 |     "f = cupy_single(c, 20).get()\n",
339 |     "cp.cuda.get_current_stream().synchronize()"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "code",
344 |    "execution_count": null,
345 |    "metadata": {},
346 |    "outputs": [],
347 |    "source": [
348 |     "f = cupy_single(c, 20)\n",
349 |     "plt.imshow(f.get())"
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "markdown",
354 |    "metadata": {},
355 |    "source": [
356 |     "We could also try writing everything ourselves with a pure, raw CUDA kernel:\n",
357 |     "\n",
358 |     "> Note: width/height are confusing here"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": null,
364 |    "metadata": {},
365 |    "outputs": [],
366 |    "source": [
367 |     "cupy_kernel = cp.RawKernel(\n",
368 |     "    \"\"\"\n",
369 |     "extern \"C\" \n",
370 |     "__global__ void fractal(double* c, int* fractal, int height, int width, int maxiterations) {\n",
371 |     "    const int x = threadIdx.x + blockIdx.x*blockDim.x;\n",
372 |     "    const int y = threadIdx.y + blockIdx.y*blockDim.y;\n",
373 |     "    \n",
374 |     "    // Manual check for out-of-bounds (since blocks may be partial)\n",
375 |     "    if (x >= height || y >= width)\n",
376 |     "        return;\n",
377 |     "    \n",
378 |     "    // Access c\n",
379 |     "    double creal = c[2 * (x + height*y)];\n",
380 |     "    double cimag = c[2 * (x + height*y) + 1];\n",
381 |     "    \n",
382 |     "    // z = c\n",
383 |     "    double zreal = creal;\n",
384 |     "    double zimag = cimag;\n",
385 |     "    \n",
386 |     "    fractal[x + height*y] = 0;\n",
387 |     "    for (int i = 0;  i < maxiterations;  i++) {\n",
388 |     "        // z = z*z + c\n",
389 |     "        double zreal_new = zreal*zreal - zimag*zimag + creal;\n",
390 |     "        double zimag_new = 2*zreal*zimag + cimag;\n",
391 |     "        zreal = zreal_new;\n",
392 |     "        zimag = zimag_new;\n",
393 |     "        \n",
394 |     "        if (zreal*zreal + zimag*zimag > 4) {\n",
395 |     "            break;\n",
396 |     "        }\n",
397 |     "        fractal[x + height*y] = i;\n",
398 |     "    }\n",
399 |     "}\n",
400 |     "\"\"\",\n",
401 |     "    \"fractal\",\n",
402 |     ")"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "code",
407 |    "execution_count": null,
408 |    "metadata": {},
409 |    "outputs": [],
410 |    "source": [
411 |     "def prepare_pycuda(c, fractal, maxiterations):\n",
412 |     "    threadsperblock = (32, 32)\n",
413 |     "    blockspergrid = (\n",
414 |     "        math.ceil(c.shape[0] / threadsperblock[0]),\n",
415 |     "        math.ceil(c.shape[1] / threadsperblock[1]),\n",
416 |     "    )\n",
417 |     "\n",
418 |     "    return (\n",
419 |     "        blockspergrid,\n",
420 |     "        threadsperblock,\n",
421 |     "        [\n",
422 |     "            c.view(cp.double),\n",
423 |     "            fractal,\n",
424 |     "            cp.int32(height),\n",
425 |     "            cp.int32(width),\n",
426 |     "            cp.int32(maxiterations),\n",
427 |     "        ],\n",
428 |     "    )"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": null,
434 |    "metadata": {},
435 |    "outputs": [],
436 |    "source": [
437 |     "c, fractal = prepare(height, width, cp)\n",
438 |     "args = prepare_pycuda(c, fractal, maxiterations)"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "code",
443 |    "execution_count": null,
444 |    "metadata": {},
445 |    "outputs": [],
446 |    "source": [
447 |     "%%timeit\n",
448 |     "cupy_kernel(*args)\n",
449 |     "fractal.get()\n",
450 |     "cp.cuda.get_current_stream().synchronize()"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "code",
455 |    "execution_count": null,
456 |    "metadata": {},
457 |    "outputs": [],
458 |    "source": [
459 |     "plt.imshow(fractal.get());"
460 |    ]
461 |   },
462 |   {
463 |    "cell_type": "markdown",
464 |    "metadata": {},
465 |    "source": [
466 |     "# Extra features\n",
467 |     "\n",
468 |     "I've skipped a key example not included above: reduction kernels. These let you perform an element-wise calculation as well as a binary reduction (like a sum).\n",
469 |     "\n",
470 |     "You can also use generic (template in C++) types \"T\", and you can use \"raw\" generics, which are arrays that do not participate in the element-wise portion of the kernel (that is, they do not broadcast in NumPy terms)."
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "markdown",
475 |    "metadata": {},
476 |    "source": [
477 |     "# New version\n",
478 |     "\n",
479 |     "CuPy 7.0 brought a host of new features, including:\n",
480 |     "\n",
481 |     "* Remove Python 2 support\n",
482 |     "* RawModule, for building larger projects\n",
483 |     "* NVCC support (instead of just NVRTC)\n",
484 |     "* TensorCore support\n",
485 |     "* High speed CUB routines, like sum and more\n",
486 |     "\n",
487 |     "CuPy 8.0 brought even more:\n",
488 |     "\n",
489 |     "* Optional activateion of more CUB routines\n",
490 |     "* More kernel fusion, with more reducers\n",
491 |     "* More Scipy support, better external library integration\n",
492 |     "\n",
493 |     "CuPy 9.0 gave even more performance and filled out of the library. JIT support is experimental.\n",
494 |     "\n",
495 |     "CuPy 10 added multi-node/GPU features via a new `cupyx.distributed` module. ARM binaries are provided. JIT now covers lambas, atomics, and more built-ins. 51 new APIs added for NumPy/SciPy. This is the first version implementing the Python Array Standard introduced in NumPy 1.22.\n",
496 |     "\n",
497 |     "CuPy 11 has some exciting new features. It can use the Graph API to speed up kernel launches. `__device__` functions supported in JIT, along with groups and shape/strides. Static typing is being added. CUB is now enabled by default. MPI/sparse matrix support in distributed. A new wheel package makes it easier to install."
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "markdown",
502 |    "metadata": {},
503 |    "source": [
504 |     "\n"
505 |    ]
506 |   }
507 |  ],
508 |  "metadata": {
509 |   "kernelspec": {
510 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
511 |    "language": "python",
512 |    "name": "sys_pygpu201912"
513 |   },
514 |   "language_info": {
515 |    "codemirror_mode": {
516 |     "name": "ipython",
517 |     "version": 3
518 |    },
519 |    "file_extension": ".py",
520 |    "mimetype": "text/x-python",
521 |    "name": "python",
522 |    "nbconvert_exporter": "python",
523 |    "pygments_lexer": "ipython3",
524 |    "version": "3.9.7"
525 |   }
526 |  },
527 |  "nbformat": 4,
528 |  "nbformat_minor": 4
529 | }
530 | 


--------------------------------------------------------------------------------
/01b_fractal_numba.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "height = 2_000\n",
 10 |     "width = 3_000\n",
 11 |     "maxiterations = 20"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": null,
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import numpy as np\n",
 21 |     "import numba\n",
 22 |     "import numba.cuda\n",
 23 |     "import math\n",
 24 |     "import matplotlib.pyplot as plt"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "This time we can actually check the name of the device using the API:"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "numba.cuda.get_current_device().name"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "Let's make the data each time (we won't always use the output `fractal`)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": null,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "def prepare(height, width):\n",
 57 |     "    x, y = np.ogrid[-1.5j : 1.5j : height * 1j, -2 : 2 : width * 1j]\n",
 58 |     "    c = x + y\n",
 59 |     "    fractal = np.zeros(c.shape, dtype=np.int32)\n",
 60 |     "    return c, fractal"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "## NumPy\n",
 68 |     "\n",
 69 |     "Let's try a NumPy run (we will use `%%time` instead of `%%timeit`, since this takes several seconds to run so we don't need a precision measurement and don't want to waste time):"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "def fractal_numpy(c, maxiterations):\n",
 79 |     "    f = np.zeros_like(c, dtype=np.int32)\n",
 80 |     "    z = c.copy()\n",
 81 |     "\n",
 82 |     "    for i in range(1, maxiterations + 1):\n",
 83 |     "        z = z**2 + c  # Compute z\n",
 84 |     "        diverge = np.abs(z**2) > 2**2  # Divergence criteria\n",
 85 |     "\n",
 86 |     "        z[diverge] = 2  # Keep number size small\n",
 87 |     "        f[~diverge] = i  # Fill in non-diverged iteration number\n",
 88 |     "\n",
 89 |     "    return f"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "metadata": {},
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "c, _ = prepare(height, width)"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": null,
104 |    "metadata": {},
105 |    "outputs": [],
106 |    "source": [
107 |     "%%time\n",
108 |     "_ = fractal_numpy(c, maxiterations)"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "## Numba\n",
116 |     "\n",
117 |     "Let's do a quick check with Numba from the CPU course, just to see how fast we can get on single CPU:"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": null,
123 |    "metadata": {},
124 |    "outputs": [],
125 |    "source": [
126 |     "@numba.vectorize([numba.int32(numba.complex128, numba.int32)])\n",
127 |     "def fractal_numba_vectorize(cxy, maxiterations):\n",
128 |     "    z = cxy\n",
129 |     "    for i in range(maxiterations):\n",
130 |     "        z = z**2 + cxy\n",
131 |     "        if abs(z) > 2:\n",
132 |     "            return i\n",
133 |     "    return maxiterations"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": null,
139 |    "metadata": {},
140 |    "outputs": [],
141 |    "source": [
142 |     "c, _ = prepare(height, width)"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": null,
148 |    "metadata": {},
149 |    "outputs": [],
150 |    "source": [
151 |     "%%timeit\n",
152 |     "fractal_numba_vectorize(c, maxiterations)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "metadata": {},
158 |    "source": [
159 |     "## Numba CUDA: vectorize, host memory"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": null,
165 |    "metadata": {},
166 |    "outputs": [],
167 |    "source": [
168 |     "@numba.vectorize([numba.int32(numba.complex128, numba.int32)], target=\"cuda\")\n",
169 |     "def fractal_cuda_vectorize(cxy, maxiterations):\n",
170 |     "    z = cxy\n",
171 |     "    for i in range(maxiterations):\n",
172 |     "        z = z**2 + cxy\n",
173 |     "        if abs(z) > 2:\n",
174 |     "            return i\n",
175 |     "    return maxiterations"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": null,
181 |    "metadata": {},
182 |    "outputs": [],
183 |    "source": [
184 |     "c, _ = prepare(height, width)"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": null,
190 |    "metadata": {},
191 |    "outputs": [],
192 |    "source": [
193 |     "%%timeit\n",
194 |     "fractal_cuda_vectorize(c, maxiterations);"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "markdown",
199 |    "metadata": {},
200 |    "source": [
201 |     "## Numba CUDA: vectorize, GPU memory"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": null,
207 |    "metadata": {},
208 |    "outputs": [],
209 |    "source": [
210 |     "c, _ = prepare(height, width)\n",
211 |     "c = numba.cuda.to_device(c)"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "c"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": null,
226 |    "metadata": {},
227 |    "outputs": [],
228 |    "source": [
229 |     "%%timeit\n",
230 |     "fractal_cuda_vectorize(c, maxiterations)\n",
231 |     "numba.cuda.synchronize()"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "Note that we did not copy the memory back to the CPU; it's still on the GPU."
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "metadata": {},
244 |    "source": [
245 |     "## Numba CUDA: vectorize, skip allocation"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": null,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "c, f = prepare(height, width)\n",
255 |     "c = numba.cuda.to_device(c)\n",
256 |     "f = numba.cuda.to_device(f)"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": null,
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": [
265 |     "%%timeit\n",
266 |     "fractal_cuda_vectorize(c, maxiterations, out=f)\n",
267 |     "numba.cuda.synchronize()"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "markdown",
272 |    "metadata": {},
273 |    "source": [
274 |     "## Numba CUDA: custom kernel"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": null,
280 |    "metadata": {},
281 |    "outputs": [],
282 |    "source": [
283 |     "@numba.cuda.jit\n",
284 |     "def fractal_cuda_kernel(c_array, f, maxiterations):\n",
285 |     "    x, y = numba.cuda.grid(2)\n",
286 |     "    if x < c_array.shape[0] and y < c_array.shape[1]:\n",
287 |     "        f[x, y] = 0\n",
288 |     "        z = c_array[x, y]\n",
289 |     "        for i in range(maxiterations):\n",
290 |     "            z = z**2 + c_array[x, y]\n",
291 |     "            if abs(z**2) > 4:\n",
292 |     "                break\n",
293 |     "            f[x, y] = i"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": null,
299 |    "metadata": {},
300 |    "outputs": [],
301 |    "source": [
302 |     "c, f = prepare(height, width)\n",
303 |     "c = numba.cuda.to_device(c)\n",
304 |     "f = numba.cuda.to_device(f)"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "markdown",
309 |    "metadata": {},
310 |    "source": [
311 |     "Now we have to specify a custom kernel launch, rather than having it automated."
312 |    ]
313 |   },
314 |   {
315 |    "cell_type": "code",
316 |    "execution_count": null,
317 |    "metadata": {},
318 |    "outputs": [],
319 |    "source": [
320 |     "threadsperblock = (8, 8)\n",
321 |     "blockspergrid = (\n",
322 |     "    math.ceil(c.shape[0] / threadsperblock[0]),\n",
323 |     "    math.ceil(c.shape[1] / threadsperblock[1]),\n",
324 |     ")"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "code",
329 |    "execution_count": null,
330 |    "metadata": {},
331 |    "outputs": [],
332 |    "source": [
333 |     "blockspergrid"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": null,
339 |    "metadata": {},
340 |    "outputs": [],
341 |    "source": [
342 |     "%%timeit\n",
343 |     "fractal_cuda_kernel[blockspergrid, threadsperblock](c, f, maxiterations)\n",
344 |     "np.array(f, dtype=f.dtype)\n",
345 |     "numba.cuda.synchronize()"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "markdown",
350 |    "metadata": {},
351 |    "source": [
352 |     "We can plot this, just in case we made a mistake (even though we ran timeit above, plotting is valid, since we are reusing the same preallocated memory location):"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "code",
357 |    "execution_count": null,
358 |    "metadata": {},
359 |    "outputs": [],
360 |    "source": [
361 |     "plt.imshow(np.array(f, dtype=f.dtype));"
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": null,
367 |    "metadata": {},
368 |    "outputs": [],
369 |    "source": []
370 |   }
371 |  ],
372 |  "metadata": {
373 |   "kernelspec": {
374 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
375 |    "language": "python",
376 |    "name": "sys_pygpu201912"
377 |   },
378 |   "language_info": {
379 |    "codemirror_mode": {
380 |     "name": "ipython",
381 |     "version": 3
382 |    },
383 |    "file_extension": ".py",
384 |    "mimetype": "text/x-python",
385 |    "name": "python",
386 |    "nbconvert_exporter": "python",
387 |    "pygments_lexer": "ipython3",
388 |    "version": "3.9.7"
389 |   }
390 |  },
391 |  "nbformat": 4,
392 |  "nbformat_minor": 4
393 | }
394 | 


--------------------------------------------------------------------------------
/02a_nll_cupy.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Fitting: Computing an NLL\n",
  8 |     "\n",
  9 |     "We will be using  CuPy to compute a negative log likelihood, for an unbinned fit (not performed). Like before, let's set up the data and then try a solution with NumPy:"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "!nvidia-smi"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "## Dataset"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "import numpy as np\n",
 35 |     "import matplotlib.pyplot as plt\n",
 36 |     "import math\n",
 37 |     "\n",
 38 |     "np.random.seed(42)\n",
 39 |     "\n",
 40 |     "dist = np.hstack(\n",
 41 |     "    [\n",
 42 |     "        np.random.normal(loc=1, scale=2.0, size=500_000),\n",
 43 |     "        np.random.normal(loc=1, scale=0.5, size=500_000),\n",
 44 |     "    ]\n",
 45 |     ")"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "plt.hist(dist, bins=\"auto\");"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "## NumPy"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": null,
 67 |    "metadata": {},
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "def gaussian(x, μ, σ):\n",
 71 |     "    return 1 / np.sqrt(2 * np.pi * σ**2) * np.exp(-((x - μ) ** 2) / (2 * σ**2))\n",
 72 |     "\n",
 73 |     "\n",
 74 |     "def add(x, f_0, mean, sigma, sigma2):\n",
 75 |     "    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)\n",
 76 |     "\n",
 77 |     "\n",
 78 |     "def nll(dist, f_0, mean, sigma, sigma2):\n",
 79 |     "    return -np.sum(np.log(add(dist, f_0, mean, sigma, sigma2)))"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "metadata": {},
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "%%timeit\n",
 89 |     "nll(dist, *np.random.rand(4))"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "We may get a divide by 0 error, since we are randomly setting parameters. That's okay."
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "metadata": {},
102 |    "source": [
103 |     "## CuPy: simple"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {},
110 |    "outputs": [],
111 |    "source": [
112 |     "import cupy as cp"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": null,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "d_dist = cp.asarray(dist)"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": null,
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "%%timeit\n",
131 |     "nll(d_dist, *cp.random.rand(4))\n",
132 |     "cp.cuda.get_current_stream().synchronize()"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "Because CuPy supports the NumPy 1.13 ufunc dispatch, we didn't even need to replace the `np.*` in the lines above!"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "## CuPy: Fuse\n",
147 |     "\n",
148 |     "We can get even a *little* better by using fuse:"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "code",
153 |    "execution_count": null,
154 |    "metadata": {},
155 |    "outputs": [],
156 |    "source": [
157 |     "@cp.fuse()\n",
158 |     "def gaussian(x, μ, σ):\n",
159 |     "    return 1 / cp.sqrt(2 * cp.pi * σ**2) * cp.exp(-((x - μ) ** 2) / (2 * σ**2))\n",
160 |     "\n",
161 |     "\n",
162 |     "@cp.fuse()\n",
163 |     "def add(x, f_0, mean, sigma, sigma2):\n",
164 |     "    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)\n",
165 |     "\n",
166 |     "\n",
167 |     "# @cp.fuse() # Actually slower; it seems to reorder the sum into a linear reduction\n",
168 |     "def nll(dist, f_0, mean, sigma, sigma2):\n",
169 |     "    return -cp.sum(cp.log(add(dist, f_0, mean, sigma, sigma2)))"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": null,
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": [
178 |     "%%timeit\n",
179 |     "nll(d_dist, *cp.random.rand(4))\n",
180 |     "cp.cuda.get_current_stream().synchronize()"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "## CuPy: Custom kernels\n",
188 |     "\n",
189 |     "Let's try a custom reduction kernel:"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "code",
194 |    "execution_count": null,
195 |    "metadata": {},
196 |    "outputs": [],
197 |    "source": [
198 |     "device_fns = \"\"\"\n",
199 |     "#define POW2(x) ((x)*(x))\n",
200 |     "__device__\n",
201 |     "double gaussian(double x, double mu, double sigma) {\n",
202 |     "    return rsqrt(2*M_PI*POW2(sigma)) * exp(-POW2(x-mu)/(2*POW2(sigma)));\n",
203 |     "}\n",
204 |     "\n",
205 |     "__device__ double add(double x, double f_0, double mean, double sigma, double sigma2) {\n",
206 |     "    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2);\n",
207 |     "}\n",
208 |     "\"\"\""
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": null,
214 |    "metadata": {},
215 |    "outputs": [],
216 |    "source": [
217 |     "nll_kernel = cp.ReductionKernel(\n",
218 |     "    in_params=\"T dist, T f_0, T mean, T sigma, T sigma2\",\n",
219 |     "    out_params=\"T y\",\n",
220 |     "    map_expr=f\"log(add(dist, f_0, mean, sigma, sigma2))\",\n",
221 |     "    reduce_expr=\"a + b\",\n",
222 |     "    post_map_expr=\"y = -a\",\n",
223 |     "    identity=\"0\",\n",
224 |     "    name=\"nll_kernel\",\n",
225 |     "    preamble=device_fns,\n",
226 |     ")"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "markdown",
231 |    "metadata": {},
232 |    "source": [
233 |     "And, when we run, we get a nice speedup combined with the large linear reduction slowdown:"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "metadata": {},
240 |    "outputs": [],
241 |    "source": [
242 |     "%%timeit\n",
243 |     "nll_kernel(d_dist, *cp.random.rand(4))\n",
244 |     "cp.cuda.get_current_stream().synchronize()"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "markdown",
249 |    "metadata": {},
250 |    "source": [
251 |     "#### CuPy Elementwise + sum algorithm\n",
252 |     "\n",
253 |     "This is the best we can do (without implementing a RawKernel with a smart reduction, anyway):"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": null,
259 |    "metadata": {},
260 |    "outputs": [],
261 |    "source": [
262 |     "inside_nll = cp.ElementwiseKernel(\n",
263 |     "    in_params=\"T dist, T f_0, T mean, T sigma, T sigma2\",\n",
264 |     "    out_params=\"T y\",\n",
265 |     "    operation=\"y = log(add(dist, f_0, mean, sigma, sigma2))\",\n",
266 |     "    name=\"inside_nll\",\n",
267 |     "    preamble=device_fns,\n",
268 |     ")"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": [
277 |     "%%timeit\n",
278 |     "-cp.sum(inside_nll(d_dist, *cp.random.rand(4)))\n",
279 |     "cp.cuda.get_current_stream().synchronize()"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "markdown",
284 |    "metadata": {},
285 |    "source": [
286 |     "# Exercise\n",
287 |     "\n",
288 |     "Take one or more of the above examples, and convert them to 32 bit floats. How does the performance compare? (Pay attention to the GPU you get when running the example).\n",
289 |     "\n",
290 |     "Be careful when you do so not to let 64 bits sneak in. Check the output and/or in-between steps regularly!\n"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": null,
296 |    "metadata": {},
297 |    "outputs": [],
298 |    "source": []
299 |   }
300 |  ],
301 |  "metadata": {
302 |   "kernelspec": {
303 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
304 |    "language": "python",
305 |    "name": "sys_pygpu201912"
306 |   },
307 |   "language_info": {
308 |    "codemirror_mode": {
309 |     "name": "ipython",
310 |     "version": 3
311 |    },
312 |    "file_extension": ".py",
313 |    "mimetype": "text/x-python",
314 |    "name": "python",
315 |    "nbconvert_exporter": "python",
316 |    "pygments_lexer": "ipython3",
317 |    "version": "3.8.6"
318 |   }
319 |  },
320 |  "nbformat": 4,
321 |  "nbformat_minor": 4
322 | }
323 | 


--------------------------------------------------------------------------------
/02b_nll_tensorflow.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Fitting: Computing an NLL\n",
  8 |     "\n",
  9 |     "We will be using Tensorflow's new eager mode, the new JIT static graph, and a classic API static graph to solve a different problem: fitting unbinned datasets. Like before, let's set up the data and then try a solution with NumPy:"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "!nvidia-smi"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {},
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "import numpy as np\n",
 28 |     "import math\n",
 29 |     "\n",
 30 |     "np.random.seed(42)\n",
 31 |     "\n",
 32 |     "dist = np.hstack(\n",
 33 |     "    [\n",
 34 |     "        np.random.normal(loc=1, scale=2.0, size=500_000),\n",
 35 |     "        np.random.normal(loc=1, scale=0.5, size=500_000),\n",
 36 |     "    ]\n",
 37 |     ")"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "markdown",
 42 |    "metadata": {},
 43 |    "source": [
 44 |     "Now let's load TensorFlow 2.0:"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "import tensorflow as tf\n",
 54 |     "\n",
 55 |     "print(f\"{tf.__version__ = }\")"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "The dataset does not change, so that can be a constant."
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {},
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "d_dist = tf.constant(dist, name=\"dist\")"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "Notice that this looks a lot like NumPy, except most of the names are different. Also this is the same on both APIs; the main difference is the setup and debugging."
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "def gaussian(x, μ, σ):\n",
 88 |     "    return 1 / tf.sqrt(2 * np.pi * σ**2) * tf.math.exp(-((x - μ) ** 2) / (2 * σ**2))\n",
 89 |     "\n",
 90 |     "\n",
 91 |     "def add(x, f_0, mean, sigma, sigma2):\n",
 92 |     "    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)\n",
 93 |     "\n",
 94 |     "\n",
 95 |     "def make_nll(dist, f_0, mean, sigma, sigma2):\n",
 96 |     "    return -tf.reduce_sum(tf.math.log(add(dist, f_0, mean, sigma, sigma2)))"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "%%timeit\n",
106 |     "make_nll(d_dist, *np.random.rand(4))"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "Let's try using the autograph technique to convert this into something like a static graph (it gets cached on first use). This could be written as a decorator, `@tf.function`:"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "metadata": {},
120 |    "outputs": [],
121 |    "source": [
122 |     "nll = tf.function(make_nll)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "For the static graph to work, we need to be careful and use all TensorFlow objects:"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "tf_f_0 = tf.Variable(0, dtype=tf.float64)\n",
139 |     "tf_mean = tf.Variable(0, dtype=tf.float64)\n",
140 |     "tf_sigma = tf.Variable(0, dtype=tf.float64)\n",
141 |     "tf_sigma2 = tf.Variable(0, dtype=tf.float64)"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "%%timeit\n",
151 |     "r = np.random.rand(4)\n",
152 |     "\n",
153 |     "tf_f_0.assign(r[0])\n",
154 |     "tf_mean.assign(r[1])\n",
155 |     "tf_sigma.assign(r[1])\n",
156 |     "tf_sigma2.assign(r[1])\n",
157 |     "\n",
158 |     "nll(d_dist, tf_f_0, tf_mean, tf_sigma, tf_sigma2)"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "### Static Graph (classic API)\n",
166 |     "\n",
167 |     "Let's try the classic API, and build a static graph:"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "import tensorflow.compat.v1 as tf\n",
177 |     "\n",
178 |     "tf.disable_eager_execution()"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "Repeating this here for good measure:"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "def gaussian(x, μ, σ):\n",
195 |     "    return 1 / tf.sqrt(2 * np.pi * σ**2) * tf.math.exp(-((x - μ) ** 2) / (2 * σ**2))\n",
196 |     "\n",
197 |     "\n",
198 |     "def add(x, f_0, mean, sigma, sigma2):\n",
199 |     "    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)\n",
200 |     "\n",
201 |     "\n",
202 |     "def make_nll(dist, f_0, mean, sigma, sigma2):\n",
203 |     "    return -tf.reduce_sum(tf.math.log(add(dist, f_0, mean, sigma, sigma2)))"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "The API is different (we still have constant, but placeholder is classic API):"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": null,
216 |    "metadata": {},
217 |    "outputs": [],
218 |    "source": [
219 |     "x = tf.constant(dist)\n",
220 |     "tf_f_0 = tf.placeholder(dtype=tf.float64)\n",
221 |     "tf_mean = tf.placeholder(dtype=tf.float64)\n",
222 |     "tf_sigma = tf.placeholder(dtype=tf.float64)\n",
223 |     "tf_sigma2 = tf.placeholder(dtype=tf.float64)\n",
224 |     "\n",
225 |     "graph = make_nll(x, tf_f_0, tf_mean, tf_sigma, tf_sigma2)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": [
234 |     "graph"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "We have to start up a session, then feed the hungry graph with values:"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "sess = tf.Session()\n",
251 |     "\n",
252 |     "\n",
253 |     "def nll(f_0, mean, sigma, sigma2):\n",
254 |     "    loss_value = sess.run(\n",
255 |     "        graph,\n",
256 |     "        feed_dict={tf_f_0: f_0, tf_mean: mean, tf_sigma: sigma, tf_sigma2: sigma2},\n",
257 |     "    )\n",
258 |     "    return loss_value"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": null,
264 |    "metadata": {},
265 |    "outputs": [],
266 |    "source": [
267 |     "%%timeit\n",
268 |     "nll(*np.random.rand(4))"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": [
277 |     "sess.close()"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": null,
283 |    "metadata": {},
284 |    "outputs": [],
285 |    "source": []
286 |   }
287 |  ],
288 |  "metadata": {
289 |   "kernelspec": {
290 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
291 |    "language": "python",
292 |    "name": "sys_pygpu201912"
293 |   },
294 |   "language_info": {
295 |    "codemirror_mode": {
296 |     "name": "ipython",
297 |     "version": 3
298 |    },
299 |    "file_extension": ".py",
300 |    "mimetype": "text/x-python",
301 |    "name": "python",
302 |    "nbconvert_exporter": "python",
303 |    "pygments_lexer": "ipython3",
304 |    "version": "3.8.6"
305 |   }
306 |  },
307 |  "nbformat": 4,
308 |  "nbformat_minor": 4
309 | }
310 | 


--------------------------------------------------------------------------------
/02c_nll_torch.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Fitting: Computing an NLL\n",
  8 |     "\n",
  9 |     "We will be using PyTorch this time."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "!nvidia-smi"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {},
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "import numpy as np\n",
 28 |     "import math\n",
 29 |     "\n",
 30 |     "np.random.seed(42)\n",
 31 |     "\n",
 32 |     "dist = np.hstack(\n",
 33 |     "    [\n",
 34 |     "        np.random.normal(loc=1, scale=2.0, size=500_000),\n",
 35 |     "        np.random.normal(loc=1, scale=0.5, size=500_000),\n",
 36 |     "    ]\n",
 37 |     ")"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": null,
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "import torch"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "## Torch: CPU\n",
 54 |     "By default, Torch data will be on the CPU unless sent to a GPU. Let's start with CPU, then:"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "d_dist = torch.tensor(dist)"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "This is similar to NumPy, though we'll have to be careful to use a non-Torch `sqrt` function since it does not operate on a Torch Tensor:"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": null,
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "def gaussian(x, μ, σ):\n",
 80 |     "    return (\n",
 81 |     "        1 / torch.sqrt(2 * np.pi * σ**2) * torch.exp(-((x - μ) ** 2) / (2 * σ**2))\n",
 82 |     "    )\n",
 83 |     "\n",
 84 |     "\n",
 85 |     "def add(x, f_0, mean, sigma, sigma2):\n",
 86 |     "    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)\n",
 87 |     "\n",
 88 |     "\n",
 89 |     "@torch.jit.script\n",
 90 |     "def nll(dist, f_0, mean, sigma, sigma2):\n",
 91 |     "    return -torch.sum(torch.log(add(dist, f_0, mean, sigma, sigma2)))"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "markdown",
 96 |    "metadata": {},
 97 |    "source": [
 98 |     "Now, let's check the performance:"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": null,
104 |    "metadata": {},
105 |    "outputs": [],
106 |    "source": [
107 |     "%%timeit\n",
108 |     "vals = [torch.tensor(v) for v in np.random.rand(4)]\n",
109 |     "nll(d_dist, *vals)"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "## Torch: GPU\n",
117 |     "\n",
118 |     "Moving this to the GPU is very simple; we get a CUDA device and then use `.to` to send data to the device. *Note that we do not have to send functions to the device, only data. If you are doing ML, models usually also have to be sent to the device, because they contain weights, and weights are data)*."
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": null,
124 |    "metadata": {},
125 |    "outputs": [],
126 |    "source": [
127 |     "device = torch.device(\"cuda:0\")"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": null,
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": [
136 |     "dev_dist = d_dist.to(device)"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "> Warning: in the current environment, this is a little broken - PyTorch and conda-forge are conflicting, I believe."
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": null,
149 |    "metadata": {},
150 |    "outputs": [],
151 |    "source": [
152 |     "%%timeit\n",
153 |     "vals = [torch.tensor(v).to(device) for v in np.random.rand(4)]\n",
154 |     "nll(dev_dist, *vals)\n",
155 |     "torch.cuda.synchronize()"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "## Exercise\n",
163 |     "\n",
164 |     "Try enabling the `torch.jit.script` decorator. What happens to the performance? How does it compare with the other methods now?"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "metadata": {},
170 |    "source": [
171 |     "## PyTorch gradients\n",
172 |     "\n",
173 |     "Torch's strong point (along with TensorFlow) is the gradient functionality. If you make a tensor with `requires_grad=True`, it then keeps a record of what happens to it during calculations, called a tape. If you call `result.backward(values)`, it replays the tape of gradient operations in reverse, allowing you to get the gradient. This is very powerful in fitting problems, such as those encountered in ML."
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {},
180 |    "outputs": [],
181 |    "source": []
182 |   }
183 |  ],
184 |  "metadata": {
185 |   "kernelspec": {
186 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
187 |    "language": "python",
188 |    "name": "sys_pygpu201912"
189 |   },
190 |   "language_info": {
191 |    "codemirror_mode": {
192 |     "name": "ipython",
193 |     "version": 3
194 |    },
195 |    "file_extension": ".py",
196 |    "mimetype": "text/x-python",
197 |    "name": "python",
198 |    "nbconvert_exporter": "python",
199 |    "pygments_lexer": "ipython3",
200 |    "version": "3.8.6"
201 |   }
202 |  },
203 |  "nbformat": 4,
204 |  "nbformat_minor": 4
205 | }
206 | 


--------------------------------------------------------------------------------
/02x_torch_autograd.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import torch"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "Let's take a sneek peak at autograd:"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "$$\n",
 24 |     "x = 3\n",
 25 |     "$$"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "x = torch.tensor([3.0], requires_grad=True)"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "Let's make a computation:"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "$$\n",
 49 |     "y = x^3 + x^2 + x = 27 + 9 + 3 = 39\n",
 50 |     "$$"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "y = x**3 + x**2 + x"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "y"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "Notice how it's keeping a \"tape\" of all the backward operations.\n",
 76 |     "\n",
 77 |     "And, now let's compute the gradient:"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "$$\n",
 85 |     "\\frac{dy}{dx} = 3x^2 + 2x + 1 = 3\\cdot 9 + 6 + 1 = 34\n",
 86 |     "$$"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "We call `.backward()` on the final product (`y`) to fill in the `.grad` properties on the inputs (`x`)."
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "y.backward()"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "y"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "x.grad"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": []
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": null,
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": []
136 |   }
137 |  ],
138 |  "metadata": {
139 |   "kernelspec": {
140 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
141 |    "language": "python",
142 |    "name": "sys_pygpu201912"
143 |   },
144 |   "language_info": {
145 |    "codemirror_mode": {
146 |     "name": "ipython",
147 |     "version": 3
148 |    },
149 |    "file_extension": ".py",
150 |    "mimetype": "text/x-python",
151 |    "name": "python",
152 |    "nbconvert_exporter": "python",
153 |    "pygments_lexer": "ipython3",
154 |    "version": "3.8.6"
155 |   }
156 |  },
157 |  "nbformat": 4,
158 |  "nbformat_minor": 4
159 | }
160 | 


--------------------------------------------------------------------------------
/03_nll.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "e4575a55-e1b2-4ee8-9cba-257212934011",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Intro: CuPy and Numba on the GPU\n",
  9 |     "\n",
 10 |     "10-20-2021\n",
 11 |     "\n",
 12 |     "\n",
 13 |     "Useful links:\n",
 14 |     "* [High Performance Python: CPUs](https://github.com/henryiii/python-performance-minicourse)\n",
 15 |     "* [iscinumpy.gitlab.io](https://iscinumpy.gitlab.io)\n",
 16 |     "* [CompClass](https://github.com/henryiii/compclass)\n",
 17 |     "\n",
 18 |     "Note that we are using CPython 3.9. 3.10 is out, but is not quite ready for conda yet. And even when it is, Numba is slow to update due to heavy usage of bytecode, which is not (supposed to be) stable between releases."
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "id": "de66b640-8e48-451c-9f6c-f3f8db03c7e9",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "## Problem 1: Negative Log Likelihood\n",
 27 |     "\n",
 28 |     "Let's start with a NLL calculation. If you are doing an unbinned likelihood fit, this is main computation loop that drives that sort of fit. It's also _mostly_ embarrassingly parallel, except for a final reduction."
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "id": "20e1482b-5a33-4614-bd01-36f14fe54bb3",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "### NumPy (normal CPU solution)\n",
 37 |     "\n",
 38 |     "Let's try with numpy so you can see the three lines of actual code involved. First our imports:"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "id": "a18534ad-b843-4bd4-bfa8-becf5b350488",
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "import numpy as np\n",
 49 |     "import math"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "id": "8d015e9b-2eba-4b17-8dab-9acfc49e4bd3",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "Now we make some artificial data to run on (this is what we'd fit if we added the fitter):"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": null,
 63 |    "id": "4285ea09-0b30-48f9-b2f8-71467dcb2509",
 64 |    "metadata": {},
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "rng = np.random.default_rng(seed=42)\n",
 68 |     "\n",
 69 |     "dist = np.hstack(\n",
 70 |     "    [\n",
 71 |     "        rng.normal(loc=1, scale=2.0, size=500_000),\n",
 72 |     "        rng.normal(loc=1, scale=0.5, size=500_000),\n",
 73 |     "    ]\n",
 74 |     ")"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "id": "a79d24f2-0c57-49d3-80fc-aa329b7c2c8f",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "Now we define a gaussian, product of two gaussians, and an nll function:"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": null,
 88 |    "id": "c60acd4c-132b-4d03-a94a-618d96e38ff1",
 89 |    "metadata": {},
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "def gaussian(x, μ, σ):\n",
 93 |     "    return 1 / math.sqrt(2 * np.pi * σ**2) * np.exp(-((x - μ) ** 2) / (2 * σ**2))\n",
 94 |     "\n",
 95 |     "\n",
 96 |     "def add(x, f_0, μ, σ_1, σ_2):\n",
 97 |     "    return f_0 * gaussian(x, μ, σ_1) + (1 - f_0) * gaussian(x, μ, σ_2)\n",
 98 |     "\n",
 99 |     "\n",
100 |     "def nll(x, f_0, μ, σ_1, σ_2):\n",
101 |     "    return -np.sum(np.log(add(x, f_0, μ, σ_1, σ_2)))"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "id": "6bf36950-1995-43c8-9145-5164cad744bf",
107 |    "metadata": {},
108 |    "source": [
109 |     "Let's just show the actual value at the minimum for comparison later:"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": null,
115 |    "id": "0ea84184-840a-4a34-ac83-6026a3570427",
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "nll(dist, 0.5, 1.0, 2.0, 0.5)"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "id": "8b748987-94c7-4ffc-a729-f4673229cdf5",
125 |    "metadata": {},
126 |    "source": [
127 |     "Let's see how much time this takes to compute:"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": null,
133 |    "id": "2548b64c-792d-4abe-a0cd-b49f04c4c6ab",
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "%%timeit\n",
138 |     "nll(\n",
139 |     "    dist,\n",
140 |     "    rng.random(),\n",
141 |     "    rng.normal(loc=1, scale=0.3),\n",
142 |     "    rng.normal(loc=2, scale=0.5),\n",
143 |     "    rng.normal(loc=0.5, scale=0.1),\n",
144 |     ")"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "markdown",
149 |    "id": "7a0b0fc0-5630-41ec-a838-79d4c7f7c572",
150 |    "metadata": {},
151 |    "source": [
152 |     "FYI, this is _very_ good. NumPy is probably using multiple threads for parts of this computation, and fusing simple expressions."
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "id": "b12bc9a3-7b68-4114-9548-3ffa314a0317",
158 |    "metadata": {},
159 |    "source": [
160 |     "### CuPy\n",
161 |     "\n",
162 |     "#### CuPy drop-in\n",
163 |     "\n",
164 |     "We are going to import cupy. `import cupy as cp` is very common, due to similarly with `np` (and you will sometimes see `import cupy as np`)."
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": null,
170 |    "id": "2b7ed27e-6ba1-4546-8a87-1173811d4945",
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": [
174 |     "import cupy"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "id": "070e3eaf-8a39-4e79-b22a-9a224b61f5d0",
180 |    "metadata": {},
181 |    "source": [
182 |     "The first thing we need to do is move the NumPy array over to the GPU. We do that with `cupy.array`."
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "id": "a3e373fb-11c0-40e9-bae0-992155b121dd",
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": [
192 |     "cpdist = cupy.array(dist)"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "markdown",
197 |    "id": "b35cb1f6-908a-4ca9-88f9-762f52a6e8ce",
198 |    "metadata": {},
199 |    "source": [
200 |     "Actually, that's the last thing we need to do, as long as you have NumPy 1.18 or better. Everything works now:"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": null,
206 |    "id": "b11a63f1-ad8b-48ad-8fbf-7305245c9984",
207 |    "metadata": {},
208 |    "outputs": [],
209 |    "source": [
210 |     "%%timeit\n",
211 |     "nll(\n",
212 |     "    cpdist,\n",
213 |     "    rng.random(),\n",
214 |     "    rng.normal(loc=1, scale=0.3),\n",
215 |     "    rng.normal(loc=2, scale=0.5),\n",
216 |     "    rng.normal(loc=0.5, scale=0.1),\n",
217 |     ").get()"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "markdown",
222 |    "id": "73d25568-f723-4399-9ce9-9f25064de5a4",
223 |    "metadata": {},
224 |    "source": [
225 |     "NumPy 1.13 added the ability to override UFuncts, and 1.18 added the ability to override general functions, and CuPy uses this; you don't need to replace `np` with `cupy` unless you are making arrays (`array`, `asarray`, `empty`, `zeros`, etc.). If you do need to make an array, you can use `xp = cupy.get_array_module(existing_array)`, then `xp` will be either `numpy` or `cupy`, depending on the input array."
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "id": "55d6fca6-946f-4d27-b027-50b0d1f3b2d4",
231 |    "metadata": {},
232 |    "source": [
233 |     "We can try to do better, though - cupy is making temporaries, which are costly. Since we are doing a reduction, let's write a reduction kernel:"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "id": "c069bac6-5248-4bbc-8276-b32ce78fa9c9",
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "rku = cupy.ReductionKernel(\n",
244 |     "    \"float64 x, float64 f_0, float64 mean, float64 sigma, float64 sigma2\",\n",
245 |     "    \"float64 r\",\n",
246 |     "    \"\"\"\n",
247 |     "    log(     f_0  * rsqrt(2*M_PI*sigma*sigma)   * exp(-(x-mean)*(x-mean)/(2*sigma*sigma)) +\n",
248 |     "        (1 - f_0) * rsqrt(2*M_PI*sigma2*sigma2) * exp(-(x-mean)*(x-mean)/(2*sigma2*sigma2)))\n",
249 |     "    \"\"\",\n",
250 |     "    \"a + b\",\n",
251 |     "    \"r = -a\",\n",
252 |     "    \"0\",\n",
253 |     "    \"redu_kernel\",\n",
254 |     ")"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "code",
259 |    "execution_count": null,
260 |    "id": "1fcedfd1-7cfa-459e-86e4-8b4d459ead37",
261 |    "metadata": {},
262 |    "outputs": [],
263 |    "source": [
264 |     "def nll(dist, f_0, mean, sigma, sigma2):\n",
265 |     "    return rku(dist, f_0, mean, sigma, sigma2)"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "code",
270 |    "execution_count": null,
271 |    "id": "9e9199f7-119f-47f8-95c3-641f8ef18975",
272 |    "metadata": {},
273 |    "outputs": [],
274 |    "source": [
275 |     "nll(cpdist, 0.5, 1.0, 2.0, 0.5).get()"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "code",
280 |    "execution_count": null,
281 |    "id": "b42482bb-a4f7-4541-aa17-446fadd56acd",
282 |    "metadata": {},
283 |    "outputs": [],
284 |    "source": [
285 |     "%%timeit\n",
286 |     "nll(\n",
287 |     "    cpdist,\n",
288 |     "    rng.random(),\n",
289 |     "    rng.normal(loc=1, scale=0.3),\n",
290 |     "    rng.normal(loc=2, scale=0.5),\n",
291 |     "    rng.normal(loc=0.5, scale=0.1),\n",
292 |     ").get()"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "markdown",
297 |    "id": "e9452b7f-a5b3-4901-975d-f6e46a8372b4",
298 |    "metadata": {},
299 |    "source": [
300 |     "This is actually a bit worse. We did much better in the middle, not needing as many temporaries, but did much worse in the reduction, as this is not as optimized as `cp.sum`. Let's try a hybrid solution:"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "id": "5570ec62-2465-499f-a624-bc66a18b05df",
307 |    "metadata": {},
308 |    "outputs": [],
309 |    "source": [
310 |     "mykernel = cupy.ElementwiseKernel(\n",
311 |     "    \"float64 x, float64 f_0, float64 mean, float64 sigma, float64 sigma2\",\n",
312 |     "    \"float64 z\",\n",
313 |     "    \"\"\"\n",
314 |     "    \n",
315 |     "    double s12 = 2*sigma*sigma;\n",
316 |     "    double s22 = 2*sigma2*sigma2;\n",
317 |     "    \n",
318 |     "    double p = -(x-mean)*(x-mean);\n",
319 |     "    double g = rsqrt(M_PI*s12) * exp(p/s12);\n",
320 |     "    double g2 = rsqrt(M_PI*s22) * exp(p/s22);\n",
321 |     "    \n",
322 |     "    z = log(f_0 * g + (1 - f_0) * g2);\n",
323 |     "        \n",
324 |     "    \"\"\",\n",
325 |     "    \"mykernel\",\n",
326 |     ")"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "code",
331 |    "execution_count": null,
332 |    "id": "14c3cd7f-9ef9-4085-a885-35740239991e",
333 |    "metadata": {},
334 |    "outputs": [],
335 |    "source": [
336 |     "def nll(dist, f_0, mean, sigma, sigma2):\n",
337 |     "    return -cupy.sum(mykernel(dist, f_0, mean, sigma, sigma2))"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": null,
343 |    "id": "d19233ac-e605-4760-bdea-7694a2235758",
344 |    "metadata": {},
345 |    "outputs": [],
346 |    "source": [
347 |     "nll(cpdist, 0.5, 1.0, 2.0, 0.5).get()"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "id": "4db1c290-1dbc-4ad4-9b98-2ebe74a8e7e8",
354 |    "metadata": {},
355 |    "outputs": [],
356 |    "source": [
357 |     "%%timeit\n",
358 |     "nll(\n",
359 |     "    cpdist,\n",
360 |     "    rng.random(),\n",
361 |     "    rng.normal(loc=1, scale=0.3),\n",
362 |     "    rng.normal(loc=2, scale=0.5),\n",
363 |     "    rng.normal(loc=0.5, scale=0.1),\n",
364 |     ").get()"
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "markdown",
369 |    "id": "6ab90bd6-77d8-4ba8-80f8-3949937e3b38",
370 |    "metadata": {},
371 |    "source": [
372 |     "This is optimal - we are using the CUB sum as well as avoiding temporaries."
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "markdown",
377 |    "id": "24461dd2-e663-4caa-ad1a-b469f9cf099a",
378 |    "metadata": {},
379 |    "source": [
380 |     "## Numba GPU\n",
381 |     "\n",
382 |     "Another solution is Numba's JIT compiler, which supports CUDA."
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": null,
388 |    "id": "7df237f3-8cd1-434f-9e11-026c9914eb52",
389 |    "metadata": {},
390 |    "outputs": [],
391 |    "source": [
392 |     "import numba.cuda\n",
393 |     "import math\n",
394 |     "\n",
395 |     "\n",
396 |     "@numba.cuda.jit(\"float64(float64,float64,float64)\", device=True, inline=True)\n",
397 |     "def gaussian(x, μ, σ):\n",
398 |     "    return 1 / math.sqrt(2 * np.pi * σ**2) * math.exp(-((x - μ) ** 2) / (2 * σ**2))\n",
399 |     "\n",
400 |     "\n",
401 |     "@numba.vectorize([\"float64(float64,float64,float64,float64,float64)\"], target=\"cuda\")\n",
402 |     "def log_add(x, f_0, mean, sigma, sigma2):\n",
403 |     "    return -math.log(\n",
404 |     "        f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)\n",
405 |     "    )\n",
406 |     "\n",
407 |     "\n",
408 |     "@numba.cuda.reduce\n",
409 |     "def sum_reduce(a, b):\n",
410 |     "    return a + b\n",
411 |     "\n",
412 |     "\n",
413 |     "def nll(dist, f_0, mean, sigma, sigma2):\n",
414 |     "    return sum_reduce(log_add(dist, f_0, mean, sigma, sigma2))"
415 |    ]
416 |   },
417 |   {
418 |    "cell_type": "markdown",
419 |    "id": "25072a53-73a9-4bf8-abdd-823a7f5b5ad7",
420 |    "metadata": {},
421 |    "source": [
422 |     "Numba and CuPy support 0-cost transfer between libraries, so you can select the tool that's best for you! We'll make a Numba device vector from our CuPy one. `cupy.asarray(nbdist)` would convert back."
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": null,
428 |    "id": "6e80879e-88e3-4873-9ced-ee9dffb3b3f9",
429 |    "metadata": {},
430 |    "outputs": [],
431 |    "source": [
432 |     "nbdist = numba.cuda.to_device(cpdist)"
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "code",
437 |    "execution_count": null,
438 |    "id": "7eb11cd4-422e-4660-9528-48c980e371d0",
439 |    "metadata": {},
440 |    "outputs": [],
441 |    "source": [
442 |     "nll(nbdist, 0.5, 1.0, 2.0, 0.5)"
443 |    ]
444 |   },
445 |   {
446 |    "cell_type": "code",
447 |    "execution_count": null,
448 |    "id": "942114a0-0f7b-41a2-a1fb-315317508ad2",
449 |    "metadata": {},
450 |    "outputs": [],
451 |    "source": [
452 |     "%%timeit\n",
453 |     "nll(\n",
454 |     "    nbdist,\n",
455 |     "    rng.random(),\n",
456 |     "    rng.normal(loc=1, scale=0.3),\n",
457 |     "    rng.normal(loc=2, scale=0.5),\n",
458 |     "    rng.normal(loc=0.5, scale=0.1),\n",
459 |     ")"
460 |    ]
461 |   },
462 |   {
463 |    "cell_type": "markdown",
464 |    "id": "f659e0f9-7156-43ca-bd54-4b8a21a4ebf7",
465 |    "metadata": {},
466 |    "source": [
467 |     "This is basically on par with the ReductionKernel, as expected."
468 |    ]
469 |   }
470 |  ],
471 |  "metadata": {
472 |   "kernelspec": {
473 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
474 |    "language": "python",
475 |    "name": "sys_pygpu201912"
476 |   },
477 |   "language_info": {
478 |    "codemirror_mode": {
479 |     "name": "ipython",
480 |     "version": 3
481 |    },
482 |    "file_extension": ".py",
483 |    "mimetype": "text/x-python",
484 |    "name": "python",
485 |    "nbconvert_exporter": "python",
486 |    "pygments_lexer": "ipython3",
487 |    "version": "3.9.7"
488 |   }
489 |  },
490 |  "nbformat": 4,
491 |  "nbformat_minor": 5
492 | }
493 | 


--------------------------------------------------------------------------------
/04_ode.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "3144bb5e-9afe-47e6-a45e-0ba66de386de",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Failure: not all code is faster\n",
  9 |     "\n",
 10 |     "Let's look at a problem that is slower on a GPU when you convert to CuPy. This is a simple ODE solver."
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "id": "7dcb100d-b4cf-4135-a8e9-1fc58da6d69e",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## Classic NumPy code"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "id": "4d72dca1-3074-416b-a5d4-78892de62ad2",
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "import numpy as np\n",
 29 |     "import matplotlib.pyplot as plt"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "id": "cb85fd9e-d62c-4de8-8924-a04052a3402f",
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "x_max = 1  # Size of x max\n",
 40 |     "v_0 = 0\n",
 41 |     "koverm = 1  # k / m"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "id": "c4ac304d-2faa-464f-8f86-caaa5c8ab670",
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "def f(t, y):\n",
 52 |     "    \"Y has two elements, x and v\"\n",
 53 |     "    return np.array([-koverm * y[1], y[0]])\n",
 54 |     "\n",
 55 |     "\n",
 56 |     "def euler_ivp(f, init_y, t):\n",
 57 |     "    steps = len(t)\n",
 58 |     "    order = len(init_y)  # Number of equations\n",
 59 |     "\n",
 60 |     "    y = np.empty((steps, order))\n",
 61 |     "    y[0] = init_y  # Note that this sets the elements of the first row\n",
 62 |     "\n",
 63 |     "    for n in range(steps - 1):\n",
 64 |     "        h = t[n + 1] - t[n]\n",
 65 |     "\n",
 66 |     "        # Compute dydt based on *current* position\n",
 67 |     "        dydt = f(t[n], y[n])\n",
 68 |     "\n",
 69 |     "        # Compute next velocity and position\n",
 70 |     "        y[n + 1] = y[n] - dydt * h\n",
 71 |     "\n",
 72 |     "    return y"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "id": "fa07c447-c9b5-4cd1-a93b-46e2ec34b0ce",
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "ts = np.linspace(0, 40, 1000 + 1)\n",
 83 |     "y = euler_ivp(f, [x_max, v_0], ts)\n",
 84 |     "plt.plot(ts, np.cos(ts))\n",
 85 |     "plt.plot(ts, y[:, 0], \"--\")"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": null,
 91 |    "id": "cc7e4e91-e9cf-4ee7-b144-9cb71733a456",
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "%%timeit\n",
 96 |     "y = euler_ivp(f, [x_max, v_0], ts)"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "id": "b90716ba-151d-4efa-97b4-17cd400c3650",
102 |    "metadata": {},
103 |    "source": [
104 |     "## CuPy"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "id": "4d776345-b11b-4df4-8b84-9f5419804bcb",
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "import cupy as cp\n",
115 |     "import cupyx"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "id": "d9018b6c-d927-4def-84dd-ca4eb9724e66",
122 |    "metadata": {},
123 |    "outputs": [],
124 |    "source": [
125 |     "def f(t, y):\n",
126 |     "    \"Y has two elements, x and v\"\n",
127 |     "    # xp = cp.get_array_module(t)\n",
128 |     "    return cp.array([-koverm * y[1], y[0]])\n",
129 |     "\n",
130 |     "\n",
131 |     "def euler_ivp(f, init_y, t):\n",
132 |     "    # xp = cp.get_array_module(t)\n",
133 |     "    steps = len(t)\n",
134 |     "    order = len(init_y)  # Number of equations\n",
135 |     "\n",
136 |     "    y = cp.empty((steps, order))\n",
137 |     "    y[0] = init_y  # Note that this sets the elements of the first row\n",
138 |     "\n",
139 |     "    for n in range(steps - 1):\n",
140 |     "        h = t[n + 1] - t[n]\n",
141 |     "\n",
142 |     "        # Compute dydt based on *current* position\n",
143 |     "        dydt = f(t[n], y[n])\n",
144 |     "\n",
145 |     "        # Compute next velocity and position\n",
146 |     "        y[n + 1] = y[n] - dydt * h\n",
147 |     "\n",
148 |     "    return y"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "code",
153 |    "execution_count": null,
154 |    "id": "7eaf16b6-1139-4146-bd2a-c9abb3d0efd4",
155 |    "metadata": {},
156 |    "outputs": [],
157 |    "source": [
158 |     "ts = cp.linspace(0, 40, 1000 + 1)\n",
159 |     "y = euler_ivp(f, cp.array([x_max, v_0]), ts)\n",
160 |     "plt.plot(ts.get(), np.cos(ts).get())\n",
161 |     "plt.plot(ts.get(), y[:, 0].get(), \"--\")"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "id": "f0a4742c-1548-4031-a3ab-52094781192a",
168 |    "metadata": {},
169 |    "outputs": [],
170 |    "source": [
171 |     "%%timeit\n",
172 |     "y = euler_ivp(f, cp.array([x_max, v_0]), ts)"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "id": "4552b388-2109-4982-9a79-0e55351444e7",
179 |    "metadata": {},
180 |    "outputs": [],
181 |    "source": [
182 |     "def f(t, y):\n",
183 |     "    \"Y has two elements, x and v\"\n",
184 |     "    return -koverm * y[1], y[0]\n",
185 |     "\n",
186 |     "\n",
187 |     "def euler_ivp(f, init_y, t):\n",
188 |     "    steps = len(t)\n",
189 |     "    order = len(init_y)  # Number of equations (2)\n",
190 |     "\n",
191 |     "    y = cp.empty((steps, order))\n",
192 |     "    y[0] = init_y  # Note that this sets the elements of the first row\n",
193 |     "\n",
194 |     "    for n in range(steps - 1):\n",
195 |     "        h = t[n + 1] - t[n]\n",
196 |     "\n",
197 |     "        # Compute dydt based on *current* position\n",
198 |     "        dydt_0, dydt_1 = f(t[n], y[n])\n",
199 |     "\n",
200 |     "        # Compute next velocity and position\n",
201 |     "        y[n + 1, 0] = y[n, 0] - dydt_0 * h\n",
202 |     "        y[n + 1, 1] = y[n, 1] - dydt_1 * h\n",
203 |     "\n",
204 |     "    return y"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "id": "40a391d0-37b8-4cef-8e05-f8a8e92c2e77",
211 |    "metadata": {},
212 |    "outputs": [],
213 |    "source": [
214 |     "ts = cp.linspace(0, 40, 1000 + 1)\n",
215 |     "y = euler_ivp(f, cp.array([x_max, v_0]), ts)\n",
216 |     "plt.plot(ts.get(), np.cos(ts).get())\n",
217 |     "plt.plot(ts.get(), y[:, 0].get(), \"--\")"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "id": "cfc5b3c0-d238-4e24-91aa-e264505471ca",
224 |    "metadata": {},
225 |    "outputs": [],
226 |    "source": [
227 |     "%%timeit\n",
228 |     "y = euler_ivp(f, cp.array([x_max, v_0]), ts)"
229 |    ]
230 |   }
231 |  ],
232 |  "metadata": {
233 |   "kernelspec": {
234 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
235 |    "language": "python",
236 |    "name": "sys_pygpu201912"
237 |   },
238 |   "language_info": {
239 |    "codemirror_mode": {
240 |     "name": "ipython",
241 |     "version": 3
242 |    },
243 |    "file_extension": ".py",
244 |    "mimetype": "text/x-python",
245 |    "name": "python",
246 |    "nbconvert_exporter": "python",
247 |    "pygments_lexer": "ipython3",
248 |    "version": "3.9.7"
249 |   }
250 |  },
251 |  "nbformat": 4,
252 |  "nbformat_minor": 5
253 | }
254 | 


--------------------------------------------------------------------------------
/ExampleRunner.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "metadata": {},
 7 |    "outputs": [],
 8 |    "source": [
 9 |     "%load_ext sbatch_magic"
10 |    ]
11 |   },
12 |   {
13 |    "cell_type": "markdown",
14 |    "metadata": {},
15 |    "source": [
16 |     "Enter the name of the example you want to run below, without the extension (`such as 01a_fractal_cupy`)."
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "code",
21 |    "execution_count": null,
22 |    "metadata": {},
23 |    "outputs": [],
24 |    "source": [
25 |     "%%sbatch 01a_fractal_cupy\n",
26 |     "#!/bin/bash\n",
27 |     "# GPU job\n",
28 |     "\n",
29 |     "#SBATCH --nodes=1                # node count\n",
30 |     "#SBATCH --ntasks=1               # total number of tasks across all nodes\n",
31 |     "#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)\n",
32 |     "#SBATCH --gres=gpu               # number of gpus per node\n",
33 |     "#SBATCH --mem=1G                 # total memory (RAM) per node\n",
34 |     "#SBATCH --reservation=pygpu      # reservation for the class\n",
35 |     "#SBATCH --time=00:01:30          # total run time limit (HH:MM:SS)\n",
36 |     "\n",
37 |     "module purge\n",
38 |     "module load course/pygpu/default\n",
39 |     "\n",
40 |     "time jupyter nbconvert --to html --execute --ExecutePreprocessor.timeout=60 {name}.ipynb"
41 |    ]
42 |   },
43 |   {
44 |    "cell_type": "code",
45 |    "execution_count": null,
46 |    "metadata": {},
47 |    "outputs": [],
48 |    "source": [
49 |     "!squeue -u $USER"
50 |    ]
51 |   },
52 |   {
53 |    "cell_type": "markdown",
54 |    "metadata": {},
55 |    "source": [
56 |     "Open the new html page with the same name as the input to see the output!"
57 |    ]
58 |   }
59 |  ],
60 |  "metadata": {
61 |   "kernelspec": {
62 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
63 |    "language": "python",
64 |    "name": "sys_pygpu201912"
65 |   },
66 |   "language_info": {
67 |    "codemirror_mode": {
68 |     "name": "ipython",
69 |     "version": 3
70 |    },
71 |    "file_extension": ".py",
72 |    "mimetype": "text/x-python",
73 |    "name": "python",
74 |    "nbconvert_exporter": "python",
75 |    "pygments_lexer": "ipython3",
76 |    "version": "3.8.6"
77 |   }
78 |  },
79 |  "nbformat": 4,
80 |  "nbformat_minor": 4
81 | }
82 | 


--------------------------------------------------------------------------------
/ExampleRunnerExample.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Set this name to the base name of the notebook you want to run on a GPU."
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "name = \"01a_fractal_cupy\""
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "from IPython.core.magic import register_line_cell_magic\n",
 26 |     "import time\n",
 27 |     "from pathlib import Path"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "This is a quick magic command that behaves exactly like `%%writefile` except with variable templating"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "@register_line_cell_magic\n",
 44 |     "def writetemplate(line, cell):\n",
 45 |     "    print(\"(Over)writing\", line)\n",
 46 |     "    with open(line, \"w\") as f:\n",
 47 |     "        f.write(cell.format(**globals()))"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "And, a quick utility to watch the file until it signals that it is done:"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "def watch_file(filename):\n",
 64 |     "    filename = Path(filename)\n",
 65 |     "    while not filename.exists():\n",
 66 |     "        time.sleep(0.5)\n",
 67 |     "    with open(filename) as f:\n",
 68 |     "        while True:\n",
 69 |     "            r = f.readline()\n",
 70 |     "            if \"[ADRIOT-DONE]\" in r:\n",
 71 |     "                break\n",
 72 |     "            elif r == \"\":\n",
 73 |     "                time.sleep(0.5)\n",
 74 |     "            else:\n",
 75 |     "                print(r, end=\"\")\n",
 76 |     "\n",
 77 |     "    print(\"Done!\")"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "> Note:\n",
 85 |     "> \n",
 86 |     "> In Python 3.8 the loop could be written:\n",
 87 |     "> ```python\n",
 88 |     "> while \"[ADRIOT-DONE]\" not in (r := f.readline()):\n",
 89 |     "    if r == '':\n",
 90 |     "        time.sleep(.5)\n",
 91 |     "    else:\n",
 92 |     "        print(r, end='')\n",
 93 |     "> ```\n",
 94 |     ">"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "%%writetemplate {name}.sbatch\n",
104 |     "#!/bin/bash\n",
105 |     "# GPU job\n",
106 |     "\n",
107 |     "#SBATCH --job-name={name}        # create a short name for your job\n",
108 |     "#SBATCH -o {name}.out            # Name of stdout output file (%j expands to jobId)\n",
109 |     "#SBATCH --nodes=1                # node count\n",
110 |     "#SBATCH --ntasks=1               # total number of tasks across all nodes\n",
111 |     "#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)\n",
112 |     "#SBATCH --gres=gpu               # number of gpus per node\n",
113 |     "#SBATCH --mem=1G                 # total memory (RAM) per node\n",
114 |     "#SBATCH --reservation=pygpu      # reservation for the class\n",
115 |     "#SBATCH --time=00:01:30          # total run time limit (HH:MM:SS)\n",
116 |     "\n",
117 |     "module purge\n",
118 |     "module load course/pygpu/default\n",
119 |     "\n",
120 |     "time jupyter nbconvert --to html --execute --ExecutePreprocessor.timeout=60 {name}.ipynb\n",
121 |     "\n",
122 |     "echo\n",
123 |     "echo \"[ADRIOT-DONE]\""
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "`:tesla_v100:1`"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "We remove the old run output and output file if they are present."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "!rm {name}.out {name}.html"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "markdown",
151 |    "metadata": {},
152 |    "source": [
153 |     "Time to submit the run!"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "metadata": {},
160 |    "outputs": [],
161 |    "source": [
162 |     "!sbatch {name}.sbatch"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "And now let's watch the output file:"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": null,
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": [
178 |     "watch_file(f\"{name}.out\");"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "Open the new nbconvert notebook to see the output!"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": []
194 |   }
195 |  ],
196 |  "metadata": {
197 |   "kernelspec": {
198 |    "display_name": "PyGPU Course 2019/12 [course/pygpu/default]",
199 |    "language": "python",
200 |    "name": "sys_pygpu201912"
201 |   },
202 |   "language_info": {
203 |    "codemirror_mode": {
204 |     "name": "ipython",
205 |     "version": 3
206 |    },
207 |    "file_extension": ".py",
208 |    "mimetype": "text/x-python",
209 |    "name": "python",
210 |    "nbconvert_exporter": "python",
211 |    "pygments_lexer": "ipython3",
212 |    "version": "3.8.6"
213 |   }
214 |  },
215 |  "nbformat": 4,
216 |  "nbformat_minor": 4
217 | }
218 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # PyGPU: High performance Python for GPUs
 2 | 
 3 | ## Henry Schreiner
 4 | 
 5 | This minicourse covers ways to speed up your code using GPUs.
 6 | Since many of us do not have a reasonable (NVidia) GPU on our
 7 | laptops, the course is designed to be run on our local teaching
 8 | cluster. You will need to be on the Princeton network, and will
 9 | need to be able to access Adroit (registration beforehand required).
10 | 
11 | ## Princeton setup (Adroit)
12 | 
13 | #### Git clone
14 | 
15 | Log into our OnDemand site, <https://myadroit.princeton.edu>. You will want to
16 | select "Clusters -> Shell" on the header bar.
17 | 
18 | ![Header bar image](./images/HeaderBar.png)
19 | 
20 | Now, you'll want to type:
21 | 
22 | ```bash
23 | git clone https://github.com/henryiii/pygpu-minicourse
24 | ```
25 | 
26 | This will get the course materials. Press <kbd>CTRL</kbd>+<kbd>D</kbd> to quit.
27 | 
28 | #### Start up a CPU instance
29 | 
30 | We will be working with a small number of shared GPUs, so you'll want to work
31 | in a CPU only instance, and only submit a notebook to the GPU 1-at-a-time (so
32 | you don't block them for others).
33 | 
34 | Back on the header bar on the original page, click "Interactive Apps" or "My
35 | Interactive sessions", then select "Jupyter". You should see a page that looks
36 | like this:
37 | 
38 | > ![Setup page](./images/SetupPage.png)
39 | 
40 | Make sure you have **checked the JupyterLab** checkbox, that you have enough
41 | time (at least 2 hours), and that you have entered our reservation (`pygpu`).
42 | Leave the extra slurm options blank. (without a reservation,
43 | `--gres=gpu --constraint=a100` would pick GPUs and set the type of GPU.)
44 | 
45 | The Anaconda3 version is `custom`. The module name is `course/pygpu/default`.
46 | 
47 | After you click launch, you should soon see a button that looks like this:
48 | 
49 | ![Button to click](./images/ButtonToClick.png)
50 | 
51 | Click it to enter JupyterLab!
52 | 
53 | ### Local setup
54 | 
55 | If you have a GPU, you can install the environment provided in
56 | `environment.yml` with Conda. You'll probably have to choose a
57 | kernel when you launch it (and you may need the `conda_nb_kernel` package).
58 | 
59 | ## Running GPU kernels
60 | 
61 | Load the `ExampleRunner.ipynb` notebook. You can enter the name of a GPU
62 | notebook (without the extension) at the top of the provided cell, and run that
63 | to submit the notebook as a job.
64 | 
65 | ## Survey
66 | 
67 | Link: See chat.
68 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: gpu-minicourse
 2 | channels:
 3 |   - pytorch
 4 |   - conda-forge
 5 | dependencies:
 6 |   - cudatoolkit
 7 |   - cudnn
 8 |   - cupy>=9.0
 9 |   - cutensor
10 |   - iminuit
11 |   - ipympl
12 |   - ipywidgets
13 |   - jupyterlab>=3
14 |   - line_profiler
15 |   - matplotlib>=3.4
16 |   - nb_conda_kernels
17 |   - nccl
18 |   - nodejs
19 |   - numba>=0.55
20 |   - numpy>=1.20
21 |   - plumbum>=1.7.0
22 |   - python-graphviz
23 |   - python>=3.9
24 |   - pytorch
25 |   - scipy
26 |   # Disabled
27 |   #- tensorflow-gpu
28 | 


--------------------------------------------------------------------------------
/images/ButtonToClick.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/henryiii/pygpu-minicourse/56eab26e8b645709890af8e9c4691b986875927a/images/ButtonToClick.png


--------------------------------------------------------------------------------
/images/HeaderBar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/henryiii/pygpu-minicourse/56eab26e8b645709890af8e9c4691b986875927a/images/HeaderBar.png


--------------------------------------------------------------------------------
/images/LanguageInterest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/henryiii/pygpu-minicourse/56eab26e8b645709890af8e9c4691b986875927a/images/LanguageInterest.png


--------------------------------------------------------------------------------
/images/LibraryInterest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/henryiii/pygpu-minicourse/56eab26e8b645709890af8e9c4691b986875927a/images/LibraryInterest.png


--------------------------------------------------------------------------------
/images/SetupPage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/henryiii/pygpu-minicourse/56eab26e8b645709890af8e9c4691b986875927a/images/SetupPage.png


--------------------------------------------------------------------------------
/interactive/MinicondaInstallNotes.txt:
--------------------------------------------------------------------------------
 1 | # I downloaded Miniconda from the miniconda website (now Mambaforge) and ran it:
 2 | 
 3 | # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 4 | # wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
 5 | 
 6 | # chmod +x Mambaforge-Linux-x86_64.sh
 7 | # ./Mambaforge-Linux-x86_64.sh
 8 | 
 9 | 
10 | # I gave the /opt/export/course/pygpu/miniconda path, and answered no when
11 | # it wanted me to set up the conda command.
12 | 
13 | # eval "$(/opt/export/course/pygpu/miniconda/bin/conda shell.bash hook)"
14 | 
15 | # mamba env update -f environment.yml -n base
16 | 
17 | # jupyter labextension install @ijmbarr/jupyterlab_spellchecker
18 | 
19 | # Then I added the environment module. I had
20 | # to update the environment "base" since that's what we tie to. :/
21 | 
22 | 
23 | # New version:
24 | 
25 | cd /opt/export/course/pygpu
26 | curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
27 | ./bin/micromamba create -f environment.yml -r ./miniconda -n base
28 | 
29 | 


--------------------------------------------------------------------------------
/interactive/course.pygpu.default:
--------------------------------------------------------------------------------
 1 | #%Module
 2 | 
 3 | proc ModulesHelp { } {
 4 |    puts stderr "This module adds pygpu miniconda to your path"
 5 | }
 6 | 
 7 | module-whatis   "Sets up pygpy miniconda in your environment"
 8 | 
 9 | 
10 | prepend-path PATH "/opt/export/course/pygpu/miniconda/bin"
11 | setenv CONDA_DEFAULT_ENV base
12 | setenv CONDA_EXE "/opt/export/course/pygpu/miniconda/bin/conda"
13 | setenv CONDA_PREFIX "/opt/export/course/pygpu/miniconda"
14 | setenv CONDA_PROMPT_MODIFIER ''
15 | setenv CONDA_SHLVL 1
16 | setenv _CE_CONDA ""
17 | setenv _CE_M ""
18 | 
19 | set-alias conda {
20 | if [ "$#" -lt 1 ]; then
21 | 	"$CONDA_EXE" $_CE_M $_CE_CONDA;
22 | else
23 | 	\\local cmd="$1";
24 | 	shift;
25 | 	case "$cmd" in
26 | 	activate | deactivate)
27 |                 __conda_activate "$cmd" "$@"
28 | 		;;
29 | 	install | update | upgrade | remove | uninstall)
30 |                 "$CONDA_EXE" $_CE_M $_CE_CONDA "$cmd" "$@" && __conda_reactivate
31 |                 ;;
32 | 	*)
33 | 		"$CONDA_EXE" $_CE_M $_CE_CONDA "$cmd" "$@"
34 | 		;;
35 | 	esac;
36 | fi
37 | }
38 | 
39 | set-alias __conda_activate {
40 | 	if [ -n "${CONDA_PS1_BACKUP:+x}" ]; then
41 | 		PS1="$CONDA_PS1_BACKUP";
42 | 		\\unset CONDA_PS1_BACKUP;
43 | 	fi;
44 | 	\\local cmd="$1";
45 | 	shift;
46 | 	\\local ask_conda;
47 | 	ask_conda="$(PS1="$PS1" "$CONDA_EXE" $_CE_M $_CE_CONDA shell.posix "$cmd" "$@")" || \\return $?;
48 | 	\\eval "$ask_conda";
49 | 	\\hash -r
50 | }
51 | 
52 | set-alias __conda_reactivate {
53 | 	\\local ask_conda;
54 | 	ask_conda="$(PS1="$PS1" "$CONDA_EXE" $_CE_M $_CE_CONDA shell.posix reactivate)" || \\return $?;
55 | 	\\eval "$ask_conda";
56 | 	\\hash -r
57 | }
58 | 


--------------------------------------------------------------------------------
/interactive/default:
--------------------------------------------------------------------------------
 1 | #%Module
 2 | 
 3 | proc ModulesHelp { } {
 4 |    puts stderr "This module adds pygpu 2019.12 miniconda to your path"
 5 | }
 6 | 
 7 | module-whatis   "Sets up pygpy 2019.12 miniconda in your environment"
 8 | 
 9 | prepend-path PATH "/opt/export/course/pygpu/miniconda/bin"
10 | setenv CONDA_DEFAULT_ENV base
11 | setenv CONDA_EXE "/opt/export/course/pygpu/miniconda/bin/conda"
12 | setenv CONDA_PREFIX "/opt/export/course/pygpu/miniconda"
13 | setenv CONDA_PROMPT_MODIFIER ''
14 | setenv CONDA_SHLVL 1
15 | setenv _CE_CONDA ""
16 | setenv _CE_M ""
17 | 
18 | set-alias conda {
19 | if [ "$#" -lt 1 ]; then
20 | 	"$CONDA_EXE" $_CE_M $_CE_CONDA;
21 | else
22 | 	\\local cmd="$1";
23 | 	shift;
24 | 	case "$cmd" in
25 | 	activate | deactivate)
26 |                 __conda_activate "$cmd" "$@"
27 | 		;;
28 | 	install | update | upgrade | remove | uninstall)
29 |                 "$CONDA_EXE" $_CE_M $_CE_CONDA "$cmd" "$@" && __conda_reactivate
30 |                 ;;
31 | 	*)
32 | 		"$CONDA_EXE" $_CE_M $_CE_CONDA "$cmd" "$@"
33 | 		;;
34 | 	esac;
35 | fi
36 | }
37 | 
38 | set-alias __conda_activate {
39 | 	if [ -n "${CONDA_PS1_BACKUP:+x}" ]; then
40 | 		PS1="$CONDA_PS1_BACKUP";
41 | 		\\unset CONDA_PS1_BACKUP;
42 | 	fi;
43 | 	\\local cmd="$1";
44 | 	shift;
45 | 	\\local ask_conda;
46 | 	ask_conda="$(PS1="$PS1" "$CONDA_EXE" $_CE_M $_CE_CONDA shell.posix "$cmd" "$@")" || \\return $?;
47 | 	\\eval "$ask_conda";
48 | 	\\hash -r
49 | }
50 | 
51 | set-alias __conda_reactivate {
52 | 	\\local ask_conda;
53 | 	ask_conda="$(PS1="$PS1" "$CONDA_EXE" $_CE_M $_CE_CONDA shell.posix reactivate)" || \\return $?;
54 | 	\\eval "$ask_conda";
55 | 	\\hash -r
56 | }
57 | 


--------------------------------------------------------------------------------
/interactive/environment.yml:
--------------------------------------------------------------------------------
1 | ../environment.yml


--------------------------------------------------------------------------------
/interactive/interactive.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # GPU job
 3 | 
 4 | #SBATCH --job-name=cupy-job      # create a short name for your job
 5 | #SBATCH -o jupyterlab.out        # Name of stdout output file (%j expands to jobId)
 6 | #SBATCH --nodes=1                # node count
 7 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 8 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 9 | #SBATCH --gres=gpu:tesla_v100:1  # number of gpus per node
10 | #SBATCH --mem=1G                 # total memory (RAM) per node
11 | #SBATCH --time=01:00:00          # total run time limit (HH:MM:SS)
12 | 
13 | module purge
14 | module load course/pygpu
15 | 
16 | echo "Job $SLURM_JOB_ID execution at: `date`"
17 | 
18 | NODE_HOSTNAME=`hostname -s`
19 | LOCAL_PORT="8123"
20 | 
21 | echo ""
22 | echo "Your jupyter lab server is about to start!"
23 | echo "To connect:"
24 | echo "    ssh -J $USER@adroit $NODE_HOSTNAME -L $LOCAL_PORT:localhost:$LOCAL_PORT"
25 | echo "The web address and token should be listed below."
26 | echo "Manually cancel with:"
27 | echo "    scancel $SLURM_JOB_ID"
28 | echo ""
29 | 
30 | # Execution holds here until user clicks quit or time runs out
31 | 
32 | jupyter lab --no-browser --port=$LOCAL_PORT
33 | 
34 | echo "job $SLURM_JOB_ID execution finished at: `date`"
35 | 


--------------------------------------------------------------------------------
/sbatch_magic.py:
--------------------------------------------------------------------------------
  1 | from IPython.core.magic import line_cell_magic
  2 | from IPython import display
  3 | import asyncio
  4 | from io import StringIO
  5 | from pathlib import Path
  6 | 
  7 | 
  8 | class IWriter(object):
  9 |     "Class that sets up a live output display cell. .add(msg) will append to output."
 10 |     WATCH_COUNT = 1
 11 | 
 12 |     def __init__(self, msg):
 13 |         self.io = StringIO()
 14 |         self.io.write(msg)
 15 |         self.handle = display.display(
 16 |             {"text/plain": self.io.getvalue()},
 17 |             raw=True,
 18 |             display_id=self.__class__.WATCH_COUNT,
 19 |         )
 20 |         self.__class__.WATCH_COUNT += 1
 21 | 
 22 |     def add(self, msg):
 23 |         self.io.write(msg)
 24 |         self.handle.update({"text/plain": self.io.getvalue()}, raw=True)
 25 | 
 26 | 
 27 | async def submit_file(name, writer):
 28 |     "Submit a file, print the submission message and return the final item (the job number)"
 29 |     proc = await asyncio.create_subprocess_exec(
 30 |         "sbatch",
 31 |         f"--job-name={name}",
 32 |         f"--output={name}.out",
 33 |         f"{name}.sbatch",
 34 |         stdout=asyncio.subprocess.PIPE,
 35 |         stderr=asyncio.subprocess.PIPE,
 36 |     )
 37 | 
 38 |     stdout, stderr = await proc.communicate()
 39 |     stdout = stdout.decode()
 40 | 
 41 |     writer.add(stdout)
 42 |     assert "Submitted batch job " in stdout, "Invalid job submission output"
 43 |     return stdout.split()[-1]
 44 | 
 45 | 
 46 | async def watch_file(filename, writer):
 47 |     "Watch a file, print live output, and exit when final argument seen."
 48 |     writer.add(f"Waiting for {filename}...\n")
 49 | 
 50 |     filename = Path(filename)
 51 |     while not filename.exists():
 52 |         await asyncio.sleep(0.5)
 53 |     with open(filename) as f:
 54 |         while True:
 55 |             r = f.readline()
 56 |             if "[SBATCH-DONE]" in r:
 57 |                 break
 58 |             elif r == "":
 59 |                 await asyncio.sleep(0.5)
 60 |             else:
 61 |                 writer.add(r)
 62 | 
 63 |     writer.add("Done!")
 64 | 
 65 | 
 66 | async def submit_and_watch(name):
 67 |     "Run submit and watch jobs"
 68 | 
 69 |     writer = IWriter(f"Submitting {name}.sbatch\n")
 70 | 
 71 |     sbatch = Path(f"{name}.sbatch")
 72 |     out = Path(f"{name}.out")
 73 | 
 74 |     if out.exists():
 75 |         out.unlink()
 76 | 
 77 |     jobnum = await submit_file(name, writer)
 78 | 
 79 |     await watch_file(out, writer)
 80 | 
 81 |     if out.exists():
 82 |         out.unlink()
 83 | 
 84 |     if sbatch.exists():
 85 |         sbatch.unlink()
 86 | 
 87 | 
 88 | def sbatch(line, cell):
 89 |     "Submit a job by name"
 90 |     (name,) = line.split()  # Name required
 91 |     assert "." not in name, "Do not include extension!"
 92 | 
 93 |     txt = cell.format(name=name, **globals()) + 'echo\n echo "[SBATCH-DONE]"\n'
 94 | 
 95 |     with open(f"{name}.sbatch", "w") as f:
 96 |         f.write(txt)
 97 | 
 98 |     asyncio.ensure_future(submit_and_watch(name))
 99 | 
100 | 
101 | def load_ipython_extension(ipython):
102 |     ipython.register_magic_function(sbatch, magic_kind="cell")
103 | 


--------------------------------------------------------------------------------