├── .github └── workflows │ └── deploy.yml ├── .gitignore ├── CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── Makefile ├── README.md ├── hpc_lecture_notes ├── 2021-assignment_1.ipynb ├── 2021-assignment_2.md ├── 2021-assignment_3.md ├── 2021-assignment_4.md ├── 2022-a4-A_and_b.md ├── 2022-assignment_1.md ├── 2022-assignment_2.md ├── 2022-assignment_3.md ├── 2022-assignment_4.md ├── 2022-class_1.md ├── 2022-class_2.md ├── 2022-class_3.md ├── 2022-class_4.md ├── 2022-class_5.md ├── 2022-class_6.md ├── 2022-class_7.md ├── 2022-lsa_1.md ├── 2022-lsa_3.md ├── 2022-lsa_4.md ├── 2022_classes.md ├── 2022_matrices_and_simultaneous_equations.md ├── 2023-assignment_1-lsa.md ├── 2023-assignment_1.md ├── 2023-assignment_2-lsa.md ├── 2023-assignment_2.md ├── 2023-assignment_3-lsa.md ├── 2023-assignment_3.md ├── 2023-assignment_4-lsa.md ├── 2023-assignment_4.md ├── _config.yml ├── _toc.yml ├── cpu_logo.png ├── cuda_introduction.md ├── favicon.ico ├── further_topics.ipynb ├── gpu_introduction.md ├── hpc_languages.md ├── img │ ├── 2022a4-mesh.png │ ├── a100_sm.png │ ├── byte_array.png │ ├── simd_addition.png │ ├── thread_numbering.png │ └── top500development.png ├── intro.md ├── it_solvers1.ipynb ├── it_solvers2.ipynb ├── it_solvers3.ipynb ├── it_solvers4.ipynb ├── multigrid.ipynb ├── numba_cuda.ipynb ├── numexpr.ipynb ├── numpy_and_data_layouts.ipynb ├── parallel_principles.md ├── pde_example.md ├── petsc_for_sparse_systems.ipynb ├── python_hpc_tools.md ├── rbf_evaluation.ipynb ├── references.bib ├── simd.ipynb ├── simple_time_stepping.ipynb ├── sparse_data_structures.ipynb ├── sparse_direct_solvers.ipynb ├── sparse_linalg_pde.ipynb ├── sparse_solvers_introduction.ipynb ├── wave_equation.ipynb ├── what_is_hpc.md └── working_with_numba.ipynb ├── other ├── byte_array.odg └── simd_addition.odg └── requirements.txt /.github/workflows/deploy.yml: -------------------------------------------------------------------------------- 1 | name: deploy 2 | 3 | on: 4 | # Trigger the workflow on push to master branch 5 | push: 6 | branches: 7 | - master 8 | 9 | # This job installs dependencies, build the book, and pushes it to `gh-pages` 10 | jobs: 11 | build-and-deploy-book: 12 | runs-on: ${{ matrix.os }} 13 | strategy: 14 | matrix: 15 | os: [ubuntu-latest] 16 | python-version: [3.8] 17 | steps: 18 | - uses: actions/checkout@v2 19 | 20 | # Install dependencies 21 | - name: Set up Python ${{ matrix.python-version }} 22 | uses: actions/setup-python@v1 23 | with: 24 | python-version: ${{ matrix.python-version }} 25 | - name: Install dependencies 26 | run: | 27 | pip install -r requirements.txt 28 | 29 | # Build the book 30 | - name: Build the book 31 | run: | 32 | jupyter-book build hpc_lecture_notes 33 | 34 | # Deploy the book's HTML to gh-pages branch 35 | - name: GitHub Pages action 36 | uses: peaceiris/actions-gh-pages@v3.6.1 37 | with: 38 | github_token: ${{ secrets.GITHUB_TOKEN }} 39 | publish_dir: hpc_lecture_notes/_build/html -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | */_build/* 2 | */.ipynb_checkpoints/* 3 | 4 | **/.DS_Store 5 | .*.swp 6 | -------------------------------------------------------------------------------- /CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation. 7 | 8 | ## Our Standards 9 | 10 | Examples of behavior that contributes to creating a positive environment include: 11 | 12 | * Using welcoming and inclusive language 13 | * Being respectful of differing viewpoints and experiences 14 | * Gracefully accepting constructive criticism 15 | * Focusing on what is best for the community 16 | * Showing empathy towards other community members 17 | 18 | Examples of unacceptable behavior by participants include: 19 | 20 | * The use of sexualized language or imagery and unwelcome sexual attention or advances 21 | * Trolling, insulting/derogatory comments, and personal or political attacks 22 | * Public or private harassment 23 | * Publishing others' private information, such as a physical or electronic address, without explicit permission 24 | * Other conduct which could reasonably be considered inappropriate in a professional setting 25 | 26 | ## Our Responsibilities 27 | 28 | Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. 29 | 30 | Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. 31 | 32 | ## Scope 33 | 34 | This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. 35 | 36 | ## Enforcement 37 | 38 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately. 39 | 40 | Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership. 41 | 42 | ## Attribution 43 | 44 | This Code of Conduct is adapted from the [Contributor Covenant, version 1.4](http://contributor-covenant.org/version/1/4). 45 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | Contributions are welcome, and they are greatly appreciated! Every little bit 4 | helps, and credit will always be given. You can contribute in the ways listed below. 5 | 6 | ## Report Bugs 7 | 8 | Report bugs using GitHub issues. 9 | 10 | If you are reporting a bug, please include: 11 | 12 | * Your operating system name and version. 13 | * Any details about your local setup that might be helpful in troubleshooting. 14 | * Detailed steps to reproduce the bug. 15 | 16 | ## Fix Bugs 17 | 18 | Look through the GitHub issues for bugs. Anything tagged with "bug" and "help 19 | wanted" is open to whoever wants to implement it. 20 | 21 | ## Implement Features 22 | 23 | Look through the GitHub issues for features. Anything tagged with "enhancement" 24 | and "help wanted" is open to whoever wants to implement it. 25 | 26 | ## Write Documentation 27 | 28 | Techniques of High-Performance Computing - Lecture Notes could always use more documentation, whether as part of the 29 | official Techniques of High-Performance Computing - Lecture Notes docs, in docstrings, or even on the web in blog posts, 30 | articles, and such. 31 | 32 | ## Submit Feedback 33 | 34 | The best way to send feedback is to file an issue on GitHub. 35 | 36 | If you are proposing a feature: 37 | 38 | * Explain in detail how it would work. 39 | * Keep the scope as narrow as possible, to make it easier to implement. 40 | * Remember that this is a volunteer-driven project, and that contributions 41 | are welcome :) 42 | 43 | ## Get Started 44 | 45 | Ready to contribute? Here's how to set up `Techniques of High-Performance Computing - Lecture Notes` for local development. 46 | 47 | 1. Fork the repo on GitHub. 48 | 2. Clone your fork locally. 49 | 3. Install your local copy into a virtualenv, e.g., using `conda`. 50 | 4. Create a branch for local development and make changes locally. 51 | 5. Commit your changes and push your branch to GitHub. 52 | 6. Submit a pull request through the GitHub website. 53 | 54 | ## Code of Conduct 55 | 56 | Please note that the Techniques of High-Performance Computing - Lecture Notes project is released with a [Contributor Code of Conduct](CONDUCT.md). By contributing to this project you agree to abide by its terms. 57 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD License 2 | 3 | Copyright (c) 2020--22, Timo Betcke & Matthew Scroggs 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without modification, 7 | are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, this 13 | list of conditions and the following disclaimer in the documentation and/or 14 | other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from this 18 | software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 21 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 22 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. 23 | IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, 24 | INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 25 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY 27 | OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 28 | OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED 29 | OF THE POSSIBILITY OF SUCH DAMAGE. 30 | 31 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | default: 2 | ~/miniconda3/envs/jupyter-book/bin/jupyter-book build ./hpc_lecture_notes 3 | 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Techniques of High-Performance Computing - Lecture Notes 2 | 3 | Lecture Notes for the module Techniques of High-Performance Computing 4 | 5 | ## Usage 6 | 7 | ### Building the book 8 | 9 | If you'd like to develop on and build the Techniques of High-Performance Computing - Lecture Notes book, you should: 10 | 11 | - Clone this repository and run 12 | - Run `pip install -r requirements.txt` (it is recommended you do this within a virtual environment) 13 | - (Recommended) Remove the existing `Techniques of High-Performance Computing - Lecture Notes/_build/` directory 14 | - Run `jupyter-book build Techniques of High-Performance Computing - Lecture Notes/` 15 | 16 | A fully-rendered HTML version of the book will be built in `Techniques of High-Performance Computing - Lecture Notes/_build/html/`. 17 | 18 | ### Hosting the book 19 | 20 | The html version of the book is hosted on the `gh-pages` branch of this repo. A GitHub actions workflow has been created that automatically builds and pushes the book to this branch on a push or pull request to master. 21 | 22 | If you wish to disable this automation, you may remove the GitHub actions workflow and build the book manually by: 23 | 24 | - Navigating to your local build; and running, 25 | - `ghp-import -n -p -f Techniques of High-Performance Computing - Lecture Notes/_build/html` 26 | 27 | This will automatically push your build to the `gh-pages` branch. More information on this hosting process can be found [here](https://jupyterbook.org/publish/gh-pages.html#manually-host-your-book-with-github-pages). 28 | 29 | ## Contributors 30 | 31 | We welcome and recognize all contributions. You can see a list of current contributors in the [contributors tab](https://github.com/tbetcke/hpc_lecture_notes/graphs/contributors). 32 | 33 | ## Credits 34 | 35 | This project is created using the excellent open source [Jupyter Book project](https://jupyterbook.org/) and the [executablebooks/cookiecutter-jupyter-book template](https://github.com/executablebooks/cookiecutter-jupyter-book). 36 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2021-assignment_1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# Assignment 1 - Matrix multiplication in Numba\n", 7 | "\n", 8 | "**Note: This is the assignment from the 2021-22 Academic year.**\n", 9 | "\n", 10 | "**Note: You must do this Assignment, including codes and comments as a single Jupyter Notebook. To submit, make sure that you run all the codes and show the outputs in your Notebook. Printout the notebook as pdf and submit the pdf of the Assignment.**\n", 11 | "\n", 12 | "\n", 13 | "We consider the problem of evaluating the matrix multiplication $C = A\\times B$ for matrices $A, B\\in\\mathbb{R}^{n\\times n}$.\n", 14 | "A simple Python implementation of the matrix-matrix product is given below through the function `matrix_product`. At the end this\n", 15 | "function is checked against the Numpy implementation of the matrix-matrix product." 16 | ], 17 | "metadata": {} 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 4, 22 | "source": [ 23 | "import numpy as np\n", 24 | "\n", 25 | "def matrix_product(mat_a, mat_b):\n", 26 | " \"\"\"Returns the product of the matrices mat_a and mat_b.\"\"\"\n", 27 | " m = mat_a.shape[0]\n", 28 | " n = mat_b.shape[1]\n", 29 | "\n", 30 | " assert(mat_a.shape[1] == mat_b.shape[0])\n", 31 | "\n", 32 | " ncol = mat_a.shape[1]\n", 33 | "\n", 34 | " mat_c = np.zeros((m, n), dtype=np.float64)\n", 35 | "\n", 36 | " for row_ind in range(m):\n", 37 | " for col_ind in range(n):\n", 38 | " for k in range(ncol):\n", 39 | " mat_c[row_ind, col_ind] += mat_a[row_ind, k] * mat_b[k, col_ind]\n", 40 | "\n", 41 | " return mat_c\n", 42 | "\n", 43 | "a = np.random.randn(10, 10)\n", 44 | "b = np.random.randn(10, 10)\n", 45 | "\n", 46 | "c_actual = matrix_product(a, b)\n", 47 | "c_expected = a @ b\n", 48 | "\n", 49 | "error = np.linalg.norm(c_actual - c_expected) / np.linalg.norm(c_expected)\n", 50 | "print(f\"The error is {error}.\")\n" 51 | ], 52 | "outputs": [ 53 | { 54 | "output_type": "stream", 55 | "name": "stdout", 56 | "text": [ 57 | "The error is 1.0814245296430078e-16.\n" 58 | ] 59 | } 60 | ], 61 | "metadata": {} 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "source": [ 66 | "The matrix product is one of the most fundamental operations on modern computers. Most algorithms eventually make use of this operation. A lot of effort is therefore spent on optimising the matrix product. Vendors provide hardware optimised BLAS (Basis Linear Algebra Subroutines) that provide highly efficient versions of the matrix product. Alternatively, open-source libraries sucha as Openblas provide widely used generic open-source implementations of this operation.\n", 67 | "\n", 68 | "In this assignment we want to learn at the example of matrix-matrix products about the possible speedups offered by Numba, and the effects of cache-efficient programming.\n", 69 | "\n", 70 | "* Benchmark the above function against the Numpy dot product for matrix sizes up to 1000. Plot the timing results of the above function against the timing results for the Numpy dot product. You need not benchmark every dimension up to 1000. Figure out what dimensions to use so that you can represent the result without spending too much time waiting for the code to finish. To perform benchmarks you can use the `%timeit` magic command. An example is\n", 71 | " ```\n", 72 | " timeit_result = %timeit -o matrix_product(a, b)\n", 73 | " print(timeit_result.best)\n", 74 | " ```\n", 75 | "* Now optimise the code by using Numba to JIT-compile it. Also, there is lots of scope for parallelisation in the code. You can for example parallelize the outer-most for-loop. Benchmark the JIT-compiled serial code against the JIT-compiled parallel code. Comment on the expected performance on your system against the observed performance.\n", 76 | "\n", 77 | "* Now let us improve Cache efficiency. Notice that in the matrix $B$ we traverse by columns. However, the default storage ordering in Numpy is row-based. Hence, the expression `mat_b[k, col_ind]` jumps in memory by `n` units if we move from $k$ to $k+1$. Run your parallelized JIT-compiled Numba code again. But this time choose a matrix $B$ that is stored in column-major order. To change an array to column major order you can use the command `np.asfortranarray`.\n", 78 | "\n", 79 | "* We can still try to improve efficiency. A frequent technique to improve efficiency for the matrix-matrix product is through blocking. Consider the command in the inner-most loop `mat_c[row_ind, col_ind] += mat_a[row_ind, k] * mat_b[k, col_ind]`. Instead of updating a single element `mat_c[row_ind, col_ind]` we want to update a $\\ell\\times \\ell$ submatrix. Hence, the inner multiplication becomes itself the product of two $\\ell\\times\\ell$ submatrices, and instead of iterating element by element we move forward in terms of $\\ell\\times \\ell$ blocks. Implement this scheme. For the innermost $\\ell\\times\\ell$ matrix use a standard serial triple loop. Investigate how benchmark timings depend on the parameter $\\ell$ and how this implementation compares to your previous schemes. For simplicity you may want to choose outer-matrix dimensions that are multiples of $\\ell$ so that you need not deal in your code with the remainder part of the matrix if the dimensions are not divisible by $\\ell$. Note that while such schemes are used in practical implementations of the matrix-matrix product it is not immediately clear that a Numba implementation here will be advantageous. There is a lot going on in the compiler in between writing Numba loops and actually producing machine code. Real libraries are written in much lower-level languages and can optimize closer to the hardware. Your task is to experiment to see if this blocked approach has advantages within Numba.\n", 80 | "\n", 81 | "**In all your implementations make sure that you write your code in such a way that SIMD code can be produced. Demonstrate if your produced codes are SIMD optimized.**\n" 82 | ], 83 | "metadata": {} 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "source": [], 88 | "metadata": {} 89 | } 90 | ], 91 | "metadata": { 92 | "orig_nbformat": 4, 93 | "language_info": { 94 | "name": "python", 95 | "version": "3.9.7", 96 | "mimetype": "text/x-python", 97 | "codemirror_mode": { 98 | "name": "ipython", 99 | "version": 3 100 | }, 101 | "pygments_lexer": "ipython3", 102 | "nbconvert_exporter": "python", 103 | "file_extension": ".py" 104 | }, 105 | "kernelspec": { 106 | "name": "python3", 107 | "display_name": "Python 3.9.7 64-bit ('dev': conda)" 108 | }, 109 | "interpreter": { 110 | "hash": "433c521acb4ce4629f0708f9192dd3599e20d0bdeb40353d5f8d17c68a66b248" 111 | } 112 | }, 113 | "nbformat": 4, 114 | "nbformat_minor": 2 115 | } 116 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2021-assignment_2.md: -------------------------------------------------------------------------------- 1 | # Assignment 2 - GPU Accelerated solution of Poisson problems 2 | 3 | **Note: This is the assignment from the 2021-22 academic year.** 4 | 5 | In this assignment we consider the solution of Poisson problems of the form 6 | 7 | $$ 8 | -\Delta u(x, y) = f(x, y) 9 | $$ 10 | with $\Delta := u_{xx} + u_{yy}$ 11 | for $(x, y)\in\Omega\subset\mathbb{R}^2$ and boundary conditions $u(x, y) = g(x, y)$ on $\Gamma :=\partial\Omega$. 12 | 13 | For all our experiments the domain $\Omega$ is the unit square $\Omega :=[0, 1]^2$. 14 | 15 | To numerically solve this problem we define grid points $x_i := ih$ and $y_j :=jh$ with $i, j=1, \dots, N$ and $h=1/(N+1)$. We can now approximate 16 | 17 | $$ 18 | -\Delta u(x_i, y_j) \approx \frac{1}{h^2}(4 u(x_i, y_j) - u(x_{i-1}, y_j) - u(x_{i+1}, y_j) - u(x_{i}, y_{j-1}) - u(x_i, y_{j+1})). 19 | $$ 20 | If the neighboring point of $(x_i, y_j)$ is at the boundary we simply use the corresponding value of the boundary data $g$ in the above approximation. 21 | 22 | The above Poisson problem now becomes the sytem of $N^2$ equations given by 23 | 24 | $$ 25 | \frac{1}{h^2}(4 u(x_i, y_j) - u(x_{i-1}, y_j) - u(x_{i+1}, y_j) - u(x_{i}, y_{j-1}) - u(x_i, y_{j+1})) = f(x_i, y_j) 26 | $$ 27 | for $i, j=1,\dots, N$. 28 | 29 | **Task 1** We first need to create a verified reference solution to this problem. Implement a function ```discretise(f, g, N)``` that takes a Python callable $f$, a Python callable $g$ and the parameter $N$ and returns a sparse CSR matrix $A$ and the corresponding right-hand side $b$ of the above discretised Poisson problem. 30 | 31 | To verify your code we use the method of manufactured solutions. Let $u(x, y)$ be the exact function $u_{exact}(x, y) = e^{(x-0.5)^2 + (y-0.5)^2}$. By taking $-\Delta u_{exact}$ you can compute the corresponding right-hand side $f$ so that this function $u_{exact}$ will be the exact solution of the Poisson equation $-\Delta u(x, y) = f(x, y)$ with boundary conditions given by the boundary data of your known $u_{exact}$. 32 | 33 | For growing values $N$ solve the linear system of equations using the `scipy.sparse.linalg.spsolve` command. Plot the maximum relative error of your computed grid values $u(x_i, y_j)$ against the exact solution $u_{exact}$ as $N$ increases. The relative error at a given point is 34 | 35 | $$ 36 | e_{rel} = \frac{|u(x_i, y_j) - u_{exact}(x_i, y_j)|}{|u_{exact}(x_i, y_j)|} 37 | $$ 38 | 39 | For your plot you should use a double logarithmic plot (```loglog``` in Matplotlib). As $N$ increases the error should go to zero. What can you conjecture about the rate of convergence? 40 | 41 | ***Task 2*** With your verified code we now have something to compare a GPU code against. On the GPU we want to implement a simple iterative scheme to solve the Poisson equation. The idea is to rewrite the above discrete linear system as 42 | 43 | $$ 44 | u(x_i, y_j) = \frac{1}{4}\left(h^2f(x_i, y_j) + u(x_{i-1}, y_j) + u(x_{i+1}, y_j) + u(x_{i}, y_{j-1}) + u(x_i, y_{j+1}))\right) 45 | $$ 46 | 47 | You can notice that if $f$ is zero then the left-hand side $u(x_i, y_j)$ is just the average of all the neighbouring grid points. This motivates a simple iterative scheme, namely 48 | 49 | $$ 50 | u^{k+1}(x_i, y_j) = \frac{1}{4}\left(h^2f(x_i, y_j) + u^k(x_{i-1}, y_j) + u^k(x_{i+1}, y_j) + u^k(x_{i}, y_{j-1}) + u^k(x_i, y_{j+1}))\right). 51 | $$ 52 | 53 | In other words, the value of $u$ at the iteration $k+1$ is just the average of all the values at iteration $k$ plus the contribution from the right-hand side. 54 | 55 | Your task is to implement this iterative scheme in Numba Cuda. A few hints are in order: 56 | 57 | * Make sure that when possible you only copy data from the GPU to the host at the end of your computation. To initialize the iteration you can for example take $u=0$. You do not want to copy data after each iteration step. 58 | * You will need two global buffers, one for the current iteration $k$ and one for the next iteration. 59 | * Your compute kernel will execute one iteration of the scheme and you run multiple iterations by repeatedly calling the kernel from the host. 60 | * To check for convergence you should investigate the relative change of your values from $u^k$ to $u^{k+1}$ and take the maximum relative change as measure how accurate your solution is. Decide how you implement this (in the same kernel or through a separate kernel). Also, decide how often you check for convergence. You may not want to check in each iteration as it is an expensive operation. 61 | * Verify your GPU code by comparing against the exact discrete solution in Task 1. Generate a convergence plot of how the values in your iterative scheme converge against the exact discrete solution. For this use a few selected values of $N$. How does the convergence change as $N$ increases? 62 | * Try to optimise memory accesses. You will notice that if you consider a grid value $u(i, j)$ it will be read multiple times from the global buffer. Try to optimise memory accesses by preloading a block of values into local shared memory and have a thread block read the data from there. When you do this benchmark against an implementation where each thread just reads from global memory. 63 | 64 | ***Carefully describe your computations and observations. Explain what you are doing and try to be scientifically precise in your observations and conclusions. Carefully designing and interpreting your convergence and benchmark experiments is a significant component of this assignment.*** 65 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2021-assignment_3.md: -------------------------------------------------------------------------------- 1 | # Assignment 3 - Sparse matrix formats on GPUs 2 | 3 | **Note: This is the assignment from the 2021-22 academic year.** 4 | 5 | **Task 1**: So far we learned about the CSR format. On CPUs this is a widely used standard format. However, it has some severe disadvantages on GPUs, but also on modern vector extensions (AVX, etc.) of CPUs. The paper [Improving the performance of the sparse matrix 6 | vector product with GPUs](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5577904) by Vazquez et. al. describes these difficulties and also the alternative Ellpack and Ellpack-R formats that improve on the shortcomings of CSR on GPUs and offer more performance on these devices. Summarize the difficulties of the CSR format and how Ellpack and Ellpack-R solve these difficultues. For a very high mark look for some more modern papers on GPU based matvecs and describe what other modern developments have happened since that paper. 7 | 8 | **Task 2**: Given a sparse matrix in CSR format. Write a CPU based Numba routine that converts this matrix into the Ellpack-R format. 9 | 10 | Implement a new class ```EllpackMatrix``` derived from ```scipy.sparse.linalg.LinearOperator```, which in its constructor takes a Scipy sparse matrix in CSR format, converts it to Ellpack-R and provides a routine for matrix-vector product in the Ellpack-R format. At the end the following prototype commands should be possible with your class. 11 | 12 | ``` 13 | my_sparse_mat = EllpackMatrix(csr_mat) 14 | x = numpy.random.randn(my_sparse_mat.shape[1]) 15 | y = my_sparse_mat @ x 16 | ``` 17 | The sparse-matrix vector product at the end shall be executed in the Ellpack-R format. 18 | 19 | For your implementation you can either use CPU based Numba code using ```prange``` for multithreading or write an implementation in Numba-Cuda. For an overall mark of the assignment beyond 80% we would expect a Numba-Cuda implementation (assuming also all other parts are of high standard). 20 | 21 | Demonstrate in your solution that your code provides the correct result by verifying for a $1000\times 1000$ sparse random matrix with the standard CSR matvec of sparse matrices and showing for 3 random vectors that the relative distance of your Ellpack-R matvec to the CSR matvec result is in the order of machine precision. 22 | 23 | Use the ```discretise_poisson``` method from the lecture notes to generate the sparse matrix for the Poisson discretization and plot the times for a single matvec for growing matrix-sizes (go as high as you think is reasonable) using the standard Scipy csr matvec and your own Ellpack-R implemementation. 24 | 25 | Your implementation may not be faster since the matrix derives from a very simple 2d problem with few elements per row. This is not the situation where the additional complexitiy is usually needed. 26 | 27 | Finally, go shopping in the [Matrix Market](https://math.nist.gov/MatrixMarket/) and try to find two matrices that better show off the Ellpack-R format and do timing comparisions for your chosen matrices. 28 | 29 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2021-assignment_4.md: -------------------------------------------------------------------------------- 1 | # Assignment 4 - Time-dependent problems 2 | 3 | **Note: This is the assignment from the 2021-22 academic year.** 4 | 5 | Consider a square plate with sides $[−1, 1] × [−1, 1]$. At time t = 0 we are heating the plate up 6 | such that the temperature is $u = 5$ on one side and $u = 0$ on the other sides. The temperature 7 | evolves according to $u_t = \Delta u$. At what time $t^*$ does the plate reach $u = 1$ at the center of the plate? 8 | Implement a finite difference scheme and try with explicit and implicit time-stepping. By increasing 9 | the number of discretisation points demonstrate how many correct digits you can achieve. Also, 10 | plot the convergence of your computed time $t^*$ against the actual time. To 12 digits the wanted 11 | solution is $t^* = 0.424011387033$. 12 | 13 | A GPU implementation of the explicit time-stepping scheme is not necessary but would be expected for a very high mark beyond 80%. 14 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-a4-A_and_b.md: -------------------------------------------------------------------------------- 1 | ### Examples for assignment 4 2 | 3 | The code snippet below gives the matrices $\mathrm{A}$ and vectors $\mathbf{b}$ 4 | used in [assignment 4](2022-assignment_4.md) for $N=2$, $N=3$, $N=4$, $N=5$. 5 | 6 | For $N=2$, there is only one point in the interior, so we have a 1 by 1 matrix and a vector 7 | of length 1. 8 | 9 | For $N=3$, there are four points in the interior, so we have a 4 by 4 matrix and a vector of 10 | length 4. In this example, we number the points like this: 11 | 12 | $$ 13 | \begin{array}{cc} 14 | 2&3\\ 15 | 0&1 16 | \end{array} 17 | $$ 18 | 19 | For $N=4$, there are nine points in the interior, so we have a 9 by 9 matrix and a vector of 20 | length 9. In this example, we number the points like this: 21 | 22 | $$ 23 | \begin{array}{ccc} 24 | 6&7&8\\ 25 | 3&4&5\\ 26 | 0&1&2 27 | \end{array} 28 | $$ 29 | 30 | For $N=5$, there are 16 points in the interior, so we have a 16 by 16 matrix and a vector of 31 | length 16. In this example, we number the points like this: 32 | 33 | $$ 34 | \begin{array}{cccc} 35 | 12&13&14&15\\ 36 | 8&9&10&11\\ 37 | 4&5&6&7\\ 38 | 0&1&2&3 39 | \end{array} 40 | $$ 41 | 42 | ```python 43 | import numpy as np 44 | 45 | # A and b for N=2 46 | A_2 = np.array([ 47 | [-0.11111111111111116], 48 | ]) 49 | b_2 = np.array([0.2699980311833446]) 50 | 51 | # A and b for N=3 52 | A_3 = np.array([ 53 | [1.4320987654320987, -0.6419753086419753, -0.6419753086419753, -0.4104938271604938], 54 | [-0.6419753086419753, 1.4320987654320987, -0.4104938271604938, -0.6419753086419753], 55 | [-0.6419753086419753, -0.4104938271604938, 1.4320987654320987, -0.6419753086419753], 56 | [-0.4104938271604938, -0.6419753086419753, -0.6419753086419753, 1.4320987654320987], 57 | ]) 58 | b_3 = np.array([1.7251323007221917, 0.15334285313223067, -0.34843455260733003, -1.0558651156722307]) 59 | 60 | # A and b for N=4 61 | A_4 = np.array([ 62 | [1.972222222222222, -0.5069444444444444, 0.0, -0.5069444444444444, -0.3767361111111111, 0.0, 0.0, 0.0, 0.0], 63 | [-0.5069444444444444, 1.972222222222222, -0.5069444444444444, -0.3767361111111111, -0.5069444444444444, -0.3767361111111111, 0.0, 0.0, 0.0], 64 | [0.0, -0.5069444444444444, 1.972222222222222, 0.0, -0.3767361111111111, -0.5069444444444444, 0.0, 0.0, 0.0], 65 | [-0.5069444444444444, -0.3767361111111111, 0.0, 1.972222222222222, -0.5069444444444444, 0.0, -0.5069444444444444, -0.3767361111111111, 0.0], 66 | [-0.3767361111111111, -0.5069444444444444, -0.3767361111111111, -0.5069444444444444, 1.972222222222222, -0.5069444444444444, -0.3767361111111111, -0.5069444444444444, -0.3767361111111111], 67 | [0.0, -0.3767361111111111, -0.5069444444444444, 0.0, -0.5069444444444444, 1.972222222222222, 0.0, -0.3767361111111111, -0.5069444444444444], 68 | [0.0, 0.0, 0.0, -0.5069444444444444, -0.3767361111111111, 0.0, 1.972222222222222, -0.5069444444444444, 0.0], 69 | [0.0, 0.0, 0.0, -0.3767361111111111, -0.5069444444444444, -0.3767361111111111, -0.5069444444444444, 1.972222222222222, -0.5069444444444444], 70 | [0.0, 0.0, 0.0, 0.0, -0.3767361111111111, -0.5069444444444444, 0.0, -0.5069444444444444, 1.972222222222222], 71 | ]) 72 | b_4 = np.array([1.4904895819530766, 1.055600747809247, 0.07847904705126368, 0.8311407883427149, 0.0, -0.8765020708205272, -0.6433980946818605, -0.7466392365712349, -0.538021498324083]) 73 | 74 | # A and b for N=5 75 | A_5 = np.array([ 76 | [2.222222222222222, -0.4444444444444444, 0.0, 0.0, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 77 | [-0.4444444444444444, 2.222222222222222, -0.4444444444444444, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 78 | [0.0, -0.4444444444444444, 2.222222222222222, -0.4444444444444444, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 79 | [0.0, 0.0, -0.4444444444444444, 2.222222222222222, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 80 | [-0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 2.222222222222222, -0.4444444444444444, 0.0, 0.0, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 81 | [-0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, -0.4444444444444444, 2.222222222222222, -0.4444444444444444, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 0.0, 0.0, 0.0], 82 | [0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, -0.4444444444444444, 2.222222222222222, -0.4444444444444444, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 0.0, 0.0], 83 | [0.0, 0.0, -0.3611111111111111, -0.4444444444444444, 0.0, 0.0, -0.4444444444444444, 2.222222222222222, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, 0.0, 0.0, 0.0, 0.0], 84 | [0.0, 0.0, 0.0, 0.0, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 2.222222222222222, -0.4444444444444444, 0.0, 0.0, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0], 85 | [0.0, 0.0, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, -0.4444444444444444, 2.222222222222222, -0.4444444444444444, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0], 86 | [0.0, 0.0, 0.0, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, -0.4444444444444444, 2.222222222222222, -0.4444444444444444, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111], 87 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, 0.0, 0.0, -0.4444444444444444, 2.222222222222222, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444], 88 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.4444444444444444, -0.3611111111111111, 0.0, 0.0, 2.222222222222222, -0.4444444444444444, 0.0, 0.0], 89 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, -0.4444444444444444, 2.222222222222222, -0.4444444444444444, 0.0], 90 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, -0.3611111111111111, 0.0, -0.4444444444444444, 2.222222222222222, -0.4444444444444444], 91 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.3611111111111111, -0.4444444444444444, 0.0, 0.0, -0.4444444444444444, 2.222222222222222], 92 | ]) 93 | b_5 = np.array([1.2673039440507343, 0.9698054647507671, 1.0133080988552785, 0.07206335813040798, 0.9472174493756345, 0.0, 0.0, -0.9416429716282946, 0.6400834406610956, 0.0, 0.0, -0.7322882523543968, -0.8159823324771336, -0.9192523853093425, -0.48342793699793585, -0.19471066818706848]) 94 | 95 | ``` 96 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-assignment_1.md: -------------------------------------------------------------------------------- 1 | # Assignment 1 - Matrix-matrix multiplication 2 | 3 | This assignment makes up 20% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Thursday 20 October 2022**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | In this assignment, we will look at computing the product $AB$ of two matrices $A,B\in\mathbb{R}^{n\times n}$. The following snippet of code defines a function that computes the 16 | product of two matrices. As an example, the product of two 10 by 10 matrices is printed. The final line prints `matrix1 @ matrix2` - the `@` symbol denotes matrix multiplication, and 17 | Python will get Numpy to compute the product of two matrices. By looking at the output, it's possible to check that the two results are the same. 18 | 19 | ```python 20 | import numpy as np 21 | 22 | 23 | def slow_matrix_product(mat1, mat2): 24 | """Multiply two matrices.""" 25 | assert mat1.shape[1] == mat2.shape[0] 26 | result = [] 27 | for c in range(mat2.shape[1]): 28 | column = [] 29 | for r in range(mat1.shape[0]): 30 | value = 0 31 | for i in range(mat1.shape[1]): 32 | value += mat1[r, i] * mat2[i, c] 33 | column.append(value) 34 | result.append(column) 35 | return np.array(result).transpose() 36 | 37 | 38 | matrix1 = np.random.rand(10, 10) 39 | matrix2 = np.random.rand(10, 10) 40 | 41 | print(slow_matrix_product(matrix1, matrix2)) 42 | print(matrix1 @ matrix2) 43 | ``` 44 | 45 | The function in this snippet isn't very good. 46 | 47 | ### Part 1: a better function 48 | **Write your own function called `faster_matrix_product` that computes the product of two matrices more efficiently than `slow_matrix_product`.** 49 | Your function may use functions from Numpy (eg `np.dot`) to complete part of its calculation, but your function should not use `np.dot` or `@` to compute 50 | the full matrix-matrix product. 51 | 52 | Before you look at the performance of your function, you should check that it is computing the correct results. **Write a Python script using an `assert` 53 | statement that checks that your function gives the same result as using `@` for random 2 by 2, 3 by 3, 4 by 4, and 5 by 5 matrices.** 54 | 55 | In a text box, **give two brief reasons (1-2 sentences for each) why your function is better than `slow_matrix_product`.** At least one of your 56 | reasons should be related to the time you expect the two functions to take. 57 | 58 | Next, we want to compare the speed of `slow_matrix_product` and `faster_matrix_product`. **Write a Python script that runs the two functions for matrices of a range of sizes, 59 | and use `matplotlib` to create a plot showing the time taken for different sized matrices for both functions.** You should be able to run the functions for matrices 60 | of size up to around 1000 by 1000 (but if you're using an older/slower computer, you may decide to decrease the maximums slightly). You do not need to run your functions for 61 | every size between your minimum and maximum, but should choose a set of 10-15 values that will give you an informative plot. 62 | 63 | ### Part 2: speeding it up with Numba 64 | In the second part of this assignment, you're going to use Numba to speed up your function. 65 | 66 | **Create a copy of your function `faster_matrix_product` that is just-in-time (JIT) compiled using Numba.** To demonstrate the speed improvement acheived by using Numba, 67 | **make a plot (similar to that you made in the first part) that shows the times taken to multiply matrices using `faster_matrix_product`, `faster_matrix_product` with 68 | Numba JIT compilation, and Numpy (`@`).** Numpy's matrix-matrix multiplication is highly optimised, so you should not expect to be as fast is it. 69 | 70 | You may be able to achieve further speed up of your function by adjusting the memory layout used. The function `np.asfortanarray` will make a copy of an array that uses 71 | Fortran-style ordering, for example: 72 | 73 | ```python 74 | import numpy as np 75 | 76 | a = np.random.rand(10, 10) 77 | fortran_a = np.asfortranarray(a) 78 | ``` 79 | 80 | **Make a plot that compares the times taken by your JIT compiled function when the inputs have different combinations of C-style and Fortran-style ordering** 81 | (ie the plot should have lines for when both inputs are C-style, when the first is C-style and the second is Fortran-style, and so on). Focusing on the fact 82 | that it is more efficient to access memory that is close to previous accesses, **comment (in 1-2 sentences) on why one of these orderings appears to be fastest that the others**. 83 | (Numba can do a lot of different things when compiling code, so depending on your function there may or may not be a large difference: if there is little change in speeds 84 | for your function, you can comment on which ordering you might expect to be faster and why, but conclude that Numba is doing something more advanced.) 85 | 86 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-assignment_2.md: -------------------------------------------------------------------------------- 1 | # Assignment 2 - Solving two 1D problems 2 | 3 | This assignment makes up 20% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Thursday 3 November 2022**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | ### Part 1: Solving a wave problem with sparse matrices 16 | In this part of the assignment, we want to compute the solution to the following (time-harmonic) wave problem: 17 | 18 | $$ 19 | \begin{align*} 20 | \frac{\mathrm{d}^2 u}{\mathrm{d}x^2} + k^2u &= 0&&\text{in }(0, 1),\\ 21 | u &= 0&&\text{if }x=0,\\ 22 | u &= 1&&\text{if }x=1,\\ 23 | \end{align*} 24 | $$ 25 | with wavenumber $k=29\mathrm{\pi}/2$. 26 | 27 | In this part, we will approximately solving this problem using the method of finite differences. 28 | We do this by taking an evenly spaced values 29 | $x_0=0, x_1, x_2, ..., x_N=1$ 30 | and approximating the value of $u$ for each value: we will call these approximations $u_i$. 31 | To compute these approximations, we use the approximation 32 | 33 | $$ 34 | \frac{\mathrm{d}^2u_{i}}{\mathrm{d}x^2} \approx \frac{ 35 | u_{i-1}-2u_i+u_{i+1} 36 | }{h^2}, 37 | $$ 38 | where $h = 1/N$. 39 | 40 | With a bit of algebra, we see that the wave problem can be written as 41 | 42 | $$ 43 | (2-h^2k^2)u_i-u_{i-1}-u_{i+1} = 0 44 | $$ 45 | if $x_i$ is not 0 or 1, and 46 | 47 | $$ 48 | \begin{align*} 49 | u_i &= 0 50 | &&\text{if }x_i=0,\\ 51 | u_i &= 1 52 | &&\text{if }x_i=1. 53 | \end{align*} 54 | $$ 55 | 56 | This information can be used to re-write the problem as the matrix-vector problem 57 | $\mathrm{A}\mathbf{u}=\mathbf{f},$ 58 | where $\mathrm{A}$ is a known matrix, $\mathbf{f}$ is a known vector, and $\mathbf{u}$ is an unknown vector that we want to compute. 59 | The entries of 60 | $\mathbf{f}$ and $\mathbf{u}$ are given by 61 | 62 | $$ 63 | \begin{align*} 64 | \left[\mathbf{u}\right]_i &= u_i,\\ 65 | \left[\mathbf{f}\right]_i &= \begin{cases} 66 | 1&\text{if }i=N,\\ 67 | 0&\text{otherwise}. 68 | \end{cases} 69 | \end{align*} 70 | $$ 71 | The rows of $\mathrm{A}$ are given by 72 | 73 | $$ 74 | \left[\mathrm{A}\right]_{i,j} = 75 | \begin{cases} 76 | 1&\text{if }i=j,\\ 77 | 0&\text{otherwise}, 78 | \end{cases} 79 | $$ 80 | if $i=0$ or $i=N$; and 81 | 82 | $$ 83 | \left[\mathrm{A}\right]_{i, j} = 84 | \begin{cases} 85 | 2-h^2k^2&\text{if }j=i,\\ 86 | -1&\text{if }j=i+1,\\ 87 | -1&\text{if }j=i-1.\\ 88 | 0&\text{otherwise}, 89 | \end{cases} 90 | $$ 91 | otherwise. 92 | 93 | **Write a Python function that takes $N$ as an input and returns the matrix $\mathrm{A}$ and vector $\mathrm{f}$**. 94 | You should use an appropriate sparse storage format for the matrix $\mathrm{A}$. 95 | 96 | The function `scipy.sparse.linalg.spsolve` can be used to solve a sparse matrix-vector problem. Use this to **compute 97 | the approximate solution for your problem for $N=10$, $N=100$, and $N=1000$**. Use `matplotlib` (or any other plotting library) 98 | to **plot the solutions for these three values of $N$**. 99 | 100 | **Briefly (1-2 sentences) comment on your plots**: How different are they to each other? Which do you expect to be closest to the 101 | actual solution of the wave problem? 102 | 103 | This wave problem was carefully chosen so that its exact solution is known: this solution is 104 | $u_\text{exact}(x) = \sin(kx)$. (You can check this by differentiating this twice and substituting, but you 105 | do not need to do this part of this assignment.) 106 | 107 | A possible approximate measure of the error in your solution can be found by computing 108 | 109 | $$ 110 | \max_i\left|u_i-u_\text{exact}(x_i)\right|. 111 | $$ 112 | **Compute this error for a range of values for $N$ of your choice, for the method you wrote above**. On axes that both use log scales, 113 | **plot $N$ against the error in your solution**. You should pick a range of values for $N$ so that this plot will give you useful information about the 114 | methods. 115 | 116 | For the same values of $N$, **measure the time taken to compute your approximation for your function**. On axes that both use log scales, 117 | **plot $N$ against the time taken to compute a solution**. 118 | 119 | We now want to compute an approximate solution where the measure of error is $10^{-8}$ or less. By looking at your plots, **pick a value of $N$ 120 | that you would expect to give error of $10^{-8}$ or less**. **Briefly (1-2 sentences) explain how you picked your value of $N$ 121 | and predict how long the computation will take**. 122 | 123 | **Compute the approximate solution with your value of $N$**. Measure the time taken and the error, and **briefly (1-2 sentences) comment 124 | on how these compare to your predictions**. Your error may turn out to be higher than $10^{-8}$ for your value of $N$: if so, you can still get full marks for commenting on 125 | why your prediction was not correct. Depending on your implementation and your prediction, 126 | a valid conclusion in the section could be "My value of $N$ is too large for it to be feasible to complete this computation in a reasonable amount of time / without running out of memory". 127 | 128 | ### Part 2: Solving the heat equation with GPU acceleration 129 | 130 | In this part of the assignment, we want to solve the heat equation 131 | 132 | $$ 133 | \begin{align*} 134 | \frac{\mathrm{d}u}{\mathrm{d}t} &= \frac{1}{1000}\frac{\mathrm{d}^2u}{\mathrm{d}x^2}&&\text{for }x\in(0,1),\\ 135 | u(x, 0) &= 0,&&\text{if }x\not=0\text{ and }x\not=1\\ 136 | u(0,t) &= 10,\\ 137 | u(1,t) &= 10. 138 | \end{align*} 139 | $$ 140 | This represents a rod that starts at 0 temperature which is heated to a temperature of 10 at both ends. 141 | 142 | Again, we will approximately solve this by taking an evenly spaced values 143 | $x_0=0, x_1, x_2, ..., x_N=1$. 144 | Additionally, we will take a set of evenly spaced times 145 | $t_0=0,t_1=h, t_2=2h, t_3=3h, ...$, where $h=1/N$. 146 | We will write $u^{(j)}_{i}$ for the approximate value of $u$ at point $x_i$ and time $t_j$ 147 | (ie $u^{(j)}_{i}\approx u(x_i, t_j)$). 148 | 149 | Approximating both derivatives (similar to what we did in part 1), and doing some algebra, we can rewrite the 150 | heat equation as 151 | 152 | $$ 153 | \begin{align*} 154 | u^{(j + 1)}_i&=u^{(j)}_i + \frac{u^{(j)}_{i-1}-2u^{(j)}_i+u^{(j)}_{i+1}}{1000h},\\ 155 | u^{(0)}_i &= 0,\\ 156 | u^{(j)}_{0}&=10,\\ 157 | u^{(j)}_{N}&=10. 158 | \end{align*} 159 | $$ 160 | 161 | This leads us to an iterative method for solving this problem: first, at $t=0$, we set 162 | 163 | $$ 164 | u^{(0)}_i = 165 | \begin{cases} 166 | 10 &\text{if }i=0\text{ or }i=N,\\ 167 | 0 &\text{otherwise}; 168 | \end{cases} 169 | $$ 170 | then for all later values of time, we set 171 | 172 | $$ 173 | u^{(j+1)}_i = 174 | \begin{cases} 175 | 10 &\text{if }i=0\text{ or }i=N,\\ 176 | \displaystyle u^{(j)}_i + \frac{u^{(j)}_{i-1}-2u^{(j)}_i+u^{(j)}_{i+1}}{1000h} &\text{otherwise}. 177 | \end{cases} 178 | $$ 179 | 180 | **Implement this iterative scheme in Python**. You should implement this as a function that takes $N$ as an input. 181 | 182 | Using a sensible value of $N$, **plot the temperature of the rod at $t=1$, $t=2$ and $t=10$**. **Briefly (1-2 sentences) 183 | comment on how you picked a value for $N$**. 184 | 185 | **Use `numba.cuda` to parallelise your implementation on a GPU**. 186 | You should think carefully about when data needs to be copied, and be careful not to copy data to/from the GPU when not needed. 187 | 188 | 189 | **Use your code to estimate the time at which the temperature of the midpoint of the rod first exceeds a temperature of 9.8**. 190 | **Briefly (2-3 sentences) describe how you estimated this time**. You may choose to use a plot or diagram to aid your description, 191 | but it is not essential to include a plot. 192 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-assignment_3.md: -------------------------------------------------------------------------------- 1 | # Assignment 3 - Sparse matrices 2 | 3 | This assignment makes up 30% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Thursday 1 December 2022**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | ### Part 1: Implementing a CSR matrix 16 | Scipy allows you to define your own objects that can be used with their sparse solvers. You can do this 17 | by creating a subclass of `scipy.sparse.LinearOperator`. In the first part of this assignment, you are going to 18 | implement your own CSR matrix format. 19 | 20 | The following code snippet shows how you can define your own matrix-like operator. 21 | 22 | ``` 23 | from scipy.sparse.linalg import LinearOperator 24 | 25 | 26 | class CSRMatrix(LinearOperator): 27 | def __init__(self, coo_matrix): 28 | self.shape = coo_matrix.shape 29 | self.dtype = coo_matrix.dtype 30 | # You'll need to put more code here 31 | 32 | def __add__(self, other): 33 | """Add the CSR matrix other to this matrix.""" 34 | pass 35 | 36 | def _matvec(self, vector): 37 | """Compute a matrix-vector product.""" 38 | pass 39 | ``` 40 | 41 | Make a copy of this code snippet and **implement the methods `__init__`, `__add__` and `matvec`.** 42 | The method `__init__` takes a COO matrix as input and will initialise the CSR matrix: it currently includes one line 43 | that will store the shape of the input matrix. You should add code here that extracts important data from a Scipy COO to and computes and stores the appropriate data 44 | for a CSR matrix. You may use any functionality of Python and various libraries in your code, but you should not use an library's implementation of a CSR matrix. 45 | The method `__add__` will overload `+` and so allow you to add two of your CSR matrices together. 46 | The `__add__` method should avoid converting any matrices to dense matrices. You could implement this in one of two ways: you could convert both matrices to COO matrices, 47 | compute the sum, then pass this into `CSRMatrix()`; or you could compute the data, indices and indptr for the sum, and use these to create a SciPy CSR matrix. 48 | The method `matvec` will define a matrix-vector product: Scipy will use this when you tell it to use a sparse solver on your operator. 49 | 50 | **Write tests to check that the `__add__` and `matvec` methods that you have written are correct.** These test should use appropriate `assert` statements. 51 | 52 | For a collection of sparse matrices of your choice and a random vector, **measure the time taken to perform a `matvec` product**. Convert the same matrices to dense matrices and **measure 53 | the time taken to compute a dense matrix-vector product using Numpy**. **Create a plot showing the times of `matvec` and Numpy for a range of matrix sizes** and 54 | **briefly (1-2 sentence) comment on what your plot shows**. 55 | 56 | For a matrix of your choice and a random vector, **use Scipy's `gmres` and `cg` sparse solvers to solve a matrix problem using your CSR matrix**. 57 | Check if the two solutions obtained are the same. 58 | **Briefly comment (1-2 sentences) on why the solutions are or are not the same (or are nearly but not exactly the same).** 59 | 60 | ### Part 2: Implementing a custom matrix 61 | Let $\mathrm{A}$ by a $2n$ by $2n$ matrix with the following structure: 62 | 63 | - The top left $n$ by $n$ block of $\mathrm{A}$ is a diagonal matrix 64 | - The top right $n$ by $n$ block of $\mathrm{A}$ is zero 65 | - The bottom left $n$ by $n$ block of $\mathrm{A}$ is zero 66 | - The bottom right $n$ by $n$ block of $\mathrm{A}$ is dense (but has a special structure defined below) 67 | 68 | In other words, $\mathrm{A}$ looks like this, where $*$ represents a non-zero value 69 | 70 | $$ 71 | \mathrm{A}=\begin{pmatrix} 72 | *&0&0&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 73 | 0&*&0&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 74 | 0&0&*&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 75 | \vdots&\vdots&\vdots&\ddots&0&\hspace{3mm}\vdots&\vdots&\ddots&\vdots\\ 76 | 0&0&0&\cdots&*&\hspace{3mm}0&0&\cdots&0\\[3mm] 77 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&*\\ 78 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&*\\ 79 | \vdots&\vdots&\vdots&\ddots&\vdots&\hspace{3mm}\vdots&\vdots&\ddots&\vdots\\ 80 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&* 81 | \end{pmatrix} 82 | $$ 83 | 84 | Let $\tilde{\mathrm{A}}$ be the bottom right $n$ by $n$ block of $\mathrm{A}$. 85 | Suppose that $\tilde{\mathrm{A}}$ is a matrix that can be written as 86 | 87 | $$ 88 | \tilde{\mathrm{A}} = \mathrm{T}\mathrm{W}, 89 | $$ 90 | where $\mathrm{T}$ is a $n$ by 2 matrix (a tall matrix); 91 | and 92 | where $\mathrm{W}$ is a 2 by $n$ matrix (a wide matrix). 93 | 94 | **Implement a Scipy `LinearOperator` for matrices of this form**. Your implementation must include a matrix-vector product (`matvec`) and the shape of the matrix (`self.shape`), but 95 | does not need to include an `__add__` function. In your implementation of `matvec`, you should be careful to ensure that the product does not have more computational complexity then necessary. 96 | 97 | For a range of values of $n$, **create matrices where the entries on the diagonal of the top-left block and in the matrices $\mathrm{T}$ and $\mathrm{W}$ are random numbers**. 98 | For each of these matrices, **compute matrix-vector products using your implementation and measure the time taken to compute these**. Create an alternative version of each matrix, 99 | stored using a Scipy or Numpy format of your choice, 100 | and **measure the time taken to compute matrix-vector products using this format**. **Make a plot showing time taken against $n$**. **Comment (2-4 sentences) on what your plot shows, and why you think 101 | one of these methods is faster than the other** (or why they take the same amount of time if this is the case). 102 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-assignment_4.md: -------------------------------------------------------------------------------- 1 | # Assignment 4 - Solving a finite element system 2 | 3 | This assignment makes up 30% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Thursday 15 December 2022**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | ### Mathematical background 16 | In this assignment, we are going to solve a Helmholtz wave problem: 17 | 18 | $$\begin{align*} 19 | -\Delta u - k^2u &= 0&\text{in }\Omega,\\ 20 | u &= g&\text{on the boundary of }\Omega. 21 | \end{align*}$$ 22 | 23 | As our domain we will use the unit square, ie $\Omega=[0,1]^2$. 24 | In this assignment, we will use $k=5$ and 25 | 26 | $$ 27 | g(x,y)= 28 | \begin{cases} 29 | \sin(4y)&\text{if }x=0,\\ 30 | \sin(3x)&\text{if }y=0,\\ 31 | \sin(3+4y)&\text{if }x=1,\\ 32 | \sin(3x+4)&\text{if }y=1. 33 | \end{cases} 34 | $$ 35 | 36 | The finite element method is a method that can approximately solve problems like this. We first split the square $[0,1]^2$ into a mesh of $N$ squares by $N$ squares 37 | (or $N+1$ points by $N+1$ points - 38 | note that there are $N$ squares along each side, but $N+1$ points along each side (watch out for off-by-one errors)): 39 | 40 | ![A mesh of $N$ squares by $N$ squares](img/2022a4-mesh.png) 41 | 42 | As shown in the diagram, we let $h=1/N$. 43 | 44 | The (degree 1) finite element method looks for an approximate solution by placing an unknown value/variable at each point, and approximating the solution as some 45 | linear combination of the functions $1$, $x$, $y$ and $xy$ inside each square. Re-writing the problem as an integral equation (and doing a bit of algebra) allows 46 | us to turn the problem into the matrix vector problem 47 | 48 | $$\mathrm{A}\mathbf{x}=\mathbf{b}.$$ 49 | 50 | (We do not need to go into details of how this method is derived, but if you're curious, the first chapter of 51 | *Numerical Solution of Partial Differential Equations by the Finite Element Method* by Claes Johnson 52 | gives a good introduction to this method.) 53 | 54 | Let $\mathbf{p}_0$, $\mathbf{p}_1$, ..., $\mathbf{p}_{(N-1)^2-1}$ be the points in our mesh that are not on the boundary (in some order). Let $x_0$, $x_1$, ..., $x_{(N-1)^2-1}$ be 55 | the values/variables at the points (these are the entries of the unknown vector $\mathbf{x}$). 56 | 57 | $\mathrm{A}$ is an $(N-1)^2$ by $(N-1)^2$ matrix. $\mathbf{b}$ is a vector with $(N-1)^2$ entries. The entries $a_{i,j}$ of the matrix $\mathrm{A}$ are given by 58 | 59 | $$ 60 | a_{i,j} =\begin{cases} 61 | \displaystyle 62 | \frac{24-4h^2k^2}{9}&\text{if }i=j\\ 63 | \displaystyle 64 | \frac{-3-h^2k^2}{9} 65 | &\text{if }\mathbf{p}_i\text{ and }\mathbf{p}_j\text{ are horizontally or vertically adjacent}\\ 66 | \displaystyle 67 | \frac{-12-h^2k^2}{36} 68 | &\text{if }\mathbf{p}_i\text{ and }\mathbf{p}_j\text{ are diagonally adjacent}\\ 69 | 0&\text{otherwise} 70 | \end{cases} 71 | $$ 72 | 73 | The entries $b_j$ of the vector $\mathbf{b}$ are given by 74 | 75 | $$ 76 | b_{j} =\begin{cases} 77 | \displaystyle 78 | \frac{12+h^2k^2}{36}\left(g(0,0)+g(2h,0)+g(0,2h)\right)+\frac{3+h^2k^2}{9}\left(g(h,0)+g(0, h)\right) 79 | &\text{if }\mathbf{p}_j=(h,h)\\ 80 | \displaystyle 81 | \frac{12+h^2k^2}{36}\left(g(1,0)+g(1,2h)+g(1-2h,0)\right)+\frac{3+h^2k^2}{9}\left(g(1-h,0)+g(1, h)\right) 82 | &\text{if }\mathbf{p}_j=(1-h,h)\\ 83 | \displaystyle 84 | \frac{12+h^2k^2}{36}\left(g(0,1)+g(2h,1)+g(0,1-2h)\right)+\frac{3+h^2k^2}{9}\left(g(h,1)+g(0, 1-h)\right) 85 | &\text{if }\mathbf{p}_j=(h,1-h)\\ 86 | \displaystyle 87 | \frac{12+h^2k^2}{36}\left(g(1,1)+g(1-2h,1)+g(1,1-2h)\right)+\frac{3+h^2k^2}{9}\left(g(1-h,1)+g(1, 1-h)\right) 88 | &\text{if }\mathbf{p}_j=(1-h,1-h)\\ 89 | \\[3mm] 90 | \displaystyle 91 | \frac{12+h^2k^2}{36}\left(g(0,c_j+h)+g(0,c_j-h)\right)+ 92 | \frac{3+h^2k^2}{9} g(0,c_j) 93 | &\text{if }\mathbf{p}_j=(h,c_j)\text{, with }c_j\not=h\text{ and }c_j\not=1-h\\ 94 | \displaystyle 95 | \frac{12+h^2k^2}{36}\left(g(1,c_j+h)+g(1,c_j-h)\right)+ 96 | \frac{3+h^2k^2}{9} g(1,c_j) 97 | &\text{if }\mathbf{p}_j=(1-h,c_j)\text{, with }c_j\not=h\text{ and }c_j\not=1-h\\ 98 | \displaystyle 99 | \frac{12+h^2k^2}{36}\left(g(c_j+h,0)+g(c_j-h,0)\right)+ 100 | \frac{3+h^2k^2}{9} g(c_j,0) 101 | &\text{if }\mathbf{p}_j=(c_j,h)\text{, with }c_j\not=h\text{ and }c_j\not=1-h\\ 102 | \displaystyle 103 | \frac{12+h^2k^2}{36}\left(g(c_j+h,1)+g(c_j-h,1)\right)+ 104 | \frac{3+h^2k^2}{9} g(c_j,1) 105 | &\text{if }\mathbf{p}_j=(c_j,1-h)\text{, with }c_j\not=h\text{ and }c_j\not=1-h 106 | \\[3mm] 107 | 0&\text{otherwise} 108 | \end{cases} 109 | $$ 110 | 111 | You could alternatively write this as 112 | 113 | $$\begin{align*} 114 | b_j &= \frac{12+h^2k^2}{36} 115 | \left(\text{sum of evaluations of $g$ at all points on the boundary that are diagonally adjacent to $\mathbf{p}_j$}\right) 116 | \\&\hspace{5mm}+ 117 | \frac{3+h^2k^2}{9} 118 | \left(\text{sum of evaluations of $g$ at all points on the boundary that are horizontally or vertically adjacent to $\mathbf{p}_j$}\right) 119 | \end{align*}$$ 120 | 121 | For example (using $k$ and $g$ as given above) when $N=2$, 122 | 123 | $$ 124 | \mathrm{A}=\begin{pmatrix} 125 | -0.11111111 126 | \end{pmatrix}. 127 | $$ 128 | 129 | For $N=2$, the definition of $\mathbf{b}$ is different to above, as the point at $(1/2,1/2)$ is adjacent to all three sides and so the conditions above are all true at once. 130 | The alternate value of $\mathbf{b}$ used in this case is not important, as we will later 131 | take $N>2$. 132 | 133 | As as second example, when $N=3$, 134 | 135 | $$ 136 | \mathrm{A}=\begin{pmatrix} 137 | 1.43209877& -0.64197531& -0.64197531& -0.41049383\\ 138 | -0.64197531& 1.43209877& -0.41049383& -0.64197531\\ 139 | -0.64197531& -0.41049383& 1.43209877& -0.64197531\\ 140 | -0.41049383& -0.64197531& -0.64197531& 1.43209877 141 | \end{pmatrix}, 142 | $$ 143 | 144 | $$ 145 | \mathrm{b}=\begin{pmatrix} 146 | 1.72513230\\0.15334285\\-0.34843455\\-1.05586511 147 | \end{pmatrix}. 148 | $$ 149 | 150 | In this second example, I have numbered the points not on the boundary like this: 151 | 152 | $$ 153 | \begin{array}{cc} 154 | 2&3\\ 155 | 0&1 156 | \end{array} 157 | $$ 158 | 159 | ### Part 1: creating the matrix and vector 160 | **Write a function that takes $N$ as an input and returns the matrix $\mathrm{A}$ and the vector $\mathbf{b}$**. The matrix should be stored using an appropriate sparse format - you may use Scipy for this, and do not need to implement your own format. 161 | 162 | You can find [example matrices and vectors for $N=2$, $N=3$, $N=4$ and $N=5$ here](2022-a4-A_and_b.md). You may wish to use them to validate your function, but you do not need to include this validation as 163 | part of the assignment. 164 | 165 | ### Part 2: solving the system 166 | Solving the matrix-vector problem will lead to an approximate solution to the Helmholtz problem: 167 | we call this approximate solution $u_h$. 168 | 169 | Using any matrix-vector solver, **solve the matrix-vector problem for $N=4$, $N=8$, and $N=16$** and **plot the approximate solutions 170 | to the Helmholtz problem**. To plot 171 | the solutions, you can pass the $x$- and $y$-coordinates of the points and the value of $u_h$ at each 172 | point into matplotlib's 3D plotting function. For the points on the boundary, the value of $u_h$ is 173 | given by the function $g$; for interior points, the value will be one of the entries of the solution 174 | vector $\mathbf{x}$. 175 | 176 | An example of 3D plotting in matplotlib can be found in the [sparse PDE example](sparse_linalg_pde.ipynb) from earlier in the course. 177 | 178 | ### Part 3: comparing solvers and preconditioners 179 | In this section, your task is to evaluate the performance of various matrix-vector solvers. 180 | To do this, **solve the matrix-vector problem with small to medium sized value of $N$ using a range of different solvers of your choice, 181 | measuring factors you deem to be important for your evaluation.** These factors should include 182 | the time taken by the solver, and may additionally include many other things such as the number of 183 | iterations taken by an iterative solver, or the size of the residual after each iteration. 184 | **Make a set of plots that show the measurements you have made and allow you to compare the solvers**. 185 | 186 | You should compare at least five matrix-vector solvers: at least two of these should be iterative 187 | solvers, and at least one should be a direct solver. You can use solvers from the Scipy 188 | library. (You may optionally use additional solvers from other linear algebra 189 | libraries such as PETSc, but you do not need to do this to achieve high marks. 190 | You should use solvers from these libraries and do not need to implement your own solvers.) 191 | For two of the iterative solvers you have chosen to use, 192 | **repeat the comparisons with three different choices of preconditioner**. 193 | 194 | Based on your experiments, **pick a solver** (and a preconditioner if it improves the solver) 195 | that you think is most appropriate to solve this matrix-vector problem. **Explain, making use 196 | of the data from your experiments, why this is the best solver for this problem**. 197 | 198 | ### Part 4: increasing $N$ 199 | In this section, you are going to use the solver you picked in part 3 to compute the solution 200 | for larger values of $N$. 201 | 202 | The problem we have been solving in this assignment has the exact solution $u_\text{exact}=\sin(3x+4y)$. 203 | A measure of the error of an approximate solution $u_h$ can be computed using 204 | 205 | $$ 206 | \sum_{i=0}^{N^2-1} h^2\left|u_\text{exact}(\mathbf{m}_i)-u_h(\mathbf{m}_i)\right|, 207 | $$ 208 | 209 | where $\mathbf{m}_i$ is the midpoint of the $i$th square in the finite element mesh: the value of 210 | $u_h$ at this midpoint will be the mean of the values at the four corners of the square. For points on 211 | the boundary, we set $u_h=g$ and so combine evaluations of $g$ and values in the solution vector to compute some of the values in this sum. 212 | 213 | For a range of values of $N$ from small to large, **compute the solution to the matrix-vector 214 | problem**. **Measure the time taken to compute this solution**, and **compute the error of the solution**. 215 | **Make plots showing the time taken and error as $N$ is increased**. 216 | 217 | Using your plots, **estimate the complexity of the solver you are using** (ie is it $\mathcal{O}(N)$? 218 | Is it $\mathcal{O}(N^2)$?), and **estimate the order of convergence of your solution** (your error 219 | should decrease like $\mathcal{O}(N^{-\alpha})$ for some $\alpha>0$). Briefly (1-2 sentences) 220 | **comment on how you have made these estimates of the complexity and order.** 221 | 222 | ### Part 5: parallelisation 223 | In this section, we will consider how your solution method could be parallelised; you do not need, 224 | however, to implement a parallel version of your solution method. 225 | 226 | **Comment on how your solution method could be parallelised.** Which parts (if any) would be trivial 227 | to parallelise? Which parts (if any) would be difficult to parallelise? By how much would you expect 228 | parallelisation to speed up your solution method? 229 | 230 | If in part 4 you used a solver that we have not studied in lectures, you can discuss different solvers in parts 4 and 5. 231 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-class_1.md: -------------------------------------------------------------------------------- 1 | # Class 1 (Monday 10 October) 2 | 3 | These tasks are designed to be worked on in the practical class on Monday 10 October. 4 | 5 | ## Getting to know Numpy 6 | Practice using Numpy by doing the following tasks: 7 | 8 | ```python 9 | import numpy as np 10 | ``` 11 | 12 | - Create the vectors $\mathbf{a} = \begin{pmatrix}1\\2\\0\end{pmatrix}$ and $\mathbf{b} = \begin{pmatrix}-3\\\tfrac32\\1\end{pmatrix}$. 13 | ```python 14 | a = np.array([1, 2, 0]) 15 | b = np.array([-3, 3/2, 1]) 16 | ``` 17 | - Create the matrix $A=\begin{pmatrix}1&0&0\\0&2&0\\1&0&1\end{pmatrix}$. 18 | ```python 19 | A = np.array([[1, 0, 0], [0, 2, 0], [1, 0, 1]]) 20 | ``` 21 | - Compute the matrix-vector product $Aa$. 22 | ```python 23 | print(A @ a) 24 | ``` 25 | - Compute the dot product $\mathbf{a}\cdot\mathbf{b}$. 26 | ```python 27 | print(np.dot(a, b)) 28 | print(a.dot(b)) 29 | ``` 30 | - Find a vector that is perpendicular to both $\mathbf{a}$ and $\mathbf{b}$. 31 | ```python 32 | print(np.cross(a, b)) 33 | ``` 34 | - Find the vector $\mathbf{x}$ such that $A\mathbf{x}=\mathbf{a}$ 35 | ```python 36 | print(np.linalg.solve(A, a)) 37 | 38 | # This is slower for larger matrices but will give the same result: 39 | print(np.linalg.inv(A) @ a) 40 | ``` 41 | 42 | The following snippet of code defines a function that computes a matrix-vector product. The last two lines should print the same result: 43 | this can be used to check that the function is behaving as expected. 44 | ```python 45 | import numpy as np 46 | 47 | def slow_matvec(matrix, vector): 48 | assert matrix.shape[1] == vector.shape[0] 49 | result = [] 50 | for r in range(matrix.shape[0]): 51 | value = 0 52 | for c in range(matrix.shape[1]): 53 | value += matrix[r, c] * vector[c] 54 | result.append(value) 55 | return np.array(result) 56 | 57 | 58 | # Example of using this function 59 | matrix = np.random.rand(3, 3) 60 | vector = np.random.rand(3) 61 | print(slow_matvec(matrix, vector)) 62 | print(matrix @ vector) 63 | ``` 64 | Using this code as a template, write your own function called `faster_matvec` that computes matrix-vector products by taking the dot product of the matrix with each row. 65 | Check that your function also gives the same result as these two functions. 66 | 67 | ## Testing with asserts 68 | When writing a code that you want to run on a large HPC system, it is important to test that the code is correct before setting it off. 69 | There are better ways to do this than manually comparing outputs like you did above. 70 | 71 | Probably the best way to test your code for correctness is to write some `assert` statements to assert that your function gives the correct result for some small problems that 72 | you know the answer to. For example, the following code tests a function that's been written to add two integers. 73 | ```python 74 | def add(a, b): 75 | return a + b 76 | 77 | 78 | assert add(1, 1) == 2 79 | assert add(4, 5) == 9 80 | ``` 81 | 82 | When using floating point numbers, asserts using `==` can fail even when the numbers should be the same (due to differences around the size of machine precision). 83 | For example, the following assert fails (or at least, it fails on my computer using Python 3.10.4): 84 | ```python 85 | assert 100 / 3 * 30 == 1000.0 86 | ``` 87 | To avoid this issue, the function `np.isclose` should be used, eg: 88 | ```python 89 | import numpy as np 90 | assert np.isclose(100 / 3 * 30, 1000.0) 91 | ``` 92 | For vectors and matrices, `np.allclose` can be used. 93 | 94 | Write some Python code that test your `faster_matvec` function by computing that matrix-vector product of a random matrix and vector and compares the result to 95 | the result when using `@`. 96 | 97 | (Python libraries commonly use the `pytest` library to carry out automated testing. Use of `pytest` is beyond the scope of this course.) 98 | 99 | ## Timing a function 100 | You're now going to measure the time it takes to compute matrix-vector products with the two functions `slow_matvec` and `fast_matvec`. The following code snippet measures the time taken 101 | to run a function `f`. This runs the function 1000 times and prints the total time taken. 102 | ```python 103 | from timeit import timeit 104 | 105 | 106 | def f(): 107 | # contents of the function go here 108 | 109 | 110 | t = timeit(f, number=1000) 111 | print(t) 112 | ``` 113 | 114 | Write some Python code that measures the time taken to compute a matrix-vector product of an $n\times n$ matrix and a vector with $n$ entries for $n=2,10,100$ for both `slow_matvec` and `fast_matvec`. 115 | How much is your function faster than `slow_matvec`? 116 | 117 | ## Plotting with matplotlib 118 | The Python library matplotlib is commonly used to draw plots of data. Matplotlib gives you a lot of freedom to do whatever you want, but this means that it has a lot of options/functions you can use. 119 | 120 | (When making plots with matplotlib, you may want to bear in mind that around 4% of people have some form of colourblindness. It can be very helpful to use different line styles and/or markers as well 121 | as colour differences.) 122 | 123 | The following example code makes a plot with the curves $y=x$, $y=x^2$ and $y=3x^2$ between $x=1$ and $x=3$. 124 | 125 | ```python 126 | import matplotlib.pylab as plt 127 | import numpy as np 128 | 129 | x = np.linspace(1, 3, 20) 130 | y0 = x 131 | y1 = x ** 2 132 | y2 = 3 * x ** 2 133 | 134 | plt.plot(x, y0, "ro-") 135 | plt.plot(x, y1, "g^-") 136 | plt.plot(x, y2, "b-") 137 | 138 | plt.xlabel("Values of x") 139 | plt.ylabel("Values of y") 140 | plt.legend(["$y=x$", "$y=x^2$", "$y=3x^2$"]) 141 | ``` 142 | 143 | (If you're using a Jupyter notebook, you may need to put the magic command `%matplotlib inline` at the start of the cell to display the plot. If you're running Python from the command line, 144 | you'll need to put `plt.show()` at the end to show the plot.) 145 | 146 | Make an alternative version of this plot with both axes using a log scale (hint: `plt.xscale("log")`). (Personally, I think log-log plots are much clearer if the $x$ and $y$ axes use equal tick sizes: 147 | you can do this with `plt.axis("equal")`.) What do you notice about your curves on this log-log plot? (Pen and paper task: starting with $y=ax^b$, work out why what you've observed happens.) 148 | 149 | Using matplotlib, make a plot showing the timings you measures in the previous part. Add some more data to your plot to make it more informative. 150 | 151 | ## Saving data to a file 152 | If your problem takes a long time to solve, you don't want to have to resolve it to get the data when you want to tweak a matplotlib plot. It's therefore a good idea to save your data to a file 153 | and make a pretty plot with the data separately. 154 | 155 | In Jupyter notebooks, this can be acheived by running cells selectively: once you're run the cell with your solver in once, you can tweak your plotting cell and re-run it as many times as you like 156 | without having to re-solve your problem. 157 | 158 | When running through a command line interface, you can use the functions `numpy.save` and `numpy.load` to save and load Numpy objects. For example, the following snippet saves a random matrix to a 159 | `.npy` file. 160 | 161 | ```python 162 | import numpy as np 163 | 164 | data = np.random.rand(10, 10) 165 | 166 | numpy.save("my_results.npy", data) 167 | ``` 168 | 169 | This matrix can then be loaded in a different file: 170 | 171 | ```python 172 | import numpy as np 173 | 174 | data = numpy.load("my_results.npy") 175 | ``` 176 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-class_2.md: -------------------------------------------------------------------------------- 1 | # Class 2 (Monday 17 October) 2 | 3 | These tasks are designed to be worked on in the practical class on Monday 17 October. 4 | 5 | ## Experimenting with Numba 6 | Let's start by looking at the matvec code we wrote last week. 7 | 8 | ```python 9 | import numpy as np 10 | 11 | def slow_matvec(matrix, vector): 12 | assert matrix.shape[1] == vector.shape[0] 13 | result = [] 14 | for r in range(matrix.shape[0]): 15 | value = 0 16 | for c in range(matrix.shape[1]): 17 | value += matrix[r, c] * vector[c] 18 | result.append(value) 19 | return np.array(result) 20 | 21 | 22 | # Example of using this function 23 | matrix = np.random.rand(3, 3) 24 | vector = np.random.rand(3) 25 | print(slow_matvec(matrix, vector)) 26 | print(matrix @ vector) 27 | ``` 28 | 29 | Use `numba.njit` to tell Numba to just-in-time (JIT) compile this function. 30 | 31 | Numba appears to be giving incorrect results for this function. This is because Numba interprets `value = 0` as "make an **integer** `value` that is equal to 0", then 32 | will not allow `value` to take non-integer values. To fix this, replace `value = 0` with `value = 0.0`. 33 | 34 | Using matplotlib, make a plot that shows the time this function takes to compute a matrix-vector product with and without Numba acceleration. 35 | Add timings for the `faster_matvec` function that you wrote to this plot. The first time you call your function, it will need to do the JIT 36 | compilation: you may want to measure the time the first run takes separately. 37 | 38 | ## `jit` vs `njit` 39 | Add another line to your plot to shw the timings if you use `numba.jit` instead of `numba.njit`. Which is faster? 40 | 41 | `numba.njit` will use "no Python mode", while `numba.jit` uses "Python compatibility mode". We would expect `numba.njit` to produce faster code, but `numba.jit` is able to compile a wider range of 42 | functions. 43 | 44 | ## Parallel range 45 | Replace any `range`s in your function with `numba.prange`: this will make your function use a parallel for loop. Compare the timings of your function with and without 46 | parellel ranges. How big does your matrix need to be before parallelisation becomes worth using? 47 | 48 | ## Optimising your code 49 | Take the fastest version of your function you've obtained to far. Is there anything else you can try doing to it to make it faster? Try a few things and see if you can get any more speed. 50 | 51 | Compare the time your function takes to the time Numpy takes to multiply two matrices. How close to Numpy's speed can you get? 52 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-class_3.md: -------------------------------------------------------------------------------- 1 | # Class 3 (Monday 24 October) 2 | 3 | These tasks are designed to be worked on in the practical class on Monday 24 October. 4 | 5 | In this class, we'll be running code on a GPU using Cuda. To see if you have a device available, you can run: 6 | 7 | ```python 8 | from numba import cuda 9 | cuda.detect() 10 | ``` 11 | 12 | If you don't have a suitable device on your own computer, you should 13 | use Google Colab this week: you can use a GPU in Colab by selecting **Runtime -> Change Runtime Type** and selecting **GPU**. 14 | 15 | During this class, you may wish to use the [GPU accelerated evaluation of particle sums](rbf_evaluation.md) section of the lecture notes, where a 16 | similar example is worked through using a radial basis function kernel. 17 | 18 | ## Background 19 | In lots of applications, it is useful to calculate the sum 20 | 21 | $$\sum_jc_jg(\mathbf{x}, \mathbf{y}_j),$$ 22 | where $g$ is a "kernel" function, $\mathbf{x}$ is a point in $\mathbb{R}^3$, 23 | $\mathbf{y}_0,...,\mathbf{y}_{n-1}$ are (known) points in $\mathbb{R}^3$, and 24 | $c_0,...,c_{n-1}$ are (known) values in $\mathbb{C}$. 25 | 26 | In this class, we're going to use the acoustic Green's function 27 | 28 | $$g(\mathbf{x},\mathbf{y})=\mathrm{e}^{-\mathrm{i}k\left|\mathbf{x}-\mathbf{y}\right|}/\left(4\mathrm{\pi}\left|\mathbf{x}-\mathbf{y}\right|\right),$$ 29 | where $k$ is the wavenumber of the wave. 30 | This is the acoustic wave due to a point source: if there are point sources at points $\mathbf{y}_0,...,\mathbf{y}_{n-1}$ of sizes 31 | $c_0,...,c_{n-1}$, then the sum above can be used to compute the magnitude of a (time-harmonic) acoustic wave at each point. 32 | 33 | ## Plotting some waves 34 | There is a point source with wavenumber 10 at the point $(-1.2, 0, 0)$ with magnitude 1. 35 | The following code plots a slice through the wave due to this source in the plane $z=0$ with $0\leqslant x\leqslant 3$ and $-\frac32\leqslant y\leqslant \frac32$. 36 | 37 | ```python 38 | import numpy as np 39 | import matplotlib.pylab as plt 40 | 41 | k = 10. 42 | 43 | 44 | def g(x, y): 45 | """Evaluate real part of the acoustic Green's function.""" 46 | return math.cos(k * np.linalg.norm(x - y)) / 4 / np.pi / np.linalg.norm(x - y) 47 | 48 | 49 | sources = np.array([[-1.2, 0., 0.]]) 50 | magnitudes = np.array([1.0]) 51 | 52 | img_size = 250 53 | values = np.empty((img_size, img_size), dtype="complex128") 54 | 55 | xmin = 0 56 | xmax = 3 57 | ymin = -1.5 58 | ymax = 1.5 59 | 60 | # plt.imshow interprets data as the colour of pixels starting at the top left then 61 | # row by row. For example, if an image was 5 pixels wide, the order of the pixels 62 | # would be: 63 | # 0 1 2 3 4 64 | # 5 6 7 8 9 65 | # etc 66 | # Due to this ordering, the y values here might at first glance appear to be backwards 67 | for i in range(img_size): 68 | y = ymax + (ymin - ymax) * i / (img_size - 1) 69 | for j in range(img_size): 70 | x = xmin + (xmax - xmin) * j / (img_size - 1) 71 | point = np.array([x, y, 0]) 72 | v = 0 73 | for m, s in zip(magnitudes, sources): 74 | v += m * g(s, point) 75 | values[i, j] = v 76 | 77 | plt.imshow(values, extent=[xmin, xmax, ymin, ymax]) 78 | plt.show() 79 | ``` 80 | 81 | Adapt this code so that it plots the real part of a wave due to two point sources with magnitude 1 at the points 82 | $(-1.2, 0.5, 0)$ and $(-1.2, -0.5, 0)$. 83 | 84 | Adapt this code so that it plots the real part of a wave due to four point sources with random magnitudes between 0 and 1 85 | at the points random points in the area $x=-1.2$, $-1\leqslant y\leqslant1$, $-1\leqslant z\leqslant1$. 86 | 87 | ## GPU acceleration 88 | Write a new version of the code for four sources that runs on a GPU using Numba's Cuda functionality. 89 | You should use blocks of 16 by 4 threads (I picked 4 as this is the number of sources), and an appropriately sized grid so that 90 | there is a thread for each point you want to compute the wave at. 91 | 92 | You may use the function `rbf_evaluation_cuda` from the lecture notes section [GPU accelerated evaluation of particle sums](rbf_evaluation.md)) 93 | as inspiration for your function. The function in the lecture notes uses the following features of `numba.cuda` that we didn't use in the lecture: 94 | 95 | - `cuda.shared.array` creates a shared array. This array can then be used by threads in the same block. 96 | - `cuda.grid` returns the current thread's absolute position in entire grid of threads. For a two-dimensional 97 | - `cuda.syncthreads` synchronises all the threads in the same block. This allows you to ensure that all the threads are ready to perform the next operation at the same time 98 | (which is important as a GPU will perform best if all the threads in a block are performing the same operation). 99 | - `cuda.threadIdx.x` and `cuda.threadIdx.y` get the position of the current thread in the current block of threads. 100 | 101 | You may wish to create an array of points that you want to evaluate the wave at rather than computing `x` and `y` inside the loops, you could 102 | do this either by using two for loops to generate the points or by using `np.mgrid` and `ravel` as done in the [GPU accelerated evaluation of particle sums](rbf_evaluation.md)) section. 103 | 104 | Create a plot using single precision floating points numbers using your Cuda-accelerated function and a plot using double precision numbers using the standard Python code above. 105 | Visually compare the two plots: can you see any differences? 106 | 107 | ## Comparing GPU and CPU acceleration 108 | Write a version of the code that uses `numba.njit(parallel=True)` and `numba.prange` to create the plot in parallel on a CPU. (You may want to use the function `rbf_evaluation` 109 | from the lecture notes section [GPU accelerated evaluation of particle sums](rbf_evaluation.md)) as inspiration for your function.) 110 | 111 | Time your GPU and CPU functions. (For your GPU function, you might want to copy a small array to the device (eg `a = cuda.to_device(np.array([1.])`) before you start timing 112 | to be sure that the time waiting for the GPU to become available on Colab is not included in your timing.) Which is faster? 113 | 114 | ## Extension task 115 | Time your two functions for higher and low numbers of points at which you compute the wave. Create a plot showing the time taken for the two functions 116 | as you vary the number of points. 117 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-class_4.md: -------------------------------------------------------------------------------- 1 | # Class 4 (Monday 31 October) 2 | 3 | These tasks are designed to be worked on in the practical class on Monday 31 October. 4 | 5 | In this class, we will make heavy use of the [finite difference code for solving a Poisson problem](https://gist.github.com/mscroggs/45ab606d6e69b811122b2697821267b1) 6 | that we wrote in lectures. You can find this code [here](https://gist.github.com/mscroggs/45ab606d6e69b811122b2697821267b1). 7 | 8 | ## Comparing dense and sparse storage 9 | Copy the codes we that wrote to generate the matrix in dense and sparse formats. 10 | 11 | For a sensible range of $N$, measure how much memory a dense matrix and a COO matrix use. 12 | You can print the amount of memory a dense matrix uses by running: 13 | 14 | ```python 15 | import numpy as np 16 | 17 | a = np.zeros(...) 18 | print(a.nbytes) 19 | ``` 20 | 21 | You can print the amount of memory a COO matrix uses by running: 22 | 23 | ```python 24 | from scipy.sparse import coo_matrix 25 | 26 | b = coo_matrix(...) 27 | print(b.row.nbytes + b.col.nbytes + b.data.nbytes) 28 | ``` 29 | 30 | Create a plot that shows the memory used by the dense and COO sparse format against $N$. 31 | What do you notice? 32 | 33 | Use `scipy.sparse.linalg.spsolve` and `numpy.linalg.solve` to solve the problem for a range of values of $N$. 34 | Plot the time both solution methods taks against $N$. 35 | What do you notice? 36 | 37 | ## Comparing sparse formats 38 | SciPy can convert between different sparse formats, for example 39 | 40 | ```python 41 | from scipy.sparse import coo_matrix 42 | 43 | matrix = coo_matrix(...) 44 | csr_mat = matrix.tocsr() 45 | ``` 46 | 47 | For a range of values of $N$, measure how much storage space is needed to store the matrix for the Poisson problem if the matrix is stored as 48 | a COO matrix, a CSR matrix, or a CSC matrix. 49 | For a CSR matrix, the amount of memory used can be printed by running 50 | 51 | ```python 52 | from scipy.sparse import coo_matrix 53 | 54 | matrix = coo_matrix(...) 55 | csr_mat = matrix.tocsr() 56 | print(c.data.nbytes + c.indices.nbytes + c.indptr.nbytes) 57 | ``` 58 | 59 | Make a plot showing the amount of memory needed vs $N$. Which format is the most memory efficient? 60 | 61 | Optional extension: Scipy also supports LIL, DIA, DOK, and BSR sparse formats. Add these to your plot. 62 | 63 | ## When is a sparse matrix worth it? 64 | In this section, we will investigate how many zeros we need a matrix to have for sparse storage to be worth doing. 65 | 66 | Create a 10 by 10 matrix which is all zeros except for $M$ random numbers in random positions. 67 | Measure the amount of memory needed to store this as a dense and different sparse matrix formats. 68 | Make a plot showing the amount of memory needed against $M$. 69 | What proportion of the matrix needs to be zeros for sparse storage to use less space? When are different sparse formats more efficient? 70 | 71 | Repeat this with a 40 by 40 matrix. 72 | What proportion of the matrix needs to be zeros for sparse storage to use less space? When are different sparse formats more efficient? 73 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-class_5.md: -------------------------------------------------------------------------------- 1 | # Class 5 (Monday 14 November) 2 | 3 | These tasks are designed to be worked on in the practical class on Monday 14 November. 4 | 5 | ## Using GMRES 6 | You can create an identity matrix ($\mathrm{I}$) with Numpy by running: 7 | ```python 8 | identity_matrix = np.eye(N) 9 | ``` 10 | 11 | Solve $\mathrm{I}\mathbf{x}=\mathrm{b}$ for a random vector $\mathbf{b}$ using GMRES (`scipy.sparse.linalg.gmres`). 12 | Make a plot showing the number of iterations vs the size of the residual. 13 | 14 | ## Experimenting with GMRES 15 | You can create a random matrix ($\mathrm{A}$) with Numpy by running: 16 | ```python 17 | a_matrix = np.random.randn(N, N) / np.sqrt(N) 18 | ``` 19 | 20 | Solve $\mathrm{A}\mathbf{x}=\mathrm{b}$ for a random vector $\mathbf{b}$ using GMRES. 21 | Make a plot showing the number of iterations vs the size of the residual. 22 | Compare this plot to the plot for the identity matrix. 23 | 24 | You can control the stopping criteria of GMRES by passing `tol` and `atol` parameters into GMRES (eg `scipy.sparse.linalg.solve(A, b, tol=1e-8, atol=1e-8)`). 25 | Adjust these parameters so that GMRES gets to a lower residual than the default values. What is the lowest you can get the size of the residual to be? 26 | 27 | Consider the matrix $\mathrm{A}+\alpha\mathrm{I}$ for some constant $\alpha$. 28 | Solve $(\mathrm{A}+\alpha\mathrm{I})\mathbf{x}=\mathrm{b}$ for a random vector $\mathbf{b}$ using GMRES 29 | for a range of values of $\alpha$. 30 | Make a plot showing the number of iterations vs the size of the residual for the value of $\alpha$ 31 | that you chose. How is the value of $\alpha$ related to the performance of GMRES. 32 | 33 | Have you considered what happens if $\alpha$ is negative? Or complex? 34 | 35 | ## Plotting eigenvalues 36 | The performance of GMRES is related to the eigenvalues of the matrix. You can plot 37 | the eigenvalues of a matrix `A` with the following code: 38 | 39 | ```python 40 | from matplotlib import pyplot as plt 41 | from scipy.linalg import eigvals 42 | 43 | eigenvalues = eigvals(A) 44 | 45 | plt.plot(np.real(eigenvalues), np.imag(eigenvalues), 'rx', markersize=1) 46 | ``` 47 | 48 | Plot the eigenvalues of the matrices you used in the previous section. What do you notice about the eigenvalues 49 | when GMRES does not perform well? Based on your observations, pick a value of $\alpha$ for which you think 50 | GMRES will perform badly, and a value for which you think it will perform well? Make plots 51 | showing the number of iterations vs the size of the residual for these values of $\alpha$. Were your predictions correct? 52 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-class_6.md: -------------------------------------------------------------------------------- 1 | # Class 6 (Monday 21 November) 2 | 3 | These tasks are designed to be worked on in the practical class on Monday 21 November. 4 | 5 | ## Using CG 6 | You can create a random 500 by 500 symmetric positive definite matrix by running: 7 | ```python 8 | import numpy as np 9 | from numpy.random import RandomState 10 | 11 | n = 500 12 | 13 | rand = RandomState(0) 14 | 15 | Q, _ = np.linalg.qr(rand.randn(n, n)) 16 | D = np.diag(rand.rand(n)) 17 | A = Q.T @ D @ Q 18 | ``` 19 | 20 | Solve $\mathrm{A}\mathbf{x}=\mathrm{b}$ for a random vector $\mathbf{b}$ using CG (`scipy.sparse.linalg.cg`). 21 | Make a plot showing the number of iterations vs the size of the residual. 22 | 23 | ## SPAI preconditioning 24 | The SPAI preconditioner is defined by 25 | 26 | $$ 27 | \begin{align*} 28 | \mathrm{C}_k &= \mathrm{A} \mathrm{M}_k\\ 29 | \mathrm{G}_k &= \mathrm{I} - \mathrm{C}_k\\ 30 | \alpha_k &=\operatorname{tr}(\mathrm{G}_k^\text{T}\mathrm{A}\mathrm{G}_k) / \|\mathrm{A}\mathrm{G}_k\|_\text{F}^2\\ 31 | \mathrm{M}_{k+1} &= \mathrm{M}_k + \alpha_k \mathrm{G}_k 32 | \end{align*} 33 | $$ 34 | 35 | Implement this preconditioner. Solve $\mathrm{A}\mathbf{x}=\mathrm{b}$ using $\mathrm{M}_m$ as a preconditioner for a range of values of $m$ and make a plot showing 36 | the number of iterations vs the size of the residual for each of these. 37 | If $m$ is too large, the preconditioner will take a long time to compute; if $m$ is too small, $\mathrm{M}_m$ will not be a good preconditioner. Experiment to find a good value to use for $m$. 38 | 39 | You may wish to use the code included in [the preconditioning section of the lecture notes](https://tbetcke.github.io/hpc_lecture_notes/it_solvers4.html) 40 | as a template. 41 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-class_7.md: -------------------------------------------------------------------------------- 1 | # Class 7 (Monday 5 December) 2 | 3 | These tasks are designed to be worked on in the practical class on Monday 5 December. 4 | 5 | ## LU for a tridiagonal matrix 6 | In Friday's lecture, we computed the LU factorisation of a dense matrix. 7 | The code we wrote during Friday's lecture can be found [at this link](https://gist.github.com/mscroggs/7c1b4440942fa48fe4a8cab0b9cb4a49). 8 | In today's class, we are going to compute the LU decomposition of a tridiagonal matrix. 9 | 10 | We will use the following $n$ by $n$ matrix: 11 | 12 | $$ 13 | \mathrm{A}=\begin{pmatrix} 14 | a_0&-1&0&0&\cdots&0\\ 15 | -1&a_1&-1&0&\cdots&0\\ 16 | 0&-1&a_2&-1&\cdots&0\\ 17 | 0&0&-1&a_3&\cdots&0\\ 18 | \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 19 | 0&0&0&0&\cdots&a_{n-1}\\ 20 | \end{pmatrix}, 21 | $$ 22 | where $a_0$ to $a_{n-1}$ are random decimal values between 5 and 10. 23 | 24 | Write a function that takes $n$ as an input and returns the matrix $\mathrm{A}$ stored in a sparse format of your choice. 25 | 26 | Using [the code we wrote in Friday's lecture](https://gist.github.com/mscroggs/7c1b4440942fa48fe4a8cab0b9cb4a49) as a template, 27 | write a function that computes the LU decomposition of $\mathrm{A}$, and returns the factors $\mathrm{L}$ and $\mathrm{U}$ in a 28 | sparse format of your choice. Due to the structure of the matrix, you should not need to do any permuting of the rows. 29 | 30 | For a range of values of $n$, compute the LU decomposition using your function and measure the time this takes. 31 | Convert each matrix to a dense matrix and compute the LU decomposition using the code we wrote in Friday's lecture, timing 32 | this too. 33 | Plot these timings on log-log axes. What do you notice? 34 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-lsa_1.md: -------------------------------------------------------------------------------- 1 | # LSA Assignment 1 - Matrix-matrix multiplication 2 | 3 | This assignment makes up 20% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Friday 1 September 2023**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | In this assignment, we will look at computing the product $AB$ of two matrices $A,B\in\mathbb{R}^{n\times n}$. The following snippet of code defines a function that computes the 16 | product of two matrices. As an example, the product of two 10 by 10 matrices is printed. The final line prints `matrix1 @ matrix2` - the `@` symbol denotes matrix multiplication, and 17 | Python will get Numpy to compute the product of two matrices. By looking at the output, it's possible to check that the two results are the same. 18 | 19 | ```python 20 | import numpy as np 21 | 22 | 23 | def slow_matrix_product(mat1, mat2): 24 | """Multiply two matrices.""" 25 | result = [] 26 | for c in range(mat2.shape[1]): 27 | column = [] 28 | for r in range(mat1.shape[0]): 29 | value = 0 30 | for i in range(mat1.shape[1]): 31 | value += mat1[r, i] * mat2[i, c] 32 | column.append(value) 33 | result.append(column) 34 | return np.array(result).transpose() 35 | 36 | 37 | matrix1 = np.random.rand(10, 10) 38 | matrix2 = np.random.rand(10, 10) 39 | 40 | print(slow_matrix_product(matrix1, matrix2)) 41 | print(matrix1 @ matrix2) 42 | ``` 43 | 44 | The function in this snippet isn't very good. 45 | 46 | ### Part 1: a better function 47 | **Write your own function called `faster_matrix_product` that computes the product of two matrices more efficiently than `slow_matrix_product`.** 48 | Your function may use functions from Numpy (eg `np.dot`) to complete part of its calculation, but your function should not use `np.dot` or `@` to compute 49 | the full matrix-matrix product. 50 | 51 | Before you look at the performance of your function, you should check that it is computing the correct results. **Write a Python script using an `assert` 52 | statement that checks that your function gives the same result as using `@` for a pair of random 5 by 5 matrices.** 53 | 54 | In a text box, **give two brief reasons (1-2 sentences for each) why your function is better than `slow_matrix_product`.** At least one of your 55 | reasons should be related to the time you expect the two functions to take. 56 | 57 | Next, we want to compare the speed of `slow_matrix_product` and `faster_matrix_product`. **Write a Python script that runs the two functions for matrices of a range of sizes, 58 | and use `matplotlib` to create a plot showing the time taken for different sized matrices for both functions.** You should be able to run the functions for matrices 59 | of size up to around 1000 by 1000 (but if you're using an older/slower computer, you may decide to decrease the maximums slightly). You do not need to run your functions for 60 | every size between your minimum and maximum, but should choose a set of 10-15 values that will give you an informative plot. 61 | 62 | ### Part 2: speeding it up with Numba 63 | In the second part of this assignment, you're going to use Numba to speed up your function. 64 | 65 | **Create a copy of your function `faster_matrix_product` that is just-in-time (JIT) compiled using Numba.** To demonstrate the speed improvement acheived by using Numba, 66 | **make a plot (similar to that you made in the first part) that shows the times taken to multiply matrices using `faster_matrix_product`, `faster_matrix_product` with 67 | Numba JIT compilation, and Numpy (`@`).** Numpy's matrix-matrix multiplication is highly optimised, so you should not expect to be as fast is it. 68 | 69 | You may be able to achieve further speed up of your function by adjusting the memory layout used. Focusing on the fact 70 | that it is more efficient to access memory that is close to previous accesses, **comment (in 1-2 sentences) on which ordering for each matrix you would expect to lead to the fastest 71 | matric-vector multiplication**. 72 | 73 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-lsa_3.md: -------------------------------------------------------------------------------- 1 | # LSA Assignment 3 - Sparse matrices 2 | 3 | This assignment makes up 30% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Friday 1 September 2023**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | ### Part 1: Comparing sparse and dense matrices 16 | For a collection of sparse matrices of your choice and a random vector, **measure the time taken to perform a `matvec` product**. 17 | You should use Scipy to store the sparse matrix is the most suitable format. 18 | Convert the same matrices to Numpy dense matrices and **measure the time taken to compute a dense matrix-vector product using Numpy**. 19 | **Create a plot showing the times of the sparse and dense for a range of matrix sizes** and 20 | **briefly (1-2 sentence) comment on what your plot shows**. 21 | 22 | For a matrix of your choice and a random vector, **use Scipy's `gmres` and `cg` sparse solvers to solve a matrix problem using your CSR matrix**. 23 | Check if the two solutions obtained are the same. 24 | **Briefly comment (1-2 sentences) on why the solutions are or are not the same (or are nearly but not exactly the same).** 25 | 26 | ### Part 2: Implementing a custom matrix 27 | The following code snippet shows how you can define your own matrix-like operator. 28 | 29 | ``` 30 | from scipy.sparse.linalg import LinearOperator 31 | 32 | 33 | class CSRMatrix(LinearOperator): 34 | def __init__(self, coo_matrix): 35 | self.shape = coo_matrix.shape 36 | self.dtype = coo_matrix.dtype 37 | # You'll need to put more code here 38 | 39 | def _matvec(self, vector): 40 | """Compute a matrix-vector product.""" 41 | pass 42 | ``` 43 | 44 | Let $\mathrm{A}$ by a $n+1$ by $n+1$ matrix with the following structure: 45 | 46 | - The top left $n$ by $n$ block of $\mathrm{A}$ is a diagonal matrix 47 | - The bottom right entry is 0 48 | 49 | In other words, $\mathrm{A}$ looks like this, where $*$ represents a non-zero value 50 | 51 | $$ 52 | \mathrm{A}=\begin{pmatrix} 53 | *&0&0&\cdots&0&\hspace{3mm}*\\ 54 | 0&*&0&\cdots&0&\hspace{3mm}*\\ 55 | 0&0&*&\cdots&0&\hspace{3mm}*\\ 56 | \vdots&\vdots&\vdots&\ddots&0&\hspace{3mm}\vdots\\ 57 | 0&0&0&\cdots&*&\hspace{3mm}*\\[3mm] 58 | *&*&*&\cdots&*&\hspace{3mm}0\\ 59 | \end{pmatrix} 60 | $$ 61 | 62 | **Implement a Scipy `LinearOperator` for matrices of this form**. Your implementation must include a matrix-vector product (`matvec`) and the shape of the matrix (`self.shape`). 63 | In your implementation of `matvec`, you should be careful to ensure that the product does not have more computational complexity then necessary. 64 | 65 | For a range of values of $n$, **create matrices in your format where each entry is a random number**. 66 | For each of these matrices, **compute matrix-vector products using your implementation and measure the time taken to compute these**. 67 | Create an alternative version of each matrix, stored using a Scipy or Numpy format of your choice, 68 | and **measure the time taken to compute matrix-vector products using this format**. **Make a plot showing time taken against $n$**. 69 | **Comment (2-4 sentences) on what your plot shows, and why you think one of these methods is faster than the other** (or why they take the same amount of time if this is the case). 70 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022-lsa_4.md: -------------------------------------------------------------------------------- 1 | # LSA Assignment 4 - Solving a finite element system 2 | 3 | This assignment makes up 30% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Friday 1 September 2023**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | In this assignment, we will look at solving the matrix-vector problem 16 | 17 | $$Ax=b,$$ 18 | 19 | where $A$ is an $n$ by $n$ matrix and $b$ is a vector with $n$ entries. The entries $a_{ij}$ (with $0\leqslant i,j\leqslant n-1$) of A and $b_j$ of $b$ are given by: 20 | 21 | $$\begin{align*} 22 | a_{ij} &= 23 | \begin{cases} 24 | 1&i=j\\ 25 | 1&i=n-1\text{ and }j=0\\ 26 | 1&i=n-1\text{ and }j=1\\ 27 | 1&i=n-2\text{ and }j=0\\ 28 | 1&i=0\text{ and }j=n-1\\ 29 | 1&i=1\text{ and }j=n-1\\ 30 | 1&i=0\text{ and }j=n-2\\ 31 | -1/i&i=j+1\\ 32 | -1/j&i+1=j\\ 33 | 0&\text{otherwise} 34 | \end{cases}\\[3mm] 35 | b_j &= 1. 36 | \end{align*}$$ 37 | 38 | For example, if $n=6$, then the matrix is 39 | 40 | $$\begin{pmatrix} 41 | 2&-1&0&0&1&1\\ 42 | -1&2&-0.5&0&0&1\\ 43 | 0&-0.5&2&-0.33333333&0&0\\ 44 | 0&0&-0.33333333&2&-0.25&0\\ 45 | 1&0&0&-0.25&2&-0.2\\ 46 | 1&1&0&0&-0.2&2 47 | \end{pmatrix}$$ 48 | 49 | ### Part 1: creating a matrix and vector 50 | **Write a function that takes $N$ as an input and returns the matrix $\mathrm{A}$ and the vector $\mathbf{b}$**. The matrix should be stored using an appropriate sparse format - you may use Scipy for this, and do not need to implement your own format. 51 | 52 | ### Part 2: comparing solvers and preconditioners 53 | In this section, your task is to evaluate the performance of various matrix-vector solvers. 54 | To do this, **solve the matrix-vector problem with small to medium sized value of $N$ using a range of different solvers of your choice, 55 | measuring factors you deem to be important for your evaluation.** These factors should include 56 | the time taken by the solver, and may additionally include many other things such as the number of 57 | iterations taken by an iterative solver, or the size of the residual after each iteration. 58 | **Make a set of plots that show the measurements you have made and allow you to compare the solvers**. 59 | 60 | You should compare at least five matrix-vector solvers: at least two of these should be iterative 61 | solvers, and at least one should be a direct solver. You can use solvers from the Scipy 62 | library. (You may optionally use additional solvers from other linear algebra 63 | libraries such as PETSc, but you do not need to do this to achieve high marks. 64 | You should use solvers from these libraries and do not need to implement your own solvers.) 65 | For two of the iterative solvers you have chosen to use, 66 | **repeat the comparisons with three different choices of preconditioner**. 67 | 68 | Based on your experiments, **pick a solver** (and a preconditioner if it improves the solver) 69 | that you think is most appropriate to solve this matrix-vector problem. **Explain, making use 70 | of the data from your experiments, why this is the best solver for this problem**. 71 | 72 | ### Part 3: increasing $N$ 73 | In this section, you are going to use the solver you picked in part 3 to compute the solution 74 | for larger values of $N$. 75 | 76 | For a range of values of $N$ from small to large, **compute the solution to the matrix-vector 77 | problem**. **Measure the time taken to compute this solution**. 78 | **Make a plot showing the time taken and error as $N$ is increased**. 79 | 80 | Using your plots, **estimate the complexity of the solver you are using** (ie is it $\mathcal{O}(N)$? 81 | Is it $\mathcal{O}(N^2)$?). Briefly (1-2 sentences) 82 | **comment on how you have made these estimates of the complexity and order.** 83 | 84 | ### Part 4: parallelisation 85 | In this section, we will consider how your solution method could be parallelised; you do not need, 86 | however, to implement a parallel version of your solution method. 87 | 88 | **Comment on how your solution method could be parallelised.** Which parts (if any) would be trivial 89 | to parallelise? Which parts (if any) would be difficult to parallelise? By how much would you expect 90 | parallelisation to speed up your solution method? 91 | 92 | If in part 3 you used a solver that we have not studied in lectures, you can discuss different solvers in parts 3 and 4. 93 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022_classes.md: -------------------------------------------------------------------------------- 1 | # Python classes 2 | If you've not used Python classes before, this guide will give you an introduction to what they are and what they are used for. 3 | 4 | In general, a `class` is a set of instructions for how to create an object. An instance of a class is the object that is created from the class. 5 | As an example, we can create a class that defines a fraction. 6 | 7 | ```python 8 | from math import gcd 9 | 10 | 11 | class Fraction: 12 | """A fraction.""" 13 | 14 | def __init__(self, numerator, denominator): 15 | self.numerator = numerator 16 | self.denominator = denominator 17 | 18 | def print_numerator(self): 19 | print(self.numerator) 20 | 21 | def __str__(self): 22 | """Get string representation.""" 23 | return str(self.numerator) + " over " + str(self.denominator) 24 | 25 | def __add__(self, other): 26 | """Add two fractions.""" 27 | assert isinstance(other, Fraction) 28 | new_numerator = self.numerator * other.denominator + other.numerator * self.denominator 29 | new_denominator = self.denominator * other.denominator 30 | common_factor = gcd(new_numerator, new_denominator) 31 | return Fraction(new_numerator // common_factor, new_denominator // common_factor) 32 | 33 | def __mul__(self, other): 34 | """Multiply two fractions.""" 35 | assert isinstance(other, Fraction) 36 | new_numerator = self.numerator * other.numerator 37 | new_denominator = self.denominator * other.denominator 38 | common_factor = gcd(new_numerator, new_denominator) 39 | return Fraction(new_numerator // common_factor, new_denominator // common_factor) 40 | ``` 41 | 42 | This class contains a number of functions: functions inside a class are called methods. 43 | Each method starts with the `self` input: this refers to the instance of the class itself and can be used to store information 44 | that other methods will need. (You could use another name instead of `self` but it's very very common practice to use `self`.) 45 | Function that start and end with a double underscore (`__`) are special methods. 46 | 47 | The special method `__init__` is run when an instance of the class is created. In this function, we store the numerator and denominator of the 48 | fraction as `self.numerator` and `self.denominator`. 49 | 50 | We can create a fraction and call the `print_numerator` method like this: 51 | 52 | ```python 53 | half = Fraction(1, 2) 54 | half.print_numerator() 55 | ``` 56 | 57 | This first line will run `__init__` with `half` as `self`, `1` as `numerator` and `2` as `denominator`. The second line will run `print_numerator` with `half` as `self`, and will therefore 58 | print `1`. 59 | 60 | The special method `__str__` defines what happens when you `print` an instance of your class. In our example, `print(half)` will print `1 over 2`. 61 | 62 | The special method `__add__` defines what the `+` operator does. If you read the implementation above, you can see that `__add__` 63 | is adding fractions in the way you would expect. For example, you can add two fractions like this: 64 | 65 | ```python 66 | half = Fraction(1, 2) 67 | third = Fraction(1, 3) 68 | print(half + third) 69 | ``` 70 | 71 | The special method `__mul__` defines what the `*` operator does. For example, you can multiply two fractions like this: 72 | 73 | ```python 74 | half = Fraction(1, 2) 75 | third = Fraction(1, 3) 76 | print(half * third) 77 | ``` 78 | 79 | There are a lot more special methods that you can use, including those to define the behaviour of `-`, `/`, `//`, `**`, `@`, `[]`, `=`, `<`, `<=`, `>`, and `>=`. You can find 80 | (details of these in the Python documentation](https://docs.python.org/3/reference/datamodel.html#specialnames) 81 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2022_matrices_and_simultaneous_equations.md: -------------------------------------------------------------------------------- 1 | # Matrices and simultaneous equations 2 | It is common to rewrite systems of sumultaneous equations as a matrix-vector problem. For example, the equations 3 | 4 | $$ 5 | \begin{align*} 6 | 4a_0 + 3a_1 &= 2\\ 7 | a_0 - a_3 &= 1\\ 8 | -a_2 - a_3 &= 0\\ 9 | 2a_0 &= 1 10 | \end{align*} 11 | $$ 12 | 13 | can be written as the matrix-vector problem 14 | 15 | $$ 16 | \begin{pmatrix} 17 | 4&3&0&0\\ 18 | 1&0&-1&0\\ 19 | 0&0&-1&-1\\ 20 | 2&0&0&0 21 | \end{pmatrix} 22 | \begin{pmatrix} 23 | a_0\\a_1\\a_2\\a_3 24 | \end{pmatrix} 25 | = 26 | \begin{pmatrix} 27 | 2\\1\\0\\1 28 | \end{pmatrix}. 29 | $$ 30 | 31 | By multiplying out the matrix, you can see that each row of the matrix paired with one entry in the vector represents one of the simultaneous equations. 32 | 33 | This is what I did in the lecture with the (more complicated) equations 34 | 35 | $$ 36 | \begin{align*} 37 | u_{i,j} &= 0&&\text{if the point is on the boundary},\\ 38 | \frac{4u_{i,j}-u_{i+1,j}-u_{i-1,j}-u_{i,j+1}-u_{i,j-1}}{h^2} &= 1&&\text{otherwise}, 39 | \end{align*} 40 | $$ 41 | 42 | to get the matrix problem 43 | 44 | $$ 45 | \mathrm{A} 46 | \begin{pmatrix} 47 | u_{0,0}\\ 48 | u_{1,0}\\ 49 | u_{2,0}\\ 50 | \vdots\\ 51 | u_{N,0}\\ 52 | u_{0,1}\\ 53 | u_{1,1}\\ 54 | u_{2,1}\\ 55 | \vdots\\ 56 | u_{N,N}\\ 57 | \end{pmatrix} 58 | =\mathbf{b}. 59 | $$ 60 | 61 | If you didn't follow how I got the matrix during the lecture, take a look at [the code we wrote during the lecture](https://gist.github.com/mscroggs/45ab606d6e69b811122b2697821267b1), 62 | and see if you can work out how the matrix corresponds to the simultaneous equations. 63 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_1-lsa.md: -------------------------------------------------------------------------------- 1 | # LSA Assignment 1 - Matrix-matrix multiplication 2 | 3 | The deadline for submitting this assignment is **Midnight Friday 30 August 2024**. 4 | 5 | The easiest ways to create this file are: 6 | 7 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 8 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 9 | 10 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 11 | 12 | ## The assignment 13 | 14 | In this assignment, we will look at computing the product $AB$ of two matrices $A,B\in\mathbb{R}^{n\times n}$. The following snippet of code defines a function that computes the 15 | product of two matrices. As an example, the product of two 10 by 10 matrices is printed. The final line prints `matrix1 @ matrix2` - the `@` symbol denotes matrix multiplication, and 16 | Python will get Numpy to compute the product of two matrices. By looking at the output, it's possible to check that the two results are the same. 17 | 18 | ```python 19 | import numpy as np 20 | 21 | 22 | def slow_matrix_product(mat1, mat2): 23 | """Multiply two matrices.""" 24 | assert mat1.shape[1] == mat2.shape[0] 25 | result = [] 26 | for c in range(mat2.shape[1]): 27 | column = [] 28 | for r in range(mat1.shape[0]): 29 | value = 0 30 | for i in range(mat1.shape[1]): 31 | value += mat1[r, i] * mat2[i, c] 32 | column.append(value) 33 | result.append(column) 34 | return np.array(result).transpose() 35 | 36 | 37 | matrix1 = np.random.rand(10, 10) 38 | matrix2 = np.random.rand(10, 10) 39 | 40 | print(slow_matrix_product(matrix1, matrix2)) 41 | print(matrix1 @ matrix2) 42 | ``` 43 | 44 | The function in this snippet isn't very good. 45 | 46 | ### Part 1: a better function 47 | **Write your own function called `faster_matrix_product` that computes the product of two matrices more efficiently than `slow_matrix_product`.** 48 | Your function may use functions from Numpy (eg `np.dot`) to complete part of its calculation, but your function should not use `np.dot` or `@` to compute 49 | the full matrix-matrix product. 50 | 51 | Before you look at the performance of your function, you should check that it is computing the correct results. **Write a Python script using an `assert` 52 | statement that checks that your function gives the same result as using `@` for random 2 by 2, 3 by 3, 4 by 4, and 5 by 5 matrices.** 53 | 54 | In a text box, **give two brief reasons (1-2 sentences for each) why your function is better than `slow_matrix_product`.** At least one of your 55 | reasons should be related to the time you expect the two functions to take. 56 | 57 | Next, we want to compare the speed of `slow_matrix_product` and `faster_matrix_product`. **Write a Python script that runs the two functions for matrices of a range of sizes, 58 | and use `matplotlib` to create a plot showing the time taken for different sized matrices for both functions.** You should be able to run the functions for matrices 59 | of size up to around 1000 by 1000 (but if you're using an older/slower computer, you may decide to decrease the maximums slightly). You do not need to run your functions for 60 | every size between your minimum and maximum, but should choose a set of 10-15 values that will give you an informative plot. 61 | 62 | ### Part 2: speeding it up with Numba 63 | In the second part of this assignment, you're going to use Numba to speed up your function. 64 | 65 | **Create a copy of your function `faster_matrix_product` that is just-in-time (JIT) compiled using Numba.** To demonstrate the speed improvement acheived by using Numba, 66 | **make a plot (similar to that you made in the first part) that shows the times taken to multiply matrices using `faster_matrix_product`, `faster_matrix_product` with 67 | Numba JIT compilation, and Numpy (`@`).** Numpy's matrix-matrix multiplication is highly optimised, so you should not expect to be as fast is it. 68 | 69 | You may be able to achieve further speed up of your function by adjusting the memory layout used. The function `np.asfortanarray` will make a copy of an array that uses 70 | Fortran-style ordering, for example: 71 | 72 | ```python 73 | import numpy as np 74 | 75 | a = np.random.rand(10, 10) 76 | fortran_a = np.asfortranarray(a) 77 | ``` 78 | 79 | **Make a plot that compares the times taken by your JIT compiled function when the inputs have different combinations of C-style and Fortran-style ordering** 80 | (ie the plot should have lines for when both inputs are C-style, when the first is C-style and the second is Fortran-style, and so on). Focusing on the fact 81 | that it is more efficient to access memory that is close to previous accesses, **comment (in 1-2 sentences) on why one of these orderings appears to be fastest that the others**. 82 | (Numba can do a lot of different things when compiling code, so depending on your function there may or may not be a large difference: if there is little change in speeds 83 | for your function, you can comment on which ordering you might expect to be faster and why, but conclude that Numba is doing something more advanced.) 84 | 85 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_1.md: -------------------------------------------------------------------------------- 1 | # Assignment 1 - Matrix-matrix multiplication 2 | 3 | This assignment makes up 20% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Thursday 19 October 2022**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | In this assignment, we will look at computing the product $AB$ of two matrices $A,B\in\mathbb{R}^{n\times n}$. The following snippet of code defines a function that computes the 16 | product of two matrices. As an example, the product of two 10 by 10 matrices is printed. The final line prints `matrix1 @ matrix2` - the `@` symbol denotes matrix multiplication, and 17 | Python will get Numpy to compute the product of two matrices. By looking at the output, it's possible to check that the two results are the same. 18 | 19 | ```python 20 | import numpy as np 21 | 22 | 23 | def slow_matrix_product(mat1, mat2): 24 | """Multiply two matrices.""" 25 | assert mat1.shape[1] == mat2.shape[0] 26 | result = [] 27 | for c in range(mat2.shape[1]): 28 | column = [] 29 | for r in range(mat1.shape[0]): 30 | value = 0 31 | for i in range(mat1.shape[1]): 32 | value += mat1[r, i] * mat2[i, c] 33 | column.append(value) 34 | result.append(column) 35 | return np.array(result).transpose() 36 | 37 | 38 | matrix1 = np.random.rand(10, 10) 39 | matrix2 = np.random.rand(10, 10) 40 | 41 | print(slow_matrix_product(matrix1, matrix2)) 42 | print(matrix1 @ matrix2) 43 | ``` 44 | 45 | The function in this snippet isn't very good. 46 | 47 | ### Part 1: a better function 48 | **Write your own function called `faster_matrix_product` that computes the product of two matrices more efficiently than `slow_matrix_product`.** 49 | Your function may use functions from Numpy (eg `np.dot`) to complete part of its calculation, but your function should not use `np.dot` or `@` to compute 50 | the full matrix-matrix product. 51 | 52 | Before you look at the performance of your function, you should check that it is computing the correct results. **Write a Python script using an `assert` 53 | statement that checks that your function gives the same result as using `@` for random 2 by 2, 3 by 3, 4 by 4, and 5 by 5 matrices.** 54 | 55 | In a text box, **give two brief reasons (1-2 sentences for each) why your function is better than `slow_matrix_product`.** At least one of your 56 | reasons should be related to the time you expect the two functions to take. 57 | 58 | Next, we want to compare the speed of `slow_matrix_product` and `faster_matrix_product`. **Write a Python script that runs the two functions for matrices of a range of sizes, 59 | and use `matplotlib` to create a plot showing the time taken for different sized matrices for both functions.** You should be able to run the functions for matrices 60 | of size up to around 1000 by 1000 (but if you're using an older/slower computer, you may decide to decrease the maximums slightly). You do not need to run your functions for 61 | every size between your minimum and maximum, but should choose a set of 10-15 values that will give you an informative plot. 62 | 63 | ### Part 2: speeding it up with Numba 64 | In the second part of this assignment, you're going to use Numba to speed up your function. 65 | 66 | **Create a copy of your function `faster_matrix_product` that is just-in-time (JIT) compiled using Numba.** To demonstrate the speed improvement acheived by using Numba, 67 | **make a plot (similar to that you made in the first part) that shows the times taken to multiply matrices using `faster_matrix_product`, `faster_matrix_product` with 68 | Numba JIT compilation, and Numpy (`@`).** Numpy's matrix-matrix multiplication is highly optimised, so you should not expect to be as fast is it. 69 | 70 | You may be able to achieve further speed up of your function by adjusting the memory layout used. The function `np.asfortanarray` will make a copy of an array that uses 71 | Fortran-style ordering, for example: 72 | 73 | ```python 74 | import numpy as np 75 | 76 | a = np.random.rand(10, 10) 77 | fortran_a = np.asfortranarray(a) 78 | ``` 79 | 80 | **Make a plot that compares the times taken by your JIT compiled function when the inputs have different combinations of C-style and Fortran-style ordering** 81 | (ie the plot should have lines for when both inputs are C-style, when the first is C-style and the second is Fortran-style, and so on). Focusing on the fact 82 | that it is more efficient to access memory that is close to previous accesses, **comment (in 1-2 sentences) on why one of these orderings appears to be fastest that the others**. 83 | (Numba can do a lot of different things when compiling code, so depending on your function there may or may not be a large difference: if there is little change in speeds 84 | for your function, you can comment on which ordering you might expect to be faster and why, but conclude that Numba is doing something more advanced.) 85 | 86 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_2-lsa.md: -------------------------------------------------------------------------------- 1 | # LSA Assignment 2 - GPU Accelerated solution of Poisson problems 2 | 3 | **Note: This is the assignment from the 2021-22 academic year.** 4 | 5 | The deadline for submitting this assignment is **Midnight Friday 30 August 2024**. 6 | 7 | The easiest ways to create this file are: 8 | 9 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 10 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 11 | In this assignment we consider the solution of Poisson problems of the form 12 | 13 | $$ 14 | -\Delta u(x, y) = f(x, y) 15 | $$ 16 | with $\Delta := u_{xx} + u_{yy}$ 17 | for $(x, y)\in\Omega\subset\mathbb{R}^2$ and boundary conditions $u(x, y) = g(x, y)$ on $\Gamma :=\partial\Omega$. 18 | 19 | For all our experiments the domain $\Omega$ is the unit square $\Omega :=[0, 1]^2$. 20 | 21 | To numerically solve this problem we define grid points $x_i := ih$ and $y_j :=jh$ with $i, j=1, \dots, N$ and $h=1/(N+1)$. We can now approximate 22 | 23 | $$ 24 | -\Delta u(x_i, y_j) \approx \frac{1}{h^2}(4 u(x_i, y_j) - u(x_{i-1}, y_j) - u(x_{i+1}, y_j) - u(x_{i}, y_{j-1}) - u(x_i, y_{j+1})). 25 | $$ 26 | If the neighboring point of $(x_i, y_j)$ is at the boundary we simply use the corresponding value of the boundary data $g$ in the above approximation. 27 | 28 | The above Poisson problem now becomes the sytem of $N^2$ equations given by 29 | 30 | $$ 31 | \frac{1}{h^2}(4 u(x_i, y_j) - u(x_{i-1}, y_j) - u(x_{i+1}, y_j) - u(x_{i}, y_{j-1}) - u(x_i, y_{j+1})) = f(x_i, y_j) 32 | $$ 33 | for $i, j=1,\dots, N$. 34 | 35 | **Task 1** We first need to create a verified reference solution to this problem. Implement a function ```discretise(f, g, N)``` that takes a Python callable $f$, a Python callable $g$ and the parameter $N$ and returns a sparse CSR matrix $A$ and the corresponding right-hand side $b$ of the above discretised Poisson problem. 36 | 37 | To verify your code we use the method of manufactured solutions. Let $u(x, y)$ be the exact function $u_{exact}(x, y) = e^{(x-0.5)^2 + (y-0.5)^2}$. By taking $-\Delta u_{exact}$ you can compute the corresponding right-hand side $f$ so that this function $u_{exact}$ will be the exact solution of the Poisson equation $-\Delta u(x, y) = f(x, y)$ with boundary conditions given by the boundary data of your known $u_{exact}$. 38 | 39 | For growing values $N$ solve the linear system of equations using the `scipy.sparse.linalg.spsolve` command. Plot the maximum relative error of your computed grid values $u(x_i, y_j)$ against the exact solution $u_{exact}$ as $N$ increases. The relative error at a given point is 40 | 41 | $$ 42 | e_{rel} = \frac{|u(x_i, y_j) - u_{exact}(x_i, y_j)|}{|u_{exact}(x_i, y_j)|} 43 | $$ 44 | 45 | For your plot you should use a double logarithmic plot (```loglog``` in Matplotlib). As $N$ increases the error should go to zero. What can you conjecture about the rate of convergence? 46 | 47 | ***Task 2*** With your verified code we now have something to compare a GPU code against. On the GPU we want to implement a simple iterative scheme to solve the Poisson equation. The idea is to rewrite the above discrete linear system as 48 | 49 | $$ 50 | u(x_i, y_j) = \frac{1}{4}\left(h^2f(x_i, y_j) + u(x_{i-1}, y_j) + u(x_{i+1}, y_j) + u(x_{i}, y_{j-1}) + u(x_i, y_{j+1}))\right) 51 | $$ 52 | 53 | You can notice that if $f$ is zero then the left-hand side $u(x_i, y_j)$ is just the average of all the neighbouring grid points. This motivates a simple iterative scheme, namely 54 | 55 | $$ 56 | u^{k+1}(x_i, y_j) = \frac{1}{4}\left(h^2f(x_i, y_j) + u^k(x_{i-1}, y_j) + u^k(x_{i+1}, y_j) + u^k(x_{i}, y_{j-1}) + u^k(x_i, y_{j+1}))\right). 57 | $$ 58 | 59 | In other words, the value of $u$ at the iteration $k+1$ is just the average of all the values at iteration $k$ plus the contribution from the right-hand side. 60 | 61 | Your task is to implement this iterative scheme in Numba Cuda. A few hints are in order: 62 | 63 | * Make sure that when possible you only copy data from the GPU to the host at the end of your computation. To initialize the iteration you can for example take $u=0$. You do not want to copy data after each iteration step. 64 | * You will need two global buffers, one for the current iteration $k$ and one for the next iteration. 65 | * Your compute kernel will execute one iteration of the scheme and you run multiple iterations by repeatedly calling the kernel from the host. 66 | * To check for convergence you should investigate the relative change of your values from $u^k$ to $u^{k+1}$ and take the maximum relative change as measure how accurate your solution is. Decide how you implement this (in the same kernel or through a separate kernel). Also, decide how often you check for convergence. You may not want to check in each iteration as it is an expensive operation. 67 | * Verify your GPU code by comparing against the exact discrete solution in Task 1. Generate a convergence plot of how the values in your iterative scheme converge against the exact discrete solution. For this use a few selected values of $N$. How does the convergence change as $N$ increases? 68 | * Try to optimise memory accesses. You will notice that if you consider a grid value $u(i, j)$ it will be read multiple times from the global buffer. Try to optimise memory accesses by preloading a block of values into local shared memory and have a thread block read the data from there. When you do this benchmark against an implementation where each thread just reads from global memory. 69 | 70 | ***Carefully describe your computations and observations. Explain what you are doing and try to be scientifically precise in your observations and conclusions. Carefully designing and interpreting your convergence and benchmark experiments is a significant component of this assignment.*** 71 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_2.md: -------------------------------------------------------------------------------- 1 | # Assignment 2 - Solving two 1D problems 2 | 3 | This assignment makes up 20% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Thursday 2 November 2023**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | ### Part 1: Solving a wave problem with sparse matrices 16 | In this part of the assignment, we want to compute the solution to the following (time-harmonic) wave problem: 17 | 18 | $$ 19 | \begin{align*} 20 | \frac{\mathrm{d}^2 u}{\mathrm{d}x^2} + k^2u &= 0&&\text{in }(0, 1),\\ 21 | u &= 0&&\text{if }x=0,\\ 22 | u &= 1&&\text{if }x=1,\\ 23 | \end{align*} 24 | $$ 25 | with wavenumber $k=29\mathrm{\pi}/2$. 26 | 27 | In this part, we will approximately solving this problem using the method of finite differences. 28 | We do this by taking an evenly spaced values 29 | $x_0=0, x_1, x_2, ..., x_N=1$ 30 | and approximating the value of $u$ for each value: we will call these approximations $u_i$. 31 | To compute these approximations, we use the approximation 32 | 33 | $$ 34 | \frac{\mathrm{d}^2u_{i}}{\mathrm{d}x^2} \approx \frac{ 35 | u_{i-1}-2u_i+u_{i+1} 36 | }{h^2}, 37 | $$ 38 | where $h = 1/N$. 39 | 40 | With a bit of algebra, we see that the wave problem can be written as 41 | 42 | $$ 43 | (2-h^2k^2)u_i-u_{i-1}-u_{i+1} = 0 44 | $$ 45 | if $x_i$ is not 0 or 1, and 46 | 47 | $$ 48 | \begin{align*} 49 | u_i &= 0 50 | &&\text{if }x_i=0,\\ 51 | u_i &= 1 52 | &&\text{if }x_i=1. 53 | \end{align*} 54 | $$ 55 | 56 | This information can be used to re-write the problem as the matrix-vector problem 57 | $\mathrm{A}\mathbf{u}=\mathbf{f},$ 58 | where $\mathrm{A}$ is a known matrix, $\mathbf{f}$ is a known vector, and $\mathbf{u}$ is an unknown vector that we want to compute. 59 | The entries of 60 | $\mathbf{f}$ and $\mathbf{u}$ are given by 61 | 62 | $$ 63 | \begin{align*} 64 | \left[\mathbf{u}\right]_i &= u_i,\\ 65 | \left[\mathbf{f}\right]_i &= \begin{cases} 66 | 1&\text{if }i=N,\\ 67 | 0&\text{otherwise}. 68 | \end{cases} 69 | \end{align*} 70 | $$ 71 | The rows of $\mathrm{A}$ are given by 72 | 73 | $$ 74 | \left[\mathrm{A}\right]_{i,j} = 75 | \begin{cases} 76 | 1&\text{if }i=j,\\ 77 | 0&\text{otherwise}, 78 | \end{cases} 79 | $$ 80 | if $i=0$ or $i=N$; and 81 | 82 | $$ 83 | \left[\mathrm{A}\right]_{i, j} = 84 | \begin{cases} 85 | 2-h^2k^2&\text{if }j=i,\\ 86 | -1&\text{if }j=i+1,\\ 87 | -1&\text{if }j=i-1.\\ 88 | 0&\text{otherwise}, 89 | \end{cases} 90 | $$ 91 | otherwise. 92 | 93 | **Write a Python function that takes $N$ as an input and returns the matrix $\mathrm{A}$ and vector $\mathrm{f}$**. 94 | You should use an appropriate sparse storage format for the matrix $\mathrm{A}$. 95 | 96 | The function `scipy.sparse.linalg.spsolve` can be used to solve a sparse matrix-vector problem. Use this to **compute 97 | the approximate solution for your problem for $N=10$, $N=100$, and $N=1000$**. Use `matplotlib` (or any other plotting library) 98 | to **plot the solutions for these three values of $N$**. 99 | 100 | **Briefly (1-2 sentences) comment on your plots**: How different are they to each other? Which do you expect to be closest to the 101 | actual solution of the wave problem? 102 | 103 | This wave problem was carefully chosen so that its exact solution is known: this solution is 104 | $u_\text{exact}(x) = \sin(kx)$. (You can check this by differentiating this twice and substituting, but you 105 | do not need to do this part of this assignment.) 106 | 107 | A possible approximate measure of the error in your solution can be found by computing 108 | 109 | $$ 110 | \max_i\left|u_i-u_\text{exact}(x_i)\right|. 111 | $$ 112 | **Compute this error for a range of values for $N$ of your choice, for the method you wrote above**. On axes that both use log scales, 113 | **plot $N$ against the error in your solution**. You should pick a range of values for $N$ so that this plot will give you useful information about the 114 | methods. 115 | 116 | For the same values of $N$, **measure the time taken to compute your approximation for your function**. On axes that both use log scales, 117 | **plot $N$ against the time taken to compute a solution**. 118 | 119 | We now want to compute an approximate solution where the measure of error is $10^{-8}$ or less. By looking at your plots, **pick a value of $N$ 120 | that you would expect to give error of $10^{-8}$ or less**. **Briefly (1-2 sentences) explain how you picked your value of $N$ 121 | and predict how long the computation will take**. 122 | 123 | **Compute the approximate solution with your value of $N$**. Measure the time taken and the error, and **briefly (1-2 sentences) comment 124 | on how these compare to your predictions**. Your error may turn out to be higher than $10^{-8}$ for your value of $N$: if so, you can still get full marks for commenting on 125 | why your prediction was not correct. Depending on your implementation and your prediction, 126 | a valid conclusion in the section could be "My value of $N$ is too large for it to be feasible to complete this computation in a reasonable amount of time / without running out of memory". 127 | 128 | ### Part 2: Solving the heat equation with GPU acceleration 129 | 130 | In this part of the assignment, we want to solve the heat equation 131 | 132 | $$ 133 | \begin{align*} 134 | \frac{\mathrm{d}u}{\mathrm{d}t} &= \frac{1}{1000}\frac{\mathrm{d}^2u}{\mathrm{d}x^2}&&\text{for }x\in(0,1),\\ 135 | u(x, 0) &= 0,&&\text{if }x\not=0\text{ and }x\not=1\\ 136 | u(0,t) &= 10,\\ 137 | u(1,t) &= 10. 138 | \end{align*} 139 | $$ 140 | This represents a rod that starts at 0 temperature which is heated to a temperature of 10 at both ends. 141 | 142 | Again, we will approximately solve this by taking an evenly spaced values 143 | $x_0=0, x_1, x_2, ..., x_N=1$. 144 | Additionally, we will take a set of evenly spaced times 145 | $t_0=0,t_1=h, t_2=2h, t_3=3h, ...$, where $h=1/N$. 146 | We will write $u^{(j)}_{i}$ for the approximate value of $u$ at point $x_i$ and time $t_j$ 147 | (ie $u^{(j)}_{i}\approx u(x_i, t_j)$). 148 | 149 | Approximating both derivatives (similar to what we did in part 1), and doing some algebra, we can rewrite the 150 | heat equation as 151 | 152 | $$ 153 | \begin{align*} 154 | u^{(j + 1)}_i&=u^{(j)}_i + \frac{u^{(j)}_{i-1}-2u^{(j)}_i+u^{(j)}_{i+1}}{1000h},\\ 155 | u^{(0)}_i &= 0,\\ 156 | u^{(j)}_{0}&=10,\\ 157 | u^{(j)}_{N}&=10. 158 | \end{align*} 159 | $$ 160 | 161 | This leads us to an iterative method for solving this problem: first, at $t=0$, we set 162 | 163 | $$ 164 | u^{(0)}_i = 165 | \begin{cases} 166 | 10 &\text{if }i=0\text{ or }i=N,\\ 167 | 0 &\text{otherwise}; 168 | \end{cases} 169 | $$ 170 | then for all later values of time, we set 171 | 172 | $$ 173 | u^{(j+1)}_i = 174 | \begin{cases} 175 | 10 &\text{if }i=0\text{ or }i=N,\\ 176 | \displaystyle u^{(j)}_i + \frac{u^{(j)}_{i-1}-2u^{(j)}_i+u^{(j)}_{i+1}}{1000h} &\text{otherwise}. 177 | \end{cases} 178 | $$ 179 | 180 | **Implement this iterative scheme in Python**. You should implement this as a function that takes $N$ as an input. 181 | 182 | Using a sensible value of $N$, **plot the temperature of the rod at $t=1$, $t=2$ and $t=10$**. **Briefly (1-2 sentences) 183 | comment on how you picked a value for $N$**. 184 | 185 | **Use `numba.cuda` to parallelise your implementation on a GPU**. 186 | You should think carefully about when data needs to be copied, and be careful not to copy data to/from the GPU when not needed. 187 | 188 | 189 | **Use your code to estimate the time at which the temperature of the midpoint of the rod first exceeds a temperature of 9.8**. 190 | **Briefly (2-3 sentences) describe how you estimated this time**. You may choose to use a plot or diagram to aid your description, 191 | but it is not essential to include a plot. 192 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_3-lsa.md: -------------------------------------------------------------------------------- 1 | # LSA Assignment 3 - Sparse matrices 2 | 3 | The deadline for submitting this assignment is **Midnight Friday 30 August 2024**. 4 | 5 | The easiest ways to create this file are: 6 | 7 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 8 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 9 | 10 | ## The assignment 11 | 12 | ### Part 1: Implementing a CSR matrix 13 | Scipy allows you to define your own objects that can be used with their sparse solvers. You can do this 14 | by creating a subclass of `scipy.sparse.LinearOperator`. In the first part of this assignment, you are going to 15 | implement your own CSR matrix format. 16 | 17 | The following code snippet shows how you can define your own matrix-like operator. 18 | 19 | ``` 20 | from scipy.sparse.linalg import LinearOperator 21 | 22 | 23 | class CSRMatrix(LinearOperator): 24 | def __init__(self, coo_matrix): 25 | self.shape = coo_matrix.shape 26 | self.dtype = coo_matrix.dtype 27 | # You'll need to put more code here 28 | 29 | def __add__(self, other): 30 | """Add the CSR matrix other to this matrix.""" 31 | pass 32 | 33 | def _matvec(self, vector): 34 | """Compute a matrix-vector product.""" 35 | pass 36 | ``` 37 | 38 | Make a copy of this code snippet and **implement the methods `__init__`, `__add__` and `matvec`.** 39 | The method `__init__` takes a COO matrix as input and will initialise the CSR matrix: it currently includes one line 40 | that will store the shape of the input matrix. You should add code here that extracts important data from a Scipy COO to and computes and stores the appropriate data 41 | for a CSR matrix. You may use any functionality of Python and various libraries in your code, but you should not use an library's implementation of a CSR matrix. 42 | The method `__add__` will overload `+` and so allow you to add two of your CSR matrices together. 43 | The `__add__` method should avoid converting any matrices to dense matrices. You could implement this in one of two ways: you could convert both matrices to COO matrices, 44 | compute the sum, then pass this into `CSRMatrix()`; or you could compute the data, indices and indptr for the sum, and use these to create a SciPy CSR matrix. 45 | The method `matvec` will define a matrix-vector product: Scipy will use this when you tell it to use a sparse solver on your operator. 46 | 47 | **Write tests to check that the `__add__` and `matvec` methods that you have written are correct.** These test should use appropriate `assert` statements. 48 | 49 | For a collection of sparse matrices of your choice and a random vector, **measure the time taken to perform a `matvec` product**. Convert the same matrices to dense matrices and **measure 50 | the time taken to compute a dense matrix-vector product using Numpy**. **Create a plot showing the times of `matvec` and Numpy for a range of matrix sizes** and 51 | **briefly (1-2 sentence) comment on what your plot shows**. 52 | 53 | For a matrix of your choice and a random vector, **use Scipy's `gmres` and `cg` sparse solvers to solve a matrix problem using your CSR matrix**. 54 | Check if the two solutions obtained are the same. 55 | **Briefly comment (1-2 sentences) on why the solutions are or are not the same (or are nearly but not exactly the same).** 56 | 57 | ### Part 2: Implementing a custom matrix 58 | Let $\mathrm{A}$ by a $2n$ by $2n$ matrix with the following structure: 59 | 60 | - The top left $n$ by $n$ block of $\mathrm{A}$ is a diagonal matrix 61 | - The top right $n$ by $n$ block of $\mathrm{A}$ is zero 62 | - The bottom left $n$ by $n$ block of $\mathrm{A}$ is zero 63 | - The bottom right $n$ by $n$ block of $\mathrm{A}$ is dense (but has a special structure defined below) 64 | 65 | In other words, $\mathrm{A}$ looks like this, where $*$ represents a non-zero value 66 | 67 | $$ 68 | \mathrm{A}=\begin{pmatrix} 69 | *&0&0&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 70 | 0&*&0&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 71 | 0&0&*&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 72 | \vdots&\vdots&\vdots&\ddots&0&\hspace{3mm}\vdots&\vdots&\ddots&\vdots\\ 73 | 0&0&0&\cdots&*&\hspace{3mm}0&0&\cdots&0\\[3mm] 74 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&*\\ 75 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&*\\ 76 | \vdots&\vdots&\vdots&\ddots&\vdots&\hspace{3mm}\vdots&\vdots&\ddots&\vdots\\ 77 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&* 78 | \end{pmatrix} 79 | $$ 80 | 81 | Let $\tilde{\mathrm{A}}$ be the bottom right $n$ by $n$ block of $\mathrm{A}$. 82 | Suppose that $\tilde{\mathrm{A}}$ is a matrix that can be written as 83 | 84 | $$ 85 | \tilde{\mathrm{A}} = \mathrm{T}\mathrm{W}, 86 | $$ 87 | where $\mathrm{T}$ is a $n$ by 2 matrix (a tall matrix); 88 | and 89 | where $\mathrm{W}$ is a 2 by $n$ matrix (a wide matrix). 90 | 91 | **Implement a Scipy `LinearOperator` for matrices of this form**. Your implementation must include a matrix-vector product (`matvec`) and the shape of the matrix (`self.shape`), but 92 | does not need to include an `__add__` function. In your implementation of `matvec`, you should be careful to ensure that the product does not have more computational complexity then necessary. 93 | 94 | For a range of values of $n$, **create matrices where the entries on the diagonal of the top-left block and in the matrices $\mathrm{T}$ and $\mathrm{W}$ are random numbers**. 95 | For each of these matrices, **compute matrix-vector products using your implementation and measure the time taken to compute these**. Create an alternative version of each matrix, 96 | stored using a Scipy or Numpy format of your choice, 97 | and **measure the time taken to compute matrix-vector products using this format**. **Make a plot showing time taken against $n$**. **Comment (2-4 sentences) on what your plot shows, and why you think 98 | one of these methods is faster than the other** (or why they take the same amount of time if this is the case). 99 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_3.md: -------------------------------------------------------------------------------- 1 | # Assignment 3 - Sparse matrices 2 | 3 | This assignment makes up 30% of the overall marks for the course. The deadline for submitting this assignment is **5pm on Thursday 30 November 2023**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Tasks you are required to carry out and questions you are required to answer are shown in bold below. 12 | 13 | ## The assignment 14 | 15 | ### Part 1: Implementing a CSR matrix 16 | Scipy allows you to define your own objects that can be used with their sparse solvers. You can do this 17 | by creating a subclass of `scipy.sparse.LinearOperator`. In the first part of this assignment, you are going to 18 | implement your own CSR matrix format. 19 | 20 | The following code snippet shows how you can define your own matrix-like operator. 21 | 22 | ``` 23 | from scipy.sparse.linalg import LinearOperator 24 | 25 | 26 | class CSRMatrix(LinearOperator): 27 | def __init__(self, coo_matrix): 28 | self.shape = coo_matrix.shape 29 | self.dtype = coo_matrix.dtype 30 | # You'll need to put more code here 31 | 32 | def __add__(self, other): 33 | """Add the CSR matrix other to this matrix.""" 34 | pass 35 | 36 | def _matvec(self, vector): 37 | """Compute a matrix-vector product.""" 38 | pass 39 | ``` 40 | 41 | Make a copy of this code snippet and **implement the methods `__init__`, `__add__` and `matvec`.** 42 | The method `__init__` takes a COO matrix as input and will initialise the CSR matrix: it currently includes one line 43 | that will store the shape of the input matrix. You should add code here that extracts important data from a Scipy COO to and computes and stores the appropriate data 44 | for a CSR matrix. You may use any functionality of Python and various libraries in your code, but you should not use an library's implementation of a CSR matrix. 45 | The method `__add__` will overload `+` and so allow you to add two of your CSR matrices together. 46 | The `__add__` method should avoid converting any matrices to dense matrices. You could implement this in one of two ways: you could convert both matrices to COO matrices, 47 | compute the sum, then pass this into `CSRMatrix()`; or you could compute the data, indices and indptr for the sum, and use these to create a SciPy CSR matrix. 48 | The method `matvec` will define a matrix-vector product: Scipy will use this when you tell it to use a sparse solver on your operator. 49 | 50 | **Write tests to check that the `__add__` and `matvec` methods that you have written are correct.** These test should use appropriate `assert` statements. 51 | 52 | For a collection of sparse matrices of your choice and a random vector, **measure the time taken to perform a `matvec` product**. Convert the same matrices to dense matrices and **measure 53 | the time taken to compute a dense matrix-vector product using Numpy**. **Create a plot showing the times of `matvec` and Numpy for a range of matrix sizes** and 54 | **briefly (1-2 sentence) comment on what your plot shows**. 55 | 56 | For a matrix of your choice and a random vector, **use Scipy's `gmres` and `cg` sparse solvers to solve a matrix problem using your CSR matrix**. 57 | Check if the two solutions obtained are the same. 58 | **Briefly comment (1-2 sentences) on why the solutions are or are not the same (or are nearly but not exactly the same).** 59 | 60 | ### Part 2: Implementing a custom matrix 61 | Let $\mathrm{A}$ by a $2n$ by $2n$ matrix with the following structure: 62 | 63 | - The top left $n$ by $n$ block of $\mathrm{A}$ is a diagonal matrix 64 | - The top right $n$ by $n$ block of $\mathrm{A}$ is zero 65 | - The bottom left $n$ by $n$ block of $\mathrm{A}$ is zero 66 | - The bottom right $n$ by $n$ block of $\mathrm{A}$ is dense (but has a special structure defined below) 67 | 68 | In other words, $\mathrm{A}$ looks like this, where $*$ represents a non-zero value 69 | 70 | $$ 71 | \mathrm{A}=\begin{pmatrix} 72 | *&0&0&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 73 | 0&*&0&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 74 | 0&0&*&\cdots&0&\hspace{3mm}0&0&\cdots&0\\ 75 | \vdots&\vdots&\vdots&\ddots&0&\hspace{3mm}\vdots&\vdots&\ddots&\vdots\\ 76 | 0&0&0&\cdots&*&\hspace{3mm}0&0&\cdots&0\\[3mm] 77 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&*\\ 78 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&*\\ 79 | \vdots&\vdots&\vdots&\ddots&\vdots&\hspace{3mm}\vdots&\vdots&\ddots&\vdots\\ 80 | 0&0&0&\cdots&0&\hspace{3mm}*&*&\cdots&* 81 | \end{pmatrix} 82 | $$ 83 | 84 | Let $\tilde{\mathrm{A}}$ be the bottom right $n$ by $n$ block of $\mathrm{A}$. 85 | Suppose that $\tilde{\mathrm{A}}$ is a matrix that can be written as 86 | 87 | $$ 88 | \tilde{\mathrm{A}} = \mathrm{T}\mathrm{W}, 89 | $$ 90 | where $\mathrm{T}$ is a $n$ by 2 matrix (a tall matrix); 91 | and 92 | where $\mathrm{W}$ is a 2 by $n$ matrix (a wide matrix). 93 | 94 | **Implement a Scipy `LinearOperator` for matrices of this form**. Your implementation must include a matrix-vector product (`matvec`) and the shape of the matrix (`self.shape`), but 95 | does not need to include an `__add__` function. In your implementation of `matvec`, you should be careful to ensure that the product does not have more computational complexity then necessary. 96 | 97 | For a range of values of $n$, **create matrices where the entries on the diagonal of the top-left block and in the matrices $\mathrm{T}$ and $\mathrm{W}$ are random numbers**. 98 | For each of these matrices, **compute matrix-vector products using your implementation and measure the time taken to compute these**. Create an alternative version of each matrix, 99 | stored using a Scipy or Numpy format of your choice, 100 | and **measure the time taken to compute matrix-vector products using this format**. **Make a plot showing time taken against $n$**. **Comment (2-4 sentences) on what your plot shows, and why you think 101 | one of these methods is faster than the other** (or why they take the same amount of time if this is the case). 102 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_4-lsa.md: -------------------------------------------------------------------------------- 1 | # LSA Assignment 4 - Time-dependent problems 2 | 3 | The deadline for submitting this assignment is **Midnight Friday 30 August 2024**. 4 | 5 | The easiest ways to create this file are: 6 | 7 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 8 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 9 | 10 | Consider a square plate with sides $[−1, 1] × [−1, 1]$. At time t = 0 we are heating the plate up 11 | such that the temperature is $u = 5$ on one side and $u = 0$ on the other sides. The temperature 12 | evolves according to $u_t = \Delta u$. At what time $t^*$ does the plate reach $u = 1$ at the center of the plate? 13 | Implement a finite difference scheme and try with explicit and implicit time-stepping. Numerically investigate the stability of your schemes. 14 | By increasing the number of discretisation points demonstrate how many correct digits you can achieve. Also, 15 | plot the convergence of your computed time $t^*$ against the actual time. To 12 digits the wanted 16 | solution is $t^* = 0.424011387033$. 17 | 18 | A GPU implementation of the explicit time-stepping scheme is not necessary but would be expected for a very high mark beyond 80%. 19 | -------------------------------------------------------------------------------- /hpc_lecture_notes/2023-assignment_4.md: -------------------------------------------------------------------------------- 1 | # Assignment 4 - Time-dependent problems 2 | 3 | This assignment makes up 30% of the overall marks for the course. The deadline for submitting this assignment is **5pm on 14 December 2023**. 4 | 5 | Coursework is to be submitted using the link on Moodle. You should submit a single pdf file containing your code, the output when you run your code, and your answers 6 | to any text questions included in the assessment. The easiest ways to create this file are: 7 | 8 | - Write your code and answers in a Jupyter notebook, then select File -> Download as -> PDF via LaTeX (.pdf). 9 | - Write your code and answers on Google Colab, then select File -> Print, and print it as a pdf. 10 | 11 | Consider a square plate with sides $[−1, 1] × [−1, 1]$. At time t = 0 we are heating the plate up 12 | such that the temperature is $u = 5$ on one side and $u = 0$ on the other sides. The temperature 13 | evolves according to $u_t = \Delta u$. At what time $t^*$ does the plate reach $u = 1$ at the center of the plate? 14 | Implement a finite difference scheme and try with explicit and implicit time-stepping. Numerically investigate the stability of your schemes. 15 | By increasing the number of discretisation points demonstrate how many correct digits you can achieve. Also, 16 | plot the convergence of your computed time $t^*$ against the actual time. To 12 digits the wanted 17 | solution is $t^* = 0.424011387033$. 18 | 19 | A GPU implementation of the explicit time-stepping scheme is not necessary but would be expected for a very high mark beyond 80%. 20 | -------------------------------------------------------------------------------- /hpc_lecture_notes/_config.yml: -------------------------------------------------------------------------------- 1 | ####################################################################################### 2 | # A default configuration that will be loaded for all jupyter books 3 | # See the documentation for help and more options: 4 | # https://jupyterbook.org/customize/config.html 5 | 6 | ####################################################################################### 7 | # Book settings 8 | title : Techniques of High-Performance Computing - Lecture Notes # The title of the book. Will be placed in the left navbar. 9 | author : Timo Betcke & Matthew Scroggs # The author of the book 10 | copyright : "2020-22" # Copyright year to be placed in the footer 11 | logo : cpu_logo.png # A path to the book logo 12 | execute: 13 | execute_notebooks: 'off' 14 | 15 | html: 16 | favicon: favicon.ico 17 | -------------------------------------------------------------------------------- /hpc_lecture_notes/_toc.yml: -------------------------------------------------------------------------------- 1 | format: jb-book 2 | root: intro 3 | parts: 4 | - caption: High-Performance Computing with Python 5 | chapters: 6 | - file: what_is_hpc 7 | - file: hpc_languages 8 | - file: python_hpc_tools 9 | - file: numpy_and_data_layouts 10 | - file: parallel_principles 11 | - file: working_with_numba 12 | - file: simd 13 | - file: numexpr 14 | - file: gpu_introduction 15 | - file: cuda_introduction 16 | - file: numba_cuda 17 | - file: rbf_evaluation 18 | - caption: Sparse Linear Algebra 19 | chapters: 20 | - file: sparse_linalg_pde 21 | - file: sparse_data_structures 22 | - file: sparse_solvers_introduction 23 | - file: it_solvers1 24 | - file: it_solvers2 25 | - file: it_solvers3 26 | - file: it_solvers4 27 | - file: sparse_direct_solvers 28 | - file: petsc_for_sparse_systems 29 | - file: multigrid 30 | - caption: Time-Dependent Problems 31 | chapters: 32 | - file: simple_time_stepping 33 | - file: wave_equation 34 | - caption: Conclusions 35 | chapters: 36 | - file: further_topics 37 | - caption: Late Summmer Assessments 38 | chapters: 39 | - file: 2023-assignment_1-lsa 40 | - file: 2023-assignment_2-lsa 41 | - file: 2023-assignment_3-lsa 42 | - file: 2023-assignment_4-lsa 43 | # - file: 2022-assignment_4 44 | #- caption: LSA 45 | # chapters: 46 | # - file: 2022-lsa_1 47 | # - file: 2022-lsa_3 48 | # - file: 2022-lsa_4 49 | #- caption: Tasks for Monday Practical Classes 50 | # chapters: 51 | # - file: 2022-class_1 52 | # - file: 2022-class_2 53 | # - file: 2022-class_3 54 | # - file: 2022-class_4 55 | # - file: 2022-class_5 56 | # - file: 2022-class_6 57 | # - file: 2022-class_7 58 | 59 | - caption: Additional notes from 2022 60 | chapters: 61 | - file: 2022_matrices_and_simultaneous_equations 62 | - file: 2022_classes 63 | -------------------------------------------------------------------------------- /hpc_lecture_notes/cpu_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/cpu_logo.png -------------------------------------------------------------------------------- /hpc_lecture_notes/cuda_introduction.md: -------------------------------------------------------------------------------- 1 | # A tour of CUDA 2 | 3 | In this chapter we will dive into CUDA, the standard GPU development model for Nvidia devices. To understand the basics of CUDA we first need to understand how GPU devices are organised. 4 | 5 | ## CUDA Device Model 6 | 7 | At the most basic level, GPU accelerators are massively parallel compute devices that can run a huge number of threads concurrently. Compute devices have global memory, shared memory and local memory for threads. Moreover, threads are grouped into thread blocks that allow shared memory access. We will discuss all these points in more detail below. 8 | 9 | ### Threads, Cuda Cores, Warps and Streaming Multiprocessors 10 | 11 | A GPU device is organised into a number of Streaming Multiprocessors (SM). Each SM is responsible for scheduling and executing a number of thread blocks. Below we show the design of a SM for the Nvidia A100 Architecture (see [https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/](https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/)). 12 | 13 | ![SM Architecture](./img/a100_sm.png) 14 | 15 | Each SM in the A100 architecture consists of integer cores, floating point cores and tensor cores. Tensor cores are relatively new and optimised for mixed precision multiply/add operations for deep learning. The SM is responsible for scheduling threads onto the different compute cores. For the developer the lowest logical entity is a thread. Threads are organised by thread blocks in CUDA. A thread block is a group of thread that is allowed to access fast shared memory together. In terms of implementation thread blocks are divided into Warps, where as each Warp contains 32 threads. Within a Warp all threads must follow the same execution path, which has implications for branch statements that we will discuss later. A Warp is roughly comparable to a SIMD vector register in CPU architectures. 16 | 17 | The scheduling into Warps is important for the organisation of thread blocks. Ideally, these are multiples of 32. If a thread block is not a multiple of 32 Cores may be underutilised. Consider a thread block of 48 threads. This will take up two Warps as we have 32 + 16 threads. Hence, the second Warp will not be fully utilised. 18 | 19 | ### Numbering of threads 20 | 21 | The numbering of a thread is shown in the following Figure (see [https://developer.nvidia.com/blog/even-easier-introduction-cuda/](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)). 22 | 23 | ![Thread Numbering](./img/thread_numbering.png) 24 | 25 | As mentioned above, threads are organised in thread blocks. All thread blocks together form a thread grid. The thread grid does not need to be one-dimensional. It can also be two or three dimensional. This is convenient if the computational domain is better represented in two or three dimensions. The figure demonstrates how for one dimension the global thread number of a thread block is computed. 26 | 27 | ### Memory Hierarchy 28 | 29 | CUDA knows three types of memory 30 | 31 | * The **global memory** is a block of memory accessible by all threads in a device. This is the largest chunk of memory and the place where we create GPU buffers to store our input to computations and output results. While a GPU typically has a few gigabytes of global memory, access to it is relatively slow from the individual threads. 32 | 33 | * All threads within a given block have access to local **shared memory**. This shared memory is fast and available within the lifetime of the thread block. Together with local synchronisation it can be efficiently used to process workload within a given thread block without having to write back and forth to the global memory. 34 | 35 | * Each thread has its own **private memory**. This is very fast and used to store local intermediate results that are only needed in the current thread. 36 | 37 | ## An example 38 | 39 | The following example from the [official Numba documentation](https://numba.pydata.org/numba-doc/dev/cuda/examples.html#matrix-multiplication) uses all of the above mentioned principles. It is an implementation of a matrix-matrix product that makes use of shared memory for block-wise multiplication. 40 | 41 | ```python 42 | from numba import cuda, float32 43 | 44 | # Controls threads per block and shared memory usage. 45 | # The computation will be done on blocks of TPBxTPB elements. 46 | TPB = 16 47 | 48 | @cuda.jit 49 | def fast_matmul(A, B, C): 50 | # Define an array in the shared memory 51 | # The size and type of the arrays must be known at compile time 52 | sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32) 53 | sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32) 54 | 55 | x, y = cuda.grid(2) 56 | 57 | tx = cuda.threadIdx.x 58 | ty = cuda.threadIdx.y 59 | bpg = cuda.gridDim.x # blocks per grid 60 | 61 | if x >= C.shape[0] and y >= C.shape[1]: 62 | # Quit if (x, y) is outside of valid C boundary 63 | return 64 | 65 | # Each thread computes one element in the result matrix. 66 | # The dot product is chunked into dot products of TPB-long vectors. 67 | tmp = 0. 68 | for i in range(bpg): 69 | # Preload data into shared memory 70 | sA[tx, ty] = A[x, ty + i * TPB] 71 | sB[tx, ty] = B[tx + i * TPB, y] 72 | 73 | # Wait until all threads finish preloading 74 | cuda.syncthreads() 75 | 76 | # Computes partial product on the shared memory 77 | for j in range(TPB): 78 | tmp += sA[tx, j] * sB[j, ty] 79 | 80 | # Wait until all threads finish computing 81 | cuda.syncthreads() 82 | 83 | C[x, y] = tmp 84 | ``` 85 | -------------------------------------------------------------------------------- /hpc_lecture_notes/favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/favicon.ico -------------------------------------------------------------------------------- /hpc_lecture_notes/further_topics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Further topics" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "We want to list some further topics of current interest in Scientific Computing that we did not cover in this module. The list can not be exhaustive and there will be things of importance that I am leaving out. But it should give some pointers for those who are interested in diving more into the research side." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Domain Decomposition Methods" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "We have not discussed domain decomposition methods at all. The idea is to decompose a computational problem into a number of subproblems that can be solved independently and exchange information via interface conditions. The individual subproblems can then be solved in parallel. This approach is important to achieve weak scaling for larger problems on massively parallel architectures. Research into domain decomposition methods combines techniques from Partial Differential Equations (coupled PDE problems, design of interface conditions), Numerical Linear Algebra (efficient preconditioning, adapted iterative solvers, etc.) and Computer Science (load balancing, network communication, and much more)." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## Randomized Linear Algebra\n", 36 | "\n", 37 | "It might seem counterintuitive. But we can learn a lot about matrices by multiplication with random vectors. Indeed, we can learn so much that we can design efficient methods for problems such as singular value decompositions, dense linear solvers, eigenvalue computations, and many other standard problems of numerical linear algebra. Modern randomized methods approximate solutions to these problems in a probabilistic sense, with an error probability that is so small that in practice we need not worry about it. This area of research has become extremely prominent in the last 10 years and is more and more used for application problems of all sizes. A very recent overview is given in the article [Randomized numerical linear algebra: Foundations and algorithms](https://www.cambridge.org/core/journals/acta-numerica/article/abs/randomized-numerical-linear-algebra-foundations-and-algorithms/4486926746CFF4547F42A2996C7DC09C) by Martinsson and Tropp." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "## Fast direct solvers\n", 45 | "\n", 46 | "Fast direct solvers are a technology to compute approximate inverses to many relevant PDE problems in close to lnear complexity. This is made possible through compressing Green's function interactions between different points of the computational domain in the inversion. Fast direct solvers have shown tremendous success for certain types of linear systems arising for the solution of non-oscillatory stationary problems. A beautiful overview article is [Fast direct solvers for integral equations in complex three-dimensional domains](https://www.cambridge.org/core/journals/acta-numerica/article/abs/fast-direct-solvers-for-integral-equations-in-complex-threedimensional-domains/B3CCDC43521A7EEEC0207786AE469C4B) by Greengard, Gueyffier, Martinsson and Rohklin." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Space-Time Parallel Methods\n", 54 | "\n", 55 | "We have seen very simple discretisation of time-stepping methods for first and second order systems. However, while we can easily parallelise the space discretisation, we have to compute one timestep after another, which limits parallelisation opportunities on large HPC machines. The idea of space-time parallel methods is to form a large linear system that contains all space variables at all time steps and then to develop efficient preconditioned methods that solve these problems. More information is given in the article [50 Years of Time Parallel Time Integration](https://www.unige.ch/~gander/Preprints/50YearsTimeParallel.pdf) by Martin Gander." 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Reproducibility in Computational Sciences using containers\n", 63 | "\n", 64 | "Reproducibility means that we want to provide means that make it easy for other people to run our codes and data to reproduce the output that we are using in our publications. This sounds straight forward, but is fiendishly complex due to widely varying hardware configurations, operating systems, tool libraries, etc. In recent years Docker container have become more and more established as a tool to achieve reproducibility. Docker containers allow us to pack all required libraries and software into one image that can be run on various operating systems. A great introduction to reproduciblity and use of containers in computational sciences can be found at [https://lorenabarba.com/tag/reproducibility/](https://lorenabarba.com/tag/reproducibility/)." 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "## Machine Learning meets Scientific Computing\n", 72 | "\n", 73 | "The tremendous importance of machine learning is undisputed and there are a number of modules at UCL, which teach all aspects of machine learning. However, machine learning has developed quite independent from traditional scientific computing with its own tools and libraries. Recently more and more researchers have become interested in merging more traditional scientific computing and modelling with machine learning. The idea is to mix statistical and PDE based models to significantly improve the predictive power of computational simulations. This is an emerging area with a huge need for mathematical and computational and natural sciences research together with the development of suitable computational tools." 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "## The Julia Programming Language\n", 81 | "\n", 82 | "We have focused in this module on using Python. It is a very powerful programming language, which is used in a number of HPC projects. However, Julia is growing significantly as programming language and it is a number of novel ideas and concepts. A few years ago it was demonstrated that a pure Julia application was able to scale to petascale performance, which was a huge breakthrough. For new projects that do not need to be developed in a low-level language such as C++ or Fortran I strongly consider to evaluate both, Julia and Python as environments." 83 | ] 84 | } 85 | ], 86 | "metadata": { 87 | "kernelspec": { 88 | "display_name": "Python [conda env:hpc]", 89 | "language": "python", 90 | "name": "conda-env-hpc-py" 91 | }, 92 | "language_info": { 93 | "codemirror_mode": { 94 | "name": "ipython", 95 | "version": 3 96 | }, 97 | "file_extension": ".py", 98 | "mimetype": "text/x-python", 99 | "name": "python", 100 | "nbconvert_exporter": "python", 101 | "pygments_lexer": "ipython3", 102 | "version": "3.8.5" 103 | } 104 | }, 105 | "nbformat": 4, 106 | "nbformat_minor": 4 107 | } 108 | -------------------------------------------------------------------------------- /hpc_lecture_notes/hpc_languages.md: -------------------------------------------------------------------------------- 1 | # Languages for High-Performance Computing 2 | 3 | There exist a zoo of programming languages. In this short section we briefly want to discuss some frequently used programming languages and their suitability for High-Performance Computing. 4 | 5 | ## Fortran 6 | 7 | [Fortran](https://en.wikipedia.org/wiki/Fortran) is one of the dinosaurs of scientific computing. Fortran originated in the 1950s and its most recent incarnation is Fortran 2018. Fortran is still actively used for a lot of HPC code, especially when it comes to legacy applications. But new projects should not be started in Fortran. 8 | 9 | ## C/C++ 10 | 11 | [C++](https://en.wikipedia.org/wiki/C%2B%2B) is the default language of Scientific Computing. It is mature, has a huge ecosystem and most modern heterogeneous compute environments (Cuda/Sycl, etc.) are developed for C++. Some HPC projects also still use the C programming language, in particular for library development. But C++ is the best choice for most new projects. 12 | 13 | ## Julia 14 | 15 | [Julia](https://en.wikipedia.org/wiki/Julia_(programming_language)) is a relatively recent programming language that has quickly built up a sizeable support community. Even though Julia is a high-level language, it has been successfully used for simulations at Petaflops scale. It should definitely be considered for new applications. 16 | 17 | ## Python 18 | 19 | [Python](https://en.wikipedia.org/wiki/Python_(programming_language)) is the most widely used high-productivity language in Scientific Computing. Its very simple syntax and broad library support make it ideal for quickly building scalable applications. Python itself is not at all HPC capable unlike Julia. The language does not natively support the type of data structures and other features needed for fast computations. However, over the years Python interfaces to many C/C++ libraries have been developed and with the Numpy extension Python has data types and algorithms for very efficient array operations available. Python today is the number one language for machine learning and many other demanding HPC applications. In this module we will mainly use Python and deep dive into how to develop performant HPC applications in Python. 20 | 21 | ## Matlab 22 | 23 | [Matlab](https://en.wikipedia.org/wiki/MATLAB) is one of the oldest high-productivity languages and has been the defacto standard for fast numerical prototyping before Python. It is still heavily used in many numerical applications, given its excellent toolbox and huge amount of legacy code that exists. While Matlab has quite favourable licenses for academic use, it is expensive for commercial use, and if possible Python as open-source alternative is preferable for new projects. 24 | 25 | ## Rust 26 | 27 | [Rust](https://en.wikipedia.org/wiki/Rust_(programming_language)) The Rust programming language is a very recent newcomer. Its first stable release occured in 2015. However, it is quickly become extremely popular as a direct competitor to C++. The main feature of Rust is the ownership model the memory safety model that allows to detect many errors at compile time that would lead to runtime crashes in C++. Rust does not yet have a wide HPC ecosystem and most numerical libraries are in their infancy. Nevertheless, it is feasible that Rust over time matures into a serious HPC language. 28 | 29 | ## Other languages 30 | 31 | Other modern programming languages include Go, Java, and C#. Java and C# are business languages that are not designed for demanding HPC applications. Go may be suitable for some HPC type applications. But this is not where its focus lies. 32 | 33 | 34 | -------------------------------------------------------------------------------- /hpc_lecture_notes/img/2022a4-mesh.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/img/2022a4-mesh.png -------------------------------------------------------------------------------- /hpc_lecture_notes/img/a100_sm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/img/a100_sm.png -------------------------------------------------------------------------------- /hpc_lecture_notes/img/byte_array.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/img/byte_array.png -------------------------------------------------------------------------------- /hpc_lecture_notes/img/simd_addition.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/img/simd_addition.png -------------------------------------------------------------------------------- /hpc_lecture_notes/img/thread_numbering.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/img/thread_numbering.png -------------------------------------------------------------------------------- /hpc_lecture_notes/img/top500development.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/hpc_lecture_notes/img/top500development.png -------------------------------------------------------------------------------- /hpc_lecture_notes/intro.md: -------------------------------------------------------------------------------- 1 | # Welcome to Techniques of High-Performance Computing 2 | 3 | In this module we learn a number of basic techniques from High-Performance Computing. 4 | The module starts with an introduction and overview of parallel computing techniques and 5 | programming languages. We then take a deeper dive into Python, the main language for this module. 6 | In particular, we will get to learn libraries and programming techniques that will allow us to 7 | take full advantage of modern programming architectures in Python. 8 | 9 | This will be followed by an introduction to sparse linear algebra for very large linear systems of 10 | equations. You will get to know the basics of sparse direct solvers and sparse iterative solvers. 11 | 12 | Finally, we will turn to partial differential equations and implement some typical PDEs using the 13 | techniques learned in this module. 14 | 15 | 16 | -------------------------------------------------------------------------------- /hpc_lecture_notes/numba_cuda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Numba Cuda in Practice" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "To enable Cuda in Numba with conda just execute `conda install cudatoolkit` on the command line." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "The Cuda extension supports almost all Cuda features with the exception of dynamic parallelism and texture memory. Dynamic parallelism allows to launch compute kernel from within other compute kernels. Texture memory has a caching pattern based on spatial locality. We will not go into detail of these here." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## Finding out about Cuda devices" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "Let us first check what kind of Cuda device we have in the system." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 1, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "from numba import cuda" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "name": "stdout", 54 | "output_type": "stream", 55 | "text": [ 56 | "Found 1 CUDA devices\n", 57 | "id 0 b'Quadro RTX 3000' [SUPPORTED]\n", 58 | " compute capability: 7.5\n", 59 | " pci device id: 0\n", 60 | " pci bus id: 1\n", 61 | "Summary:\n", 62 | "\t1/1 devices are supported\n" 63 | ] 64 | }, 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "True" 69 | ] 70 | }, 71 | "execution_count": 2, 72 | "metadata": {}, 73 | "output_type": "execute_result" 74 | } 75 | ], 76 | "source": [ 77 | "cuda.detect()" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Launching kernels" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "Launching a Cuda kernel from Numba is very easy. A kernel is defined by using the `@cuda.jit` decorator as" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 3, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "@cuda.jit\n", 101 | "def an_empty_kernel():\n", 102 | " \"\"\"A kernel that doesn't do anything.\"\"\"\n", 103 | " # Get my current position in the global grid\n", 104 | " [pos_x, pos_y] = cuda.grid(2)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "The type of the kernel is" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 4, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "data": { 121 | "text/plain": [ 122 | "" 123 | ] 124 | }, 125 | "execution_count": 4, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "an_empty_kernel" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "In order to launch the kernel we need to specify the thread layout. The following commands define a two dimensional thread layout of $16\\times 16$ threads per block and $256\\times 256$ blocks. In total this gives us $16,777,216$ threads. This sounds huge. But GPUs are designed to launch large amounts of threads. The only restriction is that we are allowed to have at most 1024 threads in total (product of all dimensions) within a single thread block." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 5, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "threadsperblock = (16, 16) # Should be a multiple of 32 if possible.\n", 148 | "blockspergrid = (256, 256) # Blocks per grid" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "We can now launch all 16.8 million threads by calling" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 6, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "an_empty_kernel[blockspergrid, threadsperblock]()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Inside a kernel we can use the following commands to get the position of the thread." 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 7, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "@cuda.jit\n", 181 | "def another_kernel():\n", 182 | " \"\"\"Commands to get thread positions\"\"\"\n", 183 | " # Get the thread position in a thread block\n", 184 | " tx = cuda.threadIdx.x\n", 185 | " ty = cuda.threadIdx.y\n", 186 | " tz = cuda.threadIdx.z\n", 187 | " \n", 188 | " # Get the id of the thread block\n", 189 | " block_x = cuda.blockIdx.x\n", 190 | " block_y = cuda.blockIdx.y\n", 191 | " block_z = cuda.blockIdx.z\n", 192 | " \n", 193 | " # Number of threads per block\n", 194 | " dim_x = cuda.blockDim.x\n", 195 | " dim_y = cuda.blockDim.y\n", 196 | " dim_z = cuda.blockDim.z\n", 197 | " \n", 198 | " # Global thread position\n", 199 | " pos_x = tx + block_x * dim_x\n", 200 | " pos_y = ty + block_y * dim_y\n", 201 | " pos_z = tz + block_z * dim_z\n", 202 | " \n", 203 | " # We can also use the grid function to get\n", 204 | " # the global position\n", 205 | " \n", 206 | " (pos_x, pos_y, pos_z) = cuda.grid(3)\n", 207 | " # For a 1-or 2-d grid use grid(1) or grid(2)\n", 208 | " # to return a scalar or a two tuple.\n", 209 | " \n", 210 | " \n", 211 | "threadsperblock = (16, 16, 4) # Should be a multiple of 32 if possible.\n", 212 | "blockspergrid = (256, 256, 256) # Blocks per grid\n", 213 | "\n", 214 | "another_kernel[blockspergrid, threadsperblock]()" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "## Python features in Numba for Cuda" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "Numba supports in Cuda kernels only a selected set of features that are supported by the Cuda standard. Not allowed are exceptions, context managers, list comprehensions and yield statements. Supported types are `int`, `float`, `complex`, `bool`, `None`, `tuple`. For a complete overview of supported features see [https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html#](https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html#). Only a small set of Numpy functions are supported. Essentially, everything that does require dynamic memory management will not work due to the restrictions on kernels from the Cuda programming model." 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "## Memory management" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "For simple kernels we can rely on Numba copying data to and from the device. For more complex code we need to manually manage buffers on the device." 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "Copy data to the device" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 8, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "import numpy as np\n", 259 | "\n", 260 | "arr = np.arange(10)\n", 261 | "device_arr = cuda.to_device(arr)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "Copy data from the device back to the host" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 9, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "host_arr = device_arr.copy_to_host() " 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "Copy into an existing array" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 10, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "data": { 294 | "text/plain": [ 295 | "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" 296 | ] 297 | }, 298 | "execution_count": 10, 299 | "metadata": {}, 300 | "output_type": "execute_result" 301 | } 302 | ], 303 | "source": [ 304 | "host_array = np.empty(shape=device_arr.shape, dtype=device_arr.dtype)\n", 305 | "device_arr.copy_to_host(host_array)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "Generate a new array on the device" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 11, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "device_array = cuda.device_array((10,), dtype=np.float32)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "## Advanced features" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "Cuda has a number of advanced features that are supported by Numba. Some of them are:\n", 336 | "\n", 337 | "* Pinned Memory is a form of memory allocation that allows much faster data transfer than standard buffers.\n", 338 | "* Streams are a way to run multiple tasks on a GPU concurrently. By default, Cuda executes one command after another on the device. Streams allow us to create several concurrent queues for scheduling tasks onto the device. This allows for example to have a kernel stream that performs computations and a memory stream that does memory transfers, concurrently. One can use events to synchronize between different streams.\n", 339 | "* Multiple devices are well supported by Numba. There exist helper routines to enumerate and select different devices.\n" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "For a full list of features check out the guide at [https://numba.pydata.org/numba-doc/latest/cuda/index.html](https://numba.pydata.org/numba-doc/latest/cuda/index.html)" 347 | ] 348 | } 349 | ], 350 | "metadata": { 351 | "kernelspec": { 352 | "display_name": "Python [conda env:dev] *", 353 | "language": "python", 354 | "name": "conda-env-dev-py" 355 | }, 356 | "language_info": { 357 | "codemirror_mode": { 358 | "name": "ipython", 359 | "version": 3 360 | }, 361 | "file_extension": ".py", 362 | "mimetype": "text/x-python", 363 | "name": "python", 364 | "nbconvert_exporter": "python", 365 | "pygments_lexer": "ipython3", 366 | "version": "3.8.5" 367 | } 368 | }, 369 | "nbformat": 4, 370 | "nbformat_minor": 4 371 | } -------------------------------------------------------------------------------- /hpc_lecture_notes/numexpr.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# A Numexpr example\n", 8 | "\n", 9 | "Numexpr is a library for the fast execution of array transformation. One can define complex elementwise operations on array and Numexpr will generate efficient code to execute the operations." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import numexpr as ne\n", 19 | "import numpy as np" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "Numexpr provides fast multithreaded operations on array elements. Let's test it on some large arrays." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "a = np.random.rand(1000000)\n", 36 | "b = np.random.rand(1000000)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "Componentwise addition is easy." 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 4, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "array([1.3195833 , 0.92546223, 1.68758307, ..., 1.19557921, 1.19559017,\n", 55 | " 0.24145174])" 56 | ] 57 | }, 58 | "execution_count": 4, 59 | "metadata": {}, 60 | "output_type": "execute_result" 61 | } 62 | ], 63 | "source": [ 64 | "ne.evaluate(\"a + 1\")\n", 65 | "ne.evaluate(\"a + b\")" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "We can evaluate complex expressions." 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 5, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "text/plain": [ 83 | "array([False, False, False, ..., False, False, False])" 84 | ] 85 | }, 86 | "execution_count": 5, 87 | "metadata": {}, 88 | "output_type": "execute_result" 89 | } 90 | ], 91 | "source": [ 92 | "ne.evaluate('a*b-4.1*a > 2.5*b')" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "Let's compare the performance with Numpy." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 6, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "7.89 ms ± 91.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "%%timeit\n", 117 | "a * b - 4.1 * a > 2.5 * b" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 7, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "name": "stdout", 127 | "output_type": "stream", 128 | "text": [ 129 | "995 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" 130 | ] 131 | } 132 | ], 133 | "source": [ 134 | "%%timeit\n", 135 | "ne.evaluate('a*b-4.1*a > 2.5*b')" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "Numexpr is a factor 10 faster compared to Numpy, a nice improvement with very little effort. Let us compare some more complex operations." 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 8, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "5.4 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" 155 | ] 156 | } 157 | ], 158 | "source": [ 159 | "%%timeit\n", 160 | "ne.evaluate(\"sin(a) + arcsinh(a/b)\")" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "We can compare it with Numpy." 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 9, 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "39.5 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 180 | ] 181 | } 182 | ], 183 | "source": [ 184 | "%%timeit\n", 185 | "np.sin(a) + np.arcsinh(a / b)" 186 | ] 187 | } 188 | ], 189 | "metadata": { 190 | "kernelspec": { 191 | "display_name": "Python [conda env:jupyter-book]", 192 | "language": "python", 193 | "name": "conda-env-jupyter-book-py" 194 | }, 195 | "language_info": { 196 | "codemirror_mode": { 197 | "name": "ipython", 198 | "version": 3 199 | }, 200 | "file_extension": ".py", 201 | "mimetype": "text/x-python", 202 | "name": "python", 203 | "nbconvert_exporter": "python", 204 | "pygments_lexer": "ipython3", 205 | "version": "3.8.5" 206 | } 207 | }, 208 | "nbformat": 4, 209 | "nbformat_minor": 4 210 | } 211 | -------------------------------------------------------------------------------- /hpc_lecture_notes/pde_example.md: -------------------------------------------------------------------------------- 1 | # The need for sparse linear algebra - A PDE example 2 | 3 | In modern applications we are dealing with matrices that have 4 | hundreds of thousands to billions of unknowns. A typical feature 5 | of these matrices is that they are highly sparse. Typically, by 6 | sparsity we mean that a matrix mostly consists of zero elements so 7 | that it is more economical not to store the whole matrix, but just 8 | the nonzero entries. This makes necessary different types of storage 9 | structures and algorithms to deal with these matrices and to efficiently 10 | exploit the sparsity property. 11 | 12 | Before we dive into sparse matrix storage format, we want to give a simple 13 | example that demonstrates the necessity of sparse matrix formats and 14 | algorithms to deal with them. 15 | 16 | ## Solving a Poisson problem on the unit square. 17 | 18 | We consider the -------------------------------------------------------------------------------- /hpc_lecture_notes/python_hpc_tools.md: -------------------------------------------------------------------------------- 1 | # Python HPC Tools 2 | 3 | Python has an incredible ecosystem for scientific computing. In this chapter we provide a brief overview of some of the existing libraries before diving down much deeper in the following parts. 4 | 5 | ## Jupyter Notebook 6 | 7 | [Jupyter](https://jupyter.org/) is a key part of the Python ecosystem. It allows the creation of Jupyter Notebook documents that mix executable code, descriptions, figures and formulas in a single file that can be viewed and edited inside a web browser. Jupyter notebooks can be used either through the Jupyter Notebook tool or the more recent Jupyterlab environment. 8 | 9 | ## Numpy and Scipy 10 | 11 | [Numpy](https://numpy.org/) and [Scipy](https://www.scipy.org/) are key tools for any scientific Python installation. Numpy defines a fast array data type and provides a huge amount of operations on this type including all common linear algebra operations. Moreover, Numpy beautifully handles multi-dimensional arrays and operations on them. While Numpy provides fairly low-level routines, Scipy builds on top of Numpy to provide a collection of high-level routines, including graph algorithms, optimisation, sparse matrices, ode solvers, and many more. Both, Numpy and Scipy together are one of the main reasons for the success of Python. 12 | 13 | ## Numba 14 | 15 | [Numba](https://numba.pydata.org/) is a tool for the just-in-time compilation of Python functions. Python itself is a slow language. Each operation has considerable overhead from the Python interpreter, making especially time-critical for-loops inefficient in Python. Numba allows to just-in-time compile Python functions into direct machine code that needs not access the Python interpreter. Moreover, in doing so it allows to use simple loop parallelisations that cover a lot of use-cases for parallel computing on a single machine. In addition to all this, Numba has features to directly cross-compile code for use on GPU accelerators. Numba will be one of our main tools to write performant Python codes for CPUs and GPUs. 16 | 17 | ## Matplotlib 18 | 19 | [Matplotlib](https://matplotlib.org/) is the standard on Python for data visualization in two dimensions. It is incredibly feature rich. While simple plots are easy to do, there is a huge underlying API that allows very fine-grained control for complex data visualization settings. Indeed, the complexity of this API has lead to the development of other libraries that build on Matplotlib and provide simplified interfaces for specific application areas. 20 | 21 | ## Dask 22 | 23 | [Dask](https://dask.org/) provides a powerful environment to specify complex calculations as graph that is then optimally executed on a given parallel environment. It allows scalability for algorithms from a single desktop up to a HPC systems with thousands of nodes. 24 | 25 | ## Pandas 26 | 27 | [Pandas](https://pandas.pydata.org/) is the standard data-analysis library for Python. It has efficient data structures and tools to handle and process large-scale data sets in Python. 28 | 29 | 30 | ## Tensorflow, PyTorch, and scikit-learn 31 | 32 | Python is the preferred environment for machine learning. We will not go into the various tools as part of this module, but just mention some of them for completeness, in particular [Tensorflow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/) and [scikit-learn](https://scikit-learn.org/stable/). 33 | 34 | 35 | ## Other tools 36 | 37 | There are many other tools available for scientific computing in Python, such as [mpi4py](https://mpi4py.readthedocs.io/en/stable/) for data communication on clusters using the MPI standard, [petsc4py](https://pypi.org/project/petsc4py/), an interface to the widely used PetsC library for parallel sparse matrix operations, or [FEniCS](https://fenicsproject.org/), a powerful PDE solution package using the finite element method. Many more large and small packages exist for specific application areas which are too numerous to all mention here. 38 | 39 | -------------------------------------------------------------------------------- /hpc_lecture_notes/references.bib: -------------------------------------------------------------------------------- 1 | --- 2 | --- 3 | 4 | @inproceedings{holdgraf_evidence_2014, 5 | address = {Brisbane, Australia, Australia}, 6 | title = {Evidence for {Predictive} {Coding} in {Human} {Auditory} {Cortex}}, 7 | booktitle = {International {Conference} on {Cognitive} {Neuroscience}}, 8 | publisher = {Frontiers in Neuroscience}, 9 | author = {Holdgraf, Christopher Ramsay and de Heer, Wendy and Pasley, Brian N. and Knight, Robert T.}, 10 | year = {2014} 11 | } 12 | 13 | @article{holdgraf_rapid_2016, 14 | title = {Rapid tuning shifts in human auditory cortex enhance speech intelligibility}, 15 | volume = {7}, 16 | issn = {2041-1723}, 17 | url = {http://www.nature.com/doifinder/10.1038/ncomms13654}, 18 | doi = {10.1038/ncomms13654}, 19 | number = {May}, 20 | journal = {Nature Communications}, 21 | author = {Holdgraf, Christopher Ramsay and de Heer, Wendy and Pasley, Brian N. and Rieger, Jochem W. and Crone, Nathan and Lin, Jack J. and Knight, Robert T. and Theunissen, Frédéric E.}, 22 | year = {2016}, 23 | pages = {13654}, 24 | file = {Holdgraf et al. - 2016 - Rapid tuning shifts in human auditory cortex enhance speech intelligibility.pdf:C\:\\Users\\chold\\Zotero\\storage\\MDQP3JWE\\Holdgraf et al. - 2016 - Rapid tuning shifts in human auditory cortex enhance speech intelligibility.pdf:application/pdf} 25 | } 26 | 27 | @inproceedings{holdgraf_portable_2017, 28 | title = {Portable learning environments for hands-on computational instruction using container-and cloud-based technology to teach data science}, 29 | volume = {Part F1287}, 30 | isbn = {978-1-4503-5272-7}, 31 | doi = {10.1145/3093338.3093370}, 32 | abstract = {© 2017 ACM. There is an increasing interest in learning outside of the traditional classroom setting. This is especially true for topics covering computational tools and data science, as both are challenging to incorporate in the standard curriculum. These atypical learning environments offer new opportunities for teaching, particularly when it comes to combining conceptual knowledge with hands-on experience/expertise with methods and skills. Advances in cloud computing and containerized environments provide an attractive opportunity to improve the effciency and ease with which students can learn. This manuscript details recent advances towards using commonly-Available cloud computing services and advanced cyberinfrastructure support for improving the learning experience in bootcamp-style events. We cover the benets (and challenges) of using a server hosted remotely instead of relying on student laptops, discuss the technology that was used in order to make this possible, and give suggestions for how others could implement and improve upon this model for pedagogy and reproducibility.}, 33 | author = {Holdgraf, Christopher Ramsay and Culich, A. and Rokem, A. and Deniz, F. and Alegro, M. and Ushizima, D.}, 34 | year = {2017}, 35 | keywords = {Teaching, Bootcamps, Cloud computing, Data science, Docker, Pedagogy} 36 | } 37 | 38 | @article{holdgraf_encoding_2017, 39 | title = {Encoding and decoding models in cognitive electrophysiology}, 40 | volume = {11}, 41 | issn = {16625137}, 42 | doi = {10.3389/fnsys.2017.00061}, 43 | abstract = {© 2017 Holdgraf, Rieger, Micheli, Martin, Knight and Theunissen. Cognitive neuroscience has seen rapid growth in the size and complexity of data recorded from the human brain as well as in the computational tools available to analyze this data. This data explosion has resulted in an increased use of multivariate, model-based methods for asking neuroscience questions, allowing scientists to investigate multiple hypotheses with a single dataset, to use complex, time-varying stimuli, and to study the human brain under more naturalistic conditions. These tools come in the form of “Encoding” models, in which stimulus features are used to model brain activity, and “Decoding” models, in which neural features are used to generated a stimulus output. Here we review the current state of encoding and decoding models in cognitive electrophysiology and provide a practical guide toward conducting experiments and analyses in this emerging field. Our examples focus on using linear models in the study of human language and audition. We show how to calculate auditory receptive fields from natural sounds as well as how to decode neural recordings to predict speech. The paper aims to be a useful tutorial to these approaches, and a practical introduction to using machine learning and applied statistics to build models of neural activity. The data analytic approaches we discuss may also be applied to other sensory modalities, motor systems, and cognitive systems, and we cover some examples in these areas. In addition, a collection of Jupyter notebooks is publicly available as a complement to the material covered in this paper, providing code examples and tutorials for predictive modeling in python. The aimis to provide a practical understanding of predictivemodeling of human brain data and to propose best-practices in conducting these analyses.}, 44 | journal = {Frontiers in Systems Neuroscience}, 45 | author = {Holdgraf, Christopher Ramsay and Rieger, J.W. and Micheli, C. and Martin, S. and Knight, R.T. and Theunissen, F.E.}, 46 | year = {2017}, 47 | keywords = {Decoding models, Encoding models, Electrocorticography (ECoG), Electrophysiology/evoked potentials, Machine learning applied to neuroscience, Natural stimuli, Predictive modeling, Tutorials} 48 | } 49 | 50 | @book{ruby, 51 | title = {The Ruby Programming Language}, 52 | author = {Flanagan, David and Matsumoto, Yukihiro}, 53 | year = {2008}, 54 | publisher = {O'Reilly Media} 55 | } 56 | -------------------------------------------------------------------------------- /hpc_lecture_notes/simple_time_stepping.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Simple time-stepping" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Our previous examples were all stationary problems. However, many practical simulations describe processes that change over time. In this part we want to start looking a bit closer onto time-domain partial differential equations and their efficient implementation." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Finite Difference Approximation for the time-derivative" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "We want to approximate the derivative $\\frac{du}{dt}$. Remember that\n", 29 | "\n", 30 | "$$\n", 31 | "\\frac{du}{dt} \\approx \\frac{u(t+\\Delta t) - u(t)}{\\Delta t}\n", 32 | "$$\n", 33 | "\n", 34 | "for sufficiently small $h$." 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "There are three standard approximations for the time-derivative:\n", 42 | "\n", 43 | "* The forward difference: \n", 44 | "\n", 45 | "$$\n", 46 | "\\frac{du}{dt}\\approx \\frac{u(t + \\Delta t) - u(t)}{\\Delta t}.\n", 47 | "$$\n", 48 | "\n", 49 | "* The backward difference: \n", 50 | "\n", 51 | "$$\n", 52 | "\\frac{du}{dt}\\approx \\frac{u(t) - u(t - \\Delta t)}{\\Delta t}.\n", 53 | "$$\n", 54 | "\n", 55 | "* The centered difference: \n", 56 | "\n", 57 | "$$\n", 58 | "\\frac{du}{dt} \\approx \\frac{u(t + \\Delta t) - u(t -\\Delta t)}{2\\Delta t}.\n", 59 | "$$" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "To understand the error of these schemes we can use Tayler expansions to obtain\n", 67 | "\n", 68 | "$$\n", 69 | "\\frac{u(t + \\Delta t) - u(t)}{\\Delta t} = u'(t) + \\frac{1}{2}\\Delta tu''(t) + \\dots\n", 70 | "$$\n", 71 | "\n", 72 | "$$\n", 73 | "\\frac{u(t) - u(t-\\Delta t)}{\\Delta t} = u'(t) - \\frac{1}{2}\\Delta t u''(t) + \\dots\n", 74 | "$$\n", 75 | "\n", 76 | "$$\n", 77 | "\\frac{u(t + \\Delta t) - u(t-\\Delta t)}{2\\Delta t} = u'(t) + \\frac{1}{6}\\Delta t^2u'''(t) + \\dots\n", 78 | "$$\n", 79 | "\n", 80 | "Hence, the error of the first two schemes decreases linearly with $h$ and the error in the centred scheme decreases quadratically with $h$." 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## The 3-point stencil for the second derivative" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "For simplicity we denote $u_i := u(t)$, $u_{i + 1} := u(t + h)$, $u_{i-1} := u(t - h)$. We want to approximate\n", 95 | "\n", 96 | "$$\n", 97 | "\\frac{d}{dt}\\left[\\frac{du}{dt}\\right].\n", 98 | "$$\n", 99 | "\n", 100 | "The trick is to use an approximation around half-steps for the outer derivative, resulting in\n", 101 | "\n", 102 | "$$\n", 103 | "\\frac{d}{dt}\\left[\\frac{du}{dt}\\right]\\approx \\frac{1}{\\Delta t}\\left[{u_{i+\\frac{1}{2}}'} - {u_{i-\\frac{1}{2}}'}\\right].\n", 104 | "$$\n", 105 | "\n", 106 | "The derivatives at the half-steps are now again approximated by centered differences, resulting in\n", 107 | "\n", 108 | "$$\n", 109 | "\\begin{align}\n", 110 | "\\frac{d}{dt}\\left[\\frac{du}{dt}\\right]&\\approx \\frac{1}{\\Delta t}\\left[\\frac{u_{i+1} - u_i}{h} - \\frac{u_i - u_{i-1}}{\\Delta t}\\right]\\\\\n", 111 | "&= \\frac{u_{i+1} - 2u_i + u_{i-1}}{\\Delta t^2}\\\\\n", 112 | "&= u''(t) + \\mathcal{O}(\\Delta t^2)\n", 113 | "\\end{align}\n", 114 | "$$\n", 115 | "\n", 116 | "This is the famous second order finite difference operator that we have already used before. Its error is quadratically small in $h$." 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "## Application to time-dependent problems" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "We now want to solve\n", 131 | "\n", 132 | "$$\n", 133 | "\\begin{align}\n", 134 | "\\frac{dU}{dt} &= f(U, t)\\\\\n", 135 | " U(0) &= U_0,\n", 136 | "\\end{align}\n", 137 | "$$\n", 138 | "where $U(t):\\mathbb{R}\\rightarrow \\mathbb{R}^n$ is some vector valued function.\n", 139 | "\n", 140 | "The idea is replace $\\frac{dU}{dt}$ by a finite difference approximation.\n", 141 | "\n", 142 | "* Forward Euler Method\n", 143 | "\n", 144 | "$$\n", 145 | "\\frac{U_{n+1} - U_n}{\\Delta t} = f(U_n, t_n)\n", 146 | "$$\n", 147 | "\n", 148 | "* Backward Euler Method\n", 149 | "\n", 150 | "$$\n", 151 | "\\frac{U_{n+1} - U_n}{\\Delta t} = f(U_{n+1}, t_{n+1})\n", 152 | "$$\n", 153 | "\n", 154 | "The forward Euler method is an explicit method. We have that\n", 155 | "\n", 156 | "$$\n", 157 | "U_{n+1} = U_n + \\Delta tf(U_n, t_n).\n", 158 | "$$\n", 159 | "\n", 160 | "and the right-hand side only has known values.\n", 161 | "\n", 162 | "In contrast to this is the backward Euler Method, which is an implicit method since\n", 163 | "\n", 164 | "$$\n", 165 | "U_{n+1} = U_{n} + \\Delta tf(U_{n+1}, t_{n+1}).\n", 166 | "$$\n", 167 | "\n", 168 | "We hence need to solve a linear or nonlinear system of equations to compute $U_{n+1}$." 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "## Stability of forward Euler" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "We consider the model problem\n", 183 | "\n", 184 | "$$\n", 185 | "u'=\\alpha u\n", 186 | "$$\n", 187 | "\n", 188 | "for $\\alpha < 0$. Note that the explicit solution of this problem is $u(t) = u_0e^{\\alpha t}$. For $t\\rightarrow\\infty$ we have $u(t)\\rightarrow 0$ if $\\alpha < 0$.\n", 189 | "\n", 190 | "The forward Euler method can now be written as\n", 191 | "\n", 192 | "$$\n", 193 | "\\begin{align}\n", 194 | "U_{n+1} &= (1+\\alpha\\Delta t)U_n\\\\\n", 195 | " &= (1+\\alpha\\Delta t)^nU_0.\n", 196 | "\\end{align}\n", 197 | "$$\n", 198 | "\n", 199 | "Hence, in order for the solution to decay we need that $|1+\\alpha\\Delta t| < 1$ or equivalently\n", 200 | "\n", 201 | "$$\n", 202 | "-1 < 1 + \\alpha \\Delta t < 1,\n", 203 | "$$\n", 204 | "\n", 205 | "from which we obtain $|\\alpha\\Delta t| < 2$ (if $\\alpha$ negative). Now consider the problem\n", 206 | "\n", 207 | "$$\n", 208 | "\\frac{dU}{dt} = AU\n", 209 | "$$\n", 210 | "\n", 211 | "with $A\\in\\mathbb{R}^{n\\times n}$ diagonalizable. For any eigenpair $(\\lambda, \\hat{U})$ of $A$ satisfying $A\\hat{U} = \\lambda\\hat{U}$ the function $U(t) = e^{\\lambda t}\\hat{U}$ is a solution for this problem.\n", 212 | "If the eigenvalues are real and negative we require for forward Euler to converge that\n", 213 | "\n", 214 | "$$\n", 215 | "\\Delta t < \\frac{2}{|\\lambda_{max}(A)|},\n", 216 | "$$\n", 217 | "\n", 218 | "where $\\lambda_{max}$ is the largest eigenvalue by magnitude. Note that if the eigenvalues are complex the condition becomes\n", 219 | "\n", 220 | "$$\n", 221 | "\\Delta t < \\frac{2|\\text{Re}(\\lambda)|}{|\\lambda(A)|},\n", 222 | "$$\n", 223 | "\n", 224 | "where $\\lambda$ is the eigenvalue with largest negative real part by magnitude.\n", 225 | "\n", 226 | "As example let us take a look at the problem\n", 227 | "\n", 228 | "$$\n", 229 | "\\frac{\\partial u(x, t)}{\\partial t} = \\frac{\\partial^2 u(x, t)}{\\partial x^2}\n", 230 | "$$\n", 231 | "\n", 232 | "with $u(x, 0) = u_0(x)$, $u(0, t) = u(1, t) = 0$. We can discretise the right-hand side using our usual second order finite dfference scheme. For the left-hand side, we use the forward Euler method. This gives us the recurrence equation\n", 233 | "\n", 234 | "$$\n", 235 | "U_{n+1} = U_n + \\Delta t A U_n,\n", 236 | "$$\n", 237 | "\n", 238 | "with $A = \\frac{1}{h^2}\\text{tridiag}(1, -2, 1)$.\n", 239 | "\n", 240 | "The eigenvalues of $A$ are given explicitly as\n", 241 | "\n", 242 | "$$\n", 243 | "\\lambda_k = -\\frac{1}{h^2}4\\sin^2\\frac{k\\pi}{2(n+1)}\n", 244 | "$$\n", 245 | "\n", 246 | "We therefore have that $|\\lambda_{max}|\\sim \\frac{4}{h^2}$. Hence, for forward Euler to be stable we require that\n", 247 | "\n", 248 | "$$\n", 249 | "\\frac{\\Delta t}{h^2} \\lesssim \\frac{1}{2}.\n", 250 | "$$\n", 251 | "\n", 252 | "Hence, we need that $\\Delta t\\sim h^2$, meaning that the number of required time steps grows qudratically with the discretisation accuracy." 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "## Stability of backward Euler" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "For backward Euler we obtain\n", 267 | "\n", 268 | "$$\n", 269 | "\\begin{align}\n", 270 | "U_{n+1} &= (1-\\alpha\\Delta t)^{-1}U_n\\\\\n", 271 | " &= (1-\\alpha\\Delta t)^{-n}U_0.\n", 272 | "\\end{align}\n", 273 | "$$\n", 274 | "\n", 275 | "We now require that $|(1-\\alpha \\Delta t)^{-1}| < 1$. But for $\\alpha<0$ this is always true. Hence, the backward Euler method is unconditionally stable." 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## Implicit vs explicit methods" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "This analysis is very typical. In computational sciences we always have to make a choice between implicit and explicit methods. The advantage of implicit methods are the very good stability properties, allowing us for the backward Euler method to choose the time-discretisation independent of the spatial discretisation. For explicit methods we have to be much more careful and in the case of Euler we have the quadratic dependency between time-steps and spatial discretisation. However, a single time-step is much cheaper for explicit Euler as we do not need to solve a linear or nonlinear system of equations in each step. The right choice of solver depends on a huge number of factors and is very application dependent." 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "## Time-Stepping Methods in Software" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "In practice we do not necesarily use explicit or implicit Euler. There are many better methods out there. The Scipy library provides a number of time-stepping algorithms. For PDE problems PETSc has an excellent infrastructure of time-stepping methods built in to support the solution of time-dependent PDEs." 304 | ] 305 | } 306 | ], 307 | "metadata": { 308 | "kernelspec": { 309 | "display_name": "Python [conda env:dev] *", 310 | "language": "python", 311 | "name": "conda-env-dev-py" 312 | }, 313 | "language_info": { 314 | "codemirror_mode": { 315 | "name": "ipython", 316 | "version": 3 317 | }, 318 | "file_extension": ".py", 319 | "mimetype": "text/x-python", 320 | "name": "python", 321 | "nbconvert_exporter": "python", 322 | "pygments_lexer": "ipython3", 323 | "version": "3.8.5" 324 | } 325 | }, 326 | "nbformat": 4, 327 | "nbformat_minor": 4 328 | } 329 | -------------------------------------------------------------------------------- /hpc_lecture_notes/sparse_solvers_introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# An introduction to sparse linear system solvers" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "We have seen that we can efficiently represent large sparse matrices with suitable data structures. Moreover, we can efficiently evaluate matrix vector products if the sparse matrix is given as CSR format." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "What is missing is a way to efficiently solve linear system with this data structure. We could attempt to use standard LU Decomposition (Gaussian Elimination). But the computational complexity is $O(n^3)$, making this method infeasible for very large sparse system. The issue is that standard LU decomposition does not take into account that most elements of a matrix are zero. There are different ways to overcome this issue.\n", 22 | "\n", 23 | "* **Sparse direct solvers**. Sparse direct solvers are essentially variants of LU decomposition, but tuned for taken into account that most of the matrix consist of zero elements. Sparse direct solvers are highly efficient for PDE problems in two dimensions and still very good for many three dimensional problems. However, there performance deteriorates on matrices arising from complex three dimensional meshes.\n", 24 | "\n", 25 | "* **Iterative Methods**. The most widely used iterative solvers are based on so-called Krylov subspace iterations. The idea is that the matrix is only known through its actions on vectors, that is we are allowed to use matrix-vector products only. A sequence of matrix-vector products is then used to build up a low-dimensional model of the matrix that can be solved efficiently and well approximates the solution of the original large linear system. Iterative methods are widely used in applications and can give almost optimal complexity in the number of unknowns. However, the problem is that the performance of iterative methods depends very much on certain properties of the matrix that reflect the underlying physical problem and so-called preconditioning techniques often need to be used to accelerate iterative solvers. These preconditioners can themselves be complex to develop for specific applications and are a topic of much research.\n", 26 | "\n", 27 | "* **Multigrid Methods**. Multigrid methods follow a different idea. Here, starting from our discretisation we move to coarser ans coarser discretisation levels to refine the solution. Eventually, we are on a very coarse level on which the solution is trivial to accomplish. From there we refine again. Multigrid can be used as solver on its own or as preconditioner for iterative methods.\n" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Software for sparse solvers" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "In the following we want to give a very incomplete overview of some frequently used software packages for the solution of sparse linear systems of equations." 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "### Sparse direct solver packages\n", 49 | "\n", 50 | "* [UMFPACK (Part of Suitesparse)](https://people.engr.tamu.edu/davis/suitesparse.html) is a widely used sparse direct solver. It is built into Matlab and also available in Python through scikit-umfpack. It is very efficient and constantly being developed.\n", 51 | "* [Pardiso](https://www.pardiso-project.org/) is available either directly under a closed source license or as part of the Intel MKL, with the caveat that the Intel MKL version is old and significantly slower than the directly available version.\n", 52 | "* [SuperLU](https://portal.nersc.gov/project/sparse/superlu/) is the standard sparse solver that is also built into Scipy. Scipy only offers the serial version of the library, which is sufficient for smaller to medium problems.\n", 53 | "* [Mumps](http://mumps.enseeiht.fr/) is a massively parallel sparse direct solver. It is often used on parallel clusters.\n", 54 | "* [Amesos2](https://trilinos.github.io/amesos2.html) is part of [Trilinos](https://trilinos.github.io/), a large collection of libraries for the parallel solution of partial differential equations. Amesos2 provides its own sparse direct solver, as well as interfaces to many other sparse direct solvers.\n", 55 | "* [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) is a templated C++ linear algebra library for dense and sparse operations. It provides its own sparse direct solver and also interfaces to many external solvers" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "### Sparse iterative solvers\n", 63 | "\n", 64 | "* [Scipy](https://www.scipy.org/) has a good selection of sparse iterative solvers built in. For medium sized matrix problems it is a very good choice.\n", 65 | "* [Petsc](https://www.mcs.anl.gov/petsc/) is a parallel sparse solver library with a range of built-in iterative solvers.\n", 66 | "* [Belos](https://trilinos.github.io/belos.html) is part of Trilinos and provides a number of parallel iterative solvers.\n", 67 | "* [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) not only has sparse direct but also several iterative solvers built in." 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "Multigrid solvers\n", 75 | "\n", 76 | "* [AmgX](https://developer.nvidia.com/amgx) is an algebraic multigrid library for Nvidia GPUs.\n", 77 | "* [PyAMG](https://github.com/pyamg/pyamg) is a Python based algebraic multigrid package.\n", 78 | "* [ML](https://trilinos.github.io/ml.html) is the multigrid solver as part of the Trilinos package." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "Most of the above packages are written in C/C++. But many of them have Python bindings. In the following sessions we will discuss sparse direct solvers, iterative solvers, and multigrid in more detail, and then give examples using some of the above software packages." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [] 94 | } 95 | ], 96 | "metadata": { 97 | "kernelspec": { 98 | "display_name": "Python [conda env:dev] *", 99 | "language": "python", 100 | "name": "conda-env-dev-py" 101 | }, 102 | "language_info": { 103 | "codemirror_mode": { 104 | "name": "ipython", 105 | "version": 3 106 | }, 107 | "file_extension": ".py", 108 | "mimetype": "text/x-python", 109 | "name": "python", 110 | "nbconvert_exporter": "python", 111 | "pygments_lexer": "ipython3", 112 | "version": "3.8.5" 113 | } 114 | }, 115 | "nbformat": 4, 116 | "nbformat_minor": 4 117 | } 118 | -------------------------------------------------------------------------------- /hpc_lecture_notes/what_is_hpc.md: -------------------------------------------------------------------------------- 1 | # What is High-Performance Computing? 2 | 3 | ## Floating point numbers 4 | 5 | Most applications in High-Performance Computing rely on floating point operations. These are 6 | operations such as `1.2 + 3.7`, or `2.8 * 5.7`. Most large scale computational simulations rely 7 | on these operations. Floating point numbers are defined in the [IEEE 754 standard](https://en.wikipedia.org/wiki/IEEE_754). This is one of the most famous standards in numerical computing. Here, we will not go into all the details of this 8 | standard but just consider the two most important types: **single precision** and **double precision** floating point numbers. 9 | 10 | We will define floating point numbers using the following slightly simplified model. The floating point numbers are the set 11 | 12 | $$ 13 | \mathcal{F} = \left\{(-1)^s\cdot b^e \cdot \frac{m}{b^{p-1}} :\right. 14 | \left. s = 0,1; e_{min}\leq e \leq e_{max}; b^{p-1}\leq m\leq b^{p}-1\right\}. 15 | $$ 16 | 17 | The number $b$ is the base, which is always $2$ on modern computers, $m$ denotes the mantissa, $e$ the exponent, and $p$ determines the available precision. 18 | 19 | Floating point numbers are not equally spaced on the number line. To understand the spacing consider the term $\frac{m}{b^{p-1}}$. We have 20 | 21 | $$ 22 | \frac{m}{b^{p-1}} = 1, 1 + 2^{1-p}, 1 + 2\times 2^{1-p}, \dots, 2 - 2^{1-p}. 23 | $$ 24 | 25 | These are all the possible floating point numbers between $1$ and $2$. To obtain the floating point numbers between $2$ and $4$ we multiply these numbers by $2$, and so on. Hence, the floating point numbers become coarser spaced the higher we go. This makes sense. 26 | 27 | The two most important classes of floating point numbers are the following: 28 | 29 | * `IEEE double precision`: $e_{min} = -1022, e_{max} = 1023, p=53$ 30 | * `IEEE single precision`: $e_{min} = -126, e_{max} = 127, p=24$ 31 | 32 | Roughly, double precision numbers give around 16 digits accuracy, while single precision numbers give around 8 digits accuracy. 33 | 34 | An important number is $\epsilon_{rel} = 2^{1-p}$. This is the smallest number in floating point arithmetic such that $1 + \epsilon_{rel}\neq 1$. In double precision we have $\epsilon_{rel}\approx 2.2\times 10^{-16}$ and in single precision $\epsilon_{rel}\approx 1.2\times 10^{-7}$. 35 | 36 | ## How many Flops/s do I have? 37 | 38 | One of the most important measures for the performance of a computing device in High-Performance Computing is the number of floating point operations per second are possible. Below are a couple of performance numbers for different types of CPUs/GPUs. 39 | 40 | | Name | Peak Performance (GFlop/s) | Notes | 41 | | ---- | --------------------------- | ----- | 42 | | Intel Xeon Platinum 8280M (28 Cores) | 1,612.8 | A fast workstation CPU | 43 | | Raspberry PI 4 Model B | 24 | A very cheap and fun to code for ARM CPU | 44 | | Nvidia RTX 4090 | 73,000 | Single precision peak for Nvidia's new GPU generation | 45 | | PS5 GPU | 10,280 | Single precision peak of the new GPU for the PS5 | 46 | | XBOX Series X GPU | 12,500 | Single precision peak of the new GPU for the XBOX Series X 47 | 48 | The measure here is in GFlop/s, which is $10^9$ floating point operations per second. This table has some very different systems. The first one is a fast Intel Server CPU. On the other end of the spectrum we have the Raspberry Pi with just 24 GFlops/s peak performance. It is a representative of a typical low-powered cheap ARM CPU. These have very little in common with the very fast ARM based chips on recent Apple Computers. 49 | 50 | The table also contains numbers for the processing power of GPUs, graphics processing units. They are not only good for displaying great graphics but also highly parallel compute devices. We will later learn more about the differences of CPUs and GPUs. The Nvidia RTX 4090 manages up to 70TFlops/s for single precision operations. While CPUs are good at single and double precision operations, GPUs are usually optimised for single precision operations since for graphics the reduced precision does not matter. However, one can also buy specialised compute devices from Nvidia that are optimised for double precision as well. Finally, we have the PS5 and XBOX Series X with around 10 and 12 TFlops/s in GPU performance, both of which are running AMD's RDNA 2 architecture. 51 | 52 | ## The Top 500 53 | 54 | For really large machines there is a regularly updated list of the world's fastest supercomputers. The [Top 500](https://top500.org/) shows which machines are the fastest in the world. The current number one is Frontier, with a peak performance of 1.7 EFlop/s (1 EFlop/s is 10^6 TFlop/s). It is interesting to consider the [performance over time](https://top500.org/statistics/perfdevel/). 55 | 56 | ![Performance over time](./img/top500development.png) 57 | 58 | ## A biased definition of High-Performance Computing 59 | 60 | Only 20 years ago a PS5 would have been the world's fastest supercomputer. This hides an important message. What we consider as supercomputers now, will be standard desktop systems in the not too distant future. 61 | 62 | It therefore makes little sense to talk about High-Performance Computing only if we develop on very big systems. What we have under our desk now, was a big system just a few years ago. My personal definition of High-Performance Computing is the following: 63 | 64 | **High-Performance Computing is concerned with developing tools, algorithms, and applications that can make optimal use of a given hardware environment.** 65 | 66 | In this sense, we can perform High-Performance Computing also on a Raspberry Pi or a mobile phone. And indeed, the trend goes to scalable development environments that allow us to make optimal use of hardware from a small low-powered ARM device up to the fastest supercomputers in the world. 67 | 68 | While the software landscape is fast moving, there are certain development principles that have shown to be useful on any kind of device. In this module we want to discuss these techniques and how to achieve high-performing code on current CPU and GPU systems. 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | -------------------------------------------------------------------------------- /other/byte_array.odg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/other/byte_array.odg -------------------------------------------------------------------------------- /other/simd_addition.odg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tbetcke/hpc_lecture_notes/62a4164ce7cb8a3da008dccb77b398b5fd5edd62/other/simd_addition.odg -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | jupyter-book 2 | matplotlib 3 | numpy 4 | numba 5 | ghp-import 6 | --------------------------------------------------------------------------------