├── README.md
├── 8_OpenCL.md
├── FAQ.md
├── 7_OpenMP.md
├── 4_Compilers.md
├── 0_Prerequisites.md
├── 2_Modules.md
├── 6_MPI.md
├── 3_Queueing_Systems.md
├── 1_Connecting_to_BlueCrystal.md
└── 5_Performance_Analysis_Tools.md


/README.md:
--------------------------------------------------------------------------------
 1 | Getting Started with the HPC Course
 2 | ===================================
 3 | 
 4 | 
 5 | This is a collection of documents aiming to introduce the environment and typical workflows in HPC systems.
 6 | The tutorial addresses — and should be useful for — the University of Bristol High Performance Computing course.
 7 | Note, however, that we likely cover more material than is directly relevant for the assignments—just because a technique or issue is mentioned in this tutorial, it does _not_ mean that you necessarily need to tackle it in your submitted code or report.
 8 | 
 9 | ## Structure
10 | 
11 | This repository aims to serve both as introductory material for the course labs and as reference to be used throughout the assignment.
12 | When reading this material for the first time, we suggest that you follow the ordering presented in the [section below](#suggested-reading-order) and try the commands and code yourself.
13 | However, each page aims to be self-contained and cover its topic without relying on a specific reading order.
14 | 
15 | ### Suggested reading order
16 | 
17 | If you are unfamiliar with HPC systems, it's worth covering the basics of connecting to and using as supercomputer before moving on to programming tools.
18 | We suggest you go through this tutorial as follows:
19 | 
20 | 0. [Prerequisites](0_Prerequisites.md) – _Please read this first!_
21 | 1. [Connecting to BlueCrystal](1_Connecting_to_BlueCrystal.md)
22 | 2. [Software Modules and BlueCrystal](2_Modules.md)
23 | 3. [Queueing Systems](3_Queueing_Systems.md)
24 | 4. [HPC Compilers](4_Compilers.md)
25 | 5. [Performance Analysis Tools](5_Performance_Analysis_Tools.md)
26 | 6. [OpenMP](7_OpenMP.md)
27 | 7. [MPI](6_MPI.md)
28 | 8. [OpenCL<sup>†</sup>](8_OpenCL.md)
29 | 
30 | <!-- TODO: If we switch to BCp4, add SLURM instructions and update modules names -->
31 | 
32 | Note that this is not required reading _per se_.
33 | The purpose of this tutorial is to help you get started with your assignment, but as long as you acquire the necessary skills to complete the unit, feel free to skip sections or use any alternative or additional material.
34 | 
35 | Items marked with † cover advanced topics and may exceed the scope of the Introduction course.
36 | 
37 | ### FAQs
38 | 
39 | In addition to the reference pages, we are collecting answers to common questions in [a dedicated FAQ page](FAQ.md).
40 | If you encounter a problem, please check this page before asking a question to see if an answer isn't already available.
41 | 
42 | ## Additional Material
43 | 
44 | In the past, we have used various content that you may find useful throughout the unit. The list below is in no particular order.
45 | 
46 | <!-- - [Using BlueCrystal labsheet on BlackBoard](https://www.ole.bris.ac.uk/bbcswebdav/pid-3307009-dt-content-rid-9643695_2/courses/COMS30005_2018/Open%20Access%20for%20CS/labs/intro-handout.pdf) -->
47 | - [OpenMP and MPI Examples on GitHub](https://github.com/UoB-HPC/hpc-course-examples)
48 | - [OpenMP Tutorial Videos by Tim Mattson @ Intel](https://www.youtube.com/watch?v=nE-xN4Bf8XI&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG)
49 | - [GPU Programming Examples on GitHub](https://github.com/UoB-HPC/advanced-hpc-examples) (likely relevant for Advanced HPC only)
50 | 
51 | In addition, there are many books and links to online resources [on the unit's BlackBoard page](https://www.ole.bris.ac.uk/webapps/blackboard/content/listContentEditable.jsp?content_id=_4566993_1&course_id=_240828_1&mode=reset).
52 | 
53 | ----
54 | 
55 | ## Issues
56 | 
57 | Please report any problems or mistakes and suggest any potential improvements using [Issues](https://github.com/UoB-HPC/hpc-course-getting-started/issues).
58 | 


--------------------------------------------------------------------------------
/8_OpenCL.md:
--------------------------------------------------------------------------------
 1 | # OpenCL
 2 | 
 3 | This section discusses compiling and running OpenCL programs on BlueCrystal Phase 3.
 4 | As with [OpenMP](7_OpenMP.md), this is relatively straight-forward: you only need to load an OpenCL module and run on a node with your your desired hardware, e.g. GPUs.
 5 | 
 6 | Note that this is **not** an OpenCL programming tutorial.
 7 | 
 8 | ## OpenCL implementations
 9 | 
10 | OpenCL implementations are generally distributed by hardware vendors to target their own hardware.
11 | For example, NVIDIA distribute their OpenCL library as part of the CUDA package, and so it can only be used on GPUs that support CUDA.
12 | Conversely, Intel distribute their own implementation, which targets Intel CPUs and GPUs (although not all models are supported, particularly older ones).
13 | 
14 | One thing to be aware of is that there are several versions of the OpenCL standard.
15 | Most vendors only support [OpenCL 1.2](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/), and not [OpenCL 2.0](https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/) or newer, so make sure you read the appropriate documentation.
16 | On BCp3, you will be using version 1.2.
17 | 
18 | If you want to install OpenCL on your own machine:
19 | 
20 | * On **Linux**, you should follow the instructions for your distribution and hardware. If you have a discrete GPU, the easiest way is through the [AMD](https://www.amd.com/en/support/kb/faq/amdgpu-installation)/[NVIDIA](https://developer.nvidia.com/cuda-downloads) drivers. If you only have integrated (Intel) graphics or you want to use OpenCL on the CPU, use the [Intel implementation](https://software.intel.com/en-us/intel-opencl) if your hardware supports it, or try [Beignet](https://www.freedesktop.org/wiki/Software/Beignet/) otherwise.
21 | * On **macOS**, OpenCL should be installed and ready to be used by default. It _should_ work on both CPU and GPU, although results may vary depending on hardware.
22 | * On **Windows**, this is just asking for trouble and you shouldn't do it...
23 | 
24 | ## Compiling OpenCL programs
25 | 
26 | On BCp3, you can use OpenCL on the GPUs with the `cuda/toolkit/6.5` module.
27 | After loading the module, all you need to do is link against the OpenCL library, e.g. by adding `-lOpenCL` to your linker flags.
28 | You should use GCC to compile your host code, so your compilation line may look similar to:
29 | 
30 | ```bash
31 | gcc -std=c99 -O3 d2q9-bgk.c -lm -lOpenCL -o d2q9-bgk
32 | ```
33 | 
34 | If you want to try OpenCL on the CPU, use the `intel-opencl/1.2-3.0.67279` module.
35 | However, you should be aware that people have had mixed success with this in the past.
36 | You should also (already) know that programs optimised for GPUs don't necessarily perform well on CPUs, especially when running on older-generation hardware such as Sandy Bridge.
37 | 
38 | ## Running OpenCL jobs
39 | 
40 | To run an OpenCL program, you don't need to do anything in addition to having an OpenCL device available.
41 | When targeting a GPU on BCp3, this means requesting a node with a GPU through the queueing system.
42 | This is done by adding the `gpus=<n>` constraint, where `n` is the number of GPUs you want to use and can be either `1` or `2` (since GPU nodes in BCp3 have two GPUs per node).
43 | Your request may look similar to:
44 | 
45 | ```bash
46 | qsub -lnodes=1:ppn=16:gpus=1 ...
47 | ```
48 | 
49 | You can use all [other `qsub` parameters](3_Queueing_Systems.md#bcp3--pbs) as usual.
50 | 
51 | ## Further reference
52 | 
53 | * The [OpenCL homepage](https://www.khronos.org/opencl/) has links to many documentation resources.
54 | * Just as with OpenMP, [the OpenCL specification](https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf) is not hard to read and will answer a lot of questions. Note that some behaviour is still implementation-dependent.
55 | * There are [online OpenCL reference pages](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/) and a [cheatsheet-style reference card](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/OpenCL-1.2-refcard.pdf).
56 | * NVIDIA have a [page of OpenCL docs and examples](https://developer.nvidia.com/opencl); so do [AMD](https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/Opencl-programming-guide.html).
57 | * Finally, a (long) list of OpenCL material can be found [on the Khronos Group website](https://www.khronos.org/opencl/resources). This page has links to all the implementations and bindings for difference programming languages.
58 | 


--------------------------------------------------------------------------------
/FAQ.md:
--------------------------------------------------------------------------------
 1 | Frequently Asked Questions
 2 | ==========================
 3 | 
 4 | This page contains a collection of answers to common questions.
 5 | It is being constantly updated.
 6 | 
 7 | ## Logging into BlueCrystal
 8 | 
 9 | **Q**: I am having trouble logging into BlueCrystal, I've forgotten my password, etc. What should I do? <br />
10 | **A**: Email <hpc-help@bristol.ac.uk> explaining your problem.
11 | 
12 | **Q**: I've think I've set up an SSH key, but I am still being asked for my password. What could have gone wrong? <br />
13 | **A**: It's likely that you either 1) haven't placed your public key in `authorized_keys` on BlueCrystal, 2) you have not set `0644` permissions on `authorized_keys`, or 3) you haven't set your `.ssh/config` to use the right username and key file. Try following the manual steps again to make sure everything is there.
14 | 
15 | **Q**: I have not used Linux much before and I don't know many commands, or I don't know which terminal editor to use and how it works. What should I do? <br />
16 | **A**: There are plenty of resources online explaining your options and how to use them. For CLI text editors, a few options are suggested in [the _Connecting_ section](1_Connecting_to_BlueCrystal#editing-files-remotely), and all of them have manpages, a large userbase, and plenty of documentation online.
17 | 
18 | ## Compiling and running code
19 | 
20 | **Q**: Can I access GitHub from BlueCrystal, or do I need to pull my repository locally and upload it manually? <br />
21 | **A**: You can access any online resource _from inside BC_; what you can't do is get _into BC_ from outside the University's network. Load the `tools/git/2.35.1` module and clone your repository directly on BlueCrystal.
22 | 
23 | **Q**: I have changed my Makefile, but typing `make` doesn't run the updated commands. What do I need to do? <br />
24 | **A**: Make will only recompile if _source files_ have changed, not Makefiles themselves. You can ask it to rebuild anyway with a flag: `make -B`.
25 | 
26 | **Q**: I am encountering segmentation faults or bus errors, but they don't seem to happen all the time. Is it an issue with the compute nodes? <br />
27 | **A**: These errors [generally mean you have attempted to access invalid memory locations](https://stackoverflow.com/questions/212466/what-is-a-bus-error), e.g. outside your program's memory space. If you have a parallel program, they may depend on the order of execution of some statements, which can be non-deterministic. Try using Valgrind to check for memory access issues and see if you can reproduce the problem with a single process/thread.
28 | 
29 | **Q**: When compiling with ICC, I get a `Catastrophic error: could not set locale "" to allow processing of multibyte characters` error. What does this mean? <br />
30 | **A**: Read <https://software.intel.com/en-us/articles/cdiag912>. In particular, you need to `export LANG=C LC_ALL=C`; you will need to add this to your `.bashrc`, so that it's automatically done every time you log in.
31 | 
32 | **Q**: On Phase 4, why does my application crash under `mpirun`? <br />
33 | **A**: While the reason could be down to a bug in _your application_, you should use `srun` as your parallel launcher, and not `mpirun`. Make sure to `export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so` _before_ you run your application.
34 | 
35 | 
36 | 
37 | ## The queueing system
38 | 
39 | **Q**: When I try to submit a job, there is no job output file and I get a `Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)` error in my UNIX mail. What went wrong? <br />
40 | **A**: This normally happens when your `authorized_keys` file has been damaged. Try running the following command (note that there are _two_ angle brackets, _not one_!), then attempt your job again:
41 | ```
42 | $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
43 | ```
44 | 
45 | **Q**: When I try to submit a job using `qsub`, I get the following error message: `qsub:  file must be an ascii script`. What am I doing wrong? <br />
46 | **A**: You are trying to pass an application binary directly to `qsub`, which it doesn't allow. You need to write a [job script](3_Queueing_Systems.md#job-files). There is a note about this in the [previous part](3_Queueing_Systems.md#submitting-jobs).
47 | 
48 | ## Performance tools
49 | 
50 | **Q**: What does the vTune `Amplxe-cl Cannot enable Hardware Event-based Sampling: problem with the driver` error mean? <br />
51 | **A**: You can run the Hotspot analysis in VTune, but the more advanced analyses need an additional driver installed. Due to some (legacy) configuration issues, this is hard to set up on BCp3. If you need advanced analyses, try running vTune on your own machine.
52 | 
53 | ## Other system tools
54 | 
55 | **Q**: When I try to push/pull from a remote git repository on BCp4, I get an SSL library error. What am I doing wrong? <br />
56 | **A**: There seem to be intermittent issues with git on Phase 4 if you have Intel modules loaded. Log out completely, log back in, then attempt your git operations again _without loading any Intel module_.
57 | 
58 | <!-- Template
59 | **Q**: <br />
60 | **A**:
61 | -->
62 | 
63 | 


--------------------------------------------------------------------------------
/7_OpenMP.md:
--------------------------------------------------------------------------------
 1 | # OpenMP
 2 | 
 3 | This section discusses compiling and running OpenMP programs on BlueCrystal Phase 4.
 4 | As you will find, and unlike with [MPI](../6_MPI.md), there is almost no difference compared to when dealing with serial problems.
 5 | Still, there are a number of issues that you should be aware of and hints worth remembering.
 6 | 
 7 | Note that this is **not** an OpenMP programming tutorial.
 8 | 
 9 | ## OpenMP implementations
10 | 
11 | OpenMP libraries are implemented by and shipped with all the compilers available on BCp4.
12 | You _do not_ need to load any additional module or use a compiler wrapper.
13 | However, be aware that each compiler vendor uses their own implementation of OpenMP, so some may perform better than others for a given code.
14 | 
15 | ## Compiling OpenMP programs
16 | 
17 | You should use the latest version of you compiler of choice, so that you can take advantage of as many optimisations and as much functionality as possible.
18 | There are:
19 | 
20 | - GCC 7.1:
21 | ```bash
22 | module load languages/gcc-10.4.0
23 | ```
24 | - Intel 16:
25 | ```bash
26 | module load languages/intel/2020-u4
27 | ```
28 | 
29 | When compiling, you need to use a specific flag to enable OpenMP, _in addition to optimisation flags_, even at high levels of optimisations.
30 | If you don't, then your program will still compile, but the `pragma` directives will be ignored and no OpenMP will be used.
31 | The flags you should use are:
32 | 
33 | | Compiler | Flag       |
34 | | -------- | ---------- |
35 | | GCC      | `-fopenmp` |
36 | | Intel    | `-qopenmp` |
37 | 
38 | ## Running OpenMP jobs
39 | 
40 | There is no special action you need to take to run an OpenMP program.
41 | If it has been compiled appropriately, it will spawn threads as expected on its own.
42 | However, you should always set the `OMP_NUM_THREADS` environment variable, which tell the OpenMP runtime how many threads to use for parallel regions.
43 | If you don't set this, a system default value will be use, and on some systems this is 1, i.e. serial execution!
44 | 
45 | The most common ways to do this are:
46 | 
47 | - Export the variable in your job script on interactive session. This will make it available to all further commands.
48 | ```bash
49 | export OMP_NUM_THREADS=16
50 | ./application
51 | ```
52 | - Use `env` to only pass it to one command. This is useful if you only want to set it temporarily, e.g. for testing with different values.
53 | ```bash
54 | env OMP_NUM_THREADS=16 ./application
55 | ```
56 | 
57 | ### Thread placement and binding
58 | 
59 | In addition to setting the number of threads to be used, several environment variables can be used to influence the behaviour of the OpenMP runtime with regards _where_ the threads are run.
60 | The most important ones are `OMP_PLACES` and `OMP_PROC_BIND`; they are generally used together.
61 | 
62 | One easy way to think about specifying thread affinity this ways is that `PLACES` states _where_, i.e. at what hardware level, to look, and then `PROC_BIND` says _how_ the affinity should be set here.
63 | Two situations are commonly identified:
64 | 
65 | 1. You want to have the threads as close together as possible, e.g. to minimise communication costs. This usually means placing threads on cores that are next to each-other, filling up a socket before moving to the next one.
66 | ```bash
67 | export OMP_PLACES=cores OMP_PROC_BIND=close
68 | ```
69 | 2. You want to fill up your NUMA nodes equally, e.g. to spread memory access over the two sockets as much as possible.
70 | ```bash
71 | export OMP_PLACES=cores OMP_PROC_BIND=spread
72 | ```
73 | 
74 | There is also [more advanced syntax](https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPLACES.html) for defining several levels of _places_, each with its own placement strategy.
75 | If you go down this route, make sure you read the documentation carefully and check that you are actually binding the way you want to.
76 | 
77 | TACC have [a set of affinity examples](http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html) that shows more combinations.
78 | 
79 | If you start using these variables, it is useful to know that you can set `OMP_DISPLAY_ENV=true` to have the runtime prints its relevant variables before running your program.
80 | This can be helpful if you think the same variable might be set several times and you want to confirm what value is being used.
81 | 
82 | ## OpenMP examples
83 | 
84 | We have written a set of [OpenMP examples](https://github.com/UoB-HPC/hpc-course-examples/tree/master/openmp) relevant for this course.
85 | These discuss basic implementation patterns, as well as pitfalls you need to be aware of when parallelising your code.
86 | 
87 | ## Further reference
88 | 
89 | The [OpenMP specification](https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf) is a great resource for clarifying any doubts you might have about how OpenMP should behave.
90 | If the spec doesn't say it, then you should treat it as undefined behaviour.
91 | It is well organised and very easy to read, so don't be intimidated!
92 | 
93 | Documentation is also provided by the compilers by the compilers available on BCp3 (which implements OpenMP):
94 | 
95 | - [GNU (libgomp)](https://gcc.gnu.org/onlinedocs/libgomp/index.html)
96 | - [Intel (libiomp)](https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-openmp-support)
97 | 


--------------------------------------------------------------------------------
/4_Compilers.md:
--------------------------------------------------------------------------------
  1 | # Compilers
  2 | 
  3 | Compilers generate machine code from source program code.
  4 | If a compiler is able to optimise better for the target platform, then the application will run faster.
  5 | 
  6 | The compilers most commonly used in HPC on x86 are [the GNU Compiler](https://www.gnu.org/software/gcc/) and [the Intel Compiler](https://software.intel.com/en-us/intel-compilers/).
  7 | They are both available on BlueCrystal.
  8 | There is no reliable way to tell which compiler will be faster _a priori_, as this can vary greatly depending on the code and the settings used to compile the application.
  9 | In general, you should try all the options and see which works better for your particular case.
 10 | If there are several versions of the same compiler, it makes sense to use the latest one, although there may be cases where an older version is faster, i.e. performance regressions.
 11 | 
 12 | <!-- The remainder of this section assumes you are working on BCp4.
 13 | If you are on Phase 3, some module names will likely be different. -->
 14 | 
 15 | ## Compiler modules
 16 | 
 17 | On BCp4, the GCC modules are named `languages/gcc/<version>`:
 18 | 
 19 | ```
 20 | $ module av languages/gcc
 21 | 
 22 | -------------------------------------------------------------- /mnt/storage/easybuild/modules/local ---------------------------------------------------------------
 23 |   languages/gcc/7.5.0    languages/gcc/9.1.0    languages/gcc/9.3.0    languages/gcc/10.4.0 (D)
 24 | ```
 25 | 
 26 | Similarly, Intel Compiler modules are names `languages/intel/<version>`:
 27 | 
 28 | ```
 29 | $ module av languages/intel
 30 | 
 31 | -------------------------------------------------------------- /mnt/storage/easybuild/modules/local ---------------------------------------------------------------
 32 |    languages/intel/2016-u3-cuda-8.0    languages/intel/2017-u4    languages/intel/2017.01    languages/intel/2018-u3    languages/intel/2020-u4 (D)
 33 | 
 34 |   Where:
 35 |    D:  Default Module
 36 | ```
 37 | 
 38 | Load the module for the desired compiler and version before building your application.
 39 | 
 40 | **Note**: The `gcc` command is available in the system by default, i.e. without loading a module, because the Linux distribution comes packaged with a compiler.
 41 | However, note that this version is significantly older than what is available through modules, and so may generate slower code:
 42 | 
 43 | ```
 44 | $ gcc -v
 45 | gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)
 46 | $ module load languages/gcc/9.1.0
 47 | $ gcc -v
 48 | gcc version 9.1.0 (GCC)
 49 | ```
 50 | 
 51 | Also note that the alias `cc` refers to the system's default compiler.
 52 | This is _not a different compiler_; it's just a command alias:
 53 | 
 54 | ```
 55 | $ ls -l $(which cc)
 56 | lrwxrwxrwx 1 root root 3 Mar 21  2017 /usr/bin/cc -> gcc
 57 | ```
 58 | 
 59 | ## General considerations
 60 | 
 61 | When building your application, it may be helpful to consider the following compiler options:
 62 | 
 63 | | Option                                         | Relevant GCC flags                             | Relevant Intel flags   |
 64 | | ---------------------------------------------- | ---------------------------------------------- | ---------------------- |
 65 | | Optimisation level                             | `-O<level>`                                    | `-O<level>`            |
 66 | | Target platform                                | `-march=<arch>`, `-mcpu=<cpu>`, `-mtune=<cpu>` | `-x<platform>`         |
 67 | | Fast (potentially unsafe!) maths optimisations | `-ffast-math`, `-funsafe-math-optimizations `  | `-fast`                |
 68 | | General optimisation reports                   | `-fopt-info-<kind>` | `-qopt-report=<level>`  |
 69 | | Vectorisation reports                          | `-fopt-info-vec-<kind>` | `-qopt-report=<level> -qopt-report-phase=vec` |
 70 | 
 71 | The following sections discuss some GCC and Intel-specific options.
 72 | 
 73 | ## Compiling with GCC
 74 | 
 75 | A complete list of the GCC flags can be found [in the GCC online docs](https://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html).
 76 | 
 77 | The default optimisation level with GCC is equivalent to `-O0`.
 78 | Be careful that this may not be the same across all compilers.
 79 | 
 80 | When specifying the target platform, you should be able to use `-march=native` to have the compiler detect what processor it's running on and optimise for it in particular.
 81 | However, there are cases when the detected processor type isn't accurate.
 82 | Be careful if the CPU you're compiling on is different than that one that will run your code.
 83 | 
 84 | If you attempt to make use of the hardware's SIMD features, you can use the vectorisation report to check if vectorised code has been generated.
 85 | The following flavours of `-fopt-info-vec-*` may be useful:
 86 | - `-fopt-info-vec-optimized` to show where vectorisation was used
 87 | - `-fopt-info-vec-missed` to show failed attempts to vectorise
 88 | - `-fopt-info-vec-all` to show all vectorisation-related messages
 89 | 
 90 | GCC's vectorisation reports are rather verbose and can be hard to parse.
 91 | Look for lines similar to the following that show reasons why vectorisation may have failed:
 92 | 
 93 | ```
 94 | test.c:38:3: note: bad loop form.
 95 | test.c:39:5: note: reduction used in loop.
 96 | test.c:39:5: note: Unknown def-use cycle pattern.
 97 | ```
 98 | 
 99 | However, not that these report contain merged reports from several optimisation passes, so it is possible to have _some_ passes fail but still obtain SIMD code in the end.
100 | 
101 | If you get no vectorisation report output, try using the legacy flag instead: `-ftree-vectorizer-verbose=<level>`, where level 5 is as good place to start.
102 | You can also show _all_ optimisation reports, i.e. not just from the vectorizer, by omitting `vec` from the flag above: `-fopt-info-[all|missed|optimized]`.
103 | 
104 | ## Compiling with Intel
105 | 
106 | Virtually all the options of the Intel compiler, including flags, pragmas, and intrinsics, can be found [here](https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference).
107 | 
108 | You can use `-xHOST` to optimise for the platform you are compiling on.
109 | As with GCC, consider that this may not always work as expected.
110 | Also note that GNU and Intel have different names for the same platforms.
111 | 
112 | The default optimisation level with Intel is equivalent to `-O2`.
113 | Be careful that this may not be the same across all compilers.
114 | 
115 | To generate vectorisation report, start with `-qopt-report=1 -qopt-report-phase=vec`.
116 | If you need more verbose messages, increase to `-qopt-report=2`.
117 | The output is stored in a `.optrpt` file that matches the name of your binary, with a separate section for each loop in your program.
118 | If you have nested loops, expect to see the same nesting structure in the optimisation report.
119 | 
120 | Where vectorisation was successful, you should see this line:
121 | 
122 | ```
123 | remark #15300: LOOP WAS VECTORIZED
124 | ```
125 | 
126 | If vectorisation failed, you should see a message listing a cause that prevented optimisation:
127 | 
128 | ```
129 | remark #15344: loop was not vectorized: vector dependence prevents vectorization
130 | ```
131 | 
132 | Finally, Intel have a [tutorial page on using auto-vectorisation](https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2023-1/automatic-vectorization.html) that you may find useful.
133 | 


--------------------------------------------------------------------------------
/0_Prerequisites.md:
--------------------------------------------------------------------------------
 1 | Prerequisites
 2 | =============
 3 | 
 4 | ## Assumptions
 5 | 
 6 | Throughout this tutorial-style introduction to working with HPC codes, we will assume that the reader:
 7 | 
 8 | - Is familiar with the C programming language.
 9 | - Is familiar with using a [command line](https://en.wikipedia.org/wiki/Command-line_interface) in a [Linux](https://en.wikipedia.org/wiki/Linux) environment.
10 | Since the supercomputers used for this course are accessed _exclusively_ through a [terminal](https://en.wikipedia.org/wiki/Terminal_emulator), this is _essential_.
11 | This tutorial assumes that the shell used is [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)).
12 | - Can efficiently navigate documentation and reference material, _both_ offline, e.g. [man pages](https://en.wikipedia.org/wiki/Man_page) or language specifications, and online, e.g. discussion forums or mailing lists, in order to _independently_ find answers to questions and solve programming and system usage issues.
13 | 
14 | ## What this tutorial is _not_
15 | 
16 | While there may be _some_ overlap with the lectures, particularly concerning tools and programming frameworks, it is important to keep in mind that **this tutorial is not a replacement for the lectures**.
17 | It also does not aim to:
18 | 
19 | - Give a complete course on any of the programming languages and frameworks discussed
20 | - Provide a "walkthrough" for the coursework
21 |     - In particular, copy-pasting all the commands into your terminal will not get you closer to the goals of this unit. **Do not** type anything into a terminal until you've understood exactly what it does and why it is useful to you.
22 | - Explain how to write a good report (although this will be covered later in class)
23 | - Replace engagement in labs, including any feedback, suggestions, and clarifications
24 | - Deter students from taking novel approaches or in any way limit their use of technologies not discussed in these pages
25 |     - For example, even though this repository shows sample code in C, feel free to write Fortran!
26 | - Substitute for interactive learning, e.g. in lectures, labs, and on the forum
27 | 
28 | ## Required (and useful) tools
29 | 
30 | To connect to the BlueCrystal supercomputer, you will need an [SSH](https://en.wikipedia.org/wiki/Secure_Shell) client.
31 | If you use any Unix derivative, chances are you have a working `ssh` command for this purpose; if you're on Windows, some of your options are listed [below](#suggestions-for-windows-users).
32 | 
33 | To write your code, we suggest you use an editor that you can customise to your liking, e.g. indentation or syntax highlighting.
34 | Many editors have out-of-the-box support for the languages you will likely use throughout this course; these include:
35 | 
36 | - [Visual Studio Code](https://code.visualstudio.com/) and [Atom](https://atom.io/), which are modern open-source text editors with rich plugin support, although they are written in JavaScript.
37 | - [gedit](https://en.wikipedia.org/wiki/Gedit) or [Kate](https://en.wikipedia.org/wiki/Kate_(text_editor)), as your [DE](https://en.wikipedia.org/wiki/Desktop_environment)'s packaged text editor.
38 | - [Vim](https://en.wikipedia.org/wiki/Vim_(text_editor)) and [Emacs](https://en.wikipedia.org/wiki/Emacs), in the terminal.
39 | 
40 | If you want to be able to compile and run your code locally (i.e. on your own laptop, or on a machine in the lab), you will need a compiler supporting your languages and frameworks.
41 | This is _not required_, but it may ease and speed up debugging and correctness testing.
42 | For C, both [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) and [LLVM/Clang](https://llvm.org/) are available for virtually all Linux distributions and support OpenMP.
43 | As as student, you can also get access to a copy of [the Intel Compiler](https://software.intel.com/en-us/parallel-studio-xe/choose-download/student-linux-fortran).
44 | To compile MPI programs, you will need to install an MPI implementation, with the most common choices being [Open MPI](https://www.open-mpi.org/) and [MPICH](http://www.mpich.org/); it does not matter which one you choose, but make sure you are not following the documentation for the other choice!
45 | 
46 | To transfer files between your machine and the supercomputer, you can use `scp`, which is part of OpenSSH and likely already available on your Linux box. There is also [rsync](https://en.wikipedia.org/wiki/Rsync), which may speed up repeated trasnfers.
47 | However, manually transferring files every time you make a change is cumbersome, so a better alternative is to set up [SSHFS](https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh) to mount a remote directory as a virtual drive on your local machine.
48 | Another option is to set up your text editor to automatically sync with a remote folder ([example for VS Code](https://code.visualstudio.com/docs/remote/ssh)).
49 | 
50 | We provide the starting code and some of the examples through GitHub repositories.
51 | Although you can obtain the files without using `git`, we _strongly_ encourage you to use version control for your assignment.
52 | Without such a system, you will find yourself saving multiple copies of your files with attempted optimisations, and you risk losing track of which changes stay and which go.
53 | If you are not familiar with Git, a good starting point is [the Atlassian tutorial series](https://www.atlassian.com/git/tutorials)—use it, it may well save you a great deal of wasted effort!
54 | 
55 | ### Suggestions for Windows Users
56 | 
57 | While we _encourage_ you to use a Linux machine, or at least a [*nix](https://en.wikipedia.org/wiki/Unix-like) environment, this is _not strictly required_.
58 | If you use Windows, you are advised to:
59 | 
60 | - Set up a more _Linux-like_ environment, in order to avoid many (sometimes subtle!) inter-platform issues.
61 | Avoid using `cmd`, as it lacks features; some better options are (in rough order of "niceness"):
62 |     - Make use of [Windows Subsystem for Linux (WSL)](https://docs.microsoft.com/en-us/windows/wsl/about) to obtain a full Unix shell. This will give you an environment which is virtually indistinguishable from running full Linux for the purpose of this course. [WSLtty](https://github.com/mintty/wsltty) is a very good terminal for use with WSL; [ConEmu](https://conemu.github.io/) is also good, but you may run into text display issues.
63 |     - If you are not on Windows 10 (although arguably you should be!), [Cygwin](https://cygwin.com/) can provide a Linux-like environment that makes many Linux tools available on Windows.
64 |     - Install Linux in a [Virtual Machine](https://en.wikipedia.org/wiki/Virtual_machine). Both [VirtualBox](https://www.virtualbox.org/) and [VMWare](https://www.vmware.com/uk/products/workstation-player/workstation-player-evaluation.html) provide free virtualisation software. If in doubt which Linux distribution to choose, go with [Ubuntu](https://www.ubuntu.com/) (for large community support), [Fedora](https://getfedora.org/) (for a stable an up-to-date distribution), or [Debian](https://www.debian.org/) (for long-term stability).
65 |     - Install [Git for Windows](https://git-scm.com/downloads) and choose the option to use the packaged terminal (mintty). This will have Bash and OpenSSH packaged in alongside git.
66 |     - ConEmu (mentioned above) includes a packaged, but minimal, version of Bash that you can use without either WSL or Cygiwn. This _may_ be enough for your needs. [cmder](https://cmder.app/) is ConEmu with more sane defaults and Bash and OpenSSH built-in.
67 |     - Use [Microsoft's OpenSSH for Windows](https://blogs.msdn.microsoft.com/commandline/2018/01/22/openssh-in-windows-10/) or a stand-alone SSH client ([PuTTY](https://putty.org/) is a reliable and featureful choice) to allow you to connect to the supercomputer remotely. Note that if you choose this option only, _you will not be able to test your code on your own machine_. This likely needs to be paired with a [SFTP](https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol) client, such as [WinSCP](https://winscp.net/eng/index.php).
68 | - Read about [issues with line endings](https://help.github.com/articles/dealing-with-line-endings/#platform-windows) and make sure you work around them. Also keep in mind that on Windows [NTFS](https://en.wikipedia.org/wiki/NTFS) is _case-insensitive_ ([_but case-preserving_](https://superuser.com/questions/364057/why-is-ntfs-case-sensitive)!), so a file that looks as if it's named `Makefile` may actually be `MaKEfILe`—take care when (re)naming files.
69 | - Use a text editor designed with programming in mind, and **not** Notepad or—even worse—Word or WordPad. This will help you avoid silly mistakes such as [spaces instead of tabs in Makefiles](https://stackoverflow.com/a/28720186).
70 | [Visual Studio Code](https://code.visualstudio.com/) is a very good (extensible) editor, and [Notepad++](https://notepad-plus-plus.org/) is a more lightweight alternative. There is a related issue with file extensions: if you have those hidden (which is the default in Windows), make sure you editor _doesn't_ automatically add one, e.g. `Makefile.txt`.
71 | 
72 | Note that the rest of this tutorial assumes a Linux environment, and so will only show command examples for Linux.
73 | If you use Windows, or software that differs from our standard choices, _you_ are responsible for setting it up in such a way that you can follow the tutorial.
74 | As long as you choose reasonably popular software, there should be plenty of support available online, and we will do our best to highlight any issues that we know might impact non-Linux platforms.
75 | 


--------------------------------------------------------------------------------
/2_Modules.md:
--------------------------------------------------------------------------------
  1 | # Software Modules
  2 | 
  3 | On supercomputers, it is common practice to manage software using [environment modules](http://modules.sourceforge.net/).
  4 | Modules allow the admins to install a wide variety of software, potentially including several versions of the same package, while trying to avoid conflicts.
  5 | 
  6 | The basic working principle for modules is simple: instead of installing everything in the default system location, e.g. `/usr/local`, each software package is installed into its own directory.
  7 | Because these individual directories are not by default on the system's paths, it will not appear as if the packages are actually installed.
  8 | Instead, each package has an associated _module file_, which defines the paths to these individual directories.
  9 | When a user wants to use a particular package, they first need to _load_ the associated module, which adds the custom paths to the system's list.
 10 | From then on, everything will be virtually the same as having the package installed in a standard location.
 11 | 
 12 | ## Common modules operations
 13 | 
 14 | Although the implementation of the modules package itself is different between BCp3 and BCp4, there should be very few user-facing differences.
 15 | The commands listed below apply to both phases—and virtually any other modules implementation you may come across.
 16 | 
 17 | If you try the examples in your terminal, keep in mind that the snippets below are from Phase 4.
 18 | Therefore, you may see and need to use slightly different module names on other systems.
 19 | 
 20 | ### Listing loaded modules
 21 | 
 22 | To see what modules you _have loaded_, use the `list` (shorthand `li`) command:
 23 | 
 24 | ```bash
 25 | $ module list
 26 | 
 27 | Currently Loaded Modules:
 28 |   1) tools/git/2.18.0
 29 | 
 30 | ```
 31 | 
 32 | ### Listing available modules
 33 | 
 34 | Use `available` (shorthand `avail` or `av`) to see _all defined modules_:
 35 | 
 36 | ```bash
 37 | $ module available
 38 | 
 39 | --------------------------------- /mnt/storage/easybuild/modules/local ---------------------------------
 40 | apps/grace/5.1.25                              languages/anaconda3/2019.07-3.7.3-biopython             (D)
 41 | apps/gromacs/5.1.4-plumed-mpi-intel            languages/gcc/9.1.0
 42 | apps/gromacs/5.1.4-mpi-intel                   languages/go/1.10
 43 | apps/gromacs/2018-mpi-gpu-intel                languages/intel/2016-u3-cuda-8.0
 44 | apps/gromacs/2019.3-mpi-intel           (D)    languages/intel/2017.01
 45 | # Many more lines not shown...
 46 | ```
 47 | 
 48 | You can also specify a pattern to `avail`, which will filter available modules (but see the note below!):
 49 | 
 50 | ```bash
 51 | $ module av gcc
 52 | 
 53 | -------------------------------------------------------- /mnt/storage/easybuild/modules/local --------------------------------------------------------
 54 | languages/gcc/9.1.0    libs/cuda/9.0-gcc-5.4.0-2.26    libs/cuda/10.0-gcc-5.4.0-2.26 (D)    libs/gsl/2.5-gcc-5.5.0
 55 | ```
 56 | 
 57 | Note that on most systems **the pattern is only matched at the beginning of the module name**.
 58 | So, a module called `languages/gcc` will be filtered _out_ by the command above.
 59 | You can avoid this issue by filtering modules yourself using `grep`, but note that `module` _outputs to standard error_ (see [How do I capture the module command output?](https://modules.readthedocs.io/en/latest/FAQ.html#how-do-i-capture-the-module-command-output) for an explanation why):
 60 | 
 61 | ```bash
 62 | $ module av |& grep -i gcc
 63 | languages/gcc/9.1.0
 64 | libs/cuda/9.0-gcc-5.4.0-2.26
 65 | libs/cuda/10.0-gcc-5.4.0-2.26                           (D)
 66 | libs/gsl/2.5-gcc-5.5.0
 67 | ATLAS/3.10.2-GCC-5.4.0-2.26-LAPACK-3.6.1acml/gcc/64/5.1.0
 68 | fftw3/gcc/64/3.3.3
 69 | gcc/4.7.0
 70 | languages/R-3.4.4-ATLAS-gcc-7.1.0
 71 | languages/gcc-7.1.0
 72 | openmpi/gcc/64/2.1.1
 73 | # Many lines omitted...
 74 | ```
 75 | 
 76 | On BCp4 in particular, the modules are split into sections in such a way that modules intended for end-users are displayed at the top, under the `/mnt/storage/easybuild/modules/local` heading.
 77 | You will not need to use modules outside this section for your coursework, so please **do not** use those unless you know _exactly_ what you are doing.
 78 | 
 79 | ### Loading modules
 80 | 
 81 | To _load_ a module, use the `add` or `load` commands:
 82 | 
 83 | ```bash
 84 | $ module list
 85 | No modules loaded
 86 | $ module load GCC/7.2.0-2.29
 87 | $ module list
 88 | 
 89 | Currently Loaded Modules:
 90 |   1) GCCcore/7.2.0
 91 |   2) binutils/2.29-GCCcore-7.2.0
 92 |   3) GCC/7.2.0-2.29
 93 | ```
 94 | 
 95 | As shown above, loading a module may automatically load other modules marked as dependencies.
 96 | 
 97 | It is common practice for the version of a package be to listed as the last component of the module name.
 98 | If you don't give a specific version, then the default one will be loaded.
 99 | The default version is marked when you list modules with `avail`:
100 | 
101 | ```bash
102 | $ module av GCC
103 | --------------------------- /mnt/storage/easybuild/modules/all ---------------------------
104 |    GCC/4.9.3-2.25
105 |    GCC/5.4.0-2.26
106 |    GCC/6.4.0-2.28
107 |    GCC/7.2.0-2.29 (D)
108 | 
109 |   Where:
110 |    D:  Default Module
111 | ```
112 | 
113 | ### Unloading modules
114 | 
115 | Use `rm` or `unload` to remove a previously loaded module:
116 | 
117 | ```bash
118 | $ module list
119 | 
120 | Currently Loaded Modules:
121 |   1) GCCcore/7.2.0
122 |   2) binutils/2.29-GCCcore-7.2.0
123 |   3) GCC/7.2.0-2.29
124 | $ module rm GCC/7.2.0-2.29
125 | $ module list
126 | 
127 | Currently Loaded Modules:
128 |   1) GCCcore/7.2.0
129 |   2) binutils/2.29-GCCcore-7.2.0
130 | ```
131 | 
132 | **Warning**: The `languages/intel` modules on BCp4 are _not_ set up to unload properly.
133 | If you try to remove them, it will appear as if everything worked fine—they will disappear from the list of loaded modules—but _the environment will not be cleaned as it should_.
134 | If you have loaded one of those modules and wish to remove it, you will need to log out of the system and log back in ([which gives you the default environment](#sessions-and-persisting-modules), before loading any modules).
135 | 
136 | You can use `purge` to _unload **all** loaded modules_:
137 | 
138 | ```bash
139 | $ module list
140 | 
141 | Currently Loaded Modules:
142 |   1) GCCcore/7.2.0
143 |   2) binutils/2.29-GCCcore-7.2.0
144 | $ module purge
145 | $ module list
146 | No modules loaded
147 | ```
148 | 
149 | Note that _`purge` will also remove modules loaded by default_, so you may find that certain functionality no longer works.
150 | Only use this if you're confident you understand all the modules in your environment, and keep in mind you can always log off and back on to ensure you are using the default environment.
151 | 
152 | ### Sessions and persisting modules
153 | 
154 | Any changes you make to your environment will only persist until you log out.
155 | When you log back in, you will find that you only have the default modules loaded.
156 | 
157 | If you want to add more modules to be loaded automatically when you log in, one way to do this is to configure your shell to run your required `module` commands.
158 | If you use bash, you can use your `.bashrc` file for this (more information on `.bashrc` in the [answer and comments here](https://unix.stackexchange.com/a/129144); other shells will have similar mechanisms, e.g. `.zshrc` or `config.fish`):
159 | 
160 | ```bash
161 | $ grep module .bashrc
162 | module load tools/git-2.18.0
163 | ```
164 | 
165 | We recommend you do _not_ add module commands to your shell's start-up script, because it may prevent you from getting a "clean" environment when you expect one.
166 | If you still decide to do it, _you_ are responsible for any issues you create, e.g. with conflicting library versions.
167 | When you encounter unexplained bugs and when you collect performance data, it is always worth running with a clean environment.
168 | See the [note about how environment is preserved in compute jobs in Queueing Systems](3_Queueing_Systems.md#Environment-modules-and-queueing-systems).
169 | 
170 | ## A worked example
171 | 
172 | Assume you want to use a more recent version of the GNU Compiler on Phase 4.
173 | When you first log in, you only have the default modules loaded:
174 | 
175 | ```bash
176 | $ module list
177 | No modules loaded
178 | 
179 | ```
180 | 
181 | If you examine the `gcc` command, you will notice that it points to the (old) version packaged with the operating system:
182 | 
183 | ```bash
184 | $ gcc -v
185 | # Some lines omitted
186 | gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)
187 | $ which gcc
188 | /usr/bin/gcc
189 | ```
190 | 
191 | You first need to find the module that you want:
192 | 
193 | ```bash
194 | $ module av |& grep -i gcc
195 | -------------------------------------------------------- /mnt/storage/easybuild/modules/local --------------------------------------------------------
196 | languages/gcc/7.2.0 languages/gcc/8.2.0  languages/gcc/9.1.0
197 | # Many lines omitted
198 | ```
199 | 
200 | Say you select version 9.1.0.
201 | Now you need to load it:
202 | 
203 | ```bash
204 | $ module load languages/gcc/9.1.0
205 | ```
206 | 
207 | You can now check that you are using the desired version:
208 | 
209 | ```bash
210 | $ module list
211 | 
212 | Currently Loaded Modules:
213 |   1) languages/gcc/9.1.0
214 | 
215 | $ which gcc
216 | /mnt/storage/software/languages/gcc-9.1/bin/gcc
217 | $ gcc -v
218 | # Some lines omitted
219 | gcc version 9.1.0 (GCC)
220 | ```
221 | 
222 | Loading the `languages/gcc/9.1.0` module has put the installation directory of the version 9.1.0 at the front of the system's path, so binaries and libraries are now searched here first, effectively _hiding_ the system version.
223 | If you are interested, you can look at everything the module does with the `show` command (the [modulefile reference](http://modules.sourceforge.net/man/modulefile.html) may come in handy):
224 | 
225 | ```
226 | $ module show languages/gcc-7.5.0
227 | -------------------------------------------------------------------
228 | help([[
229 | Description
230 | ===========
231 | The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, Go, and D, as well as libraries for these languages.
232 | More information
233 | ================
234 | - Homepage: https://gcc.gnu.org
235 | ]])
236 | whatis("The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, Go, and D, as well as libraries for these languages")
237 | whatis("Homepage: https://gcc.gnu.org")
238 | prepend_path("PATH","/mnt/storage/software/languages/gcc-9.1/bin")
239 | prepend_path("LD_LIBRARY_PATH","/mnt/storage/software/languages/gcc-9.1/lib64")
240 | prepend_path("LD_LIBRARY_PATH","/mnt/storage/software/languages/gcc-9.1/lib")
241 | prepend_path("MANPATH","/mnt/storage/software/languages/gcc-9.1/share/man")
242 | -------------------------------------------------------------------
243 | ```
244 | 


--------------------------------------------------------------------------------
/6_MPI.md:
--------------------------------------------------------------------------------
  1 | MPI
  2 | ===
  3 | 
  4 | This section will deal with running MPI jobs on BlueCrystal.
  5 | It will cover the `mpirun` launcher used to execute parallel jobs and how this interacts with the [queueing system](3_Queueing_Systems.md).
  6 | It is **not** an MPI programming tutorial, and it requires you to already be familiar with MPI terminology, e.g. what is a _rank_ and how it relates to threads, processes, and compute nodes.
  7 | 
  8 | [The last part of this section](#further-reference) lists some useful documentation links.
  9 | 
 10 | ## MPI Implementations
 11 | 
 12 | When compiling MPI programs, you will need to choose an MPI implementation.
 13 | The most common choices are [Open MPI](https://www.open-mpi.org/), [MPICH](http://www.mpich.org/), and [Intel MPI](https://software.intel.com/en-us/mpi-library).
 14 | The first two are open-source libraries which you can install on your own machine, and all three are available on BlueCrystal.
 15 | While feature-wise they should be largely equivalent, some options may differ in name and performance might vary.
 16 | You are encouraged to explore all the available options and discover any differences on your own.
 17 | 
 18 | ### BCp4
 19 | 
 20 | On Phase 4, use the Intel MPI.
 21 | It is part of the compiler modules, e.g. `languages/intel/2020-u4`.
 22 | 
 23 | ## Compiling MPI programs
 24 | 
 25 | The choice of MPI library is (mostly) independent of the compiler choice.
 26 | Therefore, you should be able to use (for example) the GNU compiler with any of the MPI libraries listed above.
 27 | In practice, there are sometimes issues when you use a proprietary MPI implementation with compilers from other vendors.
 28 | Although unlikely, this means that you _may_ encounter issues when using, for example, Intel MPI with GCC.
 29 | However, this shouldn't deter you from exploring your options!
 30 | 
 31 | MPI implementations generally provide a _compiler wrapper_, which is a command that calls the underlying compiler with the parameters necessary for the MPI code.
 32 | The advantage of using this wrapper is that you _don't_ need to manually pass the compiler and linker flags for the library, and you can change to a different implementation without changing your build command.
 33 | 
 34 | For the open-source libraries, commands are usually named as follows:
 35 | 
 36 | | Language | Command  |
 37 | | -------- | -------- |
 38 | | C        | `mpicc`  |
 39 | | C++      | `mpicxx` |
 40 | | Fortran  | `mpif90` |
 41 | 
 42 | If you use Intel software, the commands above will use Intel _MPI_ with the GNU compilers.
 43 | If you want to use _both_ Intel MPI and Compilers, the commands are named by joining `mpi` with the normal Intel Compiler command:
 44 | 
 45 | | Language | Command    |
 46 | | -------- | ---------- |
 47 | | C        | `mpiicc`   |
 48 | | C++      | `mpiicpc`  |
 49 | | Fortran  | `mpiifort` |
 50 | 
 51 | If in doubt what the right command is called, first load the modules for your desired compiler and MPI library, then use your shell's autocomplete to list the available options:
 52 | 
 53 | ```
 54 | $ mpi<TAB><TAB>
 55 | mpicc           mpiicc          mpiicpc         mpigcc
 56 | mpigxx          mpicxx          mpiexec         mpivars.sh
 57 | mpif77          mpif90          mpiifort
 58 | ```
 59 | 
 60 | Then, run with `-v` to check what compiler and library will be used:
 61 | 
 62 | ```
 63 | $ mpicc -v
 64 | mpigcc for the Intel(R) MPI Library 2018 Update 3 for Linux*
 65 | Copyright(C) 2003-2018, Intel Corporation.  All rights reserved.
 66 | Using built-in specs.
 67 | COLLECT_GCC=gcc
 68 | COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
 69 | Target: x86_64-redhat-linux
 70 | Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
 71 | Thread model: posix
 72 | gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)
 73 | $ mpiicc -v
 74 | mpiicc for the Intel(R) MPI Library 2018 Update 3 for Linux*
 75 | Copyright(C) 2003-2018, Intel Corporation.  All rights reserved.
 76 | icc version 18.0.3 (gcc version 4.8.5 compatibility)
 77 | ```
 78 | 
 79 | Once you know which wrapper to use, just replace the regular compiler command:
 80 | 
 81 | ```bash
 82 | # Without MPI
 83 | $ icc -o test-nompi test.c
 84 | 
 85 | # With MPI
 86 | $ mpiicc -o test-mpi test.c
 87 | ```
 88 | 
 89 | ## Running MPI jobs
 90 | 
 91 | MPI applications generally run several processes, which need to be orchestrated, e.g. started, synchronised, and terminated.
 92 | This is done by using an MPI _launcher_, usually called `mpirun` or `mpiexec` (synonyms), which creates as many instances of your application as you instruct it and manages the processes over their lifetime.
 93 | 
 94 | The simplest MPI launch specifies only the number of ranks (`-np`) and the command to run:
 95 | 
 96 | ```bash
 97 | $ mpirun -np 4 ./test-mpi
 98 | ```
 99 | 
100 | However, there are many more options, and you should read about them in `mpirun --help` and in the online documentation: [Open MPI](https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php), [MPICH](http://www.mpich.org/static/docs/latest/www/www1/mpiexec.html), [Intel MPI](https://software.intel.com/en-us/mpi-developer-reference-linux-mpirun).
101 | Make sure that you look at right documentation for the version of the library you are using, and keep in mind that **some options may differ between implementations**.
102 | 
103 | **Important**: In order to run more than a single MPI rank, you _need to_ use a launcher.
104 | If you don't—and just run your binary directly—only a single instance will run, so any MPI code will be redundant.
105 | 
106 | ### SLURM and BCp4
107 | 
108 | SLURM provides its own parallel launcher, called `srun`.
109 | The degree to which this integrates with the available hardware and software varies, but in general you can replace `mpirun` run with `srun` and expect everything to work fine.
110 | Some SLURM systems, e.g. CS-series Crays, don't provide `mpirun` and _require_ you to use `srun`.
111 | 
112 | The advantage of using `srun` is that it automatically reads your run configuration from your job script, so you usually don't need to specify the number of ranks. You do however need to include the `--mpi=pmi2` argument to ensure we are using the correct version of MPI.
113 | The following example script, which only uses `sbatch` arguments and passes just the binary to `srun`, runs 8 MPI processes evenly split between two nodes:
114 | 
115 | ```bash
116 | #SBATCH --nodes 2
117 | #SBATCH --ntasks-per-node 4
118 | 
119 | srun --mpi=pmi2 ./test-mpi
120 | ```
121 | 
122 | On BCp4, you can use both `mpirun` and `srun`.
123 | We recommend using `srun`, because your parallel configuration will be automatically read from your job script, so you won't have to repeat it in `mpirun` arguments.
124 | 
125 | **NOTE**: As of Jan. 2024, `mpirun` does not execute jobs as expected using previous methods (described above). Please **only** use `srun` for the time being.
126 | 
127 | ### Tagging output
128 | 
129 | One useful setting for debugging is tagging each line of output with the number of the rank that produced it.
130 | Since the order in which print statements will be executed across several ranks is not fixed, this will help you identify which rank printed each line:
131 | 
132 | ```
133 | # Without tags
134 | $ srun --mpi=pmi2 ./hi
135 | Hello from rank 2, on node31-031.
136 | Hello from rank 0, on node31-031.
137 | Hello from rank 3, on node31-031.
138 | Hello from rank 1, on node31-031.
139 | 
140 | # With tags
141 | $ srun --mpi=pmi2 --label ./hi
142 |   2: Hello from rank 2, on node31-031.
143 |   0: Hello from rank 0, on node31-031.
144 |   3: Hello from rank 3, on node31-031.
145 |   1: Hello from rank 1, on node31-031.
146 | ```
147 | 
148 | With `mpirun`, the equivalent flag is `--tag-output`. This option works with Open MPI; with Intel MPI, use `-l`.
149 | Other libraries likely have similar options, but they might have different names.
150 | 
151 | ### Binding processes
152 | 
153 | By default, `mpirun` will allow you to launch one rank per CPU core available.
154 | However, sometimes you may want to launch _fewer_ ranks than you have cores, e.g. because  each rank might run several threads internally, or _more_ ranks per core, particularly if your processor supports [simultaneous multithreading](https://en.wikipedia.org/wiki/Simultaneous_multithreading) and you want to run a rank per hardware thread.
155 | The process of assigning processes (or threads) to hardware resources is commonly referred to as _binding_.
156 | 
157 | Most MPI implementations offer some support for binding ranks as part of the launcher.
158 | For example, you can restrict each rank to run on a specific core (as opposed to any core, which is the default):
159 | 
160 | ```bash
161 | # Without binding
162 | $ mpirun -np 4 ./hi
163 | Hello from rank 2, on node31-031. (core affinity = 0-15)
164 | Hello from rank 0, on node31-031. (core affinity = 0-15)
165 | Hello from rank 3, on node31-031. (core affinity = 0-15)
166 | Hello from rank 1, on node31-031. (core affinity = 0-15)
167 | 
168 | # With binding
169 | $ mpirun -np 4 -bind-to-core ./hi
170 | Hello from rank 0, on node31-031. (core affinity = 0)
171 | Hello from rank 1, on node31-031. (core affinity = 1)
172 | Hello from rank 2, on node31-031. (core affinity = 2)
173 | Hello from rank 3, on node31-031. (core affinity = 3)
174 | ```
175 | 
176 | Another example is splitting the total number of processes between several nodes:
177 | 
178 | ```bash
179 | # Without mapping (all on first node)
180 | $ mpirun -np 4 ./hi
181 | Hello from rank 0, on compute091.
182 | Hello from rank 1, on compute091.
183 | Hello from rank 2, on compute091.
184 | Hello from rank 3, on compute091.
185 | 
186 | # With mapping (split across 2 nodes)
187 | $ mpirun -np 4 -npernode 2 ./hi
188 | Hello from rank 0, on compute091.
189 | Hello from rank 1, on compute091.
190 | Hello from rank 2, on compute092.
191 | Hello from rank 3, on compute092.
192 | ```
193 | 
194 | There are many more options available, and they are all explained in the manuals.
195 | As above, the options may slightly differ with the implementation use.
196 | 
197 | ## MPI examples
198 | 
199 | We have provided a [set of working MPI examples](https://github.com/UoB-HPC/hpc-course-examples/tree/master/mpi), ranging from a simple ["Hello World" MPI program](https://github.com/UoB-HPC/hpc-course-examples/tree/master/mpi/example1) to an implementation of [the "halo exchange" message passing pattern](https://github.com/UoB-HPC/hpc-course-examples/tree/master/mpi/example5) you need for the MPI assignment.
200 | 
201 | ## Further reference
202 | 
203 | Here are some handy links to MPI docs:
204 | 
205 | - Open MPI
206 |     - [v3.1](https://www.open-mpi.org/doc/v3.1/) (a recent, supported version)
207 |     - [v2.1](https://www.open-mpi.org/doc/v2.1/) (on BCp3)
208 |     - [v1.6](https://www.open-mpi.org/doc/v1.6/) (on BCp3)
209 | - [Intel MPI guides](https://software.intel.com/en-us/mpi-developer-guide-linux)
210 | - MPICH
211 |     - [Installation guide](http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1-installguide.pdf)
212 |     - [User guide](http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1-userguide.pdf)
213 |     - [Latest manpages](http://www.mpich.org/static/docs/latest/www/)
214 |     - [Older manpages](http://www.mpich.org/documentation/manpages/)
215 | 
216 | You can find some MPI programming tutorials [on the MPICH guides page](http://www.mpich.org/documentation/guides/).
217 | The MPI standard spec is also [available online](https://www.mpi-forum.org/docs/).
218 | 


--------------------------------------------------------------------------------
/3_Queueing_Systems.md:
--------------------------------------------------------------------------------
  1 | # Queueing Systems
  2 | 
  3 | **Important**: Before reading this section, you need to be familiar with the concepts of _login nodes_ and _compute nodes_.
  4 | In a typical supercomputer system, when you connect to the machine you will get to a _login node_, which is shared with other users connected to the system.
  5 | This is where you usually compile your application and set up your environment.
  6 | In contrast, running applications happens on _compute nodes_, which are allocated and freed as needed, and are generally dedicated to a single user for the duration of their application's execution.
  7 | 
  8 | Supercomputers house a large amount of resources, and it is common for many users to be running their applications at the same time.
  9 | However, unlike in a desktop scenario, application performance is important—often critical—so it is common for each user to run on a dedicated part of the system.
 10 | In order to manage the allocation of resources to users, supercomputers run _workload managers_ (WLMs) that often implement a _job queue_.
 11 | 
 12 | The typical workflow with a WLM can be summarised as follows:
 13 | 1. The user defines their _job_. This includes the application to be run and the amount of hardware resources that it will need, e.g. number of processor cores, amount of RAM, and so on. This is done on a login node.
 14 | 2. The job is submitted to the system's queue. It is not unusual to have tens or hundreds of jobs in the queue at a given time.
 15 | 3. The WLM looks at available resources together with queued jobs and any higher/lower priorities applied to them in order to decide which job(s) will run next.
 16 | 4. When the requested resources are available and the job starts running, those resources (on compute nodes) become allocated to it and no other job will be able to use it until the current one is stopped.
 17 | 5. The job is run according to its definition. It will stop when it either completes, crashes, or exceeds the requested resources, e.g. it has used more CPU time than the job definition requested.
 18 | 6. When the job is done, its resources are freed so they can be used for other jobs.
 19 | 
 20 | BlueCrystal uses **SLURM** on Phase 4.
 21 | 
 22 | ## BCp4 – SLURM
 23 | 
 24 | ### Listing jobs
 25 | 
 26 | To see _all_ the jobs queued on the system, type `squeue`:
 27 | 
 28 | ```bash
 29 | $ squeue
 30 |              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 31 |            1340878       gpu  2ycu2u1  ck14921 PD       0:00      1 (Resources)
 32 |            1340605       gpu   run_tf   yl1220 PD       0:00      1 (Priority)
 33 |            1340938       gpu run_atte  hd12584 PD       0:00      1 (Priority)
 34 | # Many lines omitted...
 35 |            1330979       cpu      m12  wk14463  R 11-02:07:08      1 compute480
 36 |            1329938       cpu texit000    ggpoh  R 12-14:19:34      1 compute329
 37 |            1328086       cpu      m22  wk14463  R 13-18:04:55      1 compute503
 38 | ```
 39 | 
 40 | The state is usually either running (`R`) or pending (`PD`).
 41 | When a job is running, you can see which compute nodes it is using in the rightmost column.
 42 | When it is pending, the reason why it hasn't started yet is shown.
 43 | 
 44 | To see a _single user's jobs_, use `squeue -u <username>`:
 45 | 
 46 | ```bash
 47 | $ squeue -u $USER
 48 |              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 49 |            1340971       cpu     bash  ab12345  R       0:05      1 compute382
 50 | ```
 51 | 
 52 | You can also filter jobs by partition (`-p`) or account (`-A`):
 53 | 
 54 | ```bash
 55 | $ squeue -p teach_cpu
 56 |              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 57 |            2481514 teach_cpu     bash  ab12345  R       0:06      1 compute084
 58 | 
 59 | $ squeue -A COMS031424
 60 |              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 61 |            2481514 teach_cpu     bash  ab12345  R       0:15      1 compute084
 62 | ```
 63 | 
 64 | You can see all the details about a job, even after it has completed, using `scontrol`:
 65 | 
 66 | ```bash
 67 | $ scontrol show -d job 1340971
 68 | JobId=1340971 JobName=bash
 69 |    UserId=ab12345(999999) GroupId=mven(16621) MCS_label=N/A
 70 |    Priority=988 Nice=0 Account=default QOS=normal WCKey=*cosc17r
 71 |    JobState=COMPLETED Reason=None Dependency=(null)
 72 |    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
 73 |    DerivedExitCode=0:0
 74 |    RunTime=00:05:22 TimeLimit=2-00:00:00 TimeMin=N/A
 75 |    SubmitTime=2018-09-25T12:22:37 EligibleTime=2018-09-25T12:22:37
 76 |    StartTime=2018-09-25T12:22:37 EndTime=2018-09-25T12:27:59 Deadline=N/A
 77 |    PreemptTime=None SuspendTime=None SecsPreSuspend=0
 78 |    Partition=teach_cpu AllocNode:Sid=bc4login1:12671
 79 |    ReqNodeList=(null) ExcNodeList=(null)
 80 |    NodeList=compute382
 81 |    BatchHost=compute382
 82 |    NumNodes=1 NumCPUs=28 NumTasks=28 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
 83 |    TRES=cpu=28,mem=28000M,node=1
 84 |    Socks/Node=* NtasksPerN:B:S:C=28:0:*:* CoreSpec=*
 85 |      Nodes=compute382 CPU_IDs=0-27 Mem=28000
 86 |    MinCPUsNode=28 MinMemoryCPU=1000M MinTmpDiskNode=0
 87 |    Features=(null) Gres=(null) Reservation=(null)
 88 |    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
 89 |    Command=bash
 90 |    WorkDir=/mnt/storage/home/ab12345
 91 |    Power=
 92 | ```
 93 | 
 94 | In SLURM, resources are organised into _partitions_ (as opposed to PBS queues), which can be listed with `sinfo -s`:
 95 | 
 96 | ```bash
 97 | $ sinfo -s
 98 | PARTITION     AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
 99 | cpu*             up 14-00:00:0      445/0/0/445 compute[068-176,178-241,246-260,262-320,322-519]
100 | hmem             up 14-00:00:0          6/2/0/8 highmem[10-17]
101 | gpu              up 7-00:00:00        21/5/0/26 gpu[06-31]
102 | gpu_veryshort    up    6:00:00          0/1/0/1 gpu32
103 | test             up    1:00:00      434/3/0/437 compute[080-241,246-520]
104 | veryshort        up    6:00:00      440/5/0/445 compute[080-241,246-520],highmem[10-17]
105 | dcv              up   infinite          0/0/1/1 bc4vis1
106 | teach_cpu        up    3:00:00          0/9/0/9 compute[242-245,521-525]
107 | teach_gpu        up    3:00:00          0/5/0/5 gpu[01-05]
108 | # Some lines omitted...
109 | ```
110 | 
111 | By state, nodes can be idle (`I`), allocated (`A`), or other (`O`); with `T` representing the total number of nodes in the partition. Note that partitions are not necessarily disjoint.
112 | 
113 | You can query a specified partition only using `-p`:
114 | 
115 | ```bash
116 | $ sinfo -p hmem
117 | PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
118 | hmem         up 14-00:00:0      8  alloc highmem[10-17]
119 | ```
120 | 
121 | More details and usage example about these commands are in the manpages: `man squeue`, `man sinfo`, `man scontrol`.
122 | 
123 | ### Deleting jobs
124 | 
125 | Use `scancel` to remove jobs from the queue or stop in-progress ones:
126 | 
127 | ```bash
128 | $ scancel 1340971
129 | ```
130 | 
131 | ### Submitting jobs
132 | 
133 | Unlike PBS, SLURM distinguishes between two ways to run jobs:
134 | 
135 | - You can run your application directly using `srun`. You can use your binary directly, and resources will be allocated and freed automatically. You cannot do this with PBS.
136 | - You can use a job script, which is submitted using `sbatch`. This is similar to using `qsub`.
137 | 
138 | Both approaches take the same arguments, and the syntax is as follows:
139 | 
140 | ```bash
141 | $ srun   [options] /path/to/binary
142 | $ sbatch [options] /path/to/script
143 | ```
144 | 
145 | The following table lists a few common job control options:
146 | 
147 | | SLURM argument  | Meaning             |
148 | | --------------- | ------------------- |
149 | | `--partition=<partition>`    | The partition to run on |
150 | | `--nodes=<n>` | Request `n` nodes |
151 | | `--ntasks-per-node=<c>` | Request `c` tasks to be run on each node (often related to number of cores required) |
152 | | `--time=<t>` | Specifies that your job should be allowed to run for at most `t` (time). You should specify `t` as `hh:mm:ss` |
153 | | `--job-name=<name>` | Sets the job's name, so you can easily identify it later |
154 | | `--output=<file>` | Sets a name for the file where the job's output will be saved. If you don't set this, an automatically generated named will be used |
155 | | `--exclusive` | Does not allow other jobs to be scheduled on your allocated compute nodes, even if you don't fully utilise their resources |
156 | | `--gres=gpu:<g>` | Request `g` GPUs. GPUs are only present in nodes in the `gpu` partition, where each node has 2 GPUs |
157 | | `--account=<account>` | SLURM allows users to be organised into groups (_accounts_) that share resources. A user can be part of serveral groups simultaneously, so `-A` is used to pick which account to use |
158 | | `--reservation=<name>` | Nodes can be reserved for subsets of users. If you are part of a reservation, specify its name to use it |
159 | 
160 | If you compare this to the equivalent PBS table above, note that `-j oe` and `-V` are implied on SLURM.
161 | 
162 | **Important**: If you are taking the COMS30053 unit in 2024, please ensure you are using the `teach_cpu` partition along with account code `COMS031424` throughout the course.
163 | 
164 | #### Job files
165 | 
166 | SLURM job files work virtually the same way as [PBS job files](#job-files) (please read this section before continuing if you haven't used job files before).
167 | The notable differences are:
168 | 
169 | - Job paramters are prefixed with `#SBATCH` in the script.
170 | - You need to use SLURM arguments, and the script is submitted using `sbatch`.
171 | 
172 | Here is the same example script shown above, but using SLURM paramters instead:
173 | 
174 | ```bash
175 | $ cat my.job
176 | #!/bin/bash
177 | #SBATCH --job-name=LBM
178 | #SBATCH --output=lbm.out
179 | #SBATCH --partition=teach_cpu
180 | #SBATCH --account=COMS031424
181 | #SBATCH --nodes=1
182 | #SBATCH --ntasks-per-node=14
183 | #SBATCH --time=00:05:00
184 | 
185 | $HOME/work/d2q9-bgk
186 | ```
187 | 
188 | #### Interactive jobs
189 | 
190 | Use `srun --pty bash` to run an interactive job:
191 | 
192 | ```bash
193 | [ab12345@bc4login1 ~]$ srun -N1 --tasks-per-node 28 --pty bash
194 | [ab12345@compute084 ~]$ echo "Now running on a compute node."
195 | Now running on a compute node.
196 | ```
197 | 
198 | Please note that using an interactive session will keep the node(s) requested allocated _for the whole session_, not just when you are actively running commands.
199 | Since all the resources are shared with the other users on the system, **only use an interactive job for tasks that you cannot do on a login node or through job scripts, and give up your allocation as soon as you have finished**.
200 | 
201 | ### More resources
202 | 
203 | You can find documentation both in the manpages, e.g. `man srun`, as well as [online](https://slurm.schedmd.com/documentation.html).
204 | You can also find [online version of the manpages](https://slurm.schedmd.com/man_index.html).
205 | However, please note that web-based documentation may target a different version than what is used on BCp4, and not all features supported by SLURM may be enabled and available on BlueCrystal.
206 | If in doubt, always check the manpage on the system.
207 | 
208 | You may also find useful the [SLURM command summary sheet](https://slurm.schedmd.com/pdfs/summary.pdf), and the ACRC have [online documentation for BCp4](https://www.acrc.bris.ac.uk/protected/bc4-docs/).
209 | 
210 | ## Environment modules and queueing systems
211 | 
212 | One important difference between PBS and SLURM is how environment is preserved when you run a job:
213 | 
214 | - For **PBS**, _the job runs in a clean environment_, so any modules you load or variables you set in your interactive session on the login node will not be automatically forwarded to compute jobs. However, your shell start-up scripts will still be executed, e.g. commands in `.bashrc` will be run.
215 | - For **SLURM**, _jobs start in the same environment that you had on the login node_, so you don't _need to_ load all the required module in your job script.
216 | 
217 | Regardless of which system you're using, it is good practice to only have loaded those modules that are required for your job.
218 | This avoid issues where your job is affected by modules you may have loaded for testing (or other purpuses) and forgot about.
219 | Therefore, consider starting your job script with a `module purge`, followed by `load`s for _only the required modules_.
220 | 


--------------------------------------------------------------------------------
/1_Connecting_to_BlueCrystal.md:
--------------------------------------------------------------------------------
  1 | Connecting to BlueCrystal
  2 | =========================
  3 | 
  4 | Throughout the HPC course, you will use [BlueCrystal](https://www.acrc.bris.ac.uk/), the University of Bristol's supercomputer.
  5 | This part of the tutorial shows how to connect to and accomplish basic tasks on BlueCrystal.
  6 | 
  7 | ## Before you start
  8 | 
  9 | You will connect to BlueCrystal from your shell, over SSH. Make sure you have read [the Prerequisites section](0_Prerequisites.md) and have set up your software accordingly.
 10 | If you are using a lab machine, everything is already set up.
 11 | 
 12 | In order to access BlueCrystal, you need to have an account.
 13 | If you don't have an account yet and haven't received any instructions, e.g. via email, on how to obtain one, you can find them on the unit's BlackBoard page.
 14 | If you have any issues signing up, please ask a lab helper.
 15 | 
 16 | In the sections below, shell commands are prefixed with a `$` symbol.
 17 | This is to indicate that you would type them at a shell prompt and _you do not need to type the `$`_.
 18 | In the following example, `echo hello` is the command typed into the shell, and `hello` (on the following line) is the output of the command.
 19 | 
 20 | ```bash
 21 | $ echo hello
 22 | hello
 23 | ```
 24 | 
 25 | Where a username is required, we either use the placeholder `<username>` or the example `ab12345`.
 26 | **Make sure you use _your own_ username in your commands and configuration files**.
 27 | 
 28 | BlueCrystal is split into several _phases_, which are system parts that have been added over time to keep up with hardware advances.
 29 | The phases are independent and differ slightly in their configuration, so it's important that you don't confuse them.
 30 | For 2024 we will use [BlueCrystal Phase 4](https://www.acrc.bris.ac.uk/acrc/phase4.htm) (BCp4).
 31 | 
 32 | ## Connecting
 33 | 
 34 | BlueCrystal is only accessible from inside the University's network.
 35 | If you require access from outside, e.g. from your home network, see [Connecting from outside the University](#connecting-from-outside-the-university).
 36 | 
 37 | From a laptop connected to eduroam, a lab machine, or a computer otherwise attached to the University's network, run the following command to connect to BCp4, replacing `<username>` with your UoB username:
 38 | 
 39 | ```
 40 | $ ssh <username>@bc4login.acrc.bris.ac.uk
 41 | ```
 42 | 
 43 | After you type your password, you should get a prompt showing your username and the system's hostname, for example:
 44 | 
 45 | ```
 46 | [ab12345@bc4login2 ~]$
 47 | ```
 48 | 
 49 | **Note**: There are _five_ login nodes on BCp4: `bc4login[1-5]`.
 50 | You will not always get assigned to the same one for load balancing reasons, but all the nodes are the same and share the file system.
 51 | This means that when you log in, the hostname you see may end in a different digit than it does in the examples below—as long as it starts with `bc4login`, you're on the right system!
 52 | 
 53 | <!-- For BCp3 (legacy), only the hostname is different:
 54 | 
 55 | ```
 56 | $ ssh <username>@bluecrystalp3.bris.ac.uk
 57 | [ab12345@newblue2 ~]$
 58 | ``` -->
 59 | 
 60 | Your Blue Crystal Phase 4 password is the same as your university password.
 61 | If you want to change the provided password, simply run `passwd` _on the system for which you want to change your password_ and follow the instructions.
 62 | 
 63 | <!-- Note that you get automatically generated passwords when you sign up for an account.
 64 | These are different between Phase 3 and Phase 4, and might different again from your University account's password. **For 2019-20, the Phase 4 passwords are the same as your university password.**
 65 | If you want to change the provided password, simply run `passwd` _on the system for which you want to change your password_ and follow the instructions. -->
 66 | 
 67 | ### Running graphical programs
 68 | 
 69 | If you want to run a program that has a GUI, you need to enable X forwarding:
 70 | 
 71 | ```
 72 | $ ssh -X <username>@bc4login.acrc.bris.ac.uk
 73 | ```
 74 | 
 75 | You can test it works by running a program with a graphical interface, for example:
 76 | 
 77 | ```bash
 78 | $ xeyes
 79 | ```
 80 | 
 81 | You should see a text editor window (running on BlueCrystal) on your screen.
 82 | If you get a `Can't open display error`, then you have not used `-X`.
 83 | You will need to log out and log back in using the forwarding option.
 84 | If you get a warning message about failing to enable _untrusted_ SSH X forwarding when you log in, or if you _can_ start GUI programs but parts of the interface are missing, try logging out and using the trusted forwarding option instead:
 85 | 
 86 | ```
 87 | $ ssh -Y <username>@bc4login.acrc.bris.ac.uk
 88 | ```
 89 | 
 90 | On Windows machines, if you get an error which looks something like this:
 91 | 
 92 | ```
 93 | Warning: untrusted X11 forwarding setup failed: xauth key data not generated
 94 | ```
 95 | 
 96 | Then you aren't running X (the graphical windowing subsystem) on your local machine. To solve this problem, run Xming on your Windows machine, and enable X11 forwarding for whichever shell tool you're using (PuTTY etc). For more details on how to do this, [see this helpful IT services webpage](https://www.bristol.ac.uk//it-services/locations/fits/science/putty_xming).
 97 | 
 98 | On MacOS, you will need to install [XQuartz](https://www.xquartz.org/) to enable running graphical programs over ssh.
 99 | 
100 | 
101 | ### Passwordless SSH access
102 | 
103 | You can simplify your login process by setting up authentication using an SSH key instead of your password.
104 | The key is transmitted automatically, saving you from having to type your password for every connection to BlueCrystal.
105 | Note that the content of this section applies to _any remote server running Linux_, not just BlueCrystal.
106 | 
107 | If you don't already have an SSH key, or if you want to use a new one for BlueCrystal, we will generate one now.
108 | **On your local machine**, go into the `.ssh` directory in your home folder (creating it if it doesn't exist) and follow the instructions to create your key:
109 | 
110 | ```
111 | $ mkdir ~/.ssh # if it doesn't exist
112 | $ cd ~/.ssh
113 | $ ssh-keygen -t rsa -b 4096
114 | Generating public/private rsa key pair.
115 | Enter file in which to save the key (/home/andrei/.ssh/id_rsa): uob
116 | Enter passphrase (empty for no passphrase):
117 | Enter same passphrase again:
118 | ```
119 | 
120 | Although `rsa` should be the default type, we specify it for completeness.
121 | We then ask for a 4096 bit key, which should be a good size to use in 2024.
122 | You can give your key file any name when prompted to do so (`uob` is used in this example).
123 | 
124 | You can also set a password for your key, if you choose to.
125 | If you do, then the key will be encrypted before being stored on your disk and you will have to enter the password to unlock it before it can be used.
126 | If you don't set a password, no interaction will be required, but keep in mind that this means anyone can take your key and use it as if they were you!
127 | There are ways to set up your key so that it only needs to be unlocked once per session, i.e. until you log off, but there are subtle security issues that could come up; see [this SO answer](https://unix.stackexchange.com/a/90869) for an in-depth look at your options, and make sure you understand the consequences of your actions!
128 | 
129 | Once you have your SSH key, observe that it is made up of two parts, a private key and a public key:
130 | 
131 | ```bash
132 | $ ls uob*
133 | uob uob.pub
134 | ```
135 | 
136 | The private key (`uob`) **never leaves your machine**.
137 | The public key (`uob.pub`) needs to be uploaded on any server where you want to use it for authentication.
138 | To do this, first make sure that you have a `.ssh` folder on BlueCrystal, and an `authorized_keys` file inside it with `0644` permissions:
139 | 
140 | ```bash
141 | $ ssh ab12345@bc4login.acrc.bris.ac.uk
142 | [ab12345@bc4login2 ~]$ mkdir .ssh
143 | [ab12345@bc4login2 ~]$ touch .ssh/authorized_keys
144 | [ab12345@bc4login2 ~]$ chmod 644 .ssh/authorized_keys
145 | ```
146 | 
147 | Then, coming back to your local machine, upload your public key to the `.ssh` folder on BlueCrystal (make sure you don't upload your private key by mistake!) and add it to `authorized_keys`:
148 | 
149 | ```bash
150 | $ scp uob.pub ab12345@bc4login.acrc.bris.ac.uk:.ssh/
151 | $ ssh ab12345@bc4login.acrc.bris.ac.uk
152 | [ab12345@bc4login2 ~]$ cat .ssh/uob.pub >> .ssh/authorized_keys
153 | ```
154 | 
155 | You can also do all this in a single command _from your own machine_:
156 | 
157 | ```bash
158 | $ cat ~/.ssh/uob.pub | ssh ab12345@bc4login.acrc.bris.ac.uk 'cat >> .ssh/authorized_keys'
159 | ```
160 | 
161 | **Note**: Make sure you use **two** angle brackets (`>>`), **not one** (`>`), since you want to _append_ to `authorized_keys`, not _overwrite_.
162 | If you accidentally overwrite this file, your jobs will fail to create the output file and it will appear as if there is no output.
163 | To fix this, read the answer under [_The queueing system_ in the FAQ](https://github.com/UoB-HPC/hpc-course-getting-started/blob/master/FAQ.md#the-queueing-system).
164 | 
165 | Finally, on your local machine, configure SSH to use the key.
166 | To do this, open the `~/.ssh/config` file (creating if it does not exist) and add a configuration for BlueCrystal:
167 | 
168 | ```
169 | Host bcp4
170 |     HostName bc4login.acrc.bris.ac.uk
171 |     User ab12345
172 |     IdentityFile ~/.ssh/uob
173 | ```
174 | 
175 | The `Host` alias (`bcp4` in this case) can be set to anything you want and will be used as an identifier for this connection.
176 | The rest of the parameters should be self-explanatory.
177 | You should now be able to connect to BlueCrystal easily from your local machine:
178 | 
179 | ```bash
180 | $ ssh bcp4
181 | [ab12345@bc4login2 ~]$
182 | ```
183 | 
184 | **Note**: There are several schools of thought regarding how many SSH keys you should be using (see [this SO discussion](https://security.stackexchange.com/questions/40050/best-practice-separate-ssh-key-per-host-and-user-vs-one-ssh-key-for-all-hos) for some revealing insight).
185 | If you have followed this tutorial to create a key specifically for BlueCrystal (and potentially other UoB services), then you should add the following line at the very top of your `~/.ssh/config` to avoid [your machine offering any other SSH keys you might have](https://superuser.com/a/187790):
186 | 
187 | ```
188 | IdentitiesOnly yes
189 | ```
190 | 
191 | ### Connecting from outside the University
192 | 
193 | If you want to connect to BlueCrystal from outside the University, you first need to first connect to the University's network.
194 | 
195 | Your first option is to use [the University's VPN](https://www.bris.ac.uk/it-services/advice/homeusers/uobonly/uobvpn/).
196 | Once you set this up and connect to it, your traffic will be tunneled to a server on the University's network.
197 | From there you can access BlueCrystal (and any other University resources) as if you were connected to eduroam.
198 | **WARNING**: While connected to the VPN, _all your traffic will go via the University_, so make sure to disable it when you don't need it.
199 | 
200 | The other option is to connect to seis, a server in the University that exists specifically to enable proxy access to other systems.
201 | This server will not offer you an environment where you can work, but from it you can connect to BlueCrystal.
202 | To connect to seis, use your UoB username and password:
203 | 
204 | ```
205 | $ ssh <username>@seis.bris.ac.uk
206 | ```
207 | 
208 | If you have [set up access using an SSH key](#passwordless-ssh-access), then you can take a step further and automate connecting to seis and from there to BlueCrystal.
209 | First, set up your SSH key on snowy following the same procedure you used for BlueCrystal and confirm it works, i.e. that you can connect without typing your password.
210 | Then, add the following `ProxyCommand` line to your SSH configuration for BlueCrystal (the example assumes that you have used the alias `seis`):
211 | 
212 | ```
213 | ProxyCommand ssh -q -W %h:%p seis
214 | ```
215 | 
216 | You can find more details about SSH proxy jumps [on WikiBooks](https://en.wikibooks.org/wiki/OpenSSH%2FCookbook%2FProxies_and_Jump_Hosts#Passing_Through_One_or_More_Gateways_Using_ProxyJump).
217 | In particular, if you have a recent version of OpenSSH, then you can use the `ProxyJump` directive.
218 | 
219 | If you have followed all the steps so far, your SSH configuration file should contain the following (but using your own username):
220 | 
221 | ```
222 | IdentitiesOnly yes
223 | 
224 | Host seis
225 |     HostName seis.bris.ac.uk
226 |     User ab12345
227 |     IdentityFile ~/.ssh/uob
228 | 
229 | Host bcp4
230 |     HostName bc4login.acrc.bris.ac.uk
231 |     User ab12345
232 |     IdentityFile ~/.ssh/uob
233 |     ProxyCommand ssh -q -W %h:%p seis
234 | ```
235 | 
236 | If you want to [use GUI applications](#running-graphical-programs) when connecting with a proxy jump, you will need to use `-X` (or `-Y`) for _all_ the proxy connections.
237 | If you use the `ProxyCommand` configuration, just add the flag to the `ssh` command line.
238 | 
239 | ## Transferring files
240 | 
241 | The easiest way to transfer files to BlueCrystal is through SFTP, for which we will use the `scp` command.
242 | `scp` works just like `cp`, in that its first argument is the source, the second one is the destination, and it will _not_ copy directories recursively unless given the `-r` flag.
243 | In addition, you can specify a remote host for either the source or the destination (or even both, but there are [some subtleties](https://superuser.com/a/686527)) using the colon syntax:
244 | 
245 | ```bash
246 | $ scp bcp4:hello.c test.c #1
247 | $ scp hello.c bcp4:test.c #2
248 | ```
249 | 
250 | In the example above, the first command _downloads_ `hello.c` from BlueCrystal into a local file called `test.c`.
251 | The second command does the reverse process, _uploading_ the local `hello.c` file to a remote file called `test.c`.
252 | Both commands assume you have followed the [SSH key setup above](#passwordless-ssh-access), which we recommend; if you haven't, then you will need to use the full `username@host` syntax and enter your password:
253 | 
254 | ```bash
255 | $ scp hello.c ab12345@bc4login.acrc.bris.ac.uk:test.c
256 | ```
257 | 
258 | If you want to use a GUI for your file transfers, you can use:
259 | 
260 | - Your native file explorer on Linux
261 | - [WinSCP](https://winscp.net/eng/index.php) on Windows
262 | - [Cyberduck](https://cyberduck.io/) on Mac
263 | 
264 | ## Editing files remotely
265 | 
266 | Depending on your editor of choice, you will either be editing files on your own machine and transferring them using, for example, SSHFS, or editing directly on BlueCrystal.
267 | 
268 | If you want to edit files locally and transfer, some options are discussed in [_Prerequisites_](0_Prerequisites.md#required-and-useful-tools).
269 | 
270 | If your editor supports editing files remotely, then you can use your _local editor_ to open _files on BlueCrystal_).
271 | For example, you can achieve this using [a VS Code plugin](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) or [the `scp://` scheme in Vim](http://vim.wikia.com/wiki/Editing_remote_files_via_scp_in_vim).
272 | 
273 | Finally, you can use a terminal editor on BlueCrystal.
274 | Emacs, Vim, and nano are available, among others.
275 | If you decide to use Vim and you haven't used it before, you can greatly improve the default configuration; see [amix's basic vimrc](https://github.com/amix/vimrc/blob/master/vimrcs/basic.vim) for a good starting point or [Will Price's vim configuration](https://github.com/willprice/dotfiles/tree/master/vim/.vim) for a more advanced setup.
276 | Emacs has more sensible defaults, so you may find that it needs less tweaking.
277 | Using nano on anything else than a temporary basis or in emergencies is controversial...
278 | 
279 | ## Further resources
280 | 
281 | A variety of guides and cheat sheets on using BlueCrystal are available on the ACRC website: <https://www.acrc.bris.ac.uk/acrc/resources.htm>.
282 | 


--------------------------------------------------------------------------------
/5_Performance_Analysis_Tools.md:
--------------------------------------------------------------------------------
  1 | Performance Analysis Tools
  2 | ==========================
  3 | 
  4 | As the name suggests, High _Performance_ Computing revolves around analysing and improving the performance of various applications and hardware platforms.
  5 | In order to identify, understand, an attempt to tackle performance issues, you first need to collect a range of data that shows how your application and host hardware interact.
  6 | This section presents some of the tools that are commonly used to collect and interpret such data.
  7 | It is by no means an exhaustive list, but it should provide a good starting point.
  8 | 
  9 | Note that you are not _required_ to use everything shown below.
 10 | While it is likely that you will need _some_ tools to perform your performance analysis, you are not expected to utilise _all_ of them.
 11 | You should try to identify which tools are most useful for the result _you_ are trying to achieve and focus on getting the most out of them.
 12 | 
 13 | While all tools apart from those in the [_Other tools_ section](#other-notable-tools) should work on BlueCrystal, not all of them may be installed by default.
 14 | For the open-source tools, you can always download the source code and compile them yourself.
 15 | For the Intel tools, if you require anything that is not installed, [obtaining a student licence](https://software.intel.com/en-us/qualify-for-free-software/student) and installing them on your own machine is likely the easiest option.
 16 | 
 17 | <!-- TODO: GPU: nvprof, OpenCL Intercept Layer, oclgrind -->
 18 | 
 19 | ## Summary of tools
 20 | 
 21 | This table shows a summary of all the tools presented below and their support on BCp4.
 22 | 
 23 | | Tool       | Installed | Compatible | Usage                                                    |
 24 | | ---------- | :-------: | :--------: | -------------------------------------------------------- |
 25 | | perf       | ✔         | ✔          | Run `perf`                                               |
 26 | | gprof      | ✔         | ✔          | Compile with `gcc -pg`, run `gprof`                      |
 27 | | PAPI       | ✗         | ✔          | Install from source                                      |
 28 | | Valgrind   | ✔         | ✔          | `module load tools/valgrind/3.15.0`                      |
 29 | | TAU        | x         | ✔          | Install from source                                      |
 30 | | vTune      | ✔         | ✔          | `module load VTune/2016_update3`                         |
 31 | | Advisor    | x         | ✗          | Part of the `languages/intel/2020-u4` module             |
 32 | | MPI Tracer | ✗         | ✗          | Install through student licence on own machine           |
 33 | | Extrae     | ✗         | ✔          | Install from source                                      |
 34 | 
 35 | ## Command-line tools
 36 | 
 37 | ### perf
 38 | 
 39 | One of the easiest ways you can start obtaining performance data about your application is using a tool that comes with the Linux kernel: `perf`.
 40 | In fact, `perf` is a collection of tools, and you pick which tools to use through arguments.
 41 | The general syntax is:
 42 | 
 43 | ```
 44 | $ perf <tool> [<options>] -- <application>
 45 | ```
 46 | 
 47 | You can use perf to record a profile of your application's run, then look at it alongside the code using `perf record` and `perf annotate`, respectively.
 48 | It also offers a quick way to access hardware performance counters: `perf stat`.
 49 | For the latter, run `perf list` to show the available counters, then select the ones you want to query with `-e`.
 50 | Here is an example where I've picked some of the counters at the top of the list:
 51 | 
 52 | ```
 53 | $ perf stat -e cycles,instructions,cache-references,cache-misses -- ./test
 54 | 
 55 |  Performance counter stats for './test':
 56 | 
 57 |        292,217,550 cycles                    #    0.000 GHz
 58 |        675,095,645 instructions              #    2.31  insns per cycle
 59 |          2,298,419 cache-references
 60 |              5,093 cache-misses              #    0.222 % of all cache refs
 61 | 
 62 |        0.094668510 seconds time elapsed
 63 | ```
 64 | 
 65 | See `man perf` for a list of all the available tools and their associated manpages.
 66 | There is also [plenty of documentation available online](https://perf.wiki.kernel.org/index.php/Tutorial).
 67 | 
 68 | ### gprof
 69 | 
 70 | gprof is a GNU profiler for Linux.
 71 | It is a command-line tool that can sample your application at run time and produce a summary of where time was spent executing code.
 72 | It is available by default on BlueCrystal, but it can only be used with the GNU compiler.
 73 | 
 74 | To use gprof, follow these steps:
 75 | 
 76 | 1. Compile your application using GCC and the `-pg` flag.
 77 | 2. Run your produced binary. This will generate a `gmon.out` file.
 78 | 3. To produce a profiler report, use the following syntax:
 79 |       ```
 80 |       $ gprof [<options>] <executable> gmon.out
 81 |       ```
 82 | 
 83 | Here is an example for the [STREAM benchmark](https://www.cs.virginia.edu/stream/), where the `-l` option is used to show a line-by-line profile:
 84 | 
 85 | ```
 86 | $ gcc -O3 -fopenmp -pg -g -o stream.gprof stream.c
 87 | $ ./stream.gprof
 88 | $ gprof -l stream.gprof gmon.out
 89 | Flat profile:
 90 | 
 91 | Each sample counts as 0.01 seconds.
 92 |   %   cumulative   self              self     total
 93 |  time   seconds   seconds    calls  Ts/call  Ts/call  name
 94 |  31.77      0.72     0.72                             main._omp_fn.7 (stream.c:345 @ 400a70)
 95 |  30.44      1.41     0.69                             main._omp_fn.6 (stream.c:335 @ 400bb0)
 96 |  28.24      2.05     0.64                             main._omp_fn.5 (stream.c:325 @ 400cf0)
 97 |   2.21      2.10     0.05                             main._omp_fn.2 (stream.c:267 @ 400fac)
 98 |   2.21      2.15     0.05                             main._omp_fn.3 (stream.c:288 @ 400e90)
 99 |   1.76      2.19     0.04                             main._omp_fn.2 (stream.c:269 @ 400fd4)
100 |   1.32      2.22     0.03                             checkSTREAMresults (stream.c:463 @ 4014c4)
101 |   1.10      2.25     0.03                             checkSTREAMresults (stream.c:465 @ 4014ec)
102 |   0.44      2.26     0.01                             main._omp_fn.3 (stream.c:286 @ 400e7d)
103 |   0.44      2.27     0.01                             main._omp_fn.5 (stream.c:323 @ 400ccd)
104 |   0.22      2.27     0.01                             checkSTREAMresults (stream.c:464 @ 4014d8)
105 |   0.00      2.27     0.00        1     0.00     0.00  checkSTREAMresults (stream.c:434 @ 401480)
106 |   0.00      2.27     0.00        1     0.00     0.00  checktick (stream.c:385 @ 401160)
107 | # More output omitted
108 | ```
109 | 
110 | Note that gprof doesn't "understand" OpenMP, so there is no breakdown by thread or by parallel region.
111 | It will also not support MPI.
112 | For more usage instructions, see `man gprof` or [the online documentation](https://sourceware.org/binutils/docs/gprof/index.html#SEC_Contents).
113 | 
114 | ### PAPI
115 | 
116 | [PAPI](http://icl.cs.utk.edu/papi/) is a collection of tools and libraries to instrument and record the performance of an application.
117 | It provides three main components:
118 | 
119 | - Binaries that can be used directly to obtain performance data. This is one of the easiest ways to access hardware counters.
120 | - A programming interface (API) that allows you to call on PAPI from your code and record data programmatically. This requires source code modifications.
121 | - A library that other tools can use and build on. For example, valgrind and Extrae (presented below) can be compiled with PAPI support to enable additional data to be collected.
122 | 
123 | PAPI is available as a module on BCp3:
124 | 
125 | ```
126 | $ module av |& grep -i papi
127 | libraries/gnu_builds/papi-5.3.0
128 | libraries/intel_builds/papi-5.3.0
129 | ```
130 | 
131 | If you need a newer version, it can also be [compiled from source](http://icl.cs.utk.edu/papi/software/index.html).
132 | 
133 | A good place to start is checking which hardware data PAPI can access:
134 | 
135 | ```
136 | $ papi_avail
137 | Available events and hardware information.
138 | 
139 |     Name        Code    Avail Deriv Description (Note)
140 | PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
141 | PAPI_L1_ICM  0x80000001  Yes   No   Level 1 instruction cache misses
142 | PAPI_L2_DCM  0x80000002  Yes   Yes  Level 2 data cache misses
143 | PAPI_L2_ICM  0x80000003  Yes   No   Level 2 instruction cache misses
144 | PAPI_L3_DCM  0x80000004  No    No   Level 3 data cache misses
145 | PAPI_L3_ICM  0x80000005  No    No   Level 3 instruction cache misses
146 | # More lines omitted
147 | ```
148 | 
149 | Then, read [the online manual](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) to learn how to use the library's functionality.
150 | 
151 | ### Valgrind
152 | 
153 | [Valgrind](http://valgrind.org/) is a collection of tools for instrumentation and analysis of binaries.
154 | [Memcheck](http://valgrind.org/info/tools.html#memcheck) is the tool for debugging memory issues, which may be useful if you have issues with invalid accesses or segmentation faults.
155 | There are also [Cachegrind and Callgrind](http://valgrind.org/info/tools.html#cachegrind) which can give you insight into how you application uses the cache, although you should be careful if you try to use this with MPI or OpenMP.
156 | 
157 | Here is sample basic output from the Cachegrind tool:
158 | 
159 | ```
160 | $ valgrind --tool=cachegrind ./test
161 | 
162 | ==31134==
163 | ==31134== I   refs:      832,790
164 | ==31134== I1  misses:      1,255
165 | ==31134== LLi misses:      1,207
166 | ==31134== I1  miss rate:    0.15%
167 | ==31134== LLi miss rate:    0.14%
168 | ==31134==
169 | ==31134== D   refs:      311,489  (226,286 rd   + 85,203 wr)
170 | ==31134== D1  misses:      9,581  (  8,034 rd   +  1,547 wr)
171 | ==31134== LLd misses:      5,996  (  4,631 rd   +  1,365 wr)
172 | ==31134== D1  miss rate:     3.1% (    3.6%     +    1.8%  )
173 | ==31134== LLd miss rate:     1.9% (    2.0%     +    1.6%  )
174 | ==31134==
175 | ==31134== LL refs:        10,836  (  9,289 rd   +  1,547 wr)
176 | ==31134== LL misses:       7,203  (  5,838 rd   +  1,365 wr)
177 | ==31134== LL miss rate:      0.6% (    0.6%     +    1.6%  )
178 | ```
179 | 
180 | Valgrind is already installed on BCp3 and you don't need to load any module; just run `valgrind`.
181 | Documentation is [available online](http://valgrind.org/docs/manual/manual.html) and there is also a [quick start guide](http://valgrind.org/docs/manual/QuickStart.html).
182 | See `man valgrind` for CLI usage information.
183 | 
184 | On your own machine, you can install and use [KCachegrind](https://kcachegrind.github.io/html/Home.html) to graphically visualise Callgrind profiles.
185 | 
186 | ### TAU
187 | 
188 | [TAU](https://www.cs.uoregon.edu/research/tau/home.php) is a flexible open-source profiler.
189 | It supports hardware counters, OpenMP and MPI code, and some version even work with GPU applications.
190 | On BCp3, there are several modules available, and you should choose the one for the compiler and library you are using:
191 | 
192 | ```
193 | $ module av |& grep -i tau-
194 | tools/gnu_builds/tau-2.23.1-openmp
195 | tools/gnu_builds/tau-2.23.1-openmpi
196 | tools/intel_builds/tau-2.23-openmp
197 | tools/intel_builds/tau-2.23-openmpi
198 | tools/intel_builds/tau-2.23.1-openmp
199 | tools/intel_builds/tau-2.23.1-openmpi
200 | tools/intel_builds/tau-2.27-intel-16u2-mpi
201 | ```
202 | 
203 | The tool has plenty of [documentation online](https://www.cs.uoregon.edu/research/tau/docs.php), including video guides.
204 | There is also a [tutorial for new users](http://tau.uoregon.edu/tau.ppt), which is a good place to start.
205 | 
206 | Basic usage is as follows:
207 | 
208 | 1. Compile your application using the `tau_cc.sh` wrapper script. This is used to specify what TAU will record and then to call on the underlying compiler.
209 | 2. Run the application. This will produce `profile.*` files for each of your MPI ranks and OpenMP threads.
210 | 3. Run `pprof` display a text report of the collected profiler data. There are other visualisations available, including GUIs. See the documentation for more details.
211 | 
212 | The following is a basic example:
213 | 
214 | ```
215 | $ module load tools/gnu_builds/tau-2.23.1-openmp
216 | $ tau_cc.sh -O3 -fopenmp -o stream.tau.gcc stream.c
217 | $ OMP_NUM_THREADS=16 ./stream.tau.gcc
218 | $ ls -d profile.0.0.*
219 | profile.0.0.0  profile.0.0.10  profile.0.0.12  profile.0.0.14  profile.0.0.2  profile.0.0.4  profile.0.0.6  profile.0.0.8
220 | profile.0.0.1  profile.0.0.11  profile.0.0.13  profile.0.0.15  profile.0.0.3  profile.0.0.5  profile.0.0.7  profile.0.0.9
221 | $ pprof
222 | NODE 0;CONTEXT 0;THREAD 0:
223 | ---------------------------------------------------------------------------------------
224 | %Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
225 |               msec   total msec                          usec/call
226 | ---------------------------------------------------------------------------------------
227 | 100.0          143        1,353           1          44    1353319 .TAU application
228 |  89.4            1        1,210          44          44      27504 parallel begin/end [OpenMP]
229 |  88.9        0.729        1,202          42          42      28639 for enter/exit [OpenMP]
230 |  46.4        0.002          627           1           1     627323 parallelfor (parallel begin/end) [OpenMP location: file:/panfs/panasas01/cosc/ab12345/workspace/STREAM/stream.c <267, 272>]
231 |  46.4          321          627           1           1     627306 parallelfor (loop body) [OpenMP location: file:/panfs/panasas01/cosc/ab12345/workspace/STREAM/stream.c <267, 272>]
232 |  32.7            3          443          44          44      10069 barrier enter/exit [OpenMP]
233 | # More output omitted
234 | ```
235 | 
236 | Note that TAU is a complex tool and may require significant effort to learn.
237 | We suggest you also consider the Intel tools below, [vTune](#intel-vtune) and [Advisor](#intel-advisor), before you decide to use TAU as you main profiler.
238 | 
239 | ### likwid
240 | 
241 | A newer collection of tools aiming to simplify performance analysis and benchmarking is [likwid](https://github.com/RRZE-HPC/likwid).
242 | It can access hardware counters, run microbenchmarks to reveal some peak capabilities of your platform, and help with binding threads and MPI ranks.
243 | One of its aims is to present results in a user-friendly way.
244 | 
245 | Here is a snippet of the output produced when using likwid to look at L2 cache counters on the STREAM benchmark:
246 | 
247 | ```
248 | $ env OMP_NUM_THREADS=1 likwid-perfctr -g L2 ./stream.ivy.intel
249 | +----------------------------+---------+--------+-----+--------+------------+
250 | |            Event           | Counter |   Sum  | Min |   Max  |     Avg    |
251 | +----------------------------+---------+--------+-----+--------+------------+
252 | |   INSTR_RETIRED_ANY STAT   |  FIXC0  |  95886 |   0 |  51966 |  3995.2500 |
253 | | CPU_CLK_UNHALTED_CORE STAT |  FIXC1  | 519359 |   0 | 217214 | 21639.9583 |
254 | |  CPU_CLK_UNHALTED_REF STAT |  FIXC2  | 808272 |   0 | 324459 |      33678 |
255 | |    L1D_REPLACEMENT STAT    |   PMC0  |   4327 |   0 |   1921 |   180.2917 |
256 | |      L1D_M_EVICT STAT      |   PMC1  |   1073 |   0 |    454 |    44.7083 |
257 | |     ICACHE_MISSES STAT     |   PMC2  |  13846 |   0 |   6563 |   576.9167 |
258 | +----------------------------+---------+--------+-----+--------+------------+
259 | ```
260 | 
261 | If you want to try likwid, you will need to install it from source.
262 | There is a guide on how to do this, along with detailed usage instructions and examples, on [their wiki](https://github.com/RRZE-HPC/likwid/wiki).
263 | Note that this is relatively new tool, so there are no guarantees about how easy it is to install or how well it works on BlueCrystal.
264 | 
265 | ## Graphical interface tools
266 | 
267 | ### Intel vTune
268 | 
269 | [vTune Amplifier](https://software.intel.com/en-us/vtune) is a graphical toolkit for performance analysis from Intel.
270 | It contains a profiler that supports both OpenMP and MPI, and which can access a wide range of hardware counters.
271 | The GUI then provides various options to visualise the results.
272 | 
273 | ![vTune](https://i.imgur.com/pDVFa8x.png)
274 | _Example of the vTune Amplifier interface_
275 | 
276 | To use vTune on BCp4, load its module, then run the GUI:
277 | 
278 | ```
279 | $ module load vtune/2017.1.132-GCC-5.4.0-2.26
280 | $ amplxe-gui &
281 | ```
282 | 
283 | Of course, you will need to enable SSH X forwarding to be able to view the GUI on your machine.
284 | When you connect to BlueCrystal (and any other intermediate hops you may be jumping though), use `ssh -X`.
285 | This was discussed in the [_Connecting_ section](1_Connecting_to_BlueCrystal.md#running-graphical-programs).
286 | 
287 | Once the GUI is running, you will need to crete a new _project_ for each binary that you want to analyse.
288 | The _Hotspots_ analysis is a good place to start, as it will give you a high-level overview of what's happening in your program.
289 | 
290 | Intel have a [quick start guide](https://software.intel.com/en-us/get-started-with-vtune-linux-os) which we recommend reading to introduce you to vTune.
291 | The same page links to [more tutorials](https://software.intel.com/en-us/articles/intel-vtune-amplifier-tutorials) that cover various tasks which can be done in vTune.
292 | The guides have step-by-step instructions, including screenshots and explanations of the UI elements, so they are well worth the time.
293 | 
294 | vTune can also be used from the command line, which doesn't require you to use X forwarding over SSH.
295 | If you want to use this approach, the command is called `amplxe-cl` and you can find [a guide on Intel's website](https://software.intel.com/en-us/vtune-amplifier-help-running-command-line-analysis).
296 | Perhaps more importantly, this lets you profile your code _on a compute node_ by just adding the command to your job script.
297 | This way, you don't get any of the noise on the login node interfering with your analysis.
298 | 
299 | **Important**: The licence on BlueCrystal only allows a limited number of users to run vTune simultaneously, and it is likely that this number is significantly smaller than the number of students on the unit.
300 | Therefore, occasionally you may be unable to run vTune if it is also being used by many others at the same time, e.g. during labs.
301 | This also means that you should terminate your sessions as soon as possible, so that you don't hold up a licence unnecessarily.
302 | A sensible strategy is to use the CLI tools for data collection—which _doesn't_ use a licence—, then use a local installation to look at the results.
303 | This maximises the availability of the shared resources.
304 | 
305 | ### Intel Advisor
306 | 
307 | [Intel Advisor](https://software.intel.com/en-us/advisor) is a performance tool that focuses on vectorisation and parallelisation of applications.
308 | As such, it is less comprehensive than vTune, but it can be easier to work with when vectorisation and threading is the main focus.
309 | 
310 | On BCp4, Advisor is part of the latest Intel compiler module, `languages/intel/2020-u4`. The binaries are named similarly to those in vTune: the GUI is called `advixe-gui`, and the CLI interface is `advixe-cl`.
311 | 
312 | ![Advisor table](https://i.imgur.com/HMytbaD.png) <br />
313 | _Screenshot of the Advisor profiler table_
314 | 
315 | ![Advisor mix](https://i.imgur.com/XkVQT1C.png) <br />
316 | _Screenshot of Advisor's instruction mix summary_
317 | 
318 | We recommend using Advisor if you are attempting to vectorise your code.
319 | The tool will show you exactly which loops were vectorised and the strategies used for each, and when loops aren't vectorised, it will suggest issues that prevented vectorisation.
320 | Together with compiler reports, it should give you enough information to improve your code.
321 | Advisor will also show estimated speed-ups from vectoristion, as well as data that you need if you want to plot a roofline graph.
322 | 
323 | Similarly to vTune, Intel have a [getting started guide](https://software.intel.com/en-us/get-started-with-advisor) which presents the commands and UI elements.
324 | We recommend reading this first, then referring to [the usage guide](https://software.intel.com/en-us/advisor-user-guide) for further documentation.
325 | 
326 | **Important**: Advisor is subject to the same licence limitations as vTune above.
327 | We suggest a similar strategy: use the CLI tools to collect data—ideally on a compute node—then run the GUI on your local machine.
328 | 
329 | ### Intel MPI Trace Analyzer and Collector
330 | 
331 | When writing MPI programs, it is important that you understand your communication patterns and how this affect the performance (and correctness!) of your application.
332 | One tool that can help with visualising your MPI application's structure and identifying potential issues is [Intel's MPI Trace tools](https://software.intel.com/en-us/intel-trace-analyzer).
333 | 
334 | <!-- ![ITAC](https://software.intel.com/sites/default/files/managed/f0/b5/intel-trace-analyzer-app.png) <br />
335 | _Screenshot of the Intel Trace Analyzer and Collector_ -->
336 | 
337 | To use this tool, you will need to install it on your own machine through the Intel student licence.
338 | Documentation is [available online](https://software.intel.com/en-us/articles/intel-trace-analyzer-and-collector-documentation), and there is a [getting started guide](https://software.intel.com/en-us/get-started-with-itac).
339 | There is also a [page explaning the GUI](https://software.intel.com/en-us/articles/introducing-intel-trace-analyzer-gui).
340 | 
341 | ### Extrae and Paraver
342 | 
343 | This is a pair of profiling tools that often go hand-in-hand developed at the [Barcelona Supercomputing Center](https://www.bsc.es/).
344 | [Extrae](https://tools.bsc.es/extrae) handles the collection stage, and its output is generally visualised using [Paraver](https://tools.bsc.es/paraver).
345 | 
346 | Below is an example screenshot of Paraver from their website.
347 | If you want to use these tools, you'll have to compile from source, but the code comes with [installation instructions](https://github.com/bsc-performance-tools/extrae/blob/master/INSTALL).
348 | 
349 | ![Paraver](https://tools.bsc.es/sites/default/files/pictures/main-paraver-generalOverview_files/1333.gif) <br />
350 | _Paraver screenshot. <https://tools.bsc.es/paraver>_
351 | 
352 | 
353 | ## Other notable tools
354 | 
355 | The tools in this section **aren't available on BlueCrystal or as free download, so you will likely not be able to use them**.
356 | However, they are important in the wider HPC context, so we mention them for completeness.
357 | 
358 | ### Cray tools
359 | 
360 | The Cray software stack—which runs exclusively on Cray machines—includes a range of tools collectively known as _perftools_.
361 | One of those tools is the versatile CrayPAT profiler, which can collect data form any combination of CPU, GPU, memory, I/O, and various networking frameworks/languages.
362 | Another is Reveal, a code analyser that helps with extracting parallelism from serial code by identifying dependencies and suggesting potential way to get around them.
363 | 
364 | Below is an example report produced by CrayPAT.
365 | You can read the [documentation online](https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00113914en_us&page=About_the_Cray_Performance_Measurement_and_Analysis_Tools_User_Guide.html).
366 | 
367 | ```
368 | CrayPat/X:  Version 6.5.2 Revision ba33e9b  08/22/17 20:38:22
369 | Experiment:                  lite  lite/sample_profile
370 | Number of PEs (MPI ranks):      1
371 | Numbers of PEs per Node:        1
372 | Numbers of Threads per PE:     36
373 | Number of Cores per Socket:    18
374 | Execution start time:  Wed Sep 26 16:49:16 2018
375 | System name and speed:  pascal-002  1200 MHz (approx)
376 | Intel Broadwell CPU  Family:  6  Model: 79  Stepping:  1
377 | 
378 | 
379 | Avg Process Time:       0.57 secs
380 | High Memory:           801.8 MBytes     801.8 MBytes per PE
381 | MFLOPS (aggregate):    42.25 M/sec      42.25 M/sec per PE
382 | I/O Write Rate:     1.834425 MBytes/sec
383 | 
384 | Table 1:  Profile by Function
385 | 
386 |   Samp% | Samp | Imb. |  Imb. | Group
387 |         |      | Samp | Samp% |  Function=[MAX10]
388 |         |      |      |       |   Thread=HIDE
389 | 
390 |  100.0% | 53.0 |   -- |    -- | Total
391 | |-----------------------------------------------------------------------
392 | |  50.9% | 27.0 |   -- |    -- | USER
393 | ||----------------------------------------------------------------------
394 | ||  15.1% |  8.0 |   -- |    -- | checkSTREAMresults
395 | ||  13.2% |  7.0 |  0.4 |  5.7% | main.REGION@li.334
396 | ||  13.2% |  7.0 |  0.6 |  9.0% | main.REGION@li.344
397 | ||   9.4% |  5.0 |  1.5 | 21.6% | main.REGION@li.324
398 | ||======================================================================
399 | |  24.5% | 13.0 |   -- |    -- | OMP
400 | ||----------------------------------------------------------------------
401 | ||  13.2% |  7.0 |   -- |    -- | _cray$mt_execute_parallel_with_proc_bind
402 | ||  11.3% |  6.0 |   -- |    -- | _cray$mt_start_two_code_parallel
403 | ||======================================================================
404 | |  17.0% |  9.0 |   -- |    -- | ETC
405 | ||----------------------------------------------------------------------
406 | ||  11.3% |  6.0 |   -- |    -- | pthread_create@@GLIBC_2.2.5
407 | ||   5.7% |  3.0 |   -- |    -- | fullscan_barrier_list]
408 | ||======================================================================
409 | |   5.7% |  3.0 |   -- |    -- | PTHREAD
410 | ||----------------------------------------------------------------------
411 | ||   5.7% |  3.0 |   -- |    -- | pthread_join
412 | ||======================================================================
413 | |   1.9% |  1.0 |   -- |    -- | RT
414 | ||----------------------------------------------------------------------
415 | ||   1.9% |  1.0 |   -- |    -- | nanosleep
416 | |=======================================================================
417 | ```
418 | 
419 | ### Arm tools
420 | 
421 | Arm provide the Arm Forge, a collection of tools previously known as the Allinea Forge.
422 | They primarily target Arm platforms, but some of the tools also run on x86.
423 | The two main tools are DDT, a parallel debugger that support OpenMP and MPI, and MAP, a graphical profiler that aims to be both easy-to-use and feature-rich.
424 | 
425 | ![Arm DDT: https://static.docs.arm.com/101136/1822/images/DDTWithVersionControlInformation.png](https://i.imgur.com/nuYmEHa.png) <br />
426 | _Screenshot of Arm DDT_
427 | 
428 | ![Arm MAP: https://static.docs.arm.com/101136/1822/images/MapOpenMpSourceCodeView.png](https://i.imgur.com/793DzgT.png) <br />
429 | _Screenshot of Arm MAP in line profiling mode_
430 | 
431 | You can find more details [on the Arm Developer website](https://developer.arm.com/documentation/101136/22-1-3).
432 | 


--------------------------------------------------------------------------------