├── 01_what_is_a_gpu ├── README.md └── pli.md ├── 02_cuda_toolkit └── README.md ├── 03_your_first_gpu_job ├── README.md ├── cupy │ ├── job.slurm │ ├── lu.py │ └── svd.py ├── julia │ ├── job.slurm │ └── svd.jl ├── matlab │ ├── job.slurm │ └── svd.m ├── pytorch │ ├── job.slurm │ └── svd.py └── tensorflow │ ├── job.slurm │ └── svd.py ├── 04_gpu_tools └── README.md ├── 05_cuda_libraries ├── README.md ├── gesvdj_example.cpp ├── hello_world_gpu_library │ ├── README.md │ ├── cumessage.cu │ ├── cumessage.h │ ├── job.slurm │ └── myapp.cu ├── job.slurm └── matrixMul │ └── job.slurm ├── 06_cuda_kernels ├── 01_hello_world │ ├── README.md │ ├── hello_world.c │ ├── hello_world_gpu.cu │ └── job.slurm ├── 02_simple_kernel │ ├── README.md │ ├── first_parallel.cu │ ├── job.slurm │ └── solution.cu ├── 03_thread_indices │ ├── README.md │ ├── for_loop.c │ ├── for_loop.cu │ ├── hint.md │ ├── job.slurm │ └── solution.cu ├── 04_vector_addition │ ├── README.md │ ├── job.slurm │ ├── timer.h │ ├── vector_add_cpu.c │ └── vector_add_gpu.cu ├── 05_multiple_gpus │ ├── README.md │ ├── job.slurm │ └── multi_gpu.cu └── README.md ├── 07_advanced_and_other └── README.md ├── README.md └── setup.md /01_what_is_a_gpu/README.md: -------------------------------------------------------------------------------- 1 | # What is a GPU? 2 | 3 | A GPU, or Graphics Processing Unit, is an electronic device originally designed for manipulating the images that appear on a computer monitor. However, beginning in 2006 with NVIDIA CUDA, GPUs have become widely used for accelerating computation in various fields including image processing and machine learning. 4 | 5 | Relative to the CPU, GPUs have a far greater number of processing cores but with slower clock speeds. Within a block of threads called a warp (NVIDIA), each thread carries out the same operation on a different piece of data. This is the SIMT paradigm (single instruction, multiple threads). GPUs tend to have much less memory than what is available on a CPU. For instance, the H100 GPUs on Della have 80 GB compared to 1000 GB available to the CPU cores. This is an important consideration when designing algorithms and running jobs. Furthermore, GPUs are intended for highly parallel algorithms. The CPU can often out-perform a GPU on algorithms that are not highly parallelizable such as those that rely on data caching and flow control (e.g., "if" statements). 6 | 7 | Many of the fastest supercomputers in the world use GPUs (see [Top 500](https://top500.org/lists/top500/2024/11/)). How many of the top 10 supercomputers use GPUs? 8 | 9 | NVIDIA has been the leading player in GPUs for HPC. However, the GPU market landscape changed in May 2019 when the US DoE announced that Frontier, the first exascale supercomputer in the US, would be based on [AMD GPUs](https://www.hpcwire.com/2019/05/07/cray-amd-exascale-frontier-at-oak-ridge/) and CPUs. Princeton has a two [MI210 GPUs](https://researchcomputing.princeton.edu/amd-mi100-gpu-testing) which you can use for testing. Intel is also a GPU producer with the [Aurora supercomputer](https://en.wikipedia.org/wiki/Aurora_(supercomputer)) being an example. 10 | 11 | All laptops have a GPU for graphics. It is becoming standard for a laptop to have a second GPU dedicated for compute (see the latest [MacBook Pro](https://www.apple.com/macbook-pro/)). 12 | 13 | ![cpu-vs-gpu](http://blog.itvce.com/wp-content/uploads/2016/03/032216_1532_DustFreeNVI2.png) 14 | 15 | The image below emphasizes the cache sizes and flow control: 16 | 17 | ![cache_flow_control](https://tigress-web.princeton.edu/~jdh4/gpu-devotes-more-transistors-to-data-processing.png) 18 | 19 | Like a CPU, a GPU has a hierarchical structure with respect to both the execution units and memory. A warp is a unit of 32 threads. NVIDIA GPUs impose a limit of 1024 threads per block. Some integral number of warps are grouped into a streaming multiprocessor (SM). There are tens of SMs per GPU. Each thread has its own memory. There is limited shared memory between a block of threads. And, finally, there is the global memory which is accessible to each grid or collection of blocks. 20 | 21 | ![ampere](https://developer-blogs.nvidia.com/wp-content/uploads/2022/03/H100-Streaming-Multiprocessor-SM-625x869.png) 22 | 23 | The figure above is a diagram of a streaming multiprocessor (SM) for the [NVIDIA H100 GPU](https://www.nvidia.com/en-us/data-center/h100/). The H100 is composed of up to 132 SMs. 24 | 25 | # Princeton Language and Intelligence 26 | 27 | The university spent $9.6M on a new [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) cluster for research involving large AI models. The cluster provides 37 nodes with 8 GPUs per node. The H100 GPU is optimized for training transformer models. [Learn more](https://pli.princeton.edu/about-pli/directors-message) about this. 28 | 29 | # Overview of using a GPU 30 | 31 | This is the essence of how every GPU is used as an accelerator for compute: 32 | 33 | + Copy data from the CPU (host) to the GPU (device) 34 | 35 | + Launch a kernel to carry out computations on the GPU 36 | 37 | + Copy data from the GPU (device) back to the CPU (host) 38 | 39 | ![gpu-overview](https://tigress-web.princeton.edu/~jdh4/gpu_as_accelerator_to_cpu_diagram.png) 40 | 41 | The diagram above and the accompanying pseudocode present a simplified view of how GPUs are used in scientific computing. To fully understand how things work you will need to learn more about memory cache, interconnects, CUDA streams and much more. 42 | 43 | [NVLink](https://www.nvidia.com/en-us/data-center/nvlink/) on Traverse enables fast CPU-to-GPU and GPU-to-GPU data transfers with a peak rate of 75 GB/s per direction. Della has this fast GPU-GPU interconnect on each pair of GPUs on 70 of the 90 GPU nodes. 44 | 45 | Given the significant performance penalty for moving data between the CPU and GPU, it is natural to work toward "unifying" the CPU and GPU. For instance, read about the [NVIDIA Grace Hopper Superchip](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/). 46 | 47 | # What GPU resources does Princeton have? 48 | 49 | See the "Hardware Resources" on the [GPU Computing](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) page for a complete list. 50 | 51 | ## Adroit 52 | 53 | There are 3 GPU nodes on Adroit: `adroit-h11g1`, `adroit-h11g2` and `adroit-h11g3` 54 | 55 |
 56 | $ ssh <NetID>@adroit.princeton.edu
 57 | $ snodes
 58 | HOSTNAMES      STATE  CPUS S:C:T  CPUS(A/I/O/T) CPU_LOAD MEMORY  PARTITION  AVAIL_FEATURES
 59 | adroit-08      alloc  32   2:16:1 32/0/0/32     1.27     384000  class      skylake,intel
 60 | adroit-09      alloc  32   2:16:1 32/0/0/32     0.75     384000  class      skylake,intel
 61 | adroit-10      alloc  32   2:16:1 32/0/0/32     0.63     384000  class      skylake,intel
 62 | adroit-11      mix    32   2:16:1 29/3/0/32     0.28     384000  class      skylake,intel
 63 | adroit-12      mix    32   2:16:1 16/16/0/32    0.28     384000  class      skylake,intel
 64 | adroit-13      mix    32   2:16:1 25/7/0/32     0.22     384000  all*       skylake,intel
 65 | adroit-13      mix    32   2:16:1 25/7/0/32     0.22     384000  class      skylake,intel
 66 | adroit-14      alloc  32   2:16:1 32/0/0/32     32.29    384000  all*       skylake,intel
 67 | adroit-14      alloc  32   2:16:1 32/0/0/32     32.29    384000  class      skylake,intel
 68 | adroit-15      mix    32   2:16:1 22/10/0/32    9.68     384000  all*       skylake,intel
 69 | adroit-15      mix    32   2:16:1 22/10/0/32    9.68     384000  class      skylake,intel
 70 | adroit-16      alloc  32   2:16:1 32/0/0/32     24.13    384000  all*       skylake,intel
 71 | adroit-16      alloc  32   2:16:1 32/0/0/32     24.13    384000  class      skylake,intel
 72 | adroit-h11g1   plnd   48   2:24:1 0/48/0/48     0.00     1000000 gpu        a100,intel,gpu80
 73 | adroit-h11g2   plnd   48   2:24:1 0/48/0/48     0.76     1000000 gpu        a100,intel
 74 | adroit-h11g3   mix    56   4:14:1 5/51/0/56     1.05     760000  gpu        v100,intel
 75 | adroit-h11n1   idle   128  2:64:1 0/128/0/128   0.00     256000  class      amd,rome
 76 | adroit-h11n2   alloc  64   2:32:1 64/0/0/64     49.07    500000  all*       intel,ice
 77 | adroit-h11n3   mix    64   2:32:1 50/14/0/64    40.54    500000  all*       intel,ice
 78 | adroit-h11n4   mix    64   2:32:1 48/16/0/64    40.33    500000  all*       intel,ice
 79 | adroit-h11n5   mix    64   2:32:1 32/32/0/64    32.94    500000  all*       intel,ice
 80 | adroit-h11n6   mix    64   2:32:1 62/2/0/64     38.95    500000  all*       intel,ice
 81 | 
82 | 83 | To only see the GPU nodes: 84 | 85 |
 86 | $ shownodes -p gpu
 87 | NODELIST      STATE      FREE/TOTAL CPUs  CPU_LOAD  AVAIL/TOTAL MEMORY  FREE/TOTAL GPUs          FEATURES
 88 | adroit-h11g1  planned              48/48      0.00   1000000/1000000MB  4/4 nvidia_a100  a100,intel,gpu80
 89 | adroit-h11g2  planned              48/48      0.76   1000000/1000000MB      8/8 3g.20gb        a100,intel
 90 | adroit-h11g3  mixed                51/56      1.05     736960/760000MB   0/4 tesla_v100        v100,intel
 91 | 
92 | 93 | ### adroit-h11g1 94 | 95 | This node has 4 NVIDIA A100 GPUs with 80 GB of memory each. Each A100 GPU has 108 streaming multiprocessors (SM) and 64 FP32 CUDA cores per SM. 96 | 97 | Here is some information about the A100 GPUs on this node: 98 | 99 | ``` 100 | CUDADevice with properties: 101 | 102 | Name: 'NVIDIA A100 80GB PCIe' 103 | Index: 1 104 | ComputeCapability: '8.0' 105 | SupportsDouble: 1 106 | DriverVersion: 12.2000 107 | ToolkitVersion: 11.2000 108 | MaxThreadsPerBlock: 1024 109 | MaxShmemPerBlock: 49152 110 | MaxThreadBlockSize: [1024 1024 64] 111 | MaxGridSize: [2.1475e+09 65535 65535] 112 | SIMDWidth: 32 113 | TotalMemory: 8.5175e+10 114 | AvailableMemory: 8.4519e+10 115 | MultiprocessorCount: 108 116 | ClockRateKHz: 1410000 117 | ComputeMode: 'Default' 118 | GPUOverlapsTransfers: 1 119 | KernelExecutionTimeout: 0 120 | CanMapHostMemory: 1 121 | DeviceSupported: 1 122 | DeviceAvailable: 1 123 | DeviceSelected: 1 124 | ``` 125 | 126 | Here is infomation about the CPUs on this node: 127 | 128 |
129 | $ ssh <NetID>@adroit.princeton.edu
130 | $ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --gres=gpu:1 --constraint=gpu80 --reservation=gpuprimer
131 | $ lscpu | grep -v Flags
132 | Architecture:        x86_64
133 | CPU op-mode(s):      32-bit, 64-bit
134 | Byte Order:          Little Endian
135 | CPU(s):              48
136 | On-line CPU(s) list: 0-47
137 | Thread(s) per core:  1
138 | Core(s) per socket:  24
139 | Socket(s):           2
140 | NUMA node(s):        2
141 | Vendor ID:           GenuineIntel
142 | CPU family:          6
143 | Model:               143
144 | Model name:          Intel(R) Xeon(R) Gold 6442Y
145 | Stepping:            8
146 | CPU MHz:             3707.218
147 | CPU max MHz:         4000.0000
148 | CPU min MHz:         800.0000
149 | BogoMIPS:            5200.00
150 | Virtualization:      VT-x
151 | L1d cache:           48K
152 | L1i cache:           32K
153 | L2 cache:            2048K
154 | L3 cache:            61440K
155 | NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
156 | NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
157 | $ exit
158 | 
159 | 160 | 161 | ### adroit-h11g2 162 | 163 | `adroit-h11g2` has 4 NVIDIA A100 GPUs with 40 GB of memory per GPU. The 4 GPUs have been divided into 8 less powerful GPUs with 20 GB of memory each. To connect to this node use: 164 | 165 | ``` 166 | $ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --gres=gpu:1 --nodelist=adroit-h11g2 --reservation=gpuprimer 167 | ``` 168 | 169 | Below is information about the A100 GPUs: 170 | 171 | ``` 172 | $ nvidia-smi -a 173 | Using a NVIDIA A100-PCIE-40GB GPU. 174 | CUDADevice with properties: 175 | 176 | Name: 'NVIDIA A100-PCIE-40GB' 177 | Index: 1 178 | ComputeCapability: '8.0' 179 | SupportsDouble: 1 180 | DriverVersion: 11.7000 181 | ToolkitVersion: 11.2000 182 | MaxThreadsPerBlock: 1024 183 | MaxShmemPerBlock: 49152 184 | MaxThreadBlockSize: [1024 1024 64] 185 | MaxGridSize: [2.1475e+09 65535 65535] 186 | SIMDWidth: 32 187 | TotalMemory: 4.2351e+10 188 | AvailableMemory: 4.1703e+10 189 | MultiprocessorCount: 108 190 | ClockRateKHz: 1410000 191 | ComputeMode: 'Default' 192 | GPUOverlapsTransfers: 1 193 | KernelExecutionTimeout: 0 194 | CanMapHostMemory: 1 195 | DeviceSupported: 1 196 | DeviceAvailable: 1 197 | DeviceSelected: 1 198 | ``` 199 | 200 | Below is information about the CPUs: 201 | 202 | ``` 203 | $ lscpu | grep -v Flags 204 | Architecture: x86_64 205 | CPU op-mode(s): 32-bit, 64-bit 206 | Byte Order: Little Endian 207 | CPU(s): 48 208 | On-line CPU(s) list: 0-47 209 | Thread(s) per core: 1 210 | Core(s) per socket: 24 211 | Socket(s): 2 212 | NUMA node(s): 2 213 | Vendor ID: GenuineIntel 214 | CPU family: 6 215 | Model: 106 216 | Model name: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz 217 | Stepping: 6 218 | CPU MHz: 3499.996 219 | CPU max MHz: 3500.0000 220 | CPU min MHz: 800.0000 221 | BogoMIPS: 5600.00 222 | L1d cache: 48K 223 | L1i cache: 32K 224 | L2 cache: 1280K 225 | L3 cache: 36864K 226 | NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 227 | NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47 228 | ``` 229 | 230 | See the necessary Slurm directives to [run on specific GPUs](https://researchcomputing.princeton.edu/systems/adroit#gpus) on Adroit. 231 | 232 | To see a wealth of information about the GPUs use: 233 | 234 | ``` 235 | $ nvidia-smi -q | less 236 | ``` 237 | 238 | ### adroit-h11g3 239 | 240 | This node offers the older V100 GPUs. 241 | 242 | ### Grace Hopper Superchip 243 | 244 | See the [Grace Hopper Superchip webpage](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/) by NVIDIA. Here is a schematic diagram of the superchip: 245 | 246 | ![grace](https://developer-blogs.nvidia.com/wp-content/uploads/2022/11/grace-hopper-overview.png) 247 | 248 | ``` 249 | aturing@della-gh:~$ nvidia-smi -a 250 | 251 | ==============NVSMI LOG============== 252 | 253 | Timestamp : Mon Apr 22 11:24:41 2024 254 | Driver Version : 545.23.08 255 | CUDA Version : 12.3 256 | 257 | Attached GPUs : 1 258 | GPU 00000009:01:00.0 259 | Product Name : GH200 480GB 260 | Product Brand : NVIDIA 261 | Product Architecture : Hopper 262 | Display Mode : Disabled 263 | Display Active : Disabled 264 | Persistence Mode : Enabled 265 | Addressing Mode : ATS 266 | MIG Mode 267 | Current : Disabled 268 | Pending : Disabled 269 | ... 270 | ``` 271 | 272 | The CPU on the GH Superchip: 273 | 274 | ``` 275 | jdh4@della-gh:~$ lscpu 276 | Architecture: aarch64 277 | CPU op-mode(s): 64-bit 278 | Byte Order: Little Endian 279 | CPU(s): 72 280 | On-line CPU(s) list: 0-71 281 | Vendor ID: ARM 282 | Model name: Neoverse-V2 283 | Model: 0 284 | Thread(s) per core: 1 285 | Core(s) per socket: 72 286 | Socket(s): 1 287 | Stepping: r0p0 288 | Frequency boost: disabled 289 | CPU max MHz: 3510.0000 290 | CPU min MHz: 81.0000 291 | BogoMIPS: 2000.00 292 | Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm di 293 | t uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh 294 | Caches (sum of all): 295 | L1d: 4.5 MiB (72 instances) 296 | L1i: 4.5 MiB (72 instances) 297 | L2: 72 MiB (72 instances) 298 | L3: 114 MiB (1 instance) 299 | NUMA: 300 | NUMA node(s): 9 301 | NUMA node0 CPU(s): 0-71 302 | NUMA node1 CPU(s): 303 | NUMA node2 CPU(s): 304 | NUMA node3 CPU(s): 305 | NUMA node4 CPU(s): 306 | NUMA node5 CPU(s): 307 | NUMA node6 CPU(s): 308 | NUMA node7 CPU(s): 309 | NUMA node8 CPU(s): 310 | Vulnerabilities: 311 | Gather data sampling: Not affected 312 | Itlb multihit: Not affected 313 | L1tf: Not affected 314 | Mds: Not affected 315 | Meltdown: Not affected 316 | Mmio stale data: Not affected 317 | Retbleed: Not affected 318 | Spec rstack overflow: Not affected 319 | Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 320 | Spectre v1: Mitigation; __user pointer sanitization 321 | Spectre v2: Not affected 322 | Srbds: Not affected 323 | Tsx async abort: Not affected 324 | ``` 325 | 326 | ### Compute Capability and Building Optimized Codes 327 | 328 | Some software will only run on a GPU of a given compute capability. To find these values for a given NVIDIA Telsa card see [this page](https://en.wikipedia.org/wiki/Nvidia_Tesla). The compute capability of the A100's on Della is 8.0. For various build systems this translates to `sm_80`. 329 | 330 | The following is from `$ nvcc --help` after loading a `cudatoolkit` module: 331 | 332 | ``` 333 | Options for steering GPU code generation. 334 | ========================================= 335 | 336 | --gpu-architecture (-arch) 337 | Specify the name of the class of NVIDIA 'virtual' GPU architecture for which 338 | the CUDA input files must be compiled. 339 | With the exception as described for the shorthand below, the architecture 340 | specified with this option must be a 'virtual' architecture (such as compute_50). 341 | Normally, this option alone does not trigger assembly of the generated PTX 342 | for a 'real' architecture (that is the role of nvcc option '--gpu-code', 343 | see below); rather, its purpose is to control preprocessing and compilation 344 | of the input to PTX. 345 | For convenience, in case of simple nvcc compilations, the following shorthand 346 | is supported. If no value for option '--gpu-code' is specified, then the 347 | value of this option defaults to the value of '--gpu-architecture'. In this 348 | situation, as only exception to the description above, the value specified 349 | for '--gpu-architecture' may be a 'real' architecture (such as a sm_50), 350 | in which case nvcc uses the specified 'real' architecture and its closest 351 | 'virtual' architecture as effective architecture values. For example, 'nvcc 352 | --gpu-architecture=sm_50' is equivalent to 'nvcc --gpu-architecture=compute_50 353 | --gpu-code=sm_50,compute_50'. 354 | -arch=all build for all supported architectures (sm_*), and add PTX 355 | for the highest major architecture to the generated code. 356 | -arch=all-major build for just supported major versions (sm_*0), plus the 357 | earliest supported, and add PTX for the highest major architecture to the 358 | generated code. 359 | -arch=native build for all architectures (sm_*) on the current system 360 | Note: -arch=native, -arch=all, -arch=all-major cannot be used with the -code 361 | option, but can be used with -gencode options 362 | Note: the values compute_30, compute_32, compute_35, compute_37, compute_50, 363 | sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in 364 | a future release. 365 | Allowed values for this option: 'all','all-major','compute_35','compute_37', 366 | 'compute_50','compute_52','compute_53','compute_60','compute_61','compute_62', 367 | 'compute_70','compute_72','compute_75','compute_80','compute_86','compute_87', 368 | 'lto_35','lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62', 369 | 'lto_70','lto_72','lto_75','lto_80','lto_86','lto_87','native','sm_35','sm_37', 370 | 'sm_50','sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75', 371 | 'sm_80','sm_86','sm_87'. 372 | ``` 373 | 374 | Hence, a starting point for optimization flags for the A100 GPUs on Della and Adroit: 375 | 376 | ``` 377 | nvcc -O3 --use_fast_math --gpu-architecture=sm_80 -o myapp myapp.cu 378 | ``` 379 | 380 | For the H100 GPUs on Della: 381 | 382 | ``` 383 | nvcc -O3 --use_fast_math --gpu-architecture=sm_90 -o myapp myapp.cu 384 | ``` 385 | 386 | ## Comparison of GPU Resources 387 | 388 | | Cluster | Number of Nodes | GPUs per Node | NVIDIA GPU Model | Number of FP32 Cores| SM Count | GPU Memory (GB) | 389 | |:----------:|:----------:|:---------:|:-------:|:-------:|:-------:|:-------:| 390 | | Adroit | 1 | 4 | A100 | 6912 | 108 | 80 | 391 | | Adroit | 1 | 8 | A100 | -- | -- | 20 | 392 | | Adroit | 1 | 4 | V100 | 5120 | 80 | 32 | 393 | | Della | 37 | 8 | H100 | 14592 | 132 | 80 | 394 | | Della | 69 | 4 | A100 | 6912 | 108 | 80 | 395 | | Della | 20 | 2 | A100 | 6912 | 108 | 40 | 396 | | Della | 2 | 28 | A100 | -- | -- | 10 | 397 | | Stellar | 6 | 2 | A100 | 6912 | 108 | 40 | 398 | | Stellar | 1 | 8 | A100 | 6912 | 108 | 40 | 399 | | Tiger | 12 | 4 | H100 | 14592 | 132 | 80 | 400 | 401 | SM is streaming multiprocessor. Note that the V100 GPUs have 640 [Tensor Cores](https://devblogs.nvidia.com/cuda-9-features-revealed/) (8 per SM) where half-precision Warp Matrix-Matrix and Accumulate (WMMA) operations can be carried out. That is, each core can perform a 4x4 matrix-matrix multiply and add the result to a third matrix. There are differences between the V100 node on Adroit and the Traverse nodes (see [PCIe versus SXM2](https://www.nextplatform.com/micro-site-content/achieving-maximum-compute-throughput-pcie-vs-sxm2/)). 402 | 403 | 404 | ## GPU Hackathon at Princeton 405 | 406 | The next hackathon will take place in [June of 2025](https://www.openhackathons.org/s/siteevent/a0CUP00000rwmKa2AI/se000356). This is a great opportunity to get help from experts in porting your code to a GPU. Or you can participate as a mentor and help a team rework their code. See the [GPU Computing](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) page for details. 407 | -------------------------------------------------------------------------------- /01_what_is_a_gpu/pli.md: -------------------------------------------------------------------------------- 1 | # PLI Nodes 2 | 3 | ``` 4 | Architecture: x86_64 5 | CPU op-mode(s): 32-bit, 64-bit 6 | Byte Order: Little Endian 7 | CPU(s): 96 8 | On-line CPU(s) list: 0-95 9 | Thread(s) per core: 1 10 | Core(s) per socket: 48 11 | Socket(s): 2 12 | NUMA node(s): 2 13 | Vendor ID: GenuineIntel 14 | CPU family: 6 15 | Model: 143 16 | Model name: Intel(R) Xeon(R) Platinum 8468 17 | Stepping: 8 18 | CPU MHz: 3645.945 19 | CPU max MHz: 3800.0000 20 | CPU min MHz: 800.0000 21 | BogoMIPS: 4200.00 22 | L1d cache: 48K 23 | L1i cache: 32K 24 | L2 cache: 2048K 25 | L3 cache: 107520K 26 | NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94 27 | NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95 28 | Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities 29 | ``` 30 | 31 | ``` 32 | $ nvidia-smi 33 | Fri Feb 23 11:51:11 2024 34 | +---------------------------------------------------------------------------------------+ 35 | | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | 36 | |-----------------------------------------+----------------------+----------------------+ 37 | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 38 | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 39 | | | | MIG M. | 40 | |=========================================+======================+======================| 41 | | 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | 42 | | N/A 33C P0 72W / 700W | 2MiB / 81559MiB | 0% Default | 43 | | | | Disabled | 44 | +-----------------------------------------+----------------------+----------------------+ 45 | 46 | +---------------------------------------------------------------------------------------+ 47 | | Processes: | 48 | | GPU GI CI PID Type Process name GPU Memory | 49 | | ID ID Usage | 50 | |=======================================================================================| 51 | | No running processes found | 52 | +---------------------------------------------------------------------------------------+ 53 | ``` 54 | 55 | ``` 56 | jdh4@della-j11g1:~$ nvidia-smi -a 57 | ==============NVSMI LOG============== 58 | Timestamp : Fri Feb 23 11:51:29 2024 59 | Driver Version : 545.23.08 60 | CUDA Version : 12.3 61 | 62 | Attached GPUs : 1 63 | GPU 00000000:19:00.0 64 | Product Name : NVIDIA H100 80GB HBM3 65 | Product Brand : NVIDIA 66 | Product Architecture : Hopper 67 | Display Mode : Enabled 68 | Display Active : Disabled 69 | Persistence Mode : Enabled 70 | Addressing Mode : None 71 | MIG Mode 72 | Current : Disabled 73 | Pending : Disabled 74 | Accounting Mode : Disabled 75 | Accounting Mode Buffer Size : 4000 76 | Driver Model 77 | Current : N/A 78 | Pending : N/A 79 | Serial Number : 1654123038646 80 | GPU UUID : GPU-10f35015-e921-bfab-2eb8-4e9b6664d5f1 81 | Minor Number : 0 82 | VBIOS Version : 96.00.74.00.0D 83 | MultiGPU Board : No 84 | Board ID : 0x1900 85 | Board Part Number : 692-2G520-0200-000 86 | GPU Part Number : 2330-885-A1 87 | FRU Part Number : N/A 88 | Module ID : 2 89 | Inforom Version 90 | Image Version : G520.0200.00.05 91 | OEM Object : 2.1 92 | ECC Object : 7.16 93 | Power Management Object : N/A 94 | Inforom BBX Object Flush 95 | Latest Timestamp : 2024/02/22 13:09:29.459 96 | Latest Duration : 119019 us 97 | GPU Operation Mode 98 | Current : N/A 99 | Pending : N/A 100 | GSP Firmware Version : N/A 101 | GPU C2C Mode : Disabled 102 | GPU Virtualization Mode 103 | Virtualization Mode : None 104 | Host VGPU Mode : N/A 105 | GPU Reset Status 106 | Reset Required : No 107 | Drain and Reset Recommended : No 108 | IBMNPU 109 | Relaxed Ordering Mode : N/A 110 | PCI 111 | Bus : 0x19 112 | Device : 0x00 113 | Domain : 0x0000 114 | Device Id : 0x233010DE 115 | Bus Id : 00000000:19:00.0 116 | Sub System Id : 0x16C110DE 117 | GPU Link Info 118 | PCIe Generation 119 | Max : 5 120 | Current : 5 121 | Device Current : 5 122 | Device Max : 5 123 | Host Max : 5 124 | Link Width 125 | Max : 16x 126 | Current : 16x 127 | Bridge Chip 128 | Type : N/A 129 | Firmware : N/A 130 | Replays Since Reset : 0 131 | Replay Number Rollovers : 0 132 | Tx Throughput : 464 KB/s 133 | Rx Throughput : 2593 KB/s 134 | Atomic Caps Inbound : N/A 135 | Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 136 | Fan Speed : N/A 137 | Performance State : P0 138 | Clocks Event Reasons 139 | Idle : Active 140 | Applications Clocks Setting : Not Active 141 | SW Power Cap : Not Active 142 | HW Slowdown : Not Active 143 | HW Thermal Slowdown : Not Active 144 | HW Power Brake Slowdown : Not Active 145 | Sync Boost : Not Active 146 | SW Thermal Slowdown : Not Active 147 | Display Clock Setting : Not Active 148 | FB Memory Usage 149 | Total : 81559 MiB 150 | Reserved : 328 MiB 151 | Used : 2 MiB 152 | Free : 81227 MiB 153 | BAR1 Memory Usage 154 | Total : 131072 MiB 155 | Used : 1 MiB 156 | Free : 131071 MiB 157 | Conf Compute Protected Memory Usage 158 | Total : 0 MiB 159 | Used : 0 MiB 160 | Free : 0 MiB 161 | Compute Mode : Default 162 | Utilization 163 | Gpu : 0 % 164 | Memory : 0 % 165 | Encoder : 0 % 166 | Decoder : 0 % 167 | JPEG : 0 % 168 | OFA : 0 % 169 | Encoder Stats 170 | Active Sessions : 0 171 | Average FPS : 0 172 | Average Latency : 0 173 | FBC Stats 174 | Active Sessions : 0 175 | Average FPS : 0 176 | Average Latency : 0 177 | ECC Mode 178 | Current : Enabled 179 | Pending : Enabled 180 | ECC Errors 181 | Volatile 182 | SRAM Correctable : 0 183 | SRAM Uncorrectable : 0 184 | DRAM Correctable : 0 185 | DRAM Uncorrectable : 0 186 | Aggregate 187 | SRAM Correctable : 0 188 | SRAM Uncorrectable : 0 189 | DRAM Correctable : 0 190 | DRAM Uncorrectable : 0 191 | Retired Pages 192 | Single Bit ECC : N/A 193 | Double Bit ECC : N/A 194 | Pending Page Blacklist : N/A 195 | Remapped Rows 196 | Correctable Error : 0 197 | Uncorrectable Error : 0 198 | Pending : No 199 | Remapping Failure Occurred : No 200 | Bank Remap Availability Histogram 201 | Max : 2560 bank(s) 202 | High : 0 bank(s) 203 | Partial : 0 bank(s) 204 | Low : 0 bank(s) 205 | None : 0 bank(s) 206 | Temperature 207 | GPU Current Temp : 33 C 208 | GPU T.Limit Temp : 54 C 209 | GPU Shutdown T.Limit Temp : -8 C 210 | GPU Slowdown T.Limit Temp : -2 C 211 | GPU Max Operating T.Limit Temp : 0 C 212 | GPU Target Temperature : N/A 213 | Memory Current Temp : 41 C 214 | Memory Max Operating T.Limit Temp : 0 C 215 | GPU Power Readings 216 | Power Draw : 72.02 W 217 | Current Power Limit : 700.00 W 218 | Requested Power Limit : 700.00 W 219 | Default Power Limit : 700.00 W 220 | Min Power Limit : 200.00 W 221 | Max Power Limit : 700.00 W 222 | GPU Memory Power Readings 223 | Power Draw : 47.78 W 224 | Module Power Readings 225 | Power Draw : N/A 226 | Current Power Limit : N/A 227 | Requested Power Limit : N/A 228 | Default Power Limit : N/A 229 | Min Power Limit : N/A 230 | Max Power Limit : N/A 231 | Clocks 232 | Graphics : 345 MHz 233 | SM : 345 MHz 234 | Memory : 2619 MHz 235 | Video : 765 MHz 236 | Applications Clocks 237 | Graphics : 1980 MHz 238 | Memory : 2619 MHz 239 | Default Applications Clocks 240 | Graphics : 1980 MHz 241 | Memory : 2619 MHz 242 | Deferred Clocks 243 | Memory : N/A 244 | Max Clocks 245 | Graphics : 1980 MHz 246 | SM : 1980 MHz 247 | Memory : 2619 MHz 248 | Video : 1545 MHz 249 | Max Customer Boost Clocks 250 | Graphics : 1980 MHz 251 | Clock Policy 252 | Auto Boost : N/A 253 | Auto Boost Default : N/A 254 | Voltage 255 | Graphics : 670.000 mV 256 | Fabric 257 | State : Completed 258 | Status : Success 259 | Processes : None 260 | ``` 261 | 262 | ``` 263 | $ numactl -H 264 | available: 2 nodes (0-1) 265 | node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 266 | node 0 size: 515020 MB 267 | node 0 free: 509047 MB 268 | node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 269 | node 1 size: 516037 MB 270 | node 1 free: 489964 MB 271 | node distances: 272 | node 0 1 273 | 0: 10 21 274 | 1: 21 10 275 | ``` 276 | 277 | ## Intra-Node Topology 278 | 279 | ``` 280 | jdh4@della-k17g3:~$ nvidia-smi topo -m 281 | GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID 282 | GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX NODE NODE NODE NODE 0 N/A 283 | GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE 0 N/A 284 | GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE 0 N/A 285 | GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE PIX NODE NODE NODE 0 N/A 286 | GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 NODE NODE NODE PIX PIX NODE 1 1 N/A 287 | GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 NODE NODE NODE NODE NODE NODE 1 1 N/A 288 | GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 NODE NODE NODE NODE NODE PIX 1 1 N/A 289 | GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X NODE NODE NODE NODE NODE NODE 1 1 N/A 290 | NIC0 PIX NODE NODE NODE NODE NODE NODE NODE X PIX NODE NODE NODE NODE 291 | NIC1 PIX NODE NODE NODE NODE NODE NODE NODE PIX X NODE NODE NODE NODE 292 | NIC2 NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE X NODE NODE NODE 293 | NIC3 NODE NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE X PIX NODE 294 | NIC4 NODE NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE PIX X NODE 295 | NIC5 NODE NODE NODE NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE X 296 | 297 | Legend: 298 | 299 | X = Self 300 | SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) 301 | NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node 302 | PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) 303 | PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) 304 | PIX = Connection traversing at most a single PCIe bridge 305 | NV# = Connection traversing a bonded set of # NVLinks 306 | 307 | NIC Legend: 308 | 309 | NIC0: mlx5_0 310 | NIC1: mlx5_1 311 | NIC2: mlx5_2 312 | NIC3: mlx5_3 313 | NIC4: mlx5_4 314 | NIC5: mlx5_5 315 | ``` 316 | -------------------------------------------------------------------------------- /02_cuda_toolkit/README.md: -------------------------------------------------------------------------------- 1 | # NVIDIA CUDA Toolkit 2 | 3 | ![NVIDIA CUDA](https://upload.wikimedia.org/wikipedia/en/b/b9/Nvidia_CUDA_Logo.jpg) 4 | 5 | The [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) provides a comprehensive set of libraries and tools for developing and running GPU-accelerated applications. 6 | 7 | List the available modules that are related to CUDA: 8 | 9 | ``` 10 | $ module avail cudatoolkit 11 | ------------ /usr/local/share/Modules/modulefiles ------------- 12 | cudatoolkit/10.2 cudatoolkit/11.7 cudatoolkit/12.4 13 | cudatoolkit/11.1 cudatoolkit/12.0 cudatoolkit/12.5 14 | cudatoolkit/11.3 cudatoolkit/12.2 cudatoolkit/12.6 15 | cudatoolkit/11.4 cudatoolkit/12.3 16 | ``` 17 | 18 | Run the following command to see which environment variables the `cudatoolkit` module is modifying: 19 | 20 | ``` 21 | $ $ module show cudatoolkit/12.5 22 | ------------------------------------------------------------------- 23 | /usr/local/share/Modules/modulefiles/cudatoolkit/12.5: 24 | 25 | module-whatis {Sets up cudatoolkit125 12.5 in your environment} 26 | prepend-path PATH /usr/local/cuda-12.5/bin 27 | prepend-path LD_LIBRARY_PATH /usr/local/cuda-12.5/lib64 28 | prepend-path LIBRARY_PATH /usr/local/cuda-12.5/lib64 29 | prepend-path MANPATH /usr/local/cuda-12.5/doc/man 30 | append-path -d { } LDFLAGS -L/usr/local/cuda-12.5/lib64 31 | append-path -d { } INCLUDE -I/usr/local/cuda-12.5/include 32 | append-path CPATH /usr/local/cuda-12.5/include 33 | append-path -d { } FFLAGS -I/usr/local/cuda-12.5/include 34 | append-path -d { } LOCAL_LDFLAGS -L/usr/local/cuda-12.5/lib64 35 | append-path -d { } LOCAL_INCLUDE -I/usr/local/cuda-12.5/include 36 | append-path -d { } LOCAL_CFLAGS -I/usr/local/cuda-12.5/include 37 | append-path -d { } LOCAL_FFLAGS -I/usr/local/cuda-12.5/include 38 | append-path -d { } LOCAL_CXXFLAGS -I/usr/local/cuda-12.5/include 39 | setenv CUDA_HOME /usr/local/cuda-12.5 40 | ------------------------------------------------------------------- 41 | ``` 42 | 43 | Let's look at the files in `/usr/local/cuda-12.5/bin`: 44 | 45 | ``` 46 | $ ls -ltrh /usr/local/cuda-12.5/bin 47 | total 243M 48 | -rwxr-xr-x. 1 root root 49M Apr 15 22:46 nvdisasm 49 | -rwxr-xr-x. 1 root root 688K Apr 15 22:47 cuobjdump 50 | -rwxr-xr-x. 6 root root 11K May 17 18:50 __nvcc_device_query 51 | -rwxr-xr-x. 14 root root 285 May 17 18:50 nvvp 52 | -rwxr-xr-x. 1 root root 111K Jun 6 06:03 nvprune 53 | -rwxr-xr-x. 1 root root 75K Jun 6 06:09 cu++filt 54 | -rwxr-xr-x. 1 root root 30M Jun 6 06:12 ptxas 55 | -rwxr-xr-x. 1 root root 30M Jun 6 06:12 nvlink 56 | -rw-r--r--. 1 root root 465 Jun 6 06:12 nvcc.profile 57 | -rwxr-xr-x. 1 root root 22M Jun 6 06:12 nvcc 58 | -rwxr-xr-x. 1 root root 1.2M Jun 6 06:12 fatbinary 59 | -rwxr-xr-x. 1 root root 7.1M Jun 6 06:12 cudafe++ 60 | -rwxr-xr-x. 1 root root 87K Jun 6 06:12 bin2c 61 | -rwxr-xr-x. 1 root root 803K Jun 6 07:25 cuda-gdbserver 62 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.9-tui 63 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.8-tui 64 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.12-tui 65 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.11-tui 66 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.10-tui 67 | -rwxr-xr-x. 1 root root 15M Jun 6 07:25 cuda-gdb-minimal 68 | -rwxr-xr-x. 1 root root 1.9K Jun 6 07:25 cuda-gdb 69 | -rwxr-xr-x. 1 root root 5.8M Jun 6 07:56 nvprof 70 | lrwxrwxrwx. 1 root root 4 Jun 6 08:04 computeprof -> nvvp 71 | -rwxr-xr-x. 11 root root 1.6K Jun 14 19:56 nsight_ee_plugins_manage.sh 72 | -rwxr-xr-x. 1 root root 833 Jun 25 17:54 nsys-ui 73 | -rwxr-xr-x. 1 root root 743 Jun 25 17:54 nsys 74 | -rwxr-xr-x. 5 root root 112 Jul 12 02:21 compute-sanitizer 75 | -rwxr-xr-x. 5 root root 3.6K Jul 26 18:06 ncu-ui 76 | -rwxr-xr-x. 5 root root 3.8K Jul 26 18:06 ncu 77 | -rwxr-xr-x. 4 root root 197 Jul 26 18:06 nsight-sys 78 | drwxr-xr-x. 2 root root 43 Aug 28 10:24 crt 79 | ``` 80 | 81 | `nvcc` is the NVIDIA CUDA Compiler. Note that `nvcc` is built on `llvm` as [described here](https://developer.nvidia.com/cuda-llvm-compiler). To learn more about an executable, use the help option. For instance: `nvcc --help`. 82 | 83 | 84 | Let's look at the libraries: 85 | 86 | ``` 87 | $ ls -lL /usr/local/cuda-12.5/lib64/lib*.so 88 | -rwxr-xr-x. 1 root root 2412216 Jun 6 07:56 /usr/local/cuda-12.5/lib64/libaccinj64.so 89 | -rwxr-xr-x. 1 root root 1505608 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libcheckpoint.so 90 | -rwxr-xr-x. 1 root root 446820528 Jun 6 06:10 /usr/local/cuda-12.5/lib64/libcublasLt.so 91 | -rwxr-xr-x. 1 root root 104128480 Jun 6 06:10 /usr/local/cuda-12.5/lib64/libcublas.so 92 | -rwxr-xr-x. 1 root root 712032 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libcudart.so 93 | -rwxr-xr-x. 1 root root 276080616 Jun 6 06:16 /usr/local/cuda-12.5/lib64/libcufft.so 94 | -rwxr-xr-x. 1 root root 974920 Jun 6 06:16 /usr/local/cuda-12.5/lib64/libcufftw.so 95 | -rwxr-xr-x. 6 root root 43320 Jun 5 13:57 /usr/local/cuda-12.5/lib64/libcufile_rdma.so 96 | -rwxr-xr-x. 1 root root 2993816 Jun 6 06:53 /usr/local/cuda-12.5/lib64/libcufile.so 97 | -rwxr-xr-x. 1 root root 2832640 Jun 6 07:56 /usr/local/cuda-12.5/lib64/libcuinj64.so 98 | -rwxr-xr-x. 1 root root 7807144 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libcupti.so 99 | -rwxr-xr-x. 1 root root 96529840 Jun 6 06:14 /usr/local/cuda-12.5/lib64/libcurand.so 100 | -rwxr-xr-x. 1 root root 82234792 Jun 6 06:55 /usr/local/cuda-12.5/lib64/libcusolverMg.so 101 | -rwxr-xr-x. 1 root root 122162688 Jun 6 06:55 /usr/local/cuda-12.5/lib64/libcusolver.so 102 | -rwxr-xr-x. 1 root root 294682616 Jun 6 06:29 /usr/local/cuda-12.5/lib64/libcusparse.so 103 | -rwxr-xr-x. 1 root root 1651184 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppc.so 104 | -rwxr-xr-x. 1 root root 17736496 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppial.so 105 | -rwxr-xr-x. 1 root root 7689032 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppicc.so 106 | -rwxr-xr-x. 1 root root 11248792 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppidei.so 107 | -rwxr-xr-x. 1 root root 101120104 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppif.so 108 | -rwxr-xr-x. 1 root root 41165712 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppig.so 109 | -rwxr-xr-x. 1 root root 10703688 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppim.so 110 | -rwxr-xr-x. 1 root root 37897296 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppist.so 111 | -rwxr-xr-x. 1 root root 724392 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppisu.so 112 | -rwxr-xr-x. 1 root root 5595760 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppitc.so 113 | -rwxr-xr-x. 1 root root 14169336 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnpps.so 114 | -rwxr-xr-x. 1 root root 757496 Jun 6 06:10 /usr/local/cuda-12.5/lib64/libnvblas.so 115 | -rwxr-xr-x. 1 root root 2409960 Jun 6 06:08 /usr/local/cuda-12.5/lib64/libnvfatbin.so 116 | -rwxr-xr-x. 1 root root 54560656 Jun 6 06:11 /usr/local/cuda-12.5/lib64/libnvJitLink.so 117 | -rwxr-xr-x. 1 root root 6726448 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libnvjpeg.so 118 | -rwxr-xr-x. 1 root root 28139320 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libnvperf_host.so 119 | -rwxr-xr-x. 1 root root 5579216 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libnvperf_target.so 120 | -rwxr-xr-x. 1 root root 5322632 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libnvrtc-builtins.so 121 | -rwxr-xr-x. 1 root root 61401616 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libnvrtc.so 122 | -rwxr-xr-x. 10 root root 40136 May 17 18:50 /usr/local/cuda-12.5/lib64/libnvToolsExt.so 123 | -rwxr-xr-x. 10 root root 30856 May 17 18:50 /usr/local/cuda-12.5/lib64/libOpenCL.so 124 | -rwxr-xr-x. 1 root root 920920 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libpcsamplingutil.so 125 | ``` 126 | 127 | ## cuDNN 128 | 129 | There is also the [CUDA Deep Neural Net](https://developer.nvidia.com/cudnn) (cuDNN) library. It is external to the NVIDIA CUDA Toolkit and is used with TensorFlow, for instance, to provide GPU routines for training neural nets. See the available modules with: 130 | 131 | ``` 132 | $ module avail cudnn 133 | ``` 134 | 135 | ## Conda Installations 136 | 137 | When you install [CuPy](https://cupy.dev), for instance, which is like NumPy for GPUs, Conda will include the CUDA libraries: 138 | 139 |
140 | $ module load anaconda3/2024.6
141 | $ conda create --name cupy-env cupy --channel conda-forge
142 | ...
143 |   _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
144 |   _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
145 |   bzip2              conda-forge/linux-64::bzip2-1.0.8-hd590300_5 
146 |   ca-certificates    conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0 
147 |   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.5.82-he02047a_0 
148 |   cuda-version       conda-forge/noarch::cuda-version-12.5-hd4f0392_3 
149 |   cupy               conda-forge/linux-64::cupy-13.2.0-py312had87585_0 
150 |   cupy-core          conda-forge/linux-64::cupy-core-13.2.0-py312hd074ebb_0 
151 |   fastrlock          conda-forge/linux-64::fastrlock-0.8.2-py312h30efb56_2 
152 |   ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.40-hf3520f5_7 
153 |   libblas            conda-forge/linux-64::libblas-3.9.0-22_linux64_openblas 
154 |   libcblas           conda-forge/linux-64::libcblas-3.9.0-22_linux64_openblas 
155 |   libcublas          conda-forge/linux-64::libcublas-12.5.3.2-he02047a_0 
156 |   libcufft           conda-forge/linux-64::libcufft-11.2.3.61-he02047a_0 
157 |   libcurand          conda-forge/linux-64::libcurand-10.3.6.82-he02047a_0 
158 |   libcusolver        conda-forge/linux-64::libcusolver-11.6.3.83-he02047a_0 
159 |   libcusparse        conda-forge/linux-64::libcusparse-12.5.1.3-he02047a_0 
160 |   libexpat           conda-forge/linux-64::libexpat-2.6.2-h59595ed_0 
161 |   libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5 
162 |   libgcc-ng          conda-forge/linux-64::libgcc-ng-14.1.0-h77fa898_0 
163 |   libgfortran-ng     conda-forge/linux-64::libgfortran-ng-14.1.0-h69a702a_0 
164 |   libgfortran5       conda-forge/linux-64::libgfortran5-14.1.0-hc5f4f2c_0 
165 |   libgomp            conda-forge/linux-64::libgomp-14.1.0-h77fa898_0 
166 |   liblapack          conda-forge/linux-64::liblapack-3.9.0-22_linux64_openblas 
167 |   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
168 |   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.5.82-he02047a_0 
169 |   libopenblas        conda-forge/linux-64::libopenblas-0.3.27-pthreads_hac2b453_1 
170 |   libsqlite          conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 
171 |   libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-14.1.0-hc0a3c3a_0 
172 |   libuuid            conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 
173 |   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
174 |   libzlib            conda-forge/linux-64::libzlib-1.3.1-h4ab18f5_1 
175 |   ncurses            conda-forge/linux-64::ncurses-6.5-h59595ed_0 
176 |   numpy              conda-forge/linux-64::numpy-2.0.0-py312h22e1c76_0 
177 |   openssl            conda-forge/linux-64::openssl-3.3.1-h4ab18f5_1 
178 |   pip                conda-forge/noarch::pip-24.0-pyhd8ed1ab_0 
179 |   python             conda-forge/linux-64::python-3.12.4-h194c7f8_0_cpython 
180 |   python_abi         conda-forge/linux-64::python_abi-3.12-4_cp312 
181 |   readline           conda-forge/linux-64::readline-8.2-h8228510_1 
182 |   setuptools         conda-forge/noarch::setuptools-70.1.1-pyhd8ed1ab_0 
183 |   tk                 conda-forge/linux-64::tk-8.6.13-noxft_h4845f30_101 
184 |   tzdata             conda-forge/noarch::tzdata-2024a-h0c530f3_0 
185 |   wheel              conda-forge/noarch::wheel-0.43.0-pyhd8ed1ab_1 
186 |   xz                 conda-forge/linux-64::xz-5.2.6-h166bdaf_0 
187 | 
188 | 189 | When using `pip` to do the installation, one needs to load the `cudatoolkit` module since that dependency is assumed to be available on the local system. The Conda approach installs all the dependencies so one does not load the module. 190 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/README.md: -------------------------------------------------------------------------------- 1 | # Your First GPU Job 2 | 3 | Using the GPUs on the Princeton HPC clusters is easy. Pick one of the applications below to get started. To obtain the materials to run the examples, use these commands: 4 | 5 | ``` 6 | $ ssh @adroit.princeton.edu 7 | $ cd /scratch/network/ 8 | $ git clone https://github.com/PrincetonUniversity/gpu_programming_intro.git 9 | ``` 10 | 11 | To add a GPU to your Slurm allocation: 12 | 13 | ``` 14 | #SBATCH --gres=gpu:1 # number of gpus per node 15 | ``` 16 | 17 | For Adroit, one can specify the GPU type using a constraint: 18 | 19 | ``` 20 | #SBATCH --constraint=a100 # set to gpu80, a100 or v100 21 | #SBATCH --gres=gpu:1 # number of gpus per node 22 | ``` 23 | 24 | For more on specifying the GPU type on Adroit [see this page](https://researchcomputing.princeton.edu/systems/adroit#gpus). 25 | 26 | ## CuPy 27 | 28 | [CuPy](https://cupy.chainer.org) provides a Python interface to set of common numerical routines (e.g., matrix factorizations) which are executed on a GPU (see the [Reference Manual](https://docs-cupy.chainer.org/en/stable/reference/index.html)). You can roughly think of CuPy as NumPy for GPUs. This example is set to use the CuPy installation of the workshop instructor. If you use CuPy for your research work then you should [install it](https://github.com/PrincetonUniversity/gpu_programming_intro/tree/master/02_cuda_toolkit#conda-installations) into your account. 29 | 30 | Examine the Python script before running the code: 31 | 32 | ```python 33 | $ cd gpu_programming_intro/03_your_first_gpu_job/cupy 34 | $ cat svd.py 35 | from time import perf_counter 36 | import cupy as cp 37 | 38 | N = 1000 39 | X = cp.random.randn(N, N, dtype=cp.float64) 40 | 41 | trials = 5 42 | times = [] 43 | for _ in range(trials): 44 | t0 = perf_counter() 45 | u, s, v = cp.linalg.svd(X) 46 | cp.cuda.Device(0).synchronize() 47 | times.append(perf_counter() - t0) 48 | print("Execution time: ", min(times)) 49 | print("sum(s) = ", s.sum()) 50 | print("CuPy version: ", cp.__version__) 51 | ``` 52 | 53 | Below is a sample Slurm script: 54 | 55 | ```bash 56 | $ cat job.slurm 57 | #!/bin/bash 58 | #SBATCH --job-name=cupy-job # create a short name for your job 59 | #SBATCH --nodes=1 # node count 60 | #SBATCH --ntasks=1 # total number of tasks across all nodes 61 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 62 | #SBATCH --gres=gpu:1 # number of gpus per node 63 | #SBATCH --mem=4G # total memory (RAM) per node 64 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 65 | #SBATCH --constraint=a100 # choose a100 or v100 66 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 67 | 68 | module purge 69 | module load anaconda3/2024.6 70 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/cupy-env 71 | 72 | python svd.py 73 | ``` 74 | 75 | A GPU is allocated using the Slurm directive `#SBATCH --gres=gpu:1`. 76 | 77 | Submit the job: 78 | 79 | ``` 80 | $ sbatch job.slurm 81 | ``` 82 | 83 | Wait a few seconds for the job to run. Inspect the output: 84 | 85 | ``` 86 | $ cat slurm-*.out 87 | ``` 88 | 89 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. What happens if you re-run the script with the matrix in single precision? Does the execution time double if N is doubled? There is a CPU version of the code at the bottom of this page. Does the operation run faster on the CPU with NumPy or on the GPU with CuPy? Try [this exercise](https://github.com/PrincetonUniversity/a100_workshop/tree/main/06_cupy#cupy-uses-tensor-cores) where the Tensor Cores are utilized by using less than single precision (i.e., TensorFloat32). 90 | 91 | Why are multiple trials used when measuring the execution time? `CuPy` compiles a custom GPU kernel for each GPU operation (e.g., SVD). This means the first time a `CuPy` function is called the measured time is the sum of the compile time plus the time to execute the operation. The second and later calls only include the time to execute the operation. 92 | 93 | In addition to CuPy, Python programmers looking to run their code on GPUs should also be aware of [Numba](https://numba.pydata.org/) and [JAX](https://github.com/google/jax). 94 | 95 | To see performance comparison between the CPU and GPU, see `matmul_numpy.py` and `matmul_cupy.py` in [this repo](https://github.com/jdh4/python-gpu/tree/main/cupy). 96 | 97 | ## PyTorch 98 | 99 | [PyTorch](https://pytorch.org) is a popular deep learning framework. See its documentation for [Tensor operations](https://pytorch.org/docs/stable/tensors.html). This example is set to use the PyTorch installation of the workshop instructor. If you use PyTorch for your research work then you should [install it](https://researchcomputing.princeton.edu/support/knowledge-base/pytorch) into your account. 100 | 101 | Examine the Python script before running the code: 102 | 103 | ```python 104 | $ cd gpu_programming_intro/03_your_first_gpu_job/pytorch 105 | $ cat svd.py 106 | from time import perf_counter 107 | import torch 108 | 109 | N = 1000 110 | 111 | cuda0 = torch.device('cuda:0') 112 | x = torch.randn(N, N, dtype=torch.float64, device=cuda0) 113 | t0 = perf_counter() 114 | u, s, v = torch.svd(x) 115 | elapsed_time = perf_counter() - t0 116 | 117 | print("Execution time: ", elapsed_time) 118 | print("Result: ", torch.sum(s).cpu().numpy()) 119 | print("PyTorch version: ", torch.__version__) 120 | ``` 121 | 122 | Here is a sample Slurm script: 123 | 124 | ```bash 125 | $ cat job.slurm 126 | #!/bin/bash 127 | #SBATCH --job-name=torch-svd # create a short name for your job 128 | #SBATCH --nodes=1 # node count 129 | #SBATCH --ntasks=1 # total number of tasks across all nodes 130 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 131 | #SBATCH --mem-per-cpu=4G # memory per cpu-core 132 | #SBATCH --gres=gpu:1 # number of gpus per node 133 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 134 | #SBATCH --constraint=a100 # choose a100 or v100 on adroit 135 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 136 | 137 | module purge 138 | module load anaconda3/2024.6 139 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/torch-env 140 | 141 | python svd.py 142 | ``` 143 | 144 | Submit the job: 145 | 146 | ``` 147 | $ sbatch job.slurm 148 | ``` 149 | 150 | Wait a few seconds for the job to run. Inspect the output: 151 | 152 | ``` 153 | $ cat slurm-*.out 154 | ``` 155 | 156 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. 157 | 158 | ## TensorFlow 159 | 160 | [TensorFlow](https://www.tensorflow.org) is popular library for training deep neural networks. It can also be used for various numerical computations (see [documentation](https://www.tensorflow.org/api_docs/python/tf)). This example is set to use the TensorFlow installation of the workshop instructor. If you use TensorFlow for your research work then you should [install it](https://researchcomputing.princeton.edu/support/knowledge-base/tensorflow) into your account. 161 | 162 | Examine the Python script before running the code: 163 | 164 | ```python 165 | $ cd gpu_programming_intro/03_your_first_gpu_job/tensorflow 166 | $ cat svd.py 167 | from time import perf_counter 168 | 169 | import os 170 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 171 | 172 | import tensorflow as tf 173 | print("TensorFlow version: ", tf.__version__) 174 | 175 | N = 100 176 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64) 177 | t0 = perf_counter() 178 | s, u, v = tf.linalg.svd(x) 179 | elapsed_time = perf_counter() - t0 180 | print("Execution time: ", elapsed_time) 181 | print("Result: ", tf.reduce_sum(s).numpy()) 182 | ``` 183 | 184 | Below is a sample Slurm script: 185 | 186 | ```bash 187 | $ cat job.slurm 188 | #!/bin/bash 189 | #SBATCH --job-name=svd-tf # create a short name for your job 190 | #SBATCH --nodes=1 # node count 191 | #SBATCH --ntasks=1 # total number of tasks across all nodes 192 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 193 | #SBATCH --mem=4G # total memory (RAM) per node 194 | #SBATCH --gres=gpu:1 # number of gpus per node 195 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 196 | #SBATCH --constraint=a100 # choose a100 or v100 197 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 198 | 199 | module load anaconda3/2024.6 200 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/tf2-gpu 201 | 202 | python svd.py 203 | ``` 204 | 205 | Submit the job: 206 | 207 | ``` 208 | $ sbatch job.slurm 209 | ``` 210 | 211 | Wait a few seconds for the job to run. Inspect the output: 212 | 213 | ``` 214 | $ cat slurm-*.out 215 | ``` 216 | 217 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. 218 | 219 | 224 | 225 | ## R with NVBLAS 226 | 227 | Take a look at [this page](https://github.com/PrincetonUniversity/HPC_R_Workshop/tree/master/07_NVBLAS) and then run the commands below: 228 | 229 | ``` 230 | $ git clone https://github.com/PrincetonUniversity/HPC_R_Workshop 231 | $ cd HPC_R_Workshop/07_NVBLAS 232 | $ mv nvblas.conf ~ 233 | $ sbatch 07_NVBLAS.cmd 234 | ``` 235 | 236 | Here is the sample output: 237 | 238 | ``` 239 | $ cat slurm-*.out 240 | ... 241 | [1] "Matrix multiply:" 242 | user system elapsed 243 | 0.166 0.137 0.304 244 | [1] "----" 245 | [1] "Cholesky Factorization:" 246 | user system elapsed 247 | 1.053 0.041 1.096 248 | [1] "----" 249 | [1] "Singular Value Decomposition:" 250 | user system elapsed 251 | 8.060 1.837 5.345 252 | [1] "----" 253 | [1] "Principal Components Analysis:" 254 | user system elapsed 255 | 16.814 5.987 11.252 256 | [1] "----" 257 | [1] "Linear Discriminant Analysis:" 258 | user system elapsed 259 | 25.955 3.080 20.830 260 | [1] "----" 261 | ... 262 | ``` 263 | 264 | See the [user guide](https://docs.nvidia.com/cuda/nvblas/index.html) for NVBLAS. 265 | 266 | ## MATLAB 267 | 268 | MATLAB is already installed on the cluster. Simply follow these steps: 269 | 270 | ```bash 271 | $ cd gpu_programming_intro/03_your_first_gpu_job/matlab 272 | $ cat svd.m 273 | ``` 274 | 275 | Here is the MATLAB script: 276 | 277 | ```matlab 278 | gpu = gpuDevice(); 279 | fprintf('Using a %s GPU.\n', gpu.Name); 280 | disp(gpuDevice); 281 | 282 | X = gpuArray([1 0 2; -1 5 0; 0 3 -9]); 283 | whos X 284 | [U,S,V] = svd(X) 285 | fprintf('trace(S): %f\n', trace(S)) 286 | quit; 287 | ``` 288 | 289 | Below is a sample Slurm script: 290 | 291 | ```bash 292 | #!/bin/bash 293 | #SBATCH --job-name=matlab-svd # create a short name for your job 294 | #SBATCH --nodes=1 # node count 295 | #SBATCH --ntasks=1 # total number of tasks across all nodes 296 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 297 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) 298 | #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS) 299 | #SBATCH --gres=gpu:1 # number of gpus per node 300 | #SBATCH --constraint=a100 # choose a100 or v100 301 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 302 | 303 | module purge 304 | module load matlab/R2023a 305 | 306 | matlab -singleCompThread -nodisplay -nosplash -r svd 307 | ``` 308 | 309 | Submit the job: 310 | 311 | ``` 312 | $ sbatch job.slurm 313 | ``` 314 | 315 | Wait a few seconds for the job to run. Inspect the output: 316 | 317 | ``` 318 | $ cat slurm-*.out 319 | ``` 320 | 321 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. Learn more about [MATLAB on the Research Computing clusters](https://researchcomputing.princeton.edu/support/knowledge-base/matlab). 322 | 323 | Here is an [intro](https://www.mathworks.com/help/parallel-computing/run-matlab-functions-on-a-gpu.html) to using MATLAB with GPUs. 324 | 325 | ## Julia 326 | 327 | Install the `CUDA` package then run the script in `03_your_first_gpu_job/julia`. See our [Julia webage](https://researchcomputing.princeton.edu/support/knowledge-base/julia). 328 | 329 | ## Monitoring GPU Usage 330 | 331 | To monitor jobs in our reservation: 332 | 333 | ``` 334 | $ watch -n 1 squeue -R gpuprimer 335 | ``` 336 | 337 | ## Benchmarks 338 | 339 | ### Matrix Multiplication 340 | 341 | | cluster | code | CPU-cores | time (s) | 342 | |:--------------------:|:----:|:-----------:|:--------:| 343 | | adroit (CPU) | NumPy | 1 | 24.2 | 344 | | adroit (CPU) | NumPy | 2 | 15.5 | 345 | | adroit (CPU) | NumPy | 4 | 5.3 | 346 | | adroit (V100) | CuPy | 1 | 0.3 | 347 | | adroit (K40c) | CuPy | 1 | 1.7 | 348 | 349 | Times are best of 5 for a square matrix with N=10000 in double precision. 350 | 351 | ### LU Decomposition 352 | 353 | | cluster | code | CPU-cores | time (s) | 354 | |:--------------------:|:-----------:|:----------:|:--------:| 355 | | adroit (CPU) | SciPy | 1 | 9.4 | 356 | | adroit (CPU) | SciPy | 2 | 7.9 | 357 | | adroit (CPU) | SciPy | 4 | 6.5 | 358 | | adroit (V100) | CuPy | 1 | 0.3 | 359 | | adroit (K40c) | CuPy | 1 | 1.1 | 360 | | adroit (V100) | Tensorflow | 1 | 0.3 | 361 | | adroit (K40c) | Tensorflow | 1 | 1.1 | 362 | | adroit (CPU) | Tensorflow | 1 | 50.8 | 363 | 364 | Times are best of 5 for a square matrix with N=10000 in double precision. 365 | 366 | ### Singular Value Decomposition 367 | 368 | | cluster | code | CPU-cores | time (s) | 369 | |:--------------------:|:----------:|:----------:|:--------:| 370 | | adroit (CPU) | NumPy | 1 | 3.6 | 371 | | adroit (CPU) | NumPy | 2 | 3.0 | 372 | | adroit (CPU) | NumPy | 4 | 1.2 | 373 | | adroit (V100) | CuPy | 1 | 24.7 | 374 | | adroit (K40c) | CuPy | 1 | 30.5 | 375 | | adroit (V100) | Torch | 1 | 0.9 | 376 | | adroit (K40c) | Torch | 1 | 1.5 | 377 | | adroit (CPU) | Torch | 1 | 3.0 | 378 | | adroit (V100) | TensorFlow | 1 | 24.8 | 379 | | adroit (K40c) | TensorFlow | 1 | 29.7 | 380 | | adroit (CPU) | TensorFlow | 1 | 9.2 | 381 | 382 | Times are best of 5 for a square matrix with N=2000 in double precision. 383 | 384 | For the LU decomposition using SciPy: 385 | 386 | ``` 387 | from time import perf_counter 388 | 389 | import numpy as np 390 | import scipy as sp 391 | from scipy.linalg import lu 392 | 393 | N = 10000 394 | cpu_runs = 5 395 | 396 | times = [] 397 | X = np.random.randn(N, N).astype(np.float64) 398 | for _ in range(cpu_runs): 399 | t0 = perf_counter() 400 | p, l, u = lu(X, check_finite=False) 401 | times.append(perf_counter() - t0) 402 | print("CPU time: ", min(times)) 403 | print("NumPy version: ", np.__version__) 404 | print("SciPy version: ", sp.__version__) 405 | print(p.sum()) 406 | print(times) 407 | ``` 408 | 409 | For the LU decomposition on the CPU: 410 | 411 | ``` 412 | from time import perf_counter 413 | 414 | import os 415 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 416 | 417 | import tensorflow as tf 418 | print("TensorFlow version: ", tf.__version__) 419 | 420 | times = [] 421 | N = 10000 422 | with tf.device("/cpu:0"): 423 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64) 424 | for _ in range(5): 425 | t0 = perf_counter() 426 | lu, p = tf.linalg.lu(x) 427 | elapsed_time = perf_counter() - t0 428 | times.append(elapsed_time) 429 | print("Execution time: ", min(times)) 430 | print(times) 431 | print("Result: ", tf.reduce_sum(p).numpy()) 432 | ``` 433 | 434 | SVD with NumPy: 435 | 436 | ``` 437 | from time import perf_counter 438 | 439 | N = 2000 440 | cpu_runs = 5 441 | 442 | times = [] 443 | import numpy as np 444 | X = np.random.randn(N, N).astype(np.float64) 445 | for _ in range(cpu_runs): 446 | t0 = perf_counter() 447 | u, s, v = np.linalg.svd(X) 448 | times.append(perf_counter() - t0) 449 | print("CPU time: ", min(times)) 450 | print("NumPy version: ", np.__version__) 451 | print(s.sum()) 452 | print(times) 453 | ``` 454 | 455 | Performing benchmarks with R: 456 | 457 | ``` 458 | # install.packages("microbenchmark") 459 | library(microbenchmark) 460 | library(Matrix) 461 | 462 | N <- 10000 463 | microbenchmark(lu(matrix(rnorm(N*N), N, N)), times=5, unit="s") 464 | ``` 465 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/cupy/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=cupy-job # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --gres=gpu:1 # number of gpus per node 7 | #SBATCH --mem=4G # total memory (RAM) per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --constraint=a100 # choose a100 or v100 10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 11 | 12 | module purge 13 | module load anaconda3/2024.6 14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/cupy-env 15 | 16 | python svd.py 17 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/cupy/lu.py: -------------------------------------------------------------------------------- 1 | from time import perf_counter 2 | import numpy as np 3 | import cupy as cp 4 | import cupyx.scipy.linalg 5 | 6 | N = 10000 7 | X = cp.random.randn(N, N, dtype=np.float64) 8 | 9 | trials = 5 10 | times = [] 11 | for _ in range(trials): 12 | start_time = perf_counter() 13 | lu, piv = cupyx.scipy.linalg.lu_factor(X, check_finite=False) 14 | cp.cuda.Device(0).synchronize() 15 | times.append(perf_counter() - start_time) 16 | 17 | print("Execution time: ", min(times)) 18 | print("CuPy version: ", cp.__version__) 19 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/cupy/svd.py: -------------------------------------------------------------------------------- 1 | from time import perf_counter 2 | import cupy as cp 3 | 4 | N = 1000 5 | X = cp.random.randn(N, N, dtype=cp.float64) 6 | 7 | trials = 5 8 | times = [] 9 | for _ in range(trials): 10 | t0 = perf_counter() 11 | u, s, v = cp.linalg.svd(X) 12 | cp.cuda.Device(0).synchronize() 13 | times.append(perf_counter() - t0) 14 | print("Execution time: ", min(times)) 15 | print("sum(s) = ", s.sum()) 16 | print("CuPy version: ", cp.__version__) 17 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/julia/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=julia_gpu # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --gres=gpu:1 # number of gpus per node 7 | #SBATCH --mem=4G # total memory (RAM) per node 8 | #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS) 9 | #SBATCH --constraint=a100 # choose gpu80, a100 or v100 10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 11 | 12 | module purge 13 | module load julia/1.8.2 14 | 15 | julia svd.jl 16 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/julia/svd.jl: -------------------------------------------------------------------------------- 1 | using CUDA 2 | N = 8000 3 | F = CUDA.svd(CUDA.rand(N, N)) 4 | println(sum(F.S)) 5 | println("completed") 6 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/matlab/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=matlab-svd # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) 7 | #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS) 8 | #SBATCH --gres=gpu:1 # number of gpus per node 9 | #SBATCH --constraint=a100 # choose a100 or v100 10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 11 | 12 | module purge 13 | module load matlab/R2023a 14 | 15 | matlab -singleCompThread -nodisplay -nosplash -r svd 16 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/matlab/svd.m: -------------------------------------------------------------------------------- 1 | gpu = gpuDevice(); 2 | fprintf('Using a %s GPU.\n', gpu.Name); 3 | disp(gpuDevice); 4 | 5 | X = gpuArray([1 0 2; -1 5 0; 0 3 -9]); 6 | whos X; 7 | [U,S,V] = svd(X) 8 | fprintf('trace(S): %f\n', trace(S)) 9 | quit; 10 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/pytorch/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=torch-svd # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --constraint=a100 # choose a100 or v100 10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 11 | 12 | module purge 13 | module load anaconda3/2023.9 14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/torch-env 15 | 16 | python svd.py 17 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/pytorch/svd.py: -------------------------------------------------------------------------------- 1 | from time import perf_counter 2 | import torch 3 | 4 | N = 1000 5 | 6 | cuda0 = torch.device('cuda:0') 7 | x = torch.randn(N, N, dtype=torch.float64, device=cuda0) 8 | t0 = perf_counter() 9 | u, s, v = torch.svd(x) 10 | elapsed_time = perf_counter() - t0 11 | 12 | print("Execution time: ", elapsed_time) 13 | print("Result: ", torch.sum(s).cpu().numpy()) 14 | print("PyTorch version: ", torch.__version__) 15 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/tensorflow/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=svd-tf # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem=4G # total memory (RAM) per node 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:02:00 # total run time limit (HH:MM:SS) 9 | #SBATCH --constraint=a100 # choose a100 or v100 10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 11 | 12 | module purge 13 | module load anaconda3/2024.6 14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/tf2-gpu 15 | 16 | python svd.py 17 | -------------------------------------------------------------------------------- /03_your_first_gpu_job/tensorflow/svd.py: -------------------------------------------------------------------------------- 1 | from time import perf_counter 2 | 3 | import os 4 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 5 | 6 | import tensorflow as tf 7 | print("TensorFlow version: ", tf.__version__) 8 | 9 | N = 100 10 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64) 11 | t0 = perf_counter() 12 | s, u, v = tf.linalg.svd(x) 13 | elapsed_time = perf_counter() - t0 14 | print("Execution time: ", elapsed_time) 15 | print("Result: ", tf.reduce_sum(s).numpy()) 16 | -------------------------------------------------------------------------------- /04_gpu_tools/README.md: -------------------------------------------------------------------------------- 1 | # GPU Tools 2 | 3 | This page presents common tools and utilities for GPU computing. 4 | 5 | # nvidia-smi 6 | 7 | This is the NVIDIA Systems Management Interface. This utility can be used to monitor GPU usage and GPU memory usage. It is a comprehensive tool with many options. 8 | 9 | ``` 10 | $ nvidia-smi 11 | Wed May 28 09:39:23 2025 12 | +-----------------------------------------------------------------------------------------+ 13 | | NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 | 14 | |-----------------------------------------+------------------------+----------------------+ 15 | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 16 | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 17 | | | | MIG M. | 18 | |=========================================+========================+======================| 19 | | 0 NVIDIA A100 80GB PCIe On | 00000000:17:00.0 Off | 0 | 20 | | N/A 39C P0 57W / 300W | 0MiB / 81920MiB | 0% Default | 21 | | | | Disabled | 22 | +-----------------------------------------+------------------------+----------------------+ 23 | 24 | +-----------------------------------------------------------------------------------------+ 25 | | Processes: | 26 | | GPU GI CI PID Type Process name GPU Memory | 27 | | ID ID Usage | 28 | |=========================================================================================| 29 | | No running processes found | 30 | +-----------------------------------------------------------------------------------------+ 31 | ``` 32 | 33 | To see all of the available options, view the help: 34 | 35 | ```$ nvidia-smi --help``` 36 | 37 | Here is an an example that produces CSV output of various metrics: 38 | 39 | ``` 40 | $ nvidia-smi --query-gpu=timestamp,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 5 41 | ``` 42 | 43 | The command above takes a reading every 5 seconds. 44 | 45 | # Nsight Systems (nsys) for Profiling 46 | 47 | The `nsys` command can be used to generate a timeline of the execution of your code. `nsys-ui` provides a GUI to examine the profiling data generated by `nsys`. See the NVIDIA Nsight Systems [getting started guide](https://docs.nvidia.com/nsight-systems/) and notes on [Summit](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#profiling-gpu-code-with-nvidia-developer-tools). 48 | 49 | To see the help menu: 50 | 51 | ``` 52 | $ /usr/local/bin/nsys --help 53 | $ /usr/local/bin/nsys --help profile 54 | ``` 55 | 56 | IMPORTANT: Do not run profiling jobs in your `/home` directory because large files are often written during these jobs which can exceed your quota. Instead launch jobs from `/scratch/gpfs/` where you have lots of space. Here's an example: 57 | 58 | ``` 59 | $ ssh @della-gpu.princeton.edu 60 | $ cd /scratch/gpfs/ 61 | $ mkdir myjob && cd myjob 62 | # prepare Slurm script 63 | $ sbatch job.slurm 64 | ``` 65 | 66 | Below is an example Slurm script: 67 | 68 | ``` 69 | #!/bin/bash 70 | #SBATCH --job-name=profile # create a short name for your job 71 | #SBATCH --nodes=1 # node count 72 | #SBATCH --ntasks=1 # total number of tasks across all nodes 73 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 74 | #SBATCH --mem=4G # total memory per node 75 | #SBATCH --gres=gpu:1 # number of gpus per node 76 | #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS) 77 | 78 | module purge 79 | module load anaconda3/2024.10 80 | conda activate myenv 81 | 82 | /usr/local/bin/nsys profile --trace=cuda,nvtx,osrt -o myprofile_${SLURM_JOBID} python myscript.py 83 | ``` 84 | 85 | For an MPI code you should use: 86 | 87 | ``` 88 | srun --wait=0 /usr/local/bin/nsys profile --trace=cuda,nvtx,osrt,mpi -o myprofile_${SLURM_JOBID} ./my_mpi_exe 89 | ``` 90 | 91 | Run this command to see the summary statistics: 92 | 93 | ``` 94 | $ /usr/local/bin/nsys stats myprofile_*.nsys-rep 95 | ``` 96 | 97 | To work the the graphical interface (nsys-ui) you can either (1) download the `.qdrep` file to your local machine or (2) create a graphical desktop session on [https://mydella.princeton.edu](https://mydella.princeton.edu/) or [https://mystellar.princeton.edu](https://mystellar.princeton.edu/). To create the graphical desktop, choose "Interactive Apps" then "Desktop of Della/Stellar Vis Nodes". Once the session starts, click on the black terminal icon and then run: 98 | 99 | ``` 100 | $ /usr/local/bin/nsys-ui myprofile_*.nsys-rep 101 | ``` 102 | 103 | # Nsight Compute (ncu) for GPU Kernel Profiling 104 | 105 | The `ncu` command is used for detailed profiling of GPU kernels. See the NVIDIA [documentation](https://docs.nvidia.com/nsight-compute/). On some clusters you will need to load a module to make the command available: 106 | 107 | ``` 108 | $ module load cudatoolkit/12.9 109 | $ ncu --help 110 | ``` 111 | 112 | The idea is to use `ncu` for the profiling and `ncu-ui` for examining the data in a GUI. 113 | 114 | Below is a sample slurm script: 115 | 116 | ``` 117 | #!/bin/bash 118 | #SBATCH --job-name=profile # create a short name for your job 119 | #SBATCH --nodes=1 # node count 120 | #SBATCH --ntasks=1 # total number of tasks across all nodes 121 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 122 | #SBATCH --mem=4G # total memory per node 123 | #SBATCH --gres=gpu:1 # number of gpus per node 124 | #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS) 125 | 126 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 127 | 128 | module purge 129 | module load cudatoolkit/12.9 130 | module load anaconda3/2024.10 131 | conda activate myenv 132 | 133 | ncu -o my_report_${SLURM_JOBID} python myscript.py 134 | ``` 135 | 136 | Note: the `ncu` profiler can significantly slow down the execution time of the code. 137 | 138 | To work the the graphical interface (ncu-ui) you can either (1) download the `.ncu-rep` file to your local machine or (2) create a graphical desktop session on [https://mydella.princeton.edu](https://mydella.princeton.edu/) or [https://mystellar.princeton.edu](https://mystellar.princeton.edu/). To create the graphical desktop, choose "Interactive Apps" then "Desktop of Della/Stellar Vis Nodes". Once the session starts, click on the black terminal icon and then run: 139 | 140 | ``` 141 | $ module load cudatoolkit/12.9 142 | $ ncu-ui my_report_*.ncu-rep 143 | ``` 144 | 145 | # line_prof for Python Profiling 146 | 147 | The [line_prof](https://researchcomputing.princeton.edu/python-profiling) tool provides profiling info for each line of a function. It is easy to use and it can be used for Python codes that run on CPUs and/or GPUs. 148 | 149 | # nvcc 150 | 151 | This is the NVIDIA CUDA compiler. It is based on LLVM. To compile a simple code: 152 | 153 | ``` 154 | $ module load cudatoolkit/12.9 155 | $ nvcc -o hello_world hello_world.cu 156 | ``` 157 | 158 | # Job Statistics 159 | 160 | Follow [this procedure](https://researchcomputing.princeton.edu/support/knowledge-base/job-stats) to view detailed metrics for your Slurm jobs. This includes GPU utilization and memory as a function of time. 161 | 162 | # GPU Computing 163 | 164 | See [this page](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) for an overview of the hardware at Princton as well as useful commands like `gpudash` and `shownodes`. 165 | 166 | # Debuggers 167 | 168 | ### ARM DDT 169 | 170 | The general directions for using the DDT debugger are [here](https://researchcomputing.princeton.edu/faq/debugging-with-ddt-on-the). The getting started guide is [here](https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge/arm-ddt). 171 | 172 | ``` 173 | $ ssh -X @adroit.princeton.edu # better to use graphical desktop via myadroit 174 | $ git clone https://github.com/PrincetonUniversity/hpc_beginning_workshop 175 | $ cd hpc_beginning_workshop/RC_example_jobs/simple_gpu_kernel 176 | $ salloc -N 1 -n 1 -t 10:00 --gres=gpu:1 --x11 177 | $ module load cudatoolkit/12.9 178 | $ nvcc -g -G hello_world_gpu.cu 179 | $ module load ddt/24.1 180 | $ #export ALLINEA_FORCE_CUDA_VERSION=10.1 181 | $ ddt 182 | # check cuda, uncheck "submit to queue", and click on "Run" 183 | ``` 184 | 185 | The `-g` debugging flag is for CPU code while the `-G` flag is for GPU code. `-G` turns off compiler optimizations. 186 | 187 | If the graphics are not displaying fast enough then consider using [TurboVNC](https://researchcomputing.princeton.edu/faq/how-do-i-use-vnc-on-tigre). 188 | 189 | ### `cuda-gdb` 190 | 191 | `cuda-gdb` is a free debugger available as part of the CUDA Toolkit. 192 | -------------------------------------------------------------------------------- /05_cuda_libraries/README.md: -------------------------------------------------------------------------------- 1 | # GPU-Accelerated Libraries 2 | 3 | Let's say you have a CPU code and you are thinking about writing GPU kernels to accelerate the performance of the slow parts of the code. Before doing this, you should see if there are GPU libraries that already have implemented the routines that you need. This page presents an overview of the NVIDIA GPU-accelerated libraries. 4 | 5 | According to NVIDIA: "NVIDIA GPU-accelerated libraries provide highly-optimized functions that perform 2x-10x faster than CPU-only alternatives. Using drop-in interfaces, you can replace CPU-only libraries such as MKL, IPP and FFTW with GPU-accelerated versions with almost no code changes. The libraries can optimally scale your application across multiple GPUs." 6 | 7 | ![NVIDIA-GPU-Libraries](https://tigress-web.princeton.edu/~jdh4/nv_libraries.jpeg) 8 | 9 | ### Selected libraries 10 | 11 | + **cuDNN** - GPU-accelerated library of primitives for deep neural networks 12 | + **cuBLAS** - GPU-accelerated standard BLAS library 13 | + **cuSPARSE** - GPU-accelerated BLAS for sparse matrices 14 | + **cuRAND** - GPU-accelerated random number generation (RNG) 15 | + **cuSOLVER** - Dense and sparse direct solvers for computer vision, CFD and other applications 16 | + **cuTENSOR** - GPU-accelerated tensor linear algebra library 17 | + **cuFFT** - GPU-accelerated library for Fast Fourier Transforms 18 | + **NPP** - GPU-accelerated image, video, and signal processing functions 19 | + **NCCL** - Collective Communications Library for scaling apps across multiple GPUs and nodes 20 | + **nvGRAPH** - GPU-accelerated library for graph analytics 21 | 22 | For the complete list see [GPU libraries](https://developer.nvidia.com/gpu-accelerated-libraries) by NVIDIA. 23 | 24 | ## Where to find the libraries 25 | 26 | Run the commands below to examine the libraries: 27 | 28 | ``` 29 | $ module show cudatoolkit/12.2 30 | $ ls -lL /usr/local/cuda-12.2/lib64/lib*.so 31 | ``` 32 | 33 | ## Example 34 | 35 | Make sure that you are on the `adroit5` login node : 36 | 37 | ``` 38 | $ hostname 39 | adroit5 40 | ``` 41 | 42 | Instead of computing the singular value decomposition (SVD) on the CPU, this example computes it on the GPU using `libcusolver`. First look over the source code: 43 | 44 | ``` 45 | $ cd gpu_programming_intro/05_cuda_libraries 46 | $ cat gesvdj_example.cpp | less # q to quit 47 | ``` 48 | 49 | The header file `cusolverDn.h` included by `gesvdj_example.cpp` contains the line `cuSolverDN : Dense Linear Algebra Library` providing information about its purpose. See the [cuSOLVER API](https://docs.nvidia.com/cuda/cusolver/index.html) for more. 50 | 51 | 52 | Next, compile and link the code as follows: 53 | 54 | ``` 55 | $ module load cudatoolkit/12.2 56 | $ g++ -o gesvdj_example gesvdj_example.cpp -lcudart -lcusolver 57 | ``` 58 | 59 | Run `ldd gesvdj_example` to check the linking against cuSOLVER (i.e., `libcusolver.so`). 60 | 61 | Submit the job to the scheduler with: 62 | 63 | ``` 64 | $ sbatch job.slurm 65 | ``` 66 | 67 | The ouput should appears as: 68 | 69 | ``` 70 | $ cat slurm-*.out 71 | 72 | example of gesvdj 73 | tol = 1.000000E-07, default value is machine zero 74 | max. sweeps = 15, default value is 100 75 | econ = 0 76 | A = (matlab base-1) 77 | A(1,1) = 1.0000000000000000E+00 78 | A(1,2) = 2.0000000000000000E+00 79 | A(2,1) = 4.0000000000000000E+00 80 | A(2,2) = 5.0000000000000000E+00 81 | A(3,1) = 2.0000000000000000E+00 82 | A(3,2) = 1.0000000000000000E+00 83 | ===== 84 | gesvdj converges 85 | S = singular values (matlab base-1) 86 | S(1,1) = 7.0652834970827287E+00 87 | S(2,1) = 1.0400812977120775E+00 88 | ===== 89 | U = left singular vectors (matlab base-1) 90 | U(1,1) = 3.0821892063278472E-01 91 | U(1,2) = -4.8819507401989848E-01 92 | U(1,3) = 8.1649658092772659E-01 93 | U(2,1) = 9.0613333377729299E-01 94 | U(2,2) = -1.1070553170904460E-01 95 | U(2,3) = -4.0824829046386302E-01 96 | U(3,1) = 2.8969549251172333E-01 97 | U(3,2) = 8.6568461633075366E-01 98 | U(3,3) = 4.0824829046386224E-01 99 | ===== 100 | V = right singular vectors (matlab base-1) 101 | V(1,1) = 6.3863583713639760E-01 102 | V(1,2) = 7.6950910814953477E-01 103 | V(2,1) = 7.6950910814953477E-01 104 | V(2,2) = -6.3863583713639760E-01 105 | ===== 106 | |S - S_exact|_sup = 4.440892E-16 107 | residual |A - U*S*V**H|_F = 3.511066E-16 108 | number of executed sweeps = 1 109 | ``` 110 | 111 | ## NVIDIA CUDA Samples 112 | 113 | Run the following command to obtain a copy of the [NVIDIA CUDA Samples](https://github.com/NVIDIA/cuda-samples): 114 | 115 | ``` 116 | $ cd gpu_programming_intro 117 | $ git clone https://github.com/NVIDIA/cuda-samples.git 118 | $ cd cuda-samples/Samples 119 | ``` 120 | 121 | Then browse the directories: 122 | 123 | ``` 124 | $ ls -ltrh 125 | total 20K 126 | drwxr-xr-x. 55 jdh4 cses 4.0K Oct 9 18:23 0_Introduction 127 | drwxr-xr-x. 6 jdh4 cses 130 Oct 9 18:23 1_Utilities 128 | drwxr-xr-x. 36 jdh4 cses 4.0K Oct 9 18:23 2_Concepts_and_Techniques 129 | drwxr-xr-x. 25 jdh4 cses 4.0K Oct 9 18:23 3_CUDA_Features 130 | drwxr-xr-x. 40 jdh4 cses 4.0K Oct 9 18:23 4_CUDA_Libraries 131 | drwxr-xr-x. 52 jdh4 cses 4.0K Oct 9 18:23 5_Domain_Specific 132 | drwxr-xr-x. 5 jdh4 cses 105 Oct 9 18:23 6_Performance 133 | ``` 134 | 135 | Pick an example and then build and run it. For instance: 136 | 137 | ``` 138 | $ module load cudatoolkit/12.2 139 | $ cd 0_Introduction/matrixMul 140 | $ make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++ # use 90 for H100 GPUs on Tiger and Della (PLI) 141 | ``` 142 | 143 | This will produce `matrixMul`. If you run the `ldd` command on `matrixMul` you will see that it does not link against `cublas.so`. Instead it uses a naive implementation of the routine which is surely not as efficient as the library implementation. 144 | 145 | ``` 146 | $ cp /gpu_programming_intro/05_cuda_libraries/matrixMul/job.slurm . 147 | ``` 148 | 149 | Submit the job: 150 | 151 | ``` 152 | $ sbatch job.slurm 153 | ``` 154 | 155 | See `4_CUDA_Libraries` for more examples. For instance, take a look at `4_CUDA_Libraries/matrixMulCUBLAS`. Does the resulting executable link against `libcublas.so`? 156 | 157 | ``` 158 | $ cd ../../4_CUDA_Libraries/matrixMulCUBLAS 159 | $ make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++ 160 | $ ldd matrixMulCUBLAS 161 | ``` 162 | 163 | Similarly, does the code in `4_CUDA_Libraries/simpleCUFFT_MGPU` link against `libcufft.so`? 164 | 165 | To run code that uses the Tensor Cores see examples such as `3_CUDA_Features/bf16TensorCoreGemm`. That example uses the bfloat16 floating-point format. 166 | 167 | Note that some examples have dependencies that will not be satisfied so they will not build. This can be resolved if it relates to your research work. For instance, to build `5_Domain_Specific/nbody` use: 168 | 169 | ``` 170 | GLPATH=/lib64 make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++ # use 90 for H100 GPUs on Tiger and Della (PLI) 171 | ``` 172 | 173 | Note that `nbody` will not run successfully on adroit since the GPU nodes do not have `libglut.so`. The library could be added if needed. One can compile and run this code on adroit-vis using `TARGET_ARCH=x86_64 SMS="80"`. 174 | -------------------------------------------------------------------------------- /05_cuda_libraries/gesvdj_example.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | * * How to compile (assume cuda is installed at /usr/local/cuda-10.1/) 3 | * * nvcc -c -I/usr/local/cuda-10.1/include gesvdj_example.cpp 4 | * * g++ -o gesvdj_example gesvdj_example.o -L/usr/local/cuda-10.1/lib64 -lcudart -lcusolver 5 | * */ 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | 13 | void printMatrix(int m, int n, const double*A, int lda, const char* name) 14 | { 15 | for(int row = 0 ; row < m ; row++){ 16 | for(int col = 0 ; col < n ; col++){ 17 | double Areg = A[row + col*lda]; 18 | printf("%s(%d,%d) = %20.16E\n", name, row+1, col+1, Areg); 19 | } 20 | } 21 | } 22 | 23 | int main(int argc, char*argv[]) 24 | { 25 | cusolverDnHandle_t cusolverH = NULL; 26 | cudaStream_t stream = NULL; 27 | gesvdjInfo_t gesvdj_params = NULL; 28 | 29 | cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS; 30 | cudaError_t cudaStat1 = cudaSuccess; 31 | cudaError_t cudaStat2 = cudaSuccess; 32 | cudaError_t cudaStat3 = cudaSuccess; 33 | cudaError_t cudaStat4 = cudaSuccess; 34 | cudaError_t cudaStat5 = cudaSuccess; 35 | const int m = 3; 36 | const int n = 2; 37 | const int lda = m; 38 | /* | 1 2 | 39 | * * A = | 4 5 | 40 | * * | 2 1 | 41 | * */ 42 | double A[lda*n] = { 1.0, 4.0, 2.0, 2.0, 5.0, 1.0}; 43 | double U[lda*m]; /* m-by-m unitary matrix, left singular vectors */ 44 | double V[lda*n]; /* n-by-n unitary matrix, right singular vectors */ 45 | double S[n]; /* numerical singular value */ 46 | /* exact singular values */ 47 | double S_exact[n] = {7.065283497082729, 1.040081297712078}; 48 | double *d_A = NULL; /* device copy of A */ 49 | double *d_S = NULL; /* singular values */ 50 | double *d_U = NULL; /* left singular vectors */ 51 | double *d_V = NULL; /* right singular vectors */ 52 | int *d_info = NULL; /* error info */ 53 | int lwork = 0; /* size of workspace */ 54 | double *d_work = NULL; /* devie workspace for gesvdj */ 55 | int info = 0; /* host copy of error info */ 56 | 57 | /* configuration of gesvdj */ 58 | const double tol = 1.e-7; 59 | const int max_sweeps = 15; 60 | const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvectors. 61 | const int econ = 0 ; /* econ = 1 for economy size */ 62 | 63 | /* numerical results of gesvdj */ 64 | double residual = 0; 65 | int executed_sweeps = 0; 66 | 67 | printf("example of gesvdj \n"); 68 | printf("tol = %E, default value is machine zero \n", tol); 69 | printf("max. sweeps = %d, default value is 100\n", max_sweeps); 70 | printf("econ = %d \n", econ); 71 | 72 | printf("A = (matlab base-1)\n"); 73 | printMatrix(m, n, A, lda, "A"); 74 | printf("=====\n"); 75 | 76 | /* step 1: create cusolver handle, bind a stream */ 77 | status = cusolverDnCreate(&cusolverH); 78 | assert(CUSOLVER_STATUS_SUCCESS == status); 79 | 80 | cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); 81 | assert(cudaSuccess == cudaStat1); 82 | 83 | status = cusolverDnSetStream(cusolverH, stream); 84 | assert(CUSOLVER_STATUS_SUCCESS == status); 85 | 86 | /* step 2: configuration of gesvdj */ 87 | status = cusolverDnCreateGesvdjInfo(&gesvdj_params); 88 | assert(CUSOLVER_STATUS_SUCCESS == status); 89 | 90 | /* default value of tolerance is machine zero */ 91 | status = cusolverDnXgesvdjSetTolerance( 92 | gesvdj_params, 93 | tol); 94 | assert(CUSOLVER_STATUS_SUCCESS == status); 95 | 96 | /* default value of max. sweeps is 100 */ 97 | status = cusolverDnXgesvdjSetMaxSweeps( 98 | gesvdj_params, 99 | max_sweeps); 100 | assert(CUSOLVER_STATUS_SUCCESS == status); 101 | 102 | /* step 3: copy A and B to device */ 103 | cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double)*lda*n); 104 | cudaStat2 = cudaMalloc ((void**)&d_S , sizeof(double)*n); 105 | cudaStat3 = cudaMalloc ((void**)&d_U , sizeof(double)*lda*m); 106 | cudaStat4 = cudaMalloc ((void**)&d_V , sizeof(double)*lda*n); 107 | cudaStat5 = cudaMalloc ((void**)&d_info, sizeof(int)); 108 | assert(cudaSuccess == cudaStat1); 109 | assert(cudaSuccess == cudaStat2); 110 | assert(cudaSuccess == cudaStat3); 111 | assert(cudaSuccess == cudaStat4); 112 | assert(cudaSuccess == cudaStat5); 113 | 114 | cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n, cudaMemcpyHostToDevice); 115 | assert(cudaSuccess == cudaStat1); 116 | 117 | /* step 4: query workspace of SVD */ 118 | status = cusolverDnDgesvdj_bufferSize( 119 | cusolverH, 120 | jobz, /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */ 121 | /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */ 122 | econ, /* econ = 1 for economy size */ 123 | m, /* nubmer of rows of A, 0 <= m */ 124 | n, /* number of columns of A, 0 <= n */ 125 | d_A, /* m-by-n */ 126 | lda, /* leading dimension of A */ 127 | d_S, /* min(m,n) */ 128 | /* the singular values in descending order */ 129 | d_U, /* m-by-m if econ = 0 */ 130 | /* m-by-min(m,n) if econ = 1 */ 131 | lda, /* leading dimension of U, ldu >= max(1,m) */ 132 | d_V, /* n-by-n if econ = 0 */ 133 | /* n-by-min(m,n) if econ = 1 */ 134 | lda, /* leading dimension of V, ldv >= max(1,n) */ 135 | &lwork, 136 | gesvdj_params); 137 | assert(CUSOLVER_STATUS_SUCCESS == status); 138 | 139 | cudaStat1 = cudaMalloc((void**)&d_work , sizeof(double)*lwork); 140 | assert(cudaSuccess == cudaStat1); 141 | 142 | /* step 5: compute SVD */ 143 | status = cusolverDnDgesvdj( 144 | cusolverH, 145 | jobz, /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */ 146 | /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */ 147 | econ, /* econ = 1 for economy size */ 148 | m, /* nubmer of rows of A, 0 <= m */ 149 | n, /* number of columns of A, 0 <= n */ 150 | d_A, /* m-by-n */ 151 | lda, /* leading dimension of A */ 152 | d_S, /* min(m,n) */ 153 | /* the singular values in descending order */ 154 | d_U, /* m-by-m if econ = 0 */ 155 | /* m-by-min(m,n) if econ = 1 */ 156 | lda, /* leading dimension of U, ldu >= max(1,m) */ 157 | d_V, /* n-by-n if econ = 0 */ 158 | /* n-by-min(m,n) if econ = 1 */ 159 | lda, /* leading dimension of V, ldv >= max(1,n) */ 160 | d_work, 161 | lwork, 162 | d_info, 163 | gesvdj_params); 164 | cudaStat1 = cudaDeviceSynchronize(); 165 | assert(CUSOLVER_STATUS_SUCCESS == status); 166 | assert(cudaSuccess == cudaStat1); 167 | 168 | cudaStat1 = cudaMemcpy(U, d_U, sizeof(double)*lda*m, cudaMemcpyDeviceToHost); 169 | cudaStat2 = cudaMemcpy(V, d_V, sizeof(double)*lda*n, cudaMemcpyDeviceToHost); 170 | cudaStat3 = cudaMemcpy(S, d_S, sizeof(double)*n , cudaMemcpyDeviceToHost); 171 | cudaStat4 = cudaMemcpy(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost); 172 | cudaStat5 = cudaDeviceSynchronize(); 173 | assert(cudaSuccess == cudaStat1); 174 | assert(cudaSuccess == cudaStat2); 175 | assert(cudaSuccess == cudaStat3); 176 | assert(cudaSuccess == cudaStat4); 177 | assert(cudaSuccess == cudaStat5); 178 | 179 | if ( 0 == info ){ 180 | printf("gesvdj converges \n"); 181 | }else if ( 0 > info ){ 182 | printf("%d-th parameter is wrong \n", -info); 183 | exit(1); 184 | }else{ 185 | printf("WARNING: info = %d : gesvdj does not converge \n", info ); 186 | } 187 | 188 | printf("S = singular values (matlab base-1)\n"); 189 | printMatrix(n, 1, S, lda, "S"); 190 | printf("=====\n"); 191 | 192 | printf("U = left singular vectors (matlab base-1)\n"); 193 | printMatrix(m, m, U, lda, "U"); 194 | printf("=====\n"); 195 | 196 | printf("V = right singular vectors (matlab base-1)\n"); 197 | printMatrix(n, n, V, lda, "V"); 198 | printf("=====\n"); 199 | 200 | /* step 6: measure error of singular value */ 201 | double ds_sup = 0; 202 | for(int j = 0; j < n; j++){ 203 | double err = fabs( S[j] - S_exact[j] ); 204 | ds_sup = (ds_sup > err)? ds_sup : err; 205 | } 206 | printf("|S - S_exact|_sup = %E \n", ds_sup); 207 | 208 | status = cusolverDnXgesvdjGetSweeps( 209 | cusolverH, 210 | gesvdj_params, 211 | &executed_sweeps); 212 | assert(CUSOLVER_STATUS_SUCCESS == status); 213 | 214 | status = cusolverDnXgesvdjGetResidual( 215 | cusolverH, 216 | gesvdj_params, 217 | &residual); 218 | assert(CUSOLVER_STATUS_SUCCESS == status); 219 | 220 | printf("residual |A - U*S*V**H|_F = %E \n", residual ); 221 | printf("number of executed sweeps = %d \n", executed_sweeps ); 222 | 223 | /* free resources */ 224 | if (d_A ) cudaFree(d_A); 225 | if (d_S ) cudaFree(d_S); 226 | if (d_U ) cudaFree(d_U); 227 | if (d_V ) cudaFree(d_V); 228 | if (d_info) cudaFree(d_info); 229 | if (d_work ) cudaFree(d_work); 230 | 231 | if (cusolverH) cusolverDnDestroy(cusolverH); 232 | if (stream ) cudaStreamDestroy(stream); 233 | if (gesvdj_params) cusolverDnDestroyGesvdjInfo(gesvdj_params); 234 | 235 | cudaDeviceReset(); 236 | return 0; 237 | } 238 | -------------------------------------------------------------------------------- /05_cuda_libraries/hello_world_gpu_library/README.md: -------------------------------------------------------------------------------- 1 | # Building a Simple GPU Library 2 | 3 | In this exercise we will construct a "hello world" GPU library called `cumessage` and then link and run a code against it. 4 | 5 | ### Create the GPU Library 6 | 7 | Inspect the files that compose the GPU library: 8 | 9 | ```bash 10 | $ cd 05_cuda_libraries/hello_world_gpu_library 11 | $ cat cumessage.h 12 | $ cat cumessage.cu 13 | ``` 14 | 15 | `cumessage.h` is the header file. It contains the signature or protocol of one function. That is, the name and the input/output types are specified but the function body is not implemented here. The implementation is done in `cumessage.cu`. There is some CUDA code in that file. It will be explained in `06_cuda_kernels`. 16 | 17 | Libraries are standalone. That is, there is nothing at present waiting to use our library. We will simply create it and then write a code that can use it. Create the library by running the following commands: 18 | 19 | ```bash 20 | $ module load cudatoolkit/11.7 21 | $ nvcc -Xcompiler -fPIC -o libcumessage.so -shared cumessage.cu 22 | $ ls -ltr 23 | ``` 24 | 25 | This will produce `libcumessage.so` which is a GPU library with a single function. Add the option "-v" to the line beginning with `nvcc` above to see more details. You will see that `gcc` is being called. 26 | 27 | ### Use the GPU Library 28 | 29 | Take a look at our simple code in `myapp.cu` that will use our GPU library: 30 | 31 | ```bash 32 | $ cat myapp.cu 33 | ``` 34 | 35 | Once again, note that `myapp.cu` only needs to know about the inputs and outputs of `GPUfunction` through the header file. Nothing is known to `myapp.cu` about how that function is implemented. 36 | 37 | Compile the main routine against our GPU library: 38 | 39 | ``` 40 | $ nvcc -I. -o myapp myapp.cu -L. -lcudart -lcumessage 41 | $ ls -ltr 42 | ``` 43 | 44 | This will produce `myapp` which is a GPU application that links against our GPU library `libcumessage.so`: 45 | 46 | ``` 47 | $ env LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ldd myapp 48 | linux-vdso.so.1 (0x00007fffdaf61000) 49 | libcumessage.so => ./libcumessage.so (0x000014d68450a000) 50 | libcudart.so.11.0 => /usr/local/cuda-11.4/lib64/libcudart.so.11.0 (0x000014d684268000) 51 | librt.so.1 => /lib64/librt.so.1 (0x000014d684060000) 52 | libpthread.so.0 => /lib64/libpthread.so.0 (0x000014d683e40000) 53 | libdl.so.2 => /lib64/libdl.so.2 (0x000014d683c3c000) 54 | libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014d6838a7000) 55 | libm.so.6 => /lib64/libm.so.6 (0x000014d683525000) 56 | libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014d68330d000) 57 | libc.so.6 => /lib64/libc.so.6 (0x000014d682f48000) 58 | /lib64/ld-linux-x86-64.so.2 (0x000014d6847a9000) 59 | ``` 60 | Finally, submit the job and inspect the output: 61 | 62 | ``` 63 | $ sbatch job.slurm 64 | $ cat slurm-*.out 65 | Hello world from the CPU. 66 | Hello world from the GPU. 67 | ``` 68 | -------------------------------------------------------------------------------- /05_cuda_libraries/hello_world_gpu_library/cumessage.cu: -------------------------------------------------------------------------------- 1 | #include 2 | #include "cumessage.h" 3 | 4 | __global__ void GPUFunction_kernel() { 5 | printf("Hello world from the GPU.\n"); 6 | } 7 | 8 | void GPUFunction() { 9 | GPUFunction_kernel<<<1,1>>>(); 10 | 11 | // kernel execution is asynchronous so sync on its completion 12 | cudaDeviceSynchronize(); 13 | } 14 | -------------------------------------------------------------------------------- /05_cuda_libraries/hello_world_gpu_library/cumessage.h: -------------------------------------------------------------------------------- 1 | void GPUFunction(); 2 | -------------------------------------------------------------------------------- /05_cuda_libraries/hello_world_gpu_library/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=gpu-lib # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G per cpu-core is default) 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) 9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 10 | 11 | module purge 12 | module load cudatoolkit/11.7 13 | export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH 14 | 15 | ./myapp 16 | -------------------------------------------------------------------------------- /05_cuda_libraries/hello_world_gpu_library/myapp.cu: -------------------------------------------------------------------------------- 1 | #include 2 | #include "cumessage.h" 3 | 4 | void CPUFunction() { 5 | printf("Hello world from the CPU.\n"); 6 | } 7 | 8 | int main() { 9 | // function to run on the cpu 10 | CPUFunction(); 11 | 12 | // function to run on the gpu 13 | GPUFunction(); 14 | 15 | return 0; 16 | } 17 | -------------------------------------------------------------------------------- /05_cuda_libraries/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=cuda-libs # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --constraint=a100 # choose gpu80, a100 or v100 10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 11 | 12 | module purge 13 | module load cudatoolkit/12.2 14 | 15 | ./gesvdj_example 16 | -------------------------------------------------------------------------------- /05_cuda_libraries/matrixMul/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=cuda-libs # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=16G # memory per cpu-core (4G is default) 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --constraint=a100 # choose a100 or v100 10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 11 | 12 | module purge 13 | module load cudatoolkit/12.2 14 | 15 | ./matrixMul 16 | -------------------------------------------------------------------------------- /06_cuda_kernels/01_hello_world/README.md: -------------------------------------------------------------------------------- 1 | # Hello World 2 | 3 | On this page we consider the simplest CPU C code and the simplest CUDA C GPU code. 4 | 5 | ## CPU 6 | 7 | A simple CPU-only code: 8 | 9 | ```C 10 | #include 11 | 12 | void CPUFunction() { 13 | printf("Hello world from the CPU.\n"); 14 | } 15 | 16 | int main() { 17 | // function to run on the cpu 18 | CPUFunction(); 19 | } 20 | ``` 21 | 22 | This can be compiled and run with: 23 | 24 | ``` 25 | $ cd gpu_programming_intro/06_cuda_kernels/01_hello_world 26 | $ gcc -o hello_world hello_world.c 27 | $ ./hello_world 28 | ``` 29 | 30 | The output is 31 | 32 | ``` 33 | Hello world from the CPU. 34 | ``` 35 | 36 | ## GPU 37 | 38 | Below is a simple GPU code that calls a CPU function followed by a GPU function: 39 | 40 | ```C 41 | #include 42 | 43 | void CPUFunction() { 44 | printf("Hello world from the CPU.\n"); 45 | } 46 | 47 | __global__ void GPUFunction() { 48 | printf("Hello world from the GPU.\n"); 49 | } 50 | 51 | int main() { 52 | // function to run on the cpu 53 | CPUFunction(); 54 | 55 | // function to run on the gpu 56 | GPUFunction<<<1, 1>>>(); 57 | 58 | // kernel execution is asynchronous so sync on its completion 59 | cudaDeviceSynchronize(); 60 | } 61 | ``` 62 | 63 | The GPU code above can be compiled and executed with: 64 | 65 | ``` 66 | $ module load cudatoolkit/12.2 67 | $ nvcc -o hello_world_gpu hello_world_gpu.cu 68 | $ sbatch job.slurm 69 | ``` 70 | 71 | The output should be: 72 | 73 | ``` 74 | $ cat slurm-*.out 75 | Hello world from the CPU. 76 | Hello world from the GPU. 77 | ``` 78 | 79 | `nvcc` is the NVIDIA CUDA Compiler. It compiles the GPU code itself and uses GNU `gcc` to compile the CPU code. CUDA provides extensions for many common programming languages (e.g., C/C++/Fortran). These language extensions allow developers to write GPU functions. 80 | 81 | From this simple example we learn that GPU functions are declared with `__global__`, which is a CUDA C/C++ keyword. The triple angle brackets or so-called "triple chevron" is used to specify the execution configuration of the kernel launch which is a call from host code to device code. 82 | 83 | Here is the general form for the execution configuration: `<<>>`. In the example above we used 1 block and 1 thread per block. At a high level, the execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block. 84 | 85 | Notice the return type of `void` for GPUFunction. It is required that GPU functions are defined with the `__global__` keyword return type `void`. 86 | 87 | ### Exercises 88 | 89 | 1. What happens if you remove `__global__`? 90 | 91 | 2. Can you rewrite the code so that the output is: 92 | 93 | ``` 94 | Hello world from the CPU. 95 | Hello world from the GPU. 96 | Hello world from the CPU. 97 | ``` 98 | 99 | 3. What happens if you comment out the `cudaDeviceSynchronize()` line by preceding it with `//`? 100 | -------------------------------------------------------------------------------- /06_cuda_kernels/01_hello_world/hello_world.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | void CPUFunction() { 4 | printf("Hello world from the CPU.\n"); 5 | } 6 | 7 | int main() { 8 | // function to run on the cpu 9 | CPUFunction(); 10 | } 11 | -------------------------------------------------------------------------------- /06_cuda_kernels/01_hello_world/hello_world_gpu.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | void CPUFunction() { 4 | printf("Hello world from the CPU.\n"); 5 | } 6 | 7 | __global__ void GPUFunction() { 8 | printf("Hello world from the GPU.\n"); 9 | } 10 | 11 | int main() { 12 | // function to run on the cpu 13 | CPUFunction(); 14 | 15 | // function to run on the gpu 16 | GPUFunction<<<1, 1>>>(); 17 | 18 | // kernel execution is asynchronous so sync on its completion 19 | cudaDeviceSynchronize(); 20 | } 21 | -------------------------------------------------------------------------------- /06_cuda_kernels/01_hello_world/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=hw-gpu # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G is default) 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 10 | 11 | ./hello_world_gpu 12 | -------------------------------------------------------------------------------- /06_cuda_kernels/02_simple_kernel/README.md: -------------------------------------------------------------------------------- 1 | # Launching Parallel Kernels 2 | 3 | The execution configuration allows programmers to specify details about launching the kernel to run in parallel on multiple GPU threads. More precisely, the execution configuration allows programmers to specifiy how many groups of threads (called thread blocks) and how many threads they would like each thread block to contain. The syntax for this is: 4 | 5 | ``` 6 | <<>> 7 | ``` 8 | 9 | The kernel code is executed by every thread in every thread block configured when the kernel is launched. The image below corresponds to `<<<1, 5>>>`: 10 | 11 | ![thread-block](https://miro.medium.com/max/1118/1*e_FAITzOXSearSZYNWnmKQ.png) 12 | 13 | 14 | ## CPU Code 15 | 16 | ```c 17 | #include 18 | 19 | void firstParallel() 20 | { 21 | printf("This should be running in parallel.\n"); 22 | } 23 | 24 | int main() 25 | { 26 | firstParallel(); 27 | } 28 | ``` 29 | 30 | ## Exercise: GPU implementation 31 | 32 | ``` 33 | # rewrite the CPU code above so that it runs on a GPU using multiple threads 34 | # save your file as first_parallel.cu (a starting file by this name is given -- see below) 35 | ``` 36 | 37 | The objective is to write a GPU code with one kernel launch that produces the following 6 lines of output: 38 | 39 | ``` 40 | This should be running in parallel. 41 | This should be running in parallel. 42 | This should be running in parallel. 43 | This should be running in parallel. 44 | This should be running in parallel. 45 | This should be running in parallel. 46 | ``` 47 | 48 | To get started: 49 | 50 | ``` 51 | $ cd gpu_programming_intro/06_cuda_kernels/02_simple_kernel 52 | # edit first_parallel.cu (use a text editor of your choice) 53 | $ nvcc -o first_parallel first_parallel.cu 54 | $ sbatch job.slurm 55 | ``` 56 | 57 | There are multiple possible solutions. 58 | -------------------------------------------------------------------------------- /06_cuda_kernels/02_simple_kernel/first_parallel.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | void CPUFunction() { 4 | printf("Hello world from the CPU.\n"); 5 | } 6 | 7 | __global__ void GPUFunction() { 8 | printf("Hello world from the GPU.\n"); 9 | } 10 | 11 | int main() { 12 | // function to run on the cpu 13 | CPUFunction(); 14 | 15 | // function to run on the gpu 16 | GPUFunction<<<1, 1>>>(); 17 | 18 | // kernel execution is asynchronous so sync on its completion 19 | cudaDeviceSynchronize(); 20 | } 21 | -------------------------------------------------------------------------------- /06_cuda_kernels/02_simple_kernel/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=serial_c # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G is default) 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 10 | 11 | ./first_parallel 12 | -------------------------------------------------------------------------------- /06_cuda_kernels/02_simple_kernel/solution.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | __global__ void firstParallel() 4 | { 5 | printf("This is running in parallel.\n"); 6 | } 7 | 8 | int main() 9 | { 10 | firstParallel<<<2, 3>>>(); 11 | cudaDeviceSynchronize(); 12 | } 13 | -------------------------------------------------------------------------------- /06_cuda_kernels/03_thread_indices/README.md: -------------------------------------------------------------------------------- 1 | # Built-in Thread and Block Indices 2 | 3 | Each thread is given an index within its thread block, starting at 0. Additionally, each block is given an index, starting at 0. Threads are grouped into thread blocks, blocks are grouped into grids, and grids can be grouped into a cluster, which is the highest entity in the CUDA hierarchy. 4 | 5 | ![intrinic-indices](https://devblogs.nvidia.com/wp-content/uploads/2017/01/cuda_indexing.png) 6 | 7 | CUDA kernels have access to special variables identifying both the index of the thread (within the block) that is executing the kernel, and, the index of the block (within the grid) that the thread is within. These variables are `threadIdx.x` and `blockIdx.x` respectively. Below is an example use of `threadIdx.x`: 8 | 9 | ```C 10 | __global__ void GPUFunction() { 11 | printf("My thread index is: %d\n", threadIdx.x); 12 | } 13 | ``` 14 | 15 | ## CPU implentation of a for loop 16 | 17 | ```C 18 | #include 19 | 20 | void printLoopIndex() { 21 | int N = 100; 22 | for (int i = 0; i < N; ++i) 23 | printf("%d\n", i); 24 | } 25 | 26 | int main() { 27 | // function to run on the cpu 28 | printLoopIndex(); 29 | } 30 | ``` 31 | 32 | Run the CPU code above by following these commands: 33 | 34 | ```bash 35 | $ cd gpu_programming_intro/06_cuda_kernels/03_thread_indices 36 | $ nvcc -o for_loop for_loop.c 37 | $ ./for_loop 38 | ``` 39 | 40 | The output of the above is 41 | 42 | ``` 43 | 0 44 | 1 45 | 2 46 | ... 47 | 97 48 | 98 49 | 99 50 | ``` 51 | 52 | ## Exercise: GPU implementation 53 | 54 | In the CPU code above, the loop is carried out in serial. That is, loop iterations takes place one at a time. Can you write a GPU code that produces the same output as that above but does so in parallel using a CUDA kernel? 55 | 56 | ``` 57 | // write a GPU kernel to produce the output above 58 | ``` 59 | 60 | To get started: 61 | 62 | ```bash 63 | $ module load cudatoolkit/12.2 64 | # edit for_loop.cu 65 | $ nvcc -o for_loop for_loop.cu 66 | $ sbatch job.slurm 67 | ``` 68 | 69 | Click [here](hint.md) to see some hints. 70 | 71 | One possible solution is [here](solution.cu) (try for yourself first). 72 | 73 | Are you seeing any behavior which is a multiple of 32 in this exercise? For NVIDIA, the threads within a thread block are organized into "warps". A "warp" is composed of 32 threads. [Read more](http://15418.courses.cs.cmu.edu/spring2013/article/15) about how `printf` works in CUDA. 74 | -------------------------------------------------------------------------------- /06_cuda_kernels/03_thread_indices/for_loop.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | void printLoopIndex() { 4 | int i; 5 | int N = 100; 6 | for (i = 0; i < N; ++i) 7 | printf("%d\n", i); 8 | } 9 | 10 | int main() { 11 | // function to run on the cpu 12 | printLoopIndex(); 13 | } 14 | -------------------------------------------------------------------------------- /06_cuda_kernels/03_thread_indices/for_loop.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | void printLoopIndex() { 4 | int N = 100; 5 | for (int i = 0; i < N; ++i) 6 | printf("%d\n", i); 7 | } 8 | 9 | int main() { 10 | // function to run on the cpu 11 | printLoopIndex(); 12 | } 13 | -------------------------------------------------------------------------------- /06_cuda_kernels/03_thread_indices/hint.md: -------------------------------------------------------------------------------- 1 | ## Hints 2 | 3 | To understand how to do this exercise, take a look at the code below which uses `threadIdx.x`: 4 | 5 | ```C 6 | #include 7 | 8 | __global__ void GPUFunction() { 9 | printf("My thread index is: %g\n", threadIdx.x); 10 | } 11 | 12 | int main() { 13 | GPUFunction<<<1, 1>>>(); 14 | cudaDeviceSynchronize(); 15 | } 16 | ``` 17 | 18 | The output of the code above is 19 | 20 | ``` 21 | My thread index is: 0 22 | ``` 23 | 24 | We need to replace the i variable in the CPU code. In a CUDA kernel, each thread has an index 25 | associated with it called `threadIdx.x`. So use that as the substitution for i. 26 | 27 | Next, to generate 100 threads, try a kernel launch like this: `<<<1, 100>>>` 28 | 29 | The above will give you 1 block composed of 100 threads. 30 | 31 | Be sure to add `__global__` to your GPU function and don't forget to call `cudaDeviceSynchronize()`. 32 | -------------------------------------------------------------------------------- /06_cuda_kernels/03_thread_indices/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=for_loop # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G is default) 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 10 | 11 | ./for_loop 12 | -------------------------------------------------------------------------------- /06_cuda_kernels/03_thread_indices/solution.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | __global__ void printLoopIndex() { 4 | printf("%d\n", threadIdx.x); 5 | } 6 | 7 | int main() { 8 | printLoopIndex<<<1, 100>>>(); 9 | cudaDeviceSynchronize(); 10 | } 11 | -------------------------------------------------------------------------------- /06_cuda_kernels/04_vector_addition/README.md: -------------------------------------------------------------------------------- 1 | # Elementwise Vector Addition 2 | 3 | ## A Word on Allocating Memory 4 | 5 | Here is an example on the CPU where 10 integers are dynamically allocated and the last line frees the memory: 6 | 7 | ```C 8 | int N = 10; 9 | size_t size = N * sizeof(int); 10 | 11 | int *a; 12 | a = (int*)malloc(size); 13 | free(a); 14 | ``` 15 | 16 | On the GPU: 17 | 18 | ```C 19 | int N = 10; 20 | size_t size = N * sizeof(int); 21 | 22 | int *d_a; 23 | cudaMalloc(&d_a, size); 24 | cudaFree(d_a); 25 | ``` 26 | Note that we write `d_a` for the GPU case instead of `a` to remind ourselves that we are allocating memory on the "device" or GPU. Sometimes developers will prefix CPU variables with 'h' to denote "host". 27 | 28 | ![add-arrays](https://www3.ntu.edu.sg/home/ehchua/programming/cpp/images/Array.png) 29 | 30 | The vectors `a` and `b` are added elementwise to produce the vector `c`: 31 | 32 | ``` 33 | c[0] = a[0] + b[0] 34 | c[1] = a[1] + b[1] 35 | ... 36 | c[N-1] = a[N-1] + b[N-1] 37 | ``` 38 | 39 | ## CPU 40 | 41 | The following code adds two vectors together on a CPU: 42 | 43 | ```C 44 | #include 45 | #include 46 | #include 47 | #include "timer.h" 48 | 49 | void vecAdd(double *a, double *b, double *c, int n) 50 | { 51 | int i; 52 | for (i = 0; i < n; i++) { 53 | c[i] = a[i] + b[i]; 54 | } 55 | } 56 | 57 | int main(int argc, char* argv[]) 58 | { 59 | // Size of vectors 60 | int n = 2000; 61 | 62 | // Host input vectors 63 | double *h_a; 64 | double *h_b; 65 | //Host output vector 66 | double *h_c; 67 | 68 | // Size, in bytes, of each vector 69 | size_t bytes = n*sizeof(double); 70 | 71 | // Allocate memory for each vector on host 72 | h_a = (double*)malloc(bytes); 73 | h_b = (double*)malloc(bytes); 74 | h_c = (double*)malloc(bytes); 75 | 76 | int i; 77 | // Initialize vectors on host 78 | for (i = 0; i < n; i++) { 79 | h_a[i] = sin(i)*sin(i); 80 | h_b[i] = cos(i)*cos(i); 81 | } 82 | 83 | // add the two vectors 84 | vecAdd(h_a, h_b, h_c, n); 85 | 86 | // Release host memory 87 | free(h_a); 88 | free(h_b); 89 | free(h_c); 90 | 91 | return 0; 92 | } 93 | ``` 94 | 95 | Take a look at `vector_add_cpu.c`. You will see that it allocates three arrays of size `n` and then fills `a` and `b` with values. The `vecAdd` function is then called to perform the elementwise addition of the two arrays producing a third array `c`: 96 | 97 | ```C 98 | void vecAdd(double *a, double *b, double *c, int n) { 99 | int i; 100 | for (i = 0; i < n; i++) { 101 | c[i] = a[i] + b[i]; 102 | } 103 | } 104 | ``` 105 | 106 | 107 | The output reports the time taken to perform the addition ignoring the memory allocation and initialization. Build and run the code: 108 | 109 | ``` 110 | $ cd gpu_programming_intro/06_cuda_kernels/04_vector_addition 111 | $ gcc -O3 -march=native -o vector_add_cpu vector_add_cpu.c -lm 112 | $ ./vector_add_cpu 113 | ``` 114 | 115 | ## GPU 116 | 117 | The following code adds two vectors together on a GPU: 118 | 119 | ```C 120 | #include 121 | #include 122 | #include 123 | #include "timer.h" 124 | 125 | // each thread is responsible for one element of c 126 | __global__ void vecAdd(double *a, double *b, double *c, int n) 127 | { 128 | // Get our global thread ID 129 | int id = blockIdx.x * blockDim.x + threadIdx.x; 130 | int stride = gridDim.x * blockDim.x; 131 | 132 | // Make sure we do not go out of bounds 133 | int i; 134 | for (i = id; i < n; i += stride) 135 | c[i] = a[i] + b[i]; 136 | } 137 | 138 | int main(int argc, char* argv[]) 139 | { 140 | // Size of vectors 141 | int n = 2000; 142 | 143 | // Host input vectors 144 | double *h_a; 145 | double *h_b; 146 | //Host output vector 147 | double *h_c; 148 | 149 | // Device input vectors 150 | double *d_a; 151 | double *d_b; 152 | //Device output vector 153 | double *d_c; 154 | 155 | // Size, in bytes, of each vector 156 | size_t bytes = n*sizeof(double); 157 | 158 | // Allocate memory for each vector on host 159 | h_a = (double*)malloc(bytes); 160 | h_b = (double*)malloc(bytes); 161 | h_c = (double*)malloc(bytes); 162 | 163 | int i; 164 | // Initialize vectors on host 165 | for (i = 0; i < n; i++) { 166 | h_a[i] = sin(i)*sin(i); 167 | h_b[i] = cos(i)*cos(i); 168 | } 169 | 170 | // Allocate memory for each vector on GPU 171 | cudaMalloc(&d_a, bytes); 172 | cudaMalloc(&d_b, bytes); 173 | cudaMalloc(&d_c, bytes); 174 | 175 | // Copy host vectors to device 176 | cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice); 177 | cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice); 178 | 179 | int blockSize, gridSize; 180 | 181 | // Number of threads in each thread block 182 | blockSize = 1024; 183 | 184 | // Number of thread blocks in grid 185 | gridSize = (int)ceil((double)n/blockSize); 186 | if (gridSize > 65535) gridSize = 32000; 187 | // Execute the kernel 188 | vecAdd<<>>(d_a, d_b, d_c, n); 189 | 190 | // Copy array back to host 191 | cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost); 192 | 193 | // Release device memory 194 | cudaFree(d_a); 195 | cudaFree(d_b); 196 | cudaFree(d_c); 197 | 198 | cudaDeviceSynchronize(); 199 | 200 | // Release host memory 201 | free(h_a); 202 | free(h_b); 203 | free(h_c); 204 | 205 | return 0; 206 | } 207 | ``` 208 | 209 | The `vecAdd` function has been replaced with a CUDA kernel: 210 | 211 | ```C 212 | __global__ void vecAdd(double *a, double *b, double *c, int n) 213 | { 214 | // Get our global thread ID 215 | int id = blockIdx.x * blockDim.x + threadIdx.x; 216 | int stride = gridDim.x * blockDim.x; 217 | 218 | // Make sure we do not go out of bounds 219 | int i; 220 | for (i = id; i < n; i += stride) 221 | c[i] = a[i] + b[i]; 222 | } 223 | ``` 224 | 225 | The kernel uses special variables which are CUDA extensions to allow threads to distinguish themselves and operate on different data. Specifically, `blockIdx.x` is the block index within a grid, `blockDim.x` is the number of threads per block and `threadIdx.x` is the thread index within a block. Let's build and run the code. The `nvcc` compiler will compile the kernel function while `gcc` will be used in the background to compile the CPU code. 226 | 227 | ``` 228 | $ module load cudatoolkit/12.2 229 | $ nvcc -O3 -arch=sm_80 -o vector_add_gpu vector_add_gpu.cu # use 70 on traverse or adroit v100 node 230 | $ sbatch job.slurm 231 | ``` 232 | 233 | The output of the code will be something like: 234 | ``` 235 | Allocating CPU memory and populating arrays of length 2000 ... done. 236 | GridSize 2 and total_threads 2048 237 | Performing vector addition (timer started) ... done in 0.09 s. 238 | ``` 239 | 240 | Note that the reported time includes all operations beyond those needed to carry out the operation on the GPU. This includes the time required to allocate and deallocate memory on the GPU and the time required to move the data to and from the GPU. 241 | 242 | To use a GPU effectively the problem you are solving must have a vast amount of data parallelism and an overall amount of computation. In the example here the parallelism is high (one can assign a different thread to each of the individual elements) but the overall amount of computation is low so the CPU wins out in performance. Contrast this with a large matrix-matrix multiply where both conditions are satisfied and the GPU wins. For problems involving recursion or sorting or small amounts of data, it becomes difficult to take advantage of a GPU. 243 | 244 | ## Advanced Examples 245 | 246 | For more advanced examples return to the NVIDIA CUDA samples at the bottom of [this page](https://github.com/PrincetonUniversity/gpu_programming_intro/tree/master/05_cuda_libraries#nvidia-cuda-samples). 247 | -------------------------------------------------------------------------------- /06_cuda_kernels/04_vector_addition/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=vec-add # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=16G # memory per cpu-core (4G is default) 7 | #SBATCH --gres=gpu:1 # number of gpus per node 8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS) 9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 10 | 11 | ./vector_add_gpu 12 | -------------------------------------------------------------------------------- /06_cuda_kernels/04_vector_addition/timer.h: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2012 NVIDIA Corporation 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | 17 | #ifndef TIMER_H 18 | #define TIMER_H 19 | 20 | #include 21 | 22 | #ifdef WIN32 23 | #define WIN32_LEAN_AND_MEAN 24 | #include 25 | #else 26 | #include 27 | #endif 28 | 29 | #ifdef WIN32 30 | double PCFreq = 0.0; 31 | __int64 timerStart = 0; 32 | #else 33 | struct timeval timerStart; 34 | #endif 35 | 36 | void StartTimer() 37 | { 38 | #ifdef WIN32 39 | LARGE_INTEGER li; 40 | if(!QueryPerformanceFrequency(&li)) 41 | printf("QueryPerformanceFrequency failed!\n"); 42 | 43 | PCFreq = (double)li.QuadPart/1000.0; 44 | 45 | QueryPerformanceCounter(&li); 46 | timerStart = li.QuadPart; 47 | #else 48 | gettimeofday(&timerStart, NULL); 49 | #endif 50 | } 51 | 52 | // time elapsed in ms 53 | double GetTimer() 54 | { 55 | #ifdef WIN32 56 | LARGE_INTEGER li; 57 | QueryPerformanceCounter(&li); 58 | return (double)(li.QuadPart-timerStart)/PCFreq; 59 | #else 60 | struct timeval timerStop, timerElapsed; 61 | gettimeofday(&timerStop, NULL); 62 | timersub(&timerStop, &timerStart, &timerElapsed); 63 | return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0; 64 | #endif 65 | } 66 | 67 | #endif // TIMER_H 68 | -------------------------------------------------------------------------------- /06_cuda_kernels/04_vector_addition/vector_add_cpu.c: -------------------------------------------------------------------------------- 1 | /* CPU VERSION */ 2 | 3 | // modified from https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/ 4 | 5 | #include 6 | #include 7 | #include 8 | #include "timer.h" 9 | 10 | void vecAdd(double *a, double *b, double *c, int n) 11 | { 12 | int i; 13 | for(i = 0; i < n; i++) { 14 | c[i] = a[i] + b[i]; 15 | } 16 | } 17 | 18 | int main( int argc, char* argv[] ) 19 | { 20 | // Size of vectors 21 | int n = 2000; 22 | 23 | // Host input vectors 24 | double *h_a; 25 | double *h_b; 26 | //Host output vector 27 | double *h_c; 28 | 29 | // Size, in bytes, of each vector 30 | size_t bytes = n*sizeof(double); 31 | 32 | // Allocate memory for each vector on host 33 | fprintf(stderr, "Allocating memory and populating arrays of length %d ...", n); 34 | h_a = (double*)malloc(bytes); 35 | h_b = (double*)malloc(bytes); 36 | h_c = (double*)malloc(bytes); 37 | 38 | int i; 39 | // Initialize vectors on host 40 | for( i = 0; i < n; i++ ) { 41 | h_a[i] = sin(i)*sin(i); 42 | h_b[i] = cos(i)*cos(i); 43 | } 44 | 45 | fprintf(stderr, " done.\n"); 46 | fprintf(stderr, "Performing vector addition (timer started) ..."); 47 | StartTimer(); 48 | 49 | // add the two vectors 50 | vecAdd(h_a, h_b, h_c, n); 51 | 52 | double runtime = GetTimer(); 53 | fprintf(stderr, " done in %.2f s.\n", runtime / 1000); 54 | 55 | // Sum up vector c and print result divided by n, this should equal 1 within error 56 | double sum = 0; 57 | for(i=0; i tol) printf("Warning: potential numerical problems.\n"); 61 | 62 | // Release host memory 63 | free(h_a); 64 | free(h_b); 65 | free(h_c); 66 | 67 | return 0; 68 | } 69 | -------------------------------------------------------------------------------- /06_cuda_kernels/04_vector_addition/vector_add_gpu.cu: -------------------------------------------------------------------------------- 1 | /* GPU Version */ 2 | 3 | // original file is https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/ 4 | 5 | #include 6 | #include 7 | #include 8 | #include "timer.h" 9 | 10 | // CUDA kernel. Each thread takes care of one element of c 11 | __global__ void vecAdd(double *a, double *b, double *c, int n) 12 | { 13 | // Get our global thread ID 14 | int id = blockIdx.x * blockDim.x + threadIdx.x; 15 | int stride = gridDim.x * blockDim.x; 16 | 17 | // Make sure we do not go out of bounds 18 | int i; 19 | for (i = id; i < n; i += stride) 20 | c[i] = a[i] + b[i]; 21 | } 22 | 23 | int main( int argc, char* argv[] ) 24 | { 25 | // Size of vectors 26 | int n = 2000; 27 | 28 | // Host input vectors 29 | double *h_a; 30 | double *h_b; 31 | //Host output vector 32 | double *h_c; 33 | 34 | // Device input vectors 35 | double *d_a; 36 | double *d_b; 37 | //Device output vector 38 | double *d_c; 39 | 40 | // Size, in bytes, of each vector 41 | size_t bytes = n*sizeof(double); 42 | 43 | // Allocate memory for each vector on host 44 | fprintf(stderr, "Allocating CPU memory and populating arrays of length %d ...", n); 45 | h_a = (double*)malloc(bytes); 46 | h_b = (double*)malloc(bytes); 47 | h_c = (double*)malloc(bytes); 48 | 49 | int i; 50 | // Initialize vectors on host 51 | for( i = 0; i < n; i++ ) { 52 | h_a[i] = sin(i)*sin(i); 53 | h_b[i] = cos(i)*cos(i); 54 | } 55 | fprintf(stderr, " done.\n"); 56 | 57 | fprintf(stderr, "Performing vector addition (timer started) ..."); 58 | StartTimer(); 59 | 60 | // Allocate memory for each vector on GPU 61 | cudaMalloc(&d_a, bytes); 62 | cudaMalloc(&d_b, bytes); 63 | cudaMalloc(&d_c, bytes); 64 | 65 | // Copy host vectors to device 66 | cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice); 67 | cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice); 68 | 69 | int blockSize, gridSize; 70 | 71 | // Number of threads in each thread block 72 | blockSize = 1024; 73 | 74 | // Number of thread blocks in grid 75 | gridSize = (int)ceil((double)n/blockSize); 76 | if (gridSize > 65535) gridSize = 32000; 77 | printf("GridSize %d and total_threads %d\n", gridSize, gridSize * blockSize); 78 | // Execute the kernel 79 | vecAdd<<>>(d_a, d_b, d_c, n); 80 | 81 | // Copy array back to host 82 | cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost ); 83 | 84 | // Release device memory 85 | cudaFree(d_a); 86 | cudaFree(d_b); 87 | cudaFree(d_c); 88 | 89 | cudaDeviceSynchronize(); 90 | 91 | double runtime = GetTimer(); 92 | fprintf(stderr, " done in %.2f s.\n", runtime / 1000); 93 | 94 | // Sum up vector c and print result divided by n, this should equal 1 within error 95 | double sum = 0; 96 | for(i=0; i tol) printf("Warning: potential numerical problems.\n"); 101 | 102 | // Release host memory 103 | free(h_a); 104 | free(h_b); 105 | free(h_c); 106 | 107 | return 0; 108 | } 109 | -------------------------------------------------------------------------------- /06_cuda_kernels/05_multiple_gpus/README.md: -------------------------------------------------------------------------------- 1 | # Multiple GPUs 2 | 3 | The code in the this directory illustrates the use of multiple GPUs. To compile and execute the example, run the following commands: 4 | 5 | ``` 6 | $ module load cudatoolkit/12.2 7 | $ nvcc -O3 -arch=sm_80 -o multi_gpu multi_gpu.cu 8 | $ sbatch job.slurm 9 | ``` 10 | 11 | On Traverse and the Adroit V100 nodes, replace `sm_80` with `sm_70`. 12 | 13 | See also `Samples/0_Introduction/simpleMultiGPU` in the NVIDIA samples which are discussed in `05_cuda_libraries`. 14 | -------------------------------------------------------------------------------- /06_cuda_kernels/05_multiple_gpus/job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=multi-gpu # create a short name for your job 3 | #SBATCH --nodes=1 # node count 4 | #SBATCH --ntasks=1 # total number of tasks across all nodes 5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) 6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G per cpu-core is default) 7 | #SBATCH --gres=gpu:2 # number of gpus per node 8 | #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) 9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP 10 | 11 | module purge 12 | module load cudatoolkit/12.2 13 | 14 | ./multi_gpu 15 | -------------------------------------------------------------------------------- /06_cuda_kernels/05_multiple_gpus/multi_gpu.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | void CPUFunction() { 4 | printf("Hello world from the CPU.\n"); 5 | } 6 | 7 | __global__ void GPUFunction(int myid) { 8 | printf("Hello world from GPU %d.\n", myid); 9 | } 10 | 11 | int main() { 12 | 13 | // function to run on the cpu 14 | CPUFunction(); 15 | 16 | int deviceCount; 17 | cudaGetDeviceCount(&deviceCount); 18 | int device; 19 | for (device=0; device < deviceCount; ++device) { 20 | cudaDeviceProp deviceProp; 21 | cudaGetDeviceProperties(&deviceProp, device); 22 | printf("Device %d has compute capability %d.%d.\n", 23 | device, deviceProp.major, deviceProp.minor); 24 | } 25 | 26 | // run on gpu 0 27 | int device_id = 0; 28 | cudaSetDevice(device_id); 29 | GPUFunction<<<1, 1>>>(device_id); 30 | 31 | // run on gpu 1 32 | device_id = 1; 33 | cudaSetDevice(device_id); 34 | GPUFunction<<<1, 1>>>(device_id); 35 | 36 | // kernel execution is asynchronous so sync on its completion 37 | cudaDeviceSynchronize(); 38 | } 39 | -------------------------------------------------------------------------------- /06_cuda_kernels/README.md: -------------------------------------------------------------------------------- 1 | # CUDA kernels 2 | 3 | In this section you will write GPU kernels from scratch. To get started click on `01_hello_world` above. 4 | -------------------------------------------------------------------------------- /07_advanced_and_other/README.md: -------------------------------------------------------------------------------- 1 | # Advanced and Other 2 | 3 | ## CUDA-Aware MPI 4 | 5 | On Della you will see MPI modules that have been built against CUDA. These modules enable [CUDA-aware MPI](https://developer.nvidia.com/mpi-solutions-gpus) where 6 | memory on a GPU can be sent to another GPU without concerning a CPU. According to NVIDIA: 7 | 8 | > Regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy. 9 | 10 | > With [CUDA-aware MPI](https://developer.nvidia.com/mpi-solutions-gpus), the MPI library can send and receive GPU buffers directly, without having to first stage them in host memory. Implementation of CUDA-aware MPI was simplified by Unified Virtual Addressing (UVA) in CUDA 4.0 – which enables a single address space for all CPU and GPU memory. CUDA-aware implementations of MPI have several advantages. 11 | 12 | See the CUDA-aware MPI modules on Della: 13 | 14 | ``` 15 | $ ssh @della.princeton.edu 16 | $ module avail openmpi/cuda 17 | 18 | ------------- /usr/local/share/Modules/modulefiles ------------- 19 | openmpi/cuda-11.1/gcc/4.1.1 openmpi/cuda-11.3/nvhpc-21.5/4.1.1 20 | ``` 21 | 22 | ## GPU Direct 23 | 24 | [GPU Direct](https://developer.nvidia.com/gpudirect) is a solution to the problem of data-starved GPUs. 25 | 26 | ![gpu-direct](https://developer.nvidia.com/sites/default/files/akamai/GPUDirect/cuda-gpu-direct-blog-refresh_diagram_1.png) 27 | 28 | > Using GPUDirect™, multiple GPUs, network adapters, solid-state drives (SSDs) and now NVMe drives can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA Tesla™ and Quadro™ products 29 | 30 | GPUDirect is enabled on `della` and `traverse`. 31 | 32 | ## GPU Sharing 33 | 34 | Many GPU applications only use the GPU for a fraction of the time. For many years, a goal of GPU vendors has been to allow for GPU sharing between applications. Slurm is capable of supporting this through the `--gpu-mps` option. 35 | 36 | ## OpenMP 4.5+ 37 | 38 | Recent implementations of [OpenMP](https://www.openmp.org/) support GPU programming. However, they are not mature and should not be favored. 39 | 40 | ## CUDA Kernels versus OpenACC on the Long Term 41 | 42 | CUDA kernels are written at a low level. OpenACC is a high-level programmaing model. Because GPU hardware is changing rapidly, some argue that writing GPU codes with OpenACC is a better choice because there is much less work do to when new hardware comes out. The sames holds true for Kokkos. 43 | 44 | [See the materials](http://w3.pppl.gov/~ethier/PICSCIE/Intro_to_OpenACC_Nov_2019.pdf) for an OpenACC workshop by Stephane Ethier. Be aware of the Slack channel for OpenACC for getting help. 45 | 46 | ## Using the Intel Compiler 47 | 48 | Note the use of `auto` in the code below: 49 | 50 | ```c++ 51 | #include 52 | 53 | __global__ void simpleKernel() 54 | { 55 | auto i = blockDim.x * blockIdx.x + threadIdx.x; 56 | printf("Index: %d\n", i); 57 | } 58 | 59 | int main() 60 | { 61 | simpleKernel<<<2, 3>>>(); 62 | cudaDeviceSynchronize(); 63 | } 64 | ``` 65 | 66 | The C++11 language standard introduced the `auto` keyword. To compile the code with the Intel compiler for Della: 67 | 68 | ``` 69 | $ module load intel/19.1.1.217 70 | $ module load cudatoolkit/11.7 71 | $ nvcc -ccbin=icpc -std=c++11 -arch=sm_80 -o simple simple.cu 72 | ``` 73 | 74 | In general, NVIDIA engineers strongly recommend using GCC over the Intel compiler. 75 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to GPU Computing 2 | 3 | ## About 4 | 5 | This guide provides materials for getting started with running GPU codes on the Princeton Research Computing clusters. It also provides an introduction to writing CUDA kernels and examples of using the NVIDIA GPU-accelerated libraries (e.g., cuBLAS). 6 | 7 | ## Upcoming GPU Training 8 | 9 | [Princeton GPU User Group](https://researchcomputing.princeton.edu/learn/user-groups/gpu) 10 | [See all PICSciE/RC workshops](https://researchcomputing.princeton.edu/learn/workshops-live-training) 11 | [Subscribe to PICSciE/RC Mailing List](https://researchcomputing.princeton.edu/subscribe) 12 | 13 | ## Learning Resources 14 | 15 | [GPU Computing at Princeton](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) 16 | [2025 Princeton GPU Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwmKa2AI/se000356) 17 | [Resource List by Open Hackathons](https://www.openhackathons.org/s/technical-resources) 18 | [Training Archive at Oak Ridge National Laboratory](https://docs.olcf.ornl.gov/training/training_archive.html) 19 | [LeetGPU - Free GPU Simulator](https://leetgpu.com/) 20 | [CUDA C++ Programming Guide by NVIDIA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) 21 | [CUDA Fortran Programming Guide by NVIDIA](https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html) 22 | [Intro to CUDA Blog Post](https://developer.nvidia.com/blog/even-easier-introduction-cuda/?mkt_tok=MTU2LU9GTi03NDIAAAGad2PhouORjrUMHihUOvdy-syejFRkc-7otOyEDUy4HXOnJ85JjZ-gUs-lGlbdvG-hpVpXtxlpVN4EOvosdmaWcaSV9TQa84zICsZ3IdKBp5L69uOLQDsm) 23 | [Online Book Available through PU Library](https://catalog.princeton.edu/catalog/99125304171206421) 24 | [Princeton A100 GPU Workshop](https://github.com/PrincetonUniversity/a100_workshop) 25 | 26 | ## Getting Help 27 | 28 | If you encounter any difficulties with this material then please send an email to cses@princeton.edu or attend a help session. 29 | 30 | ## Authorship 31 | 32 | This guide was created by Jonathan Halverson and members of Princeton Research Computing. 33 | -------------------------------------------------------------------------------- /setup.md: -------------------------------------------------------------------------------- 1 | # Introduction to GPU Computing 2 | 3 | ## Setup for live workshop 4 | 5 | ### Point your browser to `https://bit.ly/36g5YUS` 6 | 7 | + Connect to the eduroam wireless network 8 | 9 | + Open a terminal (e.g., Terminal, PowerShell, PuTTY) [click here for help] 10 | 11 | + Request an [account on Adroit](https://forms.rc.princeton.edu/registration/?q=adroit). 12 | 13 | + Please SSH to Adroit in the terminal: `ssh @adroit.princeton.edu` [click [here](https://researchcomputing.princeton.edu/faq/why-cant-i-login-to-a-clu) for help] 14 | 15 | + If you are new to Linux then consider using the MyAdroit web portal: [https://myadroit.princeton.edu](https://myadroit.princeton.edu) (VPN required from off-campus) 16 | 17 | + Clone this repo on Adroit: 18 | 19 | ``` 20 | $ cd /scratch/network/$USER 21 | $ git clone https://github.com/PrincetonUniversity/gpu_programming_intro.git 22 | $ cd gpu_programming_intro 23 | ``` 24 | 25 | + For the live workshop, to get access to the GPU nodes on Adroit, add this line to your Slurm scripts: 26 | 27 | `$ sbatch --reservation=gpuprimer job.slurm` 28 | 29 | + Go to the [main page](https://github.com/PrincetonUniversity/gpu_programming_intro) of this repo 30 | --------------------------------------------------------------------------------