├── 01_what_is_a_gpu
├── README.md
└── pli.md
├── 02_cuda_toolkit
└── README.md
├── 03_your_first_gpu_job
├── README.md
├── cupy
│ ├── job.slurm
│ ├── lu.py
│ └── svd.py
├── julia
│ ├── job.slurm
│ └── svd.jl
├── matlab
│ ├── job.slurm
│ └── svd.m
├── pytorch
│ ├── job.slurm
│ └── svd.py
└── tensorflow
│ ├── job.slurm
│ └── svd.py
├── 04_gpu_tools
└── README.md
├── 05_cuda_libraries
├── README.md
├── gesvdj_example.cpp
├── hello_world_gpu_library
│ ├── README.md
│ ├── cumessage.cu
│ ├── cumessage.h
│ ├── job.slurm
│ └── myapp.cu
├── job.slurm
└── matrixMul
│ └── job.slurm
├── 06_cuda_kernels
├── 01_hello_world
│ ├── README.md
│ ├── hello_world.c
│ ├── hello_world_gpu.cu
│ └── job.slurm
├── 02_simple_kernel
│ ├── README.md
│ ├── first_parallel.cu
│ ├── job.slurm
│ └── solution.cu
├── 03_thread_indices
│ ├── README.md
│ ├── for_loop.c
│ ├── for_loop.cu
│ ├── hint.md
│ ├── job.slurm
│ └── solution.cu
├── 04_vector_addition
│ ├── README.md
│ ├── job.slurm
│ ├── timer.h
│ ├── vector_add_cpu.c
│ └── vector_add_gpu.cu
├── 05_multiple_gpus
│ ├── README.md
│ ├── job.slurm
│ └── multi_gpu.cu
└── README.md
├── 07_advanced_and_other
└── README.md
├── README.md
└── setup.md
/01_what_is_a_gpu/README.md:
--------------------------------------------------------------------------------
1 | # What is a GPU?
2 |
3 | A GPU, or Graphics Processing Unit, is an electronic device originally designed for manipulating the images that appear on a computer monitor. However, beginning in 2006 with NVIDIA CUDA, GPUs have become widely used for accelerating computation in various fields including image processing and machine learning.
4 |
5 | Relative to the CPU, GPUs have a far greater number of processing cores but with slower clock speeds. Within a block of threads called a warp (NVIDIA), each thread carries out the same operation on a different piece of data. This is the SIMT paradigm (single instruction, multiple threads). GPUs tend to have much less memory than what is available on a CPU. For instance, the H100 GPUs on Della have 80 GB compared to 1000 GB available to the CPU cores. This is an important consideration when designing algorithms and running jobs. Furthermore, GPUs are intended for highly parallel algorithms. The CPU can often out-perform a GPU on algorithms that are not highly parallelizable such as those that rely on data caching and flow control (e.g., "if" statements).
6 |
7 | Many of the fastest supercomputers in the world use GPUs (see [Top 500](https://top500.org/lists/top500/2024/11/)). How many of the top 10 supercomputers use GPUs?
8 |
9 | NVIDIA has been the leading player in GPUs for HPC. However, the GPU market landscape changed in May 2019 when the US DoE announced that Frontier, the first exascale supercomputer in the US, would be based on [AMD GPUs](https://www.hpcwire.com/2019/05/07/cray-amd-exascale-frontier-at-oak-ridge/) and CPUs. Princeton has a two [MI210 GPUs](https://researchcomputing.princeton.edu/amd-mi100-gpu-testing) which you can use for testing. Intel is also a GPU producer with the [Aurora supercomputer](https://en.wikipedia.org/wiki/Aurora_(supercomputer)) being an example.
10 |
11 | All laptops have a GPU for graphics. It is becoming standard for a laptop to have a second GPU dedicated for compute (see the latest [MacBook Pro](https://www.apple.com/macbook-pro/)).
12 |
13 | 
14 |
15 | The image below emphasizes the cache sizes and flow control:
16 |
17 | 
18 |
19 | Like a CPU, a GPU has a hierarchical structure with respect to both the execution units and memory. A warp is a unit of 32 threads. NVIDIA GPUs impose a limit of 1024 threads per block. Some integral number of warps are grouped into a streaming multiprocessor (SM). There are tens of SMs per GPU. Each thread has its own memory. There is limited shared memory between a block of threads. And, finally, there is the global memory which is accessible to each grid or collection of blocks.
20 |
21 | 
22 |
23 | The figure above is a diagram of a streaming multiprocessor (SM) for the [NVIDIA H100 GPU](https://www.nvidia.com/en-us/data-center/h100/). The H100 is composed of up to 132 SMs.
24 |
25 | # Princeton Language and Intelligence
26 |
27 | The university spent $9.6M on a new [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) cluster for research involving large AI models. The cluster provides 37 nodes with 8 GPUs per node. The H100 GPU is optimized for training transformer models. [Learn more](https://pli.princeton.edu/about-pli/directors-message) about this.
28 |
29 | # Overview of using a GPU
30 |
31 | This is the essence of how every GPU is used as an accelerator for compute:
32 |
33 | + Copy data from the CPU (host) to the GPU (device)
34 |
35 | + Launch a kernel to carry out computations on the GPU
36 |
37 | + Copy data from the GPU (device) back to the CPU (host)
38 |
39 | 
40 |
41 | The diagram above and the accompanying pseudocode present a simplified view of how GPUs are used in scientific computing. To fully understand how things work you will need to learn more about memory cache, interconnects, CUDA streams and much more.
42 |
43 | [NVLink](https://www.nvidia.com/en-us/data-center/nvlink/) on Traverse enables fast CPU-to-GPU and GPU-to-GPU data transfers with a peak rate of 75 GB/s per direction. Della has this fast GPU-GPU interconnect on each pair of GPUs on 70 of the 90 GPU nodes.
44 |
45 | Given the significant performance penalty for moving data between the CPU and GPU, it is natural to work toward "unifying" the CPU and GPU. For instance, read about the [NVIDIA Grace Hopper Superchip](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/).
46 |
47 | # What GPU resources does Princeton have?
48 |
49 | See the "Hardware Resources" on the [GPU Computing](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) page for a complete list.
50 |
51 | ## Adroit
52 |
53 | There are 3 GPU nodes on Adroit: `adroit-h11g1`, `adroit-h11g2` and `adroit-h11g3`
54 |
55 |
56 | $ ssh <NetID>@adroit.princeton.edu
57 | $ snodes
58 | HOSTNAMES STATE CPUS S:C:T CPUS(A/I/O/T) CPU_LOAD MEMORY PARTITION AVAIL_FEATURES
59 | adroit-08 alloc 32 2:16:1 32/0/0/32 1.27 384000 class skylake,intel
60 | adroit-09 alloc 32 2:16:1 32/0/0/32 0.75 384000 class skylake,intel
61 | adroit-10 alloc 32 2:16:1 32/0/0/32 0.63 384000 class skylake,intel
62 | adroit-11 mix 32 2:16:1 29/3/0/32 0.28 384000 class skylake,intel
63 | adroit-12 mix 32 2:16:1 16/16/0/32 0.28 384000 class skylake,intel
64 | adroit-13 mix 32 2:16:1 25/7/0/32 0.22 384000 all* skylake,intel
65 | adroit-13 mix 32 2:16:1 25/7/0/32 0.22 384000 class skylake,intel
66 | adroit-14 alloc 32 2:16:1 32/0/0/32 32.29 384000 all* skylake,intel
67 | adroit-14 alloc 32 2:16:1 32/0/0/32 32.29 384000 class skylake,intel
68 | adroit-15 mix 32 2:16:1 22/10/0/32 9.68 384000 all* skylake,intel
69 | adroit-15 mix 32 2:16:1 22/10/0/32 9.68 384000 class skylake,intel
70 | adroit-16 alloc 32 2:16:1 32/0/0/32 24.13 384000 all* skylake,intel
71 | adroit-16 alloc 32 2:16:1 32/0/0/32 24.13 384000 class skylake,intel
72 | adroit-h11g1 plnd 48 2:24:1 0/48/0/48 0.00 1000000 gpu a100,intel,gpu80
73 | adroit-h11g2 plnd 48 2:24:1 0/48/0/48 0.76 1000000 gpu a100,intel
74 | adroit-h11g3 mix 56 4:14:1 5/51/0/56 1.05 760000 gpu v100,intel
75 | adroit-h11n1 idle 128 2:64:1 0/128/0/128 0.00 256000 class amd,rome
76 | adroit-h11n2 alloc 64 2:32:1 64/0/0/64 49.07 500000 all* intel,ice
77 | adroit-h11n3 mix 64 2:32:1 50/14/0/64 40.54 500000 all* intel,ice
78 | adroit-h11n4 mix 64 2:32:1 48/16/0/64 40.33 500000 all* intel,ice
79 | adroit-h11n5 mix 64 2:32:1 32/32/0/64 32.94 500000 all* intel,ice
80 | adroit-h11n6 mix 64 2:32:1 62/2/0/64 38.95 500000 all* intel,ice
81 |
82 |
83 | To only see the GPU nodes:
84 |
85 |
86 | $ shownodes -p gpu
87 | NODELIST STATE FREE/TOTAL CPUs CPU_LOAD AVAIL/TOTAL MEMORY FREE/TOTAL GPUs FEATURES
88 | adroit-h11g1 planned 48/48 0.00 1000000/1000000MB 4/4 nvidia_a100 a100,intel,gpu80
89 | adroit-h11g2 planned 48/48 0.76 1000000/1000000MB 8/8 3g.20gb a100,intel
90 | adroit-h11g3 mixed 51/56 1.05 736960/760000MB 0/4 tesla_v100 v100,intel
91 |
92 |
93 | ### adroit-h11g1
94 |
95 | This node has 4 NVIDIA A100 GPUs with 80 GB of memory each. Each A100 GPU has 108 streaming multiprocessors (SM) and 64 FP32 CUDA cores per SM.
96 |
97 | Here is some information about the A100 GPUs on this node:
98 |
99 | ```
100 | CUDADevice with properties:
101 |
102 | Name: 'NVIDIA A100 80GB PCIe'
103 | Index: 1
104 | ComputeCapability: '8.0'
105 | SupportsDouble: 1
106 | DriverVersion: 12.2000
107 | ToolkitVersion: 11.2000
108 | MaxThreadsPerBlock: 1024
109 | MaxShmemPerBlock: 49152
110 | MaxThreadBlockSize: [1024 1024 64]
111 | MaxGridSize: [2.1475e+09 65535 65535]
112 | SIMDWidth: 32
113 | TotalMemory: 8.5175e+10
114 | AvailableMemory: 8.4519e+10
115 | MultiprocessorCount: 108
116 | ClockRateKHz: 1410000
117 | ComputeMode: 'Default'
118 | GPUOverlapsTransfers: 1
119 | KernelExecutionTimeout: 0
120 | CanMapHostMemory: 1
121 | DeviceSupported: 1
122 | DeviceAvailable: 1
123 | DeviceSelected: 1
124 | ```
125 |
126 | Here is infomation about the CPUs on this node:
127 |
128 |
129 | $ ssh <NetID>@adroit.princeton.edu
130 | $ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --gres=gpu:1 --constraint=gpu80 --reservation=gpuprimer
131 | $ lscpu | grep -v Flags
132 | Architecture: x86_64
133 | CPU op-mode(s): 32-bit, 64-bit
134 | Byte Order: Little Endian
135 | CPU(s): 48
136 | On-line CPU(s) list: 0-47
137 | Thread(s) per core: 1
138 | Core(s) per socket: 24
139 | Socket(s): 2
140 | NUMA node(s): 2
141 | Vendor ID: GenuineIntel
142 | CPU family: 6
143 | Model: 143
144 | Model name: Intel(R) Xeon(R) Gold 6442Y
145 | Stepping: 8
146 | CPU MHz: 3707.218
147 | CPU max MHz: 4000.0000
148 | CPU min MHz: 800.0000
149 | BogoMIPS: 5200.00
150 | Virtualization: VT-x
151 | L1d cache: 48K
152 | L1i cache: 32K
153 | L2 cache: 2048K
154 | L3 cache: 61440K
155 | NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
156 | NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
157 | $ exit
158 |
159 |
160 |
161 | ### adroit-h11g2
162 |
163 | `adroit-h11g2` has 4 NVIDIA A100 GPUs with 40 GB of memory per GPU. The 4 GPUs have been divided into 8 less powerful GPUs with 20 GB of memory each. To connect to this node use:
164 |
165 | ```
166 | $ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --gres=gpu:1 --nodelist=adroit-h11g2 --reservation=gpuprimer
167 | ```
168 |
169 | Below is information about the A100 GPUs:
170 |
171 | ```
172 | $ nvidia-smi -a
173 | Using a NVIDIA A100-PCIE-40GB GPU.
174 | CUDADevice with properties:
175 |
176 | Name: 'NVIDIA A100-PCIE-40GB'
177 | Index: 1
178 | ComputeCapability: '8.0'
179 | SupportsDouble: 1
180 | DriverVersion: 11.7000
181 | ToolkitVersion: 11.2000
182 | MaxThreadsPerBlock: 1024
183 | MaxShmemPerBlock: 49152
184 | MaxThreadBlockSize: [1024 1024 64]
185 | MaxGridSize: [2.1475e+09 65535 65535]
186 | SIMDWidth: 32
187 | TotalMemory: 4.2351e+10
188 | AvailableMemory: 4.1703e+10
189 | MultiprocessorCount: 108
190 | ClockRateKHz: 1410000
191 | ComputeMode: 'Default'
192 | GPUOverlapsTransfers: 1
193 | KernelExecutionTimeout: 0
194 | CanMapHostMemory: 1
195 | DeviceSupported: 1
196 | DeviceAvailable: 1
197 | DeviceSelected: 1
198 | ```
199 |
200 | Below is information about the CPUs:
201 |
202 | ```
203 | $ lscpu | grep -v Flags
204 | Architecture: x86_64
205 | CPU op-mode(s): 32-bit, 64-bit
206 | Byte Order: Little Endian
207 | CPU(s): 48
208 | On-line CPU(s) list: 0-47
209 | Thread(s) per core: 1
210 | Core(s) per socket: 24
211 | Socket(s): 2
212 | NUMA node(s): 2
213 | Vendor ID: GenuineIntel
214 | CPU family: 6
215 | Model: 106
216 | Model name: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
217 | Stepping: 6
218 | CPU MHz: 3499.996
219 | CPU max MHz: 3500.0000
220 | CPU min MHz: 800.0000
221 | BogoMIPS: 5600.00
222 | L1d cache: 48K
223 | L1i cache: 32K
224 | L2 cache: 1280K
225 | L3 cache: 36864K
226 | NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
227 | NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
228 | ```
229 |
230 | See the necessary Slurm directives to [run on specific GPUs](https://researchcomputing.princeton.edu/systems/adroit#gpus) on Adroit.
231 |
232 | To see a wealth of information about the GPUs use:
233 |
234 | ```
235 | $ nvidia-smi -q | less
236 | ```
237 |
238 | ### adroit-h11g3
239 |
240 | This node offers the older V100 GPUs.
241 |
242 | ### Grace Hopper Superchip
243 |
244 | See the [Grace Hopper Superchip webpage](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/) by NVIDIA. Here is a schematic diagram of the superchip:
245 |
246 | 
247 |
248 | ```
249 | aturing@della-gh:~$ nvidia-smi -a
250 |
251 | ==============NVSMI LOG==============
252 |
253 | Timestamp : Mon Apr 22 11:24:41 2024
254 | Driver Version : 545.23.08
255 | CUDA Version : 12.3
256 |
257 | Attached GPUs : 1
258 | GPU 00000009:01:00.0
259 | Product Name : GH200 480GB
260 | Product Brand : NVIDIA
261 | Product Architecture : Hopper
262 | Display Mode : Disabled
263 | Display Active : Disabled
264 | Persistence Mode : Enabled
265 | Addressing Mode : ATS
266 | MIG Mode
267 | Current : Disabled
268 | Pending : Disabled
269 | ...
270 | ```
271 |
272 | The CPU on the GH Superchip:
273 |
274 | ```
275 | jdh4@della-gh:~$ lscpu
276 | Architecture: aarch64
277 | CPU op-mode(s): 64-bit
278 | Byte Order: Little Endian
279 | CPU(s): 72
280 | On-line CPU(s) list: 0-71
281 | Vendor ID: ARM
282 | Model name: Neoverse-V2
283 | Model: 0
284 | Thread(s) per core: 1
285 | Core(s) per socket: 72
286 | Socket(s): 1
287 | Stepping: r0p0
288 | Frequency boost: disabled
289 | CPU max MHz: 3510.0000
290 | CPU min MHz: 81.0000
291 | BogoMIPS: 2000.00
292 | Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm di
293 | t uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh
294 | Caches (sum of all):
295 | L1d: 4.5 MiB (72 instances)
296 | L1i: 4.5 MiB (72 instances)
297 | L2: 72 MiB (72 instances)
298 | L3: 114 MiB (1 instance)
299 | NUMA:
300 | NUMA node(s): 9
301 | NUMA node0 CPU(s): 0-71
302 | NUMA node1 CPU(s):
303 | NUMA node2 CPU(s):
304 | NUMA node3 CPU(s):
305 | NUMA node4 CPU(s):
306 | NUMA node5 CPU(s):
307 | NUMA node6 CPU(s):
308 | NUMA node7 CPU(s):
309 | NUMA node8 CPU(s):
310 | Vulnerabilities:
311 | Gather data sampling: Not affected
312 | Itlb multihit: Not affected
313 | L1tf: Not affected
314 | Mds: Not affected
315 | Meltdown: Not affected
316 | Mmio stale data: Not affected
317 | Retbleed: Not affected
318 | Spec rstack overflow: Not affected
319 | Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
320 | Spectre v1: Mitigation; __user pointer sanitization
321 | Spectre v2: Not affected
322 | Srbds: Not affected
323 | Tsx async abort: Not affected
324 | ```
325 |
326 | ### Compute Capability and Building Optimized Codes
327 |
328 | Some software will only run on a GPU of a given compute capability. To find these values for a given NVIDIA Telsa card see [this page](https://en.wikipedia.org/wiki/Nvidia_Tesla). The compute capability of the A100's on Della is 8.0. For various build systems this translates to `sm_80`.
329 |
330 | The following is from `$ nvcc --help` after loading a `cudatoolkit` module:
331 |
332 | ```
333 | Options for steering GPU code generation.
334 | =========================================
335 |
336 | --gpu-architecture (-arch)
337 | Specify the name of the class of NVIDIA 'virtual' GPU architecture for which
338 | the CUDA input files must be compiled.
339 | With the exception as described for the shorthand below, the architecture
340 | specified with this option must be a 'virtual' architecture (such as compute_50).
341 | Normally, this option alone does not trigger assembly of the generated PTX
342 | for a 'real' architecture (that is the role of nvcc option '--gpu-code',
343 | see below); rather, its purpose is to control preprocessing and compilation
344 | of the input to PTX.
345 | For convenience, in case of simple nvcc compilations, the following shorthand
346 | is supported. If no value for option '--gpu-code' is specified, then the
347 | value of this option defaults to the value of '--gpu-architecture'. In this
348 | situation, as only exception to the description above, the value specified
349 | for '--gpu-architecture' may be a 'real' architecture (such as a sm_50),
350 | in which case nvcc uses the specified 'real' architecture and its closest
351 | 'virtual' architecture as effective architecture values. For example, 'nvcc
352 | --gpu-architecture=sm_50' is equivalent to 'nvcc --gpu-architecture=compute_50
353 | --gpu-code=sm_50,compute_50'.
354 | -arch=all build for all supported architectures (sm_*), and add PTX
355 | for the highest major architecture to the generated code.
356 | -arch=all-major build for just supported major versions (sm_*0), plus the
357 | earliest supported, and add PTX for the highest major architecture to the
358 | generated code.
359 | -arch=native build for all architectures (sm_*) on the current system
360 | Note: -arch=native, -arch=all, -arch=all-major cannot be used with the -code
361 | option, but can be used with -gencode options
362 | Note: the values compute_30, compute_32, compute_35, compute_37, compute_50,
363 | sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in
364 | a future release.
365 | Allowed values for this option: 'all','all-major','compute_35','compute_37',
366 | 'compute_50','compute_52','compute_53','compute_60','compute_61','compute_62',
367 | 'compute_70','compute_72','compute_75','compute_80','compute_86','compute_87',
368 | 'lto_35','lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62',
369 | 'lto_70','lto_72','lto_75','lto_80','lto_86','lto_87','native','sm_35','sm_37',
370 | 'sm_50','sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75',
371 | 'sm_80','sm_86','sm_87'.
372 | ```
373 |
374 | Hence, a starting point for optimization flags for the A100 GPUs on Della and Adroit:
375 |
376 | ```
377 | nvcc -O3 --use_fast_math --gpu-architecture=sm_80 -o myapp myapp.cu
378 | ```
379 |
380 | For the H100 GPUs on Della:
381 |
382 | ```
383 | nvcc -O3 --use_fast_math --gpu-architecture=sm_90 -o myapp myapp.cu
384 | ```
385 |
386 | ## Comparison of GPU Resources
387 |
388 | | Cluster | Number of Nodes | GPUs per Node | NVIDIA GPU Model | Number of FP32 Cores| SM Count | GPU Memory (GB) |
389 | |:----------:|:----------:|:---------:|:-------:|:-------:|:-------:|:-------:|
390 | | Adroit | 1 | 4 | A100 | 6912 | 108 | 80 |
391 | | Adroit | 1 | 8 | A100 | -- | -- | 20 |
392 | | Adroit | 1 | 4 | V100 | 5120 | 80 | 32 |
393 | | Della | 37 | 8 | H100 | 14592 | 132 | 80 |
394 | | Della | 69 | 4 | A100 | 6912 | 108 | 80 |
395 | | Della | 20 | 2 | A100 | 6912 | 108 | 40 |
396 | | Della | 2 | 28 | A100 | -- | -- | 10 |
397 | | Stellar | 6 | 2 | A100 | 6912 | 108 | 40 |
398 | | Stellar | 1 | 8 | A100 | 6912 | 108 | 40 |
399 | | Tiger | 12 | 4 | H100 | 14592 | 132 | 80 |
400 |
401 | SM is streaming multiprocessor. Note that the V100 GPUs have 640 [Tensor Cores](https://devblogs.nvidia.com/cuda-9-features-revealed/) (8 per SM) where half-precision Warp Matrix-Matrix and Accumulate (WMMA) operations can be carried out. That is, each core can perform a 4x4 matrix-matrix multiply and add the result to a third matrix. There are differences between the V100 node on Adroit and the Traverse nodes (see [PCIe versus SXM2](https://www.nextplatform.com/micro-site-content/achieving-maximum-compute-throughput-pcie-vs-sxm2/)).
402 |
403 |
404 | ## GPU Hackathon at Princeton
405 |
406 | The next hackathon will take place in [June of 2025](https://www.openhackathons.org/s/siteevent/a0CUP00000rwmKa2AI/se000356). This is a great opportunity to get help from experts in porting your code to a GPU. Or you can participate as a mentor and help a team rework their code. See the [GPU Computing](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) page for details.
407 |
--------------------------------------------------------------------------------
/01_what_is_a_gpu/pli.md:
--------------------------------------------------------------------------------
1 | # PLI Nodes
2 |
3 | ```
4 | Architecture: x86_64
5 | CPU op-mode(s): 32-bit, 64-bit
6 | Byte Order: Little Endian
7 | CPU(s): 96
8 | On-line CPU(s) list: 0-95
9 | Thread(s) per core: 1
10 | Core(s) per socket: 48
11 | Socket(s): 2
12 | NUMA node(s): 2
13 | Vendor ID: GenuineIntel
14 | CPU family: 6
15 | Model: 143
16 | Model name: Intel(R) Xeon(R) Platinum 8468
17 | Stepping: 8
18 | CPU MHz: 3645.945
19 | CPU max MHz: 3800.0000
20 | CPU min MHz: 800.0000
21 | BogoMIPS: 4200.00
22 | L1d cache: 48K
23 | L1i cache: 32K
24 | L2 cache: 2048K
25 | L3 cache: 107520K
26 | NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94
27 | NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
28 | Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
29 | ```
30 |
31 | ```
32 | $ nvidia-smi
33 | Fri Feb 23 11:51:11 2024
34 | +---------------------------------------------------------------------------------------+
35 | | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
36 | |-----------------------------------------+----------------------+----------------------+
37 | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
38 | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
39 | | | | MIG M. |
40 | |=========================================+======================+======================|
41 | | 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
42 | | N/A 33C P0 72W / 700W | 2MiB / 81559MiB | 0% Default |
43 | | | | Disabled |
44 | +-----------------------------------------+----------------------+----------------------+
45 |
46 | +---------------------------------------------------------------------------------------+
47 | | Processes: |
48 | | GPU GI CI PID Type Process name GPU Memory |
49 | | ID ID Usage |
50 | |=======================================================================================|
51 | | No running processes found |
52 | +---------------------------------------------------------------------------------------+
53 | ```
54 |
55 | ```
56 | jdh4@della-j11g1:~$ nvidia-smi -a
57 | ==============NVSMI LOG==============
58 | Timestamp : Fri Feb 23 11:51:29 2024
59 | Driver Version : 545.23.08
60 | CUDA Version : 12.3
61 |
62 | Attached GPUs : 1
63 | GPU 00000000:19:00.0
64 | Product Name : NVIDIA H100 80GB HBM3
65 | Product Brand : NVIDIA
66 | Product Architecture : Hopper
67 | Display Mode : Enabled
68 | Display Active : Disabled
69 | Persistence Mode : Enabled
70 | Addressing Mode : None
71 | MIG Mode
72 | Current : Disabled
73 | Pending : Disabled
74 | Accounting Mode : Disabled
75 | Accounting Mode Buffer Size : 4000
76 | Driver Model
77 | Current : N/A
78 | Pending : N/A
79 | Serial Number : 1654123038646
80 | GPU UUID : GPU-10f35015-e921-bfab-2eb8-4e9b6664d5f1
81 | Minor Number : 0
82 | VBIOS Version : 96.00.74.00.0D
83 | MultiGPU Board : No
84 | Board ID : 0x1900
85 | Board Part Number : 692-2G520-0200-000
86 | GPU Part Number : 2330-885-A1
87 | FRU Part Number : N/A
88 | Module ID : 2
89 | Inforom Version
90 | Image Version : G520.0200.00.05
91 | OEM Object : 2.1
92 | ECC Object : 7.16
93 | Power Management Object : N/A
94 | Inforom BBX Object Flush
95 | Latest Timestamp : 2024/02/22 13:09:29.459
96 | Latest Duration : 119019 us
97 | GPU Operation Mode
98 | Current : N/A
99 | Pending : N/A
100 | GSP Firmware Version : N/A
101 | GPU C2C Mode : Disabled
102 | GPU Virtualization Mode
103 | Virtualization Mode : None
104 | Host VGPU Mode : N/A
105 | GPU Reset Status
106 | Reset Required : No
107 | Drain and Reset Recommended : No
108 | IBMNPU
109 | Relaxed Ordering Mode : N/A
110 | PCI
111 | Bus : 0x19
112 | Device : 0x00
113 | Domain : 0x0000
114 | Device Id : 0x233010DE
115 | Bus Id : 00000000:19:00.0
116 | Sub System Id : 0x16C110DE
117 | GPU Link Info
118 | PCIe Generation
119 | Max : 5
120 | Current : 5
121 | Device Current : 5
122 | Device Max : 5
123 | Host Max : 5
124 | Link Width
125 | Max : 16x
126 | Current : 16x
127 | Bridge Chip
128 | Type : N/A
129 | Firmware : N/A
130 | Replays Since Reset : 0
131 | Replay Number Rollovers : 0
132 | Tx Throughput : 464 KB/s
133 | Rx Throughput : 2593 KB/s
134 | Atomic Caps Inbound : N/A
135 | Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
136 | Fan Speed : N/A
137 | Performance State : P0
138 | Clocks Event Reasons
139 | Idle : Active
140 | Applications Clocks Setting : Not Active
141 | SW Power Cap : Not Active
142 | HW Slowdown : Not Active
143 | HW Thermal Slowdown : Not Active
144 | HW Power Brake Slowdown : Not Active
145 | Sync Boost : Not Active
146 | SW Thermal Slowdown : Not Active
147 | Display Clock Setting : Not Active
148 | FB Memory Usage
149 | Total : 81559 MiB
150 | Reserved : 328 MiB
151 | Used : 2 MiB
152 | Free : 81227 MiB
153 | BAR1 Memory Usage
154 | Total : 131072 MiB
155 | Used : 1 MiB
156 | Free : 131071 MiB
157 | Conf Compute Protected Memory Usage
158 | Total : 0 MiB
159 | Used : 0 MiB
160 | Free : 0 MiB
161 | Compute Mode : Default
162 | Utilization
163 | Gpu : 0 %
164 | Memory : 0 %
165 | Encoder : 0 %
166 | Decoder : 0 %
167 | JPEG : 0 %
168 | OFA : 0 %
169 | Encoder Stats
170 | Active Sessions : 0
171 | Average FPS : 0
172 | Average Latency : 0
173 | FBC Stats
174 | Active Sessions : 0
175 | Average FPS : 0
176 | Average Latency : 0
177 | ECC Mode
178 | Current : Enabled
179 | Pending : Enabled
180 | ECC Errors
181 | Volatile
182 | SRAM Correctable : 0
183 | SRAM Uncorrectable : 0
184 | DRAM Correctable : 0
185 | DRAM Uncorrectable : 0
186 | Aggregate
187 | SRAM Correctable : 0
188 | SRAM Uncorrectable : 0
189 | DRAM Correctable : 0
190 | DRAM Uncorrectable : 0
191 | Retired Pages
192 | Single Bit ECC : N/A
193 | Double Bit ECC : N/A
194 | Pending Page Blacklist : N/A
195 | Remapped Rows
196 | Correctable Error : 0
197 | Uncorrectable Error : 0
198 | Pending : No
199 | Remapping Failure Occurred : No
200 | Bank Remap Availability Histogram
201 | Max : 2560 bank(s)
202 | High : 0 bank(s)
203 | Partial : 0 bank(s)
204 | Low : 0 bank(s)
205 | None : 0 bank(s)
206 | Temperature
207 | GPU Current Temp : 33 C
208 | GPU T.Limit Temp : 54 C
209 | GPU Shutdown T.Limit Temp : -8 C
210 | GPU Slowdown T.Limit Temp : -2 C
211 | GPU Max Operating T.Limit Temp : 0 C
212 | GPU Target Temperature : N/A
213 | Memory Current Temp : 41 C
214 | Memory Max Operating T.Limit Temp : 0 C
215 | GPU Power Readings
216 | Power Draw : 72.02 W
217 | Current Power Limit : 700.00 W
218 | Requested Power Limit : 700.00 W
219 | Default Power Limit : 700.00 W
220 | Min Power Limit : 200.00 W
221 | Max Power Limit : 700.00 W
222 | GPU Memory Power Readings
223 | Power Draw : 47.78 W
224 | Module Power Readings
225 | Power Draw : N/A
226 | Current Power Limit : N/A
227 | Requested Power Limit : N/A
228 | Default Power Limit : N/A
229 | Min Power Limit : N/A
230 | Max Power Limit : N/A
231 | Clocks
232 | Graphics : 345 MHz
233 | SM : 345 MHz
234 | Memory : 2619 MHz
235 | Video : 765 MHz
236 | Applications Clocks
237 | Graphics : 1980 MHz
238 | Memory : 2619 MHz
239 | Default Applications Clocks
240 | Graphics : 1980 MHz
241 | Memory : 2619 MHz
242 | Deferred Clocks
243 | Memory : N/A
244 | Max Clocks
245 | Graphics : 1980 MHz
246 | SM : 1980 MHz
247 | Memory : 2619 MHz
248 | Video : 1545 MHz
249 | Max Customer Boost Clocks
250 | Graphics : 1980 MHz
251 | Clock Policy
252 | Auto Boost : N/A
253 | Auto Boost Default : N/A
254 | Voltage
255 | Graphics : 670.000 mV
256 | Fabric
257 | State : Completed
258 | Status : Success
259 | Processes : None
260 | ```
261 |
262 | ```
263 | $ numactl -H
264 | available: 2 nodes (0-1)
265 | node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94
266 | node 0 size: 515020 MB
267 | node 0 free: 509047 MB
268 | node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
269 | node 1 size: 516037 MB
270 | node 1 free: 489964 MB
271 | node distances:
272 | node 0 1
273 | 0: 10 21
274 | 1: 21 10
275 | ```
276 |
277 | ## Intra-Node Topology
278 |
279 | ```
280 | jdh4@della-k17g3:~$ nvidia-smi topo -m
281 | GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
282 | GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX NODE NODE NODE NODE 0 N/A
283 | GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE 0 N/A
284 | GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE 0 N/A
285 | GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE PIX NODE NODE NODE 0 N/A
286 | GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 NODE NODE NODE PIX PIX NODE 1 1 N/A
287 | GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 NODE NODE NODE NODE NODE NODE 1 1 N/A
288 | GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 NODE NODE NODE NODE NODE PIX 1 1 N/A
289 | GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X NODE NODE NODE NODE NODE NODE 1 1 N/A
290 | NIC0 PIX NODE NODE NODE NODE NODE NODE NODE X PIX NODE NODE NODE NODE
291 | NIC1 PIX NODE NODE NODE NODE NODE NODE NODE PIX X NODE NODE NODE NODE
292 | NIC2 NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE X NODE NODE NODE
293 | NIC3 NODE NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE X PIX NODE
294 | NIC4 NODE NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE PIX X NODE
295 | NIC5 NODE NODE NODE NODE NODE NODE PIX NODE NODE NODE NODE NODE NODE X
296 |
297 | Legend:
298 |
299 | X = Self
300 | SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
301 | NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
302 | PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
303 | PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
304 | PIX = Connection traversing at most a single PCIe bridge
305 | NV# = Connection traversing a bonded set of # NVLinks
306 |
307 | NIC Legend:
308 |
309 | NIC0: mlx5_0
310 | NIC1: mlx5_1
311 | NIC2: mlx5_2
312 | NIC3: mlx5_3
313 | NIC4: mlx5_4
314 | NIC5: mlx5_5
315 | ```
316 |
--------------------------------------------------------------------------------
/02_cuda_toolkit/README.md:
--------------------------------------------------------------------------------
1 | # NVIDIA CUDA Toolkit
2 |
3 | 
4 |
5 | The [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) provides a comprehensive set of libraries and tools for developing and running GPU-accelerated applications.
6 |
7 | List the available modules that are related to CUDA:
8 |
9 | ```
10 | $ module avail cudatoolkit
11 | ------------ /usr/local/share/Modules/modulefiles -------------
12 | cudatoolkit/10.2 cudatoolkit/11.7 cudatoolkit/12.4
13 | cudatoolkit/11.1 cudatoolkit/12.0 cudatoolkit/12.5
14 | cudatoolkit/11.3 cudatoolkit/12.2 cudatoolkit/12.6
15 | cudatoolkit/11.4 cudatoolkit/12.3
16 | ```
17 |
18 | Run the following command to see which environment variables the `cudatoolkit` module is modifying:
19 |
20 | ```
21 | $ $ module show cudatoolkit/12.5
22 | -------------------------------------------------------------------
23 | /usr/local/share/Modules/modulefiles/cudatoolkit/12.5:
24 |
25 | module-whatis {Sets up cudatoolkit125 12.5 in your environment}
26 | prepend-path PATH /usr/local/cuda-12.5/bin
27 | prepend-path LD_LIBRARY_PATH /usr/local/cuda-12.5/lib64
28 | prepend-path LIBRARY_PATH /usr/local/cuda-12.5/lib64
29 | prepend-path MANPATH /usr/local/cuda-12.5/doc/man
30 | append-path -d { } LDFLAGS -L/usr/local/cuda-12.5/lib64
31 | append-path -d { } INCLUDE -I/usr/local/cuda-12.5/include
32 | append-path CPATH /usr/local/cuda-12.5/include
33 | append-path -d { } FFLAGS -I/usr/local/cuda-12.5/include
34 | append-path -d { } LOCAL_LDFLAGS -L/usr/local/cuda-12.5/lib64
35 | append-path -d { } LOCAL_INCLUDE -I/usr/local/cuda-12.5/include
36 | append-path -d { } LOCAL_CFLAGS -I/usr/local/cuda-12.5/include
37 | append-path -d { } LOCAL_FFLAGS -I/usr/local/cuda-12.5/include
38 | append-path -d { } LOCAL_CXXFLAGS -I/usr/local/cuda-12.5/include
39 | setenv CUDA_HOME /usr/local/cuda-12.5
40 | -------------------------------------------------------------------
41 | ```
42 |
43 | Let's look at the files in `/usr/local/cuda-12.5/bin`:
44 |
45 | ```
46 | $ ls -ltrh /usr/local/cuda-12.5/bin
47 | total 243M
48 | -rwxr-xr-x. 1 root root 49M Apr 15 22:46 nvdisasm
49 | -rwxr-xr-x. 1 root root 688K Apr 15 22:47 cuobjdump
50 | -rwxr-xr-x. 6 root root 11K May 17 18:50 __nvcc_device_query
51 | -rwxr-xr-x. 14 root root 285 May 17 18:50 nvvp
52 | -rwxr-xr-x. 1 root root 111K Jun 6 06:03 nvprune
53 | -rwxr-xr-x. 1 root root 75K Jun 6 06:09 cu++filt
54 | -rwxr-xr-x. 1 root root 30M Jun 6 06:12 ptxas
55 | -rwxr-xr-x. 1 root root 30M Jun 6 06:12 nvlink
56 | -rw-r--r--. 1 root root 465 Jun 6 06:12 nvcc.profile
57 | -rwxr-xr-x. 1 root root 22M Jun 6 06:12 nvcc
58 | -rwxr-xr-x. 1 root root 1.2M Jun 6 06:12 fatbinary
59 | -rwxr-xr-x. 1 root root 7.1M Jun 6 06:12 cudafe++
60 | -rwxr-xr-x. 1 root root 87K Jun 6 06:12 bin2c
61 | -rwxr-xr-x. 1 root root 803K Jun 6 07:25 cuda-gdbserver
62 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.9-tui
63 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.8-tui
64 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.12-tui
65 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.11-tui
66 | -rwxr-xr-x. 1 root root 17M Jun 6 07:25 cuda-gdb-python3.10-tui
67 | -rwxr-xr-x. 1 root root 15M Jun 6 07:25 cuda-gdb-minimal
68 | -rwxr-xr-x. 1 root root 1.9K Jun 6 07:25 cuda-gdb
69 | -rwxr-xr-x. 1 root root 5.8M Jun 6 07:56 nvprof
70 | lrwxrwxrwx. 1 root root 4 Jun 6 08:04 computeprof -> nvvp
71 | -rwxr-xr-x. 11 root root 1.6K Jun 14 19:56 nsight_ee_plugins_manage.sh
72 | -rwxr-xr-x. 1 root root 833 Jun 25 17:54 nsys-ui
73 | -rwxr-xr-x. 1 root root 743 Jun 25 17:54 nsys
74 | -rwxr-xr-x. 5 root root 112 Jul 12 02:21 compute-sanitizer
75 | -rwxr-xr-x. 5 root root 3.6K Jul 26 18:06 ncu-ui
76 | -rwxr-xr-x. 5 root root 3.8K Jul 26 18:06 ncu
77 | -rwxr-xr-x. 4 root root 197 Jul 26 18:06 nsight-sys
78 | drwxr-xr-x. 2 root root 43 Aug 28 10:24 crt
79 | ```
80 |
81 | `nvcc` is the NVIDIA CUDA Compiler. Note that `nvcc` is built on `llvm` as [described here](https://developer.nvidia.com/cuda-llvm-compiler). To learn more about an executable, use the help option. For instance: `nvcc --help`.
82 |
83 |
84 | Let's look at the libraries:
85 |
86 | ```
87 | $ ls -lL /usr/local/cuda-12.5/lib64/lib*.so
88 | -rwxr-xr-x. 1 root root 2412216 Jun 6 07:56 /usr/local/cuda-12.5/lib64/libaccinj64.so
89 | -rwxr-xr-x. 1 root root 1505608 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libcheckpoint.so
90 | -rwxr-xr-x. 1 root root 446820528 Jun 6 06:10 /usr/local/cuda-12.5/lib64/libcublasLt.so
91 | -rwxr-xr-x. 1 root root 104128480 Jun 6 06:10 /usr/local/cuda-12.5/lib64/libcublas.so
92 | -rwxr-xr-x. 1 root root 712032 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libcudart.so
93 | -rwxr-xr-x. 1 root root 276080616 Jun 6 06:16 /usr/local/cuda-12.5/lib64/libcufft.so
94 | -rwxr-xr-x. 1 root root 974920 Jun 6 06:16 /usr/local/cuda-12.5/lib64/libcufftw.so
95 | -rwxr-xr-x. 6 root root 43320 Jun 5 13:57 /usr/local/cuda-12.5/lib64/libcufile_rdma.so
96 | -rwxr-xr-x. 1 root root 2993816 Jun 6 06:53 /usr/local/cuda-12.5/lib64/libcufile.so
97 | -rwxr-xr-x. 1 root root 2832640 Jun 6 07:56 /usr/local/cuda-12.5/lib64/libcuinj64.so
98 | -rwxr-xr-x. 1 root root 7807144 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libcupti.so
99 | -rwxr-xr-x. 1 root root 96529840 Jun 6 06:14 /usr/local/cuda-12.5/lib64/libcurand.so
100 | -rwxr-xr-x. 1 root root 82234792 Jun 6 06:55 /usr/local/cuda-12.5/lib64/libcusolverMg.so
101 | -rwxr-xr-x. 1 root root 122162688 Jun 6 06:55 /usr/local/cuda-12.5/lib64/libcusolver.so
102 | -rwxr-xr-x. 1 root root 294682616 Jun 6 06:29 /usr/local/cuda-12.5/lib64/libcusparse.so
103 | -rwxr-xr-x. 1 root root 1651184 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppc.so
104 | -rwxr-xr-x. 1 root root 17736496 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppial.so
105 | -rwxr-xr-x. 1 root root 7689032 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppicc.so
106 | -rwxr-xr-x. 1 root root 11248792 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppidei.so
107 | -rwxr-xr-x. 1 root root 101120104 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppif.so
108 | -rwxr-xr-x. 1 root root 41165712 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppig.so
109 | -rwxr-xr-x. 1 root root 10703688 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppim.so
110 | -rwxr-xr-x. 1 root root 37897296 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppist.so
111 | -rwxr-xr-x. 1 root root 724392 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppisu.so
112 | -rwxr-xr-x. 1 root root 5595760 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnppitc.so
113 | -rwxr-xr-x. 1 root root 14169336 Jun 6 06:37 /usr/local/cuda-12.5/lib64/libnpps.so
114 | -rwxr-xr-x. 1 root root 757496 Jun 6 06:10 /usr/local/cuda-12.5/lib64/libnvblas.so
115 | -rwxr-xr-x. 1 root root 2409960 Jun 6 06:08 /usr/local/cuda-12.5/lib64/libnvfatbin.so
116 | -rwxr-xr-x. 1 root root 54560656 Jun 6 06:11 /usr/local/cuda-12.5/lib64/libnvJitLink.so
117 | -rwxr-xr-x. 1 root root 6726448 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libnvjpeg.so
118 | -rwxr-xr-x. 1 root root 28139320 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libnvperf_host.so
119 | -rwxr-xr-x. 1 root root 5579216 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libnvperf_target.so
120 | -rwxr-xr-x. 1 root root 5322632 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libnvrtc-builtins.so
121 | -rwxr-xr-x. 1 root root 61401616 Jun 6 06:07 /usr/local/cuda-12.5/lib64/libnvrtc.so
122 | -rwxr-xr-x. 10 root root 40136 May 17 18:50 /usr/local/cuda-12.5/lib64/libnvToolsExt.so
123 | -rwxr-xr-x. 10 root root 30856 May 17 18:50 /usr/local/cuda-12.5/lib64/libOpenCL.so
124 | -rwxr-xr-x. 1 root root 920920 Jun 6 07:30 /usr/local/cuda-12.5/lib64/libpcsamplingutil.so
125 | ```
126 |
127 | ## cuDNN
128 |
129 | There is also the [CUDA Deep Neural Net](https://developer.nvidia.com/cudnn) (cuDNN) library. It is external to the NVIDIA CUDA Toolkit and is used with TensorFlow, for instance, to provide GPU routines for training neural nets. See the available modules with:
130 |
131 | ```
132 | $ module avail cudnn
133 | ```
134 |
135 | ## Conda Installations
136 |
137 | When you install [CuPy](https://cupy.dev), for instance, which is like NumPy for GPUs, Conda will include the CUDA libraries:
138 |
139 |
140 | $ module load anaconda3/2024.6
141 | $ conda create --name cupy-env cupy --channel conda-forge
142 | ...
143 | _libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
144 | _openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
145 | bzip2 conda-forge/linux-64::bzip2-1.0.8-hd590300_5
146 | ca-certificates conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0
147 | cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.5.82-he02047a_0
148 | cuda-version conda-forge/noarch::cuda-version-12.5-hd4f0392_3
149 | cupy conda-forge/linux-64::cupy-13.2.0-py312had87585_0
150 | cupy-core conda-forge/linux-64::cupy-core-13.2.0-py312hd074ebb_0
151 | fastrlock conda-forge/linux-64::fastrlock-0.8.2-py312h30efb56_2
152 | ld_impl_linux-64 conda-forge/linux-64::ld_impl_linux-64-2.40-hf3520f5_7
153 | libblas conda-forge/linux-64::libblas-3.9.0-22_linux64_openblas
154 | libcblas conda-forge/linux-64::libcblas-3.9.0-22_linux64_openblas
155 | libcublas conda-forge/linux-64::libcublas-12.5.3.2-he02047a_0
156 | libcufft conda-forge/linux-64::libcufft-11.2.3.61-he02047a_0
157 | libcurand conda-forge/linux-64::libcurand-10.3.6.82-he02047a_0
158 | libcusolver conda-forge/linux-64::libcusolver-11.6.3.83-he02047a_0
159 | libcusparse conda-forge/linux-64::libcusparse-12.5.1.3-he02047a_0
160 | libexpat conda-forge/linux-64::libexpat-2.6.2-h59595ed_0
161 | libffi conda-forge/linux-64::libffi-3.4.2-h7f98852_5
162 | libgcc-ng conda-forge/linux-64::libgcc-ng-14.1.0-h77fa898_0
163 | libgfortran-ng conda-forge/linux-64::libgfortran-ng-14.1.0-h69a702a_0
164 | libgfortran5 conda-forge/linux-64::libgfortran5-14.1.0-hc5f4f2c_0
165 | libgomp conda-forge/linux-64::libgomp-14.1.0-h77fa898_0
166 | liblapack conda-forge/linux-64::liblapack-3.9.0-22_linux64_openblas
167 | libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0
168 | libnvjitlink conda-forge/linux-64::libnvjitlink-12.5.82-he02047a_0
169 | libopenblas conda-forge/linux-64::libopenblas-0.3.27-pthreads_hac2b453_1
170 | libsqlite conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0
171 | libstdcxx-ng conda-forge/linux-64::libstdcxx-ng-14.1.0-hc0a3c3a_0
172 | libuuid conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0
173 | libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1
174 | libzlib conda-forge/linux-64::libzlib-1.3.1-h4ab18f5_1
175 | ncurses conda-forge/linux-64::ncurses-6.5-h59595ed_0
176 | numpy conda-forge/linux-64::numpy-2.0.0-py312h22e1c76_0
177 | openssl conda-forge/linux-64::openssl-3.3.1-h4ab18f5_1
178 | pip conda-forge/noarch::pip-24.0-pyhd8ed1ab_0
179 | python conda-forge/linux-64::python-3.12.4-h194c7f8_0_cpython
180 | python_abi conda-forge/linux-64::python_abi-3.12-4_cp312
181 | readline conda-forge/linux-64::readline-8.2-h8228510_1
182 | setuptools conda-forge/noarch::setuptools-70.1.1-pyhd8ed1ab_0
183 | tk conda-forge/linux-64::tk-8.6.13-noxft_h4845f30_101
184 | tzdata conda-forge/noarch::tzdata-2024a-h0c530f3_0
185 | wheel conda-forge/noarch::wheel-0.43.0-pyhd8ed1ab_1
186 | xz conda-forge/linux-64::xz-5.2.6-h166bdaf_0
187 |
188 |
189 | When using `pip` to do the installation, one needs to load the `cudatoolkit` module since that dependency is assumed to be available on the local system. The Conda approach installs all the dependencies so one does not load the module.
190 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/README.md:
--------------------------------------------------------------------------------
1 | # Your First GPU Job
2 |
3 | Using the GPUs on the Princeton HPC clusters is easy. Pick one of the applications below to get started. To obtain the materials to run the examples, use these commands:
4 |
5 | ```
6 | $ ssh @adroit.princeton.edu
7 | $ cd /scratch/network/
8 | $ git clone https://github.com/PrincetonUniversity/gpu_programming_intro.git
9 | ```
10 |
11 | To add a GPU to your Slurm allocation:
12 |
13 | ```
14 | #SBATCH --gres=gpu:1 # number of gpus per node
15 | ```
16 |
17 | For Adroit, one can specify the GPU type using a constraint:
18 |
19 | ```
20 | #SBATCH --constraint=a100 # set to gpu80, a100 or v100
21 | #SBATCH --gres=gpu:1 # number of gpus per node
22 | ```
23 |
24 | For more on specifying the GPU type on Adroit [see this page](https://researchcomputing.princeton.edu/systems/adroit#gpus).
25 |
26 | ## CuPy
27 |
28 | [CuPy](https://cupy.chainer.org) provides a Python interface to set of common numerical routines (e.g., matrix factorizations) which are executed on a GPU (see the [Reference Manual](https://docs-cupy.chainer.org/en/stable/reference/index.html)). You can roughly think of CuPy as NumPy for GPUs. This example is set to use the CuPy installation of the workshop instructor. If you use CuPy for your research work then you should [install it](https://github.com/PrincetonUniversity/gpu_programming_intro/tree/master/02_cuda_toolkit#conda-installations) into your account.
29 |
30 | Examine the Python script before running the code:
31 |
32 | ```python
33 | $ cd gpu_programming_intro/03_your_first_gpu_job/cupy
34 | $ cat svd.py
35 | from time import perf_counter
36 | import cupy as cp
37 |
38 | N = 1000
39 | X = cp.random.randn(N, N, dtype=cp.float64)
40 |
41 | trials = 5
42 | times = []
43 | for _ in range(trials):
44 | t0 = perf_counter()
45 | u, s, v = cp.linalg.svd(X)
46 | cp.cuda.Device(0).synchronize()
47 | times.append(perf_counter() - t0)
48 | print("Execution time: ", min(times))
49 | print("sum(s) = ", s.sum())
50 | print("CuPy version: ", cp.__version__)
51 | ```
52 |
53 | Below is a sample Slurm script:
54 |
55 | ```bash
56 | $ cat job.slurm
57 | #!/bin/bash
58 | #SBATCH --job-name=cupy-job # create a short name for your job
59 | #SBATCH --nodes=1 # node count
60 | #SBATCH --ntasks=1 # total number of tasks across all nodes
61 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
62 | #SBATCH --gres=gpu:1 # number of gpus per node
63 | #SBATCH --mem=4G # total memory (RAM) per node
64 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
65 | #SBATCH --constraint=a100 # choose a100 or v100
66 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
67 |
68 | module purge
69 | module load anaconda3/2024.6
70 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/cupy-env
71 |
72 | python svd.py
73 | ```
74 |
75 | A GPU is allocated using the Slurm directive `#SBATCH --gres=gpu:1`.
76 |
77 | Submit the job:
78 |
79 | ```
80 | $ sbatch job.slurm
81 | ```
82 |
83 | Wait a few seconds for the job to run. Inspect the output:
84 |
85 | ```
86 | $ cat slurm-*.out
87 | ```
88 |
89 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. What happens if you re-run the script with the matrix in single precision? Does the execution time double if N is doubled? There is a CPU version of the code at the bottom of this page. Does the operation run faster on the CPU with NumPy or on the GPU with CuPy? Try [this exercise](https://github.com/PrincetonUniversity/a100_workshop/tree/main/06_cupy#cupy-uses-tensor-cores) where the Tensor Cores are utilized by using less than single precision (i.e., TensorFloat32).
90 |
91 | Why are multiple trials used when measuring the execution time? `CuPy` compiles a custom GPU kernel for each GPU operation (e.g., SVD). This means the first time a `CuPy` function is called the measured time is the sum of the compile time plus the time to execute the operation. The second and later calls only include the time to execute the operation.
92 |
93 | In addition to CuPy, Python programmers looking to run their code on GPUs should also be aware of [Numba](https://numba.pydata.org/) and [JAX](https://github.com/google/jax).
94 |
95 | To see performance comparison between the CPU and GPU, see `matmul_numpy.py` and `matmul_cupy.py` in [this repo](https://github.com/jdh4/python-gpu/tree/main/cupy).
96 |
97 | ## PyTorch
98 |
99 | [PyTorch](https://pytorch.org) is a popular deep learning framework. See its documentation for [Tensor operations](https://pytorch.org/docs/stable/tensors.html). This example is set to use the PyTorch installation of the workshop instructor. If you use PyTorch for your research work then you should [install it](https://researchcomputing.princeton.edu/support/knowledge-base/pytorch) into your account.
100 |
101 | Examine the Python script before running the code:
102 |
103 | ```python
104 | $ cd gpu_programming_intro/03_your_first_gpu_job/pytorch
105 | $ cat svd.py
106 | from time import perf_counter
107 | import torch
108 |
109 | N = 1000
110 |
111 | cuda0 = torch.device('cuda:0')
112 | x = torch.randn(N, N, dtype=torch.float64, device=cuda0)
113 | t0 = perf_counter()
114 | u, s, v = torch.svd(x)
115 | elapsed_time = perf_counter() - t0
116 |
117 | print("Execution time: ", elapsed_time)
118 | print("Result: ", torch.sum(s).cpu().numpy())
119 | print("PyTorch version: ", torch.__version__)
120 | ```
121 |
122 | Here is a sample Slurm script:
123 |
124 | ```bash
125 | $ cat job.slurm
126 | #!/bin/bash
127 | #SBATCH --job-name=torch-svd # create a short name for your job
128 | #SBATCH --nodes=1 # node count
129 | #SBATCH --ntasks=1 # total number of tasks across all nodes
130 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
131 | #SBATCH --mem-per-cpu=4G # memory per cpu-core
132 | #SBATCH --gres=gpu:1 # number of gpus per node
133 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
134 | #SBATCH --constraint=a100 # choose a100 or v100 on adroit
135 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
136 |
137 | module purge
138 | module load anaconda3/2024.6
139 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/torch-env
140 |
141 | python svd.py
142 | ```
143 |
144 | Submit the job:
145 |
146 | ```
147 | $ sbatch job.slurm
148 | ```
149 |
150 | Wait a few seconds for the job to run. Inspect the output:
151 |
152 | ```
153 | $ cat slurm-*.out
154 | ```
155 |
156 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`.
157 |
158 | ## TensorFlow
159 |
160 | [TensorFlow](https://www.tensorflow.org) is popular library for training deep neural networks. It can also be used for various numerical computations (see [documentation](https://www.tensorflow.org/api_docs/python/tf)). This example is set to use the TensorFlow installation of the workshop instructor. If you use TensorFlow for your research work then you should [install it](https://researchcomputing.princeton.edu/support/knowledge-base/tensorflow) into your account.
161 |
162 | Examine the Python script before running the code:
163 |
164 | ```python
165 | $ cd gpu_programming_intro/03_your_first_gpu_job/tensorflow
166 | $ cat svd.py
167 | from time import perf_counter
168 |
169 | import os
170 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
171 |
172 | import tensorflow as tf
173 | print("TensorFlow version: ", tf.__version__)
174 |
175 | N = 100
176 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64)
177 | t0 = perf_counter()
178 | s, u, v = tf.linalg.svd(x)
179 | elapsed_time = perf_counter() - t0
180 | print("Execution time: ", elapsed_time)
181 | print("Result: ", tf.reduce_sum(s).numpy())
182 | ```
183 |
184 | Below is a sample Slurm script:
185 |
186 | ```bash
187 | $ cat job.slurm
188 | #!/bin/bash
189 | #SBATCH --job-name=svd-tf # create a short name for your job
190 | #SBATCH --nodes=1 # node count
191 | #SBATCH --ntasks=1 # total number of tasks across all nodes
192 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
193 | #SBATCH --mem=4G # total memory (RAM) per node
194 | #SBATCH --gres=gpu:1 # number of gpus per node
195 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
196 | #SBATCH --constraint=a100 # choose a100 or v100
197 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
198 |
199 | module load anaconda3/2024.6
200 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/tf2-gpu
201 |
202 | python svd.py
203 | ```
204 |
205 | Submit the job:
206 |
207 | ```
208 | $ sbatch job.slurm
209 | ```
210 |
211 | Wait a few seconds for the job to run. Inspect the output:
212 |
213 | ```
214 | $ cat slurm-*.out
215 | ```
216 |
217 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`.
218 |
219 |
224 |
225 | ## R with NVBLAS
226 |
227 | Take a look at [this page](https://github.com/PrincetonUniversity/HPC_R_Workshop/tree/master/07_NVBLAS) and then run the commands below:
228 |
229 | ```
230 | $ git clone https://github.com/PrincetonUniversity/HPC_R_Workshop
231 | $ cd HPC_R_Workshop/07_NVBLAS
232 | $ mv nvblas.conf ~
233 | $ sbatch 07_NVBLAS.cmd
234 | ```
235 |
236 | Here is the sample output:
237 |
238 | ```
239 | $ cat slurm-*.out
240 | ...
241 | [1] "Matrix multiply:"
242 | user system elapsed
243 | 0.166 0.137 0.304
244 | [1] "----"
245 | [1] "Cholesky Factorization:"
246 | user system elapsed
247 | 1.053 0.041 1.096
248 | [1] "----"
249 | [1] "Singular Value Decomposition:"
250 | user system elapsed
251 | 8.060 1.837 5.345
252 | [1] "----"
253 | [1] "Principal Components Analysis:"
254 | user system elapsed
255 | 16.814 5.987 11.252
256 | [1] "----"
257 | [1] "Linear Discriminant Analysis:"
258 | user system elapsed
259 | 25.955 3.080 20.830
260 | [1] "----"
261 | ...
262 | ```
263 |
264 | See the [user guide](https://docs.nvidia.com/cuda/nvblas/index.html) for NVBLAS.
265 |
266 | ## MATLAB
267 |
268 | MATLAB is already installed on the cluster. Simply follow these steps:
269 |
270 | ```bash
271 | $ cd gpu_programming_intro/03_your_first_gpu_job/matlab
272 | $ cat svd.m
273 | ```
274 |
275 | Here is the MATLAB script:
276 |
277 | ```matlab
278 | gpu = gpuDevice();
279 | fprintf('Using a %s GPU.\n', gpu.Name);
280 | disp(gpuDevice);
281 |
282 | X = gpuArray([1 0 2; -1 5 0; 0 3 -9]);
283 | whos X
284 | [U,S,V] = svd(X)
285 | fprintf('trace(S): %f\n', trace(S))
286 | quit;
287 | ```
288 |
289 | Below is a sample Slurm script:
290 |
291 | ```bash
292 | #!/bin/bash
293 | #SBATCH --job-name=matlab-svd # create a short name for your job
294 | #SBATCH --nodes=1 # node count
295 | #SBATCH --ntasks=1 # total number of tasks across all nodes
296 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
297 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default)
298 | #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS)
299 | #SBATCH --gres=gpu:1 # number of gpus per node
300 | #SBATCH --constraint=a100 # choose a100 or v100
301 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
302 |
303 | module purge
304 | module load matlab/R2023a
305 |
306 | matlab -singleCompThread -nodisplay -nosplash -r svd
307 | ```
308 |
309 | Submit the job:
310 |
311 | ```
312 | $ sbatch job.slurm
313 | ```
314 |
315 | Wait a few seconds for the job to run. Inspect the output:
316 |
317 | ```
318 | $ cat slurm-*.out
319 | ```
320 |
321 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. Learn more about [MATLAB on the Research Computing clusters](https://researchcomputing.princeton.edu/support/knowledge-base/matlab).
322 |
323 | Here is an [intro](https://www.mathworks.com/help/parallel-computing/run-matlab-functions-on-a-gpu.html) to using MATLAB with GPUs.
324 |
325 | ## Julia
326 |
327 | Install the `CUDA` package then run the script in `03_your_first_gpu_job/julia`. See our [Julia webage](https://researchcomputing.princeton.edu/support/knowledge-base/julia).
328 |
329 | ## Monitoring GPU Usage
330 |
331 | To monitor jobs in our reservation:
332 |
333 | ```
334 | $ watch -n 1 squeue -R gpuprimer
335 | ```
336 |
337 | ## Benchmarks
338 |
339 | ### Matrix Multiplication
340 |
341 | | cluster | code | CPU-cores | time (s) |
342 | |:--------------------:|:----:|:-----------:|:--------:|
343 | | adroit (CPU) | NumPy | 1 | 24.2 |
344 | | adroit (CPU) | NumPy | 2 | 15.5 |
345 | | adroit (CPU) | NumPy | 4 | 5.3 |
346 | | adroit (V100) | CuPy | 1 | 0.3 |
347 | | adroit (K40c) | CuPy | 1 | 1.7 |
348 |
349 | Times are best of 5 for a square matrix with N=10000 in double precision.
350 |
351 | ### LU Decomposition
352 |
353 | | cluster | code | CPU-cores | time (s) |
354 | |:--------------------:|:-----------:|:----------:|:--------:|
355 | | adroit (CPU) | SciPy | 1 | 9.4 |
356 | | adroit (CPU) | SciPy | 2 | 7.9 |
357 | | adroit (CPU) | SciPy | 4 | 6.5 |
358 | | adroit (V100) | CuPy | 1 | 0.3 |
359 | | adroit (K40c) | CuPy | 1 | 1.1 |
360 | | adroit (V100) | Tensorflow | 1 | 0.3 |
361 | | adroit (K40c) | Tensorflow | 1 | 1.1 |
362 | | adroit (CPU) | Tensorflow | 1 | 50.8 |
363 |
364 | Times are best of 5 for a square matrix with N=10000 in double precision.
365 |
366 | ### Singular Value Decomposition
367 |
368 | | cluster | code | CPU-cores | time (s) |
369 | |:--------------------:|:----------:|:----------:|:--------:|
370 | | adroit (CPU) | NumPy | 1 | 3.6 |
371 | | adroit (CPU) | NumPy | 2 | 3.0 |
372 | | adroit (CPU) | NumPy | 4 | 1.2 |
373 | | adroit (V100) | CuPy | 1 | 24.7 |
374 | | adroit (K40c) | CuPy | 1 | 30.5 |
375 | | adroit (V100) | Torch | 1 | 0.9 |
376 | | adroit (K40c) | Torch | 1 | 1.5 |
377 | | adroit (CPU) | Torch | 1 | 3.0 |
378 | | adroit (V100) | TensorFlow | 1 | 24.8 |
379 | | adroit (K40c) | TensorFlow | 1 | 29.7 |
380 | | adroit (CPU) | TensorFlow | 1 | 9.2 |
381 |
382 | Times are best of 5 for a square matrix with N=2000 in double precision.
383 |
384 | For the LU decomposition using SciPy:
385 |
386 | ```
387 | from time import perf_counter
388 |
389 | import numpy as np
390 | import scipy as sp
391 | from scipy.linalg import lu
392 |
393 | N = 10000
394 | cpu_runs = 5
395 |
396 | times = []
397 | X = np.random.randn(N, N).astype(np.float64)
398 | for _ in range(cpu_runs):
399 | t0 = perf_counter()
400 | p, l, u = lu(X, check_finite=False)
401 | times.append(perf_counter() - t0)
402 | print("CPU time: ", min(times))
403 | print("NumPy version: ", np.__version__)
404 | print("SciPy version: ", sp.__version__)
405 | print(p.sum())
406 | print(times)
407 | ```
408 |
409 | For the LU decomposition on the CPU:
410 |
411 | ```
412 | from time import perf_counter
413 |
414 | import os
415 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
416 |
417 | import tensorflow as tf
418 | print("TensorFlow version: ", tf.__version__)
419 |
420 | times = []
421 | N = 10000
422 | with tf.device("/cpu:0"):
423 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64)
424 | for _ in range(5):
425 | t0 = perf_counter()
426 | lu, p = tf.linalg.lu(x)
427 | elapsed_time = perf_counter() - t0
428 | times.append(elapsed_time)
429 | print("Execution time: ", min(times))
430 | print(times)
431 | print("Result: ", tf.reduce_sum(p).numpy())
432 | ```
433 |
434 | SVD with NumPy:
435 |
436 | ```
437 | from time import perf_counter
438 |
439 | N = 2000
440 | cpu_runs = 5
441 |
442 | times = []
443 | import numpy as np
444 | X = np.random.randn(N, N).astype(np.float64)
445 | for _ in range(cpu_runs):
446 | t0 = perf_counter()
447 | u, s, v = np.linalg.svd(X)
448 | times.append(perf_counter() - t0)
449 | print("CPU time: ", min(times))
450 | print("NumPy version: ", np.__version__)
451 | print(s.sum())
452 | print(times)
453 | ```
454 |
455 | Performing benchmarks with R:
456 |
457 | ```
458 | # install.packages("microbenchmark")
459 | library(microbenchmark)
460 | library(Matrix)
461 |
462 | N <- 10000
463 | microbenchmark(lu(matrix(rnorm(N*N), N, N)), times=5, unit="s")
464 | ```
465 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/cupy/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=cupy-job # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --gres=gpu:1 # number of gpus per node
7 | #SBATCH --mem=4G # total memory (RAM) per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --constraint=a100 # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
11 |
12 | module purge
13 | module load anaconda3/2024.6
14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/cupy-env
15 |
16 | python svd.py
17 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/cupy/lu.py:
--------------------------------------------------------------------------------
1 | from time import perf_counter
2 | import numpy as np
3 | import cupy as cp
4 | import cupyx.scipy.linalg
5 |
6 | N = 10000
7 | X = cp.random.randn(N, N, dtype=np.float64)
8 |
9 | trials = 5
10 | times = []
11 | for _ in range(trials):
12 | start_time = perf_counter()
13 | lu, piv = cupyx.scipy.linalg.lu_factor(X, check_finite=False)
14 | cp.cuda.Device(0).synchronize()
15 | times.append(perf_counter() - start_time)
16 |
17 | print("Execution time: ", min(times))
18 | print("CuPy version: ", cp.__version__)
19 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/cupy/svd.py:
--------------------------------------------------------------------------------
1 | from time import perf_counter
2 | import cupy as cp
3 |
4 | N = 1000
5 | X = cp.random.randn(N, N, dtype=cp.float64)
6 |
7 | trials = 5
8 | times = []
9 | for _ in range(trials):
10 | t0 = perf_counter()
11 | u, s, v = cp.linalg.svd(X)
12 | cp.cuda.Device(0).synchronize()
13 | times.append(perf_counter() - t0)
14 | print("Execution time: ", min(times))
15 | print("sum(s) = ", s.sum())
16 | print("CuPy version: ", cp.__version__)
17 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/julia/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=julia_gpu # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --gres=gpu:1 # number of gpus per node
7 | #SBATCH --mem=4G # total memory (RAM) per node
8 | #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS)
9 | #SBATCH --constraint=a100 # choose gpu80, a100 or v100
10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
11 |
12 | module purge
13 | module load julia/1.8.2
14 |
15 | julia svd.jl
16 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/julia/svd.jl:
--------------------------------------------------------------------------------
1 | using CUDA
2 | N = 8000
3 | F = CUDA.svd(CUDA.rand(N, N))
4 | println(sum(F.S))
5 | println("completed")
6 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/matlab/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=matlab-svd # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default)
7 | #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS)
8 | #SBATCH --gres=gpu:1 # number of gpus per node
9 | #SBATCH --constraint=a100 # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
11 |
12 | module purge
13 | module load matlab/R2023a
14 |
15 | matlab -singleCompThread -nodisplay -nosplash -r svd
16 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/matlab/svd.m:
--------------------------------------------------------------------------------
1 | gpu = gpuDevice();
2 | fprintf('Using a %s GPU.\n', gpu.Name);
3 | disp(gpuDevice);
4 |
5 | X = gpuArray([1 0 2; -1 5 0; 0 3 -9]);
6 | whos X;
7 | [U,S,V] = svd(X)
8 | fprintf('trace(S): %f\n', trace(S))
9 | quit;
10 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/pytorch/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=torch-svd # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --constraint=a100 # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
11 |
12 | module purge
13 | module load anaconda3/2023.9
14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/torch-env
15 |
16 | python svd.py
17 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/pytorch/svd.py:
--------------------------------------------------------------------------------
1 | from time import perf_counter
2 | import torch
3 |
4 | N = 1000
5 |
6 | cuda0 = torch.device('cuda:0')
7 | x = torch.randn(N, N, dtype=torch.float64, device=cuda0)
8 | t0 = perf_counter()
9 | u, s, v = torch.svd(x)
10 | elapsed_time = perf_counter() - t0
11 |
12 | print("Execution time: ", elapsed_time)
13 | print("Result: ", torch.sum(s).cpu().numpy())
14 | print("PyTorch version: ", torch.__version__)
15 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/tensorflow/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=svd-tf # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem=4G # total memory (RAM) per node
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:02:00 # total run time limit (HH:MM:SS)
9 | #SBATCH --constraint=a100 # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
11 |
12 | module purge
13 | module load anaconda3/2024.6
14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/tf2-gpu
15 |
16 | python svd.py
17 |
--------------------------------------------------------------------------------
/03_your_first_gpu_job/tensorflow/svd.py:
--------------------------------------------------------------------------------
1 | from time import perf_counter
2 |
3 | import os
4 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
5 |
6 | import tensorflow as tf
7 | print("TensorFlow version: ", tf.__version__)
8 |
9 | N = 100
10 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64)
11 | t0 = perf_counter()
12 | s, u, v = tf.linalg.svd(x)
13 | elapsed_time = perf_counter() - t0
14 | print("Execution time: ", elapsed_time)
15 | print("Result: ", tf.reduce_sum(s).numpy())
16 |
--------------------------------------------------------------------------------
/04_gpu_tools/README.md:
--------------------------------------------------------------------------------
1 | # GPU Tools
2 |
3 | This page presents common tools and utilities for GPU computing.
4 |
5 | # nvidia-smi
6 |
7 | This is the NVIDIA Systems Management Interface. This utility can be used to monitor GPU usage and GPU memory usage. It is a comprehensive tool with many options.
8 |
9 | ```
10 | $ nvidia-smi
11 | Wed May 28 09:39:23 2025
12 | +-----------------------------------------------------------------------------------------+
13 | | NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
14 | |-----------------------------------------+------------------------+----------------------+
15 | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
16 | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
17 | | | | MIG M. |
18 | |=========================================+========================+======================|
19 | | 0 NVIDIA A100 80GB PCIe On | 00000000:17:00.0 Off | 0 |
20 | | N/A 39C P0 57W / 300W | 0MiB / 81920MiB | 0% Default |
21 | | | | Disabled |
22 | +-----------------------------------------+------------------------+----------------------+
23 |
24 | +-----------------------------------------------------------------------------------------+
25 | | Processes: |
26 | | GPU GI CI PID Type Process name GPU Memory |
27 | | ID ID Usage |
28 | |=========================================================================================|
29 | | No running processes found |
30 | +-----------------------------------------------------------------------------------------+
31 | ```
32 |
33 | To see all of the available options, view the help:
34 |
35 | ```$ nvidia-smi --help```
36 |
37 | Here is an an example that produces CSV output of various metrics:
38 |
39 | ```
40 | $ nvidia-smi --query-gpu=timestamp,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 5
41 | ```
42 |
43 | The command above takes a reading every 5 seconds.
44 |
45 | # Nsight Systems (nsys) for Profiling
46 |
47 | The `nsys` command can be used to generate a timeline of the execution of your code. `nsys-ui` provides a GUI to examine the profiling data generated by `nsys`. See the NVIDIA Nsight Systems [getting started guide](https://docs.nvidia.com/nsight-systems/) and notes on [Summit](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#profiling-gpu-code-with-nvidia-developer-tools).
48 |
49 | To see the help menu:
50 |
51 | ```
52 | $ /usr/local/bin/nsys --help
53 | $ /usr/local/bin/nsys --help profile
54 | ```
55 |
56 | IMPORTANT: Do not run profiling jobs in your `/home` directory because large files are often written during these jobs which can exceed your quota. Instead launch jobs from `/scratch/gpfs/` where you have lots of space. Here's an example:
57 |
58 | ```
59 | $ ssh @della-gpu.princeton.edu
60 | $ cd /scratch/gpfs/
61 | $ mkdir myjob && cd myjob
62 | # prepare Slurm script
63 | $ sbatch job.slurm
64 | ```
65 |
66 | Below is an example Slurm script:
67 |
68 | ```
69 | #!/bin/bash
70 | #SBATCH --job-name=profile # create a short name for your job
71 | #SBATCH --nodes=1 # node count
72 | #SBATCH --ntasks=1 # total number of tasks across all nodes
73 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
74 | #SBATCH --mem=4G # total memory per node
75 | #SBATCH --gres=gpu:1 # number of gpus per node
76 | #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
77 |
78 | module purge
79 | module load anaconda3/2024.10
80 | conda activate myenv
81 |
82 | /usr/local/bin/nsys profile --trace=cuda,nvtx,osrt -o myprofile_${SLURM_JOBID} python myscript.py
83 | ```
84 |
85 | For an MPI code you should use:
86 |
87 | ```
88 | srun --wait=0 /usr/local/bin/nsys profile --trace=cuda,nvtx,osrt,mpi -o myprofile_${SLURM_JOBID} ./my_mpi_exe
89 | ```
90 |
91 | Run this command to see the summary statistics:
92 |
93 | ```
94 | $ /usr/local/bin/nsys stats myprofile_*.nsys-rep
95 | ```
96 |
97 | To work the the graphical interface (nsys-ui) you can either (1) download the `.qdrep` file to your local machine or (2) create a graphical desktop session on [https://mydella.princeton.edu](https://mydella.princeton.edu/) or [https://mystellar.princeton.edu](https://mystellar.princeton.edu/). To create the graphical desktop, choose "Interactive Apps" then "Desktop of Della/Stellar Vis Nodes". Once the session starts, click on the black terminal icon and then run:
98 |
99 | ```
100 | $ /usr/local/bin/nsys-ui myprofile_*.nsys-rep
101 | ```
102 |
103 | # Nsight Compute (ncu) for GPU Kernel Profiling
104 |
105 | The `ncu` command is used for detailed profiling of GPU kernels. See the NVIDIA [documentation](https://docs.nvidia.com/nsight-compute/). On some clusters you will need to load a module to make the command available:
106 |
107 | ```
108 | $ module load cudatoolkit/12.9
109 | $ ncu --help
110 | ```
111 |
112 | The idea is to use `ncu` for the profiling and `ncu-ui` for examining the data in a GUI.
113 |
114 | Below is a sample slurm script:
115 |
116 | ```
117 | #!/bin/bash
118 | #SBATCH --job-name=profile # create a short name for your job
119 | #SBATCH --nodes=1 # node count
120 | #SBATCH --ntasks=1 # total number of tasks across all nodes
121 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
122 | #SBATCH --mem=4G # total memory per node
123 | #SBATCH --gres=gpu:1 # number of gpus per node
124 | #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
125 |
126 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
127 |
128 | module purge
129 | module load cudatoolkit/12.9
130 | module load anaconda3/2024.10
131 | conda activate myenv
132 |
133 | ncu -o my_report_${SLURM_JOBID} python myscript.py
134 | ```
135 |
136 | Note: the `ncu` profiler can significantly slow down the execution time of the code.
137 |
138 | To work the the graphical interface (ncu-ui) you can either (1) download the `.ncu-rep` file to your local machine or (2) create a graphical desktop session on [https://mydella.princeton.edu](https://mydella.princeton.edu/) or [https://mystellar.princeton.edu](https://mystellar.princeton.edu/). To create the graphical desktop, choose "Interactive Apps" then "Desktop of Della/Stellar Vis Nodes". Once the session starts, click on the black terminal icon and then run:
139 |
140 | ```
141 | $ module load cudatoolkit/12.9
142 | $ ncu-ui my_report_*.ncu-rep
143 | ```
144 |
145 | # line_prof for Python Profiling
146 |
147 | The [line_prof](https://researchcomputing.princeton.edu/python-profiling) tool provides profiling info for each line of a function. It is easy to use and it can be used for Python codes that run on CPUs and/or GPUs.
148 |
149 | # nvcc
150 |
151 | This is the NVIDIA CUDA compiler. It is based on LLVM. To compile a simple code:
152 |
153 | ```
154 | $ module load cudatoolkit/12.9
155 | $ nvcc -o hello_world hello_world.cu
156 | ```
157 |
158 | # Job Statistics
159 |
160 | Follow [this procedure](https://researchcomputing.princeton.edu/support/knowledge-base/job-stats) to view detailed metrics for your Slurm jobs. This includes GPU utilization and memory as a function of time.
161 |
162 | # GPU Computing
163 |
164 | See [this page](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) for an overview of the hardware at Princton as well as useful commands like `gpudash` and `shownodes`.
165 |
166 | # Debuggers
167 |
168 | ### ARM DDT
169 |
170 | The general directions for using the DDT debugger are [here](https://researchcomputing.princeton.edu/faq/debugging-with-ddt-on-the). The getting started guide is [here](https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge/arm-ddt).
171 |
172 | ```
173 | $ ssh -X @adroit.princeton.edu # better to use graphical desktop via myadroit
174 | $ git clone https://github.com/PrincetonUniversity/hpc_beginning_workshop
175 | $ cd hpc_beginning_workshop/RC_example_jobs/simple_gpu_kernel
176 | $ salloc -N 1 -n 1 -t 10:00 --gres=gpu:1 --x11
177 | $ module load cudatoolkit/12.9
178 | $ nvcc -g -G hello_world_gpu.cu
179 | $ module load ddt/24.1
180 | $ #export ALLINEA_FORCE_CUDA_VERSION=10.1
181 | $ ddt
182 | # check cuda, uncheck "submit to queue", and click on "Run"
183 | ```
184 |
185 | The `-g` debugging flag is for CPU code while the `-G` flag is for GPU code. `-G` turns off compiler optimizations.
186 |
187 | If the graphics are not displaying fast enough then consider using [TurboVNC](https://researchcomputing.princeton.edu/faq/how-do-i-use-vnc-on-tigre).
188 |
189 | ### `cuda-gdb`
190 |
191 | `cuda-gdb` is a free debugger available as part of the CUDA Toolkit.
192 |
--------------------------------------------------------------------------------
/05_cuda_libraries/README.md:
--------------------------------------------------------------------------------
1 | # GPU-Accelerated Libraries
2 |
3 | Let's say you have a CPU code and you are thinking about writing GPU kernels to accelerate the performance of the slow parts of the code. Before doing this, you should see if there are GPU libraries that already have implemented the routines that you need. This page presents an overview of the NVIDIA GPU-accelerated libraries.
4 |
5 | According to NVIDIA: "NVIDIA GPU-accelerated libraries provide highly-optimized functions that perform 2x-10x faster than CPU-only alternatives. Using drop-in interfaces, you can replace CPU-only libraries such as MKL, IPP and FFTW with GPU-accelerated versions with almost no code changes. The libraries can optimally scale your application across multiple GPUs."
6 |
7 | 
8 |
9 | ### Selected libraries
10 |
11 | + **cuDNN** - GPU-accelerated library of primitives for deep neural networks
12 | + **cuBLAS** - GPU-accelerated standard BLAS library
13 | + **cuSPARSE** - GPU-accelerated BLAS for sparse matrices
14 | + **cuRAND** - GPU-accelerated random number generation (RNG)
15 | + **cuSOLVER** - Dense and sparse direct solvers for computer vision, CFD and other applications
16 | + **cuTENSOR** - GPU-accelerated tensor linear algebra library
17 | + **cuFFT** - GPU-accelerated library for Fast Fourier Transforms
18 | + **NPP** - GPU-accelerated image, video, and signal processing functions
19 | + **NCCL** - Collective Communications Library for scaling apps across multiple GPUs and nodes
20 | + **nvGRAPH** - GPU-accelerated library for graph analytics
21 |
22 | For the complete list see [GPU libraries](https://developer.nvidia.com/gpu-accelerated-libraries) by NVIDIA.
23 |
24 | ## Where to find the libraries
25 |
26 | Run the commands below to examine the libraries:
27 |
28 | ```
29 | $ module show cudatoolkit/12.2
30 | $ ls -lL /usr/local/cuda-12.2/lib64/lib*.so
31 | ```
32 |
33 | ## Example
34 |
35 | Make sure that you are on the `adroit5` login node :
36 |
37 | ```
38 | $ hostname
39 | adroit5
40 | ```
41 |
42 | Instead of computing the singular value decomposition (SVD) on the CPU, this example computes it on the GPU using `libcusolver`. First look over the source code:
43 |
44 | ```
45 | $ cd gpu_programming_intro/05_cuda_libraries
46 | $ cat gesvdj_example.cpp | less # q to quit
47 | ```
48 |
49 | The header file `cusolverDn.h` included by `gesvdj_example.cpp` contains the line `cuSolverDN : Dense Linear Algebra Library` providing information about its purpose. See the [cuSOLVER API](https://docs.nvidia.com/cuda/cusolver/index.html) for more.
50 |
51 |
52 | Next, compile and link the code as follows:
53 |
54 | ```
55 | $ module load cudatoolkit/12.2
56 | $ g++ -o gesvdj_example gesvdj_example.cpp -lcudart -lcusolver
57 | ```
58 |
59 | Run `ldd gesvdj_example` to check the linking against cuSOLVER (i.e., `libcusolver.so`).
60 |
61 | Submit the job to the scheduler with:
62 |
63 | ```
64 | $ sbatch job.slurm
65 | ```
66 |
67 | The ouput should appears as:
68 |
69 | ```
70 | $ cat slurm-*.out
71 |
72 | example of gesvdj
73 | tol = 1.000000E-07, default value is machine zero
74 | max. sweeps = 15, default value is 100
75 | econ = 0
76 | A = (matlab base-1)
77 | A(1,1) = 1.0000000000000000E+00
78 | A(1,2) = 2.0000000000000000E+00
79 | A(2,1) = 4.0000000000000000E+00
80 | A(2,2) = 5.0000000000000000E+00
81 | A(3,1) = 2.0000000000000000E+00
82 | A(3,2) = 1.0000000000000000E+00
83 | =====
84 | gesvdj converges
85 | S = singular values (matlab base-1)
86 | S(1,1) = 7.0652834970827287E+00
87 | S(2,1) = 1.0400812977120775E+00
88 | =====
89 | U = left singular vectors (matlab base-1)
90 | U(1,1) = 3.0821892063278472E-01
91 | U(1,2) = -4.8819507401989848E-01
92 | U(1,3) = 8.1649658092772659E-01
93 | U(2,1) = 9.0613333377729299E-01
94 | U(2,2) = -1.1070553170904460E-01
95 | U(2,3) = -4.0824829046386302E-01
96 | U(3,1) = 2.8969549251172333E-01
97 | U(3,2) = 8.6568461633075366E-01
98 | U(3,3) = 4.0824829046386224E-01
99 | =====
100 | V = right singular vectors (matlab base-1)
101 | V(1,1) = 6.3863583713639760E-01
102 | V(1,2) = 7.6950910814953477E-01
103 | V(2,1) = 7.6950910814953477E-01
104 | V(2,2) = -6.3863583713639760E-01
105 | =====
106 | |S - S_exact|_sup = 4.440892E-16
107 | residual |A - U*S*V**H|_F = 3.511066E-16
108 | number of executed sweeps = 1
109 | ```
110 |
111 | ## NVIDIA CUDA Samples
112 |
113 | Run the following command to obtain a copy of the [NVIDIA CUDA Samples](https://github.com/NVIDIA/cuda-samples):
114 |
115 | ```
116 | $ cd gpu_programming_intro
117 | $ git clone https://github.com/NVIDIA/cuda-samples.git
118 | $ cd cuda-samples/Samples
119 | ```
120 |
121 | Then browse the directories:
122 |
123 | ```
124 | $ ls -ltrh
125 | total 20K
126 | drwxr-xr-x. 55 jdh4 cses 4.0K Oct 9 18:23 0_Introduction
127 | drwxr-xr-x. 6 jdh4 cses 130 Oct 9 18:23 1_Utilities
128 | drwxr-xr-x. 36 jdh4 cses 4.0K Oct 9 18:23 2_Concepts_and_Techniques
129 | drwxr-xr-x. 25 jdh4 cses 4.0K Oct 9 18:23 3_CUDA_Features
130 | drwxr-xr-x. 40 jdh4 cses 4.0K Oct 9 18:23 4_CUDA_Libraries
131 | drwxr-xr-x. 52 jdh4 cses 4.0K Oct 9 18:23 5_Domain_Specific
132 | drwxr-xr-x. 5 jdh4 cses 105 Oct 9 18:23 6_Performance
133 | ```
134 |
135 | Pick an example and then build and run it. For instance:
136 |
137 | ```
138 | $ module load cudatoolkit/12.2
139 | $ cd 0_Introduction/matrixMul
140 | $ make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++ # use 90 for H100 GPUs on Tiger and Della (PLI)
141 | ```
142 |
143 | This will produce `matrixMul`. If you run the `ldd` command on `matrixMul` you will see that it does not link against `cublas.so`. Instead it uses a naive implementation of the routine which is surely not as efficient as the library implementation.
144 |
145 | ```
146 | $ cp /gpu_programming_intro/05_cuda_libraries/matrixMul/job.slurm .
147 | ```
148 |
149 | Submit the job:
150 |
151 | ```
152 | $ sbatch job.slurm
153 | ```
154 |
155 | See `4_CUDA_Libraries` for more examples. For instance, take a look at `4_CUDA_Libraries/matrixMulCUBLAS`. Does the resulting executable link against `libcublas.so`?
156 |
157 | ```
158 | $ cd ../../4_CUDA_Libraries/matrixMulCUBLAS
159 | $ make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++
160 | $ ldd matrixMulCUBLAS
161 | ```
162 |
163 | Similarly, does the code in `4_CUDA_Libraries/simpleCUFFT_MGPU` link against `libcufft.so`?
164 |
165 | To run code that uses the Tensor Cores see examples such as `3_CUDA_Features/bf16TensorCoreGemm`. That example uses the bfloat16 floating-point format.
166 |
167 | Note that some examples have dependencies that will not be satisfied so they will not build. This can be resolved if it relates to your research work. For instance, to build `5_Domain_Specific/nbody` use:
168 |
169 | ```
170 | GLPATH=/lib64 make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++ # use 90 for H100 GPUs on Tiger and Della (PLI)
171 | ```
172 |
173 | Note that `nbody` will not run successfully on adroit since the GPU nodes do not have `libglut.so`. The library could be added if needed. One can compile and run this code on adroit-vis using `TARGET_ARCH=x86_64 SMS="80"`.
174 |
--------------------------------------------------------------------------------
/05_cuda_libraries/gesvdj_example.cpp:
--------------------------------------------------------------------------------
1 | /*
2 | * * How to compile (assume cuda is installed at /usr/local/cuda-10.1/)
3 | * * nvcc -c -I/usr/local/cuda-10.1/include gesvdj_example.cpp
4 | * * g++ -o gesvdj_example gesvdj_example.o -L/usr/local/cuda-10.1/lib64 -lcudart -lcusolver
5 | * */
6 | #include
7 | #include
8 | #include
9 | #include
10 | #include
11 | #include
12 |
13 | void printMatrix(int m, int n, const double*A, int lda, const char* name)
14 | {
15 | for(int row = 0 ; row < m ; row++){
16 | for(int col = 0 ; col < n ; col++){
17 | double Areg = A[row + col*lda];
18 | printf("%s(%d,%d) = %20.16E\n", name, row+1, col+1, Areg);
19 | }
20 | }
21 | }
22 |
23 | int main(int argc, char*argv[])
24 | {
25 | cusolverDnHandle_t cusolverH = NULL;
26 | cudaStream_t stream = NULL;
27 | gesvdjInfo_t gesvdj_params = NULL;
28 |
29 | cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS;
30 | cudaError_t cudaStat1 = cudaSuccess;
31 | cudaError_t cudaStat2 = cudaSuccess;
32 | cudaError_t cudaStat3 = cudaSuccess;
33 | cudaError_t cudaStat4 = cudaSuccess;
34 | cudaError_t cudaStat5 = cudaSuccess;
35 | const int m = 3;
36 | const int n = 2;
37 | const int lda = m;
38 | /* | 1 2 |
39 | * * A = | 4 5 |
40 | * * | 2 1 |
41 | * */
42 | double A[lda*n] = { 1.0, 4.0, 2.0, 2.0, 5.0, 1.0};
43 | double U[lda*m]; /* m-by-m unitary matrix, left singular vectors */
44 | double V[lda*n]; /* n-by-n unitary matrix, right singular vectors */
45 | double S[n]; /* numerical singular value */
46 | /* exact singular values */
47 | double S_exact[n] = {7.065283497082729, 1.040081297712078};
48 | double *d_A = NULL; /* device copy of A */
49 | double *d_S = NULL; /* singular values */
50 | double *d_U = NULL; /* left singular vectors */
51 | double *d_V = NULL; /* right singular vectors */
52 | int *d_info = NULL; /* error info */
53 | int lwork = 0; /* size of workspace */
54 | double *d_work = NULL; /* devie workspace for gesvdj */
55 | int info = 0; /* host copy of error info */
56 |
57 | /* configuration of gesvdj */
58 | const double tol = 1.e-7;
59 | const int max_sweeps = 15;
60 | const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvectors.
61 | const int econ = 0 ; /* econ = 1 for economy size */
62 |
63 | /* numerical results of gesvdj */
64 | double residual = 0;
65 | int executed_sweeps = 0;
66 |
67 | printf("example of gesvdj \n");
68 | printf("tol = %E, default value is machine zero \n", tol);
69 | printf("max. sweeps = %d, default value is 100\n", max_sweeps);
70 | printf("econ = %d \n", econ);
71 |
72 | printf("A = (matlab base-1)\n");
73 | printMatrix(m, n, A, lda, "A");
74 | printf("=====\n");
75 |
76 | /* step 1: create cusolver handle, bind a stream */
77 | status = cusolverDnCreate(&cusolverH);
78 | assert(CUSOLVER_STATUS_SUCCESS == status);
79 |
80 | cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
81 | assert(cudaSuccess == cudaStat1);
82 |
83 | status = cusolverDnSetStream(cusolverH, stream);
84 | assert(CUSOLVER_STATUS_SUCCESS == status);
85 |
86 | /* step 2: configuration of gesvdj */
87 | status = cusolverDnCreateGesvdjInfo(&gesvdj_params);
88 | assert(CUSOLVER_STATUS_SUCCESS == status);
89 |
90 | /* default value of tolerance is machine zero */
91 | status = cusolverDnXgesvdjSetTolerance(
92 | gesvdj_params,
93 | tol);
94 | assert(CUSOLVER_STATUS_SUCCESS == status);
95 |
96 | /* default value of max. sweeps is 100 */
97 | status = cusolverDnXgesvdjSetMaxSweeps(
98 | gesvdj_params,
99 | max_sweeps);
100 | assert(CUSOLVER_STATUS_SUCCESS == status);
101 |
102 | /* step 3: copy A and B to device */
103 | cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double)*lda*n);
104 | cudaStat2 = cudaMalloc ((void**)&d_S , sizeof(double)*n);
105 | cudaStat3 = cudaMalloc ((void**)&d_U , sizeof(double)*lda*m);
106 | cudaStat4 = cudaMalloc ((void**)&d_V , sizeof(double)*lda*n);
107 | cudaStat5 = cudaMalloc ((void**)&d_info, sizeof(int));
108 | assert(cudaSuccess == cudaStat1);
109 | assert(cudaSuccess == cudaStat2);
110 | assert(cudaSuccess == cudaStat3);
111 | assert(cudaSuccess == cudaStat4);
112 | assert(cudaSuccess == cudaStat5);
113 |
114 | cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n, cudaMemcpyHostToDevice);
115 | assert(cudaSuccess == cudaStat1);
116 |
117 | /* step 4: query workspace of SVD */
118 | status = cusolverDnDgesvdj_bufferSize(
119 | cusolverH,
120 | jobz, /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */
121 | /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */
122 | econ, /* econ = 1 for economy size */
123 | m, /* nubmer of rows of A, 0 <= m */
124 | n, /* number of columns of A, 0 <= n */
125 | d_A, /* m-by-n */
126 | lda, /* leading dimension of A */
127 | d_S, /* min(m,n) */
128 | /* the singular values in descending order */
129 | d_U, /* m-by-m if econ = 0 */
130 | /* m-by-min(m,n) if econ = 1 */
131 | lda, /* leading dimension of U, ldu >= max(1,m) */
132 | d_V, /* n-by-n if econ = 0 */
133 | /* n-by-min(m,n) if econ = 1 */
134 | lda, /* leading dimension of V, ldv >= max(1,n) */
135 | &lwork,
136 | gesvdj_params);
137 | assert(CUSOLVER_STATUS_SUCCESS == status);
138 |
139 | cudaStat1 = cudaMalloc((void**)&d_work , sizeof(double)*lwork);
140 | assert(cudaSuccess == cudaStat1);
141 |
142 | /* step 5: compute SVD */
143 | status = cusolverDnDgesvdj(
144 | cusolverH,
145 | jobz, /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */
146 | /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */
147 | econ, /* econ = 1 for economy size */
148 | m, /* nubmer of rows of A, 0 <= m */
149 | n, /* number of columns of A, 0 <= n */
150 | d_A, /* m-by-n */
151 | lda, /* leading dimension of A */
152 | d_S, /* min(m,n) */
153 | /* the singular values in descending order */
154 | d_U, /* m-by-m if econ = 0 */
155 | /* m-by-min(m,n) if econ = 1 */
156 | lda, /* leading dimension of U, ldu >= max(1,m) */
157 | d_V, /* n-by-n if econ = 0 */
158 | /* n-by-min(m,n) if econ = 1 */
159 | lda, /* leading dimension of V, ldv >= max(1,n) */
160 | d_work,
161 | lwork,
162 | d_info,
163 | gesvdj_params);
164 | cudaStat1 = cudaDeviceSynchronize();
165 | assert(CUSOLVER_STATUS_SUCCESS == status);
166 | assert(cudaSuccess == cudaStat1);
167 |
168 | cudaStat1 = cudaMemcpy(U, d_U, sizeof(double)*lda*m, cudaMemcpyDeviceToHost);
169 | cudaStat2 = cudaMemcpy(V, d_V, sizeof(double)*lda*n, cudaMemcpyDeviceToHost);
170 | cudaStat3 = cudaMemcpy(S, d_S, sizeof(double)*n , cudaMemcpyDeviceToHost);
171 | cudaStat4 = cudaMemcpy(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost);
172 | cudaStat5 = cudaDeviceSynchronize();
173 | assert(cudaSuccess == cudaStat1);
174 | assert(cudaSuccess == cudaStat2);
175 | assert(cudaSuccess == cudaStat3);
176 | assert(cudaSuccess == cudaStat4);
177 | assert(cudaSuccess == cudaStat5);
178 |
179 | if ( 0 == info ){
180 | printf("gesvdj converges \n");
181 | }else if ( 0 > info ){
182 | printf("%d-th parameter is wrong \n", -info);
183 | exit(1);
184 | }else{
185 | printf("WARNING: info = %d : gesvdj does not converge \n", info );
186 | }
187 |
188 | printf("S = singular values (matlab base-1)\n");
189 | printMatrix(n, 1, S, lda, "S");
190 | printf("=====\n");
191 |
192 | printf("U = left singular vectors (matlab base-1)\n");
193 | printMatrix(m, m, U, lda, "U");
194 | printf("=====\n");
195 |
196 | printf("V = right singular vectors (matlab base-1)\n");
197 | printMatrix(n, n, V, lda, "V");
198 | printf("=====\n");
199 |
200 | /* step 6: measure error of singular value */
201 | double ds_sup = 0;
202 | for(int j = 0; j < n; j++){
203 | double err = fabs( S[j] - S_exact[j] );
204 | ds_sup = (ds_sup > err)? ds_sup : err;
205 | }
206 | printf("|S - S_exact|_sup = %E \n", ds_sup);
207 |
208 | status = cusolverDnXgesvdjGetSweeps(
209 | cusolverH,
210 | gesvdj_params,
211 | &executed_sweeps);
212 | assert(CUSOLVER_STATUS_SUCCESS == status);
213 |
214 | status = cusolverDnXgesvdjGetResidual(
215 | cusolverH,
216 | gesvdj_params,
217 | &residual);
218 | assert(CUSOLVER_STATUS_SUCCESS == status);
219 |
220 | printf("residual |A - U*S*V**H|_F = %E \n", residual );
221 | printf("number of executed sweeps = %d \n", executed_sweeps );
222 |
223 | /* free resources */
224 | if (d_A ) cudaFree(d_A);
225 | if (d_S ) cudaFree(d_S);
226 | if (d_U ) cudaFree(d_U);
227 | if (d_V ) cudaFree(d_V);
228 | if (d_info) cudaFree(d_info);
229 | if (d_work ) cudaFree(d_work);
230 |
231 | if (cusolverH) cusolverDnDestroy(cusolverH);
232 | if (stream ) cudaStreamDestroy(stream);
233 | if (gesvdj_params) cusolverDnDestroyGesvdjInfo(gesvdj_params);
234 |
235 | cudaDeviceReset();
236 | return 0;
237 | }
238 |
--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/README.md:
--------------------------------------------------------------------------------
1 | # Building a Simple GPU Library
2 |
3 | In this exercise we will construct a "hello world" GPU library called `cumessage` and then link and run a code against it.
4 |
5 | ### Create the GPU Library
6 |
7 | Inspect the files that compose the GPU library:
8 |
9 | ```bash
10 | $ cd 05_cuda_libraries/hello_world_gpu_library
11 | $ cat cumessage.h
12 | $ cat cumessage.cu
13 | ```
14 |
15 | `cumessage.h` is the header file. It contains the signature or protocol of one function. That is, the name and the input/output types are specified but the function body is not implemented here. The implementation is done in `cumessage.cu`. There is some CUDA code in that file. It will be explained in `06_cuda_kernels`.
16 |
17 | Libraries are standalone. That is, there is nothing at present waiting to use our library. We will simply create it and then write a code that can use it. Create the library by running the following commands:
18 |
19 | ```bash
20 | $ module load cudatoolkit/11.7
21 | $ nvcc -Xcompiler -fPIC -o libcumessage.so -shared cumessage.cu
22 | $ ls -ltr
23 | ```
24 |
25 | This will produce `libcumessage.so` which is a GPU library with a single function. Add the option "-v" to the line beginning with `nvcc` above to see more details. You will see that `gcc` is being called.
26 |
27 | ### Use the GPU Library
28 |
29 | Take a look at our simple code in `myapp.cu` that will use our GPU library:
30 |
31 | ```bash
32 | $ cat myapp.cu
33 | ```
34 |
35 | Once again, note that `myapp.cu` only needs to know about the inputs and outputs of `GPUfunction` through the header file. Nothing is known to `myapp.cu` about how that function is implemented.
36 |
37 | Compile the main routine against our GPU library:
38 |
39 | ```
40 | $ nvcc -I. -o myapp myapp.cu -L. -lcudart -lcumessage
41 | $ ls -ltr
42 | ```
43 |
44 | This will produce `myapp` which is a GPU application that links against our GPU library `libcumessage.so`:
45 |
46 | ```
47 | $ env LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ldd myapp
48 | linux-vdso.so.1 (0x00007fffdaf61000)
49 | libcumessage.so => ./libcumessage.so (0x000014d68450a000)
50 | libcudart.so.11.0 => /usr/local/cuda-11.4/lib64/libcudart.so.11.0 (0x000014d684268000)
51 | librt.so.1 => /lib64/librt.so.1 (0x000014d684060000)
52 | libpthread.so.0 => /lib64/libpthread.so.0 (0x000014d683e40000)
53 | libdl.so.2 => /lib64/libdl.so.2 (0x000014d683c3c000)
54 | libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014d6838a7000)
55 | libm.so.6 => /lib64/libm.so.6 (0x000014d683525000)
56 | libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014d68330d000)
57 | libc.so.6 => /lib64/libc.so.6 (0x000014d682f48000)
58 | /lib64/ld-linux-x86-64.so.2 (0x000014d6847a9000)
59 | ```
60 | Finally, submit the job and inspect the output:
61 |
62 | ```
63 | $ sbatch job.slurm
64 | $ cat slurm-*.out
65 | Hello world from the CPU.
66 | Hello world from the GPU.
67 | ```
68 |
--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/cumessage.cu:
--------------------------------------------------------------------------------
1 | #include
2 | #include "cumessage.h"
3 |
4 | __global__ void GPUFunction_kernel() {
5 | printf("Hello world from the GPU.\n");
6 | }
7 |
8 | void GPUFunction() {
9 | GPUFunction_kernel<<<1,1>>>();
10 |
11 | // kernel execution is asynchronous so sync on its completion
12 | cudaDeviceSynchronize();
13 | }
14 |
--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/cumessage.h:
--------------------------------------------------------------------------------
1 | void GPUFunction();
2 |
--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=gpu-lib # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G per cpu-core is default)
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS)
9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
10 |
11 | module purge
12 | module load cudatoolkit/11.7
13 | export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH
14 |
15 | ./myapp
16 |
--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/myapp.cu:
--------------------------------------------------------------------------------
1 | #include
2 | #include "cumessage.h"
3 |
4 | void CPUFunction() {
5 | printf("Hello world from the CPU.\n");
6 | }
7 |
8 | int main() {
9 | // function to run on the cpu
10 | CPUFunction();
11 |
12 | // function to run on the gpu
13 | GPUFunction();
14 |
15 | return 0;
16 | }
17 |
--------------------------------------------------------------------------------
/05_cuda_libraries/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=cuda-libs # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default)
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --constraint=a100 # choose gpu80, a100 or v100
10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
11 |
12 | module purge
13 | module load cudatoolkit/12.2
14 |
15 | ./gesvdj_example
16 |
--------------------------------------------------------------------------------
/05_cuda_libraries/matrixMul/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=cuda-libs # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=16G # memory per cpu-core (4G is default)
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --constraint=a100 # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
11 |
12 | module purge
13 | module load cudatoolkit/12.2
14 |
15 | ./matrixMul
16 |
--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/README.md:
--------------------------------------------------------------------------------
1 | # Hello World
2 |
3 | On this page we consider the simplest CPU C code and the simplest CUDA C GPU code.
4 |
5 | ## CPU
6 |
7 | A simple CPU-only code:
8 |
9 | ```C
10 | #include
11 |
12 | void CPUFunction() {
13 | printf("Hello world from the CPU.\n");
14 | }
15 |
16 | int main() {
17 | // function to run on the cpu
18 | CPUFunction();
19 | }
20 | ```
21 |
22 | This can be compiled and run with:
23 |
24 | ```
25 | $ cd gpu_programming_intro/06_cuda_kernels/01_hello_world
26 | $ gcc -o hello_world hello_world.c
27 | $ ./hello_world
28 | ```
29 |
30 | The output is
31 |
32 | ```
33 | Hello world from the CPU.
34 | ```
35 |
36 | ## GPU
37 |
38 | Below is a simple GPU code that calls a CPU function followed by a GPU function:
39 |
40 | ```C
41 | #include
42 |
43 | void CPUFunction() {
44 | printf("Hello world from the CPU.\n");
45 | }
46 |
47 | __global__ void GPUFunction() {
48 | printf("Hello world from the GPU.\n");
49 | }
50 |
51 | int main() {
52 | // function to run on the cpu
53 | CPUFunction();
54 |
55 | // function to run on the gpu
56 | GPUFunction<<<1, 1>>>();
57 |
58 | // kernel execution is asynchronous so sync on its completion
59 | cudaDeviceSynchronize();
60 | }
61 | ```
62 |
63 | The GPU code above can be compiled and executed with:
64 |
65 | ```
66 | $ module load cudatoolkit/12.2
67 | $ nvcc -o hello_world_gpu hello_world_gpu.cu
68 | $ sbatch job.slurm
69 | ```
70 |
71 | The output should be:
72 |
73 | ```
74 | $ cat slurm-*.out
75 | Hello world from the CPU.
76 | Hello world from the GPU.
77 | ```
78 |
79 | `nvcc` is the NVIDIA CUDA Compiler. It compiles the GPU code itself and uses GNU `gcc` to compile the CPU code. CUDA provides extensions for many common programming languages (e.g., C/C++/Fortran). These language extensions allow developers to write GPU functions.
80 |
81 | From this simple example we learn that GPU functions are declared with `__global__`, which is a CUDA C/C++ keyword. The triple angle brackets or so-called "triple chevron" is used to specify the execution configuration of the kernel launch which is a call from host code to device code.
82 |
83 | Here is the general form for the execution configuration: `<<>>`. In the example above we used 1 block and 1 thread per block. At a high level, the execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block.
84 |
85 | Notice the return type of `void` for GPUFunction. It is required that GPU functions are defined with the `__global__` keyword return type `void`.
86 |
87 | ### Exercises
88 |
89 | 1. What happens if you remove `__global__`?
90 |
91 | 2. Can you rewrite the code so that the output is:
92 |
93 | ```
94 | Hello world from the CPU.
95 | Hello world from the GPU.
96 | Hello world from the CPU.
97 | ```
98 |
99 | 3. What happens if you comment out the `cudaDeviceSynchronize()` line by preceding it with `//`?
100 |
--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/hello_world.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | void CPUFunction() {
4 | printf("Hello world from the CPU.\n");
5 | }
6 |
7 | int main() {
8 | // function to run on the cpu
9 | CPUFunction();
10 | }
11 |
--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/hello_world_gpu.cu:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | void CPUFunction() {
4 | printf("Hello world from the CPU.\n");
5 | }
6 |
7 | __global__ void GPUFunction() {
8 | printf("Hello world from the GPU.\n");
9 | }
10 |
11 | int main() {
12 | // function to run on the cpu
13 | CPUFunction();
14 |
15 | // function to run on the gpu
16 | GPUFunction<<<1, 1>>>();
17 |
18 | // kernel execution is asynchronous so sync on its completion
19 | cudaDeviceSynchronize();
20 | }
21 |
--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=hw-gpu # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G is default)
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
10 |
11 | ./hello_world_gpu
12 |
--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/README.md:
--------------------------------------------------------------------------------
1 | # Launching Parallel Kernels
2 |
3 | The execution configuration allows programmers to specify details about launching the kernel to run in parallel on multiple GPU threads. More precisely, the execution configuration allows programmers to specifiy how many groups of threads (called thread blocks) and how many threads they would like each thread block to contain. The syntax for this is:
4 |
5 | ```
6 | <<>>
7 | ```
8 |
9 | The kernel code is executed by every thread in every thread block configured when the kernel is launched. The image below corresponds to `<<<1, 5>>>`:
10 |
11 | 
12 |
13 |
14 | ## CPU Code
15 |
16 | ```c
17 | #include
18 |
19 | void firstParallel()
20 | {
21 | printf("This should be running in parallel.\n");
22 | }
23 |
24 | int main()
25 | {
26 | firstParallel();
27 | }
28 | ```
29 |
30 | ## Exercise: GPU implementation
31 |
32 | ```
33 | # rewrite the CPU code above so that it runs on a GPU using multiple threads
34 | # save your file as first_parallel.cu (a starting file by this name is given -- see below)
35 | ```
36 |
37 | The objective is to write a GPU code with one kernel launch that produces the following 6 lines of output:
38 |
39 | ```
40 | This should be running in parallel.
41 | This should be running in parallel.
42 | This should be running in parallel.
43 | This should be running in parallel.
44 | This should be running in parallel.
45 | This should be running in parallel.
46 | ```
47 |
48 | To get started:
49 |
50 | ```
51 | $ cd gpu_programming_intro/06_cuda_kernels/02_simple_kernel
52 | # edit first_parallel.cu (use a text editor of your choice)
53 | $ nvcc -o first_parallel first_parallel.cu
54 | $ sbatch job.slurm
55 | ```
56 |
57 | There are multiple possible solutions.
58 |
--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/first_parallel.cu:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | void CPUFunction() {
4 | printf("Hello world from the CPU.\n");
5 | }
6 |
7 | __global__ void GPUFunction() {
8 | printf("Hello world from the GPU.\n");
9 | }
10 |
11 | int main() {
12 | // function to run on the cpu
13 | CPUFunction();
14 |
15 | // function to run on the gpu
16 | GPUFunction<<<1, 1>>>();
17 |
18 | // kernel execution is asynchronous so sync on its completion
19 | cudaDeviceSynchronize();
20 | }
21 |
--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=serial_c # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G is default)
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
10 |
11 | ./first_parallel
12 |
--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/solution.cu:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | __global__ void firstParallel()
4 | {
5 | printf("This is running in parallel.\n");
6 | }
7 |
8 | int main()
9 | {
10 | firstParallel<<<2, 3>>>();
11 | cudaDeviceSynchronize();
12 | }
13 |
--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/README.md:
--------------------------------------------------------------------------------
1 | # Built-in Thread and Block Indices
2 |
3 | Each thread is given an index within its thread block, starting at 0. Additionally, each block is given an index, starting at 0. Threads are grouped into thread blocks, blocks are grouped into grids, and grids can be grouped into a cluster, which is the highest entity in the CUDA hierarchy.
4 |
5 | 
6 |
7 | CUDA kernels have access to special variables identifying both the index of the thread (within the block) that is executing the kernel, and, the index of the block (within the grid) that the thread is within. These variables are `threadIdx.x` and `blockIdx.x` respectively. Below is an example use of `threadIdx.x`:
8 |
9 | ```C
10 | __global__ void GPUFunction() {
11 | printf("My thread index is: %d\n", threadIdx.x);
12 | }
13 | ```
14 |
15 | ## CPU implentation of a for loop
16 |
17 | ```C
18 | #include
19 |
20 | void printLoopIndex() {
21 | int N = 100;
22 | for (int i = 0; i < N; ++i)
23 | printf("%d\n", i);
24 | }
25 |
26 | int main() {
27 | // function to run on the cpu
28 | printLoopIndex();
29 | }
30 | ```
31 |
32 | Run the CPU code above by following these commands:
33 |
34 | ```bash
35 | $ cd gpu_programming_intro/06_cuda_kernels/03_thread_indices
36 | $ nvcc -o for_loop for_loop.c
37 | $ ./for_loop
38 | ```
39 |
40 | The output of the above is
41 |
42 | ```
43 | 0
44 | 1
45 | 2
46 | ...
47 | 97
48 | 98
49 | 99
50 | ```
51 |
52 | ## Exercise: GPU implementation
53 |
54 | In the CPU code above, the loop is carried out in serial. That is, loop iterations takes place one at a time. Can you write a GPU code that produces the same output as that above but does so in parallel using a CUDA kernel?
55 |
56 | ```
57 | // write a GPU kernel to produce the output above
58 | ```
59 |
60 | To get started:
61 |
62 | ```bash
63 | $ module load cudatoolkit/12.2
64 | # edit for_loop.cu
65 | $ nvcc -o for_loop for_loop.cu
66 | $ sbatch job.slurm
67 | ```
68 |
69 | Click [here](hint.md) to see some hints.
70 |
71 | One possible solution is [here](solution.cu) (try for yourself first).
72 |
73 | Are you seeing any behavior which is a multiple of 32 in this exercise? For NVIDIA, the threads within a thread block are organized into "warps". A "warp" is composed of 32 threads. [Read more](http://15418.courses.cs.cmu.edu/spring2013/article/15) about how `printf` works in CUDA.
74 |
--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/for_loop.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | void printLoopIndex() {
4 | int i;
5 | int N = 100;
6 | for (i = 0; i < N; ++i)
7 | printf("%d\n", i);
8 | }
9 |
10 | int main() {
11 | // function to run on the cpu
12 | printLoopIndex();
13 | }
14 |
--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/for_loop.cu:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | void printLoopIndex() {
4 | int N = 100;
5 | for (int i = 0; i < N; ++i)
6 | printf("%d\n", i);
7 | }
8 |
9 | int main() {
10 | // function to run on the cpu
11 | printLoopIndex();
12 | }
13 |
--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/hint.md:
--------------------------------------------------------------------------------
1 | ## Hints
2 |
3 | To understand how to do this exercise, take a look at the code below which uses `threadIdx.x`:
4 |
5 | ```C
6 | #include
7 |
8 | __global__ void GPUFunction() {
9 | printf("My thread index is: %g\n", threadIdx.x);
10 | }
11 |
12 | int main() {
13 | GPUFunction<<<1, 1>>>();
14 | cudaDeviceSynchronize();
15 | }
16 | ```
17 |
18 | The output of the code above is
19 |
20 | ```
21 | My thread index is: 0
22 | ```
23 |
24 | We need to replace the i variable in the CPU code. In a CUDA kernel, each thread has an index
25 | associated with it called `threadIdx.x`. So use that as the substitution for i.
26 |
27 | Next, to generate 100 threads, try a kernel launch like this: `<<<1, 100>>>`
28 |
29 | The above will give you 1 block composed of 100 threads.
30 |
31 | Be sure to add `__global__` to your GPU function and don't forget to call `cudaDeviceSynchronize()`.
32 |
--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=for_loop # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G is default)
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
10 |
11 | ./for_loop
12 |
--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/solution.cu:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | __global__ void printLoopIndex() {
4 | printf("%d\n", threadIdx.x);
5 | }
6 |
7 | int main() {
8 | printLoopIndex<<<1, 100>>>();
9 | cudaDeviceSynchronize();
10 | }
11 |
--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/README.md:
--------------------------------------------------------------------------------
1 | # Elementwise Vector Addition
2 |
3 | ## A Word on Allocating Memory
4 |
5 | Here is an example on the CPU where 10 integers are dynamically allocated and the last line frees the memory:
6 |
7 | ```C
8 | int N = 10;
9 | size_t size = N * sizeof(int);
10 |
11 | int *a;
12 | a = (int*)malloc(size);
13 | free(a);
14 | ```
15 |
16 | On the GPU:
17 |
18 | ```C
19 | int N = 10;
20 | size_t size = N * sizeof(int);
21 |
22 | int *d_a;
23 | cudaMalloc(&d_a, size);
24 | cudaFree(d_a);
25 | ```
26 | Note that we write `d_a` for the GPU case instead of `a` to remind ourselves that we are allocating memory on the "device" or GPU. Sometimes developers will prefix CPU variables with 'h' to denote "host".
27 |
28 | 
29 |
30 | The vectors `a` and `b` are added elementwise to produce the vector `c`:
31 |
32 | ```
33 | c[0] = a[0] + b[0]
34 | c[1] = a[1] + b[1]
35 | ...
36 | c[N-1] = a[N-1] + b[N-1]
37 | ```
38 |
39 | ## CPU
40 |
41 | The following code adds two vectors together on a CPU:
42 |
43 | ```C
44 | #include
45 | #include
46 | #include
47 | #include "timer.h"
48 |
49 | void vecAdd(double *a, double *b, double *c, int n)
50 | {
51 | int i;
52 | for (i = 0; i < n; i++) {
53 | c[i] = a[i] + b[i];
54 | }
55 | }
56 |
57 | int main(int argc, char* argv[])
58 | {
59 | // Size of vectors
60 | int n = 2000;
61 |
62 | // Host input vectors
63 | double *h_a;
64 | double *h_b;
65 | //Host output vector
66 | double *h_c;
67 |
68 | // Size, in bytes, of each vector
69 | size_t bytes = n*sizeof(double);
70 |
71 | // Allocate memory for each vector on host
72 | h_a = (double*)malloc(bytes);
73 | h_b = (double*)malloc(bytes);
74 | h_c = (double*)malloc(bytes);
75 |
76 | int i;
77 | // Initialize vectors on host
78 | for (i = 0; i < n; i++) {
79 | h_a[i] = sin(i)*sin(i);
80 | h_b[i] = cos(i)*cos(i);
81 | }
82 |
83 | // add the two vectors
84 | vecAdd(h_a, h_b, h_c, n);
85 |
86 | // Release host memory
87 | free(h_a);
88 | free(h_b);
89 | free(h_c);
90 |
91 | return 0;
92 | }
93 | ```
94 |
95 | Take a look at `vector_add_cpu.c`. You will see that it allocates three arrays of size `n` and then fills `a` and `b` with values. The `vecAdd` function is then called to perform the elementwise addition of the two arrays producing a third array `c`:
96 |
97 | ```C
98 | void vecAdd(double *a, double *b, double *c, int n) {
99 | int i;
100 | for (i = 0; i < n; i++) {
101 | c[i] = a[i] + b[i];
102 | }
103 | }
104 | ```
105 |
106 |
107 | The output reports the time taken to perform the addition ignoring the memory allocation and initialization. Build and run the code:
108 |
109 | ```
110 | $ cd gpu_programming_intro/06_cuda_kernels/04_vector_addition
111 | $ gcc -O3 -march=native -o vector_add_cpu vector_add_cpu.c -lm
112 | $ ./vector_add_cpu
113 | ```
114 |
115 | ## GPU
116 |
117 | The following code adds two vectors together on a GPU:
118 |
119 | ```C
120 | #include
121 | #include
122 | #include
123 | #include "timer.h"
124 |
125 | // each thread is responsible for one element of c
126 | __global__ void vecAdd(double *a, double *b, double *c, int n)
127 | {
128 | // Get our global thread ID
129 | int id = blockIdx.x * blockDim.x + threadIdx.x;
130 | int stride = gridDim.x * blockDim.x;
131 |
132 | // Make sure we do not go out of bounds
133 | int i;
134 | for (i = id; i < n; i += stride)
135 | c[i] = a[i] + b[i];
136 | }
137 |
138 | int main(int argc, char* argv[])
139 | {
140 | // Size of vectors
141 | int n = 2000;
142 |
143 | // Host input vectors
144 | double *h_a;
145 | double *h_b;
146 | //Host output vector
147 | double *h_c;
148 |
149 | // Device input vectors
150 | double *d_a;
151 | double *d_b;
152 | //Device output vector
153 | double *d_c;
154 |
155 | // Size, in bytes, of each vector
156 | size_t bytes = n*sizeof(double);
157 |
158 | // Allocate memory for each vector on host
159 | h_a = (double*)malloc(bytes);
160 | h_b = (double*)malloc(bytes);
161 | h_c = (double*)malloc(bytes);
162 |
163 | int i;
164 | // Initialize vectors on host
165 | for (i = 0; i < n; i++) {
166 | h_a[i] = sin(i)*sin(i);
167 | h_b[i] = cos(i)*cos(i);
168 | }
169 |
170 | // Allocate memory for each vector on GPU
171 | cudaMalloc(&d_a, bytes);
172 | cudaMalloc(&d_b, bytes);
173 | cudaMalloc(&d_c, bytes);
174 |
175 | // Copy host vectors to device
176 | cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
177 | cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
178 |
179 | int blockSize, gridSize;
180 |
181 | // Number of threads in each thread block
182 | blockSize = 1024;
183 |
184 | // Number of thread blocks in grid
185 | gridSize = (int)ceil((double)n/blockSize);
186 | if (gridSize > 65535) gridSize = 32000;
187 | // Execute the kernel
188 | vecAdd<<>>(d_a, d_b, d_c, n);
189 |
190 | // Copy array back to host
191 | cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
192 |
193 | // Release device memory
194 | cudaFree(d_a);
195 | cudaFree(d_b);
196 | cudaFree(d_c);
197 |
198 | cudaDeviceSynchronize();
199 |
200 | // Release host memory
201 | free(h_a);
202 | free(h_b);
203 | free(h_c);
204 |
205 | return 0;
206 | }
207 | ```
208 |
209 | The `vecAdd` function has been replaced with a CUDA kernel:
210 |
211 | ```C
212 | __global__ void vecAdd(double *a, double *b, double *c, int n)
213 | {
214 | // Get our global thread ID
215 | int id = blockIdx.x * blockDim.x + threadIdx.x;
216 | int stride = gridDim.x * blockDim.x;
217 |
218 | // Make sure we do not go out of bounds
219 | int i;
220 | for (i = id; i < n; i += stride)
221 | c[i] = a[i] + b[i];
222 | }
223 | ```
224 |
225 | The kernel uses special variables which are CUDA extensions to allow threads to distinguish themselves and operate on different data. Specifically, `blockIdx.x` is the block index within a grid, `blockDim.x` is the number of threads per block and `threadIdx.x` is the thread index within a block. Let's build and run the code. The `nvcc` compiler will compile the kernel function while `gcc` will be used in the background to compile the CPU code.
226 |
227 | ```
228 | $ module load cudatoolkit/12.2
229 | $ nvcc -O3 -arch=sm_80 -o vector_add_gpu vector_add_gpu.cu # use 70 on traverse or adroit v100 node
230 | $ sbatch job.slurm
231 | ```
232 |
233 | The output of the code will be something like:
234 | ```
235 | Allocating CPU memory and populating arrays of length 2000 ... done.
236 | GridSize 2 and total_threads 2048
237 | Performing vector addition (timer started) ... done in 0.09 s.
238 | ```
239 |
240 | Note that the reported time includes all operations beyond those needed to carry out the operation on the GPU. This includes the time required to allocate and deallocate memory on the GPU and the time required to move the data to and from the GPU.
241 |
242 | To use a GPU effectively the problem you are solving must have a vast amount of data parallelism and an overall amount of computation. In the example here the parallelism is high (one can assign a different thread to each of the individual elements) but the overall amount of computation is low so the CPU wins out in performance. Contrast this with a large matrix-matrix multiply where both conditions are satisfied and the GPU wins. For problems involving recursion or sorting or small amounts of data, it becomes difficult to take advantage of a GPU.
243 |
244 | ## Advanced Examples
245 |
246 | For more advanced examples return to the NVIDIA CUDA samples at the bottom of [this page](https://github.com/PrincetonUniversity/gpu_programming_intro/tree/master/05_cuda_libraries#nvidia-cuda-samples).
247 |
--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=vec-add # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=16G # memory per cpu-core (4G is default)
7 | #SBATCH --gres=gpu:1 # number of gpus per node
8 | #SBATCH --time=00:00:30 # total run time limit (HH:MM:SS)
9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
10 |
11 | ./vector_add_gpu
12 |
--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/timer.h:
--------------------------------------------------------------------------------
1 | /*
2 | * Copyright 2012 NVIDIA Corporation
3 | *
4 | * Licensed under the Apache License, Version 2.0 (the "License");
5 | * you may not use this file except in compliance with the License.
6 | * You may obtain a copy of the License at
7 | *
8 | * http://www.apache.org/licenses/LICENSE-2.0
9 | *
10 | * Unless required by applicable law or agreed to in writing, software
11 | * distributed under the License is distributed on an "AS IS" BASIS,
12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | * See the License for the specific language governing permissions and
14 | * limitations under the License.
15 | */
16 |
17 | #ifndef TIMER_H
18 | #define TIMER_H
19 |
20 | #include
21 |
22 | #ifdef WIN32
23 | #define WIN32_LEAN_AND_MEAN
24 | #include
25 | #else
26 | #include
27 | #endif
28 |
29 | #ifdef WIN32
30 | double PCFreq = 0.0;
31 | __int64 timerStart = 0;
32 | #else
33 | struct timeval timerStart;
34 | #endif
35 |
36 | void StartTimer()
37 | {
38 | #ifdef WIN32
39 | LARGE_INTEGER li;
40 | if(!QueryPerformanceFrequency(&li))
41 | printf("QueryPerformanceFrequency failed!\n");
42 |
43 | PCFreq = (double)li.QuadPart/1000.0;
44 |
45 | QueryPerformanceCounter(&li);
46 | timerStart = li.QuadPart;
47 | #else
48 | gettimeofday(&timerStart, NULL);
49 | #endif
50 | }
51 |
52 | // time elapsed in ms
53 | double GetTimer()
54 | {
55 | #ifdef WIN32
56 | LARGE_INTEGER li;
57 | QueryPerformanceCounter(&li);
58 | return (double)(li.QuadPart-timerStart)/PCFreq;
59 | #else
60 | struct timeval timerStop, timerElapsed;
61 | gettimeofday(&timerStop, NULL);
62 | timersub(&timerStop, &timerStart, &timerElapsed);
63 | return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0;
64 | #endif
65 | }
66 |
67 | #endif // TIMER_H
68 |
--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/vector_add_cpu.c:
--------------------------------------------------------------------------------
1 | /* CPU VERSION */
2 |
3 | // modified from https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
4 |
5 | #include
6 | #include
7 | #include
8 | #include "timer.h"
9 |
10 | void vecAdd(double *a, double *b, double *c, int n)
11 | {
12 | int i;
13 | for(i = 0; i < n; i++) {
14 | c[i] = a[i] + b[i];
15 | }
16 | }
17 |
18 | int main( int argc, char* argv[] )
19 | {
20 | // Size of vectors
21 | int n = 2000;
22 |
23 | // Host input vectors
24 | double *h_a;
25 | double *h_b;
26 | //Host output vector
27 | double *h_c;
28 |
29 | // Size, in bytes, of each vector
30 | size_t bytes = n*sizeof(double);
31 |
32 | // Allocate memory for each vector on host
33 | fprintf(stderr, "Allocating memory and populating arrays of length %d ...", n);
34 | h_a = (double*)malloc(bytes);
35 | h_b = (double*)malloc(bytes);
36 | h_c = (double*)malloc(bytes);
37 |
38 | int i;
39 | // Initialize vectors on host
40 | for( i = 0; i < n; i++ ) {
41 | h_a[i] = sin(i)*sin(i);
42 | h_b[i] = cos(i)*cos(i);
43 | }
44 |
45 | fprintf(stderr, " done.\n");
46 | fprintf(stderr, "Performing vector addition (timer started) ...");
47 | StartTimer();
48 |
49 | // add the two vectors
50 | vecAdd(h_a, h_b, h_c, n);
51 |
52 | double runtime = GetTimer();
53 | fprintf(stderr, " done in %.2f s.\n", runtime / 1000);
54 |
55 | // Sum up vector c and print result divided by n, this should equal 1 within error
56 | double sum = 0;
57 | for(i=0; i tol) printf("Warning: potential numerical problems.\n");
61 |
62 | // Release host memory
63 | free(h_a);
64 | free(h_b);
65 | free(h_c);
66 |
67 | return 0;
68 | }
69 |
--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/vector_add_gpu.cu:
--------------------------------------------------------------------------------
1 | /* GPU Version */
2 |
3 | // original file is https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
4 |
5 | #include
6 | #include
7 | #include
8 | #include "timer.h"
9 |
10 | // CUDA kernel. Each thread takes care of one element of c
11 | __global__ void vecAdd(double *a, double *b, double *c, int n)
12 | {
13 | // Get our global thread ID
14 | int id = blockIdx.x * blockDim.x + threadIdx.x;
15 | int stride = gridDim.x * blockDim.x;
16 |
17 | // Make sure we do not go out of bounds
18 | int i;
19 | for (i = id; i < n; i += stride)
20 | c[i] = a[i] + b[i];
21 | }
22 |
23 | int main( int argc, char* argv[] )
24 | {
25 | // Size of vectors
26 | int n = 2000;
27 |
28 | // Host input vectors
29 | double *h_a;
30 | double *h_b;
31 | //Host output vector
32 | double *h_c;
33 |
34 | // Device input vectors
35 | double *d_a;
36 | double *d_b;
37 | //Device output vector
38 | double *d_c;
39 |
40 | // Size, in bytes, of each vector
41 | size_t bytes = n*sizeof(double);
42 |
43 | // Allocate memory for each vector on host
44 | fprintf(stderr, "Allocating CPU memory and populating arrays of length %d ...", n);
45 | h_a = (double*)malloc(bytes);
46 | h_b = (double*)malloc(bytes);
47 | h_c = (double*)malloc(bytes);
48 |
49 | int i;
50 | // Initialize vectors on host
51 | for( i = 0; i < n; i++ ) {
52 | h_a[i] = sin(i)*sin(i);
53 | h_b[i] = cos(i)*cos(i);
54 | }
55 | fprintf(stderr, " done.\n");
56 |
57 | fprintf(stderr, "Performing vector addition (timer started) ...");
58 | StartTimer();
59 |
60 | // Allocate memory for each vector on GPU
61 | cudaMalloc(&d_a, bytes);
62 | cudaMalloc(&d_b, bytes);
63 | cudaMalloc(&d_c, bytes);
64 |
65 | // Copy host vectors to device
66 | cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
67 | cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
68 |
69 | int blockSize, gridSize;
70 |
71 | // Number of threads in each thread block
72 | blockSize = 1024;
73 |
74 | // Number of thread blocks in grid
75 | gridSize = (int)ceil((double)n/blockSize);
76 | if (gridSize > 65535) gridSize = 32000;
77 | printf("GridSize %d and total_threads %d\n", gridSize, gridSize * blockSize);
78 | // Execute the kernel
79 | vecAdd<<>>(d_a, d_b, d_c, n);
80 |
81 | // Copy array back to host
82 | cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost );
83 |
84 | // Release device memory
85 | cudaFree(d_a);
86 | cudaFree(d_b);
87 | cudaFree(d_c);
88 |
89 | cudaDeviceSynchronize();
90 |
91 | double runtime = GetTimer();
92 | fprintf(stderr, " done in %.2f s.\n", runtime / 1000);
93 |
94 | // Sum up vector c and print result divided by n, this should equal 1 within error
95 | double sum = 0;
96 | for(i=0; i tol) printf("Warning: potential numerical problems.\n");
101 |
102 | // Release host memory
103 | free(h_a);
104 | free(h_b);
105 | free(h_c);
106 |
107 | return 0;
108 | }
109 |
--------------------------------------------------------------------------------
/06_cuda_kernels/05_multiple_gpus/README.md:
--------------------------------------------------------------------------------
1 | # Multiple GPUs
2 |
3 | The code in the this directory illustrates the use of multiple GPUs. To compile and execute the example, run the following commands:
4 |
5 | ```
6 | $ module load cudatoolkit/12.2
7 | $ nvcc -O3 -arch=sm_80 -o multi_gpu multi_gpu.cu
8 | $ sbatch job.slurm
9 | ```
10 |
11 | On Traverse and the Adroit V100 nodes, replace `sm_80` with `sm_70`.
12 |
13 | See also `Samples/0_Introduction/simpleMultiGPU` in the NVIDIA samples which are discussed in `05_cuda_libraries`.
14 |
--------------------------------------------------------------------------------
/06_cuda_kernels/05_multiple_gpus/job.slurm:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=multi-gpu # create a short name for your job
3 | #SBATCH --nodes=1 # node count
4 | #SBATCH --ntasks=1 # total number of tasks across all nodes
5 | #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
6 | #SBATCH --mem-per-cpu=1G # memory per cpu-core (4G per cpu-core is default)
7 | #SBATCH --gres=gpu:2 # number of gpus per node
8 | #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS)
9 | #SBATCH --reservation=gpuprimer # REMOVE THIS LINE AFTER THE WORKSHOP
10 |
11 | module purge
12 | module load cudatoolkit/12.2
13 |
14 | ./multi_gpu
15 |
--------------------------------------------------------------------------------
/06_cuda_kernels/05_multiple_gpus/multi_gpu.cu:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | void CPUFunction() {
4 | printf("Hello world from the CPU.\n");
5 | }
6 |
7 | __global__ void GPUFunction(int myid) {
8 | printf("Hello world from GPU %d.\n", myid);
9 | }
10 |
11 | int main() {
12 |
13 | // function to run on the cpu
14 | CPUFunction();
15 |
16 | int deviceCount;
17 | cudaGetDeviceCount(&deviceCount);
18 | int device;
19 | for (device=0; device < deviceCount; ++device) {
20 | cudaDeviceProp deviceProp;
21 | cudaGetDeviceProperties(&deviceProp, device);
22 | printf("Device %d has compute capability %d.%d.\n",
23 | device, deviceProp.major, deviceProp.minor);
24 | }
25 |
26 | // run on gpu 0
27 | int device_id = 0;
28 | cudaSetDevice(device_id);
29 | GPUFunction<<<1, 1>>>(device_id);
30 |
31 | // run on gpu 1
32 | device_id = 1;
33 | cudaSetDevice(device_id);
34 | GPUFunction<<<1, 1>>>(device_id);
35 |
36 | // kernel execution is asynchronous so sync on its completion
37 | cudaDeviceSynchronize();
38 | }
39 |
--------------------------------------------------------------------------------
/06_cuda_kernels/README.md:
--------------------------------------------------------------------------------
1 | # CUDA kernels
2 |
3 | In this section you will write GPU kernels from scratch. To get started click on `01_hello_world` above.
4 |
--------------------------------------------------------------------------------
/07_advanced_and_other/README.md:
--------------------------------------------------------------------------------
1 | # Advanced and Other
2 |
3 | ## CUDA-Aware MPI
4 |
5 | On Della you will see MPI modules that have been built against CUDA. These modules enable [CUDA-aware MPI](https://developer.nvidia.com/mpi-solutions-gpus) where
6 | memory on a GPU can be sent to another GPU without concerning a CPU. According to NVIDIA:
7 |
8 | > Regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy.
9 |
10 | > With [CUDA-aware MPI](https://developer.nvidia.com/mpi-solutions-gpus), the MPI library can send and receive GPU buffers directly, without having to first stage them in host memory. Implementation of CUDA-aware MPI was simplified by Unified Virtual Addressing (UVA) in CUDA 4.0 – which enables a single address space for all CPU and GPU memory. CUDA-aware implementations of MPI have several advantages.
11 |
12 | See the CUDA-aware MPI modules on Della:
13 |
14 | ```
15 | $ ssh @della.princeton.edu
16 | $ module avail openmpi/cuda
17 |
18 | ------------- /usr/local/share/Modules/modulefiles -------------
19 | openmpi/cuda-11.1/gcc/4.1.1 openmpi/cuda-11.3/nvhpc-21.5/4.1.1
20 | ```
21 |
22 | ## GPU Direct
23 |
24 | [GPU Direct](https://developer.nvidia.com/gpudirect) is a solution to the problem of data-starved GPUs.
25 |
26 | 
27 |
28 | > Using GPUDirect™, multiple GPUs, network adapters, solid-state drives (SSDs) and now NVMe drives can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA Tesla™ and Quadro™ products
29 |
30 | GPUDirect is enabled on `della` and `traverse`.
31 |
32 | ## GPU Sharing
33 |
34 | Many GPU applications only use the GPU for a fraction of the time. For many years, a goal of GPU vendors has been to allow for GPU sharing between applications. Slurm is capable of supporting this through the `--gpu-mps` option.
35 |
36 | ## OpenMP 4.5+
37 |
38 | Recent implementations of [OpenMP](https://www.openmp.org/) support GPU programming. However, they are not mature and should not be favored.
39 |
40 | ## CUDA Kernels versus OpenACC on the Long Term
41 |
42 | CUDA kernels are written at a low level. OpenACC is a high-level programmaing model. Because GPU hardware is changing rapidly, some argue that writing GPU codes with OpenACC is a better choice because there is much less work do to when new hardware comes out. The sames holds true for Kokkos.
43 |
44 | [See the materials](http://w3.pppl.gov/~ethier/PICSCIE/Intro_to_OpenACC_Nov_2019.pdf) for an OpenACC workshop by Stephane Ethier. Be aware of the Slack channel for OpenACC for getting help.
45 |
46 | ## Using the Intel Compiler
47 |
48 | Note the use of `auto` in the code below:
49 |
50 | ```c++
51 | #include
52 |
53 | __global__ void simpleKernel()
54 | {
55 | auto i = blockDim.x * blockIdx.x + threadIdx.x;
56 | printf("Index: %d\n", i);
57 | }
58 |
59 | int main()
60 | {
61 | simpleKernel<<<2, 3>>>();
62 | cudaDeviceSynchronize();
63 | }
64 | ```
65 |
66 | The C++11 language standard introduced the `auto` keyword. To compile the code with the Intel compiler for Della:
67 |
68 | ```
69 | $ module load intel/19.1.1.217
70 | $ module load cudatoolkit/11.7
71 | $ nvcc -ccbin=icpc -std=c++11 -arch=sm_80 -o simple simple.cu
72 | ```
73 |
74 | In general, NVIDIA engineers strongly recommend using GCC over the Intel compiler.
75 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Introduction to GPU Computing
2 |
3 | ## About
4 |
5 | This guide provides materials for getting started with running GPU codes on the Princeton Research Computing clusters. It also provides an introduction to writing CUDA kernels and examples of using the NVIDIA GPU-accelerated libraries (e.g., cuBLAS).
6 |
7 | ## Upcoming GPU Training
8 |
9 | [Princeton GPU User Group](https://researchcomputing.princeton.edu/learn/user-groups/gpu)
10 | [See all PICSciE/RC workshops](https://researchcomputing.princeton.edu/learn/workshops-live-training)
11 | [Subscribe to PICSciE/RC Mailing List](https://researchcomputing.princeton.edu/subscribe)
12 |
13 | ## Learning Resources
14 |
15 | [GPU Computing at Princeton](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing)
16 | [2025 Princeton GPU Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwmKa2AI/se000356)
17 | [Resource List by Open Hackathons](https://www.openhackathons.org/s/technical-resources)
18 | [Training Archive at Oak Ridge National Laboratory](https://docs.olcf.ornl.gov/training/training_archive.html)
19 | [LeetGPU - Free GPU Simulator](https://leetgpu.com/)
20 | [CUDA C++ Programming Guide by NVIDIA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
21 | [CUDA Fortran Programming Guide by NVIDIA](https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html)
22 | [Intro to CUDA Blog Post](https://developer.nvidia.com/blog/even-easier-introduction-cuda/?mkt_tok=MTU2LU9GTi03NDIAAAGad2PhouORjrUMHihUOvdy-syejFRkc-7otOyEDUy4HXOnJ85JjZ-gUs-lGlbdvG-hpVpXtxlpVN4EOvosdmaWcaSV9TQa84zICsZ3IdKBp5L69uOLQDsm)
23 | [Online Book Available through PU Library](https://catalog.princeton.edu/catalog/99125304171206421)
24 | [Princeton A100 GPU Workshop](https://github.com/PrincetonUniversity/a100_workshop)
25 |
26 | ## Getting Help
27 |
28 | If you encounter any difficulties with this material then please send an email to cses@princeton.edu or attend a help session.
29 |
30 | ## Authorship
31 |
32 | This guide was created by Jonathan Halverson and members of Princeton Research Computing.
33 |
--------------------------------------------------------------------------------
/setup.md:
--------------------------------------------------------------------------------
1 | # Introduction to GPU Computing
2 |
3 | ## Setup for live workshop
4 |
5 | ### Point your browser to `https://bit.ly/36g5YUS`
6 |
7 | + Connect to the eduroam wireless network
8 |
9 | + Open a terminal (e.g., Terminal, PowerShell, PuTTY) [click here for help]
10 |
11 | + Request an [account on Adroit](https://forms.rc.princeton.edu/registration/?q=adroit).
12 |
13 | + Please SSH to Adroit in the terminal: `ssh @adroit.princeton.edu` [click [here](https://researchcomputing.princeton.edu/faq/why-cant-i-login-to-a-clu) for help]
14 |
15 | + If you are new to Linux then consider using the MyAdroit web portal: [https://myadroit.princeton.edu](https://myadroit.princeton.edu) (VPN required from off-campus)
16 |
17 | + Clone this repo on Adroit:
18 |
19 | ```
20 | $ cd /scratch/network/$USER
21 | $ git clone https://github.com/PrincetonUniversity/gpu_programming_intro.git
22 | $ cd gpu_programming_intro
23 | ```
24 |
25 | + For the live workshop, to get access to the GPU nodes on Adroit, add this line to your Slurm scripts:
26 |
27 | `$ sbatch --reservation=gpuprimer job.slurm`
28 |
29 | + Go to the [main page](https://github.com/PrincetonUniversity/gpu_programming_intro) of this repo
30 |
--------------------------------------------------------------------------------