├── 01_what_is_a_gpu
    ├── README.md
    └── pli.md
├── 02_cuda_toolkit
    └── README.md
├── 03_your_first_gpu_job
    ├── README.md
    ├── cupy
    │   ├── job.slurm
    │   ├── lu.py
    │   └── svd.py
    ├── julia
    │   ├── job.slurm
    │   └── svd.jl
    ├── matlab
    │   ├── job.slurm
    │   └── svd.m
    ├── pytorch
    │   ├── job.slurm
    │   └── svd.py
    └── tensorflow
    │   ├── job.slurm
    │   └── svd.py
├── 04_gpu_tools
    └── README.md
├── 05_cuda_libraries
    ├── README.md
    ├── gesvdj_example.cpp
    ├── hello_world_gpu_library
    │   ├── README.md
    │   ├── cumessage.cu
    │   ├── cumessage.h
    │   ├── job.slurm
    │   └── myapp.cu
    ├── job.slurm
    └── matrixMul
    │   └── job.slurm
├── 06_cuda_kernels
    ├── 01_hello_world
    │   ├── README.md
    │   ├── hello_world.c
    │   ├── hello_world_gpu.cu
    │   └── job.slurm
    ├── 02_simple_kernel
    │   ├── README.md
    │   ├── first_parallel.cu
    │   ├── job.slurm
    │   └── solution.cu
    ├── 03_thread_indices
    │   ├── README.md
    │   ├── for_loop.c
    │   ├── for_loop.cu
    │   ├── hint.md
    │   ├── job.slurm
    │   └── solution.cu
    ├── 04_vector_addition
    │   ├── README.md
    │   ├── job.slurm
    │   ├── timer.h
    │   ├── vector_add_cpu.c
    │   └── vector_add_gpu.cu
    ├── 05_multiple_gpus
    │   ├── README.md
    │   ├── job.slurm
    │   └── multi_gpu.cu
    └── README.md
├── 07_advanced_and_other
    └── README.md
├── README.md
└── setup.md


/01_what_is_a_gpu/README.md:
--------------------------------------------------------------------------------
  1 | # What is a GPU?
  2 | 
  3 | A GPU, or Graphics Processing Unit, is an electronic device originally designed for manipulating the images that appear on a computer monitor. However, beginning in 2006 with NVIDIA CUDA, GPUs have become widely used for accelerating computation in various fields including image processing and machine learning.
  4 | 
  5 | Relative to the CPU, GPUs have a far greater number of processing cores but with slower clock speeds. Within a block of threads called a warp (NVIDIA), each thread carries out the same operation on a different piece of data. This is the SIMT paradigm (single instruction, multiple threads). GPUs tend to have much less memory than what is available on a CPU. For instance, the H100 GPUs on Della have 80 GB compared to 1000 GB available to the CPU cores. This is an important consideration when designing algorithms and running jobs. Furthermore, GPUs are intended for highly parallel algorithms. The CPU can often out-perform a GPU on algorithms that are not highly parallelizable such as those that rely on data caching and flow control (e.g., "if" statements).
  6 | 
  7 | Many of the fastest supercomputers in the world use GPUs (see [Top 500](https://top500.org/lists/top500/2024/11/)). How many of the top 10 supercomputers use GPUs?
  8 | 
  9 | NVIDIA has been the leading player in GPUs for HPC. However, the GPU market landscape changed in May 2019 when the US DoE announced that Frontier, the first exascale supercomputer in the US, would be based on [AMD GPUs](https://www.hpcwire.com/2019/05/07/cray-amd-exascale-frontier-at-oak-ridge/) and CPUs. Princeton has a two [MI210 GPUs](https://researchcomputing.princeton.edu/amd-mi100-gpu-testing) which you can use for testing. Intel is also a GPU producer with the [Aurora supercomputer](https://en.wikipedia.org/wiki/Aurora_(supercomputer)) being an example.
 10 | 
 11 | All laptops have a GPU for graphics. It is becoming standard for a laptop to have a second GPU dedicated for compute (see the latest [MacBook Pro](https://www.apple.com/macbook-pro/)).
 12 | 
 13 | ![cpu-vs-gpu](http://blog.itvce.com/wp-content/uploads/2016/03/032216_1532_DustFreeNVI2.png)
 14 | 
 15 | The image below emphasizes the cache sizes and flow control:
 16 | 
 17 | ![cache_flow_control](https://tigress-web.princeton.edu/~jdh4/gpu-devotes-more-transistors-to-data-processing.png)
 18 | 
 19 | Like a CPU, a GPU has a hierarchical structure with respect to both the execution units and memory. A warp is a unit of 32 threads. NVIDIA GPUs impose a limit of 1024 threads per block. Some integral number of warps are grouped into a streaming multiprocessor (SM). There are tens of SMs per GPU. Each thread has its own memory. There is limited shared memory between a block of threads. And, finally, there is the global memory which is accessible to each grid or collection of blocks.
 20 | 
 21 | ![ampere](https://developer-blogs.nvidia.com/wp-content/uploads/2022/03/H100-Streaming-Multiprocessor-SM-625x869.png)
 22 | 
 23 | The figure above is a diagram of a streaming multiprocessor (SM) for the [NVIDIA H100 GPU](https://www.nvidia.com/en-us/data-center/h100/). The H100 is composed of up to 132 SMs.
 24 | 
 25 | # Princeton Language and Intelligence
 26 | 
 27 | The university spent $9.6M on a new [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) cluster for research involving large AI models. The cluster provides 37 nodes with 8 GPUs per node. The H100 GPU is optimized for training transformer models. [Learn more](https://pli.princeton.edu/about-pli/directors-message) about this.
 28 | 
 29 | # Overview of using a GPU
 30 | 
 31 | This is the essence of how every GPU is used as an accelerator for compute:
 32 | 
 33 | + Copy data from the CPU (host) to the GPU (device)
 34 | 
 35 | + Launch a kernel to carry out computations on the GPU
 36 | 
 37 | + Copy data from the GPU (device) back to the CPU (host)
 38 | 
 39 | ![gpu-overview](https://tigress-web.princeton.edu/~jdh4/gpu_as_accelerator_to_cpu_diagram.png)
 40 | 
 41 | The diagram above and the accompanying pseudocode present a simplified view of how GPUs are used in scientific computing. To fully understand how things work you will need to learn more about memory cache, interconnects, CUDA streams and much more.
 42 | 
 43 | [NVLink](https://www.nvidia.com/en-us/data-center/nvlink/) on Traverse enables fast CPU-to-GPU and GPU-to-GPU data transfers with a peak rate of 75 GB/s per direction. Della has this fast GPU-GPU interconnect on each pair of GPUs on 70 of the 90 GPU nodes.
 44 | 
 45 | Given the significant performance penalty for moving data between the CPU and GPU, it is natural to work toward "unifying" the CPU and GPU. For instance, read about the [NVIDIA Grace Hopper Superchip](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/).
 46 | 
 47 | # What GPU resources does Princeton have?
 48 | 
 49 | See the "Hardware Resources" on the [GPU Computing](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) page for a complete list.
 50 | 
 51 | ## Adroit
 52 | 
 53 | There are 3 GPU nodes on Adroit: `adroit-h11g1`, `adroit-h11g2` and `adroit-h11g3`
 54 | 
 55 | <pre>
 56 | $ ssh &lt;NetID&gt;@adroit.princeton.edu
 57 | $ snodes
 58 | HOSTNAMES      STATE  CPUS S:C:T  CPUS(A/I/O/T) CPU_LOAD MEMORY  PARTITION  AVAIL_FEATURES
 59 | adroit-08      alloc  32   2:16:1 32/0/0/32     1.27     384000  class      skylake,intel
 60 | adroit-09      alloc  32   2:16:1 32/0/0/32     0.75     384000  class      skylake,intel
 61 | adroit-10      alloc  32   2:16:1 32/0/0/32     0.63     384000  class      skylake,intel
 62 | adroit-11      mix    32   2:16:1 29/3/0/32     0.28     384000  class      skylake,intel
 63 | adroit-12      mix    32   2:16:1 16/16/0/32    0.28     384000  class      skylake,intel
 64 | adroit-13      mix    32   2:16:1 25/7/0/32     0.22     384000  all*       skylake,intel
 65 | adroit-13      mix    32   2:16:1 25/7/0/32     0.22     384000  class      skylake,intel
 66 | adroit-14      alloc  32   2:16:1 32/0/0/32     32.29    384000  all*       skylake,intel
 67 | adroit-14      alloc  32   2:16:1 32/0/0/32     32.29    384000  class      skylake,intel
 68 | adroit-15      mix    32   2:16:1 22/10/0/32    9.68     384000  all*       skylake,intel
 69 | adroit-15      mix    32   2:16:1 22/10/0/32    9.68     384000  class      skylake,intel
 70 | adroit-16      alloc  32   2:16:1 32/0/0/32     24.13    384000  all*       skylake,intel
 71 | adroit-16      alloc  32   2:16:1 32/0/0/32     24.13    384000  class      skylake,intel
 72 | adroit-h11g1   plnd   48   2:24:1 0/48/0/48     0.00     1000000 gpu        a100,intel,gpu80
 73 | adroit-h11g2   plnd   48   2:24:1 0/48/0/48     0.76     1000000 gpu        a100,intel
 74 | adroit-h11g3   mix    56   4:14:1 5/51/0/56     1.05     760000  gpu        v100,intel
 75 | adroit-h11n1   idle   128  2:64:1 0/128/0/128   0.00     256000  class      amd,rome
 76 | adroit-h11n2   alloc  64   2:32:1 64/0/0/64     49.07    500000  all*       intel,ice
 77 | adroit-h11n3   mix    64   2:32:1 50/14/0/64    40.54    500000  all*       intel,ice
 78 | adroit-h11n4   mix    64   2:32:1 48/16/0/64    40.33    500000  all*       intel,ice
 79 | adroit-h11n5   mix    64   2:32:1 32/32/0/64    32.94    500000  all*       intel,ice
 80 | adroit-h11n6   mix    64   2:32:1 62/2/0/64     38.95    500000  all*       intel,ice
 81 | </pre>
 82 | 
 83 | To only see the GPU nodes:
 84 | 
 85 | <pre>
 86 | $ shownodes -p gpu
 87 | NODELIST      STATE      FREE/TOTAL CPUs  CPU_LOAD  AVAIL/TOTAL MEMORY  FREE/TOTAL GPUs          FEATURES
 88 | adroit-h11g1  planned              48/48      0.00   1000000/1000000MB  4/4 nvidia_a100  a100,intel,gpu80
 89 | adroit-h11g2  planned              48/48      0.76   1000000/1000000MB      8/8 3g.20gb        a100,intel
 90 | adroit-h11g3  mixed                51/56      1.05     736960/760000MB   0/4 tesla_v100        v100,intel
 91 | </pre>
 92 | 
 93 | ### adroit-h11g1
 94 | 
 95 | This node has 4 NVIDIA A100 GPUs with 80 GB of memory each. Each A100 GPU has 108 streaming multiprocessors (SM) and 64 FP32 CUDA cores per SM.
 96 | 
 97 | Here is some information about the A100 GPUs on this node:
 98 | 
 99 | ```
100 |   CUDADevice with properties:
101 | 
102 |                       Name: 'NVIDIA A100 80GB PCIe'
103 |                      Index: 1
104 |          ComputeCapability: '8.0'
105 |             SupportsDouble: 1
106 |              DriverVersion: 12.2000
107 |             ToolkitVersion: 11.2000
108 |         MaxThreadsPerBlock: 1024
109 |           MaxShmemPerBlock: 49152
110 |         MaxThreadBlockSize: [1024 1024 64]
111 |                MaxGridSize: [2.1475e+09 65535 65535]
112 |                  SIMDWidth: 32
113 |                TotalMemory: 8.5175e+10
114 |            AvailableMemory: 8.4519e+10
115 |        MultiprocessorCount: 108
116 |               ClockRateKHz: 1410000
117 |                ComputeMode: 'Default'
118 |       GPUOverlapsTransfers: 1
119 |     KernelExecutionTimeout: 0
120 |           CanMapHostMemory: 1
121 |            DeviceSupported: 1
122 |            DeviceAvailable: 1
123 |             DeviceSelected: 1
124 | ```
125 | 
126 | Here is infomation about the CPUs on this node:
127 | 
128 | <pre>
129 | $ ssh &lt;NetID&gt;@adroit.princeton.edu
130 | $ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --gres=gpu:1 --constraint=gpu80 --reservation=gpuprimer
131 | $ lscpu | grep -v Flags
132 | Architecture:        x86_64
133 | CPU op-mode(s):      32-bit, 64-bit
134 | Byte Order:          Little Endian
135 | CPU(s):              48
136 | On-line CPU(s) list: 0-47
137 | Thread(s) per core:  1
138 | Core(s) per socket:  24
139 | Socket(s):           2
140 | NUMA node(s):        2
141 | Vendor ID:           GenuineIntel
142 | CPU family:          6
143 | Model:               143
144 | Model name:          Intel(R) Xeon(R) Gold 6442Y
145 | Stepping:            8
146 | CPU MHz:             3707.218
147 | CPU max MHz:         4000.0000
148 | CPU min MHz:         800.0000
149 | BogoMIPS:            5200.00
150 | Virtualization:      VT-x
151 | L1d cache:           48K
152 | L1i cache:           32K
153 | L2 cache:            2048K
154 | L3 cache:            61440K
155 | NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
156 | NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
157 | $ exit
158 | </pre>
159 | 
160 | 
161 | ### adroit-h11g2
162 | 
163 | `adroit-h11g2` has 4 NVIDIA A100 GPUs with 40 GB of memory per GPU. The 4 GPUs have been divided into 8 less powerful GPUs with 20 GB of memory each. To connect to this node use:
164 | 
165 | ```
166 | $ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --gres=gpu:1 --nodelist=adroit-h11g2 --reservation=gpuprimer
167 | ```
168 | 
169 | Below is information about the A100 GPUs:
170 | 
171 | ```
172 | $ nvidia-smi -a
173 | Using a NVIDIA A100-PCIE-40GB GPU.
174 |   CUDADevice with properties:
175 | 
176 |                       Name: 'NVIDIA A100-PCIE-40GB'
177 |                      Index: 1
178 |          ComputeCapability: '8.0'
179 |             SupportsDouble: 1
180 |              DriverVersion: 11.7000
181 |             ToolkitVersion: 11.2000
182 |         MaxThreadsPerBlock: 1024
183 |           MaxShmemPerBlock: 49152
184 |         MaxThreadBlockSize: [1024 1024 64]
185 |                MaxGridSize: [2.1475e+09 65535 65535]
186 |                  SIMDWidth: 32
187 |                TotalMemory: 4.2351e+10
188 |            AvailableMemory: 4.1703e+10
189 |        MultiprocessorCount: 108
190 |               ClockRateKHz: 1410000
191 |                ComputeMode: 'Default'
192 |       GPUOverlapsTransfers: 1
193 |     KernelExecutionTimeout: 0
194 |           CanMapHostMemory: 1
195 |            DeviceSupported: 1
196 |            DeviceAvailable: 1
197 |             DeviceSelected: 1
198 | ```
199 | 
200 | Below is information about the CPUs:
201 | 
202 | ```
203 | $ lscpu | grep -v Flags
204 | Architecture:        x86_64
205 | CPU op-mode(s):      32-bit, 64-bit
206 | Byte Order:          Little Endian
207 | CPU(s):              48
208 | On-line CPU(s) list: 0-47
209 | Thread(s) per core:  1
210 | Core(s) per socket:  24
211 | Socket(s):           2
212 | NUMA node(s):        2
213 | Vendor ID:           GenuineIntel
214 | CPU family:          6
215 | Model:               106
216 | Model name:          Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
217 | Stepping:            6
218 | CPU MHz:             3499.996
219 | CPU max MHz:         3500.0000
220 | CPU min MHz:         800.0000
221 | BogoMIPS:            5600.00
222 | L1d cache:           48K
223 | L1i cache:           32K
224 | L2 cache:            1280K
225 | L3 cache:            36864K
226 | NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
227 | NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
228 | ```
229 | 
230 | See the necessary Slurm directives to [run on specific GPUs](https://researchcomputing.princeton.edu/systems/adroit#gpus) on Adroit.
231 | 
232 | To see a wealth of information about the GPUs use:
233 | 
234 | ```
235 | $ nvidia-smi -q | less
236 | ```
237 | 
238 | ### adroit-h11g3
239 | 
240 | This node offers the older V100 GPUs.
241 | 
242 | ### Grace Hopper Superchip
243 | 
244 | See the [Grace Hopper Superchip webpage](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/) by NVIDIA. Here is a schematic diagram of the superchip:
245 | 
246 | ![grace](https://developer-blogs.nvidia.com/wp-content/uploads/2022/11/grace-hopper-overview.png)
247 | 
248 | ```
249 | aturing@della-gh:~$ nvidia-smi -a
250 | 
251 | ==============NVSMI LOG==============
252 | 
253 | Timestamp                                 : Mon Apr 22 11:24:41 2024
254 | Driver Version                            : 545.23.08
255 | CUDA Version                              : 12.3
256 | 
257 | Attached GPUs                             : 1
258 | GPU 00000009:01:00.0
259 |     Product Name                          : GH200 480GB
260 |     Product Brand                         : NVIDIA
261 |     Product Architecture                  : Hopper
262 |     Display Mode                          : Disabled
263 |     Display Active                        : Disabled
264 |     Persistence Mode                      : Enabled
265 |     Addressing Mode                       : ATS
266 |     MIG Mode
267 |         Current                           : Disabled
268 |         Pending                           : Disabled
269 | ...
270 | ```
271 | 
272 | The CPU on the GH Superchip:
273 | 
274 | ```
275 | jdh4@della-gh:~$ lscpu
276 | Architecture:           aarch64
277 |   CPU op-mode(s):       64-bit
278 |   Byte Order:           Little Endian
279 | CPU(s):                 72
280 |   On-line CPU(s) list:  0-71
281 | Vendor ID:              ARM
282 |   Model name:           Neoverse-V2
283 |     Model:              0
284 |     Thread(s) per core: 1
285 |     Core(s) per socket: 72
286 |     Socket(s):          1
287 |     Stepping:           r0p0
288 |     Frequency boost:    disabled
289 |     CPU max MHz:        3510.0000
290 |     CPU min MHz:        81.0000
291 |     BogoMIPS:           2000.00
292 |     Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm di
293 |                         t uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh
294 | Caches (sum of all):    
295 |   L1d:                  4.5 MiB (72 instances)
296 |   L1i:                  4.5 MiB (72 instances)
297 |   L2:                   72 MiB (72 instances)
298 |   L3:                   114 MiB (1 instance)
299 | NUMA:                   
300 |   NUMA node(s):         9
301 |   NUMA node0 CPU(s):    0-71
302 |   NUMA node1 CPU(s):    
303 |   NUMA node2 CPU(s):    
304 |   NUMA node3 CPU(s):    
305 |   NUMA node4 CPU(s):    
306 |   NUMA node5 CPU(s):    
307 |   NUMA node6 CPU(s):    
308 |   NUMA node7 CPU(s):    
309 |   NUMA node8 CPU(s):    
310 | Vulnerabilities:        
311 |   Gather data sampling: Not affected
312 |   Itlb multihit:        Not affected
313 |   L1tf:                 Not affected
314 |   Mds:                  Not affected
315 |   Meltdown:             Not affected
316 |   Mmio stale data:      Not affected
317 |   Retbleed:             Not affected
318 |   Spec rstack overflow: Not affected
319 |   Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
320 |   Spectre v1:           Mitigation; __user pointer sanitization
321 |   Spectre v2:           Not affected
322 |   Srbds:                Not affected
323 |   Tsx async abort:      Not affected
324 | ```
325 | 
326 | ### Compute Capability and Building Optimized Codes
327 | 
328 | Some software will only run on a GPU of a given compute capability. To find these values for a given NVIDIA Telsa card see [this page](https://en.wikipedia.org/wiki/Nvidia_Tesla). The compute capability of the A100's on Della is 8.0. For various build systems this translates to `sm_80`.
329 | 
330 | The following is from `$ nvcc --help` after loading a `cudatoolkit` module:
331 | 
332 | ```
333 | Options for steering GPU code generation.
334 | =========================================
335 | 
336 | --gpu-architecture <arch>                       (-arch)                         
337 |         Specify the name of the class of NVIDIA 'virtual' GPU architecture for which
338 |         the CUDA input files must be compiled.
339 |         With the exception as described for the shorthand below, the architecture
340 |         specified with this option must be a 'virtual' architecture (such as compute_50).
341 |         Normally, this option alone does not trigger assembly of the generated PTX
342 |         for a 'real' architecture (that is the role of nvcc option '--gpu-code',
343 |         see below); rather, its purpose is to control preprocessing and compilation
344 |         of the input to PTX.
345 |         For convenience, in case of simple nvcc compilations, the following shorthand
346 |         is supported.  If no value for option '--gpu-code' is specified, then the
347 |         value of this option defaults to the value of '--gpu-architecture'.  In this
348 |         situation, as only exception to the description above, the value specified
349 |         for '--gpu-architecture' may be a 'real' architecture (such as a sm_50),
350 |         in which case nvcc uses the specified 'real' architecture and its closest
351 |         'virtual' architecture as effective architecture values.  For example, 'nvcc
352 |         --gpu-architecture=sm_50' is equivalent to 'nvcc --gpu-architecture=compute_50
353 |         --gpu-code=sm_50,compute_50'.
354 |         -arch=all         build for all supported architectures (sm_*), and add PTX
355 |         for the highest major architecture to the generated code.
356 |         -arch=all-major   build for just supported major versions (sm_*0), plus the
357 |         earliest supported, and add PTX for the highest major architecture to the
358 |         generated code.
359 |         -arch=native      build for all architectures (sm_*) on the current system
360 |         Note: -arch=native, -arch=all, -arch=all-major cannot be used with the -code
361 |         option, but can be used with -gencode options
362 |         Note: the values compute_30, compute_32, compute_35, compute_37, compute_50,
363 |         sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in
364 |         a future release.
365 |         Allowed values for this option:  'all','all-major','compute_35','compute_37',
366 |         'compute_50','compute_52','compute_53','compute_60','compute_61','compute_62',
367 |         'compute_70','compute_72','compute_75','compute_80','compute_86','compute_87',
368 |         'lto_35','lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62',
369 |         'lto_70','lto_72','lto_75','lto_80','lto_86','lto_87','native','sm_35','sm_37',
370 |         'sm_50','sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75',
371 |         'sm_80','sm_86','sm_87'.
372 | ```
373 | 
374 | Hence, a starting point for optimization flags for the A100 GPUs on Della and Adroit:
375 | 
376 | ```
377 | nvcc -O3 --use_fast_math --gpu-architecture=sm_80 -o myapp myapp.cu
378 | ```
379 | 
380 | For the H100 GPUs on Della:
381 | 
382 | ```
383 | nvcc -O3 --use_fast_math --gpu-architecture=sm_90 -o myapp myapp.cu
384 | ```
385 | 
386 | ## Comparison of GPU Resources
387 | 
388 | |   Cluster  | Number of Nodes | GPUs per Node | NVIDIA GPU Model  | Number of FP32 Cores| SM Count | GPU Memory (GB) |
389 | |:----------:|:----------:|:---------:|:-------:|:-------:|:-------:|:-------:|
390 | | Adroit     |      1           |     4         |  A100            | 6912   | 108  | 80 |
391 | | Adroit     |      1           |     8         |  A100            | --   | --  | 20 |
392 | | Adroit     |      1           |     4         |  V100            | 5120   | 80  | 32 |    
393 | | Della      |     37           |     8         |  H100            | 14592  | 132 | 80 |
394 | | Della      |     69           |     4         |  A100            | 6912   | 108  | 80 |
395 | | Della      |     20           |     2         |  A100            | 6912   | 108  | 40 |
396 | | Della      |     2            |    28         |  A100            | --     | --   | 10 |  
397 | | Stellar     |     6            |     2         |  A100            | 6912   | 108  | 40 |
398 | | Stellar     |     1            |     8         |  A100            | 6912   | 108  | 40 |
399 | | Tiger     |     12           |     4         |  H100            | 14592  | 132  | 80 |
400 | 
401 | SM is streaming multiprocessor. Note that the V100 GPUs have 640 [Tensor Cores](https://devblogs.nvidia.com/cuda-9-features-revealed/) (8 per SM) where half-precision Warp Matrix-Matrix and Accumulate (WMMA) operations can be carried out. That is, each core can perform a 4x4 matrix-matrix multiply and add the result to a third matrix. There are differences between the V100 node on Adroit and the Traverse nodes (see [PCIe versus SXM2](https://www.nextplatform.com/micro-site-content/achieving-maximum-compute-throughput-pcie-vs-sxm2/)).
402 | 
403 | 
404 | ## GPU Hackathon at Princeton
405 | 
406 | The next hackathon will take place in [June of 2025](https://www.openhackathons.org/s/siteevent/a0CUP00000rwmKa2AI/se000356). This is a great opportunity to get help from experts in porting your code to a GPU. Or you can participate as a mentor and help a team rework their code. See the [GPU Computing](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) page for details.
407 | 


--------------------------------------------------------------------------------
/01_what_is_a_gpu/pli.md:
--------------------------------------------------------------------------------
  1 | # PLI Nodes
  2 | 
  3 | ```
  4 | Architecture:        x86_64
  5 | CPU op-mode(s):      32-bit, 64-bit
  6 | Byte Order:          Little Endian
  7 | CPU(s):              96
  8 | On-line CPU(s) list: 0-95
  9 | Thread(s) per core:  1
 10 | Core(s) per socket:  48
 11 | Socket(s):           2
 12 | NUMA node(s):        2
 13 | Vendor ID:           GenuineIntel
 14 | CPU family:          6
 15 | Model:               143
 16 | Model name:          Intel(R) Xeon(R) Platinum 8468
 17 | Stepping:            8
 18 | CPU MHz:             3645.945
 19 | CPU max MHz:         3800.0000
 20 | CPU min MHz:         800.0000
 21 | BogoMIPS:            4200.00
 22 | L1d cache:           48K
 23 | L1i cache:           32K
 24 | L2 cache:            2048K
 25 | L3 cache:            107520K
 26 | NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94
 27 | NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
 28 | Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
 29 | ```
 30 | 
 31 | ```
 32 | $ nvidia-smi
 33 | Fri Feb 23 11:51:11 2024       
 34 | +---------------------------------------------------------------------------------------+
 35 | | NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
 36 | |-----------------------------------------+----------------------+----------------------+
 37 | | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
 38 | | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
 39 | |                                         |                      |               MIG M. |
 40 | |=========================================+======================+======================|
 41 | |   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
 42 | | N/A   33C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
 43 | |                                         |                      |             Disabled |
 44 | +-----------------------------------------+----------------------+----------------------+
 45 |                                                                                          
 46 | +---------------------------------------------------------------------------------------+
 47 | | Processes:                                                                            |
 48 | |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
 49 | |        ID   ID                                                             Usage      |
 50 | |=======================================================================================|
 51 | |  No running processes found                                                           |
 52 | +---------------------------------------------------------------------------------------+
 53 | ```
 54 | 
 55 | ```
 56 | jdh4@della-j11g1:~$ nvidia-smi -a
 57 | ==============NVSMI LOG==============
 58 | Timestamp                                 : Fri Feb 23 11:51:29 2024
 59 | Driver Version                            : 545.23.08
 60 | CUDA Version                              : 12.3
 61 | 
 62 | Attached GPUs                             : 1
 63 | GPU 00000000:19:00.0
 64 |     Product Name                          : NVIDIA H100 80GB HBM3
 65 |     Product Brand                         : NVIDIA
 66 |     Product Architecture                  : Hopper
 67 |     Display Mode                          : Enabled
 68 |     Display Active                        : Disabled
 69 |     Persistence Mode                      : Enabled
 70 |     Addressing Mode                       : None
 71 |     MIG Mode
 72 |         Current                           : Disabled
 73 |         Pending                           : Disabled
 74 |     Accounting Mode                       : Disabled
 75 |     Accounting Mode Buffer Size           : 4000
 76 |     Driver Model
 77 |         Current                           : N/A
 78 |         Pending                           : N/A
 79 |     Serial Number                         : 1654123038646
 80 |     GPU UUID                              : GPU-10f35015-e921-bfab-2eb8-4e9b6664d5f1
 81 |     Minor Number                          : 0
 82 |     VBIOS Version                         : 96.00.74.00.0D
 83 |     MultiGPU Board                        : No
 84 |     Board ID                              : 0x1900
 85 |     Board Part Number                     : 692-2G520-0200-000
 86 |     GPU Part Number                       : 2330-885-A1
 87 |     FRU Part Number                       : N/A
 88 |     Module ID                             : 2
 89 |     Inforom Version
 90 |         Image Version                     : G520.0200.00.05
 91 |         OEM Object                        : 2.1
 92 |         ECC Object                        : 7.16
 93 |         Power Management Object           : N/A
 94 |     Inforom BBX Object Flush
 95 |         Latest Timestamp                  : 2024/02/22 13:09:29.459
 96 |         Latest Duration                   : 119019 us
 97 |     GPU Operation Mode
 98 |         Current                           : N/A
 99 |         Pending                           : N/A
100 |     GSP Firmware Version                  : N/A
101 |     GPU C2C Mode                          : Disabled
102 |     GPU Virtualization Mode
103 |         Virtualization Mode               : None
104 |         Host VGPU Mode                    : N/A
105 |     GPU Reset Status
106 |         Reset Required                    : No
107 |         Drain and Reset Recommended       : No
108 |     IBMNPU
109 |         Relaxed Ordering Mode             : N/A
110 |     PCI
111 |         Bus                               : 0x19
112 |         Device                            : 0x00
113 |         Domain                            : 0x0000
114 |         Device Id                         : 0x233010DE
115 |         Bus Id                            : 00000000:19:00.0
116 |         Sub System Id                     : 0x16C110DE
117 |         GPU Link Info
118 |             PCIe Generation
119 |                 Max                       : 5
120 |                 Current                   : 5
121 |                 Device Current            : 5
122 |                 Device Max                : 5
123 |                 Host Max                  : 5
124 |             Link Width
125 |                 Max                       : 16x
126 |                 Current                   : 16x
127 |         Bridge Chip
128 |             Type                          : N/A
129 |             Firmware                      : N/A
130 |         Replays Since Reset               : 0
131 |         Replay Number Rollovers           : 0
132 |         Tx Throughput                     : 464 KB/s
133 |         Rx Throughput                     : 2593 KB/s
134 |         Atomic Caps Inbound               : N/A
135 |         Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
136 |     Fan Speed                             : N/A
137 |     Performance State                     : P0
138 |     Clocks Event Reasons
139 |         Idle                              : Active
140 |         Applications Clocks Setting       : Not Active
141 |         SW Power Cap                      : Not Active
142 |         HW Slowdown                       : Not Active
143 |             HW Thermal Slowdown           : Not Active
144 |             HW Power Brake Slowdown       : Not Active
145 |         Sync Boost                        : Not Active
146 |         SW Thermal Slowdown               : Not Active
147 |         Display Clock Setting             : Not Active
148 |     FB Memory Usage
149 |         Total                             : 81559 MiB
150 |         Reserved                          : 328 MiB
151 |         Used                              : 2 MiB
152 |         Free                              : 81227 MiB
153 |     BAR1 Memory Usage
154 |         Total                             : 131072 MiB
155 |         Used                              : 1 MiB
156 |         Free                              : 131071 MiB
157 |     Conf Compute Protected Memory Usage
158 |         Total                             : 0 MiB
159 |         Used                              : 0 MiB
160 |         Free                              : 0 MiB
161 |     Compute Mode                          : Default
162 |     Utilization
163 |         Gpu                               : 0 %
164 |         Memory                            : 0 %
165 |         Encoder                           : 0 %
166 |         Decoder                           : 0 %
167 |         JPEG                              : 0 %
168 |         OFA                               : 0 %
169 |     Encoder Stats
170 |         Active Sessions                   : 0
171 |         Average FPS                       : 0
172 |         Average Latency                   : 0
173 |     FBC Stats
174 |         Active Sessions                   : 0
175 |         Average FPS                       : 0
176 |         Average Latency                   : 0
177 |     ECC Mode
178 |         Current                           : Enabled
179 |         Pending                           : Enabled
180 |     ECC Errors
181 |         Volatile
182 |             SRAM Correctable              : 0
183 |             SRAM Uncorrectable            : 0
184 |             DRAM Correctable              : 0
185 |             DRAM Uncorrectable            : 0
186 |         Aggregate
187 |             SRAM Correctable              : 0
188 |             SRAM Uncorrectable            : 0
189 |             DRAM Correctable              : 0
190 |             DRAM Uncorrectable            : 0
191 |     Retired Pages
192 |         Single Bit ECC                    : N/A
193 |         Double Bit ECC                    : N/A
194 |         Pending Page Blacklist            : N/A
195 |     Remapped Rows
196 |         Correctable Error                 : 0
197 |         Uncorrectable Error               : 0
198 |         Pending                           : No
199 |         Remapping Failure Occurred        : No
200 |         Bank Remap Availability Histogram
201 |             Max                           : 2560 bank(s)
202 |             High                          : 0 bank(s)
203 |             Partial                       : 0 bank(s)
204 |             Low                           : 0 bank(s)
205 |             None                          : 0 bank(s)
206 |     Temperature
207 |         GPU Current Temp                  : 33 C
208 |         GPU T.Limit Temp                  : 54 C
209 |         GPU Shutdown T.Limit Temp         : -8 C
210 |         GPU Slowdown T.Limit Temp         : -2 C
211 |         GPU Max Operating T.Limit Temp    : 0 C
212 |         GPU Target Temperature            : N/A
213 |         Memory Current Temp               : 41 C
214 |         Memory Max Operating T.Limit Temp : 0 C
215 |     GPU Power Readings
216 |         Power Draw                        : 72.02 W
217 |         Current Power Limit               : 700.00 W
218 |         Requested Power Limit             : 700.00 W
219 |         Default Power Limit               : 700.00 W
220 |         Min Power Limit                   : 200.00 W
221 |         Max Power Limit                   : 700.00 W
222 |     GPU Memory Power Readings 
223 |         Power Draw                        : 47.78 W
224 |     Module Power Readings
225 |         Power Draw                        : N/A
226 |         Current Power Limit               : N/A
227 |         Requested Power Limit             : N/A
228 |         Default Power Limit               : N/A
229 |         Min Power Limit                   : N/A
230 |         Max Power Limit                   : N/A
231 |     Clocks
232 |         Graphics                          : 345 MHz
233 |         SM                                : 345 MHz
234 |         Memory                            : 2619 MHz
235 |         Video                             : 765 MHz
236 |     Applications Clocks
237 |         Graphics                          : 1980 MHz
238 |         Memory                            : 2619 MHz
239 |     Default Applications Clocks
240 |         Graphics                          : 1980 MHz
241 |         Memory                            : 2619 MHz
242 |     Deferred Clocks
243 |         Memory                            : N/A
244 |     Max Clocks
245 |         Graphics                          : 1980 MHz
246 |         SM                                : 1980 MHz
247 |         Memory                            : 2619 MHz
248 |         Video                             : 1545 MHz
249 |     Max Customer Boost Clocks
250 |         Graphics                          : 1980 MHz
251 |     Clock Policy
252 |         Auto Boost                        : N/A
253 |         Auto Boost Default                : N/A
254 |     Voltage
255 |         Graphics                          : 670.000 mV
256 |     Fabric
257 |         State                             : Completed
258 |         Status                            : Success
259 |     Processes                             : None
260 | ```
261 | 
262 | ```
263 | $ numactl -H
264 | available: 2 nodes (0-1)
265 | node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94
266 | node 0 size: 515020 MB
267 | node 0 free: 509047 MB
268 | node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
269 | node 1 size: 516037 MB
270 | node 1 free: 489964 MB
271 | node distances:
272 | node   0   1 
273 |   0:  10  21 
274 |   1:  21  10 
275 | ```
276 | 
277 | ## Intra-Node Topology
278 | 
279 | ```
280 | jdh4@della-k17g3:~$ nvidia-smi topo -m
281 | 	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	CPU Affinity	NUMA Affinity	GPU NUMA ID
282 | GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PIX	PIX	NODE	NODE	NODE	NODE		0		N/A
283 | GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	NODE	NODE		0		N/A
284 | GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	NODE	NODE		0		N/A
285 | GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	NODE	PIX	NODE	NODE	NODE		0		N/A
286 | GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NODE	NODE	NODE	PIX	PIX	NODE	1	1		N/A
287 | GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NODE	NODE	NODE	NODE	NODE	NODE	1	1		N/A
288 | GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NODE	NODE	NODE	NODE	NODE	PIX	1	1		N/A
289 | GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NODE	NODE	NODE	NODE	NODE	NODE	1	1		N/A
290 | NIC0	PIX	NODE	NODE	NODE	NODE	NODE	NODE	NODE	 X 	PIX	NODE	NODE	NODE	NODE				
291 | NIC1	PIX	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PIX	 X 	NODE	NODE	NODE	NODE				
292 | NIC2	NODE	NODE	NODE	PIX	NODE	NODE	NODE	NODE	NODE	NODE	 X 	NODE	NODE	NODE				
293 | NIC3	NODE	NODE	NODE	NODE	PIX	NODE	NODE	NODE	NODE	NODE	NODE	 X 	PIX	NODE				
294 | NIC4	NODE	NODE	NODE	NODE	PIX	NODE	NODE	NODE	NODE	NODE	NODE	PIX	 X 	NODE				
295 | NIC5	NODE	NODE	NODE	NODE	NODE	NODE	PIX	NODE	NODE	NODE	NODE	NODE	NODE	 X 				
296 | 
297 | Legend:
298 | 
299 |   X    = Self
300 |   SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
301 |   NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
302 |   PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
303 |   PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
304 |   PIX  = Connection traversing at most a single PCIe bridge
305 |   NV#  = Connection traversing a bonded set of # NVLinks
306 | 
307 | NIC Legend:
308 | 
309 |   NIC0: mlx5_0
310 |   NIC1: mlx5_1
311 |   NIC2: mlx5_2
312 |   NIC3: mlx5_3
313 |   NIC4: mlx5_4
314 |   NIC5: mlx5_5
315 | ```
316 | 


--------------------------------------------------------------------------------
/02_cuda_toolkit/README.md:
--------------------------------------------------------------------------------
  1 | # NVIDIA CUDA Toolkit
  2 | 
  3 | ![NVIDIA CUDA](https://upload.wikimedia.org/wikipedia/en/b/b9/Nvidia_CUDA_Logo.jpg)
  4 | 
  5 | The [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) provides a comprehensive set of libraries and tools for developing and running GPU-accelerated applications.
  6 | 
  7 | List the available modules that are related to CUDA:
  8 | 
  9 | ```
 10 | $ module avail cudatoolkit
 11 | ------------ /usr/local/share/Modules/modulefiles -------------
 12 | cudatoolkit/10.2  cudatoolkit/11.7  cudatoolkit/12.4  
 13 | cudatoolkit/11.1  cudatoolkit/12.0  cudatoolkit/12.5  
 14 | cudatoolkit/11.3  cudatoolkit/12.2  cudatoolkit/12.6  
 15 | cudatoolkit/11.4  cudatoolkit/12.3  
 16 | ```
 17 | 
 18 | Run the following command to see which environment variables the `cudatoolkit` module is modifying:
 19 | 
 20 | ```
 21 | $ $ module show cudatoolkit/12.5
 22 | -------------------------------------------------------------------
 23 | /usr/local/share/Modules/modulefiles/cudatoolkit/12.5:
 24 | 
 25 | module-whatis   {Sets up cudatoolkit125 12.5 in your environment}
 26 | prepend-path    PATH /usr/local/cuda-12.5/bin
 27 | prepend-path    LD_LIBRARY_PATH /usr/local/cuda-12.5/lib64
 28 | prepend-path    LIBRARY_PATH /usr/local/cuda-12.5/lib64
 29 | prepend-path    MANPATH /usr/local/cuda-12.5/doc/man
 30 | append-path     -d { } LDFLAGS -L/usr/local/cuda-12.5/lib64
 31 | append-path     -d { } INCLUDE -I/usr/local/cuda-12.5/include
 32 | append-path     CPATH /usr/local/cuda-12.5/include
 33 | append-path     -d { } FFLAGS -I/usr/local/cuda-12.5/include
 34 | append-path     -d { } LOCAL_LDFLAGS -L/usr/local/cuda-12.5/lib64
 35 | append-path     -d { } LOCAL_INCLUDE -I/usr/local/cuda-12.5/include
 36 | append-path     -d { } LOCAL_CFLAGS -I/usr/local/cuda-12.5/include
 37 | append-path     -d { } LOCAL_FFLAGS -I/usr/local/cuda-12.5/include
 38 | append-path     -d { } LOCAL_CXXFLAGS -I/usr/local/cuda-12.5/include
 39 | setenv          CUDA_HOME /usr/local/cuda-12.5
 40 | -------------------------------------------------------------------
 41 | ```
 42 | 
 43 | Let's look at the files in `/usr/local/cuda-12.5/bin`:
 44 | 
 45 | ```
 46 | $ ls -ltrh /usr/local/cuda-12.5/bin
 47 | total 243M
 48 | -rwxr-xr-x.  1 root root  49M Apr 15 22:46 nvdisasm
 49 | -rwxr-xr-x.  1 root root 688K Apr 15 22:47 cuobjdump
 50 | -rwxr-xr-x.  6 root root  11K May 17 18:50 __nvcc_device_query
 51 | -rwxr-xr-x. 14 root root  285 May 17 18:50 nvvp
 52 | -rwxr-xr-x.  1 root root 111K Jun  6 06:03 nvprune
 53 | -rwxr-xr-x.  1 root root  75K Jun  6 06:09 cu++filt
 54 | -rwxr-xr-x.  1 root root  30M Jun  6 06:12 ptxas
 55 | -rwxr-xr-x.  1 root root  30M Jun  6 06:12 nvlink
 56 | -rw-r--r--.  1 root root  465 Jun  6 06:12 nvcc.profile
 57 | -rwxr-xr-x.  1 root root  22M Jun  6 06:12 nvcc
 58 | -rwxr-xr-x.  1 root root 1.2M Jun  6 06:12 fatbinary
 59 | -rwxr-xr-x.  1 root root 7.1M Jun  6 06:12 cudafe++
 60 | -rwxr-xr-x.  1 root root  87K Jun  6 06:12 bin2c
 61 | -rwxr-xr-x.  1 root root 803K Jun  6 07:25 cuda-gdbserver
 62 | -rwxr-xr-x.  1 root root  17M Jun  6 07:25 cuda-gdb-python3.9-tui
 63 | -rwxr-xr-x.  1 root root  17M Jun  6 07:25 cuda-gdb-python3.8-tui
 64 | -rwxr-xr-x.  1 root root  17M Jun  6 07:25 cuda-gdb-python3.12-tui
 65 | -rwxr-xr-x.  1 root root  17M Jun  6 07:25 cuda-gdb-python3.11-tui
 66 | -rwxr-xr-x.  1 root root  17M Jun  6 07:25 cuda-gdb-python3.10-tui
 67 | -rwxr-xr-x.  1 root root  15M Jun  6 07:25 cuda-gdb-minimal
 68 | -rwxr-xr-x.  1 root root 1.9K Jun  6 07:25 cuda-gdb
 69 | -rwxr-xr-x.  1 root root 5.8M Jun  6 07:56 nvprof
 70 | lrwxrwxrwx.  1 root root    4 Jun  6 08:04 computeprof -> nvvp
 71 | -rwxr-xr-x. 11 root root 1.6K Jun 14 19:56 nsight_ee_plugins_manage.sh
 72 | -rwxr-xr-x.  1 root root  833 Jun 25 17:54 nsys-ui
 73 | -rwxr-xr-x.  1 root root  743 Jun 25 17:54 nsys
 74 | -rwxr-xr-x.  5 root root  112 Jul 12 02:21 compute-sanitizer
 75 | -rwxr-xr-x.  5 root root 3.6K Jul 26 18:06 ncu-ui
 76 | -rwxr-xr-x.  5 root root 3.8K Jul 26 18:06 ncu
 77 | -rwxr-xr-x.  4 root root  197 Jul 26 18:06 nsight-sys
 78 | drwxr-xr-x.  2 root root   43 Aug 28 10:24 crt
 79 | ```
 80 | 
 81 | `nvcc` is the NVIDIA CUDA Compiler. Note that `nvcc` is built on `llvm` as [described here](https://developer.nvidia.com/cuda-llvm-compiler). To learn more about an executable, use the help option. For instance: `nvcc --help`.
 82 | 
 83 | 
 84 | Let's look at the libraries:
 85 | 
 86 | ```
 87 | $ ls -lL /usr/local/cuda-12.5/lib64/lib*.so
 88 | -rwxr-xr-x.  1 root root   2412216 Jun  6 07:56 /usr/local/cuda-12.5/lib64/libaccinj64.so
 89 | -rwxr-xr-x.  1 root root   1505608 Jun  6 07:30 /usr/local/cuda-12.5/lib64/libcheckpoint.so
 90 | -rwxr-xr-x.  1 root root 446820528 Jun  6 06:10 /usr/local/cuda-12.5/lib64/libcublasLt.so
 91 | -rwxr-xr-x.  1 root root 104128480 Jun  6 06:10 /usr/local/cuda-12.5/lib64/libcublas.so
 92 | -rwxr-xr-x.  1 root root    712032 Jun  6 06:07 /usr/local/cuda-12.5/lib64/libcudart.so
 93 | -rwxr-xr-x.  1 root root 276080616 Jun  6 06:16 /usr/local/cuda-12.5/lib64/libcufft.so
 94 | -rwxr-xr-x.  1 root root    974920 Jun  6 06:16 /usr/local/cuda-12.5/lib64/libcufftw.so
 95 | -rwxr-xr-x.  6 root root     43320 Jun  5 13:57 /usr/local/cuda-12.5/lib64/libcufile_rdma.so
 96 | -rwxr-xr-x.  1 root root   2993816 Jun  6 06:53 /usr/local/cuda-12.5/lib64/libcufile.so
 97 | -rwxr-xr-x.  1 root root   2832640 Jun  6 07:56 /usr/local/cuda-12.5/lib64/libcuinj64.so
 98 | -rwxr-xr-x.  1 root root   7807144 Jun  6 07:30 /usr/local/cuda-12.5/lib64/libcupti.so
 99 | -rwxr-xr-x.  1 root root  96529840 Jun  6 06:14 /usr/local/cuda-12.5/lib64/libcurand.so
100 | -rwxr-xr-x.  1 root root  82234792 Jun  6 06:55 /usr/local/cuda-12.5/lib64/libcusolverMg.so
101 | -rwxr-xr-x.  1 root root 122162688 Jun  6 06:55 /usr/local/cuda-12.5/lib64/libcusolver.so
102 | -rwxr-xr-x.  1 root root 294682616 Jun  6 06:29 /usr/local/cuda-12.5/lib64/libcusparse.so
103 | -rwxr-xr-x.  1 root root   1651184 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppc.so
104 | -rwxr-xr-x.  1 root root  17736496 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppial.so
105 | -rwxr-xr-x.  1 root root   7689032 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppicc.so
106 | -rwxr-xr-x.  1 root root  11248792 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppidei.so
107 | -rwxr-xr-x.  1 root root 101120104 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppif.so
108 | -rwxr-xr-x.  1 root root  41165712 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppig.so
109 | -rwxr-xr-x.  1 root root  10703688 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppim.so
110 | -rwxr-xr-x.  1 root root  37897296 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppist.so
111 | -rwxr-xr-x.  1 root root    724392 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppisu.so
112 | -rwxr-xr-x.  1 root root   5595760 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnppitc.so
113 | -rwxr-xr-x.  1 root root  14169336 Jun  6 06:37 /usr/local/cuda-12.5/lib64/libnpps.so
114 | -rwxr-xr-x.  1 root root    757496 Jun  6 06:10 /usr/local/cuda-12.5/lib64/libnvblas.so
115 | -rwxr-xr-x.  1 root root   2409960 Jun  6 06:08 /usr/local/cuda-12.5/lib64/libnvfatbin.so
116 | -rwxr-xr-x.  1 root root  54560656 Jun  6 06:11 /usr/local/cuda-12.5/lib64/libnvJitLink.so
117 | -rwxr-xr-x.  1 root root   6726448 Jun  6 06:07 /usr/local/cuda-12.5/lib64/libnvjpeg.so
118 | -rwxr-xr-x.  1 root root  28139320 Jun  6 07:30 /usr/local/cuda-12.5/lib64/libnvperf_host.so
119 | -rwxr-xr-x.  1 root root   5579216 Jun  6 07:30 /usr/local/cuda-12.5/lib64/libnvperf_target.so
120 | -rwxr-xr-x.  1 root root   5322632 Jun  6 06:07 /usr/local/cuda-12.5/lib64/libnvrtc-builtins.so
121 | -rwxr-xr-x.  1 root root  61401616 Jun  6 06:07 /usr/local/cuda-12.5/lib64/libnvrtc.so
122 | -rwxr-xr-x. 10 root root     40136 May 17 18:50 /usr/local/cuda-12.5/lib64/libnvToolsExt.so
123 | -rwxr-xr-x. 10 root root     30856 May 17 18:50 /usr/local/cuda-12.5/lib64/libOpenCL.so
124 | -rwxr-xr-x.  1 root root    920920 Jun  6 07:30 /usr/local/cuda-12.5/lib64/libpcsamplingutil.so
125 | ```
126 | 
127 | ## cuDNN
128 | 
129 | There is also the [CUDA Deep Neural Net](https://developer.nvidia.com/cudnn) (cuDNN) library. It is external to the NVIDIA CUDA Toolkit and is used with TensorFlow, for instance, to provide GPU routines for training neural nets. See the available modules with:
130 | 
131 | ```
132 | $ module avail cudnn
133 | ```
134 | 
135 | ## Conda Installations
136 | 
137 | When you install [CuPy](https://cupy.dev), for instance, which is like NumPy for GPUs, Conda will include the CUDA libraries:
138 | 
139 | <pre>
140 | $ module load anaconda3/2024.6
141 | $ conda create --name cupy-env cupy --channel conda-forge
142 | ...
143 |   _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
144 |   _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
145 |   bzip2              conda-forge/linux-64::bzip2-1.0.8-hd590300_5 
146 |   ca-certificates    conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0 
147 |   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.5.82-he02047a_0 
148 |   cuda-version       conda-forge/noarch::cuda-version-12.5-hd4f0392_3 
149 |   cupy               conda-forge/linux-64::cupy-13.2.0-py312had87585_0 
150 |   cupy-core          conda-forge/linux-64::cupy-core-13.2.0-py312hd074ebb_0 
151 |   fastrlock          conda-forge/linux-64::fastrlock-0.8.2-py312h30efb56_2 
152 |   ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.40-hf3520f5_7 
153 |   <b>libblas            conda-forge/linux-64::libblas-3.9.0-22_linux64_openblas 
154 |   libcblas           conda-forge/linux-64::libcblas-3.9.0-22_linux64_openblas 
155 |   libcublas          conda-forge/linux-64::libcublas-12.5.3.2-he02047a_0 
156 |   libcufft           conda-forge/linux-64::libcufft-11.2.3.61-he02047a_0 
157 |   libcurand          conda-forge/linux-64::libcurand-10.3.6.82-he02047a_0 
158 |   libcusolver        conda-forge/linux-64::libcusolver-11.6.3.83-he02047a_0 
159 |   libcusparse        conda-forge/linux-64::libcusparse-12.5.1.3-he02047a_0 </b>
160 |   libexpat           conda-forge/linux-64::libexpat-2.6.2-h59595ed_0 
161 |   libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5 
162 |   libgcc-ng          conda-forge/linux-64::libgcc-ng-14.1.0-h77fa898_0 
163 |   libgfortran-ng     conda-forge/linux-64::libgfortran-ng-14.1.0-h69a702a_0 
164 |   libgfortran5       conda-forge/linux-64::libgfortran5-14.1.0-hc5f4f2c_0 
165 |   libgomp            conda-forge/linux-64::libgomp-14.1.0-h77fa898_0 
166 |   liblapack          conda-forge/linux-64::liblapack-3.9.0-22_linux64_openblas 
167 |   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
168 |   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.5.82-he02047a_0 
169 |   libopenblas        conda-forge/linux-64::libopenblas-0.3.27-pthreads_hac2b453_1 
170 |   libsqlite          conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 
171 |   libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-14.1.0-hc0a3c3a_0 
172 |   libuuid            conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 
173 |   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
174 |   libzlib            conda-forge/linux-64::libzlib-1.3.1-h4ab18f5_1 
175 |   ncurses            conda-forge/linux-64::ncurses-6.5-h59595ed_0 
176 |   numpy              conda-forge/linux-64::numpy-2.0.0-py312h22e1c76_0 
177 |   openssl            conda-forge/linux-64::openssl-3.3.1-h4ab18f5_1 
178 |   pip                conda-forge/noarch::pip-24.0-pyhd8ed1ab_0 
179 |   python             conda-forge/linux-64::python-3.12.4-h194c7f8_0_cpython 
180 |   python_abi         conda-forge/linux-64::python_abi-3.12-4_cp312 
181 |   readline           conda-forge/linux-64::readline-8.2-h8228510_1 
182 |   setuptools         conda-forge/noarch::setuptools-70.1.1-pyhd8ed1ab_0 
183 |   tk                 conda-forge/linux-64::tk-8.6.13-noxft_h4845f30_101 
184 |   tzdata             conda-forge/noarch::tzdata-2024a-h0c530f3_0 
185 |   wheel              conda-forge/noarch::wheel-0.43.0-pyhd8ed1ab_1 
186 |   xz                 conda-forge/linux-64::xz-5.2.6-h166bdaf_0 
187 | </pre>
188 | 
189 | When using `pip` to do the installation, one needs to load the `cudatoolkit` module since that dependency is assumed to be available on the local system. The Conda approach installs all the dependencies so one does not load the module.
190 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/README.md:
--------------------------------------------------------------------------------
  1 | # Your First GPU Job
  2 | 
  3 | Using the GPUs on the Princeton HPC clusters is easy. Pick one of the applications below to get started. To obtain the materials to run the examples, use these commands:
  4 | 
  5 | ```
  6 | $ ssh <YourNetID>@adroit.princeton.edu
  7 | $ cd /scratch/network/<YourNetID>
  8 | $ git clone https://github.com/PrincetonUniversity/gpu_programming_intro.git
  9 | ```
 10 | 
 11 | To add a GPU to your Slurm allocation:
 12 | 
 13 | ```
 14 | #SBATCH --gres=gpu:1             # number of gpus per node
 15 | ```
 16 | 
 17 | For Adroit, one can specify the GPU type using a constraint:
 18 | 
 19 | ```
 20 | #SBATCH --constraint=a100        # set to gpu80, a100 or v100
 21 | #SBATCH --gres=gpu:1             # number of gpus per node
 22 | ```
 23 | 
 24 | For more on specifying the GPU type on Adroit [see this page](https://researchcomputing.princeton.edu/systems/adroit#gpus).
 25 | 
 26 | ## CuPy
 27 | 
 28 | [CuPy](https://cupy.chainer.org) provides a Python interface to set of common numerical routines (e.g., matrix factorizations) which are executed on a GPU (see the [Reference Manual](https://docs-cupy.chainer.org/en/stable/reference/index.html)). You can roughly think of CuPy as NumPy for GPUs. This example is set to use the CuPy installation of the workshop instructor. If you use CuPy for your research work then you should [install it](https://github.com/PrincetonUniversity/gpu_programming_intro/tree/master/02_cuda_toolkit#conda-installations) into your account.
 29 | 
 30 | Examine the Python script before running the code:
 31 | 
 32 | ```python
 33 | $ cd gpu_programming_intro/03_your_first_gpu_job/cupy
 34 | $ cat svd.py
 35 | from time import perf_counter
 36 | import cupy as cp
 37 | 
 38 | N = 1000
 39 | X = cp.random.randn(N, N, dtype=cp.float64)
 40 | 
 41 | trials = 5
 42 | times = []
 43 | for _ in range(trials):
 44 |     t0 = perf_counter()
 45 |     u, s, v = cp.linalg.svd(X)
 46 |     cp.cuda.Device(0).synchronize()
 47 |     times.append(perf_counter() - t0)
 48 | print("Execution time: ", min(times))
 49 | print("sum(s) = ", s.sum())
 50 | print("CuPy version: ", cp.__version__)
 51 | ```
 52 | 
 53 | Below is a sample Slurm script:
 54 | 
 55 | ```bash
 56 | $ cat job.slurm
 57 | #!/bin/bash
 58 | #SBATCH --job-name=cupy-job      # create a short name for your job
 59 | #SBATCH --nodes=1                # node count
 60 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 61 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 62 | #SBATCH --gres=gpu:1             # number of gpus per node
 63 | #SBATCH --mem=4G                 # total memory (RAM) per node
 64 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 65 | #SBATCH --constraint=a100        # choose a100 or v100
 66 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
 67 | 
 68 | module purge
 69 | module load anaconda3/2024.6
 70 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/cupy-env
 71 | 
 72 | python svd.py
 73 | ```
 74 | 
 75 | A GPU is allocated using the Slurm directive `#SBATCH --gres=gpu:1`.
 76 | 
 77 | Submit the job:
 78 | 
 79 | ```
 80 | $ sbatch job.slurm
 81 | ```
 82 | 
 83 | Wait a few seconds for the job to run. Inspect the output:
 84 | 
 85 | ```
 86 | $ cat slurm-*.out
 87 | ```
 88 | 
 89 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. What happens if you re-run the script with the matrix in single precision? Does the execution time double if N is doubled? There is a CPU version of the code at the bottom of this page. Does the operation run faster on the CPU with NumPy or on the GPU with CuPy? Try [this exercise](https://github.com/PrincetonUniversity/a100_workshop/tree/main/06_cupy#cupy-uses-tensor-cores) where the Tensor Cores are utilized by using less than single precision (i.e., TensorFloat32).
 90 | 
 91 | Why are multiple trials used when measuring the execution time? `CuPy` compiles a custom GPU kernel for each GPU operation (e.g., SVD). This means the first time a `CuPy` function is called the measured time is the sum of the compile time plus the time to execute the operation. The second and later calls only include the time to execute the operation.
 92 | 
 93 | In addition to CuPy, Python programmers looking to run their code on GPUs should also be aware of [Numba](https://numba.pydata.org/) and [JAX](https://github.com/google/jax).
 94 | 
 95 | To see performance comparison between the CPU and GPU, see `matmul_numpy.py` and `matmul_cupy.py` in [this repo](https://github.com/jdh4/python-gpu/tree/main/cupy).
 96 | 
 97 | ## PyTorch
 98 | 
 99 | [PyTorch](https://pytorch.org) is a popular deep learning framework. See its documentation for [Tensor operations](https://pytorch.org/docs/stable/tensors.html). This example is set to use the PyTorch installation of the workshop instructor. If you use PyTorch for your research work then you should [install it](https://researchcomputing.princeton.edu/support/knowledge-base/pytorch) into your account.
100 | 
101 | Examine the Python script before running the code:
102 | 
103 | ```python
104 | $ cd gpu_programming_intro/03_your_first_gpu_job/pytorch
105 | $ cat svd.py
106 | from time import perf_counter
107 | import torch
108 | 
109 | N = 1000
110 | 
111 | cuda0 = torch.device('cuda:0')
112 | x = torch.randn(N, N, dtype=torch.float64, device=cuda0)
113 | t0 = perf_counter()
114 | u, s, v = torch.svd(x)
115 | elapsed_time = perf_counter() - t0
116 | 
117 | print("Execution time: ", elapsed_time)
118 | print("Result: ", torch.sum(s).cpu().numpy())
119 | print("PyTorch version: ", torch.__version__)
120 | ```
121 | 
122 | Here is a sample Slurm script:
123 | 
124 | ```bash
125 | $ cat job.slurm
126 | #!/bin/bash
127 | #SBATCH --job-name=torch-svd     # create a short name for your job
128 | #SBATCH --nodes=1                # node count
129 | #SBATCH --ntasks=1               # total number of tasks across all nodes
130 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
131 | #SBATCH --mem-per-cpu=4G         # memory per cpu-core
132 | #SBATCH --gres=gpu:1             # number of gpus per node
133 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
134 | #SBATCH --constraint=a100        # choose a100 or v100 on adroit
135 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
136 | 
137 | module purge
138 | module load anaconda3/2024.6
139 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/torch-env
140 | 
141 | python svd.py
142 | ```
143 | 
144 | Submit the job:
145 | 
146 | ```
147 | $ sbatch job.slurm
148 | ```
149 | 
150 | Wait a few seconds for the job to run. Inspect the output:
151 | 
152 | ```
153 | $ cat slurm-*.out
154 | ```
155 | 
156 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`.
157 | 
158 | ## TensorFlow
159 | 
160 | [TensorFlow](https://www.tensorflow.org) is popular library for training deep neural networks. It can also be used for various numerical computations (see [documentation](https://www.tensorflow.org/api_docs/python/tf)). This example is set to use the TensorFlow installation of the workshop instructor. If you use TensorFlow for your research work then you should [install it](https://researchcomputing.princeton.edu/support/knowledge-base/tensorflow) into your account.
161 | 
162 | Examine the Python script before running the code:
163 | 
164 | ```python
165 | $ cd gpu_programming_intro/03_your_first_gpu_job/tensorflow
166 | $ cat svd.py
167 | from time import perf_counter
168 | 
169 | import os
170 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
171 | 
172 | import tensorflow as tf
173 | print("TensorFlow version: ", tf.__version__)
174 | 
175 | N = 100
176 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64)
177 | t0 = perf_counter()
178 | s, u, v = tf.linalg.svd(x)
179 | elapsed_time = perf_counter() - t0
180 | print("Execution time: ", elapsed_time)
181 | print("Result: ", tf.reduce_sum(s).numpy())
182 | ```
183 | 
184 | Below is a sample Slurm script:
185 | 
186 | ```bash
187 | $ cat job.slurm
188 | #!/bin/bash
189 | #SBATCH --job-name=svd-tf        # create a short name for your job
190 | #SBATCH --nodes=1                # node count
191 | #SBATCH --ntasks=1               # total number of tasks across all nodes
192 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
193 | #SBATCH --mem=4G                 # total memory (RAM) per node
194 | #SBATCH --gres=gpu:1             # number of gpus per node
195 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
196 | #SBATCH --constraint=a100        # choose a100 or v100
197 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
198 | 
199 | module load anaconda3/2024.6
200 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/tf2-gpu
201 | 
202 | python svd.py
203 | ```
204 | 
205 | Submit the job:
206 | 
207 | ```
208 | $ sbatch job.slurm
209 | ```
210 | 
211 | Wait a few seconds for the job to run. Inspect the output:
212 | 
213 | ```
214 | $ cat slurm-*.out
215 | ```
216 | 
217 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`.
218 | 
219 | <!--### Benchmarks
220 | 
221 | Below is benchmark data for the SVD of an N x N matrix in double precision using NumPy with a single CPU-core on Adroit versus TensorFlow on Traverse using a single CPU-core and a V100 GPU:
222 | 
223 | ![svd-data](svd_adroit_traverse_log_log.png)-->
224 | 
225 | ## R with NVBLAS
226 | 
227 | Take a look at [this page](https://github.com/PrincetonUniversity/HPC_R_Workshop/tree/master/07_NVBLAS) and then run the commands below:
228 | 
229 | ```
230 | $ git clone https://github.com/PrincetonUniversity/HPC_R_Workshop
231 | $ cd HPC_R_Workshop/07_NVBLAS
232 | $ mv nvblas.conf ~
233 | $ sbatch 07_NVBLAS.cmd
234 | ```
235 | 
236 | Here is the sample output:
237 | 
238 | ```
239 | $ cat slurm-*.out
240 | ...
241 | [1] "Matrix multiply:"
242 |    user  system elapsed 
243 |   0.166   0.137   0.304 
244 | [1] "----"
245 | [1] "Cholesky Factorization:"
246 |    user  system elapsed 
247 |   1.053   0.041   1.096 
248 | [1] "----"
249 | [1] "Singular Value Decomposition:"
250 |    user  system elapsed 
251 |   8.060   1.837   5.345 
252 | [1] "----"
253 | [1] "Principal Components Analysis:"
254 |    user  system elapsed 
255 |  16.814   5.987  11.252 
256 | [1] "----"
257 | [1] "Linear Discriminant Analysis:"
258 |    user  system elapsed 
259 |  25.955   3.080  20.830 
260 | [1] "----"
261 | ...
262 | ```
263 | 
264 | See the [user guide](https://docs.nvidia.com/cuda/nvblas/index.html) for NVBLAS.
265 | 
266 | ## MATLAB
267 | 
268 | MATLAB is already installed on the cluster. Simply follow these steps:
269 | 
270 | ```bash
271 | $ cd gpu_programming_intro/03_your_first_gpu_job/matlab
272 | $ cat svd.m
273 | ```
274 | 
275 | Here is the MATLAB script:
276 | 
277 | ```matlab
278 | gpu = gpuDevice();
279 | fprintf('Using a %s GPU.\n', gpu.Name);
280 | disp(gpuDevice);
281 | 
282 | X = gpuArray([1 0 2; -1 5 0; 0 3 -9]);
283 | whos X
284 | [U,S,V] = svd(X)
285 | fprintf('trace(S): %f\n', trace(S))
286 | quit;
287 | ```
288 | 
289 | Below is a sample Slurm script:
290 | 
291 | ```bash
292 | #!/bin/bash
293 | #SBATCH --job-name=matlab-svd    # create a short name for your job
294 | #SBATCH --nodes=1                # node count
295 | #SBATCH --ntasks=1               # total number of tasks across all nodes
296 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
297 | #SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
298 | #SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
299 | #SBATCH --gres=gpu:1             # number of gpus per node
300 | #SBATCH --constraint=a100        # choose a100 or v100
301 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
302 | 
303 | module purge
304 | module load matlab/R2023a
305 | 
306 | matlab -singleCompThread -nodisplay -nosplash -r svd
307 | ```
308 | 
309 | Submit the job:
310 | 
311 | ```
312 | $ sbatch job.slurm
313 | ```
314 | 
315 | Wait a few seconds for the job to run. Inspect the output:
316 | 
317 | ```
318 | $ cat slurm-*.out
319 | ```
320 | 
321 | You can monitor the progress of the job with `squeue -u $USER`. Once the job completes, view the output with `cat slurm-*.out`. Learn more about [MATLAB on the Research Computing clusters](https://researchcomputing.princeton.edu/support/knowledge-base/matlab).
322 | 
323 | Here is an [intro](https://www.mathworks.com/help/parallel-computing/run-matlab-functions-on-a-gpu.html) to using MATLAB with GPUs.
324 | 
325 | ## Julia
326 | 
327 | Install the `CUDA` package then run the script in `03_your_first_gpu_job/julia`. See our [Julia webage](https://researchcomputing.princeton.edu/support/knowledge-base/julia).
328 | 
329 | ## Monitoring GPU Usage
330 | 
331 | To monitor jobs in our reservation:
332 | 
333 | ```
334 | $ watch -n 1 squeue -R gpuprimer
335 | ```
336 | 
337 | ## Benchmarks
338 | 
339 | ### Matrix Multiplication
340 | 
341 | | cluster              | code |  CPU-cores  | time (s) |
342 | |:--------------------:|:----:|:-----------:|:--------:|
343 | |  adroit (CPU)        | NumPy |    1       |  24.2    |
344 | |  adroit (CPU)        | NumPy |    2       |  15.5    |
345 | |  adroit (CPU)        | NumPy |    4       |   5.3    |  
346 | |  adroit (V100)       | CuPy  |    1       |   0.3   |
347 | |  adroit (K40c)       | CuPy  |    1       |   1.7   |
348 | 
349 | Times are best of 5 for a square matrix with N=10000 in double precision.
350 | 
351 | ### LU Decomposition
352 | 
353 | | cluster              | code        |  CPU-cores | time (s) |
354 | |:--------------------:|:-----------:|:----------:|:--------:|
355 | |  adroit (CPU)        | SciPy       |    1       |   9.4   |
356 | |  adroit (CPU)        | SciPy       |    2       |   7.9   |
357 | |  adroit (CPU)        | SciPy       |    4       |   6.5   |  
358 | |  adroit (V100)       | CuPy        |    1       |   0.3   |
359 | |  adroit (K40c)       | CuPy        |    1       |   1.1   |
360 | |  adroit (V100)       | Tensorflow  |    1       |   0.3   |
361 | |  adroit (K40c)       | Tensorflow  |    1       |   1.1   |
362 | |  adroit (CPU)        | Tensorflow  |    1       |  50.8   |
363 | 
364 | Times are best of 5 for a square matrix with N=10000 in double precision.
365 | 
366 | ### Singular Value Decomposition
367 | 
368 | | cluster              | code       |  CPU-cores | time (s) |
369 | |:--------------------:|:----------:|:----------:|:--------:|
370 | |  adroit (CPU)        | NumPy      |    1       |    3.6   |
371 | |  adroit (CPU)        | NumPy      |    2       |    3.0   |
372 | |  adroit (CPU)        | NumPy      |    4       |    1.2   |
373 | |  adroit (V100)       | CuPy       |    1       |   24.7   |
374 | |  adroit (K40c)       | CuPy       |    1       |   30.5   |
375 | |  adroit (V100)       | Torch      |    1       |   0.9    |
376 | |  adroit (K40c)       | Torch      |    1       |   1.5    |
377 | |  adroit (CPU)        | Torch      |    1       |   3.0    |
378 | |  adroit (V100)       | TensorFlow |    1       |   24.8   |
379 | |  adroit (K40c)       | TensorFlow |    1       |   29.7   |
380 | |  adroit (CPU)        | TensorFlow |    1       |    9.2   |
381 | 
382 | Times are best of 5 for a square matrix with N=2000 in double precision.
383 | 
384 | For the LU decomposition using SciPy:
385 | 
386 | ```
387 | from time import perf_counter
388 | 
389 | import numpy as np
390 | import scipy as sp
391 | from scipy.linalg import lu
392 | 
393 | N = 10000
394 | cpu_runs = 5
395 | 
396 | times = []
397 | X = np.random.randn(N, N).astype(np.float64)
398 | for _ in range(cpu_runs):
399 |   t0 = perf_counter()
400 |   p, l, u = lu(X, check_finite=False)
401 |   times.append(perf_counter() - t0)
402 | print("CPU time: ", min(times))
403 | print("NumPy version: ", np.__version__)
404 | print("SciPy version: ", sp.__version__)
405 | print(p.sum())
406 | print(times)
407 | ```
408 | 
409 | For the LU decomposition on the CPU:
410 | 
411 | ```
412 | from time import perf_counter
413 | 
414 | import os
415 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
416 | 
417 | import tensorflow as tf
418 | print("TensorFlow version: ", tf.__version__)
419 | 
420 | times = []
421 | N = 10000
422 | with tf.device("/cpu:0"):
423 |   x = tf.random.normal((N, N), dtype=tf.dtypes.float64)
424 |   for _ in range(5):
425 |     t0 = perf_counter()
426 |     lu, p = tf.linalg.lu(x)
427 |     elapsed_time = perf_counter() - t0
428 |     times.append(elapsed_time)
429 | print("Execution time: ", min(times))
430 | print(times)
431 | print("Result: ", tf.reduce_sum(p).numpy())
432 | ```
433 | 
434 | SVD with NumPy:
435 | 
436 | ```
437 | from time import perf_counter
438 | 
439 | N = 2000
440 | cpu_runs = 5
441 | 
442 | times = []
443 | import numpy as np
444 | X = np.random.randn(N, N).astype(np.float64)
445 | for _ in range(cpu_runs):
446 |   t0 = perf_counter()
447 |   u, s, v = np.linalg.svd(X)
448 |   times.append(perf_counter() - t0)
449 | print("CPU time: ", min(times))
450 | print("NumPy version: ", np.__version__)
451 | print(s.sum())
452 | print(times)
453 | ```
454 | 
455 | Performing benchmarks with R:
456 | 
457 | ```
458 | # install.packages("microbenchmark")
459 | library(microbenchmark)
460 | library(Matrix)
461 | 
462 | N <- 10000
463 | microbenchmark(lu(matrix(rnorm(N*N), N, N)), times=5, unit="s")
464 | ```
465 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/cupy/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=cupy-job      # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --gres=gpu:1             # number of gpus per node
 7 | #SBATCH --mem=4G                 # total memory (RAM) per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --constraint=a100        # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
11 | 
12 | module purge
13 | module load anaconda3/2024.6
14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/cupy-env
15 | 
16 | python svd.py
17 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/cupy/lu.py:
--------------------------------------------------------------------------------
 1 | from time import perf_counter
 2 | import numpy as np
 3 | import cupy as cp
 4 | import cupyx.scipy.linalg
 5 | 
 6 | N = 10000
 7 | X = cp.random.randn(N, N, dtype=np.float64)
 8 | 
 9 | trials = 5
10 | times = []
11 | for _ in range(trials):
12 |   start_time = perf_counter()
13 |   lu, piv = cupyx.scipy.linalg.lu_factor(X, check_finite=False)
14 |   cp.cuda.Device(0).synchronize()
15 |   times.append(perf_counter() - start_time)
16 | 
17 | print("Execution time: ", min(times))
18 | print("CuPy version: ", cp.__version__)
19 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/cupy/svd.py:
--------------------------------------------------------------------------------
 1 | from time import perf_counter
 2 | import cupy as cp
 3 | 
 4 | N = 1000
 5 | X = cp.random.randn(N, N, dtype=cp.float64)
 6 | 
 7 | trials = 5
 8 | times = []
 9 | for _ in range(trials):
10 |     t0 = perf_counter()
11 |     u, s, v = cp.linalg.svd(X)
12 |     cp.cuda.Device(0).synchronize()
13 |     times.append(perf_counter() - t0)
14 | print("Execution time: ", min(times))
15 | print("sum(s) = ", s.sum())
16 | print("CuPy version: ", cp.__version__)
17 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/julia/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=julia_gpu     # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --gres=gpu:1             # number of gpus per node
 7 | #SBATCH --mem=4G                 # total memory (RAM) per node
 8 | #SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
 9 | #SBATCH --constraint=a100        # choose gpu80, a100 or v100
10 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
11 | 
12 | module purge
13 | module load julia/1.8.2
14 | 
15 | julia svd.jl
16 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/julia/svd.jl:
--------------------------------------------------------------------------------
1 | using CUDA
2 | N = 8000
3 | F = CUDA.svd(CUDA.rand(N, N))
4 | println(sum(F.S))
5 | println("completed")
6 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/matlab/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=matlab-svd    # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
 7 | #SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
 8 | #SBATCH --gres=gpu:1             # number of gpus per node
 9 | #SBATCH --constraint=a100        # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
11 | 
12 | module purge
13 | module load matlab/R2023a
14 | 
15 | matlab -singleCompThread -nodisplay -nosplash -r svd
16 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/matlab/svd.m:
--------------------------------------------------------------------------------
 1 | gpu = gpuDevice();
 2 | fprintf('Using a %s GPU.\n', gpu.Name);
 3 | disp(gpuDevice);
 4 | 
 5 | X = gpuArray([1 0 2; -1 5 0; 0 3 -9]);
 6 | whos X;
 7 | [U,S,V] = svd(X)
 8 | fprintf('trace(S): %f\n', trace(S))
 9 | quit;
10 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/pytorch/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=torch-svd     # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=4G         # memory per cpu-core
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --constraint=a100        # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
11 | 
12 | module purge
13 | module load anaconda3/2023.9
14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/torch-env
15 | 
16 | python svd.py
17 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/pytorch/svd.py:
--------------------------------------------------------------------------------
 1 | from time import perf_counter
 2 | import torch
 3 | 
 4 | N = 1000
 5 | 
 6 | cuda0 = torch.device('cuda:0')
 7 | x = torch.randn(N, N, dtype=torch.float64, device=cuda0)
 8 | t0 = perf_counter()
 9 | u, s, v = torch.svd(x)
10 | elapsed_time = perf_counter() - t0
11 | 
12 | print("Execution time: ", elapsed_time)
13 | print("Result: ", torch.sum(s).cpu().numpy())
14 | print("PyTorch version: ", torch.__version__)
15 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/tensorflow/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=svd-tf        # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem=4G                 # total memory (RAM) per node
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:02:00          # total run time limit (HH:MM:SS)
 9 | #SBATCH --constraint=a100        # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
11 | 
12 | module purge
13 | module load anaconda3/2024.6
14 | conda activate /scratch/network/jdh4/.gpu_workshop/envs/tf2-gpu
15 | 
16 | python svd.py
17 | 


--------------------------------------------------------------------------------
/03_your_first_gpu_job/tensorflow/svd.py:
--------------------------------------------------------------------------------
 1 | from time import perf_counter
 2 | 
 3 | import os
 4 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
 5 | 
 6 | import tensorflow as tf
 7 | print("TensorFlow version: ", tf.__version__)
 8 | 
 9 | N = 100
10 | x = tf.random.normal((N, N), dtype=tf.dtypes.float64)
11 | t0 = perf_counter()
12 | s, u, v = tf.linalg.svd(x)
13 | elapsed_time = perf_counter() - t0
14 | print("Execution time: ", elapsed_time)
15 | print("Result: ", tf.reduce_sum(s).numpy())
16 | 


--------------------------------------------------------------------------------
/04_gpu_tools/README.md:
--------------------------------------------------------------------------------
  1 | # GPU Tools
  2 | 
  3 | This page presents common tools and utilities for GPU computing.
  4 | 
  5 | # nvidia-smi
  6 | 
  7 | This is the NVIDIA Systems Management Interface. This utility can be used to monitor GPU usage and GPU memory usage. It is a comprehensive tool with many options.
  8 | 
  9 | ```
 10 | $ nvidia-smi
 11 | Wed May 28 09:39:23 2025       
 12 | +-----------------------------------------------------------------------------------------+
 13 | | NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
 14 | |-----------------------------------------+------------------------+----------------------+
 15 | | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
 16 | | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
 17 | |                                         |                        |               MIG M. |
 18 | |=========================================+========================+======================|
 19 | |   0  NVIDIA A100 80GB PCIe          On  |   00000000:17:00.0 Off |                    0 |
 20 | | N/A   39C    P0             57W /  300W |       0MiB /  81920MiB |      0%      Default |
 21 | |                                         |                        |             Disabled |
 22 | +-----------------------------------------+------------------------+----------------------+
 23 |                                                                                          
 24 | +-----------------------------------------------------------------------------------------+
 25 | | Processes:                                                                              |
 26 | |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
 27 | |        ID   ID                                                               Usage      |
 28 | |=========================================================================================|
 29 | |  No running processes found                                                             |
 30 | +-----------------------------------------------------------------------------------------+
 31 | ```
 32 | 
 33 | To see all of the available options, view the help:
 34 | 
 35 | ```$ nvidia-smi --help```
 36 | 
 37 | Here is an an example that produces CSV output of various metrics:
 38 | 
 39 | ```
 40 | $ nvidia-smi --query-gpu=timestamp,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 5
 41 | ```
 42 | 
 43 | The command above takes a reading every 5 seconds.
 44 | 
 45 | # Nsight Systems (nsys) for Profiling
 46 | 
 47 | The `nsys` command can be used to generate a timeline of the execution of your code. `nsys-ui` provides a GUI to examine the profiling data generated by `nsys`. See the NVIDIA Nsight Systems [getting started guide](https://docs.nvidia.com/nsight-systems/) and notes on [Summit](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#profiling-gpu-code-with-nvidia-developer-tools).
 48 | 
 49 | To see the help menu:
 50 | 
 51 | ```
 52 | $ /usr/local/bin/nsys --help
 53 | $ /usr/local/bin/nsys --help profile
 54 | ```
 55 | 
 56 | IMPORTANT: Do not run profiling jobs in your `/home` directory because large files are often written during these jobs which can exceed your quota. Instead launch jobs from `/scratch/gpfs/<YourNetID>` where you have lots of space. Here's an example:
 57 | 
 58 | ```
 59 | $ ssh <YourNetID>@della-gpu.princeton.edu
 60 | $ cd /scratch/gpfs/<YourNetID>
 61 | $ mkdir myjob && cd myjob
 62 | # prepare Slurm script
 63 | $ sbatch job.slurm
 64 | ```
 65 | 
 66 | Below is an example Slurm script:
 67 | 
 68 | ```
 69 | #!/bin/bash
 70 | #SBATCH --job-name=profile       # create a short name for your job
 71 | #SBATCH --nodes=1                # node count
 72 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 73 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 74 | #SBATCH --mem=4G                 # total memory per node
 75 | #SBATCH --gres=gpu:1             # number of gpus per node
 76 | #SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
 77 | 
 78 | module purge
 79 | module load anaconda3/2024.10
 80 | conda activate myenv
 81 | 
 82 | /usr/local/bin/nsys profile --trace=cuda,nvtx,osrt -o myprofile_${SLURM_JOBID} python myscript.py
 83 | ```
 84 | 
 85 | For an MPI code you should use:
 86 | 
 87 | ```
 88 | srun --wait=0 /usr/local/bin/nsys profile --trace=cuda,nvtx,osrt,mpi -o myprofile_${SLURM_JOBID} ./my_mpi_exe
 89 | ```
 90 | 
 91 | Run this command to see the summary statistics:
 92 | 
 93 | ```
 94 | $ /usr/local/bin/nsys stats myprofile_*.nsys-rep
 95 | ```
 96 | 
 97 | To work the the graphical interface (nsys-ui) you can either (1) download the `.qdrep` file to your local machine or (2) create a graphical desktop session on [https://mydella.princeton.edu](https://mydella.princeton.edu/) or [https://mystellar.princeton.edu](https://mystellar.princeton.edu/). To create the graphical desktop, choose "Interactive Apps" then "Desktop of Della/Stellar Vis Nodes". Once the session starts, click on the black terminal icon and then run:
 98 | 
 99 | ```
100 | $ /usr/local/bin/nsys-ui myprofile_*.nsys-rep
101 | ```
102 | 
103 | # Nsight Compute (ncu) for GPU Kernel Profiling
104 | 
105 | The `ncu` command is used for detailed profiling of GPU kernels. See the NVIDIA [documentation](https://docs.nvidia.com/nsight-compute/). On some clusters you will need to load a module to make the command available:
106 | 
107 | ```
108 | $ module load cudatoolkit/12.9
109 | $ ncu --help
110 | ```
111 | 
112 | The idea is to use `ncu` for the profiling and `ncu-ui` for examining the data in a GUI.
113 | 
114 | Below is a sample slurm script:
115 | 
116 | ```
117 | #!/bin/bash
118 | #SBATCH --job-name=profile       # create a short name for your job
119 | #SBATCH --nodes=1                # node count
120 | #SBATCH --ntasks=1               # total number of tasks across all nodes
121 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
122 | #SBATCH --mem=4G                 # total memory per node
123 | #SBATCH --gres=gpu:1             # number of gpus per node
124 | #SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
125 | 
126 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
127 | 
128 | module purge
129 | module load cudatoolkit/12.9
130 | module load anaconda3/2024.10
131 | conda activate myenv
132 | 
133 | ncu -o my_report_${SLURM_JOBID} python myscript.py
134 | ```
135 | 
136 | Note: the `ncu` profiler can significantly slow down the execution time of the code.
137 | 
138 | To work the the graphical interface (ncu-ui) you can either (1) download the `.ncu-rep` file to your local machine or (2) create a graphical desktop session on [https://mydella.princeton.edu](https://mydella.princeton.edu/) or [https://mystellar.princeton.edu](https://mystellar.princeton.edu/). To create the graphical desktop, choose "Interactive Apps" then "Desktop of Della/Stellar Vis Nodes". Once the session starts, click on the black terminal icon and then run:
139 | 
140 | ```
141 | $ module load cudatoolkit/12.9
142 | $ ncu-ui my_report_*.ncu-rep
143 | ```
144 | 
145 | # line_prof for Python Profiling
146 | 
147 | The [line_prof](https://researchcomputing.princeton.edu/python-profiling) tool provides profiling info for each line of a function. It is easy to use and it can be used for Python codes that run on CPUs and/or GPUs.
148 | 
149 | # nvcc
150 | 
151 | This is the NVIDIA CUDA compiler. It is based on LLVM. To compile a simple code:
152 | 
153 | ```
154 | $ module load cudatoolkit/12.9
155 | $ nvcc -o hello_world hello_world.cu
156 | ```
157 | 
158 | # Job Statistics
159 | 
160 | Follow [this procedure](https://researchcomputing.princeton.edu/support/knowledge-base/job-stats) to view detailed metrics for your Slurm jobs. This includes GPU utilization and memory as a function of time.
161 | 
162 | # GPU Computing
163 | 
164 | See [this page](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing) for an overview of the hardware at Princton as well as useful commands like `gpudash` and `shownodes`.
165 | 
166 | # Debuggers
167 | 
168 | ### ARM DDT
169 | 
170 | The general directions for using the DDT debugger are [here](https://researchcomputing.princeton.edu/faq/debugging-with-ddt-on-the). The getting started guide is [here](https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge/arm-ddt).
171 | 
172 | ```
173 | $ ssh -X <NetID>@adroit.princeton.edu  # better to use graphical desktop via myadroit
174 | $ git clone https://github.com/PrincetonUniversity/hpc_beginning_workshop
175 | $ cd hpc_beginning_workshop/RC_example_jobs/simple_gpu_kernel
176 | $ salloc -N 1 -n 1 -t 10:00 --gres=gpu:1 --x11
177 | $ module load cudatoolkit/12.9
178 | $ nvcc -g -G hello_world_gpu.cu
179 | $ module load ddt/24.1
180 | $ #export ALLINEA_FORCE_CUDA_VERSION=10.1
181 | $ ddt
182 | # check cuda, uncheck "submit to queue", and click on "Run"
183 | ```
184 | 
185 | The `-g` debugging flag is for CPU code while the `-G` flag is for GPU code. `-G` turns off compiler optimizations.
186 | 
187 | If the graphics are not displaying fast enough then consider using [TurboVNC](https://researchcomputing.princeton.edu/faq/how-do-i-use-vnc-on-tigre).
188 | 
189 | ### `cuda-gdb`
190 | 
191 | `cuda-gdb` is a free debugger available as part of the CUDA Toolkit.
192 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/README.md:
--------------------------------------------------------------------------------
  1 | # GPU-Accelerated Libraries
  2 | 
  3 | Let's say you have a CPU code and you are thinking about writing GPU kernels to accelerate the performance of the slow parts of the code. Before doing this, you should see if there are GPU libraries that already have implemented the routines that you need. This page presents an overview of the NVIDIA GPU-accelerated libraries.
  4 | 
  5 | According to NVIDIA: "NVIDIA GPU-accelerated libraries provide highly-optimized functions that perform 2x-10x faster than CPU-only alternatives. Using drop-in interfaces, you can replace CPU-only libraries such as MKL, IPP and FFTW with GPU-accelerated versions with almost no code changes. The libraries can optimally scale your application across multiple GPUs."
  6 | 
  7 | ![NVIDIA-GPU-Libraries](https://tigress-web.princeton.edu/~jdh4/nv_libraries.jpeg)
  8 | 
  9 | ### Selected libraries
 10 | 
 11 | + **cuDNN** - GPU-accelerated library of primitives for deep neural networks
 12 | + **cuBLAS** - GPU-accelerated standard BLAS library
 13 | + **cuSPARSE** - GPU-accelerated BLAS for sparse matrices
 14 | + **cuRAND** - GPU-accelerated random number generation (RNG)
 15 | + **cuSOLVER** - Dense and sparse direct solvers for computer vision, CFD and other applications
 16 | + **cuTENSOR** - GPU-accelerated tensor linear algebra library
 17 | + **cuFFT** - GPU-accelerated library for Fast Fourier Transforms
 18 | + **NPP** - GPU-accelerated image, video, and signal processing functions
 19 | + **NCCL** - Collective Communications Library for scaling apps across multiple GPUs and nodes
 20 | + **nvGRAPH** - GPU-accelerated library for graph analytics
 21 | 
 22 | For the complete list see [GPU libraries](https://developer.nvidia.com/gpu-accelerated-libraries) by NVIDIA.
 23 | 
 24 | ## Where to find the libraries
 25 | 
 26 | Run the commands below to examine the libraries:
 27 | 
 28 | ```
 29 | $ module show cudatoolkit/12.2
 30 | $ ls -lL /usr/local/cuda-12.2/lib64/lib*.so
 31 | ```
 32 | 
 33 | ## Example
 34 | 
 35 | Make sure that you are on the `adroit5` login node :
 36 | 
 37 | ```
 38 | $ hostname
 39 | adroit5
 40 | ```
 41 | 
 42 | Instead of computing the singular value decomposition (SVD) on the CPU, this example computes it on the GPU using `libcusolver`. First look over the source code:
 43 | 
 44 | ```
 45 | $ cd gpu_programming_intro/05_cuda_libraries
 46 | $ cat gesvdj_example.cpp | less  # q to quit
 47 | ```
 48 | 
 49 | The header file `cusolverDn.h` included by `gesvdj_example.cpp` contains the line `cuSolverDN : Dense Linear Algebra Library` providing information about its purpose. See the [cuSOLVER API](https://docs.nvidia.com/cuda/cusolver/index.html) for more.
 50 | 
 51 | 
 52 | Next, compile and link the code as follows:
 53 | 
 54 | ```
 55 | $ module load cudatoolkit/12.2
 56 | $ g++ -o gesvdj_example gesvdj_example.cpp -lcudart -lcusolver
 57 | ```
 58 | 
 59 | Run `ldd gesvdj_example` to check the linking against cuSOLVER (i.e., `libcusolver.so`).
 60 | 
 61 | Submit the job to the scheduler with:
 62 | 
 63 | ```
 64 | $ sbatch job.slurm
 65 | ```
 66 | 
 67 | The ouput should appears as:
 68 | 
 69 | ```
 70 | $ cat slurm-*.out
 71 | 
 72 | example of gesvdj
 73 | tol = 1.000000E-07, default value is machine zero
 74 | max. sweeps = 15, default value is 100
 75 | econ = 0
 76 | A = (matlab base-1)
 77 | A(1,1) = 1.0000000000000000E+00
 78 | A(1,2) = 2.0000000000000000E+00
 79 | A(2,1) = 4.0000000000000000E+00
 80 | A(2,2) = 5.0000000000000000E+00
 81 | A(3,1) = 2.0000000000000000E+00
 82 | A(3,2) = 1.0000000000000000E+00
 83 | =====
 84 | gesvdj converges
 85 | S = singular values (matlab base-1)
 86 | S(1,1) = 7.0652834970827287E+00
 87 | S(2,1) = 1.0400812977120775E+00
 88 | =====
 89 | U = left singular vectors (matlab base-1)
 90 | U(1,1) = 3.0821892063278472E-01
 91 | U(1,2) = -4.8819507401989848E-01
 92 | U(1,3) = 8.1649658092772659E-01
 93 | U(2,1) = 9.0613333377729299E-01
 94 | U(2,2) = -1.1070553170904460E-01
 95 | U(2,3) = -4.0824829046386302E-01
 96 | U(3,1) = 2.8969549251172333E-01
 97 | U(3,2) = 8.6568461633075366E-01
 98 | U(3,3) = 4.0824829046386224E-01
 99 | =====
100 | V = right singular vectors (matlab base-1)
101 | V(1,1) = 6.3863583713639760E-01
102 | V(1,2) = 7.6950910814953477E-01
103 | V(2,1) = 7.6950910814953477E-01
104 | V(2,2) = -6.3863583713639760E-01
105 | =====
106 | |S - S_exact|_sup = 4.440892E-16
107 | residual |A - U*S*V**H|_F = 3.511066E-16
108 | number of executed sweeps = 1
109 | ```
110 | 
111 | ## NVIDIA CUDA Samples
112 | 
113 | Run the following command to obtain a copy of the [NVIDIA CUDA Samples](https://github.com/NVIDIA/cuda-samples):
114 | 
115 | ```
116 | $ cd gpu_programming_intro
117 | $ git clone https://github.com/NVIDIA/cuda-samples.git
118 | $ cd cuda-samples/Samples
119 | ```
120 | 
121 | Then browse the directories:
122 | 
123 | ```
124 | $ ls -ltrh
125 | total 20K
126 | drwxr-xr-x. 55 jdh4 cses 4.0K Oct  9 18:23 0_Introduction
127 | drwxr-xr-x.  6 jdh4 cses  130 Oct  9 18:23 1_Utilities
128 | drwxr-xr-x. 36 jdh4 cses 4.0K Oct  9 18:23 2_Concepts_and_Techniques
129 | drwxr-xr-x. 25 jdh4 cses 4.0K Oct  9 18:23 3_CUDA_Features
130 | drwxr-xr-x. 40 jdh4 cses 4.0K Oct  9 18:23 4_CUDA_Libraries
131 | drwxr-xr-x. 52 jdh4 cses 4.0K Oct  9 18:23 5_Domain_Specific
132 | drwxr-xr-x.  5 jdh4 cses  105 Oct  9 18:23 6_Performance
133 | ```
134 | 
135 | Pick an example and then build and run it. For instance:
136 | 
137 | ```
138 | $ module load cudatoolkit/12.2
139 | $ cd 0_Introduction/matrixMul
140 | $ make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++  # use 90 for H100 GPUs on Tiger and Della (PLI)
141 | ```
142 | 
143 | This will produce `matrixMul`. If you run the `ldd` command on `matrixMul` you will see that it does not link against `cublas.so`. Instead it uses a naive implementation of the routine which is surely not as efficient as the library implementation.
144 | 
145 | ```
146 | $ cp <PATH/TO>/gpu_programming_intro/05_cuda_libraries/matrixMul/job.slurm .
147 | ```
148 | 
149 | Submit the job:
150 | 
151 | ```
152 | $ sbatch job.slurm
153 | ```
154 | 
155 | See `4_CUDA_Libraries` for more examples. For instance, take a look at `4_CUDA_Libraries/matrixMulCUBLAS`. Does the resulting executable link against `libcublas.so`?
156 | 
157 | ```
158 | $ cd ../../4_CUDA_Libraries/matrixMulCUBLAS
159 | $ make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++
160 | $ ldd matrixMulCUBLAS
161 | ```
162 | 
163 | Similarly, does the code in `4_CUDA_Libraries/simpleCUFFT_MGPU` link against `libcufft.so`?
164 | 
165 | To run code that uses the Tensor Cores see examples such as `3_CUDA_Features/bf16TensorCoreGemm`. That example uses the bfloat16 floating-point format.
166 | 
167 | Note that some examples have dependencies that will not be satisfied so they will not build. This can be resolved if it relates to your research work. For instance, to build `5_Domain_Specific/nbody` use:
168 | 
169 | ```
170 | GLPATH=/lib64 make TARGET_ARCH=x86_64 SMS="80" HOST_COMPILER=g++  # use 90 for H100 GPUs on Tiger and Della (PLI)
171 | ```
172 | 
173 | Note that `nbody` will not run successfully on adroit since the GPU nodes do not have `libglut.so`. The library could be added if needed. One can compile and run this code on adroit-vis using `TARGET_ARCH=x86_64 SMS="80"`.
174 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/gesvdj_example.cpp:
--------------------------------------------------------------------------------
  1 | /*
  2 |  *  * How to compile (assume cuda is installed at /usr/local/cuda-10.1/)
  3 |  *   *   nvcc -c -I/usr/local/cuda-10.1/include gesvdj_example.cpp 
  4 |  *    *   g++ -o gesvdj_example gesvdj_example.o -L/usr/local/cuda-10.1/lib64 -lcudart -lcusolver
  5 |  *     */
  6 | #include <stdio.h>
  7 | #include <stdlib.h>
  8 | #include <string.h>
  9 | #include <assert.h>
 10 | #include <cuda_runtime.h>
 11 | #include <cusolverDn.h>
 12 | 
 13 | void printMatrix(int m, int n, const double*A, int lda, const char* name)
 14 | {
 15 |     for(int row = 0 ; row < m ; row++){
 16 |         for(int col = 0 ; col < n ; col++){
 17 |             double Areg = A[row + col*lda];
 18 |             printf("%s(%d,%d) = %20.16E\n", name, row+1, col+1, Areg);
 19 |         }
 20 |     }
 21 | }
 22 | 
 23 | int main(int argc, char*argv[])
 24 | {
 25 |     cusolverDnHandle_t cusolverH = NULL;
 26 |     cudaStream_t stream = NULL;
 27 |     gesvdjInfo_t gesvdj_params = NULL;
 28 | 
 29 |     cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS;
 30 |     cudaError_t cudaStat1 = cudaSuccess;
 31 |     cudaError_t cudaStat2 = cudaSuccess;
 32 |     cudaError_t cudaStat3 = cudaSuccess;
 33 |     cudaError_t cudaStat4 = cudaSuccess;
 34 |     cudaError_t cudaStat5 = cudaSuccess;
 35 |     const int m = 3;
 36 |     const int n = 2;
 37 |     const int lda = m;
 38 | /*       | 1 2  |
 39 |  *        *   A = | 4 5  |
 40 |  *         *       | 2 1  |
 41 |  *          */
 42 |     double A[lda*n] = { 1.0, 4.0, 2.0, 2.0, 5.0, 1.0};
 43 |     double U[lda*m]; /* m-by-m unitary matrix, left singular vectors  */
 44 |     double V[lda*n]; /* n-by-n unitary matrix, right singular vectors */
 45 |     double S[n];     /* numerical singular value */
 46 | /* exact singular values */
 47 |     double S_exact[n] = {7.065283497082729, 1.040081297712078};
 48 |     double *d_A = NULL;  /* device copy of A */
 49 |     double *d_S = NULL;  /* singular values */
 50 |     double *d_U = NULL;  /* left singular vectors */
 51 |     double *d_V = NULL;  /* right singular vectors */
 52 |     int *d_info = NULL;  /* error info */
 53 |     int lwork = 0;       /* size of workspace */
 54 |     double *d_work = NULL; /* devie workspace for gesvdj */
 55 |     int info = 0;        /* host copy of error info */
 56 | 
 57 | /* configuration of gesvdj  */
 58 |     const double tol = 1.e-7;
 59 |     const int max_sweeps = 15;
 60 |     const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvectors.
 61 |     const int econ = 0 ; /* econ = 1 for economy size */
 62 | 
 63 | /* numerical results of gesvdj  */
 64 |     double residual = 0;
 65 |     int executed_sweeps = 0;
 66 | 
 67 |     printf("example of gesvdj \n");
 68 |     printf("tol = %E, default value is machine zero \n", tol);
 69 |     printf("max. sweeps = %d, default value is 100\n", max_sweeps);
 70 |     printf("econ = %d \n", econ);
 71 | 
 72 |     printf("A = (matlab base-1)\n");
 73 |     printMatrix(m, n, A, lda, "A");
 74 |     printf("=====\n");
 75 | 
 76 | /* step 1: create cusolver handle, bind a stream */
 77 |     status = cusolverDnCreate(&cusolverH);
 78 |     assert(CUSOLVER_STATUS_SUCCESS == status);
 79 | 
 80 |     cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
 81 |     assert(cudaSuccess == cudaStat1);
 82 | 
 83 |     status = cusolverDnSetStream(cusolverH, stream);
 84 |     assert(CUSOLVER_STATUS_SUCCESS == status);
 85 | 
 86 | /* step 2: configuration of gesvdj */
 87 |     status = cusolverDnCreateGesvdjInfo(&gesvdj_params);
 88 |     assert(CUSOLVER_STATUS_SUCCESS == status);
 89 | 
 90 | /* default value of tolerance is machine zero */
 91 |     status = cusolverDnXgesvdjSetTolerance(
 92 |         gesvdj_params,
 93 |         tol);
 94 |     assert(CUSOLVER_STATUS_SUCCESS == status);
 95 | 
 96 | /* default value of max. sweeps is 100 */
 97 |     status = cusolverDnXgesvdjSetMaxSweeps(
 98 |         gesvdj_params,
 99 |         max_sweeps);
100 |     assert(CUSOLVER_STATUS_SUCCESS == status);
101 | 
102 | /* step 3: copy A and B to device */
103 |     cudaStat1 = cudaMalloc ((void**)&d_A   , sizeof(double)*lda*n);
104 |     cudaStat2 = cudaMalloc ((void**)&d_S   , sizeof(double)*n);
105 |     cudaStat3 = cudaMalloc ((void**)&d_U   , sizeof(double)*lda*m);
106 |     cudaStat4 = cudaMalloc ((void**)&d_V   , sizeof(double)*lda*n);
107 |     cudaStat5 = cudaMalloc ((void**)&d_info, sizeof(int));
108 |     assert(cudaSuccess == cudaStat1);
109 |     assert(cudaSuccess == cudaStat2);
110 |     assert(cudaSuccess == cudaStat3);
111 |     assert(cudaSuccess == cudaStat4);
112 |     assert(cudaSuccess == cudaStat5);
113 | 
114 |     cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n, cudaMemcpyHostToDevice);
115 |     assert(cudaSuccess == cudaStat1);
116 | 
117 | /* step 4: query workspace of SVD */
118 |     status = cusolverDnDgesvdj_bufferSize(
119 |         cusolverH,
120 |         jobz, /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */
121 |               /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */
122 |         econ, /* econ = 1 for economy size */
123 |         m,    /* nubmer of rows of A, 0 <= m */
124 |         n,    /* number of columns of A, 0 <= n  */
125 |         d_A,  /* m-by-n */
126 |         lda,  /* leading dimension of A */
127 |         d_S,  /* min(m,n) */
128 |               /* the singular values in descending order */
129 |         d_U,  /* m-by-m if econ = 0 */
130 |               /* m-by-min(m,n) if econ = 1 */
131 |         lda,  /* leading dimension of U, ldu >= max(1,m) */
132 |         d_V,  /* n-by-n if econ = 0  */
133 |               /* n-by-min(m,n) if econ = 1  */
134 |         lda,  /* leading dimension of V, ldv >= max(1,n) */
135 |         &lwork,
136 |         gesvdj_params);
137 |     assert(CUSOLVER_STATUS_SUCCESS == status);
138 | 
139 |     cudaStat1 = cudaMalloc((void**)&d_work , sizeof(double)*lwork);
140 |     assert(cudaSuccess == cudaStat1);
141 | 
142 | /* step 5: compute SVD */
143 |     status = cusolverDnDgesvdj(
144 |         cusolverH,
145 |         jobz,  /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */
146 |                /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */
147 |         econ,  /* econ = 1 for economy size */
148 |         m,     /* nubmer of rows of A, 0 <= m */
149 |         n,     /* number of columns of A, 0 <= n  */
150 |         d_A,   /* m-by-n */
151 |         lda,   /* leading dimension of A */
152 |         d_S,   /* min(m,n)  */
153 |                /* the singular values in descending order */
154 |         d_U,   /* m-by-m if econ = 0 */
155 |                /* m-by-min(m,n) if econ = 1 */
156 |         lda,   /* leading dimension of U, ldu >= max(1,m) */
157 |         d_V,   /* n-by-n if econ = 0  */
158 |                /* n-by-min(m,n) if econ = 1  */
159 |         lda,   /* leading dimension of V, ldv >= max(1,n) */
160 |         d_work,
161 |         lwork,
162 |         d_info,
163 |         gesvdj_params);
164 |     cudaStat1 = cudaDeviceSynchronize();
165 |     assert(CUSOLVER_STATUS_SUCCESS == status);
166 |     assert(cudaSuccess == cudaStat1);
167 | 
168 |     cudaStat1 = cudaMemcpy(U, d_U, sizeof(double)*lda*m, cudaMemcpyDeviceToHost);
169 |     cudaStat2 = cudaMemcpy(V, d_V, sizeof(double)*lda*n, cudaMemcpyDeviceToHost);
170 |     cudaStat3 = cudaMemcpy(S, d_S, sizeof(double)*n    , cudaMemcpyDeviceToHost);
171 |     cudaStat4 = cudaMemcpy(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost);
172 |     cudaStat5 = cudaDeviceSynchronize();
173 |     assert(cudaSuccess == cudaStat1);
174 |     assert(cudaSuccess == cudaStat2);
175 |     assert(cudaSuccess == cudaStat3);
176 |     assert(cudaSuccess == cudaStat4);
177 |     assert(cudaSuccess == cudaStat5);
178 | 
179 |     if ( 0 == info ){
180 |         printf("gesvdj converges \n");
181 |     }else if ( 0 > info ){
182 |         printf("%d-th parameter is wrong \n", -info);
183 |         exit(1);
184 |     }else{
185 |         printf("WARNING: info = %d : gesvdj does not converge \n", info );
186 |     }
187 | 
188 |     printf("S = singular values (matlab base-1)\n");
189 |     printMatrix(n, 1, S, lda, "S");
190 |     printf("=====\n");
191 | 
192 |     printf("U = left singular vectors (matlab base-1)\n");
193 |     printMatrix(m, m, U, lda, "U");
194 |     printf("=====\n");
195 | 
196 |     printf("V = right singular vectors (matlab base-1)\n");
197 |     printMatrix(n, n, V, lda, "V");
198 |     printf("=====\n");
199 | 
200 | /* step 6: measure error of singular value */
201 |     double ds_sup = 0;
202 |     for(int j = 0; j < n; j++){
203 |         double err = fabs( S[j] - S_exact[j] );
204 |         ds_sup = (ds_sup > err)? ds_sup : err;
205 |     }
206 |     printf("|S - S_exact|_sup = %E \n", ds_sup);
207 | 
208 |     status = cusolverDnXgesvdjGetSweeps(
209 |         cusolverH,
210 |         gesvdj_params,
211 |         &executed_sweeps);
212 |     assert(CUSOLVER_STATUS_SUCCESS == status);
213 | 
214 |     status = cusolverDnXgesvdjGetResidual(
215 |         cusolverH,
216 |         gesvdj_params,
217 |         &residual);
218 |     assert(CUSOLVER_STATUS_SUCCESS == status);
219 | 
220 |     printf("residual |A - U*S*V**H|_F = %E \n", residual );
221 |     printf("number of executed sweeps = %d \n", executed_sweeps );
222 | 
223 | /*  free resources  */
224 |     if (d_A    ) cudaFree(d_A);
225 |     if (d_S    ) cudaFree(d_S);
226 |     if (d_U    ) cudaFree(d_U);
227 |     if (d_V    ) cudaFree(d_V);
228 |     if (d_info) cudaFree(d_info);
229 |     if (d_work ) cudaFree(d_work);
230 | 
231 |     if (cusolverH) cusolverDnDestroy(cusolverH);
232 |     if (stream      ) cudaStreamDestroy(stream);
233 |     if (gesvdj_params) cusolverDnDestroyGesvdjInfo(gesvdj_params);
234 | 
235 |     cudaDeviceReset();
236 |     return 0;
237 | }
238 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/README.md:
--------------------------------------------------------------------------------
 1 | # Building a Simple GPU Library
 2 | 
 3 | In this exercise we will construct a "hello world" GPU library called `cumessage` and then link and run a code against it.
 4 | 
 5 | ### Create the GPU Library
 6 | 
 7 | Inspect the files that compose the GPU library:
 8 | 
 9 | ```bash
10 | $ cd 05_cuda_libraries/hello_world_gpu_library
11 | $ cat cumessage.h
12 | $ cat cumessage.cu
13 | ```
14 | 
15 | `cumessage.h` is the header file. It contains the signature or protocol of one function. That is, the name and the input/output types are specified but the function body is not implemented here. The implementation is done in `cumessage.cu`. There is some CUDA code in that file. It will be explained in `06_cuda_kernels`.
16 | 
17 | Libraries are standalone. That is, there is nothing at present waiting to use our library. We will simply create it and then write a code that can use it. Create the library by running the following commands:
18 | 
19 | ```bash
20 | $ module load cudatoolkit/11.7
21 | $ nvcc -Xcompiler -fPIC -o libcumessage.so -shared cumessage.cu
22 | $ ls -ltr
23 | ```
24 | 
25 | This will produce `libcumessage.so` which is a GPU library with a single function. Add the option "-v" to the line beginning with `nvcc` above to see more details. You will see that `gcc` is being called.
26 | 
27 | ### Use the GPU Library
28 | 
29 | Take a look at our simple code in `myapp.cu` that will use our GPU library:
30 | 
31 | ```bash
32 | $ cat myapp.cu
33 | ```
34 | 
35 | Once again, note that `myapp.cu` only needs to know about the inputs and outputs of `GPUfunction` through the header file. Nothing is known to `myapp.cu` about how that function is implemented.
36 | 
37 | Compile the main routine against our GPU library:
38 | 
39 | ```
40 | $ nvcc -I. -o myapp myapp.cu -L. -lcudart -lcumessage 
41 | $ ls -ltr
42 | ```
43 | 
44 | This will produce `myapp` which is a GPU application that links against our GPU library `libcumessage.so`:
45 | 
46 | ```
47 | $ env LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ldd myapp
48 |   linux-vdso.so.1 (0x00007fffdaf61000)
49 |   libcumessage.so => ./libcumessage.so (0x000014d68450a000)
50 |   libcudart.so.11.0 => /usr/local/cuda-11.4/lib64/libcudart.so.11.0 (0x000014d684268000)
51 |   librt.so.1 => /lib64/librt.so.1 (0x000014d684060000)
52 |   libpthread.so.0 => /lib64/libpthread.so.0 (0x000014d683e40000)
53 |   libdl.so.2 => /lib64/libdl.so.2 (0x000014d683c3c000)
54 |   libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014d6838a7000)
55 |   libm.so.6 => /lib64/libm.so.6 (0x000014d683525000)
56 |   libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014d68330d000)
57 |   libc.so.6 => /lib64/libc.so.6 (0x000014d682f48000)
58 |   /lib64/ld-linux-x86-64.so.2 (0x000014d6847a9000)
59 |   ```
60 | Finally, submit the job and inspect the output:
61 |   
62 | ```
63 | $ sbatch job.slurm
64 | $ cat slurm-*.out
65 |   Hello world from the CPU.
66 |   Hello world from the GPU.
67 | ```
68 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/cumessage.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include "cumessage.h"
 3 | 
 4 | __global__ void GPUFunction_kernel() {
 5 |   printf("Hello world from the GPU.\n");
 6 | }
 7 | 
 8 | void GPUFunction() {
 9 |   GPUFunction_kernel<<<1,1>>>();
10 |   
11 |   // kernel execution is asynchronous so sync on its completion
12 |   cudaDeviceSynchronize();
13 | }
14 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/cumessage.h:
--------------------------------------------------------------------------------
1 | void GPUFunction();
2 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=gpu-lib       # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
 9 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
10 | 
11 | module purge
12 | module load cudatoolkit/11.7
13 | export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH
14 | 
15 | ./myapp
16 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/hello_world_gpu_library/myapp.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include "cumessage.h"
 3 | 
 4 | void CPUFunction() {
 5 |   printf("Hello world from the CPU.\n");
 6 | }
 7 | 
 8 | int main() {
 9 |   // function to run on the cpu
10 |   CPUFunction();
11 | 
12 |   // function to run on the gpu
13 |   GPUFunction();
14 | 
15 |   return 0;
16 | }
17 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=cuda-libs     # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --constraint=a100        # choose gpu80, a100 or v100
10 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
11 | 
12 | module purge
13 | module load cudatoolkit/12.2
14 | 
15 | ./gesvdj_example
16 | 


--------------------------------------------------------------------------------
/05_cuda_libraries/matrixMul/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=cuda-libs     # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=16G        # memory per cpu-core (4G is default)
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --constraint=a100        # choose a100 or v100
10 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
11 | 
12 | module purge
13 | module load cudatoolkit/12.2
14 | 
15 | ./matrixMul
16 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/README.md:
--------------------------------------------------------------------------------
  1 | # Hello World
  2 | 
  3 | On this page we consider the simplest CPU C code and the simplest CUDA C GPU code.
  4 | 
  5 | ## CPU
  6 | 
  7 | A simple CPU-only code:
  8 | 
  9 | ```C
 10 | #include <stdio.h>
 11 | 
 12 | void CPUFunction() {
 13 |   printf("Hello world from the CPU.\n");
 14 | }
 15 | 
 16 | int main() {
 17 |   // function to run on the cpu
 18 |   CPUFunction();
 19 | }
 20 | ```
 21 | 
 22 | This can be compiled and run with:
 23 | 
 24 | ```
 25 | $ cd gpu_programming_intro/06_cuda_kernels/01_hello_world
 26 | $ gcc -o hello_world hello_world.c
 27 | $ ./hello_world
 28 | ```
 29 | 
 30 | The output is
 31 | 
 32 | ```
 33 | Hello world from the CPU.
 34 | ```
 35 | 
 36 | ## GPU
 37 | 
 38 | Below is a simple GPU code that calls a CPU function followed by a GPU function:
 39 | 
 40 | ```C
 41 | #include <stdio.h>
 42 | 
 43 | void CPUFunction() {
 44 |   printf("Hello world from the CPU.\n");
 45 | }
 46 | 
 47 | __global__ void GPUFunction() {
 48 |   printf("Hello world from the GPU.\n");
 49 | }
 50 | 
 51 | int main() {
 52 |   // function to run on the cpu
 53 |   CPUFunction();
 54 | 
 55 |   // function to run on the gpu
 56 |   GPUFunction<<<1, 1>>>();
 57 |   
 58 |   // kernel execution is asynchronous so sync on its completion
 59 |   cudaDeviceSynchronize();
 60 | }
 61 | ```
 62 | 
 63 | The GPU code above can be compiled and executed with:
 64 | 
 65 | ```
 66 | $ module load cudatoolkit/12.2
 67 | $ nvcc -o hello_world_gpu hello_world_gpu.cu
 68 | $ sbatch job.slurm
 69 | ```
 70 | 
 71 | The output should be:
 72 | 
 73 | ```
 74 | $ cat slurm-*.out
 75 | Hello world from the CPU.
 76 | Hello world from the GPU.
 77 | ```
 78 | 
 79 | `nvcc` is the NVIDIA CUDA Compiler. It compiles the GPU code itself and uses GNU `gcc` to compile the CPU code. CUDA provides extensions for many common programming languages (e.g., C/C++/Fortran). These language extensions allow developers to write GPU functions.
 80 | 
 81 | From this simple example we learn that GPU functions are declared with `__global__`, which is a CUDA C/C++ keyword. The triple angle brackets or so-called "triple chevron" is used to specify the execution configuration of the kernel launch which is a call from host code to device code.
 82 | 
 83 | Here is the general form for the execution configuration: `<<<NumBlocks, NumThreadsPerBlock>>>`. In the example above we used 1 block and 1 thread per block. At a high level, the execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block.
 84 | 
 85 | Notice the return type of `void` for GPUFunction. It is required that GPU functions are defined with the `__global__` keyword return type `void`.
 86 | 
 87 | ### Exercises
 88 | 
 89 | 1. What happens if you remove `__global__`?
 90 | 
 91 | 2. Can you rewrite the code so that the output is:
 92 | 
 93 | ```
 94 | Hello world from the CPU.
 95 | Hello world from the GPU.
 96 | Hello world from the CPU.
 97 | ```
 98 | 
 99 | 3. What happens if you comment out the `cudaDeviceSynchronize()` line by preceding it with `//`?
100 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/hello_world.c:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | void CPUFunction() {
 4 |   printf("Hello world from the CPU.\n");
 5 | }
 6 | 
 7 | int main() {
 8 |   // function to run on the cpu
 9 |   CPUFunction();
10 | }
11 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/hello_world_gpu.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | void CPUFunction() {
 4 |   printf("Hello world from the CPU.\n");
 5 | }
 6 | 
 7 | __global__ void GPUFunction() {
 8 |   printf("Hello world from the GPU.\n");
 9 | }
10 | 
11 | int main() {
12 |   // function to run on the cpu
13 |   CPUFunction();
14 | 
15 |   // function to run on the gpu
16 |   GPUFunction<<<1, 1>>>();
17 |   
18 |   // kernel execution is asynchronous so sync on its completion
19 |   cudaDeviceSynchronize();
20 | }
21 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/01_hello_world/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=hw-gpu        # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=1G         # memory per cpu-core (4G is default)
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
10 | 
11 | ./hello_world_gpu
12 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/README.md:
--------------------------------------------------------------------------------
 1 | # Launching Parallel Kernels
 2 | 
 3 | The execution configuration allows programmers to specify details about launching the kernel to run in parallel on multiple GPU threads. More precisely, the execution configuration allows programmers to specifiy how many groups of threads (called thread blocks) and how many threads they would like each thread block to contain. The syntax for this is:
 4 | 
 5 | ```
 6 | <<<NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>
 7 | ```
 8 | 
 9 | The kernel code is executed by every thread in every thread block configured when the kernel is launched. The image below corresponds to `<<<1, 5>>>`:
10 | 
11 | ![thread-block](https://miro.medium.com/max/1118/1*e_FAITzOXSearSZYNWnmKQ.png)
12 | 
13 | 
14 | ## CPU Code
15 | 
16 | ```c
17 | #include <stdio.h>
18 | 
19 | void firstParallel()
20 | {
21 |   printf("This should be running in parallel.\n");
22 | }
23 | 
24 | int main()
25 | {
26 |   firstParallel();
27 | }
28 | ```
29 | 
30 | ## Exercise: GPU implementation
31 | 
32 | ```
33 | # rewrite the CPU code above so that it runs on a GPU using multiple threads
34 | # save your file as first_parallel.cu (a starting file by this name is given -- see below)
35 | ```
36 | 
37 | The objective is to write a GPU code with one kernel launch that produces the following 6 lines of output:
38 | 
39 | ```
40 | This should be running in parallel.
41 | This should be running in parallel.
42 | This should be running in parallel.
43 | This should be running in parallel.
44 | This should be running in parallel.
45 | This should be running in parallel.
46 | ```
47 | 
48 | To get started:
49 | 
50 | ```
51 | $ cd gpu_programming_intro/06_cuda_kernels/02_simple_kernel
52 | # edit first_parallel.cu   (use a text editor of your choice)
53 | $ nvcc -o first_parallel first_parallel.cu
54 | $ sbatch job.slurm
55 | ```
56 | 
57 | There are multiple possible solutions.
58 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/first_parallel.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | void CPUFunction() {
 4 |   printf("Hello world from the CPU.\n");
 5 | }
 6 | 
 7 | __global__ void GPUFunction() {
 8 |   printf("Hello world from the GPU.\n");
 9 | }
10 | 
11 | int main() {
12 |   // function to run on the cpu
13 |   CPUFunction();
14 | 
15 |   // function to run on the gpu
16 |   GPUFunction<<<1, 1>>>();
17 |   
18 |   // kernel execution is asynchronous so sync on its completion
19 |   cudaDeviceSynchronize();
20 | }
21 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=serial_c      # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=1G         # memory per cpu-core (4G is default)
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
10 | 
11 | ./first_parallel
12 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/02_simple_kernel/solution.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | __global__ void firstParallel()
 4 | {
 5 |   printf("This is running in parallel.\n");
 6 | }
 7 | 
 8 | int main()
 9 | {
10 |   firstParallel<<<2, 3>>>();
11 |   cudaDeviceSynchronize();
12 | }
13 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/README.md:
--------------------------------------------------------------------------------
 1 | # Built-in Thread and Block Indices
 2 | 
 3 | Each thread is given an index within its thread block, starting at 0. Additionally, each block is given an index, starting at 0. Threads are grouped into thread blocks, blocks are grouped into grids, and grids can be grouped into a cluster, which is the highest entity in the CUDA hierarchy.
 4 | 
 5 | ![intrinic-indices](https://devblogs.nvidia.com/wp-content/uploads/2017/01/cuda_indexing.png)
 6 | 
 7 | CUDA kernels have access to special variables identifying both the index of the thread (within the block) that is executing the kernel, and, the index of the block (within the grid) that the thread is within. These variables are `threadIdx.x` and `blockIdx.x` respectively. Below is an example use of `threadIdx.x`:
 8 | 
 9 | ```C
10 | __global__ void GPUFunction() {
11 |   printf("My thread index is: %d\n", threadIdx.x);
12 | }
13 | ```
14 | 
15 | ## CPU implentation of a for loop
16 | 
17 | ```C
18 | #include <stdio.h>
19 | 
20 | void printLoopIndex() {
21 |   int N = 100;
22 |   for (int i = 0; i < N; ++i)
23 |     printf("%d\n", i);
24 | }
25 | 
26 | int main() {
27 |   // function to run on the cpu
28 |   printLoopIndex();
29 | }
30 | ```
31 | 
32 | Run the CPU code above by following these commands:
33 | 
34 | ```bash
35 | $ cd gpu_programming_intro/06_cuda_kernels/03_thread_indices
36 | $ nvcc -o for_loop for_loop.c
37 | $ ./for_loop
38 | ```
39 | 
40 | The output of the above is
41 | 
42 | ```
43 | 0
44 | 1
45 | 2
46 | ...
47 | 97
48 | 98
49 | 99
50 | ```
51 | 
52 | ## Exercise: GPU implementation
53 | 
54 | In the CPU code above, the loop is carried out in serial. That is, loop iterations takes place one at a time. Can you write a GPU code that produces the same output as that above but does so in parallel using a CUDA kernel?
55 | 
56 | ```
57 | // write a GPU kernel to produce the output above
58 | ```
59 | 
60 | To get started:
61 | 
62 | ```bash
63 | $ module load cudatoolkit/12.2
64 | # edit for_loop.cu
65 | $ nvcc -o for_loop for_loop.cu
66 | $ sbatch job.slurm
67 | ```
68 | 
69 | Click [here](hint.md) to see some hints.
70 | 
71 | One possible solution is [here](solution.cu) (try for yourself first).
72 | 
73 | Are you seeing any behavior which is a multiple of 32 in this exercise? For NVIDIA, the threads within a thread block are organized into "warps". A "warp" is composed of 32 threads. [Read more](http://15418.courses.cs.cmu.edu/spring2013/article/15) about how `printf` works in CUDA.
74 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/for_loop.c:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | void printLoopIndex() {
 4 |   int i;
 5 |   int N = 100;
 6 |   for (i = 0; i < N; ++i)
 7 |     printf("%d\n", i);
 8 | }
 9 | 
10 | int main() {
11 |   // function to run on the cpu
12 |   printLoopIndex();
13 | }
14 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/for_loop.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | void printLoopIndex() {
 4 |   int N = 100;
 5 |   for (int i = 0; i < N; ++i)
 6 |     printf("%d\n", i);
 7 | }
 8 | 
 9 | int main() {
10 |   // function to run on the cpu
11 |   printLoopIndex();
12 | }
13 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/hint.md:
--------------------------------------------------------------------------------
 1 | ## Hints
 2 | 
 3 | To understand how to do this exercise, take a look at the code below which uses `threadIdx.x`:
 4 | 
 5 | ```C
 6 | #include <stdio.h>
 7 | 
 8 | __global__ void GPUFunction() {
 9 |   printf("My thread index is: %g\n", threadIdx.x);
10 | }
11 | 
12 | int main() {
13 |   GPUFunction<<<1, 1>>>();
14 |   cudaDeviceSynchronize();
15 | }
16 | ```
17 | 
18 | The output of the code above is
19 | 
20 | ```
21 | My thread index is: 0
22 | ```
23 | 
24 | We need to replace the i variable in the CPU code. In a CUDA kernel, each thread has an index
25 | associated with it called `threadIdx.x`. So use that as the substitution for i.
26 | 
27 | Next, to generate 100 threads, try a kernel launch like this: `<<<1, 100>>>`
28 | 
29 | The above will give you 1 block composed of 100 threads.
30 | 
31 | Be sure to add `__global__` to your GPU function and don't forget to call `cudaDeviceSynchronize()`.
32 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=for_loop      # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=1G         # memory per cpu-core (4G is default)
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
10 | 
11 | ./for_loop
12 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/03_thread_indices/solution.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | __global__ void printLoopIndex() {
 4 |     printf("%d\n", threadIdx.x);
 5 | }
 6 | 
 7 | int main() {
 8 |   printLoopIndex<<<1, 100>>>();
 9 |   cudaDeviceSynchronize();
10 | }
11 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/README.md:
--------------------------------------------------------------------------------
  1 | # Elementwise Vector Addition
  2 | 
  3 | ## A Word on Allocating Memory
  4 | 
  5 | Here is an example on the CPU where 10 integers are dynamically allocated and the last line frees the memory:
  6 | 
  7 | ```C
  8 | int N = 10;
  9 | size_t size = N * sizeof(int);
 10 | 
 11 | int *a;
 12 | a = (int*)malloc(size);
 13 | free(a);
 14 | ```
 15 | 
 16 | On the GPU:
 17 | 
 18 | ```C
 19 | int N = 10;
 20 | size_t size = N * sizeof(int);
 21 | 
 22 | int *d_a;
 23 | cudaMalloc(&d_a, size);
 24 | cudaFree(d_a);
 25 | ```
 26 | Note that we write `d_a` for the GPU case instead of `a` to remind ourselves that we are allocating memory on the "device" or GPU. Sometimes developers will prefix CPU variables with 'h' to denote "host".
 27 | 
 28 | ![add-arrays](https://www3.ntu.edu.sg/home/ehchua/programming/cpp/images/Array.png)
 29 | 
 30 | The vectors `a` and `b` are added elementwise to produce the vector `c`:
 31 | 
 32 | ```
 33 | c[0] = a[0] + b[0]
 34 | c[1] = a[1] + b[1]
 35 | ...
 36 | c[N-1] = a[N-1] + b[N-1]
 37 | ```
 38 | 
 39 | ## CPU
 40 | 
 41 | The following code adds two vectors together on a CPU:
 42 | 
 43 | ```C
 44 | #include <stdio.h>
 45 | #include <stdlib.h>
 46 | #include <math.h>
 47 | #include "timer.h"
 48 |  
 49 | void vecAdd(double *a, double *b, double *c, int n)
 50 | {
 51 |     int i;
 52 |     for (i = 0; i < n; i++) {
 53 |         c[i] = a[i] + b[i];
 54 |     }
 55 | }
 56 | 
 57 | int main(int argc, char* argv[])
 58 | {
 59 |     // Size of vectors
 60 |     int n = 2000;
 61 |  
 62 |     // Host input vectors
 63 |     double *h_a;
 64 |     double *h_b;
 65 |     //Host output vector
 66 |     double *h_c;
 67 |  
 68 |     // Size, in bytes, of each vector
 69 |     size_t bytes = n*sizeof(double);
 70 |  
 71 |     // Allocate memory for each vector on host
 72 |     h_a = (double*)malloc(bytes);
 73 |     h_b = (double*)malloc(bytes);
 74 |     h_c = (double*)malloc(bytes);
 75 |  
 76 |     int i;
 77 |     // Initialize vectors on host
 78 |     for (i = 0; i < n; i++) {
 79 |         h_a[i] = sin(i)*sin(i);
 80 |         h_b[i] = cos(i)*cos(i);
 81 |     }
 82 | 
 83 |     // add the two vectors
 84 |     vecAdd(h_a, h_b, h_c, n);
 85 |  
 86 |     // Release host memory
 87 |     free(h_a);
 88 |     free(h_b);
 89 |     free(h_c);
 90 | 
 91 |     return 0;
 92 | }
 93 | ```
 94 | 
 95 | Take a look at `vector_add_cpu.c`. You will see that it allocates three arrays of size `n` and then fills `a` and `b` with values. The `vecAdd` function is then called to perform the elementwise addition of the two arrays producing a third array `c`:
 96 | 
 97 | ```C
 98 | void vecAdd(double *a, double *b, double *c, int n) {
 99 |     int i;
100 |     for (i = 0; i < n; i++) {
101 |         c[i] = a[i] + b[i];
102 |     }
103 | }
104 | ```
105 | 
106 | 
107 | The output reports the time taken to perform the addition ignoring the memory allocation and initialization. Build and run the code:
108 | 
109 | ```
110 | $ cd gpu_programming_intro/06_cuda_kernels/04_vector_addition
111 | $ gcc -O3 -march=native -o vector_add_cpu vector_add_cpu.c -lm
112 | $ ./vector_add_cpu
113 | ```
114 | 
115 | ## GPU
116 | 
117 | The following code adds two vectors together on a GPU:
118 | 
119 | ```C
120 | #include <stdio.h>
121 | #include <stdlib.h>
122 | #include <math.h>
123 | #include "timer.h"
124 |  
125 | // each thread is responsible for one element of c
126 | __global__ void vecAdd(double *a, double *b, double *c, int n)
127 | {
128 |     // Get our global thread ID
129 |     int id = blockIdx.x * blockDim.x + threadIdx.x;
130 |     int stride = gridDim.x * blockDim.x;
131 |  
132 |     // Make sure we do not go out of bounds
133 |     int i;
134 |     for (i = id; i < n; i += stride)
135 |       c[i] = a[i] + b[i];
136 | }
137 |  
138 | int main(int argc, char* argv[])
139 | {
140 |     // Size of vectors
141 |     int n = 2000;
142 |  
143 |     // Host input vectors
144 |     double *h_a;
145 |     double *h_b;
146 |     //Host output vector
147 |     double *h_c;
148 |  
149 |     // Device input vectors
150 |     double *d_a;
151 |     double *d_b;
152 |     //Device output vector
153 |     double *d_c;
154 |  
155 |     // Size, in bytes, of each vector
156 |     size_t bytes = n*sizeof(double);
157 |  
158 |     // Allocate memory for each vector on host
159 |     h_a = (double*)malloc(bytes);
160 |     h_b = (double*)malloc(bytes);
161 |     h_c = (double*)malloc(bytes);
162 | 
163 |     int i;
164 |     // Initialize vectors on host
165 |     for (i = 0; i < n; i++) {
166 |         h_a[i] = sin(i)*sin(i);
167 |         h_b[i] = cos(i)*cos(i);
168 |     }
169 | 
170 |     // Allocate memory for each vector on GPU
171 |     cudaMalloc(&d_a, bytes);
172 |     cudaMalloc(&d_b, bytes);
173 |     cudaMalloc(&d_c, bytes);
174 |  
175 |     // Copy host vectors to device
176 |     cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
177 |     cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
178 |  
179 |     int blockSize, gridSize;
180 |  
181 |     // Number of threads in each thread block
182 |     blockSize = 1024;
183 |  
184 |     // Number of thread blocks in grid
185 |     gridSize = (int)ceil((double)n/blockSize);
186 |     if (gridSize > 65535) gridSize = 32000;
187 |     // Execute the kernel
188 |     vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
189 |  
190 |     // Copy array back to host
191 |     cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
192 |  
193 |     // Release device memory
194 |     cudaFree(d_a);
195 |     cudaFree(d_b);
196 |     cudaFree(d_c);
197 | 
198 |     cudaDeviceSynchronize();
199 |  
200 |     // Release host memory
201 |     free(h_a);
202 |     free(h_b);
203 |     free(h_c);
204 | 
205 |     return 0;
206 | }
207 | ```
208 | 
209 | The `vecAdd` function has been replaced with a CUDA kernel:
210 | 
211 | ```C
212 | __global__ void vecAdd(double *a, double *b, double *c, int n)
213 | {
214 |     // Get our global thread ID
215 |     int id = blockIdx.x * blockDim.x + threadIdx.x;
216 |     int stride = gridDim.x * blockDim.x;
217 |  
218 |     // Make sure we do not go out of bounds
219 |     int i;
220 |     for (i = id; i < n; i += stride)
221 |       c[i] = a[i] + b[i];
222 | }
223 | ```
224 | 
225 | The kernel uses special variables which are CUDA extensions to allow threads to distinguish themselves and operate on different data. Specifically, `blockIdx.x` is the block index within a grid, `blockDim.x` is the number of threads per block and `threadIdx.x` is the thread index within a block. Let's build and run the code. The `nvcc` compiler will compile the kernel function while `gcc` will be used in the background to compile the CPU code.
226 | 
227 | ```
228 | $ module load cudatoolkit/12.2
229 | $ nvcc -O3 -arch=sm_80 -o vector_add_gpu vector_add_gpu.cu  # use 70 on traverse or adroit v100 node
230 | $ sbatch job.slurm
231 | ```
232 | 
233 | The output of the code will be something like:
234 | ```
235 | Allocating CPU memory and populating arrays of length 2000 ... done.
236 | GridSize 2 and total_threads 2048
237 | Performing vector addition (timer started) ... done in 0.09 s.
238 | ```
239 | 
240 | Note that the reported time includes all operations beyond those needed to carry out the operation on the GPU. This includes the time required to allocate and deallocate memory on the GPU and the time required to move the data to and from the GPU.
241 | 
242 | To use a GPU effectively the problem you are solving must have a vast amount of data parallelism and an overall amount of computation. In the example here the parallelism is high (one can assign a different thread to each of the individual elements) but the overall amount of computation is low so the CPU wins out in performance. Contrast this with a large matrix-matrix multiply where both conditions are satisfied and the GPU wins. For problems involving recursion or sorting or small amounts of data, it becomes difficult to take advantage of a GPU.
243 | 
244 | ## Advanced Examples
245 | 
246 | For more advanced examples return to the NVIDIA CUDA samples at the bottom of [this page](https://github.com/PrincetonUniversity/gpu_programming_intro/tree/master/05_cuda_libraries#nvidia-cuda-samples).
247 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=vec-add       # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=16G        # memory per cpu-core (4G is default)
 7 | #SBATCH --gres=gpu:1             # number of gpus per node
 8 | #SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
 9 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
10 | 
11 | ./vector_add_gpu
12 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/timer.h:
--------------------------------------------------------------------------------
 1 | /*
 2 |  *  Copyright 2012 NVIDIA Corporation
 3 |  *
 4 |  *  Licensed under the Apache License, Version 2.0 (the "License");
 5 |  *  you may not use this file except in compliance with the License.
 6 |  *  You may obtain a copy of the License at
 7 |  *
 8 |  *      http://www.apache.org/licenses/LICENSE-2.0
 9 |  *
10 |  *  Unless required by applicable law or agreed to in writing, software
11 |  *  distributed under the License is distributed on an "AS IS" BASIS,
12 |  *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 |  *  See the License for the specific language governing permissions and
14 |  *  limitations under the License.
15 |  */
16 | 
17 | #ifndef TIMER_H
18 | #define TIMER_H
19 | 
20 | #include <stdlib.h>
21 | 
22 | #ifdef WIN32
23 | #define WIN32_LEAN_AND_MEAN
24 | #include <windows.h>
25 | #else
26 | #include <sys/time.h>
27 | #endif
28 | 
29 | #ifdef WIN32
30 | double PCFreq = 0.0;
31 | __int64 timerStart = 0;
32 | #else
33 | struct timeval timerStart;
34 | #endif
35 | 
36 | void StartTimer()
37 | {
38 | #ifdef WIN32
39 |     LARGE_INTEGER li;
40 |     if(!QueryPerformanceFrequency(&li))
41 |         printf("QueryPerformanceFrequency failed!\n");
42 | 
43 |     PCFreq = (double)li.QuadPart/1000.0;
44 | 
45 |     QueryPerformanceCounter(&li);
46 |     timerStart = li.QuadPart;
47 | #else
48 |     gettimeofday(&timerStart, NULL);
49 | #endif
50 | }
51 | 
52 | // time elapsed in ms
53 | double GetTimer()
54 | {
55 | #ifdef WIN32
56 |     LARGE_INTEGER li;
57 |     QueryPerformanceCounter(&li);
58 |     return (double)(li.QuadPart-timerStart)/PCFreq;
59 | #else
60 |     struct timeval timerStop, timerElapsed;
61 |     gettimeofday(&timerStop, NULL);
62 |     timersub(&timerStop, &timerStart, &timerElapsed);
63 |     return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0;
64 | #endif
65 | }
66 | 
67 | #endif // TIMER_H
68 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/vector_add_cpu.c:
--------------------------------------------------------------------------------
 1 | /* CPU VERSION */
 2 | 
 3 | // modified from https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
 4 | 
 5 | #include <stdio.h>
 6 | #include <stdlib.h>
 7 | #include <math.h>
 8 | #include "timer.h"
 9 |  
10 | void vecAdd(double *a, double *b, double *c, int n)
11 | {
12 |     int i;
13 |     for(i = 0; i < n; i++) {
14 |         c[i] = a[i] + b[i];
15 |     }
16 | }
17 | 
18 | int main( int argc, char* argv[] )
19 | {
20 |     // Size of vectors
21 |     int n = 2000;
22 |  
23 |     // Host input vectors
24 |     double *h_a;
25 |     double *h_b;
26 |     //Host output vector
27 |     double *h_c;
28 |  
29 |     // Size, in bytes, of each vector
30 |     size_t bytes = n*sizeof(double);
31 |  
32 |     // Allocate memory for each vector on host
33 |     fprintf(stderr, "Allocating memory and populating arrays of length %d ...", n);
34 |     h_a = (double*)malloc(bytes);
35 |     h_b = (double*)malloc(bytes);
36 |     h_c = (double*)malloc(bytes);
37 |  
38 |     int i;
39 |     // Initialize vectors on host
40 |     for( i = 0; i < n; i++ ) {
41 |         h_a[i] = sin(i)*sin(i);
42 |         h_b[i] = cos(i)*cos(i);
43 |     }
44 | 
45 |     fprintf(stderr, " done.\n");
46 |     fprintf(stderr, "Performing vector addition (timer started) ...");
47 |     StartTimer();
48 | 
49 |     // add the two vectors
50 |     vecAdd(h_a, h_b, h_c, n);
51 |  
52 |     double runtime = GetTimer();
53 |     fprintf(stderr, " done in %.2f s.\n", runtime / 1000);
54 |  
55 |     // Sum up vector c and print result divided by n, this should equal 1 within error
56 |     double sum = 0;
57 |     for(i=0; i<n; i++)
58 |         sum += h_c[i];
59 |     double tol = 1e-6;
60 |     if (fabs(sum/n - 1.0) > tol) printf("Warning: potential numerical problems.\n"); 
61 |  
62 |     // Release host memory
63 |     free(h_a);
64 |     free(h_b);
65 |     free(h_c);
66 | 
67 |     return 0;
68 | }
69 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/04_vector_addition/vector_add_gpu.cu:
--------------------------------------------------------------------------------
  1 | /* GPU Version */
  2 | 
  3 | // original file is https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
  4 | 
  5 | #include <stdio.h>
  6 | #include <stdlib.h>
  7 | #include <math.h>
  8 | #include "timer.h"
  9 |  
 10 | // CUDA kernel. Each thread takes care of one element of c
 11 | __global__ void vecAdd(double *a, double *b, double *c, int n)
 12 | {
 13 |     // Get our global thread ID
 14 |     int id = blockIdx.x * blockDim.x + threadIdx.x;
 15 |     int stride = gridDim.x * blockDim.x;
 16 |  
 17 |     // Make sure we do not go out of bounds
 18 |     int i;
 19 |     for (i = id; i < n; i += stride)
 20 |       c[i] = a[i] + b[i];
 21 | }
 22 |  
 23 | int main( int argc, char* argv[] )
 24 | {
 25 |     // Size of vectors
 26 |     int n = 2000;
 27 |  
 28 |     // Host input vectors
 29 |     double *h_a;
 30 |     double *h_b;
 31 |     //Host output vector
 32 |     double *h_c;
 33 |  
 34 |     // Device input vectors
 35 |     double *d_a;
 36 |     double *d_b;
 37 |     //Device output vector
 38 |     double *d_c;
 39 |  
 40 |     // Size, in bytes, of each vector
 41 |     size_t bytes = n*sizeof(double);
 42 |  
 43 |     // Allocate memory for each vector on host
 44 |     fprintf(stderr, "Allocating CPU memory and populating arrays of length %d ...", n);
 45 |     h_a = (double*)malloc(bytes);
 46 |     h_b = (double*)malloc(bytes);
 47 |     h_c = (double*)malloc(bytes);
 48 | 
 49 |     int i;
 50 |     // Initialize vectors on host
 51 |     for( i = 0; i < n; i++ ) {
 52 |         h_a[i] = sin(i)*sin(i);
 53 |         h_b[i] = cos(i)*cos(i);
 54 |     }
 55 |     fprintf(stderr, " done.\n");
 56 | 
 57 |     fprintf(stderr, "Performing vector addition (timer started) ...");
 58 |     StartTimer();
 59 | 
 60 |     // Allocate memory for each vector on GPU
 61 |     cudaMalloc(&d_a, bytes);
 62 |     cudaMalloc(&d_b, bytes);
 63 |     cudaMalloc(&d_c, bytes);
 64 |  
 65 |     // Copy host vectors to device
 66 |     cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
 67 |     cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
 68 |  
 69 |     int blockSize, gridSize;
 70 |  
 71 |     // Number of threads in each thread block
 72 |     blockSize = 1024;
 73 |  
 74 |     // Number of thread blocks in grid
 75 |     gridSize = (int)ceil((double)n/blockSize);
 76 |     if (gridSize > 65535) gridSize = 32000;
 77 |     printf("GridSize %d and total_threads %d\n", gridSize, gridSize * blockSize); 
 78 |     // Execute the kernel
 79 |     vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
 80 |  
 81 |     // Copy array back to host
 82 |     cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost );
 83 |  
 84 |     // Release device memory
 85 |     cudaFree(d_a);
 86 |     cudaFree(d_b);
 87 |     cudaFree(d_c);
 88 | 
 89 |     cudaDeviceSynchronize();
 90 |  
 91 |     double runtime = GetTimer();
 92 |     fprintf(stderr, " done in %.2f s.\n", runtime / 1000);
 93 |  
 94 |     // Sum up vector c and print result divided by n, this should equal 1 within error
 95 |     double sum = 0;
 96 |     for(i=0; i<n; i++)
 97 |         sum += h_c[i];
 98 |     //double tol = 1e-6;
 99 |     //printf("\nout is %f\n", sum/n);
100 |     //if (fabs(sum/n - 1.0) > tol) printf("Warning: potential numerical problems.\n");
101 |  
102 |     // Release host memory
103 |     free(h_a);
104 |     free(h_b);
105 |     free(h_c);
106 | 
107 |     return 0;
108 | }
109 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/05_multiple_gpus/README.md:
--------------------------------------------------------------------------------
 1 | # Multiple GPUs
 2 | 
 3 | The code in the this directory illustrates the use of multiple GPUs. To compile and execute the example, run the following commands:
 4 | 
 5 | ```
 6 | $ module load cudatoolkit/12.2
 7 | $ nvcc -O3 -arch=sm_80 -o multi_gpu multi_gpu.cu
 8 | $ sbatch job.slurm
 9 | ```
10 | 
11 | On Traverse and the Adroit V100 nodes, replace `sm_80` with `sm_70`.
12 | 
13 | See also `Samples/0_Introduction/simpleMultiGPU` in the NVIDIA samples which are discussed in `05_cuda_libraries`.
14 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/05_multiple_gpus/job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=multi-gpu     # create a short name for your job
 3 | #SBATCH --nodes=1                # node count
 4 | #SBATCH --ntasks=1               # total number of tasks across all nodes
 5 | #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
 6 | #SBATCH --mem-per-cpu=1G         # memory per cpu-core (4G per cpu-core is default)
 7 | #SBATCH --gres=gpu:2             # number of gpus per node
 8 | #SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
 9 | #SBATCH --reservation=gpuprimer  # REMOVE THIS LINE AFTER THE WORKSHOP
10 | 
11 | module purge
12 | module load cudatoolkit/12.2
13 | 
14 | ./multi_gpu
15 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/05_multiple_gpus/multi_gpu.cu:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | 
 3 | void CPUFunction() {
 4 |   printf("Hello world from the CPU.\n");
 5 | }
 6 | 
 7 | __global__ void GPUFunction(int myid) {
 8 |   printf("Hello world from GPU %d.\n", myid);
 9 | }
10 | 
11 | int main() {
12 | 
13 |   // function to run on the cpu
14 |   CPUFunction();
15 | 
16 |   int deviceCount;
17 |   cudaGetDeviceCount(&deviceCount);
18 |   int device;
19 |   for (device=0; device < deviceCount; ++device) {
20 |     cudaDeviceProp deviceProp;
21 |     cudaGetDeviceProperties(&deviceProp, device);
22 |     printf("Device %d has compute capability %d.%d.\n",
23 |            device, deviceProp.major, deviceProp.minor);
24 |   }
25 | 
26 |   // run on gpu 0
27 |   int device_id = 0;
28 |   cudaSetDevice(device_id);
29 |   GPUFunction<<<1, 1>>>(device_id);
30 |  
31 |   // run on gpu 1
32 |   device_id = 1;
33 |   cudaSetDevice(device_id);
34 |   GPUFunction<<<1, 1>>>(device_id);
35 | 
36 |   // kernel execution is asynchronous so sync on its completion
37 |   cudaDeviceSynchronize();
38 | }
39 | 


--------------------------------------------------------------------------------
/06_cuda_kernels/README.md:
--------------------------------------------------------------------------------
1 | # CUDA kernels
2 | 
3 | In this section you will write GPU kernels from scratch. To get started click on `01_hello_world` above.
4 | 


--------------------------------------------------------------------------------
/07_advanced_and_other/README.md:
--------------------------------------------------------------------------------
 1 | # Advanced and Other
 2 | 
 3 | ## CUDA-Aware MPI
 4 | 
 5 | On Della you will see MPI modules that have been built against CUDA. These modules enable [CUDA-aware MPI](https://developer.nvidia.com/mpi-solutions-gpus) where
 6 | memory on a GPU can be sent to another GPU without concerning a CPU. According to NVIDIA:
 7 | 
 8 | > Regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy.
 9 | 
10 | > With [CUDA-aware MPI](https://developer.nvidia.com/mpi-solutions-gpus), the MPI library can send and receive GPU buffers directly, without having to first stage them in host memory. Implementation of CUDA-aware MPI was simplified by Unified Virtual Addressing (UVA) in CUDA 4.0 – which enables a single address space for all CPU and GPU memory. CUDA-aware implementations of MPI have several advantages.
11 | 
12 | See the CUDA-aware MPI modules on Della:
13 | 
14 | ```
15 | $ ssh <NetID>@della.princeton.edu
16 | $ module avail openmpi/cuda
17 | 
18 | ------------- /usr/local/share/Modules/modulefiles -------------
19 | openmpi/cuda-11.1/gcc/4.1.1  openmpi/cuda-11.3/nvhpc-21.5/4.1.1
20 | ```
21 | 
22 | ## GPU Direct
23 | 
24 | [GPU Direct](https://developer.nvidia.com/gpudirect) is a solution to the problem of data-starved GPUs.
25 | 
26 | ![gpu-direct](https://developer.nvidia.com/sites/default/files/akamai/GPUDirect/cuda-gpu-direct-blog-refresh_diagram_1.png)
27 | 
28 | > Using GPUDirect™, multiple GPUs, network adapters, solid-state drives (SSDs) and now NVMe drives can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA Tesla™ and Quadro™ products
29 | 
30 | GPUDirect is enabled on `della` and `traverse`.
31 | 
32 | ## GPU Sharing
33 | 
34 | Many GPU applications only use the GPU for a fraction of the time. For many years, a goal of GPU vendors has been to allow for GPU sharing between applications. Slurm is capable of supporting this through the `--gpu-mps` option.
35 | 
36 | ## OpenMP 4.5+
37 | 
38 | Recent implementations of [OpenMP](https://www.openmp.org/) support GPU programming. However, they are not mature and should not be favored.
39 | 
40 | ## CUDA Kernels versus OpenACC on the Long Term
41 | 
42 | CUDA kernels are written at a low level. OpenACC is a high-level programmaing model. Because GPU hardware is changing rapidly, some argue that writing GPU codes with OpenACC is a better choice because there is much less work do to when new hardware comes out. The sames holds true for Kokkos.
43 | 
44 | [See the materials](http://w3.pppl.gov/~ethier/PICSCIE/Intro_to_OpenACC_Nov_2019.pdf) for an OpenACC workshop by Stephane Ethier. Be aware of the Slack channel for OpenACC for getting help.
45 | 
46 | ## Using the Intel Compiler
47 | 
48 | Note the use of `auto` in the code below:
49 | 
50 | ```c++
51 | #include <stdio.h>
52 | 
53 | __global__ void simpleKernel()
54 | {
55 |   auto i = blockDim.x * blockIdx.x + threadIdx.x;
56 |   printf("Index: %d\n", i);
57 | }
58 | 
59 | int main()
60 | {
61 |   simpleKernel<<<2, 3>>>();
62 |   cudaDeviceSynchronize();
63 | }
64 | ```
65 | 
66 | The C++11 language standard introduced the `auto` keyword. To compile the code with the Intel compiler for Della:
67 | 
68 | ```
69 | $ module load intel/19.1.1.217
70 | $ module load cudatoolkit/11.7
71 | $ nvcc -ccbin=icpc -std=c++11 -arch=sm_80 -o simple simple.cu
72 | ```
73 | 
74 | In general, NVIDIA engineers strongly recommend using GCC over the Intel compiler.
75 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction to GPU Computing
 2 | 
 3 | ## About
 4 | 
 5 | This guide provides materials for getting started with running GPU codes on the Princeton Research Computing clusters. It also provides an introduction to writing CUDA kernels and examples of using the NVIDIA GPU-accelerated libraries (e.g., cuBLAS).
 6 | 
 7 | ## Upcoming GPU Training
 8 | 
 9 | [Princeton GPU User Group](https://researchcomputing.princeton.edu/learn/user-groups/gpu)  
10 | [See all PICSciE/RC workshops](https://researchcomputing.princeton.edu/learn/workshops-live-training)  
11 | [Subscribe to PICSciE/RC Mailing List](https://researchcomputing.princeton.edu/subscribe)  
12 | 
13 | ## Learning Resources
14 | 
15 | [GPU Computing at Princeton](https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing)  
16 | [2025 Princeton GPU Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwmKa2AI/se000356)  
17 | [Resource List by Open Hackathons](https://www.openhackathons.org/s/technical-resources)  
18 | [Training Archive at Oak Ridge National Laboratory](https://docs.olcf.ornl.gov/training/training_archive.html)   
19 | [LeetGPU - Free GPU Simulator](https://leetgpu.com/)  
20 | [CUDA C++ Programming Guide by NVIDIA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)  
21 | [CUDA Fortran Programming Guide by NVIDIA](https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html)     
22 | [Intro to CUDA Blog Post](https://developer.nvidia.com/blog/even-easier-introduction-cuda/?mkt_tok=MTU2LU9GTi03NDIAAAGad2PhouORjrUMHihUOvdy-syejFRkc-7otOyEDUy4HXOnJ85JjZ-gUs-lGlbdvG-hpVpXtxlpVN4EOvosdmaWcaSV9TQa84zICsZ3IdKBp5L69uOLQDsm)   
23 | [Online Book Available through PU Library](https://catalog.princeton.edu/catalog/99125304171206421)  
24 | [Princeton A100 GPU Workshop](https://github.com/PrincetonUniversity/a100_workshop)  
25 | 
26 | ## Getting Help
27 | 
28 | If you encounter any difficulties with this material then please send an email to <a href="mailto:cses@princeton.edu">cses@princeton.edu</a> or attend a <a href="https://researchcomputing.princeton.edu/education/help-sessions">help session</a>.
29 | 
30 | ## Authorship
31 | 
32 | This guide was created by Jonathan Halverson and members of Princeton Research Computing.
33 | 


--------------------------------------------------------------------------------
/setup.md:
--------------------------------------------------------------------------------
 1 | # Introduction to GPU Computing
 2 | 
 3 | ## Setup for live workshop
 4 | 
 5 | ### Point your browser to `https://bit.ly/36g5YUS`
 6 | 
 7 | + Connect to the eduroam wireless network
 8 | 
 9 | + Open a terminal (e.g., Terminal, PowerShell, PuTTY) [<a href="https://researchcomputing.princeton.edu/education/training/hardware-and-software-requirements-picscie-workshops" target="_blank">click here</a> for help]
10 | 
11 | + Request an [account on Adroit](https://forms.rc.princeton.edu/registration/?q=adroit).
12 | 
13 | + Please SSH to Adroit in the terminal: `ssh <YourNetID>@adroit.princeton.edu` [click [here](https://researchcomputing.princeton.edu/faq/why-cant-i-login-to-a-clu) for help]
14 | 
15 | + If you are new to Linux then consider using the MyAdroit web portal: [https://myadroit.princeton.edu](https://myadroit.princeton.edu) (VPN required from off-campus)
16 | 
17 | + Clone this repo on Adroit:
18 | 
19 |    ```
20 |    $ cd /scratch/network/$USER
21 |    $ git clone https://github.com/PrincetonUniversity/gpu_programming_intro.git
22 |    $ cd gpu_programming_intro
23 |    ```
24 | 
25 | + For the live workshop, to get access to the GPU nodes on Adroit, add this line to your Slurm scripts:
26 | 
27 |    `$ sbatch --reservation=gpuprimer job.slurm`
28 | 
29 | + Go to the [main page](https://github.com/PrincetonUniversity/gpu_programming_intro) of this repo
30 | 


--------------------------------------------------------------------------------