├── An Even Easier Introduction to CUDA.md ├── CUDA CC++ Basics.md ├── CUDA IPC.md ├── CUDA Memory Optimizations.md ├── CUDA Multi-Process Service.md └── README.md /An Even Easier Introduction to CUDA.md: -------------------------------------------------------------------------------- 1 | # An Even Easier Introduction to CUDA 2 | 3 | [Source link](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) 4 | 5 | ## CUDA Terminologies 6 | 7 | **kernel**: a funtion that the GPU can run 8 | 9 | **\__global__**: a specifier telling the CUDA C++ complier that this is a funtion that runs on the GPU and can be called from CPU code 10 | 11 | **device code**: code that runs on the GPU 12 | 13 | **host code**: code that runs on the CPU 14 | 15 | ## Memory Allocation in CUDA 16 | 17 | ### Unified Memory[^1] 18 | 19 | Unified Mmeory creates a pool of managed memory that is shared between the CPU and GPU, briding the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using **a single pointer**. The key is that the system automatically **migrates** data allocated in Unified Memory between host and device so that it **looks like** CPU memory to code running on the CPU, and like GPU memory to code running on the GPU. 20 | 21 | ```c++ 22 | char *data; 23 | cudaMallocManaged(&data, N); // making the data pointer accessible from both the host and the device. 24 | ``` 25 | 26 | **Note**: this is a programming model to simplify CUDA codes. However, a carefully tuned CUDA program that uses streams and `cudaMemcpyAsync()` to efficiently overlap execution with data transfers **may very well perform better than only using Unified Memory**. 27 | 28 | Q: Are we using unified memory or traditional memory allocation techniques? 29 | 30 | ## Execution Configuration 31 | 32 | **execution configuration**: tells the CUDA runtime how many parallel threads to use for the launch on the GPU. 33 | 34 | **Streaming Multiprocessors**: each runs multiple concurrent thread blocks 35 | 36 | **threadIdx.x / y / z**: the index of the current thread within its block 37 | 38 | **blockIdx.x / y / z**: the index of the current thread block in the grid 39 | 40 | **blockDim.x / y / z**: the number of threads in the block 41 | 42 | **gridDim.x / y / z**: the number of blocks in the grid 43 | 44 | ### Grid-Stride Loop 45 | 46 | ```c++ 47 | __global__ 48 | void add (int n, float *x, float *y) 49 | { 50 | int index = blockIdx.x * blockDim.x + threadIdx.x; // the thread index in the grid 51 | int stride = blockDim.x * gridDim.x; // number of threads in the grid 52 | for (int i = index; i < n; i += stride) 53 | y[i] = x[i] + y[i]; 54 | } 55 | ``` 56 | 57 | [^1]: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ -------------------------------------------------------------------------------- /CUDA CC++ Basics.md: -------------------------------------------------------------------------------- 1 | # CUDA C/C++ Basics 2 | 3 | [Souce link](https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf) 4 | 5 | ## Cooperating Threads 6 | 7 | Unlike parallel blocks, threads have mechanisms to efficiently: 8 | 9 | - Communicate 10 | - Synchronize 11 | 12 | ### Sharing Data Between Threads 13 | 14 | **shared memory**: shared data for threads within a block 15 | 16 | - exetremely fast on-chip memory, by opposition to the global memory 17 | - like a user-managed cache 18 | - declared using `__shared__`, allocated per block 19 | - data is not visible to threads in other blocks 20 | 21 | **\__syncthreads()**: synchronizes all threads **within a block** 22 | 23 | - prevents data hazards 24 | 25 | ## Managing The Device 26 | 27 | ### Coordinating Host & Device 28 | 29 | Kernal launches are asynchronous 30 | 31 | - contol returns to the CPU immediately 32 | 33 | CPU needs to synchronize before consuming the results 34 | 35 | - cudaMemcpy(): 36 | - blocks the CPU until the copy is complete 37 | - **copy begins when all preceding CUDA calls have complete** 38 | - cudaMemcpyAsync(): asynchronous, does not block the CPU 39 | - cudaDeviceSynchronize(): blocks the CPU until all preceding CUDA calls have completed 40 | 41 | ### Reporting Errors 42 | 43 | ```c++ 44 | // Get the error code for the last error 45 | cudaError_t cudaGetLastError(void); 46 | 47 | // Get a string to describe the error 48 | char *cudaGetErrorString(cudaError_t); 49 | printf("%s\n", cudaGetErrorString(cudaGetLastError())); 50 | ``` 51 | 52 | ### Device Management 53 | 54 | **Application can query and select GPUs** 55 | 56 | - cudaGetDeviceCount(int *count) 57 | - cudaSetDevice(int device) 58 | - cudaGetDevice(int *device) 59 | - cudaGetDeviceProperties(cudaDeviceProp *prop, int device) 60 | 61 | ```c++ 62 | // Source: https://forums.developer.nvidia.com/t/beginner-cudagetdevicecount/16403 63 | int device = 0; 64 | int gpuDeviceCount = 0; 65 | struct cudaDeviceProp properties; 66 | 67 | cudaError_t cudaResultCode = cudaGetDeviceCount(&gpuDeviceCount); 68 | 69 | if (cudaResultCode == cudaSuccess) 70 | { 71 | cudaGetDeviceProperties(&properties, device); 72 | printf("%d GPU CUDA devices(s)(%d)\n", gpuDeviceCount, properties.major); 73 | printf("\t Product Name: %s\n" , properties.name); 74 | printf("\t TotalGlobalMem: %d MB\n" , properties.totalGlobalMem/(1024^2)); 75 | printf("\t GPU Count: %d\n" , properties.multiProcessorCount); 76 | printf("\t Kernels found: %d\n" , properties.concurrentKernels); 77 | } 78 | ``` 79 | 80 | **Multiple host threads can share a device** 81 | 82 | **A single host thread can manage multiple devices** 83 | 84 | - cudaSetDevice(i): to select curent device 85 | - cudaMemcpy(...): for peer-to-peer copies 86 | 87 | -------------------------------------------------------------------------------- /CUDA IPC.md: -------------------------------------------------------------------------------- 1 | # CUDA IPC 2 | 3 | CUDA IPC (Inter-Process Communication) is a feature in NVIDIA's CUDA (Compute Unified Device Architecture) programming model, allowing data sharing and synchronization between different processes running on the same GPU. It's helpful when you have multiple applications that need to share GPU resources, without copying data between host and device memory. 4 | 5 | ## Terminologies 6 | 7 | ### Cuda Context 8 | 9 | A CUDA context is like a container that holds the GPU resources for a specific process. It includes device memory, loaded modules, and other resources needed for executing your GPU kernels (the code that runs on the GPU). Each process has its own context, which isolates its resources from other processes. 10 | 11 | Imagine the context as a "workspace" for each GPU application, keeping its data and settings separate from other applications. 12 | 13 | The CUDA context is implemented as an opaque data structure in the CUDA runtime, which is managed by the CUDA driver. When a process initializes the CUDA runtime (usually by calling `cudaSetDevice()` or similar functions), a context is created for that process. The context is associated with the chosen GPU device and provides a separate environment for each process that runs on the GPU. 14 | 15 | The context ensures resource isolation between different processes, preventing them from interfering with each other's memory, kernels, and other resources. The context also maintains the memory allocation state and handles memory operations such as allocation, deallocation, and data transfers between the host and the device memory. 16 | 17 | ### Memory Handle 18 | 19 | A MemHandle, or memory handle, is a lightweight reference to a piece of GPU memory that can be shared between processes. It allows different processes to access the same device memory without copying data between host and device. 20 | 21 | Think of the MemHandle as a "ticket" that grants access to a specific memory location on the GPU. You can share this ticket with other processes, allowing them to use the same memory. 22 | 23 | Example: 24 | 25 | Process A has a piece of data in its GPU memory that it wants to share with Process B. Process A creates a MemHandle for that memory and shares it with Process B. Now, both processes can access the same data on the GPU without copying it back and forth. 26 | 27 | ### CudaEvent 28 | 29 | A CUDA event is a synchronization primitive used to track the progress of various operations in the GPU. It can be used to measure the time taken by a specific operation, or to coordinate the execution of multiple tasks. 30 | 31 | Imagine events as "milestones" in your GPU code. When a certain task reaches a milestone, it records an event. Other tasks can then wait for these events to occur, ensuring proper execution order. 32 | 33 | ### CUDA IPC Event Handle 34 | 35 | An EventHandle is similar to a MemHandle, but for events instead of memory. It's a reference to a CUDA event that can be shared between processes to synchronize their execution. 36 | 37 | Consider the EventHandle as a "ticket" for a specific event, like the one we described for MemHandles. By sharing this ticket, multiple processes can coordinate their execution based on the same event. 38 | 39 | Example: 40 | 41 | Process A and Process B both need to perform calculations on the shared memory (accessed using a MemHandle). Process A needs to finish its calculations before Process B can start. To ensure this, Process A records a CUDA event when it completes its task, and shares the EventHandle with Process B. Process B then waits for the event to be recorded before starting its calculations. 42 | 43 | ## APIs 44 | 45 | ### cudaIpcGetMemHandle 46 | 47 | `cudaIpcGetMemHandle()` is a function in the CUDA API that allows you to create a memory handle (IPC handle) for a specific GPU memory allocation. This handle can be shared with other processes, enabling them to access the same device memory without copying data. 48 | 49 | #### Function signature 50 | 51 | ```c++ 52 | cudaError_t cudaIpcGetMemHandle(cudaIpcMemHandle_t *handle, void *devPtr); 53 | ``` 54 | 55 | #### Parameters 56 | 57 | 1. `handle`: A pointer to a `cudaIpcMemHandle_t` structure that will store the memory handle upon successful completion of the function. This handle can be shared with other processes to access the device memory. 58 | 2. `devPtr`: A pointer to the device memory for which you want to create the memory handle. This memory must have been allocated using `cudaMalloc()` or a similar function **within the same process**. 59 | 60 | #### Return type 61 | 62 | `cudaError_t`: An enumerated type that indicates the success or failure of the function. A value of `cudaSuccess` (0) means the function succeeded, while any other value indicates an error. 63 | 64 | #### Possible failure cases 65 | 66 | 1. `cudaErrorInvalidDevicePointer`: If `devPtr` is not a valid device memory pointer, this error is returned. 67 | 68 | **Example:** If you accidentally pass a host pointer or an uninitialized pointer to the function, you may get this error. 69 | 70 | ```c++ 71 | int *h_data = (int *) malloc(sizeof(int) * 10); 72 | cudaIpcMemHandle_t handle; 73 | // Incorrectly passing a host pointer instead of a device pointer 74 | cudaError_t err = cudaIpcGetMemHandle(&handle, h_data); 75 | ``` 76 | 77 | 2. `cudaErrorMemoryAllocation`: If there's a failure in allocating the memory handle, this error is returned. It typically occurs when the system is under high memory pressure or if there is a bug in the driver. 78 | 79 | 3. `cudaErrorIpcMemoryHandleNotValid`: Occurs when the memory handle for the device pointer cannot be created or is not valid. This error may occur if the device memory pointer is not aligned properly or if there is an issue with the CUDA driver. 80 | 81 | ```c++ 82 | int *d_A; 83 | cudaMalloc((void **)&d_A, 10 * sizeof(int) + 1); // Allocating misaligned memory 84 | cudaIpcMemHandle_t handle; 85 | cudaError_t err = cudaIpcGetMemHandle(&handle, d_A); // d_A is misaligned, will result in cudaErrorIpcMemoryHandleNotValid 86 | ``` 87 | 88 | ### cudaIpcOpenMemHandle 89 | 90 | `cudaIpcOpenMemHandle` is a function used in CUDA IPC (Inter-Process Communication) to open a remote memory handle for access by the calling process. This allows different processes to share GPU memory without copying data between host and device. 91 | 92 | ### cudaEventCreate 93 | 94 | ### cudaIpcGetEventHandle 95 | 96 | ### cudaIpcOpenEventHandle 97 | 98 | -------------------------------------------------------------------------------- /CUDA Memory Optimizations.md: -------------------------------------------------------------------------------- 1 | # CUDA Memory Optimizations 2 | 3 | [CUDA TOOLKIT DOCUMENTATION Chapter 9](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations) 4 | 5 | [How to Overlap Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) 6 | 7 | Maximize bandwidth => More fast memory, less slow-access memory 8 | 9 | ## 1 Data Transfer Between Host and Device 10 | 11 | **Choices**: 12 | 13 | - Minimize data transfer between the host and the device -- even if have to run kernels on the GPU that do not demonstrate any speedup compared with running them on the CPU. 14 | - Intermediate data structures: created and operated on, and destroyed by the device. 15 | - Batch small transters into one larger transfer -- even if have to pack non-contiguous regions of memory into a contiguous buffer and then unpack after the transfer. 16 | - Use page-locked (or pinned) memory. 17 | 18 | ### Pinned Memory 19 | 20 | > With paged memory, the specific memory, which is allowed to be paged in or paged out, is called *pageable memory*. Conversely, the specific memory, which is not allowed to be paged in or paged out, is called *page-locked memory* or *pinned memory*. 21 | > 22 | > Page-locked memory will not communicate with hard drive. Therefore, the efficiency of reading and writing in page-locked memory is more guaranteed.[^1] 23 | 24 | [`cudaHostAlloc()` ](http://horacio9573.no-ip.org/cuda/group__CUDART__MEMORY_g15a3871f15f8c38f5b7190946845758c.html): Allocates host memory that is page-locked and accesible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as `cudaMemcpy()`. 25 | 26 | [`cudaHostRegister()`](http://horacio9573.no-ip.org/cuda/group__CUDART__MEMORY_g36b9fe28f547f28d23742e8c7cd18141.html): Page-locks a specified range of memory. 27 | 28 | **Note:** Allocating excessive pinned memory may degrade system performance, since it reduces the memory for paging. **Test the application and the systems it runs on for optimal performance parameters.** 29 | 30 | ### Overlap Data Transfers with Computation on the Host 31 | 32 | (See Nvidia's Technical Blog: [How to Overlay Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/)) 33 | 34 | #### (1) CUDA Streams 35 | 36 | **Definition:** A sequence of operations that execute on the device in the order in which they are issued by the host code. All device operations (kernels and data transfers) run in a stream. 37 | 38 | - Default stream: Used when no stream is specified 39 | 40 | - Synchronizing stream => synchronize with operation in other streams => an operation begins after all previously issued operations *in any stream on the device* have completed; and completes before any other operation *in any stream on the device* will begin. (Exception: [CUDA 7](https://developer.nvidia.com/blog/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/)) 41 | - Overlapping strategy: based on the **asynchronous behavior of kernel launches** 42 | 43 | - Non-default stream: Explicitly declared, created, and destroyed by the host 44 | 45 | - Non-blocking stream => sometimes need to synchronize with the host code 46 | - `cudaDeviceSynchronize()`: blocks the host code until *all* previously issued operations on the device have completed 47 | - `cudaStreamSynchronize(stream)`: blocks the host thread until all previoulsy issued operations *in the specified stream* have completed 48 | - `cudaEventSynchronize(event)`: blocks the host thread until all previously issued operations *in the specified event* have completed 49 | 50 | - Overlapping strategy: based on **asynchronous data transfers** 51 | 52 | #### (2) Overlapping Kernel Execution and Data Transfers 53 | 54 | `cudaMemcpyAsync()`: A non-blocking variant of `cudaMemcpy()`. Requires pinned host memory. 55 | 56 | Asynchronous tranfers enable overlap of data transfers by: 57 | 58 | - Overlapping host computation with async data transfers and with device computations. 59 | 60 | - Overlapping kernel execution with async data transfer. On devices that are capable of concurrent copy and compute (see `asyncEngineCount`), the data transfer and kernel must use different, non-default streams (stream with non-zero IDs). 61 | 62 | ```c++ 63 | cudaStreamCreate(&stream1); 64 | cudaStreamCreate(&stream2); 65 | cudaMemcpy(a_d, a_h, ..., ..., stream1); 66 | kernel<<>>(otherData_d); 67 | ``` 68 | 69 | ##### **Notice:** Different GPU architectures have different numbers of copy and kernel engines, which may differ in performance when using asynchronous transfers. 70 | 71 | ### Zero Copy 72 | 73 | This feature enables GPU threads to directly access host memory. It requires mapped pinned memory. 74 | 75 | **Note:** Mapped pinned host memory allows you to overlap CPU-GPU memory transfers with computation while avoiding the use of CUDA streams. But since any repeated access to such memory areas causes repeated CPU-GPU transfers, consider creating a second area in device memory to manually cache the previously read host memory data. 76 | 77 | ### Unified Virtual Addressing 78 | 79 | With UVA, the host memory and the device memories of all installed supported devices share a single virtual address space. 80 | 81 | ## 2 Device Memory Spaces 82 | 83 | [Salient Features of Device Memory](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces__salient-features-device-memory) 84 | 85 | **Choices**: 86 | 87 | - Ensure global memory accesses are coalesced whenever possible. 88 | - Non-unit-stride global memory accesses should be avoided whenever possible. 89 | - Use shared memory to avoid redundant transfers from global memory. 90 | - Use asynchronous copies from global to shared memory with an element of size 8 or 16 bytes. 91 | - Use texture reads for streaming fetches with a constant latency. 92 | - Use constant memory when threads in the same warp accesses only a few distinct locations. 93 | 94 | ### Coalesced Access to Global Memory 95 | 96 | **Ensure global memory accesses are coalesced whenever possible.** 97 | 98 | > Coalesced memory access of memory coalescing refers to combining multiple memory accesses into a single transaction. However, the following conditions may result in uncoalesced load (serialized memory accesses):[^2] 99 | > 100 | > - Memory (access) is not sequential 101 | > - Memory access is sparse 102 | > - Misaligned memory access 103 | > 104 | > Memory is accessed at 32 byte granularity.[^3] 105 | 106 | The global access requirements for coalescing depend on the compute capability of the device. 107 | 108 | Coalescing concepts are illustrated in the following simple examples. Assume: 109 | 110 | - Compute capability 6.0 or higher 111 | - Accesses are for 4-byte words, unless otherwise noted. 112 | 113 | #### (1) A Simple Access Pattern 114 | 115 | Sequential and aligned access: The k-th thread accesses the k-th word in a 32-byte aligned array. Not all threads need to participate. 116 | 117 | #### (2) A Sequential but Misaligned Access Pattern 118 | 119 | Will require the original transactions to load the first `X` words, and another transaction to load the rest `32-X` words, where `X` is the offset of the misalignment. More transactions are required. 120 | 121 | When `X` is a multiple of 8, the global memory access bandwidth can be the same as the aligned accesses. 122 | 123 | **Cache line reuse increases throughput**: Adjacent warps reuse the cache lines their neighbors fetched. So the impact of misalignment is not as large as we might have expected. 124 | 125 | #### (3) Strided Accesses 126 | 127 | **Ensure that as much as possible of the data in each cache line fetched is actually used is an important part of performance optimization of memory accesses.** Strided access results in low load/store efficiency since elements in the transaction are not fully used and represent wasted bandwidth. 128 | 129 | In this case, **non-unit-stride global memory accesses should be avoided whenever possible**. 130 | 131 | => utilize shared memory 132 | 133 | ### L2 Cache 134 | 135 | On-chip => higher bandwidth and lower latency accesses to global memory 136 | 137 | A portion of the L2 cache can be set aside for persistent (repeatedly) accesses to a data region in global memory: 138 | 139 | ```c++ 140 | cudaGetDeviceProperties(&prop, device_id); 141 | cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, prop.persistingL2CacheMaxSize); /* Set aside max possible size of L2 cache for persisting accesses */ 142 | ``` 143 | 144 | **Cache Access Window** -- `accessPolicyWindow` includes: 145 | 146 | - `base_ptr`: Global memory data pointer 147 | - `num_bytes`: Number of bytes for persisting accesses. Must be less than the max window size. 148 | - `hitProp`: Type of access property on cache hit (persisting access) 149 | - `missProp`: Type of access property on cache miss (streaming access) 150 | - `hitRatio`: Percentage of lines assigned `hitProp`, rest are assigned `missProp` [^4] 151 | 152 | Depending on the value of the `num_bytes` parameter and the size of L2 cache, one may need to tune the value of `hitRatio` to avlod thrashing of L2 cache lines. 153 | 154 | ### Shared Memory 155 | 156 | On-chip => higher bandwidth and lower latency than local and global memory 157 | 158 | #### (1) Memory Banks 159 | 160 | **Definition:** Shared memory is divided into equally sized memory modules (banks) for concurrent accesses. 161 | 162 | **Memory Bank Conflict:** Accessing addresses that are mapped to the same memory bank can only be done serially. 163 | 164 | Exception: **(Broadcast)** Multiple threads in a warp address the same shared memory location. In this case, multiple broadcasts from different banks are coalesced into a single multicast from the requested shared memory locations to the threads. 165 | 166 | **Addresses <-> Banks Mapping Strategy**: 167 | 168 | - On devices of compute capability 5.x or newer, each bank has a bandwidth of 32 bits every clock cycle, and successive 32-bit words are assigned to successive banks. 169 | - On devices of compute capability 3.x, each bank has a bandwidth of 64 bits every clock cycle. Either successive 32-bit words (in 32-bit mode) or successive 64-bit words (64-bit mode) are assigned to successive banks. 170 | 171 | #### (1) Matrix Multiplication (C=AB) 172 | 173 | Aside from memory bank conflicts, there is no penalty for non-sequential or unaligned accesses by a warp in shared memory. 174 | 175 | **Use shared memory to avoid redundant transfers from global memory.** 176 | 177 | #### (2) Matrix Multiplication (C=AAT) 178 | 179 | **Analysis and eliminating bank conflicts:** Pad the shared memory array 180 | 181 | ```c++ 182 | __shared__ float transposedTile[TILE_DIM][TILE_DIM+1]; 183 | ``` 184 | 185 | This padding eliminates the conflicts entirely, because now the stride between threads is w+1 banks (i.e., 33 for current devices), which, due to modulo arithmetic used to compute bank indices, is equivalent to a unit stride. 186 | 187 | #### (4) Async Copy from Global Memory to Shared Memory 188 | 189 | Overlapping copying data from global to shared memory with computation. 190 | 191 | The synchronous version for the kernel loads an element from global memory to an intermediate register and then stores the intermediate register value to shared memory. 192 | 193 | In the asynchronous version of the kernel, instructions to load from global memory and store directly into shared memory are issued as soon as `__pipeline_memcpy_async()` function is called. Using asynchronous copies does not use any intermediate register. 194 | 195 | Best performance is achieved when using asynchronous copies with an element of size 8 or 16 bytes. 196 | 197 | **Q: What is the role of intermediate registers?** 198 | 199 | ### Local Memory 200 | 201 | Off-chip => as expensive as access to global memory => **no faster access** 202 | 203 | Usage: holding automatic variables 204 | 205 | - large structures or arrays that would consume too much register space 206 | - arrays that the compiler determines may be indexed dynamically) 207 | 208 | Inspection of the PTX assembly code (obtained by compiling with `-ptx` or `-keep` command-line options to `nvcc`) reveals whether a variable has been placed in local memory during the first compilation phases. 209 | 210 | ### Texture Memory (Texture Cache) 211 | 212 | Read-only => costs device memory read only on a cache miss 213 | 214 | Optimized for 2D spatial locality => reading texture addresses that are closer will achieve best performance 215 | 216 | Designed for **streaming fetches** with a constant latency 217 | 218 | **Caveat: Within a kernel call, the texture cache is not kept coherent with respect to global memory write.** A thread can safely read a memory location via texture if the location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread within the same kernel call. 219 | 220 | ### Constant Memory (Constant Cache) 221 | 222 | Size: 64KB 223 | 224 | Costs device memory read only on a cache miss 225 | 226 | Best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access. 227 | 228 | ## 3 Allocation 229 | 230 | **Device memory should be reused and/or sub-allocated by the application whenever possible** to minimize the impact of allocations on overall performance. 231 | 232 | ## Other References 233 | 234 | [^1]: [Page-Locked Host Memory for Data Transfer](https://leimao.github.io/blog/Page-Locked-Host-Memory-Data-Transfer/#:~:text=With%20paged%20memory%2C%20the%20specific,locked%20memory%20or%20pinned%20memory.) 235 | [^2]: [Introduction to GPGPU and CUDA Programming: Memory Coalescing](https://cvw.cac.cornell.edu/gpu/coalesced#:~:text=Coalesced%20memory%20access%20or%20memory,threads%20in%20a%20single%20transaction.) 236 | [^3]: [GPU Optimization Fundamentals](https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf) 237 | [^4]: [CUaccessPolicyWindow_v1 Struct Reference](https://docs.nvidia.com/cuda/cuda-driver-api/structCUaccessPolicyWindow__v1.html#structCUaccessPolicyWindow__v1_1d6ed5cd7bb416976b45e75bafce547e9) -------------------------------------------------------------------------------- /CUDA Multi-Process Service.md: -------------------------------------------------------------------------------- 1 | # CUDA Multi-Process Service 2 | 3 | [Improving GPU Utilization with MPS](https://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf) 4 | 5 | Used to optimize the performance in multi CUDA processes scenario. 6 | 7 | ## Background 8 | 9 | ### CUDA Context [^1] 10 | 11 | - Global memory allocated by the CPU 12 | - Stack/Heap space (local memory) from the kernel 13 | - CUDA streams & events objects 14 | - Code module (*.cubin, *.ptx) 15 | 16 | Each process has its own CUDA context 17 | 18 | Each context has its own memory space, and cannot access other CUDA contexts' spaces 19 | 20 | ### Hyper-Q [^2]-- Hyper Queue (Hardware Property) 21 | 22 | Hyper-Q enables multiple threads or processes to launch work on a single GPU simultaneously. 23 | 24 | - Increases GPU utilization and reduces CPU idle times 25 | - Eliminates false dependencies across tasks 26 | 27 | Before Hyper-Q: Fermi's single pipeline. There is only one hardware work queue so there can be false dependencies across the tasks. 28 | 29 | Kepler GK110 introduces the Grid Management Unit, which creates multiple hardware work queues to reduce or eliminate false dependencies. (Also the feedback path from SMXs to the GMU provides dynamic parallelism.) 30 | 31 | Kelper allows 32-way concurrency. 32 | 33 | ## Multi-Process Service (MPS) 34 | 35 | A feature that allows multiple CUDA processes (contexts) to share a single GPU context. Each process receive some subset of the available connections to that GPU. 36 | 37 | MPS allows overlapping of kernel and memcopy operations *from different processes* on the GPU to achieve maximum utilization. 38 | 39 | MPS Server: Hyper-Q/MPI 40 | 41 | - All MPS Client Processes started after starting MPS Server will communicate through MPS Server only 42 | 43 | - Many-to-one context mapping 44 | - Allows multiple CUDA processes to share a/multiple GPU context(s) 45 | 46 | ## Usage 47 | 48 | See the slides 49 | 50 | ## Summary 51 | 52 | Best for GPU acceleration for legacy applications 53 | 54 | Enables overlapping of memory copies and compute between different MPI ranks 55 | 56 | Ideal for applications with 57 | 58 | - MPI-everywhere 59 | - Non-negligible CPU work 60 | - Partially migrated to GPU 61 | 62 | ## Other References 63 | 64 | [^1]: [如何使用MPS提升GPU计算收益](https://www.nvidia.cn/content/dam/en-zz/zh_cn/assets/webinars/31oct2019c/20191031_MPS_davidwu.pdf) 65 | [^2]: [Hyper-Q Example](https://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GPU-Tutorials 2 | Tutorials to GPU programming. Reading notes. 3 | --------------------------------------------------------------------------------