├── An Even Easier Introduction to CUDA.md
├── CUDA CC++ Basics.md
├── CUDA IPC.md
├── CUDA Memory Optimizations.md
├── CUDA Multi-Process Service.md
└── README.md


/An Even Easier Introduction to CUDA.md:
--------------------------------------------------------------------------------
 1 | # An Even Easier Introduction to CUDA
 2 | 
 3 | [Source link](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)
 4 | 
 5 | ## CUDA Terminologies
 6 | 
 7 | **kernel**: a funtion that the GPU can run
 8 | 
 9 | **\__global__**: a specifier telling the CUDA C++ complier that this is a funtion that runs on the GPU and can be called from CPU code
10 | 
11 | **device code**: code that runs on the GPU
12 | 
13 | **host code**: code that runs on the CPU
14 | 
15 | ## Memory Allocation in CUDA
16 | 
17 | ### Unified Memory[^1]
18 | 
19 | Unified Mmeory creates a pool of managed memory that is shared between the CPU and GPU, briding the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using **a single pointer**. The key is that the system automatically **migrates** data allocated in Unified Memory between host and device so that it **looks like** CPU memory to code running on the CPU, and like GPU memory to code running on the GPU. 
20 | 
21 | ```c++
22 | char *data;
23 | cudaMallocManaged(&data, N); // making the data pointer accessible from both the host and the device.
24 | ```
25 | 
26 | **Note**: this is a programming model to simplify CUDA codes. However, a carefully tuned CUDA program that uses streams and `cudaMemcpyAsync()` to efficiently overlap execution with data transfers **may very well perform better than only using Unified Memory**. 
27 | 
28 | Q: Are we using unified memory or traditional memory allocation techniques? 
29 | 
30 | ## Execution Configuration
31 | 
32 | **execution configuration**: tells the CUDA runtime how many parallel threads to use for the launch on the GPU. 
33 | 
34 | **Streaming Multiprocessors**: each runs multiple concurrent thread blocks
35 | 
36 | **threadIdx.x / y / z**: the index of the current thread within its block
37 | 
38 | **blockIdx.x / y / z**: the index of the current thread block in the grid
39 | 
40 | **blockDim.x / y / z**: the number of threads in the block
41 | 
42 | **gridDim.x / y / z**: the number of blocks in the grid
43 | 
44 | ### Grid-Stride Loop
45 | 
46 | ```c++
47 | __global__
48 | void add (int n, float *x, float *y) 
49 | {
50 | 	int index = blockIdx.x * blockDim.x + threadIdx.x; // the thread index in the grid
51 | 	int stride = blockDim.x * gridDim.x; // number of threads in the grid
52 | 	for (int i = index; i < n; i += stride)
53 | 		y[i] = x[i] + y[i];
54 | }
55 | ```
56 | 
57 | [^1]: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/


--------------------------------------------------------------------------------
/CUDA CC++ Basics.md:
--------------------------------------------------------------------------------
 1 | # CUDA C/C++ Basics
 2 | 
 3 | [Souce link](https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf)
 4 | 
 5 | ## Cooperating Threads
 6 | 
 7 | Unlike parallel blocks, threads have mechanisms to efficiently: 
 8 | 
 9 | - Communicate
10 | - Synchronize
11 | 
12 | ### Sharing Data Between Threads
13 | 
14 | **shared memory**: shared data for threads within a block
15 | 
16 | - exetremely fast on-chip memory, by opposition to the global memory
17 | - like a user-managed cache
18 | - declared using `__shared__`, allocated per block
19 | - data is not visible to threads in other blocks
20 | 
21 | **\__syncthreads()**: synchronizes all threads **within a block**
22 | 
23 | - prevents data hazards
24 | 
25 | ## Managing The Device
26 | 
27 | ### Coordinating Host & Device
28 | 
29 | Kernal launches are asynchronous
30 | 
31 | - contol returns to the CPU immediately
32 | 
33 | CPU needs to synchronize before consuming the results
34 | 
35 | - cudaMemcpy(): 
36 |   - blocks the CPU until the copy is complete
37 |   - **copy begins when all preceding CUDA calls have complete**
38 | - cudaMemcpyAsync(): asynchronous, does not block the CPU
39 | - cudaDeviceSynchronize(): blocks the CPU until all preceding CUDA calls have completed
40 | 
41 | ### Reporting Errors
42 | 
43 | ```c++
44 | // Get the error code for the last error
45 | cudaError_t cudaGetLastError(void);
46 | 
47 | // Get a string to describe the error
48 | char *cudaGetErrorString(cudaError_t);
49 | printf("%s\n", cudaGetErrorString(cudaGetLastError()));
50 | ```
51 | 
52 | ### Device Management
53 | 
54 | **Application can query and select GPUs**
55 | 
56 | - cudaGetDeviceCount(int *count)
57 | - cudaSetDevice(int device)
58 | - cudaGetDevice(int *device)
59 | - cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
60 | 
61 | ```c++
62 | // Source: https://forums.developer.nvidia.com/t/beginner-cudagetdevicecount/16403
63 | int device = 0; 
64 | int gpuDeviceCount = 0; 
65 | struct cudaDeviceProp properties; 
66 | 
67 | cudaError_t cudaResultCode = cudaGetDeviceCount(&gpuDeviceCount); 
68 | 
69 | if (cudaResultCode == cudaSuccess) 
70 | { 
71 | 	cudaGetDeviceProperties(&properties, device); 
72 | 	printf("%d GPU CUDA devices(s)(%d)\n", gpuDeviceCount, properties.major); 
73 | 	printf("\t Product Name: %s\n"		, properties.name);
74 | 	printf("\t TotalGlobalMem: %d MB\n"	, properties.totalGlobalMem/(1024^2));
75 | 	printf("\t GPU Count: %d\n"		, properties.multiProcessorCount);
76 | 	printf("\t Kernels found: %d\n"		, properties.concurrentKernels);
77 | }
78 | ```
79 | 
80 | **Multiple host threads can share a device**
81 | 
82 | **A single host thread can manage multiple devices**
83 | 
84 | - cudaSetDevice(i): to select curent device
85 | - cudaMemcpy(...): for peer-to-peer copies
86 | 
87 | 


--------------------------------------------------------------------------------
/CUDA IPC.md:
--------------------------------------------------------------------------------
 1 | # CUDA IPC
 2 | 
 3 | CUDA IPC (Inter-Process Communication) is a feature in NVIDIA's CUDA (Compute Unified Device Architecture) programming model, allowing data sharing and synchronization between different processes running on the same GPU. It's helpful when you have multiple applications that need to share GPU resources, without copying data between host and device memory.
 4 | 
 5 | ## Terminologies
 6 | 
 7 | ### Cuda Context
 8 | 
 9 | A CUDA context is like a container that holds the GPU resources for a specific process. It includes device memory, loaded modules, and other resources needed for executing your GPU kernels (the code that runs on the GPU). Each process has its own context, which isolates its resources from other processes.
10 | 
11 | Imagine the context as a "workspace" for each GPU application, keeping its data and settings separate from other applications.
12 | 
13 | The CUDA context is implemented as an opaque data structure in the CUDA runtime, which is managed by the CUDA driver. When a process initializes the CUDA runtime (usually by calling `cudaSetDevice()` or similar functions), a context is created for that process. The context is associated with the chosen GPU device and provides a separate environment for each process that runs on the GPU.
14 | 
15 | The context ensures resource isolation between different processes, preventing them from interfering with each other's memory, kernels, and other resources. The context also maintains the memory allocation state and handles memory operations such as allocation, deallocation, and data transfers between the host and the device memory.
16 | 
17 | ### Memory Handle
18 | 
19 | A MemHandle, or memory handle, is a lightweight reference to a piece of GPU memory that can be shared between processes. It allows different processes to access the same device memory without copying data between host and device.
20 | 
21 | Think of the MemHandle as a "ticket" that grants access to a specific memory location on the GPU. You can share this ticket with other processes, allowing them to use the same memory.
22 | 
23 | Example: 
24 | 
25 | Process A has a piece of data in its GPU memory that it wants to share with Process B. Process A creates a MemHandle for that memory and shares it with Process B. Now, both processes can access the same data on the GPU without copying it back and forth.
26 | 
27 | ### CudaEvent
28 | 
29 | A CUDA event is a synchronization primitive used to track the progress of various operations in the GPU. It can be used to measure the time taken by a specific operation, or to coordinate the execution of multiple tasks.
30 | 
31 | Imagine events as "milestones" in your GPU code. When a certain task reaches a milestone, it records an event. Other tasks can then wait for these events to occur, ensuring proper execution order.
32 | 
33 | ### CUDA IPC Event Handle
34 | 
35 | An EventHandle is similar to a MemHandle, but for events instead of memory. It's a reference to a CUDA event that can be shared between processes to synchronize their execution.
36 | 
37 | Consider the EventHandle as a "ticket" for a specific event, like the one we described for MemHandles. By sharing this ticket, multiple processes can coordinate their execution based on the same event.
38 | 
39 | Example: 
40 | 
41 | Process A and Process B both need to perform calculations on the shared memory (accessed using a MemHandle). Process A needs to finish its calculations before Process B can start. To ensure this, Process A records a CUDA event when it completes its task, and shares the EventHandle with Process B. Process B then waits for the event to be recorded before starting its calculations.
42 | 
43 | ## APIs
44 | 
45 | ### cudaIpcGetMemHandle
46 | 
47 | `cudaIpcGetMemHandle()` is a function in the CUDA API that allows you to create a memory handle (IPC handle) for a specific GPU memory allocation. This handle can be shared with other processes, enabling them to access the same device memory without copying data.
48 | 
49 | #### Function signature
50 | 
51 | ```c++
52 | cudaError_t cudaIpcGetMemHandle(cudaIpcMemHandle_t *handle, void *devPtr);
53 | ```
54 | 
55 | #### Parameters
56 | 
57 | 1. `handle`: A pointer to a `cudaIpcMemHandle_t` structure that will store the memory handle upon successful completion of the function. This handle can be shared with other processes to access the device memory.
58 | 2. `devPtr`: A pointer to the device memory for which you want to create the memory handle. This memory must have been allocated using `cudaMalloc()` or a similar function **within the same process**.
59 | 
60 | #### Return type
61 | 
62 | `cudaError_t`: An enumerated type that indicates the success or failure of the function. A value of `cudaSuccess` (0) means the function succeeded, while any other value indicates an error.
63 | 
64 | #### Possible failure cases
65 | 
66 | 1. `cudaErrorInvalidDevicePointer`: If `devPtr` is not a valid device memory pointer, this error is returned.
67 | 
68 |    **Example:** If you accidentally pass a host pointer or an uninitialized pointer to the function, you may get this error.
69 | 
70 |    ```c++
71 |    int *h_data = (int *) malloc(sizeof(int) * 10);
72 |    cudaIpcMemHandle_t handle;
73 |    // Incorrectly passing a host pointer instead of a device pointer
74 |    cudaError_t err = cudaIpcGetMemHandle(&handle, h_data);
75 |    ```
76 | 
77 | 2. `cudaErrorMemoryAllocation`: If there's a failure in allocating the memory handle, this error is returned. It typically occurs when the system is under high memory pressure or if there is a bug in the driver.
78 | 
79 | 3. `cudaErrorIpcMemoryHandleNotValid`: Occurs when the memory handle for the device pointer cannot be created or is not valid. This error may occur if the device memory pointer is not aligned properly or if there is an issue with the CUDA driver.
80 | 
81 |    ```c++
82 |    int *d_A;
83 |    cudaMalloc((void **)&d_A, 10 * sizeof(int) + 1); // Allocating misaligned memory
84 |    cudaIpcMemHandle_t handle;
85 |    cudaError_t err = cudaIpcGetMemHandle(&handle, d_A); // d_A is misaligned, will result in cudaErrorIpcMemoryHandleNotValid
86 |    ```
87 | 
88 | ### cudaIpcOpenMemHandle
89 | 
90 | `cudaIpcOpenMemHandle` is a function used in CUDA IPC (Inter-Process Communication) to open a remote memory handle for access by the calling process. This allows different processes to share GPU memory without copying data between host and device.
91 | 
92 | ### cudaEventCreate
93 | 
94 | ### cudaIpcGetEventHandle
95 | 
96 | ### cudaIpcOpenEventHandle
97 | 
98 | 


--------------------------------------------------------------------------------
/CUDA Memory Optimizations.md:
--------------------------------------------------------------------------------
  1 | # CUDA Memory Optimizations
  2 | 
  3 | [CUDA TOOLKIT DOCUMENTATION Chapter 9](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations)
  4 | 
  5 | [How to Overlap Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) 
  6 | 
  7 | Maximize bandwidth => More fast memory, less slow-access memory
  8 | 
  9 | ## 1 Data Transfer Between Host and Device
 10 | 
 11 | **Choices**: 
 12 | 
 13 | - Minimize data transfer between the host and the device -- even if have to run kernels on the GPU that do not demonstrate any speedup compared with running them on the CPU. 
 14 | - Intermediate data structures: created and operated on, and destroyed by the device.
 15 | - Batch small transters into one larger transfer -- even if have to pack non-contiguous regions of memory into a contiguous buffer and then unpack after the transfer. 
 16 | - Use page-locked (or pinned) memory. 
 17 | 
 18 | ### Pinned Memory
 19 | 
 20 | > With paged memory, the specific memory, which is allowed to be paged in or paged out, is called *pageable memory*. Conversely, the specific memory, which is not allowed to be paged in or paged out, is called *page-locked memory* or *pinned memory*.
 21 | >
 22 | > Page-locked memory will not communicate with hard drive. Therefore, the efficiency of reading and writing in page-locked memory is more guaranteed.[^1]
 23 | 
 24 | [`cudaHostAlloc()` ](http://horacio9573.no-ip.org/cuda/group__CUDART__MEMORY_g15a3871f15f8c38f5b7190946845758c.html): Allocates host memory that is page-locked and accesible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as `cudaMemcpy()`. 
 25 | 
 26 | [`cudaHostRegister()`](http://horacio9573.no-ip.org/cuda/group__CUDART__MEMORY_g36b9fe28f547f28d23742e8c7cd18141.html): Page-locks a specified range of memory. 
 27 | 
 28 | **Note:** Allocating excessive pinned memory may degrade system performance, since it reduces the memory for paging. **Test the application and the systems it runs on for optimal performance parameters.** 
 29 | 
 30 | ### Overlap Data Transfers with Computation on the Host
 31 | 
 32 | (See Nvidia's Technical Blog: [How to Overlay Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/))
 33 | 
 34 | #### (1) CUDA Streams
 35 | 
 36 | **Definition:** A sequence of operations that execute on the device in the order in which they are issued by the host code. All device operations (kernels and data transfers) run in a stream. 
 37 | 
 38 | - Default stream: Used when no stream is specified
 39 | 
 40 |   - Synchronizing stream => synchronize with operation in other streams => an operation begins after all previously issued operations *in any stream on the device* have completed; and completes before any other operation *in any stream on the device* will begin. (Exception: [CUDA 7](https://developer.nvidia.com/blog/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/))
 41 |   - Overlapping strategy: based on the **asynchronous behavior of kernel launches**
 42 | 
 43 | - Non-default stream: Explicitly declared, created, and destroyed by the host
 44 | 
 45 |   - Non-blocking stream => sometimes need to synchronize with the host code
 46 |     - `cudaDeviceSynchronize()`: blocks the host code until *all* previously issued operations on the device have completed
 47 |     - `cudaStreamSynchronize(stream)`: blocks the host thread until all previoulsy issued operations *in the specified stream* have completed
 48 |     - `cudaEventSynchronize(event)`: blocks the host thread until all previously issued operations *in the specified event* have completed
 49 | 
 50 |   - Overlapping strategy: based on **asynchronous data transfers**
 51 | 
 52 | #### (2) Overlapping Kernel Execution and Data Transfers
 53 | 
 54 | `cudaMemcpyAsync()`: A non-blocking variant of `cudaMemcpy()`. Requires pinned host memory.
 55 | 
 56 | Asynchronous tranfers enable overlap of data transfers by:
 57 | 
 58 | - Overlapping host computation with async data transfers and with device computations. 
 59 | 
 60 | - Overlapping kernel execution with async data transfer. On devices that are capable of concurrent copy and compute (see `asyncEngineCount`), the data transfer and kernel must use different, non-default streams (stream with non-zero IDs).
 61 | 
 62 |   ```c++
 63 |   cudaStreamCreate(&stream1);
 64 |   cudaStreamCreate(&stream2);
 65 |   cudaMemcpy(a_d, a_h, ..., ..., stream1);
 66 |   kernel<<<grid, block, 0, stream2>>>(otherData_d);
 67 |   ```
 68 | 
 69 | ##### **Notice:** Different GPU architectures have different numbers of copy and kernel engines, which may differ in performance when using asynchronous transfers. 
 70 | 
 71 | ### Zero Copy
 72 | 
 73 | This feature enables GPU threads to directly access host memory. It requires mapped pinned memory. 
 74 | 
 75 | **Note:** Mapped pinned host memory allows you to overlap CPU-GPU memory transfers with computation while avoiding the use of CUDA streams. But since any repeated access to such memory areas causes repeated CPU-GPU transfers, consider creating a second area in device memory to manually cache the previously read host memory data.
 76 | 
 77 | ### Unified Virtual Addressing
 78 | 
 79 | With UVA, the host memory and the device memories of all installed supported devices share a single virtual address space. 
 80 | 
 81 | ## 2 Device Memory Spaces
 82 | 
 83 | [Salient Features of Device Memory](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces__salient-features-device-memory)
 84 | 
 85 | **Choices**: 
 86 | 
 87 | - Ensure global memory accesses are coalesced whenever possible.
 88 | - Non-unit-stride global memory accesses should be avoided whenever possible.
 89 | - Use shared memory to avoid redundant transfers from global memory.
 90 | - Use asynchronous copies from global to shared memory with an element of size 8 or 16 bytes.
 91 | - Use texture reads for streaming fetches with a constant latency.
 92 | - Use constant memory when threads in the same warp accesses only a few distinct locations. 
 93 | 
 94 | ### Coalesced Access to Global Memory
 95 | 
 96 | **Ensure global memory accesses are coalesced whenever possible.** 
 97 | 
 98 | > Coalesced memory access of memory coalescing refers to combining multiple memory accesses into a single transaction. However, the following conditions may result in uncoalesced load (serialized memory accesses):[^2]
 99 | >
100 | > - Memory (access) is not sequential
101 | > - Memory access is sparse
102 | > - Misaligned memory access
103 | >
104 | > Memory is accessed at 32 byte granularity.[^3]
105 | 
106 | The global access requirements for coalescing depend on the compute capability of the device. 
107 | 
108 | Coalescing concepts are illustrated in the following simple examples. Assume: 
109 | 
110 | - Compute capability 6.0 or higher
111 | - Accesses are for 4-byte words, unless otherwise noted.
112 | 
113 | #### (1) A Simple Access Pattern
114 | 
115 | Sequential and aligned access: The k-th thread accesses the k-th word in a 32-byte aligned array. Not all threads need to participate. 
116 | 
117 | #### (2) A Sequential but Misaligned Access Pattern
118 | 
119 | Will require the original transactions to load the first `X` words, and another transaction to load the rest `32-X` words, where `X` is the offset of the misalignment. More transactions are required. 
120 | 
121 | When `X` is a multiple of 8, the global memory access bandwidth can be the same as the aligned accesses. 
122 | 
123 | **Cache line reuse increases throughput**: Adjacent warps reuse the cache lines their neighbors fetched. So the impact of misalignment is not as large as we might have expected. 
124 | 
125 | #### (3) Strided Accesses
126 | 
127 | **Ensure that as much as possible of the data in each cache line fetched is actually used is an important part of performance optimization of memory accesses.** Strided access results in low load/store efficiency since elements in the transaction are not fully used and represent wasted bandwidth. 
128 | 
129 | In this case, **non-unit-stride global memory accesses should be avoided whenever possible**. 
130 | 
131 | => utilize shared memory
132 | 
133 | ### L2 Cache
134 | 
135 | On-chip => higher bandwidth and lower latency accesses to global memory
136 | 
137 | A portion of the L2 cache can be set aside for persistent (repeatedly) accesses to a data region in global memory:  
138 | 
139 | ```c++
140 | cudaGetDeviceProperties(&prop, device_id);                
141 | cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, prop.persistingL2CacheMaxSize); /* Set aside max possible size of L2 cache for persisting accesses */ 
142 | ```
143 | 
144 | **Cache Access Window** -- `accessPolicyWindow` includes: 
145 | 
146 | - `base_ptr`: Global memory data pointer
147 | - `num_bytes`: Number of bytes for persisting accesses. Must be less than the max window size. 
148 | - `hitProp`: Type of access property on cache hit (persisting access)
149 | - `missProp`: Type of access property on cache miss (streaming access)
150 | - `hitRatio`: Percentage of lines assigned `hitProp`, rest are assigned `missProp` [^4]
151 | 
152 | Depending on the value of the `num_bytes` parameter and the size of L2 cache, one may need to tune the value of `hitRatio` to avlod thrashing of L2 cache lines. 
153 | 
154 | ### Shared Memory
155 | 
156 | On-chip => higher bandwidth and lower latency than local and global memory
157 | 
158 | #### (1) Memory Banks
159 | 
160 | **Definition:** Shared memory is divided into equally sized memory modules (banks) for concurrent accesses. 
161 | 
162 | **Memory Bank Conflict:** Accessing addresses that are mapped to the same memory bank can only be done serially. 
163 | 
164 | Exception: **(Broadcast)** Multiple threads in a warp address the same shared memory location. In this case, multiple broadcasts from different banks are coalesced into a single multicast from the requested shared memory locations to the threads. 
165 | 
166 | **Addresses <-> Banks Mapping Strategy**: 
167 | 
168 | - On devices of compute capability 5.x or newer, each bank has a bandwidth of 32 bits every clock cycle, and successive 32-bit words are assigned to successive banks. 
169 | - On devices of compute capability 3.x, each bank has a bandwidth of 64 bits every clock cycle. Either successive 32-bit words (in 32-bit mode) or successive 64-bit words (64-bit mode) are assigned to successive banks. 
170 | 
171 | #### (1) Matrix Multiplication (C=AB)
172 | 
173 | Aside from memory bank conflicts, there is no penalty for non-sequential or unaligned accesses by a warp in shared memory. 
174 | 
175 | **Use shared memory to avoid redundant transfers from global memory.**
176 | 
177 | #### (2) Matrix Multiplication (C=AAT)
178 | 
179 | **Analysis and eliminating bank conflicts:** Pad the shared memory array
180 | 
181 | ```c++
182 | __shared__ float transposedTile[TILE_DIM][TILE_DIM+1];
183 | ```
184 | 
185 | This padding eliminates the conflicts entirely, because now the stride between threads is w+1 banks (i.e., 33 for current devices), which, due to modulo arithmetic used to compute bank indices, is equivalent to a unit stride.
186 | 
187 | #### (4) Async Copy from Global Memory to Shared Memory
188 | 
189 | Overlapping copying data from global to shared memory with computation. 
190 | 
191 | The synchronous version for the kernel loads an element from global memory to an intermediate register and then stores the intermediate register value to shared memory. 
192 | 
193 | In the asynchronous version of the kernel, instructions to load from global memory and store directly into shared memory are issued as soon as `__pipeline_memcpy_async()` function is called. Using asynchronous copies does not use any intermediate register.
194 | 
195 | Best performance is achieved when using asynchronous copies with an element of size 8 or 16 bytes.
196 | 
197 | **Q: What is the role of intermediate registers?** 
198 | 
199 | ### Local Memory
200 | 
201 | Off-chip => as expensive as access to global memory => **no faster access**
202 | 
203 | Usage: holding automatic variables 
204 | 
205 | - large structures or arrays that would consume too much register space
206 | - arrays that the compiler determines may be indexed dynamically)
207 | 
208 | Inspection of the PTX assembly code (obtained by compiling with `-ptx` or `-keep` command-line options to `nvcc`) reveals whether a variable has been placed in local memory during the first compilation phases. 
209 | 
210 | ### Texture Memory (Texture Cache)
211 | 
212 | Read-only => costs device memory read only on a cache miss
213 | 
214 | Optimized for 2D spatial locality => reading texture addresses that are closer will achieve best performance
215 | 
216 | Designed for **streaming fetches** with a constant latency
217 | 
218 | **Caveat: Within a kernel call, the texture cache is not kept coherent with respect to global memory write.** A thread can safely read a memory location via texture if the location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread within the same kernel call.
219 | 
220 | ### Constant Memory (Constant Cache)
221 | 
222 | Size: 64KB
223 | 
224 | Costs device memory read only on a cache miss
225 | 
226 | Best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access.
227 | 
228 | ## 3 Allocation
229 | 
230 | **Device memory should be reused and/or sub-allocated by the application whenever possible** to minimize the impact of allocations on overall performance. 
231 | 
232 | ## Other References
233 | 
234 | [^1]: [Page-Locked Host Memory for Data Transfer](https://leimao.github.io/blog/Page-Locked-Host-Memory-Data-Transfer/#:~:text=With%20paged%20memory%2C%20the%20specific,locked%20memory%20or%20pinned%20memory.)
235 | [^2]: [Introduction to GPGPU and CUDA Programming: Memory Coalescing](https://cvw.cac.cornell.edu/gpu/coalesced#:~:text=Coalesced%20memory%20access%20or%20memory,threads%20in%20a%20single%20transaction.)
236 | [^3]: [GPU Optimization Fundamentals](https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf)
237 | [^4]: [CUaccessPolicyWindow_v1 Struct Reference](https://docs.nvidia.com/cuda/cuda-driver-api/structCUaccessPolicyWindow__v1.html#structCUaccessPolicyWindow__v1_1d6ed5cd7bb416976b45e75bafce547e9)


--------------------------------------------------------------------------------
/CUDA Multi-Process Service.md:
--------------------------------------------------------------------------------
 1 | # CUDA Multi-Process Service
 2 | 
 3 | [Improving GPU Utilization with MPS](https://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf)
 4 | 
 5 | Used to optimize the performance in multi CUDA processes scenario. 
 6 | 
 7 | ## Background
 8 | 
 9 | ### CUDA Context [^1]
10 | 
11 | - Global memory allocated by the CPU
12 | - Stack/Heap space (local memory) from the kernel
13 | - CUDA streams & events objects
14 | - Code module (*.cubin, *.ptx)
15 | 
16 | Each process has its own CUDA context
17 | 
18 | Each context has its own memory space, and cannot access other CUDA contexts' spaces 
19 | 
20 | ### Hyper-Q [^2]-- Hyper Queue (Hardware Property)
21 | 
22 | Hyper-Q enables multiple threads or processes to launch work on a single GPU simultaneously. 
23 | 
24 | - Increases GPU utilization and reduces CPU idle times
25 | - Eliminates false dependencies across tasks 
26 | 
27 | Before Hyper-Q: Fermi's single pipeline. There is only one hardware work queue so there can be false dependencies across the tasks. 
28 | 
29 | Kepler GK110 introduces the Grid Management Unit, which creates multiple hardware work queues to reduce or eliminate false dependencies. (Also the feedback path from SMXs to the GMU provides dynamic parallelism.)
30 | 
31 | Kelper allows 32-way concurrency. 
32 | 
33 | ## Multi-Process Service (MPS)
34 | 
35 | A feature that allows multiple CUDA processes (contexts) to share a single GPU context. Each process receive some subset of the available connections to that GPU. 
36 | 
37 | MPS allows overlapping of kernel and memcopy operations *from different processes* on the GPU to achieve maximum utilization.
38 | 
39 | MPS Server: Hyper-Q/MPI
40 | 
41 | - All MPS Client Processes started after starting MPS Server will communicate through MPS Server only
42 | 
43 | - Many-to-one context mapping
44 | - Allows multiple CUDA processes to share a/multiple GPU context(s)
45 | 
46 | ## Usage
47 | 
48 | See the slides
49 | 
50 | ## Summary
51 | 
52 | Best for GPU acceleration for legacy applications
53 | 
54 | Enables overlapping of memory copies and compute between different MPI ranks
55 | 
56 | Ideal for applications with
57 | 
58 | - MPI-everywhere
59 | - Non-negligible CPU work
60 | - Partially migrated to GPU
61 | 
62 | ## Other References
63 | 
64 | [^1]: [如何使用MPS提升GPU计算收益](https://www.nvidia.cn/content/dam/en-zz/zh_cn/assets/webinars/31oct2019c/20191031_MPS_davidwu.pdf)
65 | [^2]: [Hyper-Q Example](https://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # GPU-Tutorials
2 | Tutorials to GPU programming. Reading notes. 
3 | 


--------------------------------------------------------------------------------