├── Chapter01 └── README.md ├── Chapter02 └── README.md ├── Chapter03 └── README.md ├── Chapter04 └── README.md ├── Chapter05 ├── README.md ├── coalesced.ncu-rep ├── coalescing.cu └── mix.ncu-rep ├── Chapter06 ├── README.md ├── cameraman.png ├── circles.png ├── opencv_add.cpp └── test.nsys-rep ├── Chapter07 ├── Occupancy.cu └── README.md ├── Chapter08 └── README.md ├── Chapter09 ├── README.md ├── demo.cu ├── device_detail.cu └── profile.ncu-rep ├── Chapter10 └── README.md ├── Fix-Bug ├── README.md └── test.cu └── README.md /Chapter01/README.md: -------------------------------------------------------------------------------- 1 |

2 |

Introduction to Nsight Systems - Nsight Compute

3 |

4 | 5 | In this article, I will provide a brief introduction to Nsight Systems and Nsight Compute, giving you an overview of which tool to use for your specific needs. 6 | 7 | Please note that this article serves as a high-level introduction to these two tools and does not delve into every detail. Therefore, the content below provides an overview of what to pay attention to, while in-depth explanations, debugging, and optimization will be covered in future articles. 8 | 9 |

10 |

Nsight Systems - Nsight Compute

11 |

12 | 13 | Before we go through these two tools, let me give you an example to make it easier for you to understand. When you go to the doctor for a regular check-up, you will first have a general check-up. If everything is fine, then you can leave. But if there is a problem (for example, with your heart or lungs), then you will need to have a more detailed examination of the parts that are not working properly. In this case, **the performance of our code** is similar to our health. First, we will use **Nsight Systems to check our code overall** to see if there are any problems (for example, with the **functions or the copy data**). If there are, then we will use **Nsight Compute to identify the problem in the function/copy data** so that we can **optimize and debug it.** 14 | 15 |

16 | 17 |

18 | 19 | `As you can see in the figure, we will start with Nsight Systems (general check-up) and then move on to Nsight Compute (detailed analysis of the kernels, also known as functions on the GPU). It is important to note that I will not be covering Nsight Graphics because it is for the graphics and gaming industry. However, you should not be disappointed because the metrics are very similar to those of Nsight Compute.` 20 | 21 | **One thing to keep in mind is that these two tools, Compute and Systems, are ONLY for programs that use GPUs to run. That is why in this series, I will only be showing how to use them for parallel programming or Deep Learning models.** 22 | 23 |

24 |

Nsight Systems

25 |

26 | 27 | As you can see in the figure, Nsight Systems is first used to analyze the program. So, what specifically do we analyze here? 28 | 29 | ## 1. Time/speed/size when transferring data from the host to the device and vice versa 30 | 31 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/662fb9fd-032b-4d69-aaf6-5704e6282694) 32 | 33 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/ec6d18f0-4bc5-4d90-9260-34b59b543748) 34 | 35 |

36 | 37 |

38 | 39 | Based on the three images above, we can see that we can **improve the copy from the host to the device.** 40 | 41 | ## 2. Next, we can look at an overview of our kernel (kernel name: mygemm) 42 | 43 | The metrics that we will need to focus on for analysis are: **Grid/block/Theoretical occupancy** 44 | 45 | **Summary: After the general check-up, we see that we can improve the code in two areas: copy data and kernel.** 46 | 47 |

48 |

Nsight Compute

49 |

50 | 51 | After confirming that the two problems to be addressed are data copy and kernel, we will use Nsight Compute to analyze in more detail what the problem is. 52 | 53 | ### 1. First, the "summary" will show us where we are having problems and how to solve them (I will not go into too much detail here, but I will provide a brief explanation). 54 | 55 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/6bec7cab-c45e-4067-9234-3f5a63945cc5) 56 | 57 | As you can see in the figure, we can improve **three things, including two** that I have already analyzed above: 58 | 59 | - **Theoretical warps speedup 33.33%:** You will notice that in the kernel overview figure, the Theoretical occupancy is 66.66%, which means that we can improve it further (in theory, it can reach 100%). 60 | 61 | - **DRAM Execessive Read Sectors:** This means that our memory allocation and organization is not optimized, which leads to problems with read/write during data transfer. 62 | Source 63 | 64 | ### 2. Next, the "Source" will show us the line of code that is performing the heaviest work (consuming a lot of time/memory). 65 | 66 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/51881ee3-ba12-4875-96d4-8b6b8bd288e4) 67 | 68 | ### 3. The "Detail" is also the most difficult section and contains the most information that needs to be analyzed. The Detail contains a lot of information, but we will focus on the following: 69 | 70 | - **GPU Speed of Light Throughput** 71 | 72 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/48d60176-6cb8-4d9a-889d-1f70bbe81686) 73 | 74 | 75 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/8441d8de-31b4-484e-96eb-62f6ce0aa02d) 76 | 77 | 78 | - **Memory Workload Analysis** 79 | 80 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/d60e8d78-0409-43e6-a5d6-b831f3e3033b) 81 | 82 | 83 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/7f0afaf4-dec6-4d2a-bf4c-ad523033a16b) 84 | 85 | - **Scheduler Statistics** 86 | 87 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/739ae9d8-1788-43d6-bf05-cffa8c7a53ce) 88 | 89 | 90 | - **Occupancy** 91 | 92 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/637c70ca-06ad-4296-8cae-2da0346551d4) 93 | 94 |

95 |

Summary

96 |

97 | 98 |

99 | 100 |

101 | 102 | After reading this article, you should have a good idea of the usefulness of Nsight Systems and Nsight Compute. In the following articles, I will go into more detail. 103 | 104 | -------------------------------------------------------------------------------- /Chapter02/README.md: -------------------------------------------------------------------------------- 1 | 2 | Before using Nvidia's profiling tools, it's essential to have a basic understanding of how CUDA works. In this article, I'll briefly explain two commonly mentioned concepts in CUDA: CUDA Toolkit and CUDA Driver. 3 | 4 | I will provide a simple explanation without diving too deep into the details, so don't worry. 5 | 6 |

7 |

Cuda toolkit - Cuda driver

8 |

9 | 10 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/720652d1-dab8-44cd-8ad4-4048f6f3dafb) 11 | 12 | 13 | Before explaining these two terms, let's start with an analogy to help you understand better: Imagine you're playing a video game, and your character is at level 10, equipped with a level 5 weapon. In this scenario, your total combat power is 100. You have two ways to increase your character's combat power: 14 | 15 | - The Easy Way: Find a level 10 weapon that matches your character's level. 16 | - The Hard Way: Increase your character's level. 17 | 18 | A small note: You cannot equip a weapon with a higher level than your character's level. 19 | 20 | In the context of CUDA, it's similar. If you want to optimize a CUDA program (excluding code-related factors), you have two options: increase the level of the CUDA Toolkit or increase the level of the CUDA Driver. 21 | 22 | - CUDA Driver: This represents the capability of your computer (similar to your character's level). The more powerful your computer is, the faster it can run, and each computer will have a certain level of capability. 23 | 24 | 25 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/ec0c590f-a7f6-436e-bed0-fdda364f64a1) 26 | 27 | 28 | - CUDA Toolkit: This represents the version of CUDA you are using (similar to the level of your weapon). A higher version **can POTENTIALLY run faster** than an older version (because newer versions are usually more optimized and may have more advanced functions compared to older versions). 29 | 30 | **In summary, CUDA Driver is physical, representing the maximum capability your CUDA program can run at, while CUDA Toolkit is logical, representing the level of CUDA utilization. A higher version of the CUDA Toolkit indicates a more advanced level of utilization.** 31 | 32 | 33 |

34 | 35 |

36 | 37 | 38 | When coding, we have two perspectives: the coder view (logical view) and the hardware view (physical view). This means that when we optimize our code, it's optimized at the logical level, and that code is then compiled into binary code for the hardware to execute and further optimize. In the case of CUDA, the CUDA Toolkit and Driver operate similarly. We use the CUDA Toolkit to optimize our CUDA code, and the CUDA Driver optimizes the hardware for us. 39 | 40 | **The question then arises: How do we determine which CUDA Driver and Toolkit versions to use?** 41 | 42 | It's quite simple. We use the [NVIDIA driver](https://www.nvidia.com/Download/index.aspx?lang=en-us) to determine the level of driver our computer is using. Here's an example from my computer: 43 | 44 |

45 | 46 |

47 | 48 |

49 | 50 |

51 | 52 | 53 | Here you will see that the suitable driver version for me is version 535. After that, you can go to the driver version 535 and, once downloaded, open a terminal and run this command to check the compatible CUDA Toolkit version: 54 | ``` 55 | $nvidia-smi 56 | ``` 57 | 58 |

59 | 60 |

61 | -------------------------------------------------------------------------------- /Chapter03/README.md: -------------------------------------------------------------------------------- 1 |

2 |

NVIDIA Compute Sanitizer Part 1

3 |

4 | 5 | In this article, I will guide you on how to use the NVIDIA Compute Sanitizer, a fantastic tool to support those who are new to CUDA. 6 | 7 | For those who are already very familiar with CUDA, NVIDIA Compute Sanitizer may not be of much help, but it's still better to know about it than not. 8 | 9 | NVIDIA Compute Sanitizer helps us check for four important errors that CUDA beginners often encounter: 10 | - **Memcheck** for memory access error and leak detection 11 | - **Racecheck**, a shared memory data access hazard detection tool 12 | - **Initcheck**, an uninitialized device global memory access detection tool 13 | - **Synccheck** for thread synchronization hazard detection 14 | 15 | This is a simple code snippet (adding two vectors) to analyze four cases. 16 | 17 | ``` 18 | #include 19 | 20 | __global__ void vectorAdd(int *a, int *b, int *c, int n) { 21 | int tid = blockIdx.x * blockDim.x + threadIdx.x; 22 | c[tid] = a[tid] + b[tid]; 23 | } 24 | 25 | int main() { 26 | int n = 10; 27 | int *a, *b, *c; 28 | int *d_a, *d_b, *d_c; 29 | int size = n * sizeof(int); 30 | 31 | a = (int*)malloc(size); 32 | b = (int*)malloc(size); 33 | c = (int*)malloc(size); 34 | 35 | for (int i = 0; i < n; i++) { 36 | a[i] = i; 37 | b[i] = i; 38 | } 39 | 40 | cudaMalloc((void**)&d_a, size); 41 | cudaMalloc((void**)&d_b, size); 42 | cudaMalloc((void**)&d_c, size); 43 | 44 | cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); 45 | cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); 46 | 47 | 48 | vectorAdd<<<1, ?>>>(d_a, d_b, d_c, n); 49 | 50 | cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); 51 | 52 | free(a); 53 | free(b); 54 | free(c); 55 | 56 | return 0; 57 | } 58 | ``` 59 | 60 | When you start coding in CUDA, one common mistake is using too few or too many threads compared to the data, and this can lead to a **"undefined behavior"** bug. It may work without giving any error messages. In a larger program, this can result in logic issues and have a significant impact on memory allocations. 61 | 62 |

63 |

Initcheck

64 |

65 | 66 | Here, I will provide a specific example. 67 | 68 | ``` 69 | vectorAdd<<<1, ?>>>(d_a, d_b, d_c, n)--> vectorAdd<<<1, 9>>>(d_a, d_b, d_c, n) 70 | ``` 71 | 72 | As you can see, we have N = 10 but are using only 9 threads for processing, and this leads to memory leakage. We will use this command to profile. 73 | 74 | ``` 75 | compute-sanitizer --tool initcheck --track-unused-memory yes --show-backtrace no 76 | ``` 77 | 78 |

79 | 80 |

81 | 82 | We use N = 10 (int), so the total bytes are 40 bytes, and we use 9 threads, leaving 10% of memory unused. 83 | 84 | And here are the results when we fix it to use 10 threads. 85 | 86 |

87 | 88 |

89 | 90 | 91 |

92 |

Memcheck

93 |

94 | 95 | 96 | Above, we discussed the case of using fewer threads. Now, let's consider the case of using more threads. 97 | 98 | ``` 99 | vectorAdd<<<1, ?>>>(d_a, d_b, d_c, n)--> vectorAdd<<<1, 11>>>(d_a, d_b, d_c, n) 100 | ``` 101 | As you can see, we only initialize enough for 10 threads to operate, which leads to the 11th thread suffering from **out-of-bounds array access**. We will use this command to profile. 102 | 103 | ``` 104 | compute-sanitizer --tool memcheck --show-backtrace no 105 | ``` 106 | 107 |

108 | 109 |

110 | 111 | 112 | To explain it simply, when we use cudaMemcpy to copy to the GPU, the 11th element will fail because we only initialize enough for 10 elements. This failure happens at thread(10, 0, 0), which means the 11th thread is accessing data beyond the boundary, leading to **"undefined behavior."** 113 | 114 | The solution is to either adjust to use 10 threads or **add boundary checks.** 115 | 116 | ``` 117 | if (tid < n) { 118 | c[tid] = a[tid] + b[tid] ---> c[tid] = a[tid] + b[tid]; 119 | } 120 | ``` 121 | 122 | Additionally, if you notice, in this code snippet, we are missing the **cudaFree**, which will also lead to **memory leaks**. We will use this command to profile. 123 | 124 | ``` 125 | compute-sanitizer --tool memcheck --leak-check=full --show-backtrace no 126 | ``` 127 |

128 | 129 |

130 | 131 | 132 | The remaining two errors, Synccheck and Racecheck, will be discussed later after we cover atomic functions and data hazards. 133 | -------------------------------------------------------------------------------- /Chapter04/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | In this article, I will continue discussing how to use the NVIDIA Compute Sanitizer. Please read these articles: [NVIDIA Compute Sanitize Part 1](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter03), [Data Hazard](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter12) before reading this one. 4 | 5 |

6 |

NVIDIA Compute Sanitizer Part 2

7 |

8 | 9 | Following up on Part 1, Part 2 will cover the remaining two tools: 10 | 11 | - **Racecheck**, a shared memory data access hazard detection tool 12 | 13 | - **Synccheck** for thread synchronization hazard detection 14 | 15 |

16 |

Racecheck

17 |

18 | 19 | As NVIDIA has mentioned about the [NVIDIA Compute Sanitizer](https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/), Racecheck is used to check for hazards when using shared memory. So, if you test on global memory, it will not yield any results. 20 | 21 |

22 |

Code

23 |

24 | 25 | ``` 26 | __global__ void sumWithSharedMemory(int* input) { 27 | __shared__ int sharedMemory[4]; 28 | 29 | int tid = threadIdx.x; 30 | int i = blockIdx.x * blockDim.x + threadIdx.x; 31 | 32 | sharedMemory[tid] = input[i]; 33 | 34 | for (int stride = 1; stride < blockDim.x; stride *= 2) { 35 | 36 | // __syncthreads(); -----> barrier 37 | 38 | if (tid % (2 * stride) == 0) { 39 | sharedMemory[tid] += sharedMemory[tid + stride]; 40 | } 41 | } 42 | 43 | printf("blockIdx.x=%d --> %d\n", blockIdx.x, sharedMemory[tid]); 44 | 45 | } 46 | ``` 47 | This code snippet is identical to the one in the [Data Hazard](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter12) article, and here is how it works. 48 | 49 |

50 | 51 |

52 | 53 | The only difference is that instead of using global memory, here we use shared memory. 54 | 55 | Shared memory is a topic I will discuss separately since it is a very important concept when talking about CUDA. So, in this article, you only need to understand that instead of using global memory to perform the addition 1+2+3+4 in parallel, we use shared memory. 56 | 57 | And now we use the NVIDIA Compute Sanitizer to check for data hazards with a command line. 58 | 59 | ``` 60 | compute-sanitizer --tool racecheck --show-backtrace no ./a.out 61 | ``` 62 | 63 |

64 | 65 |

66 | 67 | 68 | Here, you will find a surprising fact that the result is still correct even though there has been a data hazard. As I mentioned in the [previous article](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter12) about the phenomenon of **"undefined behavior,"** this is exactly it. We cannot determine whether it will cause an error or not. It may be that my machine produces the correct result, BUT yours might differ ==> thus, the phenomenon of **"undefined behavior"** can be quite troublesome. 69 | 70 | At this point, if we use __syncthreads(), it will solve this problem. 71 | 72 |

73 | 74 |

75 | 76 | 77 | Regarding Synccheck, from my tests and searches through NVIDIA's blog, I noticed they didn't mention anything about the code, so I cannot provide an illustration for you. I will skip this part, but I will add it later if I find any relevant information (if you find something, please comment below). 78 | 79 | 80 |

81 |

Exercise

82 |

83 | 84 | 85 |

86 | 87 |

88 | 89 | The actual error in our case is just 4 data hazards (N = 4), but why does the image above show us having 2 data hazard errors (4 and 8)? 90 | 91 | Hint: 1 data hazard = 1 read or 1 write. 92 | 93 | Is 4 data hazards comprised of 4 reads or 4 writes? 94 | 95 | And is 8 data hazards comprised of 8 reads, 8 writes, or 4 reads and 4 writes? 96 | -------------------------------------------------------------------------------- /Chapter05/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | Global memory is the largest memory BUT also the slowest on the GPU, so in this article we will analyze what factors lead to **"low performance"** as well as how to fix them. Before diving into this, it is recommended to review articles on [GPU memories](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter05) and [their utilization](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter06) to better understand the context. 4 | 5 |

6 |

Global Memory Coalescing

7 |

8 | 9 | 10 | Before we dive into the lesson, let's start with an example: 11 | 12 | Imagine you have a task to distribute candies and cakes to children, each with different preferences. Instead of waiting for their turn to come up and asking what they like, which can be time-consuming (in terms of both asking and fetching the respective item), you decide to organize them by preference from the start: those who choose cakes on the left and candies on the right. This way, the distribution process is optimized. 13 | 14 | When discussing global memory access, three key concepts often come up: 15 | 16 | - **Coalescing:** This is the process by which **threads within the same warp** access memory simultaneously, optimizing memory access by reducing the number of necessary accesses and speeding up data transfer (similar to the candy and cake distribution, where instead of asking each time, it's already known what to give out, leading to cache hits). 17 | - **Alignment:** This relates to organizing data in memory optimally to ensure memory accesses are as efficient as possible, minimizing unnecessary data reads and enhancing processing performance (like organizing children by their preference for cakes or candies on different sides to avoid confusion during distribution). 18 | - **Sector:** Refers to the basic unit of memory that can be accessed simultaneously in a single access, clarifying the scope and method by which data is retrieved or written to memory. 19 | Though these are three distinct concepts, they share a common goal: optimizing access to a large memory space. 20 | 21 | **In summary, coalescing is about accessing memory in the most optimal way possible (the fewer accesses, the better), alignment involves arranging data optimally, and a sector is the unit of each access.** 22 | 23 |

24 | 25 |

26 | 27 |

28 | 29 |

30 | 31 | 32 |

33 |

Code

34 |

35 | 36 | 37 | I will demonstrate a simple piece of code using 32 blocks ( 32 threads / block ) and elements (number of elements) = 1024. 38 | 39 |

40 |

Coalescing

41 |

42 | 43 | 44 | ``` 45 | __global__ void testCoalesced(int* in, int* out, int elements) 46 | { 47 | int id = blockDim.x * blockIdx.x +threadIdx.x; 48 | out[id] = in[id]; 49 | } 50 | ``` 51 | 52 |

53 | 54 |

55 | 56 | 57 | And we will profile the above code: 58 | 59 | - **global load transactions per request: the smaller, the better (this is about copying chunks --> checking coalescing)** 60 | 61 | ``` 62 | ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio ./a.out 63 | ``` 64 | 65 |

66 | 67 |

68 | 69 | - **global store transactions per request: the smaller, the better (this is about copying chunks --> checking coalescing)** 70 | 71 | ``` 72 | ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio ./a.out 73 | ``` 74 | 75 |

76 | 77 |

78 | 79 | 80 | 81 | - **global load transactions: (compare to see which kernel has coalescing || the smaller, the better).** 82 | 83 | ``` 84 | ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum ./a.out 85 | ``` 86 | 87 |

88 | 89 |

90 | 91 | 92 | 93 | - **global store transactions:(compare to see which kernel has coalescing || the smaller, the better).** 94 | 95 | ``` 96 | ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum ./a.out 97 | ``` 98 | 99 |

100 | 101 |

102 | 103 | 104 | `The reason "the smaller, the better" applies is because it's akin to distributing candies; the fewer times we need to exchange cookies for candies, the quicker the distribution process. Here, sector/request means that for each request, we only use 4 sectors, totaling just 256 sectors (load and store).` 105 | 106 | `It's important to note that "sector" here does not refer to the number of elements processed per request but to the number of simultaneous data storage access operations the computer performs to process a request. The fewer the accesses, the faster it is (hit cache).` 107 | 108 | 109 |

110 |

Mix but in cache line

111 |

112 | 113 | 114 | ``` 115 | __global__ void testMixed(int* in, int* out, int elements) 116 | { 117 | int id = ((blockDim.x * blockIdx.x +threadIdx.x* 7) % elements) %elements; 118 | out[id] = in[id]; 119 | } 120 | ``` 121 | 122 |

123 | 124 |

125 | 126 | 127 | 128 | Here, we profile the same: 129 | 130 | - **global load transactions per request: the smaller, the better (this is about copying chunks --> checking coalescing)** 131 | 132 | ``` 133 | ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio ./a.out 134 | ``` 135 | 136 |

137 | 138 |

139 | 140 | 141 | - **global store transactions per request: the smaller, the better (this is about copying chunks --> checking coalescing)** 142 | 143 | ``` 144 | ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio ./a.out 145 | ``` 146 | 147 |

148 | 149 |

150 | 151 | 152 | 153 | - **global load transactions: (compare to see which kernel has coalescing || the smaller, the better).** 154 | 155 | ``` 156 | ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum ./a.out 157 | ``` 158 | 159 |

160 | 161 |

162 | 163 | 164 | 165 | **global store transactions:(compare to see which kernel has coalescing || the smaller, the better).** 166 | 167 | ``` 168 | ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum ./a.out 169 | ``` 170 | 171 |

172 | 173 |

174 | 175 | 176 | As I mentioned, even though it still resides within the cache line (meaning the threads do not exceed the array space), because it is not coalesced (not in order, such as cookies first then candies, or vice versa), it results in more sectors/request, leading to slower performance. 177 | 178 | 179 |

180 | 181 |

182 | 183 |

184 | 185 |

186 | 187 | 188 | BUT IF YOU PROFILE FULLY (meaning to output to a .ncu-rep file for use with Nsight Compute, here is the command line) 189 | 190 | `One note is that I will not delve too deeply into analyzing Nsight Compute but will leave it for a later article.` 191 | 192 | ``` 193 | ncu --set full -o ./a.out 194 | ``` 195 | 196 | **And you will notice a somewhat strange point:** 197 | 198 |

199 |

Coalescing

200 |

201 | 202 | 203 |

204 | 205 |

206 | 207 |

208 |

Mix

209 |

210 | 211 | 212 |

213 | 214 |

215 | 216 | Why is the Coalescing throughput (GB/s) lower than Mix and the L2 cache hit rate lower, but the total time is faster? 217 | 218 | Here (as I speculate), the computer optimizes for us: meaning for a certain amount of bytes, it will optimize what the transfer speed needs to be. It's not always the case that higher is better because if it's too high, it can lead to: 219 | 220 | - When the data transfer rate is too high, it may cause congestion, reducing data transfer efficiency. 221 | - A high data transfer rate may also consume more energy. 222 | - In some cases, a high data transfer rate does not significantly benefit, for example, when transferring small files. 223 | 224 | `It's like shopping; the most expensive option isn't always the best, and sometimes it depends on our needs.` 225 | 226 | **Therefore, using more GB/s leads to a higher hit rate.** 227 | 228 | **In summary: In this article, you have learned how to analyze and optimize when using global memory (and from what I've researched, 4 sectors/request is best ==> meaning we achieve coalescing when sector/request = 4).** 229 | 230 | 231 |

232 |

Exercise

233 |

234 | 235 | - Try to code a case with an offset and profiling it 236 | 237 |

238 | 239 |

240 | 241 | In the picture above, the offset is 2, and having an offset leads to going out of the cache line (meaning instead of using 1024 * 4 bytes (since it's an int) for an array, here we use 1024 * 2 * 4 bytes). 242 | 243 | - An interesting question: **(WE STILL USE GLOBAL MEMORY)** Although it is coalescing, we can still improve, so before improving, what is the reason for its slowness? 244 | 245 | Hint: 246 | - memory bound (not yet fully utilizing the computer's capabilities) 247 | 248 |

249 | 250 |

251 | 252 | - datatype between int ( 4 bytes ) and int4 ( 16 bytes ) 253 | 254 | 255 | 256 | 257 | 258 | 259 | -------------------------------------------------------------------------------- /Chapter05/coalesced.ncu-rep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CisMine/Guide-NVIDIA-Tools/e39727ad4f6f0d695f595a91852ec892ce293395/Chapter05/coalesced.ncu-rep -------------------------------------------------------------------------------- /Chapter05/coalescing.cu: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | __global__ void testCoalesced(int* in, int* out, int elements) 5 | { 6 | int id = (blockDim.x * blockIdx.x +threadIdx.x) % elements; 7 | out[id] = in[id]; 8 | } 9 | 10 | __global__ void testMixed(int* in, int* out, int elements) 11 | { 12 | int id = ((blockDim.x * blockIdx.x +threadIdx.x* 7) % elements) %elements; 13 | out[id] = in[id]; 14 | } 15 | 16 | 17 | 18 | int main() { 19 | int elements = 1024; 20 | size_t size = elements * sizeof(int); 21 | 22 | int *in, *out; 23 | int *d_in, *d_out; 24 | 25 | in = (int*)malloc(size); 26 | out = (int*)malloc(size); 27 | 28 | for (int i = 0; i < elements; i++) { 29 | in[i] = i; 30 | } 31 | 32 | cudaMalloc(&d_in, size); 33 | cudaMalloc(&d_out, size); 34 | 35 | cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice); 36 | 37 | int threadsPerBlock = 32; 38 | int blocksPerGrid = 32; 39 | 40 | 41 | 42 | testCoalesced<<>>(d_in, d_out, elements); 43 | testMixed<<>>(d_in, d_out, elements); 44 | 45 | 46 | 47 | cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost); 48 | 49 | // for(int i = 0; i < elements; i++) { 50 | // std::cout << out[i] << " "; 51 | // } 52 | 53 | 54 | 55 | cudaFree(d_in); 56 | cudaFree(d_out); 57 | free(in); 58 | free(out); 59 | 60 | return 0; 61 | } 62 | -------------------------------------------------------------------------------- /Chapter05/mix.ncu-rep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CisMine/Guide-NVIDIA-Tools/e39727ad4f6f0d695f595a91852ec892ce293395/Chapter05/mix.ncu-rep -------------------------------------------------------------------------------- /Chapter06/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | In the article on [Synchronization - Asynchronization](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter08), we mentioned the concept of **latency hiding**, a very common term when talking about CUDA. When discussing **latency hiding**, it often involves the idea of **always keeping threads busy**. Therefore, in this article, I will explain this concept in more detail as well as its operating mechanism - it can be said to be indispensable because it greatly helps us in optimizing code. 5 | 6 | `The purpose of this article is to help you better understand the operation mechanism of CUDA, so it will be quite important in the NVIDIA Tools series. However, if you are only interested in coding with CUDA at a basic level, you may skip this article.` 7 | 8 |

9 |

Warp Scheduler

10 |

11 | 12 | 13 | Before diving into the lesson, let's use an example to make it easier to visualize: 14 | 15 | Imagine there are 100 people coming to the post office to send parcels, and there's only one worker available. To successfully send a parcel, you must complete two steps: fill in the parcel information form (which takes a lot of time) - the staff confirms the form and proceeds with the parcel sending procedure (which is quite fast). Here, instead of the staff waiting for each person to finish filling out the form to proceed with the procedure, as soon as someone's turn comes, they are given a form and go somewhere else to fill it out, and once completed, they return to the queue ==> this is much faster compared to waiting for each person to fill out the form one by one. 16 | 17 | Similarly, with computers, let's assume we have a problem: y[i] += x[i] * 3. The computer also has to perform two steps: 18 | 19 | - **Memory instruction:** the time between the load/store operation being issued and the data arriving at its destination. 20 | - **Arithmetic instruction:** the time an arithmetic operation starts to its output. 21 | 22 | Going back to the example y[i] += x[i] * 3, instead of the computer having to wait to load/store x[0] and y[0], the computer will move on to load/store x[1] and y[1], and continue doing so until x[0] and y[0] are loaded/stored before returning to compute. 23 | 24 | 25 |

26 |

1st method

27 |

28 | 29 |

30 | 31 |

32 | 33 | 34 | 35 | 36 |

37 |

2nd method

38 |

39 | 40 | 41 | 42 |

43 | 44 |

45 | 46 | 47 | **To summarize, the Warp Scheduler performs the action of swapping busy warps to save time, hence it's often referred to as latency hiding or always keeping threads busy (depending on the machine, a Warp Scheduler can control a certain number of warps).** 48 | 49 | 50 | If you find the above example similar to the candy distribution example in the article on [how computers work](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter02), then you are correct. 51 | 52 | 53 | When discussing warps, we typically encounter three states of a warp: 54 | - **Stalled:** The warp is busy executing something. 55 | - **Eligible:** The warp is idle and ready to participate. 56 | - **Selected:** The warp is chosen for execution. 57 | 58 | The idea is that after a warp is **selected**, it will execute a Memory instruction and, during the wait time, it will be swapped for another warp. Here, there can be two scenarios: the subsequent warp is either **stalled or eligible**. If it's **eligible**, that's great; if it's **stalled**, it will be swapped again until an **eligible** warp is found. 59 | 60 | **The question arises: If so, can we just create many warps so the number of eligibles increases?** 61 | 62 | **If you think so, that's a mistake**. Creating more warps means the warp scheduler has to do more work, and creating many warps (i.e., many threads) leads to a decrease in the number of registers available per thread ==> causing the SM (Streaming Multiprocessors) to run slower ==> we have to consider how many threads are appropriate to use. 63 | 64 | **If you think that if 128 people come to mail letters, we should use 128 workers, that's incorrect. Similarly, if we need to process an array of 128 elements using 128 threads (4 warps), that's a mistake.** 65 | 66 | Reason: It wastes resources and, given today's computers are very powerful, it means one worker can handle two people at once, but if we only have them handle one person at a time, it's somewhat wasteful ==> one thread handles two elements ===> reduces the number of threads initiated ==> increases registers for each thread + reduces the workload for the warp scheduler. 67 | 68 | `For the same reason, when you profile OpenCV CUDA code with Nsight Systems, you will see very few threads being used. Here is the example using opencv cuda to add 2 images` 69 | 70 | ``` 71 | #include "opencv2/opencv.hpp" 72 | #include 73 | 74 | cv::Mat opencv_add(const cv::Mat &img1, const cv::Mat &img2) 75 | { 76 | cv::cuda::GpuMat d_img1, d_img2, d_result; 77 | 78 | d_img1.upload(img1); 79 | d_img2.upload(img2); 80 | 81 | cv::cuda::add(d_img1, d_img2, d_result); 82 | 83 | cv::Mat result; 84 | d_result.download(result); 85 | 86 | return result; 87 | } 88 | int main() 89 | { 90 | cv::Mat img1 = cv::imread("circles.png"); 91 | cv::Mat img2 = cv::imread("cameraman.png"); 92 | 93 | cv::Mat result = opencv_add(img1, img2); 94 | 95 | cv::imshow("Result", result); 96 | 97 | cv::waitKey(); 98 | 99 | return 0; 100 | } 101 | ``` 102 | 103 | `Then profile by using Nsight system to see the kernel by this command:` 104 | 105 | ``` 106 | nsys profile -o test ./a.out 107 | ``` 108 |

109 | 110 |

111 | 112 | 113 | 114 | **The question arises: So how many threads should we use?** 115 | 116 | ==> It depends on the configuration of each computer as well as the reasons causing warps to be stalled. In the next article, I will analyze the reasons causing warp stalls as well as how to determine the appropriate number of threads. 117 | 118 | 119 | 120 | 121 | -------------------------------------------------------------------------------- /Chapter06/cameraman.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CisMine/Guide-NVIDIA-Tools/e39727ad4f6f0d695f595a91852ec892ce293395/Chapter06/cameraman.png -------------------------------------------------------------------------------- /Chapter06/circles.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CisMine/Guide-NVIDIA-Tools/e39727ad4f6f0d695f595a91852ec892ce293395/Chapter06/circles.png -------------------------------------------------------------------------------- /Chapter06/opencv_add.cpp: -------------------------------------------------------------------------------- 1 | #include "opencv2/opencv.hpp" 2 | #include 3 | 4 | cv::Mat opencv_add(const cv::Mat &img1, const cv::Mat &img2) 5 | { 6 | cv::cuda::GpuMat d_img1, d_img2, d_result; 7 | 8 | d_img1.upload(img1); 9 | d_img2.upload(img2); 10 | 11 | cv::cuda::add(d_img1, d_img2, d_result); 12 | 13 | cv::Mat result; 14 | d_result.download(result); 15 | 16 | return result; 17 | } 18 | 19 | int main() 20 | { 21 | cv::Mat img1 = cv::imread("circles.png"); 22 | cv::Mat img2 = cv::imread("cameraman.png"); 23 | 24 | cv::Mat result = opencv_add(img1, img2); 25 | 26 | cv::imshow("Result", result); 27 | 28 | cv::waitKey(); 29 | 30 | return 0; 31 | } 32 | -------------------------------------------------------------------------------- /Chapter06/test.nsys-rep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CisMine/Guide-NVIDIA-Tools/e39727ad4f6f0d695f595a91852ec892ce293395/Chapter06/test.nsys-rep -------------------------------------------------------------------------------- /Chapter07/Occupancy.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | const int manualBlockSize = 1024; 4 | 5 | //////////////////////////////////////////////////////////////////////////////// 6 | // Test kernel 7 | // 8 | // This kernel squares each array element. Each thread addresses 9 | // himself with threadIdx and blockIdx, so that it can handle any 10 | // execution configuration, including anything the launch configurator 11 | // API suggests. 12 | //////////////////////////////////////////////////////////////////////////////// 13 | __global__ void square(int *array, int N) 14 | { 15 | int idx = threadIdx.x + blockIdx.x * blockDim.x; 16 | 17 | if (idx < N) 18 | { 19 | array[idx] *= array[idx]; 20 | } 21 | } 22 | 23 | //////////////////////////////////////////////////////////////////////////////// 24 | // Potential occupancy calculator 25 | // 26 | // The potential occupancy is calculated according to the kernel and 27 | // execution configuration the user desires. Occupancy is defined in 28 | // terms of active blocks per multiprocessor, and the user can convert 29 | // it to other metrics. 30 | // 31 | // This wrapper routine computes the occupancy of kernel, and reports 32 | // it in terms of active warps / maximum warps per SM. 33 | //////////////////////////////////////////////////////////////////////////////// 34 | static double reportPotentialOccupancy(void *kernel, int blockSize, size_t dynamicSMem) 35 | { 36 | int device; 37 | cudaDeviceProp prop; 38 | 39 | int numBlocks; 40 | int activeWarps; 41 | int maxWarps; 42 | 43 | double occupancy; 44 | 45 | (cudaGetDevice(&device)); 46 | (cudaGetDeviceProperties(&prop, device)); 47 | 48 | (cudaOccupancyMaxActiveBlocksPerMultiprocessor( 49 | &numBlocks, 50 | kernel, 51 | blockSize, 52 | dynamicSMem)); 53 | 54 | activeWarps = numBlocks * blockSize / prop.warpSize; 55 | maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize; 56 | 57 | occupancy = (double)activeWarps / maxWarps; 58 | 59 | return occupancy; 60 | } 61 | 62 | //////////////////////////////////////////////////////////////////////////////// 63 | // Occupancy-based launch configurator 64 | // 65 | // The launch configurator, cudaOccupancyMaxPotentialBlockSize and 66 | // cudaOccupancyMaxPotentialBlockSizeVariableSMem, suggests a block 67 | // size that achieves the best theoretical occupancy. It also returns 68 | // the minimum number of blocks needed to achieve the occupancy on the 69 | // whole device. 70 | // 71 | // This launch configurator is purely occupancy-based. It doesn't 72 | // translate directly to performance, but the suggestion should 73 | // nevertheless be a good starting point for further optimizations. 74 | // 75 | // This function configures the launch based on the "automatic" 76 | // argument, records the runtime, and reports occupancy and runtime. 77 | //////////////////////////////////////////////////////////////////////////////// 78 | static int launchConfig(int *array, int arrayCount, bool automatic) 79 | { 80 | int blockSize; 81 | int minGridSize; 82 | int gridSize; 83 | size_t dynamicSMemUsage = 0; 84 | 85 | cudaEvent_t start; 86 | cudaEvent_t end; 87 | 88 | float elapsedTime; 89 | 90 | double potentialOccupancy; 91 | 92 | (cudaEventCreate(&start)); 93 | (cudaEventCreate(&end)); 94 | 95 | if (automatic) 96 | { 97 | (cudaOccupancyMaxPotentialBlockSize( 98 | &minGridSize, 99 | &blockSize, 100 | (void *)square, 101 | dynamicSMemUsage, 102 | arrayCount)); 103 | 104 | std::cout << "Suggested block size: " << blockSize << std::endl 105 | << "Minimum grid size for maximum occupancy: " << minGridSize << std::endl; 106 | } 107 | else 108 | { 109 | // This block size is too small. Given limited number of 110 | // active blocks per multiprocessor, the number of active 111 | // threads will be limited, and thus unable to achieve maximum 112 | // occupancy. 113 | // 114 | blockSize = manualBlockSize; 115 | } 116 | 117 | // Round up 118 | // 119 | gridSize = (arrayCount + blockSize - 1) / blockSize; 120 | 121 | // Launch and profile 122 | // 123 | (cudaEventRecord(start)); 124 | square<<>>(array, arrayCount); 125 | (cudaEventRecord(end)); 126 | 127 | (cudaDeviceSynchronize()); 128 | 129 | // Calculate occupancy 130 | // 131 | potentialOccupancy = reportPotentialOccupancy((void *)square, blockSize, dynamicSMemUsage); 132 | 133 | std::cout << "Potential occupancy: " << potentialOccupancy * 100 << "%" << std::endl; 134 | 135 | // Report elapsed time 136 | // 137 | (cudaEventElapsedTime(&elapsedTime, start, end)); 138 | std::cout << "Elapsed time: " << elapsedTime << "ms" << std::endl; 139 | 140 | return 0; 141 | } 142 | 143 | //////////////////////////////////////////////////////////////////////////////// 144 | // The test 145 | // 146 | // The test generates an array and squares it with a CUDA kernel, then 147 | // verifies the result. 148 | //////////////////////////////////////////////////////////////////////////////// 149 | static int test(bool automaticLaunchConfig, const int count = 1000000) 150 | { 151 | int *array; 152 | int *dArray; 153 | int size = count * sizeof(int); 154 | 155 | array = new int[count]; 156 | 157 | for (int i = 0; i < count; i += 1) 158 | { 159 | array[i] = i; 160 | } 161 | 162 | (cudaMalloc(&dArray, size)); 163 | (cudaMemcpy(dArray, array, size, cudaMemcpyHostToDevice)); 164 | 165 | for (int i = 0; i < count; i += 1) 166 | { 167 | array[i] = 0; 168 | } 169 | 170 | launchConfig(dArray, count, automaticLaunchConfig); 171 | 172 | (cudaMemcpy(array, dArray, size, cudaMemcpyDeviceToHost)); 173 | (cudaFree(dArray)); 174 | 175 | // Verify the return data 176 | // 177 | for (int i = 0; i < count; i += 1) 178 | { 179 | if (array[i] != i * i) 180 | { 181 | std::cout << "element " << i << " expected " << i * i << " actual " << array[i] << std::endl; 182 | return 1; 183 | } 184 | } 185 | 186 | (cudaDeviceReset()); 187 | 188 | delete[] array; 189 | 190 | return 0; 191 | } 192 | 193 | //////////////////////////////////////////////////////////////////////////////// 194 | // Sample Main 195 | // 196 | // The sample runs the test with manually configured launch and 197 | // automatically configured launch, and reports the occupancy and 198 | // performance. 199 | //////////////////////////////////////////////////////////////////////////////// 200 | int main() 201 | { 202 | int status; 203 | 204 | std::cout << "starting Simple Occupancy" << std::endl 205 | << std::endl; 206 | 207 | std::cout << "[ Manual configuration with " << manualBlockSize 208 | << " threads per block ]" << std::endl; 209 | 210 | status = test(false); 211 | if (status) 212 | { 213 | std::cerr << "Test failed\n" 214 | << std::endl; 215 | return -1; 216 | } 217 | 218 | std::cout << std::endl; 219 | 220 | std::cout << "[ Automatic, occupancy-based configuration ]" << std::endl; 221 | status = test(true); 222 | if (status) 223 | { 224 | std::cerr << "Test failed\n" 225 | << std::endl; 226 | return -1; 227 | } 228 | 229 | std::cout << std::endl; 230 | std::cout << "Test PASSED\n" 231 | << std::endl; 232 | 233 | return 0; 234 | } 235 | -------------------------------------------------------------------------------- /Chapter07/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | In lesson 6, I discussed the issue of **how to select the suitable number of threads.** In this article, I will share a quite common method to determine this. Many of you might wonder **why we don't just simplify the problem by running multiple cases with different thread counts to determine the appropriate number.** This approach is only suitable if your code is simple because, if it is complex, each run will take a long time. Therefore, running multiple cases to choose the appropriate number of threads is not a good choice. 5 | 6 |

7 |

Occupancy Part 1

8 |

9 | 10 | Before diving into the article, let me give an example to help you understand what occupancy is and its utility. 11 | 12 | For instance, we have 6 workers and 6 tasks. The simplest way to distribute the work is to assign each worker one task. However, if each worker has the capability to handle 3 tasks simultaneously, then we only need 2 workers for the 6 tasks. **This results in hiring fewer workers, which costs less money, and the number of tasks is always greater than the number of workers, so optimizing workers for the tasks is necessary.** 13 | 14 | Here, the workers are threads and the tasks are data. The question arises: how do we determine how many tasks each worker can handle (how much data a thread can process)? This is where NVIDIA's Occupancy metric comes into play. 15 | 16 | **Occupancy is used to determine the optimal number of threads to be used in a kernel to achieve the highest performance.** 17 | 18 | 19 |

20 |

Code

21 |

22 | 23 | With N = 1000000 24 | 25 | ``` 26 | __global__ void square(int *array, int N) 27 | { 28 | int idx = threadIdx.x + blockIdx.x * blockDim.x; 29 | 30 | if (idx < N) 31 | { 32 | array[idx] *= array[idx]; 33 | } 34 | } 35 | ``` 36 | 37 | In normolly, we'll use like this 38 | 39 | ``` 40 | blockSize = 1024; 41 | gridSize = (N + blockSize - 1) / blockSize; 42 | ``` 43 | However, whether the number 1024 is optimal or not, we need to profile to check the occupancy. 44 | 45 | ``` 46 | ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active ./a.out 47 | ``` 48 | 49 |

50 | 51 |

52 | 53 | 54 | It can be seen that using 1024 threads is a waste of resources (because we are using the maximum number of threads per block, but the occupancy is only 53.73%) ==> This indicates that one thread can handle more than one task. 55 | 56 | NVIDIA has created a function to determine the appropriate occupancy. 57 | 58 | ``` 59 | template 60 | __inline__ __host__ CUDART_DEVICE cudaError_t cudaOccupancyMaxPotentialBlockSize( 61 | int *minGridSize, 62 | int *blockSize, 63 | T func, 64 | size_t dynamicSMemSize = 0, 65 | int blockSizeLimit = 0) 66 | { 67 | return cudaOccupancyMaxPotentialBlockSizeVariableSMem(minGridSize, blockSize, func, __cudaOccupancyB2DHelper(dynamicSMemSize), blockSizeLimit); 68 | } 69 | 70 | minGridSize = Suggested min grid size to achieve a full machine launch. 71 | blockSize = Suggested block size to achieve maximum occupancy. 72 | func = Kernel function. 73 | dynamicSMemSize = Size of dynamically allocated shared memory. Of course, it is known at runtime before any kernel launch. The size of the statically allocated shared memory is not needed as it is inferred by the properties of func. 74 | blockSizeLimit = Maximum size for each block. In the case of 1D kernels, it can coincide with the number of input elements. 75 | ``` 76 | 77 | And here are the results when using Occupancy to determine the number of threads 78 | 79 |

80 | 81 |

82 | 83 | 84 | 85 |

86 | 87 |

88 | 89 | **One small note is that each computer has different configurations, leading to different numbers of threads needed to achieve 100% occupancy.** 90 | 91 | Here, you might wonder why there are two different occupancy values: one is 74.89% and the other is 100%, even though the same code is being used. 92 | 93 | This introduces a new concept called **Theoretical Occupancy vs. Achieved Occupancy.** You can understand these simply as the theoretical value expected vs. the actual value when the code runs. 94 | 95 | The reason why Theoretical Occupancy and Achieved Occupancy yield different results is that when the code runs, threads are influenced by many other factors. 96 | 97 | 98 |

99 | 100 |

101 | 102 | 103 | **We should focus on Achieved Occupancy rather than worrying too much about Theoretical Occupancy.** 104 | Example: 105 | - Theoretical 100% and Achieved 50% 106 | - Theoretical 80% and Achieved 70% 107 | 108 | In this case, we should choose the second scenario. 109 | 110 | In the upcoming lessons, I will guide you through methods to increase Achieved Occupancy. 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /Chapter08/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 |

4 |

Occupancy Part 2

5 |

6 | 7 | In the first part, I introduced occupancy; in this second part, I'll delve deeper into how to improve achieved occupancy. 8 | 9 | Before we dive into the lesson, let me explain two important concepts for this article: 10 | 11 | - Tail Effect: If the total number of threads isn't divisible by a warp (32), the tail effect occurs. The tail effect is the remaining threads that run last in the warp. The fewer the remaining threads, the more significant the tail effect, leading to slower program execution. 12 | - For example, if we have data with N = 80 and 40 threads, we would need two warps to assign 40 threads, meaning the second warp would only use 8 threads, wasting resources. Given N = 80, we would need 4 warps when only 3 warps should suffice. 13 | 14 | - Waves: Another term for warp. 15 | 16 | 17 | 18 |

19 |

Achieved Occupancy

20 |

21 | 22 | 23 | **Theoretical occupancy** gives us the upper bound of active warps per SM. However, in practice, threads in blocks may execute at different speeds and complete their executions at different times. Thus, the actual number of active warps per SM fluctuates over time, depending on how the threads in the blocks execute. 24 | 25 | This brings us to a new concept - **achieved occupancy**, which addresses this issue: **achieved occupancy** looks at warp schedulers and uses hardware performance counters to determine the number of active warps per clock cycle. 26 | 27 | You can refer to the [warp schedulers](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter06) article for a better understanding 28 | 29 | 30 | 31 |

32 |

Causes of Low Achieved Occupancy

33 |

34 | 35 | Achieved occupancy cannot exceed theoretical occupancy, so the first step toward increasing occupancy should be to increase theoretical occupancy by adjusting the limiting factors, for example using **cudaOccupancyMaxPotentialBlockSize** to gain 100% Theoretical occupancy. The next step is to check if the achieved value is close to the theoretical value. The achieved value will be lower than the theoretical value when the theoretical number of active warps is not maintained for the full time the SM is active. This occurs in the following situations 36 | 37 | 38 |

39 |

Unbalanced workload within blocks

40 |

41 | 42 | If the warps within a block do not execute simultaneously, we encounter an Unbalanced issue within the block. This can be understood as having too many threads in one block, leading to some warps being stalled because each thread requires a certain number of registers. When we use many threads in one block, it results in less utilization of the last warps, leading to the appearance of the tail effect. 43 | 44 |

45 | 46 |

47 | 48 | 49 | Instead of using the maximum number of threads in one block (1024 threads), consider and choose a number that is appropriate for the data. 50 | 51 |

52 |

Unbalanced workload across blocks

53 |

54 | 55 | If the blocks do not execute simultaneously within each SM, we also encounter an Unbalanced issue within the grid. This can be understood as having too many blocks per SM, leading to stalls. We can address this by adjusting the number of blocks in the grid or using streams to run the kernels concurrently. 56 | 57 | You can review the [streaming](https://github.com/CisMine/Parallel-Computing-Cuda-C/tree/main/Chapter11) section to understand this better. To answer the question of how many streams are appropriate, divide in a way that minimizes the tail effect of each thread in the block. 58 | 59 | 60 |

61 |

Too few blocks launched

62 |

63 | 64 | We are not utilizing the maximum capacity of the SMs (the number of blocks used is less than the number of blocks that can run simultaneously in an SM). The phenomenon of full waves - full warps occurs when the total number of SMs multiplied by the total number of active warps per SM is achieved. 65 | 66 | For example, if we have 15 SMs and 100% theoretical occupancy with 4 blocks per SM, the full waves would be 60 blocks. If we only use 45 blocks, we would only achieve 75% achieved occupancy. 67 | 68 | 69 | **However, never focus too much on improving occupancy, because high occupancy is not necessarily good** 70 | 71 | 72 | 73 |

74 | 75 |

76 | 77 | As I said in previous articles, each thread will have a certain number of registers. If the occupancy is higher ==> multiple threads are used ==> the number of registers per thread decreases ==> the thread's computational ability also decreases. 78 | 79 | `For example: if 6 workers do 6 jobs, if each worker is divided equally into 1 job ==> the worker will be comfortable, if 2 workers do 6 jobs, the worker will use more energy ==> so we We need to determine whether the job is heavy or not to know how many workers are appropriate` 80 | 81 | **In conclusion: before improving occupancy, consider whether our algorithm is complex and then use the number of threads accordingly.** 82 | 83 | The question is how do we determine what is complex and what is simple? Because each computer has a different processing speed, it cannot be generalized? 84 | 85 | In the following articles I will talk more about this 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | -------------------------------------------------------------------------------- /Chapter09/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | In this article, I will introduce three critical concepts in profiling: Bandwidth, Throughput, and Latency. 5 | 6 |

7 |

Bandwidth - Throughput - Latency

8 |

9 | 10 | When evaluating a piece of code or a program, three important concepts need to be considered: **bandwidth, throughput, and latency**. However, it is easy to get confused when only one of these pieces of information is provided without the others, leading to an inaccurate assessment of performance. Since different computers can have varying latency or bandwidth, providing only one piece of information will not reflect the true performance of the code 11 | 12 |

13 |

Latency

14 |

15 | 16 | **Latency (s):** is the time taken to complete a task. An extremely important note is that profiling (e.g., using **cudaEvent_t start, stop**) can affect performance, so profiling code should be removed during the final run of the program. 17 | 18 | Instead of using cudaEvent_t start, stop to check latency, we can use Nsight System with the command: 19 | 20 | ``` 21 | nsys profile -o timeline --trace cuda,nvtx,osrt,openacc ./a.out 22 | ``` 23 | 24 | This way, everything running on the GPU will be measured in detail. 25 | 26 | 27 | 28 |

29 | 30 |

31 | 32 | Based on the article [Introduction to Nsight Systems - Nsight Compute](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter01) we can identify that our code needs improvement in data allocation for the GPU (cudaMalloc). 33 | 34 |

35 |

Bandwidth

36 |

37 | 38 | **Bandwidth (GB/s):** represents the data transfer speed. 39 | 40 | When discussing bandwidth, we refer to two concepts: 41 | 42 | - **Theoretical Peak Bandwidth:** The theoretical ideal speed for data transfer. 43 | 44 |

45 | 46 |

47 | 48 | 49 |

50 | 51 |

52 | 53 |

54 | 55 |

56 | 57 | ``` 58 | Divide by 8 to convert from bits to bytes. 59 | 60 | Divide by 10^9 to convert to GB/s. 61 | 62 | DDR (Double Data Rate): multiply by 2. 63 | 64 | SDR (Single Data Rate): multiply by 1. 65 | ``` 66 | 67 | To determine DDR or SDR, use the command: 68 | 69 | ``` 70 | sudo lshw -c memory 71 | ``` 72 | 73 | Since my machine is DDR: 74 | 75 |

76 | 77 |

78 | 79 | 80 | - **Effective Bandwidth:** The actual data transfer speed of the kernel. 81 | 82 | 83 |

84 | 85 |

86 | 87 | 88 | - R(B): number of bytes read by each kernel 89 | 90 | - W(B): number of bytes written by each kernel 91 | 92 | - t(s): latency 93 | 94 | 95 | 96 |

97 |

Code

98 |

99 | 100 | 101 | We'll implement this: y[i] = a*x[i] + y[i] with N = 20 * (1 << 20) 102 | 103 | 104 | ``` 105 | __global__ 106 | void saxpy(int n, float a, float *x, float *y) 107 | { 108 | int i = blockIdx.x*blockDim.x + threadIdx.x; 109 | if (i < n) y[i] = a*x[i] + y[i]; 110 | } 111 | ``` 112 | 113 | ``` 114 | cudaEvent_t start, stop; 115 | cudaEventCreate(&start); 116 | cudaEventCreate(&stop); 117 | 118 | cudaEventRecord(start); 119 | 120 | // Perform SAXPY on 1M elements 121 | saxpy<<<(N+511)/512, 512>>>(N, 2.0f, d_x, d_y); 122 | 123 | cudaEventRecord(stop); 124 | 125 | cudaEventSynchronize(stop); 126 | float milliseconds = 0; 127 | cudaEventElapsedTime(&milliseconds, start, stop); 128 | 129 | printf("time: %f\n", milliseconds); 130 | printf("Effective Bandwidth (GB/s): %f", N*4*3/milliseconds/1e6); 131 | } 132 | ``` 133 | 134 | using the formula R + W: N * 3 ( read a + read y + write y) & N * 4 (1 float = 4 bytes ) 135 | 136 | and this is the output 137 | 138 |

139 | 140 |

141 | 142 | But if we profile using Nsight compute with this command: 143 | 144 | ``` 145 | ncu -o profile --set full ./a.out 146 | ``` 147 | 148 | We'll see that 149 | 150 |

151 | 152 |

153 | 154 | 155 |

156 | 157 |

158 | 159 | 160 |

161 | 162 |

163 | 164 | From here we can calculate that 165 | 166 |

167 | 168 |

169 | 170 | 171 | As I mentioned above, if we profile using cudaEvent_t start, stop, it will affect the code's performance from **91 GB/s to 87.8 GB/s.** 172 | 173 | We can also reverse calculate to determine the Theoretical. 174 | 175 |

176 | 177 |

178 | 179 | 180 | The slight deviation from the formula above (**96 GB/s to 95.8 GB/s**) is due to factors such as memory and kernel impact. 181 | 182 | From this, we can conclude that our code is **very efficient in terms of data transfer**, as there is **no significant gap between Theoretical and Effective bandwidth.** 183 | 184 | You can determine the effective bandwidth more quickly using the command: 185 | 186 | ``` 187 | Load: ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second ./a.out 188 | Store: ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum.per_second ./a.out 189 | ``` 190 | 191 |

192 | 193 |

194 | 195 | 196 |

197 | 198 |

199 | 200 | We can see that 60.68 + 30.40 = 91.08 GB/s, which is close to 96 GB/s, indicating that the code is efficient. 201 | 202 | 203 |

204 |

Computational Throughput

205 |

206 | 207 | **Throughput(GFLOP/s):** refers to the number of Floating Point Operations (FLOPs) a kernel can perform in one second. 208 | 209 | `A FLOP is a floating point operation, which includes basic arithmetic operations such as addition, subtraction, multiplication, division, as well as more complex operations like square roots, sine, cosine, etc.` 210 | 211 |

212 | 213 |

214 | 215 | The question posed is whether to improve **bandwidth or throughput**, as improving one might affect the other. How can we determine the optimal value? 216 | 217 | `Increasing bandwidth ==> more data is read/written ==> the compute increases ==> throughput decrease and opposite` 218 | 219 | To answer this question, we can use a technique called the **roofline chart**, which I will discuss in future articles. 220 | 221 |

222 | 223 |

224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | -------------------------------------------------------------------------------- /Chapter09/demo.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | 4 | __global__ 5 | void saxpy(int n, float a, float *x, float *y) 6 | { 7 | int i = blockIdx.x*blockDim.x + threadIdx.x; 8 | if (i < n) y[i] = a*x[i] + y[i]; 9 | } 10 | 11 | int main(void) 12 | { 13 | int N = 20 * (1 << 20); 14 | float *x, *y, *d_x, *d_y; 15 | x = (float*)malloc(N*sizeof(float)); 16 | y = (float*)malloc(N*sizeof(float)); 17 | 18 | cudaMalloc(&d_x, N*sizeof(float)); 19 | cudaMalloc(&d_y, N*sizeof(float)); 20 | 21 | for (int i = 0; i < N; i++) { 22 | x[i] = 1.0f; 23 | y[i] = 2.0f; 24 | } 25 | 26 | cudaEvent_t start, stop; 27 | cudaEventCreate(&start); 28 | cudaEventCreate(&stop); 29 | 30 | cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); 31 | cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); 32 | 33 | cudaEventRecord(start); 34 | 35 | // Perform SAXPY on 1M elements 36 | saxpy<<<(N+511)/512, 512>>>(N, 2.0f, d_x, d_y); 37 | 38 | cudaEventRecord(stop); 39 | 40 | cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); 41 | 42 | cudaEventSynchronize(stop); 43 | float milliseconds = 0; 44 | cudaEventElapsedTime(&milliseconds, start, stop); 45 | 46 | float maxError = 0.0f; 47 | for (int i = 0; i < N; i++) { 48 | maxError = max(maxError, abs(y[i]-4.0f)); 49 | } 50 | 51 | printf("time: %f\n", milliseconds); 52 | printf("Effective Bandwidth (GB/s): %f\n", N*4*3/milliseconds/1e6); 53 | } 54 | 55 | -------------------------------------------------------------------------------- /Chapter09/device_detail.cu: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | int main(int argc, char **argv) 5 | { 6 | printf("%s Starting...\n", argv[0]); 7 | int deviceCount = 0; 8 | cudaError_t error_id = cudaGetDeviceCount(&deviceCount); 9 | 10 | if (error_id != cudaSuccess) 11 | { 12 | printf("cudaGetDeviceCount returned %d\n-> %s\n", 13 | (int)error_id, cudaGetErrorString(error_id)); 14 | printf("Result = FAIL\n"); 15 | exit(EXIT_FAILURE); 16 | } 17 | 18 | if (deviceCount == 0) 19 | { 20 | printf("There are no available devices that support CUDA\n"); 21 | } 22 | else 23 | { 24 | printf("Detected %d CUDA Capable device(s)\n", deviceCount); 25 | } 26 | 27 | int dev, driverVersion = 0, runtimeVersion = 0; 28 | dev = 0; 29 | cudaSetDevice(dev); 30 | cudaDeviceProp deviceProp; 31 | cudaGetDeviceProperties(&deviceProp, dev); 32 | printf("Device %d: \"%s\"\n", dev, deviceProp.name); 33 | cudaDriverGetVersion(&driverVersion); 34 | cudaRuntimeGetVersion(&runtimeVersion); 35 | printf("CUDA Driver Version / Runtime Version: %d.%d / %d.%d\n", 36 | driverVersion / 1000, 37 | (driverVersion % 100) / 10, 38 | runtimeVersion / 1000, 39 | (runtimeVersion % 100) / 10); 40 | printf("CUDA Capability Major/Minor version number: %d.%d\n", 41 | deviceProp.major, 42 | deviceProp.minor); 43 | printf("Total amount of global memory: %.2f MBytes (%llu bytes)\n", 44 | (float)deviceProp.totalGlobalMem / (pow(1024.0, 3)), 45 | (unsigned long long)deviceProp.totalGlobalMem); 46 | printf("GPU Clock rate: %.0f MHz (%0.2f GHz)\n", 47 | deviceProp.clockRate * 1e-3f, 48 | deviceProp.clockRate * 1e-6f); 49 | printf("Memory Clock rate: %.0f MHz\n", 50 | deviceProp.memoryClockRate * 1e-3f); 51 | printf("Memory Bus Width: %d-bit\n", 52 | deviceProp.memoryBusWidth); 53 | if (deviceProp.l2CacheSize) 54 | { 55 | printf("L2 Cache Size: %d bytes\n", 56 | deviceProp.l2CacheSize); 57 | } 58 | printf("Max Texture Dimension Size (x,y,z)\n" 59 | "1D = (%d), 2D = (%d, %d), 3D = (%d, %d, %d)\n", 60 | deviceProp.maxTexture1D, 61 | deviceProp.maxTexture2D[0], 62 | deviceProp.maxTexture2D[1], 63 | deviceProp.maxTexture3D[0], 64 | deviceProp.maxTexture3D[1], 65 | deviceProp.maxTexture3D[2]); 66 | printf("Max Layered Texture Size (dim) x layers\n" 67 | "1D = (%d) x %d, 2D = (%d, %d) x %d\n", 68 | deviceProp.maxTexture1DLayered[0], 69 | deviceProp.maxTexture1DLayered[1], 70 | deviceProp.maxTexture2DLayered[0], 71 | deviceProp.maxTexture2DLayered[1], 72 | deviceProp.maxTexture2DLayered[2]); 73 | printf("Total amount of constant memory: %lu bytes\n", 74 | deviceProp.totalConstMem); 75 | printf("Total amount of shared memory per block: %lu bytes\n", 76 | deviceProp.sharedMemPerBlock); 77 | printf("Total number of registers available per block: %d\n", 78 | deviceProp.regsPerBlock); 79 | printf("Warp size: %d\n", deviceProp.warpSize); 80 | printf("Maximum number of threads per multiprocessor: %d\n", 81 | deviceProp.maxThreadsPerMultiProcessor); 82 | printf("Maximum number of threads per block: %d\n", 83 | deviceProp.maxThreadsPerBlock); 84 | printf("Maximum sizes of each dimension of a block: (%d, %d, %d)\n", 85 | deviceProp.maxThreadsDim[0], 86 | deviceProp.maxThreadsDim[1], 87 | deviceProp.maxThreadsDim[2]); 88 | printf("Maximum sizes of each dimension of a grid: (%d, %d, %d)\n", 89 | deviceProp.maxGridSize[0], 90 | deviceProp.maxGridSize[1], 91 | deviceProp.maxGridSize[2]); 92 | printf("Maximum memory pitch: %lu bytes\n", deviceProp.memPitch); 93 | 94 | exit(EXIT_SUCCESS); 95 | } 96 | 97 | -------------------------------------------------------------------------------- /Chapter09/profile.ncu-rep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CisMine/Guide-NVIDIA-Tools/e39727ad4f6f0d695f595a91852ec892ce293395/Chapter09/profile.ncu-rep -------------------------------------------------------------------------------- /Chapter10/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 |

4 |

Compute Bound & Memory Bound

5 |

6 | 7 | 8 | In any program, we need to do 2 things: 9 | - Bring data from memory 10 | - Perform computation on the data 11 | 12 | Or we can said that: When discussing performance in a code segment, we consider two main concepts: **memory and compute** 13 | 14 |

15 |

What are memory and compute, and why are they so important?

16 |

17 | 18 | 19 |

20 |

Memory - Compute

21 |

22 | 23 | 24 | - Compute: Refers to computational power, often measured using a popular metric called the **FLOPS rate (floating point operations per second)**. It quantifies a computer's performance in executing floating-point calculations in one second. 25 | 26 |

27 | 28 |

29 | 30 | 31 | - Memory: This doesn’t refer to the total memory used but rather the memory bandwidth (GB/s), which is the rate at which data can be loaded or stored between memory and processing components. 32 | 33 |

34 | 35 |

36 | 37 | 38 |

39 |

How do we determine a good FLOPS rate or Memory Bandwidth?

40 |

41 | 42 | 43 |

44 |

Desired Compute to Memory Ratio (OP/B)

45 |

46 | 47 | 48 |

49 | 50 |

51 | 52 | This is a critical metric for balancing a computer’s processing power and its load/store capabilities in memory. 53 | 54 | ``` 55 | Why balance them? - Balancing ensures efficient hardware resource utilization, avoiding performance bottlenecks. 56 | ``` 57 | 58 | - If OP/B is low: The system is handling heavy computational tasks, but the computational power is limit. This leads to a **compute-bound scenario.** 59 | - If OP/B is high: The system cannot supply enough data for processing, leading to data starvation or a **memory-bound scenario.** 60 | 61 | 62 |

63 |

What are Compute/Memory Bound, and how can we identify and resolve them?

64 |

65 | 66 | - Compute-bound: Occurs when a computer’s performance is limited by its computational capacity. This is common when executing complex calculations. 67 | - Memory-bound: Occurs when performance is limited by the ability to access data from memory. This usually happens when handling large amounts of data load/store operations. 68 | 69 | 70 |

71 |

Identifying Compute/Memory Bound in Your Code

72 |

73 | 74 | 75 | 76 |

77 |

Speed Of Light Throughput (SoL)

78 |

79 | 80 | 81 |

82 | 83 |

84 | 85 | In **Nsight Compute** can help us to identify whether **Compute/Memory bound** by using **SoL** 86 | 87 | **SoL**: Achieved % of utilization with respect to maximum, represents the level of activity of the computer hardware, not the performance of the code. 88 | 89 | Our goal is to ensure that compute and memory resources are utilized evenly without a significant imbalance. 90 | 91 | - Balanced utilization prevents bottlenecks, where one resource (compute or memory) becomes a limiting factor for overall performance. 92 | - Uniform usage also helps maximize the efficiency of the hardware, achieving a closer approximation to the theoretical peak performance 93 | 94 | 95 |

96 | 97 |

98 | 99 | 100 | - Latency (M & SM < 60): 101 | 102 | As explained earlier, SoL reflects the level of activity of the computer. However, in this case, we can see that both M (memory) and SM (compute) are not being used at their full potential. This suggests that the system isn’t fully utilizing the available resources, which could mean that the workload is not heavy enough to stress the hardware. 103 | 104 | - Compute Bound (SM > 60 || M < 60): 105 | 106 | This situation suggests that while the system has enough computational power to handle the data, it’s being overused in complex calculations, leading to a compute-bound scenario. 107 | 108 | 109 | - Memory Bound (SM < 60 || M > 60): 110 | 111 | This leads to data starvation, where the system cannot provide enough data to the computation resources in time, even though the calculations themselves may be simple. 112 | 113 | - Compute/Memory Bound (SM & M > 60): 114 | 115 | In this case, the system is operating at a high capacity for both compute and memory. It’s important to monitor performance carefully to avoid potential bottlenecks. 116 | 117 | 118 | 119 | **In summary, each of these situations reflects a different type of resource imbalance:** 120 | 121 | - Latency: Low usage of both resources. 122 | - Compute Bound: Excessive compute usage with underused memory. 123 | - Memory Bound: High memory usage with underused compute. 124 | - Balanced Usage: Optimized usage of both compute and memory or encountering performance limits with both resources. 125 | 126 |

127 | 128 |

129 | 130 | - SM: Inst Executed Pipe Lsu(%) : if % high is SM instead of compute then the load/store unit will be expensive in time 131 | - SM: Pipe Fma/Alu Cycles Active (%): %SM compute 132 | 133 | 134 |

135 |

Roofline chart

136 |

137 | 138 |

139 | 140 |

141 | 142 | 143 | Before explaining in depth about the roofline chart, I will go over the definitions you need to know 144 | 145 | **Arithmetic Intensity** is a metric that represents the efficiency of utilizing both computation resources (FLOP - Floating Point Operations) and memory bandwidth (bytes of data transferred). 146 | 147 |

148 | 149 |

150 | 151 | 152 | As we know, math is much faster than memory, so we need to balance the ratio appropriately to avoid memory/compute bound cases. Each computer will have a different ratio, and to determine this, you can click on the square as shown in the image. 153 | 154 | 155 |

156 | 157 |

158 | 159 | We will analyze this diagram in more detail. 160 | 161 | 162 |

163 | 164 |

165 | 166 | 167 | - Peak FLOP/s: The maximum computational speed that a computer can achieve. 168 | 169 | - Bandwidth GB/s: The rate at which the computer can load/store memory, reaching its peak at the intersection of the red and blue lines. This point is called the key point or knee point. 170 | 171 | **Key point (knee point):** The point where there is a transition between two stages: 172 | 173 | - Memory bound stage. 174 | - Compute bound stage. 175 | 176 | In theory: If we achieve the key point ratio (as shown in the image, AI = 0.55), it means our code is nearly perfect (balanced between math and memory). 177 | 178 | In practice: Reaching a point along the diagonal line, as shown in the image, is already a very good outcome. 179 | 180 | Bottleneck situation: 181 | 182 | 183 |

184 | 185 |

186 | 187 | ``` 188 | 189 | P (FLOP/s): Represents the speed at which a task can be executed. 190 | 191 | P (peak): The maximum computational speed that the computer can theoretically achieve. 192 | 193 | I . b (FLOP/byte * byte/s): The actual speed required to run a specific piece of code. 194 | 195 | ``` 196 | 197 | When we use the min function to determine whether the system is compute-bound or memory-bound: 198 | 199 | If **P(peak) min ==> compute-bound:** This means the actual speed to run the code is higher, but our computer is limited to the peak computational speed and cannot go beyond that. Solution: Use a larger unit size. 200 | 201 | If **I . b min ==> memory-bound:** This indicates that we haven't fully utilized the computer's capacity. Solution: Use coarsening. 202 | 203 | Thus, through these two aspects, we can determine whether our code has issues with computation or with load/store data. In the following sessions, I will guide you on how to specifically address each case. 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | -------------------------------------------------------------------------------- /Fix-Bug/README.md: -------------------------------------------------------------------------------- 1 |

2 |

Common errors when using 3 | 4 | Nsight Systems - Nsight Compute

5 |

6 | 7 | 8 | Even though the download was successful and checked with the commands nsys -v or ncu -v, there will still be some errors so I will show you how to fix them. 9 | 10 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/349f86a1-d566-4c4f-b227-36ff70816c33) 11 | 12 | ``` 13 | #include 14 | 15 | 16 | __global__ void kernel() 17 | { 18 | 19 | printf("hello world"); 20 | } 21 | 22 | int main() 23 | { 24 | kernel<<<1,1>>>(); 25 | cudaDeviceSynchronize(); 26 | 27 | return 0; 28 | } 29 | ``` 30 | 31 | This is a simple piece of code for us to test the two tools we just downloaded 32 | 33 |

34 |

Nsight Systems

35 |

36 | 37 | run these commands ( I'll explain in others chapter) 38 | 39 | $nvcc test.cu 40 | 41 | $./a.out 42 | 43 | $nsys profile ./a.out 44 | 45 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/2fdad835-220d-48a0-a18d-4e91c60df6ef) 46 | 47 | Open Nsight system and open that file (.nsys-rep) 48 | 49 | Click the warmings and you'll see the warning Daemon 50 | 51 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/7c22fd95-baa8-4091-938f-c705496c6755) 52 | 53 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/5cef9e18-2fb6-4c78-92a4-ed1a2bf6bfc3) 54 | 55 | 56 | Then run this command to fix it. PLEASE NOTE THAT EACH COMPUTER WILL HAVE A DIFFERENT LEVEL, SO PLEASE PAY ATTENTION. 57 | 58 | $cat /proc/sys/kernel/perf_event_paranoid 59 | 60 | $sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid' ( change the number 1 into your computer's number ) 61 | 62 | $sudo sh -c 'echo kernel.perf_event_paranoid=2 > /etc/sysctl.d/local.conf' 63 | 64 | run this to check: 65 | 66 | $cat /proc/sys/kernel/perf_event_paranoid 67 | 68 | 69 |

70 |

Nsight Compute

71 |

72 | 73 | Instead of running the command line $nsys profile ./a.out run this 74 | 75 | $ncu --set full -o test ./a.out 76 | 77 | 78 | ![image](https://github.com/user-attachments/assets/b6441013-116f-4056-91ac-d70d9f33fcb7) 79 | 80 | If it creates the .ncu-rep file, it is successful BUT if you encounter the problem of **nsight compute permission deny**, then run these commands: 81 | 82 | $sudo nano /etc/modprobe.d/nvidia.conf 83 | 84 | $options nvidia NVreg_RestrictProfilingToAdminUsers=0 85 | 86 | then do ctrl + o then ctrl + x ( save and exit ) 87 | -------------------------------------------------------------------------------- /Fix-Bug/test.cu: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | 4 | __global__ void kernel() 5 | { 6 | 7 | printf("hello world"); 8 | } 9 | 10 | int main() 11 | { 12 | kernel<<<1,1>>>(); 13 | cudaDeviceSynchronize(); 14 | 15 | return 0; 16 | } 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 |

NVIDIA Tools Usage Guide

3 |

4 | 5 | This repository contains documentation and examples on how to use NVIDIA tools for profiling, analyzing, and optimizing GPU-accelerated applications for beginners with a starting point. Currently, it covers NVIDIA Nsight Systems and NVIDIA Nsight Compute, and it may include guidance on using NVIDIA Nsight Deep Learning Designer in the future as well. 6 | 7 | 8 |

9 |

Introduction to NVIDIA Tools

10 |

11 | 12 | NVIDIA provides a suite of powerful profiling and analysis tools to help developers optimize, debug, and accelerate GPU applications. This repository aims to provide comprehensive guidance on using these tools effectively in your GPU development workflow. 13 | 14 | - ### NVIDIA Nsight Systems: 15 | This tool allows you to profile CPU and GPU activities, view timeline traces, and analyze system-wide performance bottlenecks. It helps you gain insights into how your application is utilizing GPU resources. 16 | 17 | - ### NVIDIA Nsight Compute: 18 | Nsight Compute is a GPU profiler that provides detailed insights into the performance of individual GPU kernels. It helps you identify performance bottlenecks at the kernel level and optimize your GPU code accordingly. 19 | 20 | - ### NVIDIA Compute Sanitizer: 21 | NVIDIA Compute Sanitizer is a tool that helps developers (cuda beginners) find and fix programming errors and memory issues in GPU-accelerated applications, improving reliability and performance. 22 | 23 | - ### NVIDIA Nsight Deep Learning Designer (Future): 24 | Nsight Deep Learning Designer is designed for deep learning model optimization and debugging. While it may not be covered in this repository yet, future updates may include guidance on using this tool for your deep learning projects. 25 | 26 |

27 |

Getting Started

28 |

29 | 30 | ### Download Nsight systems 31 | - Follow this [link for downloading](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2023-3) 32 | 33 | - You can use this command to verify its existence: 34 | 35 | $nsys -v 36 | 37 | 38 | ### Download Nsight compute 39 | - Nsight Compute is bundled within the CUDA Toolkit. If you've already installed the CUDA Toolkit, there's no need to download Nsight Compute separately. If you wish to switch to a different version, you can do so by using the [provided link](https://developer.nvidia.com/tools-overview/nsight-compute/get-started) 40 | 41 | - You can use this command to verify its existence: 42 | 43 | $ncu -v 44 | 45 | ![image](https://github.com/CisMine/Guide-NVIDIA-Tools/assets/122800932/6d0bb179-42a1-4bce-b1ed-3f5682a988b4) 46 | 47 | - If you haven't installed CUDA Toolkit yet, please follow these steps: 48 | - If your computer has GPU, follow these steps in NIVIDA to install [Cuda Toolkit](https://developer.nvidia.com/cuda-downloads) 49 | 50 | - If you are using Linux, I advise you to watch [this video](https://www.youtube.com/watch?v=wxNQQP9U1Bc) 51 | 52 | - If you are using Windows, this is [your video](https://www.youtube.com/watch?v=cuCWbztXk4Y&t=49s) 53 | 54 | 55 | - If your computer doesn't have GPU 56 | 57 | - Don't worry; I'll demonstrate how to set up and use Google Colab to code [in here](https://medium.com/@giahuy04/the-easiest-way-to-run-cuda-c-in-google-colab-831efbc33d7a) 58 | 59 | 60 |

61 |

Prerequisites

62 |

63 | 64 | - Basic knowledge of C/C++ programming. 65 | - Understanding of parallel programming concepts. 66 | - Familiarity with the CUDA programming model. 67 | - Access to a CUDA-capable GPU. 68 | 69 | ### If you are unfamiliar with these concepts, please refer to this [series parallel computing](https://github.com/CisMine/Parallel-Computing-Cuda-C) 70 | 71 | 72 | 73 |

74 |

Table of Contents

75 |

76 | 77 | [Fix-Bug](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Fix-Bug) 78 | 79 | [Chapter01: Introduction to Nsight Systems - Nsight Compute](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter01) 80 | 81 | [Chapter02: Cuda toolkit - Cuda driver](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter02) 82 | 83 | [Chapter03: NVIDIA Compute Sanitizer Part 1](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter03) 84 | 85 | [Chapter04: NVIDIA Compute Sanitizer Part 2 ](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter04) 86 | 87 | [Chapter05: Global Memory Coalescing](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter05) 88 | 89 | [Chapter06: Warp Scheduler](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter06) 90 | 91 | [Chapter07: Occupancy Part 1](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter07) 92 | 93 | [Chapter08: Occupancy Part 2](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter08) 94 | 95 | [Chapter09: Bandwidth - Throughput - Latency](https://github.com/CisMine/Guide-NVIDIA-Tools/tree/main/Chapter09) 96 | 97 | [Chapter10: Compute Bound - Memory Bound](https://github.com/CisMine/Guide-NVIDIA-Tools/blob/main/Chapter10/README.md) 98 | 99 | 100 |

101 |

Resources

102 |

103 | 104 | In addition to the code examples, this repository provides a curated list of resources, including books, tutorials, online courses, and research papers, to further enhance your understanding of using NVIDIA Tools. These resources will help you delve deeper into the subject and explore advanced topics and techniques. 105 | 106 | - [Nsight Systems v2023.3.1 Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) 107 | - [Nsight Compute v2023.2.1 Guide](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html) 108 | --------------------------------------------------------------------------------