├── .gitattributes ├── MXMACA编程的内存层次模型.png ├── README.md ├── chapter10 ├── mcBlas.c ├── mcDNN.cpp ├── mcblas命令.txt └── usingThrust.cpp ├── chapter11 ├── Makefile ├── simple2DFD.cpp └── vectorAddMultiGpus.cpp ├── chapter2 └── helloFromGpu.c ├── chapter3 ├── cpuVectorAdd.cpp └── gpuVectorAdd.cpp ├── chapter4 └── grammar.cpp ├── chapter5 ├── Cooperative_Groups.cpp ├── assignKernel.cpp ├── information.cpp └── nestedHelloWorld.cpp ├── chapter6 ├── AplusB_with_managed.cpp ├── AplusB_with_unified_addressing.cpp ├── AplusB_without_unified_addressing.cpp ├── BC_addKernel.cpp ├── NBC_addKernel2.cpp ├── __shfl_down_syncExample.cpp ├── __shfl_syncExample.cpp ├── __shfl_up_syncExample.cpp ├── __shfl_xor_syncExample.cpp ├── checkGlobalVariable.cpp ├── information.cpp ├── vectorAddUnifiedVirtualAddressing.cpp └── vectorAddZerocopy.cpp ├── chapter7 ├── Makefile.txt ├── my_program │ ├── CMakeLists.txt │ ├── include │ │ ├── a.h │ │ └── b.h │ ├── main.cpp │ └── src │ │ ├── a.cpp │ │ └── b.cpp ├── trigger_memory_violation.cpp ├── trigger_memory_violation_repaired.cpp └── vectorAdd.cpp ├── chapter8 ├── myKernel.cpp └── stream_parallel_execution.cpp ├── chapter9 ├── shortKernelsAsyncLaunch.cpp ├── shortKernelsGraphLaunch.cpp └── shortKernelsSyncLaunch.cpp ├── common └── common.h ├── 习题运行结果 ├── 3.1.png ├── 3.2.png ├── 5.2.9.1运行结果 │ ├── 1.png │ ├── 2.png │ └── 3.png ├── 5.2.9.2运行结果 │ ├── 1.png │ ├── 2.png │ └── 3.png ├── T4运行结果.png ├── answer.md ├── nestedMandelbrot.cpp └── 统一内存寻址运行结果.png ├── 开源的完整示例代码表.md └── 示例代码运行截图 ├── chapter2 └── 2-1.png ├── chapter3 └── 3-2.png ├── chapter4 └── 4-1.png ├── chapter5 ├── 5-1.png ├── 5-3.png └── 5-5.png ├── chapter6 ├── 6-1-1.png ├── 6-1-2.png ├── 6-10-1.png ├── 6-10-2.png ├── 6-11-1.png ├── 6-11-2.png ├── 6-12-1.png ├── 6-12-2.png ├── 6-2-1.png ├── 6-2-2.png ├── 6-3-1.png ├── 6-3-2.png ├── 6-30.png ├── 6-4.png ├── 6-5.png ├── 6-6.png ├── 6-7.png ├── 6-8.png └── 6-9.png ├── chapter7 ├── 7-4.png └── 7-5.png ├── chapter8 ├── 8-1.png └── 8-2.png └── 示例代码运行截图.md /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /MXMACA编程的内存层次模型.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/MXMACA编程的内存层次模型.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # getting-started-guide-and-summary-of-MXMACA 2 | 3 | ## CPU VS GPU 4 | 5 | CPU,即中央处理器,由数百万个晶体管构成,可以具有多个处理核心,是计算机系统的运算和控制核心。CPU涉及到通用计算,适合少量的复杂计算。CPU虽然处理核心远没有GPU多,但是可以将核心集中在单个任务上并快速完成工作。 6 | 7 | GPU,即图形处理器,由许多更小、更专业的核心组成的处理器。适合大量的简单运算。GPU最初是用来加速3D渲染任务,但是随着时间的推移,这些固定功能的引擎变得更加可编程、更加灵活。虽然图形和日益逼真的视觉效果仍然是GPU的主要功能,但GPU也已发展成为更通用的并行处理器,可以处理越来越多的应用程序。 8 | 9 | | CPU | GPU | 10 | | ---------------------------------- | -------------------------------- | 11 | | 通用组件,负责计算机的主要处理功能 | 专用组件,主要负责图形和视频渲染 | 12 | | 核心数:2-64 | 核心数:数千 | 13 | | 串行运行进程 | 并行运行进程 | 14 | | 更适合处理一项大任务 | 更适合处理多个较小的任务 | 15 | 16 | 17 | 18 | ### 加速深度学习和人工智能 19 | 20 | GPU或其他加速器非常适合用神经网络或大量特定数据(e.g. 2D图像)进行深度学习训练。 21 | 22 | GPU加速方法已经适用于深度学习算法,可以显著提升算法性能。 23 | 24 | 25 | 26 | ## 基本概念的解释 27 | 28 | 内存部分的解释详见MXMACA内存模型和管理。 29 | 30 | ### 主机端(host) 31 | 32 | CPU所在的位置称为主机端。 33 | 34 | 可以简单理解为CPU。 35 | 36 | ### 设备端(device) 37 | 38 | GPU所在的位置称为设备端。 39 | 40 | 可以简单理解为GPU。 41 | 42 | 主机和设备之间通过PCIe总线连接,用于传递指令和数据,让CPU和GPU一起来协同工作。 43 | 44 | ### 加速处理器(Accelerated Processors,AP) 45 | 46 | 每个AP都能支持数千个GPU线程并发执行。 47 | 48 | 执行具体的指令和指令和任务。 49 | 50 | ### 核函数(kernel) 51 | 52 | 核函数在设备端执行,需要为一个线程规定所进行的计算和访问的数据。当核函数被调用时,许多不同的MXMACA线程并行执行同一计算任务。 53 | 54 | 在设备侧(GPU)执行,可以在设备侧(GPU)和主机侧(CPU)被调用。 55 | 56 | ### 线程(thread) 57 | 58 | 一般通过GPU的一个核进行处理。 59 | 60 | 每个线程是Kernel的单个执行实例。在一个block中的所有线程可以共享一些资源,并能够相互通信。 61 | 62 | ### 线程束(wave) 63 | 64 | GPU执行程序时的调度单位。 65 | 66 | 64个线程组成一个线程束,线程束中每个线程在不同数据集上同时执行相同的指令。 67 | 68 | ### 线程块(thread block) 69 | 70 | 由多个线程组成。可以是一维、二维或三维的。 71 | 72 | 各block是并行执行的。 73 | 74 | 同一个线程块内的线程可以相互协作,不同线程块内的线程不能协作。 75 | 76 | 当启动一个核函数网格时,它的GPU线程会被分配到可用的AP上执行。一旦线程块被调度到一个AP上,其中的线程将只在该指定的AP上并发执行。 77 | 78 | 多个线程块根据AP资源的可用性进行调度,可能会被分配到同一个AP上或不同的AP上。 79 | 80 | ### 线程网格(grid) 81 | 82 | 多个线程块可以构成线程网格。 83 | 84 | 和核函数(kernel)的关系:启动核函数(kernel)时,会定义一个线程网格(grid)。 85 | 86 | 网格可以是一维的、二维的或三维的。 87 | 88 | ### 流(stream) 89 | 90 | 相当于是GPU上的任务队列。 91 | 92 | 同一个stream的任务是严格保证顺序的,上一个命令执行完成才会执行下一个命令。 93 | 94 | 不同stream的命令不保证任何执行顺序。部分优化技巧需要用到多个stream才能实现。如在执行kernel的同时进行数据拷贝,需要一个stream执行kernel,另一个stream进行数据拷贝。 95 | 96 | 97 | 98 | ## 基本编程模型 99 | 100 | 1. 用户可以通过调用动态运行时库,申请、释放显存,并在内存和显存间进行数据拷贝。 101 | 102 | 2. 典型的MXMACA程序实现流程遵循以下模式: 103 | 104 | 1. 把数据从CPU内存拷贝到GPU内存; 105 | 2. 调用核函数对GPU内存的数据进行处理; 106 | 3. 将数据从GPU内存传送回CPU内存。 107 | 108 | 3. 用户可以编写kernel函数,在主机侧调用kernel函数,调用将创建GPU线程。 109 | 110 | 1. 用户可以在Kernel Launch时分别指定网格中的线程块数量、线程块中包含的线程数量。当用户指定的线程数量超过64,这些线程会被拆分成多个线程束,并在同一个AP上执行,这些线程束可能并发执行,也可能串行执行。 111 | 2. 每个GPU线程都会完整执行一次kernel函数,kernel函数可以对显存进行读、写等操作,也可以调用设备侧函数对显存进行读、写等操作。不同的GPU线程可以通过内置变量进行区分,只需要通过读取内置变量,分别找到线程块的位置、线程的位置,就可以给每一个线程唯一地标识ThreadIdx(可以参考后文,相关的几个内置变量)。 112 | 113 | 4. 相关的几个内置变量 114 | 115 | 1. `threadIdx`,获取线程`thread`的ID索引;如果线程是一维的那么就取`threadIdx.x`,二维的还可以多取到一个值`threadIdx.y`,以此类推到三维`threadIdx.z`。可以在一个线程块中唯一的标识线程。 116 | 2. `blockIdx`,线程块的ID索引;同样有`blockIdx.x`,`blockIdx.y`,`blockIdx.z`。可以在一个网格中唯一标识线程块。 117 | 3. `blockDim`,线程块的维度,同样有`blockDim.x`,`blockDim.y`,`blockDim.z`。可以代表每个维度下线程的最大数量。 118 | 1. 对于一维的`block`,线程的`threadID=threadIdx.x`。 119 | 2. 对于大小为`(blockDim.x, blockDim.y)`的 二维`block`,线程的`threadID=threadIdx.x+threadIdx.y*blockDim.x`。 120 | 3. 对于大小为`(blockDim.x, blockDim.y, blockDim.z)`的 三维 `block`,线程的`threadID=threadIdx.x+threadIdx.y*blockDim.x+threadIdx.z*blockDim.x*blockDim.y`。 121 | 4. `gridDim`,线程格的维度,同样有`gridDim.x`,`gridDim.y`,`gridDim.z`。可以代表每个唯独下线程块的最大数量。 122 | 123 | 5. 常用的GPU函数 124 | 125 | 1. `mcMalloc()` 126 | 127 | 负责内存分配。类似与C语言中的`malloc`。不过mcMalloc是在GPU上分配内存,返回device指针。 128 | 129 | 2. `mcMemcpy()` 130 | 131 | 负责内存复制。 132 | 133 | 可以把数据从host搬到device,再从device搬回host。 134 | 135 | 3. `mcFree()` 136 | 137 | 释放显存的指针。 138 | 139 | (可以参考示例代码) 140 | 141 | ## 基本硬件架构及其在Kernel执行中的作用 142 | 143 | ## MXMACA内存模型和管理 144 | 145 | ### MXMACA内存模型 146 | 147 | MXMACA的内存是分层次的,每个不同类型的内存空间有不同的作用域、生命周期和缓存行为。一个内核函数中,每个线程有自己的私有内存,每个线程块有自己工作组的共享内存并对块内的所有线程可见,一个线程网格中的所有线程都可以访问全局内存和常量。可以参考下图: 148 | 149 | 150 | 151 | 书里提到了它们的初始化方式,这里主要介绍它们的用途、局限性。 152 | 153 | #### 可编程存储器、不可编程存储器 154 | 155 | 根据存储器能否被程序员控制,可分为:可编程存储器、不可编程存储器。 156 | 157 | 可编程存储器:需要显示控制哪些数据放在可编程内存中。包括全局存储、常量存储、共享存储、本地存储和寄存器等。 158 | 159 | 不可编程存储器:不能决定哪些数据放在这些存储器中,也不能决定数据在存储器中的位置。包括一级缓存、二级缓存等。 160 | 161 | #### GPU寄存器 162 | 163 | 寄存器延迟极低,对于每个线程是私有的,与核函数的生命周期相同。 164 | 165 | 寄存器是稀有资源,使用过多的寄存器也会影响到性能,可以添加辅助信息控制限定寄存器数量。 166 | 167 | 书中也提到了一些方式,可以让一个线程束内的两个线程相互访问对方的寄存器,而不需要访问全局内存或者共享内存,延迟很低且不消耗额外内存。 168 | 169 | #### GPU私有内存 170 | 171 | 私有内存是每个线程私有的。 172 | 173 | 私有内存在物理上与全局内存在同一块储存区域,因此具有较高的延迟和低带宽。 174 | 175 | #### GPU线程块共享内存 176 | 177 | 共享内存的地址空间被线程块中所有的线程共享。它的内容和创建时所在的线程块具有相同生命周期。 178 | 179 | 共享内存让同一个线程块中的线程能够相互协作,便于重用片上数据,可以降低核函数所需的全局内存带宽。 180 | 181 | 相较于全局内存,共享内存延迟更低,带宽更高。 182 | 183 | 适合在数据需要重复利用、全局内存合并或线程之间有共享数据时使用共享内存。 184 | 185 | 不能过度使用,否则会限制活跃线程束的数量。 186 | 187 | 书里也提到了共享内存的分配、共享内存的地址映射方式、bank冲突以及最小化bank冲突的方法。bank冲突时,多个访问操作会被序列化,降低内存带宽,就没有什么并行的意义了。 188 | 189 | #### GPU常量内存 190 | 191 | 常量内存在设备内存中,并在每个AP专用的常量缓存中缓存。 192 | 193 | 如果线程束中所有线程都从相同内存读取数据,常量内存表现最好,因为每从一个常量内存中读取一次数据,都会广播给线程束里的所有线程。 194 | 195 | #### GPU全局内存 196 | 197 | GPU中内存最大、延迟最高、最常使用。 198 | 199 | 可以在任何AP上被访问,并且贯穿应用程序的整个生命周期。 200 | 201 | 优化时需要注意对齐内存访问与合并内存访问。 202 | 203 | ## MXMACA程序优化 204 | 205 | ### 性能优化的目标 206 | 207 | 1. 提高程序执行效率,减少运行时间,提高程序的处理能力和吞吐量。 208 | 2. 优化资源利用率,避免资源的浪费和滥用。 209 | 3. 改善程序的响应时间。 210 | 211 | ### 程序性能评估 212 | 213 | #### 精度 214 | 215 | GPU 的单精度计算性能要远远超过双精度计算性能,需要在速度与精度之间选取合适的平衡。 216 | 217 | #### 延迟 218 | 219 | #### 计算量 220 | 221 | 如果计算量很小,或者串行部分占用时间较长,并行部分占用时间较短,都不适合用GPU进行并行计算。 222 | 223 | ### 优化的主要策略 224 | 225 | #### 硬件性能优化 226 | 227 | #### 并行性优化 228 | 229 | 可以通过设置线程块的大小、每个线程块的共享内存使用量、每个线程使用的寄存器数量,尽量提升occupancy。 230 | 231 | #### 内存访问优化 232 | 233 | ##### 提高`Global Memory`访存效率 234 | 235 | 对齐内存访问:一个内存事务的首个访问地址尽量是缓存粒度(32或128字节)的偶数倍,减少带宽浪费。 236 | 237 | 合并内存访问:尽量让一个线程束的线程访问的内存都在一个线程块。 238 | 239 | ##### 提高`Shared Memory`访存效率 240 | 241 | 若`wave`中不同的线程访问相同的`bank`,则会发生bank冲突(bank conflict),bank冲突时,`wave`的一条访存指令会被拆分为n条不冲突的访存请求,降低`shared memory`的有效带宽。所以需要尽量避免bank冲突。 242 | 243 | #### 算法优化 244 | 245 | 1. 如何将问题分解成块、束、线程 246 | 2. 线程如何访问数据以及产生什么样的内存模式 247 | 3. 数据的重用性 248 | 4. 算法总共要执行多少工作,与串行化的方法之间的差异 249 | 250 | #### 算数运算密度优化 251 | 252 | 1. 超越函数操作:可以查阅平方根等超越函数和加速函数,以及设备接口函数 253 | 2. 近似:可以在速度和精度之间进行折衷 254 | 3. 查找表:用空间换时间。适合GPU高占用率的情况,也要考虑到计算的复杂度,计算复杂度低时,计算速度可能大大快于低GPU占用下的内存查找方式。 255 | 256 | #### 编译器优化 257 | 258 | 1. 展开循环 259 | 2. 常量折叠 e.g. 编译时直接计算常数,从而简化常数 260 | 3. 常量传播:将表达式中的变量替换为已知常数 261 | 4. 公共子表达式消除:将该类公共子表达式的值临时记录,并传播到子表达式使用的语句 262 | 5. 目标相关优化:用复杂指令取代简单通用的指令组合,使程序获得更高的性能 263 | 264 | #### 其他 265 | 266 | 1. 用结构体数组(结构体的成员是数组),而不是数组结构体(数组的每个元素都是结构体)。 267 | 2. 尽量少用条件分支。CPU具有分支预测的功能,GPU没有这一功能,GPU执行if,else语句的效率非常低。因此只能让束内每一线程在每个分支都经过一遍(但不一定执行),当然如果所有线程都不用执行,就可以忽略这一分支。只要有一个线程需要执行某一个分支,其他线程即使不需要执行,也要等着一个线程执行完才能开始自己的计算任务。而且不同的分支是串行执行的,因此要减少分支的数目。 268 | 1. 通过计算,去掉分支(可以参考书中8.3.4相关内容)。 269 | 2. 通过查找表去掉分支。 270 | 3. 尽量使`wave`块完美对齐,让一个`wave`里的所有线程都满足条件或者都不满足条件。 271 | 3. 引入一些指令级并行操作,尽可能终止最后的线程束以使整个线程块都闲置出来,并替换为另一个包含一组更活跃线程束的线程块。 272 | 273 | ### 优化性能需要考虑的指标 274 | 275 | 1. 最大化利用率 276 | 2. 最大化存储吞吐量 277 | 3. 最大化指令吞吐量 278 | 4. 最小化内存抖动 279 | 5. 时间消耗(整体运行所需时间、GPU和CPU之间的传输所需时间、核函数运行所需时间) 280 | 281 | ## MXMACA生态的人工智能和计算加速库 282 | 283 | ### mcBLAS 284 | 285 | 主要用于多种形式的计算。 286 | 287 | `Level-1 Functions`定义了向量与向量、向量与标量之间的运算,还为多种数据类型(单精度浮点实数、单精度浮点复数、双精度浮点实数、双精度浮点复数)定义了专用的接口。 288 | 289 | `Level-2 Functions`定义了矩阵与向量之间的运算。 290 | 291 | `Level-3 Functions`定义了矩阵与矩阵之间的运算。是求解器和深度神经网络库的底层实现基础。 292 | 293 | ### mcDNN 294 | 295 | 提供常用深度学习算子。 296 | 297 | ### mcSPARSE 298 | 299 | 稀疏矩阵线性代数库。稀疏矩阵是指零元素数目远多于非零元素数目的矩阵。 300 | 301 | 可以用对应的接口完成稀疏矩阵线性代数运算。 302 | 303 | ### mcSOLVER 304 | 305 | 稠密矩阵线性方程组的求解函数库。 306 | 307 | ### mcFFT 308 | 309 | 快速傅里叶变换库。 310 | 311 | 312 | 313 | -------------------------------------------------------------------------------- /chapter10/mcBlas.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include "mcblas.h" 6 | 7 | /* cpu implementation of sgemm */ 8 | static void cpu_sgemm(int m, int n, int k, float alpha, const float *A, const float *B, float beta, float *C_in, 9 | float *C_out) { 10 | int i; 11 | int j; 12 | int kk; 13 | 14 | for (i = 0; i < m; ++i) { 15 | for (j = 0; j < n; ++j) { 16 | float prod = 0; 17 | 18 | for (kk = 0; kk < k; ++kk) { 19 | prod += A[kk * m + i] * B[j * k + kk]; 20 | } 21 | 22 | C_out[j * m + i] = alpha * prod + beta * C_in[j * m + i]; 23 | } 24 | } 25 | } 26 | 27 | int main(int argc, char **argv) { 28 | float *h_A; 29 | float *h_B; 30 | float *h_C; 31 | float *h_C_ref; 32 | float *d_A = 0; 33 | float *d_B = 0; 34 | float *d_C = 0; 35 | float alpha = 1.0f; 36 | float beta = 0.0f; 37 | int m = 256; 38 | int n = 128; 39 | int k = 64; 40 | int size_a = m * n; // the element num of A matrix 41 | int size_b = n * k; // the element num of B matrix 42 | int size_c = m * n; // the element num of C matrix 43 | float error_norm; 44 | float ref_norm; 45 | float diff; 46 | mcblasHandle_t handle; 47 | mcblasStatus_t status; 48 | 49 | /* Initialize mcBLAS */ 50 | status = mcblasCreate(&handle); 51 | if (status != MCBLAS_STATUS_SUCCESS) { 52 | fprintf(stderr, "Init failed\n"); 53 | return EXIT_FAILURE; 54 | } 55 | 56 | /* Allocate host memory for A/B/C matrix*/ 57 | h_A = (float *)malloc(size_a * sizeof(float)); 58 | if (h_A == NULL) { 59 | fprintf(stderr, "A host memory allocation failed\n"); 60 | return EXIT_FAILURE; 61 | } 62 | h_B = (float *)malloc(size_b * sizeof(float)); 63 | if (h_B == NULL) { 64 | fprintf(stderr, "B host memory allocation failed\n"); 65 | return EXIT_FAILURE; 66 | } 67 | h_C = (float *)malloc(size_c * sizeof(float)); 68 | if (h_C == 0) { 69 | fprintf(stderr, "C host memory allocation failed\n"); 70 | return EXIT_FAILURE; 71 | } 72 | h_C_ref = (float *)malloc(size_c * sizeof(float)); 73 | if (h_C_ref == 0) { 74 | fprintf(stderr, "C_ref host memory allocation failed\n"); 75 | return EXIT_FAILURE; 76 | } 77 | 78 | /* Fill the matrices with test data */ 79 | for (int i = 0; i < size_a; ++i) { 80 | h_A[i] = cos(i + 0.125); 81 | } 82 | for (int i = 0; i < size_b; ++i) { 83 | h_B[i] = cos(i - 0.125); 84 | } 85 | for (int i = 0; i < size_c; ++i) { 86 | h_C[i] = sin(i + 0.25); 87 | } 88 | 89 | /* Allocate device memory for the matrices */ 90 | if (mcMalloc((void **)(&d_A), size_a * sizeof(float)) != mcSuccess) { 91 | fprintf(stderr, "A device memory allocation failed\n"); 92 | return EXIT_FAILURE; 93 | } 94 | if (mcMalloc((void **)(&d_B), size_b * sizeof(float)) != mcSuccess) { 95 | fprintf(stderr, "B device memory allocation failed\n"); 96 | return EXIT_FAILURE; 97 | } 98 | if (mcMalloc((void **)(&d_C), size_c * sizeof(float)) != mcSuccess) { 99 | fprintf(stderr, "C device memory allocation failed\n"); 100 | return EXIT_FAILURE; 101 | } 102 | 103 | /* Initialize the device matrices with the host matrices */ 104 | if (mcblasSetVector(size_a, sizeof(float), h_A, 1, d_A, 1) != MCBLAS_STATUS_SUCCESS) { 105 | fprintf(stderr, "Copy A from host to device failed\n"); 106 | return EXIT_FAILURE; 107 | } 108 | if (mcblasSetVector(size_b, sizeof(float), h_B, 1, d_B, 1) != MCBLAS_STATUS_SUCCESS) { 109 | fprintf(stderr, "Copy B from host to device failed\n"); 110 | return EXIT_FAILURE; 111 | } 112 | if (mcblasSetVector(size_c, sizeof(float), h_C, 1, d_C, 1) != MCBLAS_STATUS_SUCCESS) { 113 | fprintf(stderr, "Copy C from host to device failed\n"); 114 | return EXIT_FAILURE; 115 | } 116 | 117 | /* compute the reference result */ 118 | cpu_sgemm(m, n, k, alpha, h_A, h_B, beta, h_C, h_C_ref); 119 | 120 | /* Performs operation using mcblas */ 121 | status = mcblasSgemm(handle, MCBLAS_OP_N, MCBLAS_OP_N, m, n, k, &alpha, d_A, m, d_B, n, &beta, d_C, k); 122 | if (status != MCBLAS_STATUS_SUCCESS) { 123 | fprintf(stderr, "Sgemm kernel execution failed\n"); 124 | return EXIT_FAILURE; 125 | } 126 | /* Read the result back */ 127 | status = mcblasGetVector(size_c, sizeof(float), d_C, 1, h_C, 1); 128 | if (status != MCBLAS_STATUS_SUCCESS) { 129 | fprintf(stderr, "C data reading failed\n"); 130 | return EXIT_FAILURE; 131 | } 132 | 133 | /* Check result against reference */ 134 | error_norm = 0; 135 | ref_norm = 0; 136 | 137 | for (int i = 0; i < size_c; ++i) { 138 | diff = h_C_ref[i] - h_C[i]; 139 | error_norm += diff * diff; 140 | ref_norm += h_C_ref[i] * h_C_ref[i]; 141 | } 142 | 143 | error_norm = (float)sqrt((double)error_norm); 144 | ref_norm = (float)sqrt((double)ref_norm); 145 | 146 | if (error_norm / ref_norm < 1e-6f) { 147 | printf("McBLAS test passed.\n"); 148 | } else { 149 | printf("McBLAS test failed.\n"); 150 | } 151 | 152 | /* Memory clean up */ 153 | free(h_A); 154 | free(h_B); 155 | free(h_C); 156 | free(h_C_ref); 157 | 158 | if (mcFree(d_A) != mcSuccess) { 159 | fprintf(stderr, "A device mem free failed\n"); 160 | return EXIT_FAILURE; 161 | } 162 | 163 | if (mcFree(d_B) != mcSuccess) { 164 | fprintf(stderr, "B device mem free failed\n"); 165 | return EXIT_FAILURE; 166 | } 167 | 168 | if (mcFree(d_C) != mcSuccess) { 169 | fprintf(stderr, "C device mem free failed\n"); 170 | return EXIT_FAILURE; 171 | } 172 | 173 | /* Shutdown */ 174 | status = mcblasDestroy(handle); 175 | if (status != MCBLAS_STATUS_SUCCESS) { 176 | fprintf(stderr, "Destory failed\n"); 177 | return EXIT_FAILURE; 178 | } 179 | 180 | return EXIT_SUCCESS; 181 | } 182 | -------------------------------------------------------------------------------- /chapter10/mcDNN.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | 7 | #define MCDNN_CHECK(f) 8 | { 9 | mcdnnStatus_t err = static_case(f) ; 10 | if (err != MCDNN_STATUS_SUCCESS) { 11 | std::cout << "Error occurred : " << err << std::endl; 12 | std::exit(1); 13 | } 14 | } 15 | 16 | int main() { 17 | // data shape 18 | int batch = 3; 19 | int data_w = 224; 20 | int data_h = 224; 21 | int in_channel = 3; 22 | int out_channel = 8; 23 | int filter_w = 5; 24 | int filter_h = 5; 25 | int stride[2] = {1, 1}; 26 | int dilate[2] = {1, 1}; 27 | float alpha = 2.f; 28 | float beta = 5.f; 29 | 30 | // model selected 31 | mcdnnConvalutionMode_t mode = MCDNN_CROSS_CRRELATION; 32 | mcdnnConvalutionFwdAlgo_t algo = MCDNN_CONVOLUTION_FWD_ALGO__FFT_TILING; 33 | // data type selected float, double, half, etc. 34 | mcdnnDataType_t data_type = MCDNN_DATA_FLOAT; 35 | 36 | // init handle 37 | mcdnnHandle_t handle; 38 | MCDNN_CHECK(mcdnnCreate(&handle)); 39 | 40 | // create descriptor 41 | mcdnnTensorDescriptor_t x_desc; 42 | mcdnnFilterDescriptor_t w_desc; 43 | mcdnnTensorDescriptor_t y_desc; 44 | mcdnnConvolutionDescriptor_t conv_desc; 45 | MCDNN_CHECK(mcdnnCreateTensorDescriptor(&x_desc)); 46 | MCDNN_CHECK(mcdnnCreateFilterDescriptor(&w_desc)); 47 | MCDNN_CHECK(mcdnnCreateTensorDescriptor(&y_desc)); 48 | MCDNN_CHECK(mcdnnCreateConvolutionDescriptor(&conv_desc)); 49 | 50 | // convolution padding 51 | // out size = (input + pad - kernel) / stride + 1 52 | uint32_t padding_w = data_w + pad[2] + pad[3]; 53 | uint32_t padding_h = data_h + pad[0] + pad[1]; 54 | uint32_t out_h = padding_h - filter_h + 1; 55 | uint32_t out_w = padding_w - filter_w + 1; 56 | // init tensor descriptor, set data type, layout format, shape, etc. 57 | mcdnnSetTensor4dDescriptor(x_desc, MCDNN_TENSOR_NCHW, data_type, batch, 58 | in_channel, data_h, data_w); 59 | mcdnnSetFi1ter4dDescriptor(w_desc, data_type, MCDNN_TENSOR NCHW, out_channel, 60 | in_channel, filter_h, filter_w); 61 | mcdnnSetTensor4dDescriptor(y_desc, MCDNN_TENSOR_NCHW, data_type, batch, 62 | out_channel, out_h, out_w); 63 | // int convolution descriptor, set padding, stride date_type, etc. 64 | mcdnnSetConvolution2dDescriptor(conv_desc, pad[1], pad[2], stride[0], 65 | stride[1], dilate[0], dilate[1], mode, 66 | data_type); 67 | 68 | // init input data 69 | uint32_t input_data_numbers = batch * in_channel * data_h * data_w; 70 | uint32_t filter_data_numbers = out_channel * in_channel * filter_h * filter_w; 71 | uint32_t out_data_numbers = batch * out_channel * out_h * out_w; 72 | 73 | std::vector x(input_data_numbers); 74 | std::vector w(filter_data_numbers); 75 | std::vector y(out_data_numbers); 76 | for (int i = 0; i < input_data_numbers; ++i) { 77 | x[i] = std::cos(i) * i; 78 | } 79 | for (int i = 0; i < filter_data_numbers; ++i) { 80 | x[i] = std::sin(i) / 10; 81 | } 82 | 83 | for (int i = 0; i < out_data_numbers; ++i) { 84 | y[i] = std::cos(i + 0.5); 85 | } 86 | 87 | // alloc x device memory 88 | void *ptr_x_dev = nullptr; 89 | MCDNN_CHECK(mcMalloc(&ptr_x_dev, x.size() * sizeof(float))); 90 | // copy data to device 91 | MCDNN_CHECK(mcMemcpy(&ptr_x_dev, x.data(), x.size() * sizeof(float), 92 | mcMemcpyHostToDevice)); 93 | // alloc w device memory 94 | void *ptr_w_dev = nullptr; 95 | MCDNN_CHECK(mcMalloc(&ptr_w_dev, w.size() * sizeof(float))); 96 | // copy data to device 97 | MCDNN_CHECK(mcMemcpy(&ptr_w_dev, w.data(), w.size() * sizeof(float), 98 | mcMemcpyHostToDevice)); 99 | // alloc y device memory 100 | void *ptr_y_dev = nullptr; 101 | MCDNN_CHECK(mcMalloc(&ptr_y_dev, y.size() * sizeof(float))); 102 | // copy data to device 103 | MCDNN_CHECK(mcMemcpy(&ptr_y_dev, y.data(), y.size() * sizeof(float), 104 | mcMemcpyHostToDevice)); 105 | 106 | uint32_t padding_src_elements = batch * in_channel * padding_h * padding_w; 107 | 108 | size_t workspace_size = 0; 109 | MCDNN_CHECK(mcdnnGetConvolutionForwardWorkspaceSize( 110 | handle, x_desc, w_desc, conv_desc, y_desc, algo, &workspace_size)); 111 | 112 | void *ptr_worksapce = nullptr; 113 | if (workspace_size > 0) { 114 | MCDNN_CHECK(mcMalloc(&ptr_worksapce, workspace_size)); 115 | } 116 | 117 | // convolution forward 118 | MCDNN_CHECK(mcdnnConvolutinForward(handle, &alpha, x_desc, ptr_x_dev, w_desc, 119 | ptr_w_dev, conv_desc, algo, ptr_worksapce, 120 | workspace_size, &beta, y_desc, ptr_y_dev)); 121 | MCDNN_CHECK(mcMemcpy(y.data(), ptr_y_dev, y.size() * sizeof(float), 122 | mcMemcpyDeviceToHost)); 123 | 124 | // free device pointer and handle 125 | MCDNN_CHECK(mcFree(ptr_x_dev)); 126 | MCDNN_CHECK(mcFree(ptr_w_dev)); 127 | MCDNN_CHECK(mcFree(ptr_y_dev)); 128 | MCDNN_CHECK(mcFree(ptr_w_dev)); 129 | MCDNN_CHECK(mcdnnDestoryTensorDescriptor(x_desc)); 130 | MCDNN_CHECK(mcdnnDestoryTensorDescriptor(y_desc)); 131 | MCDNN_CHECK(mcdnnDestoryFilterDescriptor(w_desc)); 132 | MCDNN_CHECK(mcdnnDestoryConvolutionDescriptor(conv_desc)); 133 | MCDNN_CHECK(mcdnnDestory(handle)); 134 | 135 | return 0; 136 | } 137 | -------------------------------------------------------------------------------- /chapter10/mcblas命令.txt: -------------------------------------------------------------------------------- 1 | mxcc sample_mcblas.c -I${MACA_PATH}/include -I${MACA_PATH}/include/mcblas -I${MACA_PATH}/include/mcr -L${MACA_PATH}/lib -lmcruntime -lmcblas -------------------------------------------------------------------------------- /chapter10/usingThrust.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | #include 7 | #include 8 | 9 | int main(void) { 10 | // the following code shows how to use thrust::sort and thrust::host_vector 11 | std::vector array = {2, 4, 6, 8, 0, 9, 7, 5, 3, 1}; 12 | thrust::host_vector vec; 13 | vec = array; // now vec has storage for 10 integers 14 | std::cout << "vec has size: " << vec.size() << std::endl; 15 | 16 | std::cout << "vec before sorting:" << std::endl; 17 | for (size_t i = 0; i < vec.size(); ++i) 18 | std::cout << vec[i] << " "; 19 | std::cout << std::endl; 20 | 21 | thrust::sort(vec.begin(), vec.end()); 22 | std::cout << "vec after sorting:" << std::endl; 23 | for (size_t i = 0; i < vec.size(); ++i) 24 | std::cout << vec[i] << " "; 25 | std::cout << std::endl; 26 | 27 | vec.resize(2); 28 | std::cout << "now vec has size: " << vec.size() << std::endl; 29 | 30 | return 0; 31 | } 32 | -------------------------------------------------------------------------------- /chapter11/Makefile: -------------------------------------------------------------------------------- 1 | DEBUG ?= 0 2 | MCCL ?=0 3 | MCCLCMMD = -D_USE_MCCL -lmccl 4 | 5 | ifeq ($(DEBUG), 0) 6 | ifeq ($(MCCL),0) 7 | simple2DFD_rls: simple2DFD.cpp 8 | mxcc -x maca -O3 ./simple2DFD.cpp -I./ -o ./build/$@ 9 | else 10 | simple2DFD_rls_mccl: simple2DFD.cpp 11 | mxcc -x maca -O3 ./simple2DFD.cpp $(MCCLCMMD) -I./ -o ./build/$@ 12 | @echo Useing mccl now! 13 | endif 14 | else 15 | ifeq ($(MCCL),0) 16 | simple2DFD_dbg: simple2DFD.cpp 17 | mxcc -x maca -g -G ./simple2DFD.cpp -I./ -o ./build/$@ 18 | else 19 | simple2DFD_dbg_mccl: simple2DFD.cpp 20 | mxcc -x maca -g -G ./simple2DFD.cpp $(MCCLCMMD) -I./ -o ./build/$@ 21 | @echo Useing mccl now! 22 | endif 23 | endif 24 | 25 | clean: 26 | rm -f ./build/simple2DFD_* 27 | 28 | -------------------------------------------------------------------------------- /chapter11/simple2DFD.cpp: -------------------------------------------------------------------------------- 1 | #include "../common/common.h" 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | #include 10 | 11 | #ifdef _USE_MCCL 12 | #include 13 | #endif 14 | 15 | 16 | /* 17 | * This example implements a 2D stencil computation, spreading the computation 18 | * across multiple GPUs. This requires communicating halo regions between GPUs 19 | * on every iteration of the stencil as well as managing multiple GPUs from a 20 | * single host application. Here, kernels and transfers are issued in 21 | * breadth-first order to each maca stream. Each maca stream is associated with 22 | * a single maca device. 23 | */ 24 | 25 | #define a0 -3.0124472f 26 | #define a1 1.7383092f 27 | #define a2 -0.2796695f 28 | #define a3 0.0547837f 29 | #define a4 -0.0073118f 30 | 31 | // cnst for gpu 32 | #define BDIMX 32 33 | #define NPAD 4 34 | #define NPAD2 8 35 | 36 | // constant memories for 8 order FD coefficients 37 | __device__ __constant__ float coef[5]; 38 | 39 | // set up fd coefficients 40 | void setup_coef (void) 41 | { 42 | const float h_coef[] = {a0, a1, a2, a3, a4}; 43 | CHECK( mcMemcpyToSymbol( coef, h_coef, 5 * sizeof(float) )); 44 | } 45 | 46 | void saveSnapshotIstep( 47 | int istep, 48 | int nx, 49 | int ny, 50 | int ngpus, 51 | float **g_u2) 52 | { 53 | float *iwave = (float *)malloc(nx * ny * sizeof(float)); 54 | 55 | if (ngpus > 1) 56 | { 57 | unsigned int skiptop = nx * 4; 58 | unsigned int gsize = nx * ny / 2; 59 | 60 | for (int i = 0; i < ngpus; i++) 61 | { 62 | CHECK(mcSetDevice(i)); 63 | int iskip = (i == 0 ? 0 : skiptop); 64 | int ioff = (i == 0 ? 0 : gsize); 65 | CHECK(mcMemcpy(iwave + ioff, g_u2[i] + iskip, 66 | gsize * sizeof(float), mcMemcpyDeviceToHost)); 67 | 68 | // int iskip = (i == 0 ? nx*ny/2-4*nx : 0+4*nx); 69 | // int ioff = (i == 0 ? 0 : nx*4); 70 | // CHECK(mcMemcpy(iwave + ioff, g_u2[i] + iskip, 71 | // skiptop * sizeof(float), mcMemcpyDeviceToHost)); 72 | } 73 | } 74 | else 75 | { 76 | unsigned int isize = nx * ny; 77 | CHECK(mcMemcpy (iwave, g_u2[0], isize * sizeof(float), 78 | mcMemcpyDeviceToHost)); 79 | } 80 | 81 | char fname[50]; 82 | sprintf(fname, "snap_at_step_%d.data", istep); 83 | 84 | FILE *fp_snap = fopen(fname, "w"); 85 | 86 | fwrite(iwave, sizeof(float), nx * ny, fp_snap); 87 | // fwrite(iwave, sizeof(float), nx * 4, fp_snap); 88 | printf("%s: nx = %d ny = %d istep = %d\n", fname, nx, ny, istep); 89 | fflush(stdout); 90 | fclose(fp_snap); 91 | 92 | free(iwave); 93 | return; 94 | } 95 | // 判断算力是否大于2,大于2则就支持P2P通信 96 | inline bool isCapableP2P(int ngpus) 97 | { 98 | mcDeviceProp_t prop[ngpus]; 99 | int iCount = 0; 100 | 101 | for (int i = 0; i < ngpus; i++) 102 | { 103 | CHECK(mcGetDeviceProperties(&prop[i], i)); 104 | 105 | if (prop[i].major >= 2) iCount++; 106 | 107 | printf("> GPU%d: %s %s Peer-to-Peer access\n", i, 108 | prop[i].name, (prop[i].major >= 2 ? "supports" : "doesn't support")); 109 | fflush(stdout); 110 | } 111 | 112 | if(iCount != ngpus) 113 | { 114 | printf("> no enough device to run this application\n"); 115 | fflush(stdout); 116 | } 117 | 118 | return (iCount == ngpus); 119 | } 120 | 121 | /* 122 | * enable P2P memcopies between GPUs (all GPUs must be compute capability 2.0 or 123 | * later (Fermi or later)) 124 | */ 125 | inline void enableP2P (int ngpus) 126 | { 127 | for (int i = 0; i < ngpus; i++) 128 | { 129 | CHECK(mcSetDevice(i)); 130 | 131 | for (int j = 0; j < ngpus; j++) 132 | { 133 | if (i == j) continue; 134 | 135 | int peer_access_available = 0; 136 | CHECK(mcDeviceCanAccessPeer(&peer_access_available, i, j)); 137 | 138 | if (peer_access_available) CHECK(mcDeviceEnablePeerAccess(j, 0)); 139 | } 140 | } 141 | } 142 | // 是否支持UnifiedAddressing 143 | inline bool isUnifiedAddressing (int ngpus) 144 | { 145 | mcDeviceProp_t prop[ngpus]; 146 | 147 | for (int i = 0; i < ngpus; i++) 148 | { 149 | CHECK(mcGetDeviceProperties(&prop[i], i)); 150 | } 151 | 152 | const bool iuva = (prop[0].unifiedAddressing && prop[1].unifiedAddressing); 153 | printf("> GPU%d: %s %s Unified Addressing\n", 0, prop[0].name, 154 | (prop[0].unifiedAddressing ? "supports" : "doesn't support")); 155 | printf("> GPU%d: %s %s Unified Addressing\n", 1, prop[1].name, 156 | (prop[1].unifiedAddressing ? "supports" : "doesn't support")); 157 | fflush(stdout); 158 | return iuva; 159 | } 160 | // 2GPU的结果为252,256,4,252 161 | inline void calcIndex(int *haloStart, int *haloEnd, int *bodyStart, 162 | int *bodyEnd, const int ngpus, const int iny) 163 | { 164 | // for halo 165 | for (int i = 0; i < ngpus; i++) 166 | { 167 | if (i == 0 && ngpus == 2) 168 | { 169 | haloStart[i] = iny - NPAD2; // 260-8=252 170 | haloEnd[i] = iny - NPAD; // 260-4=256 171 | 172 | } 173 | else 174 | { 175 | haloStart[i] = NPAD; 176 | haloEnd[i] = NPAD2; 177 | } 178 | } 179 | 180 | // for body 181 | for (int i = 0; i < ngpus; i++) 182 | { 183 | if (i == 0 && ngpus == 2) 184 | { 185 | bodyStart[i] = NPAD; // 4 186 | bodyEnd[i] = iny - NPAD2; // 260-8=252 187 | } 188 | else 189 | { 190 | bodyStart[i] = NPAD + NPAD; 191 | bodyEnd[i] = iny - NPAD; 192 | } 193 | } 194 | } 195 | // // src_skip: 512*(260-8) 4*512 dst_skip:0 (260-4)*512 196 | inline void calcSkips(int *src_skip, int *dst_skip, const int nx, 197 | const int iny) 198 | { 199 | src_skip[0] = nx * (iny - NPAD2);// 512*(260-8) 200 | dst_skip[0] = 0; 201 | src_skip[1] = NPAD * nx; // 4*512 202 | dst_skip[1] = (iny - NPAD) * nx; // (260-4)*512 203 | } 204 | 205 | // wavelet 206 | __global__ void kernel_add_wavelet ( float *g_u2, float wavelets, const int nx, 207 | const int ny, const int ngpus) 208 | { // ny为iny=260,nx=512 209 | // global grid idx for (x,y) plane 若gpu个数为2,则 210 | int ipos = (ngpus == 2 ? ny - 10 : ny / 2 - 10); // ipos=250 211 | unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x; // ix就是x方向上节点编号 212 | unsigned int idx = ipos * nx + ix; // idx=250*512+ix 213 | 214 | if(ix == nx / 2) g_u2[idx] += wavelets; // 这里是说ix==256时,则 215 | } 216 | 217 | // fd kernel function 218 | __global__ void kernel_2dfd_last(float *g_u1, float *g_u2, const int nx, 219 | const int iStart, const int iEnd) 220 | { 221 | // global to slice : global grid idx for (x,y) plane 222 | unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x; 223 | 224 | // smem idx for current point 225 | unsigned int stx = threadIdx.x + NPAD; 226 | unsigned int idx = ix + iStart * nx; 227 | 228 | // shared memory for u2 with size [4+16+4][4+16+4] 229 | __shared__ float tile[BDIMX + NPAD2]; 230 | 231 | const float alpha = 0.12f; 232 | 233 | // register for y value 234 | float yval[9]; 235 | 236 | for (int i = 0; i < 8; i++) yval[i] = g_u2[idx + (i - 4) * nx]; 237 | 238 | // to be used in z loop 239 | int iskip = NPAD * nx; 240 | 241 | #pragma unroll 9 242 | for (int iy = iStart; iy < iEnd; iy++) 243 | { 244 | // get front3 here 245 | yval[8] = g_u2[idx + iskip]; 246 | 247 | if(threadIdx.x < NPAD) 248 | { 249 | tile[threadIdx.x] = g_u2[idx - NPAD]; 250 | tile[stx + BDIMX] = g_u2[idx + BDIMX]; 251 | } 252 | 253 | tile[stx] = yval[4]; 254 | __syncthreads(); 255 | 256 | if ( (ix >= NPAD) && (ix < nx - NPAD) ) 257 | { 258 | // 8rd fd operator 259 | float tmp = coef[0] * tile[stx] * 2.0f; 260 | 261 | #pragma unroll 262 | for(int d = 1; d <= 4; d++) 263 | { 264 | tmp += coef[d] * (tile[stx - d] + tile[stx + d]); 265 | } 266 | 267 | #pragma unroll 268 | for(int d = 1; d <= 4; d++) 269 | { 270 | tmp += coef[d] * (yval[4 - d] + yval[4 + d]); 271 | } 272 | 273 | // time dimension 274 | g_u1[idx] = yval[4] + yval[4] - g_u1[idx] + alpha * tmp; 275 | } 276 | 277 | #pragma unroll 8 278 | for (int i = 0; i < 8 ; i++) 279 | { 280 | yval[i] = yval[i + 1]; 281 | } 282 | 283 | // advancd on global idx 284 | idx += nx; 285 | __syncthreads(); 286 | } 287 | } 288 | 289 | __global__ void kernel_2dfd(float *g_u1, float *g_u2, const int nx, 290 | const int iStart, const int iEnd) 291 | { 292 | // global to line index 293 | unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x; 294 | 295 | // smem idx for current point 296 | unsigned int stx = threadIdx.x + NPAD; 297 | unsigned int idx = ix + iStart * nx; // ix+4*512,idx表示插值的中心点坐标 298 | 299 | // shared memory for x dimension 300 | __shared__ float line[BDIMX + NPAD2];// 对于一个block,根据模板,需要的共享内存元素数量为block线程大小+NPAD*2 301 | 302 | // a coefficient related to physical properties 303 | const float alpha = 0.12f; // 关于时间步长的系数 304 | 305 | // register for y value 306 | float yval[9]; // 寄存器数组 307 | // 从GPU主存中获取值,这里数据由于是沿着坐标x轴排布的,所以获取主存的数据是不连续的 308 | for (int i = 0; i < 8; i++) yval[i] = g_u2[idx + (i - 4) * nx]; 309 | 310 | // skip for the bottom most y value 311 | int iskip = NPAD * nx; // 4*512,看上面for循环,最大下标到idx+3*nx,这里多加了1 312 | 313 | #pragma unroll 9 314 | for (int iy = iStart; iy < iEnd; iy++)//对y方向的数据点进行循环 315 | { 316 | // get yval[8] here 317 | yval[8] = g_u2[idx + iskip];//这里每次yval的最后一个数据从主存获取,其他数据最后从寄存器获取 318 | // 所以内存是按坐标轴的x方向上排布的 319 | // read halo partk // 320 | if(threadIdx.x < NPAD) 321 | { // 共享内存的最前最后4个数据即(0,1,2,3)和(36,37,38,39) 322 | line[threadIdx.x] = g_u2[idx - NPAD]; 323 | line[stx + BDIMX] = g_u2[idx + BDIMX]; 324 | } 325 | 326 | line[stx] = yval[4]; // line获取中心点的值,注意由于每个线程的yval[4]和stx都不同,所以这样可以将line[4-35]的所有数据填满 327 | __syncthreads();// 直到块内线程同步 328 | 329 | // 8rd fd operator 这里的ix>=4,ix<512-4 330 | if ( (ix >= NPAD) && (ix < nx - NPAD) ) 331 | { 332 | // center point 333 | float tmp = coef[0] * line[stx] * 2.0f; 334 | 335 | #pragma unroll 336 | for(int d = 1; d <= 4; d++) 337 | { 338 | tmp += coef[d] * ( line[stx - d] + line[stx + d]); 339 | } 340 | 341 | #pragma unroll 342 | for(int d = 1; d <= 4; d++) 343 | { 344 | tmp += coef[d] * (yval[4 - d] + yval[4 + d]); 345 | } 346 | 347 | // time dimension yval[4]=gu2[idx],g_u1和g_u2和时间推进有关 348 | g_u1[idx] = yval[4] + yval[4] - g_u1[idx] + alpha * tmp; 349 | } 350 | 351 | #pragma unroll 8 // 这里将下移一格,即沿着坐标y轴下移,进行下一层(沿着x轴为一层) 352 | for (int i = 0; i < 8 ; i++) 353 | { 354 | yval[i] = yval[i + 1]; 355 | } 356 | 357 | // advancd on global idx 358 | idx += nx; // idx+一层的点数,接着循环 359 | __syncthreads(); 360 | } 361 | } 362 | // 程序有多个参数,第一个为要使用的GPU个数,第二个为保存哪个时间步的波场 363 | /* 364 | 1. argv[1]:gpu数量 365 | 2. argv[2]: 每隔多少个时间步存储数据 366 | 3. argv[3]: 一共多少时间步 367 | 4. argv[4]: 每个方向上的网格数 368 | */ 369 | int main( int argc, char *argv[] ) 370 | { 371 | int ngpus=2; 372 | 373 | // check device count 374 | CHECK(mcGetDeviceCount(&ngpus)); 375 | printf("> Number of devices available: %i\n", ngpus); 376 | 377 | // check p2p capability 378 | isCapableP2P(ngpus); 379 | isUnifiedAddressing(ngpus); 380 | 381 | // get it from command line 382 | if (argc > 1) 383 | { 384 | if (atoi(argv[1]) > ngpus) 385 | { 386 | fprintf(stderr, "Invalid number of GPUs specified: %d is greater " 387 | "than the total number of GPUs in this platform (%d)\n", 388 | atoi(argv[1]), ngpus); 389 | exit(1); 390 | } 391 | 392 | ngpus = atoi(argv[1]); 393 | } 394 | 395 | int iMovie = 100; // 这里现在表示每隔多少步存一次数据 396 | 397 | if(argc >= 3) iMovie = atoi(argv[2]); 398 | 399 | // size 400 | // 时间步 401 | int nsteps = 1001; 402 | if(argc>=4) nsteps=atoi(argv[3]); 403 | 404 | printf("> run with %i devices: nsteps = %i\n", ngpus, nsteps); 405 | 406 | // x方向点数 407 | const int nx = 512; 408 | // y方向点数 409 | const int ny = 512; 410 | // 计算每个gpu上点数,这里每个线程负责所有y方向的数据点计算 411 | const int iny = ny / ngpus + NPAD * (ngpus - 1); 412 | 413 | size_t isize = nx * iny; // 总的数据点数 414 | size_t ibyte = isize * sizeof(float); // 每块总的数据字节数 415 | #ifndef _USE_MCCL 416 | size_t iexchange = NPAD * nx * sizeof(float); // 交换区域的字节数 417 | #endif 418 | 419 | // set up gpu card 420 | float *d_u2[ngpus], *d_u1[ngpus]; 421 | 422 | for(int i = 0; i < ngpus; i++) 423 | { 424 | // set device 425 | CHECK(mcSetDevice(i)); 426 | 427 | // allocate device memories // d_u1,d_u2分别存着两个时间步的数据 428 | CHECK(mcMalloc ((void **) &d_u1[i], ibyte)); 429 | CHECK(mcMalloc ((void **) &d_u2[i], ibyte)); 430 | 431 | CHECK(mcMemset (d_u1[i], 0, ibyte)); 432 | CHECK(mcMemset (d_u2[i], 0, ibyte)); 433 | printf("GPU %i: %.2f MB global memory allocated\n", i, 434 | (4.f * ibyte) / (1024.f * 1024.f) ); 435 | setup_coef (); 436 | } 437 | 438 | // stream definition 439 | mcStream_t stream_halo[ngpus], stream_body[ngpus]; 440 | 441 | for (int i = 0; i < ngpus; i++) 442 | { 443 | CHECK(mcSetDevice(i)); 444 | CHECK(mcStreamCreate( &stream_halo[i] )); 445 | CHECK(mcStreamCreate( &stream_body[i] )); 446 | } 447 | 448 | // calculate index for computation 449 | int haloStart[ngpus], bodyStart[ngpus], haloEnd[ngpus], bodyEnd[ngpus]; 450 | // 根据iny进行处理 ,2GPU的结果为252,256,4,252 451 | calcIndex(haloStart, haloEnd, bodyStart, bodyEnd, ngpus, iny); 452 | 453 | int src_skip[ngpus], dst_skip[ngpus]; 454 | // // src_skip: 512*(260-8) 4*512 dst_skip:0 (260-4)*512 455 | // 根据nx,iny进行处理 456 | if(ngpus > 1) calcSkips(src_skip, dst_skip, nx, iny); 457 | 458 | // kernel launch configuration 459 | // block 中的线程数量 460 | dim3 block(BDIMX); 461 | // block数量 这样的话一个线程要处理所有y向的数据。y方向被所有的GPU分块 462 | dim3 grid(nx / block.x); 463 | 464 | // set up event for timing 465 | CHECK(mcSetDevice(0)); 466 | mcEvent_t start, stop; 467 | CHECK (mcEventCreate(&start)); 468 | CHECK (mcEventCreate(&stop )); 469 | CHECK(mcEventRecord( start, 0 )); 470 | #ifdef _USE_MCCL 471 | int devs[2] = {0, 1}; 472 | mcclComm_t comms[2]; 473 | assert(mcclSuccess==mcclCommInitAll(comms, ngpus, devs)); 474 | #endif 475 | // main loop for wave propagation 476 | for(int istep = 0; istep < nsteps; istep++) 477 | { 478 | 479 | // save snap image 480 | if(istep%iMovie==0) saveSnapshotIstep(istep, nx, ny, ngpus, d_u2); 481 | 482 | // add wavelet only onto gpu0 483 | if (istep == 0) 484 | { 485 | CHECK(mcSetDevice(0)); 486 | kernel_add_wavelet<<>>(d_u2[0], 20.0, nx, iny, ngpus); 487 | } 488 | 489 | // halo part 490 | for (int i = 0; i < ngpus; i++) 491 | { 492 | CHECK(mcSetDevice(i)); 493 | 494 | // compute halo 495 | kernel_2dfd<<>>(d_u1[i], d_u2[i], 496 | nx, haloStart[i], haloEnd[i]); 497 | 498 | // compute internal 499 | kernel_2dfd<<>>(d_u1[i], d_u2[i], 500 | nx, bodyStart[i], bodyEnd[i]); 501 | } 502 | 503 | /* 504 | ================================================================================ 505 | 506 | ***************************使用不同的方式在GPU间交换数据**************************** 507 | 508 | ================================================================================ 509 | */ 510 | 511 | #ifndef _USE_MCCL 512 | // exchange halo 513 | // src_skip: 512*(260-8) 4*512 dst_skip:0 (260-4)*512 514 | if (ngpus > 1) 515 | { 516 | // 交换两个GPU的数据注意都是d_u1的数据,即新的时间步上的数据 这里可以考虑使用mccl? 517 | // 这里是将gpu0的halo区域数据给gpu1的填充区域 518 | CHECK(mcMemcpyAsync(d_u1[1] + dst_skip[0], d_u1[0] + src_skip[0], 519 | iexchange, mcMemcpyDefault, stream_halo[0])); 520 | // 这里是将gpu1的halo区域数据给gpu0的填充区域 521 | CHECK(mcMemcpyAsync(d_u1[0] + dst_skip[1], d_u1[1] + src_skip[1], 522 | iexchange, mcMemcpyDefault, stream_halo[1])); 523 | } 524 | #else 525 | // 使用mccl发送填充区数据 526 | assert(mcclSuccess == mcclGroupStart()); 527 | for (int i = 0; i < ngpus; ++i) 528 | { 529 | mcSetDevice(i); 530 | int tag = (i + 1) % 2; 531 | mcclSend(d_u1[i] + src_skip[i], NPAD * nx, mcclFloat, tag, comms[i], stream_halo[i]); 532 | mcclRecv(d_u1[i] + dst_skip[tag], NPAD * nx, mcclFloat, tag, comms[i], stream_halo[i]); 533 | } 534 | assert(mcclSuccess == mcclGroupEnd()); 535 | 536 | for (int i = 0; i < ngpus; ++i) 537 | { 538 | mcSetDevice(i); 539 | // it will stall host until all operations are done 540 | mcStreamSynchronize(stream_halo[i]); 541 | } 542 | #endif 543 | for (int i = 0; i < ngpus; i++) 544 | { 545 | CHECK(mcSetDevice(i)); 546 | CHECK(mcDeviceSynchronize()); 547 | // 交换时间步的指针 548 | float *tmpu0 = d_u1[i]; 549 | d_u1[i] = d_u2[i]; 550 | d_u2[i] = tmpu0; 551 | } 552 | 553 | } // 关于istep的for循环结束 554 | 555 | CHECK(mcSetDevice(0)); 556 | CHECK(mcEventRecord(stop, 0)); 557 | 558 | CHECK(mcDeviceSynchronize()); 559 | CHECK(mcGetLastError()); 560 | 561 | float elapsed_time_ms = 0.0f; 562 | CHECK(mcEventElapsedTime(&elapsed_time_ms, start, stop)); 563 | 564 | elapsed_time_ms /= nsteps; 565 | /* 566 | 1. nsteps=30000,NCCL:845.04 MCells/s,origin:941.21 MCells/s 567 | 2. nsteps=15000,NCCL:817.91 MCells/s,origin:935.47 MCells/s 568 | 3. nsteps=10000,NCCL:793.62 MCells/s,origin:925.97 MCells/s 569 | 4. nsteps=05000,NCCL:756.32 MCells/s,origin:925.32 MCells/s 570 | 5. nsteps=02000,NCCL:599.61 MCells/s,origin:889.43 MCells/s 571 | 6. nsteps=01000,NCCL:470.81 MCells/s,origin:802.86 MCells/s 572 | 可见随着循环步骤数的增加,mccl通信与原有程序的速度逐渐接近 573 | */ 574 | printf("gputime: %8.2fms ", elapsed_time_ms); 575 | printf("performance: %8.2f MCells/s\n", 576 | (double)nx * ny / (elapsed_time_ms * 1e3f)); 577 | fflush(stdout); 578 | 579 | CHECK(mcEventDestroy(start)); 580 | CHECK(mcEventDestroy(stop)); 581 | 582 | // clear 583 | for (int i = 0; i < ngpus; i++) 584 | { 585 | CHECK(mcSetDevice(i)); 586 | 587 | CHECK(mcStreamDestroy(stream_halo[i])); 588 | CHECK(mcStreamDestroy(stream_body[i])); 589 | 590 | CHECK(mcFree(d_u1[i])); 591 | CHECK(mcFree(d_u2[i])); 592 | 593 | // CHECK(mcDeviceReset()); // 不注释掉会mcclCommDestroy出现段错误 594 | } 595 | #ifdef _USE_MCCL 596 | for (int i = 0; i < ngpus; ++i) 597 | { 598 | assert(mcclSuccess == mcclCommDestroy(comms[i])); 599 | } 600 | #endif 601 | exit(EXIT_SUCCESS); 602 | } 603 | -------------------------------------------------------------------------------- /chapter11/vectorAddMultiGpus.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | 7 | #define USECPSEC 1000000ULL 8 | 9 | unsigned long long dtime_usec(unsigned long long start){ 10 | 11 | timeval tv; 12 | gettimeofday(&tv, 0); 13 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start; 14 | } 15 | 16 | // error checking macro 17 | #define macaCheckErrors(msg) \ 18 | do { \ 19 | mcError_t __err = mcGetLastError(); \ 20 | if (__err != mcSuccess) { \ 21 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 22 | msg, mcGetErrorString(__err), \ 23 | __FILE__, __LINE__); \ 24 | fprintf(stderr, "*** FAILED - ABORTING\n"); \ 25 | exit(1); \ 26 | } \ 27 | } while (0) 28 | 29 | 30 | const int DSIZE = 1 << 26; //64MB 31 | #define NGPUS 4 32 | 33 | // generate different seed for random number 34 | void initialData(float *ip, int size) 35 | { 36 | time_t t; 37 | srand((unsigned) time(&t)); 38 | 39 | for (int i = 0; i < size; i++) 40 | { 41 | ip[i] = (float)(rand() & 0xFF) / 10.0f; 42 | } 43 | 44 | return; 45 | } 46 | 47 | // vector add function: C = A + B 48 | void cpuVectorAdd(float *A, float *B, float *C, const int N) 49 | { 50 | for (int idx = 0; idx < N; idx++) 51 | C[idx] = A[idx] + B[idx]; 52 | } 53 | 54 | // vector add kernel: C = A + B 55 | __global__ void gpuVectorAddKernel(const float *A, const float *B, float *C, const int N){ 56 | 57 | for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < N; idx+=gridDim.x*blockDim.x) // a grid-stride loop 58 | C[idx] = A[idx] + B[idx]; // do the vector (element) add here 59 | } 60 | 61 | // check results from host and gpu 62 | void checkResult(float *hostRef, float *gpuRef, const int N) 63 | { 64 | double epsilon = 1.0E-8; 65 | bool match = 1; 66 | for (int i = 0; i < N; i++) 67 | { 68 | if (abs(hostRef[i] - gpuRef[i]) > epsilon) 69 | { 70 | match = 0; 71 | printf("The vector-add results do not match!\n"); 72 | printf("host %5.2f gpu %5.2f at current %d\n", hostRef[i], 73 | gpuRef[i], i); 74 | break; 75 | } 76 | } 77 | // if (match) printf("The vector-add results match.\n\n"); 78 | return; 79 | } 80 | 81 | // 程序有多个参数,第一个为要使用的GPU个数,第二个为保存哪个时间步的波场 82 | /* 83 | 1. argv[1]:GPU数量 (nGpus) 84 | 2. argv[2]:线程块大小(blockSize) 85 | 3. argv[3]:数据量(dataSize), default is 26(1<<26=64MB) 86 | */ 87 | int main( int argc, char *argv[] ) 88 | { 89 | int nGpus; 90 | mcGetDeviceCount(&nGpus); 91 | nGpus = (nGpus > NGPUS) ? NGPUS : nGpus; 92 | printf("> Number of devices available: %i\n", nGpus); 93 | // get it from command line 94 | if (argc > 1) 95 | { 96 | if (atoi(argv[1]) > nGpus) 97 | { 98 | fprintf(stderr, "Invalid number of GPUs specified: %d is greater " 99 | "than the total number of GPUs in this platform (%d)\n", 100 | atoi(argv[1]), nGpus); 101 | exit(1); 102 | } 103 | nGpus = atoi(argv[1]); 104 | } 105 | 106 | // blockSize is set to 1 for slowing execution time per GPU 107 | int blockSize = 1; 108 | // It would be faster if blockSize is set to multiples of 64(waveSize) 109 | if(argc >= 3) blockSize = atoi(argv[2]); 110 | int dataSize = DSIZE; 111 | if(argc >= 4) dataSize = 1 << abs(atoi(argv[3])); 112 | printf("> total array size is %iMB, using %i devices with each device handling %iMB\n", dataSize/1024/1024, nGpus, dataSize/1024/1024/nGpus); 113 | 114 | float *d_A[NGPUS], *d_B[NGPUS], *d_C[NGPUS]; 115 | float *h_A[NGPUS], *h_B[NGPUS], *hostRef[NGPUS], *gpuRef[NGPUS]; 116 | mcStream_t stream[NGPUS]; 117 | 118 | int iSize = dataSize / nGpus; 119 | size_t iBytes = iSize * sizeof(float); 120 | for (int i = 0; i < nGpus; i++) { 121 | //set current device 122 | mcSetDevice(i); 123 | 124 | //allocate device memory 125 | mcMalloc((void **) &d_A[i], iBytes); 126 | mcMalloc((void **) &d_B[i], iBytes); 127 | mcMalloc((void **) &d_C[i], iBytes); 128 | 129 | //allocate page locked host memory for asynchronous data transfer 130 | mcMallocHost((void **) &h_A[i], iBytes); 131 | mcMallocHost((void **) &h_B[i], iBytes); 132 | mcMallocHost((void **) &hostRef[i], iBytes); 133 | mcMallocHost((void **) &gpuRef[i], iBytes); 134 | 135 | // initialize data at host side 136 | initialData(h_A[i], iSize); 137 | initialData(h_B[i], iSize); 138 | //memset(hostRef[i], 0, iBytes); 139 | //memset(gpuRef[i], 0, iBytes); 140 | } 141 | mcDeviceSynchronize(); 142 | 143 | // distribute the workload across multiple devices 144 | unsigned long long dt = dtime_usec(0); 145 | for (int i = 0; i < nGpus; i++) { 146 | //set current device 147 | mcSetDevice(i); 148 | mcStreamCreate(&stream[i]); 149 | 150 | // transfer data from host to device 151 | mcMemcpyAsync(d_A[i],h_A[i], iBytes, mcMemcpyHostToDevice, stream[i]); 152 | mcMemcpyAsync(d_B[i],h_B[i], iBytes, mcMemcpyHostToDevice, stream[i]); 153 | 154 | // invoke kernel at host side 155 | dim3 block (blockSize); 156 | dim3 grid (iSize/blockSize); 157 | gpuVectorAddKernel<<>>(d_A[i], d_B[i], d_C[i], iSize); 158 | 159 | // copy kernel result back to host side 160 | mcMemcpyAsync(gpuRef[i],d_C[i],iBytes,mcMemcpyDeviceToHost,stream[i]); 161 | } 162 | mcDeviceSynchronize(); 163 | dt = dtime_usec(dt); 164 | std::cout << "> The execution time with " << nGpus <<"GPUs: "<< dt/(float)USECPSEC << "s" << std::endl; 165 | 166 | // check the results from host and gpu devices 167 | for (int i = 0; i < nGpus; i++) { 168 | // add vector at host side for result checks 169 | cpuVectorAdd(h_A[i], h_B[i], hostRef[i], iSize); 170 | 171 | // check device results 172 | checkResult(hostRef[i], gpuRef[i], iSize); 173 | 174 | // free device global memory 175 | mcSetDevice(i); 176 | mcFree(d_A[i]); 177 | mcFree(d_B[i]); 178 | mcFree(d_C[i]); 179 | 180 | // free host memory 181 | mcFreeHost(h_A[i]); 182 | mcFreeHost(h_B[i]); 183 | mcFreeHost(hostRef[i]); 184 | mcFreeHost(gpuRef[i]); 185 | 186 | mcStreamSynchronize(stream[i]); 187 | mcStreamDestroy(stream[i]); 188 | } 189 | mcDeviceSynchronize(); 190 | return 0; 191 | } 192 | -------------------------------------------------------------------------------- /chapter2/helloFromGpu.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | __global__ void helloFromGpu (void) 6 | { 7 | printf("Hello World from GPU!\n"); 8 | } 9 | 10 | int main(void) 11 | { 12 | printf("Hello World from CPU!\n"); 13 | 14 | helloFromGpu <<<1, 10>>>(); 15 | mcDeviceReset(); 16 | //mcDeviceReset()用来显式销毁并清除与当前设备有关的所有资源。 17 | return 0; 18 | } 19 | -------------------------------------------------------------------------------- /chapter3/cpuVectorAdd.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | using namespace std; 6 | 7 | void cpuVectorAdd(float* A, float* B, float* C, int n) { 8 | for (int i = 0; i < n; i++) { 9 | C[i] = A[i] + B[i]; 10 | } 11 | } 12 | 13 | int main(int argc, char *argv[]) { 14 | 15 | int n = atoi(argv[1]); 16 | cout << n << endl; 17 | 18 | size_t size = n * sizeof(float); 19 | 20 | // host memery 21 | float *a = (float *)malloc(size); //分配一段内存,使用指针 a 指向它。 22 | float *b = (float *)malloc(size); 23 | float *c = (float *)malloc(size); 24 | 25 | // for 循环产生一些随机数,并放在分配的内存里面。 26 | for (int i = 0; i < n; i++) { 27 | float af = rand() / double(RAND_MAX); 28 | float bf = rand() / double(RAND_MAX); 29 | a[i] = af; 30 | b[i] = bf; 31 | } 32 | 33 | struct timeval t1, t2; 34 | 35 | // gettimeofday 函数来得到精确时间。它的精度可以达到微秒,是C标准库的函数。 36 | gettimeofday(&t1, NULL); 37 | 38 | // 输入指向3段内存的指针名,也就是 a, b, c。 39 | cpuVectorAdd(a, b, c, n); 40 | 41 | gettimeofday(&t2, NULL); 42 | 43 | //for (int i = 0; i < 10; i++) 44 | // cout << vecA[i] << " " << vecB[i] << " " << vecC[i] << endl; 45 | double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0; 46 | cout << timeuse << endl; 47 | 48 | // free 函数把申请的3段内存释放掉。 49 | free(a); 50 | free(b); 51 | free(c); 52 | return 0; 53 | } 54 | -------------------------------------------------------------------------------- /chapter3/gpuVectorAdd.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | using namespace std; 7 | 8 | // 要用 __global__ 来修饰。 9 | // 输入指向3段显存的指针名。 10 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N) 11 | { 12 | int i = threadIdx.x + blockDim.x * blockIdx.x; 13 | if (i < N) C_d[i] = A_d[i] + B_d[i]; 14 | } 15 | 16 | int main(int argc, char *argv[]) { 17 | 18 | int n = atoi(argv[1]); 19 | cout << n << endl; 20 | 21 | size_t size = n * sizeof(float); 22 | 23 | // host memery 24 | float *a = (float *)malloc(size); 25 | float *b = (float *)malloc(size); 26 | float *c = (float *)malloc(size); 27 | 28 | for (int i = 0; i < n; i++) { 29 | float af = rand() / double(RAND_MAX); 30 | float bf = rand() / double(RAND_MAX); 31 | a[i] = af; 32 | b[i] = bf; 33 | } 34 | 35 | // 定义空指针。 36 | float *da = NULL; 37 | float *db = NULL; 38 | float *dc = NULL; 39 | 40 | // 申请显存,da 指向申请的显存,注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。 41 | mcMalloc((void **)&da, size); 42 | mcMalloc((void **)&db, size); 43 | mcMalloc((void **)&dc, size); 44 | 45 | // 把内存的东西拷贝到显存,也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。 46 | mcMemcpy(da,a,size,mcMemcpyHostToDevice); 47 | mcMemcpy(db,b,size,mcMemcpyHostToDevice); 48 | 49 | struct timeval t1, t2; 50 | 51 | // 计算线程块和网格的数量。 52 | int threadPerBlock = 256; 53 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock; 54 | printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid); 55 | 56 | gettimeofday(&t1, NULL); 57 | 58 | // 调用核函数。 59 | gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n); 60 | 61 | gettimeofday(&t2, NULL); 62 | 63 | mcMemcpy(c,dc,size,mcMemcpyDeviceToHost); 64 | 65 | // for (int i = 0; i < 10; i++) 66 | // cout< 12 | __device__ __host__ void count_if(int *count, T *data, int start, int end, int stride, P p) { 13 | for(int i = start; i < end; i += stride){ 14 | if(p(data[i])){ 15 | // __MACA_ARCH__ 宏仅在编译设备侧代码时生效 16 | #ifdef __MACA_ARCH__ 17 | // 使用原子操作保证设备侧多线程执行时的正确性 18 | atomicAdd(count, 1); 19 | #else 20 | *count += 1; 21 | #endif 22 | } 23 | } 24 | } 25 | // 定义核函数 26 | __global__ void count_xyzw(int *res) { 27 | // 利用内建变量gridDim, blockDim, blockIdx, threadIdx对每个线程操作的字符串进行分割 28 | const int start = blockDim.x * blockIdx.x + threadIdx.x; 29 | const int stride = gridDim.x * blockDim.x; 30 | // 在设备侧调用count_if 31 | count_if(res, dstrlist, start, dsize, stride, [=](char c){ 32 | for(auto i: letters) 33 | if(i == c) return true; 34 | return false; 35 | }); 36 | } 37 | 38 | int main(void){ 39 | // 初始化字符串 40 | char test_data[SIZE]; 41 | for(int i = 0; i < SIZE; i ++){ 42 | test_data[i] = 'a' + i % 26; 43 | } 44 | // 拷贝字符串数据至设备侧 45 | mcMemcpyToSymbol(dstrlist, test_data, SIZE); 46 | // 开辟设备侧的计数器内存并赋值为0 47 | int *dcnt; 48 | mcMalloc(&dcnt, sizeof(int)); 49 | int dinit = 0; 50 | mcMemcpy(dcnt, &dinit, sizeof(int), mcMemcpyHostToDevice); 51 | // 启动核函数 52 | count_xyzw<<<4, 64>>>(dcnt); 53 | // 拷贝计数器值到主机侧 54 | int dres; 55 | mcMemcpy(&dres, dcnt, sizeof(int), mcMemcpyDeviceToHost); 56 | // 释放设备侧开辟的内存 57 | mcFree(dcnt); 58 | printf("xyzw counted by device: %d\n", dres); 59 | 60 | // 在主机侧调用count_if 61 | int hcnt = 0; 62 | count_if(&hcnt, test_data, 0, SIZE, 1, [=](char c){ 63 | for(auto i: letters) 64 | if(i == c) return true; 65 | return false; 66 | }); 67 | printf("xyzw counted by host: %d\n", hcnt); 68 | return 0; 69 | } 70 | -------------------------------------------------------------------------------- /chapter5/Cooperative_Groups.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | using namespace cooperative_groups; 6 | __device__ int reduce_sum(thread_group g, int *temp, int val) 7 | { 8 | int lane = g.thread_rank(); 9 | 10 | // Each iteration halves the number of active threads 11 | // Each thread adds its partial sum[i] to sum[lane+i] 12 | for (int i = g.size() / 2; i > 0; i /= 2) 13 | { 14 | temp[lane] = val; 15 | g.sync(); // wait for all threads to store 16 | if(lane 2 | 3 | int main( void ) { 4 | mcDeviceProp_t prop; 5 | 6 | int count; 7 | mcGetDeviceCount( &count ); 8 | for (int i=0; i< count; i++) { 9 | mcGetDeviceProperties( &prop, i ); 10 | printf( " --- General Information for device %d ---\n", i ); 11 | printf( "Name: %s\n", prop.name ); 12 | printf( "Compute capability: %d.%d\n", prop.major, prop.minor ); 13 | printf( "Clock rate: %d\n", prop.clockRate ); 14 | printf( "Device copy overlap: " ); 15 | if (prop.deviceOverlap) 16 | printf( "Enabled\n" ); 17 | else 18 | printf( "Disabled\n" ); 19 | printf( "Kernel execition timeout : " ); 20 | if (prop.kernelExecTimeoutEnabled) 21 | printf( "Enabled\n" ); 22 | else 23 | printf( "Disabled\n" ); 24 | 25 | printf( " --- MP Information for device %d ---\n", i ); 26 | printf( "Multiprocessor count: %d\n", 27 | prop.multiProcessorCount ); 28 | printf( "Threads in wave: %d\n", prop.waveSize ); 29 | printf( "Max threads per block: %d\n", 30 | prop.maxThreadsPerBlock ); 31 | printf( "Max thread dimensions: (%d, %d, %d)\n", 32 | prop.maxThreadsDim[0], prop.maxThreadsDim[1], 33 | prop.maxThreadsDim[2] ); 34 | printf( "Max grid dimensions: (%d, %d, %d)\n", 35 | prop.maxGridSize[0], prop.maxGridSize[1], 36 | prop.maxGridSize[2] ); 37 | printf( "\n" ); 38 | } 39 | } 40 | -------------------------------------------------------------------------------- /chapter5/nestedHelloWorld.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | 5 | __global__ void nestedHelloWorld(int const iSize, int iDepth) { 6 | int tid = threadIdx.x; 7 | printf("Recursion=%d: Hello World from thread %d" 8 | " block %d\n", iDepth, tid, blockIdx.x); 9 | 10 | // condition to stop recursive execution 11 | if (iSize==1) return; 12 | 13 | //reduce block size to half 14 | int nThreads = iSize >> 1; 15 | 16 | //thread 0 lauches child grid recursively 17 | if (tid == 0 && nThreads >0) { 18 | nestedHelloWorld<<<1, nThreads>>>(nThreads, ++iDepth); 19 | printf("------> nested execution depth: %d\n", iDepth); 20 | } 21 | } 22 | 23 | int main(int argc, char *argv[]) 24 | { 25 | // launch nestedHelloWorld 26 | nestedHelloWorld<<<1,8>>>(8,0); 27 | mcDeviceSynchronize(); 28 | return 0; 29 | } 30 | -------------------------------------------------------------------------------- /chapter6/AplusB_with_managed.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | using namespace std; 6 | 7 | __device__ __managed__ int ret[1000]; 8 | __global__ void AplusB(int a, int b) { 9 | ret[threadIdx.x] = a + b + threadIdx.x; 10 | } 11 | int main() { 12 | AplusB<<< 1, 1000 >>>(10, 100); 13 | mcDeviceSynchronize(); 14 | for(int i = 0; i < 1000; i++) 15 | printf("%d: A+B = %d\n", i, ret[i]); 16 | return 0; 17 | } 18 | -------------------------------------------------------------------------------- /chapter6/AplusB_with_unified_addressing.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | using namespace std; 7 | __global__ void AplusB(int *ret, int a, int b) { 8 | ret[threadIdx.x] = a + b + threadIdx.x; 9 | } 10 | int main() { 11 | int *ret; 12 | mcMallocManaged(&ret, 1000 * sizeof(int)); 13 | AplusB<<< 1, 1000 >>>(ret, 10, 100); 14 | mcDeviceSynchronize(); 15 | for(int i = 0; i < 1000; i++) 16 | printf("%d: A+B = %d\n", i, ret[i]); 17 | mcFree(ret); 18 | return 0; 19 | } 20 | -------------------------------------------------------------------------------- /chapter6/AplusB_without_unified_addressing.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | 7 | __global__ void AplusB(int *ret, int a, int b) { 8 | ret[threadIdx.x] = a + b + threadIdx.x; 9 | } 10 | int main() { 11 | int *ret; 12 | mcMalloc(&ret, 1000 * sizeof(int)); 13 | AplusB<<< 1, 1000 >>>(ret, 10, 100); 14 | int *host_ret = (int *)malloc(1000 * sizeof(int)); 15 | mcMemcpy(host_ret, ret, 1000 * sizeof(int), mcMemcpyDefault); 16 | for(int i = 0; i < 1000; i++) 17 | printf("%d: A+B = %d\n", i, host_ret[i]); 18 | free(host_ret); 19 | mcFree(ret); 20 | return 0; 21 | } 22 | -------------------------------------------------------------------------------- /chapter6/BC_addKernel.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | #define ThreadsPerBlock 256 7 | #define maxGridSize 16 8 | __global__ void BC_addKernel(const int *a, int *r) 9 | { 10 | __shared__ int cache[ThreadsPerBlock]; 11 | int tid = blockIdx.x * blockDim.x + threadIdx.x; 12 | int cacheIndex = threadIdx.x; 13 | 14 | // copy data to shared memory from global memory 15 | cache[cacheIndex] = a[tid]; 16 | __syncthreads(); 17 | 18 | // add these data using reduce 19 | for (int i = 1; i < blockDim.x; i *= 2) 20 | { 21 | int index = 2 * i * cacheIndex; 22 | if (index < blockDim.x) 23 | { 24 | cache[index] += cache[index + i]; 25 | } 26 | __syncthreads(); 27 | } 28 | 29 | // copy the result of reduce to global memory 30 | if (cacheIndex == 0){ 31 | r[blockIdx.x] = cache[cacheIndex]; 32 | printf("blockIdx.x:%d r[blockIdx.x]:%d\n",blockIdx.x,r[blockIdx.x]); 33 | } 34 | 35 | } 36 | 37 | int test(int *h_a,int n){ 38 | int *a; 39 | mcMalloc(&a,n*sizeof(int)); 40 | mcMemcpy(a,h_a,n*sizeof(int),mcMemcpyHostToDevice); 41 | int *r; 42 | int h_r[maxGridSize]={0}; 43 | mcMalloc(&r,maxGridSize*sizeof(int)); 44 | mcMemcpy(r,h_r,maxGridSize*sizeof(int),mcMemcpyHostToDevice); 45 | BC_addKernel<<>>(a,r); 46 | mcMemcpy(h_a,a,n*sizeof(int),mcMemcpyDeviceToHost); 47 | mcMemcpy(h_r,r,maxGridSize*sizeof(int),mcMemcpyDeviceToHost); 48 | mcFree(r); 49 | mcFree(a); 50 | int sum=0; 51 | for(int i=0;i 2 | #include 3 | #include 4 | #include 5 | 6 | #define ThreadsPerBlock 256 7 | #define maxGridSize 16 8 | __global__ void NBC_addKernel2(const int *a, int *r) 9 | { 10 | __shared__ int cache[ThreadsPerBlock]; 11 | int tid = blockIdx.x * blockDim.x + threadIdx.x; 12 | int cacheIndex = threadIdx.x; 13 | 14 | // copy data to shared memory from global memory 15 | cache[cacheIndex] = a[tid]; 16 | __syncthreads(); 17 | 18 | // add these data using reduce 19 | for (int i = blockDim.x / 2; i > 0; i /= 2) 20 | { 21 | if (cacheIndex < i) 22 | { 23 | cache[cacheIndex] += cache[cacheIndex + i]; 24 | } 25 | __syncthreads(); 26 | } 27 | 28 | // copy the result of reduce to global memory 29 | if (cacheIndex == 0){ 30 | r[blockIdx.x] = cache[cacheIndex]; 31 | printf("blockIdx.x:%d r[blockIdx.x]:%d\n",blockIdx.x,r[blockIdx.x]); 32 | } 33 | } 34 | 35 | 36 | int test(int *h_a,int n){ 37 | int *a; 38 | mcMalloc(&a,n*sizeof(int)); 39 | mcMemcpy(a,h_a,n*sizeof(int),mcMemcpyHostToDevice); 40 | int *r; 41 | int h_r[maxGridSize]={0}; 42 | mcMalloc(&r,maxGridSize*sizeof(int)); 43 | mcMemcpy(r,h_r,maxGridSize*sizeof(int),mcMemcpyHostToDevice); 44 | NBC_addKernel2<<>>(a,r); 45 | mcMemcpy(h_a,a,n*sizeof(int),mcMemcpyDeviceToHost); 46 | mcMemcpy(h_r,r,maxGridSize*sizeof(int),mcMemcpyDeviceToHost); 47 | mcFree(r); 48 | mcFree(a); 49 | int sum=0; 50 | for(int i=0;i 2 | #include 3 | #include 4 | using namespace std; 5 | 6 | __global__ void test_shfl_down_sync(int A[], int B[]) 7 | { 8 | int tid = threadIdx.x; 9 | int value = B[tid]; 10 | 11 | value = __shfl_down_sync(0xffffffffffffffff, value, 2); 12 | A[tid] = value; 13 | 14 | } 15 | 16 | 17 | int main() 18 | { 19 | int *A,*Ad, *B, *Bd; 20 | int n = 64; 21 | int size = n * sizeof(int); 22 | 23 | // CPU端分配内存 24 | A = (int*)malloc(size); 25 | B = (int*)malloc(size); 26 | 27 | for (int i = 0; i < n; i++) 28 | { 29 | B[i] = rand()%101; 30 | std::cout << B[i] << std::endl; 31 | } 32 | 33 | std::cout <<"----------------------------" << std::endl; 34 | 35 | // GPU端分配内存 36 | mcMalloc((void**)&Ad, size); 37 | mcMalloc((void**)&Bd, size); 38 | mcMemcpy(Bd, B, size, mcMemcpyHostToDevice); 39 | 40 | // 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程 41 | dim3 dimBlock(128); 42 | dim3 dimGrid(1000); 43 | 44 | // 执行kernel 45 | test_shfl_down_sync <<<1, 64 >>> (Ad,Bd); 46 | 47 | mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost); 48 | 49 | // 校验误差 50 | float max_error = 0.0; 51 | for (int i = 0; i < 64; i++) 52 | { 53 | std::cout << A[i] << std::endl; 54 | } 55 | 56 | cout << "max error is " << max_error << endl; 57 | 58 | // 释放CPU端、GPU端的内存 59 | free(A); 60 | free(B); 61 | mcFree(Ad); 62 | mcFree(Bd); 63 | 64 | return 0; 65 | } 66 | -------------------------------------------------------------------------------- /chapter6/__shfl_syncExample.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | using namespace std; 5 | 6 | __global__ void test_shfl_sync(int A[], int B[]) 7 | { 8 | int tid = threadIdx.x; 9 | int value = B[tid]; 10 | 11 | value = __shfl_sync(0xffffffffffffffff, value, 2); 12 | A[tid] = value; 13 | } 14 | 15 | int main() 16 | { 17 | int *A,*Ad, *B, *Bd; 18 | int n = 64; 19 | int size = n * sizeof(int); 20 | 21 | // CPU端分配内存 22 | A = (int*)malloc(size); 23 | B = (int*)malloc(size); 24 | 25 | for (int i = 0; i < n; i++) 26 | { 27 | B[i] = rand()%101; 28 | std::cout << B[i] << std::endl; 29 | } 30 | 31 | std::cout <<"----------------------------" << std::endl; 32 | 33 | // GPU端分配内存 34 | mcMalloc((void**)&Ad, size); 35 | mcMalloc((void**)&Bd, size); 36 | mcMemcpy(Bd, B, size, mcMemcpyHostToDevice); 37 | 38 | // 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程 39 | dim3 dimBlock(128); 40 | dim3 dimGrid(1000); 41 | 42 | // 执行kernel 43 | test_shfl_sync <<<1, 64 >>> (Ad,Bd); 44 | 45 | mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost); 46 | 47 | // 校验误差 48 | float max_error = 0.0; 49 | for (int i = 0; i < 64; i++) 50 | { 51 | std::cout << A[i] << std::endl; 52 | } 53 | 54 | cout << "max error is " << max_error << endl; 55 | 56 | // 释放CPU端、GPU端的内存 57 | free(A); 58 | free(B); 59 | mcFree(Ad); 60 | mcFree(Bd); 61 | 62 | return 0; 63 | } 64 | -------------------------------------------------------------------------------- /chapter6/__shfl_up_syncExample.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | using namespace std; 5 | 6 | __global__ void test_shfl_up_sync(int A[], int B[]) 7 | { 8 | int tid = threadIdx.x; 9 | int value = B[tid]; 10 | 11 | value = __shfl_up_sync(0xffffffffffffffff, value, 2); 12 | A[tid] = value; 13 | 14 | } 15 | 16 | 17 | int main() 18 | { 19 | int *A,*Ad, *B, *Bd; 20 | int n = 64; 21 | int size = n * sizeof(int); 22 | 23 | // CPU端分配内存 24 | A = (int*)malloc(size); 25 | B = (int*)malloc(size); 26 | 27 | for (int i = 0; i < n; i++) 28 | { 29 | B[i] = rand()%101; 30 | std::cout << B[i] << std::endl; 31 | } 32 | 33 | std::cout <<"----------------------------" << std::endl; 34 | 35 | // GPU端分配内存 36 | mcMalloc((void**)&Ad, size); 37 | mcMalloc((void**)&Bd, size); 38 | mcMemcpy(Bd, B, size, mcMemcpyHostToDevice); 39 | 40 | // 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程 41 | dim3 dimBlock(128); 42 | dim3 dimGrid(1000); 43 | 44 | // 执行kernel 45 | test_shfl_up_sync <<<1, 64 >>> (Ad,Bd); 46 | 47 | mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost); 48 | 49 | // 校验误差 50 | float max_error = 0.0; 51 | for (int i = 0; i < 64; i++) 52 | { 53 | std::cout << A[i] << std::endl; 54 | } 55 | 56 | cout << "max error is " << max_error << endl; 57 | 58 | // 释放CPU端、GPU端的内存 59 | free(A); 60 | free(B); 61 | mcFree(Ad); 62 | mcFree(Bd); 63 | 64 | return 0; 65 | } 66 | -------------------------------------------------------------------------------- /chapter6/__shfl_xor_syncExample.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | __global__ void waveReduce() { 5 | int laneId = threadIdx.x & 0x3f; 6 | // Seed starting value as inverse lane ID 7 | int value = 63 - laneId; 8 | 9 | // Use XOR mode to perform butterfly reduction 10 | for (int i=1; i<64; i*=2) 11 | value += __shfl_xor_sync(0xffffffffffffffff, value, i, 64); 12 | 13 | // "value" now contains the sum across all threads 14 | printf("Thread %d final value = %d\n", threadIdx.x, value); 15 | } 16 | 17 | int main() { 18 | waveReduce<<< 1, 64 >>>(); 19 | mcDeviceSynchronize(); 20 | return 0; 21 | } 22 | -------------------------------------------------------------------------------- /chapter6/checkGlobalVariable.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | __device__ float devData; 5 | __global__ void checkGlobalVariable(){ 6 | printf("Device: the value of the global variable is %f\n", devData); 7 | devData += 2.0; 8 | } 9 | 10 | int main(){ 11 | float value = 3.14f; 12 | mcMemcpyToSymbol(devData, &value, sizeof(float)); 13 | printf("Host: copy %f to the global variable\n", value); 14 | checkGlobalVariable<<<1,1>>>(); 15 | mcMemcpyFromSymbol(&value, devData, sizeof(float)); 16 | printf("Host: the value changed by the kernel to %f\n", value); 17 | mcDeviceReset(); 18 | return EXIT_SUCCESS; 19 | } 20 | -------------------------------------------------------------------------------- /chapter6/information.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | int main( void ) { 4 | mcDeviceProp_t prop; 5 | 6 | int count; 7 | mcGetDeviceCount( &count ); 8 | for (int i=0; i< count; i++) { 9 | mcGetDeviceProperties( &prop, i ); 10 | printf( " --- Memory Information for device %d ---\n", i ); 11 | printf( "Total global mem: %ld[bytes]\n", prop.totalGlobalMem ); 12 | printf( "Total constant Mem: %ld[bytes]\n", prop.totalConstMem ); 13 | printf( "Max mem pitch: %ld[bytes]\n", prop.memPitch ); 14 | printf( "Texture alignment: %ld[bytes]\n", prop.textureAlignment ); 15 | printf( "Shared mem per AP: %ld[bytes]\n",prop.sharedMemPerBlock ); 16 | printf( "Registers per AP: %d[bytes]\n", prop.regsPerBlock ); 17 | printf( "\n" ); 18 | } 19 | } 20 | -------------------------------------------------------------------------------- /chapter6/vectorAddUnifiedVirtualAddressing.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | using namespace std; 7 | 8 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N) 9 | { 10 | int i = threadIdx.x + blockDim.x * blockIdx.x; 11 | if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f; 12 | } 13 | 14 | int main(int argc, char *argv[]) { 15 | 16 | int n = atoi(argv[1]); 17 | cout << n << endl; 18 | 19 | size_t size = n * sizeof(float); 20 | mcError_t err; 21 | 22 | // Allocate the host vectors of A&B&C 23 | unsigned int flag = mcMallocHostPortable; 24 | float *a = NULL; 25 | float *b = NULL; 26 | float *c = NULL; 27 | err = mcMallocHost((void**)&a, size, flag); 28 | err = mcMallocHost((void**)&b, size, flag); 29 | err = mcMallocHost((void**)&c, size, flag); 30 | 31 | // Initialize the host vectors of A&B 32 | for (int i = 0; i < n; i++) { 33 | float af = rand() / double(RAND_MAX); 34 | float bf = rand() / double(RAND_MAX); 35 | a[i] = af; 36 | b[i] = bf; 37 | } 38 | 39 | // Launch the vector add kernel 40 | struct timeval t1, t2; 41 | int threadPerBlock = 256; 42 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock; 43 | printf("threadPerBlock: %d \nblockPerGrid: %d \n",threadPerBlock,blockPerGrid); 44 | gettimeofday(&t1, NULL); 45 | vectorAdd<<< blockPerGrid, threadPerBlock >>> (a, b, c, n); 46 | gettimeofday(&t2, NULL); 47 | double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0; 48 | cout << timeuse << endl; 49 | 50 | // Free host memory 51 | err = mcFreeHost(a); 52 | err = mcFreeHost(b); 53 | err = mcFreeHost(c); 54 | 55 | return 0; 56 | } 57 | -------------------------------------------------------------------------------- /chapter6/vectorAddZerocopy.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | using namespace std; 7 | 8 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N) 9 | { 10 | int i = threadIdx.x + blockDim.x * blockIdx.x; 11 | if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f; 12 | } 13 | 14 | int main(int argc, char *argv[]) { 15 | 16 | int n = atoi(argv[1]); 17 | cout << n << endl; 18 | 19 | size_t size = n * sizeof(float); 20 | mcError_t err; 21 | 22 | // Allocate the host vectors of A&B&C 23 | unsigned int flag = mcMallocHostMapped; 24 | float *a = NULL; 25 | float *b = NULL; 26 | float *c = NULL; 27 | err = mcMallocHost((void**)&a, size, flag); 28 | err = mcMallocHost((void**)&b, size, flag); 29 | err = mcMallocHost((void**)&c, size, flag); 30 | 31 | // Initialize the host vectors of A&B 32 | for (int i = 0; i < n; i++) { 33 | float af = rand() / double(RAND_MAX); 34 | float bf = rand() / double(RAND_MAX); 35 | a[i] = af; 36 | b[i] = bf; 37 | } 38 | 39 | // Get the pointer in device on the vectors of A&B&C 40 | float *da = NULL; 41 | float *db = NULL; 42 | float *dc = NULL; 43 | err = mcHostGetDevicePointer((void**)&da, (void *)a, 0); 44 | err = mcHostGetDevicePointer((void**)&db, (void *)b, 0); 45 | err = mcHostGetDevicePointer((void**)&dc, (void *)c, 0); 46 | 47 | // Launch the vector add kernel 48 | struct timeval t1, t2; 49 | int threadPerBlock = 256; 50 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock; 51 | printf("threadPerBlock: %d \nblockPerGrid: %d \n",threadPerBlock,blockPerGrid); 52 | gettimeofday(&t1, NULL); 53 | vectorAdd<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n); 54 | gettimeofday(&t2, NULL); 55 | double timeuse = (t2.tv_sec - t1.tv_sec) 56 | + (double)(t2.tv_usec - t1.tv_usec)/1000000.0; 57 | cout << timeuse << endl; 58 | 59 | // Free host memory 60 | err = mcFreeHost(a); 61 | err = mcFreeHost(b); 62 | err = mcFreeHost(c); 63 | 64 | return 0; 65 | } 66 | -------------------------------------------------------------------------------- /chapter7/Makefile.txt: -------------------------------------------------------------------------------- 1 | # MXMACA Compiler 2 | MXCC = $(MACA_PATH)/mxgpu_llvm/bin/mxcc 3 | 4 | # Compiler flags 5 | MXCCFLAGS = -xmaca 6 | 7 | # Source files 8 | SRCS= main.cpp src/a.cpp src/b.cpp 9 | 10 | # Object files 11 | OBJS = $(SRCS:.cpp=.o) 12 | 13 | # Executable 14 | EXEC = my_program 15 | 16 | # Default target 17 | all: $(EXEC) 18 | 19 | # Link object files to create executable 20 | $(EXEC): $(OBJS) 21 | $(MXCC) $(OBJS) -o $(EXEC) 22 | 23 | %.o: %.cpp 24 | $(MXCC) $(MXCCFLAGS) -c $< -o $@ -I include 25 | 26 | # clean up object files and executable 27 | clean: 28 | rm -f $(OBJS) $(EXEC) 29 | -------------------------------------------------------------------------------- /chapter7/my_program/CMakeLists.txt: -------------------------------------------------------------------------------- 1 | # Specify the minimum CMake version required 2 | cmake_minimum_required(VERSION 3.0) 3 | 4 | # Set the project name 5 | project(my_program) 6 | 7 | # Set the path to the compiler 8 | set(MXCC_PATH $ENV{MACA_PATH}) 9 | set(CMAKE_CXX_COMPILER ${MXCC_PATH}/mxgpu_llvm/bin/mxcc) 10 | 11 | # Set the compiler flags 12 | set(MXCC_COMPILE_FLAGS -x maca) 13 | add_compile_options(${MXCC_COMPILE_FLAGS}) 14 | 15 | # Add source files 16 | File(GLOB SRCS src/*.cpp main.cpp) 17 | add_executable(my_program ${SRCS}) 18 | 19 | # Set the include paths 20 | target_include_directories(my_program PRIVATE include) 21 | -------------------------------------------------------------------------------- /chapter7/my_program/include/a.h: -------------------------------------------------------------------------------- 1 | extern void func_a(); -------------------------------------------------------------------------------- /chapter7/my_program/include/b.h: -------------------------------------------------------------------------------- 1 | extern void func_b(); -------------------------------------------------------------------------------- /chapter7/my_program/main.cpp: -------------------------------------------------------------------------------- 1 | //main.cpp: 2 | #include 3 | #include "a.h" 4 | #include "b.h" 5 | int main() 6 | { 7 | func_a(); 8 | func_b(); 9 | printf("my program!\n"); 10 | return 1; 11 | } 12 | -------------------------------------------------------------------------------- /chapter7/my_program/src/a.cpp: -------------------------------------------------------------------------------- 1 | //a.cpp: 2 | #include 3 | #include 4 | extern "C" __global__ void vector_add(int *A_d, size_t num) 5 | { 6 | size_t offset = (blockIdx.x * blockDim.x + threadIdx.x); 7 | size_t stride = blockDim.x * gridDim.x; 8 | for (size_t i = offset; i < num; i += stride) { 9 | A_d[i]++; 10 | } 11 | } 12 | void func_a() 13 | { 14 | size_t arrSize = 100; 15 | mcDeviceptr_t a_d; 16 | int *a_h = (int *)malloc(sizeof(int) * arrSize); 17 | memset(a_h, 0, sizeof(int) * arrSize); 18 | mcMalloc(&a_d, sizeof(int) * arrSize); 19 | mcMemcpyHtoD(a_d, a_h, sizeof(int) * arrSize); 20 | vector_add<<<1, arrSize>>>(reinterpret_cast(a_d), arrSize); 21 | mcMemcpyDtoH(a_h, a_d, sizeof(int) * arrSize); 22 | bool resCheck = true; 23 | for (int i; i < arrSize; i++) { 24 | if (a_h[i] != 1){ 25 | resCheck = false; 26 | } 27 | } 28 | printf("vector add result: %s\n", resCheck ? "success": "fail"); 29 | free(a_h); 30 | mcFree(a_d); 31 | } 32 | 33 | //a.h: 34 | extern void func_a(); 35 | -------------------------------------------------------------------------------- /chapter7/my_program/src/b.cpp: -------------------------------------------------------------------------------- 1 | //b.cpp: 2 | #include 3 | __global__ void kernel_b() 4 | { 5 | /* kernel code*/ 6 | } 7 | void func_b() 8 | { 9 | /* launch kernel */ 10 | kernel_b<<<1, 1>>>(); 11 | } 12 | 13 | //b.h: 14 | extern void func_b(); 15 | -------------------------------------------------------------------------------- /chapter7/trigger_memory_violation.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | typedef struct 4 | { 5 | alignas(4)float f; 6 | double d; 7 | }__attribute__((packed)) test_type_mem_violation; 8 | 9 | __global__ void trigger_memory_violation(test_type_mem_violation *dst) 10 | { 11 | atomicAdd(&dst->f,1.23); 12 | atomicAdd(&dst->d,20); 13 | dst->f=9.8765; 14 | } 15 | 16 | int main() 17 | { 18 | test_type_mem_violation hd={0}; 19 | test_type_mem_violation *ddd; 20 | mcMalloc((void**)&ddd,sizeof(test_type_mem_violation)); 21 | mcMemcpy(ddd,&hd,sizeof(test_type_mem_violation),mcMemcpyHostToDevice); 22 | trigger_memory_violation<<>>(ddd); 23 | mcMemcpy(&hd,ddd,sizeof(test_type_mem_violation),mcMemcpyDeviceToHost); 24 | mcFree(ddd); 25 | return 0; 26 | } 27 | -------------------------------------------------------------------------------- /chapter7/trigger_memory_violation_repaired.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | typedef struct 4 | { 5 | float f; 6 | double d; 7 | }test_type_mem_violation; 8 | 9 | __global__ void trigger_memory_violation(test_type_mem_violation *dst) 10 | { 11 | atomicAdd(&dst->f,1.23); 12 | atomicAdd(&dst->d,20); 13 | dst->f=9.8765; 14 | } 15 | 16 | int main() 17 | { 18 | test_type_mem_violation hd={0}; 19 | test_type_mem_violation *ddd; 20 | mcMalloc((void**)&ddd,sizeof(test_type_mem_violation)); 21 | mcMemcpy(ddd,&hd,sizeof(test_type_mem_violation),mcMemcpyHostToDevice); 22 | trigger_memory_violation<<>>(ddd); 23 | mcMemcpy(&hd,ddd,sizeof(test_type_mem_violation),mcMemcpyDeviceToHost); 24 | mcFree(ddd); 25 | return 0; 26 | } 27 | -------------------------------------------------------------------------------- /chapter7/vectorAdd.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | __global__ void vectorADD(const float* A_d, const float* B_d, float* C_d, size_t NELEM) { 4 | size_t offset = (blockIdx.x * blockDim.x + threadIdx.x); 5 | size_t stride = blockDim.x * gridDim.x; 6 | 7 | for (size_t i = offset; i < NELEM; i += stride) { 8 | C_d[i] = A_d[i] + B_d[i]; 9 | } 10 | } 11 | 12 | int main() 13 | { 14 | int blocks=20; 15 | int threadsPerBlock=1024; 16 | int numSize=1024*1024; 17 | 18 | float *A_d=nullptr; 19 | float *B_d=nullptr; 20 | float *C_d=nullptr; 21 | 22 | float *A_h=nullptr; 23 | float *B_h=nullptr; 24 | float *C_h=nullptr; 25 | 26 | mcMalloc((void**)&A_d,numSize*sizeof(float)); 27 | mcMalloc((void**)&B_d,numSize*sizeof(float)); 28 | mcMalloc((void**)&C_d,numSize*sizeof(float)); 29 | 30 | A_h=(float*)malloc(numSize*sizeof(float)); 31 | B_h=(float*)malloc(numSize*sizeof(float)); 32 | C_h=(float*)malloc(numSize*sizeof(float)); 33 | 34 | for(int i=0;i>>(A_d,B_d,C_d,numSize); 45 | 46 | mcMemcpy(C_h,C_d,numSize*sizeof(float),mcMemcpyDeviceToHost); 47 | 48 | mcFree(A_d); 49 | mcFree(B_d); 50 | mcFree(C_d); 51 | 52 | free(A_h); 53 | free(B_h); 54 | free(C_h); 55 | 56 | return 0; 57 | } 58 | -------------------------------------------------------------------------------- /chapter8/myKernel.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | // #include "device_launch_parameters.h" 5 | 6 | __global__ void myKernel(float* devPtr, int height, int width, int pitch) 7 | { 8 | int row, col; 9 | float *rowHead; 10 | 11 | for (row = 0; row < height; row++) 12 | { 13 | rowHead = (float*)((char*)devPtr + row * pitch); 14 | 15 | for (col = 0; col < width; col++) 16 | { 17 | printf("\t%f", rowHead[col]);// 逐个打印并自增 1 18 | rowHead[col]++; 19 | } 20 | printf("\n"); 21 | } 22 | } 23 | 24 | int main() 25 | { 26 | size_t width = 6; 27 | size_t height = 5; 28 | float *h_data, *d_data; 29 | size_t pitch; 30 | 31 | h_data = (float *)malloc(sizeof(float)*width*height); 32 | for (int i = 0; i < width*height; i++) 33 | h_data[i] = (float)i; 34 | 35 | printf("\n\tAlloc memory."); 36 | mcMallocPitch((void **)&d_data, &pitch, sizeof(float)*width, height); 37 | printf("\n\tPitch = %d B\n", pitch); 38 | 39 | printf("\n\tCopy to Device.\n"); 40 | mcMemcpy2D(d_data, pitch, h_data, sizeof(float)*width, sizeof(float)*width, height, mcMemcpyHostToDevice); 41 | 42 | myKernel <<<1, 1 >>> (d_data, height, width, pitch); 43 | mcDeviceSynchronize(); 44 | 45 | printf("\n\tCopy back to Host.\n"); 46 | mcMemcpy2D(h_data, sizeof(float)*width, d_data, pitch, sizeof(float)*width, height, mcMemcpyDeviceToHost); 47 | 48 | for (int i = 0; i < width*height; i++) 49 | { 50 | printf("\t%f", h_data[i]); 51 | if ((i + 1) % width == 0) 52 | printf("\n"); 53 | } 54 | 55 | free(h_data); 56 | mcFree(d_data); 57 | 58 | getchar(); 59 | return 0; 60 | } 61 | -------------------------------------------------------------------------------- /chapter8/stream_parallel_execution.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #define FULL_DATA_SIZE 10000 5 | #define N 1000 6 | #define BLOCKNUM 16 7 | #define THREADNUM 64 8 | 9 | __global__ void kernel(int *a,int *b,int *c){ 10 | int idx=threadIdx.x+blockIdx.x*blockDim.x; 11 | if (idx>>(dev0_a, dev0_b, dev0_c); 76 | 77 | kernel <<>>(dev1_a, dev1_b, dev1_c); 78 | 79 | mcStatus = mcMemcpyAsync(host_c + i, dev0_c, N * sizeof(int), 80 | mcMemcpyDeviceToHost, stream0); 81 | if (mcStatus != mcSuccess) 82 | { 83 | printf("mcMemcpyAsync0 c failed!\n"); 84 | } 85 | 86 | mcStatus = mcMemcpyAsync(host_c + N + i, dev1_c, N * sizeof(int), 87 | mcMemcpyDeviceToHost, stream1); 88 | if (mcStatus != mcSuccess) 89 | { 90 | printf("mcMemcpyAsync1 c failed!\n"); 91 | } 92 | } 93 | for(i=0;i<20;i++){ 94 | printf("%d ",host_a[i]); 95 | } 96 | printf("\n"); 97 | for(i=0;i<20;i++){ 98 | printf("%d ",host_b[i]); 99 | } 100 | printf("\n"); 101 | for(i=0;i<20;i++){ 102 | printf("%d ",host_c[i]); 103 | } 104 | printf("\n"); 105 | mcStreamSynchronize(stream1); 106 | mcStreamSynchronize(stream0); 107 | mcStreamDestroy(stream1); 108 | mcStreamDestroy(stream0); 109 | mcFree(dev0_a); 110 | mcFree(dev1_a); 111 | mcFree(dev0_b); 112 | mcFree(dev1_b); 113 | mcFree(dev0_c); 114 | mcFree(dev1_c); 115 | free(host_a); 116 | free(host_b); 117 | free(host_c); 118 | } 119 | -------------------------------------------------------------------------------- /chapter9/shortKernelsAsyncLaunch.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | * 9.4.1: 1) lots of short kernels launched asynchronously 3 | * 9.4.1 {Sample#2} lots of short kernels launched asynchronously 4 | * Usage: 5 | * 1) compiling: mxcc -x maca shortKernelsAsyncLaunch.cpp -o shortKernelsAsyncLaunch 6 | * 2) running:./shortKernelsAsyncLaunch 7 | */ 8 | #include 9 | #include 10 | #include "mc_runtime.h" 11 | 12 | #define macaCheckErrors(msg) \ 13 | do { \ 14 | mcError_t __err = mcGetLastError(); \ 15 | if (__err != mcSuccess) { \ 16 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 17 | msg, mcGetErrorString(__err), \ 18 | __FILE__, __LINE__); \ 19 | fprintf(stderr, "*** FAILED - ABORTING\n"); \ 20 | exit(1); \ 21 | } \ 22 | } while (0) 23 | 24 | 25 | #include 26 | #include 27 | #define USECPSEC 1000000ULL 28 | 29 | unsigned long long dtime_usec(unsigned long long start){ 30 | timeval tv; 31 | gettimeofday(&tv, 0); 32 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start; 33 | } 34 | 35 | #define N 400000 // tuned until kernel takes a few microseconds 36 | __global__ void shortKernel(float * out_d, float * in_d){ 37 | int idx=blockIdx.x*blockDim.x+threadIdx.x; 38 | if(idx>>(d_output, d_input); 58 | macaCheckErrors("kernel launch failure"); 59 | mcDeviceSynchronize(); 60 | macaCheckErrors("kernel execution failure"); 61 | // run on device and measure execution time 62 | unsigned long long dt = dtime_usec(0); 63 | dt = dtime_usec(0); 64 | for(int istep=0; istep>>(d_output, d_input); 67 | } 68 | } 69 | mcStreamSynchronize(stream); 70 | 71 | macaCheckErrors("kernel execution failure"); 72 | dt = dtime_usec(dt); 73 | std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl; 74 | return 0; 75 | } -------------------------------------------------------------------------------- /chapter9/shortKernelsGraphLaunch.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | * 9.4.1 {Sample#3} lots of short kernels launched by graph APIs 3 | * Usage: 4 | * 1) compiling: mxcc -x maca shortKernelsGraphLaunch.cpp -o shortKernelsGraphLaunch 5 | * 2) setting: export MACA_GRAPH_LAUNCH_MODE=1 6 | * 3) running:./shortKernelsGraphLaunch 7 | */ 8 | #include 9 | #include 10 | #include "mc_runtime.h" 11 | 12 | #define macaCheckErrors(msg) \ 13 | do { \ 14 | mcError_t __err = mcGetLastError(); \ 15 | if (__err != mcSuccess) { \ 16 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 17 | msg, mcGetErrorString(__err), \ 18 | __FILE__, __LINE__); \ 19 | fprintf(stderr, "*** FAILED - ABORTING\n"); \ 20 | exit(1); \ 21 | } \ 22 | } while (0) 23 | 24 | 25 | #include 26 | #include 27 | #define USECPSEC 1000000ULL 28 | 29 | unsigned long long dtime_usec(unsigned long long start){ 30 | timeval tv; 31 | gettimeofday(&tv, 0); 32 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start; 33 | } 34 | 35 | #define N 400000 // tuned until kernel takes a few microseconds 36 | __global__ void shortKernel(float * out_d, float * in_d){ 37 | int idx=blockIdx.x*blockDim.x+threadIdx.x; 38 | if(idx>>(d_output, d_input); 58 | macaCheckErrors("kernel launch failure"); 59 | mcDeviceSynchronize(); 60 | macaCheckErrors("kernel execution failure"); 61 | // run on device and measure execution time 62 | unsigned long long dt = dtime_usec(0); 63 | dt = dtime_usec(0); 64 | bool graphCreated=false; 65 | mcGraph_t graph; 66 | mcGraphExec_t instance; 67 | for(int istep=0; istep>>(d_output, d_input); 72 | } 73 | mcStreamEndCapture(stream, &graph); 74 | mcGraphInstantiate(&instance, graph, NULL, NULL, 0); 75 | graphCreated=true; 76 | } 77 | mcGraphLaunch(instance, stream); 78 | mcStreamSynchronize(stream); 79 | } 80 | macaCheckErrors("kernel execution failure"); 81 | dt = dtime_usec(dt); 82 | std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl; 83 | return 0; 84 | } -------------------------------------------------------------------------------- /chapter9/shortKernelsSyncLaunch.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | * 9.4.1 {Sample#1} lots of short kernels launched synchronously 3 | * Usage: 4 | * 1) compiling: mxcc -x maca shortKernelsSyncLaunch.cpp -o shortKernelsSyncLaunch 5 | * 2) running:./shortKernelsSyncLaunch 6 | */ 7 | #include 8 | #include 9 | #include "mc_runtime.h" 10 | 11 | #define macaCheckErrors(msg) \ 12 | do { \ 13 | mcError_t __err = mcGetLastError(); \ 14 | if (__err != mcSuccess) { \ 15 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 16 | msg, mcGetErrorString(__err), \ 17 | __FILE__, __LINE__); \ 18 | fprintf(stderr, "*** FAILED - ABORTING\n"); \ 19 | exit(1); \ 20 | } \ 21 | } while (0) 22 | 23 | 24 | #include 25 | #include 26 | #define USECPSEC 1000000ULL 27 | 28 | unsigned long long dtime_usec(unsigned long long start){ 29 | timeval tv; 30 | gettimeofday(&tv, 0); 31 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start; 32 | } 33 | 34 | #define N 400000 // tuned until kernel takes a few microseconds 35 | __global__ void shortKernel(float * out_d, float * in_d){ 36 | int idx=blockIdx.x*blockDim.x+threadIdx.x; 37 | if(idx>>(d_output, d_input); 57 | macaCheckErrors("kernel launch failure"); 58 | mcDeviceSynchronize(); 59 | macaCheckErrors("kernel execution failure"); 60 | // run on device and measure execution time 61 | unsigned long long dt = dtime_usec(0); 62 | dt = dtime_usec(0); 63 | for(int istep=0; istep>>(d_output, d_input); 66 | mcStreamSynchronize(stream); 67 | } 68 | } 69 | macaCheckErrors("kernel execution failure"); 70 | dt = dtime_usec(dt); 71 | std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl; 72 | return 0; 73 | } -------------------------------------------------------------------------------- /common/common.h: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #ifndef _COMMON_H 4 | #define _COMMON_H 5 | 6 | #define CHECK(call) \ 7 | { \ 8 | const mcError_t error = call; \ 9 | if (error != mcSuccess) \ 10 | { \ 11 | fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__); \ 12 | fprintf(stderr, "code: %d, reason: %s\n", error, \ 13 | mcGetErrorString(error)); \ 14 | } \ 15 | } 16 | 17 | inline double seconds() 18 | { 19 | struct timeval tp; 20 | struct timezone tzp; 21 | int i = gettimeofday(&tp, &tzp); 22 | return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6); 23 | } 24 | 25 | #endif // _COMMON_H 26 | -------------------------------------------------------------------------------- /习题运行结果/3.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/3.1.png -------------------------------------------------------------------------------- /习题运行结果/3.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/3.2.png -------------------------------------------------------------------------------- /习题运行结果/5.2.9.1运行结果/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/1.png -------------------------------------------------------------------------------- /习题运行结果/5.2.9.1运行结果/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/2.png -------------------------------------------------------------------------------- /习题运行结果/5.2.9.1运行结果/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/3.png -------------------------------------------------------------------------------- /习题运行结果/5.2.9.2运行结果/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/1.png -------------------------------------------------------------------------------- /习题运行结果/5.2.9.2运行结果/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/2.png -------------------------------------------------------------------------------- /习题运行结果/5.2.9.2运行结果/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/3.png -------------------------------------------------------------------------------- /习题运行结果/T4运行结果.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/T4运行结果.png -------------------------------------------------------------------------------- /习题运行结果/answer.md: -------------------------------------------------------------------------------- 1 | # new answer 2 | 3 | ## Chapter 2 4 | 5 | ### Exercise 1 6 | 7 | #### 参考代码 8 | 9 | ```c 10 | #include 11 | #include 12 | #include 13 | 14 | __global__ void helloFromGpu (void) 15 | 16 | { 17 | printf("Hello World from GPU!\n"); 18 | } 19 | 20 | int main(void) 21 | { 22 | printf("Hello World from CPU!\n"); 23 | helloFromGpu <<<1, 10>>>(); 24 | return 0; 25 | } 26 | ``` 27 | 28 | #### 编译结果 29 | 30 | 函数mcDeviceReset()用来显式销毁并清除与当前设备有关的所有资源。 31 | 32 | 当重置函数移除,编译运行则只输出 33 | 34 | ``` 35 | Hello World from CPU! 36 | ``` 37 | 38 | 当printf在gpu上被调用,mcDeviceReset()函数使这些来自gpu的输出发送到主机,然后在控制台输出。 39 | 40 | 没有调用cudaDeviceReset()函数就不能保证这些可以被显示。 41 | 42 | ### Exercise 2 43 | 44 | #### 参考代码 45 | 46 | ```c 47 | #include 48 | #include 49 | #include 50 | 51 | __global__ void helloFromGpu (void) 52 | { 53 | printf("Hello World from GPU!\n"); 54 | } 55 | 56 | int main(void) 57 | { 58 | printf("Hello World from CPU!\n"); 59 | 60 | helloFromGpu <<<1, 10>>>(); 61 | mcDeviceSynchronize(); 62 | return 0; 63 | } 64 | 65 | ``` 66 | 67 | #### 编译结果 68 | 69 | ``` 70 | Hello World from CPU! 71 | Hello World from GPU! 72 | Hello World from GPU! 73 | Hello World from GPU! 74 | Hello World from GPU! 75 | Hello World from GPU! 76 | Hello World from GPU! 77 | Hello World from GPU! 78 | Hello World from GPU! 79 | Hello World from GPU! 80 | Hello World from GPU! 81 | ``` 82 | 83 | 输出效果和helloFromGpu.c一样。 84 | 85 | mcDeviceSynchronize()也可以用来使gpu的输出打印在用户可见控制台。 86 | 87 | ### Exercise 3 88 | 89 | #### 参考代码 90 | 91 | ```c 92 | #include 93 | #include 94 | #include 95 | 96 | __global__ void helloFromGpu (void) 97 | { 98 | if (threadIdx.x==9) printf("Hello World from GPU Thread 9!\n"); 99 | } 100 | int main(void) 101 | { 102 | printf("Hello World from CPU!\n"); 103 | helloFromGpu <<<1, 10>>>(); 104 | mcDeviceReset(); 105 | return 0; 106 | } 107 | ``` 108 | 109 | ## Chapter 3 110 | 111 | ### Exercise 1 112 | 113 | #### 参考代码 114 | 115 | ```c++ 116 | #include 117 | #include 118 | #include 119 | #include 120 | 121 | using namespace std; 122 | 123 | // 要用 __global__ 来修饰。 124 | // 输入指向3段显存的指针名。 125 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N) 126 | { 127 | int i = threadIdx.x + blockDim.x * blockIdx.x; 128 | // printf("threadIdx.x:%d blockDim.x:%d blockIdx.x:%d\n",threadIdx.x,blockDim.x,blockIdx.x); 129 | if (i < N) C_d[i] = A_d[i] + B_d[i]; 130 | } 131 | 132 | int main(int argc, char *argv[]) { 133 | 134 | int n = 2048; 135 | cout << n << endl; 136 | 137 | size_t size = n * sizeof(float); 138 | 139 | // host memery 140 | float *a = (float *)malloc(size); 141 | float *b = (float *)malloc(size); 142 | float *c = (float *)malloc(size); 143 | 144 | for (int i = 0; i < n; i++) { 145 | float af = rand() / double(RAND_MAX); 146 | float bf = rand() / double(RAND_MAX); 147 | a[i] = af; 148 | b[i] = bf; 149 | } 150 | 151 | // 定义空指针。 152 | float *da = NULL; 153 | float *db = NULL; 154 | float *dc = NULL; 155 | 156 | // 申请显存,da 指向申请的显存,注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。 157 | mcMalloc((void **)&da, size); 158 | mcMalloc((void **)&db, size); 159 | mcMalloc((void **)&dc, size); 160 | 161 | // 把内存的东西拷贝到显存,也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。 162 | mcMemcpy(da,a,size,mcMemcpyHostToDevice); 163 | mcMemcpy(db,b,size,mcMemcpyHostToDevice); 164 | 165 | struct timeval t1, t2; 166 | 167 | // 计算线程块和网格的数量。 168 | int threadPerBlock_array[8]={1,16,32,64,128,256,512,1024}; 169 | for(int i=0;i<8;i++){ 170 | int threadPerBlock = threadPerBlock_array[i]; 171 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock; 172 | printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid); 173 | 174 | gettimeofday(&t1, NULL); 175 | 176 | // 调用核函数。 177 | gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n); 178 | 179 | gettimeofday(&t2, NULL); 180 | 181 | mcMemcpy(c,dc,size,mcMemcpyDeviceToHost); 182 | 183 | // for (int i = 0; i < 10; i++) 184 | // cout< 207 | 208 | ### Exercise 2 209 | 210 | #### 参考代码 211 | 212 | ```c++ 213 | #include 214 | #include 215 | #include 216 | #include 217 | 218 | using namespace std; 219 | 220 | // 要用 __global__ 来修饰。 221 | // 输入指向3段显存的指针名。 222 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N) 223 | { 224 | int i = threadIdx.x + blockDim.x * blockIdx.x; 225 | // printf("threadIdx.x:%d blockDim.x:%d blockIdx.x:%d\n",threadIdx.x,blockDim.x,blockIdx.x); 226 | if (i < N) C_d[i] = A_d[i] + B_d[i]; 227 | } 228 | 229 | int main(int argc, char *argv[]) { 230 | 231 | int n = 256; 232 | cout << n << endl; 233 | 234 | size_t size = n * sizeof(float); 235 | 236 | // host memory 237 | float *a = (float *)malloc(size); 238 | float *b = (float *)malloc(size); 239 | float *c = (float *)malloc(size); 240 | 241 | for (int i = 0; i < n; i++) { 242 | float af = rand() / double(RAND_MAX); 243 | float bf = rand() / double(RAND_MAX); 244 | a[i] = af; 245 | b[i] = bf; 246 | } 247 | 248 | // 定义空指针。 249 | float *da = NULL; 250 | float *db = NULL; 251 | float *dc = NULL; 252 | 253 | // 申请显存,da 指向申请的显存,注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。 254 | mcMalloc((void **)&da, size); 255 | mcMalloc((void **)&db, size); 256 | mcMalloc((void **)&dc, size); 257 | 258 | // 把内存的东西拷贝到显存,也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。 259 | mcMemcpy(da,a,size,mcMemcpyHostToDevice); 260 | mcMemcpy(db,b,size,mcMemcpyHostToDevice); 261 | 262 | struct timeval t1, t2; 263 | 264 | // 计算线程块和网格的数量。 265 | int threadPerBlock_array[2]={1,256}; 266 | for(int i=0;i<2;i++){ 267 | int threadPerBlock = threadPerBlock_array[i]; 268 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock; 269 | printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid); 270 | 271 | gettimeofday(&t1, NULL); 272 | 273 | // 调用核函数。 274 | gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n); 275 | 276 | gettimeofday(&t2, NULL); 277 | 278 | mcMemcpy(c,dc,size,mcMemcpyDeviceToHost); 279 | 280 | // for (int i = 0; i < 10; i++) 281 | // cout< 303 | 304 | ### Exercise 3 305 | 306 | 执行每个数值计算的速度并没有CPU快,CPU更适合处理逻辑控制密集的计算任务,GPU更适合处理数据密集的计算任务 307 | 308 | ### Exercise 4 309 | 310 | #### 参考代码 311 | 312 | ```c 313 | #include 314 | #include 315 | #include 316 | #include 317 | 318 | using namespace std; 319 | 320 | 321 | __global__ void matrixMultiplication(int *A_d,int *B_d,int *Result_d,int width){ 322 | int i=threadIdx.x+blockDim.x*blockIdx.x; 323 | int j=threadIdx.y+blockDim.y*blockIdx.y; 324 | int sum=0; 325 | int count; 326 | for(count=0;count>>(da,db,d_result,col); 357 | // 把显存的东西拷贝回内存 358 | mcMemcpy(result,d_result,sizeof(int)*row*col,mcMemcpyDeviceToHost); 359 | // print矩阵,这里row和col相等,所以统一用col表示 360 | int j; 361 | printf("a:\n"); 362 | for(i=0;i 398 | 399 | ## Chapter 5 400 | 401 | ### 5.2.9 402 | 403 | #### Exercise 1 404 | 405 | ##### 参考代码 406 | 407 | ```c 408 | #include 409 | #include 410 | #include 411 | using namespace std; 412 | 413 | 414 | __global__ void print() 415 | { 416 | printf("blockIdx.x:%d threadIdx.x:%d\n",blockIdx.x, threadIdx.x); 417 | } 418 | 419 | int main(void) 420 | { 421 | const dim3 block_size(16); 422 | print<<<10, block_size>>>(); 423 | mcDeviceSynchronize(); 424 | return 0; 425 | } 426 | 427 | 428 | ``` 429 | 430 | ##### 运行结果(一部分) 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 同一个wave内部thread的执行是顺序的。block的执行不是顺序的。 439 | 440 | 在MXMACA中,wave对程序员来说是透明的,它的大小可能会随着硬件的发展发生变化,在当前版本的MXMACA中,每个wave是由64个thread组成的。由64个thread组成的wave是MACA程序执行的最小单位,并且同一个wave是串行的。在一个SM中可能同时有来自不同block的wave。当一个block中的wave在进行访存或者同步等高延迟操作时,另一个block可以占用SM中的计算资源。这样,在SM内就实现了简单的乱序执行。不同block之间的执行没有顺序,完全并行。并且,一个sm只会执行一个block里的wave,当该block里的wave执行完才会执行其他block里的wave。 441 | 442 | #### Exercise 2 443 | 444 | ##### 参考代码 445 | 446 | ```c 447 | #include 448 | #include 449 | #include 450 | using namespace std; 451 | 452 | 453 | __global__ void print() 454 | { 455 | printf("blockIdx.x:%d threadIdx.x:%d threadIdx.y:%d threadIdx.z:%d\n",blockIdx.x, threadIdx.x, threadIdx.y, threadIdx.z); 456 | } 457 | 458 | int main(void) 459 | { 460 | const dim3 block_size(16); 461 | print<<<10, block_size>>>(); 462 | mcDeviceSynchronize(); 463 | return 0; 464 | } 465 | 466 | 467 | ``` 468 | 469 | 470 | 471 | ##### 运行结果 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 没有定义,默认为0. 480 | 481 | 可以在定义block_size时对三个维度的size都进行设置(注意三者的乘积不可以超过maxThreadsPerBlock)。 482 | 483 | ### 5.4.4(待更正) 484 | 485 | #### Exercise 1 486 | 487 | ##### 参考代码 488 | 489 | ```c 490 | // #include 491 | #include 492 | #include 493 | #include 494 | #include 495 | #include 496 | // #include 497 | // #include "dynamicParallelism.h" 498 | #include 499 | /** block size along */ 500 | #define BSX 64 501 | #define BSY 4 502 | /** maximum recursion depth */ 503 | #define MAX_DEPTH 4 504 | /** region below which do per-pixel */ 505 | #define MIN_SIZE 32 506 | /** subdivision factor along each axis */ 507 | #define SUBDIV 4 508 | /** subdivision when launched from host */ 509 | #define INIT_SUBDIV 32 510 | #define H (16 * 1024) 511 | #define W (16 * 1024) 512 | #define MAX_DWELL 512 513 | using namespace std; 514 | 515 | 516 | 517 | /** a useful function to compute the number of threads */ 518 | int __host__ __device__ divup(int x, int y) { return x / y + (x % y ? 1 : 0); } 519 | 520 | /** a simple complex type */ 521 | struct complex { 522 | __host__ __device__ complex(float re, float im = 0) 523 | { 524 | this->re = re; 525 | this->im = im; 526 | } 527 | /** real and imaginary part */ 528 | float re, im; 529 | }; // struct complex 530 | 531 | // operator overloads for complex numbers 532 | inline __host__ __device__ complex operator+(const complex &a, const complex &b) 533 | { 534 | return complex(a.re + b.re, a.im + b.im); 535 | } 536 | inline __host__ __device__ complex operator-(const complex &a) { return complex(-a.re, -a.im); } 537 | inline __host__ __device__ complex operator-(const complex &a, const complex &b) 538 | { 539 | return complex(a.re - b.re, a.im - b.im); 540 | } 541 | inline __host__ __device__ complex operator*(const complex &a, const complex &b) 542 | { 543 | return complex(a.re * b.re - a.im * b.im, a.im * b.re + a.re * b.im); 544 | } 545 | inline __host__ __device__ float abs2(const complex &a) { return a.re * a.re + a.im * a.im; } 546 | inline __host__ __device__ complex operator/(const complex &a, const complex &b) 547 | { 548 | float invabs2 = 1 / abs2(b); 549 | return complex((a.re * b.re + a.im * b.im) * invabs2, (a.im * b.re - b.im * a.re) * invabs2); 550 | } // operator/ 551 | /** find the dwell for the pixel */ 552 | __device__ int pixel_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x, int y) 553 | { 554 | complex dc = cmax - cmin; 555 | float fx = (float)x / w, fy = (float)y / h; 556 | complex c = cmin + complex(fx * dc.re, fy * dc.im); 557 | int dwell = 0; 558 | complex z = c; 559 | while (dwell < max_dwell && abs2(z) < 2 * 2) { 560 | z = z * z + c; 561 | dwell++; 562 | } 563 | return dwell; 564 | } // pixel_dwell 565 | 566 | /** binary operation for common dwell "reduction": MAX_DWELL + 1 = neutral 567 | element, -1 = dwells are different */ 568 | // #define NEUT_DWELL (MAX_DWELL + 1) 569 | #define DIFF_DWELL (-1) 570 | __device__ int same_dwell(int d1, int d2, int max_dwell) 571 | { 572 | if (d1 == d2) 573 | return d1; 574 | else if (d1 == (max_dwell + 1) || d2 == (max_dwell + 1)) 575 | return min(d1, d2); 576 | else 577 | return DIFF_DWELL; 578 | } // same_dwell 579 | 580 | /** evaluates the common border dwell, if it exists */ 581 | __device__ int border_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x0, int y0, 582 | int d) 583 | { 584 | // check whether all boundary pixels have the same dwell 585 | int tid = threadIdx.y * blockDim.x + threadIdx.x; 586 | int bs = blockDim.x * blockDim.y; 587 | int comm_dwell = (max_dwell + 1); 588 | // for all boundary pixels, distributed across threads 589 | for (int r = tid; r < d; r += bs) { 590 | // for each boundary: b = 0 is east, then counter-clockwise 591 | for (int b = 0; b < 4; b++) { 592 | int x = b % 2 != 0 ? x0 + r : (b == 0 ? x0 + d - 1 : x0); 593 | int y = b % 2 == 0 ? y0 + r : (b == 1 ? y0 + d - 1 : y0); 594 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y); 595 | comm_dwell = same_dwell(comm_dwell, dwell, max_dwell); 596 | } 597 | } // for all boundary pixels 598 | // reduce across threads in the block 599 | __shared__ int ldwells[BSX * BSY]; 600 | int nt = min(d, BSX * BSY); 601 | if (tid < nt) 602 | ldwells[tid] = comm_dwell; 603 | __syncthreads(); 604 | for (; nt > 1; nt /= 2) { 605 | if (tid < nt / 2) 606 | ldwells[tid] = same_dwell(ldwells[tid], ldwells[tid + nt / 2], max_dwell); 607 | __syncthreads(); 608 | } 609 | return ldwells[0]; 610 | } // border_dwell 611 | 612 | /** the kernel to fill the image region with a specific dwell value */ 613 | __global__ void dwell_fill_k(int *dwells, int w, int x0, int y0, int d, int dwell) 614 | { 615 | int x = threadIdx.x + blockIdx.x * blockDim.x; 616 | int y = threadIdx.y + blockIdx.y * blockDim.y; 617 | if (x < d && y < d) { 618 | x += x0, y += y0; 619 | dwells[y * w + x] = dwell; 620 | } 621 | } // dwell_fill_k 622 | 623 | /** 624 | * the kernel to fill in per-pixel values of the portion of the Mandelbrot set 625 | */ 626 | __global__ void mandelbrot_pixel_k(int *dwells, int w, int h, int max_dwell, complex cmin, 627 | complex cmax, int x0, int y0, int d) 628 | { 629 | int x = threadIdx.x + blockDim.x * blockIdx.x; 630 | int y = threadIdx.y + blockDim.y * blockIdx.y; 631 | if (x < d && y < d) { 632 | x += x0, y += y0; 633 | dwells[y * w + x] = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y); 634 | } 635 | } // mandelbrot_pixel_k 636 | 637 | /** computes the dwells for Mandelbrot image using dynamic parallelism; one block is launched per 638 | pixel 639 | @param dwells the output array 640 | @param w the width of the output image 641 | @param h the height of the output image 642 | @param cmin the complex value associated with the left-bottom corner of the image 643 | @param cmax the complex value associated with the right-top corner of the image 644 | @param x0 the starting x coordinate of the portion to compute 645 | @param y0 the starting y coordinate of the portion to compute 646 | @param d the size of the portion to compute (the portion is always a square) 647 | @param depth kernel invocation depth 648 | @remarks the algorithm reverts to per-pixel Mandelbrot evaluation once either maximum depth or 649 | minimum size is reached 650 | */ 651 | __global__ void mandelbrot_with_dp(int *dwells, int w, int h, int max_dwell, complex cmin, 652 | complex cmax, int x0, int y0, int d, int depth) 653 | { 654 | x0 += d * blockIdx.x, y0 += d * blockIdx.y; 655 | int comm_dwell = border_dwell(w, h, max_dwell, cmin, cmax, x0, y0, d); 656 | if (threadIdx.x == 0 && threadIdx.y == 0) { 657 | if (comm_dwell != DIFF_DWELL) { 658 | // uniform dwell, just fill 659 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY)); 660 | dwell_fill_k<<>>(dwells, w, x0, y0, d, comm_dwell); 661 | } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) { 662 | // subdivide recursively 663 | dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV, SUBDIV); 664 | mandelbrot_with_dp<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, 665 | d / SUBDIV, depth + 1); 666 | } else { 667 | // leaf, per-pixel kernel 668 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY)); 669 | mandelbrot_pixel_k<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, d); 670 | } 671 | // check_error(x0, y0, d); 672 | } 673 | } // mandelbrot_with_dp 674 | 675 | /** computes the dwells for Mandelbrot image 676 | @param dwells the output array 677 | @param w the width of the output image 678 | @param h the height of the output image 679 | @param cmin the complex value associated with the left-bottom corner of the image 680 | @param cmax the complex value associated with the right-top corner of the image 681 | */ 682 | __global__ void mandelbrot_without_dp(int *dwells, int w, int h, int max_dwell, complex cmin, 683 | complex cmax) 684 | { 685 | // complex value to start iteration (c) 686 | int x = threadIdx.x + blockIdx.x * blockDim.x; 687 | int y = threadIdx.y + blockIdx.y * blockDim.y; 688 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y); 689 | dwells[y * w + x] = dwell; 690 | } 691 | 692 | __global__ void dwell_fill_k_null() { printf("111 \n"); } // dwell_fill_k 693 | 694 | __global__ void mandelbrot_with_dp_cpu_perf() { dwell_fill_k_null<<<1, 1>>>(); } 695 | 696 | __global__ void mandelbrot_without_dp_cpu_perf() { printf("222 \n"); } 697 | 698 | struct timeval t1, t2; 699 | 700 | static void BM_DynamicParallelism_WithDP() 701 | { 702 | static char env_str[] = "DOORBELL_LISTEN=ON"; 703 | putenv(env_str); 704 | 705 | // allocate memory 706 | int w = W; 707 | int h = H; 708 | int max_dwell = MAX_DWELL; 709 | 710 | size_t dwell_sz = w * h * sizeof(int); 711 | int *h_dwells, *d_dwells; 712 | mcMalloc((void **)&d_dwells, dwell_sz); 713 | h_dwells = (int *)malloc(dwell_sz); 714 | 715 | dim3 bs(BSX, BSY), grid(INIT_SUBDIV, INIT_SUBDIV); 716 | gettimeofday(&t1, NULL); 717 | mandelbrot_with_dp<<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1), 718 | complex(0.5, 1), 0, 0, w / INIT_SUBDIV, 1); 719 | gettimeofday(&t2, NULL); 720 | mcDeviceSynchronize(); 721 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost); 722 | 723 | // free data 724 | mcFree(d_dwells); 725 | free(h_dwells); 726 | cout<<"BM_DynamicParallelism_WithDP over "<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1), 747 | complex(0.5, 1)); 748 | gettimeofday(&t2, NULL); 749 | mcDeviceSynchronize(); 750 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost); 751 | 752 | // free data 753 | mcFree(d_dwells); 754 | free(h_dwells); 755 | cout<<"BM_DynamicParallelism_WithoutDP over"<>>(); 770 | 771 | mcDeviceSynchronize(); 772 | cout<<"BM_DynamicParallelism_WithDP_CPU_Perf over"<>>(); 782 | 783 | mcDeviceSynchronize(); 784 | cout<<"BM_DynamicParallelism_WithoutDP_CPU_Perf over"< 812 | #include 813 | #include 814 | #include 815 | 816 | using namespace std; 817 | 818 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N){ 819 | int i = threadIdx.x + blockDim.x * blockIdx.x; 820 | if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f; 821 | } 822 | 823 | int main(int argc,char *argv[]){ 824 | int n = atoi(argv[1]); 825 | cout << n << endl; 826 | 827 | float *A,*B,*C; 828 | mcMallocManaged(&A,n*sizeof(float)); 829 | mcMallocManaged(&B,n*sizeof(float)); 830 | mcMallocManaged(&C,n*sizeof(float)); 831 | 832 | for(int i=0;i>>(A,B,C,n); 840 | mcDeviceSynchronize(); 841 | for(int i=0;i 864 | 865 | ### Exercise 2 866 | 867 | ```c++ 868 | #include 869 | #include 870 | #include 871 | #include 872 | #include 873 | #include 874 | using namespace std; 875 | 876 | #define M 512 877 | #define K 512 878 | #define N 512 879 | 880 | void initial(float *array, int size) 881 | { 882 | for (int i = 0; i < size; i++) 883 | { 884 | array[i] = (float)(rand() % 10 + 1); 885 | } 886 | } 887 | 888 | //核函数(静态共享内存版) 889 | __global__ void matrixMultiplyShared(float *A, float *B, float *C, 890 | int numARows, int numAColumns, int numBRows, int numBColumns, int numCRows, int numCColumns) 891 | { 892 | //分配共享内存 893 | // __shared__ float sharedM[blockDim.y][blockDim.x]; 894 | // __shared__ float sharedN[blockDim.x][blockDim.y]; 895 | __shared__ float sharedM[16][32]; 896 | __shared__ float sharedN[16][32]; 897 | 898 | int bx = blockIdx.x; 899 | int by = blockIdx.y; 900 | int tx = threadIdx.x; 901 | int ty = threadIdx.y; 902 | 903 | int row = by * blockDim.y + ty; 904 | int col = bx * blockDim.x + tx; 905 | 906 | float Csub = 0.0; 907 | 908 | //将保存在全局内存中的矩阵M&N分块存放到共享内存中 909 | for (int i = 0; i < (int)(ceil((float)numAColumns / blockDim.x)); i++) 910 | { 911 | if (i*blockDim.x + tx < numAColumns && row < numARows) 912 | sharedM[ty][tx] = A[row*numAColumns + i * blockDim.x + tx]; 913 | else 914 | sharedM[ty][tx] = 0.0; 915 | 916 | if (i*blockDim.y + ty < numBRows && col < numBColumns)//分割N矩阵 917 | sharedN[ty][tx] = B[(i*blockDim.y + ty)*numBColumns + col]; 918 | else 919 | sharedN[ty][tx] = 0.0; 920 | __syncthreads(); 921 | 922 | for (int j = 0; j < blockDim.x; j++)//分块后的矩阵相乘 923 | Csub += sharedM[ty][j] * sharedN[j][tx]; 924 | __syncthreads(); 925 | } 926 | 927 | if (row < numCRows && col < numCColumns)//将计算后的矩阵块放到结果矩阵C中 928 | C[row*numCColumns + col] = Csub; 929 | } 930 | 931 | 932 | int main(int argc, char **argv) 933 | { 934 | int Axy = M * K; 935 | int Bxy = K * N; 936 | int Cxy = M * N; 937 | 938 | float *h_A, *h_B, *h_C; 939 | h_A = (float*)malloc(Axy * sizeof(float)); 940 | h_B = (float*)malloc(Bxy * sizeof(float)); 941 | 942 | h_C = (float*)malloc(Cxy * sizeof(float)); 943 | 944 | initial(h_A, Axy); 945 | initial(h_B, Bxy); 946 | 947 | float *d_A, *d_B, *d_C; 948 | mcMalloc((void**)&d_A, Axy * sizeof(float)); 949 | mcMalloc((void**)&d_B, Bxy * sizeof(float)); 950 | mcMalloc((void**)&d_C, Cxy * sizeof(float)); 951 | 952 | mcMemcpy(d_A, h_A, Axy * sizeof(float), mcMemcpyHostToDevice); 953 | mcMemcpy(d_B, h_B, Bxy * sizeof(float), mcMemcpyHostToDevice); 954 | 955 | int dimx = 32; 956 | int dimy = 16; 957 | dim3 block(dimx, dimy); 958 | dim3 grid((M + block.x - 1) / block.x, (N + block.y - 1) / block.y); 959 | struct timeval t1, t2; 960 | gettimeofday(&t1, NULL); 961 | matrixMultiplyShared <<< grid, block >>> (d_A, d_B, d_C, M, K, K, N, M, N); 962 | mcMemcpy(h_C, d_C, Cxy * sizeof(float), mcMemcpyDeviceToHost); 963 | gettimeofday(&t2, NULL); 964 | double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0; 965 | cout << "timeuse: " << timeuse << endl; 966 | mcFree(d_A); 967 | mcFree(d_B); 968 | mcFree(d_C); 969 | 970 | free(h_A); 971 | free(h_B); 972 | free(h_C); 973 | } 974 | 975 | ``` 976 | 977 | 978 | 979 | -------------------------------------------------------------------------------- /习题运行结果/nestedMandelbrot.cpp: -------------------------------------------------------------------------------- 1 | // #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | // #include 8 | // #include "dynamicParallelism.h" 9 | #include 10 | /** block size along */ 11 | #define BSX 64 12 | #define BSY 4 13 | /** maximum recursion depth */ 14 | #define MAX_DEPTH 4 15 | /** region below which do per-pixel */ 16 | #define MIN_SIZE 32 17 | /** subdivision factor along each axis */ 18 | #define SUBDIV 4 19 | /** subdivision when launched from host */ 20 | #define INIT_SUBDIV 32 21 | #define H (16 * 1024) 22 | #define W (16 * 1024) 23 | #define MAX_DWELL 512 24 | using namespace std; 25 | 26 | 27 | 28 | /** a useful function to compute the number of threads */ 29 | int __host__ __device__ divup(int x, int y) { return x / y + (x % y ? 1 : 0); } 30 | 31 | /** a simple complex type */ 32 | struct complex { 33 | __host__ __device__ complex(float re, float im = 0) 34 | { 35 | this->re = re; 36 | this->im = im; 37 | } 38 | /** real and imaginary part */ 39 | float re, im; 40 | }; // struct complex 41 | 42 | // operator overloads for complex numbers 43 | inline __host__ __device__ complex operator+(const complex &a, const complex &b) 44 | { 45 | return complex(a.re + b.re, a.im + b.im); 46 | } 47 | inline __host__ __device__ complex operator-(const complex &a) { return complex(-a.re, -a.im); } 48 | inline __host__ __device__ complex operator-(const complex &a, const complex &b) 49 | { 50 | return complex(a.re - b.re, a.im - b.im); 51 | } 52 | inline __host__ __device__ complex operator*(const complex &a, const complex &b) 53 | { 54 | return complex(a.re * b.re - a.im * b.im, a.im * b.re + a.re * b.im); 55 | } 56 | inline __host__ __device__ float abs2(const complex &a) { return a.re * a.re + a.im * a.im; } 57 | inline __host__ __device__ complex operator/(const complex &a, const complex &b) 58 | { 59 | float invabs2 = 1 / abs2(b); 60 | return complex((a.re * b.re + a.im * b.im) * invabs2, (a.im * b.re - b.im * a.re) * invabs2); 61 | } // operator/ 62 | /** find the dwell for the pixel */ 63 | __device__ int pixel_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x, int y) 64 | { 65 | complex dc = cmax - cmin; 66 | float fx = (float)x / w, fy = (float)y / h; 67 | complex c = cmin + complex(fx * dc.re, fy * dc.im); 68 | int dwell = 0; 69 | complex z = c; 70 | while (dwell < max_dwell && abs2(z) < 2 * 2) { 71 | z = z * z + c; 72 | dwell++; 73 | } 74 | return dwell; 75 | } // pixel_dwell 76 | 77 | /** binary operation for common dwell "reduction": MAX_DWELL + 1 = neutral 78 | element, -1 = dwells are different */ 79 | // #define NEUT_DWELL (MAX_DWELL + 1) 80 | #define DIFF_DWELL (-1) 81 | __device__ int same_dwell(int d1, int d2, int max_dwell) 82 | { 83 | if (d1 == d2) 84 | return d1; 85 | else if (d1 == (max_dwell + 1) || d2 == (max_dwell + 1)) 86 | return min(d1, d2); 87 | else 88 | return DIFF_DWELL; 89 | } // same_dwell 90 | 91 | /** evaluates the common border dwell, if it exists */ 92 | __device__ int border_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x0, int y0, 93 | int d) 94 | { 95 | // check whether all boundary pixels have the same dwell 96 | int tid = threadIdx.y * blockDim.x + threadIdx.x; 97 | int bs = blockDim.x * blockDim.y; 98 | int comm_dwell = (max_dwell + 1); 99 | // for all boundary pixels, distributed across threads 100 | for (int r = tid; r < d; r += bs) { 101 | // for each boundary: b = 0 is east, then counter-clockwise 102 | for (int b = 0; b < 4; b++) { 103 | int x = b % 2 != 0 ? x0 + r : (b == 0 ? x0 + d - 1 : x0); 104 | int y = b % 2 == 0 ? y0 + r : (b == 1 ? y0 + d - 1 : y0); 105 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y); 106 | comm_dwell = same_dwell(comm_dwell, dwell, max_dwell); 107 | } 108 | } // for all boundary pixels 109 | // reduce across threads in the block 110 | __shared__ int ldwells[BSX * BSY]; 111 | int nt = min(d, BSX * BSY); 112 | if (tid < nt) 113 | ldwells[tid] = comm_dwell; 114 | __syncthreads(); 115 | for (; nt > 1; nt /= 2) { 116 | if (tid < nt / 2) 117 | ldwells[tid] = same_dwell(ldwells[tid], ldwells[tid + nt / 2], max_dwell); 118 | __syncthreads(); 119 | } 120 | return ldwells[0]; 121 | } // border_dwell 122 | 123 | /** the kernel to fill the image region with a specific dwell value */ 124 | __global__ void dwell_fill_k(int *dwells, int w, int x0, int y0, int d, int dwell) 125 | { 126 | int x = threadIdx.x + blockIdx.x * blockDim.x; 127 | int y = threadIdx.y + blockIdx.y * blockDim.y; 128 | if (x < d && y < d) { 129 | x += x0, y += y0; 130 | dwells[y * w + x] = dwell; 131 | } 132 | } // dwell_fill_k 133 | 134 | /** 135 | * the kernel to fill in per-pixel values of the portion of the Mandelbrot set 136 | */ 137 | __global__ void mandelbrot_pixel_k(int *dwells, int w, int h, int max_dwell, complex cmin, 138 | complex cmax, int x0, int y0, int d) 139 | { 140 | int x = threadIdx.x + blockDim.x * blockIdx.x; 141 | int y = threadIdx.y + blockDim.y * blockIdx.y; 142 | if (x < d && y < d) { 143 | x += x0, y += y0; 144 | dwells[y * w + x] = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y); 145 | } 146 | } // mandelbrot_pixel_k 147 | 148 | /** computes the dwells for Mandelbrot image using dynamic parallelism; one block is launched per 149 | pixel 150 | @param dwells the output array 151 | @param w the width of the output image 152 | @param h the height of the output image 153 | @param cmin the complex value associated with the left-bottom corner of the image 154 | @param cmax the complex value associated with the right-top corner of the image 155 | @param x0 the starting x coordinate of the portion to compute 156 | @param y0 the starting y coordinate of the portion to compute 157 | @param d the size of the portion to compute (the portion is always a square) 158 | @param depth kernel invocation depth 159 | @remarks the algorithm reverts to per-pixel Mandelbrot evaluation once either maximum depth or 160 | minimum size is reached 161 | */ 162 | __global__ void mandelbrot_with_dp(int *dwells, int w, int h, int max_dwell, complex cmin, 163 | complex cmax, int x0, int y0, int d, int depth) 164 | { 165 | x0 += d * blockIdx.x, y0 += d * blockIdx.y; 166 | int comm_dwell = border_dwell(w, h, max_dwell, cmin, cmax, x0, y0, d); 167 | if (threadIdx.x == 0 && threadIdx.y == 0) { 168 | if (comm_dwell != DIFF_DWELL) { 169 | // uniform dwell, just fill 170 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY)); 171 | dwell_fill_k<<>>(dwells, w, x0, y0, d, comm_dwell); 172 | } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) { 173 | // subdivide recursively 174 | dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV, SUBDIV); 175 | mandelbrot_with_dp<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, 176 | d / SUBDIV, depth + 1); 177 | } else { 178 | // leaf, per-pixel kernel 179 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY)); 180 | mandelbrot_pixel_k<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, d); 181 | } 182 | // check_error(x0, y0, d); 183 | } 184 | } // mandelbrot_with_dp 185 | 186 | /** computes the dwells for Mandelbrot image 187 | @param dwells the output array 188 | @param w the width of the output image 189 | @param h the height of the output image 190 | @param cmin the complex value associated with the left-bottom corner of the image 191 | @param cmax the complex value associated with the right-top corner of the image 192 | */ 193 | __global__ void mandelbrot_without_dp(int *dwells, int w, int h, int max_dwell, complex cmin, 194 | complex cmax) 195 | { 196 | // complex value to start iteration (c) 197 | int x = threadIdx.x + blockIdx.x * blockDim.x; 198 | int y = threadIdx.y + blockIdx.y * blockDim.y; 199 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y); 200 | dwells[y * w + x] = dwell; 201 | } 202 | 203 | __global__ void dwell_fill_k_null() { printf("111 \n"); } // dwell_fill_k 204 | 205 | __global__ void mandelbrot_with_dp_cpu_perf() { dwell_fill_k_null<<<1, 1>>>(); } 206 | 207 | __global__ void mandelbrot_without_dp_cpu_perf() { printf("222 \n"); } 208 | 209 | struct timeval t1, t2; 210 | 211 | static void BM_DynamicParallelism_WithDP() 212 | { 213 | static char env_str[] = "DOORBELL_LISTEN=ON"; 214 | putenv(env_str); 215 | 216 | // allocate memory 217 | int w = W; 218 | int h = H; 219 | int max_dwell = MAX_DWELL; 220 | 221 | size_t dwell_sz = w * h * sizeof(int); 222 | int *h_dwells, *d_dwells; 223 | mcMalloc((void **)&d_dwells, dwell_sz); 224 | h_dwells = (int *)malloc(dwell_sz); 225 | 226 | dim3 bs(BSX, BSY), grid(INIT_SUBDIV, INIT_SUBDIV); 227 | gettimeofday(&t1, NULL); 228 | mandelbrot_with_dp<<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1), 229 | complex(0.5, 1), 0, 0, w / INIT_SUBDIV, 1); 230 | gettimeofday(&t2, NULL); 231 | mcDeviceSynchronize(); 232 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost); 233 | 234 | // free data 235 | mcFree(d_dwells); 236 | free(h_dwells); 237 | cout<<"BM_DynamicParallelism_WithDP over "<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1), 258 | complex(0.5, 1)); 259 | gettimeofday(&t2, NULL); 260 | mcDeviceSynchronize(); 261 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost); 262 | 263 | // free data 264 | mcFree(d_dwells); 265 | free(h_dwells); 266 | cout<<"BM_DynamicParallelism_WithoutDP over"<>>(); 281 | 282 | mcDeviceSynchronize(); 283 | cout<<"BM_DynamicParallelism_WithDP_CPU_Perf over"<>>(); 293 | 294 | mcDeviceSynchronize(); 295 | cout<<"BM_DynamicParallelism_WithoutDP_CPU_Perf over"< 8 | 9 | ## chapter 3 10 | 11 | ### 3-2 12 | 13 | 14 | 15 | ## chapter 4 16 | 17 | ### 4-1 18 | 19 | 20 | 21 | ## chapter 5 22 | 23 | ### 5-1 24 | 25 | 26 | 27 | ### 5-3 28 | 29 | 30 | 31 | ### 5-5 32 | 33 | 34 | 35 | ## chapter 6 36 | 37 | ### 6-1 38 | 39 | 40 | 41 | 42 | 43 | ### 6-2 44 | 45 | 46 | 47 | 48 | 49 | ### 6-3 50 | 51 | 52 | 53 | 54 | 55 | ### 6-4 56 | 57 | 58 | 59 | ### 6-5 60 | 61 | 62 | 63 | ### 6-6 64 | 65 | 66 | 67 | ### 6-7 68 | 69 | 70 | 71 | ### 6-8 72 | 73 | 74 | 75 | ### 6-94 76 | 77 | 78 | 79 | ### 6-10 80 | 81 | 82 | 83 | 84 | 85 | ### 6-11 86 | 87 | 88 | 89 | 90 | 91 | ### 6-12 92 | 93 | 94 | 95 | 96 | 97 | ### 6-30 98 | 99 | 100 | 101 | ## chapter 7 102 | 103 | ### 7-4 104 | 105 | 106 | 107 | ### 7-5 108 | 109 | 110 | 111 | ## chapter 8 112 | 113 | ### 8-1 114 | 115 | 116 | 117 | ### 8-2 118 | 119 | --------------------------------------------------------------------------------