├── .gitattributes
├── MXMACA编程的内存层次模型.png
├── README.md
├── chapter10
    ├── mcBlas.c
    ├── mcDNN.cpp
    ├── mcblas命令.txt
    └── usingThrust.cpp
├── chapter11
    ├── Makefile
    ├── simple2DFD.cpp
    └── vectorAddMultiGpus.cpp
├── chapter2
    └── helloFromGpu.c
├── chapter3
    ├── cpuVectorAdd.cpp
    └── gpuVectorAdd.cpp
├── chapter4
    └── grammar.cpp
├── chapter5
    ├── Cooperative_Groups.cpp
    ├── assignKernel.cpp
    ├── information.cpp
    └── nestedHelloWorld.cpp
├── chapter6
    ├── AplusB_with_managed.cpp
    ├── AplusB_with_unified_addressing.cpp
    ├── AplusB_without_unified_addressing.cpp
    ├── BC_addKernel.cpp
    ├── NBC_addKernel2.cpp
    ├── __shfl_down_syncExample.cpp
    ├── __shfl_syncExample.cpp
    ├── __shfl_up_syncExample.cpp
    ├── __shfl_xor_syncExample.cpp
    ├── checkGlobalVariable.cpp
    ├── information.cpp
    ├── vectorAddUnifiedVirtualAddressing.cpp
    └── vectorAddZerocopy.cpp
├── chapter7
    ├── Makefile.txt
    ├── my_program
    │   ├── CMakeLists.txt
    │   ├── include
    │   │   ├── a.h
    │   │   └── b.h
    │   ├── main.cpp
    │   └── src
    │   │   ├── a.cpp
    │   │   └── b.cpp
    ├── trigger_memory_violation.cpp
    ├── trigger_memory_violation_repaired.cpp
    └── vectorAdd.cpp
├── chapter8
    ├── myKernel.cpp
    └── stream_parallel_execution.cpp
├── chapter9
    ├── shortKernelsAsyncLaunch.cpp
    ├── shortKernelsGraphLaunch.cpp
    └── shortKernelsSyncLaunch.cpp
├── common
    └── common.h
├── 习题运行结果
    ├── 3.1.png
    ├── 3.2.png
    ├── 5.2.9.1运行结果
    │   ├── 1.png
    │   ├── 2.png
    │   └── 3.png
    ├── 5.2.9.2运行结果
    │   ├── 1.png
    │   ├── 2.png
    │   └── 3.png
    ├── T4运行结果.png
    ├── answer.md
    ├── nestedMandelbrot.cpp
    └── 统一内存寻址运行结果.png
├── 开源的完整示例代码表.md
└── 示例代码运行截图
    ├── chapter2
        └── 2-1.png
    ├── chapter3
        └── 3-2.png
    ├── chapter4
        └── 4-1.png
    ├── chapter5
        ├── 5-1.png
        ├── 5-3.png
        └── 5-5.png
    ├── chapter6
        ├── 6-1-1.png
        ├── 6-1-2.png
        ├── 6-10-1.png
        ├── 6-10-2.png
        ├── 6-11-1.png
        ├── 6-11-2.png
        ├── 6-12-1.png
        ├── 6-12-2.png
        ├── 6-2-1.png
        ├── 6-2-2.png
        ├── 6-3-1.png
        ├── 6-3-2.png
        ├── 6-30.png
        ├── 6-4.png
        ├── 6-5.png
        ├── 6-6.png
        ├── 6-7.png
        ├── 6-8.png
        └── 6-9.png
    ├── chapter7
        ├── 7-4.png
        └── 7-5.png
    ├── chapter8
        ├── 8-1.png
        └── 8-2.png
    └── 示例代码运行截图.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/MXMACA编程的内存层次模型.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/MXMACA编程的内存层次模型.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # getting-started-guide-and-summary-of-MXMACA
  2 | 
  3 | ## CPU VS GPU
  4 | 
  5 | CPU，即中央处理器，由数百万个晶体管构成，可以具有多个处理核心，是计算机系统的运算和控制核心。CPU涉及到通用计算，适合少量的复杂计算。CPU虽然处理核心远没有GPU多，但是可以将核心集中在单个任务上并快速完成工作。
  6 | 
  7 | GPU，即图形处理器，由许多更小、更专业的核心组成的处理器。适合大量的简单运算。GPU最初是用来加速3D渲染任务，但是随着时间的推移，这些固定功能的引擎变得更加可编程、更加灵活。虽然图形和日益逼真的视觉效果仍然是GPU的主要功能，但GPU也已发展成为更通用的并行处理器，可以处理越来越多的应用程序。
  8 | 
  9 | | CPU                                | GPU                              |
 10 | | ---------------------------------- | -------------------------------- |
 11 | | 通用组件，负责计算机的主要处理功能 | 专用组件，主要负责图形和视频渲染 |
 12 | | 核心数：2-64                       | 核心数：数千                     |
 13 | | 串行运行进程                       | 并行运行进程                     |
 14 | | 更适合处理一项大任务               | 更适合处理多个较小的任务         |
 15 | 
 16 | 
 17 | 
 18 | ### 加速深度学习和人工智能
 19 | 
 20 | GPU或其他加速器非常适合用神经网络或大量特定数据（e.g. 2D图像）进行深度学习训练。
 21 | 
 22 | GPU加速方法已经适用于深度学习算法，可以显著提升算法性能。
 23 | 
 24 | 
 25 | 
 26 | ## 基本概念的解释
 27 | 
 28 | 内存部分的解释详见MXMACA内存模型和管理。
 29 | 
 30 | ### 主机端（host）
 31 | 
 32 | CPU所在的位置称为主机端。
 33 | 
 34 | 可以简单理解为CPU。
 35 | 
 36 | ### 设备端（device）
 37 | 
 38 | GPU所在的位置称为设备端。
 39 | 
 40 | 可以简单理解为GPU。
 41 | 
 42 | 主机和设备之间通过PCIe总线连接，用于传递指令和数据，让CPU和GPU一起来协同工作。
 43 | 
 44 | ### 加速处理器（Accelerated Processors，AP）
 45 | 
 46 | 每个AP都能支持数千个GPU线程并发执行。
 47 | 
 48 | 执行具体的指令和指令和任务。
 49 | 
 50 | ### 核函数（kernel）
 51 | 
 52 | 核函数在设备端执行，需要为一个线程规定所进行的计算和访问的数据。当核函数被调用时，许多不同的MXMACA线程并行执行同一计算任务。
 53 | 
 54 | 在设备侧（GPU）执行，可以在设备侧（GPU）和主机侧（CPU）被调用。
 55 | 
 56 | ### 线程（thread）
 57 | 
 58 | 一般通过GPU的一个核进行处理。
 59 | 
 60 | 每个线程是Kernel的单个执行实例。在一个block中的所有线程可以共享一些资源，并能够相互通信。
 61 | 
 62 | ### 线程束（wave）
 63 | 
 64 | GPU执行程序时的调度单位。
 65 | 
 66 | 64个线程组成一个线程束，线程束中每个线程在不同数据集上同时执行相同的指令。
 67 | 
 68 | ### 线程块（thread block）
 69 | 
 70 | 由多个线程组成。可以是一维、二维或三维的。
 71 | 
 72 | 各block是并行执行的。
 73 | 
 74 | 同一个线程块内的线程可以相互协作，不同线程块内的线程不能协作。
 75 | 
 76 | 当启动一个核函数网格时，它的GPU线程会被分配到可用的AP上执行。一旦线程块被调度到一个AP上，其中的线程将只在该指定的AP上并发执行。
 77 | 
 78 | 多个线程块根据AP资源的可用性进行调度，可能会被分配到同一个AP上或不同的AP上。
 79 | 
 80 | ### 线程网格（grid）
 81 | 
 82 | 多个线程块可以构成线程网格。
 83 | 
 84 | 和核函数（kernel）的关系：启动核函数（kernel）时，会定义一个线程网格（grid）。
 85 | 
 86 | 网格可以是一维的、二维的或三维的。
 87 | 
 88 | ### 流（stream）
 89 | 
 90 | 相当于是GPU上的任务队列。
 91 | 
 92 | 同一个stream的任务是严格保证顺序的，上一个命令执行完成才会执行下一个命令。
 93 | 
 94 | 不同stream的命令不保证任何执行顺序。部分优化技巧需要用到多个stream才能实现。如在执行kernel的同时进行数据拷贝，需要一个stream执行kernel，另一个stream进行数据拷贝。
 95 | 
 96 | 
 97 | 
 98 | ## 基本编程模型
 99 | 
100 | 1. 用户可以通过调用动态运行时库，申请、释放显存，并在内存和显存间进行数据拷贝。
101 | 
102 | 2. 典型的MXMACA程序实现流程遵循以下模式：
103 | 
104 |    1. 把数据从CPU内存拷贝到GPU内存；
105 |    2. 调用核函数对GPU内存的数据进行处理；
106 |    3. 将数据从GPU内存传送回CPU内存。
107 | 
108 | 3. 用户可以编写kernel函数，在主机侧调用kernel函数，调用将创建GPU线程。
109 | 
110 |    1. 用户可以在Kernel Launch时分别指定网格中的线程块数量、线程块中包含的线程数量。当用户指定的线程数量超过64，这些线程会被拆分成多个线程束，并在同一个AP上执行，这些线程束可能并发执行，也可能串行执行。
111 |    2. 每个GPU线程都会完整执行一次kernel函数，kernel函数可以对显存进行读、写等操作，也可以调用设备侧函数对显存进行读、写等操作。不同的GPU线程可以通过内置变量进行区分，只需要通过读取内置变量，分别找到线程块的位置、线程的位置，就可以给每一个线程唯一地标识ThreadIdx（可以参考后文，相关的几个内置变量）。
112 | 
113 | 4. 相关的几个内置变量
114 | 
115 |    1. `threadIdx`，获取线程`thread`的ID索引；如果线程是一维的那么就取`threadIdx.x`，二维的还可以多取到一个值`threadIdx.y`，以此类推到三维`threadIdx.z`。可以在一个线程块中唯一的标识线程。
116 |    2. `blockIdx`，线程块的ID索引；同样有`blockIdx.x`，`blockIdx.y`，`blockIdx.z`。可以在一个网格中唯一标识线程块。
117 |    3. `blockDim`，线程块的维度，同样有`blockDim.x`，`blockDim.y`，`blockDim.z`。可以代表每个维度下线程的最大数量。
118 |       1. 对于一维的`block`，线程的`threadID=threadIdx.x`。
119 |       2. 对于大小为`（blockDim.x, blockDim.y）`的 二维`block`，线程的`threadID=threadIdx.x+threadIdx.y*blockDim.x`。
120 |       3. 对于大小为`（blockDim.x, blockDim.y, blockDim.z）`的 三维 `block`，线程的`threadID=threadIdx.x+threadIdx.y*blockDim.x+threadIdx.z*blockDim.x*blockDim.y`。
121 |    4. `gridDim`，线程格的维度，同样有`gridDim.x`，`gridDim.y`，`gridDim.z`。可以代表每个唯独下线程块的最大数量。
122 | 
123 | 5. 常用的GPU函数
124 | 
125 |    1. `mcMalloc()`
126 | 
127 |       负责内存分配。类似与C语言中的`malloc`。不过mcMalloc是在GPU上分配内存，返回device指针。
128 | 
129 |    2. `mcMemcpy()`
130 | 
131 |       负责内存复制。
132 | 
133 |       可以把数据从host搬到device，再从device搬回host。
134 | 
135 |    3. `mcFree()`
136 | 
137 |       释放显存的指针。
138 | 
139 |       （可以参考示例代码）
140 | 
141 | ## 基本硬件架构及其在Kernel执行中的作用
142 | 
143 | ## MXMACA内存模型和管理
144 | 
145 | ### MXMACA内存模型
146 | 
147 | MXMACA的内存是分层次的，每个不同类型的内存空间有不同的作用域、生命周期和缓存行为。一个内核函数中，每个线程有自己的私有内存，每个线程块有自己工作组的共享内存并对块内的所有线程可见，一个线程网格中的所有线程都可以访问全局内存和常量。可以参考下图：
148 | 
149 | <img src=".\MXMACA编程的内存层次模型.png">
150 | 
151 | 书里提到了它们的初始化方式，这里主要介绍它们的用途、局限性。
152 | 
153 | #### 可编程存储器、不可编程存储器
154 | 
155 | 根据存储器能否被程序员控制，可分为：可编程存储器、不可编程存储器。
156 | 
157 | 可编程存储器：需要显示控制哪些数据放在可编程内存中。包括全局存储、常量存储、共享存储、本地存储和寄存器等。
158 | 
159 | 不可编程存储器：不能决定哪些数据放在这些存储器中，也不能决定数据在存储器中的位置。包括一级缓存、二级缓存等。
160 | 
161 | #### GPU寄存器
162 | 
163 | 寄存器延迟极低，对于每个线程是私有的，与核函数的生命周期相同。
164 | 
165 | 寄存器是稀有资源，使用过多的寄存器也会影响到性能，可以添加辅助信息控制限定寄存器数量。
166 | 
167 | 书中也提到了一些方式，可以让一个线程束内的两个线程相互访问对方的寄存器，而不需要访问全局内存或者共享内存，延迟很低且不消耗额外内存。
168 | 
169 | #### GPU私有内存
170 | 
171 | 私有内存是每个线程私有的。
172 | 
173 | 私有内存在物理上与全局内存在同一块储存区域，因此具有较高的延迟和低带宽。
174 | 
175 | #### GPU线程块共享内存
176 | 
177 | 共享内存的地址空间被线程块中所有的线程共享。它的内容和创建时所在的线程块具有相同生命周期。
178 | 
179 | 共享内存让同一个线程块中的线程能够相互协作，便于重用片上数据，可以降低核函数所需的全局内存带宽。
180 | 
181 | 相较于全局内存，共享内存延迟更低，带宽更高。
182 | 
183 | 适合在数据需要重复利用、全局内存合并或线程之间有共享数据时使用共享内存。
184 | 
185 | 不能过度使用，否则会限制活跃线程束的数量。
186 | 
187 | 书里也提到了共享内存的分配、共享内存的地址映射方式、bank冲突以及最小化bank冲突的方法。bank冲突时，多个访问操作会被序列化，降低内存带宽，就没有什么并行的意义了。
188 | 
189 | #### GPU常量内存
190 | 
191 | 常量内存在设备内存中，并在每个AP专用的常量缓存中缓存。
192 | 
193 | 如果线程束中所有线程都从相同内存读取数据，常量内存表现最好，因为每从一个常量内存中读取一次数据，都会广播给线程束里的所有线程。
194 | 
195 | #### GPU全局内存
196 | 
197 | GPU中内存最大、延迟最高、最常使用。
198 | 
199 | 可以在任何AP上被访问，并且贯穿应用程序的整个生命周期。
200 | 
201 | 优化时需要注意对齐内存访问与合并内存访问。
202 | 
203 | ## MXMACA程序优化
204 | 
205 | ### 性能优化的目标
206 | 
207 | 1. 提高程序执行效率，减少运行时间，提高程序的处理能力和吞吐量。
208 | 2. 优化资源利用率，避免资源的浪费和滥用。
209 | 3. 改善程序的响应时间。
210 | 
211 | ### 程序性能评估
212 | 
213 | #### 精度 
214 | 
215 | GPU 的单精度计算性能要远远超过双精度计算性能，需要在速度与精度之间选取合适的平衡。
216 | 
217 | #### 延迟 
218 | 
219 | #### 计算量
220 | 
221 | 如果计算量很小，或者串行部分占用时间较长，并行部分占用时间较短，都不适合用GPU进行并行计算。
222 | 
223 | ### 优化的主要策略
224 | 
225 | #### 硬件性能优化
226 | 
227 | #### 并行性优化
228 | 
229 | 可以通过设置线程块的大小、每个线程块的共享内存使用量、每个线程使用的寄存器数量，尽量提升occupancy。
230 | 
231 | #### 内存访问优化
232 | 
233 | ##### 提高`Global Memory`访存效率
234 | 
235 | 对齐内存访问：一个内存事务的首个访问地址尽量是缓存粒度（32或128字节）的偶数倍，减少带宽浪费。
236 | 
237 | 合并内存访问：尽量让一个线程束的线程访问的内存都在一个线程块。
238 | 
239 | ##### 提高`Shared Memory`访存效率
240 | 
241 | 若`wave`中不同的线程访问相同的`bank`，则会发生bank冲突(bank conflict)，bank冲突时，`wave`的一条访存指令会被拆分为n条不冲突的访存请求，降低`shared memory`的有效带宽。所以需要尽量避免bank冲突。
242 | 
243 | #### 算法优化
244 | 
245 | 1. 如何将问题分解成块、束、线程
246 | 2. 线程如何访问数据以及产生什么样的内存模式
247 | 3. 数据的重用性
248 | 4. 算法总共要执行多少工作，与串行化的方法之间的差异
249 | 
250 | #### 算数运算密度优化
251 | 
252 | 1. 超越函数操作：可以查阅平方根等超越函数和加速函数，以及设备接口函数
253 | 2. 近似：可以在速度和精度之间进行折衷
254 | 3. 查找表：用空间换时间。适合GPU高占用率的情况，也要考虑到计算的复杂度，计算复杂度低时，计算速度可能大大快于低GPU占用下的内存查找方式。
255 | 
256 | #### 编译器优化
257 | 
258 | 1. 展开循环
259 | 2. 常量折叠 e.g. 编译时直接计算常数，从而简化常数
260 | 3. 常量传播：将表达式中的变量替换为已知常数
261 | 4. 公共子表达式消除：将该类公共子表达式的值临时记录，并传播到子表达式使用的语句
262 | 5. 目标相关优化：用复杂指令取代简单通用的指令组合，使程序获得更高的性能
263 | 
264 | #### 其他
265 | 
266 | 1. 用结构体数组（结构体的成员是数组），而不是数组结构体（数组的每个元素都是结构体）。
267 | 2. 尽量少用条件分支。CPU具有分支预测的功能，GPU没有这一功能，GPU执行if，else语句的效率非常低。因此只能让束内每一线程在每个分支都经过一遍（但不一定执行），当然如果所有线程都不用执行，就可以忽略这一分支。只要有一个线程需要执行某一个分支，其他线程即使不需要执行，也要等着一个线程执行完才能开始自己的计算任务。而且不同的分支是串行执行的，因此要减少分支的数目。
268 |    1. 通过计算，去掉分支（可以参考书中8.3.4相关内容）。
269 |    2. 通过查找表去掉分支。
270 |    3. 尽量使`wave`块完美对齐，让一个`wave`里的所有线程都满足条件或者都不满足条件。
271 | 3. 引入一些指令级并行操作，尽可能终止最后的线程束以使整个线程块都闲置出来，并替换为另一个包含一组更活跃线程束的线程块。
272 | 
273 | ### 优化性能需要考虑的指标
274 | 
275 | 1. 最大化利用率
276 | 2. 最大化存储吞吐量
277 | 3. 最大化指令吞吐量
278 | 4. 最小化内存抖动
279 | 5. 时间消耗（整体运行所需时间、GPU和CPU之间的传输所需时间、核函数运行所需时间）
280 | 
281 | ## MXMACA生态的人工智能和计算加速库
282 | 
283 | ### mcBLAS
284 | 
285 | 主要用于多种形式的计算。
286 | 
287 | `Level-1 Functions`定义了向量与向量、向量与标量之间的运算，还为多种数据类型（单精度浮点实数、单精度浮点复数、双精度浮点实数、双精度浮点复数）定义了专用的接口。
288 | 
289 | `Level-2 Functions`定义了矩阵与向量之间的运算。
290 | 
291 | `Level-3 Functions`定义了矩阵与矩阵之间的运算。是求解器和深度神经网络库的底层实现基础。
292 | 
293 | ### mcDNN
294 | 
295 | 提供常用深度学习算子。
296 | 
297 | ### mcSPARSE
298 | 
299 | 稀疏矩阵线性代数库。稀疏矩阵是指零元素数目远多于非零元素数目的矩阵。
300 | 
301 | 可以用对应的接口完成稀疏矩阵线性代数运算。
302 | 
303 | ### mcSOLVER
304 | 
305 | 稠密矩阵线性方程组的求解函数库。
306 | 
307 | ### mcFFT
308 | 
309 | 快速傅里叶变换库。
310 | 
311 | 
312 | 
313 | 


--------------------------------------------------------------------------------
/chapter10/mcBlas.c:
--------------------------------------------------------------------------------
  1 | #include <stdio.h>  
  2 | #include <stdlib.h>  
  3 | #include <string.h>  
  4 | #include <mc_runtime_api.h>  
  5 | #include "mcblas.h"
  6 |   
  7 | /* cpu implementation of sgemm */  
  8 | static void cpu_sgemm(int m, int n, int k, float alpha, const float *A, const float *B, float beta, float *C_in,  
  9 |                       float *C_out) {  
 10 |   int i;  
 11 |   int j;  
 12 |   int kk;  
 13 |   
 14 |   for (i = 0; i < m; ++i) {  
 15 |     for (j = 0; j < n; ++j) {  
 16 |       float prod = 0;  
 17 |   
 18 |       for (kk = 0; kk < k; ++kk) {  
 19 |         prod += A[kk * m + i] * B[j * k + kk];  
 20 |       }  
 21 |   
 22 |       C_out[j * m + i] = alpha * prod + beta * C_in[j * m + i];  
 23 |     }  
 24 |   }  
 25 | }  
 26 |   
 27 | int main(int argc, char **argv) {  
 28 |   float *h_A;  
 29 |   float *h_B;  
 30 |   float *h_C;  
 31 |   float *h_C_ref;  
 32 |   float *d_A = 0;  
 33 |   float *d_B = 0;  
 34 |   float *d_C = 0;  
 35 |   float alpha = 1.0f;  
 36 |   float beta = 0.0f;  
 37 |   int m = 256;  
 38 |   int n = 128;  
 39 |   int k = 64;  
 40 |   int size_a = m * n;  // the element num of A matrix  
 41 |   int size_b = n * k;  // the element num of B matrix  
 42 |   int size_c = m * n;  // the element num of C matrix  
 43 |   float error_norm;  
 44 |   float ref_norm;  
 45 |   float diff;  
 46 |   mcblasHandle_t handle;  
 47 |   mcblasStatus_t status;  
 48 |   
 49 |   /* Initialize mcBLAS */  
 50 |   status = mcblasCreate(&handle);  
 51 |   if (status != MCBLAS_STATUS_SUCCESS) {  
 52 |     fprintf(stderr, "Init failed\n");  
 53 |     return EXIT_FAILURE;  
 54 |   }  
 55 |   
 56 |   /* Allocate host memory for A/B/C matrix*/  
 57 |   h_A = (float *)malloc(size_a * sizeof(float));  
 58 |   if (h_A == NULL) {  
 59 |     fprintf(stderr, "A host memory allocation failed\n");  
 60 |     return EXIT_FAILURE;  
 61 |   }  
 62 |   h_B = (float *)malloc(size_b * sizeof(float));  
 63 |   if (h_B == NULL) {  
 64 |     fprintf(stderr, "B host memory allocation failed\n");  
 65 |     return EXIT_FAILURE;  
 66 |   }  
 67 |   h_C = (float *)malloc(size_c * sizeof(float));  
 68 |   if (h_C == 0) {  
 69 |     fprintf(stderr, "C host memory allocation failed\n");  
 70 |     return EXIT_FAILURE;  
 71 |   }  
 72 |   h_C_ref = (float *)malloc(size_c * sizeof(float));  
 73 |   if (h_C_ref == 0) {  
 74 |     fprintf(stderr, "C_ref host memory allocation failed\n");  
 75 |     return EXIT_FAILURE;  
 76 |   }  
 77 |   
 78 |   /* Fill the matrices with test data */  
 79 |   for (int i = 0; i < size_a; ++i) {  
 80 |     h_A[i] = cos(i + 0.125);  
 81 |   }  
 82 |   for (int i = 0; i < size_b; ++i) {  
 83 |     h_B[i] = cos(i - 0.125);  
 84 |   }  
 85 |   for (int i = 0; i < size_c; ++i) {  
 86 |     h_C[i] = sin(i + 0.25);  
 87 |   }  
 88 |   
 89 |   /* Allocate device memory for the matrices */  
 90 |   if (mcMalloc((void **)(&d_A), size_a * sizeof(float)) != mcSuccess) {  
 91 |     fprintf(stderr, "A device memory allocation failed\n");  
 92 |     return EXIT_FAILURE;  
 93 |   }  
 94 |   if (mcMalloc((void **)(&d_B), size_b * sizeof(float)) != mcSuccess) {  
 95 |     fprintf(stderr, "B device memory allocation failed\n");  
 96 |     return EXIT_FAILURE;  
 97 |   }  
 98 |   if (mcMalloc((void **)(&d_C), size_c * sizeof(float)) != mcSuccess) {  
 99 |     fprintf(stderr, "C device memory allocation failed\n");  
100 |     return EXIT_FAILURE;  
101 |   }  
102 |   
103 |   /* Initialize the device matrices with the host matrices */  
104 |   if (mcblasSetVector(size_a, sizeof(float), h_A, 1, d_A, 1) != MCBLAS_STATUS_SUCCESS) {  
105 |     fprintf(stderr, "Copy A from host to device failed\n");  
106 |     return EXIT_FAILURE;  
107 |   }  
108 |   if (mcblasSetVector(size_b, sizeof(float), h_B, 1, d_B, 1) != MCBLAS_STATUS_SUCCESS) {  
109 |     fprintf(stderr, "Copy B from host to device failed\n");  
110 |     return EXIT_FAILURE;  
111 |   }  
112 |   if (mcblasSetVector(size_c, sizeof(float), h_C, 1, d_C, 1) != MCBLAS_STATUS_SUCCESS) {  
113 |     fprintf(stderr, "Copy C from host to device failed\n");  
114 |     return EXIT_FAILURE;  
115 |   }  
116 |   
117 |   /* compute the reference result */  
118 |   cpu_sgemm(m, n, k, alpha, h_A, h_B, beta, h_C, h_C_ref);  
119 |   
120 |   /* Performs operation using mcblas */  
121 |   status = mcblasSgemm(handle, MCBLAS_OP_N, MCBLAS_OP_N, m, n, k, &alpha, d_A, m, d_B, n, &beta, d_C, k);  
122 |   if (status != MCBLAS_STATUS_SUCCESS) {  
123 |     fprintf(stderr, "Sgemm kernel execution failed\n");  
124 |     return EXIT_FAILURE;  
125 |   }  
126 |   /* Read the result back */  
127 |   status = mcblasGetVector(size_c, sizeof(float), d_C, 1, h_C, 1);  
128 |   if (status != MCBLAS_STATUS_SUCCESS) {  
129 |     fprintf(stderr, "C data reading failed\n");  
130 |     return EXIT_FAILURE;  
131 |   }  
132 |   
133 |   /* Check result against reference */  
134 |   error_norm = 0;  
135 |   ref_norm = 0;  
136 |   
137 |   for (int i = 0; i < size_c; ++i) {  
138 |     diff = h_C_ref[i] - h_C[i];  
139 |     error_norm += diff * diff;  
140 |     ref_norm += h_C_ref[i] * h_C_ref[i];  
141 |   }  
142 |   
143 |   error_norm = (float)sqrt((double)error_norm);  
144 |   ref_norm = (float)sqrt((double)ref_norm);  
145 |   
146 |   if (error_norm / ref_norm < 1e-6f) {  
147 |     printf("McBLAS test passed.\n");  
148 |   } else {  
149 |     printf("McBLAS test failed.\n");  
150 |   }  
151 |   
152 |   /* Memory clean up */  
153 |   free(h_A);  
154 |   free(h_B);  
155 |   free(h_C);  
156 |   free(h_C_ref);  
157 |   
158 |   if (mcFree(d_A) != mcSuccess) {  
159 |     fprintf(stderr, "A device mem free failed\n");  
160 |     return EXIT_FAILURE;  
161 |   }  
162 |   
163 |   if (mcFree(d_B) != mcSuccess) {  
164 |     fprintf(stderr, "B device mem free failed\n");  
165 |     return EXIT_FAILURE;  
166 |   }  
167 |   
168 |   if (mcFree(d_C) != mcSuccess) {  
169 |     fprintf(stderr, "C device mem free failed\n");  
170 |     return EXIT_FAILURE;  
171 |   }  
172 |   
173 |   /* Shutdown */  
174 |   status = mcblasDestroy(handle);  
175 |   if (status != MCBLAS_STATUS_SUCCESS) {  
176 |     fprintf(stderr, "Destory failed\n");  
177 |     return EXIT_FAILURE;  
178 |   }  
179 |   
180 |   return EXIT_SUCCESS;  
181 | }
182 | 


--------------------------------------------------------------------------------
/chapter10/mcDNN.cpp:
--------------------------------------------------------------------------------
  1 | #include <iostream>  
  2 | #include <vector>  
  3 | #include <mcr/mc_runtime_api.h>  
  4 | #include <mcdnn/mcdnn.h>  
  5 | #include <math.h>
  6 |   
  7 | #define MCDNN_CHECK(f)      
  8 |   {             
  9 |     mcdnnStatus_t err =  static_case<mcdnnStatus_t>(f) ;
 10 |     if (err != MCDNN_STATUS_SUCCESS) {  
 11 |       std::cout << "Error occurred : " << err << std::endl;   
 12 |       std::exit(1);              
 13 |     }                     
 14 |   }  
 15 |   
 16 | int main() {  
 17 |   // data shape  
 18 |   int batch = 3;  
 19 |   int data_w = 224;  
 20 |   int data_h = 224;  
 21 |   int in_channel = 3;  
 22 |   int out_channel = 8;  
 23 |   int filter_w = 5;  
 24 |   int filter_h = 5;  
 25 |   int stride[2] = {1, 1};  
 26 |   int dilate[2] = {1, 1};  
 27 |   float alpha = 2.f;  
 28 |   float beta = 5.f;  
 29 |   
 30 |   // model selected  
 31 |   mcdnnConvalutionMode_t mode = MCDNN_CROSS_CRRELATION;  
 32 |   mcdnnConvalutionFwdAlgo_t algo = MCDNN_CONVOLUTION_FWD_ALGO__FFT_TILING;  
 33 |   // data type selected float, double, half, etc.  
 34 |   mcdnnDataType_t data_type = MCDNN_DATA_FLOAT;  
 35 |   
 36 |   // init handle  
 37 |   mcdnnHandle_t handle;  
 38 |   MCDNN_CHECK(mcdnnCreate(&handle));  
 39 |   
 40 |   // create descriptor  
 41 |   mcdnnTensorDescriptor_t x_desc;  
 42 |   mcdnnFilterDescriptor_t w_desc;  
 43 |   mcdnnTensorDescriptor_t y_desc;  
 44 |   mcdnnConvolutionDescriptor_t conv_desc;  
 45 |   MCDNN_CHECK(mcdnnCreateTensorDescriptor(&x_desc));  
 46 |   MCDNN_CHECK(mcdnnCreateFilterDescriptor(&w_desc));  
 47 |   MCDNN_CHECK(mcdnnCreateTensorDescriptor(&y_desc));  
 48 |   MCDNN_CHECK(mcdnnCreateConvolutionDescriptor(&conv_desc));  
 49 |   
 50 |   // convolution padding  
 51 |   // out size = (input + pad - kernel) / stride + 1  
 52 |   uint32_t padding_w = data_w + pad[2] + pad[3];  
 53 |   uint32_t padding_h = data_h + pad[0] + pad[1];  
 54 |   uint32_t out_h = padding_h - filter_h + 1;  
 55 |   uint32_t out_w = padding_w - filter_w + 1;  
 56 |   // init tensor descriptor, set data type, layout format, shape, etc.  
 57 |   mcdnnSetTensor4dDescriptor(x_desc, MCDNN_TENSOR_NCHW, data_type, batch,  
 58 |                              in_channel, data_h, data_w);  
 59 |   mcdnnSetFi1ter4dDescriptor(w_desc, data_type, MCDNN_TENSOR NCHW, out_channel,  
 60 |                              in_channel, filter_h, filter_w);  
 61 |   mcdnnSetTensor4dDescriptor(y_desc, MCDNN_TENSOR_NCHW, data_type, batch,  
 62 |                              out_channel, out_h, out_w);  
 63 |   // int convolution descriptor, set padding, stride date_type, etc.  
 64 |   mcdnnSetConvolution2dDescriptor(conv_desc, pad[1], pad[2], stride[0],  
 65 |                                   stride[1], dilate[0], dilate[1], mode,  
 66 |                                   data_type);  
 67 |   
 68 |   // init input data  
 69 |   uint32_t input_data_numbers = batch * in_channel * data_h * data_w;  
 70 |   uint32_t filter_data_numbers = out_channel * in_channel * filter_h * filter_w;  
 71 |   uint32_t out_data_numbers = batch * out_channel * out_h * out_w;  
 72 |   
 73 |   std::vector<float> x(input_data_numbers);  
 74 |   std::vector<float> w(filter_data_numbers);  
 75 |   std::vector<float> y(out_data_numbers);  
 76 |   for (int i = 0; i < input_data_numbers; ++i) {  
 77 |     x[i] = std::cos(i) * i;  
 78 |   }  
 79 |   for (int i = 0; i < filter_data_numbers; ++i) {  
 80 |     x[i] = std::sin(i) / 10;  
 81 |   }  
 82 |   
 83 |   for (int i = 0; i < out_data_numbers; ++i) {  
 84 |     y[i] = std::cos(i + 0.5);  
 85 |   }  
 86 |   
 87 |   // alloc x device memory  
 88 |   void *ptr_x_dev = nullptr;  
 89 |   MCDNN_CHECK(mcMalloc(&ptr_x_dev, x.size() * sizeof(float)));  
 90 |   // copy data to device  
 91 |   MCDNN_CHECK(mcMemcpy(&ptr_x_dev, x.data(), x.size() * sizeof(float),  
 92 |                        mcMemcpyHostToDevice));  
 93 |   // alloc w device memory  
 94 |   void *ptr_w_dev = nullptr;  
 95 |   MCDNN_CHECK(mcMalloc(&ptr_w_dev, w.size() * sizeof(float)));  
 96 |   // copy data to device  
 97 |   MCDNN_CHECK(mcMemcpy(&ptr_w_dev, w.data(), w.size() * sizeof(float),  
 98 |                        mcMemcpyHostToDevice));  
 99 |   // alloc y device memory  
100 |   void *ptr_y_dev = nullptr;  
101 |   MCDNN_CHECK(mcMalloc(&ptr_y_dev, y.size() * sizeof(float)));  
102 |   // copy data to device  
103 |   MCDNN_CHECK(mcMemcpy(&ptr_y_dev, y.data(), y.size() * sizeof(float),  
104 |                        mcMemcpyHostToDevice));  
105 |   
106 |   uint32_t padding_src_elements = batch * in_channel * padding_h * padding_w;  
107 |   
108 |   size_t workspace_size = 0;  
109 |   MCDNN_CHECK(mcdnnGetConvolutionForwardWorkspaceSize(  
110 |     handle, x_desc, w_desc, conv_desc, y_desc, algo, &workspace_size));  
111 |   
112 |   void *ptr_worksapce = nullptr;  
113 |   if (workspace_size > 0) {  
114 |     MCDNN_CHECK(mcMalloc(&ptr_worksapce, workspace_size));  
115 |   }  
116 |   
117 |   // convolution forward  
118 |   MCDNN_CHECK(mcdnnConvolutinForward(handle, &alpha, x_desc, ptr_x_dev, w_desc,  
119 |                                      ptr_w_dev, conv_desc, algo, ptr_worksapce,  
120 |                                      workspace_size, &beta, y_desc, ptr_y_dev));  
121 |   MCDNN_CHECK(mcMemcpy(y.data(), ptr_y_dev, y.size() * sizeof(float),  
122 |                        mcMemcpyDeviceToHost));  
123 |   
124 |   // free device pointer and handle  
125 |   MCDNN_CHECK(mcFree(ptr_x_dev));  
126 |   MCDNN_CHECK(mcFree(ptr_w_dev));  
127 |   MCDNN_CHECK(mcFree(ptr_y_dev));  
128 |   MCDNN_CHECK(mcFree(ptr_w_dev));  
129 |   MCDNN_CHECK(mcdnnDestoryTensorDescriptor(x_desc));  
130 |   MCDNN_CHECK(mcdnnDestoryTensorDescriptor(y_desc));  
131 |   MCDNN_CHECK(mcdnnDestoryFilterDescriptor(w_desc));  
132 |   MCDNN_CHECK(mcdnnDestoryConvolutionDescriptor(conv_desc));  
133 |   MCDNN_CHECK(mcdnnDestory(handle));  
134 |   
135 |   return 0;  
136 | }
137 | 


--------------------------------------------------------------------------------
/chapter10/mcblas命令.txt:
--------------------------------------------------------------------------------
1 | mxcc sample_mcblas.c -I${MACA_PATH}/include -I${MACA_PATH}/include/mcblas -I${MACA_PATH}/include/mcr -L${MACA_PATH}/lib -lmcruntime -lmcblas


--------------------------------------------------------------------------------
/chapter10/usingThrust.cpp:
--------------------------------------------------------------------------------
 1 | #include <thrust/sort.h>
 2 | #include <thrust/iterator/counting_iterator.h>
 3 | #include <thrust/reduce.h>
 4 | #include <thrust/host_vector.h>
 5 | 
 6 | #include <vector>
 7 | #include <iostream>
 8 | 
 9 | int main(void) {
10 | 	// the following code shows how to use thrust::sort and thrust::host_vector
11 | 	std::vector<int> array = {2, 4, 6, 8, 0, 9, 7, 5, 3, 1};
12 | 	thrust::host_vector<int> vec;
13 | 	vec = array; 	// now vec has storage for 10 integers
14 | 	std::cout << "vec has size: " << vec.size() << std::endl;
15 | 
16 | 	std::cout << "vec before sorting:" << std::endl;
17 | 	for (size_t i = 0; i < vec.size(); ++i)
18 | 	std::cout << vec[i] << "  ";
19 | 	std::cout << std::endl;
20 | 
21 | 	thrust::sort(vec.begin(), vec.end());
22 | 	std::cout << "vec after sorting:" << std::endl;
23 | 	for (size_t i = 0; i < vec.size(); ++i)
24 | 			std::cout << vec[i] << "  ";
25 | 	std::cout << std::endl;
26 | 
27 | 	vec.resize(2);
28 | 	std::cout << "now vec has size: " << vec.size() << std::endl;
29 | 
30 | 	return 0;
31 | }
32 | 


--------------------------------------------------------------------------------
/chapter11/Makefile:
--------------------------------------------------------------------------------
 1 | DEBUG ?= 0
 2 | MCCL ?=0
 3 | MCCLCMMD = -D_USE_MCCL -lmccl
 4 | 
 5 | ifeq ($(DEBUG), 0)
 6 | ifeq ($(MCCL),0)
 7 | simple2DFD_rls: simple2DFD.cpp
 8 | 	mxcc -x maca -O3 ./simple2DFD.cpp -I./ -o ./build/$@
 9 | else 
10 | simple2DFD_rls_mccl: simple2DFD.cpp
11 | 	mxcc -x maca -O3 ./simple2DFD.cpp  $(MCCLCMMD) -I./ -o ./build/$@
12 | 	@echo Useing mccl now!
13 | endif
14 | else
15 | ifeq ($(MCCL),0)
16 | simple2DFD_dbg: simple2DFD.cpp
17 | 	mxcc -x maca -g -G ./simple2DFD.cpp  -I./ -o ./build/$@
18 | else
19 | simple2DFD_dbg_mccl: simple2DFD.cpp
20 | 	mxcc -x maca -g -G ./simple2DFD.cpp  $(MCCLCMMD)  -I./ -o ./build/$@
21 | 	@echo Useing mccl now!
22 | endif
23 | endif
24 | 
25 | clean:
26 | 	rm -f ./build/simple2DFD_*
27 | 
28 | 


--------------------------------------------------------------------------------
/chapter11/simple2DFD.cpp:
--------------------------------------------------------------------------------
  1 | #include "../common/common.h"
  2 | #include <stdlib.h>
  3 | #include <stdio.h>
  4 | #include <string.h>
  5 | #include <math.h>
  6 | #include <mc_runtime.h>
  7 | #include <iostream>
  8 | 
  9 | #include <cassert>
 10 | 
 11 | #ifdef _USE_MCCL
 12 | #include <mccl.h>
 13 | #endif
 14 | 
 15 | 
 16 | /*
 17 |  * This example implements a 2D stencil computation, spreading the computation
 18 |  * across multiple GPUs. This requires communicating halo regions between GPUs
 19 |  * on every iteration of the stencil as well as managing multiple GPUs from a
 20 |  * single host application. Here, kernels and transfers are issued in
 21 |  * breadth-first order to each maca stream. Each maca stream is associated with
 22 |  * a single maca device.
 23 |  */
 24 | 
 25 | #define a0     -3.0124472f
 26 | #define a1      1.7383092f
 27 | #define a2     -0.2796695f
 28 | #define a3      0.0547837f
 29 | #define a4     -0.0073118f
 30 | 
 31 | // cnst for gpu
 32 | #define BDIMX       32
 33 | #define NPAD        4
 34 | #define NPAD2       8
 35 | 
 36 | // constant memories for 8 order FD coefficients
 37 | __device__ __constant__ float coef[5];
 38 | 
 39 | // set up fd coefficients
 40 | void setup_coef (void)
 41 | {
 42 |     const float h_coef[] = {a0, a1, a2, a3, a4};
 43 |     CHECK( mcMemcpyToSymbol( coef, h_coef, 5 * sizeof(float) ));
 44 | }
 45 | 
 46 | void saveSnapshotIstep(
 47 |     int istep,
 48 |     int nx,
 49 |     int ny,
 50 |     int ngpus,
 51 |     float **g_u2)
 52 | {
 53 |     float *iwave = (float *)malloc(nx * ny * sizeof(float));
 54 | 
 55 |     if (ngpus > 1)
 56 |     {
 57 |         unsigned int skiptop = nx * 4;
 58 |         unsigned int gsize = nx * ny / 2;
 59 | 
 60 |         for (int i = 0; i < ngpus; i++)
 61 |         {
 62 |             CHECK(mcSetDevice(i));
 63 |             int iskip = (i == 0 ? 0 : skiptop);
 64 |             int ioff  = (i == 0 ? 0 : gsize);
 65 |             CHECK(mcMemcpy(iwave + ioff, g_u2[i] + iskip,
 66 |                         gsize * sizeof(float), mcMemcpyDeviceToHost));
 67 | 
 68 |             // int iskip = (i == 0 ? nx*ny/2-4*nx : 0+4*nx);
 69 |             // int ioff  = (i == 0 ? 0 : nx*4);
 70 |             // CHECK(mcMemcpy(iwave + ioff, g_u2[i] + iskip,
 71 |             //             skiptop * sizeof(float), mcMemcpyDeviceToHost));
 72 |         }
 73 |     }
 74 |     else
 75 |     {
 76 |         unsigned int isize = nx * ny;
 77 |         CHECK(mcMemcpy (iwave, g_u2[0], isize * sizeof(float),
 78 |                           mcMemcpyDeviceToHost));
 79 |     }
 80 | 
 81 |     char fname[50];
 82 |     sprintf(fname, "snap_at_step_%d.data", istep);
 83 | 
 84 |     FILE *fp_snap = fopen(fname, "w");
 85 | 
 86 |     fwrite(iwave, sizeof(float), nx * ny, fp_snap);
 87 |     // fwrite(iwave, sizeof(float), nx * 4, fp_snap);
 88 |     printf("%s: nx = %d ny = %d istep = %d\n", fname, nx, ny, istep);
 89 |     fflush(stdout);
 90 |     fclose(fp_snap);
 91 | 
 92 |     free(iwave);
 93 |     return;
 94 | }
 95 | // 判断算力是否大于2，大于2则就支持P2P通信
 96 | inline bool isCapableP2P(int ngpus)
 97 | {
 98 |     mcDeviceProp_t prop[ngpus];
 99 |     int iCount = 0;
100 | 
101 |     for (int i = 0; i < ngpus; i++)
102 |     {
103 |         CHECK(mcGetDeviceProperties(&prop[i], i));
104 | 
105 |         if (prop[i].major >= 2) iCount++;
106 | 
107 |         printf("> GPU%d: %s %s Peer-to-Peer access\n", i,
108 |                 prop[i].name, (prop[i].major >= 2 ? "supports" : "doesn't support"));
109 |         fflush(stdout);
110 |     }
111 | 
112 |     if(iCount != ngpus)
113 |     {
114 |         printf("> no enough device to run this application\n");
115 |         fflush(stdout);
116 |     }
117 | 
118 |     return (iCount == ngpus);
119 | }
120 | 
121 | /*
122 |  * enable P2P memcopies between GPUs (all GPUs must be compute capability 2.0 or
123 |  * later (Fermi or later))
124 |  */
125 | inline void enableP2P (int ngpus)
126 | {
127 |     for (int i = 0; i < ngpus; i++)
128 |     {
129 |         CHECK(mcSetDevice(i));
130 | 
131 |         for (int j = 0; j < ngpus; j++)
132 |         {
133 |             if (i == j) continue;
134 | 
135 |             int peer_access_available = 0;
136 |             CHECK(mcDeviceCanAccessPeer(&peer_access_available, i, j));
137 | 
138 |             if (peer_access_available) CHECK(mcDeviceEnablePeerAccess(j, 0));
139 |         }
140 |     }
141 | }
142 | // 是否支持UnifiedAddressing
143 | inline bool isUnifiedAddressing (int ngpus)
144 | {
145 |     mcDeviceProp_t prop[ngpus];
146 | 
147 |     for (int i = 0; i < ngpus; i++)
148 |     {
149 |         CHECK(mcGetDeviceProperties(&prop[i], i));
150 |     }
151 | 
152 |     const bool iuva = (prop[0].unifiedAddressing && prop[1].unifiedAddressing);
153 |     printf("> GPU%d: %s %s Unified Addressing\n", 0, prop[0].name,
154 |            (prop[0].unifiedAddressing ? "supports" : "doesn't support"));
155 |     printf("> GPU%d: %s %s Unified Addressing\n", 1, prop[1].name,
156 |            (prop[1].unifiedAddressing ? "supports" : "doesn't support"));
157 |     fflush(stdout);
158 |     return iuva;
159 | }
160 | // 2GPU的结果为252,256,4,252
161 | inline void calcIndex(int *haloStart, int *haloEnd, int *bodyStart,
162 |                       int *bodyEnd, const int ngpus, const int iny)
163 | {
164 |     // for halo
165 |     for (int i = 0; i < ngpus; i++)
166 |     {
167 |         if (i == 0 && ngpus == 2)
168 |         {
169 |             haloStart[i] = iny - NPAD2; // 260-8=252
170 |             haloEnd[i]   = iny - NPAD; // 260-4=256
171 | 
172 |         }
173 |         else
174 |         {
175 |             haloStart[i] = NPAD;
176 |             haloEnd[i]   = NPAD2;
177 |         }
178 |     }
179 | 
180 |     // for body
181 |     for (int i = 0; i < ngpus; i++)
182 |     {
183 |         if (i == 0 && ngpus == 2)
184 |         {
185 |             bodyStart[i] = NPAD; // 4
186 |             bodyEnd[i]   = iny - NPAD2; // 260-8=252
187 |         }
188 |         else
189 |         {
190 |             bodyStart[i] = NPAD + NPAD;
191 |             bodyEnd[i]   = iny - NPAD;
192 |         }
193 |     }
194 | }
195 | // // src_skip: 512*(260-8) 4*512 dst_skip:0  (260-4)*512
196 | inline void calcSkips(int *src_skip, int *dst_skip, const int nx,
197 |                       const int iny)
198 | {
199 |     src_skip[0] = nx * (iny - NPAD2);// 512*(260-8)
200 |     dst_skip[0] = 0;
201 |     src_skip[1] = NPAD * nx; // 4*512
202 |     dst_skip[1] = (iny - NPAD) * nx; // (260-4)*512
203 | }
204 | 
205 | // wavelet
206 | __global__ void kernel_add_wavelet ( float *g_u2, float wavelets, const int nx,
207 |                                      const int ny, const int ngpus)
208 | { // ny为iny=260，nx=512
209 |     // global grid idx for (x,y) plane 若gpu个数为2，则
210 |     int ipos = (ngpus == 2 ? ny - 10 : ny / 2 - 10); // ipos=250
211 |     unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x; // ix就是x方向上节点编号
212 |     unsigned int idx = ipos * nx + ix; // idx=250*512+ix
213 | 
214 |     if(ix == nx / 2) g_u2[idx] += wavelets; // 这里是说ix==256时，则
215 | }
216 | 
217 | // fd kernel function
218 | __global__ void kernel_2dfd_last(float *g_u1, float *g_u2, const int nx,
219 |                                  const int iStart, const int iEnd)
220 | {
221 |     // global to slice : global grid idx for (x,y) plane
222 |     unsigned int ix  = blockIdx.x * blockDim.x + threadIdx.x;
223 | 
224 |     // smem idx for current point
225 |     unsigned int stx = threadIdx.x + NPAD;
226 |     unsigned int idx  = ix + iStart * nx;
227 | 
228 |     // shared memory for u2 with size [4+16+4][4+16+4]
229 |     __shared__ float tile[BDIMX + NPAD2];
230 | 
231 |     const float alpha = 0.12f;
232 | 
233 |     // register for y value
234 |     float yval[9];
235 | 
236 |     for (int i = 0; i < 8; i++) yval[i] = g_u2[idx + (i - 4) * nx];
237 | 
238 |     // to be used in z loop
239 |     int iskip = NPAD * nx;
240 | 
241 | #pragma unroll 9
242 |     for (int iy = iStart; iy < iEnd; iy++)
243 |     {
244 |         // get front3 here
245 |         yval[8] = g_u2[idx + iskip];
246 | 
247 |         if(threadIdx.x < NPAD)
248 |         {
249 |             tile[threadIdx.x]  = g_u2[idx - NPAD];
250 |             tile[stx + BDIMX]    = g_u2[idx + BDIMX];
251 |         }
252 | 
253 |         tile[stx] = yval[4];
254 |         __syncthreads();
255 | 
256 |         if ( (ix >= NPAD) && (ix < nx - NPAD) )
257 |         {
258 |             // 8rd fd operator
259 |             float tmp = coef[0] * tile[stx] * 2.0f;
260 | 
261 | #pragma unroll
262 |             for(int d = 1; d <= 4; d++)
263 |             {
264 |                 tmp += coef[d] * (tile[stx - d] + tile[stx + d]);
265 |             }
266 | 
267 | #pragma unroll
268 |             for(int d = 1; d <= 4; d++)
269 |             {
270 |                 tmp += coef[d] * (yval[4 - d] + yval[4 + d]);
271 |             }
272 | 
273 |             // time dimension
274 |             g_u1[idx] = yval[4] + yval[4] - g_u1[idx] + alpha * tmp;
275 |         }
276 | 
277 | #pragma unroll 8
278 |         for (int i = 0; i < 8 ; i++)
279 |         {
280 |             yval[i] = yval[i + 1];
281 |         }
282 | 
283 |         // advancd on global idx
284 |         idx  += nx;
285 |         __syncthreads();
286 |     }
287 | }
288 | 
289 | __global__ void kernel_2dfd(float *g_u1, float *g_u2, const int nx,
290 |                             const int iStart, const int iEnd)
291 | {
292 |     // global to line index
293 |     unsigned int ix  = blockIdx.x * blockDim.x + threadIdx.x;
294 | 
295 |     // smem idx for current point
296 |     unsigned int stx = threadIdx.x + NPAD;
297 |     unsigned int idx  = ix + iStart * nx; // ix+4*512,idx表示插值的中心点坐标
298 | 
299 |     // shared memory for x dimension
300 |     __shared__ float line[BDIMX + NPAD2];// 对于一个block，根据模板，需要的共享内存元素数量为block线程大小+NPAD*2
301 | 
302 |     // a coefficient related to physical properties
303 |     const float alpha = 0.12f; // 关于时间步长的系数
304 | 
305 |     // register for y value
306 |     float yval[9]; // 寄存器数组
307 |     // 从GPU主存中获取值，这里数据由于是沿着坐标x轴排布的，所以获取主存的数据是不连续的
308 |     for (int i = 0; i < 8; i++) yval[i] = g_u2[idx + (i - 4) * nx];
309 | 
310 |     // skip for the bottom most y value
311 |     int iskip = NPAD * nx; // 4*512，看上面for循环，最大下标到idx+3*nx,这里多加了1
312 | 
313 | #pragma unroll 9
314 |     for (int iy = iStart; iy < iEnd; iy++)//对y方向的数据点进行循环
315 |     {
316 |         // get yval[8] here
317 |         yval[8] = g_u2[idx + iskip];//这里每次yval的最后一个数据从主存获取，其他数据最后从寄存器获取
318 |         // 所以内存是按坐标轴的x方向上排布的
319 |         // read halo partk // 
320 |         if(threadIdx.x < NPAD)
321 |         {   // 共享内存的最前最后4个数据即(0,1,2,3)和(36,37,38,39)
322 |             line[threadIdx.x]  = g_u2[idx - NPAD]; 
323 |             line[stx + BDIMX]    = g_u2[idx + BDIMX];
324 |         }
325 | 
326 |         line[stx] = yval[4]; // line获取中心点的值,注意由于每个线程的yval[4]和stx都不同，所以这样可以将line[4-35]的所有数据填满
327 |         __syncthreads();// 直到块内线程同步
328 | 
329 |         // 8rd fd operator 这里的ix>=4,ix<512-4
330 |         if ( (ix >= NPAD) && (ix < nx - NPAD) )
331 |         {
332 |             // center point
333 |             float tmp = coef[0] * line[stx] * 2.0f;
334 | 
335 | #pragma unroll
336 |             for(int d = 1; d <= 4; d++)
337 |             {
338 |                 tmp += coef[d] * ( line[stx - d] + line[stx + d]);
339 |             }
340 | 
341 | #pragma unroll
342 |             for(int d = 1; d <= 4; d++)
343 |             {
344 |                 tmp += coef[d] * (yval[4 - d] + yval[4 + d]);
345 |             }
346 | 
347 |             // time dimension yval[4]=gu2[idx],g_u1和g_u2和时间推进有关
348 |             g_u1[idx] = yval[4] + yval[4] - g_u1[idx] + alpha * tmp;
349 |         }
350 | 
351 | #pragma unroll 8 // 这里将下移一格，即沿着坐标y轴下移，进行下一层(沿着x轴为一层)
352 |         for (int i = 0; i < 8 ; i++)
353 |         {
354 |             yval[i] = yval[i + 1];
355 |         }
356 | 
357 |         // advancd on global idx
358 |         idx  += nx; // idx+一层的点数，接着循环
359 |         __syncthreads();
360 |     }
361 | }
362 | // 程序有多个参数，第一个为要使用的GPU个数，第二个为保存哪个时间步的波场
363 | /*
364 | 1. argv[1]:gpu数量
365 | 2. argv[2]: 每隔多少个时间步存储数据
366 | 3. argv[3]: 一共多少时间步
367 | 4. argv[4]: 每个方向上的网格数
368 |  */
369 | int main( int argc, char *argv[] )
370 | {
371 |     int ngpus=2;
372 | 
373 |     // check device count
374 |     CHECK(mcGetDeviceCount(&ngpus));
375 |     printf("> Number of devices available: %i\n", ngpus);
376 | 
377 |     // check p2p capability
378 |     isCapableP2P(ngpus);
379 |     isUnifiedAddressing(ngpus);
380 | 
381 |     //  get it from command line
382 |     if (argc > 1)
383 |     {
384 |         if (atoi(argv[1]) > ngpus)
385 |         {
386 |             fprintf(stderr, "Invalid number of GPUs specified: %d is greater "
387 |                     "than the total number of GPUs in this platform (%d)\n",
388 |                     atoi(argv[1]), ngpus);
389 |             exit(1);
390 |         }
391 | 
392 |         ngpus  = atoi(argv[1]);
393 |     }
394 | 
395 |     int iMovie = 100; // 这里现在表示每隔多少步存一次数据
396 | 
397 |     if(argc >= 3) iMovie = atoi(argv[2]);
398 | 
399 |     // size
400 |     // 时间步
401 |     int nsteps  = 1001; 
402 |     if(argc>=4) nsteps=atoi(argv[3]);
403 | 
404 |     printf("> run with %i devices: nsteps = %i\n", ngpus, nsteps); 
405 |     
406 |     // x方向点数
407 |     const int nx      = 512; 
408 |     // y方向点数
409 |     const int ny      = 512; 
410 |     // 计算每个gpu上点数，这里每个线程负责所有y方向的数据点计算
411 |     const int iny     = ny / ngpus + NPAD * (ngpus - 1); 
412 | 
413 |     size_t isize = nx * iny; // 总的数据点数
414 |     size_t ibyte = isize * sizeof(float); // 每块总的数据字节数
415 | #ifndef _USE_MCCL
416 |     size_t iexchange = NPAD * nx * sizeof(float); // 交换区域的字节数
417 | #endif
418 | 
419 |     // set up gpu card
420 |     float *d_u2[ngpus], *d_u1[ngpus];
421 | 
422 |     for(int i = 0; i < ngpus; i++)
423 |     {
424 |         // set device
425 |         CHECK(mcSetDevice(i));
426 | 
427 |         // allocate device memories // d_u1,d_u2分别存着两个时间步的数据
428 |         CHECK(mcMalloc ((void **) &d_u1[i], ibyte));
429 |         CHECK(mcMalloc ((void **) &d_u2[i], ibyte));
430 | 
431 |         CHECK(mcMemset (d_u1[i], 0, ibyte));
432 |         CHECK(mcMemset (d_u2[i], 0, ibyte));
433 |         printf("GPU %i: %.2f MB global memory allocated\n", i,
434 |                (4.f * ibyte) / (1024.f * 1024.f) );
435 |         setup_coef ();
436 |     }
437 | 
438 |     // stream definition
439 |     mcStream_t stream_halo[ngpus], stream_body[ngpus];
440 | 
441 |     for (int i = 0; i < ngpus; i++)
442 |     {
443 |         CHECK(mcSetDevice(i));
444 |         CHECK(mcStreamCreate( &stream_halo[i] ));
445 |         CHECK(mcStreamCreate( &stream_body[i] ));
446 |     }
447 | 
448 |     // calculate index for computation
449 |     int haloStart[ngpus], bodyStart[ngpus], haloEnd[ngpus], bodyEnd[ngpus];
450 |     // 根据iny进行处理 ，2GPU的结果为252,256,4,252
451 |     calcIndex(haloStart, haloEnd, bodyStart, bodyEnd, ngpus, iny);
452 | 
453 |     int src_skip[ngpus], dst_skip[ngpus];
454 |     // // src_skip: 512*(260-8) 4*512 dst_skip:0  (260-4)*512
455 |     // 根据nx,iny进行处理
456 |     if(ngpus > 1) calcSkips(src_skip, dst_skip, nx, iny); 
457 | 
458 |     // kernel launch configuration
459 |     // block 中的线程数量
460 |     dim3 block(BDIMX); 
461 |     // block数量 这样的话一个线程要处理所有y向的数据。y方向被所有的GPU分块
462 |     dim3 grid(nx / block.x); 
463 | 
464 |     // set up event for timing
465 |     CHECK(mcSetDevice(0));
466 |     mcEvent_t start, stop;
467 |     CHECK (mcEventCreate(&start));
468 |     CHECK (mcEventCreate(&stop ));
469 |     CHECK(mcEventRecord( start, 0 ));
470 | #ifdef _USE_MCCL
471 |     int devs[2] = {0, 1};
472 |     mcclComm_t comms[2];
473 |     assert(mcclSuccess==mcclCommInitAll(comms, ngpus, devs));
474 | #endif
475 |     // main loop for wave propagation
476 |     for(int istep = 0; istep < nsteps; istep++)
477 |     {
478 | 
479 |         // save snap image
480 |         if(istep%iMovie==0) saveSnapshotIstep(istep, nx, ny, ngpus, d_u2);
481 | 
482 |         // add wavelet only onto gpu0
483 |         if (istep == 0) 
484 |         {
485 |             CHECK(mcSetDevice(0));
486 |             kernel_add_wavelet<<<grid, block>>>(d_u2[0], 20.0, nx, iny, ngpus);
487 |         }
488 | 
489 |         // halo part
490 |         for (int i = 0; i < ngpus; i++)
491 |         {
492 |             CHECK(mcSetDevice(i));
493 | 
494 |             // compute halo 
495 |             kernel_2dfd<<<grid, block, 0, stream_halo[i]>>>(d_u1[i], d_u2[i],
496 |                     nx, haloStart[i], haloEnd[i]);
497 | 
498 |             // compute internal
499 |             kernel_2dfd<<<grid, block, 0, stream_body[i]>>>(d_u1[i], d_u2[i],
500 |                     nx, bodyStart[i], bodyEnd[i]);
501 |         }
502 | 
503 |         /*
504 |             ================================================================================
505 | 
506 |             ***************************使用不同的方式在GPU间交换数据****************************
507 | 
508 |             ================================================================================
509 |         */
510 | 
511 | #ifndef _USE_MCCL 
512 |         // exchange halo
513 |         // src_skip: 512*(260-8) 4*512 dst_skip:0  (260-4)*512
514 |         if (ngpus > 1) 
515 |         {   
516 |             // 交换两个GPU的数据注意都是d_u1的数据，即新的时间步上的数据 这里可以考虑使用mccl？
517 |             // 这里是将gpu0的halo区域数据给gpu1的填充区域
518 |             CHECK(mcMemcpyAsync(d_u1[1] + dst_skip[0], d_u1[0] + src_skip[0],
519 |                                   iexchange, mcMemcpyDefault, stream_halo[0]));
520 |             // 这里是将gpu1的halo区域数据给gpu0的填充区域
521 |             CHECK(mcMemcpyAsync(d_u1[0] + dst_skip[1], d_u1[1] + src_skip[1],
522 |                                   iexchange, mcMemcpyDefault, stream_halo[1]));
523 |         }
524 | #else
525 |         // 使用mccl发送填充区数据
526 |         assert(mcclSuccess == mcclGroupStart());
527 |         for (int i = 0; i < ngpus; ++i)
528 |         {
529 |             mcSetDevice(i);
530 |             int tag = (i + 1) % 2;
531 |             mcclSend(d_u1[i] + src_skip[i], NPAD * nx, mcclFloat, tag, comms[i], stream_halo[i]);
532 |             mcclRecv(d_u1[i] + dst_skip[tag], NPAD * nx, mcclFloat, tag, comms[i], stream_halo[i]);
533 |         }
534 |         assert(mcclSuccess == mcclGroupEnd());
535 | 
536 |         for (int i = 0; i < ngpus; ++i)
537 |         {
538 |             mcSetDevice(i);
539 |             // it will stall host until all operations are done
540 |             mcStreamSynchronize(stream_halo[i]);
541 |         }
542 | #endif
543 |         for (int i = 0; i < ngpus; i++)
544 |         {
545 |             CHECK(mcSetDevice(i));
546 |             CHECK(mcDeviceSynchronize());
547 |             // 交换时间步的指针
548 |             float *tmpu0 = d_u1[i];
549 |             d_u1[i] = d_u2[i];
550 |             d_u2[i] = tmpu0;
551 |         }
552 | 
553 |     } // 关于istep的for循环结束
554 | 
555 |     CHECK(mcSetDevice(0));
556 |     CHECK(mcEventRecord(stop, 0));
557 | 
558 |     CHECK(mcDeviceSynchronize());
559 |     CHECK(mcGetLastError());
560 | 
561 |     float elapsed_time_ms = 0.0f;
562 |     CHECK(mcEventElapsedTime(&elapsed_time_ms, start, stop));
563 | 
564 |     elapsed_time_ms /= nsteps;
565 |     /*
566 |     1. nsteps=30000,NCCL:845.04 MCells/s,origin:941.21 MCells/s
567 |     2. nsteps=15000,NCCL:817.91 MCells/s,origin:935.47 MCells/s
568 |     3. nsteps=10000,NCCL:793.62 MCells/s,origin:925.97 MCells/s
569 |     4. nsteps=05000,NCCL:756.32 MCells/s,origin:925.32 MCells/s
570 |     5. nsteps=02000,NCCL:599.61 MCells/s,origin:889.43 MCells/s
571 |     6. nsteps=01000,NCCL:470.81 MCells/s,origin:802.86 MCells/s
572 |     可见随着循环步骤数的增加，mccl通信与原有程序的速度逐渐接近
573 |     */
574 |     printf("gputime: %8.2fms ", elapsed_time_ms);
575 |     printf("performance: %8.2f MCells/s\n",
576 |            (double)nx * ny / (elapsed_time_ms * 1e3f));
577 |     fflush(stdout);
578 | 
579 |     CHECK(mcEventDestroy(start));
580 |     CHECK(mcEventDestroy(stop));
581 | 
582 |     // clear
583 |     for (int i = 0; i < ngpus; i++)
584 |     {
585 |         CHECK(mcSetDevice(i));
586 | 
587 |         CHECK(mcStreamDestroy(stream_halo[i]));
588 |         CHECK(mcStreamDestroy(stream_body[i]));
589 | 
590 |         CHECK(mcFree(d_u1[i]));
591 |         CHECK(mcFree(d_u2[i]));
592 | 
593 |         // CHECK(mcDeviceReset()); // 不注释掉会mcclCommDestroy出现段错误
594 |     }
595 | #ifdef _USE_MCCL
596 |     for (int i = 0; i < ngpus; ++i)
597 |     {
598 |         assert(mcclSuccess == mcclCommDestroy(comms[i]));
599 |     }
600 | #endif
601 |     exit(EXIT_SUCCESS);
602 | }
603 | 


--------------------------------------------------------------------------------
/chapter11/vectorAddMultiGpus.cpp:
--------------------------------------------------------------------------------
  1 | #include <stdio.h>
  2 | #include <iostream>
  3 | #include <time.h>
  4 | #include <sys/time.h>
  5 | #include<mc_runtime_api.h>
  6 | 
  7 | #define USECPSEC 1000000ULL
  8 | 
  9 | unsigned long long dtime_usec(unsigned long long start){
 10 | 
 11 |   timeval tv;
 12 |   gettimeofday(&tv, 0);
 13 |   return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
 14 | }
 15 | 
 16 | // error checking macro
 17 | #define macaCheckErrors(msg) \
 18 |   do { \
 19 |     mcError_t __err = mcGetLastError(); \
 20 |     if (__err != mcSuccess) { \
 21 |         fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
 22 |             msg, mcGetErrorString(__err), \
 23 |             __FILE__, __LINE__); \
 24 |         fprintf(stderr, "*** FAILED - ABORTING\n"); \
 25 |         exit(1); \
 26 |     } \
 27 |   } while (0)
 28 | 
 29 | 
 30 | const int DSIZE = 1 << 26; //64MB
 31 | #define NGPUS 4
 32 | 
 33 | // generate different seed for random number
 34 | void initialData(float *ip, int size)
 35 | {
 36 |    time_t t;
 37 |    srand((unsigned) time(&t));
 38 | 
 39 |    for (int i = 0; i < size; i++)
 40 |    {
 41 |        ip[i] = (float)(rand() & 0xFF) / 10.0f;
 42 |    }
 43 | 
 44 |    return;
 45 | }
 46 | 
 47 | // vector add function: C = A + B
 48 | void cpuVectorAdd(float *A, float *B, float *C, const int N)
 49 | {
 50 |    for (int idx = 0; idx < N; idx++)
 51 |        C[idx] = A[idx] + B[idx];
 52 | }
 53 |   
 54 | // vector add kernel: C = A + B
 55 | __global__ void gpuVectorAddKernel(const float *A, const float *B, float *C, const int N){
 56 | 
 57 |   for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < N; idx+=gridDim.x*blockDim.x)         // a grid-stride loop
 58 |     C[idx] = A[idx] + B[idx]; // do the vector (element) add here
 59 | }
 60 | 
 61 | // check results from host and gpu
 62 | void checkResult(float *hostRef, float *gpuRef, const int N)
 63 | {
 64 |    double epsilon = 1.0E-8;
 65 |    bool match = 1;
 66 |    for (int i = 0; i < N; i++)
 67 |    {
 68 |        if (abs(hostRef[i] - gpuRef[i]) > epsilon)
 69 |        {
 70 |            match = 0;
 71 |            printf("The vector-add results do not match!\n");
 72 |            printf("host %5.2f gpu %5.2f at current %d\n", hostRef[i],
 73 |                   gpuRef[i], i);
 74 |            break;
 75 |        }
 76 |    }
 77 |    // if (match) printf("The vector-add results match.\n\n");
 78 |    return;
 79 | }
 80 | 
 81 | // 程序有多个参数，第一个为要使用的GPU个数，第二个为保存哪个时间步的波场
 82 | /*
 83 |  1. argv[1]:GPU数量 (nGpus)
 84 |  2. argv[2]:线程块大小（blockSize）
 85 |  3. argv[3]:数据量（dataSize）, default is 26(1<<26=64MB)
 86 |  */
 87 | int main( int argc, char *argv[] )
 88 | {
 89 |   int nGpus;
 90 |   mcGetDeviceCount(&nGpus);
 91 |   nGpus = (nGpus > NGPUS) ? NGPUS : nGpus;
 92 |   printf("> Number of devices available: %i\n", nGpus);
 93 |   //  get it from command line
 94 |   if (argc > 1)
 95 |   {
 96 |     if (atoi(argv[1]) > nGpus)
 97 |     {
 98 |       fprintf(stderr, "Invalid number of GPUs specified: %d is greater "
 99 |                     "than the total number of GPUs in this platform (%d)\n",
100 |                     atoi(argv[1]), nGpus);
101 |       exit(1);
102 |     }
103 |     nGpus  = atoi(argv[1]);
104 |   }
105 |   
106 |   // blockSize is set to 1 for slowing execution time per GPU
107 |   int blockSize = 1; 
108 |   // It would be faster if blockSize is set to multiples of 64(waveSize)
109 |   if(argc >= 3) blockSize = atoi(argv[2]);
110 |   int dataSize = DSIZE;
111 |   if(argc >= 4) dataSize = 1 << abs(atoi(argv[3]));
112 |   printf("> total array size is %iMB, using %i devices with each device handling %iMB\n", dataSize/1024/1024, nGpus, dataSize/1024/1024/nGpus); 
113 |   
114 |   float *d_A[NGPUS], *d_B[NGPUS], *d_C[NGPUS];
115 |   float *h_A[NGPUS], *h_B[NGPUS], *hostRef[NGPUS], *gpuRef[NGPUS];
116 |   mcStream_t stream[NGPUS];
117 | 
118 |   int iSize = dataSize / nGpus;
119 |   size_t iBytes = iSize * sizeof(float);
120 |   for (int i = 0; i < nGpus; i++) {
121 |     //set current device
122 |     mcSetDevice(i);
123 |     
124 |     //allocate device memory 
125 |     mcMalloc((void **) &d_A[i], iBytes);
126 |     mcMalloc((void **) &d_B[i], iBytes);
127 |     mcMalloc((void **) &d_C[i], iBytes);
128 |     
129 |     //allocate page locked host memory for asynchronous data transfer
130 |     mcMallocHost((void **) &h_A[i], iBytes);
131 |     mcMallocHost((void **) &h_B[i], iBytes);
132 |     mcMallocHost((void **) &hostRef[i], iBytes);
133 |     mcMallocHost((void **) &gpuRef[i], iBytes);
134 | 
135 |     // initialize data at host side
136 |     initialData(h_A[i], iSize);
137 |     initialData(h_B[i], iSize);
138 |     //memset(hostRef[i], 0, iBytes);
139 |     //memset(gpuRef[i],  0, iBytes);
140 |   }
141 |   mcDeviceSynchronize();
142 | 
143 |   // distribute the workload across multiple devices
144 |   unsigned long long dt = dtime_usec(0);
145 |   for (int i = 0; i < nGpus; i++) {
146 |     //set current device
147 |     mcSetDevice(i);
148 |     mcStreamCreate(&stream[i]);
149 |      
150 |     // transfer data from host to device
151 |     mcMemcpyAsync(d_A[i],h_A[i], iBytes, mcMemcpyHostToDevice, stream[i]);
152 |     mcMemcpyAsync(d_B[i],h_B[i], iBytes, mcMemcpyHostToDevice, stream[i]);
153 |       
154 |     // invoke kernel at host side
155 |     dim3 block (blockSize);
156 |     dim3 grid  (iSize/blockSize);
157 |     gpuVectorAddKernel<<<grid,block,0,stream[i]>>>(d_A[i], d_B[i], d_C[i], iSize);
158 |         
159 |     // copy kernel result back to host side
160 |     mcMemcpyAsync(gpuRef[i],d_C[i],iBytes,mcMemcpyDeviceToHost,stream[i]);
161 |   }
162 |   mcDeviceSynchronize();
163 |   dt = dtime_usec(dt);
164 |   std::cout << "> The execution time with " << nGpus <<"GPUs:  "<< dt/(float)USECPSEC << "s" << std::endl;
165 |   
166 |   // check the results from host and gpu devices
167 |   for (int i = 0; i < nGpus; i++) {
168 |     // add vector at host side for result checks
169 |     cpuVectorAdd(h_A[i], h_B[i], hostRef[i], iSize);
170 | 
171 |     // check device results
172 |     checkResult(hostRef[i], gpuRef[i], iSize);
173 | 
174 |     // free device global memory
175 |     mcSetDevice(i);
176 |     mcFree(d_A[i]);
177 |     mcFree(d_B[i]);
178 |     mcFree(d_C[i]);
179 | 
180 |     // free host memory
181 |     mcFreeHost(h_A[i]);
182 |     mcFreeHost(h_B[i]);
183 |     mcFreeHost(hostRef[i]);
184 |     mcFreeHost(gpuRef[i]);
185 | 
186 |     mcStreamSynchronize(stream[i]);
187 |     mcStreamDestroy(stream[i]);
188 |   }
189 |   mcDeviceSynchronize();
190 |   return 0;
191 | }
192 | 


--------------------------------------------------------------------------------
/chapter2/helloFromGpu.c:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <mc_common.h>
 3 | #include <mc_runtime_api.h> 
 4 | 
 5 | __global__ void helloFromGpu (void)
 6 | {
 7 |     printf("Hello World from GPU!\n");
 8 | }
 9 | 
10 | int main(void)
11 | {
12 |     printf("Hello World from CPU!\n");
13 |     
14 |     helloFromGpu <<<1, 10>>>();
15 |     mcDeviceReset();
16 |     //mcDeviceReset()用来显式销毁并清除与当前设备有关的所有资源。
17 | 	return 0;
18 | }
19 | 


--------------------------------------------------------------------------------
/chapter3/cpuVectorAdd.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | 
 5 | using namespace std;
 6 | 
 7 | void cpuVectorAdd(float* A, float* B, float* C, int n) {
 8 |     for (int i = 0; i < n; i++) {
 9 |         C[i] = A[i] + B[i];
10 |     }
11 | }
12 | 
13 | int main(int argc, char *argv[]) {
14 | 
15 |     int n = atoi(argv[1]);
16 |     cout << n << endl;
17 | 
18 |     size_t size = n * sizeof(float);
19 | 
20 |     // host memery
21 |     float *a = (float *)malloc(size); //分配一段内存，使用指针 a 指向它。
22 |     float *b = (float *)malloc(size);
23 |     float *c = (float *)malloc(size);
24 | 
25 |     // for 循环产生一些随机数，并放在分配的内存里面。
26 |     for (int i = 0; i < n; i++) {
27 |         float af = rand() / double(RAND_MAX);
28 |         float bf = rand() / double(RAND_MAX);
29 |         a[i] = af;
30 |         b[i] = bf;
31 |     }
32 | 
33 |     struct timeval t1, t2;
34 | 
35 |     // gettimeofday 函数来得到精确时间。它的精度可以达到微秒，是C标准库的函数。
36 |     gettimeofday(&t1, NULL);
37 | 
38 |     // 输入指向3段内存的指针名，也就是 a, b, c。
39 |     cpuVectorAdd(a, b, c, n);
40 | 
41 |     gettimeofday(&t2, NULL);
42 | 
43 |     //for (int i = 0; i < 10; i++) 
44 |     //    cout << vecA[i] << " " << vecB[i] << " " << vecC[i] << endl;
45 |     double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
46 |     cout << timeuse << endl;
47 | 
48 |     // free 函数把申请的3段内存释放掉。
49 |     free(a);
50 |     free(b);
51 |     free(c);
52 |     return 0;
53 | }
54 | 


--------------------------------------------------------------------------------
/chapter3/gpuVectorAdd.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | #include <mc_runtime_api.h>
 5 | 
 6 | using namespace std;
 7 | 
 8 | // 要用 __global__ 来修饰。
 9 | // 输入指向3段显存的指针名。
10 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N)
11 | {
12 |     int i = threadIdx.x + blockDim.x * blockIdx.x;
13 |     if (i < N) C_d[i] = A_d[i] + B_d[i];
14 | }
15 | 
16 | int main(int argc, char *argv[]) {
17 | 
18 |     int n = atoi(argv[1]);
19 |     cout << n << endl;
20 | 
21 |     size_t size = n * sizeof(float);
22 | 
23 |     // host memery
24 |     float *a = (float *)malloc(size);
25 |     float *b = (float *)malloc(size);
26 |     float *c = (float *)malloc(size);
27 | 
28 |     for (int i = 0; i < n; i++) {
29 |         float af = rand() / double(RAND_MAX);
30 |         float bf = rand() / double(RAND_MAX);
31 |         a[i] = af;
32 |         b[i] = bf;
33 |     }
34 | 
35 |     // 定义空指针。
36 |     float *da = NULL;
37 |     float *db = NULL;
38 |     float *dc = NULL;
39 | 
40 |     // 申请显存，da 指向申请的显存，注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。
41 |     mcMalloc((void **)&da, size);
42 |     mcMalloc((void **)&db, size);
43 |     mcMalloc((void **)&dc, size);
44 | 
45 |     // 把内存的东西拷贝到显存，也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。
46 |     mcMemcpy(da,a,size,mcMemcpyHostToDevice);
47 |     mcMemcpy(db,b,size,mcMemcpyHostToDevice);
48 | 
49 |     struct timeval t1, t2;
50 | 
51 |     // 计算线程块和网格的数量。
52 |     int threadPerBlock = 256;
53 |     int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
54 |     printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid);
55 | 
56 |     gettimeofday(&t1, NULL);
57 | 
58 |     // 调用核函数。
59 |     gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
60 | 
61 |     gettimeofday(&t2, NULL);
62 | 
63 |     mcMemcpy(c,dc,size,mcMemcpyDeviceToHost);
64 | 
65 |     // for (int i = 0; i < 10; i++) 
66 |     //     cout<<vecA[i]<<" "<<vecB[i]<<" "<<vecC[i]<< endl;
67 | 
68 | double timeuse = (t2.tv_sec - t1.tv_sec) + 
69 | (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
70 |     cout << timeuse << endl;
71 | 
72 |     mcFree(da);
73 |     mcFree(db);
74 |     mcFree(dc);
75 | 
76 |     free(a);
77 |     free(b);
78 |     free(c);
79 |     return 0;
80 | }
81 | 


--------------------------------------------------------------------------------
/chapter4/grammar.cpp:
--------------------------------------------------------------------------------
 1 | #include "mc_runtime_api.h"
 2 | // 字符串长度
 3 | #define SIZE 1000
 4 | // 定义设备侧的字符串变量
 5 | __device__ char dstrlist[SIZE];
 6 | // 待统计的字符，managed变量可同时被设备侧和主机侧访问
 7 | __managed__ char letters[] = {'x', 'y', 'z', 'w'};
 8 | // 演示__constant__用法，定义设备侧的字符串长度
 9 | __constant__ int dsize = SIZE;
10 | // 使用__host__ __device__修饰可同时被主机侧和设备侧调用的函数
11 | template<typename T, typename P>
12 | __device__ __host__ void count_if(int *count, T *data, int start, int end, int stride, P p) {
13 | 	for(int i = start; i < end; i += stride){
14 | 		if(p(data[i])){
15 |     // __MACA_ARCH__ 宏仅在编译设备侧代码时生效
16 |     #ifdef __MACA_ARCH__
17 |         // 使用原子操作保证设备侧多线程执行时的正确性
18 |         atomicAdd(count, 1);
19 |     #else
20 |     	*count += 1;
21 |     #endif
22 |     }
23 |   }
24 | }
25 | // 定义核函数
26 | __global__ void count_xyzw(int *res) {
27 |     // 利用内建变量gridDim, blockDim, blockIdx, threadIdx对每个线程操作的字符串进行分割
28 |     const int start = blockDim.x * blockIdx.x + threadIdx.x;
29 |     const int stride = gridDim.x * blockDim.x;
30 |     // 在设备侧调用count_if
31 |     count_if(res, dstrlist, start, dsize, stride, [=](char c){
32 |         for(auto i: letters)
33 |             if(i == c) return true;
34 |         return false;
35 |     });
36 | }
37 | 
38 | int main(void){
39 |     // 初始化字符串
40 |     char test_data[SIZE];
41 |     for(int i = 0; i < SIZE; i ++){
42 |         test_data[i] = 'a' + i % 26;
43 |     }
44 |     // 拷贝字符串数据至设备侧
45 |     mcMemcpyToSymbol(dstrlist, test_data, SIZE);
46 |     // 开辟设备侧的计数器内存并赋值为0
47 |     int *dcnt;
48 |     mcMalloc(&dcnt, sizeof(int));
49 |     int dinit = 0;
50 |     mcMemcpy(dcnt, &dinit, sizeof(int), mcMemcpyHostToDevice);
51 |     // 启动核函数
52 |     count_xyzw<<<4, 64>>>(dcnt);
53 |     // 拷贝计数器值到主机侧
54 |     int dres;
55 |     mcMemcpy(&dres, dcnt, sizeof(int), mcMemcpyDeviceToHost);
56 |     // 释放设备侧开辟的内存
57 |     mcFree(dcnt);
58 |     printf("xyzw counted by device: %d\n", dres);
59 | 
60 |     // 在主机侧调用count_if
61 |     int hcnt = 0;
62 |     count_if(&hcnt, test_data, 0, SIZE, 1, [=](char c){
63 |         for(auto i: letters)
64 |         if(i == c) return true;
65 |         return false;
66 |         });
67 |     printf("xyzw counted by host: %d\n", hcnt);
68 |     return 0;
69 | }
70 | 


--------------------------------------------------------------------------------
/chapter5/Cooperative_Groups.cpp:
--------------------------------------------------------------------------------
 1 | #include<iostream>
 2 | #include<mc_runtime_api.h>
 3 | #include<maca_cooperative_groups.h>
 4 | 
 5 | using namespace cooperative_groups;
 6 | __device__ int reduce_sum(thread_group g, int *temp, int val)
 7 | {
 8 |     int lane = g.thread_rank();
 9 | 
10 |     // Each iteration halves the number of active threads
11 |     // Each thread adds its partial sum[i] to sum[lane+i]
12 |     for (int i = g.size() / 2; i > 0; i /= 2)
13 |     {
14 |         temp[lane] = val;
15 |         g.sync(); // wait for all threads to store
16 |         if(lane<i) val += temp[lane + i];
17 |         g.sync(); // wait for all threads to load
18 |     }
19 |     return val; // note: only thread 0 will return full sum
20 | }
21 | 
22 | __device__ int thread_sum(int *input, int n) 
23 | {
24 |     int sum = 0;
25 | 
26 |     for(int i = blockIdx.x * blockDim.x + threadIdx.x;
27 |         i < n / 4; 
28 |         i += blockDim.x * gridDim.x)
29 |     {
30 |         int4 in = ((int4*)input)[i];
31 |         sum += in.x + in.y + in.z + in.w;
32 |     }
33 |     return sum;
34 | }
35 | 
36 | __global__ void sum_kernel_block(int *sum, int *input, int n)
37 | {
38 |     int my_sum = thread_sum(input, n);
39 | 
40 |     extern __shared__ int temp[];
41 |     auto g = this_thread_block();
42 |     int block_sum = reduce_sum(g, temp, my_sum);
43 | 
44 |     if (g.thread_rank() == 0) atomicAdd(sum, block_sum);
45 | }
46 | 
47 | int main()
48 | {
49 |     int n = 5 * 1024;
50 |     int blockSize = 256;
51 |     int nBlocks = (n + blockSize - 1) / blockSize;
52 |     int sharedBytes = blockSize * sizeof(int);
53 | 
54 |     int *sum, *data;
55 |     mcMallocManaged(&sum, sizeof(int));
56 |     mcMallocManaged(&data, n * sizeof(int));
57 |     std::fill_n(data, n, 1); // initialize data
58 |     mcMemset(sum, 0, sizeof(int));
59 | 
60 |     void *kernelArgs[]={
61 |         (void*)&sum,
62 |         (void*)&data,
63 |         (void*)&n,
64 |     };
65 | 
66 |     mcStream_t stream;
67 |     mcStreamCreate(&stream);
68 |     mcLaunchCooperativeKernel((void *)sum_kernel_block, nBlocks,
69 |                                 blockSize, kernelArgs, sharedBytes, stream);
70 |     mcStreamSynchronize(stream);
71 |     mcStreamDestroy(stream);
72 |     std::cout<<"sum="<<*sum<<std::endl;
73 |     return 0;
74 | };
75 | 


--------------------------------------------------------------------------------
/chapter5/assignKernel.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | #include <mc_runtime_api.h>
 5 | #include <iostream>
 6 | using namespace std;
 7 | 
 8 | __global__ void assignKernel(int *data) {
 9 |     int tid = blockIdx.x * blockDim.x + threadIdx.x;
10 |     
11 |     if (tid % 2 == 0) {
12 |         data[tid] = 20;
13 |     } else {
14 |         data[tid] = 10;
15 |     }
16 | }
17 | int main(){
18 |     int *a;
19 |     a=(int *)malloc(sizeof(int)*16*16);
20 |     int i;
21 |     for(i=0;i<16*16;i++) a[i]=(int)rand() %10+1;
22 |     int *da;
23 |     mcMalloc((void **)&da,sizeof(int)*16*16);
24 |     mcMemcpy(da,a,sizeof(int)*16*16,mcMemcpyHostToDevice);
25 |     assignKernel<<<16,16>>>(da);
26 |     mcMemcpy(a,da,sizeof(int)*16*16,mcMemcpyDeviceToHost);
27 |     for(i=0;i<16*16;i++) cout<<i<<" a[i]:"<<a[i]<<endl;
28 |     return 0;
29 | }
30 | 


--------------------------------------------------------------------------------
/chapter5/information.cpp:
--------------------------------------------------------------------------------
 1 | #include<mc_runtime_api.h>
 2 | 
 3 | int main( void ) {
 4 |     mcDeviceProp_t prop;
 5 |     
 6 |     int count;
 7 |     mcGetDeviceCount( &count );
 8 |     for (int i=0; i< count; i++) {
 9 |         mcGetDeviceProperties( &prop, i );
10 |         printf( " --- General Information for device %d ---\n", i );
11 |         printf( "Name: %s\n", prop.name );
12 |         printf( "Compute capability: %d.%d\n", prop.major, prop.minor );
13 |         printf( "Clock rate: %d\n", prop.clockRate );
14 |         printf( "Device copy overlap: " );
15 |         if (prop.deviceOverlap)
16 |             printf( "Enabled\n" );
17 |         else
18 |             printf( "Disabled\n" );
19 |         printf( "Kernel execition timeout : " );
20 |         if (prop.kernelExecTimeoutEnabled)
21 |             printf( "Enabled\n" );
22 |         else
23 |             printf( "Disabled\n" );
24 |         
25 |         printf( " --- MP Information for device %d ---\n", i );
26 |         printf( "Multiprocessor count: %d\n",
27 |                 prop.multiProcessorCount );
28 |         printf( "Threads in wave: %d\n", prop.waveSize );
29 |         printf( "Max threads per block: %d\n",
30 |                 prop.maxThreadsPerBlock );
31 |         printf( "Max thread dimensions: (%d, %d, %d)\n",
32 |                 prop.maxThreadsDim[0], prop.maxThreadsDim[1],
33 |                 prop.maxThreadsDim[2] );
34 |         printf( "Max grid dimensions: (%d, %d, %d)\n",
35 |                 prop.maxGridSize[0], prop.maxGridSize[1],
36 |                 prop.maxGridSize[2] );
37 |         printf( "\n" );
38 |     }
39 | } 
40 | 


--------------------------------------------------------------------------------
/chapter5/nestedHelloWorld.cpp:
--------------------------------------------------------------------------------
 1 | #include <mc_runtime.h>
 2 | #include <stdio.h>
 3 | 
 4 | 
 5 | __global__ void nestedHelloWorld(int const iSize, int iDepth) { 
 6 |     int tid = threadIdx.x;
 7 |     printf("Recursion=%d: Hello World from thread %d"
 8 |             " block %d\n", iDepth, tid, blockIdx.x);
 9 |     
10 |     // condition to stop recursive execution
11 |     if (iSize==1) return;
12 | 
13 |     //reduce block size to half
14 |     int nThreads = iSize >> 1;
15 | 
16 |     //thread 0 lauches child grid recursively
17 |     if (tid == 0 && nThreads >0) {
18 |         nestedHelloWorld<<<1, nThreads>>>(nThreads, ++iDepth);
19 |         printf("------> nested execution depth: %d\n", iDepth);
20 |     }
21 | }
22 | 
23 | int main(int argc, char *argv[])
24 | {
25 |     // launch nestedHelloWorld
26 |     nestedHelloWorld<<<1,8>>>(8,0);
27 |     mcDeviceSynchronize();
28 |     return 0;
29 | }
30 | 


--------------------------------------------------------------------------------
/chapter6/AplusB_with_managed.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | #include <mc_runtime_api.h>
 5 | using namespace std;
 6 | 
 7 | __device__ __managed__ int ret[1000];
 8 | __global__ void AplusB(int a, int b) {
 9 |     ret[threadIdx.x] = a + b + threadIdx.x;
10 | }
11 | int main() {
12 |     AplusB<<< 1, 1000 >>>(10, 100);
13 |     mcDeviceSynchronize();
14 |     for(int i = 0; i < 1000; i++)
15 |         printf("%d: A+B = %d\n", i, ret[i]);
16 |     return 0;
17 | }
18 | 


--------------------------------------------------------------------------------
/chapter6/AplusB_with_unified_addressing.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | #include <mc_runtime_api.h>
 5 | 
 6 | using namespace std;
 7 | __global__ void AplusB(int *ret, int a, int b) {
 8 |     ret[threadIdx.x] = a + b + threadIdx.x;
 9 | }
10 | int main() {
11 |     int *ret;
12 |     mcMallocManaged(&ret, 1000 * sizeof(int));
13 |     AplusB<<< 1, 1000 >>>(ret, 10, 100);
14 |     mcDeviceSynchronize();
15 |     for(int i = 0; i < 1000; i++)
16 |         printf("%d: A+B = %d\n", i, ret[i]);
17 |     mcFree(ret); 
18 |     return 0;
19 | }
20 | 


--------------------------------------------------------------------------------
/chapter6/AplusB_without_unified_addressing.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | #include <mc_runtime_api.h>
 5 | #include <iostream>
 6 | 
 7 | __global__ void AplusB(int *ret, int a, int b) {
 8 |     ret[threadIdx.x] = a + b + threadIdx.x;
 9 | }
10 | int main() {
11 |     int *ret;
12 |     mcMalloc(&ret, 1000 * sizeof(int));
13 |     AplusB<<< 1, 1000 >>>(ret, 10, 100);
14 |     int *host_ret = (int *)malloc(1000 * sizeof(int));
15 |     mcMemcpy(host_ret, ret, 1000 * sizeof(int), mcMemcpyDefault);
16 |     for(int i = 0; i < 1000; i++)
17 |         printf("%d: A+B = %d\n", i, host_ret[i]); 
18 |     free(host_ret);
19 |     mcFree(ret); 
20 |     return 0;
21 | }
22 | 


--------------------------------------------------------------------------------
/chapter6/BC_addKernel.cpp:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <mc_runtime_api.h>
 3 | #include<math.h>
 4 | #include <mc_common.h>
 5 | 
 6 | #define ThreadsPerBlock 256
 7 | #define maxGridSize 16
 8 | __global__ void BC_addKernel(const int *a, int *r)
 9 | {
10 |     __shared__ int cache[ThreadsPerBlock];
11 |     int tid = blockIdx.x * blockDim.x + threadIdx.x;
12 |     int cacheIndex = threadIdx.x;
13 | 
14 |     // copy data to shared memory from global memory
15 |     cache[cacheIndex] = a[tid];
16 |     __syncthreads();
17 | 
18 |     // add these data using reduce
19 |     for (int i = 1; i < blockDim.x; i *= 2)
20 |     {
21 |         int index = 2 * i * cacheIndex;
22 |         if (index < blockDim.x)
23 |         {
24 |             cache[index] += cache[index + i];
25 |         }
26 |         __syncthreads();
27 |     }
28 | 
29 |     // copy the result of reduce to global memory
30 |     if (cacheIndex == 0){
31 |         r[blockIdx.x] = cache[cacheIndex];
32 |         printf("blockIdx.x:%d  r[blockIdx.x]:%d\n",blockIdx.x,r[blockIdx.x]);
33 |     }
34 |         
35 | }
36 | 
37 | int test(int *h_a,int n){
38 |     int *a;
39 |     mcMalloc(&a,n*sizeof(int));
40 |     mcMemcpy(a,h_a,n*sizeof(int),mcMemcpyHostToDevice);
41 |     int *r;
42 |     int h_r[maxGridSize]={0};
43 |     mcMalloc(&r,maxGridSize*sizeof(int));
44 |     mcMemcpy(r,h_r,maxGridSize*sizeof(int),mcMemcpyHostToDevice);
45 |     BC_addKernel<<<ceil((double)n/ThreadsPerBlock), ThreadsPerBlock>>>(a,r);
46 |     mcMemcpy(h_a,a,n*sizeof(int),mcMemcpyDeviceToHost);
47 |     mcMemcpy(h_r,r,maxGridSize*sizeof(int),mcMemcpyDeviceToHost);
48 |     mcFree(r);
49 |     mcFree(a);
50 |     int sum=0;
51 |     for(int i=0;i<ceil((double)n/ThreadsPerBlock);i++){
52 |         sum+=h_r[i];
53 |     }
54 |     return sum;
55 | }
56 | 
57 | int main(){
58 |     int n=2048;
59 |     int *h_a=(int *)malloc(n*sizeof(int));
60 |     int sum_cpu=0;
61 |     for(int i=0;i<n;i++){
62 |         h_a[i]=rand()%10;
63 |         sum_cpu+=h_a[i];
64 |         // h_a[i]=1;
65 |     }
66 |     int sum_gpu;
67 |     sum_gpu=test(h_a,n);
68 |     printf("sum from gpu:%d\n",sum_gpu);
69 |     printf("sum from cpu:%d\n",sum_cpu);
70 |     free(h_a);
71 |     return 0;
72 | }


--------------------------------------------------------------------------------
/chapter6/NBC_addKernel2.cpp:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <mc_runtime_api.h>
 3 | #include<math.h>
 4 | #include <mc_common.h>
 5 | 
 6 | #define ThreadsPerBlock 256
 7 | #define maxGridSize 16
 8 | __global__ void NBC_addKernel2(const int *a, int *r)
 9 | {
10 |     __shared__ int cache[ThreadsPerBlock];
11 |     int tid = blockIdx.x * blockDim.x + threadIdx.x;
12 |     int cacheIndex = threadIdx.x;
13 | 
14 |     // copy data to shared memory from global memory
15 |     cache[cacheIndex] = a[tid];
16 |     __syncthreads();
17 | 
18 |     // add these data using reduce
19 |     for (int i = blockDim.x / 2; i > 0; i /= 2)
20 |     {
21 |         if (cacheIndex < i)
22 |         {
23 |             cache[cacheIndex] += cache[cacheIndex + i];
24 |         }
25 |         __syncthreads();
26 |     }
27 | 
28 |     // copy the result of reduce to global memory
29 |     if (cacheIndex == 0){
30 |         r[blockIdx.x] = cache[cacheIndex];
31 |         printf("blockIdx.x:%d  r[blockIdx.x]:%d\n",blockIdx.x,r[blockIdx.x]);
32 |     }
33 | }
34 | 
35 | 
36 | int test(int *h_a,int n){
37 |     int *a;
38 |     mcMalloc(&a,n*sizeof(int));
39 |     mcMemcpy(a,h_a,n*sizeof(int),mcMemcpyHostToDevice);
40 |     int *r;
41 |     int h_r[maxGridSize]={0};
42 |     mcMalloc(&r,maxGridSize*sizeof(int));
43 |     mcMemcpy(r,h_r,maxGridSize*sizeof(int),mcMemcpyHostToDevice);
44 |     NBC_addKernel2<<<ceil((double)n/ThreadsPerBlock), ThreadsPerBlock>>>(a,r);
45 |     mcMemcpy(h_a,a,n*sizeof(int),mcMemcpyDeviceToHost);
46 |     mcMemcpy(h_r,r,maxGridSize*sizeof(int),mcMemcpyDeviceToHost);
47 |     mcFree(r);
48 |     mcFree(a);
49 |     int sum=0;
50 |     for(int i=0;i<ceil((double)n/ThreadsPerBlock);i++){
51 |         sum+=h_r[i];
52 |     }
53 |     return sum;
54 | }
55 | 
56 | int main(){
57 |     int n=2048;
58 |     int *h_a=(int *)malloc(n*sizeof(int));
59 |     int sum_cpu=0;
60 |     for(int i=0;i<n;i++){
61 |         h_a[i]=rand()%10;
62 |         sum_cpu+=h_a[i];
63 |         // h_a[i]=1;
64 |     }
65 |     int sum_gpu;
66 |     sum_gpu=test(h_a,n);
67 |     printf("sum from gpu:%d\n",sum_gpu);
68 |     printf("sum from cpu:%d\n",sum_cpu);
69 |     free(h_a);
70 |     return 0;
71 | }


--------------------------------------------------------------------------------
/chapter6/__shfl_down_syncExample.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <mc_runtime_api.h>
 4 | using namespace std;
 5 | 
 6 | __global__ void test_shfl_down_sync(int A[], int B[])
 7 | {
 8 |     int tid = threadIdx.x;
 9 |     int value = B[tid];
10 |  
11 |     value = __shfl_down_sync(0xffffffffffffffff, value, 2);
12 |     A[tid] = value;
13 | 
14 | } 
15 | 
16 | 
17 | int main()
18 | {
19 |     int *A,*Ad, *B, *Bd;
20 |     int n = 64;
21 |     int size = n * sizeof(int);
22 |  
23 |     // CPU端分配内存
24 |     A = (int*)malloc(size);
25 |     B = (int*)malloc(size);
26 |  
27 |     for (int i = 0; i < n; i++)
28 |     {   
29 |         B[i] = rand()%101;
30 |         std::cout << B[i] << std::endl;
31 |     }
32 |    
33 |     std::cout <<"----------------------------" << std::endl;
34 |    
35 |     // GPU端分配内存
36 |     mcMalloc((void**)&Ad, size);
37 |     mcMalloc((void**)&Bd, size);
38 |     mcMemcpy(Bd, B, size, mcMemcpyHostToDevice); 
39 |  
40 |     // 定义kernel执行配置，（1024*1024/512）个block，每个block里面有512个线程
41 |     dim3 dimBlock(128);
42 |     dim3 dimGrid(1000);
43 |  
44 |     // 执行kernel
45 |     test_shfl_down_sync <<<1, 64 >>> (Ad,Bd);
46 |    
47 |     mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost);
48 |  
49 |     // 校验误差
50 |     float max_error = 0.0;
51 |     for (int i = 0; i < 64; i++)
52 |     {
53 |         std::cout << A[i] << std::endl;
54 |     }
55 |  
56 |     cout << "max error is " << max_error << endl;
57 |  
58 |     // 释放CPU端、GPU端的内存
59 |     free(A);
60 |     free(B);   
61 |     mcFree(Ad);
62 |     mcFree(Bd);
63 |  
64 |     return 0;
65 | }
66 | 


--------------------------------------------------------------------------------
/chapter6/__shfl_syncExample.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <mc_runtime_api.h>
 4 | using namespace std;
 5 | 
 6 | __global__ void test_shfl_sync(int A[], int B[])
 7 | {
 8 |     int tid = threadIdx.x;
 9 |     int value = B[tid];
10 |    
11 |     value = __shfl_sync(0xffffffffffffffff, value, 2);
12 |     A[tid] = value;
13 | }
14 | 
15 | int main()
16 | {
17 |     int *A,*Ad, *B, *Bd;
18 |     int n = 64;
19 |     int size = n * sizeof(int);
20 |  
21 |     // CPU端分配内存
22 |     A = (int*)malloc(size);
23 |     B = (int*)malloc(size);
24 |  
25 |     for (int i = 0; i < n; i++)
26 |     {   
27 |         B[i] = rand()%101;
28 |         std::cout << B[i] << std::endl;
29 |     }
30 |    
31 |     std::cout <<"----------------------------" << std::endl;
32 |    
33 |     // GPU端分配内存
34 |     mcMalloc((void**)&Ad, size);
35 |     mcMalloc((void**)&Bd, size);
36 |     mcMemcpy(Bd, B, size, mcMemcpyHostToDevice); 
37 |  
38 |     // 定义kernel执行配置，（1024*1024/512）个block，每个block里面有512个线程
39 |     dim3 dimBlock(128);
40 |     dim3 dimGrid(1000);
41 |  
42 |     // 执行kernel
43 |     test_shfl_sync <<<1, 64 >>> (Ad,Bd);
44 |    
45 |     mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost);
46 |  
47 |     // 校验误差
48 |     float max_error = 0.0;
49 |     for (int i = 0; i < 64; i++)
50 |     {
51 |         std::cout << A[i] << std::endl;
52 |     }
53 |  
54 |     cout << "max error is " << max_error << endl;
55 |  
56 |     // 释放CPU端、GPU端的内存
57 |     free(A);
58 |     free(B);   
59 |     mcFree(Ad);
60 |     mcFree(Bd);
61 |  
62 |     return 0;
63 | }
64 | 


--------------------------------------------------------------------------------
/chapter6/__shfl_up_syncExample.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <mc_runtime_api.h>
 4 | using namespace std;
 5 | 
 6 | __global__ void test_shfl_up_sync(int A[], int B[])
 7 | {
 8 |     int tid = threadIdx.x;
 9 |     int value = B[tid];
10 |  
11 |     value = __shfl_up_sync(0xffffffffffffffff, value, 2);
12 |     A[tid] = value;
13 | 
14 | }
15 | 
16 | 
17 | int main()
18 | {
19 |     int *A,*Ad, *B, *Bd;
20 |     int n = 64;
21 |     int size = n * sizeof(int);
22 |  
23 |     // CPU端分配内存
24 |     A = (int*)malloc(size);
25 |     B = (int*)malloc(size);
26 |  
27 |     for (int i = 0; i < n; i++)
28 |     {   
29 |         B[i] = rand()%101;
30 |         std::cout << B[i] << std::endl;
31 |     }
32 |    
33 |     std::cout <<"----------------------------" << std::endl;
34 |    
35 |     // GPU端分配内存
36 |     mcMalloc((void**)&Ad, size);
37 |     mcMalloc((void**)&Bd, size);
38 |     mcMemcpy(Bd, B, size, mcMemcpyHostToDevice); 
39 |  
40 |     // 定义kernel执行配置，（1024*1024/512）个block，每个block里面有512个线程
41 |     dim3 dimBlock(128);
42 |     dim3 dimGrid(1000);
43 |  
44 |     // 执行kernel
45 |     test_shfl_up_sync <<<1, 64 >>> (Ad,Bd);
46 |    
47 |     mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost);
48 |  
49 |     // 校验误差
50 |     float max_error = 0.0;
51 |     for (int i = 0; i < 64; i++)
52 |     {
53 |         std::cout << A[i] << std::endl;
54 |     }
55 |  
56 |     cout << "max error is " << max_error << endl;
57 |  
58 |     // 释放CPU端、GPU端的内存
59 |     free(A);
60 |     free(B);   
61 |     mcFree(Ad);
62 |     mcFree(Bd);
63 |  
64 |     return 0;
65 | }
66 | 


--------------------------------------------------------------------------------
/chapter6/__shfl_xor_syncExample.cpp:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <mc_runtime_api.h>
 3 | 
 4 | __global__ void waveReduce() {
 5 |     int laneId = threadIdx.x & 0x3f;
 6 |     // Seed starting value as inverse lane ID
 7 |     int value = 63 - laneId;
 8 | 
 9 |     // Use XOR mode to perform butterfly reduction
10 |     for (int i=1; i<64; i*=2)
11 | 		value += __shfl_xor_sync(0xffffffffffffffff, value, i, 64);
12 | 
13 |     // "value" now contains the sum across all threads
14 |     printf("Thread %d final value = %d\n", threadIdx.x, value);
15 | }
16 | 
17 | int main() {
18 |     waveReduce<<< 1, 64 >>>();
19 |     mcDeviceSynchronize();
20 |     return 0;
21 | } 
22 | 


--------------------------------------------------------------------------------
/chapter6/checkGlobalVariable.cpp:
--------------------------------------------------------------------------------
 1 | #include <mc_runtime_api.h>
 2 | #include <stdio.h>
 3 | 
 4 | __device__ float devData;
 5 | __global__ void checkGlobalVariable(){
 6 |     printf("Device: the value of the global variable is %f\n", devData);
 7 |     devData += 2.0;
 8 | }
 9 | 
10 | int main(){
11 |     float value = 3.14f;
12 |     mcMemcpyToSymbol(devData, &value, sizeof(float));
13 |     printf("Host: copy %f to the global variable\n", value);
14 |     checkGlobalVariable<<<1,1>>>();
15 |     mcMemcpyFromSymbol(&value, devData, sizeof(float));
16 |     printf("Host: the value changed by the kernel to %f\n", value);
17 |     mcDeviceReset();
18 |     return EXIT_SUCCESS;
19 | }
20 | 


--------------------------------------------------------------------------------
/chapter6/information.cpp:
--------------------------------------------------------------------------------
 1 | #include<mc_runtime_api.h>
 2 | 
 3 | int main( void ) {
 4 |     mcDeviceProp_t prop;
 5 |     
 6 |     int count;
 7 |     mcGetDeviceCount( &count );
 8 |     for (int i=0; i< count; i++) {
 9 |         mcGetDeviceProperties( &prop, i );
10 |         printf( " --- Memory Information for device %d ---\n", i );
11 |         printf( "Total global mem: %ld[bytes]\n", prop.totalGlobalMem );
12 |         printf( "Total constant Mem: %ld[bytes]\n", prop.totalConstMem );
13 |         printf( "Max mem pitch: %ld[bytes]\n", prop.memPitch );
14 |         printf( "Texture alignment: %ld[bytes]\n", prop.textureAlignment );
15 |         printf( "Shared mem per AP: %ld[bytes]\n",prop.sharedMemPerBlock );
16 |         printf( "Registers per AP: %d[bytes]\n", prop.regsPerBlock );
17 |         printf( "\n" );
18 |     }
19 | }
20 | 


--------------------------------------------------------------------------------
/chapter6/vectorAddUnifiedVirtualAddressing.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | #include <mc_runtime_api.h>
 5 | 
 6 | using namespace std;
 7 | 
 8 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N)
 9 | {
10 |     int i = threadIdx.x + blockDim.x * blockIdx.x;
11 |     if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f;
12 | }
13 | 
14 | int main(int argc, char *argv[]) {
15 | 
16 |     int n = atoi(argv[1]);
17 |     cout << n << endl;
18 | 
19 |     size_t size = n * sizeof(float);
20 |     mcError_t err;
21 | 
22 |     // Allocate the host vectors of A&B&C
23 |     unsigned int flag = mcMallocHostPortable;
24 |     float *a = NULL;
25 |     float *b = NULL;
26 |     float *c = NULL;
27 |     err = mcMallocHost((void**)&a, size, flag);
28 |     err = mcMallocHost((void**)&b, size, flag);
29 |     err = mcMallocHost((void**)&c, size, flag);
30 | 
31 |     // Initialize the host vectors of A&B
32 |     for (int i = 0; i < n; i++) {
33 |         float af = rand() / double(RAND_MAX);
34 |         float bf = rand() / double(RAND_MAX);
35 |         a[i] = af;
36 |         b[i] = bf;
37 |     }
38 | 
39 |     // Launch the vector add kernel
40 |     struct timeval t1, t2;
41 |     int threadPerBlock = 256;
42 |     int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
43 |     printf("threadPerBlock: %d \nblockPerGrid: %d \n",threadPerBlock,blockPerGrid);
44 |     gettimeofday(&t1, NULL);
45 |     vectorAdd<<< blockPerGrid, threadPerBlock >>> (a, b, c, n);
46 |     gettimeofday(&t2, NULL);
47 |     double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
48 |     cout << timeuse << endl;
49 | 
50 |     // Free host memory
51 |     err = mcFreeHost(a);
52 |     err = mcFreeHost(b);
53 |     err = mcFreeHost(c);
54 |     
55 |     return 0;
56 | }
57 | 


--------------------------------------------------------------------------------
/chapter6/vectorAddZerocopy.cpp:
--------------------------------------------------------------------------------
 1 | #include <iostream>
 2 | #include <cstdlib>
 3 | #include <sys/time.h>
 4 | #include <mc_runtime_api.h>
 5 | 
 6 | using namespace std;
 7 | 
 8 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N)
 9 | {
10 |     int i = threadIdx.x + blockDim.x * blockIdx.x;
11 |     if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f;
12 | }
13 | 
14 | int main(int argc, char *argv[]) {
15 | 
16 |     int n = atoi(argv[1]);
17 |     cout << n << endl;
18 | 
19 |     size_t size = n * sizeof(float);
20 |     mcError_t err;
21 | 
22 |     // Allocate the host vectors of A&B&C
23 |     unsigned int flag = mcMallocHostMapped;
24 |     float *a = NULL;
25 |     float *b = NULL;
26 |     float *c = NULL;
27 |     err = mcMallocHost((void**)&a, size, flag);
28 |     err = mcMallocHost((void**)&b, size, flag);
29 |     err = mcMallocHost((void**)&c, size, flag);
30 | 
31 |     // Initialize the host vectors of A&B
32 |     for (int i = 0; i < n; i++) {
33 |         float af = rand() / double(RAND_MAX);
34 |         float bf = rand() / double(RAND_MAX);
35 |         a[i] = af;
36 |         b[i] = bf;
37 |     }
38 | 
39 |     // Get the pointer in device on the vectors of A&B&C
40 |     float *da = NULL;
41 |     float *db = NULL;
42 |     float *dc = NULL;
43 |     err = mcHostGetDevicePointer((void**)&da, (void *)a, 0);
44 |     err = mcHostGetDevicePointer((void**)&db, (void *)b, 0);
45 |     err = mcHostGetDevicePointer((void**)&dc, (void *)c, 0);
46 | 
47 |     // Launch the vector add kernel
48 |     struct timeval t1, t2;
49 |     int threadPerBlock = 256;
50 |     int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
51 |     printf("threadPerBlock: %d \nblockPerGrid: %d \n",threadPerBlock,blockPerGrid);
52 |     gettimeofday(&t1, NULL);
53 |     vectorAdd<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
54 |     gettimeofday(&t2, NULL);
55 | double timeuse = (t2.tv_sec - t1.tv_sec) 
56 | + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
57 |     cout << timeuse << endl;
58 | 
59 |     // Free host memory
60 |     err = mcFreeHost(a);
61 |     err = mcFreeHost(b);
62 |     err = mcFreeHost(c);
63 |     
64 |     return 0;
65 | }
66 | 


--------------------------------------------------------------------------------
/chapter7/Makefile.txt:
--------------------------------------------------------------------------------
 1 | # MXMACA Compiler
 2 | MXCC = $(MACA_PATH)/mxgpu_llvm/bin/mxcc
 3 | 
 4 | # Compiler flags
 5 | MXCCFLAGS = -xmaca
 6 | 
 7 | # Source files
 8 | SRCS= main.cpp src/a.cpp src/b.cpp
 9 | 
10 | # Object files
11 | OBJS = $(SRCS:.cpp=.o)
12 | 
13 | # Executable
14 | EXEC = my_program
15 | 
16 | # Default target
17 | all: $(EXEC)
18 | 
19 | # Link object files to create executable
20 | $(EXEC): $(OBJS)
21 | 	$(MXCC) $(OBJS) -o $(EXEC)
22 | 
23 | %.o: %.cpp
24 | 	$(MXCC) $(MXCCFLAGS) -c $< -o $@ -I include
25 | 
26 | # clean up object files and executable
27 | clean:
28 | 	rm -f $(OBJS) $(EXEC)
29 | 


--------------------------------------------------------------------------------
/chapter7/my_program/CMakeLists.txt:
--------------------------------------------------------------------------------
 1 | # Specify the minimum CMake version required
 2 | cmake_minimum_required(VERSION 3.0)
 3 | 
 4 | # Set the project name
 5 | project(my_program)
 6 |   
 7 | # Set the path to the compiler
 8 | set(MXCC_PATH $ENV{MACA_PATH})
 9 | set(CMAKE_CXX_COMPILER ${MXCC_PATH}/mxgpu_llvm/bin/mxcc)
10 | 
11 | # Set the compiler flags
12 | set(MXCC_COMPILE_FLAGS -x maca)
13 | add_compile_options(${MXCC_COMPILE_FLAGS})
14 | 
15 | # Add source files
16 | File(GLOB SRCS src/*.cpp main.cpp)
17 | add_executable(my_program ${SRCS})
18 | 
19 | # Set the include paths
20 | target_include_directories(my_program PRIVATE include)
21 | 


--------------------------------------------------------------------------------
/chapter7/my_program/include/a.h:
--------------------------------------------------------------------------------
1 | extern void func_a();


--------------------------------------------------------------------------------
/chapter7/my_program/include/b.h:
--------------------------------------------------------------------------------
1 | extern void func_b();


--------------------------------------------------------------------------------
/chapter7/my_program/main.cpp:
--------------------------------------------------------------------------------
 1 | //main.cpp:
 2 | #include <stdio.h>
 3 | #include "a.h"
 4 | #include "b.h"
 5 | int main()
 6 | {
 7 | 	func_a();
 8 | 	func_b();
 9 | 	printf("my program!\n");
10 | 	return 1;
11 | }
12 | 


--------------------------------------------------------------------------------
/chapter7/my_program/src/a.cpp:
--------------------------------------------------------------------------------
 1 | //a.cpp:  
 2 | #include <mc_runtime_api.h>
 3 | #include <string.h>
 4 | extern "C"  __global__  void vector_add(int *A_d, size_t num)
 5 | {
 6 | 	size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
 7 | 	size_t stride = blockDim.x * gridDim.x;
 8 | 	for (size_t i = offset; i < num; i += stride) {
 9 | 		A_d[i]++;
10 | 	}
11 | }
12 | void func_a()
13 | {
14 | 	size_t arrSize = 100;
15 | 	mcDeviceptr_t a_d;
16 | 	int *a_h = (int *)malloc(sizeof(int) * arrSize);
17 | 	memset(a_h, 0, sizeof(int) * arrSize);
18 | 	mcMalloc(&a_d, sizeof(int) * arrSize);
19 | 	mcMemcpyHtoD(a_d, a_h, sizeof(int) * arrSize);
20 | 	vector_add<<<1, arrSize>>>(reinterpret_cast<int *>(a_d), arrSize);
21 | 	mcMemcpyDtoH(a_h, a_d, sizeof(int) * arrSize);
22 | 	bool resCheck = true;
23 | 	for (int i; i < arrSize; i++) {
24 | 		if (a_h[i] != 1){
25 | 			resCheck = false;
26 | 		}
27 | 	}
28 | 	printf("vector add result: %s\n", resCheck ? "success": "fail");
29 | 	free(a_h);
30 | 	mcFree(a_d);
31 | }
32 | 
33 | //a.h:
34 | extern void func_a();
35 | 


--------------------------------------------------------------------------------
/chapter7/my_program/src/b.cpp:
--------------------------------------------------------------------------------
 1 | //b.cpp:  
 2 | #include<mc_runtime_api.h>
 3 | __global__ void kernel_b()
 4 | {
 5 | /* kernel code*/
 6 | }
 7 | void func_b()
 8 | {
 9 | 	/* launch kernel */
10 | 	kernel_b<<<1, 1>>>();
11 | }
12 | 
13 | //b.h:
14 | extern void func_b();
15 | 


--------------------------------------------------------------------------------
/chapter7/trigger_memory_violation.cpp:
--------------------------------------------------------------------------------
 1 | #include<mc_runtime_api.h>
 2 | 
 3 | typedef struct 
 4 | {
 5 |   alignas(4)float f;
 6 |   double d;
 7 | }__attribute__((packed)) test_type_mem_violation;
 8 | 
 9 | __global__ void trigger_memory_violation(test_type_mem_violation *dst)
10 | {
11 |   atomicAdd(&dst->f,1.23);
12 |   atomicAdd(&dst->d,20);
13 |   dst->f=9.8765;
14 | }
15 | 
16 | int main()
17 | {
18 |   test_type_mem_violation hd={0};
19 |   test_type_mem_violation *ddd;
20 |   mcMalloc((void**)&ddd,sizeof(test_type_mem_violation));
21 |   mcMemcpy(ddd,&hd,sizeof(test_type_mem_violation),mcMemcpyHostToDevice);
22 |   trigger_memory_violation<<<dim3(1),dim3(1)>>>(ddd);
23 |   mcMemcpy(&hd,ddd,sizeof(test_type_mem_violation),mcMemcpyDeviceToHost);
24 |   mcFree(ddd);
25 |   return 0;
26 | }
27 | 


--------------------------------------------------------------------------------
/chapter7/trigger_memory_violation_repaired.cpp:
--------------------------------------------------------------------------------
 1 | #include<mc_runtime_api.h>
 2 | 
 3 | typedef struct 
 4 | {
 5 |   float f;
 6 |   double d;
 7 | }test_type_mem_violation;
 8 | 
 9 | __global__ void trigger_memory_violation(test_type_mem_violation *dst)
10 | {
11 |   atomicAdd(&dst->f,1.23);
12 |   atomicAdd(&dst->d,20);
13 |   dst->f=9.8765;
14 | }
15 | 
16 | int main()
17 | {
18 |   test_type_mem_violation hd={0};
19 |   test_type_mem_violation *ddd;
20 |   mcMalloc((void**)&ddd,sizeof(test_type_mem_violation));
21 |   mcMemcpy(ddd,&hd,sizeof(test_type_mem_violation),mcMemcpyHostToDevice);
22 |   trigger_memory_violation<<<dim3(1),dim3(1)>>>(ddd);
23 |   mcMemcpy(&hd,ddd,sizeof(test_type_mem_violation),mcMemcpyDeviceToHost);
24 |   mcFree(ddd);
25 |   return 0;
26 | }
27 | 


--------------------------------------------------------------------------------
/chapter7/vectorAdd.cpp:
--------------------------------------------------------------------------------
 1 | #include<mc_runtime_api.h>
 2 | 
 3 | __global__ void vectorADD(const float* A_d, const float* B_d, float* C_d, size_t NELEM) {
 4 |     size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
 5 |     size_t stride = blockDim.x * gridDim.x;
 6 | 
 7 |     for (size_t i = offset; i < NELEM; i += stride) {
 8 |     C_d[i] = A_d[i] + B_d[i];
 9 |     }
10 | }
11 | 
12 | int main()
13 | {
14 |     int blocks=20;
15 |     int threadsPerBlock=1024;
16 |     int numSize=1024*1024;
17 | 
18 |     float *A_d=nullptr;
19 |     float *B_d=nullptr;
20 |     float *C_d=nullptr;
21 | 
22 |     float *A_h=nullptr;
23 |     float *B_h=nullptr;
24 |     float *C_h=nullptr;
25 | 
26 |     mcMalloc((void**)&A_d,numSize*sizeof(float));
27 |     mcMalloc((void**)&B_d,numSize*sizeof(float));
28 |     mcMalloc((void**)&C_d,numSize*sizeof(float));
29 | 
30 |     A_h=(float*)malloc(numSize*sizeof(float));
31 |     B_h=(float*)malloc(numSize*sizeof(float));
32 |     C_h=(float*)malloc(numSize*sizeof(float));
33 | 
34 |     for(int i=0;i<numSize;i++)
35 |     {
36 |         A_h[i]=3;
37 |         B_h[i]=4;
38 |         C_h[i]=0;
39 |     }
40 | 
41 |     mcMemcpy(A_d,A_h,numSize*sizeof(float),mcMemcpyHostToDevice);
42 |     mcMemcpy(B_d,B_h,numSize*sizeof(float),mcMemcpyHostToDevice);
43 | 
44 |     vectorADD<<<dim3(blocks),dim3(threadsPerBlock)>>>(A_d,B_d,C_d,numSize);
45 | 
46 |     mcMemcpy(C_h,C_d,numSize*sizeof(float),mcMemcpyDeviceToHost);
47 | 
48 |     mcFree(A_d);
49 |     mcFree(B_d);
50 |     mcFree(C_d);
51 | 
52 |     free(A_h);
53 |     free(B_h);
54 |     free(C_h);
55 | 
56 |     return 0;
57 | }
58 | 


--------------------------------------------------------------------------------
/chapter8/myKernel.cpp:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <malloc.h>
 3 | #include <mc_runtime_api.h>
 4 | // #include "device_launch_parameters.h"
 5 | 
 6 | __global__ void myKernel(float* devPtr, int height, int width, int pitch)
 7 | {
 8 |     int row, col;
 9 |     float *rowHead;
10 | 
11 |     for (row = 0; row < height; row++)
12 |     {
13 |         rowHead = (float*)((char*)devPtr + row * pitch);
14 | 
15 |         for (col = 0; col < width; col++)
16 |         {
17 |             printf("\t%f", rowHead[col]);// 逐个打印并自增 1
18 |             rowHead[col]++;
19 |         }
20 |         printf("\n");
21 |     }
22 | }
23 | 
24 | int main()
25 | {
26 |     size_t width = 6;
27 |     size_t height = 5;
28 |     float *h_data, *d_data;
29 |     size_t pitch;
30 | 
31 |     h_data = (float *)malloc(sizeof(float)*width*height);
32 |     for (int i = 0; i < width*height; i++)
33 |         h_data[i] = (float)i;
34 | 
35 |     printf("\n\tAlloc memory.");
36 |     mcMallocPitch((void **)&d_data, &pitch, sizeof(float)*width, height);
37 |     printf("\n\tPitch = %d B\n", pitch);
38 | 
39 |     printf("\n\tCopy to Device.\n");
40 |     mcMemcpy2D(d_data, pitch, h_data, sizeof(float)*width, sizeof(float)*width, height, mcMemcpyHostToDevice);
41 | 
42 |     myKernel <<<1, 1 >>> (d_data, height, width, pitch);
43 |     mcDeviceSynchronize();
44 | 
45 |     printf("\n\tCopy back to Host.\n");
46 |     mcMemcpy2D(h_data, sizeof(float)*width, d_data, pitch, sizeof(float)*width, height, mcMemcpyDeviceToHost);
47 | 
48 |     for (int i = 0; i < width*height; i++)
49 |     {
50 |         printf("\t%f", h_data[i]);
51 |         if ((i + 1) % width == 0)
52 |             printf("\n");
53 |     }
54 | 
55 |     free(h_data);
56 |     mcFree(d_data);
57 | 
58 |     getchar();
59 |     return 0;
60 | }
61 | 


--------------------------------------------------------------------------------
/chapter8/stream_parallel_execution.cpp:
--------------------------------------------------------------------------------
  1 | #include <stdio.h>
  2 | #include <malloc.h>
  3 | #include <mc_runtime_api.h>
  4 | #define FULL_DATA_SIZE 10000
  5 | #define N 1000
  6 | #define BLOCKNUM 16
  7 | #define THREADNUM 64
  8 | 
  9 | __global__ void kernel(int *a,int *b,int *c){
 10 |     int idx=threadIdx.x+blockIdx.x*blockDim.x;
 11 |     if (idx<N){
 12 |         a[idx]*=a[idx];
 13 |         a[idx]+=1;
 14 |         b[idx]*=b[idx];
 15 |         b[idx]+=1;
 16 |         c[idx]*=c[idx];
 17 |         c[idx]+=1;
 18 |     }
 19 | }
 20 | 
 21 | int main(){
 22 |     int i;
 23 |     int *host_a,*host_b,*host_c;
 24 |     host_a=(int *)malloc(sizeof(int)*FULL_DATA_SIZE);
 25 |     host_b=(int *)malloc(sizeof(int)*FULL_DATA_SIZE);
 26 |     host_c=(int *)malloc(sizeof(int)*FULL_DATA_SIZE);
 27 |     for (i=0;i<FULL_DATA_SIZE;i++){
 28 |         host_a[i]=i;
 29 |         host_b[i]=i;
 30 |         host_c[i]=i;
 31 |     }
 32 |     int *dev0_a,*dev1_a,*dev0_b,*dev1_b,*dev0_c,*dev1_c;
 33 |     mcMalloc((int**)&dev0_a,N*sizeof(int));
 34 |     mcMalloc((int**)&dev1_a,N*sizeof(int));
 35 |     mcMalloc((int**)&dev0_b,N*sizeof(int));
 36 |     mcMalloc((int**)&dev1_b,N*sizeof(int));
 37 |     mcMalloc((int**)&dev0_c,N*sizeof(int));
 38 |     mcMalloc((int**)&dev1_c,N*sizeof(int));
 39 |     mcError_t mcStatus;
 40 |     mcStream_t stream0,stream1;
 41 |     mcStreamCreate(&stream0);
 42 |     mcStreamCreate(&stream1);
 43 |     for (i = 0; i < FULL_DATA_SIZE; i += N * 2)
 44 |     {
 45 |         mcStatus = mcMemcpyAsync(dev0_a, host_a + i, N * sizeof(int), 
 46 |                     mcMemcpyHostToDevice, stream0);
 47 |         if (mcStatus != mcSuccess)
 48 |         {
 49 |             printf("mcMemcpyAsync0 a failed!\n");
 50 |         }
 51 |     
 52 |         mcStatus = mcMemcpyAsync(dev1_a, host_a + N + i, N * sizeof(int), 
 53 |                     mcMemcpyHostToDevice, stream1);
 54 |         if (mcStatus != mcSuccess)
 55 |         {
 56 |             printf("mcMemcpyAsync1 a failed!\n");
 57 |         }
 58 |     
 59 |         mcStatus = mcMemcpyAsync(dev0_b, host_b + i, N * sizeof(int), 
 60 |                     mcMemcpyHostToDevice, stream0);
 61 |         if (mcStatus != mcSuccess)
 62 |         {
 63 |             printf("mcMemcpyAsync0 b failed!\n");
 64 |         }
 65 |     
 66 |         mcStatus = mcMemcpyAsync(dev1_b, host_b + N + i, N * sizeof(int), 
 67 |                     mcMemcpyHostToDevice, stream1);
 68 |         if (mcStatus != mcSuccess)
 69 |         {
 70 |             printf("mcMemcpyAsync1 b failed!\n");
 71 |         }
 72 |         
 73 | 
 74 | 
 75 |         kernel <<<N/BLOCKNUM, THREADNUM, 0, stream0 >>>(dev0_a, dev0_b, dev0_c);
 76 |     
 77 |         kernel <<<N/BLOCKNUM, THREADNUM, 0, stream1 >>>(dev1_a, dev1_b, dev1_c);
 78 |     
 79 |         mcStatus = mcMemcpyAsync(host_c + i, dev0_c, N * sizeof(int), 
 80 |                     mcMemcpyDeviceToHost, stream0);
 81 |         if (mcStatus != mcSuccess)
 82 |         {
 83 |             printf("mcMemcpyAsync0 c failed!\n");
 84 |         }
 85 |     
 86 |         mcStatus = mcMemcpyAsync(host_c + N + i, dev1_c, N * sizeof(int), 
 87 |                     mcMemcpyDeviceToHost, stream1);
 88 |         if (mcStatus != mcSuccess)
 89 |         {
 90 |             printf("mcMemcpyAsync1 c failed!\n");
 91 |         }
 92 |     }
 93 |     for(i=0;i<20;i++){
 94 |         printf("%d ",host_a[i]);
 95 |     }
 96 |     printf("\n");
 97 |     for(i=0;i<20;i++){
 98 |         printf("%d ",host_b[i]);
 99 |     }
100 |     printf("\n");
101 |     for(i=0;i<20;i++){
102 |         printf("%d ",host_c[i]);
103 |     }
104 |     printf("\n");
105 |     mcStreamSynchronize(stream1);
106 |     mcStreamSynchronize(stream0);
107 |     mcStreamDestroy(stream1);
108 |     mcStreamDestroy(stream0);
109 |     mcFree(dev0_a);
110 |     mcFree(dev1_a);
111 |     mcFree(dev0_b);
112 |     mcFree(dev1_b);
113 |     mcFree(dev0_c);
114 |     mcFree(dev1_c);
115 |     free(host_a);
116 |     free(host_b);
117 |     free(host_c);
118 | }
119 | 


--------------------------------------------------------------------------------
/chapter9/shortKernelsAsyncLaunch.cpp:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * 9.4.1: 1) lots of short kernels launched asynchronously 
 3 |  * 9.4.1 {Sample#2} lots of short kernels launched asynchronously 
 4 |  * Usage:
 5 |  *   1) compiling: mxcc -x maca shortKernelsAsyncLaunch.cpp -o shortKernelsAsyncLaunch
 6 |  *   2) running：./shortKernelsAsyncLaunch
 7 |  */
 8 | #include <iostream>
 9 | #include <vector>
10 | #include "mc_runtime.h"
11 | 
12 | #define macaCheckErrors(msg) \
13 |   do { \
14 |     mcError_t __err = mcGetLastError(); \
15 |     if (__err != mcSuccess) { \
16 |         fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
17 |             msg, mcGetErrorString(__err), \
18 |             __FILE__, __LINE__); \
19 |         fprintf(stderr, "*** FAILED - ABORTING\n"); \
20 |         exit(1); \
21 |     } \
22 |   } while (0)
23 | 
24 | 
25 | #include <time.h>
26 | #include <sys/time.h>
27 | #define USECPSEC 1000000ULL
28 | 
29 | unsigned long long dtime_usec(unsigned long long start){
30 |   timeval tv;
31 |   gettimeofday(&tv, 0);
32 |   return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
33 | }
34 | 
35 | #define N 400000 // tuned until kernel takes a few microseconds
36 | __global__ void shortKernel(float * out_d, float * in_d){
37 |   int idx=blockIdx.x*blockDim.x+threadIdx.x;
38 |   if(idx<N) out_d[idx]=1.23*in_d[idx];
39 | }
40 | 
41 | #define NSTEP 2000
42 | #define NKERNEL 200
43 | typedef float ft;
44 | 
45 | int main(){
46 |   ft *d_input, *d_output;
47 |   mcStream_t stream;
48 |   mcStreamCreate(&stream);
49 | 
50 |   // device allocations
51 |   mcMalloc(&d_input, N*sizeof(ft));
52 |   mcMalloc(&d_output,  N*sizeof(ft));
53 |   
54 |   int blocks = 1;
55 |   int threads = 64;  
56 |   // warm-up: copy kernel image to device
57 |   shortKernel<<<blocks, threads>>>(d_output, d_input);
58 |   macaCheckErrors("kernel launch failure");
59 |   mcDeviceSynchronize();
60 |   macaCheckErrors("kernel execution failure");
61 |   // run on device and measure execution time
62 |   unsigned long long dt = dtime_usec(0);
63 |   dt = dtime_usec(0);
64 |   for(int istep=0; istep<NSTEP; istep++){
65 |     for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
66 |       shortKernel<<<blocks, threads, 0, stream>>>(d_output, d_input);
67 |     }
68 |   }
69 |   mcStreamSynchronize(stream);
70 | 
71 |   macaCheckErrors("kernel execution failure");
72 |   dt = dtime_usec(dt);
73 |   std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl;
74 |   return 0;
75 | }


--------------------------------------------------------------------------------
/chapter9/shortKernelsGraphLaunch.cpp:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * 9.4.1 {Sample#3} lots of short kernels launched by graph APIs
 3 |  * Usage:
 4 |  *   1) compiling: mxcc -x maca shortKernelsGraphLaunch.cpp -o shortKernelsGraphLaunch
 5 |  *   2) setting: export MACA_GRAPH_LAUNCH_MODE=1
 6 |  *   3) running：./shortKernelsGraphLaunch
 7 |  */
 8 | #include <iostream>
 9 | #include <vector>
10 | #include "mc_runtime.h"
11 | 
12 | #define macaCheckErrors(msg) \
13 |   do { \
14 |     mcError_t __err = mcGetLastError(); \
15 |     if (__err != mcSuccess) { \
16 |         fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
17 |             msg, mcGetErrorString(__err), \
18 |             __FILE__, __LINE__); \
19 |         fprintf(stderr, "*** FAILED - ABORTING\n"); \
20 |         exit(1); \
21 |     } \
22 |   } while (0)
23 | 
24 | 
25 | #include <time.h>
26 | #include <sys/time.h>
27 | #define USECPSEC 1000000ULL
28 | 
29 | unsigned long long dtime_usec(unsigned long long start){
30 |   timeval tv;
31 |   gettimeofday(&tv, 0);
32 |   return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
33 | }
34 | 
35 | #define N 400000 // tuned until kernel takes a few microseconds
36 | __global__ void shortKernel(float * out_d, float * in_d){
37 |   int idx=blockIdx.x*blockDim.x+threadIdx.x;
38 |   if(idx<N) out_d[idx]=1.23*in_d[idx];
39 | }
40 | 
41 | #define NSTEP 2000
42 | #define NKERNEL 200
43 | typedef float ft;
44 | 
45 | int main(){
46 |   ft *d_input, *d_output;
47 |   mcStream_t stream;
48 |   mcStreamCreate(&stream);
49 | 
50 |   // device allocations
51 |   mcMalloc(&d_input, N*sizeof(ft));
52 |   mcMalloc(&d_output,  N*sizeof(ft));
53 |   
54 |   int blocks = 1;
55 |   int threads = 64;  
56 |   // warm-up: copy kernel image to device
57 |   shortKernel<<<blocks, threads>>>(d_output, d_input);
58 |   macaCheckErrors("kernel launch failure");
59 |   mcDeviceSynchronize();
60 |   macaCheckErrors("kernel execution failure");
61 |   // run on device and measure execution time
62 |   unsigned long long dt = dtime_usec(0);
63 |   dt = dtime_usec(0);
64 |   bool graphCreated=false;
65 |   mcGraph_t graph;
66 |   mcGraphExec_t instance;
67 |   for(int istep=0; istep<NSTEP; istep++){
68 |     if(!graphCreated){
69 |       mcStreamBeginCapture(stream, mcStreamCaptureModeGlobal);
70 |       for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
71 |         shortKernel<<<blocks, threads, 0, stream>>>(d_output, d_input);
72 |       }
73 |       mcStreamEndCapture(stream, &graph);
74 |       mcGraphInstantiate(&instance, graph, NULL, NULL, 0);
75 |       graphCreated=true;
76 |     }
77 |     mcGraphLaunch(instance, stream);
78 |     mcStreamSynchronize(stream);
79 |   }
80 |   macaCheckErrors("kernel execution failure");
81 |   dt = dtime_usec(dt);
82 |   std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl;
83 |   return 0;
84 | }


--------------------------------------------------------------------------------
/chapter9/shortKernelsSyncLaunch.cpp:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * 9.4.1 {Sample#1} lots of short kernels launched synchronously 
 3 |  * Usage:
 4 |  *   1) compiling: mxcc -x maca shortKernelsSyncLaunch.cpp -o shortKernelsSyncLaunch
 5 |  *   2) running：./shortKernelsSyncLaunch
 6 |  */
 7 | #include <iostream>
 8 | #include <vector>
 9 | #include "mc_runtime.h"
10 | 
11 | #define macaCheckErrors(msg) \
12 |   do { \
13 |     mcError_t __err = mcGetLastError(); \
14 |     if (__err != mcSuccess) { \
15 |         fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
16 |             msg, mcGetErrorString(__err), \
17 |             __FILE__, __LINE__); \
18 |         fprintf(stderr, "*** FAILED - ABORTING\n"); \
19 |         exit(1); \
20 |     } \
21 |   } while (0)
22 | 
23 | 
24 | #include <time.h>
25 | #include <sys/time.h>
26 | #define USECPSEC 1000000ULL
27 | 
28 | unsigned long long dtime_usec(unsigned long long start){
29 |   timeval tv;
30 |   gettimeofday(&tv, 0);
31 |   return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
32 | }
33 | 
34 | #define N 400000 // tuned until kernel takes a few microseconds
35 | __global__ void shortKernel(float * out_d, float * in_d){
36 |   int idx=blockIdx.x*blockDim.x+threadIdx.x;
37 |   if(idx<N) out_d[idx]=1.23*in_d[idx];
38 | }
39 | 
40 | #define NSTEP 2000
41 | #define NKERNEL 200
42 | typedef float ft;
43 | 
44 | int main(){
45 |   ft *d_input, *d_output;
46 |   mcStream_t stream;
47 |   mcStreamCreate(&stream);
48 | 
49 |   // device allocations
50 |   mcMalloc(&d_input, N*sizeof(ft));
51 |   mcMalloc(&d_output,  N*sizeof(ft));
52 |   
53 |   int blocks = 1;
54 |   int threads = 64;  
55 |   // warm-up: copy kernel image to device
56 |   shortKernel<<<blocks, threads>>>(d_output, d_input);
57 |   macaCheckErrors("kernel launch failure");
58 |   mcDeviceSynchronize();
59 |   macaCheckErrors("kernel execution failure");
60 |   // run on device and measure execution time
61 |   unsigned long long dt = dtime_usec(0);
62 |   dt = dtime_usec(0);
63 |   for(int istep=0; istep<NSTEP; istep++){
64 |     for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
65 |       shortKernel<<<blocks, threads, 0, stream>>>(d_output, d_input);
66 |       mcStreamSynchronize(stream);
67 |     }
68 |   }
69 |   macaCheckErrors("kernel execution failure");
70 |   dt = dtime_usec(dt);
71 |   std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl;
72 |   return 0;
73 | }


--------------------------------------------------------------------------------
/common/common.h:
--------------------------------------------------------------------------------
 1 | #include <sys/time.h>
 2 | 
 3 | #ifndef _COMMON_H
 4 | #define _COMMON_H
 5 | 
 6 | #define CHECK(call)                                                            \
 7 | {                                                                              \
 8 |     const mcError_t error = call;                                              \
 9 |     if (error != mcSuccess)                                                    \
10 |     {                                                                          \
11 |         fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__);                 \
12 |         fprintf(stderr, "code: %d, reason: %s\n", error,                       \
13 |                 mcGetErrorString(error));                                      \
14 |     }                                                                          \
15 | }
16 | 
17 | inline double seconds()
18 | {
19 |     struct timeval tp;
20 |     struct timezone tzp;
21 |     int i = gettimeofday(&tp, &tzp);
22 |     return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
23 | }
24 | 
25 | #endif // _COMMON_H
26 | 


--------------------------------------------------------------------------------
/习题运行结果/3.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/3.1.png


--------------------------------------------------------------------------------
/习题运行结果/3.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/3.2.png


--------------------------------------------------------------------------------
/习题运行结果/5.2.9.1运行结果/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/1.png


--------------------------------------------------------------------------------
/习题运行结果/5.2.9.1运行结果/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/2.png


--------------------------------------------------------------------------------
/习题运行结果/5.2.9.1运行结果/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/3.png


--------------------------------------------------------------------------------
/习题运行结果/5.2.9.2运行结果/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/1.png


--------------------------------------------------------------------------------
/习题运行结果/5.2.9.2运行结果/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/2.png


--------------------------------------------------------------------------------
/习题运行结果/5.2.9.2运行结果/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/3.png


--------------------------------------------------------------------------------
/习题运行结果/T4运行结果.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/T4运行结果.png


--------------------------------------------------------------------------------
/习题运行结果/answer.md:
--------------------------------------------------------------------------------
  1 | # new answer
  2 | 
  3 | ## Chapter 2
  4 | 
  5 | ### Exercise 1
  6 | 
  7 | #### 参考代码
  8 | 
  9 | ```c
 10 | #include <stdio.h>
 11 | #include <mc_common.h>
 12 | #include <mc_runtime_api.h> 
 13 | 
 14 | __global__ void helloFromGpu (void)
 15 | 
 16 | {
 17 |   printf("Hello World from GPU!\n");
 18 | }
 19 | 
 20 | int main(void)
 21 | {
 22 |   printf("Hello World from CPU!\n");
 23 |   helloFromGpu <<<1, 10>>>();
 24 |   return 0;
 25 | }
 26 | ```
 27 | 
 28 | #### 编译结果
 29 | 
 30 | 函数mcDeviceReset()用来显式销毁并清除与当前设备有关的所有资源。
 31 | 
 32 | 当重置函数移除，编译运行则只输出
 33 | 
 34 | ```
 35 | Hello World from CPU!
 36 | ```
 37 | 
 38 | 当printf在gpu上被调用，mcDeviceReset()函数使这些来自gpu的输出发送到主机，然后在控制台输出。
 39 | 
 40 | 没有调用cudaDeviceReset()函数就不能保证这些可以被显示。
 41 | 
 42 | ### Exercise 2
 43 | 
 44 | #### 参考代码
 45 | 
 46 | ```c
 47 | #include <stdio.h>
 48 | #include <mc_common.h>
 49 | #include <mc_runtime_api.h> 
 50 | 
 51 | __global__ void helloFromGpu (void)
 52 | {
 53 |     printf("Hello World from GPU!\n");
 54 | }
 55 | 
 56 | int main(void)
 57 | {
 58 |     printf("Hello World from CPU!\n");
 59 |     
 60 |     helloFromGpu <<<1, 10>>>();
 61 |     mcDeviceSynchronize();
 62 | 	return 0;
 63 | }
 64 | 
 65 | ```
 66 | 
 67 | #### 编译结果
 68 | 
 69 | ```
 70 | Hello World from CPU!
 71 | Hello World from GPU!
 72 | Hello World from GPU!
 73 | Hello World from GPU!
 74 | Hello World from GPU!
 75 | Hello World from GPU!
 76 | Hello World from GPU!
 77 | Hello World from GPU!
 78 | Hello World from GPU!
 79 | Hello World from GPU!
 80 | Hello World from GPU!
 81 | ```
 82 | 
 83 | 输出效果和helloFromGpu.c一样。
 84 | 
 85 | mcDeviceSynchronize()也可以用来使gpu的输出打印在用户可见控制台。
 86 | 
 87 | ### Exercise 3
 88 | 
 89 | #### 参考代码
 90 | 
 91 | ```c
 92 | #include <stdio.h>
 93 | #include <mc_common.h>
 94 | #include <mc_runtime_api.h> 
 95 | 
 96 | __global__ void helloFromGpu (void)
 97 | {
 98 |     if (threadIdx.x==9) printf("Hello World from GPU Thread 9!\n");
 99 | }
100 | int main(void)
101 | {
102 |     printf("Hello World from CPU!\n")；
103 |     helloFromGpu <<<1, 10>>>();
104 |     mcDeviceReset();
105 |     return 0;
106 | }
107 | ```
108 | 
109 | ## Chapter 3
110 | 
111 | ### Exercise 1 
112 | 
113 | #### 参考代码
114 | 
115 | ```c++
116 | #include <iostream>
117 | #include <cstdlib>
118 | #include <sys/time.h>
119 | #include <mc_runtime_api.h>
120 | 
121 | using namespace std;
122 | 
123 | // 要用 __global__ 来修饰。
124 | // 输入指向3段显存的指针名。
125 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N)
126 | {
127 |     int i = threadIdx.x + blockDim.x * blockIdx.x;
128 |     // printf("threadIdx.x:%d  blockDim.x:%d  blockIdx.x:%d\n",threadIdx.x,blockDim.x,blockIdx.x);
129 |     if (i < N) C_d[i] = A_d[i] + B_d[i];
130 | }
131 | 
132 | int main(int argc, char *argv[]) {
133 | 
134 |     int n = 2048;
135 |     cout << n << endl;
136 | 
137 |     size_t size = n * sizeof(float);
138 | 
139 |     // host memery
140 |     float *a = (float *)malloc(size);
141 |     float *b = (float *)malloc(size);
142 |     float *c = (float *)malloc(size);
143 | 
144 |     for (int i = 0; i < n; i++) {
145 |         float af = rand() / double(RAND_MAX);
146 |         float bf = rand() / double(RAND_MAX);
147 |         a[i] = af;
148 |         b[i] = bf;
149 |     }
150 | 
151 |     // 定义空指针。
152 |     float *da = NULL;
153 |     float *db = NULL;
154 |     float *dc = NULL;
155 | 
156 |     // 申请显存，da 指向申请的显存，注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。
157 |     mcMalloc((void **)&da, size);
158 |     mcMalloc((void **)&db, size);
159 |     mcMalloc((void **)&dc, size);
160 | 
161 |     // 把内存的东西拷贝到显存，也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。
162 |     mcMemcpy(da,a,size,mcMemcpyHostToDevice);
163 |     mcMemcpy(db,b,size,mcMemcpyHostToDevice);
164 | 
165 |     struct timeval t1, t2;
166 | 
167 |     // 计算线程块和网格的数量。
168 |     int threadPerBlock_array[8]={1,16,32,64,128,256,512,1024};
169 |     for(int i=0;i<8;i++){
170 |         int threadPerBlock = threadPerBlock_array[i];
171 |         int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
172 |         printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid);
173 | 
174 |         gettimeofday(&t1, NULL);
175 | 
176 |         // 调用核函数。
177 |         gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
178 | 
179 |         gettimeofday(&t2, NULL);
180 | 
181 |         mcMemcpy(c,dc,size,mcMemcpyDeviceToHost);
182 | 
183 |         // for (int i = 0; i < 10; i++) 
184 |         //     cout<<vecA[i]<<" "<<vecB[i]<<" "<<vecC[i]<< endl;
185 | 
186 |         double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
187 |         cout << "threadPerBlock: " << threadPerBlock << "timeuse: " << timeuse << endl;
188 | 
189 |     }
190 |     
191 |     mcFree(da);
192 |     mcFree(db);
193 |     mcFree(dc);
194 | 
195 |     free(a);
196 |     free(b);
197 |     free(c);
198 |     return 0;
199 | }
200 | 
201 | 
202 | ```
203 | 
204 | #### 运行结果
205 | 
206 | <img src=".\3.1.png">
207 | 
208 | ### Exercise 2
209 | 
210 | #### 参考代码
211 | 
212 | ```c++
213 | #include <iostream>
214 | #include <cstdlib>
215 | #include <sys/time.h>
216 | #include <mc_runtime_api.h>
217 | 
218 | using namespace std;
219 | 
220 | // 要用 __global__ 来修饰。
221 | // 输入指向3段显存的指针名。
222 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N)
223 | {
224 |     int i = threadIdx.x + blockDim.x * blockIdx.x;
225 |     // printf("threadIdx.x:%d  blockDim.x:%d  blockIdx.x:%d\n",threadIdx.x,blockDim.x,blockIdx.x);
226 |     if (i < N) C_d[i] = A_d[i] + B_d[i];
227 | }
228 | 
229 | int main(int argc, char *argv[]) {
230 | 
231 |     int n = 256;
232 |     cout << n << endl;
233 | 
234 |     size_t size = n * sizeof(float);
235 | 
236 |     // host memory
237 |     float *a = (float *)malloc(size);
238 |     float *b = (float *)malloc(size);
239 |     float *c = (float *)malloc(size);
240 | 
241 |     for (int i = 0; i < n; i++) {
242 |         float af = rand() / double(RAND_MAX);
243 |         float bf = rand() / double(RAND_MAX);
244 |         a[i] = af;
245 |         b[i] = bf;
246 |     }
247 | 
248 |     // 定义空指针。
249 |     float *da = NULL;
250 |     float *db = NULL;
251 |     float *dc = NULL;
252 | 
253 |     // 申请显存，da 指向申请的显存，注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。
254 |     mcMalloc((void **)&da, size);
255 |     mcMalloc((void **)&db, size);
256 |     mcMalloc((void **)&dc, size);
257 | 
258 |     // 把内存的东西拷贝到显存，也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。
259 |     mcMemcpy(da,a,size,mcMemcpyHostToDevice);
260 |     mcMemcpy(db,b,size,mcMemcpyHostToDevice);
261 | 
262 |     struct timeval t1, t2;
263 | 
264 |     // 计算线程块和网格的数量。
265 |     int threadPerBlock_array[2]={1,256};
266 |     for(int i=0;i<2;i++){
267 |         int threadPerBlock = threadPerBlock_array[i];
268 |         int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
269 |         printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid);
270 | 
271 |         gettimeofday(&t1, NULL);
272 | 
273 |         // 调用核函数。
274 |         gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
275 | 
276 |         gettimeofday(&t2, NULL);
277 | 
278 |         mcMemcpy(c,dc,size,mcMemcpyDeviceToHost);
279 | 
280 |         // for (int i = 0; i < 10; i++) 
281 |         //     cout<<vecA[i]<<" "<<vecB[i]<<" "<<vecC[i]<< endl;
282 | 
283 |         double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
284 |         cout << "threadPerBlock: " << threadPerBlock << "timeuse: " << timeuse << endl;
285 | 
286 |     }
287 |     
288 |     mcFree(da);
289 |     mcFree(db);
290 |     mcFree(dc);
291 | 
292 |     free(a);
293 |     free(b);
294 |     free(c);
295 |     return 0;
296 | }
297 | 
298 | ```
299 | 
300 | #### 运行结果
301 | 
302 | <img src=".\3.2.png">
303 | 
304 | ### Exercise 3 
305 | 
306 | 执行每个数值计算的速度并没有CPU快，CPU更适合处理逻辑控制密集的计算任务，GPU更适合处理数据密集的计算任务
307 | 
308 | ### Exercise 4
309 | 
310 | #### 参考代码
311 | 
312 | ```c
313 | #include <iostream>
314 | #include <cstdlib>
315 | #include <sys/time.h>
316 | #include <mc_runtime_api.h>
317 | 
318 | using namespace std;
319 | 
320 | 
321 | __global__ void matrixMultiplication(int *A_d,int *B_d,int *Result_d,int width){
322 |     int i=threadIdx.x+blockDim.x*blockIdx.x;
323 |     int j=threadIdx.y+blockDim.y*blockIdx.y;
324 |     int sum=0;
325 |     int count;
326 |     for(count=0;count<width;count++) sum+=A_d[j*width+count]*B_d[count*width+i];
327 |     Result_d[j*width+i]=sum;
328 | }
329 | 
330 | int main(){
331 |     int *a,*b,*result;
332 |     int col=15,row=15;
333 |     // host memory
334 |     a=(int *)malloc(sizeof(int)*row*col);
335 |     b=(int *)malloc(sizeof(int)*row*col);
336 |     result=(int *)malloc(sizeof(int)*row*col);
337 |     // initialize
338 |     int i;
339 |     for(i=0;i<row*col;i++){
340 |         a[i]=(int)rand() %15;
341 |         b[i]=(int)rand() %15;
342 |     }
343 |     // 定义空指针
344 |     int *da,*db,*d_result;
345 |     // 申请显存
346 |     mcMalloc((void **)&da,sizeof(int)*row*col);
347 |     mcMalloc((void **)&db,sizeof(int)*row*col);
348 |     mcMalloc((void **)&d_result,sizeof(int)*row*col);
349 |     // 把内存的东西拷贝到显存
350 |     mcMemcpy(da,a,sizeof(int)*row*col,mcMemcpyHostToDevice);
351 |     mcMemcpy(db,b,sizeof(int)*row*col,mcMemcpyHostToDevice);
352 |     // 计算线程块和网格的数量
353 |     dim3 threadPerBlock(16,16);
354 |     dim3 blockNumber((col+threadPerBlock.x-1)/ threadPerBlock.x, (row+threadPerBlock.y-1)/ threadPerBlock.y );
355 |     // 调用kernel函数
356 |     matrixMultiplication<<<blockNumber,threadPerBlock>>>(da,db,d_result,col);
357 |     // 把显存的东西拷贝回内存
358 |     mcMemcpy(result,d_result,sizeof(int)*row*col,mcMemcpyDeviceToHost);
359 |     // print矩阵，这里row和col相等，所以统一用col表示
360 |     int j;
361 |     printf("a:\n");
362 |     for(i=0;i<col;i++){
363 |         for(j=0;j<col;j++){
364 |             printf("%d ",a[i*col+j]);
365 |         }
366 |         printf("\n");
367 |     }
368 |     printf("b:\n");
369 |     for(i=0;i<col;i++){
370 |         for(j=0;j<col;j++){
371 |             printf("%d ",b[i*col+j]);
372 |         }
373 |         printf("\n");
374 |     }
375 |     printf("result:\n");
376 |     for(i=0;i<col;i++){
377 |         for(j=0;j<col;j++){
378 |             printf("%d ",result[i*col+j]);
379 |         }
380 |         printf("\n");
381 |     }
382 |     // free
383 |     free(a);
384 |     free(b);
385 |     free(result);
386 |     mcFree(da);
387 |     mcFree(db);
388 |     mcFree(d_result);
389 |     return 0;
390 | }
391 | ```
392 | 
393 |  
394 | 
395 | #### 运行结果
396 | 
397 | <img src=".\T4运行结果.png">
398 | 
399 | ## Chapter 5
400 | 
401 | ### 5.2.9
402 | 
403 | #### Exercise 1
404 | 
405 | ##### 参考代码
406 | 
407 | ```c
408 | #include <iostream>
409 | #include<mc_runtime_api.h>
410 | #include <stdio.h>
411 | using namespace std;
412 | 
413 | 
414 | __global__ void print()
415 | {
416 |     printf("blockIdx.x:%d threadIdx.x:%d\n",blockIdx.x, threadIdx.x);
417 | }
418 | 
419 | int main(void)
420 | {
421 |     const dim3 block_size(16);
422 |     print<<<10, block_size>>>();
423 |     mcDeviceSynchronize();
424 |     return 0;
425 | }
426 | 
427 | 
428 | ```
429 | 
430 | ##### 运行结果（一部分）
431 | 
432 | <img src=".\5.2.9.1运行结果\1.png">
433 | 
434 | <img src=".\5.2.9.1运行结果\2.png">
435 | 
436 | <img src=".\5.2.9.1运行结果\3.png">
437 | 
438 | 同一个wave内部thread的执行是顺序的。block的执行不是顺序的。
439 | 
440 | 在MXMACA中，wave对程序员来说是透明的，它的大小可能会随着硬件的发展发生变化，在当前版本的MXMACA中，每个wave是由64个thread组成的。由64个thread组成的wave是MACA程序执行的最小单位，并且同一个wave是串行的。在一个SM中可能同时有来自不同block的wave。当一个block中的wave在进行访存或者同步等高延迟操作时，另一个block可以占用SM中的计算资源。这样，在SM内就实现了简单的乱序执行。不同block之间的执行没有顺序，完全并行。并且，一个sm只会执行一个block里的wave，当该block里的wave执行完才会执行其他block里的wave。
441 | 
442 | #### Exercise 2
443 | 
444 | ##### 参考代码
445 | 
446 | ```c
447 | #include <iostream>
448 | #include<mc_runtime_api.h>
449 | #include <stdio.h>
450 | using namespace std;
451 | 
452 | 
453 | __global__ void print()
454 | {
455 |     printf("blockIdx.x:%d threadIdx.x:%d threadIdx.y:%d threadIdx.z:%d\n",blockIdx.x, threadIdx.x, threadIdx.y, threadIdx.z);
456 | }
457 | 
458 | int main(void)
459 | {
460 |     const dim3 block_size(16);
461 |     print<<<10, block_size>>>();
462 |     mcDeviceSynchronize();
463 |     return 0;
464 | }
465 | 
466 | 
467 | ```
468 | 
469 | 
470 | 
471 | ##### 运行结果
472 | 
473 | <img src=".\5.2.9.2运行结果\1.png">
474 | 
475 | <img src=".\5.2.9.2运行结果\2.png">
476 | 
477 | <img src=".\5.2.9.2运行结果\3.png">
478 | 
479 | 没有定义，默认为0.
480 | 
481 | 可以在定义block_size时对三个维度的size都进行设置（注意三者的乘积不可以超过maxThreadsPerBlock）。
482 | 
483 | ### 5.4.4(待更正)
484 | 
485 | #### Exercise 1
486 | 
487 | ##### 参考代码
488 | 
489 | ```c
490 | // #include <benchmark/benchmark.h>
491 | #include <iostream>
492 | #include <fstream>
493 | #include <cstdlib>
494 | #include <sys/time.h>
495 | #include <iostream>
496 | // #include <benchmark_test_config.hh>
497 | // #include "dynamicParallelism.h"
498 | #include <mc_runtime.h>
499 | /** block size along */
500 | #define BSX 64
501 | #define BSY 4
502 | /** maximum recursion depth */
503 | #define MAX_DEPTH 4
504 | /** region below which do per-pixel */
505 | #define MIN_SIZE 32
506 | /** subdivision factor along each axis */
507 | #define SUBDIV 4
508 | /** subdivision when launched from host */
509 | #define INIT_SUBDIV 32
510 | #define H (16 * 1024)
511 | #define W (16 * 1024)
512 | #define MAX_DWELL 512
513 | using namespace std;
514 | 
515 | 
516 | 
517 | /** a useful function to compute the number of threads */
518 | int __host__ __device__ divup(int x, int y) { return x / y + (x % y ? 1 : 0); }
519 | 
520 | /** a simple complex type */
521 | struct complex {
522 |     __host__ __device__ complex(float re, float im = 0)
523 |     {
524 |         this->re = re;
525 |         this->im = im;
526 |     }
527 |     /** real and imaginary part */
528 |     float re, im;
529 | }; // struct complex
530 | 
531 | // operator overloads for complex numbers
532 | inline __host__ __device__ complex operator+(const complex &a, const complex &b)
533 | {
534 |     return complex(a.re + b.re, a.im + b.im);
535 | }
536 | inline __host__ __device__ complex operator-(const complex &a) { return complex(-a.re, -a.im); }
537 | inline __host__ __device__ complex operator-(const complex &a, const complex &b)
538 | {
539 |     return complex(a.re - b.re, a.im - b.im);
540 | }
541 | inline __host__ __device__ complex operator*(const complex &a, const complex &b)
542 | {
543 |     return complex(a.re * b.re - a.im * b.im, a.im * b.re + a.re * b.im);
544 | }
545 | inline __host__ __device__ float abs2(const complex &a) { return a.re * a.re + a.im * a.im; }
546 | inline __host__ __device__ complex operator/(const complex &a, const complex &b)
547 | {
548 |     float invabs2 = 1 / abs2(b);
549 |     return complex((a.re * b.re + a.im * b.im) * invabs2, (a.im * b.re - b.im * a.re) * invabs2);
550 | } // operator/
551 | /** find the dwell for the pixel */
552 | __device__ int pixel_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x, int y)
553 | {
554 |     complex dc = cmax - cmin;
555 |     float fx = (float)x / w, fy = (float)y / h;
556 |     complex c = cmin + complex(fx * dc.re, fy * dc.im);
557 |     int dwell = 0;
558 |     complex z = c;
559 |     while (dwell < max_dwell && abs2(z) < 2 * 2) {
560 |         z = z * z + c;
561 |         dwell++;
562 |     }
563 |     return dwell;
564 | } // pixel_dwell
565 | 
566 | /** binary operation for common dwell "reduction": MAX_DWELL + 1 = neutral
567 |         element, -1 = dwells are different */
568 | // #define NEUT_DWELL (MAX_DWELL + 1)
569 | #define DIFF_DWELL (-1)
570 | __device__ int same_dwell(int d1, int d2, int max_dwell)
571 | {
572 |     if (d1 == d2)
573 |         return d1;
574 |     else if (d1 == (max_dwell + 1) || d2 == (max_dwell + 1))
575 |         return min(d1, d2);
576 |     else
577 |         return DIFF_DWELL;
578 | } // same_dwell
579 | 
580 | /** evaluates the common border dwell, if it exists */
581 | __device__ int border_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x0, int y0,
582 |                             int d)
583 | {
584 |     // check whether all boundary pixels have the same dwell
585 |     int tid        = threadIdx.y * blockDim.x + threadIdx.x;
586 |     int bs         = blockDim.x * blockDim.y;
587 |     int comm_dwell = (max_dwell + 1);
588 |     // for all boundary pixels, distributed across threads
589 |     for (int r = tid; r < d; r += bs) {
590 |         // for each boundary: b = 0 is east, then counter-clockwise
591 |         for (int b = 0; b < 4; b++) {
592 |             int x      = b % 2 != 0 ? x0 + r : (b == 0 ? x0 + d - 1 : x0);
593 |             int y      = b % 2 == 0 ? y0 + r : (b == 1 ? y0 + d - 1 : y0);
594 |             int dwell  = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
595 |             comm_dwell = same_dwell(comm_dwell, dwell, max_dwell);
596 |         }
597 |     } // for all boundary pixels
598 |     // reduce across threads in the block
599 |     __shared__ int ldwells[BSX * BSY];
600 |     int nt = min(d, BSX * BSY);
601 |     if (tid < nt)
602 |         ldwells[tid] = comm_dwell;
603 |     __syncthreads();
604 |     for (; nt > 1; nt /= 2) {
605 |         if (tid < nt / 2)
606 |             ldwells[tid] = same_dwell(ldwells[tid], ldwells[tid + nt / 2], max_dwell);
607 |         __syncthreads();
608 |     }
609 |     return ldwells[0];
610 | } // border_dwell
611 | 
612 | /** the kernel to fill the image region with a specific dwell value */
613 | __global__ void dwell_fill_k(int *dwells, int w, int x0, int y0, int d, int dwell)
614 | {
615 |     int x = threadIdx.x + blockIdx.x * blockDim.x;
616 |     int y = threadIdx.y + blockIdx.y * blockDim.y;
617 |     if (x < d && y < d) {
618 |         x += x0, y += y0;
619 |         dwells[y * w + x] = dwell;
620 |     }
621 | } // dwell_fill_k
622 | 
623 | /**
624 |  * the kernel to fill in per-pixel values of the portion of the Mandelbrot set
625 |  */
626 | __global__ void mandelbrot_pixel_k(int *dwells, int w, int h, int max_dwell, complex cmin,
627 |                                    complex cmax, int x0, int y0, int d)
628 | {
629 |     int x = threadIdx.x + blockDim.x * blockIdx.x;
630 |     int y = threadIdx.y + blockDim.y * blockIdx.y;
631 |     if (x < d && y < d) {
632 |         x += x0, y += y0;
633 |         dwells[y * w + x] = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
634 |     }
635 | } // mandelbrot_pixel_k
636 | 
637 | /** computes the dwells for Mandelbrot image using dynamic parallelism; one block is launched per
638 |    pixel
639 |  @param dwells the output array
640 |  @param w the width of the output image
641 |  @param h the height of the output image
642 |  @param cmin the complex value associated with the left-bottom corner of the image
643 |  @param cmax the complex value associated with the right-top corner of the image
644 |  @param x0 the starting x coordinate of the portion to compute
645 |  @param y0 the starting y coordinate of the portion to compute
646 |  @param d the size of the portion to compute (the portion is always a square)
647 |  @param depth kernel invocation depth
648 |  @remarks the algorithm reverts to per-pixel Mandelbrot evaluation once either maximum depth or
649 |    minimum size is reached
650 |  */
651 | __global__ void mandelbrot_with_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
652 |                                    complex cmax, int x0, int y0, int d, int depth)
653 | {
654 |     x0 += d * blockIdx.x, y0 += d * blockIdx.y;
655 |     int comm_dwell = border_dwell(w, h, max_dwell, cmin, cmax, x0, y0, d);
656 |     if (threadIdx.x == 0 && threadIdx.y == 0) {
657 |         if (comm_dwell != DIFF_DWELL) {
658 |             // uniform dwell, just fill
659 |             dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
660 |             dwell_fill_k<<<grid, bs>>>(dwells, w, x0, y0, d, comm_dwell);
661 |         } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) {
662 |             // subdivide recursively
663 |             dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV, SUBDIV);
664 |             mandelbrot_with_dp<<<grid, bs>>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0,
665 |                                              d / SUBDIV, depth + 1);
666 |         } else {
667 |             // leaf, per-pixel kernel
668 |             dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
669 |             mandelbrot_pixel_k<<<grid, bs>>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, d);
670 |         }
671 |         // check_error(x0, y0, d);
672 |     }
673 | } // mandelbrot_with_dp
674 | 
675 | /** computes the dwells for Mandelbrot image
676 |  @param dwells the output array
677 |  @param w the width of the output image
678 |  @param h the height of the output image
679 |  @param cmin the complex value associated with the left-bottom corner of the image
680 |  @param cmax the complex value associated with the right-top corner of the image
681 |  */
682 | __global__ void mandelbrot_without_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
683 |                                       complex cmax)
684 | {
685 |     // complex value to start iteration (c)
686 |     int x             = threadIdx.x + blockIdx.x * blockDim.x;
687 |     int y             = threadIdx.y + blockIdx.y * blockDim.y;
688 |     int dwell         = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
689 |     dwells[y * w + x] = dwell;
690 | }
691 | 
692 | __global__ void dwell_fill_k_null() { printf("111 \n"); } // dwell_fill_k
693 | 
694 | __global__ void mandelbrot_with_dp_cpu_perf() { dwell_fill_k_null<<<1, 1>>>(); }
695 | 
696 | __global__ void mandelbrot_without_dp_cpu_perf() { printf("222 \n"); }
697 | 
698 | struct timeval t1, t2;
699 | 
700 | static void BM_DynamicParallelism_WithDP()
701 | {
702 |     static char env_str[] = "DOORBELL_LISTEN=ON";
703 |     putenv(env_str);
704 | 
705 |     // allocate memory
706 |     int w         = W;
707 |     int h         = H;
708 |     int max_dwell = MAX_DWELL;
709 | 
710 |     size_t dwell_sz = w * h * sizeof(int);
711 |     int *h_dwells, *d_dwells;
712 |     mcMalloc((void **)&d_dwells, dwell_sz);
713 |     h_dwells = (int *)malloc(dwell_sz);
714 | 
715 |     dim3 bs(BSX, BSY), grid(INIT_SUBDIV, INIT_SUBDIV);
716 |     gettimeofday(&t1, NULL);
717 |     mandelbrot_with_dp<<<grid, bs>>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
718 |                                         complex(0.5, 1), 0, 0, w / INIT_SUBDIV, 1);
719 |     gettimeofday(&t2, NULL);
720 |     mcDeviceSynchronize();
721 |     mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
722 | 
723 |     // free data
724 |     mcFree(d_dwells);
725 |     free(h_dwells);
726 |     cout<<"BM_DynamicParallelism_WithDP over  "<<endl;
727 |     double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
728 |     cout << timeuse << endl;
729 | 
730 | }
731 | 
732 | static void BM_DynamicParallelism_WithoutDP()
733 | {
734 |     /* data size */
735 |     int w         = W;
736 |     int h         = H;
737 |     int max_dwell = MAX_DWELL;
738 | 
739 |     size_t dwell_sz = w * h * sizeof(int);
740 |     int *h_dwells, *d_dwells;
741 |     mcMalloc((void **)&d_dwells, dwell_sz);
742 |     h_dwells = (int *)malloc(dwell_sz);
743 | 
744 |     dim3 bs(64, 4), grid(divup(w, bs.x), divup(h, bs.y));
745 |     gettimeofday(&t1, NULL);
746 |     mandelbrot_without_dp<<<grid, bs>>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
747 |                                         complex(0.5, 1));
748 |     gettimeofday(&t2, NULL);
749 |     mcDeviceSynchronize();
750 |     mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
751 | 
752 |     // free data
753 |     mcFree(d_dwells);
754 |     free(h_dwells);
755 |     cout<<"BM_DynamicParallelism_WithoutDP over"<<endl;
756 |     double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
757 |     cout << timeuse << endl;
758 | 
759 | }
760 | 
761 | static void BM_DynamicParallelism_WithDP_CPU_Perf()
762 | {
763 |     static char env_str[] = "DOORBELL_LISTEN=ON";
764 |     putenv(env_str);
765 | 
766 |     int count = 0;
767 |     mcGetDeviceCount(&count);
768 | 
769 |     mandelbrot_with_dp_cpu_perf<<<1, 1>>>();
770 | 
771 |     mcDeviceSynchronize();
772 |     cout<<"BM_DynamicParallelism_WithDP_CPU_Perf over"<<endl;
773 | 
774 | }
775 | 
776 | static void BM_DynamicParallelism_WithoutDP_CPU_Perf()
777 | {
778 |     int count = 0;
779 |     mcGetDeviceCount(&count);
780 | 
781 |     mandelbrot_without_dp_cpu_perf<<<1, 1>>>();
782 | 
783 |     mcDeviceSynchronize();
784 |     cout<<"BM_DynamicParallelism_WithoutDP_CPU_Perf over"<<endl;
785 | 
786 | }
787 | 
788 | 
789 | 
790 | int main() {
791 | 	BM_DynamicParallelism_WithDP();
792 |     BM_DynamicParallelism_WithoutDP();
793 |     BM_DynamicParallelism_WithDP_CPU_Perf();
794 |     BM_DynamicParallelism_WithoutDP_CPU_Perf();
795 | 	return 0;
796 | }  
797 | 
798 | ```
799 | 
800 | 
801 | 
802 | ## Chapter 6
803 | 
804 | ### Exercise 1
805 | 
806 | 改用统一寻址方式简化vectorAdd
807 | 
808 | #### 参考代码
809 | 
810 | ```c++
811 | #include <iostream>
812 | #include <cstdlib>
813 | #include <sys/time.h>
814 | #include <mc_runtime_api.h>
815 | 
816 | using namespace std;
817 | 
818 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N){
819 |     int i = threadIdx.x + blockDim.x * blockIdx.x;
820 |     if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f;
821 | }
822 | 
823 | int main(int argc,char *argv[]){
824 |     int n = atoi(argv[1]);
825 |     cout << n << endl;
826 |     
827 |     float *A,*B,*C;
828 |     mcMallocManaged(&A,n*sizeof(float));
829 |     mcMallocManaged(&B,n*sizeof(float));
830 |     mcMallocManaged(&C,n*sizeof(float));
831 | 
832 |     for(int i=0;i<n;i++){
833 |         A[i]=rand() / double(RAND_MAX);
834 |         B[i]=rand() / double(RAND_MAX);
835 |     }
836 |     int threadPerBlock=256;
837 |     int blockPerGrid=(n + threadPerBlock - 1)/threadPerBlock;
838 |     printf("threadPerBlock: %d \nblockPerGrid: %d \n",threadPerBlock,blockPerGrid);
839 |     vectorAdd<<<blockPerGrid,threadPerBlock>>>(A,B,C,n);
840 |     mcDeviceSynchronize();
841 |     for(int i=0;i<n;i++){
842 |         printf("A[%d]:%f  B[%d]:%f  C[%d]:%f\n",i,A[i],i,B[i],i,C[i]);
843 |     }
844 |     mcFree(A);
845 |     mcFree(B);
846 |     mcFree(C);
847 |     return 0;
848 | }
849 | ```
850 | 
851 | #### 运行结果
852 | 
853 | 文件另存为vectorAdd.cpp
854 | 
855 | 在控制台输入：
856 | 
857 | ```
858 | mxcc -x maca vectorAdd.cpp -o vectorAdd
859 | 
860 | ./vectorAdd 10 （可以自定义其他数值）
861 | ```
862 | 
863 | <img src=".\统一内存寻址运行结果.png">
864 | 
865 | ### Exercise 2
866 | 
867 | ```c++
868 | #include <stdio.h>
869 | #include <mc_runtime_api.h>
870 | #include <math.h>
871 | #include <mc_common.h>
872 | #include <sys/time.h>
873 | #include <iostream>
874 | using namespace std;
875 | 
876 | #define M 512
877 | #define K 512
878 | #define N 512
879 | 
880 | void initial(float *array, int size)
881 | {
882 | 	for (int i = 0; i < size; i++)
883 | 	{
884 | 		array[i] = (float)(rand() % 10 + 1);
885 | 	}
886 | }
887 | 
888 | //核函数（静态共享内存版）
889 | __global__ void matrixMultiplyShared(float *A, float *B, float *C,
890 | int numARows, int numAColumns, int numBRows, int numBColumns, int numCRows, int numCColumns)
891 | {
892 | 	//分配共享内存
893 | 	// __shared__ float sharedM[blockDim.y][blockDim.x];
894 | 	// __shared__ float sharedN[blockDim.x][blockDim.y];
895 | 	__shared__ float sharedM[16][32];
896 | 	__shared__ float sharedN[16][32];
897 | 
898 | 	int bx = blockIdx.x;
899 | 	int by = blockIdx.y;
900 | 	int tx = threadIdx.x;
901 | 	int ty = threadIdx.y;
902 | 
903 | 	int row = by * blockDim.y + ty;
904 | 	int col = bx * blockDim.x + tx;
905 | 
906 | 	float Csub = 0.0;
907 | 	
908 | 	//将保存在全局内存中的矩阵M&N分块存放到共享内存中
909 | 	for (int i = 0; i < (int)(ceil((float)numAColumns / blockDim.x)); i++)
910 | 	{
911 | 		if (i*blockDim.x + tx < numAColumns && row < numARows)
912 | 			sharedM[ty][tx] = A[row*numAColumns + i * blockDim.x + tx];
913 | 		else
914 | 			sharedM[ty][tx] = 0.0;
915 | 
916 | 		if (i*blockDim.y + ty < numBRows && col < numBColumns)//分割N矩阵
917 | 			sharedN[ty][tx] = B[(i*blockDim.y + ty)*numBColumns + col];
918 | 		else
919 | 			sharedN[ty][tx] = 0.0;
920 | 		__syncthreads();
921 | 
922 | 		for (int j = 0; j < blockDim.x; j++)//分块后的矩阵相乘
923 | 			Csub += sharedM[ty][j] * sharedN[j][tx];
924 | 		__syncthreads();
925 | 	}
926 | 
927 | 	if (row < numCRows && col < numCColumns)//将计算后的矩阵块放到结果矩阵C中
928 | 		C[row*numCColumns + col] = Csub;
929 | }
930 | 
931 | 
932 | int main(int argc, char **argv)
933 | {
934 | 	int Axy = M * K;
935 | 	int Bxy = K * N;
936 | 	int Cxy = M * N;
937 | 
938 | 	float *h_A, *h_B, *h_C;
939 | 	h_A = (float*)malloc(Axy * sizeof(float));
940 | 	h_B = (float*)malloc(Bxy * sizeof(float));
941 | 
942 | 	h_C = (float*)malloc(Cxy * sizeof(float));
943 | 
944 | 	initial(h_A, Axy);
945 | 	initial(h_B, Bxy);
946 | 	
947 | 	float *d_A, *d_B, *d_C;
948 | 	mcMalloc((void**)&d_A, Axy * sizeof(float));
949 | 	mcMalloc((void**)&d_B, Bxy * sizeof(float));
950 | 	mcMalloc((void**)&d_C, Cxy * sizeof(float));
951 | 
952 | 	mcMemcpy(d_A, h_A, Axy * sizeof(float), mcMemcpyHostToDevice);
953 | 	mcMemcpy(d_B, h_B, Bxy * sizeof(float), mcMemcpyHostToDevice);
954 | 	
955 |     int dimx = 32;
956 |     int dimy = 16;
957 |     dim3 block(dimx, dimy);
958 |     dim3 grid((M + block.x - 1) / block.x, (N + block.y - 1) / block.y);
959 | 	struct timeval t1, t2;
960 |     gettimeofday(&t1, NULL);
961 | 	matrixMultiplyShared <<< grid, block >>> (d_A, d_B, d_C, M, K, K, N, M, N);
962 | 	mcMemcpy(h_C, d_C, Cxy * sizeof(float), mcMemcpyDeviceToHost);
963 | 	gettimeofday(&t2, NULL);
964 |     double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
965 |     cout << "timeuse: " << timeuse << endl;
966 |     mcFree(d_A);
967 |     mcFree(d_B);
968 |     mcFree(d_C);
969 |     
970 |     free(h_A);
971 |     free(h_B);
972 |     free(h_C);
973 | }
974 | 
975 | ```
976 | 
977 | 
978 | 
979 | 


--------------------------------------------------------------------------------
/习题运行结果/nestedMandelbrot.cpp:
--------------------------------------------------------------------------------
  1 | // #include <benchmark/benchmark.h>
  2 | #include <iostream>
  3 | #include <fstream>
  4 | #include <cstdlib>
  5 | #include <sys/time.h>
  6 | #include <iostream>
  7 | // #include <benchmark_test_config.hh>
  8 | // #include "dynamicParallelism.h"
  9 | #include <mc_runtime.h>
 10 | /** block size along */
 11 | #define BSX 64
 12 | #define BSY 4
 13 | /** maximum recursion depth */
 14 | #define MAX_DEPTH 4
 15 | /** region below which do per-pixel */
 16 | #define MIN_SIZE 32
 17 | /** subdivision factor along each axis */
 18 | #define SUBDIV 4
 19 | /** subdivision when launched from host */
 20 | #define INIT_SUBDIV 32
 21 | #define H (16 * 1024)
 22 | #define W (16 * 1024)
 23 | #define MAX_DWELL 512
 24 | using namespace std;
 25 | 
 26 | 
 27 | 
 28 | /** a useful function to compute the number of threads */
 29 | int __host__ __device__ divup(int x, int y) { return x / y + (x % y ? 1 : 0); }
 30 | 
 31 | /** a simple complex type */
 32 | struct complex {
 33 |     __host__ __device__ complex(float re, float im = 0)
 34 |     {
 35 |         this->re = re;
 36 |         this->im = im;
 37 |     }
 38 |     /** real and imaginary part */
 39 |     float re, im;
 40 | }; // struct complex
 41 | 
 42 | // operator overloads for complex numbers
 43 | inline __host__ __device__ complex operator+(const complex &a, const complex &b)
 44 | {
 45 |     return complex(a.re + b.re, a.im + b.im);
 46 | }
 47 | inline __host__ __device__ complex operator-(const complex &a) { return complex(-a.re, -a.im); }
 48 | inline __host__ __device__ complex operator-(const complex &a, const complex &b)
 49 | {
 50 |     return complex(a.re - b.re, a.im - b.im);
 51 | }
 52 | inline __host__ __device__ complex operator*(const complex &a, const complex &b)
 53 | {
 54 |     return complex(a.re * b.re - a.im * b.im, a.im * b.re + a.re * b.im);
 55 | }
 56 | inline __host__ __device__ float abs2(const complex &a) { return a.re * a.re + a.im * a.im; }
 57 | inline __host__ __device__ complex operator/(const complex &a, const complex &b)
 58 | {
 59 |     float invabs2 = 1 / abs2(b);
 60 |     return complex((a.re * b.re + a.im * b.im) * invabs2, (a.im * b.re - b.im * a.re) * invabs2);
 61 | } // operator/
 62 | /** find the dwell for the pixel */
 63 | __device__ int pixel_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x, int y)
 64 | {
 65 |     complex dc = cmax - cmin;
 66 |     float fx = (float)x / w, fy = (float)y / h;
 67 |     complex c = cmin + complex(fx * dc.re, fy * dc.im);
 68 |     int dwell = 0;
 69 |     complex z = c;
 70 |     while (dwell < max_dwell && abs2(z) < 2 * 2) {
 71 |         z = z * z + c;
 72 |         dwell++;
 73 |     }
 74 |     return dwell;
 75 | } // pixel_dwell
 76 | 
 77 | /** binary operation for common dwell "reduction": MAX_DWELL + 1 = neutral
 78 |         element, -1 = dwells are different */
 79 | // #define NEUT_DWELL (MAX_DWELL + 1)
 80 | #define DIFF_DWELL (-1)
 81 | __device__ int same_dwell(int d1, int d2, int max_dwell)
 82 | {
 83 |     if (d1 == d2)
 84 |         return d1;
 85 |     else if (d1 == (max_dwell + 1) || d2 == (max_dwell + 1))
 86 |         return min(d1, d2);
 87 |     else
 88 |         return DIFF_DWELL;
 89 | } // same_dwell
 90 | 
 91 | /** evaluates the common border dwell, if it exists */
 92 | __device__ int border_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x0, int y0,
 93 |                             int d)
 94 | {
 95 |     // check whether all boundary pixels have the same dwell
 96 |     int tid        = threadIdx.y * blockDim.x + threadIdx.x;
 97 |     int bs         = blockDim.x * blockDim.y;
 98 |     int comm_dwell = (max_dwell + 1);
 99 |     // for all boundary pixels, distributed across threads
100 |     for (int r = tid; r < d; r += bs) {
101 |         // for each boundary: b = 0 is east, then counter-clockwise
102 |         for (int b = 0; b < 4; b++) {
103 |             int x      = b % 2 != 0 ? x0 + r : (b == 0 ? x0 + d - 1 : x0);
104 |             int y      = b % 2 == 0 ? y0 + r : (b == 1 ? y0 + d - 1 : y0);
105 |             int dwell  = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
106 |             comm_dwell = same_dwell(comm_dwell, dwell, max_dwell);
107 |         }
108 |     } // for all boundary pixels
109 |     // reduce across threads in the block
110 |     __shared__ int ldwells[BSX * BSY];
111 |     int nt = min(d, BSX * BSY);
112 |     if (tid < nt)
113 |         ldwells[tid] = comm_dwell;
114 |     __syncthreads();
115 |     for (; nt > 1; nt /= 2) {
116 |         if (tid < nt / 2)
117 |             ldwells[tid] = same_dwell(ldwells[tid], ldwells[tid + nt / 2], max_dwell);
118 |         __syncthreads();
119 |     }
120 |     return ldwells[0];
121 | } // border_dwell
122 | 
123 | /** the kernel to fill the image region with a specific dwell value */
124 | __global__ void dwell_fill_k(int *dwells, int w, int x0, int y0, int d, int dwell)
125 | {
126 |     int x = threadIdx.x + blockIdx.x * blockDim.x;
127 |     int y = threadIdx.y + blockIdx.y * blockDim.y;
128 |     if (x < d && y < d) {
129 |         x += x0, y += y0;
130 |         dwells[y * w + x] = dwell;
131 |     }
132 | } // dwell_fill_k
133 | 
134 | /**
135 |  * the kernel to fill in per-pixel values of the portion of the Mandelbrot set
136 |  */
137 | __global__ void mandelbrot_pixel_k(int *dwells, int w, int h, int max_dwell, complex cmin,
138 |                                    complex cmax, int x0, int y0, int d)
139 | {
140 |     int x = threadIdx.x + blockDim.x * blockIdx.x;
141 |     int y = threadIdx.y + blockDim.y * blockIdx.y;
142 |     if (x < d && y < d) {
143 |         x += x0, y += y0;
144 |         dwells[y * w + x] = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
145 |     }
146 | } // mandelbrot_pixel_k
147 | 
148 | /** computes the dwells for Mandelbrot image using dynamic parallelism; one block is launched per
149 |    pixel
150 |  @param dwells the output array
151 |  @param w the width of the output image
152 |  @param h the height of the output image
153 |  @param cmin the complex value associated with the left-bottom corner of the image
154 |  @param cmax the complex value associated with the right-top corner of the image
155 |  @param x0 the starting x coordinate of the portion to compute
156 |  @param y0 the starting y coordinate of the portion to compute
157 |  @param d the size of the portion to compute (the portion is always a square)
158 |  @param depth kernel invocation depth
159 |  @remarks the algorithm reverts to per-pixel Mandelbrot evaluation once either maximum depth or
160 |    minimum size is reached
161 |  */
162 | __global__ void mandelbrot_with_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
163 |                                    complex cmax, int x0, int y0, int d, int depth)
164 | {
165 |     x0 += d * blockIdx.x, y0 += d * blockIdx.y;
166 |     int comm_dwell = border_dwell(w, h, max_dwell, cmin, cmax, x0, y0, d);
167 |     if (threadIdx.x == 0 && threadIdx.y == 0) {
168 |         if (comm_dwell != DIFF_DWELL) {
169 |             // uniform dwell, just fill
170 |             dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
171 |             dwell_fill_k<<<grid, bs>>>(dwells, w, x0, y0, d, comm_dwell);
172 |         } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) {
173 |             // subdivide recursively
174 |             dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV, SUBDIV);
175 |             mandelbrot_with_dp<<<grid, bs>>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0,
176 |                                              d / SUBDIV, depth + 1);
177 |         } else {
178 |             // leaf, per-pixel kernel
179 |             dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
180 |             mandelbrot_pixel_k<<<grid, bs>>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, d);
181 |         }
182 |         // check_error(x0, y0, d);
183 |     }
184 | } // mandelbrot_with_dp
185 | 
186 | /** computes the dwells for Mandelbrot image
187 |  @param dwells the output array
188 |  @param w the width of the output image
189 |  @param h the height of the output image
190 |  @param cmin the complex value associated with the left-bottom corner of the image
191 |  @param cmax the complex value associated with the right-top corner of the image
192 |  */
193 | __global__ void mandelbrot_without_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
194 |                                       complex cmax)
195 | {
196 |     // complex value to start iteration (c)
197 |     int x             = threadIdx.x + blockIdx.x * blockDim.x;
198 |     int y             = threadIdx.y + blockIdx.y * blockDim.y;
199 |     int dwell         = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
200 |     dwells[y * w + x] = dwell;
201 | }
202 | 
203 | __global__ void dwell_fill_k_null() { printf("111 \n"); } // dwell_fill_k
204 | 
205 | __global__ void mandelbrot_with_dp_cpu_perf() { dwell_fill_k_null<<<1, 1>>>(); }
206 | 
207 | __global__ void mandelbrot_without_dp_cpu_perf() { printf("222 \n"); }
208 | 
209 | struct timeval t1, t2;
210 | 
211 | static void BM_DynamicParallelism_WithDP()
212 | {
213 |     static char env_str[] = "DOORBELL_LISTEN=ON";
214 |     putenv(env_str);
215 | 
216 |     // allocate memory
217 |     int w         = W;
218 |     int h         = H;
219 |     int max_dwell = MAX_DWELL;
220 | 
221 |     size_t dwell_sz = w * h * sizeof(int);
222 |     int *h_dwells, *d_dwells;
223 |     mcMalloc((void **)&d_dwells, dwell_sz);
224 |     h_dwells = (int *)malloc(dwell_sz);
225 | 
226 |     dim3 bs(BSX, BSY), grid(INIT_SUBDIV, INIT_SUBDIV);
227 |     gettimeofday(&t1, NULL);
228 |     mandelbrot_with_dp<<<grid, bs>>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
229 |                                         complex(0.5, 1), 0, 0, w / INIT_SUBDIV, 1);
230 |     gettimeofday(&t2, NULL);
231 |     mcDeviceSynchronize();
232 |     mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
233 | 
234 |     // free data
235 |     mcFree(d_dwells);
236 |     free(h_dwells);
237 |     cout<<"BM_DynamicParallelism_WithDP over  "<<endl;
238 |     double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
239 |     cout << timeuse << endl;
240 | 
241 | }
242 | 
243 | static void BM_DynamicParallelism_WithoutDP()
244 | {
245 |     /* data size */
246 |     int w         = W;
247 |     int h         = H;
248 |     int max_dwell = MAX_DWELL;
249 | 
250 |     size_t dwell_sz = w * h * sizeof(int);
251 |     int *h_dwells, *d_dwells;
252 |     mcMalloc((void **)&d_dwells, dwell_sz);
253 |     h_dwells = (int *)malloc(dwell_sz);
254 | 
255 |     dim3 bs(64, 4), grid(divup(w, bs.x), divup(h, bs.y));
256 |     gettimeofday(&t1, NULL);
257 |     mandelbrot_without_dp<<<grid, bs>>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
258 |                                         complex(0.5, 1));
259 |     gettimeofday(&t2, NULL);
260 |     mcDeviceSynchronize();
261 |     mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
262 | 
263 |     // free data
264 |     mcFree(d_dwells);
265 |     free(h_dwells);
266 |     cout<<"BM_DynamicParallelism_WithoutDP over"<<endl;
267 |     double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
268 |     cout << timeuse << endl;
269 | 
270 | }
271 | 
272 | static void BM_DynamicParallelism_WithDP_CPU_Perf()
273 | {
274 |     static char env_str[] = "DOORBELL_LISTEN=ON";
275 |     putenv(env_str);
276 | 
277 |     int count = 0;
278 |     mcGetDeviceCount(&count);
279 | 
280 |     mandelbrot_with_dp_cpu_perf<<<1, 1>>>();
281 | 
282 |     mcDeviceSynchronize();
283 |     cout<<"BM_DynamicParallelism_WithDP_CPU_Perf over"<<endl;
284 | 
285 | }
286 | 
287 | static void BM_DynamicParallelism_WithoutDP_CPU_Perf()
288 | {
289 |     int count = 0;
290 |     mcGetDeviceCount(&count);
291 | 
292 |     mandelbrot_without_dp_cpu_perf<<<1, 1>>>();
293 | 
294 |     mcDeviceSynchronize();
295 |     cout<<"BM_DynamicParallelism_WithoutDP_CPU_Perf over"<<endl;
296 | 
297 | }
298 | 
299 | 
300 | 
301 | int main() {
302 | 	BM_DynamicParallelism_WithDP();
303 |     BM_DynamicParallelism_WithoutDP();
304 |     BM_DynamicParallelism_WithDP_CPU_Perf();
305 |     BM_DynamicParallelism_WithoutDP_CPU_Perf();
306 | 	return 0;
307 | }  


--------------------------------------------------------------------------------
/习题运行结果/统一内存寻址运行结果.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/统一内存寻址运行结果.png


--------------------------------------------------------------------------------
/开源的完整示例代码表.md:
--------------------------------------------------------------------------------
 1 | # 开源的完整示例代码表
 2 | 
 3 | | 章节 | 标号 | 信息                                           | 文件名                                                |
 4 | | ---- | ---- | ---------------------------------------------- | ----------------------------------------------------- |
 5 | | 2    | 2-1  | 打印hello world完整代码                        | helloFromGpu.c                                        |
 6 | | 3    | 3-1  | 完整的向量相加纯CPU编程                        | cpuVectorAdd.cpp                                      |
 7 | | 3    | 3-2  | 完整的向量相加`MXMACA`异构编程                 | gpuVectorAdd.cpp                                      |
 8 | | 4    | 4-1  | `MXMACA`基本语法示例                           | grammar.cpp                                           |
 9 | | 5    | 5-1  | 设备线程架构信息查询                           | information.cpp                                       |
10 | | 5    | 5-2  | 包含分支的核函数                               | assignKernel.cpp                                      |
11 | | 5    | 5-3  | 完整的线程组示例                               | Cooperative_Groups.cpp                                |
12 | | 5    | 5-5  | 在GPU上嵌套Hello World                         | nestedHelloWorld.cpp                                  |
13 | | 6    | 6-1  | 用`__shfl_sync`实现给定通道“值”的广播          | __shfl_syncExample.cpp                                |
14 | | 6    | 6-2  | 用`__shfl_up_sync`实现tid-delta线程号的var复制 | __shfl_up_syncExample.cpp                             |
15 | | 6    | 6-3  | 用`__shfl_up_sync`实现tid+delta线程号的var复制 | __shfl_down_syncExample.cpp                           |
16 | | 6    | 6-4  | 用`__shfl_xor_sync`实现线程束内的reduce操作    | __shfl_xor_syncExample.cpp                            |
17 | | 6    | 6-5  | 非连续的归约求和                               | BC_addKernel.cpp                                      |
18 | | 6    | 6-6  | 连续的归约求和                                 | NBC_addKernel2.cpp                                    |
19 | | 6    | 6-7  | 静态全局内存使用示例                           | checkGlobalVariable.cpp                               |
20 | | 6    | 6-8  | 用零拷贝内存实现向量相加                       | vectorAddZerocopy.cpp                                 |
21 | | 6    | 6-9  | 用统一虚拟寻址内存实现向量相加                 | vectorAddUnifiedVirtualAddressing.cpp                 |
22 | | 6    | 6-10 | 未使用统一寻址内存的A加B                       | AplusB_without_unified_addressing.cpp                 |
23 | | 6    | 6-11 | 使用统一寻址内存API的A加B                      | AplusB_with_unified_addressing.cpp                    |
24 | | 6    | 6-12 | 使用统一寻址内存`__managed__`变量的A加B        | AplusB_with_managed.cpp                               |
25 | | 6    | 6-30 | 设备内存信息查询                               | information.cpp                                       |
26 | | 7    | 7-1  | 子模块a代码实现                                | a.cpp a.h                                             |
27 | | 7    | 7-2  | 子模块b代码实现                                | b.cpp b.h                                             |
28 | | 7    | 7-3  | 主程序代码实现                                 | main.cpp                                              |
29 | | 7    | 7-4  | `Makefile`代码实现                             | Makefile.txt（需要自己在Linux环境下配置一个Makefile） |
30 | | 7    | 7-5  | `CMake`代码实现                                | CMakeLists.txt                                        |
31 | | 7    | 7-8  | vectorADD的代码托管                            | vectorAdd.cpp                                         |
32 | | 7    | 7-9  | 内存非法访问的trap kernel示例                  | trigger_memory_violation.cpp                          |
33 | | 7    | 7-10 | 修复内存非法访问后的kernel                     | trigger_memory_violation_repaired.cpp                 |
34 | | 8    | 8-1  | 二维矩阵传输测试代码                           | myKernel.cpp                                          |
35 | | 8    | 8-2  | 让两个流并行执行                               | stream_parallel_execution.cpp                         |
36 | 
37 | 


--------------------------------------------------------------------------------
/示例代码运行截图/chapter2/2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter2/2-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter3/3-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter3/3-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter4/4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter4/4-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter5/5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter5/5-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter5/5-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter5/5-3.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter5/5-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter5/5-5.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-1-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-1-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-1-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-1-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-10-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-10-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-10-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-11-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-11-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-11-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-12-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-12-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-12-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-12-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-2-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-2-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-2-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-3-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-3-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-3-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-30.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-30.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-4.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-5.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-6.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-7.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-8.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter6/6-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter6/6-9.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter7/7-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter7/7-4.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter7/7-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter7/7-5.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter8/8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter8/8-1.png


--------------------------------------------------------------------------------
/示例代码运行截图/chapter8/8-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/示例代码运行截图/chapter8/8-2.png


--------------------------------------------------------------------------------
/示例代码运行截图/示例代码运行截图.md:
--------------------------------------------------------------------------------
  1 | # 示例代码运行截图
  2 | 
  3 | ## chapter 2
  4 | 
  5 | ### 2-1
  6 | 
  7 | <img src=".\chapter2\2-1.png">
  8 | 
  9 | ## chapter 3
 10 | 
 11 | ### 3-2
 12 | 
 13 | <img src=".\chapter3\3-2.png">
 14 | 
 15 | ## chapter 4
 16 | 
 17 | ### 4-1
 18 | 
 19 | <img src=".\chapter4\4-1.png">
 20 | 
 21 | ## chapter 5
 22 | 
 23 | ### 5-1
 24 | 
 25 | <img src=".\chapter5\5-1.png">
 26 | 
 27 | ### 5-3
 28 | 
 29 | <img src=".\chapter5\5-3.png">
 30 | 
 31 | ### 5-5
 32 | 
 33 | <img src=".\chapter5\5-5.png">
 34 | 
 35 | ## chapter 6
 36 | 
 37 | ### 6-1
 38 | 
 39 | <img src=".\chapter6\6-1-1.png">
 40 | 
 41 | <img src=".\chapter6\6-1-2.png">
 42 | 
 43 | ### 6-2
 44 | 
 45 | <img src=".\chapter6\6-2-1.png">
 46 | 
 47 | <img src=".\chapter6\6-2-2.png">
 48 | 
 49 | ### 6-3
 50 | 
 51 | <img src=".\chapter6\6-3-1.png">
 52 | 
 53 | <img src=".\chapter6\6-3-2.png">
 54 | 
 55 | ### 6-4
 56 | 
 57 | <img src=".\chapter6\6-4.png">
 58 | 
 59 | ### 6-5
 60 | 
 61 | <img src=".\chapter6\6-5.png">
 62 | 
 63 | ### 6-6
 64 | 
 65 | <img src=".\chapter6\6-6.png">
 66 | 
 67 | ### 6-7
 68 | 
 69 | <img src=".\chapter6\6-7.png">
 70 | 
 71 | ### 6-8
 72 | 
 73 | <img src=".\运行截图\chapter6\6-8.png">
 74 | 
 75 | ### 6-94
 76 | 
 77 | <img src=".\chapter6\6-9.png">
 78 | 
 79 | ### 6-10
 80 | 
 81 | <img src=".\chapter6\6-10-1.png">
 82 | 
 83 | <img src=".\chapter6\6-10-2.png">
 84 | 
 85 | ### 6-11
 86 | 
 87 | <img src=".\chapter6\6-11-1.png">
 88 | 
 89 | <img src=".\chapter6\6-11-2.png">
 90 | 
 91 | ### 6-12
 92 | 
 93 | <img src=".\chapter6\6-12-1.png">
 94 | 
 95 | <img src=".\chapter6\6-12-2.png">
 96 | 
 97 | ### 6-30
 98 | 
 99 | <img src=".\chapter6\6-30.png">
100 | 
101 | ## chapter 7
102 | 
103 | ### 7-4
104 | 
105 | <img src=".\chapter7\7-4.png">
106 | 
107 | ### 7-5
108 | 
109 | <img src=".\chapter7\7-5.png">
110 | 
111 | ## chapter 8
112 | 
113 | ### 8-1
114 | 
115 | <img src=".\chapter8\8-1.png">
116 | 
117 | ### 8-2
118 | 
119 | <img src=".\chapter8\8-2.png">


--------------------------------------------------------------------------------