├── .gitattributes
├── MXMACA编程的内存层次模型.png
├── README.md
├── chapter10
├── mcBlas.c
├── mcDNN.cpp
├── mcblas命令.txt
└── usingThrust.cpp
├── chapter11
├── Makefile
├── simple2DFD.cpp
└── vectorAddMultiGpus.cpp
├── chapter2
└── helloFromGpu.c
├── chapter3
├── cpuVectorAdd.cpp
└── gpuVectorAdd.cpp
├── chapter4
└── grammar.cpp
├── chapter5
├── Cooperative_Groups.cpp
├── assignKernel.cpp
├── information.cpp
└── nestedHelloWorld.cpp
├── chapter6
├── AplusB_with_managed.cpp
├── AplusB_with_unified_addressing.cpp
├── AplusB_without_unified_addressing.cpp
├── BC_addKernel.cpp
├── NBC_addKernel2.cpp
├── __shfl_down_syncExample.cpp
├── __shfl_syncExample.cpp
├── __shfl_up_syncExample.cpp
├── __shfl_xor_syncExample.cpp
├── checkGlobalVariable.cpp
├── information.cpp
├── vectorAddUnifiedVirtualAddressing.cpp
└── vectorAddZerocopy.cpp
├── chapter7
├── Makefile.txt
├── my_program
│ ├── CMakeLists.txt
│ ├── include
│ │ ├── a.h
│ │ └── b.h
│ ├── main.cpp
│ └── src
│ │ ├── a.cpp
│ │ └── b.cpp
├── trigger_memory_violation.cpp
├── trigger_memory_violation_repaired.cpp
└── vectorAdd.cpp
├── chapter8
├── myKernel.cpp
└── stream_parallel_execution.cpp
├── chapter9
├── shortKernelsAsyncLaunch.cpp
├── shortKernelsGraphLaunch.cpp
└── shortKernelsSyncLaunch.cpp
├── common
└── common.h
├── 习题运行结果
├── 3.1.png
├── 3.2.png
├── 5.2.9.1运行结果
│ ├── 1.png
│ ├── 2.png
│ └── 3.png
├── 5.2.9.2运行结果
│ ├── 1.png
│ ├── 2.png
│ └── 3.png
├── T4运行结果.png
├── answer.md
├── nestedMandelbrot.cpp
└── 统一内存寻址运行结果.png
├── 开源的完整示例代码表.md
└── 示例代码运行截图
├── chapter2
└── 2-1.png
├── chapter3
└── 3-2.png
├── chapter4
└── 4-1.png
├── chapter5
├── 5-1.png
├── 5-3.png
└── 5-5.png
├── chapter6
├── 6-1-1.png
├── 6-1-2.png
├── 6-10-1.png
├── 6-10-2.png
├── 6-11-1.png
├── 6-11-2.png
├── 6-12-1.png
├── 6-12-2.png
├── 6-2-1.png
├── 6-2-2.png
├── 6-3-1.png
├── 6-3-2.png
├── 6-30.png
├── 6-4.png
├── 6-5.png
├── 6-6.png
├── 6-7.png
├── 6-8.png
└── 6-9.png
├── chapter7
├── 7-4.png
└── 7-5.png
├── chapter8
├── 8-1.png
└── 8-2.png
└── 示例代码运行截图.md
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 |
--------------------------------------------------------------------------------
/MXMACA编程的内存层次模型.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/MXMACA编程的内存层次模型.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # getting-started-guide-and-summary-of-MXMACA
2 |
3 | ## CPU VS GPU
4 |
5 | CPU,即中央处理器,由数百万个晶体管构成,可以具有多个处理核心,是计算机系统的运算和控制核心。CPU涉及到通用计算,适合少量的复杂计算。CPU虽然处理核心远没有GPU多,但是可以将核心集中在单个任务上并快速完成工作。
6 |
7 | GPU,即图形处理器,由许多更小、更专业的核心组成的处理器。适合大量的简单运算。GPU最初是用来加速3D渲染任务,但是随着时间的推移,这些固定功能的引擎变得更加可编程、更加灵活。虽然图形和日益逼真的视觉效果仍然是GPU的主要功能,但GPU也已发展成为更通用的并行处理器,可以处理越来越多的应用程序。
8 |
9 | | CPU | GPU |
10 | | ---------------------------------- | -------------------------------- |
11 | | 通用组件,负责计算机的主要处理功能 | 专用组件,主要负责图形和视频渲染 |
12 | | 核心数:2-64 | 核心数:数千 |
13 | | 串行运行进程 | 并行运行进程 |
14 | | 更适合处理一项大任务 | 更适合处理多个较小的任务 |
15 |
16 |
17 |
18 | ### 加速深度学习和人工智能
19 |
20 | GPU或其他加速器非常适合用神经网络或大量特定数据(e.g. 2D图像)进行深度学习训练。
21 |
22 | GPU加速方法已经适用于深度学习算法,可以显著提升算法性能。
23 |
24 |
25 |
26 | ## 基本概念的解释
27 |
28 | 内存部分的解释详见MXMACA内存模型和管理。
29 |
30 | ### 主机端(host)
31 |
32 | CPU所在的位置称为主机端。
33 |
34 | 可以简单理解为CPU。
35 |
36 | ### 设备端(device)
37 |
38 | GPU所在的位置称为设备端。
39 |
40 | 可以简单理解为GPU。
41 |
42 | 主机和设备之间通过PCIe总线连接,用于传递指令和数据,让CPU和GPU一起来协同工作。
43 |
44 | ### 加速处理器(Accelerated Processors,AP)
45 |
46 | 每个AP都能支持数千个GPU线程并发执行。
47 |
48 | 执行具体的指令和指令和任务。
49 |
50 | ### 核函数(kernel)
51 |
52 | 核函数在设备端执行,需要为一个线程规定所进行的计算和访问的数据。当核函数被调用时,许多不同的MXMACA线程并行执行同一计算任务。
53 |
54 | 在设备侧(GPU)执行,可以在设备侧(GPU)和主机侧(CPU)被调用。
55 |
56 | ### 线程(thread)
57 |
58 | 一般通过GPU的一个核进行处理。
59 |
60 | 每个线程是Kernel的单个执行实例。在一个block中的所有线程可以共享一些资源,并能够相互通信。
61 |
62 | ### 线程束(wave)
63 |
64 | GPU执行程序时的调度单位。
65 |
66 | 64个线程组成一个线程束,线程束中每个线程在不同数据集上同时执行相同的指令。
67 |
68 | ### 线程块(thread block)
69 |
70 | 由多个线程组成。可以是一维、二维或三维的。
71 |
72 | 各block是并行执行的。
73 |
74 | 同一个线程块内的线程可以相互协作,不同线程块内的线程不能协作。
75 |
76 | 当启动一个核函数网格时,它的GPU线程会被分配到可用的AP上执行。一旦线程块被调度到一个AP上,其中的线程将只在该指定的AP上并发执行。
77 |
78 | 多个线程块根据AP资源的可用性进行调度,可能会被分配到同一个AP上或不同的AP上。
79 |
80 | ### 线程网格(grid)
81 |
82 | 多个线程块可以构成线程网格。
83 |
84 | 和核函数(kernel)的关系:启动核函数(kernel)时,会定义一个线程网格(grid)。
85 |
86 | 网格可以是一维的、二维的或三维的。
87 |
88 | ### 流(stream)
89 |
90 | 相当于是GPU上的任务队列。
91 |
92 | 同一个stream的任务是严格保证顺序的,上一个命令执行完成才会执行下一个命令。
93 |
94 | 不同stream的命令不保证任何执行顺序。部分优化技巧需要用到多个stream才能实现。如在执行kernel的同时进行数据拷贝,需要一个stream执行kernel,另一个stream进行数据拷贝。
95 |
96 |
97 |
98 | ## 基本编程模型
99 |
100 | 1. 用户可以通过调用动态运行时库,申请、释放显存,并在内存和显存间进行数据拷贝。
101 |
102 | 2. 典型的MXMACA程序实现流程遵循以下模式:
103 |
104 | 1. 把数据从CPU内存拷贝到GPU内存;
105 | 2. 调用核函数对GPU内存的数据进行处理;
106 | 3. 将数据从GPU内存传送回CPU内存。
107 |
108 | 3. 用户可以编写kernel函数,在主机侧调用kernel函数,调用将创建GPU线程。
109 |
110 | 1. 用户可以在Kernel Launch时分别指定网格中的线程块数量、线程块中包含的线程数量。当用户指定的线程数量超过64,这些线程会被拆分成多个线程束,并在同一个AP上执行,这些线程束可能并发执行,也可能串行执行。
111 | 2. 每个GPU线程都会完整执行一次kernel函数,kernel函数可以对显存进行读、写等操作,也可以调用设备侧函数对显存进行读、写等操作。不同的GPU线程可以通过内置变量进行区分,只需要通过读取内置变量,分别找到线程块的位置、线程的位置,就可以给每一个线程唯一地标识ThreadIdx(可以参考后文,相关的几个内置变量)。
112 |
113 | 4. 相关的几个内置变量
114 |
115 | 1. `threadIdx`,获取线程`thread`的ID索引;如果线程是一维的那么就取`threadIdx.x`,二维的还可以多取到一个值`threadIdx.y`,以此类推到三维`threadIdx.z`。可以在一个线程块中唯一的标识线程。
116 | 2. `blockIdx`,线程块的ID索引;同样有`blockIdx.x`,`blockIdx.y`,`blockIdx.z`。可以在一个网格中唯一标识线程块。
117 | 3. `blockDim`,线程块的维度,同样有`blockDim.x`,`blockDim.y`,`blockDim.z`。可以代表每个维度下线程的最大数量。
118 | 1. 对于一维的`block`,线程的`threadID=threadIdx.x`。
119 | 2. 对于大小为`(blockDim.x, blockDim.y)`的 二维`block`,线程的`threadID=threadIdx.x+threadIdx.y*blockDim.x`。
120 | 3. 对于大小为`(blockDim.x, blockDim.y, blockDim.z)`的 三维 `block`,线程的`threadID=threadIdx.x+threadIdx.y*blockDim.x+threadIdx.z*blockDim.x*blockDim.y`。
121 | 4. `gridDim`,线程格的维度,同样有`gridDim.x`,`gridDim.y`,`gridDim.z`。可以代表每个唯独下线程块的最大数量。
122 |
123 | 5. 常用的GPU函数
124 |
125 | 1. `mcMalloc()`
126 |
127 | 负责内存分配。类似与C语言中的`malloc`。不过mcMalloc是在GPU上分配内存,返回device指针。
128 |
129 | 2. `mcMemcpy()`
130 |
131 | 负责内存复制。
132 |
133 | 可以把数据从host搬到device,再从device搬回host。
134 |
135 | 3. `mcFree()`
136 |
137 | 释放显存的指针。
138 |
139 | (可以参考示例代码)
140 |
141 | ## 基本硬件架构及其在Kernel执行中的作用
142 |
143 | ## MXMACA内存模型和管理
144 |
145 | ### MXMACA内存模型
146 |
147 | MXMACA的内存是分层次的,每个不同类型的内存空间有不同的作用域、生命周期和缓存行为。一个内核函数中,每个线程有自己的私有内存,每个线程块有自己工作组的共享内存并对块内的所有线程可见,一个线程网格中的所有线程都可以访问全局内存和常量。可以参考下图:
148 |
149 |
150 |
151 | 书里提到了它们的初始化方式,这里主要介绍它们的用途、局限性。
152 |
153 | #### 可编程存储器、不可编程存储器
154 |
155 | 根据存储器能否被程序员控制,可分为:可编程存储器、不可编程存储器。
156 |
157 | 可编程存储器:需要显示控制哪些数据放在可编程内存中。包括全局存储、常量存储、共享存储、本地存储和寄存器等。
158 |
159 | 不可编程存储器:不能决定哪些数据放在这些存储器中,也不能决定数据在存储器中的位置。包括一级缓存、二级缓存等。
160 |
161 | #### GPU寄存器
162 |
163 | 寄存器延迟极低,对于每个线程是私有的,与核函数的生命周期相同。
164 |
165 | 寄存器是稀有资源,使用过多的寄存器也会影响到性能,可以添加辅助信息控制限定寄存器数量。
166 |
167 | 书中也提到了一些方式,可以让一个线程束内的两个线程相互访问对方的寄存器,而不需要访问全局内存或者共享内存,延迟很低且不消耗额外内存。
168 |
169 | #### GPU私有内存
170 |
171 | 私有内存是每个线程私有的。
172 |
173 | 私有内存在物理上与全局内存在同一块储存区域,因此具有较高的延迟和低带宽。
174 |
175 | #### GPU线程块共享内存
176 |
177 | 共享内存的地址空间被线程块中所有的线程共享。它的内容和创建时所在的线程块具有相同生命周期。
178 |
179 | 共享内存让同一个线程块中的线程能够相互协作,便于重用片上数据,可以降低核函数所需的全局内存带宽。
180 |
181 | 相较于全局内存,共享内存延迟更低,带宽更高。
182 |
183 | 适合在数据需要重复利用、全局内存合并或线程之间有共享数据时使用共享内存。
184 |
185 | 不能过度使用,否则会限制活跃线程束的数量。
186 |
187 | 书里也提到了共享内存的分配、共享内存的地址映射方式、bank冲突以及最小化bank冲突的方法。bank冲突时,多个访问操作会被序列化,降低内存带宽,就没有什么并行的意义了。
188 |
189 | #### GPU常量内存
190 |
191 | 常量内存在设备内存中,并在每个AP专用的常量缓存中缓存。
192 |
193 | 如果线程束中所有线程都从相同内存读取数据,常量内存表现最好,因为每从一个常量内存中读取一次数据,都会广播给线程束里的所有线程。
194 |
195 | #### GPU全局内存
196 |
197 | GPU中内存最大、延迟最高、最常使用。
198 |
199 | 可以在任何AP上被访问,并且贯穿应用程序的整个生命周期。
200 |
201 | 优化时需要注意对齐内存访问与合并内存访问。
202 |
203 | ## MXMACA程序优化
204 |
205 | ### 性能优化的目标
206 |
207 | 1. 提高程序执行效率,减少运行时间,提高程序的处理能力和吞吐量。
208 | 2. 优化资源利用率,避免资源的浪费和滥用。
209 | 3. 改善程序的响应时间。
210 |
211 | ### 程序性能评估
212 |
213 | #### 精度
214 |
215 | GPU 的单精度计算性能要远远超过双精度计算性能,需要在速度与精度之间选取合适的平衡。
216 |
217 | #### 延迟
218 |
219 | #### 计算量
220 |
221 | 如果计算量很小,或者串行部分占用时间较长,并行部分占用时间较短,都不适合用GPU进行并行计算。
222 |
223 | ### 优化的主要策略
224 |
225 | #### 硬件性能优化
226 |
227 | #### 并行性优化
228 |
229 | 可以通过设置线程块的大小、每个线程块的共享内存使用量、每个线程使用的寄存器数量,尽量提升occupancy。
230 |
231 | #### 内存访问优化
232 |
233 | ##### 提高`Global Memory`访存效率
234 |
235 | 对齐内存访问:一个内存事务的首个访问地址尽量是缓存粒度(32或128字节)的偶数倍,减少带宽浪费。
236 |
237 | 合并内存访问:尽量让一个线程束的线程访问的内存都在一个线程块。
238 |
239 | ##### 提高`Shared Memory`访存效率
240 |
241 | 若`wave`中不同的线程访问相同的`bank`,则会发生bank冲突(bank conflict),bank冲突时,`wave`的一条访存指令会被拆分为n条不冲突的访存请求,降低`shared memory`的有效带宽。所以需要尽量避免bank冲突。
242 |
243 | #### 算法优化
244 |
245 | 1. 如何将问题分解成块、束、线程
246 | 2. 线程如何访问数据以及产生什么样的内存模式
247 | 3. 数据的重用性
248 | 4. 算法总共要执行多少工作,与串行化的方法之间的差异
249 |
250 | #### 算数运算密度优化
251 |
252 | 1. 超越函数操作:可以查阅平方根等超越函数和加速函数,以及设备接口函数
253 | 2. 近似:可以在速度和精度之间进行折衷
254 | 3. 查找表:用空间换时间。适合GPU高占用率的情况,也要考虑到计算的复杂度,计算复杂度低时,计算速度可能大大快于低GPU占用下的内存查找方式。
255 |
256 | #### 编译器优化
257 |
258 | 1. 展开循环
259 | 2. 常量折叠 e.g. 编译时直接计算常数,从而简化常数
260 | 3. 常量传播:将表达式中的变量替换为已知常数
261 | 4. 公共子表达式消除:将该类公共子表达式的值临时记录,并传播到子表达式使用的语句
262 | 5. 目标相关优化:用复杂指令取代简单通用的指令组合,使程序获得更高的性能
263 |
264 | #### 其他
265 |
266 | 1. 用结构体数组(结构体的成员是数组),而不是数组结构体(数组的每个元素都是结构体)。
267 | 2. 尽量少用条件分支。CPU具有分支预测的功能,GPU没有这一功能,GPU执行if,else语句的效率非常低。因此只能让束内每一线程在每个分支都经过一遍(但不一定执行),当然如果所有线程都不用执行,就可以忽略这一分支。只要有一个线程需要执行某一个分支,其他线程即使不需要执行,也要等着一个线程执行完才能开始自己的计算任务。而且不同的分支是串行执行的,因此要减少分支的数目。
268 | 1. 通过计算,去掉分支(可以参考书中8.3.4相关内容)。
269 | 2. 通过查找表去掉分支。
270 | 3. 尽量使`wave`块完美对齐,让一个`wave`里的所有线程都满足条件或者都不满足条件。
271 | 3. 引入一些指令级并行操作,尽可能终止最后的线程束以使整个线程块都闲置出来,并替换为另一个包含一组更活跃线程束的线程块。
272 |
273 | ### 优化性能需要考虑的指标
274 |
275 | 1. 最大化利用率
276 | 2. 最大化存储吞吐量
277 | 3. 最大化指令吞吐量
278 | 4. 最小化内存抖动
279 | 5. 时间消耗(整体运行所需时间、GPU和CPU之间的传输所需时间、核函数运行所需时间)
280 |
281 | ## MXMACA生态的人工智能和计算加速库
282 |
283 | ### mcBLAS
284 |
285 | 主要用于多种形式的计算。
286 |
287 | `Level-1 Functions`定义了向量与向量、向量与标量之间的运算,还为多种数据类型(单精度浮点实数、单精度浮点复数、双精度浮点实数、双精度浮点复数)定义了专用的接口。
288 |
289 | `Level-2 Functions`定义了矩阵与向量之间的运算。
290 |
291 | `Level-3 Functions`定义了矩阵与矩阵之间的运算。是求解器和深度神经网络库的底层实现基础。
292 |
293 | ### mcDNN
294 |
295 | 提供常用深度学习算子。
296 |
297 | ### mcSPARSE
298 |
299 | 稀疏矩阵线性代数库。稀疏矩阵是指零元素数目远多于非零元素数目的矩阵。
300 |
301 | 可以用对应的接口完成稀疏矩阵线性代数运算。
302 |
303 | ### mcSOLVER
304 |
305 | 稠密矩阵线性方程组的求解函数库。
306 |
307 | ### mcFFT
308 |
309 | 快速傅里叶变换库。
310 |
311 |
312 |
313 |
--------------------------------------------------------------------------------
/chapter10/mcBlas.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include "mcblas.h"
6 |
7 | /* cpu implementation of sgemm */
8 | static void cpu_sgemm(int m, int n, int k, float alpha, const float *A, const float *B, float beta, float *C_in,
9 | float *C_out) {
10 | int i;
11 | int j;
12 | int kk;
13 |
14 | for (i = 0; i < m; ++i) {
15 | for (j = 0; j < n; ++j) {
16 | float prod = 0;
17 |
18 | for (kk = 0; kk < k; ++kk) {
19 | prod += A[kk * m + i] * B[j * k + kk];
20 | }
21 |
22 | C_out[j * m + i] = alpha * prod + beta * C_in[j * m + i];
23 | }
24 | }
25 | }
26 |
27 | int main(int argc, char **argv) {
28 | float *h_A;
29 | float *h_B;
30 | float *h_C;
31 | float *h_C_ref;
32 | float *d_A = 0;
33 | float *d_B = 0;
34 | float *d_C = 0;
35 | float alpha = 1.0f;
36 | float beta = 0.0f;
37 | int m = 256;
38 | int n = 128;
39 | int k = 64;
40 | int size_a = m * n; // the element num of A matrix
41 | int size_b = n * k; // the element num of B matrix
42 | int size_c = m * n; // the element num of C matrix
43 | float error_norm;
44 | float ref_norm;
45 | float diff;
46 | mcblasHandle_t handle;
47 | mcblasStatus_t status;
48 |
49 | /* Initialize mcBLAS */
50 | status = mcblasCreate(&handle);
51 | if (status != MCBLAS_STATUS_SUCCESS) {
52 | fprintf(stderr, "Init failed\n");
53 | return EXIT_FAILURE;
54 | }
55 |
56 | /* Allocate host memory for A/B/C matrix*/
57 | h_A = (float *)malloc(size_a * sizeof(float));
58 | if (h_A == NULL) {
59 | fprintf(stderr, "A host memory allocation failed\n");
60 | return EXIT_FAILURE;
61 | }
62 | h_B = (float *)malloc(size_b * sizeof(float));
63 | if (h_B == NULL) {
64 | fprintf(stderr, "B host memory allocation failed\n");
65 | return EXIT_FAILURE;
66 | }
67 | h_C = (float *)malloc(size_c * sizeof(float));
68 | if (h_C == 0) {
69 | fprintf(stderr, "C host memory allocation failed\n");
70 | return EXIT_FAILURE;
71 | }
72 | h_C_ref = (float *)malloc(size_c * sizeof(float));
73 | if (h_C_ref == 0) {
74 | fprintf(stderr, "C_ref host memory allocation failed\n");
75 | return EXIT_FAILURE;
76 | }
77 |
78 | /* Fill the matrices with test data */
79 | for (int i = 0; i < size_a; ++i) {
80 | h_A[i] = cos(i + 0.125);
81 | }
82 | for (int i = 0; i < size_b; ++i) {
83 | h_B[i] = cos(i - 0.125);
84 | }
85 | for (int i = 0; i < size_c; ++i) {
86 | h_C[i] = sin(i + 0.25);
87 | }
88 |
89 | /* Allocate device memory for the matrices */
90 | if (mcMalloc((void **)(&d_A), size_a * sizeof(float)) != mcSuccess) {
91 | fprintf(stderr, "A device memory allocation failed\n");
92 | return EXIT_FAILURE;
93 | }
94 | if (mcMalloc((void **)(&d_B), size_b * sizeof(float)) != mcSuccess) {
95 | fprintf(stderr, "B device memory allocation failed\n");
96 | return EXIT_FAILURE;
97 | }
98 | if (mcMalloc((void **)(&d_C), size_c * sizeof(float)) != mcSuccess) {
99 | fprintf(stderr, "C device memory allocation failed\n");
100 | return EXIT_FAILURE;
101 | }
102 |
103 | /* Initialize the device matrices with the host matrices */
104 | if (mcblasSetVector(size_a, sizeof(float), h_A, 1, d_A, 1) != MCBLAS_STATUS_SUCCESS) {
105 | fprintf(stderr, "Copy A from host to device failed\n");
106 | return EXIT_FAILURE;
107 | }
108 | if (mcblasSetVector(size_b, sizeof(float), h_B, 1, d_B, 1) != MCBLAS_STATUS_SUCCESS) {
109 | fprintf(stderr, "Copy B from host to device failed\n");
110 | return EXIT_FAILURE;
111 | }
112 | if (mcblasSetVector(size_c, sizeof(float), h_C, 1, d_C, 1) != MCBLAS_STATUS_SUCCESS) {
113 | fprintf(stderr, "Copy C from host to device failed\n");
114 | return EXIT_FAILURE;
115 | }
116 |
117 | /* compute the reference result */
118 | cpu_sgemm(m, n, k, alpha, h_A, h_B, beta, h_C, h_C_ref);
119 |
120 | /* Performs operation using mcblas */
121 | status = mcblasSgemm(handle, MCBLAS_OP_N, MCBLAS_OP_N, m, n, k, &alpha, d_A, m, d_B, n, &beta, d_C, k);
122 | if (status != MCBLAS_STATUS_SUCCESS) {
123 | fprintf(stderr, "Sgemm kernel execution failed\n");
124 | return EXIT_FAILURE;
125 | }
126 | /* Read the result back */
127 | status = mcblasGetVector(size_c, sizeof(float), d_C, 1, h_C, 1);
128 | if (status != MCBLAS_STATUS_SUCCESS) {
129 | fprintf(stderr, "C data reading failed\n");
130 | return EXIT_FAILURE;
131 | }
132 |
133 | /* Check result against reference */
134 | error_norm = 0;
135 | ref_norm = 0;
136 |
137 | for (int i = 0; i < size_c; ++i) {
138 | diff = h_C_ref[i] - h_C[i];
139 | error_norm += diff * diff;
140 | ref_norm += h_C_ref[i] * h_C_ref[i];
141 | }
142 |
143 | error_norm = (float)sqrt((double)error_norm);
144 | ref_norm = (float)sqrt((double)ref_norm);
145 |
146 | if (error_norm / ref_norm < 1e-6f) {
147 | printf("McBLAS test passed.\n");
148 | } else {
149 | printf("McBLAS test failed.\n");
150 | }
151 |
152 | /* Memory clean up */
153 | free(h_A);
154 | free(h_B);
155 | free(h_C);
156 | free(h_C_ref);
157 |
158 | if (mcFree(d_A) != mcSuccess) {
159 | fprintf(stderr, "A device mem free failed\n");
160 | return EXIT_FAILURE;
161 | }
162 |
163 | if (mcFree(d_B) != mcSuccess) {
164 | fprintf(stderr, "B device mem free failed\n");
165 | return EXIT_FAILURE;
166 | }
167 |
168 | if (mcFree(d_C) != mcSuccess) {
169 | fprintf(stderr, "C device mem free failed\n");
170 | return EXIT_FAILURE;
171 | }
172 |
173 | /* Shutdown */
174 | status = mcblasDestroy(handle);
175 | if (status != MCBLAS_STATUS_SUCCESS) {
176 | fprintf(stderr, "Destory failed\n");
177 | return EXIT_FAILURE;
178 | }
179 |
180 | return EXIT_SUCCESS;
181 | }
182 |
--------------------------------------------------------------------------------
/chapter10/mcDNN.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 |
7 | #define MCDNN_CHECK(f)
8 | {
9 | mcdnnStatus_t err = static_case(f) ;
10 | if (err != MCDNN_STATUS_SUCCESS) {
11 | std::cout << "Error occurred : " << err << std::endl;
12 | std::exit(1);
13 | }
14 | }
15 |
16 | int main() {
17 | // data shape
18 | int batch = 3;
19 | int data_w = 224;
20 | int data_h = 224;
21 | int in_channel = 3;
22 | int out_channel = 8;
23 | int filter_w = 5;
24 | int filter_h = 5;
25 | int stride[2] = {1, 1};
26 | int dilate[2] = {1, 1};
27 | float alpha = 2.f;
28 | float beta = 5.f;
29 |
30 | // model selected
31 | mcdnnConvalutionMode_t mode = MCDNN_CROSS_CRRELATION;
32 | mcdnnConvalutionFwdAlgo_t algo = MCDNN_CONVOLUTION_FWD_ALGO__FFT_TILING;
33 | // data type selected float, double, half, etc.
34 | mcdnnDataType_t data_type = MCDNN_DATA_FLOAT;
35 |
36 | // init handle
37 | mcdnnHandle_t handle;
38 | MCDNN_CHECK(mcdnnCreate(&handle));
39 |
40 | // create descriptor
41 | mcdnnTensorDescriptor_t x_desc;
42 | mcdnnFilterDescriptor_t w_desc;
43 | mcdnnTensorDescriptor_t y_desc;
44 | mcdnnConvolutionDescriptor_t conv_desc;
45 | MCDNN_CHECK(mcdnnCreateTensorDescriptor(&x_desc));
46 | MCDNN_CHECK(mcdnnCreateFilterDescriptor(&w_desc));
47 | MCDNN_CHECK(mcdnnCreateTensorDescriptor(&y_desc));
48 | MCDNN_CHECK(mcdnnCreateConvolutionDescriptor(&conv_desc));
49 |
50 | // convolution padding
51 | // out size = (input + pad - kernel) / stride + 1
52 | uint32_t padding_w = data_w + pad[2] + pad[3];
53 | uint32_t padding_h = data_h + pad[0] + pad[1];
54 | uint32_t out_h = padding_h - filter_h + 1;
55 | uint32_t out_w = padding_w - filter_w + 1;
56 | // init tensor descriptor, set data type, layout format, shape, etc.
57 | mcdnnSetTensor4dDescriptor(x_desc, MCDNN_TENSOR_NCHW, data_type, batch,
58 | in_channel, data_h, data_w);
59 | mcdnnSetFi1ter4dDescriptor(w_desc, data_type, MCDNN_TENSOR NCHW, out_channel,
60 | in_channel, filter_h, filter_w);
61 | mcdnnSetTensor4dDescriptor(y_desc, MCDNN_TENSOR_NCHW, data_type, batch,
62 | out_channel, out_h, out_w);
63 | // int convolution descriptor, set padding, stride date_type, etc.
64 | mcdnnSetConvolution2dDescriptor(conv_desc, pad[1], pad[2], stride[0],
65 | stride[1], dilate[0], dilate[1], mode,
66 | data_type);
67 |
68 | // init input data
69 | uint32_t input_data_numbers = batch * in_channel * data_h * data_w;
70 | uint32_t filter_data_numbers = out_channel * in_channel * filter_h * filter_w;
71 | uint32_t out_data_numbers = batch * out_channel * out_h * out_w;
72 |
73 | std::vector x(input_data_numbers);
74 | std::vector w(filter_data_numbers);
75 | std::vector y(out_data_numbers);
76 | for (int i = 0; i < input_data_numbers; ++i) {
77 | x[i] = std::cos(i) * i;
78 | }
79 | for (int i = 0; i < filter_data_numbers; ++i) {
80 | x[i] = std::sin(i) / 10;
81 | }
82 |
83 | for (int i = 0; i < out_data_numbers; ++i) {
84 | y[i] = std::cos(i + 0.5);
85 | }
86 |
87 | // alloc x device memory
88 | void *ptr_x_dev = nullptr;
89 | MCDNN_CHECK(mcMalloc(&ptr_x_dev, x.size() * sizeof(float)));
90 | // copy data to device
91 | MCDNN_CHECK(mcMemcpy(&ptr_x_dev, x.data(), x.size() * sizeof(float),
92 | mcMemcpyHostToDevice));
93 | // alloc w device memory
94 | void *ptr_w_dev = nullptr;
95 | MCDNN_CHECK(mcMalloc(&ptr_w_dev, w.size() * sizeof(float)));
96 | // copy data to device
97 | MCDNN_CHECK(mcMemcpy(&ptr_w_dev, w.data(), w.size() * sizeof(float),
98 | mcMemcpyHostToDevice));
99 | // alloc y device memory
100 | void *ptr_y_dev = nullptr;
101 | MCDNN_CHECK(mcMalloc(&ptr_y_dev, y.size() * sizeof(float)));
102 | // copy data to device
103 | MCDNN_CHECK(mcMemcpy(&ptr_y_dev, y.data(), y.size() * sizeof(float),
104 | mcMemcpyHostToDevice));
105 |
106 | uint32_t padding_src_elements = batch * in_channel * padding_h * padding_w;
107 |
108 | size_t workspace_size = 0;
109 | MCDNN_CHECK(mcdnnGetConvolutionForwardWorkspaceSize(
110 | handle, x_desc, w_desc, conv_desc, y_desc, algo, &workspace_size));
111 |
112 | void *ptr_worksapce = nullptr;
113 | if (workspace_size > 0) {
114 | MCDNN_CHECK(mcMalloc(&ptr_worksapce, workspace_size));
115 | }
116 |
117 | // convolution forward
118 | MCDNN_CHECK(mcdnnConvolutinForward(handle, &alpha, x_desc, ptr_x_dev, w_desc,
119 | ptr_w_dev, conv_desc, algo, ptr_worksapce,
120 | workspace_size, &beta, y_desc, ptr_y_dev));
121 | MCDNN_CHECK(mcMemcpy(y.data(), ptr_y_dev, y.size() * sizeof(float),
122 | mcMemcpyDeviceToHost));
123 |
124 | // free device pointer and handle
125 | MCDNN_CHECK(mcFree(ptr_x_dev));
126 | MCDNN_CHECK(mcFree(ptr_w_dev));
127 | MCDNN_CHECK(mcFree(ptr_y_dev));
128 | MCDNN_CHECK(mcFree(ptr_w_dev));
129 | MCDNN_CHECK(mcdnnDestoryTensorDescriptor(x_desc));
130 | MCDNN_CHECK(mcdnnDestoryTensorDescriptor(y_desc));
131 | MCDNN_CHECK(mcdnnDestoryFilterDescriptor(w_desc));
132 | MCDNN_CHECK(mcdnnDestoryConvolutionDescriptor(conv_desc));
133 | MCDNN_CHECK(mcdnnDestory(handle));
134 |
135 | return 0;
136 | }
137 |
--------------------------------------------------------------------------------
/chapter10/mcblas命令.txt:
--------------------------------------------------------------------------------
1 | mxcc sample_mcblas.c -I${MACA_PATH}/include -I${MACA_PATH}/include/mcblas -I${MACA_PATH}/include/mcr -L${MACA_PATH}/lib -lmcruntime -lmcblas
--------------------------------------------------------------------------------
/chapter10/usingThrust.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | #include
7 | #include
8 |
9 | int main(void) {
10 | // the following code shows how to use thrust::sort and thrust::host_vector
11 | std::vector array = {2, 4, 6, 8, 0, 9, 7, 5, 3, 1};
12 | thrust::host_vector vec;
13 | vec = array; // now vec has storage for 10 integers
14 | std::cout << "vec has size: " << vec.size() << std::endl;
15 |
16 | std::cout << "vec before sorting:" << std::endl;
17 | for (size_t i = 0; i < vec.size(); ++i)
18 | std::cout << vec[i] << " ";
19 | std::cout << std::endl;
20 |
21 | thrust::sort(vec.begin(), vec.end());
22 | std::cout << "vec after sorting:" << std::endl;
23 | for (size_t i = 0; i < vec.size(); ++i)
24 | std::cout << vec[i] << " ";
25 | std::cout << std::endl;
26 |
27 | vec.resize(2);
28 | std::cout << "now vec has size: " << vec.size() << std::endl;
29 |
30 | return 0;
31 | }
32 |
--------------------------------------------------------------------------------
/chapter11/Makefile:
--------------------------------------------------------------------------------
1 | DEBUG ?= 0
2 | MCCL ?=0
3 | MCCLCMMD = -D_USE_MCCL -lmccl
4 |
5 | ifeq ($(DEBUG), 0)
6 | ifeq ($(MCCL),0)
7 | simple2DFD_rls: simple2DFD.cpp
8 | mxcc -x maca -O3 ./simple2DFD.cpp -I./ -o ./build/$@
9 | else
10 | simple2DFD_rls_mccl: simple2DFD.cpp
11 | mxcc -x maca -O3 ./simple2DFD.cpp $(MCCLCMMD) -I./ -o ./build/$@
12 | @echo Useing mccl now!
13 | endif
14 | else
15 | ifeq ($(MCCL),0)
16 | simple2DFD_dbg: simple2DFD.cpp
17 | mxcc -x maca -g -G ./simple2DFD.cpp -I./ -o ./build/$@
18 | else
19 | simple2DFD_dbg_mccl: simple2DFD.cpp
20 | mxcc -x maca -g -G ./simple2DFD.cpp $(MCCLCMMD) -I./ -o ./build/$@
21 | @echo Useing mccl now!
22 | endif
23 | endif
24 |
25 | clean:
26 | rm -f ./build/simple2DFD_*
27 |
28 |
--------------------------------------------------------------------------------
/chapter11/simple2DFD.cpp:
--------------------------------------------------------------------------------
1 | #include "../common/common.h"
2 | #include
3 | #include
4 | #include
5 | #include
6 | #include
7 | #include
8 |
9 | #include
10 |
11 | #ifdef _USE_MCCL
12 | #include
13 | #endif
14 |
15 |
16 | /*
17 | * This example implements a 2D stencil computation, spreading the computation
18 | * across multiple GPUs. This requires communicating halo regions between GPUs
19 | * on every iteration of the stencil as well as managing multiple GPUs from a
20 | * single host application. Here, kernels and transfers are issued in
21 | * breadth-first order to each maca stream. Each maca stream is associated with
22 | * a single maca device.
23 | */
24 |
25 | #define a0 -3.0124472f
26 | #define a1 1.7383092f
27 | #define a2 -0.2796695f
28 | #define a3 0.0547837f
29 | #define a4 -0.0073118f
30 |
31 | // cnst for gpu
32 | #define BDIMX 32
33 | #define NPAD 4
34 | #define NPAD2 8
35 |
36 | // constant memories for 8 order FD coefficients
37 | __device__ __constant__ float coef[5];
38 |
39 | // set up fd coefficients
40 | void setup_coef (void)
41 | {
42 | const float h_coef[] = {a0, a1, a2, a3, a4};
43 | CHECK( mcMemcpyToSymbol( coef, h_coef, 5 * sizeof(float) ));
44 | }
45 |
46 | void saveSnapshotIstep(
47 | int istep,
48 | int nx,
49 | int ny,
50 | int ngpus,
51 | float **g_u2)
52 | {
53 | float *iwave = (float *)malloc(nx * ny * sizeof(float));
54 |
55 | if (ngpus > 1)
56 | {
57 | unsigned int skiptop = nx * 4;
58 | unsigned int gsize = nx * ny / 2;
59 |
60 | for (int i = 0; i < ngpus; i++)
61 | {
62 | CHECK(mcSetDevice(i));
63 | int iskip = (i == 0 ? 0 : skiptop);
64 | int ioff = (i == 0 ? 0 : gsize);
65 | CHECK(mcMemcpy(iwave + ioff, g_u2[i] + iskip,
66 | gsize * sizeof(float), mcMemcpyDeviceToHost));
67 |
68 | // int iskip = (i == 0 ? nx*ny/2-4*nx : 0+4*nx);
69 | // int ioff = (i == 0 ? 0 : nx*4);
70 | // CHECK(mcMemcpy(iwave + ioff, g_u2[i] + iskip,
71 | // skiptop * sizeof(float), mcMemcpyDeviceToHost));
72 | }
73 | }
74 | else
75 | {
76 | unsigned int isize = nx * ny;
77 | CHECK(mcMemcpy (iwave, g_u2[0], isize * sizeof(float),
78 | mcMemcpyDeviceToHost));
79 | }
80 |
81 | char fname[50];
82 | sprintf(fname, "snap_at_step_%d.data", istep);
83 |
84 | FILE *fp_snap = fopen(fname, "w");
85 |
86 | fwrite(iwave, sizeof(float), nx * ny, fp_snap);
87 | // fwrite(iwave, sizeof(float), nx * 4, fp_snap);
88 | printf("%s: nx = %d ny = %d istep = %d\n", fname, nx, ny, istep);
89 | fflush(stdout);
90 | fclose(fp_snap);
91 |
92 | free(iwave);
93 | return;
94 | }
95 | // 判断算力是否大于2,大于2则就支持P2P通信
96 | inline bool isCapableP2P(int ngpus)
97 | {
98 | mcDeviceProp_t prop[ngpus];
99 | int iCount = 0;
100 |
101 | for (int i = 0; i < ngpus; i++)
102 | {
103 | CHECK(mcGetDeviceProperties(&prop[i], i));
104 |
105 | if (prop[i].major >= 2) iCount++;
106 |
107 | printf("> GPU%d: %s %s Peer-to-Peer access\n", i,
108 | prop[i].name, (prop[i].major >= 2 ? "supports" : "doesn't support"));
109 | fflush(stdout);
110 | }
111 |
112 | if(iCount != ngpus)
113 | {
114 | printf("> no enough device to run this application\n");
115 | fflush(stdout);
116 | }
117 |
118 | return (iCount == ngpus);
119 | }
120 |
121 | /*
122 | * enable P2P memcopies between GPUs (all GPUs must be compute capability 2.0 or
123 | * later (Fermi or later))
124 | */
125 | inline void enableP2P (int ngpus)
126 | {
127 | for (int i = 0; i < ngpus; i++)
128 | {
129 | CHECK(mcSetDevice(i));
130 |
131 | for (int j = 0; j < ngpus; j++)
132 | {
133 | if (i == j) continue;
134 |
135 | int peer_access_available = 0;
136 | CHECK(mcDeviceCanAccessPeer(&peer_access_available, i, j));
137 |
138 | if (peer_access_available) CHECK(mcDeviceEnablePeerAccess(j, 0));
139 | }
140 | }
141 | }
142 | // 是否支持UnifiedAddressing
143 | inline bool isUnifiedAddressing (int ngpus)
144 | {
145 | mcDeviceProp_t prop[ngpus];
146 |
147 | for (int i = 0; i < ngpus; i++)
148 | {
149 | CHECK(mcGetDeviceProperties(&prop[i], i));
150 | }
151 |
152 | const bool iuva = (prop[0].unifiedAddressing && prop[1].unifiedAddressing);
153 | printf("> GPU%d: %s %s Unified Addressing\n", 0, prop[0].name,
154 | (prop[0].unifiedAddressing ? "supports" : "doesn't support"));
155 | printf("> GPU%d: %s %s Unified Addressing\n", 1, prop[1].name,
156 | (prop[1].unifiedAddressing ? "supports" : "doesn't support"));
157 | fflush(stdout);
158 | return iuva;
159 | }
160 | // 2GPU的结果为252,256,4,252
161 | inline void calcIndex(int *haloStart, int *haloEnd, int *bodyStart,
162 | int *bodyEnd, const int ngpus, const int iny)
163 | {
164 | // for halo
165 | for (int i = 0; i < ngpus; i++)
166 | {
167 | if (i == 0 && ngpus == 2)
168 | {
169 | haloStart[i] = iny - NPAD2; // 260-8=252
170 | haloEnd[i] = iny - NPAD; // 260-4=256
171 |
172 | }
173 | else
174 | {
175 | haloStart[i] = NPAD;
176 | haloEnd[i] = NPAD2;
177 | }
178 | }
179 |
180 | // for body
181 | for (int i = 0; i < ngpus; i++)
182 | {
183 | if (i == 0 && ngpus == 2)
184 | {
185 | bodyStart[i] = NPAD; // 4
186 | bodyEnd[i] = iny - NPAD2; // 260-8=252
187 | }
188 | else
189 | {
190 | bodyStart[i] = NPAD + NPAD;
191 | bodyEnd[i] = iny - NPAD;
192 | }
193 | }
194 | }
195 | // // src_skip: 512*(260-8) 4*512 dst_skip:0 (260-4)*512
196 | inline void calcSkips(int *src_skip, int *dst_skip, const int nx,
197 | const int iny)
198 | {
199 | src_skip[0] = nx * (iny - NPAD2);// 512*(260-8)
200 | dst_skip[0] = 0;
201 | src_skip[1] = NPAD * nx; // 4*512
202 | dst_skip[1] = (iny - NPAD) * nx; // (260-4)*512
203 | }
204 |
205 | // wavelet
206 | __global__ void kernel_add_wavelet ( float *g_u2, float wavelets, const int nx,
207 | const int ny, const int ngpus)
208 | { // ny为iny=260,nx=512
209 | // global grid idx for (x,y) plane 若gpu个数为2,则
210 | int ipos = (ngpus == 2 ? ny - 10 : ny / 2 - 10); // ipos=250
211 | unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x; // ix就是x方向上节点编号
212 | unsigned int idx = ipos * nx + ix; // idx=250*512+ix
213 |
214 | if(ix == nx / 2) g_u2[idx] += wavelets; // 这里是说ix==256时,则
215 | }
216 |
217 | // fd kernel function
218 | __global__ void kernel_2dfd_last(float *g_u1, float *g_u2, const int nx,
219 | const int iStart, const int iEnd)
220 | {
221 | // global to slice : global grid idx for (x,y) plane
222 | unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;
223 |
224 | // smem idx for current point
225 | unsigned int stx = threadIdx.x + NPAD;
226 | unsigned int idx = ix + iStart * nx;
227 |
228 | // shared memory for u2 with size [4+16+4][4+16+4]
229 | __shared__ float tile[BDIMX + NPAD2];
230 |
231 | const float alpha = 0.12f;
232 |
233 | // register for y value
234 | float yval[9];
235 |
236 | for (int i = 0; i < 8; i++) yval[i] = g_u2[idx + (i - 4) * nx];
237 |
238 | // to be used in z loop
239 | int iskip = NPAD * nx;
240 |
241 | #pragma unroll 9
242 | for (int iy = iStart; iy < iEnd; iy++)
243 | {
244 | // get front3 here
245 | yval[8] = g_u2[idx + iskip];
246 |
247 | if(threadIdx.x < NPAD)
248 | {
249 | tile[threadIdx.x] = g_u2[idx - NPAD];
250 | tile[stx + BDIMX] = g_u2[idx + BDIMX];
251 | }
252 |
253 | tile[stx] = yval[4];
254 | __syncthreads();
255 |
256 | if ( (ix >= NPAD) && (ix < nx - NPAD) )
257 | {
258 | // 8rd fd operator
259 | float tmp = coef[0] * tile[stx] * 2.0f;
260 |
261 | #pragma unroll
262 | for(int d = 1; d <= 4; d++)
263 | {
264 | tmp += coef[d] * (tile[stx - d] + tile[stx + d]);
265 | }
266 |
267 | #pragma unroll
268 | for(int d = 1; d <= 4; d++)
269 | {
270 | tmp += coef[d] * (yval[4 - d] + yval[4 + d]);
271 | }
272 |
273 | // time dimension
274 | g_u1[idx] = yval[4] + yval[4] - g_u1[idx] + alpha * tmp;
275 | }
276 |
277 | #pragma unroll 8
278 | for (int i = 0; i < 8 ; i++)
279 | {
280 | yval[i] = yval[i + 1];
281 | }
282 |
283 | // advancd on global idx
284 | idx += nx;
285 | __syncthreads();
286 | }
287 | }
288 |
289 | __global__ void kernel_2dfd(float *g_u1, float *g_u2, const int nx,
290 | const int iStart, const int iEnd)
291 | {
292 | // global to line index
293 | unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;
294 |
295 | // smem idx for current point
296 | unsigned int stx = threadIdx.x + NPAD;
297 | unsigned int idx = ix + iStart * nx; // ix+4*512,idx表示插值的中心点坐标
298 |
299 | // shared memory for x dimension
300 | __shared__ float line[BDIMX + NPAD2];// 对于一个block,根据模板,需要的共享内存元素数量为block线程大小+NPAD*2
301 |
302 | // a coefficient related to physical properties
303 | const float alpha = 0.12f; // 关于时间步长的系数
304 |
305 | // register for y value
306 | float yval[9]; // 寄存器数组
307 | // 从GPU主存中获取值,这里数据由于是沿着坐标x轴排布的,所以获取主存的数据是不连续的
308 | for (int i = 0; i < 8; i++) yval[i] = g_u2[idx + (i - 4) * nx];
309 |
310 | // skip for the bottom most y value
311 | int iskip = NPAD * nx; // 4*512,看上面for循环,最大下标到idx+3*nx,这里多加了1
312 |
313 | #pragma unroll 9
314 | for (int iy = iStart; iy < iEnd; iy++)//对y方向的数据点进行循环
315 | {
316 | // get yval[8] here
317 | yval[8] = g_u2[idx + iskip];//这里每次yval的最后一个数据从主存获取,其他数据最后从寄存器获取
318 | // 所以内存是按坐标轴的x方向上排布的
319 | // read halo partk //
320 | if(threadIdx.x < NPAD)
321 | { // 共享内存的最前最后4个数据即(0,1,2,3)和(36,37,38,39)
322 | line[threadIdx.x] = g_u2[idx - NPAD];
323 | line[stx + BDIMX] = g_u2[idx + BDIMX];
324 | }
325 |
326 | line[stx] = yval[4]; // line获取中心点的值,注意由于每个线程的yval[4]和stx都不同,所以这样可以将line[4-35]的所有数据填满
327 | __syncthreads();// 直到块内线程同步
328 |
329 | // 8rd fd operator 这里的ix>=4,ix<512-4
330 | if ( (ix >= NPAD) && (ix < nx - NPAD) )
331 | {
332 | // center point
333 | float tmp = coef[0] * line[stx] * 2.0f;
334 |
335 | #pragma unroll
336 | for(int d = 1; d <= 4; d++)
337 | {
338 | tmp += coef[d] * ( line[stx - d] + line[stx + d]);
339 | }
340 |
341 | #pragma unroll
342 | for(int d = 1; d <= 4; d++)
343 | {
344 | tmp += coef[d] * (yval[4 - d] + yval[4 + d]);
345 | }
346 |
347 | // time dimension yval[4]=gu2[idx],g_u1和g_u2和时间推进有关
348 | g_u1[idx] = yval[4] + yval[4] - g_u1[idx] + alpha * tmp;
349 | }
350 |
351 | #pragma unroll 8 // 这里将下移一格,即沿着坐标y轴下移,进行下一层(沿着x轴为一层)
352 | for (int i = 0; i < 8 ; i++)
353 | {
354 | yval[i] = yval[i + 1];
355 | }
356 |
357 | // advancd on global idx
358 | idx += nx; // idx+一层的点数,接着循环
359 | __syncthreads();
360 | }
361 | }
362 | // 程序有多个参数,第一个为要使用的GPU个数,第二个为保存哪个时间步的波场
363 | /*
364 | 1. argv[1]:gpu数量
365 | 2. argv[2]: 每隔多少个时间步存储数据
366 | 3. argv[3]: 一共多少时间步
367 | 4. argv[4]: 每个方向上的网格数
368 | */
369 | int main( int argc, char *argv[] )
370 | {
371 | int ngpus=2;
372 |
373 | // check device count
374 | CHECK(mcGetDeviceCount(&ngpus));
375 | printf("> Number of devices available: %i\n", ngpus);
376 |
377 | // check p2p capability
378 | isCapableP2P(ngpus);
379 | isUnifiedAddressing(ngpus);
380 |
381 | // get it from command line
382 | if (argc > 1)
383 | {
384 | if (atoi(argv[1]) > ngpus)
385 | {
386 | fprintf(stderr, "Invalid number of GPUs specified: %d is greater "
387 | "than the total number of GPUs in this platform (%d)\n",
388 | atoi(argv[1]), ngpus);
389 | exit(1);
390 | }
391 |
392 | ngpus = atoi(argv[1]);
393 | }
394 |
395 | int iMovie = 100; // 这里现在表示每隔多少步存一次数据
396 |
397 | if(argc >= 3) iMovie = atoi(argv[2]);
398 |
399 | // size
400 | // 时间步
401 | int nsteps = 1001;
402 | if(argc>=4) nsteps=atoi(argv[3]);
403 |
404 | printf("> run with %i devices: nsteps = %i\n", ngpus, nsteps);
405 |
406 | // x方向点数
407 | const int nx = 512;
408 | // y方向点数
409 | const int ny = 512;
410 | // 计算每个gpu上点数,这里每个线程负责所有y方向的数据点计算
411 | const int iny = ny / ngpus + NPAD * (ngpus - 1);
412 |
413 | size_t isize = nx * iny; // 总的数据点数
414 | size_t ibyte = isize * sizeof(float); // 每块总的数据字节数
415 | #ifndef _USE_MCCL
416 | size_t iexchange = NPAD * nx * sizeof(float); // 交换区域的字节数
417 | #endif
418 |
419 | // set up gpu card
420 | float *d_u2[ngpus], *d_u1[ngpus];
421 |
422 | for(int i = 0; i < ngpus; i++)
423 | {
424 | // set device
425 | CHECK(mcSetDevice(i));
426 |
427 | // allocate device memories // d_u1,d_u2分别存着两个时间步的数据
428 | CHECK(mcMalloc ((void **) &d_u1[i], ibyte));
429 | CHECK(mcMalloc ((void **) &d_u2[i], ibyte));
430 |
431 | CHECK(mcMemset (d_u1[i], 0, ibyte));
432 | CHECK(mcMemset (d_u2[i], 0, ibyte));
433 | printf("GPU %i: %.2f MB global memory allocated\n", i,
434 | (4.f * ibyte) / (1024.f * 1024.f) );
435 | setup_coef ();
436 | }
437 |
438 | // stream definition
439 | mcStream_t stream_halo[ngpus], stream_body[ngpus];
440 |
441 | for (int i = 0; i < ngpus; i++)
442 | {
443 | CHECK(mcSetDevice(i));
444 | CHECK(mcStreamCreate( &stream_halo[i] ));
445 | CHECK(mcStreamCreate( &stream_body[i] ));
446 | }
447 |
448 | // calculate index for computation
449 | int haloStart[ngpus], bodyStart[ngpus], haloEnd[ngpus], bodyEnd[ngpus];
450 | // 根据iny进行处理 ,2GPU的结果为252,256,4,252
451 | calcIndex(haloStart, haloEnd, bodyStart, bodyEnd, ngpus, iny);
452 |
453 | int src_skip[ngpus], dst_skip[ngpus];
454 | // // src_skip: 512*(260-8) 4*512 dst_skip:0 (260-4)*512
455 | // 根据nx,iny进行处理
456 | if(ngpus > 1) calcSkips(src_skip, dst_skip, nx, iny);
457 |
458 | // kernel launch configuration
459 | // block 中的线程数量
460 | dim3 block(BDIMX);
461 | // block数量 这样的话一个线程要处理所有y向的数据。y方向被所有的GPU分块
462 | dim3 grid(nx / block.x);
463 |
464 | // set up event for timing
465 | CHECK(mcSetDevice(0));
466 | mcEvent_t start, stop;
467 | CHECK (mcEventCreate(&start));
468 | CHECK (mcEventCreate(&stop ));
469 | CHECK(mcEventRecord( start, 0 ));
470 | #ifdef _USE_MCCL
471 | int devs[2] = {0, 1};
472 | mcclComm_t comms[2];
473 | assert(mcclSuccess==mcclCommInitAll(comms, ngpus, devs));
474 | #endif
475 | // main loop for wave propagation
476 | for(int istep = 0; istep < nsteps; istep++)
477 | {
478 |
479 | // save snap image
480 | if(istep%iMovie==0) saveSnapshotIstep(istep, nx, ny, ngpus, d_u2);
481 |
482 | // add wavelet only onto gpu0
483 | if (istep == 0)
484 | {
485 | CHECK(mcSetDevice(0));
486 | kernel_add_wavelet<<>>(d_u2[0], 20.0, nx, iny, ngpus);
487 | }
488 |
489 | // halo part
490 | for (int i = 0; i < ngpus; i++)
491 | {
492 | CHECK(mcSetDevice(i));
493 |
494 | // compute halo
495 | kernel_2dfd<<>>(d_u1[i], d_u2[i],
496 | nx, haloStart[i], haloEnd[i]);
497 |
498 | // compute internal
499 | kernel_2dfd<<>>(d_u1[i], d_u2[i],
500 | nx, bodyStart[i], bodyEnd[i]);
501 | }
502 |
503 | /*
504 | ================================================================================
505 |
506 | ***************************使用不同的方式在GPU间交换数据****************************
507 |
508 | ================================================================================
509 | */
510 |
511 | #ifndef _USE_MCCL
512 | // exchange halo
513 | // src_skip: 512*(260-8) 4*512 dst_skip:0 (260-4)*512
514 | if (ngpus > 1)
515 | {
516 | // 交换两个GPU的数据注意都是d_u1的数据,即新的时间步上的数据 这里可以考虑使用mccl?
517 | // 这里是将gpu0的halo区域数据给gpu1的填充区域
518 | CHECK(mcMemcpyAsync(d_u1[1] + dst_skip[0], d_u1[0] + src_skip[0],
519 | iexchange, mcMemcpyDefault, stream_halo[0]));
520 | // 这里是将gpu1的halo区域数据给gpu0的填充区域
521 | CHECK(mcMemcpyAsync(d_u1[0] + dst_skip[1], d_u1[1] + src_skip[1],
522 | iexchange, mcMemcpyDefault, stream_halo[1]));
523 | }
524 | #else
525 | // 使用mccl发送填充区数据
526 | assert(mcclSuccess == mcclGroupStart());
527 | for (int i = 0; i < ngpus; ++i)
528 | {
529 | mcSetDevice(i);
530 | int tag = (i + 1) % 2;
531 | mcclSend(d_u1[i] + src_skip[i], NPAD * nx, mcclFloat, tag, comms[i], stream_halo[i]);
532 | mcclRecv(d_u1[i] + dst_skip[tag], NPAD * nx, mcclFloat, tag, comms[i], stream_halo[i]);
533 | }
534 | assert(mcclSuccess == mcclGroupEnd());
535 |
536 | for (int i = 0; i < ngpus; ++i)
537 | {
538 | mcSetDevice(i);
539 | // it will stall host until all operations are done
540 | mcStreamSynchronize(stream_halo[i]);
541 | }
542 | #endif
543 | for (int i = 0; i < ngpus; i++)
544 | {
545 | CHECK(mcSetDevice(i));
546 | CHECK(mcDeviceSynchronize());
547 | // 交换时间步的指针
548 | float *tmpu0 = d_u1[i];
549 | d_u1[i] = d_u2[i];
550 | d_u2[i] = tmpu0;
551 | }
552 |
553 | } // 关于istep的for循环结束
554 |
555 | CHECK(mcSetDevice(0));
556 | CHECK(mcEventRecord(stop, 0));
557 |
558 | CHECK(mcDeviceSynchronize());
559 | CHECK(mcGetLastError());
560 |
561 | float elapsed_time_ms = 0.0f;
562 | CHECK(mcEventElapsedTime(&elapsed_time_ms, start, stop));
563 |
564 | elapsed_time_ms /= nsteps;
565 | /*
566 | 1. nsteps=30000,NCCL:845.04 MCells/s,origin:941.21 MCells/s
567 | 2. nsteps=15000,NCCL:817.91 MCells/s,origin:935.47 MCells/s
568 | 3. nsteps=10000,NCCL:793.62 MCells/s,origin:925.97 MCells/s
569 | 4. nsteps=05000,NCCL:756.32 MCells/s,origin:925.32 MCells/s
570 | 5. nsteps=02000,NCCL:599.61 MCells/s,origin:889.43 MCells/s
571 | 6. nsteps=01000,NCCL:470.81 MCells/s,origin:802.86 MCells/s
572 | 可见随着循环步骤数的增加,mccl通信与原有程序的速度逐渐接近
573 | */
574 | printf("gputime: %8.2fms ", elapsed_time_ms);
575 | printf("performance: %8.2f MCells/s\n",
576 | (double)nx * ny / (elapsed_time_ms * 1e3f));
577 | fflush(stdout);
578 |
579 | CHECK(mcEventDestroy(start));
580 | CHECK(mcEventDestroy(stop));
581 |
582 | // clear
583 | for (int i = 0; i < ngpus; i++)
584 | {
585 | CHECK(mcSetDevice(i));
586 |
587 | CHECK(mcStreamDestroy(stream_halo[i]));
588 | CHECK(mcStreamDestroy(stream_body[i]));
589 |
590 | CHECK(mcFree(d_u1[i]));
591 | CHECK(mcFree(d_u2[i]));
592 |
593 | // CHECK(mcDeviceReset()); // 不注释掉会mcclCommDestroy出现段错误
594 | }
595 | #ifdef _USE_MCCL
596 | for (int i = 0; i < ngpus; ++i)
597 | {
598 | assert(mcclSuccess == mcclCommDestroy(comms[i]));
599 | }
600 | #endif
601 | exit(EXIT_SUCCESS);
602 | }
603 |
--------------------------------------------------------------------------------
/chapter11/vectorAddMultiGpus.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 |
7 | #define USECPSEC 1000000ULL
8 |
9 | unsigned long long dtime_usec(unsigned long long start){
10 |
11 | timeval tv;
12 | gettimeofday(&tv, 0);
13 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
14 | }
15 |
16 | // error checking macro
17 | #define macaCheckErrors(msg) \
18 | do { \
19 | mcError_t __err = mcGetLastError(); \
20 | if (__err != mcSuccess) { \
21 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
22 | msg, mcGetErrorString(__err), \
23 | __FILE__, __LINE__); \
24 | fprintf(stderr, "*** FAILED - ABORTING\n"); \
25 | exit(1); \
26 | } \
27 | } while (0)
28 |
29 |
30 | const int DSIZE = 1 << 26; //64MB
31 | #define NGPUS 4
32 |
33 | // generate different seed for random number
34 | void initialData(float *ip, int size)
35 | {
36 | time_t t;
37 | srand((unsigned) time(&t));
38 |
39 | for (int i = 0; i < size; i++)
40 | {
41 | ip[i] = (float)(rand() & 0xFF) / 10.0f;
42 | }
43 |
44 | return;
45 | }
46 |
47 | // vector add function: C = A + B
48 | void cpuVectorAdd(float *A, float *B, float *C, const int N)
49 | {
50 | for (int idx = 0; idx < N; idx++)
51 | C[idx] = A[idx] + B[idx];
52 | }
53 |
54 | // vector add kernel: C = A + B
55 | __global__ void gpuVectorAddKernel(const float *A, const float *B, float *C, const int N){
56 |
57 | for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < N; idx+=gridDim.x*blockDim.x) // a grid-stride loop
58 | C[idx] = A[idx] + B[idx]; // do the vector (element) add here
59 | }
60 |
61 | // check results from host and gpu
62 | void checkResult(float *hostRef, float *gpuRef, const int N)
63 | {
64 | double epsilon = 1.0E-8;
65 | bool match = 1;
66 | for (int i = 0; i < N; i++)
67 | {
68 | if (abs(hostRef[i] - gpuRef[i]) > epsilon)
69 | {
70 | match = 0;
71 | printf("The vector-add results do not match!\n");
72 | printf("host %5.2f gpu %5.2f at current %d\n", hostRef[i],
73 | gpuRef[i], i);
74 | break;
75 | }
76 | }
77 | // if (match) printf("The vector-add results match.\n\n");
78 | return;
79 | }
80 |
81 | // 程序有多个参数,第一个为要使用的GPU个数,第二个为保存哪个时间步的波场
82 | /*
83 | 1. argv[1]:GPU数量 (nGpus)
84 | 2. argv[2]:线程块大小(blockSize)
85 | 3. argv[3]:数据量(dataSize), default is 26(1<<26=64MB)
86 | */
87 | int main( int argc, char *argv[] )
88 | {
89 | int nGpus;
90 | mcGetDeviceCount(&nGpus);
91 | nGpus = (nGpus > NGPUS) ? NGPUS : nGpus;
92 | printf("> Number of devices available: %i\n", nGpus);
93 | // get it from command line
94 | if (argc > 1)
95 | {
96 | if (atoi(argv[1]) > nGpus)
97 | {
98 | fprintf(stderr, "Invalid number of GPUs specified: %d is greater "
99 | "than the total number of GPUs in this platform (%d)\n",
100 | atoi(argv[1]), nGpus);
101 | exit(1);
102 | }
103 | nGpus = atoi(argv[1]);
104 | }
105 |
106 | // blockSize is set to 1 for slowing execution time per GPU
107 | int blockSize = 1;
108 | // It would be faster if blockSize is set to multiples of 64(waveSize)
109 | if(argc >= 3) blockSize = atoi(argv[2]);
110 | int dataSize = DSIZE;
111 | if(argc >= 4) dataSize = 1 << abs(atoi(argv[3]));
112 | printf("> total array size is %iMB, using %i devices with each device handling %iMB\n", dataSize/1024/1024, nGpus, dataSize/1024/1024/nGpus);
113 |
114 | float *d_A[NGPUS], *d_B[NGPUS], *d_C[NGPUS];
115 | float *h_A[NGPUS], *h_B[NGPUS], *hostRef[NGPUS], *gpuRef[NGPUS];
116 | mcStream_t stream[NGPUS];
117 |
118 | int iSize = dataSize / nGpus;
119 | size_t iBytes = iSize * sizeof(float);
120 | for (int i = 0; i < nGpus; i++) {
121 | //set current device
122 | mcSetDevice(i);
123 |
124 | //allocate device memory
125 | mcMalloc((void **) &d_A[i], iBytes);
126 | mcMalloc((void **) &d_B[i], iBytes);
127 | mcMalloc((void **) &d_C[i], iBytes);
128 |
129 | //allocate page locked host memory for asynchronous data transfer
130 | mcMallocHost((void **) &h_A[i], iBytes);
131 | mcMallocHost((void **) &h_B[i], iBytes);
132 | mcMallocHost((void **) &hostRef[i], iBytes);
133 | mcMallocHost((void **) &gpuRef[i], iBytes);
134 |
135 | // initialize data at host side
136 | initialData(h_A[i], iSize);
137 | initialData(h_B[i], iSize);
138 | //memset(hostRef[i], 0, iBytes);
139 | //memset(gpuRef[i], 0, iBytes);
140 | }
141 | mcDeviceSynchronize();
142 |
143 | // distribute the workload across multiple devices
144 | unsigned long long dt = dtime_usec(0);
145 | for (int i = 0; i < nGpus; i++) {
146 | //set current device
147 | mcSetDevice(i);
148 | mcStreamCreate(&stream[i]);
149 |
150 | // transfer data from host to device
151 | mcMemcpyAsync(d_A[i],h_A[i], iBytes, mcMemcpyHostToDevice, stream[i]);
152 | mcMemcpyAsync(d_B[i],h_B[i], iBytes, mcMemcpyHostToDevice, stream[i]);
153 |
154 | // invoke kernel at host side
155 | dim3 block (blockSize);
156 | dim3 grid (iSize/blockSize);
157 | gpuVectorAddKernel<<>>(d_A[i], d_B[i], d_C[i], iSize);
158 |
159 | // copy kernel result back to host side
160 | mcMemcpyAsync(gpuRef[i],d_C[i],iBytes,mcMemcpyDeviceToHost,stream[i]);
161 | }
162 | mcDeviceSynchronize();
163 | dt = dtime_usec(dt);
164 | std::cout << "> The execution time with " << nGpus <<"GPUs: "<< dt/(float)USECPSEC << "s" << std::endl;
165 |
166 | // check the results from host and gpu devices
167 | for (int i = 0; i < nGpus; i++) {
168 | // add vector at host side for result checks
169 | cpuVectorAdd(h_A[i], h_B[i], hostRef[i], iSize);
170 |
171 | // check device results
172 | checkResult(hostRef[i], gpuRef[i], iSize);
173 |
174 | // free device global memory
175 | mcSetDevice(i);
176 | mcFree(d_A[i]);
177 | mcFree(d_B[i]);
178 | mcFree(d_C[i]);
179 |
180 | // free host memory
181 | mcFreeHost(h_A[i]);
182 | mcFreeHost(h_B[i]);
183 | mcFreeHost(hostRef[i]);
184 | mcFreeHost(gpuRef[i]);
185 |
186 | mcStreamSynchronize(stream[i]);
187 | mcStreamDestroy(stream[i]);
188 | }
189 | mcDeviceSynchronize();
190 | return 0;
191 | }
192 |
--------------------------------------------------------------------------------
/chapter2/helloFromGpu.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 |
5 | __global__ void helloFromGpu (void)
6 | {
7 | printf("Hello World from GPU!\n");
8 | }
9 |
10 | int main(void)
11 | {
12 | printf("Hello World from CPU!\n");
13 |
14 | helloFromGpu <<<1, 10>>>();
15 | mcDeviceReset();
16 | //mcDeviceReset()用来显式销毁并清除与当前设备有关的所有资源。
17 | return 0;
18 | }
19 |
--------------------------------------------------------------------------------
/chapter3/cpuVectorAdd.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 |
5 | using namespace std;
6 |
7 | void cpuVectorAdd(float* A, float* B, float* C, int n) {
8 | for (int i = 0; i < n; i++) {
9 | C[i] = A[i] + B[i];
10 | }
11 | }
12 |
13 | int main(int argc, char *argv[]) {
14 |
15 | int n = atoi(argv[1]);
16 | cout << n << endl;
17 |
18 | size_t size = n * sizeof(float);
19 |
20 | // host memery
21 | float *a = (float *)malloc(size); //分配一段内存,使用指针 a 指向它。
22 | float *b = (float *)malloc(size);
23 | float *c = (float *)malloc(size);
24 |
25 | // for 循环产生一些随机数,并放在分配的内存里面。
26 | for (int i = 0; i < n; i++) {
27 | float af = rand() / double(RAND_MAX);
28 | float bf = rand() / double(RAND_MAX);
29 | a[i] = af;
30 | b[i] = bf;
31 | }
32 |
33 | struct timeval t1, t2;
34 |
35 | // gettimeofday 函数来得到精确时间。它的精度可以达到微秒,是C标准库的函数。
36 | gettimeofday(&t1, NULL);
37 |
38 | // 输入指向3段内存的指针名,也就是 a, b, c。
39 | cpuVectorAdd(a, b, c, n);
40 |
41 | gettimeofday(&t2, NULL);
42 |
43 | //for (int i = 0; i < 10; i++)
44 | // cout << vecA[i] << " " << vecB[i] << " " << vecC[i] << endl;
45 | double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
46 | cout << timeuse << endl;
47 |
48 | // free 函数把申请的3段内存释放掉。
49 | free(a);
50 | free(b);
51 | free(c);
52 | return 0;
53 | }
54 |
--------------------------------------------------------------------------------
/chapter3/gpuVectorAdd.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | using namespace std;
7 |
8 | // 要用 __global__ 来修饰。
9 | // 输入指向3段显存的指针名。
10 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N)
11 | {
12 | int i = threadIdx.x + blockDim.x * blockIdx.x;
13 | if (i < N) C_d[i] = A_d[i] + B_d[i];
14 | }
15 |
16 | int main(int argc, char *argv[]) {
17 |
18 | int n = atoi(argv[1]);
19 | cout << n << endl;
20 |
21 | size_t size = n * sizeof(float);
22 |
23 | // host memery
24 | float *a = (float *)malloc(size);
25 | float *b = (float *)malloc(size);
26 | float *c = (float *)malloc(size);
27 |
28 | for (int i = 0; i < n; i++) {
29 | float af = rand() / double(RAND_MAX);
30 | float bf = rand() / double(RAND_MAX);
31 | a[i] = af;
32 | b[i] = bf;
33 | }
34 |
35 | // 定义空指针。
36 | float *da = NULL;
37 | float *db = NULL;
38 | float *dc = NULL;
39 |
40 | // 申请显存,da 指向申请的显存,注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。
41 | mcMalloc((void **)&da, size);
42 | mcMalloc((void **)&db, size);
43 | mcMalloc((void **)&dc, size);
44 |
45 | // 把内存的东西拷贝到显存,也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。
46 | mcMemcpy(da,a,size,mcMemcpyHostToDevice);
47 | mcMemcpy(db,b,size,mcMemcpyHostToDevice);
48 |
49 | struct timeval t1, t2;
50 |
51 | // 计算线程块和网格的数量。
52 | int threadPerBlock = 256;
53 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
54 | printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid);
55 |
56 | gettimeofday(&t1, NULL);
57 |
58 | // 调用核函数。
59 | gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
60 |
61 | gettimeofday(&t2, NULL);
62 |
63 | mcMemcpy(c,dc,size,mcMemcpyDeviceToHost);
64 |
65 | // for (int i = 0; i < 10; i++)
66 | // cout<
12 | __device__ __host__ void count_if(int *count, T *data, int start, int end, int stride, P p) {
13 | for(int i = start; i < end; i += stride){
14 | if(p(data[i])){
15 | // __MACA_ARCH__ 宏仅在编译设备侧代码时生效
16 | #ifdef __MACA_ARCH__
17 | // 使用原子操作保证设备侧多线程执行时的正确性
18 | atomicAdd(count, 1);
19 | #else
20 | *count += 1;
21 | #endif
22 | }
23 | }
24 | }
25 | // 定义核函数
26 | __global__ void count_xyzw(int *res) {
27 | // 利用内建变量gridDim, blockDim, blockIdx, threadIdx对每个线程操作的字符串进行分割
28 | const int start = blockDim.x * blockIdx.x + threadIdx.x;
29 | const int stride = gridDim.x * blockDim.x;
30 | // 在设备侧调用count_if
31 | count_if(res, dstrlist, start, dsize, stride, [=](char c){
32 | for(auto i: letters)
33 | if(i == c) return true;
34 | return false;
35 | });
36 | }
37 |
38 | int main(void){
39 | // 初始化字符串
40 | char test_data[SIZE];
41 | for(int i = 0; i < SIZE; i ++){
42 | test_data[i] = 'a' + i % 26;
43 | }
44 | // 拷贝字符串数据至设备侧
45 | mcMemcpyToSymbol(dstrlist, test_data, SIZE);
46 | // 开辟设备侧的计数器内存并赋值为0
47 | int *dcnt;
48 | mcMalloc(&dcnt, sizeof(int));
49 | int dinit = 0;
50 | mcMemcpy(dcnt, &dinit, sizeof(int), mcMemcpyHostToDevice);
51 | // 启动核函数
52 | count_xyzw<<<4, 64>>>(dcnt);
53 | // 拷贝计数器值到主机侧
54 | int dres;
55 | mcMemcpy(&dres, dcnt, sizeof(int), mcMemcpyDeviceToHost);
56 | // 释放设备侧开辟的内存
57 | mcFree(dcnt);
58 | printf("xyzw counted by device: %d\n", dres);
59 |
60 | // 在主机侧调用count_if
61 | int hcnt = 0;
62 | count_if(&hcnt, test_data, 0, SIZE, 1, [=](char c){
63 | for(auto i: letters)
64 | if(i == c) return true;
65 | return false;
66 | });
67 | printf("xyzw counted by host: %d\n", hcnt);
68 | return 0;
69 | }
70 |
--------------------------------------------------------------------------------
/chapter5/Cooperative_Groups.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 |
5 | using namespace cooperative_groups;
6 | __device__ int reduce_sum(thread_group g, int *temp, int val)
7 | {
8 | int lane = g.thread_rank();
9 |
10 | // Each iteration halves the number of active threads
11 | // Each thread adds its partial sum[i] to sum[lane+i]
12 | for (int i = g.size() / 2; i > 0; i /= 2)
13 | {
14 | temp[lane] = val;
15 | g.sync(); // wait for all threads to store
16 | if(lane
2 | #include
3 | #include
4 | #include
5 | #include
6 | using namespace std;
7 |
8 | __global__ void assignKernel(int *data) {
9 | int tid = blockIdx.x * blockDim.x + threadIdx.x;
10 |
11 | if (tid % 2 == 0) {
12 | data[tid] = 20;
13 | } else {
14 | data[tid] = 10;
15 | }
16 | }
17 | int main(){
18 | int *a;
19 | a=(int *)malloc(sizeof(int)*16*16);
20 | int i;
21 | for(i=0;i<16*16;i++) a[i]=(int)rand() %10+1;
22 | int *da;
23 | mcMalloc((void **)&da,sizeof(int)*16*16);
24 | mcMemcpy(da,a,sizeof(int)*16*16,mcMemcpyHostToDevice);
25 | assignKernel<<<16,16>>>(da);
26 | mcMemcpy(a,da,sizeof(int)*16*16,mcMemcpyDeviceToHost);
27 | for(i=0;i<16*16;i++) cout<
2 |
3 | int main( void ) {
4 | mcDeviceProp_t prop;
5 |
6 | int count;
7 | mcGetDeviceCount( &count );
8 | for (int i=0; i< count; i++) {
9 | mcGetDeviceProperties( &prop, i );
10 | printf( " --- General Information for device %d ---\n", i );
11 | printf( "Name: %s\n", prop.name );
12 | printf( "Compute capability: %d.%d\n", prop.major, prop.minor );
13 | printf( "Clock rate: %d\n", prop.clockRate );
14 | printf( "Device copy overlap: " );
15 | if (prop.deviceOverlap)
16 | printf( "Enabled\n" );
17 | else
18 | printf( "Disabled\n" );
19 | printf( "Kernel execition timeout : " );
20 | if (prop.kernelExecTimeoutEnabled)
21 | printf( "Enabled\n" );
22 | else
23 | printf( "Disabled\n" );
24 |
25 | printf( " --- MP Information for device %d ---\n", i );
26 | printf( "Multiprocessor count: %d\n",
27 | prop.multiProcessorCount );
28 | printf( "Threads in wave: %d\n", prop.waveSize );
29 | printf( "Max threads per block: %d\n",
30 | prop.maxThreadsPerBlock );
31 | printf( "Max thread dimensions: (%d, %d, %d)\n",
32 | prop.maxThreadsDim[0], prop.maxThreadsDim[1],
33 | prop.maxThreadsDim[2] );
34 | printf( "Max grid dimensions: (%d, %d, %d)\n",
35 | prop.maxGridSize[0], prop.maxGridSize[1],
36 | prop.maxGridSize[2] );
37 | printf( "\n" );
38 | }
39 | }
40 |
--------------------------------------------------------------------------------
/chapter5/nestedHelloWorld.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 |
4 |
5 | __global__ void nestedHelloWorld(int const iSize, int iDepth) {
6 | int tid = threadIdx.x;
7 | printf("Recursion=%d: Hello World from thread %d"
8 | " block %d\n", iDepth, tid, blockIdx.x);
9 |
10 | // condition to stop recursive execution
11 | if (iSize==1) return;
12 |
13 | //reduce block size to half
14 | int nThreads = iSize >> 1;
15 |
16 | //thread 0 lauches child grid recursively
17 | if (tid == 0 && nThreads >0) {
18 | nestedHelloWorld<<<1, nThreads>>>(nThreads, ++iDepth);
19 | printf("------> nested execution depth: %d\n", iDepth);
20 | }
21 | }
22 |
23 | int main(int argc, char *argv[])
24 | {
25 | // launch nestedHelloWorld
26 | nestedHelloWorld<<<1,8>>>(8,0);
27 | mcDeviceSynchronize();
28 | return 0;
29 | }
30 |
--------------------------------------------------------------------------------
/chapter6/AplusB_with_managed.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | using namespace std;
6 |
7 | __device__ __managed__ int ret[1000];
8 | __global__ void AplusB(int a, int b) {
9 | ret[threadIdx.x] = a + b + threadIdx.x;
10 | }
11 | int main() {
12 | AplusB<<< 1, 1000 >>>(10, 100);
13 | mcDeviceSynchronize();
14 | for(int i = 0; i < 1000; i++)
15 | printf("%d: A+B = %d\n", i, ret[i]);
16 | return 0;
17 | }
18 |
--------------------------------------------------------------------------------
/chapter6/AplusB_with_unified_addressing.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | using namespace std;
7 | __global__ void AplusB(int *ret, int a, int b) {
8 | ret[threadIdx.x] = a + b + threadIdx.x;
9 | }
10 | int main() {
11 | int *ret;
12 | mcMallocManaged(&ret, 1000 * sizeof(int));
13 | AplusB<<< 1, 1000 >>>(ret, 10, 100);
14 | mcDeviceSynchronize();
15 | for(int i = 0; i < 1000; i++)
16 | printf("%d: A+B = %d\n", i, ret[i]);
17 | mcFree(ret);
18 | return 0;
19 | }
20 |
--------------------------------------------------------------------------------
/chapter6/AplusB_without_unified_addressing.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 |
7 | __global__ void AplusB(int *ret, int a, int b) {
8 | ret[threadIdx.x] = a + b + threadIdx.x;
9 | }
10 | int main() {
11 | int *ret;
12 | mcMalloc(&ret, 1000 * sizeof(int));
13 | AplusB<<< 1, 1000 >>>(ret, 10, 100);
14 | int *host_ret = (int *)malloc(1000 * sizeof(int));
15 | mcMemcpy(host_ret, ret, 1000 * sizeof(int), mcMemcpyDefault);
16 | for(int i = 0; i < 1000; i++)
17 | printf("%d: A+B = %d\n", i, host_ret[i]);
18 | free(host_ret);
19 | mcFree(ret);
20 | return 0;
21 | }
22 |
--------------------------------------------------------------------------------
/chapter6/BC_addKernel.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | #define ThreadsPerBlock 256
7 | #define maxGridSize 16
8 | __global__ void BC_addKernel(const int *a, int *r)
9 | {
10 | __shared__ int cache[ThreadsPerBlock];
11 | int tid = blockIdx.x * blockDim.x + threadIdx.x;
12 | int cacheIndex = threadIdx.x;
13 |
14 | // copy data to shared memory from global memory
15 | cache[cacheIndex] = a[tid];
16 | __syncthreads();
17 |
18 | // add these data using reduce
19 | for (int i = 1; i < blockDim.x; i *= 2)
20 | {
21 | int index = 2 * i * cacheIndex;
22 | if (index < blockDim.x)
23 | {
24 | cache[index] += cache[index + i];
25 | }
26 | __syncthreads();
27 | }
28 |
29 | // copy the result of reduce to global memory
30 | if (cacheIndex == 0){
31 | r[blockIdx.x] = cache[cacheIndex];
32 | printf("blockIdx.x:%d r[blockIdx.x]:%d\n",blockIdx.x,r[blockIdx.x]);
33 | }
34 |
35 | }
36 |
37 | int test(int *h_a,int n){
38 | int *a;
39 | mcMalloc(&a,n*sizeof(int));
40 | mcMemcpy(a,h_a,n*sizeof(int),mcMemcpyHostToDevice);
41 | int *r;
42 | int h_r[maxGridSize]={0};
43 | mcMalloc(&r,maxGridSize*sizeof(int));
44 | mcMemcpy(r,h_r,maxGridSize*sizeof(int),mcMemcpyHostToDevice);
45 | BC_addKernel<<>>(a,r);
46 | mcMemcpy(h_a,a,n*sizeof(int),mcMemcpyDeviceToHost);
47 | mcMemcpy(h_r,r,maxGridSize*sizeof(int),mcMemcpyDeviceToHost);
48 | mcFree(r);
49 | mcFree(a);
50 | int sum=0;
51 | for(int i=0;i
2 | #include
3 | #include
4 | #include
5 |
6 | #define ThreadsPerBlock 256
7 | #define maxGridSize 16
8 | __global__ void NBC_addKernel2(const int *a, int *r)
9 | {
10 | __shared__ int cache[ThreadsPerBlock];
11 | int tid = blockIdx.x * blockDim.x + threadIdx.x;
12 | int cacheIndex = threadIdx.x;
13 |
14 | // copy data to shared memory from global memory
15 | cache[cacheIndex] = a[tid];
16 | __syncthreads();
17 |
18 | // add these data using reduce
19 | for (int i = blockDim.x / 2; i > 0; i /= 2)
20 | {
21 | if (cacheIndex < i)
22 | {
23 | cache[cacheIndex] += cache[cacheIndex + i];
24 | }
25 | __syncthreads();
26 | }
27 |
28 | // copy the result of reduce to global memory
29 | if (cacheIndex == 0){
30 | r[blockIdx.x] = cache[cacheIndex];
31 | printf("blockIdx.x:%d r[blockIdx.x]:%d\n",blockIdx.x,r[blockIdx.x]);
32 | }
33 | }
34 |
35 |
36 | int test(int *h_a,int n){
37 | int *a;
38 | mcMalloc(&a,n*sizeof(int));
39 | mcMemcpy(a,h_a,n*sizeof(int),mcMemcpyHostToDevice);
40 | int *r;
41 | int h_r[maxGridSize]={0};
42 | mcMalloc(&r,maxGridSize*sizeof(int));
43 | mcMemcpy(r,h_r,maxGridSize*sizeof(int),mcMemcpyHostToDevice);
44 | NBC_addKernel2<<>>(a,r);
45 | mcMemcpy(h_a,a,n*sizeof(int),mcMemcpyDeviceToHost);
46 | mcMemcpy(h_r,r,maxGridSize*sizeof(int),mcMemcpyDeviceToHost);
47 | mcFree(r);
48 | mcFree(a);
49 | int sum=0;
50 | for(int i=0;i
2 | #include
3 | #include
4 | using namespace std;
5 |
6 | __global__ void test_shfl_down_sync(int A[], int B[])
7 | {
8 | int tid = threadIdx.x;
9 | int value = B[tid];
10 |
11 | value = __shfl_down_sync(0xffffffffffffffff, value, 2);
12 | A[tid] = value;
13 |
14 | }
15 |
16 |
17 | int main()
18 | {
19 | int *A,*Ad, *B, *Bd;
20 | int n = 64;
21 | int size = n * sizeof(int);
22 |
23 | // CPU端分配内存
24 | A = (int*)malloc(size);
25 | B = (int*)malloc(size);
26 |
27 | for (int i = 0; i < n; i++)
28 | {
29 | B[i] = rand()%101;
30 | std::cout << B[i] << std::endl;
31 | }
32 |
33 | std::cout <<"----------------------------" << std::endl;
34 |
35 | // GPU端分配内存
36 | mcMalloc((void**)&Ad, size);
37 | mcMalloc((void**)&Bd, size);
38 | mcMemcpy(Bd, B, size, mcMemcpyHostToDevice);
39 |
40 | // 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程
41 | dim3 dimBlock(128);
42 | dim3 dimGrid(1000);
43 |
44 | // 执行kernel
45 | test_shfl_down_sync <<<1, 64 >>> (Ad,Bd);
46 |
47 | mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost);
48 |
49 | // 校验误差
50 | float max_error = 0.0;
51 | for (int i = 0; i < 64; i++)
52 | {
53 | std::cout << A[i] << std::endl;
54 | }
55 |
56 | cout << "max error is " << max_error << endl;
57 |
58 | // 释放CPU端、GPU端的内存
59 | free(A);
60 | free(B);
61 | mcFree(Ad);
62 | mcFree(Bd);
63 |
64 | return 0;
65 | }
66 |
--------------------------------------------------------------------------------
/chapter6/__shfl_syncExample.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | using namespace std;
5 |
6 | __global__ void test_shfl_sync(int A[], int B[])
7 | {
8 | int tid = threadIdx.x;
9 | int value = B[tid];
10 |
11 | value = __shfl_sync(0xffffffffffffffff, value, 2);
12 | A[tid] = value;
13 | }
14 |
15 | int main()
16 | {
17 | int *A,*Ad, *B, *Bd;
18 | int n = 64;
19 | int size = n * sizeof(int);
20 |
21 | // CPU端分配内存
22 | A = (int*)malloc(size);
23 | B = (int*)malloc(size);
24 |
25 | for (int i = 0; i < n; i++)
26 | {
27 | B[i] = rand()%101;
28 | std::cout << B[i] << std::endl;
29 | }
30 |
31 | std::cout <<"----------------------------" << std::endl;
32 |
33 | // GPU端分配内存
34 | mcMalloc((void**)&Ad, size);
35 | mcMalloc((void**)&Bd, size);
36 | mcMemcpy(Bd, B, size, mcMemcpyHostToDevice);
37 |
38 | // 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程
39 | dim3 dimBlock(128);
40 | dim3 dimGrid(1000);
41 |
42 | // 执行kernel
43 | test_shfl_sync <<<1, 64 >>> (Ad,Bd);
44 |
45 | mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost);
46 |
47 | // 校验误差
48 | float max_error = 0.0;
49 | for (int i = 0; i < 64; i++)
50 | {
51 | std::cout << A[i] << std::endl;
52 | }
53 |
54 | cout << "max error is " << max_error << endl;
55 |
56 | // 释放CPU端、GPU端的内存
57 | free(A);
58 | free(B);
59 | mcFree(Ad);
60 | mcFree(Bd);
61 |
62 | return 0;
63 | }
64 |
--------------------------------------------------------------------------------
/chapter6/__shfl_up_syncExample.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | using namespace std;
5 |
6 | __global__ void test_shfl_up_sync(int A[], int B[])
7 | {
8 | int tid = threadIdx.x;
9 | int value = B[tid];
10 |
11 | value = __shfl_up_sync(0xffffffffffffffff, value, 2);
12 | A[tid] = value;
13 |
14 | }
15 |
16 |
17 | int main()
18 | {
19 | int *A,*Ad, *B, *Bd;
20 | int n = 64;
21 | int size = n * sizeof(int);
22 |
23 | // CPU端分配内存
24 | A = (int*)malloc(size);
25 | B = (int*)malloc(size);
26 |
27 | for (int i = 0; i < n; i++)
28 | {
29 | B[i] = rand()%101;
30 | std::cout << B[i] << std::endl;
31 | }
32 |
33 | std::cout <<"----------------------------" << std::endl;
34 |
35 | // GPU端分配内存
36 | mcMalloc((void**)&Ad, size);
37 | mcMalloc((void**)&Bd, size);
38 | mcMemcpy(Bd, B, size, mcMemcpyHostToDevice);
39 |
40 | // 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程
41 | dim3 dimBlock(128);
42 | dim3 dimGrid(1000);
43 |
44 | // 执行kernel
45 | test_shfl_up_sync <<<1, 64 >>> (Ad,Bd);
46 |
47 | mcMemcpy(A, Ad, size, mcMemcpyDeviceToHost);
48 |
49 | // 校验误差
50 | float max_error = 0.0;
51 | for (int i = 0; i < 64; i++)
52 | {
53 | std::cout << A[i] << std::endl;
54 | }
55 |
56 | cout << "max error is " << max_error << endl;
57 |
58 | // 释放CPU端、GPU端的内存
59 | free(A);
60 | free(B);
61 | mcFree(Ad);
62 | mcFree(Bd);
63 |
64 | return 0;
65 | }
66 |
--------------------------------------------------------------------------------
/chapter6/__shfl_xor_syncExample.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 |
4 | __global__ void waveReduce() {
5 | int laneId = threadIdx.x & 0x3f;
6 | // Seed starting value as inverse lane ID
7 | int value = 63 - laneId;
8 |
9 | // Use XOR mode to perform butterfly reduction
10 | for (int i=1; i<64; i*=2)
11 | value += __shfl_xor_sync(0xffffffffffffffff, value, i, 64);
12 |
13 | // "value" now contains the sum across all threads
14 | printf("Thread %d final value = %d\n", threadIdx.x, value);
15 | }
16 |
17 | int main() {
18 | waveReduce<<< 1, 64 >>>();
19 | mcDeviceSynchronize();
20 | return 0;
21 | }
22 |
--------------------------------------------------------------------------------
/chapter6/checkGlobalVariable.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 |
4 | __device__ float devData;
5 | __global__ void checkGlobalVariable(){
6 | printf("Device: the value of the global variable is %f\n", devData);
7 | devData += 2.0;
8 | }
9 |
10 | int main(){
11 | float value = 3.14f;
12 | mcMemcpyToSymbol(devData, &value, sizeof(float));
13 | printf("Host: copy %f to the global variable\n", value);
14 | checkGlobalVariable<<<1,1>>>();
15 | mcMemcpyFromSymbol(&value, devData, sizeof(float));
16 | printf("Host: the value changed by the kernel to %f\n", value);
17 | mcDeviceReset();
18 | return EXIT_SUCCESS;
19 | }
20 |
--------------------------------------------------------------------------------
/chapter6/information.cpp:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | int main( void ) {
4 | mcDeviceProp_t prop;
5 |
6 | int count;
7 | mcGetDeviceCount( &count );
8 | for (int i=0; i< count; i++) {
9 | mcGetDeviceProperties( &prop, i );
10 | printf( " --- Memory Information for device %d ---\n", i );
11 | printf( "Total global mem: %ld[bytes]\n", prop.totalGlobalMem );
12 | printf( "Total constant Mem: %ld[bytes]\n", prop.totalConstMem );
13 | printf( "Max mem pitch: %ld[bytes]\n", prop.memPitch );
14 | printf( "Texture alignment: %ld[bytes]\n", prop.textureAlignment );
15 | printf( "Shared mem per AP: %ld[bytes]\n",prop.sharedMemPerBlock );
16 | printf( "Registers per AP: %d[bytes]\n", prop.regsPerBlock );
17 | printf( "\n" );
18 | }
19 | }
20 |
--------------------------------------------------------------------------------
/chapter6/vectorAddUnifiedVirtualAddressing.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | using namespace std;
7 |
8 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N)
9 | {
10 | int i = threadIdx.x + blockDim.x * blockIdx.x;
11 | if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f;
12 | }
13 |
14 | int main(int argc, char *argv[]) {
15 |
16 | int n = atoi(argv[1]);
17 | cout << n << endl;
18 |
19 | size_t size = n * sizeof(float);
20 | mcError_t err;
21 |
22 | // Allocate the host vectors of A&B&C
23 | unsigned int flag = mcMallocHostPortable;
24 | float *a = NULL;
25 | float *b = NULL;
26 | float *c = NULL;
27 | err = mcMallocHost((void**)&a, size, flag);
28 | err = mcMallocHost((void**)&b, size, flag);
29 | err = mcMallocHost((void**)&c, size, flag);
30 |
31 | // Initialize the host vectors of A&B
32 | for (int i = 0; i < n; i++) {
33 | float af = rand() / double(RAND_MAX);
34 | float bf = rand() / double(RAND_MAX);
35 | a[i] = af;
36 | b[i] = bf;
37 | }
38 |
39 | // Launch the vector add kernel
40 | struct timeval t1, t2;
41 | int threadPerBlock = 256;
42 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
43 | printf("threadPerBlock: %d \nblockPerGrid: %d \n",threadPerBlock,blockPerGrid);
44 | gettimeofday(&t1, NULL);
45 | vectorAdd<<< blockPerGrid, threadPerBlock >>> (a, b, c, n);
46 | gettimeofday(&t2, NULL);
47 | double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
48 | cout << timeuse << endl;
49 |
50 | // Free host memory
51 | err = mcFreeHost(a);
52 | err = mcFreeHost(b);
53 | err = mcFreeHost(c);
54 |
55 | return 0;
56 | }
57 |
--------------------------------------------------------------------------------
/chapter6/vectorAddZerocopy.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | using namespace std;
7 |
8 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N)
9 | {
10 | int i = threadIdx.x + blockDim.x * blockIdx.x;
11 | if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f;
12 | }
13 |
14 | int main(int argc, char *argv[]) {
15 |
16 | int n = atoi(argv[1]);
17 | cout << n << endl;
18 |
19 | size_t size = n * sizeof(float);
20 | mcError_t err;
21 |
22 | // Allocate the host vectors of A&B&C
23 | unsigned int flag = mcMallocHostMapped;
24 | float *a = NULL;
25 | float *b = NULL;
26 | float *c = NULL;
27 | err = mcMallocHost((void**)&a, size, flag);
28 | err = mcMallocHost((void**)&b, size, flag);
29 | err = mcMallocHost((void**)&c, size, flag);
30 |
31 | // Initialize the host vectors of A&B
32 | for (int i = 0; i < n; i++) {
33 | float af = rand() / double(RAND_MAX);
34 | float bf = rand() / double(RAND_MAX);
35 | a[i] = af;
36 | b[i] = bf;
37 | }
38 |
39 | // Get the pointer in device on the vectors of A&B&C
40 | float *da = NULL;
41 | float *db = NULL;
42 | float *dc = NULL;
43 | err = mcHostGetDevicePointer((void**)&da, (void *)a, 0);
44 | err = mcHostGetDevicePointer((void**)&db, (void *)b, 0);
45 | err = mcHostGetDevicePointer((void**)&dc, (void *)c, 0);
46 |
47 | // Launch the vector add kernel
48 | struct timeval t1, t2;
49 | int threadPerBlock = 256;
50 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
51 | printf("threadPerBlock: %d \nblockPerGrid: %d \n",threadPerBlock,blockPerGrid);
52 | gettimeofday(&t1, NULL);
53 | vectorAdd<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
54 | gettimeofday(&t2, NULL);
55 | double timeuse = (t2.tv_sec - t1.tv_sec)
56 | + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
57 | cout << timeuse << endl;
58 |
59 | // Free host memory
60 | err = mcFreeHost(a);
61 | err = mcFreeHost(b);
62 | err = mcFreeHost(c);
63 |
64 | return 0;
65 | }
66 |
--------------------------------------------------------------------------------
/chapter7/Makefile.txt:
--------------------------------------------------------------------------------
1 | # MXMACA Compiler
2 | MXCC = $(MACA_PATH)/mxgpu_llvm/bin/mxcc
3 |
4 | # Compiler flags
5 | MXCCFLAGS = -xmaca
6 |
7 | # Source files
8 | SRCS= main.cpp src/a.cpp src/b.cpp
9 |
10 | # Object files
11 | OBJS = $(SRCS:.cpp=.o)
12 |
13 | # Executable
14 | EXEC = my_program
15 |
16 | # Default target
17 | all: $(EXEC)
18 |
19 | # Link object files to create executable
20 | $(EXEC): $(OBJS)
21 | $(MXCC) $(OBJS) -o $(EXEC)
22 |
23 | %.o: %.cpp
24 | $(MXCC) $(MXCCFLAGS) -c $< -o $@ -I include
25 |
26 | # clean up object files and executable
27 | clean:
28 | rm -f $(OBJS) $(EXEC)
29 |
--------------------------------------------------------------------------------
/chapter7/my_program/CMakeLists.txt:
--------------------------------------------------------------------------------
1 | # Specify the minimum CMake version required
2 | cmake_minimum_required(VERSION 3.0)
3 |
4 | # Set the project name
5 | project(my_program)
6 |
7 | # Set the path to the compiler
8 | set(MXCC_PATH $ENV{MACA_PATH})
9 | set(CMAKE_CXX_COMPILER ${MXCC_PATH}/mxgpu_llvm/bin/mxcc)
10 |
11 | # Set the compiler flags
12 | set(MXCC_COMPILE_FLAGS -x maca)
13 | add_compile_options(${MXCC_COMPILE_FLAGS})
14 |
15 | # Add source files
16 | File(GLOB SRCS src/*.cpp main.cpp)
17 | add_executable(my_program ${SRCS})
18 |
19 | # Set the include paths
20 | target_include_directories(my_program PRIVATE include)
21 |
--------------------------------------------------------------------------------
/chapter7/my_program/include/a.h:
--------------------------------------------------------------------------------
1 | extern void func_a();
--------------------------------------------------------------------------------
/chapter7/my_program/include/b.h:
--------------------------------------------------------------------------------
1 | extern void func_b();
--------------------------------------------------------------------------------
/chapter7/my_program/main.cpp:
--------------------------------------------------------------------------------
1 | //main.cpp:
2 | #include
3 | #include "a.h"
4 | #include "b.h"
5 | int main()
6 | {
7 | func_a();
8 | func_b();
9 | printf("my program!\n");
10 | return 1;
11 | }
12 |
--------------------------------------------------------------------------------
/chapter7/my_program/src/a.cpp:
--------------------------------------------------------------------------------
1 | //a.cpp:
2 | #include
3 | #include
4 | extern "C" __global__ void vector_add(int *A_d, size_t num)
5 | {
6 | size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
7 | size_t stride = blockDim.x * gridDim.x;
8 | for (size_t i = offset; i < num; i += stride) {
9 | A_d[i]++;
10 | }
11 | }
12 | void func_a()
13 | {
14 | size_t arrSize = 100;
15 | mcDeviceptr_t a_d;
16 | int *a_h = (int *)malloc(sizeof(int) * arrSize);
17 | memset(a_h, 0, sizeof(int) * arrSize);
18 | mcMalloc(&a_d, sizeof(int) * arrSize);
19 | mcMemcpyHtoD(a_d, a_h, sizeof(int) * arrSize);
20 | vector_add<<<1, arrSize>>>(reinterpret_cast(a_d), arrSize);
21 | mcMemcpyDtoH(a_h, a_d, sizeof(int) * arrSize);
22 | bool resCheck = true;
23 | for (int i; i < arrSize; i++) {
24 | if (a_h[i] != 1){
25 | resCheck = false;
26 | }
27 | }
28 | printf("vector add result: %s\n", resCheck ? "success": "fail");
29 | free(a_h);
30 | mcFree(a_d);
31 | }
32 |
33 | //a.h:
34 | extern void func_a();
35 |
--------------------------------------------------------------------------------
/chapter7/my_program/src/b.cpp:
--------------------------------------------------------------------------------
1 | //b.cpp:
2 | #include
3 | __global__ void kernel_b()
4 | {
5 | /* kernel code*/
6 | }
7 | void func_b()
8 | {
9 | /* launch kernel */
10 | kernel_b<<<1, 1>>>();
11 | }
12 |
13 | //b.h:
14 | extern void func_b();
15 |
--------------------------------------------------------------------------------
/chapter7/trigger_memory_violation.cpp:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | typedef struct
4 | {
5 | alignas(4)float f;
6 | double d;
7 | }__attribute__((packed)) test_type_mem_violation;
8 |
9 | __global__ void trigger_memory_violation(test_type_mem_violation *dst)
10 | {
11 | atomicAdd(&dst->f,1.23);
12 | atomicAdd(&dst->d,20);
13 | dst->f=9.8765;
14 | }
15 |
16 | int main()
17 | {
18 | test_type_mem_violation hd={0};
19 | test_type_mem_violation *ddd;
20 | mcMalloc((void**)&ddd,sizeof(test_type_mem_violation));
21 | mcMemcpy(ddd,&hd,sizeof(test_type_mem_violation),mcMemcpyHostToDevice);
22 | trigger_memory_violation<<>>(ddd);
23 | mcMemcpy(&hd,ddd,sizeof(test_type_mem_violation),mcMemcpyDeviceToHost);
24 | mcFree(ddd);
25 | return 0;
26 | }
27 |
--------------------------------------------------------------------------------
/chapter7/trigger_memory_violation_repaired.cpp:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | typedef struct
4 | {
5 | float f;
6 | double d;
7 | }test_type_mem_violation;
8 |
9 | __global__ void trigger_memory_violation(test_type_mem_violation *dst)
10 | {
11 | atomicAdd(&dst->f,1.23);
12 | atomicAdd(&dst->d,20);
13 | dst->f=9.8765;
14 | }
15 |
16 | int main()
17 | {
18 | test_type_mem_violation hd={0};
19 | test_type_mem_violation *ddd;
20 | mcMalloc((void**)&ddd,sizeof(test_type_mem_violation));
21 | mcMemcpy(ddd,&hd,sizeof(test_type_mem_violation),mcMemcpyHostToDevice);
22 | trigger_memory_violation<<>>(ddd);
23 | mcMemcpy(&hd,ddd,sizeof(test_type_mem_violation),mcMemcpyDeviceToHost);
24 | mcFree(ddd);
25 | return 0;
26 | }
27 |
--------------------------------------------------------------------------------
/chapter7/vectorAdd.cpp:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | __global__ void vectorADD(const float* A_d, const float* B_d, float* C_d, size_t NELEM) {
4 | size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
5 | size_t stride = blockDim.x * gridDim.x;
6 |
7 | for (size_t i = offset; i < NELEM; i += stride) {
8 | C_d[i] = A_d[i] + B_d[i];
9 | }
10 | }
11 |
12 | int main()
13 | {
14 | int blocks=20;
15 | int threadsPerBlock=1024;
16 | int numSize=1024*1024;
17 |
18 | float *A_d=nullptr;
19 | float *B_d=nullptr;
20 | float *C_d=nullptr;
21 |
22 | float *A_h=nullptr;
23 | float *B_h=nullptr;
24 | float *C_h=nullptr;
25 |
26 | mcMalloc((void**)&A_d,numSize*sizeof(float));
27 | mcMalloc((void**)&B_d,numSize*sizeof(float));
28 | mcMalloc((void**)&C_d,numSize*sizeof(float));
29 |
30 | A_h=(float*)malloc(numSize*sizeof(float));
31 | B_h=(float*)malloc(numSize*sizeof(float));
32 | C_h=(float*)malloc(numSize*sizeof(float));
33 |
34 | for(int i=0;i>>(A_d,B_d,C_d,numSize);
45 |
46 | mcMemcpy(C_h,C_d,numSize*sizeof(float),mcMemcpyDeviceToHost);
47 |
48 | mcFree(A_d);
49 | mcFree(B_d);
50 | mcFree(C_d);
51 |
52 | free(A_h);
53 | free(B_h);
54 | free(C_h);
55 |
56 | return 0;
57 | }
58 |
--------------------------------------------------------------------------------
/chapter8/myKernel.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | // #include "device_launch_parameters.h"
5 |
6 | __global__ void myKernel(float* devPtr, int height, int width, int pitch)
7 | {
8 | int row, col;
9 | float *rowHead;
10 |
11 | for (row = 0; row < height; row++)
12 | {
13 | rowHead = (float*)((char*)devPtr + row * pitch);
14 |
15 | for (col = 0; col < width; col++)
16 | {
17 | printf("\t%f", rowHead[col]);// 逐个打印并自增 1
18 | rowHead[col]++;
19 | }
20 | printf("\n");
21 | }
22 | }
23 |
24 | int main()
25 | {
26 | size_t width = 6;
27 | size_t height = 5;
28 | float *h_data, *d_data;
29 | size_t pitch;
30 |
31 | h_data = (float *)malloc(sizeof(float)*width*height);
32 | for (int i = 0; i < width*height; i++)
33 | h_data[i] = (float)i;
34 |
35 | printf("\n\tAlloc memory.");
36 | mcMallocPitch((void **)&d_data, &pitch, sizeof(float)*width, height);
37 | printf("\n\tPitch = %d B\n", pitch);
38 |
39 | printf("\n\tCopy to Device.\n");
40 | mcMemcpy2D(d_data, pitch, h_data, sizeof(float)*width, sizeof(float)*width, height, mcMemcpyHostToDevice);
41 |
42 | myKernel <<<1, 1 >>> (d_data, height, width, pitch);
43 | mcDeviceSynchronize();
44 |
45 | printf("\n\tCopy back to Host.\n");
46 | mcMemcpy2D(h_data, sizeof(float)*width, d_data, pitch, sizeof(float)*width, height, mcMemcpyDeviceToHost);
47 |
48 | for (int i = 0; i < width*height; i++)
49 | {
50 | printf("\t%f", h_data[i]);
51 | if ((i + 1) % width == 0)
52 | printf("\n");
53 | }
54 |
55 | free(h_data);
56 | mcFree(d_data);
57 |
58 | getchar();
59 | return 0;
60 | }
61 |
--------------------------------------------------------------------------------
/chapter8/stream_parallel_execution.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #define FULL_DATA_SIZE 10000
5 | #define N 1000
6 | #define BLOCKNUM 16
7 | #define THREADNUM 64
8 |
9 | __global__ void kernel(int *a,int *b,int *c){
10 | int idx=threadIdx.x+blockIdx.x*blockDim.x;
11 | if (idx>>(dev0_a, dev0_b, dev0_c);
76 |
77 | kernel <<>>(dev1_a, dev1_b, dev1_c);
78 |
79 | mcStatus = mcMemcpyAsync(host_c + i, dev0_c, N * sizeof(int),
80 | mcMemcpyDeviceToHost, stream0);
81 | if (mcStatus != mcSuccess)
82 | {
83 | printf("mcMemcpyAsync0 c failed!\n");
84 | }
85 |
86 | mcStatus = mcMemcpyAsync(host_c + N + i, dev1_c, N * sizeof(int),
87 | mcMemcpyDeviceToHost, stream1);
88 | if (mcStatus != mcSuccess)
89 | {
90 | printf("mcMemcpyAsync1 c failed!\n");
91 | }
92 | }
93 | for(i=0;i<20;i++){
94 | printf("%d ",host_a[i]);
95 | }
96 | printf("\n");
97 | for(i=0;i<20;i++){
98 | printf("%d ",host_b[i]);
99 | }
100 | printf("\n");
101 | for(i=0;i<20;i++){
102 | printf("%d ",host_c[i]);
103 | }
104 | printf("\n");
105 | mcStreamSynchronize(stream1);
106 | mcStreamSynchronize(stream0);
107 | mcStreamDestroy(stream1);
108 | mcStreamDestroy(stream0);
109 | mcFree(dev0_a);
110 | mcFree(dev1_a);
111 | mcFree(dev0_b);
112 | mcFree(dev1_b);
113 | mcFree(dev0_c);
114 | mcFree(dev1_c);
115 | free(host_a);
116 | free(host_b);
117 | free(host_c);
118 | }
119 |
--------------------------------------------------------------------------------
/chapter9/shortKernelsAsyncLaunch.cpp:
--------------------------------------------------------------------------------
1 | /*
2 | * 9.4.1: 1) lots of short kernels launched asynchronously
3 | * 9.4.1 {Sample#2} lots of short kernels launched asynchronously
4 | * Usage:
5 | * 1) compiling: mxcc -x maca shortKernelsAsyncLaunch.cpp -o shortKernelsAsyncLaunch
6 | * 2) running:./shortKernelsAsyncLaunch
7 | */
8 | #include
9 | #include
10 | #include "mc_runtime.h"
11 |
12 | #define macaCheckErrors(msg) \
13 | do { \
14 | mcError_t __err = mcGetLastError(); \
15 | if (__err != mcSuccess) { \
16 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
17 | msg, mcGetErrorString(__err), \
18 | __FILE__, __LINE__); \
19 | fprintf(stderr, "*** FAILED - ABORTING\n"); \
20 | exit(1); \
21 | } \
22 | } while (0)
23 |
24 |
25 | #include
26 | #include
27 | #define USECPSEC 1000000ULL
28 |
29 | unsigned long long dtime_usec(unsigned long long start){
30 | timeval tv;
31 | gettimeofday(&tv, 0);
32 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
33 | }
34 |
35 | #define N 400000 // tuned until kernel takes a few microseconds
36 | __global__ void shortKernel(float * out_d, float * in_d){
37 | int idx=blockIdx.x*blockDim.x+threadIdx.x;
38 | if(idx>>(d_output, d_input);
58 | macaCheckErrors("kernel launch failure");
59 | mcDeviceSynchronize();
60 | macaCheckErrors("kernel execution failure");
61 | // run on device and measure execution time
62 | unsigned long long dt = dtime_usec(0);
63 | dt = dtime_usec(0);
64 | for(int istep=0; istep>>(d_output, d_input);
67 | }
68 | }
69 | mcStreamSynchronize(stream);
70 |
71 | macaCheckErrors("kernel execution failure");
72 | dt = dtime_usec(dt);
73 | std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl;
74 | return 0;
75 | }
--------------------------------------------------------------------------------
/chapter9/shortKernelsGraphLaunch.cpp:
--------------------------------------------------------------------------------
1 | /*
2 | * 9.4.1 {Sample#3} lots of short kernels launched by graph APIs
3 | * Usage:
4 | * 1) compiling: mxcc -x maca shortKernelsGraphLaunch.cpp -o shortKernelsGraphLaunch
5 | * 2) setting: export MACA_GRAPH_LAUNCH_MODE=1
6 | * 3) running:./shortKernelsGraphLaunch
7 | */
8 | #include
9 | #include
10 | #include "mc_runtime.h"
11 |
12 | #define macaCheckErrors(msg) \
13 | do { \
14 | mcError_t __err = mcGetLastError(); \
15 | if (__err != mcSuccess) { \
16 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
17 | msg, mcGetErrorString(__err), \
18 | __FILE__, __LINE__); \
19 | fprintf(stderr, "*** FAILED - ABORTING\n"); \
20 | exit(1); \
21 | } \
22 | } while (0)
23 |
24 |
25 | #include
26 | #include
27 | #define USECPSEC 1000000ULL
28 |
29 | unsigned long long dtime_usec(unsigned long long start){
30 | timeval tv;
31 | gettimeofday(&tv, 0);
32 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
33 | }
34 |
35 | #define N 400000 // tuned until kernel takes a few microseconds
36 | __global__ void shortKernel(float * out_d, float * in_d){
37 | int idx=blockIdx.x*blockDim.x+threadIdx.x;
38 | if(idx>>(d_output, d_input);
58 | macaCheckErrors("kernel launch failure");
59 | mcDeviceSynchronize();
60 | macaCheckErrors("kernel execution failure");
61 | // run on device and measure execution time
62 | unsigned long long dt = dtime_usec(0);
63 | dt = dtime_usec(0);
64 | bool graphCreated=false;
65 | mcGraph_t graph;
66 | mcGraphExec_t instance;
67 | for(int istep=0; istep>>(d_output, d_input);
72 | }
73 | mcStreamEndCapture(stream, &graph);
74 | mcGraphInstantiate(&instance, graph, NULL, NULL, 0);
75 | graphCreated=true;
76 | }
77 | mcGraphLaunch(instance, stream);
78 | mcStreamSynchronize(stream);
79 | }
80 | macaCheckErrors("kernel execution failure");
81 | dt = dtime_usec(dt);
82 | std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl;
83 | return 0;
84 | }
--------------------------------------------------------------------------------
/chapter9/shortKernelsSyncLaunch.cpp:
--------------------------------------------------------------------------------
1 | /*
2 | * 9.4.1 {Sample#1} lots of short kernels launched synchronously
3 | * Usage:
4 | * 1) compiling: mxcc -x maca shortKernelsSyncLaunch.cpp -o shortKernelsSyncLaunch
5 | * 2) running:./shortKernelsSyncLaunch
6 | */
7 | #include
8 | #include
9 | #include "mc_runtime.h"
10 |
11 | #define macaCheckErrors(msg) \
12 | do { \
13 | mcError_t __err = mcGetLastError(); \
14 | if (__err != mcSuccess) { \
15 | fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
16 | msg, mcGetErrorString(__err), \
17 | __FILE__, __LINE__); \
18 | fprintf(stderr, "*** FAILED - ABORTING\n"); \
19 | exit(1); \
20 | } \
21 | } while (0)
22 |
23 |
24 | #include
25 | #include
26 | #define USECPSEC 1000000ULL
27 |
28 | unsigned long long dtime_usec(unsigned long long start){
29 | timeval tv;
30 | gettimeofday(&tv, 0);
31 | return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
32 | }
33 |
34 | #define N 400000 // tuned until kernel takes a few microseconds
35 | __global__ void shortKernel(float * out_d, float * in_d){
36 | int idx=blockIdx.x*blockDim.x+threadIdx.x;
37 | if(idx>>(d_output, d_input);
57 | macaCheckErrors("kernel launch failure");
58 | mcDeviceSynchronize();
59 | macaCheckErrors("kernel execution failure");
60 | // run on device and measure execution time
61 | unsigned long long dt = dtime_usec(0);
62 | dt = dtime_usec(0);
63 | for(int istep=0; istep>>(d_output, d_input);
66 | mcStreamSynchronize(stream);
67 | }
68 | }
69 | macaCheckErrors("kernel execution failure");
70 | dt = dtime_usec(dt);
71 | std::cout << "Kernel execution time: total=" << dt/(float)USECPSEC << "s, perKernelInAvg=" << 1000*1000*dt/NKERNEL/NSTEP/(float)USECPSEC << "us." << std::endl;
72 | return 0;
73 | }
--------------------------------------------------------------------------------
/common/common.h:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | #ifndef _COMMON_H
4 | #define _COMMON_H
5 |
6 | #define CHECK(call) \
7 | { \
8 | const mcError_t error = call; \
9 | if (error != mcSuccess) \
10 | { \
11 | fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__); \
12 | fprintf(stderr, "code: %d, reason: %s\n", error, \
13 | mcGetErrorString(error)); \
14 | } \
15 | }
16 |
17 | inline double seconds()
18 | {
19 | struct timeval tp;
20 | struct timezone tzp;
21 | int i = gettimeofday(&tp, &tzp);
22 | return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
23 | }
24 |
25 | #endif // _COMMON_H
26 |
--------------------------------------------------------------------------------
/习题运行结果/3.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/3.1.png
--------------------------------------------------------------------------------
/习题运行结果/3.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/3.2.png
--------------------------------------------------------------------------------
/习题运行结果/5.2.9.1运行结果/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/1.png
--------------------------------------------------------------------------------
/习题运行结果/5.2.9.1运行结果/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/2.png
--------------------------------------------------------------------------------
/习题运行结果/5.2.9.1运行结果/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.1运行结果/3.png
--------------------------------------------------------------------------------
/习题运行结果/5.2.9.2运行结果/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/1.png
--------------------------------------------------------------------------------
/习题运行结果/5.2.9.2运行结果/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/2.png
--------------------------------------------------------------------------------
/习题运行结果/5.2.9.2运行结果/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/5.2.9.2运行结果/3.png
--------------------------------------------------------------------------------
/习题运行结果/T4运行结果.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bxttttt/getting-started-guide-and-introduction-to-MXMACA/59c1ae53649538582bd5a577dcb9cda7efa83e2a/习题运行结果/T4运行结果.png
--------------------------------------------------------------------------------
/习题运行结果/answer.md:
--------------------------------------------------------------------------------
1 | # new answer
2 |
3 | ## Chapter 2
4 |
5 | ### Exercise 1
6 |
7 | #### 参考代码
8 |
9 | ```c
10 | #include
11 | #include
12 | #include
13 |
14 | __global__ void helloFromGpu (void)
15 |
16 | {
17 | printf("Hello World from GPU!\n");
18 | }
19 |
20 | int main(void)
21 | {
22 | printf("Hello World from CPU!\n");
23 | helloFromGpu <<<1, 10>>>();
24 | return 0;
25 | }
26 | ```
27 |
28 | #### 编译结果
29 |
30 | 函数mcDeviceReset()用来显式销毁并清除与当前设备有关的所有资源。
31 |
32 | 当重置函数移除,编译运行则只输出
33 |
34 | ```
35 | Hello World from CPU!
36 | ```
37 |
38 | 当printf在gpu上被调用,mcDeviceReset()函数使这些来自gpu的输出发送到主机,然后在控制台输出。
39 |
40 | 没有调用cudaDeviceReset()函数就不能保证这些可以被显示。
41 |
42 | ### Exercise 2
43 |
44 | #### 参考代码
45 |
46 | ```c
47 | #include
48 | #include
49 | #include
50 |
51 | __global__ void helloFromGpu (void)
52 | {
53 | printf("Hello World from GPU!\n");
54 | }
55 |
56 | int main(void)
57 | {
58 | printf("Hello World from CPU!\n");
59 |
60 | helloFromGpu <<<1, 10>>>();
61 | mcDeviceSynchronize();
62 | return 0;
63 | }
64 |
65 | ```
66 |
67 | #### 编译结果
68 |
69 | ```
70 | Hello World from CPU!
71 | Hello World from GPU!
72 | Hello World from GPU!
73 | Hello World from GPU!
74 | Hello World from GPU!
75 | Hello World from GPU!
76 | Hello World from GPU!
77 | Hello World from GPU!
78 | Hello World from GPU!
79 | Hello World from GPU!
80 | Hello World from GPU!
81 | ```
82 |
83 | 输出效果和helloFromGpu.c一样。
84 |
85 | mcDeviceSynchronize()也可以用来使gpu的输出打印在用户可见控制台。
86 |
87 | ### Exercise 3
88 |
89 | #### 参考代码
90 |
91 | ```c
92 | #include
93 | #include
94 | #include
95 |
96 | __global__ void helloFromGpu (void)
97 | {
98 | if (threadIdx.x==9) printf("Hello World from GPU Thread 9!\n");
99 | }
100 | int main(void)
101 | {
102 | printf("Hello World from CPU!\n");
103 | helloFromGpu <<<1, 10>>>();
104 | mcDeviceReset();
105 | return 0;
106 | }
107 | ```
108 |
109 | ## Chapter 3
110 |
111 | ### Exercise 1
112 |
113 | #### 参考代码
114 |
115 | ```c++
116 | #include
117 | #include
118 | #include
119 | #include
120 |
121 | using namespace std;
122 |
123 | // 要用 __global__ 来修饰。
124 | // 输入指向3段显存的指针名。
125 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N)
126 | {
127 | int i = threadIdx.x + blockDim.x * blockIdx.x;
128 | // printf("threadIdx.x:%d blockDim.x:%d blockIdx.x:%d\n",threadIdx.x,blockDim.x,blockIdx.x);
129 | if (i < N) C_d[i] = A_d[i] + B_d[i];
130 | }
131 |
132 | int main(int argc, char *argv[]) {
133 |
134 | int n = 2048;
135 | cout << n << endl;
136 |
137 | size_t size = n * sizeof(float);
138 |
139 | // host memery
140 | float *a = (float *)malloc(size);
141 | float *b = (float *)malloc(size);
142 | float *c = (float *)malloc(size);
143 |
144 | for (int i = 0; i < n; i++) {
145 | float af = rand() / double(RAND_MAX);
146 | float bf = rand() / double(RAND_MAX);
147 | a[i] = af;
148 | b[i] = bf;
149 | }
150 |
151 | // 定义空指针。
152 | float *da = NULL;
153 | float *db = NULL;
154 | float *dc = NULL;
155 |
156 | // 申请显存,da 指向申请的显存,注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。
157 | mcMalloc((void **)&da, size);
158 | mcMalloc((void **)&db, size);
159 | mcMalloc((void **)&dc, size);
160 |
161 | // 把内存的东西拷贝到显存,也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。
162 | mcMemcpy(da,a,size,mcMemcpyHostToDevice);
163 | mcMemcpy(db,b,size,mcMemcpyHostToDevice);
164 |
165 | struct timeval t1, t2;
166 |
167 | // 计算线程块和网格的数量。
168 | int threadPerBlock_array[8]={1,16,32,64,128,256,512,1024};
169 | for(int i=0;i<8;i++){
170 | int threadPerBlock = threadPerBlock_array[i];
171 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
172 | printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid);
173 |
174 | gettimeofday(&t1, NULL);
175 |
176 | // 调用核函数。
177 | gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
178 |
179 | gettimeofday(&t2, NULL);
180 |
181 | mcMemcpy(c,dc,size,mcMemcpyDeviceToHost);
182 |
183 | // for (int i = 0; i < 10; i++)
184 | // cout<
207 |
208 | ### Exercise 2
209 |
210 | #### 参考代码
211 |
212 | ```c++
213 | #include
214 | #include
215 | #include
216 | #include
217 |
218 | using namespace std;
219 |
220 | // 要用 __global__ 来修饰。
221 | // 输入指向3段显存的指针名。
222 | __global__ void gpuVectorAddKernel(float* A_d,float* B_d,float* C_d, int N)
223 | {
224 | int i = threadIdx.x + blockDim.x * blockIdx.x;
225 | // printf("threadIdx.x:%d blockDim.x:%d blockIdx.x:%d\n",threadIdx.x,blockDim.x,blockIdx.x);
226 | if (i < N) C_d[i] = A_d[i] + B_d[i];
227 | }
228 |
229 | int main(int argc, char *argv[]) {
230 |
231 | int n = 256;
232 | cout << n << endl;
233 |
234 | size_t size = n * sizeof(float);
235 |
236 | // host memory
237 | float *a = (float *)malloc(size);
238 | float *b = (float *)malloc(size);
239 | float *c = (float *)malloc(size);
240 |
241 | for (int i = 0; i < n; i++) {
242 | float af = rand() / double(RAND_MAX);
243 | float bf = rand() / double(RAND_MAX);
244 | a[i] = af;
245 | b[i] = bf;
246 | }
247 |
248 | // 定义空指针。
249 | float *da = NULL;
250 | float *db = NULL;
251 | float *dc = NULL;
252 |
253 | // 申请显存,da 指向申请的显存,注意 mcMalloc 函数传入指针的指针 (指向申请得到的显存的指针)。
254 | mcMalloc((void **)&da, size);
255 | mcMalloc((void **)&db, size);
256 | mcMalloc((void **)&dc, size);
257 |
258 | // 把内存的东西拷贝到显存,也就是把 a, b, c 里面的东西拷贝到 d_a, d_b, d_c 中。
259 | mcMemcpy(da,a,size,mcMemcpyHostToDevice);
260 | mcMemcpy(db,b,size,mcMemcpyHostToDevice);
261 |
262 | struct timeval t1, t2;
263 |
264 | // 计算线程块和网格的数量。
265 | int threadPerBlock_array[2]={1,256};
266 | for(int i=0;i<2;i++){
267 | int threadPerBlock = threadPerBlock_array[i];
268 | int blockPerGrid = (n + threadPerBlock - 1)/threadPerBlock;
269 | printf("threadPerBlock: %d \nblockPerGrid: %d\n", threadPerBlock,blockPerGrid);
270 |
271 | gettimeofday(&t1, NULL);
272 |
273 | // 调用核函数。
274 | gpuVectorAddKernel<<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);
275 |
276 | gettimeofday(&t2, NULL);
277 |
278 | mcMemcpy(c,dc,size,mcMemcpyDeviceToHost);
279 |
280 | // for (int i = 0; i < 10; i++)
281 | // cout<
303 |
304 | ### Exercise 3
305 |
306 | 执行每个数值计算的速度并没有CPU快,CPU更适合处理逻辑控制密集的计算任务,GPU更适合处理数据密集的计算任务
307 |
308 | ### Exercise 4
309 |
310 | #### 参考代码
311 |
312 | ```c
313 | #include
314 | #include
315 | #include
316 | #include
317 |
318 | using namespace std;
319 |
320 |
321 | __global__ void matrixMultiplication(int *A_d,int *B_d,int *Result_d,int width){
322 | int i=threadIdx.x+blockDim.x*blockIdx.x;
323 | int j=threadIdx.y+blockDim.y*blockIdx.y;
324 | int sum=0;
325 | int count;
326 | for(count=0;count>>(da,db,d_result,col);
357 | // 把显存的东西拷贝回内存
358 | mcMemcpy(result,d_result,sizeof(int)*row*col,mcMemcpyDeviceToHost);
359 | // print矩阵,这里row和col相等,所以统一用col表示
360 | int j;
361 | printf("a:\n");
362 | for(i=0;i
398 |
399 | ## Chapter 5
400 |
401 | ### 5.2.9
402 |
403 | #### Exercise 1
404 |
405 | ##### 参考代码
406 |
407 | ```c
408 | #include
409 | #include
410 | #include
411 | using namespace std;
412 |
413 |
414 | __global__ void print()
415 | {
416 | printf("blockIdx.x:%d threadIdx.x:%d\n",blockIdx.x, threadIdx.x);
417 | }
418 |
419 | int main(void)
420 | {
421 | const dim3 block_size(16);
422 | print<<<10, block_size>>>();
423 | mcDeviceSynchronize();
424 | return 0;
425 | }
426 |
427 |
428 | ```
429 |
430 | ##### 运行结果(一部分)
431 |
432 |
433 |
434 |
435 |
436 |
437 |
438 | 同一个wave内部thread的执行是顺序的。block的执行不是顺序的。
439 |
440 | 在MXMACA中,wave对程序员来说是透明的,它的大小可能会随着硬件的发展发生变化,在当前版本的MXMACA中,每个wave是由64个thread组成的。由64个thread组成的wave是MACA程序执行的最小单位,并且同一个wave是串行的。在一个SM中可能同时有来自不同block的wave。当一个block中的wave在进行访存或者同步等高延迟操作时,另一个block可以占用SM中的计算资源。这样,在SM内就实现了简单的乱序执行。不同block之间的执行没有顺序,完全并行。并且,一个sm只会执行一个block里的wave,当该block里的wave执行完才会执行其他block里的wave。
441 |
442 | #### Exercise 2
443 |
444 | ##### 参考代码
445 |
446 | ```c
447 | #include
448 | #include
449 | #include
450 | using namespace std;
451 |
452 |
453 | __global__ void print()
454 | {
455 | printf("blockIdx.x:%d threadIdx.x:%d threadIdx.y:%d threadIdx.z:%d\n",blockIdx.x, threadIdx.x, threadIdx.y, threadIdx.z);
456 | }
457 |
458 | int main(void)
459 | {
460 | const dim3 block_size(16);
461 | print<<<10, block_size>>>();
462 | mcDeviceSynchronize();
463 | return 0;
464 | }
465 |
466 |
467 | ```
468 |
469 |
470 |
471 | ##### 运行结果
472 |
473 |
474 |
475 |
476 |
477 |
478 |
479 | 没有定义,默认为0.
480 |
481 | 可以在定义block_size时对三个维度的size都进行设置(注意三者的乘积不可以超过maxThreadsPerBlock)。
482 |
483 | ### 5.4.4(待更正)
484 |
485 | #### Exercise 1
486 |
487 | ##### 参考代码
488 |
489 | ```c
490 | // #include
491 | #include
492 | #include
493 | #include
494 | #include
495 | #include
496 | // #include
497 | // #include "dynamicParallelism.h"
498 | #include
499 | /** block size along */
500 | #define BSX 64
501 | #define BSY 4
502 | /** maximum recursion depth */
503 | #define MAX_DEPTH 4
504 | /** region below which do per-pixel */
505 | #define MIN_SIZE 32
506 | /** subdivision factor along each axis */
507 | #define SUBDIV 4
508 | /** subdivision when launched from host */
509 | #define INIT_SUBDIV 32
510 | #define H (16 * 1024)
511 | #define W (16 * 1024)
512 | #define MAX_DWELL 512
513 | using namespace std;
514 |
515 |
516 |
517 | /** a useful function to compute the number of threads */
518 | int __host__ __device__ divup(int x, int y) { return x / y + (x % y ? 1 : 0); }
519 |
520 | /** a simple complex type */
521 | struct complex {
522 | __host__ __device__ complex(float re, float im = 0)
523 | {
524 | this->re = re;
525 | this->im = im;
526 | }
527 | /** real and imaginary part */
528 | float re, im;
529 | }; // struct complex
530 |
531 | // operator overloads for complex numbers
532 | inline __host__ __device__ complex operator+(const complex &a, const complex &b)
533 | {
534 | return complex(a.re + b.re, a.im + b.im);
535 | }
536 | inline __host__ __device__ complex operator-(const complex &a) { return complex(-a.re, -a.im); }
537 | inline __host__ __device__ complex operator-(const complex &a, const complex &b)
538 | {
539 | return complex(a.re - b.re, a.im - b.im);
540 | }
541 | inline __host__ __device__ complex operator*(const complex &a, const complex &b)
542 | {
543 | return complex(a.re * b.re - a.im * b.im, a.im * b.re + a.re * b.im);
544 | }
545 | inline __host__ __device__ float abs2(const complex &a) { return a.re * a.re + a.im * a.im; }
546 | inline __host__ __device__ complex operator/(const complex &a, const complex &b)
547 | {
548 | float invabs2 = 1 / abs2(b);
549 | return complex((a.re * b.re + a.im * b.im) * invabs2, (a.im * b.re - b.im * a.re) * invabs2);
550 | } // operator/
551 | /** find the dwell for the pixel */
552 | __device__ int pixel_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x, int y)
553 | {
554 | complex dc = cmax - cmin;
555 | float fx = (float)x / w, fy = (float)y / h;
556 | complex c = cmin + complex(fx * dc.re, fy * dc.im);
557 | int dwell = 0;
558 | complex z = c;
559 | while (dwell < max_dwell && abs2(z) < 2 * 2) {
560 | z = z * z + c;
561 | dwell++;
562 | }
563 | return dwell;
564 | } // pixel_dwell
565 |
566 | /** binary operation for common dwell "reduction": MAX_DWELL + 1 = neutral
567 | element, -1 = dwells are different */
568 | // #define NEUT_DWELL (MAX_DWELL + 1)
569 | #define DIFF_DWELL (-1)
570 | __device__ int same_dwell(int d1, int d2, int max_dwell)
571 | {
572 | if (d1 == d2)
573 | return d1;
574 | else if (d1 == (max_dwell + 1) || d2 == (max_dwell + 1))
575 | return min(d1, d2);
576 | else
577 | return DIFF_DWELL;
578 | } // same_dwell
579 |
580 | /** evaluates the common border dwell, if it exists */
581 | __device__ int border_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x0, int y0,
582 | int d)
583 | {
584 | // check whether all boundary pixels have the same dwell
585 | int tid = threadIdx.y * blockDim.x + threadIdx.x;
586 | int bs = blockDim.x * blockDim.y;
587 | int comm_dwell = (max_dwell + 1);
588 | // for all boundary pixels, distributed across threads
589 | for (int r = tid; r < d; r += bs) {
590 | // for each boundary: b = 0 is east, then counter-clockwise
591 | for (int b = 0; b < 4; b++) {
592 | int x = b % 2 != 0 ? x0 + r : (b == 0 ? x0 + d - 1 : x0);
593 | int y = b % 2 == 0 ? y0 + r : (b == 1 ? y0 + d - 1 : y0);
594 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
595 | comm_dwell = same_dwell(comm_dwell, dwell, max_dwell);
596 | }
597 | } // for all boundary pixels
598 | // reduce across threads in the block
599 | __shared__ int ldwells[BSX * BSY];
600 | int nt = min(d, BSX * BSY);
601 | if (tid < nt)
602 | ldwells[tid] = comm_dwell;
603 | __syncthreads();
604 | for (; nt > 1; nt /= 2) {
605 | if (tid < nt / 2)
606 | ldwells[tid] = same_dwell(ldwells[tid], ldwells[tid + nt / 2], max_dwell);
607 | __syncthreads();
608 | }
609 | return ldwells[0];
610 | } // border_dwell
611 |
612 | /** the kernel to fill the image region with a specific dwell value */
613 | __global__ void dwell_fill_k(int *dwells, int w, int x0, int y0, int d, int dwell)
614 | {
615 | int x = threadIdx.x + blockIdx.x * blockDim.x;
616 | int y = threadIdx.y + blockIdx.y * blockDim.y;
617 | if (x < d && y < d) {
618 | x += x0, y += y0;
619 | dwells[y * w + x] = dwell;
620 | }
621 | } // dwell_fill_k
622 |
623 | /**
624 | * the kernel to fill in per-pixel values of the portion of the Mandelbrot set
625 | */
626 | __global__ void mandelbrot_pixel_k(int *dwells, int w, int h, int max_dwell, complex cmin,
627 | complex cmax, int x0, int y0, int d)
628 | {
629 | int x = threadIdx.x + blockDim.x * blockIdx.x;
630 | int y = threadIdx.y + blockDim.y * blockIdx.y;
631 | if (x < d && y < d) {
632 | x += x0, y += y0;
633 | dwells[y * w + x] = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
634 | }
635 | } // mandelbrot_pixel_k
636 |
637 | /** computes the dwells for Mandelbrot image using dynamic parallelism; one block is launched per
638 | pixel
639 | @param dwells the output array
640 | @param w the width of the output image
641 | @param h the height of the output image
642 | @param cmin the complex value associated with the left-bottom corner of the image
643 | @param cmax the complex value associated with the right-top corner of the image
644 | @param x0 the starting x coordinate of the portion to compute
645 | @param y0 the starting y coordinate of the portion to compute
646 | @param d the size of the portion to compute (the portion is always a square)
647 | @param depth kernel invocation depth
648 | @remarks the algorithm reverts to per-pixel Mandelbrot evaluation once either maximum depth or
649 | minimum size is reached
650 | */
651 | __global__ void mandelbrot_with_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
652 | complex cmax, int x0, int y0, int d, int depth)
653 | {
654 | x0 += d * blockIdx.x, y0 += d * blockIdx.y;
655 | int comm_dwell = border_dwell(w, h, max_dwell, cmin, cmax, x0, y0, d);
656 | if (threadIdx.x == 0 && threadIdx.y == 0) {
657 | if (comm_dwell != DIFF_DWELL) {
658 | // uniform dwell, just fill
659 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
660 | dwell_fill_k<<>>(dwells, w, x0, y0, d, comm_dwell);
661 | } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) {
662 | // subdivide recursively
663 | dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV, SUBDIV);
664 | mandelbrot_with_dp<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0,
665 | d / SUBDIV, depth + 1);
666 | } else {
667 | // leaf, per-pixel kernel
668 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
669 | mandelbrot_pixel_k<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, d);
670 | }
671 | // check_error(x0, y0, d);
672 | }
673 | } // mandelbrot_with_dp
674 |
675 | /** computes the dwells for Mandelbrot image
676 | @param dwells the output array
677 | @param w the width of the output image
678 | @param h the height of the output image
679 | @param cmin the complex value associated with the left-bottom corner of the image
680 | @param cmax the complex value associated with the right-top corner of the image
681 | */
682 | __global__ void mandelbrot_without_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
683 | complex cmax)
684 | {
685 | // complex value to start iteration (c)
686 | int x = threadIdx.x + blockIdx.x * blockDim.x;
687 | int y = threadIdx.y + blockIdx.y * blockDim.y;
688 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
689 | dwells[y * w + x] = dwell;
690 | }
691 |
692 | __global__ void dwell_fill_k_null() { printf("111 \n"); } // dwell_fill_k
693 |
694 | __global__ void mandelbrot_with_dp_cpu_perf() { dwell_fill_k_null<<<1, 1>>>(); }
695 |
696 | __global__ void mandelbrot_without_dp_cpu_perf() { printf("222 \n"); }
697 |
698 | struct timeval t1, t2;
699 |
700 | static void BM_DynamicParallelism_WithDP()
701 | {
702 | static char env_str[] = "DOORBELL_LISTEN=ON";
703 | putenv(env_str);
704 |
705 | // allocate memory
706 | int w = W;
707 | int h = H;
708 | int max_dwell = MAX_DWELL;
709 |
710 | size_t dwell_sz = w * h * sizeof(int);
711 | int *h_dwells, *d_dwells;
712 | mcMalloc((void **)&d_dwells, dwell_sz);
713 | h_dwells = (int *)malloc(dwell_sz);
714 |
715 | dim3 bs(BSX, BSY), grid(INIT_SUBDIV, INIT_SUBDIV);
716 | gettimeofday(&t1, NULL);
717 | mandelbrot_with_dp<<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
718 | complex(0.5, 1), 0, 0, w / INIT_SUBDIV, 1);
719 | gettimeofday(&t2, NULL);
720 | mcDeviceSynchronize();
721 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
722 |
723 | // free data
724 | mcFree(d_dwells);
725 | free(h_dwells);
726 | cout<<"BM_DynamicParallelism_WithDP over "<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
747 | complex(0.5, 1));
748 | gettimeofday(&t2, NULL);
749 | mcDeviceSynchronize();
750 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
751 |
752 | // free data
753 | mcFree(d_dwells);
754 | free(h_dwells);
755 | cout<<"BM_DynamicParallelism_WithoutDP over"<>>();
770 |
771 | mcDeviceSynchronize();
772 | cout<<"BM_DynamicParallelism_WithDP_CPU_Perf over"<>>();
782 |
783 | mcDeviceSynchronize();
784 | cout<<"BM_DynamicParallelism_WithoutDP_CPU_Perf over"<
812 | #include
813 | #include
814 | #include
815 |
816 | using namespace std;
817 |
818 | __global__ void vectorAdd(float* A_d, float* B_d, float* C_d, int N){
819 | int i = threadIdx.x + blockDim.x * blockIdx.x;
820 | if (i < N) C_d[i] = A_d[i] + B_d[i] + 0.0f;
821 | }
822 |
823 | int main(int argc,char *argv[]){
824 | int n = atoi(argv[1]);
825 | cout << n << endl;
826 |
827 | float *A,*B,*C;
828 | mcMallocManaged(&A,n*sizeof(float));
829 | mcMallocManaged(&B,n*sizeof(float));
830 | mcMallocManaged(&C,n*sizeof(float));
831 |
832 | for(int i=0;i>>(A,B,C,n);
840 | mcDeviceSynchronize();
841 | for(int i=0;i
864 |
865 | ### Exercise 2
866 |
867 | ```c++
868 | #include
869 | #include
870 | #include
871 | #include
872 | #include
873 | #include
874 | using namespace std;
875 |
876 | #define M 512
877 | #define K 512
878 | #define N 512
879 |
880 | void initial(float *array, int size)
881 | {
882 | for (int i = 0; i < size; i++)
883 | {
884 | array[i] = (float)(rand() % 10 + 1);
885 | }
886 | }
887 |
888 | //核函数(静态共享内存版)
889 | __global__ void matrixMultiplyShared(float *A, float *B, float *C,
890 | int numARows, int numAColumns, int numBRows, int numBColumns, int numCRows, int numCColumns)
891 | {
892 | //分配共享内存
893 | // __shared__ float sharedM[blockDim.y][blockDim.x];
894 | // __shared__ float sharedN[blockDim.x][blockDim.y];
895 | __shared__ float sharedM[16][32];
896 | __shared__ float sharedN[16][32];
897 |
898 | int bx = blockIdx.x;
899 | int by = blockIdx.y;
900 | int tx = threadIdx.x;
901 | int ty = threadIdx.y;
902 |
903 | int row = by * blockDim.y + ty;
904 | int col = bx * blockDim.x + tx;
905 |
906 | float Csub = 0.0;
907 |
908 | //将保存在全局内存中的矩阵M&N分块存放到共享内存中
909 | for (int i = 0; i < (int)(ceil((float)numAColumns / blockDim.x)); i++)
910 | {
911 | if (i*blockDim.x + tx < numAColumns && row < numARows)
912 | sharedM[ty][tx] = A[row*numAColumns + i * blockDim.x + tx];
913 | else
914 | sharedM[ty][tx] = 0.0;
915 |
916 | if (i*blockDim.y + ty < numBRows && col < numBColumns)//分割N矩阵
917 | sharedN[ty][tx] = B[(i*blockDim.y + ty)*numBColumns + col];
918 | else
919 | sharedN[ty][tx] = 0.0;
920 | __syncthreads();
921 |
922 | for (int j = 0; j < blockDim.x; j++)//分块后的矩阵相乘
923 | Csub += sharedM[ty][j] * sharedN[j][tx];
924 | __syncthreads();
925 | }
926 |
927 | if (row < numCRows && col < numCColumns)//将计算后的矩阵块放到结果矩阵C中
928 | C[row*numCColumns + col] = Csub;
929 | }
930 |
931 |
932 | int main(int argc, char **argv)
933 | {
934 | int Axy = M * K;
935 | int Bxy = K * N;
936 | int Cxy = M * N;
937 |
938 | float *h_A, *h_B, *h_C;
939 | h_A = (float*)malloc(Axy * sizeof(float));
940 | h_B = (float*)malloc(Bxy * sizeof(float));
941 |
942 | h_C = (float*)malloc(Cxy * sizeof(float));
943 |
944 | initial(h_A, Axy);
945 | initial(h_B, Bxy);
946 |
947 | float *d_A, *d_B, *d_C;
948 | mcMalloc((void**)&d_A, Axy * sizeof(float));
949 | mcMalloc((void**)&d_B, Bxy * sizeof(float));
950 | mcMalloc((void**)&d_C, Cxy * sizeof(float));
951 |
952 | mcMemcpy(d_A, h_A, Axy * sizeof(float), mcMemcpyHostToDevice);
953 | mcMemcpy(d_B, h_B, Bxy * sizeof(float), mcMemcpyHostToDevice);
954 |
955 | int dimx = 32;
956 | int dimy = 16;
957 | dim3 block(dimx, dimy);
958 | dim3 grid((M + block.x - 1) / block.x, (N + block.y - 1) / block.y);
959 | struct timeval t1, t2;
960 | gettimeofday(&t1, NULL);
961 | matrixMultiplyShared <<< grid, block >>> (d_A, d_B, d_C, M, K, K, N, M, N);
962 | mcMemcpy(h_C, d_C, Cxy * sizeof(float), mcMemcpyDeviceToHost);
963 | gettimeofday(&t2, NULL);
964 | double timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
965 | cout << "timeuse: " << timeuse << endl;
966 | mcFree(d_A);
967 | mcFree(d_B);
968 | mcFree(d_C);
969 |
970 | free(h_A);
971 | free(h_B);
972 | free(h_C);
973 | }
974 |
975 | ```
976 |
977 |
978 |
979 |
--------------------------------------------------------------------------------
/习题运行结果/nestedMandelbrot.cpp:
--------------------------------------------------------------------------------
1 | // #include
2 | #include
3 | #include
4 | #include
5 | #include
6 | #include
7 | // #include
8 | // #include "dynamicParallelism.h"
9 | #include
10 | /** block size along */
11 | #define BSX 64
12 | #define BSY 4
13 | /** maximum recursion depth */
14 | #define MAX_DEPTH 4
15 | /** region below which do per-pixel */
16 | #define MIN_SIZE 32
17 | /** subdivision factor along each axis */
18 | #define SUBDIV 4
19 | /** subdivision when launched from host */
20 | #define INIT_SUBDIV 32
21 | #define H (16 * 1024)
22 | #define W (16 * 1024)
23 | #define MAX_DWELL 512
24 | using namespace std;
25 |
26 |
27 |
28 | /** a useful function to compute the number of threads */
29 | int __host__ __device__ divup(int x, int y) { return x / y + (x % y ? 1 : 0); }
30 |
31 | /** a simple complex type */
32 | struct complex {
33 | __host__ __device__ complex(float re, float im = 0)
34 | {
35 | this->re = re;
36 | this->im = im;
37 | }
38 | /** real and imaginary part */
39 | float re, im;
40 | }; // struct complex
41 |
42 | // operator overloads for complex numbers
43 | inline __host__ __device__ complex operator+(const complex &a, const complex &b)
44 | {
45 | return complex(a.re + b.re, a.im + b.im);
46 | }
47 | inline __host__ __device__ complex operator-(const complex &a) { return complex(-a.re, -a.im); }
48 | inline __host__ __device__ complex operator-(const complex &a, const complex &b)
49 | {
50 | return complex(a.re - b.re, a.im - b.im);
51 | }
52 | inline __host__ __device__ complex operator*(const complex &a, const complex &b)
53 | {
54 | return complex(a.re * b.re - a.im * b.im, a.im * b.re + a.re * b.im);
55 | }
56 | inline __host__ __device__ float abs2(const complex &a) { return a.re * a.re + a.im * a.im; }
57 | inline __host__ __device__ complex operator/(const complex &a, const complex &b)
58 | {
59 | float invabs2 = 1 / abs2(b);
60 | return complex((a.re * b.re + a.im * b.im) * invabs2, (a.im * b.re - b.im * a.re) * invabs2);
61 | } // operator/
62 | /** find the dwell for the pixel */
63 | __device__ int pixel_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x, int y)
64 | {
65 | complex dc = cmax - cmin;
66 | float fx = (float)x / w, fy = (float)y / h;
67 | complex c = cmin + complex(fx * dc.re, fy * dc.im);
68 | int dwell = 0;
69 | complex z = c;
70 | while (dwell < max_dwell && abs2(z) < 2 * 2) {
71 | z = z * z + c;
72 | dwell++;
73 | }
74 | return dwell;
75 | } // pixel_dwell
76 |
77 | /** binary operation for common dwell "reduction": MAX_DWELL + 1 = neutral
78 | element, -1 = dwells are different */
79 | // #define NEUT_DWELL (MAX_DWELL + 1)
80 | #define DIFF_DWELL (-1)
81 | __device__ int same_dwell(int d1, int d2, int max_dwell)
82 | {
83 | if (d1 == d2)
84 | return d1;
85 | else if (d1 == (max_dwell + 1) || d2 == (max_dwell + 1))
86 | return min(d1, d2);
87 | else
88 | return DIFF_DWELL;
89 | } // same_dwell
90 |
91 | /** evaluates the common border dwell, if it exists */
92 | __device__ int border_dwell(int w, int h, int max_dwell, complex cmin, complex cmax, int x0, int y0,
93 | int d)
94 | {
95 | // check whether all boundary pixels have the same dwell
96 | int tid = threadIdx.y * blockDim.x + threadIdx.x;
97 | int bs = blockDim.x * blockDim.y;
98 | int comm_dwell = (max_dwell + 1);
99 | // for all boundary pixels, distributed across threads
100 | for (int r = tid; r < d; r += bs) {
101 | // for each boundary: b = 0 is east, then counter-clockwise
102 | for (int b = 0; b < 4; b++) {
103 | int x = b % 2 != 0 ? x0 + r : (b == 0 ? x0 + d - 1 : x0);
104 | int y = b % 2 == 0 ? y0 + r : (b == 1 ? y0 + d - 1 : y0);
105 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
106 | comm_dwell = same_dwell(comm_dwell, dwell, max_dwell);
107 | }
108 | } // for all boundary pixels
109 | // reduce across threads in the block
110 | __shared__ int ldwells[BSX * BSY];
111 | int nt = min(d, BSX * BSY);
112 | if (tid < nt)
113 | ldwells[tid] = comm_dwell;
114 | __syncthreads();
115 | for (; nt > 1; nt /= 2) {
116 | if (tid < nt / 2)
117 | ldwells[tid] = same_dwell(ldwells[tid], ldwells[tid + nt / 2], max_dwell);
118 | __syncthreads();
119 | }
120 | return ldwells[0];
121 | } // border_dwell
122 |
123 | /** the kernel to fill the image region with a specific dwell value */
124 | __global__ void dwell_fill_k(int *dwells, int w, int x0, int y0, int d, int dwell)
125 | {
126 | int x = threadIdx.x + blockIdx.x * blockDim.x;
127 | int y = threadIdx.y + blockIdx.y * blockDim.y;
128 | if (x < d && y < d) {
129 | x += x0, y += y0;
130 | dwells[y * w + x] = dwell;
131 | }
132 | } // dwell_fill_k
133 |
134 | /**
135 | * the kernel to fill in per-pixel values of the portion of the Mandelbrot set
136 | */
137 | __global__ void mandelbrot_pixel_k(int *dwells, int w, int h, int max_dwell, complex cmin,
138 | complex cmax, int x0, int y0, int d)
139 | {
140 | int x = threadIdx.x + blockDim.x * blockIdx.x;
141 | int y = threadIdx.y + blockDim.y * blockIdx.y;
142 | if (x < d && y < d) {
143 | x += x0, y += y0;
144 | dwells[y * w + x] = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
145 | }
146 | } // mandelbrot_pixel_k
147 |
148 | /** computes the dwells for Mandelbrot image using dynamic parallelism; one block is launched per
149 | pixel
150 | @param dwells the output array
151 | @param w the width of the output image
152 | @param h the height of the output image
153 | @param cmin the complex value associated with the left-bottom corner of the image
154 | @param cmax the complex value associated with the right-top corner of the image
155 | @param x0 the starting x coordinate of the portion to compute
156 | @param y0 the starting y coordinate of the portion to compute
157 | @param d the size of the portion to compute (the portion is always a square)
158 | @param depth kernel invocation depth
159 | @remarks the algorithm reverts to per-pixel Mandelbrot evaluation once either maximum depth or
160 | minimum size is reached
161 | */
162 | __global__ void mandelbrot_with_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
163 | complex cmax, int x0, int y0, int d, int depth)
164 | {
165 | x0 += d * blockIdx.x, y0 += d * blockIdx.y;
166 | int comm_dwell = border_dwell(w, h, max_dwell, cmin, cmax, x0, y0, d);
167 | if (threadIdx.x == 0 && threadIdx.y == 0) {
168 | if (comm_dwell != DIFF_DWELL) {
169 | // uniform dwell, just fill
170 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
171 | dwell_fill_k<<>>(dwells, w, x0, y0, d, comm_dwell);
172 | } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) {
173 | // subdivide recursively
174 | dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV, SUBDIV);
175 | mandelbrot_with_dp<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0,
176 | d / SUBDIV, depth + 1);
177 | } else {
178 | // leaf, per-pixel kernel
179 | dim3 bs(BSX, BSY), grid(divup(d, BSX), divup(d, BSY));
180 | mandelbrot_pixel_k<<>>(dwells, w, h, max_dwell, cmin, cmax, x0, y0, d);
181 | }
182 | // check_error(x0, y0, d);
183 | }
184 | } // mandelbrot_with_dp
185 |
186 | /** computes the dwells for Mandelbrot image
187 | @param dwells the output array
188 | @param w the width of the output image
189 | @param h the height of the output image
190 | @param cmin the complex value associated with the left-bottom corner of the image
191 | @param cmax the complex value associated with the right-top corner of the image
192 | */
193 | __global__ void mandelbrot_without_dp(int *dwells, int w, int h, int max_dwell, complex cmin,
194 | complex cmax)
195 | {
196 | // complex value to start iteration (c)
197 | int x = threadIdx.x + blockIdx.x * blockDim.x;
198 | int y = threadIdx.y + blockIdx.y * blockDim.y;
199 | int dwell = pixel_dwell(w, h, max_dwell, cmin, cmax, x, y);
200 | dwells[y * w + x] = dwell;
201 | }
202 |
203 | __global__ void dwell_fill_k_null() { printf("111 \n"); } // dwell_fill_k
204 |
205 | __global__ void mandelbrot_with_dp_cpu_perf() { dwell_fill_k_null<<<1, 1>>>(); }
206 |
207 | __global__ void mandelbrot_without_dp_cpu_perf() { printf("222 \n"); }
208 |
209 | struct timeval t1, t2;
210 |
211 | static void BM_DynamicParallelism_WithDP()
212 | {
213 | static char env_str[] = "DOORBELL_LISTEN=ON";
214 | putenv(env_str);
215 |
216 | // allocate memory
217 | int w = W;
218 | int h = H;
219 | int max_dwell = MAX_DWELL;
220 |
221 | size_t dwell_sz = w * h * sizeof(int);
222 | int *h_dwells, *d_dwells;
223 | mcMalloc((void **)&d_dwells, dwell_sz);
224 | h_dwells = (int *)malloc(dwell_sz);
225 |
226 | dim3 bs(BSX, BSY), grid(INIT_SUBDIV, INIT_SUBDIV);
227 | gettimeofday(&t1, NULL);
228 | mandelbrot_with_dp<<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
229 | complex(0.5, 1), 0, 0, w / INIT_SUBDIV, 1);
230 | gettimeofday(&t2, NULL);
231 | mcDeviceSynchronize();
232 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
233 |
234 | // free data
235 | mcFree(d_dwells);
236 | free(h_dwells);
237 | cout<<"BM_DynamicParallelism_WithDP over "<>>(d_dwells, w, h, max_dwell, complex(-1.5, -1),
258 | complex(0.5, 1));
259 | gettimeofday(&t2, NULL);
260 | mcDeviceSynchronize();
261 | mcMemcpy(h_dwells, d_dwells, dwell_sz, mcMemcpyDeviceToHost);
262 |
263 | // free data
264 | mcFree(d_dwells);
265 | free(h_dwells);
266 | cout<<"BM_DynamicParallelism_WithoutDP over"<>>();
281 |
282 | mcDeviceSynchronize();
283 | cout<<"BM_DynamicParallelism_WithDP_CPU_Perf over"<>>();
293 |
294 | mcDeviceSynchronize();
295 | cout<<"BM_DynamicParallelism_WithoutDP_CPU_Perf over"<
8 |
9 | ## chapter 3
10 |
11 | ### 3-2
12 |
13 |
14 |
15 | ## chapter 4
16 |
17 | ### 4-1
18 |
19 |
20 |
21 | ## chapter 5
22 |
23 | ### 5-1
24 |
25 |
26 |
27 | ### 5-3
28 |
29 |
30 |
31 | ### 5-5
32 |
33 |
34 |
35 | ## chapter 6
36 |
37 | ### 6-1
38 |
39 |
40 |
41 |
42 |
43 | ### 6-2
44 |
45 |
46 |
47 |
48 |
49 | ### 6-3
50 |
51 |
52 |
53 |
54 |
55 | ### 6-4
56 |
57 |
58 |
59 | ### 6-5
60 |
61 |
62 |
63 | ### 6-6
64 |
65 |
66 |
67 | ### 6-7
68 |
69 |
70 |
71 | ### 6-8
72 |
73 |
74 |
75 | ### 6-94
76 |
77 |
78 |
79 | ### 6-10
80 |
81 |
82 |
83 |
84 |
85 | ### 6-11
86 |
87 |
88 |
89 |
90 |
91 | ### 6-12
92 |
93 |
94 |
95 |
96 |
97 | ### 6-30
98 |
99 |
100 |
101 | ## chapter 7
102 |
103 | ### 7-4
104 |
105 |
106 |
107 | ### 7-5
108 |
109 |
110 |
111 | ## chapter 8
112 |
113 | ### 8-1
114 |
115 |
116 |
117 | ### 8-2
118 |
119 |
--------------------------------------------------------------------------------