├── pic
    └── gemm_mma.png
└── README.md


/pic/gemm_mma.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gty111/GEMM_MMA/HEAD/pic/gemm_mma.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # [GEMM MMA](https://gty111.github.io/2023/06/20/gemm-optimize/)
 2 | 
 3 | > cutlass:3.1 CUDA:11.4.4
 4 | 
 5 | GEMM MMA 首先构建了一个初级的GEMM kernel， 它使用CUDA `mma.sync`指令来使用GPU tensor core单元，之后每次引入一个优化概念并对比性能变化
 6 | 
 7 | 最终优化的性能: 73.65% (相比cutlass算子，测试维度为8192x8192x8192)
 8 | 
 9 | [source code: gemm.cu](https://github.com/gty111/GEMM_MMA/blob/epilogue/gemm.cu)
10 | 
11 | ## [Optimize GEMM step by step](https://zhuanlan.zhihu.com/p/638522893)
12 | 
13 | 一步步优化GEMM系列，每次引入一个优化概念并对比性能变化，代码在每个分支的`gemm.cu`
14 | 
15 | baseline性能: 3.44%
16 | 
17 | ### [1. 使用向量化(vector)](https://github.com/gty111/GEMM_MMA/tree/vector)
18 | 
19 | vector分支主要介绍向量化load/store，
20 | 
21 | 优化后性能: 4.74%
22 | 
23 | ### [2. 避免bank冲突并且合并访存(bfco)](https://github.com/gty111/GEMM_MMA/tree/bfco)
24 | 
25 | bfco分支主要介绍如何通过解决shared memory bank conflict 和 memory coalesce (访存合并) 来优化性能
26 | 
27 | 优化后性能: 5.00%
28 | 
29 | ### [3. 使用异步拷贝(ldgsts)](https://github.com/gty111/GEMM_MMA/tree/ldgsts)
30 | 
31 | ldgsts 分支主要来介绍使用Ampere引入的异步拷贝来优化性能
32 | 
33 | 优化后性能: 5.36%
34 | 
35 | ### [4. 使用寄存器(reg)](https://github.com/gty111/GEMM_MMA/tree/reg)
36 | 
37 | reg 分支介绍使用寄存器来优化性能
38 | 
39 | 优化后性能: 35.39%
40 | 
41 | ### [5. 使用数据预取(prefetch)](https://github.com/gty111/GEMM_MMA/tree/prefetch)
42 | 
43 | prefetch 分支介绍使用数据预取来优化性能
44 | 
45 | 优化后性能：39.36%
46 | 
47 | ### [6. 关于PTXAS有趣的发现(ptxas)](https://github.com/gty111/GEMM_MMA/tree/ptxas)
48 | 
49 | ptxas 分支分享一个调优过程中发现的关于ptxas(ptx汇编器)有意思的东西
50 | 
51 | ### [7. 优化数据预取(prefetchx)](https://github.com/gty111/GEMM_MMA/tree/prefetchx)
52 | 
53 | prefetchx 分支和之前的prefetch分支类似，区别是增加了预取数据大小并利用了同步指令`cp.async.waitgroup N`
54 | 
55 | 优化后性能：46.89%
56 | 
57 | ### [8. 调整线程块和warp计算的矩阵大小(shape)](https://github.com/gty111/GEMM_MMA/tree/shape)
58 | 
59 | shape 分支调整了每个block和warp计算的矩阵C的大小
60 | 
61 | 优化后性能：62.39%
62 | 
63 | ### [9. 调整线程块分配到的计算位置(swizzle)](https://github.com/gty111/GEMM_MMA/tree/swizzle)
64 | 
65 | swizzle 分支调整每个thread block分配到的计算位置来优化性能
66 | 
67 | 优化后性能: 68.43%
68 | 
69 | ### [10. 使用ldmatrix指令(ldmatrix)](https://github.com/gty111/GEMM_MMA/tree/ldmatrix)
70 | 
71 | ldmatrix 分支使用`ldmatrix`指令来加载共享内存
72 | 
73 | 优化后性能: 73.65%
74 | 
75 | ### [11. 增加对参数alpha和beta的支持(epilogue)](https://github.com/gty111/GEMM_MMA/tree/epilogue)
76 | 
77 | epilogue 分支增加了对参数`alpha`和`beta`的支持
78 | 


--------------------------------------------------------------------------------