├── .clang-format ├── .gitignore ├── CMakeLists.txt ├── LICENSE ├── README.md ├── docker └── gemm-cuda.Dockerfile ├── include ├── cuda_gemm.hpp ├── cuda_gemm_utils.cuh ├── cuda_gemm_utils.hpp └── profile_utils.cuh └── src ├── 00_non_coalesced_global_memory_access.cu ├── 01_coalesced_global_memory_access.cu ├── 02_2d_block_tiling.cu ├── 02_2d_block_tiling_vectorized_memory_access.cu ├── 03_2d_block_tiling_1d_thread_tiling.cu ├── 03_2d_block_tiling_1d_thread_tiling_vectorized_memory_access.cu ├── 04_2d_block_tiling_2d_thread_tiling.cu ├── 04_2d_block_tiling_2d_thread_tiling_vectorized_memory_access.cu ├── 05_2d_block_tiling_2d_thread_tiling_matrix_transpose.cu ├── 05_2d_block_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access.cu ├── 06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose.cu ├── 06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access.cu ├── 06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access_double_buffered.cu ├── 07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma.cu ├── 07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma_vectorized_memory_access.cu ├── 07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma_vectorized_memory_access_double_buffered.cu ├── CMakeLists.txt ├── cuda_gemm_utils.cu ├── profile_cuda_gemm_fp16.cu └── profile_cuda_gemm_fp32.cu /.clang-format: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/.clang-format -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/.gitignore -------------------------------------------------------------------------------- /CMakeLists.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/CMakeLists.txt -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/README.md -------------------------------------------------------------------------------- /docker/gemm-cuda.Dockerfile: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/docker/gemm-cuda.Dockerfile -------------------------------------------------------------------------------- /include/cuda_gemm.hpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/include/cuda_gemm.hpp -------------------------------------------------------------------------------- /include/cuda_gemm_utils.cuh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/include/cuda_gemm_utils.cuh -------------------------------------------------------------------------------- /include/cuda_gemm_utils.hpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/include/cuda_gemm_utils.hpp -------------------------------------------------------------------------------- /include/profile_utils.cuh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/include/profile_utils.cuh -------------------------------------------------------------------------------- /src/00_non_coalesced_global_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/00_non_coalesced_global_memory_access.cu -------------------------------------------------------------------------------- /src/01_coalesced_global_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/01_coalesced_global_memory_access.cu -------------------------------------------------------------------------------- /src/02_2d_block_tiling.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/02_2d_block_tiling.cu -------------------------------------------------------------------------------- /src/02_2d_block_tiling_vectorized_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/02_2d_block_tiling_vectorized_memory_access.cu -------------------------------------------------------------------------------- /src/03_2d_block_tiling_1d_thread_tiling.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/03_2d_block_tiling_1d_thread_tiling.cu -------------------------------------------------------------------------------- /src/03_2d_block_tiling_1d_thread_tiling_vectorized_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/03_2d_block_tiling_1d_thread_tiling_vectorized_memory_access.cu -------------------------------------------------------------------------------- /src/04_2d_block_tiling_2d_thread_tiling.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/04_2d_block_tiling_2d_thread_tiling.cu -------------------------------------------------------------------------------- /src/04_2d_block_tiling_2d_thread_tiling_vectorized_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/04_2d_block_tiling_2d_thread_tiling_vectorized_memory_access.cu -------------------------------------------------------------------------------- /src/05_2d_block_tiling_2d_thread_tiling_matrix_transpose.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/05_2d_block_tiling_2d_thread_tiling_matrix_transpose.cu -------------------------------------------------------------------------------- /src/05_2d_block_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/05_2d_block_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access.cu -------------------------------------------------------------------------------- /src/06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose.cu -------------------------------------------------------------------------------- /src/06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access.cu -------------------------------------------------------------------------------- /src/06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access_double_buffered.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/06_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_vectorized_memory_access_double_buffered.cu -------------------------------------------------------------------------------- /src/07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma.cu -------------------------------------------------------------------------------- /src/07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma_vectorized_memory_access.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma_vectorized_memory_access.cu -------------------------------------------------------------------------------- /src/07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma_vectorized_memory_access_double_buffered.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/07_2d_block_tiling_2d_warp_tiling_2d_thread_tiling_matrix_transpose_wmma_vectorized_memory_access_double_buffered.cu -------------------------------------------------------------------------------- /src/CMakeLists.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/CMakeLists.txt -------------------------------------------------------------------------------- /src/cuda_gemm_utils.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/cuda_gemm_utils.cu -------------------------------------------------------------------------------- /src/profile_cuda_gemm_fp16.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/profile_cuda_gemm_fp16.cu -------------------------------------------------------------------------------- /src/profile_cuda_gemm_fp32.cu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leimao/CUDA-GEMM-Optimization/HEAD/src/profile_cuda_gemm_fp32.cu --------------------------------------------------------------------------------