├── .gitattributes
├── .gitignore
├── LICENSE
└── README.md


/.gitattributes:
--------------------------------------------------------------------------------
 1 | # Auto detect text files and perform LF normalization
 2 | * text=auto
 3 | 
 4 | # Custom for Visual Studio
 5 | *.cs     diff=csharp
 6 | 
 7 | # Standard to msysgit
 8 | *.doc	 diff=astextplain
 9 | *.DOC	 diff=astextplain
10 | *.docx diff=astextplain
11 | *.DOCX diff=astextplain
12 | *.dot  diff=astextplain
13 | *.DOT  diff=astextplain
14 | *.pdf  diff=astextplain
15 | *.PDF	 diff=astextplain
16 | *.rtf	 diff=astextplain
17 | *.RTF	 diff=astextplain
18 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Windows image file caches
 2 | Thumbs.db
 3 | ehthumbs.db
 4 | 
 5 | # Folder config file
 6 | Desktop.ini
 7 | 
 8 | # Recycle Bin used on file shares
 9 | $RECYCLE.BIN/
10 | 
11 | # Windows Installer files
12 | *.cab
13 | *.msi
14 | *.msm
15 | *.msp
16 | 
17 | # Windows shortcuts
18 | *.lnk
19 | 
20 | # =========================
21 | # Operating System Files
22 | # =========================
23 | 
24 | # OSX
25 | # =========================
26 | 
27 | .DS_Store
28 | .AppleDouble
29 | .LSOverride
30 | 
31 | # Thumbnails
32 | ._*
33 | 
34 | # Files that might appear in the root of a volume
35 | .DocumentRevisions-V100
36 | .fseventsd
37 | .Spotlight-V100
38 | .TemporaryItems
39 | .Trashes
40 | .VolumeIcon.icns
41 | 
42 | # Directories potentially created on remote AFP share
43 | .AppleDB
44 | .AppleDesktop
45 | Network Trash Folder
46 | Temporary Items
47 | .apdisk
48 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2022, Keren Zhou
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright notice, this
10 |    list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 |    this list of conditions and the following disclaimer in the documentation
14 |    and/or other materials provided with the distribution.
15 | 
16 | 3. Neither the name of the copyright holder nor the names of its
17 |    contributors may be used to endorse or promote products derived from
18 |    this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Awesome-GPU
  2 | =================
  3 | 
  4 |    * [Architecture](#architecture)
  5 |       * [Resources Management](#resources-management)
  6 |       * [Parallelism](#parallelism)
  7 |       * [Cache](#cache)
  8 |       * [Memory](#memory)
  9 |       * [White Papers](#white-papers)
 10 |    * [Algorithms](#algorithms)
 11 |       * [BLAS](#blas)
 12 |       * [Stencils](#stencils)
 13 |       * [Scans](#scans)
 14 |    * [Applications](#applications)
 15 |       * [Deep Learning](#deep-learning)
 16 |    * [Tools](#tools)
 17 |       * [Benchmarks](#benchmarks)
 18 |       * [Models](#models)
 19 |       * [Simulators](#simulators)
 20 |       * [Profilers](#profilers)
 21 |    * [Runtime](#runtime)
 22 |       * [Scheduling](#scheduling)
 23 |    * [Code Generation](#code-generation)
 24 |       * [Compilers](#compilers)
 25 |       * [Programming Models](#programming-models)
 26 |       * [Profile Guided Optimization](#profile-guided-optimization)
 27 |       * [Binaries](#binaries)
 28 | 
 29 | ## Architecture
 30 | 
 31 | ### Resources Management
 32 | 
 33 | - **TECS'21**-[Reducing Energy in GPGPUs through Approximate Trivial Bypassing](https://dl.acm.org/doi/10.1145/3429440)
 34 | - **ASPLOS'17**-[Locality-Aware CTA Clustering for Modern GPUs](http://dl.acm.org/citation.cfm?id=3037709)
 35 | - **ASPLOS'17**-[Dynamic Resource Management for Efficient Utilization of Multitasking GPUs](http://dl.acm.org/citation.cfm?id=3037707)
 36 | - **HPCA'17**-[Dynamic GPGPU Power Management Using Adaptive Model Predictive Control](http://ieeexplore.ieee.org/document/7920860/)
 37 | - **ISCA'16**-[Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems](http://ieeexplore.ieee.org/document/7551394/)
 38 | 
 39 | ### Parallelism
 40 | 
 41 | - **HPCA'18**-[Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls](https://ieeexplore.ieee.org/abstract/document/8327010)
 42 | - **HPCA'17**-[Controlled Kernel Launch for Dynamic Parallelism in GPUs](http://ieeexplore.ieee.org/document/7920863/)
 43 | - **GTC'17**-[COOPERATIVE GROUPS](http://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf)
 44 | - **ISCA'16**-[LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs](http://ieeexplore.ieee.org/document/7551424/)
 45 | - **ISCA'16**-[Virtual Thread Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit](http://ieeexplore.ieee.org/document/7551426/)
 46 | - **Berkeley TechRpts'16**-[Understanding Latency Hiding on GPUs](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html)
 47 | 
 48 | ### Cache
 49 | 
 50 | - **ISCA'16**-[APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs](http://ieeexplore.ieee.org/document/7551393/)
 51 | - **SC'15**-[Adaptive and Transparent Cache Bypassing for GPUs](https://ieeexplore.ieee.org/document/7832791)
 52 | 
 53 | ### Memory
 54 | 
 55 | - **ICCAD'21**-[Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU](https://ieeexplore.ieee.org/document/9643535)
 56 | - **SC'21**-[In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing](https://dl.acm.org/doi/10.1145/3458817.3480855)
 57 | - **IBM'20**-[Umpire: Application-Focused Management and Coordination of Complex Hierarchical Memory](https://ieeexplore.ieee.org/document/8907404)
 58 | - **HPCA'13**-[Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization](https://ieeexplore.ieee.org/document/6522332)
 59 | 
 60 | ### White Papers
 61 | 
 62 | - **NVIDIA Hopper**-[NVIDIA H100 Tensor Core GPU Architecture](https://resources.nvidia.com/en-us-tensor-core)
 63 | - **NVIDIA Ampere**-[NVIDIA A100 Tensor Core GPU Architecture](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf)
 64 | - **NVIDIA Turing**-[NVIDIA TURING GPU ARCHITECTURE](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
 65 | - **NVIDIA Volta**-[NVIDIA TESLA V100](http://www.nvidia.com/object/volta-architecture-whitepaper.html)
 66 | - **NVIDIA Pascal**-[NVIDIA TESLA P100](http://www.nvidia.com/object/gpu-architecture.html)
 67 | - **NVIDIA Kepler**-[NVIDIA’s Next Generation CUDA Compute Architecture: Kepler](https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf)
 68 | - **NVIDIA Fermi**-[NVIDIA’s Next Generation CUDA Compute Architecture: Fermi](https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf)
 69 | - **AMD CDNA 2**-[INTRODUCING AMD CDNA 2 ARCHITECTURE](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
 70 | - **AMD CDNA**-[INTRODUCING AMD CDNA ARCHITECTURE](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
 71 | 
 72 | ## Algorithms
 73 | 
 74 | ### BLAS
 75 | 
 76 | - **GTC'20**-[DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100](https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf)
 77 | - **IPDPS'20**-[Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply](https://ieeexplore.ieee.org/abstract/document/9139835)
 78 | - **PPoPP'19**-[A Coordinated Tiling and Batching Framework for Efficient GEMM on GPU](https://dl.acm.org/doi/10.1145/3293883.3295734)
 79 | - **GTC'18**-[CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf)
 80 | 
 81 | ### Stencils
 82 | 
 83 | - **CGO'20**-[AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs](https://dl.acm.org/doi/10.1145/3368826.3377904)
 84 | - **IPDPS'20**-[On Optimizing Complex Stencils on GPUs](https://ieeexplore.ieee.org/document/8820786)
 85 | - **PPoPP'18**-[Register Optimizations for Stencils on GPUs](https://dl.acm.org/doi/abs/10.1145/3178487.3178500)
 86 | 
 87 | ### Scans
 88 | 
 89 | - **NVResearch TechRpts'16**-[Single-pass Parallel Prefix Scan with Decoupled Look-back](https://research.nvidia.com/publication/single-pass-parallel-prefix-scan-decoupled-look-back)
 90 | 
 91 | ## Applications
 92 | 
 93 | ### Deep Learning
 94 | 
 95 | - **PPoPP'21**-[Understanding and bridging the gaps in current GNN performance optimizations](https://dl.acm.org/doi/10.1145/3437801.3441585)
 96 | - **SC'21**-[E.T.: re-thinking self-attention for transformer models on GPUs](https://dl.acm.org/doi/abs/10.1145/3458817.3476138)
 97 | - **OSDI'21**-[GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs](https://www.usenix.org/system/files/osdi21-wang-yuke.pdf)
 98 | - **SC'20**-[Sparse GPU Kernels for Deep Learning](https://arxiv.org/abs/2006.10901)
 99 | - **PPoPP'18**-[SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks](https://arxiv.org/abs/1801.04380)
100 | - **HPCA'17**-[Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures](http://ieeexplore.ieee.org/document/7920809/)
101 | 
102 | ## Tools
103 | 
104 | ### Benchmarking
105 | 
106 | - **GTC'18**-[Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking](https://arxiv.org/pdf/1804.06826.pdf)
107 | - **ISPASS'10**-[Demystifying GPU Microarchitecture through Microbenchmarking](http://ieeexplore.ieee.org/document/5452013/)
108 | 
109 | ### Models
110 | 
111 | - **PMBS'19**-[Instruction Roofline An insightful visual performance model for GPUs](https://ieeexplore.ieee.org/document/9059264)
112 | - **ECP'19**-[Performance Tuning of Scientific Codes with the Roofline Model](https://crd.lbl.gov/assets/Uploads/ECP19-Roofline-1-intro.pdf)
113 | - **GTC'18**-[VOLTA Architecture and performance optimization](http://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf)
114 | - **Synthesis Lectures on Computer Architecture'12**-[Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6812836&newsearch=true&queryText=Performance%20Analysis%20and%20Tuning%20for%20General%20Purpose%20Graphics%20Processing%20Units%2038%20.LB.GPGPU.RB.)
115 | - **SC'10**-[Fundamental_Optimizations](https://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Fundamental_Optimizations.pdf)
116 | 
117 | ### Simulators
118 | 
119 | - **ISPASS'10**-[Visualizing Complex Dynamics in Many-Core Accelerator Architectures](http://ieeexplore.ieee.org/document/5452029/)
120 | - **ISPASS'09**-[Analyzing CUDA Workloads Using a Detailed GPU Simulator](http://ieeexplore.ieee.org/abstract/document/4919648/)
121 | 
122 | ### Profilers
123 | 
124 | - **PLDI'18**-[GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis](https://dl.acm.org/citation.cfm?id=3192397)
125 | - **CGO'18**-[CUDAAdvisor: LLVM-based runtime profiling for modern GPUs](https://dl.acm.org/citation.cfm?id=3168831)
126 | - **CCGRID'18**-[Exposing Hidden Performance Opportunities in High Performance GPU Applications ](https://ieeexplore.ieee.org/document/8411034)
127 | - **THPC'16**-[Monitoring Heterogeneous Applications with the OpenMP Tools Interface](https://link.springer.com/chapter/10.1007/978-3-319-56702-0_3)
128 | - **Euro-Par'15**-[Identifying Optimization Opportunities Within Kernel Execution in GPU Codes](https://link.springer.com/chapter/10.1007/978-3-319-27308-2_16)
129 | - **SC'13**-[Effective sampling-driven performance tools for GPU-accelerated supercomputers](https://dl.acm.org/citation.cfm?id=2503299)
130 | - **ISPASS'12**-[Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures ](https://ieeexplore.ieee.org/document/6189206)
131 | - **ICPP'11**-[Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs](https://dl.acm.org/citation.cfm?id=2066951)
132 | - [**Vampir|Score-P**](http://www.vi-hps.org/projects/score-p/)
133 | - [**TAU**](https://www.cs.uoregon.edu/research/tau/home.php)
134 | - [**PAPI**](http://icl.utk.edu/papi/)
135 | - [**Allinea MAP**](https://www.allinea.com/products/map/)
136 | - [**Open|SpeedShop**](https://openspeedshop.org/)
137 | - [**HPCToolkit**](http://hpctoolkit.org/)
138 | - [**NVIDIA Nsight Systems**](https://developer.nvidia.com/nsight-systems)
139 | - [**NVIDIA Nsight Compute**](https://developer.nvidia.com/nsight-compute)
140 | - [**SASSI**](https://github.com/NVlabs/SASSI/blob/master/doc/SASSI-Tutorial-Micro2015.pptx)
141 | - [**NVBit**](https://github.com/NVlabs/NVBit/releases)
142 | 
143 | ## Runtime
144 | 
145 | ### Scheduling
146 | 
147 | - **PPoPP'22**-[CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems](https://arxiv.org/abs/2107.08538)
148 | - **TPDS'20**-[cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs](https://www.computer.org/csdl/journal/td/2020/04/08853389/1dKnnndWFwY)
149 | 
150 | ## Code Generation
151 | 
152 | ### Compilers
153 | 
154 | - **AMD'21**-[Generating GPU Compiler Heuristics using Reinforcement Learning](https://arxiv.org/abs/2111.12055)
155 | - **TACO'21**-[Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation](https://dl.acm.org/doi/10.1145/3469030)
156 | - **LLVM'17**-[Implementing implicit OpenMP data sharing on GPUs](https://dl.acm.org/citation.cfm?id=3148189)
157 | - **CGO'16**-[gpucc: An Open-Source GPGPU Compiler](http://dl.acm.org/citation.cfm?id=2854041)
158 | - **LLVM'16**-[Offloading Support for OpenMP in Clang and LLVM](https://dl.acm.org/citation.cfm?id=3018870)
159 | - **PMBS'15**-[Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application](https://dl.acm.org/citation.cfm?id=2832089)
160 | - **LLVM'15**-[Integrating GPU Support for OpenMP Ofﬂoading Directives into Clang](https://dl.acm.org/citation.cfm?id=2833161)
161 | - **LLVM'14**-[Coordinating GPU Threads for OpenMP 4.0 in LLVM](https://dl.acm.org/citation.cfm?id=2688364)
162 | 
163 | ### Programming Models
164 | 
165 | - **CGO'21**-[C-for-metal: high performance SIMD programming on intel GPUs](https://dl.acm.org/doi/abs/10.1109/CGO51591.2021.9370324)
166 | - **ECRTS'19**-[Novel Methodologies for Predictable CPU-To-GPU Command Offloading](https://drops.dagstuhl.de/opus/volltexte/2019/10759/)
167 | - **ASPLOS'14**-[Paraprox: Pattern-Based Approximation for Data Parallel Applications](https://dl.acm.org/citation.cfm?id=2541948)
168 | 
169 | ### Profile Guided Optimization
170 | 
171 | - **Geometry and Optimization'21**-[Cooperative Profile Guided Optimizations](https://doi.org/10.1111/cgf.14382)
172 | - **IPDPS'13**-[Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)](https://ieeexplore.ieee.org/document/6569883)
173 | 
174 | ### Binaries
175 | 
176 | - **CGO'19**-[Decoding CUDA binary](https://dl.acm.org/citation.cfm?id=3314900)
177 | - **ISCA'15**-[Flexible software profiling of GPU architectures](http://ieeexplore.ieee.org/document/7284065/)
178 | 
179 | 


--------------------------------------------------------------------------------