├── README.md ├── Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation ├── Presentation Slides - A Crash course on GPU optimization.pdf ├── Speaker Q&A - A Crash course in GPU optimization.pdf └── Summary Notes - A Crash Course on GPU Optimization.pdf ├── Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia ├── Presentation Slides - High Performance LLM Serving on Nvidia GPUs.pdf ├── Speaker Q&A - High Performance LLM Serving on Nvidia GPUs.pdf └── Summary Notes - High Performance LLM Serving on Nvidia GPUs.pdf ├── Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI ├── Presentation Slides - Block Based GPU Programming with Triton.pdf ├── Speaker Q&A - Block Based GPU Programming using Triton.pdf └── Summary Notes - Block-based GPU Programming with Triton.pdf ├── Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data ├── Presentation Slides - Scaling data processing from CPU to distributed GPU.pdf ├── Speaker Q&A - Scaling data processing from CPU to distributed GPUs.pdf ├── Speaker Q&A - Scaling data processing from CPUs to distributed GPUs.pdf └── Summary Notes - Scaling data processing from CPU to distributed GPUs.pdf └── community-note.md /README.md: -------------------------------------------------------------------------------- 1 | # GPU Optimization Workshop (May 2024) 2 | Slides, notes, and materials for the workshop 3 | 4 | - RSVP link: https://lu.ma/1wu5ppl5 5 | - Host: [@chiphuyen](https://github.com/chiphuyen)'s [Discord community](https://discord.gg/C8duCmvngk) 6 | - [YouTube's recording](https://www.youtube.com/watch?v=v_q2JTIqE20) 7 | 8 | ## Pre-event note 9 | * The talks are pretty technical, given that this is a workshop on GPU optimization. The speakers try their best to make their topics accessible, but you’ll make more out of the workshop if you familiarize yourself with the basic concepts in advance. (See Reading materials) 10 | * The event will be livestreamed on YouTube, but questions should be asked on Discord, not YouTube. 11 | * Given that we have 2000+ people signing up for the event, we expect there will be a lot of interesting live discussions on Discord. 12 | * Workshop TAs who will be helping us run the workshop: 13 | * [Roland Tannous](https://www.linkedin.com/in/rolandjosephtannous/) 14 | * [Chris Alexiuk](https://www.linkedin.com/in/csalexiuk/) 15 | * [Matúš Jurák](https://www.linkedin.com/in/mat%C3%BA%C5%A1-jur%C3%A1k-8bb680139/) 16 | 17 | ## Schedule 18 | **[12:00] Crash course on GPU optimization ([Mark Saroufim](https://www.linkedin.com/in/marksaroufim/) @ Meta)** 19 | 20 | _Mark is a PyTorch core developer and cofounder of CUDA MODE. He also ran the really fun NeurIPS[ LLM Efficiency challenge](https://neurips.cc/virtual/2023/competition/66594) last year. Previously, he was at Graphcore and Microsoft._ 21 | 22 | Mark will give an overview of why GPUs, the metrics that matter, and different GPU programming models (thread-based CUDA and block-based Triton). He promises this will be a painless guide to writing CUDA/Triton kernels! This talk will give us the basics to understand the rest of the workshop. 23 | 24 | **[12:45] High-performance LLM serving on GPUs ([Sharan Chetlur](https://www.linkedin.com/in/sharan-chetlur-1bb35912/) @ NVIDIA)** 25 | 26 | _Sharan is a principal engineer working on TensorRT-LLM at NVIDIA. He’s been working on CUDA since 2012, optimizing the performance of deep learning models from a single GPU to a full data center scale. Previously, he was the Director of Engineering at Cerebras._ 27 | 28 | Sharan will discuss how to build performant, flexible solutions to optimize LLM serving given the rapid evolution of new models and techniques. The talk will cover optimization techniques such as token concatenation, different strategies for batching, and cache. 29 | 30 | **[13:20] Block-based GPU Programming with Triton ([Philippe Tillet](https://www.linkedin.com/in/philippe-tillet-809b5536/) @ OpenAI)** 31 | 32 | _Philippe is currently leading the Triton team at OpenAI. Previously, he was at pretty much all major chip makers including NVIDIA, AMD, Intel, and Nervana._ 33 | 34 | Philippe will explain how Triton works and how its block-based programming model differs from the traditional single instruction, multiple threads (SIMT) programming model that CUDA follows. Triton aims to be higher-level than CUDA while being more expressive (lower-level) than common graph compilers like XLA and Torch-Inductor. 35 | 36 | **[14:00] Scaling data processing from CPU to distributed GPUs ([William Malpica](https://www.linkedin.com/in/william-malpica-68577a44/) @ Voltron Data)** 37 | 38 | _William is a co-founder of Voltron Data and the creator of BlazingSQL. He helped scale Theseus, a GPU-native query engine, to handle 100TB queries!_ 39 | 40 | Most people today use GPUs for training and inference. A category of workloads that GPUs excel at but are underutilized for is data processing. In this talk, William will discuss why large-scale data processing should be done on GPUs instead of CPUs and how different tools like cuDF, RAPIDS, and Theseus leverage GPUs for data processing. 41 | 42 | ## Reading materials 43 | 44 | Please read the schedule below carefully. If there are terms you’re not familiar with, you might want to look them up in advance. Examples: 45 | 46 | 1. **Memory bound vs. compute bound**: whether the bottleneck is in GPU’s memory or in computation capabilities. 47 | 2. **Thread-based vs. block-based**: different programming models for GPU programming. CUDA is thread-based and Triton is block-based. 48 | 49 | Tools that will be discussed in the workshop: 50 | 51 | 1. [Development repository for the Triton language and compiler](https://github.com/triton-lang/triton) 52 | 1. [Introducing Triton: Open-source GPU programming for neural networks](https://openai.com/index/triton/) 53 | 2. [Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations](https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf) 54 | 2. [TensorRT](https://github.com/NVIDIA/TensorRT) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) 55 | 3. Check out Mark’s lecture on [profiling CUDA in PyTorch](https://www.youtube.com/watch?v=LuhJEEJQgUM&ab_channel=CUDAMODE). 56 | 3. [Model Inference Optimization Checklist](https://pytorch.org/serve/performance_checklist.html) 57 | 4. [Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/) 58 | 4. [rapidsai/cudf - GPU DataFrame Library](https://github.com/rapidsai/cudf) 59 | 5. [Benchmarking Report: Theseus Engine | Voltron Data](https://voltrondata.com/benchmarks/theseus) 60 | 61 | Recommended resources: 62 | 1. [How CUDA Programming Works - Stephen Jones, NVIDIA](https://www.youtube.com/watch?v=QQceTDjA4f4&ab_channel=ChristopherHollinworth) (great lecture) 63 | 2. [The Best GPUs for Deep Learning in 2023 — An In-depth Analysis](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/) (Tim Dettmers) 64 | 3. [CUDA MODE Discord](https://discord.gg/cudamode). They have a great [lecture series on GPU optimization](https://github.com/cuda-mode/lectures/tree/main). 65 | 66 | 67 | -------------------------------------------------------------------------------- /Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Presentation Slides - A Crash course on GPU optimization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Presentation Slides - A Crash course on GPU optimization.pdf -------------------------------------------------------------------------------- /Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Speaker Q&A - A Crash course in GPU optimization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Speaker Q&A - A Crash course in GPU optimization.pdf -------------------------------------------------------------------------------- /Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Summary Notes - A Crash Course on GPU Optimization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Summary Notes - A Crash Course on GPU Optimization.pdf -------------------------------------------------------------------------------- /Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Presentation Slides - High Performance LLM Serving on Nvidia GPUs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Presentation Slides - High Performance LLM Serving on Nvidia GPUs.pdf -------------------------------------------------------------------------------- /Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Speaker Q&A - High Performance LLM Serving on Nvidia GPUs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Speaker Q&A - High Performance LLM Serving on Nvidia GPUs.pdf -------------------------------------------------------------------------------- /Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Summary Notes - High Performance LLM Serving on Nvidia GPUs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Summary Notes - High Performance LLM Serving on Nvidia GPUs.pdf -------------------------------------------------------------------------------- /Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Presentation Slides - Block Based GPU Programming with Triton.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Presentation Slides - Block Based GPU Programming with Triton.pdf -------------------------------------------------------------------------------- /Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Speaker Q&A - Block Based GPU Programming using Triton.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Speaker Q&A - Block Based GPU Programming using Triton.pdf -------------------------------------------------------------------------------- /Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Summary Notes - Block-based GPU Programming with Triton.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Summary Notes - Block-based GPU Programming with Triton.pdf -------------------------------------------------------------------------------- /Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Presentation Slides - Scaling data processing from CPU to distributed GPU.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Presentation Slides - Scaling data processing from CPU to distributed GPU.pdf -------------------------------------------------------------------------------- /Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPU to distributed GPUs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPU to distributed GPUs.pdf -------------------------------------------------------------------------------- /Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPUs to distributed GPUs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPUs to distributed GPUs.pdf -------------------------------------------------------------------------------- /Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Summary Notes - Scaling data processing from CPU to distributed GPUs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data/Summary Notes - Scaling data processing from CPU to distributed GPUs.pdf -------------------------------------------------------------------------------- /community-note.md: -------------------------------------------------------------------------------- 1 | _Thanks Ajinkya Tejankar, Mohsin Iqbal, Sujit Ahirrao, and Krishna Gupta for the notes!_ 2 | 3 | ## Crash course to GPU optimization 4 | 1. PyTorch 5 | 1. Needs to support different dtypes, layouts, and devices so the kernels are general 6 | 2. Eager execution allows easy debugging but this trades off performance 7 | 3. Models getting converged to eager may not be the best performance 8 | 2. Pointwise ops 9 | 1. Every element assigned to a single thread that run in parallel 10 | 3. Memory hierarchy 11 | 1. Load data in shared memory and apply relu 12 | 2. Done differently in Triton 13 | 4. Eager execution can lead to unnecessary memory accesses if kernels are repeated 14 | 5. GPU mem bandwidth is the bottleneck not FLOPs 15 | 6. Operations / no of bytes accessed = arithmetic intensity 16 | 7. Repeated calls to a kernel can be fused together - `torch.compile` (generates Triton) (see PyTorch 2 paper for more information). 17 | 8. FP32 to FP16 or BF16 improves the perf significantly 18 | 9. `torch.set_float32_matmul_precision('high')` => use tensor cores => talk by Nvidia about tensor cores on CUDA MODE 19 | 10. Most of the time may be spent on figuring out which GPU kernel to use because 1.1 above 20 | 11. CUDA kernels are asyc so queue them up -> CUDA graphs “reduce-overhead” in torch.compile 21 | 12. Quantization helps compute bound but also mem bound kernels as it reduces the number of bytes accessed in the arithmetic intensity calculation 22 | 13. GPT fast - weight only quantization 23 | 14. Int8 is ambiguous - quantize optimizers? Gradients? Not applied over all the model only the linear layers. W8A16 -> Int 8 weights. 24 | 15. Bit packing: Pack 2 int4s into a single int8 25 | 16. Compute bound problems: become better at math 26 | 17. Why compiler couldn’t have figured out FlashAttention? Q by a reviewer Compilers are good at fusing kernels but not math of the operations 27 | 18. Online softmax paper explains the FlashAttention better 28 | 19. Learn the basics of CUDA - Programming Massively Parallel Processors: A Hands-on Approach - helps with compute bound kernels 29 | 20. `load_inline` function in `cpp_extension` in pytorch 30 | 21. Nvidia provides a profiler: `ncu` - good supplement for reading the above book 31 | 22. Write kernels!! Good content on the cuda-mode and join for writing custom kernels 32 | 23. Karpathy - building in raw cuda 33 | 24. Reach out to Mark for shipping your hand written cuda kernels (he’ll help with release) 34 | 25. Learning through mentorship is great since public docs are not great at the moment 35 | 26. Quantization is not possible through torch.compile 36 | 27. How to make PyTorch models faster: Fuse more, use tensor cores, reduce overhead, quantize, use a custom kernel (all in order) 37 | 28. How’s execute torch different from torch.compile? Focused on more constrained devices. However, dynamo (a part of the compile subsystem) is shared. 38 | 29. How does PyTorch treat GPUs other than Nvidia’s? Triton provides backends that work on Intel, AMD GPUs so PyTorch just generates Triton. Hierarchical IR and Code gen. 39 | 30. What do you think about 1 bit quantization? Eval does not scale. Bit packing can help. 40 | 31. Common pitfalls of running GPUs? 41 | 1. Eager - Profile first to figure out the real bottlenecks 42 | 2. Compile - Enable first 3 things on 27 point 43 | 44 | 45 | ### Relevant resources 46 | * CUDA Programming model basics: 47 | 48 | [https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#4](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#4) 49 | 50 | [https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a4](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a4) 51 | 52 | 53 | [https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a7](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a7) 54 | 55 | * [Programming Massively Parallel Processors: A Hands-on Approach](https://a.co/d/folc0LI) 56 | * CUDA Mode: [discord.gg/cudamode](discord.gg/cudamode) 57 | * Native PyTorch library for quantization and sparsity: [https://github.com/pytorch/ao](https://github.com/pytorch/ao) 58 | * Learn Triton: [https://github.com/cuda-mode/triton-index/](https://github.com/cuda-mode/triton-index/) 59 | 60 | ## LLM Serving optimization 61 | 62 | 1. Focusing on server-based systems not edge-end user latencies are important 63 | 2. Multi-functional accurate models are large - deployment and optimization is a challenge 64 | 3. Many models, very big models, new operators (optimization becomes a moving target) 65 | 4. Goal: SoTA performance for LLMs for production deployments 66 | 5. Fast forward pass is very important. Also, important intelligent batching 67 | 6. Other techniques like kv cache optimization for improved GPU workload 68 | 7. Quantization 69 | 1. As long as you can preserve accuracy, lower bit-width precisions are great 70 | 1. Lesser memory, higher throughput comms between GPUs, faster computation (all-round win) 71 | 2. Post-training quantization is the most common 72 | 3. TensorRT model optimizer offers a bunch of techniques 73 | 1. PTQ (post-training quantization) and QAT (quantization-aware training) 74 | 2. SmoothQuant and INT4 AWQ don’t lead to too much drop in acc (MMLU) 75 | 8. LLM request has two phases 76 | 1. Prefill: process the prompt, generate the first token, and init the kv cache. Called only once for a request. Lots of parallel operations across tokens. 77 | 2. Generate: starts from prior state (kv cache) and generates the next token, updating the kv cache. Called in a loop for each request. Lot of memory bound operations. 78 | 3. Attention is complex - features like GQA and Speculative Decoding increase math:data movement ratio (arithmetic intensity) 79 | 4. TRT-LLMs fastest implementations use hand tuned custom cuda kernels 80 | 9. Traditional Request Scheduling (static batching) 81 | 1. Accumulate, batch, forward 82 | 2. Request as an atomic operation is great for fixed length inputs however for tasks like completion where outputs differ in length this is not great. (image vs. chat) 83 | 3. Largest completion in a batch can stall the smallest completion. Padding also wastes computation. 84 | 10. LLM Request Properties 85 | 1. Multiple forward passes and the number is unknown a priori 86 | 2. Online setting, request arrival time is priori 87 | 3. In flight batching 88 | 1. On EOS, Max tokens reached, stop phrase -> send response and evict 89 | 2. Process new work - next iteration of LLM 90 | 1. Prompt phase goes to prefill 91 | 2. Prefill goes to generate 92 | 3. Generate keeps generating 93 | 3. Transformer ops 94 | 1. Token parallel - Matmul, LayerNorm 95 | 2. Sequence parallel - MHA 96 | 3. Tokens across above two types are concatenated in in-flight batching to improve memory bound (makes it more compute intensive) 97 | 14. Paged KV Cache 98 | 1. Contiguous KV Cache leads to wasted allocation of memory since all KV cache memory is contiguous 99 | 2. Instead think of memory as a linked list of pages - reduces memory unused memory - lazy memory allocation - increases complexity of attention kernel 100 | 3. Allows sharing of KV cache between requests! E.g. system prompt kv cache blocks are part of the linked list of different requests! 101 | 15. Speculative Decoding 102 | 1. Instead of generating a single token as in regular autoregressive generation, generate many tokens 103 | 2. Evaluate if draft tokens are valid in the same time as a single token is generated 104 | 3. Speculates that speculative decoding will be used everywhere ;) 105 | 4. Turns latency problem into throughput problem where GPUs are great 106 | 16. Time to first token vs time between token. Which is important? Time between since time to first is easily optimized. 107 | 17. Online vs batch inference. Which is common? Online is important, but the idea is to turn online into batch inference. 108 | 18. Any specific techniques for streaming mode? Not much. Stream out tokens as they are generated. Since everything is async anyway. 109 | 19. Quantization sounds too good to be true. Any caveats? PTQ is model dependent. 110 | 20. Good intro paper for changing workload? Orca paper. Link in the discord. 111 | 21. Many LLM inference services. Which one to use? Each is optimized for a specific use cases so explore. 112 | 22. What are the questions ppl should be asking when evaluating inference services? Clarity of Quality of Service (latency, throughput, acc) for your use case 113 | 23. Now way to avoid multi-gpu since models keep getting bigger. For many cases, single GPU use case is just fine. 114 | 115 | 116 | ### Relevant resources 117 | 118 | * [Decoding Speculative Decoding](https://arxiv.org/html/2402.01528v1) 119 | * [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) 120 | * [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180) 121 | 122 | 123 | ## Block-based optimization with Triton 124 | 125 | 1. CUDA - all sorts of things can be done on GPUs but since it allows anything to be done it creates problems and hampers productivity. 126 | 1. First few months of support are okay 127 | 2. Supporting different gpus becomes problem 128 | 3. Opaque to researchers - cannot read CUDA code - reading tensor core code requires proficiency - becomes a black box - slows down research 129 | 4. Addressed with Graph Compilers - better for research 130 | 1. Walking a tree, linked lists in PyTorch are very slow 131 | 2. Control flow becomes complicated with graph operators 132 | 3. Code gen from graph compilers is a very difficult problem - this gives rise FlashAttention like custom CUDA kernels 133 | 4. Simplicity at the cost of flexibility 134 | 2. Triton - more low level than graph compilers but much easier to work with than CUDA 135 | 1. Can write algorithms out of scope of graph compilers - trees, linked lists, radix sort 136 | 2. Code still remains readable/modifiable by researchers 137 | 3. Performance is portable across different vendors 138 | 4. Less expressive than CUDA not as fast 139 | 3. Triton Machine Model: DRAM, L1 and L2 cache, Cores, Memory Controllers - Von Neumann Basic 140 | 4. Programming Model 141 | 1. Tensors are defined in SRAM and modified using torch like operators 142 | 2. Embedded in Python and Just-in-Time compiled 143 | 3. Tensor of pointers! 144 | 4. Powers of 2 - shapes of tensors!? 145 | 5. Vector addition 146 | 1. Each program gets a different slice to the input with tl.program_id 147 | 6. Softmax 148 | 1. Entirely fused kernels in less than 10 lines 149 | 2. Load the data only once unlike PyTorch eager mode 150 | 7. Why blocked program representation? 151 | 1. Peephole optimization 152 | 2. SRAM allocation 153 | 3. Automatic vectorization - Need to issue big enough loads to keep the memory bandwidth busy 154 | 4. Compiler allocates shared mem in addition to registers 155 | 5. Lot of value in researchers doing kernel developement! 156 | 6. Technical debt manageable 157 | 8. Challenges of building kernels at OpenAI scale? Reliability vs agility of the code base 158 | 9. Tricks for single GPU? Consumer GPUs have restriction on tensor cores. Go out of your way to use 16bit tensor cores. Not a priority of OpenAI, but TinyGrad focuses on it. 159 | 10. Model performance can change after optimizations? Kernel output shouldn’t change with reference non-optimized implementation. Power of 2 inputs. 160 | 11. Surprising kernels built on top of Triton? Sorting kernel. Hypercubes. 161 | 12. Why block based? Grew out of dissertation. 162 | 163 | 164 | ### Relevant resources 165 | 166 | * [https://openai.com/index/triton/](https://openai.com/index/triton/) 167 | 168 | 169 | ## Scaling data workloads on GPUs 170 | 171 | 1. Transactional databases - not gpu friendly - row oriented - CSV 172 | 2. Analytics datasets - gpu friendly - column oriented - Parquet, Apache Arrow. Apache Arrow is everywhere today. It makes it easy to move data across multiple data platforms. 173 | 3. Nvidia Rapids contains many libraries for gpu processing: cuPy, cuDF, cuML, cuGraph 174 | 4. Benchmark showing prformance boost moving from CPU to GPU, can be up to 100x times faster. The speed up is more with larger workloads. 175 | 5. Data processing on CPUs eventually hits a wall. 176 | 6. GPUs are fast for data processing because many data processing jobs are naturally parallelizable and GPUs have many cores. 177 | 7. What to do depending on where your job bottlenecks: memory bound, latency bound, or compute bound. Figure out where the bottleneck is by using profiling tools. 178 | 179 | ### Relevant resources 180 | 181 | * [The composable codex](https://voltrondata.com/codex.html) --------------------------------------------------------------------------------