├── README.md
├── Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation
    ├── Presentation Slides - A Crash course on GPU optimization.pdf
    ├── Speaker Q&A - A Crash course in GPU optimization.pdf
    └── Summary Notes - A Crash Course on GPU Optimization.pdf
├── Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia
    ├── Presentation Slides - High Performance LLM Serving on Nvidia GPUs.pdf
    ├── Speaker Q&A - High Performance LLM Serving on Nvidia GPUs.pdf
    └── Summary Notes - High Performance LLM Serving on Nvidia GPUs.pdf
├── Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI
    ├── Presentation Slides - Block Based GPU Programming with Triton.pdf
    ├── Speaker Q&A - Block Based GPU Programming using Triton.pdf
    └── Summary Notes - Block-based GPU Programming with Triton.pdf
├── Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data
    ├── Presentation Slides - Scaling data processing from  CPU to distributed GPU.pdf
    ├── Speaker Q&A - Scaling data processing from CPU to distributed GPUs.pdf
    ├── Speaker Q&A - Scaling data processing from CPUs to distributed GPUs.pdf
    └── Summary Notes - Scaling data processing from CPU to distributed GPUs.pdf
└── community-note.md


/README.md:
--------------------------------------------------------------------------------
 1 | # GPU Optimization Workshop (May 2024)
 2 | Slides, notes, and materials for the workshop
 3 | 
 4 | - RSVP link: https://lu.ma/1wu5ppl5
 5 | - Host: [@chiphuyen](https://github.com/chiphuyen)'s [Discord community](https://discord.gg/C8duCmvngk)
 6 | - [YouTube's recording](https://www.youtube.com/watch?v=v_q2JTIqE20)
 7 | 
 8 | ## Pre-event note
 9 | * The talks are pretty technical, given that this is a workshop on GPU optimization. The speakers try their best to make their topics accessible, but you’ll make more out of the workshop if you familiarize yourself with the basic concepts in advance. (See Reading materials)
10 | * The event will be livestreamed on YouTube, but questions should be asked on Discord, not YouTube.
11 | * Given that we have 2000+ people signing up for the event, we expect there will be a lot of interesting live discussions on Discord.
12 | * Workshop TAs who will be helping us run the workshop:
13 |     * [Roland Tannous](https://www.linkedin.com/in/rolandjosephtannous/)
14 |     * [Chris Alexiuk](https://www.linkedin.com/in/csalexiuk/)
15 |     * [Matúš Jurák](https://www.linkedin.com/in/mat%C3%BA%C5%A1-jur%C3%A1k-8bb680139/)
16 | 
17 | ## Schedule
18 | **[12:00] Crash course on GPU optimization ([Mark Saroufim](https://www.linkedin.com/in/marksaroufim/) @ Meta)**
19 | 
20 | _Mark is a PyTorch core developer and cofounder of CUDA MODE. He also ran the really fun NeurIPS[ LLM Efficiency challenge](https://neurips.cc/virtual/2023/competition/66594) last year. Previously, he was at Graphcore and Microsoft._
21 | 
22 | Mark will give an overview of why GPUs, the metrics that matter, and different GPU programming models (thread-based CUDA and block-based Triton). He promises this will be a painless guide to writing CUDA/Triton kernels! This talk will give us the basics to understand the rest of the workshop.
23 | 
24 | **[12:45] High-performance LLM serving on GPUs ([Sharan Chetlur](https://www.linkedin.com/in/sharan-chetlur-1bb35912/) @ NVIDIA)**
25 | 
26 | _Sharan is a principal engineer working on TensorRT-LLM at NVIDIA. He’s been working on CUDA since 2012, optimizing the performance of deep learning models from a single GPU to a full data center scale. Previously, he was the Director of Engineering at Cerebras._
27 | 
28 | Sharan will discuss how to build performant, flexible solutions to optimize LLM serving given the rapid evolution of new models and techniques. The talk will cover optimization techniques such as token concatenation, different strategies for batching, and cache.
29 | 
30 | **[13:20] Block-based GPU Programming with Triton ([Philippe Tillet](https://www.linkedin.com/in/philippe-tillet-809b5536/) @ OpenAI)**
31 | 
32 | _Philippe is currently leading the Triton team at OpenAI. Previously, he was at pretty much all major chip makers including NVIDIA, AMD, Intel, and Nervana._
33 | 
34 | Philippe will explain how Triton works and how its block-based programming model differs from the traditional single instruction, multiple threads (SIMT) programming model that CUDA follows. Triton aims to be higher-level than CUDA while being more expressive (lower-level) than common graph compilers like XLA and Torch-Inductor.
35 | 
36 | **[14:00] Scaling data processing from CPU to distributed GPUs ([William Malpica](https://www.linkedin.com/in/william-malpica-68577a44/) @ Voltron Data)**
37 | 
38 | _William is a co-founder of Voltron Data and the creator of BlazingSQL. He helped scale Theseus, a GPU-native query engine, to handle 100TB queries!_
39 | 
40 | Most people today use GPUs for training and inference. A category of workloads that GPUs excel at but are underutilized for is data processing. In this talk, William will discuss why large-scale data processing should be done on GPUs instead of CPUs and how different tools like cuDF, RAPIDS, and Theseus leverage GPUs for data processing.
41 | 
42 | ## Reading materials 
43 | 
44 | Please read the schedule below carefully. If there are terms you’re not familiar with, you might want to look them up in advance. Examples:
45 | 
46 | 1. **Memory bound vs. compute bound**: whether the bottleneck is in GPU’s memory or in computation capabilities.
47 | 2. **Thread-based vs. block-based**: different programming models for GPU programming. CUDA is thread-based and Triton is block-based.
48 | 
49 | Tools that will be discussed in the workshop:
50 | 
51 | 1. [Development repository for the Triton language and compiler](https://github.com/triton-lang/triton)
52 |     1. [Introducing Triton: Open-source GPU programming for neural networks](https://openai.com/index/triton/)
53 |     2. [Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations](https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf) 
54 | 2. [TensorRT](https://github.com/NVIDIA/TensorRT) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
55 | 3. Check out Mark’s lecture on [profiling CUDA in PyTorch](https://www.youtube.com/watch?v=LuhJEEJQgUM&ab_channel=CUDAMODE).
56 |     3. [Model Inference Optimization Checklist](https://pytorch.org/serve/performance_checklist.html)
57 |     4. [Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/) 
58 | 4. [rapidsai/cudf - GPU DataFrame Library](https://github.com/rapidsai/cudf) 
59 | 5. [Benchmarking Report: Theseus Engine | Voltron Data](https://voltrondata.com/benchmarks/theseus) 
60 | 
61 | Recommended resources:
62 | 1. [How CUDA Programming Works - Stephen Jones, NVIDIA](https://www.youtube.com/watch?v=QQceTDjA4f4&ab_channel=ChristopherHollinworth) (great lecture)
63 | 2. [The Best GPUs for Deep Learning in 2023 — An In-depth Analysis](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/) (Tim Dettmers) 
64 | 3. [CUDA MODE Discord](https://discord.gg/cudamode). They have a great [lecture series on GPU optimization](https://github.com/cuda-mode/lectures/tree/main).
65 | 
66 | 
67 | 


--------------------------------------------------------------------------------
/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Presentation Slides - A Crash course on GPU optimization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Presentation Slides - A Crash course on GPU optimization.pdf


--------------------------------------------------------------------------------
/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Speaker Q&A - A Crash course in GPU optimization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Speaker Q&A - A Crash course in GPU optimization.pdf


--------------------------------------------------------------------------------
/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Summary Notes - A Crash Course on GPU Optimization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation/Summary Notes - A Crash Course on GPU Optimization.pdf


--------------------------------------------------------------------------------
/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Presentation Slides - High Performance LLM Serving on Nvidia GPUs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Presentation Slides - High Performance LLM Serving on Nvidia GPUs.pdf


--------------------------------------------------------------------------------
/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Speaker Q&A - High Performance LLM Serving on Nvidia GPUs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Speaker Q&A - High Performance LLM Serving on Nvidia GPUs.pdf


--------------------------------------------------------------------------------
/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Summary Notes - High Performance LLM Serving on Nvidia GPUs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia/Summary Notes - High Performance LLM Serving on Nvidia GPUs.pdf


--------------------------------------------------------------------------------
/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Presentation Slides - Block Based GPU Programming with Triton.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Presentation Slides - Block Based GPU Programming with Triton.pdf


--------------------------------------------------------------------------------
/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Speaker Q&A - Block Based GPU Programming using Triton.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Speaker Q&A - Block Based GPU Programming using Triton.pdf


--------------------------------------------------------------------------------
/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Summary Notes - Block-based GPU Programming with Triton.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI/Summary Notes - Block-based GPU Programming with Triton.pdf


--------------------------------------------------------------------------------
/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Presentation Slides - Scaling data processing from  CPU to distributed GPU.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Presentation Slides - Scaling data processing from  CPU to distributed GPU.pdf


--------------------------------------------------------------------------------
/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPU to distributed GPUs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPU to distributed GPUs.pdf


--------------------------------------------------------------------------------
/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPUs to distributed GPUs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Speaker Q&A - Scaling data processing from CPUs to distributed GPUs.pdf


--------------------------------------------------------------------------------
/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Summary Notes - Scaling data processing from CPU to distributed GPUs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlops-discord/gpu-optimization-workshop/df2e9c3f73e53a85bde69545048e49e5b4333135/Talk 4 - Scaling data processing from  CPU to distributed GPU -  William Malpica - Voltron Data/Summary Notes - Scaling data processing from CPU to distributed GPUs.pdf


--------------------------------------------------------------------------------
/community-note.md:
--------------------------------------------------------------------------------
  1 | _Thanks Ajinkya Tejankar, Mohsin Iqbal, Sujit Ahirrao, and Krishna Gupta for the notes!_
  2 | 
  3 | ## Crash course to GPU optimization
  4 | 1. PyTorch
  5 |     1. Needs to support different dtypes, layouts, and devices so the kernels are general
  6 |     2. Eager execution allows easy debugging but this trades off performance
  7 |     3. Models getting converged to eager may not be the best performance
  8 | 2. Pointwise ops
  9 |     1. Every element assigned to a single thread that run in parallel
 10 | 3. Memory hierarchy
 11 |     1. Load data in shared memory and apply relu
 12 |     2. Done differently in Triton
 13 | 4. Eager execution can lead to unnecessary memory accesses if kernels are repeated
 14 | 5. GPU mem bandwidth is the bottleneck not FLOPs
 15 | 6. Operations / no of bytes accessed = arithmetic intensity
 16 | 7. Repeated calls to a kernel can be fused together - `torch.compile` (generates Triton) (see PyTorch 2 paper for more information).
 17 | 8. FP32 to FP16 or BF16 improves the perf significantly
 18 | 9. `torch.set_float32_matmul_precision('high')` => use tensor cores => talk by Nvidia about tensor cores on CUDA MODE
 19 | 10. Most of the time may be spent on figuring out which GPU kernel to use because 1.1 above
 20 | 11. CUDA kernels are asyc so queue them up -> CUDA graphs “reduce-overhead” in torch.compile
 21 | 12. Quantization helps compute bound but also mem bound kernels as it reduces the number of bytes accessed in the arithmetic intensity calculation
 22 | 13. GPT fast - weight only quantization
 23 | 14. Int8 is ambiguous - quantize optimizers? Gradients? Not applied over all the model only the linear layers. W8A16 -> Int 8 weights.
 24 | 15. Bit packing: Pack 2 int4s into a single int8
 25 | 16. Compute bound problems: become better at math
 26 | 17. Why compiler couldn’t have figured out FlashAttention? Q by a reviewer Compilers are good at fusing kernels but not math of the operations
 27 | 18. Online softmax paper explains the FlashAttention better
 28 | 19. Learn the basics of CUDA - Programming Massively Parallel Processors: A Hands-on Approach - helps with compute bound kernels
 29 | 20. `load_inline` function in `cpp_extension` in pytorch
 30 | 21. Nvidia provides a profiler: `ncu` - good supplement for reading the above book
 31 | 22. Write kernels!! Good content on the cuda-mode and join for writing custom kernels
 32 | 23. Karpathy - building in raw cuda
 33 | 24. Reach out to Mark for shipping your hand written cuda kernels (he’ll help with release)
 34 | 25. Learning through mentorship is great since public docs are not great at the moment
 35 | 26. Quantization is not possible through torch.compile
 36 | 27. How to make PyTorch models faster: Fuse more, use tensor cores, reduce overhead, quantize, use a custom kernel (all in order)
 37 | 28. How’s execute torch different from torch.compile? Focused on more constrained devices. However, dynamo (a part of the compile subsystem) is shared.
 38 | 29. How does PyTorch treat GPUs other than Nvidia’s? Triton provides backends that work on Intel, AMD GPUs so PyTorch just generates Triton. Hierarchical IR and Code gen.
 39 | 30. What do you think about 1 bit quantization? Eval does not scale. Bit packing can help.
 40 | 31. Common pitfalls of running GPUs?
 41 |     1. Eager - Profile first to figure out the real bottlenecks
 42 |     2. Compile - Enable first 3 things on 27 point
 43 | 
 44 | 
 45 | ### Relevant resources
 46 | * CUDA Programming model basics:
 47 | 
 48 |     [https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#4](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#4)
 49 | 
 50 |     [https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a4](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a4)
 51 | 
 52 | 
 53 |     [https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a7](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf#a7)
 54 | 
 55 | * [Programming Massively Parallel Processors: A Hands-on Approach](https://a.co/d/folc0LI)
 56 | * CUDA Mode: [discord.gg/cudamode](discord.gg/cudamode)
 57 | * Native PyTorch library for quantization and sparsity: [https://github.com/pytorch/ao](https://github.com/pytorch/ao)
 58 | * Learn Triton: [https://github.com/cuda-mode/triton-index/](https://github.com/cuda-mode/triton-index/)
 59 | 
 60 | ## LLM Serving optimization
 61 | 
 62 | 1. Focusing on server-based systems not edge-end user latencies are important
 63 | 2. Multi-functional accurate models are large - deployment and optimization is a challenge
 64 | 3. Many models, very big models, new operators (optimization becomes a moving target)
 65 | 4. Goal: SoTA performance for LLMs for production deployments
 66 | 5. Fast forward pass is very important. Also, important intelligent batching
 67 | 6. Other techniques like kv cache optimization for improved GPU workload
 68 | 7. Quantization
 69 |     1. As long as you can preserve accuracy, lower bit-width precisions are great
 70 |         1. Lesser memory, higher throughput comms between GPUs, faster computation (all-round win)
 71 |     2. Post-training quantization is the most common
 72 |     3. TensorRT model optimizer offers a bunch of techniques
 73 |         1. PTQ (post-training quantization) and QAT (quantization-aware training)
 74 |         2. SmoothQuant and INT4 AWQ don’t lead to too much drop in acc (MMLU)
 75 | 8. LLM request has two phases
 76 |     1. Prefill: process the prompt, generate the first token, and init the kv cache. Called only once for a request. Lots of parallel operations across tokens.
 77 |     2. Generate: starts from prior state (kv cache) and generates the next token, updating the kv cache. Called in a loop for each request. Lot of memory bound operations.
 78 |     3. Attention is complex - features like GQA and Speculative Decoding increase math:data movement ratio (arithmetic intensity)
 79 |     4. TRT-LLMs fastest implementations use hand tuned custom cuda kernels
 80 | 9. Traditional Request Scheduling (static batching)
 81 |     1. Accumulate, batch, forward
 82 |     2. Request as an atomic operation is great for fixed length inputs however for tasks like completion where outputs differ in length this is not great. (image vs. chat)
 83 |     3. Largest completion in a batch can stall the smallest completion. Padding also wastes computation.
 84 | 10. LLM Request Properties
 85 |     1. Multiple forward passes and the number is unknown a priori
 86 |     2. Online setting, request arrival time is priori
 87 |     3. In flight batching
 88 |         1. On EOS, Max tokens reached, stop phrase -> send response and evict
 89 |         2. Process new work - next iteration of LLM
 90 |             1. Prompt phase goes to prefill
 91 |             2. Prefill goes to generate
 92 |             3. Generate keeps generating
 93 |         3. Transformer ops
 94 |             1. Token parallel - Matmul, LayerNorm
 95 |             2. Sequence parallel - MHA
 96 |             3. Tokens across above two types are concatenated in in-flight batching to improve memory bound (makes it more compute intensive)
 97 |     14. Paged KV Cache
 98 |         1. Contiguous KV Cache leads to wasted allocation of memory since all KV cache memory is contiguous
 99 |         2. Instead think of memory as a linked list of pages - reduces memory unused memory - lazy memory allocation - increases complexity of attention kernel
100 |         3. Allows sharing of KV cache between requests! E.g. system prompt kv cache blocks are part of the linked list of different requests!
101 |     15. Speculative Decoding
102 |         1. Instead of generating a single token as in regular autoregressive generation, generate many tokens
103 |         2. Evaluate if draft tokens are valid in the same time as a single token is generated
104 |         3. Speculates that speculative decoding will be used everywhere ;)
105 |         4. Turns latency problem into throughput problem where GPUs are great
106 |     16. Time to first token vs time between token. Which is important? Time between since time to first is easily optimized.
107 |     17. Online vs batch inference. Which is common? Online is important, but the idea is to turn online into batch inference.
108 |     18. Any specific techniques for streaming mode? Not much. Stream out tokens as they are generated. Since everything is async anyway.
109 |     19. Quantization sounds too good to be true. Any caveats? PTQ is model dependent.
110 |     20. Good intro paper for changing workload? Orca paper. Link in the discord.
111 |     21. Many LLM inference services. Which one to use? Each is optimized for a specific use cases so explore.
112 |     22. What are the questions ppl should be asking when evaluating inference services? Clarity of Quality of Service (latency, throughput, acc) for your use case
113 |     23. Now way to avoid multi-gpu since models keep getting bigger. For many cases, single GPU use case is just fine.
114 | 
115 | 
116 | ### Relevant resources
117 | 
118 | * [Decoding Speculative Decoding](https://arxiv.org/html/2402.01528v1)
119 | * [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318)
120 | * [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180)
121 | 
122 | 
123 | ## Block-based optimization with Triton
124 | 
125 | 1. CUDA - all sorts of things can be done on GPUs but since it allows anything to be done it creates problems and hampers productivity.
126 |     1. First few months of support are okay
127 |     2. Supporting different gpus becomes problem
128 |     3. Opaque to researchers - cannot read CUDA code - reading tensor core code requires proficiency - becomes a black box - slows down research
129 |     4. Addressed with Graph Compilers - better for research
130 |         1. Walking a tree, linked lists in PyTorch are very slow
131 |         2. Control flow becomes complicated with graph operators
132 |         3. Code gen from graph compilers is a very difficult problem - this gives rise FlashAttention like custom CUDA kernels
133 |         4. Simplicity at the cost of flexibility
134 | 2. Triton - more low level than graph compilers but much easier to work with than CUDA
135 |     1. Can write algorithms out of scope of graph compilers - trees, linked lists, radix sort
136 |     2. Code still remains readable/modifiable by researchers
137 |     3. Performance is portable across different vendors
138 |     4. Less expressive than CUDA not as fast
139 | 3. Triton Machine Model: DRAM, L1 and L2 cache, Cores, Memory Controllers - Von Neumann Basic
140 | 4. Programming Model	
141 |     1. Tensors are defined in SRAM and modified using torch like operators
142 |     2. Embedded in Python and Just-in-Time compiled
143 |     3. Tensor of pointers!
144 |     4. Powers of 2 - shapes of tensors!?
145 | 5. Vector addition
146 |     1. Each program gets a different slice to the input with tl.program_id
147 | 6. Softmax
148 |     1. Entirely fused kernels in less than 10 lines
149 |     2. Load the data only once unlike PyTorch eager mode
150 | 7. Why blocked program representation?
151 |     1. Peephole optimization
152 |     2. SRAM allocation
153 |     3. Automatic vectorization - Need to issue big enough loads to keep the memory bandwidth busy
154 |     4. Compiler allocates shared mem in addition to registers
155 |     5. Lot of value in researchers doing kernel developement!
156 |     6. Technical debt manageable
157 | 8. Challenges of building kernels at OpenAI scale? Reliability vs agility of the code base
158 | 9. Tricks for single GPU? Consumer GPUs have restriction on tensor cores. Go out of your way to use 16bit tensor cores. Not a priority of OpenAI, but TinyGrad focuses on it.
159 | 10. Model performance can change after optimizations? Kernel output shouldn’t change with reference non-optimized implementation. Power of 2 inputs.
160 | 11. Surprising kernels built on top of Triton? Sorting kernel. Hypercubes.
161 | 12. Why block based? Grew out of dissertation.
162 | 
163 | 
164 | ### Relevant resources
165 | 
166 | * [https://openai.com/index/triton/](https://openai.com/index/triton/)
167 | 
168 | 
169 | ## Scaling data workloads on GPUs
170 | 
171 | 1. Transactional databases - not gpu friendly - row oriented - CSV
172 | 2. Analytics datasets - gpu friendly - column oriented - Parquet, Apache Arrow. Apache Arrow is everywhere today. It makes it easy to move data across multiple data platforms.
173 | 3. Nvidia Rapids contains many libraries for gpu processing: cuPy, cuDF, cuML, cuGraph
174 | 4. Benchmark showing prformance boost moving from CPU to GPU, can be up to 100x times faster. The speed up is more with larger workloads.
175 | 5. Data processing on CPUs eventually hits a wall.
176 | 6. GPUs are fast for data processing because many data processing jobs are naturally parallelizable and GPUs have many cores.
177 | 7. What to do depending on where your job bottlenecks: memory bound, latency bound, or compute bound. Figure out where the bottleneck is by using profiling tools.
178 | 
179 | ### Relevant resources
180 | 
181 | * [The composable codex](https://voltrondata.com/codex.html)


--------------------------------------------------------------------------------