├── README.md └── conferences.md /README.md: -------------------------------------------------------------------------------- 1 | ## ML Systems Onboarding Reading List 2 | 3 | This is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy! 4 | 5 | [Conferences](conferences.md) where MLSys papers get published 6 | 7 | ## Attention Mechanism 8 | * [Attention is all you need](https://arxiv.org/abs/1706.03762): Start here, Still one of the best intros 9 | * [Online normalizer calculation for softmax](https://arxiv.org/abs/1805.02867): A must read before reading the flash attention. Will help you get the main "trick" 10 | * [Self Attention does not need O(n^2) memory](https://arxiv.org/abs/2112.05682): 11 | * [Flash Attention 2](https://arxiv.org/abs/2307.08691): The diagrams here do a better job of explaining flash attention 1 as well 12 | * [Llama 2 paper](https://arxiv.org/abs/2307.09288): Skim it for the model details 13 | * [gpt-fast](https://github.com/pytorch-labs/gpt-fast): A great repo to come back to for minimal yet performant code 14 | * [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409): There's tons of papers on long context lengths but I found this to be among the clearest 15 | * Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional 16 | 17 | ## Performance Optimizations 18 | * [Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems](https://arxiv.org/abs/2312.15234): Wonderful survey, start here 19 | * [Efficiently Scaling transformer inference](https://arxiv.org/abs/2211.05102): Introduced many ideas most notably KV caches 20 | * [Making Deep Learning go Brrr from First Principles](https://horace.io/brrr_intro.html): One of the best intros to fusions and overhead 21 | * [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192): This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding 22 | * [Group Query Attention](https://arxiv.org/pdf/2305.13245): KV caches can be chunky this is how you fix it 23 | * [Orca: A Distributed Serving System for Transformer-Based Generative Models](https://www.usenix.org/conference/osdi22/presentation/yu): introduced continuous batching (great pre-read for the PagedAttention paper). 24 | * [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180): the most crucial optimization for high throughput batch inference 25 | * [Colfax Research Blog](https://research.colfax-intl.com/blog/): Excellent blog if you're interested in learning more about CUTLASS and modern GPU programming 26 | * [Sarathi LLM](https://arxiv.org/abs/2308.16369): Introduces chunked prefill to make workloads more balanced between prefill and decode 27 | * [Epilogue Visitor Tree](https://dl.acm.org/doi/10.1145/3620666.3651369): Fuse custom epilogues by adding more epilogues to the same class (visitor design pattern) and represent the whole epilogue as a tree 28 | 29 | ## Quantization 30 | * [A White Paper on Neural Network Quantization](https://arxiv.org/abs/2106.08295): Start here this is will give you the foundation to quickly skim all the other papers 31 | * [LLM.int8](https://arxiv.org/abs/2208.07339): All of Dettmers papers are great but this is a natural intro 32 | * [FP8 formats for deep learning](https://arxiv.org/abs/2209.05433): For a first hand look of how new number formats come about 33 | * [Smoothquant](https://arxiv.org/abs/2211.10438): Balancing rounding errors between weights and activations 34 | * [Mixed precision training](https://arxiv.org/abs/1710.03740): The OG paper describing mixed precision training strategies for half 35 | 36 | ## Long context length 37 | * [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864): The paper that introduced rotary positional embeddings 38 | * [YaRN: Efficient Context Window Extension of Large Language Models](https://arxiv.org/abs/2309.00071): Extend base model context lengths with finetuning 39 | * [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889): Scale to infinite context lengths as long as you can stack more GPUs 40 | 41 | ## Sparsity 42 | * [Venom](https://arxiv.org/pdf/2310.02065): Vectorized N:M Format for sparse tensor cores when hardware only supports 2:4 43 | * [Megablocks](https://arxiv.org/pdf/2211.15841): Efficient Sparse training with mixture of experts 44 | * [ReLu Strikes Back](https://openreview.net/pdf?id=osoWxY8q2E): Really enjoyed this paper as an example of doing model surgery for more efficient inference 45 | 46 | ## Distributed 47 | * [Singularity](https://arxiv.org/abs/2202.07848): Shows how to make jobs preemptible, migratable and elastic 48 | * [Local SGD](https://arxiv.org/abs/1805.09767): So hot right now 49 | * [OpenDiloco](https://arxiv.org/abs/2407.07852): Asynchronous training for decentralized training 50 | * [torchtitan](https://arxiv.org/abs/2410.06511): Minimal repository showing how to implement 4D parallelism in pure PyTorch 51 | * [pipedream](https://arxiv.org/abs/1806.03377): The pipeline parallel paper 52 | * [jit checkpointing](https://dl.acm.org/doi/pdf/10.1145/3627703.3650085): a very clever alternative to periodic checkpointing 53 | * [Reducing Activation Recomputation in Large Transformer models](https://arxiv.org/abs/2205.05198): THe paper thatt introduced selective activation checkpointing and goes over activation recomputation strategies 54 | * [Breaking the computation and communication abstraction barrier](https://arxiv.org/abs/2105.05720): God tier paper that goes over research at the intersection of distributed computing and compilers to maximize comms overlap 55 | * [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054): The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism. 56 | * [Megatron-LM](https://arxiv.org/abs/1909.08053): For an introduction to Tensor Parallelism 57 | -------------------------------------------------------------------------------- /conferences.md: -------------------------------------------------------------------------------- 1 | # Conferences 2 | 3 | * NeurIPS: https://neurips.cc/ 4 | * OSDI: https://www.usenix.org/conference/osdi 5 | * MLSys: https://mlsys.org/ 6 | * ICLR: https://iclr.cc/ 7 | * ICML: https://icml.cc/ 8 | * ASPLOS: https://www.asplos-conference.org/ 9 | * ISCA: https://iscaconf.org/isca2024/ 10 | * HPCA: https://hpca-conf.org/2025/ 11 | * MICRO: https://microarch.org/micro57/ 12 | * SOCP: https://www.socp.org/ 13 | * OSDI: https://www.usenix.org/conference/osdi25 14 | * SOSP: https://www.sosp.org/ 15 | * NSDI: https://www.usenix.org/conference/nsdi25 16 | 17 | # Labs 18 | * https://catalyst.cs.cmu.edu/ 19 | 20 | # Researchers (WIP) 21 | 22 | Very incomplete list of researchers and their google scholar profiles. 23 | 24 | * Bill Dally: https://scholar.google.com/citations?user=YZHj-Y4AAAAJ&hl=en on efficiency and compression 25 | * Jeff Dean: https://scholar.google.com/citations?user=NMS69lQAAAAJ&hl=en on everything 26 | * Matei Zaharia: https://scholar.google.com/citations?user=I1EvjZsAAAAJ&hl=en on large scale data processing 27 | * Xupeng Mia: https://scholar.google.com/citations?user=aCAgdYkAAAAJ&hl=zh-CN on LLM inference 28 | * Minjia Zhang: https://scholar.google.com/citations?user=98vX7S8AAAAJ&hl=en on LLM inference, Moe and open models 29 | * Dan Fu: https://scholar.google.com/citations?user=Ov-sMBIAAAAJ&hl=en on flash attention and state space models 30 | * Lianmin Zheng: https://scholar.google.com/citations?user=_7Q8uIYAAAAJ&hl=en on LLM inference and evals 31 | * Hui Guan: https://scholar.google.com/citations?user=rfPAfBkAAAAJ&hl=en on quantization, datareuse 32 | * Tian Li: https://scholar.google.com/citations?user=8JWoJrAAAAAJ&hl=en on federated learning 33 | * Gauri Joshi: https://scholar.google.com/citations?user=yqIoH34AAAAJ&hl=en on federated learning 34 | * Tianqi Chen: https://scholar.google.com/citations?user=7nlvOMQAAAAJ&hl=en on compilers and a lot more stuff 35 | * Philip Gibbons: https://scholar.google.com/citations?user=F9kqUXkAAAAJ&hl=en on federated learning 36 | * Christopher De Sa: https://scholar.google.com/citations?user=v7EjGHkAAAAJ&hl=en on data augmentation 37 | * Gennady Pekhimenko: https://scholar.google.com/citations?user=ZgqVLuMAAAAJ&hl=en on mlperf and DRAM 38 | * Onur Mutlu: https://scholar.google.com/citations?user=7XyGUGkAAAAJ&hl=en on computer architecture 39 | * Michael Carbin: https://scholar.google.com/citations?user=mtejbKYAAAAJ&hl=en on pruning 40 | --------------------------------------------------------------------------------