├── README.md
└── conferences.md


/README.md:
--------------------------------------------------------------------------------
 1 | ## ML Systems Onboarding Reading List
 2 | 
 3 | This is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy!
 4 | 
 5 | [Conferences](conferences.md) where MLSys papers get published
 6 | 
 7 | ## Attention Mechanism
 8 | * [Attention is all you need](https://arxiv.org/abs/1706.03762): Start here, Still one of the best intros
 9 | * [Online normalizer calculation for softmax](https://arxiv.org/abs/1805.02867): A must read before reading the flash attention. Will help you get the main "trick" 
10 | * [Self Attention does not need O(n^2) memory](https://arxiv.org/abs/2112.05682): 
11 | * [Flash Attention 2](https://arxiv.org/abs/2307.08691): The diagrams here do a better job of explaining flash attention 1 as well
12 | * [Llama 2 paper](https://arxiv.org/abs/2307.09288): Skim it for the model details
13 | * [gpt-fast](https://github.com/pytorch-labs/gpt-fast): A great repo to come back to for minimal yet performant code
14 | * [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409): There's tons of papers on long context lengths but I found this to be among the clearest
15 | * Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional
16 | 
17 | ## Performance Optimizations
18 | * [Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems](https://arxiv.org/abs/2312.15234): Wonderful survey, start here
19 | * [Efficiently Scaling transformer inference](https://arxiv.org/abs/2211.05102): Introduced many ideas most notably KV caches
20 | * [Making Deep Learning go Brrr from First Principles](https://horace.io/brrr_intro.html): One of the best intros to fusions and overhead
21 | * [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192): This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding
22 | * [Group Query Attention](https://arxiv.org/pdf/2305.13245): KV caches can be chunky this is how you fix it
23 | * [Orca: A Distributed Serving System for Transformer-Based Generative Models](https://www.usenix.org/conference/osdi22/presentation/yu): introduced continuous batching (great pre-read for the PagedAttention paper).
24 | * [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180): the most crucial optimization for high throughput batch inference
25 | * [Colfax Research Blog](https://research.colfax-intl.com/blog/): Excellent blog if you're interested in learning more about CUTLASS and modern GPU programming
26 | * [Sarathi LLM](https://arxiv.org/abs/2308.16369): Introduces chunked prefill to make workloads more balanced between prefill and decode
27 | * [Epilogue Visitor Tree](https://dl.acm.org/doi/10.1145/3620666.3651369): Fuse custom epilogues by adding more epilogues to the same class (visitor design pattern) and represent the whole epilogue as a tree
28 | 
29 | ## Quantization
30 | * [A White Paper on Neural Network Quantization](https://arxiv.org/abs/2106.08295): Start here this is will give you the foundation to quickly skim all the other papers
31 | * [LLM.int8](https://arxiv.org/abs/2208.07339): All of Dettmers papers are great but this is a natural intro
32 | * [FP8 formats for deep learning](https://arxiv.org/abs/2209.05433): For a first hand look of how new number formats come about
33 | * [Smoothquant](https://arxiv.org/abs/2211.10438): Balancing rounding errors between weights and activations
34 | * [Mixed precision training](https://arxiv.org/abs/1710.03740): The OG paper describing mixed precision training strategies for half
35 | 
36 | ## Long context length
37 | * [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864): The paper that introduced rotary positional embeddings
38 | * [YaRN: Efficient Context Window Extension of Large Language Models](https://arxiv.org/abs/2309.00071): Extend base model context lengths with finetuning
39 | * [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889): Scale to infinite context lengths as long as you can stack more GPUs
40 | 
41 | ## Sparsity
42 | * [Venom](https://arxiv.org/pdf/2310.02065): Vectorized N:M Format for sparse tensor cores when hardware only supports 2:4
43 | * [Megablocks](https://arxiv.org/pdf/2211.15841): Efficient Sparse training with mixture of experts
44 | * [ReLu Strikes Back](https://openreview.net/pdf?id=osoWxY8q2E): Really enjoyed this paper as an example of doing model surgery for more efficient inference
45 | 
46 | ## Distributed
47 | * [Singularity](https://arxiv.org/abs/2202.07848): Shows how to make jobs preemptible, migratable and elastic
48 | * [Local SGD](https://arxiv.org/abs/1805.09767): So hot right now
49 | * [OpenDiloco](https://arxiv.org/abs/2407.07852): Asynchronous training for decentralized training
50 | * [torchtitan](https://arxiv.org/abs/2410.06511): Minimal repository showing how to implement 4D parallelism in pure PyTorch
51 | * [pipedream](https://arxiv.org/abs/1806.03377): The pipeline parallel paper
52 | * [jit checkpointing](https://dl.acm.org/doi/pdf/10.1145/3627703.3650085): a very clever alternative to periodic checkpointing
53 | * [Reducing Activation Recomputation in Large Transformer models](https://arxiv.org/abs/2205.05198): THe paper thatt introduced selective activation checkpointing and goes over activation recomputation strategies
54 | * [Breaking the computation and communication abstraction barrier](https://arxiv.org/abs/2105.05720): God tier paper that goes over research at the intersection of distributed computing and compilers to maximize comms overlap
55 | * [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054): The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism.
56 | * [Megatron-LM](https://arxiv.org/abs/1909.08053): For an introduction to Tensor Parallelism
57 | 


--------------------------------------------------------------------------------
/conferences.md:
--------------------------------------------------------------------------------
 1 | # Conferences
 2 | 
 3 | * NeurIPS: https://neurips.cc/
 4 | * OSDI: https://www.usenix.org/conference/osdi
 5 | * MLSys: https://mlsys.org/
 6 | * ICLR: https://iclr.cc/
 7 | * ICML: https://icml.cc/
 8 | * ASPLOS: https://www.asplos-conference.org/
 9 | * ISCA: https://iscaconf.org/isca2024/
10 | * HPCA: https://hpca-conf.org/2025/
11 | * MICRO: https://microarch.org/micro57/
12 | * SOCP: https://www.socp.org/
13 | * OSDI: https://www.usenix.org/conference/osdi25
14 | * SOSP: https://www.sosp.org/
15 | * NSDI: https://www.usenix.org/conference/nsdi25
16 | 
17 | # Labs
18 | * https://catalyst.cs.cmu.edu/
19 | 
20 | # Researchers (WIP)
21 | 
22 | Very incomplete list of researchers and their google scholar profiles.
23 | 
24 | * Bill Dally: https://scholar.google.com/citations?user=YZHj-Y4AAAAJ&hl=en on efficiency and compression
25 | * Jeff Dean: https://scholar.google.com/citations?user=NMS69lQAAAAJ&hl=en on everything
26 | * Matei Zaharia: https://scholar.google.com/citations?user=I1EvjZsAAAAJ&hl=en on large scale data processing
27 | * Xupeng Mia: https://scholar.google.com/citations?user=aCAgdYkAAAAJ&hl=zh-CN on LLM inference
28 | * Minjia Zhang: https://scholar.google.com/citations?user=98vX7S8AAAAJ&hl=en on LLM inference, Moe and open models
29 | * Dan Fu: https://scholar.google.com/citations?user=Ov-sMBIAAAAJ&hl=en on flash attention and state space models
30 | * Lianmin Zheng: https://scholar.google.com/citations?user=_7Q8uIYAAAAJ&hl=en on LLM inference and evals
31 | * Hui Guan: https://scholar.google.com/citations?user=rfPAfBkAAAAJ&hl=en on quantization, datareuse 
32 | * Tian Li: https://scholar.google.com/citations?user=8JWoJrAAAAAJ&hl=en on federated learning
33 | * Gauri Joshi: https://scholar.google.com/citations?user=yqIoH34AAAAJ&hl=en on federated learning
34 | * Tianqi Chen: https://scholar.google.com/citations?user=7nlvOMQAAAAJ&hl=en on compilers and a lot more stuff
35 | * Philip Gibbons: https://scholar.google.com/citations?user=F9kqUXkAAAAJ&hl=en on federated learning
36 | * Christopher De Sa: https://scholar.google.com/citations?user=v7EjGHkAAAAJ&hl=en on data augmentation
37 | * Gennady Pekhimenko: https://scholar.google.com/citations?user=ZgqVLuMAAAAJ&hl=en on mlperf and DRAM
38 | * Onur Mutlu: https://scholar.google.com/citations?user=7XyGUGkAAAAJ&hl=en on computer architecture
39 | * Michael Carbin: https://scholar.google.com/citations?user=mtejbKYAAAAJ&hl=en on pruning
40 | 


--------------------------------------------------------------------------------