βββ README.md
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome-KV-Cache-Management
2 |
3 |
4 | ### News
5 |
6 | **π’ New Benchmark Released (2025-02-18):** *"Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models [[PDF]](https://arxiv.org/pdf/2502.11075)[[Dataset]](https://github.com/TreeAI-Lab/NumericBench)"* β proposing a long NumericBench to assess LLMs' numerical reasoning! π
7 |
8 |
9 |
10 |
11 | ## A Survey on Large Language Model Acceleration based on KV Cache Management [[PDF]](https://arxiv.org/pdf/2412.19442)
12 |
13 | > *Haoyang Li 1, Yiming Li 2, Anxin Tian 2, Tinahao Tang 2, Zhanchao Xu 4, Xuejia Chen 4, Nicole Hu 3, Wei Dong 5, Qing Li 1, Lei Chen 2*
14 |
15 | > *1Hong Kong Polytechnic University, 2Hong Kong University of Science and Technology, 3The Chinese University of Hong Kong, 4Huazhong University of Science and Technology, 5Nanyang Technological University.*
16 |
17 | - This repository is dedicated to recording KV Cache Management papers for LLM acceleration. The survey will be updated regularly. If you find this survey helpful for your work, please consider citing it.
18 | ```
19 | @article{li2024surveylargelanguagemodel,
20 | title={A Survey on Large Language Model Acceleration based on KV Cache Management},
21 | author={Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen},
22 | journal={arXiv preprint arXiv:2412.19442},
23 | year={2024}
24 | }
25 | ```
26 | - If you would like to include your paper or any modifications in this survey and repository, please feel free to send email to (haoyang-comp.li@polyu.edu.hk) or open an issue with your paper's title, category, and a brief summary highlighting its key techniques. Thank you!
27 |
28 |
29 | ## Toxonomy and Papers
30 |
31 | - [Awesome-KV-Cache-Management](#awesome-kv-cache-management)
32 | - [Token-level Optimization](#token-level-optimization)
33 | - [KV Cache Selection](#kv-cache-selection)
34 | - [Static KV Cache Selection (To Topππ»)](#static-kv-cache-selection-to-top)
35 | - [Dynamic Selection with Permanent Eviction (To Topππ»)](#dynamic-selection-with-permanent-eviction-to-top)
36 | - [Dynamic Selection without Permanent Eviction (To Topππ»)](#dynamic-selection-without-permanent-eviction-to-top)
37 | - [KV Cache Budget Allocation](#kv-cache-budget-allocation)
38 | - [Layer-wise Budget Allocation (To Topππ»)](#layer-wise-budget-allocation-to-top)
39 | - [Head-wise Budget Allocation (To Topππ»)](#head-wise-budget-allocation-to-top)
40 | - [KV Cache Merging](#kv-cache-merging)
41 | - [Intra-layer Merging (To Topππ»)](#intra-layer-merging-to-top)
42 | - [Cross-layer Merging (To Topππ»)](#cross-layer-merging-to-top)
43 | - [KV Cache Quantization](#kv-cache-quantization)
44 | - [Fixed-precision Quantization (To Topππ»)](#fixed-precision-quantization-to-top)
45 | - [Mixed-precision Quantization (To Topππ»)](#mixed-precision-quantization-to-top)
46 | - [Outlier Redistribution (To Topππ»)](#outlier-redistribution-to-top)
47 | - [KV Cache Low-rank Decomposition](#kv-cache-low-rank-decomposition)
48 | - [Singular Value Decomposition (To Topππ»)](#singular-value-decomposition-to-top)
49 | - [Tensor Decomposition (To Topππ»)](#tensor-decomposition-to-top)
50 | - [Learned Low-rank Approximation (To Topππ»)](#learned-low-rank-approximation-to-top)
51 | - [Model-level Optimization](#model-level-optimization)
52 | - [Attention Grouping and Sharing](#attention-grouping-and-sharing)
53 | - [Intra-Layer Grouping (To Topππ»)](#intra-layer-grouping-to-top)
54 | - [Cross-Layer Sharing (To Topππ»)](#cross-layer-sharing-to-top)
55 | - [Architecture Alteration](#architecture-alteration)
56 | - [Enhanced Attention (To Topππ»)](#enhanced-attention-to-top)
57 | - [Augmented Architecture (To Topππ»)](#augmented-architecture-to-top)
58 | - [Non-transformer Architecture](#non-transformer-architecture)
59 | - [Adaptive Sequence Processing Architecture (To Topππ»)](#adaptive-sequence-processing-architecture-to-top)
60 | - [Hybrid Architecture (To Topππ»)](#hybrid-architecture-to-top)
61 | - [System-level Optimization](#system-level-optimization)
62 | - [Memory Management](#memory-management)
63 | - [Architectural Design (To Topππ»)](#architectural-design-to-top)
64 | - [Prefix-aware Design (To Topππ»)](#prefix-aware-design-to-top)
65 | - [Scheduling](#scheduling)
66 | - [Prefix-aware Scheduling (To Topππ»)](#prefix-aware-scheduling-to-top)
67 | - [Preemptive and Fairness-oriented Scheduling (To Topππ»)](#preemptive-and-fairness-oriented-scheduling-to-top)
68 | - [Layer-specific and Hierarchical Scheduling (To Topππ»)](#layer-specific-and-hierarchical-scheduling-to-top)
69 | - [Hardware-aware Design](#hardware-aware-design)
70 | - [Single/Multi-GPU Design (To Topππ»)](#singlemulti-gpu-design-to-top)
71 | - [I/O-based Design (To Topππ»)](#io-based-design-to-top)
72 | - [Heterogeneous Design (To Topππ»)](#heterogeneous-design-to-top)
73 | - [SSD-based Design (To Topππ»)](#ssd-based-design-to-top)
74 | - [Datasets and Benchmarks](#datasets-and-benchmarks)
75 |
76 | ---
77 |
78 | # Token-level Optimization
79 |
80 | ## KV Cache Selection
81 |
82 | ### Static KV Cache Selection ([To Topππ»](#awesome-kv-cache-management))
83 |
84 | | Year | Title | Type | Venue | Paper | code |
85 | | ---- | ----------------------------------------------------------------------- | ------------------------- | ------- | ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
86 | | 2024 | Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs | Static KV Cache Selection | ICLR | [Link](https://arxiv.org/pdf/2310.01801) | |
87 | | 2024 | SnapKV: LLM Knows What You are Looking for Before Generation | Static KV Cache Selection | NeurIPS | [Link](https://arxiv.org/pdf/2404.14469) | [Link](https://github.com/FasterDecoding/SnapKV)  |
88 | | 2024 | In-context KV-Cache Eviction for LLMs via Attention-Gate | Static KV Cache Selection | arXiv | [Link](https://arxiv.org/pdf/2410.12876) | |
89 |
90 | ### Dynamic Selection with Permanent Eviction ([To Topππ»](#awesome-kv-cache-management))
91 |
92 | | Year | Title | Type | Venue | Paper | code |
93 | | ---- | ----------------------------------------------------------------------------------------------------------- | ----------------------------------------- | ------- | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
94 | | 2024 | Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | Dynamic Selection with Permanent Eviction | MLSys | [Link](https://arxiv.org/pdf/2403.09054) | |
95 | | 2024 | BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference | Dynamic Selection with Permanent Eviction | arXiv | [Link](https://arxiv.org/pdf/2410.23079) | [Link](https://github.com/JunqiZhao888/buzz-llm)  |
96 | | 2024 | NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time | Dynamic Selection with Permanent Eviction | ACL | [Link](https://arxiv.org/pdf/2408.03675) | [Link](https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2024-NACL)  |
97 | | 2023 | H2O: heavy-hitter oracle for efficient generative inference of large language models | Dynamic Selection with Permanent Eviction | NeurIPS | [Link](https://arxiv.org/pdf/2306.14048) | [Link](https://github.com/FMInference/H2O)  |
98 | | 2023 | Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | Dynamic Selection with Permanent Eviction | NeurIPS | [Link](https://arxiv.org/pdf/2305.17118) | |
99 |
100 | ### Dynamic Selection without Permanent Eviction ([To Topππ»](#awesome-kv-cache-management))
101 |
102 | | Year | Title | Type | Venue | Paper | code |
103 | | ---- | ------------------------------------------------------------------------------------------ | -------------------------------------------- | ----- | ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
104 | | 2024 | InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory | Dynamic Selection without Permanent Eviction | arXiv | [Link](https://arxiv.org/pdf/2402.04617) | [Link](https://github.com/thunlp/InfLLM)  |
105 | | 2024 | Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | Dynamic Selection without Permanent Eviction | ICML | [Link](https://arxiv.org/pdf/2406.10774) | [Link](https://github.com/mit-han-lab/Quest)  |
106 | | 2024 | PQCache: Product Quantization-based KVCache for Long Context LLM Inference | Dynamic Selection without Permanent Eviction | arXiv | [Link](https://arxiv.org/pdf/2407.12820) | |
107 | | 2024 | Squeezed Attention: Accelerating Long Context Length LLM Inference | Dynamic Selection without Permanent Eviction | arXiv | [Link](https://arxiv.org/pdf/2411.09688) | [Link](https://github.com/SqueezeAILab/SqueezedAttention)  |
108 | | 2024 | RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval | Dynamic Selection without Permanent Eviction | arXiv | [Link](https://arxiv.org/pdf/2409.10516) | [Link](https://github.com/jzbjyb/ReAtt)  |
109 | | 2024 | Human-like Episodic Memory for Infinite Context LLMs | Dynamic Selection without Permanent Eviction | arXiv | [Link](https://arxiv.org/pdf/2407.09450) | |
110 | | 2024 | ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression | Dynamic Selection without Permanent Eviction | arXiv | [Link](https://arxiv.org/pdf/2412.03213) | |
111 |
112 |
113 | ## KV Cache Budget Allocation
114 |
115 | ### Layer-wise Budget Allocation ([To Topππ»](#awesome-kv-cache-management))
116 |
117 | | Year | Title | Type | Venue | Paper | code |
118 | | ---- | ------------------------------------------------------------ | ---------------------------- | --------- | ---------------------------------------- | ------------------------------------------------------------ |
119 | | 2024 | PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling | Layer-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2406.02069) | [Link](https://github.com/Zefan-Cai/KVCache-Factory)  |
120 | | 2024 | PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference | Layer-wise Budget Allocation | Findings | [Link](https://arxiv.org/pdf/2405.12532) | [Link](https://github.com/mutonix/pyramidinfer)  |
121 | | 2024 | DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs | Layer-wise Budget Allocation | ICLR sub. | [Link](https://arxiv.org/pdf/2412.14838) | |
122 | | 2024 | PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation | Layer-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2412.03409) | [Link](https://github.com/THU-MIG/PrefixKV)  |
123 | | 2024 | SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction | Layer-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2410.13846) | [Link](https://github.com/sail-sg/SimLayerKV)  |
124 |
125 | ### Head-wise Budget Allocation ([To Topππ»](#awesome-kv-cache-management))
126 |
127 | | Year | Title | Type | Venue | Paper | code |
128 | | ---- | ------------------------------------------------------------------------------------------------------ | --------------------------- | --------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
129 | | 2024 | Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference | Head-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2407.11550) | |
130 | | 2024 | Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective | Head-wise Budget Allocation | ICLR sub. | [Link](https://openreview.net/forum?id=lRTDMGYCpy) | |
131 | | 2024 | Unifying KV Cache Compression for Large Language Models with LeanKV | Head-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2412.03131) | |
132 | | 2024 | RazorAttention: Efficient KV Cache Compression Through Retrieval Heads | Head-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2407.15891) | |
133 | | 2024 | Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning | Head-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2410.19258) | [Link](https://github.com/FYYFU/HeadKV/tree/main)  |
134 | | 2024 | DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads | Head-wise Budget Allocation | arXiv | [Link](https://arxiv.org/pdf/2410.10819) | [Link](https://github.com/mit-han-lab/duo-attention)  |
135 |
136 | ## KV Cache Merging
137 |
138 | ### Intra-layer Merging ([To Topππ»](#awesome-kv-cache-management))
139 |
140 | | Year | Title | Type | Venue | Paper | code |
141 | | ---- | -------------------------------------------------------------------------------------------------- | ------------------- | ----- | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------- |
142 | | 2024 | Compressed Context Memory for Online Language Model Interaction | Intra-layer Merging | ICLR | [Link](https://openreview.net/forum?id=64kSvC4iPg) | [Link](https://github.com/snu-mllab/context-memory)  |
143 | | 2024 | LoMA: Lossless Compressed Memory Attention | Intra-layer Merging | arXiv | [Link](https://arxiv.org/pdf/2401.09486) | |
144 | | 2024 | Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference | Intra-layer Merging | ICML | [Link](https://openreview.net/forum?id=tDRYrAkOB7) | [Link](https://github.com/NVIDIA/Megatron-LM/tree/dmc)  |
145 | | 2024 | CaM: Cache Merging for Memory-efficient LLMs Inference | Intra-layer Merging | ICML | [Link](https://openreview.net/forum?id=LCTmppB165) | [Link](https://github.com/zyxxmu/cam)  |
146 | | 2024 | D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models | Intra-layer Merging | arXiv | [Link](https://arxiv.org/pdf/2406.13035) | |
147 | | 2024 | AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | Intra-layer Merging | arXiv | [Link](https://arxiv.org/pdf/2412.03248) | [Link](https://github.com/LaVi-Lab/AIM)  |
148 | | 2024 | LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference | Intra-layer Merging | EMNLP | [Link](https://aclanthology.org/2024.findings-emnlp.235/) | [Link](https://github.com/SUSTechBruce/LOOK-M)  |
149 | | 2024 | Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks | Intra-layer Merging | arXiv | [Link](https://arxiv.org/pdf/2407.08454) | |
150 | | 2024 | CHAI: Clustered Head Attention for Efficient LLM Inference | Intra-layer Merging | arXiv | [Link](https://arxiv.org/pdf/2403.08058) | |
151 |
152 | ### Cross-layer Merging ([To Topππ»](#awesome-kv-cache-management))
153 |
154 | | Year | Title | Type | Venue | Paper | code |
155 | | ---- | ---------------------------------------------------------------------------- | ------------------- | ----- | ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
156 | | 2024 | MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | Cross-layer Merging | arXiv | [Link](https://arxiv.org/pdf/2405.14366) | [Link](https://github.com/AkideLiu/MiniCache)  |
157 | | 2024 | KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cross-Layer Sharing | Cross-layer Merging | arXiv | [Link](https://arxiv.org/pdf/2410.18517) | [Link](https://github.com/yangyifei729/KVSharer)  |
158 |
159 | ## KV Cache Quantization
160 |
161 | ### Fixed-precision Quantization ([To Topππ»](#awesome-kv-cache-management))
162 |
163 | | Year | Title | Type | Venue | Paper | code |
164 | | ---- | ------------------------------------------------------------------------------------------- | ---------------------------- | ----- | ----------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
165 | | 2024 | QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead | Fixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2406.03482) | [Link](https://github.com/amirzandieh/QJL)  |
166 | | 2024 | PQCache: Product Quantization-based KVCache for Long Context LLM Inference | Fixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2407.12820) | |
167 | | 2023 | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | Fixed-precision Quantization | ICML | [Link](https://proceedings.mlr.press/v202/sheng23a.html) | [Link](https://github.com/FMInference/FlexLLMGen)  |
168 | | 2022 | ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | Fixed-precision Quantization | NIPS | [Link](https://arxiv.org/pdf/2206.01861) | [Link](https://github.com/microsoft/DeepSpeed)  |
169 |
170 | ### Mixed-precision Quantization ([To Topππ»](#awesome-kv-cache-management))
171 |
172 | | Year | Title | Type | Venue | Paper | code |
173 | | ---- | ----------------------------------------------------------------------------------------------------- | ---------------------------- | ----- | ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
174 | | 2024 | KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2401.18079) | [Link](https://github.com/SqueezeAILab/KVQuant)  |
175 | | 2024 | IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2403.01241) | [Link](https://github.com/ruikangliu/IntactKV)  |
176 | | 2024 | SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2405.06219) | [Link](https://github.com/cat538/SKVQ)  |
177 | | 2024 | KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2402.02750) | [Link](https://github.com/jy-yuan/KIVI)  |
178 | | 2024 | WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2402.12065) | |
179 | | 2024 | GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2403.05527) | [Link](https://github.com/opengear-project/GEAR)  |
180 | | 2024 | No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2402.18096) | |
181 | | 2024 | ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2410.08584) | |
182 | | 2024 | ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2405.14256) | [Link](https://github.com/ThisisBillhe/ZipCache)  |
183 | | 2024 | PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2410.05265) | [Link](https://github.com/ChenMnZ/PrefixQuant)  |
184 | | 2024 | MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache | Mixed-precision Quantization | arXiv | [Link](https://arxiv.org/pdf/2411.18077) | |
185 |
186 | ### Outlier Redistribution ([To Topππ»](#awesome-kv-cache-management))
187 |
188 | | Year | Title | Type | Venue | Paper | code |
189 | | ---- | ------------------------------------------------------------ | ---------------------- | ------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
190 | | 2024 | Massive Activations in Large Language Models | Outlier Redistribution | arXiv | [Link](https://arxiv.org/pdf/2402.17762) | [Link](https://github.com/locuslab/massive-activations)  |
191 | | 2024 | QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | Outlier Redistribution | arXiv | [Link](https://arxiv.org/pdf/2404.00456) | [Link](https://github.com/spcl/QuaRot)  |
192 | | 2024 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | Outlier Redistribution | arXiv | [Link](https://arxiv.org/pdf/2405.04532) | [Link](https://github.com/mit-han-lab/qserve)  |
193 | | 2024 | SpinQuant: LLM Quantization with Learned Rotations | Outlier Redistribution | arXiv | [Link](https://arxiv.org/pdf/2405.16406) | [Link](https://github.com/facebookresearch/SpinQuant)  |
194 | | 2024 | DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs | Outlier Redistribution | NeurIPS | [Link](https://arxiv.org/pdf/2406.01721) | [Link](https://github.com/Hsu1023/DuQuant)  |
195 | | 2024 | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | Outlier Redistribution | ICML | [Link](https://proceedings.mlr.press/v202/xiao23c.html) | [Link](https://github.com/mit-han-lab/smoothquant)  |
196 | | 2024 | Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling | Outlier Redistribution | EMNLP | [Link](https://arxiv.org/pdf/2304.09145) | [Link](https://github.com/ModelTC/Outlier_Suppression_Plus)  |
197 | | 2024 | AffineQuant: Affine Transformation Quantization for Large Language Models | Outlier Redistribution | arXiv | [Link](https://arxiv.org/pdf/2403.12544) | [Link](https://github.com/bytedance/AffineQuant)  |
198 | | 2024 | FlatQuant: Flatness Matters for LLM Quantization | Outlier Redistribution | arXiv | [Link](https://arxiv.org/pdf/2410.09426) | [Link](https://github.com/ruikangliu/FlatQuant)  |
199 | | 2024 | AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration | Outlier Redistribution | MLSys | [Link](https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-pdftract-Conference.html) | [Link](https://github.com/mit-han-lab/llm-awq)  |
200 | | 2023 | OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | Outlier Redistribution | arXiv | [Link](https://arxiv.org/pdf/2308.13137) | [Link](https://github.com/OpenGVLab/OmniQuant)  |
201 | | 2023 | Training Transformers with 4-bit Integers | Outlier Redistribution | NeurIPS | [Link](https://proceedings.neurips.cc//paper_files/paper/2023/hash/99fc8bc48b917c301a80cb74d91c0c06-pdftract-Conference.html) | [Link](https://github.com/xijiu9/Train_Transformers_with_INT4)  |
202 |
203 | ## KV Cache Low-rank Decomposition
204 |
205 | ### Singular Value Decomposition ([To Topππ»](#awesome-kv-cache-management))
206 |
207 | | Year | Title | Type | Venue | Paper | code |
208 | | ---- | ------------------------------------------------------------------------------------------- | ---------------------------- | ----- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
209 | | 2024 | Effectively Compress KV Heads for LLM | Singular Value Decomposition | arXiv | [Link](https://arxiv.org/pdf/2406.07056) | |
210 | | 2024 | Eigen Attention: Attention in Low-Rank Space for KV Cache Compression | Singular Value Decomposition | arXiv | [Link](https://arxiv.org/pdf/2408.05646) | [Link](https://github.com/UtkarshSaxena1/EigenAttn/tree/main)  |
211 | | 2024 | Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference | Singular Value Decomposition | arXiv | [Link](https://arxiv.org/pdf/2408.04107) | |
212 | | 2024 | LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy | Singular Value Decomposition | arXiv | [Link](https://arxiv.org/pdf/2410.03111) | |
213 | | 2024 | ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference | Singular Value Decomposition | arXiv | [Link](https://arxiv.org/pdf/2410.21465) | [Link](https://github.com/bytedance/ShadowKV)  |
214 | | 2024 | Palu: Compressing KV-Cache with Low-Rank Projection | Singular Value Decomposition | arXiv | [Link](https://arxiv.org/pdf/2407.21118) | [Link](https://github.com/shadowpa0327/Palu)  |
215 |
216 | ### Tensor Decomposition ([To Topππ»](#awesome-kv-cache-management))
217 |
218 | | Year | Title | Type | Venue | Paper | code |
219 | | ---- | ------------------------------------------------------------------------------------------- | -------------------- | ----- | ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
220 | | 2024 | Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression | Tensor Decomposition | ACL | [Link](https://arxiv.org/pdf/2405.12591) | [Link](https://github.com/lpyhdzx/DecoQuant_code)  |
221 |
222 | ### Learned Low-rank Approximation ([To Topππ»](#awesome-kv-cache-management))
223 |
224 | | Year | Title | Type | Venue | Paper | code |
225 | | ---- | ------------------------------------------------------------------------------------------------- | ------------------------------ | ----- | ------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
226 | | 2024 | Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference | Learned Low-rank Approximation | arXiv | [Link](https://arxiv.org/pdf/2402.09398) | [Link](https://github.com/hdong920/LESS)  |
227 | | 2024 | MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection | Learned Low-rank Approximation | arXiv | [Link](https://arxiv.org/pdf/2410.14731) | |
228 |
229 | ---
230 |
231 | # Model-level Optimization
232 |
233 | ## Attention Grouping and Sharing
234 |
235 | ### Intra-Layer Grouping ([To Topππ»](#awesome-kv-cache-management))
236 |
237 | | Year | Title | Type | Venue | Paper | code |
238 | | ---- | ------------------------------------------------------------------------------------ | ----------------- | ----- | ------------------------------------- | -------------------------------------------------------------- |
239 | | 2019 | Fast Transformer Decoding: One Write-Head is All You Need | Intra-Layer Grouping | arXiv | [Link](https://arxiv.org/pdf/1911.02150) | |
240 | | 2023 | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Intra-Layer Grouping | EMNLP | [Link](https://arxiv.org/pdf/2305.13245) | [Link](https://github.com/fkodom/grouped-query-attention-pytorch) |
241 | | 2024 | Optimised Grouped-Query Attention Mechanism for Transformers | Intra-Layer Grouping | ICML | [Link](https://openreview.net/pdf?id=13MMghY6Kh) | |
242 | | 2024 | Weighted Grouped Query Attention in Transformers | Intra-Layer Grouping | arXiv | [Link](https://arxiv.org/pdf/2407.10855) | |
243 | | 2024 | QCQA: Quality and Capacity-aware grouped Query Attention | Intra-Layer Grouping | arXiv | [Link](https://arxiv.org/pdf/2406.10247) | [Non-official Link](https://github.com/vinayjoshi22/qcqa) | |
244 | | 2024 | Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention | Intra-Layer Grouping | arXiv | [Link](https://arxiv.org/pdf/2408.08454) | [Link](https://github.com/zohaib-khan5040/key-driven-gqa) |
245 | | 2023 | GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values | Intra-Layer Grouping | NeurIPS | [Link](https://arxiv.org/pdf/2311.03426) | |
246 |
247 | ### Cross-Layer Sharing ([To Topππ»](#awesome-kv-cache-management))
248 |
249 | | Year | Title | Type | Venue | Paper | code |
250 | | ---- | --------- | ------------- | ----- | ----- | ---- |
251 | | 2024 | Reducing Transformer Key-Value Cache Size with Cross-Layer Attention | Cross-Layer Sharing | arXiv | [Link](https://arxiv.org/pdf/2405.12981) | [Non-official Link](https://github.com/JerryYin777/Cross-Layer-Attention)  |
252 | | 2024 | Layer-Condensed KV Cache for Efficient Inference of Large Language Models | Cross-Layer Sharing | ACL | [Link](https://aclanthology.org/2024.acl-long.602.pdf) | [Link](https://github.LCKVcom/whyNLP/)  |
253 | | 2024 | Beyond KV Caching: Shared Attention for Efficient LLMs | Cross-Layer Sharing | arXiv | [Link](https://arxiv.org/pdf/2407.12866) | [Link](https://github.com/metacarbon/shareAtt)  |
254 | | 2024 | MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding | Cross-Layer Sharing | arXiv | [Link](https://arxiv.org/pdf/2406.09297) | [Link](https://github.com/zaydzuhri/mlkv)  |
255 | | 2024 | Cross-layer Attention Sharing for Large Language Models | Cross-Layer Sharing | arXiv | [Link](https://arxiv.org/pdf/2408.01890) | |
256 | | 2024 | A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference | Cross-Layer Sharing | arXiv | [Link](https://arxiv.org/pdf/2410.14442) | |
257 | | 2024 | Lossless KV Cache Compression to 2% | Cross-Layer Sharing | arXiv | [Link](https://arxiv.org/pdf/2410.15252) | |
258 | | 2024 | DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion | Cross-Layer Sharing | NeurIPS | [Link](https://arxiv.org/pdf/2406.06567) | |
259 | | 2024 | Value Residual Learning For Alleviating Attention Concentration In Transformers | Cross-Layer Sharing | arXiv | [Link](https://arxiv.org/pdf/2410.17897) | [Link](https://github.com/Zcchill/Value-Residual-Learning)  |
260 |
261 | ## Architecture Alteration
262 |
263 | ### Enhanced Attention ([To Topππ»](#awesome-kv-cache-management))
264 |
265 | | Year | Title | Type | Venue | Paper | code |
266 | | ---- | ------------ | ----------- | ----- | ----- | ---- |
267 | | 2024 | DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model | Enhanced Attention | arXiv | [Link](https://arxiv.org/pdf/2405.04434) | [Link](https://github.com/deepseek-ai/DeepSeek-V2)  |
268 | | 2022 | Transformer Quality in Linear Time | Enhanced Attention | ICML | [Link](https://proceedings.mlr.press/v162/hua22a/hua22a.pdf) | |
269 | | 2024 | Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention | Enhanced Attention | arXiv | [Link](https://arxiv.org/pdf/2404.07143) | |
270 |
271 | ### Augmented Architecture ([To Topππ»](#awesome-kv-cache-management))
272 |
273 | | Year | Title | Type | Venue | Paper | code |
274 | | ---- | ----- | ----------- | ----- | ----- | ---- |
275 | | 2024 | You Only Cache Once: Decoder-Decoder Architectures for Language Models | Augmented Architecture | arXiv | [Link](https://arxiv.org/pdf/2405.05254) | [Link](https://github.com/microsoft/unilm/tree/master/YOCO)  |
276 | | 2024 | Long-Context Language Modeling with Parallel Context Encoding | Augmented Architectures | ACL | [Link](https://aclanthology.org/2024.acl-long.142.pdf) | [Link](https://github.com/princeton-nlp/CEPE)  |
277 | | 2024 | XC-CACHE: Cross-Attending to Cached Context for Efficient LLM Inference | Augmented Architectures | Findings | [Link](https://aclanthology.org/2024.findings-emnlp.896.pdf) | |
278 | | 2024 | Block Transformer: Global-to-Local Language Modeling for Fast Inference | Augmented Architectures | arXiv | [Link](https://arxiv.org/pdf/2406.02657) | [Link](https://github.com/itsnamgyu/block-transformer)  |
279 |
280 | ## Non-transformer Architecture
281 |
282 | ### Adaptive Sequence Processing Architecture ([To Topππ»](#awesome-kv-cache-management))
283 |
284 | | Year | Title | Type | Venue | Paper | code |
285 | | ---- | ------------------------------------------------------------ | ----------------------------------------- | -------- | ---------------------------------------- | ------------------------------------------------------------ |
286 | | 2023 | RWKV: Reinventing RNNs for the Transformer Era | Adaptive Sequence Processing Architecture | Findings | [Link](https://arxiv.org/pdf/2305.13048) | [Link](https://github.com/BlinkDL/RWKV-LM)  |
287 | | 2024 | Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Adaptive Sequence Processing Architecture | arXiv | [Link](https://arxiv.org/pdf/2312.00752) | [Link](https://github.com/state-spaces/mamba)  |
288 | | 2023 | Retentive Network: A Successor to Transformer for Large Language Models | Adaptive Sequence Processing Architecture | arXiv | [Link](https://arxiv.org/pdf/2307.08621) | [Link](https://github.com/microsoft/unilm/tree/master/retnet)  |
289 | | 2024 | MCSD: An Efficient Language Model with Diverse Fusion | Adaptive Sequence Processing Architecture | arXiv | [Link](https://arxiv.org/pdf/2406.12230) | |
290 |
291 | ### Hybrid Architecture ([To Topππ»](#awesome-kv-cache-management))
292 |
293 | | Year | Title | Type | Venue | Paper | code |
294 | | ---- | ------------------------------------------------------------ | ------------------- | --------- | ------------------------------------------------------------ | --------------------------------------------------- |
295 | | 2024 | MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling | Hybrid Architecture | IOS Press | [Link](https://zhouchenlin.github.io/Publications/2024-ECAI-MixCon.pdf) | |
296 | | 2024 | GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression | Hybrid Architecture | arXiv | [Link](https://arxiv.org/pdf/2407.12077) | [Link](https://github.com/recursal/GoldFinch-paper)  |
297 | | 2024 | RecurFormer: Not All Transformer Heads Need Self-Attention | Hybrid Architecture | arXiv | [Link](https://arxiv.org/pdf/2410.12850) | |
298 |
299 | ---
300 |
301 | # System-level Optimization
302 |
303 | ## Memory Management
304 |
305 | ### Architectural Design ([To Topππ»](#awesome-kv-cache-management))
306 |
307 | | Year | Title | Type | Venue | Paper | code |
308 | | ---- | ------------------------------------------------------------ | -------------------- | ----- | ---------------------------------------- | ------------------------------------------------------------ |
309 | | 2024 | vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving | Architectural Design | arXiv | [Link](https://arxiv.org/pdf/2407.15309) | [Link](https://github.com/antgroup/glake)  |
310 | | 2024 | Unifying KV Cache Compression for Large Language Models with LeanKV | Architectural Design | arXiv | [Link](https://arxiv.org/pdf/2412.03131) | |
311 | | 2023 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Architectural Design | SOSP | [Link](https://arxiv.org/pdf/2309.06180) | [Link](https://github.com/vllm-project/vllm)  |
312 |
313 | ### Prefix-aware Design ([To Topππ»](#awesome-kv-cache-management))
314 |
315 | | Year | Title | Type | Venue | Paper | code |
316 | | ---- | ------------------------------------------------------------ | ------------------- | ----- | ---------------------------------------- | ------------------------------------------------------------ |
317 | | 2024 | ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition | Prefix-aware Design | ACL | [Link](https://arxiv.org/pdf/2402.15220) | [Link](https://github.com/microsoft/chunk-attention)  |
318 | | 2024 | MemServe:FlexibleMemPoolforBuilding DisaggregatedLLMServingwithCaching | Prefix-aware Design | arXiv | [Link](https://arxiv.org/pdf/2406.17565) | |
319 |
320 | ## Scheduling
321 |
322 | ### Prefix-aware Scheduling ([To Topππ»](#awesome-kv-cache-management))
323 |
324 | | Year | Title | Type | Venue | Paper | code |
325 | | ---- | ------------------------------------------------------------ | ----------------------- | ------- | ---------------------------------------- | ------------------------------------------------------------ |
326 | | 2024 | BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching | Prefix-aware Scheduling | arXiv | [Link](https://arxiv.org/pdf/2412.03594) | |
327 | | 2024 | SGLang: Efficient Execution of Structured Language Model Programs | Prefix-aware Scheduling | NeurIPS | [Link](https://arxiv.org/pdf/2312.07104) | [Link](https://github.com/sgl-project/sglang)  |
328 |
329 | ### Preemptive and Fairness-oriented Scheduling ([To Topππ»](#awesome-kv-cache-management))
330 |
331 | | Year | Title | Type | Venue | Paper | code |
332 | | ---- | ------------------------------------------------------------ | ------------------------------------------- | ----- | ---------------------------------------- | ---- |
333 | | 2024 | Fast Distributed Inference Serving for Large Language Models | Preemptive and Fairness-oriented Scheduling | arXiv | [Link](https://arxiv.org/pdf/2305.05920) | |
334 | | 2024 | FASTSWITCH: OPTIMIZING CONTEXT SWITCHING EFFICIENCY IN FAIRNESS-AWARE LARGE LANGUAGE MODEL SERVING | Preemptive and Fairness-oriented Scheduling | arXiv | [Link](https://arxiv.org/pdf/2411.18424) | |
335 |
336 | ### Layer-specific and Hierarchical Scheduling ([To Topππ»](#awesome-kv-cache-management))
337 |
338 | | Year | Title | Type | Venue | Paper | code |
339 | | ---- | ------------------------------------------------------------ | ------------------------------------------ | ---------- | ---------------------------------------- | ------------------------------------------------------------ |
340 | | 2024 | LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management | Layer-specific and Hierarchical Scheduling | arXiv | [Link](https://arxiv.org/pdf/2410.00428) | [Link](https://github.com/antgroup/glake)  |
341 | | 2024 | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | Layer-specific and Hierarchical Scheduling | USENIX ATC | [Link](https://arxiv.org/pdf/2403.19708) | |
342 | | 2024 | ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching | Layer-specific and Hierarchical Scheduling | ISCA | [Link](https://arxiv.org/pdf/2403.17312) | |
343 | | 2024 | Fast Inference for Augmented Large Language Models | Layer-specific and Hierarchical Scheduling | arXiv | [Link](https://arxiv.org/pdf/2410.18248) | |
344 |
345 | ## Hardware-aware Design
346 |
347 | ### Single/Multi-GPU Design ([To Topππ»](#awesome-kv-cache-management))
348 |
349 | | Year | Title | Type | Venue | Paper | code |
350 | | ---- | ------------------------------------------------------------------------------------------------- | ----------------------- | ----- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
351 | | 2024 | Hydragen: High-Throughput LLM Inference with Shared Prefixes | Single/Multi-GPU Design | arXiv | [Link](https://arxiv.org/pdf/2402.05099) | [Link](https://github.com/ScalingIntelligence/hydragen)  |
352 | | 2024 | DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference | Single/Multi-GPU Design | arXiv | [Link](https://arxiv.org/pdf/2404.00242) | |
353 | | 2024 | DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | Single/Multi-GPU Design | OSDI | [Link](https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf) | [Link](https://github.com/LLMServe/DistServe)  |
354 | | 2024 | Multi-Bin Batching for Increasing LLM Inference Throughput | Single/Multi-GPU Design | arXiv | [Link](https://openreview.net/pdf?id=WVmarX0RNd) | |
355 | | 2024 | Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters | Single/Multi-GPU Design | arXiv | [Link](https://arxiv.org/pdf/2408.04093) | [Link](https://github.com/Zyphra/tree_attention)  |
356 | | 2023 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Single/Multi-GPU Design | SOSP | [Link](https://arxiv.org/pdf/2309.06180) | [Link](https://github.com/vllm-project/vllm)  |
357 | | 2022 | Orca: A Distributed Serving System for Transformer-Based Generative Models | Single/Multi-GPU Design | OSDI | [Link](https://www.usenix.org/system/files/osdi22-yu.pdf) | |
358 |
359 | ### I/O-based Design ([To Topππ»](#awesome-kv-cache-management))
360 |
361 | | Year | Title | Type | Venue | Paper | code |
362 | | ---- | -------------------------------------------------------------------------------------------------- | ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
363 | | 2024 | Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs | I/O-based Design | arXiv | [Link](https://arxiv.org/pdf/2403.08845) | [Link](https://github.com/bifurcated-attn-icml-2024/gpt-fast-parallel-sampling)  |
364 | | 2024 | Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation | I/O-based Design | arXiv | [Link](https://arxiv.org/pdf/2411.17089) | |
365 | | 2024 | Fast State Restoration in LLM Serving with HCache | I/O-based Design | arXiv | [Link](https://arxiv.org/pdf/2410.05004) | |
366 | | 2024 | Compute Or Load KV Cache? Why Not Both? | I/O-based Design | arXiv | [Link](https://arxiv.org/pdf/2410.03065) | |
367 | | 2024 | FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving | I/O-based Design | arXiv | [Link](https://arxiv.org/pdf/2411.18424) | |
368 | | 2022 | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | I/O-based Design | NeurIPS | [Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf) | [Link](https://github.com/Dao-AILab/flash-attention)  |
369 |
370 | ### Heterogeneous Design ([To Topππ»](#awesome-kv-cache-management))
371 |
372 | | Year | Title | Type | Venue | Paper | code |
373 | | ---- | --------------------------------------------------------------------------------------------------- | -------------------- | ----- | ------------------------------------- | ---- |
374 | | 2024 | NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | Heterogeneous Design | arXiv | [Link](https://arxiv.org/pdf/2411.01142) | |
375 | | 2024 | FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines | Heterogeneous Design | arXiv | [Link](https://arxiv.org/pdf/2403.11421) | |
376 | | 2024 | vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving | Heterogeneous Design | arXiv | [Link](https://arxiv.org/pdf/2407.15309) | |
377 | | 2024 | InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management | Heterogeneous Design | arXiv | [Link](https://arxiv.org/pdf/2406.19707) | |
378 | | 2024 | Fast Distributed Inference Serving for Large Language Models | Heterogeneous Design | arXiv | [Link](https://arxiv.org/pdf/2305.05920) | |
379 | | 2024 | Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation | Heterogeneous Design | arXiv | [Link](https://arxiv.org/pdf/2411.17089) | |
380 | | 2023 | Stateful Large Language Model Serving with Pensieve | Heterogeneous Design | arXiv | [Link](https://arxiv.org/pdf/2312.05516) | |
381 |
382 | ### SSD-based Design ([To Topππ»](#awesome-kv-cache-management))
383 |
384 | | Year | Title | Type | Venue | Paper | code |
385 | | ---- | ---------------------------------------------------------------------------------------- | ---------------- | ----- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
386 | | 2024 | InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference | SSD-based Design | arXiv | [Link](https://arxiv.org/pdf/2409.04992) | |
387 | | 2023 | FlexGen: High-Throughput Generative Inference of Large Language Models | SSD-based Design | ICML | [Link](https://proceedings.mlr.press/v202/sheng23a/sheng23a.pdf) | [Link](https://github.com/FMInference/FlexLLMGen)  |
388 |
389 | ---
390 |
391 | # Datasets and Benchmarks
392 |
393 | Please refer to our paper for detailed information on this section.
394 |
395 | ---
396 |
--------------------------------------------------------------------------------