├── README.md
└── img
├── .DS_Store
├── overview.png
└── title.png
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
Awesome LLM Inference Serving
3 |
4 |
5 | 
6 |
7 | This repository contains the literature referenced in [Taming the Titans: A Survey of Efficient LLM Inference Serving](https://arxiv.org/abs/2504.19720), and will be updated regularly.
8 |
9 |
10 |
11 | ## Table of Contents
12 | 
13 | - [Table of Contents](#table-of-contents)
14 | - [LLM Inference Serving in Instance](#llm-inference-serving-in-instance)
15 | - [Model Placement](#model-placement)
16 | - [Model Parallelism](#model-parallelism)
17 | - [Offloading](#offloading)
18 | - [Request Scheduling](#request-scheduling)
19 | - [Inter-Request Scheduling](#inter-request-scheduling)
20 | - [Intra-Request Scheduling](#intra-request-scheduling)
21 | - [Decoding Length Prediction](#decoding-length-prediction)
22 | - [Exact Length Prediction](#exact-length-prediction)
23 | - [Range-Based Classification](#range-based-classification)
24 | - [Relative Ranking Prediction](#relative-ranking-prediction)
25 | - [KV Cache Optimization](#kv-cache-optimization)
26 | - [Memory Management](#memory-management)
27 | - [Reuse Strategies](#reuse-strategies)
28 | - [Compression Techniques](#compression-techniques)
29 | - [PD Disaggregation](#pd-disaggregation)
30 | - [LLM Inference Serving in Cluster](#llm-inference-serving-in-cluster)
31 | - [Cluster Optimization](#cluster-optimization)
32 | - [Architecture and Optimization for Heterogeneous Resources](#architecture-and-optimization-for-heterogeneous-resources)
33 | - [Service-Aware Scheduling](#service-aware-scheduling)
34 | - [Load Balancing](#load-balancing)
35 | - [Heuristic Algorithm](#heuristic-algorithm)
36 | - [Dynamic Scheduling](#dynamic-scheduling)
37 | - [Intelligent Predictive Scheduling](#intelligent-predictive-scheduling)
38 | - [Cloud-Based LLM Serving](#cloud-based-llm-serving)
39 | - [Deployment and Computing Effective](#deployment-and-computing-effective)
40 | - [Cooperation with Edge Device](#cooperation-with-edge-device)
41 | - [Emerging Scenarios](#emerging-scenarios)
42 | - [Long Context](#long-context)
43 | - [Parallel Processing](#parallel-processing)
44 | - [Attention Computation](#attention-computation)
45 | - [KV Cache Management](#kv-cache-management)
46 | - [RAG](#rag)
47 | - [Workflow Scheduling](#workflow-scheduling)
48 | - [Storage Optimization](#storage-optimization)
49 | - [MoE](#moe)
50 | - [Expert Placement](#expert-placement)
51 | - [Expert Load Balancing](#expert-load-balancing)
52 | - [All-to-All Communication](#all-to-all-communication)
53 | - [LoRA](#lora)
54 | - [Speculative Decoding](#speculative-decoding)
55 | - [Augmented LLMs](#augmented-llms)
56 | - [Test-Time Reasoning](#test-time-reasoning)
57 | - [Miscellaneous Areas](#miscellaneous-areas)
58 | - [Hardware](#hardware)
59 | - [Privacy](#privacy)
60 | - [Simulator](#simulator)
61 | - [Fairness](#fairness)
62 | - [Energy](#energy)
63 | - [Reference](#reference)
64 |
65 | ## LLM Inference Serving in Instance
66 |
67 | ### Model Placement
68 |
69 | #### Model Parallelism
70 |
71 | - **GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism** [arxiv 2019.7] [paper](https://arxiv.org/abs/1811.06965)
72 | - **PipeDream: Fast and Efficient Pipeline Parallel DNN Training** [arxiv 2018.6] [paper](https://arxiv.org/abs/1806.03377)
73 | - **Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM** [arxiv 2021.4] [paper](https://dl.acm.org/doi/abs/10.1145/3458817.3476209) [code](https://github.com/nvidia/megatron-lm)
74 | - **Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism** [arxiv 2020.3] [paper](https://arxiv.org/abs/1909.08053) [code](https://github.com/NVIDIA/Megatron-LM)
75 | - **Reducing Activation Recomputation in Large Transformer Models** [arxiv 2022.5] [paper](https://arxiv.org/abs/2205.05198)
76 | - **NVIDIA** [2024] [paper](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html)
77 | - **Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity** [arxiv 2022.6] [paper](https://arxiv.org/abs/2101.03961) [code](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py)
78 |
79 | #### Offloading
80 |
81 | - **ZeRO-Offload: Democratizing Billion-Scale Model Training** [arxiv 2021.1] [paper](https://arxiv.org/abs/2101.03961)
82 | - **DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale** [arxiv 2022.6] [paper](https://arxiv.org/abs/2207.00032) [code](https://github.com/deepspeedai/DeepSpeed)
83 | - **FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU** [arxiv 2023.6] [paper](https://arxiv.org/abs/2303.06865) [code](https://github.com/FMInference/FlexLLMGen)
84 | - **PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU** [arxiv 2024.12] [paper](https://arxiv.org/abs/2312.12456)
85 | - **TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference** [SYSTOR 2024] [paper](https://dl.acm.org/doi/10.1145/3688351.3689164)
86 | - **Improving Throughput-oriented LLM Inference with CPU Computations** [PACT 2024] [paper](https://dl.acm.org/doi/abs/10.1145/3656019.3676949)
87 |
88 | ### Request Scheduling
89 |
90 | #### Inter-Request Scheduling
91 |
92 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu)
93 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741)
94 | - **Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2407.00079) [code](https://github.com/kvcache-ai/Mooncake)
95 | - **Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency** [NSDI 2019] [paper](https://www.usenix.org/conference/nsdi19/presentation/kaffes) [code](https://github.com/stanford-mast/shinjuku)
96 | - **Fast Distributed Inference Serving for Large Language Models** [arxiv 2024.9] [paper](https://arxiv.org/abs/2305.05920)
97 | - **Efficient LLM Scheduling by Learning to Rank** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.15792) [code](https://github.com/hao-ai-lab/vllm-ltr)
98 | - **Don't Stop Me Now: Embedding Based Scheduling for LLMs** [arxiv 24.10] [paper](https://arxiv.org/abs/2410.01035)
99 | - **Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking** [scs.stanford.edu 2024] [paper](https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf)
100 | - **The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.07447)
101 | - **BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching** [arxiv 2025.1] [paper](https://arxiv.org/abs/2412.03594)
102 |
103 | #### Intra-Request Scheduling
104 |
105 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu))
106 | - **DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.08671) [code](https://github.com/deepspeedai/DeepSpeed-MII)
107 | - **Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)
108 | - **Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving** [arxiv 2025.3] [paper](https://arxiv.org/abs/2406.13511)
109 |
110 | ### Decoding Length Prediction
111 |
112 | #### Exact Length Prediction
113 |
114 | - **Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction** [ICWS 2024] [paper](https://ieeexplore.ieee.org/abstract/document/10707595)
115 | - **Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11181)
116 | - **Efficient interactive llm serv ing with proxy model-based sequence length predic tion** [arxiv 2024.11] [paper](https://arxiv.org/abs/2404.08509) [code](https://github.com/James-QiuHaoran/LLM-serving-with-proxy-models)
117 |
118 | #### Range-Based Classification
119 |
120 | - **Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline** [NeurIPS 2023] [paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/ce7ff3405c782f761fac7f849b41ae9a-Abstract-Conference.html) [code](https://github.com/zhengzangw/Sequence-Scheduling)
121 | - **S3: Increasing GPU Utilization during Generative Inference for Higher Throughput** [NeurIPS 2023] [paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/3a13be0c5dae69e0f08065f113fb10b8-Abstract-Conference.html)
122 | - **Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing** [arxiv 2025.1] [paper](https://arxiv.org/abs/2408.13510)
123 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741)
124 | - **Power-aware Deep Learning Model Serving with μ-Serve** [ATC 2024] [paper](https://www.usenix.org/conference/atc24/presentation/qiu)
125 | - **Don't Stop Me Now: Embedding Based Scheduling for LLMs** [arxiv 24.10] [paper](https://arxiv.org/abs/2410.01035)
126 | - **SyncIntellects: Orchestrating LLM Inference with Progressive Prediction and QoS-Friendly Control** [IWQoS 2024] [paper](https://ieeexplore.ieee.org/document/10682949)
127 |
128 | #### Relative Ranking Prediction
129 |
130 | - **Efficient interactive llm serv ing with proxy model-based sequence length predic tion** [arxiv 2024.11] [paper](https://arxiv.org/abs/2404.08509) [code](https://github.com/James-QiuHaoran/LLM-serving-with-proxy-models)
131 | - **Efficient LLM Scheduling by Learning to Rank** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.15792) [code](https://github.com/hao-ai-lab/vllm-ltr)
132 | - **SkipPredict: When to Invest in Predictions for Scheduling** [arxiv 2024.2] [paper](https://arxiv.org/abs/2402.03564)
133 | - **BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching** [arxiv 2025.1] [paper](https://arxiv.org/abs/2412.03594)
134 | - **Predicting LLM Inference Latency: A Roofline-Driven MLMethod** [NeurIPS 2024] [paper](https://mlforsystems.org/assets/papers/neurips2024/paper28.pdf)
135 |
136 | ### KV Cache Optimization
137 |
138 | #### Memory Management
139 |
140 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [arxiv 2023.9] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm)
141 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.7] [paper](https://arxiv.org/abs/2401.02669)
142 | - **FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.11421)
143 | - **LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.00428)
144 | - **KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.18169)
145 | - **SYMPHONY: Improving Memory Management for LLM Inference Workloads** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.16434)
146 | - **InstCache: A Predictive Cache for LLM Serving** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.13820)
147 | - **PQCache: Product Quantization-based KVCache for Long Context LLM Inference** [arxiv 2025.3] [paper](https://arxiv.org/abs/2407.12820)
148 | - **InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management** [arxiv 2024.6] [paper](https://arxiv.org/abs/2406.19707)
149 |
150 | #### Reuse Strategies
151 |
152 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [arxiv 2023.9] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm)
153 | - **MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool** [arxiv 2024.12] [paper](https://arxiv.org/abs/2406.17565)
154 | - **Preble: Efficient Distributed Prompt Scheduling for LLM Serving** [arxiv 2024.10] [paper](https://arxiv.org/abs/2407.00023) [code](https://github.com/WukLab/preble)
155 | - **Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention** [ATC 2024] [paper](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost)
156 | - **GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings** [NLP-OSS 2023] [paper](https://aclanthology.org/2023.nlposs-1.24/)
157 | - **SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models** [arxiv 2024.5] [paper](https://arxiv.org/abs/2406.00025)
158 |
159 | #### Compression Techniques
160 |
161 | - **Model Compression and Efficient Inference for Large Language Models: A Survey** [arxiv 2024.2] [paper](https://arxiv.org/abs/2402.09748)
162 | - **FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU** [arxiv 2023.6] [paper](https://arxiv.org/abs/2303.06865) [code](https://github.com/FMInference/FlexLLMGen)
163 | - **KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization** [2024] [paper](https://www.researchgate.net/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization?channel=doi&linkId=658b5d282468df72d3db3280&showFulltext=true)
164 | - **MiniCache: KV Cache Compression in Depth Dimension for Large Language Models** [arxiv 2024.9] [paper](https://arxiv.org/abs/2405.14366) [code](https://github.com/AkideLiu/MiniCache)
165 | - **AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration** [MLSys 2024] [paper](https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html) [code](https://github.com/mit-han-lab/llm-awq)
166 | - **Atom: Low-bit Quantization for Efficient and Accurate LLM Serving** [arxiv 2024.4] [paper](https://arxiv.org/abs/2310.19102)
167 | - **QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.04532) [code](https://github.com/mit-han-lab/omniserve)
168 | - **CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2310.07240) [code](https://github.com/UChi-JCL/CacheGen)
169 |
170 | ### PD Disaggregation
171 |
172 | - **DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin) [code](https://github.com/LLMServe/DistServe)
173 | - **Splitwise: Efficient Generative LLM Inference Using Phase Splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/abstract/document/10609649)
174 | - **DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.01876)
175 | - **Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2407.00079) [code](https://github.com/kvcache-ai/Mooncake)
176 | - **Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11181)
177 | - **P/D-Serve: Serving Disaggregated Large Language Model at Scale** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.08147)
178 |
179 | ## LLM Inference Serving in Cluster
180 | ### Cluster Optimization
181 | #### Architecture and Optimization for Heterogeneous Resources
182 |
183 | - **Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling** [SOSP 2023] [paper](https://dl.acm.org/doi/10.1145/3600006.3613175)
184 | - **Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow** [ASPLOS 2025] [paper](https://arxiv.org/abs/2406.01566) [code](https://github.com/Thesys-lab/Helix-ASPLOS25)
185 | - **LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.01136) [code](https://github.com/tonyzhao-jt/LLM-PQ)
186 | - **HexGen: Generative Inference of Large Language Model over Heterogeneous Environment** [ICML 2024] [paper](https://arxiv.org/abs/2311.11514) [code](https://github.com/Relaxed-System-Lab/HexGen)
187 | - **Splitwise: Efficient generative llm inference using phase splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/document/10609649)
188 | - **DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving** [OSDI 2024] [paper](https://arxiv.org/abs/2401.09670) [code](https://github.com/LLMServe/DistServe)
189 | - **HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment** [ICLR 2025] [paper](https://arxiv.org/abs/2502.07903)
190 | - **Optimizing llm inference clusters for enhanced performance and energy efficiency** [TechRxiv 2024.12] [paper](https://www.techrxiv.org/users/812455/articles/1213926-optimizing-llm-inference-clusters-for-enhanced-performance-and-energy-efficiency)
191 |
192 | #### Service-Aware Scheduling
193 |
194 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741)
195 | - **Splitwise: Efficient generative llm inference using phase splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/document/10609649)
196 |
197 |
198 | ### Load Balancing
199 |
200 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu)
201 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [SOSP 2023] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm)
202 | - **DeepSpeed-MII** [code](https://github.com/deepspeedai/DeepSpeed-MII)
203 |
204 | #### Heuristic Algorithm
205 |
206 | - **Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving** [arxiv 2024.6] [paper](https://arxiv.org/abs/2406.13511)
207 | - **A Unified Framework for Max-Min and Min-Max Fairness With Applications** [TNET 2007.8] [paper](https://ieeexplore.ieee.org/document/4346554)
208 | - **Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.17840)
209 |
210 | #### Dynamic Scheduling
211 |
212 | - **Llumnix: Dynamic Scheduling for Large Language Model Serving** [OSDI 2024] [paper](https://arxiv.org/abs/2406.03243) [code](https://github.com/AlibabaPAI/llumnix)
213 |
214 | #### Intelligent Predictive Scheduling
215 |
216 | - **Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.13510)
217 |
218 | ### Cloud-Based LLM Serving
219 |
220 | #### Deployment and Computing Effective
221 |
222 | - **SpotServe: Serving Generative Large Language Models on Preemptible Instances** [ASPLOS 2024] [paper](https://arxiv.org/abs/2311.15566) [code](https://github.com/Hsword/SpotServe)
223 | - **ServerlessLLM: Low-Latency Serverless Inference for Large Language Models** [OSDI 2024] [paper](https://arxiv.org/abs/2401.14351) [code](https://github.com/ServerlessLLM/ServerlessLLM)
224 | - **Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity** [arxiv 2024.4] [paper](https://arxiv.org/abs/2404.14527) [code](https://github.com/tyler-griggs/melange-release)
225 | - **Characterizing Power Management Opportunities for LLMs in the Cloud** [ASPLOS 2024] [paper](https://dl.acm.org/doi/10.1145/3620666.3651329)
226 | - **Predicting LLM Inference Latency: A Roofline-Driven ML Method** [NeurIPS 2024] [paper](https://mlforsystems.org/assets/papers/neurips2024/paper28.pdf)
227 | - **Distributed Inference and Fine-tuning of Large Language Models Over The Internet** [NeurIPS 2023] [paper](https://arxiv.org/abs/2312.08361)
228 |
229 | #### Cooperation with Edge Device
230 |
231 | - **EdgeShard: Efficient LLM Inference via Collaborative Edge Computing** [JIOT 2024.12] [paper](https://ieeexplore.ieee.org/abstract/document/10818760)
232 | - **PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.14636)
233 | - **Hybrid SLM and LLM for Edge-Cloud Collaborative Inference** [EdgeFM 2024] [paper](https://dl.acm.org/doi/10.1145/3662006.3662067)
234 | - **Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach** [TMC 2024.12] [paper](https://ieeexplore.ieee.org/document/10591707)
235 |
236 | ## Emerging Scenarios
237 |
238 | ### Long Context
239 |
240 | #### Parallel Processing
241 |
242 | - **LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism** [SOSP 2024] [paper](https://dl.acm.org/doi/10.1145/3694715.3695948) [code](https://github.com/LoongServe/LoongServe)
243 |
244 | #### Attention Computation
245 |
246 | - **Ring Attention with Blockwise Transformers for Near-Infinite Context** [arxiv 2023.10] [paper](https://arxiv.org/abs/2310.01889) [code](https://github.com/haoliuhl/ringattention)
247 | - **Striped Attention: Faster Ring Attention for Causal Transformers** [arxiv 2023.10] [paper](https://arxiv.org/abs/2311.09431) [code](https://github.com/exists-forall/striped_attention/)
248 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.02669)
249 | - **InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference** [arxiv 2024.9] [paper](https://arxiv.org/abs/2409.04992)
250 |
251 | #### KV Cache Management
252 |
253 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.02669)
254 | - **InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management** [OSDI 2024] [paper](https://arxiv.org/abs/2406.19707)
255 | - **Marconi: Prefix Caching for the Era of Hybrid LLMs** [MLSys 2025] [paper](https://arxiv.org/abs/2411.19379) [code](https://github.com/ruipeterpan/marconi)
256 |
257 | ### RAG
258 |
259 | #### Workflow Scheduling
260 |
261 | - **PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.05676) [code](https://github.com/amazon-science/piperag)
262 | - **Teola: Towards End-to-End Optimization of LLM-based Applications** [ASPLOS 2025] [paper](https://dl.acm.org/doi/10.1145/3676641.3716278) [code](https://github.com/NetX-lab/Ayo)
263 | - **Accelerating Retrieval-Augmented Language Model Serving with Speculation** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.14021)
264 | - **RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.10543)
265 |
266 | #### Storage Optimization
267 |
268 | - **RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation** [arxiv 2024.4] [paper](https://arxiv.org/abs/2404.12457)
269 | - **Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.16178)
270 | - **CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion** [EuroSys 2025] [paper](https://dl.acm.org/doi/10.1145/3689031.3696098) [code](https://github.com/YaoJiayi/CacheBlend)
271 | - **EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.15332)
272 |
273 | ### MoE
274 |
275 | - **A Survey on Inference Optimization Techniques for Mixture of Experts Models** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.14219) [code](https://github.com/MoE-Inf/awesome-moe-inference/)
276 |
277 | #### Expert Placement
278 |
279 | - **Tutel: Adaptive Mixture-of-Experts at Scale** [MLSys 2023] [paper](https://proceedings.mlsys.org/paper_files/paper/2023/hash/5616d34cf8ff73942cfd5aa922842556-Abstract-mlsys2023.html) [code](https://github.com/microsoft/tutel)
280 | - **DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale** [ICML 2022] [paper](https://proceedings.mlr.press/v162/rajbhandari22a) [code](https://github.com/deepspeedai/DeepSpeed)
281 | - **FastMoE: A Fast Mixture-of-Expert Training System** [arxiv 2021.3] [paper](https://arxiv.org/abs/2103.13262) [code](https://github.com/laekov/fastmoe)
282 | - **GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding** [ICLR 2021] [paper](https://iclr.cc/virtual/2021/poster/3196) [code](https://github.com/lucidrains/mixture-of-experts)
283 |
284 | #### Expert Load Balancing
285 |
286 | - **Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference** [arxiv 2023.3] [paper](https://arxiv.org/abs/2303.06182)
287 | - **Optimizing Dynamic Neural Networks with Brainstorm** [OSDI 2023] [paper](https://www.usenix.org/conference/osdi23/presentation/cui) [code](https://github.com/Raphael-Hao/brainstorm)
288 | - **Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.08982)
289 | - **Mixture-of-Experts with Expert Choice Routing** [NeurIPS 2022] [paper](https://dl.acm.org/doi/abs/10.5555/3600270.3600785)
290 |
291 | #### All-to-All Communication
292 |
293 | - **Tutel: Adaptive Mixture-of-Experts at Scale** [MLSys 2023] [paper](https://proceedings.mlsys.org/paper_files/paper/2023/hash/5616d34cf8ff73942cfd5aa922842556-Abstract-mlsys2023.html) [code](https://github.com/microsoft/tutel)
294 | - **Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.17043)
295 | - **Accelerating Distributed MoE Training and Inference with Lina** [USENIX ATC 2023] [paper](https://www.usenix.org/conference/atc23/presentation/li-jiamin)
296 |
297 | ### LoRA
298 |
299 | - **LoRA: Low-Rank Adaptation of Large Language Models** [ICLR 2022] [paper](https://iclr.cc/virtual/2022/poster/6319) [code](https://github.com/microsoft/LoRA)
300 | - **LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models** [ICLR 2024] [paper](https://arxiv.org/abs/2309.12307) [code](https://github.com/dvlab-research/LongLoRA)
301 | - **QLoRA: Efficient Finetuning of Quantized LLMs** [NeurIPS 2023] [paper](https://arxiv.org/abs/2305.14314) [code](https://github.com/artidoro/qlora)
302 | - **CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11240)
303 | - **dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang) [code](https://github.com/LLMServe/dLoRA-artifact)
304 |
305 | ### Speculative Decoding
306 |
307 | - **Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding** [ACL 2024] [paper](https://arxiv.org/abs/2401.07851) [code](https://github.com/hemingkx/SpeculativeDecodingPapers)
308 | - **OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure** [TACL 2025] [paper](https://aclanthology.org/2025.tacl-1.8/)
309 | - **SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification** [ASPLOS 2024] [paper](https://dl.acm.org/doi/10.1145/3620666.3651335) [code](https://github.com/goliaro/specinfer-ae)
310 |
311 | ### Augmented LLMs
312 |
313 | - **InferCept: Efficient Intercept Support for Augmented Large Language Model Inference** [ICML 2024] [paper](https://icml.cc/virtual/2024/poster/32755) [code](https://github.com/WukLab/InferCept)
314 | - **Fast Inference for Augmented Large Language Models** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.18248)
315 | - **Parrot: Efficient Serving of LLM-based Applications with Semantic Variable** [OSDI 2024] [paper](https://arxiv.org/abs/2405.19888) [code](https://github.com/microsoft/ParrotServe)
316 |
317 | ### Test-Time Reasoning
318 |
319 | - **Test-Time Compute: from System-1 Thinking to System-2 Thinking** [arxiv 2025.1] [paper](https://arxiv.org/abs/2501.02497) [code](https://github.com/Dereck0602/Awesome_Test_Time_LLMs)
320 | - **Efficiently Serving LLM Reasoning Programs with Certaindex** [arxiv 2024.10] [paper](https://arxiv.org/abs/2412.20993) [code](https://github.com/hao-ai-lab/Dynasor)
321 | - **Learning How Hard to Think: Input-Adaptive Allocation of LM Computation** [ICLR 2025] [paper](https://arxiv.org/abs/2410.04707)
322 |
323 | ## Miscellaneous Areas
324 |
325 | ### Hardware
326 |
327 | - **Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.14740)
328 | - **Efficient LLM inference solution on Intel GPU** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.05391)
329 | - **LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.02425)
330 | - **Demystifying Platform Requirements for Diverse LLM Inference Use Cases** [arxiv 2024.1] [paper](https://arxiv.org/abs/2406.01698) [code](https://github.com/abhibambhaniya/GenZ-LLM-Analyzer)
331 | - **Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs** [arxiv 2024.7] [paper](https://arxiv.org/abs/2403.20041)
332 | - **LLM as a System Service on Mobile Devices** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.11805)
333 | - **Fast On-device LLM Inference with NPUs** [arxiv 2024.12] [paper](https://arxiv.org/abs/2407.05858)
334 |
335 | ### Privacy
336 |
337 | - **A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage** [MobiArch 2024] [paper](https://arxiv.org/abs/2409.04040)
338 | - **No Free Lunch Theorem for Privacy-Preserving LLM Inference** [AIJ 2025.4] [paper](https://www.sciencedirect.com/science/article/pii/S0004370225000128)
339 | - **MPC-Minimized Secure LLM Inference** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.03561)
340 |
341 | ### Simulator
342 |
343 | - **Vidur: A Large-Scale Simulation Framework For LLM Inference** [MLSys 2024] [paper](https://arxiv.org/abs/2405.05465) [code](https://github.com/microsoft/vidur)
344 | - **Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow** [ASPLOS 2025] [paper](https://arxiv.org/abs/2406.01566) [code](https://github.com/Thesys-lab/Helix-ASPLOS25)
345 |
346 | ### Fairness
347 |
348 | - **Fairness in Serving Large Language Models** [OSDI 2024] [paper](https://arxiv.org/abs/2401.00588) [code](https://github.com/Ying1123/VTC-artifact)
349 |
350 | ### Energy
351 |
352 | - **Towards Sustainable Large Language Model Serving** [HotCarbon 2024] [paper](https://arxiv.org/abs/2501.01990)
353 |
354 | ## Reference
355 | We would be grateful if you could cite our survey in your research if you find it useful:
356 | ```
357 | @misc{zhen2025tamingtitanssurveyefficient,
358 | title={Taming the Titans: A Survey of Efficient LLM Inference Serving},
359 | author={Ranran Zhen and Juntao Li and Yixin Ji and Zhenlin Yang and Tong Liu and Qingrong Xia and Xinyu Duan and Zhefeng Wang and Baoxing Huai and Min Zhang},
360 | year={2025},
361 | eprint={2504.19720},
362 | archivePrefix={arXiv},
363 | primaryClass={cs.CL},
364 | url={https://arxiv.org/abs/2504.19720},
365 | }
366 | ```
--------------------------------------------------------------------------------
/img/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/.DS_Store
--------------------------------------------------------------------------------
/img/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/overview.png
--------------------------------------------------------------------------------
/img/title.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/title.png
--------------------------------------------------------------------------------