Awesome LLM Inference Serving

├── README.md
└── img
    ├── .DS_Store
    ├── overview.png
    └── title.png


/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 | <h1>Awesome LLM Inference Serving</h1>
  3 | </div>
  4 | 
  5 | ![](./img/title.png)
  6 | 
  7 | This repository contains the literature referenced in [Taming the Titans: A Survey of Efficient LLM Inference Serving](https://arxiv.org/abs/2504.19720), and will be updated regularly.
  8 | 
  9 | <!--## Table of Contents-->
 10 | 
 11 | ## Table of Contents
 12 | ![](./img/overview.png)
 13 | - [Table of Contents](#table-of-contents)
 14 | - [LLM Inference Serving in Instance](#llm-inference-serving-in-instance)
 15 |   - [Model Placement](#model-placement)
 16 |     - [Model Parallelism](#model-parallelism)
 17 |     - [Offloading](#offloading)
 18 |   - [Request Scheduling](#request-scheduling)
 19 |     - [Inter-Request Scheduling](#inter-request-scheduling)
 20 |     - [Intra-Request Scheduling](#intra-request-scheduling)
 21 |   - [Decoding Length Prediction](#decoding-length-prediction)
 22 |     - [Exact Length Prediction](#exact-length-prediction)
 23 |     - [Range-Based Classification](#range-based-classification)
 24 |     - [Relative Ranking Prediction](#relative-ranking-prediction)
 25 |   - [KV Cache Optimization](#kv-cache-optimization)
 26 |     - [Memory Management](#memory-management)
 27 |     - [Reuse Strategies](#reuse-strategies)
 28 |     - [Compression Techniques](#compression-techniques)
 29 |   - [PD Disaggregation](#pd-disaggregation)
 30 | - [LLM Inference Serving in Cluster](#llm-inference-serving-in-cluster)
 31 |   - [Cluster Optimization](#cluster-optimization)
 32 |     - [Architecture and Optimization for Heterogeneous Resources](#architecture-and-optimization-for-heterogeneous-resources)
 33 |     - [Service-Aware Scheduling](#service-aware-scheduling)
 34 |   - [Load Balancing](#load-balancing)
 35 |     - [Heuristic Algorithm](#heuristic-algorithm)
 36 |     - [Dynamic Scheduling](#dynamic-scheduling)
 37 |     - [Intelligent Predictive Scheduling](#intelligent-predictive-scheduling)
 38 |   - [Cloud-Based LLM Serving](#cloud-based-llm-serving)
 39 |     - [Deployment and Computing Effective](#deployment-and-computing-effective)
 40 |     - [Cooperation with Edge Device](#cooperation-with-edge-device)
 41 | - [Emerging Scenarios](#emerging-scenarios)
 42 |   - [Long Context](#long-context)
 43 |     - [Parallel Processing](#parallel-processing)
 44 |     - [Attention Computation](#attention-computation)
 45 |     - [KV Cache Management](#kv-cache-management)
 46 |   - [RAG](#rag)
 47 |     - [Workflow Scheduling](#workflow-scheduling)
 48 |     - [Storage Optimization](#storage-optimization)
 49 |   - [MoE](#moe)
 50 |     - [Expert Placement](#expert-placement)
 51 |     - [Expert Load Balancing](#expert-load-balancing)
 52 |     - [All-to-All Communication](#all-to-all-communication)
 53 |   - [LoRA](#lora)
 54 |   - [Speculative Decoding](#speculative-decoding)
 55 |   - [Augmented LLMs](#augmented-llms)
 56 |   - [Test-Time Reasoning](#test-time-reasoning)
 57 | - [Miscellaneous Areas](#miscellaneous-areas)
 58 |   - [Hardware](#hardware)
 59 |   - [Privacy](#privacy)
 60 |   - [Simulator](#simulator)
 61 |   - [Fairness](#fairness)
 62 |   - [Energy](#energy)
 63 | - [Reference](#reference)
 64 | 
 65 | ## LLM Inference Serving in Instance
 66 | 
 67 | ### Model Placement
 68 | 
 69 | #### Model Parallelism
 70 | 
 71 | - **GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism** [arxiv 2019.7] [paper](https://arxiv.org/abs/1811.06965)
 72 | - **PipeDream: Fast and Efficient Pipeline Parallel DNN Training** [arxiv 2018.6] [paper](https://arxiv.org/abs/1806.03377)
 73 | - **Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM** [arxiv 2021.4] [paper](https://dl.acm.org/doi/abs/10.1145/3458817.3476209) [code](https://github.com/nvidia/megatron-lm)
 74 | - **Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism** [arxiv 2020.3] [paper](https://arxiv.org/abs/1909.08053) [code](https://github.com/NVIDIA/Megatron-LM)
 75 | - **Reducing Activation Recomputation in Large Transformer Models** [arxiv 2022.5] [paper](https://arxiv.org/abs/2205.05198)
 76 | - **NVIDIA** [2024] [paper](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html)
 77 | - **Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity** [arxiv 2022.6] [paper](https://arxiv.org/abs/2101.03961) [code](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py)
 78 | 
 79 | #### Offloading
 80 | 
 81 | - **ZeRO-Offload: Democratizing Billion-Scale Model Training** [arxiv 2021.1] [paper](https://arxiv.org/abs/2101.03961)
 82 | - **DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale** [arxiv 2022.6] [paper](https://arxiv.org/abs/2207.00032) [code](https://github.com/deepspeedai/DeepSpeed)
 83 | - **FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU** [arxiv 2023.6] [paper](https://arxiv.org/abs/2303.06865) [code](https://github.com/FMInference/FlexLLMGen)
 84 | - **PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU** [arxiv 2024.12] [paper](https://arxiv.org/abs/2312.12456)
 85 | - **TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference** [SYSTOR 2024] [paper](https://dl.acm.org/doi/10.1145/3688351.3689164)
 86 | - **Improving Throughput-oriented LLM Inference with CPU Computations** [PACT 2024] [paper](https://dl.acm.org/doi/abs/10.1145/3656019.3676949)
 87 | 
 88 | ### Request Scheduling
 89 | 
 90 | #### Inter-Request Scheduling
 91 | 
 92 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu)
 93 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741)
 94 | - **Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2407.00079) [code](https://github.com/kvcache-ai/Mooncake)
 95 | - **Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency** [NSDI 2019] [paper](https://www.usenix.org/conference/nsdi19/presentation/kaffes) [code](https://github.com/stanford-mast/shinjuku)
 96 | - **Fast Distributed Inference Serving for Large Language Models** [arxiv 2024.9] [paper](https://arxiv.org/abs/2305.05920)
 97 | - **Efficient LLM Scheduling by Learning to Rank** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.15792) [code](https://github.com/hao-ai-lab/vllm-ltr)
 98 | - **Don't Stop Me Now: Embedding Based Scheduling for LLMs** [arxiv 24.10] [paper](https://arxiv.org/abs/2410.01035)
 99 | - **Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking** [scs.stanford.edu 2024] [paper](https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf)
100 | - **The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.07447)
101 | - **BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching** [arxiv 2025.1] [paper](https://arxiv.org/abs/2412.03594)
102 | 
103 | #### Intra-Request Scheduling
104 | 
105 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu))
106 | - **DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.08671) [code](https://github.com/deepspeedai/DeepSpeed-MII)
107 | - **Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)
108 | - **Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving** [arxiv 2025.3] [paper](https://arxiv.org/abs/2406.13511)
109 | 
110 | ### Decoding Length Prediction
111 | 
112 | #### Exact Length Prediction
113 | 
114 | - **Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction** [ICWS 2024] [paper](https://ieeexplore.ieee.org/abstract/document/10707595)
115 | - **Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11181)
116 | - **Efficient interactive llm serv ing with proxy model-based sequence length predic tion** [arxiv 2024.11] [paper](https://arxiv.org/abs/2404.08509) [code](https://github.com/James-QiuHaoran/LLM-serving-with-proxy-models)
117 | 
118 | #### Range-Based Classification
119 | 
120 | - **Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline** [NeurIPS 2023] [paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/ce7ff3405c782f761fac7f849b41ae9a-Abstract-Conference.html) [code](https://github.com/zhengzangw/Sequence-Scheduling)
121 | - **S<sup>3</sup>: Increasing GPU Utilization during Generative Inference for Higher Throughput** [NeurIPS 2023] [paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/3a13be0c5dae69e0f08065f113fb10b8-Abstract-Conference.html)
122 | - **Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing** [arxiv 2025.1] [paper](https://arxiv.org/abs/2408.13510)
123 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741)
124 | - **Power-aware Deep Learning Model Serving with μ-Serve** [ATC 2024] [paper](https://www.usenix.org/conference/atc24/presentation/qiu)
125 | - **Don't Stop Me Now: Embedding Based Scheduling for LLMs** [arxiv 24.10] [paper](https://arxiv.org/abs/2410.01035)
126 | - **SyncIntellects: Orchestrating LLM Inference with Progressive Prediction and QoS-Friendly Control** [IWQoS 2024] [paper](https://ieeexplore.ieee.org/document/10682949)
127 | 
128 | #### Relative Ranking Prediction
129 | 
130 | - **Efficient interactive llm serv ing with proxy model-based sequence length predic tion** [arxiv 2024.11] [paper](https://arxiv.org/abs/2404.08509) [code](https://github.com/James-QiuHaoran/LLM-serving-with-proxy-models)
131 | - **Efficient LLM Scheduling by Learning to Rank** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.15792) [code](https://github.com/hao-ai-lab/vllm-ltr)
132 | - **SkipPredict: When to Invest in Predictions for Scheduling** [arxiv 2024.2] [paper](https://arxiv.org/abs/2402.03564)
133 | - **BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching** [arxiv 2025.1] [paper](https://arxiv.org/abs/2412.03594)
134 | - **Predicting LLM Inference Latency: A Roofline-Driven MLMethod** [NeurIPS 2024] [paper](https://mlforsystems.org/assets/papers/neurips2024/paper28.pdf)
135 | 
136 | ### KV Cache Optimization
137 | 
138 | #### Memory Management
139 | 
140 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [arxiv 2023.9] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm)
141 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.7] [paper](https://arxiv.org/abs/2401.02669)
142 | - **FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.11421)
143 | - **LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.00428)
144 | - **KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.18169)
145 | - **SYMPHONY: Improving Memory Management for LLM Inference Workloads** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.16434)
146 | - **InstCache: A Predictive Cache for LLM Serving** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.13820)
147 | - **PQCache: Product Quantization-based KVCache for Long Context LLM Inference** [arxiv 2025.3] [paper](https://arxiv.org/abs/2407.12820)
148 | - **InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management** [arxiv 2024.6] [paper](https://arxiv.org/abs/2406.19707)
149 | 
150 | #### Reuse Strategies
151 | 
152 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [arxiv 2023.9] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm)
153 | - **MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool** [arxiv 2024.12] [paper](https://arxiv.org/abs/2406.17565)
154 | - **Preble: Efficient Distributed Prompt Scheduling for LLM Serving** [arxiv 2024.10] [paper](https://arxiv.org/abs/2407.00023) [code](https://github.com/WukLab/preble)
155 | - **Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention** [ATC 2024] [paper](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost)
156 | - **GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings** [NLP-OSS 2023] [paper](https://aclanthology.org/2023.nlposs-1.24/)
157 | - **SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models** [arxiv 2024.5] [paper](https://arxiv.org/abs/2406.00025)
158 | 
159 | ####  Compression Techniques
160 | 
161 | - **Model Compression and Efficient Inference for Large Language Models: A Survey** [arxiv 2024.2] [paper](https://arxiv.org/abs/2402.09748)
162 | - **FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU** [arxiv 2023.6] [paper](https://arxiv.org/abs/2303.06865) [code](https://github.com/FMInference/FlexLLMGen)
163 | - **KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization** [2024] [paper](https://www.researchgate.net/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization?channel=doi&linkId=658b5d282468df72d3db3280&showFulltext=true)
164 | - **MiniCache: KV Cache Compression in Depth Dimension for Large Language Models** [arxiv 2024.9] [paper](https://arxiv.org/abs/2405.14366) [code](https://github.com/AkideLiu/MiniCache)
165 | - **AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration** [MLSys 2024] [paper](https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html) [code](https://github.com/mit-han-lab/llm-awq)
166 | - **Atom: Low-bit Quantization for Efficient and Accurate LLM Serving** [arxiv 2024.4] [paper](https://arxiv.org/abs/2310.19102)
167 | - **QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.04532) [code](https://github.com/mit-han-lab/omniserve)
168 | - **CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2310.07240) [code](https://github.com/UChi-JCL/CacheGen)
169 | 
170 | ### PD Disaggregation
171 | 
172 | - **DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin) [code](https://github.com/LLMServe/DistServe)
173 | - **Splitwise: Efficient Generative LLM Inference Using Phase Splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/abstract/document/10609649)
174 | - **DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.01876)
175 | - **Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2407.00079) [code](https://github.com/kvcache-ai/Mooncake)
176 | - **Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11181)
177 | - **P/D-Serve: Serving Disaggregated Large Language Model at Scale** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.08147)
178 | 
179 | ## LLM Inference Serving in Cluster
180 | ### Cluster Optimization
181 | #### Architecture and Optimization for Heterogeneous Resources
182 | 
183 | - **Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling** [SOSP 2023] [paper](https://dl.acm.org/doi/10.1145/3600006.3613175)
184 | - **Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow** [ASPLOS 2025] [paper](https://arxiv.org/abs/2406.01566) [code](https://github.com/Thesys-lab/Helix-ASPLOS25)
185 | - **LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.01136) [code](https://github.com/tonyzhao-jt/LLM-PQ)
186 | - **HexGen: Generative Inference of Large Language Model over Heterogeneous Environment** [ICML 2024] [paper](https://arxiv.org/abs/2311.11514) [code](https://github.com/Relaxed-System-Lab/HexGen)
187 | - **Splitwise: Efficient generative llm inference using phase splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/document/10609649)
188 | - **DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving** [OSDI 2024] [paper](https://arxiv.org/abs/2401.09670) [code](https://github.com/LLMServe/DistServe)
189 | - **HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment** [ICLR 2025] [paper](https://arxiv.org/abs/2502.07903)
190 | - **Optimizing llm inference clusters for enhanced performance and energy efficiency** [TechRxiv 2024.12] [paper](https://www.techrxiv.org/users/812455/articles/1213926-optimizing-llm-inference-clusters-for-enhanced-performance-and-energy-efficiency)
191 | 
192 | #### Service-Aware Scheduling
193 | 
194 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741)
195 | - **Splitwise: Efficient generative llm inference using phase splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/document/10609649)
196 | 
197 | 
198 | ### Load Balancing
199 | 
200 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu)
201 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [SOSP 2023] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm)
202 | - **DeepSpeed-MII** [code](https://github.com/deepspeedai/DeepSpeed-MII)
203 | 
204 | #### Heuristic Algorithm
205 | 
206 | - **Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving** [arxiv 2024.6] [paper](https://arxiv.org/abs/2406.13511)
207 | - **A Unified Framework for Max-Min and Min-Max Fairness With Applications** [TNET 2007.8] [paper](https://ieeexplore.ieee.org/document/4346554)
208 | - **Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.17840)
209 | 
210 | #### Dynamic Scheduling
211 | 
212 | - **Llumnix: Dynamic Scheduling for Large Language Model Serving** [OSDI 2024] [paper](https://arxiv.org/abs/2406.03243) [code](https://github.com/AlibabaPAI/llumnix)
213 | 
214 | #### Intelligent Predictive Scheduling
215 | 
216 | - **Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.13510)
217 | 
218 | ### Cloud-Based LLM Serving
219 | 
220 | #### Deployment and Computing Effective
221 | 
222 | - **SpotServe: Serving Generative Large Language Models on Preemptible Instances** [ASPLOS 2024] [paper](https://arxiv.org/abs/2311.15566) [code](https://github.com/Hsword/SpotServe)
223 | - **ServerlessLLM: Low-Latency Serverless Inference for Large Language Models** [OSDI 2024] [paper](https://arxiv.org/abs/2401.14351) [code](https://github.com/ServerlessLLM/ServerlessLLM)
224 | - **Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity** [arxiv 2024.4] [paper](https://arxiv.org/abs/2404.14527) [code](https://github.com/tyler-griggs/melange-release)
225 | - **Characterizing Power Management Opportunities for LLMs in the Cloud** [ASPLOS 2024] [paper](https://dl.acm.org/doi/10.1145/3620666.3651329)
226 | - **Predicting LLM Inference Latency: A Roofline-Driven ML Method** [NeurIPS 2024] [paper](https://mlforsystems.org/assets/papers/neurips2024/paper28.pdf)
227 | - **Distributed Inference and Fine-tuning of Large Language Models Over The Internet** [NeurIPS 2023] [paper](https://arxiv.org/abs/2312.08361) 
228 | 
229 | #### Cooperation with Edge Device
230 | 
231 | - **EdgeShard: Efficient LLM Inference via Collaborative Edge Computing** [JIOT 2024.12] [paper](https://ieeexplore.ieee.org/abstract/document/10818760)
232 | - **PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.14636)
233 | - **Hybrid SLM and LLM for Edge-Cloud Collaborative Inference** [EdgeFM 2024] [paper](https://dl.acm.org/doi/10.1145/3662006.3662067)
234 | - **Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach** [TMC 2024.12] [paper](https://ieeexplore.ieee.org/document/10591707)
235 | 
236 | ## Emerging Scenarios
237 | 
238 | ### Long Context
239 | 
240 | #### Parallel Processing
241 | 
242 | - **LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism** [SOSP 2024] [paper](https://dl.acm.org/doi/10.1145/3694715.3695948) [code](https://github.com/LoongServe/LoongServe)
243 | 
244 | #### Attention Computation
245 | 
246 | - **Ring Attention with Blockwise Transformers for Near-Infinite Context** [arxiv 2023.10] [paper](https://arxiv.org/abs/2310.01889) [code](https://github.com/haoliuhl/ringattention)
247 | - **Striped Attention: Faster Ring Attention for Causal Transformers** [arxiv 2023.10] [paper](https://arxiv.org/abs/2311.09431) [code](https://github.com/exists-forall/striped_attention/)
248 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.02669)
249 | - **InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference** [arxiv 2024.9] [paper](https://arxiv.org/abs/2409.04992)
250 | 
251 | #### KV Cache Management
252 | 
253 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.02669)
254 | - **InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management** [OSDI 2024] [paper](https://arxiv.org/abs/2406.19707)
255 | - **Marconi: Prefix Caching for the Era of Hybrid LLMs** [MLSys 2025] [paper](https://arxiv.org/abs/2411.19379) [code](https://github.com/ruipeterpan/marconi)
256 | 
257 | ### RAG
258 | 
259 | #### Workflow Scheduling
260 | 
261 | - **PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.05676) [code](https://github.com/amazon-science/piperag)
262 | - **Teola: Towards End-to-End Optimization of LLM-based Applications** [ASPLOS 2025] [paper](https://dl.acm.org/doi/10.1145/3676641.3716278) [code](https://github.com/NetX-lab/Ayo)
263 | - **Accelerating Retrieval-Augmented Language Model Serving with Speculation** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.14021)
264 | - **RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.10543)
265 | 
266 | #### Storage Optimization
267 | 
268 | - **RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation** [arxiv 2024.4] [paper](https://arxiv.org/abs/2404.12457)
269 | - **Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.16178)
270 | - **CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion** [EuroSys 2025] [paper](https://dl.acm.org/doi/10.1145/3689031.3696098) [code](https://github.com/YaoJiayi/CacheBlend)
271 | - **EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.15332) 
272 | 
273 | ### MoE
274 | 
275 | - **A Survey on Inference Optimization Techniques for Mixture of Experts Models** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.14219) [code](https://github.com/MoE-Inf/awesome-moe-inference/)
276 | 
277 | #### Expert Placement
278 | 
279 | - **Tutel: Adaptive Mixture-of-Experts at Scale** [MLSys 2023] [paper](https://proceedings.mlsys.org/paper_files/paper/2023/hash/5616d34cf8ff73942cfd5aa922842556-Abstract-mlsys2023.html) [code](https://github.com/microsoft/tutel)
280 | - **DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale** [ICML 2022] [paper](https://proceedings.mlr.press/v162/rajbhandari22a) [code](https://github.com/deepspeedai/DeepSpeed)
281 | - **FastMoE: A Fast Mixture-of-Expert Training System** [arxiv 2021.3] [paper](https://arxiv.org/abs/2103.13262) [code](https://github.com/laekov/fastmoe)
282 | - **GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding** [ICLR 2021] [paper](https://iclr.cc/virtual/2021/poster/3196) [code](https://github.com/lucidrains/mixture-of-experts)
283 | 
284 | #### Expert Load Balancing
285 | 
286 | - **Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference** [arxiv 2023.3] [paper](https://arxiv.org/abs/2303.06182) 
287 | - **Optimizing Dynamic Neural Networks with Brainstorm** [OSDI 2023] [paper](https://www.usenix.org/conference/osdi23/presentation/cui) [code](https://github.com/Raphael-Hao/brainstorm)
288 | - **Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.08982)
289 | - **Mixture-of-Experts with Expert Choice Routing** [NeurIPS 2022] [paper](https://dl.acm.org/doi/abs/10.5555/3600270.3600785)
290 | 
291 | #### All-to-All Communication
292 | 
293 | - **Tutel: Adaptive Mixture-of-Experts at Scale** [MLSys 2023] [paper](https://proceedings.mlsys.org/paper_files/paper/2023/hash/5616d34cf8ff73942cfd5aa922842556-Abstract-mlsys2023.html) [code](https://github.com/microsoft/tutel)
294 | - **Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.17043)
295 | - **Accelerating Distributed MoE Training and Inference with Lina** [USENIX ATC 2023] [paper](https://www.usenix.org/conference/atc23/presentation/li-jiamin)
296 | 
297 | ### LoRA
298 | 
299 | - **LoRA: Low-Rank Adaptation of Large Language Models** [ICLR 2022] [paper](https://iclr.cc/virtual/2022/poster/6319) [code](https://github.com/microsoft/LoRA)
300 | - **LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models** [ICLR 2024] [paper](https://arxiv.org/abs/2309.12307) [code](https://github.com/dvlab-research/LongLoRA)
301 | - **QLoRA: Efficient Finetuning of Quantized LLMs** [NeurIPS 2023] [paper](https://arxiv.org/abs/2305.14314) [code](https://github.com/artidoro/qlora)
302 | - **CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11240)
303 | - **dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang) [code](https://github.com/LLMServe/dLoRA-artifact)
304 | 
305 | ### Speculative Decoding
306 | 
307 | - **Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding** [ACL 2024] [paper](https://arxiv.org/abs/2401.07851) [code](https://github.com/hemingkx/SpeculativeDecodingPapers)
308 | - **OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure** [TACL 2025] [paper](https://aclanthology.org/2025.tacl-1.8/)
309 | - **SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification** [ASPLOS 2024] [paper](https://dl.acm.org/doi/10.1145/3620666.3651335) [code](https://github.com/goliaro/specinfer-ae)
310 | 
311 | ### Augmented LLMs
312 | 
313 | - **InferCept: Efficient Intercept Support for Augmented Large Language Model Inference** [ICML 2024] [paper](https://icml.cc/virtual/2024/poster/32755) [code](https://github.com/WukLab/InferCept)
314 | - **Fast Inference for Augmented Large Language Models** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.18248)
315 | - **Parrot: Efficient Serving of LLM-based Applications with Semantic Variable** [OSDI 2024] [paper](https://arxiv.org/abs/2405.19888) [code](https://github.com/microsoft/ParrotServe)
316 | 
317 | ### Test-Time Reasoning
318 | 
319 | - **Test-Time Compute: from System-1 Thinking to System-2 Thinking** [arxiv 2025.1] [paper](https://arxiv.org/abs/2501.02497) [code](https://github.com/Dereck0602/Awesome_Test_Time_LLMs)
320 | - **Efficiently Serving LLM Reasoning Programs with Certaindex** [arxiv 2024.10] [paper](https://arxiv.org/abs/2412.20993) [code](https://github.com/hao-ai-lab/Dynasor)
321 | - **Learning How Hard to Think: Input-Adaptive Allocation of LM Computation** [ICLR 2025] [paper](https://arxiv.org/abs/2410.04707)
322 | 
323 | ## Miscellaneous Areas
324 | 
325 | ### Hardware
326 | 
327 | - **Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.14740)
328 | - **Efficient LLM inference solution on Intel GPU** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.05391)
329 | - **LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.02425)
330 | - **Demystifying Platform Requirements for Diverse LLM Inference Use Cases** [arxiv 2024.1] [paper](https://arxiv.org/abs/2406.01698) [code](https://github.com/abhibambhaniya/GenZ-LLM-Analyzer)
331 | - **Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs** [arxiv 2024.7] [paper](https://arxiv.org/abs/2403.20041)
332 | - **LLM as a System Service on Mobile Devices** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.11805)
333 | - **Fast On-device LLM Inference with NPUs** [arxiv 2024.12] [paper](https://arxiv.org/abs/2407.05858)
334 | 
335 | ### Privacy
336 | 
337 | - **A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage** [MobiArch 2024] [paper](https://arxiv.org/abs/2409.04040)
338 | - **No Free Lunch Theorem for Privacy-Preserving LLM Inference** [AIJ 2025.4] [paper](https://www.sciencedirect.com/science/article/pii/S0004370225000128)
339 | - **MPC-Minimized Secure LLM Inference** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.03561)
340 | 
341 | ### Simulator
342 | 
343 | - **Vidur: A Large-Scale Simulation Framework For LLM Inference** [MLSys 2024] [paper](https://arxiv.org/abs/2405.05465) [code](https://github.com/microsoft/vidur)
344 | - **Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow** [ASPLOS 2025] [paper](https://arxiv.org/abs/2406.01566) [code](https://github.com/Thesys-lab/Helix-ASPLOS25)
345 | 
346 | ### Fairness
347 | 
348 | - **Fairness in Serving Large Language Models** [OSDI 2024] [paper](https://arxiv.org/abs/2401.00588) [code](https://github.com/Ying1123/VTC-artifact)
349 | 
350 | ### Energy
351 | 
352 | - **Towards Sustainable Large Language Model Serving** [HotCarbon 2024] [paper](https://arxiv.org/abs/2501.01990)
353 | 
354 | ## Reference
355 | We would be grateful if you could cite our survey in your research if you find it useful:
356 | ```
357 | @misc{zhen2025tamingtitanssurveyefficient,
358 |       title={Taming the Titans: A Survey of Efficient LLM Inference Serving}, 
359 |       author={Ranran Zhen and Juntao Li and Yixin Ji and Zhenlin Yang and Tong Liu and Qingrong Xia and Xinyu Duan and Zhefeng Wang and Baoxing Huai and Min Zhang},
360 |       year={2025},
361 |       eprint={2504.19720},
362 |       archivePrefix={arXiv},
363 |       primaryClass={cs.CL},
364 |       url={https://arxiv.org/abs/2504.19720}, 
365 | }
366 | ```


--------------------------------------------------------------------------------
/img/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/.DS_Store


--------------------------------------------------------------------------------
/img/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/overview.png


--------------------------------------------------------------------------------
/img/title.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/title.png


--------------------------------------------------------------------------------