├── README.md └── img ├── .DS_Store ├── overview.png └── title.png /README.md: -------------------------------------------------------------------------------- 1 |
2 |

Awesome LLM Inference Serving

3 |
4 | 5 | ![](./img/title.png) 6 | 7 | This repository contains the literature referenced in [Taming the Titans: A Survey of Efficient LLM Inference Serving](https://arxiv.org/abs/2504.19720), and will be updated regularly. 8 | 9 | 10 | 11 | ## Table of Contents 12 | ![](./img/overview.png) 13 | - [Table of Contents](#table-of-contents) 14 | - [LLM Inference Serving in Instance](#llm-inference-serving-in-instance) 15 | - [Model Placement](#model-placement) 16 | - [Model Parallelism](#model-parallelism) 17 | - [Offloading](#offloading) 18 | - [Request Scheduling](#request-scheduling) 19 | - [Inter-Request Scheduling](#inter-request-scheduling) 20 | - [Intra-Request Scheduling](#intra-request-scheduling) 21 | - [Decoding Length Prediction](#decoding-length-prediction) 22 | - [Exact Length Prediction](#exact-length-prediction) 23 | - [Range-Based Classification](#range-based-classification) 24 | - [Relative Ranking Prediction](#relative-ranking-prediction) 25 | - [KV Cache Optimization](#kv-cache-optimization) 26 | - [Memory Management](#memory-management) 27 | - [Reuse Strategies](#reuse-strategies) 28 | - [Compression Techniques](#compression-techniques) 29 | - [PD Disaggregation](#pd-disaggregation) 30 | - [LLM Inference Serving in Cluster](#llm-inference-serving-in-cluster) 31 | - [Cluster Optimization](#cluster-optimization) 32 | - [Architecture and Optimization for Heterogeneous Resources](#architecture-and-optimization-for-heterogeneous-resources) 33 | - [Service-Aware Scheduling](#service-aware-scheduling) 34 | - [Load Balancing](#load-balancing) 35 | - [Heuristic Algorithm](#heuristic-algorithm) 36 | - [Dynamic Scheduling](#dynamic-scheduling) 37 | - [Intelligent Predictive Scheduling](#intelligent-predictive-scheduling) 38 | - [Cloud-Based LLM Serving](#cloud-based-llm-serving) 39 | - [Deployment and Computing Effective](#deployment-and-computing-effective) 40 | - [Cooperation with Edge Device](#cooperation-with-edge-device) 41 | - [Emerging Scenarios](#emerging-scenarios) 42 | - [Long Context](#long-context) 43 | - [Parallel Processing](#parallel-processing) 44 | - [Attention Computation](#attention-computation) 45 | - [KV Cache Management](#kv-cache-management) 46 | - [RAG](#rag) 47 | - [Workflow Scheduling](#workflow-scheduling) 48 | - [Storage Optimization](#storage-optimization) 49 | - [MoE](#moe) 50 | - [Expert Placement](#expert-placement) 51 | - [Expert Load Balancing](#expert-load-balancing) 52 | - [All-to-All Communication](#all-to-all-communication) 53 | - [LoRA](#lora) 54 | - [Speculative Decoding](#speculative-decoding) 55 | - [Augmented LLMs](#augmented-llms) 56 | - [Test-Time Reasoning](#test-time-reasoning) 57 | - [Miscellaneous Areas](#miscellaneous-areas) 58 | - [Hardware](#hardware) 59 | - [Privacy](#privacy) 60 | - [Simulator](#simulator) 61 | - [Fairness](#fairness) 62 | - [Energy](#energy) 63 | - [Reference](#reference) 64 | 65 | ## LLM Inference Serving in Instance 66 | 67 | ### Model Placement 68 | 69 | #### Model Parallelism 70 | 71 | - **GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism** [arxiv 2019.7] [paper](https://arxiv.org/abs/1811.06965) 72 | - **PipeDream: Fast and Efficient Pipeline Parallel DNN Training** [arxiv 2018.6] [paper](https://arxiv.org/abs/1806.03377) 73 | - **Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM** [arxiv 2021.4] [paper](https://dl.acm.org/doi/abs/10.1145/3458817.3476209) [code](https://github.com/nvidia/megatron-lm) 74 | - **Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism** [arxiv 2020.3] [paper](https://arxiv.org/abs/1909.08053) [code](https://github.com/NVIDIA/Megatron-LM) 75 | - **Reducing Activation Recomputation in Large Transformer Models** [arxiv 2022.5] [paper](https://arxiv.org/abs/2205.05198) 76 | - **NVIDIA** [2024] [paper](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html) 77 | - **Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity** [arxiv 2022.6] [paper](https://arxiv.org/abs/2101.03961) [code](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py) 78 | 79 | #### Offloading 80 | 81 | - **ZeRO-Offload: Democratizing Billion-Scale Model Training** [arxiv 2021.1] [paper](https://arxiv.org/abs/2101.03961) 82 | - **DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale** [arxiv 2022.6] [paper](https://arxiv.org/abs/2207.00032) [code](https://github.com/deepspeedai/DeepSpeed) 83 | - **FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU** [arxiv 2023.6] [paper](https://arxiv.org/abs/2303.06865) [code](https://github.com/FMInference/FlexLLMGen) 84 | - **PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU** [arxiv 2024.12] [paper](https://arxiv.org/abs/2312.12456) 85 | - **TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference** [SYSTOR 2024] [paper](https://dl.acm.org/doi/10.1145/3688351.3689164) 86 | - **Improving Throughput-oriented LLM Inference with CPU Computations** [PACT 2024] [paper](https://dl.acm.org/doi/abs/10.1145/3656019.3676949) 87 | 88 | ### Request Scheduling 89 | 90 | #### Inter-Request Scheduling 91 | 92 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu) 93 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741) 94 | - **Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2407.00079) [code](https://github.com/kvcache-ai/Mooncake) 95 | - **Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency** [NSDI 2019] [paper](https://www.usenix.org/conference/nsdi19/presentation/kaffes) [code](https://github.com/stanford-mast/shinjuku) 96 | - **Fast Distributed Inference Serving for Large Language Models** [arxiv 2024.9] [paper](https://arxiv.org/abs/2305.05920) 97 | - **Efficient LLM Scheduling by Learning to Rank** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.15792) [code](https://github.com/hao-ai-lab/vllm-ltr) 98 | - **Don't Stop Me Now: Embedding Based Scheduling for LLMs** [arxiv 24.10] [paper](https://arxiv.org/abs/2410.01035) 99 | - **Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking** [scs.stanford.edu 2024] [paper](https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf) 100 | - **The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.07447) 101 | - **BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching** [arxiv 2025.1] [paper](https://arxiv.org/abs/2412.03594) 102 | 103 | #### Intra-Request Scheduling 104 | 105 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu)) 106 | - **DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.08671) [code](https://github.com/deepspeedai/DeepSpeed-MII) 107 | - **Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/agrawal) 108 | - **Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving** [arxiv 2025.3] [paper](https://arxiv.org/abs/2406.13511) 109 | 110 | ### Decoding Length Prediction 111 | 112 | #### Exact Length Prediction 113 | 114 | - **Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction** [ICWS 2024] [paper](https://ieeexplore.ieee.org/abstract/document/10707595) 115 | - **Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11181) 116 | - **Efficient interactive llm serv ing with proxy model-based sequence length predic tion** [arxiv 2024.11] [paper](https://arxiv.org/abs/2404.08509) [code](https://github.com/James-QiuHaoran/LLM-serving-with-proxy-models) 117 | 118 | #### Range-Based Classification 119 | 120 | - **Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline** [NeurIPS 2023] [paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/ce7ff3405c782f761fac7f849b41ae9a-Abstract-Conference.html) [code](https://github.com/zhengzangw/Sequence-Scheduling) 121 | - **S3: Increasing GPU Utilization during Generative Inference for Higher Throughput** [NeurIPS 2023] [paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/3a13be0c5dae69e0f08065f113fb10b8-Abstract-Conference.html) 122 | - **Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing** [arxiv 2025.1] [paper](https://arxiv.org/abs/2408.13510) 123 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741) 124 | - **Power-aware Deep Learning Model Serving with μ-Serve** [ATC 2024] [paper](https://www.usenix.org/conference/atc24/presentation/qiu) 125 | - **Don't Stop Me Now: Embedding Based Scheduling for LLMs** [arxiv 24.10] [paper](https://arxiv.org/abs/2410.01035) 126 | - **SyncIntellects: Orchestrating LLM Inference with Progressive Prediction and QoS-Friendly Control** [IWQoS 2024] [paper](https://ieeexplore.ieee.org/document/10682949) 127 | 128 | #### Relative Ranking Prediction 129 | 130 | - **Efficient interactive llm serv ing with proxy model-based sequence length predic tion** [arxiv 2024.11] [paper](https://arxiv.org/abs/2404.08509) [code](https://github.com/James-QiuHaoran/LLM-serving-with-proxy-models) 131 | - **Efficient LLM Scheduling by Learning to Rank** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.15792) [code](https://github.com/hao-ai-lab/vllm-ltr) 132 | - **SkipPredict: When to Invest in Predictions for Scheduling** [arxiv 2024.2] [paper](https://arxiv.org/abs/2402.03564) 133 | - **BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching** [arxiv 2025.1] [paper](https://arxiv.org/abs/2412.03594) 134 | - **Predicting LLM Inference Latency: A Roofline-Driven MLMethod** [NeurIPS 2024] [paper](https://mlforsystems.org/assets/papers/neurips2024/paper28.pdf) 135 | 136 | ### KV Cache Optimization 137 | 138 | #### Memory Management 139 | 140 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [arxiv 2023.9] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm) 141 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.7] [paper](https://arxiv.org/abs/2401.02669) 142 | - **FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.11421) 143 | - **LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.00428) 144 | - **KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.18169) 145 | - **SYMPHONY: Improving Memory Management for LLM Inference Workloads** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.16434) 146 | - **InstCache: A Predictive Cache for LLM Serving** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.13820) 147 | - **PQCache: Product Quantization-based KVCache for Long Context LLM Inference** [arxiv 2025.3] [paper](https://arxiv.org/abs/2407.12820) 148 | - **InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management** [arxiv 2024.6] [paper](https://arxiv.org/abs/2406.19707) 149 | 150 | #### Reuse Strategies 151 | 152 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [arxiv 2023.9] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm) 153 | - **MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool** [arxiv 2024.12] [paper](https://arxiv.org/abs/2406.17565) 154 | - **Preble: Efficient Distributed Prompt Scheduling for LLM Serving** [arxiv 2024.10] [paper](https://arxiv.org/abs/2407.00023) [code](https://github.com/WukLab/preble) 155 | - **Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention** [ATC 2024] [paper](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost) 156 | - **GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings** [NLP-OSS 2023] [paper](https://aclanthology.org/2023.nlposs-1.24/) 157 | - **SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models** [arxiv 2024.5] [paper](https://arxiv.org/abs/2406.00025) 158 | 159 | #### Compression Techniques 160 | 161 | - **Model Compression and Efficient Inference for Large Language Models: A Survey** [arxiv 2024.2] [paper](https://arxiv.org/abs/2402.09748) 162 | - **FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU** [arxiv 2023.6] [paper](https://arxiv.org/abs/2303.06865) [code](https://github.com/FMInference/FlexLLMGen) 163 | - **KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization** [2024] [paper](https://www.researchgate.net/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization?channel=doi&linkId=658b5d282468df72d3db3280&showFulltext=true) 164 | - **MiniCache: KV Cache Compression in Depth Dimension for Large Language Models** [arxiv 2024.9] [paper](https://arxiv.org/abs/2405.14366) [code](https://github.com/AkideLiu/MiniCache) 165 | - **AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration** [MLSys 2024] [paper](https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html) [code](https://github.com/mit-han-lab/llm-awq) 166 | - **Atom: Low-bit Quantization for Efficient and Accurate LLM Serving** [arxiv 2024.4] [paper](https://arxiv.org/abs/2310.19102) 167 | - **QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.04532) [code](https://github.com/mit-han-lab/omniserve) 168 | - **CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2310.07240) [code](https://github.com/UChi-JCL/CacheGen) 169 | 170 | ### PD Disaggregation 171 | 172 | - **DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin) [code](https://github.com/LLMServe/DistServe) 173 | - **Splitwise: Efficient Generative LLM Inference Using Phase Splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/abstract/document/10609649) 174 | - **DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.01876) 175 | - **Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving** [arxiv 2024.7] [paper](https://arxiv.org/abs/2407.00079) [code](https://github.com/kvcache-ai/Mooncake) 176 | - **Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11181) 177 | - **P/D-Serve: Serving Disaggregated Large Language Model at Scale** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.08147) 178 | 179 | ## LLM Inference Serving in Cluster 180 | ### Cluster Optimization 181 | #### Architecture and Optimization for Heterogeneous Resources 182 | 183 | - **Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling** [SOSP 2023] [paper](https://dl.acm.org/doi/10.1145/3600006.3613175) 184 | - **Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow** [ASPLOS 2025] [paper](https://arxiv.org/abs/2406.01566) [code](https://github.com/Thesys-lab/Helix-ASPLOS25) 185 | - **LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.01136) [code](https://github.com/tonyzhao-jt/LLM-PQ) 186 | - **HexGen: Generative Inference of Large Language Model over Heterogeneous Environment** [ICML 2024] [paper](https://arxiv.org/abs/2311.11514) [code](https://github.com/Relaxed-System-Lab/HexGen) 187 | - **Splitwise: Efficient generative llm inference using phase splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/document/10609649) 188 | - **DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving** [OSDI 2024] [paper](https://arxiv.org/abs/2401.09670) [code](https://github.com/LLMServe/DistServe) 189 | - **HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment** [ICLR 2025] [paper](https://arxiv.org/abs/2502.07903) 190 | - **Optimizing llm inference clusters for enhanced performance and energy efficiency** [TechRxiv 2024.12] [paper](https://www.techrxiv.org/users/812455/articles/1213926-optimizing-llm-inference-clusters-for-enhanced-performance-and-energy-efficiency) 191 | 192 | #### Service-Aware Scheduling 193 | 194 | - **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.00741) 195 | - **Splitwise: Efficient generative llm inference using phase splitting** [ISCA 2024] [paper](https://ieeexplore.ieee.org/document/10609649) 196 | 197 | 198 | ### Load Balancing 199 | 200 | - **Orca: A Distributed Serving System for Transformer-Based Generative Models** [OSDI 2022] [paper](https://www.usenix.org/conference/osdi22/presentation/yu) 201 | - **Efficient Memory Management for Large Language Model Serving with PagedAttention** [SOSP 2023] [paper](https://arxiv.org/abs/2309.06180) [code](https://github.com/vllm-project/vllm) 202 | - **DeepSpeed-MII** [code](https://github.com/deepspeedai/DeepSpeed-MII) 203 | 204 | #### Heuristic Algorithm 205 | 206 | - **Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving** [arxiv 2024.6] [paper](https://arxiv.org/abs/2406.13511) 207 | - **A Unified Framework for Max-Min and Min-Max Fairness With Applications** [TNET 2007.8] [paper](https://ieeexplore.ieee.org/document/4346554) 208 | - **Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.17840) 209 | 210 | #### Dynamic Scheduling 211 | 212 | - **Llumnix: Dynamic Scheduling for Large Language Model Serving** [OSDI 2024] [paper](https://arxiv.org/abs/2406.03243) [code](https://github.com/AlibabaPAI/llumnix) 213 | 214 | #### Intelligent Predictive Scheduling 215 | 216 | - **Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.13510) 217 | 218 | ### Cloud-Based LLM Serving 219 | 220 | #### Deployment and Computing Effective 221 | 222 | - **SpotServe: Serving Generative Large Language Models on Preemptible Instances** [ASPLOS 2024] [paper](https://arxiv.org/abs/2311.15566) [code](https://github.com/Hsword/SpotServe) 223 | - **ServerlessLLM: Low-Latency Serverless Inference for Large Language Models** [OSDI 2024] [paper](https://arxiv.org/abs/2401.14351) [code](https://github.com/ServerlessLLM/ServerlessLLM) 224 | - **Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity** [arxiv 2024.4] [paper](https://arxiv.org/abs/2404.14527) [code](https://github.com/tyler-griggs/melange-release) 225 | - **Characterizing Power Management Opportunities for LLMs in the Cloud** [ASPLOS 2024] [paper](https://dl.acm.org/doi/10.1145/3620666.3651329) 226 | - **Predicting LLM Inference Latency: A Roofline-Driven ML Method** [NeurIPS 2024] [paper](https://mlforsystems.org/assets/papers/neurips2024/paper28.pdf) 227 | - **Distributed Inference and Fine-tuning of Large Language Models Over The Internet** [NeurIPS 2023] [paper](https://arxiv.org/abs/2312.08361) 228 | 229 | #### Cooperation with Edge Device 230 | 231 | - **EdgeShard: Efficient LLM Inference via Collaborative Edge Computing** [JIOT 2024.12] [paper](https://ieeexplore.ieee.org/abstract/document/10818760) 232 | - **PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.14636) 233 | - **Hybrid SLM and LLM for Edge-Cloud Collaborative Inference** [EdgeFM 2024] [paper](https://dl.acm.org/doi/10.1145/3662006.3662067) 234 | - **Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach** [TMC 2024.12] [paper](https://ieeexplore.ieee.org/document/10591707) 235 | 236 | ## Emerging Scenarios 237 | 238 | ### Long Context 239 | 240 | #### Parallel Processing 241 | 242 | - **LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism** [SOSP 2024] [paper](https://dl.acm.org/doi/10.1145/3694715.3695948) [code](https://github.com/LoongServe/LoongServe) 243 | 244 | #### Attention Computation 245 | 246 | - **Ring Attention with Blockwise Transformers for Near-Infinite Context** [arxiv 2023.10] [paper](https://arxiv.org/abs/2310.01889) [code](https://github.com/haoliuhl/ringattention) 247 | - **Striped Attention: Faster Ring Attention for Causal Transformers** [arxiv 2023.10] [paper](https://arxiv.org/abs/2311.09431) [code](https://github.com/exists-forall/striped_attention/) 248 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.02669) 249 | - **InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference** [arxiv 2024.9] [paper](https://arxiv.org/abs/2409.04992) 250 | 251 | #### KV Cache Management 252 | 253 | - **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.02669) 254 | - **InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management** [OSDI 2024] [paper](https://arxiv.org/abs/2406.19707) 255 | - **Marconi: Prefix Caching for the Era of Hybrid LLMs** [MLSys 2025] [paper](https://arxiv.org/abs/2411.19379) [code](https://github.com/ruipeterpan/marconi) 256 | 257 | ### RAG 258 | 259 | #### Workflow Scheduling 260 | 261 | - **PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.05676) [code](https://github.com/amazon-science/piperag) 262 | - **Teola: Towards End-to-End Optimization of LLM-based Applications** [ASPLOS 2025] [paper](https://dl.acm.org/doi/10.1145/3676641.3716278) [code](https://github.com/NetX-lab/Ayo) 263 | - **Accelerating Retrieval-Augmented Language Model Serving with Speculation** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.14021) 264 | - **RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.10543) 265 | 266 | #### Storage Optimization 267 | 268 | - **RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation** [arxiv 2024.4] [paper](https://arxiv.org/abs/2404.12457) 269 | - **Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection** [arxiv 2024.5] [paper](https://arxiv.org/abs/2405.16178) 270 | - **CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion** [EuroSys 2025] [paper](https://dl.acm.org/doi/10.1145/3689031.3696098) [code](https://github.com/YaoJiayi/CacheBlend) 271 | - **EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.15332) 272 | 273 | ### MoE 274 | 275 | - **A Survey on Inference Optimization Techniques for Mixture of Experts Models** [arxiv 2024.12] [paper](https://arxiv.org/abs/2412.14219) [code](https://github.com/MoE-Inf/awesome-moe-inference/) 276 | 277 | #### Expert Placement 278 | 279 | - **Tutel: Adaptive Mixture-of-Experts at Scale** [MLSys 2023] [paper](https://proceedings.mlsys.org/paper_files/paper/2023/hash/5616d34cf8ff73942cfd5aa922842556-Abstract-mlsys2023.html) [code](https://github.com/microsoft/tutel) 280 | - **DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale** [ICML 2022] [paper](https://proceedings.mlr.press/v162/rajbhandari22a) [code](https://github.com/deepspeedai/DeepSpeed) 281 | - **FastMoE: A Fast Mixture-of-Expert Training System** [arxiv 2021.3] [paper](https://arxiv.org/abs/2103.13262) [code](https://github.com/laekov/fastmoe) 282 | - **GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding** [ICLR 2021] [paper](https://iclr.cc/virtual/2021/poster/3196) [code](https://github.com/lucidrains/mixture-of-experts) 283 | 284 | #### Expert Load Balancing 285 | 286 | - **Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference** [arxiv 2023.3] [paper](https://arxiv.org/abs/2303.06182) 287 | - **Optimizing Dynamic Neural Networks with Brainstorm** [OSDI 2023] [paper](https://www.usenix.org/conference/osdi23/presentation/cui) [code](https://github.com/Raphael-Hao/brainstorm) 288 | - **Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection** [arxiv 2024.11] [paper](https://arxiv.org/abs/2411.08982) 289 | - **Mixture-of-Experts with Expert Choice Routing** [NeurIPS 2022] [paper](https://dl.acm.org/doi/abs/10.5555/3600270.3600785) 290 | 291 | #### All-to-All Communication 292 | 293 | - **Tutel: Adaptive Mixture-of-Experts at Scale** [MLSys 2023] [paper](https://proceedings.mlsys.org/paper_files/paper/2023/hash/5616d34cf8ff73942cfd5aa922842556-Abstract-mlsys2023.html) [code](https://github.com/microsoft/tutel) 294 | - **Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.17043) 295 | - **Accelerating Distributed MoE Training and Inference with Lina** [USENIX ATC 2023] [paper](https://www.usenix.org/conference/atc23/presentation/li-jiamin) 296 | 297 | ### LoRA 298 | 299 | - **LoRA: Low-Rank Adaptation of Large Language Models** [ICLR 2022] [paper](https://iclr.cc/virtual/2022/poster/6319) [code](https://github.com/microsoft/LoRA) 300 | - **LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models** [ICLR 2024] [paper](https://arxiv.org/abs/2309.12307) [code](https://github.com/dvlab-research/LongLoRA) 301 | - **QLoRA: Efficient Finetuning of Quantized LLMs** [NeurIPS 2023] [paper](https://arxiv.org/abs/2305.14314) [code](https://github.com/artidoro/qlora) 302 | - **CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.11240) 303 | - **dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving** [OSDI 2024] [paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang) [code](https://github.com/LLMServe/dLoRA-artifact) 304 | 305 | ### Speculative Decoding 306 | 307 | - **Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding** [ACL 2024] [paper](https://arxiv.org/abs/2401.07851) [code](https://github.com/hemingkx/SpeculativeDecodingPapers) 308 | - **OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure** [TACL 2025] [paper](https://aclanthology.org/2025.tacl-1.8/) 309 | - **SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification** [ASPLOS 2024] [paper](https://dl.acm.org/doi/10.1145/3620666.3651335) [code](https://github.com/goliaro/specinfer-ae) 310 | 311 | ### Augmented LLMs 312 | 313 | - **InferCept: Efficient Intercept Support for Augmented Large Language Model Inference** [ICML 2024] [paper](https://icml.cc/virtual/2024/poster/32755) [code](https://github.com/WukLab/InferCept) 314 | - **Fast Inference for Augmented Large Language Models** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.18248) 315 | - **Parrot: Efficient Serving of LLM-based Applications with Semantic Variable** [OSDI 2024] [paper](https://arxiv.org/abs/2405.19888) [code](https://github.com/microsoft/ParrotServe) 316 | 317 | ### Test-Time Reasoning 318 | 319 | - **Test-Time Compute: from System-1 Thinking to System-2 Thinking** [arxiv 2025.1] [paper](https://arxiv.org/abs/2501.02497) [code](https://github.com/Dereck0602/Awesome_Test_Time_LLMs) 320 | - **Efficiently Serving LLM Reasoning Programs with Certaindex** [arxiv 2024.10] [paper](https://arxiv.org/abs/2412.20993) [code](https://github.com/hao-ai-lab/Dynasor) 321 | - **Learning How Hard to Think: Input-Adaptive Allocation of LM Computation** [ICLR 2025] [paper](https://arxiv.org/abs/2410.04707) 322 | 323 | ## Miscellaneous Areas 324 | 325 | ### Hardware 326 | 327 | - **Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.14740) 328 | - **Efficient LLM inference solution on Intel GPU** [arxiv 2024.1] [paper](https://arxiv.org/abs/2401.05391) 329 | - **LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services** [arxiv 2024.10] [paper](https://arxiv.org/abs/2410.02425) 330 | - **Demystifying Platform Requirements for Diverse LLM Inference Use Cases** [arxiv 2024.1] [paper](https://arxiv.org/abs/2406.01698) [code](https://github.com/abhibambhaniya/GenZ-LLM-Analyzer) 331 | - **Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs** [arxiv 2024.7] [paper](https://arxiv.org/abs/2403.20041) 332 | - **LLM as a System Service on Mobile Devices** [arxiv 2024.3] [paper](https://arxiv.org/abs/2403.11805) 333 | - **Fast On-device LLM Inference with NPUs** [arxiv 2024.12] [paper](https://arxiv.org/abs/2407.05858) 334 | 335 | ### Privacy 336 | 337 | - **A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage** [MobiArch 2024] [paper](https://arxiv.org/abs/2409.04040) 338 | - **No Free Lunch Theorem for Privacy-Preserving LLM Inference** [AIJ 2025.4] [paper](https://www.sciencedirect.com/science/article/pii/S0004370225000128) 339 | - **MPC-Minimized Secure LLM Inference** [arxiv 2024.8] [paper](https://arxiv.org/abs/2408.03561) 340 | 341 | ### Simulator 342 | 343 | - **Vidur: A Large-Scale Simulation Framework For LLM Inference** [MLSys 2024] [paper](https://arxiv.org/abs/2405.05465) [code](https://github.com/microsoft/vidur) 344 | - **Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow** [ASPLOS 2025] [paper](https://arxiv.org/abs/2406.01566) [code](https://github.com/Thesys-lab/Helix-ASPLOS25) 345 | 346 | ### Fairness 347 | 348 | - **Fairness in Serving Large Language Models** [OSDI 2024] [paper](https://arxiv.org/abs/2401.00588) [code](https://github.com/Ying1123/VTC-artifact) 349 | 350 | ### Energy 351 | 352 | - **Towards Sustainable Large Language Model Serving** [HotCarbon 2024] [paper](https://arxiv.org/abs/2501.01990) 353 | 354 | ## Reference 355 | We would be grateful if you could cite our survey in your research if you find it useful: 356 | ``` 357 | @misc{zhen2025tamingtitanssurveyefficient, 358 | title={Taming the Titans: A Survey of Efficient LLM Inference Serving}, 359 | author={Ranran Zhen and Juntao Li and Yixin Ji and Zhenlin Yang and Tong Liu and Qingrong Xia and Xinyu Duan and Zhefeng Wang and Baoxing Huai and Min Zhang}, 360 | year={2025}, 361 | eprint={2504.19720}, 362 | archivePrefix={arXiv}, 363 | primaryClass={cs.CL}, 364 | url={https://arxiv.org/abs/2504.19720}, 365 | } 366 | ``` -------------------------------------------------------------------------------- /img/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/.DS_Store -------------------------------------------------------------------------------- /img/overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/overview.png -------------------------------------------------------------------------------- /img/title.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zenrran4nlp/Awesome-LLM-Inference-Serving/06159e484c18859ba9da6f221d029fe413a33321/img/title.png --------------------------------------------------------------------------------