├── project.html ├── .gitignore ├── LICENSE └── README.md /project.html: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # OS generated files 2 | .DS_Store 3 | .DS_Store? 4 | ._* 5 | .Spotlight-V100 6 | .Trashes 7 | ehthumbs.db 8 | Thumbs.db 9 | 10 | # Editor directories and files 11 | .idea/ 12 | .vscode/ 13 | *.swp 14 | *.swo 15 | *~ 16 | 17 | # Backup files 18 | *.bak 19 | *.backup 20 | 21 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Awesome-Context-Compression-LLMs Contributors 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-Context-Compression-LLMs 🗜️ 2 | 3 | [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 4 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 5 | [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)]() 6 | 7 | > A curated collection of research papers focused on enhancing the efficiency of Large Language Models (LLMs) through context compression techniques. These methods aim to **reduce token usage**, **compress latent states**, and **optimize memory footprints (KV Cache)**. 8 | 9 | ## 📑 Table of Contents 10 | 11 | - [Introduction](#-introduction) 12 | - [Taxonomy](#-taxonomy) 13 | - [Explicit Context Compression (Prompt/Input Level)](#-explicit-context-compression-promptinput-level) 14 | - [Implicit Context Compression (Latent/Reasoning Level)](#-implicit-context-compression-latentreasoning-level) 15 | - [Inference-Time KV Compression (Memory/Cache Level)](#-inference-time-kv-compression-memorycache-level) 16 | - [Surveys](#-surveys) 17 | - [Contributing](#-contributing) 18 | - [Star History](#-star-history) 19 | 20 | ## 🎯 Introduction 21 | 22 | As LLMs scale to handle longer contexts and more complex tasks, efficient context management becomes crucial. This repository organizes papers into three distinct categories based on **where** and **how** compression occurs: 23 | 24 | 1. **Explicit Compression**: Operates on input tokens before/during encoding 25 | 2. **Implicit Compression**: Compresses into latent representations 26 | 3. **KV Compression**: Optimizes the Key-Value cache during inference 27 | 28 | ## 🗂️ Taxonomy 29 | 30 | ``` 31 | Context Compression Methods 32 | ├── Explicit Context Compression (Input Level) 33 | │ ├── Token Pruning (LLMLingua, Selective-Context) 34 | │ ├── Summarization-based 35 | │ └── Information-theoretic Selection 36 | │ 37 | ├── Implicit Context Compression (Latent Level) 38 | │ ├── Soft Prompt (AutoCompressor) 39 | │ ├── Autoencoder-based (ICAE, CoCom) 40 | │ └── Latent Reasoning (Coconut) 41 | │ 42 | └── Inference-Time KV Compression (Cache Level) 43 | ├── Eviction Policies (H2O, TOVA, SnapKV) 44 | ├── Quantization (KIVI) 45 | └── Sparse Attention (StreamingLLM) 46 | ``` 47 | 48 | --- 49 | 50 | ## 📝 Explicit Context Compression (Prompt/Input Level) 51 | 52 | **Definition**: Methods that operate primarily on the input text or input tokens. They select, prune, or summarize the context **before or during the initial encoding** to shorten the input sequence length. The goal is often to fit more context into the window or reduce API costs. 53 | 54 | **Keywords**: `Prompt Compression`, `Token Pruning`, `Summarization`, `Information Entropy`, `Token Selection`, `Coarse-grained Pruning` 55 | 56 | | Paper Title | Venue/Date | Tags | Code | TL;DR | 57 | | :--- | :--- | :--- | :--- | :--- | 58 | | [TokenSkip: Controllable Chain-of-Thought Compression in LLMs](https://arxiv.org/abs/2502.12067) | EMNLP 2025 Main | `Token Pruning`, `Distillation` | [GitHub](https://github.com/hemingkx/TokenSkip) | A simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. | 59 | | [LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression](https://arxiv.org/abs/2403.12968) | ACL 2024 | `Pruning`, `Distillation` | [GitHub](https://github.com/microsoft/LLMLingua) | Learns compression from GPT-4 annotations, 3x-6x faster than LLMLingua | 60 | | [LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression](https://arxiv.org/abs/2310.06839) | ACL 2024 | `Pruning`, `RAG` | [GitHub](https://github.com/microsoft/LLMLingua) | Question-aware compression for RAG scenarios, reorders retrieved documents by relevance | 61 | | [Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference](https://arxiv.org/abs/2403.09054) | MLSys 2024 | `Pruning` | [GitHub](https://github.com/d-matrix-ai/keyformer) | Identifies key tokens at each layer for selective retention | 62 | | [RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation](https://arxiv.org/abs/2310.04408) | ICLR 2024 | `Summarization`, `RAG` | [GitHub](https://github.com/carriex/recomp) | Trains extractive/abstractive compressors for retrieved documents | 63 | | [Nugget: Neural Compression for Efficient Prompt Decoding](https://arxiv.org/abs/2310.04749) | ICLR 2024 | `Pruning` | - | Learns to identify and preserve "nugget" tokens for compression | 64 | | [Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading](https://arxiv.org/abs/2310.05029) | NAACL 2024 | `Summarization` | - | MemWalker: iteratively summarizes and navigates long documents | 65 | | [LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models](https://arxiv.org/abs/2310.05736) | EMNLP 2023 | `Pruning` | [GitHub](https://github.com/microsoft/LLMLingua) | Uses a small LM to calculate perplexity and prune less informative tokens, achieving up to 20x compression | 66 | | [Selective-Context: Compressing Contexts for Efficient Inference](https://arxiv.org/abs/2310.06201) | EMNLP 2023 | `Pruning`, `Self-Information` | [GitHub](https://github.com/liyucheng09/Selective_Context) | Filters out low self-information content using a small LM | 67 | 68 | --- 69 | 70 | ## 🧠 Implicit Context Compression (Latent/Reasoning Level) 71 | 72 | **Definition**: Methods that compress context into **soft vectors, embeddings, or latent states**. This includes encoding long text into compact vector representations and **Latent Reasoning** where the "Chain of Thought" or intermediate reasoning steps are performed in the latent space (not outputting tokens) to reduce generation overhead. 73 | 74 | **Keywords**: `Soft Prompt`, `Autoencoder`, `Memory Vectors`, `Latent Space Reasoning`, `Continuous Chain of Thought`, `Internal State Compression` 75 | 76 | | Paper Title | Venue/Date | Tags | Code | TL;DR | 77 | | :--- | :--- | :--- | :--- | :--- | 78 | | [CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning](https://arxiv.org/pdf/2511.18659) | - | `Memory Slots`, `RAG` | [Github](https://github.com/apple/ml-clara) | CLaRa achieves significant compression rates (32x-64x) while preserving essential information for accurate answer generation. | 79 | | [REFRAG: Rethinking RAG based Decoding](https://arxiv.org/abs/2509.01092) | - | `Autoencoder`, `Memory Slots`, `RAG` | [Github](https://github.com/facebookresearch/refrag) | Reduces TTFT delay of RAG system by 30 times. | 80 | | [PCC: Pretraining Context Compressor for Large Language Models with Embedding-Based Memory](https://aclanthology.org/2025.acl-long.1394.pdf) | ACL 2025 Main | `Autoencoder`, `Memory Slots` | [Github](https://github.com/microsoft/AnthropomorphicIntelligence) | Explore the upper limit of implicit compression ratio and connect to downstream LLMs faster. | 81 | | [500xCompressor: Generalized Prompt Compression for Large Language Models](https://arxiv.org/abs/2408.03094) | ACL 2025 Main | `Autoencoder`, `Memory Slots` | [Github](https://github.com/ZongqianLi/500xCompressor) | Compresss a maximum of 500 natural language tokens into only 1 special token. | 82 | | [Coconut: Chain of Continuous Thought](https://arxiv.org/abs/2412.06769) | arXiv 2024.12 | `Latent Reasoning`, `CoT` | [Github](https://github.com/facebookresearch/coconut) | Performs reasoning in continuous latent space without outputting tokens | 83 | | [xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token](https://arxiv.org/abs/2405.13792) | NeurIPS 2024 | `RAG`, `Compression` | [Github](https://github.com/Hannibal046/xRAG) | Compresses retrieved documents into dense representations | 84 | | [PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents](https://arxiv.org/abs/2305.14564) | ACL 2024 | `Planning`, `Long Doc` | [GitHub](https://github.com/SimengSun/pearl) | Decomposes long document QA into planning and execution | 85 | | [ICAE: In-context Autoencoder for Context Compression in a Large Language Model](https://arxiv.org/abs/2307.06945) | ICLR 2024 | `Autoencoder`, `Memory Slots` | [GitHub](https://github.com/getao/ICAE) | ICAE compresses 512 tokens into 128 memory slots for 4x compression | 86 | | [AutoCompressor: Adapting Language Models to Compress Contexts](https://arxiv.org/abs/2305.14788) | EMNLP 2023 | `Soft Prompt` | [GitHub](https://github.com/princeton-nlp/AutoCompressors) | AutoCompressor: recursively compresses segments into summary vectors | 87 | | [Scaling Latent Reasoning via Thinking Tokens](https://arxiv.org/abs/2311.04254) | arXiv 2023.11 | `Latent Reasoning`, `Pause Token` | - | Uses "thinking tokens" for implicit reasoning steps | 88 | | [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) | NeurIPS 2023 | `Contrastive`, `Long-Context` | [GitHub](https://github.com/CStanKonrad/long_llama) | LongLLaMA: uses contrastive learning to focus on relevant context | 89 | | [Learning to Compress Prompts with Gist Tokens](https://arxiv.org/abs/2304.08467) | NeurIPS 2023 | `Soft Prompt` | [GitHub](https://github.com/jayelm/gisting) | Compresses instructions into learnable gist tokens | 90 | | [Parallel Context Windows for Large Language Models](https://arxiv.org/abs/2212.10947) | ACL 2023 | `Parallel`, `Memory` | - | PCW: processes context in parallel windows and aggregates | 91 | | [Training Language Models with Memory Augmentation](https://arxiv.org/abs/2205.12674) | EMNLP 2022 | `Memory`, `TRIME` | [GitHub](https://github.com/princeton-nlp/TRIME) | TRIME: retrieves and integrates memory tokens during training | 92 | | [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507) | ICLR 2020 | `Compression`, `Memory` | - | Uses compressed memory to extend context beyond window limits | 93 | 94 | --- 95 | 96 | ## ⚡ Inference-Time KV Compression (Memory/Cache Level) 97 | 98 | **Definition**: Methods that specifically target the **Key-Value (KV) Cache** during the generation phase. They aim to reduce GPU memory usage and latency by evicting "unimportant" KV pairs, quantizing the cache, or using sparse attention patterns. These methods usually happen **on-the-fly during inference**. 99 | 100 | **Keywords**: `KV Cache Eviction`, `Heavy Hitters`, `Sparse Attention`, `Cache Quantization`, `Budget-constrained Generation`, `Infinite Context`, `Streaming Inference` 101 | 102 | | Paper Title | Venue/Date | Tags | Code | TL;DR | 103 | | :--- | :--- | :--- | :--- | :--- | 104 | | [ParallelComp: Parallel Long-Context Compressor for Length Extrapolation](https://arxiv.org/pdf/2502.14317) | ICML 2025 | `Sparse` | [GitHub](https://github.com/menik1126/ParallelComp) | This method divides long texts into smaller chunks and processes them in parallel while automatically removing redundant or irrelevant parts, greatly improving efficiency and performance. | 105 | | [UNComp: Can Matrix Entropy Uncover Sparsity? — A Compressor Design from an Uncertainty-Aware Perspective](https://arxiv.org/pdf/2410.03090) | EMNLP 2025 | `Sparse`, `KV Cache Eviction` | [GitHub](https://github.com/menik1126/UNComp) | An uncertainty-aware framework that leverages truncated matrix entropy to identify areas of low information content, thereby revealing sparsity patterns that can be used for adaptive compression. | 106 | | [Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference](https://arxiv.org/abs/2406.10774) | ICML 2024 | `Sparse`, `Query-Aware` | [GitHub](https://github.com/mit-han-lab/Quest) | Page-based KV management with query-aware selection | 107 | | [PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling](https://arxiv.org/abs/2406.02069) | arXiv 2024.06 | `Eviction`, `Layer-wise` | [GitHub](https://github.com/Zefan-Cai/PyramidKV) | Different layers retain different amounts of KV pairs (pyramid structure) | 108 | | [MiniCache: KV Cache Compression in Depth Dimension for Large Language Models](https://arxiv.org/abs/2405.14366) | arXiv 2024.05 | `Eviction`, `Layer Merge` | - | Merges KV caches across similar layers to reduce memory | 109 | | [SnapKV: LLM Knows What You are Looking for Before Generation](https://arxiv.org/abs/2404.14469) | arXiv 2024.04 | `Eviction`, `Observation Window` | [GitHub](https://github.com/FasterDecoding/SnapKV) | Uses observation window at prompt end to identify important KV pairs | 110 | | [CaM: Cache Merging for Memory-efficient LLMs Inference](https://arxiv.org/abs/2403.17696) | ICLR 2024 | `Merging` | - | Merges similar KV pairs instead of hard eviction | 111 | | [Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference](https://arxiv.org/abs/2403.09636) | ICML 2024 | `Compression`, `Learned` | - | DMC: learns to decide what to keep/discard dynamically | 112 | | [Gear: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference](https://arxiv.org/abs/2403.05527) | arXiv 2024.03 | `Quantization`, `Residual` | [GitHub](https://github.com/HaoKang-Timmy/Gear) | Quantize majority + low-rank for outliers + sparse residual | 113 | | [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache](https://arxiv.org/abs/2402.02750) | ICML 2024 | `Quantization`, `2-bit` | [GitHub](https://github.com/jy-yuan/KIVI) | Asymmetric quantization: 2-bit for Keys, 2-bit for Values with different schemes | 114 | | [Anchor-based Large Language Models](https://arxiv.org/abs/2402.07616) | ACL 2024 | `Anchor`, `Compression` | [GitHub](https://github.com/lancopku/Anchor-LLM) | Groups and anchors tokens for parallel compression | 115 | | [InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences](https://arxiv.org/abs/2402.04617) | arXiv 2024.02 | `Block`, `Memory` | [GitHub](https://github.com/thunlp/InfLLM) | Uses block-level memory units for extreme-length processing | 116 | | [KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization](https://arxiv.org/abs/2401.18079) | arXiv 2024.01 | `Quantization`, `Per-Channel` | [GitHub](https://github.com/SqueezeAILab/KVQuant) | Per-channel quantization with outlier handling for extreme compression | 117 | | [LoMA: Lossless Compressed Memory Attention](https://arxiv.org/abs/2401.09486) | arXiv 2024.01 | `Compression`, `Lossless` | - | Achieves lossless compression via efficient memory management | 118 | | [TOVA: Token-wise Attention for Optimal KV-Cache Reduction](https://arxiv.org/abs/2401.06104) | arXiv 2024.01 | `Eviction`, `Token-wise` | [GitHub](https://github.com/schwartz-lab-NLP/TOVA) | Evicts tokens based on attention received in each generation step | 119 | | [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) | ICLR 2024 | `Streaming`, `Attention Sink` | [GitHub](https://github.com/mit-han-lab/streaming-llm) | StreamingLLM: keeps initial "sink" tokens + recent window for infinite streaming | 120 | | [H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models](https://arxiv.org/abs/2306.14048) | NeurIPS 2023 | `Eviction`, `Heavy Hitter` | [GitHub](https://github.com/FMInference/H2O) | Keeps only "heavy-hitter" tokens (high cumulative attention) plus recent tokens | 121 | | [Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression](https://arxiv.org/abs/2305.17118) | NeurIPS 2023 | `Eviction`, `Persistence` | - | Important tokens remain important; prunes based on historical importance | 122 | | [FastGen: Adaptive KV Cache Compression for Efficient LLM Inference](https://arxiv.org/abs/2310.01801) | arXiv 2023.10 | `Eviction`, `Adaptive` | - | Adaptive compression policies based on attention patterns | 123 | 124 | --- 125 | 126 | ## 📚 Surveys 127 | 128 | | Paper Title | Venue/Date | Focus | 129 | | :--- | :--- | :--- | 130 | | [Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) | NAACL 2025 | Comprehensive survey on prompt compression | 131 | | [A Survey on Efficient Inference for Large Language Models](https://arxiv.org/abs/2404.14294) | arXiv 2024 | Covers KV cache and other inference optimizations | 132 | | [A Survey on Model Compression for Large Language Models](https://arxiv.org/abs/2308.07633) | TACL 2024 | General LLM compression (quantization, pruning, distillation) | 133 | | [Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models](https://arxiv.org/abs/2402.02244) | arXiv 2024 | Focuses on extending context length | 134 | | [Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding](https://arxiv.org/abs/2312.17044) | arXiv 2023 | Survey on positional encoding for long contexts | 135 | 136 | --- 137 | 138 | ## 🤝 Contributing 139 | 140 | We welcome contributions! Please follow these steps: 141 | 142 | 1. Fork the repository 143 | 2. Add your paper following the table format 144 | 3. Ensure the paper is correctly categorized 145 | 4. Submit a Pull Request 146 | 147 | ### Categorization Guidelines 148 | 149 | - **Cat 1 (Explicit)**: If the paper discusses compressing prompts *before* sending to the model/API 150 | - **Cat 2 (Implicit)**: If the paper compresses into latent vectors or performs latent reasoning 151 | - **Cat 3 (KV Cache)**: If the paper manages GPU memory by manipulating KV pairs *during* inference 152 | 153 | --- 154 | 155 | ## ⭐ Star History 156 | 157 | [![Star History Chart](https://api.star-history.com/svg?repos=broalantaps/Awesome-Context-Compression-LLMs)](https://star-history.com/#broalantaps/Awesome-Context-Compression-LLMs) 158 | 159 | 160 | --- 161 | 162 | ## 📄 License 163 | 164 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. 165 | 166 | --- 167 | 168 | ## 🙏 Acknowledgements 169 | 170 | Special thanks to all researchers whose work is featured in this repository. Your contributions to making LLMs more efficient benefit the entire community. 171 | 172 | --- 173 | 174 |

175 | If you find this repository helpful, please consider giving it a ⭐! 176 |

177 | --------------------------------------------------------------------------------