├── project.html
├── .gitignore
├── LICENSE
└── README.md


/project.html:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # OS generated files
 2 | .DS_Store
 3 | .DS_Store?
 4 | ._*
 5 | .Spotlight-V100
 6 | .Trashes
 7 | ehthumbs.db
 8 | Thumbs.db
 9 | 
10 | # Editor directories and files
11 | .idea/
12 | .vscode/
13 | *.swp
14 | *.swo
15 | *~
16 | 
17 | # Backup files
18 | *.bak
19 | *.backup
20 | 
21 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Awesome-Context-Compression-LLMs Contributors
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome-Context-Compression-LLMs 🗜️
  2 | 
  3 | [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
  4 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  5 | [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)]()
  6 | 
  7 | > A curated collection of research papers focused on enhancing the efficiency of Large Language Models (LLMs) through context compression techniques. These methods aim to **reduce token usage**, **compress latent states**, and **optimize memory footprints (KV Cache)**.
  8 | 
  9 | ## 📑 Table of Contents
 10 | 
 11 | - [Introduction](#-introduction)
 12 | - [Taxonomy](#-taxonomy)
 13 | - [Explicit Context Compression (Prompt/Input Level)](#-explicit-context-compression-promptinput-level)
 14 | - [Implicit Context Compression (Latent/Reasoning Level)](#-implicit-context-compression-latentreasoning-level)
 15 | - [Inference-Time KV Compression (Memory/Cache Level)](#-inference-time-kv-compression-memorycache-level)
 16 | - [Surveys](#-surveys)
 17 | - [Contributing](#-contributing)
 18 | - [Star History](#-star-history)
 19 | 
 20 | ## 🎯 Introduction
 21 | 
 22 | As LLMs scale to handle longer contexts and more complex tasks, efficient context management becomes crucial. This repository organizes papers into three distinct categories based on **where** and **how** compression occurs:
 23 | 
 24 | 1. **Explicit Compression**: Operates on input tokens before/during encoding
 25 | 2. **Implicit Compression**: Compresses into latent representations
 26 | 3. **KV Compression**: Optimizes the Key-Value cache during inference
 27 | 
 28 | ## 🗂️ Taxonomy
 29 | 
 30 | ```
 31 | Context Compression Methods
 32 | ├── Explicit Context Compression (Input Level)
 33 | │   ├── Token Pruning (LLMLingua, Selective-Context)
 34 | │   ├── Summarization-based
 35 | │   └── Information-theoretic Selection
 36 | │
 37 | ├── Implicit Context Compression (Latent Level)
 38 | │   ├── Soft Prompt (AutoCompressor)
 39 | │   ├── Autoencoder-based (ICAE, CoCom)
 40 | │   └── Latent Reasoning (Coconut)
 41 | │
 42 | └── Inference-Time KV Compression (Cache Level)
 43 |     ├── Eviction Policies (H2O, TOVA, SnapKV)
 44 |     ├── Quantization (KIVI)
 45 |     └── Sparse Attention (StreamingLLM)
 46 | ```
 47 | 
 48 | ---
 49 | 
 50 | ## 📝 Explicit Context Compression (Prompt/Input Level)
 51 | 
 52 | **Definition**: Methods that operate primarily on the input text or input tokens. They select, prune, or summarize the context **before or during the initial encoding** to shorten the input sequence length. The goal is often to fit more context into the window or reduce API costs.
 53 | 
 54 | **Keywords**: `Prompt Compression`, `Token Pruning`, `Summarization`, `Information Entropy`, `Token Selection`, `Coarse-grained Pruning`
 55 | 
 56 | | Paper Title | Venue/Date | Tags | Code | TL;DR |
 57 | | :--- | :--- | :--- | :--- | :--- |
 58 | | [TokenSkip: Controllable Chain-of-Thought Compression in LLMs](https://arxiv.org/abs/2502.12067) | EMNLP 2025 Main | `Token Pruning`, `Distillation` | [GitHub](https://github.com/hemingkx/TokenSkip) | A simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. |
 59 | | [LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression](https://arxiv.org/abs/2403.12968) | ACL 2024 | `Pruning`, `Distillation` | [GitHub](https://github.com/microsoft/LLMLingua) | Learns compression from GPT-4 annotations, 3x-6x faster than LLMLingua |
 60 | | [LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression](https://arxiv.org/abs/2310.06839) | ACL 2024 | `Pruning`, `RAG` | [GitHub](https://github.com/microsoft/LLMLingua) | Question-aware compression for RAG scenarios, reorders retrieved documents by relevance |
 61 | | [Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference](https://arxiv.org/abs/2403.09054) | MLSys 2024 | `Pruning` | [GitHub](https://github.com/d-matrix-ai/keyformer) | Identifies key tokens at each layer for selective retention |
 62 | | [RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation](https://arxiv.org/abs/2310.04408) | ICLR 2024 | `Summarization`, `RAG` | [GitHub](https://github.com/carriex/recomp) | Trains extractive/abstractive compressors for retrieved documents |
 63 | | [Nugget: Neural Compression for Efficient Prompt Decoding](https://arxiv.org/abs/2310.04749) | ICLR 2024 | `Pruning` | - | Learns to identify and preserve "nugget" tokens for compression |
 64 | | [Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading](https://arxiv.org/abs/2310.05029) | NAACL 2024 | `Summarization` | - | MemWalker: iteratively summarizes and navigates long documents |
 65 | | [LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models](https://arxiv.org/abs/2310.05736) | EMNLP 2023 | `Pruning` | [GitHub](https://github.com/microsoft/LLMLingua) | Uses a small LM to calculate perplexity and prune less informative tokens, achieving up to 20x compression |
 66 | | [Selective-Context: Compressing Contexts for Efficient Inference](https://arxiv.org/abs/2310.06201) | EMNLP 2023 | `Pruning`, `Self-Information` | [GitHub](https://github.com/liyucheng09/Selective_Context) | Filters out low self-information content using a small LM |
 67 | 
 68 | ---
 69 | 
 70 | ## 🧠 Implicit Context Compression (Latent/Reasoning Level)
 71 | 
 72 | **Definition**: Methods that compress context into **soft vectors, embeddings, or latent states**. This includes encoding long text into compact vector representations and **Latent Reasoning** where the "Chain of Thought" or intermediate reasoning steps are performed in the latent space (not outputting tokens) to reduce generation overhead.
 73 | 
 74 | **Keywords**: `Soft Prompt`, `Autoencoder`, `Memory Vectors`, `Latent Space Reasoning`, `Continuous Chain of Thought`, `Internal State Compression`
 75 | 
 76 | | Paper Title | Venue/Date | Tags | Code | TL;DR |
 77 | | :--- | :--- | :--- | :--- | :--- |
 78 | | [CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning](https://arxiv.org/pdf/2511.18659) | - | `Memory Slots`, `RAG` | [Github](https://github.com/apple/ml-clara) | CLaRa achieves significant compression rates (32x-64x) while preserving essential information for accurate answer generation. |
 79 | | [REFRAG: Rethinking RAG based Decoding](https://arxiv.org/abs/2509.01092) | - | `Autoencoder`, `Memory Slots`, `RAG` | [Github](https://github.com/facebookresearch/refrag) | Reduces TTFT delay of RAG system by 30 times. |
 80 | | [PCC: Pretraining Context Compressor for Large Language Models with Embedding-Based Memory](https://aclanthology.org/2025.acl-long.1394.pdf) | ACL 2025 Main | `Autoencoder`, `Memory Slots` | [Github](https://github.com/microsoft/AnthropomorphicIntelligence) | Explore the upper limit of implicit compression ratio and connect to downstream LLMs faster. |
 81 | | [500xCompressor: Generalized Prompt Compression for Large Language Models](https://arxiv.org/abs/2408.03094) | ACL 2025 Main | `Autoencoder`, `Memory Slots` | [Github](https://github.com/ZongqianLi/500xCompressor) | Compresss a maximum of 500 natural language tokens into only 1 special token. |
 82 | | [Coconut: Chain of Continuous Thought](https://arxiv.org/abs/2412.06769) | arXiv 2024.12 | `Latent Reasoning`, `CoT` | [Github](https://github.com/facebookresearch/coconut) | Performs reasoning in continuous latent space without outputting tokens |
 83 | | [xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token](https://arxiv.org/abs/2405.13792) | NeurIPS 2024 | `RAG`, `Compression` | [Github](https://github.com/Hannibal046/xRAG) | Compresses retrieved documents into dense representations |
 84 | | [PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents](https://arxiv.org/abs/2305.14564) | ACL 2024 | `Planning`, `Long Doc` | [GitHub](https://github.com/SimengSun/pearl) | Decomposes long document QA into planning and execution |
 85 | | [ICAE: In-context Autoencoder for Context Compression in a Large Language Model](https://arxiv.org/abs/2307.06945) | ICLR 2024 | `Autoencoder`, `Memory Slots` | [GitHub](https://github.com/getao/ICAE) | ICAE compresses 512 tokens into 128 memory slots for 4x compression |
 86 | | [AutoCompressor: Adapting Language Models to Compress Contexts](https://arxiv.org/abs/2305.14788) | EMNLP 2023 | `Soft Prompt` | [GitHub](https://github.com/princeton-nlp/AutoCompressors) | AutoCompressor: recursively compresses segments into summary vectors |
 87 | | [Scaling Latent Reasoning via Thinking Tokens](https://arxiv.org/abs/2311.04254) | arXiv 2023.11 | `Latent Reasoning`, `Pause Token` | - | Uses "thinking tokens" for implicit reasoning steps |
 88 | | [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) | NeurIPS 2023 | `Contrastive`, `Long-Context` | [GitHub](https://github.com/CStanKonrad/long_llama) | LongLLaMA: uses contrastive learning to focus on relevant context |
 89 | | [Learning to Compress Prompts with Gist Tokens](https://arxiv.org/abs/2304.08467) | NeurIPS 2023 | `Soft Prompt` | [GitHub](https://github.com/jayelm/gisting) | Compresses instructions into learnable gist tokens |
 90 | | [Parallel Context Windows for Large Language Models](https://arxiv.org/abs/2212.10947) | ACL 2023 | `Parallel`, `Memory` | - | PCW: processes context in parallel windows and aggregates |
 91 | | [Training Language Models with Memory Augmentation](https://arxiv.org/abs/2205.12674) | EMNLP 2022 | `Memory`, `TRIME` | [GitHub](https://github.com/princeton-nlp/TRIME) | TRIME: retrieves and integrates memory tokens during training |
 92 | | [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507) | ICLR 2020 | `Compression`, `Memory` | - | Uses compressed memory to extend context beyond window limits |
 93 | 
 94 | ---
 95 | 
 96 | ## ⚡ Inference-Time KV Compression (Memory/Cache Level)
 97 | 
 98 | **Definition**: Methods that specifically target the **Key-Value (KV) Cache** during the generation phase. They aim to reduce GPU memory usage and latency by evicting "unimportant" KV pairs, quantizing the cache, or using sparse attention patterns. These methods usually happen **on-the-fly during inference**.
 99 | 
100 | **Keywords**: `KV Cache Eviction`, `Heavy Hitters`, `Sparse Attention`, `Cache Quantization`, `Budget-constrained Generation`, `Infinite Context`, `Streaming Inference`
101 | 
102 | | Paper Title | Venue/Date | Tags | Code | TL;DR |
103 | | :--- | :--- | :--- | :--- | :--- |
104 | | [ParallelComp: Parallel Long-Context Compressor for Length Extrapolation](https://arxiv.org/pdf/2502.14317) | ICML 2025 | `Sparse` | [GitHub](https://github.com/menik1126/ParallelComp) | This method divides long texts into smaller chunks and processes them in parallel while automatically removing redundant or irrelevant parts, greatly improving efficiency and performance. |
105 | | [UNComp: Can Matrix Entropy Uncover Sparsity? — A Compressor Design from an Uncertainty-Aware Perspective](https://arxiv.org/pdf/2410.03090) | EMNLP 2025 | `Sparse`, `KV Cache Eviction` | [GitHub](https://github.com/menik1126/UNComp) | An uncertainty-aware framework that leverages truncated matrix entropy to identify areas of low information content, thereby revealing sparsity patterns that can be used for adaptive compression. |
106 | | [Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference](https://arxiv.org/abs/2406.10774) | ICML 2024 | `Sparse`, `Query-Aware` | [GitHub](https://github.com/mit-han-lab/Quest) | Page-based KV management with query-aware selection |
107 | | [PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling](https://arxiv.org/abs/2406.02069) | arXiv 2024.06 | `Eviction`, `Layer-wise` | [GitHub](https://github.com/Zefan-Cai/PyramidKV) | Different layers retain different amounts of KV pairs (pyramid structure) |
108 | | [MiniCache: KV Cache Compression in Depth Dimension for Large Language Models](https://arxiv.org/abs/2405.14366) | arXiv 2024.05 | `Eviction`, `Layer Merge` | - | Merges KV caches across similar layers to reduce memory |
109 | | [SnapKV: LLM Knows What You are Looking for Before Generation](https://arxiv.org/abs/2404.14469) | arXiv 2024.04 | `Eviction`, `Observation Window` | [GitHub](https://github.com/FasterDecoding/SnapKV) | Uses observation window at prompt end to identify important KV pairs |
110 | | [CaM: Cache Merging for Memory-efficient LLMs Inference](https://arxiv.org/abs/2403.17696) | ICLR 2024 | `Merging` | - | Merges similar KV pairs instead of hard eviction |
111 | | [Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference](https://arxiv.org/abs/2403.09636) | ICML 2024 | `Compression`, `Learned` | - | DMC: learns to decide what to keep/discard dynamically |
112 | | [Gear: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference](https://arxiv.org/abs/2403.05527) | arXiv 2024.03 | `Quantization`, `Residual` | [GitHub](https://github.com/HaoKang-Timmy/Gear) | Quantize majority + low-rank for outliers + sparse residual |
113 | | [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache](https://arxiv.org/abs/2402.02750) | ICML 2024 | `Quantization`, `2-bit` | [GitHub](https://github.com/jy-yuan/KIVI) | Asymmetric quantization: 2-bit for Keys, 2-bit for Values with different schemes |
114 | | [Anchor-based Large Language Models](https://arxiv.org/abs/2402.07616) | ACL 2024 | `Anchor`, `Compression` | [GitHub](https://github.com/lancopku/Anchor-LLM) | Groups and anchors tokens for parallel compression |
115 | | [InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences](https://arxiv.org/abs/2402.04617) | arXiv 2024.02 | `Block`, `Memory` | [GitHub](https://github.com/thunlp/InfLLM) | Uses block-level memory units for extreme-length processing |
116 | | [KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization](https://arxiv.org/abs/2401.18079) | arXiv 2024.01 | `Quantization`, `Per-Channel` | [GitHub](https://github.com/SqueezeAILab/KVQuant) | Per-channel quantization with outlier handling for extreme compression |
117 | | [LoMA: Lossless Compressed Memory Attention](https://arxiv.org/abs/2401.09486) | arXiv 2024.01 | `Compression`, `Lossless` | - | Achieves lossless compression via efficient memory management |
118 | | [TOVA: Token-wise Attention for Optimal KV-Cache Reduction](https://arxiv.org/abs/2401.06104) | arXiv 2024.01 | `Eviction`, `Token-wise` | [GitHub](https://github.com/schwartz-lab-NLP/TOVA) | Evicts tokens based on attention received in each generation step |
119 | | [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) | ICLR 2024 | `Streaming`, `Attention Sink` | [GitHub](https://github.com/mit-han-lab/streaming-llm) | StreamingLLM: keeps initial "sink" tokens + recent window for infinite streaming |
120 | | [H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models](https://arxiv.org/abs/2306.14048) | NeurIPS 2023 | `Eviction`, `Heavy Hitter` | [GitHub](https://github.com/FMInference/H2O) | Keeps only "heavy-hitter" tokens (high cumulative attention) plus recent tokens |
121 | | [Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression](https://arxiv.org/abs/2305.17118) | NeurIPS 2023 | `Eviction`, `Persistence` | - | Important tokens remain important; prunes based on historical importance |
122 | | [FastGen: Adaptive KV Cache Compression for Efficient LLM Inference](https://arxiv.org/abs/2310.01801) | arXiv 2023.10 | `Eviction`, `Adaptive` | - | Adaptive compression policies based on attention patterns |
123 | 
124 | ---
125 | 
126 | ## 📚 Surveys
127 | 
128 | | Paper Title | Venue/Date | Focus |
129 | | :--- | :--- | :--- |
130 | | [Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) | NAACL 2025 | Comprehensive survey on prompt compression |
131 | | [A Survey on Efficient Inference for Large Language Models](https://arxiv.org/abs/2404.14294) | arXiv 2024 | Covers KV cache and other inference optimizations |
132 | | [A Survey on Model Compression for Large Language Models](https://arxiv.org/abs/2308.07633) | TACL 2024 | General LLM compression (quantization, pruning, distillation) |
133 | | [Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models](https://arxiv.org/abs/2402.02244) | arXiv 2024 | Focuses on extending context length |
134 | | [Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding](https://arxiv.org/abs/2312.17044) | arXiv 2023 | Survey on positional encoding for long contexts |
135 | 
136 | ---
137 | 
138 | ## 🤝 Contributing
139 | 
140 | We welcome contributions! Please follow these steps:
141 | 
142 | 1. Fork the repository
143 | 2. Add your paper following the table format
144 | 3. Ensure the paper is correctly categorized
145 | 4. Submit a Pull Request
146 | 
147 | ### Categorization Guidelines
148 | 
149 | - **Cat 1 (Explicit)**: If the paper discusses compressing prompts *before* sending to the model/API
150 | - **Cat 2 (Implicit)**: If the paper compresses into latent vectors or performs latent reasoning
151 | - **Cat 3 (KV Cache)**: If the paper manages GPU memory by manipulating KV pairs *during* inference
152 | 
153 | ---
154 | 
155 | ## ⭐ Star History
156 | 
157 | [![Star History Chart](https://api.star-history.com/svg?repos=broalantaps/Awesome-Context-Compression-LLMs)](https://star-history.com/#broalantaps/Awesome-Context-Compression-LLMs)
158 | 
159 | 
160 | ---
161 | 
162 | ## 📄 License
163 | 
164 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
165 | 
166 | ---
167 | 
168 | ## 🙏 Acknowledgements
169 | 
170 | Special thanks to all researchers whose work is featured in this repository. Your contributions to making LLMs more efficient benefit the entire community.
171 | 
172 | ---
173 | 
174 | <p align="center">
175 |   <i>If you find this repository helpful, please consider giving it a ⭐!</i>
176 | </p>
177 | 


--------------------------------------------------------------------------------