└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome-Token-Compress
2 | 🔥🔥🔥 A paper list of some recent works about Token Compress for Vit and VLM.
3 | ## VLM
4 | ### 2025
5 | -
[AdaTP: Attention-Debiased Token Pruning for Video Large Language Models](https://arxiv.org/pdf/2505.20100). [AdaTP; Video;]
6 | -
[CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms](https://arxiv.org/pdf/2505.17020). [CrossLMM; Video; [GitHub](https://github.com/shilinyan99/CrossLMM)]
7 |
8 | -
[Clapper: Compact Learning and Video Representation in VLMs](https://arxiv.org/pdf/2505.15529). [Clapper; Video]
9 |
10 | -
[Video Compression Commander:Plug-and-Play Inference Acceleration for Video Large Language Models](https://arxiv.org/pdf/2505.14454). [VidCom2; Video; [GitHub](https://github.com/xuyang-liu16/VidCom2)]
11 |
12 | -
[Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning](https://arxiv.org/pdf/2505.11945). [LLaVA-Meteor]
13 |
14 | -
[FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding](https://arxiv.org/pdf/2504.20384). [FiLA-Video;Video]
15 |
16 | -
[VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning](https://arxiv.org/pdf/2504.19627). [VCM]
17 |
18 | -
[TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos](https://arxiv.org/pdf/2504.17343). [TimeChat-Online; Video; [GitHub](https://github.com/yaolinli/TimeChat-Online)]
19 |
20 | -
[DYMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs](https://arxiv.org/pdf/2504.17040). [DYMU;[GitHub](https://github.com/MikeWangWZHL/dymu)]
21 |
22 | -
[Quicksviewer: An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes](https://arxiv.org/pdf/2504.15270). [Quicksviewer;Video;[GitHub](https://github.com/quicksviewer/quicksviewer)]
23 |
24 | -
[PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models](https://arxiv.org/pdf/2504.08966). [PACT;CVPR 2025;[GitHub](https://github.com/orailix/PACT/tree/main)]
25 |
26 | -
[QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA](https://arxiv.org/pdf/2504.00654). [QG-VTC;VQA]
27 |
28 | -
[InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression](https://arxiv.org/pdf/2503.21307). [InternVL-X;[GitHub](https://github.com/ludc506/InternVL-X)]
29 |
30 | -
[Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Model](https://arxiv.org/pdf/2503.16980). [Token Dynamics;Video]
31 |
32 | -
[HICom:Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models](https://arxiv.org/pdf/2503.16036). [HICom;2025 CVPR;Video;[GitHub](https://github.com/lntzm/HICom)]
33 |
34 | -
[FastVID: Dynamic Density Pruning for Fast Video Large Language Models](https://arxiv.org/abs/2503.11187). [FastVID;[GitHub](https://github.com/LunarShen/FastVID)]
35 |
36 | -
[SAINT:Similarity-Aware Token Pruning: Your VLM but Faster](https://arxiv.org/pdf/2503.11549). [SAINT;[GitHub](https://github.com/ArmenJeddi/saint)]
37 |
38 | -
[STORM:Token-Efficient Long Video Understanding for Multimodal LLMs](https://arxiv.org/pdf/2503.04130). [STORM;Video;NVIDIA]
39 |
40 | - OpenReview [Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification](https://openreview.net/pdf?id=hzVpZDrW73). [Dynamic-LLaVA; ICLR2025; [GitHub](https://github.com/Osilly/dynamic_llava)]
41 | -
[DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models](https://arxiv.org/pdf/2503.02175). [DivPrune;[GitHub](https://github.com/vbdi/divprune)]
42 |
43 | -
[FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression](https://arxiv.org/pdf/2502.18512). [FCoT-VL;]
44 |
45 | -
[Beyond Token Compression: A Training-Free Reduction Framework for Efficient Visual Processing in MLLMs](https://arxiv.org/pdf/2501.19036). [Beyond Token Compression;[GitHub](https://github.com/L-Hugh/Beyond-Token-Compression)]
46 |
47 | -
[DyRate:Dynamic Token Reduction during Generation for Vision Language Models](https://arxiv.org/pdf/2501.14204). [DyRate]
48 | -
[AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture](https://arxiv.org/pdf/2501.09532). [AdaFV]
49 |
50 | -
[LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token](https://arxiv.org/pdf/2501.03895). [LLAVA-MINI;[GitHub](https://github.com/ictnlp/LLaVA-Mini)]
51 |
52 | -
[FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models](https://arxiv.org/pdf/2501.01986). [FrameFusion;Video;[GitHub](https://github.com/thu-nics/FrameFusion)]
53 | -
[VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling](https://arxiv.org/pdf/2501.00574). [VideoChat-Flash;Video;[GitHub](https://github.com/OpenGVLab/VideoChat-Flash)]
54 | ### 2024
55 | -
[RETAKE: Reducing Temporal and Knowledge Redundancy for Long Video Understanding](https://arxiv.org/pdf/2412.20504). [RETAKE;Video;[GitHub](https://github.com/SCZwangxiao/video-ReTaKe)]
56 |
57 | -
[FastVLM: Efficient Vision Encoding for Vision Language Models](https://arxiv.org/pdf/2412.13303). [FastVLM;Apple;]
58 |
59 | -
[PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models](https://arxiv.org/pdf/2412.09613). [PVC;Video;[GitHub](https://github.com/OpenGVLab/PVC)]
60 |
61 | -
[Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM](https://arxiv.org/pdf/2412.09530). [Dynamic-VLM;Video;]
62 |
63 | -
[VisionZip: Longer is Better but Not Necessary in Vision Language Models](https://arxiv.org/pdf/2412.04467). [VisionZip;Video;[GitHub](https://github.com/dvlab-research/VisionZip)]
64 | -
[p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay](https://arxiv.org/pdf/2412.04449). [p-MoD: [GitHub](https://github.com/MCG-NJU/p-MoD)]
65 |
66 | -
[[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster](https://arxiv.org/pdf/2412.01818). [FasterVLM: [GitHub](https://github.com/Theia-4869/FasterVLM)]
67 | -
[ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models](https://arxiv.org/pdf/2412.00447). [ATP-LLaVA]
68 | - OpenReview [LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token](https://openreview.net/pdf?id=UQJ7CDW8nb). [LLaVA-Mini]
69 | -
[Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration](https://arxiv.org/pdf/2411.17686) .[FiCoCo;]
70 | - OpenReview [LVP: Language-guide Visual Projector for Efficient Multimodal LLM](https://openreview.net/pdf?id=PxBzxO02Ef).[LVP]
71 | - OpenReview [Efficient Multi-modal Large Language Models via Visual Token Grouping](https://openreview.net/pdf?id=ym1dS37mZE) .[VisToG]
72 | -
[DyCoke:Dynamic Compression of Tokens for Fast Video Large Language Models](https://arxiv.org/pdf/2411.15024) .[DyCoke;Video;[Github](https://github.com/KD-TAO/DyCoke)]
73 | -
[FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression](https://arxiv.org/pdf/2411.14228) .[FocusLLaVA;]
74 | -
[MustDrop:Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model](https://arxiv.org/pdf/2411.10803) .[MustDrop;[Github](https://github.com/liuting20/MustDrop)]
75 | -
[Don't Look Twice: Faster Video Transformers with Run-Length Tokenization](https://arxiv.org/pdf/2411.05222) .[RLT;Video;NeurIPS 2024;[Github](https://rccchoudhury.github.io/projects/rlt/)]
76 | -
[Inference Optimal VLMs Need Only One Visual Token but Larger Models](https://arxiv.org/pdf/2411.03312) .[QueCC;[Github](https://github.com/locuslab/llava-token-compression)]
77 | -
[Video Token Merging for Long-form Video Understandin](https://arxiv.org/pdf/2410.23782) .[Learnable VTM;Video]
78 | -
[LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/pdf/2410.17434) .[LongVU;Video;[Github](https://github.com/Vision-CAIR/LongVU)]
79 | -
[PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction](https://arxiv.org/pdf/2410.17247) .[PyramidDrop;[Github](https://github.com/Cooperx521/PyramidDrop)]
80 |
81 | -
[Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers](https://arxiv.org/pdf/2410.14072) .[Victor;]
82 |
83 | -
[VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models](https://arxiv.org/pdf/2410.11417) .[VidCompress;]
84 |
85 | -
[Retrieval Replace Reduction:An effective visual token reduction method via semantic match](https://arxiv.org/pdf/2410.07278) .[TRSM;]
86 |
87 | -
[AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity](https://arxiv.org/pdf/2410.02745) .[AVG-LLaVA;[Github](https://github.com/DeepLearnXMU/AVG-LLaVA)]
88 | -
[Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs](https://arxiv.org/pdf/2409.10994) .[TRIM]
89 | -
[TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration](https://arxiv.org/pdf/2409.03206) .[TC-LLaVA;Video;]
90 | -
[TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings](https://arxiv.org/pdf/2409.09564) .[TG-LLaVA]
91 | -
[mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding](https://arxiv.org/abs/2409.03420) . [mPLUG-DocOwl2;[Github](https://github.com/X-PLUG/mPLUG-DocOwl)]
92 | -
[TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval](https://arxiv.org/pdf/2409.01156) . [TempMe;Video;ICLR 2025;[Github](https://github.com/LunarShen/TempMe)]
93 | -
[Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information](https://arxiv.org/pdf/2409.01179) . [Recoverable Compression]
94 | -
[HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments](https://arxiv.org/pdf/2408.10945) . [HiRED;[Github](https://github.com/hasanar1f/HiRED)]
95 | -
[mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models](https://arxiv.org/abs/2408.04840). [mPLUG-Owl3;[Github](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl3)]
96 | -
[Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding](https://arxiv.org/pdf/2407.14439) . [Token-level;[Github](https://github.com/JiuTian-VL/TokenCorrCompressor)]
97 | -
[HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models](https://arxiv.org/pdf/2407.08706) . [HiRes-LLaVA;]
98 | -
[TokenPacker: Efficient Visual Projector for Multimodal LLM](https://arxiv.org/abs/2407.02392.pdf) . [TokenPacker;[Github](https://github.com/CircleRadon/TokenPacker)]
99 | -
[VoCo-LLaMA: Towards Vision Compression with Large Language Models](https://arxiv.org/pdf/2406.12275) . [VoCo-LLaMA;[Github](https://github.com/Yxxxb/VoCo-LLaMA)]
100 | -
[DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models](https://arxiv.org/pdf/2405.20985) . [DeCo;[Github](https://github.com/yaolinli/DeCo)]
101 | -
[Matryoshka Query Transformer for Large Vision-Language Models](https://arxiv.org/pdf/2405.19315) . [MQT-LLaVA; NeurIPS 2024][Github](https://github.com/gordonhu608/MQT-LLaVA)]
102 | -
[Matryoshka Multimodal Models](https://arxiv.org/pdf/2405.17430) . [Matryoshka;M3][Github](https://github.com/mu-cai/matryoshka-mm)]
103 | -
[How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821.pdf) . [InternVL;Pixel-Shuffle;[Github](https://github.com/OpenGVLab/InternVL)]
104 | -
[CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference](https://arxiv.org/pdf/2404.08567) . [CATP;]
105 | -
[LLaVA-PruMerge:
106 | Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388.pdf) . [LLaVA-PruMerge;[Github](https://github.com/42Shawn/LLaVA-PruMerge)]
107 | -
[An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference](https://arxiv.org/pdf/2403.06764) . [FastV;ECCV 2024;[Github](https://github.com/pkunlp-icler/FastV)]
108 |
109 |
110 | -
[MobileVLM V2: Faster and Stronger Baseline for Vision Language Model](https://arxiv.org/abs/2402.03766.pdf) . [LDP-v2;[Github](https://github.com/Meituan-AutoML/MobileVLM)]
111 |
112 | ### 2023
113 | -
[Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742) . [C-Abstractor;CVPR 2024;[Github](https://github.com/khanrc/honeybee?tab=readme-ov-file) ]
114 | -
[LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043) . [LLaMA-VID;ECCV 2024;[Github](https://github.com/dvlab-research/LLaMA-VID/tree/main) ]
115 | -
[Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/pdf/2308.12966v2) . [Resampler;[Github](https://github.com/QwenLM/Qwen-VL)]
116 | -
[CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers](https://arxiv.org/pdf/2305.17455v4) . [CrossGET; ICML 2024;[Github](https://github.com/sdc17/CrossGET)]
117 | -
[BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) . [Q-former;[Github](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)]
118 | ## Vit
119 | ### 2025
120 | -
[Lossless Token Merging Even Without Fine-Tuning in Vision Transformers](https://arxiv.org/pdf/2505.15160) . [ATM]
121 |
122 | -
[Prune and Merge: Efficient Token Compression For Vision Transformer With Spatial Information Preserved](https://arxiv.org/pdf/2503.23455) . [Prune and Merge;[Github](https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge)]
123 |
124 | ### 2024
125 | -
[Token Cropr: Faster ViTs for Quite a Few Tasks](https://arxiv.org/pdf/2412.00965).[Token Cropr;]
126 | -
[Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer](https://arxiv.org/pdf/2408.17062) . [Vote&Mix;]
127 | -
[Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning](https://arxiv.org/pdf/2408.06798) . [Token Compensator;ToCom;[Github](https://github.com/JieShibo/ToCom)]
128 | -
[Dynamic and Compressive Adaptation of Transformers From Images to Videos](https://arxiv.org/pdf/2408.06840) . [InTI;]
129 | -
[LookupViT: Compressing visual information to a limited number of tokens](https://arxiv.org/pdf/2407.12753) . [LookupViT;DeepMind]
130 | -
[PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation](https://arxiv.org/abs/2403.09192) . [PYRA;ECCV 2024;[Github](https://github.com/THU-MIG/PYRA?tab=readme-ov-file)]
131 |
132 | ### 2023
133 | -
[PPT: Token Pruning and Pooling for Efficient Vision Transformers](https://arxiv.org/pdf/2310.01812) . [PPT;[Github](https://github.com/xjwu1024/PPT)]
134 | -
[DiffRate : Differentiable Compression Rate for Efficient Vision Transformers](https://arxiv.org/abs/2305.17997) . [DiffRate;ICCV 2023;[Github](https://github.com/OpenGVLab/DiffRate)]
135 | -
[Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers](https://arxiv.org/pdf/2304.10716) . [TPS;CVPR 2023;[Github](https://github.com/megvii-research/TPS-CVPR2023)]
136 | ### 2022
137 | -
[TOKEN MERGING: YOUR VIT BUT FASTER](https://arxiv.org/pdf/2210.09461) . [ToMe;Token Merging; ICLR 2023]
138 | -
[Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention](https://arxiv.org/pdf/2209.13802) . [Adaptive Sparse ViT]
139 | -
[EViT: Expediting Vision Transformers via Token Reorganizations](https://arxiv.org/pdf/2202.07800) . [EViT;ICLR 2022;[Github](https://github.com/youweiliang/evit?tab=readme-ov-file)]
140 | -
[Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space](https://arxiv.org/pdf/2201.00814) . [ViT-Slim;CVPR 2022;[Github](https://github.com/Arnav0400/ViT-Slim)]
141 | ### 2021
142 | -
[A-ViT: Adaptive Tokens for Efficient Vision Transformer](https://arxiv.org/pdf/2112.07658) . [A-Vit;]
143 | -
[ATS: Adaptive Token Sampling For Efficient Vision Transformers](https://arxiv.org/abs/2111.15667) . [ATS;ECCV 2022;[Github](https://github.com/adaptivetokensampling/ATS)]
144 | -
[Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer](https://arxiv.org/abs/2108.01390) . [Evo-ViT;AAAI 2022;[Github](https://github.com/YifanXu74/Evo-ViT)]
145 | -
[Patch Slimming for Efficient Vision Transformers](https://arxiv.org/abs/2106.02852) . [Patch Slimming;]
146 | -
[DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsificationr](https://arxiv.org/abs/2106.02034) . [DynamicViT;NeurIPS 2021;[Github](https://github.com/raoyongming/DynamicViT)]
147 |
148 |
149 |
150 |
151 |
--------------------------------------------------------------------------------