└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-Token-Compress 2 | 🔥🔥🔥 A paper list of some recent works about Token Compress for Vit and VLM. 3 | ## VLM 4 | ### 2025 5 | - arXiv [AdaTP: Attention-Debiased Token Pruning for Video Large Language Models](https://arxiv.org/pdf/2505.20100). [AdaTP; Video;] 6 | - arXiv [CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms](https://arxiv.org/pdf/2505.17020). [CrossLMM; Video; [GitHub](https://github.com/shilinyan99/CrossLMM)] 7 | 8 | - arXiv [Clapper: Compact Learning and Video Representation in VLMs](https://arxiv.org/pdf/2505.15529). [Clapper; Video] 9 | 10 | - arXiv [Video Compression Commander:Plug-and-Play Inference Acceleration for Video Large Language Models](https://arxiv.org/pdf/2505.14454). [VidCom2; Video; [GitHub](https://github.com/xuyang-liu16/VidCom2)] 11 | 12 | - arXiv [Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning](https://arxiv.org/pdf/2505.11945). [LLaVA-Meteor] 13 | 14 | - arXiv [FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding](https://arxiv.org/pdf/2504.20384). [FiLA-Video;Video] 15 | 16 | - arXiv [VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning](https://arxiv.org/pdf/2504.19627). [VCM] 17 | 18 | - arXiv [TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos](https://arxiv.org/pdf/2504.17343). [TimeChat-Online; Video; [GitHub](https://github.com/yaolinli/TimeChat-Online)] 19 | 20 | - arXiv [DYMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs](https://arxiv.org/pdf/2504.17040). [DYMU;[GitHub](https://github.com/MikeWangWZHL/dymu)] 21 | 22 | - arXiv [Quicksviewer: An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes](https://arxiv.org/pdf/2504.15270). [Quicksviewer;Video;[GitHub](https://github.com/quicksviewer/quicksviewer)] 23 | 24 | - arXiv [PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models](https://arxiv.org/pdf/2504.08966). [PACT;CVPR 2025;[GitHub](https://github.com/orailix/PACT/tree/main)] 25 | 26 | - arXiv [QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA](https://arxiv.org/pdf/2504.00654). [QG-VTC;VQA] 27 | 28 | - arXiv [InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression](https://arxiv.org/pdf/2503.21307). [InternVL-X;[GitHub](https://github.com/ludc506/InternVL-X)] 29 | 30 | - arXiv [Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Model](https://arxiv.org/pdf/2503.16980). [Token Dynamics;Video] 31 | 32 | - arXiv [HICom:Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models](https://arxiv.org/pdf/2503.16036). [HICom;2025 CVPR;Video;[GitHub](https://github.com/lntzm/HICom)] 33 | 34 | - arXiv [FastVID: Dynamic Density Pruning for Fast Video Large Language Models](https://arxiv.org/abs/2503.11187). [FastVID;[GitHub](https://github.com/LunarShen/FastVID)] 35 | 36 | - arXiv [SAINT:Similarity-Aware Token Pruning: Your VLM but Faster](https://arxiv.org/pdf/2503.11549). [SAINT;[GitHub](https://github.com/ArmenJeddi/saint)] 37 | 38 | - arXiv [STORM:Token-Efficient Long Video Understanding for Multimodal LLMs](https://arxiv.org/pdf/2503.04130). [STORM;Video;NVIDIA] 39 | 40 | - OpenReview [Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification](https://openreview.net/pdf?id=hzVpZDrW73). [Dynamic-LLaVA; ICLR2025; [GitHub](https://github.com/Osilly/dynamic_llava)] 41 | - arXiv [DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models](https://arxiv.org/pdf/2503.02175). [DivPrune;[GitHub](https://github.com/vbdi/divprune)] 42 | 43 | - arXiv [FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression](https://arxiv.org/pdf/2502.18512). [FCoT-VL;] 44 | 45 | - arXiv [Beyond Token Compression: A Training-Free Reduction Framework for Efficient Visual Processing in MLLMs](https://arxiv.org/pdf/2501.19036). [Beyond Token Compression;[GitHub](https://github.com/L-Hugh/Beyond-Token-Compression)] 46 | 47 | - arXiv [DyRate:Dynamic Token Reduction during Generation for Vision Language Models](https://arxiv.org/pdf/2501.14204). [DyRate] 48 | - arXiv [AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture](https://arxiv.org/pdf/2501.09532). [AdaFV] 49 | 50 | - arXiv [LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token](https://arxiv.org/pdf/2501.03895). [LLAVA-MINI;[GitHub](https://github.com/ictnlp/LLaVA-Mini)] 51 | 52 | - arXiv [FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models](https://arxiv.org/pdf/2501.01986). [FrameFusion;Video;[GitHub](https://github.com/thu-nics/FrameFusion)] 53 | - arXiv [VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling](https://arxiv.org/pdf/2501.00574). [VideoChat-Flash;Video;[GitHub](https://github.com/OpenGVLab/VideoChat-Flash)] 54 | ### 2024 55 | - arXiv [RETAKE: Reducing Temporal and Knowledge Redundancy for Long Video Understanding](https://arxiv.org/pdf/2412.20504). [RETAKE;Video;[GitHub](https://github.com/SCZwangxiao/video-ReTaKe)] 56 | 57 | - arXiv [FastVLM: Efficient Vision Encoding for Vision Language Models](https://arxiv.org/pdf/2412.13303). [FastVLM;Apple;] 58 | 59 | - arXiv [PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models](https://arxiv.org/pdf/2412.09613). [PVC;Video;[GitHub](https://github.com/OpenGVLab/PVC)] 60 | 61 | - arXiv [Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM](https://arxiv.org/pdf/2412.09530). [Dynamic-VLM;Video;] 62 | 63 | - arXiv [VisionZip: Longer is Better but Not Necessary in Vision Language Models](https://arxiv.org/pdf/2412.04467). [VisionZip;Video;[GitHub](https://github.com/dvlab-research/VisionZip)] 64 | - arXiv [p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay](https://arxiv.org/pdf/2412.04449). [p-MoD: [GitHub](https://github.com/MCG-NJU/p-MoD)] 65 | 66 | - arXiv [[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster](https://arxiv.org/pdf/2412.01818). [FasterVLM: [GitHub](https://github.com/Theia-4869/FasterVLM)] 67 | - arXiv [ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models](https://arxiv.org/pdf/2412.00447). [ATP-LLaVA] 68 | - OpenReview [LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token](https://openreview.net/pdf?id=UQJ7CDW8nb). [LLaVA-Mini] 69 | - arXiv [Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration](https://arxiv.org/pdf/2411.17686) .[FiCoCo;] 70 | - OpenReview [LVP: Language-guide Visual Projector for Efficient Multimodal LLM](https://openreview.net/pdf?id=PxBzxO02Ef).[LVP] 71 | - OpenReview [Efficient Multi-modal Large Language Models via Visual Token Grouping](https://openreview.net/pdf?id=ym1dS37mZE) .[VisToG] 72 | - arXiv [DyCoke:Dynamic Compression of Tokens for Fast Video Large Language Models](https://arxiv.org/pdf/2411.15024) .[DyCoke;Video;[Github](https://github.com/KD-TAO/DyCoke)] 73 | - arXiv [FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression](https://arxiv.org/pdf/2411.14228) .[FocusLLaVA;] 74 | - arXiv [MustDrop:Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model](https://arxiv.org/pdf/2411.10803) .[MustDrop;[Github](https://github.com/liuting20/MustDrop)] 75 | - arXiv [Don't Look Twice: Faster Video Transformers with Run-Length Tokenization](https://arxiv.org/pdf/2411.05222) .[RLT;Video;NeurIPS 2024;[Github](https://rccchoudhury.github.io/projects/rlt/)] 76 | - arXiv [Inference Optimal VLMs Need Only One Visual Token but Larger Models](https://arxiv.org/pdf/2411.03312) .[QueCC;[Github](https://github.com/locuslab/llava-token-compression)] 77 | - arXiv [Video Token Merging for Long-form Video Understandin](https://arxiv.org/pdf/2410.23782) .[Learnable VTM;Video] 78 | - arXiv [LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/pdf/2410.17434) .[LongVU;Video;[Github](https://github.com/Vision-CAIR/LongVU)] 79 | - arXiv [PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction](https://arxiv.org/pdf/2410.17247) .[PyramidDrop;[Github](https://github.com/Cooperx521/PyramidDrop)] 80 | 81 | - arXiv [Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers](https://arxiv.org/pdf/2410.14072) .[Victor;] 82 | 83 | - arXiv [VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models](https://arxiv.org/pdf/2410.11417) .[VidCompress;] 84 | 85 | - arXiv [Retrieval Replace Reduction:An effective visual token reduction method via semantic match](https://arxiv.org/pdf/2410.07278) .[TRSM;] 86 | 87 | - arXiv [AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity](https://arxiv.org/pdf/2410.02745) .[AVG-LLaVA;[Github](https://github.com/DeepLearnXMU/AVG-LLaVA)] 88 | - arXiv [Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs](https://arxiv.org/pdf/2409.10994) .[TRIM] 89 | - arXiv [TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration](https://arxiv.org/pdf/2409.03206) .[TC-LLaVA;Video;] 90 | - arXiv [TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings](https://arxiv.org/pdf/2409.09564) .[TG-LLaVA] 91 | - arXiv [mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding](https://arxiv.org/abs/2409.03420) . [mPLUG-DocOwl2;[Github](https://github.com/X-PLUG/mPLUG-DocOwl)] 92 | - arXiv [TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval](https://arxiv.org/pdf/2409.01156) . [TempMe;Video;ICLR 2025;[Github](https://github.com/LunarShen/TempMe)] 93 | - arXiv [Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information](https://arxiv.org/pdf/2409.01179) . [Recoverable Compression] 94 | - arXiv [HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments](https://arxiv.org/pdf/2408.10945) . [HiRED;[Github](https://github.com/hasanar1f/HiRED)] 95 | - arXiv [mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models](https://arxiv.org/abs/2408.04840). [mPLUG-Owl3;[Github](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl3)] 96 | - arXiv [Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding](https://arxiv.org/pdf/2407.14439) . [Token-level;[Github](https://github.com/JiuTian-VL/TokenCorrCompressor)] 97 | - arXiv [HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models](https://arxiv.org/pdf/2407.08706) . [HiRes-LLaVA;] 98 | - arXiv [TokenPacker: Efficient Visual Projector for Multimodal LLM](https://arxiv.org/abs/2407.02392.pdf) . [TokenPacker;[Github](https://github.com/CircleRadon/TokenPacker)] 99 | - arXiv [VoCo-LLaMA: Towards Vision Compression with Large Language Models](https://arxiv.org/pdf/2406.12275) . [VoCo-LLaMA;[Github](https://github.com/Yxxxb/VoCo-LLaMA)] 100 | - arXiv [DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models](https://arxiv.org/pdf/2405.20985) . [DeCo;[Github](https://github.com/yaolinli/DeCo)] 101 | - arXiv [Matryoshka Query Transformer for Large Vision-Language Models](https://arxiv.org/pdf/2405.19315) . [MQT-LLaVA; NeurIPS 2024][Github](https://github.com/gordonhu608/MQT-LLaVA)] 102 | - arXiv [Matryoshka Multimodal Models](https://arxiv.org/pdf/2405.17430) . [Matryoshka;M3][Github](https://github.com/mu-cai/matryoshka-mm)] 103 | - arXiv [How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821.pdf) . [InternVL;Pixel-Shuffle;[Github](https://github.com/OpenGVLab/InternVL)] 104 | - arXiv [CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference](https://arxiv.org/pdf/2404.08567) . [CATP;] 105 | - arXiv [LLaVA-PruMerge: 106 | Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388.pdf) . [LLaVA-PruMerge;[Github](https://github.com/42Shawn/LLaVA-PruMerge)] 107 | - arXiv [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference](https://arxiv.org/pdf/2403.06764) . [FastV;ECCV 2024;[Github](https://github.com/pkunlp-icler/FastV)] 108 | 109 | 110 | - arXiv [MobileVLM V2: Faster and Stronger Baseline for Vision Language Model](https://arxiv.org/abs/2402.03766.pdf) . [LDP-v2;[Github](https://github.com/Meituan-AutoML/MobileVLM)] 111 | 112 | ### 2023 113 | - arXiv [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742) . [C-Abstractor;CVPR 2024;[Github](https://github.com/khanrc/honeybee?tab=readme-ov-file) ] 114 | - arXiv [LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043) . [LLaMA-VID;ECCV 2024;[Github](https://github.com/dvlab-research/LLaMA-VID/tree/main) ] 115 | - arXiv [Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/pdf/2308.12966v2) . [Resampler;[Github](https://github.com/QwenLM/Qwen-VL)] 116 | - arXiv [CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers](https://arxiv.org/pdf/2305.17455v4) . [CrossGET; ICML 2024;[Github](https://github.com/sdc17/CrossGET)] 117 | - arXiv [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) . [Q-former;[Github](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)] 118 | ## Vit 119 | ### 2025 120 | - arXiv [Lossless Token Merging Even Without Fine-Tuning in Vision Transformers](https://arxiv.org/pdf/2505.15160) . [ATM] 121 | 122 | - arXiv [Prune and Merge: Efficient Token Compression For Vision Transformer With Spatial Information Preserved](https://arxiv.org/pdf/2503.23455) . [Prune and Merge;[Github](https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge)] 123 | 124 | ### 2024 125 | - arXiv [Token Cropr: Faster ViTs for Quite a Few Tasks](https://arxiv.org/pdf/2412.00965).[Token Cropr;] 126 | - arXiv [Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer](https://arxiv.org/pdf/2408.17062) . [Vote&Mix;] 127 | - arXiv [Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning](https://arxiv.org/pdf/2408.06798) . [Token Compensator;ToCom;[Github](https://github.com/JieShibo/ToCom)] 128 | - arXiv [Dynamic and Compressive Adaptation of Transformers From Images to Videos](https://arxiv.org/pdf/2408.06840) . [InTI;] 129 | - arXiv [LookupViT: Compressing visual information to a limited number of tokens](https://arxiv.org/pdf/2407.12753) . [LookupViT;DeepMind] 130 | - arXiv [PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation](https://arxiv.org/abs/2403.09192) . [PYRA;ECCV 2024;[Github](https://github.com/THU-MIG/PYRA?tab=readme-ov-file)] 131 | 132 | ### 2023 133 | - arXiv [PPT: Token Pruning and Pooling for Efficient Vision Transformers](https://arxiv.org/pdf/2310.01812) . [PPT;[Github](https://github.com/xjwu1024/PPT)] 134 | - arXiv [DiffRate : Differentiable Compression Rate for Efficient Vision Transformers](https://arxiv.org/abs/2305.17997) . [DiffRate;ICCV 2023;[Github](https://github.com/OpenGVLab/DiffRate)] 135 | - arXiv [Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers](https://arxiv.org/pdf/2304.10716) . [TPS;CVPR 2023;[Github](https://github.com/megvii-research/TPS-CVPR2023)] 136 | ### 2022 137 | - arXiv [TOKEN MERGING: YOUR VIT BUT FASTER](https://arxiv.org/pdf/2210.09461) . [ToMe;Token Merging; ICLR 2023] 138 | - arXiv [Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention](https://arxiv.org/pdf/2209.13802) . [Adaptive Sparse ViT] 139 | - arXiv [EViT: Expediting Vision Transformers via Token Reorganizations](https://arxiv.org/pdf/2202.07800) . [EViT;ICLR 2022;[Github](https://github.com/youweiliang/evit?tab=readme-ov-file)] 140 | - arXiv [Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space](https://arxiv.org/pdf/2201.00814) . [ViT-Slim;CVPR 2022;[Github](https://github.com/Arnav0400/ViT-Slim)] 141 | ### 2021 142 | - arXiv [A-ViT: Adaptive Tokens for Efficient Vision Transformer](https://arxiv.org/pdf/2112.07658) . [A-Vit;] 143 | - arXiv [ATS: Adaptive Token Sampling For Efficient Vision Transformers](https://arxiv.org/abs/2111.15667) . [ATS;ECCV 2022;[Github](https://github.com/adaptivetokensampling/ATS)] 144 | - arXiv [Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer](https://arxiv.org/abs/2108.01390) . [Evo-ViT;AAAI 2022;[Github](https://github.com/YifanXu74/Evo-ViT)] 145 | - arXiv [Patch Slimming for Efficient Vision Transformers](https://arxiv.org/abs/2106.02852) . [Patch Slimming;] 146 | - arXiv [DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsificationr](https://arxiv.org/abs/2106.02034) . [DynamicViT;NeurIPS 2021;[Github](https://github.com/raoyongming/DynamicViT)] 147 | 148 | 149 | 150 | 151 | --------------------------------------------------------------------------------