├── .gitignore
├── images
    └── motivation.png
├── LICENSE
├── audio-transformer.md
├── vision-transformer.md
├── audio-llm.md
├── image-llm.md
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .vscode
3 | *.py
4 | *.csv
5 | 
6 | *template*


--------------------------------------------------------------------------------
/images/motivation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cokeshao/Awesome-Multimodal-Token-Compression/HEAD/images/motivation.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 cokeshao
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/audio-transformer.md:
--------------------------------------------------------------------------------
 1 | 
 2 | <details open>
 3 | <summary><strong>2025 AST</strong></summary>
 4 | 
 5 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
 6 | | --- | --- | --- | :---: | 
 7 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/yangdongchao/ALMTokenizer.svg?style=social&label=Star)](https://github.com/yangdongchao/ALMTokenizer)<br>[ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling](https://arxiv.org/abs/2504.10344)<br>Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.10344)<br> [GitHub](https://github.com/yangdongchao/ALMTokenizer)<br> | 
 8 | |  [![Publish](https://img.shields.io/badge/ECAI-2025-blue)]() [![Star](https://img.shields.io/github/stars/andylee-24/token-pruning-audio-transformer.svg?style=social&label=Star)](https://github.com/andylee-24/token-pruning-audio-transformer)<br>[Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance](https://arxiv.org/abs/2504.01690)<br>Taehan Lee, Hyukjun Lee |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.01690)<br> [GitHub](https://github.com/andylee-24/token-pruning-audio-transformer)<br> [Model](https://drive.google.com/drive/folders/1cBDXh98m2qDlYLLX3q6xB-gtU1uUtxhK)<br> | 
 9 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.01-red)]() [![Star](https://img.shields.io/github/stars/VITA-MLLM/LUCY.svg?style=social&label=Star)](https://github.com/VITA-MLLM/LUCY)<br>[LUCY: Linguistic Understanding and Control Yielding Early Stage of Her](https://arxiv.org/abs/2501.16327)<br>Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2501.16327)<br> [GitHub](https://github.com/VITA-MLLM/LUCY)<br> [Model](https://huggingface.co/VITA-MLLM)<br> | 
10 | </details>
11 | 
12 | <details open>
13 | <summary><strong>2024 AST</strong></summary>
14 | 
15 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
16 | | --- | --- | --- | :---: | 
17 | |  [![Publish](https://img.shields.io/badge/Interspeech-2024-blue)]() [![Star](https://img.shields.io/github/stars/swarupbehera/FastAST.svg?style=social&label=Star)](https://github.com/swarupbehera/FastAST)<br>[FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation](https://arxiv.org/abs/2406.07676)<br>Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.07676)<br> [GitHub](https://github.com/swarupbehera/FastAST)<br> | 
18 | </details>
19 | 
20 | <details open>
21 | <summary><strong>2023 AST</strong></summary>
22 | 
23 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
24 | | --- | --- | --- | :---: | 
25 | |  [![Publish](https://img.shields.io/badge/Interspeech-2023-blue)]() <br>[Accelerating Transducers through Adjacent Token Merging](https://arxiv.org/abs/2306.16009)<br>Yuang Li, Yu Wu, Jinyu Li, Shujie Liu |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2306.16009)<br> | 
26 | </details>
27 | 
28 | <details open>
29 | <summary><strong>2022 AST</strong></summary>
30 | 
31 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
32 | | --- | --- | --- | :---: | 
33 | |  [![Publish](https://img.shields.io/badge/ICASSP-2022-blue)]() [![Star](https://img.shields.io/github/stars/RetroCirce/HTS-Audio-Transformer.svg?style=social&label=Star)](https://github.com/RetroCirce/HTS-Audio-Transformer)<br>[HTS-AT: A Hierarchical Token-Semantic Audio-Transformer for Sound Classification and Detection](https://arxiv.org/abs/2202.00874)<br>Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2202.00874)<br> [GitHub](https://github.com/RetroCirce/HTS-Audio-Transformer)<br> | 
34 | </details>
35 | 


--------------------------------------------------------------------------------
/vision-transformer.md:
--------------------------------------------------------------------------------
 1 | 
 2 | <details open>
 3 | <summary><strong>2025 ViT</strong></summary>
 4 | 
 5 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
 6 | | --- | --- | --- | :---: | 
 7 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation](https://arxiv.org/abs/2508.04058)<br>Zunhui Xia, Hongxing Li, Libin Lan |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.04058)<br> | 
 8 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation](https://arxiv.org/abs/2508.03388)<br>Yizhe Xiong, Zihan Zhou, Yiwen Liang, Hui Chen, Zijia Lin, Tianxiang Hao, Fan Zhang, Jungong Han, Guiguang Ding |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.03388)<br> | 
 9 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/mlvlab/Representation-Shift.svg?style=social&label=Star)](https://github.com/mlvlab/Representation-Shift)<br>[Representation Shift: Unifying Token Compression with FlashAttention](https://arxiv.org/abs/2508.00367)<br>Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.00367)<br> [GitHub](https://github.com/mlvlab/Representation-Shift)<br> | 
10 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference](https://arxiv.org/abs/2507.16260)<br>Haoyue Zhang, Jie Zhang, Song Guo |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.16260)<br> | 
11 | </details>
12 | 
13 | <details open>
14 | <summary><strong>2024 ViT</strong></summary>
15 | 
16 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
17 | | --- | --- | --- | :---: | 
18 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.05-red)]() [![Star](https://img.shields.io/github/stars/yaolinli/DeCo.svg?style=social&label=Star)](https://github.com/yaolinli/DeCo)<br>[DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models](https://arxiv.org/abs/2405.20985)<br>Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2405.20985)<br> [GitHub](https://github.com/yaolinli/DeCo)<br> | 
19 | |  [![Publish](https://img.shields.io/badge/CVPR-2024-blue)]() [![Star](https://img.shields.io/github/stars/double125/MADTP-plus.svg?style=social&label=Star)](https://github.com/double125/MADTP-plus)<br>[MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer](https://arxiv.org/abs/2403.02991)<br>Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2403.02991)<br> [GitHub](https://github.com/double125/MADTP-plus)<br> | 
20 | </details>
21 | 
22 | <details open>
23 | <summary><strong>2023 ViT</strong></summary>
24 | 
25 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
26 | | --- | --- | --- | :---: | 
27 | |  [![Publish](https://img.shields.io/badge/ACL-2023-blue)]() [![Star](https://img.shields.io/github/stars/csarron/PuMer.svg?style=social&label=Star)](https://github.com/csarron/PuMer)<br>[PuMer: Pruning and Merging Tokens for Efficient Vision Language Models](https://arxiv.org/abs/2305.17530)<br>Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2305.17530)<br> [GitHub](https://github.com/csarron/PuMer)<br> | 
28 | </details>
29 | 
30 | <details open>
31 | <summary><strong>2022 ViT</strong></summary>
32 | 
33 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
34 | | --- | --- | --- | :---: | 
35 | |  [![Publish](https://img.shields.io/badge/ICLR_Oral-2023-blue)]() [![Star](https://img.shields.io/github/stars/facebookresearch/ToMe.svg?style=social&label=Star)](https://github.com/facebookresearch/ToMe)<br>[Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461)<br>Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2210.09461)<br> [GitHub](https://github.com/facebookresearch/ToMe)<br> | 
36 | </details>
37 | 
38 | <details open>
39 | <summary><strong>2021 ViT</strong></summary>
40 | 
41 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
42 | | --- | --- | --- | :---: | 
43 | |  [![Publish](https://img.shields.io/badge/ICML-2021-blue)]() <br>[Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206)<br>Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2103.03206)<br> | 
44 | </details>
45 | 


--------------------------------------------------------------------------------
/audio-llm.md:
--------------------------------------------------------------------------------
 1 | 
 2 | <details open>
 3 | <summary><strong>2025 Audio</strong></summary>
 4 | 
 5 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
 6 | | --- | --- | --- | :---: | 
 7 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.12-red)]() <br>[EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs](https://arxiv.org/abs/2512.10324)<br>Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2512.10324)<br> | 
 8 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/cokeshao/Awesome-Multimodal-Token-Compression.svg?style=social&label=Star)](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br>[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)<br>Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() |  |  [Paper](https://arxiv.org/abs/2507.20198)<br> [GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br> | 
 9 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.05-red)]() [![Star](https://img.shields.io/github/stars/ZLKong/Awesome-Collection-Token-Reduction.svg?style=social&label=Star)](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)<br>[Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227)<br>Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Position--Paper-purple)]() |  |  [Paper](https://arxiv.org/abs/2505.18227)<br> [GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)<br> | 
10 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/yangdongchao/ALMTokenizer.svg?style=social&label=Star)](https://github.com/yangdongchao/ALMTokenizer)<br>[ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling](https://arxiv.org/abs/2504.10344)<br>Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.10344)<br> [GitHub](https://github.com/yangdongchao/ALMTokenizer)<br> | 
11 | |  [![Publish](https://img.shields.io/badge/ECAI-2025-blue)]() [![Star](https://img.shields.io/github/stars/andylee-24/token-pruning-audio-transformer.svg?style=social&label=Star)](https://github.com/andylee-24/token-pruning-audio-transformer)<br>[Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance](https://arxiv.org/abs/2504.01690)<br>Taehan Lee, Hyukjun Lee |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.01690)<br> [GitHub](https://github.com/andylee-24/token-pruning-audio-transformer)<br> [Model](https://drive.google.com/drive/folders/1cBDXh98m2qDlYLLX3q6xB-gtU1uUtxhK)<br> | 
12 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.03-red)]() [![Star](https://img.shields.io/github/stars/QwenLM/Qwen2.5-Omni.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen2.5-Omni)<br>[Qwen2.5-Omni Technical Report](https://arxiv.org/abs/2503.20215)<br>Qwen Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.20215)<br> [GitHub](https://github.com/QwenLM/Qwen2.5-Omni)<br> [Model](https://huggingface.co/collections/Qwen/qwen25-omni-67de1e5f0f9464dc6314b36e)<br> | 
13 | |  [![Publish](https://img.shields.io/badge/ACL_Findings-2025-blue)]() [![Star](https://img.shields.io/github/stars/JeongHun0716/MMS-LLaMA.svg?style=social&label=Star)](https://github.com/JeongHun0716/MMS-LLaMA)<br>[MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens](https://arxiv.org/abs/2503.11315)<br>Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.11315)<br> [GitHub](https://github.com/JeongHun0716/MMS-LLaMA)<br> | 
14 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.03-red)]() <br>[Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs](https://arxiv.org/abs/2503.06362)<br>Umberto Cappellazzo, Minsu Kim, Stavros Petridis |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.06362)<br> | 
15 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.02-red)]() [![Star](https://img.shields.io/github/stars/baichuan-inc/Baichuan-Audio.svg?style=social&label=Star)](https://github.com/baichuan-inc/Baichuan-Audio)<br>[Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction](https://arxiv.org/abs/2502.17239)<br>Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2502.17239)<br> [GitHub](https://github.com/baichuan-inc/Baichuan-Audio)<br> [Model](https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct)<br> | 
16 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.01-red)]() [![Star](https://img.shields.io/github/stars/VITA-MLLM/LUCY.svg?style=social&label=Star)](https://github.com/VITA-MLLM/LUCY)<br>[LUCY: Linguistic Understanding and Control Yielding Early Stage of Her](https://arxiv.org/abs/2501.16327)<br>Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2501.16327)<br> [GitHub](https://github.com/VITA-MLLM/LUCY)<br> [Model](https://huggingface.co/VITA-MLLM)<br> | 
17 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.01-red)]() [![Star](https://img.shields.io/github/stars/ASLP-lab/OSUM.svg?style=social&label=Star)](https://github.com/ASLP-lab/OSUM)<br>[OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia](https://arxiv.org/abs/2501.13306)<br>ASLP@NPU |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2501.13306)<br> [GitHub](https://github.com/ASLP-lab/OSUM)<br> [Model](https://huggingface.co/ASLP-lab/OSUM)<br> | 
18 | </details>
19 | 
20 | <details open>
21 | <summary><strong>2024 Audio</strong></summary>
22 | 
23 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
24 | | --- | --- | --- | :---: | 
25 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.12-red)]() [![Star](https://img.shields.io/github/stars/scb-10x/typhoon2-audio.svg?style=social&label=Star)](https://github.com/scb-10x/typhoon2-audio)<br>[Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models](https://arxiv.org/abs/2412.13702)<br>Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2412.13702)<br> [GitHub](https://github.com/scb-10x/typhoon2-audio)<br> [Model](https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct)<br> | 
26 | |  [![Publish](https://img.shields.io/badge/ICME-2025-blue)]() <br>[SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval](https://arxiv.org/abs/2412.12009)<br>Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran Chen |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.12009)<br> | 
27 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/Lyra.svg?style=social&label=Star)](https://github.com/dvlab-research/Lyra)<br>[Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition](https://arxiv.org/abs/2412.09501)<br>Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2412.09501)<br> [GitHub](https://github.com/dvlab-research/Lyra)<br> [Model](https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc)<br> [Dataset](https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82)<br> | 
28 | |  [![Publish](https://img.shields.io/badge/ICASSP-2025-blue)]() [![Star](https://img.shields.io/github/stars/umbertocappellazzo/Llama-AVSR.svg?style=social&label=Star)](https://github.com/umbertocappellazzo/Llama-AVSR)<br>[Large Language Models are Strong Audio-Visual Speech Recognition Learners](https://arxiv.org/abs/2409.12319)<br>Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2409.12319)<br> [GitHub](https://github.com/umbertocappellazzo/Llama-AVSR)<br> | 
29 | |  [![Publish](https://img.shields.io/badge/ICLR-2025-blue)]() [![Star](https://img.shields.io/github/stars/ictnlp/LLaMA-Omni.svg?style=social&label=Star)](https://github.com/ictnlp/LLaMA-Omni)<br>[LLaMA-Omni: Seamless Speech Interaction with Large Language Models](https://arxiv.org/abs/2409.06666)<br>Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2409.06666)<br> [GitHub](https://github.com/ictnlp/LLaMA-Omni)<br> [Model](https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni)<br> [Dataset](https://huggingface.co/datasets/ICTNLP/InstructS2S-200K)<br> | 
30 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.07-red)]() [![Star](https://img.shields.io/github/stars/QwenLM/Qwen2-Audio.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen2-Audio)<br>[Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759)<br>Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2407.10759)<br> [GitHub](https://github.com/QwenLM/Qwen2-Audio)<br> [Model](https://huggingface.co/collections/Qwen/qwen2-audio-66b628d694096020e0c52ff6)<br> | 
31 | |  [![Publish](https://img.shields.io/badge/ICML-2024-blue)]() [![Star](https://img.shields.io/github/stars/bytedance/SALMONN.svg?style=social&label=Star)](https://github.com/bytedance/SALMONN/tree/videosalmonn)<br>[video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models](https://arxiv.org/abs/2406.15704)<br>Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.15704)<br> [GitHub](https://github.com/bytedance/SALMONN/tree/videosalmonn)<br> [Model](https://huggingface.co/tsinghua-ee/Video-SALMONN/tree/main)<br> | 
32 | |  [![Publish](https://img.shields.io/badge/Interspeech-2024-blue)]() <br>[Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding](https://arxiv.org/abs/2406.13275)<br>Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.13275)<br> | 
33 | |  [![Publish](https://img.shields.io/badge/Interspeech-2024-blue)]() [![Star](https://img.shields.io/github/stars/swarupbehera/FastAST.svg?style=social&label=Star)](https://github.com/swarupbehera/FastAST)<br>[FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation](https://arxiv.org/abs/2406.07676)<br>Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.07676)<br> [GitHub](https://github.com/swarupbehera/FastAST)<br> | 
34 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.06-red)]() [![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/VideoLLaMA2.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/VideoLLaMA2)<br>[VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)<br>Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.07476)<br> [GitHub](https://github.com/DAMO-NLP-SG/VideoLLaMA2)<br> [Model](https://huggingface.co/collections/DAMO-NLP-SG/videollama2-6669b6b6f0493188305c87ed)<br> | 
35 | |  [![Publish](https://img.shields.io/badge/Interspeech-2024-blue)]() <br>[Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model](https://arxiv.org/abs/2406.03706)<br>Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.03706)<br> | 
36 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.05-red)]() <br>[SpeechVerse: A Large-scale Generalizable Audio Language Model](https://arxiv.org/abs/2405.08295)<br>AWS AI Team |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2405.08295)<br> | 
37 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.02-red)]() <br>[An Embarrassingly Simple Approach for LLM with Strong ASR Capacity](https://arxiv.org/abs/2402.08846)<br>Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2402.08846)<br> | 
38 | </details>
39 | 
40 | <details open>
41 | <summary><strong>2023 Audio</strong></summary>
42 | 
43 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
44 | | --- | --- | --- | :---: | 
45 | |  [![Publish](https://img.shields.io/badge/ICLR-2024-blue)]() [![Star](https://img.shields.io/github/stars/Render-AI/salmonn.svg?style=social&label=Star)](https://github.com/Render-AI/salmonn)<br>[SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/abs/2310.13289)<br>Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2310.13289)<br> [GitHub](https://github.com/Render-AI/salmonn)<br> [Model](https://huggingface.co/tsinghua-ee/SALMONN)<br> | 
46 | |  [![Publish](https://img.shields.io/badge/ICASSP-2024-blue)]() <br>[Connecting Speech Encoder and Large Language Model for ASR](https://arxiv.org/abs/2309.13963)<br>Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2309.13963)<br> | 
47 | |  [![Publish](https://img.shields.io/badge/ICASSP-2024-blue)]() <br>[Prompting Large Language Models with Speech Recognition Abilities](https://arxiv.org/abs/2307.11795)<br>Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2307.11795)<br> | 
48 | |  [![Publish](https://img.shields.io/badge/Interspeech-2023-blue)]() <br>[Accelerating Transducers through Adjacent Token Merging](https://arxiv.org/abs/2306.16009)<br>Yuang Li, Yu Wu, Jinyu Li, Shujie Liu |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2306.16009)<br> | 
49 | |  [![Publish](https://img.shields.io/badge/EMNLP-2023-blue)]() [![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/Video-LLaMA)<br>[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858)<br>Hang Zhang, Xin Li, Lidong Bing |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2306.02858)<br> [GitHub](https://github.com/DAMO-NLP-SG/Video-LLaMA)<br> [Model](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series)<br> | 
50 | </details>
51 | 
52 | <details open>
53 | <summary><strong>2022 Audio</strong></summary>
54 | 
55 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
56 | | --- | --- | --- | :---: | 
57 | |  [![Publish](https://img.shields.io/badge/ICASSP-2022-blue)]() [![Star](https://img.shields.io/github/stars/RetroCirce/HTS-Audio-Transformer.svg?style=social&label=Star)](https://github.com/RetroCirce/HTS-Audio-Transformer)<br>[HTS-AT: A Hierarchical Token-Semantic Audio-Transformer for Sound Classification and Detection](https://arxiv.org/abs/2202.00874)<br>Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2202.00874)<br> [GitHub](https://github.com/RetroCirce/HTS-Audio-Transformer)<br> | 
58 | </details>
59 | 


--------------------------------------------------------------------------------
/image-llm.md:
--------------------------------------------------------------------------------
  1 | 
  2 | <details open>
  3 | <summary><strong>2025 Image</strong></summary>
  4 | 
  5 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
  6 | | --- | --- | --- | :---: | 
  7 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.10-red)]() [![Star](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-OCR.svg?style=social&label=Star)](https://github.com/deepseek-ai/DeepSeek-OCR)<br>[DeepSeek-OCR: Contexts Optical Compression](https://arxiv.org/abs/2510.18234)<br>Haoran Wei, Yaofeng Sun, Yukun Li |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2510.18234)<br> [GitHub](https://github.com/deepseek-ai/DeepSeek-OCR)<br> [Model](https://huggingface.co/deepseek-ai/DeepSeek-OCR)<br> | 
  8 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.10-red)]() [![Star](https://img.shields.io/github/stars/JulietChoo/VisionSelector.svg?style=social&label=Star)](https://github.com/JulietChoo/VisionSelector)<br>[VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs](https://arxiv.org/abs/2510.16598)<br>Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2510.16598)<br> [GitHub](https://github.com/JulietChoo/VisionSelector)<br> [Model](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-7B)<br> | 
  9 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.10-red)]() [![Star](https://img.shields.io/github/stars/Chenfei-Liao/VTC-Bench.svg?style=social&label=Star)](https://github.com/Chenfei-Liao/VTC-Bench)<br>[Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods](https://arxiv.org/abs/2510.07143)<br>Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  |  [Paper](https://arxiv.org/abs/2510.07143)<br> [GitHub](https://github.com/Chenfei-Liao/VTC-Bench)<br> | 
 10 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models](https://arxiv.org/abs/2509.24837)<br>Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.24837)<br> | 
 11 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/AutoLab-SAI-SJTU/AutoPrune.svg?style=social&label=Star)](https://github.com/AutoLab-SAI-SJTU/AutoPrune)<br>[AutoPrune: Each Complexity Deserves a Pruning Policy](https://arxiv.org/abs/2509.23931)<br>Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.23931)<br> [GitHub](https://github.com/AutoLab-SAI-SJTU/AutoPrune)<br> | 
 12 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score](https://arxiv.org/abs/2509.23663)<br>Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.23663)<br> | 
 13 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance](https://arxiv.org/abs/2509.15704)<br>Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.15704)<br> | 
 14 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression](https://arxiv.org/abs/2509.12159)<br>Jingyu Xiao, Zhongyi Zhang, Yuxuan Wan, Yintong Huo, Yang Liu, Michael R.Lyu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/GUI--Agent-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.12159)<br> | 
 15 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge](https://arxiv.org/abs/2509.09955)<br>Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.09955)<br> | 
 16 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() [![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social&label=Star)](https://github.com/OpenGVLab/InternVL)<br>[InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://arxiv.org/abs/2508.18265)<br>InternVL Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.18265)<br> [GitHub](https://github.com/OpenGVLab/InternVL)<br> [Model](https://huggingface.co/collections/OpenGVLab/internvl35-68ac87bd52ebe953485927fb)<br> | 
 17 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() <br>[VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference](https://arxiv.org/abs/2508.17857)<br>Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.17857)<br> | 
 18 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[Revisiting MLLM Token Technology through the Lens of Classical Visual Coding](https://arxiv.org/abs/2508.13460)<br>Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Visual--Coding-purple)]() |  |  [Paper](https://arxiv.org/abs/2508.13460)<br> | 
 19 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models](https://arxiv.org/abs/2508.11886)<br>Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.11886)<br> | 
 20 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning](https://arxiv.org/abs/2508.07871)<br>Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.07871)<br> | 
 21 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance](https://arxiv.org/abs/2508.06084)<br>Weichen Zhang, Zhui Zhu, Ningbo Li, Kebin Liu, Yunhao Liu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.06084)<br> | 
 22 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models](https://arxiv.org/abs/2508.06038)<br>Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.06038)<br> | 
 23 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/sihany077/VFlowOpt.svg?style=social&label=Star)](https://github.com/sihany077/VFlowOpt)<br>[VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization](https://arxiv.org/abs/2508.05211)<br>Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.05211)<br> [GitHub](https://github.com/sihany077/VFlowOpt)<br> | 
 24 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() [![Star](https://img.shields.io/github/stars/HVision-NKU/GlimpsePrune.svg?style=social&label=Star)](https://github.com/HVision-NKU/GlimpsePrune)<br>[A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models](https://arxiv.org/abs/2508.01548)<br>Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.01548)<br> [GitHub](https://github.com/HVision-NKU/GlimpsePrune)<br> [Model](https://huggingface.co/collections/ashun989/glimpseprune-688d8826ef5bd09db6af145e)<br> | 
 25 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() <br>[Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models](https://arxiv.org/abs/2508.01236)<br>Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.01236)<br> | 
 26 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models](https://arxiv.org/abs/2508.00553)<br>Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.00553)<br> | 
 27 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning](https://arxiv.org/abs/2507.23318)<br>Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/VLA-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.23318)<br> | 
 28 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/YuchenLiu98/METEOR.svg?style=social&label=Star)](https://github.com/YuchenLiu98/METEOR)<br>[METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models](https://arxiv.org/abs/2507.20842)<br>Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.20842)<br> [GitHub](https://github.com/YuchenLiu98/METEOR)<br> | 
 29 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/liaolea/TransPrune.svg?style=social&label=Star)](https://github.com/liaolea/TransPrune)<br>[TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model](https://arxiv.org/abs/2507.20630)<br>Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.20630)<br> [GitHub](https://github.com/liaolea/TransPrune)<br> | 
 30 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/cokeshao/Awesome-Multimodal-Token-Compression.svg?style=social&label=Star)](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br>[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)<br>Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() |  |  [Paper](https://arxiv.org/abs/2507.20198)<br> [GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br> | 
 31 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[Efficient Whole Slide Pathology VQA via Token Compression](https://arxiv.org/abs/2507.14497)<br>Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.14497)<br> | 
 32 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[Training-free Token Reduction for Vision Mamba](https://arxiv.org/abs/2507.14042)<br>Qiankun Ma, Ziyao Zhang, Chi Su, Jie Chen, Zhen Song, Hairong Zheng, Wen Gao |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.14042)<br> | 
 33 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/VisionThink.svg?style=social&label=Star)](https://github.com/dvlab-research/VisionThink)<br>[VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning](https://arxiv.org/abs/2507.13348)<br>Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.13348)<br> [GitHub](https://github.com/dvlab-research/VisionThink)<br> [Model](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)<br> [Dataset](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)<br> | 
 34 | |  [![Publish](https://img.shields.io/badge/EMNLP_Findings-2024-blue)]() <br>[LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models](https://arxiv.org/abs/2507.02279)<br>Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.02279)<br> | 
 35 | |  [![Publish](https://img.shields.io/badge/IROS-2025-blue)]() <br>[ToSA: Token Merging with Spatial Awareness](https://arxiv.org/abs/2506.20066)<br>Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.20066)<br> | 
 36 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/Theia-4869/CDPruner.svg?style=social&label=Star)](https://github.com/Theia-4869/CDPruner)<br>[Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/abs/2506.10967)<br>Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.10967)<br> [GitHub](https://github.com/Theia-4869/CDPruner)<br> | 
 37 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.06-red)]() <br>[Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective](https://arxiv.org/abs/2506.01097)<br>Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2506.01097)<br> | 
 38 | |  [![Publish](https://img.shields.io/badge/ACL-2025-blue)]() [![Star](https://img.shields.io/github/stars/EffiVLM-Bench/EffiVLM-Bench.svg?style=social&label=Star)](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br>[EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models](https://arxiv.org/abs/2506.00479)<br>Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  |  [Paper](https://arxiv.org/abs/2506.00479)<br> [GitHub](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br> | 
 39 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.05-red)]() [![Star](https://img.shields.io/github/stars/Tencent/SelfEvolvingAgent.svg?style=social&label=Star)](https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan)<br>[VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models](https://arxiv.org/abs/2505.22654)<br>Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.22654)<br> [GitHub](https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan)<br> | 
 40 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() <br>[Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization](https://arxiv.org/abs/2505.22038)<br>Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.22038)<br> | 
 41 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/wangqinsi1/2025-ICML-CoreMatching.svg?style=social&label=Star)](https://github.com/wangqinsi1/2025-ICML-CoreMatching)<br>[CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models](https://arxiv.org/abs/2505.19235)<br>Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.19235)<br> [GitHub](https://github.com/wangqinsi1/2025-ICML-CoreMatching)<br> | 
 42 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.05-red)]() [![Star](https://img.shields.io/github/stars/xuyang-liu16/Awesome-Token-level-Model-Compression.svg?style=social&label=Star)](https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression)<br>[Shifting AI Efficiency From Model-Centric to Data-Centric Compression](https://arxiv.org/abs/2505.19147)<br>Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Position--Paper-purple)]() |  |  [Paper](https://arxiv.org/abs/2505.19147)<br> [GitHub](https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression)<br> | 
 43 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.05-red)]() [![Star](https://img.shields.io/github/stars/ZLKong/Awesome-Collection-Token-Reduction.svg?style=social&label=Star)](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)<br>[Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227)<br>Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Position--Paper-purple)]() |  |  [Paper](https://arxiv.org/abs/2505.18227)<br> [GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)<br> | 
 44 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() <br>[Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering](https://arxiv.org/abs/2505.10118)<br>Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.10118)<br> | 
 45 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.05-red)]() [![Star](https://img.shields.io/github/stars/ByteDance-Seed/Seed1.5-VL.svg?style=social&label=Star)](https://github.com/ByteDance-Seed/Seed1.5-VL)<br>[Seed1.5-VL Technical Report](https://arxiv.org/abs/2505.07062)<br>Seed Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2505.07062)<br> [GitHub](https://github.com/ByteDance-Seed/Seed1.5-VL)<br> | 
 46 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.04-red)]() [![Star](https://img.shields.io/github/stars/MikeWangWZHL/dymu.svg?style=social&label=Star)](https://github.com/MikeWangWZHL/dymu)<br>[DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs](https://arxiv.org/abs/2504.17040)<br>Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2504.17040)<br> [GitHub](https://github.com/MikeWangWZHL/dymu)<br> | 
 47 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() [![Star](https://img.shields.io/github/stars/orailix/PACT.svg?style=social&label=Star)](https://github.com/orailix/PACT)<br>[PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models](https://arxiv.org/abs/2504.08966)<br>Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2504.08966)<br> [GitHub](https://github.com/orailix/PACT)<br> | 
 48 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.04-red)]() <br>[QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA](https://arxiv.org/abs/2504.00654)<br>Shuai Li, Jian Xu, Xiao-Hui Li, Chao Deng, Lin-Lin Huang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.00654)<br> | 
 49 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/zwl666666/Skip-Vision.svg?style=social&label=Star)](https://github.com/zwl666666/Skip-Vision)<br>[Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping](https://arxiv.org/abs/2503.21817)<br>Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.21817)<br> [GitHub](https://github.com/zwl666666/Skip-Vision)<br> | 
 50 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.03-red)]() [![Star](https://img.shields.io/github/stars/ludc506/InternVL-X.svg?style=social&label=Star)](https://github.com/ludc506/InternVL-X)<br>[InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression](https://arxiv.org/abs/2503.21307)<br>Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.21307)<br> [GitHub](https://github.com/ludc506/InternVL-X)<br> [Model](https://huggingface.co/LLCC506/InternVL-X-8B-HD)<br> | 
 51 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.03-red)]() [![Star](https://img.shields.io/github/stars/QwenLM/Qwen2.5-Omni.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen2.5-Omni)<br>[Qwen2.5-Omni Technical Report](https://arxiv.org/abs/2503.20215)<br>Qwen Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.20215)<br> [GitHub](https://github.com/QwenLM/Qwen2.5-Omni)<br> [Model](https://huggingface.co/collections/Qwen/qwen25-omni-67de1e5f0f9464dc6314b36e)<br> | 
 52 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() <br>[TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model](https://arxiv.org/abs/2503.18278)<br>Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2503.18278)<br> | 
 53 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() <br>[Growing a Twig to Accelerate Large Vision-Language Models](https://arxiv.org/abs/2503.14075)<br>Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.14075)<br> | 
 54 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.03-red)]() [![Star](https://img.shields.io/github/stars/ShawnTan86/TokenCarve.svg?style=social&label=Star)](https://github.com/ShawnTan86/TokenCarve)<br>[TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models](https://arxiv.org/abs/2503.10501)<br>Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2503.10501)<br> [GitHub](https://github.com/ShawnTan86/TokenCarve)<br> | 
 55 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() [![Star](https://img.shields.io/github/stars/vbdi/divprune.svg?style=social&label=Star)](https://github.com/vbdi/divprune)<br>[DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models](https://arxiv.org/abs/2503.02175)<br>Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2503.02175)<br> [GitHub](https://github.com/vbdi/divprune)<br> | 
 56 | |  [![Publish](https://img.shields.io/badge/NAACL-2025-blue)]() [![Star](https://img.shields.io/github/stars/AIoT-MLSys-Lab/MEDA.svg?style=social&label=Star)](https://github.com/AIoT-MLSys-Lab/MEDA)<br>[MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2502.17599)<br>Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2502.17599)<br> [GitHub](https://github.com/AIoT-MLSys-Lab/MEDA)<br> | 
 57 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.02-red)]() [![Star](https://img.shields.io/github/stars/ZichenWen1/DART.svg?style=social&label=Star)](https://github.com/ZichenWen1/DART)<br>[Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More](https://arxiv.org/abs/2502.11494)<br>Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2502.11494)<br> [GitHub](https://github.com/ZichenWen1/DART)<br> | 
 58 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.01-red)]() <br>[AdaFV: Rethinking of Visual-Language alignment for VLM acceleration](https://arxiv.org/abs/2501.09532)<br>Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2501.09532)<br> | 
 59 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.01-red)]() [![Star](https://img.shields.io/github/stars/xuyang-liu16/GlobalCom2.svg?style=social&label=Star)](https://github.com/xuyang-liu16/GlobalCom2)<br>[Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models](https://arxiv.org/abs/2501.05179)<br>Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2501.05179)<br> [GitHub](https://github.com/xuyang-liu16/GlobalCom2)<br> | 
 60 | |  [![Publish](https://img.shields.io/badge/ICLR-2025-blue)]() [![Star](https://img.shields.io/github/stars/ictnlp/LLaVA-Mini.svg?style=social&label=Star)](https://github.com/ictnlp/LLaVA-Mini)<br>[LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token](https://arxiv.org/abs/2501.03895)<br>Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2501.03895)<br> [GitHub](https://github.com/ictnlp/LLaVA-Mini)<br> [Model](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b)<br> | 
 61 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/anakin-skywalker-Joseph/Folder.svg?style=social&label=Star)](https://github.com/anakin-skywalker-Joseph/Folder)<br>[FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance](https://arxiv.org/abs/2501.02430)<br>Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, Enzo Tartaglione |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2501.02430)<br> [GitHub](https://github.com/anakin-skywalker-Joseph/Folder)<br> | 
 62 | |  [![Publish](https://img.shields.io/badge/AAAI-2025-blue)]() [![Star](https://img.shields.io/github/stars/jytmelon/G-Prune.svg?style=social&label=Star)](https://github.com/jytmelon/G-Prune)<br>[What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph](https://arxiv.org/abs/2501.02268)<br>Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2501.02268)<br> [GitHub](https://github.com/jytmelon/G-Prune)<br> | 
 63 | </details>
 64 | 
 65 | <details open>
 66 | <summary><strong>2024 Image</strong></summary>
 67 | 
 68 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
 69 | | --- | --- | --- | :---: | 
 70 | |  [![Publish](https://img.shields.io/badge/AAAI-2025-blue)]() <br>[ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming](https://arxiv.org/abs/2412.20105)<br>Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.20105)<br> | 
 71 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() [![Star](https://img.shields.io/github/stars/OpenGVLab/PVC.svg?style=social&label=Star)](https://github.com/OpenGVLab/PVC)<br>[PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models](https://arxiv.org/abs/2412.09613)<br>Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2412.09613)<br> [GitHub](https://github.com/OpenGVLab/PVC)<br> [Model](https://huggingface.co/OpenGVLab/PVC-InternVL2-8B)<br> | 
 72 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/Lyra.svg?style=social&label=Star)](https://github.com/dvlab-research/Lyra)<br>[Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition](https://arxiv.org/abs/2412.09501)<br>Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2412.09501)<br> [GitHub](https://github.com/dvlab-research/Lyra)<br> [Model](https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc)<br> [Dataset](https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82)<br> | 
 73 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.12-red)]() [![Star](https://img.shields.io/github/stars/hulianyuyy/iLLaVA.svg?style=social&label=Star)](https://github.com/hulianyuyy/iLLaVA)<br>[iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models](https://arxiv.org/abs/2412.06263)<br>Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.06263)<br> [GitHub](https://github.com/hulianyuyy/iLLaVA)<br> | 
 74 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/VisionZip.svg?style=social&label=Star)](https://github.com/dvlab-research/VisionZip)<br>[VisionZip: Longer is Better but Not Necessary in Vision Language Models](https://arxiv.org/abs/2412.04467)<br>Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.04467)<br> [GitHub](https://github.com/dvlab-research/VisionZip)<br> | 
 75 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/Theia-4869/VisPruner.svg?style=social&label=Star)](https://github.com/Theia-4869/VisPruner)<br>[Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs](https://arxiv.org/abs/2412.01818)<br>Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.01818)<br> [GitHub](https://github.com/Theia-4869/VisPruner)<br> | 
 76 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() <br>[Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction](https://arxiv.org/abs/2412.00556)<br>Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.00556)<br> | 
 77 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() <br>[ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models](https://arxiv.org/abs/2412.00447)<br>Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2412.00447)<br> | 
 78 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.11-red)]() <br>[Efficient Multi-modal Large Language Models via Visual Token Grouping](https://arxiv.org/abs/2411.17773)<br>Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2411.17773)<br> | 
 79 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.11-red)]() [![Star](https://img.shields.io/github/stars/kawhiiiileo/FiCoCo.svg?style=social&label=Star)](https://github.com/kawhiiiileo/FiCoCo)<br>[Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration](https://arxiv.org/abs/2411.17686)<br>Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2411.17686)<br> [GitHub](https://github.com/kawhiiiileo/FiCoCo)<br> | 
 80 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.11-red)]() <br>[FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression](https://arxiv.org/abs/2411.14228)<br>Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2411.14228)<br> | 
 81 | |  [![Publish](https://img.shields.io/badge/CVPR_Highlight-2025-blue)]() <br>[AdaCM2: Adaptive Cross‑Modality Memory Reduction](https://arxiv.org/abs/2411.12593)<br>Yuanbin Man, Ying Huang, Chengming Zhang, Bingzhe Li, Wei Niu, Miao Yin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2411.12593)<br> | 
 82 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.11-red)]() [![Star](https://img.shields.io/github/stars/liuting20/MustDrop.svg?style=social&label=Star)](https://github.com/liuting20/MustDrop)<br>[Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model](https://arxiv.org/abs/2411.10803)<br>Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2411.10803)<br> [GitHub](https://github.com/liuting20/MustDrop)<br> | 
 83 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() [![Star](https://img.shields.io/github/stars/Cooperx521/PyramidDrop.svg?style=social&label=Star)](https://github.com/Cooperx521/PyramidDrop)<br>[PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction](https://arxiv.org/abs/2410.17247)<br>Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2410.17247)<br> [GitHub](https://github.com/Cooperx521/PyramidDrop)<br> | 
 84 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.10-red)]() <br>[Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers](https://arxiv.org/abs/2410.14072)<br>Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2410.14072)<br> | 
 85 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() <br>[ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification](https://arxiv.org/abs/2410.08584)<br>Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2410.08584)<br> | 
 86 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/Gumpest/SparseVLMs.svg?style=social&label=Star)](https://github.com/Gumpest/SparseVLMs)<br>[SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference](https://arxiv.org/abs/2410.04417)<br>Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2410.04417)<br> [GitHub](https://github.com/Gumpest/SparseVLMs)<br> | 
 87 | |  [![Publish](https://img.shields.io/badge/ICLR-2025-blue)]() [![Star](https://img.shields.io/github/stars/rese1f/aurora.svg?style=social&label=Star)](https://github.com/rese1f/aurora)<br>[AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark](https://arxiv.org/abs/2410.03051)<br>Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2410.03051)<br> [GitHub](https://github.com/rese1f/aurora)<br> [Model](https://huggingface.co/collections/wchai/auroracap-66d117ffe13bedda96702013)<br> [Dataset](https://huggingface.co/datasets/wchai/Video-Detailed-Caption)<br> | 
 88 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.10-red)]() [![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-NeXT.svg?style=social&label=Star)](https://github.com/LLaVA-VL/LLaVA-NeXT)<br>[Video Instruction Tuning with Synthetic Data](https://arxiv.org/abs/2410.02713)<br>Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2410.02713)<br> [GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT)<br> [Model](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944)<br> | 
 89 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.09-red)]() [![Star](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social&label=Star)](https://github.com/NVIDIA/Megatron-LM/tree/NVLM-1.0/examples/multimodal/nvlm)<br>[NVLM: Open Frontier-Class Multimodal LLMs](https://arxiv.org/abs/2409.11402)<br>Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2409.11402)<br> [GitHub](https://github.com/NVIDIA/Megatron-LM/tree/NVLM-1.0/examples/multimodal/nvlm)<br> [Model](https://huggingface.co/collections/nvidia/nvlm-10-66e9f407c764a0ee6e37b7f4)<br> | 
 90 | |  [![Publish](https://img.shields.io/badge/COLING-2025-blue)]() [![Star](https://img.shields.io/github/stars/FreedomIntelligence/TRIM.svg?style=social&label=Star)](https://github.com/FreedomIntelligence/TRIM)<br>[Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs](https://arxiv.org/abs/2409.10994)<br>Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2409.10994)<br> [GitHub](https://github.com/FreedomIntelligence/TRIM)<br> | 
 91 | |  [![Publish](https://img.shields.io/badge/AAAI-2025-blue)]() [![Star](https://img.shields.io/github/stars/ywh187/FitPrune.svg?style=social&label=Star)](https://github.com/ywh187/FitPrune)<br>[Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models](https://arxiv.org/abs/2409.10197)<br>Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2409.10197)<br> [GitHub](https://github.com/ywh187/FitPrune)<br> | 
 92 | |  [![Publish](https://img.shields.io/badge/AAAI-2025-blue)]() [![Star](https://img.shields.io/github/stars/hasanar1f/HiRED.svg?style=social&label=Star)](https://github.com/hasanar1f/HiRED)<br>[HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models](https://arxiv.org/abs/2408.10945)<br>Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2408.10945)<br> [GitHub](https://github.com/hasanar1f/HiRED)<br> | 
 93 | |  [![Publish](https://img.shields.io/badge/Trans._Mach._Learn._Res.-2025-blue)]() [![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-NeXT.svg?style=social&label=Star)](https://github.com/LLaVA-VL/LLaVA-NeXT)<br>[LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)<br>Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2408.03326)<br> [GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT)<br> [Model](https://huggingface.co/collections/lmms-lab/llava-onevision-66a259c3526e15166d6bba37)<br> | 
 94 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.07-red)]() [![Star](https://img.shields.io/github/stars/JiuTian-VL/TokenCorrCompressor.svg?style=social&label=Star)](https://github.com/JiuTian-VL/TokenCorrCompressor)<br>[Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding](https://arxiv.org/abs/2407.14439)<br>Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2407.14439)<br> [GitHub](https://github.com/JiuTian-VL/TokenCorrCompressor)<br> | 
 95 | |  [![Publish](https://img.shields.io/badge/IJCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/CircleRadon/TokenPacker.svg?style=social&label=Star)](https://github.com/CircleRadon/TokenPacker)<br>[TokenPacker: Efficient Visual Projector for Multimodal LLM](https://arxiv.org/abs/2407.02392)<br>Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2407.02392)<br> [GitHub](https://github.com/CircleRadon/TokenPacker)<br> | 
 96 | |  [![Publish](https://img.shields.io/badge/EMNLP_Findings-2024-blue)]() [![Star](https://img.shields.io/github/stars/SUSTechBruce/LOOK-M.svg?style=social&label=Star)](https://github.com/SUSTechBruce/LOOK-M)<br>[LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2406.18139)<br>Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2406.18139)<br> [GitHub](https://github.com/SUSTechBruce/LOOK-M)<br> | 
 97 | |  [![Publish](https://img.shields.io/badge/Trans._Mach._Learn._Res.-2025-blue)]() [![Star](https://img.shields.io/github/stars/EvolvingLMMs-Lab/LongVA.svg?style=social&label=Star)](https://github.com/EvolvingLMMs-Lab/LongVA)<br>[Long Context Transfer from Language to Vision](https://arxiv.org/abs/2406.16852)<br>Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.16852)<br> [GitHub](https://github.com/EvolvingLMMs-Lab/LongVA)<br> | 
 98 | |  [![Publish](https://img.shields.io/badge/ICML-2024-blue)]() [![Star](https://img.shields.io/github/stars/bytedance/SALMONN.svg?style=social&label=Star)](https://github.com/bytedance/SALMONN/tree/videosalmonn)<br>[video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models](https://arxiv.org/abs/2406.15704)<br>Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.15704)<br> [GitHub](https://github.com/bytedance/SALMONN/tree/videosalmonn)<br> [Model](https://huggingface.co/tsinghua-ee/Video-SALMONN/tree/main)<br> | 
 99 | |  [![Publish](https://img.shields.io/badge/CVPR-2025-blue)]() [![Star](https://img.shields.io/github/stars/Yxxxb/VoCo-LLaMA.svg?style=social&label=Star)](https://github.com/Yxxxb/VoCo-LLaMA)<br>[VoCo-LLaMA: Towards Vision Compression with Large Language Models](https://arxiv.org/abs/2406.12275)<br>Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2406.12275)<br> [GitHub](https://github.com/Yxxxb/VoCo-LLaMA)<br> | 
100 | |  [![Publish](https://img.shields.io/badge/ICLR-2025-blue)]() [![Star](https://img.shields.io/github/stars/mu-cai/matryoshka-mm.svg?style=social&label=Star)](https://github.com/mu-cai/matryoshka-mm)<br>[Matryoshka Multimodal Models](https://arxiv.org/abs/2405.17430)<br>Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2405.17430)<br> [GitHub](https://github.com/mu-cai/matryoshka-mm)<br> | 
101 | |  [![Publish](https://img.shields.io/badge/AAAI_Oral-2025-blue)]() [![Star](https://img.shields.io/github/stars/lzhxmu/VTW.svg?style=social&label=Star)](https://github.com/lzhxmu/VTW)<br>[Boosting multimodal large language models with visual tokens withdrawal for rapid inference.](https://arxiv.org/abs/2405.05803)<br>Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2405.05803)<br> [GitHub](https://github.com/lzhxmu/VTW)<br> | 
102 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.04-red)]() [![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social&label=Star)](https://github.com/OpenGVLab/InternVL)<br>[How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821)<br>InternVL Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2404.16821)<br> [GitHub](https://github.com/OpenGVLab/InternVL)<br> [Model](https://huggingface.co/collections/OpenGVLab/internvl15-6675ae031d45e5a07007f260)<br> | 
103 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/42Shawn/LLaVA-PruMerge.svg?style=social&label=Star)](https://github.com/42Shawn/LLaVA-PruMerge)<br>[LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388)<br>Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2403.15388)<br> [GitHub](https://github.com/42Shawn/LLaVA-PruMerge)<br> | 
104 | |  [![Publish](https://img.shields.io/badge/ECCV_Oral-2024-blue)]() [![Star](https://img.shields.io/github/stars/pkunlp-icler/FastV.svg?style=social&label=Star)](https://github.com/pkunlp-icler/FastV)<br>[An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764)<br>Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2403.06764)<br> [GitHub](https://github.com/pkunlp-icler/FastV)<br> | 
105 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2024\.02-red)]() [![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/MobileVLM)<br>[MobileVLM V2: Faster and Stronger Baseline for Vision Language Model](https://arxiv.org/abs/2402.03766)<br>Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2402.03766)<br> [GitHub](https://github.com/Meituan-AutoML/MobileVLM)<br> [Model](https://huggingface.co/mtgv/models)<br> | 
106 | </details>
107 | 
108 | <details open>
109 | <summary><strong>2023 Image</strong></summary>
110 | 
111 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
112 | | --- | --- | --- | :---: | 
113 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2023\.12-red)]() [![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/MobileVLM)<br>[MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices](https://arxiv.org/abs/2312.16886)<br>Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2312.16886)<br> [GitHub](https://github.com/Meituan-AutoML/MobileVLM)<br> [Model](https://huggingface.co/mtgv/models)<br> | 
114 | |  [![Publish](https://img.shields.io/badge/CVPR-2024-blue)]() [![Star](https://img.shields.io/github/stars/khanrc/honeybee.svg?style=social&label=Star)](https://github.com/khanrc/honeybee?tab=readme-ov-file)<br>[Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742)<br>Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2312.06742)<br> [GitHub](https://github.com/khanrc/honeybee?tab=readme-ov-file)<br> | 
115 | |  [![Publish](https://img.shields.io/badge/ECCV-2024-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/LLaMA-VID.svg?style=social&label=Star)](https://github.com/dvlab-research/LLaMA-VID)<br>[LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043)<br>Yanwei Li, Chengyao Wang, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2311.17043)<br> [GitHub](https://github.com/dvlab-research/LLaMA-VID)<br> [Model](https://huggingface.co/collections/YanweiLi/llama-vid-656741a92f3ec92d7e484dea)<br> [Dataset](https://huggingface.co/datasets/YanweiLi/LLaMA-VID-Data/tree/main)<br> | 
116 | |  [![Publish](https://img.shields.io/badge/CVPR_Highlight-2024-blue)]() [![Star](https://img.shields.io/github/stars/PKU-YuanGroup/Chat-UniVi.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Chat-UniVi)<br>[Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding](https://arxiv.org/abs/2311.08046)<br>Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2311.08046)<br> [GitHub](https://github.com/PKU-YuanGroup/Chat-UniVi)<br> [Model](https://huggingface.co/collections/Chat-UniVi/chat-univi-66f4265ee4c51e5acf255f2e)<br> [Dataset](https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/DATA.md)<br> | 
117 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2023\.08-red)]() [![Star](https://img.shields.io/github/stars/QwenLM/Qwen-VL.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen-VL)<br>[Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/abs/2308.12966)<br>Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2308.12966)<br> [GitHub](https://github.com/QwenLM/Qwen-VL)<br> [Model](https://huggingface.co/Qwen/Qwen-VL)<br> | 
118 | |  [![Publish](https://img.shields.io/badge/EMNLP-2023-blue)]() [![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/Video-LLaMA)<br>[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858)<br>Hang Zhang, Xin Li, Lidong Bing |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2306.02858)<br> [GitHub](https://github.com/DAMO-NLP-SG/Video-LLaMA)<br> [Model](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series)<br> | 
119 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2023-blue)]() [![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)<br>[InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)<br>Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2305.06500)<br> [GitHub](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)<br> | 
120 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2023\.04-red)]() [![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star)](https://github.com/X-PLUG/mPLUG-Owl)<br>[mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178)<br>DAMO Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2304.14178)<br> [GitHub](https://github.com/X-PLUG/mPLUG-Owl)<br> | 
121 | |  [![Publish](https://img.shields.io/badge/ICLR-2024-blue)]() [![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star)](https://github.com/Vision-CAIR/MiniGPT-4)<br>[Minigpt-4: Enhancing vision-language understanding with advanced large language models.](https://arxiv.org/abs/2304.10592)<br>Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2304.10592)<br> [GitHub](https://github.com/Vision-CAIR/MiniGPT-4)<br> [Model](https://huggingface.co/Vision-CAIR/MiniGPT-4)<br> [Dataset](https://github.com/Vision-CAIR/MiniGPT-4/tree/main/dataset)<br> | 
122 | |  [![Publish](https://img.shields.io/badge/ICML-2023-blue)]() [![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)<br>[BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)<br>Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2301.12597)<br> [GitHub](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)<br> | 
123 | </details>
124 | 
125 | <details open>
126 | <summary><strong>2022 Image</strong></summary>
127 | 
128 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
129 | | --- | --- | --- | :---: | 
130 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2022-blue)]() <br>[Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)<br>DeepMind Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2204.14198)<br> | 
131 | </details>
132 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <div align=center>
  2 | 
  3 | # Awesome Multimodal Token Compression
  4 | 
  5 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  6 | [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)
  7 | [![arXiv](https://img.shields.io/badge/arXiv-2507\.20198-red.svg)](https://arxiv.org/abs/2507.20198)
  8 | [![Last Commit](https://img.shields.io/github/last-commit/cokeshao/Awesome-Multimodal-Token-Compression.svg?style=flat&color=orange)](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)
  9 | 
 10 | [[arXiv]](https://arxiv.org/abs/2507.20198) [[HuggingFace]](https://huggingface.co/papers/2507.20198) [[Database]](https://oasis-paddleboat-fc1.notion.site/when-tokens-talk-too-much-database)
 11 | 
 12 | </div>
 13 | 
 14 | > **When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios** [[arXiv]](https://arxiv.org/pdf/2507.20198)   
 15 | > [Kele Shao](https://cokeshao.github.io/)<sup>\*,1,2</sup>, [Keda Tao](https://kd-tao.github.io/)<sup>\*,1,2</sup>, [Kejia Zhang](https://kejiazhang-robust.github.io/)<sup>3</sup>, [Sicheng Feng](https://fscdc.github.io/)<sup>2,4</sup>, [Mu Cai](https://pages.cs.wisc.edu/~mucai/)<sup>5</sup>, [Yuzhang Shang](https://42shawn.github.io/)<sup>6</sup>, [Haoxuan You](https://hxyou.github.io/)<sup>7</sup>, [Can Qin](https://canqin.tech/)<sup>8</sup>, [Yang Sui](https://eclipsess.github.io/yangsui.github.io/)<sup>9</sup>, [Huan Wang](https://huanwang.tech/)<sup>†,2</sup>
 16 | > 
 17 | > <sup>1</sup>Zhejiang University, <sup>2</sup>Westlake University, <sup>3</sup>Xiamen University, <sup>4</sup>National University of Singapore, <sup>5</sup>University of Wisconsin-Madison, <sup>6</sup>University of Central Florida, <sup>7</sup>Columbia University, <sup>8</sup>Salesforce AI Research, <sup>9</sup>Rice University
 18 | > 
 19 | > \* Equal Contribution.  † Corresponding Author (wanghuan@westlake.edu.cn).
 20 | 
 21 | ---
 22 | 
 23 | > [!IMPORTANT]
 24 | > We welcome your help in improving the repository and paper. Please feel free to submit a [pull request](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression/pulls) or [contact us](#️-contact) to:
 25 | > 
 26 | > - Add a relevant paper not yet included.
 27 | >
 28 | > - Suggest a more suitable category.
 29 | >
 30 | > - Update the information.
 31 | >
 32 | > - Ask for clarification about any content.
 33 | 
 34 | ---
 35 | 
 36 | ## 🔥 News
 37 | 
 38 | - **[2025.10.11]** Papers accepted by **NeurIPS'25** about MLLM token compression have been updated [here](#published-in-recent-conferencejournal). Congratulations! 🎉🎉🎉
 39 | - **[2025.08.14]** ❗ Added [Recent Papers](#recent-papers-last-6-months), [Papers Published in Recent Conference/Journal](#published-in-recent-conferencejournal), and a [database](https://oasis-paddleboat-fc1.notion.site/when-tokens-talk-too-much-database) for quick-search.
 40 | - **[2025.07.29]** The v1 survey is now published! We've also initialized the repository.
 41 | 
 42 | ## 🎯 Motivation
 43 | <div align="left">
 44 |   <img src="images/motivation.png" alt="Awesome Token Compression" width="400"/>
 45 | </div>
 46 | 
 47 | > **Motivation:** **Up:** Image, video, and audio data types can scale in their representation dimensions, leading to a corresponding increase in the number of tokens. **Down:** Top-performing MLLMs cannot address real-world demands, as the number of tokens for multimodal information, especially video, vastly exceeds that of text. Therefore, token compression is crucial to address this limitation.
 48 | 
 49 | ## 📌 Citation
 50 | 
 51 | If you find our paper or this resource helpful, please consider cite:
 52 | 
 53 | ```bibtex
 54 | @article{shao2025tokens,
 55 |   title={When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios},
 56 |   author={Shao, Kele and Tao, Keda and Zhang, Kejia and Feng, Sicheng and Cai, Mu and Shang, Yuzhang and You, Haoxuan and Qin, Can and Sui, Yang and Wang, Huan},
 57 |   journal={arXiv preprint arXiv:2507.20198},
 58 |   year={2025}
 59 | }
 60 | ```
 61 | 
 62 | ## 📚 Contents
 63 | 
 64 | - [Awesome Token Compression](#awesome-multimodal-token-compression)
 65 |     - [Image LLM](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression/tree/main/image-llm.md)
 66 |     - [Video LLM](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression/tree/main/video-llm.md)
 67 |     - [Audio LLM](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression/tree/main/audio-llm.md)
 68 |     - [Vision Transformer](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression/tree/main/vision-transformer.md)
 69 |     - [Audio Transformer](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression/tree/main/audio-transformer.md)
 70 | 
 71 | **Please check out all the papers by selecting the sub-area you're interested in. On this main page, only papers released in the past 6 months are shown.**
 72 | 
 73 | ---
 74 | 
 75 | ### Badge Colors
 76 | - ![arXiv Badge](https://img.shields.io/badge/arXiv-red) `red` for arXiv papers
 77 | - ![PDF Badge](https://img.shields.io/badge/PDF-blue) `blue` for conference/journal papers
 78 | - ![GitHub Badge](https://img.shields.io/badge/GitHub-white) `white` for GitHub repositories
 79 | - ![Research Areas Badge](https://img.shields.io/badge/Areas-purple) `purple` for research areas
 80 | - ![Categories Badge](https://img.shields.io/badge/Categories-green) `green` for categories
 81 | - ![Cost Badge](https://img.shields.io/badge/Cost-yellow) `yellow` for training cost
 82 | 
 83 | ### Recent Papers (Last 6 Months)
 84 | 
 85 | 
 86 | <details open>
 87 | <summary><strong>Image</strong></summary>
 88 | 
 89 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
 90 | | --- | --- | --- | :---: | 
 91 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.10-red)]() [![Star](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-OCR.svg?style=social&label=Star)](https://github.com/deepseek-ai/DeepSeek-OCR)<br>[DeepSeek-OCR: Contexts Optical Compression](https://arxiv.org/abs/2510.18234)<br>Haoran Wei, Yaofeng Sun, Yukun Li |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2510.18234)<br> [GitHub](https://github.com/deepseek-ai/DeepSeek-OCR)<br> [Model](https://huggingface.co/deepseek-ai/DeepSeek-OCR)<br> | 
 92 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.10-red)]() [![Star](https://img.shields.io/github/stars/JulietChoo/VisionSelector.svg?style=social&label=Star)](https://github.com/JulietChoo/VisionSelector)<br>[VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs](https://arxiv.org/abs/2510.16598)<br>Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2510.16598)<br> [GitHub](https://github.com/JulietChoo/VisionSelector)<br> [Model](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-7B)<br> | 
 93 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.10-red)]() [![Star](https://img.shields.io/github/stars/Chenfei-Liao/VTC-Bench.svg?style=social&label=Star)](https://github.com/Chenfei-Liao/VTC-Bench)<br>[Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods](https://arxiv.org/abs/2510.07143)<br>Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  |  [Paper](https://arxiv.org/abs/2510.07143)<br> [GitHub](https://github.com/Chenfei-Liao/VTC-Bench)<br> | 
 94 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models](https://arxiv.org/abs/2509.24837)<br>Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.24837)<br> | 
 95 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/AutoLab-SAI-SJTU/AutoPrune.svg?style=social&label=Star)](https://github.com/AutoLab-SAI-SJTU/AutoPrune)<br>[AutoPrune: Each Complexity Deserves a Pruning Policy](https://arxiv.org/abs/2509.23931)<br>Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.23931)<br> [GitHub](https://github.com/AutoLab-SAI-SJTU/AutoPrune)<br> | 
 96 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score](https://arxiv.org/abs/2509.23663)<br>Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.23663)<br> | 
 97 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance](https://arxiv.org/abs/2509.15704)<br>Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.15704)<br> | 
 98 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression](https://arxiv.org/abs/2509.12159)<br>Jingyu Xiao, Zhongyi Zhang, Yuxuan Wan, Yintong Huo, Yang Liu, Michael R.Lyu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/GUI--Agent-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.12159)<br> | 
 99 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge](https://arxiv.org/abs/2509.09955)<br>Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.09955)<br> | 
100 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() [![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social&label=Star)](https://github.com/OpenGVLab/InternVL)<br>[InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://arxiv.org/abs/2508.18265)<br>InternVL Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.18265)<br> [GitHub](https://github.com/OpenGVLab/InternVL)<br> [Model](https://huggingface.co/collections/OpenGVLab/internvl35-68ac87bd52ebe953485927fb)<br> | 
101 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() <br>[VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference](https://arxiv.org/abs/2508.17857)<br>Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.17857)<br> | 
102 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[Revisiting MLLM Token Technology through the Lens of Classical Visual Coding](https://arxiv.org/abs/2508.13460)<br>Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Visual--Coding-purple)]() |  |  [Paper](https://arxiv.org/abs/2508.13460)<br> | 
103 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models](https://arxiv.org/abs/2508.11886)<br>Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.11886)<br> | 
104 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning](https://arxiv.org/abs/2508.07871)<br>Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.07871)<br> | 
105 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance](https://arxiv.org/abs/2508.06084)<br>Weichen Zhang, Zhui Zhu, Ningbo Li, Kebin Liu, Yunhao Liu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.06084)<br> | 
106 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models](https://arxiv.org/abs/2508.06038)<br>Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.06038)<br> | 
107 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/sihany077/VFlowOpt.svg?style=social&label=Star)](https://github.com/sihany077/VFlowOpt)<br>[VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization](https://arxiv.org/abs/2508.05211)<br>Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.05211)<br> [GitHub](https://github.com/sihany077/VFlowOpt)<br> | 
108 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() [![Star](https://img.shields.io/github/stars/HVision-NKU/GlimpsePrune.svg?style=social&label=Star)](https://github.com/HVision-NKU/GlimpsePrune)<br>[A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models](https://arxiv.org/abs/2508.01548)<br>Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.01548)<br> [GitHub](https://github.com/HVision-NKU/GlimpsePrune)<br> [Model](https://huggingface.co/collections/ashun989/glimpseprune-688d8826ef5bd09db6af145e)<br> | 
109 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() <br>[Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models](https://arxiv.org/abs/2508.01236)<br>Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.01236)<br> | 
110 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models](https://arxiv.org/abs/2508.00553)<br>Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.00553)<br> | 
111 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning](https://arxiv.org/abs/2507.23318)<br>Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/VLA-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.23318)<br> | 
112 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/YuchenLiu98/METEOR.svg?style=social&label=Star)](https://github.com/YuchenLiu98/METEOR)<br>[METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models](https://arxiv.org/abs/2507.20842)<br>Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.20842)<br> [GitHub](https://github.com/YuchenLiu98/METEOR)<br> | 
113 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/liaolea/TransPrune.svg?style=social&label=Star)](https://github.com/liaolea/TransPrune)<br>[TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model](https://arxiv.org/abs/2507.20630)<br>Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.20630)<br> [GitHub](https://github.com/liaolea/TransPrune)<br> | 
114 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/cokeshao/Awesome-Multimodal-Token-Compression.svg?style=social&label=Star)](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br>[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)<br>Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() |  |  [Paper](https://arxiv.org/abs/2507.20198)<br> [GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br> | 
115 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[Efficient Whole Slide Pathology VQA via Token Compression](https://arxiv.org/abs/2507.14497)<br>Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.14497)<br> | 
116 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[Training-free Token Reduction for Vision Mamba](https://arxiv.org/abs/2507.14042)<br>Qiankun Ma, Ziyao Zhang, Chi Su, Jie Chen, Zhen Song, Hairong Zheng, Wen Gao |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.14042)<br> | 
117 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/VisionThink.svg?style=social&label=Star)](https://github.com/dvlab-research/VisionThink)<br>[VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning](https://arxiv.org/abs/2507.13348)<br>Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.13348)<br> [GitHub](https://github.com/dvlab-research/VisionThink)<br> [Model](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)<br> [Dataset](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)<br> | 
118 | |  [![Publish](https://img.shields.io/badge/EMNLP_Findings-2024-blue)]() <br>[LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models](https://arxiv.org/abs/2507.02279)<br>Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.02279)<br> | 
119 | |  [![Publish](https://img.shields.io/badge/IROS-2025-blue)]() <br>[ToSA: Token Merging with Spatial Awareness](https://arxiv.org/abs/2506.20066)<br>Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.20066)<br> | 
120 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/Theia-4869/CDPruner.svg?style=social&label=Star)](https://github.com/Theia-4869/CDPruner)<br>[Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/abs/2506.10967)<br>Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.10967)<br> [GitHub](https://github.com/Theia-4869/CDPruner)<br> | 
121 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.06-red)]() <br>[Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective](https://arxiv.org/abs/2506.01097)<br>Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2506.01097)<br> | 
122 | |  [![Publish](https://img.shields.io/badge/ACL-2025-blue)]() [![Star](https://img.shields.io/github/stars/EffiVLM-Bench/EffiVLM-Bench.svg?style=social&label=Star)](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br>[EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models](https://arxiv.org/abs/2506.00479)<br>Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  |  [Paper](https://arxiv.org/abs/2506.00479)<br> [GitHub](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br> | 
123 | </details>
124 | 
125 | <details open>
126 | <summary><strong>Video</strong></summary>
127 | 
128 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
129 | | --- | --- | --- | :---: | 
130 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.12-red)]() <br>[EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs](https://arxiv.org/abs/2512.10324)<br>Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2512.10324)<br> | 
131 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() <br>[Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs](https://arxiv.org/abs/2510.17364)<br>Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2510.17364)<br> | 
132 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.10-red)]() [![Star](https://img.shields.io/github/stars/JulietChoo/VisionSelector.svg?style=social&label=Star)](https://github.com/JulietChoo/VisionSelector)<br>[VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs](https://arxiv.org/abs/2510.16598)<br>Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2510.16598)<br> [GitHub](https://github.com/JulietChoo/VisionSelector)<br> [Model](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-7B)<br> | 
133 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning](https://arxiv.org/abs/2509.22481)<br>Xiangmo Zhao, Nan Yang, Yang Wang, Zhanwen Liu |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Event--Camera-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.22481)<br> | 
134 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning](https://arxiv.org/abs/2509.15250)<br>Wenda Qin, Andrea Burns, Bryan A. Plummer, Margrit Betke |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/VLN-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.15250)<br> | 
135 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() <br>[The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning](https://arxiv.org/abs/2509.12594)<br>Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/VLA-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2509.12594)<br> | 
136 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.09-red)]() [![Star](https://img.shields.io/github/stars/Zizzzzzzz/FocusMamba.svg?style=social&label=Star)](https://github.com/Zizzzzzzz/FocusMamba)<br>[Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection](https://arxiv.org/abs/2509.03872)<br>Nan Yang, Yang Wang, Zhanwen Liu, Yuchao Dai, Yang Liu, Xiangmo Zhao |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Event--Camera-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2509.03872)<br> [GitHub](https://github.com/Zizzzzzzz/FocusMamba)<br> | 
137 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() [![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social&label=Star)](https://github.com/OpenGVLab/InternVL)<br>[InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://arxiv.org/abs/2508.18265)<br>InternVL Team |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.18265)<br> [GitHub](https://github.com/OpenGVLab/InternVL)<br> [Model](https://huggingface.co/collections/OpenGVLab/internvl35-68ac87bd52ebe953485927fb)<br> | 
138 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() <br>[VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference](https://arxiv.org/abs/2508.17857)<br>Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.17857)<br> | 
139 | |  [![Publish](https://img.shields.io/badge/EMNLP--blue)]() [![Star](https://img.shields.io/github/stars/yogesh-iitj/LGTTP.svg?style=social&label=Star)](https://github.com/yogesh-iitj/LGTTP)<br>[Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing](https://arxiv.org/abs/2508.17686)<br>Yogesh Kumar |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.17686)<br> [GitHub](https://github.com/yogesh-iitj/LGTTP)<br> | 
140 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() [![Star](https://img.shields.io/github/stars/zju-jiyicheng/SpecVLM.svg?style=social&label=Star)](https://github.com/zju-jiyicheng/SpecVLM)<br>[SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning](https://arxiv.org/abs/2508.16201)<br>Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() |  [Paper](https://arxiv.org/abs/2508.16201)<br> [GitHub](https://github.com/zju-jiyicheng/SpecVLM)<br> | 
141 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding](https://arxiv.org/abs/2508.15717)<br>Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, Mengye Ren |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Streaming--Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.15717)<br> | 
142 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.08-red)]() <br>[EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models](https://arxiv.org/abs/2508.11886)<br>Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.11886)<br> | 
143 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/sihany077/VFlowOpt.svg?style=social&label=Star)](https://github.com/sihany077/VFlowOpt)<br>[VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization](https://arxiv.org/abs/2508.05211)<br>Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.05211)<br> [GitHub](https://github.com/sihany077/VFlowOpt)<br> | 
144 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/cokeshao/Awesome-Multimodal-Token-Compression.svg?style=social&label=Star)](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br>[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)<br>Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() |  |  [Paper](https://arxiv.org/abs/2507.20198)<br> [GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br> | 
145 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() <br>[EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent](https://arxiv.org/abs/2507.15428)<br>Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.15428)<br> | 
146 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/HYUNJS/STTM.svg?style=social&label=Star)](https://github.com/HYUNJS/STTM)<br>[Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video-LLMs](https://arxiv.org/abs/2507.07990)<br>Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.07990)<br> [GitHub](https://github.com/HYUNJS/STTM)<br> | 
147 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/InternRobotics/StreamVLN.svg?style=social&label=Star)](https://github.com/InternRobotics/StreamVLN)<br>[StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling](https://arxiv.org/abs/2507.05240)<br>Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/VLN-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.05240)<br> [GitHub](https://github.com/InternRobotics/StreamVLN)<br> [Dataset](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data)<br> | 
148 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() <br>[AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding](https://arxiv.org/abs/2507.02591)<br>Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.02591)<br> | 
149 | |  [![Publish](https://img.shields.io/badge/EMNLP_Findings-2024-blue)]() <br>[LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models](https://arxiv.org/abs/2507.02279)<br>Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.02279)<br> | 
150 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.06-red)]() [![Star](https://img.shields.io/github/stars/HumanMLLM/LLaVA-Scissor.svg?style=social&label=Star)](https://github.com/HumanMLLM/LLaVA-Scissor)<br>[LLaVA-Scissor: Token Compression with Semantic Connected Components for Video-LLMs](https://arxiv.org/abs/2506.21862)<br>Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.21862)<br> [GitHub](https://github.com/HumanMLLM/LLaVA-Scissor)<br> | 
151 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.06-red)]() [![Star](https://img.shields.io/github/stars/VectorSpaceLab/Video-XL.svg?style=social&label=Star)](https://github.com/VectorSpaceLab/Video-XL)<br>[Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification](https://arxiv.org/abs/2506.19225)<br>Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2506.19225)<br> [GitHub](https://github.com/VectorSpaceLab/Video-XL)<br> [Model](https://huggingface.co/collections/BAAI/video-xl-683973cd45636acda09a11bd)<br> | 
152 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/Theia-4869/CDPruner.svg?style=social&label=Star)](https://github.com/Theia-4869/CDPruner)<br>[Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/abs/2506.10967)<br>Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.10967)<br> [GitHub](https://github.com/Theia-4869/CDPruner)<br> | 
153 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.06-red)]() <br>[DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding](https://arxiv.org/abs/2506.03990)<br>Hongzhi Zhang, Jingyuan Zhang, Xingguang Ji, Qi Wang, Fuzheng Zhang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.03990)<br> | 
154 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.06-red)]() [![Star](https://img.shields.io/github/stars/mnyuew/METok.svg?style=social&label=Star)](https://github.com/mnyuew/METok)<br>[METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding](https://arxiv.org/abs/2506.02850)<br>Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.02850)<br> [GitHub](https://github.com/mnyuew/METok)<br> | 
155 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.06-red)]() <br>[Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective](https://arxiv.org/abs/2506.01097)<br>Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2506.01097)<br> | 
156 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/yunzhuzhang0918/flexselect.svg?style=social&label=Star)](https://github.com/yunzhuzhang0918/flexselect)<br>[FlexSelect: Flexible Token Selection for Efficient Long Video Understanding](https://arxiv.org/abs/2506.00993)<br>Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.00993)<br> [GitHub](https://github.com/yunzhuzhang0918/flexselect)<br> | 
157 | |  [![Publish](https://img.shields.io/badge/ACL-2025-blue)]() [![Star](https://img.shields.io/github/stars/EffiVLM-Bench/EffiVLM-Bench.svg?style=social&label=Star)](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br>[EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models](https://arxiv.org/abs/2506.00479)<br>Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  |  [Paper](https://arxiv.org/abs/2506.00479)<br> [GitHub](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br> | 
158 | </details>
159 | 
160 | <details open>
161 | <summary><strong>Audio</strong></summary>
162 | 
163 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
164 | | --- | --- | --- | :---: | 
165 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.12-red)]() <br>[EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs](https://arxiv.org/abs/2512.10324)<br>Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2512.10324)<br> | 
166 | |  [![Arxiv](https://img.shields.io/badge/arXiv-2025\.07-red)]() [![Star](https://img.shields.io/github/stars/cokeshao/Awesome-Multimodal-Token-Compression.svg?style=social&label=Star)](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br>[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)<br>Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() |  |  [Paper](https://arxiv.org/abs/2507.20198)<br> [GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br> | 
167 | </details>
168 | 
169 | 
170 | ### Published in Recent Conference/Journal
171 | 
172 | 
173 | <details open>
174 | <summary><strong>NeurIPS 2025</strong></summary>
175 | 
176 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
177 | | --- | --- | --- | :---: | 
178 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() <br>[Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs](https://arxiv.org/abs/2510.17364)<br>Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2510.17364)<br> | 
179 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/AutoLab-SAI-SJTU/AutoPrune.svg?style=social&label=Star)](https://github.com/AutoLab-SAI-SJTU/AutoPrune)<br>[AutoPrune: Each Complexity Deserves a Pruning Policy](https://arxiv.org/abs/2509.23931)<br>Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2509.23931)<br> [GitHub](https://github.com/AutoLab-SAI-SJTU/AutoPrune)<br> | 
180 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/VisionThink.svg?style=social&label=Star)](https://github.com/dvlab-research/VisionThink)<br>[VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning](https://arxiv.org/abs/2507.13348)<br>Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.13348)<br> [GitHub](https://github.com/dvlab-research/VisionThink)<br> [Model](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)<br> [Dataset](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)<br> | 
181 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/Theia-4869/CDPruner.svg?style=social&label=Star)](https://github.com/Theia-4869/CDPruner)<br>[Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/abs/2506.10967)<br>Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.10967)<br> [GitHub](https://github.com/Theia-4869/CDPruner)<br> | 
182 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/yunzhuzhang0918/flexselect.svg?style=social&label=Star)](https://github.com/yunzhuzhang0918/flexselect)<br>[FlexSelect: Flexible Token Selection for Efficient Long Video Understanding](https://arxiv.org/abs/2506.00993)<br>Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2506.00993)<br> [GitHub](https://github.com/yunzhuzhang0918/flexselect)<br> | 
183 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() <br>[Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization](https://arxiv.org/abs/2505.22038)<br>Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.22038)<br> | 
184 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/cokeshao/HoliTom.svg?style=social&label=Star)](https://github.com/cokeshao/HoliTom)<br>[HoliTom: Holistic Token Merging for Fast Video Large Language Models](https://arxiv.org/abs/2505.21334)<br>Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.21334)<br> [GitHub](https://github.com/cokeshao/HoliTom)<br> | 
185 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() <br>[Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering](https://arxiv.org/abs/2505.10118)<br>Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.10118)<br> | 
186 | |  [![Publish](https://img.shields.io/badge/NeurIPS-2025-blue)]() [![Star](https://img.shields.io/github/stars/LunarShen/FastVID.svg?style=social&label=Star)](https://github.com/LunarShen/FastVID)<br>[FastVID: Dynamic Density Pruning for Fast Video Large Language Models](https://arxiv.org/abs/2503.11187)<br>Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2503.11187)<br> [GitHub](https://github.com/LunarShen/FastVID)<br> | 
187 | </details>
188 | 
189 | <details open>
190 | <summary><strong>ICCV 2025</strong></summary>
191 | 
192 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
193 | | --- | --- | --- | :---: | 
194 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/sihany077/VFlowOpt.svg?style=social&label=Star)](https://github.com/sihany077/VFlowOpt)<br>[VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization](https://arxiv.org/abs/2508.05211)<br>Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.05211)<br> [GitHub](https://github.com/sihany077/VFlowOpt)<br> | 
195 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/mlvlab/Representation-Shift.svg?style=social&label=Star)](https://github.com/mlvlab/Representation-Shift)<br>[Representation Shift: Unifying Token Compression with FlashAttention](https://arxiv.org/abs/2508.00367)<br>Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim |  [![Area](https://img.shields.io/badge/Vision--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.00367)<br> [GitHub](https://github.com/mlvlab/Representation-Shift)<br> | 
196 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/YuchenLiu98/METEOR.svg?style=social&label=Star)](https://github.com/YuchenLiu98/METEOR)<br>[METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models](https://arxiv.org/abs/2507.20842)<br>Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2507.20842)<br> [GitHub](https://github.com/YuchenLiu98/METEOR)<br> | 
197 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/HYUNJS/STTM.svg?style=social&label=Star)](https://github.com/HYUNJS/STTM)<br>[Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video-LLMs](https://arxiv.org/abs/2507.07990)<br>Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.07990)<br> [GitHub](https://github.com/HYUNJS/STTM)<br> | 
198 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() <br>[AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding](https://arxiv.org/abs/2507.02591)<br>Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2507.02591)<br> | 
199 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/zwl666666/Skip-Vision.svg?style=social&label=Star)](https://github.com/zwl666666/Skip-Vision)<br>[Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping](https://arxiv.org/abs/2503.21817)<br>Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.21817)<br> [GitHub](https://github.com/zwl666666/Skip-Vision)<br> | 
200 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() <br>[Growing a Twig to Accelerate Large Vision-Language Models](https://arxiv.org/abs/2503.14075)<br>Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.14075)<br> | 
201 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/LSDBench.svg?style=social&label=Star)](https://github.com/dvlab-research/LSDBench)<br>[Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?](https://arxiv.org/abs/2503.12496)<br>Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  |  [Paper](https://arxiv.org/abs/2503.12496)<br> [GitHub](https://github.com/dvlab-research/LSDBench)<br> [Dataset](https://huggingface.co/datasets/TainU/LSDBench)<br> | 
202 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/anakin-skywalker-Joseph/Folder.svg?style=social&label=Star)](https://github.com/anakin-skywalker-Joseph/Folder)<br>[FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance](https://arxiv.org/abs/2501.02430)<br>Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, Enzo Tartaglione |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2501.02430)<br> [GitHub](https://github.com/anakin-skywalker-Joseph/Folder)<br> | 
203 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/thu-nics/FrameFusion.svg?style=social&label=Star)](https://github.com/thu-nics/FrameFusion)<br>[FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models](https://arxiv.org/abs/2501.01986)<br>Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() |  [Paper](https://arxiv.org/abs/2501.01986)<br> [GitHub](https://github.com/thu-nics/FrameFusion)<br> | 
204 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/Hon-Wong/ByteVideoLLM.svg?style=social&label=Star)](https://github.com/Hon-Wong/ByteVideoLLM)<br>[Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM](https://arxiv.org/abs/2412.09530)<br>Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2412.09530)<br> [GitHub](https://github.com/Hon-Wong/ByteVideoLLM)<br> | 
205 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/dvlab-research/Lyra.svg?style=social&label=Star)](https://github.com/dvlab-research/Lyra)<br>[Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition](https://arxiv.org/abs/2412.09501)<br>Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2412.09501)<br> [GitHub](https://github.com/dvlab-research/Lyra)<br> [Model](https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc)<br> [Dataset](https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82)<br> | 
206 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/LaVi-Lab/AIM.svg?style=social&label=Star)](https://github.com/LaVi-Lab/AIM)<br>[AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning](https://arxiv.org/abs/2412.03248)<br>Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.03248)<br> [GitHub](https://github.com/LaVi-Lab/AIM)<br> | 
207 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/Theia-4869/VisPruner.svg?style=social&label=Star)](https://github.com/Theia-4869/VisPruner)<br>[Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs](https://arxiv.org/abs/2412.01818)<br>Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.01818)<br> [GitHub](https://github.com/Theia-4869/VisPruner)<br> | 
208 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() <br>[ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification](https://arxiv.org/abs/2410.08584)<br>Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2410.08584)<br> | 
209 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/joslefaure/HERMES.svg?style=social&label=Star)](https://github.com/joslefaure/HERMES)<br>[HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics](https://arxiv.org/abs/2408.17443)<br>Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2408.17443)<br> [GitHub](https://github.com/joslefaure/HERMES)<br> | 
210 | |  [![Publish](https://img.shields.io/badge/ICCV-2025-blue)]() [![Star](https://img.shields.io/github/stars/42Shawn/LLaVA-PruMerge.svg?style=social&label=Star)](https://github.com/42Shawn/LLaVA-PruMerge)<br>[LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388)<br>Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2403.15388)<br> [GitHub](https://github.com/42Shawn/LLaVA-PruMerge)<br> | 
211 | </details>
212 | 
213 | <details open>
214 | <summary><strong>ACL 2025</strong></summary>
215 | 
216 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
217 | | --- | --- | --- | :---: | 
218 | |  [![Publish](https://img.shields.io/badge/ACL-2025-blue)]() [![Star](https://img.shields.io/github/stars/EffiVLM-Bench/EffiVLM-Bench.svg?style=social&label=Star)](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br>[EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models](https://arxiv.org/abs/2506.00479)<br>Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Benchmark-purple)]() |  |  [Paper](https://arxiv.org/abs/2506.00479)<br> [GitHub](https://github.com/EffiVLM-Bench/EffiVLM-Bench)<br> | 
219 | |  [![Publish](https://img.shields.io/badge/ACL_Findings-2025-blue)]() [![Star](https://img.shields.io/github/stars/JeongHun0716/MMS-LLaMA.svg?style=social&label=Star)](https://github.com/JeongHun0716/MMS-LLaMA)<br>[MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens](https://arxiv.org/abs/2503.11315)<br>Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro |  [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2503.11315)<br> [GitHub](https://github.com/JeongHun0716/MMS-LLaMA)<br> | 
220 | |  [![Publish](https://img.shields.io/badge/NAACL-2025-blue)]() [![Star](https://img.shields.io/github/stars/AIoT-MLSys-Lab/MEDA.svg?style=social&label=Star)](https://github.com/AIoT-MLSys-Lab/MEDA)<br>[MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2502.17599)<br>Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2502.17599)<br> [GitHub](https://github.com/AIoT-MLSys-Lab/MEDA)<br> | 
221 | |  [![Publish](https://img.shields.io/badge/ACL-2025-blue)]() [![Star](https://img.shields.io/github/stars/Visual-AI/PruneVid.svg?style=social&label=Star)](https://github.com/Visual-AI/PruneVid)<br>[PruneVid: Visual Token Pruning for Efficient Video Large Language Models](https://arxiv.org/abs/2412.16117)<br>Xiaohu Huang, Hao Zhou, Kai Han |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2412.16117)<br> [GitHub](https://github.com/Visual-AI/PruneVid)<br> | 
222 | |  [![Publish](https://img.shields.io/badge/NAACL_Oral-2025-blue)]() [![Star](https://img.shields.io/github/stars/ZongqianLi/Prompt-Compression-Survey.svg?style=social&label=Star)](https://github.com/ZongqianLi/Prompt-Compression-Survey)<br>[Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388)<br>Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |  [![Area](https://img.shields.io/badge/LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() |  |  [Paper](https://arxiv.org/abs/2410.12388)<br> [GitHub](https://github.com/ZongqianLi/Prompt-Compression-Survey)<br> | 
223 | </details>
224 | 
225 | <details open>
226 | <summary><strong>ICML 2025</strong></summary>
227 | 
228 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
229 | | --- | --- | --- | :---: | 
230 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/wangqinsi1/2025-ICML-CoreMatching.svg?style=social&label=Star)](https://github.com/wangqinsi1/2025-ICML-CoreMatching)<br>[CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models](https://arxiv.org/abs/2505.19235)<br>Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2505.19235)<br> [GitHub](https://github.com/wangqinsi1/2025-ICML-CoreMatching)<br> | 
231 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/yangdongchao/ALMTokenizer.svg?style=social&label=Star)](https://github.com/yangdongchao/ALMTokenizer)<br>[ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling](https://arxiv.org/abs/2504.10344)<br>Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng |  [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.10344)<br> [GitHub](https://github.com/yangdongchao/ALMTokenizer)<br> | 
232 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/steven-ccq/ViLAMP.svg?style=social&label=Star)](https://github.com/steven-ccq/ViLAMP)<br>[Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation](https://arxiv.org/abs/2504.02438)<br>Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.02438)<br> [GitHub](https://github.com/steven-ccq/ViLAMP)<br> [Model](https://huggingface.co/orange-sk/ViLAMP-llava-qwen)<br> | 
233 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/Vision-CAIR/LongVU.svg?style=social&label=Star)](https://github.com/Vision-CAIR/LongVU)<br>[LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/abs/2410.17434)<br>Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2410.17434)<br> [GitHub](https://github.com/Vision-CAIR/LongVU)<br> [Model](https://huggingface.co/collections/Vision-CAIR/longvu-67181d2debabfc1eb050c21d)<br> | 
234 | |  [![Publish](https://img.shields.io/badge/ICML-2025-blue)]() [![Star](https://img.shields.io/github/stars/Gumpest/SparseVLMs.svg?style=social&label=Star)](https://github.com/Gumpest/SparseVLMs)<br>[SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference](https://arxiv.org/abs/2410.04417)<br>Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2410.04417)<br> [GitHub](https://github.com/Gumpest/SparseVLMs)<br> | 
235 | </details>
236 | 
237 | <details open>
238 | <summary><strong>ACM MM 2025</strong></summary>
239 | 
240 | | **Title & Authors** | **Areas** | **Tags** | **Links** | 
241 | | --- | --- | --- | :---: | 
242 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() <br>[VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference](https://arxiv.org/abs/2508.17857)<br>Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() |  [Paper](https://arxiv.org/abs/2508.17857)<br> | 
243 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() <br>[Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models](https://arxiv.org/abs/2508.01236)<br>Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang |  [![Area](https://img.shields.io/badge/Image--LLM-purple)]() |  [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2508.01236)<br> | 
244 | |  [![Publish](https://img.shields.io/badge/ACM_MM-2025-blue)]() [![Star](https://img.shields.io/github/stars/yaolinli/TimeChat-Online.svg?style=social&label=Star)](https://github.com/yaolinli/TimeChat-Online)<br>[TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos](https://arxiv.org/abs/2504.17343)<br>Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun |  [![Area](https://img.shields.io/badge/Video--LLM-purple)]() |  [![Type](https://img.shields.io/badge/Similarity--Based-green)]()<br> [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() |  [Paper](https://arxiv.org/abs/2504.17343)<br> [GitHub](https://github.com/yaolinli/TimeChat-Online)<br> [Model](https://huggingface.co/wyccccc/TimeChatOnline-7B)<br> [Dataset](https://huggingface.co/datasets/yaolily/TimeChat-Online-139K)<br> | 
245 | </details>
246 | 
247 | 
248 | ---
249 | 
250 | ## 📄 License
251 | 
252 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
253 | 
254 | ---
255 | 
256 | ## 🙏 Acknowledgments
257 | 
258 | This repository is inspired by [Awesome-Efficient-Reasoning-Models](https://github.com/fscdc/Awesome-Efficient-Reasoning-Models), [Awesome-Efficient-LLM](https://github.com/horseee/Awesome-Efficient-LLM/), [Awesome-Context-Engineering](https://github.com/Meirtz/Awesome-Context-Engineering)
259 | 
260 | ## 🧑‍💻 Contributors
261 | 
262 | 👏 Thanks to these contributors for this excellent work！
263 | 
264 | <a href="https://github.com/cokeshao/Awesome-Multimodal-Token-Compression/graphs/contributors">
265 |   <img src="https://contrib.rocks/image?repo=cokeshao/Awesome-Multimodal-Token-Compression" />
266 | </a>
267 | 
268 | ## ✉️ Contact
269 | 
270 | For questions, suggestions, or collaboration opportunities, please feel free to reach out:
271 | 
272 | ✉️ Email:  [shaokele@gmail.com](mailto:shaokele@gmail.com) / [KD.TAO.CT@outlook.com](mailto:KD.TAO.CT@outlook.com)
273 | 
274 | ## ✨ Star History
275 | 
276 | [![Star History Chart](https://api.star-history.com/svg?repos=cokeshao/Awesome-Multimodal-Token-Compression&type=date&legend=top-left)](https://www.star-history.com/#cokeshao/Awesome-Multimodal-Token-Compression&type=date&legend=top-left)
277 | 
278 | [**⬆ Back to top**](#awesome-multimodal-token-compression)
279 | 


--------------------------------------------------------------------------------