├── .gitignore
├── images
└── motivation.png
├── LICENSE
├── audio-transformer.md
├── vision-transformer.md
├── audio-llm.md
├── image-llm.md
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .vscode
3 | *.py
4 | *.csv
5 |
6 | *template*
--------------------------------------------------------------------------------
/images/motivation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cokeshao/Awesome-Multimodal-Token-Compression/HEAD/images/motivation.png
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 cokeshao
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/audio-transformer.md:
--------------------------------------------------------------------------------
1 |
2 | 2025 AST
4 |
5 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
6 | | --- | --- | --- | :---: |
7 | | []() [](https://github.com/yangdongchao/ALMTokenizer)
[ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling](https://arxiv.org/abs/2504.10344)
Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng | []() | []()
[]() | [Paper](https://arxiv.org/abs/2504.10344)
[GitHub](https://github.com/yangdongchao/ALMTokenizer)
|
8 | | []() [](https://github.com/andylee-24/token-pruning-audio-transformer)
[Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance](https://arxiv.org/abs/2504.01690)
Taehan Lee, Hyukjun Lee | []() | []()
[]() | [Paper](https://arxiv.org/abs/2504.01690)
[GitHub](https://github.com/andylee-24/token-pruning-audio-transformer)
[Model](https://drive.google.com/drive/folders/1cBDXh98m2qDlYLLX3q6xB-gtU1uUtxhK)
|
9 | | []() [](https://github.com/VITA-MLLM/LUCY)
[LUCY: Linguistic Understanding and Control Yielding Early Stage of Her](https://arxiv.org/abs/2501.16327)
Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun | []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.16327)
[GitHub](https://github.com/VITA-MLLM/LUCY)
[Model](https://huggingface.co/VITA-MLLM)
|
10 | 2024 AST
14 |
15 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
16 | | --- | --- | --- | :---: |
17 | | []() [](https://github.com/swarupbehera/FastAST)
[FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation](https://arxiv.org/abs/2406.07676)
Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani | []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.07676)
[GitHub](https://github.com/swarupbehera/FastAST)
|
18 | 2023 AST
22 |
23 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
24 | | --- | --- | --- | :---: |
25 | | []()
[Accelerating Transducers through Adjacent Token Merging](https://arxiv.org/abs/2306.16009)
Yuang Li, Yu Wu, Jinyu Li, Shujie Liu | []() | []()
[]() | [Paper](https://arxiv.org/abs/2306.16009)
|
26 | 2022 AST
30 |
31 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
32 | | --- | --- | --- | :---: |
33 | | []() [](https://github.com/RetroCirce/HTS-Audio-Transformer)
[HTS-AT: A Hierarchical Token-Semantic Audio-Transformer for Sound Classification and Detection](https://arxiv.org/abs/2202.00874)
Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov | []() | []()
[]() | [Paper](https://arxiv.org/abs/2202.00874)
[GitHub](https://github.com/RetroCirce/HTS-Audio-Transformer)
|
34 | 2025 ViT
4 |
5 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
6 | | --- | --- | --- | :---: |
7 | | []()
[TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation](https://arxiv.org/abs/2508.04058)
Zunhui Xia, Hongxing Li, Libin Lan | []() | []() []() []()
[]() | [Paper](https://arxiv.org/abs/2508.04058)
|
8 | | []()
[Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation](https://arxiv.org/abs/2508.03388)
Yizhe Xiong, Zihan Zhou, Yiwen Liang, Hui Chen, Zijia Lin, Tianxiang Hao, Fan Zhang, Jungong Han, Guiguang Ding | []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.03388)
|
9 | | []() [](https://github.com/mlvlab/Representation-Shift)
[Representation Shift: Unifying Token Compression with FlashAttention](https://arxiv.org/abs/2508.00367)
Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim | []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.00367)
[GitHub](https://github.com/mlvlab/Representation-Shift)
|
10 | | []()
[ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference](https://arxiv.org/abs/2507.16260)
Haoyue Zhang, Jie Zhang, Song Guo | []() | []()
[]() | [Paper](https://arxiv.org/abs/2507.16260)
|
11 | 2024 ViT
15 |
16 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
17 | | --- | --- | --- | :---: |
18 | | []() [](https://github.com/yaolinli/DeCo)
[DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models](https://arxiv.org/abs/2405.20985)
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou | []() | []()
[]() | [Paper](https://arxiv.org/abs/2405.20985)
[GitHub](https://github.com/yaolinli/DeCo)
|
19 | | []() [](https://github.com/double125/MADTP-plus)
[MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer](https://arxiv.org/abs/2403.02991)
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2403.02991)
[GitHub](https://github.com/double125/MADTP-plus)
|
20 | 2023 ViT
24 |
25 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
26 | | --- | --- | --- | :---: |
27 | | []() [](https://github.com/csarron/PuMer)
[PuMer: Pruning and Merging Tokens for Efficient Vision Language Models](https://arxiv.org/abs/2305.17530)
Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi | []() | []()
[]() | [Paper](https://arxiv.org/abs/2305.17530)
[GitHub](https://github.com/csarron/PuMer)
|
28 | 2022 ViT
32 |
33 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
34 | | --- | --- | --- | :---: |
35 | | []() [](https://github.com/facebookresearch/ToMe)
[Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461)
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman | []() | []()
[]() []() | [Paper](https://arxiv.org/abs/2210.09461)
[GitHub](https://github.com/facebookresearch/ToMe)
|
36 | 2021 ViT
40 |
41 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
42 | | --- | --- | --- | :---: |
43 | | []()
[Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206)
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira | []() | []()
[]() | [Paper](https://arxiv.org/abs/2103.03206)
|
44 | 2025 Audio
4 |
5 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
6 | | --- | --- | --- | :---: |
7 | | []()
[EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs](https://arxiv.org/abs/2512.10324)
Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen | []() []() | []() | [Paper](https://arxiv.org/abs/2512.10324)
|
8 | | []() [](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)
[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang | []() []() []() []() | | [Paper](https://arxiv.org/abs/2507.20198)
[GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)
|
9 | | []() [](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)
[Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227)
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | []() []() []() []() | | [Paper](https://arxiv.org/abs/2505.18227)
[GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)
|
10 | | []() [](https://github.com/yangdongchao/ALMTokenizer)
[ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling](https://arxiv.org/abs/2504.10344)
Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng | []() | []()
[]() | [Paper](https://arxiv.org/abs/2504.10344)
[GitHub](https://github.com/yangdongchao/ALMTokenizer)
|
11 | | []() [](https://github.com/andylee-24/token-pruning-audio-transformer)
[Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance](https://arxiv.org/abs/2504.01690)
Taehan Lee, Hyukjun Lee | []() | []()
[]() | [Paper](https://arxiv.org/abs/2504.01690)
[GitHub](https://github.com/andylee-24/token-pruning-audio-transformer)
[Model](https://drive.google.com/drive/folders/1cBDXh98m2qDlYLLX3q6xB-gtU1uUtxhK)
|
12 | | []() [](https://github.com/QwenLM/Qwen2.5-Omni)
[Qwen2.5-Omni Technical Report](https://arxiv.org/abs/2503.20215)
Qwen Team | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.20215)
[GitHub](https://github.com/QwenLM/Qwen2.5-Omni)
[Model](https://huggingface.co/collections/Qwen/qwen25-omni-67de1e5f0f9464dc6314b36e)
|
13 | | []() [](https://github.com/JeongHun0716/MMS-LLaMA)
[MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens](https://arxiv.org/abs/2503.11315)
Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro | []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.11315)
[GitHub](https://github.com/JeongHun0716/MMS-LLaMA)
|
14 | | []()
[Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs](https://arxiv.org/abs/2503.06362)
Umberto Cappellazzo, Minsu Kim, Stavros Petridis | []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.06362)
|
15 | | []() [](https://github.com/baichuan-inc/Baichuan-Audio)
[Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction](https://arxiv.org/abs/2502.17239)
Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2502.17239)
[GitHub](https://github.com/baichuan-inc/Baichuan-Audio)
[Model](https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct)
|
16 | | []() [](https://github.com/VITA-MLLM/LUCY)
[LUCY: Linguistic Understanding and Control Yielding Early Stage of Her](https://arxiv.org/abs/2501.16327)
Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun | []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.16327)
[GitHub](https://github.com/VITA-MLLM/LUCY)
[Model](https://huggingface.co/VITA-MLLM)
|
17 | | []() [](https://github.com/ASLP-lab/OSUM)
[OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia](https://arxiv.org/abs/2501.13306)
ASLP@NPU | []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.13306)
[GitHub](https://github.com/ASLP-lab/OSUM)
[Model](https://huggingface.co/ASLP-lab/OSUM)
|
18 | 2024 Audio
22 |
23 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
24 | | --- | --- | --- | :---: |
25 | | []() [](https://github.com/scb-10x/typhoon2-audio)
[Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models](https://arxiv.org/abs/2412.13702)
Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai | []() | []()
[]() | [Paper](https://arxiv.org/abs/2412.13702)
[GitHub](https://github.com/scb-10x/typhoon2-audio)
[Model](https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct)
|
26 | | []()
[SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval](https://arxiv.org/abs/2412.12009)
Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran Chen | []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2412.12009)
|
27 | | []() [](https://github.com/dvlab-research/Lyra)
[Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition](https://arxiv.org/abs/2412.09501)
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2412.09501)
[GitHub](https://github.com/dvlab-research/Lyra)
[Model](https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc)
[Dataset](https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82)
|
28 | | []() [](https://github.com/umbertocappellazzo/Llama-AVSR)
[Large Language Models are Strong Audio-Visual Speech Recognition Learners](https://arxiv.org/abs/2409.12319)
Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic | []() | []()
[]() | [Paper](https://arxiv.org/abs/2409.12319)
[GitHub](https://github.com/umbertocappellazzo/Llama-AVSR)
|
29 | | []() [](https://github.com/ictnlp/LLaMA-Omni)
[LLaMA-Omni: Seamless Speech Interaction with Large Language Models](https://arxiv.org/abs/2409.06666)
Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng | []() | []()
[]() | [Paper](https://arxiv.org/abs/2409.06666)
[GitHub](https://github.com/ictnlp/LLaMA-Omni)
[Model](https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni)
[Dataset](https://huggingface.co/datasets/ICTNLP/InstructS2S-200K)
|
30 | | []() [](https://github.com/QwenLM/Qwen2-Audio)
[Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759)
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou | []() | []()
[]() | [Paper](https://arxiv.org/abs/2407.10759)
[GitHub](https://github.com/QwenLM/Qwen2-Audio)
[Model](https://huggingface.co/collections/Qwen/qwen2-audio-66b628d694096020e0c52ff6)
|
31 | | []() [](https://github.com/bytedance/SALMONN/tree/videosalmonn)
[video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models](https://arxiv.org/abs/2406.15704)
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.15704)
[GitHub](https://github.com/bytedance/SALMONN/tree/videosalmonn)
[Model](https://huggingface.co/tsinghua-ee/Video-SALMONN/tree/main)
|
32 | | []()
[Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding](https://arxiv.org/abs/2406.13275)
Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.13275)
|
33 | | []() [](https://github.com/swarupbehera/FastAST)
[FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation](https://arxiv.org/abs/2406.07676)
Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani | []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.07676)
[GitHub](https://github.com/swarupbehera/FastAST)
|
34 | | []() [](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
[VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.07476)
[GitHub](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
[Model](https://huggingface.co/collections/DAMO-NLP-SG/videollama2-6669b6b6f0493188305c87ed)
|
35 | | []()
[Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model](https://arxiv.org/abs/2406.03706)
Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li | []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.03706)
|
36 | | []()
[SpeechVerse: A Large-scale Generalizable Audio Language Model](https://arxiv.org/abs/2405.08295)
AWS AI Team | []() | []()
[]() | [Paper](https://arxiv.org/abs/2405.08295)
|
37 | | []()
[An Embarrassingly Simple Approach for LLM with Strong ASR Capacity](https://arxiv.org/abs/2402.08846)
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2402.08846)
|
38 | 2023 Audio
42 |
43 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
44 | | --- | --- | --- | :---: |
45 | | []() [](https://github.com/Render-AI/salmonn)
[SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/abs/2310.13289)
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2310.13289)
[GitHub](https://github.com/Render-AI/salmonn)
[Model](https://huggingface.co/tsinghua-ee/SALMONN)
|
46 | | []()
[Connecting Speech Encoder and Large Language Model for ASR](https://arxiv.org/abs/2309.13963)
Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2309.13963)
|
47 | | []()
[Prompting Large Language Models with Speech Recognition Abilities](https://arxiv.org/abs/2307.11795)
Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer | []() | []()
[]() | [Paper](https://arxiv.org/abs/2307.11795)
|
48 | | []()
[Accelerating Transducers through Adjacent Token Merging](https://arxiv.org/abs/2306.16009)
Yuang Li, Yu Wu, Jinyu Li, Shujie Liu | []() | []()
[]() | [Paper](https://arxiv.org/abs/2306.16009)
|
49 | | []() [](https://github.com/DAMO-NLP-SG/Video-LLaMA)
[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858)
Hang Zhang, Xin Li, Lidong Bing | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2306.02858)
[GitHub](https://github.com/DAMO-NLP-SG/Video-LLaMA)
[Model](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series)
|
50 | 2022 Audio
54 |
55 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
56 | | --- | --- | --- | :---: |
57 | | []() [](https://github.com/RetroCirce/HTS-Audio-Transformer)
[HTS-AT: A Hierarchical Token-Semantic Audio-Transformer for Sound Classification and Detection](https://arxiv.org/abs/2202.00874)
Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov | []() | []()
[]() | [Paper](https://arxiv.org/abs/2202.00874)
[GitHub](https://github.com/RetroCirce/HTS-Audio-Transformer)
|
58 | 2025 Image
4 |
5 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
6 | | --- | --- | --- | :---: |
7 | | []() [](https://github.com/deepseek-ai/DeepSeek-OCR)
[DeepSeek-OCR: Contexts Optical Compression](https://arxiv.org/abs/2510.18234)
Haoran Wei, Yaofeng Sun, Yukun Li | []() | []()
[]() | [Paper](https://arxiv.org/abs/2510.18234)
[GitHub](https://github.com/deepseek-ai/DeepSeek-OCR)
[Model](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
|
8 | | []() [](https://github.com/JulietChoo/VisionSelector)
[VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs](https://arxiv.org/abs/2510.16598)
Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha | []() []() | []() | [Paper](https://arxiv.org/abs/2510.16598)
[GitHub](https://github.com/JulietChoo/VisionSelector)
[Model](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-7B)
|
9 | | []() [](https://github.com/Chenfei-Liao/VTC-Bench)
[Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods](https://arxiv.org/abs/2510.07143)
Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu | []() []() | | [Paper](https://arxiv.org/abs/2510.07143)
[GitHub](https://github.com/Chenfei-Liao/VTC-Bench)
|
10 | | []()
[Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models](https://arxiv.org/abs/2509.24837)
Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong | []() | []() | [Paper](https://arxiv.org/abs/2509.24837)
|
11 | | []() [](https://github.com/AutoLab-SAI-SJTU/AutoPrune)
[AutoPrune: Each Complexity Deserves a Pruning Policy](https://arxiv.org/abs/2509.23931)
Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang | []() | []() | [Paper](https://arxiv.org/abs/2509.23931)
[GitHub](https://github.com/AutoLab-SAI-SJTU/AutoPrune)
|
12 | | []()
[HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score](https://arxiv.org/abs/2509.23663)
Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel | []() | []()
[]() | [Paper](https://arxiv.org/abs/2509.23663)
|
13 | | []()
[Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance](https://arxiv.org/abs/2509.15704)
Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue | []() | []() []() []()
[]() | [Paper](https://arxiv.org/abs/2509.15704)
|
14 | | []()
[EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression](https://arxiv.org/abs/2509.12159)
Jingyu Xiao, Zhongyi Zhang, Yuxuan Wan, Yintong Huo, Yang Liu, Michael R.Lyu | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2509.12159)
|
15 | | []()
[Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge](https://arxiv.org/abs/2509.09955)
Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat | []() | []()
[]() | [Paper](https://arxiv.org/abs/2509.09955)
|
16 | | []() [](https://github.com/OpenGVLab/InternVL)
[InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://arxiv.org/abs/2508.18265)
InternVL Team | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.18265)
[GitHub](https://github.com/OpenGVLab/InternVL)
[Model](https://huggingface.co/collections/OpenGVLab/internvl35-68ac87bd52ebe953485927fb)
|
17 | | []()
[VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference](https://arxiv.org/abs/2508.17857)
Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2508.17857)
|
18 | | []()
[Revisiting MLLM Token Technology through the Lens of Classical Visual Coding](https://arxiv.org/abs/2508.13460)
Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin | []() []() | | [Paper](https://arxiv.org/abs/2508.13460)
|
19 | | []()
[EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models](https://arxiv.org/abs/2508.11886)
Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.11886)
|
20 | | []()
[CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning](https://arxiv.org/abs/2508.07871)
Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang | []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2508.07871)
|
21 | | []()
[AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance](https://arxiv.org/abs/2508.06084)
Weichen Zhang, Zhui Zhu, Ningbo Li, Kebin Liu, Yunhao Liu | []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.06084)
|
22 | | []()
[Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models](https://arxiv.org/abs/2508.06038)
Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin | []() | []() | [Paper](https://arxiv.org/abs/2508.06038)
|
23 | | []() [](https://github.com/sihany077/VFlowOpt)
[VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization](https://arxiv.org/abs/2508.05211)
Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.05211)
[GitHub](https://github.com/sihany077/VFlowOpt)
|
24 | | []() [](https://github.com/HVision-NKU/GlimpsePrune)
[A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models](https://arxiv.org/abs/2508.01548)
Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou | []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.01548)
[GitHub](https://github.com/HVision-NKU/GlimpsePrune)
[Model](https://huggingface.co/collections/ashun989/glimpseprune-688d8826ef5bd09db6af145e)
|
25 | | []()
[Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models](https://arxiv.org/abs/2508.01236)
Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang | []() | []() | [Paper](https://arxiv.org/abs/2508.01236)
|
26 | | []()
[HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models](https://arxiv.org/abs/2508.00553)
Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2508.00553)
|
27 | | []()
[FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning](https://arxiv.org/abs/2507.23318)
Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang | []() []() | []() | [Paper](https://arxiv.org/abs/2507.23318)
|
28 | | []() [](https://github.com/YuchenLiu98/METEOR)
[METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models](https://arxiv.org/abs/2507.20842)
Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian | []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2507.20842)
[GitHub](https://github.com/YuchenLiu98/METEOR)
|
29 | | []() [](https://github.com/liaolea/TransPrune)
[TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model](https://arxiv.org/abs/2507.20630)
Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2507.20630)
[GitHub](https://github.com/liaolea/TransPrune)
|
30 | | []() [](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)
[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang | []() []() []() []() | | [Paper](https://arxiv.org/abs/2507.20198)
[GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)
|
31 | | []()
[Efficient Whole Slide Pathology VQA via Token Compression](https://arxiv.org/abs/2507.14497)
Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2507.14497)
|
32 | | []()
[Training-free Token Reduction for Vision Mamba](https://arxiv.org/abs/2507.14042)
Qiankun Ma, Ziyao Zhang, Chi Su, Jie Chen, Zhen Song, Hairong Zheng, Wen Gao | []() | []() | [Paper](https://arxiv.org/abs/2507.14042)
|
33 | | []() [](https://github.com/dvlab-research/VisionThink)
[VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning](https://arxiv.org/abs/2507.13348)
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia | []() | []()
[]() | [Paper](https://arxiv.org/abs/2507.13348)
[GitHub](https://github.com/dvlab-research/VisionThink)
[Model](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)
[Dataset](https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe)
|
34 | | []()
[LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models](https://arxiv.org/abs/2507.02279)
Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2507.02279)
|
35 | | []()
[ToSA: Token Merging with Spatial Awareness](https://arxiv.org/abs/2506.20066)
Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2506.20066)
|
36 | | []() [](https://github.com/Theia-4869/CDPruner)
[Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/abs/2506.10967)
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2506.10967)
[GitHub](https://github.com/Theia-4869/CDPruner)
|
37 | | []()
[Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective](https://arxiv.org/abs/2506.01097)
Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2506.01097)
|
38 | | []() [](https://github.com/EffiVLM-Bench/EffiVLM-Bench)
[EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models](https://arxiv.org/abs/2506.00479)
Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin | []() []() []() | | [Paper](https://arxiv.org/abs/2506.00479)
[GitHub](https://github.com/EffiVLM-Bench/EffiVLM-Bench)
|
39 | | []() [](https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan)
[VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models](https://arxiv.org/abs/2505.22654)
Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2505.22654)
[GitHub](https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan)
|
40 | | []()
[Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization](https://arxiv.org/abs/2505.22038)
Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen | []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2505.22038)
|
41 | | []() [](https://github.com/wangqinsi1/2025-ICML-CoreMatching)
[CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models](https://arxiv.org/abs/2505.19235)
Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2505.19235)
[GitHub](https://github.com/wangqinsi1/2025-ICML-CoreMatching)
|
42 | | []() [](https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression)
[Shifting AI Efficiency From Model-Centric to Data-Centric Compression](https://arxiv.org/abs/2505.19147)
Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang | []() []() []() | | [Paper](https://arxiv.org/abs/2505.19147)
[GitHub](https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression)
|
43 | | []() [](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)
[Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227)
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | []() []() []() []() | | [Paper](https://arxiv.org/abs/2505.18227)
[GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)
|
44 | | []()
[Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering](https://arxiv.org/abs/2505.10118)
Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu | []() | []()
[]() | [Paper](https://arxiv.org/abs/2505.10118)
|
45 | | []() [](https://github.com/ByteDance-Seed/Seed1.5-VL)
[Seed1.5-VL Technical Report](https://arxiv.org/abs/2505.07062)
Seed Team | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2505.07062)
[GitHub](https://github.com/ByteDance-Seed/Seed1.5-VL)
|
46 | | []() [](https://github.com/MikeWangWZHL/dymu)
[DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs](https://arxiv.org/abs/2504.17040)
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2504.17040)
[GitHub](https://github.com/MikeWangWZHL/dymu)
|
47 | | []() [](https://github.com/orailix/PACT)
[PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models](https://arxiv.org/abs/2504.08966)
Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2504.08966)
[GitHub](https://github.com/orailix/PACT)
|
48 | | []()
[QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA](https://arxiv.org/abs/2504.00654)
Shuai Li, Jian Xu, Xiao-Hui Li, Chao Deng, Lin-Lin Huang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2504.00654)
|
49 | | []() [](https://github.com/zwl666666/Skip-Vision)
[Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping](https://arxiv.org/abs/2503.21817)
Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan | []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.21817)
[GitHub](https://github.com/zwl666666/Skip-Vision)
|
50 | | []() [](https://github.com/ludc506/InternVL-X)
[InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression](https://arxiv.org/abs/2503.21307)
Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2503.21307)
[GitHub](https://github.com/ludc506/InternVL-X)
[Model](https://huggingface.co/LLCC506/InternVL-X-8B-HD)
|
51 | | []() [](https://github.com/QwenLM/Qwen2.5-Omni)
[Qwen2.5-Omni Technical Report](https://arxiv.org/abs/2503.20215)
Qwen Team | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.20215)
[GitHub](https://github.com/QwenLM/Qwen2.5-Omni)
[Model](https://huggingface.co/collections/Qwen/qwen25-omni-67de1e5f0f9464dc6314b36e)
|
52 | | []()
[TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model](https://arxiv.org/abs/2503.18278)
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.18278)
|
53 | | []()
[Growing a Twig to Accelerate Large Vision-Language Models](https://arxiv.org/abs/2503.14075)
Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu | []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.14075)
|
54 | | []() [](https://github.com/ShawnTan86/TokenCarve)
[TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models](https://arxiv.org/abs/2503.10501)
Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.10501)
[GitHub](https://github.com/ShawnTan86/TokenCarve)
|
55 | | []() [](https://github.com/vbdi/divprune)
[DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models](https://arxiv.org/abs/2503.02175)
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2503.02175)
[GitHub](https://github.com/vbdi/divprune)
|
56 | | []() [](https://github.com/AIoT-MLSys-Lab/MEDA)
[MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2502.17599)
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2502.17599)
[GitHub](https://github.com/AIoT-MLSys-Lab/MEDA)
|
57 | | []() [](https://github.com/ZichenWen1/DART)
[Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More](https://arxiv.org/abs/2502.11494)
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang | []() []() | []()
[]() []() | [Paper](https://arxiv.org/abs/2502.11494)
[GitHub](https://github.com/ZichenWen1/DART)
|
58 | | []()
[AdaFV: Rethinking of Visual-Language alignment for VLM acceleration](https://arxiv.org/abs/2501.09532)
Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng | []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.09532)
|
59 | | []() [](https://github.com/xuyang-liu16/GlobalCom2)
[Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models](https://arxiv.org/abs/2501.05179)
Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.05179)
[GitHub](https://github.com/xuyang-liu16/GlobalCom2)
|
60 | | []() [](https://github.com/ictnlp/LLaVA-Mini)
[LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token](https://arxiv.org/abs/2501.03895)
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.03895)
[GitHub](https://github.com/ictnlp/LLaVA-Mini)
[Model](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b)
|
61 | | []() [](https://github.com/anakin-skywalker-Joseph/Folder)
[FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance](https://arxiv.org/abs/2501.02430)
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, Enzo Tartaglione | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.02430)
[GitHub](https://github.com/anakin-skywalker-Joseph/Folder)
|
62 | | []() [](https://github.com/jytmelon/G-Prune)
[What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph](https://arxiv.org/abs/2501.02268)
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou | []() | []()
[]() | [Paper](https://arxiv.org/abs/2501.02268)
[GitHub](https://github.com/jytmelon/G-Prune)
|
63 | 2024 Image
67 |
68 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
69 | | --- | --- | --- | :---: |
70 | | []()
[ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming](https://arxiv.org/abs/2412.20105)
Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu | []() | []()
[]() | [Paper](https://arxiv.org/abs/2412.20105)
|
71 | | []() [](https://github.com/OpenGVLab/PVC)
[PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models](https://arxiv.org/abs/2412.09613)
Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2412.09613)
[GitHub](https://github.com/OpenGVLab/PVC)
[Model](https://huggingface.co/OpenGVLab/PVC-InternVL2-8B)
|
72 | | []() [](https://github.com/dvlab-research/Lyra)
[Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition](https://arxiv.org/abs/2412.09501)
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2412.09501)
[GitHub](https://github.com/dvlab-research/Lyra)
[Model](https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc)
[Dataset](https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82)
|
73 | | []() [](https://github.com/hulianyuyy/iLLaVA)
[iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models](https://arxiv.org/abs/2412.06263)
Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2412.06263)
[GitHub](https://github.com/hulianyuyy/iLLaVA)
|
74 | | []() [](https://github.com/dvlab-research/VisionZip)
[VisionZip: Longer is Better but Not Necessary in Vision Language Models](https://arxiv.org/abs/2412.04467)
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2412.04467)
[GitHub](https://github.com/dvlab-research/VisionZip)
|
75 | | []() [](https://github.com/Theia-4869/VisPruner)
[Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs](https://arxiv.org/abs/2412.01818)
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2412.01818)
[GitHub](https://github.com/Theia-4869/VisPruner)
|
76 | | []()
[Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction](https://arxiv.org/abs/2412.00556)
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2412.00556)
|
77 | | []()
[ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models](https://arxiv.org/abs/2412.00447)
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2412.00447)
|
78 | | []()
[Efficient Multi-modal Large Language Models via Visual Token Grouping](https://arxiv.org/abs/2411.17773)
Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng | []() | []()
[]() | [Paper](https://arxiv.org/abs/2411.17773)
|
79 | | []() [](https://github.com/kawhiiiileo/FiCoCo)
[Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration](https://arxiv.org/abs/2411.17686)
Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2411.17686)
[GitHub](https://github.com/kawhiiiileo/FiCoCo)
|
80 | | []()
[FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression](https://arxiv.org/abs/2411.14228)
Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo | []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2411.14228)
|
81 | | []()
[AdaCM2: Adaptive Cross‑Modality Memory Reduction](https://arxiv.org/abs/2411.12593)
Yuanbin Man, Ying Huang, Chengming Zhang, Bingzhe Li, Wei Niu, Miao Yin | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2411.12593)
|
82 | | []() [](https://github.com/liuting20/MustDrop)
[Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model](https://arxiv.org/abs/2411.10803)
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2411.10803)
[GitHub](https://github.com/liuting20/MustDrop)
|
83 | | []() [](https://github.com/Cooperx521/PyramidDrop)
[PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction](https://arxiv.org/abs/2410.17247)
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin | []() []() | []()
[]() []() | [Paper](https://arxiv.org/abs/2410.17247)
[GitHub](https://github.com/Cooperx521/PyramidDrop)
|
84 | | []()
[Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers](https://arxiv.org/abs/2410.14072)
Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi | []() | []()
[]() | [Paper](https://arxiv.org/abs/2410.14072)
|
85 | | []()
[ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification](https://arxiv.org/abs/2410.08584)
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2410.08584)
|
86 | | []() [](https://github.com/Gumpest/SparseVLMs)
[SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference](https://arxiv.org/abs/2410.04417)
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2410.04417)
[GitHub](https://github.com/Gumpest/SparseVLMs)
|
87 | | []() [](https://github.com/rese1f/aurora)
[AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark](https://arxiv.org/abs/2410.03051)
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2410.03051)
[GitHub](https://github.com/rese1f/aurora)
[Model](https://huggingface.co/collections/wchai/auroracap-66d117ffe13bedda96702013)
[Dataset](https://huggingface.co/datasets/wchai/Video-Detailed-Caption)
|
88 | | []() [](https://github.com/LLaVA-VL/LLaVA-NeXT)
[Video Instruction Tuning with Synthetic Data](https://arxiv.org/abs/2410.02713)
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2410.02713)
[GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT)
[Model](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944)
|
89 | | []() [](https://github.com/NVIDIA/Megatron-LM/tree/NVLM-1.0/examples/multimodal/nvlm)
[NVLM: Open Frontier-Class Multimodal LLMs](https://arxiv.org/abs/2409.11402)
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping | []() | []()
[]() | [Paper](https://arxiv.org/abs/2409.11402)
[GitHub](https://github.com/NVIDIA/Megatron-LM/tree/NVLM-1.0/examples/multimodal/nvlm)
[Model](https://huggingface.co/collections/nvidia/nvlm-10-66e9f407c764a0ee6e37b7f4)
|
90 | | []() [](https://github.com/FreedomIntelligence/TRIM)
[Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs](https://arxiv.org/abs/2409.10994)
Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2409.10994)
[GitHub](https://github.com/FreedomIntelligence/TRIM)
|
91 | | []() [](https://github.com/ywh187/FitPrune)
[Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models](https://arxiv.org/abs/2409.10197)
Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou | []() | []()
[]() | [Paper](https://arxiv.org/abs/2409.10197)
[GitHub](https://github.com/ywh187/FitPrune)
|
92 | | []() [](https://github.com/hasanar1f/HiRED)
[HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models](https://arxiv.org/abs/2408.10945)
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji | []() | []()
[]() | [Paper](https://arxiv.org/abs/2408.10945)
[GitHub](https://github.com/hasanar1f/HiRED)
|
93 | | []() [](https://github.com/LLaVA-VL/LLaVA-NeXT)
[LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2408.03326)
[GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT)
[Model](https://huggingface.co/collections/lmms-lab/llava-onevision-66a259c3526e15166d6bba37)
|
94 | | []() [](https://github.com/JiuTian-VL/TokenCorrCompressor)
[Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding](https://arxiv.org/abs/2407.14439)
Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie | []() | []()
[]() []() | [Paper](https://arxiv.org/abs/2407.14439)
[GitHub](https://github.com/JiuTian-VL/TokenCorrCompressor)
|
95 | | []() [](https://github.com/CircleRadon/TokenPacker)
[TokenPacker: Efficient Visual Projector for Multimodal LLM](https://arxiv.org/abs/2407.02392)
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang | []() | []()
[]() | [Paper](https://arxiv.org/abs/2407.02392)
[GitHub](https://github.com/CircleRadon/TokenPacker)
|
96 | | []() [](https://github.com/SUSTechBruce/LOOK-M)
[LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2406.18139)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan | []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.18139)
[GitHub](https://github.com/SUSTechBruce/LOOK-M)
|
97 | | []() [](https://github.com/EvolvingLMMs-Lab/LongVA)
[Long Context Transfer from Language to Vision](https://arxiv.org/abs/2406.16852)
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu | []() []() | []() | [Paper](https://arxiv.org/abs/2406.16852)
[GitHub](https://github.com/EvolvingLMMs-Lab/LongVA)
|
98 | | []() [](https://github.com/bytedance/SALMONN/tree/videosalmonn)
[video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models](https://arxiv.org/abs/2406.15704)
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.15704)
[GitHub](https://github.com/bytedance/SALMONN/tree/videosalmonn)
[Model](https://huggingface.co/tsinghua-ee/Video-SALMONN/tree/main)
|
99 | | []() [](https://github.com/Yxxxb/VoCo-LLaMA)
[VoCo-LLaMA: Towards Vision Compression with Large Language Models](https://arxiv.org/abs/2406.12275)
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2406.12275)
[GitHub](https://github.com/Yxxxb/VoCo-LLaMA)
|
100 | | []() [](https://github.com/mu-cai/matryoshka-mm)
[Matryoshka Multimodal Models](https://arxiv.org/abs/2405.17430)
Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2405.17430)
[GitHub](https://github.com/mu-cai/matryoshka-mm)
|
101 | | []() [](https://github.com/lzhxmu/VTW)
[Boosting multimodal large language models with visual tokens withdrawal for rapid inference.](https://arxiv.org/abs/2405.05803)
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2405.05803)
[GitHub](https://github.com/lzhxmu/VTW)
|
102 | | []() [](https://github.com/OpenGVLab/InternVL)
[How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821)
InternVL Team | []() | []()
[]() | [Paper](https://arxiv.org/abs/2404.16821)
[GitHub](https://github.com/OpenGVLab/InternVL)
[Model](https://huggingface.co/collections/OpenGVLab/internvl15-6675ae031d45e5a07007f260)
|
103 | | []() [](https://github.com/42Shawn/LLaVA-PruMerge)
[LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388)
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2403.15388)
[GitHub](https://github.com/42Shawn/LLaVA-PruMerge)
|
104 | | []() [](https://github.com/pkunlp-icler/FastV)
[An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764)
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2403.06764)
[GitHub](https://github.com/pkunlp-icler/FastV)
|
105 | | []() [](https://github.com/Meituan-AutoML/MobileVLM)
[MobileVLM V2: Faster and Stronger Baseline for Vision Language Model](https://arxiv.org/abs/2402.03766)
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2402.03766)
[GitHub](https://github.com/Meituan-AutoML/MobileVLM)
[Model](https://huggingface.co/mtgv/models)
|
106 | 2023 Image
110 |
111 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
112 | | --- | --- | --- | :---: |
113 | | []() [](https://github.com/Meituan-AutoML/MobileVLM)
[MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices](https://arxiv.org/abs/2312.16886)
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen | []() | []()
[]() | [Paper](https://arxiv.org/abs/2312.16886)
[GitHub](https://github.com/Meituan-AutoML/MobileVLM)
[Model](https://huggingface.co/mtgv/models)
|
114 | | []() [](https://github.com/khanrc/honeybee?tab=readme-ov-file)
[Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742)
Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh | []() | []()
[]() | [Paper](https://arxiv.org/abs/2312.06742)
[GitHub](https://github.com/khanrc/honeybee?tab=readme-ov-file)
|
115 | | []() [](https://github.com/dvlab-research/LLaMA-VID)
[LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043)
Yanwei Li, Chengyao Wang, Jiaya Jia | []() []() | []() []()
[]() | [Paper](https://arxiv.org/abs/2311.17043)
[GitHub](https://github.com/dvlab-research/LLaMA-VID)
[Model](https://huggingface.co/collections/YanweiLi/llama-vid-656741a92f3ec92d7e484dea)
[Dataset](https://huggingface.co/datasets/YanweiLi/LLaMA-VID-Data/tree/main)
|
116 | | []() [](https://github.com/PKU-YuanGroup/Chat-UniVi)
[Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding](https://arxiv.org/abs/2311.08046)
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2311.08046)
[GitHub](https://github.com/PKU-YuanGroup/Chat-UniVi)
[Model](https://huggingface.co/collections/Chat-UniVi/chat-univi-66f4265ee4c51e5acf255f2e)
[Dataset](https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/DATA.md)
|
117 | | []() [](https://github.com/QwenLM/Qwen-VL)
[Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/abs/2308.12966)
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2308.12966)
[GitHub](https://github.com/QwenLM/Qwen-VL)
[Model](https://huggingface.co/Qwen/Qwen-VL)
|
118 | | []() [](https://github.com/DAMO-NLP-SG/Video-LLaMA)
[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858)
Hang Zhang, Xin Li, Lidong Bing | []() []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2306.02858)
[GitHub](https://github.com/DAMO-NLP-SG/Video-LLaMA)
[Model](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series)
|
119 | | []() [](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)
[InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi | []() | []()
[]() | [Paper](https://arxiv.org/abs/2305.06500)
[GitHub](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)
|
120 | | []() [](https://github.com/X-PLUG/mPLUG-Owl)
[mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178)
DAMO Team | []() | []()
[]() | [Paper](https://arxiv.org/abs/2304.14178)
[GitHub](https://github.com/X-PLUG/mPLUG-Owl)
|
121 | | []() [](https://github.com/Vision-CAIR/MiniGPT-4)
[Minigpt-4: Enhancing vision-language understanding with advanced large language models.](https://arxiv.org/abs/2304.10592)
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny | []() | []()
[]() | [Paper](https://arxiv.org/abs/2304.10592)
[GitHub](https://github.com/Vision-CAIR/MiniGPT-4)
[Model](https://huggingface.co/Vision-CAIR/MiniGPT-4)
[Dataset](https://github.com/Vision-CAIR/MiniGPT-4/tree/main/dataset)
|
122 | | []() [](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
[BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi | []() | []()
[]() | [Paper](https://arxiv.org/abs/2301.12597)
[GitHub](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
|
123 | 2022 Image
127 |
128 | | **Title & Authors** | **Areas** | **Tags** | **Links** |
129 | | --- | --- | --- | :---: |
130 | | []()
[Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
DeepMind Team | []() []() | []()
[]() | [Paper](https://arxiv.org/abs/2204.14198)
|
131 |
45 |