└── README.md /README.md: -------------------------------------------------------------------------------- 1 | **Papers of Visual Signal Coding for Machine** 2 | 3 | **Purpose**: We aim to provide a summary of visual signal coding for machine. More papers will be summarized. 4 | 5 | University of Science and Technology of China (USTC), [Intelligent Media Computing Lab](https://scholar.google.com/citations?user=1ayDJfsAAAAJ&hl=en&oi=ao) 6 | 7 | 8 | **📌 About new works.** If you want to incorporate your studies (e.g., the link of paper or project) on visual signal coding for machine in this repository. Welcome to raise an issue or email us. We will incorporate it into this repository as soon as possible. 9 | 10 | 11 | ## Papers for Visual Coding for Machine 12 | ### Table of contents 13 | 15 | - [Survey & Theory](#Survey-&-Theory) 16 | - [Compress Then Analysis](#Compress-Then-Analysis) 17 | - [Feature Compression](#Feature-Compression) 18 | - [Joint Human and Machine Vision](#Joint-Human-and-Machine-Vision) 19 | - [Vision Model Token Compression](#Vision-Model-Token-Compression) 20 | 22 | 23 | ### Survey-&-Theory 24 | |Models| Paper | First Author | Note | Venue | Data | Project | 25 | | :--- | :---: | :---: | :--: | :--: |:--: | :--: | 26 | | | [A Rate-Distortion-Classification Approach for Lossy Image Compression](https://arxiv.org/pdf/2405.03500) | Yuefeng Zhang | | PrePrint'24 | | | 27 | | | [Video Coding for Machines: Compact Visual Representation Compression for Intelligent Collaborative Analytics](https://ieeexplore.ieee.org/abstract/document/10440522) | Wenhan Yang, Haofeng Huang and Yueyu Hu | survey | TPAMI2024 | Video | | 28 | | | [Rate-Distortion Theory in Coding for Machines and its Applications](https://arxiv.org/pdf/2305.17295.pdf) | Alon Harell | | PrePrint'23 | | | 29 | | | [Rate-Distortion in Image Coding for Machines](https://ieeexplore.ieee.org/abstract/document/10018035) | Alon Harell | | PCS2022 | | | 30 | | | [Lossy Compression for Lossless Prediction](https://openreview.net/pdf?id=wZrOOO9XBn) | Yann Dubois | | NeurIPS 2021 Spotlight | | [Code](https://github.com/YannDubs/lossyless) | 31 | | | [Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics](https://ieeexplore.ieee.org/abstract/document/9180095) | Lingyu Duan | survey | TIP2020 | Video | | 32 | | | [On The Classification-Distortion-Perception Tradeoff](https://proceedings.neurips.cc/paper_files/paper/2019) | Dong Liu | | NeurIPS2019 | | | 33 | ### Compress-Then-Analysis 34 | |Models| Paper | First Author | Note | Venue | Data | Project | 35 | | :--- | :---: | :---: | :--: | :--: |:--: | :--: | 36 | | DT-JRD | [DT-JRD: Deep Transformer based Just Recognizable Difference Prediction Model for Video Coding for Machines](https://arxiv.org/pdf/2411.09308) | Junqi Liu | | TMM | Video | | 37 | | FSIC | [FSIC: Frequency-separated image compression for small object detection](https://www.sciencedirect.com/science/article/pii/S1051200424004470) | Chengjie Dai | Frequency decomposition | Digital Signal Processing2025 | Image | | 38 | | SA-ICM | [Image Coding for Machines with Edge Information Learning Using Segment Anything](https://arxiv.org/pdf/2403.04173) | Takahiro Shindo | | ICIP2024 | Image | | 39 | | | [Remote Sensing Image Coding for Machines on Semantic Segmentation via Contrastive Learning](https://ieeexplore.ieee.org/abstract/document/10716527) | Junxi Zhang | Remote Sensing | TGRS2024 | Image | | 40 | | Delta-ICM | [Delta-ICM: Entropy Modeling with Delta Function for Learned Image Compression](https://arxiv.org/abs/2410.07669) | Takahiro Shindo | | PrePrint'24 | Image | | 41 | | Free-VSC | [Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression](https://arxiv.org/pdf/2409.11718) | Yuan Tian | unsupervised | PrePrint'24 | Video | | 42 | | | [Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs](https://www.arxiv.org/pdf/2408.08575) | Jinming Liu | using MLLM | PrePrint'24 | Image | | 43 | | | [High Efficiency Image Compression for Large Visual-Language Models](https://www.arxiv.org/pdf/2407.17060) | Binzhe Li | for MLLM | PrePrint'24 | Image | | 44 | | | [Feature-Preserving Rate-Distortion Optimization in Image Coding for Machines](https://arxiv.org/pdf/2408.07028) | Samuel Fernández Menduiña | Based on AVC | PrePrint'24 | Image | | 45 | | | [Saliency Map-Guided End-to-End Image Coding for Machines](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10574324) | Bo Peng | | IEEE Signal Processing Letters 2024 | Image | | 46 | | | [Competitive Learning for Achieving Content-specific Filters in Video Coding for Machines](https://arxiv.org/pdf/2406.12367) | Honglei Zhang | | PrePrint'24 | Video | | 47 | | | [A Coding Framework and Benchmark towards Low-Bitrate Video Understanding](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10440520) | Yuan Tian | | TPAMI2024 | Video | | 48 | | | [Task-Aware Encoder Control for Deep Video Compression](https://arxiv.org/pdf/2404.04848.pdf) | Xingtong Ge | Encoder Control | CVPR2024 | Video | | 49 | | | [Super-High-Fidelity Image Compression via Hierarchical-ROI and Adaptive Quantization](https://arxiv.org/pdf/2403.13030.pdf) | Jixiang Luo | benefit for ICM | PrePrint'24 | Image | | 50 | |SegPIC| [Region-Adaptive Transform with Segmentation Prior for Image Compression](https://arxiv.org/pdf/2403.00628.pdf) | Yuxi Liu | benefit for ICM | PrePrint'24 | Image | [Code](https://github.com/GityuxiLiu/SegPIC-for-Image-Compression)| 51 | |VNVC| [VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision](https://ieeexplore.ieee.org/abstract/document/10411051) | Xihua Sheng | | TPAMI2024 | Video | | 52 | | | [Task-Switchable Pre-Processor for Image Compression for Multiple Machine Vision Tasks](https://ieeexplore.ieee.org/abstract/document/10379180) | Mingyi Yang | multi-task | TCSVT2024 | Image | | 53 | | | [Unified Architecture Adaptation for Compressed Domain Semantic Inference](https://ieeexplore.ieee.org/abstract/document/10379180) | Zhihao Duan | | TCSVT2023 | Image | | 54 | | SMC | [Nonsemantics suppressed mask learning for unsupervised video semantic compression.](https://openaccess.thecvf.com/content/ICCV2023/papers/Tian_Non-Semantics_Suppressed_Mask_Learning_for_Unsupervised_Video_Semantic_Compression_ICCV_2023_paper.pdf) | Yuan Tian | | ICCV2023 | Video | | 55 | |TransTIC| [TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_TransTIC_Transferring_Transformer-based_Image_Compression_from_Human_Perception_to_Machine_ICCV_2023_paper.pdf) | Yi-Hsin Chen | prompt | ICCV2023 | Image | [Code](https://github.com/NYCU-MAPL/TransTIC) | 56 | |MCM| [You Can Mask More For Extremely Low-Bitrate Image Compression](https://arxiv.org/pdf/2306.15561.pdf) | Anqi Li | benefit for ICM | PrePrint'23 | Image | [Code](https://github.com/lianqi1008/MCM) | 57 | | | [SMachine Perception-Driven Image Compression: A Layered Generative Approach](https://arxiv.org/pdf/2304.06896.pdf) | Yuefeng Zhang | | PrePrint'23 | Image | | 58 | |DMIC| [Diagnosis-oriented Medical Image Compression with Efficient Transfer Learning](https://ieeexplore.ieee.org/abstract/document/10402638) | Guangqi Xie, Xin Li | RL | VCIP2023 best paper | Medical Data | | 59 | | | [Towards Efficient Learned Image Coding for Machines via Saliency-Driven Rate Allocation](https://ieeexplore.ieee.org/abstract/document/10402607) | Zixiang Zhang | | VCIP2023 | Image | | 60 | | | [Composable Image Coding for Machine via Task-oriented Internal Adaptor and External Prior](https://ieeexplore.ieee.org/abstract/document/10402659) | Jinming Liu | adapter | VCIP2023 | Image | | 61 | | | [Image Coding for Machines based on Non-Uniform Importance Allocation](https://ieeexplore.ieee.org/abstract/document/10402758) | Yunpeng Qi | | VCIP2023 | Image | | 62 | | | [Saliency-Driven Hierarchical Learned Image Coding for Machines](https://ieeexplore.ieee.org/abstract/document/10096674) | Kristian Fischer | | ICASSP2023 | Image | | 63 | |Neural-Syntax| [Neural Data-Dependent Transform for Learned Image Compression](https://openaccess.thecvf.com/content/CVPR2022/papers/Wang_Neural_Data-Dependent_Transform_for_Learned_Image_Compression_CVPR_2022_paper.pdf) | Dezhao Wang | benefit for ICM | CVPR2022 | Image | [Code](https://github.com/Dezhao-Wang/Neural-Syntax-Code) [Project](https://dezhao-wang.github.io/Neural-Syntax-Website/)| 64 | |HRLVSC| [Hierarchical Reinforcement Learning Based Video Semantic Coding for Segmentation](https://ieeexplore.ieee.org/abstract/document/10008806) | Guangqi Xie | RL | VCIP2022 | Video | | 65 | | | [Boosting Neural Image Compression for Machines Using Latent Space Masking](https://ieeexplore.ieee.org/abstract/document/9845478) | Kristian Fischer | | TCSVT2022 | Image | [Code](https://github.com/FAU-LMS/NCN_for_M2M) | 66 | | | [Perceptual Video Coding for Machines via Satisfied Machine Ratio Modeling](https://arxiv.org/pdf/2211.06797.pdf) | Qi Zhang | | PrePrint'22 | Video | [Code](https://github.com/ywwynm/SMR) | 67 | | | [Preprocessing Enhanced Image Compression for Machine Vision](https://arxiv.org/pdf/2206.05650.pdf) | Guo Lu | | PrePrint'22 | Image | | 68 | |QmapCompression| [Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform](https://openaccess.thecvf.com/content/ICCV2021/papers/Song_Variable-Rate_Deep_Image_Compression_Through_Spatially-Adaptive_Feature_Transform_ICCV_2021_paper.pdf) | Myungseo Song | benefit for ICM | ICCV2021 | Image | [Code](https://github.com/micmic123/QmapCompression?tab=readme-ov-file)| 69 | | | [A Novel Video Coding Strategy in HEVC for Object Detection](https://ieeexplore.ieee.org/abstract/document/9343857) | Qi Cai | | TCSVT2021 | Video | | 70 | | | [Task-Driven Semantic Coding via Reinforcement Learning](https://ieeexplore.ieee.org/abstract/document/9472999) | Xin Li | RL | TIP2021 | Image | | 71 | | | [Learned Image Coding for Machines: A Content-Adaptive Approach](https://ieeexplore.ieee.org/abstract/document/9428224) | Nam Le | | ICME2021 | Image | | 72 | | | [Visual Analysis Motivated Rate-Distortion Model for Image Coding](https://ieeexplore.ieee.org/abstract/document/9428417) | Zhimeng Huang | | ICME2021 | Image | | 73 | | | [Image coding for machines: an end-to-end learned approach](https://ieeexplore.ieee.org/abstract/document/9414465) | Nam Le | | ICASSP2021 | Image | | 74 | | | [High Efficiency Compression for Object Detection](https://ieeexplore.ieee.org/abstract/document/8462653) | Hyomin Choi | | ICASSP2018 | Image | | 75 | | | [Faster Neural Networks Straight from JPEG](https://papers.nips.cc/paper_files/paper/2018/file/7af6266cc52234b5aa339b16695f7fc4-Paper.pdf) | Lionel Gueguen | | NeurIPS2018 | Image | [Code](https://github.com/uber-research/jpeg2dct) | 76 | | | [Towards Image Understanding from Deep Compression Without Decoding](https://openreview.net/pdf?id=HkXWCMbRW) | Robert Torfason | | ICLR2018 | Image | | 77 | ### Feature-Compression 78 | |Model| Paper | First Author | Note | Venue | Data | Project | 79 | | :--- | :---: | :---: | :--: | :--: |:--: | :--: | 80 | | | [Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding](https://web3.arxiv.org/pdf/2407.11224) | Danish Nazir | | ECCV2024 | Image | | 81 | | | [ComNeck: Bridging Compressed Image Latents and Multimodal LLMs via Universal Transform-Neck](https://arxiv.org/pdf/2407.19651) | Chia-Hao Kao | for MLLM | PrePrint'24 | Image | | 82 | | | [Reconstruction-free Image Compression for Machine Vision via Knowledge Transfer](https://dl.acm.org/doi/abs/10.1145/3678471) | Hanyue Tu | | | Image | | 83 | | | [Masked Feature Compression for Object Detection](https://www.mdpi.com/2227-7390/12/12/1848) | Chengjie Dai | | mathematics2024 | Image | | 84 | | | [Texture-guided Coding for Deep Features](https://arxiv.org/pdf/2405.19669) | Lei Xiong | | PrePrint'24 | Image | | 85 | | | [Split Computing With Scalable Feature Compression for Visual Analytics on the Edge](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10540267) | Zhongzheng Yuan | | TMM 2024 | Image | | 86 | | | [Hierarchical Image Feature Compression for Machines via Feature Sparsity Learning](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10502180) | Ding Ding | | IEEE Signal Processing 2024 | Image | | 87 | |var-feat-comp| [Flexible Variable-Rate Image Feature Compression for Edge-Cloud Systems](https://ieeexplore.ieee.org/abstract/document/10222898) | Md Adnan Faisal Hossain | variable-rate | ICME workshop2023 | Image | [Code](https://github.com/adnan-hossain/var-feat-comp) | 88 | |NEC| [Neural Embedding Compression For Efficient Multi-Task Earth Observation Modelling](https://arxiv.org/pdf/2403.17886.pdf) | Carlos Gomes | multi-task | PrePrint'24 | Earth Observation Data | | 89 | || [Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models](https://openaccess.thecvf.com/content/CVPR2023/papers/Ahuja_Neural_Rate_Estimator_and_Unsupervised_Learning_for_Efficient_Distributed_Image_CVPR_2023_paper.pdf) | Nilesh Ahuja | | CVPR2023 | Image | [Code](https://github.com/IntelLabs/spic) | 90 | |var-feat-comp| [Flexible Variable-Rate Image Feature Compression for Edge-Cloud Systems](https://ieeexplore.ieee.org/abstract/document/10222898) | Md Adnan Faisal Hossain | variable-rate | ICME workshop2023 | Image | [Code](https://github.com/adnan-hossain/var-feat-comp) | 91 | || [Scalable Feature Compression for Edge-Assisted Object Detection Over Time-Varying Networks](https://openreview.net/forum?id=b9lQw75UugS) | Zhongzheng Yuan | | MLSys workshop2023 | Image | | 92 | | | [End-to-End Learnable Multi-Scale Feature Compression for VCM](https://ieeexplore.ieee.org/abstract/document/10210338) | Yeongwoong Kim | | TCSVT2023 | Video | | 93 | | Prompt-ICM | [Prompt-ICM: A Unified Framework towards Image Coding for Machines with Task-driven Prompts](https://arxiv.org/pdf/2305.02578) | Ruoyu Feng & Jinming Liu | | PrePrint'23 | Image | | 94 | | | [Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach](https://dl.acm.org/doi/abs/10.1145/3581783.3611851) | Sha Guo | | ACMMM2023 | Image | | 95 | | | [Semantic Segmentation In Learned Compressed Domain](https://ieeexplore.ieee.org/abstract/document/10018036) | Jinming Liu | | PCS2022(Best Paper Award Finalists) | Image | | 96 | |edge-cloud-rac| [Efficient Feature Compression for Edge-Cloud Systems](https://ieeexplore.ieee.org/document/10018075) | Zhihao Duan | | PCS2022(Best Paper Award Finalists) | Image | [Code](https://github.com/duanzhiihao/edge-cloud-rac) | 97 | |Omni-ICM| [Image Coding for Machines with Omnipotent Feature Learning](https://link.springer.com/chapter/10.1007/978-3-031-19836-6_29) | Ruoyu Feng | | ECCV2022 | Image | | 98 | | | [A Low-Complexity Approach to Rate-Distortion Optimized Variable Bit-Rate Compression for Split DNN Computing](https://ieeexplore.ieee.org/abstract/document/9956232) | Parual Datta | variable-rate | ICPR2022 | Image | | 99 | | | [Improving Multiple Machine Vision Tasks in the Compressed Domain](https://ieeexplore.ieee.org/abstract/document/9956532) | Jinming Liu | | ICPR2022 | Image | | 100 | | | [Learning from the CNN-based Compressed Domain](https://openaccess.thecvf.com/content/WACV2022/html/Wang_Learning_From_the_CNN-Based_Compressed_Domain_WACV_2022_paper.html) | Zhenzhen Wang | | WACV2022 | Image | | 101 | | | [Supervised Compression for Resource-Constrained Edge Computing Systems](https://openaccess.thecvf.com/content/WACV2022/papers/Matsubara_Supervised_Compression_for_Resource-Constrained_Edge_Computing_Systems_WACV_2022_paper.pdf) | Yoshitomo Matsubara | | WACV2022 | Image | [Code](https://github.com/yoshitomo-matsubara/supervised-compression) | 102 | | | [Bridging the Gap Between Image Coding for Machines and Humans](https://ieeexplore.ieee.org/abstract/document/9897916) | Nam Le | | ICIP2022 | Image | | 103 | | | [Enhancing Image Coding for Machines with Compressed Feature Residuals](https://ieeexplore.ieee.org/document/9666106) | Joni Seppälä | | ISM2021 | Image | | 104 | | | [Learning in Compressed Domain for Faster Machine Vision Tasks](https://ieeexplore.ieee.org/abstract/document/9675369) | Jinming Liu | | VCIP2021 | Image | | 105 | | | [Learning in the Frequency Domain](https://openaccess.thecvf.com/content_CVPR_2020/html/Xu_Learning_in_the_Frequency_Domain_CVPR_2020_paper.html) | Kai Xu | | CVPR2020 | Image | [Code](https://github.com/kaix90/DCTNet) | 106 | | | [Lossy Intermediate Deep Learning Feature Compression and Evaluation](https://dl.acm.org/doi/abs/10.1145/3343031.3350849) | Zhuo Chen | | ACMMM2019 | Image | | 107 | | | [Toward Intelligent Sensing: Intermediate Deep Feature Compression](https://ieeexplore.ieee.org/abstract/document/8848858) | Zhuo Chen | | TIP2019 | Image | | 108 | | | [End-to-End Optimized ROI Image Compression](https://ieeexplore.ieee.org/document/8943263) | Chunlei Cai | | TIP2019 | Image | | 109 | | | [Deep Feature Compression for Collaborative Object Detection](https://ieeexplore.ieee.org/abstract/document/8451100) | Hyomin Choi | | ICIP2018 | Image | | 110 | | | [Near-Lossless Deep Feature Compression for Collaborative Intelligence](https://arxiv.org/pdf/1804.09963) | Hyomin Choi | | MMSP2018 | Image | | 111 | ### Joint-Human-and-Machine-Vision 112 | |Model| Paper | First Author | Note | Venue | Data | Project | 113 | | :--- | :---: | :---: | :--: | :--: |:--: | :--: | 114 | | | [An Efficient Adaptive Compression Method for Human Perception and Machine Vision Tasks](https://arxiv.org/pdf/2501.04329) | Lei Liu | | PrePrint'25 | Image | | 115 | | | [A Unified Image Compression Method for Human Perception and Multiple Vision Tasks](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/09009.pdf) | Sha Guo and Lin Sui | multi-task | ECCV2024 | Image | | 116 | | | [Rate-distortion cognitive controllable versatile neural image compression](https://arxiv.org/pdf/2407.11700) | Jinming Liu | | ECCV2024 | Image | | 117 | |Adapt-ICMH| [Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation](https://arxiv.org/pdf/2407.09853) | Han Li | Adapter | ECCV2024 | Image | [Code](https://github.com/qingshi9974/ECCV2024-AdpatICMH) | 118 | | | [Machine Perception-Driven Facial Image Compression: A Layered Generative Approach]() | Yuefeng Zhang | | TCSVT2024 | Image | | 119 | | | [Towards Task-Compatible Compressible Representations](https://arxiv.org/pdf/2405.10244) | Anderson de Andrade | | ICME Workshop2024 | Image | | 120 | | | [Scalable Image Coding for Humans and Machines Using Feature Fusion Network](https://arxiv.org/pdf/2405.09152) | Takahiro Shindo | | PrePrint'24 | Image | | 121 | | | [Scalable Human-Machine Point Cloud Compression](https://arxiv.org/pdf/2402.12532v1.pdf) | Mateen Ulhaq | | PrePrint'24 | Pointcloud | | 122 | |GIT-SSIC| [Semantically Structured Image Compression via Irregular Group-Based Decoupling](https://openaccess.thecvf.com/content/ICCV2023/html/Feng_Semantically_Structured_Image_Compression_via_Irregular_Group-Based_Decoupling_ICCV_2023_paper.html) | Ruoyu Feng and Yixin Gao | | ICCV2023 | Image | [Code](https://github.com/Amygyx/GIT-SSIC)| 123 | | | [Adaptive Human-Centric Video Compression for Humans and Machines](https://openaccess.thecvf.com/content/CVPR2023W/NTIRE/papers/Jiang_Adaptive_Human-Centric_Video_Compression_for_Humans_and_Machines_CVPRW_2023_paper.pdf) | Wei Jiang | VQ | CVPR workshop 2023 | Video | | 124 | | | [Learned point cloud compression for classification](https://ieeexplore.ieee.org/abstract/document/10337709) | Mateen Ulhaq | | MMSP2023 | Pointcloud | | 125 | | | [Learned Disentangled Latent Representations for Scalable Image Coding for Humans and Machines](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10125297) | Ezgi Ozyılkan | | DCC2023 | Image | | 126 | | | [Learned Scalable Video Coding For Humans and Machines](https://arxiv.org/pdf/2307.08978.pdf) | Hadi Hadizadeh | | PrePrint'23 | Video | | 127 | | | [Scalable Face Image Coding via StyleGAN Prior: Toward Compression for Human-Machine Collaborative Vision](https://ieeexplore.ieee.org/abstract/document/10372532) | Qi Mao | | TIP2023 | Image | | 128 | | ICMH-Net | [ICMH-Net: Neural Image Compression Towards both Machine Vision and Human Vision](https://dl.acm.org/doi/abs/10.1145/3581783.3612041) | Lei Liu | | ACM MM2023 | Image | | 129 | | DeepSVC | [DeepSVC: Deep Scalable Video Coding for Both Machine and Human Vision](https://dl.acm.org/doi/abs/10.1145/3581783.3612500) | Hongbin Lin | | ACM MM2023 | Video | | 130 | | | [Peering into The Sketch: Ultra-Low Bitrate Face Compression for Joint Human and Machine Perception](https://dl.acm.org/doi/abs/10.1145/3581783.3613799) | Yudong Mao | | ACM MM2023 | Image | | 131 | | | [Sketch Assisted Face Image Coding for Human and Machine Vision: A Joint Training Approach](https://ieeexplore.ieee.org/abstract/document/10082973) | Xin Fang | | TCSVT2023 | Image | | 132 | | | [Base Layer Efficiency in Scalable Human-Machine Coding](https://ieeexplore.ieee.org/abstract/document/10223087) | Yalda Foroutan | | ICIP2023 | Image | | 133 | | | [Scalable Image Coding for Humans and Machines](https://ieeexplore.ieee.org/abstract/document/9741390) | Hyomin Choi | | TIP2022 | Image | | 134 | | | [HMFVC: A Human-Machine Friendly Video Compression Scheme](https://ieeexplore.ieee.org/abstract/document/9894405) | Zhimeng Huang | | TCSVT2022 | Video | | 135 | | | [Bridging the gap between image coding for machines and humans](https://ieeexplore.ieee.org/abstract/document/9897916) | Nam Le | | ICIP2022 | Image | | 136 | | | [Towards End-to-End Image Compression and Analysis with Transformers](https://arxiv.org/pdf/2112.09300) | Yuanchao Bai & Xu Yang | | AAAI2022 | Image | [Code](https://github.com/BYchao100/Towards-Image-Compression-and-Analysis-with-Transformers/tree/main?tab=readme-ov-file#towards-end-to-end-image-compression-and-analysis-with-transformers) | 137 | | | [Learned Image Compression for Machine Perception](https://arxiv.org/pdf/2111.02249) | Felipe Codevilla | | PrePrint'21 | Image | | 138 | | | [Semantics-to-Signal Scalable Image Compression with Learned Revertible Representations](https://link.springer.com/article/10.1007/s11263-021-01491-7) | Kang Liu | | IJCV2021 | Image | | 139 | | | [Towards Analysis-Friendly Face Representation With Scalable Feature and Texture Compression](https://ieeexplore.ieee.org/document/9473014) | Shurun Wang | | TMM2021 | Image | | 140 | | SSSIC | [SSSIC: Semantics-to-Signal Scalable Image Coding With Learned Structural Representations](https://ieeexplore.ieee.org/document/9585549) | Ning Yan | | TIP2021 | Image | | 141 | | | [Towards coding for human and machine vision: Scalable face image coding](https://ieeexplore.ieee.org/abstract/document/9385898) | Shuai Yang | | TMM2021 | Image | | 142 | | | [Semantic Scalable Image Compression with Cross-Layer Priors](https://dl.acm.org/doi/abs/10.1145/3474085.3475533) | Hanyue Tu | | ACM MM2021 | Image | | 143 | | | [Latent-space scalability for multi-task collaborative intelligence](https://arxiv.org/pdf/2105.10089) | Hyomin Choi | | ICIP2021 | Image | | 144 | | SSIC | [Semantic Structured Image Coding Framework for Multiple Intelligent Applications](https://ieeexplore.ieee.org/abstract/document/9281078) | Simeng Sun | | TCSVT2020 | Image | | 145 | | | [Towards Coding for Human and Machine Vision: A Scalable Image Coding Approach](https://arxiv.org/abs/2001.02915) | Yueyu Hu | | ICME2020 | Image | [Project](https://williamyang1991.github.io/projects/VCM-Face/) | 146 | ### Vision Model Token Compression 147 | |Models| Paper | First Author | Note | Venue | Data | Project | 148 | | :--- | :---: | :---: | :--: | :--: |:--: | :--: | 149 | | | [Towards Semantic Equivalence of Tokenization in Multimodal LLM](https://arxiv.org/abs/2406.05127) | Shengqiong Wu| Tokenizer | PrePrint'24 | Image | | 150 | | PruMerge | [LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388) | Yuzhang Shang | VL-conncetor | PrePrint'24 | Image & Video| [Code](https://github.com/42Shawn/LLaVA-PruMerge) | 151 | | AVG-LLaVA | [AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity](https://arxiv.org/abs/2410.02745) | Zhibin Lan | VL-conncetor | PrePrint'24 | Image | [Code](https://github.com/DeepLearnXMU/AVG-LLaVA) | 152 | | | [Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information](https://arxiv.org/abs/2409.01179) | Yi Chen | VL-conncetor | PrePrint'24 | Image | | 153 | | HiRED | [HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments](https://arxiv.org/abs/2408.10945) | Kazi Hasan Ibn Arif | VL-conncetor | PrePrint'24 | High Resolution Image | | 154 | | DeCo | [DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models](https://arxiv.org/abs/2405.20985) | Linli Yao | VL-conncetor | PrePrint'24 | Image | [Code](https://github.com/yaolinli/DeCo) | 155 | | CrossGET | [CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers](https://arxiv.org/abs/2305.17455) | Dachuan Shi | Tokenizer | ICML2024 | Image | [Code](https://github.com/sdc17/CrossGET) | 156 | | DocPedia | [DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding](https://arxiv.org/abs/2311.11810) | Hao Feng | Tokenizer | PrePrint'23 | Document | | 157 | | CrossGET | [CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers](https://arxiv.org/abs/2305.17455) | Dachuan Shi | | ICML2024 | | [Code](https://github.com/sdc17/CrossGET) | 158 | | Honeybee | [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://openaccess.thecvf.com/content/CVPR2024/papers/Cha_Honeybee_Locality-enhanced_Projector_for_Multimodal_LLM_CVPR_2024_paper.pdf) | Junbum Cha, Wooyoung Kang and Jonghwan Mun | VL-conncetor | CVPR2024 Highlight | | [Code](https://github.com/khanrc/honeybee) | 159 | | InternVL2 | [How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821) | Zhe Chen | VL-conncetor | PrePrint'24 | | [Code](https://github.com/OpenGVLab/InternVL) | 160 | | | [Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs](https://arxiv.org/abs/2409.10994) | Dingjie Song | VL-conncetor | PrePrint'24 | | | 161 | | | [Matryoshka Multimodal Models](https://arxiv.org/abs/2405.17430) | Mu Cai | VL-conncetor | PrePrint'24 | | [Code](https://github.com/mu-cai/matryoshka-mm) | 162 | | MM1 | [MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training](https://arxiv.org/abs/2403.09611) | Brandon McKinzie | Ablations on each element | ECCV2024 | Image | | 163 | | LDPv2 | [MobileVLM V2: Faster and Stronger Baseline for Vision Language Model](https://arxiv.org/abs/2402.03766) | Xiangxiang Chu | VL-conncetor | PrePrint'24 | Image | [Code](https://github.com/Meituan-AutoML/MobileVLM) | 164 | | mPLUG-DocOwl2 | [High-resolution Compressing for OCR-free Multi-page Document Understanding](https://arxiv.org/abs/2402.03766) | Anwen Hu | VL-conncetor | PrePrint'24 | Document | [Code](https://github.com/X-PLUG/mPLUG-DocOwl) | 165 | | TextHawk | [TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models](https://arxiv.org/abs/2404.09204) | Ya-Qi Yu | VL-conncetor | PrePrint'24 | Image | [Code](https://github.com/yuyq96/TextHawk) | 166 | | TextMonkey | [TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document](https://arxiv.org/abs/2403.04473) | Yuliang Liu | VL-conncetor | PrePrint'24 | Document | [Code](https://github.com/Yuliang-Liu/Monkey) | 167 | | TokenCorrCompressor | [Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding](https://arxiv.org/abs/2407.14439) | Renshan Zhang | VL-conncetor | PrePrint'24 | Image & Video | [Code](https://github.com/JiuTian-VL/TokenCorrCompressor) | 168 | | ToMe | [Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461) | Daniel Bolya | Tokenizer | ICLR2023 notable top 5% | Image & Video | [Code](https://github.com/facebookresearch/ToMe) | 169 | | TokenPacker | [TokenPacker: Efficient Visual Projector for Multimodal LLM](https://arxiv.org/abs/2407.02392) | Wentong Li | VL-conncetor | PrePrint'24 | Image | [Code](https://github.com/CircleRadon/TokenPacker) | 170 | | VidToMe | [VidToMe: Video Token Merging for Zero-Shot Video Editing](https://vidtome-diffusion.github.io/VidToMe_Arxiv.pdf) | Xirui Li | Tokenizer | CVPR2024 | Video | [Code](https://github.com/lixirui142/VidToMe) | 171 | | VoCo-LLaMA | [VoCo-LLaMA: Towards Vision Compression with Large Language Models](https://arxiv.org/abs/2406.12275) | Xubing Ye | VL-conncetor | PrePrint'24 | Image & Video | [Code](https://github.com/Yxxxb/VoCo-LLaMA) | 172 | | FastV | [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764) | Xubing Ye | VL-conncetor | ECCV2024 oral | Image | [Code](https://github.com/pkunlp-icler/FastV) | 173 | | TempMe | [TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval](https://arxiv.org/abs/2409.01156) | Leqi Shen | Tokenizer | PrePrint'24 | Image | | 174 | --------------------------------------------------------------------------------