170 | Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench---a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman’s correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. 171 |
172 |193 | Framework of our Perception-Driven Motion Metrics (PMM). PMM comprises multiple evaluation metrics: Commonsense Adherence Score (CAS), Motion Smoothness Score (MSS), Object Integrity Score (OIS), Perceptible Amplitude 194 | Score (PAS), and Temporal Coherence Score (TCS). (a-e): Computational flowcharts for each metric. The scores produced by 195 | PMM show variation trends consistent with human assessments, indicating strong alignment with human perception. 196 |
197 |216 | Our metrics framework for evaluating video motion, which is inspired by the mechanisms of human perception of 217 | motion in videos. (a) Human perception of motion in videos primarily encompasses two dimensions: Comprehensive Analysis 218 | of Motion and Capture of Motion Details. (b) Our proposed metrics framework for evaluating video motion. Specifically, 219 | the MSS and CAS correspond to the human process of Comprehensive Analysis of Motion, while the OIS, PAS, and TCS 220 | correspond to the capture of motion details. 221 |
222 |241 | Framework of our Meta-Guided Motion Prompt Generation (MMPG). MMPG consists of three stages: (a) Metainformation Extraction: Extracting Subjects, Places, and Actions from datasets such as VidProm [30], Didemo [35], MSRVTT [34], WebVid [33], Place365 [31], and Kinect-700 [32]. (b) Self-Refining Prompt Generation: Generating and iteratively refining prompts based on the extracted information. (c) Human-LLM Joint Validation: Validating the prompts through 242 | a collaborative process between humans and DeepSeek-R1 to ensure their rationality 243 |
244 |263 | Statistical analysis of motion prompts in VMBench. (a-h): Multi-perspective statistical analysis of prompts in 264 | VMBench. These analyses demonstrate VMBench’s comprehensive evaluation scope, encompassing motion dynamics, information diversity, and real-world commonsense adherence. 265 |
266 |390 | We visualize the evaluation results of the 6 most recent video generation models across Perception-Driven Motion Evaluation Metrics (PMM) dimensions. 391 |
392 |If you find our work useful, please consider citing our paper:
402 |@misc{ling2025vmbenchbenchmarkperceptionalignedvideo,
403 | title={VMBench: A Benchmark for Perception-Aligned Video Motion Generation},
404 | author={Xinran Ling and Chen Zhu and Meiqi Wu and Hangyu Li and Xiaokun Feng and Cundian Yang and Aiming Hao and Jiashu Zhu and Jiahong Wu and Xiangxiang Chu},
405 | year={2025},
406 | eprint={2503.10076},
407 | archivePrefix={arXiv},
408 | primaryClass={cs.CV},
409 | url={https://arxiv.org/abs/2503.10076},
410 | }
411 |