├── .gitignore ├── LICENSE ├── README.md ├── checkpoints └── .gitkeep ├── docs ├── data.md ├── eval.md ├── inference.ipynb ├── inference_for_glm.ipynb ├── offline_demo.md └── train.md ├── images ├── demo.mp4 ├── ex.png ├── ex1.png ├── ex2.png ├── ex3.png └── framework.png ├── requirements.txt ├── scripts ├── stage1.sh ├── stage1_glm.sh ├── stage2.sh ├── stage2_glm.sh ├── stage3.sh ├── stage3_glm.sh ├── zero2.json ├── zero3.json └── zero3_offload.json └── vtimellm ├── __init__.py ├── constants.py ├── conversation.py ├── demo_gradio.py ├── eval ├── data_example.json ├── dvc_eval │ ├── SODA │ │ ├── LICENSE │ │ ├── README.md │ │ ├── dataset.py │ │ ├── nlpeval │ │ │ ├── bert_f_score.py │ │ │ ├── bert_r_score.py │ │ │ └── mover.py │ │ ├── requirements.txt │ │ ├── soda.py │ │ └── utils.py │ ├── __init__.py │ ├── eval_dvc.py │ └── eval_soda.py ├── eval.py ├── log │ └── example_log.txt └── metric.py ├── inference.py ├── mm_utils.py ├── model ├── __init__.py ├── builder.py ├── chatglm │ ├── __init__.py │ ├── configuration_chatglm.py │ ├── modeling_chatglm.py │ ├── quantization.py │ └── tokenization_chatglm.py ├── vtimellm_arch.py ├── vtimellm_chatglm.py └── vtimellm_llama.py ├── train ├── dataset.py ├── llama_flash_attn_monkey_patch.py ├── train.py ├── train_mem.py └── vtimellm_trainer.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | **/__pycache__ 2 | checkpoints/ 3 | .ipynb_checkpoints/ 4 | wandb/ 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # VTimeLLM \[[Paper](https://arxiv.org/pdf/2311.18445.pdf)\] 2 | Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments". 3 | 4 | 5 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=vtimellm-empower-llm-to-grasp-video-moments) 6 | 7 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=vtimellm-empower-llm-to-grasp-video-moments) 8 | 9 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=vtimellm-empower-llm-to-grasp-video-moments) 10 | 11 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=vtimellm-empower-llm-to-grasp-video-moments) 12 | 13 | 14 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=vtimellm-empower-llm-to-grasp-video-moments) 15 | 16 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=vtimellm-empower-llm-to-grasp-video-moments) 17 | 18 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=vtimellm-empower-llm-to-grasp-video-moments) 19 | 20 | --- 21 | 22 | ## :loudspeaker: Latest Updates 23 | - **Jan-2**: Thanks to [Xiao Xia](https://github.com/Rishubi) , [Shengbo Tong](https://github.com/tsb-19) and [Beining Wang](https://github.com/Benson0704), we have refactored the code to now support both the LLAMA and ChatGLM3 architectures. We translated the training data into Chinese and fine-tuned a Chinese version based on the ChatGLM3-6b. 24 | - **Dec-14**: Released the training code and data. All the resources including models, datasets and extracted features are available 25 | [here](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2F&mode=list). :fire::fire: 26 | - **Dec-4**: VTimeLLM: demo released. 27 | 28 | --- 29 | 30 | 31 | 32 | ## VTimeLLM Overview :bulb: 33 | 34 | VTimeLLM is a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. 35 | 36 | VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. 37 | 38 | ![framework](images/framework.png) 39 | 40 | 41 | --- 42 | 43 | ## Contributions :trophy: 44 | 45 | - We propose VTimeLLM, the first boundary-aware Video LLM, to the best of our knowledge. 46 | - We propose the boundary-aware three-stage training strategy, which consecutively leverages i) large-scale image-text data for feature alignment, ii) large-scale multi-event video-text data together with the temporal-related single-turn and multi-turn QA to enhance the awareness of time boundary, and iii) instruction tuning on the high-quality dialog dataset for better temporal reasoning ability. 47 | - We conduct extensive experiments to demonstrate that the proposed VTimeLLM significantly outperforms existing Video LLMs in various fine-grained temporal-related video tasks, showing its superior ability for video understanding and reasoning. 48 | 49 | 50 | --- 51 | 52 | ## Installation :wrench: 53 | 54 | We recommend setting up a conda environment for the project: 55 | ```shell 56 | conda create --name=vtimellm python=3.10 57 | conda activate vtimellm 58 | 59 | git clone https://github.com/huangb23/VTimeLLM.git 60 | cd VTimeLLM 61 | pip install -r requirements.txt 62 | ``` 63 | Additionally, install additional packages for training cases. 64 | ```shell 65 | pip install ninja 66 | pip install flash-attn --no-build-isolation 67 | ``` 68 | 69 | 70 | 71 | ## Running Demo Offline :cd: 72 | 73 | To run the demo offline, please refer to the instructions in [offline_demo.md](docs/offline_demo.md). 74 | 75 | ## Training :train: 76 | 77 | For training instructions, check out [train.md](docs/train.md). 78 | 79 | ## Qualitative Analysis :mag: 80 | 81 | A Comprehensive Evaluation of VTimeLLM's Performance across Multiple Tasks. 82 | 83 | 84 | ### Video Understanding and Conversational Tasks :speech_balloon: 85 | ![0](images/ex.png) 86 | 87 | --- 88 | 89 | ### Creative Tasks :paintbrush: 90 | ![1](images/ex1.png) 91 | 92 | --- 93 | ### Fine-grained Understanding Tasks :globe_with_meridians: 94 | ![2](images/ex2.png) 95 | 96 | --- 97 | ### Video Reasoning Tasks :question: 98 | ![3](images/ex3.png) 99 | 100 | --- 101 | 102 | 103 | ## Acknowledgements :pray: 104 | 105 | We are grateful for the following awesome projects our VTimeLLM arising from: 106 | 107 | * [LLaVA](https://github.com/haotian-liu/LLaVA): Large Language and Vision Assistant 108 | * [FastChat](https://github.com/lm-sys/FastChat): An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots 109 | * [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT): Towards Detailed Video Understanding via Large Vision and Language Models 110 | * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models 111 | * [Vid2seq](https://github.com/google-research/scenic/tree/main/scenic/projects/vid2seq): Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning 112 | * [InternVid](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid): A Large-scale Video-Text dataset 113 | 114 | 115 | If you're using VTimeLLM in your research or applications, please cite using this BibTeX: 116 | ```bibtex 117 | @inproceedings{huang2024vtimellm, 118 | title={Vtimellm: Empower llm to grasp video moments}, 119 | author={Huang, Bin and Wang, Xin and Chen, Hong and Song, Zihan and Zhu, Wenwu}, 120 | booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, 121 | pages={14271--14280}, 122 | year={2024} 123 | } 124 | ``` 125 | 126 | ## License :scroll: 127 | Creative Commons License 128 | 129 | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. 130 | 131 | 132 | Looking forward to your feedback, contributions, and stars! :star2: 133 | -------------------------------------------------------------------------------- /checkpoints/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/checkpoints/.gitkeep -------------------------------------------------------------------------------- /docs/data.md: -------------------------------------------------------------------------------- 1 | Our dataset can be downloaded from [here](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fdata&mode=list). The data is organized in the following JSON format: 2 | 3 | * source: The original sources of the dataset. This field is ['internvid'](https://huggingface.co/datasets/OpenGVLab/InternVid) in stage 2 and ['anet'](http://activity-net.org/download.html) or ['didemo'](https://drive.google.com/drive/u/0/folders/1_oyJ5rQiZboipbMl6tkhY8v0s9zDkvJc) in stage 3. 4 | * id: The ID of the video in the original dataset. 5 | * conversations: The conversation data used for training. We utilize special tokens (such as ``, ``) to indicate timestamps. These need to be replaced during training using information from the 'meta' field. 6 | * meta: 7 | * split: refers to the extraction of a segment from the original video as input, with two numbers denoting the start and end timestamps of the extraction. If this field does not exist, the entire video is used as input. 8 | * duration: indicates the length of the input video. 9 | * token: represents the timestamp of special tokens appearing in 'conversations'. 10 | 11 | Here is an example: 12 | 13 | ```json 14 | { 15 | "source": "internvid", 16 | "id": "3n3oCNerzV0", 17 | "conversations": [ 18 | { 19 | "from": "human", 20 | "value": "